Wooldridge Introductory Eco No Metrics 4e Solutions
-
Upload
carolina-gonzalez -
Category
Documents
-
view
1.263 -
download
7
Transcript of Wooldridge Introductory Eco No Metrics 4e Solutions
Wooldridge, Introductory Econometrics, 4thed.
Chapter 1: Nature of Econometrics and
Economic Data
What do we mean by econometrics? Econo-metrics is the field of economics in which sta-tistical methods are developed and applied toestimate economic relationships, test economictheories, and evaluate plans and policies imple-mented by private industry, government, andsupranational organizations. Econometrics en-compasses forecasting–not only the high-profileforecasts of macroeconomic and financial vari-ables, but also forecasts of demand for a prod-uct, likely effects of a tax package, or the inter-action between the demand for health servicesand welfare reform.
Why is econometrics separate from mathemat-ical statistics? Because most applications of
statistics in economics and finance are related
to the use of non-experimental data, or ob-
servational data. The fundamental techniques
of statistics have been developed for use on
experimental data: that gathered from con-
trolled experiments, where the design of the
experiment and the reliability of measurements
from its outcomes are primary foci. In rely-
ing on observational data, economists are more
like astronomers, able to collect and analyse in-
creasingly complete measures on the world (or
universe) around them, but unable to influence
the outcomes.
This distinction is not absolute; some eco-
nomic policies are in the nature of experiments,
and economists have been heavily involved in
both their design and implementation. A good
example within the last five years is the imple-
mentation of welfare reform, limiting individ-
uals’ tenure on the welfare rolls to five years
of lifetime experience. Many doubted that this
would be successful in addressing the needs of
those welfare recipients who have low job skills;
but the reforms have been surprisingly suc-
cessful, as a recent article in The Economist
states, at raising the employment rate among
this cohort. Economists are also able to care-
fully examine the economic consequences of
massive regime changes in an economy, such
as the transition from a planned economy to
a capitalist system in the former Soviet bloc.
But fundamentally applied econometricians ob-
serve the data, and use sophisticated tech-
niques to evaluate their meaning.
We speak of this work as empirical analysis, or
empirical research. The first step is the careful
formulation of the question of interest. This
will often involve the application or develop-
ment of an economic model, which may be as
simple as noting that normal goods have neg-
ative price elasticities, or exceedingly complex,
involving a full-fledged description of many as-
pects of a set of interrelated markets and the
supply/demand relationships for the products
traded (as would, for instance, an economet-
ric analysis of an antitrust issue, such as U.S.
v Microsoft). Economists are often attacked
for their imperialistic tendencies–applying eco-
nomic logic to consider such diverse topics as
criminal behavior, fertility, or environmental issues–
but where there is an economic dimension, the
application of economic logic and empirical re-
search based on econometric practice may yield
valuable insights. Gary Becker, who has made
a career of applying economics to non-economic
realms, won a Nobel Prize for his efforts.Crime,
after all, is yet another career choice, and for
high school dropouts who don’t see much fu-
ture in flipping burgers at minimum wage, it
is hardly surprising that there are ample ap-
plicants for positions in a drug dealer’s distri-
bution network. In risk-adjusted terms (gaug-
ing the risk of getting shot, or arrested and
successfully prosecuted...) the risk-adjusted
hourly wage is many times the minimum wage.
Should we be surprised by the outcome?
Irregardless of whether empirical research is
based on a formal economic model or eco-
nomic intuition, the hypotheses about economic
behavior must be transformed into an econo-
metric model that can be applied to the data.
In an economic model, we can speak of func-
tions such as Q = Q (P, Y ) ; but if we are to es-
timate the parameters of that relationship, we
must have an explicit functional form for the Q
function, and determine that it is an appropri-
ate form for the model we have in mind. For
instance, if we were trying to predict the effi-
ciency of an automobile in terms of its engine
size (displacement, in in3 or liters), Americans
would likely rely on a measure like mpg – miles
per gallon. But the engineering relationship is
not linear between mpg and displacement; it
is much closer to being a linear function if we
relate gallons per mile (gpm = 1/mpg) to en-
gine size. The relationship will be curvilinear in
mpg terms, requiring a more complex model,
but nearly linear in gpm vs displacement. An
econometric model will spell out the role of
each of its variables: for instance,
gpmi = β0 + β1displi + εi
would express the relationship between the fuel
consumption of the ith automobile to its en-
gine size, or displacement, as a linear function,
with an additive error term εi which encom-
passes all factors not included in the model.
The parameters of the model are the β terms,
which must be estimated via statistical meth-
ods. Once that estimation has been done, we
may test specific hypotheses on their values:
for instance, that β1 is positive (larger engines
use more fuel), or that β1 takes on a certain
value. Estimating this relationship for Stata’s
auto.dta dataset of 74 automobiles, the pre-
dicted relationship is
gpmi = 0.029 + 0.011displi
where displacement is measured in hundreds of
in3. This estimated relationship has an “R2”
value of 0.59, indicating that 59% of the vari-
ation of gpm around its mean is “explained”
by displacement, and a root-mean-squared er-
ror of 0.008 (which can be compared to gpm’s
mean of 0.050, corresponding to about 21 mpg).
The structure of economic data
We must acquaint ourselves with some termi-
nology to describe the several forms in which
economic and financial data may appear. A
great deal of the work we will do in this course
will relate to cross-sectional data: a sam-
ple of units (individuals, families, firms, in-
dustries, countries...) taken at a given point
in time, or in a particular time frame. The
sample is often considered to be a random
sample of some sort when applied to micro-
data such as that gathered from individuals
or households. For instance, the official esti-
mates of the U.S. unemployment rate are gath-
ered from a monthly survey of individuals, in
which each is asked about their employment
status. It is not a count, or census, of those
out of work. Of course, some cross sections
are not samples, but may represent the pop-
ulation: e.g. data from the 50 states do not
represent a random sample of states. A cross-
sectional dataset can be conceptualized as a
spreadsheet, with variables in the columns and
observations in the rows. Each row is uniquely
identified by an observation number, but in
a cross-sectional datasets the ordering of the
observations is immaterial. Different variables
may correspond to different time periods; we
might have a dataset containing municipalities,
their employment rates, and their population in
the 1990 and 2000 censuses.
The other major form of data considered in
econometrics is the time series: a series of
evenly spaced measurements on a variable. A
time-series dataset may contain a number of
measures, each measured at the same frequency,
including measures derived from the originals
such as lagged values, differences, and the like.
Time series are innately more difficult to han-
dle in an econometric context because their
observations almost surely are interdependent
across time. Most economic and financial time
series exhibit some degree of persistence. Al-
though we may be able to derive some mea-
sures which should not, in theory, be explain-
able from earlier observations (such as tomor-
row’s stock return in an efficient market), most
economic time series are both interrelated and
autocorrelated–that is, related to themselves
across time periods. In a spreadsheet context,the variables would be placed in the columns,and the rows labelled with dates or times. Theorder of the observations in a time-series datasetmatters, since it denotes the passage of equalincrements of time. We will discuss time-seriesdata and some of the special techniques thathave been developed for its analysis in the lat-ter part of the course.
Two combinations of these data schemes arealso widely used: pooled cross-section/time
series (CS/TS) datasets and panel, or lon-
gitudinal, data sets. The former (CS/TS)arise in the context of a repeated survey–suchas a presidential popularity poll–where the re-spondents are randomly chosen. It is advanta-geous to analyse multiple cross-sections, butnot possible to link observations across thecross-sections. Much more useful are paneldata sets, in which we have timeseries of ob-servations on the same unit: for instance, Ci,t
might be the consumption level of the ith house-
hold at time t. Many of the datasets we com-
monly utilize in economic and financial research
are of this nature: for instance, a great deal of
research in corporate finance is carried out with
Standard and Poor’s COMPUSTAT, a panel
data set containing 20 years of annual financial
statements for thousands of major U.S. corpo-
rations. There is a wide array of specialized
econometric techniques that have been devel-
oped to analyse panel data; we will not touch
upon them in this course.
Causality and ceteris paribus
The hypotheses tested in applied economet-
ric analysis are often posed to make inferences
about the possible causal effects of one or
more factors on a response variable: that is, do
changes in consumers’ incomes “cause” changes
in their consumption of beer? At some level,
of course, we can never establish causation–
unlike the physical sciences, where the interre-
lations of molecules may follow well-established
physical laws, our observed phenomena rep-
resent innately unpredictable human behavior.
In economic theory, we generally hold that in-
dividuals exhibit rational behavior; but since
the econometrician does not observe all of the
factors that might influence behavior, we can-
not always make sensible inferences about po-
tentially causal factors. Whenever we “opera-
tionalize” an econometric model, we implic-
itly acknowledge that it can only capture a
few key details of the behavioral relationship,
and is leaving many additional factors (which
may or may not be observable) in the “pound
of ceteris paribus.” ceteris paribus–literally,
other things equal–always underlies our infer-
ences from empirical research. Our best hope
is that we might control for many of the fac-
tors, and be able to use our empirical findings
to ascertain whether systematic factors have
been omitted. Any econometric model should
be subjected to diagnostic testing to deter-
mine whether it contains obvious flaws. For in-
stance, the relationship between mpg and displ
in the automobile data is strictly dominated by
a model containing both displ and displ2, given
the curvilinear relation between mpg and displ.
Thus the original linear model can be viewed as
unacceptable in comparison to the polynomial
model; this conclusion could be drawn from
analysis of the model’s residuals, coupled with
an understanding of the engineering relation-
ship that posits a nonlinear function between
mpg and displ.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 2: The simple regression model
Most of this course will be concerned with use
of a regression model: a structure in which
one or more explanatory variables are consid-
ered to generate an outcome variable, or de-
pendent variable.We begin by considering the
simple regression model, in which a single ex-
planatory, or independent, variable is involved.
We often speak of this as ‘two-variable’ regres-
sion, or ‘Y on X regression’. Algebraically,
yi = β0 + β1xi + ui (1)
is the relationship presumed to hold in the pop-
ulation for each observation i. The values of y
are expected to lie on a straight line, depending
on the corresponding values of x. Their values
will differ from those predicted by that line by
the amount of the error term, or disturbance,
u, which expresses the net effect of all factors
other than x on the outcome y−that is, it re-
flects the assumption of ceteris paribus. We
often speak of x as the ‘regressor’ in this rela-
tionship; less commonly we speak of y as the
‘regressand.’ The coefficients of the relation-
ship, β0 and β1, are the regression parameters,
to be estimated from a sample. They are pre-
sumed constant in the population, so that the
effect of a one-unit change in x on y is assumed
constant for all values of x.
As long as we include an intercept in the rela-
tionship, we can always assume that E (u) = 0,
since a nonzero mean for u could be absorbed
by the intercept term.
The crucial assumption in this regression model
involves the relationship between x and u. We
consider x a random variable, as is u, and con-
cern ourselves with the conditional distribution
of u given x. If that distribution is equivalent to
the unconditional distribution of u, then we can
conclude that there is no relationship between
x and u−which, as we will see, makes the es-
timation problem much more straightforward.
To state this formally, we assume that
E (u | x) = E (u) = 0 (2)
or that the u process has a zero conditional
mean. This assumption states that the unob-
served factors involved in the regression func-
tion are not related in any systematic manner
to the observed factors. For instance, con-
sider a regression of individuals’ hourly wage
on the number of years of education they have
completed. There are, of course, many factors
influencing the hourly wage earned beyond the
number of years of formal schooling. In work-
ing with this regression function, we are as-
suming that the unobserved factors–excluded
from the regression we estimate, and thus rel-
egated to the u term–are not systematically
related to years of formal schooling. This may
not be a tenable assumption; we might con-
sider “innate ability” as such a factor, and it
is probably related to success in both the edu-
cational process and the workplace. Thus, in-
nate ability–which we cannot measure without
some proxies–may be positively correlated to
the education variable, which would invalidate
assumption (2).
The population regression function, given
the zero conditional mean assumption, is
E (y | x) = β0 + β1xi (3)
This allows us to separate y into two parts:
the systematic part, related to x, and the un-
systematic part, which is related to u. As long
as assumption (2) holds, those two compo-
nents are independent in the statistical sense.
Let us now derive the least squares estimates
of the regression parameters.
Let [(xi, yi) : i = 1, ..., n] denote a random sam-
ple of size n from the population, where yiand xi are presumed to obey the relation (1).
The assumption (2) allows us to state that
E(u) = 0, and given that assumption, that
Cov(x, u) = E(xu) = 0, where Cov(·) denotes
the covariance between the random variables.
These assumptions can be written in terms of
the regression error:
E (yi − β0 − β1xi) = 0 (4)
E [xi (yi − β0 − β1xi)] = 0
These two equations place two restrictions on
the joint probability distribution of x and u.
Since there are two unknown parameters to be
estimated, we might look upon these equations
to provide solutions for those two parameters.
We choose estimators b0 and b1 to solve the
sample counterparts of these equations, mak-
ing use of the principle of the method of mo-
ments:
n−1n∑i=1
(yi − b0 − b1xi) = 0 (5)
n−1n∑i=1
xi (yi − b0 − b1xi) = 0
the so-called normal equations of least squares.
Why is this method said to be “least squares”?
Because as we shall see, it is equivalent to min-
imizing the sum of squares of the regression
residuals. How do we arrive at the solution?
The first “normal equation” can be seen to be
b0 = y − b1x (6)
where y and x are the sample averages of those
variables. This implies that the regression line
passes through the point of means of the sam-
ple data. Substituting this solution into the
second normal equation, we now have one equa-
tion in one unknown, b1 :
0 =n∑i=1
xi (yi − (y − b1x)− b1xi)
n∑i=1
xi (yi − y) = b1
n∑i=1
xi (xi − x)
b1 =
∑ni=1 (xi − x) (yi − y)∑n
i=1 (xi − x)2
b1 =Cov(x, y)
V ar(x)(7)
where the slope estimate is merely the ratio of
the sample covariance of the two variables to
the variance of x, which, must be nonzero for
the estimates to be computed. This merely
implies that not all of the sample values of x
can take on the same value. There must be
diversity in the observed values of x. These
estimates–b0 and b1−are said to be the ordi-
nary least squares (OLS) estimates of the
regression parameters, since they can be de-
rived by solving the least squares problem:
minb0,b1
S =n∑i=1
e2i =
n∑i=1
(yi − b0 − b1xi)2 (8)
Here we minimize the sum of squared residu-
als, or differences between the regression line
and the values of y, by choosing b0 and b1.
If we take the derivatives ∂S/∂b0 and ∂S/∂b1and set the resulting first order conditions to
zero, the two equations that result are exactly
the OLS solutions for the estimated parame-
ters shown above. The “least squares” esti-
mates minimize the sum of squared residuals,
in the sense that any other line drawn through
the scatter of (x, y) points would yield a larger
sum of squared residuals. The OLS estimates
provide the unique solution to this problem,
and can always be computed if (i) V ar(x) > 0
and (ii) n ≥ 2. The estimated OLS regression
line is then
yi = b0 + b1xi (9)
where the “hat” denotes the predicted value
of y corresponding to that value of x. This is
the sample regression function (SRF), cor-
responding to the population regression func-
tion, or PRF (3). The population regression
function is fixed, but unknown, in the popu-
lation; the SRF is a function of the particular
sample that we have used to derive it, and a
different SRF will be forthcoming from a differ-
ent sample. The primary interest in these es-
timates usually involves b1 = ∂y/∂x = ∆y/∆x,
the amount by which y is predicted to change
from a unit change in the level of x. This slope
is often of economic interest, whereas the con-
stant term in many regressions is devoid of
economic meaning. For instance, a regres-
sion of major companies’ CEO salaries on the
firms’ return on equity–a measure of economic
performance–yields the regression estimates
S = 963.191 + 18.501r (10)
where S is the CEO’s annual salary, in thou-
sands of 1990 dollars, and r is average re-
turn on equity over the prior three years, in
per cent. This implies that a one percent in-
crease in ROE over the past three years is
worth $18,501 to a CEO, on average. The
average annual salary for the 209 CEOs in the
sample is $1.28 million, so the increment is
about 1.4% of that average salary. The SRF
can also be used to predict what a CEO will
earn for any level of ROE; points on the esti-
mated regression function are such predictions.
Mechanics of OLS
Some algebraic properties of the OLS regres-sion line:
(1) The sum (and average) of the OLS resid-uals is zero:
n∑i=1
ei = 0 (11)
which follows from the first normal equation,which specifies that the estimated regressionline goes through the point of means (x, y), sothat the mean residual must be zero.
(2) By construction, the sample covariance be-tween the OLS residuals and the regressor iszero:
Cov(e, x) =n∑i=1
xiei = 0 (12)
This is not an assumption, but follows directlyfrom the second normal equation. The esti-mated coefficients, which give rise to the resid-uals, are chosen to make it so.
(3) Each value of the dependent variable may
be written in terms of its prediction and its
error, or regression residual:
yi = yi + ei
so that OLS decomposes each yi into two parts:
a fitted value, and a residual. Property (3) also
implies that Cov(e, y) = 0, since y is a linear
transformation of x, and linear transformations
have linear effects on covariances. Thus, the
fitted values and residuals are uncorrelated in
the sample. Taking this property and applying
it to the entire sample, we define
SST =n∑i=1
(yi − y)2
SSE =n∑i=1
(yi − y)2
SSR =n∑i=1
e2i
as the Total sum of squares, Explained sum
of squares, and Residual sum of squares, re-
spectively. Note that SST expresses the total
variation in y around its mean (and we do not
strive to “explain” its mean; only how it varies
about its mean). The second quantity, SSE,
expresses the variation of the predicted values
of y around the mean value of y (and it is trivial
to show that y has the same mean as y). The
third quantity, SSR, is the same as the least
squares criterion S from (8). (Note that some
textbooks interchange the definitions of SSE
and SSR, since both “explained” and “error”
start with E, and “regression” and “residual”
start with R). Given these sums of squares, we
can generalize the decomposition mentioned
above into
SST = SSE + SSR (13)
or, the total variation in y may be divided into
that explained and that unexplained, i.e. left
in the residual category. To prove the validity
of (13), note that
n∑i=1
(yi − y)2 =n∑i=1
((yi − yi) + (yi − y))2
=n∑i=1
[ei + (yi − y)]2
=n∑i=1
e2i + 2
n∑i=1
ei (yi − y) +
n∑i=1
(yi − y)2
SST = SSR+ SSE
given that the middle term in this expression
is equal to zero. But this term is the sample
covariance of e and y, given a zero mean for
e, and by (12) we have established that this is
zero.
How good a job does this SRF do? Does the
regression function explain a great deal of the
variation of y, or not very much? That can
now be answered by making use of these sums
of squares:
R2 = [rxy]2 =SSE
SST= 1−
SSR
SST
The R2 measure (sometimes termed the coef-
ficient of determination) expresses the percent
of variation in y around its mean “explained”
by the regression function. It is an r, or simple
correlation coefficient, squared, in this case of
simple regression on a single x variable. Since
the correlation between two variables ranges
between -1 and +1, the squared correlation
ranges between 0 and 1. In that sense, R2
is like a batting average. In the case where
R2 = 0, the model we have built fails to ex-
plain any of the variation in the y values around
their mean–unlikely, but it is certainly possible
to have a very low value of R2. In the case
where R2 = 1, all of the points lie on the SRF.
That is unlikely when n > 2, but it may be
the case that all points lie close to the line,
in which case R2 will approach 1. We can-
not make any statistical judgment based di-
rectly on R2, or even say that a model with
a higher R2 and the same dependent variable
is necessarily a better model; but other things
equal, a higher R2 will be forthcoming from a
model that captures more of y′s behavior. In
cross-sectional analyses, where we are trying
to understand the idiosyncracies of individual
behavior, very low R2 values are common, and
do not necessarily denote a failure to build a
useful model.
Important issues in evaluating applied work:
how do the quantities we have estimated change
when the units of measurement are changed?
In the estimated model of CEO salaries, since
the y variable was measured in thousands of
dollars, the intercept and slope coefficient referto those units as well. If we measured salariesin dollars, the intercept and slope would bemultiplied by 1000, but nothing would change.The correlation between y and x is not af-fected by linear transformations, so we wouldnot alter the R2 of this equation by changingits units of measurement. Likewise, if ROEwas measured in decimals rather than per cent,it would merely change the units of measure-ment of the slope coefficient. Dividing r by100 would cause the slope to be multiplied by100. In the original (10), with r in percent, theslope is 18.501 (thousands of dollars per oneunit change in r). If we expressed r in decimalform, the slope would be 1850.1. A change inr from 0.10 to 0.11 – a one per cent increasein ROE–would be associated with a changein salary of (0.01)(1850.1)=18.501 thousanddollars. Again, the correlation between salaryand ROE would not be altered. This also ap-plies for a transformation such as F = 32+ 9
5C;
it would not matter whether we viewed tem-
perature in degrees F or degrees C as a causal
factor in estimating the demand for heating oil,
since the correlation between the dependent
variable and temperature would be unchanged
by switching from Fahrenheit to Celsius de-
grees.
Functional form
Simple linear regression would seem to be a
workable tool if we have a presumed linear re-
lationship between y and x, but what if theory
suggests that the relation should be nonlinear?
It turns out that the “linearity” of regression
refers to y being expressed as a linear func-
tion of x−but neither y nor x need be the “raw
data” of our analysis. For instance, regressing
y on t (a time trend) would allow us to analyse
a linear trend, or constant growth, in the data.
What if we expect the data to exhibit expo-
nential growth, as would population, or sums
earning compound interest? If the underlying
model is
y = A exp (rt) (14)
log y = logA+ rt
y∗ = A∗+ rt (15)
so that the “single-log” transformation may
be used to express a constant-growth relation-
ship, in which r is the regression slope coef-
ficient that directly estimates ∂ log y/∂t. Like-
wise, the “double-log” transformation can be
used to express a constant-elasticity relation-
ship, such as that of a Cobb-Douglas function:
y = Axα (16)
log y = logA+ α logx
y∗ = A∗+ αx∗
In this context, the slope coefficient α is an
estimate of the elasticity of y with respect to
x, given that ηy,x = ∂ log y/∂ logx by the defini-
tion of elasticity. The original equation is non-
linear, but the transformed equation is a linear
function which may be estimated by OLS re-
gression.
Likewise, a model in which y is thought to de-
pend on 1/x (the reciprocal model) may be
estimated by linear regression by just defin-
ing a new variable, z, equal to 1/x (presuming
x > 0). That model has an interesting inter-
pretation if you work out its algebra.
We often use a polynomial form to allow for
nonlinearities in a regression relationship. For
instance, rather than including only x as a re-
gressor, we may include x and x2. Although
this relationship is linear in the parameters, it
implies that ∂Y∂x = β + 2γx, so that the effect
of x on Y now depends on the level of x for
that observation, rather than being a constant
factor.
Properties of OLS estimators
Now let us consider the properties of the re-
gression estimators we have derived, consider-
ing b0 and b1 as estimators of their respective
population quantities. To establish the unbi-
asedness of these estimators, we must make
several assumptions:
Proposition 1 SLR1: in the population, the
dependent variable y is related to the indepen-
dent variable x and the error u as
y = β0 + β1x + u (17)
Proposition 2 SLR2: we can estimate the pop-
ulation parameters from a sample of size n,
{(xi, yi), i = 1, ..., n}.
Proposition 3 SLR3: the error process has a
zero conditional mean:
E (u | x) = 0. (18)
Proposition 4 SLR4: the independent vari-
able x has a positive variance:
(n− 1)−1n∑i=1
(xi − x)2 > 0. (19)
Given these four assumptions, we may pro-
ceed, considering the intercept and slope esti-
mators as random variables. For the slope es-
timator; we may express the estimator in terms
of population coefficients and errors:
b1 =
∑ni=1 (xi − x) (yi − y)∑n
i=1 (xi − x)2 =
∑ni=1 (xi − x) yi
s2x
(20)
where we have defined s2x as the total variation
in x (not the variance of x). Substituting, we
can write the slope estimator as:
b1 =
∑ni=1 (xi − x) yi
s2x
=
∑ni=1 (xi − x) (β0 + β1xi + ui)
s2x
=β0
∑ni=1 (xi − x) + β1
∑ni=1 (xi − x)xi +
∑ni=1 (xi − x)ui
s2x
(21)
We can show that the first term in the nu-
merator is algebraically zero, given that the
deviations around the mean sum to zero. The
second term can be written as∑ni=1 (xi − x)2 =
s2x, so that the second term is merely β1 when
divided by s2x. Thus this expression can be rewrit-
ten as:
b1 = β1 +1
s2x
n∑i=1
(xi − x)ui
showing that any randomness in the estimates
of b1 is derived from the errors in the sample,
weighted by the deviations of their respective
x values. Given the assumed independence of
the distributions of x and u implied by (18),
this expression implies that:
E (b1) = β1,
or that b1 is an unbiased estimate of β1, given
the propositions above. The four propositions
listed above are all crucial for this result, but
the key assumption is the independence of x
and u.
We are also concerned about the precision of
the OLS estimators. To derive an estimator
of the precision, we must add an assumption
on the distribution of the error u :
Proposition 5 SLR5: (homoskedasticity):
V ar (u | x) = V ar(u) = σ2.
This assumption states that the variance of the
error term is constant over the population, and
thus within the sample. Given (18), the con-
ditional variance is also the unconditional vari-
ance. The errors are considered drawn from a
fixed distribution, with a mean of zero and a
constant variance of σ2. If this assumption is vi-
olated, we have the condition of heteroskedas-
ticity, which will often involve the magnitude
of the error variance relating to the magnitude
of x, or to some other measurable factor.
Given this additional assumption, but no fur-
ther assumptions on the nature of the distri-
bution of u, we may demonstrate that:
V ar (b1) =σ2∑n
i=1 (xi − x)2 =σ2
s2x
(22)
so that the precision of our estimate of the
slope is dependent upon the overall error vari-
ance, and is inversely related to the variation in
the x variable. The magnitude of x does not
matter, but its variability in the sample does
matter. If we are conducting a controlled ex-periment (quite unlikely in economic analysis)we would want to choose widely spread valuesof x to generate the most precise estimate of∂y/∂x.
We can likewise prove that b0 is an unbiased es-timator of the population intercept, with sam-pling variance:
V ar (b0) = n−1 σ2 ∑ni=1 x
2i∑n
i=1 (xi − x)2 =σ2 ∑n
i=1 x2i
ns2x
(23)so that the precision of the intercept depends,as well, upon the sample size, and the mag-nitude of the x values. These formulas forthe sampling variances will be invalid in thepresence of heteroskedasticity–that is, whenproposition SLR5 is violated.
These formulas are not operational, since theyinclude the unknown parameter σ2. To calcu-late estimates of the variances, we must first
replace σ2 with a consistent estimate, s2, de-
rives from the least squares residuals:
ei = yi − b0 − b1xi, i = 1, ..., n (24)
We cannot observe the error ui for a given ob-
servation, but we can generate a consistent es-
timate of the ith observation’s error with the ith
observation’s least squares residual, ui. Like-
wise, a sample quantity corresponding to the
population variance σ2 can be derived from the
residuals:
s2 =1
(n− 2)
n∑i=1
e2i =
SSR
(n− 2)(25)
where the numerator is just the least squares
criterion, SSR, divided by the appropriate de-
grees of freedom. Here, two degrees of free-
dom are lost, since each residual is calculated
by replacing two population coefficients with
their sample counterparts. This now makes it
possible to generate the estimated variances
and, more usefully, the estimated standard
error of the regression slope:
sb1 =s
sx
where s is the standard deviation, or standard
error, of the disturbance process (that is,√s2),
and sx is√s2x. It is this estimated standard
error that will be displayed on the computer
printout when you run a regression, and used
to construct confidence intervals and hypoth-
esis tests about the slope coefficient. We can
calculate the estimated standard error of the
intercept term by the same means.
Regression through the origin
We could also consider a special case of the
model above where we impose a constraint
that β0 = 0, so that y is taken to be propor-
tional to x. This will often be inappropriate; it
is generally more sensible to let the data calcu-
late the appropriate intercept term, and rees-
timate the model subject to that constraint
only if that is a reasonable course of action.
Otherwise, the resulting estimate of the slope
coefficient will be biased. Unless theory sug-
gests that a strictly proportional relationship is
appropriate, the intercept should be included in
the model.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 3: Multiple regression analysis:
Estimation
In multiple regression analysis, we extend the
simple (two-variable) regression model to con-
sider the possibility that there are additional
explanatory factors that have a systematic ef-
fect on the dependent variable. The simplest
extension is the “three-variable” model, in which
a second explanatory variable is added:
y = β0 + β1x1 + β2x2 + u (1)
where each of the slope coefficients are now
partial derivatives of y with respect to the x
variable which they multiply: that is, hold-
ing x2 fixed, β1 = ∂y/∂x1. This extension also
allows us to consider nonlinear relationships,
such as a polynomial in z, where x1 = z and
x2 = z2. Then, the regression is linear in x1
and x2, but nonlinear in z : ∂y/∂z = β1 + 2β2z.
The key assumption for this model, analogous
to that which we specified for the simple re-
gression model, involves the independence of
the error process u and both regressors, or ex-
planatory variables:
E (u | x1, x2) = 0. (2)
This assumption of a zero conditional mean
for the error process implies that it does not
systematically vary with the x′s nor with any
linear combination of the x′s; u is independent,
in the statistical sense, from the distributions
of the x′s.
The model may now be generalized to the case
of k regressors:
y = β0 + β1x1 + β2x2 + ...+ βkxk + u (3)
where the β coefficients have the same inter-
pretation: each is the partial derivative of y
with respect to that x, holding all other x′sconstant (ceteris paribus), and the u term is
that nonsystematic part of y not linearly re-
lated to any of the x′s. The dependent variable
y is taken to be linearly related to the x′s, which
may bear any relation to each other (e.g. poly-
nomials or other transformations) as long as
there are no exact linear dependencies among
the regressors. That is, no x variable can be
an exact linear transformation of another, or
the regression estimates cannot be calculated.
The independence assumption now becomes:
E (u | x1, x2, ..., xk) = 0. (4)
Mechanics and interpretation of OLS
Consider first the “three-variable model” given
above in (1). The estimated OLS equation
contains the parameters of interest:
y = b0 + b1x1 + b2x2 (5)
and we may define the ordinary least squares
criterion in terms of the OLS residuals, calcu-
lated from a sample of size n, from this expres-
sion:
minS =n∑i=1
(yi − b0 − b1xi1 − b2xi2)2 (6)
where the minimization of this expression is
performed with respect to each of the three
parameters, {b0, b1, b2}. In the case of k regres-
sors, these expressions include terms in bk, and
the minimization is performed with respect to
the (k+1) parameters {b0, b1, b2, ...bk}. For this
to be feasible, n > (k + 1) : that is, we must
have a sample larger than the number of pa-
rameters to be estimated from that sample.
The minimization is carried out by differenti-
ating the scalar S with respect to each of the
b′s in turn, and setting the resulting first order
condition to zero. This gives rise to (k+ 1) si-
multaneous equations in (k+1) unknowns, the
regression parameters, which are known as the
least squares normal equations. The nor-mal equations are expressions in the sums ofsquares and cross products of the y and the re-gressors, including a first “regressor” which isa column of 1′s, multiplying the constant term.For the “three-variable” regression model, wecan write out the normal equations as:∑
y = nb0 + b1∑
x1 + b2∑
x2 (7)∑x1y = b0
∑x1 + b1
∑x2
1 + b2∑
x1x2∑x2y = b0
∑x2 + b1
∑x1x2 + b2
∑x2
2
Just as in the “two-variable” case, the firstnormal equation can be interpreted as stat-ing that the regression surface (in 3-space)passes through the multivariate point of means{x1, x2, y}. These three equations may be uniquelysolved, by normal algebraic techniques or linearalgebra, for the estimated least squares param-eters.
This extends to the case of k regressors and(k+1) regression parameters. In each case, the
regression coefficients are considered in the ce-
teris paribus sense: that each coefficient mea-
sures the partial effect of a unit change in its
variable, or regressor, holding all other regres-
sors fixed. If a variable is a component of more
than one regressor–as in a polynomial relation-
ship, as discussed above–the total effect of a
change in that variable is additive.
Fitted values, residuals, and their proper-
ties
Just as in simple regression, we may calculate
fitted values, or predicted values, after esti-
mating a multiple regression. For observation
i, the fitted value is
yi = b0 + b1xi1 + b2xi2 + ...+ bkxik (8)
and the residual is the difference between the
actual value of y and the fitted value:
ei = yi − yi (9)
As with simple regression, the sum of the resid-
uals is zero; they have, by construction, zero
covariance with each of the x variables, and
thus zero covariance with y; and since the av-
erage residual is zero, the regression surface
passes through the multivariate point of means,
{x1, x2, ..., xk, y}.
There are two instances where the simple re-
gression of y on x1 will yield the same coeffi-
cient as the multiple regression of y on x1 and
x2, with respect to x1. In general, the simple re-
gression coefficient will not equal the multiple
regression coefficient, since the simple regres-
sion ignores the effect of x2 (and considers that
it can be viewed as nonsystematic, captured in
the error u). When will the two coefficients be
equal? First, when the coefficient of x2 is truly
zero–that is, when x2 really does not belong in
the model. Second, when x1 and x2 are un-
correlated in the sample. This is likely to be
quite rare in actual data. However, these two
cases suggest when the two coefficients will
be similar; when x2 is relatively unimportant in
explaining y, or when it is very loosely related
to x1.
We can define the same three sums of squares–
SST, SSE, SSR−as in simple regression, and
R2 is still the ratio of the explained sum of
squares (SSE) to the total sum of squares
(SST ). It is no longer a simple correlation (e.g.
ryx) squared, but it still has the interpretation
of a squared simple correlation coefficient: the
correlation between y and y, ryy. A very im-
portant principle is that R2 never decreases
when an explanatory variable is added to a
regression–no matter how irrelevant that vari-able may be, the R2 of the expanded regres-sion will be no less than that of the originalregression. Thus, the regression R2 may bearbitrarily increased by adding variables (evenunimportant variables), and we should not beimpressed by a high value of R2 in a modelwith a long list of explanatory variables.
Just as with simple regression, it is possibleto fit a model through the origin, suppress-ing the constant term. It is important to notethat many of the properties we have discussedno longer hold in that case: for instance, theleast squares residuals (eis) no longer have azero sample average, and the R2 from such anequation can actually be negative–that is, theequation does worse than the “model” whichspecifies that y = y for all i. If the populationintercept β0 differs from zero, the slope coef-ficients computed in a regression through theorigin will be biased. Therefore, we often willinclude an intercept, and let the data deter-mine whether it should be zero.
Expected value of the OLS estimators
We now discuss the statistical properties of the
OLS estimators of the parameters in the pop-
ulation regression function. The population
model is taken to be (3). We assume that we
have a random sample of size n on the vari-
ables of the model. The multivariate analogue
to our assumption about the error process is
now:
E (u | x1, x2, ..., xk) = 0 (10)
so that we consider the error process to be
independent of each of the explanatory vari-
ables’ distributions. This assumption would
not hold if we misspecified the model: for in-
stance, if we ran a simple regression with inc
as the explanatory variable, but the population
model also contained inc2. Since inc and inc2
will have a positive correlation, the simple re-
gression’s parameter estimates will be biased.
This bias will also appear if there is a sepa-
rate, important factor that should be included
in the model; if that factor is correlated with
the included regressors, their coefficients will
be biased.
In the context of multiple regression, with sev-
eral independent variables, we must make an
additional assumption about their measured val-
ues:
Proposition 1 In the sample, none of the in-
dependent variables x may be expressed as an
exact linear relation of the others (including a
vector of 1s).
Every multiple regression that includes a con-
stant term can be considered as having a vari-
able x0i = 1 ∀i. This proposition states that
each of the other explanatory variables must
have nonzero sample variance: that is, it may
not be a constant in the sample. Second,
the proposition states that there is no per-
fect collinearity, or multicollinearity, in the
sample. If we could express one x as a linear
combination of the other x variables, this as-
sumption would be violated. If we have perfect
collinearity in the regressor matrix, the OLS es-
timates cannot be computed; mathematically,
they do not exist. A trivial example of perfect
collinearity would be the inclusion of the same
variable twice, measured in different units (or
via a linear transformation, such as tempera-
ture in degrees F versus C). The key concept:
each regressor we add to a multiple regression
must contain information at the margin. It
must tell us something about y that we do not
already know. For instance, if we consider x1 :
proportion of football games won, x2 : pro-
portion of games lost, and x3: proportion of
games tied, and we try to use all three as ex-
planatory variables to model alumni donations
to the athletics program, we find that there
is perfect collinearity: since for every college
in the sample, the three variables sum to one
by construction. There is no information in,
e.g., x3 once we know the other two, so in-
cluding it in a regression with the other two
makes no sense (and renders that regression
uncomputable). We can leave any one of the
three variables out of the regression; it does
not matter which one. Note that this proposi-
tion is not an assumption about the population
model: it is an implication of the sample data
we have to work with. Note also that this only
applies to linear relations among the explana-
tory variables: a variable and its square, for
instance, are not linearly related, so we may
include both in a regression to capture a non-
linear relation between y and x.
Given the four assumptions: that of the pop-
ulation model, the random sample, the zero
conditional mean of the u process, and the ab-
sence of perfect collinearity, we can demon-
strate that the OLS estimators of the popula-
tion parameters are unbiased:
Ebj = βj, j = 0, ..., k (11)
What happens if we misspecify the model by
including irrelevant explanatory variables: x
variables that, unbeknowst to us, are not in
the population model? Fortunately, this does
not damage the estimates. The regression will
still yield unbiased estimates of all of the coef-
ficients, including unbiased estimates of these
variables’ coefficients, which are zero in the
population. It may be improved by removing
such variables, since including them in the re-
gression consumes degrees of freedom (and re-
duces the precision of the estimates); but the
effect of overspecifying the model is rather
benign. The same applies to overspecifying
a polynomial order; including quadratic and
cubic terms when only the quadratic term is
needed will be harmless, and you will find that
the cubic term’s coefficient is far from signifi-
cant.
However, the opposite case–where we under-
specify the model by mistakenly excluding a
relevant explanatory variable–is much more se-
rious. Let us formally consider the direction
and size of bias in this case. Assume that the
population model is:
y = β0 + β1x1 + β2x2 + u (12)
but we do not recognize the importance of x2,
and mistakenly consider the relationship
y = β0 + β1x1 + u (13)
to be fully specified. What are the conse-quences of estimating the latter relationship?We can show that in this case:
Eb1 = β1 + β2
∑ni=1 (xi1 − x1)xi2∑ni=1 (xi1 − x1)2 (14)
so that the OLS coefficient b1 will be biased–not equal to its population value of β1, evenin an expected sense–in the presence of thesecond term. That term will be nonzero whenβ2 is nonzero (which it is, by assumption) andwhen the fraction is nonzero. But the frac-tion is merely a simple regression coefficient inthe auxiliary regression of x2 on x1. If the re-gressors are correlated with one another, thatregression coefficient will be nonzero, and itsmagnitude will be related to the strength ofthe correlation (and the units of the variables).Say that the auxiliary regression is:
x1 = d0 + d1x2 + u (15)
with d1 > 0, so that x1 and x2 are positivelycorrelated (e.g. as income and wealth would
be in a sample of household data). Then wecan write the bias as:
Eb1 − β1 = β2d1 (16)
and its sign and magnitude will depend on boththe relation between y and x2 and the inter-relation among the explanatory variables. Ifthere is no such relationship–if x1 and x2 areuncorrelated in the sample–then b1 is unbiased(since in that special case multiple regressionreverts to simple regression). In all other cases,though, there will be bias in the estimation ofthe underspecified model. If the left side of(16) is positive, we say that b1 has an upwardbias: the OLS value will be too large. If itwere negative, we would speak of a downwardbias. If the OLS coefficient is closer to zerothan the population coefficient, we would saythat it is “biased toward zero” or attenuated.
It is more difficult to evaluate the potentialbias in a multiple regression, where the popu-lation relationship involves k variables and we
include, for instance, k− 1 of them. All of theOLS coefficients in the underspecified modelwill generally be biased in this circumstance un-less the omitted variable is uncorrelated witheach included regressor (a very unlikely out-come). What we can take away as a generalrule is the asymmetric nature of specificationerror: it is far more damaging to exclude arelevant variable than to include an irrelevantvariable. When in doubt (and we almost al-ways are in doubt as to the nature of the truerelationship) we will always be better off erringon the side of caution, and including variablesthat we are not certain should be part of theexplanation of y.
Variance of the OLS estimators
We first reiterate the assumption of homoskedas-ticity, in the context of the k−variable regres-sion model:
V ar (u | x1, x2, ..., xk) = σ2 (17)
If this assumption is satisfied, then the error
variance is identical for all combinations of the
explanatory variables. If it is violated, we say
that the errors are heteroskedastic, and must
be concerned about our computation of the
OLS estimates’ variances. The OLS estimates
are still unbiased in this case, but our esti-
mates of their variances are not. Given this
assumption, plus the four made earlier, we can
derive the sampling variances, or precision, of
the OLS slope estimators:
V ar(bj)
=σ2
SSTj(1−R2
j
), j = 1, ..., k (18)
where SSTj is the total variation in xj about
its mean, and R2j is the R2 from an auxiliary
regression from regressing xj on all other x
variables, including the constant term. We see
immediately that this formula applies to sim-
ple regression, since the formula we derived for
the slope estimator in that instance is identi-
cal, given that R2j = 0 in that instance (there
are no other x variables). Given the population
error variance σ2, what will make a particular
OLS slope estimate more precise? Its preci-
sion will be increased (i.e. its sampling vari-
ance will be smaller) the larger is the variation
in the associated x variable. Its precision will
be decreased, the larger the amount of vari-
able xj that can be “explained” by other vari-
ables in the regression. In the case of perfect
collinearity, R2j = 1, and the sampling variance
goes to infinity. If R2j is very small, then this
variable makes a large marginal contribution to
the equation, and we may calculate a relatively
more precise estimate of its coefficient. If R2j is
quite large, the precision of the coefficient will
be low, since it will be difficult to “partial out”
the effect of variable j on y from the effects of
the other explanatory variables (with which it
is highly correlated). However, we must has-
ten to add that the assumption that there is no
perfect collinearity does not preclude R2j from
being close to unity–it only states that it is less
than unity. The principle stated above when
we discussed collinearity–that at the margin,
each explanatory variable must add informa-
tion that we do not already have, in whole or
in large part–if that variable is to have a mean-
ingful role in a regression model of y. This for-
mula for the sampling variance of an OLS co-
efficient also explains why we might not want
to overspecify the model: if we include an irrel-
evant explanatory variable, the point estimates
are unbiased, but their sampling variances will
be larger than they would be in the absence
of that variable (unless the irrelevant variable
is uncorrelated with the relevant explanatory
variables).
How do we make (18) operational? As written,
it cannot be computed, since it depends on the
unknown population parameter σ2. Just as in
the case of simple regression, we must replace
σ2 with a consistent estimate:
s2 =
∑ni=1 e
2i
(n− (k + 1))=
∑ni=1 e
2i
(n− k − 1)(19)
where the numerator is just SSR, and the de-
nominator is the sample size, less the number
of estimated parameters: the constant and k
slopes. In simple regression, we computed s2
using a denominator of 2: intercept plus slope.
Now, we must account for the additional slope
parameters. This also suggests that we cannot
estimate a k−variable regression model with-
out having a sample of size at least (k+1). In-
deed, just as two points define a straight line,
the degrees of freedom in simple regression will
be positive iff n > 2. For multiple regression,
with k slopes and an intercept, n > (k + 1).
Of course, in practice, we would like to use a
much larger sample than this in order to make
inferences about the population.
The positive square root of s2 is known as
the standard error of regression, or SER.
(Stata reports s on the regression output la-
belled ”Root MSE”, or root mean squared er-
ror). It is in the same units as the dependent
variable, and is the numerator of our estimated
standard errors of the OLS coefficients. The
magnitude of the SER is often compared to
the mean of the dependent variable to gauge
the regression’s ability to “explain” the data.
In the presence of heteroskedasticity–where the
variance of the error process is not constant
over the sample–the estimate of s2 presented
above will be biased. Likewise, the estimates
of coefficients’ standard errors will be biased,
since they depend on s2. If there is reason to
worry about heteroskedasticity in a particular
sample, we must work with a different ap-
proach to compute these measures.
Efficiency of OLS estimators
An important result, which underlays the widespread
use of OLS regression, is the Gauss-Markov
Theorem, describing the relative efficiency of
the OLS estimators. Under the assumptions
that we have made above for multiple regression–
and making no further distributional assump-
tions about the error process–we may show
that:
Proposition 2 (Gauss-Markov) Among the
class of linear, unbiased estimators of the pop-
ulation regression function, OLS provides the
best estimators, in terms of minimum sampling
variance: OLS estimators are best linear unbi-
ased estimators (BLUE).
This theorem only considers estimators that
have these two properties of linearity and unbi-
asedness. Linearity means that the estimator–
the rule for computing the estimates–can be
written as a linear function of the data y (es-
sentially, as a weighted average of the y val-
ues). OLS clearly meets this requirement. Un-
der the assumptions above, OLS estimators
are also unbiased. Given those properties, the
proof of the Gauss-Markov theorem demon-
strates that the OLS estimators have the mini-
mum sampling variance of any possible estima-
tor: that is, they are the “best” (most precise)
that could possibly be calculated. This theo-
rem is not based on the assumption that, for
instance, the u process is Normally distributed;
only that it is independent of the x variables
and homoskedastic (that is, that it is i.i.d.).
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 4: Multiple regression analysis:
Inference
We have discussed the conditions under which
OLS estimators are unbiased, and derived the
variances of these estimators under the Gauss-
Markov assumptions. The Gauss-Markov the-
orem establishes that OLS estimators have the
smallest variance of any linear unbiased estima-
tors of the population parameters. We must
now more fully characterize the sampling distri-
bution of the OLS estimators–beyond its mean
and variance–so that we may test hypotheses
on the population parameters. To make the
sampling distribution tractable, we add an as-
sumption on the distribution of the errors:
Proposition 1 MLR6 (Normality) The popu-
lation error u is independent of the explanatory
variables x1, .., xk and is normally distributed
with zero mean and constant variance: u ∼N(0, σ2
).
This is a much stronger assumption than we
have previously made on the distribution of the
errors. The assumption of normality, as we
have stated it, subsumes both the assumption
of the error process being independent of the
explanatory variables, and that of homoskedas-
ticity. For cross-sectional regression analysis,
these six assumptions define the classical lin-
ear model. The rationale for normally dis-
tributed errors is often phrased in terms of the
many factors influencing y being additive, ap-
pealing to the Central Limit Theorem to sug-
gest that the sum of a large number of random
factors will be normally distributed. Although
we might have reason in a particular context
to doubt this rationale, we usually use it as a
working hypothesis. Various transformations–
such as taking the logarithm of the dependent
variable–are often motivated in terms of their
inducing normality in the resulting errors.
What is the importance of assuming normal-
ity for the error process? Under the assump-
tions of the classical linear model, normally dis-
tributed errors give rise to normally distributed
OLS estimators:
bj ∼ N(βj, V ar
(bj))
(1)
which will then imply that:(bj − βj
)σbj
∼ N (0,1) (2)
This follows since each of the bj can be writ-
ten as a linear combination of the errors in the
sample. Since we assume that the errors are in-
dependent, identically distributed normal ran-
dom variates, any linear combination of those
errors is also normally distributed. We may
also show that any linear combination of the
bj is also normally distributed, and a subset
of these estimators has a joint normal distri-
bution. These properties will come in handy
in formulating tests on the coefficient vector.
We may also show that the OLS estimators
will be approximately normally distributed (at
least in large samples), even if the underlying
errors are not normally distributed.
Testing an hypothesis on a single βj
To test hypotheses about a single population
parameter, we start with the model containing
k regressors:
y = β0 + β1x1 + β2x2 + ...+ βkxk + u (3)
Under the classical linear model assumptions,
a test statistic formed from the OLS estimates
may be expressed as:
(bj − βj
)sbj
∼ tn−k−1 (4)
Why does this test statistic differ from (2)
above? In that expression, we considered the
variance of bj as an expression including σ, the
unknown standard deviation of the error term
(that is,√σ2). In this operational test statistic
(4), we have replaced σ with a consistent es-
timate, s. That additional source of sampling
variation requires the switch from the standard
normal distribution to the t distribution, with
(n−k−1) degrees of freedom. Where n is not
all that large relative to k, the resulting t distri-
bution will have considerably fatter tails than
the standard normal. Where (n − k − 1) is a
large number–greater than 100, for instance–
the t distribution will essentially be the stan-
dard normal. The net effect is to make the
critical values larger for a finite sample, and
raise the threshold at which we will conclude
that there is adequate evidence to reject a par-
ticular hypothesis.
The test statistic (4) allows us to test hypothe-
ses regarding the population parameter βj : in
particular, to test the null hypothesis
H0 : βj = 0 (5)
for any of the regression parameters. The
“t-statistic” used for this test is merely that
printed on the output when you run a regres-
sion in Stata or any other program: the ratio
of the estimated coefficient to its estimated
standard error. If the null hypothesis is to be
rejected, the “t-stat” must be larger (in ab-
solute value) than the critical point on the t-
distribution. The “t-stat” will have the same
sign as the estimated coefficient, since the stan-
dard error is always positive. Even if βj is actu-
ally zero in the population, a sample estimate
of this parameter–bj− will never equal exactly
zero. But when should we conclude that it
could be zero? When its value cannot be dis-
tinguished from zero. There will be cause to
reject this null hypothesis if the value, scaled
by its standard error, exceeds the threshold.
For a “two-tailed test,” there will be reason to
reject the null if the “t-stat” takes on a large
negative value or a large positive value; thus
we reject in favor of the alternative hypothesis
(of βj 6= 0) in either case. This is a two-sided
alternative, giving rise to a two-tailed test. If
the hypothesis is to be tested at, e.g., the 95%
level of confidence, we use critical values from
the t-distribution which isolate 2.5% in each
tail, for a total of 5% of the mass of the dis-
tribution. When using a computer program to
calculate regression estimates, we usually are
given the “p-value” of the estimate–that is,
the tail probability corresponding to the coef-
ficient’s t-value. The p-value may usefully be
considered as the probability of observing a t-
statistic as extreme as that shown if the null
hypothesis is true. If the t-value was equal to,
e.g., the 95% critical value, the p-value would
be exactly 0.05. If the t-value was higher, the
p-value would be closer to zero, and vice versa.
Thus, we are looking for small p-values as in-
dicative of rejection. A p-value of 0.92, for in-
stance, corresponds to an hypothesis that can
be rejected at the 8% level of confidence–thus
quite irrelevant, since we would expect to find
a value that large 92% of the time under the
null hypothesis. On the other hand, a p-value
of 0.08 will reject at the 90% level, but not at
the 95% level; only 8% of the time would we
expect to find a t-statistic of that magnitude
if H0 was true.
What if we have a one-sided alternative? For
instance, we may phrase the hypothesis of in-
terest as:
H0 : βj > 0 (6)
HA : βj ≤ 0
Here, we must use the appropriate critical point
on the t-distribution to perform this test at the
same level of confidence. If the point estimate
bj is positive, then we do not have cause to
reject the null. If it is negative, we may have
cause to reject the null if it is a sufficiently
large negative value. The critical point should
be that which isolates 5% of the mass of the
distribution in that tail (for a 95% level of con-
fidence). This critical value will be smaller (in
absolute value) than that corresponding to a
two-tailed test, which isolates only 2.5% of the
mass in that tail. The computer program al-
ways provides you with a p-value for a two-
tailed test; if the p-value is 0.08, for instance,
it corresponds to a one-tailed p-value of 0.04
(that being the mass in that tail).
Testing other hypotheses about βj
Every regression output includes the informa-
tion needed to test the two-tailed or one-tailed
hypotheses that a population parameter equals
zero. What if we want to test a different hy-
pothesis about the value of that parameter?
For instance, we would not consider it sensible
for the mpc for a consumer to be zero, but we
might have an hypothesized value (of, say, 0.8)
implied by a particular theory of consumption.
How might we test this hypothesis? If the null
is stated as:
H0 : βj = aj (7)
where aj is the hypothesized value, then the
appropriate test statistic becomes:
(bj − aj
)sbj
∼ tn−k−1 (8)
and we may simply calculate that quantity and
compare it to the appropriate point on the t-
distribution. Most computer programs provide
you with assistance in this effort; for instance,
if we believed that aj, the coefficient on bdrms,
should be equal to $20,000 in a regression of
house prices on square footage and bdrms (e.g.
using HPRICE1), we would use Stata’s test
command:
regress price bdrms sqrft
test bdrms=20000
where we use the name of the variable as a
shorthand for the name of the coefficient on
that variable. Stata, in that instance, presents
us with:
( 1) bdrms = 20000.0
F( 1, 85) = 0.26
Prob > F = 0.6139
making use of an F-statistic, rather than a t-
statistic, to perform this test. In this partic-
ular case–of an hypothesis involving a single
regression coefficient–we may show that this
F-statistic is merely the square of the asso-
ciated t-statistic. The p-value would be the
same in either case. The estimated coefficient
is 15198.19, with an estimated standard error
of 9483.517. Plugging in these values to (8)
yields a t-statistic:
. di (_b[bdrms]-20000)/_se[bdrms]
-.50633208
which, squared, is the F-statistic shown by the
test command. Just as with tests against a
null hypothesis of zero, the results of the test
command may be used for one-tailed tests as
well as two-tailed tests; then, the magnitude of
the coefficient matters (i.e. the fact that the
estimated coefficient is about $15,000 means
we would never reject a null that it is less than
$20,000), and the p-value must be adjusted for
one tail. Any number of test commands may
be given after a regress command in Stata,
testing different hypotheses about the coeffi-
cients.
Confidence intervals
As we discussed in going over Appendix C, we
may use the point estimate and its estimated
standard error to calculate an hypothesis test
on the underlying population parameter, or we
may form a confidence interval for that pa-
rameter. Stata makes that easy in a regression
context by providing the 95% confidence inter-
val for every estimated coefficient. If you want
to use some other level of significance, you may
either use the level() option on regress (e.g.
regress price bdrms sqrft, level(90)) or you
may change the default level for this run with
set level. All further regressions will report
confidence intervals with that level of confi-
dence. To connect this concept to that of the
hypothesis test, consider that in the above ex-
ample the 95% confidence interval for βbdrmsextended from -3657.581 to 34053.96; thus,
an hypothesis test with the null that βbdrmstakes on any value in this interval (including
zero) will not lead to a rejection.
Testing hypotheses about a single linear
combination of the parameters
Economic theory will often suggest that a par-
ticular linear combination of parameters should
take on a certain value: for instance, in a
Cobb-Douglas production function, that the
slope coefficients should sum to one in the case
of constant returns to scale (CRTS):
Q = ALβ1Kβ2Eβ3 (9)
logQ = logA+ β1 logL+ β2 logK + β3 logE + υ
where K,L,E are the factors capital, labor, and
energy, respectively. We have added an error
term to the double-log-transformed version of
this model to represent it as an empirical re-
lationship. The hypothesis of CRTS may be
stated as:
H0 : β1 + β2 + β3 = 1 (10)
The test statistic for this hypothesis is quite
straightforward:
(b1 + b2 + b3 − 1)
sb1+b2+b3
∼ tn−k−1 (11)
and its numerator may be easily calculated.
The denominator, however, is not so simple; it
represents the standard error of the linear com-
bination of estimated coefficients. You may
recall that the variance of a sum of random
variables is not merely the sum of their vari-
ances, but an expression also including their
covariances, unless they are independent. The
random variables {b1, b2, b3} are not indepen-
dent of one another since the underlying re-
gressors are not independent of one another.
Each of the underlying regressors is assumed
to be independent of the error term u, but
not of the other regressors. We would expect,
for instance, that firms with a larger capital
stock also have a larger labor force, and use
more energy in the production process. The
variance (and standard error) that we need
may be readily calculated by Stata, however,
from the variance-covariance matrix of the es-
timated parameters via the test command:
test cap+labor+energy=1
will provide the appropriate test statistic, again
as an F-statistic with a p-value. You may in-
terpret this value directly. If you would like the
point and interval estimate of the hypothesized
combination, you can compute that (after a re-
gression) with the lincom (linear combination)
command:
lincom cap + labor + energy
will show the sum of those values and a confi-
dence interval for that sum.
We may also use this technique to test other
hypotheses than adding-up conditions on the
parameters. For instance, consider a two-factor
Cobb-Douglas function in which you have only
labor and capital, and you want to test the hy-
pothesis that labor’s share is 2/3. This implies
that the labor coefficient should be twice the
capital coefficient, or:
H0 : βL = 2βK, or (12)
H0 :βLβK
= 2, or
H0 : βL − 2βK = 0
Note that this does not allow us to test a non-
linear hypothesis on the parameters: but con-
sidering that a ratio of two parameters is a
constant is not a nonlinear restriction. In the
latter form, we may specify it to Stata’s test
command as:
test labor - 2*cap = 0
In fact, Stata will figure out that form if you
specify the hypothesis as:
test labor=2*cap
(rewriting it in the above form), but it is not
quite smart enough to handle the ratio form.
It is easy to rewrite the ratio form into one
of the other forms. Either form will produce
an F-statistic and associated p-value related to
this single linear hypothesis on the parameters
which may be used to make a judgment about
the hypothesis of interest.
Testing multiple linear restrictions
When we use the test command, an F-statistic
is reported–even when the test involves only
one coefficient–because in general, hypothesis
tests may involve more than one restriction on
the population parameters. The hypotheses
discussed above–even that of CRTS, involv-
ing several coefficients–still only represent one
restriction on the parameters. For instance, if
CRTS is imposed, the elasticities of the factors
of production must sum to one, but they may
individually take on any value. But in most
applications of multiple linear regression, we
concern ourselves with joint tests of restric-
tions on the parameters.
The simplest joint test is that which every re-
gression reports: the so-called “ANOVA F”
test, which has the null hypothesis that each
of the slopes is equal to zero. Note that in a
multiple regression, specifying that each slope
individually equals zero is not the same thing
as specifying that their sum equals zero. This
“ANOVA” (ANalysis Of VAriance) F-test is of
interest since it essentially tests whether the
entire regression has any explanatory power.
The null hypothesis, in this case, is that the
“model” is y = β0 + u : that is, none of the
explanatory variables assist in explaining the
variation in y. We cannot test any hypothesis
on the R2 of a regression, but we will see that
there is an intimate relationship between the
R2 and the ANOVA F:
R2 =SSE
SST(13)
F =SSE/k
SSR/ (n− (k + 1))
∴ F =R2/k(
1−R2)/ (n− (k + 1))
where the ANOVA F, the ratio of mean square
explained variation to mean square unexplained
variation, is distributed as F kn−(k+1) under the
null hypothesis. For a simple regression, this
statistic is F1n−2, which is identical to
(tb1,n−2
)2:
that is, the square of the t− statistic for the
slope coefficient, with precisely the same p−value as that t− statistic. In a multiple regres-
sion context, we do not often find an insignif-
icant F− statistic, since the null hypothesis is
a very strong statement: that none of the ex-
planatory variables, taken singly or together,
explain any significant fraction of the variation
of y about its mean. That can happen, but it
is often somewhat unlikely.
The ANOVA F tests k exclusion restrictions:
that all k slope coefficients are jointly zero. We
may use an F-statistic to test that a number of
slope coefficients are jointly equal to zero. For
instance, consider a regression of 353 major
league baseball players’ salaries (from MLB1).
If we regress lsalary (log of player’s salary)
on years (number of years in majors), gamesyr
(number of games played per year), and sev-
eral variables indicating the position played (
frstbase, scndbase, shrtstop, thrdbase, catcher),
we get an R2 of 0.6105, and an ANOVA F
(with 7 and 345 d.f.) of 77.24 with a p−value of zero. The overall regression is clearly
significant, and the coefficients on years and
gamesyr both have the expected positive and
significant coefficients. Only one of the five
coefficients on the positions played, however,
are significantly different from zero at the 5%
level: scndbase, with a negative value (-0.034)
and a p− value of 0.015. The frstbase and
shrtstop coefficients are also negative (but in-
significant), while the thrdbase and catcher co-
efficients are positive and insignificant. Should
we just remove all of these variables (except
for scndbase)? The F-test for these five exclu-
sion restrictions will provide an answer to that
question:
. test frstbase scndbase shrtstop
thrdbase catcher
( 1) frstbase = 0.0
( 2) scndbase = 0.0
( 3) shrtstop = 0.0
( 4) thrdbase = 0.0
( 5) catcher = 0.0
F( 5, 345) = 2.37
Prob > F = 0.0390
At the 95% level of significance, these coef-
ficients are not each zero. That result, of
course, could be largely driven by the scndbase
coefficient:
. test frstbase shrtstop thrdbase catcher
( 1) frstbase = 0.0
( 2) shrtstop = 0.0
( 3) thrdbase = 0.0
( 4) catcher = 0.0
F( 4, 345) = 1.56
Prob > F = 0.1858
So perhaps it would be sensible to remove these
four, which even when taken together do not
explain a meaningful fraction of the variation
in lsalary. But this illustrates the point of the
joint hypothesis test: the result of simulta-
neously testing several hypotheses (that, for
instance, individual coefficients are equal to
zero) cannot be inferred from the results of
the individual tests. If each coefficient is sig-
nificant, then a joint test will surely reject the
joint exclusion restriction; but the converse is
assuredly false.
Notice that a joint test of exclusion restrictions
may be easily conduced by Stata’s test com-
mand, by merely listing the variables whose co-
efficients are presumed to be zero under the
null hypothesis. The resulting test statistic
is an F with as many numerator degrees of
freedom as there are coefficients (or variables)
in the list. It can be written in terms of the
residual sums of squares (SSRs) of the “unre-
stricted” and “restricted” models:
F =(SSRr − SSRur) /qSSRur/ (n− k − 1)
(14)
Since adding variables to a model will never de-
crease SSR (nor decrease R2), the “restricted”
model–in which certain coefficients are not freely
estimated from the data, but constrained–must
have SSR at least as large as the “unrestricted”
model, in which all coefficients are data-determined
at their optimal values. Thus the difference
in the numerator is non-negative. If it is a
large value, then the restrictions severely di-
minish the explanatory power of the model.
The amount by which it is diminished is scaled
by the number of restrictions, q, and then di-
vided by the unrestricted model’s s2. If this ra-
tio is a large number, then the “average cost
per restriction” is large relative to the explana-
tory power of the unrestricted model, and we
have evidence against the null hypothesis (that
is, the F− statistic will be larger than the crit-
ical point on an F− table with q and (n−k−1)
degrees of freedom. If the ratio is smaller than
the critical value, we do not reject the null
hypothesis, and conclude that the restrictions
are consistent with the data. In this circum-
stance, we might then reformulate the model
with the restrictions in place, since they do
not conflict with the data. In the baseball
player salary example, we might drop the four
insignificant variables and reestimate the more
parsimonious model.
Testing general linear restrictions
The apparatus described above is far more pow-
erful than it might appear. We have considered
individual tests involving a linear combination
of the parameters (e.g. CRTS) and joint tests
involving exclusion restrictions (as in the base-
ball players’ salary example). But the “subset
F” test defined in (14) is capable of being ap-
plied to any set of linear restrictions on the
parameter vector: for instance, that β1 = 0,
β2+β3+β4 = 1, and β5 = −1. What would this
set of restrictions imply about a regression of
y on {X1, X2, X3, X4, X5}? That regression, in
its unrestricted form, would have k = 5, with 5
estimated slope coefficients and an intercept.
The joint hypotheses expressed above would
state that a restricted form of this equation
would have three fewer parameters, since β1
would be constrained to zero, β5 to -1, and
one of the coefficients {β2, β3, β4} expressed
in terms of the other two. In the terminol-
ogy of (14), q = 3. How would we test the
hypothesis? We can readily calculate SSRur,
but what about SSRr? One approach would
be to algebraically substitute the restrictions
in the model, estimate that restricted model,
and record its SSRr value. This can be done
with any computer program that estimates a
multiple regression, but it requires that you do
the algebra and transform the variables accord-
ingly. (For instance, constraining β5 to -1 im-
plies that you should form a new dependent
variable, (y +X5)). Alternatively, if you are us-
ing a computer program that can test linear
restrictions, you may use its features. Stata
will test general linear restrictions of this sort
with the test command:
regress y x1 x2 x3 x4 x5
test (x1) (x2+x3+x4=1) (x5=-1)
This test command will print an F-statistic for
the set of three linear restrictions on the re-
gression: for instance,
( 1) years = 0.0
( 2) frstbase + scndbase + shrtstop = 1.0
( 3) sbases = -1.0
F( 3, 347) = 38.54
Prob > F = 0.0000
The F-test will have three numerator degrees
of freedom, because you have specified three
linear hypotheses to be jointly applied to the
coefficient vector. This syntax of test may
be used to construct any set of linear restric-
tions on the coefficient vector, and perform the
joint test for the validity of those restrictions.
The test statistic will reject the null hypoth-
esis (that the restrictions are consistent with
the data) if its value is large relative to the
underlying F-distribution.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 6: Multiple regression analysis:
Further issues
What effects will the scale of the X and y vari-
ables have upon multiple regression? The co-
efficients’ point estimates are ∂y/∂Xj, so they
are in the scale of the data–for instance, dol-
lars of wage per additional year of education.
If we were to measure either y or X in differ-
ent units, the magnitudes of these derivatives
would change, but the overall fit of the regres-
sion equation would not. Regression is based
on correlation, and any linear transformation
leaves the correlation between two variables
unchanged. The R2, for instance, will be un-
affected by the scaling of the data. The stan-
dard error of a coefficient estimate is in the
same units as the point estimate, and both
will change by the same factor if the data are
scaled. Thus, each coefficient’s t− statistic
will have the same value, with the same p−
value, irrespective of scaling. The standard
error of the regression (termed “Root MSE”
by Stata) is in the units of the dependent vari-
able. The ANOVA F, based on R2, will be
unchanged by scaling, as will be all F-statistics
associated with hypothesis tests on the param-
eters. As an example, consider a regression of
babies’ birth weight, measured in pounds, on
the number of cigarettes per day smoked by
their mothers. This regression would have the
same explanatory power if we measured birth
weight in ounces, or kilograms, or alternatively
if we measured nicotine consumption by the
number of packs per day rather than cigarettes
per day.
A corollary to this result applies to a dependent
variable measured in logarithmic form. Since
the slope coefficient in this case is an elas-
ticity or semi-elasticity, a change in the de-
pendent variable’s units of measurement does
not affect the slope coefficient at all (since
log(cy) = log c + log y), but rather just shows
up in the intercept term.
Beta coefficients
In economics, we generally report the regres-
sion coefficients’ point estimates when present-
ing regression results. Our coefficients often
have natural units, and those units are mean-
ingful. In other disciplines, many explanatory
variables are indices (measures of self-esteem,
or political freedom, etc.), and the associated
regression coefficients’ units are not well de-
fined. To evaluate the relative importance of
a number of explanatory variables, it is com-
mon to calculate so-called beta coefficients–
standardized regression coefficients, from a re-
gression of y∗ on X∗, where the starred vari-
ables have been “z-transformed.” This trans-
formation (subtracting the mean and dividing
by the sample standard deviation) generates
variables with a mean of zero and a standard
deviation of one. In a regression of standard-
ized variables, the (beta) coefficient estimates
∂y∗/∂X∗ express the effect of a one standard
deviation change in Xj in terms of standard
deviations of y. The explanatory variable with
the largest (absolute) beta coefficient thus has
the biggest “bang for the buck” in terms of an
effect on y. The intercept in such a regres-
sion is zero by construction. You need not
perform this standardization in most regression
programs to compute beta coefficients; for in-
stance, in Stata, you may just use the beta op-
tion, e.g. regress lsalary years gamesyr scndbase,
beta which causes the beta coefficients to be
printed (rather than the 95% confidence in-
terval for each coefficient) on the right of the
regression output.
Logarithmic functional forms
Many econometric models make use of vari-
ables measured in logarithms: sometimes the
dependent variable, sometimes both dependent
and independent variables. Using the “double-
log” transformation (of both y and X) we can
turn a multiplicative relationship, such as a
Cobb-Douglas production function, into a lin-
ear relation in the (natural) logs of output and
the factors of production. The estimated co-
efficients are, themselves, elasticities: that is,
∂ log y/∂ logXj, which have the units of per-
centage changes. The “single-log” transfor-
mation regresses log y on X, measured in nat-
ural units (alternatively, some columns of X
might be in logs, and some columns in lev-
els). If we are interpreting the coefficient on
a levels variable, it is ∂ log y/∂Xj, or approx-
imately the percentage change in y resulting
from a one unit change in X. We often use
this sort of model to estimate an exponen-
tial trend–that is, a growth rate–since if the
X variable is t, we have ∂ log y/∂t, or an es-
timate of the growth rate of y. The interpre-
tation of regression coefficients as percentage
changes depends on an approximation, that
log(1 + x) ≈ x for small x. If x is sizable–
and we seek the effect for a discrete change
in x− then we must take care with that ap-
proximation. The exact percentage change,
%∆y = 100[
exp(
bj∆Xj
)
− 1]
, will give us a
more accurate prediction of the change in y.
Why do so many econometric models utilize
logs? For one thing, a model with a log de-
pendent variable often more closely satisfies
the assumptions we have made for the classi-
cal linear model. Most economic variables are
constrained to be positive, and their empirical
distributions may be quite non-normal (think
of the income distribution). When logs are
applied, the distributions are better behaved.
Taking logs also reduces the extrema in the
data, and curtails the effects of outliers. We
often see economic variables measured in dol-
lars in log form, while variables measured in
units of time, or interest rates, are often left
in levels. Variables which are themselves ratios
are often left in that form in empirical work
(although they could be expressed in logs; but
something like an unemployment rate already
has a percentage interpretation). We must
be careful when discussing ratios to distinguish
between an 0.01 change and a one unit change.
If the unemployment rate is measured as a dec-
imal, e.g. 0.05 or 0.06, we might be concerned
with the effect of an 0.01 change (a one per
cent increase in unemployment)–which will be
1/100 of the regression coefficient’s magni-
tude!
Polynomial functional forms
We often make use of polynomial functional
forms–or their simplest form, the quadratic–to
represent a relationship that is not likely to be
linear. If y is regressed on x and x2, it is im-
portant to note that we must calculate ∂y/∂x
taking account of this form–that is, we cannot
consider the effect of changing x while holding
x2 constant. Thus, ∂y/∂x = b1 + 2b2x, and
the slope in {x, y} space will depend upon the
level of x at which we evaluate the derivative.
In many applications, b1 > 0 while b2 < 0, so
that while x is increasing, y is increasing at a
decreasing rate, or levelling off. Naturally, for
sufficiently large x, y will take on smaller val-
ues, and in the limit will become negative; but
in the range of the data, y will often appear
to be a concave function of x. We could also
have the opposite sign pattern, b1 < 0 while
b2 > 0, which will lead to a U-shaped relation
in the {x, y} plane, with y decreasing, reaching
a minimum, and increasing–somewhat like an
average cost curve. Higher-order polynomial
terms may also be used, but they are not as
commonly found in empirical work.
Interaction terms
An important technique that allows for non-
linearities in an econometric model is the use
of interaction terms–the product of explana-
tory variables. For instance, we might model
the house price as a function of bdrms, sqft,
and sqft· bdrms, which would make the partial
derivatives with respect to each factor depend
upon the other. For instance, ∂price/∂bdrms =
bbdrms + bsqft·bdrmssqft, so that the effect of an
additional bedroom on the price of the house
also depends on the size of the house. Like-
wise, the effect of additional square footage
(e.g. an addition) depends on the number of
bedrooms. Since a model with no interaction
terms is a special case of this model, we may
readily test for the presence of these nonlin-
earities by examining the significance of the
interaction term’s estimated coefficient. If it
is significant, the interaction term is needed to
capture the relationship.
Adjusted R2
In presenting multiple regression, we established
that R2 cannot decrease when additional ex-
planatory variables are added to the model,
even if they have no significant effect on y.
A “longer” model will always appear to be su-
perior to a “shorter” model, even though the
latter is a more parsimonious representation of
the relationship. How can we deal with this in
comparing alternative models, some of which
may have many more explanatory factors than
others? We can express the standard R2 as:
R2 = 1 −SSR
SST= 1 −
SSR/n
SST/n(1)
Since all models with the same dependent vari-
able will have the same SST, and SSR cannot
increase with additional variables, R2 is a non-
decreasing function of k. An alternative mea-
sure, computed by most econometrics pack-
ages, is the so-called “R-bar-squared” or ‘Ad-
justed R2” :
R2 = 1 −SSR/ (n − (k + 1))
SST/ (n − 1)(2)
where the numerator and denominator of R2
are divided by their respective degrees of free-
dom (just as they are in computing the mean
squared measures in the ANOVA F table). For
a given dependent variable, the denominator
does not change; but the numerator, which
is s2, may rise or fall as k is increased. An
additional regressor uses one more degree of
freedom, so (n − (k + 1)) declines; and SSR
declines as well (or remains unchanged). If
SSR declines by a larger percentage than the
degrees of freedom, then R2 rises, and vice
versa. Adding a number of regressors with lit-
tle explanatory power will increase R2, but will
decrease R2− which may even become nega-
tive! R2 does not have the interpretation of a
squared correlation coefficient, nor of a “bat-
ting average” for the model. But it may be
used to compare different models of the same
dependent variable. Note, however, that we
cannot make statistical judgments based on
this measure; for instance, we can show that
R2 will rise if we add one variable to the model
with a |t| > 1− but a t of unity is never sig-
nificant. Thus, an increase in R2 cannot be
taken as meaningful (the coefficients must be
examined for significance) but, conversely, if a
“longer” model has a lower R2, its usefulness
is cast in doubt. R2 is also useful in that it
can be used to compare non-nested models–
i.e. two models, neither of which is a proper
subset of the other. A “subset F” test cannot
be used to compare these models, since there
is no hypothesis under which the one model
emerges from restrictions on the other, and
vice versa. R2 may be used to make informal
comparisons of non-nested models, as long as
they have the same dependent variable. Stata
presents the R2 as the “Adj R-squared” on the
regression output.
Prediction and residual analysis
The predictions of a multiple regression are,
simply, the evaluation of the regression line
for various values of the explanatory variables.
We can always calculate y for each observa-
tion used in the regression; these are known
as “in-sample” or “ex post” predictions. Since
the estimated regression equation is a func-
tion, we can evaluate the function for any set
of values {X01 , X0
2 , ..., X0k } and form the associ-
ated point estimate y0, which might be termed
an “out-of-sample” or “ex ante” forecast of
the regression equation. How reliable are the
forecasts of the equation? Since the predicted
values are linear combinations of the b values,
we can calculate an interval estimate for the
predicted value. This is the confidence inter-
val for E(
y0)
: that is, the average value that
would be predicted by the model for a specific
set of X values. This may be calculated after
any regression in Stata using the predict com-
mand’s stdp option: that is, predict stdpred,
stdp will save a variable named “stdpred” con-
taining the standard error of prediction. The
95% confidence interval will then be, for large
samples, {y − 1.96stdpred, y + 1.96stdpred}. An
illustration of this confidence interval for a sim-
ple regression is given here. Note that the con-
fidence intervals are parabolic, with the mini-
mum width interval at X, widening symmetri-
cally as we move farther from X. For a multiple
regression, the confidence interval will be nar-
rowest at the multivariate point of means of
the X ′s.
prediction interval for E(y)Weight (lbs.)
Displacement (cu. in.) Fitted values plo phi
1,760 4,840
46.1214
425
However, if we want a confidence interval for
a specific value of y− rather than for the mean
of y− we must also take into account the fact
that a predicted value of y will contain an er-
ror, u. On average, that error is assumed to be
zero; that is, E(u) = 0. For a specific value of
y, though, there will be an error ui; we do not
know its magnitude, but we have estimated
that it is drawn from a distribution with stan-
dard error s. Thus, the standard error of fore-
cast will include this additional source of un-
certainty, and confidence intervals formed for
specific values of y will be wider than those as-
sociated with predictions of the mean y. This
standard error of forecast series can be calcu-
lated, after a regression has been estimated,
with the predict command, specifying the stdf
option. If the variable stdfc is created, the
95% confidence interval will then be, for large
samples, {y−1.96stdfc, y+1.96stdfc}. An illus-
tration of this confidence interval for a simple
regression is given here, juxtaposed with that
shown earlier for the standard error of predic-
tion. As you can see, the added uncertainty
associated with a draw from the error distribu-
tion makes the prediction interval much wider.
prediction intervals for E(y) and specific value of yWeight (lbs.)
Displacement (cu. in.) plof plo Fitted values
1,760 4,840
−18.748
474.207
Residual analysis
The OLS residuals are often calculated and
analyzed after estimating a regression. In a
purely technical sense, they may be used to
test the validity of the several assumptions that
underly the application of OLS. When plotted,
do they appear systematic? Does their dis-
persion appear to be roughly constant, or is
it larger for some X values than others? Ev-
idence of systematic behavior in the magni-
tude of the OLS residuals, or in their disper-
sion, would cast doubt on the OLS results.
A number of formal tests, as we will discuss,
are based on the residuals, and many graph-
ical techniques for examining their random-
ness (or lack thereof) are available. In Stata,
help regression diagnostics discusses many of
them.
The residuals are often used to test specific
hypotheses about the underlying relationship.
For instance, we could fit a regression of the
salaries of employees of XYZ Corp. on a num-
ber of factors which should relate to their salary
level: experience, education, specific qualifica-
tions, job level, and so on. Say that such a
regression was run, and the residuals retrieved.
If we now sort the residuals by factors not
used to explain salary levels, such as the em-
ployee’s gender or race, what will we find? Un-
der nondiscrimination laws, there should be no
systematic reason for women to be paid more
or less than men, or blacks more or less than
whites, after we have controlled for these fac-
tors. If there are significant differences be-
tween the average residual for, e.g., blacks and
whites, then we would have evidence of “sta-
tistical discrimination.” Regression equations
have often played an important role in inves-
tigating charges of discrimination in the work-
place. Likewise, most towns’ and cities’ as-
sessments of real estate (used to set the tax
levy on that property) are performed by regres-
sion, in which the explanatory factors include
the characteristics of a house and its neighbor-
hood. Since many houses will not have been
sold in the recent past, the regression must
be run over a sample of houses that have been
sold, and out-of-sample predictions used to es-
timate the appropriate price for a house that
has not been sold recently, based on its at-
tributes and trends in real estate transactions
prices in its neighborhood. A mechanical eval-
uation of the fair market value of the house
may be subject to error, but previous meth-
ods used–in which knowledgeable individuals
attached valuations based on their understand-
ing of the local real estate market–are more
subjective.
Wooldridge, Introductory Econometrics, 4thed.
Chapter 7: Multiple regression analysis with
qualitative information: Binary (or dummy)
variables
We often consider relationships between ob-served outcomes and qualitative factors: mod-els in which a continuous dependent variableis related to a number of explanatory factors,some of which are quantitative, and some ofwhich are qualitative. In econometrics, we alsoconsider models of qualitative dependent vari-ables, but we will not explore those models inthis course due to time constraints. But wecan readily evaluate the use of qualitative in-formation in standard regression models withcontinuous dependent variables.
Qualitative information often arises in termsof some coding, or index, which takes on a
number of values: for instance, we may know
in which one of the six New England states
each of the individuals in our sample resides.
The data themselves may be coded with the
biliteral “MA”, “RI”, “ME”, etc. How can
we use this factor in a regression equation?
In the data, state takes on six distinct val-
ues. We must create six binary variables, or
dummy variables, each of which will refer to
one state–that is, that variable will be 1 if the
individual comes from that state, and 0 oth-
erwise. We can generate this set of 6 vari-
ables easily in Stata with the command tab
state, gen(st), which will create 6 new vari-
ables in our dataset: st1, st2, ... st6. Each
of these variables are dummies–that is, they
only contain 0 or 1 values. If we add up these
variables, we get–exactly–a vector of 1’s, sug-
gesting that we will never want to use all 6
variables in a regression (since by knowing the
values of any 5...) We may also find the pro-
portions of each state’s citizens in our sample
very easily: summ st* will give the descriptive
statistics of all 6 variables, and the mean of
each st dummy is the sample proportion living
in that state.
In Stata 11, we actually do not have to create
these variables explicitly; we can make use of
factor variables, which will automatically cre-
ate the dummies.
How can we use these dummy variables? Say
that we wanted to know whether incomes dif-
fered significantly across the 6-state region.
What if we regressed income on any five of
these st dummies? We could do this with ex-
plicit variables as
regress income st2-st6
or with factor variables as
regress income i.state
In either case, we are estimating the equation
income = β0+β2st2+β3st3+β4st4+β5st5+β6st6+u
(1)
where I have suppressed the observation sub-
scripts. What are the regression coefficients in
this case? β0 is the average income in the 1st
state–the dummy for which is excluded from
the regression. β2 is the difference between
the income in state 2 and the income in state
1. β3 is the difference between the income
in state 3 and the income in state 1, and so
on. What is the ordinary “ANOVA F” in this
context–the test that all the slopes are equal
to zero? Precisely the test of the null hypoth-
esis:
H0 : µ1 = µ2 = µ3 = µ4 = µ5 = µ6 (2)
versus the alternative that not all six of the
state means are the same value. It turns out
that we can test this same hypothesis by ex-
cluding any one of the dummies, and including
the remaining five in the regression. The co-
efficients will differ, but the p− value of the
ANOVA F will be identical for any of these
regressions. In fact, this regression is an ex-
ample of “classical one-way ANOVA”–testing
whether a qualitative factor (in this case, state
of residence) explains a significant fraction of
the variation in income.
What if we wanted to generate point and in-
terval estimates of the state means of income?
Then it would be most convenient to reformu-
late (1) by including all 6 dummies, and remov-
ing the constant term. This is, algebraically,
the same regression:
regress income st1-st6, noconstant
or with factor variables as
regress income ibn.state, noconstant
The coefficient on the now-included st1 will be
precisely that reported above as β0. The coeffi-
cient reported for st2 will be precisely (β0 + β2)
from the previous model, and so on. But now
those coefficients will be reported with confi-
dence intervals around the state means. Those
statistics could all be calculated if you only es-
timated (1), but to do so you would have to
use lincom for each coefficient. Running this
alternative form of the model is much more
convenient for estimating the state means in
point and interval form. But to test the hy-
pothesis (2), it is most convenient to run the
original regression–since then the ANOVA F
performs the appropriate test with no further
ado.
What if we fail to reject the ANOVA F null?
Then it appears that the qualitative factor “state”
does not explain a significant fraction of the
variation in income. Perhaps the relevant clas-
sification is between northern, more rural New
England states (NEN) and southern, more pop-
ulated New England states (NES). Given the
nature of dummy variables, we may generate
these dummies two ways. We can express the
Boolean condition in terms of the state vari-
able: gen nen = (state==‘‘VT’’ | state==‘‘NH’’
| state==‘‘ME’’). This expression, with parens
on the right hand side of the generate state-
ment, evaluates that expression and returns
true (1) or false (0). The vertical bar (|) is
Stata’s OR operator; since every person in the
sample lives in one and only one state, we must
use OR to phrase the condition that they live in
northern New England. But there is another
way to generate this nen dummy, given that
we have st1...st6 defined for the regression
above. Let’s say that Vermont, New Hamp-
shire and Maine have been coded as st6, st4
and st3, respectively. We may just gen nen =
st3+st4+st6, since the sum of mutually exclu-
sive and exhaustive dummies must be another
dummy. To check, the resulting nen will have
a mean equal to the percentage of the sample
that live in northern New England; the equiva-
lent nes dummy will have a mean for southern
New England residents; and the sum of those
two means will of course be 1. We can then
run a simplified form of our model as regress
inc nen; the ANOVA F statistic for that regres-
sion tests the null hypothesis that incomes in
northern and southern New England do not
differ significantly. Since we have excluded
nes, the “slope” coefficient on nen measures
the amount by which northern New England
income differs from southern New England in-
come; the mean income for southern New Eng-
land is the constant term. If we want point and
interval estimates for those means, we should
regress inc nen nes, noc.
Regression with continuous and dummy vari-ables
In the above examples, we have estimated “pureANOVA” models–regression models in whichall of the explanatory variables are dummies. Ineconometric research, we often want to com-bine quantitative and qualitative information,including some regressors that are measurableand others that are dummies. Consder thesimplest example: we have data on individu-als’ wages, years of education, and their gen-der. We could create two gender dummies,male and female, but we will only need one inthe analysis: say, female. We create this vari-able as gen female = (gender==’’F’’). We canthen estimate the model:
wage = β0 + β1educ+ β2female+ u (3)
The constant term in this model now becomesthe wage for a male with zero years of ed-ucation. Male wages are predicted as b0 +
b1educ, while female wages are predicted as
b0 + b1educ+ b2. The gender differential is thus
b2. How would we test for the existence of “sta-
tistical discrimination”–that, say, females with
the same qualifications are paid a lower wage?
This would be H0 : β2 < 0. The t−statistic
for b2 will provide us with this hypothesis test.
What is this model saying about wage struc-
ture? Wages are a linear function of the years
of education. If b2 is significantly different
than zero, then there are two “wage profiles”–
parallel lines in {educ, wage} space, each with
a slope of b1, with their intercepts differing by
b2.
What if we wanted to expand this model to
consider the possibility that wages differ by
both gender and race? Say that each worker is
classified as race=white or race=black. Then
we could gen black = (race==‘‘black’’) to cre-
ate the dummy variable, and add it to (3).
What, now, is the constant term? The wage
for a white male with zero years of education.
Is there a significant race differential in wages?
If so, the coefficient b3, which measures the
difference between white and black wages, ce-
teris paribus, will be significantly different from
zero. In {educ, wage} space, the model can be
represented as four parallel lines, with each in-
tercept labelled by a combination of gender
and race.
What if our racial data classified each worker
as white, Black or Asian? Then we would run
the regression:
wage = β0+β1educ+β2female+β3Black+β4Asian+u
(4)
or, with factor variables,
regress wage educ female i.race
where the constant term still refers to a white
male. In this model, b3 measures the differ-
ence between black and white wages, ceteris
paribus, while b4 measures the difference be-
tween Asian and white wages. Each can be
examined for significance. But how can we
determine whether the qualitative factor, race,
affects wages? That is a joint test, that both
β3 = 0 and β4 = 0, and should be conducted
as such. If factor variables were used, we could
do this with
testparm i.race
No matter how the equation is estimated, we
should not make judgments based on the indi-
vidual dummies’ coefficients, but should rather
include both race variables if the null is re-
jected, or remove them both if it is not. When
we examine a qualitative factor, which maygive rise to a number of dummy variables, theyshould be treated as a group. For instance, wemight want to modify (3) to consider the ef-fect of state of residence:
wage = β0 + β1educ+ β2female+6∑
j=2
γjstj + u
(5)where we include any 5 of the 6 st variablesdesignating the New England states. The testthat wage levels differ significantly due to stateof residence is the joint test that γj = 0, j =2, ...,6 (or, if factor variables are used, testparmi.state). A judgment concerning the relevanceof state of residence should be made on thebasis of this joint test (an F-test with 5 nu-merator degrees of freedom).
Note that if the dependent variable was mea-sured in log form, the coefficients on dummies
would be interpreted as percentage changes; if
(5) was respecified to place log(wage) as the
dependent variable, the coefficient b1 would
measure the percentage return to education
(how many percent does the wage change for
each additional year of education), while the
coefficient b2 would measure the (approximate)
percentage difference in wage levels between
females and males, ceteris paribus. The state
dummies would, likewise, measure the percent-
age difference in wage levels between that state
and the excluded state (state 1).
We must be careful when working with vari-
ables that have an ordinal interpretation, and
are thus coded in numeric form, to treat them
as ordinal. For instance, if we model the in-
terest rate corporations must pay to borrow
(corprt) as a function of their credit rating,
we consider that Moody’s and Standard and
Poor’s assign credit ratings somewhat like grades:
AAA, AA, A, BAA, BA, B, C, et cetera. Those
could be coded as 1,2,...,7. Just as we can
agree that an “A” grade is better than a “B”,
a triple-A bond rating results in a lower bor-
rowing cost than a double-A rating. But while
GPAs are measured on a clear four-point scale,
the bond ratings are merely ordinal, or ordered:
everyone agrees on the rating scale, but the
differential between AA borrowers’ rates and A
borrowers’ rates might be much smaller than
that between B and C borrowers’ rates: es-
pecially the case if C denotes “below invest-
ment grade”, which will reduce the market for
such bonds. Thus, although we might have
a numeric index corresponding to AAA...C, we
should not assume that ∂corprt/∂index is con-
stant; we should not treat index as a cardi-
nal measure. Clearly, the appropriate way to
proceed is to create dummy variables for each
rating class, and include all but one of those
variables in a regression of corprt on bond rat-
ing and other relevant factors. For instance, if
we leave out the AAA dummy, all of the ratings
class dummies’ coefficients will then measure
the degree to which those borrowers’ bonds
bear higher rates than those of AAA borrowers.
But we could just as well leave out the C rating
class dummy, and measure the effects of rat-
ings classes relative to the worst credits’ cost
of borrowing.
Interactions involving dummy variables
Just as continuous variables may be interacted
in regression equations, so can dummy vari-
ables. We might, for instance, have one set of
dummies indicating the gender of respondents
(female) and another set indicating their mar-
ital status (married). We could regress lwage
on these two dummies:
lwage = b0 + b1female+ b2married+ u
which gives rise to the following classification
of mean wages, conditional on the two fac-
tors (which is thus a classic “two-way ANOVA”
setup):
male femaleunmarried b0 b0 + b1married b0 + b2 b0 + b1 + b2
We assume that the two effects, gender and
marital status, have independent effects on the
dependent variable. Why? Because this joint
distribution is modelled as the product of the
marginals. What is the difference between male
and female wages? b1, irrespective of marital
status. What is the difference between un-
married and married wages? b2, irrespective of
gender.
If we were to relax the assumption that gen-
der and marital status had independent effects
on wages, we would want to consider their
interaction. Since there are only two cate-
gories of each variable, we only need one in-
teraction term, fm, to capture the possible ef-
fects. As above, that term could be generated
as a Boolean (noting that & is Stata’s AND
operator): gen fm=(female==1) & (married==1),
or we could generate it algebraically, as gen
fm=female*married. In either case, it represents
the intersection of the sets. We then add a
term, b3fm, to the equation, which then ap-
pears as an additive constant in the lower right
cell of the table. Now, if the coefficient on fm
is significantly nonzero, the effect of being fe-
male on the wage differs, depending on marital
status, and vice versa. Are the interaction ef-
fects important–that is, does the joint distribu-
tion differ from the product of the marginals?
That is easily discerned, since if that is so b3will be significantly nonzero.
Using explicit variables, this would be estimated
as
regress wage female married fm
or, with factor variables, we can make use of
the factorial interaction operator:
regress wage female married i.female#i.married
or, in an even simpler form,
regress wage i.female##i.married
where the double hash mark indicates the full
factorial interaction, including both the main
effects of each factor and their interaction.
Two extensions of this framework come to
mind. Sticking with two-way ANOVA (con-
sidering two factors’ effects), imagine that in-
stead of marital status we consider race =
{white,Black,Asian}. To run the model with-out interactions, we would include two of thesedummies in the regression–say, Black and Asian;the constant term would be the mean wage ofa white male (the excluded class). What ifwe wanted to include interactions? Then wewould define f Black and f Asian, and includethose two regressors as well. The test for thesignificance of interactions is now a joint testthat these two coefficients are jointly zero.
With factor variables, we can just say
regress wage i.female##i.race
where the factorial interaction includes all racecategories, both in levels and interacted withthe female dummy.
A second extension of the interaction conceptis far more important: what if we want to con-sider a regular regression, on quantitative vari-ables, but want to allow for different slopes
for different categories of observations? Then
we create interaction effects between the dum-
mies that define those categories and the mea-
sured variables. For instance,
lwage = b0+b1female+b2educ+b3 (female× educ)+u
Here, we are in essence estimating two sepa-
rate regressions in one: a regression for males,
with an intercept of b0 and a slope of b2, and
a regression for females, with an intercept of
(b0 + b1) and a slope of (b2 + b3) . Why would
we want to do this? We could clearly estimate
the two separate regressions, but if we did that,
we could not conduct any tests (e.g. do males
and females have the same intercept? The
same slope?). If we use interacted dummies,
we can run one regression, and test all of the
special cases of this model which are nested
within: that the slopes are the same, that
the intercepts are the same, and the “pooled”
case in which we need not distinguish between
males and females. Since each of these special
cases merely involves restrictions on this gen-
eral form, we can run this equation and then
just conduct the appropriate tests.
This can be done with factor variables as
regress wage i.female##c.educ
where we must use the c. operator to tell Stata
that educ is to be treated as a continuous vari-
able, rather than considering all possible levels
of that variable in the dataset.
If we extended this logic to include race, as de-
fined above, as an additional factor, we would
include two of the race dummies (say, Black
and Asian) and interact each with educ. This
would be a model without interactions, where
the effects of gender and race are consideredto be independent, but it would allow us to es-timate different regression lines for each com-bination of gender and race, and test for theimportance of each factor. These interactionmethods are often used to test hypotheses aboutthe importance of a qualitative factor–for in-stance, in a sample of companies from whichwe are estimating their profitability, we maywant to distinguish between companies in dif-ferent industries, or companies that underwenta significant merger, or companies that wereformed within the last decade, and evaluatewhether their expenditures on R&D or adver-tising have the same effects across those cat-egories.
All of the necessary tests involving dummy vari-ables and interacted dummy variables may beeasily specified and computed, since modelswithout interacted dummies (or without cer-tain dummies in any form) are merely restricted
forms of more general models in which they
appear. Thus, the standard “subset F” test-
ing strategy that we have discussed for the
testing of joint hypotheses on the coefficient
vector may be readily applied in this context.
The text describes how a “Chow test” may be
formulated by running the general regression,
running a restricted form in which certain con-
straints are imposed, and performing a com-
putation using their sums of squared errors;
this computation is precisely that done with
Stata’s test command. The advantage of set-
ting up the problem for the test command is
that any number of tests (e.g. above, for the
importance of gender, or for the importance of
race) may be conducted after estimating a sin-
gle regression; it is not necessary to estimate
additional regressions to compute any possible
“subset F” test statistic, which is what the
“Chow test” is doing.
Wooldridge, Introductory Econometrics, 4thed.
Chapter 8: Heteroskedasticity
In laying out the standard regression model,we made the assumption of homoskedasticityof the regression error term: that its varianceis assumed to be constant in the population,conditional on the explanatory variables. Theassumption of homoskedasticity fails when thevariance changes in different segments of thepopulation: for instance, if the variance of theunobserved factors influencing individuals’ sav-ing increases with their level of income. In sucha case, we say that the error process is het-eroskedastic. This does not affect the opti-mality of ordinary least squares for the compu-tation of point estimates–and the assumptionof homoskedasticity did not underly our deriva-tion of the OLS formulas. But if this assump-tion is not tenable, we may not be able to rely
on the interval estimates of the parameters–on
their confidence intervals, and t−statistics de-
rived from their estimated standard errors. In-
deed, the Gauss-Markov theorem, proving the
optimality of least squares among linear un-
biased estimators of the regression equation,
does not hold in the presence of heteroskedas-
ticity. If the error variance is not constant,
then OLS estimators are no longer BLUE.
How, then, should we proceed? The classical
approach is to test for heteroskedasticity, and
if it is evident, try to model it. We can de-
rive modified least squares estimators (known
as weighted least squares) which will regain
some of the desirable properties enjoyed by
OLS in a homoskedastic setting. But this ap-
proach is sometimes problematic, since there
are many plausible ways in which the error vari-
ance may differ in segments of the population–
depending on some of the explanatory variables
in our model, or perhaps on some variables
that are not even in the model. We can use
weighted least squares effectively if we can de-
rive the correct weights, but may not be much
better off if we cannot convince ourselves that
our application of weighted least squares is
valid.
Fortunately, fairly recent developments in econo-
metric theory have made it possible to avoid
these quandaries. Methods have been devel-
oped to adjust the estimated standard errors
in an OLS context for heteroskedasticity of
unknown form–to develop what are known as
robust standard errors. Most statistical pack-
ages now support the calculation of these ro-
bust standard errors when a regression is esti-
mated. If heteroskedasticity is a problem, the
robust standard errors will differ from those
calculated by OLS, and we should take the for-
mer as more appropriate. How can you com-
pute these robust standard errors? In Stata,
one merely adds the option ,robust to the regress
command. The ANOVA F-table will be sup-pressed (as will the adjusted R2 measure), sinceneither is valid when robust standard errors arebeing computed, and the term “robust” will bedisplayed above the standard errors of the co-efficients to remind you that robust errors arein use.
How are robust standard errors calculated? Con-sider a model with a single explanatory vari-able. The OLS estimator can be written as:
b1 = β1 +
∑(xi − x)ui∑(xi − x)2
This gives rise to an estimated variance of theslope parameter:
V ar (b1) =
∑(xi − x)2 σ2
i(∑(xi − x)2
)2 (1)
This expression reduces to the standard ex-pression from Chapter 2 if σ2
i = σ2 for all ob-servations:
V ar (b1) =σ2∑
(xi − x)2
But if σ2i 6= σ2 this simplification cannot be
performed on (1). How can we proceed? Hal-bert White showed (in a famous article in Econo-metrica, 1980) that the unknown error vari-ance of the ith observation, σ2
i , can be consis-tently estimated by e2
i−that is, by the squareof the OLS residual from the original equation.This enables us to compute robust variances ofthe parameters–for instance, (1) can now becomputed from OLS residuals, and its squareroot will be the robust standard error of b1.This carries over to multiple regression; in thegeneral case of k explanatory variables,
V ar(bj)
=
∑r2ije
2i(∑(
xij − xj)2)2 (2)
where e2i is the square of the ith OLS residual,
and rijis the ith residual from regressing vari-
able j on all other explanatory variables. The
square root of this quantity is the heteroskedasticity-
robust standard error, or the “White” stan-
dard error, of the jth estimated coefficient. It
may be used to compute the heteroskedasticity-
robust t−statistic, which then will be valid for
tests of the coefficient even in the presence of
heteroskedasticity of unknown form. Likewise,
F -statistics, which would also be biased in the
presence of heteroskedasticity, may be consis-
tently computed from the regression in which
the robust standard errors of the coefficients
are available.
If we have this better mousetrap, why would
we want to report OLS standard errors–which
would be subject to bias, and thus unreliable,
if there is a problem of heteroskedasticity? If
(and only if) the assumption of homoskedas-ticity is valid, the OLS standard errors are pre-ferred, since they will have an exact t−distributionat any sample size. The application of robuststandard errors is justified as the sample sizebecomes large. If we are working with a sam-ple of modest size, and the assumption of ho-moskedasticity is tenable, we should rely onOLS standard errors. But since robust stan-dard errors are very easily calculated in moststatistical packages, it is a simple task to esti-mate both sets of standard errors for a partic-ular equation, and consider whether inferencebased on the OLS standard errors is fragile.In large data sets, it has become increasinglycommon practice to report the robust standarderrors.
Testing for heteroskedasticity
We may want to demonstrate that the modelwe have estimated does not suffer from het-eroskedasticity, and justify reliance on OLS and
OLS standard errors in this context. How might
we evaluate whether homoskedasticity is a rea-
sonable assumption? If we estimate the model
via standard OLS, we may then base a test
for heteroskedasticity on the OLS residuals.
If the assumption of homoskedasticity, condi-
tional on the explanatory variables, holds, it
may be written as:
H0 : V ar (u|x1, x2, ..., xk) = σ2
And a test of this null hypothesis can evalu-
ate whether the variance of the error process
appears to be independent of the explanatory
variables. We cannot observe the variances
of each observation, of course, but as above
we can rely on the squared OLS residual, e2i ,
to be a consistent estimator of σ2i . One of
the most common tests for heteroskedastic-
ity is derived from this line of reasoning: the
Breusch–Pagan test. The BP test involves
regressing the squares of the OLS residuals on
a set of variables—such as the original explana-
tory variables—in an auxiliary regression:
e2i = d0 + d1x1 + d2x2 + ...dkxk + v (3)
If the magnitude of the squared residual—a
consistent estimator of the error variance of
that observation—is not related to any of the
explanatory variables, then this regression will
have no explanatory power: its R2 will be small,
and its ANOVA F−statistic will indicate that
it does not explain any meaningful fraction of
the variation of e2i around its own mean. (Note
that although the OLS residuals have mean
zero, and are in fact uncorrelated by construc-
tion with each of the explanatory variables,
that does not apply to their squares). The
Breusch–Pagan test can be conducted by ei-ther the ANOVA F−statistic from (3), or by alarge-sample form known as the Lagrange mul-tiplier statistic: LM = n × R2 from the auxil-iary regression. Under H0 of homoskedasticity,LM ∼ χ2
k.
The Breusch–Pagan test can be computed withthe estat hettest command after regress.
regress price mpg weight length
estat hettest
which would evaluate the residuals from the re-gression for heteroskedasticity, with respect tothe original explanatory variables. The null hy-pothesis is that of homoskedasticity; if a smallp−value is received, the null is rejected in fa-vor of heteroskedasticity (that is, the auxiliaryregression (which is not shown) had a mean-ingful amount of explanatory power). Therou-tine displays the LM statistic and its p−value
versus the χ2k distribution. If a rejection is re-
ceived, one should rely on robust standard er-
rors for the original regression. Although we
have demonstrated the Breusch–Pagan test by
employing the original explanatory variables,
the test may be used with any set of variables–
including those not in the regression, but sus-
pected of being systematically related to the
error variance, such as the size of a firm, or
the wealth of an individual.
The Breusch-Pagan test is a special case of
White’s general test for heteroskedastic-
ity. The sort of heteroskedasticity that will
damage OLS standard errors is that which in-
volves correlations between squared errors and
explanatory variables. White’s test takes the
list of explanatory variables {x1, x2, ..., xk} and
augments it with squares and cross products
of each of these variables. The White test
then runs an auxiliary regression of e2i on the
explanatory variables, their squares, and their
cross products. Under the null hypothesis, none
of these variables should have any explanatory
power, if the error variances are not system-
atically varying. The White test is another
LM test, of the n × R2 form, but involves a
much larger number of regressors in the aux-
iliary regression. In the example above, rather
than just including mpg weight length,we would
also include mpg2, weight2, length2, mpg×weight,mpg×length, and weight×length: 9 regressors
in all, giving rise to a test statistic with a χ2(9)
distribution.
How can you perform White’s test? Give the
command ssc install whitetst (you only need
do this once) and it will install this routine in
Stata. The whitetst command will automat-
ically generate these additional variables and
perform the test after a regress command.
Since Stata knows what explanatory variables
were used in the regression, you need not spec-
ify them; just give the command whitetst after
regress. You may also use the fitted option to
base the test on powers of the predicted val-
ues of the regression rather than the full list of
regressors, squares and cross products.
Weighted least squares estimation
As an alternative to using heteroskedasticity-
robust standard errors, we could transform the
regression equation if we had knowledge of the
form taken by heteroskedasticity. For instance,
if we had reason to believe that:
V ar(u|x) = σ2h(x)
where h(x) is some function of the explana-
tory variables that could be made explicit (e.g.
h(x) = income), we could use that informa-
tion to properly specify the correction for het-
eroskedasticity. What would this entail? Since
in this case we are saying that V ar(u|x) ∝income, then the standard deviation of ui, con-
ditional on incomei, is√incomei. Thus could be
used to perform weighted least squares: a
technique in which we transform the variables
in the regression, and then run OLS on the
transformed equation. For instance, if we were
estimating a simple savings function from the
dataset saving.dta, in which sav is regressed
on inc, and believed that there might be het-
eroskedasticity of the form above, we would
perform the following transformations:
gen sd=sqrt(inc)
gen wsav=sav/sd
gen kon=1/sd
gen winc=inc/sd
regress wsav kon winc,noc
Note that there is no constant term in the
weighted least squares (WLS) equation, and
that the coefficient on winc still has the same
connotation: that of the marginal propensity
to save. In this case, though, we might be
thankful that Stata (and most modern pack-
ages) have a method for estimating WLS mod-
els by merely specifying the form of the weights:
regress sav inc [aw=1/inc]
In this case, the “aw” indicates that we are us-
ing “analytical weights”—Stata’s term for this
sort of weighting—and the analytical weight
is specified to be the inverse of the observa-
tion variance (not its standard error). If you
run this regression, you will find that its coef-
ficient estimates and their standard errors are
identical to those of the transformed equation–
with less hassle than the latter, in which the
summary statistics (F-statistic, R2, predicted
values, residuals, etc.) pertain to the trans-
formed dependent variable (wsav) rather than
the original variable.
The use of this sort of WLS estimation is less
popular than it was before the invention of
“White” standard errors; in theory, the trans-
formation to homoskedastic errors will yield
more attractive properties than even the use
of “White” standard errors, conditional on our
proper specification of the form of the het-
eroskedasticity. But of course we are not sure
about that, and imprecise treatment of the
errors may not be as attractive as the less
informed technique of using the robust esti-
mates.
One case in which we do know the form of
the heteroskedasticity is that of grouped data,
in which the data we are using has been ag-
gregated from microdata into groups of dif-
ferent sizes. For instance, a dataset with 50
states’ average values of income, family size,
etc. calculated from a random sample of the
U.S. population will have widely varying preci-
sion in those average values. The mean val-
ues for a small state will be computed from
relatively few observations, whereas the coun-
terpart values for a large state will be more
precisely estimated. Since we know that the
standard error of the mean is σ/√n, we recog-
nize how this effect will influence the precision
of the estimates. How, then, can we use this
dataset of 50 observations while dealing with
the known heteroskedasticity of the states’ er-
rors? This too is weighted least squares, where
the weight on the individual state should be its
population. This can be achieved in Stata by
specifying “frequency weights”–a variable con-
taining the number of observations from which
each sample observation represents. If we had
state-level data on saving, income and popula-
tion, we might regress saving income [fw=pop]
to achieve this weighting.
One additional observation regarding heteroskedas-ticity. We often see, in empirical studies, thatan equation has been specified in some ra-tio form—for instance, with per capita depen-dent and independent variables for data onstates or countries, or in terms of financial ra-tios for firm- or industry-level data. Althoughthere may be no mention of heteroskedastic-ity in the study, it is very likely that these ra-tio forms have been chosen to limit the po-tential damage of heteroskedasticity in the es-timated model. There can certainly be het-eroskedasticity in a per-capita form regressionon country-level data, but it is much less likelyto be a problem than it would be if, say, the lev-els of GDP were used in that model. Likewise,scaling firms’ values by total assets, or totalrevenues, or the number of employees will tendto mitigate the difficulties caused by extremesin scale between large corporations and cornerstores. Such models should still be examinedfor their errors’ behavior, but the popularity ofthe ratio form in these instances is an implicitconsideration of potential heteroskedasticity.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 9: More on specification and data
problems
Functional form misspecification
We may have a model that is correctly speci-
fied, in terms of including the appropriate ex-
planatory variables, yet commit functional form
misspecification–in which the model does not
properly account for the relationship between
dependent and observed explanatory variables.
We have considered this sort of problem when
discussing polynomial models; omitting a squared
term, for instance, and constraining ∂y/∂x to
be constant (rather than linear in x) would be
a functional form misspecification. We may
also encounter difficulties of this sort with re-
spect to interactions among the regressors. If
omitted, the effects of those regressors will be
estimated as constant, rather than varying as
they would in the case of interacted variables.
In the context of models with more than one
categorical variable, assuming that their effects
can be treated as independent (thus omitting
interaction terms) would yield the same diffi-
culty.
We may, of course, use the tools already de-
veloped to deal with these problems, in the
sense that if we first estimate a general model
that allows for powers, interaction terms, etc.
and then “test down” with joint F tests, we
can be confident that the more specific model
we develop will not have imposed inappropri-
ate restrictions along the way. But how can
we consider the possibility that there are miss-ing elements even in the context of our generalmodel?
One quite useful approach to a general test forfunctional form misspecification is Ramsey’sRESET (regression specification error test).The idea behind RESET is quite simple; if wehave properly specified the model, no nonlinearfunctions of the independent variables shouldbe significant when added to our estimatedequation. Since the fitted, or predicted values(y) of the estimated model are linear in the in-dependent variables, we may consider powersof the predicted values as additional regres-sors. Clearly the y values themselves cannotbe added to the regression, since they are byconstruction linear combinations of the x vari-ables. But their squares, cubes,... are not.The RESET formulation reestimates the orig-inal equation, augmented by powers of y (usu-ally squares, cubes, and fourth powers are suffi-cient) and conducts an F-test for the joint null
hypothesis that those variables have no sig-
nificant explanatory power. This test is easy
to implement, but many computer programs
have it already programmed; for instance, in
Stata one may just specify estat ovtest (omit-
ted variable test) after any regression, and the
Ramsey RESET will be produced. However,
as Wooldridge cautions, RESET should not be
considered a general test for omission of rele-
vant variables; it is a test for misspecification
of the relationship between y and the x values
in the model, and nothing more.
Tests against nonnested alternatives
The standard joint testing framework is not
helpful in the context of “competing models,”
or nonnested alternatives. These alternatives
can also arise in the context of functional form:
for instance,
y = β0 + β1x1 + β2x2 + u (1)
y = β0 + β1 logx1 + β2 logx2 + u
are nonnested models. The mechanical al-
ternative, in which we construct an artificial
model that contains each model as a special
case, is often not very attractive (and some-
time will not even be feasible). An alterna-
tive approach is that of Davidson and MacK-
innon. Using the same logic applied in devel-
oping Ramsey’s RESET, we can estimate each
of the models in (1), generate their predicted
values, and include them in the other equation.
Under the null hypothesis that the first form of
the model is correctly specified, a linear com-
bination of the logs of the x variables should
have no power to improve it, and that coef-
ficient should be insignificant. Likewise, one
can reestimate the second model, including the
predicted values from the first model. This
testing strategy–often termed the Davidson-
MacKinnon “J test”–may indicate that one
of the models is robust against the other.
There are no guarantees, though, in that ap-
plying the J test to these two equations may
generate zero, one, or two rejections. If nei-
ther hypothesis is rejected, then the data are
not helpful in ranking the models. If both are
rejected, we are given an indication that nei-
ther model is adequate, and that a continued
specification search should be conducted. If
one rejection is received, then the J test is
definitive in indicating that one of the models
dominates (or subsumes) the other, and not
vice versa. However, this does not imply that
the preferred model is well specified; again, this
test is against a very specific alternative, and
does not deliver a “clean bill of health” for the
preferred model should one emerge.
Proxy variables
So far, we have discussed issues of misspec-
ification resulting from improper handling of
the x variables. In many economic models, we
are forced to employ “proxy variables”: ap-
proximate measures of an unobservable phe-
nomenon. For instance, admissions officers
use SAT scores and high school GPAs as prox-
ies for applicants’ ability and intelligence. No
one argues that standardized tests or grade
point averages are actually measuring aptitude,
or intelligence; but there are reasons to believe
that the observable variable is well correlated
with the unobservable, or latent, variable. To
what extent will a model estimated using such
proxies for the variables in the underlying re-
lationship be successful, in terms of delivering
consistent estimates of its parameters? First,
of course, it must be established that there
is a correlation between the observable vari-
able and the latent variable. If we consider the
latent variable as having a linear relation to
a measurable proxy variable, the error in that
relation must not be correlated with other re-
gressors. When we estimate the relationship
including the proxy variable, it should be ap-
parent that the measurement error from the
latent variable equation ends up in the error
term, as an additional source of uncertainty.
This is an incentive to avoid proxy variables
where one can, since they will inexorably inflate
the error variance in the estimated regression.
But usually they are employed out of necessity,
in models for which we have no ability to mea-
sure the latent variable. If there are several
potential proxy measures, they might each be
tested, to attempt to ascertain whether bias is
being introduced to the relationship.
In some cross-sectional relationships, we have
the opportunity to use a lagged value of the
dependent variable as a proxy variable. For in-
stance, if we are trying to explain cities’ crime
rates, we might consider that there are likelyto be similarities—irregardless of the effective-ness of anti-crime strategies—between currentcrime rates and last year’s values. Thus, aprior value of the dependent variable, under-standably independent of this year’s value, maybe a useful proxy for a number of factors thatcannot otherwise be quantified. This approachmight often be used to deal with factors suchas “business climate,” in which some statesor municipalities are viewed as more welcom-ing to business; there may be many aspectsto this perception, some of them more readilyquantifiable (such as tax rates), some of themnot so (such as local officials’ willingness to ne-gotiate infrastructure improvements, or assistin funding for a new facility). But in the ab-sence of radical changes in localities’ stance inthis regard, the prior year’s (or decade’s) busi-ness investment in the locality may be a goodproxy for those factors, perceived much moreclearly by the business decisionmakers than bythe econometrician.
Measurement error
We often must deal with the issue of mea-
surement error: that the variable that theory
tells us belongs in the relationship cannot be
precisely measured in the available data. For
instance, the exact marginal tax rate that an
individual faces will depend on many factors,
only some of which we might be able to ob-
serve: even if we knew the individual’s income,
number of dependents, and homeowner sta-
tus, we could only approximate the effect of
a change in tax law on his or her tax liabil-
ity. We are faced, therefore, with using an
approximate measure, including some error of
measurement, whenever we might attempt to
formulate and implement such a model. This is
conceptually similar to the proxy variable prob-
lem we have already discussed, but in this case
it is not a latent variable problem. There is an
observable magnitude, but we do not necessar-
ily observe it. For instance, reported income is
an imperfect measure of actual income, while
IQ score is only a proxy for ability. Why is
measurement error of concern? Because the
behavior we’re trying to model–be it of indi-
viduals, firms, or nations–presumably is driven
by the actual measures, not our mismeasured
approximations of those factors. To the extent
that we fail to capture the actual measure, we
may misinterpret the behavioral response.
If measurement error is observed in the de-
pendent variable–for instance, if the true rela-
tionship explains y∗, but we only observe y =
y∗ + ε, where ε is a meanzero error process,
then ε becomes a component of the regres-
sion error term: yet another reason why the
relationship does not fit perfectly. We assume
that ε is not systematic, in particular, that it is
not correlated with the independent variables
X. As long as that is the case, then this form
of measurement error does no real harm; it
merely weakens the model, without introduc-
ing bias in either point or interval estimates. If
the magnitude of the measurement error in y is
correlated with one or more of the x variables,
then we will have a problem of bias.
Measurement error in an explanatory variable,
on the other hand, is a far more serious prob-
lem. Say that the true model is
y = β0 + β1x∗1 + u (2)
but that x∗1 is not observed; instead, we ob-
serve x1 = x∗1+ε1. We can assume that E(ε1) =
0 with generality. But what should be as-
sumed about the relationship between ε1 and
x∗1? First, let us assume that ε1 is uncorre-
lated with the observed measure x1 (that is,
larger values of x1 do not give rise to sys-
tematically larger (or smaller) errors of mea-
surement). This can be written as Cov( ε1,
x1) = 0. But if this is the case, it must be
true that Cov( ε1, x∗1) 6= 0 : that is, the error
of measurement must be correlated with the
actual explanatory variable x∗1, so that we can
write the estimated equation (in which x∗1 is
replaced with the observable x1) as
y = β0 + β1x1 + (u− β1ε1) (3)
Since both u and ε1 have zero mean and are
uncorrelated (by assumption) with x1, the pres-
ence of measurement error merely inflates the
error term: that is, V ar (u− β1ε1) = σ2u+β2
1σ2ε1,
given that we have assumed that u and ε1 are
uncorrelated with each other. Thus, measure-
ment error in x∗1 does not negatively affect the
regression of y on x1; it merely inflates the
error variance, like measurement error in the
dependent variable.
However, this is not the case that we usu-
ally consider under the heading of errors-in-
variables. It is perhaps more reasonable to
assume that the measurement error is uncorre-
lated with the true explanatory variable: Cov(
ε1, x∗1) = 0. If this is so, then Cov( ε1, x1) =
Cov(ε1,(x∗1 + ε1
)) 6= 0 by construction, and the
regression (3) will have a correlation between
its explanatory variable x1 and the composite
error term. The covariance of (x1, u− β1ε1) =
−β1Cov(ε1, x1) = −β1σ2ε16= 0, causing the
OLS regression of y on x1 to be biased and
inconsistent. In this simple case of a single ex-
planatory variable measured with error, we can
determine the nature of the bias:
plim(b1) = β1 +Cov (x1, u− β1ε1)
V ar(x1)(4)
= β1
σ2x1
σ2x1
+ σ2ε1
demonstrating that the OLS point estimate
will be attenuated–biased toward zero–since
the bracketed expression must be a fraction.
Clearly, in the absence of measurement error,
σ2ε1→ 0, and the OLS coefficient becomes un-
biased and consistent. As σ2ε1
increases rela-
tive to the variance in the (correctly measured)
explanatory variable, the OLS coefficient be-
comes more and more unreliable, shrinking to-
ward zero.
What can we conclude in a multiple regression
equation, in which perhaps one of the explana-
tory variables is subject to measurement error?
If the measurement error is uncorrelated to the
true (correctly measured) explanatory variable,
then the result we have here applies: the OLS
coefficients will be biased and inconsistent for
all of the explanatory variables (not merely the
variable measured with error), but we can no
longer predict the direction of bias in general
terms. Realistically, more than one explana-
tory variable may be subject to measurement
error (e.g. both reported income and wealth
may be erroneous).
We might be discouraged by these findings,
but fortunately there are solutions to these
problems. The models in question, in which
we suspect the presence of serious errors of
measurement, may be estimated by techniques
other than OLS regression. We will discuss
those instrumental variable techniques, which
may also be used to deal with problems of si-
multaneity, or two-way causality, in Chapter
15.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 10: Basic regression analysis with
time series data
We now turn to the analysis of time series
data. One of the key assumptions underlying
our analysis of cross-sectional data will prove
to be untenable when we consider time series
data; thus, we separate out the issues of time
series modelling from that of cross sections.
How does time series data differ? First of all,
it has a natural ordering, that of calendar time
at some periodic frequency. Note that we are
not considering here a dataset in which some
of the variables are dated at a different point
in time: e.g. a survey measuring this year’s in-
come, and (as a separate variable) last year’s
income. In time series data sets, the observa-
tions are dated, and thus we need to respect
their order, particularly if the model we con-sider has a dynamic specification (involvingvariables from more than one point in time).What is a time series? Merely a sequence ofobservations on some phenomenon observedat regular intervals. Those intervals may cor-respond to the passage of calendar time (e.g.annual, quarterly, monthly data) or they mayreflect an economic process that is irregular incalendar time (such as business-daily data). Ineither case, our observations may not be avail-able for every point in time (for instance, thereare days when a given stock does not trade onthe exchange).
A second important difference between cross-sectional and time series data: with the former,we can reaonably assume that the sample isdrawn randomly from the appropriate popula-tion, and could conceive of one or many alter-nate samples constructed from the same popu-lation. In the case of time series data, we con-sider the sequence of events we have recorded
as a realization of the underlying process. We
only have one realization available, in the sense
that history played out a specific sequence of
events. In an alternate universe, Notre Dame
might have lost to BC this year. Randomness
plays a role, of course, just as it does in cross-
sectional data; we do not know what will tran-
spire until it happens, so that time series data
ex ante are random variables. We often speak
of a time series as a stochastic process, or
time series process, focusing on the concept
that there is some mechanism generating that
process, with a random component.
Types of time series regression models
Models used in a time series context can often
be grouped into those sharing common fea-
tures. By far the simplest is a static model,
such as
yt = β0 + β1x1,t + β2x2,t + ut (1)
We may note that this model is the equiva-
lent of the cross-sectional regression model,
with the i subscript in the cross section re-
placed by t in the time series context. Each
observation is modeled as depending only on
contemporaneous values of the explanatory
variables. This structure implies that all of the
interactions among the variables of the model
are assumed to take place immediately: or,
taking the frequency into account, within the
same time period. Thus, such a model might
be reasonable when applied to annual data,
where the length of the observation interval is
long enough to allow behavioral adjustments
to take place. If we applied the same model
to higher-frequency data, we might consider
that assumption inappropriate; we might con-
sider, for instance, that a tax cut would not be
fully reflected by higher retail sales in the same
month that it took effect. An example of such
a structure that appears in many textbooks is
the static Phillips curve:
πt = β0 + β1URt + ut (2)
where πt is this year’s inflation rate, and URtis this year’s unemployment rate. Stating the
model in this form not only implies that the
level of unemployment is expected to affect the
rate of inflation (presumably with a negative
sign), but also that the entire effect of changes
in unemployment will be reflected in inflation
within the observation interval (e.g. one year).
In many contexts, we find a static model in-
adequate to reflect what we consider to be
the relationship between explanatory variables
and those variables we wish to explain. For
instance, economic theory surely predicts that
changes in interest rates (generated by mone-
tary policy) will have an effect on firms’ capital
investment spending. At lower interest rates,
firms will find more investment projects with a
positive expected net present value. But since
it takes some time to carry out these projects–
equipment must be ordered, delivered, and in-
stalled, or new factories must be built and
equipped–we would not expect that quarterly
investment spending would reflect the same
quarter’s (or even the previous quarter’s) in-
terest rates. Presumably interest rates affect
capital investment spending with a lag, and we
must take account of that phenomenon. If we
were to model capital investment with a static
model, we would be omitting relevant explana-
tory variables: the prior values of the causal
factors. These omissions would cause our es-
timates of the static model to be biased and
inconsistent. Thus, we must use some form of
distributed lag model to express the relation-
ship between current and past values of the
explanatory variables and the outcome. Dis-
tributed lag models may take a finite number
of lagged values into account (thus the Finite
Distributed Lag model, or FDL) or they may
use an infinite distributed lag: e.g. all past
values of the x variables. When an infinite DL
model is specified, some algebraic sleight-of-
hand must be used to create a finite set of
regressors.
A simple FDL model would be
ft = β0 + β1pet + β2pet−1 + β3pet−2 + ut (3)
in which we consider the fertility rate in the
population as a function of the personal ex-
emption, or child allowance, over this year and
the past two years. We would expect that the
effect of a greater personal exemption is posi-
tive, but realistically we would not expect the
effect to be (only) contemporaneous. Given
that there is at least a 9-month lag between
the decision and the recorded birth, we would
expect such an effect (if it exists) to be largely
concentrated in the β2 and β3 coefficients. In-
deed, we might consider whether additional
lags are warranted. In this model, β1 is the
impact effect, or impact multiplier of the
personal exemption, measuring the contempo-
raneous change. How do we calculate ∂f/∂pe?
That (total) derivative must be considered as
the effect of a one-time change in pe that
raises the exemption by one unit and leaves
it permanently higher. It may be computed
by evaluating the steady state of the model:
that with all time subscripts dropped. Then
it may be seen that the total effect, or long-
run multiplier, of a permanent change in pe
is (β1 + β2 + β3) . In this specification, we pre-
sume that there is an impact effect (allowing
for a nonzero value of β1) but we are impos-
ing the restriction that the entire effect will be
felt within the two year lag. This is testable,
of course, by allowing for additional lag terms
in the model, and testing for their joint sig-
nificance. However the analysis of individual
lag coefficients is often hampered–especially
at higher frequencies such as quarterly and
monthly data–by high autocorrelation in the
series. That is, the values of the series are
closely related to each other over time. If this
is the case, then many of the individual coeffi-
cients in a FDL regression model may not be
distinguishable from zero. This does not im-
ply, though, that the sum of those coefficients
(i.e. the long run multiplier) will be imprecisely
estimated. We may get a very precise value for
that effect, even if its components are highly
intercorrelated.
One additional concern that will apply in esti-
mating FDL models, especially when the num-
ber of observations is limited. Each lagged
value included in a model results in the loss
of one observation in the estimation sample.
Likewise, the use of a first difference (∆yt ≡yt − yt−1) on either the left or right side of
a model results in the loss of one observa-
tion. If we have a long time series, we may
not be too concerned about this; but if we
were working with monthly data, and felt it
appropriate to consider 12 lags of the explana-
tory variables, we would lose the first year of
data to provide these starting values. Com-
puter programs such as Stata may be set up
to recognize the time series nature of the data
(in Stata, we use the tsset command to iden-
tify the date variable, which must contain the
calendar dates over which the data are mea-
sured), and construct lags and first differences
taking these constraints into account (for in-
stance, a lagged value of a variable will be set
to a missing value where it is not available).
In Stata, once a dataset has been established
as time series, we may use the operators L.,D.
and F. to refer to the lag, difference or lead of a
variable, respectively: so L.gdp is last period’s
gdp, D.gdp is the first difference, and F.gdp is
next year’s value. These operators can also
consider higher lags, so L2.gdp is the second
lag, and L(1/4).gdp refers to the first four lags,
using standard Stata “numlist” notation (help
numlist for details).
Finite sample properties of OLS
How must we modify the assumptions under-
lying OLS to deal with time series data? First
of all, we assume that there is a linear model
linking y with a set of explanatory variables,
{x1...xk}, with an additive error u,for a sample
of t = 1, ..., n. It is useful to consider the ex-
planatory variables as being arrayed in a matrix
X =
x1,1 · · · x1,kx2,1 · · · x2,k
... · · · ...xn,1 · · · xn,k
where, like a spreadsheet,
the rows are the observations (indexed by time)
and the columns are the variables (which may
actually be dated differently: e.g. x2 may ac-
tually be the lag of x1, etc.) To proceed with
the development of the finite sample properties
of OLS, we assume:
Proposition 1 For each t, E(ut|X) = 0, where
X is the matrix of explanatory variables.
This is a key assumption, and quite a strong
one: it states not only that the error is con-
temporaneously uncorrelated with each of the
explanatory variables, but also that the error is
assumed to be uncorrelated with elements of
X at every point in time. The weaker state-
ment of contemporaneous exogeneity,
E(ut|xt,1, xt,2, ..., xt,k) = 0 is analogous to the
assumption that we made in the cross-sectional
context. But this is a stronger assumption, for
it states that the elements of X, past, present,
and future, are independent of the errors: or
that the explanatory variables in X are strictly
exogenous. It is important to note that
this assumption, by itself, says nothing about
the correlations over time among the explana-
tory variables (or their correlations with each
other), nor about the possibility that succes-
sive elements of u may be correlated (in which
case we would say that u is autocorrelated).
The assumption only states that the distribu-
tions of u and X are independent.
What might cause this assumption to fail? Clearly,
omitted variables and/or measurement error
are likely causes of a correlation between the
regressors and errors. But in a time series con-
text there are other likely suspects. If we esti-
mate a static model, for instance, but the true
relationship is dynamic–in which lagged values
of some of the explanatory variables also have
direct effects on y−then we will have a correla-
tion between contemporaneous x and the error
term, since it will contain the effects of lagged
x, which is likely to be correlated with cur-
rent x. So this assumption of strict exogeneity
has strong implications for the correct speci-
fication of the model (in this case, we would
need to specify a FDL model). It also implies
that there cannot be correlation between cur-
rent values of the error process and future x
values:something that would be likely in a case
where some of the x variables are policy in-
struments. For instance, consider a model of
farmers’ income, dependent on (among other
factors) on government price supports for their
crop. If unprecedented shocks (such as a se-
ries of droughts), which are unpredictable and
random effects of weather on farmers’ income,
trigger an expansion of the government price
support program, then the errors today are cor-
related with future x values.
The last assumption we need is the standard
assumption that the columns of X are linearly
independent: that is, there are no exact linear
relations, or perfect collinearity, among the
regressors.
With these assumptions in hand, we can demon-
strate that the OLS estimators are unbiased,
both conditional on X and unconditionally. The
random assumption that allowed us to prove
unbiasedness in the cross-sectional context has
been replaced by the assumption of strict ex-
ogeneity in the time series context. We now
turn to the interval estimates. As previously,
we assume that the error variance, conditioned
on X, is homoskedastic: V ar(ut|X) = V ar(ut) =
σ2, ∀t. In a time series context, this assumption
states that the error variance is constant over
time, and in particular not influenced by the
X variables. In some cases, this may be quite
unrealistic. We now add an additional assump-
tion, particular to time series analysis: that
there is no serial correlation in the errors:
Cov(ut, us|X) = Cov(ut, us) = 0, ∀t 6= s. This
assumption states that the errors are not auto-
correlated, or correlated with one another, so
that there is no systematic pattern in the errors
over time. This may clearly be violated, if the
error in one period (for instance, the degree to
which the actual level of y falls short of the de-
sired level) is positively (or negatively) related
to the error in the previous period. Positive
autocorrelation can readily arise in a situation
where there is partial adjustment to a discrep-
ancy, whereas negative autocorrelation is much
more likely to reflect “overshooting,” in which
a positive error (for instance, an overly opti-
mistic forecast) is followed by a negative error
(a pessimistic forecast). This assumption has
nothing to do with the potential autocorrela-
tion within the X matrix; it only applies to
the error process. Why is this assumption only
relevant for time series? In cross sections, we
assume random sampling, whereby each obser-
vation is independent of every other. In time
series, the sequence of the observations makes
it likely that if independence is violated, it will
show up in successive observations’ errors.
With these additional assumptions, we may
state the Gauss-Markov theorem for OLS esti-
mators of a time series model (OLS estimators
are BLUE), implying that the variances of the
OLS estimators are given by:
V ar(bj|X) =σ2[
SSTj(1−R2
j
)] (4)
where SSTj is the total sum of squares of the
jth explanatory variable, and R2j is the R2 from
a regression of variable xj on the other ele-
ments of X. Likewise, the unknown parameter
σ2 may be replaced by its consistent estimate,
s2 = SSRn−k−1, identical to that discussed previ-
ously.
As in our prior derivation, we will assume that
the errors are normally distributed: u ∼ N(0, σ2).
If the above assumptions hold, then the stan-
dard t−statistics and F−statistics we have ap-
plied in a cross-sectional context will also be
applicable in time series regression models.
Functional form, dummy variables, and in-
dex numbers
We find that a logarithmic transformation is
very commonly used in time series models, par-
ticularly with series that reflect stocks, flows,
or prices (rather than rates). Many models
are specified with the first difference of log(y),
implying that the dependent variable is the
growth rate of y. Dummy variables are also
very useful to test for structural change. We
may have a priori information that indicates
that unusual events were experienced in partic-
ular time periods: wars, strikes, or presidential
elections, or a market crash. In the context of
a dynamic model, we do not want to merely
exclude those observations, since that would
create episodes of missing data. Instead, we
can “dummy” the period of the event, which
then allows for an intercept shift (or, with in-
teractions, for a slope shift) during the un-
usual period. The tests for significance of the
dummy coefficients permit us to identify the
importance of the period, and justify its special
treatment. We may want to test that the rela-
tionship between inflation and unemployment
(the “Phillips curve”) is the same in Repub-
lican and Democratic presidential administra-
tions; this may readily be done with a dummy
for one party, added to the equation and inter-
acted to allow for a slope change between the
two parties’ equations. Dummy variables are
also used widely in financial research, to con-
duct event studies: models in which a par-
ticular event, such as the announcement of a
takeover bid, is hypothesized to trigger “ab-
normal” returns to the stock. In this context,
high-frequency (e.g. daily) data on stock re-
turns are analyzed, with a dummy set equal to
1 on and after the date of the takeover bid
announcement. A test for the significance of
the dummy coefficient allows us to analyze the
importance of this event. (These models are
explicitly discussed in EC327, Financial Econo-
metrics).
Creation of these dummies in Stata is made
easier by the tin() function (read: tee-in). If
the data set has been established as a time
series via tsset, you may refer to natural time
periods in generating new variables or spec-
ifying the estimation sample. For instance,
gen prefloat = (tin(1959q1,1971q3)) will gen-
erate a dummy for that pre-Smithsonian pe-
riod, and a model may be estimated over a
subset of the observations via regress ... if
tin(1970m1,1987m9).
In working with time series data, we are often
concerned with series measured as index num-
bers, such as the Consumer Price Index, GDP
Deflator, Index of Industrial Production, etc.
The price series are often needed to gener-
ate real values from nominal magnitudes. The
usual concerns must be applied in working with
these index number series, some of which have
been rebased (e.g. from 1982=100 to 1987=100)
and must be adjusted accordingly for a new
base period and value. Interesting implications
arise when we work with “real” magnitudes,
expressed in logs: for instance, labor supply
is usually modelled as depending on the real
wage,(wp
). If we express these variables in logs,
the log of the real wage becomes logw− log p.
Regressing the log of hours worked on a single
variable, (logw − log p), is a restricted version
of a regression in which the two variables are
entered separately. In that regression, the co-
efficients will almost surely differ in their ab-
solute value. But economic theory states that
only the real wage should influence workers’
decisions; they should not react to changes in
its components (e.g. they should not be will-
ing to supply more hours of labor if offered a
higher nominal wage that only makes up for a
decrease in their purchasing power).
Trends and seasonality
Many economic time series are trending: grow-
ing over time. One of the reasons for very high
R2 values in many time series regressions is the
common effect of time on many of the vari-
ables considered. This brings a challenge to
the analysis of time series data, since when we
estimate a model in which we consider the ef-
fect of several causal factors, we must be care-
ful to account for the co-movements that may
merely reflect trending behavior. Many macro
series reflect upward trends; some, such as the
cost of RAM for personal computers, exhibit
strong downward trends. We can readily model
a linear trend by merely running a regression
of the series on t, in which the slope coefficient
is then ∂y/∂t. To create a time trend in Stata,
you can just generate t = n, where n is the
observation number. It does not matter where
a trend starts, or the units in which it is ex-
pressed; a trend is merely a series that changes
by a fixed amount per time period. A linear
trend may prove to be inadequate for many
economic series, which we might expect on a
theoretical basis to exhibit constant growth,
not constant increments. In this case, an ex-
ponential trend may readily be estimated (for
strictly positive y) by regressing log y on t. The
slope coefficient is then a direct estimate of
the percentage growth rate per period. We
could also use a polynomial model, such as a
quadratic time trend, regressing the level of
y on t and t2.
Nothing about trending economic variables vi-
olates our basic assumptions for the estima-
tion of OLS regression models with time se-
ries data. However, it is important to consider
whether significant trends exist in the series;
if we ignore a common trend, we may be esti-
mating a spurious regression, in which both y
and the X variables appear to be correlated be-
cause of the influence on both of an omitted
factor, the passage of time. We can readily
guard against this by including a time trend
(linear or quadratic) in the regression; if it is
needed, it will appear to be a significant de-
terminant of y. In some cases, inclusion of a
time trend can actually highlight a meaning-
ful relationship between y and one or more x
variables: since their coefficients are now es-
timates of their co-movement with y, ceteris
paribus: that is, net of the trend in y.
We may link the concept of a regression in-
clusive of trend to the common practice of
analyzing detrended data. Rather than re-gressing y on X and t, we could remove thetrend from y and each of the variables in X.
How? Regress each variable on t, and savethe residuals (if desired, adding back the orig-inal mean of the series). This is then thedetrended y, call it y∗, and the detrended ex-planatory variables X∗ (not including a trendterm). If we now estimate the regression of y∗
on X∗, we will find that the slope coefficients’point and interval estimates are exactly equalto those from the original regression of y onX and t. Thus, it does not matter whether wefirst detrend the series, and run the regression,or estimate the regression with trend included.Those are equvalent strategies, and since thelatter is less burdensome, it may be preferredby the innately lazy researcher.
Another issue that may often arise in time se-ries data of quarterly, monthly or higher fre-quency is seasonality. Some economic vari-ables are provided in seasonally adjusted form.
In databanks and statistical publications, the
acronym SAAR (seasonally adjusted at annual
rate) is often found. Other economic series are
provided in their raw form, often labelled NSA,
or not seasonally adjusted. Seasonal factors
play an important role in many series. Natu-
rally, they reflect the seasonal patterns in many
commodities’ measures: agricultural prices dif-
fer between harvest periods and out-of-season
periods, fuel prices differ due to winter demand
for oil and natural gas, or summer demand
for gasoline. But there are seasonal factors
in many series we might consider with a more
subtle interpretation. Retail sales, naturally,
are very high in the holiday period: but so is
the demand for cash, since shoppers and gift-
givers will often need more cash at that time.
Payrolls in the construction industry will ex-
hibit seasonal patterns, as construction falls
off in cold climates, but may be stimulated by
a mild winter. Many financial series will re-
flect the adjustments made by financial firms
to “dress up” quarter-end balance sheets and
improve apparent performance.
If all of the data series we are using in a model
have been seasonally adjusted by their produc-
ers, we may not be concerned about seasonal-
ity. But often we will want to use some NSA
series, or be worried about the potential for
seasonal effects. In this case, just as we dealt
with trending series by including a time trend,
we should incorporate seasonality into the re-
gression model by including a set of seasonal
dummies. For quarterly data, we will need 3
dummies; for monthly data, 11 dummies; and
so on. If we are using business-daily data such
as financial time series, we may want to in-
clude “day-of-week” effects, with dummies for
four of the five business days.
How would you use quarterly dummies in Stata?
First of all, you must know what the time vari-
able in the data set is: give the command
tsset to find out. If it is a quarterly variable,the tsset range will report dates with embed-ded “q”s. Then you may create one quarterlydummy as gen q1=(quarter(dofq(qtr))==1) whichwill take on 1 in the first quarter, and 0 oth-erwise. To consider whether series income ex-hibits seasonality, regress income L(1/3).q1 andexamine the F−statistic. You could, of course,include any three of the four quarter dummies;L(0/2) would include dummies for quarters 1,2 and 3, and yield the same F−statistic. Notethat inclusion of these three dummies will re-quire the loss of at least two observations atthe beginning of the sample. This form ofseasonal adjustment will consider the effectof each season to be linear; if we wanted toconsider multiplicative seasonality, e.g. salesare always 10% higher in the fourth quarter,that could be achieved by regressing log y onthe seasonal dummies. A trend could be in-cluded in either form of the regression to cap-ture trending behavior over and above sea-sonality; in the latter regression, of course,
it would represent an exponential (constant
growth) trend.
Just as with a trend, we may either deseason-
alize each series (by regressing it on seasonal
dummies, saving the residuals, and adding the
mean of the original series) and regress sea-
sonally adjusted series on each other; or we
may include a set of seasonal dummies (leav-
ing one out) in a regression of y on X, and test
for the joint significance of the seasonal dum-
mies. The coefficients on the X variables will
be identical, in both point and interval form,
using either strategy.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 12: Serial correlation and heteroskedas-
ticity in time series regressions
What will happen if we violate the assump-
tion that the errors are not serially corre-
lated, or autocorrelated? We demonstrated
that the OLS estimators are unbiased, even in
the presence of autocorrelated errors, as long
as the explanatory variables are strictly exoge-
nous. This is analogous to our results in the
case of heteroskedasticity, where the presence
of heteroskedasticity alone does not cause bias
nor inconsistency in the OLS point estimates.
However, following that parallel argument, we
will be concerned with the properties of our
interval estimates and hypothesis tests in the
presence of autocorrelation.
OLS is no longer BLUE in the presence of se-
rial correlation, and the OLS standard errors
and test statistics are no longer valid, even
asymptotically. Consider a first-order Markov
error process:
ut = ρut−1 + et, |ρ| < 1 (1)
where the et are uncorrelated random variables
with mean zero and constant variance. What
will be the variance of the OLS slope estimator
in a simple (y on x) regression model? For
simplicity let us center the x series so that x =
0. Then the OLS estimator will be:
b1 = β1 +
∑ni=1 xtutSSTx
(2)
where SSTx is the sum of squares of the x
series. In computing the variance of b1, con-
ditional on x, we must account for the serial
correlation in the u process:
V ar (b1) =1
SST2xV ar
n∑t=1
xtut
=
1
SST2x
( ∑ni=1 x
2t V ar(ut)+
2∑n−1t=1
∑n−1j=1 xtxt−jE
(utut−j
) )
=σ2
SSTx+ 2
(σ2
SST2x
) n−1∑t=1
n−1∑j=1
ρjxtxt−j
where σ2 = V ar(ut) and we have used thefact that E
(utut−j
)= Cov
(utut−j
)= ρjσ2 in
the derivation. Notice that the first term inthis expression is merely the OLS variance ofb1 in the absence of serial correlation. Whenwill the second term be nonzero? When ρ
is nonzero, and the x process itself is auto-correlated, this double summation will have anonzero value. But since nothing prevents theexplanatory variables from exhibiting autocor-relation (and in fact many explanatory vari-ables take on similar values through time) the
only way in which this second term will vanish
is if ρ is zero, and u is not serially correlated.
In the presence of serial correlation, the second
term will cause the standard OLS variances of
our regression parameters to be biased and in-
consistent. In most applications, when serial
correlation arises, ρ is positive, so that suc-
cessive errors are positively correlated. In that
case, the second term will be positive as well.
Recall that this expression is the true variance
of the regression parameter; OLS will only con-
sider the first term. In that case OLS will seri-
ously underestimate the variance of the param-
eter, and the t−statistic will be much too high.
If on the other hand ρ is negative–so that suc-
cessive errors result from an “overshooting”
process–then we may not be able to determine
the sign of the second term, since odd terms
will be negative and even terms will be positive.
Surely, though, it will not be zero. Thus the
consequence of serial correlation in the errors–
particularly if the autocorrelation is positive–
will render the standard t− and F−statistics
useless.
Serial correlation in the presence of lagged
dependent variables
A case of particular interest, even in the con-
text of simple y on x regression, is that where
the “explanatory variable” is a lagged depen-
dent variable. Suppose that the conditional
expectation of yt is linear in its past value:
E(yt|yt−1
)= β0 + β1yt−1. We can always add
an error term to this relation, and write it as
yt = β0 + β1yt−1 + ut (3)
Let us first assume that the error is “well be-
haved,” i.e. E(ut|yt−1
)= 0, so that there is
no correlation between the current error and
the lagged value of the dependent variable. In
this setup the explanatory variable cannot be
strictly exogenous, since there is a contempo-
raneous correlation between yt and ut by con-
struction; but in evaluating the consistency of
OLS in this context we are concerned with the
correlation between the error and yt−1, not the
correlation with yt, yt−2, and so on. In this
case, OLS would still yield unbiased and con-
sistent point estimates, with biased standard
errors, as we derived above, even if the u pro-
cess was serially correlated..
But it is often claimed that the joint presence
of a lagged dependent variable and autocor-
related errors, OLS will be inconsistent. This
arises, as it happens, from the assumption that
the u process in (3) follows a particular autore-
gressive process, such as the first-order Markov
process in (1). If this is the case, then we
do have a problem of inconsistency, but it is
arising from a different source: the misspeci-
fication of the dynamics of the model. If we
combine (3) with (1), we really have an AR(2)
model for yt, since we can lag (3) one period
and substitute it into (1) to rewrite the model
as:
yt = β0 + β1yt−1 + ρ(yt−1 − β0 − β1yt−2
)+ et
= β0 (1− ρ) + (β1 + ρ) yt−1 − ρβ1yt−2 + et
= α0 + α1yt−1 + α2yt−2 + et (4)
so that the conditional expectation of yt prop-
erly depends on two lags of y, not merely one.
Thus the estimation of (3) via OLS is indeed
inconsistent, but the reason for that inconsis-
tency is that y is correctly modelled as AR(2).
The AR(1) model is seen to be a dynamic mis-
specification of (4); as is always the case, the
omission of relevant explanatory variables will
cause bias and inconsistency in OLS estimates,
especially if the excluded variables are corre-
lated with the included variables. In this case,
that correlation will almost surely be meaning-
ful. To arrive at consistent point estimates of
this model, we merely need add yt−2 to the
estimated equation. That does not deal with
the inconsistent interval estimates, which will
require a different strategy.
Testing for first-order serial correlation
Since the presence of serial correlation invali-
dates our standard hypothesis tests and inter-
val estimates, we should be concerned about
testing for it. First let us consider testing
for serial correlation in the k−variable regres-
sion model with strictly exogenous regressors–
which rules out, among other things, lagged
dependent variables.
The simplest structure which we might posit
for serially correlated errors is AR(1), the first
order Markov process, as given in (1). Let us
assume that et is uncorrelated with the entire
past history of the u process, and that et is ho-
moskedastic. The null hypothesis is H0 : ρ = 0
in the context of (1). If we could observe the
u process, we could test this hypothesis by es-
timating (1) directly. Under the maintained
assumptions, we can replace the unobservable
ut with the OLS residual vt. Thus a regres-
sion of the OLS residuals on their own lagged
values,
vt = κ+ ρvt−1 + εt, t = 2, ...n (5)
will yield a t− test. That regression can be run
with or without an intercept, and the robust
option may be used to guard against violations
of the homoskedasticity assumption. It is only
an asymptotic test, though, and may not have
much power in small samples.
A very common strategy in considering the
possibility of AR(1) errors is the Durbin-Watson
test, which is also based on the OLS residuals:
DW =
∑nt=2
(vt − vt−1
)2∑nt=1 v
2t
(6)
Simple algebra shows that the DW statistic is
closely linked to the estimate of ρ from the
large-sample test:
DW ' 2 (1− ρ) (7)
ρ ' 1−DW
2
The relationship is not exact because of the
difference between (n−1) terms in the numer-
ator and n terms in the denominator of the
DW test. The difficulty with the DW test is
that the critical values must be evaluated from
a table, since they depend on both the number
of regressors (k) and the sample size (n), and
are not unique: for a given level of confidence,
the table contains two values, dL and dU . If
the computed value falls below dL, the null is
clearly rejected. If it falls above dU , there is
no cause for rejection. But in the intervening
region, the test is inconclusive. The test can-
not be used on a model without a constant
term, and it is not appropriate if there are any
lagged dependent variables. You may perform
the test in Stata, after a regression, using the
estat dwatson command.
In the presence of one or more lagged de-
pendent variables, an alternative statistic may
be used: Durbin’s h statistic, which merely
amounts to augmenting (5) with the explana-
tory variables from the original regression. This
test statistic may readily be calculated in Stata
with the estat durbinalt command.
Testing for higher-order serial correlation
One of the disadvantages of tests for AR(1)
errors is that they consider precisely that al-
ternative hypothesis. In many cases, if there
is serial correlation in the error structure, it
may manifest itself in a more complex relation-
ship, involving higher-order autocorrelations;
e.g. AR(p). A logical extension to the test de-
scribed in 5) and the Durbin “h” test is the
Breusch-Godfrey test, which considers the
null of nonautocorrelated errors against an al-
ternative that they are AR(p). This can readily
be performed by regressing the OLS residu-
als on p lagged values, as well as the regres-
sors from the original model. The test is the
joint null hypothesis that those p coefficients
are all zero, which can be considered as an-
other nR2 Lagrange multiplier (LM) statistic,
analogous to White’s test for heteroskedastic-
ity. The test may easily be performed in Stata
using the estat bgodfrey command. You must
specify the lag order p to indicate the degree
of autocorrelation to be considered. If p = 1,
the test is essentially Durbin’s “h” statistic.
An even more general test often employed on
time series regression models is the Box-Pierce
or Ljung-Box Q statistic, or “portmanteau
test,” which has the null hypothesis that the
error process is “white noise,” or nonautocor-
related, versus the alternative that it is not
well behaved. The “Q” test evaluates the au-
tocorrelation function of the errors, and in that
sense is closely related to the Breusch-Godfrey
test. That test evaluates the conditional au-
tocorrelations of the residual series, whereas
the “Q” statistic uses the unconditional auto-
correlations. The “Q” test can be applied to
any time series as a test for “white noise,” or
randomness. For that reason, it is available
in Stata as the command wntestq. This test
is often reported in empirical papers as an in-
dication that the regression models presented
therein are reasonably specified.
Any of these tests may be used to evaluate the
hypothesis that the errors exhibit serial correla-
tion, or nonindependence. But caution should
be exercised when their null hypotheses are re-
jected. It is very straightforward to demon-
strate that serial correlation may be induced by
simple misspecification of the equation–for in-
stance, modeling a relationship as linear when
it is curvilinear, or when it represents expo-
nential growth. Many time series models are
misspecified in terms of inadequate dynam-
ics: that is, the relationship between y and
the regressors may involve many lags of the
regressors. If those lags are mistakenly omit-
ted, the equation suffers from misspecification
bias, and the regression residuals will reflect
the missing terms. In this context, a visual in-
spection of the residuals is often useful. User-
written Stata routines such as tsgraph, sparl
and particularly ofrtplot should be employed
to better understand the dynamics of the re-
gression function. Each may be located and
installed with Stata’s ssc command, and each
is well documented with on–line help.
Correcting for serial correlation with strictly
exogenous regressors
Since we recognize that OLS cannot provide
consistent interval estimates in the presence
of autocorrelated errors, how should we pro-
ceed? If we have strictly exogenous regressors
(in particular, no lagged dependent variables),
we may be able to obtain an appropriate esti-
mator through transformation of the model. If
the errors follow the AR(1) process in (1), we
determine that V ar(ut) = σ2e /(1− ρ2
). Con-
sider a simple y on x regression with auto-
correlated errors following an AR(1) process.
Then simple algebra will show that the quasi-
differenced equation(yt − ρyt−1
)= (1− ρ)β0+β1
(xt − ρxt−1
)+(ut − ρut−1
)(8)
will have nonautocorrelated errors, since the
error term in this equation is in fact et, by
assumption well behaved. This transforma-
tion can only be applied to observations 2, ..., n,
but we can write down the first observation in
static terms to complete that, plugging in a
zero value for the time-zero value of u. This ex-
tends to any number of explanatory variables,
as long as they are strictly exogenous; we just
quasi-difference each, and use the quasi-differenced
version in an OLS regression.
But how can we employ this strategy when
we do not know the value of ρ? It turns out
that the feasible generalized least squares
(GLS) estimator of this model merely replacesρ with a consistent estimate, ρ. The result-ing model is asymptotically appropriate, evenif it lacks small sample properties. We canderive an estimate of ρ from OLS residuals,or from the calculated value of the Durbin-Watson statistic on those residuals. Most com-monly, if this technique is employed, we use analgorithm that implements an iterative scheme,revising the estimate of ρ in a number of stepsto derive the final results. One common method-ology is the Prais-Winsten estimator, whichmakes use of the first observation, transform-ing it separately. It may be used in Stata viathe prais command. That same commandmay also be used to employ the Cochrane-
Orcutt estimator, a similar iterative techniquethat ignores the first observation. (In a largesample, it will not matter if one observationis lost). This estimator can be executed usingthe corc option of the prais command.
We do not expect these estimators to provide
the same point estimates as OLS, as they are
working with a fundamentally different model.
If they provide similar point estimates, the FGLS
estimator is to be preferred, since its standard
errors are consistent. However, in the presence
of lagged dependent variables, more compli-
cated estimation techniques are required.
An aside on first differencing. An alternative
to employing the feasible GLS estimator, in
which a value of ρ inside the unit circle is esti-
mated and used to transform the data, would
be to first difference the data: that is, trans-
form the left and right hand side variables into
differences. This would indeed be the proper
procedure to follow if it was suspected that
the variables possessed a unit root in their
time series representation. But if the value of
ρ in (1) is strictly less than 1 in absolute value,
first differencing approximates that value, since
differencing is equivalent to imposing ρ = 1 on
the error process. If the process’s ρ is quite dif-
ferent from 1, first differencing is not as good
a solution as applying the FGLS estimator.
Also note that if you difference a standard re-
gression equation in y, x1, x2... you derive an
equation that does not have a constant term.
A constant term in an equation in differences
corresponds to a linear trend in the levels equa-
tion. Unless the levels equation already con-
tains a linear trend, applying differences to that
equation should result in a model without a
constant term..
Robust inference in the presence of auto-
correlation
Just as we utilized the “White” heteroskedasticity-
consistent standard errors to deal with het-
eroskedasticity of unknown form, we may gen-
erate estimates of the standard errors that are
robust to both heteroskedasticity and auto-
correlation. Why would we want to do this
rather than explicitly take account of the au-
tocorrelated errors via the feasible generalized
least squares estimator described earlier? If we
doubt that the explanatory variables may be
considered strictly exogenous, then the FGLS
estimates will not even be consistent, let alone
efficient. Also, FGLS is usually implemented
in the context of an AR(1) model, since it is
much more complex to apply it to a more com-
plex AR structure. But higher-order autocor-
relation in the errors may be quite plausible.
Robust methods may take account of that be-
havior.
The methodology to compute what are often
termed heteroskedasticity- and autocorrelation-
consistent (HAC) standard errors was devel-
oped by Newey and West; thus they are of-
ten referred to as Newey-West standard er-
rors. Unlike the White standard errors, which
require no judgment, the Newey-West stan-
dard errors must be calculated conditional on
a choice of maximum lag. They are calculated
from a distributed lag of the OLS residuals,
and one must specify the longest lag at which
autocovariances are to be computed. Normally
a lag length exceeding the periodicity of the
data will suffice; e.g. at least 4 for quar-
terly data, 12 for monthly data, etc. The
Newey-West (HAC) standard errors may be
readily calculated for any OLS regression using
Stata’s newey command. You must provide the
“option” lag( ), which specifies the maximum
lag order, and your data must be tsset (that is,
known to Stata as time series data). Since the
Newey-West formula involves an expression in
the squares of the residuals which is identical
to White’s formula (as well as a second term
in the cross-products of the residuals), these
robust estimates subsume White’s correction.
Newey-West standard errors in a time series
context are robust to both arbitrary autocor-
relation (up to the order of the chosen lag) as
well as arbitrary heteroskedasticity.
Heteroskedasticity in the time series con-
text
Heteroskedasticity can also occur in time se-
ries regression models; its presence, while not
causing bias nor inconsistency in the point es-
timates, has the usual effect of invalidating the
standard errors, t−statistics, and F−statistics,
just as in the cross–sectional case. Since the
Newey–West standard error formula subsumes
the White (robust) standard error component,
if the Newey–West standard errors are com-
puted, they will also be robust to arbitrary de-
partures from homoskedasticity. However, the
standard tests for heteroskedasticity assume
independence of the errors, so if the errors are
serially correlated, those tests will not generally
be correct. It thus makes sense to test for se-
rial correlation first (using a heteroskedasticity–
robust test if it is suspected), correct for se-
rial correlation, and then apply a test for het-
eroskedasticity.
In the time series context, it may be quite plau-
sible that if heteroskedasticity—that is, vari-
ations in volatility in a time series process—
exists, it may itself follow an autoregressive
pattern. This can be termed a dynamic form
of heteroskedasticity, in which Engle’s ARCH
(autoregressive conditional heteroskedasticity)
model applies. The simplest ARCH model may
be written as:
yt = β0 + β1zt + ut
E(u2t |ut−1, ut−2, ...
)= E
(u2t |ut−1
)= α0 + α1u
2t−1
The second line is the conditional variance of utgiven that series’ past history, assuming that
the u process is serially uncorrelated. Since
conditional variances must be positive, this only
makes sense if α0 > 0 and α1 ≥ 0. We can
rewrite the second line as:
u2t = α0 + α1u
2t−1 + υt
which then appears as an autoregressive model
in the squared errors, with stability condition
α1 < 1. When α1 > 0, the squared errors con-
tain positive serial correlation, even though the
errors themselves do not.
If this sort of process is evident in the regres-
sion errors, what are the consequences? First
of all, OLS are still BLUE. There are no as-
sumptions on the conditional variance of the
error process that would invalidate the use of
OLS in this context. But we may want to
explicitly model the conditional variance of the
error process, since in many financial series the
movements of volatility are of key importance
(for instance, option pricing via the standard
Black–Scholes formula requires an estimate of
the volatility of the underlying asset’s returns,
which may well be time–varying).
Estimation of ARCH models—of which there
are now many flavors, with the most common
extension being Bollerslev’s GARCH (gener-
alised ARCH)—may be performed via Stata’s
arch command. Tests for ARCH, which are
based on the squared residuals from an OLS re-
gression, are provided by Stata’s estat archlm
command.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 15: Instrumental variables and two
stage least squares
Many economic models involve endogeneity:
that is, a theoretical relationship does not fit
into the framework of y-on-X regression, in
which we can assume that the y variable is de-
termined by (but does not jointly determine)
X. Indeed, the simplest analytical concepts we
teach in principles of economics—a demand
curve in micro, and the Keynesian consump-
tion function in macro—are relations of this
sort, where at least one of the “explanatory”
variables is endogenous, or jointly determined
with the “dependent” variable. From a math-
ematical standpoint, the difficulties that this
endogeneity cause for econometric analysis are
identical to those which we have already con-sidered, in two contexts: that of omitted vari-ables, and that of errors-in-variables, or mea-surement error in the X variables. In each ofthese three cases, OLS is not capable of deliv-ering consistent parameter estimates. We nowturn to a general solution to the problem of en-dogenous regressors, which as we will see canalso be profitably applied in other contexts, inwhich the omitted variable (or poorly measuredvariable) can be taken into account. The gen-eral concept is that of the instrumental vari-ables estimator; a popular form of that esti-mator, often employed in the context of endo-geneity, is known as two-stage least squares(2SLS).
To motivate the problem, let us consider theomitted-variable problem: for instance, a wageequation, which would be correctly specifiedas:
log(wage) = β0 + β1educ+ β2abil + e (1)
This equation cannot be estimated, because
ability (abil) is not observed. If we had a proxy
variable available, we could substitute it for
abil; the quality of that equation would then
depend on the degree to which it was a good
proxy. If we merely ignore abil, it becomes part
of the error term in the specification:
log(wage) = β0 + β1educ+ u (2)
If abil and educ are correlated, OLS will yield
biased and inconsistent estimates. To consis-
tently estimate this equation, we must find an
instrumental variable: a new variable that
satisfies certain properties. Imagine that vari-
able z is uncorrelated with u, but is correlated
with educ. A variable that meets those two
conditions is an instrumental variable for educ.
We cannot directly test the prior assumption,
since we cannot observe u; but we can readily
test the latter assumption, and should do so,
by merely regressing the included explanatory
variable on the instrument:
educ = π0 + π1z + υ (3)
In this regression, we should easily reject H0 :
π1 = 0. It should be clear that there is no
unique choice of an instrument in this situa-
tion; many potential variables could meet these
two conditions, of being uncorrelated with the
unobservable factors influencing the wage (in-
cluding abil) and correlated with educ. Note
that in this context we are not searching for a
proxy variable for abil; if we had a good proxy
for abil, it would not make a satisfactory instru-
mental variable, since correlation with abil im-
plies correlation with the error process u. What
might serve in this context? Perhaps some-
thing like the mother’s level of education, or
the number of siblings, would make a sensible
instrument. If we determine that we have a
reasonable instrument, how may it be used?
Return to the misspecified equation (2), and
write it in general terms of y and x :
y = β0 + β1x+ u (4)
If we now take the covariance of each term in
the equation with our instrument z, we find:
Cov(y, z) = β1Cov(x, z) + Cov(u, z) (5)
We have made use of the fact that the covari-
ance with a constant is zero. Since by assump-
tion the instrument is uncorrelated with the
error process u, the last term has expectation
zero, and we may solve (5) for our estimate of
β1 :
b1 =Cov(y, z)
Cov(x, z)=
∑(yi − y) (zi − z)∑(xi − x) (zi − z)
(6)
Note that this estimator has an interesting spe-
cial case where x = z : that is, where an ex-
planatory variable may serve as its own instru-
ment, which would be appropriate if Cov(x, u) =
0. In that case, this estimator may be seen tobe the OLS estimator of β1. Thus, we mayconsider OLS as a special case of IV, usablewhen the assumption of exogeneity of the x
variable(s) may be made. We may also notethat the IV estimator is consistent, as longas the two key assumptions about the instru-ment’s properties are satisfied. The IV estima-tor is not an unbiased estimator, though, andin small samples its bias may be substantial.
Inference with the IV estimator
To carry out inference–compute interval esti-mates and hypothesis tests–we assume thatthe error process is homoskedastic: in this case,conditional on the instrumental variable z, notthe included explanatory variable x. With thisadditional assumption, we may derive the asymp-totic variance of the IV estimator as:
V ar(b1) =σ2
SSTxρ2xz
(7)
where n is the sample size, SSTx is the to-tal sum of squares of the explanatory variable,and ρ2
xz is the R2 (or squared correlation) ina regression of x on z : that is, equation (3).This quantity can be consistently estimated;σ2 from the regression residuals, just as withOLS. Notice that as the correlation betweenthe explanatory variable x and the instrumentz increases, ceteris paribus, the sampling vari-ance of b1 decreases. Thus, an instrumentalvariables estimate generated from a “better”instrument will be more precise (conditional, ofcourse, on the instrument having zero correla-tion with u). Note as well that this estimatedvariance must exceed that of the OLS estima-tor of b1, since 0 ≤ ρ2
xz ≤ 1. In the case wherean explanatory variable may serve as its own in-strument, the squared correlation is unity. TheIV estimator will always have a larger asymp-totic variance than will the OLS estimator, butthat merely reflects the introduction of an ad-ditional source of uncertainty (in the form of
the instrument, imperfectly correlated with the
explanatory variable).
What will happen if we use the instrumental
variables with a “poor” or “weak” instrument?
A weak correlation between x and z will bring
a sizable bias in the estimator. If there is any
correlation between z and u, a weak correla-
tion between x and z will render IV estimates
inconsistent. Although we cannot observe the
correlation between z and u, we can empirically
evaluate the correlation between the explana-
tory variable and its instrument, and should
always do so.
It should also be noted that an R2 measure in
the context of the IV estimator is not the “per-
centage of variation explained” measure that
we are familiar with in OLS terms. In the pres-
ence of correlation between x and u, we can no
longer decompose the variation in y into two
independent components, SSE and SSR, and
R2 has no natural interpretation. In the OLS
context, a joint hypothesis test can be writ-
ten in terms of R2 measures; that cannot be
done in the IV context. Just as the asymp-
totic variance of an IV estimator exceeds that
of OLS, the R2 measure from IV will never
beat that which may be calculated from OLS.
If we wanted to maximize R2, we would just
use OLS; but when OLS is biased and incon-
sistent, we seek an estimation technique that
will focus on providing consistent estimates of
the regression parameters, and not mechani-
cally find the “least squares” solution in terms
of inconsistent parameter estimates.
IV estimates in the multiple regression con-
text
The instrumental variables technique illustrated
above can readily be extended to the case of
multiple regression. To introduce some nota-
tion, consider a structural equation:
y1 = β0 + β1y2 + β2z1 + u1 (8)
where we have suppressed the observation sub-
scripts. The y variables are endogenous; the
z variable is exogenous. The endogenous na-
ture of y2 implies that if this equation is esti-
mated by OLS, the point estimates will be bi-
ased and inconsistent, since the error term will
be correlated with y2. We need an instrument
for y2 : a variable that is correlated with y2,
but not correlated with u. Let us write the en-
dogenous explanatory variable in terms of the
exogenous variables, including the instrument
z2 :
y2 = π0 + π1z1 + π2z2 + v (9)
The key identification condition is that π2 6= 0;
that is, after partialling out z1, y2 and z2 are
still meaningfully correlated. This can readily
be tested by estimating the auxiliary regres-
sion (9). We cannot test the other crucial as-
sumption: that in this context, cov(z2, v) = 0.
Given the satisfaction of these assumptions,
we may then derive the instrumental variables
estimator of (8) by writing down the “normal
equations” for the least squares problem, and
solving them for the point estimates. In this
context, z1 serves as an instrument for itself.
We can extend this logic to include any number
of additional exogenous variables in the equa-
tion; the condition that the analogue to (9)
must have π2 6= 0 always applies. Likewise,
we could imagine an equation with additional
endogenous variables; for each additional en-
dogenous variable on the right hand side, we
would have to find another appropriate instru-
ment, which would have to meet the two con-
ditions specified above.
Two stage least squares (2SLS)
What if we have a single endogenous explana-tory variable, as in equation (8), but have morethan one potential instrument? There mightbe several variables available, each of whichwould have a significant coefficient in an equa-tion like (9), and could be considered uncor-related with u. Depending on which of thepotential instruments we employ, we will de-rive different IV estimates, with differing de-grees of precision. This is not a very attrac-tive possibility, since it suggests that depend-ing on how we implement the IV estimator,we might reach different qualitatitive conclu-sions about the structural model. The tech-nique of two-stage least squares (2SLS) hasbeen developed to deal with this problem. Howmight we combine several instruments to pro-duce the single instrument needed to imple-ment IV for equation (8)? Naturally, by run-ning a regression–in this case, an auxiliary re-gression of the form of equation (9), with all of
the available instruments included as explana-
tory variables. The predicted values of that
regression, y2, will serve as the instrument for
y2, and this auxiliary regression is the “first
stage” of 2SLS. In the “second stage,” we
use the IV estimator, making use of the gen-
erated instrument y2. The IV estimator we
developed above can be shown, algebraically,
to be a 2SLS estimator; but although the IV
estimator becomes non-unique in the presence
of multiple instruments, the 2SLS estimation
technique will always yield a unique set of pa-
rameter values for a given instrument list.
Although from a pedagogical standpoint we
speak of the two stages, we should not actually
perform 2SLS “by hand.” Why? Because the
second stage will yield the “wrong” residuals
(being computed from the instruments rather
than the original variables), which implies that
all statistics computed from those residuals will
be incorrect (the estimate s2, the estimated
standard errors of the parameters, etc.) We
should make use of a computer program that
has a command to perform 2SLS (or, as some
programs term it, instrumental variables). In
Stata, you use the ivregress command to per-
form either IV or 2SLS estimation. The syntax
of ivregress is:
ivregress 2sls depvar [varlist1] (varlist2=varlist iv)
where depvar is the dependent variable; varlist1,
which may not be present, is the list of in-
cluded exogenous variables (such as z1 in equa-
tion (8); varlist2 contains the included en-
dogenous variables (such as y2 in equation (8);
and varlist iv contains the list of instruments
that are not included in the equation, but will
be used to form the instrumental variables es-
timator. If we wanted to estimate equation
(8) with Stata, we would give the command
ivregress 2sls y1 z1 (y2 = z2). If we had ad-
ditional exogenous variables in the equation,
they would follow z1. If we had additional in-
struments (and were thus performing 2SLS),
we would list them after z2.
The 2SLS estimator may be applied to a much
more complex model, in which there are mul-
tiple endogenous explanatory variables (which
would be listed after y2 in the command), as
well as any number of instruments and included
exogenous variables. The constraint that must
always be satisfied is related to the parenthe-
sized lists: the order condition for identifi-
cation. Intuitively, it states that for each in-
cluded endogenous variable (e.g. y2), we must
have at least one instrument—that is, one ex-
ogenous variable that does not itself appear in
the equation, or satisfies an exclusion restric-
tion. If there are three included endogenous
variables, then we must have no fewer than
three instruments after the equals sign, or the
equation will not be identified. That is, it will
not be possible to solve for a unique solution
in terms of the instrumental variables estima-
tor. In the case (such as the example above)
where the number of included endogenous vari-
ables exactly equals the number of excluded
exogenous variables, we satisfy the order con-
dition with equality, and the standard IV es-
timator will yield a solution. Where we have
more instruments than needed, we satisfy the
order condition with inequality, and the 2SLS
form of the estimator must be used to derive
unique estimates, since we have more equa-
tions than unknowns: the equation is overi-
dentified. If we have fewer instruments than
needed, we fail the order condition, since there
are more unknowns than equations. No econo-
metric technique can solve this problem of un-
deridentification. There are additional condi-
tions for identification—the order condition is
necessary, but not sufficient—as it must also
be the case that each instrument has a nonzero
partial correlation with the dependent variable.
This would fail, for instance, if one of our can-
didate instruments was actually a linear com-
bination of the included exogenous variables.
IV and errors-in-variables
The instrumental variables estimator can also
be used fruitfully to deal with the errors-in-
variables model discussed earlier–not surpris-
ingly, since the econometric difficulties caused
by errors-in-variables are mathematically the
same problem as that of an endogenous ex-
planatory variable. To deal with errors-in-variables,
we need an instrument for the mismeasured x
variable that satisfies the usual assumptions:
being well correlated with x, but not corre-
lated with the error process. If we could find a
second measurement of x−even one also sub-ject to measurement error–we could use it asan instrument, since it would presumably bewell correlated with x itself, but if generatedby an independent measurement process, un-correlated with the original x′s measurementerror. Thus, we might conduct a householdsurvey which inquires about disposable income,consumption, and saving. The respondents’answers about their saving last year might wellbe mismeasured, since it is much harder totrack saving than, say, earned income. Thesame could be said for their estimates of howmuch they spent on various categories of con-sumption. But using income and consumptiondata, we could derive a second (mismeasured)estimate of saving, and use it as an instrumentto mitigate the problems of measurement errorin the direct estimate.
IV may also be used to solve proxy problems;imagine that we are regressing log(wage) on
education and experience, using a theoretical
model that suggests that “ability” should ap-
pear as a regressor. Since we do not have a
measure of ability, we use a test score as a
proxy variable. That may introduce a prob-
lem, though, since the measurement error in
the relation of test score to ability will cause
the test score to be correlated with the error
term. This might be dealt with if we had a
second test score measure–on a different apti-
tude test–which could then be used as an in-
strument. The two test scores are likely to be
correlated, and the measurement error in the
first (the degree that it fails to measure abil-
ity) should not be correlated with the second
score.
Tests for endogeneity and overidentifying
restrictions
Since the use of IV will necessarily inflate the
variances of the estimators, and weaken our
ability to make inferences from our estimates,
we might be concerned about the need to ap-
ply IV (or 2SLS) in a particular equation. One
form of a test for endogeneity can be readily
performed in this context. Imagine that we
have the equation:
y1 = β0 + β1y2 + β2z1 + β3z2 + u1 (10)
where y2 is the single endogenous explanatory
variable, and the z′s are included exogenous
variables. Imagine that the equation is overi-
dentified for IV: that is, we have at least two
instruments (in this case, z3 and z4) which
could be used to estimate (10) via 2SLS. If
we performed 2SLS, we would be estimating
the following reduced form equation in the
“first stage”:
y2 = π0 + π1z1 + π2z2 + π3z3 + π4z4 + v (11)
which would allow us to compute OLS residu-
als, v. Those residuals will be that part of y2
not correlated with the z′s. If there is a prob-
lem of endogeneity of y2 in equation (10), it
will occur because cov(v, u1) 6= 0. We cannot
observe v, but we can calculate a consistent
estimate of v as v. Including v as an additional
regressor in the OLS model
y1 = β0 + β1y2 + β2z1 + β3z2 + δv + ω (12)
and testing for the significance of δ will give
us the answer. If cov(v, u1) = 0, our estimate
of δ should not be significantly different from
zero. If that is the case, then there is no ev-
idence that y2 is endogenous in the original
equation, and OLS may be applied. If we reject
the hypothesis that δ = 0, we should not rely
on OLS, but should rather use IV (or 2SLS).
This test may also be generalized for the pres-
ence of multiple included endogenous variables
in (10); the relevant test is then an F−test,
jointly testing that a set of δ coefficients are
all zero. This test is available within Stata as
the estat endog command following ivregress.
Although we can never directly test the main-tained hypothesis that the instruments are un-correlated with the error process u, we canderive indirect evidence on the suitability ofthe instruments if we have an excess of instru-ments: that is, if the equation is overidenti-fied, so that we are using 2SLS. The ivregress
residuals may be regressed on all exogenousvariables (included exogenous variables plus in-struments). Under the null hypothesis thatall IV’s are uncorrelated with u, a Lagrangemultiplier statistic of the nR2 form will notexceed the critical point on a χ2 (r) distribu-tion, where r is the number of overidentify-
ing restrictions (i.e. the number of excess in-struments). If we reject this hypothesis, thenwe cast doubt on the suitability of the instru-ments; at least some of them do not appearto be satisfying the condition of orthogonalitywith the error process. This test is availablewithin Stata as the estat overid command fol-lowing ivregress.
Applying 2SLS in a time series context
When there are concerns of included endoge-
nous variables in a model fit to time series
data, we have a natural source of instruments
in terms of predetermined variables. For in-
stance, if y2t is an explanatory variable, its own
lagged values, y2t−1or y2t−2 might be used as
instruments: they are likely to be correlated
with y2t, and they will not be correlated with
the error term at time t, since they were gen-
erated at an earlier point in time. The one
caveat that must be raised in this context re-
lates to autocorrelated errors: if the errors are
themselves autocorrelated, then the presumed
exogeneity of predetermined variables will be in
doubt. Tests for autocorrelated errors should
be conducted; in the presence of autocorrela-
tion, more distant lags might be used to miti-
gate this concern.
Wooldridge, Introductory Econometrics, 4th
ed.
Chapter 16: Simultaneous equations mod-
els
An obvious reason for the endogeneity of ex-
planatory variables in a regression model is si-
multaneity: that is, one or more of the “ex-
planatory” variables are jointly determined with
the “dependent” variable. Models of this sort
are known as simultaneous equations mod-
els (SEMs), and they are widely utilized in
both applied microeconomics and macroeco-
nomics. Each equation in a SEM should be a
behavioral equation which describes how one
or more economic agents will react to shocks
or shifts in the exogenous explanatory vari-
ables, ceteris paribus. The simultaneously de-
termined variables often have an equilibrium
interpretation, and we consider that these vari-
ables are only observed when the underlying
model is in equilibrium. For instance, a de-
mand curve relating the quantity demanded to
the price of a good, as well as income, the
prices of substitute commodities, etc. concep-
tually would express that quantity for a range
of prices. But the only price-quantity pair that
we observe is that resulting from market clear-
ing, where the quantities supplied and demanded
were matched, and an equilibrium price was
struck. In the context of labor supply, we
might relate aggregate hours to the average
wage and additional explanatory factors:
hi = β0 + β1wi + β2z1 + ui (1)
where the unit of observation might be the
county. This is a structural equation, or be-
havioral equation, relating labor supply to its
causal factors: that is, it reflects the structure
of the supply side of the labor market. This
equation resembles many that we have consid-
ered earlier, and we might wonder why there
would be any difficulty in estimating it. But
if the data relate to an aggregate–such as the
hours worked at the county level, in response
to the average wage in the county–this equa-
tion poses problems that would not arise if, for
instance, the unit of observation was the indi-
vidual, derived from a survey. Although we can
assume that the individual is a price- (or wage-)
taker, we cannot assume that the average level
of wages is exogenous to the labor market in
Suffolk County. Rather, we must consider that
it is determined within the market, affected by
broader economic conditions. We might con-
sider that the z variable expresses wage levels
in other areas, which would cet.par. have an
effect on the supply of labor in Suffolk County;
higher wages in Middlesex County would lead
to a reduction in labor supply in the Suffolk
County labor market, cet. par.
To complete the model, we must add a speci-
fication of labor demand:
hi = γ0 + γ1wi + γ2z2 + υi (2)
where we model the quantity demanded of la-
bor as a function of the average wage and ad-
ditional factors that might shift the demand
curve. Since the demand for labor is a de-
rived demand, dependent on the cost of other
factors of production, we might include some
measure of factor cost (e.g. the cost of capi-
tal) as this equation’s z variable. In this case,
we would expect that a higher cost of capital
would trigger substitution of labor for capital
at every level of the wage, so that γ2 > 0. Note
that the supply equation represents the behav-
ior of workers in the aggregate, while the de-
mand equation represents the behavior of em-
ployers in the aggregate. In equilibrium, we
would equate these two equations, and expect
that at some level of equilibrium labor utiliza-
tion and average wage that the labor market
is equilibrated. These two equations then con-stitute a simultaneous equations model (SEM)of the labor market.
Neither of these equations may be consistentlyestimated via OLS, since the wage variable ineach equation is correlated with the respectiveerror term. How do we know this? Becausethese two equations can be solved and rewrit-ten as two reduced form equations in the en-dogenous variables hi and wi. Each of thosevariables will depend on the exogenous vari-ables in the entire system–z1 and z2–as wellas the structural errors ui and υi. In general,any shock to either labor demand or supplywill affect both the equilibrium quantity andprice (wage). Even if we rewrote one of theseequations to place the wage variable on the lefthand side, this problem would persist: both en-dogenous variables in the system are jointly de-termined by the exogenous variables and struc-tural shocks. Another implication of this struc-ture is that we must have separate explanatory
factors in the two equations. If z1 = z2, for in-
stance, we would not be able to solve this sys-
tem and uniquely identify its structural param-
eters. There must be factors that are unique
to each structural equation that, for instance,
shift the supply curve without shifting the de-
mand curve.
The implication here is that even if we only
care about one of these structural equations–
for instance, we are tasked with modelling la-
bor supply, and have no interest in working
with the demand side of the market–we must
be able to specify the other structural equa-
tions of the model. We need not estimate
them, but we must be able to determine what
measures they would contain. For instance,
consider estimating the relationship between
murder rate, number of police, and wealth for
a number of cities. We might expect that both
of those factors would reduce the murder rate,
cet.par.: more police are available to appre-
hend murderers, and perhaps prevent murders,
while we might expect that lower-income cities
might have greater unrest and crime. But can
we reasonably assume that the number of po-
lice (per capita) is exogenous to the murder
rate? Probably not, in the sense that cities
striving to reduce crime will spend more on po-
lice. Thus we might consider a second struc-
tural equation that expressed the number of
police per capita as a function of a number of
factors. We may have no interest in estimat-
ing this equation (which is behavioral, reflect-
ing the behavior of city officials), but if we are
to consistently estimate the former equation–
the behavioral equation reflecting the behavior
of murderers–we will have to specify the sec-
ond equation as well, and collect data for its
explanatory factors.
Simultaneity bias in OLS
What goes wrong if we use OLS to estimatea structural equation containing endogenousexplanatory variables? Consider the structuralsystem:
y1 = α1y2 + β1z1 + u1 (3)
y2 = α2y1 + β2z2 + u2
in which we are interested in estimating thefirst equation. Assume that the z variables areexogenous, in that each is uncorrelated witheach of the error processes u. What is the cor-relation between y2 and u1? If we substitutethe first equation into the second, we derive:
y2 = α2 (α1y2 + β1z1 + u1) + β2z2 + u2
(1− α2α1) y2 = α2β1z1 + β2z2 + α2u1 + u2 (4)
If we assume that α2α1 6= 1, we can derive thereduced form equation for y2 as:
y2 = π21z1 + π22z2 + υ2 (5)
where the reduced form error term υ2 = α2u1+
u2. Thus y2 depends on u1, and estimation by
OLS of the first equation in (3) will not yield
consistent estimates. We can consistently es-
timate the reduced form equation (5) via OLS,
and that in fact is an essential part of the strat-
egy of the 2SLS estimator. But the parameters
of the structural equation are nonlinear trans-
formations of the reduced form parameters, so
being able to estimate the reduced form pa-
rameters does not achieve the goal of provid-
ing us with point and interval estimates of the
structural equation.
In this special case, we can evaluate the simul-
taneity bias that would result from improperly
applying OLS to the original structural equa-
tion. The covariance of y2 and u1 is equal to
the covariance of y2 and υ2:
=[α2/ (1− α2α1)E
(u2
1
)]= [α2/ (1− α2α1)]σ2
1 (6)
If we have some priors about the signs of the
α parameters, we may sign the bias. Generally,
it could be either positive or negative; that is,
the OLS coefficient estimate could be larger
or smaller than the correct estimate, but will
not be equal to the population parameter in
an expected sense unless the bracketed expres-
sion is zero. Note that this would happen if
α2 = 0 : that is, if y2 was not simultaneously
determined with y1. But in that case, we do not
have a simultaneous system; the model in that
case is said to be a recursive system, which
may be consistently estimated with OLS.
Identifying and estimating a structural equa-
tion
The tool that we will apply to consistently
estimate structural equations such as (3) is
one that we have seen before: two-stage least
squares (2SLS). The application of 2SLS in a
structural system is more straightforward than
the general application of instrumental vari-
ables estimators, since the specification of the
system makes clear what variables are available
as instruments. Let us first consider a slightly
different two-equation structural system:
q = α1p+ β1z1 + u1 (7)
q = α2p+ u2
We presume these equations describe the work-
ings of a market, and that the equilibrium con-
dition of market clearing has been imposed.
Let q be per capita milk consumption at the
county level, p be the average price of a gallon
of milk in that county, and let z1 be the price
of cattle feed. The first structural equation
is thus the supply equation, with α1 > 0 and
β1 < 0: that is, a higher cost of production
will generally reduce the quantity supplied at
the same price per gallon. The second equa-
tion is the demand equation, where we pre-
sume that α2 < 0, reflecting the slope of the
demand curve in the {p, q} plane. Given a ran-
dom sample on {p, q, z1}, what can we achieve?
The demand equation is said to be identified–
in fact, exactly identified–since one instru-
ment is needed, and precisely one is available.
z1 is available because the demand for milk
does not depend on the price of cattle feed, so
we take advantage of an exclusion restriction
that makes z1 available to identify the demand
curve. Intuitively, we can think of variations
in z1 shifting the supply curve up and down,
tracing out the demand curve; in doing so, it
makes it possible for us to estimate the struc-
tural parameters of the demand curve.
What about the supply curve? It, also, has
a problem of simultaneity bias, but it turns
out that the supply equation is unidentified.
Given the model as we have laid it out, there
is no variable available to serve as an instru-
ment for p : that is, we need a variable that
affects demand (and shifts the demand curve)
but does not directly affect supply. In this
case, no such variable is available, and we can-
not apply the instrumental variables technique
without an instrument. What if we went back
to the drawing board, and realized that the
price of orange juice should enter the demand
equation–although it tastes terrible on corn
flakes, orange juice might be a healthy substi-
tute for quenching one’s thirst? Then the sup-
ply curve would be identified–exactly identified–
since we now would have a single instrument
that served to shift demand but did not enter
the supply relation. What if we also consid-
ered the price of beer as an additional demand
factor? Then we would have two available in-
struments (presuming that each is appropri-
ately correlated), and 2SLS would be used to
“boil them down” into the single instrument
needed. In that case, we would say that the
supply curve would be overidentified.
The identification status of each structural equa-
tion thus hinges upon exclusion restrictions:
our a priori statements that certain variables
do not appear in certain structural equations.
If they do not appear in a structural equation,
they may be used as instruments to assist in
identifying the parameters of that equation.
For these variables to successfully identify the
parameters, they must have nonzero popula-
tion parameters in the equation in which they
are included. Consider an example:
hours = f1 (log(wage), educ, age, kl6, wifeY )
log(wage) = f2
(hours, educ, xper, xper2
)(8)
The first equation is a labor supply relation,
expressing the number of hours worked by a
married woman as a function of her wage, ed-
ucation, age, the number of preschool children,
and non-wage income (including spouses’s earn-
ings). The second equation is a labor demand
equation, expressing the wage to be paid as
a function of hours worked, the employee’seducation, and a polynomial in her work ex-perience. The exclusion restructions indicatethat the demand for labor does not depend onthe worker’s age (nor should it!), the presenceof preschool kids, or other resources availableto the worker. Likewise, we assume that thewoman’s willingness to participate in the mar-ket does not depend on her labor market ex-perience. One instrument is needed to identifyeach equation; age, kl6 and wifeY are avail-able to identify the supply equation, while xper
and xper2 are available to identify the demandequation. This is the order condition foridentfication, essentially counting instrumentsand variables to be instrumented; each equa-tion is overidentified. But the order conditionis only necessary; the sufficient condition is therank condition, which essentially states thatin the reduced-form equation:
log(wage) = g(educ, age, kl6, wifeY, xper, xper2
)(9)
at least one of the population coefficients on
{xper, xper2} must be nonzero. But since we
can consistently estimate this equation with
OLS, we may generate sample estimates of
those coefficients, and test the joint null that
both coefficients are zero. If that null is re-
jected, then we satisfy the rank condition for
the first equation, and we may proceed to esti-
mate it via 2SLS. The equivalent condition for
the demand equation is that at least one of the
population coefficients {age, kl6, wifeY } in the
regression of hours on the system’s exogenous
variables is nonzero. If any of those variables
are significant in the equivalent reduced-form
equation, it may be used as an instrument to
estimate the demand equation via 2SLS.
The application of two-stage least squares (via
Stata’s ivregress 2sls command) involves iden-
tifying the endogenous explanatory variable(s),
the exogenous variables that are included in
each equation, and the instruments that are
excluded from each equation. To satisfy the
order condition, the list of (excluded) instru-
ments must be at least as long as the list of en-
dogenous explanatory variables. This logic car-
ries over to structural equation systems with
more than two endogenous variables / equa-
tions; a structural model may have any num-
ber of endogenous variables, each defined by
an equation, and we can proceed to evaluate
the identification status of each equation in
turn, given the appropriate exclusion restric-
tions. Note that if an equation is uniden-
tified, due to the lack of appropriate instru-
ments, then no econometric technique may be
used to estimate its parameters. In that case,
we do not have knowledge that would allow us
to “trace out” that equation’s slope while we
move along it.
Simultaneous equations models with time
series
One of the most common applications of 2SLS
in applied work is the estimation of structural
time series models. For instance, consider a
simple macro model:
Ct = β0 + β1 (Yt − Tt) + β2rt + u1t
It = γ0 + γ1rt + u2t
Yt = Ct + It +Gt (10)
In this system, aggregate consumption each
quarter is determined jointly with disposable
income. Even if we assume that taxes are ex-
ogenous (and in fact they are responsive to
income), the consumption function cannot be
consistently estimated via OLS. If the interest
rate is taken as exogenous (set, for instance,
by monetary policy makers) then the invest-
ment equation may be consistently estimated
via OLS. The third equation is an identity; it
need not be estimated, and holds without er-
ror, but its presence makes explicit the simul-
taneous nature of the model. If r is exoge-
nous, then we need one instrument to estimate
the consumption function; government spend-
ing will suffice, and consumption will be exactly
identified. If r is to be taken as endogenous,
we would have to add at least one equation
to the model to express how monetary pol-
icy reacts to economic conditions. We might
also make the investment function more re-
alistic by including dynamics–that investment
depends on lagged income, for instance, Yt−1
(firms make investment spending plans based
on the demand for their product). This would
allow Yt−1, a predetermined variable, to be
used as an additional instrument in estimation
of the consumption function. We may also
use lags of exogenous variables–for instance,
lagged taxes or government spending–as in-
struments in this context.
Although this only scratches the surface of a
broad set of issues relating to the estimation
of structural models with time series data, it
should be clear that those models will generally
require instrumental variables techniques such
as 2SLS for the consistent estimation of their
component relationships.
Wooldridge, Introductory Econometrics, 4th
ed.
Appendix C: Fundamentals of mathemati-
cal statistics
A short review of the principles of mathemati-
cal statistics. Econometrics is concerned with
statistical inference: learning about the char-
acteristics of a population from a sample of the
population. The population is a well-defined
group of subjects–and it is important to de-
fine the population of interest. Are we trying
to study the unemployment rate of all labor
force participants, or only teenaged workers, or
only AHANA workers? Given a population, we
may define an economic model that contains
parameters of interest–coefficients, or elastic-
ities, which express the effects of changes in
one variable upon another.
Let Y be a random variable (r.v.) representing
a population with probability density function
(pdf) f(y; θ), with θ a scalar parameter. We
assume that we know f,but do not know the
value of θ. Let a random sample from the pop-
ulation be (Y1, ..., YN) , with Yi being an inde-
pendent random variable drawn from f(y; θ).
We speak of Yi being iid – independently and
identically distributed. We often assume that
random samples are drawn from the Bernoulli
distribution (for instance, that if I pick a stu-
dent randomly from my class list, what is the
probability that she is female? That probabil-
ity is γ, where γ% of the students are female,
so P (Yi = 1) = γ and P (Yi = 0) = (1− γ). For
many other applications, we will assume that
samples are drawn from the Normal distribu-
tion. In that case, the pdf is characterized by
two parameters, µ and σ2, expressing the mean
and spread of the distribution, respectively.
Finite sample properties of estimators
The finite sample properties (as opposed to
asymptotic properties) apply to all sample sizes,
large or small. These are of great relevance
when we are dealing with samples of limited
size, and unable to conduct a survey to gener-
ate a larger sample. How well will estimators
perform in this context? First we must distin-
guish between estimators and estimates. An
estimator is a rule, or algorithm, that speci-
fies how the sample information should be ma-
nipulated in order to generate a numerical es-
timate. Estimators have properties–they may
be reliable in some sense to be defined; they
may be easy or difficult to calculate; that dif-
ficulty may itself be a function of sample size.
For instance, a test which involves measuring
the distances between every observation of a
variable involves an order of calculations which
grows more than linearly with sample size. An
estimator with which we are all familiar is the
sample average, or arithmetic mean, of N num-
bers: add them up and divide by N. That es-
timator has certain properties, and its applica-
tion to a sample produces an estimate. We
will often call this a point estimate–since it
yields a single number–as opposed to an inter-
val estimate, which produces a range of val-
ues associated with a particular level of confi-
dence. For instance, an election poll may state
that 55% are expected to vote for candidate
A, with a margin of error of ±4%. If we trust
those results, it is likely that candidate A will
win, with between 51% and 59% of the vote.
We are concerned with the sampling distribu-
tions of estimators–that is, how the estimates
they generate will vary when the estimator is
applied to repeated samples.
What are the finite-sample properties which we
might be able to establish for a given estimator
and its sampling distribution? First of all, we
are concerned with unbiasedness. An estima-tor W of θ is said to be unbiased if E(W ) = θ
for all possible values of θ. If an estimator isunbiased, then its probability distribution hasan expected value equal to the population pa-rameter it is estimating. Unbiasedness doesnot mean that a given estimate is equal to θ,
or even very close to θ; it means that if wedrew an infinite number of samples from thepopulation and averaged the W estimates, wewould obtain θ. An estimator that is biasedexhibits Bias(W ) = E(W )− θ. The magnitudeof the bias will depend on the distribution ofthe Y and the function that transforms Y intoW , that is, the estimator. In some cases wecan demonstrate unbiasedness (or show thatbias=0) irregardless of the distribution of Y ;for instance, consider the sample average Y ,
which is an unbiased estimate of the popula-tion mean µ :
E(Y ) = E(1
n
n∑i=1
Yi)
=1
nE(
n∑i=1
Yi)
=1
n
n∑i=1
E(Yi)
=1
n
n∑i=1
µ
=1
nnµ = µ
Any hypothesis tests on the mean will require
an estimate of the variance, σ2, from a popu-
lation with mean µ. Since we do not know µ
(but must estimate it with Y ), the estimate of
sample variance is defined as
S2 =1
n− 1
n∑i=1
(Yi − Y
)2with one degree of freedom lost by the replace-
ment of the population statistic µ with its sam-
ple estimate Y . This is an unbiased estimate of
the population variance, whereas the counter-
part with a divisor of n will be biased unless we
know µ. Of course, the degree of this bias will
depend on the difference between(
nn−1
)and
unity, which disappears as n→∞.
Two difficulties with unbiasedness as a crite-
rion for an estimator: some quite reasonable
estimators are unavoidably biased, but useful;
and more seriously, many unbiased estimators
are quite poor. For instance, picking the first
value in a sample as an estimate of the popula-
tion mean, and discarding the remaining (n−1)
values, yields an unbiased estimator of µ, since
E(Y1) = µ; but this is a very imprecise estima-
tor.
What additional information do we need to
evaluate estimators? We are concerned with
the precision of the estimator as well as its
bias. An unbiased estimator with a smaller
sampling variance will dominate its counter-
part with a larger sampling variance: e.g. we
can demonstrate that the estimator that uses
only the first observation to estimate µ has a
much larger sampling variance than the sample
average, for nontrivial n. What is the sampling
variance of the sample average?
V ar(Y ) = V ar
1
n
n∑i=1
Yi
=
1
n2V ar
n∑i=1
Yi
=
1
n2
n∑i=1
V ar(Yi)
=
1
n2
n∑i=1
σ2
=
1
n2nσ2 =
σ2
n
so that the precision of the sample average de-
pends on the sample size, as well as the (un-
known) variance of the underlying distribution
of Y. Using the same logic, we can derive the
sampling variance of the “estimator” that uses
only the first observation of a sample as σ2.
Even for a sample of size 2, the sample mean
will be twice as precise.
This leads us to the concept of efficiency:
given two unbiased estimators of θ, an estima-
tor W1 is efficient relative to W2 when V ar(W1) ≤V ar(W2) ∀θ, with strict inequality for at least
one θ. A relatively efficient unbiased estimator
dominates its less efficient counterpart. We
can compare two estimators, even if one or
both is biased, by comparing mean squared er-
ror (MSE), MSE(W ) = E[(W − θ)2
]. This ex-
pression can be shown to equal the variance of
the estimator plus the square of the bias; thus,
it equals the variance for an unbiased estima-
tor.
Large sample (asymptotic) properties of
estimators
We can compare estimators, and evaluate their
relative usefulness, by appealing to their large
sample properties–or asymptotic properties.
That is, how do they behave as sample size
goes to infinity? We see that the sample aver-
age has a sampling variance with limiting value
of zero as n → ∞. The first asymptotic prop-
erty is that of consistency. If W is an estimate
of θ based on a sample [Y1, ..., Yn] of size n, W
is said to be a consistent estimator of θ if, for
every ε > 0,
P (|Wn − θ| > ε)→ 0 as n→∞.
Intuitively, a consistent estimator becomes more
accurate as the sample size increases without
bound. If an estimator does not possess this
property, it is said to be inconsistent. In that
case, it does not matter how much data we
have; the “recipe” that tells us how to use the
data to estimate θ is flawed. If an estimator is
biased but its variance shrinks as n→∞, then
the estimator is consistent.
A consistent estimator has probability limit,
or plim, equal to the population parameter:
plim(Y)
= µ. Some mechanics of plims: let
θ be a parameter and g (·) a continuous func-
tion, so that γ = g(θ). Suppose plim(Wn) = θ,
and we devise an estimator of γ, Gn = g(Wn).
Then plim(Gn) = γ, or
plim g(Wn) = g (plim Wn) .
This allows us to establish the consistency of
estimators which can be shown to be transfor-
mations of other consistent estimators. For in-
stance, we can demonstrate that the estimator
given above of the population variance is not
only unbiased but consistent. The standard
deviation is the square root of the variance:
a nonlinear function, continuous for positive
arguments. Thus the standard deviation S is
a consistent estimator of the population stan-
dard deviation. Some additional properties of
plims, if plim(Tn) = α and plim(Un) = β :
plim (Tn + Un) = α+ β
plim (TnUn) = αβ
plim (Tn/Un) = α/β, β 6= 0.
Consistency is a property of point estimators:
the distribution of the estimator collapses around
the population parameter in the limit, but that
says nothing about the shape of the distribu-
tion for a given sample size. To work with in-
terval estimators and hypothesis tests, we need
a way to approximate the distribution of the es-
timators. Most estimators used in economet-
rics have distributions that are reasonably ap-
proximated by the Normal distribution for large
samples, leading to the concept of asymptotic
normality:
P (Zn ≤ z)→ Φ (z) as n→∞
where Φ (·) is the standard normal cumulative
distribution function (cdf). We will often say
“Zn˜N(0,1)” or “Zn is asy N.” This relates to
one form of the central limit theorem (CLT).
If [Y1, ...Yn] is a random sample with mean µ
and variance σ2,
Zn =Yn − µσ/√n
has an asymptotic standard normal distribu-
tion. Regardless of the population distribu-
tion of Y, this standardized version of Y will
be asy N, and the entire distribution of Z will
become arbitrarily close to the standard nor-
mal as n → ∞. Since many of the estimators
we will derive in econometrics can be viewed as
sample averages, the law of large numbers and
the central limit theorem can be combined to
show that these estimators will be asy N. In-
deed, the above estimator will be asy N even
if we replace σ with a consistent estimator of
that parameter, S.
General approaches to parameter estima-
tion
What general strategies will provide us with es-
timators with desirable properties such as un-
biasedness, consistency and efficiency? One of
the most fundamental strategies for estimation
is the method of moments, in which we re-
place population moments with their sample
counterparts. We have seen this above, where
a consistent estimator of sample variance is
defined by replacing the unknown population µ
with a consistent estimate thereof, Y . A sec-
ond widely employed strategy is the principle of
maximum likelihood, where we choose an es-
timator of the population parameter θ by find-
ing the value that maximizes the likelihood of
observing the sample data. We will not fo-
cus on maximum likelihood estimators in this
course, but note their importance in econo-
metrics. Most of our work here is based on the
least squares principle: that to find an esti-
mate of the population parameter, we should
solve a minimization problem. We can readily
show that the sample average is a method of
moments estimator (and is in fact a maximum
likelihood estimator as well). We demonstrate
now that the sample average is a least squares
estimator:
minm
n∑i=1
(Yi −m)2
will yield an estimator, m, which is identical
to that defined as Y . We may show that the
value m minimizes the sum of squared devi-
ations about the sample mean, and that any
other value m′ would have a larger sum (or
would not be “least squares”). Standard re-
gression techniques, to which we will devote
much of the course, are often called “OLS”:
ordinary least squares.
Interval estimation and confidence inter-
vals
Since an estimator will yield a value (or point
estimate) as well as a sampling variance, we
may generally form a confidence interval around
the point estimate in order to make proba-
bility statements about a population param-
eter. For instance, the fraction of Firestone
tires involved in fatal accidents is surely not
0.0005 of those sold. Any number of samples
would yield estimates of that mean differing
from that number (and for a continuous ran-
dom variable, the probability of a point is zero).
But we can test the hypothesis that 0.0005 of
the tires are involved with fatal accidents if we
can generate both a point and interval esti-
mate for that parameter, and if the interval
estimate cannot reject 0.0005 as a plausible
value. This is the concept of a confidence in-
terval, which is defined with regard to a given
level of “confidence” or level of probability. Fora standard normal (N(0,1)) variable,
P
(−1.96 <
Y − µ1/√n< 1.96
)= 0.95.
which defines the interval estimate(Y − 1.96√
n, Y + 1.96√
n
). We do not conclude from
this that the probability that µ lies in the inter-val is 0.95; the population parameter either liesin the interval or it does not. The proper wayto consider the confidence interval is that ifwe construct a large number of random sam-ples from the population, 95% of them willcontain µ. Thus, if a hypothesized value for µlies outside the confidence interval for a singlesample, that would occur by chance only 5%of the time.
But what if we do not have a standard normalvariate, for which we know the variance equalsunity? If we have a variable X, which we con-clude is distributed as N
(µ, σ2
), we arrive at
the difficulty that we do not know σ2 : and thus
cannot specify the confidence interval. Via the
method of moments, we replace the unknown
σ2 with a consistent estimate, S2, to form the
transformed statistic
Y − µS/√n
˜ tn
denoting that its distribution is no longer stan-
dard normal, but “student’s t” with n degrees
of freedom. The t distribution has fatter tails
than does the normal; above 20 or 25 degrees
of freedom, it is approximated quite well by
the normal. Thus, confidence intervals con-
structed with the t distribution will be wider
for small n, since the value will be larger than
1.96. A 95% confidence interval, given the
symmetry of the t distribution, will leave 2.5%
of probability in each tail (a two-tailed t test).
If cα is the 100(1-α) percentile in the t distribu-
tion, a 100(1-α)% confidence interval for the
mean will be defined as:
y − cα/2s√n, y + cα/2
s√n
where s is the estimated standard deviation of
Y . We often refer to s√n
as the standard er-
ror of the parameter–in this case, the standard
error of our estimate of µ. Note well the dif-
ference between the concepts of the standard
deviation of the underlying distribution (an es-
timate of σ) and the standard error, or preci-
sion, of our estimate of the mean µ. We will
return to this distinction when we consider re-
gression parameters. A simple rule of thumb,
for large samples, is that a 95% confidence in-
terval is roughly two standard errors on either
side of the point estimate–the counterpart of
a “t of 2” denoting significance of a param-
eter. If an estimated parameter is more than
two standard errors from zero, a test of the hy-
pothesis that it equals zero in the population
will likely be rejected.
Hypothesis testing
We want to test a specific hypothesis about
the value of a population parameter θ. We may
believe that the parameter equals 0.42; so that
we state the null and alternative hypotheses:
H0 : θ = 0.42
HA : θ 6= 0.42
In this case, we have a two-sided alternative:
we will reject the null if our point estimate
is “significantly” below 0.42, or if it is “sig-
nificantly” above 0.42. In other cases, we
may specify the alternative as one-sided. For
instance, in a quality control study, our null
might be that the proportion of rejects from
the assembly line is no more than 0.03, versus
the alternative that it is greater than 0.03. A
rejection of the null would lead to a shutdown
of the production process, whereas a smaller
proportion of rejects would not be cause for
concern. Using the principles of the scientific
method, we set up the hypothesis and consider
whether there is sufficient evidence against the
null to reject it. Like the principle that a find-
ing of guilt must be associated with evidence
beyond a reasonable doubt, the null will stand
unless sufficient evidence is found to reject it
as unlikely. Just as in the courts, there are two
potential errors of judgment: we may find an
innocent person guilty, and reject a null even
when it is true; this is Type I error. We may
also fail to convict a guilty person, or reject
a false null; this is Type II error. Just as
the judicial system tries to balance those two
types of error (especially considering the con-
sequences of punishing the innocent, or even
putting them to death), we must be concerned
with the magnitude of these two sources of er-
ror in statistical inference. We construct hy-
pothesis tests so as to make the probability
of a Type I error fairly small; this is the level
of the test, and is usually denoted as α. For
instance, if we operate at a 95% level of con-
fidence, then the level of the test is α = 0.05.
When we set α, we are expressing our tolerance
for committing a Type I error (and rejecting a
true null). Given α, we would like to minimize
the probability of a Type II error, or equiva-
lently maximize the power of the test, which
is just one minus the probability of committing
a Type II error, and failing to reject a false null.
We must balance the level of the test (and
the risk of falsely rejecting the truth) with the
power of the test (and failing to reject a false
null).
When we use a computer program to calculate
point and interval estimates, we are given the
information that will allow us to reject or fail to
reject a particular null. This is usually phrased
in terms of p− values, which are the tail prob-
abilities associated with a test statistic. If the
p-value is less than the level of the test, then it
leads to a rejection: a p-value of 0.035 allows
us to reject the null at the level of 0.05. One
must be careful to avoid the misinterpretation
of a p-value of, say, 0.94, which is indicative
of the massive failure to reject that null.
One should also note the duality between con-
fidence intervals and hypothesis tests. They
utilize the same information: the point esti-
mate, the precision as expressed in the stan-
dard error, and a value taken from the under-
lying distribution of the test statistic (such as
1.96). If the boundary of the 95% confidence
interval contains a value δ, then a hypothesis
test that the population parameter equals δ will
be on the borderline of acceptance and rejec-
tion at the 5% level. We can consider these
quantities as either defining an interval esti-
mate for the parameter, or alternatively sup-
porting an hypothesis test for the parameter.