Wooldridge Introductory Eco No Metrics 4e Solutions

Wooldridge, Introductory Econometrics, 4thed.

Chapter 1: Nature of Econometrics and

Economic Data

What do we mean by econometrics? Econo-metrics is the field of economics in which sta-tistical methods are developed and applied toestimate economic relationships, test economictheories, and evaluate plans and policies imple-mented by private industry, government, andsupranational organizations. Econometrics en-compasses forecasting–not only the high-profileforecasts of macroeconomic and financial vari-ables, but also forecasts of demand for a prod-uct, likely effects of a tax package, or the inter-action between the demand for health servicesand welfare reform.

Why is econometrics separate from mathemat-ical statistics? Because most applications of

statistics in economics and finance are related

to the use of non-experimental data, or ob-

servational data. The fundamental techniques

of statistics have been developed for use on

experimental data: that gathered from con-

trolled experiments, where the design of the

experiment and the reliability of measurements

from its outcomes are primary foci. In rely-

ing on observational data, economists are more

like astronomers, able to collect and analyse in-

creasingly complete measures on the world (or

universe) around them, but unable to influence

the outcomes.

This distinction is not absolute; some eco-

nomic policies are in the nature of experiments,

and economists have been heavily involved in

both their design and implementation. A good

example within the last five years is the imple-

mentation of welfare reform, limiting individ-

uals’ tenure on the welfare rolls to five years

of lifetime experience. Many doubted that this

would be successful in addressing the needs of

those welfare recipients who have low job skills;

but the reforms have been surprisingly suc-

cessful, as a recent article in The Economist

states, at raising the employment rate among

this cohort. Economists are also able to care-

fully examine the economic consequences of

massive regime changes in an economy, such

as the transition from a planned economy to

a capitalist system in the former Soviet bloc.

But fundamentally applied econometricians ob-

serve the data, and use sophisticated tech-

niques to evaluate their meaning.

We speak of this work as empirical analysis, or

empirical research. The first step is the careful

formulation of the question of interest. This

will often involve the application or develop-

ment of an economic model, which may be as

simple as noting that normal goods have neg-

ative price elasticities, or exceedingly complex,

involving a full-fledged description of many as-

pects of a set of interrelated markets and the

supply/demand relationships for the products

traded (as would, for instance, an economet-

ric analysis of an antitrust issue, such as U.S.

v Microsoft). Economists are often attacked

for their imperialistic tendencies–applying eco-

nomic logic to consider such diverse topics as

criminal behavior, fertility, or environmental issues–

but where there is an economic dimension, the

application of economic logic and empirical re-

search based on econometric practice may yield

valuable insights. Gary Becker, who has made

a career of applying economics to non-economic

realms, won a Nobel Prize for his efforts.Crime,

after all, is yet another career choice, and for

high school dropouts who don’t see much fu-

ture in flipping burgers at minimum wage, it

is hardly surprising that there are ample ap-

plicants for positions in a drug dealer’s distri-

bution network. In risk-adjusted terms (gaug-

ing the risk of getting shot, or arrested and

successfully prosecuted...) the risk-adjusted

hourly wage is many times the minimum wage.

Should we be surprised by the outcome?

Irregardless of whether empirical research is

based on a formal economic model or eco-

nomic intuition, the hypotheses about economic

behavior must be transformed into an econo-

metric model that can be applied to the data.

In an economic model, we can speak of func-

tions such as Q = Q (P, Y ) ; but if we are to es-

timate the parameters of that relationship, we

must have an explicit functional form for the Q

function, and determine that it is an appropri-

ate form for the model we have in mind. For

instance, if we were trying to predict the effi-

ciency of an automobile in terms of its engine

size (displacement, in in3 or liters), Americans

would likely rely on a measure like mpg – miles

per gallon. But the engineering relationship is

not linear between mpg and displacement; it

is much closer to being a linear function if we

relate gallons per mile (gpm = 1/mpg) to en-

gine size. The relationship will be curvilinear in

mpg terms, requiring a more complex model,

but nearly linear in gpm vs displacement. An

econometric model will spell out the role of

each of its variables: for instance,

gpmi = β0 + β1displi + εi

would express the relationship between the fuel

consumption of the ith automobile to its en-

gine size, or displacement, as a linear function,

with an additive error term εi which encom-

passes all factors not included in the model.

The parameters of the model are the β terms,

which must be estimated via statistical meth-

ods. Once that estimation has been done, we

may test specific hypotheses on their values:

for instance, that β1 is positive (larger engines

use more fuel), or that β1 takes on a certain

value. Estimating this relationship for Stata’s

auto.dta dataset of 74 automobiles, the pre-

dicted relationship is

gpmi = 0.029 + 0.011displi

where displacement is measured in hundreds of

in3. This estimated relationship has an “R2”

value of 0.59, indicating that 59% of the vari-

ation of gpm around its mean is “explained”

by displacement, and a root-mean-squared er-

ror of 0.008 (which can be compared to gpm’s

mean of 0.050, corresponding to about 21 mpg).

The structure of economic data

We must acquaint ourselves with some termi-

nology to describe the several forms in which

economic and financial data may appear. A

great deal of the work we will do in this course

will relate to cross-sectional data: a sam-

ple of units (individuals, families, firms, in-

dustries, countries...) taken at a given point

in time, or in a particular time frame. The

sample is often considered to be a random

sample of some sort when applied to micro-

data such as that gathered from individuals

or households. For instance, the official esti-

mates of the U.S. unemployment rate are gath-

ered from a monthly survey of individuals, in

which each is asked about their employment

status. It is not a count, or census, of those

out of work. Of course, some cross sections

are not samples, but may represent the pop-

ulation: e.g. data from the 50 states do not

represent a random sample of states. A cross-

sectional dataset can be conceptualized as a

spreadsheet, with variables in the columns and

observations in the rows. Each row is uniquely

identified by an observation number, but in

a cross-sectional datasets the ordering of the

observations is immaterial. Different variables

may correspond to different time periods; we

might have a dataset containing municipalities,

their employment rates, and their population in

the 1990 and 2000 censuses.

The other major form of data considered in

econometrics is the time series: a series of

evenly spaced measurements on a variable. A

time-series dataset may contain a number of

measures, each measured at the same frequency,

including measures derived from the originals

such as lagged values, differences, and the like.

Time series are innately more difficult to han-

dle in an econometric context because their

observations almost surely are interdependent

across time. Most economic and financial time

series exhibit some degree of persistence. Al-

though we may be able to derive some mea-

sures which should not, in theory, be explain-

able from earlier observations (such as tomor-

row’s stock return in an efficient market), most

economic time series are both interrelated and

autocorrelated–that is, related to themselves

across time periods. In a spreadsheet context,the variables would be placed in the columns,and the rows labelled with dates or times. Theorder of the observations in a time-series datasetmatters, since it denotes the passage of equalincrements of time. We will discuss time-seriesdata and some of the special techniques thathave been developed for its analysis in the lat-ter part of the course.

Two combinations of these data schemes arealso widely used: pooled cross-section/time

series (CS/TS) datasets and panel, or lon-

gitudinal, data sets. The former (CS/TS)arise in the context of a repeated survey–suchas a presidential popularity poll–where the re-spondents are randomly chosen. It is advanta-geous to analyse multiple cross-sections, butnot possible to link observations across thecross-sections. Much more useful are paneldata sets, in which we have timeseries of ob-servations on the same unit: for instance, Ci,t

might be the consumption level of the ith house-

hold at time t. Many of the datasets we com-

monly utilize in economic and financial research

are of this nature: for instance, a great deal of

research in corporate finance is carried out with

Standard and Poor’s COMPUSTAT, a panel

data set containing 20 years of annual financial

statements for thousands of major U.S. corpo-

rations. There is a wide array of specialized

econometric techniques that have been devel-

oped to analyse panel data; we will not touch

upon them in this course.

Causality and ceteris paribus

The hypotheses tested in applied economet-

ric analysis are often posed to make inferences

about the possible causal effects of one or

more factors on a response variable: that is, do

changes in consumers’ incomes “cause” changes

in their consumption of beer? At some level,

of course, we can never establish causation–

unlike the physical sciences, where the interre-

lations of molecules may follow well-established

physical laws, our observed phenomena rep-

resent innately unpredictable human behavior.

In economic theory, we generally hold that in-

dividuals exhibit rational behavior; but since

the econometrician does not observe all of the

factors that might influence behavior, we can-

not always make sensible inferences about po-

tentially causal factors. Whenever we “opera-

tionalize” an econometric model, we implic-

itly acknowledge that it can only capture a

few key details of the behavioral relationship,

and is leaving many additional factors (which

may or may not be observable) in the “pound

of ceteris paribus.” ceteris paribus–literally,

other things equal–always underlies our infer-

ences from empirical research. Our best hope

is that we might control for many of the fac-

tors, and be able to use our empirical findings

to ascertain whether systematic factors have

been omitted. Any econometric model should

be subjected to diagnostic testing to deter-

mine whether it contains obvious flaws. For in-

stance, the relationship between mpg and displ

in the automobile data is strictly dominated by

a model containing both displ and displ2, given

the curvilinear relation between mpg and displ.

Thus the original linear model can be viewed as

unacceptable in comparison to the polynomial

model; this conclusion could be drawn from

analysis of the model’s residuals, coupled with

an understanding of the engineering relation-

ship that posits a nonlinear function between

mpg and displ.

Wooldridge, Introductory Econometrics, 4th

ed.

Chapter 2: The simple regression model

Most of this course will be concerned with use

of a regression model: a structure in which

one or more explanatory variables are consid-

ered to generate an outcome variable, or de-

pendent variable.We begin by considering the

simple regression model, in which a single ex-

planatory, or independent, variable is involved.

We often speak of this as ‘two-variable’ regres-

sion, or ‘Y on X regression’. Algebraically,

yi = β0 + β1xi + ui (1)

is the relationship presumed to hold in the pop-

ulation for each observation i. The values of y

are expected to lie on a straight line, depending

on the corresponding values of x. Their values

will differ from those predicted by that line by

the amount of the error term, or disturbance,

u, which expresses the net effect of all factors

other than x on the outcome y−that is, it re-

flects the assumption of ceteris paribus. We

often speak of x as the ‘regressor’ in this rela-

tionship; less commonly we speak of y as the

‘regressand.’ The coefficients of the relation-

ship, β0 and β1, are the regression parameters,

to be estimated from a sample. They are pre-

sumed constant in the population, so that the

effect of a one-unit change in x on y is assumed

constant for all values of x.

As long as we include an intercept in the rela-

tionship, we can always assume that E (u) = 0,

since a nonzero mean for u could be absorbed

by the intercept term.

The crucial assumption in this regression model

involves the relationship between x and u. We

consider x a random variable, as is u, and con-

cern ourselves with the conditional distribution

of u given x. If that distribution is equivalent to

the unconditional distribution of u, then we can

conclude that there is no relationship between

x and u−which, as we will see, makes the es-

timation problem much more straightforward.

To state this formally, we assume that

E (u | x) = E (u) = 0 (2)

or that the u process has a zero conditional

mean. This assumption states that the unob-

served factors involved in the regression func-

tion are not related in any systematic manner

to the observed factors. For instance, con-

sider a regression of individuals’ hourly wage

on the number of years of education they have

completed. There are, of course, many factors

influencing the hourly wage earned beyond the

number of years of formal schooling. In work-

ing with this regression function, we are as-

suming that the unobserved factors–excluded

from the regression we estimate, and thus rel-

egated to the u term–are not systematically

related to years of formal schooling. This may

not be a tenable assumption; we might con-

sider “innate ability” as such a factor, and it

is probably related to success in both the edu-

cational process and the workplace. Thus, in-

nate ability–which we cannot measure without

some proxies–may be positively correlated to

the education variable, which would invalidate

assumption (2).

The population regression function, given

the zero conditional mean assumption, is

E (y | x) = β0 + β1xi (3)

This allows us to separate y into two parts:

the systematic part, related to x, and the un-

systematic part, which is related to u. As long

as assumption (2) holds, those two compo-

nents are independent in the statistical sense.

Let us now derive the least squares estimates

of the regression parameters.

Let [(xi, yi) : i = 1, ..., n] denote a random sam-

ple of size n from the population, where yiand xi are presumed to obey the relation (1).

The assumption (2) allows us to state that

E(u) = 0, and given that assumption, that

Cov(x, u) = E(xu) = 0, where Cov(·) denotes

the covariance between the random variables.

These assumptions can be written in terms of

the regression error:

E (yi − β0 − β1xi) = 0 (4)

E [xi (yi − β0 − β1xi)] = 0

These two equations place two restrictions on

the joint probability distribution of x and u.

Since there are two unknown parameters to be

estimated, we might look upon these equations

to provide solutions for those two parameters.

We choose estimators b0 and b1 to solve the

sample counterparts of these equations, mak-

ing use of the principle of the method of mo-

ments:

n−1n∑i=1

(yi − b0 − b1xi) = 0 (5)

n−1n∑i=1

xi (yi − b0 − b1xi) = 0

the so-called normal equations of least squares.

Why is this method said to be “least squares”?

Because as we shall see, it is equivalent to min-

imizing the sum of squares of the regression

residuals. How do we arrive at the solution?

The first “normal equation” can be seen to be

b0 = y − b1x (6)

where y and x are the sample averages of those

variables. This implies that the regression line

passes through the point of means of the sam-

ple data. Substituting this solution into the

second normal equation, we now have one equa-

tion in one unknown, b1 :

0 =n∑i=1

xi (yi − (y − b1x)− b1xi)

n∑i=1

xi (yi − y) = b1

n∑i=1

xi (xi − x)

b1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2

b1 =Cov(x, y)

V ar(x)(7)

where the slope estimate is merely the ratio of

the sample covariance of the two variables to

the variance of x, which, must be nonzero for

the estimates to be computed. This merely

implies that not all of the sample values of x

can take on the same value. There must be

diversity in the observed values of x. These

estimates–b0 and b1−are said to be the ordi-

nary least squares (OLS) estimates of the

regression parameters, since they can be de-

rived by solving the least squares problem:

minb0,b1

S =n∑i=1

e2i =

n∑i=1

(yi − b0 − b1xi)2 (8)

Here we minimize the sum of squared residu-

als, or differences between the regression line

and the values of y, by choosing b0 and b1.

If we take the derivatives ∂S/∂b0 and ∂S/∂b1and set the resulting first order conditions to

zero, the two equations that result are exactly

the OLS solutions for the estimated parame-

ters shown above. The “least squares” esti-

mates minimize the sum of squared residuals,

in the sense that any other line drawn through

the scatter of (x, y) points would yield a larger

sum of squared residuals. The OLS estimates

provide the unique solution to this problem,

and can always be computed if (i) V ar(x) > 0

and (ii) n ≥ 2. The estimated OLS regression

line is then

yi = b0 + b1xi (9)

where the “hat” denotes the predicted value

of y corresponding to that value of x. This is

the sample regression function (SRF), cor-

responding to the population regression func-

tion, or PRF (3). The population regression

function is fixed, but unknown, in the popu-

lation; the SRF is a function of the particular

sample that we have used to derive it, and a

different SRF will be forthcoming from a differ-

ent sample. The primary interest in these es-

timates usually involves b1 = ∂y/∂x = ∆y/∆x,

the amount by which y is predicted to change

from a unit change in the level of x. This slope

is often of economic interest, whereas the con-

stant term in many regressions is devoid of

economic meaning. For instance, a regres-

sion of major companies’ CEO salaries on the

firms’ return on equity–a measure of economic

performance–yields the regression estimates

S = 963.191 + 18.501r (10)

where S is the CEO’s annual salary, in thou-

sands of 1990 dollars, and r is average re-

turn on equity over the prior three years, in

per cent. This implies that a one percent in-

crease in ROE over the past three years is

worth $18,501 to a CEO, on average. The

average annual salary for the 209 CEOs in the

sample is $1.28 million, so the increment is

about 1.4% of that average salary. The SRF

can also be used to predict what a CEO will

earn for any level of ROE; points on the esti-

mated regression function are such predictions.

Mechanics of OLS

Some algebraic properties of the OLS regres-sion line:

(1) The sum (and average) of the OLS resid-uals is zero:

n∑i=1

ei = 0 (11)

which follows from the first normal equation,which specifies that the estimated regressionline goes through the point of means (x, y), sothat the mean residual must be zero.

(2) By construction, the sample covariance be-tween the OLS residuals and the regressor iszero:

Cov(e, x) =n∑i=1

xiei = 0 (12)

This is not an assumption, but follows directlyfrom the second normal equation. The esti-mated coefficients, which give rise to the resid-uals, are chosen to make it so.

(3) Each value of the dependent variable may

be written in terms of its prediction and its

error, or regression residual:

yi = yi + ei

so that OLS decomposes each yi into two parts:

a fitted value, and a residual. Property (3) also

implies that Cov(e, y) = 0, since y is a linear

transformation of x, and linear transformations

have linear effects on covariances. Thus, the

fitted values and residuals are uncorrelated in

the sample. Taking this property and applying

it to the entire sample, we define

SST =n∑i=1

(yi − y)2

SSE =n∑i=1

(yi − y)2

SSR =n∑i=1

e2i

as the Total sum of squares, Explained sum

of squares, and Residual sum of squares, re-

spectively. Note that SST expresses the total

variation in y around its mean (and we do not

strive to “explain” its mean; only how it varies

about its mean). The second quantity, SSE,

expresses the variation of the predicted values

of y around the mean value of y (and it is trivial

to show that y has the same mean as y). The

third quantity, SSR, is the same as the least

squares criterion S from (8). (Note that some

textbooks interchange the definitions of SSE

and SSR, since both “explained” and “error”

start with E, and “regression” and “residual”

start with R). Given these sums of squares, we

can generalize the decomposition mentioned

above into

SST = SSE + SSR (13)

or, the total variation in y may be divided into

that explained and that unexplained, i.e. left

in the residual category. To prove the validity

of (13), note that

n∑i=1

(yi − y)2 =n∑i=1

((yi − yi) + (yi − y))2

=n∑i=1

[ei + (yi − y)]2

=n∑i=1

e2i + 2

n∑i=1

ei (yi − y) +

n∑i=1

(yi − y)2

SST = SSR+ SSE

given that the middle term in this expression

is equal to zero. But this term is the sample

covariance of e and y, given a zero mean for

e, and by (12) we have established that this is

zero.

How good a job does this SRF do? Does the

regression function explain a great deal of the

variation of y, or not very much? That can

now be answered by making use of these sums

of squares:

R2 = [rxy]2 =SSE

SST= 1−

SSR

SST

The R2 measure (sometimes termed the coef-

ficient of determination) expresses the percent

of variation in y around its mean “explained”

by the regression function. It is an r, or simple

correlation coefficient, squared, in this case of

simple regression on a single x variable. Since

the correlation between two variables ranges

between -1 and +1, the squared correlation

ranges between 0 and 1. In that sense, R2

is like a batting average. In the case where

R2 = 0, the model we have built fails to ex-

plain any of the variation in the y values around

their mean–unlikely, but it is certainly possible

to have a very low value of R2. In the case

where R2 = 1, all of the points lie on the SRF.

That is unlikely when n > 2, but it may be

the case that all points lie close to the line,

in which case R2 will approach 1. We can-

not make any statistical judgment based di-

rectly on R2, or even say that a model with

a higher R2 and the same dependent variable

is necessarily a better model; but other things

equal, a higher R2 will be forthcoming from a

model that captures more of y′s behavior. In

cross-sectional analyses, where we are trying

to understand the idiosyncracies of individual

behavior, very low R2 values are common, and

do not necessarily denote a failure to build a

useful model.

Important issues in evaluating applied work:

how do the quantities we have estimated change

when the units of measurement are changed?

In the estimated model of CEO salaries, since

the y variable was measured in thousands of

dollars, the intercept and slope coefficient referto those units as well. If we measured salariesin dollars, the intercept and slope would bemultiplied by 1000, but nothing would change.The correlation between y and x is not af-fected by linear transformations, so we wouldnot alter the R2 of this equation by changingits units of measurement. Likewise, if ROEwas measured in decimals rather than per cent,it would merely change the units of measure-ment of the slope coefficient. Dividing r by100 would cause the slope to be multiplied by100. In the original (10), with r in percent, theslope is 18.501 (thousands of dollars per oneunit change in r). If we expressed r in decimalform, the slope would be 1850.1. A change inr from 0.10 to 0.11 – a one per cent increasein ROE–would be associated with a changein salary of (0.01)(1850.1)=18.501 thousanddollars. Again, the correlation between salaryand ROE would not be altered. This also ap-plies for a transformation such as F = 32+ 9

5C;

it would not matter whether we viewed tem-

perature in degrees F or degrees C as a causal

factor in estimating the demand for heating oil,

since the correlation between the dependent

variable and temperature would be unchanged

by switching from Fahrenheit to Celsius de-

grees.

Functional form

Simple linear regression would seem to be a

workable tool if we have a presumed linear re-

lationship between y and x, but what if theory

suggests that the relation should be nonlinear?

It turns out that the “linearity” of regression

refers to y being expressed as a linear func-

tion of x−but neither y nor x need be the “raw

data” of our analysis. For instance, regressing

y on t (a time trend) would allow us to analyse

a linear trend, or constant growth, in the data.

What if we expect the data to exhibit expo-

nential growth, as would population, or sums

earning compound interest? If the underlying

model is

y = A exp (rt) (14)

log y = logA+ rt

y∗ = A∗+ rt (15)

so that the “single-log” transformation may

be used to express a constant-growth relation-

ship, in which r is the regression slope coef-

ficient that directly estimates ∂ log y/∂t. Like-

wise, the “double-log” transformation can be

used to express a constant-elasticity relation-

ship, such as that of a Cobb-Douglas function:

y = Axα (16)

log y = logA+ α logx

y∗ = A∗+ αx∗

In this context, the slope coefficient α is an

estimate of the elasticity of y with respect to

x, given that ηy,x = ∂ log y/∂ logx by the defini-

tion of elasticity. The original equation is non-

linear, but the transformed equation is a linear

function which may be estimated by OLS re-

gression.

Likewise, a model in which y is thought to de-

pend on 1/x (the reciprocal model) may be

estimated by linear regression by just defin-

ing a new variable, z, equal to 1/x (presuming

x > 0). That model has an interesting inter-

pretation if you work out its algebra.

We often use a polynomial form to allow for

nonlinearities in a regression relationship. For

instance, rather than including only x as a re-

gressor, we may include x and x2. Although

this relationship is linear in the parameters, it

implies that ∂Y∂x = β + 2γx, so that the effect

of x on Y now depends on the level of x for

that observation, rather than being a constant

factor.

Properties of OLS estimators

Now let us consider the properties of the re-

gression estimators we have derived, consider-

ing b0 and b1 as estimators of their respective

population quantities. To establish the unbi-

asedness of these estimators, we must make

several assumptions:

Proposition 1 SLR1: in the population, the

dependent variable y is related to the indepen-

dent variable x and the error u as

y = β0 + β1x + u (17)

Proposition 2 SLR2: we can estimate the pop-

ulation parameters from a sample of size n,

{(xi, yi), i = 1, ..., n}.

Proposition 3 SLR3: the error process has a

zero conditional mean:

E (u | x) = 0. (18)

Proposition 4 SLR4: the independent vari-

able x has a positive variance:

(n− 1)−1n∑i=1

(xi − x)2 > 0. (19)

Given these four assumptions, we may pro-

ceed, considering the intercept and slope esti-

mators as random variables. For the slope es-

timator; we may express the estimator in terms

of population coefficients and errors:

b1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2 =

∑ni=1 (xi − x) yi

s2x

(20)

where we have defined s2x as the total variation

in x (not the variance of x). Substituting, we

can write the slope estimator as:

b1 =

∑ni=1 (xi − x) yi

s2x

=

∑ni=1 (xi − x) (β0 + β1xi + ui)

s2x

=β0

∑ni=1 (xi − x) + β1

∑ni=1 (xi − x)xi +

∑ni=1 (xi − x)ui

s2x

(21)

We can show that the first term in the nu-

merator is algebraically zero, given that the

deviations around the mean sum to zero. The

second term can be written as∑ni=1 (xi − x)2 =

s2x, so that the second term is merely β1 when

divided by s2x. Thus this expression can be rewrit-

ten as:

b1 = β1 +1

s2x

n∑i=1

(xi − x)ui

showing that any randomness in the estimates

of b1 is derived from the errors in the sample,

weighted by the deviations of their respective

x values. Given the assumed independence of

the distributions of x and u implied by (18),

this expression implies that:

E (b1) = β1,

or that b1 is an unbiased estimate of β1, given

the propositions above. The four propositions

listed above are all crucial for this result, but

the key assumption is the independence of x

and u.

We are also concerned about the precision of

the OLS estimators. To derive an estimator

of the precision, we must add an assumption

on the distribution of the error u :

Proposition 5 SLR5: (homoskedasticity):

V ar (u | x) = V ar(u) = σ2.

This assumption states that the variance of the

error term is constant over the population, and

thus within the sample. Given (18), the con-

ditional variance is also the unconditional vari-

ance. The errors are considered drawn from a

fixed distribution, with a mean of zero and a

constant variance of σ2. If this assumption is vi-

olated, we have the condition of heteroskedas-

ticity, which will often involve the magnitude

of the error variance relating to the magnitude

of x, or to some other measurable factor.

Given this additional assumption, but no fur-

ther assumptions on the nature of the distri-

bution of u, we may demonstrate that:

V ar (b1) =σ2∑n

i=1 (xi − x)2 =σ2

s2x

(22)

so that the precision of our estimate of the

slope is dependent upon the overall error vari-

ance, and is inversely related to the variation in

the x variable. The magnitude of x does not

matter, but its variability in the sample does

matter. If we are conducting a controlled ex-periment (quite unlikely in economic analysis)we would want to choose widely spread valuesof x to generate the most precise estimate of∂y/∂x.

We can likewise prove that b0 is an unbiased es-timator of the population intercept, with sam-pling variance:

V ar (b0) = n−1 σ2 ∑ni=1 x

2i∑n

i=1 (xi − x)2 =σ2 ∑n

i=1 x2i

ns2x

(23)so that the precision of the intercept depends,as well, upon the sample size, and the mag-nitude of the x values. These formulas forthe sampling variances will be invalid in thepresence of heteroskedasticity–that is, whenproposition SLR5 is violated.

These formulas are not operational, since theyinclude the unknown parameter σ2. To calcu-late estimates of the variances, we must first

replace σ2 with a consistent estimate, s2, de-

rives from the least squares residuals:

ei = yi − b0 − b1xi, i = 1, ..., n (24)

We cannot observe the error ui for a given ob-

servation, but we can generate a consistent es-

timate of the ith observation’s error with the ith

observation’s least squares residual, ui. Like-

wise, a sample quantity corresponding to the

population variance σ2 can be derived from the

residuals:

s2 =1

(n− 2)

n∑i=1

e2i =

SSR

(n− 2)(25)

where the numerator is just the least squares

criterion, SSR, divided by the appropriate de-

grees of freedom. Here, two degrees of free-

dom are lost, since each residual is calculated

by replacing two population coefficients with

their sample counterparts. This now makes it

possible to generate the estimated variances

and, more usefully, the estimated standard

error of the regression slope:

sb1 =s

sx

where s is the standard deviation, or standard

error, of the disturbance process (that is,√s2),

and sx is√s2x. It is this estimated standard

error that will be displayed on the computer

printout when you run a regression, and used

to construct confidence intervals and hypoth-

esis tests about the slope coefficient. We can

calculate the estimated standard error of the

intercept term by the same means.

Regression through the origin

We could also consider a special case of the

model above where we impose a constraint

that β0 = 0, so that y is taken to be propor-

tional to x. This will often be inappropriate; it

is generally more sensible to let the data calcu-

late the appropriate intercept term, and rees-

timate the model subject to that constraint

only if that is a reasonable course of action.

Otherwise, the resulting estimate of the slope

coefficient will be biased. Unless theory sug-

gests that a strictly proportional relationship is

appropriate, the intercept should be included in

the model.


ed.

Chapter 3: Multiple regression analysis:

Estimation

In multiple regression analysis, we extend the

simple (two-variable) regression model to con-

sider the possibility that there are additional

explanatory factors that have a systematic ef-

fect on the dependent variable. The simplest

extension is the “three-variable” model, in which

a second explanatory variable is added:

y = β0 + β1x1 + β2x2 + u (1)

where each of the slope coefficients are now

partial derivatives of y with respect to the x

variable which they multiply: that is, hold-

ing x2 fixed, β1 = ∂y/∂x1. This extension also

allows us to consider nonlinear relationships,

such as a polynomial in z, where x1 = z and

x2 = z2. Then, the regression is linear in x1

and x2, but nonlinear in z : ∂y/∂z = β1 + 2β2z.

The key assumption for this model, analogous

to that which we specified for the simple re-

gression model, involves the independence of

the error process u and both regressors, or ex-

planatory variables:

E (u | x1, x2) = 0. (2)

This assumption of a zero conditional mean

for the error process implies that it does not

systematically vary with the x′s nor with any

linear combination of the x′s; u is independent,

in the statistical sense, from the distributions

of the x′s.

The model may now be generalized to the case

of k regressors:

y = β0 + β1x1 + β2x2 + ...+ βkxk + u (3)

where the β coefficients have the same inter-

pretation: each is the partial derivative of y

with respect to that x, holding all other x′sconstant (ceteris paribus), and the u term is

that nonsystematic part of y not linearly re-

lated to any of the x′s. The dependent variable

y is taken to be linearly related to the x′s, which

may bear any relation to each other (e.g. poly-

nomials or other transformations) as long as

there are no exact linear dependencies among

the regressors. That is, no x variable can be

an exact linear transformation of another, or

the regression estimates cannot be calculated.

The independence assumption now becomes:

E (u | x1, x2, ..., xk) = 0. (4)

Mechanics and interpretation of OLS

Consider first the “three-variable model” given

above in (1). The estimated OLS equation

contains the parameters of interest:

y = b0 + b1x1 + b2x2 (5)

and we may define the ordinary least squares

criterion in terms of the OLS residuals, calcu-

lated from a sample of size n, from this expres-

sion:

minS =n∑i=1

(yi − b0 − b1xi1 − b2xi2)2 (6)

where the minimization of this expression is

performed with respect to each of the three

parameters, {b0, b1, b2}. In the case of k regres-

sors, these expressions include terms in bk, and

the minimization is performed with respect to

the (k+1) parameters {b0, b1, b2, ...bk}. For this

to be feasible, n > (k + 1) : that is, we must

have a sample larger than the number of pa-

rameters to be estimated from that sample.

The minimization is carried out by differenti-

ating the scalar S with respect to each of the

b′s in turn, and setting the resulting first order

condition to zero. This gives rise to (k+ 1) si-

multaneous equations in (k+1) unknowns, the

regression parameters, which are known as the

least squares normal equations. The nor-mal equations are expressions in the sums ofsquares and cross products of the y and the re-gressors, including a first “regressor” which isa column of 1′s, multiplying the constant term.For the “three-variable” regression model, wecan write out the normal equations as:∑

y = nb0 + b1∑

x1 + b2∑

x2 (7)∑x1y = b0

∑x1 + b1

∑x2

1 + b2∑

x1x2∑x2y = b0

∑x2 + b1

∑x1x2 + b2

∑x2

2

Just as in the “two-variable” case, the firstnormal equation can be interpreted as stat-ing that the regression surface (in 3-space)passes through the multivariate point of means{x1, x2, y}. These three equations may be uniquelysolved, by normal algebraic techniques or linearalgebra, for the estimated least squares param-eters.

This extends to the case of k regressors and(k+1) regression parameters. In each case, the

regression coefficients are considered in the ce-

teris paribus sense: that each coefficient mea-

sures the partial effect of a unit change in its

variable, or regressor, holding all other regres-

sors fixed. If a variable is a component of more

than one regressor–as in a polynomial relation-

ship, as discussed above–the total effect of a

change in that variable is additive.

Fitted values, residuals, and their proper-

ties

Just as in simple regression, we may calculate

fitted values, or predicted values, after esti-

mating a multiple regression. For observation

i, the fitted value is

yi = b0 + b1xi1 + b2xi2 + ...+ bkxik (8)

and the residual is the difference between the

actual value of y and the fitted value:

ei = yi − yi (9)

As with simple regression, the sum of the resid-

uals is zero; they have, by construction, zero

covariance with each of the x variables, and

thus zero covariance with y; and since the av-

erage residual is zero, the regression surface

passes through the multivariate point of means,

{x1, x2, ..., xk, y}.

There are two instances where the simple re-

gression of y on x1 will yield the same coeffi-

cient as the multiple regression of y on x1 and

x2, with respect to x1. In general, the simple re-

gression coefficient will not equal the multiple

regression coefficient, since the simple regres-

sion ignores the effect of x2 (and considers that

it can be viewed as nonsystematic, captured in

the error u). When will the two coefficients be

equal? First, when the coefficient of x2 is truly

zero–that is, when x2 really does not belong in

the model. Second, when x1 and x2 are un-

correlated in the sample. This is likely to be

quite rare in actual data. However, these two

cases suggest when the two coefficients will

be similar; when x2 is relatively unimportant in

explaining y, or when it is very loosely related

to x1.

We can define the same three sums of squares–

SST, SSE, SSR−as in simple regression, and

R2 is still the ratio of the explained sum of

squares (SSE) to the total sum of squares

(SST ). It is no longer a simple correlation (e.g.

ryx) squared, but it still has the interpretation

of a squared simple correlation coefficient: the

correlation between y and y, ryy. A very im-

portant principle is that R2 never decreases

when an explanatory variable is added to a

regression–no matter how irrelevant that vari-able may be, the R2 of the expanded regres-sion will be no less than that of the originalregression. Thus, the regression R2 may bearbitrarily increased by adding variables (evenunimportant variables), and we should not beimpressed by a high value of R2 in a modelwith a long list of explanatory variables.

Just as with simple regression, it is possibleto fit a model through the origin, suppress-ing the constant term. It is important to notethat many of the properties we have discussedno longer hold in that case: for instance, theleast squares residuals (eis) no longer have azero sample average, and the R2 from such anequation can actually be negative–that is, theequation does worse than the “model” whichspecifies that y = y for all i. If the populationintercept β0 differs from zero, the slope coef-ficients computed in a regression through theorigin will be biased. Therefore, we often willinclude an intercept, and let the data deter-mine whether it should be zero.

Expected value of the OLS estimators

We now discuss the statistical properties of the

OLS estimators of the parameters in the pop-

ulation regression function. The population

model is taken to be (3). We assume that we

have a random sample of size n on the vari-

ables of the model. The multivariate analogue

to our assumption about the error process is

now:

E (u | x1, x2, ..., xk) = 0 (10)

so that we consider the error process to be

independent of each of the explanatory vari-

ables’ distributions. This assumption would

not hold if we misspecified the model: for in-

stance, if we ran a simple regression with inc

as the explanatory variable, but the population

model also contained inc2. Since inc and inc2

will have a positive correlation, the simple re-

gression’s parameter estimates will be biased.

This bias will also appear if there is a sepa-

rate, important factor that should be included

in the model; if that factor is correlated with

the included regressors, their coefficients will

be biased.

In the context of multiple regression, with sev-

eral independent variables, we must make an

additional assumption about their measured val-

ues:

Proposition 1 In the sample, none of the in-

dependent variables x may be expressed as an

exact linear relation of the others (including a

vector of 1s).

Every multiple regression that includes a con-

stant term can be considered as having a vari-

able x0i = 1 ∀i. This proposition states that

each of the other explanatory variables must

have nonzero sample variance: that is, it may

not be a constant in the sample. Second,

the proposition states that there is no per-

fect collinearity, or multicollinearity, in the

sample. If we could express one x as a linear

combination of the other x variables, this as-

sumption would be violated. If we have perfect

collinearity in the regressor matrix, the OLS es-

timates cannot be computed; mathematically,

they do not exist. A trivial example of perfect

collinearity would be the inclusion of the same

variable twice, measured in different units (or

via a linear transformation, such as tempera-

ture in degrees F versus C). The key concept:

each regressor we add to a multiple regression

must contain information at the margin. It

must tell us something about y that we do not

already know. For instance, if we consider x1 :

proportion of football games won, x2 : pro-

portion of games lost, and x3: proportion of

games tied, and we try to use all three as ex-

planatory variables to model alumni donations

to the athletics program, we find that there

is perfect collinearity: since for every college

in the sample, the three variables sum to one

by construction. There is no information in,

e.g., x3 once we know the other two, so in-

cluding it in a regression with the other two

makes no sense (and renders that regression

uncomputable). We can leave any one of the

three variables out of the regression; it does

not matter which one. Note that this proposi-

tion is not an assumption about the population

model: it is an implication of the sample data

we have to work with. Note also that this only

applies to linear relations among the explana-

tory variables: a variable and its square, for

instance, are not linearly related, so we may

include both in a regression to capture a non-

linear relation between y and x.

Given the four assumptions: that of the pop-

ulation model, the random sample, the zero

conditional mean of the u process, and the ab-

sence of perfect collinearity, we can demon-

strate that the OLS estimators of the popula-

tion parameters are unbiased:

Ebj = βj, j = 0, ..., k (11)

What happens if we misspecify the model by

including irrelevant explanatory variables: x

variables that, unbeknowst to us, are not in

the population model? Fortunately, this does

not damage the estimates. The regression will

still yield unbiased estimates of all of the coef-

ficients, including unbiased estimates of these

variables’ coefficients, which are zero in the

population. It may be improved by removing

such variables, since including them in the re-

gression consumes degrees of freedom (and re-

duces the precision of the estimates); but the

effect of overspecifying the model is rather

benign. The same applies to overspecifying

a polynomial order; including quadratic and

cubic terms when only the quadratic term is

needed will be harmless, and you will find that

the cubic term’s coefficient is far from signifi-

cant.

However, the opposite case–where we under-

specify the model by mistakenly excluding a

relevant explanatory variable–is much more se-

rious. Let us formally consider the direction

and size of bias in this case. Assume that the

population model is:

y = β0 + β1x1 + β2x2 + u (12)

but we do not recognize the importance of x2,

and mistakenly consider the relationship

y = β0 + β1x1 + u (13)

to be fully specified. What are the conse-quences of estimating the latter relationship?We can show that in this case:

Eb1 = β1 + β2

∑ni=1 (xi1 − x1)xi2∑ni=1 (xi1 − x1)2 (14)

so that the OLS coefficient b1 will be biased–not equal to its population value of β1, evenin an expected sense–in the presence of thesecond term. That term will be nonzero whenβ2 is nonzero (which it is, by assumption) andwhen the fraction is nonzero. But the frac-tion is merely a simple regression coefficient inthe auxiliary regression of x2 on x1. If the re-gressors are correlated with one another, thatregression coefficient will be nonzero, and itsmagnitude will be related to the strength ofthe correlation (and the units of the variables).Say that the auxiliary regression is:

x1 = d0 + d1x2 + u (15)

with d1 > 0, so that x1 and x2 are positivelycorrelated (e.g. as income and wealth would

be in a sample of household data). Then wecan write the bias as:

Eb1 − β1 = β2d1 (16)

and its sign and magnitude will depend on boththe relation between y and x2 and the inter-relation among the explanatory variables. Ifthere is no such relationship–if x1 and x2 areuncorrelated in the sample–then b1 is unbiased(since in that special case multiple regressionreverts to simple regression). In all other cases,though, there will be bias in the estimation ofthe underspecified model. If the left side of(16) is positive, we say that b1 has an upwardbias: the OLS value will be too large. If itwere negative, we would speak of a downwardbias. If the OLS coefficient is closer to zerothan the population coefficient, we would saythat it is “biased toward zero” or attenuated.

It is more difficult to evaluate the potentialbias in a multiple regression, where the popu-lation relationship involves k variables and we

include, for instance, k− 1 of them. All of theOLS coefficients in the underspecified modelwill generally be biased in this circumstance un-less the omitted variable is uncorrelated witheach included regressor (a very unlikely out-come). What we can take away as a generalrule is the asymmetric nature of specificationerror: it is far more damaging to exclude arelevant variable than to include an irrelevantvariable. When in doubt (and we almost al-ways are in doubt as to the nature of the truerelationship) we will always be better off erringon the side of caution, and including variablesthat we are not certain should be part of theexplanation of y.

Variance of the OLS estimators

We first reiterate the assumption of homoskedas-ticity, in the context of the k−variable regres-sion model:

V ar (u | x1, x2, ..., xk) = σ2 (17)

If this assumption is satisfied, then the error

variance is identical for all combinations of the

explanatory variables. If it is violated, we say

that the errors are heteroskedastic, and must

be concerned about our computation of the

OLS estimates’ variances. The OLS estimates

are still unbiased in this case, but our esti-

mates of their variances are not. Given this

assumption, plus the four made earlier, we can

derive the sampling variances, or precision, of

the OLS slope estimators:

V ar(bj)

=σ2

SSTj(1−R2

j

), j = 1, ..., k (18)

where SSTj is the total variation in xj about

its mean, and R2j is the R2 from an auxiliary

regression from regressing xj on all other x

variables, including the constant term. We see

immediately that this formula applies to sim-

ple regression, since the formula we derived for

the slope estimator in that instance is identi-

cal, given that R2j = 0 in that instance (there

are no other x variables). Given the population

error variance σ2, what will make a particular

OLS slope estimate more precise? Its preci-

sion will be increased (i.e. its sampling vari-

ance will be smaller) the larger is the variation

in the associated x variable. Its precision will

be decreased, the larger the amount of vari-

able xj that can be “explained” by other vari-

ables in the regression. In the case of perfect

collinearity, R2j = 1, and the sampling variance

goes to infinity. If R2j is very small, then this

variable makes a large marginal contribution to

the equation, and we may calculate a relatively

more precise estimate of its coefficient. If R2j is

quite large, the precision of the coefficient will

be low, since it will be difficult to “partial out”

the effect of variable j on y from the effects of

the other explanatory variables (with which it

is highly correlated). However, we must has-

ten to add that the assumption that there is no

perfect collinearity does not preclude R2j from

being close to unity–it only states that it is less

than unity. The principle stated above when

we discussed collinearity–that at the margin,

each explanatory variable must add informa-

tion that we do not already have, in whole or

in large part–if that variable is to have a mean-

ingful role in a regression model of y. This for-

mula for the sampling variance of an OLS co-

efficient also explains why we might not want

to overspecify the model: if we include an irrel-

evant explanatory variable, the point estimates

are unbiased, but their sampling variances will

be larger than they would be in the absence

of that variable (unless the irrelevant variable

is uncorrelated with the relevant explanatory

variables).

How do we make (18) operational? As written,

it cannot be computed, since it depends on the

unknown population parameter σ2. Just as in

the case of simple regression, we must replace

σ2 with a consistent estimate:

s2 =

∑ni=1 e

2i

(n− (k + 1))=

∑ni=1 e

2i

(n− k − 1)(19)

where the numerator is just SSR, and the de-

nominator is the sample size, less the number

of estimated parameters: the constant and k

slopes. In simple regression, we computed s2

using a denominator of 2: intercept plus slope.

Now, we must account for the additional slope

parameters. This also suggests that we cannot

estimate a k−variable regression model with-

out having a sample of size at least (k+1). In-

deed, just as two points define a straight line,

the degrees of freedom in simple regression will

be positive iff n > 2. For multiple regression,

with k slopes and an intercept, n > (k + 1).

Of course, in practice, we would like to use a

much larger sample than this in order to make

inferences about the population.

The positive square root of s2 is known as

the standard error of regression, or SER.

(Stata reports s on the regression output la-

belled ”Root MSE”, or root mean squared er-

ror). It is in the same units as the dependent

variable, and is the numerator of our estimated

standard errors of the OLS coefficients. The

magnitude of the SER is often compared to

the mean of the dependent variable to gauge

the regression’s ability to “explain” the data.

In the presence of heteroskedasticity–where the

variance of the error process is not constant

over the sample–the estimate of s2 presented

above will be biased. Likewise, the estimates

of coefficients’ standard errors will be biased,

since they depend on s2. If there is reason to

worry about heteroskedasticity in a particular

sample, we must work with a different ap-

proach to compute these measures.

Efficiency of OLS estimators

An important result, which underlays the widespread

use of OLS regression, is the Gauss-Markov

Theorem, describing the relative efficiency of

the OLS estimators. Under the assumptions

that we have made above for multiple regression–

and making no further distributional assump-

tions about the error process–we may show

that:

Proposition 2 (Gauss-Markov) Among the

class of linear, unbiased estimators of the pop-

ulation regression function, OLS provides the

best estimators, in terms of minimum sampling

variance: OLS estimators are best linear unbi-

ased estimators (BLUE).

This theorem only considers estimators that

have these two properties of linearity and unbi-

asedness. Linearity means that the estimator–

the rule for computing the estimates–can be

written as a linear function of the data y (es-

sentially, as a weighted average of the y val-

ues). OLS clearly meets this requirement. Un-

der the assumptions above, OLS estimators

are also unbiased. Given those properties, the

proof of the Gauss-Markov theorem demon-

strates that the OLS estimators have the mini-

mum sampling variance of any possible estima-

tor: that is, they are the “best” (most precise)

that could possibly be calculated. This theo-

rem is not based on the assumption that, for

instance, the u process is Normally distributed;

only that it is independent of the x variables

and homoskedastic (that is, that it is i.i.d.).


ed.


Inference

We have discussed the conditions under which

OLS estimators are unbiased, and derived the

variances of these estimators under the Gauss-

Markov assumptions. The Gauss-Markov the-

orem establishes that OLS estimators have the

smallest variance of any linear unbiased estima-

tors of the population parameters. We must

now more fully characterize the sampling distri-

bution of the OLS estimators–beyond its mean

and variance–so that we may test hypotheses

on the population parameters. To make the

sampling distribution tractable, we add an as-

sumption on the distribution of the errors:

Proposition 1 MLR6 (Normality) The popu-

lation error u is independent of the explanatory

variables x1, .., xk and is normally distributed

with zero mean and constant variance: u ∼N(0, σ2

).

This is a much stronger assumption than we

have previously made on the distribution of the

errors. The assumption of normality, as we

have stated it, subsumes both the assumption

of the error process being independent of the

explanatory variables, and that of homoskedas-

ticity. For cross-sectional regression analysis,

these six assumptions define the classical lin-

ear model. The rationale for normally dis-

tributed errors is often phrased in terms of the

many factors influencing y being additive, ap-

pealing to the Central Limit Theorem to sug-

gest that the sum of a large number of random

factors will be normally distributed. Although

we might have reason in a particular context

to doubt this rationale, we usually use it as a

working hypothesis. Various transformations–

such as taking the logarithm of the dependent

variable–are often motivated in terms of their

inducing normality in the resulting errors.

What is the importance of assuming normal-

ity for the error process? Under the assump-

tions of the classical linear model, normally dis-

tributed errors give rise to normally distributed

OLS estimators:

bj ∼ N(βj, V ar

(bj))

(1)

which will then imply that:(bj − βj

)σbj

∼ N (0,1) (2)

This follows since each of the bj can be writ-

ten as a linear combination of the errors in the

sample. Since we assume that the errors are in-

dependent, identically distributed normal ran-

dom variates, any linear combination of those

errors is also normally distributed. We may

also show that any linear combination of the

bj is also normally distributed, and a subset

of these estimators has a joint normal distri-

bution. These properties will come in handy

in formulating tests on the coefficient vector.

We may also show that the OLS estimators

will be approximately normally distributed (at

least in large samples), even if the underlying

errors are not normally distributed.

Testing an hypothesis on a single βj

To test hypotheses about a single population

parameter, we start with the model containing

k regressors:

y = β0 + β1x1 + β2x2 + ...+ βkxk + u (3)

Under the classical linear model assumptions,

a test statistic formed from the OLS estimates

may be expressed as:

(bj − βj

)sbj

∼ tn−k−1 (4)

Why does this test statistic differ from (2)

above? In that expression, we considered the

variance of bj as an expression including σ, the

unknown standard deviation of the error term

(that is,√σ2). In this operational test statistic

(4), we have replaced σ with a consistent es-

timate, s. That additional source of sampling

variation requires the switch from the standard

normal distribution to the t distribution, with

(n−k−1) degrees of freedom. Where n is not

all that large relative to k, the resulting t distri-

bution will have considerably fatter tails than

the standard normal. Where (n − k − 1) is a

large number–greater than 100, for instance–

the t distribution will essentially be the stan-

dard normal. The net effect is to make the

critical values larger for a finite sample, and

raise the threshold at which we will conclude

that there is adequate evidence to reject a par-

ticular hypothesis.

The test statistic (4) allows us to test hypothe-

ses regarding the population parameter βj : in

particular, to test the null hypothesis

H0 : βj = 0 (5)

for any of the regression parameters. The

“t-statistic” used for this test is merely that

printed on the output when you run a regres-

sion in Stata or any other program: the ratio

of the estimated coefficient to its estimated

standard error. If the null hypothesis is to be

rejected, the “t-stat” must be larger (in ab-

solute value) than the critical point on the t-

distribution. The “t-stat” will have the same

sign as the estimated coefficient, since the stan-

dard error is always positive. Even if βj is actu-

ally zero in the population, a sample estimate

of this parameter–bj− will never equal exactly

zero. But when should we conclude that it

could be zero? When its value cannot be dis-

tinguished from zero. There will be cause to

reject this null hypothesis if the value, scaled

by its standard error, exceeds the threshold.

For a “two-tailed test,” there will be reason to

reject the null if the “t-stat” takes on a large

negative value or a large positive value; thus

we reject in favor of the alternative hypothesis

(of βj 6= 0) in either case. This is a two-sided

alternative, giving rise to a two-tailed test. If

the hypothesis is to be tested at, e.g., the 95%

level of confidence, we use critical values from

the t-distribution which isolate 2.5% in each

tail, for a total of 5% of the mass of the dis-

tribution. When using a computer program to

calculate regression estimates, we usually are

given the “p-value” of the estimate–that is,

the tail probability corresponding to the coef-

ficient’s t-value. The p-value may usefully be

considered as the probability of observing a t-

statistic as extreme as that shown if the null

hypothesis is true. If the t-value was equal to,

e.g., the 95% critical value, the p-value would

be exactly 0.05. If the t-value was higher, the

p-value would be closer to zero, and vice versa.

Thus, we are looking for small p-values as in-

dicative of rejection. A p-value of 0.92, for in-

stance, corresponds to an hypothesis that can

be rejected at the 8% level of confidence–thus

quite irrelevant, since we would expect to find

a value that large 92% of the time under the

null hypothesis. On the other hand, a p-value

of 0.08 will reject at the 90% level, but not at

the 95% level; only 8% of the time would we

expect to find a t-statistic of that magnitude

if H0 was true.

What if we have a one-sided alternative? For

instance, we may phrase the hypothesis of in-

terest as:

H0 : βj > 0 (6)

HA : βj ≤ 0

Here, we must use the appropriate critical point

on the t-distribution to perform this test at the

same level of confidence. If the point estimate

bj is positive, then we do not have cause to

reject the null. If it is negative, we may have

cause to reject the null if it is a sufficiently

large negative value. The critical point should

be that which isolates 5% of the mass of the

distribution in that tail (for a 95% level of con-

fidence). This critical value will be smaller (in

absolute value) than that corresponding to a

two-tailed test, which isolates only 2.5% of the

mass in that tail. The computer program al-

ways provides you with a p-value for a two-

tailed test; if the p-value is 0.08, for instance,

it corresponds to a one-tailed p-value of 0.04

(that being the mass in that tail).

Testing other hypotheses about βj

Every regression output includes the informa-

tion needed to test the two-tailed or one-tailed

hypotheses that a population parameter equals

zero. What if we want to test a different hy-

pothesis about the value of that parameter?

For instance, we would not consider it sensible

for the mpc for a consumer to be zero, but we

might have an hypothesized value (of, say, 0.8)

implied by a particular theory of consumption.

How might we test this hypothesis? If the null

is stated as:

H0 : βj = aj (7)

where aj is the hypothesized value, then the

appropriate test statistic becomes:

(bj − aj

)sbj

∼ tn−k−1 (8)

and we may simply calculate that quantity and

compare it to the appropriate point on the t-

distribution. Most computer programs provide

you with assistance in this effort; for instance,

if we believed that aj, the coefficient on bdrms,

should be equal to $20,000 in a regression of

house prices on square footage and bdrms (e.g.

using HPRICE1), we would use Stata’s test

command:

regress price bdrms sqrft

test bdrms=20000

where we use the name of the variable as a

shorthand for the name of the coefficient on

that variable. Stata, in that instance, presents

us with:

( 1) bdrms = 20000.0

F( 1, 85) = 0.26

Prob > F = 0.6139

making use of an F-statistic, rather than a t-

statistic, to perform this test. In this partic-

ular case–of an hypothesis involving a single

regression coefficient–we may show that this

F-statistic is merely the square of the asso-

ciated t-statistic. The p-value would be the

same in either case. The estimated coefficient

is 15198.19, with an estimated standard error

of 9483.517. Plugging in these values to (8)

yields a t-statistic:

. di (_b[bdrms]-20000)/_se[bdrms]

-.50633208

which, squared, is the F-statistic shown by the

test command. Just as with tests against a

null hypothesis of zero, the results of the test

command may be used for one-tailed tests as

well as two-tailed tests; then, the magnitude of

the coefficient matters (i.e. the fact that the

estimated coefficient is about $15,000 means

we would never reject a null that it is less than

$20,000), and the p-value must be adjusted for

one tail. Any number of test commands may

be given after a regress command in Stata,

testing different hypotheses about the coeffi-

cients.

Confidence intervals

As we discussed in going over Appendix C, we

may use the point estimate and its estimated

standard error to calculate an hypothesis test

on the underlying population parameter, or we

may form a confidence interval for that pa-

rameter. Stata makes that easy in a regression

context by providing the 95% confidence inter-

val for every estimated coefficient. If you want

to use some other level of significance, you may

either use the level() option on regress (e.g.

regress price bdrms sqrft, level(90)) or you

may change the default level for this run with

set level. All further regressions will report

confidence intervals with that level of confi-

dence. To connect this concept to that of the

hypothesis test, consider that in the above ex-

ample the 95% confidence interval for βbdrmsextended from -3657.581 to 34053.96; thus,

an hypothesis test with the null that βbdrmstakes on any value in this interval (including

zero) will not lead to a rejection.

Testing hypotheses about a single linear

combination of the parameters

Economic theory will often suggest that a par-

ticular linear combination of parameters should

take on a certain value: for instance, in a

Cobb-Douglas production function, that the

slope coefficients should sum to one in the case

of constant returns to scale (CRTS):

Q = ALβ1Kβ2Eβ3 (9)

logQ = logA+ β1 logL+ β2 logK + β3 logE + υ

where K,L,E are the factors capital, labor, and

energy, respectively. We have added an error

term to the double-log-transformed version of

this model to represent it as an empirical re-

lationship. The hypothesis of CRTS may be

stated as:

H0 : β1 + β2 + β3 = 1 (10)

The test statistic for this hypothesis is quite

straightforward:

(b1 + b2 + b3 − 1)

sb1+b2+b3

∼ tn−k−1 (11)

and its numerator may be easily calculated.

The denominator, however, is not so simple; it

represents the standard error of the linear com-

bination of estimated coefficients. You may

recall that the variance of a sum of random

variables is not merely the sum of their vari-

ances, but an expression also including their

covariances, unless they are independent. The

random variables {b1, b2, b3} are not indepen-

dent of one another since the underlying re-

gressors are not independent of one another.

Each of the underlying regressors is assumed

to be independent of the error term u, but

not of the other regressors. We would expect,

for instance, that firms with a larger capital

stock also have a larger labor force, and use

more energy in the production process. The

variance (and standard error) that we need

may be readily calculated by Stata, however,

from the variance-covariance matrix of the es-

timated parameters via the test command:

test cap+labor+energy=1

will provide the appropriate test statistic, again

as an F-statistic with a p-value. You may in-

terpret this value directly. If you would like the

point and interval estimate of the hypothesized

combination, you can compute that (after a re-

gression) with the lincom (linear combination)

command:

lincom cap + labor + energy

will show the sum of those values and a confi-

dence interval for that sum.

We may also use this technique to test other

hypotheses than adding-up conditions on the

parameters. For instance, consider a two-factor

Cobb-Douglas function in which you have only

labor and capital, and you want to test the hy-

pothesis that labor’s share is 2/3. This implies

that the labor coefficient should be twice the

capital coefficient, or:

H0 : βL = 2βK, or (12)

H0 :βLβK

= 2, or

H0 : βL − 2βK = 0

Note that this does not allow us to test a non-

linear hypothesis on the parameters: but con-

sidering that a ratio of two parameters is a

constant is not a nonlinear restriction. In the

latter form, we may specify it to Stata’s test

command as:

test labor - 2*cap = 0

In fact, Stata will figure out that form if you

specify the hypothesis as:

test labor=2*cap

(rewriting it in the above form), but it is not

quite smart enough to handle the ratio form.

It is easy to rewrite the ratio form into one

of the other forms. Either form will produce

an F-statistic and associated p-value related to

this single linear hypothesis on the parameters

which may be used to make a judgment about

the hypothesis of interest.

Testing multiple linear restrictions

When we use the test command, an F-statistic

is reported–even when the test involves only

one coefficient–because in general, hypothesis

tests may involve more than one restriction on

the population parameters. The hypotheses

discussed above–even that of CRTS, involv-

ing several coefficients–still only represent one

restriction on the parameters. For instance, if

CRTS is imposed, the elasticities of the factors

of production must sum to one, but they may

individually take on any value. But in most

applications of multiple linear regression, we

concern ourselves with joint tests of restric-

tions on the parameters.

The simplest joint test is that which every re-

gression reports: the so-called “ANOVA F”

test, which has the null hypothesis that each

of the slopes is equal to zero. Note that in a

multiple regression, specifying that each slope

individually equals zero is not the same thing

as specifying that their sum equals zero. This

“ANOVA” (ANalysis Of VAriance) F-test is of

interest since it essentially tests whether the

entire regression has any explanatory power.

The null hypothesis, in this case, is that the

“model” is y = β0 + u : that is, none of the

explanatory variables assist in explaining the

variation in y. We cannot test any hypothesis

on the R2 of a regression, but we will see that

there is an intimate relationship between the

R2 and the ANOVA F:

R2 =SSE

SST(13)

F =SSE/k

SSR/ (n− (k + 1))

∴ F =R2/k(

1−R2)/ (n− (k + 1))

where the ANOVA F, the ratio of mean square

explained variation to mean square unexplained

variation, is distributed as F kn−(k+1) under the

null hypothesis. For a simple regression, this

statistic is F1n−2, which is identical to

(tb1,n−2

)2:

that is, the square of the t− statistic for the

slope coefficient, with precisely the same p−value as that t− statistic. In a multiple regres-

sion context, we do not often find an insignif-

icant F− statistic, since the null hypothesis is

a very strong statement: that none of the ex-

planatory variables, taken singly or together,

explain any significant fraction of the variation

of y about its mean. That can happen, but it

is often somewhat unlikely.

The ANOVA F tests k exclusion restrictions:

that all k slope coefficients are jointly zero. We

may use an F-statistic to test that a number of

slope coefficients are jointly equal to zero. For

instance, consider a regression of 353 major

league baseball players’ salaries (from MLB1).

If we regress lsalary (log of player’s salary)

on years (number of years in majors), gamesyr

(number of games played per year), and sev-

eral variables indicating the position played (

frstbase, scndbase, shrtstop, thrdbase, catcher),

we get an R2 of 0.6105, and an ANOVA F

(with 7 and 345 d.f.) of 77.24 with a p−value of zero. The overall regression is clearly

significant, and the coefficients on years and

gamesyr both have the expected positive and

significant coefficients. Only one of the five

coefficients on the positions played, however,

are significantly different from zero at the 5%

level: scndbase, with a negative value (-0.034)

and a p− value of 0.015. The frstbase and

shrtstop coefficients are also negative (but in-

significant), while the thrdbase and catcher co-

efficients are positive and insignificant. Should

we just remove all of these variables (except

for scndbase)? The F-test for these five exclu-

sion restrictions will provide an answer to that

question:

. test frstbase scndbase shrtstop

thrdbase catcher

( 1) frstbase = 0.0

( 2) scndbase = 0.0

( 3) shrtstop = 0.0

( 4) thrdbase = 0.0

( 5) catcher = 0.0

F( 5, 345) = 2.37

Prob > F = 0.0390

At the 95% level of significance, these coef-

ficients are not each zero. That result, of

course, could be largely driven by the scndbase

coefficient:

. test frstbase shrtstop thrdbase catcher

( 1) frstbase = 0.0

( 2) shrtstop = 0.0

( 3) thrdbase = 0.0

( 4) catcher = 0.0

F( 4, 345) = 1.56

Prob > F = 0.1858

So perhaps it would be sensible to remove these

four, which even when taken together do not

explain a meaningful fraction of the variation

in lsalary. But this illustrates the point of the

joint hypothesis test: the result of simulta-

neously testing several hypotheses (that, for

instance, individual coefficients are equal to

zero) cannot be inferred from the results of

the individual tests. If each coefficient is sig-

nificant, then a joint test will surely reject the

joint exclusion restriction; but the converse is

assuredly false.

Notice that a joint test of exclusion restrictions

may be easily conduced by Stata’s test com-

mand, by merely listing the variables whose co-

efficients are presumed to be zero under the

null hypothesis. The resulting test statistic

is an F with as many numerator degrees of

freedom as there are coefficients (or variables)

in the list. It can be written in terms of the

residual sums of squares (SSRs) of the “unre-

stricted” and “restricted” models:

F =(SSRr − SSRur) /qSSRur/ (n− k − 1)

(14)

Since adding variables to a model will never de-

crease SSR (nor decrease R2), the “restricted”

model–in which certain coefficients are not freely

estimated from the data, but constrained–must

have SSR at least as large as the “unrestricted”

model, in which all coefficients are data-determined

at their optimal values. Thus the difference

in the numerator is non-negative. If it is a

large value, then the restrictions severely di-

minish the explanatory power of the model.

The amount by which it is diminished is scaled

by the number of restrictions, q, and then di-

vided by the unrestricted model’s s2. If this ra-

tio is a large number, then the “average cost

per restriction” is large relative to the explana-

tory power of the unrestricted model, and we

have evidence against the null hypothesis (that

is, the F− statistic will be larger than the crit-

ical point on an F− table with q and (n−k−1)

degrees of freedom. If the ratio is smaller than

the critical value, we do not reject the null

hypothesis, and conclude that the restrictions

are consistent with the data. In this circum-

stance, we might then reformulate the model

with the restrictions in place, since they do

not conflict with the data. In the baseball

player salary example, we might drop the four

insignificant variables and reestimate the more

parsimonious model.

Testing general linear restrictions

The apparatus described above is far more pow-

erful than it might appear. We have considered

individual tests involving a linear combination

of the parameters (e.g. CRTS) and joint tests

involving exclusion restrictions (as in the base-

ball players’ salary example). But the “subset

F” test defined in (14) is capable of being ap-

plied to any set of linear restrictions on the

parameter vector: for instance, that β1 = 0,

β2+β3+β4 = 1, and β5 = −1. What would this

set of restrictions imply about a regression of

y on {X1, X2, X3, X4, X5}? That regression, in

its unrestricted form, would have k = 5, with 5

estimated slope coefficients and an intercept.

The joint hypotheses expressed above would

state that a restricted form of this equation

would have three fewer parameters, since β1

would be constrained to zero, β5 to -1, and

one of the coefficients {β2, β3, β4} expressed

in terms of the other two. In the terminol-

ogy of (14), q = 3. How would we test the

hypothesis? We can readily calculate SSRur,

but what about SSRr? One approach would

be to algebraically substitute the restrictions

in the model, estimate that restricted model,

and record its SSRr value. This can be done

with any computer program that estimates a

multiple regression, but it requires that you do

the algebra and transform the variables accord-

ingly. (For instance, constraining β5 to -1 im-

plies that you should form a new dependent

variable, (y +X5)). Alternatively, if you are us-

ing a computer program that can test linear

restrictions, you may use its features. Stata

will test general linear restrictions of this sort

with the test command:

regress y x1 x2 x3 x4 x5

test (x1) (x2+x3+x4=1) (x5=-1)

This test command will print an F-statistic for

the set of three linear restrictions on the re-

gression: for instance,

( 1) years = 0.0

( 2) frstbase + scndbase + shrtstop = 1.0

( 3) sbases = -1.0

F( 3, 347) = 38.54

Prob > F = 0.0000

The F-test will have three numerator degrees

of freedom, because you have specified three

linear hypotheses to be jointly applied to the

coefficient vector. This syntax of test may

be used to construct any set of linear restric-

tions on the coefficient vector, and perform the

joint test for the validity of those restrictions.

The test statistic will reject the null hypoth-

esis (that the restrictions are consistent with

the data) if its value is large relative to the

underlying F-distribution.


ed.


Further issues

What effects will the scale of the X and y vari-

ables have upon multiple regression? The co-

efficients’ point estimates are ∂y/∂Xj, so they

are in the scale of the data–for instance, dol-

lars of wage per additional year of education.

If we were to measure either y or X in differ-

ent units, the magnitudes of these derivatives

would change, but the overall fit of the regres-

sion equation would not. Regression is based

on correlation, and any linear transformation

leaves the correlation between two variables

unchanged. The R2, for instance, will be un-

affected by the scaling of the data. The stan-

dard error of a coefficient estimate is in the

same units as the point estimate, and both

will change by the same factor if the data are

scaled. Thus, each coefficient’s t− statistic

will have the same value, with the same p−

value, irrespective of scaling. The standard

error of the regression (termed “Root MSE”

by Stata) is in the units of the dependent vari-

able. The ANOVA F, based on R2, will be

unchanged by scaling, as will be all F-statistics

associated with hypothesis tests on the param-

eters. As an example, consider a regression of

babies’ birth weight, measured in pounds, on

the number of cigarettes per day smoked by

their mothers. This regression would have the

same explanatory power if we measured birth

weight in ounces, or kilograms, or alternatively

if we measured nicotine consumption by the

number of packs per day rather than cigarettes

per day.

A corollary to this result applies to a dependent

variable measured in logarithmic form. Since

the slope coefficient in this case is an elas-

ticity or semi-elasticity, a change in the de-

pendent variable’s units of measurement does

not affect the slope coefficient at all (since

log(cy) = log c + log y), but rather just shows

up in the intercept term.

Beta coefficients

In economics, we generally report the regres-

sion coefficients’ point estimates when present-

ing regression results. Our coefficients often

have natural units, and those units are mean-

ingful. In other disciplines, many explanatory

variables are indices (measures of self-esteem,

or political freedom, etc.), and the associated

regression coefficients’ units are not well de-

fined. To evaluate the relative importance of

a number of explanatory variables, it is com-

mon to calculate so-called beta coefficients–

standardized regression coefficients, from a re-

gression of y∗ on X∗, where the starred vari-

ables have been “z-transformed.” This trans-

formation (subtracting the mean and dividing

by the sample standard deviation) generates

variables with a mean of zero and a standard

deviation of one. In a regression of standard-

ized variables, the (beta) coefficient estimates

∂y∗/∂X∗ express the effect of a one standard

deviation change in Xj in terms of standard

deviations of y. The explanatory variable with

the largest (absolute) beta coefficient thus has

the biggest “bang for the buck” in terms of an

effect on y. The intercept in such a regres-

sion is zero by construction. You need not

perform this standardization in most regression

programs to compute beta coefficients; for in-

stance, in Stata, you may just use the beta op-

tion, e.g. regress lsalary years gamesyr scndbase,

beta which causes the beta coefficients to be

printed (rather than the 95% confidence in-

terval for each coefficient) on the right of the

regression output.

Logarithmic functional forms

Many econometric models make use of vari-

ables measured in logarithms: sometimes the

dependent variable, sometimes both dependent

and independent variables. Using the “double-

log” transformation (of both y and X) we can

turn a multiplicative relationship, such as a

Cobb-Douglas production function, into a lin-

ear relation in the (natural) logs of output and

the factors of production. The estimated co-

efficients are, themselves, elasticities: that is,

∂ log y/∂ logXj, which have the units of per-

centage changes. The “single-log” transfor-

mation regresses log y on X, measured in nat-

ural units (alternatively, some columns of X

might be in logs, and some columns in lev-

els). If we are interpreting the coefficient on

a levels variable, it is ∂ log y/∂Xj, or approx-

imately the percentage change in y resulting

from a one unit change in X. We often use

this sort of model to estimate an exponen-

tial trend–that is, a growth rate–since if the

X variable is t, we have ∂ log y/∂t, or an es-

timate of the growth rate of y. The interpre-

tation of regression coefficients as percentage

changes depends on an approximation, that

log(1 + x) ≈ x for small x. If x is sizable–

and we seek the effect for a discrete change

in x− then we must take care with that ap-

proximation. The exact percentage change,

%∆y = 100[

exp(

bj∆Xj

)

− 1]

, will give us a

more accurate prediction of the change in y.

Why do so many econometric models utilize

logs? For one thing, a model with a log de-

pendent variable often more closely satisfies

the assumptions we have made for the classi-

cal linear model. Most economic variables are

constrained to be positive, and their empirical

distributions may be quite non-normal (think

of the income distribution). When logs are

applied, the distributions are better behaved.

Taking logs also reduces the extrema in the

data, and curtails the effects of outliers. We

often see economic variables measured in dol-

lars in log form, while variables measured in

units of time, or interest rates, are often left

in levels. Variables which are themselves ratios

are often left in that form in empirical work

(although they could be expressed in logs; but

something like an unemployment rate already

has a percentage interpretation). We must

be careful when discussing ratios to distinguish

between an 0.01 change and a one unit change.

If the unemployment rate is measured as a dec-

imal, e.g. 0.05 or 0.06, we might be concerned

with the effect of an 0.01 change (a one per

cent increase in unemployment)–which will be

1/100 of the regression coefficient’s magni-

tude!

Polynomial functional forms

We often make use of polynomial functional

forms–or their simplest form, the quadratic–to

represent a relationship that is not likely to be

linear. If y is regressed on x and x2, it is im-

portant to note that we must calculate ∂y/∂x

taking account of this form–that is, we cannot

consider the effect of changing x while holding

x2 constant. Thus, ∂y/∂x = b1 + 2b2x, and

the slope in {x, y} space will depend upon the

level of x at which we evaluate the derivative.

In many applications, b1 > 0 while b2 < 0, so

that while x is increasing, y is increasing at a

decreasing rate, or levelling off. Naturally, for

sufficiently large x, y will take on smaller val-

ues, and in the limit will become negative; but

in the range of the data, y will often appear

to be a concave function of x. We could also

have the opposite sign pattern, b1 < 0 while

b2 > 0, which will lead to a U-shaped relation

in the {x, y} plane, with y decreasing, reaching

a minimum, and increasing–somewhat like an

average cost curve. Higher-order polynomial

terms may also be used, but they are not as

commonly found in empirical work.

Interaction terms

An important technique that allows for non-

linearities in an econometric model is the use

of interaction terms–the product of explana-

tory variables. For instance, we might model

the house price as a function of bdrms, sqft,

and sqft· bdrms, which would make the partial

derivatives with respect to each factor depend

upon the other. For instance, ∂price/∂bdrms =

bbdrms + bsqft·bdrmssqft, so that the effect of an

additional bedroom on the price of the house

also depends on the size of the house. Like-

wise, the effect of additional square footage

(e.g. an addition) depends on the number of

bedrooms. Since a model with no interaction

terms is a special case of this model, we may

readily test for the presence of these nonlin-

earities by examining the significance of the

interaction term’s estimated coefficient. If it

is significant, the interaction term is needed to

capture the relationship.

Adjusted R2

In presenting multiple regression, we established

that R2 cannot decrease when additional ex-

planatory variables are added to the model,

even if they have no significant effect on y.

A “longer” model will always appear to be su-

perior to a “shorter” model, even though the

latter is a more parsimonious representation of

the relationship. How can we deal with this in

comparing alternative models, some of which

may have many more explanatory factors than

others? We can express the standard R2 as:

R2 = 1 −SSR

SST= 1 −

SSR/n

SST/n(1)

Since all models with the same dependent vari-

able will have the same SST, and SSR cannot

increase with additional variables, R2 is a non-

decreasing function of k. An alternative mea-

sure, computed by most econometrics pack-

ages, is the so-called “R-bar-squared” or ‘Ad-

justed R2” :

R2 = 1 −SSR/ (n − (k + 1))

SST/ (n − 1)(2)

where the numerator and denominator of R2

are divided by their respective degrees of free-

dom (just as they are in computing the mean

squared measures in the ANOVA F table). For

a given dependent variable, the denominator

does not change; but the numerator, which

is s2, may rise or fall as k is increased. An

additional regressor uses one more degree of

freedom, so (n − (k + 1)) declines; and SSR

declines as well (or remains unchanged). If

SSR declines by a larger percentage than the

degrees of freedom, then R2 rises, and vice

versa. Adding a number of regressors with lit-

tle explanatory power will increase R2, but will

decrease R2− which may even become nega-

tive! R2 does not have the interpretation of a

squared correlation coefficient, nor of a “bat-

ting average” for the model. But it may be

used to compare different models of the same

dependent variable. Note, however, that we

cannot make statistical judgments based on

this measure; for instance, we can show that

R2 will rise if we add one variable to the model

with a |t| > 1− but a t of unity is never sig-

nificant. Thus, an increase in R2 cannot be

taken as meaningful (the coefficients must be

examined for significance) but, conversely, if a

“longer” model has a lower R2, its usefulness

is cast in doubt. R2 is also useful in that it

can be used to compare non-nested models–

i.e. two models, neither of which is a proper

subset of the other. A “subset F” test cannot

be used to compare these models, since there

is no hypothesis under which the one model

emerges from restrictions on the other, and

vice versa. R2 may be used to make informal

comparisons of non-nested models, as long as

they have the same dependent variable. Stata

presents the R2 as the “Adj R-squared” on the

regression output.

Prediction and residual analysis

The predictions of a multiple regression are,

simply, the evaluation of the regression line

for various values of the explanatory variables.

We can always calculate y for each observa-

tion used in the regression; these are known

as “in-sample” or “ex post” predictions. Since

the estimated regression equation is a func-

tion, we can evaluate the function for any set

of values {X01 , X0

2 , ..., X0k } and form the associ-

ated point estimate y0, which might be termed

an “out-of-sample” or “ex ante” forecast of

the regression equation. How reliable are the

forecasts of the equation? Since the predicted

values are linear combinations of the b values,

we can calculate an interval estimate for the

predicted value. This is the confidence inter-

val for E(

y0)

: that is, the average value that

would be predicted by the model for a specific

set of X values. This may be calculated after

any regression in Stata using the predict com-

mand’s stdp option: that is, predict stdpred,

stdp will save a variable named “stdpred” con-

taining the standard error of prediction. The

95% confidence interval will then be, for large

samples, {y − 1.96stdpred, y + 1.96stdpred}. An

illustration of this confidence interval for a sim-

ple regression is given here. Note that the con-

fidence intervals are parabolic, with the mini-

mum width interval at X, widening symmetri-

cally as we move farther from X. For a multiple

regression, the confidence interval will be nar-

rowest at the multivariate point of means of

the X ′s.

prediction interval for E(y)Weight (lbs.)

Displacement (cu. in.) Fitted values plo phi

1,760 4,840

46.1214

425

However, if we want a confidence interval for

a specific value of y− rather than for the mean

of y− we must also take into account the fact

that a predicted value of y will contain an er-

ror, u. On average, that error is assumed to be

zero; that is, E(u) = 0. For a specific value of

y, though, there will be an error ui; we do not

know its magnitude, but we have estimated

that it is drawn from a distribution with stan-

dard error s. Thus, the standard error of fore-

cast will include this additional source of un-

certainty, and confidence intervals formed for

specific values of y will be wider than those as-

sociated with predictions of the mean y. This

standard error of forecast series can be calcu-

lated, after a regression has been estimated,

with the predict command, specifying the stdf

option. If the variable stdfc is created, the

95% confidence interval will then be, for large

samples, {y−1.96stdfc, y+1.96stdfc}. An illus-

tration of this confidence interval for a simple

regression is given here, juxtaposed with that

shown earlier for the standard error of predic-

tion. As you can see, the added uncertainty

associated with a draw from the error distribu-

tion makes the prediction interval much wider.

prediction intervals for E(y) and specific value of yWeight (lbs.)

Displacement (cu. in.) plof plo Fitted values

1,760 4,840

−18.748

474.207

Residual analysis

The OLS residuals are often calculated and

analyzed after estimating a regression. In a

purely technical sense, they may be used to

test the validity of the several assumptions that

underly the application of OLS. When plotted,

do they appear systematic? Does their dis-

persion appear to be roughly constant, or is

it larger for some X values than others? Ev-

idence of systematic behavior in the magni-

tude of the OLS residuals, or in their disper-

sion, would cast doubt on the OLS results.

A number of formal tests, as we will discuss,

are based on the residuals, and many graph-

ical techniques for examining their random-

ness (or lack thereof) are available. In Stata,

help regression diagnostics discusses many of

them.

The residuals are often used to test specific

hypotheses about the underlying relationship.

For instance, we could fit a regression of the

salaries of employees of XYZ Corp. on a num-

ber of factors which should relate to their salary

level: experience, education, specific qualifica-

tions, job level, and so on. Say that such a

regression was run, and the residuals retrieved.

If we now sort the residuals by factors not

used to explain salary levels, such as the em-

ployee’s gender or race, what will we find? Un-

der nondiscrimination laws, there should be no

systematic reason for women to be paid more

or less than men, or blacks more or less than

whites, after we have controlled for these fac-

tors. If there are significant differences be-

tween the average residual for, e.g., blacks and

whites, then we would have evidence of “sta-

tistical discrimination.” Regression equations

have often played an important role in inves-

tigating charges of discrimination in the work-

place. Likewise, most towns’ and cities’ as-

sessments of real estate (used to set the tax

levy on that property) are performed by regres-

sion, in which the explanatory factors include

the characteristics of a house and its neighbor-

hood. Since many houses will not have been

sold in the recent past, the regression must

be run over a sample of houses that have been

sold, and out-of-sample predictions used to es-

timate the appropriate price for a house that

has not been sold recently, based on its at-

tributes and trends in real estate transactions

prices in its neighborhood. A mechanical eval-

uation of the fair market value of the house

may be subject to error, but previous meth-

ods used–in which knowledgeable individuals

attached valuations based on their understand-

ing of the local real estate market–are more

subjective.


Chapter 7: Multiple regression analysis with

qualitative information: Binary (or dummy)

variables

We often consider relationships between ob-served outcomes and qualitative factors: mod-els in which a continuous dependent variableis related to a number of explanatory factors,some of which are quantitative, and some ofwhich are qualitative. In econometrics, we alsoconsider models of qualitative dependent vari-ables, but we will not explore those models inthis course due to time constraints. But wecan readily evaluate the use of qualitative in-formation in standard regression models withcontinuous dependent variables.

Qualitative information often arises in termsof some coding, or index, which takes on a

number of values: for instance, we may know

in which one of the six New England states

each of the individuals in our sample resides.

The data themselves may be coded with the

biliteral “MA”, “RI”, “ME”, etc. How can

we use this factor in a regression equation?

In the data, state takes on six distinct val-

ues. We must create six binary variables, or

dummy variables, each of which will refer to

one state–that is, that variable will be 1 if the

individual comes from that state, and 0 oth-

erwise. We can generate this set of 6 vari-

ables easily in Stata with the command tab

state, gen(st), which will create 6 new vari-

ables in our dataset: st1, st2, ... st6. Each

of these variables are dummies–that is, they

only contain 0 or 1 values. If we add up these

variables, we get–exactly–a vector of 1’s, sug-

gesting that we will never want to use all 6

variables in a regression (since by knowing the

values of any 5...) We may also find the pro-

portions of each state’s citizens in our sample

very easily: summ st* will give the descriptive

statistics of all 6 variables, and the mean of

each st dummy is the sample proportion living

in that state.

In Stata 11, we actually do not have to create

these variables explicitly; we can make use of

factor variables, which will automatically cre-

ate the dummies.

How can we use these dummy variables? Say

that we wanted to know whether incomes dif-

fered significantly across the 6-state region.

What if we regressed income on any five of

these st dummies? We could do this with ex-

plicit variables as

regress income st2-st6

or with factor variables as

regress income i.state

In either case, we are estimating the equation

income = β0+β2st2+β3st3+β4st4+β5st5+β6st6+u

(1)

where I have suppressed the observation sub-

scripts. What are the regression coefficients in

this case? β0 is the average income in the 1st

state–the dummy for which is excluded from

the regression. β2 is the difference between

the income in state 2 and the income in state

1. β3 is the difference between the income

in state 3 and the income in state 1, and so

on. What is the ordinary “ANOVA F” in this

context–the test that all the slopes are equal

to zero? Precisely the test of the null hypoth-

esis:

H0 : µ1 = µ2 = µ3 = µ4 = µ5 = µ6 (2)

versus the alternative that not all six of the

state means are the same value. It turns out

that we can test this same hypothesis by ex-

cluding any one of the dummies, and including

the remaining five in the regression. The co-

efficients will differ, but the p− value of the

ANOVA F will be identical for any of these

regressions. In fact, this regression is an ex-

ample of “classical one-way ANOVA”–testing

whether a qualitative factor (in this case, state

of residence) explains a significant fraction of

the variation in income.

What if we wanted to generate point and in-

terval estimates of the state means of income?

Then it would be most convenient to reformu-

late (1) by including all 6 dummies, and remov-

ing the constant term. This is, algebraically,

the same regression:

regress income st1-st6, noconstant

or with factor variables as

regress income ibn.state, noconstant

The coefficient on the now-included st1 will be

precisely that reported above as β0. The coeffi-

cient reported for st2 will be precisely (β0 + β2)

from the previous model, and so on. But now

those coefficients will be reported with confi-

dence intervals around the state means. Those

statistics could all be calculated if you only es-

timated (1), but to do so you would have to

use lincom for each coefficient. Running this

alternative form of the model is much more

convenient for estimating the state means in

point and interval form. But to test the hy-

pothesis (2), it is most convenient to run the

original regression–since then the ANOVA F

performs the appropriate test with no further

ado.

What if we fail to reject the ANOVA F null?

Then it appears that the qualitative factor “state”

does not explain a significant fraction of the

variation in income. Perhaps the relevant clas-

sification is between northern, more rural New

England states (NEN) and southern, more pop-

ulated New England states (NES). Given the

nature of dummy variables, we may generate

these dummies two ways. We can express the

Boolean condition in terms of the state vari-

able: gen nen = (state==‘‘VT’’ | state==‘‘NH’’

| state==‘‘ME’’). This expression, with parens

on the right hand side of the generate state-

ment, evaluates that expression and returns

true (1) or false (0). The vertical bar (|) is

Stata’s OR operator; since every person in the

sample lives in one and only one state, we must

use OR to phrase the condition that they live in

northern New England. But there is another

way to generate this nen dummy, given that

we have st1...st6 defined for the regression

above. Let’s say that Vermont, New Hamp-

shire and Maine have been coded as st6, st4

and st3, respectively. We may just gen nen =

st3+st4+st6, since the sum of mutually exclu-

sive and exhaustive dummies must be another

dummy. To check, the resulting nen will have

a mean equal to the percentage of the sample

that live in northern New England; the equiva-

lent nes dummy will have a mean for southern

New England residents; and the sum of those

two means will of course be 1. We can then

run a simplified form of our model as regress

inc nen; the ANOVA F statistic for that regres-

sion tests the null hypothesis that incomes in

northern and southern New England do not

differ significantly. Since we have excluded

nes, the “slope” coefficient on nen measures

the amount by which northern New England

income differs from southern New England in-

come; the mean income for southern New Eng-

land is the constant term. If we want point and

interval estimates for those means, we should

regress inc nen nes, noc.

Regression with continuous and dummy vari-ables

In the above examples, we have estimated “pureANOVA” models–regression models in whichall of the explanatory variables are dummies. Ineconometric research, we often want to com-bine quantitative and qualitative information,including some regressors that are measurableand others that are dummies. Consder thesimplest example: we have data on individu-als’ wages, years of education, and their gen-der. We could create two gender dummies,male and female, but we will only need one inthe analysis: say, female. We create this vari-able as gen female = (gender==’’F’’). We canthen estimate the model:

wage = β0 + β1educ+ β2female+ u (3)

The constant term in this model now becomesthe wage for a male with zero years of ed-ucation. Male wages are predicted as b0 +

b1educ, while female wages are predicted as

b0 + b1educ+ b2. The gender differential is thus

b2. How would we test for the existence of “sta-

tistical discrimination”–that, say, females with

the same qualifications are paid a lower wage?

This would be H0 : β2 < 0. The t−statistic

for b2 will provide us with this hypothesis test.

What is this model saying about wage struc-

ture? Wages are a linear function of the years

of education. If b2 is significantly different

than zero, then there are two “wage profiles”–

parallel lines in {educ, wage} space, each with

a slope of b1, with their intercepts differing by

b2.

What if we wanted to expand this model to

consider the possibility that wages differ by

both gender and race? Say that each worker is

classified as race=white or race=black. Then

we could gen black = (race==‘‘black’’) to cre-

ate the dummy variable, and add it to (3).

What, now, is the constant term? The wage

for a white male with zero years of education.

Is there a significant race differential in wages?

If so, the coefficient b3, which measures the

difference between white and black wages, ce-

teris paribus, will be significantly different from

zero. In {educ, wage} space, the model can be

represented as four parallel lines, with each in-

tercept labelled by a combination of gender

and race.

What if our racial data classified each worker

as white, Black or Asian? Then we would run

the regression:

wage = β0+β1educ+β2female+β3Black+β4Asian+u

(4)

or, with factor variables,

regress wage educ female i.race

where the constant term still refers to a white

male. In this model, b3 measures the differ-

ence between black and white wages, ceteris

paribus, while b4 measures the difference be-

tween Asian and white wages. Each can be

examined for significance. But how can we

determine whether the qualitative factor, race,

affects wages? That is a joint test, that both

β3 = 0 and β4 = 0, and should be conducted

as such. If factor variables were used, we could

do this with

testparm i.race

No matter how the equation is estimated, we

should not make judgments based on the indi-

vidual dummies’ coefficients, but should rather

include both race variables if the null is re-

jected, or remove them both if it is not. When

we examine a qualitative factor, which maygive rise to a number of dummy variables, theyshould be treated as a group. For instance, wemight want to modify (3) to consider the ef-fect of state of residence:

wage = β0 + β1educ+ β2female+6∑

j=2

γjstj + u

(5)where we include any 5 of the 6 st variablesdesignating the New England states. The testthat wage levels differ significantly due to stateof residence is the joint test that γj = 0, j =2, ...,6 (or, if factor variables are used, testparmi.state). A judgment concerning the relevanceof state of residence should be made on thebasis of this joint test (an F-test with 5 nu-merator degrees of freedom).

Note that if the dependent variable was mea-sured in log form, the coefficients on dummies

would be interpreted as percentage changes; if

(5) was respecified to place log(wage) as the

dependent variable, the coefficient b1 would

measure the percentage return to education

(how many percent does the wage change for

each additional year of education), while the

coefficient b2 would measure the (approximate)

percentage difference in wage levels between

females and males, ceteris paribus. The state

dummies would, likewise, measure the percent-

age difference in wage levels between that state

and the excluded state (state 1).

We must be careful when working with vari-

ables that have an ordinal interpretation, and

are thus coded in numeric form, to treat them

as ordinal. For instance, if we model the in-

terest rate corporations must pay to borrow

(corprt) as a function of their credit rating,

we consider that Moody’s and Standard and

Poor’s assign credit ratings somewhat like grades:

AAA, AA, A, BAA, BA, B, C, et cetera. Those

could be coded as 1,2,...,7. Just as we can

agree that an “A” grade is better than a “B”,

a triple-A bond rating results in a lower bor-

rowing cost than a double-A rating. But while

GPAs are measured on a clear four-point scale,

the bond ratings are merely ordinal, or ordered:

everyone agrees on the rating scale, but the

differential between AA borrowers’ rates and A

borrowers’ rates might be much smaller than

that between B and C borrowers’ rates: es-

pecially the case if C denotes “below invest-

ment grade”, which will reduce the market for

such bonds. Thus, although we might have

a numeric index corresponding to AAA...C, we

should not assume that ∂corprt/∂index is con-

stant; we should not treat index as a cardi-

nal measure. Clearly, the appropriate way to

proceed is to create dummy variables for each

rating class, and include all but one of those

variables in a regression of corprt on bond rat-

ing and other relevant factors. For instance, if

we leave out the AAA dummy, all of the ratings

class dummies’ coefficients will then measure

the degree to which those borrowers’ bonds

bear higher rates than those of AAA borrowers.

But we could just as well leave out the C rating

class dummy, and measure the effects of rat-

ings classes relative to the worst credits’ cost

of borrowing.

Interactions involving dummy variables

Just as continuous variables may be interacted

in regression equations, so can dummy vari-

ables. We might, for instance, have one set of

dummies indicating the gender of respondents

(female) and another set indicating their mar-

ital status (married). We could regress lwage

on these two dummies:

lwage = b0 + b1female+ b2married+ u

which gives rise to the following classification

of mean wages, conditional on the two fac-

tors (which is thus a classic “two-way ANOVA”

setup):

male femaleunmarried b0 b0 + b1married b0 + b2 b0 + b1 + b2

We assume that the two effects, gender and

marital status, have independent effects on the

dependent variable. Why? Because this joint

distribution is modelled as the product of the

marginals. What is the difference between male

and female wages? b1, irrespective of marital

status. What is the difference between un-

married and married wages? b2, irrespective of

gender.

If we were to relax the assumption that gen-

der and marital status had independent effects

on wages, we would want to consider their

interaction. Since there are only two cate-

gories of each variable, we only need one in-

teraction term, fm, to capture the possible ef-

fects. As above, that term could be generated

as a Boolean (noting that & is Stata’s AND

operator): gen fm=(female==1) & (married==1),

or we could generate it algebraically, as gen

fm=female*married. In either case, it represents

the intersection of the sets. We then add a

term, b3fm, to the equation, which then ap-

pears as an additive constant in the lower right

cell of the table. Now, if the coefficient on fm

is significantly nonzero, the effect of being fe-

male on the wage differs, depending on marital

status, and vice versa. Are the interaction ef-

fects important–that is, does the joint distribu-

tion differ from the product of the marginals?

That is easily discerned, since if that is so b3will be significantly nonzero.

Using explicit variables, this would be estimated

as

regress wage female married fm

or, with factor variables, we can make use of

the factorial interaction operator:

regress wage female married i.female#i.married

or, in an even simpler form,

regress wage i.female##i.married

where the double hash mark indicates the full

factorial interaction, including both the main

effects of each factor and their interaction.

Two extensions of this framework come to

mind. Sticking with two-way ANOVA (con-

sidering two factors’ effects), imagine that in-

stead of marital status we consider race =

{white,Black,Asian}. To run the model with-out interactions, we would include two of thesedummies in the regression–say, Black and Asian;the constant term would be the mean wage ofa white male (the excluded class). What ifwe wanted to include interactions? Then wewould define f Black and f Asian, and includethose two regressors as well. The test for thesignificance of interactions is now a joint testthat these two coefficients are jointly zero.

With factor variables, we can just say

regress wage i.female##i.race

where the factorial interaction includes all racecategories, both in levels and interacted withthe female dummy.

A second extension of the interaction conceptis far more important: what if we want to con-sider a regular regression, on quantitative vari-ables, but want to allow for different slopes

for different categories of observations? Then

we create interaction effects between the dum-

mies that define those categories and the mea-

sured variables. For instance,

lwage = b0+b1female+b2educ+b3 (female× educ)+u

Here, we are in essence estimating two sepa-

rate regressions in one: a regression for males,

with an intercept of b0 and a slope of b2, and

a regression for females, with an intercept of

(b0 + b1) and a slope of (b2 + b3) . Why would

we want to do this? We could clearly estimate

the two separate regressions, but if we did that,

we could not conduct any tests (e.g. do males

and females have the same intercept? The

same slope?). If we use interacted dummies,

we can run one regression, and test all of the

special cases of this model which are nested

within: that the slopes are the same, that

the intercepts are the same, and the “pooled”

case in which we need not distinguish between

males and females. Since each of these special

cases merely involves restrictions on this gen-

eral form, we can run this equation and then

just conduct the appropriate tests.

This can be done with factor variables as

regress wage i.female##c.educ

where we must use the c. operator to tell Stata

that educ is to be treated as a continuous vari-

able, rather than considering all possible levels

of that variable in the dataset.

If we extended this logic to include race, as de-

fined above, as an additional factor, we would

include two of the race dummies (say, Black

and Asian) and interact each with educ. This

would be a model without interactions, where

the effects of gender and race are consideredto be independent, but it would allow us to es-timate different regression lines for each com-bination of gender and race, and test for theimportance of each factor. These interactionmethods are often used to test hypotheses aboutthe importance of a qualitative factor–for in-stance, in a sample of companies from whichwe are estimating their profitability, we maywant to distinguish between companies in dif-ferent industries, or companies that underwenta significant merger, or companies that wereformed within the last decade, and evaluatewhether their expenditures on R&D or adver-tising have the same effects across those cat-egories.

All of the necessary tests involving dummy vari-ables and interacted dummy variables may beeasily specified and computed, since modelswithout interacted dummies (or without cer-tain dummies in any form) are merely restricted

forms of more general models in which they

appear. Thus, the standard “subset F” test-

ing strategy that we have discussed for the

testing of joint hypotheses on the coefficient

vector may be readily applied in this context.

The text describes how a “Chow test” may be

formulated by running the general regression,

running a restricted form in which certain con-

straints are imposed, and performing a com-

putation using their sums of squared errors;

this computation is precisely that done with

Stata’s test command. The advantage of set-

ting up the problem for the test command is

that any number of tests (e.g. above, for the

importance of gender, or for the importance of

race) may be conducted after estimating a sin-

gle regression; it is not necessary to estimate

additional regressions to compute any possible

“subset F” test statistic, which is what the

“Chow test” is doing.


Chapter 8: Heteroskedasticity

In laying out the standard regression model,we made the assumption of homoskedasticityof the regression error term: that its varianceis assumed to be constant in the population,conditional on the explanatory variables. Theassumption of homoskedasticity fails when thevariance changes in different segments of thepopulation: for instance, if the variance of theunobserved factors influencing individuals’ sav-ing increases with their level of income. In sucha case, we say that the error process is het-eroskedastic. This does not affect the opti-mality of ordinary least squares for the compu-tation of point estimates–and the assumptionof homoskedasticity did not underly our deriva-tion of the OLS formulas. But if this assump-tion is not tenable, we may not be able to rely

on the interval estimates of the parameters–on

their confidence intervals, and t−statistics de-

rived from their estimated standard errors. In-

deed, the Gauss-Markov theorem, proving the

optimality of least squares among linear un-

biased estimators of the regression equation,

does not hold in the presence of heteroskedas-

ticity. If the error variance is not constant,

then OLS estimators are no longer BLUE.

How, then, should we proceed? The classical

approach is to test for heteroskedasticity, and

if it is evident, try to model it. We can de-

rive modified least squares estimators (known

as weighted least squares) which will regain

some of the desirable properties enjoyed by

OLS in a homoskedastic setting. But this ap-

proach is sometimes problematic, since there

are many plausible ways in which the error vari-

ance may differ in segments of the population–

depending on some of the explanatory variables

in our model, or perhaps on some variables

that are not even in the model. We can use

weighted least squares effectively if we can de-

rive the correct weights, but may not be much

better off if we cannot convince ourselves that

our application of weighted least squares is

valid.

Fortunately, fairly recent developments in econo-

metric theory have made it possible to avoid

these quandaries. Methods have been devel-

oped to adjust the estimated standard errors

in an OLS context for heteroskedasticity of

unknown form–to develop what are known as

robust standard errors. Most statistical pack-

ages now support the calculation of these ro-

bust standard errors when a regression is esti-

mated. If heteroskedasticity is a problem, the

robust standard errors will differ from those

calculated by OLS, and we should take the for-

mer as more appropriate. How can you com-

pute these robust standard errors? In Stata,

one merely adds the option ,robust to the regress

command. The ANOVA F-table will be sup-pressed (as will the adjusted R2 measure), sinceneither is valid when robust standard errors arebeing computed, and the term “robust” will bedisplayed above the standard errors of the co-efficients to remind you that robust errors arein use.

How are robust standard errors calculated? Con-sider a model with a single explanatory vari-able. The OLS estimator can be written as:

b1 = β1 +

∑(xi − x)ui∑(xi − x)2

This gives rise to an estimated variance of theslope parameter:

V ar (b1) =

∑(xi − x)2 σ2

i(∑(xi − x)2

)2 (1)

This expression reduces to the standard ex-pression from Chapter 2 if σ2

i = σ2 for all ob-servations:

V ar (b1) =σ2∑

(xi − x)2

But if σ2i 6= σ2 this simplification cannot be

performed on (1). How can we proceed? Hal-bert White showed (in a famous article in Econo-metrica, 1980) that the unknown error vari-ance of the ith observation, σ2

i , can be consis-tently estimated by e2

i−that is, by the squareof the OLS residual from the original equation.This enables us to compute robust variances ofthe parameters–for instance, (1) can now becomputed from OLS residuals, and its squareroot will be the robust standard error of b1.This carries over to multiple regression; in thegeneral case of k explanatory variables,

V ar(bj)

=

∑r2ije

2i(∑(

xij − xj)2)2 (2)

where e2i is the square of the ith OLS residual,

and rijis the ith residual from regressing vari-

able j on all other explanatory variables. The

square root of this quantity is the heteroskedasticity-

robust standard error, or the “White” stan-

dard error, of the jth estimated coefficient. It

may be used to compute the heteroskedasticity-

robust t−statistic, which then will be valid for

tests of the coefficient even in the presence of

heteroskedasticity of unknown form. Likewise,

F -statistics, which would also be biased in the

presence of heteroskedasticity, may be consis-

tently computed from the regression in which

the robust standard errors of the coefficients

are available.

If we have this better mousetrap, why would

we want to report OLS standard errors–which

would be subject to bias, and thus unreliable,

if there is a problem of heteroskedasticity? If

(and only if) the assumption of homoskedas-ticity is valid, the OLS standard errors are pre-ferred, since they will have an exact t−distributionat any sample size. The application of robuststandard errors is justified as the sample sizebecomes large. If we are working with a sam-ple of modest size, and the assumption of ho-moskedasticity is tenable, we should rely onOLS standard errors. But since robust stan-dard errors are very easily calculated in moststatistical packages, it is a simple task to esti-mate both sets of standard errors for a partic-ular equation, and consider whether inferencebased on the OLS standard errors is fragile.In large data sets, it has become increasinglycommon practice to report the robust standarderrors.

Testing for heteroskedasticity

We may want to demonstrate that the modelwe have estimated does not suffer from het-eroskedasticity, and justify reliance on OLS and

OLS standard errors in this context. How might

we evaluate whether homoskedasticity is a rea-

sonable assumption? If we estimate the model

via standard OLS, we may then base a test

for heteroskedasticity on the OLS residuals.

If the assumption of homoskedasticity, condi-

tional on the explanatory variables, holds, it

may be written as:

H0 : V ar (u|x1, x2, ..., xk) = σ2

And a test of this null hypothesis can evalu-

ate whether the variance of the error process

appears to be independent of the explanatory

variables. We cannot observe the variances

of each observation, of course, but as above

we can rely on the squared OLS residual, e2i ,

to be a consistent estimator of σ2i . One of

the most common tests for heteroskedastic-

ity is derived from this line of reasoning: the

Breusch–Pagan test. The BP test involves

regressing the squares of the OLS residuals on

a set of variables—such as the original explana-

tory variables—in an auxiliary regression:

e2i = d0 + d1x1 + d2x2 + ...dkxk + v (3)

If the magnitude of the squared residual—a

consistent estimator of the error variance of

that observation—is not related to any of the

explanatory variables, then this regression will

have no explanatory power: its R2 will be small,

and its ANOVA F−statistic will indicate that

it does not explain any meaningful fraction of

the variation of e2i around its own mean. (Note

that although the OLS residuals have mean

zero, and are in fact uncorrelated by construc-

tion with each of the explanatory variables,

that does not apply to their squares). The

Breusch–Pagan test can be conducted by ei-ther the ANOVA F−statistic from (3), or by alarge-sample form known as the Lagrange mul-tiplier statistic: LM = n × R2 from the auxil-iary regression. Under H0 of homoskedasticity,LM ∼ χ2

k.

The Breusch–Pagan test can be computed withthe estat hettest command after regress.

regress price mpg weight length

estat hettest

which would evaluate the residuals from the re-gression for heteroskedasticity, with respect tothe original explanatory variables. The null hy-pothesis is that of homoskedasticity; if a smallp−value is received, the null is rejected in fa-vor of heteroskedasticity (that is, the auxiliaryregression (which is not shown) had a mean-ingful amount of explanatory power). Therou-tine displays the LM statistic and its p−value

versus the χ2k distribution. If a rejection is re-

ceived, one should rely on robust standard er-

rors for the original regression. Although we

have demonstrated the Breusch–Pagan test by

employing the original explanatory variables,

the test may be used with any set of variables–

including those not in the regression, but sus-

pected of being systematically related to the

error variance, such as the size of a firm, or

the wealth of an individual.

The Breusch-Pagan test is a special case of

White’s general test for heteroskedastic-

ity. The sort of heteroskedasticity that will

damage OLS standard errors is that which in-

volves correlations between squared errors and

explanatory variables. White’s test takes the

list of explanatory variables {x1, x2, ..., xk} and

augments it with squares and cross products

of each of these variables. The White test

then runs an auxiliary regression of e2i on the

explanatory variables, their squares, and their

cross products. Under the null hypothesis, none

of these variables should have any explanatory

power, if the error variances are not system-

atically varying. The White test is another

LM test, of the n × R2 form, but involves a

much larger number of regressors in the aux-

iliary regression. In the example above, rather

than just including mpg weight length,we would

also include mpg2, weight2, length2, mpg×weight,mpg×length, and weight×length: 9 regressors

in all, giving rise to a test statistic with a χ2(9)

distribution.

How can you perform White’s test? Give the

command ssc install whitetst (you only need

do this once) and it will install this routine in

Stata. The whitetst command will automat-

ically generate these additional variables and

perform the test after a regress command.

Since Stata knows what explanatory variables

were used in the regression, you need not spec-

ify them; just give the command whitetst after

regress. You may also use the fitted option to

base the test on powers of the predicted val-

ues of the regression rather than the full list of

regressors, squares and cross products.

Weighted least squares estimation

As an alternative to using heteroskedasticity-

robust standard errors, we could transform the

regression equation if we had knowledge of the

form taken by heteroskedasticity. For instance,

if we had reason to believe that:

V ar(u|x) = σ2h(x)

where h(x) is some function of the explana-

tory variables that could be made explicit (e.g.

h(x) = income), we could use that informa-

tion to properly specify the correction for het-

eroskedasticity. What would this entail? Since

in this case we are saying that V ar(u|x) ∝income, then the standard deviation of ui, con-

ditional on incomei, is√incomei. Thus could be

used to perform weighted least squares: a

technique in which we transform the variables

in the regression, and then run OLS on the

transformed equation. For instance, if we were

estimating a simple savings function from the

dataset saving.dta, in which sav is regressed

on inc, and believed that there might be het-

eroskedasticity of the form above, we would

perform the following transformations:

gen sd=sqrt(inc)

gen wsav=sav/sd

gen kon=1/sd

gen winc=inc/sd

regress wsav kon winc,noc

Note that there is no constant term in the

weighted least squares (WLS) equation, and

that the coefficient on winc still has the same

connotation: that of the marginal propensity

to save. In this case, though, we might be

thankful that Stata (and most modern pack-

ages) have a method for estimating WLS mod-

els by merely specifying the form of the weights:

regress sav inc [aw=1/inc]

In this case, the “aw” indicates that we are us-

ing “analytical weights”—Stata’s term for this

sort of weighting—and the analytical weight

is specified to be the inverse of the observa-

tion variance (not its standard error). If you

run this regression, you will find that its coef-

ficient estimates and their standard errors are

identical to those of the transformed equation–

with less hassle than the latter, in which the

summary statistics (F-statistic, R2, predicted

values, residuals, etc.) pertain to the trans-

formed dependent variable (wsav) rather than

the original variable.

The use of this sort of WLS estimation is less

popular than it was before the invention of

“White” standard errors; in theory, the trans-

formation to homoskedastic errors will yield

more attractive properties than even the use

of “White” standard errors, conditional on our

proper specification of the form of the het-

eroskedasticity. But of course we are not sure

about that, and imprecise treatment of the

errors may not be as attractive as the less

informed technique of using the robust esti-

mates.

One case in which we do know the form of

the heteroskedasticity is that of grouped data,

in which the data we are using has been ag-

gregated from microdata into groups of dif-

ferent sizes. For instance, a dataset with 50

states’ average values of income, family size,

etc. calculated from a random sample of the

U.S. population will have widely varying preci-

sion in those average values. The mean val-

ues for a small state will be computed from

relatively few observations, whereas the coun-

terpart values for a large state will be more

precisely estimated. Since we know that the

standard error of the mean is σ/√n, we recog-

nize how this effect will influence the precision

of the estimates. How, then, can we use this

dataset of 50 observations while dealing with

the known heteroskedasticity of the states’ er-

rors? This too is weighted least squares, where

the weight on the individual state should be its

population. This can be achieved in Stata by

specifying “frequency weights”–a variable con-

taining the number of observations from which

each sample observation represents. If we had

state-level data on saving, income and popula-

tion, we might regress saving income [fw=pop]

to achieve this weighting.

One additional observation regarding heteroskedas-ticity. We often see, in empirical studies, thatan equation has been specified in some ra-tio form—for instance, with per capita depen-dent and independent variables for data onstates or countries, or in terms of financial ra-tios for firm- or industry-level data. Althoughthere may be no mention of heteroskedastic-ity in the study, it is very likely that these ra-tio forms have been chosen to limit the po-tential damage of heteroskedasticity in the es-timated model. There can certainly be het-eroskedasticity in a per-capita form regressionon country-level data, but it is much less likelyto be a problem than it would be if, say, the lev-els of GDP were used in that model. Likewise,scaling firms’ values by total assets, or totalrevenues, or the number of employees will tendto mitigate the difficulties caused by extremesin scale between large corporations and cornerstores. Such models should still be examinedfor their errors’ behavior, but the popularity ofthe ratio form in these instances is an implicitconsideration of potential heteroskedasticity.


ed.

Chapter 9: More on specification and data

problems

Functional form misspecification

We may have a model that is correctly speci-

fied, in terms of including the appropriate ex-

planatory variables, yet commit functional form

misspecification–in which the model does not

properly account for the relationship between

dependent and observed explanatory variables.

We have considered this sort of problem when

discussing polynomial models; omitting a squared

term, for instance, and constraining ∂y/∂x to

be constant (rather than linear in x) would be

a functional form misspecification. We may

also encounter difficulties of this sort with re-

spect to interactions among the regressors. If

omitted, the effects of those regressors will be

estimated as constant, rather than varying as

they would in the case of interacted variables.

In the context of models with more than one

categorical variable, assuming that their effects

can be treated as independent (thus omitting

interaction terms) would yield the same diffi-

culty.

We may, of course, use the tools already de-

veloped to deal with these problems, in the

sense that if we first estimate a general model

that allows for powers, interaction terms, etc.

and then “test down” with joint F tests, we

can be confident that the more specific model

we develop will not have imposed inappropri-

ate restrictions along the way. But how can

we consider the possibility that there are miss-ing elements even in the context of our generalmodel?

One quite useful approach to a general test forfunctional form misspecification is Ramsey’sRESET (regression specification error test).The idea behind RESET is quite simple; if wehave properly specified the model, no nonlinearfunctions of the independent variables shouldbe significant when added to our estimatedequation. Since the fitted, or predicted values(y) of the estimated model are linear in the in-dependent variables, we may consider powersof the predicted values as additional regres-sors. Clearly the y values themselves cannotbe added to the regression, since they are byconstruction linear combinations of the x vari-ables. But their squares, cubes,... are not.The RESET formulation reestimates the orig-inal equation, augmented by powers of y (usu-ally squares, cubes, and fourth powers are suffi-cient) and conducts an F-test for the joint null

hypothesis that those variables have no sig-

nificant explanatory power. This test is easy

to implement, but many computer programs

have it already programmed; for instance, in

Stata one may just specify estat ovtest (omit-

ted variable test) after any regression, and the

Ramsey RESET will be produced. However,

as Wooldridge cautions, RESET should not be

considered a general test for omission of rele-

vant variables; it is a test for misspecification

of the relationship between y and the x values

in the model, and nothing more.

Tests against nonnested alternatives

The standard joint testing framework is not

helpful in the context of “competing models,”

or nonnested alternatives. These alternatives

can also arise in the context of functional form:

for instance,

y = β0 + β1x1 + β2x2 + u (1)

y = β0 + β1 logx1 + β2 logx2 + u

are nonnested models. The mechanical al-

ternative, in which we construct an artificial

model that contains each model as a special

case, is often not very attractive (and some-

time will not even be feasible). An alterna-

tive approach is that of Davidson and MacK-

innon. Using the same logic applied in devel-

oping Ramsey’s RESET, we can estimate each

of the models in (1), generate their predicted

values, and include them in the other equation.

Under the null hypothesis that the first form of

the model is correctly specified, a linear com-

bination of the logs of the x variables should

have no power to improve it, and that coef-

ficient should be insignificant. Likewise, one

can reestimate the second model, including the

predicted values from the first model. This

testing strategy–often termed the Davidson-

MacKinnon “J test”–may indicate that one

of the models is robust against the other.

There are no guarantees, though, in that ap-

plying the J test to these two equations may

generate zero, one, or two rejections. If nei-

ther hypothesis is rejected, then the data are

not helpful in ranking the models. If both are

rejected, we are given an indication that nei-

ther model is adequate, and that a continued

specification search should be conducted. If

one rejection is received, then the J test is

definitive in indicating that one of the models

dominates (or subsumes) the other, and not

vice versa. However, this does not imply that

the preferred model is well specified; again, this

test is against a very specific alternative, and

does not deliver a “clean bill of health” for the

preferred model should one emerge.

Proxy variables

So far, we have discussed issues of misspec-

ification resulting from improper handling of

the x variables. In many economic models, we

are forced to employ “proxy variables”: ap-

proximate measures of an unobservable phe-

nomenon. For instance, admissions officers

use SAT scores and high school GPAs as prox-

ies for applicants’ ability and intelligence. No

one argues that standardized tests or grade

point averages are actually measuring aptitude,

or intelligence; but there are reasons to believe

that the observable variable is well correlated

with the unobservable, or latent, variable. To

what extent will a model estimated using such

proxies for the variables in the underlying re-

lationship be successful, in terms of delivering

consistent estimates of its parameters? First,

of course, it must be established that there

is a correlation between the observable vari-

able and the latent variable. If we consider the

latent variable as having a linear relation to

a measurable proxy variable, the error in that

relation must not be correlated with other re-

gressors. When we estimate the relationship

including the proxy variable, it should be ap-

parent that the measurement error from the

latent variable equation ends up in the error

term, as an additional source of uncertainty.

This is an incentive to avoid proxy variables

where one can, since they will inexorably inflate

the error variance in the estimated regression.

But usually they are employed out of necessity,

in models for which we have no ability to mea-

sure the latent variable. If there are several

potential proxy measures, they might each be

tested, to attempt to ascertain whether bias is

being introduced to the relationship.

In some cross-sectional relationships, we have

the opportunity to use a lagged value of the

dependent variable as a proxy variable. For in-

stance, if we are trying to explain cities’ crime

rates, we might consider that there are likelyto be similarities—irregardless of the effective-ness of anti-crime strategies—between currentcrime rates and last year’s values. Thus, aprior value of the dependent variable, under-standably independent of this year’s value, maybe a useful proxy for a number of factors thatcannot otherwise be quantified. This approachmight often be used to deal with factors suchas “business climate,” in which some statesor municipalities are viewed as more welcom-ing to business; there may be many aspectsto this perception, some of them more readilyquantifiable (such as tax rates), some of themnot so (such as local officials’ willingness to ne-gotiate infrastructure improvements, or assistin funding for a new facility). But in the ab-sence of radical changes in localities’ stance inthis regard, the prior year’s (or decade’s) busi-ness investment in the locality may be a goodproxy for those factors, perceived much moreclearly by the business decisionmakers than bythe econometrician.

Measurement error

We often must deal with the issue of mea-

surement error: that the variable that theory

tells us belongs in the relationship cannot be

precisely measured in the available data. For

instance, the exact marginal tax rate that an

individual faces will depend on many factors,

only some of which we might be able to ob-

serve: even if we knew the individual’s income,

number of dependents, and homeowner sta-

tus, we could only approximate the effect of

a change in tax law on his or her tax liabil-

ity. We are faced, therefore, with using an

approximate measure, including some error of

measurement, whenever we might attempt to

formulate and implement such a model. This is

conceptually similar to the proxy variable prob-

lem we have already discussed, but in this case

it is not a latent variable problem. There is an

observable magnitude, but we do not necessar-

ily observe it. For instance, reported income is

an imperfect measure of actual income, while

IQ score is only a proxy for ability. Why is

measurement error of concern? Because the

behavior we’re trying to model–be it of indi-

viduals, firms, or nations–presumably is driven

by the actual measures, not our mismeasured

approximations of those factors. To the extent

that we fail to capture the actual measure, we

may misinterpret the behavioral response.

If measurement error is observed in the de-

pendent variable–for instance, if the true rela-

tionship explains y∗, but we only observe y =

y∗ + ε, where ε is a meanzero error process,

then ε becomes a component of the regres-

sion error term: yet another reason why the

relationship does not fit perfectly. We assume

that ε is not systematic, in particular, that it is

not correlated with the independent variables

X. As long as that is the case, then this form

of measurement error does no real harm; it

merely weakens the model, without introduc-

ing bias in either point or interval estimates. If

the magnitude of the measurement error in y is

correlated with one or more of the x variables,

then we will have a problem of bias.

Measurement error in an explanatory variable,

on the other hand, is a far more serious prob-

lem. Say that the true model is

y = β0 + β1x∗1 + u (2)

but that x∗1 is not observed; instead, we ob-

serve x1 = x∗1+ε1. We can assume that E(ε1) =

0 with generality. But what should be as-

sumed about the relationship between ε1 and

x∗1? First, let us assume that ε1 is uncorre-

lated with the observed measure x1 (that is,

larger values of x1 do not give rise to sys-

tematically larger (or smaller) errors of mea-

surement). This can be written as Cov( ε1,

x1) = 0. But if this is the case, it must be

true that Cov( ε1, x∗1) 6= 0 : that is, the error

of measurement must be correlated with the

actual explanatory variable x∗1, so that we can

write the estimated equation (in which x∗1 is

replaced with the observable x1) as

y = β0 + β1x1 + (u− β1ε1) (3)

Since both u and ε1 have zero mean and are

uncorrelated (by assumption) with x1, the pres-

ence of measurement error merely inflates the

error term: that is, V ar (u− β1ε1) = σ2u+β2

1σ2ε1,

given that we have assumed that u and ε1 are

uncorrelated with each other. Thus, measure-

ment error in x∗1 does not negatively affect the

regression of y on x1; it merely inflates the

error variance, like measurement error in the

dependent variable.

However, this is not the case that we usu-

ally consider under the heading of errors-in-

variables. It is perhaps more reasonable to

assume that the measurement error is uncorre-

lated with the true explanatory variable: Cov(

ε1, x∗1) = 0. If this is so, then Cov( ε1, x1) =

Cov(ε1,(x∗1 + ε1

)) 6= 0 by construction, and the

regression (3) will have a correlation between

its explanatory variable x1 and the composite

error term. The covariance of (x1, u− β1ε1) =

−β1Cov(ε1, x1) = −β1σ2ε16= 0, causing the

OLS regression of y on x1 to be biased and

inconsistent. In this simple case of a single ex-

planatory variable measured with error, we can

determine the nature of the bias:

plim(b1) = β1 +Cov (x1, u− β1ε1)

V ar(x1)(4)

= β1

σ2x1

σ2x1

+ σ2ε1

demonstrating that the OLS point estimate

will be attenuated–biased toward zero–since

the bracketed expression must be a fraction.

Clearly, in the absence of measurement error,

σ2ε1→ 0, and the OLS coefficient becomes un-

biased and consistent. As σ2ε1

increases rela-

tive to the variance in the (correctly measured)

explanatory variable, the OLS coefficient be-

comes more and more unreliable, shrinking to-

ward zero.

What can we conclude in a multiple regression

equation, in which perhaps one of the explana-

tory variables is subject to measurement error?

If the measurement error is uncorrelated to the

true (correctly measured) explanatory variable,

then the result we have here applies: the OLS

coefficients will be biased and inconsistent for

all of the explanatory variables (not merely the

variable measured with error), but we can no

longer predict the direction of bias in general

terms. Realistically, more than one explana-

tory variable may be subject to measurement

error (e.g. both reported income and wealth

may be erroneous).

We might be discouraged by these findings,

but fortunately there are solutions to these

problems. The models in question, in which

we suspect the presence of serious errors of

measurement, may be estimated by techniques

other than OLS regression. We will discuss

those instrumental variable techniques, which

may also be used to deal with problems of si-

multaneity, or two-way causality, in Chapter

15.


ed.

Chapter 10: Basic regression analysis with

time series data

We now turn to the analysis of time series

data. One of the key assumptions underlying

our analysis of cross-sectional data will prove

to be untenable when we consider time series

data; thus, we separate out the issues of time

series modelling from that of cross sections.

How does time series data differ? First of all,

it has a natural ordering, that of calendar time

at some periodic frequency. Note that we are

not considering here a dataset in which some

of the variables are dated at a different point

in time: e.g. a survey measuring this year’s in-

come, and (as a separate variable) last year’s

income. In time series data sets, the observa-

tions are dated, and thus we need to respect

their order, particularly if the model we con-sider has a dynamic specification (involvingvariables from more than one point in time).What is a time series? Merely a sequence ofobservations on some phenomenon observedat regular intervals. Those intervals may cor-respond to the passage of calendar time (e.g.annual, quarterly, monthly data) or they mayreflect an economic process that is irregular incalendar time (such as business-daily data). Ineither case, our observations may not be avail-able for every point in time (for instance, thereare days when a given stock does not trade onthe exchange).

A second important difference between cross-sectional and time series data: with the former,we can reaonably assume that the sample isdrawn randomly from the appropriate popula-tion, and could conceive of one or many alter-nate samples constructed from the same popu-lation. In the case of time series data, we con-sider the sequence of events we have recorded

as a realization of the underlying process. We

only have one realization available, in the sense

that history played out a specific sequence of

events. In an alternate universe, Notre Dame

might have lost to BC this year. Randomness

plays a role, of course, just as it does in cross-

sectional data; we do not know what will tran-

spire until it happens, so that time series data

ex ante are random variables. We often speak

of a time series as a stochastic process, or

time series process, focusing on the concept

that there is some mechanism generating that

process, with a random component.

Types of time series regression models

Models used in a time series context can often

be grouped into those sharing common fea-

tures. By far the simplest is a static model,

such as

yt = β0 + β1x1,t + β2x2,t + ut (1)

We may note that this model is the equiva-

lent of the cross-sectional regression model,

with the i subscript in the cross section re-

placed by t in the time series context. Each

observation is modeled as depending only on

contemporaneous values of the explanatory

variables. This structure implies that all of the

interactions among the variables of the model

are assumed to take place immediately: or,

taking the frequency into account, within the

same time period. Thus, such a model might

be reasonable when applied to annual data,

where the length of the observation interval is

long enough to allow behavioral adjustments

to take place. If we applied the same model

to higher-frequency data, we might consider

that assumption inappropriate; we might con-

sider, for instance, that a tax cut would not be

fully reflected by higher retail sales in the same

month that it took effect. An example of such

a structure that appears in many textbooks is

the static Phillips curve:

πt = β0 + β1URt + ut (2)

where πt is this year’s inflation rate, and URtis this year’s unemployment rate. Stating the

model in this form not only implies that the

level of unemployment is expected to affect the

rate of inflation (presumably with a negative

sign), but also that the entire effect of changes

in unemployment will be reflected in inflation

within the observation interval (e.g. one year).

In many contexts, we find a static model in-

adequate to reflect what we consider to be

the relationship between explanatory variables

and those variables we wish to explain. For

instance, economic theory surely predicts that

changes in interest rates (generated by mone-

tary policy) will have an effect on firms’ capital

investment spending. At lower interest rates,

firms will find more investment projects with a

positive expected net present value. But since

it takes some time to carry out these projects–

equipment must be ordered, delivered, and in-

stalled, or new factories must be built and

equipped–we would not expect that quarterly

investment spending would reflect the same

quarter’s (or even the previous quarter’s) in-

terest rates. Presumably interest rates affect

capital investment spending with a lag, and we

must take account of that phenomenon. If we

were to model capital investment with a static

model, we would be omitting relevant explana-

tory variables: the prior values of the causal

factors. These omissions would cause our es-

timates of the static model to be biased and

inconsistent. Thus, we must use some form of

distributed lag model to express the relation-

ship between current and past values of the

explanatory variables and the outcome. Dis-

tributed lag models may take a finite number

of lagged values into account (thus the Finite

Distributed Lag model, or FDL) or they may

use an infinite distributed lag: e.g. all past

values of the x variables. When an infinite DL

model is specified, some algebraic sleight-of-

hand must be used to create a finite set of

regressors.

A simple FDL model would be

ft = β0 + β1pet + β2pet−1 + β3pet−2 + ut (3)

in which we consider the fertility rate in the

population as a function of the personal ex-

emption, or child allowance, over this year and

the past two years. We would expect that the

effect of a greater personal exemption is posi-

tive, but realistically we would not expect the

effect to be (only) contemporaneous. Given

that there is at least a 9-month lag between

the decision and the recorded birth, we would

expect such an effect (if it exists) to be largely

concentrated in the β2 and β3 coefficients. In-

deed, we might consider whether additional

lags are warranted. In this model, β1 is the

impact effect, or impact multiplier of the

personal exemption, measuring the contempo-

raneous change. How do we calculate ∂f/∂pe?

That (total) derivative must be considered as

the effect of a one-time change in pe that

raises the exemption by one unit and leaves

it permanently higher. It may be computed

by evaluating the steady state of the model:

that with all time subscripts dropped. Then

it may be seen that the total effect, or long-

run multiplier, of a permanent change in pe

is (β1 + β2 + β3) . In this specification, we pre-

sume that there is an impact effect (allowing

for a nonzero value of β1) but we are impos-

ing the restriction that the entire effect will be

felt within the two year lag. This is testable,

of course, by allowing for additional lag terms

in the model, and testing for their joint sig-

nificance. However the analysis of individual

lag coefficients is often hampered–especially

at higher frequencies such as quarterly and

monthly data–by high autocorrelation in the

series. That is, the values of the series are

closely related to each other over time. If this

is the case, then many of the individual coeffi-

cients in a FDL regression model may not be

distinguishable from zero. This does not im-

ply, though, that the sum of those coefficients

(i.e. the long run multiplier) will be imprecisely

estimated. We may get a very precise value for

that effect, even if its components are highly

intercorrelated.

One additional concern that will apply in esti-

mating FDL models, especially when the num-

ber of observations is limited. Each lagged

value included in a model results in the loss

of one observation in the estimation sample.

Likewise, the use of a first difference (∆yt ≡yt − yt−1) on either the left or right side of

a model results in the loss of one observa-

tion. If we have a long time series, we may

not be too concerned about this; but if we

were working with monthly data, and felt it

appropriate to consider 12 lags of the explana-

tory variables, we would lose the first year of

data to provide these starting values. Com-

puter programs such as Stata may be set up

to recognize the time series nature of the data

(in Stata, we use the tsset command to iden-

tify the date variable, which must contain the

calendar dates over which the data are mea-

sured), and construct lags and first differences

taking these constraints into account (for in-

stance, a lagged value of a variable will be set

to a missing value where it is not available).

In Stata, once a dataset has been established

as time series, we may use the operators L.,D.

and F. to refer to the lag, difference or lead of a

variable, respectively: so L.gdp is last period’s

gdp, D.gdp is the first difference, and F.gdp is

next year’s value. These operators can also

consider higher lags, so L2.gdp is the second

lag, and L(1/4).gdp refers to the first four lags,

using standard Stata “numlist” notation (help

numlist for details).

Finite sample properties of OLS

How must we modify the assumptions under-

lying OLS to deal with time series data? First

of all, we assume that there is a linear model

linking y with a set of explanatory variables,

{x1...xk}, with an additive error u,for a sample

of t = 1, ..., n. It is useful to consider the ex-

planatory variables as being arrayed in a matrix

X =

x1,1 · · · x1,kx2,1 · · · x2,k

... · · · ...xn,1 · · · xn,k

where, like a spreadsheet,

the rows are the observations (indexed by time)

and the columns are the variables (which may

actually be dated differently: e.g. x2 may ac-

tually be the lag of x1, etc.) To proceed with

the development of the finite sample properties

of OLS, we assume:

Proposition 1 For each t, E(ut|X) = 0, where

X is the matrix of explanatory variables.

This is a key assumption, and quite a strong

one: it states not only that the error is con-

temporaneously uncorrelated with each of the

explanatory variables, but also that the error is

assumed to be uncorrelated with elements of

X at every point in time. The weaker state-

ment of contemporaneous exogeneity,

E(ut|xt,1, xt,2, ..., xt,k) = 0 is analogous to the

assumption that we made in the cross-sectional

context. But this is a stronger assumption, for

it states that the elements of X, past, present,

and future, are independent of the errors: or

that the explanatory variables in X are strictly

exogenous. It is important to note that

this assumption, by itself, says nothing about

the correlations over time among the explana-

tory variables (or their correlations with each

other), nor about the possibility that succes-

sive elements of u may be correlated (in which

case we would say that u is autocorrelated).

The assumption only states that the distribu-

tions of u and X are independent.

What might cause this assumption to fail? Clearly,

omitted variables and/or measurement error

are likely causes of a correlation between the

regressors and errors. But in a time series con-

text there are other likely suspects. If we esti-

mate a static model, for instance, but the true

relationship is dynamic–in which lagged values

of some of the explanatory variables also have

direct effects on y−then we will have a correla-

tion between contemporaneous x and the error

term, since it will contain the effects of lagged

x, which is likely to be correlated with cur-

rent x. So this assumption of strict exogeneity

has strong implications for the correct speci-

fication of the model (in this case, we would

need to specify a FDL model). It also implies

that there cannot be correlation between cur-

rent values of the error process and future x

values:something that would be likely in a case

where some of the x variables are policy in-

struments. For instance, consider a model of

farmers’ income, dependent on (among other

factors) on government price supports for their

crop. If unprecedented shocks (such as a se-

ries of droughts), which are unpredictable and

random effects of weather on farmers’ income,

trigger an expansion of the government price

support program, then the errors today are cor-

related with future x values.

The last assumption we need is the standard

assumption that the columns of X are linearly

independent: that is, there are no exact linear

relations, or perfect collinearity, among the

regressors.

With these assumptions in hand, we can demon-

strate that the OLS estimators are unbiased,

both conditional on X and unconditionally. The

random assumption that allowed us to prove

unbiasedness in the cross-sectional context has

been replaced by the assumption of strict ex-

ogeneity in the time series context. We now

turn to the interval estimates. As previously,

we assume that the error variance, conditioned

on X, is homoskedastic: V ar(ut|X) = V ar(ut) =

σ2, ∀t. In a time series context, this assumption

states that the error variance is constant over

time, and in particular not influenced by the

X variables. In some cases, this may be quite

unrealistic. We now add an additional assump-

tion, particular to time series analysis: that

there is no serial correlation in the errors:

Cov(ut, us|X) = Cov(ut, us) = 0, ∀t 6= s. This

assumption states that the errors are not auto-

correlated, or correlated with one another, so

that there is no systematic pattern in the errors

over time. This may clearly be violated, if the

error in one period (for instance, the degree to

which the actual level of y falls short of the de-

sired level) is positively (or negatively) related

to the error in the previous period. Positive

autocorrelation can readily arise in a situation

where there is partial adjustment to a discrep-

ancy, whereas negative autocorrelation is much

more likely to reflect “overshooting,” in which

a positive error (for instance, an overly opti-

mistic forecast) is followed by a negative error

(a pessimistic forecast). This assumption has

nothing to do with the potential autocorrela-

tion within the X matrix; it only applies to

the error process. Why is this assumption only

relevant for time series? In cross sections, we

assume random sampling, whereby each obser-

vation is independent of every other. In time

series, the sequence of the observations makes

it likely that if independence is violated, it will

show up in successive observations’ errors.

With these additional assumptions, we may

state the Gauss-Markov theorem for OLS esti-

mators of a time series model (OLS estimators

are BLUE), implying that the variances of the

OLS estimators are given by:

V ar(bj|X) =σ2[

SSTj(1−R2

j

)] (4)

where SSTj is the total sum of squares of the

jth explanatory variable, and R2j is the R2 from

a regression of variable xj on the other ele-

ments of X. Likewise, the unknown parameter

σ2 may be replaced by its consistent estimate,

s2 = SSRn−k−1, identical to that discussed previ-

ously.

As in our prior derivation, we will assume that

the errors are normally distributed: u ∼ N(0, σ2).

If the above assumptions hold, then the stan-

dard t−statistics and F−statistics we have ap-

plied in a cross-sectional context will also be

applicable in time series regression models.

Functional form, dummy variables, and in-

dex numbers

We find that a logarithmic transformation is

very commonly used in time series models, par-

ticularly with series that reflect stocks, flows,

or prices (rather than rates). Many models

are specified with the first difference of log(y),

implying that the dependent variable is the

growth rate of y. Dummy variables are also

very useful to test for structural change. We

may have a priori information that indicates

that unusual events were experienced in partic-

ular time periods: wars, strikes, or presidential

elections, or a market crash. In the context of

a dynamic model, we do not want to merely

exclude those observations, since that would

create episodes of missing data. Instead, we

can “dummy” the period of the event, which

then allows for an intercept shift (or, with in-

teractions, for a slope shift) during the un-

usual period. The tests for significance of the

dummy coefficients permit us to identify the

importance of the period, and justify its special

treatment. We may want to test that the rela-

tionship between inflation and unemployment

(the “Phillips curve”) is the same in Repub-

lican and Democratic presidential administra-

tions; this may readily be done with a dummy

for one party, added to the equation and inter-

acted to allow for a slope change between the

two parties’ equations. Dummy variables are

also used widely in financial research, to con-

duct event studies: models in which a par-

ticular event, such as the announcement of a

takeover bid, is hypothesized to trigger “ab-

normal” returns to the stock. In this context,

high-frequency (e.g. daily) data on stock re-

turns are analyzed, with a dummy set equal to

1 on and after the date of the takeover bid

announcement. A test for the significance of

the dummy coefficient allows us to analyze the

importance of this event. (These models are

explicitly discussed in EC327, Financial Econo-

metrics).

Creation of these dummies in Stata is made

easier by the tin() function (read: tee-in). If

the data set has been established as a time

series via tsset, you may refer to natural time

periods in generating new variables or spec-

ifying the estimation sample. For instance,

gen prefloat = (tin(1959q1,1971q3)) will gen-

erate a dummy for that pre-Smithsonian pe-

riod, and a model may be estimated over a

subset of the observations via regress ... if

tin(1970m1,1987m9).

In working with time series data, we are often

concerned with series measured as index num-

bers, such as the Consumer Price Index, GDP

Deflator, Index of Industrial Production, etc.

The price series are often needed to gener-

ate real values from nominal magnitudes. The

usual concerns must be applied in working with

these index number series, some of which have

been rebased (e.g. from 1982=100 to 1987=100)

and must be adjusted accordingly for a new

base period and value. Interesting implications

arise when we work with “real” magnitudes,

expressed in logs: for instance, labor supply

is usually modelled as depending on the real

wage,(wp

). If we express these variables in logs,

the log of the real wage becomes logw− log p.

Regressing the log of hours worked on a single

variable, (logw − log p), is a restricted version

of a regression in which the two variables are

entered separately. In that regression, the co-

efficients will almost surely differ in their ab-

solute value. But economic theory states that

only the real wage should influence workers’

decisions; they should not react to changes in

its components (e.g. they should not be will-

ing to supply more hours of labor if offered a

higher nominal wage that only makes up for a

decrease in their purchasing power).

Trends and seasonality

Many economic time series are trending: grow-

ing over time. One of the reasons for very high

R2 values in many time series regressions is the

common effect of time on many of the vari-

ables considered. This brings a challenge to

the analysis of time series data, since when we

estimate a model in which we consider the ef-

fect of several causal factors, we must be care-

ful to account for the co-movements that may

merely reflect trending behavior. Many macro

series reflect upward trends; some, such as the

cost of RAM for personal computers, exhibit

strong downward trends. We can readily model

a linear trend by merely running a regression

of the series on t, in which the slope coefficient

is then ∂y/∂t. To create a time trend in Stata,

you can just generate t = n, where n is the

observation number. It does not matter where

a trend starts, or the units in which it is ex-

pressed; a trend is merely a series that changes

by a fixed amount per time period. A linear

trend may prove to be inadequate for many

economic series, which we might expect on a

theoretical basis to exhibit constant growth,

not constant increments. In this case, an ex-

ponential trend may readily be estimated (for

strictly positive y) by regressing log y on t. The

slope coefficient is then a direct estimate of

the percentage growth rate per period. We

could also use a polynomial model, such as a

quadratic time trend, regressing the level of

y on t and t2.

Nothing about trending economic variables vi-

olates our basic assumptions for the estima-

tion of OLS regression models with time se-

ries data. However, it is important to consider

whether significant trends exist in the series;

if we ignore a common trend, we may be esti-

mating a spurious regression, in which both y

and the X variables appear to be correlated be-

cause of the influence on both of an omitted

factor, the passage of time. We can readily

guard against this by including a time trend

(linear or quadratic) in the regression; if it is

needed, it will appear to be a significant de-

terminant of y. In some cases, inclusion of a

time trend can actually highlight a meaning-

ful relationship between y and one or more x

variables: since their coefficients are now es-

timates of their co-movement with y, ceteris

paribus: that is, net of the trend in y.

We may link the concept of a regression in-

clusive of trend to the common practice of

analyzing detrended data. Rather than re-gressing y on X and t, we could remove thetrend from y and each of the variables in X.

How? Regress each variable on t, and savethe residuals (if desired, adding back the orig-inal mean of the series). This is then thedetrended y, call it y∗, and the detrended ex-planatory variables X∗ (not including a trendterm). If we now estimate the regression of y∗

on X∗, we will find that the slope coefficients’point and interval estimates are exactly equalto those from the original regression of y onX and t. Thus, it does not matter whether wefirst detrend the series, and run the regression,or estimate the regression with trend included.Those are equvalent strategies, and since thelatter is less burdensome, it may be preferredby the innately lazy researcher.

Another issue that may often arise in time se-ries data of quarterly, monthly or higher fre-quency is seasonality. Some economic vari-ables are provided in seasonally adjusted form.

In databanks and statistical publications, the

acronym SAAR (seasonally adjusted at annual

rate) is often found. Other economic series are

provided in their raw form, often labelled NSA,

or not seasonally adjusted. Seasonal factors

play an important role in many series. Natu-

rally, they reflect the seasonal patterns in many

commodities’ measures: agricultural prices dif-

fer between harvest periods and out-of-season

periods, fuel prices differ due to winter demand

for oil and natural gas, or summer demand

for gasoline. But there are seasonal factors

in many series we might consider with a more

subtle interpretation. Retail sales, naturally,

are very high in the holiday period: but so is

the demand for cash, since shoppers and gift-

givers will often need more cash at that time.

Payrolls in the construction industry will ex-

hibit seasonal patterns, as construction falls

off in cold climates, but may be stimulated by

a mild winter. Many financial series will re-

flect the adjustments made by financial firms

to “dress up” quarter-end balance sheets and

improve apparent performance.

If all of the data series we are using in a model

have been seasonally adjusted by their produc-

ers, we may not be concerned about seasonal-

ity. But often we will want to use some NSA

series, or be worried about the potential for

seasonal effects. In this case, just as we dealt

with trending series by including a time trend,

we should incorporate seasonality into the re-

gression model by including a set of seasonal

dummies. For quarterly data, we will need 3

dummies; for monthly data, 11 dummies; and

so on. If we are using business-daily data such

as financial time series, we may want to in-

clude “day-of-week” effects, with dummies for

four of the five business days.

How would you use quarterly dummies in Stata?

First of all, you must know what the time vari-

able in the data set is: give the command

tsset to find out. If it is a quarterly variable,the tsset range will report dates with embed-ded “q”s. Then you may create one quarterlydummy as gen q1=(quarter(dofq(qtr))==1) whichwill take on 1 in the first quarter, and 0 oth-erwise. To consider whether series income ex-hibits seasonality, regress income L(1/3).q1 andexamine the F−statistic. You could, of course,include any three of the four quarter dummies;L(0/2) would include dummies for quarters 1,2 and 3, and yield the same F−statistic. Notethat inclusion of these three dummies will re-quire the loss of at least two observations atthe beginning of the sample. This form ofseasonal adjustment will consider the effectof each season to be linear; if we wanted toconsider multiplicative seasonality, e.g. salesare always 10% higher in the fourth quarter,that could be achieved by regressing log y onthe seasonal dummies. A trend could be in-cluded in either form of the regression to cap-ture trending behavior over and above sea-sonality; in the latter regression, of course,

it would represent an exponential (constant

growth) trend.

Just as with a trend, we may either deseason-

alize each series (by regressing it on seasonal

dummies, saving the residuals, and adding the

mean of the original series) and regress sea-

sonally adjusted series on each other; or we

may include a set of seasonal dummies (leav-

ing one out) in a regression of y on X, and test

for the joint significance of the seasonal dum-

mies. The coefficients on the X variables will

be identical, in both point and interval form,

using either strategy.


ed.

Chapter 12: Serial correlation and heteroskedas-

ticity in time series regressions

What will happen if we violate the assump-

tion that the errors are not serially corre-

lated, or autocorrelated? We demonstrated

that the OLS estimators are unbiased, even in

the presence of autocorrelated errors, as long

as the explanatory variables are strictly exoge-

nous. This is analogous to our results in the

case of heteroskedasticity, where the presence

of heteroskedasticity alone does not cause bias

nor inconsistency in the OLS point estimates.

However, following that parallel argument, we

will be concerned with the properties of our

interval estimates and hypothesis tests in the

presence of autocorrelation.

OLS is no longer BLUE in the presence of se-

rial correlation, and the OLS standard errors

and test statistics are no longer valid, even

asymptotically. Consider a first-order Markov

error process:

ut = ρut−1 + et, |ρ| < 1 (1)

where the et are uncorrelated random variables

with mean zero and constant variance. What

will be the variance of the OLS slope estimator

in a simple (y on x) regression model? For

simplicity let us center the x series so that x =

0. Then the OLS estimator will be:

b1 = β1 +

∑ni=1 xtutSSTx

(2)

where SSTx is the sum of squares of the x

series. In computing the variance of b1, con-

ditional on x, we must account for the serial

correlation in the u process:

V ar (b1) =1

SST2xV ar

n∑t=1

xtut

=

1

SST2x

( ∑ni=1 x

2t V ar(ut)+

2∑n−1t=1

∑n−1j=1 xtxt−jE

(utut−j

) )

=σ2

SSTx+ 2

(σ2

SST2x

) n−1∑t=1

n−1∑j=1

ρjxtxt−j

where σ2 = V ar(ut) and we have used thefact that E

(utut−j

)= Cov

(utut−j

)= ρjσ2 in

the derivation. Notice that the first term inthis expression is merely the OLS variance ofb1 in the absence of serial correlation. Whenwill the second term be nonzero? When ρ

is nonzero, and the x process itself is auto-correlated, this double summation will have anonzero value. But since nothing prevents theexplanatory variables from exhibiting autocor-relation (and in fact many explanatory vari-ables take on similar values through time) the

only way in which this second term will vanish

is if ρ is zero, and u is not serially correlated.

In the presence of serial correlation, the second

term will cause the standard OLS variances of

our regression parameters to be biased and in-

consistent. In most applications, when serial

correlation arises, ρ is positive, so that suc-

cessive errors are positively correlated. In that

case, the second term will be positive as well.

Recall that this expression is the true variance

of the regression parameter; OLS will only con-

sider the first term. In that case OLS will seri-

ously underestimate the variance of the param-

eter, and the t−statistic will be much too high.

If on the other hand ρ is negative–so that suc-

cessive errors result from an “overshooting”

process–then we may not be able to determine

the sign of the second term, since odd terms

will be negative and even terms will be positive.

Surely, though, it will not be zero. Thus the

consequence of serial correlation in the errors–

particularly if the autocorrelation is positive–

will render the standard t− and F−statistics

useless.

Serial correlation in the presence of lagged

dependent variables

A case of particular interest, even in the con-

text of simple y on x regression, is that where

the “explanatory variable” is a lagged depen-

dent variable. Suppose that the conditional

expectation of yt is linear in its past value:

E(yt|yt−1

)= β0 + β1yt−1. We can always add

an error term to this relation, and write it as

yt = β0 + β1yt−1 + ut (3)

Let us first assume that the error is “well be-

haved,” i.e. E(ut|yt−1

)= 0, so that there is

no correlation between the current error and

the lagged value of the dependent variable. In

this setup the explanatory variable cannot be

strictly exogenous, since there is a contempo-

raneous correlation between yt and ut by con-

struction; but in evaluating the consistency of

OLS in this context we are concerned with the

correlation between the error and yt−1, not the

correlation with yt, yt−2, and so on. In this

case, OLS would still yield unbiased and con-

sistent point estimates, with biased standard

errors, as we derived above, even if the u pro-

cess was serially correlated..

But it is often claimed that the joint presence

of a lagged dependent variable and autocor-

related errors, OLS will be inconsistent. This

arises, as it happens, from the assumption that

the u process in (3) follows a particular autore-

gressive process, such as the first-order Markov

process in (1). If this is the case, then we

do have a problem of inconsistency, but it is

arising from a different source: the misspeci-

fication of the dynamics of the model. If we

combine (3) with (1), we really have an AR(2)

model for yt, since we can lag (3) one period

and substitute it into (1) to rewrite the model

as:

yt = β0 + β1yt−1 + ρ(yt−1 − β0 − β1yt−2

)+ et

= β0 (1− ρ) + (β1 + ρ) yt−1 − ρβ1yt−2 + et

= α0 + α1yt−1 + α2yt−2 + et (4)

so that the conditional expectation of yt prop-

erly depends on two lags of y, not merely one.

Thus the estimation of (3) via OLS is indeed

inconsistent, but the reason for that inconsis-

tency is that y is correctly modelled as AR(2).

The AR(1) model is seen to be a dynamic mis-

specification of (4); as is always the case, the

omission of relevant explanatory variables will

cause bias and inconsistency in OLS estimates,

especially if the excluded variables are corre-

lated with the included variables. In this case,

that correlation will almost surely be meaning-

ful. To arrive at consistent point estimates of

this model, we merely need add yt−2 to the

estimated equation. That does not deal with

the inconsistent interval estimates, which will

require a different strategy.

Testing for first-order serial correlation

Since the presence of serial correlation invali-

dates our standard hypothesis tests and inter-

val estimates, we should be concerned about

testing for it. First let us consider testing

for serial correlation in the k−variable regres-

sion model with strictly exogenous regressors–

which rules out, among other things, lagged

dependent variables.

The simplest structure which we might posit

for serially correlated errors is AR(1), the first

order Markov process, as given in (1). Let us

assume that et is uncorrelated with the entire

past history of the u process, and that et is ho-

moskedastic. The null hypothesis is H0 : ρ = 0

in the context of (1). If we could observe the

u process, we could test this hypothesis by es-

timating (1) directly. Under the maintained

assumptions, we can replace the unobservable

ut with the OLS residual vt. Thus a regres-

sion of the OLS residuals on their own lagged

values,

vt = κ+ ρvt−1 + εt, t = 2, ...n (5)

will yield a t− test. That regression can be run

with or without an intercept, and the robust

option may be used to guard against violations

of the homoskedasticity assumption. It is only

an asymptotic test, though, and may not have

much power in small samples.

A very common strategy in considering the

possibility of AR(1) errors is the Durbin-Watson

test, which is also based on the OLS residuals:

DW =

∑nt=2

(vt − vt−1

)2∑nt=1 v

2t

(6)

Simple algebra shows that the DW statistic is

closely linked to the estimate of ρ from the

large-sample test:

DW ' 2 (1− ρ) (7)

ρ ' 1−DW

2

The relationship is not exact because of the

difference between (n−1) terms in the numer-

ator and n terms in the denominator of the

DW test. The difficulty with the DW test is

that the critical values must be evaluated from

a table, since they depend on both the number

of regressors (k) and the sample size (n), and

are not unique: for a given level of confidence,

the table contains two values, dL and dU . If

the computed value falls below dL, the null is

clearly rejected. If it falls above dU , there is

no cause for rejection. But in the intervening

region, the test is inconclusive. The test can-

not be used on a model without a constant

term, and it is not appropriate if there are any

lagged dependent variables. You may perform

the test in Stata, after a regression, using the

estat dwatson command.

In the presence of one or more lagged de-

pendent variables, an alternative statistic may

be used: Durbin’s h statistic, which merely

amounts to augmenting (5) with the explana-

tory variables from the original regression. This

test statistic may readily be calculated in Stata

with the estat durbinalt command.

Testing for higher-order serial correlation

One of the disadvantages of tests for AR(1)

errors is that they consider precisely that al-

ternative hypothesis. In many cases, if there

is serial correlation in the error structure, it

may manifest itself in a more complex relation-

ship, involving higher-order autocorrelations;

e.g. AR(p). A logical extension to the test de-

scribed in 5) and the Durbin “h” test is the

Breusch-Godfrey test, which considers the

null of nonautocorrelated errors against an al-

ternative that they are AR(p). This can readily

be performed by regressing the OLS residu-

als on p lagged values, as well as the regres-

sors from the original model. The test is the

joint null hypothesis that those p coefficients

are all zero, which can be considered as an-

other nR2 Lagrange multiplier (LM) statistic,

analogous to White’s test for heteroskedastic-

ity. The test may easily be performed in Stata

using the estat bgodfrey command. You must

specify the lag order p to indicate the degree

of autocorrelation to be considered. If p = 1,

the test is essentially Durbin’s “h” statistic.

An even more general test often employed on

time series regression models is the Box-Pierce

or Ljung-Box Q statistic, or “portmanteau

test,” which has the null hypothesis that the

error process is “white noise,” or nonautocor-

related, versus the alternative that it is not

well behaved. The “Q” test evaluates the au-

tocorrelation function of the errors, and in that

sense is closely related to the Breusch-Godfrey

test. That test evaluates the conditional au-

tocorrelations of the residual series, whereas

the “Q” statistic uses the unconditional auto-

correlations. The “Q” test can be applied to

any time series as a test for “white noise,” or

randomness. For that reason, it is available

in Stata as the command wntestq. This test

is often reported in empirical papers as an in-

dication that the regression models presented

therein are reasonably specified.

Any of these tests may be used to evaluate the

hypothesis that the errors exhibit serial correla-

tion, or nonindependence. But caution should

be exercised when their null hypotheses are re-

jected. It is very straightforward to demon-

strate that serial correlation may be induced by

simple misspecification of the equation–for in-

stance, modeling a relationship as linear when

it is curvilinear, or when it represents expo-

nential growth. Many time series models are

misspecified in terms of inadequate dynam-

ics: that is, the relationship between y and

the regressors may involve many lags of the

regressors. If those lags are mistakenly omit-

ted, the equation suffers from misspecification

bias, and the regression residuals will reflect

the missing terms. In this context, a visual in-

spection of the residuals is often useful. User-

written Stata routines such as tsgraph, sparl

and particularly ofrtplot should be employed

to better understand the dynamics of the re-

gression function. Each may be located and

installed with Stata’s ssc command, and each

is well documented with on–line help.

Correcting for serial correlation with strictly

exogenous regressors

Since we recognize that OLS cannot provide

consistent interval estimates in the presence

of autocorrelated errors, how should we pro-

ceed? If we have strictly exogenous regressors

(in particular, no lagged dependent variables),

we may be able to obtain an appropriate esti-

mator through transformation of the model. If

the errors follow the AR(1) process in (1), we

determine that V ar(ut) = σ2e /(1− ρ2

). Con-

sider a simple y on x regression with auto-

correlated errors following an AR(1) process.

Then simple algebra will show that the quasi-

differenced equation(yt − ρyt−1

)= (1− ρ)β0+β1

(xt − ρxt−1

)+(ut − ρut−1

)(8)

will have nonautocorrelated errors, since the

error term in this equation is in fact et, by

assumption well behaved. This transforma-

tion can only be applied to observations 2, ..., n,

but we can write down the first observation in

static terms to complete that, plugging in a

zero value for the time-zero value of u. This ex-

tends to any number of explanatory variables,

as long as they are strictly exogenous; we just

quasi-difference each, and use the quasi-differenced

version in an OLS regression.

But how can we employ this strategy when

we do not know the value of ρ? It turns out

that the feasible generalized least squares

(GLS) estimator of this model merely replacesρ with a consistent estimate, ρ. The result-ing model is asymptotically appropriate, evenif it lacks small sample properties. We canderive an estimate of ρ from OLS residuals,or from the calculated value of the Durbin-Watson statistic on those residuals. Most com-monly, if this technique is employed, we use analgorithm that implements an iterative scheme,revising the estimate of ρ in a number of stepsto derive the final results. One common method-ology is the Prais-Winsten estimator, whichmakes use of the first observation, transform-ing it separately. It may be used in Stata viathe prais command. That same commandmay also be used to employ the Cochrane-

Orcutt estimator, a similar iterative techniquethat ignores the first observation. (In a largesample, it will not matter if one observationis lost). This estimator can be executed usingthe corc option of the prais command.

We do not expect these estimators to provide

the same point estimates as OLS, as they are

working with a fundamentally different model.

If they provide similar point estimates, the FGLS

estimator is to be preferred, since its standard

errors are consistent. However, in the presence

of lagged dependent variables, more compli-

cated estimation techniques are required.

An aside on first differencing. An alternative

to employing the feasible GLS estimator, in

which a value of ρ inside the unit circle is esti-

mated and used to transform the data, would

be to first difference the data: that is, trans-

form the left and right hand side variables into

differences. This would indeed be the proper

procedure to follow if it was suspected that

the variables possessed a unit root in their

time series representation. But if the value of

ρ in (1) is strictly less than 1 in absolute value,

first differencing approximates that value, since

differencing is equivalent to imposing ρ = 1 on

the error process. If the process’s ρ is quite dif-

ferent from 1, first differencing is not as good

a solution as applying the FGLS estimator.

Also note that if you difference a standard re-

gression equation in y, x1, x2... you derive an

equation that does not have a constant term.

A constant term in an equation in differences

corresponds to a linear trend in the levels equa-

tion. Unless the levels equation already con-

tains a linear trend, applying differences to that

equation should result in a model without a

constant term..

Robust inference in the presence of auto-

correlation

Just as we utilized the “White” heteroskedasticity-

consistent standard errors to deal with het-

eroskedasticity of unknown form, we may gen-

erate estimates of the standard errors that are

robust to both heteroskedasticity and auto-

correlation. Why would we want to do this

rather than explicitly take account of the au-

tocorrelated errors via the feasible generalized

least squares estimator described earlier? If we

doubt that the explanatory variables may be

considered strictly exogenous, then the FGLS

estimates will not even be consistent, let alone

efficient. Also, FGLS is usually implemented

in the context of an AR(1) model, since it is

much more complex to apply it to a more com-

plex AR structure. But higher-order autocor-

relation in the errors may be quite plausible.

Robust methods may take account of that be-

havior.

The methodology to compute what are often

termed heteroskedasticity- and autocorrelation-

consistent (HAC) standard errors was devel-

oped by Newey and West; thus they are of-

ten referred to as Newey-West standard er-

rors. Unlike the White standard errors, which

require no judgment, the Newey-West stan-

dard errors must be calculated conditional on

a choice of maximum lag. They are calculated

from a distributed lag of the OLS residuals,

and one must specify the longest lag at which

autocovariances are to be computed. Normally

a lag length exceeding the periodicity of the

data will suffice; e.g. at least 4 for quar-

terly data, 12 for monthly data, etc. The

Newey-West (HAC) standard errors may be

readily calculated for any OLS regression using

Stata’s newey command. You must provide the

“option” lag( ), which specifies the maximum

lag order, and your data must be tsset (that is,

known to Stata as time series data). Since the

Newey-West formula involves an expression in

the squares of the residuals which is identical

to White’s formula (as well as a second term

in the cross-products of the residuals), these

robust estimates subsume White’s correction.

Newey-West standard errors in a time series

context are robust to both arbitrary autocor-

relation (up to the order of the chosen lag) as

well as arbitrary heteroskedasticity.

Heteroskedasticity in the time series con-

text

Heteroskedasticity can also occur in time se-

ries regression models; its presence, while not

causing bias nor inconsistency in the point es-

timates, has the usual effect of invalidating the

standard errors, t−statistics, and F−statistics,

just as in the cross–sectional case. Since the

Newey–West standard error formula subsumes

the White (robust) standard error component,

if the Newey–West standard errors are com-

puted, they will also be robust to arbitrary de-

partures from homoskedasticity. However, the

standard tests for heteroskedasticity assume

independence of the errors, so if the errors are

serially correlated, those tests will not generally

be correct. It thus makes sense to test for se-

rial correlation first (using a heteroskedasticity–

robust test if it is suspected), correct for se-

rial correlation, and then apply a test for het-

eroskedasticity.

In the time series context, it may be quite plau-

sible that if heteroskedasticity—that is, vari-

ations in volatility in a time series process—

exists, it may itself follow an autoregressive

pattern. This can be termed a dynamic form

of heteroskedasticity, in which Engle’s ARCH

(autoregressive conditional heteroskedasticity)

model applies. The simplest ARCH model may

be written as:

yt = β0 + β1zt + ut

E(u2t |ut−1, ut−2, ...

)= E

(u2t |ut−1

)= α0 + α1u

2t−1

The second line is the conditional variance of utgiven that series’ past history, assuming that

the u process is serially uncorrelated. Since

conditional variances must be positive, this only

makes sense if α0 > 0 and α1 ≥ 0. We can

rewrite the second line as:

u2t = α0 + α1u

2t−1 + υt

which then appears as an autoregressive model

in the squared errors, with stability condition

α1 < 1. When α1 > 0, the squared errors con-

tain positive serial correlation, even though the

errors themselves do not.

If this sort of process is evident in the regres-

sion errors, what are the consequences? First

of all, OLS are still BLUE. There are no as-

sumptions on the conditional variance of the

error process that would invalidate the use of

OLS in this context. But we may want to

explicitly model the conditional variance of the

error process, since in many financial series the

movements of volatility are of key importance

(for instance, option pricing via the standard

Black–Scholes formula requires an estimate of

the volatility of the underlying asset’s returns,

which may well be time–varying).

Estimation of ARCH models—of which there

are now many flavors, with the most common

extension being Bollerslev’s GARCH (gener-

alised ARCH)—may be performed via Stata’s

arch command. Tests for ARCH, which are

based on the squared residuals from an OLS re-

gression, are provided by Stata’s estat archlm

command.


ed.

Chapter 15: Instrumental variables and two

stage least squares

Many economic models involve endogeneity:

that is, a theoretical relationship does not fit

into the framework of y-on-X regression, in

which we can assume that the y variable is de-

termined by (but does not jointly determine)

X. Indeed, the simplest analytical concepts we

teach in principles of economics—a demand

curve in micro, and the Keynesian consump-

tion function in macro—are relations of this

sort, where at least one of the “explanatory”

variables is endogenous, or jointly determined

with the “dependent” variable. From a math-

ematical standpoint, the difficulties that this

endogeneity cause for econometric analysis are

identical to those which we have already con-sidered, in two contexts: that of omitted vari-ables, and that of errors-in-variables, or mea-surement error in the X variables. In each ofthese three cases, OLS is not capable of deliv-ering consistent parameter estimates. We nowturn to a general solution to the problem of en-dogenous regressors, which as we will see canalso be profitably applied in other contexts, inwhich the omitted variable (or poorly measuredvariable) can be taken into account. The gen-eral concept is that of the instrumental vari-ables estimator; a popular form of that esti-mator, often employed in the context of endo-geneity, is known as two-stage least squares(2SLS).

To motivate the problem, let us consider theomitted-variable problem: for instance, a wageequation, which would be correctly specifiedas:

log(wage) = β0 + β1educ+ β2abil + e (1)

This equation cannot be estimated, because

ability (abil) is not observed. If we had a proxy

variable available, we could substitute it for

abil; the quality of that equation would then

depend on the degree to which it was a good

proxy. If we merely ignore abil, it becomes part

of the error term in the specification:

log(wage) = β0 + β1educ+ u (2)

If abil and educ are correlated, OLS will yield

biased and inconsistent estimates. To consis-

tently estimate this equation, we must find an

instrumental variable: a new variable that

satisfies certain properties. Imagine that vari-

able z is uncorrelated with u, but is correlated

with educ. A variable that meets those two

conditions is an instrumental variable for educ.

We cannot directly test the prior assumption,

since we cannot observe u; but we can readily

test the latter assumption, and should do so,

by merely regressing the included explanatory

variable on the instrument:

educ = π0 + π1z + υ (3)

In this regression, we should easily reject H0 :

π1 = 0. It should be clear that there is no

unique choice of an instrument in this situa-

tion; many potential variables could meet these

two conditions, of being uncorrelated with the

unobservable factors influencing the wage (in-

cluding abil) and correlated with educ. Note

that in this context we are not searching for a

proxy variable for abil; if we had a good proxy

for abil, it would not make a satisfactory instru-

mental variable, since correlation with abil im-

plies correlation with the error process u. What

might serve in this context? Perhaps some-

thing like the mother’s level of education, or

the number of siblings, would make a sensible

instrument. If we determine that we have a

reasonable instrument, how may it be used?

Return to the misspecified equation (2), and

write it in general terms of y and x :

y = β0 + β1x+ u (4)

If we now take the covariance of each term in

the equation with our instrument z, we find:

Cov(y, z) = β1Cov(x, z) + Cov(u, z) (5)

We have made use of the fact that the covari-

ance with a constant is zero. Since by assump-

tion the instrument is uncorrelated with the

error process u, the last term has expectation

zero, and we may solve (5) for our estimate of

β1 :

b1 =Cov(y, z)

Cov(x, z)=

∑(yi − y) (zi − z)∑(xi − x) (zi − z)

(6)

Note that this estimator has an interesting spe-

cial case where x = z : that is, where an ex-

planatory variable may serve as its own instru-

ment, which would be appropriate if Cov(x, u) =

0. In that case, this estimator may be seen tobe the OLS estimator of β1. Thus, we mayconsider OLS as a special case of IV, usablewhen the assumption of exogeneity of the x

variable(s) may be made. We may also notethat the IV estimator is consistent, as longas the two key assumptions about the instru-ment’s properties are satisfied. The IV estima-tor is not an unbiased estimator, though, andin small samples its bias may be substantial.

Inference with the IV estimator

To carry out inference–compute interval esti-mates and hypothesis tests–we assume thatthe error process is homoskedastic: in this case,conditional on the instrumental variable z, notthe included explanatory variable x. With thisadditional assumption, we may derive the asymp-totic variance of the IV estimator as:

V ar(b1) =σ2

SSTxρ2xz

(7)

where n is the sample size, SSTx is the to-tal sum of squares of the explanatory variable,and ρ2

xz is the R2 (or squared correlation) ina regression of x on z : that is, equation (3).This quantity can be consistently estimated;σ2 from the regression residuals, just as withOLS. Notice that as the correlation betweenthe explanatory variable x and the instrumentz increases, ceteris paribus, the sampling vari-ance of b1 decreases. Thus, an instrumentalvariables estimate generated from a “better”instrument will be more precise (conditional, ofcourse, on the instrument having zero correla-tion with u). Note as well that this estimatedvariance must exceed that of the OLS estima-tor of b1, since 0 ≤ ρ2

xz ≤ 1. In the case wherean explanatory variable may serve as its own in-strument, the squared correlation is unity. TheIV estimator will always have a larger asymp-totic variance than will the OLS estimator, butthat merely reflects the introduction of an ad-ditional source of uncertainty (in the form of

the instrument, imperfectly correlated with the

explanatory variable).

What will happen if we use the instrumental

variables with a “poor” or “weak” instrument?

A weak correlation between x and z will bring

a sizable bias in the estimator. If there is any

correlation between z and u, a weak correla-

tion between x and z will render IV estimates

inconsistent. Although we cannot observe the

correlation between z and u, we can empirically

evaluate the correlation between the explana-

tory variable and its instrument, and should

always do so.

It should also be noted that an R2 measure in

the context of the IV estimator is not the “per-

centage of variation explained” measure that

we are familiar with in OLS terms. In the pres-

ence of correlation between x and u, we can no

longer decompose the variation in y into two

independent components, SSE and SSR, and

R2 has no natural interpretation. In the OLS

context, a joint hypothesis test can be writ-

ten in terms of R2 measures; that cannot be

done in the IV context. Just as the asymp-

totic variance of an IV estimator exceeds that

of OLS, the R2 measure from IV will never

beat that which may be calculated from OLS.

If we wanted to maximize R2, we would just

use OLS; but when OLS is biased and incon-

sistent, we seek an estimation technique that

will focus on providing consistent estimates of

the regression parameters, and not mechani-

cally find the “least squares” solution in terms

of inconsistent parameter estimates.

IV estimates in the multiple regression con-

text

The instrumental variables technique illustrated

above can readily be extended to the case of

multiple regression. To introduce some nota-

tion, consider a structural equation:

y1 = β0 + β1y2 + β2z1 + u1 (8)

where we have suppressed the observation sub-

scripts. The y variables are endogenous; the

z variable is exogenous. The endogenous na-

ture of y2 implies that if this equation is esti-

mated by OLS, the point estimates will be bi-

ased and inconsistent, since the error term will

be correlated with y2. We need an instrument

for y2 : a variable that is correlated with y2,

but not correlated with u. Let us write the en-

dogenous explanatory variable in terms of the

exogenous variables, including the instrument

z2 :

y2 = π0 + π1z1 + π2z2 + v (9)

The key identification condition is that π2 6= 0;

that is, after partialling out z1, y2 and z2 are

still meaningfully correlated. This can readily

be tested by estimating the auxiliary regres-

sion (9). We cannot test the other crucial as-

sumption: that in this context, cov(z2, v) = 0.

Given the satisfaction of these assumptions,

we may then derive the instrumental variables

estimator of (8) by writing down the “normal

equations” for the least squares problem, and

solving them for the point estimates. In this

context, z1 serves as an instrument for itself.

We can extend this logic to include any number

of additional exogenous variables in the equa-

tion; the condition that the analogue to (9)

must have π2 6= 0 always applies. Likewise,

we could imagine an equation with additional

endogenous variables; for each additional en-

dogenous variable on the right hand side, we

would have to find another appropriate instru-

ment, which would have to meet the two con-

ditions specified above.

Two stage least squares (2SLS)

What if we have a single endogenous explana-tory variable, as in equation (8), but have morethan one potential instrument? There mightbe several variables available, each of whichwould have a significant coefficient in an equa-tion like (9), and could be considered uncor-related with u. Depending on which of thepotential instruments we employ, we will de-rive different IV estimates, with differing de-grees of precision. This is not a very attrac-tive possibility, since it suggests that depend-ing on how we implement the IV estimator,we might reach different qualitatitive conclu-sions about the structural model. The tech-nique of two-stage least squares (2SLS) hasbeen developed to deal with this problem. Howmight we combine several instruments to pro-duce the single instrument needed to imple-ment IV for equation (8)? Naturally, by run-ning a regression–in this case, an auxiliary re-gression of the form of equation (9), with all of

the available instruments included as explana-

tory variables. The predicted values of that

regression, y2, will serve as the instrument for

y2, and this auxiliary regression is the “first

stage” of 2SLS. In the “second stage,” we

use the IV estimator, making use of the gen-

erated instrument y2. The IV estimator we

developed above can be shown, algebraically,

to be a 2SLS estimator; but although the IV

estimator becomes non-unique in the presence

of multiple instruments, the 2SLS estimation

technique will always yield a unique set of pa-

rameter values for a given instrument list.

Although from a pedagogical standpoint we

speak of the two stages, we should not actually

perform 2SLS “by hand.” Why? Because the

second stage will yield the “wrong” residuals

(being computed from the instruments rather

than the original variables), which implies that

all statistics computed from those residuals will

be incorrect (the estimate s2, the estimated

standard errors of the parameters, etc.) We

should make use of a computer program that

has a command to perform 2SLS (or, as some

programs term it, instrumental variables). In

Stata, you use the ivregress command to per-

form either IV or 2SLS estimation. The syntax

of ivregress is:

ivregress 2sls depvar [varlist1] (varlist2=varlist iv)

where depvar is the dependent variable; varlist1,

which may not be present, is the list of in-

cluded exogenous variables (such as z1 in equa-

tion (8); varlist2 contains the included en-

dogenous variables (such as y2 in equation (8);

and varlist iv contains the list of instruments

that are not included in the equation, but will

be used to form the instrumental variables es-

timator. If we wanted to estimate equation

(8) with Stata, we would give the command

ivregress 2sls y1 z1 (y2 = z2). If we had ad-

ditional exogenous variables in the equation,

they would follow z1. If we had additional in-

struments (and were thus performing 2SLS),

we would list them after z2.

The 2SLS estimator may be applied to a much

more complex model, in which there are mul-

tiple endogenous explanatory variables (which

would be listed after y2 in the command), as

well as any number of instruments and included

exogenous variables. The constraint that must

always be satisfied is related to the parenthe-

sized lists: the order condition for identifi-

cation. Intuitively, it states that for each in-

cluded endogenous variable (e.g. y2), we must

have at least one instrument—that is, one ex-

ogenous variable that does not itself appear in

the equation, or satisfies an exclusion restric-

tion. If there are three included endogenous

variables, then we must have no fewer than

three instruments after the equals sign, or the

equation will not be identified. That is, it will

not be possible to solve for a unique solution

in terms of the instrumental variables estima-

tor. In the case (such as the example above)

where the number of included endogenous vari-

ables exactly equals the number of excluded

exogenous variables, we satisfy the order con-

dition with equality, and the standard IV es-

timator will yield a solution. Where we have

more instruments than needed, we satisfy the

order condition with inequality, and the 2SLS

form of the estimator must be used to derive

unique estimates, since we have more equa-

tions than unknowns: the equation is overi-

dentified. If we have fewer instruments than

needed, we fail the order condition, since there

are more unknowns than equations. No econo-

metric technique can solve this problem of un-

deridentification. There are additional condi-

tions for identification—the order condition is

necessary, but not sufficient—as it must also

be the case that each instrument has a nonzero

partial correlation with the dependent variable.

This would fail, for instance, if one of our can-

didate instruments was actually a linear com-

bination of the included exogenous variables.

IV and errors-in-variables

The instrumental variables estimator can also

be used fruitfully to deal with the errors-in-

variables model discussed earlier–not surpris-

ingly, since the econometric difficulties caused

by errors-in-variables are mathematically the

same problem as that of an endogenous ex-

planatory variable. To deal with errors-in-variables,

we need an instrument for the mismeasured x

variable that satisfies the usual assumptions:

being well correlated with x, but not corre-

lated with the error process. If we could find a

second measurement of x−even one also sub-ject to measurement error–we could use it asan instrument, since it would presumably bewell correlated with x itself, but if generatedby an independent measurement process, un-correlated with the original x′s measurementerror. Thus, we might conduct a householdsurvey which inquires about disposable income,consumption, and saving. The respondents’answers about their saving last year might wellbe mismeasured, since it is much harder totrack saving than, say, earned income. Thesame could be said for their estimates of howmuch they spent on various categories of con-sumption. But using income and consumptiondata, we could derive a second (mismeasured)estimate of saving, and use it as an instrumentto mitigate the problems of measurement errorin the direct estimate.

IV may also be used to solve proxy problems;imagine that we are regressing log(wage) on

education and experience, using a theoretical

model that suggests that “ability” should ap-

pear as a regressor. Since we do not have a

measure of ability, we use a test score as a

proxy variable. That may introduce a prob-

lem, though, since the measurement error in

the relation of test score to ability will cause

the test score to be correlated with the error

term. This might be dealt with if we had a

second test score measure–on a different apti-

tude test–which could then be used as an in-

strument. The two test scores are likely to be

correlated, and the measurement error in the

first (the degree that it fails to measure abil-

ity) should not be correlated with the second

score.

Tests for endogeneity and overidentifying

restrictions

Since the use of IV will necessarily inflate the

variances of the estimators, and weaken our

ability to make inferences from our estimates,

we might be concerned about the need to ap-

ply IV (or 2SLS) in a particular equation. One

form of a test for endogeneity can be readily

performed in this context. Imagine that we

have the equation:

y1 = β0 + β1y2 + β2z1 + β3z2 + u1 (10)

where y2 is the single endogenous explanatory

variable, and the z′s are included exogenous

variables. Imagine that the equation is overi-

dentified for IV: that is, we have at least two

instruments (in this case, z3 and z4) which

could be used to estimate (10) via 2SLS. If

we performed 2SLS, we would be estimating

the following reduced form equation in the

“first stage”:

y2 = π0 + π1z1 + π2z2 + π3z3 + π4z4 + v (11)

which would allow us to compute OLS residu-

als, v. Those residuals will be that part of y2

not correlated with the z′s. If there is a prob-

lem of endogeneity of y2 in equation (10), it

will occur because cov(v, u1) 6= 0. We cannot

observe v, but we can calculate a consistent

estimate of v as v. Including v as an additional

regressor in the OLS model

y1 = β0 + β1y2 + β2z1 + β3z2 + δv + ω (12)

and testing for the significance of δ will give

us the answer. If cov(v, u1) = 0, our estimate

of δ should not be significantly different from

zero. If that is the case, then there is no ev-

idence that y2 is endogenous in the original

equation, and OLS may be applied. If we reject

the hypothesis that δ = 0, we should not rely

on OLS, but should rather use IV (or 2SLS).

This test may also be generalized for the pres-

ence of multiple included endogenous variables

in (10); the relevant test is then an F−test,

jointly testing that a set of δ coefficients are

all zero. This test is available within Stata as

the estat endog command following ivregress.

Although we can never directly test the main-tained hypothesis that the instruments are un-correlated with the error process u, we canderive indirect evidence on the suitability ofthe instruments if we have an excess of instru-ments: that is, if the equation is overidenti-fied, so that we are using 2SLS. The ivregress

residuals may be regressed on all exogenousvariables (included exogenous variables plus in-struments). Under the null hypothesis thatall IV’s are uncorrelated with u, a Lagrangemultiplier statistic of the nR2 form will notexceed the critical point on a χ2 (r) distribu-tion, where r is the number of overidentify-

ing restrictions (i.e. the number of excess in-struments). If we reject this hypothesis, thenwe cast doubt on the suitability of the instru-ments; at least some of them do not appearto be satisfying the condition of orthogonalitywith the error process. This test is availablewithin Stata as the estat overid command fol-lowing ivregress.

Applying 2SLS in a time series context

When there are concerns of included endoge-

nous variables in a model fit to time series

data, we have a natural source of instruments

in terms of predetermined variables. For in-

stance, if y2t is an explanatory variable, its own

lagged values, y2t−1or y2t−2 might be used as

instruments: they are likely to be correlated

with y2t, and they will not be correlated with

the error term at time t, since they were gen-

erated at an earlier point in time. The one

caveat that must be raised in this context re-

lates to autocorrelated errors: if the errors are

themselves autocorrelated, then the presumed

exogeneity of predetermined variables will be in

doubt. Tests for autocorrelated errors should

be conducted; in the presence of autocorrela-

tion, more distant lags might be used to miti-

gate this concern.


ed.

Chapter 16: Simultaneous equations mod-

els

An obvious reason for the endogeneity of ex-

planatory variables in a regression model is si-

multaneity: that is, one or more of the “ex-

planatory” variables are jointly determined with

the “dependent” variable. Models of this sort

are known as simultaneous equations mod-

els (SEMs), and they are widely utilized in

both applied microeconomics and macroeco-

nomics. Each equation in a SEM should be a

behavioral equation which describes how one

or more economic agents will react to shocks

or shifts in the exogenous explanatory vari-

ables, ceteris paribus. The simultaneously de-

termined variables often have an equilibrium

interpretation, and we consider that these vari-

ables are only observed when the underlying

model is in equilibrium. For instance, a de-

mand curve relating the quantity demanded to

the price of a good, as well as income, the

prices of substitute commodities, etc. concep-

tually would express that quantity for a range

of prices. But the only price-quantity pair that

we observe is that resulting from market clear-

ing, where the quantities supplied and demanded

were matched, and an equilibrium price was

struck. In the context of labor supply, we

might relate aggregate hours to the average

wage and additional explanatory factors:

hi = β0 + β1wi + β2z1 + ui (1)

where the unit of observation might be the

county. This is a structural equation, or be-

havioral equation, relating labor supply to its

causal factors: that is, it reflects the structure

of the supply side of the labor market. This

equation resembles many that we have consid-

ered earlier, and we might wonder why there

would be any difficulty in estimating it. But

if the data relate to an aggregate–such as the

hours worked at the county level, in response

to the average wage in the county–this equa-

tion poses problems that would not arise if, for

instance, the unit of observation was the indi-

vidual, derived from a survey. Although we can

assume that the individual is a price- (or wage-)

taker, we cannot assume that the average level

of wages is exogenous to the labor market in

Suffolk County. Rather, we must consider that

it is determined within the market, affected by

broader economic conditions. We might con-

sider that the z variable expresses wage levels

in other areas, which would cet.par. have an

effect on the supply of labor in Suffolk County;

higher wages in Middlesex County would lead

to a reduction in labor supply in the Suffolk

County labor market, cet. par.

To complete the model, we must add a speci-

fication of labor demand:

hi = γ0 + γ1wi + γ2z2 + υi (2)

where we model the quantity demanded of la-

bor as a function of the average wage and ad-

ditional factors that might shift the demand

curve. Since the demand for labor is a de-

rived demand, dependent on the cost of other

factors of production, we might include some

measure of factor cost (e.g. the cost of capi-

tal) as this equation’s z variable. In this case,

we would expect that a higher cost of capital

would trigger substitution of labor for capital

at every level of the wage, so that γ2 > 0. Note

that the supply equation represents the behav-

ior of workers in the aggregate, while the de-

mand equation represents the behavior of em-

ployers in the aggregate. In equilibrium, we

would equate these two equations, and expect

that at some level of equilibrium labor utiliza-

tion and average wage that the labor market

is equilibrated. These two equations then con-stitute a simultaneous equations model (SEM)of the labor market.

Neither of these equations may be consistentlyestimated via OLS, since the wage variable ineach equation is correlated with the respectiveerror term. How do we know this? Becausethese two equations can be solved and rewrit-ten as two reduced form equations in the en-dogenous variables hi and wi. Each of thosevariables will depend on the exogenous vari-ables in the entire system–z1 and z2–as wellas the structural errors ui and υi. In general,any shock to either labor demand or supplywill affect both the equilibrium quantity andprice (wage). Even if we rewrote one of theseequations to place the wage variable on the lefthand side, this problem would persist: both en-dogenous variables in the system are jointly de-termined by the exogenous variables and struc-tural shocks. Another implication of this struc-ture is that we must have separate explanatory

factors in the two equations. If z1 = z2, for in-

stance, we would not be able to solve this sys-

tem and uniquely identify its structural param-

eters. There must be factors that are unique

to each structural equation that, for instance,

shift the supply curve without shifting the de-

mand curve.

The implication here is that even if we only

care about one of these structural equations–

for instance, we are tasked with modelling la-

bor supply, and have no interest in working

with the demand side of the market–we must

be able to specify the other structural equa-

tions of the model. We need not estimate

them, but we must be able to determine what

measures they would contain. For instance,

consider estimating the relationship between

murder rate, number of police, and wealth for

a number of cities. We might expect that both

of those factors would reduce the murder rate,

cet.par.: more police are available to appre-

hend murderers, and perhaps prevent murders,

while we might expect that lower-income cities

might have greater unrest and crime. But can

we reasonably assume that the number of po-

lice (per capita) is exogenous to the murder

rate? Probably not, in the sense that cities

striving to reduce crime will spend more on po-

lice. Thus we might consider a second struc-

tural equation that expressed the number of

police per capita as a function of a number of

factors. We may have no interest in estimat-

ing this equation (which is behavioral, reflect-

ing the behavior of city officials), but if we are

to consistently estimate the former equation–

the behavioral equation reflecting the behavior

of murderers–we will have to specify the sec-

ond equation as well, and collect data for its

explanatory factors.

Simultaneity bias in OLS

What goes wrong if we use OLS to estimatea structural equation containing endogenousexplanatory variables? Consider the structuralsystem:

y1 = α1y2 + β1z1 + u1 (3)

y2 = α2y1 + β2z2 + u2

in which we are interested in estimating thefirst equation. Assume that the z variables areexogenous, in that each is uncorrelated witheach of the error processes u. What is the cor-relation between y2 and u1? If we substitutethe first equation into the second, we derive:

y2 = α2 (α1y2 + β1z1 + u1) + β2z2 + u2

(1− α2α1) y2 = α2β1z1 + β2z2 + α2u1 + u2 (4)

If we assume that α2α1 6= 1, we can derive thereduced form equation for y2 as:

y2 = π21z1 + π22z2 + υ2 (5)

where the reduced form error term υ2 = α2u1+

u2. Thus y2 depends on u1, and estimation by

OLS of the first equation in (3) will not yield

consistent estimates. We can consistently es-

timate the reduced form equation (5) via OLS,

and that in fact is an essential part of the strat-

egy of the 2SLS estimator. But the parameters

of the structural equation are nonlinear trans-

formations of the reduced form parameters, so

being able to estimate the reduced form pa-

rameters does not achieve the goal of provid-

ing us with point and interval estimates of the

structural equation.

In this special case, we can evaluate the simul-

taneity bias that would result from improperly

applying OLS to the original structural equa-

tion. The covariance of y2 and u1 is equal to

the covariance of y2 and υ2:

=[α2/ (1− α2α1)E

(u2

1

)]= [α2/ (1− α2α1)]σ2

1 (6)

If we have some priors about the signs of the

α parameters, we may sign the bias. Generally,

it could be either positive or negative; that is,

the OLS coefficient estimate could be larger

or smaller than the correct estimate, but will

not be equal to the population parameter in

an expected sense unless the bracketed expres-

sion is zero. Note that this would happen if

α2 = 0 : that is, if y2 was not simultaneously

determined with y1. But in that case, we do not

have a simultaneous system; the model in that

case is said to be a recursive system, which

may be consistently estimated with OLS.

Identifying and estimating a structural equa-

tion

The tool that we will apply to consistently

estimate structural equations such as (3) is

one that we have seen before: two-stage least

squares (2SLS). The application of 2SLS in a

structural system is more straightforward than

the general application of instrumental vari-

ables estimators, since the specification of the

system makes clear what variables are available

as instruments. Let us first consider a slightly

different two-equation structural system:

q = α1p+ β1z1 + u1 (7)

q = α2p+ u2

We presume these equations describe the work-

ings of a market, and that the equilibrium con-

dition of market clearing has been imposed.

Let q be per capita milk consumption at the

county level, p be the average price of a gallon

of milk in that county, and let z1 be the price

of cattle feed. The first structural equation

is thus the supply equation, with α1 > 0 and

β1 < 0: that is, a higher cost of production

will generally reduce the quantity supplied at

the same price per gallon. The second equa-

tion is the demand equation, where we pre-

sume that α2 < 0, reflecting the slope of the

demand curve in the {p, q} plane. Given a ran-

dom sample on {p, q, z1}, what can we achieve?

The demand equation is said to be identified–

in fact, exactly identified–since one instru-

ment is needed, and precisely one is available.

z1 is available because the demand for milk

does not depend on the price of cattle feed, so

we take advantage of an exclusion restriction

that makes z1 available to identify the demand

curve. Intuitively, we can think of variations

in z1 shifting the supply curve up and down,

tracing out the demand curve; in doing so, it

makes it possible for us to estimate the struc-

tural parameters of the demand curve.

What about the supply curve? It, also, has

a problem of simultaneity bias, but it turns

out that the supply equation is unidentified.

Given the model as we have laid it out, there

is no variable available to serve as an instru-

ment for p : that is, we need a variable that

affects demand (and shifts the demand curve)

but does not directly affect supply. In this

case, no such variable is available, and we can-

not apply the instrumental variables technique

without an instrument. What if we went back

to the drawing board, and realized that the

price of orange juice should enter the demand

equation–although it tastes terrible on corn

flakes, orange juice might be a healthy substi-

tute for quenching one’s thirst? Then the sup-

ply curve would be identified–exactly identified–

since we now would have a single instrument

that served to shift demand but did not enter

the supply relation. What if we also consid-

ered the price of beer as an additional demand

factor? Then we would have two available in-

struments (presuming that each is appropri-

ately correlated), and 2SLS would be used to

“boil them down” into the single instrument

needed. In that case, we would say that the

supply curve would be overidentified.

The identification status of each structural equa-

tion thus hinges upon exclusion restrictions:

our a priori statements that certain variables

do not appear in certain structural equations.

If they do not appear in a structural equation,

they may be used as instruments to assist in

identifying the parameters of that equation.

For these variables to successfully identify the

parameters, they must have nonzero popula-

tion parameters in the equation in which they

are included. Consider an example:

hours = f1 (log(wage), educ, age, kl6, wifeY )

log(wage) = f2

(hours, educ, xper, xper2

)(8)

The first equation is a labor supply relation,

expressing the number of hours worked by a

married woman as a function of her wage, ed-

ucation, age, the number of preschool children,

and non-wage income (including spouses’s earn-

ings). The second equation is a labor demand

equation, expressing the wage to be paid as

a function of hours worked, the employee’seducation, and a polynomial in her work ex-perience. The exclusion restructions indicatethat the demand for labor does not depend onthe worker’s age (nor should it!), the presenceof preschool kids, or other resources availableto the worker. Likewise, we assume that thewoman’s willingness to participate in the mar-ket does not depend on her labor market ex-perience. One instrument is needed to identifyeach equation; age, kl6 and wifeY are avail-able to identify the supply equation, while xper

and xper2 are available to identify the demandequation. This is the order condition foridentfication, essentially counting instrumentsand variables to be instrumented; each equa-tion is overidentified. But the order conditionis only necessary; the sufficient condition is therank condition, which essentially states thatin the reduced-form equation:

log(wage) = g(educ, age, kl6, wifeY, xper, xper2

)(9)

at least one of the population coefficients on

{xper, xper2} must be nonzero. But since we

can consistently estimate this equation with

OLS, we may generate sample estimates of

those coefficients, and test the joint null that

both coefficients are zero. If that null is re-

jected, then we satisfy the rank condition for

the first equation, and we may proceed to esti-

mate it via 2SLS. The equivalent condition for

the demand equation is that at least one of the

population coefficients {age, kl6, wifeY } in the

regression of hours on the system’s exogenous

variables is nonzero. If any of those variables

are significant in the equivalent reduced-form

equation, it may be used as an instrument to

estimate the demand equation via 2SLS.

The application of two-stage least squares (via

Stata’s ivregress 2sls command) involves iden-

tifying the endogenous explanatory variable(s),

the exogenous variables that are included in

each equation, and the instruments that are

excluded from each equation. To satisfy the

order condition, the list of (excluded) instru-

ments must be at least as long as the list of en-

dogenous explanatory variables. This logic car-

ries over to structural equation systems with

more than two endogenous variables / equa-

tions; a structural model may have any num-

ber of endogenous variables, each defined by

an equation, and we can proceed to evaluate

the identification status of each equation in

turn, given the appropriate exclusion restric-

tions. Note that if an equation is uniden-

tified, due to the lack of appropriate instru-

ments, then no econometric technique may be

used to estimate its parameters. In that case,

we do not have knowledge that would allow us

to “trace out” that equation’s slope while we

move along it.

Simultaneous equations models with time

series

One of the most common applications of 2SLS

in applied work is the estimation of structural

time series models. For instance, consider a

simple macro model:

Ct = β0 + β1 (Yt − Tt) + β2rt + u1t

It = γ0 + γ1rt + u2t

Yt = Ct + It +Gt (10)

In this system, aggregate consumption each

quarter is determined jointly with disposable

income. Even if we assume that taxes are ex-

ogenous (and in fact they are responsive to

income), the consumption function cannot be

consistently estimated via OLS. If the interest

rate is taken as exogenous (set, for instance,

by monetary policy makers) then the invest-

ment equation may be consistently estimated

via OLS. The third equation is an identity; it

need not be estimated, and holds without er-

ror, but its presence makes explicit the simul-

taneous nature of the model. If r is exoge-

nous, then we need one instrument to estimate

the consumption function; government spend-

ing will suffice, and consumption will be exactly

identified. If r is to be taken as endogenous,

we would have to add at least one equation

to the model to express how monetary pol-

icy reacts to economic conditions. We might

also make the investment function more re-

alistic by including dynamics–that investment

depends on lagged income, for instance, Yt−1

(firms make investment spending plans based

on the demand for their product). This would

allow Yt−1, a predetermined variable, to be

used as an additional instrument in estimation

of the consumption function. We may also

use lags of exogenous variables–for instance,

lagged taxes or government spending–as in-

struments in this context.

Although this only scratches the surface of a

broad set of issues relating to the estimation

of structural models with time series data, it

should be clear that those models will generally

require instrumental variables techniques such

as 2SLS for the consistent estimation of their

component relationships.


ed.

Appendix C: Fundamentals of mathemati-

cal statistics

A short review of the principles of mathemati-

cal statistics. Econometrics is concerned with

statistical inference: learning about the char-

acteristics of a population from a sample of the

population. The population is a well-defined

group of subjects–and it is important to de-

fine the population of interest. Are we trying

to study the unemployment rate of all labor

force participants, or only teenaged workers, or

only AHANA workers? Given a population, we

may define an economic model that contains

parameters of interest–coefficients, or elastic-

ities, which express the effects of changes in

one variable upon another.

Let Y be a random variable (r.v.) representing

a population with probability density function

(pdf) f(y; θ), with θ a scalar parameter. We

assume that we know f,but do not know the

value of θ. Let a random sample from the pop-

ulation be (Y1, ..., YN) , with Yi being an inde-

pendent random variable drawn from f(y; θ).

We speak of Yi being iid – independently and

identically distributed. We often assume that

random samples are drawn from the Bernoulli

distribution (for instance, that if I pick a stu-

dent randomly from my class list, what is the

probability that she is female? That probabil-

ity is γ, where γ% of the students are female,

so P (Yi = 1) = γ and P (Yi = 0) = (1− γ). For

many other applications, we will assume that

samples are drawn from the Normal distribu-

tion. In that case, the pdf is characterized by

two parameters, µ and σ2, expressing the mean

and spread of the distribution, respectively.

Finite sample properties of estimators

The finite sample properties (as opposed to

asymptotic properties) apply to all sample sizes,

large or small. These are of great relevance

when we are dealing with samples of limited

size, and unable to conduct a survey to gener-

ate a larger sample. How well will estimators

perform in this context? First we must distin-

guish between estimators and estimates. An

estimator is a rule, or algorithm, that speci-

fies how the sample information should be ma-

nipulated in order to generate a numerical es-

timate. Estimators have properties–they may

be reliable in some sense to be defined; they

may be easy or difficult to calculate; that dif-

ficulty may itself be a function of sample size.

For instance, a test which involves measuring

the distances between every observation of a

variable involves an order of calculations which

grows more than linearly with sample size. An

estimator with which we are all familiar is the

sample average, or arithmetic mean, of N num-

bers: add them up and divide by N. That es-

timator has certain properties, and its applica-

tion to a sample produces an estimate. We

will often call this a point estimate–since it

yields a single number–as opposed to an inter-

val estimate, which produces a range of val-

ues associated with a particular level of confi-

dence. For instance, an election poll may state

that 55% are expected to vote for candidate

A, with a margin of error of ±4%. If we trust

those results, it is likely that candidate A will

win, with between 51% and 59% of the vote.

We are concerned with the sampling distribu-

tions of estimators–that is, how the estimates

they generate will vary when the estimator is

applied to repeated samples.

What are the finite-sample properties which we

might be able to establish for a given estimator

and its sampling distribution? First of all, we

are concerned with unbiasedness. An estima-tor W of θ is said to be unbiased if E(W ) = θ

for all possible values of θ. If an estimator isunbiased, then its probability distribution hasan expected value equal to the population pa-rameter it is estimating. Unbiasedness doesnot mean that a given estimate is equal to θ,

or even very close to θ; it means that if wedrew an infinite number of samples from thepopulation and averaged the W estimates, wewould obtain θ. An estimator that is biasedexhibits Bias(W ) = E(W )− θ. The magnitudeof the bias will depend on the distribution ofthe Y and the function that transforms Y intoW , that is, the estimator. In some cases wecan demonstrate unbiasedness (or show thatbias=0) irregardless of the distribution of Y ;for instance, consider the sample average Y ,

which is an unbiased estimate of the popula-tion mean µ :

E(Y ) = E(1

n

n∑i=1

Yi)

=1

nE(

n∑i=1

Yi)

=1

n

n∑i=1

E(Yi)

=1

n

n∑i=1

µ

=1

nnµ = µ

Any hypothesis tests on the mean will require

an estimate of the variance, σ2, from a popu-

lation with mean µ. Since we do not know µ

(but must estimate it with Y ), the estimate of

sample variance is defined as

S2 =1

n− 1

n∑i=1

(Yi − Y

)2with one degree of freedom lost by the replace-

ment of the population statistic µ with its sam-

ple estimate Y . This is an unbiased estimate of

the population variance, whereas the counter-

part with a divisor of n will be biased unless we

know µ. Of course, the degree of this bias will

depend on the difference between(

nn−1

)and

unity, which disappears as n→∞.

Two difficulties with unbiasedness as a crite-

rion for an estimator: some quite reasonable

estimators are unavoidably biased, but useful;

and more seriously, many unbiased estimators

are quite poor. For instance, picking the first

value in a sample as an estimate of the popula-

tion mean, and discarding the remaining (n−1)

values, yields an unbiased estimator of µ, since

E(Y1) = µ; but this is a very imprecise estima-

tor.

What additional information do we need to

evaluate estimators? We are concerned with

the precision of the estimator as well as its

bias. An unbiased estimator with a smaller

sampling variance will dominate its counter-

part with a larger sampling variance: e.g. we

can demonstrate that the estimator that uses

only the first observation to estimate µ has a

much larger sampling variance than the sample

average, for nontrivial n. What is the sampling

variance of the sample average?

V ar(Y ) = V ar

1

n

n∑i=1

Yi

=

1

n2V ar

n∑i=1

Yi

=

1

n2

n∑i=1

V ar(Yi)

=

1

n2

n∑i=1

σ2

=

1

n2nσ2 =

σ2

n

so that the precision of the sample average de-

pends on the sample size, as well as the (un-

known) variance of the underlying distribution

of Y. Using the same logic, we can derive the

sampling variance of the “estimator” that uses

only the first observation of a sample as σ2.

Even for a sample of size 2, the sample mean

will be twice as precise.

This leads us to the concept of efficiency:

given two unbiased estimators of θ, an estima-

tor W1 is efficient relative to W2 when V ar(W1) ≤V ar(W2) ∀θ, with strict inequality for at least

one θ. A relatively efficient unbiased estimator

dominates its less efficient counterpart. We

can compare two estimators, even if one or

both is biased, by comparing mean squared er-

ror (MSE), MSE(W ) = E[(W − θ)2

]. This ex-

pression can be shown to equal the variance of

the estimator plus the square of the bias; thus,

it equals the variance for an unbiased estima-

tor.

Large sample (asymptotic) properties of

estimators

We can compare estimators, and evaluate their

relative usefulness, by appealing to their large

sample properties–or asymptotic properties.

That is, how do they behave as sample size

goes to infinity? We see that the sample aver-

age has a sampling variance with limiting value

of zero as n → ∞. The first asymptotic prop-

erty is that of consistency. If W is an estimate

of θ based on a sample [Y1, ..., Yn] of size n, W

is said to be a consistent estimator of θ if, for

every ε > 0,

P (|Wn − θ| > ε)→ 0 as n→∞.

Intuitively, a consistent estimator becomes more

accurate as the sample size increases without

bound. If an estimator does not possess this

property, it is said to be inconsistent. In that

case, it does not matter how much data we

have; the “recipe” that tells us how to use the

data to estimate θ is flawed. If an estimator is

biased but its variance shrinks as n→∞, then

the estimator is consistent.

A consistent estimator has probability limit,

or plim, equal to the population parameter:

plim(Y)

= µ. Some mechanics of plims: let

θ be a parameter and g (·) a continuous func-

tion, so that γ = g(θ). Suppose plim(Wn) = θ,

and we devise an estimator of γ, Gn = g(Wn).

Then plim(Gn) = γ, or

plim g(Wn) = g (plim Wn) .

This allows us to establish the consistency of

estimators which can be shown to be transfor-

mations of other consistent estimators. For in-

stance, we can demonstrate that the estimator

given above of the population variance is not

only unbiased but consistent. The standard

deviation is the square root of the variance:

a nonlinear function, continuous for positive

arguments. Thus the standard deviation S is

a consistent estimator of the population stan-

dard deviation. Some additional properties of

plims, if plim(Tn) = α and plim(Un) = β :

plim (Tn + Un) = α+ β

plim (TnUn) = αβ

plim (Tn/Un) = α/β, β 6= 0.

Consistency is a property of point estimators:

the distribution of the estimator collapses around

the population parameter in the limit, but that

says nothing about the shape of the distribu-

tion for a given sample size. To work with in-

terval estimators and hypothesis tests, we need

a way to approximate the distribution of the es-

timators. Most estimators used in economet-

rics have distributions that are reasonably ap-

proximated by the Normal distribution for large

samples, leading to the concept of asymptotic

normality:

P (Zn ≤ z)→ Φ (z) as n→∞

where Φ (·) is the standard normal cumulative

distribution function (cdf). We will often say

“Zn˜N(0,1)” or “Zn is asy N.” This relates to

one form of the central limit theorem (CLT).

If [Y1, ...Yn] is a random sample with mean µ

and variance σ2,

Zn =Yn − µσ/√n

has an asymptotic standard normal distribu-

tion. Regardless of the population distribu-

tion of Y, this standardized version of Y will

be asy N, and the entire distribution of Z will

become arbitrarily close to the standard nor-

mal as n → ∞. Since many of the estimators

we will derive in econometrics can be viewed as

sample averages, the law of large numbers and

the central limit theorem can be combined to

show that these estimators will be asy N. In-

deed, the above estimator will be asy N even

if we replace σ with a consistent estimator of

that parameter, S.

General approaches to parameter estima-

tion

What general strategies will provide us with es-

timators with desirable properties such as un-

biasedness, consistency and efficiency? One of

the most fundamental strategies for estimation

is the method of moments, in which we re-

place population moments with their sample

counterparts. We have seen this above, where

a consistent estimator of sample variance is

defined by replacing the unknown population µ

with a consistent estimate thereof, Y . A sec-

ond widely employed strategy is the principle of

maximum likelihood, where we choose an es-

timator of the population parameter θ by find-

ing the value that maximizes the likelihood of

observing the sample data. We will not fo-

cus on maximum likelihood estimators in this

course, but note their importance in econo-

metrics. Most of our work here is based on the

least squares principle: that to find an esti-

mate of the population parameter, we should

solve a minimization problem. We can readily

show that the sample average is a method of

moments estimator (and is in fact a maximum

likelihood estimator as well). We demonstrate

now that the sample average is a least squares

estimator:

minm

n∑i=1

(Yi −m)2

will yield an estimator, m, which is identical

to that defined as Y . We may show that the

value m minimizes the sum of squared devi-

ations about the sample mean, and that any

other value m′ would have a larger sum (or

would not be “least squares”). Standard re-

gression techniques, to which we will devote

much of the course, are often called “OLS”:

ordinary least squares.

Interval estimation and confidence inter-

vals

Since an estimator will yield a value (or point

estimate) as well as a sampling variance, we

may generally form a confidence interval around

the point estimate in order to make proba-

bility statements about a population param-

eter. For instance, the fraction of Firestone

tires involved in fatal accidents is surely not

0.0005 of those sold. Any number of samples

would yield estimates of that mean differing

from that number (and for a continuous ran-

dom variable, the probability of a point is zero).

But we can test the hypothesis that 0.0005 of

the tires are involved with fatal accidents if we

can generate both a point and interval esti-

mate for that parameter, and if the interval

estimate cannot reject 0.0005 as a plausible

value. This is the concept of a confidence in-

terval, which is defined with regard to a given

level of “confidence” or level of probability. Fora standard normal (N(0,1)) variable,

P

(−1.96 <

Y − µ1/√n< 1.96

)= 0.95.

which defines the interval estimate(Y − 1.96√

n, Y + 1.96√

n

). We do not conclude from

this that the probability that µ lies in the inter-val is 0.95; the population parameter either liesin the interval or it does not. The proper wayto consider the confidence interval is that ifwe construct a large number of random sam-ples from the population, 95% of them willcontain µ. Thus, if a hypothesized value for µlies outside the confidence interval for a singlesample, that would occur by chance only 5%of the time.

But what if we do not have a standard normalvariate, for which we know the variance equalsunity? If we have a variable X, which we con-clude is distributed as N

(µ, σ2

), we arrive at

the difficulty that we do not know σ2 : and thus

cannot specify the confidence interval. Via the

method of moments, we replace the unknown

σ2 with a consistent estimate, S2, to form the

transformed statistic

Y − µS/√n

˜ tn

denoting that its distribution is no longer stan-

dard normal, but “student’s t” with n degrees

of freedom. The t distribution has fatter tails

than does the normal; above 20 or 25 degrees

of freedom, it is approximated quite well by

the normal. Thus, confidence intervals con-

structed with the t distribution will be wider

for small n, since the value will be larger than

1.96. A 95% confidence interval, given the

symmetry of the t distribution, will leave 2.5%

of probability in each tail (a two-tailed t test).

If cα is the 100(1-α) percentile in the t distribu-

tion, a 100(1-α)% confidence interval for the

mean will be defined as:

y − cα/2s√n, y + cα/2

s√n

where s is the estimated standard deviation of

Y . We often refer to s√n

as the standard er-

ror of the parameter–in this case, the standard

error of our estimate of µ. Note well the dif-

ference between the concepts of the standard

deviation of the underlying distribution (an es-

timate of σ) and the standard error, or preci-

sion, of our estimate of the mean µ. We will

return to this distinction when we consider re-

gression parameters. A simple rule of thumb,

for large samples, is that a 95% confidence in-

terval is roughly two standard errors on either

side of the point estimate–the counterpart of

a “t of 2” denoting significance of a param-

eter. If an estimated parameter is more than

two standard errors from zero, a test of the hy-

pothesis that it equals zero in the population

will likely be rejected.

Hypothesis testing

We want to test a specific hypothesis about

the value of a population parameter θ. We may

believe that the parameter equals 0.42; so that

we state the null and alternative hypotheses:

H0 : θ = 0.42

HA : θ 6= 0.42

In this case, we have a two-sided alternative:

we will reject the null if our point estimate

is “significantly” below 0.42, or if it is “sig-

nificantly” above 0.42. In other cases, we

may specify the alternative as one-sided. For

instance, in a quality control study, our null

might be that the proportion of rejects from

the assembly line is no more than 0.03, versus

the alternative that it is greater than 0.03. A

rejection of the null would lead to a shutdown

of the production process, whereas a smaller

proportion of rejects would not be cause for

concern. Using the principles of the scientific

method, we set up the hypothesis and consider

whether there is sufficient evidence against the

null to reject it. Like the principle that a find-

ing of guilt must be associated with evidence

beyond a reasonable doubt, the null will stand

unless sufficient evidence is found to reject it

as unlikely. Just as in the courts, there are two

potential errors of judgment: we may find an

innocent person guilty, and reject a null even

when it is true; this is Type I error. We may

also fail to convict a guilty person, or reject

a false null; this is Type II error. Just as

the judicial system tries to balance those two

types of error (especially considering the con-

sequences of punishing the innocent, or even

putting them to death), we must be concerned

with the magnitude of these two sources of er-

ror in statistical inference. We construct hy-

pothesis tests so as to make the probability

of a Type I error fairly small; this is the level

of the test, and is usually denoted as α. For

instance, if we operate at a 95% level of con-

fidence, then the level of the test is α = 0.05.

When we set α, we are expressing our tolerance

for committing a Type I error (and rejecting a

true null). Given α, we would like to minimize

the probability of a Type II error, or equiva-

lently maximize the power of the test, which

is just one minus the probability of committing

a Type II error, and failing to reject a false null.

We must balance the level of the test (and

the risk of falsely rejecting the truth) with the

power of the test (and failing to reject a false

null).

When we use a computer program to calculate

point and interval estimates, we are given the

information that will allow us to reject or fail to

reject a particular null. This is usually phrased

in terms of p− values, which are the tail prob-

abilities associated with a test statistic. If the

p-value is less than the level of the test, then it

leads to a rejection: a p-value of 0.035 allows

us to reject the null at the level of 0.05. One

must be careful to avoid the misinterpretation

of a p-value of, say, 0.94, which is indicative

of the massive failure to reject that null.

One should also note the duality between con-

fidence intervals and hypothesis tests. They

utilize the same information: the point esti-

mate, the precision as expressed in the stan-

dard error, and a value taken from the under-

lying distribution of the test statistic (such as

1.96). If the boundary of the 95% confidence

interval contains a value δ, then a hypothesis

test that the population parameter equals δ will

be on the borderline of acceptance and rejec-

tion at the 5% level. We can consider these

quantities as either defining an interval esti-

mate for the parameter, or alternatively sup-

porting an hypothesis test for the parameter.

Wooldridge Introductory Eco No Metrics 4e Solutions

Documents

Transcript of Wooldridge Introductory Eco No Metrics 4e Solutions