Limited DEpendent Variable

Limited Dependent Variables

Lecture notes for Econometrics(First year, MPhil in Economics)

Simon Quinn*

Hilary 2013

These notes provide some background for six lectures that I will be giving this year for the M.PhilEconometrics core course. I will use slides for the lectures themselves. I will make the slidesavailable online after our last lecture. It is likely that there will be some things in these notes thatwe do not have time to cover in class, and we may cover some things in class that are not cov-ered in these notes. Though we will focus in class on the most important issues, please considerall of the lectures and all of these notes to be potentially relevant for the exam (except where noted).

For each lecture, I have ‘starred’ (‘?’) references to Cameron and Trivedi (2005) and to Wooldridge(2002 and 2010). You are required to read at least one of these, but you do not need to read more.I have also provided other references; you are not required or expected to read these.

*Department of Economics, Centre for the Study of African Economies and St Antony’s College, University ofOxford: [email protected]. I must particularly thank Victoria Prowse for her assistancein preparing these lectures. I must also thank Cameron Chisholm, Cosmina Dorobantu, Sebastian Königs, JulienLabonne and Jeremy Magruder for very useful comments. All errors remain my own.

1

LecturesWe will spend six lectures on limited dependent variables; the following table summarises.

HILARY

Week 7 Lecture 1 Binary Choice I Monday, 11.30am – 1.00pm

Week 7 Lecture 2 Binary Choice II Tuesday, 11.30am – 1.00pm

Week 7 Lecture 3 Discrete Ordered Choice Wednesday, 9.30am – 11.00am

Week 8 Lecture 4 Discrete Multinomial Choice Monday, 11.30am – 1.00pm

Week 8 Lecture 5 Censored and Truncated Outcomes Tuesday, 11.30am – 1.00pm

Week 8 Lecture 6 Selection Wednesday, 9.30am – 11.00am

Problem setsThere are two problem sets for limited dependent variables. Both were written by Victoria Prowse,and have evolved on this course over several years. The problem sets do not attempt to coverevery aspect of our lectures, but they will provide a very useful opportunity for you to consider keyconcepts and techniques.

So, what’s changed?This is the second year that I have lectured this topic. In revising for the exam, you may wishto look at past papers — and hence you should note the following small differences between thecontent for this year and the content prior to 2012:

• We will not be discussing the specific details of any numerical optimisation techniques.(Therefore, for example, you may disregard Question 7(ii)(c) on the June 2011 Exam.)

• We will discuss the Generalised Ordered Probit (and Generalised Ordered Logit), whichwere not covered prior to 2012.

2 [email protected]

Are any of these notes not examinable?There are a few small aspects of these notes that are not examinable for any question on limiteddependent variables:

• I provide some example Stata code; you do not need to know any Stata code for any examquestions about limited dependent variables.

• I briefly discuss nonparametric and semiparametric estimation in a few places; you do notneed to know anything about these estimation techniques for any exam questions about lim-ited dependent variables.

• You do not need to be able to repeat the derivation in equations 4.13 to 4.24.

Nonethelesss, I include this extra material because I think it may be useful to help your generalunderstanding of the other themes that we will discuss.

3 [email protected]

1 Lecture 1: Binary Choice IRequired readings (for Lectures 1 and 2):

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 463 – 478 (i.e. sections 14.1 to 14.4, inclusive)or

• ? WOOLDRIDGE, J. (2002): Econometric Analysis of Cross Section and Panel Data. TheMIT Press, pages 453 – 461 (i.e. sections 15.1 to 15.4, inclusive)or

• ? WOOLDRIDGE, J. (2010): Econometric Analysis of Cross Section and Panel Data (2nded.). The MIT Press, pages 561 – 569 (i.e. sections 15.1 to 15.4, inclusive).

Other references:

• TRAIN, K. (2009): Discrete Choice Methods with Simulation. Cambridge University Press.

• GOULD, W., PITBLADO, J. AND SRIBNEY, W. (2006): Maximum Likelihood Estimationwith Stata. Stata Press.

1.1 An illustrative empirical questionOur first two lectures consider models for binary dependent variables; that is, models for contextsin which our outcome of interest takes just two values. We will focus on a simple illustrative ques-tion: how has primary school attendance changed over time in Tanzania? There are many reasonsthat this question may be important for empirical researchers — for example, it may be of histor-ical interest in understanding Tanzania’s long-run economic development, or it may be importantfor considering present-day earnings differences across Tanzanian age cohorts.

We shall consider this question using data from Tanzania’s 2005/2006 Integrated Labour ForceSurvey (‘ILFS’). For simplicity, we will consider a single explanatory variable: the year in whicha respondent was born. We index individuals by i, and denote the ith individual’s year of birth asxi. We record educational attainment by a dummy variable, yi, defined such that:

yi =

{0 if the ith individual did not complete primary education;1 if the ith individual did complete primary education. (1.1)

(Note immediately that — as with all binary outcome models — this denomination is arbitrary:for example, we could just as easily reverse the assignment of 0 and 1 without changing anythingof the structure of the problem.)

Figure 1.1 illustrates the data: it shows the education dummy variable on the y axis (with datapoints ‘jittered’, for illustrative clarity), and the age variable on the x axis. Note that we will limitconsideration to individuals born between 1920 and 1980 (inclusive).

4 [email protected]

1.2 A simple model of binary choice

Figure 1.1: Primary school attainment in Tanzania across age cohorts

1.2 A simple model of binary choiceWe began with a somewhat imprecise question: how has primary school attendance changed overtime in Tanzania? More formally, we will be interested in estimating the following ‘object ofinterest’:

Pr(yi = 1 |xi). (1.2)

That is, we will build and estimate a model of the probability of attaining a primary school educa-tion, conditional upon year of birth.

As with most econometric outcome variables, investment in primary school education is a matterof choice. We therefore begin by specifying a (very simple) microeconomic model of investmentin education. Denote the ith household’s utility of attending primary school as US

i (xi) and theutility of not attending school (i.e. ‘staying home’) as UH

i (xi). For simplicity, we will assume thateach utility function is additive in the year in which a child was born:1

USi (xi) = αS

0 + αS1xi + µS

i (1.3)

UHi (xi) = αH

0 + αH1 xi + µH

i . (1.4)

1 This would be the case, for example, if we think that the ‘utility cost’ of primary education has changed linearlyover time — or, indeed, the ‘utility benefit’ from a primary education.

5 [email protected]

1.3 The probit model

This is a very simple example of an ‘additive random utility model’ (‘ARUM’).

Define β0 ≡ αS0 − αH

0 , β1 ≡ αS1 − αH

1 and εi ≡ µSi − µH

i . Then, trivially, we model a householdas having invested in primary education if:

β0 + β1xi + εi ≥ 0. (1.5)

We can therefore define a ‘latent variable’, y∗i :

y∗i (xi; β0, β1) ≡ β0 + β1xi + εi. (1.6)

We can express this latent variable as determining our outcome variable for the ith individual:

yi =

{0 if y∗i < 01 if y∗i ≥ 0.

(1.7)

So far, so good — but we’re still not in a position to estimate the object of interest. To do this, weneed to make a distributional assumption.

1.3 The probit modelAssumption 1.1 (DISTRIBUTION OF εi) εi is i.i.d. with a standard normal distribution, indepen-dent of xi:

εi |xi ∼ N (0, 1). (1.8)

You will be familiar with the normal distribution, and with the concepts of the probability densityand the cumulative density. Recall that the probability density function φ(x) is:

φ(x) =1√2π· exp

(−x

2

2

), (1.9)

and that the cumulative density function (Φ(x)) has no closed-form expression:

Φ(x) =

∫ x

−∞φ(z) dz. (1.10)

Figure 1.2 illustrates. With our distributional assumption, we can now write the conditional prob-ability of primary education:2

Pr(yi = 1 |xi; β0, β1) = Pr(β0 + β1xi + εi ≥ 0 |xi) (1.11)= Pr(−εi ≤ β0 + β1xi |xi) (1.12)= Pr(εi ≤ β0 + β1xi |xi) (1.13)= Φ(β0 + β1xi) (1.14)

∴ Pr(yi = 0 |xi; β0, β1) = 1− Φ(β0 + β1xi). (1.15)

2 Note that equation 1.13 follows from equation 1.12 because, under the assumption of normality, the distribution ofε is symmetric.

6 [email protected]

1.3 The probit model

Figure 1.2: The standard normal: probability density (φ(·)) and cumulative density (Φ(·))

.......................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...........................

.....................................................................................................................................................................................................................................................................................................................................................................................................

........................................................................................................................................................................................................................................................................................

...........................

ε

φ(ε)

. . . .... ........................

..

..

..

..

..

..

..

..

..

.

..

..

..

..

.

..

..

..

..

.

..

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.

a

Φ(a).............................................................................................................................................................................

...................

..................

b

φ(b)

•

..................................................................................................................................................

...........................

Equation 1.14 (and, equivalently, equation 1.15) defines the ‘probit model’. There is certainly noneed to motivate the probit model with an additive random utility approach, as we have done;indeed, the vast majority of empirical papers (and econometrics textbooks) start merely with someequivalent to equation 1.14. But the additive random utility approach is useful (i) for thinking abouthow microeconomic models may motivate econometric analysis, (ii) for explaining the ‘latentvariable interpretation’ of the probit model, and (iii) to lay the conceptual groundwork for otherlimited depdenent variable models that we will discuss later (in particular, models of multinomialchoice). Train (2009, page 14) explains the utility-based approach in discrete choice models asfollows:

Discrete choice models are usually derived under an assumptiion of utility-maximisingbehaviour by the decision maker. . . It is important to note, however, that models de-rived from utility maximisation can also be used to represent decision making thatdoes not entail utility maximisation. The derivation assumes that the model is consis-tent with utility maximisation; it does not preclude the model from being consistentwith other forms of behaviour. The models can also be seen as simply describingthe relation of explanatory variables to the outcome of a choice, without reference toexactly how the choice is made.

7 [email protected]

1.4 Estimation by maximum likelihood

1.4 Estimation by maximum likelihood1.4.1 The log-likelihood

Equation 1.14 defines the probit model. But this still requires a method of estimation. The methodused for the probit model is maximum likelihood.3 For the ith individual, the likelihood can bewritten as:

Li(β0, β1; yi |xi) = Pr(yi = 1 |xi; β0, β1)yi · Pr(yi = 0 |xi; β0, β1)1−yi (1.16)

= Φ(β0 + β1xi)yi · [1− Φ(β0 + β1xi)]

1−yi . (1.17)

The log-likelihood, therefore, is:

`i(β0, β1; yi |xi) = yi · ln Φ(β0 + β1xi) + (1− yi) · ln [1− Φ(β0 + β1xi)] . (1.18)

Denoting the stacked values of yi and xi as y and x respectively, we can write the likelihood for asample of N individuals as:

`(β0, β1;y |x) =N∑i=1

{yi · ln Φ(β0 + β1xi) + (1− yi) · ln [1− Φ(β0 + β1xi)]} . (1.19)

You will be aware that several numerical algorithms may be used to find the values (β0, β1) thatjointly maximise this log-likelihood — for example, the Newton-Raphson method, the Berndt-Hall-Hall-Hausman algorithm, the Davidson-Fletcher-Powell algorithm, the Broyden-Fletcher-Goldfarb-Shanno algorithm, etc. Happily, Stata (and other statistical packages) has these algo-rithms built in, so we can use these algorithms without having to code them ourselves.

1.4.2 Properties of the maximum likelihood estimator

Before we go on with our probit example, we should briefly revise several important propertiesof maximum likelihood estimators. Suppose that we have an outcome vector, y, and a matrix ofexplanatory variables, X; further, suppose that we are interested in fitting a parameter vector β.You will recall that we can generally specify the log-likelihood as:

`(β;y |X) = ln f(y |X;β); (1.20)

that is, the log-likelihood is the log of the conditional probability density (or probability mass) ofy, given X , for some parameter value β. This can formally be described as the conditional log-likelihood function, but we usually just term it ‘the log-likelihood’.4 Further, you will recall that, if

3 However, this is certainly not the only way we could estimate the probit model. For example, equation 1.14 impliesthat E(yi |xi) = Φ(β0 + β1xi); the model could therefore also be estimated by Nonlinear Least Squares (i.e. amethod-of-moments estimator).

4 Cameron and Trivedi (page 139) note that ignoring the marginal likelihood of X is not a problem “if f(y |X) andf(X) depend on mutually exclusive sets of parameters”; that is, if there is no endogeneity problem.

8 [email protected]


we assume observations are independent across individuals, we can decompose the log-likelihoodas:

`(β;y |X) =N∑i=1

ì(β; yi |xi) =N∑i=1

ln f(yi |xi;β). (1.21)

The maximum likelihood estimate βML therefore solves:

∂`(β;y |X)

∂β

∣∣∣∣β=βML

= 0, (1.22)

where the lefthand side of this expression is called the ‘score vector’.

You will recall further that all maximum likelihood estimators share at least four important prop-erties:

(i) Consistency: In general terms, an estimator is consistent if, as the number of observationsbecomes very large, the probability of the estimator missing the true parameter value goesto zero. Suppose that we are trying to estimate some true scalar parameter β, and that weare using a maximum likelihood estimator βML, with N observations in our sample. Thenconsistency means that, for any ε > 0,

limN→∞

Pr(|βML − β| > ε) = 0. (1.23)

We can describe this by saying “βML converges in probability to the true value β”, and wecan write

plimN→∞

βML = β. (1.24)

(ii) Asymptotic normality: Assuming some regularity conditions, the asymptotic distributionof a maximum likelihood estimator is normal:5

√N ·

(β − β

)d→N

(0, I(β)−1

), (1.25)

where I(β) = −E(∂2ì(β; yi |xi)

∂β∂β′

). (1.26)

We generally estimate I(β) using:

I(βML) = − 1

N

N∑i=1

∂2ì(β)

∂β∂β′

∣∣∣∣∣β=βML

. (1.27)

5 See, for example, page 392 of Wooldridge (2002) for those conditions. In this context, the conditions are: (i) that βis interior to the set of possible values for β, and (ii) that the log-likelihood is twice continuously differentiable onthe interior of that set. We will not need to worry about these conditions in these lectures.

9 [email protected]


An alternative approach is to use the ‘outer product of the gradient’ (sometimes known asthe ‘BHHH estimate’, or just the ‘OPG’):

IOPG(βML) =1

N

N∑i=1

(∂`i(β)

∂β

)·(∂`i(β)

∂β′

)∣∣∣∣∣β=βML

. (1.28)

As Gould et al explain, “the OPG has the same asymptotic properties as [I(β)−1], the ‘con-ventional variance estimator”’ (page 11).6 One advantage of the OPG is that it does notrequire calculation of the second derivatives of the log-likelihood; if we are maximisingour log-likelihood with an algorithm that does not itself require those second derivatives (forexample, the BHHH method), the OPG method may therefore be computationally more con-venient (see, for example, Gould et al, pages 11–12).

Suppose, then, that we want to test a hypothesis H0: β = β0. The estimated covariancematrix I(βML)−1 (or its OPG equivalent) can be used to perform a Wald test. Alternatively,we can perform a Likelihood Ratio test:

2 ·[`(βML

)− ` (β0)

]∼ χ2(k), (1.29)

where k is the number of restricted parameters in β0.

(iii) Efficiency: Equations 1.25 and 1.26 show that, asymptotically, the variance of maximumlikelihood estimators is the ‘Cramér-Rao lower bound’ (i.e. the inverse of the Fisher in-formation matrix). That is, maximum likelihood estimators are efficient: the asymptoticvariance of the maximum likelihood estimator is at least as small as the variance of anyother consistent estimator of the parameter.

(iv) Invariance: If γ = f (β) is a one-to-one, continuous and continuously differentiable func-tion, γML = f

(βML

).

1.4.3 Goodness of fit in the probit model

For simplicity, let’s return to our earlier example of a probit model with a single explanatory vari-able. You will be familiar with the R2 statistic from linear regression models; this statistic reportsthe proportion of variation in the outcome variable that is explained by variation in the regres-sors. Unfortunately, this statistic does not generalise naturally to the maximum likelihood context.Instead, the standard goodness-of-fit statistic for maximum likelihood estimates is ‘McFadden’sPseudo-R2’:

R2p ≡ 1− `(β)

`0, (1.30)

6 I use square brackets in the quote to show where I have inserted our notation in place of Gould et al’s. I use thisnotation at several points in these notes.

10 [email protected]


where `(β) is the value of the maximised log-likelihood, and `0 is the log-likelihood for a modelwithout explanatory variables (so, in the context of our probit model, `0 is the log-likelihood for aprobit estimation using Pr(yi = 1 |x) = Φ(β0)). You should confirm that we will always obtainR2

p ∈ (0, 1).

Additionally, in a binary outcome model, we may wish to report the ‘percent correctly predicted’.Wooldridge (2002, page 465) explains:

For each i, we compute the predicted probability that yi = 1, given the explanatoryvariables, xi. If [Φ(β0+β1xi) > 0.5], we predict yi to be unity; if [Φ(β0+β1xi) ≤ 0.5],yi is predicted to be zero. The percentage of times the predicted yi matches the actualyi is the percent correctly predicted. In many cases it is easy to predict one of theoutcomes and much harder to predict another outcome, in which case the percentcorrectly predicted can be misleading as a goodness-of-fit statistic. More informativeis to compute the percent correctly predicted for each outcome, y = 0 and y = 1.

1.4.4 Back to Tanzania. . .

Table 1.1 reports the results of the probit estimation for Tanzania (see column (1)). We estimateβ0 = −90.395 and β1 = 0.046; both estimates are highly significant. Columns (2) and (3) showrespectively the estimated mean marginal effect and the estimated marginal effect at the mean (thatis, the estimated marginal effect for xi = 1962.627). (We will discuss the concept of marginaleffects shortly.) Figure 1.3 shows the predicted probability of primary school attainment: Φ(β0 +β1 · xi). (Appendix 1 provides the basic Stata commands for producing these estimates.)

Table 1.1: Probit estimates for primary school attainment in Tanzania

Estimates Marginal EffectsMean effect Effect at mean

(1) (2) (3)Year born 0.046 0.015 0.018

(0.001)∗∗∗ (0.0002)∗∗∗ (0.0004)∗∗∗

Const. -90.395(2.075)∗∗∗

Obs. 10000Log-likelihood -5684.679Pseudo-R2 0.165Successes correctly predicted (%) 84.7Failures correctly predicted (%) 57.2Mean of "year born" 1962.627

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.


1.5 Normalisations in the probit model

Figure 1.3: Probit estimates for primary school attainment in Tanzania

1.5 Normalisations in the probit modelWe assumed earlier that εi ∼ N (0, 1). But suppose instead that we had assumed more generallythat εi ∼ N (µ, σ2). In that case, we would write the conditional probability as:

Pr(yi = 1 |xi; β0, β1) = Pr(β0 + β1x+ εi ≥ 0 |xi) (1.31)= Pr(εi ≤ β0 + β1xi |xi) (1.32)

= Pr

(εi − µσ≤ β0 − µ

σ+β1σ· xi∣∣∣∣ xi) (1.33)

= Φ

(β0 − µσ

+β1σ· xi)

(1.34)

∴ Pr(yi = 0 |xi; β0, β1) = 1− Φ

(β0 − µσ

+β1σ· xi). (1.35)

This clearly presents a problem: the best that we can now do is to identify the objects (β0 − µ)·σ−1and β1 · σ−1. That is, the earlier assumptions that µ = 0 and σ = 1 are identifying assumptions:they are normalisations without which we cannot identify either β0 or β1.


1.6 Interpreting the results: Marginal effects

This should not come as a surprise: remember that we can always take a monotone increasingtransformation of a utility function without changing any of the observed choices. This has animportant implication for the way that we interpret the magnitude of parameter estimates fromdiscrete choice models; as Train explains (2009, page 24, emphasis in original):

The [estimated coefficients in a probit model] reflect, therefore, the effect of the ob-served variables relative to the standard deviation of the unobserved factors.

1.6 Interpreting the results: Marginal effectsThe parameter estimates from a probit model are often difficult to interpret in any intuitive sense;a policymaker, for example, is hardly likely to be impressed if told that “the estimated effect ofage on primary school completion in Tanzania is β1 = 0.048”! Instead, the interpretation of probitestimates tends to focus upon (i) the predicted probabilities of success and, consequently, (ii) theestimated marginal effects.

In a binary outcome model, a given marginal effect is the ceteris paribus effect of changing oneindividual characteristic upon an individual’s probability of ‘success’. In the context of the Tan-zanian education data, the marginal effects measure the predicted difference in the probability ofprimary school attainment between individuals born one year apart.

Having estimated the parameters β0 and β1, the estimated marginal effects follow straightfor-wardly. For an individual i born in year xi, the predicted probability of completing primary educa-tion is:

Pr(yi = 1 |xi; β0, β1) = Φ(β0 + β1 · xi

). (1.36)

Had the ith individual been born a year later, (s)he would have a predicted probability of complet-ing primary education of:

Φ(β0 + β1 · (xi + 1)

). (1.37)

For the ith individual, the estimated marginal effect of the variable x is therefore:

Md(xi; β0, β1) = Φ(β0 + β1 · (xi + 1)

)− Φ

(β0 + β1 · xi

). (1.38)

Md(xi; β0, β1) provides the marginal effect for the discrete variable xi. If xi were continuous — ortreated as being continuous for simplicity (an approximation that often works well) — we wouldinstead use:

Mc(xi; β0, β1) =∂ Pr(yi = 1 |xi; β0, β1)

∂xi(1.39)

= β1 · φ(β0 + β1 · xi

). (1.40)


1.6 Interpreting the results: Marginal effects

Figure 1.4: Probit estimates and marginal effects for primary school attainment in Tanzania

Figure 1.4 shows the predicted probabilities of success for the Tanzanian data — as in Figure 1.3— but superimposes the estimated marginal effects (where, for simplicity, I have treated year ofbirth as a continuous variable). You will see the the estimated marginal effect is greatest for in-dividuals born in 1959 — and that this is the year for which the predicted probability of primaryattainment is (approximately) 0.5. This is clearly no coincidence: the function φ(x) is maximisedat x = 0, and Φ(0) = 0.5. This restriction — i.e. that the marginal effect is largest for individualswith a predicted probability of 0.5 — can be relaxed by, for example, the ‘scobit’ estimator. Thisestimator, however, is beyond the scope of our course.

Figure 1.4 shows one way of reporting the marginal effects — i.e. by calculating the marginaleffects separately for all individuals in the sample, and then graphing. But there are several al-ternative approaches: for example, statistical packages generally report either the average of themarginal effects across the sample, or the marginal effect at the mean of the regressors.7 Alterna-tively, you may wish to take some weighted average of the sample marginal effects, if the sampleis unrepresentative of the population of interest. Table 1.1 earlier reported both the mean marginaleffect and the marginal effect at the mean.

7 Cameron and Trivedi (page 467) prefer the former; they say, “it is best to use. . . the sample average of the marginaleffects. Some programs instead evaluate at the sample average of the regressors. . . ”.


1.7 Appendix to Lecture 1: Stata code

In short, the marginal effects are particularly important for binary outcome variables, and it isgenerally a very good idea to report marginal effects alongside estimates of the parameters (or,indeed, instead of them). Standard statistical packages can compute estimated marginal effectsstraightforwardly — and, similarly, can use the delta method to calculate corresponding standarderrors.

1.7 Appendix to Lecture 1: Stata codeNote: You do not need to know any Stata code for any exam question about limited dependentvariables.

First, let’s clear Stata’s memory and load our dataset.

clear

use WorkingSample

We can then tabulate primaryplus, the dummy variable that records whether or not a respon-dent has primary education (or higher):

tab primaryplus

Similarly, let’s summarise the variable yborn, which records the year in which each respondentwas born:

summarize yborn

Time to estimate. We begin with our probit model, explaining primaryplus by yborn (and aconstant):

probit primaryplus yborn

Finally, let’s calculate marginal effects — first as the mean across the sample, and then as themarginal effect at the mean of yborn:

margins, dydx(yborn)margins, dydx(yborn) atmean


2 Lecture 2: Binary Choice IIRequired readings (for Lectures 1 and 2):

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 463 – 478 (i.e. sections 14.1 to 14.4, inclusive)or



Other references:

• HARRISON, G. (2011): “Randomisation and Its Discontents,” Journal of African Economies,20(4), 626–652.

• ANGRIST, J.D. AND PISCHKE, J.S. (2008): Mostly Harmless Econometrics: An Empiri-cist’s Companion. Princeton University Press.

2.1 The logit modelLecture 1 considered the probit model: a model of binary choice in which the latent error variableis assumed to have a standard normal distribution. You will recall that, in the context of a singleexplanatory variable (xi), this model can be summarised succinctly by our earlier equation 1.14:

Pr(yi = 1 |xi; β0, β1) = Φ(β0 + β1xi). (1.14)

An alternative approach is to assume that the latent unobservable has a ‘logistic distribution’:

Assumption 2.1 (DISTRIBUTION OF εi) ε is i.i.d. with a logistic distribution, independent of x:

Pr(ε ≤ Z |x) = Λ(Z) (2.1)

=exp(Z)

1 + exp(Z). (2.2)

Figure 2.1 shows the cdf of the logistic distribution, compared to the normal.


http://jae.oxfordjournals.org/content/20/4/626.full.pdf

http://jae.oxfordjournals.org/content/20/4/626.full.pdf

2.1 The logit model

Figure 2.1: Cumulative density functions: normal and logistic distributions

Symmetric to our derivation of the probit specification, we can write:

Pr(yi = 1 |xi; β0, β1) = Pr(εi ≤ β0 + β1xi |xi) (2.3)= Λ(β0 + β1xi). (2.4)

Equation 2.4 is directly analogous to equation 1.14; it defines the ‘logit model’. The logit model isan alternative to the probit model for estimating the conditional probability of a binary outcome.For any given dataset, the predicted probabilities from a logit model are generally almost identicalto those from a probit model, as we will see later in this lecture.

All of the reasoning from Lecture 1 extends by analogy to the case of the logit model: we canfollow the same principles to (i) form the log-likelihood, (ii) maximise the log-likelihood, (iii)measure the goodness-of-fit, (iv) normalise our parameter estimates and (v) interpret the marginaleffects. We will not rehearse these principles in this lecture, but you should understand the waythat they extend from the probit case to the logit case.


2.2 The log-odds ratio in the logit model

2.2 The log-odds ratio in the logit modelThe probit model and the logit model are almost identical in their implications. However, whenwe use the logit model, we sometimes speak about the ‘odds ratio’, because this ratio has a naturalrelationship to the estimated parameters from a logit specification.8

Generally, the odds ratio is defined as:

odds ratio =probability of successprobability of failure

. (2.5)

In the context of our Tanzanian problem, and using the logit specification, we can write:

odds ratioi =Pr(yi = 1 |xi; β0, β1)Pr(yi = 0 |xi; β0, β1)

(2.6)

=Pr(yi = 1 |xi; β0, β1)

1− Pr(yi = 1 |xi; β0, β1)(2.7)

=

(exp(β0 + β1xi)

1 + exp(β0 + β1xi)

)·(

1

1 + exp(β0 + β1xi)

)−1(2.8)

= exp (β0 + β1xi) . (2.9)

In the logit model, we can therefore interpret the index β0 + β1xi as providing the ‘log odds ratio’,so that the parameter β1 shows the effect of xi on this log ratio:

β0 + β1xi = ln (odds ratioi) (2.10)

β1 =∂ ln (odds ratioi)

∂xi. (2.11)

This implies that, for a small change in xi, the value 100β1 ·∆xi is approximately the percentagechange in the odds ratio.

2.3 Probit or logit?Cameron and Trivedi have an extensive discussion of the theoretical and empirical considerationsin choosing between the probit or logit model: see pages 471–473. In these notes, I would likesimply to emphasise their comment about empirical considerations:

Empirically, either logit and probit can be used. There is often little difference betweenthe predicted probabilities from probit and logit models. The difference is greatest inthe tails where probabilities are close to 0 or 1. The difference is much less if interestlies only in marginal effects averaged over the sample rather than for each individual.

Figure 2.2 illustrates this point by comparing estimates from the Tanzanian data.8 Of course, this doesn’t mean that we can’t talk about the odds ratio when discussing other models — just that the

ratio has a more intuitive relationship to the parameters of interest in the logit model.


2.4 The Linear Probability Model

Figure 2.2: Probit estimates and logit estimates for primary school attainment in Tanzania

2.4 The Linear Probability ModelTo this point, we have considered two models: probit and logit. We have specified these modelsin terms of a conditional probability of success, but we could equally specify them in terms of theconditional expectation of the outcome variable:

E(yi |xi; β0, β1) = 1× Pr(yi = 1 |xi; β0, β1) + 0× Pr(yi = 0 |xi; β0, β1) (2.12)= Pr(yi = 1 |xi; β0, β1). (2.13)

Thus, for the probit model, we used:

E(yi |xi; β0, β1) = Φ(β0 + β1xi); (2.14)

for the logit model, we used:

E(yi |xi; β0, β1) = Λ(β0 + β1xi). (2.15)

A simpler approach is to assume that the conditional probability of success — and, therefore, theconditional expectation of the outcome — is linear in the explanatory variable(s):

E(yi |xi; β0, β1) = β0 + β1xi (2.16)⇔ yi = β0 + β1xi + εi, (2.17)


2.4 The Linear Probability Model

where E(εi |xi) = 0.

This is known as the ‘Linear Probability Model’, or ‘LPM’ for short. As equation 2.17 implies,the parameters of interest for the LPM can be obtained very simply: just use OLS. We will notrehearse here the principles involved in OLS estimation.

2.4.1 Predicted probabilities and marginal effects in the Linear Probability Model

Predicted probabilities in the LPM are trivial:

Pr(yi = 1 |xi; β0, β1

)= β0 + β1xi. (2.18)

Note that nothing constrains this predicted probability to lie in the unit interval. We will return tothis point shortly.

Marginal effects in the LPM are similarly trivial: whether xi is discrete or continuous, its estimatedmarginal effect is β1. Note that this marginal effect is identical across all values of x.

2.4.2 Heteroskedasticity in the Linear Probability Model

The Linear Probability Model generally produces heteroskedastic errors. We can illustrate thisstraightforwardly using our simple example; for a given xi, we have:

εi =

{1− β0 − β1xi with conditional probability β0 + β1xi−β0 − β1xi with conditional probability 1− β0 − β1xi.

(2.19)

Figure 2.3 illustrates. We know that Var(εi |xi) = E (ε2i |xi) − [E (εi |xi)]2, and — as we notedearlier — E(εi |xi) = 0. We therefore have:

Var(εi |xi) = E(ε2i |xi

)(2.20)

= Pr(yi = 1 |xi) · (1− β0 − β1xi)2 + Pr(yi = 0 |xi) · (−β0 − β1xi)2 (2.21)

= (β0 + β1xi) · (1− β0 − β1xi)2 + (1− β0 − β1xi) · (−β0 − β1xi)2 (2.22)= (β0 + β1xi) · (1− β0 − β1xi) . (2.23)

Therefore, Var(εi |xi) depends upon xi so long as β1 6= 0.9 The simplest way of dealing with thisproblem is to use White’s heteroskedasticity-robust standard errors (which can be implementedstraightforwardly in Stata using the ‘robust’ option). Alternatively, we could use WeightedLeast Squares; this produces more efficient estimates, but requires predicted probabilities to liebetween 0 and 1 (which, as we discussed, is not guaranteed in the LPM).

9 Note, then, that if we are testing a null hypothesis H0 : β1 = 0 in this model, we do not need to worry aboutheteroskedasticity.


2.5 LPM or MLE?

Figure 2.3: Heteroskedasticty in the Linear Probability Model

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. ...............

...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

....

0

1

·

· x

y

1− β0 − β1xi

β0 + β1xi

xi................................................................................................................................................................................................................................................................................

................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................

2.5 LPM or MLE?2.5.1 Relative advantages and disadvantages

We noted earlier that the difference between probit and logit is very small — both in terms ofthe estimates that they provide and in terms of their underlying structure. The Linear ProbabilityModel, however, is clearly quite different — for example, as Cameron and Trivedi note (page 466),the Linear Probability Model, unlike probit and logit, “does not use a cdf”. So which approachshould be preferred — probit/logit on the one hand, or the Linear Probability Model on the other?

This can be quite a controversial issue in applied research! On the one hand, many researchersprefer the probit or logit models, on the basis that they constrain predicted probabilities to theunit interval, and that they therefore imply sensible marginal effects across the entire range ofexplanatory variables. Cameron and Trivedi, for example, say (page 471):

Although OLS estimation with heteroskedastic standard errors can be a useful ex-ploratory data analysis tool, it is best to use the logit or probit MLE for final dataanalysis.

In a 2011 article about Randomised Controlled Trials (‘RCT’) in the Journal of African Economies,Harrison said this about the use of OLS for limited dependent variable models (footnote omitted):10

10 Harrison also discussed the issue in his presentation at the 2011 CSAE Annual Conference, available at http://www.csae.ox.ac.uk/conferences/2011-EdiA/video.html.


http://www.csae.ox.ac.uk/conferences/2011-EdiA/video.html

http://www.csae.ox.ac.uk/conferences/2011-EdiA/video.html

2.5 LPM or MLE?

One side-effect of the popularity of RCT is the increasing use of Ordinary LeastSquares estimators when dependent variables are binary, count or otherwise truncatedin some manner. One is tempted to call this the OLS Gone Wild reality show, akin tothe Girls Gone Wild reality TV show, but it is much more sober and demeaning stuff.I have long given up asking researchers in seminars why they do not just report themarginal effects for the right econometric specification. Instead I ask if we should justsack those faculty in the room who seem to waste our time teaching things like logit,count models or hurdle models.

In their book Mostly Harmless Econometrics, Angrist and Pischke (2009, page 94) take a differentapproach:

Should the fact that a dependent variable is limited affect empirical practice? Manyeconometrics textbooks argue that, while OLS is fine for continuous dependent vari-ables, when the outcome of interest is a limited dependent variable (LDV), linear re-gression models are inappropriate and nonlinear models such as probit and Tobit arepreferred. In contrast, our view of regression as inheriting its legitimacy from the[Conditional Expectation Function] makes LDVness less central.

That is, the Linear Probability Model can still be used to estimate the average marginal effect.Cameron and Trivedi acknowledge (page 471) that:

The OLS estimator [that is, the Linear Probability Model] is nonetheless useful as anexploratory tool. In practice it provides a reasonable direct estimate of the sample-average marginal effect on the probability that y = 1 as x changes, even though itprovides a poor model for individual probabilities. In practice it provides a good guideto which variables are statistically significant.

Further, the Linear Probability Model is sometimes preferred for computational reasons; maximumlikelihood models can prove much more difficult to estimate where, for example, there is a verylarge number of observations or a large number of explanatory variables.

2.5.2 Estimates from Tanzania

Table 2.1 reports estimates from the probit, logit and LPM models for the Tanzanian educationexample; Figure 2.4 shows the predicted probabilities. Together, the table and figure illustrateseveral important features of the three models. First, all three models predict very similar meanmarginal effects. Second, the mean marginal effect for the Linear Probability Model is identicalto the parameter estimate.11 Third, the probit and logit models predict conditional probabilities inthe unit interval; in contrast, the LPM implies nonsensical predicted probabilities for people bornbefore about 1925.11 For this reason, we would never report the estimate and the marginal effect separately for the LPM; I have done so

here simply to emphasise their equivalence.


2.5 LPM or MLE?

Table 2.1: Probit, logit and LPM results from Tanzania

Probit Logit LPMEstimate Marginal Estimate Marginal Estimate Marginal

(1) (2) (3) (4) (5) (6)Year born 0.046 0.015 0.077 0.015 0.016 0.016

(0.001)∗∗∗ (0.0002)∗∗∗ (0.002)∗∗∗ (0.0002)∗∗∗ (0.0003)∗∗∗ (0.0003)∗∗∗

Const. -90.395 -149.997 -30.510(2.075)∗∗∗ (3.651)∗∗∗ (0.603)∗∗∗

Obs. 10000 10000 10000Log-likelihood -5684.679 -5680.831Pseudo-R2 0.165 0.165R2 0.210

Correctly predicted:Successes (%) 84.7 84.7 86.3Failures (%) 57.2 57.2 55.2

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.‘Marginal’ refers to the mean marginal effect. The Linear Probability Model was run using White’s heteroskedasticity-robust standard errors.

Figure 2.4: LPM, probit and logit estimates for primary school attainment in Tanzania


2.6 The single-index assumption

2.6 The single-index assumptionAlbert Einstein is sometimes quoted — misquoted, perhaps — as having said that “everythingshould be made as simple as possible, but no simpler”. In this spirit, we have considered the pro-bit, logit and LPM models solely in the context of a single (scalar) explanatory variable, xi. All ofthe basic principles of these estimators can be understood in this way, so we have not yet consid-ered the multivariate context.

In most empirical applications, however, we have more than one explanatory variable. It is straight-forward to take all of our previous reasoning on xi and generalise it to a vector xi, by replacingβ0 + β1xi with the linear index β · xi (where, generally, xi is understood as including an element‘1’, to allow an intercept). This is how we deal with multiple explanatory variables in the contextof the probit, logit and LPM models; thus, in the multivariate case, we specify either:

Pr(yi = 1 |xi) = Φ (β · xi) for probit, (2.24)or Pr(yi = 1 |xi) = Λ (β · xi) for logit, (2.25)or Pr(yi = 1 |xi) = β · xi for LPM. (2.26)

This is a very general structure; it permits quite flexible estimation of binary outcome modelswith a large number of explanatory variables. However, note that the explanatory variables enterlinearly through a ‘single index’, β ·xi. It is easy to think of functions that violate this assumption.For example, the left surface of Figure 2.5 shows the function y = Φ(3(x1 + x2 − 1)); thissatisfies the single index assumption because we can write y = F (β · x). But the right surface inFigure 2.5 shows the function y = 0.5 (Φ(10x1 − 3) + Φ(10x2 − 3)); this cannot be expressed asy = F (β · x), so it violates the single-index assumption.

2.7 A general frameworkIf we are willing to impose the single-index assumption, we can write the probit, logit and LPMmodels as special cases of a very general structure:

Pr(yi = 1 |xi) = F (β · xi) , (2.27)

where F (z) = Φ(z) for probit, F (z) = Λ(z) for logit and F (z) = z for the LPM. F can bereferred to as a ‘link function’. If F (z) is a cdf — as it is, for example, for probit and logit— then we can rewrite all of our earlier maximum likelihood results more generally in terms ofF (z). This is the way, for example, that Cameron and Trivedi introduce probit and logit (see pages465 to 469 of their text); Wooldridge takes the same approach (se pages 457 – 458 of his 2002 text).



Figure 2.5: The single index restriction: an example (left) and a violation (right)

The surface on the left shows the function y = Φ(3(x1 + x2 − 1)); the graph on the right showsthe function y = 0.5 (Φ(10x1 − 3) + Φ(10x2 − 3)). Thus, the surface on the left may be

expressed as y = f(β ·x), but the surface on the right requires a bivariate function y = g(x1, x2).

More generally, equation 2.27 permits semiparametric estimation, in which a researcher can jointlyestimate the parameter vector β and the link function F . In our simple application with one ex-planatory variable, this kind of approach just implies fitting a univariate nonparametric function,y = f(xi) + εi, where E(εi |xi) = 0. Figure 2.6 illustrates, where the function F is fitted with akernel.

You do not need to understand any nonparametric or semiparametric methods for these lectures.However, the underlying point is worth remembering: the probit, logit and LPM models all imposeparticular assumptions on the data, and we can sometimes relax these assumptions using moreflexible estimation techniques. (Some of these techniques are covered in the second-year M.Philcourse ‘Advanced Econometrics 1 – Microeconometrics’.)


Let’s again clear Stata’s memory and load our dataset.

clear

use WorkingSample



Figure 2.6: Probit and kernel estimates for primary school attainment in Tanzania

We can run a logit estimation with the logit command:

logit primaryplus yborn

We can use the same margins command as for probit

margins, dydx(yborn)margins, dydx(yborn) atmean

Finally, we can run the Linear Probability Model using the command regress, for an OLSregression; we add the option ‘robust’ to calcuate White’s heteroskedasticity-robust standarderrors:

regress primaryplus yborn, robust


3 Lecture 3: Discrete ordered choiceRequired reading:

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 519 – 520 (i.e. section 15.9.1)or

• ? WOOLDRIDGE, J. (2002): Econometric Analysis of Cross Section and Panel Data. TheMIT Press, pages 504 – 507 (i.e. section 15.10)or

• ? WOOLDRIDGE, J. (2010): Econometric Analysis of Cross Section and Panel Data (2nded.). The MIT Press, pages 655 – 659 (i.e. sections 16.3.1 and 16.3.2).

Other references:

• CUNHA, F., HECKMAN, J. AND NAVARRO, S. (2007): “The Identification and EconomicContent of Ordered Choice Models with Stochastic Thresholds,” International EconomicReview, 48(4), 1273–1309.

3.1 The concept of ordered choiceIn Lectures 1 and 2, we considered the problem of binary outcome variables; we did so by consid-ering Tanzanians’ decision whether or not to complete primary school education. In this lecture,we extend our earlier reasoning to consider the problem of discrete ordered choice. To do so, wewill continue to work with the Tanzanian ILFS dataset; we will now consider Tanzanians’ decisionbetween three choices: (i) not completing primary education, (ii) completing primary educationbut not secondary education, and (iii) completing secondary education.

We will denote our outcome variable as follows:

yi =

0 if the ith individual did not complete primary education;1 if the ith individual completed primary education but not secondary education;2 if the ith individual completed secondary education.

(3.1)

Figure 3.1 shows how attainment of primary and secondary education has changed over time inTanzania; it plots our new variable yi against respondents’ year of birth (xi). Note again one of thekey characteristics of many limited dependent variable models: the outcome variable is categori-cal, so the numerical values taken by yi have no cardinal meaning. There is no sense, for example,in which completing secondary education (yi = 2) is ‘twice as good’, or ‘twice as useful’, or ‘twiceas anything’ as completing primary education (yi = 1).

We will model this education decision as an ordered choice. It is clear that, in some intuitivesense, the categories ‘no education’ – ‘primary education’ – ‘secondary education’ are ordered; for


http://www.ingentaconnect.com/content/bpl/iere/2007/00000048/00000004/art00008



3.2 A simple optimal stopping model

Figure 3.1: Primary and secondary school attainment in Tanzania across age cohorts

example, secondary education requires more time than primary education, which itself (obviously)requires more time than no education. This kind of intuitive reasoning often justifies the descriptionof a choice as a ‘discrete ordered choice’. Ideally, though, we should be able to go further: weshould be able to describe the outcome as a monotone step function of some continuous latentvariable. To illustrate what this might mean, we will consider a simple microeconomic model ofTanzanians’ investment in education.

3.2 A simple optimal stopping modelAssume that a student obtains some utility from attending school (or, equivalently, pays some util-ity cost), and that this utility changes with (i) the student’s year of birth (xi), and (ii) the student’sunobserved taste for education (εi):

usit(xi) = β0 + β1xi + εi. (3.2)

Additionally, assume that the student may work and receive in-period utility determined by thestudent’s level of education (si):

uit(si) = αsi. (3.3)

We assume that εi is known to the student, but unobservable to a researcher.


3.2 A simple optimal stopping model

For simplicity, let’s assume that the student faces only three choices: (i) do not attend school(s = 0), (ii) finish primary school (s = 7), and (iii) finish secondary school (s = 12). We willassume that students have a lifetime of known finite duration T > 12 years and, for simplicity, wewill make the extreme assumption that students assign equal utility weight to each period.12 Giventhese assumptions, we can write three value functions, one corresponding to each choice:

V0(0, xi, εi) = 0 (3.4)V0(7, xi, εi) = 7 · (β0 + β1xi + εi) + (T − 7) · 7α (3.5)V0(12, xi, εi) = 12 · (β0 + β1xi + εi) + (T − 12) · 12α. (3.6)

Therefore, the student prefers s = 7 to s = 0 if and only if:13

β0 + β1xi + εi ≥ (7− T ) · α. (3.7)

Similarly, the student prefers s = 12 to s = 7 if and only if:

12 · (β0 + β1xi + εi) + (T − 12) · 12α ≥ 7 · (β0 + β1xi + εi) + (T − 7) · 7α (3.8)⇔ β0 + β1xi + εi ≥ (19− T ) · α. (3.9)

We can therefore define two ‘cutpoints’,

κ1 = α · (7− T )− β0 (3.10)κ2 = α · (19− T )− β0, (3.11)

and express the ith student’s decision (yi) as an ordered choice in the latent variable β1xi + εi:

y(xi, εi; β1, κ1, κ2) =

0 if β1xi + εi < κ1;1 if β1xi + εi ∈ [κ1, κ2) ;2 if β1xi + εi ≥ κ2.

(3.12)

Our simple optimal stopping model therefore implies that y(x, ε; β0, β1) is a monotone step func-tion in β1xi + εi. Cunha, Heckman and Navarro (2007) discuss several classes of models —including a dynamic schooling choice model — that imply this kind of monotone step functionsolution. Figure 3.2 illustrates.

Notice that, in this model, the observable covariate x affects the latent index rather than the cut-points; for this reason, we can describe the Ordered Probit as a model of ‘index shift’. There aretwo reasons that this result matters for empirical analysis:

(i) We may wish to exploit the ordered nature of the outcome variable for more efficient esti-mation;

(ii) We may wish to test the microeconomic model by testing the implication that x affectseducational choice through ‘index shift’.

12 That is, we will use a subjective discount factor of 1.13 We will assume that the indifferent student chooses the higher level of education.


3.3 The Ordered Probit

Figure 3.2: Optimal schooling as a monotone step function in β1xi + εi

.............................................................................................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................

................................

·

·

0

1

2y(xi, εi; β1, κ1, κ2)

β1xi + εiκ1 κ2

................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ .................... ............................................................................................................................................................................................................................................................................. ..............................................................................................................................................................................................................................................................................

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

......

3.3 The Ordered ProbitThe implications of our simple optimal stopping model are important. However, we need morebefore we can take these implications to data: once again, we need a distributional assumptionabout ε. We will make the same assumption that we made in Lecture 1.

Assumption 3.1 (DISTRIBUTION OF εi) εi is i.i.d. with a standard normal distribution, indepen-dent of xi:

εi |xi ∼ N (0, 1). (3.13)

Armed with this assumption, it is straightforward to write the log-likelihood for the ith individual:

`i(β1, κ1, κ2; yi |xi) =

ln Φ (κ1 − β1xi) if yi = 0;ln [Φ (κ2 − β1xi)− Φ (κ1 − β1xi)] if yi = 1;ln [1− Φ (κ2 − β1xi)] if yi = 2.

(3.14)

3.4 Marginal effects in the Ordered Probit modelMarginal effects in the Ordered Probit model are directly analogous to marginal effects in theprobit model. For simplicity, we will consider only the case in which xi is continuous. Considerfirst the marginal effects for the extreme categories, yi = 2 and yi = 0. Following the reasoning in


3.5 The Ordered Probit in Tanzania

subsection 1.6, we have:

M0(xi; β1, κ1) =∂ Pr(yi = 0 |xi; β1, κ1)

∂xi= −β1 · φ

(κ1 − β1 · xi

), and (3.15)

M2(xi; β1, κ2) =∂ Pr(yi = 2 |xi; β1, κ2)

∂xi= β1 · φ

(κ2 − β1 · xi

). (3.16)

For the intermediate category, we can find the marginal effect simply by considering the effect ofxi at both cutoffs:

M1(xi; β1, κ1, κ2) =∂ Pr(yi = 1 |xi; β1, κ1, κ2)

∂xi= β1 ·

[φ(κ1 − β1 · xi

)− φ

(κ2 − β1 · xi

)].

(3.17)

These principles generalise naturally to the case where xi is discrete, and to the case in which thereare more than three categories.

3.5 The Ordered Probit in TanzaniaTable 3.1 shows the estimates from the Ordered Probit model for our Tanzanian data: we estimateβ1 = 0.039, κ1 = 76.672 and κ2 = 78.517. Columns 2 and 3 respectively show the mean marginaleffects for the outcomes y = 1 and y = 2 (that is, I omit the mean marginal effect for outcomey = 0; you should be able to calculate this, however). Figure 3.3 shows the consequent predictedprobabilities.

Table 3.1: Estimates from Tanzania: Ordered Probit

Estimates Mean Marginal Effectsy = 1 y = 2

(1) (2) (3)Year born 0.039 0.008 0.005

(0.001)∗∗∗ (0.002)∗∗∗ (0.002)∗∗∗

Cutoff 1 (κ1) 76.672(1.886)∗∗∗

Cutoff 2 (κ2) 78.517(1.890)∗∗∗

Obs. 10000Log-likelihood -7972.275Pseudo-R2 0.105

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.


3.6 The Generalised Ordered Probit

Figure 3.3: Ordered Probit estimates for primary and secondary school attainment in Tanza-nia across age cohorts

3.6 The Generalised Ordered ProbitWe just noted that the Ordered Probit model is a model of index shift: the observable variable xiaffects the latent index β1xi + εi. This was justified by our simple optimal stopping model, inwhich year of birth (xi) directly affected each student’s utility from attending school. This model— i.e. both the optimal stopping model and the Ordered Probit — therefore implied that we couldsummarise the effect of age on both primary and secondary education with a single parameter: β1.This implies, for example, that if we estimate that, over time, students are more likely to completeprimary school (i.e. Pr(yi = 0 |xi) is decreasing), we must also estimate that students are morelikely to complete secondary school (i.e. that Pr(yi = 2 |xi) is increasing). This is implied inequations 3.15 and 3.16; the marginal effects on the largest and smallest outcomes must have op-posite signs.

However, we might be concerned that this structure is too restrictive. After all, there might be manygood reasons that, over time, students have become less likely to complete secondary educationand less likely to complete no education at all (i.e. with more students stopping after primaryschool). This might be the case if, for example, the cost of primary education has fallen over timebut the cost of secondary education has increased. In that case, we may still believe that educationalchoice is a monotone step function in the student’s unobserved taste for education (εi), but we may



want to allow the explanatory variable to affect each cutpoint differently. That is, we may want touse a model of ‘cutpoint shift’, rather than of ‘index shift’:

y(xi, εi; β1, κ1, κ2) =

0 if εi < κ1 − γ1xi;1 if εi ∈ [κ1 − γ1xi, κ2 − γ2xi) ;2 if εi ≥ κ2 − γ2xi.

(3.18)

This model is identical to our earlier model in the special case γ1 = γ2 = β1. But, by allowing γ1and γ2 to vary separately, we can allow for a more flexible model while still exploiting the orderedstructure of the decision. (Note, of course, that we haven’t gone back to modify our simple opti-mal stopping model to reflect this change; however, we could certainly do so — for example, byallowing xi to affect the cost of each schooling level differently.) Figure 3.4 illustrates this moregeneral model.

Figure 3.4: Optimal schooling as a monotone step function in εi

.............................................................................................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................

................................

·

·

0

1

2y(xi, εi; γ1, γ2, κ1, κ2)

εiκ1 − γ1xi κ2 − γ2xi

................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ .................... ............................................................................................................................................................................................................................................................................. ..............................................................................................................................................................................................................................................................................

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

......

If we maintain the assumption that εi has a standard normal distribution, we can describe this newmodel as a ‘Generalised Ordered Probit’. We will not write the log-likelihood for this model,but it is straightforward and directly analogous to the log-likelihood for the Ordered Probit. Table3.2 shows the estimation results, with estimated mean marginal effect. Compared to the OrderedProbit, the Generalised Ordered Probit implies a slightly higher mean marginal effect upon theprobability of primary education (yi = 1), but a slightly lower effect upon the probability ofsecondary (yi = 2).



Table 3.2: Estimates from Tanzania: Generalised Ordered Probit


Year born 0.013 0.001(0.0002)∗∗∗ (0.0002)∗∗∗

Cutoff 1:Year born 0.045

(0.001)∗∗∗

Const. 88.725(2.017)∗∗∗

Cutoff 2:Year born 0.01

(0.001)∗∗∗

Const. 21.794(2.882)∗∗∗


Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.

Figure 3.5 shows the consequent predicted probabilities. Figure 3.6 shows the estimated cutofffunctions for the Generalised Ordered Probit (that is, κ1 − γ1xi and κ2 − γ2xi) — along withthe cutoffs for the Ordered Probit (κ1 − β1xi and κ2 − β1xi). The diagram shows the fundamentaldifference between the Ordered Probit and Generalised Ordered Probit: the Ordered Probit restrictsthe cutoff functions to be parallel. Of course, this may or may not be a valid (or useful) restriction,depending on our particular empirical context. We can test the restriction straightforwardly: youshould verify that, using a Likelihood Ratio test, we can compare the results in Tables 3.2 and3.1 and obtain LR = 2 × (7972.275 − 7779.772) = 385.006, which implies a tiny p-value whencompared to a χ2(1) distribution.



Figure 3.5: Generalised Ordered Probit estimates for primary and secondary school attain-ment in Tanzania across age cohorts

Figure 3.6: Estimated cutoff functions: Ordered Probit and Generalised Ordered Probit


3.7 The Ordered Logit and Generalised Ordered Logit

3.7 The Ordered Logit and Generalised Ordered LogitRecall that, in the binary outcome case, the probit model is motivated by the assumption that thelatent error term has a normal distribution, and the logit model is motivated by the assumption thatthe error has a logistic distribution. In this lecture, we have considered the Ordered Probit andthe Generalised Ordered Probit. Both specifications have relied upon the assumption that ε has anormal distribution. However, as in the binary outcome case, we could assume instead that ε hasa logistic distribution. By analogy to the binary outcome case, we would then call our estimatorsthe Ordered Logit and the Generalised Ordered Logit.

3.8 The Linear Probability Model and discrete ordered choiceIn Lecture 2, we considered the Linear Probability Model as an alternative to the probit or logitmodel. We can also use a Linear Probability Model as an alternative to the Generalised OrderedProbit (or Generalised Ordered Logit). It would be tempting to write such an alternative like this:

yi = β0 + β1xi + εi, (3.19)

where yi again refers to our three-outcome measure of educational achievement and xi is again yearof birth. That is, we could simply run an OLS regression of yi on xi. However, it is very difficult— if not impossible — to justify this approach. The reason for the difficulty is simple: as we notedearlier, yi is a categorial outcome, where the values ‘0’, ‘1’ and ‘2’ have no cardinal meaning. It istherefore not meaningful to talk about a ‘one unit increase in the outcome variable’ (for example,we cannot interpret our estimate of β1 in terms of a marginal effect on a conditional probability).Unfortunately, it is not uncommon to see researchers using specifications like equation 3.19 forstudying discrete outcomes.

Instead, we ought to estimate in a way that respects the categorical nature of the dependent variable.If we want to use a linear probability structure, we can do this by using multiple LPM estimates.In our Tanzanian example, we can do this by defining two new binary outcomes:

pi =

{1 if yi = 1.0 if yi 6= 1;

(3.20)

si =

{1 if yi = 2.0 if yi 6= 2;

(3.21)

We can then run two separate Linear Probability Models to estimate the effect of xi on the proba-bility of primary compltion and the probability of secondary completion:

pi = γ0 + γ1xi + εi (3.22)si = δ0 + δ1xi + µi. (3.23)

Figure 3.7 shows the resulting estimates. We can compare this graph directly to Figure 3.5. Ofcourse, the estimates illustrated in Figure 3.7 are not necessarily good estimates: all of the objec-tions to the Linear Probability Model that we discussed in Lecture 2 still apply. Arguably, these


3.8 The Linear Probability Model and discrete ordered choice

objections apply with even more force where the dependent variable has multiple outcomes: wemay think that it is even more important, in this case, to use an estimator that can be rationalisedin terms of an underlying economic structure. But these estimates can at least be defended asproviding reasonable estimates of the marginal effect of xi on the probability of choosing yi = 1and the probability of choosing yi = 2. Unfortunately, this is not something that we can say aboutequation 3.19.14

Figure 3.7: Linear Probability Model estimates for primary and secondary school attainmentin Tanzania across age cohorts

14 I have introduced this ‘multiple LPM’ approach as an alternative to the Generalised Ordered Probit. We could alsouse it as an alternative to models of discrete multinomial choice (i.e. ‘unordered’ choice), which we will discuss inLecture 4. However, as we will see in that lecture, models of unordered choice have traditionally placed particularemphasis upon having choice-theoretic foundations — which, as we saw in Lecture 2, the Linear Probability Modeldoes not provide.




We can start by clearing the memory and loading the data — as we did previously. Then, we cantabulate our categorical education variable:

tab educ_cat

We can run an Ordered Probit with the oprobit command:

oprobit educ_cat yborn

We can then calculate mean marginal effects for the outcomes y = 1 and y = 2:

margins, dydx(yborn) predict(outcome(1))margins, dydx(yborn) predict(outcome(2))

We can then fit the Generalised Ordered Probit using the goprobit command:

goprobit educ_cat yborn

(Note that goprobit actually estimates, in our terminology,−κ1 and−κ2, rather than κ1 and κ2.Note also that this command may not be installed on your computer; the command is not currentlybuilt in to Stata, so you may have to download it separately.)

The same margins commands will then calculate mean marginal effects.

To use multiple Linear Probability Models instead, we can generate new dummy variables, thenrun OLS regressions:

gen p = (educ_cat == 1)gen s = (educ_cat == 2)reg p yborn, robustreg s yborn, robust


4 Lecture 4: Discrete multinomial choiceRequired reading:

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 490 – 506 (i.e. sections 15.1 to 15.5.3, inclusive)or

• ? WOOLDRIDGE, J. (2002): Econometric Analysis of Cross Section and Panel Data. TheMIT Press, pages 497 – 502 (i.e. section 15.9.1 and part of section 15.9.2)or

• ? WOOLDRIDGE, J. (2010): Econometric Analysis of Cross Section and Panel Data (2nded.). The MIT Press, pages 643 – 649 (i.e. sections 16.1, 16.2.1 and part of 16.2.2).

Other references:

• LEWIS, W.A. (1954): “Economic Development with Unlimited Supplies of Labour,” TheManchester School, 22(2), 139–191.

• MCFADDEN, D. (1974): “The Measurement of Urban Travel Demand,” Journal of PublicEconomics, 3(4), 303–328.

• MCFADDEN, D. (2000): “Economic Choices”, Nobel Prize Lecture, 8 December 2000.

4.1 Occupational choice in TanzaniaTravel demand forecasting has long been the province of transportation engineers, whohave built up over the years considerable empirical wisdom and a repertory of largelyad hoc models which have proved successful in various applications. . . [but] there stilldoes not exist a solid foundation in behavioral theory for demand forecasting practices.Because travel behavior is complex and multifaceted, and involves ‘non-marginal’choices, the task of bringing economic consumer theory to bear is a challenging one.Particularly difficult is the integration of a satisfactory behavioural theory with prac-tical statistical procedures for calibration and forecasting.

McFadden (1974, emphasis added)

The main sources from which workers come as economic development proceeds aresubsistence agriculture, casual labour, petty trade, domestic service, wives and daugh-ters in the household, and the increase of population. In most but not all of thesesectors, if the country is overpopulated relatively to its natural resources, the marginalproductivity of labour is negligible, zero, or even negative.

Lewis (1954)


http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9957.1954.tb00021.x/pdf

http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9957.1954.tb00021.x/pdf

http://elsa.berkeley.edu/pub/reprints/mcfadden/measurement.pdf

http://elsa.berkeley.edu/pub/reprints/mcfadden/measurement.pdf

http://www.nobelprize.org/nobel_prizes/economics/laureates/2000/mcfadden-lecture.pdf

4.1 Occupational choice in Tanzania

In this lecture, we consider occupational choice in Tanzania. Both geographically and conceptu-ally, the Tanzanian labour market is a long way from the San Francisco Bay Area Rapid Transitsystem (the ‘BART’). Nonetheless, we will analyse occupational choice using some of the econo-metric methods that Daniel McFadden famously developed to predict demand for the new BART.Like McFadden, our concern shall be to estimate the conditional probability of various discreteand unordered choices, and to do so with — as McFadden termed it — “a solid foundation inbehavioral theory”.

Let’s begin in Tanzania. Figure 4.1 describes occupational choice among employed Tanzaniansof different ages. The figure — and our subsequent analysis — uses a ternary outcome variable,covering three mutually exlusive categories:

yi =

1 if the ith respondent works in agriculture;2 if the ith respondent is self-employed (outside of agriculture);3 if the ith respondent is wage employed (outside of agriculture).

(4.1)

Figure 4.1: Occupational categories and age in Tanzania

We will not worry too much today about why occupational choice in Tanzania might matter; asin earlier lectures, we will use the Tanzanian data as an illustrative vehicle for our econometrictechniques. However, it is not difficult to see why this kind of occupational choice might beimportant for understanding Tanzania’s development; the quote from Lewis’s famous 1954 paper,


4.2 An Additive Random Utility Model

for example, highlights sectoral shifts as an important mechanism for long-run development, andthe data in Figure 4.1 might provide insights into the flexibility with which workers can achievesuch shifts. (For example, if older workers are more likely to choose employment in agriculturethan younger workers, this may suggest some ‘switching costs’ between sectors.)

4.2 An Additive Random Utility ModelAs in earlier lectures, we will motivate our econometric methods by a simple underlying microeco-nomic model. Typically, this kind of choice-theoretic foundation is more common in the analysisof discrete unordered choice than in the models that we have studied earlier. For example, the la-tent variable interpretion is a useful approach for thinking about the probit and logit models, but isnot generally a starting point for analysis; similarly, an optimal stopping model is just one possiblefoundation for models of discrete ordered choice. But in the analysis of discrete unordered choice,an additive random utility model is a common starting point. As Cameron and Trivedi (page 506)explain:

The econometrics literature has placed great emphasis in restricting attention to multi-nomial models that are consistent with maximisation of a random utility function. Thisis similar to restricting analysis to demand functions that are consistent with consumerchoice theory.

Suppose, therefore, that we again have data on N individuals, indexed i ∈ {1, . . . , N}. Assumethat each individual makes a choice yi = j, where there are a finite number J options available.Critically, suppose that we observe information at the level of each individual, including his/herchoices (that is, we observe xi and yi). That is, we do not observe information at the level of eachavailable option; we will consider this alternative kind of data structure later in this lecture. Youmay query how reasonable it may be to model occupational outcomes purely as a matter of choice— after all, could an agricultural employee simply choose to take a wage job? — but we will leavethis concern aside for this lecture.

As in Lecture 1, we will assume that the ith individual’s utility from the jth choice is determinedby an additive random utility model:

Uij(xi) = α(j)0 + α

(j)1 xi + εij. (4.2)

Thus, for example, for choices j ∈ {1, 2, 3}, the individual obtains the following utilities:

Ui1(xi) = α(1)0 + α

(1)1 xi + εi1 (4.3)

Ui2(xi) = α(2)0 + α

(2)1 xi + εi2 (4.4)

Ui3(xi) = α(3)0 + α

(3)1 xi + εi3. (4.5)

Together, these three utilities determine the choice of an optimising agent. Figure 4.2 illustratespreferences between the three options in two-dimensional space; in each box, the bold outcomerepresents the agent’s choice.


4.2 An Additive Random Utility Model

Figure 4.2: Multinomial choice among three options

........................................................................................................................................................................................................................................................................................................................................ ...................................................................................................................................................................................................................................................................................................................................................................... .......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

....................

...............

.......................................................................................................................................................................................................................................................................................................................................................

................

................

................

................

................

................

................

................

................

................

................

................

................

................

..............................................

................................

................................

................................

................................

................................

................................

................................

..............................................

U2(x)− U1(x)

U3(x)− U1(x)U3(x) = U2(x)

3 � 22 � 13 � 1

3 ≺ 22 � 13 � 1

3 ≺ 22 � 13 ≺ 13 ≺ 2

2 ≺ 13 ≺ 1

3 � 22 ≺ 13 ≺ 1

3 � 22 ≺ 13 � 1

We can, therefore, express the conditional probability of the ith individual choosing, say, option 1:

Pr(yi = 1 |xi) = Pr[U (1)(xi) > U (2)(xi) and U (1)(xi) > U (3)(xi) |xi

](4.6)

= Pr[α(1)0 + α

(1)1 xi + εi1 > α

(2)0 + α

(2)1 xi + εi2 and

α(1)0 + α

(1)1 xi + εi1 > α

(3)0 + α

(3)1 xi + εi3 |xi

]. (4.7)

More generally, if the ith individual were to choose yi = j out of J choices, we could write:

Pr(yi = j |xi) = Pr

[U (j)(xi) > max

k 6=j

(U (k)(xi)

)|xi]

(4.8)

= Pr

[αj0 + αj

1xi + εij > maxk 6=j

(αk0 + αk

1xi + εik)|xi]. (4.9)

In order to estimate using equation 4.9, we again need to make a distributional assumption.


4.3 The Multinomial Logit model

4.3 The Multinomial Logit modelAssumption 4.1 (DISTRIBUTION OF εij) εij is i.i.d. with a Type I Extreme Value distribution,independent of xi:

Pr(εij < z |xi) = Pr(εij < z) = exp (− exp (−z)) . (4.10)

Equation 4.10, of course, defines the cumulative density function F (z); this implies a probabilitydensity function of:

f(z) =d

dzexp (− exp (−z)) = exp (−z) · exp (− exp (−z)) (4.11)

= exp (−z) · F (z). (4.12)

Figure 4.3 shows the cumulative density function for the Type I Extreme Value function, comparedto the cdf of the normal.

Figure 4.3: Cumulative density functions: Normal and Type I Extreme Value distributions

Figure 4.4 shows how this distributional assumption might imply the different outcomes yi = 1,yi = 2 and yi = 3; the figure shows the same two-dimensional space as Figure 4.2, but withsimulated values for Ui1, Ui2 and Ui3. (For simplicity, I have generated the graph by setting all ofthe parameters α(j)

0 and α(j)1 to zero; that is, the graph shows variation generated solely by the Type

I Extreme Value distribution on εij .)



Figure 4.4: Multinomial choice among three options: Simulated data

With this distributional assumption, we can now find an expression for the conditional probabilitythat the ith individual chooses outcome j from J choices.15 Note that you do not need to memorisethis derivation for the exam; however, I think the derivation is useful for understanding the under-lying structure required by multinomial choice models.

First, suppose that the error term for the chosen option, εij , were known. Then we could write:

Pr(yi = j |xi, εij) = Pr

[U (j)(xi) > max

k 6=j

(U (k)(xi)

) ∣∣∣∣ xi, εij] (4.13)

= Pr[αk0 + αk

1xi + εik < αj0 + αj

1xi + εij |xi, εij ∀ k 6= j]

(4.14)

= Pr[εik < εij + αj

0 + αj1xi − αk

0 − αk1xi |xi, εij ∀ k 6= j

](4.15)

=∏k 6=j

exp{− exp

[−(εij + αj

0 + αj1xi − αk

0 − αk1xi)]}

. (4.16)

15 This derivation is taken from Train (2009, pages 74-75).



Of course, εij is not known; we therefore need to integrate across its possible values:

Pr(yi = j |xi) =

∫ ∞−∞

f(εij) · Pr(yi = j |xi, εij) dεij (4.17)

=

∫ ∞−∞

exp (−εij) · exp [− exp (−εij)]

·∏k 6=j

exp{− exp

[−(εij + αj

0 + αj1xi − αk

0 − αk1xi)]}

dεij (4.18)

=

∫ ∞−∞

exp (−εij) ·∏k

exp{− exp

[−(εij + αj

0 + αj1xi − αk

0 − αk1xi)]}

dεij

(4.19)

=

∫ ∞−∞

exp (−εij) · exp

{−∑k

exp[−(εij + αj

0 + αj1xi − αk

0 − αk1xi)]}

dεij

(4.20)

=

∫ ∞−∞

exp (−εij)

· exp

{− exp(−εij) ·

∑k

exp[−(αj0 + αj

1xi − αk0 − αk

1xi)]}

dεij. (4.21)

We can now integrate by substitution. Define t = exp(−εij), so that dt = − exp(−εij) · dεij . Notethat limεij→−∞ t =∞ and limεij→∞ t = 0. Then we can rewrite our integral as:

Pr(yi = j |xi) =

∫ ∞0

exp

{−t ·

∑k

exp[−(αj0 + αj


1xi)]}

dt (4.22)

=

[exp

{−t ·

∑k exp

[−(αj0 + αj


1xi)]}

−∑

k exp[−(αj0 + αj


1xi)] ]∞

0

(4.23)

=1∑

k exp[−(αj0 + αj


1xi)] (4.24)

=exp

[αj0 + αj

1xi]∑

k exp[αk0 + αk

1xi] . (4.25)



Let’s return to our example with three outcomes, y ∈ {1, 2, 3}. The last derivation implies that wecan write the conditional probabilities of the outcomes as:

Pr(yi = 1 |xi) =exp

(α(1)0 + α

(1)1 xi

)exp

(α(1)0 + α

(1)1 xi

)+ exp

(α(2)0 + α

(2)1 xi

)+ exp

(α(3)0 + α

(3)1 xi

) ; (4.26)

Pr(yi = 2 |xi) =exp

(α(2)0 + α

(2)1 xi

)exp

(α(1)0 + α

(1)1 xi

)+ exp

(α(2)0 + α

(2)1 xi

)+ exp

(α(3)0 + α

(3)1 xi

) ; (4.27)

Pr(yi = 3 |xi) =exp

(α(3)0 + α

(3)1 xi

)exp

(α(1)0 + α

(1)1 xi

)+ exp

(α(2)0 + α

(2)1 xi

)+ exp

(α(3)0 + α

(3)1 xi

) . (4.28)

It would be tempting to take these three conditional probabilities and write the log-likelihood; forour three outcomes, we could therefore try to maximise the log-likelihood across six unknownparameters (that is, the parameters α(1)

0 , α(2)0 , α(3)

0 , α(1)1 , α(2)

1 and α(3)1 ). However, this would be

a mistake, because we would be unable to find a unique set of values that would maximise thefunction. (That is, the model would be underidentified.) The reason, of course, is that we canonly ever express utility in relative terms: we have seen this principle already in deriving boththe probit and the Ordered Probit models. We therefore need to choose a ‘base category’, andestimate relative to the utility from that category. We shall choose 1 as the base category, anddefine β(2)

0 ≡ α(2)0 − α

(1)0 and β(2)

1 ≡ α(2)1 − α

(1)1 (and symmetrically for β(3)

0 and β(3)1 ). Note the

emphasis here that the choice of base category is arbitrary: our predicted probabilities would beidentical if we were to choose a different base category. Then we can multiply numerator anddenominator of each conditional probability by exp(−α(1)

0 − α(1)1 xi), to obtain:

Pr(yi = 1 |xi) =1

1 + exp(β(2)0 + β

(2)1 xi

)+ exp

(β(3)0 + β

(3)1 xi

) ; (4.29)

Pr(yi = 2 |xi) =exp

(β(2)0 + β

(2)1 xi

)1 + exp

(β(2)0 + β

(2)1 xi

)+ exp

(β(3)0 + β

(3)1 xi

) ; (4.30)

Pr(yi = 3 |xi) =exp

(β(3)0 + β

(3)1 xi

)1 + exp

(β(2)0 + β

(2)1 xi

)+ exp

(β(3)0 + β

(3)1 xi

) . (4.31)


4.4 Estimates from Tanzania

These conditional probabilities can then be used to define the log-likelihood; it is now straightfor-ward that, for the ith individual, the log-likelihood is:16

`i

(β(1)0 , β

(1)1 , β

(2)0 , β

(2)1 ; yi |xi

)= 1 (yi = 1) · ln [Pr(yi = 1 |xi)] + 1 (yi = 2) · ln [Pr(yi = 2 |xi)]+ 1 (yi = 3) · ln [Pr(yi = 3 |xi)] . (4.32)

This log-likelihood function defines the ‘Multinomial Logit’ model. The Multinomial Logit is thesimplest model for unordered choice. (Note that, if J = 2, the Multinomial Logit collapses to thelogit model that we considered in Lecture 2.) Marginal effects in the Multinomial Logit model aredirectly analogous to marginal effects in the earlier models.

4.4 Estimates from TanzaniaTable 4.1 shows the estimates from the Tanzanian data. Note that, given the foundation of ouradditive random utility model, we can express the estimates in terms of ‘relative utility’ from self-employment and wage employment; we estimate β(1)

0 = −36.383, β(1)1 = 0.019, β(2)

0 = −48.562

and β(2)1 = 0.025. All four estimates are highly significant. Figure 4.5 shows the consequent

predicted probabilities of all three work categories (conditional upon having employment); thisshows that older employed Tanzanians are significantly more likely to be working in agriculturethan are younger employed Tanzanians, and that the converse applies for wage employment andself-employment.

16 I denote the indicator function by 1(·). Note that, for simplicity, I have not explicitly written each conditionalprobability as depending upon the parameters of interest, though they clearly do.


4.4 Estimates from Tanzania

Table 4.1: Estimates from Tanzania: Multinomial Logit


(1) (2) (3)Year born 0.001 0.002

(0.0007)∗∗∗ (0.0006)∗∗∗

Relative utility of self-employment:Year born 0.019

(0.003)∗∗∗

Const. -36.383(6.308)∗∗∗

Relative utility of wage employment:Year born 0.025

(0.004)∗∗∗

Const. -48.562(7.289)∗∗∗


Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.

Figure 4.5: Occupational categories and age in Tanzania: Multinomial Logit estimates


4.5 From Multinomial Logit to Conditional Logit

4.5 From Multinomial Logit to Conditional LogitWe assumed earlier that the ith individual’s utility from the jth choice depends linearly upon (i)the observable characteristics of the individual, xi and (ii) unobservable characteristics of theindividual’s taste for that choice, εij:

Uij(xi) = α(j)0 + α

(j)1 xi + εij. (4.2)

Critically, this structure does not allow for observable characteristics of different options. How-ever, we can allow straightforwardly for that possibility, by allowing the variable x to be indexedby both individual and potential choice: xij . That is, we now assume that the researcher observesinformation on characteristics of options that the ith individual did not choose — for example, aresearcher might know what the ith individual would have paid to take the train, even though sheactually chose the bus.17 We can therefore write Uij as a linear function of characteristics — and,for the sake of generality, we now use vector notation to allow for multiple characteristics:18

Uij(xij) = α · xij + εij. (4.33)

Note that xij is indexed at the level of the option-individual combination; the vector includes char-acteristics that vary at the level of the individual and the choice — for example, respondents mayface different relative costs of using different types of transport. Wooldridge cautions (2002, page500) that “xij cannot contain elements that vary only across i and not j; in particular, xij does notcontain unity”. This means, therefore, that xij cannot include personal characteristics (for exam-ple, age, sex, income, etc); we will see an intuitive reason for this shortly (in equation 4.39).

In the Multinomial Logit, the outcome y was indexed by individuals, and took the value of theparticular choice made; for example, we might write yi = j. But in the Conditional Logit model,we need to express the outcome at the level of the individual-choice combination. Therefore, weredefine our outcome as a binary variable:

yij =

{1 if the ith individual chooses option j, and;0 if the ith individual chooses some other option k 6= j.

(4.34)

All of our reasoning from the Multinomial Logit extends to the Conditional Logit. The conditionalprobability of individual i choosing outcome j is:

Pr(yij = 1 |xi1, . . . ,xiJ) =exp(β · xij)∑Jk=1 exp(β · xik)

. (4.35)

17 Cameron and Trivedi provide an example of this kind of data structure in their Table 15.1 on page 492; in thatapplication, a researcher observes the price that different respondents faced for each of four types of fishing, eventhough each respondent chose only one type.

18 Of course, we could also have used vector notation for all of our earlier reasoning in the Multinomial Logit model;this would have implied estimating two vectors β(1)

1 and β(2)1 . But the scalar case captured all of the important

aspects of Multinomial Logit in a simpler context, and matched directly our illustrative empirical application.


4.6 Independence assumptions

The log-likelihood follows straightforwardly:

`i (β; yi1, . . . , yiJ |xi1, . . . ,xiJ) =J∑

j=1

yij · ln Pr(yij = 1 |xi1, . . . ,xiJ). (4.36)

4.6 Independence assumptionsThe Multinomial Logit and Conditional Logit are very tractable models. As we have discussed,they provide an analytical expression for the log-likelihood; this function can therefore be evalu-ated and maximised easily. But this analytical tractability comes at a cost: the Multinomial Logitand Conditional Logit both require that the unobservable terms, εij , have a Type I Extreme Valuedistribution, and that these terms are distributed independently of each other. This has seriousimplications for a structure of individual choice. Consider, for example, the Multinomial Logit.Using equation 4.25, we can write the ratio of the conditional probability that yi = A and thatyi = B:

Pr (yi = A |xi)Pr (yi = B |xi)

=exp

[α(A)0 + α

(A)1 xi

]exp

[α(B)0 + α

(B)1 xi

] (4.37)

= exp[α(A)0 − α(B)

0 +(α(A)1 − α(B)

1

)· xi]. (4.38)

That is, the ratio of probabilities for any two alternatives cannot depend upon how much the re-spondents like any of the other alternatives on offer. Similarly, consider the Conditional Logit.Using equation 4.35, the ratio of conditional probabilities for two choices is:

Pr(yiA = 1 |xi1, . . . ,xiJ)

Pr(yiB = 1 |xi1, . . . ,xiJ)= exp [β · (xiA − xiB)] . (4.39)

Thus, in the Conditional Logit model, the ratio of probabilities for two alternaties cannot dependupon the characteristics of any other alternative (or, as noted, on any characteristics that do notvary across j).

Cameron and Trivedi (page 503, emphasis in original) describe why these results are so concerning:

A limitation of the [Conditional Logit] and [Multinomial Logit] models is that dis-crimination among the [J] alternatives reduces to a series of pairwise comparisonsthat are unaffected by the characteristics of alternatives other than the pair under con-sideration. . .

As an extreme example, the conditional probability of commute by car given commuteby car or red bus is assumed in an MNL or CL model to be independent of whethercommuting by blue bus is an option. However, in practice we would expect introduc-tion of a blue bus, which is the same as a red bus in every aspect except colour, to have



little impact on car use and to halve use of the red bus, leading to an increase in theconditional probability of car use given commute by car or red bus.

This weakness of MNL is known in the literature as the red bus – blue bus problem,or more formally as the assumption of independence of irrelevant alternatives.

This is clearly a serious limitation of the conditional logit and multinomial logit. Indeed, in hisNobel Prize Lecture in 2000, Daniel McFadden even went so far as to say (page 339):

The MNL model has proven to have wide empirical applicability, but as a theoreticalmodel of choice behaviour its IIA property is unsatisfactorily restrictive.

A variety of alternative models have been developed in order to relax these independence assump-tions while still maintaining a clear basis in individual utility maximisation. For example, theAlternative-Specific Multinomial Probit assumes that the values of εij are drawn from a multivari-ate normal distribution with a flexible covariance matrix; this would allow, for example, that theunobservable utility from taking a blue bus is very close to the unobservable utility from takinga red bus. However, this model — like most other alternative models — does not admit a closedform expression for the log-likelihood. The log-likelihood is therefore evaluated using simulation-based methods (e.g. ‘Simulated Maximum Likelihood’). These models are beyond the scope ofour lectures — though they build directly upon the principles that we have discussed.


First, clear the memory and load the data, as in previous lectures. We can then tabulate the variableWorkCat:

tab work_cat

We can estimate a multinomial logit — with a base category of 1 (agricultural employment) — by:

mlogit work_cat yborn, baseoutcome(1)

Mean marginal effects for wage employment and self-employment — categories 2 and 3 — can becalculated by:

margins, dydx(yborn) predict(outcome(2))margins, dydx(yborn) predict(outcome(3))


5 Lecture 5: Censored and trucated outcomesRequired reading:

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 529 – 544 (i.e. sections 16.1 to 16.3.7, inclusive)or



5.1 The problem of incompletely observed dataData is often observed incompletely. For example, we may often need to worry about problemsof measurement error, missing observations, non-random attrition, and so on. In this lecture, wewill discuss the specific problem of censoring on an outcome variable (and the related problem oftruncation). There are two reasons that this problem is worth discussing. First, censoring oftenarises in real data, and can cause serious problems for empirical analysis if it is not addressed.Second, censoring provides a useful introduction to the issues and methods involved in correctingfor problems of sample selection (the topic of Lecture 6). Cameron and Trivedi define truncatedand censored data as follows (page 529, emphasis in original):

Leading cases of incompletely observed data are truncation and censoring. For trun-cated data some observations on both the dependent variable and the regressors arelost. For example, income may be the dependent variable and only low-income peopleare included in the sample. For censored data information on the dependent variableis lost, but not data on the regressors. For example, people of all income levels maybe included in the sample, but for confidentiality reasons the income of high-incomepeople may be top-coded and reported only as exceeding, say, $100,000 per year.Truncation entails greater information loss than does censoring.

In this lecture, we will focus upon the problem of censored data.19 Once again, we will use theTanzanian data for illustrative purposes. However, we will use an artificially modified versionof the data. Specifically, we will imagine that, for purposes of survey anonymity, the TanzanianNational Bureau of Statistics censored the measure of income at a gross weekly income of 250,000Tanzanian shillings (i.e. almost exactly £100). Figure 5.1 shows the actual recorded Tanzaianincome distribution (in the top panel), and then the imaginary income distribution with artificialtop-coding (in the bottom panel).

19 Truncated data requires very similar methods, and a similar log-likelihood function.


5.2 The Tobit model

Figure 5.1: Income in Tanzania with artificial top-coding

The question for this lecture is simple: how can we estimate the effect of education on earningswhen earnings are censored? For simplicity, we will consider four educational achievements: (i)no education (i.e. not having completed primary school), (ii) primary education, (iii) secondary ed-ucation and (iv) tertiary education. Figure 5.2 shows the distribution of (true) gross weekly incomeacross those four categories, with the top-coding point marked. The figure illustrates immediatelywhy censorship produces empirical problems: censorship has different effects across the differenteducational categories. Thus, for example, only about 12% of income-earning Tanzanians withno education would be affected by our artificial top-coding; however, the top-coding would affectalmost half of income-Tanzanians with a tertiary education. Figure 5.3 shows the effect on condi-tional means: top-coding reduces mean measured income by more for those education categorieswith higher earnings; if we do not correct for top-coding, we will therefore underestimate the effectof education.

5.2 The Tobit modelLet’s consider the problem more formally. We will define Dp

i = 1 if the ith individual has com-pleted primary education, and define Ds

i and Dti respectively for secondary and tertiary. We can

denote log earnings by the variable y∗, and therefore specify:

y∗i = α + βpDpi + βsD

si + βtD

ti + ε∗i . (5.1)


5.2 The Tobit model

Figure 5.2: Income distributions in Tanzania by education category

In the actual Tanzanian data, y∗ is observed, so we can run an OLS regression on equation 5.1 toestimate the average earnings of individuals with different educational achievements. (Note thatwe are not saying anything here about causation: because we are saying nothing about ‘abilitybias’, we are simply seeking here to describe conditional average earnings.)

In our artificial data, however, things are not so simple; recall that, in our artificial data, we areimagining that earnings is top-coded at a weekly income of 250,000 Tanzanian shillings. In ourartificial data, we therefore observe not y∗ but y, where we have:

yi =

{y∗i if y∗i < zz if y∗i ≥ z,

(5.2)

where z denotes the censorship point (in this case, defined such that z = ln(250000)). Alterna-tively, equation 5.2 can be rewritten as:

yi = min (y∗i , z) . (5.3)

As in earlier lectures, we need a distributional assumption on the error term.

Assumption 5.1 (DISTRIBUTION OF ε∗i ) ε∗i is i.i.d. with a normal distribution with mean 0 andvariance σ2, independent of Dp

i , Dsi and Dt

i:

ε∗i |Dpi , D

si , D

ti ∼ N

(0, σ2

). (5.4)


5.2 The Tobit model

Figure 5.3: Mean log income by education category, with and without top-coding

For simplicity, we can define µi such that µi |Dpi , D

si , D

ti ∼ N (0, 1), and write:

y∗i = α + βpDpi + βsD

si + βtD

ti + σ · µi. (5.5)

From this equation, we can derive the log-likelihood. The idea is simple: we can recover consistentestimates of our parameters of interest if we can write a log-likelihood that jointly models (i) theincome-generating process, and (ii) the process of top-coding. First, we can write the conditionalprobability of censorship:20

Pr(yi = z |Dp

i , Dsi , D

ti

)= Pr

(y∗i ≥ z |Dp

i , Dsi , D

ti

)(5.6)

= Pr

(µi ≥

z − (α + βpDpi + βsD

si + βtD

ti)

σ

∣∣∣∣ Dpi , D

si , D

ti

)(5.7)

= 1− Φ

(z − (α + βpD

pi + βsD

si + βtD

ti)

σ

)(5.8)

= Φ

(α + βpD

pi + βsD

si + βtD

ti − z

σ

). (5.9)

For individuals whose income is not censored, the likelihood is the probability that income is notcensored, multiplied by the conditional density of income (conditional also upon income not being20 Note that, because µi has a standard normal distribution, the conditional probability of censorship follows a probit

model.


5.3 Back to Tanzania. . .

censored):

Pr(yi < z |Dp

i , Dsi , D

ti

)· f(yi | yi < z,Dp

i , Dsi , D

ti

)(5.10)

= Pr(yi < z |Dp

i , Dsi , D

ti

)· f (yi |Dp

i , Dsi , D

ti)

Pr (yi < z |Dpi , D

si , D

ti)

(5.11)

= f(yi |Dp

i , Dsi , D

ti

)(5.12)

= σ−1 · φ(yi − (α + βpD

pi + βsD

si + βtD

ti)

σ

). (5.13)

(Note that, for any random variable X distributed such that X ∼ N (µ, σ2), the probability densityis f(X;µ, σ2) = σ−1 · φ [(X − µ)/σ].)

We can therefore write the log-likelihood for the ith individual as:

`i(α, βp, βs, βt, σ; yi |Dp

i , Dsi , D

ti

)=

ln Φ

(α + βpD

pi + βsD

si + βtD

ti − z

σ

)if yi = z;

ln

[σ−1 · φ

(yi − (α + βpD

pi + βsD

si + βtD

ti)

σ

)]if yi < z.

(5.14)

We can write the log-likelihood for the sample as:

`(·) =N∑i=1

{1 (yi = z) · ln Φ

(α + βpD

pi + βsD

si + βtD

ti − z

σ

)+1 (yi < z) · ln

[σ−1 · φ

(yi − (α + βpD

pi + βsD

si + βtD

ti)

σ

)]}. (5.15)

5.3 Back to Tanzania. . .Table 5.1 shows estimates using the earnings data. Column (1) shows the OLS estimates usingthe actual earnings data; i.e. the column shows consistent estimates of the relationship betweeneducation and earnings. Columns (2) and (3) use the top-coded data: column (2) estimates withthe top-coded observations included, while column (3) drops those observations. Neither approachworks well in recovering the column (1) estimates. Column (4) shows the Tobit estimates: theestimator performs very well in recovering the column (1) estimates.21 Figure 5.4 illustrates.

21 Admittedly, the Tobit estimator overestimates the average earnings for the tertiary-educated; however, this differenceis not significant.


5.3 Back to Tanzania. . .

Table 5.1: Tobit estimates for the artificial Tanzanian top-coding

OLS Tobit︷︸︸︷︷︸︸︷Real data Censored data Truncated data Censored data

(1) (2) (3) (4)Const. 10.786 10.705 10.470 10.776

(0.041)∗∗∗ (0.035)∗∗∗ (0.034)∗∗∗ (0.041)∗∗∗

Primary 0.556 0.501 0.445 0.564(0.049)∗∗∗ (0.042)∗∗∗ (0.042)∗∗∗ (0.05)∗∗∗

Secondary 1.076 0.981 0.982 1.111(0.071)∗∗∗ (0.06)∗∗∗ (0.062)∗∗∗ (0.072)∗∗∗

Tertiary 1.401 1.185 0.936 1.577(0.21)∗∗∗ (0.177)∗∗∗ (0.224)∗∗∗ (0.227)∗∗∗

Obs. 3660 3660 2990 3660Log-likelihood -5613.710

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.‘Real data’ refers to the true Tanzanian earnings data. ‘Censored data’ refers to the data including top-coded obser-vations (at 250,000 Tanzanian shillings). ‘Truncated data’ refers to the data omitting top-coded observations.

Figure 5.4: Tobit estimates for Tanzania


5.4 The Inverse Mills Ratio

5.4 The Inverse Mills RatioHopefully, it is clear from the earlier discussion that censorship (in the form of top-coding) hascreated serious problems for the OLS estimator: censorship caused OLS to underestimate sub-stantially the effect of education on conditional earnings. So what’s going on? We can considerthe problem by writing the expectation of yi, conditional on covariates and upon earnings lyingbelow the top-coding point. We noted earlier that even dropping the censored observations — thatis, transforming the problem to one of truncation, rather than of censorship — did not produceconsistent estimates. In this section, we will use the truncated sample in order to introduce theconcept of the ‘Inverse Mills Ratio’.

Figure 5.5: Mean log income by education category, dropping top-coded observations

Figure 5.5 shows mean log income across education category, comparing the mean of the true datato the mean of truncated data. (That is, Figure 5.3 compared the true data to censored data; Figure5.5 shows the equivalent figure for truncated data. In doing so, the figure summarises the coeffi-cients in column (3) of Table 5.1.)

Our objective is to derive an expression for the expected heights of the light gray bars in Figure5.5. To do this, we need to consider a very nice feature of the normal distribution. As you know,the probability density function of the standard normal is:

φ(x) =1√2π· exp

(−x

2

2

). (5.16)



It follows that the slope of this pdf is:

dφ(x)

dx= −x× 1√

2π· exp

(−x

2

2

)(5.17)

= −x · φ(x). (5.18)

This is very useful; we can use this result to derive the expectation of a truncated normal distribu-tion. Suppose that X ∼ N (0, 1). Then we can write:

E(X |X < z) =

∫ z

−∞x · φ(x) dx∫ z

−∞φ(x) dx

(5.19)

=

−∫ z

−∞

dφ(x)

dxdx

Φ(z)(5.20)

= −[φ(x)]z−∞

Φ(z)(5.21)

= −φ(z)

Φ(z). (5.22)

Symmetrically, we have:

E(X |X > z) =φ(z)

1− Φ(z). (5.23)

Therefore, we can write the expectation of yi, conditional on covariates and upon ‘no truncation’as:22

E(yi | yi < z,Dpi , D

si , D

ti) = E(y∗i | y∗i < z,Dp, Ds, Dt) (5.24)

= α + βpDpi + βsD

si + βtD

ti

+ σ · E(µi

∣∣∣∣µi <z − (α + βpD

pi + βsD

si + βtD

ti)

σ

)(5.25)

= α + βpDpi + βsD

si + βtD

ti

− σ · φ ([z − (α + βpDpi + βsD

si + βtD

ti)] · σ−1)

Φ ([z − (α + βpDpi + βsDs

i + βtDti)] · σ−1)

, (5.26)

where the final equality follows from our earlier results about the truncated standard normal.

22 Note that equation 5.25 follows from equation 5.24 because we are conditioning on Dpi , Ds

i and Dti ; that is, those

variables are known, and enter linearly, so we can take them outside of the expectation operator.



The final term — the Inverse Mills Ratio term — is clearly central to understanding why top-coding causes us to underestimate the conditional means of education. To see this clearly, we canwrite the conditional mean for each level of education separately:

E(yi | yi < z,Dp, Ds, Dt

)=

α − σ · φ ([z − α] · σ−1)Φ ([z − α] · σ−1)

if no education;

α + βp − σ ·φ ([z − α− βp] · σ−1)Φ ([z − α− βp] · σ−1)

if primary education;

α + βs − σ ·φ ([z − α− βs] · σ−1)Φ ([z − α− βs] · σ−1)

if secondary education;

α + βt − σ ·φ ([z − α− βt] · σ−1)Φ ([z − α− βt] · σ−1)

if tertiary education.

(5.27)

Figure 5.6 shows the Inverse Mills Ratio function: the function is strictly decreasing, strictly con-vex, limx→∞ φ(x)/Φ(x) = 0 and limx→−∞ φ(x)/ (x · Φ(x)) = −1. Thus, if the effect of educationis larger (e.g. βp is higher), then the argument to the Invese Mills Ratio will be smaller, the Ratiowill be larger, and the difference between true mean earnings and mean censored earnings will begreater. This, of course, is precisely what we observed in Figure 5.5.

Figure 5.6: The Inverse Mills Ratio

..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................................

...............................

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

............................

...........................

x

y

0 4

4

−4

y = −x

y =φ(x)

Φ(x)

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

.............

The Inverse Mills Ratio is therefore useful for understanding the consequences of truncation. (And,indeed, we could also write a slightly more complicated expression for the conditional expectationsunder censorship.) But the Ratio is more important than that; we will see in our next lecture that itis also fundamental for correcting problems of sample selection.


5.5 A warning

5.5 A warning: Censoring is a data problem, not an economic outcomeThis has been a lecture about modelling incompletely observed data: we have considered an ex-ample in which an observable economic variable (income) has been censored, through top-coding.This is quite different to a context in which an outcome variable appears to be censored or trun-cated because of agents’ economic decisions. For example, consider the top panel of Figure 5.7;it shows the number of hours worked in the previous week, for a sample from the Tanzanian ILFS(for respondents’ main activity). In some respects, this looks a bit like the outcome that we mightexpect from a Tobit model with censorship from below at zero hours worked.

Figure 5.7: Using a Tobit structure for hours worked

The Tobit model is certainly one structure that is sometimes placed upon data like this. Researcherssometimes argue, for example, that respondents each have a ‘preferred number of hours worked’— if that preferred number is greater than zero, respondents work that number of hours; otherwise,they work zero hours. However, this interpretation is problematic, for at least two reasons. First, asa matter of principle, we may not find it reasonable to model individuals’ labour supply decisionin this way. For example, we may think that individuals require some effort to find a job, or sometransportation cost of travelling to work — so it may not make sense, say, to model anybody asworking for 10 minutes a week. Second, as matter of empirical observation, many censored ortruncated variables actually have a distribution quite different to that implied by a Tobit structure.For example, consider the bottom panel of Figure 5.7: this shows the distribution of a simulated



‘hours of work’ variable, where the parameters used are obtained by a Tobit estimation on the truedata (in the top panel). Loosely, we can interpret the bottom panel as saying, “This is what thedistribution of hours should look like, if the underlying process truly is a Tobit model”. It is clearthat the distributions look quite different: for example, the proportion of respondents who do notwork is substantially higher in the real data than in the simulated data. In short, the Tobit modeldoes not seem to describe this data well — either as a matter of empirical observation or in itsimplications about underlying behaviour.

Angrist and Pischke make a similar point (2009, page 102):

Do Tobit-type latent variables models ever make sense? Yes, if the data you are work-ing with are truly censored. True censoring means the latent variable has an empiricalcounterpart that is the outcome of primary interest. A leading example from labor eco-nomics is CPS earnings data, which topcodes (censors) very high values of earningsto protect respondent confidentiality.

Instead of using the Tobit model, it may be more reasonable to consider individuals as making twolabour supply decisions — i.e. (i) whether to work at all and, if so, (ii) how many hours to work.The Tobit model links those two decisions in a very rigid and very specific way — but it maybe preferable to use a model that treats those two outcomes separately. Such an approach mightfit the data more closely, while also allowing more flexibility in its implications about economicbehaviour. This is the kind of approach that we will consider in our final lecture.


gen logincome_cens = logincomereplace logincome_cens = log(250000) ///

if logincome >= log(250000) & logincome !=.

reg logincome primary secondary tertiary

reg logincome_cens primary secondary tertiary

tobit logincome_cens primary secondary tertiary, ul(12.429)

(Note, in interpreting this last line, that exp(12.429) ≈ 250000; that is, the last line is telling Statato use 12.429 as the upper limit for censorship of log wages.)


6 Lecture 6: SelectionRequired reading:

• ? CAMERON, A.C. AND TRIVEDI, P.K. (2005): Microeconometrics: Methods and Appli-cations. Cambridge University Press, pages 546 – 553 (i.e. section 16.5)or

• ? WOOLDRIDGE, J. (2002): Econometric Analysis of Cross Section and Panel Data. TheMIT Press, pages 551 – 566 (i.e. sections 17.1 to 17.4.1, inclusive)or

• ? WOOLDRIDGE, J. (2010): Econometric Analysis of Cross Section and Panel Data (2nded.). The MIT Press, pages 790 – 808 (i.e. sections 19.3 to 19.6.1, inclusive).

Other references:

• DEATON, A. (1997): The Analysis of Household Surveys: A Microeconometric Approach toDevelopment Policy. The World Bank.

• NEWEY, W., J.POWELL AND J.WALKER (1990): “Semiparametric Estimation of SelectionModels: Some Empirical Results,” American Economic Review, 80(2), 324–328.

6.1 The problem of endogenous selectionWhat is the effect of education upon worker productivity in Tanzania? This is an important ques-tion, for several reasons: for example, it may be important in assessing the value of public invest-ment in education, or it may be useful for learning about how firms reward their employees. It isalso a canonical illustration for many issues in microeconometrics; in this lecture, we will considerthis question in order to think about the problem of endogenous selection.

For illustrative purposes, we will begin by considering a simulated dataset. In that simulated data,students choose years of education x ∈ {1, 20}, and the ith student then has a potential log incomey∗i , where y∗i is determined by a standard Mincer earnings equation:

y∗i = β0 + β1xi + ε∗i . (6.1)

That is, we can interpret y∗i as an income offer — the income that the ith individual would earn ifthat individual were to hold employment. Critically, we will assume that we cannot fully observey∗i : we will assume that the ith individual earns y∗i only if that individual finds a job. Otherwise,we will assume that the value for log income is missing (in the sense that log(0) is undefined). Asresearchers, we therefore observe not potential log income, y∗i , but actual log income, yi:

yi =

{y∗i if the ith individual has a job;· if the ith individual does not have a job. (6.2)


http://www.jstor.org/stable/10.2307/2006593

http://www.jstor.org/stable/10.2307/2006593

6.2 The Heckman selection model

We will assume that E(ε∗i |xi) = 0; that is, we will assume that, if we could observe y∗i , we couldestimate β0 and β1 consistently simply by using OLS. (In reality, of course, we may be concernedthat this assumption does not hold; we might be concerned both about endogenous sample seletionand about endogenous choice of education. But we will ignore this possibility in this lecture sothat we can focus on the problem of endogenous selection.)

Before we go any further, we must consider a simple question: why does it matter whether anindividual has a job or not? The simplest approach to estimating β0 and β1 would simply be toregress yi on xi for those individuals who have a job. This would be a valid approach — that is, itwould produce consistent estimates of β0 and β1 — if the individuals with a job were a representa-tive sample of all individuals in the population. For example, it would be a valid approach if eachindividual’s employment status were determined by flipping a coin.

The problem, of course, is that employment is not determined by flipping a coin — and thereare good reasons to believe that the mechanism that determines whether someone has a job is notindependent of the mechanism that determines how much that individual can earn. For example,it may be that individuals with a higher earning capacity invest more resources in job search;similarly, it may be that employers are more likely to offer a job to someone who is likely to bemore productive. This is what researchers sometimes term a ‘selection problem’: the individualsfor whom we have data (in this case, individuals with a job) may be an unrepresentative subsampleof the population, because the process by which the sample is chosen depends upon the mechanismbeing studied (in this case, the relationship between education and earnings). In this lecture, weconsider a method for recovering consistent estimates of β0 and β1 while allowing for this selectionproblem.

6.2 The Heckman selection model6.2.1 A triangular simultaneous equations model

The central idea behind a selection model in this context is simple: in order to estimate consistentlythe relationship between education and income (that is, to obtain consistent estimates of β0 and β1),we need to model both the relationship between education and earnings and the mechanism bywhich individuals find employment. Equation 6.1 includes a single error term, ε∗i ; for simplicity,we will term this ‘work ability’. But we will now introduce a second error term, denoted µi, whichwe shall denote as ‘work desire’. We will also introduce a new variable, zi ∈ {0, . . . , 4}, denotingthe number of children in the ith individual’s household. This variable requires an importantassumption.

Assumption 6.1 The number of children in an individual’s household (zi) is independent of thatindividual’s ‘work ability’ (ε∗i ) and ‘work desire’ (µi).

Like all assumptions, this is debateable. For example, the assumption rules out the possibility thatindividuals with higher ‘work ability’ choose to have fewer children in order to pursue their ca-reers. I leave it to you to consider how reasonable you find this assumption; it is an assumption



that we shall maintain throughout this lecture.

We will assume that the ith individual’s employment status depends upon (i) his or her education,xi, (ii) the number of children in his or her household, zi, and (iii) his or her ‘work desire’, µi. Wewill denote ji as a dummy variable for whether the ith individual has a job, and will assume:

ji =

{0 if γ0 + γ1xi + γ2zi + µi < 0;1 if γ0 + γ1xi + γ2zi + µi ≥ 0.

(6.3)

(Note, in passing, the similarity between equation 6.3 and our earlier equations 1.6 and 1.7.)

Together, equations 6.1, 6.2 and 6.3 describe what can be termed a ‘triangular simultaneous equa-tions system’; more generally, we can write the system as:

yi = y(xi, ji, ε∗i ) (6.4)

ji = j(xi, zi, µi). (6.5)

Critically, note that — because of Assumption 6.1 — zi directly affects ji but does not directly af-fect yi. That is, zi is excluded from the function y; this assumption is therefore sometimes termedthe ‘exclusion restriction’. In this context, the validity of the exclusion restriction depends criti-cally upon our assumption that zi and ε∗i are independent; if, for example, individuals with higher‘work ability’ choose to have fewer children, the exclusion restriction will be violated.

Together, equations 6.4 and 6.5 emphasise the importance of the relationship between ε∗i and µi:as we noted earlier, it is this relationship that creates the selection problem that our estimator willneed to address. In the Heckman selection model, we proceed by assuming that ε∗i and µi aredrawn from the following bivariate normal distribution, independent of xi and zi:(

ε∗

µ

)∣∣∣∣xi, zi ∼ N (( 00

),

(σ2 ρ · σρ · σ 1

)). (6.6)

In our simulated data, we will set ρ = 0.75 and σ = 1. Figure 6.1 shows the simulated distributionof (µi, ε

∗i ) for a sample of size N = 10000. Additionally, we have zi ∈ {0, . . . , 4}, simulated with

equal probability on each point of support (i.e. a probability of 0.2).

We will derive formal results shortly. But, even before we do this, Figure 6.1 provides an intuitiveillustration of the selection problem. First, note that individuals with a higher value of µi aremore likely to have a job — and that those individuals are also likely to have a higher value ofε∗i . Moreover, we can see intuitively why the value of ε∗i among individuals with a job is likelyto correlate with individuals’ education. If γ1 > 0, more educated individuals are more likely tohave a job. Therefore, those less educated individuals who have a job will, on average, have ahigher work desire than their more educated workmates; in turn, this implies that they will alsohave a higher average work ability. We should therefore expect an OLS regression of yi on xi tounderestimate the effects of schooling. To formalise this, we need to revise some characteristics ofthe bivariate normal.



Figure 6.1: Work desire and work ability (simulated data)

6.2.2 Marginal and conditional distributions under the bivariate normal

Suppose that two random variables, X and Y , have a bivariate normal distribution:(YX

)∼ N

((µY

µX

),

(σ2Y ρ · σXσY

ρ · σXσY σ2X

)). (6.7)

It follows that the marginal distribution of X is:

X ∼ N(µX , σ

2X

). (6.8)

The distribution of Y , conditional on X = x, is:

Y | (X = x) ∼ N(µY + ρ

σYσX· (x− µX), (1− ρ2) · σ2

Y

). (6.9)

(If you are interested in an intuitive illustration of the bivariate normal distribution, I recommendhttp://demonstrations.wolfram.com/TheBivariateNormalDistribution/.)

Equation 6.6 is clearly a special case of 6.7, where the distributional assumption on (µi, εi) nor-malises the means of each variable to zero and the variance of µi to 1. Two important resultsfollow. First, consider the marginal distribution of µi:

µi ∼ N (0, 1). (6.10)


http://demonstrations.wolfram.com/TheBivariateNormalDistribution/


The ith individual’s employment status is therefore determined by a linear index of xi and zi (seeequation 6.3), plus an error term that has a standard normal distribution: that is, employment statusis determined by a probit model. This implies that we can estimate γ0, γ1 and γ2 consistently byrunning a probit estimation of ji on xi and zi.

Second, consider the conditional distribution of ε∗i given µi:

ε∗i |µi ∼ N(ρσ · µi, (1− ρ2) · σ2

). (6.11)

Critically, this implies that the conditional expectation of ε∗i is linear in µi (as Figure 6.1 illustrated):

E(ε∗i |µi) = ρσ · µi. (6.12)

Consider, then, the conditional expectation of ε∗i for someone who has a job:

E(ε∗i |xi, zi, ji = 1; γ0, γ1, γ2) = ρσ · E(µ∗i |xi, zi, ji = 1) (6.13)= ρσ · E(µi | γ0 + γ1xi + γzi + µi ≥ 0) (6.14)= −ρσ · E(−µi | − µi ≤ γ0 + γ1xi + γ2zi) (6.15)

= ρσ · φ(γ0 + γ1xi + γ2zi)

Φ(γ0 + γ1xi + γ2zi). (6.16)

This derivation follows our earlier results on the Inverse Mills Ratio (see, for example, equation5.22). From this result, we can write the conditional expectation of log income, given that anindividual has a job:

E(yi |xi, zi, ji = 1; β0, β1, γ0, γ1, γ2) = β0 + β1xi + ρσ · φ(γ0 + γ1xi + γ2zi)

Φ(γ0 + γ1xi + γ2zi). (6.17)

This implies that, for the subsample having a job, we can write:

yi = β0 + β1xi + ρσ · φ(γ0 + γ1xi + γ2zi)

Φ(γ0 + γ1xi + γ2zi)+ ηi, (6.18)

where E(ηi |xi, zi) = 0.

This should remind you of our earlier derivation of the Tobit estimator; for this reason, the Heck-man selection correction model is sometimes known as a ‘Generalised Tobit’ estimator (or a ‘Type2 Tobit’).

6.2.3 Estimation using a two-step method

The simplest way of thinking about the Heckman model is through a ‘two-step’ estimation method,implied by equation 6.18:



(i) Run a probit of ji on xi and zi (i.e. for the entire sample). Use the estimates γ0, γ1 and γ2 toconstruct an estimated Inverse Mills Ratio,

φ (γ0 + γ1xi + γ2zi)

Φ (γ0 + γ1xi + γ2zi).

(ii) Run an OLS regression of yi on xi and the estimated Inverse Mills Ratio, for all individualswho have a job.

As Wooldridge (2002, page 564) explains, “The procedure is sometimes called Heckit after Heck-man (1976) and the tradition of putting “it” on the end of procedures related to probit (such asTobit).” Assuming that all of the model assumptions are correct, this method will yield consistentestimates of β0 and β1. The Inverse Mills Ratio term has acted in the OLS regression as a controlfor the differences in work ability created by the selection process: for this reason, we can term theInverse Mills Ratio function in this context as a ‘control function’. Control function methods areuseful for many different types of applied problems.

The Heckit procedure produces consistent estimates of β0 and β1, but does not generally produceconsistent estimates of the standard errors, if ρ 6= 0. There are two reasons for this: (i) ρ 6= 0implies heteroskedasticty, and (ii) γ0, γ1 and γ2 are estimates of the true γ0, γ1 and γ2, so theestimated Inverse Mills Ratio is a ‘generated regressor’. Fortunately, Stata adjusts the standarderrors to correct these problem (using the ‘heckman ..., twostep’ command). Note that ifwe are only interested in testing H0 : ρ = 0, these problems do not arise (i.e. we can trust the thestandard errors from the OLS regression).

6.2.4 Estimation using maximum likelihood

We can estimate the Heckman selection model efficiently using maximum likelihood. Like thetwo-step approach, the maximum likelihood estimator must explain two features of the data:

(i) Whether the ith individual has a job, and

(ii) If the individual has a job, how much (s)he earns.

To write the log-likelihood, let’s start by turning our earlier logic around. For a given value of ε∗i ,the marginal distribution of ε∗i is N (0, σ2); the conditional distribution of µi is:

µi | ε∗i ∼ N(ρσ−1 · ε∗i , (1− ρ2)

). (6.19)

We can use this expression to write the conditional probability that µi ≥ γ0 + γ1xi + γ2zi (i.e. thatthe ith individual has a job), for a given value of ε∗i . First, note that equation 6.19 implies:

−µi + ρσ−1 · ε∗i√1− ρ2

∣∣∣∣∣ ε∗i ∼ N (0, 1) . (6.20)



The probability of a respondent having a job, conditional upon having a ‘work ability’ ε∗i , is there-fore:

Pr (γ0 + γ1xi + γ2zi + µi ≥ 0 | ε∗i )= Pr (−µi ≤ γ0 + γ1xi + γ2zi | ε∗i ) (6.21)

= Pr

(−µi + ρσ−1 · ε∗i√

1− ρ2≤ γ0 + γ1xi + γ2zi + ρσ−1 · ε∗i√

1− ρ2

∣∣∣∣∣ ε∗i)

(6.22)

= Φ

(γ0 + γ1xi + γ2zi + ρσ−1 · ε∗i√

1− ρ2

). (6.23)

This provides the probability mass of having a job, conditional on an observed value of ε∗i (whichwe are terming εi). Given that the marginal distribution of ε∗i is normal with variance σ2, themarginal probability density of ε∗i is:

σ−1 · φ(yi − β0 − β1xi

σ

). (6.24)

Therefore, the joint likelihood of observing the ith respondent in a job and of observing that respon-dent’s particular wage is simply the marginal probability density times the conditional probabilitymass:

Φ

(γ0 + γ1xi + γ2zi + ρσ−1 · ε∗i√

1− ρ2

)× σ−1 · φ

(yi − β0 − β1xi

σ

). (6.25)

Thus, for the ith individual, we can write the log-likelihood as:

`i(·) =

ln [1− Φ (γ0 + γ1xi + γ2zi)]if the ith individual has no job, and

ln Φ

(γ0 + γ1xi + γ2zi + ρσ−1 · (yi − β0 − β1xi)√

1− ρ2

)+ ln

[σ−1 · φ

(yi − β0 − β1xi

σ

)]if the ith individual has a job at wage yi.

(6.26)

This, it must be said, is quite an intimidating expression — and you will have the opportunityto revise its derivation as part of the second problem set. The underlying idea, however, shouldbe clear: we can write the log-likelihood for a selection model by thinking about the marginaldistribution of the observed wage and the probability of employment conditional upon that wagehaving been offered.



6.2.5 Maximum Likelihood or Heckman’s Two-Step?

Both the Heckit estimator and the maximum likelihood estimator produce consistent estimates un-der the bivariate normality assumption of equation 6.6. If we are willing to accept that assumption,we should use the maximum likelihood estimator, because of its efficiency (see, for example, ourearlier discussion on page 10 of these notes).

However, note that the Heckit estimator did not require the full distributional assumption of equa-tion 6.7; instead, it required only (i) the standard normal marginal distribution assumption of equa-tion 6.10 and (ii) the linear conditional expectation assumption of equation 6.12. As in manyeconometric contexts, the choice between the Heckit estimator and the maximum likelihood esti-mator is therefore a choice between an estimator that is more efficient and one that is more robust.

Wooldridge (2002, page 566) summarises the trade-off as follows:23

[Maximum likelihood estimation] will be more efficient than the two-step procedureunder joint normality of [µi and εi], and it will produce standard errors and likelihoodratio statistics that can be used directly. . . The drawbacks are that it is less robust thanthe two-step procedure and that it is sometimes difficult to get the problem to converge.

Cameron and Trivedi make the same point: see pages 550–551.

6.2.6 Illustrating using simulated data

These principles can be illustrated using our simulated data. First, Figure 6.2 shows the predictedprobabilities of having a job from the first-stage probit model. I simulated the first stage usingγ0 = −0.5, γ1 = 0.1 and γ2 = −0.25; the five fitted lines therefore correspond to different valuesof zi, with the highest line being zi = 0 and the lowest being zi = 4. I simulated the second stageusing ρ = 0.75, σ = 1, β0 = 10 and β1 = 0.1.

Figure 6.3 shows the distribution of simulated ε∗i across different levels of education. The dottedline shows that E(ε∗i |xi) = 0 (that is, it confirms that OLS would produce consistent estimates ofβ0 and β1 if we were able to observe earnings for the entire sample).

Figure 6.4 shows the distribution of simulated εi, i.e. the distribution of ε∗i for individuals with ajob (for simplicity, I now show only individuals having no children in the household: zi = 0). Theconditional expectation is noticeably larger for individuals with less education; as discussed, theseindividuals have ‘overcome more’ in order to find a job, and therefore have a higher average workability. The smoothed line is given by ρσ times the Inverse Mills Ratio (i.e. as in equation 6.16).

23 Wooldridge actually prefers the term ‘partial’ maximum likelihood here, because the wage does not enter the likeli-hood unless an individual has a job.



Figure 6.2: Conditional probability of earning income (simulated data)

Figure 6.3: Conditional expectation of ε∗ (simulated data)



Figure 6.4: Conditional expectation of ε (simulated data, z = 0)

Table 6.1 shows three sets of estimates. Note that the primary object of interest is β1, the effect ofeducation upon log earnings. Column 3 reports the OLS estimate: β1 = 0.059. OLS substantiallyunderestimates the effect of education, for the reasons that we have discussed. Columns 1 and 2report the Heckman estimates; they respectively show the estimates from maximum likelihood andfrom the two-step method. The estimates are nearly identical, and both estimate β1 very close toits true value (β1 = 0.1).



Table 6.1: Heckman estimates for a simulated selection model

Heckman OLSMaximum Likelihood Two-step

(1) (2) (3)First stage:γ0 (true γ0 = −0.5) -.505 -.509

(0.031)∗∗∗ (0.032)∗∗∗

γ1 (true γ1 = 0.1) 0.1 0.1(0.002)∗∗∗ (0.002)∗∗∗

γ2 (true γ2 = −0.25) -.247 -.245(0.009)∗∗∗ (0.01)∗∗∗

Second stage:β0 (true β0 = 10.0) 10.041 9.965 10.987

(0.055)∗∗∗ (0.094)∗∗∗ (0.029)∗∗∗

β1 (true β1 = 0.1) 0.098 0.101 0.059(0.003)∗∗∗ (0.004)∗∗∗ (0.002)∗∗∗

Other parameters:ρ (true ρ = 0.75) 0.727 0. 768

(0.024)

σ (true σ = 1) 0.983 1.005(0.017)

ρσ (true ρσ = 0.75) 0.715 0.772(0.034) (0.067)

Obs. 10000 10000 10000Log-likelihood -11822.11

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.


6.3 An unconvincing alternative (or “How can I find log(0)?”)

6.3 An unconvincing alternative (or “How can I find log(0)?”)“I’ve got a bit of a problem with my thesis...”

“Go on...”

“Well, I have log earnings on the lefthand side of my estimation.”

“Okay...”

“And lots of people in my data don’t have any earnings.”

“Sure...what’s the problem?’

“Well...how do I find the log of zero?”

In one form or another, almost every applied researcher has had this conversation at some pointin his or her life — yet it is a conversation that is rarely addressed directly by econometrics text-books. Of course, the problem really has nothing to do with ‘finding the log of zero’; as we havediscussed in this lecture, the problem is that those individuals who do not work are unlikely to bedrawn randomly from our population of interest. This is why we needed to discuss ‘work ability’and ‘work desire’, and the consequent Heckman selection model.

But there is another approach that deserves a mention, if only because it is reasonably popular: justignore the selection problem and instead code the outcome variable as y = log(wage + 1). Thisfunction is almost identical to y = log(wage) for individuals who have a positive wage, but nowallows us to include y = 0 in our estimations for everyone without a job. For example, we couldthen estimate the effect of education using a regression of log(wagei + 1) = β0 + β1xi + ωi.

This is a superficially appealing approach, because we appear to have resolved the entire selectionproblem without needing any of the irritations of Heckman’s approach (and, indeed, without need-ing an exclusion restriction). There is just one problem: it doesn’t work. Figure 6.5 shows why,using the same simulated example that we have just been examining: because the log(wagei + 1)transformation combines individuals who have a job with those who do not, the regression es-timates (shown by the dotted line) conflate two separate effects: (i) the effect of education onwhether an individual has a job, and (ii) the effect of education on the individual’s potential earn-ings.24 Selection problems should be confronted directly, rather than sidestepped with arbitrarytransformations.25

24 The log(y + 1) transformation produces an estimate β1 = 0.430, whereas the true value used for simulation wasβ1 = 0.1.

25 The same could be said of several variations on the theme — including, for example, using y = log(wagei +κ) forsome arbitrarily chosen κ > 1, and the ‘inverse hyperbolic sine transformation’, y = log(wagei +

√wage2

i + 1).


6.4 Selection in the Tanzanian data

Figure 6.5: Scatter plot of log(wage + 1) against years of education (with jitter)

6.4 Selection in the Tanzanian dataTable 6.2 shows results from a Heckman two-step estimation on the Tanzanian data. For theexclued variable, zi, we use the number of babies in the household (where a ‘baby’ is definedas a child aged 0 or 1). The estimations are run separately for men (N = 4953) and women(N = 5047). Compared to the OLS estimations, the Heckman correction slightly reduces the es-timated return to education for men, and slightly increases the estimate for women. This suggeststhat there may be a positive correlation between ‘work desire’ and ‘work ability’ for women, but anegative correlation for men (note the sign of ρσ for the two Heckman estimations). However, notethat neither selection effect is significant; i.e. we cannot rule out, for either group, that selectioninto employment is ‘as if random’, so that the OLS estimates are consistent.26

Of course, these conclusions depend upon our model being correct — and this, of course, includesthe exclusion restriction. I leave you to consider how reasonable these assumptions are.

26 Notice, however, that the standard errors on the Heckman estimates are much larger than the standard errors onOLS.


6.5 Flexible extensions to the Heckman model

Table 6.2: Income and education in Tanzania: Heckman estimates

Men Women︷︸︸︷︷︸︸︷OLS Heckman OLS Heckman(1) (2) (3) (4)

First stage:Const. -.799 -.967

(0.04)∗∗∗ (0.034)∗∗∗

Years of education 0.127 0.097(0.006)∗∗∗ (0.005)∗∗∗

Babies in the household -.162 -.252(0.036)∗∗∗ (0.042)∗∗∗

Second stage:Const. 10.639 10.835 10.570 9.769

(0.064)∗∗∗ (0.692)∗∗∗ (0.063)∗∗∗ (0.649)∗∗∗

Years of education 0.112 0.098 0.089 0.126(0.009)∗∗ (0.043)∗∗ (0.009)∗∗ (0.031)∗∗∗

Other:ρσ -.178 0.524

(0.518) (0.423)

Obs. 2240 4953 1420 5047

Confidence: ***↔ 99%, **↔ 95%, *↔ 90%.

6.5 Flexible extensions to the Heckman modelWe have considered two versions of the Heckman selection model. One version — the maximumlikelihood approach — depends upon a strong assumption about the joint distribution of µi and ε∗i .The other version — the two-step approach — is less efficient, but relies only upon the assumptionthat the conditional expectation of ε∗i is linear in µi. But this assumption can be relaxed further.We can generalise equation 6.12 by writing that conditional expectation as a flexible function ofµi:

E (ε∗i |µi) = f(µi). (6.27)

This implies that, instead of equation 6.18, we can write our second stage as a flexible function ofthe Inverse Mills Ratio:

yi = β0 + β1xi + g

(φ(γ0 + γ1xi + γ2zi)

Φ(γ0 + γ1xi + γ2zi)

)+ ηi. (6.28)



Deaton (1997, page 105) notes that the function g can be estimated parametrically — for example,by using a polynomial specification, such as:

yi = β0 + β1xi + δ1 ·(φ(γ0 + γ1xi + γ2zi)


)+ δ2 ·

(φ(γ0 + γ1xi + γ2zi)


)2

+ δ3 ·(φ(γ0 + γ1xi + γ2zi)


)3

+ δ4 ·(φ(γ0 + γ1xi + γ2zi)


)4

+ ηi. (6.29)

This specification nests the two-step estimator of equation 6.18 for the special case that δ1 = ρσand δ2 = δ3 = δ4 = 0. Alternatively, we could estimate equation 6.28 semiparametrically, as a‘partial linear model’ — as in, for example, Newey, Powell and Walker (1990).


Having cleared the memory and loaded our data, we can implement the Heckman model using thecommand ‘heckman’; we choose the two-step estimator by adding the option ‘twostep’. Wecan therefore run the following:

reg logincome educ if sex == "male"

heckman logincome educ HHminors if sex == "male", ///twostep select(educ HHbabies)

reg logincome educ if sex == "female"

heckman logincome educ if sex == "female", ///twostep select(educ HHbabies)


Limited DEpendent Variable

Documents

Transcript of Limited DEpendent Variable