Sample Selection
Walter Sosa-Escudero
Econ 507. Econometric Analysis. Spring 2012
April 23, 2012
Walter Sosa-Escudero Sample Selection
Preliminaries 1) Truncated normal distribution
X f(x), X|X < a: X truncated in a. Then
f(x|X < a) = f(x)Pr(X < a)
If X N(, 2), and recalling that
Pr(X < a) = Pr(
1/(X ) < 1/(a ))
= Pr(z < ),
(a )/, z (x )/.
f(x|X < a) =1
12pi
exp[12(x
)2]()
=(1/)(z)
()
Walter Sosa-Escudero Sample Selection
Result (no proof): if X N(, 2), then:
E(X|X < a) = ()()
Truncated to the right: expected value moves to the left(general).
How much? Depends on and 2
(z) (z)/(z) is known as the inverse Mills ratio
Walter Sosa-Escudero Sample Selection
Preliminaries 2) Latent variables and probits
Recall the probit model:
Pr(y = 1|x) = (x)
can be consistently estimated by MLE based on a randomsample (yi, xi), i = 1, . . . , n.
Consider the regression model
y = x + u, u N(0, 2)y not directly observable (a latent variable) but, instead, weobserve y = 1[y > 0].
Which parameters of the regression models can be estimatedconsistently with this information?
Walter Sosa-Escudero Sample Selection
Note that:
P (y = 1|x) = P (y > 0|x)= P (u > x|x)= P (u < x|x)= P (u/ < x/ | x)= (x)
This is a probit model with /.
Based on (yi, xi), we can estimate consistenly by MLE, evenwhen we cannot estimate and 2 separately.
Walter Sosa-Escudero Sample Selection
2 and are not identified with the sample (yi, xi). Forexample = 10 and 2 = 2 are observationally equivalent tothe case = 5 y 2 = 1. / is identified.
Walter Sosa-Escudero Sample Selection
Sample selectivity
Consider the regression model
yi = xi + ui
si is a selectivity variable: si = 1 observed, 0 if not.
We can think that there is a super sample of size N ofyi , xi, si and that we observe the sub sample y
i , xi, only
when si = 1.
Example: female labor productivity
Walter Sosa-Escudero Sample Selection
OLS under selectivity
With a random sample (yi , xi), consistency relies on:
E(ui|xi) = 0
which implies E(yi|xi) = xi.
Now we do not have a random sample, but one conditioned onsi = 1. Taking conditional expectations:
E(yi|xi, si = 1) = xi + E(ui|xi, si = 1)OLS based on the selected sample will be inconsistent, unlessE(ui|xi, si = 1) = 0.
Walter Sosa-Escudero Sample Selection
Not every selectivity mechanism makes OLS inconsistent.
If u is independent of x, OLS still consistent (why?).
If s = g(x), OLS still consistent.
Examples: wages and education. Males with odd SSN. Maleswith formal education?
Walter Sosa-Escudero Sample Selection
An estimable model under selectivity
Consider the following equations:{y1i = x
1i 1 + u1i (regression)
y2i = x2i 2 + u2i (selectivity)
Let y2i = 1[y2i > 0].
Example: y1i = wage, regression equantion determines wages based onpersons characteristics (x1i). y2i = net utility of work, x2i are observeddeterminantes of utility.
Walter Sosa-Escudero Sample Selection
Assumptions:
1 (y2i, x2i) observed for everyone.
2 (y1i, x1i) observed only if y2i = 1 (selected sample).
3 (u1i, u2i) are independent of x2i and have zero mean.
4 u2i N(0, 22).5 E(u1i|u2i) = u2. No-observables can be related.
Walter Sosa-Escudero Sample Selection
Selectivity bias
Here si y2i.
E(y1i|x1i, y2i = 1) = x1i1 + E(u1i | x1i, y2i = 1)= x1i1 + E
[E(u1i|u2i) | x1i, y2i = 1
]= x1i1 + E
[u2i | x1i, y2i = 1
]= x1i1 + E
[u2i | x1i, y2i > 0
]= x1i1 + E
[u2i | x1i , u2i < x2i2
]= x1i1 2 (x2i2/2)= x1i1 2 zi 6= x1i1
with zi (x2i2/2). OLS with the selected sample isinconsistent.
Walter Sosa-Escudero Sample Selection
E(y1i|x1i, y2i = 1) = x1i1 2 zi 6= x1i1
Inconsistency: omission of zi. Heckman (1979): selectivitybias as misspecification.
Inconsistency due to the correlation between u1i y u2i, , thatis 6= 0.
Walter Sosa-Escudero Sample Selection
Heckmans two-step estimator
Define u1i y1i x1i1 zi, 2. Solving:
y1i = x1i +
z + u1i
where, by construction E(u1i|x1i, y2i = 1) = 0.x1i, zi observable when y2i = 1: OLS of y1i on x1i and ziusing the selected sample gives consistent esimates of 1 and.Problem: zi (x2i2/2) is NOT observable, it depends on2 and 2.
Walter Sosa-Escudero Sample Selection
Note that u2i N(0, 22), hence:
P (y2i = 1) = P (y2i > 0) = P (u2i/2 < x
2i2/2) = (x
2i)
P (y2i = 1) is a probit model with unknown coefficients .
x2i and y2i are observed for everybody: can be estimatedusing probit.
Important: we cannot identify 2i and 2 separately but = 2i/2i.
Walter Sosa-Escudero Sample Selection
This suggests the following two-stage method:
Stage 1: Obtain estimates based on the probit modelP (y2i = 1) = (x
2i) using the full sample. Predict zi using
zi = (x2i).
Stage 2: Regress y2i on x1i and zi using the selected sample.This produces consistent estimates of 1 and
.
Walter Sosa-Escudero Sample Selection
Practical remarks
The method is consistent and asymptotically normal (methodof moments estimator). Standard inference works fine.
Careful with asympototic variance. The second stage isheteroskedastic. Requires correction. See Greene (Ch. 20).
A test of H0 : = 0 may be used to check sample selectivity.Under H0, the regression model with the selected sample ishomoskedastic, test can be based in a model without takingcare of heteroskedasticity.
Classic issue: low power when x1 is very similar to x2.
MLE? Requires bivariate normality. Complicated likelihood,rarely used.
Walter Sosa-Escudero Sample Selection