Issues in Estimation Data Generating Process: What behavior and what sampling process generated data...

Issues in Estimation

Data Generating Process:

What behavior and what sampling process generated data that you have collected?

Estimation

• Are you gathering a random sample of all possible participants (e.g. telephone or mail survey of population)?

• Or, are you sampling on site?

1. Censored SamplesIf you sample a population of potential

participants, you will find that some took trips to the site of interest and some (many?) took no trips.

Plot of trip cost against number of trips for all observations in a hypothetical sample.

Number oftrips

Trip cost

x

x

xxx

xx

xx

x

xx

xx

x

x

x

x

x

x

x

0 trips

Non-participants

Here’s a hypothetical data set and actual least squares regression lines

0 5 1 0

trip s

1 0

2 0

3 0

4 0

c

o

s

t

Least Squares line including zeros

Least squares line excluding zeros

Which, if either, is right?

Answer: Neither

Censored Samples – empirical models to

analyze them:• Tobit model –

Assumes an underlying latent variable that could be negative

• Count models –Recognizes that trips are non-negative integers

• Sample selection models–Models the participation decision differently from the trips decision

Tobit Model

Underlying model

Latent variable:

iiiz i10 c*

But,

zi = zi*, if zi* > 0

zi = 0, if zi* 0

(To cut down on notation, 0i stands for the intercept and all other covariates that might be in the model, so it varies over individuals.)

Estimation by Maximum Likelihood

Every observation makes contribution to the likelihood function.

Contribution by non-trip takers:

Pr(zi* 0) =

where F is the cumulative distribution function

for ; the x’s are the explanatory variables in the model, including cost of access.

)(

]Pr[

]0Pr[

kkik

kkiki

ik

kik

xF

x

x

Contribution by trip takers:

]0Pr[]0Pr[

)(

iki

kk

ikik

k

kik

ki

xx

xzf

)0*Pr(]0*|)Pr[( iiikik

ki zzxz

)( kik

ki xzf

Note: this is the same expression as for ordinary least squares.

Tobit – maximize the following likelihood

function

Likelihood function equals:

k

ikkNi

kiTi k

ki xFxzf )()(

where T is the set of trip takers and N is the set of non-trip takers

For our simple example:Ordinary Least Squares estimates:

0 = 8.89

1 = -.282 = 2.4

Tobit estimates:

0 =13.81

1 = -.72

2 = 2.30 5 1 0

trip s

1 0

2 0

3 0

4 0

c

o

s

t

OLS

Tobit

How do we get welfare measures in the Tobit?

The Tobit is usually estimated in linear form.

The area behind a linear demand function is given by:

1

2~

10 2)(0

ii

c

c

ii

zdccCSi

i

But how do you evaluate this expression?

But what do you use for zi?

Do you use the individual’s actual number of trips?

Or do you use the predicted number of trips using the model?

1

.iz

ci

estimated function

i0

1/1 slope

zi

Use as estimate for 1;

If you want to use the predicted number of trips...

You must calculate the expected value of trips in the Tobit framework – which is a somewhat complicated expression.

Fortunately, LIMDEP* will do this for you in a simple command.

*LIMDEP is a software package by William Greene, Columbia University

You should know that expected trips will always be positive in the Tobit.

The answers can be quite different…

but the choice is not obvious.

In our simple example, the difference isn’t great.

Using Actual z Using Predicted z

Ave. trips 2.53 2.55

Ave. consumer

surplus $15.32 $13.93

Total CS for sample $459.60 $417.90

Difference in average consumer surplus is due to nonlinearity of consumer surplus in trips.

Reasons for using one rather than another…

Use the expected value of trips,

if you think the dominant source of “error” is from measurement.

Use the actual number of trips,

if you think the dominant source of “error” is from specification.

(Note: in the Tobit, the predicted number of trips is never zero.)

,ˆiz

,iz

Getting an estimate for the population

If your sample is a random sample of the population:

average CS * population

Count Models

The Tobit assumes an underlying latent variable that can take on negative values.

Count models explicitly account for the fact that the dependent variable, trips, can only be an integer and can only be non-negative.

Count Models..

…specify that the quantity demanded of trips is a non-negative random variable whose mean is a function of the exogenous regressors in the model.

The Poisson Distribution is a common choice

Poisson distribution:

Where the mean is i and it is usually modeled as:

!)Pr(

n

enz

ni

i

i

k

kiki x )exp(

Intuition?The Poisson model implies that

the number of trips a person decides to take is a random variable drawn from a distribution that only allows non-negative integers.

The distribution can be centered around different non-negative numbers, however, depending on the exogenous variables the individual faces.

E.g. A person with a relatively low access cost will face a distribution with a higher mean number of trips.

An individual’s contribution to the likelihood function in the Poisson is this very complicated looking expression:

(Note: 0! is defined mathematically as =1)

Fortunately, LIMDEP will estimate this for you without any hard work on your part.

!

))exp(())exp(exp(

i

kikik

kkik

z

zxx

Getting Welfare Measures in the Poisson

The expected number of trips for an individual is the mean of the Poisson distribution for that individual.

)exp()( 10 iiii czE

The mean is i in the above expression and is a usually specified as a semi-log function of the explanatory variables:

We saw earlier that…

the area under a semi-log demand function is given by:

Because CS is linear in trips for a semi-log function, it does not matter whether you use actual or expected trips. The answer is the same.

1

100

)exp(

i

c

iii

z

dccCSi

Welfare measures in our simple hypothetical case

Using Actual z Using Predicted z

Ave. trips 2.53 2.53

Ave. consumer

surplus $14.90 $14.90

Total CS

for sample $447.00 $447.00

The Poisson has the property that the mean of expected trips = mean of actual trips.

The formula for consumer surplus in a semi-log function is linear in trips.

THEREFORE, it does not matter in this model whether you use expected or actual trips.

Another Popular Count Model

The negative binomial distribution is also used often. It is a more general distribution than the Poisson, in that it does not constrain the mean and the variance to be equal.

See LIMDEP if you wish to estimate this model.

Participation vs Demand for Trips

In the above models, the same model affects how many trips a user takes and whether or not he is a user.

Suppose different factors affected– whether he used the site– how many times he used the site, if he did use the site

Two types of models (see LIMDEP): Combination of probit and truncated models (E.g. Cragg) Selection models (e.g. Heckman)

2. Truncated Samples

Now suppose you have only collected data from people who actually visit the site.

There will be no zeros in this dataset.

Do you still need to make econometric adjustments?

The answer is “YES”

Ordinary least squares assumes that every observation is drawn from a normal distribution with a given variance.

Let’s look at data again…

Remember the model is:

OLS assumes that

Number oftrips

Trip cost

xxx

x

xx

xx

x

x

x

x

x

0 trips

iiii cz 10

),0(~ 2 Ni

x

xDistribution istruncated for obsnear access

Relationship you want

Result of running OLS regression

OLS applied to truncated data

produces biased slope estimates if truncation is “relevant”.

The bias will generate a larger negative estimate for the slope of the line in the graph, which is really a smaller negative estimate for 1.

Since -1 is in the denominator of the consumer surplus formula, the result will be an over-estimate of consumer surplus.

Contribution to the Likelihood Function in the

Truncated Model

]0Pr[

)(

ikik

k

kik

ki

x

xzf

Pr (trips=zi|trips>0) =

2 4 6 8 1 0 1 2

trip s

5

1 0

1 5

2 0

cost

OLS Regression line

Truncated regression

The difference between the OLS and Truncated estimated relationship for our simple hypothetical data

Oh no, another problem!

The reason you have only non-zero observations for trips is probably because you sampled on site.

On-site sampling is often the only practical way to get enough information on users of a site.

But this, too, causes problems!

If you randomly sample on-site, you are actually randomly sampling trips instead of

trip-takers.

This is not a random sample of users of the site.

The problem is called “endogenous stratification”.

A simple example..Suppose there are only two types of

users:

25 users take 1 trip to site

75 users take 2 trips to site

Total number of trips taken = 175.

Average number of trips taken = 1.75.

Now, suppose you randomly sample trips (not users).

Prob. of encountering a 1-trip user = 25/175 = .14 (rather than .25)

Prob. of encountering a 2-trip user = 75/175 = .86 (rather than .75)

A solution to endogenous stratification is to weight each observation by 1/trips.

Parameter estimates for our little sample:

OLS 12.85 -0.59OLS weighted 9.99 -0.44Truncated 14.05 -0.74Trun weighted 13.76 -0.84

0 1

*Note: for many problems the truncated model does not converge in estimation.

**

A Better and Easier Alternative

Poisson Count Model:– Easy to estimate with truncation.– Easy to estimate with truncation

and endogenous stratification

“It turns out that”…..

You can solve both the truncation and the endogenous stratification problem by:

estimating the regular Poisson with the value zi –1 substituted for zi in estimation

Poisson Endogenous Stratification Results

and Welfare Estimates

Coeff. Std.Err. t-ratio P-valueConstant 2.882 0.244 11.8 2.89E-15COST -0.131 0.027 -4.78 1.72E-06

*Note: Remember that this is basically a semi-logdemand function so the parameters are not directlycomparable to the parameters in the previous models.

Welfare Calculation

Average WTP estimate for elimination of site

$24.30 of 13.

3.16loss

z

c

Note: must also be adjusted forendogenous stratification.

Mean number of trips =

z

N

1n

)1

(

N

nz

N=number of individuals sampledzn = number of trips taken by individual n

Issues in Estimation Data Generating Process: What behavior and what sampling process generated data...

Documents

Transcript of Issues in Estimation Data Generating Process: What behavior and what sampling process generated data...