Econometrics using STATA : Part 2

Econometrics using STATA :

Part 2

Benjamin MonneryEconomiX, Univ Paris Nanterre

M1 Economie du Droit2017-2018

FINDING DATA COVARIATE-ADJUSTMENT MATCHING

CONTENT OF PART 2

When RCT is not an option, only option is to use observational /real-life data

1. How to retrieve data ?

�

public sources (data.gouv), data repositories, journal archiveshow to clean/manipulate data sets in Stata ?

2. How to fix selection bias ?• when there is only selection on observables (part 2)

i.e. easy problems where you know all the determinants ofassignment correlated with YMethods : stratification, covariate-adjustment and matching

x when there is also selection on unobservables (part 3)Methods : IV, panel, DID, RDD...

B. Monnery (EconomiX) Econometrics using Stata II 2 / 41


EXAMPLE OF SELECTION ON OBSERVABLES

What’s the effect of lawyers on judicial outcomes ? e.g. Pr(conviction)

�

among defendants, having a lawyer is “as random” conditional on ...

• strengh of the case (evidence)• wealth, ...

⇒ Among these determinants of treatment, strengh of casecorrelates with Pr(conviction) for sure

what about wealth ? (depends on the judicial system)

Assumption : there is selection on observables (only) if

E [Y 1i |T = 1,X ] = E [Y 1

i |T = 0,X ]

E [Y 0i |T = 1,X ] = E [Y 0

i |T = 0,X ]

Potential outcomes are the same on average for treated anduntreated with same X



Finding Data



Access to data is necessary to answer questions> know key sources, be able to manipulate their data

Access to novel data is (almost) necessary to publish in top scientificjournals

• good data + good method + interesting topic = top science• “competition” for data among researchers• difficult to teach> be curious, follow the news, learn code



DATA GOUV

also look at INSEE, ministries’ websites...



HARVARD DATAVERSE AND JOURNAL ARCHIVES

Many top scientific journals now require online publication of datasets(like AER)

https://www.aeaweb.org/articles?id=10.1257/aer.20161503


https://www.aeaweb.org/articles?id=10.1257/aer.20161503


CIVIL SOCIETY INITIATIVES

We will use some of their data later in the course (Diff-in-Diff)



Covariate-adjustment



INTUITION

We want to estimate a causal treatment effect by comparing theobserved outcomes of treated and untreated people

If we think we know all the determinants X of treatment assignment Tthat also relate to Y (selection on observables), we can simplycompare treated and untreated outcomes conditional on X

How to “condition on X ” ?

1. statistically control for X in a regression model (covariateadjustment)

�

estimate Yi = β0 + β1Ti + β2Xi + εi2. use matching (e.g. propensity score matching)3. use stratification (subclassification) :

�

compute differences within small groups (strata/cells) of X

⇒ Covariate-adjustement is the regression analog to stratification



In a problem of selection on observables, we want to compare treatedand untreated within subgroups with similar potential outcomes

Ex : what’s the effect of lawyers on defendants’ probability ofconviction ?

⇒ True answer ? Probably a reduction of Pr (conviction)

⇒ Problem (selection bias) : propensity to hire a lawyer andprobability of conviction are both related to strengh of evidenceagainst defendant

• if court has strong evidence against defendant, he is more likelyto hire a lawyer to help him

• however, he is also more likely to be convicted eventually⇒ hence risk of selection bias due to differences in strengh of

evidence

If you can measure strengh of evidence, selection bias can be “easily”eliminated by stratification, covariate-adjustment or matching



STRATIFICATION

Tab 1. Sample of Defendants Tab 2. Numbers Convicted

X / T Yes No All X / T Yes No AllStrong 40 10 70 Strong 30 10 40Weak 10 20 30 Weak 5 15 20

All 50 50 100 All 35 25 60

Stata : tab X T tab X T if Convicted==1

• Naive estimator : compare rates of conviction between Yes & NoTreated : 35/50 = 70% Untreated : 25/50 = 50%

• Naive answer : detrimental “effect” of lawyers of +20% points !

⇒ But strengh of evidence is related to both Lawyers andConvictions : selection bias

Better estimator : stratify by (condition on) strengh of evidence


Stata

tab

X

T

tab

X

T

if

Convicted==1


STRATIFICATION

Tab 1. Sample of Defendants Tab 2. Numbers Convicted

X / T Yes No All X / T Yes No AllStrong 40 10 70 Strong 30 10 40Weak 10 20 30 Weak 5 15 20

All 50 50 100 All 35 25 60

• Among Strong casesTreated : 30/40 = 75% Untreated : 10/10 = 100%

�

Treatment effect : -25pp effect

• Among Weak casesTreated : 5/10 = 50% Untreated : 15/20 = 75%

�

Treatment effect : -25pp effect

⇒ Hence the stratified estimator gives a treatment effect of -25 pp



STRATIFICATION VERSUS REGRESSIONS

Stratification solves problems of selection on observables

However in practice, it is only appropriate in the most simplesituations :

• with few variables affecting T and Y• which are all categorical• e.g. 1 dummy (strong/weak), 2 dummies (+rich/poor), ...

In real-life, assignment often depends on a large number ofnon-dichotomic variables, i.e. need to stratify the sample within a lotof different groups (cells/strata)⇒ problem known as the curse of dimensionality




Problem 1 with stratification : the curse of dimensionality

Assume we want to condition on (stratify by) k dummy variables :the number of different groups will be 2k

with k = 10, we have 210 = 1024 group-specific treatments effects tocompare and average (211 = 2024 , 310 = 59049)

• computation can become long• many cells will be empty or only contain treated or untreated

observations : can’t compute group-specific effect> makes the estimated effect less general (i.e. local) as someobservations are left-out




Problem 2 with stratification : continuous variables

In real-life, many variables are not categorical but continuous

• strong/weak and rich/poor are statistical constructions to easecalculus

• the true underlying variables are continuous in nature⇒ stratification makes assumptions of homogeneity within groups

Regressions can easily solve both problems : many X and mix ofcategorical and continuous variables



COVARIATE ADJUSTMENT

Goal : conditional on X , treatment should be “as random”

Key : control appropriately the effect of wealth and case strengh

• Flexible specification :- only linear effect Yi = β0 + β1Lawyeri + β2Wealthi + εi- or more flexible form : logarithmic, polynomial

(Wealth2,Wealth3,...), by categories/bins, linear+bins...

• Relevant data/variables :- Use data on the “best” variables explaining treatment

assignment, instead of long-shot proxy variablesannual pre-tax income, disposable income, net wealth, grosswealth ? Family wealth (to account for possible family support) ?

> a (linear ?) combination of several variables, or some index ?

Recall : do not condition on potential mediators (e.g. lenght of trial) asthey will capture part of the true causal effect of T on Y



ASSUMPTIONS

The key underlying assumptions :

• Conditional independance assumption (CIA, orunconfoundedness)

�

Y 1i ,Y

0i ⊥ T | X

CIA is not directly testable (you need to argue why it’s credible)

• Common support (or overlap)

�

Pr (T = 1|X ) ∈ (0,1)common support is easily testable

+ SUTVA

Then stratification, covariate-adjustment and matching will work



REGRESSION ANATOMY

Under those assumptions, why exactly does covariate-adjustmentwork, i.e. give a causal effect of T on Y ?

⇒ what do multiple regressions do ?

We know that a simple regression with OLS : Yi = β0 + β1X1 + εi

... gives β̂1 = Cov (Y ,X1)Var (X1)

And a multiple regression with OLS : Yi = β0 + β1X1 + β2X2 + ui

... gives β̂ = (X ′X )−1X ′Y ... ?

To understand what it means, let’s turn to the regression anatomytheorem



REGRESSION ANATOMY



SENSITIVITY TO CIA

We can estimate how sensitive the results are to potentialconfounders

Simulation approach :

• Simulate a “fake” variable F that is correlated with both T and Y

• Look at the effect of including this new covariate F on β̂T

• By comparing the β̂T s under different constructions of F(variance-covariance), document the sensitivity of your findingswith respect to a violation of CIA

⇒ If β̂T only disappears under “unrealistic” assumptions (superlarge correlations (F ,X ) and (F ,Y )), then the effect is robust topotential selection on unobservables



Matching



MATCHING

Another popular method to deal with selection on observables ismatching

Matching = Appariemment

Idea : make many pairs of similar individuals (i , j), one treated & onenon-treated, and look at their average differences in outcomes

ˆATT =1

N1

∑T =1

(Yi − Yj (i))

where Yj (i) is the outcome of j , the non-treated individual closest tothe treated i (i.e. the match for i)



Note that we can also recover ATU and ATE with matching :

ÂTU =1

N0

∑T =0

(Yi − Yj (i))

ÂTE =N1

NÂTT +

N0

NÂTU



SIMPLE EXTENSIONS

Note that we can match

• on many dimensions, many X

�

that’s preferable to make CIA hold

• use several matches for a given i�

that’s prefered to reduce variance

ˆATT =1

N1

∑T=1

( Yi −1M

M∑m=1

Yjm(i) )

For now, most simple 1x1 matching on one X



1X1 MATCHING ON ONE X



ANOTHER EXAMPLE : 1X1 MATCHING ON ONE X



ANOTHER EXAMPLE : 1X1 MATCHING ON ONE X

The estimated ATT after matching is 16426− 13982 = 2444

whereas before matching : 16426− 20724 = −4298



SEVERAL X

In practice, we usually need to match on many observable variables

⇒ difficult to find perfectly similar i and j on all X (exact matching)

Other methods :• coarsened exact matching (“exact” matching within bins/ranges)

• distance-based matching- Euclidian distance||xi − xj || =

√(xi − xj )′(xi − xj ) =

√∑Kk=1(xki − xkj )2

- Normalized Euclidian distance, Mahalanobis distance

• propensity score matching

Distance-based and propensity score matching are most often used



SEVERAL MATCHES

In practice, we often want to increase precision by using severalmatches for each i

• Single nearest neighbor matching

• k-nearest neighbors matching (e.g. k=5 or 10)

• Caliper (or raduis) matching (maximal distance i − j)

• Kernel matching (different weights by distance)

• etc.

Asymptotically, they are all similar ; but in practice, this choice canmatter



PROPENSITY SCORE MATCHING

Like with distance-based matching, we want to aggregate alldifferences in X in only one index, the propensity score p(x)

p(x) measures the probability that individuals are treated (T = 1)based on their observables

• Among treated, some were very likely to be treated, some less so• Among non-treated, some were very likely not to be treated,

some less so

�

common support in p(x) between the two groups

Propensity score matching matches individuals with similar p(x) (butdifferent actual treatment status)

⇒ need to estimate p(x)



PROPENSITY SCORE MATCHING

To estimate p(x) for each individual (and then match neighbors), weusually use a probit (or logit) model :

Pr (T = 1|X ) = Pr (T ∗ > 0)= Pr (X ′β + ε > 0)= Pr (ε > −X ′β)= 1− CDF (−X ′β)= Phi(X ′β)

⇒p̂i (xi ) ranges from 0 to 1 (if probit or logit is used)

X are pre-determined variables (and interactions, polynomials, etc.)likely to explain T

and then predict the scores : p̂i (xi ) = Phi(X ′i β̂)

⇒ Hopefully with common support and balance of x between the twogroups



MAIN PRACTICAL ISSUES

Check common support : compare the two distributions of p(x)

Check balance of covariates : use simple t-tests, proportional tests, orthe standarized bias :

if std bias > 20%, difference is still “large”

Be careful about inference : propensity score matching is a two-stepprocess, so you need to adjust your standard errors (using bootstrap)

Many other choices to make : type of matching (1-1, 1-5, caliper,kernel, etc), replacement or not...



BONUS 3 : PRISON-BASED EDUCATION AND RECIDIVISM

Goal : make a 1-page critical review of the paper/chapter• brief summary of the paper (topic, method, main points, results)• discuss method, experimental design, interpretations,

conclusions• relate it to the class• criticisms, shortcomings ?

Send PDF by email before next monday (noon)at [email protected]



EXAMPLE : PRISON-BASED EDUCATION AND RECIDIVISM

Data on 31,000 prisoners released in New York State between 2005and 2008

They follow recidivism within 3 years (rearrest)

Only 347 of them received a college degree in prison

Challenge : make those 347 graduates as comparable as possible toother prisoners not getting a college degree

Method : match prisoners based on their propensity to get a degreepredicted for 47 covariates⇒ 1-1 nearest neighbor matching with a caliper of 0.01



APPLICATION ON STATA

Let’s imagine we want to estimate the effect of halfway houses(semi-liberté) instead of prison on recidivism in a sample of offendersconvicted to prison in France

• allows convicts to work, train, follow classes (probably good forreentry)

• requires them to return in “custody” every night (probably ok tomonitor offenders)

• often perceived as less punitive (possibly bad for futuredeterrence)

⇒ what’s the net causal effect on recidivism, after accounting forselection ?

Main assumption : the Conditional Independence Assumption holdsafter matching on propensity score

In Stata, we can simply use psmatch2


psmatch2

Econometrics using STATA :

Part 2

Benjamin MonneryEconomiX, Univ Paris Nanterre

M1 Economie du Droit2017-2018

Econometrics using STATA : Part 2

Documents

Transcript of Econometrics using STATA : Part 2