ECO 7377 - Southern Methodist University › millimet › classes › eco7377 › lectures ›...

479
ECO 7377 Microeconometrics Daniel L. Millimet Southern Methodist University Spring 2020 DL Millimet (SMU) ECO 7377 Spring 2020 1 / 479

Transcript of ECO 7377 - Southern Methodist University › millimet › classes › eco7377 › lectures ›...

  • ECO 7377Microeconometrics

    Daniel L. Millimet

    Southern Methodist University

    Spring 2020

    DL Millimet (SMU) ECO 7377 Spring 2020 1 / 479

  • Introduction

    Applied research in economics can be loosely classified into two types1 Predictive modeling (forecasting, associations)2 Causal estimation (cause-and-effect)

    While the first is important and useful, the second is of primaryinterest (to me)

    DL Millimet (SMU) ECO 7377 Spring 2020 2 / 479

  • Causal analysis is needed to predict the impact of changingcircumstances or policies, or for the evaluation of existing policies orinterventions

    I Predictive modeling addresses the following question: “If an agentarrives with attributes x , what is the minimum-MSE estimate ofhis/her y?”

    F Answer: Ê[y |x ] = x β̂ (assuming linearity)I Causal estimation addresses the following question: “If an interventionexposes an agent to a treatment, ∂xj , what is the minimum-MSEestimate of his/her ∂y?”

    F Answer: ̂∂ E[y |x ]/∂xj = β̂ (assuming linearity)

    DL Millimet (SMU) ECO 7377 Spring 2020 3 / 479

  • NotesI In general, estimating a CEF is very different than estimating a partialderivative

    I Origins of machine learning techniques lie with predictive modeling(with high dimensional data)

    F See Mullainathan & Spiess (2017)

    Prior to conducting, or when reviewing, empirical analysis, questionsthat need to be answered:

    1 Is predictive modeling suffi cient or is causal estimation necessary?2 What is the causal relationship of interest? Is it economicallyinteresting?

    3 What is the identification strategy? Is it reasonable?4 What is the method of statistical inference?

    DL Millimet (SMU) ECO 7377 Spring 2020 4 / 479

  • Several statistical issues are confronted when answering thesequestions in economic research:

    Specification of the causal relationship of interest entails more thanjust defining x and y ... lots of parameters could be estimated

    I Heterogenous vs. homogeneous effects

    F To whom does it apply?F What question does it answer?

    I Know what you are estimating

    Statistical inference is often diffi cult and overlookedI Spherical vs. non-spherical errorsI Generated regressorsI Derivation/computation of estimated asymptotic variances ofestimators

    I Finite sample properties of estimators

    DL Millimet (SMU) ECO 7377 Spring 2020 5 / 479

  • Identification of causal relationships frequently encountersI Selection issues

    F Self-selection (endogeneity)F Sample selection (missing data, attrition)

    I Measurement issues

    F Classical vs. non-classical errorF Dependent vs. independent variableF Continuous vs. discrete variablesF Differential vs. non-differential

    I Modeling issues

    F Functional form (P, SNP, NP)F Role of space (spillovers, spatial correlation)F Structural vs. atheoretic (Keane 2010)

    DL Millimet (SMU) ECO 7377 Spring 2020 6 / 479

  • Outline

    1 Inference: simulation methods2 Program Evaluation

    1 Causation2 Randomization3 Selection on Observed Variables4 Selection on Unobserved Variables

    3 Data Issues

    1 Sample Selection2 Measurement Error

    4 Spatial Models5 Effi ciency Models

    DL Millimet (SMU) ECO 7377 Spring 2020 7 / 479

  • Simulation-Based InferenceIntroduction

    General structure of estimation

    population ⇒ θ↓

    random sample ⇒ θ̂

    Problem: θ̂ is an estimate; need to assess its dbn for proper inference

    SolutionsI Asymptotic theoryI Simulation methods ⇒ bootstrap, jacknife, sub-sampling, ...

    Stata: -bootstrap-, -bsample-, -jackknife-

    DL Millimet (SMU) ECO 7377 Spring 2020 8 / 479

  • Bootstrap setup1 {yi , xi}Ni=1 is a random sample from a population distribution, F2 θ̂ is a statistic computed from the sample3 F ∗ is the empirical distribution of the data (the so-called resamplingdistribution)

    4 {y∗i , x∗i }Ni=1 is a resample of the data with replacement of the samesize as the original sample

    5 θ̂∗is the statistic computed from the resample

    DL Millimet (SMU) ECO 7377 Spring 2020 9 / 479

  • The bootstrap principle statesI F ∗ ≈ FI The variation in θ̂ is wellapproximated by the variationin θ̂

    In practice, processI Results in a vector ofestimates, θ̂

    ∗b , b = 1, ...,B,

    where B is the # of bootstraprepetitions

    I Use this vector of estimates toconduct inference

    population ⇒ θ↓

    random sample ⇒ θ̂↓

    bootstrap sample ⇒ θ̂∗

    DL Millimet (SMU) ECO 7377 Spring 2020 10 / 479

  • Many different bootstrapmethods

    I Parametric vs. nonparametricI Resampling algorithms

    F iidF Wild bootstrapF Block/clusterF Subsampling (M/N)

    I Estimand

    F ParameterF Test statistic (notdiscussed here)

    DL Millimet (SMU) ECO 7377 Spring 2020 11 / 479

  • Simulation-Based InferenceImplementation

    To fix ideas, consider the following estimation problemI Regression model

    yi = xi β+ εi

    I Problem: given sample estimates, β̂, need to obtain std errors orconfidence intervals

    DL Millimet (SMU) ECO 7377 Spring 2020 12 / 479

  • There are two common sampling methods

    1 Resampling the data2 Resampling the errors

    Resampling dataI Resample (with replacement) observations (yi , xi ) ⇒ {y∗i , x∗i }Ni=1I Estimate the original model (OLS) on the re-sampled data set ⇒ β̂∗

    I Repeat B times ⇒ β̂∗b , b = 1, ...,B

    DL Millimet (SMU) ECO 7377 Spring 2020 13 / 479

  • Resampling residualsI Given β̂ from OLS on original sample, obtain residuals ⇒ ε̂i ,i = 1, ...,N

    I Resample (with replacement) a vector of N residuals ⇒ ε̂∗i , i = 1, ...,NF This represents a random draw from the (nonparametric) empirical dbnof the residuals

    I Generate y∗i = xi β̂+ ε̂∗i (which imposes β = β̂)

    I Regress y∗ on x by OLS ⇒ β̂∗

    I Repeat B times ⇒ β̂∗b , b = 1, ...,BI Alternative (parametric) approach replaces step 2 above with thefollowing:

    F Estimate

    σ̂2 =1

    N −K ∑i ε̂2i

    F Draw N random numbers, ε̂∗i , i = 1, ...,N , from N(0, σ̂2)

    DL Millimet (SMU) ECO 7377 Spring 2020 14 / 479

  • Notes on resampling

    Resampling data is typically preferred since it less model dependent

    Whether resampling the data or the residuals, previous discussionassumes iid data since re-sampling occurs without regard to anydependence across observations

    If there exists some sort of dependence in the data, then resampleblocks or clusters of data

    Example #1: Time series data with serial correlationI Model

    yt = xtβ+ εt , t = 1, ...,T

    I Resample blocks of length l by drawing obs randomly fromt = 1, ...,T − l

    I If obs t ′ is chosen for the bootstrap sample, also include obst = t ′ + 1, ..., t ′ + (l − 1)

    I Draw T/l obs so final bootstrap sample size remains T

    DL Millimet (SMU) ECO 7377 Spring 2020 15 / 479

  • Example #2: Panel dataI For example, individuals within hhs, or employees within firms, orindividuals over time

    I Modelyit = xitβ+ εit , i = 1, ...,N

    I Generate bootstrap samples by resampling (with replacement)observations i

    I If i is chosen for the bootstrap sample, include i for all t

    Key: Blocks/clusters are chosen such that data are iid across blocks

    DL Millimet (SMU) ECO 7377 Spring 2020 16 / 479

  • What to do with β̂∗b , b = 1, ...,B? Several options ...

    Obtain std error for original sample estimate, β̂, given by

    se(β̂) =

    √1

    B − 1 ∑b(

    β̂∗b − β̂

    ∗)Obtain symmetric CI using normal approximation

    β ∈{

    β̂± t1− α2 ,B−1se(β̂)}

    Obtain asymmetric CI using percentile method

    β ∈{

    β̂ α2, β̂1− α2

    }where subscript refers to the quantile of the empirical dbn of β̂

    Obtain symmetric CI using the centered percentile method

    β ∈{2β̂− β̂1− α2 , 2β̂− β̂ α2

    }DL Millimet (SMU) ECO 7377 Spring 2020 17 / 479

  • Obtain asymmetric bias corrected and accelerated CIs (BCa)

    I CI given by β ∈{

    β̂p1, β̂p2

    }for appropriately chosen p1, p2 quantiles

    I Quantiles given by

    p1 = Φ

    [z0 +

    z0 − z1− α21− a(z0 − z1− α2 )

    ]; p2 = Φ

    [z0 +

    z0 + z1− α21− a(z0 + z1− α2 )

    ]where z1− α2 is the (1− α/2)

    th quantile of the std normal distributionand

    z0 = Φ−1[1B ∑b I

    (β̂∗b 6 β̂

    )](median bias)

    a =∑i

    (β̂J − β̂J(i )

    )36

    [∑i

    (β̂J − β̂J(i )

    )2]3/2 (acceleration parameter)

    where β̂J(i ) is the jacknife estimate (omitting obs i from original

    sample) and β̂Jis the mean of the jacknife estimates

    DL Millimet (SMU) ECO 7377 Spring 2020 18 / 479

  • Notes:I BC CI obtained by setting a = 0I BCa requires B > 1000I z0 = 0 when β̂ = median of β̂

    I a reflects the rate of change of the standard error of β̂ with respect tothe true value, β

    F The standard normal approximation assumes that the standard error isinvariant with respect to the true value

    F The acceleration parameter corrects for deviations in practice

    DL Millimet (SMU) ECO 7377 Spring 2020 19 / 479

  • Example: x ∼N(0, 1), N = 1000, x a∼N(0, 0.001)

    05

    1015

    20

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 50

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 500

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 1000

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 10000

    DL Millimet (SMU) ECO 7377 Spring 2020 20 / 479

  • Simulation-Based InferenceAlternative Resampling Methods

    Jacknife estimation

    Also known as leave-one-out estimation

    AlgorithmI Estimate model using original sample ⇒ β̂ (if OLS model, say)I Omit obs i and re-estimate model on sample of N − 1 obs ⇒ β̂(i )I Repeat omitting each i once (implies N estimations)I Standard error obtained as

    se(β̂) =

    √N − 1N ∑i

    (β̂(i ) − β̂(i )

    )2In some situations, delete-d jacknife achieves superior performance

    DL Millimet (SMU) ECO 7377 Spring 2020 21 / 479

  • Subsampling

    AlgorithmI Estimate model using original sample ⇒ β̂ (if OLS model, say)I Draw samples of size M

  • Comparison of bootstrap and subsamplingI The dbn of the data in the population is given by FI The dbn of the data in the sample is given by F ∗I β̂ is an estimate obtained from a random sample of size N from FI Objective: understand the dbn of β̂ if one repeatedly samples data ofsize N from F

    F This is not possibleF Bootstrap: repeatedly sample data of size N from F ∗F Subsampling: repeatedly sample data of size M from F (since randomsubsamples of the data are random samples from F )

    I Thus, neither bootstrap or sub-sampling are exactly right

    F Bootstrap generates resampled data of the right sample size, but fromthe wrong dbn (F ∗ vs. F )

    F Subsampling generates resampled data of the wrong sample size(M

  • Simulation-Based InferenceFailure of the Bootstrap

    Resampling methods are not guaranteed to work; theoreticaljustification is needed

    Most common failures occur1 when parameter of interest is a non-smooth function of the data (e.g.,median vs. mean)

    2 when parameter of interest lies on the edge of the parameter space(e.g., probability close to one or variance close to zero)

    DL Millimet (SMU) ECO 7377 Spring 2020 24 / 479

  • Example: x ∼N(0, 1), N = 1000, xmeda∼N(0, 0.00157)

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 50

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 500

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 1000

    05

    1015

    -.2 0 .2

    Bootstrap Asymptotic

    Reps = 10000

    DL Millimet (SMU) ECO 7377 Spring 2020 25 / 479

  • CausationIntroduction

    Many empirical questions in economics, business, medicine, etc.pertain to the causal effect of a program or policy or treatment

    I Correlation, in contrast, is (typically) less interesting and informative

    Statistical and econometric literature analyzing causation has seentremendous growth over the past several decades

    Central problem concerns evaluation of the causal effect of exposureto a treatment or program by a set of units on some outcome

    I These units are agents such as individuals, households, firms,geographical areas, etc.

    DL Millimet (SMU) ECO 7377 Spring 2020 26 / 479

  • Philosophy of causality...I Rich literature in analytic philosophy on causalityI Two main approaches to defining causality:

    F Regularity approaches: Hume: “We may define a cause to be anobject followed by another, and where all the objects, similar to thefirst, are followed by objects similar to the second.” (from An EnquiryConcerning Human Understanding, section VII)

    F Counterfactual approaches:—Hume: “Or, in other words, where, if the first object had not been,the second never had existed.” (from An Enquiry Concerning HumanUnderstanding, section VII)— JS Mill: “Thus, if a person eats of a particular dish, and dies inconsequence, that is, would not have died if he had not eaten of it,people would be apt to say that eating of that dish was the cause of hisdeath.”

    DL Millimet (SMU) ECO 7377 Spring 2020 27 / 479

  • Regularity approach: a minimal constant conjunction between thetwo objects

    I Basic idea behind predictive modelingI May be spurious if there exists some factor B to explain the conjuction(or correlation) between the two objects

    F B is known as a confounder or confounding variable

    I Be wary: correlation does not imply causation

    DL Millimet (SMU) ECO 7377 Spring 2020 28 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 29 / 479

  • Counterfactual approach: differences across a range of possibleworlds

    I Holland (1986, 2003): a treatment (cause) is a potential manipulationthat one can imagine

    F “NO CAUSATION WITHOUT MANIPULATION”F Gender, race are not treatments?!? (see Greiner and Rubin 2011)

    I Imbens and Wooldridge (2009):

    F “A CRITICAL FEATURE IS THAT, IN PRINCIPLE, EACH UNIT CANBE EXPOSED TO MULTIPLE LEVELS OF THE TREATMENT.”

    I Angrist and Pischke (2009): a treatment should be manipulatableconditional on other factors

    F “NO FUNDAMENTALLY UNIDENTIFIED QUESTIONS”

    I Microeconometrics today emphasizes the counterfactual approach

    DL Millimet (SMU) ECO 7377 Spring 2020 30 / 479

  • CausationRubin Causal Model

    The dominant view of causation (in microeconomics) is based on thecounterfactual approach

    Greiner & Rubin (2011):“For analysts from a variety of fields,the intensely practical goal of causalinference is to discover what wouldhappen if we changed the world insome way.”

    DL Millimet (SMU) ECO 7377 Spring 2020 31 / 479

  • Typically referred to as the Rubin Causal Model (Neyman 1923; Roy1951; Rubin 1974)Crucial underpinning is the notion of potential outcomes

    Potential outcomes refer to the outcome that would be realized underdifferent states of nature

    I Example: A sick individual may receive either Treatment 0 or 1. Theoutcome is either Recovery or Death. Thus, there are two possiblestates of nature (Treatment 0 or 1) and there is an outcome thatwould be realized in each state of nature.

    Under the counterfactual approach, the causal effect of Treatment 1relative to Treatment 0 would be the difference in outcomes acrossthese two states of nature for a given individualDL Millimet (SMU) ECO 7377 Spring 2020 32 / 479

  • The counterfactual approach immediately leads to three salient points

    1 Causal impacts of a treatment/intervention/program/policy are onlydefined with respect to a well-defined alternative

    I Typically the alternative is the ‘absence of treatment’I Not always obvious and must be made explicit

    2 Causal impacts are individual-specificI Each individual potentially has his or her own potential outcomes andhence treatment effect

    I Referred to as constant vs. heterogeneous treatment effectsI Has important implications for interpreting the results

    3 Only one state of nature is actually realized at a point in timeI We can observe at most one potential outcome for any individual,remainder are missing

    I The causal effect of a treatment is not observable for any individualI Any estimator of causal effects must overcome this missing dataproblem

    I To do so requires assumptions and the assumptions must be credible

    DL Millimet (SMU) ECO 7377 Spring 2020 33 / 479

  • Overcoming the missing data problem arising from the fact that onlyone state of nature is realized is referred to as the fundamentalproblem of causal inference (Holland 1986) and is diffi cult toovercome

    DL Millimet (SMU) ECO 7377 Spring 2020 34 / 479

  • Notation

    Treatment AssignmentI Di = (binary) treatment indicator for observation i

    Di ={1 treated0 untreated

    Potential OutcomesI Yi (1) = outcome of observation i with treatmentI Yi (0) = outcome of observation i without treatment

    DL Millimet (SMU) ECO 7377 Spring 2020 35 / 479

  • Implicit in this representation is the Stable Unit Treatment ValueAssumption (SUTVA, Rubin 1978)

    1 SUTVA implies that potential outcomes of observation i areindependent of the treatment assignment of all other agents (rulesout general equilibrium or indirect effects via spillovers)

    I Allows one to write potential outcomes solely as a function of owntreatment assignment

    Yi (0) ≡ yi (D1,D2, ...,Di−1, 0,Di+1, ...,DN ) = yi (0)Yi (1) ≡ yi (D1,D2, ...,Di−1, 1,Di+1, ...,DN ) = yi (1)

    I See Imbens & Wooldridge (2009) for references that relax thisassumption

    F Stata: -ntreatreg-

    2 SUTVA implies that the treatment, D, is identical for all observations(i.e., no variation in treatment intensity)

    DL Millimet (SMU) ECO 7377 Spring 2020 36 / 479

  • Parameters of Interest

    ∆i = Yi (1)− Yi (0) = treatment effect for obs iI Either Yi (1) or Yi (0) is observed for each i , never both

    F As a result, ∆i is never observed

    I Missing potential outcome is the missing counterfactual

    Can summarize the distribution of ∆i by focusing on different aspects

    ∆ATE = E[∆i ] = E[Y (1)− Y (0)]∆ATT = E[∆i |D = 1] = E[Y (1)− Y (0)|D = 1]∆ATU = E[∆i |D = 0] = E[Y (1)− Y (0)|D = 0]

    DL Millimet (SMU) ECO 7377 Spring 2020 37 / 479

  • Other Parameters of Interest

    1. Local Average Treatment Effect (Imbens & Angrist 1994; Angrist etal. 1996)

    I Defined as∆LATE = E[Y (1)− Y (0)|i ∈ Ω]

    where Ω refers to some specified local or subpopulation

    2. Marginal Treatment Effect (Heckman & Vytlacil 1999, 2001, 2005,2007)

    I Defined later

    DL Millimet (SMU) ECO 7377 Spring 2020 38 / 479

  • 3. Intent to Treat EffectI Defined as

    ∆ITT = E[Y (1)− Y (0)]where

    D ={1 if agents have the opportunity to undertake the treatment0 if agents do not have the opportunity to undertake the treatment

    DL Millimet (SMU) ECO 7377 Spring 2020 39 / 479

  • 4. Average Direct, Indirect, and Total EffectsI Define potential outcomes as

    Y (d) = Y (d ,M(d)), d = 0, 1

    where M is referred to as a mediator as D affects M and M affectspotential outcomes

    F Example: D is college completion, M is occupation, and Y is wages

    I Parameters of potential interest defined as

    Causal Mediation Effect (Indirect Effect): E[Y (d ,M(1))− Y (d ,M(0))]Direct Effect: E[Y (1,M(d))− Y (0,M(d))]Total Effect: E[Y (1,M(1))− Y (0,M(0))]

    I See Robins and Greenland (1992), Pearl (2001), Imai & Yamamoto(2013), Acharya et al. (2016)

    DL Millimet (SMU) ECO 7377 Spring 2020 40 / 479

  • Relationship among the ATE, ATT, and ATUI Let

    Yi (1) ≡ E[Y (1)] + υ1iYi (0) ≡ E[Y (0)] + υ0i

    I This implies

    ∆i = Yi (1)− Yi (0)= E[Y (1)− Y (0)] + υ1i − υ0i= ∆ATE + υ1i − υ0i

    and

    ∆ATT = ∆ATE + E[υ1i − υ0i |D = 1]∆ATU = ∆ATE + E[υ1i − υ0i |D = 0]

    where E[υ1i − υ0i |D = d ] is the average, obs-specific gain fromtreatment for group d

    DL Millimet (SMU) ECO 7377 Spring 2020 41 / 479

  • Can re-define any of the above parameters for sub-population definedon the basis of attributes, x

    ∆ATE (x) = E[Y (1)− Y (0)|x ]∆ATT (x) = E[Y (1)|x ,D = 1]− E[Y (0)|x ,D = 1]∆ATU (x) = E[Y (1)|x ,D = 0]− E[Y (0)|x ,D = 0]

    where these are conditional average treatment effects

    The previous unconditional parameters are obtained by integratingover the dbn of x in the relevant population

    ∆ATE =∫

    ∆ATE (x)f (x)dx

    ∆ATT =∫

    ∆ATT (x)f (x |D = 1)dx

    ∆ATU =∫

    ∆ATU (x)f (x |D = 0)dx

    DL Millimet (SMU) ECO 7377 Spring 2020 42 / 479

  • Evaluation Problem

    Setup...

    Attributes of i Observed for i{Yi (0),Yi (1),Di , xi} {yi ,Di , xi}

    whereI yi = DiYi (1) + (1−Di )Yi (0) = observed outcome for obs iI xi is a vector of attributes of obs i

    Question: How does one circumvent the missing counterfactualproblem to estimate ∆ATE , ∆ATT , ∆ATU , or any other summarystatistic of the distribution of ∆?

    I Any answer must make comparisons across agents with differenttreatment assignment

    I Once we compare agents across treatment groups, we must addresspossible bias from self-selection into treatment and control groups

    DL Millimet (SMU) ECO 7377 Spring 2020 43 / 479

  • Example #1... ATTI Consider ∆ATT = E[Y (1)|D = 1]− E[Y (0)|D = 1]I The sample counterpart to E[Y (1)|D = 1] is observed in the data, butone does not observe the counterpart to E[Y (0)|D = 1]

    I If one uses outcomes of the untreated, we can define

    ∆̃ ≡ E[Y (1)|D = 1]− E[Y (0)|D = 0]

    I Some algebra reveals

    ∆ATT = E[Y (1)|D = 1]− E[Y (0)|D = 1]= E[Y (1)|D = 1]− E[Y (0)|D = 0]

    + E[Y (0)|D = 0]− E[Y (0)|D = 1]⇒ ∆̃− ∆ATT = E[Y (0)|D = 1]− E[Y (0)|D = 0]︸ ︷︷ ︸

    selection bias

    I Generally, ∆̃ 6= ∆ATT

    DL Millimet (SMU) ECO 7377 Spring 2020 44 / 479

  • Example #2... ATEI Consider estimating ∆ATE = E[Y (1)]− E[Y (0)]I The sample counterpart of neither unconditional expectation isobserved in the data

    I If one uses conditional expectations, we can again define

    ∆̃ ≡ E[Y (1)|D = 1]− E[Y (0)|D = 0]

    I Some algebra reveals

    ∆̃− ∆ATE = (E[Y (1)|D = 1]− E[Y (0)|D = 0])− (E[Y (1)]− E[Y (0)])= (E[Y (1)|D = 1]− E[Y (1)|D = 0])[1− Pr(D = 1)]

    + (E[Y (0)|D = 1]− E[Y (0)|D = 0])Pr(D = 1)=

    (∆̃− ∆ATT

    )Pr(D = 1)

    +(

    ∆̃− ∆ATU)[1− Pr(D = 1)]

    which is a weighted average of the selection bias for the ATT and ATU

    DL Millimet (SMU) ECO 7377 Spring 2020 45 / 479

  • Decomposition of the selection biasesI Biases given by

    E[Y (0)|D = 1] − E[Y (0)|D = 0], andE[Y (1)|D = 1] − E[Y (1)|D = 0]

    I Terms are decomposed into 3 or 4 components in

    F Heckman et al. (1998)F King & Zeng (2006)

    DL Millimet (SMU) ECO 7377 Spring 2020 46 / 479

  • Question: How does one circumvent the missing counterfactualproblem to estimate ∆ATE , ∆ATT , ∆ATU , or any other summarystatistic of the distribution of ∆?Answer: By estimating the missing counterfactual, but suchestimates are only as valid as the assumptions that underlie them andthe data used to derive the estimates

    I The central issue in the RCM is the relationship between treatmentassignment and potential outcomes

    F Typically referred to as the treatment assignment rule or mechanismF Growing literature on assignment rules (Manski 2000, 2004; Pepper2002, 2003; Dehejia 2005; Lechner & Smith 2007; Kitagawa & Tetenov2018)

    I Estimation proceeds under different assumptions concerning theassignment of or selection into treatment

    I Three different categories of assumptions

    1 Random assignment2 Selection on observables (observed variables)3 Selection on unobservables (unobserved variables)

    DL Millimet (SMU) ECO 7377 Spring 2020 47 / 479

  • Early Example of Potential Outcomes: Roy Model (Roy 1951)

    At the center of the RCM is the interplay between treatmentassignment, potential outcomes, and observed outcomes

    Problem is one of self-selection; highlighted in a very clever fashion inRoy (1951)

    Specific issue in Roy (1951) was occupational choiceI Individuals have potential earnings associated with differentoccupations

    I Realized earnings reflect the chosen occuption

    Example

    Suppose (Y (0)Y (1)

    )∼N

    (01,∑)

    DL Millimet (SMU) ECO 7377 Spring 2020 48 / 479

  • Unconditional outcome distributions look like

    0.1

    .2.3

    .4

    -6 -4 -2 0 2 4

    Y(0) Y(1)N=100,000; rho = 0.7

    Unconditional Distributions of Potential Outcomes

    DL Millimet (SMU) ECO 7377 Spring 2020 49 / 479

  • Conditional distributions

    Depends onI Who selects into treatment or control group, andI Correlation of potential outcomes

    Positive correlation in above example (ρ ≈ 0.7)

    DL Millimet (SMU) ECO 7377 Spring 2020 50 / 479

  • Positive selection: Assume those with Y (1) > 1 select into treatment

    0.2

    .4.6

    .8

    -6 -4 -2 0 2 4

    Y(0) Y(1)N=100,000; rho = 0.7

    Conditional Distributions of Potential Outcomes

    DL Millimet (SMU) ECO 7377 Spring 2020 51 / 479

  • Negative selection: Assume those with Y (0) < 0 select into treatment

    0.2

    .4.6

    .8

    -4 -2 0 2 4

    Y(0) Y(1)N=100,000; rho = 0.7

    Conditional Distributions of Potential Outcomes

    DL Millimet (SMU) ECO 7377 Spring 2020 52 / 479

  • Random assignment:

    0.1

    .2.3

    .4

    -6 -4 -2 0 2 4

    Y(0) Y(1)N=100,000; rho = 0.7

    Conditional Distributions of Potential Outcomes

    Lesson to be learned: observed distributions are not the unconditionaldistributions; hence, difference in observed dbns tells us nothingabout treatment effects in the absence of further assumptionsDL Millimet (SMU) ECO 7377 Spring 2020 53 / 479

  • Roy Model

    Two occupations: hunter, fisherPotential incomes

    Y (d) = µd (x) + υd , d = 0 (h), 1 (f)

    Decision rule: maximize income

    D = I(Y (1)− Y (0) > 0)= I(µ1(x)− µ0(x) + υ1 − υ0 > 0)

    Observed income

    y = DY (1) + (1−D)Y (0)

    Treatment assignment depends on observables, x , and unobservables,υ1 − υ0Notes:

    1 Cov(D, υ1 − υ0) 6= 0 referred to as essential heterogeneity (Heckmanet al. 2006)

    2 Cov(D, υ1 − υ0) 6= 0⇒ Cov(D,D(υ1 − υ0)) 6= 0DL Millimet (SMU) ECO 7377 Spring 2020 54 / 479

  • Generalized Roy Model

    Replace income maximization decision rule with a more general rule

    Decision rule

    Y (d) = µd (x) + υd , d = 0, 1

    D = I(h(z)− u > 0)

    When D is a voluntary program (e.g., job training), u may reflect (i)costs of participation and (ii) foregone earnings (opportunity costs)

    Implies that treatment assignment depends on observables, z , andunobservables, u

    I Random Assignment: x ∩ z = ∅ and Corr(u, υd ) = 0 ∀dI Selection on Observables: x ∩ z 6= ∅ and Corr(u, υd ) = 0 ∀dI Selection on Unobservables: x ∩ z 6= ∅ and Corr(u, υd ) 6= 0 ∀d

    DL Millimet (SMU) ECO 7377 Spring 2020 55 / 479

  • Directed Acyclic Graphs (DAGs)

    Random Assignment

    D Y

    Xv

    Z =coin toss

    u =empty

    DL Millimet (SMU) ECO 7377 Spring 2020 56 / 479

  • Selection on Observables

    D Y

    Xv

    Z =observables

    u =unobservables

    DL Millimet (SMU) ECO 7377 Spring 2020 57 / 479

  • Selection on Unobservables

    D Y

    Xv

    Z =observables

    u =unobservables

    DL Millimet (SMU) ECO 7377 Spring 2020 58 / 479

  • Moving Forward

    Guided by the potential outcomes framework, figure out conditionsunder which different estimators may provide consistent estimates ofthe ATE, ATT, ATU, etc.

    Key points:I Given the missing counterfactual problem, any estimator of the causaleffects of a treatment must rely on some assumptions

    F Thus, no estimator is guaranteed to ‘always’work or (perhaps) ‘always’fail

    F Performance of every estimator is application-specific

    I Different estimators rely on different assumptions and thus should notbe expected to yield similar estimates

    I Not all assumptions can be testedI Different estimators may estimate different aspects of the dbn of ∆ andthus answer different questions

    DL Millimet (SMU) ECO 7377 Spring 2020 59 / 479

  • Random Assignment

    First solution is to randomize treatment assignment via RandomizedControl Trials (RCTs)

    I In Generalized Roy Model, z could reflect the outcome of a coin tossand u = 0 ∀i

    D Y

    Xv

    Z =coin toss

    u =empty

    DL Millimet (SMU) ECO 7377 Spring 2020 60 / 479

  • Generally speaking, randomization is the preferred solution; oftencalled the “gold standard”

    DL Millimet (SMU) ECO 7377 Spring 2020 61 / 479

  • Reason: randomization ensures that treatment assignment isindependent of potential outcomes in expectation

    Freedman (2006): “Experiments offer more reliable evidence on causation thanobservational studies.”

    Imbens (2009): “More generally, and this is the key point, in a situation where onehas control over the assignment mechanism, there is little to gain, and much to lose, by

    giving that up through allowing individuals to choose their own treatment regime.

    Randomization ensures exogeneity of key variables, where in a corresponding

    observational study one would have to worry about their endogeneity.”

    DL Millimet (SMU) ECO 7377 Spring 2020 62 / 479

  • That said, not everyone is convinced by experiments (without doingsome more mental work)

    “Much of the criticism about experiments is about the diffi culty ofgeneralizing fom the evaluation of one particular program to predictingwhat would happen to this program in a different context. Clearly,without theory to guide us on why a result extends from a context toanother, it is diffi cult to jump directly to a policy conclusion. However,when experiments are motivated by a theory, the results of experiments(not only on the final outcomes, but on the entire chain of intermediateoutcomes that led to the endpoint of interest) serve as a test of someof the implications of that theory. The combination of data points theneventually provides suffi cient evidence to make policyrecommendations.”

    Duflo (2010),http://www.aeaweb.org/econwhitepapers/white_papers/Esther_Duflo.pdf

    DL Millimet (SMU) ECO 7377 Spring 2020 63 / 479

  • “From an ex post evaluation standpoint, a carefully plannedexperiment using random assignment of program status represents theideal scenario, delivering highly credible causal inference. But from anex ante evaluation standpoint, the causal inferences from a randomizedexperiment may be a poor forecast of what were to happen if theprogram were to be ‘scaled up’.”

    DiNardo & Lee (2011)

    Ex post evaluation answers the question: “What happened?”(descriptive)

    Ex ante evaluation answers the question: “What would happen?”(predictive)

    DL Millimet (SMU) ECO 7377 Spring 2020 64 / 479

  • At issue is the distinction between internal and external validityI Proper RCTs yield high internal validity, but external validity is notguaranteed

    I See recent work on external validity and problems of scale-up (e.g.,Kline & Tamer 2018; Bisbee et al. 2017; Andrews & Oster 2017 NBERWP; Davis et al. 2017 NBER WP)

    I Stata: -extbounds-

    DL Millimet (SMU) ECO 7377 Spring 2020 65 / 479

  • Randomization may occur at different stages1 Population-level: randomize among agents in the population; typicallynot feasible since it would entail ‘compelling’treatment by some

    2 Eligibility-level: randomize among the population of eligibles byrandomly denying eligibility to a subset

    3 Application-level: randomize among the population of programapplicants by randomly accepting/rejecting a subset

    NotesI Stage at which randomization occurs generally affects what can belearned unless additional assumptions are made

    I Lab experiments are generally type 3 since randomization occurs withinthe subpopulation of experiment applicants

    DL Millimet (SMU) ECO 7377 Spring 2020 66 / 479

  • Assumptions (with population-level randomization)

    (A.i) {y ,D} is iid sample from the population(A.ii) Y (0),Y (1) ⊥ D(A.iii) Pr(D = 1) ∈ (0, 1)

    NotesI (A.i) implies SUTVAI (A.ii) implies E[Y (1)|D = 1] = E[Y (1)|D = 0] = E[Y (1)]; similarlyfor E[Y (0)]

    I (A.ii) also implies ∆ATE = ∆ATT = ∆ATU since

    E[Y (1)− Y (0)]︸ ︷︷ ︸ATE

    = E[Y (1)− Y (0)|D = 1]︸ ︷︷ ︸ATT

    = E[Y (1)− Y (0)|D = 0]︸ ︷︷ ︸ATU

    I (A.ii) relies on perfect complianceI (A.iii) ensures all agents have some probability of receiving and notreceiving the treatment

    DL Millimet (SMU) ECO 7377 Spring 2020 67 / 479

  • Imperfect compliance may invalidate (A.ii) if such non-compliance isrelated to potential outcomes

    Two options in this case:1 Difference in average outcomes based on initial assignment estimatesthe intent to treat effect, ∆ITT , under imperfect compliance; mayactually be more policy relevant

    2 Use initial assignment as an instrument for actual assignment(discussed later)

    3 Partial identification (discussed later)

    DL Millimet (SMU) ECO 7377 Spring 2020 68 / 479

  • Estimation

    ∆̂ATE = ̂E[yi |D = 1]− ̂E[yi |D = 0]

    =∑Ni=1 yi I[Di = 1]∑Ni=1 I[Di = 1]

    − ∑Ni=1 yi I[Di = 0]

    ∑Ni=1 I[Di = 0]p−→ E[yi |D = 1]− E[yi |D = 0]

    = E[DiYi (1) + (1−Di )Yi (0)|D = 1]− E[DiYi (1) + (1−Di )Yi (0)|D = 0]

    = E[Yi (1)|D = 1]− E[Yi (0)|D = 0]= E[Yi (1)]− E[Yi (0)]= ∆ATE

    DL Millimet (SMU) ECO 7377 Spring 2020 69 / 479

  • PropertiesI UnbiasedI ConsistentI Asymptotically normalI Nonparametrically identified: no parametric or functional formassumptions needed

    DL Millimet (SMU) ECO 7377 Spring 2020 70 / 479

  • Notes

    Randomization succeeds by balancing (in expectation) both observedand unobserved attributes of participants in the treatment andcontrol group

    Balance can be assessed by testing for differences in the joint dbn ofpredetermined attributes across the treatment and control groups

    DL Millimet (SMU) ECO 7377 Spring 2020 71 / 479

  • Randomization at the eligibility or application stage only yield anestimate of the ATT, which does not equal the ATE unless (i)treatment effects are homogeneous or (ii) agents do not becomeeligible or apply due to unobserved, observation-specific gains to thetreatment, υ1 − υ0

    I Implies level of randomization is important for interpreting resultsF Example: Project on Incentives in Teaching (POINT) experiment

    “The Project on Incentives in Teaching (POINT) was a three-year study conducted in the Metropolitan

    Nashville School System from 2006-07 through 2008-09, in which middle school mathematics teachers

    voluntarily participated in a controlled experiment to assess the effect of financial rewards for teachers whose

    students showed unusually large gains on standardized tests. The experiment was intended to test the notion

    that rewarding teachers for improved scores would cause scores to rise. It was up to participating teachers to

    decide what, if anything, they needed to do to raise student performance... Thus, POINT was focused on the

    notion that a significant problem in American education is the absence of appropriate incentives... By and large,

    results did not confirm this hypothesis. [S]tudents of teachers randomly assigned to the treatment group

    (eligible for bonuses) did not outperform students whose teachers were assigned to the control group (not

    eligible for bonuses).”

    https://my.vanderbilt.edu/performanceincentives/files/2012/09/Full-Report-Teacher-Pay-for-Performance-

    Experimental-Evidence-from-the-Project-on-Incentives-in-Teaching-20104.pdfDL Millimet (SMU) ECO 7377 Spring 2020 72 / 479

  • Estimation via regression is also possibleI Specification

    yi = DiYi (1) + (1−Di )Yi (0)= Yi (0) +Di [Yi (1)− Yi (0)]= Yi (0) + ∆iDi= E [Y (0)] + E [∆]Di + {[(Yi (0)− E [Y (0)]) + (∆i − E [∆])}= α+ βDi + εi

    where β̂ = ∆̂ATEI Augmenting regression model with covariates can improve precisionsince it reduces the variance of the error term

    yi = α+ βDi + xi δ+ ε̃i

    since σ2ε̃ < σ2ε

    DL Millimet (SMU) ECO 7377 Spring 2020 73 / 479

  • Example: Krueger (1999), “Experimental Estimates of EducationProduction Functions,”Quarterly Journal of Economics, 114, 497-532.

    Student/Teacher Achievement Ratio (STAR) RCT in Tennessee in1980s

    Students assigned to1 Small kindergarten class: 13-17 students2 Large kindergarten class + teacher’s aide: 22-25 students3 Large kindergarten class + no teacher’s aide: 22-25 students

    Randomization occurred within schools

    DL Millimet (SMU) ECO 7377 Spring 2020 74 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 75 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 76 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 77 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 78 / 479

  • Selection on Observables

    Randomization is often not feasible in economics

    Applied economists typically must rely on observational (ornon-experimental) data

    D Y

    Xv

    Z =observables

    u =unobservables

    DL Millimet (SMU) ECO 7377 Spring 2020 79 / 479

  • Data structure is now given by ...

    Attributes of i Observed for i{Yi (1),Yi (0),Di , xi , υ1i , υ0i} {yi ,Di , xi}

    where xi is the full vector of observable attributes of i

    DL Millimet (SMU) ECO 7377 Spring 2020 80 / 479

  • Selection on Observables

    Assumptions

    (A.i) iid sample: {y ,D, x} is iid sample from the population(A.ii) Conditional independence (CIA) or unconfoundedness:

    Yi (0),Yi (1) ⊥ D |x(A.iii) Common support (CS) or overlap: Pr(D = 1|x) ∈ (0, 1)Notes

    (A.i) implies SUTVA

    (A.ii) implies D is randomly assigned conditional on x

    Pr(Di = 1|xi ,Yi (1),Yi (0)) = Pr(Di = 1|xi )

    (A.iii) ensures one observes agents with a particular x in both thetreatment and control groups

    (A.ii), (A.iii) ⇒ strong ignorability (Rosenbaum & Rubin 1983)

    DL Millimet (SMU) ECO 7377 Spring 2020 81 / 479

  • Comments on Strong Ignorability

    Is a ‘strong’assumption (no pun intended)

    x’s must be pre-determined (i.e., unaffected by treatment assignment)I If some x’s are affected by D or the anticipation of D, then inclusionwill mask (at least) some of the treatment effect (Lechner 2008)

    There may not exist any vector x in a particular data set for aparticular treatment such that strong ignorability holds

    I There is some tension between CIA and CS; CIA takes precedenceI Strong ignorability requires an instrument to exist, but it need not beobserved (or even known) such that D is random conditional on x

    Imbens & Rubin (2015) argue that CIA is a reasonable approximationin many applications and alternative assumptions may be even lesscredible

    I I do not totally agree, ...I but they are smarter than I am!

    DL Millimet (SMU) ECO 7377 Spring 2020 82 / 479

  • Nonparametric Identification

    Estimation

    ∆̂ATE (x) = ̂E[yi |xi = x ,D = 1]− ̂E[yi |xi = x ,D = 0]

    =∑Ni=1 yi I[xi = x ,Di = 1]∑Ni=1 I[xi = x ,Di = 1]

    − ∑Ni=1 yi I[xi = x ,Di = 0]

    ∑Ni=1 I[xi = x ,Di = 0]p−→ E[yi |xi = x ,D = 1]− E[yi |xi = x ,D = 0]

    = E[Yi (1)|xi = x ,D = 1]− E[Yi (0)|xi = x ,D = 0]= E[Yi (1)|xi = x ]− E[Yi (0)|xi = x ]

    and then

    ∆̂ATE = E[∆̂ATE (x)

    ]=∫

    ∆̂ATE (x)f (x)dx =1N ∑i ∆̂

    ATE (xi )

    Similar story for other parameters, except final step uses f (x |D = 1)or f (x |D = 0)

    DL Millimet (SMU) ECO 7377 Spring 2020 83 / 479

  • Final Notes

    If x is continuous and/or high dimensional, then this estimator cannotbe used since the probability of observing more than one obs with thesame value of x is zero

    I Possible solutions:

    1 Functional form assumptions ⇒ regression2 Dimensionality reduction ⇒ matching, weighting

    DL Millimet (SMU) ECO 7377 Spring 2020 84 / 479

  • CIA is not testableI One common ‘test’employed entails testing for differences inpre-treatment outcomes conditional on x between the to-be-treatedand the controls

    I Intuition: if D is uncorrelated with unobservables related to theoutcome conditional on x , then pre-treatment outcomes should beunrelated to (future) D conditional on x

    I Heckman et al. (1999) refers to this as the alignment fallacy

    F Test based on outcomes more than one period in the past is misleadingif shocks are serially correlated and agents self-select into the treatmentgroup due to an adverse shock in the period directly before treatment

    F In general, test is useful if it rejects the independence of D and yconditional on x in periods prior to treatment; if it fails to reject, thenthe test is ambiguous

    Several simulation studies comparing the performance of variousestimators

    I Frölich (2004), Busso et al. (2014), Frölich et al. (2017)

    DL Millimet (SMU) ECO 7377 Spring 2020 85 / 479

  • Selection on ObservablesStrong Ignorability: Regression

    Previous results showed that

    ∆ATE (x) = E[Yi (1)|xi = x ]− E[Yi (0)|xi = x ]= E[yi |xi = x ,D = 1]− E[yi |xi = x ,D = 0]

    Implies key is to estimate the CEF E[yi |xi ,Di ]

    DL Millimet (SMU) ECO 7377 Spring 2020 86 / 479

  • Assumptions

    (A.iv) Separability:

    Yi (0) = µ0(xi ) + υ0iYi (1) = µ1(xi ) + υ1i

    where E[υ1 |x ] = E[υ0 |x ] = E[υ1 − υ0 |x ] = 0(A.v) Functional forms:

    (A.va) Constant treatment effect

    µ0(xi ) = α0 + xi β

    µ1(xi ) = α1 + xi β

    (A.vb) Heterogeneous treatment effects

    µ0(xi ) = α0 + xi β0µ1(xi ) = α1 + xi β1

    DL Millimet (SMU) ECO 7377 Spring 2020 87 / 479

  • Given (A.i), (A.ii), (A.iv), and (A.va) ...

    Implications

    E[yi |xi ,D = 0] = α0 + xi β+ E[υ0i |xi ,D = 0]E[yi |xi ,D = 1] = α1 + xi β+ E[υ1i |xi ,D = 1]

    implies

    ∆ATE (x) = E[yi |xi = x ,D = 1]− E[yi |xi = x ,D = 0]= α1 − α0= ∆ATE = ∆ATT = ∆ATU

    DL Millimet (SMU) ECO 7377 Spring 2020 88 / 479

  • EstimationI Via OLS

    yi ≡ Yi (0) +Di (Yi (1)− Yi (0))= α0 + xi β+ υ0i +Di (α1 + xi β+ υ1i − α0 − xi β− υ0i )= α0 + xi β+ (α1 − α0)Di + [υ0i +Di (υ1i − υ0i )]= α0 + xi β+ ∆

    ATEDi + υ̃i

    I Coeffi cient on D is an unbiased estimate of the causal parameter, and

    ∆ATE = ∆ATT = ∆ATU

    DL Millimet (SMU) ECO 7377 Spring 2020 89 / 479

  • Given (A.i), (A.ii), (A.iv), and (A.vb) ...

    Implications

    E[yi |xi ,D = 0] = α0 + xi β0 + E[υ0i |xi ,D = 0]E[yi |xi ,D = 1] = α1 + xi β1 + E[υ1i |xi ,D = 1]

    implies

    ∆ATE (x) = E[yi |xi = x ,D = 1]− E[yi |xi = x ,D = 0]= (α1 − α0) + xi (β1 − β0)

    and

    ∆ATE =∫

    ∆ATE (x)f (x)dx = (α1 − α0) + E[x ](β1 − β0)

    ∆ATT =∫

    ∆ATE (x)f (x |D = 1)dx = (α1 − α0) + E[x |D = 1](β1 − β0)

    ∆ATU =∫

    ∆ATE (x)f (x |D = 0)dx = (α1 − α0) + E[x |D = 0](β1 − β0)

    DL Millimet (SMU) ECO 7377 Spring 2020 90 / 479

  • EstimationI Via OLS

    yi = α0 + xi β0 + (α1 − α0)Di + xiDi (β1 − β0)+ [υ0i +Di (υ1i − υ0i )]

    = α0 + xi β0 + α̃1Di + xiDi β̃1 + υ̃i

    I Estimates given by

    ∆̂ATE (x) = ̂̃α1 + x ̂̃β1∆̂ATE = ̂̃α1 + x ̂̃β1∆̂ATT = ̂̃α1 + x1 ̂̃β1∆̂ATU = ̂̃α1 + x0 ̂̃β1

    where x j = ∑i xi I[Di = j ]/ ∑i I[Di = j ], j = 0, 1Stata: -teffects, ra-

    DL Millimet (SMU) ECO 7377 Spring 2020 91 / 479

  • Example: Millimet (2000), “The Impact of Children on Wages, JobTenure, and the Division of Household Labor,”Economic Journal, 110,C139-C157.

    N = 1485 married couples from 1976 PSID

    VariablesI y = log(hourly wage) (husband)I D = union status (husband)I x = educ, race, age

    Specifications

    yi = α0 + xi β+ ∆Di + εiyi = α0 + xi β+ ∆0Di + xiDi∆1 + υ̃i

    DL Millimet (SMU) ECO 7377 Spring 2020 92 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 93 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 94 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 95 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 96 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 97 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 98 / 479

  • Selection on ObservablesStrong Ignorability: Matching (Intro)

    PreliminariesI Matching methods were quite popular, but (perhaps) less so now

    F Pearl (2010, p. 114): “The method of propensity score (Rosenbaumand Rubin 1983), or propensity score matching (PSM), is the mostdeveloped and popular strategy for causal analysis in observationalstudies...”

    I (Incorrectly) viewed by many as a ‘magic bullet’and/or solution to‘endogeneity’

    I In practice, only as good as the underlying assumptions

    Assumptions required: (A.i), (A.ii), and (A.iii)

    CS assumption (A.iii) is needed in place of functional formassumptions (A.iv and A.v) used in the regression approach

    DL Millimet (SMU) ECO 7377 Spring 2020 99 / 479

  • Selection on ObservablesStrong Ignorability: Matching vs. Regression

    RegressionI Uses the full sample and gives equal weightto all controls

    I Uses extrapolation based on assumedfunctional forms for the potential outcomesif the distribution of the x’s differs in thetreatment and control groups

    F Extrapolation is particularly prominent ifthe x’s are unbalanced in the full sample

    MatchingI Matching weights observations differently, giving more weight to thosedeemed most ‘similar’and potentially giving many controls zero weight

    I No functional form assumptions

    F Instead, matching requires the presence of reasonably close matches

    DL Millimet (SMU) ECO 7377 Spring 2020 100 / 479

  • Matching requires, and thus highlights problems due to, failure of CSwhereas regression extrapolates to overcome a failure of CS

    12

    34

    5

    .2 .4 .6 .8 1x

    Untreated Units Untreated, Regression LineTreated Units Treated, Regression Line

    E[y|x,D=0]=1+1x; E[y|x,D=1]=1.5+2.5x; sigma = 0.25

    I CS is violated, but OLS simply extrapolates from each group toestimate the missing counterfactual at a particular value of x

    I If linear regression specification is not globally accurate, thenregression may yield severe bias

    DL Millimet (SMU) ECO 7377 Spring 2020 101 / 479

  • Prior to implementing regression approach,it is useful to examine the standardizeddifferences in x across the treatment andcontrol groups

    I Standardized difference for a particular x isgiven by

    norm− diff = |x1 − x0 |√12

    (σ2x1 + σ

    2x0

    )I If norm− diff > 0.25, regression results aresensitive to functional form assumptions

    I See Imbens & Wooldridge (2009)

    DL Millimet (SMU) ECO 7377 Spring 2020 102 / 479

  • Selection on ObservablesStrong Ignorability: Matching (Basics)

    To proceed, recall our parameters of interest

    ∆ATE = E[Y (1)− Y (0)]∆ATT = E[Y (1)− Y (0)|D = 1]∆ATU = E[Y (1)− Y (0)|D = 0]

    Sample counterparts

    ∆̂ATE =1N ∑i [Yi (1)− Yi (0)]

    ∆̂ATT =1N1

    ∑i [Yi (1)− Yi (0)] I[Di = 1]

    ∆̂ATU =1N0

    ∑i [Yi (1)− Yi (0)] I[Di = 0]

    These are infeasible estimators since each is a function of missing data

    DL Millimet (SMU) ECO 7377 Spring 2020 103 / 479

  • Feasible estimators

    ∆̂ATT =1N1

    ∑i[Yi (1)− Ŷi (0)

    ]I[Di = 1]

    ∆̂ATU =1N0

    ∑i[Ŷi (1)− Yi (0)

    ]I[Di = 0]

    ∆̂ATE =N1N

    ∆̂ATT +N0N

    ∆̂ATU

    whereI Ŷi (0), Ŷi (1) are estimates of the missing counterfactuals, obtained as

    Ŷi (0) =1

    ∑j∈{Dj=0}

    ωij∑

    j∈{Dj=0}ωijYj (0)

    Ŷi (1) =1

    ∑j∈{Dj=1}

    ωij∑

    j∈{Dj=1}ωijYj (1)

    I ωij = weight given to observation j by observation iI and

    Nd = ∑i I[Di = d ], d = 0, 1DL Millimet (SMU) ECO 7377 Spring 2020 104 / 479

  • NotesI Estimated missing counterfactual is a weighted average of outcomesfrom the control group

    I All matching estimators take this formI Various matching estimators differ only in terms of how the weights arespecified

    DL Millimet (SMU) ECO 7377 Spring 2020 105 / 479

  • Example

    ID D Y Y (0) Y (1) x1 0 10 10 ? 122 0 15 15 ? 163 1 14 ? 14 124 1 18 ? 18 16

    DL Millimet (SMU) ECO 7377 Spring 2020 106 / 479

  • Example (cont.)

    ID D Y Y (0) Y (1) x1 0 10 10 14 122 0 15 15 18 163 1 14 10 14 124 1 18 15 18 16

    ⇒ ∆̂3 = 14− 10 = 4∆̂4 = 18− 15 = 3

    }∆̂ATT = 3.5

    ⇒ ∆̂1 = 14− 10 = 4∆̂2 = 18− 15 = 3

    }∆̂ATU = 3.5

    ⇒ ∆̂ATE = 0.5(3.5) + 0.5(3.5) = 3.5

    DL Millimet (SMU) ECO 7377 Spring 2020 107 / 479

  • Selection on ObservablesStrong Ignorability: Matching (Weighting Schemes)

    Three primary classes of weighting schemes

    1. Exact matchingI Positive weight to observations with identical x , zero otherwiseI Implies weights satisfy

    ωij

    {> 0 if xj = xi= 0 if xj 6= xi

    I With multiple matches, typically just use the average (i.e., ωij = 1/Ni ,where Ni is the number of matches for obs i)

    I Estimator is subject to ‘curse of dimensionality’I R package: -MatchIt-I Stata: -kmatch-

    DL Millimet (SMU) ECO 7377 Spring 2020 108 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 109 / 479

  • 2. Coarsened exact matchingI Intuition: ‘round’x to fewer distinct values, then match exactly on thecoarsened data

    I Developed in Iacus et al. (2011)I Implies weights satisfy

    ωij

    {> 0 if x̃j = x̃i= 0 if x̃j 6= x̃i

    where x̃ is a vector of coarsened attributesI With multiple matches, typically just use the average (i.e., ωij = 1/Ni ,where Ni is the number of matches for obs i)

    I R package: -cem-I Stata: -cem-

    DL Millimet (SMU) ECO 7377 Spring 2020 110 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 111 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 112 / 479

  • 3. Inexact matchingI Positive weight given to observations deemed to be suffi ciently ‘close’I If multiple matches are used, weights are increasing in ‘closeness’I Implies weights satisfy

    ωij

    {> 0 if xj ≈ xi= 0 otherwise

    I Requires specification of a metric to measure the ‘distance’between xjand xi and then a choice of weights as a function of distances

    I R packages: -MatchIt-, -Matching-I Stata: -teffects-, -kmatch-

    DL Millimet (SMU) ECO 7377 Spring 2020 113 / 479

  • Inexact Matching in Detail

    Requires a measure of distance between any two observations, i and j1 Euclidian-type distance metrics are of the form

    dij = (xi − xj )′W (xi − xj )where common choices for W are

    F W = I (identity matrix)F W = Σ−1, where Σ is the average sample variance-covariance matrix ofx across the treated and control groups (Mahalanobis metric)

    F W = diag(Σ−1

    ), which replaces the off-diagonal terms in the above

    version with zeros (Euclidean metric)

    2 Propensity score methods compute the distance based on differences inthe probability of being in the treatment group given x

    p(x) = Pr(D = 1|x) ∈ [0, 1]where distance between two observations is

    dij = |p(xi )− p(xj )|, ordij = |`(xi )− `(xj )|

    where `(xi ) = log[p(xi )/(1− p(xi ))]DL Millimet (SMU) ECO 7377 Spring 2020 114 / 479

  • Both distance measures have some meritI Both circumvent dimensionality as dij is a scalarI Goal is to balance the x’s; i.e., xj ≈ xi if dij ≈ 0

    F Ho et al. (2007, p. 209): “[T]he goal of matching is to achieve thebest balance for a large number of observations, using any method ofmatching that is a function of X , so long as we do not consult Y .”

    Both potentially entail some estimation in order to compute dijI Euclidian-type distance metrics potentially requires estimation of WI Propensity score methods most likely requires estimation of p(x)

    F Will return to this later ... for now assume we know or have anestimate of dij

    F And most of the following discussion focuses on estimating ∆ATT

    DL Millimet (SMU) ECO 7377 Spring 2020 115 / 479

  • Identification using the propensity scoreI Rosenbaum & Rubin (1983, Theorem 3) prove that

    Yi (0),Yi (1) ⊥ D |x ⇒ Yi (0),Yi (1) ⊥ D |p(x)

    I Estimation

    ∆̂ATE (p(x)) = ̂E[yi |p(xi ) = p,D = 1]− ̂E[yi |p(xi ) = p,D = 0]p−→ E[Yi (1)|p(xi ) = p]− E[Yi (0)|p(xi ) = p]

    and then

    ∆̂ATE = E[∆̂ATE (p(x))

    ]=∫

    ∆̂ATE (p(x))f (p)dx =1N ∑i ∆̂

    ATE (p(xi ))

    I Similar story for other parameters, except final step uses f (p|D = 1) orf (p|D = 0)

    DL Millimet (SMU) ECO 7377 Spring 2020 116 / 479

  • Example

    ID D Y Y (0) Y (1) x1 x2 x3 p(x)1 0 10 10 18 12 29 0 0.252 0 15 15 14 16 51 0 0.473 1 14 15 14 12 36 1 0.504 1 18 10 18 16 42 0 0.30

    NotesI Exact matching is still problematic since p(x) is continuous on the unitinterval

    I Inexact matching assigns ωij > 0 when dij ≈ 0 ⇒ estimators arebiased

    DL Millimet (SMU) ECO 7377 Spring 2020 117 / 479

  • Given dij several weighting schemes are frequently used1 Single nearest neighbor (SNN)2 k−nearest neighbor (k−NN)3 Caliper (or radius)4 Kernel5 Local-linear6 Stratification (or interval or subclassification or blocking)

    Biggest difference is whether all controls are utilized or only a subsetwhen estimating the missing counterfactual for obs i

    I (1) — (3) use a subsetI (4) — (5) is ambiguous; depends on kernelI (6) uses all

    Asymptotically, all inexact matching estimators are equivalent sincethe ‘inexactness’disappears as N → ∞Finite sample performance can vary dramatically

    DL Millimet (SMU) ECO 7377 Spring 2020 118 / 479

  • Single Nearest Neighbor Matching

    Setsj∗ = arg min

    j :Dj=0|dij |

    ⇒ωij =

    {1 if j = j∗

    0 otherwise

    Intuition: j∗ is ‘closest’to i , but with different treatment assignment

    Estimated missing counterfactual given by

    Ŷi (0) = yj ∗ = Yj ∗(0)

    DL Millimet (SMU) ECO 7377 Spring 2020 119 / 479

  • k-Nearest Neighbor Matching

    Sets{j∗} = k- arg min

    j :Dj=0|dij |

    ⇒ωij =

    {1/k if j ∈ {j∗}0 otherwise

    Intuition: compute the average of the k ‘closest’to i , but withdifferent treatment assignment than i

    Estimated missing counterfactual given by

    Ŷi (0) =1k ∑j ∗ yj ∗ =

    1k ∑j ∗ Yj ∗(0)

    DL Millimet (SMU) ECO 7377 Spring 2020 120 / 479

  • Caliper or Radius Matching (Cochran & Rubin 1973)

    Sets{j∗} = {j : |dij | < ε}

    for a specified value of ε⇒

    ωij =

    {1/ki if j ∈ {j∗}0 otherwise

    Intuition: compute the average over all ki obs that are ‘closer’to ithan ε, but with different treatment assignment than i

    Estimated missing counterfactual given by

    Ŷi (0) =1ki

    ∑j ∗ yj ∗ =1ki

    ∑j ∗ Yj ∗(0)

    DL Millimet (SMU) ECO 7377 Spring 2020 121 / 479

  • Kernel Matching (Smith & Todd 2005)

    Sets

    {j∗} ={j :

    ∣∣∣∣ dijaN∣∣∣∣ 6 ε}

    ωij =

    G( dijaN

    )∑

    j ′∈{Dj ′=0}G(dij ′aN

    ) if j ∈ {j∗}0 otherwise

    where G (·) is the kernel function and aN is the bandwidthIntuition: compute the weighted average over all ki obs that receivepositive weight given the choice of G (·) and aN , but with differenttreatment assignment than i

    Some kernel functions imply ε = 1 and thus all controls receivepositive weight for obs i

    DL Millimet (SMU) ECO 7377 Spring 2020 122 / 479

  • Local Linear Matching (Smith & Todd 2005)

    Sets

    {j∗} ={j :

    ∣∣∣∣ dijaN∣∣∣∣ 6 ε}

    ωij =

    Gij ∑

    j ′∈{Dj ′=0}Gij ′d

    2ij−(Gijdij )

    ∑j ′∈{Dj ′=0}

    Gij ′dij ′

    j∈{Dj=0}Gij ∑

    j ′∈{Dj ′=0}Gijd 2ij ′−

    ∑j ′∈{Dj ′=0}

    Gijdij ′

    2 if j ∈ {j∗}

    0 otherwise

    where Gij = G(dijaN

    )Intuition: similar to kernel matching, but differs in handling of weightsassigned to obs when obs are distributed asymmetrically around i orwhen there are gaps in the distribution of the propensity score

    DL Millimet (SMU) ECO 7377 Spring 2020 123 / 479

  • Stratification

    Differs from above schemes (although it can be written as a matchingestimator)

    AlgorithmI Unit interval is divided into k intervals based on the propensity scoreI The average outcome of treated and untreated is computed withineach interval

    I ∆̂ATE (k) = Y 1 − Y 0 is computed within each interval

    Finally

    ∆̂ATT = ∑kN1kN1

    ∆̂ATE (k)

    where N1k is the number of treated within strata k

    DL Millimet (SMU) ECO 7377 Spring 2020 124 / 479

  • Example

    ID D y Y (0) Y (1) p(x)

    1 0 10 10∑

    l∈{Dl=1}ω1l yl ≡

    ω13y3 +ω14y40.3

    2 0 15 15∑

    l∈{Dl=1}ω2l yl ≡

    ω23y3 +ω24y40.4

    3 1 14∑

    l∈{Dl=0}ω3l yl ≡

    ω31y1 +ω32y214 0.5

    4 1 18∑

    l∈{Dl=0}ω4l yl ≡

    ω41y1 +ω42y218 0.6

    DL Millimet (SMU) ECO 7377 Spring 2020 125 / 479

  • Selection on ObservablesStrong Ignorability: Matching (Implementation)

    Several practical issues are confronted when implementing matchingestimators

    1 Distance metric is unknown2 With or without replacement3 Trimming4 Balance of the covariates, x5 Covariate adjustment6 Variable selection7 Weighting scheme8 Failure of CIA9 Non-binary treatments10 Inference

    DL Millimet (SMU) ECO 7377 Spring 2020 126 / 479

  • 1. Distance metric is unknownI Euclidian-type distance metrics are of the form

    dij = (xi − xj )′W (xi − xj )

    F Requires estimation of W if W depends on variances and/orcovariances of x’s

    F Just use plug-in estimator with sample values

    I Propensity score distance metrics depend on p(x)F Estimation typically based on logit or probitF Some discussion of more complex estimators such as semi- ornon-parametric estimators or machine learning algorithms

    F Not clear it matters much given the choice of x (discussed below)F Recommendation: Logit or probit is suffi cientF Newer alternatives⇒ Sant’Anna et al. (2018); R package: -ips- (integrated propensityscore)⇒ Imai & Ratkovic (2014); R package: -cbps- (covariate balancingpropensity score)⇒ Hainmueller (2012); Stata: -ebalance- (entropy balancing)

    F King & Nielsen (2018) strongly advocate against propensity scorematching

    DL Millimet (SMU) ECO 7377 Spring 2020 127 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 128 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 129 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 130 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 131 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 132 / 479

  • 2. With or without replacementI After a control is matched to a treated observation under SNN,k−NN, or caliper matching, should it be removed as a possible matchfor subsequent treated observations?

    F Removed ⇒ matching without replacementF Not Removed ⇒ matching with replacement

    I Choice has three implications:

    1 Match Order: Without replacement implies that estimates are notinvariant to match order

    2 Bias vs. Effi ciency: Without replacement leads to worse matches onaverage (↑ bias), but utilization of more controls (↑ effi ciency)

    3 Inference: Without replacement makes inference (standard errors)easier since matched controls are independent

    I Recommendation: Without replacement if N0 >> N1 and matchfrom highest p(xi ) to lowest (since presumably highest is most diffi cultto match)

    DL Millimet (SMU) ECO 7377 Spring 2020 133 / 479

  • 3. TrimmingI May want to exclude control and/or treated observations that aredeemed ‘too different’than remainder of the sample

    F Entails simply discarding some data prior to matchingF Changes the interpretation of the parameter being estimated from, e.g.,

    ∆ATE = E[∆i ]

    to∆ATE = E[∆i |i ∈ C]

    where C is the trimmed subpopulationI Why trim?

    F CS region is defined as

    Sp = {p(x) : f (p|D = 1) > 0 and f (p|D = 0) > 0}

    F Matching estimates are only defined at values of p(x) ∈ SpF In practice, may want to exclude obs outside SpF To do so requires an estimate, Ŝp

    DL Millimet (SMU) ECO 7377 Spring 2020 134 / 479

  • 3. Trimming (cont.)

    DL Millimet (SMU) ECO 7377 Spring 2020 135 / 479

  • 3. Trimming (cont.)I Implemented two ways in practice

    1 Discard obs outside of Ŝp , defined as

    Ŝp = {p(x) : p ∈

    max{

    mini∈{Di=0}

    {p(xi )}, mini∈{Di=1}

    {p(xi )}},

    min{

    maxi∈{Di=0}

    {p(xi )}, maxi∈{Di=1}

    {p(xi )}}

    2 Drop if p(xi ) /∈ [α, 1− α], where α is chosen based on criteriadeveloped in Crump et al. (2009) and may be equal to zero; see alsoImbens & Rubin (2015)

    I Recommendation: Method 1 is very common, but Method 2 ispreferred

    I Stata: -teffects overlap-

    DL Millimet (SMU) ECO 7377 Spring 2020 136 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 137 / 479

  • 05

    1015

    20de

    nsity

    .2 .3 .4 .5 .6 .7Propensity score, unionh=1

    unionh=0 unionh=1

    DL Millimet (SMU) ECO 7377 Spring 2020 138 / 479

  • 4. Balance of the covariates, xI Matching provides unbiased estimates of average treatment effectsbecause the data mimic a randomized experiment conditional on x orp(x)

    F By mimicing a random experiment, the x’s should be balanced (i.e.,have the same distribution in expectation) across the matched treatedand controls

    F In fact, Rosenbaum & Rubin (1983, Theorem 1) prove that thepropensity score is a balancing score

    x ⊥ D |p(x)

    which holds regardless of validity of CIAF This is an asymptotic property; balancing tests gauge finite sampleperformance

    F See Lee (2013), Imbens & Rubin (2015)

    DL Millimet (SMU) ECO 7377 Spring 2020 139 / 479

  • 4. Balance of the covariates, x (cont.)I Most common approach is compare normalized differences in each xbefore and after matching

    norm− diff = |x1 − x0 |√12

    (σ2x1 + σ

    2x0

    )where xd is the sample mean of either the original data or weighteddepending on how observations factor into the matching process

    F A norm − diff > 0.2 is considered ‘large’(Rosenbaum & Rubin 1985)I Hoteling T 2 test

    F Test joint null of equal (weighted) means across treatment and controlgroup

    T 2 = (x1 − x0)′∑−1(x1 − x0)I Regression-based test

    F Regress each x on a polynomial of p(x), D , and D interacted with thesame polynomial of p(x)...

    xi = φ0 +∑Ss=1 φsp(xi )

    s + π0Di +∑Ss=1 πsDip(xi )s + ηiand test Ho : π0 = π1 = · · · = πS = 0

    F Regression may be unweighted or weighted

    DL Millimet (SMU) ECO 7377 Spring 2020 140 / 479

  • 4. Balance of the covariates, x (cont.)I Failure of matching to suffi ciently balance some x’s ⇒

    1 Choose a different distance metric (e.g., alter the propensity scorespecification)

    2 Perform covariate adjustment within the matched sample (next)

    I Stata: -tebalance-

    DL Millimet (SMU) ECO 7377 Spring 2020 141 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 142 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 143 / 479

  • 05

    10

    .2 .4 .6 .8 .2 .4 .6 .8

    Raw Matched

    control treated

    Den

    sity

    Propensity Score

    Balance plot

    DL Millimet (SMU) ECO 7377 Spring 2020 144 / 479

  • 5. Covariate adjustmentI Inexact matching entails some bias due to this inexactnessI Imbalance in the x’s exacerbates this biasI Adjusting for differences in x’s after matching can reduce this biasI How? Example for ATT:

    1 Regress Y on x using matched controls ⇒ β̂c2 Adjust estimated missing counterfactual for treated obs i to

    Ŷi (0) =

    1∑j :Dj=0

    ωij

    ∑j :Dj=0

    ωij

    [Ŷj + (xi − xj )β̂c

    ]

    which leads to following adjusted estimator for ATT

    ∆̂ATTadj = Y 1 − Y 0 − (x1 − x0)β̂c = ∆̂ATT − (x1 − x0)β̂cwhere means are computed over the matched treated and control obs

    DL Millimet (SMU) ECO 7377 Spring 2020 145 / 479

  • 5. Covariate adjustment (cont.)I Stata: -teffects nnmatch, biasadj-I Reference: Abadie & Imbens (2011)I Recommendation: Covariate adjustment is preferred to remove anyresidual differences in x’s

    DL Millimet (SMU) ECO 7377 Spring 2020 146 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 147 / 479

  • 6. Variable selection (choice of x)I Three questions to ask:

    1 Which (unique) variables to include in x?2 Which higher order and interaction terms of the x’s should be included?3 Are some x’s more important than other x’s?

    I Answers:

    1 Any variable that is pre-determined (i.e., unaffected by the treatment)and necessary for CIA, Y (0),Y (1) ⊥ D |x—Goal is not to predict treatment assignment—Do not include instrumental variables (Wooldridge 2016)

    2 Higher order and interaction terms can be added to improve covariatebalance in the matched sample— It is fine to iterate; estimate distance, match, check balance, adjustand repeat ...

    3 If so, can manipulate the distance metric to match exactly on some x’s—Can adjust W in Euclidean-type distance metrics—Can adjust p(x) so matches will be constrained to have identicalvalues of certain x’s

    DL Millimet (SMU) ECO 7377 Spring 2020 148 / 479

  • 6. Variable selection (choice of x)I References: Rubin and Thomas (1996), HIT (1997), HIST (1998),Heckman and Smith (1999), Lechner (2002), Smith & Todd (2005),Hirano et al. (2003), Brookhart et al. (2006), Zhao (2007),Wooldridge (2009), Pearl (2009), Shaikh et al. (2009), Millimet &Tchernis (2009), Imbens & Rubin (2015)

    DL Millimet (SMU) ECO 7377 Spring 2020 149 / 479

  • 7. Weighting schemeI SNN minimizes bias, but is less effi cientI Multiple matches per treated obs (via k−NN or caliper) may bebeneficial if N0 >> N1

    I Kernel matching may be optimal in a mean-squared error senseI Local linear matching may be optimal if many obs have p(x) close tothe boundary of 0 or 1

    I Stratification is intuitive, but requires potentially ad hoc choice ofstrata

    I References: Huber et al. (2013), Busso et al. (2014)I Recommendation: Try several and compare

    DL Millimet (SMU) ECO 7377 Spring 2020 150 / 479

  • 8. Failure of CIAI Presence of unobserved attributes correlated with both treatmentassignment and potential outcomes invalidates CIA

    F Implies the treatment is not random conditional on observed x aloneF This implies estimated average treatment effects are (more) biased(than just due to inexactness of matches)

    I Four options:

    1 Gather additional data such that CIA holds—This may entail use of panel data— Leads to difference-in-differences matching based on changes inoutcomes

    2 Assess how important such unobserved attributes would have to be inorder to explain estimated treatment effects obtained by matching

    3 Alter the estimand4 Different estimator (selection on unobservables approach)

    DL Millimet (SMU) ECO 7377 Spring 2020 151 / 479

  • Difference-in-Differences Matching

    Consider ∆ATT

    ∆ATT (p(x)) ={

    E[Y (1)|p(x),D = 1]− E[Y (0)|p(x),D = 0]+ E[Y (0)|p(x),D = 0]− E[Y (0)|p(x),D = 1]

    }where matching estimators are based on

    ∆̃ATT (p(x)) = E[Y (1)|p(x),D = 1]− E[Y (0)|p(x),D = 0]

    which implies

    bias = ∆̃ATT (p(x))− ∆ATT (p(x))= E[Y (0)|p(x),D = 1]︸ ︷︷ ︸

    Counterfactual

    − E[Y (0)|p(x),D = 0]︸ ︷︷ ︸Observed

    which is zero under CIA

    DL Millimet (SMU) ECO 7377 Spring 2020 152 / 479

  • Rearranging terms yields

    ∆ATT (p(x)) = ∆̃ATT (p(x))− bias

    This suggests a bias-corrected estimator is feasible if the bias can beconsistently estimated

    Might assume the bias equals the difference in mean outcomes priorto treatment

    bias = E[Yt (0)|p(x),D = 1]− E[Yt (0)|p(x),D = 0]?= E[Yt ′(0)|p(x),D = 1]− E[Yt ′(0)|p(x),D = 0]

    where t ′ precedes the treatment, t is post-treatment

    DL Millimet (SMU) ECO 7377 Spring 2020 153 / 479

  • Implies

    ˜̃∆ATT (p(x)) = ∆̃ATT (p(x))− bias= E[Yt (1)|p(x),D = 1]− E[Yt (0)|p(x),D = 0]

    − {E[Yt ′(0)|p(x),D = 1]− E[Yt ′(0)|p(x),D = 0]}

    =

    {E[Yt (1)− Yt ′(0)|p(x),D = 1]− E[Yt (0)− Yt ′(0)|p(x),D = 0]

    }

    and ˜̃∆ATT (p(x)) = ∆ATT (p(x)) requiresE[Yt (0)− Yt ′(0)|p(x),D = 1] = E[Yt (0)− Yt ′(0)|p(x),D = 0]

    which is different than the original CIA

    DL Millimet (SMU) ECO 7377 Spring 2020 154 / 479

  • Implementation: difference the data ∀i , then match to estimate theATE , ATT , or ATU

    DID matching requires the original CIA be replaced with

    ∆Y (0),∆Y (1) ⊥ D |p(x)

    Intuition:I DID matching requires the change in potential outcomes to beindependent of treatment assignment given the PS

    I Equivalently, there are no time varying unobservables correlated withboth the changes in outcomes and treatment assignment given x

    DL Millimet (SMU) ECO 7377 Spring 2020 155 / 479

  • Rosenbaum Bounds

    Method of assessing sensitivity of matching estimator to anunobserved confounder (Rosenbaum 2002)Assume

    p(xi ) = F (xi β+ ui ) =exp(xi β+ ui )

    1+ exp(xi β+ ui )

    where u is an unobserved binary variable and F is the logistic CDFImplications

    I Odds ratio for obs i is

    p(xi )1− p(xi )

    = exp(xi β+ ui )

    I Odds ratio for obs i relative to obs i ′

    p(xi )1−p(xi )p(xi ′ )1−p(xi ′ )

    =exp(xi β+ ui )exp(xi ′β+ ui ′)

    = exp{γ(ui − ui ′)} if xi = xi ′

    I Thus, two observationally identical obs have different probabilities ofbeing treated if γ 6= 0 and ui 6= ui ′

    DL Millimet (SMU) ECO 7377 Spring 2020 156 / 479

  • How does inference regarding the treatment effect parameters changeas γ and ui − ui ′ change?

    I Since u is binary, ui − ui ′ ∈ {−1, 0, 1}I Implies

    1exp{γ} 6

    p(xi )1−p(xi )p(xi ′ )1−p(xi ′ )

    6 exp{γ}

    where

    F exp{γ} = 1⇒ no selection biasF exp{γ} → ∞⇒ greater selection bias

    I Rosenbaum bounds compute bounds on the significance level of thematching estimate as exp{γ} changes values

    F If matching estimate is statistically insignificant even whenexp{γ} ≈ 1, then treatment effect is not robust

    F If matching estimate is statistically significant even when exp{γ} is‘large’, then treatment effect is “not sensitive to hidden bias”

    Stata: -rbounds-, -mhbounds-

    DL Millimet (SMU) ECO 7377 Spring 2020 157 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 158 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 159 / 479

  • Simulation Approach

    Nannicini (2007) and Ichino et al. (2008) propose an alternativemethod of assessing the robustness of ATT estimates obtained underCIA

    Intuition:I If there is a relevant unobservable that invalidates the CIA, but suchthat CIA would hold if one observed this variable, then estimation isstraightforward if one can simulate/impute this unobserved variable

    I The sensitivity analysis proceeds by comparing the baseline matchingestimate to estimates obtained after additionally conditioning upon thesimulated confounder

    I The unobserved variable can be simulated in different ways to capturedifferent hypotheses regarding the nature of potential confounders

    DL Millimet (SMU) ECO 7377 Spring 2020 160 / 479

  • SetupI The parameter of interest is the ∆ATTI Accordingly, Y (0) ⊥ D |x denotes the required CIAI Suppose that this condition is not met, but if an unobservable, U, isadded then a stronger CIA holds

    Y (0) ⊥ D |x ,U

    I Implies

    E[Y (0)|D = 1, x ] 6= E[Y (0)|D = 0, x ]E[Y (0)|D = 1, x ,U ] = E[Y (0)|D = 0, x ,U ]

    DL Millimet (SMU) ECO 7377 Spring 2020 161 / 479

  • SolutionI Simulate the potential confounder and use it as a matching covariate

    F For simplicity, the potential outcomes and the confounding variable areassumed to be binary

    F Conditional independence of U and x is also assumedF Hence, the distribution of U is fully characterized by the choice of thefollowing four parameters

    pij ≡ Pr(U = 1|D = i , y = j) = Pr(U = 1|D = i , y = j , x)

    with i , j ∈ {0, 1}F Given the parameters pij , a value of U is simulated for each observationdepending on D , y

    I ∆ATT is then estimated with U as an additional matching covariate

    For a given set of the parameters pij , many simulations are performed,∆ATT computed for each simulation, and the mean/sd of theestimates reported

    DL Millimet (SMU) ECO 7377 Spring 2020 162 / 479

  • Choosing pij ...I It is essential to consider useful potential confoundersI Calibrated confounders: choose pij to make the distribution of Usimilar to the empirical distribution of observable binary covariates

    I Killer confounders: search over different pij for the existence of a Uwhich makes ∆ATT = 0

    I One can also simulate other meaningful confounders by setting theparameters pij and pi ·, where pi · can be computed as

    pi · ≡ Pr(U = 1|D = i) =1∑j=0

    pij · Pr(y = j |D = i)

    with i ∈ {0, 1}

    Stata: -sensatt-

    DL Millimet (SMU) ECO 7377 Spring 2020 163 / 479

  • Partial Conditional Independence

    See Masten & Poirier (2018)

    Intuition: Allow for small deviations from CIA

    |Pr(D = 1|Y (0),Y (1), x)− Pr(D = 1|x)| ≤ c

    I CIA implies that c = 0I Partial CIA implies c > 0, but ‘small’

    Estimation proceeds by bounding average treatment effect parametersconditional on choice of c

    Stata coding coming

    DL Millimet (SMU) ECO 7377 Spring 2020 164 / 479

  • Minimum Bias Approach

    See Millimet & Tchernis (2013), McCarthy et al. (2014)

    Intuition: Restrict the region of analysis to minimize the bias ofmatching estimators arising from the failure of CIA

    Disadvantages:I Requires a lot of structure to estimate this regionI Interpretation of estimated treatment effect changes to the averagetreatment effect in this region

    Stata: -bmte-

    DL Millimet (SMU) ECO 7377 Spring 2020 165 / 479

  • 9. Non-binary treatmentsI In some instances, there may be a vector of treatments,D ∈ {0, 1, 2, ...,T}

    I This alters the analysis in three ways

    1 Many possible parameters of interest; e.g.,

    ∆d′′dd ′ = E

    [Y (d )− Y (d ′)|D = d ′′

    ]∀d , d ′, d ′′

    (E.g., What is the expected effect of job training (d) relative to jobsearch assistance (d ′) for those who did nothing (d ′′)?)

    2 Altered identification assumptions: conditional independence of allpotential outcomes given x or only some?

    3 Now, must estimate generalized propensity score (GPS) given by

    e(d , x) = Pr(D = d |x)

    which may be estimated using multinomial or ordered models

    I With D representing variable treatment intensity or D continuous,treatment effects referred to as the dose-response function

    I Recent work in Lee (2018)I Stata: -teffects multivalued-, -poparms-

    DL Millimet (SMU) ECO 7377 Spring 2020 166 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 167 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 168 / 479

  • 10. InferenceI Usual t−test for diff in mean outcomes across matched treated anduntreated group ignores estimation of propensity score and nature ofmatching

    I Smooth matching estimators (e.g., kernel matching) rely on bootstrapI Standard bootstrap fails in the case of non-smooth matching estimatorsI Abadie & Imbens (2006) provide asymptotic standard errors forEuclidean matching estimators

    I See also Abadie & Imbens (2008, 2011, 2012, 2016), Otsu & Rai(2017), Rothe (2017), Bodory et al. (2018), Yang & Ding (2018)

    DL Millimet (SMU) ECO 7377 Spring 2020 169 / 479

  • Selection on ObservablesStrong Ignorability: Inverse Propensity Score Weighting (IPW) Estimators

    Alternative to matching estimators, but still rely on estimating thepropensity score

    Identities

    E[Dyp(x)

    ]= E

    [DY (1)p(x)

    ]= E

    [E[DY (1)p(x)

    ]| x]

    = E[1p(x)

    E [DY (1)] | x]CIA= E

    [1p(x)

    E[D | x ]E[Y (1) | x ]]

    = E[p(x)p(x)

    E[Y (1) | x ]]= E [E[Y (1) | x ]] = E[Y (1)]

    and, similarly,

    E[(1−D)y1− p(x)

    ]= E[Y (0)]

    DL Millimet (SMU) ECO 7377 Spring 2020 170 / 479

  • Parameters of interest (Horvitz & Thompson 1952)

    ∆ATE = E[Dyp(x)

    − (1−D)y1− p(x)

    ]= E

    {D − p(x)

    p(x)[1− p(x)]y}

    ∆ATT = E{

    D − p(x)Pr(D = 1) [1− p(x)]y

    }∆ATU = E

    {D − p(x)

    Pr(D = 0) [1− p(x)]y}

    Proof: Wooldridge (2002, p. 615)

    Estimation: Replace p(x) with p̂(x) and expectations andprobabilities with sample means

    DL Millimet (SMU) ECO 7377 Spring 2020 171 / 479

  • Normalized estimators (Hirano & Imbens 2001)

    I ∆̂ATE is the difference in two weighted averages, where weights are

    Di

    Np̂(xi )and

    1−DiN[1− p̂(xi )

    ]I Problem: weights may not sum to unityI HI assign weights normalized by the sum of propensity scores fortreated and untreated groups

    I Unnormalized estimator assigns equal weights of 1/N to eachobservation

    I Normalized estimator (e.g., ∆̂ATE )

    ∆̂ATE =

    [∑i

    Di yi

    p̂(xi )

    /∑i

    Di

    p̂(xi )

    ]−[∑i

    (1−Di )yi1− p̂(xi )

    /∑i

    (1−Di )1− p̂(xi )

    ]

    I Tends to be more stable in practice as it restricts weights to ≤ 1;Millimet & Tchernis (2009), Busso et al. (2011) find it performs better

    Stata: -teffects ipw-

    DL Millimet (SMU) ECO 7377 Spring 2020 172 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 173 / 479

  • Easily extended to multi-valued treatmentsI Vector of treatments given by D ∈ {0, 1, 2, ...,T}I Treatment parameters defined as

    ∆d′′dd ′ = E

    [Y (d)− Y (d ′)|D = d ′′

    ]∀d , d ′, d ′′

    I Generalized propensity score given by

    e(d , x) = Pr(D = d |x)

    I Estimators utilize sample counterparts to the following equalities

    E[Dd ye(d , x)

    ]= E[Y (d)]

    E[Dd y

    e(d ′, x)e(d , x)

    ]= E[Y (d)|D = d ′]

    I See Uysal (2015)

    DL Millimet (SMU) ECO 7377 Spring 2020 174 / 479

  • Weighting estimators can be extremely sensitive if obs have p̂(x) ≈ 0or 1

    I Typically necessary to trim sampleI Fairly ad hoc

    Standard errors obtained via bootstrap

    Other weighting schemes considered in Li et al. (2018)

    DL Millimet (SMU) ECO 7377 Spring 2020 175 / 479

  • Selection on ObservablesStrong Ignorability: Propensity Score Residual Estimation

    Lee (2017) proposes a computationally simple estimator relying onthe propensity score

    Estimator obtained via OLS estimation of

    yi − y = ∆[Di − p̂(xi )

    ]+ εi

    where p̂(xi ) = Φ(xi γ̂) obtained via probit model and

    plim ∆̂→ E [ω(x)E [Y (1)− Y (0)|x ]] 6= ∆ATE

    and

    ω(x) ≡ p(x)[1− p(x)]E {p(x)[1− p(x)]}

    However, finite sample performance of the estimator is good

    Standard errors obtained via bootstrap or asymptotic formula

    DL Millimet (SMU) ECO 7377 Spring 2020 176 / 479

  • Two alternative estimators that may perform better1 OLS estimation of

    yi − Γ̂i = ∆[Di − p̂(xi )

    ]+ εi

    where Γ̂i = ∑qj=0 δj (xi γ̂)

    j

    F I.e., yi − Γ̂i = residual from the OLS regression of yi on a constant anda polynomial in the linear index from the propensity score model , xi γ̂

    F Motivation is to improve performance if p(x) is mis-specifiedF Lee (2017) suggests q = 2 or 3

    2 OLS estimation of

    yi − y√p̂(xi )

    [1− p̂(xi )

    ] = ∆

    Di − p̂(xi )√p̂(xi )

    [1− p̂(xi )

    ]+ εi

    which is consistent for ∆ATE , but may do poorly in practice ifp̂(xi ) ≈ 0, 1 for some observations

    DL Millimet (SMU) ECO 7377 Spring 2020 177 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 178 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 179 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 180 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 181 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 182 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 183 / 479

  • Method extends easily to non-binary treatmentsI Let D ∈ {0, 1, 2, ...,T} and Dd = I(D = d)I Let p(x) = {p1(x), p2(x), ..., pT (x)} be the set of propensity scoresestimated via multinomial or ordered probit

    I Estimator obtained via OLS estimation of

    yi − y = ∑d

    ∆d[Ddi − p̂d (xi )

    ]+ εi

    oryi − Γ̂i = ∑

    d∆d[Ddi − p̂d (xi )

    ]+ εi

    where Γ̂i = ∑qj=0 δj (xi γ̂)

    j if an ordered probit is used and

    ∆dd ′ = ∆d ′′dd ′ = ∆d − ∆d ′

    DL Millimet (SMU) ECO 7377 Spring 2020 184 / 479

  • Selection on ObservablesStrong Ignorability: Double-Robust Estimators

    Robins and Rotnizky (1995), Hirano & Imbens (2001), Lunceford andDavidian (2004), and others discuss DR estimators

    DR estimators combine regression and weighting estimators and aredouble robust because they are consistent as long as

    I The regression specification for the outcome is correctly specified, orI The propensity score specification is correctly specified

    DR is a class of estimators that possess this property

    Intuition: Control for x twice, once via linear regression and once viapropensity score, as a means of overkill

    Several DR estimators exist in the literature

    DL Millimet (SMU) ECO 7377 Spring 2020 185 / 479

  • Estimation

    OLS estimation

    yi = α0 + xi β+ α̃1Di + θ0Di

    p̂(xi )+ θ1

    1−Di1− p̂(xi )

    + ε̃i

    ∆̂ATE = ̂̃α1 + 1N ∑i[

    θ̂0Di

    p̂(xi )− θ̂1

    1−Di1− p̂(xi )

    ]

    ∆̂ATT = ̂̃α1 + 1N1 ∑i :Di=1[

    θ̂0Di

    p̂(xi )− θ̂1

    1−Di1− p̂(xi )

    ]

    ∆̂ATU = ̂̃α1 + 1N0 ∑i :Di=0[

    θ̂0Di

    p̂(xi )− θ̂1

    1−Di1− p̂(xi )

    ]

    DL Millimet (SMU) ECO 7377 Spring 2020 186 / 479

  • WLS estimation: ATE or ATT

    yi = α0 + xi β+ α̃1Di + υ̃i

    where weights are

    λATEi =

    √Di

    p̂(xi )+

    1−Di1− p̂(xi )

    λATTi =

    √√√√Di + (1−Di ) p̂(xi )1− p̂(xi )

    and similarly for ATU

    DL Millimet (SMU) ECO 7377 Spring 2020 187 / 479

  • Augmented IPW: ATE (Lunceford and Davidian 2004; Glynn andQuinn 2010)

    ∆̂ATE =1N ∑i

    [Di yi − (Di − p̂(xi ))µ1(xi )

    p̂(xi )− (1−Di )yi + (Di − p̂(xi ))µ0(xi )

    1− p̂(xi )

    ]

    where µ0(xi ) and µ1(xi ) are estimated via separate OLS regressionsof y on x

    Stata: -dr-, -teffects aipw-

    DL Millimet (SMU) ECO 7377 Spring 2020 188 / 479

  • DL Millimet (SMU) ECO 7377 Spring 2020 189 / 479

  • Uysal (2015) discusses doubly robust estimators of multi-valuedtreatment effects

    SetupI Vector of treatments given by D ∈ {0, 1, 2, ...,T}I Treatment parameters defined as

    ∆d′′dd ′ = E

    [Y (d)− Y (d ′)|D = d ′′

    ]∀d , d ′, d ′′

    I Generalized propensity score given by

    e(d , x) = Pr(D = d |x)

    which may be estimated using multinomial or ordered logit/probitmodels

    DL Millimet (SMU) ECO 7377 Spring 2020 190 / 479

  • EstimationI WLS estimation: ∆dd ′ or ∆d

    ′′dd ′

    yi = ∑d

    ∆dDdi +∑dDdi (xi − x)βd + εi

    where∆dd ′ = ∆

    d ′′dd ′ = ∆d − ∆d ′

    and weights are

    λ∆dd ′i =

    √∑d

    Ddîe(d , xi )

    λ∆d′′dd ′i =

    √√√√∑dDdi

    ̂e(d ′′, xi )̂e(d , xi )

    I Standard errors obtained via bootstrap

    DL Millimet (SMU) ECO 7377 Spring 2020 191 / 479

  • Waernbaum & Pazzagli (2017, WP)I Assess the bias of DR estimators when the propensity score andoutcome equations are both mis-specified

    I Assess situations when the bias of DR estimators is smaller than thebias of IPW estimators (normalized and non-normalized)

    DL Millimet (SMU) ECO 7377 Spring 2020 192 / 479

  • Selection on ObservablesRegression Discontinuity

    First introduced in Thistlethwaite & Campbell (1960)

    Two classes of models: sharp, fuzzyI Sharp RD is a selection on observables estimato