BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

22
155 11 BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching Methods for Causal Inference ELIZABETH A. STUART DONALD B. RUBIN M any studies in social science that aim to estimate the effect of an intervention suffer from treatment selection bias, where the units who receive the treatment may have different characteristics from those in the control condition. These preexisting differences between the groups must be controlled to obtain approximately unbiased estimates of the effects of interest. For example, in a study estimating the effect of bullying on high school graduation, students who were bullied are likely to be very different from students who were not bullied on a wide range of characteristics, such as socioeco- nomic status and academic performance, even before the bullying began. It is crucial to try to separate out the causal effect of the bullying from the effect of these preexisting differences between the “treated” and “control” groups. Matching methods provide a way to attempt to do so. Random assignment of units to receive (or not receive) the treatment of interest ensures that there are no systematic differences between the treatment and control groups before treat- ment assignment. However, random assignment is often infeasible in social science research, due to either ethical or practical concerns. Matching methods constitute a growing collection of tech- niques that attempt to replicate, as closely as possible, the ideal of randomized experiments when using observational data. There are two key ways in which the match- ing methods we discuss replicate a randomized experiment. First, matching aims to select sub- samples of the treated and control groups that are, at worst, only randomly different from one another on all observed covariates. In other words, matching seeks to identify subsamples of treated and control units that are “balanced” with respect to observed covariates: The observed covariate distributions are essentially the same in the treatment and control groups. The methods described in this chapter examine how best to choose subsamples from the origi- nal treated and control groups such that the distributions of covariates in the matched groups are substantially more similar than in the original groups, when this is possible. A second crucial similarity between a randomized experiment 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 155

Transcript of BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

Page 1: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

155

11BEST PRACTICES IN QUASI-EXPERIMENTAL DESIGNS

Matching Methods for Causal Inference

ELIZABETH A. STUART

DONALD B. RUBIN

Many studies in social science that aim toestimate the effect of an interventionsuffer from treatment selection bias,

where the units who receive the treatment mayhave different characteristics from those in thecontrol condition. These preexisting differencesbetween the groups must be controlled to obtainapproximately unbiased estimates of the effects ofinterest. For example, in a study estimating theeffect of bullying on high school graduation,students who were bullied are likely to be verydifferent from students who were not bullied ona wide range of characteristics, such as socioeco-nomic status and academic performance, evenbefore the bullying began. It is crucial to try toseparate out the causal effect of the bullying fromthe effect of these preexisting differences betweenthe “treated” and “control” groups. Matchingmethods provide a way to attempt to do so.

Random assignment of units to receive (ornot receive) the treatment of interest ensuresthat there are no systematic differences betweenthe treatment and control groups before treat-ment assignment. However, random assignment

is often infeasible in social science research, dueto either ethical or practical concerns. Matchingmethods constitute a growing collection of tech-niques that attempt to replicate, as closely aspossible, the ideal of randomized experimentswhen using observational data.

There are two key ways in which the match-ing methods we discuss replicate a randomizedexperiment. First, matching aims to select sub-samples of the treated and control groups thatare, at worst, only randomly different from oneanother on all observed covariates. In otherwords, matching seeks to identify subsamples oftreated and control units that are “balanced”with respect to observed covariates: Theobserved covariate distributions are essentiallythe same in the treatment and control groups.The methods described in this chapter examinehow best to choose subsamples from the origi-nal treated and control groups such that the distributions of covariates in the matchedgroups are substantially more similar than in theoriginal groups, when this is possible. A secondcrucial similarity between a randomized experiment

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 155

Page 2: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

and a matched observational study is that eachstudy has two clear stages. The first stage isdesign, in which the units to be compared areselected, without use of the values of the out-come variables. Like the design of a randomizedexperiment, the matches are chosen withoutaccess to any of the outcome data, thereby pre-venting intentional or unintentional bias whenselecting a particular matched sample to achievea desired result. Only after the design is set doesthe second stage begin, which involves the analy-ses of the outcome, estimating treatment effectsusing the matched sample. We only discusspropensity score methods that are applicable atthe design stage in the sense that they do notinvolve any outcome data. Some methods thatuse propensity scores, including some weightingtechniques, can involve outcome data, and suchmethods are not discussed here.

This chapter reviews the diverse literature onmatching methods, with particular attention paidto providing practical guidance based on appliedand simulation results that indicate the potential ofmatching methods for bias reduction in observa-tional studies. We first provide an introduction tothe goal of matching and a very brief history ofthese methods; the second section presents thetheory and motivation behind propensity scores,discussing how they are a crucial component whenusing matching methods. We then discuss othermethods of controlling for covariates in observa-tional studies, such as regression analysis, andexplain why matching methods (particularly whencombined with regression in the analysis stage) aremore effective. The implementation of matchingmethods, including challenges and evaluations oftheir performance, is then discussed. We concludewith recommendations for researchers and a discus-sion of software currently available. Throughoutthe chapter, we motivate the methods using datafrom the National Supported Work Demonstration(Dehejia & Wahba, 1999; LaLonde, 1986).

Designing Observational Studies

The methods described here are relevant for two types of situations. The first, which isarguably more common in social scienceresearch, is a situation where all covariate andoutcome data are already available on a large setof units, but a subset of those units will be cho-sen for use in the analysis. This subsetting (or“matching”) is done with the aim of selecting

subsets of the treated and control groups withsimilar observed covariate distributions, therebyincreasing robustness in observational studies byreducing reliance on modeling assumptions. Themain objective of the matching is to reduce bias.But what about variance? Although discardingunits in the matching process will result insmaller sample sizes and thus might appear tolead to increases in sampling variance, this is notalways the case because improved balance in thecovariate distributions will decrease the varianceof estimators (Snedecor & Cochran, 1980).H. Smith (1997) gives an empirical example whereestimates from one-to-one matching have lowerestimated standard deviations than estimatesfrom a linear regression, even though thousandsof observations were discarded in the one-to-onematching, and all were used in the regression.

The second situation is one in which outcomedata are not yet collected on the units, and costconstraints prohibit measuring the outcome vari-ables on all units. In that situation, matchingmethods can help choose for follow-up thecontrol units most similar to those in the treatedgroup. The matching identifies those control unitswho are most similar to the treated units so thatrather than random samples of units being dis-carded, the units discarded are those most irrele-vant as points of comparison with the treatedunits. This second situation motivated much ofthe early work in matching methods (Althauser &Rubin, 1970; Rubin, 1973a, 1973b), which com-pared the benefits of choosing matched versusrandom samples for follow-up.

Matching methods can be considered as onemethod for designing an observational study,in the sense of selecting the most appropriatedata for reliable estimation of causal effects, asdiscussed in Cochran and Rubin (1973), Rubin(1977, 1997, 2004), Rosenbaum (1999, 2002),and Heckman, Hidehiko, and Todd (1997). Thesepapers stress the importance of carefully design-ing an observational study by making appropri-ate choices when it is impossible to have fullcontrol (e.g., randomization). The carefuldesign of an observational study must involvemaking careful choices about the data used inmaking comparisons of outcomes in treatmentand control conditions.

Other approaches that attempt to control for covariate differences between treated andcontrol units include regression analysis orselection models, which estimate parameters of

156 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 156

Page 3: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

a model for the outcome of interest conditionalon the covariates (and a treatment/control indi-cator). Matching methods are preferable to thesemodel-based adjustments for two key reasons.First, matching methods do not use the outcomevalues in the design of the study and thus pre-clude the selection of a particular design to yielda desired result. As stated by Rubin (2001),

Arguably, the most important feature ofexperiments is that we must decide on theway data will be collected before observing theoutcome data. If we could try hundreds ofdesigns and for each see the resultant answer,we could capitalize on random variation inanswers and choose the design that generatedthe answer we wanted! The lack of availabilityof outcome data when designing experimentsis a tremendous stimulus for “honesty” inexperiments and can be in well-designedobservational studies as well. (p. 169)

Second, when there are large differences in the covariate distributions between the groups,standard model-based adjustments rely heavily on extrapolation and model-based assumptions.Matching methods highlight these differences andalso provide a way to limit reliance on the inher-ently untestable modelling assumptions and theconsequential sensitivity to those assumptions.

Matching methods and regression-basedmodel adjustments should also not be seen ascompeting methods but rather as complemen-tary, which is a decades-old message. In fact, asdiscussed earlier, much research over a period ofdecades (Cochran & Rubin, 1973; Ho, Imai, King,& Stuart, 2007; Rubin, 1973b, 1979; Rubin &Thomas, 2000) has shown that the best approachis to combine the two methods by, for example,doing regression adjustment on matched sam-ples. Selecting matched samples reduces bias dueto covariate differences, and regression analysison those matched samples can adjust for smallremaining differences and increase efficiency ofestimates. These approaches are similar in spiritto the recent “doubly robust” procedures ofRobins and Rotnitzky (2001), which provide con-sistent estimation of causal effects if either themodel of treatment assignment (e.g., the propen-sity scores) or the model of the outcome is cor-rect, although these later methods are moresensitive to a correctly specified model used forweighting and generally do not have the clear

separation of stages of design and analysis that we advocate here.

The National Supported Work Demonstration

The National Supported Work (NSW)Demonstration was a federally and privatelyfunded randomized experiment done in the 1970sto estimate the effects of a job training programfor disadvantaged workers. Since a series ofanalyses beginning in the 1980s (Dehejia &Wahba, 1999, 2002; LaLonde, 1986; J. Smith &Todd, 2005), the data set from this study hasbecome a canonical example in the literature onmatching methods.

In the NSW Demonstration, eligible individ-uals were randomly selected to participate in the training program. Treatment group membersand control group members (those not selectedto participate) were followed up to estimate theeffect of the program on later earnings. Becausethe NSW program was a randomized experi-ment, the difference in means in the outcomesbetween the randomized treated and controlgroups is an unbiased estimate of the averagetreatment effect for the subjects in the random-ized experiment, and indicated that, on average,among all male participants, the program raisedannual earnings by approximately $800.

To investigate whether certain nonexperi-mental methods yielded a result similar to that from the randomized experiment, LaLonde(1986) attempted to use certain nonexperimen-tal methods to estimate the treatment effect,with the experimental estimate of the treatmenteffect as a benchmark. LaLonde used, in anal-ogy with then current econometric practice,two sources of comparison units, both largenational databases: the Panel Survey of IncomeDynamics (PSID) and the Current PopulationSurvey (CPS). LaLonde found that the nonex-perimental methods gave a wide range ofimpact estimates, ranging from approximately−$16,000 to $700, and concluded that it was dif-ficult to replicate the experimental results withany of the nonexperimental methods availableat that time.

In the 1990s, Dehejia and Wahba (1999) usedpropensity score matching methods to estimatethe effect of the NSW program, using compari-son groups similar to those used by LaLonde.They found that most of the comparison group

Quasi-Experimental Designs 157

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 157

Page 4: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

members used by LaLonde were in fact very dis-similar to the treated group members and that byrestricting the analysis to the comparison groupmembers who looked most similar to the treatedgroup, they were able to replicate results found in the NSW experimental data. Using the CPS,which had a larger pool of individuals compara-ble to those in the treated group, for the sampleof men with 2 years of pretreatment earningsdata available, Dehejia and Wahba (1999)obtained a range of treatment effect estimates of$1,559 to $1,681, quite close to the experimentalestimate of approximately $1,800 for the samesample. Although there is still debate regardingthe use of nonexperimental data to estimate theeffects of the NSW program (see, e.g., Dehejia,2005; J. Smith & Todd, 2005), this example hasnonetheless remained an important illustrationof the use of matching methods in practice.

We will use a subset of these data as an illus-trative example throughout this chapter. The“full” data set that we use has 185 treated maleswho had 2 years of preprogram earnings data(1974 and 1975) as well as 429 comparisonmales from the CPS who were younger than age55, unemployed in 1976, and had income belowthe poverty line in 1975. The goal of matchingwill be to select the comparison males who look most similar to the treated group on othercovariates. The covariates available in this dataset include age, education level, high schooldegree, marital status, race, ethnicity, and earn-ings in 1974 and 1975. In this chapter, we do notattempt to obtain a reliable estimate of the effectof the NSW program but rather use the dataonly to illustrate matching methods.1

Notation and Background

As first formalized by Rubin (1974), the esti-mation of causal effects, whether from data in arandomized experiment or from informationobtained from an observational study, is inher-ently a comparison of potential outcomes onindividual units, where a unit is a physical object(e.g., a person or a school) at a particular pointin time. In particular, the causal effect for unit iis the comparison of unit i ’s outcome if unit ireceives the treatment (unit i ’s potential out-come under treatment), Yi(1), and unit i’s out-come if unit i receives the control (unit i’spotential outcome under control), Yi(0). The“fundamental problem of causal inference”

(Holland, 1986; Rubin, 1978) is that, for eachunit, we can observe only one of these potentialoutcomes because each unit will receive eithertreatment or control, not both. The estimation ofcausal effects can thus be thought of as a missingdata problem, where at least half of the values ofinterest (the unobserved potential outcomes) aremissing (Rubin, 1976a). We are interested in pre-dicting the unobserved potential outcomes, thusenabling the comparison of the potential out-comes under treatment and control.

For efficient causal inference and good esti-mation of the unobserved potential outcomes,we would like to compare groups of treated andcontrol units that are as similar as possible. Ifthe groups are very different, the prediction ofthe Yi(1) for the control group will be madeusing information from treated units, who lookvery different from those in the control group,and likewise, the prediction of the Yi(0) for thetreated units will be made using informationfrom control units, who look very different fromthe treated units.

Randomized experiments use a known ran-domized assignment mechanism to ensure “bal-ance” of the covariates between the treated andcontrol groups: The groups will be only ran-domly different from one another on all back-ground covariates, observed and unobserved. Inobservational studies, we must posit an assign-ment mechanism, which stochastically deter-mines which units receive treatment and whichreceive control. A key initial assumption inobservational studies is that of strongly ignor-able treatment assignment (Rosenbaum &Rubin, 1983b), which implies that (a) treatmentassignment (W) is unconfounded (Rubin,1990); that is, it is independent of the potentialoutcomes (Y(0),Y(1)) given the covariates (X):W⊥(Y(0),Y(1))|X, and (b) there is a positiveprobability of receiving each treatment for all values of X: 0 < P(W = 1|X) < 1 for all X. Part(b) essentially states that there is overlap in thepropensity scores. However, since below wediscuss methods to impose this by discardingunits outside the region of overlap, in the rest ofthe chapter, we focus on the first part of thestrong ignorability assumption: unconfoundedtreatment assignment, sometimes called “selec-tion on observables” or “no hidden bias.”Imbens (2004) discusses the plausibility of thisassumption in economics, and this issue is dis-cussed further later in this chapter, including

158 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 158

Page 5: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

tests for sensitivity to the assumption of uncon-founded treatment assignment.

A second assumption that is made in nearly allstudies estimating causal effects (including ran-domized experiments) is the stable unit treat-ment value assumption (SUTVA; Rubin, 1980).There are two components to this assumption.The first is that there is only one version ofeach treatment possible for each unit. The secondcomponent is that of no interference: The treat-ment assignment of one unit does not affect thepotential outcomes of any other units. This is alsosometimes referred to as the assumption of “nospillover.” Some recent work has discussed relax-ing this SUTVA assumption, in the context ofschool effects (Hong & Raudenbush, 2006) orneighborhood effects (Sobel, 2006).

History of Matching Methods

Matching methods have been in use since thefirst half of the 20th century, with much of theearly work in sociology (Althauser & Rubin,1970; Chapin, 1947; Greenwood, 1945). How-ever, a theoretical basis for these methods wasnot developed until the late 1960s and early1970s. This development began with a paper byCochran (1968), which particularly examinedsubclassification but had clear connections withmatching, including Cochran’s occasional use ofthe term stratified matching to refer to subclassi-fication. Cochran and Rubin (1973) and Rubin(1973a, 1973b) continued this development for situations with one covariate, and Cochranand Rubin (1973) and Rubin (1976b, 1976c)extended the results to multivariate settings.

Dealing with multiple covariates was a chal-lenge due to both computational and data prob-lems. With more than just a few covariates, itbecomes very difficult to find matches with closeor exact values of all covariates. An importantadvance was made in 1983 with the introductionof the propensity score by Rosenbaum andRubin (1983b), a generalization of discriminantmatching (Cochran & Rubin, 1973; Rubin,1976b, 1976c). Rather than requiring close orexact matches on all covariates, matching on thescalar propensity score enables the constructionof matched sets with similar distributions ofcovariates.

Developments were also made regarding thetheory behind matching methods, particularlyin the context of affinely invariant matching

methods (such as most implementations ofpropensity score matching) with ellipsoidallysymmetric covariate distributions (Rubin &Stuart, 2006; Rubin & Thomas, 1992a, 1992b,1996). Affinely invariant matching methods arethose that yield the same matches following an affine (e.g., linear) transformation of the data (Rubin & Thomas, 1992a). This theoreticaldevelopment grew out of initial work on equalpercent bias-reducing (EPBR) matching meth-ods in Rubin (1976a, 1976c). EPBR methodsreduce bias in all covariate directions by thesame percentage, thus ensuring that if closematches are obtained in some direction (such asthe discriminant), then the matching is alsoreducing bias in all other directions and so can-not be increasing bias in an outcome that is alinear combination of the covariates. Methodsthat are not EPBR will infinitely increase bias forsome linear combinations of the covariates.

Since the initial work on matching methods,which was primarily in sociology and statistics,matching methods have been growing in popu-larity, with developments and applications in a variety of fields, including economics (Imbens,2004), medicine (D’Agostino, 1998), public health(Christakis & Iwashyna, 2003), political science(Ho et al., 2007), and sociology (Morgan &Harding, 2006; Winship & Morgan, 1999). Areview of the older work and more recent applica-tions can also be found in Rubin (2006).

PROPENSITY SCORES

In applications, it is often very difficult to findclose matches on each covariate. Rather thanattempting to match on all of the covariates indi-vidually, propensity score matching matches onthe scalar propensity score, which is the mostimportant scalar summary of the covariates.Propensity scores, first introduced in Rosenbaumand Rubin (1983b), provided a key step in thecontinual development of matching methods byenabling the formation of matched sets that havebalance on a large number of covariates.

The propensity score for unit i is defined asthe probability of receiving the treatment giventhe observed covariates: ei(X) = P(Wi = 1|X).There are two key theorems relating to their use(Rosenbaum & Rubin, 1983b). The first is thatpropensity scores are balancing scores: At eachvalue of the propensity score, the distribution of

Quasi-Experimental Designs 159

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 159

Page 6: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

the covariates, X, that define the propensity scoreis the same in the treated and control groups. Inother words, within a small range of propensityscore values, the treated and control groups’observed covariate distributions are only ran-domly different from each other, thus replicatinga mini-randomized experiment, at least withrespect to these covariates. Second, if treatmentassignment is unconfounded given the observedcovariates (i.e., does not depend on the potentialoutcomes), then treatment assignment is alsounconfounded given only the propensity score.This justifies matching or forming subclassesbased on the propensity score rather than on thefull set of multivariate covariates. Thus, whentreatment assignment is unconfounded, for aspecific value of the propensity score, the differ-ence in means in the outcome between thetreated and control units with that propensityscore value is an unbiased estimate of the meantreatment effect at that propensity score value.

Abadie and Imbens (2006) present theoreticalresults that provide additional justification formatching on the propensity score, showing thatcreating estimates based on matching on onecontinuous covariate (such as the propensityscore) is N ½ consistent, but attempting to matchon more than one covariate without discardingany units is not. Thus, in this particular case,using the propensity score enables consistent esti-mation of treatment effects.

Propensity Score Estimation

In practice, the true propensity scores arerarely known outside of randomized experi-ments and thus must be estimated. Propensityscores are often estimated using logistic regres-sion, although other methods such as classifica-tion and regression trees (CART; Breiman,Friedman, Olshen, & Stone, 1984), discriminantanalysis, or generalized boosted models (McCaffrey,Ridgeway, & Morral, 2004) can also be used.Matching or subclassification is then done usingthe estimated propensity score (e.g., the fittedvalues from the logistic regression).

In the matching literature, there has beensome discussion of the effects of matching usingestimated rather than true propensity scores,especially regarding the variance of estimates.Theoretical and analytic work has shown that,although more bias reduction can be obtainedusing true propensity scores, matching on estimated

propensity scores can control variance orthogo-nal to the discriminant and thus can lead to moreprecise estimates of the treatment effect (Rubin &Thomas, 1992b, 1996). Analytic expressions forthe bias and variance reduction possible for thesesituations are given in Rubin and Thomas(1992b). Specifically, Rubin and Thomas (1992b)state that “with large pools of controls, matchingusing estimated linear propensity scores results inapproximately half the variance for the differencein the matched sample means as in correspond-ing random samples for all covariates uncorre-lated with the population discriminant” (p. 802).This finding is confirmed in simulation work inRubin and Thomas (1996) and in an empiricalexample in Hill, Rubin, and Thomas (1999).Hence, in situations where nearly all bias can beeliminated relatively easily, matching on the esti-mated propensity scores is superior to matchingon the true propensity score because it will resultin more precise estimates of the average treat-ment effect.

Model Specification

The model specification and diagnosticswhen estimating propensity scores are not thestandard model diagnostics for logistic regres-sion or CART, as discussed by Rubin (2004).With propensity score estimation, concern is notwith the parameter estimates of the model butrather with the quality of the matches and sometimes in the accuracy of the predictions oftreatment assignment (the propensity scoresthemselves). When the propensity scores will beused for matching or subclassification, the keydiagnostic is covariate balance in the resultingmatched samples or subclasses. When propen-sity scores are used directly in weighting adjust-ments, more attention should be paid to theaccuracy of the model predictions since the esti-mates of the treatment effect may be very sensi-tive to the accuracy of the propensity scorevalues themselves.

Rosenbaum and Rubin (1984); Perkins, Tu,Underhill, Zhou, and Murray (2000); Dehejia andWahba (2002); and Michalopoulos, Bloom, andHill (2004) described propensity score model-fit-ting strategies that involve examining the result-ing covariate balance in subclasses defined by thepropensity score. If covariates (or their squares or cross-products) are found to be unbalanced,those terms are then included in the propensity

160 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 160

Page 7: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

score specification, which should improve bal-ance, subject to sample size limitations.

Drake (1993) stated that treatment effectestimates are more sensitive to misspecificationof the model of the outcome than to misspecifi-cation of the propensity score model. Dehejiaand Wahba (1999, 2002) and Zhao (2004) alsoprovided evidence that treatment effect estimatesmay not be too sensitive to the propensity scorespecification. However, these evaluations arefairly limited; for example, Drake consideredonly two covariates.

WHEN IS REGRESSION

ANALYSIS TRUSTWORTHY?

It has been known for many years that regres-sion analysis can lead to misleading results whenthe covariate distributions in the groups are verydifferent (e.g., Cochran, 1957; Cochran & Rubin,1973; Rubin, 1973b). Rubin (2001, p. 174) statedthe three basic conditions that must generally bemet for regression analyses to be trustworthy, inthe case of approximately normally distributedcovariates:2

1. The difference in the means of the propensityscores in the two groups being compared mustbe small (e.g., the means must be less than halfa standard deviation apart), unless thesituation is benign in the sense that:a. the distributions of the covariates in both

groups are nearly symmetric,b. the distributions of the covariates in both

groups have nearly the same variances, andc. the sample sizes are approximately the same.

2. The ratio of the variances of the propensityscore in the two groups must be close to 1(e.g., 1/2 or 2 are far too extreme).

3. The ratio of the variances of the residuals of the covariates after adjusting for thepropensity score must be close to 1 (e.g.,1/2 or 2 are far too extreme).

These guidelines arise from results on the biasresulting from regression analysis in samples withlarge initial covariate bias that show that linearregression adjustment can grossly overcorrect or undercorrect for bias when these conditions are not met (Cochran & Rubin, 1973; Rubin,1973b, 1979, 2001). For example, when the propen-sity score means are one quarter of a standard

deviation apart in the two groups, the ratio of thetreated to control group variance is 1/2, and the model of the outcome is moderately nonlin-ear (y = ex/2), linear regression adjustment canlead to 300% reduction in bias. In other words, anincrease in the original bias, but 200% in theopposite direction! Results are even more strikingfor larger initial bias between the groups, wherethe amount of bias remaining can be substantialeven if most (in percentage) of the initial bias hasbeen removed (see Rubin, 2001, Table 1).

Despite these striking results, regressionadjustment on unmatched data is still a com-mon method for attempting to estimate causaleffects. Matching methods provide a way toavoid extrapolation and reliance on the model-ing assumptions, by ensuring the comparison oftreated and control units with similar covariatedistributions, when this is possible, and warningof the inherent extrapolation in regression mod-els when there is little overlap in distributions.

IMPLEMENTATION OF

MATCHING METHODS

We now turn to the implementation of match-ing methods. There are five key steps when usingmatching methods to estimate causal effects.These are (1) choosing the covariates to be usedin the matching process; (2) defining a distancemeasure, used to assess whether units are “simi-lar”; (3) choosing a specific matching algorithmto form matched sets; (4) diagnosing thematches obtained (and iterating between (2) and (3)); and finally, (5) estimating the effectof the treatment on the outcome, using thematched sets found in (4) and possibly otheradjustments. The following sections providefurther information on each of these steps.

Choosing the Covariates

The first step is to choose the covariates onwhich close matches are desired. As discussedearlier, an underlying assumption when estimat-ing causal effects using nonexperimental data is that treatment assignment is unconfounded(Rosenbaum & Rubin, 1983b) given the covari-ates used in the matching process. To make thisassumption plausible, it is important to includein the matching procedure any covariates thatmay be related to treatment assignment and the

Quasi-Experimental Designs 161

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 161

Page 8: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

outcome; the most important covariates toinclude are those that are related to treatmentassignment because the matching will typicallybe done for many outcomes. Theoretical andempirical research has shown the importance ofincluding a large set of covariates in the match-ing procedure (Hill, Reiter, & Zanutto, 2004;Lunceford & Davidian, 2004; Rubin & Thomas,1996). Greevy, Lu, Silber, and Rosenbaum(2004) provide an example where the power ofthe subsequent analysis in a randomized experi-ment is increased by matching on 14 covariates,even though only 2 of those covariates aredirectly related to the outcome (the other 12 arerelated to the outcome only through their corre-lation with the 2 on which the outcome explic-itly depends).

A second consideration is that the covariatesincluded in the matching must be “proper”covariates in the sense of not being affected by treatment assignment. It is well-known thatmatching or subclassifying on a variable affectedby treatment assignment can lead to substantialbias in the estimated treatment effect (Frangakis& Rubin, 2002; Greenland, 2003; Imbens, 2004).All variables should thus be carefully consideredas to whether they are “proper” covariates. Thisis especially important in fields such as epidemi-ology and political science, where the treatmentassignment date is often somewhat undefined. Ifit is deemed to be critical to control for a vari-able potentially affected by treatment assign-ment, it is better to exclude that variable in thematching procedure and include it in the analy-sis model for the outcome (Reinisch, Sanders,Mortensen, & Rubin, 1995) and hope for bal-ance on it, or use principal stratification meth-ods (Frangakis & Rubin, 2002) to deal with it.

Selecting a Distance Measure

The next step when using matching methodsis to define the “distance” measure that will beused to decide whether units are “similar” interms of their covariate values. “Distance” is inquotes because the measure will not necessarilybe a proper “full-rank” distance in the mathe-matical sense. One extreme distance measure isthat of exact matching, which groups units onlyif they have the same values of all the covariates.Because limited sample sizes (and large numbersof covariates) make it very difficult to obtainexact matches, distance measures that are not full rank and that combine distances on

individual covariates, such as propensity scores,are commonly used in practice.

Two measures of the distance between unitson multiple covariates are the Mahalanobisdistance, which is full rank, and the propensityscore distance, which is not. The Mahalanobisdistance on covariates X between units i and j is(Xi − Xj)Σ−1(Xi − Xj), where Σ can be the true orestimated variance-covariance matrix in thetreated group, the control group, or a pooledsample; the control group variance-covariancematrix is usually used. The propensity score dis-tance is defined as the absolute difference in(true or estimated) propensity scores betweentwo units. See the “Propensity Score Estimation”section for more details on estimating propen-sity scores. Gu and Rosenbaum (1993) andRubin and Thomas (2000) compare the per-formance of matching methods based onMahalanobis metric matching and propensityscore matching and find that the two distancemeasures perform similarly when there are arelatively small number of covariates, butpropensity score matching works better thanMahalanobis metric matching with largenumbers of covariates (greater than 5). One rea-son for this is that the Mahalanobis metric isattempting to obtain balance on all possibleinteractions of the covariates (which is very dif-ficult in multivariate space), effectively consider-ing all of the interactions as equally important.In contrast, propensity score matching allowsthe exclusion of terms from the propensity scoremodel and thereby the inclusion of only theimportant terms (e.g., main effects, two-wayinteractions) on which to obtain balance.

As discussed below, these distance measurescan be combined or used in conjunction with exact matching on certain covariates.Combining these distance measures with exactmatching on certain covariates sets the distancebetween two units equal to infinity if the unitsare not exactly matched on those covariates.

Selecting Matches

Once the distance measure is defined, thenext step is to choose the matched samples. Thissection provides a summary of some of the mostcommon types of matching methods, given aparticular distance measure. These methodsinclude nearest neighbor matching and its vari-ations (such as caliper matching) and subclassi-fication methods (such as full matching). We

162 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 162

Page 9: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

provide an overview of each, as well as refer-ences for further information and examples.

Nearest Neighbor Matching

Nearest neighbor matching (Rubin, 1973a)generally selects k matched controls for eachtreated unit (often, k = 1). The simplest nearestneighbor matching uses a “greedy” algorithm,which cycles through the treated units one at atime, selecting for each the available control unitwith the smallest distance to the treated unit. Amore sophisticated algorithm, “optimal” match-ing, minimizes a global measure of balance(Rosenbaum, 2002). Rosenbaum (2002) arguesthat the collection of matches found using opti-mal matching can have substantially betterbalance than matches found using greedymatching, without much loss in computationalspeed. Generally, greedy matching performspoorly with respect to average pair differenceswhen there is intense competition for controlsand performs well when there is little competi-tion. In practical situations, when assessing thematched groups’ covariate balance, Gu andRosenbaum (1993) find that optimal matchingdoes not in general perform any better thangreedy matching in terms of creating groupswith good balance but does do better at reduc-ing the distance between pairs. As summarized

by Gu and Rosenbaum (1993), “Optimal match-ing picks about the same controls [as greedymatching] but does a better job of assigningthem to treated units” (p. 413).

Figure 11.1 illustrates the result of a one-to-one greedy nearest neighbor matching algorithm implemented using the NSW datadescribed in “The National Supported WorkDemonstration” section. The propensity scorewas estimated using all covariates available inthe data set. Of the 429 available control individ-uals, the 185 with propensity scores closest tothose of the 185 treated individuals wereselected as matches. We see that there is fairlygood overlap throughout most of the range ofpropensity scores and that most of the controlindividuals not used as matches had very lowpropensity scores and so were inapposite for useas points of comparison.

When there are large numbers of controlunits, it is sometimes possible to get multiplegood matches for each treated unit, which canreduce sampling variance in the treatment effectestimates. Although one-to-one matching is themost common, a larger number of matches foreach treated unit are often possible. Unless thereare many units with the same covariate values,using multiple controls for each treated unit isexpected to increase bias because the second,third, and fourth closest matches are, by definition,

Quasi-Experimental Designs 163

0.0 0.2 0.4 0.6

Propensity Score

Treatment Units

Distribution of Propensity Scores

Control Units

0.8

Figure 11.1 Matches chosen using 1:1 nearest neighbor matching on the propensity score. Black units were matched; gray units were unmatched. A total of 185 treated units were matched to 185 control units; 244 control units were discarded.

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 163

Page 10: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

further away from the treated unit than is thefirst closest match, but using multiple matchescan decrease sampling variance due to the largermatched sample size. Of course, in settingswhere the outcome data have yet to be collectedand there are cost constraints, researchers mustbalance the benefit of obtaining multiplematches for each unit with the increased costs.Examples using more than one control matchfor each treated unit include H. Smith (1997)and Rubin and Thomas (2000).

Another key issue is whether controls can beused as matches for more than one treated unit,that is, whether the matching should be done“with replacement” or “without replacement.”Matching with replacement can often yieldbetter matches because controls that look simi-lar to many treated units can be used multipletimes. In addition, like optimal matching, whenmatching with replacement, the order in whichthe treated units are matched does not matter.However, a drawback of matching with replace-ment is that it may be that only a few uniquecontrol units will be selected as matches; thenumber of times each control is matched shouldbe monitored and reflected in the estimated pre-cision of estimated causal effects.

Using the NSW data, Dehejia and Wahba(2002) match with replacement from the PSIDsample because there are few control individualscomparable to those in the treated group, mak-ing matching with replacement appealing.When one-to-one matching is done withoutreplacement, nearly half of the treated groupmembers end up with matches that are quite far away. They conclude that matching withreplacement can be useful when there are a lim-ited number of control units with values similarto those in the treated group.

Limited Exact Matching

Rosenbaum and Rubin (1985a) illustrate thefutility in attempting to find matching treatedand control units with the same values of all thecovariates and thus not being able to findmatches for most units. However, it is oftendesirable (and possible) to obtain exact matcheson a few key covariates, such as race or sex.Combining exact matching on key covariateswith propensity score matching can lead to largereductions in bias and can result in a design anal-ogous to blocking in a randomized experiment.

For example, in Rubin (2001), the analyses aredone separately for males and females, withmale smokers matched to male nonsmokers and female smokers matched to female non-smokers. Similarly, in Dehejia and Wahba(1999), the analysis is done separately for malesand females.

Mahalanobis Metric Matching on Key Covariates Within Propensity Score Calipers

Caliper matching (Althauser & Rubin, 1970)selects matches within a specified range (caliper c)of a one-dimensional covariate X (which mayactually be a combination of multiple covariates,such as the propensity score): | Xtj − Xcj | ≤ c for alltreatment/control matched pairs, indexed by j.Cochran and Rubin (1973) investigate variouscaliper sizes and show that with a normally dis-tributed covariate, a caliper of 0.2 standard devia-tions can remove 98% of the bias due to thatcovariate, assuming all treated units are matched.Althauser and Rubin (1970) find that even alooser matching (1.0 standard deviations of X)can still remove approximately 75% of the initialbias due to X. Rosenbaum and Rubin (1985b)show that if the caliper matching is done using thepropensity score, the bias reduction is obtained onall of the covariates that went into the propensityscore. They suggest that a caliper of 0.25 standarddeviations of the logit transformation of thepropensity score can work well in general.

For situations where there are some key con-tinuous covariates on which particularly closematches are desired, Mahalanobis matching on the key covariates can be combined withpropensity score matching, resulting in particu-larly good balance (Rosenbaum & Rubin, 1985b;Rubin & Thomas, 2000). The Mahalanobis dis-tance is usually calculated on covariates that arebelieved to be particularly predictive of the out-come of interest or of treatment assignment. Forexample, in the NSW Demonstration data,Mahalanobis metric matching on the 2 years ofpreprogram earnings could be done withinpropensity score calipers.

Subclassification

Rosenbaum and Rubin (1984) discuss reduc-ing bias due to multiple covariates in obser-vational studies through subclassification onestimated propensity scores, which forms groups

164 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 164

Page 11: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

of units with similar propensity scores and thussimilar covariate distributions. For example, sub-classes may be defined by splitting the treated andcontrol groups at the quintiles of the propensityscore in the treated group, leading to fivesubclasses with approximately the same numberof treated units in each. That work builds on thework by Cochran (1968) on subclassificationusing a single covariate; when the conditionalexpectation of the outcome variable is a mono-tone function of the propensity score, creatingjust five propensity score subclasses removes atleast 90% of the bias in the estimated treatmenteffect due to each of the observed covariates.Thus, five subclasses are often used, althoughwith large sample sizes, more subclasses, or evenvariable-sized subclasses, are often desirable. Thismethod is clearly related to making an ordinalversion of a continuous underlying covariate.

Lunceford and Davidian (2004) assess subclas-sification on the propensity score and find thatsubclassification without subsequent within-strata model adjustment (as discussed earlier)can lead to biased answers due to residual imbal-ance within the strata. They suggest a need forfurther research on the optimal number of sub-classes, a topic also discussed in Du (1998).

Subclassification is illustrated using the NSWdata in Figure 11.2, where six propensity scoresubclasses were formed to have approximately

equal numbers of treated units. All units areplaced into one of the six subclasses. The controlunits within each subclass are given equal weight,proportional to the number of treated units inthe subclass; thus the treated and control units ineach subclass receive the same total weight.

If the balance achieved in matched samplesselected using nearest neighbor matching is not adequate, subclassification of the matcheschosen using nearest neighbor matching can bedone to yield improved balance. This is illustratedin Figure 11.3, where, after one-to-one nearestneighbor matching, six subclasses have beenformed with approximately the same number oftreated units in each subclass. This process isillustrated in Rubin (2001) and Rubin (2007).

Full Matching

An extension of subclassification is “fullmatching” (Rosenbaum, 1991a, 2002), in whichthe matched sample is composed of matchedsets (subclasses), where each matched set con-tains either (a) one treated unit and one or morecontrols or (b) one control unit and one or moretreated units. Full matching is optimal in termsof minimizing a weighted average of the dis-tances between each treated subject and eachcontrol subject within each matched set. Hansen(2004) gives a practical evaluation of the

Quasi-Experimental Designs 165

0.0 0.2 0.4 0.6

Propensity Score

Treatment Units

Distribution of Propensity Scores

Control Units

0.8

Figure 11.2 Results from subclassification on the propensity score. Subclasses are indicated by verticallines. The weight given to each unit is represented by its symbol size; larger symbolscorrespond to larger weight.

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 165

Page 12: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

method, estimating the effect of SAT coaching,illustrating that, although the original treatedand control groups had propensity score differ-ences of 1.1 standard deviations, the matchedsets from full matching differed by less than 2%of a standard deviation. To achieve efficiencygains, Hansen (2004) also describes a variationof full matching that restricts the ratio of thenumber of treated units to the number of con-trol units in each matched set, a method alsoapplied in Stuart and Green (in press).

The output from full matching is illustratedusing the NSW data in Figure 11.4. Because it isnot feasible to show the individual matched sets(in these data, 103 matched sets were created), theunits are represented by their relative weights. Alltreated units receive a weight of 1 (and thus thesymbols are all the same size). Control units inmatched sets with many control units and fewtreated units receive small weight (e.g., the unitswith propensity scores close to 0), whereas con-trol units in matched sets with few control unitsand many treated units (e.g., the units withpropensity scores close to 0.8) receive largeweight. The weighted treated and control groupcovariate distributions look very similar. As insimple subclassification, all control units within amatched set receive equal weight. However,because there are many more matched sets thanwith simple subclassification, the variation in theweights is much larger across matched sets.

Because subclassification and full matchingplace all available units into one of the subclasses,these methods may have particular appeal forresearchers who are reluctant to discard some ofthe control units. However, these methods are not relevant for situations where the matching is being used to select units for follow-up or forsituations where some units have essentially zeroprobability of receiving the other treatment.

Weighting Adjustments

Another method that uses all units is weight-ing, where observations are weighted by theirinverse propensity score (Czajka, Hirabayashi,Little, & Rubin, 1992; Lunceford & Davidian,2004; McCaffrey et al., 2004). Weighting canalso be thought of as the limit of subclassifica-tion as the number of observations and thenumber of subclasses go to infinity. Weightingmethods are based on Horvitz-Thompson esti-mation (Horvitz & Thompson, 1952), used fre-quently in sample surveys. A drawback ofweighting adjustments is that, as with Horvitz-Thompson estimation, the sampling variance ofresulting weighted estimators can be very large ifthe weights are extreme (if the propensity scoresare close to 0 or 1). Thus, the subclassification or full matching approaches, which also use all units, may be more appealing because theresulting weights are less variable.

166 BEST PRACTICES IN RESEARCH DESIGN

0.0 0.2 0.4 0.6

Propensity Score

Treatment Units

Distribution of Propensity Scores

Control Units

0.8

Figure 11.3 One-to-one nearest neighbor matching on the propensity score followed by subclassification.Black units were matched; gray units were unmatched. Subclasses indicated by vertical lines.

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 166

Page 13: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

Another type of weighting procedure is that ofkernel weighting adjustments, which average overmultiple persons in the control group for eachtreated unit, with weights defined by their dis-tance from the treated unit. Heckman, Ichimura,Smith, and Todd (1998) and Heckman, Ichimura,and Todd (1998) describe a local linear matchingestimator that requires specifying a bandwidthparameter. Generally, larger bandwidths increasebias but reduce variance by putting weight onunits that are further away from the treated unitof interest. A complication with these methods isthis need to define a bandwidth or smoothingparameter, which does not generally have an intu-itive meaning; Imbens (2004) provides someguidance on that choice.

With all of these weighting approaches, it isstill important to separate clearly the design andanalysis stages. The propensity score should be carefully estimated, using approaches such asthose described earlier, and the weights setbefore any use of those weights in models relat-ing the outcomes to covariates.

Diagnostics for Matching Methods

Diagnosing the quality of the matchesobtained from a matching method is of primaryimportance. Extensive diagnostics and propen-sity score model specification checks arerequired for each data set, as discussed by

Dehejia (2005). Matching methods have a vari-ety of simple diagnostic procedures that can beused, most based on the idea of assessing bal-ance between the treated and control groups.Although we would ideally compare the multi-variate covariate distributions in the two groups,that is difficult when there are many covariates,and so generally comparisons are done for eachunivariate covariate separately, for two-wayinteractions of covariates, and for the propensityscore, as the most important univariate sum-mary of the covariates.

At a minimum, the balance diagnosticsshould involve comparing the mean covariatevalues in the groups, sometimes standardized by the standard deviation in the full sample; ide-ally, other characteristics of the distributions,such as variances, correlations, and interactionsbetween covariates, should also be compared.Common diagnostics include t tests of thecovariates, Kolmogorov-Smirnov tests, and othercomparisons of distributions (e.g., Austin &Mamdani, 2006). Ho et al. (2007) provide asummary of numerical and graphical summariesof balance, including empirical quantile-quantileplots to examine the empirical distribution of each covariate in the matched samples.Rosenbaum and Rubin (1984) examine F ratiosfrom a two-way analysis of variance performedfor each covariate, where the factors are treat-ment/control and propensity score subclasses.

Quasi-Experimental Designs 167

0.0 0.2 0.4 0.6

Propensity Score

Treatment Units

Distribution of Propensity Scores

Control Units

0.8

Figure 11.4 Results from full matching on the propensity score. The weight given to each unit isrepresented by its size; larger symbols correspond to higher weight.

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 167

Page 14: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

Rubin (2001) presents diagnostics related to theconditions given in the previous section thatindicate when regression analyses are trustwor-thy. These diagnostics include assessing the stan-dardized difference in means of the propensityscores between the two treatment groups, theratio of the variances of the propensity scores inthe two groups, and, for each covariate, the ratioof the variance of the residuals orthogonal to thepropensity score in the two groups. The stan-dardized differences in means should generallybe less than 0.25, and the variance ratios shouldbe close to 1, certainly between 0.5 and 2, as dis-cussed earlier.

Analysis of Outcome Data After Matching

The analysis of outcome(s) should proceedonly after the observational study design has beenset in the sense that the matched samples havebeen chosen, and it has been determined that thematched samples have adequate balance. In keep-ing with the idea of replicating a randomizedexperiment, the same methods that would be usedin an experiment can be used in the matched data.In particular, matching methods are not designedto “compete” with modeling adjustments such as linear regression, and in fact, the two methodshave been shown to work best in combination.Many authors have discussed for decades the ben-efits of combining matching or propensity scoreweighting and regression adjustment (Abadie &Imbens, 2006; Heckman et al., 1997; Robins & Rotnitzky, 1995; Rubin, 1973b, 1979; Rubin &Thomas, 2000).

The intuition for using both is the same asthat behind regression adjustment in random-ized experiments, where the regression adjust-ment is used to “clean up” small residualcovariate imbalance between the treatment andcontrol groups. The matching method reduceslarge covariate bias between the treated and con-trol groups, and the regression is used to adjustfor any small residual biases and to increase effi-ciency. These “bias-corrected” matching meth-ods have been found by Abadie and Imbens(2006) and Glazerman, Levy, and Myers (2003)to work well in practice, using simulated andactual data. Rubin (1973b, 1979), Rubin andThomas (2000), and Ho et al. (2007) show thatmodels based on matched data are much lesssensitive to model misspecification and morerobust than are models fit in the full data sets.

Some slight adjustments to the analysis meth-ods are required with some particular matchingmethods. With procedures such as full matching,subclassification, or matching with replacement,where there may be different numbers of treatedand control units at each value of the covariates,the analysis should incorporate weights toaccount for these varying weights. Examples of this can be found in Dehejia and Wahba(1999), Hill et al. (2004), and Michalopoulos et al. (2004). When subclassification has beenused, estimates should be obtained separatelywithin each subclass and then aggregated acrosssubclasses to obtain an overall effect (Rosenbaum& Rubin, 1984). Estimates within each subclassare sometimes calculated using simple differ-ences in means, although empirical (Lunceford& Davidian, 2004) and theoretical (Abadie &Imbens, 2006) work has shown that better resultsare obtained if regression adjustment is used inconjunction with the subclassification. Whenaggregating across subclasses, weighting the sub-class estimates by the number of treated units ineach subclass estimates the average treatmenteffect for the units in the treated group; ifthere was no matching done before subclassifica-tion, weighting by the overall number of units ineach subclass estimates the overall average treat-ment effect for the population of treated andcontrol units.

COMPLICATIONS IN USING

MATCHING METHODS

Overlap in Distributions

In some analyses, some of the control unitsmay be very dissimilar from all treated units, orsome of the treated units may be very dissimilarfrom all control units, potentially exhibited bypropensity scores outside the range of the othertreatment group. Thus, it is sometimes desirableto explicitly discard units with “extreme” valuesof the propensity score—for example, treatedunits for whom there are no control units withpropensity score values as large. Doing theanalysis only in the areas where there is distri-butional overlap—that is, with “common sup-port” (regions of the covariate space that haveboth treated and control units)—will lead tomore robust inference. This, in essence, is whatmatching is usually attempting to do; defining

168 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 168

Page 15: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

the area of common support is a way to discardunits that are unlike all units in the other treat-ment group.

However, it is often difficult to determinewhether there is common support in multidi-mensional space. One way of doing so is toexamine the overlap of the propensity score dis-tributions. This is illustrated in Dehejia andWahba (1999), where control units with propen-sity scores lower than the minimum propensityscore for the treated units are discarded. A sec-ond method of examining the multivariate over-lap involves examining the “convex hull” of thecovariates, essentially identifying the multidi-mensional space that allows interpolation ratherthan extrapolation (King & Zeng, 2007). Imbens(2004) also discusses these issues in an eco-nomic context.

Missing Covariate Values

Most of the literature on matching andpropensity scores assumes fully observed covari-ates, so that models such as logistic regressioncan be used to estimate the propensity scores.However, there are often missing values in thecovariates, which complicates matching andpropensity score estimation. Two complexstatistical models used to estimate propensityscores in this case are pattern mixture models(Rosenbaum & Rubin, 1984) and general loca-tion models (D’Agostino & Rubin, 2000). A keyconsideration when thinking about missingcovariate values is that the pattern of missingcovariates can be prognostically important, andin such cases, the methods should condition onthe observed values of the covariates and on theobserved missing data indicators.

There has not been much theoretical workdone on the appropriate procedures for deal-ing with missing covariate values. Multipleresearchers have done empirical comparisons ofmethods, but this is clearly an area for furtherresearch. D’Agostino, Lang, Walkup, and Morgan(2001) compare three simpler methods of deal-ing with missing covariate values: The first usesonly units with complete data and discards all units with any missing data, the second does a simple imputation for missing values andincludes indicators for missing values in thepropensity score model, and the third fits sepa-rate propensity score models for each pattern ofmissing data (a pattern mixture approach, as in

Rosenbaum & Rubin, 1984). All three methodsperform well in terms of creating well-matchedsamples. They find that the third method per-forms the best, evaluated by treating the originalcomplete-case data set (i.e., all individuals withno missing values) as the “truth,” imposing addi-tional nonignorable missing data values on thatcomplete-case data set and examining whichmethod best reproduces the estimate observedin the original complete-case data, given theimposed missingness. Song, Berlin, Lee, Gao, andRotheram-Borus (2001) compare two methodsof using propensity scores with missing covariatedata. The first uses mean imputation for themissing values and then estimates the propensityscores. The second multiply imputes the covari-ates (Rubin, 1987) and estimates propensityscores in each “complete” data set. A mixedeffects model is used to analyze the longitudinaloutcome data in each data set, and the multipleimputation combining rules are used to obtainone estimate of the treatment effect. Results aresimilar using the two methods. Both methodsshow that the covariates are very poorly balancedbetween the treated and control groups, and thatgood matches are hard to find (a finding thatstandard modeling approaches would not neces-sarily have discovered). Hill (2004) finds thatmethods using multiple imputation work betterthan complete data or complete variable meth-ods (which use only units with complete data oronly variables with complete data).

Unobserved Variables

A critique of any observational study is thatthere may be unobserved covariates that affectboth treatment assignment and the outcome,thus violating the assumption of unconfoundedtreatment assignment. The approach behindmatching is that of dealing as well as possiblewith the observed covariates; close matching onthe observed covariates will also lessen the biasdue to unobserved covariates that are correlatedwith the observed covariates. However, theremay still be concern regarding unobserved dif-ferences between the treated and control groups.

The assumption of unconfounded treat-ment assignment can never be directly tested.However, some researchers have proposed testsin which an estimate is obtained for an effect thatis “known” to be zero, such as the difference in a pretreatment measure of the outcome variable

Quasi-Experimental Designs 169

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 169

Page 16: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

(Imbens, 2004) or the difference in outcomesbetween multiple control groups (Rosenbaum,1987b). If the test indicates that the effect is notequal to zero, then the assumption of uncon-founded treatment assignment is deemed to beless plausible.

Analyses can also be performed to assesssensitivity to an unobserved variable. Rosenbaumand Rubin (1983a) extend the ideas of Cornfieldet al. (1959), who examined how strong the cor-relations would have to be between a hypotheticalunobserved covariate and both treatment assign-ment and the outcome to make the observed esti-mate of the treatment effect be zero. Thisapproach is also discussed and applied to an eco-nomic application in Imbens (2003). Rosenbaum(1991b) describes a sensitivity analysis for casecontrol studies and discusses how sensitivitycould also be assessed in situations where thereare multiple sources of control units available—some closer on some (potentially unobserved)dimensions and others closer on other (poten-tially unobserved) dimensions. See Stuart andRubin (in press) for another example of usingmultiple sources of control units.

Multiple Treatment Doses

Throughout this discussion of matching, ithas been assumed that there are just two groups:treated and control. However, in many studies,there are actually multiple levels of the treat-ment (e.g., doses of a drug). Rosenbaum (2002)summarizes two methods for dealing with mul-tiple treatment levels. In the first method, thepropensity score is still a scalar function of thecovariates (Joffe & Rosenbaum, 1999). Thismethod uses a model such as an ordinal logitmodel to match on a linear combination of thecovariates. This is illustrated in Lu, Zanutto,Hornik, and Rosenbaum (2001), where match-ing is used to form pairs that balance covariatesbut differ markedly in dose of treatmentreceived. This differs from the standard match-ing setting in that there are not clear “treatment”and “control” groups, and thus any two subjectscould conceivably be paired. An optimal match-ing algorithm for this setting is described andapplied to the evaluation of a media campaignagainst drug abuse. In the second method, eachof the levels of treatment has its own propensityscore (e.g., Imbens, 2000; Rosenbaum, 1987a),and each propensity score is used one at a time

to estimate the distribution of responses thatwould have been observed if all units hadreceived that treatment level. These distribu-tions are then compared.

Encompassing these two methods, Imai andvan Dyk (2004) generalize the propensity scoreto arbitrary treatment regimes (including ordi-nal, categorical, and multidimensional). Theyprovide theorems for the properties of this gen-eralized propensity score (the propensity func-tion), showing that it has properties similar tothat of the propensity score for binary treat-ments in that adjusting for the low-dimensional(not always scalar, but always low-dimensional)propensity function balances the covariates.They advocate subclassification rather thanmatching and provide two examples as well assimulations showing the performance of adjust-ment based on the propensity function.

Diagnostics are especially crucial in this set-ting because it becomes more difficult to assessthe balance of the resulting samples when thereare multiple treatment levels. It is even unclearwhat balance precisely means in this setting;does there need to be balance among all of thelevels or only among pairwise comparisons ofdose levels? Future work is needed to examinethese issues.

EVALUATION OF MATCHING METHODS

Two major types of evaluations of matchingmethods have been done, one using simulateddata and another trying to replicate results fromrandomized experiments using observationaldata. Simulations that compare the perfor-mance of matching methods in terms of biasreduction include Cochran and Rubin (1973),Rubin (1973a, 1973b, 1979), Rubin and Thomas(2000), Gu and Rosenbaum (1993), Frolich(2004), and Zhao (2004). These generallyinclude relatively small numbers of covariatesdrawn from known distributions. Many ofthe results from these simulations have beenincluded in the discussions of methods providedin this chapter.

A second type of evaluation has attempted toreplicate the results of randomized experimentsusing observational data. Glazerman et al.(2003) summarize the results from 12 case stud-ies that attempted to replicate experimental esti-mates using nonexperimental data, all in the

170 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 170

Page 17: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

context of job training, welfare, and employ-ment programs with earnings as the outcome ofinterest. The nonexperimental methods includematching and covariance adjustment. From the12 studies, they extract 1,150 estimates of thebias (approximately 96 per study), where bias isdefined as the difference between the result fromthe randomized experiment and the result usingobservational data. They determine that it is ingeneral difficult to replicate experimental resultsconsistently and that nonexperimental estimatesare often dramatically different from experi-mental results. However, some general guidancecan be obtained.

Glazerman et al. (2003) find that one-to-onepropensity score matching performs better thanother propensity score matching methods ornon–propensity score matching and that stan-dard econometric selection correction proce-dures, such as instrumental variables or theHeckman selection correction, tend to performpoorly. As discussed earlier, their results alsoshow that combining methods, such as match-ing and covariance adjustment, is better thanusing those methods individually. They alsostress the importance of high-quality data and arich set of covariates, and they discuss the diffi-culties in trying to use large publicly availabledata sets for this purpose. However, there arecounterexamples to this general guidance. Forexample, using the NSW data, Dehejia andWahba (1999) found that propensity scorematching methods using a large, publicly avail-able national data set replicated experimentalresults very well.

A number of authors, particularly Heckmanand colleagues, use data from the U.S. NationalJob Training Partnership Act (JTPA) study toevaluate matching methods (Heckman et al.,1997; Heckman, Ichimura, Smith, & Todd, 1998;Heckman, Ichimura, & Todd, 1998). Some oftheir results are similar to those of Glazerman et al. (2003), particularly stressing the impor-tance of high-quality data. Matching is best ableto replicate the JTPA experimental results when(a) the same data sources are used for the partic-ipants and nonparticipants, thereby ensuringsimilar covariate meaning and measurement;(b) participants and nonparticipants reside inthe same local labor markets; and (c) the data setcontains a rich set of covariates to model theprobability of receiving the treatment. Reachingsomewhat similar conclusions, Michalopoulos

et al. (2004) use data on welfare-to-work pro-grams that had random assignment and againfind that within-state comparisons have less biasthan out-of-state comparisons. They compareestimates from propensity score matching, ordi-nary least squares, a fixed-effects model, and arandom-growth model and find that no methodis consistently better than the others but that thematching method was more useful for diagnos-ing situations in which the data set was insuffi-cient for the comparison. Hill et al. (2004) alsostress the importance of matching on geographyas well as other covariates; using data from arandomized experiment of a child care programfor low-birth-weight children and comparisondata from the National Longitudinal Study ofYouth, they were able to replicate well the exper-imental results using matching with a large set ofcovariates, including individual-level and geo-graphic area–level covariates. Ordinary leastsquares with the full set of covariates or match-ing with a smaller set of covariates did notperform as well as the propensity score match-ing with the full set of covariates. Agodini andDynarski (2004) describe an example wherematching methods highlighted the fact that thedata were insufficient to estimate causal effectswithout heroic assumptions.

ADVICE TO AN INVESTIGATOR

To conclude, this section provides advice toinvestigators interested in implementing match-ing methods.

Control Group and Covariate Selection

As discussed in Cochran (1965), Cochran andRubin (1973), and Rosenbaum (1999), a key toestimating causal effects with observational datais to identify an appropriate control group, ideallywith good overlap with the treated group. Careshould be taken to find data sets that have unitssimilar to those in the treated group, with com-parable covariate meaning and availability.Approximations for the maximum percent bias reduction possible can be used to determinewhich of a set of control groups are likely to pro-vide the best matches or to help guide samplesizes and matching ratios (Rubin, 1976c; Rubin &Thomas, 1992b, 1996). Large pools of potentialcontrols are beneficial, as many articles show that

Quasi-Experimental Designs 171

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 171

Page 18: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

much better balance is achieved when there aremany controls available for the matching (Rubin,1976c; Rubin & Thomas, 1996). As discussedearlier, researchers should include all availablecovariates in the propensity score specification;excluding potentially relevant covariates can cre-ate bias in the estimation of treatment effects, butincluding potentially irrelevant covariates willtypically not reduce the quality of the matchesmuch (Rubin & Thomas, 1996).

Distance Measure

Once the control pool is selected, propensityscore matching is the most effective method for reducing bias due to many covariates (Gu & Rosenbaum, 1993; Rosenbaum & Rubin,1985b). As discussed earlier, propensity scorescan be estimated using logistic regression, andthe propensity score specification should beassessed using a method such as that in the“Model Specification” subsection. This generallyinvolves examining the balance of covariates insubclasses defined by the propensity score. Ifthere are a few covariates designated as particu-larly related to the outcome, and thus it is con-sidered desirable to obtain especially closematches on those covariates, Mahalanobismatching on those key covariates can be donewithin propensity score calipers (Rosenbaum &Rubin, 1985b; Rubin & Thomas, 2000).

Recommended Matching Methods

Our advice for the matching method itself isvery general: Try a variety of methods and use thediagnostics discussed earlier to determine whichapproach yields the most closely matchedsamples. Since the design and analysis stages areclearly separated and the outcome is not used inthe matching process, trying a variety of methodsand selecting the one that leads to the best covari-ate balance cannot bias the results. Although thebest method will depend on the individual dataset, below we highlight some methods that arelikely to produce good results for the two generalsituations considered in this chapter.

To Select Units for Follow-Up

For the special case of doing matching for thepurpose of selecting well-matched controls forfollow-up (i.e., when the outcome values are notyet available), optimal matching is generally best

for producing well-matched pairs (Gu &Rosenbaum, 1993). Optimal matching aims toreduce a global distance measure, rather thanjust considering each match one at a time, andthus reconsiders earlier matches if better overallbalance could be obtained by breaking thatearlier match. Further details are given in the“Nearest Neighbor Matching” subsection.However, if overall balanced samples are all thatis desired (rather than specifically matchedpairs), then an easier and more straightforwardnearest neighbor greedy matching algorithmcan be used to select the controls.

Researchers should also consider whether itis feasible (or desirable) to obtain more thanone matched control for each treated unit (oreven some treated units), as discussed earlier.With relatively small control pools, it may bedifficult to obtain more than one match for eachtreated unit and still obtain large reductions inbias. However, with larger control pools, it maybe possible to obtain more than one match foreach treated unit without sacrificing bias reduc-tion. This decision is also likely to involve costconsiderations.

If Outcome Data Are Already Available

When the outcome values are already avail-able, first put the outcome values away whenmatching! Then, a variety of good methods existand should be considered and tried. In particu-lar, one-to-one propensity score matching isoften a good place to start (or k-to-one if thereare many controls relative to the number oftreated units). If there is still substantial biasbetween the groups in the matched samples(e.g., imbalance in the propensity score of morethan 0.5 standard deviations), the nearest neigh-bor matching can be combined with subclassifi-cation on the matched samples, as discussedearlier. Full matching and subclassification onthe full data sets also often work well in practice,where full matching can be thought of as inbetween the two extremes of one-to-one match-ing and weighting.

Outcome Analysis

After matched samples are selected, the out-come analysis can proceed: linear regression,logistic regression, hierarchical modeling, andso on. As discussed earlier, results should be lesssensitive to the modeling assumptions and thus

172 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 172

Page 19: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

should be fairly insensitive to the model specifi-cation, as compared with the same analysis onthe original unmatched samples. With proce-dures such as full matching, subclassification, ormatching with replacement, where there may bedifferent numbers of treated and control units ateach value of the covariates, the analysis shouldincorporate weights to account for these varyingdistributions.

SOFTWARE

A variety of software packages are currentlyavailable to implement matching methods.These include multiple R packages (MatchIt,Ho, Imai, King, & Stuart, 2006; twang, Ridgeway,McCaffrey, & Morral, 2006; Matching, Sekhon,2006), multiple Stata packages (Abadie,Drukker, Herr, & Imbens, 2004; Becker &Ichino, 2002; Leuven & Sianesi, 2003), and SAS code for propensity score matching(D’Agostino, 1998; Parsons, 2001). A major ben-efit of the R packages (particularly MatchIt andtwang) is that they clearly separate the designand analysis stages and have extensive propen-sity score diagnostics. The Stata and SAS pack-ages and procedures do not explicitly separatethese two stages.

NOTES

1. The data for this example are available athttp://www.nber.org/%7Erdehejia/nswdata.html andin the MatchIt matching package for R, available athttp://gking.harvard.edu/matchit.

2. With nonnormally distributed covariates, theconditions are even more complex.

REFERENCES

Abadie, A., Drukker, D., Herr, J. L., & Imbens, G. W.(2004). Implementing matching estimators foraverage treatment effects in Stata. The StataJournal, 4(3), 290–311.

Abadie, A., & Imbens, G. W. (2006). Large sampleproperties of matching estimators for averagetreatment effects. Econometrica, 74(1),235–267.

Agodini, R., & Dynarski, M. (2004). Are experimentsthe only option? A look at dropout preventionprograms. Review of Economics and Statistics,86(1), 180–194.

Althauser, R., & Rubin, D. (1970). The computerizedconstruction of a matched sample. AmericanJournal of Sociology, 76, 325–346.

Austin, P. C., & Mamdani, M. M. (2006). Acomparison of propensity score methods:A case-study illustrating the effectiveness ofpost-ami statin use. Statistics in Medicine, 25,2084–2106.

Becker, S. O., & Ichino, A. (2002). Estimation ofaverage treatment effects based on propensityscores. The Stata Journal, 2(4), 358–377.

Breiman, L. J., Friedman, J., Olshen, R., & Stone, C.(1984). Classification and regression trees.Belmont, CA: Wadsworth.

Chapin, F. (1947). Experimental designs in sociologicalresearch. New York: Harper.

Christakis, N. A., & Iwashyna, T. I. (2003). The healthimpact of health care on families: A matchedcohort study of hospice use by decedents andmortality outcomes in surviving, widowedspouses. Social Science & Medicine, 57, 465–475.

Cochran, W. G. (1957). Analysis of covariance: Itsnature and uses. Biometrics, 13, 261–281.

Cochran, W. G. (1965). The planning ofobservational studies of human populations(with discussion). Journal of the Royal Statistical Society, Series A, 128, 234–255.

Cochran, W. G. (1968). The effectiveness ofadjustment by subclassification in removingbias in observational studies. Biometrics, 24,295–313.

Cochran, W. G., & Rubin, D. B. (1973). Controllingbias in observational studies: A review. Sankhya:The Indian Journal of Statistics, Series A, 35,417–446.

Cornfield, J., Haenszel, W., Hammond, E. C.,Lilienfeld, A. M., Shimkin, M. B., & Wynder, E. L.(1959). Smoking and lung cancer: Recentevidence and a discussion of some questions.Journal of the National Cancer Institute, 22,173–203.

Czajka, J. C., Hirabayashi, S., Little, R., & Rubin, D. B.(1992). Projecting from advance data usingpropensity modeling. Journal of Business andEconomic Statistics, 10, 117–131.

D’Agostino, R. B., Jr. (1998). Propensity scoremethods for bias reduction in the comparisonof a treatment to a non-randomized controlgroup. Statistics in Medicine, 17, 2265–2281.

D’Agostino, R. B., Jr., Lang, W., Walkup, M., &Morgan, T. (2001). Examining the impact ofmissing data on propensity score estimation indetermining the effectiveness of self-monitoringof blood glucose (SMBG). Health Services &Outcomes Research Methodology, 2, 291–315.

D’Agostino, R. B., Jr., & Rubin, D. B. (2000).Estimating and using propensity scores withpartially missing data. Journal of the AmericanStatistical Association, 95, 749–759.

Quasi-Experimental Designs 173

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 173

Page 20: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

Dehejia, R. H. (2005). Practical propensity scorematching: A reply to Smith and Todd. Journal ofEconometrics, 125, 355–364.

Dehejia, R. H., & Wahba, S. (1999). Causal effects innonexperimental studies: Re-evaluating theevaluation of training programs. Journal of theAmerican Statistical Association, 94, 1053–1062.

Dehejia, R. H., & Wahba, S. (2002). Propensity scorematching methods for non-experimental causalstudies. Review of Economics and Statistics, 84,151–161.

Drake, C. (1993). Effects of misspecification of thepropensity score on estimators of treatmenteffects. Biometrics, 49, 1231–1236.

Du, J. (1998). Valid inferences after propensity scoresubclassification using maximum number ofsubclasses as building blocks. Unpublisheddoctoral dissertation, Harvard University,Department of Statistics.

Frangakis, C. E., & Rubin, D. B. (2002). Principalstratification in causal inference. Biometrics,58, 21–29.

Frolich, M. (2004). Finite-sample properties ofpropensity-score matching and weightingestimators. Review of Economics and Statistics,86(1), 77–90.

Glazerman, S., Levy, D. M., & Myers, D. (2003).Nonexperimental versus experimental estimatesof earnings impacts. The Annals of the AmericanAcademy of Political and Social Science, 589,63–93.

Greenland, S. (2003). Quantifying biases in causalmodels: classical confounding vs collider-stratification bias. Epidemiology, 14(3), 300–306.

Greenwood, E. (1945). Experimental sociology: Astudy in method. New York: King’s Crown Press.

Greevy, R., Lu, B., Silber, J. H., & Rosenbaum, P.(2004). Optimal multivariate matching beforerandomization. Biostatistics, 5, 263–275.

Gu, X., & Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures,distances, and algorithms. Journal ofComputational and Graphical Statistics,2, 405–420.

Hansen, B. B. (2004). Full matching in anobservational study of coaching for the SAT.Journal of the American Statistical Association,99(467), 609–618.

Heckman, J. J., Hidehiko, H., & Todd, P. (1997).Matching as an econometric evaluationestimator: Evidence from evaluating a jobtraining programme. Review of EconomicStudies, 64, 605–654.

Heckman, J. J., Ichimura, H., Smith, J., & Todd, P.(1998). Characterizing selection bias usingexperimental data. Econometrika, 66(5),1017–1098.

Heckman, J. J., Ichimura, H., & Todd, P. (1998).Matching as an econometric evaluation

estimator. Review of Economic Studies, 65,261–294.

Hill, J. (2004). Reducing bias in treatment effectestimation in observational studies suffering frommissing data. Working Paper 04-01, ColumbiaUniversity Institute for Social and EconomicResearch and Policy (ISERP).

Hill, J., Reiter, J., & Zanutto, E. (2004). A comparisonof experimental and observational dataanalyses. In A. Gelman & X.-L. Meng (Eds.),Applied Bayesian modeling and causal inferencefrom an incomplete-data perspective (pp. 44–56).New York: John Wiley.

Hill, J., Rubin, D. B., & Thomas, N. (1999). The designof the New York School Choice ScholarshipProgram evaluation. In L. Bickman (Ed.), Researchdesigns: Inspired by the work of Donald Campbell(pp. 155–180). Thousand Oaks, CA: Sage.

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2006).MatchIt: Nonparametric preprocessing forparametric causal inference. Software for usingmatching methods in R. Available athttp://gking.harvard.edu/matchit/

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007).Matching as nonparametric preprocessing forreducing model dependence in parametric causalinference. Political Analysis, 15(3), 199–236.Available at http://pan.oxfordjournals.org/cgi/reprint/mpl013?ijkey=K17Pjban3gH2zs0&keytype=ref.

Holland, P. W. (1986). Statistics and causal inference.Journal of the American Statistical Association,81, 945–960.

Hong, G., & Raudenbush, S. W. (2006). Evaluatingkindergarten retention policy: A case study ofcausal inference for multilevel observationaldata. Journal of the American StatisticalAssociation, 101(475), 901–910.

Horvitz, D., & Thompson, D. (1952). Ageneralization of sampling without replacementfrom a finite universe. Journal of the AmericanStatistical Association, 47, 663–685.

Imai, K., & van Dyk, D. A. (2004). Causal inferencewith general treatment regimes: Generalizingthe propensity score. Journal of the AmericanStatistical Association, 99(467), 854–866.

Imbens, G. W. (2000). The role of the propensityscore in estimating dose-response functions.Biometrika, 87, 706–710.

Imbens, G. W. (2003). Sensitivity to exogeneityassumptions in program evaluation. AmericanEconomic Review, 96(2), 126–132.

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity:A review. Review of Economics and Statistics,86(1), 4–29.

Joffe, M. M., & Rosenbaum, P. R. (1999). Propensityscores. American Journal of Epidemiology, 150,327–333.

174 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 174

Page 21: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

King, G., & Zeng, L. (2007). When can history be ourguide? The pitfalls of counterfactual inference.International Studies Quarterly, 51, 183–210.

LaLonde, R. (1986). Evaluating the econometricevaluations of training programs withexperimental data. American Economic Review,76(4), 604–620.

Leuven, E., & Sianesi, B. (2003). psmatch2. Statamodule to perform full Mahalanobis andpropensity score matching, common supportgraphing, and covariate imbalance testing.Available at http://www1.fee.uva.nl/scholar/mdw/leuven/stata

Lu, B., Zanutto, E., Hornik, R., & Rosenbaum, P. R.(2001). Matching with doses in anobservational study of a media campaignagainst drug abuse. Journal of the AmericanStatistical Association, 96, 1245–1253.

Lunceford, J. K., & Davidian, M. (2004).Stratification and weighting via the propensityscore in estimation of causal treatment effects:A comparative study. Statistics in Medicine, 23,2937–2960.

McCaffrey, D. F., Ridgeway, G., & Morral, A. R.(2004). Propensity score estimation withboosted regression for evaluating causal effectsin observational studies. Psychological Methods,9(4), 403–425.

Michalopoulos, C., Bloom, H. S., & Hill, C. J. (2004).Can propensity-score methods match thefindings from a random assignment evaluationof mandatory welfare-to-work programs?Review of Economics and Statistics, 56(1),156–179.

Morgan, S. L., & Harding, D. J. (2006). Matchingestimators of causal effects: Prospects andpitfalls in theory and practice. SociologicalMethods & Research, 35(1), 3–60.

Parsons, L. S. (2001, April). Reducing bias in apropensity score matched-pair sample usinggreedy matching techniques. Paper presented atSAS SUGI 26, Long Beach, CA.

Perkins, S. M., Tu, W., Underhill, M. G., Zhou,X.-H., & Murray, M. D. (2000). The use ofpropensity scores in pharmacoepidemiologicalresearch. Pharmacoepidemiology and DrugSafety, 9, 93–101.

Reinisch, J., Sanders, S., Mortensen, E., & Rubin, D. B.(1995). In utero exposure to phenobarbital andintelligence deficits in adult men. Journal of theAmerican Medical Association, 274, 1518–1525.

Ridgeway, G., McCaffrey, D., & Morral, A. (2006).twang: Toolkit for weighting and analysis ofnonequivalent groups. Software for usingmatching methods in R. Available athttp://cran.r-project.org/src/contrib/Descriptions/twang.html

Robins, J., & Rotnitzky, A. (1995). Semiparametricefficiency in multivariate regression models

with missing data. Journal of the AmericanStatistical Association, 90, 122–129.

Robins, J., & Rotnitzky, A. (2001). Comment on P. J. Bickel and J. Kwon, “Inference forsemiparametric models: Some questions and ananswer.” Statistica Sinica, 11(4), 920–936.

Rosenbaum, P. R. (1987a). Model-based directadjustment. Journal of the American StatisticalAssociation, 82, 387–394.

Rosenbaum, P. R. (1987b). The role of a secondcontrol group in an observational study (withdiscussion). Statistical Science, 2(3), 292–316.

Rosenbaum, P. R. (1991a). A characterization ofoptimal designs for observational studies.Journal of the Royal Statistical Society, Series B(Methodological), 53(3), 597–610.

Rosenbaum, P. R. (1991b). Sensitivity analysis formatched case-control studies. Biometrics, 47(1),87–100.

Rosenbaum, P. R. (1999). Choice as an alternative to control in observational studies (withdiscussion and rejoinder). Statistical Science,14(3), 259–304.

Rosenbaum, P. R. (2002). Observational studies(2nd ed.). New York: Springer-Verlag.

Rosenbaum, P. R., & Rubin, D. B. (1983a). Assessingsensitivity to an unobserved binary covariate inan observational study with binary outcome.Journal of the Royal Statistical Society, Series B,45(2), 212–218.

Rosenbaum, P. R., & Rubin, D. B. (1983b). Thecentral role of the propensity score inobservational studies for causal effects.Biometrika, 70, 41–55.

Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing biasin observational studies using subclassification onthe propensity score. Journal of the AmericanStatistical Association, 79, 516–524.

Rosenbaum, P. R., & Rubin, D. B. (1985a). The bias dueto incomplete matching. Biometrics, 41, 103–116.

Rosenbaum, P. R., & Rubin, D. B. (1985b).Constructing a control group usingmultivariate matched sampling methods thatincorporate the propensity score. The AmericanStatistician, 39, 33–38.

Rubin, D. B. (1973a). Matching to remove bias inobservational studies. Biometrics, 29, 159–184.

Rubin, D. B. (1973b). The use of matched samplingand regression adjustment to remove bias inobservational studies. Biometrics, 29, 185–203.

Rubin, D. B. (1974). Estimating causal effects oftreatments in randomized and nonrandomizedstudies. Journal of Educational Psychology, 66,688–701.

Rubin, D. B. (1976a). Inference and missing data(with discussion). Biometrika, 63, 581–592.

Rubin, D. B. (1976b). Multivariate matchingmethods that are equal percent bias reducing:I. Some examples. Biometrics, 32, 109–120.

Quasi-Experimental Designs 175

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 175

Page 22: BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Matching ...

Rubin, D. B. (1976c). Multivariate matching methodsthat are equal percent bias reducing: II.Maximums on bias reduction. Biometrics, 32,121–132.

Rubin, D. B. (1977). Assignment to treatment groupon the basis of a covariate. Journal ofEducational Statistics, 2, 1–26.

Rubin, D. B. (1978). Bayesian inference for causaleffects: The role of randomization. The Annalsof Statistics, 6, 34–58.

Rubin, D. B. (1979). Using multivariate matchedsampling and regression adjustment to controlbias in observational studies. Journal of theAmerican Statistical Association, 74, 318–328.

Rubin, D. B. (1980). Discussion of “Randomizationanalysis of experimental data in the Fisherrandomization test” by D. Basu. Journal of theAmerican Statistical Association, 75, 591–593.

Rubin, D. B. (1987). Multiple imputation fornonresponse in surveys. New York: John Wiley.

Rubin, D. B. (1990). Formal modes of statisticalinference for causal effects. Journal of StatisticalPlanning and Inference, 25, 279–292.

Rubin, D. B. (1997). Estimating causal effects fromlarge data sets using propensity scores. Annals ofInternal Medicine, 127, 757–763.

Rubin, D. B. (2001). Using propensity scores to helpdesign observational studies: Application to thetobacco litigation. Health Services & OutcomesResearch Methodology, 2, 169–188.

Rubin, D. B. (2004). On principles for modelingpropensity scores in medical research.Pharmacoepidemiology and Drug Safety,13, 855–857.

Rubin, D. B. (2006). Matched sampling for causalinference. Cambridge, UK: CambridgeUniversity Press.

Rubin, D. B. (2007). The design versus the analysis ofobservational studies for causal effects: Parallelswith the design of randomized trials. Statisticsin Medicine, 26(1), 20–30.

Rubin, D. B., & Stuart, E. A. (2006). Affinelyinvariant matching methods with discriminantmixtures of proportional ellipsoidallysymmetric distributions. The Annals ofStatistics, 34(4), 1814–1826.

Rubin, D. B., & Thomas, N. (1992a). Affinelyinvariant matching methods with ellipsoidaldistributions. The Annals of Statistics, 20,1079–1093.

Rubin, D. B., & Thomas, N. (1992b). Characterizingthe effect of matching using linear propensity

score methods with normal distributions.Biometrika, 79, 797–809.

Rubin, D. B., & Thomas, N. (1996). Matching usingestimated propensity scores, relating theory topractice. Biometrics, 52, 249–264.

Rubin, D. B., & Thomas, N. (2000). Combiningpropensity score matching with additionaladjustments for prognostic covariates. Journalof the American Statistical Association, 95,573–585.

Sekhon, J. S. (2006). Matching: Multivariate andpropensity score matching with balanceoptimization. Software for using matchingmethods in R. Available at http://sekhon.berkeley.edu/matching

Smith, H. (1997). Matching with multiple controls toestimate treatment effects in observationalstudies. Sociological Methodology, 27, 325–353.

Smith, J., & Todd, P. (2005). Does matchingovercome LaLonde’s critique ofnonexperimental estimators? Journal ofEconometrics, 125, 305–353.

Snedecor, G. W., & Cochran, W. G. (1980). Statisticalmethods (7th ed.). Ames: Iowa State UniversityPress.

Sobel, M. E. (2006). What do randomized studies ofhousing mobility demonstrate? Causal inferencein the face of interference. Journal of theAmerican Statistical Association, 101(476),1398–1407.

Song, J., Belin, T. R., Lee, M. B., Gao, X., &Rotheram-Borus, M. J. (2001). Handlingbaseline differences and missing items in alongitudinal study of HIV risk among runawayyouths. Health Services & Outcomes ResearchMethodology, 2, 317–329.

Stuart, E. A., & Green, K. M. (in press). Using fullmatching to estimate causal effects in non-experimental studies: Examining therelationship between adolescent marijuana useand adult outcomes. Developmental Psychology.

Stuart, E. A., & Rubin, D. B. (in press). Matchingwith multiple control groups with adjustmentfor group differences. Journal of Educationaland Behavioral Statistics.

Winship, C., & Morgan, S. L. (1999). The estimationof causal effects from observational data.Annual Review of Sociology, 25, 659–706.

Zhao, Z. (2004). Using matching to estimatetreatment effects: Data requirements, matchingmetrics, and Monte Carlo evidence. Review ofEconomics and Statistics, 86(1), 91–107.

176 BEST PRACTICES IN RESEARCH DESIGN

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 176