Panel Data Designs and Estimators as Substitutes for...

37
Panel Data Designs and Estimators as Substitutes for Randomized Controlled Trials in the Evaluation of Public Programs Paul J. Ferraro, Juan José Miranda Abstract: In the evaluation of public programs, experimental designs are rare. Re- searchers instead rely on observational designs. Observational designs that use panel data are widely portrayed as superior to time-series or cross-sectional designs because they provide opportunities to control for observable and unobservable variables cor- related with outcomes and exposure to a program. The most popular panel data eval- uation designs use linear, xed-effects estimators with additive individual and time effects. To assess the ability of observational designs to replicate results from exper- imental designs, scholars use design replications. No such replications have assessed popular, xed-effects panel data models that exploit repeated observations before and after treatment assignment. We implement such a study using, as a benchmark, re- sults from a randomized environmental program that included effective and ineffec- tive treatments. The popular linear, xed-effects estimator fails to generate impact estimates or statistical inferences similar to the experimental estimator. Applying com- mon exible model specications or trimming procedures also fail to yield accurate es- timates or inferences. However, following best practices for selecting a nonexperimen- tal comparison group and combining matching methods with panel data estimators, we replicate the experimental benchmarks. We demonstrate how the combination of panel and matching methods mitigates common concerns about specifying the cor- rect functional form, the nature of treatment effect heterogeneity, and the way in which time enters the model. Our results are consistent with recent claims that design trumps methods in estimating treatment effects and that combining designs is more likely to approximate a randomized controlled trial than applying a single design. JEL Codes: C1, C23, C93, D12, Q25 Keywords: Causal inference, Impact evaluation, Matching methods, Panel data, Within study Paul J. Ferraro is at Johns Hopkins University, Bloomberg School of Public Health, Carey School of Business, and Whiting School of Engineering, 100 International St., Baltimore, MD 21202 ([email protected]). Juan José Miranda is at the World Bank, Environment and Natural Resources Global Practice, 1818 H St., NW, Washington, DC 20433 ( jjmiranda Received October 15, 2014; Accepted July 26, 2016; Published online February 9, 2017. JAERE, volume 4, number 1. © 2017 by The Association of Environmental and Resource Economists. All rights reserved. 2333-5955/2017/0401-0008$10.00 http://dx.doi.org/10.1086/689868 281 This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Transcript of Panel Data Designs and Estimators as Substitutes for...

Page 1: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators as Substitutes

for Randomized Controlled Trials in the

Evaluation of Public Programs

Paul J. Ferraro, Juan José Miranda

Abstract: In the evaluation of public programs, experimental designs are rare. Re-searchers instead rely on observational designs. Observational designs that use paneldata are widely portrayed as superior to time-series or cross-sectional designs becausethey provide opportunities to control for observable and unobservable variables cor-related with outcomes and exposure to a program. The most popular panel data eval-uation designs use linear, fixed-effects estimators with additive individual and timeeffects. To assess the ability of observational designs to replicate results from exper-imental designs, scholars use design replications. No such replications have assessedpopular, fixed-effects panel data models that exploit repeated observations before andafter treatment assignment. We implement such a study using, as a benchmark, re-sults from a randomized environmental program that included effective and ineffec-tive treatments. The popular linear, fixed-effects estimator fails to generate impactestimates or statistical inferences similar to the experimental estimator. Applying com-mon flexible model specifications or trimming procedures also fail to yield accurate es-timates or inferences. However, following best practices for selecting a nonexperimen-tal comparison group and combining matching methods with panel data estimators,we replicate the experimental benchmarks. We demonstrate how the combinationof panel and matching methods mitigates common concerns about specifying the cor-rect functional form, the nature of treatment effect heterogeneity, and the way inwhich time enters the model. Our results are consistent with recent claims that designtrumps methods in estimating treatment effects and that combining designs is morelikely to approximate a randomized controlled trial than applying a single design.

JEL Codes: C1, C23, C93, D12, Q25

Keywords: Causal inference, Impact evaluation, Matching methods, Panel data,Within study

Paul J. Ferraro is at Johns Hopkins University, Bloomberg School of Public Health, CareySchool of Business, and Whiting School of Engineering, 100 International St., Baltimore,MD 21202 ([email protected]). Juan José Miranda is at the World Bank, Environment andNatural Resources Global Practice, 1818 H St., NW, Washington, DC 20433 ( jjmiranda

Received October 15, 2014; Accepted July 26, 2016; Published online February 9, 2017.

JAERE, volume 4, number 1. © 2017 by The Association of Environmental and Resource Economists.All rights reserved. 2333-5955/2017/0401-0008$10.00 http://dx.doi.org/10.1086/689868

281

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 2: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

282 Journal of the Association of Environmental and Resource Economists March 2017

The three keys to success [in impact evaluation] are “design, design, design.” . . . No form of statistical

analysis can fully rescue a weak research design.

—Bloom

@worldbaCraig McIat seminarConferencsity, ResouNguyen, HRaymond

All use sub

2010

PUBLIC PROGRAMS ARE RARELY IMPLEMENTED as randomized controlled trials.Thus researchers aiming to estimate program impacts rely on nonexperimental, ob-servational designs. Although nonexperimental designs can, in theory, perform as wellas experimental designs in identifying causal relationships, debates about their perfor-mance in practice are common (e.g., Lalonde 1986; Heckman, Ichimura, and Todd1997; Dehejia 2005; Smith and Todd 2005).

To better understand the performance of observational designs, scholars often use“within-study comparisons” or “design replications” (Cook, Shadish, andWong 2008).In these studies, researchers estimate a program’s impact by randomizing units intotreatment and control groups. Under effective randomization, the estimated treatmenteffect is assumed to have high internal validity. Then the researchers drop the random-ized control group and form a nonrandomized comparison group, like one would havein an observational study (in our study, “control group” refers to randomly assigned un-treated units, whereas “comparison group” refers to untreated units in an observationalstudy). Using the nonexperimental comparison group, the researchers reestimate thetreatment effect using an empirical design assumed to eliminate sources of bias. To as-sess the performance of the nonexperimental design, the nonexperimental estimate iscompared to the experimental estimate.

In our design replication, the experimental benchmark comes from a randomizedcontrolled trial (RCT) with over 100,000 households in a metropolitan county in theUnited States (Ferraro, Miranda, and Price 2011; Ferraro and Miranda 2013, 2014;Ferraro and Price 2013; Bernedo, Ferraro, and Price 2014). In the RCT, a water util-ity sent messages to households to induce voluntary reductions in water use. Ferraroand Price (2013) detected no randomization biases or interference among units.Thus randomization of households into control and treatment groups, followed by acontrast of each group’s mean water consumption, is assumed to provide an unbiasedestimator of the average treatment effect.

To form a nonexperimental comparison group, we use households from a neigh-boring county. The neighboring county had similar water-pricing policies and the same

nk.org). For feedback and comments, we thank Spencer Banzhaf, Thomas Cook,ntosh, Sheila Olmstead, Jeffrey Smith, Cody Wing, Vivian Wong, and participantss at CATIE, Cornell University, Georgia State University, NCSU Camp Resourcee, Oregon State University, Pontificia Universidad Católica de Chile, Purdue Univer-rces for the Future, and University of Wyoming. For the water data, we thank Kathyerb Richardson, and Kathleen Brown of Cobb County Water System and Dianeof Fulton County.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 3: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 283

water sources, weather patterns, state and metro regulatory environments, and otherregional confounding factors during the messaging experiment. Participants did notself-select into the program, but they may have sorted themselves across counties basedon characteristics that also affect water use. For the experimental and nonexperimentalhouseholds, we have 13 months of pre-treatment panel data and 4 months of post-treatment panel data.

Our study advances the evaluation literature in important ways. First, our study isthe first design replication to contrast an experimental benchmark to the estimatesfrom a linear, fixed-effects panel data (FEPD) model that exploits repeated observa-tions before and after treatment assignment. These models are workhorses in evalu-ations that use panel data (Wooldridge 2005; Imai and Kim 2014). They are partic-ularly common in the fields of policy, economics, and political science and are oftenviewed as good substitutes for randomized experimental designs.1 We know of onlyone other design replication with repeated observations before and after treatment as-signment (St. Clair, Cook, and Hallberg 2014), but that study uses a random-effectsmodel, rather than a FEPD model.2 We show that the standard linear, fixed-effectsestimator with additive individual and time effects cannot replicate the experimentalestimates or statistical inferences.

Second, our study is the first to explore the role of pre-processing the data bymatching or by trimming prior to applying the FEPD estimator. Recently, scholarshave recommended combining designs to achieve robustness under misspecification(e.g., Ho et al. 2007; Imbens and Wooldridge 2009). When matching (particularlycaliper matching) and the FEPDmodel are combined in our study, the nonexperimen-tal estimator yields estimates and inferences similar to RCT. Without matching, orwith trimming or incomplete matching, the estimates and inferences are inaccurate,even with flexible modeling of unit-specific time trends.

1. For example, in recent issues of an applied economics journal (American Economic Journal-Applied, 2 (1)–6 (3)) and an applied policy journal ( Journal of Policy Analysis and Management,28 (1)–33 (2)), 90 studies use panel data with repeated observations on the same units to es-timate a treatment effect. Over three-quarters use a version of the linear, fixed-effects panel datamodel.

2. In the statistical, social science and program evaluation literatures, multiple terms are usedto describe the same models. By “fixed-effects model,” we mean an estimator that uses a vectorof dummy variables for the units from which grouped data arise (the unit-specific effects are partof the intercept term). By “random-effects model,” we mean an estimator that assumes unit ef-fects are drawn from an underlying, modeled distribution. Random-effect models assume thatunobservable, unit-specific characteristics are uncorrelated with the independent variables,whereas fixed effects models permit correlation. The contrast between the two models is oftenportrayed as a bias-variance trade-off, with fixed-effect models having lower bias and random-effect models having lower variance (e.g., Angrist and Pischke 2009).

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 4: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

284 Journal of the Association of Environmental and Resource Economists March 2017

Importantly, we show how matching helps: pre-processing the data by matching ortrimming addresses criticisms of the behavioral assumptions underlying the FEPDestimator. In particular, we show that matching addresses the modeling of counter-factual trends by design rather than through parametric modeling and specificationsearches. It also renders other assumptions of the FEPD estimator, such as homoge-neous treatment effects, less troublesome. We also show that researchers should notrely on the degree to which pre-treatment trends are parallel as an indirect test of thedesign’s identification assumptions. In fact, omitting pre-treatment water use variablesfrom the matching has less of an effect on performance than omitting time-invariantattributes.

Our study has common features with a related study (Ferraro and Miranda 2014).These two studies are the only design replications using an environmental program.Most design replications are in the context of labor, poverty, or education programs(see next section). They are also the only design replications using information-basedand social norm–based interventions, also known as “nudges.” Behavioral nudges toachieve policy goals are increasingly popular and are often characterized by inexpensiveimplementation and small treatment effects. Inaccurate estimates of these effects canthus greatly influence cost-benefit analyses. Moreover, the context of the two studiesis one in which exposure to the program occurs because of where people chose to liverather than because people chose to participate in the program (self-selection). Likeother public programs, environmental programs are frequently implemented in ad-ministrative units like states, counties, or cities. Program evaluation designs typicallylook to unexposed, neighboring administrative units to create comparison groups, andthen apply statistical techniques to control for observable and unobservable differencesbetween the exposed and unexposed groups (e.g., Card and Krueger 1994). Such eval-uation contexts differ from the mostly voluntary program contexts that comprise thedesign-replication literature (education being a notable counterexample).

Like Ferraro andMiranda (2014), we use an experimental benchmark in which onetreatment had a large, statistically significant impact, and another treatment had a neg-ligible, statistically insignificant impact (trimming top one-quarter of 1% of sampleyields a precisely estimated effect near zero). As they write, “a valid observational de-sign should be able to detect an impact where one exists, and fail to detect one whereone does not exist” (345). In contrast, the design-replication literature comprises stud-ies in which a single treatment that had a statistically significant, policy-relevant im-pact is studied. We also address the criticism that design replications fail to considerthe sensitivity of results to the choice of sample (Smith and Todd 2005) by using boot-strapping methods to determine the sensitivity of a design’s performance to changes inthe sample (McKenzie, Gibson, and Stillman 2010).

Unlike Ferraro and Miranda (2014), we use the full panel data available from theprogram. Ferraro and Miranda look at cross-sectional estimators and a simple difference-in-differences estimator using one pre-treatment observation and one post-treatmentobservation (they find that the difference-in-difference estimator performs poorly). Also

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 5: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 285

unlike Ferraro and Miranda, we quantitatively explore the reasons why pre-processingvia matching improves the performance of our estimator. In the next section, we placeour study in the broader design replication literature.

1. THE PERFORMANCE OF OBSERVATIONAL DESIGNS

One of the first design replications uses the treated group from the National Sup-ported Work (NSW) randomized field experiment and nonrandom comparison groupsfrom national surveys, such as the Current Population Survey and the Panel Studyof Income Dynamics (Lalonde 1986; Fraker and Maynard 1987; Lalonde and May-nard 1987). The authors conclude that nonexperimental designs cannot systematicallyrecover experimental estimates of labor market program impacts, but these conclu-sions have been debated (Heckman et al. 1997; Dehejia and Wahba 1999, 2002;Glazerman, Levy, and Myers 2003; Smith and Todd 2005).

Heckman and colleagues argue that three criteria are necessary to draw unbiased(or small bias) inferences from observational designs (Heckman et al. 1997): (i) par-ticipant and nonparticipant data come from the same sources, with similar measures ofthe outcome variable; (ii) participants and nonparticipants share the same economicenvironment; and (iii) the data contain a rich set of variables for eliminating sources ofbias. In a review of design replications, Cook et al. (2008) argue that observational designsperform better when they condition on pre-treatment variables (outcomes and covari-ates) and the treated and comparison units come from similar environments (echoingthe three Heckman et al. criteria). Our design replication conforms to these criteria.

Most design replications focus on conditioning strategies (i.e., selection on observ-ables), like matching or regression analyses, and a few others examine regression dis-continuity designs and instrumental variable (IV) designs (e.g., Agodini and Dynarski2004; Buddelmeyer and Skoufias 2004; Hill, Reiter, and Zanutto 2004; Black, Galdo,and Smith 2005; Arceneaux, Gerber, and Green 2006; Diaz and Handa 2006; Wildeand Hollister 2007; Handa and Maluccio 2010; McKenzie et al. 2010; Wing andCook 2013). Regression discontinuity designs seem to perform best: they consistentlyreplicate experimental benchmarks in the neighborhood of the discontinuity. Thereare too few IV design replications to draw any conclusions (and they are complicatedby the fact that the designs estimate a local average treatment effect, which may differfrom the average treatment effect estimated in the RCT benchmark). For condition-ing strategies, estimates tend to be more accurate the greater the use of pre-treatmentvariables (outcome and covariates) and the more similar the treated and untreated unitenvironments.

Most design replications with pre-treatment outcomes have only one pre-treatmentperiod, allowing simple two-period, difference-in-differences or analysis of covariance(ANCOVA) designs. Several design replications in labor economics have argued thatpre-processing the data with matching, followed by a difference-in-differences estima-tor, performs better than a cross-sectional matching approach (Heckman et al. 1997;Heckman, Ichimura, and Todd 1998; Smith and Todd 2005).

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 6: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

286 Journal of the Association of Environmental and Resource Economists March 2017

To our knowledge, only one design replication has repeated pre-treatment andpost-treatment outcomes and, like our study, contrasts a panel data design to an RCT(St. Clair et al. 2014). The treatment was a K–12 educational intervention aimed atimproving student performance. With 70 schools and a clustered RCT design, notreatment effect was detected (perhaps because of low statistical power owing to thesample’s modest size and lack of student-level data). To form a nonexperimental com-parison group, the authors use all schools in the state that were not in the experiment.Using this comparison group, the authors estimate the intervention’s effect using arandom-effects model that includes an interaction of the treatment variable with atime variable to allow for differential trends between treated and comparison groups.3

They find that the nonexperimental design yields similar estimates to the experi-mental benchmark (“similar” defined as “within 0.20 standard deviations”). The au-thors also pre-process the data with a simple form of matching and draw the sameconclusions.

2. THE RCT AND THE NONEXPERIMENTAL

COMPARISON GROUP

In 2007, Cobb County Water System (Georgia, USA) implemented a targeted, res-idential information campaign as an RCT. Its goal was to test the effectiveness of threemessages in inducing voluntary reductions in water consumption among its residentialcustomers during a drought. Each treatment group comprised approximately 11,700households, and the control group comprised approximately 71,600 households. Treat-ments were assigned in May 2007 and randomized within almost 400 meter route units(small neighborhoods). Ferraro and Price (2013) show that randomization was effectiveat balancing pre-treatment water use across treatment arms, there was no differential at-trition by treatment group,4 there was no treatment noncompliance, and there was un-likely to have been any interference among units (i.e., Stable Unit Treatment Value As-sumption violations).

We focus on two of the three treatments: (i) the technical information message and(ii) the social comparison message (for text, see appendix of Ferraro and Price 2013).The technical information message instructed households on strategies to reduce wa-

3. Our impression is that education researchers rely more on random-effects models becausethey often wish to model the role of time-invariant factors like schools. Such modeling is muchmore difficult in the context of a fixed-effects model that eliminates the effects of time-invariantfactors. As in many evaluation contexts, however, the key identifying assumption of the random-effects estimator does not seem tenable in our study: i.e., the assumption that unobserved, house-hold time-invariant effects are uncorrelated with exposure to the treatment.

4. The home is the treated unit, rather than the people, and thus technically there is zeroattrition. Less than 0.5% of the homes in the experiment experienced a name change on thecustomer account between April and October 2007 and there is no evidence of differentialchange rates among treatment arms.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 7: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 287

ter use. The social comparison message augmented the technical information with so-cial norm-based encouragement and a social comparison, in which own consumptionwas compared to median county consumption. To justify the program costs, the Wa-ter System sought a 2% reduction in 2007 summer water use. We choose the technicalinformation treatment because it had a small estimated effect that was well below the2% threshold. In contrast, the social comparison treatment had the largest estimatedtreatment effect: more than double the 2% threshold (Ferraro and Price 2013).

To form a nonexperimental comparison group, we asked water utility employees inmetropolitan Atlanta which water system was most like the Cobb County WaterSystem. The employees consistently identified the Fulton County Water Service Di-vision, which services households in northern Fulton County. Households in this ser-vice area share a border and a congressional district with Cobb County (northern Ful-ton County does not include the city of Atlanta or the county south of the city). Thetwo utilities measure water use with meters and had similar water-pricing policies dur-ing the study period. Their customers live close to each other (approximately a 1 hourdrive from the two farthest points), and had common water sources, weather patterns,media markets, state and metro regulatory environments, and other regional factorsthat affected water use during the study period.

Drawing households from two small contiguous counties in a single metropolitanarea affords the opportunity (a) to meet the Heckman, Ichimura, and Todd and Cook,Shadish, and Wong criteria for effective observational designs, (b) to thoroughly un-derstand the factors in each county that affect water use, and (c) to control for a varietyof local, time-varying unobservables that affect households in both counties by designrather than statistically. Nevertheless, the presence of a contemporaneous treatmentin Fulton County would render these households inappropriate counterfactuals forCobb County (e.g., if Fulton had started its own targeted water conservation campaignat the same time). Based on interviews with water staff from the two systems, we feelcomfortable asserting that no such contemporaneous treatment exists. Moreover, un-like in labor or education contexts, the assumption that the treatment did not affectthe post-treatment composition of the two counties is credible (i.e., no one movedto or from Cobb County because of a conservation message). The assumption of nospillovers from treated to untreated households is also credible (i.e., water prices are reg-ulated and unchanged during the study period, and Cobb County treated householdsare unlikely to have shared their messages with Fulton County households).

3. NONEXPERIMENTAL DESIGN

We define Yi as the water use by the ith household, Di as the treatment status, Y0i as

potential monthly water use by household i in the absence of treatment (Di 5 0), Y1i

as potential monthly water use by household i in the presence of treatment (Di 5 1),and E[.] as the expectation operator. The average treatment effect (ATE) and averagetreatment effect on the treated (ATT) are thus defined as follows (time subscripts aresuppressed):

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 8: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

288 Journal of the Association of Environmental and Resource Economists March 2017

ATE 5 E½Y1i – Y 0

i �, (1a)

ATT 5 E½Y1i – Y 0

i jDi 5 1�: (1b)

In an experimental design, randomization ensures that ATE 5 ATT because itensures that the following two conditional mean independence assumptions are sat-isfied:

E½Y0i jDi 5 1� – E½Y 0

i jDi 5 0� 5 0, (2a)

E½Y1i jDi 5 1� – E½Y1

i jDi 5 0� 5 0: (2b)

Under these assumptions, the ATE and ATT can be estimated using a simple con-trast of mean post-treatment water use between treated and untreated groups:

ATE 5 ATT 5 E½Y1i jDi 5 1� – E½Y 0

i jDi 5 0�: (3)

Using the nonexperimental comparison group, we wish to estimate the effect of thetreatments on households in Cobb County; that is, the ATT. In this nonexperimentalsetting, the key identification assumption (2a) is implausible. Although householdsdid not volunteer to be exposed to the treatments, their exposure is a function ofwhere they chose to live. Households likely choose to live in Cobb or Fulton Countybased on fixed (time-invariant) household characteristics, Ai, that may also affect wa-ter use (fig. 1). Other sources of bias may come from time-varying unobservable fac-tors (e.g., changes in micro-climatic conditions) or from unobservable heterogeneousresponses to common shocks (e.g., households respond differently to local media cov-erage of the drought). We address these two potential sources of bias below. Thetreatment is assigned at one point in time, and thus we can ignore dynamic selectionbias, which can be difficult to address in evaluations (and thus often assumed away inpractice).

Adding a time subscript, t, the ATT is defined as:

ATT 5 E½Y1it – Y 0

it jDit 5 1�: (4)

With observations of water use before and after treatment assignment, one can re-lax the assumption in expression (2a). Based on our assumptions about the process bywhich some households were exposed to the treatment, Dit is as good as randomly as-signed conditional on Ai, t, and time-varying observable household characteristics, Xit.In other words, we assume

E½Y 0it jAi, t,Xit,Dit� 5 E½Y 0

it jAi, t,Xit�: (5)

By making the additional assumption that treatment effects (δ ) and time effects(λt) are additive and not heterogeneous across households, we can write,

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 9: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 289

E½Y 0it jAi, t,Xit� 5 a 1 A0

ig 1 λt 1 X0itβ: (6a)

E½Y1itjAi, t,Xit� 5 E½Y 0

it jAi, t,Xit� 1 d: (6b)

These assumptions imply the well-known linear, fixed-effects panel data (FEPD)estimator for δ:

Yit 5 ai 1 λt 1 Xit β 1 dDit 1 εit

where εit 5 Y 0it – E½Y 0

it jAi, t,Xit�and ai 5 a 1 A0

ig:

(7)

Given our assumptions, this model provides an unbiased estimator of the ATE. Ifwe relax the assumption that the treatment effect is linearly additive for every unit andsimply make this assumption for Cobb County households, it provides an unbiasedestimator of the ATT (i.e., Fulton and Cobb households may respond differentlyto the treatment).

As noted in the previous section, we control for time-varying characteristics, Xit

(e.g., weather, media coverage, economic conditions, etc.), by design—by choosing com-

Figure 1. Directed acyclic graph (DAG) depicting the assumed treatment assignment mech-anism in the observational design. A single-headed arrow represents a causal link or pathwaybetween two variables. The variable Yit is monthly water use by household i at time t andDit is treatment status of household i at time t. People are assumed to choose to live in Cobbor Fulton County based on fixed household characteristics, Ai, that also affect water use. Thisvector of time-invariant characteristics is the principal source of confounding that the observa-tional design must eliminate (see sec. 3). Time-varying characteristics are assumed to be unim-portant by design (by choosing comparison households who share the same economic, climatic,and regulatory environment).

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 10: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

290 Journal of the Association of Environmental and Resource Economists March 2017

parison households from the same local environment. Any systematic differences be-tween treated and untreated units are thus assumed to be captured by time-invariantcharacteristics. The FEPD estimator for the ATT assumes that, across each time step,the expected trends of treatment and comparison groups are the same in the absence ofthe treatment (i.e., the expected trend of the comparison group represents the expectedtrend of the treatment group in the absence of treatment).

Embedded in the FEPD estimator are four strong, but often ignored, assumptions:

(1) Units exhibit a common response to common shocks. Common time shocks(e.g., a change in water regulations that affect all households in metro-politan Atlanta; a media report on the drought) are assumed to affectwater use in treatment and comparison groups in the same way, on aver-age. This assumption could be violated if, for example, exposure to localmedia reports about the drought is more strongly correlated with treat-ment assignment than are the time-invariant omitted variables whoseconfounding influence has been blocked through the control of the fixedeffects. In this case, the FEPD estimator can suffer from greater biasthan an estimator that fails to control for time-invariant confounders.

(2) The treatment effect is additive and constant (homogeneous). The FEPDestimator weights on sample variances rather than on sample frequen-cies. More specifically, it averages the group-specific coefficients propor-tional to both the conditional variance of treatment (D) and the pro-portion of the sample in each group. Thus, if the assumption of ahomogeneous treatment effect does not hold, then the fixed-effects esti-mator overweights groups that have larger variance of treatment condi-tional upon other covariates and underweights groups with smaller con-ditional variances (Gibbons, Suarez-Serrato, and Urbancic 2014).

(3) The functional form of the model is linear. The assumptions of linearityand an additive, constant treatment effect are required to sweep away theeffects of time-invariant unobservables through standard first-differencingor averaging techniques. But if the correct specification were nonlinear,the conditional variance weighting of the FEPD estimator could be verydifferent from the weighting that would be appropriate for estimating theATT. This problem could arise if, for example, the time-invariant char-acteristics that affect water use are highly imbalanced between the treat-ment and comparison groups.

(4) The panel data generating process is not dynamic. That is, lagged depen-dent variables do not belong in the model.

Although there are many plausible ways to relax these four assumptions, most arecomplicated, and choosing among them can be difficult. For example, a recent publi-

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 11: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 291

cation (Lee 2014) offers a new class of functional form and lag specification tests forpanel data models, but the tests depend on a complicated generalized spectral deriv-ative approach and a variety of untestable assumptions. Moreover, should the tests im-ply nonlinearity or the need to incorporate lagged dependent values, the estimationprocedures require much more restrictive assumptions and more complicated methods(for nonlinearities, see Lancaster 2000; for dynamic panel data, see Wooldridge 2002;Chabé-Ferret 2015).

Likewise, time can be modeled in our study more flexibly than through monthlydummy variables. For example, one could model the fixed effects as household-by-season fixed effects, rather than household fixed effects. Or, as an alternative, onecould allow monthly household responses to vary conditional on observable character-istics. The problem for the applied empiricist is that there are myriad ways to flexiblymodel time, and it is not clear which way might be best. A similar problem confrontsthe applied researcher who wishes to relax the assumption of a homogeneous treat-ment effect.5

Rather than try various ways to relax these four assumptions, we consider an alter-native approach that emphasizes design over methods.6 In an RCT, the four assump-tions are innocuous because treatment and control units are balanced on all character-istics and thus have the same expected potential water use. To make our observationalsample look more like our RCT sample, we use matching or trimming algorithms topre-process the data and make the treatment and comparison groups observationallymore similar prior to treatment assignment. In other words, we select Fulton house-holds that are observationally similar to the Cobb households in terms of their pre-treatment water use and their observable, time-invariant characteristics known to af-fect water use. When treated and untreated households are observationally similarand, for example, are exposed to contemporaneous, post-treatment shocks, the assump-tion of homogeneous average responses among treated and untreated units is moreplausible (e.g., if observable measures of education and wealth are correlated with ex-

5. In the studies referenced in n. 1 that use a fixed-effects estimator, only one in four allowfor flexible specifications of time trends, and fewer than one in five acknowledge that treatmenteffects might be heterogeneous (if estimation of conditional treatment effects takes place, it isusually conditional on three or fewer observable characteristics).

6. The distinction between “design” and “methods” (or “analysis” as methods are sometimeslabeled) is not one often made in economics. Design characterizes the assumptions needed toidentify the relevant treatment effect and the data that conform to the assumptions. Methodsare procedures that will be used to estimate impacts and draw statistical inferences that are con-sistent with the design assumptions. Design is akin to the identification strategy, in which theanalyst characterizes the estimand, the selection process (the process by which some units cameto be exposed to the treatment and other did not), and the way in which knowledge about theselection process can be used to identify the estimand. Methods are the specific procedures (es-timators) used to estimate impacts and draw statistical inferences within a design.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 12: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

292 Journal of the Association of Environmental and Resource Economists March 2017

posure to treatment and the responses to local media reports on the drought, the ho-mogeneous response assumption is plausible after conditioning on education and wealth).Moreover, if the important sources of treatment effect heterogeneity are functions ofobservable characteristics, matching and trimming can render the homogeneous treat-ment effect assumption less problematic. Likewise, if treated and untreated units aresimilar in pre-treatment water use levels, the assumption that lagged dependent vari-ables do not belong in the model may be innocuous.

The idea that combining estimation designs can make the estimates more robust todeviations from each design’s underlying assumptions is not new (Ho et al. 2007;Imbens and Wooldridge 2009). Yet we know of very few panel data studies that pre-process their data with matching on pre-treatment variables before using a parametricor nonparametric, regression-based estimator (Heckman, Ichimura, and Todd 1998;Galiani, Gertler, and Schargrodsky 2005; Uchida et al. 2007; Arriagada et al. 2012;Davis, Fuchs, and Gertler 2014; Alix-Garcia, Sims, and Yañez-Pagans 2015; Jonesand Lewis 2015; Wendland et al. 2015).7 We suspect that matching is not typicallyused with panel data because analysts assume that the fixed-effects model controls forthe confounding influence of all time-invariant attributes, whether they are unobserv-able or not.8

4. DATA AND SAMPLES

4.1. Data

The data are derived from four sources. Household water use data are from the CobbCountyWater System and Fulton CountyWater Service Division.We have 13 monthsof pre-treatment data (May 2006 to May 2007) and 4 months of post-treatment data(June to September 2007).9 The county tax assessor databases provide home and prop-erty characteristics. The 2000 US Census provides data on neighborhood character-istics at the block group.10

Table 1, which reproduces table 2 in Ferraro and Miranda (2014), shows averagewater consumption in thousands of gallons during key watering seasons for Cobbhouseholds in the experiment, including the randomized control group, and for Fulton

7. Davis et al. (2014) use a crude form of matching with panel data: match each participat-ing household with the nonparticipant household with the most similar average monthly pre-treatment consumption among the 10 closest neighbors of the participant.

8. Authors often write phrases like “any time-invariant unobservable characteristics will beremoved with the individual fixed effects,” “fixed effects remove time-invariant effects,” and “themodel eliminates the unobserved fixed effect.”

9. The administrative data correspond to water billing months. Thus, for example, May2006 to May 2007 water use corresponds to June 2006 to June 2007 billing months.

10. A block group is a subdivision of a census tract. The number of people in a block groupvaries from 600 to 3,000 people, with an average size of 1,500 people.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 13: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 293

households (see appendix, available online, for more details about Fulton data). Fultonhouseholds tend to use more water, on average, and have greater variability. In termsof pre-experiment water use, there are no statistical differences across the treated andrandomized control groups (see Ferraro and Miranda [2014] for further details).

For the matching and trimming procedures, we select covariates that are observableto policy makers and that theory or empirical studies suggest could be important con-trols in a study on water conservation. Ferraro and Price (2013) show that previouswater use strongly predicts future water use. Based on metering data from the waterutilities, we use the two variables used by Ferraro and Price in their analysis: May–October 2006 use (the main water use season) and March–April 2007 use (to reflectchanges to homes and yards prior to treatment assignment in at the end of May 2007).We also explore what happens when one matches on each month’s use individually.

From the 2007 county tax assessor databases we select fair market value of home($), property size (acres), and age of home (years), all variables that reflect the scopeand incentives for water use and conservation (fair market value is also a proxy for in-come and wealth of the residents). From the 2000 US Census, we choose variables

Table 1. Water Consumption by Seasons (in Thousands of Gallons)

Cobb County Fulton County

Variable andIndicator

TechnicalInformationTreatment

(1)

Social ComparisonTreatment

(2)

ExperimentalControl(3)

NonexperimentalComparison

Group(4)

Water 2006:a

Mean 57.48 57.7 57.43 66.27SD 38.54 40.46 40.47 54.49

Winter 2006/7:b

Mean 25.26 25.19 25.18 27.72SD 14.58 14.63 16.02 94.27

Spring 2007:c

Mean 27.14 26.7 27.2 25SD 19.77 18.94 45 71.13

Summer 2007:d

Mean 36.03 34.55 36.04 41.22SD 30.31 26.12 28.96 52.27

Observations 10,044 9,985 61,556 30,757

This All use subject to U

content downloaded from 129.119.038.195 on December 05, 2niversity of Chicago Press Terms and Conditions (http://www

Source. Ferraro and Miranda (2014, table 2).a Water use May–October 2006.b Water use November 2006–February 2007.c Water use March–May 2007.d Water use June–September 2007.

017 11:44:46 AM.journals.uchicago.edu/t-and-c).

Page 14: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

294 Journal of the Association of Environmental and Resource Economists March 2017

that also reflect the scope and incentives for water use and conservation: per capitaincome ($), percentage of adults over 25 years old with a bachelor’s degree or higher,percentage of people living below the poverty line, percentage of population that iswhite, and percentage of renter-occupied housing units.

Table 2 shows descriptive statistics from the tax assessor and the US Census. Ingeneral, the treated homes are slightly cheaper, older, and on smaller properties. Fur-ther, they are in block groups with lower per capita income, lower percentage of peoplewith a higher degree, lower percentage of renter-occupied housing units, and a lowerpercentage of whites.

4.2. Matching and Trimming

We create four samples to which we apply the FEPD estimator. The full sample doesnot reweight the comparison group observations. The trimmed sample discards obser-vations with extreme propensity scores. The matching samples with and without cal-ipers restrict the sample to comparison households that are observationally more sim-ilar to the treated households. The trimmed and matched samples are created usingthe full set of covariates described in section 4.1.

Matching methods aim to match treated units to comparison units in order toachieve balance on the distributions of confounding covariates. To create the matchedsample, we choose the matching method that generates the best covariate balancingresults for our sample: nearest-neighbor (1 :1) Mahalanobis covariate matching withreplacement. We apply this matching algorithm with and without calipers. Caliperscan further improve covariate balance between treatment and comparison groups bydefining a tolerance level for judging the quality of the matches; if a treated householddoes not have a match within the caliper, it is eliminated from the sample. In our mainanalysis, all matches not equal to or within 1 standard deviation of each covariate aredropped. Although calipers can reduce bias, they create a subsample that may not berepresentative of the population of treated households and, by dropping units, reducethe statistical power of the design.

Like matching, trimming is an approach that reweights the comparison group byrestricting the sample. The trimmed sample discards (or applies a weight of zeroto) observations with extreme propensity scores. Crump et al. (2009) derive an opti-mal trimming rule for discarding observations with extreme propensity scores. Likecaliper matching, however, trimming may lead to an estimate of a treatment effectfor a subpopulation that is not representative of the whole population. Based on aLogit model to estimate the propensity scores, the optimal trimming rule discards ob-servations with estimated propensity scores outside the interval [0.06, 0.94]. Using aProbit specification generates the trimming interval [0.07, 0.93].11

11. The optimal lower limit for the social comparison treatment is 0.0612 (0.0669/Probit),while the limit for the technical information treatment is 0.0623 (0.0674/Probit).

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 15: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table 2. Descriptive Statistics: Cobb and Fulton County Householdand Neighborhood Variables

Cobb County Fulton County

Variable and Indicator

TechnicalInformationTreatment

(1)

Social ComparisonTreatment

(2)

ExperimentalControl(3)

NonexperimentalComparison

(4)

Tax Assessor (Household) Variables

Fair market value ($):Mean 249,876.90 252,866.30 251,033.60 349,095.50SD 157,520.50 174,787.60 161,074.70 215,771.60

Age of home (years):Mean 22.05 22.05 22.05 16.99SD 12.74 12.87 12.97 8.65

Size of property (acres):Mean .18 .19 .20 .61SD 1.17 1.02 1.14 .89

Census (Neighborhood) Variablesa

% of people withhigher degree:

Mean .73 .73 .73 .85SD .15 .15 .15 .07

% of people belowpoverty level:

Mean .04 .04 .04 .03SD .04 .04 .04 .03

Per capita income:Mean 30,851.68 30,847.70 30,833.04 42,385.28SD 9,215.26 9,211.39 9,192.00 10,536.13

% of renter-occupiedhousing units:

Mean .12 .12 .12 .16SD .15 .15 .15 .16

% white:Mean .84 .84 .84 .87SD .13 .13 .13 .06

Observations 10,044 9,985 61,556 30,757

This contenAll use subject to Univers

t downloaded from 129.119.038.195 on December 05, ity of Chicago Press Terms and Conditions (http://www

Source. Ferraro and Miranda (2014, table 3).a At census block group level.

2017 11:44:46 AM.journals.uchicago.edu/t-and-c).

Page 16: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

296 Journal of the Association of Environmental and Resource Economists March 2017

The covariate balance results in table 3 (social comparison treatment) and table 4(technical information treatment) corroborate our expectations that trimming andmatching improve covariate balance and that caliper matching shows the best balance(but note that balance gets worse for March–April 2007 variable). For each covariate,we present four ways to evaluate the improvement in covariate balance (Lee 2011):(a) difference in means; (b) standardized mean difference, for which Rosenbaum andRubin (1985) suggest that a standardized difference greater than 20 should be consid-ered large; (c) eQQ mean difference, a nonparametric measure that evaluates the rankrather than the precise value of the observations (Ho et al. 2007); and (d) varianceratio between treated and untreated units, which should be equal to one if there isperfect balance (Sekhon 2011).

5. EXPERIMENTAL BENCHMARK AND DESIGN REPLICATION

EVALUATION CRITERIA

5.1. Experimental Benchmark

In order to match water utility records to the tax and census data, we drop 14% of theobservations from the original experimental sample. Thus we first replicate Ferraroand Price (2013) using the matched subsample. We estimate their ordinary leastsquares (OLS) regression specification:

Water Use Summer2007i 5 ai 1 β*Treatmenti 1 d′Xi 1 εi, (8)

where the dependent variable is the post-treatment aggregate water use (June–September) and the covariates Xi are the observable covariates described in section 4(including the two pre-treatment aggregated water use variables from Ferraro andPrice) and dummy variables for the meter routes (recall that randomization is withinmeter routes). The results are nearly identical to those in Ferraro and Price in mag-nitude and inference (table A1 in appendix).

To establish an experimental benchmark using the same FEPD estimator that weuse in the observational design (i.e., to ensure an “apples to apples” comparison), weestimate equation (7), where the dependent variable is monthly water consumption andthe monthly time variables go from May 2006 to September 2007 (the post-treatmentmonths are June–September 2007). These experimental benchmark estimates are pre-sented in table 5.

5.2. Design Replication

Cook et al. (2008) identify six criteria that high-quality design replications shouldmeet. We believe our experimental benchmark is well executed (criterion 1) andour observational design is well motivated and appropriate (criterion 2). We knowof no third variable confounding our observational design (criterion 3; e.g., a contem-poraneous shock to the comparison pool at time of treatment assignment). We use thesame FEPD estimators to analyze the experimental and nonexperimental data to

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 17: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table3.

Covariate

Balance:S

ocialC

omparisonTreatment

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Water

Use

May–October

2006

Meandifference

–8.565

–4.416

–2.480

–.697

4871

92Standardized

meandifference

–21.170

–11.498

–6.130

–1.919

4671

91MeanraweQ

Qdifference

8.667

4.524

2.917

1.825

4866

79Varianceratio

(treat/untreat)

.551

.538

1.102

1.082

–3

7782

Water

Use

March–April2007

Meandifference

.875

1.088

2.082

2.419

–24

–138

–176

Standardized

meandifference

7.629

9.856

18.151

23.128

–29

–138

–203

MeanraweQ

Qdifference

2.334

3.542

2.136

2.442

–52

8–5

Varianceratio

(treat/untreat)

.033

.015

1.722

1.770

–2

2520

FairMarketValue

Meandifference

–96,280.0

–58,738.0

–45,676.0

–26,950.0

3953

72Standardized

meandifference

–55.1

–38.2

–26.1

–18.9

3153

66MeanraweQ

Qdifference

97,173.0

58,755.0

47,582.0

28,213.0

4051

71Varianceratio

(treat/untreat)

.656

.675

1.192

1.138

644

60

Age

ofHom

e

Meandifference

5.060

1.050

1.850

.533

7963

89Standardized

meandifference

39.319

10.061

14.377

5.345

7463

86MeanraweQ

Qdifference

5.294

1.845

1.952

.885

6563

83Varianceratio

(treat/untreat)

2.213

1.333

1.397

1.216

7367

82

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 18: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table3(Continued)

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Size

ofProperty(A

cres)

Meandifference

–.419

–.370

–.311

–.305

1226

27Standardized

meandifference

–40.983

–102.130

–30.372

–66.572

–149

26–6

MeanraweQ

Qdifference

.460

.400

.328

.314

1329

32Varianceratio

(treat/untreat)

1.326

1.787

1.020

1.194

–141

9440

%of

Peoplewith

HigherDegree

Meandifference

–.126

–.056

–.069

–.027

5645

79Standardized

meandifference

–86.258

–55.861

–47.145

–29.831

3545

65MeanraweQ

Qdifference

.126

.057

.069

.028

5545

78Varianceratio

(treat/untreat)

4.275

1.456

1.953

1.397

8671

88

%of

PeopleBelow

Poverty

Level

Meandifference

.005

.001

.007

.002

84–32

70Standardized

meandifference

15.567

2.950

20.571

7.423

81–32

52MeanraweQ

Qdifference

.008

.004

.009

.004

49–14

56Varianceratio

(treat/untreat)

1.892

1.101

1.145

1.082

8984

91

298

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 19: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table3(Continued)

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Per

Capita

Income

Meandifference

–11,538.0

–5,259.0

–5,255.0

–3,105.0

5454

73Standardized

meandifference

–125.3

–61.6

–57.0

–37.8

5154

70MeanraweQ

Qdifference

11,537.0

5,686.4

5,395.1

3,516.3

5153

70Varianceratio

(treat/untreat)

.764

1.468

1.378

1.330

–99

–61

–40

%of

Renter-OccupiedHousing

Units

Meandifference

–.036

–.018

.007

–.001

4980

97Standardized

meandifference

–24.813

–16.692

5.011

–1.061

3380

96MeanraweQ

Qdifference

.039

.022

.016

.010

4257

73Varianceratio

(treat/untreat)

.823

.616

.899

1.033

–117

4381

%W

hite

Meandifference

–.027

–.004

–.041

–.005

85–53

80Standardized

meandifference

–20.323

–3.649

–31.030

–11.160

82–53

45MeanraweQ

Qdifference

.040

.031

.041

.009

22–1

77Varianceratio

(treat/untreat)

4.061

2.496

3.172

.811

5129

94

299

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 20: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table4.

Covariate

Balance:T

echn

icalInform

ationTreatment

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Water

Use

May–October

2006

Meandifference

–8.786

–4.021

–2.495

–.861

5472

90Standardized

meandifference

–22.798

–10.383

–6.474

–2.360

5472

90MeanraweQ

Qdifference

8.887

4.135

2.766

1.736

5369

80Varianceratio

(treat/untreat)

.500

.559

1.057

1.061

1289

88

Water

Use

March–April2007

Meandifference

.902

1.351

2.154

2.585

–50

–139

–187

Standardized

meandifference

7.788

11.321

18.594

23.298

–45

–139

–199

MeanraweQ

Qdifference

2.363

3.742

2.219

2.610

–58

6–10

Varianceratio

(treat/untreat)

.033

.018

1.847

1.869

–2

1210

FairMarketValue

Meandifference

–99,257.0

–57,902.0

–46,502.0

–27,024.0

4253

73Standardized

meandifference

–63.0

–38.6

–29.5

–19.8

3953

69MeanraweQ

Qdifference

100,137.0

57,980.0

48,010.0

28,746.0

4252

71Varianceratio

(treat/untreat)

.532

.658

1.101

1.163

2778

65

300

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 21: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table4(Continued)

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Age

ofHom

e

Meandifference

5.064

1.223

1.955

.629

7661

88Standardized

meandifference

39.743

11.632

15.344

6.265

7161

84MeanraweQ

Qdifference

5.282

1.965

2.040

.919

6361

83Varianceratio

(treat/untreat)

2.169

1.370

1.410

1.226

6865

81

Size

ofProperty(A

cres)

Meandifference

–.433

–.359

–.314

–.309

1727

28Standardized

meandifference

–36.992

–110.530

–26.876

–76.632

–199

27–107

MeanraweQ

Qdifference

.464

.385

.337

.319

1728

31Varianceratio

(treat/untreat)

1.735

1.676

1.224

1.293

870

60

%of

Peoplewith

HigherDegree

Meandifference

–.126

–.058

–.068

–.026

5446

79Standardized

meandifference

–85.732

–57.345

–46.536

–29.532

3346

66MeanraweQ

Qdifference

.126

.059

.069

.028

5346

78Varianceratio

(treat/untreat)

4.322

1.494

1.930

1.353

8572

89

%of

PeoplebelowPoverty

Level

Meandifference

.006

.001

.007

.002

83–24

68Standardized

meandifference

16.814

3.433

20.863

8.491

80–24

50MeanraweQ

Qdifference

.008

.004

.009

.004

49–11

54Varianceratio

(treat/untreat)

1.935

1.116

1.133

1.093

8886

90

301

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 22: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table4(Continued)

FullSample

Trimmed

Sample

Matching

with

outC

alipers

Matching

with

Calipers

%Im

provem

ent

(1)to

(2)

%Im

provem

ent

(1)to

(3)

%Im

provem

ent

(1)to

(4)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Per

Capita

Income

Meandifference

–11,533.0

–5,223.0

–5,184.0

–3,028.0

5555

74Standardized

meandifference

–125.2

–60.9

–56.3

–37.0

5155

70MeanraweQ

Qdifference

11,532.0

5,650.0

5,328.0

3,461.8

5154

70Varianceratio

(treat/untreat)

.765

1.495

1.397

1.341

–111

–69

–45

%of

Renter-OccupiedHousing

Units

Meandifference

–.034

–.021

.009

.000

3874

99Standardized

meandifference

–22.388

–18.744

5.844

–.424

1674

98MeanraweQ

Qdifference

.036

.024

.017

.011

3353

71Varianceratio

(treat/untreat)

.874

.626

.926

1.046

–197

4164

%W

hite

Meandifference

–.027

–.004

–.040

–.006

86–49

80Standardized

meandifference

–20.497

–3.272

–30.510

–11.812

84–49

42MeanraweQ

Qdifference

.040

.033

.041

.010

190

76Varianceratio

(treat/untreat)

4.189

2.691

3.169

.797

4732

94

Source.C

obbCountyW

ater

System

,FultonCountyW

ater

ServiceDivision,

Tax

Assessor,and2000

USCensus.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 23: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 303

ensure comparability (criterion 4). In the next paragraph, we develop standards tocompare the experimental and nonexperimental estimates (criterion 6). Like most de-sign replications, we fail to satisfy their criterion 5, which recommends that analystsdoing the observational study be blind to the results of the experimental study.

The literature provides no clear guidance for determining whether a nonexperi-mental estimate is “good” in comparison to these experimental benchmark estimates.Many studies apply no criteria at all. We define two criteria for judging the quality ofthe nonexperimental estimate:

Accuracy Criterion: (a) The nonexperimental point estimate should be in the 95%confidence interval of the experimental point estimate (e.g., Arceneaux et al. 2006;Greenberg, Michalopoulos, and Robins 2006). (b)We also wish to make the correctinference in testing the null hypothesis of no treatment effect (Type 1 error5 5%).

Based on results in table 5, accuracy criterion implies that the nonexperimental esti-mate for the social comparison treatment should be in the range [–443, –263] and forthe technical information treatment in the range [–111, 199].12

12. Insteadfidence intervaaffected by thenot want to inlarge confiden

TAll use subject

Table 5. Fixed Effects Panel Data Estimator UsingExperimental Sample

Treatment Estimated Effect

Technical information treatment effect –.006(.054)

Social comparison treatment effect –.353***(.046)

Observations 1,394,455

, we could define the following criterion: the nonl should cover the experimental estimate. Suchprecision of the nonexperimental estimate and tfer that the nonexperimental design performs wce interval. St. Clair et al. (2014) use a criterion th

his content downloaded from 129.119.038.195 on D to University of Chicago Press Terms and Condition

Note. Post-treatment assignment average treatment effect on the treatedin thousands of gallons per month. For example, post-treatment assignment( June–September 2007) households randomly assigned into the social com-parison treatment consumed approximately 353 fewer gallons per month,on average, than households in the control group. Robust standard errorsin parentheses (clustered at the household level).

* p < .10.** p < .05.*** p < .01.

experimental estimate 95% con-a criterion, however, would behus the sample size. We wouldell simply if the estimate has aat the nonexperimental estimate

ecember 05, 2017 11:44:46 AMs (http://www.journals.uchicago.edu/t-and-c).

Page 24: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

304 Journal of the Association of Environmental and Resource Economists March 2017

Sensitivity to sample choice has been a concern in the design-replication literature(Smith and Todd 2005). One can judge the sensitivity of the proposed methods todifferent samples by bootstrapping to estimate a distribution of treatment effects(McKenzie et al. 2010). We bootstrap the nonexperimental data to generate new sam-ples with the same number of observations. Reestimating the treatment effect for eachbootstrapped sample creates a distribution of treatment effects using the nonexperi-mental data. For each design, we use 500 repetitions. Based on this exercise, we adda second performance criterion:

Robustness (to Sample Choice) Criterion: Under random sampling using boot-strapping, the percentage of estimates that (a) fall within the 95% confidenceinterval of the experimental estimate and (b) lead to the correct inference in test-ing the null hypothesis of no treatment effect is 50% or higher for both treat-ments (i.e., an observational design cannot be judged successful if it can onlyconsistently recover one of the treatment effects).

6. NONEXPERIMENTAL DESIGN RESULTS

6.1. Main Results

Prior to presenting the impact estimates, we consider the assumption of parallel trendsfor treated and untreated units. Figure 2 shows the pre-treatment mean monthly con-sumption trends for the social comparison treatment group, technical informationtreatment group, Cobb County randomized control group, and the Fulton Countynonexperimental comparison group. Although the treated and comparison group levelsand trends look identical in the 6 months prior to treatment assignment, they lookquite different during summer 2006. This difference suggests that the Fulton house-holds, on average, may not form a good counterfactual for the treated Cobb households.

Figure 3 and figure 4 show that the trend lines become much more similar aftermatching (although a little worse in the months just prior to treatment assignment).Although we saw in section 4.2 that caliper matching improved covariate balance,there is little visual change in the trends after applying calipers. As we will see, how-ever, the use of calipers does affect the treatment effect estimates.

Table 6 shows the treatment effect estimates. Using the full sample, we estimatethat the social comparison message reduced average monthly use by 965 gallons.The technical information message reduced average use by 618 gallons. The nonexper-imental estimate of the social comparison treatment effect is almost three times largerthan the experimental estimate. The nonexperimental experimental estimate of thetechnical information treatment effect is about one hundred times larger.

should be within 0.20 standard deviations, but that criterion is too generous for behavioral“nudges” that have small effect sizes and are inexpensive to implement.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 25: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Figure 2. Pre-treatment mean monthly water consumption, in thousands of gallons

Figure 3. Pre-treatment mean monthly water consumption: Social comparison treatment ×Matched sample with and without calipers, in thousands of gallons.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 26: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

306 Journal of the Association of Environmental and Resource Economists March 2017

These estimates do not satisfy the accuracy criterion. Importantly, they send thewrong signal to decision makers: they erroneously imply that technical informationmessages achieve almost two-thirds of the impact of social comparison messages. Giventhat the technical messages are (a) cheaper, because they require fewer sheets of paperand fewer data (no need to examine past consumption patterns), (b) can be targeted tothe entire population rather than only to customers who lived in their home the pre-vious year, and (c) are well understood by utility managers, these results could be in-terpreted as implying the technical information message is preferable.

Columns 2 and 6 of table 6 present estimates using the trimmed sample (Logitspecification). The estimates are smaller than the estimates with the full sample, butthey do not satisfy the accuracy criterion (although the technical information treatmentis very close to satisfying it).

Columns 3 and 7 present estimates using the sample pre-processed by matching.The estimate for the technical information treatment effect satisfies the accuracy cri-terion and the estimate for the social comparison treatment effect comes very close tosatisfying it (both satisfy an alternative criterion that the experimental estimate shouldbe in the 95% CI of the nonexperimental estimate).

Columns 4 and 8 present estimates using the sample pre-processed by calipermatching. Despite the barely perceptible effect of the calipers on the pre-treatment

Figure 4. Pre-treatment mean monthly water consumption: Technical information treat-ment × Matched sample with and without calipers, in thousands of gallons.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 27: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table6.

FixedEffectsPanelDataEstimator

Using

NonexperimentalS

ample

SocialCom

parisonTreatmenta

Techn

icalInform

ationTreatmenta

Matched

Samplec

Matched

Samplec

FullSample

(1)

Trimmed

Sampleb

(2)

NoCaliper

(3)

Caliper

(4)

FullSample

(5)

Trimmed

Sampleb

(6)

NoCaliper

(7)

Caliper

(8)

Treatmenteffectestim

ate

–.965***

–.720***

–.516***

–.356**

–.618***

–.119

–.109

–.002

(.087)

(.144)

(.158)

(.161)

(.091)

(.105)

(.161)

(.164)

Observatio

ns697,187

365,721

339,507

239,462

698,240

371,160

341,478

243,966

Num

berof

households:

Treatment

10,038

6,741

9,988

7,045

10,099

6,831

10,044

7,175

Control

30,973

14,772

9,988

7,045

30,973

15,002

10,044

7,175

aPost-treatm

entassignmentaveragetreatm

enteffecton

thetreatedin

thousandsof

gallons

permonth.R

obuststandard

errorsin

parentheses(clustered

atthehousehold

level).

bBased

onaLo

gitmodelto

estim

atethepropensity

scores,the

trimmingrulediscards

observations

with

estim

ated

propensity

scores

outsidetheinterval[0.06,

0.94].

cNum

berof

observations

formatched

samples

representunique

households.R

epeatedobservations

aretakeninto

accountusingfrequencyweights.

*p<.10.

**p<.05.

***p<.01.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 28: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

308 Journal of the Association of Environmental and Resource Economists March 2017

trend of the matched Fulton County comparison group, both nonexperimental esti-mates are nearly identical to those generated by the experimental data and they satisfythe accuracy criterion. Were we to double the caliper size to 2 standard deviations, thetechnical information treatment estimate would still satisfy the accuracy criterion, andthe social comparison treatment estimate would just barely miss satisfying it by 6 gal-lons (–449 gallons). Because we drop two-thirds fewer units, the precision of the es-timates increases (social comparison estimate significant at p < .01).

6.2. Nonexperimental Data Bootstrapping

The bootstrapping results in figure 5 reveal how sensitive the performance of eachevaluation design is to the sample used in table 6. The dotted vertical lines are theexperimental benchmarks. With no data pre-processing, the FEPD estimator consis-tently overestimates the treatment effects. Using the trimmed sample, the perfor-mance of the FEPD estimator depends substantially upon the propensity score modelthat determines the optimal trimming threshold (Logit or Probit), with no clear pat-tern. The caliper-matching sample outperforms the others.

Table 7 reports the percentage of repetitions that satisfy the accuracy criterion. Inother words, the table reports the percentage of point estimates that are within the95% confidence interval of the experimental estimate and yield the correct inference.

Figure 5. Nonexperimental bootstrapping of treatment effect (in thousands of gallons)

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 29: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 309

The FEPD estimator with the full sample performs poorly: 0% of the runs satisfiedthe robustness criterion. In fact, the FEPD estimator performs worse than an OLSregression estimator that ignores the panel data structure and regresses an aggregatepost-treatment water use value on household covariates and a single aggregate pre-treatment variable. These OLS bootstrapping estimates satisfy the robustness crite-rion almost one-quarter of the time (Ferraro and Miranda 2014). After matching ,the FEPD estimator performs better, but the robustness criterion is only satisfied afterpreprocessing the data with caliper matching: 73% and 79% of the time the social com-parison and technical information treatments estimates, respectively, are within theexperimental benchmark’s 95% CI and we draw the correct inferences. Thus the per-formance of a combined caliper matching and FEPD design is very robust to changesin the sample.

7. WHY DOES PRE-PROCESSING BY MATCHING

IMPROVE PERFORMANCE?

Most authors who use the FEPD estimator write that it controls for time-invariantunobservable confounders. Thus some readers may be surprised that matching onpre-treatment, time-invariant, observable home and neighborhood characteristics isimportant. Some may wonder if we could ignore these characteristics, and instead sim-ply match more precisely on pre-treatment water use. In other words, make the pre-treatment trends line up as close as possible (including for the few months right beforetreatment assignment that remained imbalanced even after matching).

If we simply match on each month of pre-treatment water use, we can make thetrend lines of the treated and comparison groups nearly identical (see appendix figs. A1,

Table 7. Nonexperimental Design Bootstrapping: Percentage Results within 95% CIof Experimental Estimate and Correct Inference (Robustness Criterion)

Social ComparisonTreatment (%)

(1)

Technical InformationTreatment (%)

(2)

Full sample:Panel data and full sample 0 0

Trimmed sample:Panel data and trimming rule (Logit) 2 41Panel data and trimming rule (Probit) 0 2

Matching sample:Panel data and matching without calipers 16 50Panel data and matching with calipers 73 79

This content downloaded from 129.11All use subject to University of Chicago Press Term

9.038.195 on Decembers and Conditions (http://

Note. Results based on 500 repetitions.

05, 2017 11:44:46 AMwww.journals.uchicago.edu/t-and-c).

Page 30: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

310 Journal of the Association of Environmental and Resource Economists March 2017

A2). The estimates, however, are much worse than when we match on all covariates(–0.948 for social comparison and –0.615 for technical information; p < .01 for both).This result is consistent with Chabé-Ferret’s (2015) argument that closely matchingpre-treatment outcomes immediately before treatment assignment can lead to bias be-cause it puts too much emphasis on transitory shocks.

If we exclude just the home attributes or just the census attributes from the match-ing, we see the same patterns: the estimates are too large in absolute value and theirrelative differences are wrong. In fact, omission of the pre-treatment water use vari-ables from the matching is preferable to omission of the home and neighborhood char-acteristics. Matching only on the home and census characteristics yields more accurateestimates than matching only on pre-treatment water use (without calipers: –0.612,p < .01, for social comparison; –0.150 for technical information, p > .10; with calipers:–0.541 for social comparison, p < .01; –0.009 for technical information, p > .10).Thus parallel pre-treatment trends are an important attribute of good designs thatuse panel data estimators, but they are not sufficient. Furthermore, identical trendsare not necessary.

Why does matching improve the FEPD estimator’s performance? Once the treatedand untreated units are balanced on pre-treatment observables, the four strong im-plicit assumptions of the FEPD estimator described in section 3 are more plausible.

As noted earlier, there are other ways to make these assumptions more plausible.For example, one could attempt to model time more flexibly in the regression specifi-cation. Given the way the pre-treatment trends look in figure 2, one might reasonablydecide to allow for household-specific seasonal variation. There are three watering sea-sons in metro Atlanta and thus, in table 8, we allow for household-by-season fixed effects.The second and fifth columns present the estimates with this more flexible specification(cols. 1 and 4 present the original full sample estimates). The new estimates are closerto the experimental benchmarks but are far from satisfying the accuracy criterion.

If one conjectures that the way in which households respond to common shocksdepends on their fixed characteristics, one could consider a more extreme attemptto model time flexibly: interact all the pre-treatment home and neighborhood covar-iates with the month dummy variables. We have never seen such a flexible specifica-tion (adding well over 100 new variables), but were one to do so, the estimates getcloser to the experimental benchmarks (–0.624, p < .01; –0.168, p > .10). Yet theystill fail to meet the accuracy criterion. Undoubtedly, there are other approaches tomodeling time flexibly. Matching, however, obviates the need to choose one: it ad-dresses the issue by design, rather than through statistical methods.

Likewise, consider the implicit assumption of homogeneous treatment effects. Likethe “common response to common shocks” assumption, the constant treatment effectis hard to believe when the treated and untreated units, despite having similar trends,have very different levels of fixed characteristics (like home size and age) that affect

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 31: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Table8.

FlexibleFixedEffectsPanelDataEstimatorsUsing

NonexperimentalS

ample

SocialCom

parisonTreatment

Techn

icalInform

ationTreatment

Full

Sample

(1)

Household-by-

Season

Fixed

Effects

(2)

Household-by-Season

Fixed

Effects1

Interact×and

Treatmenta

(3)

Full

Sample

(4)

Household-by-

Season

Fixed

Effects

(5)

Household-by-Season

Fixed

Effects1

Interact×and

Treatmenta

(6)

Estimated

treatm

enteffect

–.965***

–.719***

–.536***

–.618***

–.411***

–.222

(.087)

(.082)

(.191)

(.091)

(.085)

(.193)

Observatio

ns697,187

697,187

692,614

698,240

698,240

693,616

Note.Average

treatm

enteffecton

thetreatedin

thousandsof

gallons

permonth.R

obuststandard

errorsin

parentheses(clustered

atthehouseholdlevel).

aTreatmenteffectestim

ateevaluatedat

themeancovariatevalues

oftreatedgroup.

*p<.10.

**p<.05.

***p<.01.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 32: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

312 Journal of the Association of Environmental and Resource Economists March 2017

water use.13 The assumption is more plausible after achieving balance among observ-able fixed characteristics.

Of course, one could try to directly model the treatment effect heterogeneity. If onewere to assume, for example, that the treatment effect is moderated by observable,fixed characteristics, one could interact the treatment with these observables in themodel (adding 16 more terms to the model). Using a specification that includes theseinteractions with the household-by-season fixed effects, the third and sixth columns intable 8 present the treatment effects when all the interaction terms are evaluated at themean values for the treated units. The estimates are closer to the experimental bench-marks, and decision makers would draw the correct inferences about the relative mer-its of the two treatments. The accuracy criterion, however, is not satisfied. By increas-ing the complexity of the specification, one might eventually stumble upon one thatgenerates estimates that satisfy the accuracy criterion, but in the absence of an exper-imental benchmark, the precise specification for incorporating this heterogeneity wouldbe debatable. By matching first on observables, one obviates the need for debatable andcomplicated modeling of the potential forms of heterogeneity that arise from observ-able, time-invariant characteristics.

Similarly, there have been advances in fixed-effects estimators that relax the as-sumption that no lagged dependent variables belong in the model. For example, onecan achieve consistent estimation of the parameters in a model that includes bothlagged dependent variables and unobserved household fixed effects by using Yit–2 asan instrument for Yit–1. Such an estimator, however, requires the strong assumptionthat Yit–2 ⊥ εit, which is not tenable in our context. However, pre-processing viamatching followed by estimation with a fixed-effects model is similar in spirit to a modelthat includes both lagged dependent variables and unobserved household fixed effects.If what makes Cobb County households different from Fulton County households aretime-invariant household unobservables as well as time-varying unobservables that arecaptured by household water use h periods ago (where h are the pre-treatment periodsused in matching), the estimation procedure we follow can account for both sources ofbias and permit consistent estimation of the treatment effects.

Although our extensions are already more complex than what is found in most pro-gram evaluations with panel data, we do not doubt that one could develop even morecomplex specifications that could replicate the experimental benchmarks. For example,one could think about different ways to incorporate heterogeneous treatment effects(e.g., Fernandez-Val and Lee 2013) and different ways to model the effects of time.Yet there are myriad ways to accomplish such extensions. How would one chooseamong them? Pre-processing the data through matching accomplishes the same objec-

13. In fact, using the RCT data, Ferraro and Miranda (2013) found evidence of heteroge-neous responses conditional on observable characteristics.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 33: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 313

tives as the more complicated modeling, while also seeming more justifiable in a broaderset of applied contexts.

8. CONCLUSIONS

The results show that (1) fixed-effects panel data estimators alone are not a panaceafor addressing bias (even time-invariant sources of bias), and (2) careful considerationof the validity of the identifying assumptions for causal inference is critical. When thestandard, linear fixed-effects panel data (FEPD) estimator is combined with matchingalgorithms, the estimates and inferences from the observational design are similar tothose from the experimental design.14 In other words, through a combination of de-sign and methods, we can approximate a randomized controlled trial (Rubin 2008).

Without pre-processing the data by matching, the estimates and inferences fromthe simple FEPD estimator can be quite misleading for decision makers. In fact, interms of our accuracy and robustness Criteria, the simple FEPD estimator is outper-formed by a combination of caliper matching and an OLS difference-in-means estima-tor studied by Ferraro and Miranda (2014; in other words, an analyst does better ig-noring the monthly panel data structure). However, a combination of caliper matchingand the FEPD estimator greatly outperforms all estimators in Ferraro and Miranda(2014) on both criteria.

Discarding observations with high and low propensity scores (trimming) improvesthe nonexperimental estimates, but not as much as matching does. The poor perfor-mance of trimming may result from two sources: (1) unlike matching (with or withoutcalipers), which drop outliers and inliers, trimming only drops outliers (Sekhon 2010;Ferraro andMiranda 2014); and (2) trimming may only performs well when the treat-ment effect is homogeneous (Busso, DiNardo, and McCrary 2009, 2011), but treat-ment effects are heterogeneous in our study.

Our matching design matches households based on pre-treatment outcomes as wellas time-invariant observable home and neighborhood characteristics known to affectthe outcome and systematically different between treatment and untreated units. Thepre-treatment outcomes, the home characteristics and the neighborhood characteristicsare all required. Leaving out one of these sets of covariates yields inaccurate impact es-timates. With caliper matching in particular, the nonexperimental design does an ex-cellent job of replicating the experimental benchmark and yields the same statistical in-ferences about treatment effects. Based on a bootstrapping exercise, its performance isnot sensitive to the specific sample used in the analysis.

14. Ferraro and Miranda (2014) similarly show that matching improves the performance ofthe OLS difference-in-means estimator (with covariate adjustment). After matching, this esti-mator outperforms the other estimators they examine, including the differences-in-differencesestimator with single pre-treatment and post-treatment observations.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 34: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

314 Journal of the Association of Environmental and Resource Economists March 2017

Why does pre-processing the data by matching on time-invariant observablesgreatly improve the performance of the FEPD estimator? Because it makes the identi-fying assumptions of the FEPD estimator more plausible, as they would be in a random-ized experiment. In particular, it reduces concerns about the linear specification, the as-sumption of common mean responses to common shocks among treated and untreatedhouseholds, the assumption of homogeneous treatment responses, and the assumptionthat lagged dependent variables do not belong in the model. Concerns about the valid-ity of these assumptions are prevalent in a wide range of program evaluation contexts.

In our study context, our analysis suggests that the most important violation is theassumption of homogeneous treatment responses. By matching first on observables,one obviates the need for debatable specifications and complicated modeling of the po-tential forms of heterogeneity that arise from the moderating effects of observable,time-invariant characteristics. However, failing to match on important time-invariant,observable characteristics and instead focusing the matching exercise on ensuring thatpre-treatment trends are nearly identical in the treated and matched untreated reducesaccuracy and robustness. Thus, although our analysis confirms popular notions thatsimilar pre-treatment trends are an important attribute of good observational designs,it also shows that identical trends are neither necessary nor sufficient.

As with all design replications, our conclusions come from a single context, and thuswe cannot be certain that they apply to other contexts. Moreover, we have only exam-ined some of the plausible evaluation designs that could be appropriate for our programcontext. Nevertheless, our results do support recent claims that an emphasis on re-search design (Rubin 2008) and on combining different strategies in evaluations makesthe assumptions required for causal inferences from observational data more plausible.

REFERENCES

Agodini, Roberto, and Mark Dynarski. 2004. Are experiments the only option? A look at dropout preven-

tion programs. Review of Economics and Statistics 86 (1): 180–94.

Alix-Garcia, Jennifer M., Katharine R. E. Sims, and Patricia Yañez-Pagans. 2015. Only one tree from each

seed? Environmental effectiveness and poverty alleviation in Mexico’s payments for Ecosystem Services

Program. American Economic Journal: Economic Policy 7 (4): 1–40.

Angrist, Joshua D., and Jorn-Steffen Pischke. 2009.Mostly harmless econometrics: An empiricist’s companion.

Princeton, NJ: Princeton University Press.

Arceneaux, Kevin, Alan Gerber, and Donald P. Green. 2006. Comparing experimental and matching meth-

ods using a large-scale voter mobilization experiment. Political Analysis 14:37–62.

Arriagada, Rodrigo A., Paul J. Ferraro, Erin O. Sills, Subhrendu K. Pattanayak, and Silvia Cordero-Sancho.

2012. Do payments for environmental services reduce deforestation? A farm-level evaluation from Costa

Rica. Land Economics 88:382–99.

Bernedo, Maria, Paul J. Ferraro, and Michael Price. 2014. The persistent impacts of norm-based messaging

and their implications for water conservation. Journal of Consumer Policy 37 (3): 437–52.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 35: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 315

Black, Dan, Jose Galdo, and Jeffrey Smith. 2005. Evaluating the bias of the regression discontinuity design

using experimental data. Working Paper, University of Michigan.

Bloom, Howard. 2010. Nine lessons about doing evaluation research: Remarks on accepting the Peter H.

Rossi Award at the Association for Public Policy Analysis and Management Conference, November 5,

Boston.

Buddelmeyer, Hielke, and Emmanuel Skoufias. 2004. An evaluation of the performance of regression dis-

continuity design on PROGRESA. World Bank Policy Research Working paper 3386, World Bank,

Washington, DC.

Busso, Matias, John DiNardo, and Justin McCrary. 2009. Finite sample properties of semiparametric es-

timators of average treatment effects, Working paper, University of Michigan.

————. 2011. New evidence on the finite sample properties of propensity score reweighting and matching

estimators. Review of Economics and Statistics 96 (5): 885–97.

Card, David, and Alan B. Krueger. 1994. Minimum wages and employment: A case study of the fast-food

industry in New Jersey and Pennsylvania. American Economic Review 84:772–93.

Chabé-Ferret, Sylvain. 2015. Analysis of the bias of matching and difference-in-difference under alternative

earnings and selection processes. Journal of Econometrics 185 (1): 110–23.

Cook, Thomas D., William R. Shadish, and Vivian C. Wong. 2008. Three conditions under which obser-

vational studies produce the same results as experiments. Journal of Policy Analysis and Management

274:724–50.

Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. 2009. Dealing with limited

overlap in estimation of average treatment effects. Biometrika 96 (1): 187–99.

Davis, Lucas W., Alan Fuchs, and Paul Gertler. 2014. Cash for coolers: Evaluating a large-scale appliance

replacement program in Mexico. American Economic Journal: Economic Policy 6 (4): 207–38.

Dehejia, Rajeev H. 2005. Practical propensity score matching: A reply to Smith and Todd. Journal of Econo-

metrics 125:355–64.

Dehejia, Rajeev H., and Sadek Wahba. 1999. Causal effects in nonexperimental studies: Reevaluating the

evaluation of training programs. Journal of the American Statistical Association 94 (448): 1053–62.

————. 2002. Propensity score-matching methods for nonexperimental causal studies. Review of Economics

and Statistics 84 (1): 151–61.

Diaz, Juan Jose, and Sudhanshu Handa. 2006. An assessment of propensity score matching as a nonexper-

imental impact estimator: Evidence from Mexico’s PROGRESA Program. Journal of Human Resources

41 (2): 319–45.

Fernandez-Val, Iván, and Joonhwah Lee. 2013. Panel data models with non-additive unobserved hetero-

geneity: Estimation and inference. Quantitative Economics 4 (3): 453–81.

Ferraro, Paul J., and Juan José Miranda. 2013. Heterogeneous treatment effects and causal mechanisms in

non-pecuniary, information-based environmental policies: Evidence from a large-scale field experiment.

Resource and Energy Economics 35 (3): 356–79.

————. 2014. The performance of non-experimental designs in the evaluation of environmental policy: A

design-replication study using a large-scale randomized experiment as a benchmark. Journal of Economic

Behavior and Organization 107, pt. A: 344–65.

Ferraro, Paul J., Juan José Miranda, and Michael K. Price. 2011. The persistence of treatment effects with

non-pecuniary policy instruments: Evidence from a randomized environmental policy experiment.

American Economic Review: Papers and Proceedings 101 (3): 318–22.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 36: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

316 Journal of the Association of Environmental and Resource Economists March 2017

Ferraro, Paul J., and Michael K. Price. 2013. Using non-pecuniary strategies to influence behavior: Evidence

from a large-scale field experiment. Review of Economics and Statistics 95 (1): 64–73.

Fraker, Thomas, and Rebecca Maynard. 1987. The adequacy of comparison group designs for evaluations

of employment-related programs. Journal of Human Resources 22 (2): 194–227.

Galiani, Sebastian, Paul Gertler, and Ernesto Schargrodsky. 2005. Water for life: The impact of the pri-

vatization of water services on child mortality. Journal of Political Economy 113 (1): 83–120.

Gibbons, Charles E., Juan Carlos Suarez-Serrato, and Michael B. Urbancic. 2014. Broken or fixed effects?

NBER Working paper no. 20342, National Bureau of Economic Research, Cambridge, MA.

Glazerman, Steven, Dan M. Levy, and David Myers. 2003. Nonexperimental versus experimental estimates

of earnings impacts. Annals of the American Academy of Political and Social Science 589 (1): 63–93.

Greenberg, David H., Charles Michalopoulos, and Philip K. Robins. 2006. Do experimental and nonexper-

imental evaluations give different answers about the effectiveness of government-funded training pro-

grams? Journal of Policy Analysis and Management 25 (3): 523–52.

Handa, Sudhanshu, and John A. Maluccio. 2010. Matching the gold standard: Comparing experimental

and nonexperimental evaluation techniques for a geographically targeted program. Economic Develop-

ment and Cultural Change 58 (3): 415–47.

Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1997. Matching as an econometric evaluation

estimator: Evidence from evaluating a job training program. Review of Economic Studies 64 (4): 605–54.

————. 1998. Matching as an econometric evaluation estimator. Review of Economic Studies 65 (2): 261–94.

Hill, Jennifer L., Jerome P. Reiter, and Elaine L. Zanutto. 2004. A comparison of experimental and obser-

vational data analyses. In Applied bayesian modeling and causal inference from incomplete-data perspectives,

ed. Andrew Gelman and Xiao-Li Meng. New York: Wiley.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2007. Matching as nonparametric pre-

processing for reducing model dependence in parametric causal inference. Political Analysis 15:199–236.

Imai, Kosuke, and In Song Kim. 2014. On the use of linear fixed effects regression estimators for causal

inference. Working paper.

Imbens, GuidoW., and Jeffrey M.Wooldridge. 2009. Recent developments in the econometrics of program

evaluation. Journal of Economic Literature 47 (11): 5–86.

Jones, Kelly W., and David J. Lewis. 2015. Estimating the counterfactual impact of conservation programs

on land cover outcomes: The role of matching and panel regression techniques. PLOS One, October 26;

10 (10): e0141380. doi:10.1371/journal.pone.0141380.

Lalonde, Robert J. 1986. Evaluating the econometric evaluations of training with experimental data. Amer-

ican Economic Review 76:604–20.

Lalonde, Robert J., and Rebecca Maynard. 1987. How precise are evaluations of employment and training

programs: Evidence from a field experiment. Evaluation Review 11 (4): 428–51.

Lancaster, Anthony. 2000. The incidental parameter problem since 1948. Journal of Econometrics 95 (2):

391–413.

Lee, Wang-Sheng. 2011. Propensity score matching and variations on the balancing test. Empirical Econom-

ics, May 26, 1–34.

Lee, Yoon-Jin. 2014. Testing a linear dynamic panel data model against nonlinear alternatives. Journal of

Econometrics 178 (1): 146–66.

McKenzie, David, John Gibson, and Steven Stillman. 2010. How important is selection? Experimental vs.

non-experimental measures of the income gains from migration. Journal of the European Economic As-

sociation 8 (4): 913–45.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

Page 37: Panel Data Designs and Estimators as Substitutes for ...faculty.smu.edu/millimet/classes/eco7377/papers/ferraro miranda 2017.pdfUnlike Ferraro and Miranda (2014), we use the full panel

Panel Data Designs and Estimators Ferraro and Miranda 317

Rosenbaum, Paul R., and Donald B. Rubin. 1985. Constructing a control group using multivariate matched

sampling methods that incorporate the propensity score. American Statistician 39 (1): 33–38.

Rubin, Donald B. 2008. For objective causal inference, design trumps analysis. Annals of Applied Statistics 2

(3): 808–40.

Sekhon, Jasjeet S. 2010. Package “Matching”: R documentation, University of California at Berkeley.

http://sekhon.berkeley.edu/matching/Match.html.

————. 2011. Multivariate and propensity score matching software with automated balance optimization:

The matching package for R. Journal of Statistical Software 42 (7): 1–52.

Smith, Jeffrey, and Petra Todd. 2005. Does matching overcome Lalonde’s critique of nonexperimental es-

timators? Journal of Econometrics 125:305–53.

St. Clair, Travis, Thomas D. Cook, and Kelly Hallberg. 2014. Examining the internal validity and statistical

precision of the comparative interrupted time series design by comparison with a randomized experi-

ment. American Journal of Evaluation 35 (3): 311–27.

Uchida, Emi, Jintao Xu, Zhigang Xu, and Scott Rozelle. 2007. Are the poor benefiting from China’s land

conservation program? Environment and Development Economics 12 (4): 593–620.

Wendland, Kelly J., Matthias Baumann, David J. Lewis, Anika Sieber, and Volker C. Radeloff. 2015. Pro-

tected area effectiveness in European Russia: A post-matching panel data analysis. Land Economics 91

(1): 149–68.

Wilde, Elizabeth Ty, and Robinson Hollister. 2007. How close is close enough? Evaluating propensity

score matching using data from a class size reduction experiment. Journal of Policy Analysis and Man-

agement 26 (3): 455–77.

Wing, Coady, and Thomas D. Cook. 2013. Strengthening the regression discontinuity design using addi-

tional design elements: A within-study comparison. Journal of Policy Analysis and Management 32 (4):

853–77.

Wooldridge, Jeffrey M. 2002. Inverse probability weighted M-estimators for sample selection, attrition,

and stratification. Portuguese Economic Journal 1 (2): 117–39.

———. 2005. Fixed-effects and related estimators for correlated random-coefficient and treatment effect

panel data models. Review of Economics and Statistics 87 (2): 385–90.

This content downloaded from 129.119.038.195 on December 05, 2017 11:44:46 AMAll use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).