Download - Interrupted Time Series and Regression Discontinuity Design · 2019-12-07 · Interrupted time series Interrupted time series is a fairly basic before and after analysis You have

Interrupted Time Series and Regression DiscontinuityDesign

Marcelo Coca Perraillon

University of Colorado Anschutz Medical Campus

March 2019

1 / 90

Outline

Interrupted time series

Basic design

Estimation

Validity issues

Extensions

Overview of regression discontinuity

Meaning and validity of RDD

Several examples from the literature

Estimation (where most decisions are made)

Sharp RRD example: Nursing home ratings

Fuzzy RDD example: Early intervention therapies

2 / 90

Code/data

You can get code and example dataset from my website (click onCode on left menu):

http://tinyurl.com/mcperraillon

3 / 90

http://tinyurl.com/mcperraillon

ITS


Interrupted time series is a fairly basic before and after analysis

You have probably learned, for good reasons, that before and afteranalyses are suspicious

Yet, in some circumstances, and under some assumptions, they canbe fairly convincing designs

Today, I’ll separate design and estimation which in general is the wayto go

See Rubin (2007,2008). Bottom line: “[O]bservational studies canand should be designed to approximate randomized experiments asclosely as possible. In particular, observational studies should bedesigned using only background information to create subgroups ofsimilar treated and control units, where similar here refers to theirdistributions of background variables.”

4 / 90

ITS


Time series: observations for a single variable made consecutivelyover time

Could be same units or different units

Example: prescription numbers/rates over time for different people ineach time period or the same group of people measured at differentpoints over time

At some point during the observation period there is change thatcreates an “interruption” in the time series

For example, a policy change like Medicare part D implementation in2006 or the release of a black label warning for antidepressants use inchildren would create an “interruption” or a change in payment policyin Medicaid

The idea is to use that interruption to measure the effects ofpolicies

5 / 90

ITS

ExampleMedicaid in New Hampshire imposed a three-drug limit thatrestricted medication reimbursement among among chronically illpoor patients with cardiac and other chronic illnessesThe outcome: the mean number of prescriptions declined by halfSoumerai et al (2017, 1987).

6 / 90

ITS

ExampleMedicare Part D changes in prescriptions for Medicare beneficiariesSchneeweiss et al (2009)Use pharmacy claims for the elderly–problems with measurement andestablishing insurance coverage

7 / 90

ITS

Thinking about effects

Note that in the two examples effects can be described in differentways, which has implications when thinking about the estimation part

Is a change in the level? Is it a change in the rate? In a model,intercept vs slope changes

Is the change continuous or does it decay? In the Medicaid example,there was another change prompted by the concerns aboutprescription declines

Is the effect immediate or delayed?

In health policy, changes often happen before the actual policyimplementation. Example, Medicaid expansion or Part D

8 / 90

ITS

Estimation

Estimation is (relatively) simple with models parametrized in differentways depending on whether one assumes a change in level or achange in slope or if other features are also incorporated

For example: y = β0 + β1t + β2post + β3t ∗ post + ε, where post = 1if after the policy intervention and t is time

In the pre-period: E [ypre |x] = β0 + β1t

In the post-period: E [ypost |x] = (β0 + β2) + (β1 + β3)t

Note that several tests are possible. If β2 = β3 = 0 then there is nopolicy effect

If β2 = 0 but β3 6= 0 then there was a change in the slope, but notthe level after the interruption. β3 is the change in the rate(remember, interactions in the linear model are differences ofdifferences). The slope after the policy is given by β1 + β3

9 / 90

ITS

Estimation

For the Medicaid example above, a Wald test for β2 would be morerelevant since there doesn’t seem to be much happening in the slope

A model like y = β0 + β2post + ε could be a better fit

If you think that a change in slope and not level is a betterassumption, splines are an option.

With splines, we can model a change in the slope with no change inthe level

10 / 90

ITS

Splines

The policy change (“interruption”) happens at time t = k . In splineslingo k is the knot

The model is y = β0 + β1time + β2(time − k)+ε

The (z)+ is called a truncated line function and is defined as beingequal to z if z is positive and zero otherwise

So (time − k)+ will be equal to time (time − k) after the policychange and and zero if at policy knot or before

The only difficult part about splines is to get the coding right, therest is (relatively) easy

See Stata’s mkspline command. You can do all sort of things withsplines (they don’t have to be linear). Sometimes they are calledpiecewise regression

11 / 90

ITS

Splines

Model: y = β0 + β1time + β2(time − k)+ + ε

Before the policy change the model is A) : E [y ] = β0 + β1age

After: If time > k the model is: E [y ] = β0 + β1time + β2(time − k)

Same as centering (more on it soon) so when time > k we canrewrite is B): E [y ] = (β0 − β2 ∗ k) + (β1 + β2)time

If β2 = 0, then the slope before and after is the same (so A and B arethe same)

Note that β2 is the incremental change in slope

The trick of using the truncated function is that it allowed us thepossibility of a different slope after k

See Lopez Bernal (2017) for other parametrizations

12 / 90

ITS

Statistical issues

The models I wrote above are linear models implicitly assuming errorsare iid and εi ∼ N(0, σ2) or, equivalently,εi ∼ N(β0 + β1X1i + · · ·+ βpXpi , σ

2)

Of course, two issues: 1) The outcomes drives distributionalassumptions. If counts, for example, a negative binomial or Poissonmodel would better

2) The errors cannot be iid since they can’t be independent.Autocorrelation is a feature of time series

Although the linear model is unbiased, the standard errors would bewrong

Autoregressive integrated moving average (ARIMA) or other solutionsare commonly used. There are test for autocorrelation, like theBreusch-Godfrey test

Another issue (seasonality)

13 / 90

ITS

Validity

Intuitively, the main validity issue is whether other changes happeningafter the policy/interruption could have caused the change in y

Your textbook also mentions instrumentation (changes in procedureschanges measurement) and

Selection (the composition of the group changes after policy change)

Compelling interrupted time series designs tend to be short-term,with changes that are hard to explain otherwise

14 / 90

ITS

Extensions

Adding a nonequivalent no-treatment control group time series

As the duck test describing an example of abductive reasoning goes:If it looks like a duck, swims like a duck, and quacks like a duck, thenit probably is a duck

Yes, this is the same as difference-in-difference models... See Wing etal. (2018)

Note how the parallel trend and the common shocks assumptions ofDiD are extensions of the issues discussed in Chapter 6

Please read Chapter 6. Don’t underestimate your textbook

15 / 90

ITS

And now for something completely different...

Switching gears: regression discontinuity design

16 / 90

Introduction

Basics

Method developed to estimate treatment effects in non-experimentalsettings

Provides causal estimates of treatment effects. Those estimates areLocal Average Treatment Effects (LATE)

Design limits external validity in some cases

Good internal validity; some assumptions can be empirically verified

Relatively easy to estimate but it has some complications

First application: Thistlethwaite and Campbell (1960) (Does the lastname Campbell sound familiar?)

17 / 90

Introduction Thistlethwaite and Campbell

Thistlethwaite and Campbell

They studied the impact of merit awards on future academic outcomes

Awards allocated based on test scores

If a person had a score greater than c , the cutoff point, then shereceived the award

The wrong way of analyzing: compare those who received the awardto those who didn’t

Thistlethwaite and Campbell realized they could compare individualsjust above and below the cutoff point

By now I find this idea intuitive but it’s not at first. It helps if youthink that choosing the point c is arbitrary or measured with error

Say, it’s 1200. But why not 1210? Or 1190? We know that the testscores used to give scholarships is related to future academicoutcomes but there is no solid relationship between 1200 and theoutcomes

18 / 90

Introduction Validity

Validity

Simple idea: assignment mechanism is known

We know that the probability of treatment jumps to 1 if test score > c

Assumption is that individuals cannot manipulate with precision theirassignment variable (think about the SAT)

Key word: precision. Consequence: comparable individuals near cutoffpoint

If treated and untreated individuals are similar near the cutoff pointthen data can be analyzed as if it were a (conditionally) randomizedexperiment

If this is true, then background characteristics should be similar nearc (can be checked empirically)

The estimated treatment effect applies to those near the cutoff point(limits external validity)

19 / 90

Introduction Validity

Validity

Validity hinges on assignment mechanism free of manipulation withprecision or cutoff point being arbitrary or measured with error

Manipulation example 1: Test with few questions and plenty of time

Manipulation example 2: DMV test to get a driving license

Example 3: Some mechanism makes cutoff point related to outcome(think biology: blood pressure). What if meassured with error?

Example 4: Eligibility criteria to obtain some benefit (say, belowincome of 28K). Why? How could you verify assumptions?

Common confusion: Some manipulation is fine (you can alwaysstudy harder, for example). Manipulation with precision or theabsence of a deterministic relationship between cut-off point andoutcomes is key. For example, RRD wouldn’t work if there is abiological mechanism by which a baby weighting less than 1,200grams is very likely to have a bad outcome

20 / 90

Introduction Graphical Example

Graphical Example

Simulated data with c = 140gen y = 100 + 80*T + 2*x + rnormal(0, 20)

21 / 90

Introduction Graphical Example

No effect

22 / 90

Introduction Sharp or fuzzy

Sharp and fuzzy RDD

Sharp RDD: Assignment or running variable completely determinestreatment. A jump in the probability of treatment before and aftercutoff point, from 0 to 1

Fuzzy RDD: Cutoff point increases the probability of treatment butdoesn’t completely determines treatment. A change in the probabilityof treatment before and after but not from 0 to 1

Which brings us back to the world of instrumental variables...

Fuzzy RDD not used as often but has a lot of potential in particularbecause no mental contortions are needed to justify the exclusionrestriction

Think of encouragement designs or imperfect compliance (like theOregon study)

23 / 90

Introduction Is it a jump or a kink?

Is it a jump or a kink?

There is a fairly new related method starting to make an appearancein the literature: regression kink design. See: Card, D., Lee, D. S.,Pei, Z., and Weber, A. (2015)

Kink: “a sharp twist or curve in something that is otherwise straight”

The idea is almost the same as RDD. The assignment variabledoesn’t create a discontinuity (“jump”) but instead is assumed tocreate a change in slope or a kink

Example: Suppose that a glucose blood test determines who getstreatment or not. If a patient test result is, say, over 120, then shegets the drug. That’s a sharp RDD. But if receiving the drugdepends on some other factors besides the blood test, then it’s afuzzy RDD

Kink: the blood test threshold determines the dosage of themedication instead and we expect that the outcome, say, futureglucose blood will change but not “jump” at 120. It’s a subtledistinction. Still not clear to me how to think about it

24 / 90

Introduction Examples from literature

Examples from literature

Almond et al. (2010): Assignment variable is birth weight. Infantswith low birth weight (< 1, 500 grams or about 3 pounds) receivemore medical treatment

Lee, Moretti, Buttler (2004): The vote share (0 to 100 percent) for acandidate is a continuous variable. A candidate is elected if he or sheobtains more than 50% of the votes. They evaluated voting record ofcandidates in close elections

CMS rates nursing homes using 1 to 5 stars. Overall stars areassigned based on deficiency data transformed into a points system.Outcome: new admissions six months after the release of ratings(consumer response)

Alternative outcome: changes in quality scores after 6 months(provider response)

25 / 90

Introduction Five Stars

Assignment of stars based on scores

26 / 90

Introduction Randomization

RDD as a special type randomization

Suppose you randomize people the old-fashioned way. You have adataset with 2000 persons ids. You create a new column that is adraw from a uniform random variable called rv

If rv > 0.5, then assign to treatment group. We know that eachperson has equal probability of being in either group (it’s a uniformdistribution)

If no treatment is performed , would there be any relationship betweenan outcome –any outcome– and the uniform random variable?

No. Furtheremore, there wouldn’t be any relationship betweenthe assignment variable rv and any person characteristic (rv andeverything else are independent)

But what about an outcome after performing an intervention on thetreatment group? Is there a relationship between rv and the outcome?Said another way, do we need to control for rv in our models? NO

27 / 90



set obs 2000

gen id = _n

* Simulated baseline outcome (chi-squared)

gen y0 = rnormal(10,1)^2

* Randomize

gen rv =uniform()

gen T=0

replace T=1 if rv >.5

* Pretend treatment is effective

gen y1 = y0

replace y1 = y0+10 if T==1

28 / 90


RDD as a special type of randomization

scatter y1 rv, msize(tiny) || lfit y1 rv if T==1, color(red) ///

|| lfit y1 rv if T==0, color(blue) ///

legend(off) ytitle("Outcome") xtitle("Assignment score (uniform rv)") ///

saving(rv.gph)

graph export rv.png, replace

29 / 90


RDD as a special type of randomization

* Controlling or not for the assignment variable is irrelevant

* rv is not a confounder

qui reg y1 T

est sto m1

qui reg y1 T rv

est sto m2

est table m1 m2, p

----------------------------------------

Variable | m1 m2

-------------+--------------------------

T | 10.78717 10.198423

| 0.0000 0.0000

rv | 1.1523433

| 0.7189

_cons | 100.96664 100.68496

| 0.0000 0.0000

----------------------------------------

legend: b/p

30 / 90



RDD is like a conditionally randomized trial is which theassignment score is not like rv in the example above

It’s not a random number but rather a number that isassumed/expected to be related to the outcome of interest

So we have to control for that variable, unlike rv

It’s an example of techniques used when we do know somethingabout the assignment mechanism

31 / 90


Golden rule: If we know how the treatment was assigned,we can do something to estimate causal effects

Often, we need exclusion restrictions that can’t be verify with dataInstrumental variables: we know there is one variable that predictswell who gets the treatment (and we argue that this variable is“exogenous” or not related to the outcome–conditionallyInterrupted time series: We know that treatment assignment iscorrelated with time (and we argue that that’s the only change)Difference-in-difference: We know treatment assigned is correlatedwith time and we know that one group was not treated (and we arguefor common shocks)RDD: We know exactly how treatment was assigned (but need toargue about no precise manipulation or arbitrariness of the cutoffpoint)Propensity scores: We think we know something... PS depend onsame assumptions as regression adjustment (but adjustment is atime-honored way of obtaining causal effects)

32 / 90


Regression adjustment and causal inference

It’s easy to overlook but regression adjustment is the oldestcausal inference tools we have

Simple example. Suppose that we randomize patients based onseverity of illness because we suspect a treatment is effective and wewant to give those who are in worse health a chance to receive thetreatment

We randomly assign 80% of patients who are worse (let’s called themill) to the treatment group. If patients are not so ill, we randomizethem 50-50

At the end of the trial we estimate the treatment effect on someoutcome y : yi = β0 + β1T + εi

Of course, β1 is biased, and probably bias upwards. We solve this byholding severity constant: yi = γ0 + γ1T + γ2ill + εi

If you belong to the econ tribe: the zero conditional mean assumptionis violated in the first model

33 / 90


Digression: Propensity scores

You could also stratify and estimate two models, one for those ill andone for those not ill. No confounding but you may want to combineestimates

Or you could estimate the propensity score: P(T = 1|ILL) = f (ILL)or logit(T ) = β0 + β1ILL

The predicted probability of treatment will be higher for those whoare ill

Use the inverse of the propensity score as a weight in the modelyi = β0 + β1T + εi

Same as regression adjustment (some argue that using the PS score isbetter because of more flexible functional form)

34 / 90


Regression adjustment and RDD

You may wonder why I am talking about regression adjustment andrandomization, or in this case, conditional randomization

Because RDD is just like a conditionally randomized experiment(near a cutoff point)

We often, but not always, expect that the assignment variable isrelated to the outcome in some way. So:

(1) We need to control for the assignment variable

(2) We need to get the functional form right

35 / 90


One thing I don’t like about how RDD is often presented

In most examples, there is a strong relationship between theassignment variable and the outcome

Therefore, we need to model that relationship correctly. If theassignment was random like in the previous example, then we don’tcare about controlling for the assignment variable

In some applications the relationship between the assignment variableand the outcome may not be strong so the proper functional form isless relevant

One good example is my nursing home RDD paper (more on this in asec)

36 / 90

Estimation

Estimation: Parametric

Simplest case is linear relationship between Y and X

Yi = β0 + β1Ti + β3Xi + εi

Ti = 1 if subject i received treatment and Ti = 0 otherwise. You canalso write this as Ti = 1(Xi > c) or Ti = 1[Xi>c]

X is the assignment variable (sometimes called “forcing” or “running”variable)

Usually centered at cutoff point

Yi = β0 + β1Ti + β3(Xi − c) + εi

The treatment effect is β1

E [Y |T = 1,X = c] = β0 + β1 and E [Y |T = 0,X = c] = β0.

E [Y |T = 1,X = c]− E [Y |T = 0,X = c] = β1.

37 / 90

Estimation Centering

Reminder on centering

Centering changes the interpretation of the intercept:

Y = β0 + β1(Age − 65) + β2Edu

= β0 + β1Age − β165 + β2Edu

= (β0 − β165) + β1Age + β2Edu

Compare to:Y = α0 + α1Age + α2Edu

β1 = α1, β2 = α2, but α0 6= (β0 − β165)

Useful with interactions:

Y = α0 + α1Age + α2Edu + α3Age × Edu

Compare to:

Y = β0 + β1(Age − 65) + β2(Edu− 12) + β3(Age − 65)× (Edu− 12)

38 / 90

Estimation Extrapolation

Extrapolation

Note that the estimation of treatment effect in RDD depends onextrapolation

To the left of cutoff point only non-treated observations

To the right of cutoff point only treated observations

What is the treatment effect at X = 130? Just plug in:

E [Y |T ,X = 130] = β0 + β1T + β3(130− 140)

39 / 90


Extrapolation...Dashed lines are extrapolations

40 / 90


Counterfactuals

The extrapolation is a counterfactual or potential outcome

Each person i has two potential outcomes (Rubin’s causalframework).

Yi (1) denotes the outcome of person i if in the treated group

Yi (0) denotes the outcome of person i if in the non-treated group

Causal effect of treatment for person i is Yi (1)− Yi (0)

Average treatment effect is E [Yi (1)− Yi (0)]

Only one potential outcome is observed. In randomized experiments,one group provides the conterfactual for the other because they arecomparable (exchangeable)

Exchangeability (epi). Also called “selection on observables” or “nounmeasured confounders”

41 / 90


Counterfactuals, II

In RDD the counterfactuals are conditional on X as in a conditionallyrandomized trial (think severity)

We are interested in the treatment effect at X = c :E [Yi (1)− Yi (0)|Xi = c]

Treatment effect is limx→cE [Yi |Xi = x ]− limx←cE [Yi |Xi = x ]

Estimation possible because of the continuity of E [Yi (1)|X ] andE [Yi (0)|X ]

See Hahn, Todd, and Van der Klaauw (2001) for details

The estimation of the treatment effect is based on extrapolationbecause of lack of overlap. Thefore, the functional relationshipbetween X and Y must be correctly specified

42 / 90

Estimation Functional form

Need to model relationship between X and Y correctly

What if nonlinear? Could result in a biased treatment effect if oneassumes a linear model.

43 / 90

Estimation Flexible specification

Other specifications

More general: Yi = β0 + β1Ti + β3f (Xi − c) + εi

If (Xi − c) = Xi then Yi = β0 + β1Ti + β3f (Xi ) + εi

Most common form for f (Xi ) are polynomials

Polynomials of order p:

Yi = β0 + β1Ti + β2Xi + β3Xi2

+ β4Xi3

+ · · ·+ βp+1Xip

+ εi

More flexibility with interactions

2nd degree with interactions:

Yi = β0 + β1Ti + β3Xi + β4Xi2

+ β5Xi × Ti + β6Xi2 × Ti + εi

Question: Why not controlling for other covariates?

44 / 90


Third degree polynomial. Actual model second degree polynomial (seeStata do file). However...

45 / 90


A note on higher order polynomials

We will see an example in which using higher order polynomials doesnot influence results

In some cases, however, it may matter

Gelman and Inbems (2014) not so subtle title: “Why High-orderPolynomials Should not be Used in Regression Discontinuity Designs”

“We argue that estimators for causal effects based on [higher orderpolynomials] can be misleading, and we recommend researchers donot use them, and instead use estimators based on local linear orquadratic polynomials...”

46 / 90

Estimation Fuzzy RDD

Example: Part C Early Intervention services: fuzzy RDD

About 1.4% of infants in the US are born very low birth weight(usually a weight below 1,500 grams or about 3.3 pounds)

Low birth weight is associated with many long-term health problemsand developmental difficulties

In clinical trials, early intervention (EI) services has been shown tobe effective to improve outcomes

Part C of the Disabilities and Education Act authorizes states toprovide a state-wide system of developmental services for infants andtoddlers with developmental delays (I’ll call this EI)

The problem is that the clinical trial evidence uses interventionsthat are more intense that what is typically done in practice

The weak EI evidence in practice is in part due to the methodologicalchallenges associated with conducting EI outcomes research usingobservational data

47 / 90


Assignment mechanism

Obviously, we can just compare infants who get EI to those who don’t

But who gets EI? What is the assignment mechanism?

Eligibility vary by state but in general there is a weight threshold thathas to be met. In Colorado, for example, it’s 1,200 grams

Mothers are usually referred for services, which are free, but sadly fewof those referred use the services. Also, there could be a physicalexam that demonstrates a developmental problem

(As I said, once you start pondering about the assignment mechanismyou can come up with some solutions)

48 / 90


It is fuzzy

The weight threshold does not completely determine who gets EI butit should be a strong predictor of obtaining EI services

If it strongly predicts treatment, we can use this variability to obtaincausal effects (we are 100% into instrumental variables territoryhere)

(An aside on encouragement designs and IV or why you shouldmake an angry face when people tell you that IVs don’t work. Themethod works; its usage could be of bad quality though)

We need an exclusion restriction. We need to argue that thethreshold is not related to outcomes

This is problematic because we know that low birth weight is relatedto a host of outcomes so the exclusion restriction is hard to satisfy

We solve this by the same argument as in sharp RDD: we couldrestrict the estimation to a narrow window around the threshold

49 / 90


It is fuzzy

Note that restricting the estimation to the window is giving you:conditional independence

Or said another way, all observed and unobserved characteristics ofinfants should be the same close to the threshold

As in sharp RDD, we can in part test with data

This makes it easier to justify the exclusion restriction

50 / 90


Some preliminary results

Nurse Family Partnership (NFP) data from 2002-2017

About 135,000 births in about 42 states

I have birth weight, EI referral and EI use information. Lots ofcovariates to check balance

Very messy data; still uncovering issues

51 / 90


Some preliminary resultsCan only use states with a threshold eligibility (25 states)

52 / 90



If an infant is below the threshold, the probability of using EI is 19.79.If above the threshold, the probability is 3.78

So the threshold itself is a strong predictor of who receives EI

As an aside, not all those infants who are referred end up using EI. Ofthose referred, only 31.80 percent end up using EI

The NFP nurse made the recommendation for referral

53 / 90


Weight and probability of EI use

The probability of using EI is a function of weight (centered atthreshold) (polynomial of second degree)

54 / 90


Nonparametric version

Using a smoother (lowess)

55 / 90



Next step, check balance around windows. Always a trade off betweenbias and variance

Smaller window, less bias since closer to RDD validity assumption butlarger variance because sample sizes are smaller

There is (was?) some literature about the “optimal” bandwidth butnothing much has come from it (as far as I know)

Then add outcomes: weight after 6, 12, 18, 24 months anddevelopmental scores... Cleaning the data

Need to finish by July...

56 / 90

Estimation Nursing Homes

Five-star ratings

Review of design

Two possible effects: consumer response and provider response

The importance of conceptual frameworks: one mechanism is clearerthan the other

Why we are struggling so much with the provider response

57 / 90


Parametrization

The model in the consumer response paper (dropping covariates,subscripts, and random effects to make thing easier) is:

ln(adm) = β0 + β1T + β2year

T is a dummy variable equal to one if a nursing home received anadditional star at t = 1 and zero if a nursing home did not receive anadditional star in that period. In the pre-period, T is equal to zero forall nursing homes

In the pre-period: ln(admpre) = β0. In the post-period:ln(admpost) = β0 + β1T + β2

In the post-period treated: ln(admpostt ) = β0 + β1 + β2. In thepost-period control: ln(admpostc ) = β0 + β2

So ln(admpostt )− ln(admpostc ) = β1 and therefore ln(admposttadmpostc

) = β1

andadmposttadmpostc

= eβ1

58 / 90


Parametrization

Trick is that there is an implicit interaction introduced by makingT = 1 only in the post-period

This is the same as a difference-in-difference design in which weassume that in the post-period admissions in the treatment andcontrol group is the same

DiD estimator: (Ytreatedpost − Ytreatedpre )− (Ycontrolpost − Ycontrolpre )reduces to (Ytreatedpost − Ycontrolpost ) if Ytreatedpre = Ycontrolpre

So why do it this way? Common way of analyzing two periodexperiments using longitudinal data. It allowed us to control forpre-period convariates (so it uses more data)

Why Poisson? Why random effects?

A note on overdispersion in Poisson models

59 / 90


Golden Rule II

The closer the model is to the data generating process the moreprecise the estimates (so lower SEs)

If counts, use Poisson or Negative binomial models

If 0/1 outcomes, use logit or probit (why bother with LPNs??)

If survival data, then use survival models to account for censoring

As Will Manning would say, your standard errors are wrongotherwise

Don’t be like a lot of economists. There is a world outside the linearmodel (the linear model, not OLS)

60 / 90

Example

Real dataset

Data from Lee, Moretti, Buttler (2004)

U.S. House elections (1946-1995)

Forcing variable is Democratic vote share. If share > 50 thenDemocratic candidate is elected

Outcome is a liberal voting score from the Americans for DemocraticAction (ADA)

Do candidates who are elected in close elections tend to moderatetheir congressional voting?

“We find that the degree of electoral strength has no effect on alegislator’s voting behavior”

Data and code are on Chalk

61 / 90

Example

Graph a bit messy (about 13,500 obs)

scatter score demvoteshare, msize(tiny) xline(0.5) ///

xtitle("Democrat vote share") ytitle("ADA score")

62 / 90

Example

Good idea to add some “jittering”With the jitter option, it is easier to see where is the mass

scatter score demvoteshare, msize(tiny) xline(0.5) ///

xtitle("Democrat vote share") ytitle("ADA score") jitter(5)

63 / 90

Example

Useful to “smooth” data with LOWESSlowess score demvoteshare if democrat ==1, gen (lowess_y_d1) nograph bw(0.5)

lowess score demvoteshare if democrat ==0, gen (lowess_y_d0) nograph bw(0.5)

....

....

64 / 90

Example

LOWESS

LOcally WEighted Scatterplot Smoothing

Non-parametric graphical method

Computationally intensive (one regression per data point)

For each data point, run a weighted linear regression (linear orpolynomials on X ) using all the observations within a window.Weights give more importances to observations close to data point

Predicted y , y , is then the “smoothed” (yi , xi ) point

65 / 90

Example

Parametric: Linear relationshipscatter score demvoteshare, msize(tiny) xline(0.5) xtitle("Democrat vote share") ///

ytitle("ADA score") || lfit score demvoteshare if democrat ==1, color(red) || ///

lfit score demvoteshare if democrat ==0, color(red) legend(off)

66 / 90

Example

Quadraticgen demvoteshare2 = demvoteshare^2

reg score demvoteshare demvoteshare2 democrat

predict scorehat0

67 / 90

Example

Third degree polynomialgen demvoteshare3 = demvoteshare^3

reg score demvoteshare demvoteshare2 demvoteshare3 democrat

predict scorehat01

68 / 90

Example

Fourth degree polynomialgen demvoteshare4 = demvoteshare^4

reg score demvoteshare demvoteshare2 demvoteshare3 demvoteshare4 ///

democrat

predict scorehat02

69 / 90

Example

Mean (null model) to fifth degree polynomial

line scorehat04 demvoteshare if democrat ==1, sort color(gray) || ///

line scorehat04 demvoteshare if democrat ==0, sort color(gray) legend(off) ....

70 / 90

Example

Parametric

Note that polynomials “smooth” the data (like LOWESS)

We used all the data even though we want treatment effect at c

But polynomials give weight to points away from c and tend toprovide smaller SEs

In other datasets, the choice of polynomial degree will matter (seeGelman and Inbems, 2014)

Why not only use data close to c? Bias and variance trade-off

71 / 90

Example

Restrict to a window

Run a flexible regression like a polynomial with interactions(stratified) but don’t use observations away from the cutoff. Choose abandwidth around X = 0.5. Lee et al (2004) used 0.4 to 0.6.

reg score demvoteshare demvoteshare2 if democrat ==1 & ///

(demvoteshare>.40 & demvoteshare<.60)

predict scorehat1 if e(sample)

reg score demvoteshare demvoteshare2 if democrat ==0 & ///

(demvoteshare>.40 & demvoteshare<.60)

predict scorehat0 if e(sample)

scatter score demvoteshare, msize(tiny) xline(0.5) xtitle("Democrat vote share") ///

ytitle("ADA score") || ///

line scorehat1 demvoteshare if democrat ==1, sort color(red) || ///

line scorehat0 demvoteshare if democrat ==0, sort color(red) legend(off)

graph export lee3_1.png, replace

72 / 90

Example

73 / 90

Example

Limit to window, 2nd degree polynomial

gen x_c = demvoteshare - 0.5

gen x2_c = x_c^2

reg score i.democrat##(c.x_c c.x2_c) if (demvoteshare>.40 & demvoteshare<.60)

Source | SS df MS Number of obs = 4632

-------------+------------------------------ F( 5, 4626) = 1153.29

Model | 2622762.02 5 524552.404 Prob > F = 0.0000

Residual | 2104043.2 4626 454.829918 R-squared = 0.5549

-------------+------------------------------ Adj R-squared = 0.5544

Total | 4726805.22 4631 1020.6878 Root MSE = 21.327

---------------------------------------------------------------------------------

score | Coef. Std. Err. t P>|t| [95% Conf. Interval]

----------------+----------------------------------------------------------------

1.democrat | 45.9283 1.892566 24.27 0.000 42.21797 49.63863

x_c | 38.63988 60.77525 0.64 0.525 -80.5086 157.7884

x2_c | 295.1723 594.3159 0.50 0.619 -869.9704 1460.315

|

democrat#c.x_c |

1 | 6.507415 88.51418 0.07 0.941 -167.0226 180.0374

|

democrat#c.x2_c |

1 | -744.0247 862.0435 -0.86 0.388 -2434.041 945.9916

|

_cons | 17.71198 1.310861 13.51 0.000 15.14207 20.28189

---------------------------------------------------------------------------------

74 / 90

Advice

So what should you do?

Best case: Whatever you do gives you similar results (like in thisexample)

Most common strategy is to restrict estimation to a window adjustingfor covariates

It used to be popular to use higher order polynomials

Try different windows and present sensitivity analyses

Balance should determine the size of window

Try non-parametric methods

75 / 90

Nonparametric

Nonparametric methods

Paper by Hahn, Todd, and Van der Klaauw (2001) clarifiedassumptions about RDD and framed estimation as a nonparametricproblem

Emphasized using local polynomial regression instead of somethinglike LOWESS

“Nonparametric methods” means a lot of things in statistics

In the context of RDD, the idea is to estimate a model that does notassume a functional form for the relationship between Y and X . Themodel is something like Yi = f (Xi ) + εi

A very basic method: calculate E [Y ] for each bin on X (think of ahistogram)

76 / 90

Nonparametric

Stata has a command to do just that: cmogram

After installing the command (ssc install cmogram) type help cmogram. Lotsof useful optionsCommon way to show RDD data. See for example Figure II of Almond et al.(2010). To recreate something like Figure 1 of Lee et al (2004):

cmogram score demvoteshare, cut(.5) scatter line(.5) qfit

77 / 90

Nonparametric

Compare to linear and LOWESS fitscmogram score demvoteshare, cut(.5) scatter line(.5) lfit

cmogram score demvoteshare, cut(.5) scatter line(.5) lowess

78 / 90

Nonparametric Local polynomials

Local polynomial regression

Hahn, Todd, and Van der Klaauw (2001) showed that one-side Kernelestimation (like LOWESS) may have poor properties because thepoint of interest is at a boundary

Proposed to use instead a local linear nonparametric regression

Stata’s lpoly command estimates kernel-weighted local polynomialregression

Think of it as a weighted regression restricted to a window (hence“local”). The Kernel provides the weights

A rectangular Kernel would give the same result as taking E [Y ] at agiven bin on X . The triangular Kernel gives more importance toobservations close to the center

Method sensitive to choice of bandwidth (window)

79 / 90


Local regression is a smoothing methodKernel-weighted local polynomial regression is a smoothing method

lpoly score demvoteshare if democrat == 0, nograph kernel(triangle) gen(x0 sdem0) bwidth(0.1)

lpoly score demvoteshare if democrat == 1, nograph kernel(triangle) gen(x1 sdem1) bwidth(0.1)

<omitted>

80 / 90


Treatment effect

We’re interested in getting the treatment at X = 0.5

gen forat = 0.5 in 1

lpoly score demvoteshare if democrat == 0, nograph kernel(triangle) gen(sdem0) ///

at(forat) bwidth(0.1)

lpoly score demvoteshare if democrat == 1, nograph kernel(triangle) gen(sdem1) ///

at(forat) bwidth(0.1)

gen dif = sdem1 - sdem0

list sdem1 sdem0 dif in 1/1

+----------------------------------+

| sdem1 sdem0 dif |

|----------------------------------|

1. | 64.395204 16.908821 47.48639 |

+----------------------------------+

81 / 90


Different windowsWhat happens when we change the bandwidth?

82 / 90


Nonparametric

With non-parametric methods in RDD came several methods tochoose “optimal windows”

In practical applications, you may want to check balance around thatwindow

Standard error of treatment effect can be bootstrapped

Could add other variables to nonparametric methods but morecomplicated

See Stata do file for examples using command rdrobust

83 / 90


Using rdrobust

. rdrobust score demvoteshare, c(0.5) all bwselect(IK)

Sharp RD Estimates using Local Polynomial Regression

Cutoff c = .5 | Left of c Right of c Number of obs = 13577

----------------------+---------------------- Rho (h/b) = 0.770

Number of obs | 3535 3318 NN Matches = 3

Order Loc. Poly. (p) | 1 1 BW Type = IK

Order Bias (q) | 2 2 Kernel Type = Triangular

BW Loc. Poly. (h) | 0.152 0.152

BW Bias (b) | 0.197 0.197

--------------------------------------------------------------------------------------

| Loc. Poly. Robust [Robust

score | Coef. Std. Err. z P>|z| 95% Conf. Interval]

----------------------+---------------------------------------------------------------

demvoteshare | 47.171 1.262 36.9043 0.000 44.1 49.047108

--------------------------------------------------------------------------------------

All Estimates. Outcome: score. Running Variable: demvoteshare.

--------------------------------------------------------------------------------------

Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]

----------------------+---------------------------------------------------------------

Conventional | 47.171 .98131 48.0692 0.000 45.247 49.093991

Bias-Corrected | 46.574 .98131 47.4608 0.000 44.65 48.496943

Robust | 46.574 1.262 36.9043 0.000 44.1 49.047108

--------------------------------------------------------------------------------------

84 / 90

Conclusion

Parametric or non-parametric?

When would parametric or non-parametric or window size matter?

Small effectRelationship between Y and X different away from cutoffFunctional form not well captured by polynomials (or other functionalform)

Parametric: can add random effects, clustering SEs,...

But more important: What about if the outcome cannot be assumedto distribute normal?

The curse and blessing of so many good RDD guides...

With counts, for example, need to use Poisson or Negative Binomialmodels

If conclusions are different, do worry

85 / 90

Design vs analysis

The other thing I don’t like about how RDD is presented

Design and analysis/estimation are separated issues

The design tells about the internal and external validity

But the specifics of the data tells you which statistical assumptionmakes more sense

If outcome is 1/0, why would you estimate a linear model?

If outcome is a count, why wouldn’t you estimate a negative binomialmodel or Poisson model?

More importantly, why would non-parametric models be better inthose cases?

86 / 90

Almond et al (2010)

Marginal returns to medical care

Big picture: is spending more money on health care worth it (in termsof health gained)?

Actual research: is spending more money on low-weight newbornsworth it in terms of mortality reductions? Compare marginal costs(dollars) to marginal benefits (mortality transformed into dollars).

On jargon: In economics marginal = additional. So compareadditional spending to additional benefit

In IV language, the “marginal” patient is the “complier”

RDD part used to estimate marginal benefits. Data from U.S Censusbirth 1983 to 2002

Forcing variable is newborn weight. Cutoff point c = 1, 500 grams(almost 3 lbs)

87 / 90

Almond et al (2010)

Data

Did they use a fuzzy or sharp RDD?

Related question: What is the “treatment”?

What models did they use? And what was the outcome?

88 / 90

Almond et al (2010) Estimation

Estimating equation

Their model is:

Yi = α0 + α1VLBWi + α2VLBWi × (gi − 1500)+

α3(1− VLBWi )(gi − 1500) + αt + αs + δX ′i + εi (1)

Change notation so VLBW = T and (gi − 1500) = X and after doingsome algebra the model is:

Y = α0 + α1T + α3X + (α2 − α3)T × X + (αt + αs + δX ′) + ε

(αt + αs + δX ′) are covariates

89 / 90

Almond et al (2010) Estimation...

Covariates

They compared means of covariates above and beyond cutoff point

They found some differences (large sample) so they include covariatesin the model

They did a RDD-type analysis on covariates to see if they were“smooth” (no jump at VLBW cutoff)

90 / 90