Regression Discontinuity Designs
PUBL0050 – Week 9
Jack BlumenauDepartment of Political Science
UCL
1 / 53
Course outline
I 1: Potential Outcomes and Causal InferenceI 2: Randomized ExperimentsI 3: Selection on Observables (I)I 4: Selection on Observables (II)I 5: Difference-in-Differences and Panel DataI 6: Synthetic Control MethodI 7: Instrumental Variables (I)I 8: Instrumental Variables (II)I 9: Regression Discontinuity DesignsI 10: Overview and Review
2 / 53
Lecture outline
Motivation
Sharp RDD
RDD estimation
Validating RDD
Fuzzy RDD
Conclusions
3 / 53
Motivation
Running example
What happens when extremists win primaries?What are the consequences of nominating an extremist candidate in aprimary election for downstream electoral outcomes? Andy Hall (2015)studies a sample of primary elections for the US House between 1980 and2010 where the primary was contested between an extremist candidate anda moderate candidate. Extremism is determined by receving donationsfrom extreme interest groups. He uses the outcomes of these races tocompare the electoral outcomes of moderates and challengers insubsequent general elections.
I Outcome (Yi,p,t): Party vote share in district i in the general election at time tI Treatment (Di,p,t): 1 if the party’s primary winner in district i is an “extremist”I Running variable (Xi,p,t): Candidate vote share in the primary election in district i
4 / 53
Extremist candidates
Why can’t we interpret this causally?
vote_share_extreme <- mean(hall$vote_share_general[hall$extreme == 1])vote_share_moderate <- mean(hall$vote_share_general[hall$extreme == 0])
vote_share_extreme - vote_share_moderate
## [1] -0.02736295
Selection bias: Extremists may differ in many ways from moderatesI Candidate differences
I Less experiencedI Less well financedI Less supported by local party
I District differencesI May be selected in districts where the party performs poorly historically
5 / 53
Extremist candidates
What approach might we use to estimate a causal effect?I Condition on observed differences between extremists and moderatesI Find an instrument that increases the probability of an extremist
winning the primary, but that has no effect on the generalI Use variation over time in party vote shares using a difference in
differences analysisAn alternative approach (RDD):
I Compare the vote share of parties in districts where extremistsnarrowly won their primary races to the vote share of parties whereextremists narrowly lost
I → Assume that winning the election is as good as random in closeraces
6 / 53
RDD outlineI Each unit has a score on a “running variable” which determines
treatmentI Treatment is:
I assigned to units when their score on the running variable exceeds aknown cutoff
I not assigned to units whose value of the score is below the cutoffI Key feature: probability of receiving the treatment changes abruptly
at the cutoffI Discontinuous change in this probability can be used to learn about
the local causal effect of the treatment on an outcome of interest
IntuitionUnits with scores barely below the cutoff can be used as counterfactualsfor units with scores barely above it.I.e. Districts where the extremist narrowly wins their election are comparable todistricts in which the extremist narrowly loses
7 / 53
RDD outline
RDD is an appropriate strategy when we know that treatment and controlconditions are not randomly assigned, but we know the assigned rule thatinfluences how units are assigned or selected into the treatment.
The design is reliant on us having access to a forcing variable thatdetermines the treatment status.
RDD is widely used in rule-based settings, where it is clear how and whenDi = 1 is asssigned:
I ElectionsI Administrative programmesI Geographic boundaries
8 / 53
Sharp RDD
Sharp Regression Discontinuity Designs
Imagine that our binary treatment variable, Di , is completely determinedby the value of an explanatory variable, Xi , according to:
Di = 1(Xi > c) so Di ={
Di = 1 if Xi ≥ cDi = 0 if Xi < c
whereI Xi is known as the “forcing” or “running“ variable, and may be
correlated with the outcomes (Yi ) and potential outcomes (Y1i ,Y0i )I c is a fixed cutoff point
Implications:I Di is a deterministic function of Xi → when we know Xi , we know DiI Di is a discontinuous function of Xi → no matter how close to c we
are, Di = 0 until Xi > c
9 / 53
Examples of Xi , Di , and cI Eggers (2015)
I Yi – turnout (aggregate)I Di – proportional representation in a French townI Xi – population of the townI c – 3500
I de Kadt (2017)I Yi – turnout (individual)I Di – voting in South Africa in 1994I Xi – age in 1994I c – 18
I Hall (2015)I Yi – party vote share in general electionI Di – primary election won by an extremistI Xi – margin of victory in the primary electionI c – 0
10 / 53
Graphical illustration
Do scholarships increase earnings?Thistlethwaite and Campbell (1960) study the effects of collegescholarships on employment outcomes for students later in life. Theystudy the allocation of “merit awards”, which were given out to studentsbased on a score, and anyone with a score above some cutoff received themerit award, whereas everyone below that cutoff did not.
I Outcome (Yi ): Adult earnings ($)I Treatment (Di ): Receipt of a merit awardI Running variable (Xi ): Score on a standardized testI Cutoff (c): Scores of 2000 on more on Xi result in a merit award.
11 / 53
Graphical illustration (Xi and Di)
1600 1800 2000 2200 2400
0.0
0.2
0.4
0.6
0.8
1.0
X
D Xi= c
Assigned to control Assigned to treatment
12 / 53
Graphical illustration (Xi and Yi)
1600 1800 2000 2200 2400
2000
025
000
3000
035
000
X
Y
Xi= c
13 / 53
Graphical illustration (Xi , Y1i and Y0i)
1600 1800 2000 2200 2400
2000
025
000
3000
035
000
X
Y
Unobserved outcomes(treatment)
Unobserved outcomes(control)Observed outcomes
(control)
Observed outcomes(treatment)
14 / 53
Graphical illustration (τATE at c)
1600 1800 2000 2200 2400
2000
025
000
3000
035
000
X
Y
Xi= c
LATE
15 / 53
Sharp RDD: Identification
We want to be able to estimate the difference between Di = 1 and Di = 0at the threshold c.
Can we estimate this?
τLATE = E [Y1i |Xi = c]− E [Y0i |Xi = c]= E [Yi |Xi = c,Di = 1]− E [Yi |Xi = c,Di = 0]
No! We never observe both Di = 1 and Di = 0 at c.
We have a complete absense of common support: no treatment units willhave the same value of Xi as a control unit, because Di is a discontinuousfunction of Xi (where the discontinuity is defined at c).
16 / 53
Sharp RDD: Identification
Identification assumptionsE [Y1i |Xi ,Di ] and E [Y0i |Xi ,Di ] are continuous in X around the thresholdX = c
Identification resultThe treatment effect at the threshold c is identified by:
τLATE = E [Y1i − Y0i |X = c]= E [Y1i |X = c]− E [Y0i |X = c]= lim
X↓cE [Yi |X = c]− lim
X↑cE [Yi |X = c]
Implications:I We extrapolate a small amount to infer potential outcomes at cI Without futher assumptions, the LATE only identifies the ATE at c
17 / 53
Local nature of the RD effect
1600 1800 2000 2200 2400
2000
030
000
No heterogeneity
X
Y
E[Y1|X]
E[Y0|X]
1600 1800 2000 2200 2400
2000
030
000
Moderate heterogeneity
X
Y
E[Y1|X]
E[Y0|X]
1600 1800 2000 2200 2400
2000
030
000
Severe heterogeneity
X
Y
E[Y1|X]
E[Y0|X]
18 / 53
RDD estimation
Estimation
1. Recode the running variable to deviations from c: X̃i = Xi − cI X̃i = 0 if Xi = cI X̃i > 0 if Xi > c and so Di = 1I X̃i < 0 if Xi < c and so Di = 0
2. Decide on a regression model for E [Yi |Xi ,Di ]I Linear, same slope for E [Y0i |Xi ] and E [Y1i |Xi ]I Linear, different slopesI PolynomialI Local linear
3. Produce an RD plot, visualising the discontinuity
4. Inference via regression standard errors
19 / 53
EstimationConsider the following model where X̃i = X − c:
E [Yi |Di ,Xi ] = α + βX̃i + τDi
Why does τ identify the LATE in this model? (i.e the difference betweenE [Yi |Xi = c,Di = 1] and E [Yi |Xi = c,Di = 0]).
If Xi = c then X̃i = 0:
E [Yi |X̃i = 0,Di = 1] = α + β · 0 + τDi = α + τ
andE [Yi |X̃i = c,Di = 0] = α + β · 0 + τ · 0 = α
and so:
E [Yi |Xi = c,Di = 1]− E [Yi |Xi = c,Di = 0] = (α + τ)− α = τ
20 / 53
Estimation in R (I)
same_slope_model <- lm(vote_share_general ˜ extreme + running_variable,data = hall_subset)
different_slope_model <- lm(vote_share_general ˜ extreme * running_variable,data = hall_subset)
polynomial_model <- lm(vote_share_general ˜ extreme * running_variable +extreme*I(running_variableˆ2) +extreme*I(running_variableˆ3),
data = hall_subset)
21 / 53
Linear model, same slopes
Extremist Primary Election Margin
Gen
eral
Ele
ctio
n V
ote
Sha
re
0.4
0.6
0.8
−0.2 −0.1 0 0.1 0.2
E [Y |X̃i ,Di ] = α+ β1X̃i + τDi
22 / 53
Linear model, different slopes
Extremist Primary Election Margin
Gen
eral
Ele
ctio
n V
ote
Sha
re
0.4
0.6
0.8
−0.2 −0.1 0 0.1 0.2
E [Y |X̃i ,Di ] = α+ β01X̃i + β1(X̃i Di ) + τDi
23 / 53
Non-linear model
Extremist Primary Election Margin
Gen
eral
Ele
ctio
n V
ote
Sha
re
0.4
0.6
0.8
−0.2 −0.1 0 0.1 0.2
E [Y |X̃i ,Di ] = α+ β01X̃i + β02X̃2i + β03X̃3
i + β1(X̃i Di ) + β2(X̃2i Di ) + β4(X̃3
i Di ) + τDi
24 / 53
Comparing models
Same slope Different slope Polynomial(1) (2) (3)
extreme −0.098 −0.095 −0.116(0.034) (0.034) (0.074)
Constant 0.643 0.606 0.624(0.019) (0.024) (0.053)
Observations 233 233 233R2 0.035 0.060 0.102
Note: Standard errors in parentheses
Implication: When an extremist wins a “coin-flip” election over amoderate, the party’s general-election vote share decreases on average byapproximately 9-12 percentage points.
25 / 53
Non-linearity mistaken for discontinuity
It is often the case that the choice of functional form for X̃ isconsequential for the inference about τ̂LATE:
0.0 0.2 0.4 0.6 0.8 1.0
−0.
50.
00.
51.
01.
5
Running variable
Out
com
e
26 / 53
Bandwidth selection
One way to reduce this type of model dependence is to focus only onobservations that are close to the cutoff.
In practice, only keep observations with:
c − h ≤ Xi ≤ c + h
where h is a positive value determining the window or bandwith size.
The bandwidth – h – controls the width of the neighbourhood around thecutoff that is used to calculate the discontinuity.
h directly affects the properties of the estimation process and empiricalfindings can be sensitive to the particular value that one chooses for h.
27 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10Bandwidth = 30
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10Bandwidth = 20
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10Bandwidth = 10
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
−30 −20 −10 0 10 20 30
−5
05
10Bandwidth = 5
Running variable
Out
com
e
LinearQuadratic
28 / 53
Bandwidth selection
ImplicationsComparing average outcomes in a small neighbourhood to the right andleft of the cutoff leads to:
1. Estimates of LATE that are less dependent on the functional formspecification for X̃
2. Decreases the bias that comes from misspecification3. Leads to a smaller sample size, thus increasing the variance
In picking h we face a bias-variance trade-off:I Smaller values of h → less bias in τ̂LATEI Smaller values of h → greater variance in τ̂LATE (i.e. SE (τ̂LATE) ↑)
28 / 53
How do we pick h?
The choice of h is important for the estimates of τ̂LATE.
Two approaches to choosing h:1. “Optimal” bandwidth selection
I Use algorithmic bandwidth selection methodsI Most common → Imbens-Kalyanaraman procedure
I Choose h to balance bias-variance tradeoffI h is chosen to minimise the expected mean-square error of the RD
estimator
2. Reporting results from multiple bandwidthsI In practice, it is common to show that the how much (if at all) the
estimate of τ̂LATE changes as we vary the bandwidth
29 / 53
Extremist candidates – optimal bandwidth
library(rdd)
optimal_bandwidth <- IKbandwidth(X = hall$running_variable,Y = hall$vote_share_general,cutpoint = 0)
optimal_bandwidth
## [1] 0.0851
30 / 53
Extremist candidates – optimal bandwidth
rd_est <- RDestimate(vote_share_general ˜ running_variable,cutpoint = 0,bw = optimal_bandwidth,data = hall)
rd_est
#### Call:## RDestimate(formula = vote_share_general ˜ running_variable, data = hall,## cutpoint = 0, bw = optimal_bandwidth)#### Coefficients:## LATE Half-BW Double-BW## -0.07504 -0.06580 -0.06792
31 / 53
Extremist candidates – bandwidth sensitivity
0.05 0.10 0.15 0.20 0.25
−0.
4−
0.3
−0.
2−
0.1
0.0
0.1
Bandwidth
Trea
tmen
t effe
ct
32 / 53
Break
Validating RDD
Falsification checks
1. Balance checks: Are covariates discontinuous at the threshold?
2. Placebo thresholds: Do we estimate significant treatment effects at“placebo” thresholds, c∗?
3. Sorting: Are units able to “sort” around the threshold?
33 / 53
Balance checks
If treatment is “as good as random” around the threshold, then inexpectation treated and control units around the threshold should be thesame with respect to both observed and unobserved covariates.
We cannot check balance of unobserved covariates, but we can assessbalance on observed covariates (Zi , not an instrument!):
I Visual inspectionI Plot E [Zi |X̃i ,Di ] – there should be no discontinuities around cI The relationship between X̃i and Zi should be smooth around c
I RD model for covariatesI Estimate E [Zi |X̃i ,Di ] = α + β01X̃i + β1(X̃iDi ) + τZ DiI This should yeild τZ = 0 if Zi is balanced at the threshold
34 / 53
Balance checks for extremist candidates
−0.4 −0.2 0.0 0.2 0.4
Estimated RD treatment effect
Probability Female
Probability experienced
Share of donations
35 / 53
Placebo thresholds
We can also check whether the discontinuity only appears where it“should” appear, and that it is zero at other values of the cutoff.
If we have a placebo value c∗ 6= c, then define X̃ ∗i = Xi − c∗ and estimate:
E [Yi |X̃ ∗i ,Di ] = α + β01X̃ ∗i + β1(X̃ ∗i Di ) + τ∗Di
or more flexible specifications thereof.
Implication: If our RDD is valid, we should find no significant treatmenteffects, τ∗, for any c∗.
36 / 53
Placebo test for extremist candidates
−0.15 −0.10 −0.05 0.00 0.05 0.10 0.15
−0.
2−
0.1
0.0
0.1
0.2
0.3
Cut point
LAT
E
37 / 53
SortingThe RDD is based on the assumption that there is continuity in thepotential outcomes at the threshold.
One way this assumption might be violated is if units can control theirvalues of the running variable.
Examples of sorting:I Population thresholds: Administrators might misreport population in
town/district if particular benefits are received at certain thresholds(e.g. Eggers et al., 2018)
I Earnings thresholds: Individuals may reduce their earnings if benefitsare granted to those below a certain income (e.g. McCrary, 2008)
I Geographic thresholds: Businesses might locate in different areas ifbenefits are allocated differentially across localities (e.g. Keele andTitunik, 2015)
38 / 53
SortingMcCrary (2008) proposes a test to detect sorting in X̃i :
I Looks for evidence on discontinuous jumps in the running variable atthe threshold
I Null hypothesis is that there is no sorting, so small p-values from thetest suggest evidence of sorting
I DCdensity(running variable) in R
39 / 53
Sorting of extreme candidatesDCdensity(hall$running_variable)
## [1] 0.9563002
−0.2 −0.1 0.0 0.1 0.2
01
23
45
40 / 53
Compound treatments
RDD assumes that the only thing that is determined by Xi at the cutoff isthe probability of receiving the treatment.
It is often the case that there are multiple changes at a given cutoff, andso we can only estimate a compound treatment effect
Eggers (2015) uses the fact that French towns with ≥ 3500 people holdPR elections while towns with < 3500 hold majoritarian elections.
I Outcome (Yi ): Turnout in municipalityI Treatment (Di ): PR election systemI Running variable (Xi ): Population of municipalityI Cutoff (c): 3500
Key question: Is the electoral system the only thing that changes at 3500?
41 / 53
Compound treatments (Eggers et. al., 2018)
42 / 53
Fuzzy RDD
Fuzzy RDD
Thresholds/cutoffs may not perfectly determine treatment status, butmight still create discontinuities in the probability of treatment exposure
Incentives to participate in a program may change discontinuously at athreshold, but the incentives are not powerful enough to move all unitsfrom nonparticipation to participation
We can think of the cutoff as assigning units to a treatment condition,where only some units will comply with the treatment.
→ We can use discontinuities to produce instrumental variable estimatorsof the treatment (close to the discontinuity).
43 / 53
Assumptions in Fuzzy RDD1. First stage
I There should be a discontinuity in treatment probability at the cutoffI Empirically: check RD plots with running variable on X and treatment
probability on Y
2. Local independenceI The treatment assignment should be as good as random around the
cutoffI Empirically: check RD balance plots of covariates
3. MonotonicityI No units should be discouraged from taking the treatment at the cutoffI Generally trivial
4. Exclusion restrictionI Crossing the cutoff should only affect the outcome through a unit’s
treatment values, not through any other channelI Often plausible, so long as it is only D that is affected at c (no
compound effects)44 / 53
Fuzzy RD example
Does education decrease anti-immigrant views?Although low-levels of education are powerful predictors of anti-immigrantsentiment, it is difficult to establish a causal relationship betweeneducation and attitudes towards immigrants. Marshal and Cavaille (2018)use an RDD to address this question by exploiting changes to the length ofmandatory education in five countries (Denmark, France, UK,Netherlands, and Sweden).
I Outcome (Yi ): Index of anti-immigrant attitudesI Treatment (Di ): 1 if respondent was affected by the reformI Running variable (Xi ): Birth year of the respondent minus year the
birth year of those first affected by the policy
45 / 53
Schooling and immigration attitudes
46 / 53
Schooling and immigration attitudes
Here, treatment is determined by age:
Di ,c ={
1 if birth yeari ,c ≥ birth year of first effectedc0 if birth yeari ,c < birth year of first effectedc
But many students would have stayed in school longer even in the absenceof a reform. We therefore have some non-compliance (always-takers).
47 / 53
Fuzzy RD estimation
1. Restrict data to small window above and below the cutoff (±h)2. Code the instrument, Zi , using the running variable (Zi = 1{Xi > c})3. Fit 2SLS
Yi = α + β1X̃i + β2Zi X̃i + τ D̂i
where D̂i is instrumented by Zi and X̃i = Xi − c4. We can, as before, add more flexible specifications for X̃i
5. We would normally also plot and estimate both the first- andsecond-stage discontinuities
48 / 53
Schooling and immigration attitudes – first stage
Implication: On average, reforms increase a student’s secondary schoolingby 0.29 years.
49 / 53
Schooling and immigration attitudes – reduced form
Implication: On many indexes, reform-affected students are less opposedto immigration 50 / 53
Schooling and immigration attitudes – LATE
Note that the LATE estimated in a fuzzy RD is “local” in two ways:I Local to the thresholdI Local to the compliers
51 / 53
Conclusions
Internal and external validity
I Internal validityI RDD is a transparent approach to inference which requires less
stringent assumptions that IV (at least in the Sharp RDD case)I Many of the key identifying assumptions are empirically verifiableI RDD has been shown to do a very good job at recovering known
experimental benchmarks (Cook et. al., 2008)
I External validityI Sharp RDD only identifies the ATE at the point of the discontinuityI Fuzzy RDD only identifies the ATE at the point of the discontinuity,
amongst compliersI Generalizability depends on how weird the units are at the cutoff, and
how weird the compliers are
52 / 53
Next week
1. Advice for coursework
2. Overview of course
3. Topics for future study
53 / 53
Top Related