set.seed(0)
library(brms)
library(rstanarm)
19 Regression discontinuity
Instead of a randomized experiment, we can design an experiment with no random element, and our variables still satisfy the ignorability assumption.
One possible way to do this is by regression continuity. The basic idea is to study the behavior of the outcome \(y\) over a range of a continuous variable \(X\), often called the forcing variable. Assume that, from prior knowledge, we expect \(y\) to be a smooth function of \(X\). If there is a sudden jump in \(y\) at a particular value \(X=c\), it is possible that the sudden jump is caused by the effect of the treatment.
In this case, we can define the treatment \(z\) with \(z=1\) if \(z \geq c\) and \(z=0\) otherwise. This design has a severe overlap problem as the values of \(x\) between the treatment and control groups are disjoint. We can only estimate where the groups overlap, that is, at \(X=c\). Thus we are interested in the local average treatment effect (LATE) at \(X=c\), which can be computed as follows:
\[ \tau_{\text{SRD}} = \mathbb{E}[y^1-y^0\vert X=c]. \]
Right now, it is not possible to directly compute \(\tau_{\text{SRD}}\) as it involves the counterfactual. So we have to make some assumptions first.
Assumptions
Conditional ignorability:
\[ y^1,y^0 \perp z \vert X, \]
which is automatically satisfied since \(z\) is deterministic.
Continuity of the expected potential outcomes with respect to the forcing variable:
\[\mathbb{E}[y^1\vert X] \text{ and }\mathbb{E}[y^0\vert X] \text{ are continuous in }X.\]
With these assumptions, the average treatment effect can be computed as follows:
\[ \tau_{\text{SRD}} = \mathbb{E}[y^1-y^0\vert X=c]. \tag{19.1}\]
In other words, we can estimate SATE using the difference between the values of the two linear functions at \(X=c\).
We now show that Equation 19.1 holds under the assumptions. By the continuity of the outcome, the ignorability, and SUTVA, we have
\[ \mathbb{E}[y^1\vert X=c] = \lim_{x\uparrow c}\mathbb{E}[y^1\vert X=x] = \lim_{x\uparrow c}\mathbb{E}[y^1\vert z=1, X=x]=\lim_{x\uparrow c}\mathbb{E}[y\vert X=x]. \]
Similarly, we have \(\mathbb{E}[y^0\vert X=c]=\lim_{x\downarrow c}\mathbb{E}[y\vert X=x]\). It follows that
\[ \tau_{\text{SRD}} = \lim_{x\uparrow c}\mathbb{E}[y\vert X=x] - \lim_{x\downarrow c}\mathbb{E}[y\vert X=x]. \]
Consequently, we can estimate LATE using the difference between the values of the linear functions at \(X=c\). As there is no overlap at any other value of \(X\), it is not possible to estimate LATE at these points since we cannot observe couterfactuals. Instead, we have to extrapolate the LATE at \(X=c\) other values of \(X\). The estimates of LATE can be seen as the vertical distance between two red lines in the plot below.
19.1 Deriving the linear regression
To find the linear regression that gives us an estimate of LATE, we first subtract the forcing variable by the cutoff.
\[ x' = x - c. \]
Now we can write the equations for the two linear regression with intercepts \(\alpha_0\) and \(\alpha_0+\delta\) as follows:
\[\begin{align*} \text{Control:}\quad y^0 &= \alpha_0 + \alpha_1x'+\text{error} \\ \text{Treatment:}\quad y^1 &= \alpha_0+\delta + \beta_1x'+ \text{error}. \end{align*}\]Taking the expectation on both equations, conditioning on \(x=c\) (or \(x'=0\)),
\[ \mathbb{E}[y^1\vert x=c] - \mathbb{E}[y^0\vert x=c] = (\alpha_0+\delta) -\alpha_0 = \delta. \]
The combination of two models above is equivalent to a single interactive model:
\[ y = \alpha_0 + \delta z + \alpha_1x' + (\beta_1-\alpha_1)zx' + \text{error}. \]
Notice that \(\delta\) is a coefficient of the treatment variable \(z\). So we can fit a linear regression (possibly with an interaction between the treatment and forcing variable) and obtain the coefficient of the treatment variable as an estimate of LATE.
Only regress near the cutoff. It is a good practice to fit the regression only on a small interval around the cutoff—this is to prevent the points that are far away from the cutoff to have any effect on the local estimate, as the following figure illustrates.
19.2 Example: The effect of educational support on test scores in Chile
As an example, we consider the data from the Chilean government, who implemented “900 schools program” in an attempt to improve the performance of struggling public schools. The educational resources have been distributed to schools whose mean fourth-grade test scores are below a cutoff. Here, the forcing variable is the mean test score and the outcome is the follow-up reading test score in 1992.
Here is the list of variables that we are going to use:
Name | Description |
---|---|
read92 |
The score of reading test in 1992 |
eligible |
1 if rule2 is less than cutoff , 0 otherwise |
rule2 |
the earlier test score minus cutoff |
cutoff |
The test score cutoff for educational support |
read88 |
The score of reading test in 1988 |
math88 |
The score of math test in 1988 |
The data is contained in chile.csv
.
<- read.csv("data/chile.csv")
chile
head(chile[, c("eligible", "rule2", "cutoff", "read88", "math88", "read92")])
eligible rule2 cutoff read88 math88 read92
1 0 3.41499853 49.4 54.37 51.26 57.000
2 0 31.81499672 43.4 79.06 71.37 85.515
3 0 4.44500113 43.4 47.76 47.93 51.971
4 0 16.79999733 46.4 65.13 61.27 66.374
5 0 0.09499893 49.4 49.26 49.73 52.500
6 1 -2.03000116 46.4 50.51 38.23 55.333
Here, eligible
is the treatment variable and rule2
is the forcing variable. In each group, the expected reading score in 1992 should move continuously along the earlier test score, so expect the continuity assumption to be satisfied. The forcing variable rule2
is already subtracted by the score cutoff; hence, we are ready to fit the model with an interaction. As mentioned before, we fit the regression only on a small interval around the cutoff. In this example, the cutoff for rule2
is zero, so we shall fit on the subset of schools whose rule2
is between \(-5\) and \(5\).
<- stan_glm(read92 ~ eligible + rule2 + eligible:rule2
fit_1 + read88 + math88,
data=chile, subset = abs(rule2)<5,
refresh=0)
fit_1
stan_glm
family: gaussian [identity]
formula: read92 ~ eligible + rule2 + eligible:rule2 + read88 + math88
observations: 883
predictors: 6
subset: abs(rule2) < 5
------
Median MAD_SD
(Intercept) 23.4 4.4
eligible 2.1 0.9
rule2 0.1 0.2
read88 0.6 0.1
math88 0.2 0.1
eligible:rule2 0.1 0.3
Auxiliary parameter(s):
Median MAD_SD
sigma 6.9 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The estimate of LATE is \(2.1\) with standard error \(0.9\). Let us take a look at the regression lines.
The data is too noisy to estimate the interaction, so we might instead fit the model without the interaction. In this case, the two lines are parallel to each other.
<- stan_glm(read92 ~ eligible + rule2
fit_2 + read88 + math88,
data=chile, subset = abs(rule2)<5,
refresh=0)
fit_2
stan_glm
family: gaussian [identity]
formula: read92 ~ eligible + rule2 + read88 + math88
observations: 883
predictors: 5
subset: abs(rule2) < 5
------
Median MAD_SD
(Intercept) 23.5 4.3
eligible 2.1 0.9
rule2 0.1 0.2
read88 0.6 0.1
math88 0.2 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 6.9 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
We see that the estimate of LATE is the same as before. Let us take a look at the regression lines.