19  Regression discontinuity

Instead of a randomized experiment, we can design an experiment with no random element, and our variables still satisfy the ignorability assumption.

One possible way to do this is by regression continuity. The basic idea is to study the behavior of the outcome \(y\) over a range of a continuous variable \(X\), often called the forcing variable. Assume that, from prior knowledge, we expect \(y\) to be a smooth function of \(X\). If there is a sudden jump in \(y\) at a particular value \(X=c\), it is possible that the sudden jump is caused by the effect of the treatment.

An example of regression discontinuity

In this case, we can define the treatment \(z\) with \(z=1\) if \(z \geq c\) and \(z=0\) otherwise. This design has a severe overlap problem as the values of \(x\) between the treatment and control groups are disjoint. We can only estimate where the groups overlap, that is, at \(X=c\). Thus we are interested in the local average treatment effect (LATE) at \(X=c\), which can be computed as follows:

\[ \tau_{\text{SRD}} = \mathbb{E}[y^1-y^0\vert X=c]. \]

Right now, it is not possible to directly compute \(\tau_{\text{SRD}}\) as it involves the counterfactual. So we have to make some assumptions first.

Assumptions

  1. Conditional ignorability:

    \[ y^1,y^0 \perp z \vert X, \]

    which is automatically satisfied since \(z\) is deterministic.

  2. Continuity of the expected potential outcomes with respect to the forcing variable:

    \[\mathbb{E}[y^1\vert X] \text{ and }\mathbb{E}[y^0\vert X] \text{ are continuous in }X.\]

With these assumptions, the average treatment effect can be computed as follows:

\[ \tau_{\text{SRD}} = \mathbb{E}[y^1-y^0\vert X=c]. \tag{19.1}\]

In other words, we can estimate SATE using the difference between the values of the two linear functions at \(X=c\).

We now show that Equation 19.1 holds under the assumptions. By the continuity of the outcome, the ignorability, and SUTVA, we have

\[ \mathbb{E}[y^1\vert X=c] = \lim_{x\uparrow c}\mathbb{E}[y^1\vert X=x] = \lim_{x\uparrow c}\mathbb{E}[y^1\vert z=1, X=x]=\lim_{x\uparrow c}\mathbb{E}[y\vert X=x]. \]

Similarly, we have \(\mathbb{E}[y^0\vert X=c]=\lim_{x\downarrow c}\mathbb{E}[y\vert X=x]\). It follows that

\[ \tau_{\text{SRD}} = \lim_{x\uparrow c}\mathbb{E}[y\vert X=x] - \lim_{x\downarrow c}\mathbb{E}[y\vert X=x]. \]

Consequently, we can estimate LATE using the difference between the values of the linear functions at \(X=c\). As there is no overlap at any other value of \(X\), it is not possible to estimate LATE at these points since we cannot observe couterfactuals. Instead, we have to extrapolate the LATE at \(X=c\) other values of \(X\). The estimates of LATE can be seen as the vertical distance between two red lines in the plot below.

Extrapolation of LATE estimate at the cutoff

19.1 Deriving the linear regression

To find the linear regression that gives us an estimate of LATE, we first subtract the forcing variable by the cutoff.

\[ x' = x - c. \]

Now we can write the equations for the two linear regression with intercepts \(\alpha_0\) and \(\alpha_0+\delta\) as follows:

\[\begin{align*} \text{Control:}\quad y^0 &= \alpha_0 + \alpha_1x'+\text{error} \\ \text{Treatment:}\quad y^1 &= \alpha_0+\delta + \beta_1x'+ \text{error}. \end{align*}\]

Taking the expectation on both equations, conditioning on \(x=c\) (or \(x'=0\)),

\[ \mathbb{E}[y^1\vert x=c] - \mathbb{E}[y^0\vert x=c] = (\alpha_0+\delta) -\alpha_0 = \delta. \]

The combination of two models above is equivalent to a single interactive model:

\[ y = \alpha_0 + \delta z + \alpha_1x' + (\beta_1-\alpha_1)zx' + \text{error}. \]

Notice that \(\delta\) is a coefficient of the treatment variable \(z\). So we can fit a linear regression (possibly with an interaction between the treatment and forcing variable) and obtain the coefficient of the treatment variable as an estimate of LATE.

Only regress near the cutoff. It is a good practice to fit the regression only on a small interval around the cutoff—this is to prevent the points that are far away from the cutoff to have any effect on the local estimate, as the following figure illustrates.

Left: Regressions on all points. Right: Regression on a small interval around the cutoff

19.2 Example: The effect of educational support on test scores in Chile

As an example, we consider the data from the Chilean government, who implemented “900 schools program” in an attempt to improve the performance of struggling public schools. The educational resources have been distributed to schools whose mean fourth-grade test scores are below a cutoff. Here, the forcing variable is the mean test score and the outcome is the follow-up reading test score in 1992.

Here is the list of variables that we are going to use:

Name Description
read92 The score of reading test in 1992
eligible 1 if rule2 is less than cutoff, 0 otherwise
rule2 the earlier test score minus cutoff
cutoff The test score cutoff for educational support
read88 The score of reading test in 1988
math88 The score of math test in 1988

The data is contained in chile.csv.

set.seed(0)
library(brms)
library(rstanarm)
chile <- read.csv("data/chile.csv")

head(chile[, c("eligible", "rule2", "cutoff", "read88", "math88", "read92")])
  eligible       rule2 cutoff read88 math88 read92
1        0  3.41499853   49.4  54.37  51.26 57.000
2        0 31.81499672   43.4  79.06  71.37 85.515
3        0  4.44500113   43.4  47.76  47.93 51.971
4        0 16.79999733   46.4  65.13  61.27 66.374
5        0  0.09499893   49.4  49.26  49.73 52.500
6        1 -2.03000116   46.4  50.51  38.23 55.333

Here, eligible is the treatment variable and rule2 is the forcing variable. In each group, the expected reading score in 1992 should move continuously along the earlier test score, so expect the continuity assumption to be satisfied. The forcing variable rule2 is already subtracted by the score cutoff; hence, we are ready to fit the model with an interaction. As mentioned before, we fit the regression only on a small interval around the cutoff. In this example, the cutoff for rule2 is zero, so we shall fit on the subset of schools whose rule2 is between \(-5\) and \(5\).

fit_1 <- stan_glm(read92 ~ eligible + rule2 + eligible:rule2
                  + read88 + math88,
                  data=chile, subset = abs(rule2)<5,
                  refresh=0)

fit_1
stan_glm
 family:       gaussian [identity]
 formula:      read92 ~ eligible + rule2 + eligible:rule2 + read88 + math88
 observations: 883
 predictors:   6
 subset:       abs(rule2) < 5
------
               Median MAD_SD
(Intercept)    23.4    4.4  
eligible        2.1    0.9  
rule2           0.1    0.2  
read88          0.6    0.1  
math88          0.2    0.1  
eligible:rule2  0.1    0.3  

Auxiliary parameter(s):
      Median MAD_SD
sigma 6.9    0.2   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The estimate of LATE is \(2.1\) with standard error \(0.9\). Let us take a look at the regression lines.

The data is too noisy to estimate the interaction, so we might instead fit the model without the interaction. In this case, the two lines are parallel to each other.

fit_2 <- stan_glm(read92 ~ eligible + rule2
                  + read88 +  math88,
                  data=chile, subset = abs(rule2)<5,
                  refresh=0)

fit_2
stan_glm
 family:       gaussian [identity]
 formula:      read92 ~ eligible + rule2 + read88 + math88
 observations: 883
 predictors:   5
 subset:       abs(rule2) < 5
------
            Median MAD_SD
(Intercept) 23.5    4.3  
eligible     2.1    0.9  
rule2        0.1    0.2  
read88       0.6    0.1  
math88       0.2    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 6.9    0.2   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We see that the estimate of LATE is the same as before. Let us take a look at the regression lines.