set.seed(0)
library(rstanarm)
20 Difference-in-differences
20.1 Example: effect of minimum wage on employment
Suppose that we would like to estimate the effect of raising the minimum wage on employment. With a lot of money and power, we could perform a randomized experiment by flipping a coin for each local market in countries. If it comes up head, we raise the minimum wage; if it comes up tail, we keep it the same.
Of course, this is just a thought experiment—the randomized experiment is not feasible. Nonetheless, it is possible to estimate the treatment effect when we have before-after data of a pair of units: both are controlled before, but only one of them is treated after. This is what Card and Krueger (1993) did after seeing that New Jersey’s minimum wage was about to be raised from $4.25 to $5.05 in November 1992, while a neighboring Pennsylvania’s minimum wage stayed the same at $4.25. They seized this opportunity and fielded two surveys to 400 fast food restaurants in both states: the first one in February 1992 and the second one in November 1992.
Let \(\alpha\) and \(\beta\) be the deployment in New Jersey and Pennsylvania, respectively. Let \(\delta\) be the effect of raising the minimum wage, and assume that any other factor had the same effect \(\gamma\) on both states (which might be possible since these two are adjacent). The data of employment obtained from the surveys would look like the following table:
February 1992 | November 1992 | Difference | |
---|---|---|---|
New Jersey | \(\alpha\) | \(\alpha+\gamma+\delta\) | \(\gamma+\delta\) |
Pennsylvania | \(\beta\) | \(\beta+\gamma\) | \(\gamma\) |
Difference | \(\delta\) |
We see that the treatment effect \(\delta\) is the difference between the state-wise before-and-after differences, or the difference-in-differences. Of course, this generally does not match the treatment effect due to noises, in which case we have the difference-in-differences (DID) estimate of the treatment effect.
The raw data can be downloaded from Card’s personal website. Here, we will use the preprocessed data stored in wage92.csv
.
<- read.csv("data/wage92.csv")
wage92 <- na.omit(wage92) # remove NA rows
wage92
head(wage92[, c("d_nj",
"y_ft_employment_before",
"y_ft_employment_after")])
d_nj y_ft_employment_before y_ft_employment_after
4 0 34.0 20.0
5 0 24.0 35.5
7 0 70.5 29.0
8 0 23.5 36.5
9 0 11.0 11.0
10 0 9.0 8.5
Below are descriptions of the relevant variables:
Name | Description |
---|---|
d_nj |
1 if New Jersey; 0 if Pennsylvania (Treatment) |
y_ft_employment_before |
Full time equivalent employment before treatment (Outcome) |
y_ft_employment_after |
Full time equivalent employment after treatment (Outcome) |
Now we can compute the difference-in-differences estimate using the difference in the means of the employments.
<- subset(wage92, d_nj == 1)
wage_nj <- subset(wage92, d_nj == 0)
wage_pa
<- mean(wage_nj$y_ft_employment_before)
before_nj <- mean(wage_nj$y_ft_employment_after)
after_nj <- after_nj - before_nj
diff_nj
<- mean(wage_pa$y_ft_employment_before)
before_pa <- mean(wage_pa$y_ft_employment_after)
after_pa <- after_pa - before_pa
diff_pa
<- diff_nj - diff_pa did
Let us summarize this in a table as shown above.
<- data.frame(State = c("New Jersey", "Pennsylvania", "Difference"),
result Before = c(before_nj, before_pa, NA),
After = c(after_nj, after_pa, NA),
Difference = c(diff_nj, diff_pa, did))
result
State Before After Difference
1 New Jersey 20.65775 21.04842 0.390669
2 Pennsylvania 23.70455 21.82576 -1.878788
3 Difference NA NA 2.269457
The DID estimate tells us that raising the minimum wage from $4.25 to $5.05 would increase the employment by 2.27 on average.
20.2 Regression for the difference-in-differences estimate
We can also use a linear regression to obtain the DID estimate. Let \(y_{\text{before}}\) and \(y_{\text{after}}\) be the outcome before and after the time period, and \(z\) be the treatment assignment. We can regress the difference on the treatment variable:
\[ y_{\text{after}} - y_{\text{before}} = \beta + \delta z + \varepsilon. \tag{20.1}\]
Then, the coefficient of the interaction term \(\delta\) is the DID estimate. This is because
\[ \mathbb{E}[y_{\text{after}} - y_{\text{before}}\vert z=1]-\mathbb{E}[y_{\text{after}} - y_{\text{before}}\vert z=0] = (\beta_0+\delta)-\beta_0 = \delta. \]
Let us try this method on the employment data. First, we have to combine the employments before and after the wage raise into a single column, and add a time indicator.
<- stan_glm((y_ft_employment_after - y_ft_employment_before) ~ d_nj,
fit_1 data=wage92,
seed=0, refresh=0)
print(fit_1, digit=2)
stan_glm
family: gaussian [identity]
formula: (y_ft_employment_after - y_ft_employment_before) ~ d_nj
observations: 350
predictors: 2
------
Median MAD_SD
(Intercept) -1.88 1.09
d_nj 2.23 1.17
Auxiliary parameter(s):
Median MAD_SD
sigma 8.74 0.33
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The DID estimate is 2.23, with 1.17 standard error, which is close to the point estimate of 2.27 that we just computed directly from the differences between the means.
20.2.1 Different observations before and after the treatment time
Let \(P\) be a time indicator with \(P=0\) and \(P=1\) signifies the time before and after the treatment took effect, respectively. If the observations at \(P=0\) are different than those at \(P=1\), then we cannot compute \(y_{\text{after}} - y_{\text{before}}\). Assuming that the observations in each of the treatment and control groups are independently from the same distribution, we can instead fit the following regression with an interaction term:
\[ y = \beta_0 + \beta_1z+ \beta_2P +\delta zP + \varepsilon. \]
The DID estimate is the coefficient \(\delta\) of the interaction term, as it is the difference between the two coefficients of \(z\) from fitting \(y = a+bz\) on the data with \(P=1\) and \(P=0\), respectively. More explicitly,
\[\begin{align*} \mathbb{E}[y\vert z=1, P=1] - \mathbb{E}[y\vert z=0, P=1] &= (\beta_0 + \beta_1+\beta_2+\delta)- (\beta_0+\beta_2+\delta) \\ &= \beta_1+\delta \\ \mathbb{E}[y\vert z=1, P=0] - \mathbb{E}[y\vert z=0, P=0] &= (\beta_0 + \beta_1)- \beta_0 \\ &= \beta_1. \end{align*}\]Subtracting these two equalities yields
\[ \text{DID} = (\beta_1+\delta) - \beta_1= \delta. \]
20.2.2 Difference-in-differences by matching
Alternatively, we can use propensity score matching to match each unit that was observed before the treatment time to a unit in the same group that was observed after. Then, we treat each pair as a single observation with the observed values of \(y_{\text{before}}\) and \(y_{\text{after}}\). With these new observations, we can obtain the DID estimate by fitting the regression Equation 20.1.
In all cases, we have made a strong assumption that the changes in the outcomes without the treatment effect would be the same in both New Jersey and Pennsylvania. We will discuss more about the assumptions for the DID estimate in the next section.
20.3 Parallel trends assumption
From Equation 20.1, we define the potential changes as the difference between the potential outcome, with or without the treatment, and the outcome observed before applying the treatment.
\[ d^1 = y^1 - y_{\text{before}}, \quad d^0 = y^0 - y_{\text{before}}, \]
where \(y^1,y^0\) are the potential outcomes. In view of Equation 20.1, in order for the coefficient \(\delta\) to be a valid causal estimate, the dependent variable in the regression must be independent of the treatment assignment, which is guaranteed when
\[ d^0 \perp z. \tag{20.2}\]
This is referred to as parallel trends assumption, as it implies that the change in a treated unit would be the same as that of a controlled unit had it not received the treatment. We can show that the DID estimate is an unbiased estimate of the ATT as follows:
\[\begin{align*} &\mathbb{E}[y - y_{\text{before}}\vert z=1] - \mathbb{E}[y - y_{\text{before}}\vert z=0] \\ &= \mathbb{E}[y^1 - y_{\text{before}}\vert z=1] - \mathbb{E}[y^0 - y_{\text{before}}\vert z=0] \\ &= \mathbb{E}[d^1\vert z=1] - \mathbb{E}[d^0 \vert z=0] \\ &= \mathbb{E}[d^1\vert z=1 ] - \mathbb{E}[d^0\vert z=1] \\ &= \mathbb{E}[y^1-y^0\vert z=1] \\ &= \mathbb{E}[y^1\vert z=1] - \mathbb{E}[y^0\vert z=1], \end{align*}\]where we used Equation 20.2 to show the third equality. Two comments are in order:
- If we instead have a stronger assumption: \(d^1,d^0 \perp z\). Then ATT is the same as ATE, in which case we can estimate both with DID.
- If \(y_{\text{before}} \perp z\) (which implies \(\mathbb{E}[y_{\text{before}}\vert z=1]=\mathbb{E}[y_{\text{before}}\vert z=0]\)), we can instead assume \(y^0 \perp z\) and modify the proof to show that the DID estimate is an unbiased estimate of the ATT (if \(y^1,y^0 \perp z\), then the ATT is the same as ATE).
With confounder covariates, however, this assumption might not be satisfied. For example, in the 1992 survey, almost half of the fast food restaurants were Burger King’s, and around 80% of them were from New Jersey; so if Burger King was very responsive to the minimum wage raise compared to the other fast food restaurants, the potential employments in New Jersey would be lower than that in Pennsylvania.
Thus, we have to adjust for these confounder covariates, say \(x\), in the ignorability assumption.
\[ d^1,d^0 \perp z \vert x. \]
With this assumption, we obtain the DID estimate by regressing on the confounders as well. In the employment example, we can adjust for the five franchise indicators
<- stan_glm((y_ft_employment_after - y_ft_employment_before) ~
fit_2 + x_burgerking + x_kfc
d_nj + x_roys + x_wendys + x_co_owned,
data=wage92,
seed=0, refresh=0)
print(fit_2, digit=2)
stan_glm
family: gaussian [identity]
formula: (y_ft_employment_after - y_ft_employment_before) ~ d_nj + x_burgerking +
x_kfc + x_roys + x_wendys + x_co_owned
observations: 350
predictors: 7
------
Median MAD_SD
(Intercept) -3.22 26.83
d_nj 2.35 1.16
x_burgerking 1.69 26.74
x_kfc 1.87 26.76
x_roys -0.30 27.07
x_wendys 1.06 26.85
x_co_owned 0.36 1.10
Auxiliary parameter(s):
Median MAD_SD
sigma 8.73 0.34
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Nonetheless, in this example, the DID estimate of 2.35 with standard error 1.16 is not noticably different that the previous one.
Note. For the reasons explained in Section 15.2.2, do not adjust for post-treatment covariates.
20.3.1 Checking the parallel trends assumption
It is possible to check for the parallel trends assumption if the data was recorded at multiple time points before the treatment took effect. If this is the case, there are mainly three ways to check for the parallel trend assumptions.
- Check the plot over time. We can compare the graphs of the average outcomes between the treatment and control group over a period of time leading up to when the treatment occurred. If the graphs are moving apart or approaching each other, the parallel trend assumption might not be satisfied.
- Statistical test. To see whether the trends different between the treatment and control groups, we can fit the following regression with an interaction term on the data before the treatment occurred:
\[ y = \beta_0 + \beta_1*\text{Time} + \beta_2*\text{Time}*z+\varepsilon, \]
and perform a statistical test to see if \(\beta_2=0\), in which case it is unlikely that the trends are different. On the other hand, if we reject \(\beta_2=0\), we still have to look at the graphs and see if the difference in trends is visually small but the test was performed with a large sample size, or if the outcomes vastly differ only over a brief moment, outside of which the trends are very similar.
- The placebo test. The idea is to treat some of untreated data as fake treated data and see if our DID estimate is significant, even though there should not be any effect. More precisely, we first remove the time period that the treatment took effect. Then, we choose an earlier time period, and let the outcomes over this period be the results of a fake treatment. If the DID estimate with this fake treatment is significant, then the parallel trends assumption might be violated.