library(Synth)
22 Synthetic control
In difference-in-differences, we saw that it is possible to estimate the treatment effect if the treated and controlled units are similar before applying the treatment. But more often than not, we might not find a controlled unit that exactly matches the treated unit that, we have, especially when there are many confounders to adjust for.
The idea of synthetic control is to make a fake unit that resembles the pre-treatment treated unit by combining the existing controlled units.
22.1 Example: study of the effect of taxation on cigarette consumption
The method of synthetic control was demonstrated in a study of the effect of cigarette taxation on its consumption (Abadie, Diamond, and Hainmueller 2010). One could argue that imposing the tax would decrease cigarette consumption. On the other hand, since cigarettes are addictive, the consumption would not decrease by much.
In particular, Abadie, Diamond, and Hainmueller studied the effect of Proposition 99 which imposes several restrictions on cigarette sales in California since 1988: a 25-cent tax per cigarette pack, a ban on cigarette vending machines in public areas accessible by juveniles, and a ban on the sale of single cigarettes.
To study the effect of Proposition 99 using synthetic control, the authors collected the data of cigarette consumption in California and other states, before and after imposing the statute. The data prop99.csv
contains the information regarding cigarette sales and taxes across all states from 1970 to 2000, on which we shall perform a couple of preprocessing steps:
- Add a column that represents states by numbers.
- There were several states that imposed similar cigarette restrictions. These states do not represent those in the control group, so we have to remove them.
- There are several information regarding Proposition 99 in the
SubMeasureDesc
column, but we will only use the cigarette consumption as the outcome.
# Load and preprocess the cigarette consumption data
<- read.csv("data/prop99.csv")
prop99
# Add a column that represents states by numbers
$state <- as.numeric(factor(prop99$LocationDesc))
prop99
# Remove states that imposed similar cigarette restrictions
<- c('Massachusetts', 'Arizona', 'Oregon',
bad_states 'Florida', 'Alaska', 'Hawaii', 'Maryland',
'Michigan', 'New Jersey', 'New York',
'Washington', 'District of Columbia')
<- prop99[
prop99 $SubMeasureDesc=="Cigarette Consumption (Pack Sales Per Capita)" &
prop99!(prop99$LocationDesc %in% bad_states),
]
<- prop99[, c("state", "LocationDesc", "Year", "Data_Value")]
prop99
head(prop99)
state LocationDesc Year Data_Value
2 1 Alabama 2014 61.7
20 4 Arkansas 2014 54.4
26 5 California 2014 22.7
32 6 Colorado 2014 36.7
38 7 Connecticut 2014 30.1
44 8 Delaware 2014 71.2
We also need additional information about these states in order to “match” pre-intervention California with the other states. We will use the smoking.rda
data which contains cigarette sales cigsale
, average log of income lnincome
, beer sales beer
, proportion of 15-24 age group in the population age15to24
, and the cigarette’s retail price retprice
.
# Load the state data
load("data/smoking.rda")
head(smoking)
state year cigsale lnincome beer age15to24 retprice
1 1 1970 89.8 NA NA 0.1788618 39.6
2 1 1971 95.4 NA NA 0.1799278 42.7
3 1 1972 101.1 9.498476 NA 0.1809939 42.3
4 1 1973 102.9 9.550107 NA 0.1820599 42.1
5 1 1974 108.2 9.537163 NA 0.1831260 43.1
6 1 1975 111.7 9.540031 NA 0.1841921 46.6
Now we join these two dataframes by the state numbers and the years. We also make a new variable allstates
that stores the remaining 39 states.
# Join the two dataframes
colnames(smoking)[2] <- "Year"
<- merge(prop99, smoking, by=c("state", "Year"))
prop99_full
# Obtain the list of states
<- unique(prop99_full$LocationDesc)
allstates
head(prop99_full)
state Year LocationDesc Data_Value cigsale lnincome beer age15to24 retprice
1 1 1970 Alabama 89.8 89.8 NA NA 0.1788618 39.6
2 1 1971 Alabama 95.4 95.4 NA NA 0.1799278 42.7
3 1 1972 Alabama 101.1 101.1 9.498476 NA 0.1809939 42.3
4 1 1973 Alabama 102.9 102.9 9.550107 NA 0.1820599 42.1
5 1 1974 Alabama 108.2 108.2 9.537163 NA 0.1831260 43.1
6 1 1975 Alabama 111.7 111.7 9.540031 NA 0.1841921 46.6
Let us visualize and compare the cigarette consumption in California and Colorado. We also plot a vertical line that splits between the pre-intervention period (1970-1987) and the post-intervention period (1988-2000).
<- prop99[prop99$LocationDesc == "California", ]
ca_data <- prop99[prop99$LocationDesc == "Colorado", ]
co_data
plot(ca_data$Year, ca_data$Data_Value, type = "l",
ylab = "Per-capita cigarette sales (packs)",
xlab = "Year", ylim = c(20, 140),
col = "blue", lwd = 3)
lines(co_data$Year, co_data$Data_Value, lty = 2,
col = "red", lwd = 3)
abline(v = 1988, lty = 3, lwd = 2)
legend("topright",
legend = c("California",
"Colorado",
"Proposition 99"),
col = c("blue", "red", "black"),
lty = 1:3, lwd = 3)
The trends of the cigarette consumption between these two states are similar up until 1988. After that, there is a gap which might be a result of cigarette taxation in California.
22.2 The method of synthetic control
We now explain the method of constructing a new controlled unit that has similar features to those of the treated unit before the intervention. Let us start with a simple case of only one covariate \(x\). Here, we introduce some notations:
- Suppose that there are \(J\) units: Unit \(1\) is the treated unit and unit \(2,\ldots,J\) are the controlled units.
- Suppose that the outcomes were observed at time \(t=1,\ldots,T\), and the effect of the intervention occurred at time \(T_0<T\).
- \(y_{jt}\) is the outcome of \(j\)-th unit at time \(t\).
- \(x_{jt}\) is the covariate of \(j\)-th unit at time \(t\).
The idea is to construct Unit \(1\)’s synthetic covariate \(\hat{x}_{1t}\) as a linear combination of the controlled units’ covariates:
\[\hat{x}_{1t} = w_2x_{2t}+\ldots + w_Jx_{Jt}; \qquad t=1,\ldots, T\]
where \(w_2,\ldots, w_J\) are positive weights that sum to one. Our goal now is to find the combination of weights so that Unit \(1\)’s synthetic covariate \(\hat{x}_{1t}\) is “close” to its actual covariate \(x_{1t}\); this problem can be cast as a linear regression of \(x_{1t}\) on \(x_{2t},\ldots , x_{Jt}\), with \(w_2,\ldots, w_J\) as the coefficients. The diagram below illustrates our regression problem:
Suppose that we have an additional covariate \(z\), we can stack them next to \(x\) as follows:
Now suppose that we have \(m\) covariates: \(x^1_{jt},\ldots,x^m_{jt}\), which give rise to a synthetic covariate \(\hat{x}^k_{1t}=\sum_{j=2}^J w_j x^k_{jt}\) for \(k=1,\ldots,m\). We can cast the linear regression as minimizing the least-squares objective with respect to \(w_2,\ldots,w_J\).
\[\begin{align*} & \min_{w_2,\ldots,w_J} \sum_{k=1}^m\sum_{t=1}^T\left(x^k_{1t} - \sum_{j=2}^J w_j x^k_{jt}\right)^2\\ &\text{subject to}\quad w_2+\ldots+w_J = 1. \end{align*}\]However, some covariates might be more important than the others in predicting the outcome; for example, one of our covariates is the cigarette sales, which should be more predictive of the cigarette consumption than the beer sales. We thus multiply the term associated with the \(k\)-th covariate by a positive weight \(v_k\) to reflect its relative importance:
\[\begin{align} &\min_{w_2,\ldots,w_J} \sum_{k=1}^m \sum_{t=1}^T v_k \left(x^k_{1t} - \sum_{j=2}^J w_j x^k_{jt}\right)^2 \label{eq-opt}\tag{1} \\ &\text{subject to}\quad w_2+\ldots+w_J = 1. \end{align}\]We will discuss on how to choose \(v_1,\ldots,v_m\) in a moment, but let us assume for now that these weights are known.
Solving \(\eqref{eq-opt}\) yields a solution \(\hat{w}_2,\ldots,\hat{w}_J\). We then use this solution to obtain Unit \(1\)’s synthetic outcomes \(\hat{y}_{1t}\) as follows:
\[\begin{equation} \hat{y}_{1t} = \hat{w}_2y_{2t}+\ldots + \hat{w}_Jy_{Jt}, \qquad t=1,\ldots, T. \label{eq-y}\tag{2}. \end{equation}\]The diagram below illustrates the relationship between Unit \(1\)’s synthetic and actual outcomes.
We can now continue our discussion on the choice of the weights \(v_1,\ldots,v_m\). Since the pre-intervention outcomes of the synthetic control should be similar to those of Unit \(1\), we typically choose \(v_1,\ldots,v_m\) that minimize the corresponding least-squares objective:
\[\begin{align} &\min_{v_1,\ldots,v_m} \sum_{t=1}^{T_0-1} \left(y_{1t} - \sum_{j=2}^J \hat{w}_j y_{jt}\right)^2 \label{eq-v} \tag{3}\\ &\text{where}\quad \hat{w}_2,\ldots,\hat{w}_{J} \text{ solve \eqref{eq-opt}}. \end{align}\]Notice that the sum only consists of the pre-intervention data. After solving \(\eqref{eq-v}\) for \(v_1,\ldots,v_m\), we obtain the associated \(\hat{w}_2,\ldots,\hat{w}_{J}\)—inserting these into \(\eqref{eq-y}\) yields our estimate of Unit \(1\)’s post-intervention counterfactual outcome \(\hat{y}_{1t}, t\geq T_0\). We can then estimate the causal effect using:
\[\begin{equation} \hat{\tau}_{1t} = y_{1t} - \hat{y}_{1t}, \qquad t=T_0,\ldots,T. \label{eq-tau}\tag{4} \end{equation}\]22.3 Synthetic control in R using the Synth
package
Now we continue our example of synthetic control in R. We already have the preprocessed dataframe prop99_full
which contains both the covariates and the outcomes over 39 states. We can then use the dataprep
function provided in the Synth
package to split the data into dataframes of the covariates and the outcomes for synthetic control. For simple usage, we define a wrapper function named prepare_data
that only asks for the name of the treated unit. Full descriptions of the parameters of dataprep
can be found here.
# Prepare data for synthetic control of a specified state
<- function(state) {
prepare_data return(
dataprep(
foo = prop99_full,
predictors = c("cigsale", "lnincome", "beer", "age15to24"),
predictors.op = "mean",
time.predictors.prior = 1970:1987,
dependent = "Data_Value",
unit.variable = "state",
unit.names.variable = "LocationDesc",
time.variable = "Year",
treatment.identifier = state,
controls.identifier = allstates[!allstates == state],
time.optimize.ssr = 1970:1987,
time.plot = 1970:2000
)
)
}
<- prepare_data("California") prop99_prep
To perform the method of synthetic control, we simply call the synth
function on the prepared data.
# Perform synthetic control
<- synth(data.prep.obj = prop99_prep) prop99_synth
After that, we call synth.tab
on the output and the actual data to create tables summarizing the optimal weights \(\hat{v}_1,\ldots,\hat{v}_m\) and \(\hat{w}_2,\ldots,\hat{w}_J\) and the corresponding minimum values of the lease-squares \(\eqref{eq-opt}\) and \(\eqref{eq-v}\).
# Table summarizing the result
<- synth.tab(prop99_synth, prop99_prep)
prop99_tab
$tab.w prop99_tab
w.weights unit.names unit.numbers
1 0.000 Alabama 1
4 0.000 Georgia 11
6 0.000 Idaho 13
7 0.000 Illinois 14
8 0.000 Indiana 15
11 0.000 Iowa 16
13 0.000 Kansas 17
14 0.000 Kentucky 18
15 0.000 Louisiana 19
16 0.002 Maine 20
17 0.000 Minnesota 24
18 0.000 Mississippi 25
19 0.000 Missouri 26
20 0.000 Montana 27
24 0.000 Nebraska 28
25 0.000 Nevada 29
26 0.000 New Hampshire 30
27 0.429 New Mexico 32
28 0.508 North Carolina 34
29 0.000 North Dakota 35
30 0.000 Ohio 36
32 0.000 Oklahoma 37
34 0.061 Pennsylvania 39
35 0.000 Arkansas 4
36 0.000 Colorado 6
37 0.000 Connecticut 7
39 0.000 Delaware 8
We can see that North Carolina is the most representative state in the synthetic control, while New Mexico is the second most.
Unfortunately, synth
only outputs the weights and the least-squares minimum. To compute the synthetic outcome \(\eqref{eq-y}\), we have to multiply the controlled’s outcome matrix, which is stored in the prepared data’s Y0plot
column, and the vector of optimal weights, stored in the result’s solution.w
column.
print(prop99_prep$Y0plot[1:5, 1:8])
1 11 13 14 15 16 17 18
1970 89.8 100.3 124.8 120.0 155.0 109.9 102.4 124.8
1971 95.4 104.1 125.5 117.6 161.1 115.7 108.5 125.6
1972 101.1 103.9 134.3 110.8 156.3 117.0 126.1 126.6
1973 102.9 108.0 137.9 109.3 154.7 119.8 121.8 124.4
1974 108.2 109.7 132.8 112.4 151.3 123.7 125.6 131.9
print(prop99_synth$solution.w[1:8])
[1] 2.518675e-06 8.593394e-06 4.654868e-06 3.732791e-06 5.340892e-08
[6] 1.424996e-05 1.815872e-06 1.289167e-06
# Calculate the outcomes of the synthetic control
<- prop99_prep$Y0plot%*%prop99_synth$solution.w
synth_control
tail(synth_control)
w.weight
1995 91.60042
1996 89.35529
1997 89.43415
1998 86.59770
1999 86.62141
2000 78.64532
To plot both the actual outcome and the synthetic outcome, we can use the path.plot
function. Here, we also plot a vertical line that indicates the intervention (the enactment of Proposition 99) in 1988.
# Plot the outcomes of the treated unit and the synthetic control
path.plot(prop99_synth, prop99_prep,
tr.intake = 1988,
Ylab = "Per-capita cigarette sales (packs)")
To compute an estimate of the treatment effect using \(\eqref{eq-tau}\), we subtract the treated’s outcome, stored in the prepared data’s Y1plot
column, by the synthetic control’s outcome.
# Calculate an estimate of the treatment effect
<- prop99_prep$Y1plot - synth_control
tax_effects
tail(tax_effects)
5
1995 -35.20042
1996 -34.85529
1997 -35.63415
1998 -34.29770
1999 -39.42141
2000 -37.04532
We can also plot our estimate of the treatment effect over time using gaps.plot
function.
gaps.plot(prop99_synth, prop99_prep,
tr.intake = 1988,
Main = "Treated outcome - Synthetic outcome",
Ylab = "Estimated taxation effect (packs per capita)")
22.4 Permutation test
Without a frame of reference, it is not clear if our estimate displayed above is statistically significant. To show this, we can use a permutation test, whose hypotheses are:
\[\begin{align*} H_0 &: \text{The effects of treatment on the treated and the controlled are the same.} \\ H_1 &: \text{The effect of treatment on the treated is less than that on the controlled.} \end{align*}\]To perform the test, we follow the same procedure as above for all other units to obtain estimates of the placebo effects and see if the estimate of the treatment effect is sufficiently smaller than that of the placebo effects. In addition, we only consider units whose pre-intervention synthetic outcomes are close to the actual outcomes in terms of the mean-squared error (MSE):
\[ \text{MSE}_j = \frac1{T_0}\sum_{t=1}^{T_0} (y_{jt} - \hat{y}_{jt})^2. \]
In particular, we discard any state whose pre-intervention error is greater than 1600.
<- data.frame(row.names=1970:2000)
prop99_placebos
for (state in allstates) {
# Estimate the treatment effect for the state
<- prepare_data(state)
prop99_prep <- synth(data.prep.obj = prop99_prep)
prop99_synth <- prop99_prep$Y0plot%*%prop99_synth$solution.w
synth_control <- prop99_prep$Y1plot - synth_control
tax_effects
# Only consider state with small pre-intervention error
if(mean(tax_effects[1:18]^2) < 1600) {
<- tax_effects[, 1]
prop99_placebos[, state]
} }
To visualize the permutation test, we plot the estimate of treatment effect and placebo effects.
plot(1970:2000, prop99_placebos$California, xlab = "Year",
ylab = "Estimated taxation effect (packs per capita)",
ylim = c(-80, 100), type = "l", lwd = 3)
for (state in colnames(prop99_placebos)) {
lines(1970:2000, prop99_placebos[, state], col='#00000050')
}abline(a = 0, b = 0, lty = 2, lwd = 2)
abline(v = 1988, lty = 3, lwd = 2)
The good thing about this test is that the \(p\)-value—the proportion of the controlled units whose estimates of the placebo effects are smaller than the treated unit’s synthetic control estimate—can be easily computed. For example, let us compute the \(p\)-value of the estimate in year 2000.
<- prop99_placebos["2000", ]
effect2000 <- mean(effect2000 < effect2000$California)
p_value
print(p_value)
[1] 0.04166667