22  Synthetic control

In difference-in-differences, we saw that it is possible to estimate the treatment effect if the treated and controlled units are similar before applying the treatment. But more often than not, we might not find a controlled unit that exactly matches the treated unit that, we have, especially when there are many confounders to adjust for.

The idea of synthetic control is to make a fake unit that resembles the pre-treatment treated unit by combining the existing controlled units.

22.1 Example: study of the effect of taxation on cigarette consumption

The method of synthetic control was demonstrated in a study of the effect of cigarette taxation on its consumption (Abadie, Diamond, and Hainmueller 2010). One could argue that imposing the tax would decrease cigarette consumption. On the other hand, since cigarettes are addictive, the consumption would not decrease by much.

In particular, Abadie, Diamond, and Hainmueller studied the effect of Proposition 99 which imposes several restrictions on cigarette sales in California since 1988: a 25-cent tax per cigarette pack, a ban on cigarette vending machines in public areas accessible by juveniles, and a ban on the sale of single cigarettes.

To study the effect of Proposition 99 using synthetic control, the authors collected the data of cigarette consumption in California and other states, before and after imposing the statute. The data prop99.csv contains the information regarding cigarette sales and taxes across all states from 1970 to 2000, on which we shall perform a couple of preprocessing steps:

  1. Add a column that represents states by numbers.
  2. There were several states that imposed similar cigarette restrictions. These states do not represent those in the control group, so we have to remove them.
  3. There are several information regarding Proposition 99 in the SubMeasureDesc column, but we will only use the cigarette consumption as the outcome.
library(Synth)
# Load and preprocess the cigarette consumption data
prop99 <- read.csv("data/prop99.csv")

# Add a column that represents states by numbers
prop99$state <- as.numeric(factor(prop99$LocationDesc))

# Remove states that imposed similar cigarette restrictions
bad_states <- c('Massachusetts', 'Arizona', 'Oregon',
                'Florida', 'Alaska', 'Hawaii', 'Maryland',
                'Michigan', 'New Jersey', 'New York',
                'Washington', 'District of Columbia')
prop99 <- prop99[
    prop99$SubMeasureDesc=="Cigarette Consumption (Pack Sales Per Capita)" &
    !(prop99$LocationDesc %in% bad_states),
    ]

prop99 <- prop99[, c("state", "LocationDesc", "Year", "Data_Value")]

head(prop99)
   state LocationDesc Year Data_Value
2      1      Alabama 2014       61.7
20     4     Arkansas 2014       54.4
26     5   California 2014       22.7
32     6     Colorado 2014       36.7
38     7  Connecticut 2014       30.1
44     8     Delaware 2014       71.2

We also need additional information about these states in order to “match” pre-intervention California with the other states. We will use the smoking.rda data which contains cigarette sales cigsale , average log of income lnincome , beer sales beer , proportion of 15-24 age group in the population age15to24 , and the cigarette’s retail price retprice.

# Load the state data
load("data/smoking.rda")

head(smoking)
  state year cigsale lnincome beer age15to24 retprice
1     1 1970    89.8       NA   NA 0.1788618     39.6
2     1 1971    95.4       NA   NA 0.1799278     42.7
3     1 1972   101.1 9.498476   NA 0.1809939     42.3
4     1 1973   102.9 9.550107   NA 0.1820599     42.1
5     1 1974   108.2 9.537163   NA 0.1831260     43.1
6     1 1975   111.7 9.540031   NA 0.1841921     46.6

Now we join these two dataframes by the state numbers and the years. We also make a new variable allstates that stores the remaining 39 states.

# Join the two dataframes
colnames(smoking)[2] <- "Year"
prop99_full <- merge(prop99, smoking, by=c("state", "Year"))

# Obtain the list of states
allstates <- unique(prop99_full$LocationDesc)

head(prop99_full)
  state Year LocationDesc Data_Value cigsale lnincome beer age15to24 retprice
1     1 1970      Alabama       89.8    89.8       NA   NA 0.1788618     39.6
2     1 1971      Alabama       95.4    95.4       NA   NA 0.1799278     42.7
3     1 1972      Alabama      101.1   101.1 9.498476   NA 0.1809939     42.3
4     1 1973      Alabama      102.9   102.9 9.550107   NA 0.1820599     42.1
5     1 1974      Alabama      108.2   108.2 9.537163   NA 0.1831260     43.1
6     1 1975      Alabama      111.7   111.7 9.540031   NA 0.1841921     46.6

Let us visualize and compare the cigarette consumption in California and Colorado. We also plot a vertical line that splits between the pre-intervention period (1970-1987) and the post-intervention period (1988-2000).

ca_data <- prop99[prop99$LocationDesc == "California", ]
co_data <- prop99[prop99$LocationDesc == "Colorado", ]

plot(ca_data$Year, ca_data$Data_Value, type = "l",
     ylab = "Per-capita cigarette sales (packs)",
     xlab = "Year", ylim = c(20, 140),
     col = "blue", lwd = 3)
lines(co_data$Year, co_data$Data_Value, lty = 2,
      col = "red", lwd = 3)
abline(v = 1988, lty = 3, lwd = 2)
legend("topright",
       legend = c("California",
                  "Colorado",
                  "Proposition 99"),
       col = c("blue", "red", "black"),
       lty = 1:3, lwd = 3)

The trends of the cigarette consumption between these two states are similar up until 1988. After that, there is a gap which might be a result of cigarette taxation in California.

22.2 The method of synthetic control

We now explain the method of constructing a new controlled unit that has similar features to those of the treated unit before the intervention. Let us start with a simple case of only one covariate \(x\). Here, we introduce some notations:

  • Suppose that there are \(J\) units: Unit \(1\) is the treated unit and unit \(2,\ldots,J\) are the controlled units.
  • Suppose that the outcomes were observed at time \(t=1,\ldots,T\), and the effect of the intervention occurred at time \(T_0<T\).
  • \(y_{jt}\) is the outcome of \(j\)-th unit at time \(t\).
  • \(x_{jt}\) is the covariate of \(j\)-th unit at time \(t\).

The idea is to construct Unit \(1\)’s synthetic covariate \(\hat{x}_{1t}\) as a linear combination of the controlled units’ covariates:

\[\hat{x}_{1t} = w_2x_{2t}+\ldots + w_Jx_{Jt}; \qquad t=1,\ldots, T\]

where \(w_2,\ldots, w_J\) are positive weights that sum to one. Our goal now is to find the combination of weights so that Unit \(1\)’s synthetic covariate \(\hat{x}_{1t}\) is “close” to its actual covariate \(x_{1t}\); this problem can be cast as a linear regression of \(x_{1t}\) on \(x_{2t},\ldots , x_{Jt}\), with \(w_2,\ldots, w_J\) as the coefficients. The diagram below illustrates our regression problem:

G n0 1 2 3 4 5 6 n1 x₂₁ x₂₂ x₃₁ x₃₂ x₄₁ x₄₂ x₅₁ x₅₂ times × n2 w₂ w₃ w₄ w₅ equal = n3 x̂₁₁ x̂₁₂ Unit ⟵  Unit  ⟶ Time Time W W U1 Unit 1

Suppose that we have an additional covariate \(z\), we can stack them next to \(x\) as follows:

G n0 1 2 3 1 2 3 n1x x₂₁ x₃₁ x₄₁ x₅₁ n1z z₂₁ z₃₁ z₄₁ z₅₁ times × n2 w₂ w₃ w₄ w₅ equal = n3x x̂₁₁ n3z ẑ₁₁ Unit ⟵  Unit  ⟶ Time1 Time Time2 Time W W U1 Unit 1

Now suppose that we have \(m\) covariates: \(x^1_{jt},\ldots,x^m_{jt}\), which give rise to a synthetic covariate \(\hat{x}^k_{1t}=\sum_{j=2}^J w_j x^k_{jt}\) for \(k=1,\ldots,m\). We can cast the linear regression as minimizing the least-squares objective with respect to \(w_2,\ldots,w_J\).

\[\begin{align*} & \min_{w_2,\ldots,w_J} \sum_{k=1}^m\sum_{t=1}^T\left(x^k_{1t} - \sum_{j=2}^J w_j x^k_{jt}\right)^2\\ &\text{subject to}\quad w_2+\ldots+w_J = 1. \end{align*}\]

However, some covariates might be more important than the others in predicting the outcome; for example, one of our covariates is the cigarette sales, which should be more predictive of the cigarette consumption than the beer sales. We thus multiply the term associated with the \(k\)-th covariate by a positive weight \(v_k\) to reflect its relative importance:

\[\begin{align} &\min_{w_2,\ldots,w_J} \sum_{k=1}^m \sum_{t=1}^T v_k \left(x^k_{1t} - \sum_{j=2}^J w_j x^k_{jt}\right)^2 \label{eq-opt}\tag{1} \\ &\text{subject to}\quad w_2+\ldots+w_J = 1. \end{align}\]

We will discuss on how to choose \(v_1,\ldots,v_m\) in a moment, but let us assume for now that these weights are known.

Solving \(\eqref{eq-opt}\) yields a solution \(\hat{w}_2,\ldots,\hat{w}_J\). We then use this solution to obtain Unit \(1\)’s synthetic outcomes \(\hat{y}_{1t}\) as follows:

\[\begin{equation} \hat{y}_{1t} = \hat{w}_2y_{2t}+\ldots + \hat{w}_Jy_{Jt}, \qquad t=1,\ldots, T. \label{eq-y}\tag{2}. \end{equation}\]

The diagram below illustrates the relationship between Unit \(1\)’s synthetic and actual outcomes.

G n0 1 2 3 4 5 6 n1 y₂₁ y₂₂ y₃₁ y₃₂ y₄₁ y₄₂ y₅₁ y₅₂ times × n2 ŵ₂ ŵ₃ ŵ₄ ŵ₅ equal = n3 ŷ₁₁ ŷ₁₂ Unit ⟵  Unit  ⟶ Time Time W W U1 Unit 1

We can now continue our discussion on the choice of the weights \(v_1,\ldots,v_m\). Since the pre-intervention outcomes of the synthetic control should be similar to those of Unit \(1\), we typically choose \(v_1,\ldots,v_m\) that minimize the corresponding least-squares objective:

\[\begin{align} &\min_{v_1,\ldots,v_m} \sum_{t=1}^{T_0-1} \left(y_{1t} - \sum_{j=2}^J \hat{w}_j y_{jt}\right)^2 \label{eq-v} \tag{3}\\ &\text{where}\quad \hat{w}_2,\ldots,\hat{w}_{J} \text{ solve \eqref{eq-opt}}. \end{align}\]

Notice that the sum only consists of the pre-intervention data. After solving \(\eqref{eq-v}\) for \(v_1,\ldots,v_m\), we obtain the associated \(\hat{w}_2,\ldots,\hat{w}_{J}\)—inserting these into \(\eqref{eq-y}\) yields our estimate of Unit \(1\)’s post-intervention counterfactual outcome \(\hat{y}_{1t}, t\geq T_0\). We can then estimate the causal effect using:

\[\begin{equation} \hat{\tau}_{1t} = y_{1t} - \hat{y}_{1t}, \qquad t=T_0,\ldots,T. \label{eq-tau}\tag{4} \end{equation}\]

22.3 Synthetic control in R using the Synth package

Now we continue our example of synthetic control in R. We already have the preprocessed dataframe prop99_full which contains both the covariates and the outcomes over 39 states. We can then use the dataprep function provided in the Synth package to split the data into dataframes of the covariates and the outcomes for synthetic control. For simple usage, we define a wrapper function named prepare_data that only asks for the name of the treated unit. Full descriptions of the parameters of dataprep can be found here.

# Prepare data for synthetic control of a specified state
prepare_data <- function(state) {
    return(
        dataprep(
            foo = prop99_full,
            predictors = c("cigsale", "lnincome", "beer", "age15to24"),
            predictors.op = "mean",
            time.predictors.prior = 1970:1987,
            dependent = "Data_Value",
            unit.variable = "state",
            unit.names.variable = "LocationDesc",
            time.variable = "Year",
            treatment.identifier = state,
            controls.identifier = allstates[!allstates == state],
            time.optimize.ssr = 1970:1987,
            time.plot = 1970:2000
        )
    )
}

prop99_prep <- prepare_data("California")

To perform the method of synthetic control, we simply call the synth function on the prepared data.

# Perform synthetic control
prop99_synth <- synth(data.prep.obj = prop99_prep)

After that, we call synth.tab on the output and the actual data to create tables summarizing the optimal weights \(\hat{v}_1,\ldots,\hat{v}_m\) and \(\hat{w}_2,\ldots,\hat{w}_J\) and the corresponding minimum values of the lease-squares \(\eqref{eq-opt}\) and \(\eqref{eq-v}\).

# Table summarizing the result
prop99_tab <- synth.tab(prop99_synth, prop99_prep)

prop99_tab$tab.w
   w.weights     unit.names unit.numbers
1      0.000        Alabama            1
4      0.000        Georgia           11
6      0.000          Idaho           13
7      0.000       Illinois           14
8      0.000        Indiana           15
11     0.000           Iowa           16
13     0.000         Kansas           17
14     0.000       Kentucky           18
15     0.000      Louisiana           19
16     0.002          Maine           20
17     0.000      Minnesota           24
18     0.000    Mississippi           25
19     0.000       Missouri           26
20     0.000        Montana           27
24     0.000       Nebraska           28
25     0.000         Nevada           29
26     0.000  New Hampshire           30
27     0.429     New Mexico           32
28     0.508 North Carolina           34
29     0.000   North Dakota           35
30     0.000           Ohio           36
32     0.000       Oklahoma           37
34     0.061   Pennsylvania           39
35     0.000       Arkansas            4
36     0.000       Colorado            6
37     0.000    Connecticut            7
39     0.000       Delaware            8

We can see that North Carolina is the most representative state in the synthetic control, while New Mexico is the second most.

Unfortunately, synth only outputs the weights and the least-squares minimum. To compute the synthetic outcome \(\eqref{eq-y}\), we have to multiply the controlled’s outcome matrix, which is stored in the prepared data’s Y0plot column, and the vector of optimal weights, stored in the result’s solution.w column.

print(prop99_prep$Y0plot[1:5, 1:8])
         1    11    13    14    15    16    17    18
1970  89.8 100.3 124.8 120.0 155.0 109.9 102.4 124.8
1971  95.4 104.1 125.5 117.6 161.1 115.7 108.5 125.6
1972 101.1 103.9 134.3 110.8 156.3 117.0 126.1 126.6
1973 102.9 108.0 137.9 109.3 154.7 119.8 121.8 124.4
1974 108.2 109.7 132.8 112.4 151.3 123.7 125.6 131.9
print(prop99_synth$solution.w[1:8])
[1] 2.518675e-06 8.593394e-06 4.654868e-06 3.732791e-06 5.340892e-08
[6] 1.424996e-05 1.815872e-06 1.289167e-06
# Calculate the outcomes of the synthetic control
synth_control <- prop99_prep$Y0plot%*%prop99_synth$solution.w

tail(synth_control)
     w.weight
1995 91.60042
1996 89.35529
1997 89.43415
1998 86.59770
1999 86.62141
2000 78.64532

To plot both the actual outcome and the synthetic outcome, we can use the path.plot function. Here, we also plot a vertical line that indicates the intervention (the enactment of Proposition 99) in 1988.

# Plot the outcomes of the treated unit and the synthetic control
path.plot(prop99_synth, prop99_prep,
          tr.intake = 1988,
          Ylab = "Per-capita cigarette sales (packs)")

To compute an estimate of the treatment effect using \(\eqref{eq-tau}\), we subtract the treated’s outcome, stored in the prepared data’s Y1plot column, by the synthetic control’s outcome.

# Calculate an estimate of the treatment effect
tax_effects <- prop99_prep$Y1plot - synth_control

tail(tax_effects)
             5
1995 -35.20042
1996 -34.85529
1997 -35.63415
1998 -34.29770
1999 -39.42141
2000 -37.04532

We can also plot our estimate of the treatment effect over time using gaps.plot function.

gaps.plot(prop99_synth, prop99_prep,
          tr.intake = 1988,
          Main = "Treated outcome - Synthetic outcome",
          Ylab = "Estimated taxation effect (packs per capita)")

22.4 Permutation test

Without a frame of reference, it is not clear if our estimate displayed above is statistically significant. To show this, we can use a permutation test, whose hypotheses are:

\[\begin{align*} H_0 &: \text{The effects of treatment on the treated and the controlled are the same.} \\ H_1 &: \text{The effect of treatment on the treated is less than that on the controlled.} \end{align*}\]

To perform the test, we follow the same procedure as above for all other units to obtain estimates of the placebo effects and see if the estimate of the treatment effect is sufficiently smaller than that of the placebo effects. In addition, we only consider units whose pre-intervention synthetic outcomes are close to the actual outcomes in terms of the mean-squared error (MSE):

\[ \text{MSE}_j = \frac1{T_0}\sum_{t=1}^{T_0} (y_{jt} - \hat{y}_{jt})^2. \]

In particular, we discard any state whose pre-intervention error is greater than 1600.

prop99_placebos <- data.frame(row.names=1970:2000)

for (state in allstates) {
    # Estimate the treatment effect for the state
    prop99_prep <- prepare_data(state)
    prop99_synth <- synth(data.prep.obj = prop99_prep)
    synth_control <- prop99_prep$Y0plot%*%prop99_synth$solution.w
    tax_effects <- prop99_prep$Y1plot - synth_control

    # Only consider state with small pre-intervention error
    if(mean(tax_effects[1:18]^2) < 1600) {
        prop99_placebos[, state] <- tax_effects[, 1]
    }
}

To visualize the permutation test, we plot the estimate of treatment effect and placebo effects.

plot(1970:2000, prop99_placebos$California, xlab = "Year",
     ylab = "Estimated taxation effect (packs per capita)",
     ylim = c(-80, 100), type = "l", lwd = 3)

for (state in colnames(prop99_placebos)) {
    lines(1970:2000, prop99_placebos[, state], col='#00000050')
}
abline(a = 0, b = 0, lty = 2, lwd = 2)
abline(v = 1988, lty = 3, lwd = 2)

The good thing about this test is that the \(p\)-value—the proportion of the controlled units whose estimates of the placebo effects are smaller than the treated unit’s synthetic control estimate—can be easily computed. For example, let us compute the \(p\)-value of the estimate in year 2000.

effect2000 <- prop99_placebos["2000", ]
p_value <- mean(effect2000 < effect2000$California)

print(p_value)
[1] 0.04166667

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505. https://doi.org/10.1198/jasa.2009.ap08746.