15  Causal inference with regression

Let us review all variables that arise in a randomized experiment.

15.1 Regression for simple difference estimate

We start with causal estimation using a regression of the outcome on the treatment assignment variable.

\[ y_i = a + bz_i + \varepsilon_i. \]

We use the Electric Company data, which is the data of a randomized experiment to study the effect of a new educational TV program, The Electric Company, on children’s reading abilities. The experiment used the matched pairs design, where each pair consists of two classes in a grade from a school with the lowest reading scores. The students in these classes were assigned to take a pre-test at the beginning of the school year and a post-test at the end. The data is contained in electric.csv.


Let us take a look at the data first.

electric <- read.csv("data/electric.csv")

  X post_test pre_test grade treatment supp pair_id
1 1      48.9     13.8     1         1    1       1
2 2      70.5     16.5     1         1    0       2
3 3      89.7     18.5     1         1    1       3
4 4      44.2      8.8     1         1    0       4
5 5      77.5     15.3     1         1    1       5
6 6      84.7     15.0     1         1    0       6

To see the effects of the TV program on the reading abilities, we plot the histograms of the post-test scores for both treatment and control groups. Here, the vertical lines are the averages.

Figure 15.1: The histograms of the post-test scores

We observe that the effect of the program varies across the years. We fit the regression of the post-test on the treatment indicator.

fit_1 <- stan_glm(post_test ~ treatment, data=electric,

 family:       gaussian [identity]
 formula:      post_test ~ treatment
 observations: 192
 predictors:   2
            Median MAD_SD
(Intercept) 94.3    1.7  
treatment    5.7    2.5  

Auxiliary parameter(s):
      Median MAD_SD
sigma 17.6    0.9  

* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The estimate is \(5.7\) with \(2.5\) standard deviation. Since Figure 15.1 indicates that the effect of the TV program varies across the grades, so we might consider fitting separate regression models by the grades. Foe each grade, we store the simulations of the difference estimate (the coefficient of treatment) in a matrix named fit_2.

fit_2 <- array(NA, c(4000, 4))
colnames(fit_2) <- c("Grade 1", "Grade 2",
                     "Grade 3", "Grade 4")
for (k in 1:4) {
  model <- stan_glm(post_test ~ treatment,
  fit_2[, k] <- as.matrix(model)[, 'treatment']

From these simulations, we can plot the \(50\%\) and \(95\%\) uncertainty intervals of the difference estimate for each grade.


The point estimate of per-grade SATE, adjusted for the pre-test scores with uncertainties

The result agrees with Figure 15.1 that the program has a larger effect on lower grades. We also notice that the estimates for higher grades have lower standard errors.

15.2 Adding pre-treatment covariates to the model

We also have the pre-test scores as a covariate, which is highly correlated to the post-test scores. We need to adjust for this covariate, otherwise our estimate would be inaccurate due to the difference in the pre-test scores between the control group and the treatment group.

For each grade, we fit regression models with the pre-test scores \(x_i\):

\[ y = \beta_0 + \beta_1 z + \beta_2 x + \varepsilon. \]

Assume that the assignment is completely randomized, so it is independent of the pre-treatment covariate; This implies conditional ignorability: \(z \perp y^0,y^1\vert x\). Consequently,

\[\begin{align*} \mathbb{E}[y\vert z=1,x] &= \mathbb{E}[y^1\vert z=1,x] = \mathbb{E}[y^1\vert x] \\ \mathbb{E}[y\vert z=0,x] &= \mathbb{E}[y^0\vert z=0,x] = \mathbb{E}[y^0\vert x]. \end{align*}\]

This implies

\[\begin{align*} \tau_{\text{SATE}} &= \mathbb{E}[y^1-y^0] \\ &= \mathbb{E}_{x}\mathbb{E}[y^1\vert x] - \mathbb{E}_x\mathbb{E}[y^1\vert x] \\ &= \mathbb{E}_{x}\mathbb{E}[y\vert z=1,x] - \mathbb{E}_x\mathbb{E}[y\vert z=0,x] \\ &= \mathbb{E}_x[(\beta_0 + \beta_1 + \beta_2x) - (\beta_0 + \beta_2x)] \\ &= \beta_1. \end{align*}\]

Since the coefficient obtained from fitting the model, say \(\hat{\beta_1}\), is an unbiased estimator of \(\beta_1\), so it is an unbiased estimator of the average causal effect \(\tau_{\text{SATE}}\) as well.

Let us estimate the average causal effect by fitting the regression model in R.

fit_3 <- array(NA, c(4000, 4))
colnames(fit_3) <- c("Grade 1", "Grade 2",
                     "Grade 3", "Grade 4")
for (k in 1:4) {
  model <- stan_glm(post_test ~ treatment + pre_test,
  fit_3[, k] <- as.matrix(model)[, 'treatment']


Figure 15.2: The point estimate of per-grade SATE, adjusted for the pre-test scores with uncertainties

With the pre-test scores, the standard errors of the estimated treatment effects become smaller.

15.2.1 Regression with interactions

We can add the interaction between the treatment and covariates to our model as well. To illustrate this, we take the data of grade 4 students in the Electric Company study, and fit a regression with an interaction between the treatment and the pre-test score.

fit_4 <- stan_glm(post_test ~ treatment + pre_test
                  + treatment:pre_test,

 family:       gaussian [identity]
 formula:      post_test ~ treatment + pre_test + treatment:pre_test
 observations: 42
 predictors:   4
 subset:       (grade == 4)
                   Median MAD_SD
(Intercept)        38.8    4.9  
treatment          14.3    9.0  
pre_test            0.7    0.0  
treatment:pre_test -0.1    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.2    0.3   

* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The fitted model is the following:

\[\begin{align*} y &= 38.6 + 14.6z + 0.7x -0.1zx + \varepsilon \\ &= 38.6 + (14.6-0.1x)z + 0.7x + \varepsilon. \end{align*}\]

Our estimate of the treatment effect is the coefficient of \(z\), which is \(14.6-0.1x\). This allows us to have a covariate-dependent causal estimate. For the grade 4 students, the minimum and maximum pre-test scores are 80 and 120, respectively. So the treatment effect varies from \(14.6-0.1*120=2.6\) to \(14.6-0.1*80=6.6\). This example illustrates that including the interaction allows us to see how the treatment effect varies with the covariates. We can also estimate SATE (and PATE) by computing the averate over the causal estimates of all units in the sample.

\[ \hat{\tau}_{\text{SATE}} = \frac1{n}\sum_{i=1}^{n} (14.6-0.1x_i) = 14.6 - 0.1\bar{x}, \]

which can be computed in R as follows:

sims <- as.matrix(fit_4)
n_sims <- nrow(sims)
is_grade_4 <- (electric$grade == 4)

pretest_4 <- electric$pre_test[is_grade_4]
mean_pretest_4 <- mean(pretest_4)

avg_effect <- sims[, 2] + sims[, 4]*mean_pretest_4

 [1] 0.6597359 1.4787528 1.4710521 1.5079603 1.2501985 1.3890738 2.7665281
 [8] 0.6622269 2.8731758 1.1845089

With this, we can compute the point estimate and the uncertainty.

print(c(median(avg_effect), mad(avg_effect)))
[1] 1.7747740 0.6423454

We note that this is similar to the estimate of SATE shown in Figure 15.2. It also holds in general that the average of the causal estimates with an interaction is the same as the estimate of the average causal effect without the interaction.

This way of obtaining a causal estimate can also be extended to two interactions and above. For example, suppose we have two covariates \(x_1,x_2\) and interactions \(zx_1\) and \(zx_2\). The regression model is

\[\begin{align*} y &= \beta_0 + \beta_1z + \beta_2 x_1 + \beta_3 x_2 + \beta_4 zx_1 + \beta_5 zx_2 + \varepsilon \\ &= \beta_0 + (\beta_1+\beta_4x_1+\beta_5x_2)z + \beta_2 x_1 + \beta_3 x_2 + \varepsilon. \end{align*}\]

Thus, the estimated causal effect at covariate level \(x_1\) and \(x_2\) is \(\beta_1+\beta_4x_1+\beta_5x_2\) and an estimate of SATE is \(\beta_1+\beta_4\bar{x}_1+\beta_5\bar{x}_2\).

15.2.2 Do not add post-treatment covariates to the regression

In general, one should not adjust for post-treatment covariates (sometimes referred to as mediator). Here, adjusting for a variable means adding it as a regression input. To see why, we consider the following example of a study of the effect of child care services on the child’s intelligence.

  • \(y\) is the child’s IQ score.
  • \(z\) is the treatment assignment.
  • \(x\) is a pre-treatment covariate that indicates whether both parents have a high school education.
  • \(q\) is a post-treatment covariate that measures parenting quality.

Suppose that the relationship between these variables follow the linear model:

\[ y = \beta_0 + \beta_1z + \beta_2x \beta_3q + \varepsilon. \tag{15.1}\]

However, in most cases, the post-treatment covariate is not independent of the treatment assignment. For example, we could have the following relationship between the potential outcomes \(q\) and \(z\).

\[ q = 1 + 0.2z + \varepsilon'. \]

Assume further that the relationship between \(q\) and \(z\) follow SUTVA, which implies that \(\varepsilon'\) for \(z=0\) and \(z=1\). Let us compare two families with the same parenting quality, same high school education, but with different treatments:

\[\begin{align*} \text{Family 1: }&z_1=0, q_1=1 \\ \Rightarrow\quad & q^0_1=1, q^1_1 = 1.2, \varepsilon'_1=0 \\ \text{Family 2: }&z_2=1, q_2=1 \\ \Rightarrow\quad & q^0_2=0.8, q^1_2 = 2, \varepsilon'=-0.2. \end{align*}\]

Thus, we are actually comparing two families with different parenting skills: \(q^0_1\) and \(q^0_2\). This example illustrates that, among families with the same parenting quality (\(q\)) and high school education (\(x\)), there can be some variation in the parenting skills (\(q^0\)). However, to estimate the average causal effect, we would like to control \(q^0\), the parenting skills, as it is more related to the child’s IQ before the treatment. This suggests that the coefficient \(\beta_1\) in Equation 15.1 would not be an appropriate estimate of the average causal effect.