16  Causal inference with observational data

Observational data refers to data obtained from observing an event of interest; for example, data of outcomes of a costly treatment, which is only applied to patients with extremely poor health conditions. In this setting, we could have a covariate that affects both the treatment and the outcome; such covariate is usually called confounding covariate or confounder.

G Health Health condition 𝑥 Treatment Treatment 𝑧 Health->Treatment Outcome Outcome 𝑦 Health->Outcome Treatment->Outcome

Here, “Health condition” is a confounder. Since the data does not come from a randomized experiment, the difference-of-means is unsuited for estimating the average causal effect; we can imagine that the average outcome among the treated patients (the patients with poor health conditions) must be lower than the average of the treated outcomes \(y^1\), and that among the controlled patients (the patients with poor health conditions) must be higher than the average of the controlled outcomes \(y^0\).

16.1 Assumption in an observational study

Even when the data is not obtained from a randomized experiment, as long as the ignorability assumption is satisfied, we can turn the causal estimation problem into a linear regression problem. In an observational study, we must make sure that the variables satisfy the ignorability assumption:

\[ y^0, y^1 \perp z \vert x, \]

which is similar to the assumption for a randomized block experiment. The difference is that, in an obversational study, the assumption is not implied by the design of an experiment, but by our prior knowledge of the relationship among the variables. If the ignorability assumption holds, the average causal effect can be estimated using the coefficient \(\beta_1\) of the treatment assignment in the regression model:

\[ y = \beta_0 + \beta_1z + \beta_2x + \varepsilon. \]

The proof of this can be found in Section 15.2.

So far, we have discussed causal estimation when there is only a single observed confounder. In general, the causal effect can be estimated if confounders are all observed. If not all confounders are observed, then we might risk introducing some bias in our estimate.

16.1.1 Omitted variable bias

We can quantify the bias from omitting the confounder \(x\) when the relationship between the variables can be described with a linear regression model:

\[ y = \beta_0 + \beta_1 z + \beta_2 x + \varepsilon. \tag{16.1}\]

Suppose that we did not aware of a potential confounder \(x\) and fit a misspecified model:

\[ y = \beta'_0 + \beta'_1 z + \varepsilon', \tag{16.2}\]

where \(\beta'\) and \(\beta'\) is another set of coefficients. To measure the biased introduce from using this model, we fit a regression of the confounder \(x\) on the treatment \(z\).

\[ x = \gamma_0+ \gamma_1z + \varepsilon''. \tag{16.3}\]

Substituting Equation 16.3 back into Equation 16.1 yields

\[ y = \beta_0 + \beta_2\gamma_0 + (\beta_1 + \beta_2\gamma_1) z + \varepsilon+\beta_2\varepsilon''. \tag{16.4}\]

Equating the coefficient of \(z\) in Equation 16.2 and Equation 16.4 yields

\[ \beta'_1 = \beta_1 + \beta_2\gamma_1. \]

We can see that our causal estimate \(\beta'_1\) is biased. This also implies that, if a covariate \(x\) is not associated with the treatment (\(\gamma_1=0\)) or if the covariate is not associated with the outcome (\(\beta_2=0\)).

16.1.2 Imbalance of confounder distributions

An observed confounder is imbalanced when the distribution of the confounder for the treatment group differs from that of the control group. Examples of imbalanced confounder \(x\) are shown in the following plots:

In the left plot, the distributions of \(x\) for the control and treatment groups have different means; while in the right plot, assuming that the mean is non-zero, the distributions would have different second moments.

Causal estimation with imbalanced confounder distribution would force us to rely more on the correctness of our model. For example, suppose that the true model of the population is:

\[\begin{align*} \text{Treatment: }y &= \beta_0 + \beta_1x + \beta_2x^2 + \theta + \varepsilon \\ \text{Control: }y &= \beta_0 + \beta_1x + \beta_2x^2 + \varepsilon. \end{align*}\]

Consequently, the causal effect \(\theta\) can be estimated by taking the averages of both equations, which yields

\[ \theta = \bar{y}_1 - \bar{y}_0 - \beta_1 (\bar{x}_1 - \bar{x}_0) - \beta_2(\bar{x_1^2} - \bar{x_0^2}), \]

where \(\bar{y}_1, \bar{x}_1, \bar{x_1^2}\) are the averages of the treatment group and \(\bar{y}_0, \bar{x}_0, \bar{x_0^2}\) are the averages of the control group.

Suppose that we wanted to keep it simple and estimate the causal effect using the difference between the means:

\[ \theta' = \bar{y}_1 - \bar{y}_0, \]

Then, our estimate would be off the true estimate by \(\beta_1 (\bar{x}_1 - \bar{x}_0) + \beta_2(\bar{x_1^2} - \bar{x_0^2})\). This bias would be small if the confounder distributions for the treatment and the control groups are almost identical, which implies \(\bar{x}_1 \approx \bar{x}_0\) and \(\bar{x_1^2} \approx \bar{x_0^2}\). On the other hand, if the distributions are vastly different, then hte bias would become large.

16.1.3 Lack of complete overlap

Overlap (or common support) is the intersection of the ranges of the confounder data for the treatment and control groups. We say that the distributions have complete overlap if their ranges coincide. Lack of complete overlap in the confounders leads to causal estimation problem, because for some observed values of the confounder, we have no information on the counterfactual outcomes. The plots below show examples the Electric Company data, which has the pre-test score as a confounder. Here, the solid curve are the true confounder distributions for the treatment group (black dots) and the control group (gray dots). The dashed lines in the left plot are regression lines of the post-test scores on the treatment and the pre-test score, while the dashed lines on the right also allow for an interaction between the two predictors. The causal effect at any level of pre-test score is simply the vertical distance between the two solid lines.

As the confounder distributions for the treatment and control groups do not completely overlap, our causal estimate (the vertical distance between the dashed lines) totally underestimates the true average treatment effect (the vertical distance between the solid lines).

Nonetheless, it is still possible to estimate the treatment effect in the region where the confounder is observed for both groups. As shown in the plots below, by restricting our analysis to this region and fitting a linear regression (without or with an interaction) as before, we obtain an estimate of treatment effect that is very accurate in this region.

16.2 The Electric Company example

The Electric Company data that we used in the previous chapter, in fact, has an additional covariate: the teacher for each class in the treatment group had the choice of replacing or supplementing the current regular reading program by the TV program; the choice is indicated by the covariate supp (\(0\), \(1\), or NA for every controlled class).

library(bayesplot)
library(rstanarm)
electric <- read.csv("data/electric.csv")

head(electric)
  X post_test pre_test grade treatment supp pair_id
1 1      48.9     13.8     1         1    1       1
2 2      70.5     16.5     1         1    0       2
3 3      89.7     18.5     1         1    1       3
4 4      44.2      8.8     1         1    0       4
5 5      77.5     15.3     1         1    1       5
6 6      84.7     15.0     1         1    0       6

Suppose that we would like to estimate the causal effect of the supplement versus the replacement among the classes that were assigned to watch the TV program. Assuming that the pre-test score also affects the choice of supplement (this is just for demonstration, as there can be many factors that affect the choice of supplement), the relationship between the variables is illustrated by the following graphical model:

G Health pre_test Treatment supp Health->Treatment Outcome post_test Health->Outcome Treatment->Outcome

As post_test is affected by pre_test, we must adjust for the covariate in our linear regression.

fit_supp <- array(NA, c(4000, 4))
colnames(fit_supp) <- c("Grade 1", "Grade 2",
                        "Grade 3", "Grade 4")
for (k in 1:4) {
  model <- stan_glm(post_test ~ supp + pre_test,
                    data=electric,
                    subset=(grade==k) & (!is.na(supp)),
                    refresh=0)
  fit_supp[, k] <- as.matrix(model)[, 'supp']
}

mcmc_intervals(fit_supp)

Figure 16.1: The point estimate of per-grade SATE with uncertainties, adjusted for the pre-test scores and the supplement indicators

We conclude from the plot that supplementing is more effective than replacing the TV program in lower grades.

16.2.1 Examining overlap of the confounder distribution

We can plot histograms of the confounder (the pre-test score) for the treatment and control groups. In each plot, the pink histogram is that of the treatment group, and the blue histograms is that of the control group.

blue <- rgb(173,216,230, max=255,
          alpha=80, names="lt.blue")
pink <- rgb(255,192,203, max=255,
          alpha=80, names="lt.pink")

par(mfrow=c(1,4))
for (k in 1:4){
    grade_k_data <- electric$pre_test[electric$grade==k &
                                      !is.na(electric$supp)]
    min_score <- min(grade_k_data)
    max_score <- max(grade_k_data)
    hist(grade_k_data[electric$supp==0],
         breaks=seq(min_score, max_score, length.out=6),
         xlim=c(min_score-1, max_score+1),
         ylim=c(0, 7),
         main=paste("Grade", k), col=blue,
         xlab="Pre-test score",
         freq=TRUE)
    hist(grade_k_data[electric$supp==1],
         breaks=seq(min_score, max_score, length.out=6),
         col=pink, freq=TRUE, add=TRUE)
}

We clearly see the imbalance between the treatment and control groups in Grade 1 and Grade 4, and there is lack of complete overlap in Grade 3. In particular, there are some classes in Grade 3 that supplemented the TV program and their average pre-test scores are lower than those that replaced the regular reading program with the TV program. We should keep these observations in mind when assessing the accuracy of our causal estimates.