# 14 Basics of causal inference

## 14.1 A running example

Consider the following hypothetical scenario: over the past few decades, omega-3 fatty acids has been promoted and advertised as an effective supplement for reducing blood pressure. Being skeptical about this clain, you decide to investigate. You found eight friends who agreed to join your experiment.

- Four of the friends were in the treatment group. They agreed to take the fish oil supplement every day for one year.
- The other four friends were in the control group. They agreed to simply maintaining their diets, free from fish oil supplement.

After one year, you measure systolic blood pressure of each of the eight participants. For simplicity, we consider systolic blood pressure of 160mmHg and higher as “high blood pressure”.

### 14.1.1 Potential outcomes, counterfactuals, and causal effects

To formalize causal problems in this study, we assign several notations to each participant \(i\in\{1,\ldots,8\}\)

- Let \(z_i\) be the
*treatment*variable of \(i\).- \(z_{i}=0\) if \(i\) is in the control group (i.e. he/she did not take any oil supplement).
- \(z_{i}=1\) if \(i\) is in the treatment group (i.e. he/she had been taking fish oil supplement).

- Let \(y_{i}^{0}\) and \(y_{i}^1\) be two
*outcome*variables.- \(y_{i}^0\) denotes the blood pressure of \(i\) if he/she did not take any supplement.
- \(y_{i}^1\) denotes the blood pressure of \(i\) if he/she had been taking the supplement.

The outcome variables \(y_{i}^{0}\) and \(y_{i}^1\) are commonly referred to as **potential outcomes**. It is important to note that the potential outcomes are assigned to all participants, regardless of whether or not they had received the treatment (i.e. the supplement).

Thus, for everyone in the control group (\(z_i=0\)), the value of \(y_i^0\) is observed, while that of \(y_i^1\) is unobserved. And for everyone in the treatment group (\(z_i=1\)), the value of \(y_i^1\) is observed while that of \(y_i^0\) is unobserved. Sometimes, we might be interested in *what would happen* if a particular participant from the control group had recieved the treatment and vice versa. In this case, the outcomes of interest would be \(y_{i}^{1}\) and \(y_i^0\), respectively. Such outcomes are referred to as **counterfactual outcomes**. For each participant, it is impossible to directly measure his/her counterfactual outcome—this is commonly referred to as the **fundamental problem of causal inference**.

Let \(y_i\) be the observed outcome (not potential outcome) of person \(i\). We can express it in terms of the potential outcomes:

\[ y_{i} = y_i^0(1-z_i) + y_i^1z_i. \]

The **causal effect** of supplement versus non-supplement for person \(i\) is the difference between the two potential outcomes:

\[ \tau_i = y_i^1 - y_i^0. \]

A hypothetical data of the experiment is shown in the table below. We see that at least one of the potential outcomes is always missing.

Unit \(i\) |
Treatment \(z_i\) |
Potential outcome #1 \(y_i^0\) |
Potential outcome #2 \(y_i^1\) |
Observed outcome \(y_i\) |
Causal inference \(\tau_i\) |
---|---|---|---|---|---|

Alex | 0 | 140 | ? | 140 | ? |

Anna | 0 | 140 | ? | 140 | ? |

Bill | 0 | 150 | ? | 150 | ? |

Bob | 0 | 150 | ? | 150 | ? |

Cindy | 1 | ? | 155 | 155 | ? |

Carol | 1 | ? | 155 | 155 | ? |

Dan | 1 | ? | 160 | 160 | ? |

Dave | 1 | ? | 160 | 160 | ? |

## 14.2 Average causal effects

We introduce several notions of causal effect.

First is the causal effect on an individual; this is sometimes called *individual treatment effect* (ITE).

\[ \text{individual treatment effect: } \tau_i = y_i^1-y_i^0. \]

The *sample average treatment effect* (SATE) is the average of ITE across all units in the sample.

\[ \tau_{\text{SATE}} = \frac{1}{n}\sum_{i=1}^n(y_i^1-y_i^0). \]

The *conditional average treatment effect* (CATE) is the average treatment effect of a subset \(\mathcal{C}\) of the units, such “men” or “people who received the treatments”.

\[ \tau_{\text{CATE}} = \frac{1}{\lvert\mathcal{C}\rvert}\sum_{c\in\mathcal{C}}(y_c^1-y_c^0), \]

where \(\lvert\mathcal{C}\rvert\) is the number of units in the subset \(\mathcal{C}\).

The *population average treatment effect* (PATE) is the average treatment effect across the population.

\[ \tau_{\text{PATE}} = \frac{1}{N}\sum_{i=1}^N(y_i^1-y_i^0). \]

If the sample is a randon sample, then we can use SATE to estimate PATE. And any unbiased estimator of SATE is also an unbiased estimator of PATE.

Estimation of the average treatment effects is straightforward when the experiment is performed at completely random.

## 14.3 Randomized experiments

### 14.3.1 Completely randomized experiments

In a *completely randomized experiments*, everyone in the sample is equally likely to be assigned to the treatment group and the control group. With this, the averages of \(y^1_i\)’s and \(y^0_i\) are representative of those of the sample mean, and so we can estimate SATE with

\[ \tau = \frac1{n/2}\sum_{i, z_i=0}y_i^0 - \frac1{n/2}\sum_{i,z_i=1} y_i^1, \]

which is the same as the coefficient \(\tau\) of the regression \(y_i = a + \tau z_{i}\). In other words, in a completely randomized experiment, we can estimate SATE using a regression of the outcome on the treatment assignment. If the sample is representative of the population, we can also use \(\tau\) to estimate PATE.

To illustrate how the randomness affects the estimate of SATE, we compare between two sets of data, which also include the ages of the units. The first one is an ideal scenario for a randomized experiment. In each row, the bold potential outcome is the one actually seen, and the non-bold one is not observed.

Unit \(i\) |
Age \(x_i\) |
Treatment \(z_i\) |
Potential outcome #1 \(y_i^0\) |
Potential outcome #2 \(y_i^1\) |
Observed outcome \(y_i\) |
---|---|---|---|---|---|

Alex | 40 | 0 | 140 |
135 | 140 |

Anna | 40 | 1 | 140 | 135 |
135 |

Bill | 50 | 0 | 150 |
140 | 150 |

Bob | 50 | 1 | 150 | 140 |
140 |

Cindy | 60 | 0 | 160 |
155 | 160 |

Carol | 60 | 1 | 160 | 155 |
155 |

Dan | 70 | 0 | 170 |
160 | 170 |

Dave | 70 | 1 | 170 | 160 |
160 |

In this case, the simple difference in means, \(147.5-155.5\), is the same as SATE of \(-7.5\).

Now let us compare this with a less ideal randomized scenario:

Unit \(i\) |
Age \(x_i\) |
Treatment \(z_i\) |
Potential outcome #1 \(y_i^0\) |
Potential outcome #2 \(y_i^1\) |
Observed outcome \(y_i\) |
---|---|---|---|---|---|

Alex | 40 | 1 | 140 | 135 |
135 |

Anna | 40 | 1 | 140 | 135 |
135 |

Bill | 50 | 1 | 150 | 140 |
140 |

Bob | 50 | 0 | 150 |
140 | 150 |

Cindy | 60 | 0 | 160 |
155 | 160 |

Carol | 60 | 0 | 160 |
155 | 160 |

Dan | 70 | 0 | 170 |
160 | 170 |

Dave | 70 | 1 | 170 | 160 |
160 |

We can see that the the treatment is assigned to mostly younger participants, and the difference in the means, \(142.5 - 160 = -17.5\), significantly underestimates the SATE of \(-7.5\).

In many scenarios, the sample is not perfectly randomized, so we have to make some *adjustment* for the imbalance, a technique that we will introduce later.

### 14.3.2 Randomized blocks experiments

In some experiments, the participants can be divided by the observed values of a subset of variables into various *blocks*. If there are equal numbers of control and treated units within each block like in Table 14.1, we can simply estimate SATE using the difference of the means.

However, in some experiments, the ratios of control and treated units might be different across the blocks. In the fish oil supplement example, older people might be more in need of the supplement than the younger people, so the reseachers might simulate this pattern by assigning the treatment to more people in the older block than the younger block, as shown in the table below.

Unit \(i\) |
Age \(x_i\) |
Treatment \(z_i\) |
Potential outcome #1 \(y_i^0\) |
Potential outcome #2 \(y_i^1\) |
Observed outcome \(y_i\) |
---|---|---|---|---|---|

Alex | 40 | 0 | 140 |
135 | 140 |

Anne | 40 | 0 | 140 |
135 | 140 |

Anna | 40 | 1 | 140 | 135 |
135 |

Bill | 50 | 0 | 150 |
140 | 150 |

Brad | 50 | 0 | 150 |
140 | 150 |

Bob | 50 | 1 | 150 | 140 |
140 |

Cindy | 60 | 0 | 160 |
155 | 160 |

Carol | 60 | 1 | 160 | 155 |
155 |

Chris | 60 | 1 | 160 | 155 |
155 |

Dan | 70 | 0 | 170 |
160 | 170 |

Dave | 70 | 1 | 170 | 160 |
160 |

Drew | 70 | 1 | 170 | 160 |
160 |

The difference between the means is \(-0.83\) overestimates the SATE of \(-7.5\).

A better estimate of SATE can be obtained by first computing the difference between the means in each block, and then taking a weighted average of the differences, with weights proportional to the number of units in each.

Another way to obtain the estimate is by fitting a linear regression on the treatment variable and indicators for the three of the four blocks:

\[ y_i = a + \tau_{\text{RB}}z_i + \beta_1b_{1i} + \beta_2b_{2i} + \beta_3b_{3i}. \]

Of course, this is an accurate estimator of SATE if there is only few variation of the outcomes within each block, or in other words, if the blocking variable is highly predictive of the outcome. Thus, in a randomized blocks experiment, we should select blocking variables that are predictive of the outcome, based on either theory or results from previous studies.

Randomized blocks experiments have one advantage over completely randomized experiments: their estimates of SATE (or PATE) have smaller standard deviations due to the homogeneity of the blocking variables.

### 14.3.3 Matched pairs experiments

A *matched pairs experiment* is a special case of a randomized block design with only two units in each block. For example, Table 14.1 shows data of a matched pairs experiment. In each block, we randomly select one unit (with \(0.5\) probability) to receive the treatment, and the other unit to receive the control.

This design is very effective when the members of each matched pair are similar to each other, because the difference of the observed outcomes in each pair is a good estimate for the treatment effect. Suppose that there are \(K\) pairs. Let \(y_j^T\) be the outcome of the treated unit \(y_j^C\) be the outcome of the controlled unit in pair \(j\). Then, we can estimate SATE using the average of those \(K\) differences:

\[ \bar{d} = \frac1{K}\sum_{k=1}^K (y_j^T - y_j^C). \]

\(d_j=y_j^T-y_j^C\), , and Such pairs arise naturally in children of the same family, students in the same class or workers in the same department.

### 14.3.4 Group or cluster-randomized experiments

Sometimes, due to logistical or cost reasons, the treatment is assigned at the group level. For example, a schoolwide schedule reform requires assigning the new schedule to all students in a school; a new working hours policy requires changing the working hours to all employees in a company. A simple approach perform causal analysis at a group level is to treat each group as a single unit and use the aggregated value of the response variables as the outcome.

## 14.4 Assumptions of randomized experiments

In this section, we discuss several assumptions in the random designs for effective causal analysis.

### 14.4.1 Ignorability

The first assumption is *ignorability*, which differs by the random designs. We will state a version of this assumption for each design mentioned in the previous section.

**Completely randomized design**

that the distribution of each potential outcome is independent of the treatment assignment. This can be written formally as

\[ z \perp y^0,y^1. \tag{14.1}\]

In our running example, this says that a participant with low blood pressure after one year is equally likely to be from control or treatment group.

The ignorability assumption implies that the difference in means is unbiased. To see this, we compute the expectations.

\[\begin{align*} \mathbb{E}[y\vert z=1] &= \mathbb{E}[y^1\vert z=1] = \mathbb{E}[y^1] \\ \mathbb{E}[y\vert z=0] &= \mathbb{E}[y^0\vert z=0] = \mathbb{E}[y^0]. \end{align*}\]In each line, the first equivalence follows from the fact that the potential outcome is observed for the corresponding treatment assignment, and the second equivalence follows from the ignorability. Consequently,

\[ \mathbb{E}[y\vert z=1] - \mathbb{E}[y\vert z=0] = \mathbb{E}[y^1]- \mathbb{E}[y^0] = \tau_{\text{SATE}}. \]

If the data is also a random sample from the population, we can replace the right-hand side by \(\tau_{\text{PATE}}\). This agrees with the data in Table 14.1 and Table 14.2 that the difference in means is biased when the ignorability assumption is violated.

**Randomized blocks experiments**

Let \(b\) be the blocking variable. The ignorability assumption for the random block designs is:

\[ z \perp y^0,y^1 \mid b. \]

In other words, within each block, all units have the same probability of being assigned in the treatment group. Note that if the probability of treatment is the same across all blocks (that is, \(z \perp b\)), then Equation 14.1 is satisfied, and the difference in means in an unbiased estimator of SATE (or PATE).

**Matched pairs experiments**

This a special case of randomized block experiments with two units in each block. By the definition of a matched pairs experiment, every unit has the same probability (\(0.5\)) of being assigned the treatment; so Equation 14.1 is satisfied, and the difference in means in an unbiased estimator of SATE (or PATE).

### 14.4.2 Stable unit treatment value assumption (SUTVA)

The *stable unit treatment value assumption* (SUTVA) is simply

\[ y_i^{z_i} = y_i^{z'_i} \quad \text{if }z_i=z'_i. \]

In other words, the potential outcome of unit \(i\) only depends only on the treatment, and nothing else. This assumption has several implications. First, it implies that the outcome of a unit does not depend on the other units’ treatment assignments. Without this condition, causal estimation would quickly become intractable. In our running example, if a unit’s outcome is also dependent on the other units’ treatments, then there would be \(2^8=256\) different combination of treatment assignments to \(8\) people. An we clearly do not have enough data to consider these \(256\) possibilities.

Here are some examples of SUTVA violations.

- In a study of effect of a new fertilizer, each of adjacent plots is randomly assigned to receive or not receive the fertilizer. However, the fertilizer from a treated plot might leak into a controlled plot, violating the SUTVA assumption.
- Vaccines of a contagious disease, randomly administered to people in a community could result in unvaccinated people having lower chance to contract the disease.
- A study that offered families from the same housing complex to move to a better neighborhood. However, a family accepting the offer and moving out might affect (positively or negatively) another family that did not receive the offer.

If we would like to perform an experiment in which SUTVA most likely does not hold due to unit “interference” as the examples show, one solution is to assign the treatment at a group level. For example, consider a study whose goal is to introduce a new technique to encourage physical activities among students. Suppose that the technique had been randomly assigned to a few students and turned out to be effective. This would improve physical activities of not only assigned students, which in turn improve those of non-assigned students as well. Thus it makes more sense to study the effect technique at the school level instead of individual level.

## 14.5 Some difficulties in causal inference

We address some concerns that are usualy present in causal studies.

- The ability to recover SATE (such as that of completely randomized experiments) is referred to as
*internal validity*. And the extent to which the result of the study can be generalized to the population is referred to as*external validity*. Sometimes, it is difficult for an experiment to have external validity, so one has to adjust estimates of treatment effect to the population. - The experiment can affect the behaviors of the participants. Participants in a study of effect of light on productivity are likely to be more productive during the experiment because they know they were being observed. Possible solutions include not revealing the goal of the experiment to the participants, and not telling them whether they are in the control or treatment group.
- Missing pre-treatment data is usually not fatal as they are independent of the treatment assignments. Missing outcome data, however, is very common the in control group since those in the control group are less likely to be emotionally engaged in the study. In this case, the ignorability assumption is destroyed since the missingness depends on the treatment assignments.
- Participants might not comply with the treatment assignment. In our running example, a participant who was assigned treatment might forget to take the supplement, or decide to stop taking it after a while. With such noncompliance, can make our estimate completely invalid.