# 18Instrumental variables

## 18.1 Motivation

Suppose that we want to estimate the effect of watching a TV show Sesame Street on preschool children’s recognition of English letters. We might consider an experiment where the treatment is watching Sesame Street. However, it would be difficult for us to force the children to watch the TV show for prevent them from watching it.

Instead, what we can encourage a random subgroup of the children to watch the show. Now, we have an additional variable, the encouragement that affects the treatment variable.

Let us take a look at the data, which is contained sesame.csv.

set.seed(0)
library(brms)
library(rstanarm)
sesame <- read.csv("data/sesame.csv")

head(sesame[, c("encouraged", "watched", "setting", "site", "postlet")])
  encouraged watched setting site postlet
1          1       0       2    1      30
2          1       1       2    1      37
3          0       1       2    1      46
4          0       0       2    1      14
5          0       1       2    1      63
6          0       1       2    1      36

The outcome is postlet, which is the post-treatment measurement of the letter recognition task.

The encouragement is an example of an instrumental variable (IV), a variable that affects the treatment but does not affect the outcome. This is an example of quasi-experiment, that is, an experiment to study the causal effect without random treatment assigment.

## 18.2 Terminologies for instrumental varialbes

In addition to the usual causal notations of the treatment variable $$T$$, the outcome variable $$y$$, and the confounders $$x$$, we also have:

• the instrumental variable $$z$$ affecting the treatment,
• the potential treatments $$T^0$$ and $$T^1$$ which is observed only if $$z=0$$ and $$z=1$$, respectively.
• the potential outcomes $$y^0$$ and $$y^1$$ which is observed only if $$z=0$$ and $$z=1$$, respectively.

We summarize the variables in the graphical model below.

We can estimate the effect of the instrument on the outcome—sometimes called intent-to-treat (ITT) effect—using the difference in the means of encouraged and not encouraged units effect. This is different than the treatment effect which is usually the quantity of interest. We will discuss the formulations and assumptions imposed by this model and ways to estimate the average treatment effect.

We now categorize the units into four groups by their treatment response to the instrument.

1. Compliers Instrument has a positive effect on their treatments, that is, $$T^1_i=1$$ and $$T^0_i=0$$.
2. Defiers Instrument has a negative effect on their treatments, that is, $$T^1_i=0$$ and $$T^0_i=1$$.
3. Never-takers Units who never take the treatment, that is, $$T^1_i=T^0_i=0$$.
4. Always-takers Units who always take the treatment, that is, $$T^1_i=T^0_i=1$$.

We are only interested in estimating the average treatment effect on the compliers—this is called the complier average causal effect (CACE).

A good instrument induces many compliers and no defiers; we will discuss more about desired properties of an instrument in the next section.

## 18.3 Assumptions for instrumental variables

In addition to SUTVA, the method of intrumental variables relies on several assumptions.

Ignorability of the instrument. We assume that the randomization in the instrumental variable is independent of the potential treatments and the outcomes:

$y^1,y^0,T^1,T^0 \perp z.$

Unlike the standard causal analysis, the ignorability of the treatment does not have to be satisfied.

Monotonicity. We assume that there is no defiers, that is, no unit who would have taken the treatment if they were not encouraged and would not take the treatment if they were encouraged.

Relevance. Of course, in an instrumental variable design, there would be no hope in estimating the treatment effect if the instrument is unrelated to the treatment. This assumption can be written as:

$\Pr[T^1_i=1 \vert z_i=1] >0 \quad\text{and}\quad \Pr[T^1_i=0 \vert z_i=0] > 0.$

Exclusion restriction. We assume that there is no effect of the instrument on the outcomes of the never-takers (who would not have taken the treatment either way) and always-takers (who would have taken the treatment either way).

We can come up with a story that violates the exclusion restriction. In the Sesame Street example, we could have parents who prohibited their children from watching television (so the children were never-takers). But after being encouraged to have their children watch the Sesame Street, they decided to purchases a similar educational material for their children to read instead.

If there are some covariates $$x$$, we might instead assume the conditional ignorability:

$y^1,y^0,T^1,T^0 \perp z\vert x,$

We will see that these assumptions lead to unbiasedness of the difference-in-means estimation in the next section.

## 18.4 Intent-to-treat (ITT) effect and complier average causal effec (CACE)

The following table shows hypothetical data of the Sesame Street experiment. The bold numbers are observed values.

Unit
$$i$$
Potential treatment
$$T^0_i$$
Potential treatment
$$T^1_i$$
Unit Type
$$c_i$$
Encouragement
$$z_i$$
Potential outcome
$$y^0_i$$
Potential outcome
$$y^1_i$$
Instrument effect
$$y^1_i-y^0_i$$
1 0 1 complier 0 67 76 9
2 0 1 complier 0 72 80 8
3 0 0 never-taker 0 68 68 0
4 1 1 always-taker 0 76 76 0
5 1 1 always-taker 0 74 74 0
6 0 1 complier 1 74 81 7
7 0 1 complier 1 68 78 10
8 0 0 never-taker 1 70 70 0
9 1 1 always-taker 1 80 80 0
10 1 1 always-taker 1 82 82 0

Assuming that the instrument assignments are random, this data satisfies all the four assumptions above. Specifically, there is no defiers (no $$T^0_i=1$$ and $$T^1_i=0$$) and no instrument effect on the never-takers and always-takers.

Now let us calculate the intent-to-treat (ITT) effect using the difference in the means of encouraged and not encouraged children, which in turn is the average of the instrument effect $$\tau_i$$.

$\text{ITT} = \frac{9+8+0+0+0+7+10+0+0+0}{10} =3.4.$

We can also calculate the complier average causal effect (CACE), using the difference in the means of the treated and controlled compliers.

$\text{CACE} = \frac{9+8+7+10}{4} = 8.5.$

In our hypothetical data, we can write CACE in terms of ITT as follows:

$\text{CACE} = \frac{\text{ITT}}{4/10} = \frac{\text{ITT}}{\Pr[c=\text{complier}]} = \frac{\mathbb{E}[y\vert z=1] - \mathbb{E}[y\vert z=0]}{\Pr[c=\text{complier}]}.$

### 18.4.1 Compute CACE using ITT

In a instrumental variable design, our quantity of interest is the average treatment effect over the compliers, that is, the CACE. However, we generally do not observe which units are complier, so we cannot compute CACE directly from the definition. But following computations will show that, under the four assumptions above, we can compute CACE using ITT.

First, we have

\begin{align*} \text{ITT} &=\mathbb{E}[y\vert z=1]-\mathbb{E}[y\vert z=0] \\ &= \mathbb{E}[y^1-y^0]\\ &=\mathbb{E}[y^1-y^0\vert c=\text{complier}]\Pr[c=\text{complier}] \\ &\quad + \mathbb{E}[y^1-y^0\vert c=\text{defier}]\Pr[c=\text{defier}] \\ &\quad + \mathbb{E}[y^1-y^0\vert c=\text{never-taker}]\Pr[c=\text{never-taker}] \\ &\quad + \mathbb{E}[y^1-y^0\vert c=\text{always-taker}]\Pr[c=\text{always-taker}] \\ &= \mathbb{E}[y^1-y^0\vert c=\text{complier}]\Pr[c=\text{complier}]\\ & = \text{CACE}*\Pr[c=\text{complier}], \end{align*}

where the last equality follows from the monotonicity assumption ($$\Pr[c=\text{defier}]=0$$) and the exclusion restriction assumption ($$Y^1-Y^0=0$$ for the never-takers and always-takers). To remove $$\Pr[c=\text{complier}]$$, we consider $$\mathbb{E}[T\vert z=1]-\mathbb{E}[T\vert z=0]$$. Using the monotonicity assumption as before, we obtain

\begin{align*} \mathbb{E}[T\vert z=1]-\mathbb{E}[T\vert z=0] &= \mathbb{E}[T^1-T^0]\\ &=\mathbb{E}[T^1-T^0\vert c=\text{complier}]\Pr[c=\text{complier}] \\ &\quad + \mathbb{E}[T^1-T^0\vert c=\text{defier}]\Pr[c=\text{defier}] \\ &= \mathbb{E}[T^1-T^0\vert c=\text{complier}]\Pr[c=\text{complier}] \\ &= \Pr[c=\text{complier}], \end{align*}

which cannot be zero because of the assumption that the instrument must have an effect on the treatment. Dividing these two expressions yields

$\text{CACE} = \frac{\mathbb{E}[y\vert z=1] - \mathbb{E}[y\vert z=0]}{\mathbb{E}[T\vert z=1]-\mathbb{E}[T\vert z=0]}.$

Now, using the ignorability assumption, we can estimate the numerator using the difference in the means of the outcomes of the encouraged children ($$z=1$$) and not encouraged children ($$z=0$$), which can be computed with a linear regression.

itt_zy <- stan_glm(postlet ~ encouraged, data=sesame,
refresh=0)

coef(itt_zy)["encouraged"]

The coefficient of encouraged is our estimate of ITT. And we can estimate the denominator using the difference in the means of the treatment assignments of those two groups of children.

itt_zt <- stan_glm(watched ~ encouraged, data=sesame,
refresh=0)

coef(itt_zt)["encouraged"]
encouraged
0.3626182 

The coefficient of encouraged is the proportion of compliers in the data. Dividing the ITT estimate by the proportion, we obtain a CACE estimate, sometimes called a Wald estimate.

wald_est <- coef(itt_zy)["encouraged"] / coef(itt_zt)["encouraged"]

wald_est
encouraged
7.901558 

When there are covariates $$x$$, the CACE needs to be defined for each level of $$x$$, and all of our derivations above have to be conditioned on $$x$$. More precisely, with the conditional ignorability $$y^1,y^0,T^1,T^0 \perp z\vert x$$ and the corresponding assumptions, we have

$\text{CACE}(x) = \frac{\mathbb{E}[y\vert z=1,x] - \mathbb{E}[y\vert z=0,x]}{\mathbb{E}[T\vert z=1,x]-\mathbb{E}[T\vert z=0,x]}.$

## 18.5 Two-stage least squares

We discuss a general method of estimating CACE, called two-stage least square (2SLS).

In 2SLS, we pe form two linear regressions:

1. Regression of the treatment variable on the instrument:

$\hat{T} = \alpha_0 + \alpha_1z_1.$

1. Regression of the outcome on the first model’s predicted treatment:

$y = \beta_0 + \beta_1\hat{T} + \varepsilon.$

Here is an example of 2SLS on the Sesame Street data.

fit_2a <- stan_glm(watched ~ encouraged, data=sesame,
refresh=0)
sesame$watched_hat <- fit_2a$fitted
fit_2b <- stan_glm(postlet ~ watched_hat, data=sesame,
refresh=0)

fit_2b
stan_glm
family:       gaussian [identity]
formula:      postlet ~ watched_hat
observations: 240
predictors:   2
------
(Intercept) 20.5    4.0
watched_hat  8.1    5.0

Auxiliary parameter(s):
sigma 13.3    0.6

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Here, the estimated treatment effect is $$8.1$$. However, the second regression does not give the correct standard error as it has to be computed from the observed treatment assignments, not the predicted one. Instead, we can fit 2SLS using the brms library and return the correct standard error.

f1 <- bf(watched ~ encour)
f2 <- bf(postlet ~ watched)
IV_brm_a <- brm(f1 + f2, data=sesame, refresh=0)
IV_brm_a
 Family: MV(gaussian, gaussian)
Links: mu = identity; sigma = identity
mu = identity; sigma = identity
Formula: watched ~ encour
postlet ~ watched
Data: sesame (Number of observations: 240)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
watched_Intercept     0.55      0.04     0.47     0.63 1.00     3977     3069
postlet_Intercept    20.44      3.71    13.37    28.21 1.00     2076     2310
watched_encour        0.36      0.05     0.26     0.46 1.00     4128     2913
postlet_watched       8.10      4.69    -1.61    17.05 1.00     1956     1862

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_watched     0.38      0.02     0.35     0.42 1.00     4661     2943
sigma_postlet    12.64      0.69    11.47    14.19 1.00     2790     2448

Residual Correlations:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
rescor(watched,postlet)     0.16      0.15    -0.13     0.45 1.00     1839
Tail_ESS
rescor(watched,postlet)     2117

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

We can also incorporate the covariates by adding them to both regression models. For example, let us add two covariates: site and setting to our models.

f1 <- bf(watched ~ encour + prelet + setting + factor(site))
f2 <- bf(postlet ~ watched + prelet + setting + factor(site))
IV_brm_b <- brm(f1 + f2, data=sesame, refresh=0)
IV_brm_b
 Family: MV(gaussian, gaussian)
Links: mu = identity; sigma = identity
mu = identity; sigma = identity
Formula: watched ~ encour + prelet + setting + factor(site)
postlet ~ watched + prelet + setting + factor(site)
Data: sesame (Number of observations: 240)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
watched_Intercept       0.66      0.11     0.45     0.87 1.00     3995     3119
postlet_Intercept       1.33      4.51    -7.62    10.27 1.00     1803     1713
watched_encour          0.34      0.05     0.24     0.44 1.00     5514     3055
watched_prelet          0.01      0.00    -0.00     0.01 1.00     6257     3176
watched_setting        -0.05      0.05    -0.16     0.05 1.00     5656     2961
watched_factorsite2     0.03      0.07    -0.10     0.17 1.00     3359     2622
watched_factorsite3    -0.11      0.07    -0.24     0.02 1.00     3467     2980
watched_factorsite4    -0.34      0.07    -0.49    -0.20 1.00     3891     3161
watched_factorsite5    -0.29      0.10    -0.49    -0.10 1.00     3778     3169
postlet_watched        13.90      3.86     6.40    21.33 1.00     1721     1917
postlet_prelet          0.70      0.08     0.56     0.85 1.00     5399     2797
postlet_setting         1.60      1.42    -1.20     4.39 1.00     3497     3150
postlet_factorsite2     8.39      1.87     4.64    12.05 1.00     4126     3028
postlet_factorsite3    -3.98      1.76    -7.40    -0.43 1.00     3237     2916
postlet_factorsite4     0.87      2.32    -3.67     5.58 1.00     2094     2162
postlet_factorsite5     2.69      2.81    -2.85     8.26 1.00     3138     2528

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_watched     0.36      0.02     0.33     0.39 1.00     5986     2795
sigma_postlet     9.44      0.54     8.51    10.63 1.00     3014     2271

Residual Correlations:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
rescor(watched,postlet)    -0.18      0.15    -0.46     0.13 1.00     1692
Tail_ESS
rescor(watched,postlet)     2050

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The instrumental variables’ treatment effect estimate is the coefficient of postlet_watched, which is $$13.9$$ with standard error $$3.9$$.

The 2SLS method can be easily extended to continuous instrument and treatment variables. But we must be careful in the case of binary instrument and continuous treatment as there is no universal way—as least without parametric assumptions—to quantify the instrument’s effect on the treatment.

## 18.6 Multiple instruments, treatments and covariates

We can have multiple variables playing different roles in the instrumental variables model. In general, we have

• The outcome $$y$$
• Multiple treatments $$T_1,\ldots,T_k$$
• Multiple instruments $$z_1,\ldots,z_m$$; these variables only affect the outcome through the treatments.
• Multiple covariates $$x_1,\ldots,x_l$$; these variables affect the outcomes directly.

For each treatment and each instrument, we categorized the units by the compliance as follows:

• Compliers Instrument has a positive (negative) effect on their treatments.
• Never-takers and Always-taker Instrument has no effect on their treatments.
• Defiers Instrument has a negative (positive) effect on their treatments.

In order to estimate the CACE of each treatment’s the following assumptions must be satisfied:

• Ignorability of the instrument $$y^1,y^0,T_i^1,T_i^0 \perp z_j \vert x_1,\ldots,x_l$$ for $$i=1,\ldots,k$$ and $$j=1\ldots,m$$.
• Monotonicity There is no defier, that is, no unit who reacts in the opposite direction of what we expect.
• Relevance The instrument and the treatment must be related.
• Exclusion restriction There is no instrument effect on the outcomes of the never-takers and always-takers.

In addition, to identify all treatment effects, we have to ensure that any combination of the observed treatments can be attained by adjusting the instruments; this only happens when the number of instruments is equal or greater than the number of treatments.

• Full rank model $$m=k$$ (exactly identified) or $$m>k$$ (overidentified).

With these assumptions, we can estimate the treatment effect(s) using the following 2SLS.

\begin{align*} \hat{T}_{1} &= \alpha_1 + \sum_{j=1}^{k}\alpha_{1j}z_j + \sum_{j=1}^{l}\delta_{1j}x_j \\ \hat{T}_{2} &= \alpha_2 + \sum_{j=1}^{k}\alpha_{2j}z_j + \sum_{j=1}^{l}\delta_{2j}x_j \\ &\vdots \\ \hat{T}_{m} &= \alpha_m + \sum_{j=1}^{k}\alpha_{mj}z_j + \sum_{j=1}^{l}\delta_{mj}x_j \\ y &= \beta_0 + \sum_{i=1}^{m}\beta_i\hat{T}_i +\sum_{i=1}^{l}\gamma_ix_i + \varepsilon. \end{align*}

As an example, we apply this method on cigarette sales data in the 48 continental US States in 1995 from Webel (2011).

cigarette <- read.csv("data/cigarette.csv")

head(cigarette[, c("packs", "rprice", "rincome", "salestax", "cigtax")])
      packs   rprice  rincome  salestax   cigtax
1 101.08543 103.9182 12.91535 0.9216975 26.57481
2 111.04297 115.1854 12.16907 5.4850193 36.41732
3  71.95417 130.3199 13.53964 6.2057067 42.86964
4  56.85931 138.1264 16.07359 9.0363074 40.02625
5  82.58292 109.8097 16.31556 0.0000000 28.87139
6  79.47219 143.2287 20.96236 8.1072834 48.55643

We would like to estimate the treatment effect of cigarette price (rprice) on the number of cigarette packs sold (packs). However, the price itself might be affected by the demand for cigarettes; so we instead add two instrumental variables that are rather affected by fiscal policy, namely sales tax (salestax) and cigarette tax (cigtax). We also has income (rincome) as a covariate. Our model is summarized in the following graph:

With this, we can now perform 2SLS using brm.

lm1 <- bf(log(rprice) ~ salestax + cigtax + log(rincome))
lm2 <- bf(log(packs) ~ log(rprice) + log(rincome))
IV_cig <- brm(lm1 + lm2, data=cigarette, refresh=0)
IV_cig
 Family: MV(gaussian, gaussian)
Links: mu = identity; sigma = identity
mu = identity; sigma = identity
Formula: log(rprice) ~ salestax + cigtax + log(rincome)
log(packs) ~ log(rprice) + log(rincome)
Data: cigarette (Number of observations: 48)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
logrprice_Intercept      4.10      0.10     3.89     4.30 1.00     3290
logpacks_Intercept       9.90      1.10     7.75    12.02 1.00     4381
logrprice_salestax       0.01      0.00     0.01     0.02 1.00     6481
logrprice_cigtax         0.01      0.00     0.01     0.01 1.00     4017
logrprice_logrincome     0.11      0.04     0.03     0.19 1.00     3254
logpacks_logrprice      -1.28      0.28    -1.83    -0.74 1.00     2792
logpacks_logrincome      0.29      0.25    -0.18     0.77 1.00     2662
Tail_ESS
logrprice_Intercept      2875
logpacks_Intercept       2757
logrprice_salestax       3128
logrprice_cigtax         3607
logrprice_logrincome     2812
logpacks_logrprice       2425
logpacks_logrincome      2699

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_logrprice     0.03      0.00     0.03     0.04 1.00     3110     2668
sigma_logpacks      0.20      0.02     0.16     0.25 1.00     2959     2399

Residual Correlations:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
rescor(logrprice,logpacks)    -0.25      0.15    -0.53     0.05 1.00     2788
Tail_ESS
rescor(logrprice,logpacks)     2582

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The point estimate of the coefficient is $$-1.28$$ with standard error $$0.28$$, which implies that $$1\%$$ increase in cigarette price would decrease cigarette consumption by $$-1.00\%-1.56\%$$. Note that what we just estimated is the treatment effect on the complier states, that is, the states that would increase the cigarette price as a result of the tax raises.

## 18.7 Testing the assumptions

The method of instrument variables require many assumptions, some of which can be explained away with the prior knowledge about the variables. Even if this is not the case, we can still test some of the assumptions from the data.

### 18.7.1 Testing relevance

Not only the instrument must be related to the treatment, the relationship must be strong enough so that we can estimate the treatment effect reliably. A common problem is when we have a weak instrument, an instrument that does not have a strong relationship with the treatment, which can lead to biased estimates.

To test the relevance, we can use the joint $$F$$-test to compare the first-stage regression with and without the instruments; and we reject the null hypothesis if the model with the instruments has better predictive power. For more details on the join F-test, see e.g. James et al. (2021, chap. 3.2) and Hanck et al. (2019, chap. 7.3).

The usual statistical test with a specified type I error does not exclude weak instrument, so we have to come up with a new criteria on the F-statistic in order to decide if the instrument is strong enough. A rule of thumb says that an F-statistic greater than 10 is sufficient. For more reliable cutoffs and other testing methods for weak instruments, see Stock and Yogo (2002) for an extensive study.

### 18.7.2 Testing exclusion restriction

One way to test the exclusion restriction is by performing 2SLS only on the group of always-takers and never-takers and see if the effect of the instrument is negligible.

Another way is by fitting the second-stage regression with the observed treatment and the instrument:

$y = \beta_0 +\sum_{i=1}^{k}\alpha_i z_i + \sum_{i=1}^{m}\beta_i T_i +\sum_{i=1}^{l}\gamma_i x_i + \varepsilon.$

The exclusion restriction says that the instrument only affects the outcome through the treatment, so we can use the joint $$F$$-test to see if some of $$\alpha_i \not= 0$$, in which case we say that the assumption is violated.

When the 2SLS model is overidentified, that is, when there is more instruments than the treatments, there is also Sargan test, which look at the relationship between the residuals of the second-stage model and the instruments, and we reject the null if the relationship is significant.

# References

Hanck, Christoph, Martin Arnold, Alexander Gerber, and Martin Schmelzer. 2019. Introduction to Econometrics with r. University of Duisburg-Essen.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning. Springer US. https://doi.org/10.1007/978-1-0716-1418-1.
Stock, James, and Motohiro Yogo. 2002. “Testing for Weak Instruments in Linear IV Regression.” National Bureau of Economic Research. https://doi.org/10.3386/t0284.
Webel, Karsten. 2011. JH Stock, MW Watson: Introduction to Econometrics. Springer Nature BV.