# 17Subclassification and propensity score matching

## 17.1 Subclassification

When the confounder is a discrete variable, subclassification is an easy way to estimate the average causal effect. We demonstrate this on an example of data more than 4000 children born in the 1980s. Some of the children were received high-quality child care from the Infant Health and Development Program (IHDP). We want to measure the effect of the child care on the cognitive abilities, evaluated with an IQ-like test at age 3. The data is contained in cc2.csv. Below, we show some of the attributes, namely age in months (age), body weight (bw), mother’s education (educ), treatment (treat) and the IQ score at age 3 (ppvtr.36).

set.seed(0)
library(rstanarm)
library(survey)
cc2 <- read.csv("data/cc2.csv")

head(cc2[, c('age', 'bw', 'educ', 'treat', 'ppvtr.36')])
       age   bw educ treat ppvtr.36
1 60.79671 1559    4     1      111
2 59.77823 2240    1     1       81
3 59.51540 1900    1     1       92
4 59.18686 1550    4     1      103
5 58.79261 2270    1     1       81
6 58.49692 1550    2     1       94

To make it simple, we assume that the only confounder is the mother’s education. This is the case when the program specifically targetted those families who received only basic or no education.

The Childcare data recorded four levels of mother’s education: not a high school graduate (lths), a high school graduate (hs), at some college (ltcoll), and a college gradute (college)

head(cc2[, c("lths", "hs", "ltcoll", "college")])
  lths hs ltcoll college
1    0  0      0       1
2    1  0      0       0
3    1  0      0       0
4    0  0      0       1
5    1  0      0       0
6    0  1      0       0

In this case, the conditional treatment effect (CATE) for each level of mother’s education is the difference of means within the level.

educ_list <- c("lths", "hs", "ltcoll", "college")

df <- data.frame(matrix(nrow = 4, ncol = 5))
colnames(df) <- c("Mother\'s education", "Treatment effect", "Standard error", "#treated", "#controls")
df$"Mother\'s education" <- educ_list for (k in 1:4){ edu <- educ_list[k] iq_treat <- cc2$ppvtr.36[cc2$treat==1 & cc2[edu]==1] iq_contr <- cc2$ppvtr.36[cc2$treat==0 & cc2[edu]==1] n_treat <- length(iq_treat) n_contr <- length(iq_contr) tr_effect <- mean(iq_treat) - mean(iq_contr) tr_se <- sqrt(var(iq_treat)/n_treat +var(iq_contr)/n_contr) df[k, "Treatment effect"] <- tr_effect df[k, "Standard error"] <- tr_se df[k, "#treated"] <- n_treat df[k, "#controls"] <- n_contr } df  Mother's education Treatment effect Standard error #treated #controls 1 lths 9.298590 1.461121 126 1232 2 hs 4.057315 1.873075 82 1738 3 ltcoll 7.871995 2.402038 48 789 4 college 4.622168 2.322062 34 332 Notice that most mothers in this dataset did not finish high school, which should be reflected when we combine the mother’s education-specific estimates to obtain the average treatment effect. To this end, we calculate a weighted average of the estimates, with weights defined by the number of children in each group; such estimation method is referred to as subclassification. $\hat{\tau}_{\text{ATE}} = \frac{9.3*1358 + 4.1*1820 + 7.9*837+4.6*366}{1358+1820+837+366} = 6.5,$ and the standard error is $$\sqrt{\frac{1.5^2*1358^2 + 1.9^2*1820^2 + 2.4^2*837^2+2.3^2*366^2}{(1358+1820+837+366)^2}}=1.04$$. ### 17.1.1 Average effect of treatment on the treated In observational data, sometimes we only care about the treatment effect on the treated group. For example, we would like to measure the treatment effect on the children that were eligible for the child care program, which are those in the treatment group. In this case, we would like to measure the average effect of treatment on the treated (ATT). We can also use subclassification to estimate ATT, only counting the children in the treatment group. $\hat{\tau}_{\text{ATT}} = \frac{9.3*126 + 4.1*82 + 7.9*48+4.6*34}{126+82+48+34} = 7.0,$ and the standard error is $$\sqrt{\frac{1.5^2*126^2 + 1.9^2*82^2 + 2.4^2*48^2+2.3^2*34^2}{(126+82+48+34)^2}}=0.9$$. So there is no visible difference between the estimates of the average treatment over all children (ATE) and the average treatment over the treated children (ATT) in this case. ## 17.2 Propensity score matching Matching refers to any method of transforming the original data to make it look like a sample from a randomized experiment, from which we can estimate causal effects with little bias, even when our model is misspecified. We will go over a specific type of matching, called propensity score matching. Essentially, we use a logistic regression to compute a “score” for each unit; these scores will be used to “match” between treated and untreated units. We shall detail the propensity score matching as a five-step process below. ### 17.2.1 Step 1: Choose the confounders and estimand Choosing confounders. Researchers often choose a list of confounders based on previous literature. In the child care example, there are not many covariates, so we decide to include them all. Choosing estimands. We have talked about two estimands: the average treatment effect (ATE) and the average treatment effect on the treated (ATT). We will also occasionally mention the average treatment effect on the controlled (ATC). The choice of estimand depends on the purpose of the estimation. In the child care example, we want to evaluate how the child care program affects the children that are eligible for the program, so we choose to estimate ATT. To estimate ATT, we will keep the treatment group the same, while transforming the control group to look like the treatment group. If we were to estimate the effect of the treatment on the control, we would transform the treatment group to match the control group instead. ### 17.2.2 Step 2: Estimate the propensity score In this step, we fit a logistic model to estimate the probability of a child receiving the treatment, conditioning on the covariates. ps_fit_1 <- stan_glm(treat ~ bw + bwg + hispanic + black + b.marr + lths + hs + ltcoll + work.dur + prenatal + sex + first + preterm + momage + dayskidh + income, family=binomial(link='logit'), data=cc2, algorithm='optimizing', refresh=0) We then use the fitted model to calculate the propensity score of each child, that is, the predicted linear function for each child in the treatment group (the reason we use the linear function instead of the predicted probability is because it is easier to compare the distribution before and after the matching; see Step 4 below). Each score plays a role of summarizing the covariates of each unit as a single number. pscores <- predict(ps_fit_1) print(pscores[1:10])  1 2 3 4 5 6 4.262056670 -0.210344313 0.554438239 2.917770699 0.009094453 1.744454213 7 8 9 10 1.576030355 -2.535481435 -0.702373486 0.532513542  ### 17.2.3 Step 3: Match controlled units to the treated units Our goal now is to transform the control group so that its distribution of the propensity scores matches that of the treatment group. The transformation that we use here is simply choosing, for each treated unit $$T_i$$ , the controlled unit that has the closest propensity score to that of $$T_i$$. We can either match with or without replacement. Matching without replacement In matching without replacement for ATT estimation, we match each treated unit with the closest controlled unit that has not been matched yet. The diagram below shows a hypothetical example of matching without replacement; the numbers shown are the propensity scores from the logistic regression. When matching for ATC estimation, we swtich the roles between the treated units and the controlled units. We can match a controlled unit to each treated unit using the provided matching function. source("library/matching.R") matches <- matching(z=cc2$treat, score=pscores, replace=FALSE)

print(summary(matches))
          Length Class  Mode
match.ind 4381   -none- numeric
cnts      4381   -none- numeric
pairs     4381   -none- numeric
print(matches$match.ind[1:10])   1150 2899 730 2913 1455 2433 4131 1406 2689 3910 Here, the number in the $$i$$-th row is the index of the unit that has been matched with the $$i$$-th unit. Since we are matching without replacement, the number of matched controlled units has to be the same as the number of units in the treatment group, which is 290 in this example. Thus, the total number of matched units is $$290*2=580$$. matched <- cc2[matches$match.ind,]

nrow(matched)
 580

With this new dataset, we will have to check that the covariates between the treatment and control groups are balanced, and the propensity scores in both groups sufficiently overlap; this is detailed in Step 4. But before that let us have a glimpse of what is going to happen in the final step: we will run a linear regression on the matched dataset and use the coefficient of the treatment assignment to estimate ATT.

Code example for fitting a weighted regression model is shown in Listing 17.1 below.

Listing 17.1: Regression for matching without replacement

reg_ps <- stan_glm(ppvtr.36 ~ treat + bw + bwg + hispanic
+ black + b.marr + lths + hs + ltcoll
+ work.dur + prenatal + sex + first
+ preterm + momage + dayskidh + income,
data=cc2[matches$match.ind,], algorithm='optimizing') summary(reg_ps)['treat', 1:2]  Median MAD_SD 10.426381 1.520227  Note that fitting a regression model on a matched data without replacement is the same as fitting on a weighted data, with weight equals $$1$$ for each unit that has been matched, and weight equals $$0$$ for each unit that has not been matched. This kind of “weighted data” interpretation will be relevant for the next type of matching. Matching with replacement In matching with replacement for ATT estimation, we match each treated unit with the closest controlled unit that might or might not have been matched. The diagram below shows a hypothetical example of matching with replacement; the numbers shown are the propensity scores from the logistic regression. When matching for ATC estimation, we swtich the roles between the treated units and the controlled units. For this type of matching, we can again use the matching function with replace=TRUE. matches.wr <- matching(z=cc2$treat, score=pscores, replace=TRUE)
wts.wr <- matches.wr$cnts print(wts.wr[780:800])   1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Here, the $$i$$-th number in wts.wr is the number of times that the $$i$$-unit has been matched. In this example, the $$781$$st unit has been matched twice, which means that we should include two copies of this unit in the restructured control group. In other words, we can treat each numbers in wts.wr as the weight of the corresponding unit. To fit a linear regression model with weighted data, we may use functions provided by the survey library. Specifically, we use svydesign to specify the weigths and svyglm to fit the model. Here, we use ids=~1 for data with no clusters. Code example for fitting a weighted regression model is shown in Listing 17.2 below. Listing 17.2: Regression for matching with replacement ps_fit_1_design <- svydesign(ids=~1, weights=matches.wr$cnts,
data=cc2)
reg_ps.wr <- svyglm(ppvtr.36 ~ treat + bw + bwg + hispanic
+ black + b.marr + lths + hs + ltcoll
+ work.dur + prenatal + sex + first
+ preterm + momage + dayskidh + income,
design=ps_fit_1_design,
data=cc2)

summary(reg_ps.wr)$coef['treat', 1:2]  Estimate Std. Error 9.590492 2.015102  ### 17.2.4 Step 4: Inspect balance and overlap in propensity scores As alluded in Step 3, we will check if (1) the covariates between the treatment and control groups are balanced, and (2) the propensity scores in both groups sufficiently overlap. Here, we will use balance function from the provided balance.R to compute the difference in covariate means between treatment and control groups. # Uncomment to install required package #install.packages("Hmisc") source("library/balance.R") covs <- c('bw', 'preterm', 'dayskidh', 'sex', 'first', 'age', 'black', 'hispanic', 'white', 'b.marr', 'lths', 'hs', 'ltcoll', 'college', 'work.dur', 'prenatal', 'momage') bal_nr <- balance(rawdata=cc2[,covs], treat=cc2$treat,
matched=matches$cnts, estimand='ATT') bal_nr.wr <- balance(rawdata=cc2[,covs], treat=cc2$treat,
matched=matches.wr$cnts, estimand='ATT') After calling balance, the mean differences over all covariates before the matching are stored in output$diff.means.raw, and those after the matching are stored in output$diff.means.matched. Here, we define a function named plot_mean_diffs to plot these differences from a balance’s output. plot_mean_diffs <- function(bal, title) { pts <- bal$diff.means.raw[,4]
pts2 <- bal$diff.means.matched[,4] K <- length(pts) plot(c(pts,pts2), c(1:K, 1:K), bty='n', xlab='', ylab='', xaxt='n', yaxt='n', type='n', main=title) abline(v=0, lty=2) points(pts, 1:K, cex=1) points(pts2, 1:K, pch=19, cex=1) axis(3) axis(2, at=1:K, labels=covs, las=2, hadj=1, lty=0) } Now, let us compare the absolute mean differences obtained from matching with replacement and without replacement. par(mfrow=c(1,2)) par(mar=c(1,4.3,7,2)) plot_mean_diffs(bal_nr, "Absolute mean differences\n Matching without replacement") par(mar=c(1,4.3,7,2)) plot_mean_diffs(bal_nr.wr, "Absolute mean differences\n Matching with replacement") We see that matching with replacement has smaller mean differences, which implies that it has a better balance. To check the overlap, we inspect the histograms of the propensity scores in the treatment and control groups. par(mfrow=c(1,2)) # Plot the histograms of the propensity scores before matching par(mar=c(5,4,2,1)) hist(pscores[cc2$treat==0], main="Before matching",
border="darkgrey", xlab="logit propensity scores",
freq=FALSE)
hist(pscores[cc2$treat==1], freq=FALSE, add=TRUE) # Plot the histograms of the propensity after matching par(mar=c(5,4,2,1)) hist(pscores[cc2[matches.wr$match.ind, 'treat']==0],
main="After matching", border="darkgrey",
xlab="logit propensity scores", freq=FALSE)
hist(pscores[cc2[matches.wrmatch.ind, 'treat']==1], freq=FALSE, add=TRUE) We can see that the the propensity scores have a better overlap after matching. ### 17.2.5 Before step 5: Repeat steps 2-4 until a good balance is achieved We do not want to proceed to Step 5 yet until we obtain a good balance in the covariates. There are generally two ways to achieve a better balance: • Changing the model for the propensity score. This can be done in several ways: • Adding interactions between the covariates • Finding other potential confounders • Transforming existing continuous variables • Changing the way the propensity scores are used to restructure the data, for example, to a different matching method. ### 17.2.6 Step 5: Fit the regression on the restructured data As mentioned in Step 3, after we are satisfied with the balance, we can now fit the linear regression model on the restructured dataset. Example code for fitting a regression on matched data without replacement can be found in Listing 17.1, and code for matched data with replacemenet can be found in Listing 17.2. The estimate of average treatment effect is $$10.4\pm 1.5$$ and $$9.6\pm 2.0$$, respectively. ### 17.2.7 Other considerations There are several considerations that one has to keep in mind when performing propensity score matching. • It is not always a good idea to include all covariates in the list of potential confounders. Here is a basic guideline on which covariates to choose: • Do not include post-treatment covariates • Do not include covariates that are strongly related to the treatment but not strongly related to the outcome. An example of such covariate is instrumental variable with will be introduced in the next chapter. • Do include covariates that are strongly related to the outcome. • Instead of using the propensity scores, we can simply compute the distance between two observations using a known distance functions; for example, the Euclidean distance: $$d(X_i,X_j) = \sqrt{\sum_{k=1}^K (X_{ik}-X_{jk})^2}$$ and the Mahalanobis distance: $$d(X_i,X_j) = (X_i-X_j)^T\Sigma^{-1}(X_i-X_j)$$, where $$\Sigma$$ is the covariance matrix of the data. • It is not neccessary to aim for an accurate model for the propensity score. In fact, an accurate model might cause more problems than it solves. For example, consider a logistic model that can perfectly predicts whether an unit received the treatment or control. In this case, the propensity scores are all $$0$$ or $$1$$, indicating that there is no overlap, and matching is no differen than randomly assigned a controlled unit to each treated unit. • There are several modifications for the matching algorithm: • We can match each treated unit with $$k>1$$ controlled unit closest to it; this is sometimes called $$k$$-to-$$1$$ matching. • We can specify a threshold $$d>0$$ and match each treated unit with all controlled units that are less than $$d$$ distance away. • In the two algorithms above, we can give more weights to closer matches and less weights to farther matches. There is also an R library MatchIt which is dedicated to matching for causal estimations. Check out this detailed example to get started. ## 17.3 Inverse probability weighting We can also use propensity scores as units’ weights without any matching. The idea is to weight the sample so that it is representative of the group of interest. To illustrate the idea of inverse probability weighting, we consider the following hypothetic scenario: Suppose for simplicity that there is only one confounder $$X$$ which has only two possible values: $$x_1$$ and $$x_2$$. Assume further that the potential outcomes are: \begin{align*} y^1(x_1) = 3, \quad &y^0(x_1) = 1 \\ y^1(x_2) = 2, \quad &y^0(x_2) = 1, \end{align*} and the probability scores are: $p(x_1)=\Pr[Z=1 \vert X=x_1] = 0.7, \quad p(x_2)=\Pr[Z=1 \vert X=x_2] = 0.4. \tag{17.1}$ Below is an example of count data generated from these conditional distributions. $\begin{array}{r|cc}& X=x_1 & X=x_2 \\ \hline Z=1 & 70 & 80 \\ Z=0 & 30 & 120 \\ \tau & 2 & 1 \end{array}$ The ATE is $$\frac{2*70+2*30+1*80+1*120}{70+30+80+120}=1.33$$. Due to the imbalance in the numbers of observed $$y^1$$ and $$y^0$$ for each value of $$X$$, the difference-in-mean estimate of $$\frac{3*70+2*80}{70+80}-\frac{1*30+1*120}{30+120}=1.47$$ does not exactly match the ATE. The can improve our estimate by balancing the numbers of treated and controlled units. If we were to know the conditional distribution Equation 19.1, we could have scaled the number of treated units down by a factor of $$\Pr[Z=1\vert X=x_i]$$ and the number of controlled units by a factor of $$\Pr[Z=0\vert X=x_i]$$ to obtain the so-called pseudo-population as shown below: $\begin{array}{r|cc}& X=x_1 & X=x_2 \\ \hline Z=1 & 100 & 200 \\ Z=0 & 100 & 200 \\ \tau & 2 & 1 \end{array}$ Now the difference-in-mean estimate is accurate: $$\frac{3*100+2*200}{100+200}-\frac{1*100+1*200}{100+200}=1.33$$, which exactly matches the ATE. Since we generally do not know the conditional distributions Equation 19.1, we estimate them using a logistic regression $$\hat{z} = \hat{p}(x) = \operatorname{logit}^{-1}(\beta_0+\beta_1x)$$. In general, the inverse probability weighting estimate (IPW) is: \begin{align*} \hat{\tau}^{\text{IPW}} &= \frac1{n}\sum_{i;z_i=1} \frac{y_i}{\hat{p}(x_i)} - \frac1{n}\sum_{i;z_i=0} \frac{y_i}{1-\hat{p}(x_i)} \\ &= \frac1{n}\sum_{i=1}^n \frac{y_iz_i}{\hat{p}(x_i)} - \frac1{n}\sum_{i=1}^n \frac{y_i(1-z_i)}{1-\hat{p}(x_i)}. \end{align*} Theoretically, if $$\hat{p}=p$$ and the ignorability assumption holds, then $$\hat{\tau}^{\text{IPW}}$$ is an unbiased estimator of $$\tau_{\text{ATE}}$$. To show this, we first compute the expectation of the first term. \begin{align*} \mathbb{E}\left[\frac{YZ}{p(X)}\right] &= \mathbb{E}\left[\mathbb{E}\left[\frac{YZ}{p(X)}\Bigg\vert X\right]\right] \\ & = \mathbb{E} \left[ \mathbb{E} \left[ \frac{Y^1 ~T}{p(X)} \Bigg|X\right] \right]\\ & = \mathbb{E} \left[ \frac{\mathbb{E} [ Y^1 |X]~ \mathbb{E}[T | X]}{p(X)} \right]\\ & = \mathbb{E} \left[\mathbb{E} [ Y^1 |X] \right]\\ & = \mathbb{E} [ Y^1 ]. \end{align*} Similarly, we have $$\mathbb{E}\left[\frac{Y(1-Z)}{1-\hat{p}(X)}\right]=\mathbb{E} [ Y^0 ]$$. It follows that $$\mathbb{E}[\tau^{\text{IPW}}]=\mathbb{E} [ Y^1 ]-\mathbb{E} [ Y^0 ]$$. For each unit $$(x,z,y)$$, we let $$\hat{p}(X)$$ be the propensity score obtained from the logistic regression. We will put a weight on each unit as follows: • Estimating the average treatment effect (ATE) • For every treated unit $$(x,z,y)$$, we put a weight of $$\frac{1}{\hat{p}(x)}$$. • For every controlled unit $$(x,z,y)$$, we put a weight of $$\frac{1}{1-\hat{p}(x)}$$. • Estimating the average effect of treatment on the treated (ATT) • For every treated unit $$(x,z,y)$$, we put a weight of $$1$$. • For every controlled unit $$(x,z,y)$$, we put a weight of $$\frac{\hat{p}(x)}{1-\hat{p}(x)}$$. • Estimating the average effect of treatment on the controlled (ATC) • For every treated unit $$(x,z,y)$$, we put a weight of $$\frac{1-\hat{p}(x)}{\hat{p}(x)}$$. • For every controlled unit $$(x,z,y)$$, we put a weight of $$1$$. Below is example code for estimating ATT of the child care program using the inverse probability weighting. inv.logit <- plogis wt.iptw <- inv.logit(pscores) / (1 - inv.logit(pscores)) wt.iptw[cc2treat==0] <- wt.iptw[cc2$treat==0] wt.iptw[cc2$treat==1] <- 1

ps_fit_iptw_design <- svydesign(ids=~1, weights=wt.iptw, data=cc2)
reg_ps.iptw <- svyglm(ppvtr.36 ~ treat + bw + bwg + hispanic
+ black + b.marr + lths + hs + ltcoll
+ work.dur + prenatal + sex + first
+ preterm + momage + dayskidh + income,
design=ps_fit_iptw_design, data=cc2)

summary(reg_ps.iptw)\$coef['treat', 1:2]
  Estimate Std. Error
8.372319   2.332639