set.seed(0)
library(rstanarm)
21 Panel data
Panel data, or longitudinal data, is data that contains multiple observations of each unit. As we can see from the method of difference-in-differences, under certain assumptions, the temporal nature of the data can be exploit to extract the treatment effect. Below are relevant variables in panel data analysis:
- \(t=1,2,\ldots,T\) is the time.
- \(y_{it}\) is unit \(i\)’s time-varying outcome.
- \(x_{it}\) is unit \(i\)’s vector of time-varying predictors, including the treatment.
- \(u_i\) is unit \(i\)’s vector of unobserved time-invariant effect.
Here, we impose that the unobserved individual effect is not changing over time. For example, \(u_i\) might be \(i\)’s health condition before receiving scheduled treatments. The difference-in-differences setting is a special case of panel data with \(T=2\) and \(u_i\) is the same for all units in each of the treatment and control groups.
Throughout this chapter, we will use panel data of 545 observations of young men from 1980 to 1987 from Vella and Verbeek (1998). The outcome (wage
) is the log of wage, and the predictors that we will use are: years of experience (exper
), whether the wage is set by collective bargaining (union
) and the marriage status (married
). We will also use the unique identifier (nr
) and the year identified (year
).
<- read.csv("data/wage97.csv")
wage97 <- wage97[, c("nr", "year", "exper",
wage97 "union", "married", "wage")]
$nr <- factor(wage97$nr)
wage97$year <- factor(wage97$year)
wage97
<- c("yes" = 1, "no" = 0)
lookup $union <- lookup[wage97$union]
wage97$married <- lookup[wage97$married]
wage97
head(wage97)
nr year exper union married wage
1 13 1980 1 0 0 1.197540
2 13 1981 2 1 0 1.853060
3 13 1982 3 0 0 1.344462
4 13 1983 4 0 0 1.433213
5 13 1984 5 0 0 1.568125
6 13 1985 6 0 0 1.699891
Notice that Unit #13 had been observed once a year since 1980. To see what kind of model we are looking for, we take a look at the wages of three people in the dataset.
<- c("red", "blue", "green")
colors plot(1, type = "n", xlab = "Year", ylab = "Wage",
xlim = c(1980,1987), ylim = c(0.75, 2.5))
for (i in 1:3) {
points(1980:1987, wage97$wage[8*i+1:8*(i+1)],
pch = 16, col = colors[i])
}
This suggests that simply fitting a regression on this data would be a big mistake, since such model assumes the same level of wages for every young men. Instead, we have to come up with a model that allows for the difference in wages.
21.1 Fixed effects model
We assume that the outcome is composed of the individual time-invariant effect and time-specific treatment effect, resulting in the following model:
\[ y_{it} = \beta x_{it} + u_i + \varepsilon_{it}, \quad \varepsilon_{it} \sim N(0,\sigma). \tag{21.1}\]
The vectors of coefficients \(\beta\) is the same for all units and times, but the individual effect \(u_i\) varies across the units.
The inclusion of the fixed effect term \(u_i\) means that we are controlling for anything that does not change over time; this is where we exploit the panel structure to control for unobserved confounders, given that they are constant over time. After fitting the model, the confounder effects will be stored in \(u_i\).
One way to obtain an estimate of \(u_i\) is by adding an indicator variable that is unique for each unit, but this would be computationally inefficient if we observe a large number of units. Instead, we use a trick that will remove \(u_i\) from the equation. First, we compute
- \(\bar{y}_{i}\) the mean of User \(i\)’s time-varying outcomes,
- \(\bar{x}_i\) the mean of User \(i\)’s time-varying predictors.
Then, we subtract the variables in Equation 21.1 by their means. Since \(u_i\) is fixed over time, the mean of \(u_i\) is \(u_i\) itself.
\[\begin{align*} y_{it}-\bar{y}_{i} &= \beta (x_{it}-\bar{x}_i) + (u_i-u_i) + (\varepsilon_{it}- \bar{\varepsilon}_i) \\ &= \beta (x_{it}-\bar{x}_i) + (\varepsilon_{it}- \bar{\varepsilon}_i). \end{align*}\]Denoting \(\ddot{y}_{it}=y_{it}-\bar{y}_{i}\), \(\ddot{x}_{it}=x_{it}-\bar{x}_{i}\) and \(\ddot{\varepsilon}_{it}=\varepsilon_{it}- \bar{\varepsilon}_i\), we obtain a new regression equation:
\[ \ddot{y}_{it} = \beta \ddot{x}_{it} + (\varepsilon_{it}- \bar{\varepsilon}_i). \]
Thus, we can instead fit on the demeaned data to obtain the estimate of the coefficients, and in particular, the average treatment effect.
Let us try this on the wage data. We can subtract the mean from each variable by using the scale
function with the scale
argument set to FALSE
.
$exper_c <- with(wage97, exper - ave(exper, nr))
wage97$union_c <- with(wage97, union - ave(union, nr))
wage97$married_c <- with(wage97, married - ave(married, nr))
wage97$wage_c <- with(wage97, wage - ave(wage, nr))
wage97
head(wage97[, c("nr", "year", "exper_c",
"union_c", "married_c", "wage_c")])
nr year exper_c union_c married_c wage_c
1 13 1980 -3.5 -0.125 0 -0.05811187
2 13 1981 -2.5 0.875 0 0.59740792
3 13 1982 -1.5 -0.125 0 0.08880961
4 13 1983 -0.5 -0.125 0 0.17756126
5 13 1984 0.5 -0.125 0 0.31247301
6 13 1985 1.5 -0.125 0 0.44423887
Now we can fit the regression model with e.g. stan_glm
. However, the differences in the number of observations, predictors’ variances, etc. suggest that we should assume heterogeneous standard errors across the units. We can fit such model using stan_glmer
from the rstanarm
library. In the formula, we add (1 | nr)
, which means that we are varying the intercept (1
) by the unit.
<- stan_glmer(wage ~ married + union + exper
fit_1 + (1 | nr),
data=wage97, seed=0, refresh=0)
print(fit_1, digits=3)
stan_glmer
family: gaussian [identity]
formula: wage ~ married + union + exper + (1 | nr)
observations: 3115
------
Median MAD_SD
(Intercept) 1.241 0.027
married 0.080 0.020
union 0.102 0.022
exper 0.055 0.003
Auxiliary parameter(s):
Median MAD_SD
sigma 0.348 0.005
Error terms:
Groups Name Std.Dev.
nr (Intercept) 0.3852
Residual 0.3485
Num. levels: nr 429
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The model tells us that marriage increases the wage by 8% on average (since the outcome is the log of the wage).
The intercept of each individual’s model is a sum of two components:
- The first component is the fixed effects which is the same for all units; this is the value of
intercept
shown above, - The second component is the random effects which varies from unit to unit. To see the random effects of all units, we can use
ranef
on the fitted model.
ranef(fit_1)$nr[1:5, ]
[1] -0.22613167 -0.01290594 0.17485998 0.26456645 -0.10256147
The actual intercept \(u_i\) of each individual’s model can be obtained using coef
on the fitted model.
head(coef(fit_1)$nr)
(Intercept) married union exper
13 1.014496 0.08003359 0.1019362 0.05538296
17 1.227722 0.08003359 0.1019362 0.05538296
45 1.415488 0.08003359 0.1019362 0.05538296
110 1.505195 0.08003359 0.1019362 0.05538296
120 1.138067 0.08003359 0.1019362 0.05538296
126 1.621628 0.08003359 0.1019362 0.05538296
Notice that each intercept is equal to the sum of the fixed effects and random effects above.
Let us take a look at the actual wage versus the predicted wage from the fixed effects model. Overall, the predictions (the line plots) are close to the actual wages.
<- c("red", "blue", "green")
colors plot(1, type = "n", xlab = "Year", ylab = "Wage",
xlim = c(1980,1987), ylim = c(0.75, 2.5))
for (i in 1:3) {
points(1980:1987, wage97$wage[8*i+1:8*(i+1)],
pch = 16, col = colors[i])
lines(1980:1987, fit_1$fitted.values[8*i+1:8*(i+1)],
pch = 18, col = colors[i], type = "b")
}
21.2 Time effects
In addition to the user’s effect, we can have a fixed effect at each time as well.
\[ y_{it} = \beta x_{it} + u_i + v_t + \varepsilon_{it}, \quad \varepsilon_{it} \sim N(0,\sigma). \tag{21.2}\]
This is applicable when the time is a counfounder that affects both the treatment and the outcome. For example, due to cultural shift, the number of marriages increases with time. And due to the inflation, the wage also increases with time. The graphical model of our example is shown below:
To fit the fixed effects model with stan_glmer
, simply add (1 | year)
to the formula.
<- stan_glmer(wage ~ married + union + exper
fit_2 + (1 | nr) + (1 | year),
data=wage97, seed=0, refresh=0)
print(fit_2, digits=3)
stan_glmer
family: gaussian [identity]
formula: wage ~ married + union + exper + (1 | nr) + (1 | year)
observations: 3115
------
Median MAD_SD
(Intercept) 1.673 0.124
married 0.074 0.020
union 0.108 0.022
exper -0.013 0.015
Auxiliary parameter(s):
Median MAD_SD
sigma 0.349 0.005
Error terms:
Groups Name Std.Dev.
nr (Intercept) 0.3653
year (Intercept) 0.2202
Residual 0.3486
Num. levels: nr 429, year 8
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
With this new model, the average effect of marriage on wage decreases from 8% to 7.4%, but it is still significant.
We can use ranef
to inspect the random effects as before.
ranef(fit_2)$nr[1:5, ]
[1] -0.3381000 0.0525898 0.1198533 0.3938407 -0.1562601
We can look at the time’s random effects as well.
ranef(fit_2)$year[1:5, ]
[1] -0.26494327 -0.15934185 -0.09733085 -0.03082152 0.03349460
And we can see that the effect increases by the years.
21.3 Assumptions and Cautions
Ignorability. For the coefficient of the treatment ot be a valid estimate of the treatment effect, we have to assume independent conditional to the grouping variable—in this case the unit \(i\)—and pre-treatment covariates \(x\). That is,
\[ y^0,y^1 \perp z \vert i, x. \]
If the model has time effects, we have to condition on the time variable as well.
Confounders must be constant over time. The inclusion of the time-invariant variable \(u_i\) imposes that confounding must remain constant at all time. Any unobserved confounders that are changing over time will not be detected by the fixed effects model.
No reverse causality. The problem of assuming a wrong causal direction usually comes up in panel data. For example, instead of marriage having an effect on the wage, it might be the other way around, as people with more wages have more chance of getting married. As another example. when one considers between the police spending and crime rate, the causal effect can go both ways: higher crime rate can cause more police spending, or having more police spending actually reduces the crime rate. Review of previous research is a common way to find the right direction, as well as asking domain experts.