21  Panel data

Panel data, or longitudinal data, is data that contains multiple observations of each unit. As we can see from the method of difference-in-differences, under certain assumptions, the temporal nature of the data can be exploit to extract the treatment effect. Below are relevant variables in panel data analysis:

Here, we impose that the unobserved individual effect is not changing over time. For example, \(u_i\) might be \(i\)’s health condition before receiving scheduled treatments. The difference-in-differences setting is a special case of panel data with \(T=2\) and \(u_i\) is the same for all units in each of the treatment and control groups.

Throughout this chapter, we will use panel data of 545 observations of young men from 1980 to 1987 from Vella and Verbeek (1998). The outcome (wage) is the log of wage, and the predictors that we will use are: years of experience (exper), whether the wage is set by collective bargaining (union) and the marriage status (married). We will also use the unique identifier (nr) and the year identified (year).

wage97 <- read.csv("data/wage97.csv")
wage97 <- wage97[, c("nr", "year", "exper",
                     "union", "married", "wage")]

wage97$nr <- factor(wage97$nr)
wage97$year <- factor(wage97$year)

lookup <- c("yes" = 1, "no" = 0)
wage97$union <- lookup[wage97$union]
wage97$married <- lookup[wage97$married]

  nr year exper union married     wage
1 13 1980     1     0       0 1.197540
2 13 1981     2     1       0 1.853060
3 13 1982     3     0       0 1.344462
4 13 1983     4     0       0 1.433213
5 13 1984     5     0       0 1.568125
6 13 1985     6     0       0 1.699891

Notice that Unit #13 had been observed once a year since 1980. To see what kind of model we are looking for, we take a look at the wages of three people in the dataset.

colors <- c("red", "blue", "green")
plot(1, type = "n", xlab = "Year", ylab = "Wage",
     xlim = c(1980,1987), ylim = c(0.75, 2.5))

for (i in 1:3) {
    points(1980:1987, wage97$wage[8*i+1:8*(i+1)],
           pch = 16, col = colors[i])

This suggests that simply fitting a regression on this data would be a big mistake, since such model assumes the same level of wages for every young men. Instead, we have to come up with a model that allows for the difference in wages.

21.1 Fixed effects model

We assume that the outcome is composed of the individual time-invariant effect and time-specific treatment effect, resulting in the following model:

\[ y_{it} = \beta x_{it} + u_i + \varepsilon_{it}, \quad \varepsilon_{it} \sim N(0,\sigma). \tag{21.1}\]

The vectors of coefficients \(\beta\) is the same for all units and times, but the individual effect \(u_i\) varies across the units.

The inclusion of the fixed effect term \(u_i\) means that we are controlling for anything that does not change over time; this is where we exploit the panel structure to control for unobserved confounders, given that they are constant over time. After fitting the model, the confounder effects will be stored in \(u_i\).

One way to obtain an estimate of \(u_i\) is by adding an indicator variable that is unique for each unit, but this would be computationally inefficient if we observe a large number of units. Instead, we use a trick that will remove \(u_i\) from the equation. First, we compute

  • \(\bar{y}_{i}\) the mean of User \(i\)’s time-varying outcomes,
  • \(\bar{x}_i\) the mean of User \(i\)’s time-varying predictors.

Then, we subtract the variables in Equation 21.1 by their means. Since \(u_i\) is fixed over time, the mean of \(u_i\) is \(u_i\) itself.

\[\begin{align*} y_{it}-\bar{y}_{i} &= \beta (x_{it}-\bar{x}_i) + (u_i-u_i) + (\varepsilon_{it}- \bar{\varepsilon}_i) \\ &= \beta (x_{it}-\bar{x}_i) + (\varepsilon_{it}- \bar{\varepsilon}_i). \end{align*}\]

Denoting \(\ddot{y}_{it}=y_{it}-\bar{y}_{i}\), \(\ddot{x}_{it}=x_{it}-\bar{x}_{i}\) and \(\ddot{\varepsilon}_{it}=\varepsilon_{it}- \bar{\varepsilon}_i\), we obtain a new regression equation:

\[ \ddot{y}_{it} = \beta \ddot{x}_{it} + (\varepsilon_{it}- \bar{\varepsilon}_i). \]

Thus, we can instead fit on the demeaned data to obtain the estimate of the coefficients, and in particular, the average treatment effect.

Let us try this on the wage data. We can subtract the mean from each variable by using the scale function with the scale argument set to FALSE.

wage97$exper_c <- with(wage97, exper - ave(exper, nr))
wage97$union_c <- with(wage97, union - ave(union, nr))
wage97$married_c <- with(wage97, married - ave(married, nr))
wage97$wage_c <- with(wage97, wage - ave(wage, nr))

head(wage97[, c("nr", "year", "exper_c",
                "union_c", "married_c", "wage_c")])
  nr year exper_c union_c married_c      wage_c
1 13 1980    -3.5  -0.125         0 -0.05811187
2 13 1981    -2.5   0.875         0  0.59740792
3 13 1982    -1.5  -0.125         0  0.08880961
4 13 1983    -0.5  -0.125         0  0.17756126
5 13 1984     0.5  -0.125         0  0.31247301
6 13 1985     1.5  -0.125         0  0.44423887

Now we can fit the regression model with e.g. stan_glm. However, the differences in the number of observations, predictors’ variances, etc. suggest that we should assume heterogeneous standard errors across the units. We can fit such model using stan_glmer from the rstanarm library. In the formula, we add (1 | nr), which means that we are varying the intercept (1) by the unit.

fit_1 <- stan_glmer(wage ~ married + union + exper
                    + (1 | nr),
                    data=wage97, seed=0, refresh=0)

print(fit_1, digits=3)
 family:       gaussian [identity]
 formula:      wage ~ married + union + exper + (1 | nr)
 observations: 3115
            Median MAD_SD
(Intercept) 1.241  0.027 
married     0.080  0.020 
union       0.102  0.022 
exper       0.055  0.003 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.348  0.005 

Error terms:
 Groups   Name        Std.Dev.
 nr       (Intercept) 0.3852  
 Residual             0.3485  
Num. levels: nr 429 

* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The model tells us that marriage increases the wage by 8% on average (since the outcome is the log of the wage).

The intercept of each individual’s model is a sum of two components:

  • The first component is the fixed effects which is the same for all units; this is the value of intercept shown above,
  • The second component is the random effects which varies from unit to unit. To see the random effects of all units, we can use ranef on the fitted model.
ranef(fit_1)$nr[1:5, ]
[1] -0.22613167 -0.01290594  0.17485998  0.26456645 -0.10256147

The actual intercept \(u_i\) of each individual’s model can be obtained using coef on the fitted model.

    (Intercept)    married     union      exper
13     1.014496 0.08003359 0.1019362 0.05538296
17     1.227722 0.08003359 0.1019362 0.05538296
45     1.415488 0.08003359 0.1019362 0.05538296
110    1.505195 0.08003359 0.1019362 0.05538296
120    1.138067 0.08003359 0.1019362 0.05538296
126    1.621628 0.08003359 0.1019362 0.05538296

Notice that each intercept is equal to the sum of the fixed effects and random effects above.

Let us take a look at the actual wage versus the predicted wage from the fixed effects model. Overall, the predictions (the line plots) are close to the actual wages.

colors <- c("red", "blue", "green")
plot(1, type = "n", xlab = "Year", ylab = "Wage",
     xlim = c(1980,1987), ylim = c(0.75, 2.5))

for (i in 1:3) {
    points(1980:1987, wage97$wage[8*i+1:8*(i+1)],
           pch = 16, col = colors[i])
    lines(1980:1987, fit_1$fitted.values[8*i+1:8*(i+1)],
          pch = 18, col = colors[i], type = "b")

21.2 Time effects

In addition to the user’s effect, we can have a fixed effect at each time as well.

\[ y_{it} = \beta x_{it} + u_i + v_t + \varepsilon_{it}, \quad \varepsilon_{it} \sim N(0,\sigma). \tag{21.2}\]

This is applicable when the time is a counfounder that affects both the treatment and the outcome. For example, due to cultural shift, the number of marriages increases with time. And due to the inflation, the wage also increases with time. The graphical model of our example is shown below:

G X Person Z Marriage X->Z Y Wage X->Y Z->Y W Time W->Z W->Y

To fit the fixed effects model with stan_glmer, simply add (1 | year) to the formula.

fit_2 <- stan_glmer(wage ~ married + union + exper
                    + (1 | nr) + (1 | year),
                    data=wage97, seed=0, refresh=0)

print(fit_2, digits=3)
 family:       gaussian [identity]
 formula:      wage ~ married + union + exper + (1 | nr) + (1 | year)
 observations: 3115
            Median MAD_SD
(Intercept)  1.673  0.124
married      0.074  0.020
union        0.108  0.022
exper       -0.013  0.015

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.349  0.005 

Error terms:
 Groups   Name        Std.Dev.
 nr       (Intercept) 0.3653  
 year     (Intercept) 0.2202  
 Residual             0.3486  
Num. levels: nr 429, year 8 

* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

With this new model, the average effect of marriage on wage decreases from 8% to 7.4%, but it is still significant.

We can use ranef to inspect the random effects as before.

ranef(fit_2)$nr[1:5, ]
[1] -0.3381000  0.0525898  0.1198533  0.3938407 -0.1562601

We can look at the time’s random effects as well.

ranef(fit_2)$year[1:5, ]
[1] -0.26494327 -0.15934185 -0.09733085 -0.03082152  0.03349460

And we can see that the effect increases by the years.

21.3 Assumptions and Cautions

Ignorability. For the coefficient of the treatment ot be a valid estimate of the treatment effect, we have to assume independent conditional to the grouping variable—in this case the unit \(i\)—and pre-treatment covariates \(x\). That is,

\[ y^0,y^1 \perp z \vert i, x. \]

If the model has time effects, we have to condition on the time variable as well.

Confounders must be constant over time. The inclusion of the time-invariant variable \(u_i\) imposes that confounding must remain constant at all time. Any unobserved confounders that are changing over time will not be detected by the fixed effects model.

No reverse causality. The problem of assuming a wrong causal direction usually comes up in panel data. For example, instead of marriage having an effect on the wage, it might be the other way around, as people with more wages have more chance of getting married. As another example. when one considers between the police spending and crime rate, the causal effect can go both ways: higher crime rate can cause more police spending, or having more police spending actually reduces the crime rate. Review of previous research is a common way to find the right direction, as well as asking domain experts.


Vella, Francis, and Marno Verbeek. 1998. “Whose Wages Do Unions Raise? A Dynamic Model of Unionism and Wage Rate Determination for Young Men.” Journal of Applied Econometrics 13 (2): 163–83. http://www.jstor.org/stable/223257.