12  Generalized linear models

The linear regression and the logistic regression are examples of a more general class of models: the generalized linear models (GLM). As in the logistic regression, we can modify the nonlinear function and the model of the outcomes to handle various types of data, such as data with bounded outcomes, count data and data with multi-valued outcomes.

12.1 Definition of generalized linear models

Generalized linear models (GLMs) are a class of models that conform to the transformed linear predictor design. Specifically, a GLM consists of:

  1. A vector of outcome data \(y=(y_1,\ldots,y_n)\).
  2. A matrix of predictor \(X=(\mathbf{1}_n,X_1,\ldots,X_p)\) and a vector of coefficients \(\beta=(\beta_0,\ldots,\beta_p)^T\). A linear predictor is given by \(X\beta\).
  3. A link function \(g\) that transforms the linear predictor to the model’s prediction through its inverse: \(\hat{y} = g^{-1}(X\beta)\).
  4. A distribution of the outcome, given the prediction: \(p(y\vert\hat{y})\).
  5. Other parameters such as variances, overdispersions and the outcome’s upper and/or lower bounds.

Here are the link functions and the outcome distribution that we have used in the linear regression and logistic regression:

  • In the linear regression, we used \(g(x)=x\) (and so \(g^{-1}(x)=x\)) and \(p(y\vert \hat{y})=p(y\vert X\beta)=\mathcal{N}(y-X\beta,\sigma^2)\).
  • In the logistic regression, we used the logit function \(g(x)=\text{logit}(x)\) with the logistic inverse: \(g^{-1}(x) = \text{logit}^{-1}(x)= e^x/(1+e^x)\). With the prediction \(\hat{y}=\text{logit}^{-1}(X\beta)\), the outcome distribution is the Bernoulli distribution: \(p(y\vert\hat{y}) = \hat{y}^y(1-\hat{y})^{1-y}\).

12.2 Poisson and negative binomial regression

12.2.1 Poisson regression

We start with the simplest regression model for count data.

\[ y \sim \operatorname{Poisson}(e^{X\beta}). \]

This is a GLM in which:

  • The link function is \(g(x)=\log x\).

  • The prediction is \(\hat{y} = g^{-1}(X\beta) = e^{X\beta}\).

  • The outcome distribution is \(p(y\vert\hat{y})\sim \operatorname{Poisson}(\hat{y})\).

From the properties of the Poisson distribution, we have

  • \(\mathbb{E}[y\vert X] = e^{X\beta}\).

  • \(\operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y]} = e^{X\beta/2}\).

Thus, in a Poisson model, the standard deviation of the outcome is already specified by the model. If the expected outcome is \(\mathbb{E}[y] = 10\), then the prediction errors are mostly in the range of \(\pm\sqrt{\mathbb{E}[y]} \approx\pm 3.33\). But for many datasets, the prediction errors might fall out of these ranges.

12.2.2 Overdispersion and underdispersion

Overdispersion and underdispersion refer to data that show more or less variation than the Poisson model. In other words, when fitting a Poisson model, the residuals of overdispersed data are often greater, while those of underdispersed data are mostly smaller than the square root of the predicted value. For such data, using Poisson models would be inappropriate. In the next section, we introduce another model that allows more prediction errors.

12.2.3 Negative binomial regression

We introduce another model for the count data.

\[ y \sim \operatorname{negative binomial}(e^{X\beta},\phi), \]

where \(\operatorname{negative binomial}(p,\phi)\) models the number of failures in a sequence of iid \(\operatorname{Bernoulli}(p)\) trials before observing \(\phi\) successes (but the range of \(\phi\) can be extended to positive real numbers).

The predictive standard deviation is

\[ \operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y\vert X]+\frac{1}{\phi}\mathbb{E}[y\vert X]^2}. \]

In the context of count data modeling, \(\phi\) is the “reciprocal dispersion” parameter:

  • Lower values of \(\phi\) correspond to more overdispersion.
  • Higher values of \(\phi\) correspond to less overdispersion.
  • The negative binomial distribution becomes the Poisson distribution in the limit \(\phi\to\infty\) (that is, when there is no overdispersion).

12.2.4 Exposure and offset

In many cases, the outcomes depend on the amounts of exposure to the environment, and so two different outcomes may not be directly compared. For example, the number of daily deaths by country depends population size. The number of car accidents at an intersection depend on the number of cars running through that intersection.

If this is the case, we may instead let \(e^{X\beta}\) be the expected rate of outcomes \(r\), and the expected outcome is the product of exposure \(u\) and rate \(r\).

\[ \mathbb{E}[y] = ur = ue^{X\beta}, \tag{12.1}\]

which leads to a new model with the new predictor \(u\):

\[ y\sim \operatorname{negative binomial}(ue^{X\beta},\phi). \]

We can separate the exposure and the exponential term by applying the logarithm on both sides of Equation 12.1.

\[ \log\mathbb{E}[y] = \log u +X\beta. \]

With this, we call \(\log u\) the offset.

12.2.5 Example: effect of pest management on reducing cockroach levels

We consider the Roaches dataset, which was used to study the effect of pest management on reducing cockroach levels in urban apartments. In the experiment, there were 158 apartments in the treatment group and 104 apartments in the control group. The data consists of the following variables:

Name Description
y post-treatment roach count
roach1 pre-treatment roach level
treatment 0 if control, 1 if treatment
senior 1 if the apartment is restricted to elderly
exposure2 number of days the traps had been laid

We create a new predictor named roach100, which isroach1 scaled down by a factor of 100.

roaches <- read.csv("data/roaches.csv")
roaches$roach100 <- roaches$roach1/100

head(roaches)
  X   y roach1 treatment senior exposure2 roach100
1 1 153 308.00         1      0  0.800000   3.0800
2 2 127 331.25         1      0  0.600000   3.3125
3 3   7   1.67         1      0  1.000000   0.0167
4 4   7   3.00         1      0  1.000000   0.0300
5 5   0   2.00         1      0  1.142857   0.0200
6 6   0   0.00         1      0  1.000000   0.0000

First, we fit Poisson regression by specifying family=poisson in stan_glm. The number of post-treatment roaches depend on the number of days the traps had been laid, so it makes sense to let exposure2 be the model’s exposure.

library(rstanarm)
fit_1 <- stan_glm(y ~ roach100 + treatment + senior,
                  family=poisson,
                  offset=log(exposure2),
                  data=roaches, refresh=0)

print(fit_1)
stan_glm
 family:       poisson [log]
 formula:      y ~ roach100 + treatment + senior
 observations: 262
 predictors:   4
------
            Median MAD_SD
(Intercept)  3.1    0.0  
roach100     0.7    0.0  
treatment   -0.5    0.0  
senior      -0.4    0.0  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Interpreting the coefficients. The fitted Poisson model for the number of post-treatment roaches is

\[ y \sim \operatorname{Poisson}(e^{3.1+0.7*\texttt{roach100}-0.5*\texttt{treatment}-0.4*\texttt{senior}}). \]

We can interpret the coefficients of the regression as follows:

  • The intercept is the prediction for \(\texttt{roach100}=0\), \(\texttt{treatment}=0\) and \(\texttt{senior}=0\). More precisely, for a non-senior apartment that was roach-free before and did not receive the pest management, the expected number of roaches is \(e^3.1\approx 22\). Note that sometimes some of the predictors cannot be zero, and the intercept cannot be interpreted in that case.
  • The coefficient \(0.7\) of \(\texttt{roach100}\) indicates that, for each additional \(100\) roaches (while keeping treatment and senior at the same level), the expected number of post-treatment roaches increases by a factor of \(e^{0.7}\approx 1+0.7=1.7\), or a \(70\%\) increase.
  • The coefficient \(-0.5\) of \(\texttt{treatment}\) indicates that the number of post-treatment roaches in an apartment with pest management is lower that that of an apartment without pest management (with the same level of roach100 and senior) by a factor of \(e^{-0.5}\approx 1-0.5=0.5\), or a \(50\%\) decrease.
  • The coefficient \(-0.4\) of \(\texttt{senior}\) indicates that the number of post-treatment roaches in a senior apartment is lower that that of a non-senior apartment (with the same level of roach100 and treatment) by a factor of \(e^{-0.4}\approx 1-0.4=0.6\), or a \(40\%\) decrease.

Checking the fit via simulation. Now we check the model’s fit by looking at the posterior predictive distribution. As usual, we use the posterior_predictive function to generate 4000 numbers of post-treatment roaches, then we sample 400 of them.

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
subset <- sample(n_sims, 100)

After that, we use ppc_dens_overlay to compare between the distributions of original and simulated data. Here, we plot both data in the logarithmic scale to make the difference between the plots more pronounced.

library(bayesplot)
ppc_dens_overlay(log10(roaches$y+1), log10(yrep_1[subset,]+1))

The plots show that the original data is overdispersed and contains a lot of zeros; this indicates that the Poisson model might not be suitable as it only allows a relatively small number of zeros.

We thus turn to the negative binomial regression, which can be done by specifying family=neg_binomial_2. As before, we let exposure2 be the exposure. There is no need to specify the reciprocal dispersion parameter \(\phi\)—it can be estimated from the data.

fit_2 <- stan_glm(y ~ roach100 + treatment + senior,
                  family=neg_binomial_2,
                  offset=log(exposure2),
                  data=roaches, refresh=0)

print(fit_2)
stan_glm
 family:       neg_binomial_2 [log]
 formula:      y ~ roach100 + treatment + senior
 observations: 262
 predictors:   4
------
            Median MAD_SD
(Intercept)  2.8    0.2  
roach100     1.3    0.3  
treatment   -0.8    0.2  
senior      -0.3    0.3  

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 0.3    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Let us see how the posterior predictive fits our data.

yrep_2 <- posterior_predict(fit_2)

n_sims <- nrow(yrep_2)
subset <- sample(n_sims, 100)
ppc_dens_overlay(log10(roaches$y+1), log10(yrep_2[subset,]+1))

The negative binomial looks like a better fit with a high probability of zero. However, the model allows the number of roaches to be as high as \(10^5\), which is unrealistic. We will see how we can make adjustment to this model later in the chapter.

12.3 Logistic-binomial and beta-binomial models

12.3.1 Logistic-binomial model

The logistic regression can be used to model the number of successes from \(n\) Bernoulli trials. In this setting, we can use the following GLM design:

  • The link function is \(g(x)=\text{logit}(x)\).

  • The prediction is \(\hat{p} = g^{-1}(X\beta) = \text{logit}^{-1}(X\beta)\).

  • The outcome distribution is \(p(y\vert\hat{p})\sim \operatorname{Binomial}(n,\hat{p})\), where \(n\) is the number of trials.

Let us try this model on the simulated data of basketball shooting. The following code produces \(N =100\) players, each shooting \(n=20\) shots. We also encode our assumption that the field goal percentage is inversely correlated to the weight.

N <- 100
weight <- rnorm(N, 216, 31)
p <- 0.6 - 0.1*(weight - 216)/31
n <- rep(20, N)
y <- rbinom(N, n, p)
data <- data.frame(n=n, y=y, weight=weight)

head(data)
   n  y   weight
1 20 12 222.2577
2 20 14 218.8143
3 20 10 211.4947
4 20 10 239.5631
5 20 11 173.2195
6 20 12 190.9105

To model the count data with the logistic regression, the outcome is a pair of number of successes and number of failures.

fit_1a <- stan_glm(cbind(y, 20-y) ~ weight,
                   family=binomial(link="logit"),
                   data=data, refresh=0)

print(fit_1a, digits=3)
stan_glm
 family:       binomial [logit]
 formula:      cbind(y, 20 - y) ~ weight
 observations: 100
 predictors:   2
------
            Median MAD_SD
(Intercept)  3.735  0.383
weight      -0.016  0.002

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Since the data is generated from the binomial model, so we expect the residuals to be very small. Let us see if that is the case.

y <- fit_1a$y[,1]
p_hat <- fitted(fit_1a)
y_hat <- 20 * p_hat

plot(y_hat, y - y_hat,
     xlab="Predicted number of field goals",
     ylab="Residual", pch=20)
abline(0, 0)

12.3.2 Overdispersion

Logistic models usually have overdispersion problem, that is, the variation in the data is more than indicated by the model. To detect overdispersion, we first recall the standard deviation of \(y\sim \operatorname{Binomial}(n,\hat{p})\), which is \(\sqrt{n\hat{p}(1-\hat{p})}\), where \(\hat{p}=\text{logit}^{-1}(X\beta)\). Then, we consider the standardized residual:

\[ z_i = \frac{y_i-\hat{y}_i}{\operatorname{sd}(\hat{y}_i)} = \frac{y_i-n_i\hat{p}_i}{\sqrt{n_i\hat{p}_i(1-\hat{p}_i)}}, \]

which has mean \(0\) and standard deviation \(1\). We then formally test for overdispersion by comparing \(\sum_{i=1}^N z^2_i\) to the \(\chi^2_{N-p}\) distribution, where \(p\) is the number of predictors.

12.3.3 Beta-binomial model

To handle the overdispersion, we modify the outcome distribution of the logistic-binomial model to obtain the beta-binomial model, which has the following GLM design:

  • The link function is \(g(x)=\text{logit}(x)\).

  • The prediction is \(\hat{p} = g^{-1}(X\beta) = \text{logit}^{-1}(X\beta)\).

  • The outcome distribution is \[ p(y\vert\hat{p})\sim \operatorname{Beta-Binomial}\left(n,\hat{p}\phi,\left(1-\hat{p}\right)\phi\right), \] where \(n\) is the number of trials and \(\phi\in(0,1)\) is an overdispersion parameter.

The parameters of the beta-binomial distribution are chosen so that the mean of \(y\) is \(n\hat{p}\) and the standard deviation of \(y\) is controlled by \(\phi\).

\[\begin{align*} \mathbb{E}[y\vert\hat{p}] &= n\hat{p} \\ \operatorname{sd}(y\vert\hat{p}) &= \sqrt{n\hat{p}(1-\hat{p})\left(\frac{n+\phi}{1+\phi}\right)}. \end{align*}\]

Therefore, lowerr \(\phi\) allows for more overdispersion, higher \(\phi\) for less overdispersion. In the limit \(\phi\to \infty\) (no overdispersion), we go back to the logistic-binomial model.

To fit the beta-binomial model, we use the brms library, which has the brm function that allows us to select beta_binomial family. We specify the number of trials via the trials() function. In this example, the number of trials (i.e. number of shots) for each basketball player is 20. The input of trials can be a vector in the case that the number of trials varies by the data point. The dispersion parameter \(\phi\) is estimated directly from the data, so there is no need to specify \(\phi\) in brm.

# install.packages("brms")
library(brms)
fit_1b <- brm(y|trials(20) ~ weight, family=beta_binomial,
              data=data, seed=0, refresh=0)
Compiling Stan program...
Start sampling
print(fit_1b, digits=3)
 Family: beta_binomial 
  Links: mu = logit; phi = identity 
Formula: y | trials(20) ~ weight 
   Data: data (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI  Rhat Bulk_ESS Tail_ESS
Intercept    3.723     0.403    2.937    4.506 1.000     4797     3090
weight      -0.016     0.002   -0.019   -0.012 0.999     5058     3181

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI  Rhat Bulk_ESS Tail_ESS
phi  129.642    77.237   43.188  339.890 1.003     2204     2323

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The dispersion parameter \(\phi=127.2\) tells us that the predictive standard deviation of the beta-binomial model is \(\sqrt{\frac{100+127.2}{1+127.2}}\approx 1.33\) times that of the logistic-binomial model.

12.4 Ordered and unordered categorical regression

Sometimes the outcomes can have more than two categories, which can be ordered or unordered.

Examples of ordered categories are:

  • Easy, Medium, Hard
  • Slow, Normal, Fast
  • Urban, Sub-urban, Rural

Examples of unordered categories are:

  • Football, Basketball, Baseball
  • Car, Bus, Train
  • A, B, O, AB

Here, we introduce two models: one to handle ordered categories, and another to handle unordered categories.

12.4.1 Ordered logistic regression

For the data with ordered categories \(1,\ldots,K\), we may use the following logistic model:

  • The link function is \(g(x)=\text{logit}(x)\), where \(g\) can be any strictly increasing function that maps \((0,1)\) to \((-\infty, \infty)\).

    • If \(g(x)=\text{logit}(x)\), then the resulting model is called the ordered logit model or proportional odds model.
    • If \(g(x)=\Phi^{-1}(x)\), where \(\Phi\) is the cdf of the standard normal distribution, then the resulting model is called the ordered probit model.
  • The prediction for the \(k\)-th category is \(\hat{p}_{\leq k} = g^{-1}(c_{k\vert k+1}+X\beta)\) for \(1\leq k \leq K-1\). Here, we have additional parameters \(0<c_{1\vert 2}<c_{2\vert 3}<\ldots<c_{K-1\vert K}\), called the cutpoints.

  • The outcome cumulative distribution is

    \[ \Pr(y\leq k)= \hat{p}_{\leq k}, \qquad k=1,\ldots,K-1 \]

    and \(\Pr(y \leq K)=1\). We can calculate the probabilities of individual categories as follows:

    \[ \Pr(y= k)= \Pr(y\leq k) - \Pr(y\leq k-1), \qquad k=1,\ldots,K. \]

This formulation is well-defined, as \(\{c_{k\vert k+1}\}_{k=1}^K\) is increasing and \(g\) (and hence \(g^{-1}\)) is increasing imply that \(\hat{p}_{\leq k}\) is increasing as well. The two variations of the model are used in different situations: the ordered logit model is used if the probabilities of \(y\) are expected to be evenly distributed across all categories, while the ordered probit model is used if the probabilities are expected to be concentrated at the middle values.

Let us try the ordered logit model on the Storable data. The data contains records of voting games played between 2-6 college students. A summary of the game is as follows:

  • Each student was given a total of 4 votes to play in two rounds.
  • The students had the choice to cast 1, 2 or 3 votes on the first round, and the remaining votes on the second round.
  • Before casting the first votes, the students were told the payoffs for the winner, which were drawn from the uniform distribution on the interval \([1,100]\).
  • In addition, the students were told the distribution of the payoffs.

Here, we only take the results of the game played between two students. The vote column contains the number of the first votes and the value column contains the payoffs.

data_2player <- read.csv("data/2playergames.csv")
data_401 <- subset(data_2player, person == 401,
                   select = c("vote", "value"))
data_401$factor_vote <- factor(data_401$vote,
                               levels = c(1, 2, 3),
                               labels = c("1", "2", "3"),
                               ordered=TRUE)

plot(data_401$value, data_401$factor_vote,
     xlab="Payoffs", ylab="Votes")

As expected, with more payoffs in the first round, the students were willing to cast more votes.

To fit the ordered logit model, we can use the stan_polr function (proportional odds logistic regression) provided in rstanarm. Here. we specify the prior mean of the \(R^2\) of the prediction to be \(0.3\).

fit_1 <- stan_polr(factor(vote) ~ value, method="logistic",
                   prior=R2(0.3, "mean"), data=data_401,
                   refresh=0)

print(fit_1)
stan_polr
 family:       ordered [logistic]
 formula:      factor(vote) ~ value
 observations: 20
------
      Median MAD_SD
value 0.1    0.0   

Cutpoints:
    Median MAD_SD
1|2 2.7    1.4   
2|3 5.8    2.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The results show two cutpoints: \(c_{1\vert 2} = 2.7\) and \(c_{2\vert 3}=5.9\). Below is the plot of the expected vote given the payoffs from the model.

12.4.2 Unordered logistic regression

For the data with unordered categories \(1,\ldots,K\), we can use the categorical logit model, which has the following GLM specification:

  • The link function is

    \[ g(x_1,\ldots,x_K) = \left(\log\frac{x_1}{x_K},\ldots,\log\frac{x_{K-1}}{x_K}\right). \]

    The inverse of the link function is called the multinomial logit, also known as the softmax function.

    \[ g^{-1}(x_1,\ldots,x_{K-1}) = \left(\frac{e^{x_1}}{1+\sum_k e^{x_k}},\ldots,\frac{e^{x_{K-1}}}{1+\sum_k e^{x_k}},\frac{1}{1+\sum_k e^{x_k}}\right). \]

  • The prediction is \((\hat{p}_{1},\ldots,\hat{p}_K)= g^{-1}(X\beta_1,\ldots,X\beta_{K-1})\) for \(1\leq k \leq K-1\). Notice that we now have \(K-1\) sets of parameters: \(\beta_1,\ldots,\beta_{K-1}\).

  • The outcome distribution is

    \[ \Pr(y = k)= \hat{p}_{k}, \qquad k=1,\ldots,K. \]

Let us fit this model on the iris dataset, which contains data of pedal height, pedal width, sepal height and sepal width of three different species of iris.

data(iris)

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

To fit the unordered logistic regression model of the species on the other predictors, we again use the brm function.

fit_2 <- brm(Species ~ ., family = categorical(link = "logit"),
             data=iris, prior=set_prior("normal (0, 8)"),
             refresh=0)

print(fit_2)
 Family: categorical 
  Links: muversicolor = logit; muvirginica = logit 
Formula: Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width 
   Data: iris (Number of observations: 150) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
muversicolor_Intercept      -14.25     20.10   -55.11    22.78 1.00     2062
muvirginica_Intercept       -40.24     23.24   -88.70     2.94 1.00     2270
muversicolor_Sepal.Length     1.38      4.25    -6.94     9.90 1.00     1642
muversicolor_Sepal.Width     -4.02      4.05   -12.36     3.49 1.00     1731
muversicolor_Petal.Length     6.90      3.25     0.99    13.70 1.00     1338
muversicolor_Petal.Width      0.16      5.18   -10.23     9.84 1.00     1723
muvirginica_Sepal.Length     -0.91      4.31    -9.14     7.49 1.00     1715
muvirginica_Sepal.Width      -6.82      4.24   -15.56     1.13 1.00     1763
muvirginica_Petal.Length     13.44      3.90     6.34    21.65 1.00     1327
muvirginica_Petal.Width      10.00      5.41    -0.73    20.90 1.00     1962
                          Tail_ESS
muversicolor_Intercept        2212
muvirginica_Intercept         2368
muversicolor_Sepal.Length     2441
muversicolor_Sepal.Width      2005
muversicolor_Petal.Length     1965
muversicolor_Petal.Width      2154
muvirginica_Sepal.Length      2283
muvirginica_Sepal.Width       1692
muvirginica_Petal.Length      1635
muvirginica_Petal.Width       2483

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Since the data has three categories, the fitted model has two sets of coefficients associated with two categories: versicolor and virginica.

12.5 Models with unequal error standard deviations

The usual linear regression model \(y\sim\mathcal{N}(X\beta,\sigma^2)\) assumes that the error standard deviation \(\sigma\) is fixed. That said, we can allow the standard deviation to vary by the values of the predictors—such condition is called heteroscedasticity. For example, it is possible to fit the model \(y\sim\mathcal{N}(X\beta_1,e^{X\beta_2})\), where \(\beta_1\) and \(\beta_2\) are the model’s parameters. In the example of earnings data, we could fit the linear regression of log-earnings with the error standard deviation in the form of \(e^{c+d*\texttt{male}}\), allowing different error standard deviations for women and men. This can be done using brm, with bf as a formula helper.

earnings <- read.csv("data/earnings.csv")
fit_1 <- brm(bf(log(earn)|subset(earn>0) ~ height + male,
                                   sigma ~ male),
             data=earnings, refresh=0)
Compiling Stan program...
Start sampling
print(fit_1)
 Family: gaussian 
  Links: mu = identity; sigma = log 
Formula: log(earn) | subset(earn > 0) ~ height + male 
         sigma ~ male
   Data: earnings (Number of observations: 1816) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           7.93      0.51     6.97     8.95 1.00     2666     2943
sigma_Intercept    -0.12      0.02    -0.16    -0.07 1.00     4280     2922
height              0.02      0.01     0.01     0.04 1.00     2592     2944
male                0.37      0.06     0.25     0.49 1.00     2665     2945
sigma_male         -0.06      0.04    -0.13     0.01 1.00     4101     2939

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

We can allow heteroskedasticity in other models as well. For example, the negative binomial model can be extended to \(y\sim\operatorname{negative binomial}(e^{X\beta_1},e^{X\beta_2})\), in which \(\phi\) depends on the predictors. Let us try this on the roaches data, with \(\phi\) (the shape parameter) depending on the treatment and the seniority.

fit_2 <- brm(bf(y ~ treatment + senior + offset(log(exposure2)),
                shape ~ treatment + senior),
             family=negbinomial,
             data=roaches, refresh=0)
Compiling Stan program...
Start sampling
print(fit_2)
 Family: negbinomial 
  Links: mu = log; shape = log 
Formula: y ~ treatment + senior + offset(log(exposure2)) 
         shape ~ treatment + senior
   Data: roaches (Number of observations: 262) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           3.75      0.21     3.36     4.18 1.00     4948     2693
shape_Intercept    -1.27      0.15    -1.56    -0.99 1.00     4524     2910
treatment          -0.54      0.27    -1.06    -0.01 1.00     4729     2661
senior             -0.74      0.34    -1.37    -0.03 1.00     5038     2875
shape_treatment    -0.17      0.19    -0.54     0.20 1.00     4903     3170
shape_senior       -0.55      0.21    -0.97    -0.16 1.00     4836     3303

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

12.6 Mixture models for data with many zeros

12.6.1 Hurdle models

We have briefly mentioned back in Chapter 7.1.1 how to handle data with a lot of zero-valued outcomes, specifically for log-linear models. In summary, we can use the logistic regression to classify whether the outcome is zero, and then we use the linear regression to model non-zero outcomes. We demonstrate here how to do this on the earnings data with stan_glm. First, we fit a logistic regression model on whether the earning is zero.

# (earn > 0) is an indicator of whether the earning is zero
fit_1a <- stan_glm((earn == 0) ~ height + male,
                   family=binomial(link="logit"),
                   data=earnings, refresh=0)

print(fit_1a)
stan_glm
 family:       binomial [logit]
 formula:      (earn == 0) ~ height + male
 observations: 1816
 predictors:   3
------
            Median MAD_SD
(Intercept)  2.9    1.9  
height      -0.1    0.0  
male        -1.7    0.3  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Then, we fit a linear regression model on the data with non-zero earnings.

fit_1b <- stan_glm(log(earn) ~ height + male,
                   data=earnings, subset=earn>0,
                   refresh=0)

print(fit_1b)
stan_glm
 family:       gaussian [identity]
 formula:      log(earn) ~ height + male
 observations: 1629
 predictors:   3
 subset:       earn > 0
------
            Median MAD_SD
(Intercept) 8.0    0.5   
height      0.0    0.0   
male        0.4    0.1   

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.9    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The resulting model is a mixture of the linear and logistic models. Suppose that we want to simulate the earnings of a randomly chosen 68-inch tall woman from this model; this can be done in two steps: first, we use the logistic model to predict whether her earning is zero, and if it is not zero, we use the linear model to predict her earning. Both predictions can be obtained using posterior_predict.

new <- data.frame(height=68, male=0)
pred_1a <- posterior_predict(fit_1a, newdata=new)
pred_1b <- posterior_predict(fit_1b, newdata=new)
pred <- ifelse(pred_1a==1, 0, exp(pred_1b))

print(pred[1:10])
 [1]  9395.789  4514.346 11952.875 21699.246 28884.118 25188.422  6430.726
 [8] 69471.164 19237.140     0.000

To fit the mixture model in a single step, we may use brm with the hurdle_lognormal family.

fit_2 <- brm(bf(earn ~ height + male, hu ~ height + male),
             family=hurdle_lognormal,
             data=earnings, refresh=0)
Compiling Stan program...
Start sampling
print(fit_2)
 Family: hurdle_lognormal 
  Links: mu = identity; sigma = identity; hu = logit 
Formula: earn ~ height + male 
         hu ~ height + male
   Data: earnings (Number of observations: 1816) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        7.99      0.50     7.01     8.99 1.00     3170     2735
hu_Intercept     2.91      2.01    -0.98     6.88 1.00     3628     3229
height           0.02      0.01     0.01     0.04 1.00     3106     2731
male             0.37      0.06     0.25     0.49 1.00     2992     2943
hu_height       -0.07      0.03    -0.13    -0.01 1.00     3660     3077
hu_male         -1.67      0.32    -2.33    -1.06 1.00     2728     2445

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.87      0.02     0.84     0.90 1.00     4786     2594

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Notice that the coefficients of the linear regression (Intercept, height and male) and the logistic regression (hu_Intercept, hu_height and hu_male) are almost identical to the ones from the direct fits above. For count data with many zeros, one can use the hurdle_poisson or hurdle_negbinomial family in brm.

12.6.2 Zero-inflated models

As in hurdle models, a zero-inflated model also consists a logistic regression model that predicts whether the outcome is zero, followed by a regression model of our choice. The difference is that, in a hurdle model, outputs of the second model must be non-zero (e.g. log-linear models), while in a zero-inflated model, they can be zero (e.g. Poisson or negative binomial models).

In Section 12.2.5, we have fitted a negative binomial model to the roaches data. Let us try fitting a zero-inflated negative binomial model on this data using brm. Here, we use all predictors, including the exposure variable, to predict whether the number of post-treatment roaches is zero.

fit_3 <- brm(bf(y ~ roach100 + treatment + senior +
                    offset(log(exposure2)),
                zi ~ roach100 + treatment + senior +
                     offset(log(exposure2))),
                family=zero_inflated_negbinomial(),
             data=roaches, refresh=0)
Compiling Stan program...
Start sampling
print(fit_3)
 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: y ~ roach100 + treatment + senior + offset(log(exposure2)) 
         zi ~ roach100 + treatment + senior + offset(log(exposure2))
   Data: roaches (Number of observations: 262) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        3.17      0.20     2.79     3.58 1.00     3483     2875
zi_Intercept    -1.05      0.51    -2.13    -0.14 1.00     3024     2640
roach100         0.86      0.16     0.56     1.20 1.00     3095     2653
treatment       -0.55      0.22    -0.98    -0.12 1.00     3700     2969
senior          -0.09      0.26    -0.60     0.43 1.00     3446     2657
zi_roach100    -12.60      4.27   -22.89    -6.26 1.00     1725     1289
zi_treatment     1.21      0.50     0.30     2.26 1.00     3370     2576
zi_senior        1.02      0.48     0.09     1.97 1.00     3292     2829

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.49      0.06     0.38     0.61 1.00     2983     2784

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

We again check the model’s fit with fake data simulation.

yrep_3 <- posterior_predict(fit_3)

n_sims <- nrow(yrep_3)
subset <- sample(n_sims, 100)
ppc_dens_overlay(log10(roaches$y+1), log10(yrep_3[subset,]+1))

The fit of the model is slightly better than that of the negative binomial model. Specifically, there are fewer apartments whose expected post-treatment numbers of roaches are larger than 10,000.