# 12Generalized linear models

The linear regression and the logistic regression are examples of a more general class of models: the generalized linear models (GLM). As in the logistic regression, we can modify the nonlinear function and the model of the outcomes to handle various types of data, such as data with bounded outcomes, count data and data with multi-valued outcomes.

## 12.1 Definition of generalized linear models

Generalized linear models (GLMs) are a class of models that conform to the transformed linear predictor design. Specifically, a GLM consists of:

1. A vector of outcome data $$y=(y_1,\ldots,y_n)$$.
2. A matrix of predictor $$X=(\mathbf{1}_n,X_1,\ldots,X_p)$$ and a vector of coefficients $$\beta=(\beta_0,\ldots,\beta_p)^T$$. A linear predictor is given by $$X\beta$$.
3. A link function $$g$$ that transforms the linear predictor to the model’s prediction through its inverse: $$\hat{y} = g^{-1}(X\beta)$$.
4. A distribution of the outcome, given the prediction: $$p(y\vert\hat{y})$$.
5. Other parameters such as variances, overdispersions and the outcome’s upper and/or lower bounds.

Here are the link functions and the outcome distribution that we have used in the linear regression and logistic regression:

• In the linear regression, we used $$g(x)=x$$ (and so $$g^{-1}(x)=x$$) and $$p(y\vert \hat{y})=p(y\vert X\beta)=\mathcal{N}(y-X\beta,\sigma^2)$$.
• In the logistic regression, we used the logit function $$g(x)=\text{logit}(x)$$ with the logistic inverse: $$g^{-1}(x) = \text{logit}^{-1}(x)= e^x/(1+e^x)$$. With the prediction $$\hat{y}=\text{logit}^{-1}(X\beta)$$, the outcome distribution is the Bernoulli distribution: $$p(y\vert\hat{y}) = \hat{y}^y(1-\hat{y})^{1-y}$$.

## 12.2 Poisson and negative binomial regression

### 12.2.1 Poisson regression

We start with the simplest regression model for count data.

$y \sim \operatorname{Poisson}(e^{X\beta}).$

This is a GLM in which:

• The link function is $$g(x)=\log x$$.

• The prediction is $$\hat{y} = g^{-1}(X\beta) = e^{X\beta}$$.

• The outcome distribution is $$p(y\vert\hat{y})\sim \operatorname{Poisson}(\hat{y})$$.

From the properties of the Poisson distribution, we have

• $$\mathbb{E}[y\vert X] = e^{X\beta}$$.

• $$\operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y]} = e^{X\beta/2}$$.

Thus, in a Poisson model, the standard deviation of the outcome is already specified by the model. If the expected outcome is $$\mathbb{E}[y] = 10$$, then the prediction errors are mostly in the range of $$\pm\sqrt{\mathbb{E}[y]} \approx\pm 3.33$$. But for many datasets, the prediction errors might fall out of these ranges.

### 12.2.2 Overdispersion and underdispersion

Overdispersion and underdispersion refer to data that show more or less variation than the Poisson model. In other words, when fitting a Poisson model, the residuals of overdispersed data are often greater, while those of underdispersed data are mostly smaller than the square root of the predicted value. For such data, using Poisson models would be inappropriate. In the next section, we introduce another model that allows more prediction errors.

### 12.2.3 Negative binomial regression

We introduce another model for the count data.

$y \sim \operatorname{negative binomial}(e^{X\beta},\phi),$

where $$\operatorname{negative binomial}(p,\phi)$$ models the number of failures in a sequence of iid $$\operatorname{Bernoulli}(p)$$ trials before observing $$\phi$$ successes (but the range of $$\phi$$ can be extended to positive real numbers).

The predictive standard deviation is

$\operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y\vert X]+\frac{1}{\phi}\mathbb{E}[y\vert X]^2}.$

In the context of count data modeling, $$\phi$$ is the “reciprocal dispersion” parameter:

• Lower values of $$\phi$$ correspond to more overdispersion.
• Higher values of $$\phi$$ correspond to less overdispersion.
• The negative binomial distribution becomes the Poisson distribution in the limit $$\phi\to\infty$$ (that is, when there is no overdispersion).

### 12.2.4 Exposure and offset

In many cases, the outcomes depend on the amounts of exposure to the environment, and so two different outcomes may not be directly compared. For example, the number of daily deaths by country depends population size. The number of car accidents at an intersection depend on the number of cars running through that intersection.

If this is the case, we may instead let $$e^{X\beta}$$ be the expected rate of outcomes $$r$$, and the expected outcome is the product of exposure $$u$$ and rate $$r$$.

$\mathbb{E}[y] = ur = ue^{X\beta}, \tag{12.1}$

which leads to a new model with the new predictor $$u$$:

$y\sim \operatorname{negative binomial}(ue^{X\beta},\phi).$

We can separate the exposure and the exponential term by applying the logarithm on both sides of Equation 12.1.

$\log\mathbb{E}[y] = \log u +X\beta.$

With this, we call $$\log u$$ the offset.

### 12.2.5 Example: effect of pest management on reducing cockroach levels

We consider the Roaches dataset, which was used to study the effect of pest management on reducing cockroach levels in urban apartments. In the experiment, there were 158 apartments in the treatment group and 104 apartments in the control group. The data consists of the following variables:

Name Description
y post-treatment roach count
roach1 pre-treatment roach level
treatment 0 if control, 1 if treatment
senior 1 if the apartment is restricted to elderly
exposure2 number of days the traps had been laid

We create a new predictor named roach100, which isroach1 scaled down by a factor of 100.

roaches <- read.csv("data/roaches.csv")
roaches$roach100 <- roaches$roach1/100

head(roaches)
  X   y roach1 treatment senior exposure2 roach100
1 1 153 308.00         1      0  0.800000   3.0800
2 2 127 331.25         1      0  0.600000   3.3125
3 3   7   1.67         1      0  1.000000   0.0167
4 4   7   3.00         1      0  1.000000   0.0300
5 5   0   2.00         1      0  1.142857   0.0200
6 6   0   0.00         1      0  1.000000   0.0000

First, we fit Poisson regression by specifying family=poisson in stan_glm. The number of post-treatment roaches depend on the number of days the traps had been laid, so it makes sense to let exposure2 be the model’s exposure.

library(rstanarm)
fit_1 <- stan_glm(y ~ roach100 + treatment + senior,
family=poisson,
offset=log(exposure2),
data=roaches, refresh=0)

print(fit_1)
stan_glm
family:       poisson [log]
formula:      y ~ roach100 + treatment + senior
observations: 262
predictors:   4
------
Median MAD_SD
(Intercept)  3.1    0.0
roach100     0.7    0.0
treatment   -0.5    0.0
senior      -0.4    0.0

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Interpreting the coefficients. The fitted Poisson model for the number of post-treatment roaches is

$y \sim \operatorname{Poisson}(e^{3.1+0.7*\texttt{roach100}-0.5*\texttt{treatment}-0.4*\texttt{senior}}).$

We can interpret the coefficients of the regression as follows:

• The intercept is the prediction for $$\texttt{roach100}=0$$, $$\texttt{treatment}=0$$ and $$\texttt{senior}=0$$. More precisely, for a non-senior apartment that was roach-free before and did not receive the pest management, the expected number of roaches is $$e^3.1\approx 22$$. Note that sometimes some of the predictors cannot be zero, and the intercept cannot be interpreted in that case.
• The coefficient $$0.7$$ of $$\texttt{roach100}$$ indicates that, for each additional $$100$$ roaches (while keeping treatment and senior at the same level), the expected number of post-treatment roaches increases by a factor of $$e^{0.7}\approx 1+0.7=1.7$$, or a $$70\%$$ increase.
• The coefficient $$-0.5$$ of $$\texttt{treatment}$$ indicates that the number of post-treatment roaches in an apartment with pest management is lower that that of an apartment without pest management (with the same level of roach100 and senior) by a factor of $$e^{-0.5}\approx 1-0.5=0.5$$, or a $$50\%$$ decrease.
• The coefficient $$-0.4$$ of $$\texttt{senior}$$ indicates that the number of post-treatment roaches in a senior apartment is lower that that of a non-senior apartment (with the same level of roach100 and treatment) by a factor of $$e^{-0.4}\approx 1-0.4=0.6$$, or a $$40\%$$ decrease.

Checking the fit via simulation. Now we check the model’s fit by looking at the posterior predictive distribution. As usual, we use the posterior_predictive function to generate 4000 numbers of post-treatment roaches, then we sample 400 of them.

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
subset <- sample(n_sims, 100)

After that, we use ppc_dens_overlay to compare between the distributions of original and simulated data. Here, we plot both data in the logarithmic scale to make the difference between the plots more pronounced.

library(bayesplot)
ppc_dens_overlay(log10(roaches$y+1), log10(yrep_1[subset,]+1)) The plots show that the original data is overdispersed and contains a lot of zeros; this indicates that the Poisson model might not be suitable as it only allows a relatively small number of zeros. We thus turn to the negative binomial regression, which can be done by specifying family=neg_binomial_2. As before, we let exposure2 be the exposure. There is no need to specify the reciprocal dispersion parameter $$\phi$$—it can be estimated from the data. fit_2 <- stan_glm(y ~ roach100 + treatment + senior, family=neg_binomial_2, offset=log(exposure2), data=roaches, refresh=0) print(fit_2) stan_glm family: neg_binomial_2 [log] formula: y ~ roach100 + treatment + senior observations: 262 predictors: 4 ------ Median MAD_SD (Intercept) 2.8 0.2 roach100 1.3 0.3 treatment -0.8 0.2 senior -0.3 0.3 Auxiliary parameter(s): Median MAD_SD reciprocal_dispersion 0.3 0.0 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg Let us see how the posterior predictive fits our data. yrep_2 <- posterior_predict(fit_2) n_sims <- nrow(yrep_2) subset <- sample(n_sims, 100) ppc_dens_overlay(log10(roaches$y+1), log10(yrep_2[subset,]+1))

The negative binomial looks like a better fit with a high probability of zero. However, the model allows the number of roaches to be as high as $$10^5$$, which is unrealistic. We will see how we can make adjustment to this model later in the chapter.

## 12.3 Logistic-binomial and beta-binomial models

### 12.3.1 Logistic-binomial model

The logistic regression can be used to model the number of successes from $$n$$ Bernoulli trials. In this setting, we can use the following GLM design:

• The link function is $$g(x)=\text{logit}(x)$$.

• The prediction is $$\hat{p} = g^{-1}(X\beta) = \text{logit}^{-1}(X\beta)$$.

• The outcome distribution is $$p(y\vert\hat{p})\sim \operatorname{Binomial}(n,\hat{p})$$, where $$n$$ is the number of trials.

Let us try this model on the simulated data of basketball shooting. The following code produces $$N =100$$ players, each shooting $$n=20$$ shots. We also encode our assumption that the field goal percentage is inversely correlated to the weight.

N <- 100
weight <- rnorm(N, 216, 31)
p <- 0.6 - 0.1*(weight - 216)/31
n <- rep(20, N)
y <- rbinom(N, n, p)
data <- data.frame(n=n, y=y, weight=weight)

head(data)
   n  y   weight
1 20 12 222.2577
2 20 14 218.8143
3 20 10 211.4947
4 20 10 239.5631
5 20 11 173.2195
6 20 12 190.9105

To model the count data with the logistic regression, the outcome is a pair of number of successes and number of failures.

fit_1a <- stan_glm(cbind(y, 20-y) ~ weight,
family=binomial(link="logit"),
data=data, refresh=0)

print(fit_1a, digits=3)
stan_glm
family:       binomial [logit]
formula:      cbind(y, 20 - y) ~ weight
observations: 100
predictors:   2
------
Median MAD_SD
(Intercept)  3.735  0.383
weight      -0.016  0.002

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Since the data is generated from the binomial model, so we expect the residuals to be very small. Let us see if that is the case.

y <- fit_1ay[,1] p_hat <- fitted(fit_1a) y_hat <- 20 * p_hat plot(y_hat, y - y_hat, xlab="Predicted number of field goals", ylab="Residual", pch=20) abline(0, 0) ### 12.3.2 Overdispersion Logistic models usually have overdispersion problem, that is, the variation in the data is more than indicated by the model. To detect overdispersion, we first recall the standard deviation of $$y\sim \operatorname{Binomial}(n,\hat{p})$$, which is $$\sqrt{n\hat{p}(1-\hat{p})}$$, where $$\hat{p}=\text{logit}^{-1}(X\beta)$$. Then, we consider the standardized residual: $z_i = \frac{y_i-\hat{y}_i}{\operatorname{sd}(\hat{y}_i)} = \frac{y_i-n_i\hat{p}_i}{\sqrt{n_i\hat{p}_i(1-\hat{p}_i)}},$ which has mean $$0$$ and standard deviation $$1$$. We then formally test for overdispersion by comparing $$\sum_{i=1}^N z^2_i$$ to the $$\chi^2_{N-p}$$ distribution, where $$p$$ is the number of predictors. ### 12.3.3 Beta-binomial model To handle the overdispersion, we modify the outcome distribution of the logistic-binomial model to obtain the beta-binomial model, which has the following GLM design: • The link function is $$g(x)=\text{logit}(x)$$. • The prediction is $$\hat{p} = g^{-1}(X\beta) = \text{logit}^{-1}(X\beta)$$. • The outcome distribution is $p(y\vert\hat{p})\sim \operatorname{Beta-Binomial}\left(n,\hat{p}\phi,\left(1-\hat{p}\right)\phi\right),$ where $$n$$ is the number of trials and $$\phi\in(0,1)$$ is an overdispersion parameter. The parameters of the beta-binomial distribution are chosen so that the mean of $$y$$ is $$n\hat{p}$$ and the standard deviation of $$y$$ is controlled by $$\phi$$. \begin{align*} \mathbb{E}[y\vert\hat{p}] &= n\hat{p} \\ \operatorname{sd}(y\vert\hat{p}) &= \sqrt{n\hat{p}(1-\hat{p})\left(\frac{n+\phi}{1+\phi}\right)}. \end{align*} Therefore, lowerr $$\phi$$ allows for more overdispersion, higher $$\phi$$ for less overdispersion. In the limit $$\phi\to \infty$$ (no overdispersion), we go back to the logistic-binomial model. To fit the beta-binomial model, we use the brms library, which has the brm function that allows us to select beta_binomial family. We specify the number of trials via the trials() function. In this example, the number of trials (i.e. number of shots) for each basketball player is 20. The input of trials can be a vector in the case that the number of trials varies by the data point. The dispersion parameter $$\phi$$ is estimated directly from the data, so there is no need to specify $$\phi$$ in brm. # install.packages("brms") library(brms) fit_1b <- brm(y|trials(20) ~ weight, family=beta_binomial, data=data, seed=0, refresh=0) Compiling Stan program... Start sampling print(fit_1b, digits=3)  Family: beta_binomial Links: mu = logit; phi = identity Formula: y | trials(20) ~ weight Data: data (Number of observations: 100) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 3.723 0.403 2.937 4.506 1.000 4797 3090 weight -0.016 0.002 -0.019 -0.012 0.999 5058 3181 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS phi 129.642 77.237 43.188 339.890 1.003 2204 2323 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). The dispersion parameter $$\phi=127.2$$ tells us that the predictive standard deviation of the beta-binomial model is $$\sqrt{\frac{100+127.2}{1+127.2}}\approx 1.33$$ times that of the logistic-binomial model. ## 12.4 Ordered and unordered categorical regression Sometimes the outcomes can have more than two categories, which can be ordered or unordered. Examples of ordered categories are: • Easy, Medium, Hard • Slow, Normal, Fast • Urban, Sub-urban, Rural Examples of unordered categories are: • Football, Basketball, Baseball • Car, Bus, Train • A, B, O, AB Here, we introduce two models: one to handle ordered categories, and another to handle unordered categories. ### 12.4.1 Ordered logistic regression For the data with ordered categories $$1,\ldots,K$$, we may use the following logistic model: • The link function is $$g(x)=\text{logit}(x)$$, where $$g$$ can be any strictly increasing function that maps $$(0,1)$$ to $$(-\infty, \infty)$$. • If $$g(x)=\text{logit}(x)$$, then the resulting model is called the ordered logit model or proportional odds model. • If $$g(x)=\Phi^{-1}(x)$$, where $$\Phi$$ is the cdf of the standard normal distribution, then the resulting model is called the ordered probit model. • The prediction for the $$k$$-th category is $$\hat{p}_{\leq k} = g^{-1}(c_{k\vert k+1}+X\beta)$$ for $$1\leq k \leq K-1$$. Here, we have additional parameters $$0<c_{1\vert 2}<c_{2\vert 3}<\ldots<c_{K-1\vert K}$$, called the cutpoints. • The outcome cumulative distribution is $\Pr(y\leq k)= \hat{p}_{\leq k}, \qquad k=1,\ldots,K-1$ and $$\Pr(y \leq K)=1$$. We can calculate the probabilities of individual categories as follows: $\Pr(y= k)= \Pr(y\leq k) - \Pr(y\leq k-1), \qquad k=1,\ldots,K.$ This formulation is well-defined, as $$\{c_{k\vert k+1}\}_{k=1}^K$$ is increasing and $$g$$ (and hence $$g^{-1}$$) is increasing imply that $$\hat{p}_{\leq k}$$ is increasing as well. The two variations of the model are used in different situations: the ordered logit model is used if the probabilities of $$y$$ are expected to be evenly distributed across all categories, while the ordered probit model is used if the probabilities are expected to be concentrated at the middle values. Let us try the ordered logit model on the Storable data. The data contains records of voting games played between 2-6 college students. A summary of the game is as follows: • Each student was given a total of 4 votes to play in two rounds. • The students had the choice to cast 1, 2 or 3 votes on the first round, and the remaining votes on the second round. • Before casting the first votes, the students were told the payoffs for the winner, which were drawn from the uniform distribution on the interval $$[1,100]$$. • In addition, the students were told the distribution of the payoffs. Here, we only take the results of the game played between two students. The vote column contains the number of the first votes and the value column contains the payoffs. data_2player <- read.csv("data/2playergames.csv") data_401 <- subset(data_2player, person == 401, select = c("vote", "value")) data_401factor_vote <- factor(data_401$vote, levels = c(1, 2, 3), labels = c("1", "2", "3"), ordered=TRUE) plot(data_401$value, data_401$factor_vote, xlab="Payoffs", ylab="Votes") As expected, with more payoffs in the first round, the students were willing to cast more votes. To fit the ordered logit model, we can use the stan_polr function (proportional odds logistic regression) provided in rstanarm. Here. we specify the prior mean of the $$R^2$$ of the prediction to be $$0.3$$. fit_1 <- stan_polr(factor(vote) ~ value, method="logistic", prior=R2(0.3, "mean"), data=data_401, refresh=0) print(fit_1) stan_polr family: ordered [logistic] formula: factor(vote) ~ value observations: 20 ------ Median MAD_SD value 0.1 0.0 Cutpoints: Median MAD_SD 1|2 2.7 1.4 2|3 5.8 2.1 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg The results show two cutpoints: $$c_{1\vert 2} = 2.7$$ and $$c_{2\vert 3}=5.9$$. Below is the plot of the expected vote given the payoffs from the model. ### 12.4.2 Unordered logistic regression For the data with unordered categories $$1,\ldots,K$$, we can use the categorical logit model, which has the following GLM specification: • The link function is $g(x_1,\ldots,x_K) = \left(\log\frac{x_1}{x_K},\ldots,\log\frac{x_{K-1}}{x_K}\right).$ The inverse of the link function is called the multinomial logit, also known as the softmax function. $g^{-1}(x_1,\ldots,x_{K-1}) = \left(\frac{e^{x_1}}{1+\sum_k e^{x_k}},\ldots,\frac{e^{x_{K-1}}}{1+\sum_k e^{x_k}},\frac{1}{1+\sum_k e^{x_k}}\right).$ • The prediction is $$(\hat{p}_{1},\ldots,\hat{p}_K)= g^{-1}(X\beta_1,\ldots,X\beta_{K-1})$$ for $$1\leq k \leq K-1$$. Notice that we now have $$K-1$$ sets of parameters: $$\beta_1,\ldots,\beta_{K-1}$$. • The outcome distribution is $\Pr(y = k)= \hat{p}_{k}, \qquad k=1,\ldots,K.$ Let us fit this model on the iris dataset, which contains data of pedal height, pedal width, sepal height and sepal width of three different species of iris. data(iris) head(iris)  Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa To fit the unordered logistic regression model of the species on the other predictors, we again use the brm function. fit_2 <- brm(Species ~ ., family = categorical(link = "logit"), data=iris, prior=set_prior("normal (0, 8)"), refresh=0) print(fit_2)  Family: categorical Links: muversicolor = logit; muvirginica = logit Formula: Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width Data: iris (Number of observations: 150) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS muversicolor_Intercept -14.25 20.10 -55.11 22.78 1.00 2062 muvirginica_Intercept -40.24 23.24 -88.70 2.94 1.00 2270 muversicolor_Sepal.Length 1.38 4.25 -6.94 9.90 1.00 1642 muversicolor_Sepal.Width -4.02 4.05 -12.36 3.49 1.00 1731 muversicolor_Petal.Length 6.90 3.25 0.99 13.70 1.00 1338 muversicolor_Petal.Width 0.16 5.18 -10.23 9.84 1.00 1723 muvirginica_Sepal.Length -0.91 4.31 -9.14 7.49 1.00 1715 muvirginica_Sepal.Width -6.82 4.24 -15.56 1.13 1.00 1763 muvirginica_Petal.Length 13.44 3.90 6.34 21.65 1.00 1327 muvirginica_Petal.Width 10.00 5.41 -0.73 20.90 1.00 1962 Tail_ESS muversicolor_Intercept 2212 muvirginica_Intercept 2368 muversicolor_Sepal.Length 2441 muversicolor_Sepal.Width 2005 muversicolor_Petal.Length 1965 muversicolor_Petal.Width 2154 muvirginica_Sepal.Length 2283 muvirginica_Sepal.Width 1692 muvirginica_Petal.Length 1635 muvirginica_Petal.Width 2483 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Since the data has three categories, the fitted model has two sets of coefficients associated with two categories: versicolor and virginica. ## 12.5 Models with unequal error standard deviations The usual linear regression model $$y\sim\mathcal{N}(X\beta,\sigma^2)$$ assumes that the error standard deviation $$\sigma$$ is fixed. That said, we can allow the standard deviation to vary by the values of the predictors—such condition is called heteroscedasticity. For example, it is possible to fit the model $$y\sim\mathcal{N}(X\beta_1,e^{X\beta_2})$$, where $$\beta_1$$ and $$\beta_2$$ are the model’s parameters. In the example of earnings data, we could fit the linear regression of log-earnings with the error standard deviation in the form of $$e^{c+d*\texttt{male}}$$, allowing different error standard deviations for women and men. This can be done using brm, with bf as a formula helper. earnings <- read.csv("data/earnings.csv") fit_1 <- brm(bf(log(earn)|subset(earn>0) ~ height + male, sigma ~ male), data=earnings, refresh=0) Compiling Stan program... Start sampling print(fit_1)  Family: gaussian Links: mu = identity; sigma = log Formula: log(earn) | subset(earn > 0) ~ height + male sigma ~ male Data: earnings (Number of observations: 1816) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 7.93 0.51 6.97 8.95 1.00 2666 2943 sigma_Intercept -0.12 0.02 -0.16 -0.07 1.00 4280 2922 height 0.02 0.01 0.01 0.04 1.00 2592 2944 male 0.37 0.06 0.25 0.49 1.00 2665 2945 sigma_male -0.06 0.04 -0.13 0.01 1.00 4101 2939 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). We can allow heteroskedasticity in other models as well. For example, the negative binomial model can be extended to $$y\sim\operatorname{negative binomial}(e^{X\beta_1},e^{X\beta_2})$$, in which $$\phi$$ depends on the predictors. Let us try this on the roaches data, with $$\phi$$ (the shape parameter) depending on the treatment and the seniority. fit_2 <- brm(bf(y ~ treatment + senior + offset(log(exposure2)), shape ~ treatment + senior), family=negbinomial, data=roaches, refresh=0) Compiling Stan program... Start sampling print(fit_2)  Family: negbinomial Links: mu = log; shape = log Formula: y ~ treatment + senior + offset(log(exposure2)) shape ~ treatment + senior Data: roaches (Number of observations: 262) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 3.75 0.21 3.36 4.18 1.00 4948 2693 shape_Intercept -1.27 0.15 -1.56 -0.99 1.00 4524 2910 treatment -0.54 0.27 -1.06 -0.01 1.00 4729 2661 senior -0.74 0.34 -1.37 -0.03 1.00 5038 2875 shape_treatment -0.17 0.19 -0.54 0.20 1.00 4903 3170 shape_senior -0.55 0.21 -0.97 -0.16 1.00 4836 3303 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). ## 12.6 Mixture models for data with many zeros ### 12.6.1 Hurdle models We have briefly mentioned back in Chapter 7.1.1 how to handle data with a lot of zero-valued outcomes, specifically for log-linear models. In summary, we can use the logistic regression to classify whether the outcome is zero, and then we use the linear regression to model non-zero outcomes. We demonstrate here how to do this on the earnings data with stan_glm. First, we fit a logistic regression model on whether the earning is zero. # (earn > 0) is an indicator of whether the earning is zero fit_1a <- stan_glm((earn == 0) ~ height + male, family=binomial(link="logit"), data=earnings, refresh=0) print(fit_1a) stan_glm family: binomial [logit] formula: (earn == 0) ~ height + male observations: 1816 predictors: 3 ------ Median MAD_SD (Intercept) 2.9 1.9 height -0.1 0.0 male -1.7 0.3 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg Then, we fit a linear regression model on the data with non-zero earnings. fit_1b <- stan_glm(log(earn) ~ height + male, data=earnings, subset=earn>0, refresh=0) print(fit_1b) stan_glm family: gaussian [identity] formula: log(earn) ~ height + male observations: 1629 predictors: 3 subset: earn > 0 ------ Median MAD_SD (Intercept) 8.0 0.5 height 0.0 0.0 male 0.4 0.1 Auxiliary parameter(s): Median MAD_SD sigma 0.9 0.0 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg The resulting model is a mixture of the linear and logistic models. Suppose that we want to simulate the earnings of a randomly chosen 68-inch tall woman from this model; this can be done in two steps: first, we use the logistic model to predict whether her earning is zero, and if it is not zero, we use the linear model to predict her earning. Both predictions can be obtained using posterior_predict. new <- data.frame(height=68, male=0) pred_1a <- posterior_predict(fit_1a, newdata=new) pred_1b <- posterior_predict(fit_1b, newdata=new) pred <- ifelse(pred_1a==1, 0, exp(pred_1b)) print(pred[1:10])  [1] 9395.789 4514.346 11952.875 21699.246 28884.118 25188.422 6430.726 [8] 69471.164 19237.140 0.000 To fit the mixture model in a single step, we may use brm with the hurdle_lognormal family. fit_2 <- brm(bf(earn ~ height + male, hu ~ height + male), family=hurdle_lognormal, data=earnings, refresh=0) Compiling Stan program... Start sampling print(fit_2)  Family: hurdle_lognormal Links: mu = identity; sigma = identity; hu = logit Formula: earn ~ height + male hu ~ height + male Data: earnings (Number of observations: 1816) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 7.99 0.50 7.01 8.99 1.00 3170 2735 hu_Intercept 2.91 2.01 -0.98 6.88 1.00 3628 3229 height 0.02 0.01 0.01 0.04 1.00 3106 2731 male 0.37 0.06 0.25 0.49 1.00 2992 2943 hu_height -0.07 0.03 -0.13 -0.01 1.00 3660 3077 hu_male -1.67 0.32 -2.33 -1.06 1.00 2728 2445 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS sigma 0.87 0.02 0.84 0.90 1.00 4786 2594 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). Notice that the coefficients of the linear regression (Intercept, height and male) and the logistic regression (hu_Intercept, hu_height and hu_male) are almost identical to the ones from the direct fits above. For count data with many zeros, one can use the hurdle_poisson or hurdle_negbinomial family in brm. ### 12.6.2 Zero-inflated models As in hurdle models, a zero-inflated model also consists a logistic regression model that predicts whether the outcome is zero, followed by a regression model of our choice. The difference is that, in a hurdle model, outputs of the second model must be non-zero (e.g. log-linear models), while in a zero-inflated model, they can be zero (e.g. Poisson or negative binomial models). In Section 12.2.5, we have fitted a negative binomial model to the roaches data. Let us try fitting a zero-inflated negative binomial model on this data using brm. Here, we use all predictors, including the exposure variable, to predict whether the number of post-treatment roaches is zero. fit_3 <- brm(bf(y ~ roach100 + treatment + senior + offset(log(exposure2)), zi ~ roach100 + treatment + senior + offset(log(exposure2))), family=zero_inflated_negbinomial(), data=roaches, refresh=0) Compiling Stan program... Start sampling print(fit_3)  Family: zero_inflated_negbinomial Links: mu = log; shape = identity; zi = logit Formula: y ~ roach100 + treatment + senior + offset(log(exposure2)) zi ~ roach100 + treatment + senior + offset(log(exposure2)) Data: roaches (Number of observations: 262) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 3.17 0.20 2.79 3.58 1.00 3483 2875 zi_Intercept -1.05 0.51 -2.13 -0.14 1.00 3024 2640 roach100 0.86 0.16 0.56 1.20 1.00 3095 2653 treatment -0.55 0.22 -0.98 -0.12 1.00 3700 2969 senior -0.09 0.26 -0.60 0.43 1.00 3446 2657 zi_roach100 -12.60 4.27 -22.89 -6.26 1.00 1725 1289 zi_treatment 1.21 0.50 0.30 2.26 1.00 3370 2576 zi_senior 1.02 0.48 0.09 1.97 1.00 3292 2829 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS shape 0.49 0.06 0.38 0.61 1.00 2983 2784 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). We again check the model’s fit with fake data simulation. yrep_3 <- posterior_predict(fit_3) n_sims <- nrow(yrep_3) subset <- sample(n_sims, 100) ppc_dens_overlay(log10(roaches$y+1), log10(yrep_3[subset,]+1))

The fit of the model is slightly better than that of the negative binomial model. Specifically, there are fewer apartments whose expected post-treatment numbers of roaches are larger than 10,000.