# 12Generalized linear models

The linear regression and the logistic regression are examples of a more general class of models: the generalized linear models (GLM). As in the logistic regression, we can modify the nonlinear function and the model of the outcomes to handle various types of data, such as data with bounded outcomes, count data and data with multi-valued outcomes.

## 12.1 Definition of generalized linear models

Generalized linear models (GLMs) are a class of models that conform to the transformed linear predictor design. Specifically, a GLM consists of:

1. A vector of outcome data $$y=(y_1,\ldots,y_n)$$.
2. A matrix of predictor $$X=(\mathbf{1}_n,X_1,\ldots,X_p)$$ and a vector of coefficients $$\beta=(\beta_0,\ldots,\beta_p)^T$$. A linear predictor is given by $$X\beta$$.
3. A link function $$g$$ that transforms the linear predictor to the model’s prediction through its inverse: $$\hat{y} = g^{-1}(X\beta)$$.
4. A distribution of the outcome, given the prediction: $$p(y\vert\hat{y})$$.
5. Other parameters such as variances, overdispersions and the outcome’s upper and/or lower bounds.

Here are the link functions and the outcome distribution that we have used in the linear regression and logistic regression:

• In the linear regression, we used $$g(x)=x$$ (and so $$g^{-1}(x)=x$$) and $$p(y\vert \hat{y})=p(y\vert X\beta)=\mathcal{N}(y-X\beta,\sigma^2)$$.
• In the logistic regression, we used the logit function $$g(x)=\text{logit}(x)$$ with the logistic inverse: $$g^{-1}(x) = \text{logit}^{-1}(x)= e^x/(1+e^x)$$. With the prediction $$\hat{y}=\text{logit}^{-1}(X\beta)$$, the outcome distribution is the Bernoulli distribution: $$p(y\vert\hat{y}) = \hat{y}^y(1-\hat{y})^{1-y}$$.

## 12.2 Poisson and negative binomial regression

### 12.2.1 Poisson regression

$y \sim \operatorname{Poisson}(e^{X\beta}).$

This is a GLM in which:

• The link function is $$g(x)=\log x$$.

• The prediction is $$\hat{y} = g^{-1}(X\beta) = e^{X\beta}$$.

• The outcome distribution is $$p(y\vert\hat{y})\sim \operatorname{Poisson}(\hat{y})$$.

From the properties of the Poisson distribution, we have

• $$\mathbb{E}[y\vert X] = e^{X\beta}$$.

• $$\operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y]} = e^{X\beta/2}$$.

Thus, in a Poisson model, the standard deviation of the outcome is already specified by the model. If the expected outcome is $$\mathbb{E}[y] = 10$$, then the prediction errors are mostly in the range of $$\pm\sqrt{\mathbb{E}[y]} \approx\pm 3.33$$. But for many datasets, the prediction errors might fall out of these ranges.

### 12.2.2 Overdispersion and underdispersion

Overdispersion and underdispersion refer to data that show more or less variation than the Poisson model. In other words, when fitting a Poisson model, the residuals of overdispersed data are often greater, while those of underdispersed data are mostly smaller than the square root of the predicted value. For such data, using Poisson models would be inappropriate. In the next section, we introduce another model that allows more prediction errors.

### 12.2.3 Negative binomial regression

We introduce another model for the count data.

$y \sim \operatorname{negative binomial}(e^{X\beta},\phi),$

where $$\operatorname{negative binomial}(p,\phi)$$ models the number of failures in a sequence of iid $$\operatorname{Bernoulli}(p)$$ trials before observing $$\phi$$ successes (but the range of $$\phi$$ can be extended to positive real numbers).

The predictive standard deviation is

$\operatorname{sd}(y\vert X) = \sqrt{\mathbb{E}[y\vert X]+\frac{1}{\phi}\mathbb{E}[y\vert X]^2}.$

In the context of count data modeling, $$\phi$$ is the “reciprocal dispersion” parameter:

• Lower values of $$\phi$$ correspond to more overdispersion.
• Higher values of $$\phi$$ correspond to less overdispersion.
• The negative binomial distribution becomes the Poisson distribution in the limit $$\phi\to\infty$$ (that is, when there is no overdispersion).

### 12.2.4 Exposure and offset

In many cases, the outcomes depend on the amounts of exposure to the environment, and so two different outcomes may not be directly compared. For example, the number of daily deaths by country depends population size. The number of car accidents at an intersection depend on the number of cars running through that intersection.

If this is the case, we may instead let $$e^{X\beta}$$ be the expected rate of outcomes $$r$$, and the expected outcome is the product of exposure $$u$$ and rate $$r$$.

$\mathbb{E}[y] = ur = ue^{X\beta}, \tag{12.1}$

which leads to a new model with the new predictor $$u$$:

$y\sim \operatorname{negative binomial}(ue^{X\beta},\phi).$

We can separate the exposure and the exponential term by applying the logarithm on both sides of Equation 12.1.

$\log\mathbb{E}[y] = \log u +X\beta.$

With this, we call $$\log u$$ the offset.

### 12.2.5 Example: effect of pest management on reducing cockroach levels

We consider the Roaches dataset, which was used to study the effect of pest management on reducing cockroach levels in urban apartments. In the experiment, there were 158 apartments in the treatment group and 104 apartments in the control group. The data consists of the following variables:

Name Description
y post-treatment roach count
roach1 pre-treatment roach level
treatment 0 if control, 1 if treatment
senior 1 if the apartment is restricted to elderly
exposure2 number of days the traps had been laid

We create a new predictor named roach100, which isroach1 scaled down by a factor of 100.

roaches <- read.csv("data/roaches.csv")
roaches$roach100 <- roaches$roach1/100

head(roaches)
  X   y roach1 treatment senior exposure2 roach100
1 1 153 308.00         1      0  0.800000   3.0800
2 2 127 331.25         1      0  0.600000   3.3125
3 3   7   1.67         1      0  1.000000   0.0167
4 4   7   3.00         1      0  1.000000   0.0300
5 5   0   2.00         1      0  1.142857   0.0200
6 6   0   0.00         1      0  1.000000   0.0000

First, we fit Poisson regression by specifying family=poisson in stan_glm. The number of post-treatment roaches depend on the number of days the traps had been laid, so it makes sense to let exposure2 be the model’s exposure.

library(rstanarm)
fit_1 <- stan_glm(y ~ roach100 + treatment + senior,
family=poisson,
offset=log(exposure2),
data=roaches, refresh=0)

print(fit_1)
stan_glm
family:       poisson [log]
formula:      y ~ roach100 + treatment + senior
observations: 262
predictors:   4
------
(Intercept)  3.1    0.0
roach100     0.7    0.0
treatment   -0.5    0.0
senior      -0.4    0.0

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Interpreting the coefficients. The fitted Poisson model for the number of post-treatment roaches is

$y \sim \operatorname{Poisson}(e^{3.1+0.7*\texttt{roach100}-0.5*\texttt{treatment}-0.4*\texttt{senior}}).$

We can interpret the coefficients of the regression as follows:

• The intercept is the prediction for $$\texttt{roach100}=0$$, $$\texttt{treatment}=0$$ and $$\texttt{senior}=0$$. More precisely, for a non-senior apartment that was roach-free before and did not receive the pest management, the expected number of roaches is $$e^3.1\approx 22$$. Note that sometimes some of the predictors cannot be zero, and the intercept cannot be interpreted in that case.
• The coefficient $$0.7$$ of $$\texttt{roach100}$$ indicates that, for each additional $$100$$ roaches (while keeping treatment and senior at the same level), the expected number of post-treatment roaches increases by a factor of $$e^{0.7}\approx 1+0.7=1.7$$, or a $$70\%$$ increase.
• The coefficient $$-0.5$$ of $$\texttt{treatment}$$ indicates that the number of post-treatment roaches in an apartment with pest management is lower that that of an apartment without pest management (with the same level of roach100 and senior) by a factor of $$e^{-0.5}\approx 1-0.5=0.5$$, or a $$50\%$$ decrease.
• The coefficient $$-0.4$$ of $$\texttt{senior}$$ indicates that the number of post-treatment roaches in a senior apartment is lower that that of a non-senior apartment (with the same level of roach100 and treatment) by a factor of $$e^{-0.4}\approx 1-0.4=0.6$$, or a $$40\%$$ decrease.

Checking the fit via simulation. Now we check the model’s fit by looking at the posterior predictive distribution. As usual, we use the posterior_predictive function to generate 4000 numbers of post-treatment roaches, then we sample 400 of them.

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
subset <- sample(n_sims, 100)

After that, we use ppc_dens_overlay to compare between the distributions of original and simulated data. Here, we plot both data in the logarithmic scale to make the difference between the plots more pronounced.

library(bayesplot)
ppc_dens_overlay(log10(roaches$y+1), log10(yrep_1[subset,]+1)) The plots show that the original data is overdispersed and contains a lot of zeros; this indicates that the Poisson model might not be suitable as it only allows a relatively small number of zeros. We thus turn to the negative binomial regression, which can be done by specifying family=neg_binomial_2. As before, we let exposure2 be the exposure. There is no need to specify the reciprocal dispersion parameter $$\phi$$—it can be estimated from the data. fit_2 <- stan_glm(y ~ roach100 + treatment + senior, family=neg_binomial_2, offset=log(exposure2), data=roaches, refresh=0) print(fit_2) stan_glm family: neg_binomial_2 [log] formula: y ~ roach100 + treatment + senior observations: 262 predictors: 4 ------ Median MAD_SD (Intercept) 2.8 0.2 roach100 1.3 0.3 treatment -0.8 0.2 senior -0.3 0.3 Auxiliary parameter(s): Median MAD_SD reciprocal_dispersion 0.3 0.0 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg Let us see how the posterior predictive fits our data. yrep_2 <- posterior_predict(fit_2) n_sims <- nrow(yrep_2) subset <- sample(n_sims, 100) ppc_dens_overlay(log10(roaches$y+1), log10(yrep_2[subset,]+1))

The negative binomial looks like a better fit with a high probability of zero. However, the model allows the number of roaches to be as high as $$10^5$$, which is unrealistic. We will see how we can make adjustment to this model later in the chapter.

## 12.3 Logistic-binomial and beta-binomial models

### 12.3.1 Logistic-binomial model

The logistic regression can be used to model the number of successes from $$n$$ Bernoulli trials. In this setting, we can use the following GLM design:

• The link function is $$g(x)=\text{logit}(x)$$.

• The prediction is $$\hat{p} = g^{-1}(X\beta) = \text{logit}^{-1}(X\beta)$$.

• The outcome distribution is $$p(y\vert\hat{p})\sim \operatorname{Binomial}(n,\hat{p})$$, where $$n$$ is the number of trials.

Let us try this model on the simulated data of basketball shooting. The following code produces $$N =100$$ players, each shooting $$n=20$$ shots. We also encode our assumption that the field goal percentage is inversely correlated to the weight.

N <- 100
weight <- rnorm(N, 216, 31)
p <- 0.6 - 0.1*(weight - 216)/31
n <- rep(20, N)
y <- rbinom(N, n, p)
data <- data.frame(n=n, y=y, weight=weight)

head(data)
   n  y   weight
1 20 12 222.2577
2 20 14 218.8143
3 20 10 211.4947
4 20 10 239.5631
5 20 11 173.2195
6 20 12 190.9105

To model the count data with the logistic regression, the outcome is a pair of number of successes and number of failures.

fit_1a <- stan_glm(cbind(y, 20-y) ~ weight,
data=data, refresh=0)

print(fit_1a, digits=3)
stan_glm
family:       binomial [logit]
formula:      cbind(y, 20 - y) ~ weight
observations: 100
predictors:   2
------
* For info on the priors used see ?prior_summary.stanreg