```
library(rstanarm)
<- plogis invlogit
```

# 11 Diagnostics of logistic regression models

After fitting a logistic regression model, it is a good idea to inspect the model in more details. We discuss techniques of plotting the fitted model and checking the residuals. We also look into adding interaction terms to the model, as well as identifying problems that may arise when fitting the model.

## 11.1 Plotting logistic regression and binary data

### 11.1.1 Plotting binary data using binned averages

The usual scatterplot of data with binary outcomes might not be insightful as there is a lot of overlaps. A better way of plotting the data for is by binning a predictor, with other predictors held at constants, and plot the *binned averages*, the averages of the predictor and the outcome in each bin. We demonstrate this with the `wells`

data from the previous chapter.

```
<- read.csv("data/wells.csv")
wells $dist100 <- wells$dist/100 wells
```

We fit a logistic regression model of well-switching on the distance.

```
<- stan_glm(switch ~ dist100,
fit_1 family = binomial(link="logit"),
data = wells, refresh=0)
```

Let us divide the distance into 8 bins.

```
<- 8
K # assign bin for each data point
<- as.numeric(cut(wells$dist100, K))
bins
print(data.frame(wells$dist100, bins)[1:6,])
```

```
wells.dist100 bins
1 0.16826 1
2 0.47322 2
3 0.20967 1
4 0.21486 1
5 0.40874 1
6 0.69518 2
```

Then we compute the average of the distance and the outcome for each bin.

```
<- rep(NA, K) # initial vector of NAs
x_bar <- rep(NA, K) # initial vector of NAs
y_bar for (k in 1:K){
<- mean(wells$dist100[bins==k]) # average of k-th bin
x_bar[k] <- mean(wells$switch[bins==k]) # average of k-th bin
y_bar[k] }
```

And now we plot the binned averages (the plot of white circles below). Since the averages of outcomes are no longer repeated values of 0 and 1, we can see some vertical variation in the plot. We also see that the binned averages are close to the fitted logistic regression line.

```
plot(wells$dist100, wells$switch, ylim=c(-0.05, 1.05),
xlab="Distance (100m)", ylab="Pr(Switching)",
pch=20, cex=1.0, main="Data and binned averages")
points(x_bar, y_bar, pch=21, cex=1.5)
# fitted logistic regression
curve(invlogit(0.6 - 0.6*x), add=TRUE)
```

### 11.1.2 Plotting decision boundaries when there are two predictors

Suppose that we have a logistic model with two predictors. Even though the model is nonlinear, our decision on the outcome of a new data point can be linear. A typical decision \(\hat{y}\) of a new data point \((x_1,x_2)\) is

\[ \hat{y} = \begin{cases} 1 &\text{if } \Pr(y=1\vert x_1,x_2) > 0.5 \\ 0 &\text{if } \Pr(y=1\vert x_1,x_2) \leq 0.5. \end{cases} \]

Since \(\Pr(y=1\vert x_1,x_2)=\operatorname{logit}^{-1}(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2)\) and \(\operatorname{logit}^{-1}(x)=0.5\) when \(x=0\), so we can rewrite the decision as follows:

\[ \hat{y} = \begin{cases} 1 &\text{if } \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 > 0 \\ 0 &\text{if } \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 \leq 0. \end{cases} \]

In this case, the linear function \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 = 0\) splits our decision, and so it is the decision boundary.

Let us fit a logistic regression of the well-switching on the distance and the arsenic level.

```
<- stan_glm(switch ~ dist100 + arsenic,
fit_2 family = binomial(link="logit"),
data = wells, refresh=0)
= coef(fit_2) b
```

Then, we plot the decision boundary of this model. In addition, we plot the lines of two extremes: \(\Pr(y\vert x_1,x_2)=0.1\) and \(\Pr(y\vert x_1,x_2)=0.9\), which is equivalent to \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 =\operatorname{logit}(0.1)\) and \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 =\operatorname{logit}(0.9)\), respectively. Note that to plot these equations with `abline`

, we need to write \(x_2\) as a function of \(x_1\); the corresponding equations are:

```
plot(wells$dist100[wells$switch==1],
$arsenic[wells$switch==1],
wellsmain="Data and 10%, 50%, 90% discrimination lines
from fitted logistic regression",
xlab="Distance (100m)",
ylab="Arsenic",
col = rgb(red=0, green=0, blue=1, alpha=0.5),
pch=20)
points(wells$dist100[wells$switch==0],
$arsenic[wells$switch==0],
wellscol = rgb(red=1, green=0, blue=0, alpha=0.5),
pch=20)
abline(-b[1] / b[3],
-b[2] / b[3])
abline((logit(0.9) - b[1]) / b[3],
-b[2] / b[3], lty=2)
abline((logit(0.1) - b[1]) / b[3],
-b[2] / b[3], lty=2)
```

We notice from the plot that the model does not suit the data well, as many points in each class leak to the other side of the decision boundary.

## 11.2 Predictive simulation

We can plot the uncertainty of the coefficients using `stan_glm`

’s simulations. First, let us take the model with a single predictor `dist100`

and plot the first 500 simulations of the intercept and the slope.

```
<- as.matrix(fit_1)
sims plot(sims[1:500,1], sims[1:500,2], xlab=expression(beta[0]),
ylab=expression(beta[1]), pch=20, cex=.5)
```

We can also visualize the uncertainty in the predicted probabilities by plotting the logistic models with the simulated coefficients from above. The plot of the first 20 models below tells us that the probabilities of switching under shorter distances to the nearest safe well are more varied than those under longer distances.

```
plot(wells$dist, wells$switch,
ylim=c(0, 1), xlim=c(0, 3.3),
xlab="Distance (in meters) to nearest safe well",
ylab="Pr (switching)")
for (j in 1:20) {
<- sims[j, 1]
a <- sims[j, 2]
b curve (invlogit(a + b * x), lwd=.5,
col="darkgray", add=TRUE)
}<- coef(fit_1)[1]
a_hat <- coef(fit_1)[2]
b_hat curve(invlogit(a_hat + b_hat*x), lwd=1, add=TRUE)
```

## 11.3 Log score for logistic regression

In this section, we only consider the logistic regression model with the point estimate \(\hat{\beta}\) (i.e. the numbers shown in the output of `stan_glm`

). We recall the likelihood function of the logistic model.

\[ p\left(y\vert\hat{\beta},X\right) = \prod_{i=1}^n \begin{cases}\operatorname{logit}^{-1}(X_i\hat{\beta}) \qquad &\text{if } y_i = 1 \\ 1-\operatorname{logit}^{-1}(X_i\hat{\beta}) &\text{if } y_i = 0. \end{cases} \]

Given \(m\) new labeled data points \((X^{\text{new}}_1,y_1)\ldots,(X^{\text{new}}_m,y_m)\), we denote the probability prediction \(p_i = \operatorname{logit}^{-1}(X^{\text{new}}_i \hat{\beta})\) for \(i=1,\ldots,m\). The log score is computed by taking the logarithm of the likelihood.

\[ \text{out-of-sample log score} = \sum_{i=1}^m \begin{cases}\log p_i \qquad &\text{if } y_i = 1 \\ \log(1-p_i) &\text{if } y_i = 0. \end{cases} \]

Let us see how the log score behaves in three cases.

- Almost perfectly correct predictions; that is, \(p_i\approx 1\) whenever \(y_i=1\) and \(p_i\approx 0\) whenever \(y_i=0\). In the former case, we have \(\log p_i \approx \log 1 = 0\), and in the latter, \(\log(1-p_i)\approx\log 1 = 0\) as well. Thus we expect the log score to be close to zero.
- Completely wrong predictions; that is, \(p_i\approx 0\) whenever \(y_i=1\) and \(p_i\approx 1\) whenever \(y_i=0\). In the former case, we have \(\log p_i \approx \log 0 = -\infty\) and in the latter, \(\log(1-p_i) \approx \log 0 = -\infty\). In other words, a large negative log score indicates that the model has bad predictive performance.
- Random guessing; that is, \(p_i=0.5\) for all \(i\), which implies \(\log p_i =\log(1-p_i) = \log 0.5\) for all \(i\). The corresponding log score is \(\sum_{i=1}^m \log 0.5 = m\log 0.5\). Any model with meaningful predictions should do better than random guessing, so its log score should be more than \(m\log 0.5\).

We measure the out-of-sample predictive performance with expected log predictive density (elpd) based on the above log score. We can compute the elpd of the model by simply calling the `loo`

function.

### 11.3.1 Example of variable selection: well-switching example

```
<- loo(fit_2)
loo_2
print(loo_2)
```

```
Computed from 4000 by 3020 log-likelihood matrix
Estimate SE
elpd_loo -1968.4 15.7
p_loo 3.2 0.1
looic 3936.9 31.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```

The elpd of this model is \(-1968.4\). In comparison, the elpd of random guessing is \(3020\log 0.5=-2093\) (as their are 3020 data points in the dataset). So in terms of elpd, our model is better than random guessing.

Now, let us try adding an interaction term between the distance and the arsenic level.

```
<- stan_glm(switch ~ dist100 + arsenic + dist100:arsenic,
fit_3 family = binomial(link="logit"), data = wells,
refresh=0)
print(fit_3)
```

```
stan_glm
family: binomial [logit]
formula: switch ~ dist100 + arsenic + dist100:arsenic
observations: 3020
predictors: 4
------
Median MAD_SD
(Intercept) -0.1 0.1
dist100 -0.6 0.2
arsenic 0.6 0.1
dist100:arsenic -0.2 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

```
<- loo(fit_3)
loo_3
loo_compare(loo_2, loo_3)
```

```
elpd_diff se_diff
fit_3 0.0 0.0
fit_2 -0.4 1.9
```

There is virtually no improvement over the previous model, and the standard error of the interaction term is not significantly smaller than the point estimate, so we decide to discard it from the model.

Now, let us add two more predictors that might be relevant: the years of education of the well user (`educ4`

) and the status of association with any community organization (`assoc`

).

```
<- stan_glm(switch ~ dist100 + arsenic + educ4 + assoc,
fit_4 family = binomial(link="logit"), data = wells,
refresh=0)
print(fit_4)
```

```
stan_glm
family: binomial [logit]
formula: switch ~ dist100 + arsenic + educ4 + assoc
observations: 3020
predictors: 5
------
Median MAD_SD
(Intercept) -0.2 0.1
dist100 -0.9 0.1
arsenic 0.5 0.0
educ4 0.2 0.0
assoc -0.1 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

The coefficient of `assoc`

is highly uncertain (high standard error compared to the point estimate) so we decide to remove the predictor.

```
<- stan_glm(switch ~ dist100 + arsenic + educ4,
fit_5 family = binomial(link="logit"),
data = wells, refresh=0)
print(fit_5)
```

```
stan_glm
family: binomial [logit]
formula: switch ~ dist100 + arsenic + educ4
observations: 3020
predictors: 4
------
Median MAD_SD
(Intercept) -0.2 0.1
dist100 -0.9 0.1
arsenic 0.5 0.0
educ4 0.2 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

Every predictors seems to be significant, so now we add the interaction terms between the education and the other predictors.

```
<- stan_glm(switch ~ dist100 + arsenic + educ4 +
fit_6 :educ4 + arsenic:educ4,
dist100family = binomial(link="logit"),
data = wells, refresh=0)
print(fit_6)
```

```
stan_glm
family: binomial [logit]
formula: switch ~ dist100 + arsenic + educ4 + dist100:educ4 + arsenic:educ4
observations: 3020
predictors: 6
------
Median MAD_SD
(Intercept) 0.1 0.1
dist100 -1.3 0.2
arsenic 0.4 0.1
educ4 -0.1 0.1
dist100:educ4 0.3 0.1
arsenic:educ4 0.1 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

Now let us compare the elpd of this model to the model with two predictors.

```
<- loo(fit_6)
loo_6
loo_compare(loo_2, loo_6)
```

```
elpd_diff se_diff
fit_6 0.0 0.0
fit_2 -15.6 6.3
```

The elpd of the latest model is much higher, so we decide to keep this model for later sections.

## 11.4 Residuals for logistic regression

Let \(y\) be the actual outcome and \(\Pr(y=1)\) the predicted probability of a data point. The residual of the prediction is given by

\[ \text{residual} = y - \Pr(y=1). \]

Unlike the linear regression, it does not make sense to plot the residual vs. the predicted probability as the points would belong to only one of the following functions of the predicted probability: \(1-\Pr(y=1)\) and \(-\Pr(y=1)\). We illustrate this point by making a residual plot of the previous model of well-switching.

```
# We can use predict(fit_6, type="response", newdata=wells)
# but it is much slower than fitted.
<- fitted(fit_6)
pred6 1:5] pred6[
```

```
1 2 3 4 5
0.6930966 0.4385670 0.7245031 0.6108331 0.6025867
```

```
= wells$switch-pred6
residuals
plot(pred6, residuals, pch=20, cex=.2,
xlab="Estimated Pr(switching)",
ylab="Average residual")
```

A recommended way to visualize the residuals is by plotting the *binned residuals*, which can be obtained by binning the data on the predicted probabilities, and then computing the averages of the probabilites and the residuals for each bin.

We compute and plot the binned residuals using the `binned_residuals`

function from the `performance`

library. Let \(p_j\) be the average predicted probability in bin \(j\) and \(n_j\) be the number of points in bin \(j\). If the model were true, then, the \(j\)-th binned residual should fall inside the interval \([-2\sqrt{p_j(1-p_j)/n_j}, 2\sqrt{p_j(1-p_j)/n_j}]\) (the dotted lines) with probability \(0.95\).

```
install.packages("performance")
install.packages("see") # required to plot residuals
library(performance)
```

```
= binned_residuals(fit_6)
binnedres
plot(binnedres)
```

We can also plot binned residuals versus an input of interested by specifying the name of the input as an argument to the `binned_residuals`

function. Here, we plot the binned residuals versus the distance and the arsenic level and see if there are any unusual patterns in the plots.

```
= binned_residuals(fit_6, "dist100")
binnedres_dist = binned_residuals(fit_6, "arsenic")
binnedres_arsenic
plot(binnedres_dist)
```

`plot(binnedres_arsenic)`

The binned residual plot of the distance is mostly flat around zero, indicating that the model is a good fit. The plot of the arsenic level, however, has a rising and falling pattern, in which case we might want to apply the logarithmic transformation.

## 11.5 Logarithmic transformation

As in the linear regression we can apply the logarithm on the variables to further improve the fit of the model.

We continue the well-switching example. Detecting the unusual pattern in the binned residuals of the arsenic level, we decide to apply the logarithm on this predictor.

```
$log_arsenic <- log(wells$arsenic)
wells
<- stan_glm(switch ~ dist100 + log_arsenic + educ4 +
fit_7 :educ4 + log_arsenic:educ4,
dist100family = binomial(link="logit"),
data = wells, refresh=0)
print(fit_7)
```

```
stan_glm
family: binomial [logit]
formula: switch ~ dist100 + log_arsenic + educ4 + dist100:educ4 + log_arsenic:educ4
observations: 3020
predictors: 6
------
Median MAD_SD
(Intercept) 0.5 0.1
dist100 -1.4 0.2
log_arsenic 0.8 0.1
educ4 0.0 0.1
dist100:educ4 0.3 0.1
log_arsenic:educ4 0.1 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

Then we plot the binned residuals of the arsenic level again.

```
= binned_residuals(fit_7, "log_arsenic")
binnedres_logarsenic
plot(binnedres_logarsenic)
```

The residuals look better, though the problem remains at the low arsenic levels: the households with low arsenic levels are less likely to switch than predicted by the model.

Let us compare the elpd of the log model to the previous model.

```
<- loo(fit_7)
loo_7
loo_compare(loo_6, loo_7)
```

```
elpd_diff se_diff
fit_7 0.0 0.0
fit_6 -14.9 4.3
```

The \(14.8\) increase in elpd suggests that the log model is a better fit than the previous model.

## 11.6 Error rate

Another way to measure the performance of a logistic regression model is by comparing the actual outcome \(y\) to the model’s predicted outcome:

\[ \hat{y} = \begin{cases} 1 &\text{if } \Pr(y=1\vert x_1,x_2) > 0.5 \\ 0 &\text{if } \Pr(y=1\vert x_1,x_2) \leq 0.5. \end{cases} \]

The *error rate* is the proportion of mismatches between \(y\) and \(\hat{y}\) in the dataset. Thus lower error rate implies that the model’s predictions are more accurate.

To compute the error rate in R, we first output the model’s predicted probabilities.

`<- fitted(fit_7) pred8 `

Then we can compute the error rate as follows:

```
<- mean((pred8>0.5 & wells$switch==0) |
error_rate <=0.5 & wells$switch==1))
(pred8
error_rate
```

`[1] 0.3622517`

Let us compare this error rate with the one from a much simpler model, which predicts the majority outcome for all data points.

`print(c("Proportion of 1's", mean(wells$switch)))`

`[1] "Proportion of 1's" "0.575165562913907"`

`print(c("Proportion of 0's", 1-mean(wells$switch)))`

`[1] "Proportion of 0's" "0.424834437086093"`

The error rate of this simple model is \(0.43\). So in terms of error rates, our logistic model’s predictions are more accurate than outputting the majority outcome.

## 11.7 Nonidentification

### 11.7.1 Collinearity

*Collinearity* happens when one of the predictors is a linear combination of the other predictors, which results in unstable fitting procedure. This results in coefficients having large standard errors. We can solve this issue by removing some of the predictors while keeping the elpd at the same level.

### 11.7.2 Separation

*Separation* happens when a combination of the predictors can be used to completely split the outcomes by their values. In the example below, we have data of US election in 1964, where the `black`

variable completely aligns the outcome (`rvote`

), as there is no black respondent that votes for the republican.

```
<- read.table("data/nes.txt")
nes <- nes[nes$year == 1964 &
nes64 !is.na(nes$rvote) &
!is.na(nes$female) &
!is.na(nes$black) &
!is.na(nes$income),]
head(nes64[, c("year", "rvote", "black")])
```

```
year rvote black
8467 1964 0 0
8468 1964 0 0
8470 1964 0 0
8471 1964 0 0
8473 1964 0 0
8474 1964 0 0
```

`$rvote==1 & nes64$black==1,] nes64[nes64`

```
[1] year resid weight1 weight2
[5] weight3 age gender race
[9] educ1 urban region income
[13] occup1 union religion educ2
[17] educ3 martial_status occup2 icpsr_cty
[21] fips_cty partyid7 partyid3 partyid3_b
[25] str_partyid father_party mother_party dlikes
[29] rlikes dem_therm rep_therm regis
[33] vote regisvote presvote presvote_2party
[37] presvote_intent ideo_feel ideo7 ideo
[41] cd state inter_pre inter_post
[45] black female age_sq rep_presvote
[49] rep_pres_intent south real_ideo presapprov
[53] perfin1 perfin2 perfin presadm
[57] age_10 age_sq_10 newfathe newmoth
[61] parent_party white year_new income_new
[65] age_new vote.1 age_discrete race_adj
[69] dvote rvote
<0 rows> (or 0-length row.names)
```

In this case, the best maximum likelihood estimate of the coefficient of `black`

is \(-\infty\). In the summary of the regression on three predictors below, we notice that the standard error `black`

is abnormally high.

```
<- glm(rvote ~ female + black + income,
fit_8 family=binomial(link="logit"),
data=nes64)
summary(fit_8)$coefficients
```

```
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1509055 0.21594094 -5.3297235 9.836238e-08
female -0.0873733 0.13623456 -0.6413446 5.212988e-01
black -16.8337552 420.40038735 -0.0400422 9.680595e-01
income 0.1922987 0.05846259 3.2892611 1.004508e-03
```

We can handle this issue by adding some prior to the model. In fact, we can just use the default prior in `stan_glm`

(the weakly informative prior) and the coefficient and the standard error becomes much smaller.

```
<- stan_glm(rvote ~ female + black + income,
fit_9 family=binomial(link="logit"),
data=nes64, refresh=0)
print(fit_9)
```

```
stan_glm
family: binomial [logit]
formula: rvote ~ female + black + income
observations: 1058
predictors: 4
------
Median MAD_SD
(Intercept) -1.2 0.2
female -0.1 0.1
black -8.8 4.2
income 0.2 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```