11  Diagnostics of logistic regression models

After fitting a logistic regression model, it is a good idea to inspect the model in more details. We discuss techniques of plotting the fitted model and checking the residuals. We also look into adding interaction terms to the model, as well as identifying problems that may arise when fitting the model.

11.1 Plotting logistic regression and binary data

11.1.1 Plotting binary data using binned averages

The usual scatterplot of data with binary outcomes might not be insightful as there is a lot of overlaps. A better way of plotting the data for is by binning a predictor, with other predictors held at constants, and plot the binned averages, the averages of the predictor and the outcome in each bin. We demonstrate this with the wells data from the previous chapter.

library(rstanarm)

invlogit <- plogis
wells <- read.csv("data/wells.csv")
wells$dist100 <- wells$dist/100

We fit a logistic regression model of well-switching on the distance.

fit_1 <- stan_glm(switch ~ dist100,
                  family = binomial(link="logit"), 
                  data = wells, refresh=0)

Let us divide the distance into 8 bins.

K <- 8
# assign bin for each data point
bins <- as.numeric(cut(wells$dist100, K))

print(data.frame(wells$dist100, bins)[1:6,])
  wells.dist100 bins
1       0.16826    1
2       0.47322    2
3       0.20967    1
4       0.21486    1
5       0.40874    1
6       0.69518    2

Then we compute the average of the distance and the outcome for each bin.

x_bar <- rep(NA, K)  # initial vector of NAs
y_bar <- rep(NA, K)  # initial vector of NAs
for (k in 1:K){
  x_bar[k] <- mean(wells$dist100[bins==k])  # average of k-th bin
  y_bar[k] <- mean(wells$switch[bins==k])  # average of k-th bin
}

And now we plot the binned averages (the plot of white circles below). Since the averages of outcomes are no longer repeated values of 0 and 1, we can see some vertical variation in the plot. We also see that the binned averages are close to the fitted logistic regression line.

plot(wells$dist100, wells$switch, ylim=c(-0.05, 1.05),
     xlab="Distance (100m)", ylab="Pr(Switching)",
     pch=20, cex=1.0, main="Data and binned averages")
points(x_bar, y_bar, pch=21, cex=1.5)

# fitted logistic regression
curve(invlogit(0.6 - 0.6*x), add=TRUE)

11.1.2 Plotting decision boundaries when there are two predictors

Suppose that we have a logistic model with two predictors. Even though the model is nonlinear, our decision on the outcome of a new data point can be linear. A typical decision \(\hat{y}\) of a new data point \((x_1,x_2)\) is

\[ \hat{y} = \begin{cases} 1 &\text{if } \Pr(y=1\vert x_1,x_2) > 0.5 \\ 0 &\text{if } \Pr(y=1\vert x_1,x_2) \leq 0.5. \end{cases} \]

Since \(\Pr(y=1\vert x_1,x_2)=\operatorname{logit}^{-1}(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2)\) and \(\operatorname{logit}^{-1}(x)=0.5\) when \(x=0\), so we can rewrite the decision as follows:

\[ \hat{y} = \begin{cases} 1 &\text{if } \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 > 0 \\ 0 &\text{if } \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 \leq 0. \end{cases} \]

In this case, the linear function \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 = 0\) splits our decision, and so it is the decision boundary.

Let us fit a logistic regression of the well-switching on the distance and the arsenic level.

fit_2 <- stan_glm(switch ~ dist100 + arsenic,
                  family = binomial(link="logit"), 
                  data = wells, refresh=0)

b = coef(fit_2)

Then, we plot the decision boundary of this model. In addition, we plot the lines of two extremes: \(\Pr(y\vert x_1,x_2)=0.1\) and \(\Pr(y\vert x_1,x_2)=0.9\), which is equivalent to \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 =\operatorname{logit}(0.1)\) and \(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2 =\operatorname{logit}(0.9)\), respectively. Note that to plot these equations with abline, we need to write \(x_2\) as a function of \(x_1\); the corresponding equations are:

\[\begin{align*} x_2 &= -\frac{\hat{\beta}_0}{\hat{\beta}_2}-\frac{\hat{\beta}_1}{\hat{\beta}_2}x_1 \\ x_2 &= -\frac{\operatorname{logit}(0.1)-\hat{\beta}_0}{\hat{\beta}_2}-\frac{\hat{\beta}_1}{\hat{\beta}_2}x_1 \\ x_2 &= -\frac{\operatorname{logit}(0.9)-\hat{\beta}_0}{\hat{\beta}_2}-\frac{\hat{\beta}_1}{\hat{\beta}_2}x_1. \end{align*}\]
plot(wells$dist100[wells$switch==1],
     wells$arsenic[wells$switch==1],
     main="Data and 10%, 50%, 90% discrimination lines 
from fitted logistic regression",
     xlab="Distance (100m)",
     ylab="Arsenic",
     col = rgb(red=0, green=0, blue=1, alpha=0.5),
     pch=20)
points(wells$dist100[wells$switch==0],
       wells$arsenic[wells$switch==0],
       col = rgb(red=1, green=0, blue=0, alpha=0.5),
       pch=20)

abline(-b[1] / b[3],
       -b[2] / b[3])
abline((logit(0.9) - b[1]) / b[3],
       -b[2] / b[3], lty=2)
abline((logit(0.1) - b[1]) / b[3],
       -b[2] / b[3],  lty=2)

We notice from the plot that the model does not suit the data well, as many points in each class leak to the other side of the decision boundary.

11.2 Predictive simulation

We can plot the uncertainty of the coefficients using stan_glm’s simulations. First, let us take the model with a single predictor dist100 and plot the first 500 simulations of the intercept and the slope.

sims <- as.matrix(fit_1)
plot(sims[1:500,1], sims[1:500,2], xlab=expression(beta[0]),
     ylab=expression(beta[1]), pch=20, cex=.5)

We can also visualize the uncertainty in the predicted probabilities by plotting the logistic models with the simulated coefficients from above. The plot of the first 20 models below tells us that the probabilities of switching under shorter distances to the nearest safe well are more varied than those under longer distances.

plot(wells$dist, wells$switch,
     ylim=c(0, 1), xlim=c(0, 3.3),
     xlab="Distance (in meters) to nearest safe well", 
     ylab="Pr (switching)")
for (j in 1:20) {
  a <- sims[j, 1]
  b <- sims[j, 2]
  curve (invlogit(a + b * x), lwd=.5,
         col="darkgray", add=TRUE)
}
a_hat <- coef(fit_1)[1]
b_hat <- coef(fit_1)[2]
curve(invlogit(a_hat + b_hat*x), lwd=1, add=TRUE)

11.3 Log score for logistic regression

In this section, we only consider the logistic regression model with the point estimate \(\hat{\beta}\) (i.e. the numbers shown in the output of stan_glm). We recall the likelihood function of the logistic model.

\[ p\left(y\vert\hat{\beta},X\right) = \prod_{i=1}^n \begin{cases}\operatorname{logit}^{-1}(X_i\hat{\beta}) \qquad &\text{if } y_i = 1 \\ 1-\operatorname{logit}^{-1}(X_i\hat{\beta}) &\text{if } y_i = 0. \end{cases} \]

Given \(m\) new labeled data points \((X^{\text{new}}_1,y_1)\ldots,(X^{\text{new}}_m,y_m)\), we denote the probability prediction \(p_i = \operatorname{logit}^{-1}(X^{\text{new}}_i \hat{\beta})\) for \(i=1,\ldots,m\). The log score is computed by taking the logarithm of the likelihood.

\[ \text{out-of-sample log score} = \sum_{i=1}^m \begin{cases}\log p_i \qquad &\text{if } y_i = 1 \\ \log(1-p_i) &\text{if } y_i = 0. \end{cases} \]

Let us see how the log score behaves in three cases.

  1. Almost perfectly correct predictions; that is, \(p_i\approx 1\) whenever \(y_i=1\) and \(p_i\approx 0\) whenever \(y_i=0\). In the former case, we have \(\log p_i \approx \log 1 = 0\), and in the latter, \(\log(1-p_i)\approx\log 1 = 0\) as well. Thus we expect the log score to be close to zero.
  2. Completely wrong predictions; that is, \(p_i\approx 0\) whenever \(y_i=1\) and \(p_i\approx 1\) whenever \(y_i=0\). In the former case, we have \(\log p_i \approx \log 0 = -\infty\) and in the latter, \(\log(1-p_i) \approx \log 0 = -\infty\). In other words, a large negative log score indicates that the model has bad predictive performance.
  3. Random guessing; that is, \(p_i=0.5\) for all \(i\), which implies \(\log p_i =\log(1-p_i) = \log 0.5\) for all \(i\). The corresponding log score is \(\sum_{i=1}^m \log 0.5 = m\log 0.5\). Any model with meaningful predictions should do better than random guessing, so its log score should be more than \(m\log 0.5\).

We measure the out-of-sample predictive performance with expected log predictive density (elpd) based on the above log score. We can compute the elpd of the model by simply calling the loo function.

11.3.1 Example of variable selection: well-switching example

loo_2 <- loo(fit_2)

print(loo_2)

Computed from 4000 by 3020 log-likelihood matrix

         Estimate   SE
elpd_loo  -1968.4 15.7
p_loo         3.2  0.1
looic      3936.9 31.4
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The elpd of this model is \(-1968.4\). In comparison, the elpd of random guessing is \(3020\log 0.5=-2093\) (as their are 3020 data points in the dataset). So in terms of elpd, our model is better than random guessing.

Now, let us try adding an interaction term between the distance and the arsenic level.

fit_3 <- stan_glm(switch ~ dist100 + arsenic + dist100:arsenic,
                  family = binomial(link="logit"), data = wells,
                  refresh=0)

print(fit_3)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic + dist100:arsenic
 observations: 3020
 predictors:   4
------
                Median MAD_SD
(Intercept)     -0.1    0.1  
dist100         -0.6    0.2  
arsenic          0.6    0.1  
dist100:arsenic -0.2    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
loo_3 <- loo(fit_3)

loo_compare(loo_2, loo_3)
      elpd_diff se_diff
fit_3  0.0       0.0   
fit_2 -0.4       1.9   

There is virtually no improvement over the previous model, and the standard error of the interaction term is not significantly smaller than the point estimate, so we decide to discard it from the model.

Now, let us add two more predictors that might be relevant: the years of education of the well user (educ4) and the status of association with any community organization (assoc).

fit_4 <- stan_glm(switch ~ dist100 + arsenic + educ4 + assoc,
                  family = binomial(link="logit"), data = wells,
                  refresh=0)

print(fit_4)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic + educ4 + assoc
 observations: 3020
 predictors:   5
------
            Median MAD_SD
(Intercept) -0.2    0.1  
dist100     -0.9    0.1  
arsenic      0.5    0.0  
educ4        0.2    0.0  
assoc       -0.1    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The coefficient of assoc is highly uncertain (high standard error compared to the point estimate) so we decide to remove the predictor.

fit_5 <- stan_glm(switch ~ dist100 + arsenic + educ4,
                  family = binomial(link="logit"), 
                  data = wells, refresh=0)

print(fit_5)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic + educ4
 observations: 3020
 predictors:   4
------
            Median MAD_SD
(Intercept) -0.2    0.1  
dist100     -0.9    0.1  
arsenic      0.5    0.0  
educ4        0.2    0.0  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Every predictors seems to be significant, so now we add the interaction terms between the education and the other predictors.

fit_6 <- stan_glm(switch ~ dist100 + arsenic + educ4 +
                  dist100:educ4 + arsenic:educ4,
                  family = binomial(link="logit"),
                  data = wells, refresh=0)

print(fit_6)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic + educ4 + dist100:educ4 + arsenic:educ4
 observations: 3020
 predictors:   6
------
              Median MAD_SD
(Intercept)    0.1    0.1  
dist100       -1.3    0.2  
arsenic        0.4    0.1  
educ4         -0.1    0.1  
dist100:educ4  0.3    0.1  
arsenic:educ4  0.1    0.0  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Now let us compare the elpd of this model to the model with two predictors.

loo_6 <- loo(fit_6)

loo_compare(loo_2, loo_6)
      elpd_diff se_diff
fit_6   0.0       0.0  
fit_2 -15.6       6.3  

The elpd of the latest model is much higher, so we decide to keep this model for later sections.

11.4 Residuals for logistic regression

Let \(y\) be the actual outcome and \(\Pr(y=1)\) the predicted probability of a data point. The residual of the prediction is given by

\[ \text{residual} = y - \Pr(y=1). \]

Unlike the linear regression, it does not make sense to plot the residual vs. the predicted probability as the points would belong to only one of the following functions of the predicted probability: \(1-\Pr(y=1)\) and \(-\Pr(y=1)\). We illustrate this point by making a residual plot of the previous model of well-switching.

# We can use predict(fit_6, type="response", newdata=wells) 
# but it is much slower than fitted.
pred6 <- fitted(fit_6)
pred6[1:5]
        1         2         3         4         5 
0.6930966 0.4385670 0.7245031 0.6108331 0.6025867 
residuals = wells$switch-pred6

plot(pred6, residuals, pch=20, cex=.2,
     xlab="Estimated Pr(switching)",
     ylab="Average residual")

A recommended way to visualize the residuals is by plotting the binned residuals, which can be obtained by binning the data on the predicted probabilities, and then computing the averages of the probabilites and the residuals for each bin.

We compute and plot the binned residuals using the binned_residuals function from the performance library. Let \(p_j\) be the average predicted probability in bin \(j\) and \(n_j\) be the number of points in bin \(j\). If the model were true, then, the \(j\)-th binned residual should fall inside the interval \([-2\sqrt{p_j(1-p_j)/n_j}, 2\sqrt{p_j(1-p_j)/n_j}]\) (the dotted lines) with probability \(0.95\).

install.packages("performance")
install.packages("see")  # required to plot residuals 

library(performance)
binnedres = binned_residuals(fit_6)

plot(binnedres)

We can also plot binned residuals versus an input of interested by specifying the name of the input as an argument to the binned_residuals function. Here, we plot the binned residuals versus the distance and the arsenic level and see if there are any unusual patterns in the plots.

binnedres_dist = binned_residuals(fit_6, "dist100")
binnedres_arsenic = binned_residuals(fit_6, "arsenic")

plot(binnedres_dist)

plot(binnedres_arsenic)

The binned residual plot of the distance is mostly flat around zero, indicating that the model is a good fit. The plot of the arsenic level, however, has a rising and falling pattern, in which case we might want to apply the logarithmic transformation.

11.5 Logarithmic transformation

As in the linear regression we can apply the logarithm on the variables to further improve the fit of the model.

We continue the well-switching example. Detecting the unusual pattern in the binned residuals of the arsenic level, we decide to apply the logarithm on this predictor.

wells$log_arsenic <- log(wells$arsenic)

fit_7 <- stan_glm(switch ~ dist100 + log_arsenic + educ4 +
                  dist100:educ4 + log_arsenic:educ4,
                  family = binomial(link="logit"),
                  data = wells, refresh=0)

print(fit_7)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + log_arsenic + educ4 + dist100:educ4 + log_arsenic:educ4
 observations: 3020
 predictors:   6
------
                  Median MAD_SD
(Intercept)        0.5    0.1  
dist100           -1.4    0.2  
log_arsenic        0.8    0.1  
educ4              0.0    0.1  
dist100:educ4      0.3    0.1  
log_arsenic:educ4  0.1    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Then we plot the binned residuals of the arsenic level again.

binnedres_logarsenic = binned_residuals(fit_7, "log_arsenic")

plot(binnedres_logarsenic)

The residuals look better, though the problem remains at the low arsenic levels: the households with low arsenic levels are less likely to switch than predicted by the model.

Let us compare the elpd of the log model to the previous model.

loo_7 <- loo(fit_7)

loo_compare(loo_6, loo_7)
      elpd_diff se_diff
fit_7   0.0       0.0  
fit_6 -14.9       4.3  

The \(14.8\) increase in elpd suggests that the log model is a better fit than the previous model.

11.6 Error rate

Another way to measure the performance of a logistic regression model is by comparing the actual outcome \(y\) to the model’s predicted outcome:

\[ \hat{y} = \begin{cases} 1 &\text{if } \Pr(y=1\vert x_1,x_2) > 0.5 \\ 0 &\text{if } \Pr(y=1\vert x_1,x_2) \leq 0.5. \end{cases} \]

The error rate is the proportion of mismatches between \(y\) and \(\hat{y}\) in the dataset. Thus lower error rate implies that the model’s predictions are more accurate.

To compute the error rate in R, we first output the model’s predicted probabilities.

pred8 <- fitted(fit_7)

Then we can compute the error rate as follows:

error_rate <- mean((pred8>0.5 & wells$switch==0) |
                   (pred8<=0.5 & wells$switch==1))

error_rate
[1] 0.3622517

Let us compare this error rate with the one from a much simpler model, which predicts the majority outcome for all data points.

print(c("Proportion of 1's", mean(wells$switch)))
[1] "Proportion of 1's" "0.575165562913907"
print(c("Proportion of 0's", 1-mean(wells$switch)))
[1] "Proportion of 0's" "0.424834437086093"

The error rate of this simple model is \(0.43\). So in terms of error rates, our logistic model’s predictions are more accurate than outputting the majority outcome.

11.7 Nonidentification

11.7.1 Collinearity

Collinearity happens when one of the predictors is a linear combination of the other predictors, which results in unstable fitting procedure. This results in coefficients having large standard errors. We can solve this issue by removing some of the predictors while keeping the elpd at the same level.

11.7.2 Separation

Separation happens when a combination of the predictors can be used to completely split the outcomes by their values. In the example below, we have data of US election in 1964, where the black variable completely aligns the outcome (rvote), as there is no black respondent that votes for the republican.

nes <- read.table("data/nes.txt")
nes64 <- nes[nes$year == 1964 &
               !is.na(nes$rvote) &
               !is.na(nes$female) &
               !is.na(nes$black) &
               !is.na(nes$income),]

head(nes64[, c("year", "rvote", "black")])
     year rvote black
8467 1964     0     0
8468 1964     0     0
8470 1964     0     0
8471 1964     0     0
8473 1964     0     0
8474 1964     0     0
nes64[nes64$rvote==1 & nes64$black==1,]
 [1] year            resid           weight1         weight2        
 [5] weight3         age             gender          race           
 [9] educ1           urban           region          income         
[13] occup1          union           religion        educ2          
[17] educ3           martial_status  occup2          icpsr_cty      
[21] fips_cty        partyid7        partyid3        partyid3_b     
[25] str_partyid     father_party    mother_party    dlikes         
[29] rlikes          dem_therm       rep_therm       regis          
[33] vote            regisvote       presvote        presvote_2party
[37] presvote_intent ideo_feel       ideo7           ideo           
[41] cd              state           inter_pre       inter_post     
[45] black           female          age_sq          rep_presvote   
[49] rep_pres_intent south           real_ideo       presapprov     
[53] perfin1         perfin2         perfin          presadm        
[57] age_10          age_sq_10       newfathe        newmoth        
[61] parent_party    white           year_new        income_new     
[65] age_new         vote.1          age_discrete    race_adj       
[69] dvote           rvote          
<0 rows> (or 0-length row.names)

In this case, the best maximum likelihood estimate of the coefficient of black is \(-\infty\). In the summary of the regression on three predictors below, we notice that the standard error black is abnormally high.

fit_8 <- glm(rvote ~ female + black + income,
             family=binomial(link="logit"),
             data=nes64)

summary(fit_8)$coefficients
               Estimate   Std. Error    z value     Pr(>|z|)
(Intercept)  -1.1509055   0.21594094 -5.3297235 9.836238e-08
female       -0.0873733   0.13623456 -0.6413446 5.212988e-01
black       -16.8337552 420.40038735 -0.0400422 9.680595e-01
income        0.1922987   0.05846259  3.2892611 1.004508e-03

We can handle this issue by adding some prior to the model. In fact, we can just use the default prior in stan_glm (the weakly informative prior) and the coefficient and the standard error becomes much smaller.

fit_9 <- stan_glm(rvote ~ female + black + income,
                   family=binomial(link="logit"),
                   data=nes64, refresh=0)

print(fit_9)
stan_glm
 family:       binomial [logit]
 formula:      rvote ~ female + black + income
 observations: 1058
 predictors:   4
------
            Median MAD_SD
(Intercept) -1.2    0.2  
female      -0.1    0.1  
black       -8.8    4.2  
income       0.2    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg