9  Logistic regression

In this chapter we consider a prediction task with binary outcomes (0 or 1). Our running example is the poll data obtained before the presidential election in 1992. For each respondent \(i\) in the poll, we label \(y_i=1\) if he or she preferred George Bush or \(0\) if he or she preferred Bill Clinton. We predict the preferences from respondents’ income levels, measured on a five-point scale.

nes <- read.table("data/nes.txt")
nes92 <- nes[nes$year == 1992 &
               !is.na(nes$rvote) & 
               !is.na(nes$dvote) & 
               (nes$rvote==1 | nes$dvote==1),]

head(nes92[, c("income", "rvote")])
      income rvote
32093      4     1
32094      2     1
32096      1     0
32097      2     1
32098      3     0
32099      4     0

We first attempt to predict the label from the linear function of the predictor(s).

\[ X\beta = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p, \]

where \(X=(1,x_1,\ldots,x_p)\) and \(\beta=(\beta_0,\ldots,\beta_p)\). However, the range of the function is \((-\infty,\infty)\). To map this range to \((0,1)\), we introduce the logit function:

\[ \operatorname{logit}(x) = \log\left(\frac{x}{1-x}\right), \]

which maps \((0,1)\) to \((-\infty,\infty)\). What we need is the inverse of the logit function, which is commonly called the logistic function:

\[ \operatorname{logit}^{-1}(x) = \frac{e^x}{1+e^x}. \]

We can access the logit and logistic functions in R with qlogis and plogis, respectively.

logit <- qlogis
invlogit <- plogis

Below is the plot of the \(\operatorname{logit}^{-1}(x)\). We can see that the function is bounded above by 1, bounded below by 0, and is 0.5 at 0.

curve(invlogit(x), xlim=c(-6, 6))

Here is the plot of \(\operatorname{logit}^{-1}(-x)\):

curve(invlogit(-x), xlim=c(-6, 6))

We propose a model for the probability of preferring George Bush over Bill Clinton.

\[ \Pr(y = 1\vert \beta,X) = \operatorname{logit}^{-1}(X\beta) = \operatorname{logit}^{-1}(\beta_0+\beta_1*\texttt{income}). \tag{9.1}\]

From the plots above, we see that:

Under the logistic model (@eq-1), we can compute the conditional probability that \(y=0\).

\[ \Pr(y = 0\vert \beta,X) = 1-\operatorname{logit}^{-1}(X\beta) = 1-\operatorname{logit}^{-1}(\beta_0+\beta_1*\texttt{income}). \tag{9.2}\]

Unlike the linear regression, there is no error term in this model.

9.1 Maximum likelihood for logistic regression

As in the linear regression, we fit the model by finding the parameters \(\beta\) that maximize the likelihood function. Given data \((X_1,y_1),\ldots,(X_n,y_n)\) and the probability model (Equation 19.1) and (Equation 9.2) above, the likelihood function of \(\beta\) is

\[\begin{align*} p\left(y\vert\beta,X\right) &= p\left(y_1\vert\beta,X_1\right)p\left(y_2\vert\beta,X_2\right)\ldots p\left(y_n\vert\beta,X_n\right) \\ &= \Pr\left(y=y_1\vert\beta,X_1\right)\Pr\left(y=y_2\vert\beta,X_2\right)\ldots \Pr\left(y=y_n\vert\beta,X_n\right). \end{align*}\]

Replacing each term in the product using (Equation 19.1) and (Equation 9.2), we can right the product in a compact form as

\[\begin{align*} p\left(y\vert\beta,X\right) &= \prod_{i=1}^n \begin{cases}\operatorname{logit}^{-1}(X_i\beta) \qquad &\text{if } y_i = 1 \\ 1-\operatorname{logit}^{-1}(X_i\beta) &\text{if } y_i = 0 \end{cases} \\ &= \prod_{i=1}^n \left(\operatorname{logit}^{-1}(X_i\beta)\right)^{y_i}\left(1-\operatorname{logit}^{-1}(X_i\beta)\right)^{1-y_i}. \end{align*}\]

We then maximize this expression over \(\beta\). However, unlike the linear regression, using just standard calculus does not give us a closed-form solution. So we have to resort to some optimization algorithm that converges to a stationary point i.e. a point with zero partial derivatives. We will not discuss the algorithm here. Optimization theory tells us the maximization problem has a unique solution, unless there is colinearity or separation; we shall discuss these two conditions in a later chapter.

9.2 Bayesian inference for logistic regression

What we have just discussed in the previous section can be extended to Bayesian inference. As we have shown in Chapter 4, if the prior distributions of the parameters \(\beta\) are uniform, then the vector \(\hat{\beta}\) that maximizes the posterior distribution is the same as the maximum likelihood estimate.

The default prior in stan_glm is again a weakly informative prior, which imposes the following prior distributions on the parameters:

  • Each coefficient \(\beta_k\) for \(k=1,2,\ldots,p\) is given a normal prior \(\mathcal{N}\left(0,(2.5/\operatorname{sd}(x_k))^2\right)\).

  • The prediction at the mean \(\beta_0+\beta_1\bar{x}_1+\ldots+\beta_p\bar{x}_p\) is given a normal prior \(\mathcal{N}(0,2.5^2)\).

But sometimes we have some prior information about the coefficients. For example, in a logistic regression with one predictor: \(\Pr(y=1)=\operatorname{logit}^{-1}(a+bx)\), sometimes we expect \(y\) to increase with \(x\) (a classic example is \(x=\text{smoking}\) and \(y=\text{cancer}\)). So we shall impose a “soft constraint” on \(b\) by giving it a normal prior \(\mathcal{N}(0.5,0.5^2)\), which implies that \(b\) has a really high chance to be between 0 and 1.

9.3 Fitting a logistic regression model in R

To fit the model using stan_glm, we specify the parameter family=binomial(link="logit").

library(rstanarm)
fit_1 <- stan_glm(rvote ~ income, family=binomial(link="logit"),
                  data=nes92, refresh=0)

print(fit_1)
stan_glm
 family:       binomial [logit]
 formula:      rvote ~ income
 observations: 1179
 predictors:   2
------
            Median MAD_SD
(Intercept) -1.4    0.2  
income       0.3    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We can plot the actual \(y_i\) versus the predictions \(\operatorname{logit}^{-1}(-1.4+0.3x)\) using the invlogit function.

a <- coef(fit_1)[1]
b <- coef(fit_1)[2]

plot(nes92$income, nes92$rvote,
     xlab="income", ylab="Pr(republican vote)")
curve(invlogit(a + b * x), add=TRUE)

We can also plot with the parameter simulations to visualize the uncertainty in the coefficients. In the following code, we sample 20 draws from 4000 simulations.

sims_1 <- as.matrix(fit_1)  # a matrix of parameter simulations 
n_sims <- nrow(sims_1)

plot(nes92$income, nes92$rvote,
     xlab="income", ylab="Pr(republican vote)")
for (j in sample(n_sims, 20)){
  a <- sims_1[j, 1]
  b <- sims_1[j, 2]
  curve(invlogit(a + b * x), col="gray",
        lwd=0.5, add=TRUE)
}

If, for some reason, we believe that people with higher income levels tend to prefer Bush over Clinton, then we might assume that the coefficient of income is somewhere between \(0\) to \(1\). Therefore, we use \(\mathcal{N}(0.5,0.5^2)\) as a prior of the coefficient.

fit_2 <- stan_glm(rvote ~ income, family=binomial(link="logit"),
                  data=nes92, prior=normal(0.5, 0.5),
                  refresh=0)

print(fit_2)
stan_glm
 family:       binomial [logit]
 formula:      rvote ~ income
 observations: 1179
 predictors:   2
------
            Median MAD_SD
(Intercept) -1.4    0.2  
income       0.3    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

However, since the dataset is large (1179 instances), the prior has visibly no effect on the coefficient.

9.3.1 Interpreting the coefficients

From the fitted model \(\Pr(y=1) = \operatorname{logit}^{-1}(-1.4+0.3x)\), we can interpret the intercept and the slope as follows:

  • Intercept: As \(y=1\) corresponds to voting for Bush, the intercept \(-1.4\) can be interpreted by assuming \(x=0\). However, assuming so is insensible as the income is on a 1-5 scale. Therefore, we have to indirectly interpret the intercept by evaluating the probability at some other value of \(x\). For example, we can evaluate \(\Pr(\text{Bush support})\) at the average value of \(x\).

    mean_income <- mean(nes92$income)
    cat("The mean income is", mean_income, "\n")
    The mean income is 3.075488 
    invlogit(-1.4 + 0.3 * mean_income)
    [1] 0.3828772

    which tells us that at the average value of \(x\), the probability of voting for Bush is 0.38.

  • Slope: the positive slope implies that a respondent with a higher income level is more likely to vote for Bush. We measure the difference in \(\Pr(y=1)\) for two respondents whose incomes differ by \(1\). Keep in mind that, as logistic regression involves a nonlinear function, the difference does not stay the same across different values of \(x\). As an example, let us measure the difference near the central value of \(x\). Since \(\bar{x}=3.1\), we evaluate the difference in the probabilities between \(x=2\) and \(x=3\).

    \[ \operatorname{logit}^{-1}(-1.4 + 0.3 * 3) - \operatorname{logit}^{-1}(-1.4 + 0.3 * 2) = 0.068. \]Thus, a respondent with income level \(3\) has \(0.068\) more probability of supporting Bush than one with income level \(2\).

9.3.1.1 Divide-by-4 rule for coefficient interpretation

If we look at the curve of the logistic function, we see that the slope of the curve is maximized at the center. Thus the function \(\operatorname{logit}^{-1}(a+bx)\) is maximized at \(a+bx=0\). To find out the slope at this point, we take the derivative of this function with respect to \(x\).

\[ \frac{d}{dx}\frac{e^{a+bx}}{1+e^{a+bx}} = \frac{d}{dx}\left(1-\frac{1}{1+e^{a+bx}}\right)=\frac{be^{a+bx}}{\left(1+e^{a+bx}\right)^2}. \]

Consequently, the slope at \(a+bx=0\) is \(b/(1+1)^2=b/4\). Therefore, we can interpret \(\beta/4\) as the maximum difference in \(\Pr(y=1)\) corresponding to a \(1\) unit difference in \(x\).

In our example, the slope is \(0.3\), which implies that a difference of \(1\) in income category corresponds no more than \(0.3/4 = 0.075\) increase in probability of supporting Bush.

9.4 Different types of predictions

9.4.1 Point prediction

Suppose that we observe a new vector input \(X^{\text{new}}\). Assuming the new data follow the logistic model: \(\Pr(y^{\text{new}}=1\vert \beta, X^{\text{new}})= \operatorname{logit}^{-1}(X^{\text{new}}\beta)\), a point prediction is the expected value of \(y^{\text{new}}\) with respect to the posterior distribution of \(\beta\).

\[\begin{align*} \mathbb{E}[y^{\text{new}} \mid X^{\text{new}}] &= \mathbb{E}_{\beta}\left[\mathbb{E}[y^{\text{new}} \mid \beta, X^{\text{new}}]\right] \\ &= \mathbb{E}_{\beta}\left[\Pr(y^{\text{new}}=1\vert \beta, X^{\text{new}})\right] \\ &= \mathbb{E}_{\beta}\left[\operatorname{logit}^{-1}(X^{\text{new}}\beta)\right]. \end{align*}\]

One might be inclined to estimate the expectation with \(\operatorname{logit}^{-1}(X^{\text{new}}\hat{\beta})\), where \(\hat{\beta}\) is the vector of point estimates of the parameters. However, this leads to an incorrect estimation; as \(\operatorname{logit}^{-1}\) is a nonlinear function, we have

\[ \mathbb{E}_{\beta}\left[\operatorname{logit}^{-1}(X^{\text{new}}\beta)\right] \not= \operatorname{logit}^{-1}\left(\mathbb{E}_{\beta}\left[X^{\text{new}}\beta\right]\right). \]

Instead, we estimate the expectation by averaging \(\operatorname{logit}^{-1}(X^{\text{new}}\beta_1),\ldots,\operatorname{logit}^{-1}(X^{\text{new}}\beta_S)\) over the posterior simulations \(\beta_1,\ldots,\beta_S\) of \(\beta\) which can be obtained from the output of stan_glm. Below is an example of a point prediction for \(x^{\text{new}}=5\) (which corresponds to \(X^{\text{new}}=(1,5)\)).

sims_1 <- as.matrix(fit_1)  # a matrix of parameter simulations
x_new <- 5
a <- sims_1[, 1]  # 4000 simulations of intercept
b <- sims_1[, 2]  # 4000 simulations of slope

epred <- invlogit(a + b * x_new)  # 4000 simulations of prediction

print(epred[1:5])
[1] 0.5850560 0.5315515 0.5898944 0.5638610 0.5113758
pred <- mean(epred)

print(pred)
[1] 0.5566969

All of this can be done in two lines by specifying type="response" in the predict function.

new <- data.frame(income=5)
pred <- predict(fit_1, type="response", newdata=new)

print(pred)
        1 
0.5566969 

9.4.2 Generating linear predictions

We can obtain simulations draws for the linear part of the model \(X^{\text{new}}\beta_1,\ldots,X^{\text{new}}\beta_S\) using posterior_linpred.

linpred <- posterior_linpred(fit_1, newdata=new)

print(linpred[1:5])
[1] 0.34356410 0.12637401 0.36352882 0.25684676 0.04551089

9.4.3 Generating outcome probabilities

Another way of calculating \(\Pr(y^{\text{new}}=1\vert \beta, X^{\text{new}})= \operatorname{logit}^{-1}(X^{\text{new}}\beta)\) over the posterior simulations \(\beta=\beta_1,\ldots,\beta_S\) is by calling the posterior_epred function.

epred <- posterior_epred(fit_1, newdata=new)

print(epred[1:5])
[1] 0.5850560 0.5315515 0.5898944 0.5638610 0.5113758

which gives the same outputs as the ones we used to calculate the point prediction above.

We can use these simulations to estimate the uncertainty in the predictions. For example, we can look at the mean and standard deviation of the probabilities that people with income level 5 would support Bush.

print(c(mean(epred), sd(epred)))
[1] 0.55669690 0.02858685

The mean of \(0.56\) and the standard deviation of \(0.03\) tell us that, among the people with income level 5, the percentage of Bush supporters is probably in the rage \(56\% \pm 3\%\).

9.4.4 Generating binary outcomes

The outputs of posterior_epred are \(S\) different predicted probabilities \(p_i = \Pr(y^{\text{new}}=1\vert \beta_i, X^{\text{new}}); \ i=1,\ldots,S\) of an individual voter with income \(X^{\text{new}}\). In other words, the distribution of the binary outcome \(y^{\text{new}}\) for the \(i\)-th simulation is \(\text{Bernoulli}(p_i)\). Thus, to sample new binary outcomes \(y^{\text{new}}_1,\ldots,y^{\text{new}}_S\) from the posterior predictive distribution, we can sample \(y^{\text{new}}_i\) from the \(\text{Bernoulli}(p_i)\) for \(i=1,\ldots,S\). We can simulate binary outcomes using posterior_predict.

postpred <- posterior_predict(fit_1, newdata=new)

print(postpred[1:10])
 [1] 1 1 0 1 0 1 1 1 1 0

From theory, the average of these simulations should be close to \(\mathbb{E}[y^{\text{new}} \mid X^{\text{new}}]\). Let us check if this is the case.

print(mean(postpred))
[1] 0.559

The average of \(0.56\) is close to the point prediction above.

9.4.5 Predictions with multiple inputs

We can also use these functions to make predictions for a vector or a matrix of observations. For example, the following code computes different types of predictions for five new people whose income levels take on values 1 through 5:

new <- data.frame(income=1:5)
pred <- predict(fit_1, type="response", newdata=new)
linpred <- posterior_linpred(fit_1, type="response", newdata=new)
epred <- posterior_epred(fit_1, type="response", newdata=new)
postpred <- posterior_predict(fit_1, type="response", newdata=new)

Here, pred is a vector of length \(5\), and linpred, epred and postpred are matrices of size \(\texttt{n\_sims}\times 5\). Let us check out the first few rows of postpred, for example.

print(epred[1:5,])
          
iterations         1         2         3         4         5
      [1,] 0.2674881 0.3385725 0.4177735 0.5014555 0.5850560
      [2,] 0.2569800 0.3176267 0.3851666 0.4574406 0.5315515
      [3,] 0.2263548 0.3034600 0.3934721 0.4913494 0.5898944
      [4,] 0.2367092 0.3070615 0.3877025 0.4750042 0.5638610
      [5,] 0.2436727 0.3019333 0.3673578 0.4380625 0.5113758

We notice that people with higher income levels are more likely to vote for Bush than those with lower income levels.

We can use these simulations to approximate various posterior quantities. For example, we can create a new variable which indicates that, for each simulation, Bush is more popular among the people with income level 5 than those with income level 4.

indicator <- epred[, 5] > epred[, 4]

print(indicator[1:10])
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

which can be used to computed the posterior probability that Bush is more popular among people with income level 5.

mean(indicator)
[1] 1

We can also compute the 95% confidence interval for the percentage difference of people with income level 4 and people with income level 5 who support of Bush. To do this, we use the quantile function.

quantile(epred[, 5] - epred[, 4], c(0.025, 0.975))
     2.5%     97.5% 
0.0540931 0.1078188 

This tells us that the difference is around \(5.3\% - 10.8\%\).