10  Logistic regression with multiple predictors

Including more predictors to the logistic regression adds more complexity, not only to the model but also its interpretability. In this chapter, we introduce a new concept for interpreting a coefficient, namely the average predictive difference. We also discuss logistic models with interactions and how to interpret their coefficients.

10.1 Example: wells in Bangladesh

Many of the wells in Bangladesh and other South Asian countries are contaminated with arsenic, which is a cumulative poison that may leads to cancer and other diseases. A research team from the United States and Bangladesh has inspected all the wells in Araihazar, Bangladesh and marked them as “safe” if the amount of arsenic is 50 micrograms per liter, and “unsafe” otherwise. People with unsafe wells were encouraged to switch to nearby private or community wells that were safe.

A few years later, the researchers returned surveyed the households in this area. The outcome variable is

\[ \texttt{switch} = \begin{cases}1 \quad\text{if household }i\text{ switched to a new well}\\ 0 \quad\text{if household }i\text{ continued using its own well}.\end{cases} \]

We will fit a logistic regression of this binary variable on the following inputs:

Name Description
dist The distance to the closest known safe well (meters)
arsenic The arsenic level of the household’s well (ug/l)
assoc whether any member of the household are active member in community organizations
educ the education level of the head of household

First, we fit a regression of switch on two predictors: the distance and the arsenic level. The distanced are scaled down by a factor of 100 so that the coefficient does not become too small.

library(rstanarm)

invlogit <- plogis
wells <- read.csv("data/wells.csv")
wells$dist100 <- wells$dist/100

head(wells)
  switch arsenic   dist dist100 assoc educ educ4
1      1    2.36 16.826 0.16826     0    0  0.00
2      1    0.71 47.322 0.47322     0    0  0.00
3      0    2.07 20.967 0.20967     0   10  2.50
4      1    1.15 21.486 0.21486     0   12  3.00
5      1    1.10 40.874 0.40874     1   14  3.50
6      1    3.90 69.518 0.69518     1    9  2.25
fit_1 <- stan_glm(switch ~ dist100 + arsenic,
                  family = binomial(link="logit"),
                  data = wells, refresh=0)

print(fit_1)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic
 observations: 3020
 predictors:   3
------
            Median MAD_SD
(Intercept)  0.0    0.1  
dist100     -0.9    0.1  
arsenic      0.5    0.0  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Looking at the coefficients and the standard errors, both predictors are relevant to the prediction.

10.2 Average predictive difference for coefficient interpretation

A natural way to interpret the coefficient of arsenic is to compute the predictive difference between two values of arsenic with other predictors held at a constant. For example, we might want to see how the predicted probability change from \(\texttt{arsenic}=0.5\) (the lowest unsafe level) to \(\texttt{arsenic} = 1.0\). Here, we fix \(\texttt{dist100} = 0.4\).

\[ \operatorname{logit}^{-1}(-0.9*0.4 + 0.5*1.0)-\operatorname{logit}^{-1}(-0.9*0.4 + 0.5*0.5) = 0.062. \]

We have to be careful on the values of the held-constant predictors; if we choose values that are far away from the actual range of inputs, the corresponding predictive difference might not be achievable. Below are the logistic plots of two examples of data with two variables \(u\) and \(v\), where the values of \(v\) are at the extremes. We can see that, if we fixed \(v\) at its mean value, the predictive difference would be very large compared to those at the actual values of \(v\).

Plots of logistic regression between with two predictors: \(u=0\) and \(u=1\), and \(v\) is a continuous variable.

A better practice is to compute the predictive differences in one predictor, evaluated at the actual values of the other predictors in the data. Then, we take the average of the differences. For example, let us denote the well-switching data by \(\{(\texttt{switch}_1,\texttt{dist100}_1,\texttt{arsenic}_1),\) \(\ldots,\) \((\texttt{switch}_n,\texttt{dist100}_n,\texttt{arsenic}_n)\}\). Then the average predictive difference (APD) between \(\texttt{arsenic}=0.5\) and \(\texttt{arsenic} = 1.0\) can be calculated as follows:

\[ \frac{1}{n}\sum_{i=1}^n \left[\operatorname{logit}^{-1}(-0.9*\texttt{dist100}_i + 0.5*1.0)-\operatorname{logit}^{-1}(-0.9*\texttt{dist100}_i + 0.5*0.5)\right]. \]

Here is the computation in R:

arsenic_hi <- 1.0
arsenic_lo <- 0.5
b <- coef(fit_1)

# differences of logistics for all values of dist100 in the data
diffs <- invlogit(b[1] + b[2] * wells$dist100 + b[3] * arsenic_hi) -
         invlogit(b[1] + b[2] * wells$dist100 + b[3] * arsenic_lo)

mean(diffs)
[1] 0.05603215

In other words, on average, households with the arsenic level of 1.0 are 5.6% more likely to switch the well than those with the arsenic level of 0.5.

10.3 Logistic regression with interactions

Now suppose that we would like to add an interaction term between the distance and the arsenic level to the logistic regression. In R, this can be done by adding dist100:arsenic in the formula.

fit_2 <- stan_glm(switch ~ dist100 + arsenic + dist100:arsenic,
                  family = binomial(link="logit"), data = wells,
                  refresh=0)

print(fit_2)
stan_glm
 family:       binomial [logit]
 formula:      switch ~ dist100 + arsenic + dist100:arsenic
 observations: 3020
 predictors:   4
------
                Median MAD_SD
(Intercept)     -0.1    0.1  
dist100         -0.6    0.2  
arsenic          0.6    0.1  
dist100:arsenic -0.2    0.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

According to this model, the probability of switching \(\Pr(\texttt{switch}=1)\) is:

\[ \operatorname{logit}^{-1}(-0.1 -0.6*\texttt{dist100} +0.6*\texttt{arsenic} -0.2*\texttt{dist100}*\texttt{arsenic}). \]

The coefficient of the interaction term seems to be relevant, so we keep it in the model for the time being. We now interpret each coefficient.

  • Intercept: usually, we interpret the intercept by letting the other predictors be zero. However, it is impossible for the distance and the arsenic level to be zero; so we evaluate the prediction at the average values of \(\texttt{dist100}=0.48\) and \(\texttt{arsenic} = 0.66\). The corresponding probability of switching is

    \[ \operatorname{logit}^{-1}(-0.1 -0.6*0.48+0.6*1.66-0.2*0.48*1.66)=0.61. \]

  • Effect of the distance: we can rewrite the model as

    \[ \operatorname{logit}^{-1}(-0.1 +0.6*\texttt{arsenic}-(0.6+0.2*\texttt{arsenic})*\texttt{dist100}) . \]

    With \(\texttt{arsenic}\) fixed at a positive constant, then the coefficient of \(\texttt{dist100}\) is positive, which implies that households that are farther away from the nearest safe well are less likely to switch. Moreover, as the households’ wells have higher arsenic levels, the importance of the distance to the probability of switching is increasing.

    We plot the logistic regression of probability of switching as a function of distance to the nearest safe well at two arsenic levels: \(\texttt{arsenic}=0.5\) and \(\texttt{arsenic}=3.0\).

    b <- coef(fit_2)
    
    curve(invlogit(b[1] + b[2]*x + b[3]*0.5 + b[4]*x*0.5),
          xlab="Distance (in 100 meters) to nearest safe well",
          ylab="Pr (switching)",
          xlim=c(0,3),
          ylim=c(0,0.8))
    curve(invlogit(b[1] + b[2]*x + b[3]*3.0 + b[4]*x*3.0), add=TRUE)
    text (0.50, 0.36, "if As = 0.5")
    text (0.75, 0.50, "if As = 3.0")

    For \(\texttt{arsenic}=1.0\), the probability of switching starts off higher and moves down faster; this is because the arsenic level became less relevant as the safe wells was too far away from the households. At 300 meters, the arsenic level barely affected the households’ decisions at all.

    We use the APD to measure the effect of the distance on the switching probability. Let us compute the APD of the distance from \(\texttt{dist100}=0\) to \(\texttt{dist100}=1\).

    dist100_hi <- 1
    dist100_lo <- 0
    b <- coef(fit_2)
    
    # differences of logistics for all values of dist100 in the data
    diffs <- invlogit(b[1] + b[2] * dist100_hi + b[3] * wells$arsenic +
                      b[4] * dist100_hi * wells$arsenic) -
             invlogit(b[1] + b[2] * dist100_lo + b[3] * wells$arsenic +
                      b[4] * dist100_lo * wells$arsenic)
    
    mean(diffs)
    [1] -0.1949786

    which implies that, on average, households that are 100 meters from the nearest safe well are 20% less likely to switch compared to households that lived right next to a safe well.

  • Effect of the arsenic level: we can rewrite the model as

    \[ \operatorname{logit}^{-1}(-0.1 -0.6*\texttt{dist100} +(0.6-0.2*\texttt{dist100})*\texttt{arsenic}). \]

    Notice that the coefficient of \(\texttt{arsenic}\) decreases as \(\texttt{dist100}\) increases. This means that, as the distance increases, the importance of arsenic to the probability of switching decreases.

    Let us compute the APD between \(\texttt{arsenic}=0.5\) to \(\texttt{arsenic}=1.0\).

    arsenic_hi <- 1.0
    arsenic_lo <- 0.5
    b <- coef(fit_2)
    
    # differences of logistics for all values of dist100 in the data
    diffs <- invlogit(b[1] + b[2] * wells$dist100 + b[3] * arsenic_hi +
                      b[4] * wells$dist100 * arsenic_hi) -
             invlogit(b[1] + b[2] * wells$dist100 + b[3] * arsenic_lo +
                      b[4] * wells$dist100 * arsenic_lo)
    
    mean(diffs)
    [1] 0.0579842

    which implies that, on average, households with the arsenic level of 1.0 are 5.8% more likely to switch the well than those with the arsenic level of 0.5.

    Here is the plot of the logistic regression lines of the probability of switching as a function of arsenic level, at distances of 0 meters and 50 meters.

    b <- coef(fit_2)
    
    curve(invlogit(b[1] + b[2]*0.0 + b[3]*x + b[4]*0.0*x),
          xlab="Arsenic level",
          ylab="Pr (switching)",
          xlim=c(0,10),
          ylim=c(0.35,1.0))
    curve(invlogit(b[1] + b[2]*0.5 + b[3]*x + b[4]*0.5*x), add=TRUE)
    text (1.6, 0.78, "if dist = 0")
    text (4.7, 0.7, "if dist = 50")

    Of course, households that lived next to a safe well were more likely two switch the well. The probabilities reach the same value as the arsenic level became dangerously high; this is because households almost always wanted to switch, regardless of the distance to the safe well.