5  Linear regression with multiple predictors

The linear regression with multiple predictors \(x_1,\ldots,x_p\) can be written in matrix-vector form (ignoring the error terms) as:

\[\begin{align*} \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} &= \begin{pmatrix} \beta_0 + \beta_1 x_{11} + \ldots + \beta_p x_{1p} + \varepsilon_1 \\ \beta_0 + \beta_1 x_{21} + \ldots + \beta_p x_{2p} + \varepsilon_2 \\ \vdots \\ \beta_0 + \beta_1 x_{n1} + \ldots + \beta_p x_{np} + \varepsilon_n \end{pmatrix} \\ &= \begin{pmatrix} 1 & x_{11} & \ldots & x_{1p} \\ 1 & x_{21} & \ldots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \end{pmatrix} \cdot \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix}+ \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}, \end{align*}\]

or, in short,

\[ y = X\beta + \varepsilon. \]

Here, \(y\) is the vector of outputs, \(X\) is the , \(\beta\) is the vector of parameters and \(\varepsilon\) is the vector of the error terms. Assuming that \(\varepsilon^i\) are distributed as $\mathcal{N}(0,\sigma^2)$, another way of writing the model is

\[ y\sim \mathcal{N}(X\beta, \sigma^2I). \]

We attempt to find an estimator \(\hat{\beta}\) of \(\beta\) by removing the error term to obtain an approximate equation:

\[\begin{align*} y &\approx X\hat{\beta} \\ X^Ty &\approx X^TX\hat{\beta} \\ \hat{\beta} &\approx (X^TX)^{-1}X^Ty. \end{align*}\]

Such \(\hat{\beta}\) is called an ordinary least squares (OLS) estimate of \(\beta\).

First, we import the KidIQ data, which contains data of children’s and their mother’s IQ.

kidiq <- read.csv("data/kidiq.csv")

head(kidiq)
  kid_score mom_hs    mom_iq mom_work mom_age
1        65      1 121.11753        4      27
2        98      1  89.36188        4      25
3        85      1 115.44316        4      27
4        83      1  99.44964        3      25
5       115      1  92.74571        4      27
6        98      0 107.90184        1      18

Here, kid_score is the child’s IQ score, mom_hs is an indicator for whether the mother graduated from high school (0 or 1), and mom_iq is the mother’s IQ score.

Now we fit a linear regression model of kid_score on two predictors: mom_hs and mom_iq.

library(rstanarm)
fit_1 <- stan_glm(kid_score ~ mom_hs + mom_iq, data=kidiq,
                  refresh=0)
print(fit_1)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs + mom_iq
 observations: 434
 predictors:   3
------
            Median MAD_SD
(Intercept) 25.9    6.0  
mom_hs       6.0    2.2  
mom_iq       0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.2    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Thus the fitted line from the linear regression above is:

\[ \text{kid\_score} = 26 + 6 * \text{mom\_hs} + 0.6 * \text{mom\_iq} + \text{error}. \]

Below are some suggestions for an interpretation of the mom_iq’s coefficient:

5.1 Interactions

The linear model with two predictors above imposes that the slope of mom_iq is the same for the subsets consisting of mom_hs = 0 and mom_hs = 1. However, if we consider the linear regression on these two subsets:

mom_hs_0 = kidiq[kidiq$mom_hs == 0, ]
mom_hs_1 = kidiq[kidiq$mom_hs == 1, ]

fit_2 <- stan_glm(kid_score ~ mom_iq, data=mom_hs_0,
                  refresh=0)
fit_3 <- stan_glm(kid_score ~ mom_iq, data=mom_hs_1,
                  refresh=0)

plot(mom_hs_0$mom_iq, mom_hs_0$kid_score, col="blue", cex=0.8)
points(mom_hs_1$mom_iq, mom_hs_1$kid_score, col="red", cex=0.8)
abline(coef(fit_2), col="blue")
abline(coef(fit_3), col="red")
legend(75, 135, legend=c("mom_hs = 0", "mom_hs = 1"),
       col=c("blue", "red"), lty=1)

we can see that the slopes differ by a significant amount between the subsets. A remedy for this is to add an interaction term between mom_hs and mom_iq. This can be done by adding mom_hs:mom_iq to stan_glm.

fit_4 <- stan_glm(kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq, data=kidiq,
                  refresh=0)
print(fit_4)
stan_glm
 family:       gaussian [identity]
 formula:      kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq
 observations: 434
 predictors:   4
------
              Median MAD_SD
(Intercept)   -10.0   13.5 
mom_hs         49.5   14.9 
mom_iq          0.9    0.1 
mom_hs:mom_iq  -0.5    0.2 

Auxiliary parameter(s):
      Median MAD_SD
sigma 18.0    0.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The output tells us that the fitted model is:

\[ \text{kid\_score} = -10.5 + 50 * \text{mom\_hs} + \text{mom\_iq} - 0.5 * \text{mom\_hs} * \text{mom\_iq} + \text{error}. \]

We now obtain equations of kid_score vs mom_iq for each mom_hs group by setting mom_hs = 0 and mom_hs = 1. When mom_hs = 0, the equation becomes:

\[ \text{kid\_score} = -10.5 + \text{mom\_iq} + \text{error}. \]

When mom_hs = 1, the equation becomes:

\[ \text{kid\_score} = 39.5 + 0.5 * \text{mom\_iq} + \text{error}. \]

Comparing between three equations above, we can interpret the coefficients of the equation with the interaction term as follows:

  • The intercept represents the predicted test scores for children whose mothers did not complete high school (mom_hs = 0) and had IQs of 0 (mom_iq = 0)—not a meaningful scenario.

    The intercept can be more interpretable if input variables are centered before including them as regression predictors.

  • The coefficient of mom_hs is the difference between the predicted test scores for children whose mothers did not complete high school (mom_hs = 0) and children whose mothers did complete high school (mom_hs = 1), both with mom_iq = 0; this is inconceivable as no mothers have IQ of 0. To make the coefficient more interpretable, one might want to center the variable (i.e. subtract the observed values of mom_hs by their mean) first.

  • The coefficient of mom_iq can be thought of as the comparison of mean test scores across children whose mothers did not complete high school (mom_hs = 0), but their mothers’ IQs differ by 1 point.

  • The coefficient on the interaction term is the difference between the slopes of the lines (regression on each mom_hs group) in the plot above.

When should we look for interactions? Interaction typically arises when the effects of a predictor are different across different groups. For example, when predicting the likelihood of cancer on smoking (0 or 1) and home radon exposure, the risk of cancer associated with the radon exposure is higher in the smoking group than non-smoking group.

5.2 Regression with multiple levels of a categorical predictor

We give an example of a regression with multiple inputs, some of which are categorical variables with multiple levels. Here, we will use the Earnings data.

earnings <- read.csv("data/earnings.csv")
summary(factor(earnings$ethnicity))
   Black Hispanic    Other    White 
     180      104       38     1494 

We fit a linear regression of weight on three predictors: height, male (0 or 1) and ethnicity (White, Black, Hispanic and Other).

fit_5 <- stan_glm(weight ~ height + male + ethnicity, data=earnings,
                  refresh=0)
print(fit_5)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height + male + ethnicity
 observations: 1789
 predictors:   6
------
                  Median MAD_SD
(Intercept)       -99.8   15.6 
height              3.8    0.2 
male               12.1    1.9 
ethnicityHispanic  -6.1    3.5 
ethnicityOther    -12.3    5.3 
ethnicityWhite     -5.2    2.2 

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.6    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We can see that stan_glm has automatically created three new indicator variables (also called dummy variables), namely ethnicityHispanic, ethnicityOther and ethnicityWhite for each person. The value of each variable is 1 if the person belongs to the corresponding level, otherwise it is 0. For example, for a Hispanic person, ethnicityHispanic = 1, ethnicityOther = 0 and ethnicityWhite = 0. But if we look closely at the ethnicity column in the dataset, we would notice that the group of Blacks is missing. This is because stan_glm took Black to be the baseline category against the other groups. Consequently, a black person is represented by ethnicityHispanic = 0, ethnicityOther = 0 and ethnicityWhite = 0.

The model from the regression above is:

\[\begin{align*} \text{weight} &= -100 + 3.9*\text{height} + 12.1*\text{male} \\ & \qquad - 6.1*\text{Hispanic} - 12.3*\text{Other} - 5.2*\text{White} + \text{error}. \end{align*}\]

The coefficient of ethnicity, can be interpreted as follows: between two persons with the same height and same gender,

  • On average, a Hispanic person is \(6.1\) pounds lighter than a Black person, an Other person is \(12.3\) pounds lighter than a Black person, and a White person is \(5.2\) pounds lighter than a Black person.
  • On average, a Hispanic person is \(12.3-6.1=6.2\) pounds heavier than an Other person.
  • On average, a Hispanic person is \(6.1-5.2=0.9\) pounds lighter than a White person.
  • On average, a White person is \(12.3-5.2=7.1\) pounds heavier than an Other person.

Sometimes, we would like to change the baseline group; this can be done by specifying the order of the levels in the categorical variables. An example below shows how to set White as the baseline groups.

earnings$eth <- factor(earnings$ethnicity,
                       levels=c("White", "Black", "Hispanic", "Other"))
fit_5 <- stan_glm(weight ~ height + male + eth, data=earnings,
                  refresh=0)
print(fit_5)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height + male + eth
 observations: 1789
 predictors:   6
------
            Median MAD_SD
(Intercept) -104.7   16.3
height         3.8    0.3
male          12.1    2.0
ethBlack       5.2    2.3
ethHispanic   -1.0    3.0
ethOther      -7.0    4.8

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.6    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

5.2.1 Simulation-based prediction

From the fitted model, we give an example of computing \(\Pr(\text{weight} > 180)\) for a Hispanic male who is 70 inches tall. As in the previous chapter, we compute the probability using posterior simulation.

new = data.frame(height=70, male=1, eth="Hispanic")
y_pred <- posterior_predict(fit_5, newdata=new)
y_180_indicator <- (y_pred > 180)
y_180_prob <- mean(y_180_indicator)
cat("Probability of being heavier than 180 pounds: ", y_180_prob)
Probability of being heavier than 180 pounds:  0.433

5.3 Paired and blocked designs as a regression problem

In the previous sections, we see that regression coefficients can be interpreted as comparisons. Conversely, comparison between two or more grounds can be treated a regressions.

  1. Completely randomized experiment

    Suppose that in an experiment, people are randomly assigned into treatment and control groups. A standard estimate for the treatment effect is \(\bar{y}_T-\bar{y}_C\). We can also obtain this estimate by assigning an indicator variable to each group: 0 for the control group and 1 for the treatment group. Then \(\bar{y}_T-\bar{y}_C\) is precisely the coefficient of the following regression:

    fit <- stan_glm(y ~ treatment, data=experiment)

    where treatement is 0 if the person is in the control group, and 1 if they are in the treatment group.

  2. Paired design

    When the experiment consists of control-treatment pairs and the people in each pair might not be independent, we can assign a new indicator variable that indicates each pair e.g. the control and treatment in the first pair get pair = 1, the second pair gets pair = 2, so on and so forth.

    fit <- stan_glm(y ~ treatment + factor(pair), data=experiment)
  3. Block design

    When considering the treatment effect over \(J\) groups of people, we can assign a new indicator variable group to indicate each group e.g. the first group gets group = 1.

    fit <- stan_glm(y ~ treatment + factor(group), data=experiment)

5.4 Weighted regression

The OLS estimate \(\hat{\beta}\) minimizes the least squares objective:

\[ \sum_i (y_i-X_i\hat{\beta})^2, \]

where \(X_i=(1,x_{i1},\ldots,x_{ip})\). We can modify the objective with weights \(w_1,\ldots,w_n\) for each instance in the sum:

\[\begin{equation} \sum_i w_i(y_i-X_i\hat{\beta})^2. \tag{1}\label{eq1} \end{equation}\]

We typically require that \(\sum_i w_i = 1\). Why do we want to add these weights to the objective? Sometimes, the proportions of different groups in the sample data do not match with those in the population, and we would like to make correction when fitting the model by adding the weights. For example, suppose that our data consists of 70 white persons and 30 black persons. To balance the model between these two groups, we can put weight 1/140 on all terms that correspond to white persons, and 1/60 to all terms that correspond to black persons.

Let \(W=\text{Diag}(w_1,\ldots,w_n)\). The estimate \(\hat{\beta}_{\text{wls}}\) that minimizes the weighted objective \(\eqref{eq1}\) is

\[ \hat{\beta}_{\text{wls}} = (X^TW^{-1}X)^{-1}X^TW^{-1}y. \]

To fit a weighted regression model using stan_glm, we just need to specify the weights parameter.

summary(factor(earnings$ethnicity))
   Black Hispanic    Other    White 
     180      104       38     1494 
N <- 180 + 104 + 38 + 1494
eth <- earnings$ethnicity
w <- (1/(4*180))*(eth == "Black") + 
  (1/(4*104))*(eth == "Hispanic") + 
  (1/(4*38))*(eth == "Other") + 
  (1/(4*1494))*(eth == "White")

fit_6 <- stan_glm(weight ~ height + male + ethnicity, data=earnings,
                  weights=w, refresh=0)
print(fit_6)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height + male + ethnicity
 observations: 1789
 predictors:   6
------
                  Median MAD_SD
(Intercept)       -100.4   16.3
height               3.9    0.3
male                12.1    2.0
ethnicityHispanic   -6.2    3.5
ethnicityOther     -12.3    5.3
ethnicityWhite      -5.2    2.3

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.7    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The results of the weighted regression is not much different than those of the classical regression.