6  Model diagnostics and evaluation

We will talk about several graphical and quantitative ways to check our model’s fit to the data, and later we will talk about about cross validation—a technique for comparing between different models.

6.1 Plotting the data and the fitted model

6.1.1 Model with one predictor

In one-predictor cases, we can visualize the regression’s fit by simply plotting the data and the regression line. Here, we plot the KidIQ data and the fitted lines.

library(rstanarm)
kidiq = read.csv("data/kidiq.csv")

fit_2 <- stan_glm(kid_score ~ mom_iq, data=kidiq, refresh=0)
plot(kidiq$mom_iq, kidiq$kid_score, 
     xlab="Mother IQ score", ylab="Child test score")
abline(coef(fit_2)[1], coef(fit_2)[2])

6.1.2 Model with two predictors

6.1.2.1 Model with one categorical predictor and no interaction

We can plot each group and the fitted model on that group. Using the KidIQ data as an example, we recall that the non-interactive regression equation is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq}. \]

Consequently, the equation of \(\text{kid\_score}\) on \(\text{mom\_iq}\) for the group \(\text{mom\_hs} = 1\) is

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + \hat{\beta}_3 * \text{mom\_iq}, \]

and the equation for the group \(\text{mom\_hs} = 0\) is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]

We thus obtain the intercept and the slope of each equation, which we use to plot the regression line of each \(\text{mom\_hs}\) group below.

fit_3 <- stan_glm(kid_score ~ mom_hs + mom_iq, data=kidiq,
                  refresh=0)

# create a vector with value = "red" if mom_hs==1
# "blue" if mom_hs==0
colors <- ifelse(kidiq$mom_hs==1, "red", "blue")

print(colors[1:10])
 [1] "red"  "red"  "red"  "red"  "red"  "blue" "red"  "red"  "red"  "red" 
plot(kidiq$mom_iq, kidiq$kid_score,
     xlab="Mother IQ score", ylab="Child test score", 
     col=colors, pch=20)
b_hat <- coef(fit_3)
abline(b_hat[1] + b_hat[2], b_hat[3], col="red")
abline(b_hat[1], b_hat[3], col="blue")

6.1.2.2 Model with one categorical variable and an interaction

In this case, the equation is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq} +\hat{\beta}_4*\text{mom\_hs}*\text{mom\_iq}. \]

When \(\text{mom\_hs}=1\), the equation is

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + (\hat{\beta}_3+\hat{\beta}_4) * \text{mom\_iq}, \]

and when \(\text{mom\_hs} = 0\), we have the same equation as the no-interaction case:

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]

Again, we can use the intercept and slope in each equation to plot the regression line on the corresponding level of \(\text{mom\_hs}\).

fit_4 <- stan_glm(kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq, data=kidiq,
                  refresh=0)
colors <- ifelse(kidiq$mom_hs==1, "red", "blue")
plot(kidiq$mom_iq, kidiq$kid_score,
     xlab="Mother IQ score", ylab="Child test score", 
     col=colors, pch=20)
b_hat <- coef(fit_4)
abline(b_hat[1] + b_hat[2], b_hat[3] + b_hat[4], col="red")
abline(b_hat[1], b_hat[3], col="blue")

We can see that, unlike the no-interaction case, the lines are not parallel to each other.

6.1.2.3 Plotting the regression with uncertainty

We can use the posterior simulations to represent the uncertainty in the estimated regression coefficients. As an example, we plot the interactive model above on the subset of data with \(\text{mom\_hs} = 1\) along with 20 simulations drawn from the stan_glm fit.

sims <- as.matrix(fit_4)
n_sims <- nrow(sims)
mom_iq_1 <- kidiq$mom_iq[kidiq$mom_hs == 1]
kid_score_1 <- kidiq$kid_score[kidiq$mom_hs == 1]
plot(mom_iq_1, kid_score_1, 
     xlab="Mother IQ score", ylab="Child test score",
     pch=20)
sims_display <- sample(n_sims, 20)  # sample 20 numbers between 1 and n_sims
for (i in sims_display){
  # plot with the coefficients from the i-th simulation
  abline(sims[i,1] + sims[i,2], sims[i,3] + sims[i,4], col="gray")
}
# plot with the estimated coefficients
abline(coef(fit_2)[1], coef(fit_2)[2], col="black")

6.1.3 Model with multiple predictors

Suppose that the regression equation is

\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\ldots+\hat{\beta}_px_p. \]

For each predictor \(x_k\), we can plot the line of \(y\) vs. \(x_k\) while holding the other predictors fixed at their averages. For example, the equation of \(y\) vs. \(x_1\) is

\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2\bar{x}_2\ldots+\hat{\beta}_p\bar{x}_p. \]

In the example above, we can plot the line of kid_score vs mom_iq by fixing mom_hs at its average. The equation becomes:

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2* \overline{\text{mom\_hs}}) + (\hat{\beta}_3 +\hat{\beta}_4*\overline{\text{mom\_hs}})* \text{mom\_iq} . \]

mom_hs_mean = mean(kidiq$mom_hs)

plot(kidiq$mom_iq, kidiq$kid_score,
     xlab="Mother IQ score", ylab="Child test score",
     pch=20)
b_hat <- coef(fit_4)
abline(b_hat[1] + b_hat[2] * mom_hs_mean, 
       b_hat[3] + b_hat[4] * mom_hs_mean)

6.2 Plotting the outcome against the prediction

Another way to check the model’s fit is to plot the outcome \(y\) against the linear prediction \(\hat{y} = \hat{\beta}_0+\hat{\beta}_1 x_1+\ldots+\hat{\beta}_px_p\). Let’s apply this technique to the KidIQ data above.

kid_score_pred_4 = predict(fit_4)

plot(kid_score_pred_4, kidiq$kid_score,
     xlab="Prediction", ylab="Outcome",
     asp=1)  # set the aspect ratio between x- and y-axis to 1
abline(0, 1)  # plot line y = x

6.3 Residual plots

One way to evaluate the model’s fit and independence of errors is to plot the residuals:

\[ r_i = y_i- \hat{y}_i, \]

against the predictions \(\hat{y}_i\). Let’s see a residual plot for the simple regression of kid_score vs. mom_iq.

Predictions = predict(fit_2)

Residuals = kidiq$kid_score - Predictions

plot(Predictions, Residuals)
abline(0, 0)
abline(sigma(fit_2), 0, lty="dashed")  # +1 standard deviation
abline(-sigma(fit_2), 0, lty="dashed")  # -1 standard deviation

The residuals are relatively small compared to the outcomes, most of which are between 40-140, and they look sufficiently random; this suggests that the model’s errors are independent with zero mean.

6.4 Comparing simulated data to real data

We can also simulate outcomes from the posterior predictive distribution and compare their distribution (that is, their histogram) to that of the original data. As an example, we take a look at Newcomb dataset from Stigler (1997), which contains data from an experiment to estimate the speed of light. Here, the outcome \(y\) represents the amount of time required for light to travel a distance of 7442 meters and are recorded as deviations from 24800 nanoseconds.

newcomb = read.csv("data/newcomb.txt")

head(newcomb)
    y
1  28
2  26
3  33
4  24
5  34
6 -44

We fit a simple normal distribution model on this data.

\[ y = \beta_0 + \varepsilon, \quad \varepsilon\sim\mathcal{N}(0,\sigma^2), \]

which is equivalent to \(y\sim\mathcal{N}(\beta_0,\sigma^2)\).

n_sample <- nrow(sims)

fit <- stan_glm(y ~ 1, data=newcomb,
                refresh=0)
y_sims <- posterior_predict(fit)

Here, y_sims contains 4000 simulations of \(y\). To replicate the original dataset, which has 66 observations, we make 20 datasets, each of which consists of 66 numbers randomly sampled from y_sims.

par(mfrow=c(5, 4))
par(mar = c(1, 1, 1, 1))
for (s in sample(n_sample, 20)) {
  hist(y_sims[s,], main=NULL, ylab=NULL)
}

Now let us compare these histograms with the original data.

hist(newcomb$y, main=NULL, xlab="y", breaks=20)

The simulated histograms are noticably different than that of the original data, and we can see that the normal model is not suitable for the data. Alternatively, one might use an asymmetric contaminated normal distribution or a symmetric long-tailed distribution.

6.5 Explained variance \(R^2\)

The coefficient of determination (\(R^2\)) is calculated as follows:

\[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}} = \frac{\text{TSS}-\text{RSS}}{\text{TSS}}, \]

where \(\text{RSS}\) is the residual sum of squares and \(\text{TSS}\) is the total sum of squares.

  • \(\text{TSS}\) measures the variance of the outcome \(y\).

  • \(\text{RSS}\) measures the amount of variance unexplained by the regression.

  • Therefore, \(\text{TSS}-\text{RSS}\) measures the amount of variance explained by the regression.

  • Consequently, \(R^2\) measures the proportion of variance in \(y\) that is explained by the regression.

To understand \(R^2\) further, we consider two special cases of the linear regression with one predictor: \(\hat{y} = \hat{a}+\hat{b} x\).

  • If the fitted line is the same as the horizontal line, that is, \(\hat{y}_i = \bar{y}\) for all \(i\), then \(\text{RSS}=\text{TSS}\). So when the fitted model does not explain any of the variance in \(y\), we have \(R^2=0\).

  • Suppose that the fitted line passes through all the points perfectly, so the residuals are all zeros. In this case, \(\text{RSS}=0\) and so \(R^2=1\). Thus, \(R^2=1\) indicates that the regression line explains all of the variance in \(y\).

6.5.1 Bayesian \(R^2\)

There is one problem of using \(R^2\) for Bayesian regression: \(R^2\) is only guaranteed to be non-negative when \(\hat{y}_i\) are predictions from the model with OLS coefficients. In general, however, we can have a regression model with a negative $R2$. At an extreme, one can think of a linear model that is very far away from the data. Since the simulated coefficients from the Bayesian regression are not necessarily OLS, \(R^2\) of the corresponding models might be negative. To this end, Gelman et al. (2019) proposed an alternative definition of \(R^2\) for Bayesian regression:

\[ R^2_{\text{Bayes}} = \frac{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2}{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2 + \sigma^2} = \frac{\text{Explained variance}}{\text{Explained variance}+\text{Residual variance}}, \]

where \(\bar{\hat{y}}\) is the mean of \(\hat{y}_i\)’s. Thus, \(R^2_{\text{Bayes}}\) is always between \(0\) and $1$. Again, \(R^2_{\text{Bayes}}=0\) if the fitted model is a horizontal line and \(R^2_{\text{Bayes}}\) is close to \(1\) when the explained variance dominates the residual variance.

In practice, we compute \(R^2_{\text{Bayes}}\) for each simulation of the coefficients and \(\sigma\) to obtain the posterior distribution of \(R^2_{\text{Bayes}}\). This can be done in the rstanarm library by calling bayes_R2 on a fitted model. To obtain a point estimate one can compute the median of the simulated values. For example, let us take the model of KidIQ with and without an interaction term from above.

R2_sims = bayes_R2(fit_3)
print(R2_sims[1:10])
 [1] 0.2389541 0.2227313 0.2685082 0.1419911 0.2051568 0.2286202 0.2148354
 [8] 0.2427741 0.1771237 0.1796563
cat("A point estimate of Bayesian R2 is: ", median(R2_sims))
A point estimate of Bayesian R2 is:  0.2134595
R2_sims = bayes_R2(fit_4)
print(R2_sims[1:10])
 [1] 0.2479147 0.2588455 0.2706708 0.1677184 0.2207979 0.2373506 0.2485137
 [8] 0.2423056 0.2449781 0.2004792
cat("A point estimate of Bayesian R2 is: ", median(R2_sims))
A point estimate of Bayesian R2 is:  0.2302812

The results show that \(R^2_{\text{Bayes}}\) of the model with an interaction term is larger than that without an interaction term.

6.6 Cross validation

A model should be evaluated on how well they make predictions on new data. However, sometimes we would like to evaluate and compare models without waiting for new data. One can instead hold out a subset of existing data, train the model on the remaining data, and then evaluate on the held-out data: this is called cross validation.

6.6.1 Leave-one-out cross validation

In leave-one-out cross validation (LOO), the model is fitted on all but a single data point, and then it is evaluated on that data point. We suggest two ways of evaluating the model:

6.6.1.1 1. Log score and deviance

Suppose that \(\beta_i\) and \(\sigma_i\) are the parameters fitted on the data consisting of all but the \(i\)-th data point. The likelihood of the \(i\)-th data point \((X_i,y_i)\) is

\[ p(y_i \mid \beta_i,\sigma_i) = \frac{1}{\sqrt{2\pi}\sigma_i} \exp\left(-\frac{1}{2\sigma_i^2}(y_i-X_i\beta_i)^2\right). \]

A larger likelihood suggests a better fit of the model to this data point. The log score is the log of the likelihood without the constant term:

\[ -\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]

We perform LOO for all data points. The model’s performance is measured by taking the sum of the log scores. This is called the expected log predictive density (elpd).

\[ \text{elpd} = \sum_{i=1}^n-\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]

To compare the performance between two models, we compare their elpd.

The exact computation of elpd is quite slow since it requires fitting the model \(n\) times. The loo function in R implement an approximation of elpd that is much faster to compute.

As an example, we compute the elpd of the following model:

\[ \text{kid\_score} = \text{mom\_iq}. \]

fit_1 <- stan_glm(kid_score ~ mom_hs, data=kidiq,
                 refresh=0)

loo_1 <- loo(fit_1)
print(loo_1)

Computed from 4000 by 434 log-likelihood matrix

         Estimate   SE
elpd_loo  -1914.8 13.8
p_loo         3.0  0.3
looic      3829.6 27.6
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The output can be interpreted as follows:

  • elpd_loo is the estimated elpd along with a standard error representing uncertainty due to using only 434 data points.

  • p_loo is the estimated “effective number of parameters” in the model, which is essentially the number of parameters that accounts for information in the prior and the data. The above model has 3 parameters, so it makes sense that p_loo is close to 3 here.

  • looic is the LOO information criterion, which is \(-2\times\)elpd_loo.

Let us compare fit_1 with fit_3: \(\text{kid\_score} = \text{mom\_iq} + \text{mom\_hs}\).

loo_3 <- loo(fit_3)
print(loo_3)

Computed from 4000 by 434 log-likelihood matrix

         Estimate   SE
elpd_loo  -1876.1 14.2
p_loo         4.1  0.4
looic      3752.2 28.4
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

We can see that fit_3’s elpd of -1875.9 is higher than fit_1’s elpd of -1914.8. This suggests that fit_3’s posterior predictive distribution matches the data better than that of fit_1.

To also obtain the standard error of the difference in elpd between fit_3 and fit_1, we can use the loo_compare function.

loo_compare(loo_3, loo_1)
      elpd_diff se_diff
fit_3   0.0       0.0  
fit_1 -38.7       8.3  

The output tells us that fit_1’s elpd is 39.0 smaller than that of fit_3, with the standard error of the difference equals 8.4. As a rule of thumb, if eldp_diff is greater than 4, the number of observations is greater than 100, and the model is not badly misspecified, then se_diff is a reliable measure of uncertainty in the difference between elpd’s.

Let us compare the eldp between fit_3, the model with two predictors and no interaction term, and fit_4, the model with two predictors and an interaction term.

loo_4 <- loo(fit_4)

loo_compare(loo_3, loo_4)
      elpd_diff se_diff
fit_4  0.0       0.0   
fit_3 -3.4       2.7   

The difference is less than 4, so there is no clear improvement by adding the interaction term.