library(rstanarm)
6 Model diagnostics and evaluation
We will talk about several graphical and quantitative ways to check our model’s fit to the data, and later we will talk about about cross validation—a technique for comparing between different models.
6.1 Plotting the data and the fitted model
6.1.1 Model with one predictor
In one-predictor cases, we can visualize the regression’s fit by simply plotting the data and the regression line. Here, we plot the KidIQ
data and the fitted lines.
= read.csv("data/kidiq.csv")
kidiq
<- stan_glm(kid_score ~ mom_iq, data=kidiq, refresh=0)
fit_2 plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score")
abline(coef(fit_2)[1], coef(fit_2)[2])
6.1.2 Model with two predictors
6.1.2.1 Model with one categorical predictor and no interaction
We can plot each group and the fitted model on that group. Using the KidIQ
data as an example, we recall that the non-interactive regression equation is
\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq}. \]
Consequently, the equation of \(\text{kid\_score}\) on \(\text{mom\_iq}\) for the group \(\text{mom\_hs} = 1\) is
\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + \hat{\beta}_3 * \text{mom\_iq}, \]
and the equation for the group \(\text{mom\_hs} = 0\) is
\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]
We thus obtain the intercept and the slope of each equation, which we use to plot the regression line of each \(\text{mom\_hs}\) group below.
<- stan_glm(kid_score ~ mom_hs + mom_iq, data=kidiq,
fit_3 refresh=0)
# create a vector with value = "red" if mom_hs==1
# "blue" if mom_hs==0
<- ifelse(kidiq$mom_hs==1, "red", "blue")
colors
print(colors[1:10])
[1] "red" "red" "red" "red" "red" "blue" "red" "red" "red" "red"
plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
col=colors, pch=20)
<- coef(fit_3)
b_hat abline(b_hat[1] + b_hat[2], b_hat[3], col="red")
abline(b_hat[1], b_hat[3], col="blue")
6.1.2.2 Model with one categorical variable and an interaction
In this case, the equation is
\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq} +\hat{\beta}_4*\text{mom\_hs}*\text{mom\_iq}. \]
When \(\text{mom\_hs}=1\), the equation is
\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + (\hat{\beta}_3+\hat{\beta}_4) * \text{mom\_iq}, \]
and when \(\text{mom\_hs} = 0\), we have the same equation as the no-interaction case:
\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]
Again, we can use the intercept and slope in each equation to plot the regression line on the corresponding level of \(\text{mom\_hs}\).
<- stan_glm(kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq, data=kidiq,
fit_4 refresh=0)
<- ifelse(kidiq$mom_hs==1, "red", "blue")
colors plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
col=colors, pch=20)
<- coef(fit_4)
b_hat abline(b_hat[1] + b_hat[2], b_hat[3] + b_hat[4], col="red")
abline(b_hat[1], b_hat[3], col="blue")
We can see that, unlike the no-interaction case, the lines are not parallel to each other.
6.1.2.3 Plotting the regression with uncertainty
We can use the posterior simulations to represent the uncertainty in the estimated regression coefficients. As an example, we plot the interactive model above on the subset of data with \(\text{mom\_hs} = 1\) along with 20 simulations drawn from the stan_glm
fit.
<- as.matrix(fit_4)
sims <- nrow(sims)
n_sims <- kidiq$mom_iq[kidiq$mom_hs == 1]
mom_iq_1 <- kidiq$kid_score[kidiq$mom_hs == 1]
kid_score_1 plot(mom_iq_1, kid_score_1,
xlab="Mother IQ score", ylab="Child test score",
pch=20)
<- sample(n_sims, 20) # sample 20 numbers between 1 and n_sims
sims_display for (i in sims_display){
# plot with the coefficients from the i-th simulation
abline(sims[i,1] + sims[i,2], sims[i,3] + sims[i,4], col="gray")
}# plot with the estimated coefficients
abline(coef(fit_2)[1], coef(fit_2)[2], col="black")
6.1.3 Model with multiple predictors
Suppose that the regression equation is
\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\ldots+\hat{\beta}_px_p. \]
For each predictor \(x_k\), we can plot the line of \(y\) vs. \(x_k\) while holding the other predictors fixed at their averages. For example, the equation of \(y\) vs. \(x_1\) is
\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2\bar{x}_2\ldots+\hat{\beta}_p\bar{x}_p. \]
In the example above, we can plot the line of kid_score
vs mom_iq
by fixing mom_hs
at its average. The equation becomes:
\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2* \overline{\text{mom\_hs}}) + (\hat{\beta}_3 +\hat{\beta}_4*\overline{\text{mom\_hs}})* \text{mom\_iq} . \]
= mean(kidiq$mom_hs)
mom_hs_mean
plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
pch=20)
<- coef(fit_4)
b_hat abline(b_hat[1] + b_hat[2] * mom_hs_mean,
3] + b_hat[4] * mom_hs_mean) b_hat[
6.2 Plotting the outcome against the prediction
Another way to check the model’s fit is to plot the outcome \(y\) against the linear prediction \(\hat{y} = \hat{\beta}_0+\hat{\beta}_1 x_1+\ldots+\hat{\beta}_px_p\). Let’s apply this technique to the KidIQ
data above.
= predict(fit_4)
kid_score_pred_4
plot(kid_score_pred_4, kidiq$kid_score,
xlab="Prediction", ylab="Outcome",
asp=1) # set the aspect ratio between x- and y-axis to 1
abline(0, 1) # plot line y = x
6.3 Residual plots
One way to evaluate the model’s fit and independence of errors is to plot the residuals:
\[ r_i = y_i- \hat{y}_i, \]
against the predictions \(\hat{y}_i\). Let’s see a residual plot for the simple regression of kid_score
vs. mom_iq
.
= predict(fit_2)
Predictions
= kidiq$kid_score - Predictions
Residuals
plot(Predictions, Residuals)
abline(0, 0)
abline(sigma(fit_2), 0, lty="dashed") # +1 standard deviation
abline(-sigma(fit_2), 0, lty="dashed") # -1 standard deviation
The residuals are relatively small compared to the outcomes, most of which are between 40-140, and they look sufficiently random; this suggests that the model’s errors are independent with zero mean.
6.4 Comparing simulated data to real data
We can also simulate outcomes from the posterior predictive distribution and compare their distribution (that is, their histogram) to that of the original data. As an example, we take a look at Newcomb
dataset from Stigler (1997), which contains data from an experiment to estimate the speed of light. Here, the outcome \(y\) represents the amount of time required for light to travel a distance of 7442 meters and are recorded as deviations from 24800 nanoseconds.
= read.csv("data/newcomb.txt")
newcomb
head(newcomb)
y
1 28
2 26
3 33
4 24
5 34
6 -44
We fit a simple normal distribution model on this data.
\[ y = \beta_0 + \varepsilon, \quad \varepsilon\sim\mathcal{N}(0,\sigma^2), \]
which is equivalent to \(y\sim\mathcal{N}(\beta_0,\sigma^2)\).
<- nrow(sims)
n_sample
<- stan_glm(y ~ 1, data=newcomb,
fit refresh=0)
<- posterior_predict(fit) y_sims
Here, y_sims
contains 4000 simulations of \(y\). To replicate the original dataset, which has 66 observations, we make 20 datasets, each of which consists of 66 numbers randomly sampled from y_sims
.
par(mfrow=c(5, 4))
par(mar = c(1, 1, 1, 1))
for (s in sample(n_sample, 20)) {
hist(y_sims[s,], main=NULL, ylab=NULL)
}
Now let us compare these histograms with the original data.
hist(newcomb$y, main=NULL, xlab="y", breaks=20)
The simulated histograms are noticably different than that of the original data, and we can see that the normal model is not suitable for the data. Alternatively, one might use an asymmetric contaminated normal distribution or a symmetric long-tailed distribution.
6.5 Explained variance \(R^2\)
The coefficient of determination (\(R^2\)) is calculated as follows:
\[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}} = \frac{\text{TSS}-\text{RSS}}{\text{TSS}}, \]
where \(\text{RSS}\) is the residual sum of squares and \(\text{TSS}\) is the total sum of squares.
\(\text{TSS}\) measures the variance of the outcome \(y\).
\(\text{RSS}\) measures the amount of variance unexplained by the regression.
Therefore, \(\text{TSS}-\text{RSS}\) measures the amount of variance explained by the regression.
Consequently, \(R^2\) measures the proportion of variance in \(y\) that is explained by the regression.
To understand \(R^2\) further, we consider two special cases of the linear regression with one predictor: \(\hat{y} = \hat{a}+\hat{b} x\).
If the fitted line is the same as the horizontal line, that is, \(\hat{y}_i = \bar{y}\) for all \(i\), then \(\text{RSS}=\text{TSS}\). So when the fitted model does not explain any of the variance in \(y\), we have \(R^2=0\).
Suppose that the fitted line passes through all the points perfectly, so the residuals are all zeros. In this case, \(\text{RSS}=0\) and so \(R^2=1\). Thus, \(R^2=1\) indicates that the regression line explains all of the variance in \(y\).
6.5.1 Bayesian \(R^2\)
There is one problem of using \(R^2\) for Bayesian regression: \(R^2\) is only guaranteed to be non-negative when \(\hat{y}_i\) are predictions from the model with OLS coefficients. In general, however, we can have a regression model with a negative $R2$. At an extreme, one can think of a linear model that is very far away from the data. Since the simulated coefficients from the Bayesian regression are not necessarily OLS, \(R^2\) of the corresponding models might be negative. To this end, Gelman et al. (2019) proposed an alternative definition of \(R^2\) for Bayesian regression:
\[ R^2_{\text{Bayes}} = \frac{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2}{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2 + \sigma^2} = \frac{\text{Explained variance}}{\text{Explained variance}+\text{Residual variance}}, \]
where \(\bar{\hat{y}}\) is the mean of \(\hat{y}_i\)’s. Thus, \(R^2_{\text{Bayes}}\) is always between \(0\) and $1$. Again, \(R^2_{\text{Bayes}}=0\) if the fitted model is a horizontal line and \(R^2_{\text{Bayes}}\) is close to \(1\) when the explained variance dominates the residual variance.
In practice, we compute \(R^2_{\text{Bayes}}\) for each simulation of the coefficients and \(\sigma\) to obtain the posterior distribution of \(R^2_{\text{Bayes}}\). This can be done in the rstanarm
library by calling bayes_R2
on a fitted model. To obtain a point estimate one can compute the median of the simulated values. For example, let us take the model of KidIQ
with and without an interaction term from above.
= bayes_R2(fit_3)
R2_sims print(R2_sims[1:10])
[1] 0.2389541 0.2227313 0.2685082 0.1419911 0.2051568 0.2286202 0.2148354
[8] 0.2427741 0.1771237 0.1796563
cat("A point estimate of Bayesian R2 is: ", median(R2_sims))
A point estimate of Bayesian R2 is: 0.2134595
= bayes_R2(fit_4)
R2_sims print(R2_sims[1:10])
[1] 0.2479147 0.2588455 0.2706708 0.1677184 0.2207979 0.2373506 0.2485137
[8] 0.2423056 0.2449781 0.2004792
cat("A point estimate of Bayesian R2 is: ", median(R2_sims))
A point estimate of Bayesian R2 is: 0.2302812
The results show that \(R^2_{\text{Bayes}}\) of the model with an interaction term is larger than that without an interaction term.
6.6 Cross validation
A model should be evaluated on how well they make predictions on new data. However, sometimes we would like to evaluate and compare models without waiting for new data. One can instead hold out a subset of existing data, train the model on the remaining data, and then evaluate on the held-out data: this is called cross validation.
6.6.1 Leave-one-out cross validation
In leave-one-out cross validation (LOO), the model is fitted on all but a single data point, and then it is evaluated on that data point. We suggest two ways of evaluating the model:
6.6.1.1 1. Log score and deviance
Suppose that \(\beta_i\) and \(\sigma_i\) are the parameters fitted on the data consisting of all but the \(i\)-th data point. The likelihood of the \(i\)-th data point \((X_i,y_i)\) is
\[ p(y_i \mid \beta_i,\sigma_i) = \frac{1}{\sqrt{2\pi}\sigma_i} \exp\left(-\frac{1}{2\sigma_i^2}(y_i-X_i\beta_i)^2\right). \]
A larger likelihood suggests a better fit of the model to this data point. The log score is the log of the likelihood without the constant term:
\[ -\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]
We perform LOO for all data points. The model’s performance is measured by taking the sum of the log scores. This is called the expected log predictive density (elpd).
\[ \text{elpd} = \sum_{i=1}^n-\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]
To compare the performance between two models, we compare their elpd.
The exact computation of elpd is quite slow since it requires fitting the model \(n\) times. The loo
function in R implement an approximation of elpd that is much faster to compute.
As an example, we compute the elpd of the following model:
\[ \text{kid\_score} = \text{mom\_iq}. \]
<- stan_glm(kid_score ~ mom_hs, data=kidiq,
fit_1 refresh=0)
<- loo(fit_1)
loo_1 print(loo_1)
Computed from 4000 by 434 log-likelihood matrix
Estimate SE
elpd_loo -1914.8 13.8
p_loo 3.0 0.3
looic 3829.6 27.6
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
The output can be interpreted as follows:
elpd_loo
is the estimated elpd along with a standard error representing uncertainty due to using only 434 data points.p_loo
is the estimated “effective number of parameters” in the model, which is essentially the number of parameters that accounts for information in the prior and the data. The above model has 3 parameters, so it makes sense thatp_loo
is close to 3 here.looic
is the LOO information criterion, which is \(-2\times\)elpd_loo
.
Let us compare fit_1
with fit_3
: \(\text{kid\_score} = \text{mom\_iq} + \text{mom\_hs}\).
<- loo(fit_3)
loo_3 print(loo_3)
Computed from 4000 by 434 log-likelihood matrix
Estimate SE
elpd_loo -1876.1 14.2
p_loo 4.1 0.4
looic 3752.2 28.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
We can see that fit_3
’s elpd of -1875.9 is higher than fit_1
’s elpd of -1914.8. This suggests that fit_3
’s posterior predictive distribution matches the data better than that of fit_1
.
To also obtain the standard error of the difference in elpd between fit_3
and fit_1
, we can use the loo_compare
function.
loo_compare(loo_3, loo_1)
elpd_diff se_diff
fit_3 0.0 0.0
fit_1 -38.7 8.3
The output tells us that fit_1
’s elpd is 39.0 smaller than that of fit_3
, with the standard error of the difference equals 8.4. As a rule of thumb, if eldp_diff
is greater than 4, the number of observations is greater than 100, and the model is not badly misspecified, then se_diff
is a reliable measure of uncertainty in the difference between elpd
’s.
Let us compare the eldp
between fit_3
, the model with two predictors and no interaction term, and fit_4
, the model with two predictors and an interaction term.
<- loo(fit_4)
loo_4
loo_compare(loo_3, loo_4)
elpd_diff se_diff
fit_4 0.0 0.0
fit_3 -3.4 2.7
The difference is less than 4, so there is no clear improvement by adding the interaction term.