`library(rstanarm)`

# 6 Model diagnostics and evaluation

We will talk about several graphical and quantitative ways to check our model’s fit to the data, and later we will talk about about cross validation—a technique for comparing between different models.

## 6.1 Plotting the data and the fitted model

### 6.1.1 Model with one predictor

In one-predictor cases, we can visualize the regression’s fit by simply plotting the data and the regression line. Here, we plot the `KidIQ`

data and the fitted lines.

```
= read.csv("data/kidiq.csv")
kidiq
<- stan_glm(kid_score ~ mom_iq, data=kidiq, refresh=0)
fit_2 plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score")
abline(coef(fit_2)[1], coef(fit_2)[2])
```

### 6.1.2 Model with two predictors

#### 6.1.2.1 Model with one categorical predictor and no interaction

We can plot each group and the fitted model on that group. Using the `KidIQ`

data as an example, we recall that the non-interactive regression equation is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq}. \]

Consequently, the equation of \(\text{kid\_score}\) on \(\text{mom\_iq}\) for the group \(\text{mom\_hs} = 1\) is

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + \hat{\beta}_3 * \text{mom\_iq}, \]

and the equation for the group \(\text{mom\_hs} = 0\) is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]

We thus obtain the intercept and the slope of each equation, which we use to plot the regression line of each \(\text{mom\_hs}\) group below.

```
<- stan_glm(kid_score ~ mom_hs + mom_iq, data=kidiq,
fit_3 refresh=0)
# create a vector with value = "red" if mom_hs==1
# "blue" if mom_hs==0
<- ifelse(kidiq$mom_hs==1, "red", "blue")
colors
print(colors[1:10])
```

` [1] "red" "red" "red" "red" "red" "blue" "red" "red" "red" "red" `

```
plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
col=colors, pch=20)
<- coef(fit_3)
b_hat abline(b_hat[1] + b_hat[2], b_hat[3], col="red")
abline(b_hat[1], b_hat[3], col="blue")
```

#### 6.1.2.2 Model with one categorical variable and an interaction

In this case, the equation is

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_2* \text{mom\_hs} + \hat{\beta}_3 * \text{mom\_iq} +\hat{\beta}_4*\text{mom\_hs}*\text{mom\_iq}. \]

When \(\text{mom\_hs}=1\), the equation is

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2) + (\hat{\beta}_3+\hat{\beta}_4) * \text{mom\_iq}, \]

and when \(\text{mom\_hs} = 0\), we have the same equation as the no-interaction case:

\[ \text{kid\_score} = \hat{\beta}_1 + \hat{\beta}_3 * \text{mom\_iq}. \]

Again, we can use the intercept and slope in each equation to plot the regression line on the corresponding level of \(\text{mom\_hs}\).

```
<- stan_glm(kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq, data=kidiq,
fit_4 refresh=0)
<- ifelse(kidiq$mom_hs==1, "red", "blue")
colors plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
col=colors, pch=20)
<- coef(fit_4)
b_hat abline(b_hat[1] + b_hat[2], b_hat[3] + b_hat[4], col="red")
abline(b_hat[1], b_hat[3], col="blue")
```

We can see that, unlike the no-interaction case, the lines are not parallel to each other.

#### 6.1.2.3 Plotting the regression with uncertainty

We can use the posterior simulations to represent the uncertainty in the estimated regression coefficients. As an example, we plot the interactive model above on the subset of data with \(\text{mom\_hs} = 1\) along with 20 simulations drawn from the `stan_glm`

fit.

```
<- as.matrix(fit_4)
sims <- nrow(sims)
n_sims <- kidiq$mom_iq[kidiq$mom_hs == 1]
mom_iq_1 <- kidiq$kid_score[kidiq$mom_hs == 1]
kid_score_1 plot(mom_iq_1, kid_score_1,
xlab="Mother IQ score", ylab="Child test score",
pch=20)
<- sample(n_sims, 20) # sample 20 numbers between 1 and n_sims
sims_display for (i in sims_display){
# plot with the coefficients from the i-th simulation
abline(sims[i,1] + sims[i,2], sims[i,3] + sims[i,4], col="gray")
}# plot with the estimated coefficients
abline(coef(fit_2)[1], coef(fit_2)[2], col="black")
```

### 6.1.3 Model with multiple predictors

Suppose that the regression equation is

\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\ldots+\hat{\beta}_px_p. \]

For each predictor \(x_k\), we can plot the line of \(y\) vs. \(x_k\) while holding the other predictors fixed at their averages. For example, the equation of \(y\) vs. \(x_1\) is

\[ \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2\bar{x}_2\ldots+\hat{\beta}_p\bar{x}_p. \]

In the example above, we can plot the line of `kid_score`

vs `mom_iq`

by fixing `mom_hs`

at its average. The equation becomes:

\[ \text{kid\_score} = (\hat{\beta}_1 + \hat{\beta}_2* \overline{\text{mom\_hs}}) + (\hat{\beta}_3 +\hat{\beta}_4*\overline{\text{mom\_hs}})* \text{mom\_iq} . \]

```
= mean(kidiq$mom_hs)
mom_hs_mean
plot(kidiq$mom_iq, kidiq$kid_score,
xlab="Mother IQ score", ylab="Child test score",
pch=20)
<- coef(fit_4)
b_hat abline(b_hat[1] + b_hat[2] * mom_hs_mean,
3] + b_hat[4] * mom_hs_mean) b_hat[
```

## 6.2 Plotting the outcome against the prediction

Another way to check the model’s fit is to plot the outcome \(y\) against the linear prediction \(\hat{y} = \hat{\beta}_0+\hat{\beta}_1 x_1+\ldots+\hat{\beta}_px_p\). Let’s apply this technique to the `KidIQ`

data above.

```
= predict(fit_4)
kid_score_pred_4
plot(kid_score_pred_4, kidiq$kid_score,
xlab="Prediction", ylab="Outcome",
asp=1) # set the aspect ratio between x- and y-axis to 1
abline(0, 1) # plot line y = x
```

## 6.3 Residual plots

One way to evaluate the model’s fit and independence of errors is to plot the residuals:

\[ r_i = y_i- \hat{y}_i, \]

against the predictions \(\hat{y}_i\). Let’s see a residual plot for the simple regression of `kid_score`

vs. `mom_iq`

.

```
= predict(fit_2)
Predictions
= kidiq$kid_score - Predictions
Residuals
plot(Predictions, Residuals)
abline(0, 0)
abline(sigma(fit_2), 0, lty="dashed") # +1 standard deviation
abline(-sigma(fit_2), 0, lty="dashed") # -1 standard deviation
```

The residuals are relatively small compared to the outcomes, most of which are between 40-140, and they look sufficiently random; this suggests that the model’s errors are independent with zero mean.

## 6.4 Comparing simulated data to real data

We can also simulate outcomes from the posterior predictive distribution and compare their distribution (that is, their histogram) to that of the original data. As an example, we take a look at `Newcomb`

dataset from Stigler (1997), which contains data from an experiment to estimate the speed of light. Here, the outcome \(y\) represents the amount of time required for light to travel a distance of 7442 meters and are recorded as deviations from 24800 nanoseconds.

```
= read.csv("data/newcomb.txt")
newcomb
head(newcomb)
```

```
y
1 28
2 26
3 33
4 24
5 34
6 -44
```

We fit a simple normal distribution model on this data.

\[ y = \beta_0 + \varepsilon, \quad \varepsilon\sim\mathcal{N}(0,\sigma^2), \]

which is equivalent to \(y\sim\mathcal{N}(\beta_0,\sigma^2)\).

```
<- nrow(sims)
n_sample
<- stan_glm(y ~ 1, data=newcomb,
fit refresh=0)
<- posterior_predict(fit) y_sims
```

Here, `y_sims`

contains 4000 simulations of \(y\). To replicate the original dataset, which has 66 observations, we make 20 datasets, each of which consists of 66 numbers randomly sampled from `y_sims`

.

```
par(mfrow=c(5, 4))
par(mar = c(1, 1, 1, 1))
for (s in sample(n_sample, 20)) {
hist(y_sims[s,], main=NULL, ylab=NULL)
}
```

Now let us compare these histograms with the original data.

`hist(newcomb$y, main=NULL, xlab="y", breaks=20)`

The simulated histograms are noticably different than that of the original data, and we can see that the normal model is not suitable for the data. Alternatively, one might use an asymmetric contaminated normal distribution or a symmetric long-tailed distribution.

## 6.5 Explained variance \(R^2\)

The *coefficient of determination* (\(R^2\)) is calculated as follows:

\[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}} = \frac{\text{TSS}-\text{RSS}}{\text{TSS}}, \]

where \(\text{RSS}\) is the *residual sum of squares* and \(\text{TSS}\) is the *total sum of squares*.

\(\text{TSS}\) measures the variance of the outcome \(y\).

\(\text{RSS}\) measures the amount of variance unexplained by the regression.

Therefore, \(\text{TSS}-\text{RSS}\) measures the amount of variance

*explained*by the regression.Consequently, \(R^2\) measures the

*proportion of variance in*\(y\)*that is explained by the regression*.

To understand \(R^2\) further, we consider two special cases of the linear regression with one predictor: \(\hat{y} = \hat{a}+\hat{b} x\).

If the fitted line is the same as the horizontal line, that is, \(\hat{y}_i = \bar{y}\) for all \(i\), then \(\text{RSS}=\text{TSS}\). So when the fitted model does not explain any of the variance in \(y\), we have \(R^2=0\).

Suppose that the fitted line passes through all the points perfectly, so the residuals are all zeros. In this case, \(\text{RSS}=0\) and so \(R^2=1\). Thus, \(R^2=1\) indicates that the regression line explains all of the variance in \(y\).

### 6.5.1 Bayesian \(R^2\)

There is one problem of using \(R^2\) for Bayesian regression: \(R^2\) is only guaranteed to be non-negative when \(\hat{y}_i\) are predictions from the model with OLS coefficients. In general, however, we can have a regression model with a negative $R^{2}$. At an extreme, one can think of a linear model that is very far away from the data. Since the simulated coefficients from the Bayesian regression are not necessarily OLS, \(R^2\) of the corresponding models might be negative. To this end, Gelman et al. (2019) proposed an alternative definition of \(R^2\) for Bayesian regression:

\[ R^2_{\text{Bayes}} = \frac{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2}{\frac{1}{n-1}\sum_i (\hat{y}_i-\bar{\hat{y}})^2 + \sigma^2} = \frac{\text{Explained variance}}{\text{Explained variance}+\text{Residual variance}}, \]

where \(\bar{\hat{y}}\) is the mean of \(\hat{y}_i\)*’s. Thus,* \(R^2_{\text{Bayes}}\) is always between \(0\) and $1$. Again, \(R^2_{\text{Bayes}}=0\) if the fitted model is a horizontal line and \(R^2_{\text{Bayes}}\) is close to \(1\) when the explained variance dominates the residual variance.

In practice, we compute \(R^2_{\text{Bayes}}\) for each simulation of the coefficients and \(\sigma\) to obtain the posterior distribution of \(R^2_{\text{Bayes}}\). This can be done in the `rstanarm`

library by calling `bayes_R2`

on a fitted model. To obtain a point estimate one can compute the median of the simulated values. For example, let us take the model of `KidIQ`

with and without an interaction term from above.

```
= bayes_R2(fit_3)
R2_sims print(R2_sims[1:10])
```

```
[1] 0.2389541 0.2227313 0.2685082 0.1419911 0.2051568 0.2286202 0.2148354
[8] 0.2427741 0.1771237 0.1796563
```

`cat("A point estimate of Bayesian R2 is: ", median(R2_sims))`

`A point estimate of Bayesian R2 is: 0.2134595`

```
= bayes_R2(fit_4)
R2_sims print(R2_sims[1:10])
```

```
[1] 0.2479147 0.2588455 0.2706708 0.1677184 0.2207979 0.2373506 0.2485137
[8] 0.2423056 0.2449781 0.2004792
```

`cat("A point estimate of Bayesian R2 is: ", median(R2_sims))`

`A point estimate of Bayesian R2 is: 0.2302812`

The results show that \(R^2_{\text{Bayes}}\) of the model with an interaction term is larger than that without an interaction term.

## 6.6 Cross validation

A model should be evaluated on how well they make predictions on new data. However, sometimes we would like to evaluate and compare models without waiting for new data. One can instead hold out a subset of existing data, train the model on the remaining data, and then evaluate on the held-out data: this is called *cross validation*.

### 6.6.1 Leave-one-out cross validation

In leave-one-out cross validation (LOO), the model is fitted on all but a single data point, and then it is evaluated on that data point. We suggest two ways of evaluating the model:

#### 6.6.1.1 1. Log score and deviance

Suppose that \(\beta_i\) and \(\sigma_i\) are the parameters fitted on the data consisting of all but the \(i\)-th data point. The likelihood of the \(i\)-th data point \((X_i,y_i)\) is

\[ p(y_i \mid \beta_i,\sigma_i) = \frac{1}{\sqrt{2\pi}\sigma_i} \exp\left(-\frac{1}{2\sigma_i^2}(y_i-X_i\beta_i)^2\right). \]

A larger likelihood suggests a better fit of the model to this data point. The *log score* is the log of the likelihood without the constant term:

\[ -\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]

We perform LOO for all data points. The model’s performance is measured by taking the sum of the log scores. This is called the *expected log predictive density* (elpd).

\[ \text{elpd} = \sum_{i=1}^n-\log\sigma_i-\frac{1}{2\sigma^2_i}(y_i-X_i\beta_i)^2. \]

To compare the performance between two models, we compare their elpd.

The exact computation of elpd is quite slow since it requires fitting the model \(n\) times. The `loo`

function in R implement an approximation of elpd that is much faster to compute.

As an example, we compute the elpd of the following model:

\[ \text{kid\_score} = \text{mom\_iq}. \]

```
<- stan_glm(kid_score ~ mom_hs, data=kidiq,
fit_1 refresh=0)
<- loo(fit_1)
loo_1 print(loo_1)
```

```
Computed from 4000 by 434 log-likelihood matrix
Estimate SE
elpd_loo -1914.8 13.8
p_loo 3.0 0.3
looic 3829.6 27.6
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```

The output can be interpreted as follows:

`elpd_loo`

is the estimated elpd along with a standard error representing uncertainty due to using only 434 data points.`p_loo`

is the estimated “effective number of parameters” in the model, which is essentially the number of parameters that accounts for information in the prior and the data. The above model has 3 parameters, so it makes sense that`p_loo`

is close to 3 here.`looic`

is the LOO information criterion, which is \(-2\times\)`elpd_loo`

.

Let us compare `fit_1`

with `fit_3`

: \(\text{kid\_score} = \text{mom\_iq} + \text{mom\_hs}\).

```
<- loo(fit_3)
loo_3 print(loo_3)
```

```
Computed from 4000 by 434 log-likelihood matrix
Estimate SE
elpd_loo -1876.1 14.2
p_loo 4.1 0.4
looic 3752.2 28.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```

We can see that `fit_3`

’s elpd of -1875.9 is higher than `fit_1`

’s elpd of -1914.8. This suggests that `fit_3`

’s posterior predictive distribution matches the data better than that of `fit_1`

.

To also obtain the standard error of the difference in elpd between `fit_3`

and `fit_1`

, we can use the `loo_compare`

function.

`loo_compare(loo_3, loo_1)`

```
elpd_diff se_diff
fit_3 0.0 0.0
fit_1 -38.7 8.3
```

The output tells us that `fit_1`

’s elpd is 39.0 smaller than that of `fit_3`

, with the standard error of the difference equals 8.4. As a rule of thumb, if `eldp_diff`

is greater than 4, the number of observations is greater than 100, and the model is not badly misspecified, then `se_diff`

is a reliable measure of uncertainty in the difference between `elpd`

’s.

Let us compare the `eldp`

between `fit_3`

, the model with two predictors and no interaction term, and `fit_4`

, the model with two predictors and an interaction term.

```
<- loo(fit_4)
loo_4
loo_compare(loo_3, loo_4)
```

```
elpd_diff se_diff
fit_4 0.0 0.0
fit_3 -3.4 2.7
```

The difference is less than 4, so there is no clear improvement by adding the interaction term.