8  Comparing regression models

General principles

  1. Use prior knowledge to pick relevant input variables.
  2. For inputs that have large effects, consider including their interactions.
  3. If the coefficient of a predictor has a small standard error, then we might want to keep it in the model.
  4. If the coefficient of a predictor has a large standard error, and there is no reason for the predictor in the model, then we might want to remove it.
  5. If a coefficient contradicts the reality (for example, a negative coefficient for education in an income regression),
    • If the standard error is large, then the unusual estimate can be explained from its uncertainty.

    • If the standard error is small, try to understand how it could happen. In the income vs. education example, the negative coefficient might be because the data was collected from a subpopulation in which the more educated people are younger and hence tend to have lower average income.

8.1 Example: predicting the yields of mesquite bushes

We apply the model checking techniques to Mesquite data, which is used to estimate the weight (in grams) of yield harvested from a mesquite bush.

mesquite

A mesquite tree

mesquite <- read.table("data/mesquite.dat",
                       header=TRUE)

head(mesquite)
  obs group diam1 diam2 total_height canopy_height density weight
1   1   MCD   1.8  1.15         1.30          1.00       1  401.3
2   2   MCD   1.7  1.35         1.35          1.33       1  513.7
3   3   MCD   2.8  2.55         2.16          0.60       1 1179.2
4   4   MCD   1.3  0.85         1.80          1.20       1  308.0
5   5   MCD   3.3  1.90         1.55          1.05       1  855.2
6   6   MCD   1.4  1.40         1.20          1.00       1  268.7

The input variables are:

Variable name Description
diam1 diameter along the longer axis of the canopy (meters)
diam2 diameter along the shorter axis of the canopy (meters)
canopy_height height of the canopy (meters)
total_height total height of the bush
density number of primary stems per plant unit
group MCD or ALS, indicating two different times of measurement

To start off, we regress weight on all of the predictors.

library(rstanarm)
fit_1 <- stan_glm(weight ~ diam1 + diam2 + canopy_height +
                  total_height + density + group, 
                  data=mesquite, refresh=0)

print(fit_1)
stan_glm
 family:       gaussian [identity]
 formula:      weight ~ diam1 + diam2 + canopy_height + total_height + density + 
       group
 observations: 46
 predictors:   7
------
              Median  MAD_SD 
(Intercept)   -1092.5   176.5
diam1           190.7   118.8
diam2           368.2   128.6
canopy_height   359.2   223.5
total_height   -104.2   195.2
density         132.0    34.4
groupMCD        362.8    98.8

Auxiliary parameter(s):
      Median MAD_SD
sigma 274.2   31.3 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We evaluate this model using elpd.

loo_1 <- loo(fit_1)
Warning: Found 3 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 3 times to compute the ELPDs for the problematic observations directly.
print(loo_1)

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo   -334.1 12.5
p_loo        15.7  8.4
looic       668.2 24.9
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     43    93.5%   904       
 (0.5, 0.7]   (ok)        0     0.0%   <NA>      
   (0.7, 1]   (bad)       2     4.3%   38        
   (1, Inf)   (very bad)  1     2.2%   3         
See help('pareto-k-diagnostic') for details.

Three instances with \(\hat{k}>0.7\) indicate that the approximate elpd computation might not be accurate. Aki Vehtari has provided a quick guideline on how to diagnose the model from p_loo and Pareto \(\hat{k}\) values:

  • If all Pareto k small, model is likely to be ok (although there can be better models)

  • If high Pareto k values

    • If p_loo << the number of parameters p, then the model is likely to be misspecified. PPC is likely to detect the problem, too. Try using overdispersed model, or add more structural information (nonlinearity, mixture model, etc.).

    • If p_loo > the number of parameters p, then the model is likely to be badly misspecified. If the number of parameters p<<n, then PPC is likely to detect the problem, too. Case example https://rawgit.com/avehtari/modelselection_tutorial/master/roaches 189

    • If p_loo > the number of parameters p, then the model is likely to be badly misspecified. If the number of parameters p is relatively large compared to the number of observations p>n/5 (more accurately we should count number of observations influencing each parameter as in hierarchical models some groups may have small n and some groups large n), it is possible that PPC doesn’t detect the problem. Case example Recommendations for what to do when k exceeds 0.5 in the loo package? 299

    • If p_loo < the number of parameters p and the number of parameters p is relatively large compared to the number of observations p>n/5, it is likely that model is so flexible or population prior is so weak that it’s difficult to predict for left out observation even if the model is true one. Case example is the simulated 8 schools in https://arxiv.org/abs/1507.04544 90 and Gaussian processes and spatial models with short correlation lengths.

Let us compare try fitting a logarithmic model. Again, since group is categorical, we do not apply the logarithmic transformation to this predictor.

fit_2 <- stan_glm(log(weight) ~ log(diam1) + log(diam2) + 
                  log(canopy_height) + log(total_height) + 
                  log(density) + group, 
                  data=mesquite, refresh=0)

print(fit_2)
stan_glm
 family:       gaussian [identity]
 formula:      log(weight) ~ log(diam1) + log(diam2) + log(canopy_height) + 
       log(total_height) + log(density) + group
 observations: 46
 predictors:   7
------
                   Median MAD_SD
(Intercept)        4.8    0.2   
log(diam1)         0.4    0.3   
log(diam2)         1.1    0.2   
log(canopy_height) 0.4    0.3   
log(total_height)  0.4    0.3   
log(density)       0.1    0.1   
groupMCD           0.6    0.1   

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.3    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
loo_2 <- loo(fit_2)
Warning: Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.
print(loo_2)

Computed from 4000 by 46 log-likelihood matrix

         Estimate   SE
elpd_loo    -19.5  5.4
p_loo         7.6  1.6
looic        39.0 10.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     41    89.1%   1011      
 (0.5, 0.7]   (ok)        4     8.7%   352       
   (0.7, 1]   (bad)       1     2.2%   313       
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

The approximate elpd of this model is more reliable than the previous one. Now we adjust loo_2 for the logarithmic transformation by subtracting the log score of each instance by the logarithm of its weight.

loo_2a = loo_2

loo_2a$pointwise[,1] <- loo_2a$pointwise[,1] - log(mesquite$weight) 
sum(loo_2a$pointwise[,1])
[1] -291.7755
loo_compare(loo_1, loo_2a)
Warning: Not all models have the same y variable. ('yhash' attributes do not
match)
      elpd_diff se_diff
fit_2   0.0       0.0  
fit_1 -42.3      10.8  

The elpd of the logarithmic model is larger than the linear model, so we continue with the logarithmic model.

We can also simulated a dataset from each model and compare it with the original dataset.

library(bayesplot)
yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)
subset <- sample(n_sims, 100)
ppc_dens_overlay(mesquite$weight, yrep_1[subset,])

yrep_2 <- posterior_predict(fit_2)
ppc_dens_overlay(log(mesquite$weight), yrep_2[subset,])

The density plots show that the fit on the log scale is much better.

8.1.1 Constructing a simpler model

We have been throwing all predictors in our model. But sometimes we might want to look for a simpler model that is more interpretable. For example, we can create a new predictor that measures the volume of the canopy:

\[ \texttt{canopy\_volume} = \texttt{diam1} * \texttt{diam2} * \texttt{canopy\_height}. \]

The technique of creating a new predictor from the old ones, in machine learning terms, is called feature engineering.

mesquite$canopy_volume <- mesquite$diam1 * mesquite$diam2 * mesquite$canopy_height

fit_3 <- stan_glm(log(weight) ~ log(canopy_volume), data=mesquite,
                  refresh=0)

print(fit_3)
stan_glm
 family:       gaussian [identity]
 formula:      log(weight) ~ log(canopy_volume)
 observations: 46
 predictors:   2
------
                   Median MAD_SD
(Intercept)        5.2    0.1   
log(canopy_volume) 0.7    0.1   

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.4    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Let us compare this model with the previous logarithm model with all predictors.

loo_3 <- loo(fit_3)

loo_compare(loo_2, loo_3)
      elpd_diff se_diff
fit_2  0.0       0.0   
fit_3 -7.1       5.0   

There is only 7.4 difference in the estimated elpd but this new model is much easier to interpret.

Let us add more predictors: one is canopy_shape, which measures the ratio between the two diameters

\[ \texttt{canopy\_shape} = \texttt{diam1}/\texttt{diam2}, \]

and the others are total_height, density and group. We take an educated guess that having the ratio of diameters close to one is a sign of a healthy mesquite bush; thus if the predictors are not correlated, we expect the coefficient of canopy_shape to be negative.

mesquite$canopy_shape <- mesquite$diam1 / mesquite$diam2

fit_4 <- stan_glm(log(weight) ~ log(canopy_volume) + log(canopy_shape) +
                  log(total_height) + log(density) + group, 
                  data=mesquite, refresh=0)

print(fit_4)
stan_glm
 family:       gaussian [identity]
 formula:      log(weight) ~ log(canopy_volume) + log(canopy_shape) + log(total_height) + 
       log(density) + group
 observations: 46
 predictors:   6
------
                   Median MAD_SD
(Intercept)         4.9    0.1  
log(canopy_volume)  0.7    0.1  
log(canopy_shape)  -0.5    0.2  
log(total_height)   0.2    0.3  
log(density)        0.1    0.1  
groupMCD            0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.3    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

As before, we will compare with model with the full logarithmic model.

loo_4 <- loo(fit_4)

loo_compare(loo_2, loo_4)
      elpd_diff se_diff
fit_2  0.0       0.0   
fit_4 -0.1       1.2   

The estimated difference of eldp is insignificant. However, the standard errors of total_height and density are large, so we might want to remove these predictors from the model.

Finally, we are left with a model with three predictors:

fit_5 <- stan_glm(log(weight) ~ log(canopy_volume) + log(canopy_shape) +
                  group, data=mesquite, refresh=0)

print(fit_5)
stan_glm
 family:       gaussian [identity]
 formula:      log(weight) ~ log(canopy_volume) + log(canopy_shape) + group
 observations: 46
 predictors:   4
------
                   Median MAD_SD
(Intercept)         4.9    0.1  
log(canopy_volume)  0.8    0.1  
log(canopy_shape)  -0.4    0.2  
groupMCD            0.6    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.3    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
loo_5 <- loo(fit_5)

loo_compare(loo_2, loo_5)
      elpd_diff se_diff
fit_5  0.0       0.0   
fit_2 -1.5       1.5   

In the end, we obtain a simple model with three predictors that performs as well as the full logarithmic model.

We can try adding more predictors or interaction terms but it would take a substantial amount of work to find a model that performs significantly better. Alternatively, one can try a different prior or a more complex model, such as a multilevel model.

8.2 Different priors for the coefficients

In the previous section, we have implicitly used the weakly informative prior for the coefficients.

We work on an example of predicting grades from a sample of high school students from Portugal.

data <- read.csv("data/student-merged.csv")

head(data)
  G1mat G2mat G3mat G1por G2por G3por school sex age address famsize Pstatus
1     7    10    10    13    13    13      0   0  15       0       0       1
2     8     6     5    13    11    11      0   0  15       0       0       1
3    14    13    13    14    13    12      0   0  15       0       0       1
4    10     9     8    10    11    10      0   0  15       0       0       1
5    10    10    10    13    13    13      0   0  15       0       0       1
6    12    12    11    11    12    12      0   0  15       0       0       1
  Medu Fedu traveltime studytime failures schoolsup famsup paid activities
1    1    1          2         4        1         1      1    1          1
2    1    1          1         2        2         1      1    0          0
3    2    2          1         1        0         1      1    1          1
4    2    4          1         3        0         1      1    1          1
5    3    3          2         3        2         0      1    1          1
6    3    4          1         3        0         1      1    1          1
  nursery higher internet romantic famrel freetime goout Dalc Walc health
1       1      1        1        0      3        1     2    1    1      1
2       0      1        1        1      3        3     4    2    4      5
3       1      1        0        0      4        3     1    1    1      2
4       1      1        1        0      4        3     2    1    1      5
5       1      1        1        1      4        2     1    2    3      3
6       1      1        1        0      4        3     2    1    1      5
  absences
1        2
2        2
3        8
4        2
5        8
6        2

We will predict the third period math grade given the other predictors.

predictors <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu",
"traveltime","studytime","failures","schoolsup","famsup","paid","activities",
"nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc",
"Walc","health","absences")
data_G3mat <- subset(data, subset=G3mat>0, select=c("G3mat",predictors))

Let us try the standard regression model

fit1 <- stan_glm(G3mat ~ ., data=data_G3mat, refresh=0)

We plot the posterior distributions of the coefficients using the mcmc_areas function from bayesplot library.

p1 <- mcmc_areas(as.matrix(fit1), pars=vars(-‘(Intercept)’,-sigma),

prob_outer=0.95, area_method = “scaled height”)

p1

p1 <- mcmc_areas(as.matrix(fit1), pars=vars(-'(Intercept)',-sigma),
                 prob_outer=0.95, area_method = "scaled height")
p1

The different amounts of uncertainty make it difficult to compare the coefficients. For example, it is really hard to see if absences is more relevant than health.

To compare between two predictors, we have to transform them so that they share the same scale. The common scaling technique is standardization, which consists of subtracting the observed values of each predictor by the mean, and then dividing by the sample standard deviation:

\[ x \to \frac{x-\bar{x}}{\text{sd}(x)}. \]

With this transformation, all predictors now have a mean of zero and a sample standard deviation of one.

In R, we can standardize all predictors by using the scale function.

datastd_G3mat <- data_G3mat
datastd_G3mat[,predictors] <-scale(data_G3mat[,predictors])

Now, let us fit the model on the standardized data and plot the coefficients again.

fit2 <- stan_glm(G3mat ~ ., data=datastd_G3mat, refresh=0)

p2 <- mcmc_areas(as.matrix(fit2), pars=vars(-'(Intercept)',-sigma),
                 prob_outer=0.95, area_method = "scaled height")
p2

The standard errors of the coefficients are now similar to each others. We can see that absences is more relevant than many of the predictors.

8.2.1 Priors for variable selection

To obtain a simpler model, we may assume that only some of the predictors are relevant, while the other predictors are negligible. One way to insert this assumption into the model is by using a prior whose distribution has a sharp peak around zero; examples of such priors are regularized horseshoe prior and Laplace prior.

Priors for variable selection

8.2.1.1 Regularized horseshoe prior

The regularized horseshoe prior, as shown in the plot above, consists of prior \(\mathcal{N}(0,\tau^2\lambda_j^2)\) for the \(j\)-th coefficient. Here, and \(\lambda_j\) is with the following priors:

  1. \(\tau\) is a global scale that shrinks all \(\beta_j\) towards zero. The prior for \(\tau\) is

    \[ \text{Half-Cauchy}\left(0,\texttt{global\_scale}\right). \]

    A common choice for the global scale is

    \[ \texttt{global\_scale} = \frac{p_0}{p-p_0}\frac{\sigma}{\sqrt{n}}, \]

    where \(p\) is the number of predictors, \(p_0\) is the expected number of relevant predictors, \(\sigma\) is the model’s estimated standard deviation, and \(n\) is the number of observations.

  2. \(\lambda_j\) is a local scale that allows some \(\beta_j\) to escape the shrinkage. The prior for \(\lambda_j\) is

    \[ \text{Half-Cauchy}\left(0,\texttt{slab\_scale}\right). \]

    A common choice for the slab scale is

    \[ \texttt{slab\_scale} = \sqrt{\frac{\hat{R}^2}{p_0}}\text{sd}(y), \]

    where \(\hat{R}^2\) is an estimated proportion of variance explained by the model. With this choice of slab scale, the variance of the linear model, with the coefficients sampled from the priors and the predictors with zero mean and standard deviation 1, is precisely \(\hat{R}^2\text{Var}(y)\):

    \[\begin{align*} \text{Var}\left(\beta_0+\beta_1x_1+\ldots+\beta_{p_0} x_{p_0}\right) &= \text{Var}(\beta_1)\text{Var}(x_i)+\ldots+\text{Var}(\beta_p)\text{Var}(x_p) \\ &= \underbrace{\frac{\hat{R}^2}{p_0}\text{Var}(y)+\ldots+\frac{\hat{R}^2}{p_0}\text{Var}(y)}_{p_0\text{ terms}} \\ &= \hat{R}^2\text{Var}(y). \end{align*}\]

Below is an example of linear regression with the regularized horseshoe prior with \(p_0=6\) and \(\hat{R}^2 = 0.3\) (make sure that the data is standardized!):

p <- length(predictors)
n <- nrow(datastd_G3mat)
p0 <- 6
R2_hat <- 0.3
slab_scale <- sqrt(R2_hat/p0)*sd(datastd_G3mat$G3mat)
# global scale without sigma, as the scaling by sigma is done inside stan_glm
global_scale <- (p0/(p - p0))/sqrt(n)

fit3 <- stan_glm(G3mat ~ ., data=datastd_G3mat,
                 prior=hs(global_scale=global_scale, slab_scale=slab_scale),
                 refresh=0)

print(fit3)
stan_glm
 family:       gaussian [identity]
 formula:      G3mat ~ .
 observations: 343
 predictors:   27
------
            Median MAD_SD
(Intercept) 11.6    0.2  
school      -0.1    0.1  
sex          0.2    0.2  
age         -0.2    0.2  
address      0.1    0.1  
famsize      0.0    0.1  
Pstatus      0.0    0.1  
Medu         0.2    0.2  
Fedu         0.1    0.2  
traveltime   0.0    0.1  
studytime    0.1    0.2  
failures    -0.6    0.2  
schoolsup   -0.7    0.2  
famsup      -0.1    0.1  
paid        -0.1    0.2  
activities   0.0    0.1  
nursery      0.0    0.1  
higher       0.0    0.1  
internet     0.1    0.2  
romantic     0.0    0.1  
famrel       0.0    0.1  
freetime     0.0    0.1  
goout       -0.3    0.2  
Dalc         0.0    0.1  
Walc        -0.2    0.2  
health      -0.1    0.2  
absences    -0.5    0.2  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.9    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We can see that most of the coefficient are zeros. The following distribution plots show that most of the coefficients are shrunk towards zero, making it easier to see relevant predictors.

p3 <- mcmc_areas(as.matrix(fit3), pars=vars(-'(Intercept)',-sigma),
                 prob_outer=0.95, area_method = "scaled height")
p3

From this plot, we see that the most relevant predictors are failures, schoolsup, goout and absences. Let us run the regression with only these four variables.

fit4 <- stan_glm(G3mat ~ failures + schoolsup + 
                 goout + absences, data=datastd_G3mat,
                 refresh=0)

print(fit4)
stan_glm
 family:       gaussian [identity]
 formula:      G3mat ~ failures + schoolsup + goout + absences
 observations: 343
 predictors:   5
------
            Median MAD_SD
(Intercept) 11.6    0.2  
failures    -0.8    0.2  
schoolsup   -0.7    0.2  
goout       -0.5    0.2  
absences    -0.6    0.2  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.0    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Not surprisingly, failures, goout and absences negatively affects the grades. The same goes for schoolsup: a student who requires extra educational support tends to perform worse than one who does not.

We compare its elpd to the model with all predictors.

loo2 <- loo(fit2)
loo4 <- loo(fit4)

loo_compare(loo2, loo4)
     elpd_diff se_diff
fit4  0.0       0.0   
fit2 -1.3       6.3   

The model with only four predictors performs as well as the model with all predictors.

8.2.1.2 LASSO regression

LASSO regression is the Bayesian regression with Laplace prior. To fit a LASSO model, simply run stan_glm with prior=lasso(autoscale=TRUE).

fit5 <- stan_glm(G3mat ~ ., data=datastd_G3mat,
                 prior=lasso(autoscale=TRUE),
                 refresh=0)

print(fit5)
stan_glm
 family:       gaussian [identity]
 formula:      G3mat ~ .
 observations: 343
 predictors:   27
------
            Median MAD_SD
(Intercept) 11.6    0.2  
school      -0.1    0.1  
sex          0.2    0.2  
age         -0.2    0.2  
address      0.1    0.2  
famsize      0.1    0.1  
Pstatus      0.0    0.1  
Medu         0.2    0.2  
Fedu         0.2    0.2  
traveltime   0.0    0.1  
studytime    0.2    0.2  
failures    -0.5    0.2  
schoolsup   -0.7    0.2  
famsup      -0.1    0.1  
paid        -0.2    0.2  
activities   0.0    0.1  
nursery      0.0    0.1  
higher       0.0    0.1  
internet     0.2    0.1  
romantic    -0.1    0.1  
famrel       0.0    0.1  
freetime     0.0    0.1  
goout       -0.3    0.2  
Dalc         0.0    0.1  
Walc        -0.2    0.2  
health      -0.2    0.1  
absences    -0.5    0.2  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.9    0.1   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

which gives us almost the same model as the one with the horseshoe prior. In general, the horseshoe prior is recommended over the LASSO.