Use prior knowledge to pick relevant input variables.

For inputs that have large effects, consider including their interactions.

If the coefficient of a predictor has a small standard error, then we might want to keep it in the model.

If the coefficient of a predictor has a large standard error, and there is no reason for the predictor in the model, then we might want to remove it.

If a coefficient contradicts the reality (for example, a negative coefficient for education in an income regression),

If the standard error is large, then the unusual estimate can be explained from its uncertainty.

If the standard error is small, try to understand how it could happen. In the income vs. education example, the negative coefficient might be because the data was collected from a subpopulation in which the more educated people are younger and hence tend to have lower average income.

8.1 Example: predicting the yields of mesquite bushes

We apply the model checking techniques to Mesquite data, which is used to estimate the weight (in grams) of yield harvested from a mesquite bush.

stan_glm
family: gaussian [identity]
formula: weight ~ diam1 + diam2 + canopy_height + total_height + density +
group
observations: 46
predictors: 7
------
Median MAD_SD
(Intercept) -1092.5 176.5
diam1 190.7 118.8
diam2 368.2 128.6
canopy_height 359.2 223.5
total_height -104.2 195.2
density 132.0 34.4
groupMCD 362.8 98.8
Auxiliary parameter(s):
Median MAD_SD
sigma 274.2 31.3
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We evaluate this model using elpd.

loo_1 <-loo(fit_1)

Warning: Found 3 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 3 times to compute the ELPDs for the problematic observations directly.

print(loo_1)

Computed from 4000 by 46 log-likelihood matrix
Estimate SE
elpd_loo -334.1 12.5
p_loo 15.7 8.4
looic 668.2 24.9
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 43 93.5% 904
(0.5, 0.7] (ok) 0 0.0% <NA>
(0.7, 1] (bad) 2 4.3% 38
(1, Inf) (very bad) 1 2.2% 3
See help('pareto-k-diagnostic') for details.

Three instances with \(\hat{k}>0.7\) indicate that the approximate elpd computation might not be accurate. Aki Vehtari has provided a quick guideline on how to diagnose the model from p_loo and Pareto \(\hat{k}\) values:

If all Pareto k small, model is likely to be ok (although there can be better models)

If high Pareto k values

If p_loo << the number of parameters p, then the model is likely to be misspecified. PPC is likely to detect the problem, too. Try using overdispersed model, or add more structural information (nonlinearity, mixture model, etc.).

If p_loo > the number of parameters p, then the model is likely to be badly misspecified. If the number of parameters p is relatively large compared to the number of observations p>n/5 (more accurately we should count number of observations influencing each parameter as in hierarchical models some groups may have small n and some groups large n), it is possible that PPC doesn’t detect the problem. Case example Recommendations for what to do when k exceeds 0.5 in the loo package? 299

If p_loo < the number of parameters p and the number of parameters p is relatively large compared to the number of observations p>n/5, it is likely that model is so flexible or population prior is so weak that it’s difficult to predict for left out observation even if the model is true one. Case example is the simulated 8 schools in https://arxiv.org/abs/1507.04544 90 and Gaussian processes and spatial models with short correlation lengths.

Let us compare try fitting a logarithmic model. Again, since group is categorical, we do not apply the logarithmic transformation to this predictor.

stan_glm
family: gaussian [identity]
formula: log(weight) ~ log(diam1) + log(diam2) + log(canopy_height) +
log(total_height) + log(density) + group
observations: 46
predictors: 7
------
Median MAD_SD
(Intercept) 4.8 0.2
log(diam1) 0.4 0.3
log(diam2) 1.1 0.2
log(canopy_height) 0.4 0.3
log(total_height) 0.4 0.3
log(density) 0.1 0.1
groupMCD 0.6 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.3 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

loo_2 <-loo(fit_2)

Warning: Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.

print(loo_2)

Computed from 4000 by 46 log-likelihood matrix
Estimate SE
elpd_loo -19.5 5.4
p_loo 7.6 1.6
looic 39.0 10.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 41 89.1% 1011
(0.5, 0.7] (ok) 4 8.7% 352
(0.7, 1] (bad) 1 2.2% 313
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.

The approximate elpd of this model is more reliable than the previous one. Now we adjust loo_2 for the logarithmic transformation by subtracting the log score of each instance by the logarithm of its weight.

The density plots show that the fit on the log scale is much better.

8.1.1 Constructing a simpler model

We have been throwing all predictors in our model. But sometimes we might want to look for a simpler model that is more interpretable. For example, we can create a new predictor that measures the volume of the canopy:

stan_glm
family: gaussian [identity]
formula: log(weight) ~ log(canopy_volume)
observations: 46
predictors: 2
------
Median MAD_SD
(Intercept) 5.2 0.1
log(canopy_volume) 0.7 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.4 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Let us compare this model with the previous logarithm model with all predictors.

loo_3 <-loo(fit_3)loo_compare(loo_2, loo_3)

elpd_diff se_diff
fit_2 0.0 0.0
fit_3 -7.1 5.0

There is only 7.4 difference in the estimated elpd but this new model is much easier to interpret.

Let us add more predictors: one is canopy_shape, which measures the ratio between the two diameters

and the others are total_height, density and group. We take an educated guess that having the ratio of diameters close to one is a sign of a healthy mesquite bush; thus if the predictors are not correlated, we expect the coefficient of canopy_shape to be negative.

stan_glm
family: gaussian [identity]
formula: log(weight) ~ log(canopy_volume) + log(canopy_shape) + log(total_height) +
log(density) + group
observations: 46
predictors: 6
------
Median MAD_SD
(Intercept) 4.9 0.1
log(canopy_volume) 0.7 0.1
log(canopy_shape) -0.5 0.2
log(total_height) 0.2 0.3
log(density) 0.1 0.1
groupMCD 0.6 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.3 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

As before, we will compare with model with the full logarithmic model.

loo_4 <-loo(fit_4)loo_compare(loo_2, loo_4)

elpd_diff se_diff
fit_2 0.0 0.0
fit_4 -0.1 1.2

The estimated difference of eldp is insignificant. However, the standard errors of total_height and density are large, so we might want to remove these predictors from the model.

Finally, we are left with a model with three predictors:

stan_glm
family: gaussian [identity]
formula: log(weight) ~ log(canopy_volume) + log(canopy_shape) + group
observations: 46
predictors: 4
------
Median MAD_SD
(Intercept) 4.9 0.1
log(canopy_volume) 0.8 0.1
log(canopy_shape) -0.4 0.2
groupMCD 0.6 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.3 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

loo_5 <-loo(fit_5)loo_compare(loo_2, loo_5)

elpd_diff se_diff
fit_5 0.0 0.0
fit_2 -1.5 1.5

In the end, we obtain a simple model with three predictors that performs as well as the full logarithmic model.

We can try adding more predictors or interaction terms but it would take a substantial amount of work to find a model that performs significantly better. Alternatively, one can try a different prior or a more complex model, such as a multilevel model.

8.2 Different priors for the coefficients

In the previous section, we have implicitly used the weakly informative prior for the coefficients.

We work on an example of predicting grades from a sample of high school students from Portugal.

data <-read.csv("data/student-merged.csv")head(data)

The different amounts of uncertainty make it difficult to compare the coefficients. For example, it is really hard to see if absences is more relevant than health.

To compare between two predictors, we have to transform them so that they share the same scale. The common scaling technique is standardization, which consists of subtracting the observed values of each predictor by the mean, and then dividing by the sample standard deviation:

\[
x \to \frac{x-\bar{x}}{\text{sd}(x)}.
\]

With this transformation, all predictors now have a mean of zero and a sample standard deviation of one.

In R, we can standardize all predictors by using the scale function.

The standard errors of the coefficients are now similar to each others. We can see that absences is more relevant than many of the predictors.

8.2.1 Priors for variable selection

To obtain a simpler model, we may assume that only some of the predictors are relevant, while the other predictors are negligible. One way to insert this assumption into the model is by using a prior whose distribution has a sharp peak around zero; examples of such priors are regularized horseshoe prior and Laplace prior.

8.2.1.1 Regularized horseshoe prior

The regularized horseshoe prior, as shown in the plot above, consists of prior \(\mathcal{N}(0,\tau^2\lambda_j^2)\) for the \(j\)-th coefficient. Here, and \(\lambda_j\) is with the following priors:

\(\tau\) is a global scale that shrinks all \(\beta_j\) towards zero. The prior for \(\tau\) is

where \(p\) is the number of predictors, \(p_0\) is the expected number of relevant predictors, \(\sigma\) is the model’s estimated standard deviation, and \(n\) is the number of observations.

\(\lambda_j\) is a local scale that allows some \(\beta_j\) to escape the shrinkage. The prior for \(\lambda_j\) is

where \(\hat{R}^2\) is an estimated proportion of variance explained by the model. With this choice of slab scale, the variance of the linear model, with the coefficients sampled from the priors and the predictors with zero mean and standard deviation 1, is precisely \(\hat{R}^2\text{Var}(y)\):

Below is an example of linear regression with the regularized horseshoe prior with \(p_0=6\) and \(\hat{R}^2 = 0.3\) (make sure that the data is standardized!):

p <-length(predictors)n <-nrow(datastd_G3mat)p0 <-6R2_hat <-0.3slab_scale <-sqrt(R2_hat/p0)*sd(datastd_G3mat$G3mat)# global scale without sigma, as the scaling by sigma is done inside stan_glmglobal_scale <- (p0/(p - p0))/sqrt(n)fit3 <-stan_glm(G3mat ~ ., data=datastd_G3mat,prior=hs(global_scale=global_scale, slab_scale=slab_scale),refresh=0)print(fit3)

stan_glm
family: gaussian [identity]
formula: G3mat ~ .
observations: 343
predictors: 27
------
Median MAD_SD
(Intercept) 11.6 0.2
school -0.1 0.1
sex 0.2 0.2
age -0.2 0.2
address 0.1 0.1
famsize 0.0 0.1
Pstatus 0.0 0.1
Medu 0.2 0.2
Fedu 0.1 0.2
traveltime 0.0 0.1
studytime 0.1 0.2
failures -0.6 0.2
schoolsup -0.7 0.2
famsup -0.1 0.1
paid -0.1 0.2
activities 0.0 0.1
nursery 0.0 0.1
higher 0.0 0.1
internet 0.1 0.2
romantic 0.0 0.1
famrel 0.0 0.1
freetime 0.0 0.1
goout -0.3 0.2
Dalc 0.0 0.1
Walc -0.2 0.2
health -0.1 0.2
absences -0.5 0.2
Auxiliary parameter(s):
Median MAD_SD
sigma 2.9 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

We can see that most of the coefficient are zeros. The following distribution plots show that most of the coefficients are shrunk towards zero, making it easier to see relevant predictors.

From this plot, we see that the most relevant predictors are failures, schoolsup, goout and absences. Let us run the regression with only these four variables.

stan_glm
family: gaussian [identity]
formula: G3mat ~ failures + schoolsup + goout + absences
observations: 343
predictors: 5
------
Median MAD_SD
(Intercept) 11.6 0.2
failures -0.8 0.2
schoolsup -0.7 0.2
goout -0.5 0.2
absences -0.6 0.2
Auxiliary parameter(s):
Median MAD_SD
sigma 3.0 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Not surprisingly, failures, goout and absences negatively affects the grades. The same goes for schoolsup: a student who requires extra educational support tends to perform worse than one who does not.

We compare its elpd to the model with all predictors.