4  Prediction and Bayesian inference

We go back to the Election vs Economy example.

library("rstanarm")
hibbs <- read.table("data/hibbs.dat", 
                    header=TRUE)
M1 <- stan_glm(vote ~ growth, data=hibbs, 
               refresh=0)  # suppress output
print(M1)
stan_glm
 family:       gaussian [identity]
 formula:      vote ~ growth
 observations: 16
 predictors:   2
------
            Median MAD_SD
(Intercept) 46.3    1.6  
growth       3.0    0.7  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.9    0.7   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The stan_glm function, in addition to fitting the model, also performs 4000 simulations from the posterior distribution. The simulations can be obtained using the as.matrix function.

sims <- as.matrix(M1)
print(sims[1:5,])
          parameters
iterations (Intercept)   growth    sigma
      [1,]    43.17848 3.362969 4.879039
      [2,]    41.49456 3.830520 5.439569
      [3,]    42.10958 3.690195 5.296429
      [4,]    50.66764 2.115225 3.683646
      [5,]    43.20911 3.490817 3.634799

Then we can use these simulations to compute, for examples, posterior median and posterior median absolute deviation (MAD).

Median <- apply(sims, 2, median)
MAD_SD <- apply(sims, 2, mad)
print(cbind(Median, MAD_SD))
               Median    MAD_SD
(Intercept) 46.271414 1.6281478
growth       3.034514 0.6773794
sigma        3.880989 0.7492301

We can see that the numbers are similar to the results of the regression above.

4.1 Prediction and uncertainty: predict, posterior_linpred, and posterior_predict

These are functions to be called on the fitted regression (M1) with increasing levels of uncertainty.

  1. predict returns the best point estimate for the average value of \(y\) given a new value of \(x\).

    \[ \hat{y} = \hat{a}+\hat{b}x^{\text{new}}. \]

    First, we create a new dataframe with a single value \(x^{\text{new}}=2\%\).

    new <- data.frame(growth=2.0)
    print(new)
      growth
    1      2

    Then we use the predict function.

    y_point_pred <- predict(M1, newdata=new)
    print(y_point_pred)
           1 
    52.36159 

    This gives the same value as using the point estimates of the intercept and coefficient above: \(46.3+3.0\times 2 = 52.3\).

  2. posterior_linpred

    This function returns a vector of posterior distributions of \(y\):

    \[ \hat{y} = a_i+b_ix^{\text{new}},\]

    over all simulations \(a_1,\ldots,a_{4000}\) and $\(b_1,\ldots,b_{4000}\) from the posterior distributions of the intercept and slope, respectively. This is equivalent to:

    sims <- as.matrix(M1)  # matrix of all 4000 simulations of a, b and sigma
    a <- sims[,1]  # vector of 4000 simulations of intercept
    b <- sims[,2]  # vector of 4000 simulations of slope
    y_linpred <- a + b*as.numeric(new)
    print(y_linpred[1:10])
     [1] 49.90441 49.15560 49.48997 54.89809 50.19075 50.58949 50.74200 53.82518
     [9] 50.96752 51.68159

    Calling posterior_linpred gives us the same numbers:

    y_linpred <- posterior_linpred(M1, newdata=new)
    print(y_linpred[1:10])
     [1] 49.90441 49.15560 49.48997 54.89809 50.19075 50.58949 50.74200 53.82518
     [9] 50.96752 51.68159
  3. posterior_predict

    This function returns a vector of predictions, taking into account uncertainty of \(a\), \(b\) and \(\sigma\).

    \[ \hat{y} = a_i + b_ix^{\text{new}}+\varepsilon, \quad \varepsilon\sim\mathcal{N}(0,\sigma_i^2). \]

    In addition to \(a_i\)’s and \(b_i\)’s as above, we also have \(\sigma_1,\ldots,\sigma_{4000}\), the 4000 simulations from the posterior distribution. This is the same as the following code:

    sims <- as.matrix(M1)  # matrix of all 4000 simulations of a, b and sigma
    n_sims <- nrow(sims)  # number of rows in sims
    a <- sims[,1]  # vector of 4000 simulations of intercept
    b <- sims[,2]  # vector of 4000 simulations of slope
    sigma <- sims[,3]  # vector of 4000 simulations of sigma
    y_pred <- a + b*as.numeric(new) + rnorm(n_sims, 0, sigma)
    print(y_pred[1:10])
     [1] 54.28794 43.60678 42.78229 57.70130 50.11131 48.34323 50.50501 58.01345
     [9] 55.45994 54.43732

    Let us look at the histogram of the predictions.

    hist(y_pred, breaks=20)

    Now we try calling posterior_predict directly.

    y_pred <- posterior_predict(M1, newdata=new)
    print(y_pred[1:10])
     [1] 46.02915 55.83342 53.87139 47.00921 50.72119 43.23567 56.37825 54.21147
     [9] 48.83689 54.31061

    The predictions are not exactly the same as those from the direct simulation because of the noises. Nonetheless, we can compare the histograms between posterior_predict and direct simulation.

    hist(y_pred, breaks=20)

Now we can compute the point estimation of Clinton’s voting share, the MAD and the winning probability.

y_pred_mean <- median(y_pred)
y_pred_mad <- mad(y_pred)
win_indicator <- (y_pred > 50)
print(win_indicator[1:10])
 [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
win_prob <- mean(win_indicator)
cat("Clinton's voting share: ", y_pred_mean, "\n")
Clinton's voting share:  52.43405 
cat("MAD: ", y_pred_mad, "\n")
MAD:  4.085968 
cat("Winning probability: ", win_prob, "\n")
Winning probability:  0.7225 

4.1.1 Predictions on multiple inputs

We can also make predictions on a range of input values (in this case, outputs of posterior_linpred and posterior_pred are matrices of simulations of respective inputs.). Remember that we made a dataframe new consisting of a single value. In general, we can use a dataframe of a sequence of inputs. Here is an example where inputs are \(-2.0\%,-1.5\%,\ldots 4.0\%\).

new_grid = data.frame(growth=seq(-2.0, 4.0, 0.5))
y_point_pred_grid = predict(M1, newdata=new_grid)
y_linpred_grid = posterior_linpred(M1, newdata=new_grid)
y_pred_grid = posterior_predict(M1, newdata=new_grid)

cat("Point estimations\n")
Point estimations
print(y_point_pred_grid)
       1        2        3        4        5        6        7        8 
40.16561 41.69011 43.21461 44.73910 46.26360 47.78810 49.31260 50.83709 
       9       10       11       12       13 
52.36159 53.88609 55.41058 56.93508 58.45958 
cat("\nLinear predictions with uncertainty\n")

Linear predictions with uncertainty
print(y_linpred_grid[1:5,])
          
iterations        1        2        3        4        5        6        7
      [1,] 36.45254 38.13402 39.81551 41.49699 43.17848 44.85996 46.54145
      [2,] 33.83352 35.74878 37.66404 39.57930 41.49456 43.40982 45.32508
      [3,] 34.72919 36.57429 38.41938 40.26448 42.10958 43.95468 45.79977
      [4,] 46.43718 47.49480 48.55241 49.61002 50.66764 51.72525 52.78286
      [5,] 36.22748 37.97289 39.71830 41.46370 43.20911 44.95452 46.69993
          
iterations        8        9       10       11       12       13
      [1,] 48.22293 49.90441 51.58590 53.26738 54.94887 56.63035
      [2,] 47.24034 49.15560 51.07086 52.98612 54.90138 56.81664
      [3,] 47.64487 49.48997 51.33507 53.18016 55.02526 56.87036
      [4,] 53.84047 54.89809 55.95570 57.01331 58.07092 59.12854
      [5,] 48.44534 50.19075 51.93616 53.68156 55.42697 57.17238
cat("\nPosterior predictions\n")

Posterior predictions
print(y_pred_grid[1:5,])
            1        2        3        4        5        6        7        8
[1,] 26.70167 35.02488 42.53255 36.80278 44.89615 47.73739 43.53268 43.42376
[2,] 35.67405 29.74778 36.34294 35.20969 50.87781 47.92984 41.38884 47.09931
[3,] 40.22928 30.95276 33.26086 36.60504 39.26104 49.63381 43.18587 50.99151
[4,] 49.62335 54.93808 50.40952 52.63837 52.18761 53.39729 52.31814 49.97139
[5,] 38.27645 42.18214 38.80070 39.16123 41.91069 49.53228 51.09435 44.59416
            9       10       11       12       13
[1,] 44.33410 53.65338 52.16829 60.35885 60.36851
[2,] 51.49593 42.22083 53.14718 53.94014 50.78556
[3,] 44.88779 62.01955 46.33628 44.58224 62.77947
[4,] 56.96249 52.31356 59.01079 66.40041 56.22663
[5,] 47.96626 47.24054 53.12222 51.18830 57.61050

The result of predict is a vector of length 13, posterior_linpred is a \(4000\times 13\) matrix, which contains \(4000\) predictions for each of the 13 values of growth, and posterior_predict is another \(4000\times 13\) matrix.

4.1.2 Predictions with input uncertainty

Previously, we have expressed uncertainty in the election outcome conditional on fixed values of economic growth. However, growth is usually estimated prior to the campaign, and updated by the government some time after. Hence, we have to take into account the uncertainty in growth when making predictions.

Let us assume that before the campaign, the economic growth was 2%, but after the campaign and just before the election, there would be a slight change in economic growth. We shall model the growth by \(\mathcal{N}(0,0.3^2)\). Let us simulate the growth from this distribution.

x_new <- rnorm(n_sims, 2.0, 0.3)  # create 4000 random numbers from N(0, 0.09)

We can then simulate the distribution of the prediction using the simulated \(a\), \(b\) and \(\sigma\).

sims <- as.matrix(M1)  # matrix of all 4000 simulations of a, b and sigma
n_sims <- nrow(sims)  # number of rows in sims
a <- sims[,1]  # vector of 4000 simulations of intercept
b <- sims[,2]

y_pred <- rnorm(n_sims, a + b*x_new, sigma)

Now we can compute the point estimation of Clinton’s voting share, the MAD and the winning probability as before.

y_pred_mean <- median(y_pred)
y_pred_mad <- mad(y_pred)

win_indicator <- (y_pred > 50)
win_prob <- mean(win_indicator)

cat("Clinton's voting share: ", y_pred_mean, "\n")
Clinton's voting share:  52.37621 
cat("MAD: ", y_pred_mad, "\n")
MAD:  4.15687 
cat("Winning probability: ", win_prob, "\n")
Winning probability:  0.71425 

Notice that the point prediction 52.3 is unchanged while the MAD has increased from 4.00 to 4.12 to reflect the extra uncertainty from the inputs.

4.2 Different types of priors in regression

Recall that in Bayesian inference, the likelihood is multiplied by a prior distribution to yield a posterior distribution.

In previous sections, we obtain the posterior distribution without being concerned with the prior distributions, as the data have strong linear relationship between the two variables. When the data are less informative, then we start to think carefully about our choice of prior. We will consider three specific types of prior.

  1. Uniform prior distribution

    This is sometimes called non-informative or flat prior. With a flat prior, the posterior is simply the product of the likelihood function and a constant. Thus the maximum likelihood estimate is the mode of the posterior distribution.

    As we mentioned in the previous chapter, to run with a flat prior, set the options of the coefficient and scale parameters to NULL.

    M3 <- stan_glm(vote ~ growth, data=hibbs,
                   prior_intercept=NULL, prior=NULL, prior_aux=NULL,
                   refresh=0)
    print(M3)
    stan_glm
     family:       gaussian [identity]
     formula:      vote ~ growth
     observations: 16
     predictors:   2
    ------
                Median MAD_SD
    (Intercept) 46.2    1.7  
    growth       3.1    0.8  
    
    Auxiliary parameter(s):
          Median MAD_SD
    sigma 4.0    0.8   
    
    ------
    * For help interpreting the printed output see ?print.stanreg
    * For info on the priors used see ?prior_summary.stanreg

    Then we plot the simulated values of \(a\) and \(b\) from the posterior distribution

    sims <- as.matrix(M3)
    a <- sims[,1]
    b <- sims[,2]
    plot(a, b, cex=0.3)

    We can see that the (joint) posterior distribution of \(a\) and \(b\) is a normal distribution centered at the point estimates.

  2. Weakly informative prior

    Weakly informative priors contain information about the scales of the parameters, where the scales are obtained from some appropriate analysis. They are not informative prior as they do not utilize prior knowledge about the parameters.

    At default, the stan_glm function uses a data-dependent weakly informative prior.

    \[ p(a,b,\sigma) = p(a\vert b)p(b)p(\sigma), \]

    where

    1. \(p(b) = \mathcal{N}(0,2.5\ \text{sd}(y)/\text{sd}(x))\)
    2. \(p(a\vert b) = p(a+b\bar{x}\vert b) = \mathcal{N}(\bar{y}, 2.5\ \text{sd}(y))\)
    3. \(p(\sigma) = \text{Exp}(1/\text{sd}(y))\).

    The scales of the intercept and slope are obtained via the following analysis:

    For a model of the form \(y=a+bx+\text{error}\), the formulae of the OLS estimate are

    \[\begin{align*} \hat{b} &= \frac{\sum_{i=1}^n (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{a} &= \bar{y} - \hat{b}\bar{x}. \end{align*}\]

    Applying the Cauchy-Schwarz inequality: \(\left\lvert\sum_i a_ib_i\right\rvert \leq \sqrt{\sum_i a^2_i}\sqrt{\sum_i b^2_i}\), we obtain

    \[\begin{align*} \lvert\hat{b}\rvert &= \frac{\left\lvert\sum_{i=1}^n (y_i-\bar{y})(x_i-\bar{x})\right\rvert}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ &\leq \frac{\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ &= \frac{\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}} \\ &= \frac{\text{sd}(y)}{\text{sd}(x)}. \end{align*}\]

    We use this inequality for the estimate \(\lvert\hat{b}\rvert\) to guide our belief about the value $b$. More precisely, the inequality suggests that the value of \(\lvert b\rvert\) should not exceed \(\text{sd(y)}/\text{sd}(x)\). This motivates the weakly informative prior for \(b\) used in stan_glm, which is \(\mathcal{N}(0, 2.5\ \text{sd(y)}/\text{sd}(x))\); such prior pulls the slope estimate towards the range from \(-2.5\ \text{sd(y)}/\text{sd}(x))\) to \(2.5\ \text{sd(y)}/\text{sd}(x))\). Here, the \(2.5\) factor was chosen arbitrarily so that that prior does not have too much influence on the slope estimate when data are sufficiently informative.

    For the intercept \(a\), the choice of the distribution simply comes from the following computations with \(b\) fixed:

    \[\begin{align*} \mathbb{E}[a+b\bar{x}] &= \mathbb{E}[\bar{y}] \\ \text{sd}(a+b\bar{x}) &= \text{sd}(a). \end{align*}\]

    Lastly, \(\text{Exp}(1/\text{sd}(y))\) is chosen as a prior for \(\sigma\) because it is a distribution over positive real numbers with mean equal \(\text{sd}(y)\).

    To fit a linear regression with the weakly informative prior, simply run

    M1 <- stan_glm(vote ~ growth, data=hibbs, refresh=0)
    print(M1)
    stan_glm
     family:       gaussian [identity]
     formula:      vote ~ growth
     observations: 16
     predictors:   2
    ------
                Median MAD_SD
    (Intercept) 46.3    1.6  
    growth       3.0    0.7  
    
    Auxiliary parameter(s):
          Median MAD_SD
    sigma 3.9    0.7   
    
    ------
    * For help interpreting the printed output see ?print.stanreg
    * For info on the priors used see ?prior_summary.stanreg
  3. Informative prior

    Informative priors are designed with prior knowledge about the variables. For the intercept, the prior_intercept argument in stan_glm is defined with our guess of \(a+b\bar{x}\), which is the value of \(y\) when \(x\) is set to the average value. In the Election vs Economic growth example, this corresponds to Clinton’s voting share when the growth is historically average, should be close to 50%, and it should not be less than 40% or more than 60%. Thus we set the prior for the intercept to be \(\mathcal{N}(50,10)\).

    For the slope, we should consider: how much is a swing in voting share if the economic growth were to increase by 1%. The swing is most likely not going to be more than 10%. With this information, we set the prior of the slope to be \(\mathcal{N}(5,5)\).

    Here is the code to fit a Bayesian regression model with these priors:

    M4 <- stan_glm(vote ~ growth, data=hibbs,
                   prior=normal(5,5), prior_intercept=normal(50, 10),
                   refresh=0)
    print(M4)
    stan_glm
     family:       gaussian [identity]
     formula:      vote ~ growth
     observations: 16
     predictors:   2
    ------
                Median MAD_SD
    (Intercept) 46.2    1.7  
    growth       3.1    0.7  
    
    Auxiliary parameter(s):
          Median MAD_SD
    sigma 3.9    0.7   
    
    ------
    * For help interpreting the printed output see ?print.stanreg
    * For info on the priors used see ?prior_summary.stanreg

    We see that the difference between this model and the previous models are not noticeable since the prior contains very little information compared to the data.

4.2.1 Example of regression with difference priors: Beauty and sex ratio

Here, we consider data in which the predictor is the parents’ attractiveness of a five-point scale, and the target variable is the percentage of girls births among parents in each attractiveness category.

We fit two regression models, one with a weakly informative prior, and one with an informative prior. In informative prior, we set a prior for the following parameters:

  1. Intercept: to find a prior, we must guess the value of \(a+b\bar{x}\), that is, the percentages of girls for parents of average beauty. We have prior knowledge of percentages of girl birth to be stable at roughly 48.5% to 49%. So we choose the prior to be \(\mathcal{N}(48.8,0.5^2)\).
  2. Slope: We think that the parents’ attractive should have barely any effect on the percentages of girl births, so we choose the prior to be \(\mathcal{N}(0,0.2^2)\).

Below are the plots of the regressions with the weakly informative prior and the informative prior, respectively. We can see that the data offers no information about \(a\) and \(b\).