2  Linear regression with a single predictor

As usual, we load the rstanarm package.

library("rstanarm")

2.1 Predicting presidential vote share from the economy

We will load the data from hibbs.dat which was created by Douglas Hibbs to forecast elections based on economic growth. Two important variables are: growth, the economic growth in the previous year and vote, the incumbent party’s vote percentage.

hibbs <- read.table("data/hibbs.dat", 
                    header=TRUE)
head(hibbs, 5)
  year growth  vote inc_party_candidate other_candidate
1 1952   2.40 44.60           Stevenson      Eisenhower
2 1956   2.89 57.76          Eisenhower       Stevenson
3 1960   0.85 49.91               Nixon         Kennedy
4 1964   4.21 61.34             Johnson       Goldwater
5 1968   3.02 49.60            Humphrey           Nixon

Plot the data

plot(hibbs$growth, hibbs$vote, xlab="Economic growth",
     ylab="Incumbent party's vote share")

Fit a regression model with stan_glm:

M1 <- stan_glm(vote ~ growth, data=hibbs, 
               refresh=0)  # suppress output

Display the model:

print(M1)
stan_glm
 family:       gaussian [identity]
 formula:      vote ~ growth
 observations: 16
 predictors:   2
------
            Median MAD_SD
(Intercept) 46.3    1.7  
growth       3.0    0.7  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.9    0.7   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The fitted line is

\[ y = 46.3 + 3.0x. \]

Plot the fitted regression line:

plot(hibbs$growth, hibbs$vote, xlab="Economic growth",
     ylab="Incumbent party's vote share")
abline(coef(M1))

  • At \(x=0\) (zero economic growth), the incumbent party is predicted to receive 46.3% of the votes. This makes sense, as people are less likely to vote a party with poor performance.

  • Every 1% of economic growth corresponds to an expected 3.0% increase in vote share for the incumbent party.

  • The 68% confidence interval of the slope is \([3.0 \pm 0.7]=[2.3,3.7]\) and the 95% confidence interval is \([3.0 \pm 2\times 0.7]=[1.6,4.4]\). It would be very unlikely that the data is generated from a model whose true slope is 0.

  • The estimated residual standard deviation (the standard deviation of the error term) is 3.9%. This means that roughly 68% of the true vote percentages fall within \(\pm 3.9\) of the regression line.

2.1.1 Predicting the 2008 election

In the years leading up to the 2008 election, the economic growth was approximately 0.1% or \(x=0.1\). The linear model, \(y=46.3+3.0x\), predicted \(y=46.6\), or 46.6% of the vote going to the incumbent party, which was the Republicans at that time. It thus predicts 53.4% for Barack Obama, implying Democrats’ victory in 2008.

2.1.2 Predicting the 2016 election

We now use the model to predict the 2016 presidential election of Democrat Hillary Clinton vs. Republican Donald Trump. At that time, the economic growth was approximately 2%. The linear model predicted

\[ 46.3 + 3.0\times2.0 = 52.3. \]

In other words, the model predicted that Clinton would have won the 2016 election, when in fact the winner was actually Trump.

Maybe we should have taken the uncertainty into account as well. We could ask ourselves: what is the chance that Clinton would win in that year? To answer this question, we recall that our model also has the error term:

\[ y = 46.3 + 3.0x + \varepsilon, \quad \varepsilon\sim\mathcal{N}(0, 3.9). \]

Plugging in $x=2.0$ yields a random variable \(y\):

\[y = 46.3 + 3.0\times 2.0 +\varepsilon = 52.3 +\varepsilon \sim \mathcal{N}(52.3, 3.9).\]

The distribution of \(y\) is shown below. Clinton would have won if the vote share is greater than 50%. Thus the probability of Clinton winning the election is \(\Pr[y>50] = 0.72\).

Forecast distribution

The shaded area can be computed using the following code:

1-pnorm(50, 52.3, 3.9)
[1] 0.7223187

2.2 Checking the model’s fit via simulation

We will demonstrate how to check the model’s fit using the elections data above.

2.2.0.1 Step 1: Creating parameters from the fitted model

a <- 46.3
b <- 3.0
sigma <- 3.9
x <- hibbs$growth
n <- length(x)

2.2.0.2 Step 2: Simulating fake data

y <- a + b*x + rnorm(n, 0, sigma)
fake <- data.frame(x, y)

2.2.0.3 Step 3: Fitting the model and comparing fitted to assumed parameters

fit <- stan_glm(y ~ x, data=fake, refresh=0)
print(fit)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 16
 predictors:   2
------
            Median MAD_SD
(Intercept) 46.3    1.8  
x            4.1    0.7  

Auxiliary parameter(s):
      Median MAD_SD
sigma 4.1    0.8   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The estimated coefficients (48.5 and 2.0) seem close enough to the assumed true values (46.3 and 3.0).

We can compare the coefficients formally by checking if the true values falls within 68% and 95% confidence intervals of the estimated coefficients. For simplicity, we will only do this for the slope \(b\).

b_hat <- coef(fit)["x"]
b_se <- se(fit)["x"]
print(b_hat)
       x 
4.105981 
print(b_se)
        x 
0.7335738 

We then check whether the assumed true value of \(b\) falls within the 68% and 95% confidence intervals. However, since the original data is small, with a sample size of only 16, we need to use \(t\)-distribution instead of the normal distribution.

t_68 <- qt(0.84, n-2)
t_95 <- qt(0.975, n-2)
cover_68 <- abs(b - b_hat) < t_68 * b_se
cover_95 <- abs(b - b_hat) < t_95 * b_se
paste("68% coverage: ", mean(cover_68))
[1] "68% coverage:  0"
paste("95% coverage: ", mean(cover_95))
[1] "95% coverage:  1"

2.2.0.4 Step 4: Repeating the simulation in a loop

Now, we have to repeat the simulation several times and see if the *coverage probabilities*, that is, the probabilities that the confidence intervals contain the true coefficient, are close to 68% and 95%, respectively.

n_fake <- 1000
cover_68 <- rep(NA, n_fake)  # c(NA, ... , NA)
cover_95 <- rep(NA, n_fake)  # c(NA, ... , NA)
t_68 <- qt(0.84, n-2)
t_95 <- qt(0.975, n-2)
pb <- txtProgressBar(min=0, max=n_fake, initial=0, style=3)
for (s in 1:n_fake){
  setTxtProgressBar(pb, s)
  y <- a + b*x + rnorm(n, 0, sigma)
  fake <- data.frame(x, y)
  fit <- stan_glm(y ~ x, data = fake, refresh = 0)

  b_hat <- coef(fit)["x"]
  b_se <- se(fit)["x"]

  cover_68[s] <- abs(b - b_hat) < t_68 * b_se
  cover_95[s] <- abs(b - b_hat) < t_95 * b_se
}
close(pb)
paste("68% coverage: ", mean(cover_68))
[1] "68% coverage:  0.702"
paste("95% coverage: ", mean(cover_95))
[1] "95% coverage:  0.949"

This simulation gives the desired result: close to 68% of 68% confidence intervals, and close to 95% of 95% confidence intervals, contain the true coefficients.