1  Basic regression

We start with the basics of Bayesian regression on simulated data. We will go over the steps to in order to obtain the coefficient estimates, as well as their uncertainties. First, we load the rstanarm package, an R package for Bayesian regression modeling which we will use throughout the course.

library("rstanarm")

1.1 Simulation

Simulate data as follows:

\[\begin{align} x &= 1,2,\ldots,20 \\ y &= 0.2 + 0.3x + \epsilon \\ \epsilon &\sim N(0,0.5) \end{align}\]

x <- 1:20
n <- length(x)
a <- 0.2
b <- 0.3
sigma <- 0.5
y <- a + b*x + sigma*rnorm(n)

fake <- data.frame(x, y)

Plot the data

plot(fake$x, fake$y, main="Data")

Fit a regression model with stan_glm

fit_1 <- stan_glm(y ~ x, data=fake)

Here, the result shows the estimated coefficients with the uncertainties (the standard errors). It also estimates \(\sigma\).

print(fit_1, digits=2)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 20
 predictors:   2
------
            Median MAD_SD
(Intercept) 0.26   0.24  
x           0.30   0.02  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.52   0.09  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Plot the data with the fitted regression line

plot(fake$x, fake$y, main="Data and fitted regression line")
a_hat = coef(fit_1)[1]
b_hat = coef(fit_1)[2]
abline(a_hat, b_hat)

Here are the summary of the parameters.

Parameter Assumed value Estimate Uncertainty
\(a\) 0.2 0.55 0.26
\(b\) 0.3 0.28 0.02
\(\sigma\) 0.5 0.54 0.09

From the properties of the standard normal distribution, for 68% of the time the true intercept \(a\) is between \(0.55-0.26=0.29\) and \(0.55+0.26=0.81\), and for 95% of the time it is between \(0.55-2\times 0.26=0.03\) and \(0.55+2\times 0.26=1.07\).

1.2 Earnings data

earnings <- read.csv("data/earnings.csv")

Fit a regression model

fit_2 <- stan_glm(earnk ~ height + male, data=earnings)
print(fit_2)
stan_glm
 family:       gaussian [identity]
 formula:      earnk ~ height + male
 observations: 1816
 predictors:   3
------
            Median MAD_SD
(Intercept) -25.8   12.0 
height        0.6    0.2 
male         10.6    1.5 

Auxiliary parameter(s):
      Median MAD_SD
sigma 21.4    0.3  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The fitted model is:

\[ \text{earnings} = -25.7 + 0.6*\text{height} + 10.6 *\text{male} + \text{error}. \]

How should we interpret the coefficients here? Let’s look at the following suggestions:

  1. If we were to increase someone’s height by one inch, his or her earning would increase by an expected amount of $600.
  2. Comparing two people with the same sex but one inch different in height, the average difference in earnings is $600.

Between these two choices, the latter is more sensible. Comparison is a safer interpretation of the coefficient than the effect.

Similarly, it is more appropriate to say that, comparing two people with the same height but different sex, the man’s earnings will be $10600 more than the woman’s on average.

1.3 Historical origins of regression

The term regression comes from Francis Galton, a quantitative social scientist, who fit linear models to understand parent-child height relationship. He noticed that:

  1. Children of tall parents tended to be taller than average but less tall than the parents.
  2. Children of shorter parents tended to be shorter than the average but less short than the parents.

Thus, over time, people’s heights have regressed to the mean, hence the term regression.

Let’s look at the data of people’s heights published by Karl Pearson and Alice Lee in 1903.

heights <- read.table("data/Heights.txt", header=TRUE)
print(heights[1:5,])
  daughter_height mother_height
1            52.5          59.5
2            52.5          59.5
3            53.5          59.5
4            53.5          59.5
5            55.5          59.5

Now we fit a regression model to the data.

fit_3 <- stan_glm(daughter_height ~ mother_height, data=heights)
print(fit_3)
stan_glm
 family:       gaussian [identity]
 formula:      daughter_height ~ mother_height
 observations: 5524
 predictors:   2
------
              Median MAD_SD
(Intercept)   29.8    0.8  
mother_height  0.5    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 2.3    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Let’s plot the data and the regression line.

plot(heights$mother_height, heights$daughter_height, main="Data and the fitted regression line")
a_hat = coef(fit_3)[1]
b_hat = coef(fit_3)[2]
abline(a_hat, b_hat)

Let’s take a look at the average of daughters’ and mothers’ heights.

print(mean(heights$daughter_height))
[1] 63.85626
print(mean(heights$mother_height))
[1] 62.49873

The equation of the regression line is

\[ y = 29.8 + 0.5x + \text{error} \]

The line’s slope of 0.5 is easy to interpret—adding one inch to the mother’s height corresponds to an increase in the daughter’s height by 0.5 inches.

If the mother’s height is 70 inches, then the daughter’s height is 64.8 on average, so the daughter is less tall than the mother’s but still taller than the average.

If the mother’s height is 50 inches, then the daughter’s height is 54.8 on average, so the daughter is less short than the mother’s but still shorter than the average.

Looking back at the equation, we see \(<1\) coefficient which reduces variation of the daughters’ heights, while the error term adds to the variation.

1.4 How regression to the mean can confuse people

Sometimes regression to the mean can lead people to mistakenly attribute it to causality. Here’s a simulation of midterm and final scores of a group of students. Each score is composed of two components: the student’s true ability and a random noise.

n <- 1000
true_ability <- rnorm(n, 50, 10)
noise_1 <- rnorm(n, 0, 10)
noise_2 <- rnorm(n, 0, 10)
midterm <- true_ability + noise_1
final <- true_ability + noise_2
exams <- data.frame(midterm, final)
fit_4 <- stan_glm(final ~ midterm, data=exams)
print(fit_4)
stan_glm
 family:       gaussian [identity]
 formula:      final ~ midterm
 observations: 1000
 predictors:   2
------
            Median MAD_SD
(Intercept) 23.3    1.4  
midterm      0.5    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 12.3    0.3  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
plot(midterm, final, xlab="Midterm score", ylab="Final score")
abline(coef(fit_4))

One might infer from the coefficient of 0.5 that the students who did well on the midterm got overconfident and slacked off before the final, and the students who did poor on the midterm were motivated and tried extra hard for the final. But we know that this is not the case, as we generated this data ourselves! The real reason behind the regression to the mean is the variation between the midterms and the final scores: a student who scores very well on the first midterm (e.g. the two students with ~100 midterms score on the far right) are likely to have a high level of skill, and also was very lucky at the midterm (i.e. large positive noise) and so in the final exam, the student performs better than average but worse than on the midterm.