library("rstanarm")
1 Basic regression
We start with the basics of Bayesian regression on simulated data. We will go over the steps to in order to obtain the coefficient estimates, as well as their uncertainties. First, we load the rstanarm
package, an R package for Bayesian regression modeling which we will use throughout the course.
1.1 Simulation
Simulate data as follows:
\[\begin{align} x &= 1,2,\ldots,20 \\ y &= 0.2 + 0.3x + \epsilon \\ \epsilon &\sim N(0,0.5) \end{align}\]
<- 1:20
x <- length(x)
n <- 0.2
a <- 0.3
b <- 0.5
sigma <- a + b*x + sigma*rnorm(n)
y
<- data.frame(x, y) fake
Plot the data
plot(fake$x, fake$y, main="Data")
Fit a regression model with stan_glm
<- stan_glm(y ~ x, data=fake) fit_1
Here, the result shows the estimated coefficients with the uncertainties (the standard errors). It also estimates \(\sigma\).
print(fit_1, digits=2)
stan_glm
family: gaussian [identity]
formula: y ~ x
observations: 20
predictors: 2
------
Median MAD_SD
(Intercept) 0.26 0.24
x 0.30 0.02
Auxiliary parameter(s):
Median MAD_SD
sigma 0.52 0.09
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Plot the data with the fitted regression line
plot(fake$x, fake$y, main="Data and fitted regression line")
= coef(fit_1)[1]
a_hat = coef(fit_1)[2]
b_hat abline(a_hat, b_hat)
Here are the summary of the parameters.
Parameter | Assumed value | Estimate | Uncertainty |
---|---|---|---|
\(a\) | 0.2 | 0.55 | 0.26 |
\(b\) | 0.3 | 0.28 | 0.02 |
\(\sigma\) | 0.5 | 0.54 | 0.09 |
From the properties of the standard normal distribution, for 68% of the time the true intercept \(a\) is between \(0.55-0.26=0.29\) and \(0.55+0.26=0.81\), and for 95% of the time it is between \(0.55-2\times 0.26=0.03\) and \(0.55+2\times 0.26=1.07\).
1.2 Earnings data
<- read.csv("data/earnings.csv") earnings
Fit a regression model
<- stan_glm(earnk ~ height + male, data=earnings) fit_2
print(fit_2)
stan_glm
family: gaussian [identity]
formula: earnk ~ height + male
observations: 1816
predictors: 3
------
Median MAD_SD
(Intercept) -25.8 12.0
height 0.6 0.2
male 10.6 1.5
Auxiliary parameter(s):
Median MAD_SD
sigma 21.4 0.3
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The fitted model is:
\[ \text{earnings} = -25.7 + 0.6*\text{height} + 10.6 *\text{male} + \text{error}. \]
How should we interpret the coefficients here? Let’s look at the following suggestions:
- If we were to increase someone’s height by one inch, his or her earning would increase by an expected amount of $600.
- Comparing two people with the same sex but one inch different in height, the average difference in earnings is $600.
Between these two choices, the latter is more sensible. Comparison is a safer interpretation of the coefficient than the effect.
Similarly, it is more appropriate to say that, comparing two people with the same height but different sex, the man’s earnings will be $10600 more than the woman’s on average.
1.3 Historical origins of regression
The term regression comes from Francis Galton, a quantitative social scientist, who fit linear models to understand parent-child height relationship. He noticed that:
- Children of tall parents tended to be taller than average but less tall than the parents.
- Children of shorter parents tended to be shorter than the average but less short than the parents.
Thus, over time, people’s heights have regressed to the mean, hence the term regression.
Let’s look at the data of people’s heights published by Karl Pearson and Alice Lee in 1903.
<- read.table("data/Heights.txt", header=TRUE)
heights print(heights[1:5,])
daughter_height mother_height
1 52.5 59.5
2 52.5 59.5
3 53.5 59.5
4 53.5 59.5
5 55.5 59.5
Now we fit a regression model to the data.
<- stan_glm(daughter_height ~ mother_height, data=heights) fit_3
print(fit_3)
stan_glm
family: gaussian [identity]
formula: daughter_height ~ mother_height
observations: 5524
predictors: 2
------
Median MAD_SD
(Intercept) 29.8 0.8
mother_height 0.5 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 2.3 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Let’s plot the data and the regression line.
plot(heights$mother_height, heights$daughter_height, main="Data and the fitted regression line")
= coef(fit_3)[1]
a_hat = coef(fit_3)[2]
b_hat abline(a_hat, b_hat)
Let’s take a look at the average of daughters’ and mothers’ heights.
print(mean(heights$daughter_height))
[1] 63.85626
print(mean(heights$mother_height))
[1] 62.49873
The equation of the regression line is
\[ y = 29.8 + 0.5x + \text{error} \]
The line’s slope of 0.5 is easy to interpret—adding one inch to the mother’s height corresponds to an increase in the daughter’s height by 0.5 inches.
If the mother’s height is 70 inches, then the daughter’s height is 64.8 on average, so the daughter is less tall than the mother’s but still taller than the average.
If the mother’s height is 50 inches, then the daughter’s height is 54.8 on average, so the daughter is less short than the mother’s but still shorter than the average.
Looking back at the equation, we see \(<1\) coefficient which reduces variation of the daughters’ heights, while the error term adds to the variation.
1.4 How regression to the mean can confuse people
Sometimes regression to the mean can lead people to mistakenly attribute it to causality. Here’s a simulation of midterm and final scores of a group of students. Each score is composed of two components: the student’s true ability and a random noise.
<- 1000
n <- rnorm(n, 50, 10)
true_ability <- rnorm(n, 0, 10)
noise_1 <- rnorm(n, 0, 10)
noise_2 <- true_ability + noise_1
midterm <- true_ability + noise_2
final <- data.frame(midterm, final) exams
<- stan_glm(final ~ midterm, data=exams) fit_4
print(fit_4)
stan_glm
family: gaussian [identity]
formula: final ~ midterm
observations: 1000
predictors: 2
------
Median MAD_SD
(Intercept) 23.3 1.4
midterm 0.5 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 12.3 0.3
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
plot(midterm, final, xlab="Midterm score", ylab="Final score")
abline(coef(fit_4))
One might infer from the coefficient of 0.5 that the students who did well on the midterm got overconfident and slacked off before the final, and the students who did poor on the midterm were motivated and tried extra hard for the final. But we know that this is not the case, as we generated this data ourselves! The real reason behind the regression to the mean is the variation between the midterms and the final scores: a student who scores very well on the first midterm (e.g. the two students with ~100 midterms score on the far right) are likely to have a high level of skill, and also was very lucky at the midterm (i.e. large positive noise) and so in the final exam, the student performs better than average but worse than on the midterm.