# 1Basic regression

We start with the basics of Bayesian regression on simulated data. We will go over the steps to in order to obtain the coefficient estimates, as well as their uncertainties. First, we load the rstanarm package, an R package for Bayesian regression modeling which we will use throughout the course.

library("rstanarm")

## 1.1 Simulation

Simulate data as follows:

\begin{align} x &= 1,2,\ldots,20 \\ y &= 0.2 + 0.3x + \epsilon \\ \epsilon &\sim N(0,0.5) \end{align}

x <- 1:20
n <- length(x)
a <- 0.2
b <- 0.3
sigma <- 0.5
y <- a + b*x + sigma*rnorm(n)

fake <- data.frame(x, y)

Plot the data

plot(fake$x, fake$y, main="Data") Fit a regression model with stan_glm

fit_1 <- stan_glm(y ~ x, data=fake)

Here, the result shows the estimated coefficients with the uncertainties (the standard errors). It also estimates $$\sigma$$.

print(fit_1, digits=2)
stan_glm
family:       gaussian [identity]
formula:      y ~ x
observations: 20
predictors:   2
------
(Intercept) 0.26   0.24
x           0.30   0.02

Auxiliary parameter(s):
sigma 0.52   0.09

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Plot the data with the fitted regression line

plot(fake$x, fake$y, main="Data and fitted regression line")
a_hat = coef(fit_1)
b_hat = coef(fit_1)
abline(a_hat, b_hat) Here are the summary of the parameters.

Parameter Assumed value Estimate Uncertainty
$$a$$ 0.2 0.55 0.26
$$b$$ 0.3 0.28 0.02
$$\sigma$$ 0.5 0.54 0.09

From the properties of the standard normal distribution, for 68% of the time the true intercept $$a$$ is between $$0.55-0.26=0.29$$ and $$0.55+0.26=0.81$$, and for 95% of the time it is between $$0.55-2\times 0.26=0.03$$ and $$0.55+2\times 0.26=1.07$$.

## 1.2 Earnings data

earnings <- read.csv("data/earnings.csv")

Fit a regression model

fit_2 <- stan_glm(earnk ~ height + male, data=earnings)
print(fit_2)
stan_glm
family:       gaussian [identity]
formula:      earnk ~ height + male
observations: 1816
predictors:   3
------
(Intercept) -25.8   12.0
height        0.6    0.2
male         10.6    1.5

Auxiliary parameter(s):
sigma 21.4    0.3

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

The fitted model is:

$\text{earnings} = -25.7 + 0.6*\text{height} + 10.6 *\text{male} + \text{error}.$

How should we interpret the coefficients here? Let’s look at the following suggestions:

1. If we were to increase someone’s height by one inch, his or her earning would increase by an expected amount of $600. 2. Comparing two people with the same sex but one inch different in height, the average difference in earnings is$600.

Between these two choices, the latter is more sensible. Comparison is a safer interpretation of the coefficient than the effect.

Similarly, it is more appropriate to say that, comparing two people with the same height but different sex, the man’s earnings will be $10600 more than the woman’s on average. ## 1.3 Historical origins of regression The term regression comes from Francis Galton, a quantitative social scientist, who fit linear models to understand parent-child height relationship. He noticed that: 1. Children of tall parents tended to be taller than average but less tall than the parents. 2. Children of shorter parents tended to be shorter than the average but less short than the parents. Thus, over time, people’s heights have regressed to the mean, hence the term regression. Let’s look at the data of people’s heights published by Karl Pearson and Alice Lee in 1903. heights <- read.table("data/Heights.txt", header=TRUE) print(heights[1:5,])  daughter_height mother_height 1 52.5 59.5 2 52.5 59.5 3 53.5 59.5 4 53.5 59.5 5 55.5 59.5 Now we fit a regression model to the data. fit_3 <- stan_glm(daughter_height ~ mother_height, data=heights) print(fit_3) stan_glm family: gaussian [identity] formula: daughter_height ~ mother_height observations: 5524 predictors: 2 ------ Median MAD_SD (Intercept) 29.8 0.8 mother_height 0.5 0.0 Auxiliary parameter(s): Median MAD_SD sigma 2.3 0.0 ------ * For help interpreting the printed output see ?print.stanreg * For info on the priors used see ?prior_summary.stanreg Let’s plot the data and the regression line. plot(heights$mother_height, heights$daughter_height, main="Data and the fitted regression line") a_hat = coef(fit_3) b_hat = coef(fit_3) abline(a_hat, b_hat) Let’s take a look at the average of daughters’ and mothers’ heights. print(mean(heights$daughter_height))
 63.85626
print(mean(heights\$mother_height))
 62.49873

The equation of the regression line is

$y = 29.8 + 0.5x + \text{error}$

The line’s slope of 0.5 is easy to interpret—adding one inch to the mother’s height corresponds to an increase in the daughter’s height by 0.5 inches.

If the mother’s height is 70 inches, then the daughter’s height is 64.8 on average, so the daughter is less tall than the mother’s but still taller than the average.

If the mother’s height is 50 inches, then the daughter’s height is 54.8 on average, so the daughter is less short than the mother’s but still shorter than the average.

Looking back at the equation, we see $$<1$$ coefficient which reduces variation of the daughters’ heights, while the error term adds to the variation.

## 1.4 How regression to the mean can confuse people

Sometimes regression to the mean can lead people to mistakenly attribute it to causality. Here’s a simulation of midterm and final scores of a group of students. Each score is composed of two components: the student’s true ability and a random noise.

n <- 1000
true_ability <- rnorm(n, 50, 10)
noise_1 <- rnorm(n, 0, 10)
noise_2 <- rnorm(n, 0, 10)
midterm <- true_ability + noise_1
final <- true_ability + noise_2
exams <- data.frame(midterm, final)
fit_4 <- stan_glm(final ~ midterm, data=exams)
print(fit_4)
stan_glm
family:       gaussian [identity]
formula:      final ~ midterm
observations: 1000
predictors:   2
------
(Intercept) 23.3    1.4
midterm      0.5    0.0

Auxiliary parameter(s):
* For info on the priors used see ?prior_summary.stanreg
plot(midterm, final, xlab="Midterm score", ylab="Final score")
abline(coef(fit_4)) 