# 3Fitting linear regression

## 3.1 Least squares

The classic linear regression model is:

$y_i = a+bx_i+\varepsilon.$

Define the residuals as

$r_i = y_i - (\hat{a}+\hat{b}x_i).$

Note that this is different than the errors $$\varepsilon_i = y_i-(a+bx_i)$$, which cannot be obtained from the observed data.

In least squares regression, we estimate $$(\hat{a},\hat{b})$$ that minimizes the residual sum of squares:

$\text{RSS} = \sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - (\hat{a}+\hat{b}x_i))^2.$

The $$(\hat{a},\hat{b})$$ that minimizes RSS is called the ordinary least squares or OLS estimate, which is given by

\begin{align*} \hat{b} &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2}, \\ \hat{a} &= \bar{y} - \hat{b}\bar{x}. \end{align*}

Consequently, we can write the line equation as

$y = \hat{a}+\hat{b}x = \bar{y} - \hat{b}\bar{x} +\hat{b}x = \bar{y} + \hat{b}(x-\bar{x}).$

Thus, hte line goes through the mean of the data $$(\bar{x},\bar{y})$$.

## 3.2 Estimation of residual standard deviation $$\sigma$$

Recall that we assume $$\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$$. To find an unbiased estimator of $$\sigma^2$$, we use the following fact:

$\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-2}.$

Combined with the fact that the expectation of a chi-square random variable equals its number of degrees of freedom, we have

$\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n-2,$

or equivalently,

$\mathbb{E}\left[\frac{\text{RSS}}{n-2}\right] = \sigma^2.$

Therefore, $$\frac{\text{RSS}}{n-2}$$ is an unbiased estimator of $$\sigma^2$$. Thus, we estimate the residual standard deviation $$\sigma$$ using

$\hat{\sigma} = \sqrt{\frac{\text{RSS}}{n-2}}=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-(\hat{a}+\hat{b}x_i))^2}.$

## 3.3 Maximum likelihood estimation

From the model $$y_i=a+bx_i+\varepsilon_i$$ where $$\varepsilon_i\sim\mathcal{N}(0,\sigma^2)$$, it follows that $$y\sim N(a+bx_i,\sigma^2)$$. The likelihood function is defined as the probability density function of the data, considered as a function of the parameters. Let $$y=(y_1,\ldots,y_n)$$ and \$X=(x_1,,x_n). Then, the likelihood function in terms of $$\hat{a},\hat{b}$$ and $$\hat{\sigma}$$ is

\begin{align*} p(y\mid \hat{a},\hat{b},\hat{\sigma},X) &= \prod_{i=1}^n p(y_i\mid \hat{a},\hat{b},\hat{\sigma},X) \\ &= \frac{1}{(\sqrt{2\pi}\hat{\sigma})^n}\prod_{i=1}^n\exp\left(-\frac{1}{2}\left(\frac{y-(\hat{a}+\hat{b}x_i)}{\hat{\sigma}}\right)^2\right) \\ &= \frac{1}{(\sqrt{2\pi}\hat{\sigma})^n}\exp\left(-\frac{1}{2\hat{\sigma}^2}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2\right). \end{align*}

Another way to estimate the parameters $$a,b$$ and $$\sigma$$ is to find $$\hat{a},\hat{b}$$ and $$\hat{\sigma}$$ that maximizes the likelihood function. It is common to compute the log-likelihood function first.

$\log p(y\mid \hat{a},\hat{b},\hat{\sigma},X) = -\frac{1}{2\hat{\sigma}^2}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2-n\log \sqrt{2\pi}\hat{\sigma}.$

We can see the maximizing the log-likelihood function with respect to $$a$$ and $$b$$. In other words, in linear regression, the maximum likelihood estimates of $$a$$ and $$b$$ are the same as the OLS estimates. To find the maximum likelihood estimate of $$\sigma$$, we apply the first derivative test of $$\hat{\sigma}$$.

$\frac{1}{\hat{\sigma}^3}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2-\frac{n}{\hat{\sigma}} = 0.$

which gives

$\hat{\sigma}_{\text{mle}} = \sqrt{\frac{1}{n}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2}.$

We observe that the maximum likelihood estimate is a biased estimate of $$\sigma$$.

Below is a plot of the likelihood function as a function of $$a$$ and $$b$$, with $$\sigma$$ fixed.

The second plot shows a level curve of the likelihood near the maximum likelihood estimate.

## 3.4 Bayesian linear regression

In Bayesian inference, we start by specifying a prior distribution on the parameters. In this case, the parameters are $$a,b$$ and $$\sigma^2$$.

$p(a,b,\sigma^2).$ Examples of prior distributions are: 1. Flat prior: $$p(a,b,\sigma^2)=1$$ 2. $$p(a,b,\sigma^2)=p(a,b\mid \sigma^2)p(\sigma^2)$$ where $$p(a,b\mid \sigma^2)$$ is a normal distribution and $$p(\sigma^2)$$ is an inverse-gamma distribution.

Then, we specify the likelihood function. In linear regression, this is usually the normal likelihood.

$p(y\mid a,b,\sigma^2,X) = \frac{1}{(\sqrt{2\pi}\sigma)^n}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y-(a+bx_i))^2\right).$

We use the likelihood, which contains information about the data, to update our belief on the parameters via the Bayes’ rule.

$p(a,b,\sigma^2 \mid y,X) \propto p(y\mid a,b,\sigma^2,X)p(a,b,\sigma^2).$

The left-hand side is called the posterior distribution. We then draw the parameters from the posterior distribution and use them to simulate posterior quantities, such as posterior mean of $$y$$ or confidence intervals.

Here is a code example of Bayesian regression:

library(rstanarm)
x <- 1:10
y <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55)
fake <- data.frame(x, y)

To fit the Bayesian regression with normal and inverse-gamma prior,

fit1 <- stan_glm(y ~ x, data=fake)
print(fit1)
stan_glm
family:       gaussian [identity]
formula:      y ~ x
observations: 10
predictors:   2
------
(Intercept) -13.7    6.7
x             5.1    1.1

Auxiliary parameter(s):
sigma 10.0    2.6

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

To fit the Bayesian regression with flat prior (in this case, the posterior is the same as the likelihood),

fit2 <- stan_glm(y ~ x, data=fake, prior_intercept=NULL,
prior=NULL, prior_aux=NULL)

Here, prior_intercept=NULL sets a flat prior for the intercept, prior=NULL sets a flat prior for the other coefficients, and prior_aux=NULL sets a flat prior for $$\sigma^2$$.

print(fit2)
stan_glm
family:       gaussian [identity]
formula:      y ~ x
observations: 10
predictors:   2
------
(Intercept) -13.8    7.2
x             5.1    1.2

Auxiliary parameter(s):
sigma 10.4    2.9

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

At default, stan_glm uses simulation to fit the model. To use optimization instead, set algorithm="optimizing". The following code performs the maximum likelihood estimate.

fit3 <- stan_glm(y ~ x, data=fake, prior_intercept=NULL,
prior=NULL, prior_aux=NULL,
algorithm="optimizing")
print(fit3)
stan_glm
family:       gaussian [identity]
formula:      y ~ x
observations: 10
predictors:   2
------
(Intercept) -13.1    6.2
x             5.0    1.0

Auxiliary parameter(s):
sigma 9.8    2.2

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

## 3.5 Simulations from stan_glm

The fit from stan_glm yields a matrix of simulations. Here is an example of using these simulations to construct the 95% confidence interval of the coefficient of x.

sims <- as.matrix(fit1)
quantile(sims[, 2], c(0.025, 0.975))
    2.5%    97.5%
2.575954 7.418997 

which is close to the approximation $$[5.1 \pm 2\times 1.1]$$.