3  Fitting linear regression

3.1 Least squares

The classic linear regression model is:

\[y_i = a+bx_i+\varepsilon.\]

Define the residuals as

\[r_i = y_i - (\hat{a}+\hat{b}x_i).\]

Note that this is different than the errors \(\varepsilon_i = y_i-(a+bx_i)\), which cannot be obtained from the observed data.

In least squares regression, we estimate \((\hat{a},\hat{b})\) that minimizes the residual sum of squares:

\[\text{RSS} = \sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - (\hat{a}+\hat{b}x_i))^2.\]

The \((\hat{a},\hat{b})\) that minimizes RSS is called the ordinary least squares or OLS estimate, which is given by

\[\begin{align*} \hat{b} &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2}, \\ \hat{a} &= \bar{y} - \hat{b}\bar{x}. \end{align*}\]

Consequently, we can write the line equation as

\[y = \hat{a}+\hat{b}x = \bar{y} - \hat{b}\bar{x} +\hat{b}x = \bar{y} + \hat{b}(x-\bar{x}).\]

Thus, hte line goes through the mean of the data \((\bar{x},\bar{y})\).

3.2 Estimation of residual standard deviation \(\sigma\)

Recall that we assume \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\). To find an unbiased estimator of \(\sigma^2\), we use the following fact:

\[\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-2}.\]

Combined with the fact that the expectation of a chi-square random variable equals its number of degrees of freedom, we have

\[\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n-2,\]

or equivalently,

\[\mathbb{E}\left[\frac{\text{RSS}}{n-2}\right] = \sigma^2.\]

Therefore, \(\frac{\text{RSS}}{n-2}\) is an unbiased estimator of \(\sigma^2\). Thus, we estimate the residual standard deviation \(\sigma\) using

\[\hat{\sigma} = \sqrt{\frac{\text{RSS}}{n-2}}=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-(\hat{a}+\hat{b}x_i))^2}.\]

3.3 Maximum likelihood estimation

From the model \(y_i=a+bx_i+\varepsilon_i\) where \(\varepsilon_i\sim\mathcal{N}(0,\sigma^2)\), it follows that \(y\sim N(a+bx_i,\sigma^2)\). The likelihood function is defined as the probability density function of the data, considered as a function of the parameters. Let \(y=(y_1,\ldots,y_n)\) and $X=(x_1,,x_n). Then, the likelihood function in terms of \(\hat{a},\hat{b}\) and \(\hat{\sigma}\) is

\[\begin{align*} p(y\mid \hat{a},\hat{b},\hat{\sigma},X) &= \prod_{i=1}^n p(y_i\mid \hat{a},\hat{b},\hat{\sigma},X) \\ &= \frac{1}{(\sqrt{2\pi}\hat{\sigma})^n}\prod_{i=1}^n\exp\left(-\frac{1}{2}\left(\frac{y-(\hat{a}+\hat{b}x_i)}{\hat{\sigma}}\right)^2\right) \\ &= \frac{1}{(\sqrt{2\pi}\hat{\sigma})^n}\exp\left(-\frac{1}{2\hat{\sigma}^2}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2\right). \end{align*}\]

Another way to estimate the parameters \(a,b\) and \(\sigma\) is to find \(\hat{a},\hat{b}\) and \(\hat{\sigma}\) that maximizes the likelihood function. It is common to compute the log-likelihood function first.

\[ \log p(y\mid \hat{a},\hat{b},\hat{\sigma},X) = -\frac{1}{2\hat{\sigma}^2}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2-n\log \sqrt{2\pi}\hat{\sigma}. \]

We can see the maximizing the log-likelihood function with respect to \(a\) and \(b\). In other words, in linear regression, the maximum likelihood estimates of \(a\) and \(b\) are the same as the OLS estimates. To find the maximum likelihood estimate of \(\sigma\), we apply the first derivative test of \(\hat{\sigma}\).

\[ \frac{1}{\hat{\sigma}^3}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2-\frac{n}{\hat{\sigma}} = 0. \]

which gives

\[ \hat{\sigma}_{\text{mle}} = \sqrt{\frac{1}{n}\sum_{i=1}^n(y-(\hat{a}+\hat{b}x_i))^2}. \]

We observe that the maximum likelihood estimate is a biased estimate of \(\sigma\).

Below is a plot of the likelihood function as a function of \(a\) and \(b\), with \(\sigma\) fixed.

likelihood

The second plot shows a level curve of the likelihood near the maximum likelihood estimate.

likelihood

3.4 Bayesian linear regression

In Bayesian inference, we start by specifying a prior distribution on the parameters. In this case, the parameters are \(a,b\) and \(\sigma^2\).

\[p(a,b,\sigma^2).\] Examples of prior distributions are: 1. Flat prior: \(p(a,b,\sigma^2)=1\) 2. \(p(a,b,\sigma^2)=p(a,b\mid \sigma^2)p(\sigma^2)\) where \(p(a,b\mid \sigma^2)\) is a normal distribution and \(p(\sigma^2)\) is an inverse-gamma distribution.

Then, we specify the likelihood function. In linear regression, this is usually the normal likelihood.

\[p(y\mid a,b,\sigma^2,X) = \frac{1}{(\sqrt{2\pi}\sigma)^n}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y-(a+bx_i))^2\right).\]

We use the likelihood, which contains information about the data, to update our belief on the parameters via the Bayes’ rule.

\[ p(a,b,\sigma^2 \mid y,X) \propto p(y\mid a,b,\sigma^2,X)p(a,b,\sigma^2).\]

The left-hand side is called the posterior distribution. We then draw the parameters from the posterior distribution and use them to simulate posterior quantities, such as posterior mean of \(y\) or confidence intervals.

Here is a code example of Bayesian regression:

library(rstanarm)
x <- 1:10
y <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55)
fake <- data.frame(x, y)

To fit the Bayesian regression with normal and inverse-gamma prior,

fit1 <- stan_glm(y ~ x, data=fake)
print(fit1)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 10
 predictors:   2
------
            Median MAD_SD
(Intercept) -13.7    6.7 
x             5.1    1.1 

Auxiliary parameter(s):
      Median MAD_SD
sigma 10.0    2.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

To fit the Bayesian regression with flat prior (in this case, the posterior is the same as the likelihood),

fit2 <- stan_glm(y ~ x, data=fake, prior_intercept=NULL, 
                 prior=NULL, prior_aux=NULL)

Here, prior_intercept=NULL sets a flat prior for the intercept, prior=NULL sets a flat prior for the other coefficients, and prior_aux=NULL sets a flat prior for \(\sigma^2\).

print(fit2)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 10
 predictors:   2
------
            Median MAD_SD
(Intercept) -13.8    7.2 
x             5.1    1.2 

Auxiliary parameter(s):
      Median MAD_SD
sigma 10.4    2.9  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

At default, stan_glm uses simulation to fit the model. To use optimization instead, set algorithm="optimizing". The following code performs the maximum likelihood estimate.

fit3 <- stan_glm(y ~ x, data=fake, prior_intercept=NULL, 
                 prior=NULL, prior_aux=NULL,
                 algorithm="optimizing")
print(fit3)
stan_glm
 family:       gaussian [identity]
 formula:      y ~ x
 observations: 10
 predictors:   2
------
            Median MAD_SD
(Intercept) -13.1    6.2 
x             5.0    1.0 

Auxiliary parameter(s):
      Median MAD_SD
sigma 9.8    2.2   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

3.5 Simulations from stan_glm

The fit from stan_glm yields a matrix of simulations. Here is an example of using these simulations to construct the 95% confidence interval of the coefficient of x.

sims <- as.matrix(fit1)
quantile(sims[, 2], c(0.025, 0.975))
    2.5%    97.5% 
2.575954 7.418997 

which is close to the approximation \([5.1 \pm 2\times 1.1]\).