library(rstanarm)
7 Logarithmic transformations
We might want to try a logarithmic transformation when
- Additivity and linearity are not reasonable assumptions.
- The outcomes are all positive.
In this case, we run the regression with \(\log y_i\) as target values:
\[ \log y_i = \beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i. \]
To obtain a model that predicts the outcome from the input, we exponentiate both sides of the equation.
\[\begin{align*} y_i &= e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i} \\ &= B_0B_1^{X_{i1}}\ldots B_p^{X_{ip}}E_i, \end{align*}\]where \(B_0=e^{\beta_0}, B_1=e^{\beta_1},\ldots\).
Consider the logarithmic regression on the Earnings
data. This can be done in are by simply replacing earn
in the formula by log(earn)
(notice that we only regress on the subset of \(\text{earn} > 0\).
= read.csv("data/earnings.csv")
earnings
head(earnings)
height weight male earn earnk ethnicity education mother_education
1 74 210 1 50000 50 White 16 16
2 66 125 0 60000 60 White 16 16
3 64 126 0 30000 30 White 16 16
4 65 200 0 25000 25 White 17 17
5 63 110 0 50000 50 Other 16 16
6 68 165 0 62000 62 Black 18 18
father_education walk exercise smokenow tense angry age
1 16 3 3 2 0 0 45
2 16 6 5 1 0 0 58
3 16 8 1 2 1 1 29
4 NA 8 1 2 0 0 57
5 16 5 6 2 0 0 91
6 18 1 1 2 2 2 54
<- stan_glm(log(earn) ~ height, data=earnings,
logmodel_1 subset=earn>0, refresh=0)
print(logmodel_1)
stan_glm
family: gaussian [identity]
formula: log(earn) ~ height
observations: 1629
predictors: 2
subset: earn > 0
------
Median MAD_SD
(Intercept) 5.9 0.4
height 0.1 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 0.9 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Here are what the plot of this model on the log scale and the original scale.
plot(earnings$height, log(earnings$earn), pch=20,
xlab="height", ylab="log(earnings)")
abline(coef(logmodel_1))
= coef(logmodel_1)
beta
plot(earnings$height, earnings$earn, pch=20,
xlab="height", ylab="earnings")
curve(exp(beta[1] + beta[2]*x), add=TRUE)
In the latter plot, the curve is moving upwards like an exponential function.
7.1 Interpreting the coefficients
The fitted model from the logarithmic regression above is
\[ \text{log(earnings)} = 5.9 + 0.1*\text{height}, \]
This model suggests that, for two people, say \(A\) and \(B\), whose heights differ by 1, their log-earnings differ by 0.1 on average:
\[\begin{align*} \log(\text{earnings}_A) - \log(\text{earnings}_B) &= 0.1 \\ \log \left(\frac{\text{earnings}_A}{\text{earnings}_B}\right) &= 0.1 \\ \frac{\text{earnings}_A}{\text{earnings}_B} &= e^{0.1} \approx 1.1, \end{align*}\]Here, we have used an approximation \(e^x \approx 1+x\), which is reasonably accurate for \(x<1\).
From this, we can interpret the slope of \(0.1\) as follows:
1 inch increase in height corresponds to an expected 10% increase in earnings.
Such possible interpretation is why we have (implicitly) used the logarithms base \(e\) instead of base $10$. If we were to use base \(10\), we would instead obtain \(\frac{\text{earnings}_A}{\text{earnings}_B} = 10^{\hat{\beta}_1}\) where \(\hat{\beta}_1\) is the slope from fitting the regression with logarithms base \(10\). In this case, we cannot estimate the percentage increase in earnings just by looking at the coefficient.
7.1.1 When there are zero-valued outcomes
Sometimes, we might face a situation where some of the outcomes are zeros, which cannot take the logarithm directly. One way to model such data is by running a classification model that can classify whether an instance has non-zero outcome (for example, a linear regression model, which will be introduced next chapter). We then use this model to tell us whether a new data point has non-zero outcome. If that is the case, then we use the fitted logarithmic model to predict the actual value of the outcome.
7.2 Model checking with simulations
We will compare the logarithmic regression with the linear regression by replicating a dataset from each model and compare it to the observed dataset.
We first fit the linear model.
<- stan_glm(earn ~ height, data=earnings,
fit_1 subset=earn>0, refresh=0)
and then we simulate outcomes from the posterior predictive distribution.
<- posterior_predict(fit_1)
yrep_1 <- nrow(yrep_1) # number of rows in the simulation
n_sims <- sample(n_sims, 100) # randomly pick 100 rows
subset <- yrep_1[subset,] # pick rows of yrep_1 by row indices in subset yrep_1_100
There is a convenient library bayesplot
that allows us to make density plots of the simulations and the data via ppc_dens_overlay
function.
library(bayesplot)
ppc_dens_overlay(earnings$earn[earnings$earn>0], yrep_1[subset,])
We can see that the distributions do not quite match as the observed data is more concentrated and is non-negative.
Now let us try the same for the logarithmic model.
<- posterior_predict(logmodel_1)
yrep_log_1 <- nrow(yrep_log_1) # number of rows in the simulation
n_sims <- sample(n_sims, 100) # randomly pick 100 rows
subset <- yrep_log_1[subset,] # pick rows by list of indices in subset
yrep_log_1_100
ppc_dens_overlay(log(earnings$earn[earnings$earn>0]), yrep_log_1_100)
Visually, the logarithmic model has a better fit than the linear model.
7.3 elpd for the logarithmic regression
First, let us look at the elpd between the linear and logarithmic model.
= loo(fit_1) loo_1
Warning: Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.
print(loo_1)
Computed from 4000 by 1629 log-likelihood matrix
Estimate SE
elpd_loo -18610.8 167.6
p_loo 28.6 21.8
looic 37221.5 335.3
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1628 99.9% 616
(0.5, 0.7] (ok) 0 0.0% <NA>
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 1 0.1% 1
See help('pareto-k-diagnostic') for details.
= loo(logmodel_1)
loo_log_1 print(loo_log_1)
Computed from 4000 by 1629 log-likelihood matrix
Estimate SE
elpd_loo -2100.6 38.8
p_loo 3.9 0.4
looic 4201.2 77.6
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
Notice that elpd_loo
between these two models are on different scales; this is because the likelihood of the linear model and the transformed model are totally different, and we have to make correction for this difference in the computation of LOO. Initially, the equation for a logarithmic model is:
\[ \log y = \beta_0 + \beta_1x_{1} + \ldots + \beta_p x_{p} + \varepsilon=X\beta + \varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,\sigma^2). \]
Define a new random variable \(z=\log y\). The equation above tells use that
\[ z \sim \mathcal{N}(X\beta,\sigma^2). \]
To obtain the elpd of the logarithmic model, we need to compute the density \(p(y\mid \beta,\sigma, X)\). This can be achieved by applying the change-of-variable formula between two variables \(y\) and \(z\).
\[ p(y\mid \beta,\sigma, X) = p(z\mid \beta,\sigma, X)\left\lvert \frac{dz}{dy}\right\rvert = \frac{1}{y}p(z\mid \beta,\sigma, X). \]
Here, \(\lvert y \rvert = y\) since we assume that \(y\) is positive. Thus, for a held-out data point \((X_i, y_i)\), with \(z_i = \log y_i\), the log-likelihood of this point is
\[ \log p(y_i\mid \beta,\sigma, X_i) = \underbrace{\log p(z_i\mid \beta,\sigma, X_i)}_{\text{elpd of }\log y_i}-\log y_i. \]
In other words, we can make a correction for the elpd of the logarithmic model by subtracting \(\log y_i\) for each instance \((X_i, y_i)\).
We can inspect the elpd of each instance by calling pointwise
attribute of the loo
result. The following code shows the first five rows of the elpd from the logarithmic regression:
print(loo_log_1$pointwise[1:5,])
elpd_loo mcse_elpd_loo p_loo looic influence_pareto_k
[1,] -1.1015364 0.0006749132 0.0018272358 2.203073 -0.066103226
[2,] -1.9373695 0.0007647321 0.0019790022 3.874739 -0.058415542
[3,] -1.1543427 0.0004481174 0.0006992233 2.308685 -0.036653352
[4,] -0.9582225 0.0003211416 0.0003854306 1.916445 -0.008417499
[5,] -1.9181897 0.0009642173 0.0031575345 3.836379 -0.061855767
From this, we can subtract \(\log y_i\) (earn
in this case) from the elpd_loo
(first column).
$pointwise[,1] <- loo_log_1$pointwise[,1] - log(earnings$earn[earnings$earn>0]) loo_log_1
We can now sum elpd_loo
over all instances.
<- sum(loo_log_1$pointwise[,1])
elpd_with_correction print(elpd_with_correction)
[1] -17933
The elpd is now in the same scale as the linear model. We can now compare the elpd between the linear and logarithmic model.
loo_compare(loo_log_1, loo_1)
Warning: Not all models have the same y variable. ('yhash' attributes do not
match)
elpd_diff se_diff
logmodel_1 0.0 0.0
fit_1 -677.8 155.4
7.4 Log-log model
If the log transformation is applied to both an input variable and the outcome, the coefficient can be interpreted as the percentage difference in \(y\) per percentage difference in \(x\).
Let us try this technique on the Earnings
data.
<- stan_glm(log(earn) ~ log(height) + male, data=earnings,
logmodel_5 subset=earn>0, refresh=0)
print(logmodel_5)
stan_glm
family: gaussian [identity]
formula: log(earn) ~ log(height) + male
observations: 1629
predictors: 3
subset: earn > 0
------
Median MAD_SD
(Intercept) 2.9 2.2
log(height) 1.6 0.5
male 0.4 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.9 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
The coefficient 1.6 of \(\log(\text{height})\) implies that 1% increase in height corresponds to 1.6% increase in earnings.
The coefficient 0.4 of \(\text{male}\) implies that, on average, a male earns 40% more than a female with the same height.