7  Logarithmic transformations

We might want to try a logarithmic transformation when

In this case, we run the regression with \(\log y_i\) as target values:

\[ \log y_i = \beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i. \]

To obtain a model that predicts the outcome from the input, we exponentiate both sides of the equation.

\[\begin{align*} y_i &= e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i} \\ &= B_0B_1^{X_{i1}}\ldots B_p^{X_{ip}}E_i, \end{align*}\]

where \(B_0=e^{\beta_0}, B_1=e^{\beta_1},\ldots\).

Consider the logarithmic regression on the Earnings data. This can be done in are by simply replacing earn in the formula by log(earn) (notice that we only regress on the subset of \(\text{earn} > 0\).

library(rstanarm)
earnings = read.csv("data/earnings.csv")

head(earnings)
  height weight male  earn earnk ethnicity education mother_education
1     74    210    1 50000    50     White        16               16
2     66    125    0 60000    60     White        16               16
3     64    126    0 30000    30     White        16               16
4     65    200    0 25000    25     White        17               17
5     63    110    0 50000    50     Other        16               16
6     68    165    0 62000    62     Black        18               18
  father_education walk exercise smokenow tense angry age
1               16    3        3        2     0     0  45
2               16    6        5        1     0     0  58
3               16    8        1        2     1     1  29
4               NA    8        1        2     0     0  57
5               16    5        6        2     0     0  91
6               18    1        1        2     2     2  54
logmodel_1 <- stan_glm(log(earn) ~ height, data=earnings, 
                       subset=earn>0, refresh=0)
print(logmodel_1)
stan_glm
 family:       gaussian [identity]
 formula:      log(earn) ~ height
 observations: 1629
 predictors:   2
 subset:       earn > 0
------
            Median MAD_SD
(Intercept) 5.9    0.4   
height      0.1    0.0   

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.9    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Here are what the plot of this model on the log scale and the original scale.

plot(earnings$height, log(earnings$earn), pch=20,
     xlab="height", ylab="log(earnings)")
abline(coef(logmodel_1))

beta = coef(logmodel_1)

plot(earnings$height, earnings$earn, pch=20,
     xlab="height", ylab="earnings")
curve(exp(beta[1] + beta[2]*x), add=TRUE)

In the latter plot, the curve is moving upwards like an exponential function.

7.1 Interpreting the coefficients

The fitted model from the logarithmic regression above is

\[ \text{log(earnings)} = 5.9 + 0.1*\text{height}, \]

This model suggests that, for two people, say \(A\) and \(B\), whose heights differ by 1, their log-earnings differ by 0.1 on average:

\[\begin{align*} \log(\text{earnings}_A) - \log(\text{earnings}_B) &= 0.1 \\ \log \left(\frac{\text{earnings}_A}{\text{earnings}_B}\right) &= 0.1 \\ \frac{\text{earnings}_A}{\text{earnings}_B} &= e^{0.1} \approx 1.1, \end{align*}\]

Here, we have used an approximation \(e^x \approx 1+x\), which is reasonably accurate for \(x<1\).

From this, we can interpret the slope of \(0.1\) as follows:

1 inch increase in height corresponds to an expected 10% increase in earnings.

Such possible interpretation is why we have (implicitly) used the logarithms base \(e\) instead of base $10$. If we were to use base \(10\), we would instead obtain \(\frac{\text{earnings}_A}{\text{earnings}_B} = 10^{\hat{\beta}_1}\) where \(\hat{\beta}_1\) is the slope from fitting the regression with logarithms base \(10\). In this case, we cannot estimate the percentage increase in earnings just by looking at the coefficient.

7.1.1 When there are zero-valued outcomes

Sometimes, we might face a situation where some of the outcomes are zeros, which cannot take the logarithm directly. One way to model such data is by running a classification model that can classify whether an instance has non-zero outcome (for example, a linear regression model, which will be introduced next chapter). We then use this model to tell us whether a new data point has non-zero outcome. If that is the case, then we use the fitted logarithmic model to predict the actual value of the outcome.

7.2 Model checking with simulations

We will compare the logarithmic regression with the linear regression by replicating a dataset from each model and compare it to the observed dataset.

We first fit the linear model.

fit_1 <- stan_glm(earn ~ height, data=earnings, 
                  subset=earn>0, refresh=0)

and then we simulate outcomes from the posterior predictive distribution.

yrep_1 <- posterior_predict(fit_1)
n_sims <- nrow(yrep_1)  # number of rows in the simulation
subset <- sample(n_sims, 100)  # randomly pick 100 rows
yrep_1_100 <- yrep_1[subset,]  # pick rows of yrep_1 by row indices in subset

There is a convenient library bayesplot that allows us to make density plots of the simulations and the data via ppc_dens_overlay function.

library(bayesplot)
ppc_dens_overlay(earnings$earn[earnings$earn>0], yrep_1[subset,])

We can see that the distributions do not quite match as the observed data is more concentrated and is non-negative.

Now let us try the same for the logarithmic model.

yrep_log_1 <- posterior_predict(logmodel_1)
n_sims <- nrow(yrep_log_1)  # number of rows in the simulation
subset <- sample(n_sims, 100)  # randomly pick 100 rows
yrep_log_1_100 <- yrep_log_1[subset,]  # pick rows by list of indices in subset

ppc_dens_overlay(log(earnings$earn[earnings$earn>0]), yrep_log_1_100)

Visually, the logarithmic model has a better fit than the linear model.

7.3 elpd for the logarithmic regression

First, let us look at the elpd between the linear and logarithmic model.

loo_1 = loo(fit_1)
Warning: Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.
print(loo_1)

Computed from 4000 by 1629 log-likelihood matrix

         Estimate    SE
elpd_loo -18610.8 167.6
p_loo        28.6  21.8
looic     37221.5 335.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1628  99.9%   616       
 (0.5, 0.7]   (ok)          0   0.0%   <NA>      
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    1   0.1%   1         
See help('pareto-k-diagnostic') for details.
loo_log_1 = loo(logmodel_1)
print(loo_log_1)

Computed from 4000 by 1629 log-likelihood matrix

         Estimate   SE
elpd_loo  -2100.6 38.8
p_loo         3.9  0.4
looic      4201.2 77.6
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Notice that elpd_loo between these two models are on different scales; this is because the likelihood of the linear model and the transformed model are totally different, and we have to make correction for this difference in the computation of LOO. Initially, the equation for a logarithmic model is:

\[ \log y = \beta_0 + \beta_1x_{1} + \ldots + \beta_p x_{p} + \varepsilon=X\beta + \varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,\sigma^2). \]

Define a new random variable \(z=\log y\). The equation above tells use that

\[ z \sim \mathcal{N}(X\beta,\sigma^2). \]

To obtain the elpd of the logarithmic model, we need to compute the density \(p(y\mid \beta,\sigma, X)\). This can be achieved by applying the change-of-variable formula between two variables \(y\) and \(z\).

\[ p(y\mid \beta,\sigma, X) = p(z\mid \beta,\sigma, X)\left\lvert \frac{dz}{dy}\right\rvert = \frac{1}{y}p(z\mid \beta,\sigma, X). \]

Here, \(\lvert y \rvert = y\) since we assume that \(y\) is positive. Thus, for a held-out data point \((X_i, y_i)\), with \(z_i = \log y_i\), the log-likelihood of this point is

\[ \log p(y_i\mid \beta,\sigma, X_i) = \underbrace{\log p(z_i\mid \beta,\sigma, X_i)}_{\text{elpd of }\log y_i}-\log y_i. \]

In other words, we can make a correction for the elpd of the logarithmic model by subtracting \(\log y_i\) for each instance \((X_i, y_i)\).

We can inspect the elpd of each instance by calling pointwise attribute of the loo result. The following code shows the first five rows of the elpd from the logarithmic regression:

print(loo_log_1$pointwise[1:5,])
       elpd_loo mcse_elpd_loo        p_loo    looic influence_pareto_k
[1,] -1.1015364  0.0006749132 0.0018272358 2.203073       -0.066103226
[2,] -1.9373695  0.0007647321 0.0019790022 3.874739       -0.058415542
[3,] -1.1543427  0.0004481174 0.0006992233 2.308685       -0.036653352
[4,] -0.9582225  0.0003211416 0.0003854306 1.916445       -0.008417499
[5,] -1.9181897  0.0009642173 0.0031575345 3.836379       -0.061855767

From this, we can subtract \(\log y_i\) (earn in this case) from the elpd_loo (first column).

loo_log_1$pointwise[,1] <- loo_log_1$pointwise[,1] - log(earnings$earn[earnings$earn>0])

We can now sum elpd_loo over all instances.

elpd_with_correction <- sum(loo_log_1$pointwise[,1])
print(elpd_with_correction)
[1] -17933

The elpd is now in the same scale as the linear model. We can now compare the elpd between the linear and logarithmic model.

loo_compare(loo_log_1, loo_1)
Warning: Not all models have the same y variable. ('yhash' attributes do not
match)
           elpd_diff se_diff
logmodel_1    0.0       0.0 
fit_1      -677.8     155.4 

7.4 Log-log model

If the log transformation is applied to both an input variable and the outcome, the coefficient can be interpreted as the percentage difference in \(y\) per percentage difference in \(x\).

Let us try this technique on the Earnings data.

logmodel_5 <- stan_glm(log(earn) ~ log(height) + male, data=earnings,
                       subset=earn>0, refresh=0)

print(logmodel_5)
stan_glm
 family:       gaussian [identity]
 formula:      log(earn) ~ log(height) + male
 observations: 1629
 predictors:   3
 subset:       earn > 0
------
            Median MAD_SD
(Intercept) 2.9    2.2   
log(height) 1.6    0.5   
male        0.4    0.1   

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.9    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • The coefficient 1.6 of \(\log(\text{height})\) implies that 1% increase in height corresponds to 1.6% increase in earnings.

  • The coefficient 0.4 of \(\text{male}\) implies that, on average, a male earns 40% more than a female with the same height.