`library(rstanarm)`

# 7 Logarithmic transformations

We might want to try a logarithmic transformation when

- Additivity and linearity are not reasonable assumptions.
- The outcomes are all positive.

In this case, we run the regression with \(\log y_i\) as target values:

\[ \log y_i = \beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i. \]

To obtain a model that predicts the outcome from the input, we exponentiate both sides of the equation.

\[\begin{align*} y_i &= e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i} \\ &= B_0B_1^{X_{i1}}\ldots B_p^{X_{ip}}E_i, \end{align*}\]where \(B_0=e^{\beta_0}, B_1=e^{\beta_1},\ldots\).

Consider the logarithmic regression on the `Earnings`

data. This can be done in are by simply replacing `earn`

in the formula by `log(earn)`

(notice that we only regress on the subset of \(\text{earn} > 0\).

```
= read.csv("data/earnings.csv")
earnings
head(earnings)
```

```
height weight male earn earnk ethnicity education mother_education
1 74 210 1 50000 50 White 16 16
2 66 125 0 60000 60 White 16 16
3 64 126 0 30000 30 White 16 16
4 65 200 0 25000 25 White 17 17
5 63 110 0 50000 50 Other 16 16
6 68 165 0 62000 62 Black 18 18
father_education walk exercise smokenow tense angry age
1 16 3 3 2 0 0 45
2 16 6 5 1 0 0 58
3 16 8 1 2 1 1 29
4 NA 8 1 2 0 0 57
5 16 5 6 2 0 0 91
6 18 1 1 2 2 2 54
```

```
<- stan_glm(log(earn) ~ height, data=earnings,
logmodel_1 subset=earn>0, refresh=0)
print(logmodel_1)
```

```
stan_glm
family: gaussian [identity]
formula: log(earn) ~ height
observations: 1629
predictors: 2
subset: earn > 0
------
Median MAD_SD
(Intercept) 5.9 0.4
height 0.1 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 0.9 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

Here are what the plot of this model on the log scale and the original scale.

```
plot(earnings$height, log(earnings$earn), pch=20,
xlab="height", ylab="log(earnings)")
abline(coef(logmodel_1))
```

```
= coef(logmodel_1)
beta
plot(earnings$height, earnings$earn, pch=20,
xlab="height", ylab="earnings")
curve(exp(beta[1] + beta[2]*x), add=TRUE)
```

In the latter plot, the curve is moving upwards like an exponential function.

## 7.1 Interpreting the coefficients

The fitted model from the logarithmic regression above is

\[ \text{log(earnings)} = 5.9 + 0.1*\text{height}, \]

This model suggests that, for two people, say \(A\) and \(B\), whose heights differ by 1, their log-earnings differ by 0.1 on average:

\[\begin{align*} \log(\text{earnings}_A) - \log(\text{earnings}_B) &= 0.1 \\ \log \left(\frac{\text{earnings}_A}{\text{earnings}_B}\right) &= 0.1 \\ \frac{\text{earnings}_A}{\text{earnings}_B} &= e^{0.1} \approx 1.1, \end{align*}\]Here, we have used an approximation \(e^x \approx 1+x\), which is reasonably accurate for \(x<1\).

From this, we can interpret the slope of \(0.1\) as follows:

1 inch increase in height corresponds to an expected 10% increase in earnings.

Such possible interpretation is why we have (implicitly) used the logarithms base \(e\) instead of base $10$. If we were to use base \(10\), we would instead obtain \(\frac{\text{earnings}_A}{\text{earnings}_B} = 10^{\hat{\beta}_1}\) where \(\hat{\beta}_1\) is the slope from fitting the regression with logarithms base \(10\). In this case, we cannot estimate the percentage increase in earnings just by looking at the coefficient.

### 7.1.1 When there are zero-valued outcomes

Sometimes, we might face a situation where some of the outcomes are zeros, which cannot take the logarithm directly. One way to model such data is by running a *classification* model that can classify whether an instance has non-zero outcome (for example, a linear regression model, which will be introduced next chapter). We then use this model to tell us whether a new data point has non-zero outcome. If that is the case, then we use the fitted logarithmic model to predict the actual value of the outcome.

## 7.2 Model checking with simulations

We will compare the logarithmic regression with the linear regression by replicating a dataset from each model and compare it to the observed dataset.

We first fit the linear model.

```
<- stan_glm(earn ~ height, data=earnings,
fit_1 subset=earn>0, refresh=0)
```

and then we simulate outcomes from the posterior predictive distribution.

```
<- posterior_predict(fit_1)
yrep_1 <- nrow(yrep_1) # number of rows in the simulation
n_sims <- sample(n_sims, 100) # randomly pick 100 rows
subset <- yrep_1[subset,] # pick rows of yrep_1 by row indices in subset yrep_1_100
```

There is a convenient library `bayesplot`

that allows us to make density plots of the simulations and the data via `ppc_dens_overlay`

function.

`library(bayesplot)`

`ppc_dens_overlay(earnings$earn[earnings$earn>0], yrep_1[subset,])`

We can see that the distributions do not quite match as the observed data is more concentrated and is non-negative.

Now let us try the same for the logarithmic model.

```
<- posterior_predict(logmodel_1)
yrep_log_1 <- nrow(yrep_log_1) # number of rows in the simulation
n_sims <- sample(n_sims, 100) # randomly pick 100 rows
subset <- yrep_log_1[subset,] # pick rows by list of indices in subset
yrep_log_1_100
ppc_dens_overlay(log(earnings$earn[earnings$earn>0]), yrep_log_1_100)
```

Visually, the logarithmic model has a better fit than the linear model.

## 7.3 elpd for the logarithmic regression

First, let us look at the elpd between the linear and logarithmic model.

`= loo(fit_1) loo_1 `

`Warning: Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.`

`print(loo_1)`

```
Computed from 4000 by 1629 log-likelihood matrix
Estimate SE
elpd_loo -18610.8 167.6
p_loo 28.6 21.8
looic 37221.5 335.3
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1628 99.9% 616
(0.5, 0.7] (ok) 0 0.0% <NA>
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 1 0.1% 1
See help('pareto-k-diagnostic') for details.
```

```
= loo(logmodel_1)
loo_log_1 print(loo_log_1)
```

```
Computed from 4000 by 1629 log-likelihood matrix
Estimate SE
elpd_loo -2100.6 38.8
p_loo 3.9 0.4
looic 4201.2 77.6
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```

Notice that `elpd_loo`

between these two models are on different scales; this is because the likelihood of the linear model and the transformed model are totally different, and we have to make correction for this difference in the computation of LOO. Initially, the equation for a logarithmic model is:

\[ \log y = \beta_0 + \beta_1x_{1} + \ldots + \beta_p x_{p} + \varepsilon=X\beta + \varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,\sigma^2). \]

Define a new random variable \(z=\log y\). The equation above tells use that

\[ z \sim \mathcal{N}(X\beta,\sigma^2). \]

To obtain the elpd of the logarithmic model, we need to compute the density \(p(y\mid \beta,\sigma, X)\). This can be achieved by applying the change-of-variable formula between two variables \(y\) and \(z\).

\[ p(y\mid \beta,\sigma, X) = p(z\mid \beta,\sigma, X)\left\lvert \frac{dz}{dy}\right\rvert = \frac{1}{y}p(z\mid \beta,\sigma, X). \]

Here, \(\lvert y \rvert = y\) since we assume that \(y\) is positive. Thus, for a held-out data point \((X_i, y_i)\), with \(z_i = \log y_i\), the log-likelihood of this point is

\[ \log p(y_i\mid \beta,\sigma, X_i) = \underbrace{\log p(z_i\mid \beta,\sigma, X_i)}_{\text{elpd of }\log y_i}-\log y_i. \]

In other words, we can make a correction for the elpd of the logarithmic model by subtracting \(\log y_i\) for each instance \((X_i, y_i)\).

We can inspect the elpd of each instance by calling `pointwise`

attribute of the `loo`

result. The following code shows the first five rows of the elpd from the logarithmic regression:

`print(loo_log_1$pointwise[1:5,])`

```
elpd_loo mcse_elpd_loo p_loo looic influence_pareto_k
[1,] -1.1015364 0.0006749132 0.0018272358 2.203073 -0.066103226
[2,] -1.9373695 0.0007647321 0.0019790022 3.874739 -0.058415542
[3,] -1.1543427 0.0004481174 0.0006992233 2.308685 -0.036653352
[4,] -0.9582225 0.0003211416 0.0003854306 1.916445 -0.008417499
[5,] -1.9181897 0.0009642173 0.0031575345 3.836379 -0.061855767
```

From this, we can subtract \(\log y_i\) (`earn`

in this case) from the `elpd_loo`

(first column).

`$pointwise[,1] <- loo_log_1$pointwise[,1] - log(earnings$earn[earnings$earn>0]) loo_log_1`

We can now sum `elpd_loo`

over all instances.

```
<- sum(loo_log_1$pointwise[,1])
elpd_with_correction print(elpd_with_correction)
```

`[1] -17933`

The elpd is now in the same scale as the linear model. We can now compare the elpd between the linear and logarithmic model.

`loo_compare(loo_log_1, loo_1)`

```
Warning: Not all models have the same y variable. ('yhash' attributes do not
match)
```

```
elpd_diff se_diff
logmodel_1 0.0 0.0
fit_1 -677.8 155.4
```

## 7.4 Log-log model

If the log transformation is applied to both an input variable and the outcome, the coefficient can be interpreted as the percentage difference in \(y\) per percentage difference in \(x\).

Let us try this technique on the `Earnings`

data.

```
<- stan_glm(log(earn) ~ log(height) + male, data=earnings,
logmodel_5 subset=earn>0, refresh=0)
print(logmodel_5)
```

```
stan_glm
family: gaussian [identity]
formula: log(earn) ~ log(height) + male
observations: 1629
predictors: 3
subset: earn > 0
------
Median MAD_SD
(Intercept) 2.9 2.2
log(height) 1.6 0.5
male 0.4 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.9 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
```

The coefficient 1.6 of \(\log(\text{height})\) implies that 1% increase in height corresponds to 1.6% increase in earnings.

The coefficient 0.4 of \(\text{male}\) implies that, on average, a male earns 40% more than a female with the same height.