25  Jackknife+, CV+ and Quantile regression

A downside of the full conformal is the need to refit the predictive model every time a new data point is introduced, making it computational expensive. On the other hand, the split conformal is fast but less efficient in the amount of data used to fit to the model. Between these two extremes, we would like to find another method that can utilize each data point for both fitting and scoring, while being reasonably fast to compute; two of such methods are Jackknife+ and CV+ (Barber et al. 2021).

25.1 Jackknife+

A simple way to fix the data efficiency problem is by fitting the model and calibrating on the whole dataset. However, this method is likely to overfit, as the non-conformity scores on the training data are naturally smaller than those on the unseen data, resulting in a prediction interval that does not cover enough data points.

Jackknife is introduced to address this issue using leave-one-out fitting and calibration: let \(\mathcal{D}_{1:n} = \{(x_1, y_1),\ldots,(x_n,y_n)\}\) be the observed data. For each \(i\in \{1,\ldots,n\}\), we fit the model on all but one data point \(\mathcal{D}_{1:n}- \{(x_i,y_i)\}\) to obtain a fitted model \(\hat{\mu}_{-i}\). The non-conformity score is then

\[ R_i = \lvert y_i - \hat{\mu}_{-i}(x_i) \rvert. \]

Let \(\hat{\mu}\) be the model fitted on all observed data \(\mathcal{D}_{1:n}\). Let \(x_{\textsf{new}}\) be a new data point. The 90% Jackknife prediction interval is similar to the one in the split conformal:

\[ \left[\text{5th-perc. of } \left\{\hat{\mu}(x_{\textsf{new}})-R_i\right\}, \text{95th-perc. of } \left\{\hat{\mu}(x_{\textsf{new}})+R_i\right\}\right]. \]

However, we are looking at percentiles of distances \(R_i\) from a fixed point \(\hat{\mu}\), which might be too restrictive and result in not enough coverage.

Jackknife+. To solve Jackknife’s issue, we simply use \(\hat{\mu}_{-i}\) to predict the outcome of the new data point instead of \(\hat{\mu}\). Here are the steps to perform Jackknife+ in full details:

Let \(\mathcal{D}_{1:n} = \{(x_1, y_1),\ldots,(x_n,y_n)\}\) be the observed data and \((x_{\textsf{new}},?)\) be a new data point.

For each \(i\in \{1,\ldots,n\}\),

  1. Fit the model on all but one data point \(\mathcal{D}_{1:n}- \{(x_i,y_i)\}\). Let \(\hat{\mu}_{-i}\) be the fitted model.

  2. Compute the non-conformity score: \(R_i = \lvert y_i - \hat{\mu}_{-i}(x_i) \rvert\).

  3. Compute a lower and upper bound of \(\hat{\mu}_{-i}\)’s prediction interval for the new data point:

    \[L_i = \hat{\mu}_{-i}(x_{\textsf{new}}) - R_i,\quad U_i = \hat{\mu}_{-i}(x_{\textsf{new}}) + R_i.\]

The Jackknife+ prediction interval is

\[ C^{\textsf{JK+}}_{1-2\alpha}(x_{\textsf{new}}) = \left[\alpha\text{-quantile of } \{L_1,\ldots,L_n\}, (1-\alpha)\text{-quantile of } \{U_1,\ldots,U_n\}\right]. \]

The following theorem from Barber et al. (2021) shows that this prediction interval has \(1-2\alpha\) probability coverage. So, for example, to obtain a 90% prediction interval we would set \(\alpha = 0.05\).

Theorem 25.1 If \((x_1,y_1),\ldots,(x_n,y_n),(x_{\textsf{new}},y_{\textsf{new}})\) are exchangeable, then

\[ \Pr[y_{\textsf{new}} \in C^{\textsf{JK+}}_{1-2\alpha}(x_{\textsf{new}})] \geq 1-2\alpha. \]

The following diagram visualizes the difference between Jackknife and Jackknife+.

Comparison between Jackknife and Jackknife+ intervals.

Let us try the Jackknife+ method on the KidIQ dataset.

kidiq <- read.csv("data/kidiq.csv")

n <- nrow(kidiq)
new_data <- data.frame(kid_score = NA,
                       mom_hs = 0,
                       mom_iq = 90,
                       mom_work = 1,
                       mom_age = 20)

head(kidiq)
  kid_score mom_hs    mom_iq mom_work mom_age
1        65      1 121.11753        4      27
2        98      1  89.36188        4      25
3        85      1 115.44316        4      27
4        83      1  99.44964        3      25
5       115      1  92.74571        4      27
6        98      0 107.90184        1      18

Suppose that we want to find a 90% prediction interval of the new data point; then we have to set \(\alpha=0.05\).

alpha <- 0.05

lowers <- rep(NA, n)
uppers <- rep(NA, n)

for (i in 1:n) {
    model <- lm(kid_score ~ ., data = kidiq[-i, ])
    ynew_hat <- predict(model, new_data)
    yi_hat <- predict(model, kidiq[i, ])
    yi <- kidiq[i, "kid_score"]
    Ri <- abs(yi - yi_hat)

    lowers[i] <- ynew_hat - Ri
    uppers[i] <- ynew_hat + Ri
}

lower <- quantile(lowers, alpha)
upper <- quantile(uppers, 1 - alpha)
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 39.96481 , 112.2441 ]

The prediction interval is similar to those obtained in the previous chapters, so it should have sufficient coverage over the data points. To see this, we plot the interval on the data that have similar features as those of the new data point.

kidiq_subset <- kidiq[kidiq$mom_hs == 0 &
                      kidiq$mom_work == 1 &
                      abs(kidiq$mom_age - 20) < 5,]

plot(kid_score ~ mom_iq, data = kidiq_subset,
     ylab = "Child's IQ", xlab = "Mother's IQ",
     ylim = c(38, 113), pch = 19)
arrows(x0 = new_data$mom_iq, y0 = lower,
       x1 = new_data$mom_iq, y1 = upper,
       code = 3, angle = 90, length = 0.1)

25.2 CV+

A 5-fold split of the observed data.

CV+ is a generalization of Jackknife+ in which we split the data into several folds instead of leaving one point out for calibration. Here are the steps to perform CV+ in full details:

Let \(\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data, \((x_{\textsf{new}},?)\) be a new data point, and \(K\) be a pre-specified number of folds.

  1. Split \(\{1,\ldots,n\}\) into \(K\) folds \(I_1,\ldots,I_{K}\). Define a function \(f:\{1,\ldots, n\}\to \{1,\ldots,K\}\) such that \(f(i)=k\) if \(i\in I_k\).

  2. For each \(k\in\{1,\ldots,K\}\),

    2.1. Fit the model on \(\{(x_i,y_i) : i\in \{1,\ldots,n\}- I_k\}\). Let \(\hat{\mu}_k\) be the fitted model.

    2.2. For each \(i\in I_k\),

            \(\bullet\)  Compute the non-conformity score \(R_i = \lvert y_i - \hat{\mu}_{k}(x_i) \rvert\).

            \(\bullet\)  Compute a lower and upper bound of \(\hat{\mu}_{k}\)’s prediction interval for the new data point: \[L_i = \hat{\mu}_{k}(x_{\textsf{new}}) - R_i,\quad U_i = \hat{\mu}_{k}(x_{\textsf{new}}) + R_i.\]

The CV+ prediction interval is

\[ C^{\textsf{CV+}}_{1-2\alpha}(x_{\textsf{new}}) = \left[\alpha\text{-quantile of } \{L_1,\ldots,L_n\}, (1-\alpha)\text{-quantile of } \{U_1,\ldots,U_n\}\right]. \]

The following theorem from Barber et al. (2021) shows that this prediction interval has \(1-2\alpha\) probability coverage. So, for example, to obtain a 90% prediction interval we would set \(\alpha = 0.05\).

Theorem 25.2 Theorem 2. If \((x_1,y_1),\ldots,(x_n,y_n),(x_{\textsf{new}},y_{\textsf{new}})\) are exchangeable, then

\[ \Pr[y_{\textsf{new}} \in C^{\textsf{CV+}}_{1-2\alpha}(x_{\textsf{new}})] \geq 1-2\alpha. \]

Let us use CV+ to contruct a 90% prediction interval on KidIQ’s new data point. Here, we use createFolds from caret package to split the data into 10 folds.

library(caret)
alpha <- 0.05

lowers <- rep(NA, n)
uppers <- rep(NA, n)

folds <- createFolds(kidiq$kid_score, k = 10)
for (fold in folds) {
    model <- lm(kid_score ~ ., data = kidiq[-fold, ])
    ynew_hat <- predict(model, new_data)
    yi_hat <- predict(model, kidiq[fold, ])
    yi <- kidiq[fold, "kid_score"]
    Ri <- abs(yi - yi_hat)

    lowers[fold] <- ynew_hat - Ri
    uppers[fold] <- ynew_hat + Ri
}

lower <- quantile(lowers, alpha)
upper <- quantile(uppers, 1 - alpha)
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 40.04851 , 112.5228 ]

25.3 Quantile regression

In the split conformal, the prediction interval is \([\hat{y}_{\textsf{new}}-R_{(k)}, \hat{y}_{\textsf{new}}+R_{(k)}]\). Notice that the interval has constant width independent of the new data point. So the split conformal might not be appropriate for heteroskedastic data.

Alternatively, we could instead estimate a lower and upper quantiles of the prediction interval. To estimate a specific quantile of the outcome given a set of predictors, we can use the quantile regression. Below is an example of using the quantile regression to estimate the 0.1-quantile and 0.9-quantiles on simulated data in R.

library(quantreg)
n_points <- 500
slope <- 2

x <- runif(n_points, 0, 2)
u <- x * rnorm(n_points) # heteroskedastic errors
y <- x * slope + u

simdata <- data.frame(x = x, y = y)

plot(x, y, pch = 20, ylim = c(-0.5, 7))
abline(rq(y ~ x, tau = 0.9), col = "blue", lwd = 2)
abline(rq(y ~ x, tau = 0.1), col = "red", lwd = 2)
legend("topleft", legend = c("0.9-quantile", "0.1-quantile"),
       col = c("blue", "red"), lty = 1, lwd = 2)

0.1-quantile regression and 0.9-quantile regression.

This plot suggests that the quantile estimates can be used to construct a prediction interval.

Motivated by this observation, Romano, Patterson, and Candes (2019) introduced conformal quantile regression (CQR), which uses the quantile estimates to construct a prediction interval. The steps to perform CQR are as follows:

Let \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data, \((x_{\textsf{new}},?)\) be a new data point and \(\gamma\in (0,0.5)\) is a pre-specified quantile of the prediction.

  1. Split \(\{1,\ldots,n\}\) into a training set \(\mathsf{Tr}\) and a calibration set \(\mathsf{Cal}\).

  2. Fit \(\gamma\)-quantile and \((1-\gamma)\)-quantile regressions on \(\{(x_i,y_i) : i \in \mathsf{Tr}\}\). Denote the fitted models by \(\hat{\mu}^{-}\) and \(\hat{\mu}^{+}\), respectively.

  3. For each \(j\in \mathsf{Cal}\), compute the non-conformity score:

\[R_j = \max\left\{\hat{q}^{-}(x_j) - y_j, y_j - \hat{q}^{+}(x_j)\right\}.\]

In other words, \(R_j\) is a signed distance form \(y_i\) to one of the regression lines, where \(R_j\) is negative if \(\hat{q}^{-}(x_i) < y_i < \hat{q}^{+}(x_i)\) and positive otherwise.

  1. Sort \(R_1,\ldots, R_n\) in increasing order: \(R_{(1)},\ldots, R_{(n)}\).

  2. Let \(k = \lceil \beta(n+1)\rceil\). The prediction interval is

\[ C^{\textsf{CQR}}_{\beta}(x_{\textsf{new}}) = [\hat{q}^{-}(x_{\textsf{new}})-R_{(k)}, \hat{q}^{+}(x_{\textsf{new}})+R_{(k)}]. \]

The following theorem from Romano, Patterson, and Candes (2019) shows that this prediction interval has \(\beta\) probability coverage.

If \((x_1,y_1),\ldots,(x_n,y_n),(x_{\textsf{new}},y_{\textsf{new}})\) are exchangeable, then

\[ \Pr[y_{\textsf{new}} \in C^{\textsf{CQR}}_{\beta}(x_{\textsf{new}})] \geq \beta. \]

beta <- 0.95
q <- 0.2
m <- floor(n_points/2)

new_xy <- data.frame(x = 1.5, y = NA)

calib_id <- sample(seq_len(n_points), size = m)
simdata_train <- simdata[-calib_id, ]
simdata_calib <- simdata[calib_id, ]

# Training
model_lo <- rq(y ~ x, tau = q, data = simdata_train)
model_hi <- rq(y ~ x, tau = 1 - q, data = simdata_train)

# Calibration
yhat_lo <- predict(model_lo, newdata = simdata_calib)
yhat_hi <- predict(model_hi, newdata = simdata_calib)
y <- simdata_calib$y
R <- pmax(yhat_lo - y, y - yhat_hi)

Rk <- quantile(R, beta)

# Estimate the quantiles of the new data point
ynew_hat_lo <- predict(model_lo, newdata = new_xy)
ynew_hat_hi <- predict(model_hi, newdata = new_xy)

# Construct the prediction interval
lower <- ynew_hat_lo - Rk
upper <- ynew_hat_hi + Rk
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 0.553646 , 5.422904 ]

Let us plot the prediction interval to see how it fares on the simulated data,

plot(y ~ x, data = simdata, pch = 20, ylim = c(-0.5, 7))
arrows(x0 = new_xy$x, y0 = lower,
       x1 = new_xy$x, y1 = upper,
       code = 3, angle = 90, length = 0.1,
       col = "red", lwd = 2.5)

which shows that the interval has sufficient coverage at \(x=1.5\).

References

Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2021. “Predictive Inference with the Jackknife+.” The Annals of Statistics 49 (1): 486–507. https://doi.org/10.1214/20-AOS1965.
Romano, Yaniv, Evan Patterson, and Emmanuel Candes. 2019. “Conformalized Quantile Regression.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf.