24  Full & split conformal prediction

There are many situations where a wrong prediction can lead to costly consequences. For example, a wrong prediction on a patient’s health condition after receiving the treatment could lead to a fatal outcome. Thus it is important to understand the reliability and uncertainty of our predictions.

In the first few chapters, we have talked about Bayesian regression where we used the posterior predictive distribution to measure the uncertainty. One downside of such method is the need to specify the distribution that generates the data in the form of the likelihood. In this chapter, we introduce a method with minimal distributional assumptions to estimate prediction intervals that exploit the exchangeability of the data points.

24.1 Review: quantile

Let us first review the concept of quantile. Given a random variable \(X\) with continuous density, the quantile function \(\operatorname{Q}(\beta;X)\) is the smallest value \(x\) of \(X\) such that \(\Pr[X \leq x] = \beta\). But if \(X\) is discrete or its density is not continuous, such \(x\) might not exist, so we have to relax the definition a bit. More precisely, the quantile function is:

\[ \operatorname{Quantile}(\beta,X) = \inf\{x : \Pr[x \leq X] \geq \beta\}. \]

Suppose that we observe data \(x=(x_1,\ldots,x_n)\), we can define the empirical quantile function as follows:

\[ \operatorname{Quantile}(\beta,x) = \min\left\{x_i : \frac{\#\{j : x_j \leq x_i\}}{n} \geq \beta\right\}. \tag{24.1}\]

In other words, one can compute the quantile by sorting the observed values in ascending order and finding \(x_i\) such that the proportion of values less than or equal to \(x_i\) is just above \(\beta\).

To compute quantiles in R, we can use the quantile function. Below is an example of calculating \(\operatorname{Quantile}(0.1, x)\) where \(x\) is a sample from the standard normal distribution.

set.seed(0)

x <- rnorm(1000)
q <- quantile(x, 0.1)

print(q)
     10% 
-1.28896 

When plotting a histogram of this data, the area associated with the values smaller than the quantile should take up approximately 10% of the total area.

24.2 Quantile Lemma

Exchangeability. Our method for constructing prediction intervals relies on the main assumption that the random variables of interest are exchangeable, that is, the joint distribution of the random variables \(R_1,\ldots,R_n\) does not change upon any permutation. Thus one must be careful and check whether the data at hand satisfies this assumption. For example, data points that were generated from a distribution that shifts over time are not exchangeable.

Suppose that \(R_1,\ldots,R_n\), together with a new variable \(R_{\textsf{new}}\) are exchangeable, then the following quantile lemma tells us that it is possible to know something about \(R_{\textsf{new}}\) in relation to \(R_1,\ldots,R_n\) without knowing its underlying probability distribution. The lower bound, which is a standard result in conformal prediction, is due to Vovk, Gammerman, and Shafer (2005) and the upper bound is due to Lei et al. (2018).

Lemma 24.1 (Quantile Lemma) Denote \(R_{1:n}=\{R_1,\ldots,R_n\}\). If \(R_1,\ldots,R_{n},R_{\textsf{new}}\) are exchangeable, then for any \(\beta\in (0,1)\),

\[ \Pr\left[R_{\textsf{new}} \leq \operatorname{Quantile}(\beta, R_{1:n} \cup \{\infty\})\right] \geq \beta. \tag{24.2}\]

If in addition the probabilities of ties are zero i.e. \(\Pr[R_i=R_j]=0\) for all \(i\not= j\), then we have an upper bound:

\[ \Pr\left[R_{\textsf{new}} \leq\operatorname{Quantile}(\beta, R_{1:n} \cup \{\infty\})\right] \leq \beta+ \frac1{n+1}. \tag{24.3}\]

Proof. We go over the proof in five steps.

  1. Let \(q=\operatorname{Quantile}(\beta, R_{1:n} \cup \{\infty\})\). If we modify the data \(R_{1:n} \cup \{\infty\}\) by moving \(\infty\) to any other value larger than \(q\), the \(\beta\)-quantile is still unchanged. Specifically, we move \(\infty\) to \(R_{\textsf{new}}\) so that

    \[ R_{\textsf{new}} > \operatorname{Quantile}(\beta, R_{1:n} \cup \{\infty\}) \Longleftrightarrow R_{\textsf{new}} > \operatorname{Quantile}(\beta, R_{1:n} \cup \{R_{\textsf{new}}\}), \]

    which is equivalent to

    \[ R_{\textsf{new}} \leq \operatorname{Quantile}(\beta, R_{1:n} \cup \{\infty\}) \Longleftrightarrow R_{\textsf{new}} \leq \operatorname{Quantile}(\beta, R_{1:n} \cup \{R_{\textsf{new}}\}). \]

  2. From the definition of empirical quantile (Equation 24.1),

    \[\begin{align*} R_{\textsf{new}} \leq \operatorname{Quantile}(\beta, R_{1:n} \cup \{R_{\textsf{new}}\}) &\Longleftrightarrow \frac{\#\{j : R_{j}\le R_{\textsf{new}}\}}{n+1} \le \beta \\ & \Longleftrightarrow \operatorname{rank}(R_{\textsf{new}}) \le \lceil\beta(n+1)\rceil. \end{align*}\]
  3. Let us detour a bit and consider \(\Pr[\operatorname{rank}(R_{\textsf{new}})=r]\) for any \(r\in\{1,\ldots,n+1\}\). With \(I=\{1,\ldots,n,\textsf{new}\}\), we have

    \[ \sum_{i\in I}\Pr[\operatorname{rank}(R_{i})=r] \geq \Pr[\operatorname{rank}(R_{i})=r \text{ for some }i\in I] = 1. \tag{24.4}\]

    By exchangeability, \(\Pr[\operatorname{rank}(R_{i})=r]\) is the same for all \(i\in I\), so Equation 24.4 implies \(\Pr[\operatorname{rank}(R_{\textsf{new}})=r] \geq \frac1{n+1}\) for all \(r\).

  4. Continuing from Step 2., we obtain the lower bound in Equation 24.2:

    \[\begin{align*} \Pr\left[\operatorname{rank}(R_{\textsf{new}}) \le \lceil\beta(n+1)\rceil\right] &= \sum_{r=1}^{\lceil\beta(n+1)\rceil}\Pr[\operatorname{rank}(R_{\textsf{new}})=r] \\ &\geq \frac{\lceil\beta(n+1)\rceil}{n+1} \\ &\geq \beta. \end{align*}\]
  5. If the probabilities of ties are zero, then the distribution of the rank of \(R_{\textsf{new}}\) is exactly \(\operatorname{Uniform}\{1,\ldots,n+1\}\), from which we can compute the exact probability and obtain the upper bound in Equation 24.3:

    \[ \Pr\left[\operatorname{rank}(R_{\textsf{new}}) \le \lceil\beta(n+1)\rceil\right] = \frac{\lceil \beta(n+1)\rceil}{n+1} \leq \beta + \frac1{n+1}. \]

24.3 Full conformal prediction

The main idea of constructing the prediction intervals is to let \(R_1,\ldots,R_{n},R_{\textsf{new}}\) be a “non-conformity score” of our data points \((x_1,y_1),\ldots,(x_n,y_n),(x_{\textsf{new}},?)\), from which we use the quantile lemma to find the prediction intervals of the outcome associated with \(x_{\textsf{new}}\), say \(y_{\textsf{new}}\).

Let us consider the problem of predicting a child’s IQ score from the mother’s education, IQ score, years of employment and age. We load the KidIQ dataset and make a new hypothetic data point where the child IQ has not been yet observed.

kidiq <- read.csv("data/kidiq.csv")

n <- nrow(kidiq)
new_data <- data.frame(kid_score = NA,
                       mom_hs = 0,
                       mom_iq = 90,
                       mom_work = 1,
                       mom_age = 20)

head(kidiq)
  kid_score mom_hs    mom_iq mom_work mom_age
1        65      1 121.11753        4      27
2        98      1  89.36188        4      25
3        85      1 115.44316        4      27
4        83      1  99.44964        3      25
5       115      1  92.74571        4      27
6        98      0 107.90184        1      18

Our goal is to construct a 95% prediction interval of the child’s IQ. Here, we introduce our first method, namely the full conformal prediction (Vovk, Gammerman, and Shafer 2005) which utilizes the quantile lemma to obtain a prediction inverval with any model of choice (for example, a linear regression model). There are mainly two variants of full conformal predictions.

24.3.1 Deleted full conformal prediction

Let \(\mathcal{D}_{1:n} = \{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data and \((x_{\textsf{new}},?)\) be a new data point. We can compute the prediction intervals for the outcome \(y_{\textsf{new}}\) associated with \(x_{\textsf{new}}\) as follows:

For each possible value of \(y\)

  1. Add a new point \((x_{\textsf{new}},y)\) to the dataset.

  2. For each \(i\in\{1,\ldots,n,\textsf{new}\}\),

    2.1. Fit our predictive model on all but the \(i\)-th data point: \(\mathcal{D}_{1:n} \cup \{(x_{\textsf{new}},y)\} - \{(x_i,y_i)\}\). Let \(\hat{\mu}^{y}_{-i}\) be the fitted model.

    2.2. Compute the non-conformity score \(R^{y}_i = \lvert y_i - \hat{\mu}^{y}_{-i}(x_i) \rvert\).

  3. Sort \(R^y_1,\ldots, R^y_n\) in increasing order: \(R^y_{(1)},\ldots, R^y_{(n)}\).

  4. Let \(k = \lceil \beta(n+1)\rceil\). We keep \(y\) if \(R^{y}_{\textsf{new}} \leq R^y_{(k)}\) and discard it otherwise.

After we are done with all \(y\), our prediction interval \(C_{\beta}(x_{\textsf{new}})\) consists of the values of \(y\) that we keep. In other words, \(C_{\beta}(x_{\textsf{new}}) = \{y : R^{y}_{\textsf{new}} \leq R^{y}_{(k)}\}\).

\(C_{\beta}(x_{\textsf{new}})\) being a \(\beta\)-prediction interval is a simple consequence of the quantile lemma: let \(y_{\textsf{new}}\) be the actual outcome associated with \(x_{\textsf{new}}\). Assuming that \((x_1,y_1),\ldots,(x_n,y_n),(x_{\textsf{new}}, y_{\textsf{new}})\) are exchangeable, it directly follows from the quantile lemma that \(C_{\beta}(x_{\textsf{new}})\) is a \(\beta\)-prediction interval: since \(R^{y_{\textsf{new}}}_{(k)}\) is the \(\beta\)-quantile of \(R^{y_{\textsf{new}}}_{(1)},\ldots,R^{y_{\textsf{new}}}_{(n)}\) (which are also exchangeable by their symmetric contruction), it follows from Equation 24.2 that

\[ \Pr[y_{\textsf{new}}\in C_{\beta}(x_{\textsf{new}})] = \Pr\big[R^{y_{\textsf{new}}}_{\textsf{new}} \leq R^{y_{\textsf{new}}}_{(k)}\big] \geq \beta. \]

Below is an implementation of the procedure on the KidIQ data. Here, we assume that the possible values of IQ range from 1 to 200. To speed up the process, we convert the dataframe into a matrix and fit linear regression using a bare bone .lm.fit instead.

kidiq_mat <- as.matrix(kidiq)
kidiq_mat <- rbind(kidiq_mat, as.matrix(new_data))
X <- kidiq_mat[, -1]
X <- cbind(rep(1, nrow(X)), X)
Y <- kidiq_mat[, 1]

beta <- 0.95
k <- ceiling(beta * (n+1))

Rnew <- rep(NA, 200)
Rk <- rep(NA, 200)

for (y in 1:200) {
    Y[n+1] <- y
    R <- rep(NA, n+1)

    for (i in 1:(n+1)) {
        model <- .lm.fit(X[-i, ], Y[-i])
        yhat <- X[i, ] %*% coef(model)
        R[i] <- abs(Y[i] - yhat)
    }

    Rnew[y] <- R[n+1]
    Rk[y] <- sort(R[1:n])[k]
}

Let us plot \(R^{y}_{\textsf{new}}\) and \(R^{y}_k\) in order to find a prediction interval of \(y\).

plot(1:200, Rnew, type = "l", lwd = 2, col = "blue",
     xlab = "y", ylab = "Score")
lines(1:200, Rk, lwd = 2, lty = 2, col = "red")
legend("topright", legend=c("Rnew", "0.95-quantile"),
       col=c("blue", "red"), lty=1:2)

The prediction interval consists of all \(y\)’s whose scores are below the \(\beta\)-quantile. We can also compute the upper and lower bound of the interval directly.

which_y_conform <- which(Rnew < Rk)
last <- length(which_y_conform)

lower <- which_y_conform[1]
upper <- which_y_conform[last]
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 40 , 112 ]

To see how well the prediction interval covers the data points, we plot the data points that have similar characteristics as those of the new data point. Specifically, we take the data points consisting of children whose mothers were in the 16-24 age group, never graduated from high school (mom_hs = 0) and had been working for one year (mom_work = 1).

kidiq_subset <- kidiq[kidiq$mom_hs == 0 &
                      kidiq$mom_work == 1 &
                      abs(kidiq$mom_age - 20) < 5,]

plot(kid_score ~ mom_iq, data = kidiq_subset,
     ylab = "Child's IQ", xlab = "Mother's IQ",
     ylim = c(38, 113), pch = 19)
arrows(x0 = new_data$mom_iq, y0 = lower,
       x1 = new_data$mom_iq, y1 = upper,
       code = 3, angle = 90, length = 0.1)

The plot indicates that the interval has sufficient coverage over the data points.

24.3.2 Ordinary full conformal prediction

This variant of full conformal prediction fits on all data only once for each value of \(y\). Thus it is a lot faster to run compared to the deleted variant.

Let \(\mathcal{D}_{1:n} = \{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data and \((x_{\textsf{new}},?)\) be a new data point.

For each possible value of \(y\)

  1. Add a new point \((x_{\textsf{new}},y)\) to the dataset.

  2. Fit our predictive model on \(\mathcal{D}_{1:n} \cup \{(x_{\textsf{new}},y)\}\). Let \(\hat{\mu}^{y}\) be the fitted model.

  3. For each \(i\in\{1,\ldots,n,\textsf{new}\}\), compute the non-conformity score \(R^{y}_i = \lvert y_i - \hat{\mu}^{y}(x_i) \rvert\).

  4. Sort \(R^y_1,\ldots, R^y_n\) in increasing order: \(R^y_{(1)},\ldots, R^y_{(n)}\).

  5. Let \(k = \lceil \beta(n+1)\rceil\). We keep \(y\) if \(R^{y}_{\textsf{new}} \leq R^y_{(k)}\) and discard it otherwise.

After we are done with all \(y\), our prediction interval \(C_{\beta}(x_{\textsf{new}})\) consists of the values of \(y\) that we keep. In other words, \(C_{\beta}(x_{\textsf{new}}) = \{y : R^{y}_{\textsf{new}} \leq R^{y}_{(k)}\}\).

beta <- 0.95
k <- as.integer(beta * (n+1))

Rnew <- rep(NA, 200)
Rk <- rep(NA, 200)

for (y in 1:200) {
    new_data[, "kid_score"] <- y

    model <- lm(kid_score ~ ., data = rbind(kidiq, new_data))
    R <- abs(residuals(model))

    Rnew[y] <- R[n+1]
    Rk[y] <- sort(R[1:n])[k]
}

From this, let us compute the prediction interval.

which_y_conform <- which(Rnew < Rk)
last <- length(which_y_conform)

lower <- which_y_conform[1]
upper <- which_y_conform[last]
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 40 , 112 ]

There is little to no difference between the two predictions intervals. See Abad et al. (2022) and Vovk et al. (2019) for more discussions and experiments on these two variants.

24.4 Split conformal prediction

One downside of the full conformal prediction is that it requires fitting the predictive model as many times as the number of possible values of \(y\). Alternatively, we can split the data into two sets: a training set to fit the model, and a calibration set to calculate non-conformity scores. The \(\beta\)-quantile of the scores is then used to obtain a prediction inverval as before. The fact that the training set is not involved in the scoring process has two implications:

  • We only need to assume that the data points in the calibration set and the new data point are exchangeable,
  • The model is fitted only once on the training set. In particular, we no longer have to fit the model iteratively over all possible values of \(y\),

which lead to a faster computation at a cost of statistical efficiency since the model is only fitted on a part of the dataset. This method, referred to as split conformal prediction, can be performed as follows:

Let \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data and \((x_{\textsf{new}},?)\) be a new data point.

  1. Split \(\{1,\ldots,n\}\) into a training set \(\mathsf{Tr}\) and a calibration set \(\mathsf{Cal}\).

  2. Fit our model on \(\{(x_i,y_i) : i \in \mathsf{Tr}\}\). Denote the fitted model by \(\hat{\mu}\).

  3. For each \(j\in \mathsf{Cal}\), compute the (non-conformity) score \(R_j = \lvert y_j - \hat{\mu}(x_j) \rvert\).

  4. Sort \(R_1,\ldots, R_n\) in increasing order: \(R_{(1)},\ldots, R_{(n)}\).

  5. Let \(k = \lceil \beta(n+1)\rceil\). The prediction interval is

\[ C_{\beta}(x_{\textsf{new}}) = \left[\hat{y}_{\textsf{new}}-R_{(k)}, \hat{y}_{\textsf{new}}+R_{(k)}\right]. \]

Again, \(C_{\beta}(x_{\textsf{new}})\) being a \(\beta\)-prediction interval is a simple consequence of the quantile lemma: let \(y_{\textsf{new}}\) be the actual outcome associated with \(x_{\textsf{new}}\) and \(R_{\textsf{new}} = \lvert y_{\textsf{new}} - \hat{y}_{\textsf{new}}\rvert\). As before, there is greater or equal to \(\beta\) probability that \(R_{\textsf{new}} \leq R_{(k)}\), which is equivalent to \(y_{\textsf{new}} \in [ \hat{y}_{\textsf{new}}- R_{(k)} , \hat{y}_{\textsf{new}}+ R_{(k)}]\).

Here is an example of split conformal prediction on the KidIq dataset:

beta <- 0.95
m <- floor(n/2)
k <- ceiling(beta * (m+1))

calib_id <- sample(seq_len(n), size = m)
kidiq_train <- kidiq[-calib_id, ]
kidiq_calib <- kidiq[calib_id, ]

model <- lm(kid_score ~ ., data = kidiq_train)

yhat <- predict(model, newdata = kidiq_calib)
y <- kidiq_calib$kid_score
R <- abs(y - yhat)

R <- sort(R)[k]

ynew_hat <- predict(model, newdata = new_data)
lower <- ynew_hat - R
upper <- ynew_hat + R
cat("Prediction interval: [", lower, ",", upper, "]")
Prediction interval: [ 36.34356 , 111.7025 ]

The prediction interval is a bit wider than those of the full conformal predictions, which agree with our comment regarding the statistical efficiency of split conformal prediction at the beginning of the section.

References

Abad, Javier, Umang Bhatt, Adrian Weller, and Giovanni Cherubin. 2022. “Approximating Full Conformal Prediction at Scale via Influence Functions.” In.
Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. 2018. “Distribution-Free Predictive Inference for Regression.” Journal of the American Statistical Association 113 (523): 1094–1111. https://doi.org/10.1080/01621459.2017.1307116.
Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2005. Algorithmic Learning in a Random World. Springer-Verlag. https://doi.org/10.1007/b106715.
Vovk, Vladimir, Jieli Shen, Valery Manokhin, and Min-ge Xie. 2019. “Nonparametric Predictive Distributions Based on Conformal Prediction.” Machine Learning 108 (3): 445–74. https://doi.org/10.1007/s10994-018-5755-8.