26  Conformal prediction for classification

Consider the following classification task: let \(\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data with \(y_i\in \{1,\ldots,K\}\), and \((x_{\textsf{new}},?)\) be a new data point. The goal is to find a class \(y_{\textsf{new}}\in\{1,\ldots,K\}\) that is best associated with \(x_{\textsf{new}}\).

As in regression, conformal prediction allows us to obtain a prediction set which contains multiple classes. Let \(\hat{p}(y\vert x)\) be a model that estimates the probability that an example with predictor \(x\) is in class \(y\); for example, \(\hat{p}\) could be a logistic regression model. We compute the model’s likelihood on \((x_i,y_i)\):

\[ P_i = \hat{p}_{-i}(y_i\vert x_i). \]

One candidate for the non-conformity score is the negative likelihood:

\[ R_i = -P_i. \]

We then sort \(R_1,\ldots,R_n\) in increasing order: \(R_{(1)},\ldots,R_{(n)}\). Let \(\alpha\in (0.5,1)\) and \(\beta=1-\alpha\). The quantile lemma in Section 24.2 tells us that the prediction set:

\[ \{y\in\{1,\ldots,K\} : -\hat{p}(y\vert x_{\textsf{new}}) \leq \lceil \beta(n+1) \rceil\text{-quantile of } \{R_1,\ldots, R_{n}\}\}, \]

has \(1-\alpha\) coverage probability. We can also write this set in terms of likelihood:

\[ \{y\in\{1,\ldots,K\} : \hat{p}(y\vert x_{\textsf{new}}) \geq \lfloor \alpha(n+1) \rfloor\text{-quantile of }\{P_1,\ldots, P_n\}\}, \]

which is easier to interpret. From this observation, we present here two approaches to obtain a prediction set.

26.1 Full conformal approach

Let \(\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data with \(y_i\in \{1,\ldots,K\}\), and \((x_{\textsf{new}},?)\) be a new data point.

For each \(y_{\textsf{new}}\in \{1,\ldots, K\}\),

  1. Add the new point \(\{(x_{\textsf{new}},y_{\textsf{new}})\}\) to the dataset.

  2. Fit the model on \(\mathcal{D}_{1:n}\cup \{(x_{\textsf{new}},y_{\textsf{new}})\}\). Let \(\hat{p}\) be the fitted model.

  3. For each \(i\in \{1,\ldots,n,\textsf{new}\}\), compute the probability estimate \(P_i=\hat{p}(y_i\vert x_i)\).

  4. Sort the probabilities \(P_1,\ldots,P_n\) in increasing order: \(P_{(1)},\ldots,P_{(n)}\).

  5. Let \(k = \lfloor \alpha(n+1) \rfloor\). Include \(y_{\textsf{new}}\) in the prediction set if \(\hat{p}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\), otherwise we discard it.

In summary, the prediction set is

\[\mathcal{S}_{1-\alpha}(x_{\textsf{new}}) = \{y_{\textsf{new}} \in \{1,\ldots,K\} : \hat{p}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\}. \]

It follows from Lemma 24.1 that this prediction set has \(1-\alpha\) probability coverage: let \((x_{\textsf{new}}, y)\) be a new data point, then

\[ \Pr[y \in \mathcal{S}_{1-\alpha}(x_{\textsf{new}})] \geq 1-\alpha. \]

26.2 Jackknife+ approach

The problem with the full conformal approach is the need to refit the model on every new data point. To avoid this problem, we can take the Jackknife+ approach and fit the model on the training set and make a prediction on the new data point. Here are the steps to obtain a prediction set using the Jackknife+ approach in full details:

Let \(\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) be the observed data with \(y_i\in \{1,\ldots,K\}\), and \((x_{\textsf{new}},?)\) be a new data point.

  1. For each \(i\in \{1,\ldots,n,\textsf{new}\}\),

    1.1. Fit the model on \(\mathcal{D}_{1:n}\). Let \(\hat{p}_{-i}\) be the fitted model.

    1.2. Compute the probability estimate \(P_i=\hat{p}_{-i}(y_i\vert x_i)\).

  2. Sort the probabilities \(P_1,\ldots,P_n\) in increasing order: \(P_{(1)},\ldots,P_{(n)}\).

  3. For each \(y_{\textsf{new}}\in \{1,\ldots, K\}\),

    Include \(y_{\textsf{new}}\) in the prediction set if \(\hat{p}_{-i}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\) where \(k = \lfloor \alpha(n+1) \rfloor\).

In summary, the prediction set is

\[\mathcal{S}^{\textsf{JK+}}_{1-2\alpha}(x_{\textsf{new}}) = \{y_{\textsf{new}} \in \{1,\ldots,K\} : \hat{p}_{-i}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\}. \]

This set has \(1-2\alpha\) probability coverage (Romano, Patterson, and Candes 2019). More precisely, let \((x_{\textsf{new}}, y)\) be a new data point, then

\[ \Pr[y \in \mathcal{S}^{\textsf{JK+}}_{1-\alpha}(x_{\textsf{new}})] \geq 1-2\alpha. \]

References

Romano, Yaniv, Evan Patterson, and Emmanuel Candes. 2019. “Conformalized Quantile Regression.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf.