# 26Conformal prediction for classification

Consider the following classification task: let $$\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}$$ be the observed data with $$y_i\in \{1,\ldots,K\}$$, and $$(x_{\textsf{new}},?)$$ be a new data point. The goal is to find a class $$y_{\textsf{new}}\in\{1,\ldots,K\}$$ that is best associated with $$x_{\textsf{new}}$$.

As in regression, conformal prediction allows us to obtain a prediction set which contains multiple classes. Let $$\hat{p}(y\vert x)$$ be a model that estimates the probability that an example with predictor $$x$$ is in class $$y$$; for example, $$\hat{p}$$ could be a logistic regression model. We compute the model’s likelihood on $$(x_i,y_i)$$:

$P_i = \hat{p}_{-i}(y_i\vert x_i).$

One candidate for the non-conformity score is the negative likelihood:

$R_i = -P_i.$

We then sort $$R_1,\ldots,R_n$$ in increasing order: $$R_{(1)},\ldots,R_{(n)}$$. Let $$\alpha\in (0.5,1)$$ and $$\beta=1-\alpha$$. The quantile lemma in Section 24.2 tells us that the prediction set:

$\{y\in\{1,\ldots,K\} : -\hat{p}(y\vert x_{\textsf{new}}) \leq \lceil \beta(n+1) \rceil\text{-quantile of } \{R_1,\ldots, R_{n}\}\},$

has $$1-\alpha$$ coverage probability. We can also write this set in terms of likelihood:

$\{y\in\{1,\ldots,K\} : \hat{p}(y\vert x_{\textsf{new}}) \geq \lfloor \alpha(n+1) \rfloor\text{-quantile of }\{P_1,\ldots, P_n\}\},$

which is easier to interpret. From this observation, we present here two approaches to obtain a prediction set.

## 26.1 Full conformal approach

Let $$\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}$$ be the observed data with $$y_i\in \{1,\ldots,K\}$$, and $$(x_{\textsf{new}},?)$$ be a new data point.

For each $$y_{\textsf{new}}\in \{1,\ldots, K\}$$,

1. Add the new point $$\{(x_{\textsf{new}},y_{\textsf{new}})\}$$ to the dataset.

2. Fit the model on $$\mathcal{D}_{1:n}\cup \{(x_{\textsf{new}},y_{\textsf{new}})\}$$. Let $$\hat{p}$$ be the fitted model.

3. For each $$i\in \{1,\ldots,n,\textsf{new}\}$$, compute the probability estimate $$P_i=\hat{p}(y_i\vert x_i)$$.

4. Sort the probabilities $$P_1,\ldots,P_n$$ in increasing order: $$P_{(1)},\ldots,P_{(n)}$$.

5. Let $$k = \lfloor \alpha(n+1) \rfloor$$. Include $$y_{\textsf{new}}$$ in the prediction set if $$\hat{p}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}$$, otherwise we discard it.

In summary, the prediction set is

$\mathcal{S}_{1-\alpha}(x_{\textsf{new}}) = \{y_{\textsf{new}} \in \{1,\ldots,K\} : \hat{p}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\}.$

It follows from Lemma 24.1 that this prediction set has $$1-\alpha$$ probability coverage: let $$(x_{\textsf{new}}, y)$$ be a new data point, then

$\Pr[y \in \mathcal{S}_{1-\alpha}(x_{\textsf{new}})] \geq 1-\alpha.$

## 26.2 Jackknife+ approach

The problem with the full conformal approach is the need to refit the model on every new data point. To avoid this problem, we can take the Jackknife+ approach and fit the model on the training set and make a prediction on the new data point. Here are the steps to obtain a prediction set using the Jackknife+ approach in full details:

Let $$\mathcal{D}_{1:n}=\{(x_1,y_1),\ldots,(x_n,y_n)\}$$ be the observed data with $$y_i\in \{1,\ldots,K\}$$, and $$(x_{\textsf{new}},?)$$ be a new data point.

1. For each $$i\in \{1,\ldots,n,\textsf{new}\}$$,

1.1. Fit the model on $$\mathcal{D}_{1:n}$$. Let $$\hat{p}_{-i}$$ be the fitted model.

1.2. Compute the probability estimate $$P_i=\hat{p}_{-i}(y_i\vert x_i)$$.

2. Sort the probabilities $$P_1,\ldots,P_n$$ in increasing order: $$P_{(1)},\ldots,P_{(n)}$$.

3. For each $$y_{\textsf{new}}\in \{1,\ldots, K\}$$,

Include $$y_{\textsf{new}}$$ in the prediction set if $$\hat{p}_{-i}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}$$ where $$k = \lfloor \alpha(n+1) \rfloor$$.

In summary, the prediction set is

$\mathcal{S}^{\textsf{JK+}}_{1-2\alpha}(x_{\textsf{new}}) = \{y_{\textsf{new}} \in \{1,\ldots,K\} : \hat{p}_{-i}(y_{\textsf{new}}\vert x_{\textsf{new}}) \geq P_{(k)}\}.$

This set has $$1-2\alpha$$ probability coverage . More precisely, let $$(x_{\textsf{new}}, y)$$ be a new data point, then

$\Pr[y \in \mathcal{S}^{\textsf{JK+}}_{1-\alpha}(x_{\textsf{new}})] \geq 1-2\alpha.$

# References

Romano, Yaniv, Evan Patterson, and Emmanuel Candes. 2019. “Conformalized Quantile Regression.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf.