# classification_with_valid_and_adaptive_coverage__b74934cc.pdf

Classiﬁcation with Valid and Adaptive Coverage

Yaniv Romano Department of Statistics Stanford University Stanford, CA, USA yromano@stanford.edu

Matteo Sesia Department of Data Sciences and Operations University of Southern California Los Angeles, CA, USA sesia@marshall.usc.edu

Emmanuel J. Candès Departments of Mathematics and of Statistics Stanford University Stanford, CA, USA candes@stanford.edu

Conformal inference, cross-validation+, and the jackknife+ are hold-out methods that can be combined with virtually any machine learning algorithm to construct prediction sets with guaranteed marginal coverage. In this paper, we develop specialized versions of these techniques for categorical and unordered response labels that, in addition to providing marginal coverage, are also fully adaptive to complex data distributions, in the sense that they perform favorably in terms of approximate conditional coverage compared to alternative methods. The heart of our contribution is a novel conformity score, which we explicitly demonstrate to be powerful and intuitive for classiﬁcation problems, but whose underlying principle is potentially far more general. Experiments on synthetic and real data demonstrate the practical value of our theoretical guarantees, as well as the statistical advantages of the proposed methods over the existing alternatives.

1 Introduction

Imagine we have n data samples {(Xi, Yi)}n i=1 with features Xi Rp and a discrete label Yi Y = {1, 2, . . . , C}. The samples are drawn exchangeably (e.g., i.i.d., although exchangeability alone is sufﬁcient) from some unknown distribution PXY . Given such data and a desired coverage level 1 α (0, 1), we seek to construct a prediction set ˆCn,α Y for the unseen label of a new data point (Xn+1, Yn+1), also drawn exchangeably from PXY , achieving marginal coverage; that is, obeying

P h Yn+1 ˆCn,α(Xn+1) i 1 α. (1)

The probability above is taken over all n + 1 data points, and we ask that (1) holds for any ﬁxed α, n, and PXY . While marginal coverage has the advantage of being both desirable and practically achievable, it unfortunately does not imply the stronger notion of conditional coverage:

P h Yn+1 ˆCn,α(x) | Xn+1 = x i 1 α. (2)

The latter asks for valid coverage conditional on a speciﬁc observed value of the features X. It is already known that conditional coverage cannot be achieved in theory without strong modeling

Equal contribution.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

assumptions [1, 23], which we are not willing to make in this paper. That said, it is undeniable that conditional coverage would be preferable. We thus seek to develop classiﬁcation methods that are provably valid in the marginal sense (1) and also attempt to sensibly approximate conditional coverage (16). At the same time, we want powerful predictions, in the sense that the cardinality of ˆC should be as small as possible.

1.1 The oracle classiﬁer

Imagine we have an oracle with perfect knowledge of the conditional distribution PY |X of Y given X. This would of course give the problem away; to be sure, we would deﬁne optimal prediction sets Coracle α (Xn+1) with conditional coverage as follows: for any x Rp, set πy(x) = P[Y = y | X = x] for each y Y. Denote by π(1)(x) π(2)(x) . . . π(C)(x) the order statistics for πy(x). For simplicity, let us assume for now that there are no ties; we will relax this assumption shortly. For any τ [0, 1], deﬁne the generalized conditional quantile function2

L(x; π, τ) = min{c {1, . . . , C} : π(1)(x) + π(2)(x) + . . . + π(c)(x) τ}, (3)

and the prediction set:

Coracle+ α (x) = { y indices of the L(x; π, 1 α) largest πy(x)} . (4)

Hence, (4) is the smallest deterministic set that contains a response with feature values X = x with probability at least 1 α. For example, if π1(x) = 0.3, π2(x) = 0.6, and π3(x) = 0.1, we have π(1)(x) = 0.6, π(2)(x) = 0.3, and π(3)(x) = 0.1, with L(x, 0.9) = 2, Coracle 0.1 (x) = {1, 2}, and L(x, 0.5) = 1, Coracle 0.5 (x) = {2}. Furthermore, deﬁne a function S with input x, u [0, 1], π, and τ, which computes the set of most likely labels up to (but possibly excluding) the one identiﬁed by (3):

S(x, u; π, τ) = y indices of the L(x; π, τ) 1 largest πy(x), if u V (x; π, τ), y indices of the L(x; π, τ) largest πy(x), otherwise, (5)

V (x; π, τ) = 1 π(L(x;π,τ))(x)

c=1 π(c)(x) τ

With this in place, by letting u be the realization of a uniform random variable, we can see that the oracle has access to tighter randomized prediction sets, namely,

Coracle α (x) = S(x, U; π, 1 α). (6)

Above, U Uniform(0, 1) is independent of everything else. It is easy to verify that the sets in (6) are the smallest randomized prediction sets with conditional coverage at level 1 α. In the above example, we would have Coracle 0.5 (x) = with probability (0.6 0.5)/0.6 = 1/6 and Coracle 0.5 (x) = {2} otherwise. Finally, if there are any ties among the class probabilities, the oracle could simply break them at random. Of course, we do not have access to such an oracle since PY |X is unknown.

1.2 Preview of our methods

This paper uses classiﬁers trained on the available data to approximate the unknown conditional distribution of Y | X. A key strength of the proposed methods is their ability to work with any black-box predictive model, including neural networks, random forests, support vector classiﬁers, or any other currently existing or possible future alternatives. The only restriction on the training algorithm is that it should treat all samples exchangeably; i.e., it should be invariant to their order. Most off-the-shelf tools offer such suitable probability estimates ˆπy(x) that we can exploit, regardless of whether they are well-calibrated, by imputing them into an algorithm inspired by the oracle from Section 1.1 in order to obtain prediction sets with guaranteed coverage as we shall see.

Our reader will understand that naively substituting πy(x) with ˆπy(x) into the oracle procedure would yield predictions lacking any statistical guarantees because ˆπy(x) may be a poor approximation of πy(x). Fortunately, we can automatically account for errors in ˆπy(x) by adaptively choosing the

2Recall that the conditional quantiles for continuous responses are: inf{y R : P[Y y | X = x] τ}.

threshold τ in (3) in such a way as to guarantee ﬁnite-sample coverage on future test points. The intuition is that setting τ = 1 α may not necessarily guarantee coverage at level 1 α for future test points, if ˆπ = π. However, we can compute the empirical coverage on hold-out data as a function of τ, and then select the smallest value of τ that leads to the desired 1 α coverage. Below, we will show that this adaptive tuning rigorously yields tight coverage.

1.3 Related work

We build upon conformal inference [12, 24, 26] and take inspiration from [3, 5, 8 10, 13, 18] which made conformal prediction for regression problems adaptive to heteroscedasticity, thus bringing it closer to conditional coverage [20]. Conformal inference has been applied before to classiﬁcation problems [7, 19, 24, 25] in order to attain marginal coverage; however, the idea of explicitly trying to approximate the oracle from Section 1.1 is novel. We will see that our procedure empirically achieves better conditional coverage than a direct application of conformal inference. While working on this project, we became aware of the independent work of [2], which also seeks to improve the conditional coverage of conformal classiﬁcation methods. However, their approach differs substantially; see Section 2.4. Finally, our method also naturally accommodates calibration through cross-validation+ and the jackknife+ [4], which had not yet been extended to classiﬁcation, although the natural generality of these calibration techniques has also been very recently noted by others [10].

A different but related line of work focuses on post-processing the output of black-box classiﬁcation algorithms to produce more accurate probability estimates [6, 11, 15, 16, 22, 27, 28], although without achieving prediction sets with provable ﬁnite-sample coverage. These techniques are complementary to our methods and may help further boost our performance by improving the accuracy of any given black box; however, we have not tested them empirically in this paper for space reasons.

2.1 Generalized inverse quantile conformity scores

Suppose we have a black-box classiﬁer ˆπy(x) that estimates the true unknown class probabilities πy(x). Here, we only assume ˆπy(x) to be standardized: 0 ˆπy (x) 1, PC y=1 ˆπy (x) = 1, x, y. An example may be the output of the softmax layer of a neural network, after normalization. In fact, almost any standard machine learning software, e.g., sklearn, can produce a suitable ˆπ, either through random forests, k-nearest neighbors, or support vector machines, to name a few options. Then, we plug ˆπ into a modiﬁed version of the imaginary oracle procedure of Section 1.1 where the threshold τ needs to be carefully calibrated using hold-out samples independent of the training data. We will present two alternative methods for calibrating τ; both are based on the following idea.

We deﬁne a function E, with input x, y, u, ˆπ, which computes the smallest value of τ such that the set S(x, u; ˆπ, τ) in (5) contains the label y conditional on X = x. We call this the generalized inverse quantile conformity score function:

E(x, y, u; ˆπ) = min {τ [0, 1] : y S(x, u; ˆπ, τ)} . (7)

By construction, our scores evaluated on hold-out samples (Xi, Yi), namely Ei = E(Xi, Yi, Ui; ˆπ), are uniformly distributed conditional on X if ˆπ = π. (Each Ui is a uniform random variable in [0, 1] independent of everything else.) Therefore, one could also intuitively look at (7) as a special type of p-value. It is worth emphasizing that this property makes our scores naturally comparable across different samples, in contrast with the scores found in the earlier literature on adaptive conformal inference [18]. In fact, alternative conformity scores [2, 10, 12, 18] generally have different distributions at different values of X, even in the ideal case where the base method (our ˆπ) is a perfect oracle. Below, we shall see that, loosely speaking, we can construct prediction sets with provable marginal coverage for future test points by applying (5) with a value of τ close to the 1 α quantile of {Ei}i I2, where I2 is the set of hold-out data points not used to train ˆπ; see (8).

2.2 Adaptive classiﬁcation with split-conformal calibration

Algorithm 1 implements the above idea with split-conformal calibration, from which we begin because it is the easiest to explain. Later, we will consider alternative calibration methods based on

cross-validation+ and the jackknife+; we do not discuss full-conformal calibration in the interest of space, and because it is often computationally prohibitive. For simplicity, we will apply Algorithm 1 by splitting the data into two sets of equal size; however, this is not necessary and using more data points for training may sometimes perform better in practice [20].

Algorithm 1: Adaptive classiﬁcation with split-conformal calibration

1 Input: data {(Xi, Yi)}n i=1, Xn+1, black-box learning algorithm B, level α (0, 1).

2 Randomly split the training data into 2 subsets, I1, I2.

3 Sample Ui Uniform(0, 1) for each i {1, . . . , n + 1}, independently of everything else.

4 Train B on all samples in I1: ˆπ B({(Xi, Yi)}i I1).

5 Compute Ei = E(Xi, Yi, Ui; ˆπ) for each i I2, with the function E deﬁned in (7).

6 Compute ˆQ1 α({Ei}i I2) as the (1 α)(1 + |I2|) th largest value in {Ei}i I2.

7 Use the function S deﬁned in (5) to construct the prediction set at Xn+1 as:

ˆCSC n,α(Xn+1) = S(Xn+1, Un+1; ˆπ, ˆQ1 α({Ei}i I2)). (8)

8 Output: A prediction set ˆCSC n,α(Xn+1) for the unobserved label Yn+1.

Theorem 1. If the samples (Xi, Yi), for i {1, . . . , n+1}, are exchangeable and B from Algorithm 1 is invariant to permutations of its input samples, the output of Algorithm 1 satisﬁes:

P h Yn+1 ˆCSC n,α(Xn+1) i 1 α. (9)

Furthermore, if the scores Ei are almost surely distinct, the marginal coverage is near tight:

P h Yn+1 ˆCSC n,α(Xn+1) i 1 α + 1 |I2| + 1. (10)

The proofs of this theorem and all other results are in Supplementary Section S2. Marginal coverage holds regardless of the quality of the black-box approximation; however, one can intuitively expect that if the black-box is consistent and a large amount of data is available, so that ˆπy(x) πy(x), the output of our procedure will tend to be a close approximation of the output of the oracle, which provides optimal conditional coverage. This statement could be made rigorous under some additional technical assumptions besides the consistency of the black box [20]. However, we prefer to avoid tedious technical details, especially since the intuition is already clear. If ˆπ = π, the sets S(Xi, Ui; π, τ) in (5) will tend to contain the true labels for a fraction τ of the points i I2, as long as |I2| is large. In this limit, ˆQ1 α({Ei}i I2) becomes approximately equal to 1 α, and the predictions in (8) will eventually approach those in (6).

2.3 Adaptive classiﬁcation with cross-validation+ and jackknife+ calibration

A limitation of Algorithm 1 is that it only uses part of the data to train the predictive algorithm. Consequently, the estimate ˆπ may not be as accurate as it could have been had we used all the data for estimation purposes. This is especially true if the sample size n is small. Algorithm 2 presents an alternative solution that replaces data splitting with a cross-validation approach, which is computationally more expensive but often provides tighter prediction sets.

In words, in Algorithm 2, we sweep over all possible labels y Y and include y in the ﬁnal prediction set ˆCCV+ n,α (Xn+1) if the corresponding score E(Xn+1, y, Un+1; ˆπk(i)) is smaller than (1 α)(n + 1) hold-out scores E(Xi, Yi, Ui; ˆπk(i)) evaluated on the true labeled data. Note that we have assumed n/K to be an integer for simplicity; however, different splits can have different sizes. In the special case where K = n, we refer to the hold-out system in Algorithm 2 as jackknife+ rather than cross-validation+, consistently with the terminology in [4]. Theorem 2. Under the same assumptions of Theorem 1, the output of Algorithm 2 satisﬁes:

P h {Yn+1 ˆCCV+ n,α (Xn+1) i 1 2α min 2(1 1/K)

n/K + 1 , 1 K/n

In the special case where K = n, this bound simpliﬁes to:

P h Yn+1 ˆCJK+ n,α (Xn+1) i 1 2α. (13)

Algorithm 2: Adaptive classiﬁcation with CV+ calibration

1 Input: data {(Xi, Yi)}n i=1, Xn+1, black-box B, number of splits K n, level α (0, 1).

2 Randomly split the training data into K disjoint subsets, I1, . . . , IK, each of size n/K.

3 Sample Ui Uniform(0, 1) for each i {1, . . . , n + 1}, independently of everything else.

4 for k {1, . . . , K} do

5 Train B on all samples except those in Ik: ˆπk B({(Xi, Yi)}i {1,...,n}\Ik).

7 Use the function E deﬁned in (7) to construct the prediction set ˆCCV+ n,α (Xn+1) as:

ˆCCV+ n,α (Xn+1) = y Y :

i=1 1 h E(Xi, Yi, Ui; ˆπk(i)) < E(Xn+1, y, Un+1; ˆπk(i)) i < (1 α)(n + 1) , (11)

where k(i) {1, . . . , K} is the fold containing the ith sample.

8 Output: A prediction set ˆCCV+ n,α (Xn+1) for the unobserved label Yn+1.

Note that this establishes that the coverage is slightly below 1 2α. Therefore, to guarantee 1 α coverage, we should replace the input α in Algorithm 2 with a smaller value near α/2. We chose not to do so because our experiments show that the current implementation already typically covers at level 1 α (or even higher) in practice; this empirical observation is consistent with [4]. Furthermore, there exists a conservative variation of Algorithm 2 for which we can prove 1 α coverage without modifying the input level; see Supplementary Section S1.1.

To see why everything above makes sense, consider what would happen if the black-box estimates of conditional probabilities in Algorithm 2 were exact. In this case, the ﬁnal prediction set in (11) would become

ˆCCV+ n,α (Xn+1) = n y Y : E(Xn+1, y, Un+1; π) < ˆQ1 α({E(Xi, Yi, Ui; π)}i {1,...,n}) o , (14)

where ˆQ1 α is deﬁned as in Section 2.2. If n is large, for any ﬁxed threshold τ, we can expect S(Xi, Ui; π, τ) to contain Yi for approximately a fraction τ of samples i. Therefore, ˆQ1 α({E(Xi, Yi, Ui; π)}i {1,...,n}) 1 α, and the decision rule becomes approximately:

ˆCCV+ n,α (Xn+1) {y Y : E(Xn+1, y, Un+1; π) 1 α} , (15)

which is equivalent to the oracle procedure from Section 1.1.

2.4 Comparison with alternative conformal methods

Conformal prediction has been proposed before in the context of classiﬁcation [24], through a very general calibration rule of the form

ˆC(x; t) = {y Y : ˆf(y | x) t},

where the score ˆf is a function learned by a black-box classiﬁer. However, to date it was not clear how to best translate the output of the classiﬁer into a powerful score ˆf for the above decision rule. In fact, typical choices of ˆf(y | x), e.g., the estimated probability of Y = y given X = x, often lead to poor conditional coverage because the same threshold t is applied both to easy-to-classify samples (where one label has probability close to 1 given X) and to hard-to-classify samples (where all probabilities are close to 1/|Y| given X). Therefore, this homogeneous conformal classiﬁcation may signiﬁcantly underperform compared to the oracle from Section 1.1, even in the ideal case where the black-box manages to learn the correct probabilities. This limitation has also been very recently noted in [2] and is analogous to that addressed by [18] in problems with a continuous response variable [20].

The work of [2] addresses this problem by applying quantile regression [18] to hold-out scores ˆf. However, their solution has two limitations. Firstly, it involves additional data splitting to avoid

overﬁtting, which tends to reduce power. Secondly, its theoretical asymptotic optimality is weaker than ours because it requires the consistency of two black-boxes instead of one (this should be clear even though we have explained consistency only heuristically). Practically, experiments suggest that our method provides superior conditional coverage and often yields smaller prediction sets.

2.5 Extension to label-conditional coverage

Our method can be easily extended to obtain provable coverage separately for each class [19, 24]:

P h Yn+1 ˆCn,α(Xn+1) | Yn+1 = y i 1 α, y Y. (16)

The only difference is that the threshold τ should be calibrated separately for each class. More precisely, focusing on the extension of Algorithm 1 for simplicity, we would compute ˆQ(y) 1 α({Ei}i I2) as the (1 α)(1 + |{i I2 : Yi = y}|) th largest value in {Ei}i I2:Yi=y, for each y Y. Then, we would deﬁne ˆτ = maxy Y ˆQ(y) 1 α({Ei}i I2) and output ˆCSC lc n,α (Xn+1) = S(Xn+1, Un+1; ˆπ, ˆτ). More details about this extension are in Supplementary Section S1.2.

In the interest of simplicity, we have not explicitly sought label-conditional coverage in the following numerical experiments. Nonetheless, we shall see from the results in Figure 3 that our method performs relatively well in terms of label-conditional coverage even without the explicit constraint.

3 Experiments with simulated data

3.1 Methods and metrics

We compare the performances of Algorithms 1 (SC) and 2 (CV+, JK+), which are based on the new generalized inverse quantile conformity scores in (7), to those of homogeneous conformal classiﬁcation (HCC) and conformal quantile classiﬁcation (CQC) [2]. We focus on two different data generating scenarios in which marginal coverage is not a good proxy for conditional coverage (the second setting is discussed in Supplementary Section S3.3). In both cases, we explore 3 different black-boxes: an oracle that knows the true πy(x) for all y Y and x; a support vector classiﬁer (SVC) implemented by the sklearn Python package; and a random forest classiﬁer (RFC) also implemented by sklearn see Supplementary Section S3.1 for more details.

We ﬁx α = 0.1 and assess performance in terms of marginal coverage, conditional coverage, and mean cardinality of the prediction sets. Conditional coverage is deﬁned using an estimate of the worst-slice (WS) coverage similar to that in [2], as explained in Supplementary Section S1.3. The cardinality of the prediction sets is computed both marginally and conditionally on coverage; the former is deﬁned as E[| ˆC(Xn+1)|] and the latter as E[| ˆC(Xn+1)| | Yn+1 ˆC(Xn+1)]. Additional coverage and size metrics deﬁned by conditioning on the value of a given discrete feature, e.g., X1, are discussed in Supplementary Section S3.

3.2 Experiments with multinomial model and inhomogeneous features

We generate the features X Rp, with p = 10, as follows: X1 = 1 w.p. 1/5 and X1 = 8 otherwise, while X2, . . . , X10 are independent standard normal. The conditional distribution of Y {1, . . . , 10} given X = x is multinomial with weights wj(x) deﬁned as wj(x) = zj(x)/ Pp j =1 zj (x), where zj(x) = exp(x T βj) and each βj Rp is sampled from an independent standard normal distribution.

Figure 1 conﬁrms that our methods have valid conditional coverage if the true class probabilities are provided by an oracle. If the probabilities are estimated by the RFC, the conditional coverage appears to be only slightly below 1 α, and is near perfect with the SVC black box. By contrast, the conditional coverage of the alternative methods is always signiﬁcantly lower than 1 α, even with the help of the oracle. Our methods produce slightly larger prediction sets when the oracle is available, but our sets are typically smaller than those of CQC and only slightly larger than those of HCC when the class probabilities are estimated. Finally, note that JK+ is the most powerful of our methods, followed by CV+, although SC is computationally more affordable.

Oracle RFC SVC

Marginal Conditional (WS)

GGG GGG G GGG

GGG GGG GGGG

G GG GGG G GG G G GGG G

GGG GGG G GG GG GGG G

GGGG GGG G G

GGGGG GGG GG

Oracle RFC SVC

Marginal Cond. on cover

Figure 1: Several classiﬁcation methods on simulated data with 10 classes, for different choices of calibration and black-box models. SC, CV+, and JK+ are applied with our new generalized inverse quantile conformity scores deﬁned in (7). The results correspond to 100 independent experiments with 1000 training samples and 5000 test samples each. All methods have 90% marginal coverage. (a): Marginal coverage and worst-slice conditional coverage. (b): Size of prediction sets.

4 Experiments with real data

In this section, we compare the performance of our proposed methods (SC, CV+, and JK+) with the new generalized inverse quantile conformity scores deﬁned in (7) to those of HCC and CQC [2]. We found that the original suggestion of [2] to ﬁt a quantile neural network [21] on the class probability score can be unstable and yield very wide predictions. Therefore, we offer a second variant of this calibration method, denoted by CQC-RF, which replaces the quantile neural network estimator with quantile random forests [14]; see Supplementary Section S4 for details.

The validity and statistical efﬁciency of each method is evaluated according to the same metrics as in Section 3. In all experiments, we set α = 0.1 and use the following base predictive models: (i) kernel SVC, (ii) random forest classiﬁer (RFC), and (iii) two-layer neural network classiﬁer (NNet). A detailed description of each algorithm and corresponding hyper-parameters is in Supplementary Section S4. The methods are tested on two well-known data sets: the Mice Protein Expression data set3 and the MNIST handwritten digit data set. The supplementary material describes the processing pipeline and discusses additional experiments on the Fashion-MNIST and CIFAR10 data sets. Supplementary Tables S1 S4 summarize the results of our experiments in more detail and also consider additional settings.

Figure 2 shows that all methods attain valid marginal coverage on the Mice Protein Expression data, as expected. Here, HCC, CQC, and CQC-RF fail to achieve conditional coverage, in contrast to the proposed methods (SC, CV+, JK+) based on our new conformity scores in (7). Turning to efﬁciency, we observe that the prediction sets of CV+ and JK+ are smaller than those of SC, and comparable in size to those of HCC. Here, the original CQC algorithm performs poorly both in terms of conditional coverage and cardinality. The CQC-RF variant is not as unstable as the original CQC, although it does not perform much better than HCC.

Figure 3 presents the results on the MNIST data. Here, the sample size is relatively large and hence we exclude JK+ due to its higher computational cost. As in the previous experiments, all methods achieve 90% marginal coverage. Unlike CQC, CQC-RF, and HCC, our methods also attain valid conditional coverage when relying on the NNet or SVC as base predictive models. With the RFC, all methods tend to undercover, suggesting that this classiﬁer estimates the class probabilities poorly, and our prediction sets are larger than those constructed by CQC-RF and HCC. By contrast, the

3https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression

NNet RFC SVC

Marginal Conditional (WC)

NNet RFC SVC

Marginal Cond. on cover

Figure 2: Experiments on Mice Protein Expression data. 100 independent experiments with 500 randomly chosen training samples and 580 test samples each. Left: coverage. Right: size of prediction sets (extremely large values for CQC not shown). Other details are as in Figure 1.

NNet enables our methods to achieve conditional coverage with prediction sets comparable in size to those produced by CQC-RF and HCC. The bottom part of Figure 3 demonstrates that CV+ also has conditional coverage given the true class label Y , while SC performs only slightly worse. In striking contrast, both HCC, CQC, and CQC-RF fail to achieve 90% conditional coverage.

5 Conclusions

This paper introduced a principled and versatile modular method for constructing prediction sets for multi-class classiﬁcation problems that enjoy provable ﬁnite-sample coverage, and also behave well in terms of conditional coverage when compared to alternatives. Our approach leverages the power of any black-box machine learning classiﬁer that may be available to practitioners, and is easily calibrated via various hold-out procedures; e.g., conformal splitting, CV+, or the jackknife+. This ﬂexibility makes our approach widely applicable and offers options to balance between computational efﬁciency, data parsimony, and power.

Although this paper focused on classiﬁcation, using conformity scores similar to those in (7) to calibrate hold-out procedures for regression problems [4, 18] is tantalizing. In fact, previous work in the regression setting focused on conformity scores that measure the distance of a data point from its predicted interval on the scale of the Y values (which makes sense for homoscedastic regression, but may not be optimal otherwise), rather than by the amount one would need to relax the nominal threshold (our τ) until the true value is covered. We leave it to future work to explore the performance of our intuitive metrics in other settings.

The Python package at https://github.com/msesia/arc implements our methods. This repository also contains code to reproduce our experiments.

Broader Impact

Machine learning algorithms are increasingly relied upon by decision makers. It is therefore crucial to combine the predictive performance of such complex machinery with practical guarantees on the reliability and uncertainty of their output. We view the calibration methods presented in this paper as an important step towards this goal. In fact, uncertainty estimation is an effective way to quantify and communicate the beneﬁts and limitations of machine learning. Moreover, the proposed methodologies provide an attractive way to move beyond the standard prediction accuracy measure used to compare algorithms. For instance, one can compare the performance of two candidate predictors, e.g., random forest and neural network (see Figure 3), by looking at the size of the corresponding prediction sets

NNet RFC SVC

Marginal Conditional (WC)

NNet RFC SVC

Marginal Cond. on cover

GG GGG G G GGG

G G GGGG G G

G G GG G GG G G G GGG G

G G GGG GGGG

GGGGG G GGG GG G G G G G G

G GGGG G G GG G GGGG

SC CV+ CQC CQC RF HCC

Coverage Size cond. on cover

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Figure 3: Experiments on MNIST data. 100 independent experiments with 10000 randomly chosen training samples and 5000 test samples each. Top: coverage and size of prediction sets (large values for CQC not shown). Bottom: coverage with neural network black box, conditional on the true Y , and size of the corresponding prediction sets, conditional on coverage. Other details are as in Figure 1.

and/or their their conditional coverage. Finally, the approximate conditional coverage that we seek in this work is highly relevant within the broader framework of fairness, as discussed by [17] within a regression setting. While our approximate conditional coverage already implicitly reduces the risk of unwanted bias, an equalized coverage requirement [17] can also be easily incorporated into our methods to explicitly avoid discrimination based on protected categories.

We conclude by emphasizing that the validity of our methods relies on the exchangeability of the data points. If this assumption is violated (e.g., with time-series data), our prediction sets may not have the right coverage. A general suggestion here is to always try to leverage speciﬁc knowledge of the data and of the application domain to judge whether the exchangeability assumption is reasonable. Finally, our data-splitting techniques in Section 4 offer a practical way to verify empirically the validity of the predictions on any given data set.

Acknowledgments and Disclosure of Funding

E. C. was partially supported by Ofﬁce of Naval Research grant N00014-20-12157, and by the Army Research Ofﬁce (ARO) under grant W911NF-17-1-0304. Y. R. was partially supported by ARO under the same grant. Y. R. thanks the Zuckerman Institute, ISEF Foundation, the Viterbi Fellowship,

Technion, and the Koret Foundation, for providing additional research support. M. S. was suported by NSF grant DMS 1712800. Y. R. and M. S. were advised by E. C. at Stanford University.

[1] R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. ar Xiv preprint ar Xiv:1903.04684, 2019.

[2] M. Cauchois, S. Gupta, and J. Duchi. Knowing what you know: valid conﬁdence sets in multiclass and multilabel prediction. ar Xiv preprint ar Xiv:2004.10181, 2020.

[3] V. Chernozhukov, K. Wüthrich, and Y. Zhu. Distributional conformal prediction. ar Xiv preprint ar Xiv:1909.07889, 2019.

[4] R. Foygel Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Predictive inference with the jackknife+. ar Xiv preprint ar Xiv:1905.02928, 2019.

[5] L. Guan. Conformal prediction with localization. ar Xiv preprint ar Xiv:1908.08558, 2019.

[6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017.

[7] Y. Hechtlinger, B. Póczos, and L. Wasserman. Cautious deep learning. ar Xiv preprint ar Xiv:1805.09460, 2018.

[8] R. Izbicki, G. T. Shimizu, and R. B. Stern. Distribution-free conditional predictive bands using density estimators. ar Xiv preprint ar Xiv:1910.05575, 2019.

[9] D. Kivaranovic, K. D. Johnson, and H. Leeb. Adaptive, distribution-free prediction intervals for deep neural networks. ar Xiv preprint ar Xiv:1905.10634, 2019.

[10] A. K. Kuchibhotla and A. K. Ramdas. Nested conformal prediction and the generalized jackknife+. ar Xiv preprint ar Xiv:1910.10562, 2019.

[11] A. Kumar, P. S. Liang, and T. Ma. Veriﬁed uncertainty calibration. In Advances in Neural Information Processing Systems, pages 3787 3798, 2019.

[12] J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

[13] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014.

[14] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983 999, 2006.

[15] L. Neumann, A. Zisserman, and A. Vedaldi. Relaxed softmax: Efﬁcient conﬁdence autocalibration for safe pedestrian detection. 2018.

[16] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

[17] Y. Romano, R. F. Barber, C. Sabatti, and E. Candès. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 4 2020. https://hdsr.mitpress.mit.edu/pub/qedrwcz3.

[18] Y. Romano, E. Patterson, and E. J. Candès. Conformalized quantile regression. In Advances in Neural Information Processing Systems, pages 3538 3548, 2019.

[19] M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classiﬁers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019.

[20] M. Sesia and E. J. Candès. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020.

[21] J. W. Taylor. A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299 311, 2000.

[22] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating model calibration in classiﬁcation. ar Xiv preprint ar Xiv:1902.06977, 2019.

[23] V. Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475 490, 2012.

[24] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer, 2005.

[25] V. Vovk, D. Lindsay, I. Nouretdinov, and A. Gammerman. Mondrian conﬁdence machine. Technical report, Royal Holloway, University of London, 2003. On-line Compression Modelling project.

[26] V. Vovk, I. Nouretdinov, and A. Gammerman. On-line predictive linear regression. The Annals of Statistics, 37(3):1566 1590, 2009.

[27] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classiﬁers. In Icml, volume 1, pages 609 616. Citeseer, 2001.

[28] B. Zadrozny and C. Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694 699, 2002.