# classification_with_valid_and_adaptive_coverage__b74934cc.pdf Classification with Valid and Adaptive Coverage Yaniv Romano Department of Statistics Stanford University Stanford, CA, USA yromano@stanford.edu Matteo Sesia Department of Data Sciences and Operations University of Southern California Los Angeles, CA, USA sesia@marshall.usc.edu Emmanuel J. Candès Departments of Mathematics and of Statistics Stanford University Stanford, CA, USA candes@stanford.edu Conformal inference, cross-validation+, and the jackknife+ are hold-out methods that can be combined with virtually any machine learning algorithm to construct prediction sets with guaranteed marginal coverage. In this paper, we develop specialized versions of these techniques for categorical and unordered response labels that, in addition to providing marginal coverage, are also fully adaptive to complex data distributions, in the sense that they perform favorably in terms of approximate conditional coverage compared to alternative methods. The heart of our contribution is a novel conformity score, which we explicitly demonstrate to be powerful and intuitive for classification problems, but whose underlying principle is potentially far more general. Experiments on synthetic and real data demonstrate the practical value of our theoretical guarantees, as well as the statistical advantages of the proposed methods over the existing alternatives. 1 Introduction Imagine we have n data samples {(Xi, Yi)}n i=1 with features Xi Rp and a discrete label Yi Y = {1, 2, . . . , C}. The samples are drawn exchangeably (e.g., i.i.d., although exchangeability alone is sufficient) from some unknown distribution PXY . Given such data and a desired coverage level 1 α (0, 1), we seek to construct a prediction set ˆCn,α Y for the unseen label of a new data point (Xn+1, Yn+1), also drawn exchangeably from PXY , achieving marginal coverage; that is, obeying P h Yn+1 ˆCn,α(Xn+1) i 1 α. (1) The probability above is taken over all n + 1 data points, and we ask that (1) holds for any fixed α, n, and PXY . While marginal coverage has the advantage of being both desirable and practically achievable, it unfortunately does not imply the stronger notion of conditional coverage: P h Yn+1 ˆCn,α(x) | Xn+1 = x i 1 α. (2) The latter asks for valid coverage conditional on a specific observed value of the features X. It is already known that conditional coverage cannot be achieved in theory without strong modeling Equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. assumptions [1, 23], which we are not willing to make in this paper. That said, it is undeniable that conditional coverage would be preferable. We thus seek to develop classification methods that are provably valid in the marginal sense (1) and also attempt to sensibly approximate conditional coverage (16). At the same time, we want powerful predictions, in the sense that the cardinality of ˆC should be as small as possible. 1.1 The oracle classifier Imagine we have an oracle with perfect knowledge of the conditional distribution PY |X of Y given X. This would of course give the problem away; to be sure, we would define optimal prediction sets Coracle α (Xn+1) with conditional coverage as follows: for any x Rp, set πy(x) = P[Y = y | X = x] for each y Y. Denote by π(1)(x) π(2)(x) . . . π(C)(x) the order statistics for πy(x). For simplicity, let us assume for now that there are no ties; we will relax this assumption shortly. For any τ [0, 1], define the generalized conditional quantile function2 L(x; π, τ) = min{c {1, . . . , C} : π(1)(x) + π(2)(x) + . . . + π(c)(x) τ}, (3) and the prediction set: Coracle+ α (x) = { y indices of the L(x; π, 1 α) largest πy(x)} . (4) Hence, (4) is the smallest deterministic set that contains a response with feature values X = x with probability at least 1 α. For example, if π1(x) = 0.3, π2(x) = 0.6, and π3(x) = 0.1, we have π(1)(x) = 0.6, π(2)(x) = 0.3, and π(3)(x) = 0.1, with L(x, 0.9) = 2, Coracle 0.1 (x) = {1, 2}, and L(x, 0.5) = 1, Coracle 0.5 (x) = {2}. Furthermore, define a function S with input x, u [0, 1], π, and τ, which computes the set of most likely labels up to (but possibly excluding) the one identified by (3): S(x, u; π, τ) = y indices of the L(x; π, τ) 1 largest πy(x), if u V (x; π, τ), y indices of the L(x; π, τ) largest πy(x), otherwise, (5) V (x; π, τ) = 1 π(L(x;π,τ))(x) c=1 π(c)(x) τ With this in place, by letting u be the realization of a uniform random variable, we can see that the oracle has access to tighter randomized prediction sets, namely, Coracle α (x) = S(x, U; π, 1 α). (6) Above, U Uniform(0, 1) is independent of everything else. It is easy to verify that the sets in (6) are the smallest randomized prediction sets with conditional coverage at level 1 α. In the above example, we would have Coracle 0.5 (x) = with probability (0.6 0.5)/0.6 = 1/6 and Coracle 0.5 (x) = {2} otherwise. Finally, if there are any ties among the class probabilities, the oracle could simply break them at random. Of course, we do not have access to such an oracle since PY |X is unknown. 1.2 Preview of our methods This paper uses classifiers trained on the available data to approximate the unknown conditional distribution of Y | X. A key strength of the proposed methods is their ability to work with any black-box predictive model, including neural networks, random forests, support vector classifiers, or any other currently existing or possible future alternatives. The only restriction on the training algorithm is that it should treat all samples exchangeably; i.e., it should be invariant to their order. Most off-the-shelf tools offer such suitable probability estimates ˆπy(x) that we can exploit, regardless of whether they are well-calibrated, by imputing them into an algorithm inspired by the oracle from Section 1.1 in order to obtain prediction sets with guaranteed coverage as we shall see. Our reader will understand that naively substituting πy(x) with ˆπy(x) into the oracle procedure would yield predictions lacking any statistical guarantees because ˆπy(x) may be a poor approximation of πy(x). Fortunately, we can automatically account for errors in ˆπy(x) by adaptively choosing the 2Recall that the conditional quantiles for continuous responses are: inf{y R : P[Y y | X = x] τ}. threshold τ in (3) in such a way as to guarantee finite-sample coverage on future test points. The intuition is that setting τ = 1 α may not necessarily guarantee coverage at level 1 α for future test points, if ˆπ = π. However, we can compute the empirical coverage on hold-out data as a function of τ, and then select the smallest value of τ that leads to the desired 1 α coverage. Below, we will show that this adaptive tuning rigorously yields tight coverage. 1.3 Related work We build upon conformal inference [12, 24, 26] and take inspiration from [3, 5, 8 10, 13, 18] which made conformal prediction for regression problems adaptive to heteroscedasticity, thus bringing it closer to conditional coverage [20]. Conformal inference has been applied before to classification problems [7, 19, 24, 25] in order to attain marginal coverage; however, the idea of explicitly trying to approximate the oracle from Section 1.1 is novel. We will see that our procedure empirically achieves better conditional coverage than a direct application of conformal inference. While working on this project, we became aware of the independent work of [2], which also seeks to improve the conditional coverage of conformal classification methods. However, their approach differs substantially; see Section 2.4. Finally, our method also naturally accommodates calibration through cross-validation+ and the jackknife+ [4], which had not yet been extended to classification, although the natural generality of these calibration techniques has also been very recently noted by others [10]. A different but related line of work focuses on post-processing the output of black-box classification algorithms to produce more accurate probability estimates [6, 11, 15, 16, 22, 27, 28], although without achieving prediction sets with provable finite-sample coverage. These techniques are complementary to our methods and may help further boost our performance by improving the accuracy of any given black box; however, we have not tested them empirically in this paper for space reasons. 2.1 Generalized inverse quantile conformity scores Suppose we have a black-box classifier ˆπy(x) that estimates the true unknown class probabilities πy(x). Here, we only assume ˆπy(x) to be standardized: 0 ˆπy (x) 1, PC y=1 ˆπy (x) = 1, x, y. An example may be the output of the softmax layer of a neural network, after normalization. In fact, almost any standard machine learning software, e.g., sklearn, can produce a suitable ˆπ, either through random forests, k-nearest neighbors, or support vector machines, to name a few options. Then, we plug ˆπ into a modified version of the imaginary oracle procedure of Section 1.1 where the threshold τ needs to be carefully calibrated using hold-out samples independent of the training data. We will present two alternative methods for calibrating τ; both are based on the following idea. We define a function E, with input x, y, u, ˆπ, which computes the smallest value of τ such that the set S(x, u; ˆπ, τ) in (5) contains the label y conditional on X = x. We call this the generalized inverse quantile conformity score function: E(x, y, u; ˆπ) = min {τ [0, 1] : y S(x, u; ˆπ, τ)} . (7) By construction, our scores evaluated on hold-out samples (Xi, Yi), namely Ei = E(Xi, Yi, Ui; ˆπ), are uniformly distributed conditional on X if ˆπ = π. (Each Ui is a uniform random variable in [0, 1] independent of everything else.) Therefore, one could also intuitively look at (7) as a special type of p-value. It is worth emphasizing that this property makes our scores naturally comparable across different samples, in contrast with the scores found in the earlier literature on adaptive conformal inference [18]. In fact, alternative conformity scores [2, 10, 12, 18] generally have different distributions at different values of X, even in the ideal case where the base method (our ˆπ) is a perfect oracle. Below, we shall see that, loosely speaking, we can construct prediction sets with provable marginal coverage for future test points by applying (5) with a value of τ close to the 1 α quantile of {Ei}i I2, where I2 is the set of hold-out data points not used to train ˆπ; see (8). 2.2 Adaptive classification with split-conformal calibration Algorithm 1 implements the above idea with split-conformal calibration, from which we begin because it is the easiest to explain. Later, we will consider alternative calibration methods based on cross-validation+ and the jackknife+; we do not discuss full-conformal calibration in the interest of space, and because it is often computationally prohibitive. For simplicity, we will apply Algorithm 1 by splitting the data into two sets of equal size; however, this is not necessary and using more data points for training may sometimes perform better in practice [20]. Algorithm 1: Adaptive classification with split-conformal calibration 1 Input: data {(Xi, Yi)}n i=1, Xn+1, black-box learning algorithm B, level α (0, 1). 2 Randomly split the training data into 2 subsets, I1, I2. 3 Sample Ui Uniform(0, 1) for each i {1, . . . , n + 1}, independently of everything else. 4 Train B on all samples in I1: ˆπ B({(Xi, Yi)}i I1). 5 Compute Ei = E(Xi, Yi, Ui; ˆπ) for each i I2, with the function E defined in (7). 6 Compute ˆQ1 α({Ei}i I2) as the (1 α)(1 + |I2|) th largest value in {Ei}i I2. 7 Use the function S defined in (5) to construct the prediction set at Xn+1 as: ˆCSC n,α(Xn+1) = S(Xn+1, Un+1; ˆπ, ˆQ1 α({Ei}i I2)). (8) 8 Output: A prediction set ˆCSC n,α(Xn+1) for the unobserved label Yn+1. Theorem 1. If the samples (Xi, Yi), for i {1, . . . , n+1}, are exchangeable and B from Algorithm 1 is invariant to permutations of its input samples, the output of Algorithm 1 satisfies: P h Yn+1 ˆCSC n,α(Xn+1) i 1 α. (9) Furthermore, if the scores Ei are almost surely distinct, the marginal coverage is near tight: P h Yn+1 ˆCSC n,α(Xn+1) i 1 α + 1 |I2| + 1. (10) The proofs of this theorem and all other results are in Supplementary Section S2. Marginal coverage holds regardless of the quality of the black-box approximation; however, one can intuitively expect that if the black-box is consistent and a large amount of data is available, so that ˆπy(x) πy(x), the output of our procedure will tend to be a close approximation of the output of the oracle, which provides optimal conditional coverage. This statement could be made rigorous under some additional technical assumptions besides the consistency of the black box [20]. However, we prefer to avoid tedious technical details, especially since the intuition is already clear. If ˆπ = π, the sets S(Xi, Ui; π, τ) in (5) will tend to contain the true labels for a fraction τ of the points i I2, as long as |I2| is large. In this limit, ˆQ1 α({Ei}i I2) becomes approximately equal to 1 α, and the predictions in (8) will eventually approach those in (6). 2.3 Adaptive classification with cross-validation+ and jackknife+ calibration A limitation of Algorithm 1 is that it only uses part of the data to train the predictive algorithm. Consequently, the estimate ˆπ may not be as accurate as it could have been had we used all the data for estimation purposes. This is especially true if the sample size n is small. Algorithm 2 presents an alternative solution that replaces data splitting with a cross-validation approach, which is computationally more expensive but often provides tighter prediction sets. In words, in Algorithm 2, we sweep over all possible labels y Y and include y in the final prediction set ˆCCV+ n,α (Xn+1) if the corresponding score E(Xn+1, y, Un+1; ˆπk(i)) is smaller than (1 α)(n + 1) hold-out scores E(Xi, Yi, Ui; ˆπk(i)) evaluated on the true labeled data. Note that we have assumed n/K to be an integer for simplicity; however, different splits can have different sizes. In the special case where K = n, we refer to the hold-out system in Algorithm 2 as jackknife+ rather than cross-validation+, consistently with the terminology in [4]. Theorem 2. Under the same assumptions of Theorem 1, the output of Algorithm 2 satisfies: P h {Yn+1 ˆCCV+ n,α (Xn+1) i 1 2α min 2(1 1/K) n/K + 1 , 1 K/n In the special case where K = n, this bound simplifies to: P h Yn+1 ˆCJK+ n,α (Xn+1) i 1 2α. (13) Algorithm 2: Adaptive classification with CV+ calibration 1 Input: data {(Xi, Yi)}n i=1, Xn+1, black-box B, number of splits K n, level α (0, 1). 2 Randomly split the training data into K disjoint subsets, I1, . . . , IK, each of size n/K. 3 Sample Ui Uniform(0, 1) for each i {1, . . . , n + 1}, independently of everything else. 4 for k {1, . . . , K} do 5 Train B on all samples except those in Ik: ˆπk B({(Xi, Yi)}i {1,...,n}\Ik). 7 Use the function E defined in (7) to construct the prediction set ˆCCV+ n,α (Xn+1) as: ˆCCV+ n,α (Xn+1) = y Y : i=1 1 h E(Xi, Yi, Ui; ˆπk(i)) < E(Xn+1, y, Un+1; ˆπk(i)) i < (1 α)(n + 1) , (11) where k(i) {1, . . . , K} is the fold containing the ith sample. 8 Output: A prediction set ˆCCV+ n,α (Xn+1) for the unobserved label Yn+1. Note that this establishes that the coverage is slightly below 1 2α. Therefore, to guarantee 1 α coverage, we should replace the input α in Algorithm 2 with a smaller value near α/2. We chose not to do so because our experiments show that the current implementation already typically covers at level 1 α (or even higher) in practice; this empirical observation is consistent with [4]. Furthermore, there exists a conservative variation of Algorithm 2 for which we can prove 1 α coverage without modifying the input level; see Supplementary Section S1.1. To see why everything above makes sense, consider what would happen if the black-box estimates of conditional probabilities in Algorithm 2 were exact. In this case, the final prediction set in (11) would become ˆCCV+ n,α (Xn+1) = n y Y : E(Xn+1, y, Un+1; π) < ˆQ1 α({E(Xi, Yi, Ui; π)}i {1,...,n}) o , (14) where ˆQ1 α is defined as in Section 2.2. If n is large, for any fixed threshold τ, we can expect S(Xi, Ui; π, τ) to contain Yi for approximately a fraction τ of samples i. Therefore, ˆQ1 α({E(Xi, Yi, Ui; π)}i {1,...,n}) 1 α, and the decision rule becomes approximately: ˆCCV+ n,α (Xn+1) {y Y : E(Xn+1, y, Un+1; π) 1 α} , (15) which is equivalent to the oracle procedure from Section 1.1. 2.4 Comparison with alternative conformal methods Conformal prediction has been proposed before in the context of classification [24], through a very general calibration rule of the form ˆC(x; t) = {y Y : ˆf(y | x) t}, where the score ˆf is a function learned by a black-box classifier. However, to date it was not clear how to best translate the output of the classifier into a powerful score ˆf for the above decision rule. In fact, typical choices of ˆf(y | x), e.g., the estimated probability of Y = y given X = x, often lead to poor conditional coverage because the same threshold t is applied both to easy-to-classify samples (where one label has probability close to 1 given X) and to hard-to-classify samples (where all probabilities are close to 1/|Y| given X). Therefore, this homogeneous conformal classification may significantly underperform compared to the oracle from Section 1.1, even in the ideal case where the black-box manages to learn the correct probabilities. This limitation has also been very recently noted in [2] and is analogous to that addressed by [18] in problems with a continuous response variable [20]. The work of [2] addresses this problem by applying quantile regression [18] to hold-out scores ˆf. However, their solution has two limitations. Firstly, it involves additional data splitting to avoid overfitting, which tends to reduce power. Secondly, its theoretical asymptotic optimality is weaker than ours because it requires the consistency of two black-boxes instead of one (this should be clear even though we have explained consistency only heuristically). Practically, experiments suggest that our method provides superior conditional coverage and often yields smaller prediction sets. 2.5 Extension to label-conditional coverage Our method can be easily extended to obtain provable coverage separately for each class [19, 24]: P h Yn+1 ˆCn,α(Xn+1) | Yn+1 = y i 1 α, y Y. (16) The only difference is that the threshold τ should be calibrated separately for each class. More precisely, focusing on the extension of Algorithm 1 for simplicity, we would compute ˆQ(y) 1 α({Ei}i I2) as the (1 α)(1 + |{i I2 : Yi = y}|) th largest value in {Ei}i I2:Yi=y, for each y Y. Then, we would define ˆτ = maxy Y ˆQ(y) 1 α({Ei}i I2) and output ˆCSC lc n,α (Xn+1) = S(Xn+1, Un+1; ˆπ, ˆτ). More details about this extension are in Supplementary Section S1.2. In the interest of simplicity, we have not explicitly sought label-conditional coverage in the following numerical experiments. Nonetheless, we shall see from the results in Figure 3 that our method performs relatively well in terms of label-conditional coverage even without the explicit constraint. 3 Experiments with simulated data 3.1 Methods and metrics We compare the performances of Algorithms 1 (SC) and 2 (CV+, JK+), which are based on the new generalized inverse quantile conformity scores in (7), to those of homogeneous conformal classification (HCC) and conformal quantile classification (CQC) [2]. We focus on two different data generating scenarios in which marginal coverage is not a good proxy for conditional coverage (the second setting is discussed in Supplementary Section S3.3). In both cases, we explore 3 different black-boxes: an oracle that knows the true πy(x) for all y Y and x; a support vector classifier (SVC) implemented by the sklearn Python package; and a random forest classifier (RFC) also implemented by sklearn see Supplementary Section S3.1 for more details. We fix α = 0.1 and assess performance in terms of marginal coverage, conditional coverage, and mean cardinality of the prediction sets. Conditional coverage is defined using an estimate of the worst-slice (WS) coverage similar to that in [2], as explained in Supplementary Section S1.3. The cardinality of the prediction sets is computed both marginally and conditionally on coverage; the former is defined as E[| ˆC(Xn+1)|] and the latter as E[| ˆC(Xn+1)| | Yn+1 ˆC(Xn+1)]. Additional coverage and size metrics defined by conditioning on the value of a given discrete feature, e.g., X1, are discussed in Supplementary Section S3. 3.2 Experiments with multinomial model and inhomogeneous features We generate the features X Rp, with p = 10, as follows: X1 = 1 w.p. 1/5 and X1 = 8 otherwise, while X2, . . . , X10 are independent standard normal. The conditional distribution of Y {1, . . . , 10} given X = x is multinomial with weights wj(x) defined as wj(x) = zj(x)/ Pp j =1 zj (x), where zj(x) = exp(x T βj) and each βj Rp is sampled from an independent standard normal distribution. Figure 1 confirms that our methods have valid conditional coverage if the true class probabilities are provided by an oracle. If the probabilities are estimated by the RFC, the conditional coverage appears to be only slightly below 1 α, and is near perfect with the SVC black box. By contrast, the conditional coverage of the alternative methods is always significantly lower than 1 α, even with the help of the oracle. Our methods produce slightly larger prediction sets when the oracle is available, but our sets are typically smaller than those of CQC and only slightly larger than those of HCC when the class probabilities are estimated. Finally, note that JK+ is the most powerful of our methods, followed by CV+, although SC is computationally more affordable. Oracle RFC SVC Marginal Conditional (WS) GGG GGG G GGG GGG GGG GGGG G GG GGG G GG G G GGG G GGG GGG G GG GG GGG G GGGG GGG G G GGGGG GGG GG Oracle RFC SVC Marginal Cond. on cover Figure 1: Several classification methods on simulated data with 10 classes, for different choices of calibration and black-box models. SC, CV+, and JK+ are applied with our new generalized inverse quantile conformity scores defined in (7). The results correspond to 100 independent experiments with 1000 training samples and 5000 test samples each. All methods have 90% marginal coverage. (a): Marginal coverage and worst-slice conditional coverage. (b): Size of prediction sets. 4 Experiments with real data In this section, we compare the performance of our proposed methods (SC, CV+, and JK+) with the new generalized inverse quantile conformity scores defined in (7) to those of HCC and CQC [2]. We found that the original suggestion of [2] to fit a quantile neural network [21] on the class probability score can be unstable and yield very wide predictions. Therefore, we offer a second variant of this calibration method, denoted by CQC-RF, which replaces the quantile neural network estimator with quantile random forests [14]; see Supplementary Section S4 for details. The validity and statistical efficiency of each method is evaluated according to the same metrics as in Section 3. In all experiments, we set α = 0.1 and use the following base predictive models: (i) kernel SVC, (ii) random forest classifier (RFC), and (iii) two-layer neural network classifier (NNet). A detailed description of each algorithm and corresponding hyper-parameters is in Supplementary Section S4. The methods are tested on two well-known data sets: the Mice Protein Expression data set3 and the MNIST handwritten digit data set. The supplementary material describes the processing pipeline and discusses additional experiments on the Fashion-MNIST and CIFAR10 data sets. Supplementary Tables S1 S4 summarize the results of our experiments in more detail and also consider additional settings. Figure 2 shows that all methods attain valid marginal coverage on the Mice Protein Expression data, as expected. Here, HCC, CQC, and CQC-RF fail to achieve conditional coverage, in contrast to the proposed methods (SC, CV+, JK+) based on our new conformity scores in (7). Turning to efficiency, we observe that the prediction sets of CV+ and JK+ are smaller than those of SC, and comparable in size to those of HCC. Here, the original CQC algorithm performs poorly both in terms of conditional coverage and cardinality. The CQC-RF variant is not as unstable as the original CQC, although it does not perform much better than HCC. Figure 3 presents the results on the MNIST data. Here, the sample size is relatively large and hence we exclude JK+ due to its higher computational cost. As in the previous experiments, all methods achieve 90% marginal coverage. Unlike CQC, CQC-RF, and HCC, our methods also attain valid conditional coverage when relying on the NNet or SVC as base predictive models. With the RFC, all methods tend to undercover, suggesting that this classifier estimates the class probabilities poorly, and our prediction sets are larger than those constructed by CQC-RF and HCC. By contrast, the 3https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression NNet RFC SVC Marginal Conditional (WC) NNet RFC SVC Marginal Cond. on cover Figure 2: Experiments on Mice Protein Expression data. 100 independent experiments with 500 randomly chosen training samples and 580 test samples each. Left: coverage. Right: size of prediction sets (extremely large values for CQC not shown). Other details are as in Figure 1. NNet enables our methods to achieve conditional coverage with prediction sets comparable in size to those produced by CQC-RF and HCC. The bottom part of Figure 3 demonstrates that CV+ also has conditional coverage given the true class label Y , while SC performs only slightly worse. In striking contrast, both HCC, CQC, and CQC-RF fail to achieve 90% conditional coverage. 5 Conclusions This paper introduced a principled and versatile modular method for constructing prediction sets for multi-class classification problems that enjoy provable finite-sample coverage, and also behave well in terms of conditional coverage when compared to alternatives. Our approach leverages the power of any black-box machine learning classifier that may be available to practitioners, and is easily calibrated via various hold-out procedures; e.g., conformal splitting, CV+, or the jackknife+. This flexibility makes our approach widely applicable and offers options to balance between computational efficiency, data parsimony, and power. Although this paper focused on classification, using conformity scores similar to those in (7) to calibrate hold-out procedures for regression problems [4, 18] is tantalizing. In fact, previous work in the regression setting focused on conformity scores that measure the distance of a data point from its predicted interval on the scale of the Y values (which makes sense for homoscedastic regression, but may not be optimal otherwise), rather than by the amount one would need to relax the nominal threshold (our τ) until the true value is covered. We leave it to future work to explore the performance of our intuitive metrics in other settings. The Python package at https://github.com/msesia/arc implements our methods. This repository also contains code to reproduce our experiments. Broader Impact Machine learning algorithms are increasingly relied upon by decision makers. It is therefore crucial to combine the predictive performance of such complex machinery with practical guarantees on the reliability and uncertainty of their output. We view the calibration methods presented in this paper as an important step towards this goal. In fact, uncertainty estimation is an effective way to quantify and communicate the benefits and limitations of machine learning. Moreover, the proposed methodologies provide an attractive way to move beyond the standard prediction accuracy measure used to compare algorithms. For instance, one can compare the performance of two candidate predictors, e.g., random forest and neural network (see Figure 3), by looking at the size of the corresponding prediction sets NNet RFC SVC Marginal Conditional (WC) NNet RFC SVC Marginal Cond. on cover GG GGG G G GGG G G GGGG G G G G GG G GG G G G GGG G G G GGG GGGG GGGGG G GGG GG G G G G G G G GGGG G G GG G GGGG SC CV+ CQC CQC RF HCC Coverage Size cond. on cover 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Figure 3: Experiments on MNIST data. 100 independent experiments with 10000 randomly chosen training samples and 5000 test samples each. Top: coverage and size of prediction sets (large values for CQC not shown). Bottom: coverage with neural network black box, conditional on the true Y , and size of the corresponding prediction sets, conditional on coverage. Other details are as in Figure 1. and/or their their conditional coverage. Finally, the approximate conditional coverage that we seek in this work is highly relevant within the broader framework of fairness, as discussed by [17] within a regression setting. While our approximate conditional coverage already implicitly reduces the risk of unwanted bias, an equalized coverage requirement [17] can also be easily incorporated into our methods to explicitly avoid discrimination based on protected categories. We conclude by emphasizing that the validity of our methods relies on the exchangeability of the data points. If this assumption is violated (e.g., with time-series data), our prediction sets may not have the right coverage. A general suggestion here is to always try to leverage specific knowledge of the data and of the application domain to judge whether the exchangeability assumption is reasonable. Finally, our data-splitting techniques in Section 4 offer a practical way to verify empirically the validity of the predictions on any given data set. Acknowledgments and Disclosure of Funding E. C. was partially supported by Office of Naval Research grant N00014-20-12157, and by the Army Research Office (ARO) under grant W911NF-17-1-0304. Y. R. was partially supported by ARO under the same grant. Y. R. thanks the Zuckerman Institute, ISEF Foundation, the Viterbi Fellowship, Technion, and the Koret Foundation, for providing additional research support. M. S. was suported by NSF grant DMS 1712800. Y. R. and M. S. were advised by E. C. at Stanford University. [1] R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. ar Xiv preprint ar Xiv:1903.04684, 2019. [2] M. Cauchois, S. Gupta, and J. Duchi. Knowing what you know: valid confidence sets in multiclass and multilabel prediction. ar Xiv preprint ar Xiv:2004.10181, 2020. [3] V. Chernozhukov, K. Wüthrich, and Y. Zhu. Distributional conformal prediction. ar Xiv preprint ar Xiv:1909.07889, 2019. [4] R. Foygel Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Predictive inference with the jackknife+. ar Xiv preprint ar Xiv:1905.02928, 2019. [5] L. Guan. Conformal prediction with localization. ar Xiv preprint ar Xiv:1908.08558, 2019. [6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017. [7] Y. Hechtlinger, B. Póczos, and L. Wasserman. Cautious deep learning. ar Xiv preprint ar Xiv:1805.09460, 2018. [8] R. Izbicki, G. T. Shimizu, and R. B. Stern. Distribution-free conditional predictive bands using density estimators. ar Xiv preprint ar Xiv:1910.05575, 2019. [9] D. Kivaranovic, K. D. Johnson, and H. Leeb. Adaptive, distribution-free prediction intervals for deep neural networks. ar Xiv preprint ar Xiv:1905.10634, 2019. [10] A. K. Kuchibhotla and A. K. Ramdas. Nested conformal prediction and the generalized jackknife+. ar Xiv preprint ar Xiv:1910.10562, 2019. [11] A. Kumar, P. S. Liang, and T. Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pages 3787 3798, 2019. [12] J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. [13] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014. [14] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983 999, 2006. [15] L. Neumann, A. Zisserman, and A. Vedaldi. Relaxed softmax: Efficient confidence autocalibration for safe pedestrian detection. 2018. [16] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61 74, 1999. [17] Y. Romano, R. F. Barber, C. Sabatti, and E. Candès. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 4 2020. https://hdsr.mitpress.mit.edu/pub/qedrwcz3. [18] Y. Romano, E. Patterson, and E. J. Candès. Conformalized quantile regression. In Advances in Neural Information Processing Systems, pages 3538 3548, 2019. [19] M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019. [20] M. Sesia and E. J. Candès. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020. [21] J. W. Taylor. A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299 311, 2000. [22] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating model calibration in classification. ar Xiv preprint ar Xiv:1902.06977, 2019. [23] V. Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475 490, 2012. [24] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer, 2005. [25] V. Vovk, D. Lindsay, I. Nouretdinov, and A. Gammerman. Mondrian confidence machine. Technical report, Royal Holloway, University of London, 2003. On-line Compression Modelling project. [26] V. Vovk, I. Nouretdinov, and A. Gammerman. On-line predictive linear regression. The Annals of Statistics, 37(3):1566 1590, 2009. [27] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609 616. Citeseer, 2001. [28] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694 699, 2002.