# conformal_prediction_under_ambiguous_ground_truth__a2a88efd.pdf

Published in Transactions on Machine Learning Research (10/2023)

Conformal prediction under ambiguous ground truth

David Stutz1, Abhijit Guha Roy2, Tatiana Matejovicova1, Patricia Strachan2, Ali Taylan Cemgil1, Arnaud Doucet1

1Google Deep Mind, 2Google {dstutz, agroy, tatianama, trishs, taylancemgil, arnauddoucet}@google.com

Reviewed on Open Review: https://openreview.net/forum?id=CAd6V2q Xxc

Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set C(X) satisfying P(Y C(X)) 1 α for a user-chosen α [0, 1] by relying on calibration data (X1, Y1), ..., (Xn, Yn) from P = PX PY |X. It is typically implicitly assumed that PY |X is the true posterior label distribution. However, in many real-world scenarios, the labels Y1, ..., Yn are obtained by aggregating expert opinions using a voting procedure, resulting in a one-hot distribution PY |X vote . This is the case for most datasets, even well-known ones like Image Net. For such voted labels, CP guarantees are thus w.r.t. Pvote = PX PY |X vote rather than the true distribution P. In cases with unambiguous ground truth labels, the distinction between Pvote and P is irrelevant. However, when experts do not agree because of ambiguous labels, approximating PY |X with a one-hot distribution PY |X vote ignores this uncertainty. In this paper, we propose to leverage expert opinions to approximate PY |X using a non-degenerate distribution PY |X agg . We then develop Monte Carlo CP procedures which provide guarantees w.r.t. Pagg = PX PY |X agg by sampling multiple synthetic pseudo-labels from PY |X agg for each calibration example X1, ..., Xn. In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. Pvote under-covers expert annotations: calibrated for 72% coverage, it falls short by on average 10%; our Monte Carlo CP closes this gap both empirically and theoretically. We also extend Monte Carlo CP to multi-label classification and CP with calibration examples enriched through data augmentation.

1 Introduction

Many application domains, especially safety-critical applications such as medical diagnostics, require reasonable uncertainty estimates for decision making and benefit from statistical performance guarantees. Conformal prediction (CP) is a statistical framework allowing to quantify uncertainty rigorously by providing finite-sample, non-asymptotic performance guarantees. First introduced by Vovk et al. (2005), it has become very popular in machine learning as it is widely applicable while making no dsitributional or model assumptions, e.g., see (Romano et al., 2019; Sadinle et al., 2019; Romano et al., 2020; Angelopoulos et al., 2021; Stutz et al., 2021; Fisch et al., 2022) and Angelopoulos & Bates (2021) for more references.

Specifically, we consider a classification task with K classes and denote by [K] the set {1, . . . , K}. Then, a classifier π : X K outputs the class probabilities where K is the K-simplex. Based only on a held-out set of n calibration examples (Xi, Yi) P, CP allows us to return a prediction set C(X) [K] for a given test point (X, Y ) (with Y being unobserved) dependent on the calibration data such that

P(Y C(X)) 1 α, (1)

whatever being P and π as long as the joint distribution of ((X1, Y1), . . . , (Xn, Yn), (X, Y )) is exchangeable. Here α [0, 1] is a user-specified parameter and the probability in Equation (1) is not only over (X, Y ) but

Published in Transactions on Machine Learning Research (10/2023)

Figure 1: Two ambiguous examples from CIFAR10-H (Peterson et al., 2019) with the corresponding annotations of p = 50 experts. These annotations can be used to build Pagg(Y = y|X = x, Y 1, ..., Y p), see text for details.

also over the calibration set. This is called the coverage guarantee of CP. The size of such prediction sets |C(X)|, also called inefficiency, is a good indicator of the uncertainty for X.

It is commonly taken for granted that P = PX PY |X where PY |X is the true posterior label distribution so that the l.h.s. of Equation (1) is the probability for the ground truth label Y being an element of the prediction set C(X). However, in many practical applications, the calibration labels (Yi)i [n] are obtained based on (multiple) expert opinions. In medical applications, for example, it is usually impossible to identify the patient s actual condition as this would require invasive and expensive tests. Instead, labels are derived from expert annotations, e.g., doctor ratings (Liu et al., 2020). Beyond medical diagnosis, however, this is generally the case for many popular datasets such as CIFAR10 (Krizhevsky, 2009; Peterson et al., 2019), COCO (Lin et al., 2014), Image Net (Russakovsky et al., 2015) or many datasets from the GLUE benchmark (Wang et al., 2019) such as Multi NLI (Williams et al., 2018) (see Appendix A). Based on the expert annotations, the majority voted label is then typically selected as the ground truth label. Formally, this means that we consider Yi Pvote( |X = xi) where Pvote(Y = y|X = x) is a one-hot distribution. Thus, when using CP on these voted labels, we obtain guarantees of the form Pvote(Y C(X)) 1 α for Pvote = PX PY |X vote rather than a guarantee w.r.t. to the true distribution, P(Y C(X)) 1 α. The difference between Pvote and P is small for some tasks, including popular benchmarks such as CIFAR10, because most examples are fairly unambiguous, meaning that annotators do rarely disagree because P(Y = y|X = x) is one-hot anyway and a simple aggregation strategy can unambiguously determine the true class. However, in complex tasks such as the dermatology problem addressed in this work, there are many examples where experts disagree. More importantly, this disagreement is often irresolvable (Schaekermann et al., 2016; Gordon et al., 2022; Uma et al., 2022), meaning that obtaining more annotations will likely not reduce disagreement. In such cases, Pvote(Y = y|X = x) differs significantly from the true conditional distribution P(Y = y|X = x).

In such scenarios, performing CP using voted labels can severely underestimate the true uncertainty (see Section 2.2 for an intuitive example). Instead of summarizing the expert annotations by a single one-hot distribution, we propose here to rely on an approximation of Pagg(Y = y|X = x) to P(Y = y|X = x) to better account for this uncertainty. As a simple example, assume that for each calibration data X, expert q [p] annotates a single class Y q [K] as done on CIFAR10-H (Peterson et al., 2019), see Figure 1 for an illustration. Then, let pk = Pp q=1 I(Y q = k) be the number of experts selecting class k, a simple voting procedure sets Pvote(Y = y|X = x, Y 1, ..., Y p) = I(y = ˆY ) where ˆY = arg maxk [K] pk (disregarding ties for simplicity) and unconditionally we also typically will have Pvote(Y = y|X = x) a one-hot distribution in unambiguous cases (i.e. the same label is selected for all expert annotations). Obviously, if there is sufficient disagreement among annotators, taking ˆY will ignore many of the (usually expensive) annotations. As a result, performing CP on the voted labels ignores this uncertainty. Instead, we argue that selecting an aggregated distribution, e.g., Pagg(Y = y|X = x, Y 1, ..., Y p) = 1

k [K] pk I(y = k), allows us to better capture uncertainty present in the annotations. Ideally, by integrating over (Y 1, ..., Y p), the resulting Pagg is a good approximation of the true distribution P (and we discuss common ways of constructing Pagg in Section 3.1).

Published in Transactions on Machine Learning Research (10/2023)

Contributions: In this paper, we propose CP procedures that allow us to construct a prediction set C(X) satisfying Pagg(Y C(X)) 1 α for Pagg = PX PY |X agg as long as one can sample from PY |X agg 1. Note that while it would be desirable to have instead guarantees w.r.t. the distribution P, this is an impossible task as we never observe any data with labels sampled from PY |X. Whether PY |X agg is a good approximation to PY |X will be application dependent. Instead, the prediction set C(X) outputted by our procedures is the one we could compute if the experts had access to X and their opinions were aggregated through PY |X agg . This is the best one can hope for. We emphasize that we are not making more model assumptions than is currently made by applying CP to voted labels from PY |X vote . In contrast, we make the usually implicit assumptions on label collection explicit. To perform calibration with PY |X agg , we propose a sampling-based approach, coined Monte Carlo CP, which proceeds by sampling multiple pseudo ground truth labels from PY |X agg for each calibration example X1, . . . , Xn. We show how this approach can provide rigorous coverage guarantees despite not having access to exchangeable calibration examples. We present experiments on a skin condition classification case study: Here, calibration with voted labels from PY |X vote is shown to undercover expert annotated labels by a staggering 10%. Our approach closes this gap both theoretically and empirically. Moreover, we discuss extensions to multi-label classification and calibration with augmented calibration examples for robust CP.

Outline: The rest of this paper is structured as follows: In Section 2, we illustrate the problem on a toy dataset and introduce required background on CP. Then, Section 3 introduces our original Monte Carlo CP procedures, highlights the applicability of this approach to related problems and discusses related work. In Section 4, we present an application to skin condition classification following (Liu et al., 2020), multi-label classification and data augmentation before concluding in Section 5.

2 Background and Motivating Example

2.1 Conformal prediction

In the following, we briefly review standard (split) CP (Vovk et al., 2005; Papadopoulos et al., 2002). To this end, we assume a classifier πy(x) P(Y = y|X = x) approximating the posterior label probabilities is available. This model will be typically based on learned parameters using a training set. Then, given a set of calibration examples (Xi, Yi)i [n] from P, we want to construct a prediction set C(X) [K] for the test point (X, Y ) such that the coverage guarantee from Equation (1) holds. As mentioned earlier, this requires the calibration examples and the test example to be exchangeable but does not make any further assumption on the data distribution or on the underlying model π.

A popular conformal predictor proceeds as follows: given a real-valued conformity score E(X, k) based on the model predictions π(x) K, we define

C(X) = {k [K] : E(X, k) τ} (2)

where τ is the α(n + 1) smallest element of {E(Xi, Yi)}i [n], equivalently τ is obtained by computing the α(n + 1) /n quantile of the distribution of the conformity scores of the calibration examples

τ = Q {E(Xi, Yi)}i [n]; α(n + 1)

Here, Q( ; q) denotes the q-quantile. By construction of the quantile, see e.g. (Romano et al., 2019; Vovk et al., 2005; Angelopoulos & Bates, 2021), this ensures that the lower bound on coverage in Equation (1) is satisfied. Additionally, if the conformity scores are almost surely distinct, then we have the following upper bound P(Y C(X)) 1 α + 1 n+1. The conformity score is a design choice. A standard choice, which we will use throughout the paper, is E(x, k) = πk(x) (Sadinle et al., 2019) but many alternative scores have been proposed in the literature (Romano et al., 2020; Angelopoulos et al., 2021).

1As explained further, we cannot obtain i.i.d. samples from PY |X agg but only exchangeable ones. This prevents the use of concentration inequalities to obtain bounds on Pagg(Y C(X)|X = x) for a given C(X).

Published in Transactions on Machine Learning Research (10/2023)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 x0

MLP class. boundary

Examples with their true labels

0 2 Class k

E(x, k) = k(x)

0: Conf. scores

0 2 Class k

p-value for class k

0: p-values

0 2 Class k

0: Posteriors

0 2 Class k

E(x, k) = k(x)

1: Conf. scores

0 2 Class k

p-value for class k

1: p-values

0 2 Class k

1: Posteriors

Figure 2: Illustration of CP on a toy dataset detailed in Section 2.2 for two examples X, indexed 0 and 1. Here, the conformity scores {E(X, k)}k [K] are taken to be the estimated posterior probabilities πk(X) of a multilayer preceptron (MLP) whose decision boundaries are shown. To construct the prediction sets C(X), these scores are thresholded using the threshold τ, cf. Equations (2) and (3). Equivalently, we can use the p-value associated formulation and threshold (ρy)y [K] at confidence level α, cf Equations (4) and (5). Calibrating against true labels, both approaches obtain coverage 1 α (w.r.t. the true labels).

An alternative view on CP can be obtained through a p-value formulation; see e.g. (Sadinle et al., 2019). The conformity scores of the calibration examples and test example (X, Y ) are exchangeable so if the distribution of E(X, Y ) is continuous then

ρY = |{i [n] : E(Xi, Yi) E(X, Y )}| + 1

n + 1 = Pn i=1 I[E(Xi, Yi) E(X, Y )] + 1

is uniformly distributed over {1/(n + 1), 2/(n + 1)..., 1} and thus ρY is a p-value in the sense it satisfies P(ρY α) α, equivalently P(ρY > α) 1 α. If the distribution is not continuous, then it can be checked that P(ρY α) α still holds (see e.g. (Bates et al., 2023)). It follows directly that

C(X) := {y [K] : ρy > α} (5)

satisfies P(Y C(X)) 1 α and the prediction set obtained this way is identical to the one obtained using Equations (2) and (3). For completeness, the equivalence between both formulations is detailed in Section B.1 of Appendix B. In contrast to calibrating the threshold τ, the p-value formulation requires computing (ρk)k [K] for each test example X and is thus computationally more expensive. The p-value formulation will be useful in our context as p-values can easily be combined to obtain a new p-value; see e.g. (Vovk & Wang, 2020). We illustrate an application of CP to a 3-class classification problem in Figure 2.

2.2 Motivating Example

We now consider a toy dataset to illustrate the impact a voting strategy can have on prediction sets produced by CP. Consider the true distribution P = PX PY |X = PY PX|Y be defined as follows: PX|Y admits a density p(x|y) = N(x; µy, diag(σ2 y)) with µy Rd and diag(σ2 y) Rd d being mean and variance. Further, one has P(Y = y) = wy with PK y=1 wy = 1. To generate examples (X, Y ) P, we first sample Y from P(Y = y) = wy and then X from p(x|y) so we have ground truth labels for each example. Furthermore, by Bayes rule we have access to the true posterior probability mass function P(Y = y|X = x). If the Gaussians are well-separated, P(Y = y|X = x) will be crisp, i.e., close to one-hot with low entropy. However, encouraging significant overlap between these Gaussians, e.g., by moving the means (µk)k [K] close together, will result in highly ambiguous P(Y = y|X = x). We sample examples (Xi, Yi)i [n] as outlined above and indicate classes by color in Figure 3 (top left). These synthetic data were previously used in the example displayed in Figure 2 to illustrate CP. The true posterior label probabilities P(Yi = y|X = Xi) are also displayed, cf. Figure 3 (top middle). As can be seen, these distributions can be very ambiguous in between all three classes. Let us assume a large set of annotators that allow us by majority vote to recover ˆYi = arg maxy [K] P(Y = y|X = Xi), i.e., the voted label, as shown on Figure 3 (top right). This defines the one-hot distribution Pvote(Y = y|X = Xi) = I(y = ˆYi). Clearly, these voted labels ignore the fact that PY |X

can have high entropy.

Contrary to Figure 2, we now perform standard split CP using calibration data relying on the voted labels (Xi, ˆYi)i [n] from Pvote = PX PY |X vote rather than (Xi, Yi)i [n] from P = PX PY |X, using a conformity

Published in Transactions on Machine Learning Research (10/2023)

Smooth, true and voted labels on toy dataset

0.2 0.4 0.6 0.8 1.0 x0

Examples with their smooth labels

0.2 0.4 0.6 0.8 1.0 x0

Examples with their true labels

Class 0 Class 1 Class 2

0.2 0.4 0.6 0.8 1.0 x0

Examples with their voted labels

Class 0 Class 1 Class 2

Calibration with voted labels

0.86 0.88 0.90 0.92 0.94 0.96 Empirical coverage

Calibration with voted labels

Target Coverage of true labels Coverage of voted labels

Coverage w.r.t. true labels

Coverage w.r.t. voted labels coverage gap

Figure 3: Illustration of an ambiguous problem with K = 3 classes, in two dimensions, using our toy dataset. Top: We show examples colored by their true posteriors P(Y = y|X = x) (left), true labels (middle) and voted labels, i.e., ˆY = arg maxy [K] P(Y = y|X = x) (right). Note the high ambiguity between the classes, which is clearly ignored in the voted labels. Bottom: Empirical coverage over random calibration/test splits, i.e. the proportion of true or voted labels included in the constructed prediction sets C(x), when calibrating against the voted labels. This produces prediction sets that significantly undercover w.r.t. true labels. Figure 4 shows how our Monte Carlo conformal prediction, as introduced in Section 3, overcomes this gap.

score E(x, k) = πk(x) where πk(x) is given by a multilayer perceptron. We randomly split the examples in two halves for calibration and testing. In Figure 3 (bottom), we plot the empirical coverage, i.e., the fraction of test examples for which (a) the true label (blue) or (b) the voted label (green) is included in the predicted prediction set. In this case, CP guarantees the latter by design, (b), to be 95% (on average across calibration/test splits). Strikingly, however, coverage against the (usually unknown) true labels is significantly worse. Of course, this gap depends on the ambiguity of the problem.

3 Monte Carlo conformal prediction

As discussed in the introduction and illustrated in Section 2.2, we want to avoid using a one-hot distribution PY |X vote when labeling ambiguous examples. Instead we first explain here how, based on expert opinions, we can obtain non-degenerate estimates PY |X agg of PY |X. We then show how PY |X agg can be leveraged to provide novel CP procedures for constructing prediction sets that satisfy coverage guarantees w.r.t. Pagg = PX PY |X agg . Given that we never observe labels from the true distribution PY |X, providing guarantees w.r.t. Pagg is the best we can hope to do. For a test example X, this can be understood as guaranteeing coverage against labels that expert annotators would assign if they had access to X and their annotations were aggregating using PY |X agg .

3.1 From expert annotations to PY |X agg

From now on, we assume that we have calibration data (Xi, Bi) P for i [n] and a test data X, B P with B unobserved; the joint distribution of (Xi, Bi)i [n] and (X, B) being exchangeable. Here Bi corresponds to a set of expert annotations, the space of annotations being dependent on the application. For example, on CIFAR-10H (Peterson et al., 2019), each expert provides a single label in [K]. In contrast, in our dermatology application, using data from (Liu et al., 2020), Bi represents a set of partial rankings whose cardinality depends on i. This corresponds to differential diagnoses from dermatologists. However, while

Published in Transactions on Machine Learning Research (10/2023)

the format of the Bi s can vary, we are always interested in returning prediction sets of classes C(X) [K] satisfying a coverage guarantee w.r.t. Pagg where Pagg needs to be specified based on the expert annotations Bi2

We give two simple examples to illustrate how PY |X agg can be obtained and then present a more generic framework. However, the aim of this section is not to provide the best way to aggregate expert opinions. This is application dependent and there is a substantial literature on the topic. Instead, we focus here on showing how such techniques can be exploited in a CP framework:

Single labels: We first revisit and extend the construction presented in Section 1 where B = (Y 1, ..., Y p) with Y q [K] corresponding to the single label the expert q assigns to X. We then consider Pagg(Y = y|X, B) = 1 p P

k [K] pk I(y = k) for pk = Pp q=1 I(Y q = k)3. Note that this approximation is deterministic given (X, B). However, we could also use a bootstrap procedure by resampling the entries of B with replacement.

Partial rankings: In medical applications, it is common for expert annotations to be given by differential diagnoses, corresponding to partial rankings. That is, out of the K possible conditions, each expert returns a partial ranking of conditions because multiple conditions are deemed plausible while a large majority of the conditions can be excluded. While probabilistic models for aggregating such rankings exist (Hajek et al., 2014; Zhu et al., 2023), we follow (Liu et al., 2020) and describe a simple deterministic procedure called inverse rank normalization which we will also exploit in our experiments, see Section 4.1. Let B = (B1, ..., Bp) be the collection of available partial rankings for data X. Each partial ranking Bq is divided into nq blocks, Bq = (Bq 1, ..., Bq nq), with Bq i [K] and Bq i T Bq j = and Bq nq the collection of excluded conditions. The conditions in Bq i are assessed as being more plausible than the conditions in Bq j for i < j. We then define

1 i|Bq i |I(y Bq i ), Pagg(Y = y|X, B) = αy PK k=1 αk . (6)

As above, this approximation is deterministic given (X, B). Again, we could also use a bootstrapping procedure by resampling the entries of B with replacement.

More generally, we assume the distribution of (X, B) admits a density p(x, b). Then, we will denote by λ = (λ1, ..., λK) the probabilities (Pagg(Y = 1|X, B), ..., Pagg(Y = K|X, B)) obtained by some deterministic or stochastic annotations aggregation procedures. We refer to λ as plausibilities since they quantify how plausible the different classes are to be the actual ground truth label given the annotations B. Then, the resulting marginal density of (X, λ) is given by

p(x, λ) = Z p(x, λ, b)db = Z p(λ|b, x)p(x, b)db. (7)

In the examples above, note that p(λ|b, x) = p(λ|b). Then, our aggregation model for label Y can be written as

Pagg(Y = y|X = x) = Z p(y|λ)p(λ|x)dλ = Z Z λyp(λ|b, x)p(b|x)dλdb. (8)

In words, p(b|x) corresponds to the annotation process, i.e., how experts draw their annotations when observing data x, and p(λ|b, x) models aggregation of annotations into plausibilities. Given data (X, B) p(x, b), then B is distributed according to p(b|X) given X. So by sampling λ p(λ|B, X) and then Y such that P(Y = y|λ) = λy, we obtain a sample from Pagg(Y = y|X). Note however that we cannot obtain i.i.d.

2Based on this setup, we could in principle develop a CP method returning prediction sets for B, e.g., prediction sets of rankings, satisfying a coverage guarantee w.r.t P. However, we are interested here in prediction sets for labels.

3Pagg(Y = y|X, B) obviously converges towards P(Y = y|X) as p if Y q i.i.d. Pagg( |X), i.e., this assumes that expert annotations follow the true distribution P.

Published in Transactions on Machine Learning Research (10/2023)

Algorithm 1 Monte Carlo CP with 1 α coverage guarantee for m = 1 and 1 2α for m 2. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence level α; number of samples m Output: Prediction set C(X) for test example

1. Sample m labels (Y j i )j [m] per calibration example (Xi)i [n] where P(Y j i = k) = λik.

2. Calibrate the threshold τ using

τ = Q {E(Xi, Y j i )}i [n],j [m]; αm(n + 1) m + 1

3. Return C(X) = {k [K] : E(X, k) τ}.

0.94 0.95 0.96 0.97 0.98 0.99 Empirical coverage

Monte Carlo conformal calibration

Target Coverage of true labels Coverage of voted labels

Figure 4: Monte Carlo CP with m = 10 sampled labels per calibration example (see Algorithm 1) applied to the toy dataset from Figure 3. We use the true posterior probabilities P(Y = y|X = x) as plausibilities λy to sample labels (see Figure 3, left). Again, we show coverage wrt. true or voted labels. In contrast to standard CP calibrated against voted labels, Monte Carlo CP clearly obtains the target coverage (black) of 1 α = 95% (on average) w.r.t. the true labels.

samples from Pagg(Y = y|X) as we only have access to one sample from B given X. We can only obtain exchangeable samples by repeating sampling from λ p(λ|B, X) and P(Y = y|λ) = λy. Hence we cannot use concentration inequalities to obtain finite sample bounds for quantities such as Pagg(Y C(X)|X = x) for a given C(X).

We will instead develop CP procedures returning prediction sets satisfying Pagg(Y C(X)) 1 α, it is worth clarifying what this means when Pagg(Y |X) is ambiguous, i.e., not one-hot. With Pagg = PX PY |X agg , we can rewrite the coverage guarantee as

Pagg(Y C(X)) = EX PX EY Pagg( |X)[I[Y C(X)]] . (9)

It makes explicit that, coverage is marginal across examples and classes: in ambiguous cases, the prediction set C(X) might cover only part of the classes with non-zero plausibility. We will refer to Pagg(Y C(X)) defined in Equation (9) as aggregated coverage to emphasize that this is w.r.t. Pagg. This is to contrast from voted coverage Pvote(Y C(X)) which is w.r.t. Pvote. To obtain prediction sets with aggregated coverage guarantees, we will assume access to a set of exchangeable calibration data (Xi, λi)i [n] of examples and corresponding plausibilities from p(x, λ). To obtain this calibration data, we can simply rely on the original (exchangeable) calibration data (Xi, Bi)i [n] and then sample λi p( |Xi, Bi) where λi = (λi1, ..., λi K). In the examples given above, λi is given deterministically given (Xi, Bi) so p(λ|x, b) is a delta-Dirac mass but, if a bootstrapping procedure is used to account for uncertainty in the annotation process, then it is not anymore.

3.2 Introduction to Monte Carlo conformal prediction

We propose Monte Carlo CP, a sampling-based approach to CP under ambiguous ground truth. Given the calibration examples (Xi, λi)i [n], we sample m labels Y j i for each Xi according to plausibilities λi, i.e. P(Y j i = k) = λik and duplicate the corresponding inputs4. That is, we obtain m n new calibration examples (Xi, Y j i )i [n],j [m] and then apply the conformal calibration outlined in Algorithm 1. As shown in Figure 4, Monte Carlo CP overcomes the coverage gap illustrated on our toy dataset in Figure 3.

4An alternative valid procedure would sample m plausabilities λj i p( |Xi, Bi) for each Xi and then Y j i such that P(Y j i = k) = λj ik. When p( |Xi, Bi) is a delta-mass distribution for all i, this coincides with the procedure described previously.

Published in Transactions on Machine Learning Research (10/2023)

0.946 0.948 0.950 0.952 0.954 0.956 Empirical coverage

Variation in aggregated coverage from sampled labels

Target Aggregated coverage

0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 Empirical coverage

Aggregated coverage for 10% calibration data

Target m = 1 sampled labels

m = 5 sampled labels

m = 10 sampled labels

0.1 0.2 0.3 0.4 0.5 Fraction of calibration data

Std in empirical coverage

Standard deviation in aggregated coverage

m = 1 sampled labels

m = 5 sampled labels

m = 10 sampled labels

Figure 5: Left: Empirical aggregated coverage for Monte Carlo CP, m = 1, with a fixed calibration/test split but different randomly sampled labels Y j i , cf. Algorithm 1. The approach obtains coverage 1 α across calibration/test splits but overestimates coverage slightly for this particular split. However, in Monte Carlo CP, coverage is marginal across not only test and calibration examples but also the sampled labels during calibration, as shown in this histogram. Middle and right: Instead fixing the sampled labels for different m and plotting variation in coverage across random calibration/test split shows that aggregated coverage is empirically close to 1 α even for m > 1. Large m generally reduces the variability in coverage observed across calibration/test splits, especially for small calibration sets (we consider 5% to 50% calibration data).

The aggregated coverage guarantees we will provide for Algorithm 1 are marginal over the sampled labels (Y j i )i [n],j [m]. This is illustrated in Figure 5 (left) on the toy dataset of Section 2.2 using a fixed calibration/test split but multiple samples of (Y j i )i [n],j [m]. The empirical aggregated coverage across calibration/test splits is 1 α (see Figure 5 (left)) but the variability across such splits is high if the entropy of PY |X agg is high for many calibration data. This variability can be reduced by increasing m (see Figure 5, right), this is especially beneficial in the low calibration data regime.

We now discuss the theoretical coverage properties of Algorithm 1. Recall that we have assumed in Section 3.1 that the joint distribution of (Xi, Bi)i [n] and a test data (X, B) is exchangeable. This implies that the joint distribution of (Xi, Y j i )i [n] and (X, Y ) is exchangeable for any j [m]. For m = 1, it is clear that this approach boils down to standard split CP applied to calibration data (Xi, Y 1 i ) Pagg and thus Pagg(Y C(X)) 1 α follows directly. For m 2, the calibration examples include m repetitions of each Xi and exchangeability with the test example X, typically used to establish the validity of CP, is not satisfied anymore. Nevertheless, as mentioned earlier, we empirically observe an empirical coverage close to 1 α on average in Figure 5 (middle).

In the following, Section 3.3, we focus on establishing rigorous coverage guarantees for Monte Carlo CP when m 2, showing that we can establish Pagg(Y C(X)) 1 2α. This is akin to the observation in (Barber et al., 2021) that jackknife+, cross-conformal or out-of-bag CP consistently obtain empirical coverage 1 α while they only guarantee 1 2α. In applications where rigorous coverage guarantees better than 1 2α are necessary, we show in Section 3.4 that this can be achieved by developing an alternative Monte Carlo CP procedure presented in Algorithm 2 which relies on an additional split of the calibration examples.

3.3 Coverage 1 2α by averaging p-values

In order to establish the coverage guarantee Pagg(Y C(X)) 1 2α when m 2 for Algorithm 1, let us introduce for j [m]:

ρj Y = Pn i=1 I[E(Xi, Y j i ) E(X, Y )] + 1 n + 1 . (11)

The random variables ρj Y are p-values, i.e. P(ρj Y α) α, since the scores {E(Xi, Y j i )}i [n] and {E(X, Y )} are exchangeable for fixed j [m]. We can now average these quantities over j [m] to obtain

Pm j=1 Pn i=1 I[E(Xi, Y j i ) E(X, Y )] + 1

m(n + 1) . (12)

Published in Transactions on Machine Learning Research (10/2023)

x E(x, k) πk(x)

r1 i = 1 + Pn i=1 I[E(xi, y1 i ) E(x, k)]

r2 i = 1 + Pn i=1 I[E(xi, y2 i ) E(x, k)]

rm i = 1 + Pn i=1 I[E(xi, ym i ) E(x, k)]

ρ1 i = r1 i/n+1

ρ2 i = r2 i/n+1

ρm i = rm i /n+1

Figure 6: The m label samples result in (ρj k)j [m] for the test example X, (ρj Y )j [m] being p-values. We can average ρj k over j and threshold the resulting average ρk at level α to obtain the prediction set, this is equivalent to Algorithm 1 as explained in Section 3.3. Alternatively we can combine the p-values ρj Y using the ECDF approach described in Section 3.4; see Algorithm 2 for a detailed computational description.

As ρY is an average of (dependent) p-values, it follows from standard results (Rüschendorf, 1982; Meng, 1994) that P(2 ρY 2α) 2α, equivalently P( ρY > α) 1 2α. Hence we have P(Y C(X)) 1 2α for C(X) = {y [K] : ρy > α}. Nevertheless, as discussed previously, we obtain empirical coverage close to 1 α, cf. Figure 5 (middle), see also the discussion in (Barber et al., 2021, Section 4). This approach is illustrated in Figure 6. Note that in the context of aggregating m different models, Linusson et al. (2017) also considered similar types of averaging.

Practically, we do not have to compute and actually average ρj y for j [m] and y [K] to determine C(X) = {y [K] : ρy > α}. As in Section 2.1, we can reformulate this prediction set as C(X) = {k [K] : E(X, k) τ} as in Equation (2) where τ is the αm(n+1) m+1 smallest element of {E(Xi, Y j i )}i [n],j [m], equivalently τ is obtained by computing

τ = Q {E(Xi, Y j i )}i [n],j [m]; αm(n + 1) m + 1

and confidence sets are constructed as C(X) = {k [K] : E(X, k) τ} for τ given in Equation (10). This is established in Section B.2 of Appendix B.

3.4 Beyond coverage 1 2α

The 1 2α coverage guarantee from Monte Carlo CP arises from the fact that we use a standard result about average of p-values (Rüschendorf, 1982; Meng, 1994). When the p-values are independent, a few techniques have been proposed to obtain a 1 α coverage; see e.g. (Cinar & Viechtbauer, 2022) for a comprehensive review. However, in our case, the p-values (ρj Y )j [m] we want to combine are strongly dependent as they use the same calibration examples Xi and then rely on different pseudo-labels Y j i from the same distribution (given by λi). In this setting, many standard methods yield overly conservative results.

We follow here a method that directly estimates the cumulative distribution function (CDF) of the combined, e.g., averaged p-values (Balasubramanian et al., 2015; Toccaceli & Gammerman, 2019; Toccaceli, 2019). Let ρY = 1/m Pm j=1 ρj Y be the averaged p-values. As shown in Figure 7 (left), these averaged p-values will not be uniformly distributed. However, if F denotes the CDF of ρY , then ρY = F( ρY ) is a p-value5 and thus the prediction set C(X) = {y [K] : ρy > α} will obtain coverage 1 α. The true CDF is unknown but we can split the original calibration examples into X1, . . . , Xl and Xl+1, . . . , Xn and used the second split to obtained an empirical estimate F of F. The procedure is described in Algorithm 2.

As F is not the true CDF F but an empirical CDF (ECDF) estimate, C(X) = {y [K] : F( ρy) > α} would only provide an approximate coverage guarantee at level 1 α. However, if the original calibration examples Xl+1, . . . , Xn are i.i.d., then the Dvoretzky Kiefer Wolfowitz inequality (see e.g. (Wasserman,

5We have P(F( ρY ) f) = P(F(F 1(U) f) for U a uniform random variable on [0, 1] for F 1(f) = inf{ρ R : F(ρ) f}. However F(F 1(U)) U so P(F(F 1(U) f) P(U f) = f. Hence, we have P(ρY α) α.

Published in Transactions on Machine Learning Research (10/2023)

0.0 0.2 0.4 0.6 0.8 1.0 Value

Distribution of combined p-values

Averaged p-values ECDF corrected p-values

0.0 0.2 0.4 0.6 0.8 1.0 Target coverage 1

Empirical coverage

ECDF correction of p-values

Baseline ECDF ECDF 1 band

Figure 7: For illustration, we sample 10k p-values for m = 10 independent tests and duplicate them to obtain m = 20 dependent tests. Left: Histograms for averaged and ECDF-corrected p-values. The ECDF correction is able to ensure that p-values are distributed approximately uniformly. Right: Target confidence level α plotted against the empirical confidence level when calibrating with averaged p-values (blue) and the ECDF-corrected ones (green). The Dvoretzky Kiefer Wolfowitz inequality provides tight finite-sample guarantees on the ECDF correction, using here δ = 0.0001 = 0.01%.

0.94 0.95 0.96 Empirical coverage

Aggregated coverage of ECDF-based approach

Target Monte Carlo CP ECDF Monte Carlo CP Figure 8: On our toy dataset, we plot aggregated coverage obtained using Monte Carlo CP without and with ECDF correction for m = 10 and l = n/2 . As expected, Monte Carlo CP, even for m 2, does not result in reduced coverage. Thus, ECDF correction does not yield meaningfully different results besides exhibiting higher variation in coverage across calibration/test splits due to the additional split l.

2006)) provides rigorous finite sample guarantees for the ECDF, i.e.

P( sup f [0,1] | F(f) F(f)| > ϵ) 2 exp( 2(n l)ϵ2). (14)

Basic manipulation results in a confidence band for F

P( f [0, 1], F (f) F(f) F+(f)) 1 δ with F (f) = min

1 2(n l) log 2

This confidence band is illustrated shown in Figure 7 (right). We can now define the prediction set C(X) = {y [K] : ρy > α} for ρY = F+( ρY ) and show that it satisfies Pagg(Y C(X)) (1 α)(1 δ). Indeed, writing F+ F to abbreviate F+(f) F(f) f, we have P( F+( ρY ) > α) P( F+( ρY ) > α| F+ F)P( F+ F) P(F( ρY ) > α| F+ F)P( F+ F). Now we have from Equation (14) that P( F+ F) 1 δ and P(F( ρY ) > α| F+ F) = P(F( ρY ) > α) as the probability in Equation (14) is over (Xi, Yi)i {l+1,...,n} while ρY is only a function of (Xi, Y j i )i [l],j [m] and (X, Y ). Using the fact that P(F( ρY ) > α) 1 α, we obtain the desired coverage guarantee.

Figure 8 compares the empirical coverage obtained using standard CP with true labels to both approaches of Monte Carlo CP (cf. Algorithms 1 and 2) to verify that this approach indeed obtains coverage 1 α (1 α)(1 δ) on average for δ 1. As discussed, without the ECDF correction discussed here, this is only empirical as the approach discussed in the previous section can only guarantee 1 2α. We also did not observe a significant difference in prediction set size when using or not using the ECDF correction.

3.5 Summary

We have presented two Monte Carlo CP techniques, Algorithms 1 and 2, which rely on sampling multiple synthetic pseudo-labels from PY |X agg for each calibration data. In this setting, the coverage guarantees we

Published in Transactions on Machine Learning Research (10/2023)

Algorithm 2 ECDF Monte Carlo CP with (1 α)(1 δ) coverage guarantee. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence levels α, δ; data split 1 l n 1; number of samples m Output: Prediction set C(X) for test example X

1. Sample m labels (Y j i )j [m] per calibration example (Xi)i [l] where P(Y j i = k) = λik.

2. Sample one label Yi per calibration example (Xi)i {l+1,...,n} where P(Yi = k) = λik.

3. Compute ( ρi)i {l+1,...,n} where

Pm j=1 Pl p=1 I[E(Xp, Y j p ) E(Xi, Yi)] + 1

m(l + 1) . (16)

4. Build the ECDF F(f) = 1 n l Pn i=l+1 I[ ρi f] and its upper bound F+(f) using Equation (15).

5. For test example X, compute for k [K]

Pm j=1 Pl p=1 I[E(Xp, Y j p ) E(X, k)] + 1

m(l + 1) , ρcorr k = F+( ρk). (17)

6. Return C(X) = {k [K] : ρcorr k > α}.

obtain have to be understood marginally across calibration examples and their sampled labels. We summarize the advantages and drawbacks of these methods in Table 1. In the following, we show that these developments are relevant beyond the setting of ambiguous ground truth and discuss related work.

Procedure Theoretical coverage

Avg. empirical coverage

Variability emp. coverage

Additional calibration split Alg. 1, m = 1 1 α 1 α High No Alg. 1, m 2 1 2α 1 α Low No Alg. 2 (1 α)(1 δ) (1 α)(1 δ) Low Yes

Table 1: Overview of the proposed Monte Carlo CP procedures in terms of the provided theoretical coverage guarantee, the observed empirical coverage (averaged across sampled pseudo labels), variability of the empirical coverage (w.r.t. sampled pseudo labels) and whether an additional split of the calibration data is required.

3.6 Applications to related problems

Multi-label classification: Let Yi [K] be the known, ground truth multi-label set for each calibration example Xi. For simplicity, we express the multi-label setting using plausibilities λi that divide probability mass equally across the Li = |Yi| 1 labels. We can then apply Monte Carlo CP to this setup. When m 1, we obtain calibration examples where the proportion of (Y j i )j [m] equal to a given class is approximately 1/Li. This is related to an empirical existing method to perform multi-label CP (Tsoumakas & Katakis, 2007; Wang et al., 2014; 2015) where for each calibration example (Xi, Yi) we perform CP using the calibration examples (Xi, Y 1 i ), . . . , (Xi, Y Li i ). Monte Carlo CP provides coverage guarantees on Pagg(Y C(X)) for

Pagg(Y = y|X = x) = X

Y λ(Y)y p(Y|x) (18)

with p(Y|x) being the conditional probability of the set Y given X = x and λ(Y)y = I(y Y)/|Y|.

Published in Transactions on Machine Learning Research (10/2023)

Data augmentation and robustness: Consider a scenario where we have exchangeable calibration data (Xi, Yi)i [n] for Yi [K]. We want to augment the set of calibration data by using data augmentation, i.e. for each X1 i := Xi we sample additional X2 i , ..., Xm i p( |Xi). For example, these could correspond to versions of Xi which are (adversarially or randomly) perturbed or corrupted, rotated, flipped, etc. As we usually train with data augmentation and frequently want to improve robustness against distribution shifts or specific types of perturbations, considering these augmentations for calibration is desirable. However, when using the augmented set m n of calibration data (Xj i , Yi)i [n],j [m], the joint distribution of calibration data and test data is not exchangeable anymore. However, we can still provide rigorous coverage guarantees using a procedure very similar to Monte Carlo CP. For each test data (X, Y ), we set X1 = X and sample augmentations X2, ..., Xm p( |X). We then proceed by averaging the following m p-values

ρj Y = Pn i=1 I[E(Xj i , Yi) E(Xj, Y )] + 1

n + 1 , ρY = 1

j=1 ρj Y . (19)

By following arguments similar to Section 3.3, the prediction set C(X1, X2, ..., Xn) = {y [K] : ρy > α} satisfies Paug(Y C(X1, X2, ..., Xn)) 1 2α where Paug is the joint distribution of the test data (X = X1, Y ) and the augmentations (X2, .., Xm). In this case, the coverage guarantee is marginal across (X1, ..., X2, Y ) and (Xj i , Yi)i [n],j [m].

3.7 Related work

CP (Vovk et al., 2005) has recently found numerous applications in machine learning, see e.g. (Romano et al., 2019; Sadinle et al., 2019; Romano et al., 2020; Angelopoulos et al., 2021; Stutz et al., 2021; Fisch et al., 2022). In this paper, we focus on split CP (Papadopoulos et al., 2002). However, there are also transductive and cross-validation/bagging-inspired variants being studied (Vovk et al., 2005; Vovk, 2015; Steinberger & Leeb, 2016; Barber et al., 2021; Linusson et al., 2020). Our work is related to these approaches in that many of them guarantee coverage 1 2α while empirically obtaining coverage close to 1 α. For example, cross-CP (Vovk, 2015) was recently shown to satisfy a 1 2α guarantee in (Vovk et al., 2018; Kim et al., 2020). Moreover, this guarantee is also based on combining p-values without making any assumption about their dependence structure.

As outlined before, our work is also related to CP for multi-label classification which faces similar challenges as CP for ambiguous ground truth (Wang et al., 2014; 2015; Lambrou & Papadopoulos, 2016; Papadopoulos, 2014; Cauchois et al., 2021). Finally, our work has similarities to work on adversarially robust CP (Gendler et al., 2022), especially in terms of our ideal coverage guarantee.

There is also a long history of work on combining dependent or independent p-values (Fisher, 1925; Rüschendorf, 1982; Meng, 1994; Heard & Rubin-Delanchy, 2017). Key work has been done in (Vovk et al., 2018), showing results without dependence assumption and thereby establishing guarantees for, e.g., cross-CP. Similar to us, (Balasubramanian et al., 2015; Linusson et al., 2017; Toccaceli & Gammerman, 2019; Toccaceli, 2019) use the ECDF to combine p-values but they do not provide rigorous coverage guarantees for this procedure.

4 Applications

4.1 Main case study: skin condition classification

In the main case study of this paper, we follow (Liu et al., 2020; Stutz et al., 2023) and consider a very ambiguous as well as safety-critical application in dermatology: skin condition classification from multiple images. We use the dataset of Liu et al. (2020) consisting of 1949 test examples and 419 classes with up to 6 color images resized to 448 448 pixels. The classes, i.e., conditions, were annotated by various dermatologists who provide partial rankings. These rankings are aggregated deterministically to obtain the plausibilities λ using the inverse rank normalization procedure of (Liu et al., 2020) described in Section 3.1. We followed (Roy et al., 2022; Stutz et al., 2023) to train a classifier that achieves 72.6% top-3 accuracy

Published in Transactions on Machine Learning Research (10/2023)

0 200 400 600 800 Sorted examples

Realized coverage

Realized voted and aggregated coverage for CP with voted labels

Aggregated coverage (0.62) Voted coverage (0.73) Argmax plausibility

Figure 9: Realized voted coverage, i.e., I[arg maxk λik C(Xi)] (green), and aggregated coverage, i.e., P y [K] λiy I[y C(Xi)] (blue), for standard CP using voted labels, i.e., arg maxk λik. Additionally, we plot the maximum plausibility per example (dashed black) as proxy of ambiguity. We sort examples on the x-axis according to each plot individually. Clearly, many cases are very ambiguous and aggregated coverage is underestimated severely (62% vs. the target of 73%).

0.60 0.65 0.70 0.75 Empirical coverage

CP with voted labels

Target Voted coverage Aggregated coverage

0.700 0.725 0.750 0.775 0.800 0.825 0.850 Empirical coverage

Monte Carlo CP

Target Voted coverage Aggregated coverage

0.70 0.75 0.80 0.85 Empirical coverage

ECDF Monte Carlo CP

Target Voted coverage Aggregated coverage

2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 Inefficiency

Average: 2.66

4.2 4.4 4.6 4.8 5.0 Inefficiency

Average: 4.57

3.5 4.0 4.5 5.0 5.5 Inefficiency

Average: 4.54

Figure 10: Comparison between CP applied to voted labels (left), Monte Carlo CP (middle) and ECDF Monte Carlo CP (right) in terms of voted coverage (blue) and aggregated coverage (green) across 100 random calibration/test splits (top) as well as inefficiency (bottom). We use m = 10. With voted labels does not reach the 73% aggregated target coverage (black). Monte Carlo CP overcomes this gap at the expense of higher inefficiency (bottom). Using ECDF-corrected p-values increases observed variation slightly due to the additional calibration data split (half of the original split).

against the voted label from the plausibilities6. We chose a coverage level of 1 α = 73% for our experiments (with results for α = 0.1 in the appendix) to stay comparable to the base model.

In Figure 9, we highlight how ambiguous the plausibilities for skin condition classification are: in dotted black, we plot the largest plausibility against (sorted) examples. As baseline, we performed CP using the classifier s softmax output as conformity scores and calibrating against the voted labels ˆy := arg maxk λik per calibration example (Xi, λi). In blue, we plot the realized coverage by evaluating I[ˆy C(Xi)] per example. This is a step function and roughly 27% of the examples on the x-axis are covered. In green, we plot the realized aggregated coverage by evaluating P

k [K] λik I[k C(Xi)] per example. For many examples, aggregated coverage lies in between (0, 1) showing that the obtained prediction sets only cover part of the plausibility mass. More importantly, aggregated coverage is 62% on average, i.e., significantly under-estimated by calibrating against voted labels. This is the core problem we intend to address.

While the above results consider a fixed calibration/test split, Figure 10 shows our overall results across 100 random splits. As suggested above, on the left, we show how standard CP applied to voted labels does not achieve (as expected) the target of 73% for the aggregated coverage Pagg(Y C(X)) (green). As highlighted in Figure 11, the prediction sets miss highly plausible conditions. In the middle and on the right, we demonstrate that Monte Carlo CP with m = 10 overcomes this problem and achieves (on average) the target aggregated coverage of 73%. Note that coverage w.r.t. the voted label (blue) increases alongside

6To be precise, in 72.6% of the cases, the voted label from the plausibilities is included in the top-3 prediction set derived from the predicted softmax output π(x) without any conformal calibration.

Published in Transactions on Machine Learning Research (10/2023)

Hema Mela Angi Pyog Skin Mela Atyp O/E Classes

CP with voted labels

Plausibilities Confidence set (0.41 agg. coverage)

Hema Mela Angi Pyog Skin Mela Atyp O/E Classes

Monte Carlo CP

Plausibilities Confidence set (0.79 agg. coverage)

Arte Calc Cell Pyod Veno Absc Acan Acan Classes

CP with voted labels

Plausibilities Confidence set (0.40 agg. coverage)

Arte Calc Cell Pyod Veno Absc Acan Acan Classes

Monte Carlo CP

Plausibilities Confidence set (0.60 agg. coverage)

Figure 11: Comparison of CP with voted labels and Monte Carlo CP on two concrete examples. Both are ambiguous cases as shown by the high-entropy plausibilities. Monte Carlo CP clearly covers more plausibility mass (i.e., yields higher realized aggregated coverage), potentially improving patient outcome. Appendix C includes more qualitative results.

0.87 0.88 0.89 0.90 0.91 0.92 0.93 Empirical coverage

Aggregated coverage for multi-label classification

5.8 6.0 6.2 6.4 6.6 Inefficiency

Inefficiency histogram

Average: 6.13 Inefficiency

0 1 2 3 4 5 6 7 8 9 Class

Confidence level Labels p-values

0 1 2 3 4 5 6 7 8 9 Class

Confidence level Labels p-values

Figure 12: The multi-label CP strategy of Wang et al. (2015; 2014); Tsoumakas & Katakis (2007) is a slight variant of our Monte Carlo CP approach. As shown on this example of up to two overlaid, differently colored digits, Monte Carlo CP achieves target coverage of 90% (top left). However, it is free to decide how many labels to cover per examples (top middle) and, due to the poor performance of the base model (58.8% aggregated coverage), inefficiency is rather high. On the bottom, we show two examples of our dataset with the corresponding ground truth label set Y (blue) and the obtained Monte Carlo p-values (green).

aggregated coverage (i.e., the gap between voted and aggregated coverage remains approximately the same). In terms of inefficiency, avoiding over-confidence by improving aggregated coverage leads to a significant increase in the average prediction set size from 2.66 to 4.57. However, Figure 11 highlights that this is necessary for the prediction sets to include relevant conditions.

4.2 Case study: multi-label classification

In Figure 12 we consider a simple MNIST-based multi-label classification problem with up to two, differently colored digits per image. We trained 10 multi-layer perceptrons with 100 hidden units for each digit to determine if the digit is present in the image. This simple classifier achieves 58.8% aggregated coverage when thresholding the 10 individual sigmoids at 0.5. As discussed in Section 3.6, a common strategy (Wang et al., 2015; 2014; Tsoumakas & Katakis, 2007) of performing multi-label CP consists in repeating each example according to its number of labels (here, at most 2). We can achieve something similar with rigorous aggregated coverage guarantee by uniformly sampling labels to perform Monte Carlo CP. We illustrate this procedure in Figure 12 (top left) where we show that it achieves the 90% coverage target. Furthermore, our discussion in Section 3.2 establishes the corresponding guarantee of 1 2α without ECDF correction. However, it is important to understand what aggregated coverage means for multi-label classification: CP decides how many of the labels it intends to cover per example in order to achieve the marginal coverage guarantee. This is highlighted in Figure 12 (bottom) showing the corresponding p-values for two examples. In the first example, only one out of two examples is covered by the prediction set.

Published in Transactions on Machine Learning Research (10/2023)

0.80 0.82 0.84 0.86 0.88 0.90 0.92 Empirical coverage

Standard CP

Target Clean Augmented

0.84 0.86 0.88 0.90 0.92 0.94 Empirical coverage

Monte Carlo CP m = 2

Target Clean Augmented

0.88 0.90 0.92 0.94 0.96 Empirical coverage

Monte Carlo CP m = 10

Target Clean Augmented

2.5 5.0 7.5 10.0 12.5 15.0 Inefficiency

Clean average: 3.41 Augmented average:4.32 Clean Augmented

0 10 20 30 40 50 60 Inefficiency

Clean average: 6.37 Augmented average:9.91 Clean Augmented

0 20 40 60 80 100 Inefficiency

Clean average: 9.51 Augmented average:17.35 Clean Augmented

Figure 13: Standard CP applied to original images (left) and Monte Carlo CP (middle and right) applied to the original plus augmented images from Image Net for m = 2 (middle) and m = 10 (left). We show coverage on original (blue) and augmented images (green, average across 25 Auto Augment augmentations per image). Monte Carlo CP is able to increase coverage on augmented images significantly at the expense of higher inefficiency (on both original and augmented images). Using more augmentations during calibration generally increases coverage on augmented images and reduces the observed variation across random calibration/test splits.

4.3 Case study: data augmentation and robustness

We also apply Monte Carlo CP in the data augmentation and robustness setting outlined Section 3.6. Specifically, we took a pre-trained Mobile Net V2 (Howard et al., 2017) achieving 71.3% (top-1) accuracy on the first 5k test examples of Image Net (Russakovsky et al., 2015) and additionally evaluated it on augmented images using Auto Augment (Cubuk et al., 2018). We generated 25 random augmentations per test example. The model achieves 60.2% accuracy on average, significantly lower than on the original images. Similarly, Figure 13 shows a significant gap in coverage when calibrating only on the original images. While coverage on original examples is around the target of 90%, depending on the random calibration/test split, coverage of augmented images is only slightly above 80%. With the Monte Carlo CP procedure described in Section 3.6, we can use as calibration data the original and augmented images. This procedure increases coverage on the augmented images significantly at the cost of higher inefficiency. As coverage is marginal and thus split across augmented and original images, the coverage increase is largest when using more augmented images during calibration. We interpret these results in two ways: First, there is no reason anymore to train stateof-the-art models with data augmentation but discard augmented images during calibration. Second, our Monte Carlo CP approach is effective in improving robustness against augmentations or other corruptions.

5 Discussion

In many classification tasks, the available ground truth labels arise from a voting process relying on several expert annotations resulting in a one-hot distribution PY |X vote . However, in scenarios where annotators tend to disagree because ground truth labels are ambiguous, this voting approach ignores the label uncertainty. Thus, performing CP with such voted labels can only guarantee coverage w.r.t. PY |X vote . This can have severe consequences, especially in safety-critical applications such as the dermatology case study in this paper. Instead, we use standard procedures from the literature to aggregate expert opinions and return a nondegenerate conditional probability distribution PY |X agg that can capture uncertainty about Y . In this paper, we proposed two Monte Carlo CP procedures which can output prediction sets satisfying some pre-specified coverage guarantees under the corresponding distribution Pagg = PX PY |X agg by sampling m exchangeable synthetic labels from PY |X agg for each calibration data. This allows to reduce the variability of the empirical coverage across realizations of the sampled labels. For a test example X, the prediction set outputted by such procedures can be thought as the one expert would assign if they had access to X and their annotations

Published in Transactions on Machine Learning Research (10/2023)

were aggregated using PY |X agg . In the scenario considered here where true labels are never observed, this is the best one can hope for. We have established rigorous coverage guarantees for these Monte Carlo CP procedures despite the fact that the joint distribution between the calibration data and the test data is not exchangeable. While the averaged empirical coverage provided by Algorithm 1 is 1 α, our theoretical result only guarantees 1 2α. In applications where rigorous tighter theoretical guarantees are required, we show how it is possible to modify this procedure to obtain (1 α)(1 δ) coverage at the cost of an additional calibration split, see Algorithm 2. We demonstrated the applicability of these approaches in the safety-critical and particularly ambiguous setting of skin condition classification. In this context, the use of voted labels leads to overconfident prediction sets which leads to severe under-coverage of critical conditions. Our Monte Carlo CP overcomes this gap empirically and theoretically. We also demonstrated how our methodology allows conformal calibration with augmented examples and provides, for the first time, a coverage guarantee for multi-label CP.

We also want to highlight the assumptions that underlie our work and some potential extensions. First, we assume access to calibration data of the form (Xi, Bi)i [n] for annotations Bi from which we obtain calibration data (Xi, λi)i [n] for plausibilities λi and then,by sampling, pseudo-labels (Xi, Y j i )i [n],j [m]. While assuming that the joint distribution of (Xi, Y 1 i )i [n] and test data (X, Y ) is exchangeable for Algorithm 1 and that these data are i.i.d. for Algorithm 2 is standard, this is not true in presence of distribution shift at test time. However, we believe it should be possible to adapt Monte Carlo CP to this setting using the techniques developed in (Tibshirani et al., 2019). We note finally that it is also possible to bypass having to sample pseudo-labels and perform conformal calibration directly on the plausibilities using calibration data (Xi, λi)i [n] but this leads to prediction intervals which are difficult to interpret and exploit. Second,

our method relies on an aggregation model PY |X agg . However, we emphasize that we are not making more assumptions than is currently made by applying CP to voted labels. On the contrary, we explicitly state the typically implied assumptions regarding label collection. Finally, it is important to keep in mind that, as any CP technique, the proposed Monte Carlo CP procedures only provide unconditional coverage guarantees of the form Pagg(Y C(X)) 1 α and not guarantees of the form Pagg(Y C(X)|X = x) 1 α.

Broader impact

Our work addresses the problem of conformal calibration in ambiguous settings where ground truth labels are not crisp as generally implicitly assumed in supervised machine learning. As discussed in our skin condition classification case study, ignoring this ambiguity can have severe consequences if used standard techniques relying on voted labels: key conditions (such as the cancerous Melanoma in Figure 11) may not be covered in the outputted prediction sets despite experts including them in their annotations. This can have very immediate negative consequences for patients as well as increase cost and strain on the healthcare system. Therefore, we generally view the impact of our work very positively, especially in the deployment of safetycritical applications such as in health. However, it is also important to look into how the methods proposed here affect different demographic groups.

Data availability

The de-identified dermatology data used in this paper is not publicly available due to restrictions in the data-sharing agreements.

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv.org, abs/2107.07511, 2021.

Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In Proc. of the International Conference on Learning Representations (ICLR), 2021.

Published in Transactions on Machine Learning Research (10/2023)

Vineeth Nallure Balasubramanian, Shayok Chakraborty, and Sethuraman Panchanathan. Conformal predictions for information fusion - A comparative study of p-value combination methods. Annals of Mathematics and Artificial Intelligence, 74(1-2):45 65, 2015.

Rina Foygel Barber, J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021.

Stephen Bates, Emmanuel Candès, Lihua Lei, Yaniv Romano, and Matteo Sesia. Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), 2023.

Maxime Cauchois, Suyash Gupta, and John C. Duchi. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research (JMLR), 22:81:1 81:42, 2021.

Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity - multilingual and cross-lingual focused evaluation. ar Xiv.org, abs/1708.00055, 2017.

Ozan Cinar and Wolfgang Viechtbauer. The poolr package for combining independent and dependent p values. Journal of Statistical Software, 101(1), 2022.

Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. ar Xiv.org, abs/1805.09501, 2018.

Adam Fisch, Tal Schuster, Tommi Jaakkola, and Regina Barzilay. Conformal prediction sets with limited false positives. In Proc. of the International Conference on Machine Learning (ICML), 2022.

Ronald Aylmer Fisher. Statistical Methods for Research Workers. Edinburgh Oliver & Boyd, 1925.

Asaf Gendler, Tsui-Wei Weng, Luca Daniel, and Yaniv Romano. Adversarially robust conformal prediction. In Proc. of the International Conference on Learning Representations (ICLR), 2022.

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeffrey T. Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury learning: Integrating dissenting voices into machine learning models. In Proc. of the Conference on Human Factors in Computing Systems, 2022.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Bruce Hajek, Sewoong Oh, and Jiaming Xu. Minimax-optimal inference from partial rankings. Advances in Neural Information Processing Systems, 27, 2014.

Nicholas A. Heard and Patrick Rubin-Delanchy. Choosing between methods of combining p-values. Biometrika, 105:239 246, 2017.

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv.org, abs/1704.04861, 2017.

Byol Kim, Chen Xu, and Rina Foygel Barber. Predictive inference is free with the jackknife+-after-bootstrap. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Antonis Lambrou and Harris Papadopoulos. Binary relevance multi-label conformal predictor. In Proc. of the Symposium on Conformal and Probabilistic Prediction and Applications (COPA), 2016.

Published in Transactions on Machine Learning Research (10/2023)

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Proc. of the European Conference on Computer Vision (ECCV), 2014.

Henrik Linusson, Ulf Norinder, Henrik Boström, Ulf Johansson, and Tuve Löfström. On the calibration of aggregated conformal predictors. In Proc. of the Symposium on Conformal and Probabilistic Prediction and Applications (COPA), 2017.

Henrik Linusson, Ulf Johansson, and Henrik Boström. Efficient conformal predictor ensembles. Neurocomputing, 397:266 278, 2020.

Yuan Liu, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, Vishakha Gupta, Nalini Singh, Vivek Natarajan, Rainer Hofmann-Wellenhof, Gregory S. Corrado, Lily H. Peng, Dale R. Webster, Dennis Ai, Susan Huang, Yun Liu, R. Carter Dunn, and David Coz. A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26:900 908, 2020.

Xiao-Li Meng. Posterior Predictive p-Values. The Annals of Statistics, 22(3):1142 1160, 1994.

Harris Papadopoulos. A cross-conformal predictor for multi-label classification. In Proc. of the Symposium on Conformal and Probabilistic Prediction and Applications (COPA), IFIP Advances in Information and Communication Technology, 2014.

Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alexander Gammerman. Inductive confidence machines for regression. In Proc. of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2002.

Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2019.

Yaniv Romano, Evan Patterson, and Emmanuel J. Candès. Conformalized quantile regression. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Abhijit Guha Roy, Jie Ren, Shekoofeh Azizi, Aaron Loh, Vivek Natarajan, Basil Mustafa, Nick Pawlowski, Jan Freyberg, Yuan Liu, Zachary Beaver, Nam Vo, Peggy Bui, Samantha Winter, Patricia Mac Williams, Gregory S. Corrado, Umesh Telang, Yun Liu, A. Taylan Cemgil, Alan Karthikesalingam, Balaji Lakshminarayanan, and Jim Winkens. Does your dermatology classifier know what it doesn t know? detecting the long-tail of unseen conditions. Medical Image Analysis, 75:102274, 2022.

Ludger Rüschendorf. Random variables with maximum sums. Advances in Applied Probability, 14(3):623 632, 1982.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019.

Mike Schaekermann, Edith Law, Alex C Williams, and William Callaghan. Resolvable vs. irresolvable ambiguity: A new hybrid framework for dealing with uncertain ground truth. In SIGCHI Workshop on Human-Centered Machine Learning, volume 2016, 2016.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2013.

Published in Transactions on Machine Learning Research (10/2023)

Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models with many variables. ar Xiv.org, abs/1602.05801, 2016.

David Stutz, Krishnamurthy Dvijotham, Ali Taylan Cemgil, and Arnaud Doucet. Learning optimal conformal classifiers. In Proc. of the International Conference on Learning Representations (ICLR), 2021.

David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana Matejovicova, Melih Barsbey, Patricia Mac Williams, Mike Schaekermann, Jan Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang, Dale Webster, Yuan Liu, Greg Corrado, Yossi Matias, Pushmeet Kohli, Yun Liu, Arnaud Doucet, and Alan Karthikesalingam. Evaluating ai systems under uncertain ground truth: a case study in dermatology. ar Xiv.org, abs/2307.02191, 2023.

Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas. Conformal prediction under covariate shift. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems (Neur IPS), pp. 2526 2536, 2019.

Paolo Toccaceli. Conformal predictor combination using Neyman Pearson lemma. In Proc. of the Symposium on Conformal and Probabilistic Prediction and Applications (COPA), 2019.

Paolo Toccaceli and Alexander Gammerman. Combination of inductive Mondrian conformal predictors. Machine Learning, 108(3):489 510, 2019.

Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal on Data Warehousing and Mining, 3(3):1 13, 2007.

Alexandra Uma, Dina Almanea, and Massimo Poesio. Scaling and disagreements: Bias, noise, and ambiguity. Frontiers in Artifical Intelligence, 5, 2022.

Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74(1-2):9 28, 2015.

Vladimir Vovk and Ruodu Wang. Combining p-values via averaging. Biometrika, 107(4):791 808, 2020.

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer Verlag, Berlin, Heidelberg, 2005.

Vladimir Vovk, Ilia Nouretdinov, Valery Manokhin, and Alexander Gammerman. Cross-conformal predictive distributions. In Proc. of the Symposium on Conformal and Probabilistic Prediction and Applications (COPA), 2018.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of the International Conference on Learning Representations (ICLR), 2019.

Hua-zhen Wang, Xin Liu, Ilia Nouretdinov, and Zhiyuan Luo. A comparison of three implementations of multi-label conformal prediction. In Proc. of the International Symposium on Statistical Learning and Data Sciences, 2015.

Huazhen Wang, Xin Liu, Bing Lv, Fan Yang, and Yanzhu Hong. Reliable multi-label learning via conformal predictor and random forest for syndrome differentiation of chronic fatigue in traditional chinese medicine. PLOS ONE, 9(6):1 14, 2014.

Larry Wasserman. All of Nonparametric Statistics. Springer Science & Business Media, 2006.

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.

Wanchuang Zhu, Yingkai Jiang, Jun S Liu, and Ke Deng. Partition Mallows model and its inference for rank aggregation. Journal of the American Statistical Association, 118(541):343 359, 2023.

Published in Transactions on Machine Learning Research (10/2023)

A Annotations in popular datasets

Many popular machine learning datasets have been created using annotators to determine labels. Often, they also use very basic aggregation and voting mechanisms (corresponding to Pagg and Pvote) such as majority voting or averaging. For example, the list of most cited datasets on Papers with Code7 includes the following:

The CIFAR10 dataset (Krizhevsky, 2009) is a popular multiclass classification benchmark of 32 32 pixel color images. It was labeled using a strategy similar to majority voting: first, a crowd-sourced label was collected which was then verified and potentially corrected by an author.

CIFAR10H (Peterson et al., 2019) revisited the CIFAR10 dataset by gathering additional human annotations where each annotator provides a single label.

Image Net (Russakovsky et al., 2015) uses multiple annotators with majority voted labels.

COCO (Lin et al., 2014) labels a category as present if any out of 8 annotators labeled it as present (even if all other annotators disagree).

The Stanford sentiment treebank (Socher et al., 2013) includes 3 annotations per example which are used in various ways during evaluation.

Multi NLI (Williams et al., 2018) includes 5 annotations which are majority voted.

The semantic textual similarity benchmark (Cer et al., 2017) averages scores across multiple annotators.

The Recognizing Textual Entailment datasets8 also consider multiple annotators and several editions of the corresponding challenge simply neglected examples with disagreeing annotators (equivalent to majority voting while ignoring ambiguous examples).

The above natural language processing datasets are all part of the GLUE benchmark (Wang et al., 2019) the go-to benchmark for natural language understanding. Datasets beyond (multi-label) classification also often include multiple annotations. VQA (Goyal et al., 2017), for example, includes 10 free-form answers per question; aggregating them is clearly non-trivial. For many datasets, it might also be unclear how annotations have been collected or aggregated. The Quora Question Pairs webpage9 explicitly mentions label errors but does not give details on annotation and aggregation. All of these examples emphasize that aggregating annotations and using majority voted or averaged labels is extremely common across supervised machine learning. This means that plausibilities are typically readily obtainable and the core problem we address using majority voted labels on ambiguous tasks for calibration leads to under-coverage is highly relevant.

B Calibration threshold and p-values

B.1 Single p-value

We include for completeness a proof of the results presented in Section 2.1. These are standard results. The p-value formulation detailed here will be then extended to establish the validity of some of the procedures proposed in this work.

We first establish that the prediction set defined by Equations (2) and (3) satisfies

1 α P(Y C(X)) 1 α + 1 n + 1, (20)

the upper bound requiring the additional assumption that the conformity scores are almost surely distinct.

7https://paperswithcode.com/datasets?q=&v=lst&o=cited 8https://tac.nist.gov//2008/rte/past_data/index.html 9https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

Published in Transactions on Machine Learning Research (10/2023)

Let us write (Xn+1, Yn+1) = (X, Y ) and Si = E(Xi, Yi). As (Si)i [n+1] are exchangeable, then a direct application of (Romano et al., 2019, Appendix A, Lemma 2) shows that for any α (0, 1) and τ = Q { Si}i [n]; (1 α)(n + 1) /n

P( Sn+1 τ ) 1 α (21)

and, if the random variables (Si)i [n+1] are almost surely distinct, then

P( Sn+1 τ ) 1 α + 1 n + 1. (22)

So C(Xn+1) includes all the values y such that S(Xn+1, y) is smaller or equal than the (1 α)(n + 1) smallest values of ( Si)i [n]. This is equivalent to considering all the values y such that S(Xn+1, y) is larger or equal than the α(n+1) smallest values of (Si)i [n]. Hence this corresponds to the prediction set defined by Equations (2) and (3).

We now show that this prediction set can also be obtained by thresholding p-values. Indeed Equation (4) can be rewritten as

ρYn+1 = Pn+1 i=1 I(Si Sn+1)

n + 1 . (23)

As (Si)i [n+1] are exchangeable, then ρYn+1 is uniformly distributed on { 1 n+1, 2 n+1, ..., 1} if the distribution of the scores is continuous and thus ρYn+1 is a p-value, i.e. P(ρYn+1 α) α P(ρYn+1 > α) 1 α. It can be checked that this property still holds if the distribution of the scores is not continuous (see e.g. (Bates et al., 2023)). So if we define the prediction set as

C(Xn+1) = {y : ρy > α} (24)

then by construction it follows that

P(Yn+1 C(Xn+1)) 1 α. (25)

We show here that Equation (24) does indeed coincide with the set defined by Equations (2) and (3). We have

i=1 I(Si E(Xn+1, y)) > α(n + 1) 1. (26)

If α(n + 1) is an integer then Pn i=1 I(Si E(Xn+1, y)) α(n + 1), i.e. E(Xn+1, y) is larger or equal than the α(n + 1) smallest values of (Si)i [n]. If α(n + 1) is not an integer then it means that Pn i=1 I(Si E(Xn+1, y)) α(n + 1) 1, i.e. E(Xn+1, y) is larger than the α(n + 1) 1 smallest values of (Si)i [n]. However, we have α(n + 1) 1 = α(n + 1) as α(n + 1) is not an integer. So overall, we have that ρy > α corresponds to E(Xn+1, y) being larger or equal than the α(n + 1) smallest values (Si)i [n]. So the prediction set in Equation (24) does coincide with the set defined by Equations (2) and (3).

Finally note that when the distributions of the scores is continuous, i.e. the scores are almost surely distinct, then we also have P(ρYn+1 α) α 1/n+1 so P(ρYn+1 > α) 1 α + 1/n+1.

B.2 Average of p-values

We establish here the expression of the prediction set obtained by thresholding the following average p-value

Pm j=1 Pn+1 i=1 I(Sj i Sj n+1)

m(n + 1) (27)

where Sj n+1 = Sn+1 for all j [m]. We know that by construction the set C(Xn+1) = {y : ρy > α} is such that P(Yn+1 C(Xn+1)) 1 2α. (28)

Published in Transactions on Machine Learning Research (10/2023)

0.700 0.725 0.750 0.775 0.800 0.825 0.850

Empirical coverage

Standard CP w/ m = 1 sampled label

Target Voted coverage Aggregated coverage

0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 Empirical coverage

Standard CP w/ m = 10 sampled label

Target Voted coverage Aggregated coverage

4.0 4.2 4.4 4.6 4.8 5.0 5.2 Inefficiency

Average: 4.57

4.2 4.4 4.6 4.8 5.0 Inefficiency

Average: 4.55

Figure 14: Complementary to Figure 10, we show results for standard CP with m = 1 and m = 10 sampled labels per example in terms of voted coverage (blue) and aggregated coverage (green) across 100 random calibration/test splits (top) as well as inefficiency (bottom). As for Monte Carlo CP in Algorithm 2, we sample m labels from the plausibilities, but use the quantile from Equation (3) for calibration. For m = 1, this coincides with Monte Carlo CP; for m > 1, however, the quantile computation differs from Equation (13).

i=1 I(Sj i E(Xn+1, y)) > αm(n + 1) m. (29)

If αm(n+1) is an integer then E(Xn+1, y) needs to be larger or equal than the αm(n+1) m+1 smallest values of (Sj i )i [n],j [m]. If αm(n+1) is not an integer then Pm j=1 Pn i=1 I(Sj i E(Xn+1, y)) αm(n+1) m, i.e. E(X, y) is larger or equal than the αm(n+1) m smallest values of Si. However, we have αm(n+1) 1 = αm(n + 1) as αm(n + 1) is not an integer. So αm(n + 1) m = αm(n + 1) m + 1. So overall we need E(Xn+1, y) larger or equal than the αm(n+1) m+1 smallest values of (Sj i )i [n],j [m], equivalently larger that the quantile of (Sj i )i [n],j [m] at level αm(n+1) m+1

mn , i.e. C(Xn+1) = {k [K] : E(Xn+1, k) τ} for τ defined in Equation (10).

C Additional results for skin condition classification

Figure 14 shows a baseline experiment using standard CP with sampled labels instead of our Monte Carlo CP from Algorithm 2. Specifically, we sample m = 1 and m = 10 labels as we would for Monte Carlo CP but apply the standard quantile from Equation (3) for calibration. Specifically, we use α(n+1) /n (Equation (3)) instead of αm(n+1) m+1/mn (Equation (13)) which is used for Monte Carlo CP. For m = 1, these coincide; for m > 2, the quantiles may differ but are still close for large n. Thus, it is not surprising that this approach also obtains (on average) target coverage 1 α. However, in contrast to Monte Carlo CP, this approach does not provide any coverage guarantee for m > 1.

Figure 15 shows the p-values of standard CP (with voted labels, left) and Monte Carlo CP (with sampled labels, right) w.r.t. to the voted labels (top) and 10 labels randomly sampled from the plausibilities (per example, bottom). CP against voted labels results in the corresponding p-values being uniformly distributed. However, the distribution of p-values corresponding to sampled labels is slightly skewed towards 0 (compare to the black line). With Monte Carlo CP, we observe the opposite: the p-values corresponding to voted labels are not entirely uniformly distributed while those corresponding to sampled labels are. This highlights the impact of ambiguity on the corresponding p-values.

Figure 16 presents complementary results to Figure 9 in the main paper. Specifically, on top, we additionally consider ties among the voted labels (i.e., there is no unique arg max in the plausibilities λ). In the calibration examples, we break these ties randomly. At test time, however, we can decide to evaluate

Published in Transactions on Machine Learning Research (10/2023)

0.0 0.2 0.4 0.6 0.8 1.0 p-value

CP with voted labels: p-values for voted labels

Uniform CP with voted labels

0.0 0.2 0.4 0.6 0.8 1.0 p-value

Monte Carlo CP: p-values for voted labels

Uniform Simple MCCP ECDF MCCP

0.0 0.2 0.4 0.6 0.8 1.0 p-value

CP with voted labels: p-values for sampled labels

Uniform CP with voted labels

0.0 0.2 0.4 0.6 0.8 1.0 p-value

Monte Carlo CP: p-values for sampled labels

Uniform Simple MCCP ECDF MCCP

Figure 15: p-value histograms for CP with voted labels (left) and Monte Carlo CP (right). In both cases, we show the p-values for the voted labels (arg max of plausibilities, top) and labels samples from the plausibilities (bottom). Calibrating against voted labels guarantees a uniform distribution of the p-values shown on top, Monte Carlo calibration guarantees the same for the bottom histograms. As a result, compared to the expected uniform distribution (black), the distribution of p-values w.r.t. sampled labels is skewed for CP with voted labels (bottom left) while p-values w.r.t. to voted labels are skewed for Monte Carlo CP (top right).

0 200 400 600 800 Sorted examples

Realized coverage

Realized voted and aggregated coverage for CP with voted labels

Aggregated coverage (0.62) Voted coverage (0.73) Voted coverage w/ ties (0.72) Argmax plausibility

0 200 400 600 800 Sorted examples

Realized coverage

Realized aggregated coverage comparison

CP with voted labels (0.61) Simple MCCP (0.72) ECDF MCCP (0.72) Argmax plausibility

Figure 16: Top: Complementary to Figure 9, we include evaluation against voted labels while considering ties (red, 1/Li PLi j=1 I[Y j i C(Xi)] with Y j i being one of Li tied labels for case i) which hints towards several ambiguous cases. Looking at aggregated coverage (i.e., P y [K] λik I[y C(Xi)]) in blue, however, shows that a significant portion of examples are ambiguous and standard evaluation against the voted labels (i.e., I[arg maxk λik C(Xi)]) is unreasonable. Bottom: Aggregated coverage plot for Monte Carlo CP procedures compared to CP with voted labels, highlighting that CP with voted labels does not obtain the target aggregated coverage.

coverage proportional to the tied labels. Formally, we assume that Y 1 i , . . . , Y Li i are the tied labels and compute 1/Li PLi j=1 I[Y j i C(Xi)] instead of the binary indicator I[Yi C(Xi)] to evaluate coverage. In red, we see that quite a few examples exhibit ties and the predicted prediction sets often cover only a part of the tied labels. This is an early indicator for high ambiguity in the ground truth of this dataset. On the bottom, we additionally plot aggregated coverage for (ECDF) Monte Carlo CP in comparison to CP on voted labels. Again, CP with voted labels significantly under-estimates aggregated coverage. While both Monte Carlo approaches (with and without ECDF correction) look identical in this example, they do not have to be due to randomness in how the calibration set is further split (cf. Algorithm 2). In expectation across splits, however, we found that they coincide. This also highlights that the simple variant empirically achieves aggregated coverage 1 α despite only guaranteeing 1 2α.

Figure 17 reproduces Figure 5 on our skin condition classification case study showing the advantage of using m > 1 in Monte Carlo CP to reduce variation in aggregated coverage across calibration/test splits.

Published in Transactions on Machine Learning Research (10/2023)

0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93 Empirical coverage

Variation in aggregated coverage for Monte Carlo CP

Target m = 1 m = 5 m = 10 Figure 17: Empirical aggregated coverage for Monte Carlo CP with various m. As on our toy dataset in Figure 5, higher m clearly reduces the variation in coverage across the evaluated 100 calibration/test splits despite using 50% of the 1949 examples for calibration.

Acne Diss Foll Foll Tine Sebo Absc Acan Classes

CP with voted labels

Plausibilities Confidence set (0.18 agg. coverage)

Acne Diss Foll Foll Tine Sebo Absc Acan Classes

Monte Carlo CP

Plausibilities Confidence set (0.36 agg. coverage)

Hema Onyc Onyc SK/I Tine Xero Ecze Absc Classes

CP with voted labels

Plausibilities Confidence set (0.23 agg. coverage)

Hema Onyc Onyc SK/I Tine Xero Ecze Absc Classes

Monte Carlo CP

Plausibilities Confidence set (0.38 agg. coverage)

Figure 18: Additional qualitative results corresponding to Figure 11.

Figure 18 shows two additional qualitative examples where Monte Carlo CP improves results, i.e., more conditions with significant plausibility (blue) are covered (green) but important conditions are still not covered. This can be addressed using a lower confidence level such as α = 0.1 in Figure 20.

Results of our main experiments in dermatology for α = 0.1 can be found in Figures 19 to 20.

Published in Transactions on Machine Learning Research (10/2023)

0.80 0.82 0.84 0.86 0.88 0.90 0.92 Empirical coverage

CP with voted labels

Target Voted coverage Aggregated coverage

0.88 0.90 0.92 0.94 0.96 Empirical coverage

Monte Carlo CP

Target Voted coverage Aggregated coverage

0.86 0.88 0.90 0.92 0.94 0.96 Empirical coverage

ECDF Monte Carlo CP

Target Voted coverage Aggregated coverage

7.0 7.5 8.0 8.5 9.0 9.5 Inefficiency

Average: 8.71

13.5 14.0 14.5 15.0 15.5 16.0 16.5 Inefficiency

Average: 15.02

12 14 16 18 20 Inefficiency

Average: 14.87

Figure 19: Results corresponding to Figure 10 with α = 0.1.

Hema Mela Angi Pyog Skin Mela Atyp O/E Classes

CP with voted labels

Plausibilities Confidence set (0.87 agg. coverage)

Hema Mela Angi Pyog Skin Mela Atyp O/E Classes

Monte Carlo CP

Plausibilities Confidence set (0.87 agg. coverage)

Arte Calc Cell Pyod Veno Absc Acan Acan Classes

CP with voted labels

Plausibilities Confidence set (0.80 agg. coverage)

Arte Calc Cell Pyod Veno Absc Acan Acan Classes

Monte Carlo CP

Plausibilities Confidence set (1.00 agg. coverage)

Acne Diss Foll Foll Tine Sebo Absc Acan Classes

CP with voted labels

Plausibilities Confidence set (0.55 agg. coverage)

Acne Diss Foll Foll Tine Sebo Absc Acan Classes

Monte Carlo CP

Plausibilities Confidence set (1.00 agg. coverage)

Hema Onyc Onyc SK/I Tine Xero Ecze Absc Classes

CP with voted labels

Plausibilities Confidence set (0.69 agg. coverage)

Hema Onyc Onyc SK/I Tine Xero Ecze Absc Classes

Monte Carlo CP

Plausibilities Confidence set (0.69 agg. coverage)

Figure 20: Additional qualitative results for α = 0.1 corresponding to Figures 11 and 18.