# label_noise_robustness_of_conformal_prediction__7755c9c3.pdf

Journal of Machine Learning Research 25 (2024) 1-66 Submitted 11/23; Revised 10/24; Published 10/24

Label Noise Robustness of Conformal Prediction

Bat-Sheva Einbinder bat-shevab@campus.technion.ac.il Department of Electrical and Computer Engineering Technion - Israel Institute of Technology

Shai Feldman shai.feldman@cs.technion.ac.il Department of Computer Science Technion - Israel Institute of Technology

Stephen Bates stephenbates@mit.edu Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology

Anastasios N. Angelopoulos angelopoulos@berkeley.edu Department of Electrical Engineering and Computer Science University of California, Berkeley

Asaf Gendler asafgendler@campus.technion.ac.il Department of Electrical and Computer Engineering Technion Israel Institute of Technology

Yaniv Romano yromano@technion.ac.il Departments of Electrical and Computer Engineering and of Computer Science Technion Israel Institute of Technology

Editor: Zaid Harchaoui

We study the robustness of conformal prediction, a powerful tool for uncertainty quantiﬁcation, to label noise. Our analysis tackles both regression and classiﬁcation problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. We further extend our theory and formulate the requirements for correctly controlling a general loss function, such as the false negative proportion, with noisy labels. Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity.

Keywords: conformal prediction, risk control, uncertainty quantiﬁcation, label noise, distribution shift

*. Equal contribution. These authors are listed in alphabetical order.

c 2024 Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N. Angelopoulos, Asaf Gendler, Yaniv Romano.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1549.html.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

1. Introduction

In most supervised classiﬁcation and regression tasks, one would assume the provided labels reﬂect the ground truth. In reality, this assumption is often violated (Cheng et al., 2022; Xu et al., 2019; Yuan et al., 2018; Lee and Barber, 2022; Cauchois et al., 2022). For example, doctors labeling the same medical image may have diﬀerent subjective opinions about the diagnosis, leading to variability in the ground truth label itself. To quote Abdalla and Fine (2023): Noise in labeling schemas and gold label annotations are pervasive in medical imaging classiﬁcation and aﬀect downstream clinical deployment. In other settings, such variability may arise due to sensor noise, data entry mistakes, the subjectivity of a human annotator, or many other sources. In other words, the labels we use to train machine learning (ML) models may often be noisy in the sense that these are not necessarily the ground truth. Consequently, this can result in the formulation of invalid data-driven conclusions.

The above discussion emphasizes the critical need for making reliable predictions in realworld scenarios, especially when dealing with imperfect training data. An eﬀective way to enhance the reliability of ML models is to quantify their prediction uncertainty. Conformal prediction (Vovk et al., 2005, 1999; Angelopoulos and Bates, 2023) is a generic uncertainty quantiﬁcation tool that transforms the output of any ML model into prediction sets that are guaranteed to cover the future, unknown test labels with high probability. This guarantee holds for any data distribution and sample size, under the sole assumption that the training and test data are i.i.d. However, obtaining valid uncertainty quantiﬁcation via conformal prediction in the presence of noisy labels remains unclear since the noise breaks the i.i.d. assumption. In this paper, we aim to address this speciﬁc challenge and precisely characterize under what conditions conformal methods, applied to noisy data, would yield prediction sets guaranteed to cover the unseen clean, ground truth label. Additionally, we analyze the eﬀect of label noise on risk-controlling techniques, which extend the conformal prediction approach to construct uncertainty sets with a guaranteed control over a general risk function (Bates et al., 2021; Angelopoulos et al., 2021, 2024). We also show that our theory can be applied to online settings, e.g., to quantify prediction uncertainty for time series data with noisy labels. Overall, we analyze the behavior of conformal prediction and risk-controlling methods for several common loss functions and noise models, highlighting their built-in robustness to dispersive, variability-increasing noise and vulnerability to adversarial noise. Adversarial noise might reduce the uncertainty of the response variable, potentially causing the prediction sets to be too small and achieve a low coverage rate. We note that a summary of this paper and its key contributions can be found in (Feldman et al., 2023a).

2. Conformal Prediction Under Label Noise

2.1 Problem Setup

Consider a calibration data set of i.i.d. observations {(Xi, Yi)}n i=1 sampled from an arbitrary unknown distribution PXY . Here, Xi Rp is the feature vector that contains p features for the i-th sample, and Yi denotes its response, which can be discrete for classiﬁcation tasks

Label Noise Robustness of Conformal Prediction

or continuous for regression tasks. Given the calibration data set, an i.i.d. test data point (Xtest, Ytest), and a pre-trained model ˆf, conformal prediction constructs a set b Cclean(Xtest) that contains the unknown test response, Ytest, with high probability, e.g., 90%. We refer to b Cclean(Xtest) as clean to underscore that the prediction set is formed by utilizing samples from the clean data distribution. That is, for a user-speciﬁed level α (0, 1),

P Ytest b Cclean(Xtest) 1 α. (1)

This property is called marginal coverage, where the probability is deﬁned over the calibration and test data.

In the setting of label noise, we only observe the corrupted labels Yi = g(Yi) for some corruption function g : Y [0, 1] Y, so the i.i.d. assumption and marginal coverage guarantee are invalidated. The corruption is random; we will always take the second argument of g to be a random seed U uniformly distributed on [0, 1]. To ease notation, we leave the second argument implicit henceforth. Nonetheless, using the noisy calibration data, we seek to form a prediction set b Cnoisy(Xtest) that covers the clean, uncorrupted test label, Ytest. More precisely, our goal is to delineate when it is possible to provide guarantees of the form P Ytest b Cnoisy(Xtest) 1 α, (2)

where the probability is taken jointly over the calibration data, test data, and corruption function (this will be the case for the remainder of the paper).

More formally, we use the model ˆf to construct a score function, s : X Y R, which is engineered to be large when the model is uncertain and small otherwise. We will introduce diﬀerent score functions for both classiﬁcation and regression as needed in the following subsections. Abbreviate the scores on each calibration data point as si = s(Xi, Yi) for each i = 1, ..., n. Conformal prediction tells us that we can achieve a marginal coverage guarantee by picking ˆqclean = s( (n+1)(1 α) ) as the (n + 1)(1 α) -smallest of the calibration scores and constructing the prediction sets as b Cclean (Xtest) = {y Y : s (Xtest, y) ˆqclean} .

In this paper, we do not allow ourselves access to the calibration labels, only their noisy versions, Y1, . . . , Yn, so we cannot calculate ˆqclean. Instead, we can calculate the noisy quantile ˆqnoisy as the (n + 1)(1 α) -smallest of the noisy score functions, si = s(Xi, Yi). The main formal question of our work is whether the resulting prediction set, b Cnoisy(Xtest) = {y : s(Xtest, y) ˆqnoisy}, covers the clean label as in (2). We state this general recipe algorithmically for future reference:

Recipe 1 (Conformal prediction with noisy labels)

1. Consider i.i.d. data points (X1, Y1), . . . , (Xn, Yn), (Xtest, Ytest), a corruption model g : Y Y, and a score function s : X Y R.

2. Compute the conformal quantile with the corrupted labels,

ˆqnoisy = Quantile (n + 1)(1 α)

n , {s(Xi, Yi)}n i=1

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

where Yi = g(Yi).

3. Construct the prediction set using the noisy conformal quantile,

b Cnoisy = {y : s(Xtest, y) ˆqnoisy}.

This recipe produces prediction sets that cover the noisy label at the desired coverage rate (Vovk et al., 2005; Angelopoulos and Bates, 2023):

P Ytest b Cnoisy(Xtest) 1 α.

Note that achieving (2) is impossible in general see Proposition 3. However, we present some realistic noise models under which this goal is satisﬁed. In the following sections, we provide real-data experiments indicating that conformal prediction and risk controlling methods achieve valid risk/coverage even with access only to noisy labels. In Section 2.3, we outline the precise conditions for these guarantees. There is a particular type of antidispersive noise that causes failure, and we see this both in theory and practice. If such noise can be avoided, a user should feel safe deploying uncertainty quantiﬁcation techniques even with noisy labels. Finally, we note that there exist several works that study the robustness of conformal prediction under ambiguous labels, and we discuss them in more detail in Section 5.1.

2.2 Empirical Evidence

Classification

As a real-world example of label noise, we conduct an image classiﬁcation experiment where we only observe one annotator s label but seek to cover the majority vote of many annotators. For this purpose, we use the CIFAR-10H data set, used by Peterson et al. (2019); Battleday et al. (2020); Singh et al. (2020), which contains 10,000 images labeled by approximately 50 annotators. We calibrate using only a single annotator s label and seek to cover the majority vote of the 50 annotators. The single annotator diﬀers from the ground truth labels in approximately 5% of the images.

Using the noisy calibration set (i.e., a calibration set containing these noisy labels), we apply vanilla conformal prediction as if the data were i.i.d, and study the performance of the resulting prediction sets. Details regarding the training procedure can be found in Appendix B.1. The fraction of majority vote labels covered is demonstrated in Figure 1. This ﬁgure shows that when using the clean calibration set, the marginal coverage is 90%, as expected. When using the noisy calibration set, the coverage on the clean, unknown test point, i.e., (2) increases to approximately 93%. Figure 1 also demonstrates that prediction sets that were calibrated using noisy labels are larger than sets calibrated with clean labels. This experiment illustrates the main intuition behind our paper: adding noise will usually increase the variability in the labels, leading to larger prediction sets that retain the coverage property.

Label Noise Robustness of Conformal Prediction

noisy clean nominal

True label: Cat

Noisy: {Cat, Dog}

Clean: {Cat}

True label: Car

Noisy: {Car, Ship, Cat}

Clean: {Car}

Figure 1: Eﬀect of label noise on CIFAR-10. Left: distribution of average coverage on a clean test set over 30 independent experiments with target coverage 1 α = 90%, using noisy and clean labels for calibration. We use a pre-trained resnet 18 model, which has Top-1 accuracy of 93% and 90% on the clean and noisy test set, respectively. The gray bar represents the interquartile range. Center and right: prediction sets achieved using noisy and clean labels for calibration.

In this section, we present a real-world application with a continuous response, using Aesthetic Visual Analysis (AVA) data set, ﬁrst presented by Murray et al. (2012). This data set contains pairs of images and their aesthetic scores in the range of 1 to 10, obtained by approximately 200 annotators. Following Kao et al. (2015); Talebi and Milanfar (2018); Murray et al. (2012), the task is to predict the average aesthetic score of a given test image. Therefore, we consider the average aesthetic score taken over all annotators as the clean, ground truth response. The noisy response is the average aesthetic score taken over 10 randomly selected annotators only.

We examine the performance of conformal prediction using two diﬀerent scores: the CQR score (Romano et al., 2019), deﬁned in (4) and the residual magnitude score (Papadopoulos et al., 2002; Lei et al., 2018) deﬁned in (5). In our experiments, we set the function ˆu(x) of the residual magnitude score as ˆu(x) = 1. We follow Talebi and Milanfar (2018) and take a transfer learning approach to ﬁt the predictive model using a VGG-16 model pretrained on Image Net data set. Details regarding the training strategy are in Appendix B.2.

Figure 2 portrays the marginal coverage and average interval length achieved using CQR and residual magnitude scores. As a point of reference, this ﬁgure also presents the performance of the two conformal methods when calibrated with a clean calibration set; as expected, the two perfectly attain 90% coverage. By constant, when calibrating the same predictive models with a noisy calibration set, the resulting prediction intervals tend to be wider and to over-cover the average aesthetic scores.

Thus far, we have found in empirical experiments that conservative coverage is obtained in the presence of label noise. In the following sections, our objective is to establish conditions that formally guarantee label-noise robustness.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Residual magnitude CQR Method

noisy clean nominal

Residual magnitude CQR Method

Figure 2: Results for real-data regression experiment: predicting aesthetic visual rating. Performance of conformal prediction intervals with 90% marginal coverage based on a VGG-16 model using a noisy training set. We compare the residual magnitude score and CQR methods with both noisy and clean calibration sets. Left: Marginal coverage; Right: Interval length. The results are evaluated over 30 independent experiments and the gray bar represents the interquartile range.

2.3 Theoretical Analysis

2.3.1 General Analysis

We begin the theoretical analysis with a general statement, showing that Recipe 1 produces valid prediction sets whenever the noisy score distribution stochastically dominates the clean score distribution. The intuition is that the noise distribution spreads out the distribution of the score function such that ˆqnoisy is (stochastically) larger than ˆqclean.

Theorem 1 Assume that P( stest t) P(stest t) for all t. Then,

P Ytest b Cnoisy(Xtest) 1 α.

Furthermore, for any u satisfying P( stest t) + u P(stest t), for all t, then

P Ytest b Cnoisy(Xtest) 1 α + 1 n + 1 + u.

Figure 3 illustrates the idea behind Theorem 1, demonstrating that when the noisy score distribution stochastically dominates the clean score distribution, then ˆqnoisy ˆqclean and thus uncertainty sets calibrated using noisy labels are more conservative. In practice, however, one does not have access to such a ﬁgure, since the scores of the clean labels are unknown. This gap requires an individual analysis for every task and its noise setup, which emphasizes the complexity of this study. In general, for most commonly used score functions the stochastic dominance assumption holds when the noise is dispersive, meaning it ﬂattens the density of Y | X, e.g., when Var( Y | X) > Var(Y | X). In the following subsections, we present example setups in classiﬁcation and regression tasks under which this stochastic dominance holds, and conformal prediction with noisy labels succeeds in

Label Noise Robustness of Conformal Prediction

Figure 3: Clean (green) and noisy (red) non-conformity scores under dispersive corruption.

covering the true, noiseless label. The purpose of these examples is to illustrate simple and intuitive statistical settings where Theorem 1 holds. Under the hood, all setups given in the following subsections are applications of Theorem 1. Though the noise can be adversarially designed to violate these assumptions and cause under-coverage (as in the impossibility result in Proposition 3), the evidence presented here suggests that in the majority of practical settings, conformal prediction can be applied without modiﬁcation. The proof is given in Appendix A.1.

2.3.2 Regression

In this section, we analyze a regression task where the labels are continuous-valued and the corruption function is additive: gadd(y) = y + Z (3)

for some independent noise sample Z.

We ﬁrst analyze the setting where the noise Z is symmetric around 0 and the density of Y | X is symmetric unimodal. We also assume that the estimated prediction interval contains the true median of Y | X = x, which is a very weak assumption about the ﬁtted model. The following proposition states that such an interval achieves a conservative coverage rate over the clean labels.

Proposition 1 Assume a symmetric unimodal distribution of Y | X and an independent additive noise that is symmetric around zero. Let b Cnoisy(Xtest) be a prediction interval constructed according to Recipe 1 with the corruption function gadd. Further, suppose that b Cnoisy(x) contains the true median of Y | X = x for all x X. Then

P Ytest b Cnoisy(Xtest) 1 α.

Importantly, the corruption function may depend on the feature vector, as stated next.

Remark 2 (Corruptions dependent on X) Proposition 1 holds even if the noise Z depends on X.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Furthermore, Proposition 1 applies for any non-conformity score function, speciﬁcally for two popular scores we focus on. The CQR score, developed by Romano et al. (2019), measures the distance of an estimated interval s endpoints from the corresponding label y, formally deﬁned as:

s CQR (x, y) = max{ ˆflower(x) y, y ˆfupper(x)}. (4)

Above, ˆflower and ˆfupper are the estimated lower and upper interval s endpoints, e.g., obtained by ﬁtting a quantile regression model to approximate the α/2 and 1 α/2 conditional quantiles of Y | X (Koenker and Bassett, 1978). The residual magnitude (RM) score (Papadopoulos et al., 2002; Lei et al., 2018) assesses the normalized prediction error:

s RM(x, y) = ˆf(x) y /ˆu(x), (5)

where ˆf is a regression model, such as a conditional mean estimator of Y | X, and ˆu is some normalization function, e.g., ˆu(x) 1. Additionally, the distributional assumption on Y | X is not necessary to guarantee label-noise robustness under this noise model; in Appendix A.2.1 we show that this distributional assumption can be relaxed at the expense of additional requirements on the estimated interval. Furthermore, in Section 3.3.3 we extend this analysis and formulate bounds for the coverage rate obtained on the clean labels that do not require any distributional assumptions and impose no restrictions on the constructed intervals. Subsequently, in Section 4.5, we empirically evaluate the proposed bounds and show that they are informative despite their extremely weak assumptions.

2.3.3 Classification

In this section, we formulate the conditions under which conformal prediction is robust to label noise in a K-class classiﬁcation setting where the labels take one of K values, i.e., Y {1, 2, ..., K}. In a nutshell, robustness is guaranteed when the corruption function transforms the labels distribution towards a uniform distribution while maintaining the same ranking of labels. Such corruption increases the prediction uncertainty, which drives conformal methods to construct conservative uncertainty sets to achieve the nominal coverage level on the observed noisy labels. We now formalize this intuition. We begin by deﬁning the following noise models:

Uniform noise: A noise model that fulﬁls the following: for all x X:

1. i {1, .., K} : P( Ytest = i | Xtest = x) 1

K P(Ytest = i | Xtest = x) 1

2. i, j {1, .., K} : P(Ytest = i | Xtest = x) P(Ytest = j | Xtest = x) P( Ytest = i | Xtest = x) P( Ytest = j | Xtest = x).

Random flip: A corruption function that follows:

( y w.p 1 ϵ Y else, (6)

where Y is uniformly drawn from the set {1, ..., K}.

Label Noise Robustness of Conformal Prediction

Figure 4: Clean (green) and noisy (red) class probabilities under dispersive corruption.

Proposition 2 Let b Cnoisy be constructed as in Recipe 1. Then, the coverage rate achieved over the clean labels is upper bounded by:

P Ytest b Cnoisy(Xtest) 1 α + 1 n + 1 + 1

P( Ytest = i) P(Ytest = i) .

Further suppose that b Cnoisy contains the most likely labels, i.e., x X, i b Cnoisy(x), j / b Cnoisy(x) : P( Ytest = i | Xtest = x) P( Ytest = j | Xtest = x). If the noise follows the uniform noise model, then the coverage rate is guaranteed to be valid:

1 α P Ytest b Cnoisy(Xtest) .

We emphasize that the key contribution of Proposition 2 is the lower bound, as it guarantees a valid coverage rate. We also note that Barber et al. (2023) provides a sharper upper bound for the coverage rate which also relies on the TV-distance between Ytest, Ytest. Nevertheless, the lower bound in Proposition 2 is tighter than the lower bound described in Barber et al. (2023). Figure 4 visualizes the essence of Proposition 2, showing that as the corruption increases the label uncertainty, the prediction sets generated by conformal prediction get larger, having a conservative coverage rate. It should be noted that under the above proposition, achieving valid coverage requires only knowledge of the noisy conditional distribution P Y | X , so a model trained on a large amount of noisy data should approximately have the desired coverage rate. Moreover, one should notice that without the assumptions in Proposition 2, even the oracle model is not guaranteed to achieve valid coverage, as we will show in Section 4.1. For a more general analysis of classiﬁcation problems where the noise model is an arbitrary confusion matrix, see Appendix A.3.2.

We now turn to examine two conformity scores and show that applying them with noisy data leads to conservative coverage, as a result of Proposition 2. The adaptive prediction sets (APS) score, ﬁrst introduced by Romano et al. (2020), is deﬁned as

s APS (x, y) = X

y Y ˆπy (x) I ˆπy (x) > ˆπy (x) + ˆπy (x) U ,

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

where I is the indicator function, ˆπy (x) is the estimated conditional probability P Y = y | X = x and U Unif(0, 1). To make this non-random, a variant of the above where U = 1 is often used. The APS score is one of two popular conformal methods for classiﬁcation. The other score from Vovk et al. (2005); Lei et al. (2013), is referred to as homogeneous prediction sets (HPS) score, s HPS (x, y) = 1 ˆπy (x), for some classiﬁer ˆπy (x) [0, 1]. The next corollary states that with access to an oracle classiﬁer s ranking, conformal prediction covers the noiseless test label.

Corollary 1 Let b Cnoisy(Xtest) be constructed as in Recipe 1 with either the APS or HPS score functions, with any classiﬁer that ranks the classes in the same order as the oracle classiﬁer ˆπy(x) = P Y = y | X = x . Then,

1 α P Ytest b Cnoisy(Xtest) 1 α + 1 n + 1 + 1

P( Ytest = i) P(Ytest = i) .

We now turn to examine the speciﬁc random flip noise model in which the noisy label is randomly ﬂipped ϵ fraction of the time. This noise model is well-studied in the literature; see, for example, (Aslam and Decatur, 1996; Angluin and Laird, 1988; Ma et al., 2018; Jenni and Favaro, 2018; Jindal et al., 2016; Yuan et al., 2018).

Corollary 2 Let b Cnoisy(Xtest) be constructed as in Recipe 1 with the corruption function gﬂip and either the APS or HPS score functions, with any classiﬁer that ranks the classes in the same order as the oracle classiﬁer ˆπy(x) = P Y = y | X = x . Then,

1 α P Ytest b Cnoisy(Xtest) 1 α + 1 n + 1 + ϵK 1

Crucially, the above corollaries apply with any score function that preserves the order of the estimated classiﬁer, which emphasizes the generality of our theory. All proofs are given in Appendix A.3.1. Moreover, although this is not our main focus, in Appendix A.3.3 we investigate the inﬂation of the prediction set size in the speciﬁc case of random flip noise with APS scores and the oracle model. Table 1 summarizes all diﬀerent settings we examine with their corresponding bounds. Finally, we note that in Section 3.3.1 we extend the above analysis to multi-label classiﬁcation, where there may be multiple labels that correspond to the same sample.

2.4 Disclaimer: Distribution-Free Results

Though the coverage guarantee holds in many realistic cases, conformal prediction may generate uncertainty sets that fail to cover the true outcome. Indeed, in the general case, conformal prediction produces invalid prediction sets, and must be adjusted to account for the size of the noise. The following proposition states that for any nontrivial noise distribution, there exists a score function that breaks na ıve conformal.

Label Noise Robustness of Conformal Prediction

Table 1: Summary of coverage bounds for diﬀerent scores and noise models

Task Score Noise model Bounds

All Additive symmetric: gadd in (3)

P Ytest b Cnoisy(Xtest) 1 α

(Proposition 1)

All Additive symmetric: gadd in (3)

P(Y C(x) | X = x) P( Y C(x) | X = x) (Theorem A4)

Classiﬁcation

All uniform noise

1 α P Ytest b Cnoisy(Xtest)

1 α + 1 n+1 +1

2 PK i=1 P( Ytest = i) P(Ytest = i) (Proposition 2)

Random ﬂip: gﬂip in (6)

1 α P Ytest b Cnoisy(Xtest)

1 α + 1 n+1 + ϵK 1

K (Corollary 2)

Proposition 3 (Coverage is impossible in the general case.) Take any Y d = Y . Then there exists a score function s that yields P Ytest b Cnoisy(Xtest) < P Ytest b C(Xtest) ,

for b Cnoisy constructed using noisy samples and b C constructed with clean samples.

The above proposition says that for any noise distribution, there exists an adversarially chosen score function that will disrupt coverage. Furthermore, as we discuss in Appendix A.4, with a noise of a suﬃcient magnitude, it is possible to get arbitrarily bad violations of coverage. In Appendix A.4 we state an additional impossibility result in which we claim that for any given score function following some conditions, there is an adversarial noise that invalidates the coverage.

Next, we discuss how to adjust the threshold of conformal prediction to account for noise of a known size, as measured by total variation (TV) distance from the clean label.

Corollary 3 (Corollary of Barber et al. (2023)) Let Y be any random variable satisfying DTV(Y, Y ) ϵ. Take α = α + 2 n n+1ϵ. Letting b Cnoisy(Xtest) be the output of Recipe 1 with any score function at level α yields

P Ytest b Cnoisy(Xtest) 1 α.

We discuss this strategy more in Appendix A.4 the algorithm implied by Corollary 3 may not be particularly useful, as the TV distance is a badly behaved quantity that is also diﬃcult to estimate, especially since the clean labels are inaccessible.

As a ﬁnal note, if the noise is bounded in TV norm, then the coverage is also not too conservative.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Corollary 4 (Corollary of Barber et al. (2023) Theorem 3) Let Y be any random variable satisfying DTV(Y, Y ) ξ. Letting b Cnoisy(Xtest) be the output of Recipe 1 with any score function at level α yields

P Ytest b Cnoisy(Xtest) < 1 α + 1 n + 1 + n n + 1ξ.

3. Risk Control Under Label Noise

3.1 Problem Setup

Up to this point, we have focused on the miscoverage loss, and the analysis conducted thus far applies exclusively to this metric. However, in real-world applications, it is often desired to control metrics other than the binary loss Lmiscoverage(y, C) = 1{y / C}, where C is a set of predicted labels. Examples of such alternative losses include the F1-score or the false negative rate. The latter is particularly relevant for high-dimensional response Y , as in tasks like multi-label classiﬁcation or image segmentation. To address this need, researchers have developed extensions of the conformal framework that go beyond the miscoverage loss, providing a rigorous risk control guarantee for general loss functions (Bates et al., 2021; Angelopoulos et al., 2021, 2024).

Similarly to the conformal prediction algorithm, in the risk-control setting we postprocess the predictions of a model ˆf to create a prediction set b Cλ(Xtest) with a parameter λ that determines its level of conservativeness: higher values yield larger and nested sets, in the sense that b Cλ2( ) b Cλ1( ) for λ2 λ1. For instance, if ˆfy(x) is an estimator of the conditional probability of Y | X = x, then the prediction sets can be deﬁned as b Cλ(Xtest) = {y : ˆfy(Xtest) > λ}. To measure the quality of b Cλ(Xtest), we consider a loss function L(Ytest, b Cλ(Xtest)) which we require to be non-increasing as a function of λ. This is analogous to conformal prediction in which the quantile of the scores, ˆq, encodes the prediction set sizes, and the error measure is simply the miscoverage loss. In the conformal prediction framework, we use a holdout set {(Xi, Yi)}n i=1 in order to calibrate ˆqclean and achieve valid coverage over a new test point. Likewise, in the conformal risk control settings we aim to use the observed losses {L(Yi, b Cλ(Xi))}n i=1 derived by the calibration set to ﬁnd a calibrated threshold ˆλclean that will control the risk of an unseen test point at a pre-speciﬁed level α:

R(C) = E[L(Ytest, b Cˆλclean(Xtest))] α.

See Appendix C.1 for the conformal risk control procedure. Analogously to conformal prediction, these methods produce valid sets under the i.i.d. assumption, but their guarantees do not hold in the presence of label noise. Provided a noisy calibration set, {(Xi, Yi)}n i=1, the parameter ˆλnoisy is constructed using the noisy losses {L( Yi, b Cλ(Xi))}n i=1, and therefore the risk of a new clean test point is not guaranteed to be controlled.

Our main goal is to delineate when it is possible to provide a risk control guarantee of the form E[L(Ytest, b Cˆλnoisy(Xtest))] α.

Label Noise Robustness of Conformal Prediction

In the next section, we conduct a real-data experiment, demonstrating that conformal risk control is robust to label noise. In Section 3.3 we explain this phenomenon and specify the conditions under which conservative risk is obtained. To ease notation, we will omit the subscript λ but directly specify whether the prediction set is constructed by clean or noisy calibration data.

3.2 Empirical Evidence

Multi-label Classification

In this section, we analyze the robustness of conformal risk control (Angelopoulos et al., 2024) to label noise in a multi-label classiﬁcation setting. We use the MS COCO data set (Lin et al., 2014), in which the input image may contain up to K = 80 positive labels, i.e., Y {1, 2, ..., K}. In the following experiment, we consider the annotations in this data set as ground-truth labels. We have collected noisy labels from individual annotators who annotated 117 images in total. On average, the annotators missed or mistakenly added approximately 1.75 labels from each image. See Appendix B.5 for additional details about this experimental setup and data collection.

We ﬁt a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 105 noisy samples with conformal risk control, as introduced in (Angelopoulos et al., 2024, Section 3.2), to control the false-negative rate (FNR) at diﬀerent levels. The FNR is the ratio of positive labels of Y that are missed in a prediction set C, formally deﬁned as:

LFNR(Y, C) = 1 |Y C|

We measure the FNR obtained over the clean and noisy test sets which contain 40k and 12 samples, respectively. Figure 5 displays the results, showing that the uncertainty sets attain valid risk even though they were calibrated using corrupted data. In the following section, we aim to explain these results and ﬁnd the conditions under which label-noise robustness is guaranteed.

Figure 5: FNR on MS COCO data set, achieved over noisy (red) and clean (green) test sets. The calibration scheme is applied with noisy annotations. Results are averaged over 2000 trials.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

3.3 Theoretical Analysis

3.3.1 Multi-label Classification

In this section, we study the conditions under which conformal risk control is robust to label noise in a multi-label classiﬁcation setting. Recall that each sample contains up to K positive labels, i.e., Y {1, 2, ..., K}. Here, we assume a vector-ﬂip noise model with a binary random variable ϵi that ﬂips the i-th label with probability P(ϵi = 1):

gvector ﬂip(y)i = yi(1 ϵi) + (1 yi)ϵi. (8)

Above, yi is an indicator that takes the value 1 if the i-th label is present in y, and 0, otherwise. Notice that the random variable ϵi takes the value 1 if the i-th label in y is ﬂipped and 0 otherwise. We further assume that the noise is not adversarial, i.e., P(ϵi = 1) < 0.5 for all i {1, 2, ..., K}. We now show that a valid FNR risk is guaranteed in the presence of label noise under the following assumptions.

Proposition 4 Let b Cnoisy(Xtest) be a prediction set that contains the most likely labels, in the sense of Proposition 2, and controls the FNR risk of the noisy labels at level α. Assume the multi-label noise model gvector ﬂip. If

1. Ytest is a deterministic function of Xtest,

2. The number of positive labels, i.e., | Ytest| | Xtest = x is a constant for all x X,

then E h LFNR Ytest, b Cnoisy(Xtest) i α.

Remark 3 (Dependent corruptions) Proposition 4 holds even if the elements in the noise vector ϵ depend on each other or on X.

The proof and other additional theoretical results are provided in Appendices A.5.1A.5.4. Importantly, the determinism assumption on Y | X = x is reasonable as it is simply satisﬁed when the noiseless response is deﬁned as the consensus outcome. Nevertheless, this assumption may not always hold in practice. Thus, in the next section, we propose alternative requirements for the validity of the FNR risk and demonstrate them in a segmentation setting.

3.3.2 Segmentation

In segmentation tasks, the goal is to assign labels to every pixel in an input image such that pixels with similar characteristics share the same label. For example, tumor segmentation can be applied to identify polyps in medical images. Here, the response is a binary matrix Y {0, 1}W H that contains the value 1 in the (i, j) pixel if it includes the object of interest and 0 otherwise. The uncertainty is represented by a prediction set C {1, ..., W}

Label Noise Robustness of Conformal Prediction

{1, ..., H} that includes pixels that are likely to contain the object. Similarly to the multilabel classiﬁcation problem, here, we assume a vector ﬂip noise model gvector ﬂip, that ﬂips the (i, j) pixel in Y , denoted as Yi,j, with probability P(ϵi,j = 1). We now show that the prediction sets constructed using a noisy calibration set are guaranteed to have conservative FNR if the clean response matrix Y and the noise variable ϵ are independent given X.

Proposition 5 Let b Cnoisy(Xtest) be a prediction set that contains the most likely pixels, in the sense of Proposition 2, and controls the FNR risk of the noisy labels at level α. Suppose that:

1. The elements of the clean response matrix are independent of each other given Xtest. That is, Ytesti,j | Xtest = x Ytestm,n | Xtest = x for all (i, j) = (m, n) {1, ..., W} {1, ..., H} and x X.

2. For a given input Xtest, the noise level is the same for all response elements.

3. The noise variable ϵ is independent of Ytest | Xtest = x, and the noises of diﬀerent labels are independent of each other given Xtest, similarly to condition 1.

Then, E h LFNR Ytest, b Cnoisy(Xtest) i α.

We note that a stronger version of this proposition that allows dependence between the elements in Ytest is given in Appendix A.5.1, and the proof is in Appendix A.5.5. The advantage of Proposition 5 over Proposition 4 is that here, the response matrix Ytest and the number of positive labels are allowed to be stochastic. For this reason, we believe that Proposition 5 is more suited for segmentation tasks, even though Proposition 4 applies in segmentation settings as well.

3.3.3 Regression with a General Loss

This section studies the general regression setting in which Y R takes continuous values and the loss function is an arbitrary function L(y, b Cnoisy(x)) R. Here, our objective is to ﬁnd tight bounds for the risk over the clean labels using the risk observed over the corrupted labels. The main result of this section accomplishes this goal while making minimal assumptions on the loss function, the noise model, and the data distribution.

Proposition 6 Let b Cnoisy(Xtest) be a prediction interval that controls the risk at level α :=

E h L( Ytest, b Cnoisy(Xtest) i . Suppose that the second derivative of the loss L(y; C) is bounded

for all y, C: q 2 y2 L(y; C) Q for some q, Q R. If the labels are corrupted by the function gadd from (3) with a noise Z that satisﬁes E[Z] = 0, then

2Q Var(Z) E h L(Ytest, b Cnoisy(Xtest) i α 1

If we further assume that L is convex then we obtain valid risk:

E h L(Ytest, b Cnoisy(Xtest) i α.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

The proof is detailed in Appendix A.6.1. Remarkably, Proposition 6 applies for any predictive model, calibration scheme, distribution of Y | X, and loss function that is twice diﬀerentiable. The only requirement is that the noise must be additive and with zero mean. We now demonstrate this result on a smooth approximation of the miscoverage loss, formulated as:

Lsm(y, [a, b]) = 2

1 + e (2 y a

b a 1) 2 (9)

Corollary 5 Let b Cnoisy(Xtest) be a prediction interval. Denote the smooth miscoverage over

the noisy labels as α := E h Lsm( Ytest, b Cnoisy(Xtest) i . Under the assumptions of Proposition 6, the risk of the clean labels is bounded by

2Q Var(Z) E h Lsm(Ytest, b Cnoisy(Xtest) i α 1

where q = EX h miny 2

y2 Lsm(y, b Cnoisy(X)) i and Q = EX h maxy 2

y2 Lsm(y, b Cnoisy(X)) i are known constants.

We now build upon Corollary 5 and establish a lower bound for the coverage rate achieved by intervals calibrated with corrupted labels.

Proposition 7 Let b Cnoisy(Xtest) be a prediction interval. Suppose that the labels are corrupted by the function gadd from (3) with a noise Z that satisﬁes E[Z] = 0, then

P Ytest b Cnoisy(Xtest) 1 E[Lsm( Ytest, b Cnoisy(Xtest))] 0.5P Var(Z)

where P, Q are tunable constants.

In Appendix A.6.2 we give additional details about this result as well as formulate a stronger version of it that provides a tighter coverage bound. Finally, in Appendix A.6.3 we provide an additional miscoverage bound which is more informative and tight for smooth densities of Y | X = x. Table 2 summarizes all diﬀerent risk-control settings with their corresponding bounds. Lastly, in Appendix A.6.4 we analyze label-noise robustness in settings where the response Y is a matrix, as in image-to-image regression tasks.

3.4 Online Learning Under Label Noise

In this section, we focus on an online learning setting and show that all theoretical results presented thus far also apply to the online framework. Here, the data is given as a stream (Xt, Yt)t N in a sequential fashion. Crucially, we have access only to the noisy labels Yt, and the clean labels Yt are unavailable throughout the entire learning process. At time stamp t N, our goal is to construct a prediction set b Ct noisy given on all previously observed

Label Noise Robustness of Conformal Prediction

Table 2: Summary of coverage bounds for diﬀerent risk control tasks and diﬀerent noise models

Task Noise model Bounds

Multi-label classiﬁcation

Random multi-label ﬂip: gvector-ﬂip in (8)

E h LFNR Ytest, b Cnoisy(Xtest) i α

(Proposition 4)

Segmentation Random multi-label ﬂip: gvector-ﬂip in (8)

E h LFNR Ytest, b Cnoisy(Xtest) i α

(Proposition 5)

Regression with a general loss Additive: gadd in (3) E h L Ytest, b Cnoisy(Xtest) i α

(Proposition 6)

Regression with miscoverage loss Additive: gadd in (3)

P Ytest b Cnoisy(Xtest)

1 E[Lsm( Ytest, b Cnoisy(Xtest))] 0.5P Var(Z)

Q (Proposition 7)

samples (Xt , Yt ) t 1 t =1 along with the test feature vector Xt that achieves a long-range risk controlled at a user-speciﬁed level α, i.e.,

R( b C) = lim T 1 T

t=1 Lt(Yt, b Ct noisy(Xt)) = α. (10)

Importantly, in this online learning setting, the loss function Lt might be time-dependent and may vary throughout the learning process. There have been developed calibration schemes that generate uncertainty sets with statistical guarantees in online settings in the sense of (10). A popular approach is Adaptive conformal inference (ACI), proposed by Gibbs and Candes (2021), which is an innovative online calibration scheme that constructs prediction sets with a pre-speciﬁed coverage rate, in the sense of (10) with the choice of the miscoverage loss. In contrast to ACI, Rolling risk control (Rolling RC) (Feldman et al., 2023b) extends ACI by providing a guaranteed control of a general risk that may go beyond the binary loss. The main idea behind both these methods is to tune the calibration parameter that controls the size of the prediction set according to the coverage or risk level achieved in the past. See Appendix C.2 and Appendix C.3 for more details on ACI and RRC, respectively. Yet, the guarantees of these approaches are invalidated when applied using corrupted data. Nonetheless, we argue that uncertainty sets constructed using corrupted data attain conservative risk in online settings under the requirements for oﬄine label-noise robustness presented thus far.

Proposition 8 Suppose that for all t N and x X we have a conservative loss:

E h Lt Yt, b Ct noisy(Xt) Xt = x i E h Lt Yt, b Ct noisy(Xt) Xt = x i .

t=1 Lt(Yt, b Ct noisy(Xt)) α,

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

where α is the risk computed over the noisy labels.

The proof is given in Appendix A.7.1. In words, Proposition 8 states that if the expected loss at every timestamp is conservative, then the risk over long-range windows in time is guaranteed to be valid. Practically, this proposition shows that all theoretical results presented thus far apply also in the online learning setting. We now demonstrate this result in two settings: online classiﬁcation and segmentation and show that valid risk is obtained under the assumptions of Proposition 2 and Proposition 5, respectively.

Corollary 6 (Valid risk in online classiﬁcation settings) Suppose that the distributions of Yt | Xt and Yt | Xt satisfy the assumptions in Proposition 2 for all t N. If b Ct noisy(x) contains the most likely labels, in the sense of Proposition 2, for every t N and x X, then

t=1 1{Yt / b Ct noisy(Xt)} α.

Corollary 7 (Valid risk in online segmentation settings) Suppose that the distributions of Yt | Xt and Yt | Xt satisfy the assumptions in Proposition 5 for all t N. If b Ct noisy(x) contains the most likely labels, in the sense of Proposition 5, for every t N and x X then

t=1 LFNR(Yt, b Ct noisy(Xt)) α.

Finally, in Appendix A.7.2 we analyze the eﬀect of label noise on the miscoverage counter loss (Feldman et al., 2023b). This loss assesses conditional validity in online settings by counting occurrences of consecutive miscoverage events. In a nutshell, Proposition A.7 claims that with access to corrupted labels, the miscoverage counter is valid when the miscoverage risk is valid. This is an interesting result, as it connects the validity of the miscoverage counter to the validity of the miscoverage loss, where the latter is guaranteed under the conditions established in Section 2.3.

4. Experiments

Software is available online at https://github.com/bat-sheva/Conformal-Label-Noise, with all code needed to reproduce the numerical experiments.

4.1 Synthetic Classiﬁcation

In this section, we focus on multi-class classiﬁcation problems, where we study the validity of conformal prediction using diﬀerent types of label noise distributions, described below.

Class-independent noise. This noise model, which we call uniform flip, randomly ﬂips the ground truth label into a diﬀerent one with probability ϵ. Notice that this noise

Label Noise Robustness of Conformal Prediction

model slightly diﬀers from the random ﬂip gﬂip from (6), since in this uniform flip setting, a label cannot be ﬂipped to the original label. Nonetheless, Proposition 2 states that the coverage achieved by an oracle classiﬁer is guaranteed to increase in this setting as well.

Class-dependent noise. In contrast to uniform flip noise, here, we consider a more challenging setup in which the probability of a label to be ﬂipped depends on the ground truth class label Y . Such a noise label is often called Noisy at Random (NAR) in the literature, where certain classes are more likely to be mislabeled or confused with similar ones. Let T be a row stochastic transition matrix of size K K such that Ti,j is the probability of a point with label i to be swapped with label j. In what follows, we consider three possible strategies for building the transition matrix T. (1) Confusion matrix (Algan and Ulusoy, 2020): we deﬁne T as the oracle classiﬁer s confusion matrix, up to a proper normalization to ensure the total ﬂipping probability is ϵ. We provide a theoretical study of this case in Appendix A.3.2. (2) Rare to most frequent class (Xu et al., 2019): here, we ﬂip the labels of the least frequent class with those of the most frequent class. This noise model is not uncommon in medical applications: imagine a setting where only a small fraction of the observations are abnormal, and thus likely to be annotated as the normal class. If switching between the rare and most frequent labels does not lead to a total probability of ϵ, we move to the next least common class, and so on.

To set the stage for the experiments, we generate synthetic data with K = 10 classes as follows. The features X Rd follow a standard multivariate Gaussian distribution of dimension d = 100. The conditional distribution of Y | X is multinomial with weights wj(x) = exp((x B)j)/ PK i=1 exp((x B)i), where B Rd K whose entries are sampled independently from the standard normal distribution. In our experiments, we generate a total of 60, 000 data points, where 50, 000 are used to ﬁt a classiﬁer, and the remaining ones are randomly split to form calibration and test sets, each of size 5, 000. The training and calibration data are corrupted using the label noise models we deﬁned earlier, with a ﬁxed ﬂipping probability of ϵ = 0.05. Of course, the test set is not corrupted and contains the ground truth labels. We apply conformal prediction using both the HPS and the APS score functions, with a target coverage level 1 α of 90%. We use two predictive models: a twolayer neural network and an oracle classiﬁer that has access to the conditional distribution of Y | X. Finally, we report the distribution of the coverage rate as in (2) and the prediction set sizes across 100 random splits of the calibration and test data. As a point of reference, we repeat the same experimental protocol described above on clean data; in this case, we do not violate the i.i.d. assumption required to grant the marginal coverage guarantee in (1).

The results are depicted in Figure 6. As expected, in the clean setting all conformal methods achieve the desired coverage of 90%. Under the uniform flip noise model, the coverage of the oracle classiﬁer increases to around 94%, supporting our theoretical results from Section 2.3.3. The neural network model follows a similar trend. Although not supported by a theoretical guarantee, when corrupting the labels using the more challenging confusion matrix noise, we can see a conservative behavior similar to the uniform flip. By contrast, under the rare to most frequent class noise model, we can see a decrease in coverage, which is in line with our disclaimer from Section 2.4. Yet, observe how the

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Neural network classifier

HPS APS nominal

Oracle classifier

Uniform flip

Confusion matrix

Rare to most frequent class

Average set size

Uniform flip

Confusion matrix

Rare to most frequent class

Figure 6: Eﬀect of label noise on synthetic multi-class classiﬁcation data. Performance of conformal prediction sets with target coverage 1 α = 90%, using a noisy training set and a noisy calibration set. Top: Marginal coverage; Bottom: Average size of predicted sets. The results are evaluated over 100 independent experiments and the gray bar represents the interquartile range.

APS score tends to be more robust to label noise than HPS, which emphasizes the role of the score function.

In Appendix B.3 we provide additional experiments with adversarial noise models that more aggressively reduce the coverage rate. Such adversarial cases are more pathological and less likely to occur in real-world settings, unless facing a malicious attacker.

4.2 Regression

Similarly to the classiﬁcation experiments, we study two types of noise distributions.

Label Noise Robustness of Conformal Prediction

0.0 0.01 0.1 1.0 Noise Magnitude c

nominal symmetric heavy-tail symmetric light-tail asymmetric biased

0.0 0.01 0.1 1.0 Noise Magnitude c

clean length

Figure 7: Response-independent noise. Performance of conformal prediction intervals with target coverage 1 α = 90%, using a noisy training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals (divided by the average clean length) using symmetric, asymmetric and biased noise with a varying magnitude. The results are evaluated over 50 independent experiments, with error bars showing one standard deviation. The standard deviation of Y (without noise) and the square root of E [Var [Y | X]] are both approximately 2.6.

Response-independent noise. We consider an additive noise of the form: Y = Y +c Z, where c is a parameter that allows us to control the noise level. The noise component Z is a random variable sampled from the following distributions. (1) Symmetric light tailed: standard normal distribution; (2) Symmetric heavy tailed: t-distribution with one degree of freedom; (3) Asymmetric: standard Gumbel distribution, normalized to have zero mean and unit variance; and (4) Biased: positive noise formulated as the absolute value of the symmetric heavy tailed noise above.

Response-dependent noise. Analogously to the class-dependent noise from Section 4.1, we deﬁne more challenging noise models as follows. (1) Contractive: this corruption pushes the ground truth response variables towards their mean. Formally, Yi = Yi Yi 1

n Pn i=1 Yi U, where U is a random uniform variable deﬁned on the segment [0,0.5], and n is the number of calibration points. (2) Dispersive: this noise introduces some form of a dispersion eﬀect on the ground truth response, which takes the opposite form of the contractive model, given by Yi = Yi + Yi 1

n Pn i=1 Yi U.

Having deﬁned the noise models, we turn to describe the data-generating process. We simulate a 100-dimensional X whose entries are sampled independently from a uniform distribution on the segment [0, 5]. Following Romano et al. (2019), the response variable is generated as follows:

Y Pois(sin2( X) + 0.1) + 0.03 X η1 + 25 η2 1 {U < 0.01} , (11)

where X is the mean of the vector X, and Pois(λ) is the Poisson distribution with mean λ. Both η1 and η2 are i.i.d. standard Gaussian variables, and U is a uniform random variable on [0, 1]. The right-most term in (11) creates a few but large outliers. Figure 17

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

in the appendix illustrates the eﬀect of the noise models discussed earlier on data sampled from (11).

We apply conformal prediction with the CQR score (Romano et al., 2019) for each noise model as follows. First, we ﬁt a quantile random forest model on 8, 000 noisy training points; we then calibrate the model using 2, 000 fresh noisy samples; and, lastly, test the performance on additional 5, 000 clean, ground truth samples. The results are summarized in Figures 7 and 8. Observe how the prediction intervals tend to be conservative under symmetric, both for lightand heavy-tailed noise distributions, asymmetric, and dispersive corruption models. Intuitively, this is because these noise models increase the variability of Y ; in Proposition 1 we prove this formally for any symmetric independent noise model, whereas here we show this result holds more generally even for response-dependent noise. By contrast, the prediction intervals constructed under the biased and contractive corruption models tend to under-cover the response variable. This should not surprise us: following Figure 17(c), the biased noise shifts the data upwards , and, consequently, the prediction intervals are undesirably pushed towards the positive quadrants. Analogously, the contractive corruption model pushes the data towards the mean, leading to intervals that are too narrow. Figure 20 in the appendix illustrates the scores achieved when using the diﬀerent noise models and the 90% th empirical quantile of the CQR scores. This ﬁgure supports the behavior witnessed in Figures 7, 18 and 19: over-coverage is achieved when ˆqnoisy is larger than ˆqclean, and under-coverage is obtained when ˆqnoisy is smaller.

In Appendix B.4 we study the eﬀect of the predictive model on the coverage property, for all noise models. To this end, we repeat similar experiments to the ones presented above, however, we now ﬁt the predictive model on clean training data; the calibration data remains noisy. We also provide an additional adversarial noise model that reduces the coverage rate, but is unlikely to appear in real-world settings. Figures 18 and 19 in the appendix depict a similar behaviour for most noise models, except the biased noise for which the coverage requirement is not violated. This can be explained by the improved estimation of the low and high conditional quantiles, as these are ﬁtted on clean data and thus less biased.

4.3 Multi-label Classiﬁcation

In this section, we analyze conformal risk control in a multi-label classiﬁcation task. For this purpose, we use the CIFAR-100N data set (Wei et al., 2022), which contains 50K colored images. Each image belongs to one of a hundred ﬁne classes that are grouped into twenty mutually exclusive coarse super-classes. Furthermore, every image has a noisy and a clean label, where the noise rate of the ﬁne categories is 40% and of the coarse categories is 25%. We turn this single-label classiﬁcation task into a multi-label classiﬁcation task by merging four random images into a 2 by 2 grid. Every image is used once in each position of the grid, and therefore this new data set consists of 50K images, where each is composed of four sub-images and thus associated with up to four labels. Figure 9 displays a visualization of this new variant of the CIFAR-100N data set.

Label Noise Robustness of Conformal Prediction

contractive dispersive Noise

noisy clean nominal

contractive dispersive Noise

Figure 8: Dispersive versus contractive noise regression experiment. Performance of conformal prediction intervals with target coverage 1 α = 90%, using a noisy training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals. The results are evaluated over 50 independent experiments and the gray bar represents the interquartile range.

True labels: baby, mushroom, tulip, bee.

Noisy labels: baby, mushroom, sweet pepper, bee.

Figure 9: Visualization of the multi-label CIFAR-100N data.

We ﬁt a TRes Net (Ridnik et al., 2021) model on 40k noisy samples and calibrate it using 2K noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2). We control the false-negative rate (FNR) deﬁned in (7) at diﬀerent levels and measure the FNR obtained over clean and noisy versions of the test set, which contains 8k samples. We conducted this experiment twice: once with the ﬁne-classed labels and once with the super-classed labels. Figure 10 presents the results in both settings, showing that the risk obtained over the clean labels is valid for every nominal level. Importantly, this corruption setting violates the assumptions of Proposition 4, as the positive label count may vary across diﬀerent noise instantiations. This experiment reveals that valid risk can be achieved in the presence of label noise even when the corruption model violates the requirements of our theory.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Figure 10: FNR achieved over noisy (red) and clean (green) test sets of the on CIFAR-100N data set. Left: ﬁne labels. Right: coarse labels. The calibration scheme is applied using noisy annotations in all settings. Results are averaged over 50 trials.

4.4 Segmentation

In this section, we follow a common artiﬁcial label corruption methodology, following Zhao and Gomes (2021); Kumar et al. (2020), and analyze three corruption setups that are special cases of the vector-ﬂip noise model from (8): independent, dependent, and partial. In the independent setting, each pixel s label is ﬂipped with probability β, independently of the others. In the dependent setting, however, two rectangles in the image are entirely ﬂipped with probability β, and the other pixels are ﬂipped independently with probability β. Finally, in the partial noise setting, only one rectangle in the image is ﬂipped with probability β, and the other pixels are unchanged.

Figure 11: FNR on a polyp segmentation data set, achieved over noisy (red) and clean (green) test sets. Left: independent noise. Middle: dependent noise. Right: partial noise. The predictive model is calibrated using noisy annotations in all settings, where the noise level is set to β = 0.1. Results are averaged over 1000 trials.

We experiment on a polyp segmentation task, pooling data from several polyp data sets: Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. We consider the annotations given in the data as ground-truth labels and artiﬁcially corrupt them according to the corruption setups described above, to generate noisy labels. We use Pra Net (Fan et al., 2020) as a base

Label Noise Robustness of Conformal Prediction

model and ﬁt it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2) to control the false-negative rate (FNR) from (7) at diﬀerent levels. Finally, we evaluate the FNR over clean and noisy versions of the test set, which contains 298 samples, and report the results in Figure 11. This ﬁgure indicates that conformal risk control is robust to label noise, as the constructed prediction sets achieve conservative risk in all experimented noise settings. This is not a surprise, as it is guaranteed by Propositions 5 and A.4.

4.5 Regression Risk Bounds

We now turn to demonstrate the coverage rate bounds derived in Section 3.3.3 on real and synthetic regression data sets. We examine two real benchmarks: meps 19 and bio used in (Romano et al., 2019), and one synthetic data set that was generated from a bimodal density function with a sharp slope, as visualized in Figure 12. The simulated data was deliberately designed to invalidate the assumptions of our label-noise robustness requirements in Proposition 1. Consequentially, prediction intervals that do not cover the two peaks of the density function might undercover the true outcome, even if the noise is dispersive. Therefore, this gap calls for our distribution-free risk bounds from Section 3.3.3, which are applicable in this setup, in contrast to Proposition 1. The former approach can be used to assess the worst risk level that may be obtained in practice. We consider the labels given

Figure 12: Visualization of the marginal density function of the adversarial synthetic data.

in the real and synthetic data sets as ground truth and artiﬁcially corrupt them according to the additive noise model (3). The added noise is independently sampled from a normal distribution with mean zero and variance 0.1Var(Y ).

For each data set and nominal risk level α, ﬁt a quantile regression model on 12K samples and learn the α/2, 1 α/2 conditional quantiles of the noisy labels. Then, we calibrate its outputs using another 12K samples of the data with conformal risk control, by Angelopoulos et al. (2024), to control the smooth miscoverage (9) at level α. Finally, we evaluate the performance on the test set which consists of 6K samples. We also compute the smooth miscoverage risk bounds according to Corollary 5 with a noise variance set to 0.1.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Figure 13 presents the risk bound along with the smooth miscoverage obtained over the clean and noisy versions of the test set. This ﬁgure indicates that conformal risk control generates invalid uncertainty sets when applied on the simulated noisy data, as we anticipated. Additionally, this ﬁgure shows that the proposed risk bounds are valid and tight, meaning that these are informative and eﬀective. Moreover, this ﬁgure highlights the main advantage of the proposed risk bounds: their validity is universal across all distributions of the response variable and the noise component. Lastly, we note that in Appendix B.7 we repeat this experiment with the miscoverage loss and display the miscoverage bound derived from Corollary 7.

Figure 13: Smooth miscoverage rate achieved over noisy (red) and clean (green) test sets. The calibration scheme is applied using noisy annotations to control the smooth miscoverage level. Results are averaged over 10 random splits of the calibration and test sets.

4.6 Online Learning

This section studies the eﬀect of label noise on uncertainty quantiﬁcation methods in an online learning setting, as formulated in Section 3.4. We experiment on a depth estimation task (Geiger et al., 2013), where the objective is to predict a depth map given a colored image. In other words, X RW H 3 is an input RGB image of size W H and Y RW H

is its corresponding depth map. We consider the original depth values given in this data as ground truth and artiﬁcially corrupt them according to the additive noise model deﬁned in (3) to produce noisy labels. Speciﬁcally, we add to each depth pixel an independent random noise drawn from a normal distribution with zero mean and 0.7 variance. Here, the depth uncertainty of the i, j pixel is represented by a prediction interval Ci,j(X) R. Ideally, the estimated intervals should contain the correct depth values at a pre-speciﬁed level 1 α. In this high-dimensional setting, this requirement is formalized as controlling the image miscoverage loss, deﬁned as:

Lim(Y, C(X)) = 1 WH

j=1 1{Y i,j / Ci,j(X)}. (12)

In words, the image miscoverage loss measures the proportion of depth values that were not covered in a given image.

Label Noise Robustness of Conformal Prediction

For this purpose, we employ the calibration scheme Rolling RC (Feldman et al., 2023b), which constructs uncertainty sets in an online setting with a valid risk guarantee in the sense of (10). We follow the experimental protocol outlined in (Feldman et al., 2023b, Section 4.2) and apply Rolling RC with an exponential stretching to control the image miscoverage loss at diﬀerent levels on the observed, noisy, labels. We use Le Re S (Yin et al., 2021) as a base model, which was pre-trained on a clean training set that corresponds to timestamps 1,...,6000. We continue training it and updating the calibration scheme in an online fashion on the following 2000 timestamps. We consider these samples, indexed by 6001 to 8000, as a validation set and use it to choose the calibration s hyperparameters, as explained in (Feldman et al., 2023b, Section 4.2). Finally, we continue the online procedure on the test samples whose indexes correspond to 8001 to 10000, and measure the performance on the clean and noisy versions of this test set.

Figure 14 displays the risk achieved by this technique over the clean and corrupted labels. This ﬁgure indicates that Rolling RC attain valid image miscoverage over the unknown noiseless labels. This is not a surprise, as it is supported by Proposition A.6 which guarantees conservative image miscoverage under label noise in an oﬄine setting, and Proposition 8 which states that the former result applies to an online learning setting as well.

Figure 14: Image miscoverage achieved by Rolling RC. Results are averaged over 10 random trials.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

5. Discussion

5.1 Related Work

Label noise independently of conformal prediction has been well-studied, especially for training more robust predictive models; see, for example (Angluin and Laird, 1988; Fr enay and Verleysen, 2013; Tanno et al., 2019; Algan and Ulusoy, 2020; Kumar et al., 2020). Recently, there has been a body of work studying the statistical properties of conformal prediction (Lei et al., 2018; Barber, 2020) and its performance under deviations from exchangeability (Tibshirani et al., 2019; Podkopaev and Ramdas, 2021). This line of work is relevant to us since the label noise setting violates the exchangeability assumption between the training and testing data, where the latter is clean while the former is noisy. However, these works cannot be applied in our setting since they assume covariate or label shift only. The work by Barber et al. (2023) refers to any distribution shift, and we build upon some of its results in our general disclaimer in Section 2.4. Another relevant work is (Farinhas et al., 2024), which studies the eﬀect of a general distribution shift on the obtained risk of riskcontrolling techniques, thus extending the work of by Barber et al. (2023) to more general loss functions than the miscoverage loss. Additionally, Angelopoulos et al. (2024) analyzes their proposed risk-controlling method in a covariate shift and a general distribution shift settings. Close works to ours include (Stutz et al., 2023), which analyses the performance of conformal prediction under ambiguous ground truth, and Cauchois et al. (2022), that studies conformal prediction with weak supervision, which could be interpreted as a type of noisy label.

Lastly, a follow-up work to ours has recently been published (Sesia et al., 2023). Similarly to our work, it begins by studying the eﬀect of label noise on the coverage achieved by standard conformal prediction. Then, an explicit factor that depicts the inﬂation or deﬂation of the coverage is estimated to adjust the desired coverage rate. The theoretical analysis requires no assumptions on the contamination process, but in order to achieve an applicable method that can automatically adapt to label noise, some mild assumptions on the relation between the clean and observable labels are used. Indeed, the presented experiments demonstrate less conservative coverage compared to the standard method. However, there are two key distinctions between this work and ours. It focuses on controlling the coverage rate in classiﬁcation tasks, whilst our analysis extends to regression tasks, general risk control, and online settings. Additionally, it aims to modify the calibration algorithm to account for label noise, whereas our goal is to test the limits of conformal prediction in the label noise setting and reveal the conditions on the scores, noise models, and predictive models, under which the standard algorithm remains valid despite the presence of noisy labels.

5.2 Future Research Direction

Our work raises many new questions. First, one can try and deﬁne a score function that is more robust to label noise, continuing the line of Gendler et al. (2021); Fr enay and Verleysen (2013); Cheng et al. (2022). Second, an important remaining question is how to achieve

Label Noise Robustness of Conformal Prediction

exact risk control on the clean labels using minimal information about the noise model. Lastly, it would be interesting to analyze the robustness of alternative conformal methods such as cross-conformal and jackknife+ (Vovk, 2015; Barber et al., 2021) that do not require data-splitting.

Acknowledgments

Y.R., A.G., B.E., and S.F. were supported by the ISRAEL SCIENCE FOUNDATION (grant No. 729/21). Y.R. thanks the Career Advancement Fellowship, Technion, for providing research support. A.N.A. was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. S.F. thanks Aviv Adar, Idan Aviv, Ofer Bear, Tsvi Bekker, Yoav Bourla, Yotam Gilad, Dor Sirton, and Lia Tabib for annotating the MS COCO data set.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Appendix A. Mathematical Proofs

A.1 General Analysis

We begin by proving Theorem 1.

Proof [Proof of Theorem 1] Our assumption states that

P( stest t) P(stest t).

Note that the probability is only taken over stest. Since ˆqnoisy is constant (measurable) with respect to this probability, we have that, for any α (0, 1),

P(stest ˆqnoisy) P( stest ˆqnoisy) 1 α.

This implies that Ytest b Cnoisy(Xtest) with probability at least 1 α, completing the proof of the lower bound.

Regarding the upper bound, by the same argument,

P(stest ˆqnoisy) P( stest ˆqnoisy) + u 1 α + 1 n + 1 + u.

A.2 Regression

A.2.1 General Regression Result

Here we provide an extension to Proposition 1 and prove it.

Theorem A.1 Suppose an additive noise model gadd with a noise that has mean 0. Denote the prediction interval as C(x) = [ax, bx]. If for all x X density of Y | X = x is peaked inside the interval:

1. ε 0 : f Y |X=x(bx + ε) f Y |X=x(bx ε).

2. ε 0 : f Y |X=x(ax ε) f Y |X=x(ax + ε).

then, we obtain valid conditional coverage:

P(Y C(x) | X = x) P( Y C(x) | X = x).

Proof For ease of notation, we omit the conditioning on X = x. In other words, we treat Y as Y | X = x for some x X. We begin by showing P( Y b) P(Y b).

Label Noise Robustness of Conformal Prediction

= P(Y + ε b)

= P(Y + ε b | ε 0)P(ε 0) + P(Y + ε b | ε 0)P(ε 0)

2P(Y + ε b | ε 0) + 1

2P(Y + ε b | ε 0)

2P(Y b ε | ε 0) + 1

2P(Y b + ε | ε 0)

2Eε 0 [P(Y b ε) + P(Y b + ε)]

= P(Y b) + 1

2Eε 0 [P(Y b ε) P(Y b) + P(Y b + ε) P(Y b)]

= P(Y b) + 1

2Eε 0 [P(Y b + ε) P(Y b) (P(Y b) P(Y b ε))]

= P(Y b) + 1

2Eε 0 [P(b Y b + ε) P(b ε Y b)]

The last inequality follows from the assumption that ε 0 : f Y (b + ε) f Y (b ε). The proof for P( Y a) P(Y a) is similar and hence omitted. We get that:

= P( Y b) P( Y a)

P(Y b) P(Y a) (follows from the above)

= P(Y C(x)).

We now turn to prove Proposition 1 using Theorem A.1

Proof Since the density of Y | X = x is symmetric and unimodal for all x X it is peaked inside any prediction interval that contains its median. Therefore, the prediction interval achieves valid conditional coverage:

x X : P( Y C(x) | X = x) P(Y C(x) | X = x).

By taking the expectation over X PX we obtain valid marginal coverage.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

A.3 Classiﬁcation

A.3.1 General Classification Result

We commence by providing an extension to Proposition 2 and proving it.

Theorem A.2 Suppose that b Cnoisy(x) {1, ..., K} is a prediction set. Denote β := P( Y b Cnoisy(x) | X = x). First, the coverage rate achieved over the clean labels is upper bounded by:

P(Y b Cnoisy(x) | X = x) β + 1

i=1 |P( Y = i | X = x) P(Y = i | X = x)|,

Further suppose that for all x X: X

i b Cnoisy(x)

P(Y = i | X = x) P( Y = i | X = x) 0.

Then, the coverage rate is lower bounded by:

P( Y b Cnoisy(x) | X = x) P(Y b Cnoisy(x) | X = x).

Proof First, for ease of notation, we omit the conditioning on x and consider a prediction C {1, ..., K}. Notice that if C = the proposition is trivially satisﬁed.

We begin by proving the following lower bound P(Y C) β := P( Y C). Denote: δi := P(Y = i) P( Y = i). Since P i C δi 0 we get that:

i C P(Y = i) = X

i C P( Y = i) + X

i C P( Y = i) + 0 = P( Y C) = β.

We now turn to prove the upper bound.

i {1,...,K} C δi

i=1 |P( Y = i) P(Y = i)|.

This gives us:

P(Y C) = P( Y C) + X

i C δi P( Y C) + 1

i=1 |P( Y = i) P(Y = i)|.

Label Noise Robustness of Conformal Prediction

And this concludes the proof.

We now turn to prove Proposition 2 using Theorem A.2.

Proof We suppose without loss of generality that the labels are ranked from the most likely to the least likely, i.e., P(Y = i) P(Y = i + 1). Since the prediction set contains the most likely labels, there exists some 1 m K such that the prediction set is C = {1, 2, ..., m}. We only need to show Pm i=1 P(Y = i) P( Y = i) 0 for m {1, ..., K} and the result will follow directly from Theorem A.2. We follow the notations in the proof of Theorem A.2. First, we observe that under the assumed noise model:

K P(Y = i) P( Y = i) δi 0.

Denote by 1 g N the largest index for which P(Y = g) 1

K . It follows that:

i {1, ..., K} : i g δi 0.

If m g then all summands in Pm i=1 δi are positive since δi 0 for every i m g. Thus, Pm i=1 δi 0 as required. Otherwise,

i=g+1 δi i>g δi 0

i=1 δi = 0.

Therefore, from Theorem A.2 we get:

β P(Y b Cnoisy(x) | X = x) β + 1

i=1 |P( Y = i | X = x) P(Y = i | X = x)|.

By taking the expectation over X PX we obtain the desired marginal coverage bounds. The only non-trivial transition is marginalizing the TV-distance, which follows from the integral absolute value inequality: Z

x X |P( Y = i | X = x) P(Y = i | X = x)|

P( Y = i | X = x) P(Y = i | X = x) = P( Y = i) P(Y = i)

Finally, we turn to prove Corollary 2.

Proof We follow the notations and assumptions in Theorem A.2. We only need to show that 1

2 PK i=1 |δi| ε K 1

K and the result will follow directly from Theorem A.2.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

In the random ﬂip setting, |δi| = P(Y = i)ε ε

K = ε P(Y = i) 1

K . Therefore:

i=1 |δi| = 1

i=1 ε P(Y = i) 1

i=1 P(Y = i)

i=g+1 P(Y = i)

Thus, we get:

P( Y C) P(Y C) P( Y C) + 1

i=1 |P( Y = i) P(Y = i)| P( Y C) + εK 1

which concludes the proof.

A.3.2 Confusion Matrix

The confusion matrix noise model is more realistic than the random flip. However, there exists a score function that causes conformal prediction to fail for any non-identity confusion matrix. We deﬁne the corruption model as follows: consider a matrix T in which (T)i,j = P( Y = j | Y = i).

gconfusion(y) =

1 w.p. T1,y . . . K w.p. TK,y.

Proposition A.1 Let b Cnoisy be constructed as in Recipe 1 with any score function s and the corruption function gconfusion. Then,

P Ytest b Cnoisy(Xtest) 1 α.

Label Noise Robustness of Conformal Prediction

if and only if for all classes j {1, . . . , K},

j =1 P( Y = j | Y = j)P( s t | Y = j ) P(s t | Y = j).

The proof is below.

Proof By law of total probability, P(s t) = E [P(s t | Y = j)] = PK j=1 wj P(s t | Y = j). But under the noisy model, we have instead that

P( s t) = E h P( s t | Y = j ) i =

j =1 wj P( Y = j | Y = j)P( s t | Y = j ).

We can write

P(s t) P( s t) =

j =1 wj P( Y = j | Y = j)P( s t | Y = j )

j=1 wj P(s t | Y = j).

Combining the sums and factoring, the above display equals

P( Y = j | Y = j)P( s t | Y = j ) P(s t | Y = j)

We can factor this expression as

j=1 wj P(s t | Y = j)

P( Y = j | Y = j)P( s t | Y = j )

P(s t | Y = j) 1

The stochastic dominance condition holds uniformly over all choices of base probabilities wj if and only if for all j [K],

j =1 P( Y = j | Y = j)P( s t | Y = j ) P(s t | Y = j).

Notice that the left-hand side of the above display is a convex mixture of the quantiles P( s t | Y = j ) for j [K]. Thus, the necessary and suﬃcient condition is for the noise distribution P( Y = j | Y = j) to place suﬃcient mass on the classes j whose quantiles are larger than P(s t | Y = j). But of course, without assumptions on the model and score, the latter is unknown, so it is impossible to say which noise distributions will preserve coverage.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

A.3.3 Prediction Set Size Analysis

When using the non-random APS scores with the oracle model, the prediction set size for a given x can be expressed as

j=1 π(j)(x) 1 α

where πy(x) = P(Y | X), the desired coverage level is 1 α, and π(1)(x) π(2)(x) π(K)(x) are the order statistics of π1(x), π2(x), . . . , πK(x). Under the random ﬂip noise, the noisy conditional class probabilities are given by

πy(x) = (1 ϵ)πy(x) + ϵ

where K is the number of labels and ϵ is the fraction of ﬂipped labels. Therefore, the noisy set size is given by:

j=1 π(j)(x) 1 α

j=1 π(j)(x) + ϵ

j=1 π(j)(x)

As a result, we get k ,noisy k , where the term that controls the inﬂation of the noisy set size ϵ k K Pk j=1 π(j)(x) is non-positive, and is a function of the noise level and the oracle conditional class probabilities.

A.4 Distribution-Free Results

We begin by proving Proposition 3.

Proof [Proof of Proposition 3] For convenience, assume the existence of probability density functions p and p for Y and Y respectively (these can be taken to be probability mass functions if Y is discrete). Also deﬁne the multiset of Y values E = {Y1, ..., Yn} and the corresponding multiset of Y values E. Take the set

A = {y : p(y) > p(y)}.

Since Y d = Y , we know that the set A is nonempty and P( Y A) = δ1 > P(Y A) = δ2 0. The adversarial choice of score function will be s(x, y) = 1 {y Ac}; it puts high mass wherever the ground truth label is more likely than the noisy label. The crux of the argument is that this design makes the quantile smaller when it is computed on the noisy data than when it is computed on clean data, as we next show.

Label Noise Robustness of Conformal Prediction

Begin by noticing that, because s(x, y) is binary, ˆqclean is also binary, and therefore ˆqclean > t ˆqclean = 1. Furthermore, ˆqclean = 1 if and only if |E A| < (n + 1)(1 α) . Thus, these events are the same, and for any t (0, 1],

P (ˆqclean t) = P E A < (n + 1)(1 α) .

By the deﬁnition of A, we have that P E A < (n + 1)(1 α) > P E A <

(n + 1)(1 α) . Chaining the inequalities, we get

P (ˆqclean t) > P E A < (n + 1)(1 α) = P (ˆq t) .

Since sn+1 is measurable with respect to E and E, we can plug it in for t, yielding the conclusion.

Remark A.3 In the above argument, if one further assumes continuity of the (ground truth) score function and P( Y A) = P(Y A) + ρ for

ρ = inf n ρ > 0 : Binom CDF(n, δ1, (n + 1)(1 α) 1) + 1

Binom CDF(n, δ2 + ρ , (n + 1)(1 α) 1) o ,

then P(sn+1 ˆq) < 1 α.

In other words, the noise must have some suﬃcient magnitude in order to disrupt coverage.

Next, we provide an additional impossibility result, similar to Proposition 3.

Proposition A.2 Given a score s(x, y) = | ˆf(x) y| and Y = βY, for β (0, 1). Assume further that ˆf(x) = E[ Y | X = x], then there exists α0 [0, 1] such that for any α satisfying 1 α 1 α0 : P(Y b Cnoisy(X)) 1 α.

The proof is provided below:

Proof Denote µx := E[Y | X = x]. Then

ˆf(x) = E[ Y | X = x] = E[βY | X = x] = βE[Y | X = x] = βµx.

s = | ˆf(x) Y | = |βµx βY | = β|µx Y |

s = | ˆf(x) Y | = |βµx Y |

We will show that F s(t) Fs(t), t R. Notice that this is true conditional on X (we do not write this explicitly to enhance clarity.)

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Denote G = µx Y, H = |G|, J = βµx Y and then s = βH, s = |J|.

FG(g) = P(G g) = P(µx Y g) = 1 P(Y µx g) = 1 FY (µx g).

FH(h) = P(H h) = P(|G| h) = P( h G h) = FG(h) FG( h).

FJ(j) = P(J j) = P(βµx Y j) = 1 P(Y βµx j) = 1 FY (βµx j).

F s(t) = FH( t

β ) = FG( t

β ) = FY (µx + t

β ) FY (µx t

Fs(t) = FJ(t) FJ( t) = FY (βµx + t) FY (βµx t)

Let t 0, then:

F s(t) Fs(t) =FY (µx + t

β ) FY (µx t

β ) (FY (βµx + t) FY (βµx t))

= FY (µx + t

β ) FY (µx t

β ) FY (β(µx + t

β )) FY (β(µx t

For t |βµx| from the monotonicity of FY we get: FY (µx + t

β) FY (β(µx + t

β)) and FY (β(µx t

β)) FY (µx t

β). Therefore: F s(t) Fs(t) or equivalently P( s t) P(s t). Denote α0(x) := 1 P( s |βµx|). Suppose that α satisﬁes 1 α 1 α0(x). Denote by ˆqnoisy(x) the 1 α quantile of the noisy scores s, conditional on X = x. Then, we get that ˆqnoisy(x) |βµx|, and thus:

1 α = P( s ˆqnoisy(x)) P(s ˆqnoisy(x)).

Lastly, we return to consider the notations of conditioning on X. Denote α0 = infx X (α0(x)). Suppose that α [0, 1] satisﬁes 1 α 1 α0. Denote by ˆqnoisy the 1 α quantile of the noisy scores s, marginally on x X. Then, we obtain:

1 α = P( Y Cnoisy(X)) = P( s ˆqnoisy) P(s ˆqnoisy) = P(Y Cnoisy(X)),

where above, the probability is taken marginally over x X.

We commence by proving Corollary 3.

Proof [Proof of Corollary 3] This a consequence of the TV bound from Barber et al. (2023) with weights identically equal to 1.

Unfortunately, getting such a TV bound requires a case by case analysis. It s not even straightforward to get a TV bound under strong Gaussian assumptions.

Proposition A.3 (No general TV bound) Assume Y N(0, τ 2) and Y = y + Z, where Z N(0, σ2). Then DTV(Y, Y ) τ 0 1.

Label Noise Robustness of Conformal Prediction

TV(N(0, τ 2), N(0, τ 2 + σ2)) =

e x2/τ 2 e x2/(τ 2+σ2) dx τ 0 1.

A.5 False-Negative Risk

In this section, we prove all FNR robustness propositions. Here, we suppose that the response Y [0, 1]n is a binary vector of size n N, where Yi = 1 indicates that the i-th label is present. We further suppose a vector-ﬂip noise model from (8), where ε is a binary random vector of size n as well. These notations apply for segmentation tasks as well, by ﬂattening the response matrix into a vector. The prediction set C(X) {1, ..., n} contains a subset of the labels. We begin by providing additional theoretical results and then turn to the proofs.

A.5.1 Additional Theoretical Results

Proposition A.4 Let b Cnoisy(Xtest) be a prediction set that controls the FNR risk of the noisy labels at level α. Suppose that

1. The prediction set contains the most likely labels in the sense that for all x X, k b Cnoisy(x), i / b Cnoisy(x) and m N:

j =i,k Yj = m, X = x

j =i,k Yj = m, X = x

2. For a given input X = x, the noise level of all response elements is the same, i.e., for all i = j {1 . . . K}: P(ϵi = 1 | X = x) = P(ϵj = 1 | X = x).

3. The noise is independent of Y | X = x in the sense that ϵ Y | X = x for all x X.

4. The noises of diﬀerent labels are independent of each other given X, i.e., ϵi ϵj | X = x for all i = j {1, . . . K} and x X.

Then E h LFNR Ytest, b Cnoisy(Xtest) i α.

A.5.2 General Derivations

In this section, we formulate and prove a general lemma that is used to prove label-noise robustness with the false-negative rate loss.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Lemma A.4 Suppose that x X is an input variable and C(x) is a prediction set. Denote by βi the noise level at the i-th element: βi := P(εi = 1 | X = x). We deﬁne:

ek,i(y) := βiyk E

" 1 Pn j=1 Yj

εi = 1, Y = y

If for all k C(x) and i / C(x):

" ek,i(Y ) ei,k(Y ) Pn j=1 Yj

Then C(x) achieves conservative conditional risk:

E LFNR(Y, C(X)) | X = x E h LFNR( Y , C(X)) | X = x i .

Furthermore, C(X) achieves valid marginal risk:

E LFNR(Y, C(X)) E h LFNR( Y , C(X)) i .

Proof For ease of notation, we omit the conditioning on X = x. That, we take some x X and treat Y as Y | X = x. We also denote the prediction set as C = C(x). Denote:

δ := LFNR( Y , C) LFNR(Y, C).

Our goal is to show that: E [δ] 0.

If C = or C = {1, ..., n} then the proposition is trivially satisﬁed. Therefore, for the rest of this proof, we assume that C = , {1, ..., n}. We begin by developing δ as follows.

δ = LFNR( Y , C) LFNR(Y, C)

P j C Yj P j Yj

P j C [Yj 2εj Yj + εj] P j Yj 2εj Yj + εj

P i P j C Yj[Yi 2εi Yi + εi] P i P j C Yi[Yj 2εj Yj + εj]

(P j Yj)(P j Yj 2εj Yj + εj)

P i P j C Yj Yi 2Yjεi Yi + Yjεi P i P j C Yi Yj 2Yiεj Yj + Yiεj (P j Yj)(P j Yj 2εj Yj + εj)

P i P j C Yj Yi 2Yjεi Yi + Yjεi [Yi Yj 2Yiεj Yj + Yiεj]

(P j Yj)(P j Yj 2εj Yj + εj)

P i P j C Yjεi Yiεj (P j Yj)(P j Yj 2εj Yj + εj)

= (P i εi)(P j C Yj) (P i Yi)(P j C εj)

(P j Yj)(P j Yj 2εj Yj + εj) .

Label Noise Robustness of Conformal Prediction

Without loss of generality, we assume that C = {1, ..., p}. We now compute the expectation of each term separately.

" (P i εi)(Pp k=1 Yk) (P j Yj)(P j Yj 2εj Yj + εj)

" εi(Pp k=1 Yk) (P j Yj)(P j Yj 2εj Yj + εj)

(Pp k=1 yk) P j yj E

" εi P j Yj 2εj Yj + εj

(Pp k=1 yk) P j yj E

" εi P j Yj 2εj Yj + εj

εi = 1, Y = y

" 1 (1 Yi) + P j =i Yj 2εj Yj + εj

εi = 1, Y = y

" 1 (1 Yi) + P j =i Yj 2εj Yj + εj

εi = 1, Y = y

" (P i Yi)(Pp k=1 εk) (P j Yj)(P j Yj 2εj Yj + εj)

" Yi(Pp k=1 εk) (P j Yj)(P j Yj 2εj Yj + εj)

yi P j yj E

" (Pp k=1 εk) P j Yj 2εj Yj + εj

" εk P j Yj 2εj Yj + εj

" εk P j Yj 2εj Yj + εj

εk = 1, Y = y

" 1 (1 Yk) + P j =k Yj 2εj Yj + εj

εk = 1, Y = y

" 1 (1 Yk) + P j =k Yj 2εj Yj + εj

εk = 1, Y = y

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

We now compute the expected value of δ for a given vector of labels Y = y.

E [δ|Y = y] = E

" (P i εi)(Pp k=1 Yk) (P i Yi)(P j Yj 2εj Yj + εj)

" (P i Yi)(Pp k=1 εk) (P i Yi)(P j Yj 2εj Yj + εj)

k=1 ek,i(y) ei,k(y)

k=1 ek,i(y) +

k=1 ek,i(y)

k=1 ei,k(y) +

k=1 ei,k(y)

k=1 ek,i(y)

k=1 ei,k(y)

k=1 ek,i(y) ei,k(y)

ek,i(y) ei,k(y) P j yj

Finally, we get valid risk conditioned on X = x:

ek,i(Y ) ei,k(Y ) P j Yj

" ek,i(Y ) ei,k(Y ) P j Yj

Above, the last inequality follows from the assumption of this lemma. We now marginalize the above result to obtain valid marginal risk:

E [δ] = EX h E Y ,Y |X=x [δ | X = x] i EX [0] 0.

A.5.3 Proof of Proposition A.4

Proof As in Lemma A.4, we omit the conditioning on X and treat Y as Y | X = x. Without loss of generality, we suppose that C = {1, ..., p}, where P( Yi = 1) P( Yi+1 = 1) for all i {1, ..., n 1}. Since i {1, ..., n 1} : P(εi = 1) 0.5, the order of class probabilities is preserved under the corruption, meaning that

i {1, ..., n 1} : P(Yi = 1) P(Yi+1 = 1).

Label Noise Robustness of Conformal Prediction

We now compute E h ek,i(Y ) ei,k(Y ) P j Yj

i for k < i:

E ek,i(Y ) ei,k(Y ) P i Yi

βi Yk E h 1 (1 Yi)+P j =i Yj 2εj Yj+εj

Y = Y i βk Yi E h 1 (1 Yk)+P j =k Yj 2εj Yj+εj | Y = Y i

y Y P(Y = y)

" βiyk P j yj E

" 1 (1 yi) + P j =i yj 2εjyj + εj

βkyi P j yj E

" 1 (1 yk) + P j =k yj 2εjyj + εj

{y Y:yi =yk} P(Y = y)

" βiyk P j yj E

" 1 (1 yi) + P j =i yj 2εjyj + εj

{y Y:yi =yk} P(Y = y)

" βkyi P j yj E

" 1 (1 yk) + P j =k yj 2εjyj + εj

{y Y:yk=1,yi=0} P(Y = y) βi P j yj E

" 1 2 εk + P j =i,k yj 2εjyj + εj

{y Y:yi=1,yk=0} P(Y = y) βk P j yj E

" 1 2 εi + P j =k,i yj 2εjyj + εj

To simplify the equation, denote by y the vector y with indexes i, k swapped, that is:

yj j = i, k, yi j = k, yk j = i.

Notice that βk E h 1 2 εi+P j =k,i yj 2εjyj+εj

Y = y i = βi E h 1 2 εk+P j =k,i yj 2εjyj+εj

Y = y i since

εi d= εk and these variables are independent of Y and of εj for j = i, k. Further denote

γm = β P j yj E h 1 2 εk+P j =i,k yj 2εjyj+εj

Y = y i for P j =i,k yj = m and yi = yk. Notice

that γm is well deﬁned as βE h 1 2 εk+P j =i,k yj 2εjyj+εj

Y = y i has the same value for every y

such that P j =i,k yj = m since y and ε are independent. Also P j yj = m+1 if P j =i,k yj = m

and yi = yk. Lastly, we denote γ m = γm P P j =i,k Yj = m . We continue computing

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

E h ek,i(Y ) ei,k(Y ) P j Yj

i for k < i:

" ek,i(Y ) ei,k(Y ) P j Yj

{y Y:yk=1,yi=0} (P(Y = y) P(Y = y )) β P j yj E

" 1 2 εk + P j =i,k yj 2εjyj + εj

{y Y,yk=1,yi=0,P j =i,k yj=m} (P(Y = y) P(Y = y ))γm

{y Y,yk=1,yi=0,P j =i,k yj=m} [P(Y = y) P(Y = y )]

{y Y,yk=1,yi=0,P j =i,k yj=m} P(Y = y) X

{y Y,yi=1,yk=0,P j =i,k yj=m} P(Y = y)

Yk = 1, Yi = 0, X

j =i,k Yj = m

Yk = 0, Yi = 1, X

j =i,k Yj = m

Yk = 1, Yi = 0 | X

j =i,k Yj = m

Yk = 0, Yi = 1 | X

j =i,k Yj = m

Yk = 1, Yi = 0

j =i,k Yj = m

Yk = 1, Yi = 1

j =i,k Yj = m

Yk = 1, Yi = 1

j =i,k Yj = m

Yk = 0, Yi = 1

j =i,k Yj = m

j =i,k Yj = m

j =i,k Yj = m

We assume that m : P Yk = 1 P j =i,k Yj = m P Yi = 1 P j =i,k Yj = m and therefore:

" ek,i(Y ) ei,k(Y ) P j Yj

j =i,k Yj = m

j =i,k Yj = m

According to Lemma A.4, the above concludes the proof.

Label Noise Robustness of Conformal Prediction

A.5.4 Multi-label Classification

Here we prove Proposition 4.

Proof [Proof of Proposition 4] As in Lemma A.4, we omit the conditioning on X and treat Y as Y | X = x. Without loss of generality, we suppose that C = {1, ..., p}, where P( Yi = 1) P( Yi+1 = 1) for all i {1, ..., n 1}. Since i {1, ..., n 1} : P(εi = 1) 0.5, the order of class probabilities is preserved under the corruption, meaning that

i {1, ..., n 1} : P(Yi = 1) P(Yi+1 = 1).

Since Y is a deterministic function of X, we get that Y is some non-increasing constant binary vector. Suppose that k C and i / C. Since C contains the most likely labels, these indexes satisfy k < i. We now compare ek,i(y) to ei,k. According to the deﬁnition of ek,i:

ek,i := βiyk E

" 1 Pn j=1 Yj

εi = 1, Y = y

Since Pn j=1 Yj is a assumed to be a constant, ek,i(y) and ei,k may diﬀer only in the values of yk, yi and βk, βi. We go over all four combinations of yk and yi.

1. If yk = 0, yi = 0, then ek,i(y) = ei,k(y) = 0.

2. If yk = 1, yi = 0, then ek,i(y) 0 and ei,k(y) = 0.

3. If yk = 0, yi = 1, then we get a contradiction to the property of Y being non-increasing, and thus this case is not possible.

4. If yk = 1, yi = 1, then ek,i(y) ei,k(y) = 0 since βi βk:

P(εk = 0) = P( Yk = 1) P( Yi = 1) = P(εi = 0)

1 P(εk = 1) 1 P(εi = 1)

βk = P(εk = 1) P(εi = 1) = βi.

Therefore, under all cases, we get that ek,i(y) ei,k(y) for all k C, i / C, and thus:

" ek,i(Y ) ei,k(Y ) Pn j=1 Yj

Finally, by applying Lemma A.4, we obtain valid conditional risk.

A.5.5 Segmentation

Proof [Proof of Proposition 5] This proposition is a special case of Proposition A.4 and thus valid risk follows directly from this result.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

A.6 Risk Control for Regression Tasks

A.6.1 General Risk Bound for Regression Tasks

Here we prove Proposition 6.

Proof [Proof of Proposition 6] For ease of notation, we omit the conditioning on X = x. In other words, we treat Y as Y | X = x for some x X. Given a prediction set ˆC(x), we consider the loss as a function of Y , where ˆC(x) is ﬁxed, and denote it by:

L(Y ) := L(Y, ˆC(x)).

We expand L(Y + ε) using Taylor s expansion:

L(Y + ε) = L(Y ) + εL (Y ) + 1

where ξ is some real number between Y and Y + ε.

δ := L( Y ) L(Y )

= L(Y + ε) L(Y )

= εL (Y ) + 1

We now develop each term separately. Since ε Y it follows that ε L (Y ).

E εL (Y ) = E [ε] E L (Y ) = 0E L (Y ) = 0

Since L (y) is bounded by q L (y) Q, we get the following:

1 2q Var [ε] = 1

2q E ε2 = E 1

2ε2L (ξ) E 1

We get that:

E[δ] = E εL (Y ) + 1

2ε2L (ξ) = 0 + E 1

1 2q Var [ε] E[δ] 1

2QVar [ε] .

2q Var [ε] E h L( Y , ˆC(X)) L(Y, ˆC(X)) i 1

2QVar [ε] .

We now turn to consider the conditioning on X = x and obtain marginalized bounds by taking the expectation over all X:

2QVar [ε] E h L(Y, ˆC(X)) i α 1

Label Noise Robustness of Conformal Prediction

where α := E h L(Y, ˆC(X)) i , q := EX[q X], and Q := EX[QX]. Additionally, if L(y, C(x)) is convex for all x X, then qx 0, and we get conservative coverage:

E h L(Y, ˆC(X)) i E h L( Y , ˆC(X)) i = α.

A.6.2 Deriving a Miscoverage Bound from the General Risk Bound in Regression tasks

In this section we prove Proposition 7 and show how to obtain tight coverage bounds. First, we deﬁne a parameterized smoothed miscoverage loss:

Lsm d,c(y, [a, b]) = 2

1 + e d (2 y a

b a 1) 2 c .

Above, c, d R are parameters that aﬀect the loss function. We ﬁrst connect between the smooth miscoverage and the standard miscoverage functions:

L(y, [a, b]) = 1{Lsm d,c(y, [a, b]) h}. (13)

We now invert the above equation to ﬁnd h:

h = Lsm d,c

min L(y,[a,b])=0 y, [a, b]

max L(y,[a,b])=0 y, [a, b]

= Lsm d,c (a, [a, b])

= Lsm d,c (b, [a, b])

1 + e d (2 a a

1 + e d(( 1)2) c

= 2 1 + e d .

Therefore, h(d) is a function that depends only on d. We now denote the second derivative of the smoothed loss by:

qx(c, d) = min y 2

y2 Lsm c,d(y, C(x)).

Importantly, qx(c, d) can be empirically computed by sweeping over all y R and computing the second derivative of Lsm c,d for each of them.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

We obtain an upper bound for the miscoverage of C(x) by applying Markov s inequality using (13):

P(Y / C(X) | X = x) = P(Lsm d,c(Y, C(X)) h(d) | X = x) E h Lsm c,d(Y, C(X)) | X = x i

Next, we employ smoothed miscoverage upper bound provided by Proposition 7:

E Lsm c,d(Y, C(X)) | X = x E h Lsm c,d( Y , C(X)) | X = x i 1

2qx(c, d)Var [ε] (15)

Finally, we combine (14) and (15) and derive the following miscoverage bound:

P(Y / C(X) | X = x) E h Lsm c,d( Y , C(X)) | X = x i 1

2qx(c, d)Var [ε]

which can be restated to:

P(Y C(X) | X = x) 1 E h Lsm c,d( Y , C(X)) | X = x i 1

2qx(c, d)Var [ε]

Finally, we take the expectation over all X to obtain marginal coverage bound:

P(Y C(X)) 1 E h Lsm c,d( Y , C(X)) i 1

2EX[qx(c, d)]Var [ε]

h(d) . (16)

Crucially, all variables in (16) are empirically computable so the above lower bound can is known in practice. Additionally, the parameters c, d can be tuned over a validation set to obtain tighter bounds.

A.6.3 Deriving a Miscoverage Bound from the Density s Smoothness

We say that the PDF of Y | X = x is Kx Lipschitz if

|f Y |X=x(y + u) f Y |X=x(y)| Kxu,

where f Y |X=x is the PDF of Y | X = x and Kx R is a constant that depends only on x.

Proposition A.5 Suppose that C(x) is a prediction interval. Under the additive noise model gadd from (3), if the PDF of Y | X = x is Kx Lipschitz then:

P(Y C(X)) P( Y C(X)) E [|C(X)|KX] E [|Z|] .

Label Noise Robustness of Conformal Prediction

Proof First, by the deﬁnition of the noise model we get that:

f Y |X=x(y) = Z

f Y |X=x(y z)f Z(z) dz

(f Y |X=x(y) + Kx|z|)f Z(z) dz

= f Y |X=x(y) Z

f Z(z) dz + Kx

|z|f Z(z) dz

= f Y |X=x(y) + Kx E[|Z|].

P(Y C(X) | X = x) = Z

y C(x) f Y |X=x(y) dy

y C(x) (f Y |X=x(y) Kx E[|Z|]) dy

= P( Y C(X) | X = x) |C(x)|Kx E[|Z|].

We marginalize the above to obtain the marginal coverage bound:

P(Y C(X)) = P( Y C(X)) E[|C(X)|KX]E[|Z|].

A.6.4 The Image Miscoverage Loss

In this section, we analyze the setting where the response variable is a matrix Y = RW H. Here, the uncertainty is represented by a prediction interval Ci,j(X) for each pixel i, j in the response image. Here, our goal is to control the image miscoverage loss (12), deﬁned as:

Lim(Y, C(X)) = 1 WH

j=1 1{Y i,j / Ci,j(X)}.

While this loss can be controlled under an i.i.d assumption by applying the methods proposed by Angelopoulos et al. (2021, 2024), these techniques may produce invalid uncertainty sets in the presence of label noise. We now show that conservative image miscoverage risk is obtained under the assumptions of Theorem A.1.

Proposition A.6 Suppose that each element of the response matrix is corrupted according to an additive noise model gadd with a noise that has mean 0. Suppose that for every pixel i, j of the response matrix, the prediction interval Ci,j(X) and the conditional distribution of Y i,j | X = x satisfy the assumptions of Theorem A.1 for all x X. Then, we obtain valid conditional image miscoverage:

x X : E[Lim(Y, C(X)) | X = x] E[Lim( Y , C(X)) | X = x].

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Proof For ease of notation, we suppose that Y is a random vector of length k and Ci is the prediction interval constructed for the i-th element in Y . Suppose that x X. Under the assumptions of Theorem A.1, we get that for all i {1, .., , k}:

P(Y i / Ci(x) | X = x) P( Y i / Ci(x) | X = x).

E[Lim(Y, C(X)) | X = x] = E

i=1 1{Y i / Ci(X)} | X = x

i=1 E 1{Y i / Ci(X)} | X = x

i=1 P Y i / Ci(X) | X = x

i=1 P h Y i / Ci(X) | X = x i

i=1 P h Y i / Ci(X) | X = x i

i=1 E h 1{ Y i / Ci(X)} | X = x i

i=1 1{ Y i / Ci(X)} | X = x

= E[Lim( Y , C(X)) | X = x].

A.7 Online Learning

A.7.1 Label-Noise In Online Learning Settings

Here we provide the proof to Proposition 8.

Proof [Proof of Proposition 8]

Suppose that for every t N:

EYt|Xt=x h Lt(Yt, ˆCt(Xt)) | Xt = x i αt.

Label Noise Robustness of Conformal Prediction

Draw T uniformly from [0, 1, ..., ]. Then, from the law of total expectation, it follows that:

ET h LT (YT , ˆCT (XT )) i = ET h EXT |T=t h EYT |XT =x h LT (YT , ˆCT (XT )) | XT = x i | T = t ii

ET EXT |T=t [αT | T = t]

where α := ET [αT ]. Finally, we get:

t=0 Lt(Yt. ˆCt(Xt)) = ET h LT (YT , ˆCT (XT )) i α

A.7.2 Label-Noise Robustness with the Miscoverage Counter Loss

In this section, we suppose an online learning setting, where the data {(xt, yt)} t=1 is given as a stream. The miscoverage counter loss (Feldman et al., 2023b) counts the number of consecutive miscoverage events that occurred until the timestamp t. Formally, given a series of prediction sets: { ˆCt(xt)}T t=1, and a series of labels: {yt}T t=1, the miscoverage counter at timestamp t is deﬁned as:

LMC t (yt, ˆCt(xt)) =

( LMC t 1(yt 1, ˆCt 1(xt 1)) + 1, yt ˆCt(xt) 0, otherwise.

We now show that conservative miscoverage counter risk is obtained under the presence of label noise.

Proposition A.7 Suppose that for all 0 < k N, 1 < t N, and xt X:

P(Yt b Ct noisy(Xt) | Xt = xt, LMC t 1(Yt 1, ˆCt 1(Xt 1)) = k)

P( Yt b Ct noisy(Xt) | Xt = xt, LMC t 1( Yt 1, ˆCt 1(Xt 1)) = k).

Also assume that:

P(Y1 b C1 noisy(X1) | X1 = x1) P( Y1 b C1 noisy(X1) | X1 = x1).

If the miscoverage counter risk of the noisy labels is controlled at level α, then the miscoverage counter of the clean labels is controlled at level α:

t=1 E[LMC t (Yt, ˆCt(Xt))] α.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Notice that the conditions of Proposition A.7 follow from Theorem A.2 in classiﬁcation tasks or from Theorem A.1 in regression tasks. In other words, we are guaranteed to obtain valid miscoverage counter risk when the requirements of these theorems are satisﬁed. We now demonstrate this result in a regression setting.

Corollary A.8 Suppose that the noise model and the conditional distributions of the clean Yt | Xt, Xt 1, Yt 1 and noisy Yt | Xt, Xt 1, Yt 1 labels satisfy the assumptions of Theorem A.1 for all t N. Then, the miscoverage counter of the clean labels is more conservative than the risk of the noisy labels.

We now turn to prove Proposition A.7.

Proof [Proof of Proposition A.7 ] Suppose that xt X. Our objective is to prove that any series of intervals { ˆCt(Xt)} t=1 satisﬁes:

t=1 E[LMC t (Yt, ˆCt(Xt))] lim T 1 T

t=1 E[LMC t ( Yt, ˆCt(Xt))]

First, we show by induction over 0 < t, k N that:

P[LMC t (Yt, ˆCt(Xt)) = k] P[LMC t ( Yt, ˆCt(Xt)) = k].

Base: for k = 1 and t = 1:

P[LMC t (Yt, ˆCt(Xt)) = k | Xt = xt] = P[Yt / , ˆCt(Xt) | Xt = xt]

P[ Yt / , ˆCt(Xt) | Xt = xt]

= P[LMC t ( Yt, ˆCt(Xt)) = k | Xt = xt].

Inductive step: suppose that the statement is correct for t, k. We now show for k + 1:

P[LMC t (Yt, ˆCt(Xt)) = k + 1]

= P[LMC t (Yt, ˆCt(Xt)) = k + 1 | LMC t 1(Yt 1, ˆCt 1(Xt 1)) = k]P[LMC t 1(Yt 1, ˆCt 1(Xt 1)) = k]

= P[Yt / ˆCt(Xt) | LMC t 1(Yt 1, ˆCt 1(Xt 1)) = k]P[LMC t 1(Yt 1, ˆCt 1(Xt 1)) = k]

P[ Yt / ˆCt(Xt) | LMC t 1( Yt 1, ˆCt 1(Xt 1)) = k]P[LMC t 1( Yt 1, ˆCt 1(Xt 1)) = k]

= P[LMC t ( Yt, ˆCt(Xt)) = k + 1].

Inductive step 2: suppose that the statement is correct for t, k. We now show for t + 1:

P[LMC t+1(Yt+1, ˆCt+1(Xt+1)) = k]

= P[LMC t+1(Yt+1, ˆCt+1(Xt+1)) = k | LMC t (Yt, ˆCt(Xt)) = k 1]P[LMC t (Yt, ˆCt(Xt)) = k 1]

= P[Yt+1 / ˆCt+1(Xt+1) | LMC t (Yt, ˆCt(Xt)) = k 1]P[LMC t (Yt, ˆCt(Xt)) = k 1]

P[ Yt+1 / ˆCt+1(Xt+1) | LMC t ( Yt, ˆCt(Xt)) = k 1]P[LMC t ( Yt, ˆCt(Xt)) = k 1]

= P[LMC t+1( Yt+1, ˆCt+1(Xt+1)) = k]

Label Noise Robustness of Conformal Prediction

Finally, we compute the miscoverage counter risk over the time horizon:

t=1 E[LMC(Yt, ˆCt(Xt))] = lim T 1 T

k=1 k P[LMC t (Yt, ˆCt(Xt)) = k]

k=1 k P[LMC t ( Yt, ˆCt(Xt)) = k]

= lim T 1 T

t=1 E[LMC t ( Yt, ˆCt(Xt))]

Appendix B. Additional Experimental Details and Results

B.1 Classiﬁcation: Object Recognition Experiment

Here we present additional results of the classiﬁcation experiment with CIFAR-10H explained in Section 2.2, but ﬁrst provide further details about the data set and training procedure. The CIFAR-10H data set contains the same 10,000 images as CIFAR-10, but with labels from a single annotator instead of a majority vote of 50 annotators. We ﬁnetune a Res Net18 model pre-trained on the clean training set of CIFAR-10, which contains 50,000 samples. Then we randomly select 2,000 observations from CIFAR-10H for calibration. The test set contains the remaining 8,000 samples, but with CIFAR-10 labels. We apply conformal prediction with the APS score. The marginal coverage achieved when using noisy and clean calibration sets are depicted in Figure 1. This ﬁgure shows that (i) we obtain the exact desired coverage when using the clean calibration set; and (ii) when calibrating on noisy data, the constructed prediction sets over-cover the clean test labels. Figure 15 illustrates the average prediction set sizes that are larger when using noisy data for calibration and thus lead to higher coverage levels.

B.2 Regression: Aesthetic Visual Rating

Herein, we provide additional details regarding the training of the predictive models for the real-world regression task. As explained in Section 2.2, we use a VGG-16 model pretrained on the Image Net data set whose last (deepest) fully connected layer is removed. Then, we feed the output of the VGG-16 model to a linear fully connected layer to predict the response. We train two diﬀerent models: a quantile regression model for CQR and a classic regression model for conformal with residual magnitude score. Both models are trained on 34, 000 noisy samples, calibrated on 7, 778 noisy holdout points, and tested on 7, 778 clean samples. We train the quantile regression model for 70 epochs using SGD optimizer with a batch size of 128 and an initial learning rate of 0.001 decayed every

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

noisy clean

Figure 15: Eﬀect of label noise on CIFAR-10. Distribution of average prediction set sizes over 30 independent experiments evaluated on CIFAR-10H test data using noisy and clean labels for calibration. Other details are as in Figure 1

20 epochs exponentially with a rate of 0.95 and a frequency of 10. We apply dropout regularization to avoid overﬁtting with a rate of 0.2. We train the classic regression model for 70 epochs using Adam optimizer with a batch size of 128 and an initial learning rate of 0.00005 decayed every 10 epochs exponentially with a rate of 0.95 and a frequency of 10. The dropout rate in this case is 0.5.

B.3 Synthetic Classiﬁcation: Adversarial Noise Models

In contrast with the noise distributions presented in Section 4.1, here we construct adversarial noise models to intentionally reduce the coverage rate.

1. Most frequent confusion: we extract from the confusion matrix the pair of classes with the highest probability to be confused between each other, and switch their labels until reaching a total probability of ϵ. In cases where switching between the most common pair is not enough to reach ϵ, we proceed by ﬂipping the labels of the second most confused pairs of labels, and so on.

2. Wrong to right: wrong predictions during calibration cause larger prediction sets during test time. Hence making the model think it makes fewer mistakes than it actually does during calibration can lead to under-coverage during test time. Here, we ﬁrst observe the model predictions over the calibration set, and then switch the labels only of points that were misclassiﬁed. We switch the label to the class that is most likely to be the correct class according to the model, hence making the model think it was correct. We switch a suitable amount of labels in order to reach a total switching probability of ϵ (this noise model assumes there are enough wrong predictions in order to do so).

3. Optimal adversarial: we describe here an algorithm for building the worst possible label noise for a speciﬁc model using a speciﬁc non-conformity score. This noise will decrease the calibration threshold at most and as a result, will cause signiﬁcant

Label Noise Robustness of Conformal Prediction

0.85 0.90 Coverage

Most frequent confusion

Wrong to right

Optimal adversarial HPS

Optimal adversarial APS

Neural network classifier

HPS APS nominal

1.0 1.1 1.2 1.3 Average set size

Neural network classifier

Figure 16: Eﬀect of label noise on synthetic multi-class classiﬁcation data. Performance of conformal prediction sets with target coverage 1 α = 90%, using a noisy training set and a noisy calibration set with adversarial noise models. Left: Marginal coverage; Right: Average size of predicted sets. The results are evaluated over 100 independent experiments.

under-coverage during test time. To do this, we perform an iterative process. In each iteration, we calculate the non-conformity scores of all of the calibration points with their current labels. We calculate the calibration threshold as in regular conformal prediction and then, from the points that have a score above the threshold, we search for the one that switching its label can reduce its score by most. We switch the label of this point to the label that gives the lowest score and then repeat the iterative process with the new set of labels. Basically, at every step, we make the label swap that will decrease the threshold by most.

In these experiments, we apply the same settings as described in Section 4.1 and present the results in Figure 16. We can see that the optimal adversarial noise causes the largest decrease in coverage as one would expect. The most-frequent-confusion noise decreases the neural network coverage to approximately 89%. The wrong-to-right noise decreases the coverage to around 85% with the HPS score and to around 87% with the APS score. This gap is expected as this noise directly reduces the HPS score. We can see that the optimal worst-case noise for each score function reduces the coverage to around 85% when using that score. This is in fact the maximal decrease in coverage possible theoretically, hence it strengthens the optimally of our iterative algorithm.

B.4 Synthetic Regression: Additional Results

Here we ﬁrst illustrate in Figure 17 the data we generate in the synthetic regression experiment from Section 4.2 and the diﬀerent corruptions we apply.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Figure 17: Illustration of the generated data with diﬀerent corruptions. (a): Clean samples. (b): Samples with symmetric heavy-tailed noise. (c): Samples with asymmetric noise. (d): Samples with biased noise. Noise magnitude is set to 0.1. (e): Samples with contractive noise. (f): Samples with dispersive noise.

In Section 4.2 we apply some realistic noise models and examine the performance of conformal prediction using CQR score with noisy training and calibration sets. Here we construct some more experiments using the same settings, however we train the models using clean data instead of noisy data. Moreover, we apply an additional adversarial noise model that diﬀers from those presented in Section 4.2 in the sense that it is designed to intentionally reduce the coverage level.

Label Noise Robustness of Conformal Prediction

Wrong to right: an adversarial noise that depends on the underlying trained regression model. In order to construct the noisy calibration set we switch 7% of the responses as follows: we randomly swap between outputs that are not included in the interval predicted by the model and outputs that are included.

Figures 18 and 19 depict the marginal coverage and interval length achieved when applying the diﬀerent noise models. We see that the adversarial wrong to right noise model reduces the coverage rate to approximately 83%. Moreover, these results are similar to those achieved in Section 4.2, except for the conservative coverage attained using biased noise, which can be explained by the more accurate low and high estimated quantiles.

0.0 0.01 0.1 1.0 Noise Magnitude c

nominal symmetric heavy-tail symmetric light-tail asymmetric biased

0.0 0.01 0.1 1.0 Noise Magnitude c

clean length

Figure 18: Response-independent noise. Performance of conformal prediction intervals with target coverage 1 α = 90%, using a clean training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals (divided by the average clean length) using symmetric, asymmetric and biased noise with a varying magnitude. Other details are as in Figure 7.

contractive

wrong to right

contractive

wrong to right

Figure 19: Response-dependent noise. Performance of conformal prediction intervals with target coverage 1 α = 90%, using a clean training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals. The results are evaluated over 50 independent experiments.

Lastly, in order to explain the over-coverage or under-coverage achieved for some of the diﬀerent noise models, as depicted in Figures 7 and 8, we present in Figure 20 the CQR scores and their 90% th empirical quantile. Over-coverage is achieved when the noisy scores

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

are larger than the clean ones, for example, in the symmetric heavy tailed case, and under-coverage is achieved when the noisy scores are smaller.

2 1 0 1 2 0

2 1 0 1 2 0

Scores symmetric heavy-tailed noise

clean noisy

2 1 0 1 2 0

500 Scores biased noise

clean noisy

2 1 0 1 2 0

Scores contractive noise

clean noisy

Figure 20: Illustration of the CQR scores. (a): Clean training and calibration sets. (b): Symmetric heavy-tailed noise. (c): Biased noise. Noise magnitude is set to 0.1. (d): Contractive noise. Other details are as in Figure 7.

B.5 The Multi-label Classiﬁcation Experiment with the COCO Data Set

Here, we provide the full details about the experimental setup of the real multi-label corruptions from Section 3.2. We asked 9 annotators to annotate 117 images. Each annotator labeled approximately 15 images separately, except for two annotators who labeled 15 images as a couple. The annotators were asked to label each image under 30 seconds, although this request was not enforced. Figure 21 presents the number of labels that are missed or mistakenly added to each image. On average, 1.25 of the labels were missed and 0.5 were mistakenly added, meaning that each image contains a total of 1.75 label mistakes on average.

Label Noise Robustness of Conformal Prediction

Figure 21: Analysis of human-made label corruptions in MS COCO data set.

B.6 Multi-label Classiﬁcation with Artiﬁcial Corruptions

In this section, we study three types of corruptions that demonstrate Proposition 4 in the sense that the number of positive labels, | Y |, is a deterministic function of X. The data set we use in the following experiments is the MS COCO data set (Lin et al., 2014), in which the input image may contain up to K = 80 positive labels. We consider the given annotations as ground-truth labels and artiﬁcially corrupt them to generate noisy ones, as conducted in Zhao and Gomes (2021); Kumar et al. (2020). The ﬁrst noise model applies dependent corruptions by taking β of the positive labels and turning them into diﬀerent ones, according to a pre-deﬁned transition map that was randomly chosen. The second one applies independent corruptions: β of the positive labels turn into diﬀerent ones, chosen independently and uniformly. Finally, the partial noise model corrupts the labels as the independent one, except for 30 pre-deﬁned labels that remain unchanged. In all experiments, we set β = 0.45.

We ﬁt a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 10K noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2). We control the false-negative rate (FNR), deﬁned in (7), at diﬀerent labels and measure the FNR obtained over clean and noisy versions of the test set which contains 30k samples. Figure 22 displays the results, showing that the risk obtained over the clean labels is more conservative than the risk of the clean labels. This is not a surprise, as it is supported by our theory, in Proposition 4.

B.7 Regression Miscoverage Bounds Experiment

We repeat the experimental protocol detailed in Section 4.5 with the same data sets and analyze the miscoverage bounds formulated in Section 3.3.3. Here, we apply conformalized quantile regression to control the miscoverage rate at diﬀerent levels using a noisy calibration set. Furthermore, we choose the miscoverage bound hyperparameters, c, d, from (16) by a grid search over the calibration set with the objective of tightening the miscoverage bound. Figure 23 displays the miscoverage rate achieved on the clean and noisy versions of the test set, along with the miscoverage bound. Importantly, in contrast to Theorem A.1, this bound requires no assumptions on the distribution of Y | X = x or on the distribution of

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Figure 22: FNR on MS COCO data set, achieved over noisy (red) and clean (green) test sets. The calibration scheme is applied with noisy annotations. Results are averaged over 50 trials.

the noise ε. The only requirement is that the noise is independent of the response variable and has mean 0. This great advantage covers up for the looseness of the bound.

Figure 23: Miscoverage rate achieved over noisy (red) and clean (green) test sets. The calibration scheme is applied using noisy annotations to control the miscoverage level. Results are averaged over 10 random splits of the calibration and test sets.

Appendix C. Related Algorithms

C.1 Conformal Risk Control

Here we provide a pseudo code of the conformal risk control algorithm, following (Angelopoulos et al., 2024).

This prediction set b Cˆλ(Xtest) produced by Algorithm 1 satisﬁes E[L(Ytest, b Cˆλ(Xtest))] α, for proof see (Angelopoulos et al., 2024).

Label Noise Robustness of Conformal Prediction

Algorithm 1: Conformal risk control

Input: Exchangeable non-increasing losses, Li : Λ ( , B], i = 1, . . . , n, where Li(λ) = L(Yi, b Cλ(Xi)), desired risk level α. Process: ˆRn(λ) = 1

n (L1(λ) + + Ln(λ)) ˆλ = inf n λ : n n+1 ˆRn(λ) + B n+1 α o

Output: Uncertainty set for a new test point, b Cˆλ(Xtest), applied with ˆλ.

C.2 Adaptive Conformal Inference

Below, we provide a pseudo-code of ACI, following (Gibbs and Candes, 2021, Algorithm 1).

Algorithm 2: Adaptive Conformal Inference

Input: Data {(Xt, Yt)}T t=1 X Y, given as a stream, miscoverage level α (0, 1), a score function S, a calibration set size n2, a step size γ > 0, and a learning model M. Process: Initialize α0 = α. for t = n2, ..., T do

Split the observed data {(Xi, Yi)}t 1 i=1 into a training set of size t 1 n2, indexed by I1 and a calibration set of size n2, indexed by I2. Obtain a predictive model Mt by ﬁtting the initial one on the training set {(Xi, Yi)}i I1. Construct a prediction set for the new point Xt:

b Ct(Xt) = {y Y : S(Mt(Xt), y) Q1 αt({S(Mt(Xi), Yi)}i I2)}

Obtain Yt. Compute errt = 1{Yt / b Ct(Xt)}. Update αt+1 = αt + γ(α errt).

Output: Uncertainty sets b Ct(Xt) for each time step t = n2, ..., T.

C.3 Rolling Risk Control

Here we provide a pseudo code of Rolling RC, ﬁrst developed by (Feldman et al., 2023b).

Mohamed Abdalla and Benjamin Fine. Hurdles to artiﬁcial intelligence deployment: Noise in schemas and gold labels. Radiology: Artiﬁcial Intelligence, 5(2):e220056, 2023.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Algorithm 3: Rolling Risk Control

Input: Data {(Xt, Yt)}T t=1 X Y, given as a stream, desired risk level r R, a step size γ > 0, a set constructing function f : (X, R, M) 2Y and an online learning model M. Process: Initialize θ0 = 0. for t = 1 to T do

Construct a prediction set for the new point Xt: b Ct(Xt) = f(Xt, θt, Mt). Obtain Yt. Compute lt = L(Yt, b Ct(Xt)). Update θt+1 = θt + γ(lt r). Fit the model Mt on (Xt, Yt) and obtain the updated model Mt+1.

Output: Uncertainty sets b Ct(Xt) for each time step t {1, ...T}.

G orkem Algan and Ilkay Ulusoy. Label noise types and their eﬀects on deep learning. ar Xiv preprint ar Xiv:2003.10471, 2020.

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends R in Machine Learning, 16(4):494 591, 2023. ISSN 1935-8237.

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand es, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. ar Xiv preprint, 2021. ar Xiv:2110.01052.

Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, 2024.

Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4): 343 370, 1988.

Javed A Aslam and Scott E Decatur. On the sample complexity of noise-tolerant learning. Information Processing Letters, 57(4):189 195, 1996.

Rina Foygel Barber. Is distribution-free inference possible for binary regression? Electronic Journal of Statistics, 14(2):3487 3524, 2020.

Rina Foygel Barber, Emmanuel J Cand es, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021.

Rina Foygel Barber, Emmanuel J. Cand es, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023.

Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68(6), September 2021. ISSN 0004-5411.

Label Noise Robustness of Conformal Prediction

Ruairidh M Battleday, Joshua C Peterson, and Thomas L Griﬃths. Capturing human categorization of natural images by combining deep networks and cognitive models. Nature Communications, 11(1):1 14, 2020.

bio. Physicochemical properties of protein tertiary structure data set. https: //archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+ Tertiary+Structure. Accessed: January, 2019.

Maxime Cauchois, Suyash Gupta, Alnur Ali, and John Duchi. Predictive inference with weak supervision. ar Xiv preprint ar Xiv:2201.08315, 2022.

Chen Cheng, Hilal Asi, and John Duchi. How many labelers do you have? a closer look at gold-standard labels. ar Xiv preprint ar Xiv:2206.12041, 2022.

Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention, pages 263 273. Springer, 2020.

Ant onio Farinhas, Chrysoula Zerva, Dennis Thomas Ulmer, and Andre Martins. Nonexchangeable conformal risk control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=j511Laq Ee P.

Shai Feldman, Bat-Sheva Einbinder, Stephen Bates, Anastasios N Angelopoulos, Asaf Gendler, and Yaniv Romano. Conformal prediction is robust to dispersive label noise. In Conformal and Probabilistic Prediction with Applications, pages 624 626. PMLR, 2023a.

Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Achieving risk control in online learning settings. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856.

Benoˆıt Fr enay and Michel Verleysen. Classiﬁcation in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845 869, 2013.

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 2013.

Asaf Gendler, Tsui-Wei Weng, Luca Daniel, and Yaniv Romano. Adversarially robust conformal prediction. In International Conference on Learning Representations, 2021.

Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

Simon Jenni and Paolo Favaro. Deep bilevel learning. In Proceedings of the European Conference on Computer Vision, pages 618 633, 2018.

Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy labels with dropout regularization. In 2016 IEEE 16th International Conference on Data Mining, pages 967 972. IEEE, 2016.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Yueying Kao, Chong Wang, and Kaiqi Huang. Visual aesthetic quality assessment with a regression model. In 2015 IEEE International Conference on Image Processing, pages 1583 1587. IEEE, 2015.

Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica: Journal of the Econometric Society, pages 33 50, 1978.

Himanshu Kumar, Naresh Manwani, and PS Sastry. Robust learning of multi-label classiﬁers under label noise. In Proceedings of the ACM IKDD Co DS and COMAD, pages 90 97. 2020.

Yonghoon Lee and Rina Foygel Barber. Binary classiﬁcation with corrupted labels. Electronic Journal of Statistics, 16(1):1367 1392, 2022.

Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278 287, 2013.

Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740 755. Springer, 2014.

Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pages 3355 3364. PMLR, 2018.

meps 19. Medical expenditure panel survey, panel 19. https://meps.ahrq.gov/ mepsweb/data_stats/download_data_files_detail.jsp?cbo Puf Number=HC-181. Accessed: January, 2019.

Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2408 2415. IEEE, 2012.

Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alex Gammerman. Inductive conﬁdence machines for regression. In Machine Learning: European Conference on Machine Learning, pages 345 356, 2002.

Joshua C Peterson, Ruairidh M Battleday, Thomas L Griﬃths, and Olga Russakovsky. Human uncertainty makes classiﬁcation more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617 9626, 2019.

Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantiﬁcation for classiﬁcation under label shift. In Uncertainty in Artiﬁcial Intelligence, pages 844 853. PMLR, 2021.

Label Noise Robustness of Conformal Prediction

Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Friedman. Tresnet: High performance gpu-dedicated architecture. In proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1400 1409, 2021.

Yaniv Romano, Evan Patterson, and Emmanuel Cand es. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32, pages 3543 3553. 2019.

Yaniv Romano, Matteo Sesia, and Emmanuel Cand es. Classiﬁcation with valid and adaptive coverage. In Advances in Neural Information Processing Systems, volume 33, pages 3581 3591, 2020.

Matteo Sesia, YX Wang, and Xin Tong. Adaptive conformal classiﬁcation with noisy labels. ar Xiv preprint ar Xiv:2309.05092, 2023.

Pulkit Singh, Joshua C Peterson, Ruairidh M Battleday, and Thomas L Griﬃths. End-toend deep prototype and exemplar models for predicting human behavior. ar Xiv preprint ar Xiv:2007.08723, 2020.

David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, and Arnaud Doucet. Conformal prediction under ambiguous ground truth. ar Xiv preprint ar Xiv:2307.09302, 2023.

Hossein Talebi and Peyman Milanfar. NIMA: Neural image assessment. IEEE Transactions on Image Processing, 27(8):3998 4011, 2018.

Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11244 11253, 2019.

Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems, volume 32, pages 2530 2540. 2019.

Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artiﬁcial Intelligence, 74(1-2):9 28, 2015.

Vladimir Vovk, Alexander Gammerman, and Craig Saunders. Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, pages 444 453, 1999.

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, NY, USA, 2005.

Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022.

Einbinder, Feldman, Bates, Angelopoulos, Gendler, Romano

Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Advances in Neural Information Processing Systems, 32, 2019.

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204 213, 2021.

Bodi Yuan, Jianyu Chen, Weidong Zhang, Hung-Shuo Tai, and Sara Mc Mains. Iterative cross learning on noisy labels. In IEEE Winter Conference on Applications of Computer Vision, pages 757 765. IEEE, 2018.

Wenting Zhao and Carla Gomes. Evaluating multi-label classiﬁers with noisy labels. ar Xiv preprint ar Xiv:2102.08427, 2021.