# regretful_decisions_under_label_noise__84c19408.pdf

Published as a conference paper at ICLR 2025

REGRETFUL DECISIONS UNDER LABEL NOISE

Sujay Nagaraj University of Toronto Yang Liu UC Santa Cruz Flavio P. Calmon Harvard SEAS Berk Ustun UC San Diego

Machine learning models are routinely used to support decisions that affect individuals be it to screen a patient for a serious illness or to gauge their response to treatment. In these tasks, we are limited to learning models from datasets with noisy labels. In this paper, we study the instance-level impact of learning under label noise. We introduce a notion of regret for this regime, which measures the number of unforeseen mistakes due to noisy labels. We show that standard approaches to learning under label noise can return models that perform well at a population-level while subjecting individuals to a lottery of mistakes. We present a versatile approach to estimate the likelihood of mistakes at the individual-level from a noisy dataset by training models over plausible realizations of datasets without label noise. This is supported by a comprehensive empirical study of label noise in clinical prediction tasks. Our results reveal how failure to anticipate mistakes can compromise model reliability and adoption we demonstrate how we can address these challenges by anticipating and avoiding regretful decisions.

1 INTRODUCTION

Machine learning models are routinely used to support or automate decisions that affect individuals be it to screen a patient for a mental illness [52], or assess their risk for an adverse treatment response [3]. In such tasks, we train models with labels that reflect noisy observations of the true outcome we wish to predict. In practice, such noise may arise due to measurement error [e.g., 23, 39], human annotation [30], or inherent ambiguity [39]. In all these cases, label noise can have detrimental effects on model performance [11]. Over the past decade, these issues have led to extensive work on learning from noisy datasets [see e.g., 11, 32, 40, 44, 49]. As a result, we have developed foundational results that characterize when label noise can be ignored and algorithms to mitigate its detrimental effects.

By and large, this work has focused on the impact of label noise at the population-level. In contrast, studying the effects of label noise at the instance-level has received limited attention. This oversight reflects the fact that we cannot provide meaningful guarantees on individual predictions under label noise [32]. In a best-case scenario, where we have perfectly specified distributional assumptions on label noise, we can learn a model that performs well on average, but we cannot identify where it makes mistakes; as a result, individuals are subject to a lottery of mistakes.

These effects undermine the utility of models in major real-world applications, as label noise arises in many settings where models are used to support or automate individual decisions [see, e.g., 55, for a meta-review of 72 cases in medicine]. In medical decision support tasks, our inability to identify mistakes can lead to overreliance, where physicians rely on predictions that may be incorrect [7, 29]. In automation tasks, our failure to assess the confidence of predictions can prevent us from reaping broader benefits e.g., by abstention [10, 18].

In this work, we study how label noise affects individual predictions. Our motivation stems from the fact that, even if we cannot fully resolve the effects of label noise at the instance-level, we can mitigate harm by anticipating regretful predictions through uncertainty quantification. To this end, our main contributions are:

1. We introduce a notion of regret for learning from noisy datasets, capturing how label uncertainty affects individual predictions. We show that learning under label noise leads to inevitable regret, characterizing key limitations in a wide class of methods for learning from label noise.

Published as a conference paper at ICLR 2025

True Labels yi

Noisy Labels

Predictions

1 0 0 0 0 f(xi) 6= yi

Regret 1 0 0 1 0

Regret 0 1 1 0 0

Noise Draw 2

Noise Draw 1

Noisy Labels

Predictions

f(xi) 6= yi

Figure 1: Datasets with noisy labels only contain a single draw of label noise. In such settings, we can learn a model that performs well at a population-level but cannot anticipate its mistakes. We characterize the number of individuals who are subjected to a lottery of mistakes in terms of regret i.e., the difference between anticipated mistakes and actual mistakes. Here, we show a stylized classification task with 5 points, where each point with a positive label may be flipped with a probability of 30%. In this case, 4 points are subject to a lottery of mistakes and our model assigns regretful predictions to 2 points, highlighted in yellow.

2. We develop a method to flag regretful predictions by training models on plausible realizations of a clean dataset. Our approach can measure the sensitivity of individual predictions under label noise and incorporates common noise assumptions while controlling for plausibility.

3. We conduct a comprehensive empirical study on clinical prediction tasks. Our findings highlight the instance-level impact of label noise, and we demonstrate how our approach can support safer inference by flagging potential mistakes.

Related Work Our work is related to a stream of research on learning from noisy labels. We focus on applications where we cannot resolve label noise by acquiring clean labels [see e.g., 11, 49, for surveys]. Many methods learn models by hedging for uncertainty in labels [33, 40, 44]. As we show in Section 2, such approaches are robust to label noise at a population-level while subjecting individuals to a lottery of mistakes. Our work highlights the limitations of this regime. In this sense, our results complement the work of Oyen et al. [43], who characterize the lack of robustness to label noise under general distributional assumptions.

We propose to mitigate these issues through a principled approach to uncertainty quantification. Our approach relates to recent work on model multiplicity, which shows how changes in the machine learning pipeline can produce models that assign conflicting predictions [see e.g., 4, 8, 20, 35, 38, 42, 53] and lead to downstream effects on fairness, explanations, and recourse [5, 17, 27, 36]. With respect to the literature on label noise, our approach is similar to the work of Reed et al. [47], who propose training an ensemble of deep neural networks by sampling alternative realizations of clean labels. In contrast, our procedure samples plausible realizations of clean labels and retrains plausible models to quantify uncertainty at an individual-level rather than predict.

2 PRELIMINARIES

We consider a classification task where we wish to learn a model f : X Y to predict a label y Y from a feature vector x X Rd. In a standard regime, we would be given a dataset D = {(xi, yi)}n i=1 where each (xi, yi) is drawn from a joint distribution of random variables, X and Y . Given the dataset, we would learn a model that performs well in deployment i.e., that minimizes the true risk R(f) := EX,Y [I [f(X) = Y ]].

We consider a variant of this task where we learn a model from a noisy dataset D = {(xi, yi)}n i=1, where each noisy label yi represents a potentially corrupted true label yi. In what follows, we refer to this corruption as a flip and denote it ui := I [yi = yi]. Given the flip ui, we can express noisy labels in terms of true labels as yi := yi ui and vice-versa as yi := yi ui. Here, a b := a+b 2ab is the XOR operator. Given a noisy dataset, we represent all flips as a vector called the noise draw.

Published as a conference paper at ICLR 2025

Definition 1. Given a binary classification task with n examples, the noise draw u = [u1, . . . , un] {0, 1}n is a realization of n random variables [U1, . . . , Un] {0, 1}n.

Given an example (xi, yi), each flip ui is drawn from a Bernoulli distribution with parameters pu|yi,xi := Pr(Ui = 1 | X = xi, Y = yi). Thus, the noise is generated by the random process:

Ui Bernouilli(pu|yi,xi)

In what follows, we assume that the values pu|yi,xi are determined by a generic noise model that can take on different forms e.g., uniform, class-level, or instance-level as shown in Table 1. We write pu instead of pu|yi,xi when conditioning terms are irrelevant or clear from context. We assume that the model is correctly specified and that pu < 0.5 for all points to ensure there are more clean than noisy labels [c.f., 1, 40, 44].

Given a noisy dataset, we denote the noise draw over all instances as the true draw utrue := [utrue 1 , . . . , utrue n ]. In practice, the true draw utrue is fixed but unknown. From the practitioner s perspective, utrue could be any realization of random variables U. If they knew utrue, they could recover the true labels as yi = yi utrue i and learn without label noise. As this is infeasible, given the noise model and a set of priors, practitioners can estimate the posterior noise model qu| yi,xi := Pr(Ui = 1 | X = xi, Y = yi) to infer clean labels from observed noisy labels.

Noise Type PGM Noise Model Posterior Model Inference Requirements Sample Use Case

Uniform U pu = Pr (U = 1) qu = Pr (U = 1) Screening tests with a fixed failure rate [e.g., COVID rapid tests 2].

Class-Level

pu|y = Pr (U = 1 | Y = y) qu| y = Pr U = 1 | Y = y πy = Pr (Y = y) Chest X-ray diagnosis where label noise Y changes based on the disease Y [e.g., pneunomia vs COVID 15].

Group-Level

U pu|y,g = Pr (U = 1 | Y = y, G = g) qu| y,g = Pr U = 1 | Y = y, G = g πy,g = Pr (Y = y | G = g) Diagnostic tasks where the incidence of label noise changes across subpopulations [e.g., racial bias in diagnosis 14, 51].

Instance-Level

U pu|y,x = Pr (U = 1 | Y = y, X = x) qu| y,x = Pr U = 1 | Y = y, X = x πy,x = Pr (Y = y, X = x) Data-driven discovery tasks where Y is an experimental outcome confirmed by a hypothesis test with type I/II error [16, 39] .

Table 1: Common noise models that we consider in this work. We represent each model as a probability distribution with parameters pu|y,x and show its corresponding probabilistic graphical model (PGM). Given a noisy dataset, noise model, and prior distribution πy, we infer noise draws from a posterior distribution with parameters qu| y,x.

3 REGRETFUL DECISIONS

Consider a practitioner who learns a model f : X Y from a noisy dataset. In practice, they may learn a model that performs well on average. However, they cannot determine where it makes mistakes. In such tasks, individuals are subject to a lottery of mistakes. We characterize this effect in terms of regret.

Definition 2. Given a classification task where we learn a model f : X Y from a noisy dataset, we define the regret for an instance (xi, yi) as:

Regret(f(xi), yi, Ui) := I epred(f(xi), yi) = etrue(f(xi), yi(Ui))

etrue(f(xi), yi(Ui)) := I [f(xi) = yi(Ui)] indicates an actual mistake with respect to the true label. We write the true label as yi(Ui) := yi Ui to show that it is a random variable. epred(f(xi), yi) indicates the model has made an anticipated mistake i.e., that it appears to have made a mistake based on what we can tell during training.

In practice, epred( ) is determined by how we account for noise, if at all. If we ignore label noise and fit a model via standard ERM on the noisy dataset, then epred(f(xi), yi) := I [f(xi) = yi]. If we fit a model via noise-tolerant ERM [e.g., 40, 44], then epred(f(xi), yi) := ℓ01(f(xi), yi) where ℓ01( ) is an unbiased loss defined such that EU[ ℓ01(xi, yi)] = ℓ01(f(xi), yi).

Published as a conference paper at ICLR 2025

Regret captures the irreducible error we incur due to randomness. In online learning, regret arises because we cannot foresee randomness in the future. In learning from noisy labels, regret arises because we cannot infer randomness from the past. In this case, randomness undermines our ability to determine which predictions are correct. In this regime, as individual predictions cannot be assumed to be accurate even on the training data. As a result, we cannot rely on predictions to support individual decisions. Moreover, we cannot rely on any downstream applications that depend on the correctness of individual predictions e.g., model explanations [6, 48] or post-hoc analyses [25, 26, 34]. In Prop. 3, we explore the relationship between these effects and label noise. Proposition 3. In a classification task where we learn a classifier f from a noisy dataset D:

EU|X, Y h Regret(f(X), Y , U) i = Pr(U = 1 | Y , X).

Prop. 3 provides an opportunity to highlight several implications of learning from label noise at the instance-level. On the one hand, this result implies that regret is unavoidable when learning under label noise. In practice, we can only avoid it by predicting less (e.g., via selective classification) or by removing noise (e.g., via relabeling). On the other hand, the result also implies that we can estimate the expected number of regretful predictions in terms of the posterior noise rate. In practice, however, we cannot tell how these mistakes are distributed over all instances.

One of the key issues in this regime is that the value of a prediction may be compromised, as each instance where qu|x, y > 0 is subject to a lottery of mistakes. Consider screening for a rare disease using a diagnostic test. In such cases, we can view the presence of the disease as a clean label yi and the test outcome as a noisy label yi. Given a disease that affects 10% of patients and a classlevel noise model that flips 10% of positive cases, an average draw of label noise would affect 1% of predictions. In practice, such conditions would undermine the value of screening because any patient with a negative test may have the disease. We characterize these effects by measuring the proportion of instances in a dataset that are susceptible to regret i.e., that are subject to the lottery of mistakes. Given a noisy dataset D and a posterior noise model Pr U = 1 | X, Y , the number of points susceptible to regret is:

Susceptibility( D) := 1

i=1 I h Pr U = 1 | X = xi, Y = yi > 0 i (1)

On the Regret of Hedging One of the benefits of studying regret in this regime is that we can characterize when learning is feasible at both the population and instance-levels. Many algorithms for learning from noisy labels are designed to hedge against label noise [46]. Given a noisy dataset and a noise model, hedging minimizes the expected risk over all possible noise draws. In some cases, algorithms may implement this strategy explicitly via ERM with a modified loss [see e.g., 37, 40]. In others, algorithms may hedge implicitly e.g., by assigning sample weights to training instances and setting their values to minimize expected risk over all possible draws [see e.g., 33, 44, 54].

In a best-case scenario, where we correctly specify the noise model and fit a model that minimizes the average number of mistakes over all noise draws, we would still incur regret. Formally, we would expect EU|X,Y [ Error(f, D, U)] = 0 where:

Error(f, D; U) :=

i=1 epred(f(xi), yi)

| {z } Predicted Training Error

i=1 etrue(f(xi), yi)

| {z } True Training Error

However, the resulting model f would still incur regret EU|X, Y h Regret(f, D, U) i > 0. In Prop. 4, we show that the classical hedging algorithm of Natarajan et al. [40] exhibits this behavior. Proposition 4. Consider training a model f : X Y on a noisy dataset via ERM with a modified loss function ℓ: Y Y R+ such that EU[ ℓ(f(x), y)] = ℓ(f(x), y) for all (x, y). In this case, the model minimizes risk for an implicit noise draw umle = [umle 1 , . . . , umle n ] where umle i corresponds to most likely outcome under the posterior noise model qu| yi,xi.

Prop. 4 implies that hedging will incur regret unless the implicit noise draw umle matches the true noise in the dataset utrue. In practice, this event is unlikely as limn Pr umle = utrue = 0 (see Appendix A).

Published as a conference paper at ICLR 2025

4 ANTICIPATING MISTAKES WITH PLAUSIBLE MODELS

Our results in Section 2 show how a model we learn under label noise will output regretful predictions. In this section, we develop methods to estimate the likelihood of an individual instance yielding a regretful prediction.

4.1 MOTIVATION

Our goal is to evaluate the correctness of individual predictions for models learned from noisy data. In a standard classification task, we apply an algorithm for ERM to a clean dataset, recover the model ˆf argminf F 1 n Pn i=1 I [f(xi) = yi], and evaluate the correctness of each prediction on the training data in terms of mistakes. When we learn from a noisy dataset D = {(xi, yi)}n i=1, the corresponding measure is no longer deterministic:

Mistake(xi, Yi, ˆF) = I h ˆF(xi) = Yi i . (3)

In this case, the randomness stems from: (1) the true label Yi, which is a random variable that can only be inferred from the observed noisy label yi and the posterior noise model; (2) the model ˆF : X Y, which is the output of a learning algorithm on the noisy dataset.

Our proposed measure, which we call ambiguity, quantifies the expected likelihood of a learning algorithm making a mistake on the training data i.e., the expected value of (3).

Ambiguity(xi, yi) := EYi, ˆ F | D Mistake(xi, Yi, ˆF)] = Eu U| D I[ ˆF(xi) = ( yi Ui)] (4)

Ambiguity uses all the information we have at hand: a noisy dataset and a noise model. In Prop. 5, we show how ambiguity corresponds to regret as it correctly ranks the instances based on the likelihood of experiencing regret.

Proposition 5. Given a classification task, denote the clean label risk of a model ˆF(xi) on an

instance xi as e, that is e := Pr ˆF(xi) = yi . When e < 0.5 that is, a model makes more correct than incorrect predictions a higher label noise rate for instance xi corresponds to higher Ambiguity(xi, yi).

Since Prop. 3 establishes that regret corresponds to the posterior noise rate, Prop. 5 suggests that ambiguity serves as a viable measure of regret, given its correspondence to the posterior noise rate.

4.2 ESTIMATION

We can construct unbiased estimates of ambiguity using Algorithm 1. Given a noisy dataset and a noise model, this procedure generates plausible realizations of a clean dataset, and then trains a set of plausible models that can be used to estimate ambiguity. In what follows, we describe this procedure in greater detail.

Sampling Plausible Draws Given a noisy dataset D, class-level noise model pu, and prior distribution πy := Pr (Y = y), we can sample noise draws from the posterior distribution:

qu| y = (1 π y) pu|1 y pu| y (1 π y) + (1 pu| y) π y (5)

This generalizes to different types of noise models (see e.g., Table 1). We can use these samples from the posterior distribution to estimate ambiguity directly. In practice, however, it may lead to biased estimates by returning atypical draws unlikely noise draws under a given noise model (e.g., a noise draw that flips 30% of labels under a uniform noise model with a noise rate of 10%). In settings where we wish to estimate ambiguity using a limited number of draws, an atypical draw can bias our estimates and undermine their utility. Although we could moderate this bias by increasing the number of draws, this would require training a separate model for each draw. We address these issues by sampling from a set of plausible draws.

Published as a conference paper at ICLR 2025

Algorithm 1 Generate Plausible Draws, Datasets, and Models

Input noisy dataset (xi, yi)n i=1, noise model pu|y, number of models m 1, atypicality ϵ [0, 1]] Initialize ˆF plaus ϵ {} 1: repeat 2: ui Bernouilli(qu| y,x) for i [n] generate noise draw by posterior inference 3: if [u1, . . . , un] Uϵ then check if draw is plausible using Def. 6 4: ˆyi yi ui for i [n] 5: ˆD {(xi, ˆyi)}n i=1 construct plausible clean dataset 6: ˆf argminf F ˆR(f; ˆD) train plausible model 7: ˆF plaus ϵ ˆF plaus ϵ { ˆf } update plausible models 8: end if 9: until | ˆF plaus ϵ | = m Output ˆF plaus ϵ , sample of m models from the set of plausible models F plaus ϵ

Definition 6. Given a noise draw u {0, 1}n, denote its true posterior noise rate as qu| y := Pr(U = 1 | Y = y) and empirical noise rate as ˆqu| y := 1

n Pn i=1 I [ui = 1 | yi = y]. For any ϵ [0, 1], the set of plausible draws contains all draws whose empirical noise rate is within ϵ of the posterior rate:

Uϵ( y) := {u {0, 1}n s.t. |qu| y ˆqu| y| < ϵ qu| y for all u {0, 1}}.

The set of plausible draws is a strongly typical set [see 9]. In a classification task where n is large, we can expect most (but not all) draws to concentrate in Uϵ [see Theorem 3.1.2 in 9]. We can limit atypical draws by setting the atypicality parameter ϵ, which represents the relative deviation between the true noise rate qu| y and the noise rate of sampled draws. Given a uniform noise model where qu| y = 0.1, we would set ϵ = 0.2 to only consider draws that flip between 8% to 12% of instances. Alternatively, we can set ϵ to ensure that Uϵ( y) will include a particular noise draw u Fplaus ϵ with high probability (see Prop. 9 in Appendix A). By default, we set ϵ = 0.1 to consider draws within 10% of what we would expect.

Estimating Ambiguity Given a plausible noise draw uk Uϵ( y), we construct a plausible clean dataset by pairing each xi with a plausible value of true label ˆyk i = uk yi. Definition 7. The set of ϵ-plausible models contains all models trained using ϵ-plausible datasets:

Fplaus ϵ :=

( ˆf argmin f F ˆR(f, ˆD) | ˆD := {(xi, ˆyk i )}n i=1, u Uϵ( y)

In an ideal case, where we recover a plausible draw that matches the true draw uk = utrue, our procedure returns a plausible dataset ˆDk and model ˆf k that perfectly flags all regretful predictions. Seeing how utrue is unknown, we repeat this process m times and use the m plausible models to get an unbiased estimate of ambiguity for each point in our noisy dataset as: ˆµ(x, y) := 1

k [m] I h ˆf k(x) = ˆyki .

In practice, we can use ambiguity as a confidence score to operationalize techniques to learn or predict reliably. We propose a few examples and demonstrate how these perform in Section 6:

Data Cleaning: We can use ambiguity to flag regretful instances in a training dataset to drop or relabel. Given the correspondence between regret and noise (Prop. 3), this approach can be used to de-noise a dataset to train models that generalize better on clean test data. Selective Prediction: We can use ambiguity to abstain from potentially regretful predictions at test time via selective prediction [12]. This approach can be used in instances such as clinical decision support, where we only show sufficiently reliable predictions and defer uncertain predictions to a clinician.

Discussion The main limitation of this approach is that we assume access to a correctly specified noise model. This assumption is a practical limitation and can be validated by comparing it against

Published as a conference paper at ICLR 2025

distributions estimated from a noisy dataset [see e.g., 31, 33, 44]. When working with simple noise models (e.g., uniform or class-level), we may be conservative and assume a higher noise rate. Alternatively, we can hedge against misspecification by setting ϵ to capture a larger set of plausible draws. The set of plausible models can also be used in ways to construct uncertainty measures, as demonstrated in Sections 5 and 6.

We also note that our estimates of ambiguity assume that the true noise draw, utrue, is typical. In practice, although utrue is unknown, most draws can be shown to be typical this follows from a standard application of a Chernoff bound [9].

5 EXPERIMENTS

In this section, we present an empirical study on clinical prediction tasks. Our goals are to document the effects of label noise on individual-level predictions. Supporting material and code can be found in Appendix B and Git Hub.

Setup We work with 5 classification datasets from clinical applications where models support individual medical decisions (see Table 3). We treat the labels in each dataset as true labels. We create noisy datasets by corrupting the labels using a noise draw sampled according to three classlevel noise models with noise rates [5%, 20%, 40%] where label noise only affects positive instances (yi = 1). We split each dataset into a training sample (80%), which we use to train a logistic regression model (LR) and a neural network (DNN) using noisy labels, and a test sample (20%), which we use to measure out-of-sample performance using true labels. We train these models using the following methods:

1. Ignore, where we ignore label noise and fit a model to predict noisy training labels; and 2. Hedge where we hedge against label noise using the method of Natarajan et al. [40].

This yields 12 models for each dataset (3 noise regimes 2 model classes 2 training procedures).

Metric Definition Description

True Error(f, D) 1 n

i [n] etrue(f(xi), yi) Error rate of f on the clean training labels.

Anticipated Error(f, D) 1 n

i [n] epred(f(xi), yi) etrue(f(xi), yi) Error rate of f on the noisy labels.

Susceptibility( D) 1 n

i [n] I h Pr U = 1 | X = xi, Y = yi > 0 i Proportion of instances in D subject to regret.

Regret(f, D) 1 n

i [n] I h epred(f(xi), yi) = etrue(f(xi), yi) i Mean regret across all instances in D. We expect Regret(f, D) P

y qu|y πy under class-level label noise.

Overreliance(f, D) 1 n

i [n] I h etrue(f(xi), yi) = 1, epred(f(xi), yi) = 0 i Proportion of predictions in D that are incorrectly perceived as accurate.

Table 2: Overview of summary statistics in Table 3. We report these metrics for models that we train from noisy labels using a specific training procedure, model class, noise model, and dataset. We evaluate all models trained on a given dataset and noise model using a fixed noise draw.

We characterize the accuracy and reliability of predictions from each model using the measures in Table 2. We report our results for LR models in Table 3 and results for DNN models in Appendix B. In what follows, we discuss all results.

On Label Noise, Regret, and Hedging Our results in Table 3 highlight several implications of learning under label noise. We confirm that our result in Prop. 3 holds empirically i.e., the expected prevalence of regretful predictions corresponds to the effective noise rate in each dataset. We observe similar effects across all datasets, model classes, and noise regimes, underscoring the need to quantify the effect of label noise on individual predictions.

Existing approaches to handle label noise (i.e., hedging) can learn models that are robust to noise at a population-level but still experience regret. As shown in Table 3, we observe that Hedge can moderate the impact of label noise at a population-level by reducing the true (clean label) error compared to Ignore. Even with more noise robustness, regret is unchanged, and remains high across

Published as a conference paper at ICLR 2025

all experimental conditions. On the mortality dataset, for example, Hedge reduces the error rate by over 13% compared to Ignore for a LR model under 40% label noise. However, regret is unchanged and continues to affect 19.5% of instances. It is interesting to note that Hedge can moderate the effects of overrelianceby redistributing unforeseen mistakes from instances that lead to overreliance to instances where epred(f(xi), yi) = 1 and etrue(f(xi), yi) = 0 i.e., where a practitioner may fail to reap the benefits of a correct prediction because it appears to be incorrect.

pu|y=1 = 5% pu|y=1 = 20% pu|y=1 = 40%

Dataset Metrics Ignore Hedge Ignore Hedge Ignore Hedge

shock_eicu n = 3, 456 d = 104 Pollard et al. [45]

True Error Anticipated Error Regret Overreliance Susceptibility

24.4% 25.7% 3.0% 1.1% 52.6%

23.5% 25.2% 3.0% 0.9% 52.6%

27.1% 28.3% 10.1% 6.3% 59.7%

24.6% 29.4% 10.1% 3.8% 59.7%

41.0% 28.2% 19.7% 22.6% 69.3%

24.3% 33.5% 19.7% 7.9% 69.3%

shock_mimic n = 15, 254 d = 104 Johnson et al. [22]

True Error Anticipated Error Regret Overreliance Susceptibility

20.8% 22.1% 2.5% 0.8% 52.5%

20.2% 21.7% 2.5% 0.6% 52.5%

25.0% 26.8% 10.2% 5.8% 60.2%

20.3% 26.4% 10.2% 2.8% 60.2%

34.9% 27.4% 19.8% 18.8% 69.8%

20.1% 32.5% 19.8% 5.5% 69.8%

lungcancer n = 62, 916 d = 40 NCI [41]

True Error Anticipated Error Regret Overreliance Susceptibility

31.7% 32.2% 2.5% 1.5% 52.7%

30.8% 31.5% 2.5% 1.3% 52.7%

33.7% 32.7% 10.0% 8.1% 60.2%

30.8% 33.6% 10.0% 5.4% 60.2%

43.0% 30.0% 19.7% 23.4% 69.9%

31.1% 36.5% 19.7% 11.3% 69.9%

mortality n = 20, 334 d = 84 Le Gall et al. [28]

True Error Anticipated Error Regret Overreliance Susceptibility

19.5% 20.7% 2.2% 0.6% 52.2%

19.0% 20.4% 2.2% 0.5% 52.2%

23.2% 25.7% 9.8% 4.9% 59.8%

19.1% 25.0% 9.8% 2.6% 59.8%

33.2% 27.7% 19.5% 17.3% 69.5%

19.4% 30.9% 19.5% 5.8% 69.5%

support n = 9, 696 d = 114 Knaus et al. [24]

True Error Anticipated Error Regret Overreliance Susceptibility

33.1% 33.4% 2.6% 1.8% 52.6%

33.7% 34.1% 2.6% 1.7% 52.6%

36.7% 34.1% 10.0% 9.6% 60.0%

33.7% 36.0% 10.0% 6.0% 60.0%

44.2% 29.7% 19.6% 24.3% 69.6%

33.9% 38.6% 19.6% 12.1% 69.6%

Table 3: Accuracy and reliability of predictions for LR models trained on noisy datasets where we flip 5%, 20% and 40% of positive instances. We defer results for DNN models to Appendix B for clarity.

On the Lottery of Mistakes Our results highlight how a small amount of label noise can undermine common use cases for prediction by subjecting a far greater number of instances to a lottery of mistakes. In Table 3, for example, we consider a noise model where only positive instances (y = 1) are subject to label noise. Thus, every instance with a negative noisy label ( y = 0) is subject to a lottery of mistakes. We report the proportion of points that take part in the lottery using the susceptibility metric in Eq. (1). In this case, we can see that in a task where the label noise is as low as 5%, over half of instances are subject to lottery across all five datasets. For example, in the lungcancer dataset, even a small misdiagnosis (i.e., label noise) rate, which is inevitable, can compromise the reliability of half of all diagnoses.

On the Consequences of Blindness Our results highlight the importance of considering the effect of label noise in instance-level predictions. Many real-world practitioners assume that their training data is clean and ignore label noise, but most real-world datasets are not perfectly labeled this inevitably leads to regretful predictions on individuals.

To demonstrate how regretful predictions can negatively impact individuals, we consider a particular flavor of regret overreliance the fraction of instances where a practitioner would incorrectly assume that a model assigned a correct prediction i.e. where epred(f(xi), yi) = 0 and etrue(f(xi), yi) = 1. From Table 3, we consider overreliance on the lungcancer dataset, under 40% noise. We observe that up to 23.4% of instances are assigned this type of prediction. In practice, such instances correspond to patients with cancer but who are classified as cancer-free based on the prediction of a seemingly accurate model. These are patients where the model is making a mistake, however, the practitioner cannot tell as it would not appear to be a mistake based on the noisy label. This highlights the importance of looking at the distribution of regretful instances across predictions. By analyzing the distribution of regretful predictions, we can adjust our reliance on model predictions ensuring that practitioners do not blindly trust or explain away incorrect model decisions.

6 DEMONSTRATIONS

In this section, we show how the machinery developed in Section 4 can be used to promote safety at critical parts of the machine learning lifecycle in real-world applications where noisy labels are inevitable.

Data Cleaning Our approach in Section 4 can clean noisy datasets by using ambiguity to drop noisy instances from a training dataset. Given the denoised" dataset, we can then train models that perform better in deployment. In Fig. 2, we demonstrate the effectiveness of this approach on the shock_mimic dataset. Here, we drop training examples using a confidence-based threshold rule of the form I [conf(xi) τ], where τ is a threshold set to control the number of instances to drop. We

Published as a conference paper at ICLR 2025

compare the performance of this strategy using confidence scores that we can compute on training data: (1) conf(xi) = 1 ˆµ(xi, yi), which is a measure based on the estimated ambiguity that we recover using Algorithm 1; and (2) conf(xi) = ˆp(yi | xi), which is the predicted probability of a final model. As shown, removing instances with high ambiguity from the training dataset prior to training a final model on the cleaned data leads to improved test error on clean labels. Specifically, using ambiguity to drop uncertain instances reduces test error by 14.9% when dropping only 20% of noisy training data compared to a baseline approach.

Figure 2: Clean test error for a LR model on the shock_mimic dataset with 40% classlevel label noise when dropping training instances using different confidence-based threshold rules. We show the clean test error vs percent of instances dropped from training using confidence measures based on predicted probabilities conf(xi) := ˆp( yi | xi) or ambiguity conf(xi) := 1 ˆµ(xi, yi).

Selective Classification with Cheap Labels We use our results to highlight how the machinery in Section 4 can promote safer predictions. Consider the shock_mimic dataset in Fig. 3 here, we use the same confidence-based threshold rule of the form I [conf(xi) τ] where conf(xi) is a confidence score and τ is a threshold value. We consider confidence scores based on standard predicted probabilities and ambiguity, where ambiguity can be measured using cheaply acquired test instances (e.g., noisy test data). We show how the selective test error on clean labels and selective regret change as we vary the confidence threshold value τ (0, 1). Specifically, in a regime where 20% of the labels are noisy, abstaining on 40% of instances using ambiguity reduces selective error by -6.6% and selective regret by -5.9% compared to the standard approach on cshock_mimic.

Selective Classification for Scientific Discovery We demonstrate how our approach can support a modern scientific discovery task in biotechnology. In such tasks, researchers perform in-vitro experiments to identify instances with desired properties [e.g., identifying new antibiotics 50]. Given a dataset of successful and unsuccessful experiments and their characteristics, we can train a model to predict which future experiments are likely to succeed, thereby accelerating discovery by prioritizing high-yield experiments.

We use the enhancer dataset from Gschwind et al. [16] to predict the outcome of experiments to discover enhancers i.e., segments of DNA that regulate gene expression. The dataset contains n = 992 noisy instances (xi, yi), each with d = 13 features (e.g., gene location, cell type, etc.).

Figure 3: Selective classification frontiers for a LR model on the shock_mimic dataset under 20% class-level noise when abstaining from uncertain predictions at test time using a confidence-based threshold rule. We plot the selective regret (left) and selective error (right) as we vary the percent of abstained predictions for confidence measures based on predicted probabilities conf(xi) := ˆp( yi | xi) and ambiguity conf(xi) := 1 ˆµ(xi).

Published as a conference paper at ICLR 2025

In this setting, a noisy label yi = 1 indicates a statistically significant experiment (i.e., reject null hypothesis: H0 = no effect ). Here, label noise arises from the Type I error for each experiment:

Pr ( yi = 1 | yi = 0) = Pr (reject H0 | H0 holds) = p-value for experiment i

We fit a LR model f and use its predictions to identify experiments that are likely to succeed in the test sample. To avoid low-confidence predictions, we use a thresholding rule of the form I [conf(xi) τ] where τ is a threshold value that we can set to control the number of experiments to perform. Confidence for an instance is measured with a score conf(xi) := 1 Disagreement(xi) where Disagreement(xi) measures the disagreement between the predictions of f and the m plausible models from Algorithm 1:

Disagreement(xi) := 1

k [m] I h ˆf k(xi) = f(xi) i (6)

Fig. 4 shows that disagreement can reliably predict which experiments will be successful. We use two strategies for abstention: (1) a standard approach thresholding according to ˆp( yi | xi), or (2) using disagreement rates (6). Performance is measured using test hit rate (i.e., the number of successful experiments divided by the number of total experiments that we run). Our approach improves the hit rate (+10.7%) compared to standard confidence-based abstention, with a modest 20% abstention rate (Fig. 4). This demonstrates that we can optimize laboratory resource allocation and increase the discovery rate of enhancers by forgoing 20% of experiments.

Disagreement

Disagreement Confidence

Figure 4: Selective classification frontiers for an LR model on the enhancer dataset when abstaining from uncertain predictions using a confidence-based threshold rule. We plot the selective hit rate (left) and selective error (right) as we increase the proportion of abstained predictions for confidence measures based on predicted probabilities conf(xi) := ˆp( yi | xi) and disagreement conf(xi) := 1 Disagreement(xi).

7 CONCLUDING REMARKS

Learning under label noise is a major challenge in practice. While models may perform well on average, a model that is 99% accurate can inadvertently misclassify anyone, as label noise can subject each individual prediction to a lottery of mistakes. In this work, we studied these effects through the lens of regret and highlighted the inherent limits of learning in this regime.

Our results show that, even as regret is inevitable when learning from label noise, we can operationalize simple techniques to predict safely by quantifying the uncertainty in individual predictions e.g., by estimating ambiguity of individual predictions and using this to flag predictions where we should abstain or examples that we should re-label.

Our use of regret extends beyond label noise into any setting where models are trained on datasets with a single draw of noise [e.g., for probabilistic classification 13]. In such settings, regret can explicitly reveal the impact of predictions on individuals, and estimating it can act as a safeguard or signal to collect more data or avoid prediction altogether.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work was supported by funding from the National Science Foundation IIS 2040880, IIS 2313105, IIS-2007951, IIS-2143895, and the NIH Bridge2AI Center Grant U54HG012510. Sujay Nagaraj was supported by a CIHR Vanier Scholarship. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

[1] Dana Angluin and Philip Laird. Learning from noisy examples. Machine learning, 2:343 370, 1988.

[2] Mayara Lisboa Bastos, Gamuchirai Tavaziva, Syed Kunal Abidi, Jonathon R Campbell, Louis-Patrick Haraoui, James C Johnston, Zhiyi Lan, Stephanie Law, Emily Mac Lean, Anete Trajman, et al. Diagnostic accuracy of serological tests for covid-19: systematic review and meta-analysis. bmj, 370, 2020.

[3] Atharva M Bhagwat, Kadija S Ferryman, and Jason B Gibbons. Mitigating algorithmic bias in opioid risk-score modeling to ensure equitable access to pain relief. Nature medicine, 29(4):769 770, 2023.

[4] Emily Black, Manish Raghavan, and Solon Barocas. Model multiplicity: Opportunities, concerns, and solutions. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 22, pp. 850 863, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533149. URL https://doi.org/10.1145/3531146.3533149.

[5] Marc-Etienne Brunet, Ashton Anderson, and Richard Zemel. Implications of model indeterminacy for explanations of automated decisions. Advances in Neural Information Processing Systems, 35:7810 7823, 2022.

[6] Seung Hyun Cheon, Anneke Wernerfelt, Sorelle Friedler, and Berk Ustun. Feature responsiveness scores: Model-agnostic explanations for recourse. In The Thirteenth International Conference on Learning Representations, 2025.

[7] Chun-Wei Chiang and Ming Yin. You d better stop! understanding human reliance on machine learning models under covariate shift. In Proceedings of the 13th ACM Web Science Conference 2021, pp. 120 129, 2021.

[8] Amanda Coston, Ashesh Rambachan, and Alexandra Chouldechova. Characterizing fairness over the set of good models under selective labels. Co RR, abs/2101.00352, 2021. URL https://arxiv.org/ abs/2101.00352.

[9] Thomas M Cover. Elements of Information Theory. John Wiley & Sons, 1999.

[10] Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers. Journal of Machine Learning Research, 24(11):1 49, 2023.

[11] Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE Transactions on neural networks and learning systems, 25(5):845 869, 2013.

[12] Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.

[13] Erin George, Deanna Needell, and Berk Ustun. Observational multiplicity and regret. In 2025 Joint Mathematics Meetings. American Mathematical Society.

[14] Kan Z Gianattasio, Christina Prather, M Maria Glymour, Adam Ciarleglio, and Melinda C Power. Racial disparities and temporal trends in dementia misdiagnosis risk in the united states. Alzheimer s & Dementia: Translational Research & Clinical Interventions, 5:891 898, 2019.

[15] Athanasios Giannakis, Dorottya Móré, Stella Erdmann, Laurent Kintzelé, Ralph Michael Fischer, Monika Nadja Vogel, David Lukas Mangold, Oyunbileg von Stackelberg, Paul Schnitzler, Stefan Zimmermann, et al. Covid-19 pneumonia and its lookalikes: How radiologists perform in differentiating atypical pneumonias. European Journal of Radiology, 144:110002, 2021.

[16] Andreas R Gschwind, Kristy S Mualim, Alireza Karbalayghareh, Maya U Sheth, Kushal K Dey, Evelyn Jagoda, Ramil N Nurtdinov, Wang Xi, Anthony S Tan, Hank Jones, et al. An encyclopedia of enhancergene regulatory interactions in the human genome. bio Rxiv, 2023.

Published as a conference paper at ICLR 2025

[17] Faisal Hamman, Erfaun Noorani, Saumitra Mishra, Daniele Magazzeni, and Sanghamitra Dutta. Robust algorithmic recourse under model multiplicity with probabilistic guarantees. IEEE Journal on Selected Areas in Information Theory, 2024.

[18] Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey. Machine Learning, pp. 1 38, 2024.

[19] SM Hollenberg. Cardiogenic shock. In Intensive Care Medicine: Annual Update 2003, pp. 447 458. Springer, 2003.

[20] Hsiang Hsu and Flavio du Pin Calmon. Rashomon capacity: A metric for predictive multiplicity in probabilistic classification, 2022. URL https://arxiv.org/abs/2206.01295.

[21] Hailey James, Chirag Nagpal, Katherine A Heller, and Berk Ustun. Participatory personalization in classification. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[22] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1 9, 2016.

[23] Jamil N Kanji, Nathan Zelyas, Clayton Mac Donald, Kanti Pabbaraju, Muhammad Naeem Khan, Abhaya Prasad, Jia Hu, Mathew Diggle, Byron M Berenger, and Graham Tipples. False negative rate of covid-19 pcr testing: a discordant testing analysis. Virology journal, 18:1 6, 2021.

[24] William A Knaus, Frank E Harrell, Joanne Lynn, Lee Goldman, Russell S Phillips, Alfred F Connors, Neal V Dawson, William J Fulkerson, Robert M Califf, Norman Desbiens, et al. The support prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine, 122(3):191 203, 1995.

[25] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017.

[26] Avni Kothari, Bogdan Kulynych, Tsui-Wei Weng, and Berk Ustun. Prediction without preclusion: Recourse verification with reachable sets. In The Twelfth International Conference on Learning Representations, 2024.

[27] Bogdan Kulynych, Hsiang Hsu, Carmela Troncoso, and Flavio P Calmon. Arbitrary decisions are a hidden cost of differentially private training. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1609 1623, 2023.

[28] Jean-Roger Le Gall, Stanley Lemeshow, and Fabienne Saulnier. A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama, 270(24):2957 2963, 1993.

[29] John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance. Human factors, 46(1):50 80, 2004.

[30] Dana Li, Lea Marie Pehrson, Lea Tøttrup, Marco Fraccaro, Rasmus Bonnevie, Jakob Thrane, Peter Jagd Sørensen, Alexander Rykkje, Tobias Thostrup Andersen, Henrik Steglich-Arnholm, et al. Inter-and intraobserver agreement when using a diagnostic labeling scheme for annotating findings on chest x-rays an early step in the development of a deep learning-based decision support system. Diagnostics, 12(12): 3112, 2022.

[31] Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to-end label-noise learning without anchor points. In International conference on machine learning, pp. 6403 6413. PMLR, 2021.

[32] Yang Liu. Understanding instance-level label noise: Disparate impacts and treatments. In International Conference on Machine Learning, pp. 6725 6735. PMLR, 2021.

[33] Yang Liu and Hongyi Guo. Peer loss functions: Learning from noisy labels without knowing noise rates. ICML, 2020.

[34] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. ar Xiv preprint ar Xiv:1802.03888, 2018.

[35] Charles Marx, Flavio P. Calmon, and Berk Ustun. Predictive Multiplicity in Classification, 2019.

[36] Charles Marx, Youngsuk Park, Hilaf Hasson, Yuyang Wang, Stefano Ermon, and Luke Huan. But are you sure? an uncertainty-aware perspective on explainable ai. In International Conference on Artificial Intelligence and Statistics, pp. 7375 7391. PMLR, 2023.

Published as a conference paper at ICLR 2025

[37] Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In International Conference on Machine Learning, pp. 125 134, 2015.

[38] Anna P Meyer, Aws Albarghouthi, and Loris D Antoni. The dataset multiplicity problem: How unreliable data impacts predictions. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 193 204, 2023.

[39] Sujay Nagaraj, Sarah Goodday, Thomas Hartvigsen, Adrien Boch, Kopal Garg, Sindhu Gowda, Luca Foschini, Marzyeh Ghassemi, Stephen Friend, and Anna Goldenberg. Dissecting the heterogeneity of in the wild stress from multimodal sensor data. NPJ Digital Medicine, 6(1):237, 2023.

[40] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.

[41] Surveillance Research Program NCI, DCCPS. Surveillance, epidemiology, and end results (seer) program research data (1975-2016), 2019. URL www.seer.cancer.gov.

[42] Sejoon Oh, Berk Ustun, Julian Mc Auley, and Srijan Kumar. Rank list sensitivity of recommender systems to interaction perturbations. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1584 1594, 2022.

[43] Diane Oyen, Michal Kucer, Nicolas Hengartner, and Har Simrat Singh. Robustness to label noise depends on the shape of the noise distribution. Advances in Neural Information Processing Systems, 35:35645 35656, 2022.

[44] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944 1952, 2017.

[45] Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data, 5(1):1 13, 2018.

[46] Ramamoorthi Ravi and Amitabh Sinha. Hedging uncertainty: Approximation algorithms for stochastic optimization problems. In International Conference on Integer Programming and Combinatorial Optimization, pp. 101 115. Springer, 2004.

[47] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. ar Xiv preprint ar Xiv:1412.6596, 2014.

[48] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Model-agnostic interpretability of machine learning. ar Xiv preprint ar Xiv:1606.05386, 2016.

[49] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on neural networks and learning systems, 2022.

[50] Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M Donghia, Craig R Mac Nair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell, 180(4):688 702, 2020.

[51] Vinith Menon Suriyakumar, Marzyeh Ghassemi, and Berk Ustun. When personalization harms performance: reconsidering the use of group attributes in prediction. In International Conference on Machine Learning, pp. 33209 33228. PMLR, 2023.

[52] Berk Ustun, Lenard A Adler, Cynthia Rudin, Stephen V Faraone, Thomas J Spencer, Patricia Berglund, Michael J Gruber, and Ronald C Kessler. The world health organization adult attentiondeficit/hyperactivity disorder self-report screening scale for dsm-5. Jama psychiatry, 74(5):520 526, 2017.

[53] Jamelle Watson-Daniels, Flavio du Pin Calmon, Alexander D Amour, Carol Long, David C Parkes, and Berk Ustun. Predictive churn with the set of good models. ar Xiv preprint ar Xiv:2402.07745, 2024.

[54] Jiaheng Wei, Hangyu Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama, and Yang Liu. To smooth or not? when label smoothing meets noisy labels. In International Conference on Machine Learning, 2022.

[55] Yishu Wei, Yu Deng, Cong Sun, Mingquan Lin, Hongmei Jiang, and Yifan Peng. Deep learning with noisy labels in medical prediction problems: a scoping review. ar Xiv preprint ar Xiv:2403.13111, 2024.

Published as a conference paper at ICLR 2025

[56] Eric Yamga, Sreekar Mantena, Darin Rosen, Emily M. Bucholz, Robert W. Yeh, Leo A. Celi, Berk Ustun, and Neel M. Butala. Optimized risk score to predict mortality in patients with cardiogenic shock in the cardiac intensive care unit. Journal of the American Heart Association, 12(13), July 2023. ISSN 2047-9980. doi: 10.1161/jaha.122.029232. URL http://dx.doi.org/10.1161/JAHA.122. 029232.

Published as a conference paper at ICLR 2025

A OMITTED PROOFS

A.1 RESULTS FROM SECTION 3

Proof of Prop. 3. Consider any classification task with label noise. Given a point with (X, Y ), let

ρX, Y := Pr U = 1 | X, Y denote the posterior noise rate and ℓ01(f(X), Y ) := I h f(X) = Y i

its zero-one loss.

By definition, a hedging algorithm [see e.g., 40] is designed to learn a classifier f such that:

EU|X,Y [epred(f(X), Y )] = etrue(f(X), Y ).

We observe that f will achieve zero error in expectation as a result of the unbiasedness property of hedging algorithms.

EX,Y,U h epred(f(X), Y ) etrue(f(X), Y ) i = EX,Y EU|X,Y h epred(f(X), Y ) etrue(f(X), Y ) i

The last line follows from the fact that Y is a deterministic function of U given Y .

We now show that will still incur regret in this regime. We begin by expressing the expected regret for any point (X, Y ) and any noise draw U as:

EX, Y ,U h Regret(X, Y , U) i

= EX, Y h (1 2qu) (epred(f(X), Y ) + ℓ01(f(X), Y )) + 2(qu 1) epred(f(X), Y ) ℓ01(f(X), Y ) + qu i

EX, Y ,U Regret(X, Y , U) = EX, Y ,U I h epred(f(X), Y ) = I h f(X) = Y (1 U) + (1 Y )U ii

= EX, Y EU|X, Y I h epred(f(X), Y ) = I h f(X) = Y (1 U) + (1 Y )U ii

= EX, Y EU|X, Y epred(f(X), Y )(1 I h f(X) = Y (1 U) + (1 Y )U i )

+ (1 epred(f(X), Y ))I h f(X) = Y (1 U) + (1 Y )U i

= EX, Y EU|X, Y epred(f(X), Y )(1 I h f(X) = Y i (1 U) I h f(X) = 1 Y i U)

+ (1 epred(f(X), Y ))(I h f(X) = Y i (1 U) + I h f(X) = 1 Y i U)

Letting qu = Pr U = 1 | X, Y and ℓ01(f(X), Y ) = I h f(X) = Y i , we have:

= EX, Y (1 qu)(epred(f(X), Y )(1 ℓ01(f(X), Y )) + (1 epred(f(X), Y ))ℓ01(f(X), Y ))

+ qu(epred(f(X), Y )(1 ℓ01(f(X), 1 Y )) + (1 epred(f(X), Y ))ℓ01(f(X), 1 Y ))

EX, Y ,U Regret(X, Y , U) = EX, Y (1 2qu) (epred(f(X), Y ) + ℓ01(f(X), Y ))

+ 2(qu 1) epred(f(X), Y ) ℓ01(f(X), Y ) + qu .

When there is no label noise, we have that qu = 0 and epred(f(X), Y ) = ℓ01(f(X), Y ) for all X, Y . Because they are binary terms, in this regime, we have:

EX, Y ,U h Regret(X, Y , U) i = EX, Y [0] = 0

When there is label noise, we have that qu > 0 for some X, Y . In this regime, we have:

EX, Y ,U h Regret(X, Y , U) i = EX, Y [qu] > 0.

Published as a conference paper at ICLR 2025

The proof for Prop. 4 uses the following lemma. Lemma 8. Minimizing the expected risk under the clean label distribution is equivalent to minimizing a noise-corrected (hedged) risk under the noisy label distribution.

EX,Y [I [f(X) = Y ]] = EX, Y h (1 qu I h f(X) = Y i + qu I h f(X) = 1 Y ii (7)

qu = (1 π y,x) pu|1 y,x pu| y,x (1 π y,x)+(1 pu| y,x) π y,x π y,x = Pr (Y = y|X = x) is the clean class prior an observed noisy label, pu = Pr (U = 1 | Y = y, X = x) is the class-level noise probability.

The result is analogous to Lemma 1 in Natarajan et al. [40]. In what follows, we include an additional proof for the sake of completeness.

Expected Risk(f) = EX,Y [I [f(X) = Y ]]

= EX, Y ,U h I h f(X) = Y (1 U) + U(1 Y ) ii

= EX, Y EU|X, Y h I h f(X) = Y (1 U) + U(1 Y ) ii

= EX, Y EU|X, Y h I h f(X) = Y i (1 U) + I h f(X) = 1 Y i U i

= EX, Y h EU|X, Y [I h f(X) = Y i (1 U)] + EU|X, Y [I h f(X) = 1 Y i U] i

= EX, Y h Pr U = 0| Y , X I h f(X) = Y i + Pr U = 1| Y , X I h f(X) = 1 Y ii

= EX, Y h Pr Y = Y | Y , X I h f(X) = Y i + Pr Y = Y | Y , X I h f(X) = 1 Y ii

= EX, Y h (1 qu I h f(X) = Y i + qu I h f(X) = 1 Y ii

We can recover the statement of Lemma 8 by applying Bayes theorem to write qu in terms of the clean class priors and class-level noise probabilities.

Proof of Prop. 4. We define umle as the noise draw for instance (X, Y ), such that using umle to minimize the expected risk implicitly coincides with the true minimizer of the expected risk (defined in Lemma 8). That is:

argmin f F EX, Y h I h f(X) = Y (1 umle) + umle(1 Y ) ii

= argmin f F EX, Y h (1 qu)I h f(X) = Y i + qu I h f(X) = Y ii

We can express the minimizer of the LHS as:

f argmin f F EX, Y h I h f(X) = Y (1 umle) + umle(1 Y ) ii (8)

= argmin f F EX, Y h (1 umle)I h f(X) = Y i + umle I h f(X) = Y ii (9)

We can denote the minimizer of the RHS: ˆf argmin f F EX, Y h (1 qu)I h f(X) = Y i + qu I h f(X) = Y ii (10)

Observe that:

qu|y,x < 0.5 = ˆf(X) = Y

qu|y,x > 0.5 = ˆf(X) = 1 Y

Thus, we have that umle := I [qu > 0.5] = ˆf = f , as desired.

Published as a conference paper at ICLR 2025

A.2 RESULTS FROM SECTION 4

Proof of Prop. 5. Denote the noise rate of ˆF(xi) as e, that is Pr ˆF(xi) = yi = e.

EYi, ˆ F | D

I h ˆF(xi) = ˆYi i = EYi, ˆ F | D

I h ˆF(xi) = ˆYi | ˆF(xi) = yi i (1 e)

+ EYi, ˆ F | D

I h ˆF(xi) = ˆYi | ˆF(xi) = yi i e

= EYi, ˆ F | D

I [Yi = yi] (1 e) + (1 EYi, ˆ F | D

I h ˆYi = yi i ) e

= (1 2e) EYi, ˆ F | D

I h ˆYi = yi i + e

When e < 0.5, we can claim that the higher the EYi, ˆ F | D

I h ˆYi = yi i , the higher the

EYi, ˆ F | D

I h ˆF(xi) = ˆYi i , the ambiguity measure. If we assume that EYi, ˆ F | D

I h ˆYi = yi i is

monotonic in the noise rates in ui, which is intuitively true, we then establish that the higher the noise, the higher the ambiguity measure.

A.3 ON CHOOSING AN ATYPICALITY PARAMETER

In Prop. 9, we present an additional bound that can be used to set an atypicality parameter ϵ to guarantee that the set of plausible draws Fplaus ϵ includes a reference noise draw with high probability.

Proposition 9. Given a set of np instances (x, y) subject to noise rate pu, we determine the minimum ϵ to ensure that a reference noise draw u belongs to the set plausible draws Fplaus ϵ with high probability. That is, with probability at least 1 δ, u Uϵ( y) if ϵ obeys:

2np + |pu qu| y|

Here np represents the number of instances whose labels are corrupted by the same noise model. For example, under class-level noise, this bound would need to be evaluated separately using the number of instances for each class.

For example, given a dataset with n = 10, 000 instances under 20% uniform label noise, a practitioner must set ϵ 6% to ensure that u Fplaus ϵ with probability at least 90%.

Proof of Prop. 9. Our goal is to show that Pr (u Uϵ( y)) 1 δ. for any given 0 δ 1.

The uncertainty set Uϵ( y) defined on pu| y is a strongly typical set (see [9]) where the true mean pu|y and the empirical mean is ˆpu := 1

n Pn i=1 I [ui = 1] . Thus,

u Uϵ( y) |ˆpu pu| y| pu| y ϵ (11)

We will derive conditions to satisfy the inequality Eq. (11)

Observe that we can write

|ˆpu pu| y| = |(ˆpu pu) + (pu pu| y)|

|ˆpu pu| + |pu pu| y| (by the triangle inequality)

We require |ˆpu pu| y| pu| y ϵ. Therefore we need |ˆpu pu|+|pu pu| y| pu| y ϵ which implies that |ˆpu pu| pu| y ϵ |pu pu| y|

Published as a conference paper at ICLR 2025

We observe that u is a sequence of bounded, independently sampled random variables. Thus, we can apply Hoeffding s inequality to see that:

Pr (|ˆpu pu| α) 2 exp( 2nα2)

Here, α = pu| y ϵ |pu pu| y|. Rearranging, we have that:

Pr (u Uϵ( y)) = Pr (|ˆpu pu| α) 1 2 exp( 2nα2)

= 1 2 exp( 2n(pu| y ϵ |pu pu| y|)2)

We invert the bound to obtain the following statement: with probability at least 1 δ, u Uϵ( y) if the number of samples n obeys:

2(pu| y ϵ |pu pu| y|)2

To conclude the proof, we rearrange for ϵ, that is, given a dataset of n instances, u Uϵ( y) if ϵ obeys:

2n + |pu pu| y|

Published as a conference paper at ICLR 2025

B SUPPORTING MATERIAL FOR SECTION 5 AND SECTION 6

Here we include further details about the datasets used in our experimental results.

B.1 DATASETS

lungcancer We used a cohort of 120,641 lung cancer patients diagnosed between 2004-2016 who were monitored in the National Cancer Institute SEER study [41] and processed the dataset to match the processing in James et al. [21]. The outcome variable is death within five years from any cause, with 16.9% dying within this period. The cohort includes patients across the USA (California, Georgia, Kentucky, New Jersey, and Louisiana), excluding those lost to follow-up. Features include measures of tumor morphology and histology (e.g., size, metastasis, stage, node count and location), as well as clinical interventions at the time of diagnoses (e.g., surgery, chemotherapy, radiology).

shock_eicu & shock_mimic Cardiogenic shock is an acute cardiac condition where the heart fails to sufficiently pump enough blood [19] leading to under-perfusion of vital organs. These datasets are designed to build algorithms to predict cardiogenic shock in ICU patients as described in Yamga et al. [56]. Both datasets contain identical features, group attributes, and outcome variables but they capture different patient populations. The shock_eicu dataset includes records from the EICU Collaborative Research Database V2.0 [45], while the shock_mimic dataset includes records from the MIMIC-III database [22]. The target variable is whether a patient with cardiogenic shock will die in the ICU. Features include vital signs and routine lab tests (e.g., systolic BP, heart rate, hemoglobin count) collected within 24 hours before the onset of cardiogenic shock.

mortality The Simplified Acute Physiology Score II (SAPS II) score is a risk-score designed to predict the risk of death in ICU patients collected in [28] and used in [51]. The data contains records of 7,797 patients from 137 medical centers in 12 countries. The outcome variable indicates whether a patient dies in the ICU, with 12.8% patient of patients dying. Similar to the other datasets, mortality contains features reflecting comorbidities, vital signs, and lab measurements.

support This dataset comprises 9,105 ICU patients from five U.S. medical centers, collected during 1989-1991 and 1992-1994 [24]. Each record pertains to patients across nine disease categories: acute respiratory failure, chronic obstructive pulmonary disease, congestive heart failure, liver disease, coma, colon cancer, lung cancer, multiple organ system failure with malignancy, and multiple organ system failure with sepsis. The aim is to determine the individual-level 2and 6-month survival rates based on physiological, demographic, and diagnostic data.

B.2 ADDITIONAL RESULTS FOR DNN MODELS

In Section 5, we include results for LR models. In this section, we include additional results for DNN models trained on the same unique noise draw as in Section 5.

Published as a conference paper at ICLR 2025

pu|y=1 = 5% pu|y=1 = 20% pu|y=1 = 40%

Dataset Metrics Ignore Hedge Ignore Hedge Ignore Hedge

shock_eicu n = 3, 456 d = 104 Pollard et al. [45]

True Error Anticipated Error Regret Overreliance Susceptibility

13.3% 14.4% 3.0% 1.1% 52.6%

12.8% 14.0% 3.0% 1.0% 52.6%

18.6% 20.3% 10.1% 5.3% 59.7%

19.2% 22.0% 10.1% 4.7% 59.7%

37.5% 25.1% 19.7% 21.4% 69.3%

26.2% 26.7% 19.7% 13.1% 69.3%

shock_mimic n = 15, 254 d = 104 Johnson et al. [22]

True Error Anticipated Error Regret Overreliance Susceptibility

15.6% 17.4% 2.5% 0.4% 52.5%

15.9% 17.5% 2.5% 0.5% 52.5%

18.8% 23.9% 10.2% 3.4% 60.2%

16.8% 23.2% 10.2% 2.5% 60.2%

32.7% 26.6% 19.8% 17.7% 69.8%

22.1% 25.9% 19.8% 10.8% 69.8%

lungcancer n = 62, 916 d = 40 NCI [41]

True Error Anticipated Error Regret Overreliance Susceptibility

29.8% 30.4% 2.5% 1.4% 52.7%

29.7% 30.4% 2.5% 1.3% 52.7%

31.5% 31.8% 10.0% 7.1% 60.2%

30.0% 33.4% 10.0% 5.0% 60.2%

37.7% 29.7% 19.7% 19.8% 69.9%

29.5% 36.7% 19.7% 9.9% 69.9%

mortality n = 20, 334 d = 84 Le Gall et al. [28]

True Error Anticipated Error Regret Overreliance Susceptibility

17.7% 19.1% 2.2% 0.6% 52.2%

17.9% 19.4% 2.2% 0.5% 52.2%

19.2% 23.4% 9.8% 3.7% 59.8%

18.3% 24.0% 9.8% 2.7% 59.8%

24.0% 26.2% 19.5% 11.7% 69.5%

18.9% 29.5% 19.5% 6.3% 69.5%

support n = 9, 696 d = 114 Knaus et al. [24]

True Error Anticipated Error Regret Overreliance Susceptibility

28.4% 28.2% 2.6% 2.0% 52.6%

28.6% 28.2% 2.6% 2.1% 52.6%

31.0% 28.6% 10.0% 8.7% 60.0%

30.3% 28.7% 10.0% 8.1% 60.0%

39.4% 25.2% 19.6% 22.6% 69.6%

35.7% 27.8% 19.6% 19.1% 69.6%

Table 4: Overview of performance and regret for DNN model trained on all datasets and training procedures.