# mitigating_label_noise_through_data_ambiguation__8056da75.pdf

Mitigating Label Noise through Data Ambiguation

Julian Lienen1, Eyke H ullermeier2,3

1Department of Computer Science, Paderborn University 2Institute of Informatics, LMU Munich 3Munich Center for Machine Learning julian.lienen@upb.de, eyke@lmu.de

Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by ambiguating the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.

Introduction

Label noise refers to the presence of incorrect or unreliable annotations in the training data, which can negatively impact the generalization performance of learning methods. Dealing with label noise constitutes a major challenge for the application of machine learning methods to real-world applications, which often exhibit noisy annotations in the form of erroneous labels or distorted sensor values. This is no less a concern for large models in the regime of deep learning, which have become increasingly popular due to their high expressive power, and which are not immune to the harming nature of label noise either. Existing methods addressing this issue include off-the-shelf robust loss functions, e.g., as proposed by Wang et al. (2019), and label correction, for example by means of replacing labels assumed to be corrupted (Wu et al. 2021). While the former appeals with its effortless integration into classical supervised learning setups, methods of the latter kind allow for a more effective suppression

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

0 50 100 Epoch

Fraction of Examples

Correct Memorized Incorrect (others)

0 50 100 Epoch

Avg. Probability Magnitude

Training Label (clean) Training Label (noisy) True Label (noisy) Other Label (noisy)

Figure 1: For Res Net34 models trained with cross-entropy on CIFAR-10 with 25 % of corrupted instances (averaged over five seeds), the left plot shows the fractions of examples that are correctly classified, whose corrupted training label are memorized, or incorrectly classified with a label other than the ground-truth or training label, confirming the result in (Liu et al. 2020). The right plot illustrates the predicted probability magnitudes for clean or noisy labels.

of noise by remodeling the labels. However, this comes at the cost of an increased model complexity, typically reducing the efficiency of the training (Liu et al. 2020). The training dynamics of models have been thoroughly studied in the aforementioned regime (Chang, Learned Miller, and Mc Callum 2017; Arazo et al. 2019; Liu et al. 2020), and two learning phases have been identified: Initially, as shown in the left plot of Fig. 1, the model shows reasonable learning behavior by establishing correct relations between features and targets, classifying even mislabeled instances mostly correctly. In this phase, the loss minimization is dominated by the clean fraction, such that the mislabeling does not affect the learning too much. However, after the clean labels are fit sufficiently well, the learner starts to concentrate predominantly on the mislabeled part, thereby overfitting incorrect labels and harming generalization. A closer look at the probabilistic predictions in the first learning phase reveals that erroneous training labels can be distinguished from the underlying ground-truth classes by the learner s confidence. The right plot in Fig. 1 illustrates that models are typically not incentivized to optimize for predicting the corrupted training labels in the first epochs, but infer relatively high probability scores for the (unknown) ground-truth class, at least on average. Evidently, the model

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

itself could serve as a source for modeling the beliefs about the ground-truth, taking all (mostly clean) instance-label relations into account to reason about the true labels. This idea has been adopted by label correction methods that predict pseudo-labels for instances that appear to be mislabeled, thus suggesting labels that the learner considers more plausible (Reed et al. 2015; Wu et al. 2021). While intuitively plausible, a strategy like this is obviously not without risk. Especially in the early phase of the training, replacing a supposedly incorrect label by another one might be too hasty, and may potentially even aggravate the negative effects of label noise. Instead, as the learner is still in its infancy, it seems advisable to realize a more cautious learning behavior. More concretely, instead of discarding the original training information completely, it might be better to keep it as an option. Following this motivation, we propose a complementary approach for modeling the learner s belief about the true label: Instead of forcing it to commit to a single label, either the original one or a presumably more plausible alternative, we allow the learner to (re-)label a training instance by a set of candidate labels, which may comprise more than a single plausible candidate. More specifically, by retaining the original label and adding other plausible candidates, we deliberately ambiguate the training information to mitigate the risk of making a mistake in an early phase of the training process. To put this idea into practice, we make use of so-called superset learning (Liu and Dietterich 2014; H ullermeier and Cheng 2015), which allows the learner itself to disambiguate possibly ambiguous training data. More precisely, we represent the ambiguous target information in the form of so-called credal sets, i.e., sets of probability distributions, to train probabilistic classifiers via generalized risk minimization in the spirit of label relaxation (Lienen and H ullermeier 2021a). We realize our approach, which we dub Robust Data Ambiguation (RDA), in an easy off-the-shelf loss function that dynamically derives the target sets from the model predictions without the need of any additional model parameter this is implicitly done in the loss calculation, without requiring any change to a conventional learning routine. This way, we combine the simplicity of robust losses with the data modeling capabilities of more complex label correction approaches. We demonstrate the effectiveness of our method on commonly used image classification datasets with both synthetic and real-world noise, confirming the adequacy of our proposed robust loss.

Related Work Coping with label noise in machine learning is a broad field with an extensive amount of recent literature. Here, we distinguish four views on this issue, namely, robust loss functions, regularization, sample selection, and label correction methods. For a more comprehensive overview, we refer to recent surveys by Song et al. (2020) and Wei et al. (2022b).

Robust Losses. The task of designing robust optimization criteria has a long-standing history in classical statistics, e.g., to alleviate the sensitivity towards outliers. As a prominent member of such methods, the mean absolute er-

ror (MAE) steps up to mitigate the shortcomings of the mean squared error. When relating to the context of probabilistic classification, robustness of loss functions towards label noise is linked with the symmetry of the function (Ghosh, Kumar, and Sastry 2017), leading to adaptations of losses such as cross-entropy (Wang et al. 2019; Ma et al. 2020). A large strain of research proposes to balance MAE and cross-entropy, e.g., by the negative Box-Cox transformation (Zhang and Sabuncu 2018), or controlling the order of Taylor series for the categorical cross-entropy (Feng et al. 2020), whereas also alternative loss formulations have been considered (Wei and Liu 2021). Besides, methodologies accompanying classical losses for robustness have been proposed, such as gradient-clipping (Menon et al. 2020) or subgradient optimization (Ma and Fattahi 2022).

Regularization. Regularizing losses for classification has also been considered as a means to cope with label noise. As one prominent example, label smoothing (LS) (Szegedy et al. 2016) has shown similar beneficial properties as loss correction when dealing with label noise (Lukasik et al. 2020). Advancements also enable applicability in high noise regimes (Wei et al. 2022a). Among the first works building upon this observation, Arpit et al. (2017) characterize two phases in learning from data with label noise. First, the model learns to correctly classify most of the instances (including the ground-truth labels of misclassified training instances), followed by the memorization of mislabels. The works shows that explicit regularization, for instance Dropout (Srivastava et al. 2014), is effective in combating memorization and improving generalization. Following this, Liu et al. (2020) propose a penalty term to counteract memorization that stems from the (more correct) early learning of the model. Beyond LS, also other forms of soft labeling methods have been proposed, e.g., as in so-called label distribution learning (Gao et al. 2017). In addition, sparse regularization enforces the model to predict sharp distributions with sparsity (Zhou et al. 2021b). Lastly, Iscen et al. (2022) describe a regularization term based on the consistency of an instance with its neighborhood in feature space.

Sample Selection. Designed to prevent the aforementioned memorization of mislabeled instances, a wide variety of methods rely on the so-called small loss selection criterion (Jiang et al. 2018; Gui, Wang, and Tian 2021). It is intended to distinguish clean samples the model recognizes well, and from which it can learned without distortion. This distinction suggests a range of approaches, including a gradual increase of the clean sample set with increasing model confidence (Shen and Sanghavi 2019; Cheng et al. 2021; Wang et al. 2022), co-training (Han et al. 2018; Yu et al. 2019; Wei et al. 2020), or re-weighting instances (Chen, Zhu, and Chen 2021). Furthermore, it makes the complete plethora of classical semi-supervised learning methods amenable to the task of learning from noisy labels. Here, the non-small loss examples are considered as unlabeled in the first place (Li, Socher, and Hoi 2020; Nishi et al. 2021; Zheltonozhskii et al. 2022). Often, such methodology is also combined with consistency regularization (Bachman, Alsharif, and Precup 2014), as prominently used in classical

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

semi-supervised learning (Sohn et al. 2020; Liu et al. 2020; Yao et al. 2021; Liu et al. 2022).

Label Correction. Traditionally, correcting label noise has also been considered by learning probabilistic transition matrices (Goldberger and Ben-Reuven 2017; Patrini et al. 2017; Zhu, Wang, and Liu 2022; Kye et al. 2022), e.g., to re-weight or correct losses. In this regard, it has been proposed to learn a model and (re-)label based on the model s predictions jointly (Tanaka et al. 2018). Closely related to sample selection as discussed before, Song, Kim, and Lee (2019) propose to first select clean samples in a co-teaching method, which then refurbishes noisy samples by correcting their label information. Furthermore, Arazo et al. (2019) suggest to use two-component mixture models for detecting mislabels that are to be corrected, thus serving as a clean sample criterion. Wu et al. (2021) and Tu et al. (2023) approach the problem of label correction by a metalearning technique in the form of a two-stage optimization process of the model and the training labels. In addition, Liu et al. (2022) describe an approach to learn the individual instances noise by an additional over-parameterization. Finally, ensembling has also been considered as a source for label correction (Lu and He 2022).

Method The following section outlines the motivation behind our method RDA to model targets in an ambiguous manner, introduces the theoretical foundation, and elaborates on how this approach realizes robustness against label noise.

Motivation Deep neural network models, often overparameterized with significantly more parameters than required to fit training data points at hand, have become the de-facto standard choice for most practically relevant problems in deep learning. Here, we consider probabilistic models of the form bp : X P(Y) with X Rd and Y = {y1, . . . , y K} being the feature and target space, respectively, and P(Y) the class of probability distributions on Y. We denote the training set of size N as DN = {(xi, yi)}N i=1 X Y, where each instance xi X is associated with an underlying (true) label y i . The latter is conceived as being produced by the underlying (stochastic) data-generating process, i.e., as the realization of a random variable Y p ( | x).1 However, the actual label yi encountered in the training data might be corrupted, which means that yi = y i . Reasons for corruption can be manifold, including erroneous measurement and annotation by a human labeler. As real-world data typically comprises noise in the annotation process, e.g., due to human labeling errors, their robustness towards label noise is of paramount importance. Alas, models in the regime of deep learning trained with conventional (probabilistic) losses, such as cross-entropy, lack this robustness. As thoroughly discussed in the analysis

1In applications such as image classification, the ground-truth conditional distributions p ( | x) P(Y) are often degenerate, i.e., p (y | x) 1 and p (y | x) 0 for y = y .

Figure 2: Learned feature representations of the training instances observed at the penultimate layer a MLP comprising an encoder and a classification head at different stages in the training. The data consists of correctly (blue or green resp.) and incorrectly (red) labeled images of zeros and ones from MNIST. The dashed line depicts the linear classifier.

of the training dynamics of such models (Chang, Learned Miller, and Mc Callum 2017; Arazo et al. 2019; Liu et al. 2020), two phases can be distinguished in the training of alike models, namely a correct concept learning and a memorization (Arpit et al. 2017) or forgetting (Toneva et al. 2019) phase (cf. Fig. 1). While the model relates instances to labels in a meaningful way in the first phase, leading to correct predictions on most training instances, the transition to a more and more memorizing model harms the generalization performance. Looking closer at the learning dynamics in an idealized setting, one can observe that overparameterized models project instances of the same ground-truth class y to a similar feature embedding, regardless of having observed a correct or corrupted training label in the first training phase. Fig. 2 illustrates the learned feature activations of the penultimate layer of a multi-layer perceptron with a classification head trained on MNIST over the course of the training (see the appendix for further experimental details). In the beginning, the learner predicts a relatively high probability bp(y i | xi) even for noisy instances (xi, yi), as the loss is dominated by the cross-entropy of the clean instances, which leads to a similar marginalization behavior as in the noncorrupted case. Here, the proximity of mislabeled instances to the decision boundary can be related to the hardness of instances to be learned, as also studied by Chang, Learned Miller, and Mc Callum (2017). In later stages, the feature activations of the noisy instances shift in a rotating manner, successively being pushed towards the other discriminatory face of the decision boundary. This goes hand in hand with a decrease and eventual increase of the predicted probability scores maxy Y bp(y | x), consistent with the observations

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

made in Fig. 1. It appears natural to seek for means that keep corrupted instances ( , yi) close to clean samples ( , y i ) in the feature space, and not let the learning bias in the later stages pull them over towards the label corruption. In the context of the overall optimization, the model itself represents a source of justification for considering a class label as a plausible outcome. Hence, we suggest to use the model predictions simultaneously with the training labels for modeling our beliefs about the ground-truth. Consequently, we shall consult not only the individual training labels, but also the complete training dataset that found its way into the current model in conjunction, complementing the former as a second piece of evidence. This represents a distillation of knowledge obtained from the (mostly clean) data at hand, and we argue that the confidence bp is a suitable surrogate to represent plausibility in this regard.

Credal Labeling Inquiring the model prediction bp(xi) in addition to the training label yi may call for augmenting the (hitherto single) label by an additional plausible candidate label ˆyi argmaxy Y bp(y | xi) with ˆyi = yi, making the target effectively ambiguous and hence less convenient for conventional point-wise classification losses. However, from a data modeling perspective, it is important to recognize that the imprecisiation of the target information pays off with higher validity, i.e., it is more likely that the true label is covered by the target labels. This is completely in line with the idea of data imprecisiation in the context of so-called superset learning (Lienen and H ullermeier 2021b). Thus, ambiguating presumably corrupt training information appears to be a useful means to counter the influence of mislabeling. Before detailing the representation of the aforementioned beliefs, we shall revisit the conventional probabilistic learning setting. To train probabilistic classifiers bp as specified above, e.g., by minimizing the cross-entropy loss, traditional methods transform deterministic target labels yi Y as commonly observed in classification datasets into degenerate probability distributions pyi P(Y) with pyi(yi) = 1 and pyi(y) = 0 for y = yi. As a result, the predicted distribution bp can be compared to pyi using a probabilistic loss L : P(Y) P(Y) R to be minimized. It is easy to see that pyi assigns full plausibility to the observed training label yi, while the other labels are regarded as fully implausible. Ambiguity in a probabilistic sense can be attained by enriching the representation of the target distribution in a setvalued way. To this end, Lienen and H ullermeier (2021a) propose to use credal sets, i.e., sets of probability distributions, as a means to express beliefs about the true class conditional distribution p . This allows one, for example, to not only consider pyi for the training label yi as a fully plausible target distribution, but also py for a plausible candidate label y = yi and interpolations between these extremes, i.e., distributions p = λpyi + (1 λ)py assigning probability λ to label yi and 1 λ to y. More generally, a credal set can be any (convex) set of probability distributions, including, for example, relaxations of extreme distributions pyi, which reduce the probability for yi to 1 ϵ < 1 and reas-

sign the remaining mass of ϵ > 0 to the other labels. Such distributions appear to be meaningful candidates as target distributions, because a degenerate ground truth p = pyi is actually not very likely (Lienen and H ullermeier 2021a). Possibility theory (Dubois and Prade 2004) offers a compact representation of credal sets in the form of upper bounds on probabilities, encoded as possibility distributions π : Y [0, 1]. Intuitively, for each outcome y Y, π(y) defines an upper bound on the true probability p (y). More generally, a distribution π induces a possibility measure Π(Y ) = maxy Y π(y) on Y, which in turn defines an upper bound on any event Y Y. In our case, a distribution πi associated with an instance xi assigns a possibility (or plausibility) to each class in Y for being the true outcome associated with xi, and πi(y) can be interpreted as an upper bound on p (y | xi). Thus, the credal set of distributions considered as candidates for p (y | xi) is given as follows:

Qπi ..= n p P(Y) | Y Y : X

y Y p(y) max y Y πi(y) o

Data Ambiguation for Robust Learning With the above approach to modeling ambiguous targets in a probabilistic manner, we can put the idea of a robust ambiguation method on a rigorous theoretic foundation. As a concrete model of the target label for a training instance (xi, yi), we propose the following confidence-thresholded possibility distribution πi:

πi(y) = 1 if y = yi bp(y | xi) β α otherwise , (2)

where β [0, 1] denotes the confidence threshold for the prediction bp(y | xi), and α [0, 1) is the label relaxation parameter (Lienen and H ullermeier 2021a). Thus, while the original label yi remains fully plausible (πi(yi) = 1), it might be complemented by other candidates: as soon as the predicted probability for another label exceeds β, it is deemed a fully plausible candidate, too. Besides, by assigning a small degree of possibility α to all remaining classes, these are not completely excluded either. Both α and β are treated as hyperparameters. To learn from credal targets Qπi induced by πi, i.e., from the ambiguated training data {(xi, Qπi)}, we make use of a method for superset learning that is based on minimizing the optimistic superset loss (H ullermeier 2014). The latter generalizes a standard loss L for point predictions as follows: L (Qπ, bp) ..= min p Qπ L(p, bp) . (3)

As argued by H ullermeier (2014), minimizing OSL on the ambiguated training data is suitable for inducing a model while simultaneously disambiguating the data. In other words, the learner makes a most favourable selection among the candidate labels, and simultaneously fits a model to that selection (by minimizing the original loss L). A common choice for L is the Kullback-Leibler divergence DKL, for which (3) simplifies to

L (Qπ, bp) = 0 if bp Qπ DKL(pr || bp) otherwise , (4)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 3: A barycentric visualization of the confidencethresholded ambiguation for a corrupt training label y1 and a ground-truth y2 in the target space Y = {y1, y2, y3}: Starting from a credal set Q centered at py1 (left plot), the prediction bp predicts a probability mass greater than β for y2. Consequently, full possibility is assigned to y2, leading to Q as shown to the right.

Algorithm 1 Robust Data Ambiguation (RDA) Loss

Require: Training instance (x, y) X Y, model prediction bp(x) P(Y), confidence threshold β [0, 1], relaxation parameter α [0, 1) 1: Construct π as in Eq. (4) with

π(y ) = 1 if y = y bp(y | x) β α otherwise

2: return L (Qπ, bp(x)) as specified in Eq. (4), where Qπ is derived from π

(1 α) bp(y) P

y Y:πi(y )=1 bp(y ) if πi(y) = 1

y Y:πi(y )=α bp(y ) otherwise

is projecting bp onto the boundary of Qπ. This loss is provably convex and has the same computational complexity as standard losses such as cross-entropy (Lienen and H ullermeier 2021a). Fig. 3 illustrates the core idea of our method, which we will refer to as Robust Data Ambiguation (RDA). Imagine an instance x with true label y2 but corrupted training information y1. Initially, we observe a probabilistic target centered at py1 (and potentially relaxed by some α > 0). Without changing the target, label relaxation would compute the loss by comparing the prediction bp to a (precise) distribution pr projecting bp towards py1. In this example, the model predicts y2 with a confidence exceeding β. With our method, full plausibility would be assigned to y2 in addition to y1, leading to a credal set Q as shown in the right plot. To compute the loss, bp is now compared to a less extreme target pr that is projected onto the larger credal set, no longer urging the learner to predict distributions close to py1. A more technical description of RDA is provided in Alg. 1, confirming the simplicity of our proposal. As can be seen, by ambiguating the target set, we relieve the learner from memorizing wrong training labels. At

the same time, we remain cautious and try to avoid committing to potentially erroneous predictions. Thereby, learning from noisy labels is de-emphasized while the optimization on clean samples remains unaffected. At the same time, driven by the loss minimization on similar instances that are correctly labeled with y2, the imprecisiation allows the predictions bp Q to evolve towards the true target py2. Surpassing the parameter β can be seen as an indication for mislabeling, i.e., the prediction bp suggests that a different label y = yi is a plausible candidate for representing the true outcome y . As such, β can be used to control the cautiousness in suspecting mislabeling: High values for β force that only highly confident predictions shall adjust the target, whereas less extreme values result in a more eager addition of candidates. Conversely, since classes y with bp(y) < β are considered as implausible, the threshold β could also be interpreted as a criterion for exclusion of candidates. As shown in a more extensive analysis of the β parameter, the former interpretation is practically more useful in the case of robust classification. Generally, any value in [0, 1] could be chosen for β, perhaps also depending on the training progress. As a simple yet effective rule of thumb, we suggest to vary β over the course of time: As the model is relatively uncertain in the first epochs, one should not spent too much attention to the predictions. However, with further progress, bp becomes more informative for the confidence-driven possibility elicitation, suggesting to use smaller β values. Empirically, we found the following decay function to yield good performance:

βT = β1 + 1

2 (β0 β1) 1 + cos T Tmax π , (5)

where T and Tmax denote the current and maximum number of epochs, respectively, while β0 and β1 represent the start and end values for β. Nevertheless, as will be shown in the empirical evaluation, also static values for β work reasonably well, such that the number of additional hyperparameters can be reduced to a single one.

Experiments To demonstrate our method s effectiveness in coping with label noise, we conduct an empirical analysis on several image classification datasets as a practically relevant problem domain, although RDA is not specifically tailored to this domain and does not use any modality-specific components. Here, we consider CIFAR-10/-100 (Krizhevsky and Hinton 2009) as well as the large-scale datasets Web Vision (Li et al. 2017) and Clothing1M (Xiao et al. 2015) as benchmarks. For the former, we model synthetic noise by both symmetrically and asymmetrically randomizing the labels for a fraction of examples, while Web Vision and Clothing1M comprise real-world noise by their underlying crawling process. Moreover, we report results on CIFAR-10(0)N (Wei et al. 2022b) as another real-world noise dataset based on human annotators. We refer to the appendix for a more detailed description of the datasets and the corruption process, as well as experiments on additional benchmark datasets. As baselines, we take a wide range of commonly applied loss functions into account. Proceeding from the conven-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

tional cross-entropy (CE), we report results for the regularized CE adaptations label smoothing (LS) and label relaxation (LR), as well as the popular robust loss functions generalized cross-entropy (GCE) (Zhang and Sabuncu 2018), normalized cross-entropy (NCE) (Ma et al. 2020), combinations of NCE with AUL and AGCE (Zhou et al. 2021a) and CORES (Cheng et al. 2021). For completeness, we also report results for ELR (Liu et al. 2020) and SOP (Liu et al. 2022) as two state-of-the-art representatives of regularization and label correction methods, respectively. While all other losses can be used off-the-shelf, these two have additional parameters (to track the label noise per instance), giving them an arguably unfair advantage. In the appendix, we show further results for a natural baseline to our approach in the realm of superset learning, as well as experiments that combine our method with sample selection. We follow common practice and evaluate methods against label noise in the regime of overparameterized models by training Res Net34 models on the smaller scale datasets CIFAR-10/-100. For the larger datasets, we consider Res Net50 models pretrained on Image Net. All models use the same training procedure and optimizer. A more thorough overview of experimental details, such as hyperparameters and their optimization, as well as the technical infrastructure, can be found in the appendix. We repeated each run five times with different random seeds, reporting the accuracy on the test splits to quantify generalization.

Synthetic Noise Table 1 reports the results for the synthetic corruptions on CIFAR-10/-100. As can be seen, our approach provides consistent improvements in terms of generalization performance over the robust off-the-shelf loss functions. Cross-entropy and its regularized adaptations LS and LR appear sensitive to label noise, which confirms the need for loss robustness. Interestingly, although being slightly inferior in most cases of symmetric noise, our method appears still competitive compared to ELR and SOP, despite their increased expressivity through additional parameters. For asymmetric noise, as presented in the appendix, our method could even outperform such methods. Nevertheless, it becomes less effective with an increasing level of noise. When looking at the learning dynamics of training with RDA, Fig. 4 reveals an effective ambiguation in the course of the learning process. The left plot shows the attenuation of any memorization while improving the correctness of model predictions for the wrongly labeled instances at the same time. The plot in the middle depicts an increase of the credal set size in terms of classes with full plausibility for the noisy instances. Together with the right plot showing the validity of the credal sets, i.e., the fraction of credal sets that assign full possibility to the ground-truth class, one can easily see that the ambiguation is indeed able to select the ground-truth class as training label. Furthermore, the credal set size for clean instances is barely affected, which again confirms the adequacy of our model. Notably, our method also shows self-correcting behavior after ambiguating with a wrong class midway. While the validity of the credal sets increases (roughly) monotonically,

0 20 40 60 80 100 120 0.00

Fraction of Examples

Correct Memorized Incorrect (others)

0 20 40 60 80 100 120 1.0

Avg. # Classes with Full Possibility

All Instances Clean Instances Noisy Instances

0 20 40 60 80 100 120 0.0

Avg. Credal Sets with Fully Possible Corr. Class

All Instances Clean Instances Noisy Instances

0.0 0.2 0.4 0.6 0.8 1.0 Epoch

Figure 4: The top plot shows the fraction of mislabeled training instances for which the models predict the ground-truth (blue), the wrong training label (orange) or a different label (green). The middle and bottom plots show the credal set size and validity respectively. All plots are averaged over the five runs on CIFAR-10 with 50 % synthetic symmetric noise.

the set sizes become smaller towards the end of the training. In the appendix, we provide additional plots in other noise settings showing consistent effects.

Real-World Noise In coherence with the previous observations, our robust ambiguation loss also works reasonably well for real-world noise. As presented in Table 2, RDA leads to superior generalization performance compared to baseline losses without any additional parameters in almost any case. Moreover, it also consistently outperforms SOP on CIFAR-10N, whereas it leads to similar results for CIFAR-100N. For the large-scale datasets Web Vision and Clothing1M, for which results are presented in Table 3, the differences between the baselines and our approach appear to be rather modest but are still in favor of our method.

Conclusion Large models are typically prone to memorizing noisy labels in classification tasks. To address this issue, various loss functions have been proposed that enhance the robustness of conventional loss functions against label noise. Although such techniques are appealing due to their simplicity, they typically lack the capacity to incorporate additional knowledge regarding the instances, such as beliefs about the true label. In response, pseudo-labeling methods for label correction have emerged. They allow for modeling target information in a more sophisticated way, albeit at the cost of an

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Loss Add. Param.

CIFAR-10 CIFAR-100 Sym. Sym. 25 % 50 % 75 % 25 % 50 % 75 %

CE 79.05 0.67 55.03 1.02 30.03 0.74 58.27 0.36 37.16 0.46 13.66 0.45 LS (α = 0.1) 76.66 0.69 53.95 1.47 29.03 1.21 59.75 0.24 37.61 0.61 13.53 0.51 LS (α = 0.25) 77.48 0.32 53.08 1.95 28.29 0.65 59.84 0.57 39.80 0.38 14.18 0.44 LR (α = 0.1) 80.53 0.39 57.55 0.95 29.83 0.87 57.52 0.58 36.77 0.54 13.23 0.14 LR (α = 0.25) 80.43 0.09 60.18 1.01 31.36 0.91 57.67 0.11 37.15 0.14 13.41 0.24

GCE 90.82 0.10 83.36 0.65 54.34 0.37 68.06 0.31 58.66 0.28 26.85 1.28 NCE 79.05 0.12 63.94 1.74 38.23 2.63 19.32 0.81 11.09 1.03 6.12 7.57 NCE+AGCE 87.57 0.10 83.05 0.81 51.16 6.44 64.15 0.23 39.64 1.66 7.67 1.25 NCE+AUL 88.89 0.29 84.18 0.42 65.98 1.56 69.76 0.31 57.41 0.41 17.72 1.27 CORES 88.60 0.28 82.44 0.29 47.32 17.03 60.36 0.67 46.01 0.44 18.23 0.28

RDA (ours) 91.48 0.22 86.47 0.42 48.11 15.41 70.03 0.32 59.83 1.15 26.75 8.83

ELR 92.45 0.08 88.39 0.36 72.58 1.63 73.66 1.87 48.72 26.93 38.35 10.26 SOP 92.58 0.08 89.21 0.33 76.16 4.88 72.04 0.67 64.28 1.44 40.59 1.62

Table 1: Test accuracies and standard deviations on the test split for models trained on CIFAR-10(0) with (symmetric) synthetic noise. The results are averaged over runs with different seeds, bold entries mark the best method without any additional model parameters. Underlined results indicate the best method overall.

Loss Add. Param.

CIFAR-10N CIFAR-100N Random 1 Random 2 Random 3 Aggregate Worst Noisy

CE 82.96 0.23 83.16 0.52 83.49 0.34 88.74 0.13 64.93 0.79 52.88 0.14 LS (α = 0.1) 82.76 0.47 82.10 0.21 82.12 0.37 88.63 0.11 63.10 0.38 53.48 0.45 LS (α = 0.25) 82.95 1.57 83.86 2.05 82.61 0.25 87.03 2.29 66.14 6.89 53.98 0.27 LR (α = 0.1) 83.00 0.36 82.64 0.31 82.82 0.21 88.41 0.29 66.62 0.33 52.01 0.04 LR (α = 0.25) 82.14 0.49 81.87 0.34 82.46 0.11 88.07 0.45 66.44 0.14 52.22 0.29

GCE 88.85 0.19 88.96 0.32 88.73 0.11 90.85 0.32 77.24 0.47 55.43 0.47 NCE 81.88 0.27 81.02 0.32 81.48 0.13 84.62 0.49 69.40 0.10 21.12 0.67 NCE+AGCE 89.48 0.28 88.95 0.10 89.25 0.29 90.65 0.44 81.27 0.44 51.42 0.65 NCE+AUL 89.42 0.22 89.36 0.15 88.94 0.55 90.92 0.19 81.28 0.47 56.58 0.41 CORES 86.09 0.57 86.48 0.27 86.02 0.22 89.23 0.10 76.80 0.96 53.04 0.29

RDA (ours) 90.43 0.03 90.09 0.29 90.40 0.01 91.71 0.38 82.91 0.83 59.22 0.26

ELR 91.35 0.29 91.46 0.29 91.39 0.03 92.68 0.03 84.82 0.42 62.80 0.27 SOP 89.16 0.40 89.02 0.33 88.99 0.31 90.54 0.16 80.65 0.13 59.32 0.41

Table 2: Test accuracies and standard deviations on the test split for models trained on CIFAR-10(0)N with real-world noise, the scores on the clean splits can be found in the appendix. The results are averaged over runs with different seeds.

Loss Web Vision Clothing1M

CE 66.96 68.04 GCE 61.76 69.75 AGCE 69.4 - NCE+AGCE 67.12 -

RDA (ours) 70.23 71.42

Table 3: Large-scale test accuracies on Web Vision and Clothing1M using Res Net50 models. The baseline results for the two datasets are taken from (Zhou et al. 2021a) and (Liu et al. 2020), respectively.

increased training complexity. To address the shortcomings of previous methods, our approach advocates the idea of weakening (ambiguating) the training information. In a sense, it unifies the two directions pursued so far, namely, modifying the learning procedure

(loss function) and modifying the data (selection, correction). By allowing the learner to (re-)label training data with credal sets of probabilistic labels, the approach becomes very flexible. For the specific type of label ambiguation we proposed in this paper, we could show that learning from the re-labeled data is equivalent to minimizing a robustified loss function that is easy to compute. Our empirical evaluation confirms the adequacy of our proposal.

Our approach suggests several directions for future research. For example, a more informed choice of β and α could be realized by quantifying the epistemic uncertainty (H ullermeier and Waegeman 2021) of individual predictions, as highly uncertain guesses should be considered with caution. Also, our method could be leveraged for various downstream tasks, e.g., to detect anomalies in mostly homogeneous data. Finally, as we are currently only considering the predictions at a time, further improvements could be achieved by taking the dynamics of the training into account.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgements This work was partially supported by the German Research Foundation (DFG) within the Collaborative Research Center On-The-Fly Computing (CRC 901 project no. 160364472). Moreover, the authors gratefully acknowledge the funding of this project by computing time provided by the Paderborn Center for Parallel Computing (PC2).

References Arazo, E.; Ortego, D.; Albert, P.; O Connor, N. E.; and Mc Guinness, K. 2019. Unsupervised Label Noise Modeling and Loss Correction. In ICML, June 9-15, Long Beach, CA, USA, volume 97, 312 321. PMLR. Arpit, D.; Jastrzebski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M. S.; Maharaj, T.; Fischer, A.; Courville, A. C.; Bengio, Y.; and Lacoste-Julien, S. 2017. A Closer Look at Memorization in Deep Networks. In ICML, Aug. 6-11, Sydney, Australia, volume 70, 233 242. PMLR. Bachman, P.; Alsharif, O.; and Precup, D. 2014. Learning with Pseudo-Ensembles. In NIPS, Dec. 8-13, Montreal, Canada, 3365 3373. Chang, H.; Learned-Miller, E. G.; and Mc Callum, A. 2017. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples. In Neur IPS, Dec. 49, Long Beach, CA, USA, 1002 1012. Chen, W.; Zhu, C.; and Chen, Y. 2021. Sample Prior Guided Robust Model Learning to Suppress Noisy Labels. Co RR, abs/2112.01197. Cheng, H.; Zhu, Z.; Li, X.; Gong, Y.; Sun, X.; and Liu, Y. 2021. Learning with Instance-Dependent Label Noise: A Sample Sieve Approach. In ICLR, May 3-7, virtual. Open Review.net. Dubois, D.; and Prade, H. 2004. Possibility Theory, Probability Theory and Multiple-Valued Logics: A Clarification. Ann. Math. Artif. Intell., 32: 35 66. Feng, L.; Shu, S.; Lin, Z.; Lv, F.; Li, L.; and An, B. 2020. Can Cross Entropy Loss Be Robust to Label Noise? In IJCAI, 2206 2212. ijcai.org. Gao, B.; Xing, C.; Xie, C.; Wu, J.; and Geng, X. 2017. Deep Label Distribution Learning With Label Ambiguity. IEEE Trans. Image Process., 26(6): 2825 2838. Ghosh, A.; Kumar, H.; and Sastry, P. S. 2017. Robust Loss Functions under Label Noise for Deep Neural Networks. In AAAI, Feb. 4-9, San Francisco, CA, USA, 1919 1925. AAAI Press. Goldberger, J.; and Ben-Reuven, E. 2017. Training Deep Neural-Networks using a Noise Adaptation Layer. In ICLR, Apr. 24-26, Toulon, France. Open Review.net. Gui, X.; Wang, W.; and Tian, Z. 2021. Towards Understanding Deep Learning from Noisy Labels with Small-Loss Criterion. In IJCAI, Aug. 19-27, virtual, 2469 2475. ijcai.org. Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I. W.; and Sugiyama, M. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, Dec. 3-8, Montr eal, Canada, 8536 8546.

H ullermeier, E. 2014. Learning from Imprecise and Fuzzy Observations: Data Disambiguation through Generalized Loss Minimization. Int. J. Approx. Reason., 55(7): 1519 1534. H ullermeier, E.; and Cheng, W. 2015. Superset Learning Based on Generalized Loss Minimization. In ECML PKDD, Sept. 7-11, Porto, Portugal, volume 9285 of LNCS, 260 275. Springer. H ullermeier, E.; and Waegeman, W. 2021. Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods. Mach. Learn., 110(3): 457 506. Iscen, A.; Valmadre, J.; Arnab, A.; and Schmid, C. 2022. Learning with Neighbor Consistency for Noisy Labels. In CVPR, June 18-24, New Orleans, LA, USA, 4662 4671. IEEE. Jiang, L.; Zhou, Z.; Leung, T.; Li, L.; and Fei-Fei, L. 2018. Mentor Net: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, July 10-15, Stockholm, Sweden, volume 80, 2309 2318. PMLR. Krizhevsky, A.; and Hinton, G. 2009. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, Toronto, Canada. Kye, S. M.; Choi, K.; Yi, J.; and Chang, B. 2022. Learning with Noisy Labels by Efficient Transition Matrix Estimation to Combat Label Miscorrection. In ECCV, Tel Aviv, Oct. 23-27, Israel, volume 13685 of LNCS, 717 738. Springer. Li, J.; Socher, R.; and Hoi, S. C. H. 2020. Divide Mix: Learning with Noisy Labels as Semi-supervised Learning. In ICLR, Apr. 26-30, Addis Ababa, Ethiopia. Open Review.net. Li, W.; Wang, L.; Li, W.; Agustsson, E.; and Gool, L. V. 2017. Web Vision Database: Visual Learning and Understanding from Web Data. Co RR, abs/1708.02862. Lienen, J.; and H ullermeier, E. 2021a. From Label Smoothing to Label Relaxation. In AAAI, Feb. 2-9, virtual, 8583 8591. AAAI Press. Lienen, J.; and H ullermeier, E. 2021b. Instance Weighting through Data Imprecisiation. Int. J. Approx. Reason., 134: 1 14. Liu, L.; and Dietterich, T. 2014. Learnability of the Superset Label Learning Problem. In ICML. Beijing, China. Liu, S.; Niles-Weed, J.; Razavian, N.; and Fernandez Granda, C. 2020. Early-Learning Regularization Prevents Memorization of Noisy Labels. In Neur IPS, Dec. 6-12, virtual. Liu, S.; Zhu, Z.; Qu, Q.; and You, C. 2022. Robust Training under Label Noise by Over-parameterization. In ICML, July 17-23, Baltimore, MD, USA, volume 162 of MLR, 14153 14172. PMLR. Lu, Y.; and He, W. 2022. SELC: Self-Ensemble Label Correction Improves Learning with Noisy Labels. In IJCAI, July 23-29, Vienna, Austria, 3278 3284. ijcai.org. Lukasik, M.; Bhojanapalli, S.; Menon, A. K.; and Kumar, S. 2020. Does Label Smoothing Mitigate Label Noise? In ICML, July 13-18, virtual, volume 119, 6448 6458. PMLR.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Ma, J.; and Fattahi, S. 2022. Blessing of Nonconvexity in Deep Linear Models: Depth Flattens the Optimization Landscape Around the True Solution. Co RR, abs/2207.07612. Ma, X.; Huang, H.; Wang, Y.; Romano, S.; Erfani, S. M.; and Bailey, J. 2020. Normalized Loss Functions for Deep Learning with Noisy Labels. In ICML, July 13-18, virtual, volume 119, 6543 6553. PMLR. Menon, A. K.; Rawat, A. S.; Reddi, S. J.; and Kumar, S. 2020. Can Gradient Clipping Mitigate Label Noise? In ICLR, Apr. 26-30, Addis Ababa, Ethiopia. Open Review.net. Nishi, K.; Ding, Y.; Rich, A.; and H ollerer, T. 2021. Augmentation Strategies for Learning With Noisy Labels. In CVPR, June 19-25, virtual, 8022 8031. CVF / IEEE. Patrini, G.; Rozza, A.; Menon, A. K.; Nock, R.; and Qu, L. 2017. Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach. In CVPR, July 21-26, Honolulu, HI, USA, 2233 2241. IEEE. Reed, S. E.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; and Rabinovich, A. 2015. Training Deep Neural Networks on Noisy Labels with Bootstrapping. In ICLR, May 7-9, San Diego, CA, USA, Workshop Track Proc. Shen, Y.; and Sanghavi, S. 2019. Learning with Bad Training Data via Iterative Trimmed Loss Minimization. In ICML, June 9-15, Long Beach, CA, USA, volume 97, 5739 5748. PMLR. Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.; Cubuk, E. D.; Kurakin, A.; and Li, C. 2020. Fix Match: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Neur IPS, Dec. 6-12, virtual. Song, H.; Kim, M.; and Lee, J. 2019. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In ICML, June 9-15, Long Beach, CA, USA, volume 97, 5907 5915. PMLR. Song, H.; Kim, M.; Park, D.; and Lee, J. 2020. Learning from Noisy Labels with Deep Neural Networks: A Survey. Co RR, abs/2007.08199. Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res., 15(1): 1929 1958. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR, June 27-30, Las Vegas, NV, USA, 2818 2826. IEEE. Tanaka, D.; Ikami, D.; Yamasaki, T.; and Aizawa, K. 2018. Joint Optimization Framework for Learning With Noisy Labels. In CVPR, June 18-22, Salt Lake City, UT, USA, 5552 5560. CVF / IEEE. Toneva, M.; Sordoni, A.; des Combes, R. T.; Trischler, A.; Bengio, Y.; and Gordon, G. J. 2019. An Empirical Study of Example Forgetting during Deep Neural Network Learning. In ICLR, May 6-9, New Orleans, LA, USA. Open Review.net. Tu, Y.; Zhang, B.; Li, Y.; Liu, L.; Li, J.; Wang, Y.; Wang, C.; and Zhao, C. 2023. Learning from Noisy Labels with Decoupled Meta Label Purifier. Co RR, abs/2302.06810.

Wang, H.; Xiao, R.; Dong, Y.; Feng, L.; and Zhao, J. 2022. Pro Mix: Combating Label Noise via Maximizing Clean Sample Utility. Co RR, abs/2207.10276. Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; and Bailey, J. 2019. Symmetric Cross Entropy for Robust Learning With Noisy Labels. In ICCV, Oct. 27 - Nov. 2, Seoul, Korea (South), 322 330. IEEE. Wei, H.; Feng, L.; Chen, X.; and An, B. 2020. Combating Noisy Labels by Agreement: A Joint Training Method with Co-Regularization. In CVPR, June 13-19, Seattle, WA, USA, 13723 13732. CVF / IEEE. Wei, J.; Liu, H.; Liu, T.; Niu, G.; Sugiyama, M.; and Liu, Y. 2022a. To Smooth or Not? When Label Smoothing Meets Noisy Labels. In ICML, July 17-23, Baltimore, MD, USA, volume 162, 23589 23614. PMLR. Wei, J.; and Liu, Y. 2021. When Optimizing f-Divergence is Robust with Label Noise. In ICLR, May 3-7, virtual. Open Review.net. Wei, J.; Zhu, Z.; Cheng, H.; Liu, T.; Niu, G.; and Liu, Y. 2022b. Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations. In ICLR, Apr. 25-29, virtual. Open Review.net. Wu, Y.; Shu, J.; Xie, Q.; Zhao, Q.; and Meng, D. 2021. Learning to Purify Noisy Labels via Meta Soft Label Corrector. In AAAI, Feb. 2-9, virtual, 10388 10396. AAAI Press. Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; and Wang, X. 2015. Learning from Massive Noisy Labeled Data for Image Classification. In CVPR, June 7-12, Boston, MA, USA, 2691 2699. IEEE. Yao, Y.; Sun, Z.; Zhang, C.; Shen, F.; Wu, Q.; Zhang, J.; and Tang, Z. 2021. Jo-SRC: A Contrastive Approach for Combating Noisy Labels. In CVPR, June 19-25, virtual, 5192 5201. CVF / IEEE. Yu, X.; Han, B.; Yao, J.; Niu, G.; Tsang, I. W.; and Sugiyama, M. 2019. How does Disagreement Help Generalization against Label Corruption? In ICML, June 9-15, Long Beach, CA, USA, volume 97, 7164 7173. PMLR. Zhang, Z.; and Sabuncu, M. R. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Neur IPS, Dec. 3-8, Montr eal, Canada, 8792 8802. Zheltonozhskii, E.; Baskin, C.; Mendelson, A.; Bronstein, A. M.; and Litany, O. 2022. Contrast to Divide: Self Supervised Pre-Training for Learning with Noisy Labels. In WACV, Jan. 3-8, Waikoloa, HI, USA, 387 397. IEEE. Zhou, X.; Liu, X.; Jiang, J.; Gao, X.; and Ji, X. 2021a. Asymmetric Loss Functions for Learning with Noisy Labels. In ICML, July 18-24, virtual, volume 139, 12846 12856. PMLR. Zhou, X.; Liu, X.; Wang, C.; Zhai, D.; Jiang, J.; and Ji, X. 2021b. Learning with Noisy Labels via Sparse Regularization. In ICCV, Oct. 10-17, Montreal, Canada, 72 81. IEEE. Zhu, Z.; Wang, J.; and Liu, Y. 2022. Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower Quality Features. In ICML, July 17-23, Baltimore, MD, USA, volume 162, 27633 27653. PMLR.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)