# label_noise_ignorance_is_bliss__ba7119c9.pdf Label Noise: Ignorance Is Bliss Yilun Zhu EECS University of Michigan allanzhu@umich.edu Jianxin Zhang EECS University of Michigan jianxinz@umich.edu Aditya Gangrade ECE Boston University gangrade@bu.edu Clayton Scott EECS, Statistics University of Michigan clayscot@umich.edu We establish a new theoretical framework for learning under multi-class, instancedependent label noise. This framework casts learning with label noise as a form of domain adaptation, in particular, domain adaptation under posterior drift. We introduce the concept of relative signal strength (RSS), a pointwise measure that quantifies the transferability from noisy to clean posterior. Using RSS, we establish nearly matching upper and lower bounds on the excess risk. Our theoretical findings support the simple Noise Ignorant Empirical Risk Minimization (NI-ERM) principle, which minimizes empirical risk while ignoring label noise. Finally, we translate this theoretical insight into practice: by using NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, we achieve state-of-the-art performance on the CIFAR-N data challenge. 1 Introduction The problem of classification with label noise can be stated in terms of random variables X, Y , and e Y , where X is the feature vector, Y {1, . . . , K} is the true label associated to X, and e Y {1, . . . , K} is a noisy version of Y . The learner has access to i.i.d. realizations of (X, e Y ), and the objective is to learn a classifier that optimizes the risk associated with (X, Y ). In recent years, there has been a surge of interest in the challenging setting of instance (i.e., feature) dependent label noise, in which e Y can depend on both Y and X. While several algorithms have been developed, there remains relatively little theory regarding algorithm performance and the fundamental limits of this learning paradigm. This work develops a theoretical framework for learning under multi-class, instance-dependent label noise. Our framework hinges on the concept of relative signal strength, which is a point-wise measure of noisiness in a label noise problem. Using relative signal strength to charachterize the difficulty of a label noise problem, we establish nearly matching upper and lower bounds for excess risk. We further identify distributional assumptions that ensure that the lower and upper bounds tend to zero as the sample size n grows, implying that consistent learning is possible. Surprisingly, Noise Ignorant Empirical Risk Minimization (NI-ERM) principle, which conducts empirical risk minimization as if no label noise exists, is (nearly) minimax optimal. To translate this insight into practice, we use NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, achieving state-of-the-art performance on the CIFAR-N data challenge. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 2 Literature review Theory and algorithms for classification with label noise are often based on different probabilistic models. Such models may be categorized according on how e Y depends on Y and X. The simplest model is symmetric noise, where the distribution of e Y is independent of Y and X [Angluin and Laird, 1988]. In this case, the probability that e Y = k is the same for all k = Y , regardless of Y and X. In this setting, it is easy to show that minimizing the noisy excess risk (associated to the 0/1 loss) implies minimizing the clean excess risk, a property known as immunity. When immunity holds, there is no need to modify the learning algorithm on account of noisy labels. In other words, the learner may be ignorant of the label noise and still learn consistently. A more general model is classification with label dependent noise, in which the distribution of e Y depends on Y , but not X. Many practical algorithms have been developed over the years, based on principles including data re-weighting [Liu and Tao, 2015], robust training [Han et al., 2018, Liu et al., 2020, Hu et al., 2020, Foret et al., 2021, Liu et al., 2022] and data cleaning [Brodley and Friedl, 1999, Northcutt et al., 2021]. Consistent learning algorithms still exist, such as those based on loss correction [Natarajan et al., 2013, Patrini et al., 2017, Van Rooyen and Williamson, 2018, Liu and Guo, 2020, Zhang et al., 2022]. These approaches assume knowledge of the noise transition probabilities, which can be estimated under some identifiability assumptions [Scott et al., 2013, Zhang et al., 2021b]. In the most general setting, that of instance dependent label noise, the distribution of e Y depends on both Y and X. While algorithms are emerging [Cheng et al., 2021, Zhu et al., 2021, Wang et al., 2022, Yang et al., 2023], theory has primarily focused on the binary setting. Scott [2019] establishes immunity for a Neyman-Pearson-like performance criterion under a posterior drift model, discussed in more detail below. Cannings et al. [2020] establish an upper bound for excess risk under the strong assumption that the optimal classifiers for the clean and noisy distributions are the same. Closest to our work, Im and Grigas [2023] derive excess risk upper and lower bounds, and reach a similar conclusion, that noise-ignorant ERM attains the lower bound up to a constant factor. Our results, based on the new concept of relative signal strength, provide a more refined analysis. Additional connections between our contributions and prior work are made throughout the paper. 3 Problem statement Notation. X denotes the feature space and Y = {1, 2, . . . , K} denotes the label space, with K N. The K-simplex is K := {p RK : i, pi 0, P pi = 1}. A K K matrix is row stochastic if all of its rows are in K. Denote the i-th element of a vector v as [v]i, and the (i, j)-th element of a matrix M as [M]i,j. In conventional multiclass classification, we observe training data (X1, Y1), . . . , (Xn, Yn) drawn i.i.d. from a joint distribution PXY . The marginal distribution of X is denoted by PX, and the class posterior probabilities PY |X=x are captured by a K-simplex-valued vector η : X K, where the j-th component of the vector is [η(x)]j = P (Y = j | X = x). A classifier f : X Y maps an instance x to a class f(x) Y. Denote the risk of a classifier f with respect to distribution PXY as R(f) = E(X,Y ) PXY 1{f(X) =Y } . The Bayes optimal classifier for PXY is f (x) arg max η(x). The Bayes risk, which is the minimum achievable risk, is denoted as R = R(f ) = inff R(f). We consider the setting where, instead of the true class label Y , a noisy label e Y is observed. The training data (X1, e Y1), . . . , (Xn, e Yn) can be viewed as an i.i.d. sample drawn from a noisy distribution PX e Y . We define Pe Y |X=x, eη, e R and f analogously to the clean distribution PXY . The goal of learning from label noise is to find a classifier that is able to minimize the clean test error, that is, the risk R defined w.r.t. PXY , even though the learner has access to only corrupted training data (Xi, e Yi) i.i.d. PX e Y . Noise transition perspective. Traditionally, label noise is modeled through the joint distribution of (X, Y, e Y ). This joint distribution is governed by PX, the clean class posterior PY |X, and a matrix-valued function E : X {M RK K : M is row stochastic}, known as the noise transition matrix. The (i, j)-th entry of the matrix is defined as: [E(x)]i,j = P e Y = j | Y = i, X = x . This implies that the noisy and clean class posteriors are related by eη(x) = E(x) η(x), where denotes the matrix transpose. Domain adaptation perspective. Alternatively, label noise learning can be framed as a domain adaptation problem. In this view, PX e Y represents the source domain, and PXY represents the target domain. The relationship between the two domains is characterized by posterior drift, meaning that while the source and target share the same X-marginal, the class posteriors (i.e., the distribution of labels given X) may differ [Scott, 2019, Cai and Wei, 2021, Maity et al., 2023, Liu et al., 2024]. Thus, a label noise problem can also be described by a triple (PX, η, eη). The two perspectives are equivalent, as discussed in Appendix A.1. In this work, we emphasize the domain adaptation perspective for Sections 4 and 5, and the noise transition perspective for Section 6. 4 Relative signal strength To study label noise, we introduce the concept of relative signal strength (RSS). This is a pointwise measure of how much signal (certainty about the label) is contained in the noisy distribution relative to the clean distribution. Previous work [Cannings et al., 2020, Cai and Wei, 2021] has examined a related concept within the context of binary classification, under the restriction that clean and noisy Bayes classifiers are identical. Our definition incorporates multi-class classification and relaxes the requirement that the clean and noisy Bayes classifiers agree. Definition 1 (Relative Signal Strength) For any class probability vectors η, eη, define the relative signal strength (RSS) at x X as M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j , (1) where 0/0 := + . Furthermore, for κ [0, ), denote the set of points whose RSS exceeds κ as Aκ(η, eη) = {x X : M(x; η, eη) > κ} . M(x; η, eη) is a point-wise measure of how much signal the noisy posterior contains about the clean posterior. To gain some intuition, first notice that if the noisy Bayes classifier predicts a different class than the clean Bayes classifier, the RSS is 0 by taking j = arg max eη (assuming for simplicity that the arg max is a singleton set). Now suppose the clean and noisy Bayes classifiers do make the same prediction at x, say i , and consider a fixed j. If [eη(x)]i [eη(x)]j [η(x)]i [η(x)]j is small, it means that the clean Bayes classifier is relatively certain that j is not the correct clean label, while the noisy Bayes classifier is less certain that j is not the correct noisy label. Taking the minimum over j gives the relative signal strength at x. As we formalize in the next section, a large RSS at x ensures that a small (pointwise) noisy excess risk at x implies a small (pointwise) clean excess risk. To gain more intuition, consider the following examples. Example 1 When η(x) = [0 1 0] and eη(x) = [0.3 0.6 0.1] , M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = [eη(x)]2 [eη(x)]1 [η(x)]2 [η(x)]1 = 0.6 0.3 Here, first of all, arg max η = arg max eη = 2, i.e., the clean and noisy Bayes classifier give the same prediction. What s more, M(x; η, eη) < 1 because the clean Bayes classifier is absolutely certain about its prediction, while the noisy Bayes classifier is much less certain. Example 2 When η(x) = [0 1 0] and eη(x) = [0 0 1] , M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = [eη(x)]3 [eη(x)]3 [η(x)]2 [η(x)]3 = 1 1 The zero signal strength results from eη and η leading to different predictions about arg max. Example 3 (Comparison to KL divergence) When η(x) = [0.05 0.7 0.25] , and eη(1)(x) = [0.25 0.7 0.05] , eη(2)(x) = [0.1 0.6 0.3] , 1 DKL η eη(1) < 1 DKL η eη(2) while M x; η, eη(1) > M x; η, eη(2) . Here, eη(2) is closer to η in terms of KL divergence, but eη(1) provides more information in terms of predicting the arg max of η. There is no conflict: KL divergence considers the similarity between two (whole) distributions, while the task of classification only focuses on predicting the arg max. This also illustrates why our notion of RSS is better suited for the label noise problem than other general-purpose distance measures between distributions. A desirable learning scenario would be if Aκ(η, eη) = X for some large κ, indicating that the signal strength is big across the entire space. Unfortunately, this ideal situation is generally not achievable. To gain some insight, consider the following result, proved in Appendix A.2.1. Proposition 1 A0(η, eη) = x X : arg max eη(x) arg max η(x) . If we assume that both arg max sets are singletons, this result indicates that A0, the region with positive RSS, is the region where the true and noisy Bayes classifiers agree. Accordingly, X \ A0, the zero signal region, is the region where the clean and noisy Bayes decision rules differ. The region of strong signal, Aκ, is a subset of A0. Since the clean and noisy Bayes classifiers will typically disagree for at least some x, A0 = X in general. We note that the strong assumption that A0 = X has been made in prior studies [Cannings et al., 2020, Cai and Wei, 2021]. Our notion of RSS relaxes this assumption and provides a unified view. 4.1 RSS in binary classification We can express relative signal strength more explicitly in the binary setup. Let η(x) := [η(x)]1 = P (Y = 1 | X = x) and eη(x) := [eη(x)]1 = P e Y = 1 | X = x . In standard binary classification, the margin [Tsybakov, 2004, Massart and Nédélec, 2006], defined as η(x) 1 2 , serves as a pointwise measure of signal strength. Our notion of relative signal strength (RSS) can be interpreted as an extension of this concept in the context of label noise learning. Proposition 2 In the binary setting, for κ 0, M(x; η, eη) = max eη(x) 1 2 , 0 , and Aκ(η, eη) = x X : eη(x) 1 In other words, RSS can be viewed as a relative margin. Example 4 Illustration of relative signal strength in a binary classification setup (Figure 1). 4.2 Posterior Drift Model Class. Now putting definitions together, we consider the posterior drift model Π defined over the triple (PX, η, eη). Let ϵ [0, 1], κ [0, + ), and define Π(ϵ, κ) := n (PX, η, eη) : PX Aκ (η, eη) 1 ϵ o . This is a set of triples (label noise problems) such that Aκ, the region with RSS at least κ, covers at least 1 ϵ of the probability mass. In the next section, we will demonstrate that the performance within Aκ can be guaranteed, whereas learning outside the region Aκ is provably challenging. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x Class Probability P(Y = 1|X = x) P(e Y = 1|X = x) threshold = 0.5 X \ A0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x Relative Signal Strength M(x; η, eη) X \ A0 A0.4 Figure 1: Illustration of relative signal strength for binary classification. Left: clean and noisy posteriors [η(x)]1 = P (Y = 1|X = x) and [eη(x)]1 = P e Y = 1|X = x . Right: relative signal strength corresponding to these posteriors. The gray region, x (0, 5), is where the true and noisy Bayes classifiers differ, and is also the zero signal region X \ A0. The red region is A0.4, where the RSS is > 0.4. Note that as x 0, M(x; η, eη) , which occurs since [η(x)]1 1/2, while [eη]1 is far from 1/2. For x = 0+, the predicted labels under η and eη disagree, and the RSS crashes to 0. 5 Upper and lower bounds In this section, we establish both upper and lower bounds for excess risk under multi-class instancedependent label noise. 5.1 Minimax lower bound Our first theorem reveals a fundamental limit: no classifier trained using noisy data can surpass the constraints imposed by relative signal strength in a minimax sense. To state the theorem, we employ the following notation and terminology. Denote the noisy training data by Zn = (Xi, e Yi) n i=1 i.i.d. PX e Y . A learning rule ˆf is an algorithm that takes Zn and outputs a classifier. The risk R( ˆf) of a learning rule is a random variable, where the randomness is due to the draw Zn. Theorem 1 (Minimax Lower Bound) Let ϵ [0, 1], κ > 0. Then inf ˆ f sup (PX,η,eη) Π(ϵ,κ) EZn h R ˆf R(f ) i K 1 where the inf is over all learning rules. The proof in Appendix A.2.3 offers insights into how label noise impacts the learning process: On the low RSS region (X\Aκ), learning is difficult if not impossible, and the learner incurs an irreducible error of (1 1/K)ϵ. On the high RSS region (Aκ), the standard nonparametric rate [Devroye et al., 1996] is scaled by 1/κ. These aspects determine fundamental limits that no classifier trained only on noisy data can overcome without additional assumptions. 5.2 Upper bound This subsection establishes an upper bound for Noise Ignorant Empirical Risk Minimizer (NI-ERM), the empirical risk minimizer trained on noisy data. This result implies that NI-ERM is (nearly) minimax optimal, a potentially surprising result given that NI-ERM is arguably the simplest approach one might consider. We begin by presenting a general result on the excess risk of any classifier, which is proved in Appendix A.2.4. Lemma 1 (Oracle Inequality) For any label noise problem (PX, η, eη) and any classifier f, R(f) R(f ) inf κ>0 PX X \ Aκ (η, eη) + 1 e R(f) e R f . For (PX, η, eη) Π(ϵ, κ), the first term is bounded by ϵ. When f is selected by ERM over the noisy training data, conventional learning theory implies a bound on the second term. This leads to the following upper bound for NI-ERM, whose proof is in Appendix A.2.5. Theorem 2 (Excess Risk Upper Bound of NI-ERM) Let ϵ [0, 1], κ > 0. Consider any (PX, η, eη) Π(ϵ, κ), assume function class F has Natarajan dimension V , and the noisy Bayes classifier f belongs to F. Let ˆf F be the ERM trained on Zn = (Xi, e Yi) n i=1, i.e., ˆf = arg min f F 1 n Pn i=1 1{f(Xi) =e Yi}. Then EZn h R ˆf R(f ) i ϵ + O O denotes big-O notation ignoring logarithmic factors. The Natarajan dimension is a multiclass analogue of the VC dimension. The upper bound in Theorem 2 aligns with the minimax lower bound (Theorem 1) in both terms. For the irreducible error ϵ, there is a small gap of 1/K. This gap arises because, in the lower bound construction, the low signal region X \ Aκ is known to the learner, whereas knowledge of X \ Aκ is not provided to NI-ERM. If Aκ were known to the learner (an unrealistic assumption), then a mixed strategy that preforms NI-ERM on Aκ and randomly guesses on X \ Aκ would have an upper bound with first term of (1 1/K)ϵ, exactly matching the lower bound. Regarding the second term, there is a universal constant and a logarithmic factor between the lower and upper bounds, which is a standard outcome in learning theory. This result is surprising as it indicates that the simplest possible approach, which ignores the presence of noise, is nearly optimal. No learning rule could perform significantly better in this minimax sense. 5.3 A smooth margin-condition on the relative signal strength The previous sections have analyzed learning with label noise over the class Π(ϵ, κ) = {(PX, η, η) : PX(Aκ) 1 ϵ}. Now, the set X \ Aκ equals (X \ A0) (A0 \ Aκ). The first part of this decomposition is the region where the Bayes classifiers under the noisy and clean distributions differ, while the second is a region where these match, but the RSS is small. Naturally, while PX(X \ A0) must be incurred as an irreducible error, one may question why the class Π also limits the mass of A0 \ Aκ. After all, with enough data, the optimal prediction in this region can be learned. This issue would be resolved if there existed a κ0 > 0 such that PX(A0) = PX(Aκ0), i.e., if PX(A0 \ Aκ0) = 0. In fact, our lower bound from Theorem 1 uses precisely such a construction. An interesting point of comparison to this condition lies in Massart s hard-margin condition from standard supervised learning theory, which, for binary problems, demands that PX(|[η(x)]1 [η(x)]2| < h) = 0 for some h > 0, under which one obtains minimax excess risk bounds of O(V/nh) [Massart and Nédélec, 2006]. With this lens, we can view the condition PX(A0 \ Aκ0) = 0 as a type of hard-margin condition on the relative signal strength M. This naturally motivates a smoothened version of this condition, inspired by Tsybakov s soft-margin condition [Tsybakov, 2004]. Definition 2 A triple (PX, η, eη) satisfies an (ϵ, α, Cα)-smooth relative signal margin condition with ϵ [0, 1], α > 0, Cα > 0 if κ > 0, PX(M(x; η, eη) κ) Cακα + ϵ. Further, we define Π (ϵ, α, Cα) as the set of triples (PX, η, eη) that satisfy an (ϵ, α, Cα)-smooth relative signal margin condition. We show in Appendix A.2.7 that the techniques of Section 5.2 yield the following result for Π . Theorem 3 Let ϵ [0, 1], α > 0, Cα > 0. Consider any (PX, η, eη) Π (ϵ, α, Cα), assume function class F has Natarajan dimension V , and the noisy Bayes classifier f belongs to F. Let ˆf F be the ERM trained on Zn = (Xi, e Yi) n i=1. Then EZn h R ˆf R (f ) i ϵ + inf κ>0 = ϵ + O n α/(2+2α) . Compared to Theorem 2, we see that the rate of the second term is slightly slower, which is consistent with standard learning theory where Massart s hard margin assumption leads to faster rates than Tsybakov s. The advantage of the smooth relative margin is that the irreducible term in the above theorem is exactly PX(X \ A0), which has a clearer meaning as it measures the mismatch between clean and noisy Bayes classifiers. Further, notice that the NI-ERM algorithm does not need information about α, and thus the result is adaptive to both α, and to the optimal κ for each value of α, as a consequence of the oracle inequality of Lemma 1. More broadly, Theorem 3 illustrates the flexibility of our conceptualization of label noise problems through RSS. The RSS M characterizes the irreducible error in label noise learning, similar to how the regression function η characterizes excess risk in standard learning. Thus, standard theoretical frameworks can be adapted to the noisy label problem via the relative signal. 6 Conditions that ensure noise immunity The minimax lower bound in the previous section revealed a negative outcome, indicating that no method can do well in the low signal region. Nevertheless, numerous empirical successes have been observed even under significant label noise. This is not mere coincidence. In this section, we will illustrate that the high signal region Aκ can indeed cover the entire input space X even under massive label noise, albeit with the constraint Aκ A0 as stated in Proposition 1. This not only explains past empirical successes, but also gives a rigorous condition on the consistency of NI-ERM. This section will delve into the study of noise transition matrix E and establish precise conditions that lead to A0 = X. These conditions are linear algebraic conditions on E that ensure arg max eη(x) = arg max η(x). As a result, we can infer that in a 10-class classification problem, even with up to 90% of training labels being incorrect, the NI-ERM can still asymptotically achieve Bayes accuracy. In the upcoming definition, we introduce the concept of noise immunity, wherein the optimal classifiers remain unaffected by label noise [Menon et al., 2015, Scott, 2019]. Definition 3 (Immunity) We say that a K-class classification problem (PX, η, eη) is immune to label noise if x X, arg max eη(x) = arg max η(x). Notice that due to Proposition 1, if a problem is immune, then A0 = X. We now provide necessary and sufficient conditions on noise transition matrix E that ensure noise immunity. We begin by considering distribution PXY with zero Bayes risk, that is, where η is one-hot almost surely. A matrix is defined as diagonally dominant if, for each row, the diagonal element is the unique maximum. Theorem 4 (Immunity for Zero-error Distribution) If PXY has Bayes risk of zero, then immunity holds if and only if for all x, the noise transition matrix E(x) is diagonally dominant. Remark For a zero-error distribution PXY , even corrupted with instance-dependent label noise, achieving the Bayes risk is still feasible with a noise rate P e Y = Y up to K 1 K . This highlights that the task of classification itself is robust to label noise, specially when the clean η is well-separated. The above result relies on strong assumptions about the distribution PXY . Now, we present a result that applies to any distribution, which, as a trade-off, turns out to impose more requirements on E. Theorem 5 (Universal Immunity) For any choice of PXY , immunity holds e(x) > 0 s.t. x X, [E(x)]i,j = ( 1 K + e(x) i = j 1 K e(x) 0 0.2 0.4 0.6 0.8 1 0 Testing Accuracy Linear Theory (a) Logistic regression trained on 2D gaussian mixture data with different levels of symmetric noise 0 0.2 0.4 0.6 0.8 1 0 Testing Accuracy 2-layer CNN (b) 2-layer CNN trained on MNIST with different levels of symmetric noise Figure 2: Data simulation that verifies noise immunity. For binary, the turning point is at noise rate P e Y = Y = 0.5. For 10-class, the turning point is at P e Y = Y = 0.9. Previous works [Ghosh and Kumar, 2017, Menon et al., 2018, Oyen et al., 2022] have established that symmetric label noise is sufficient for immunity. Our contribution advances this understanding by demonstrating that such noise conditions are not only sufficient but also necessary. Specifically, under symmetric label noise, learning towards the Bayes classifier is feasible as long as the proportion of wrong labels does not exceed K 1 K . Furthermore, this transition is abrupt: when P e Y = Y < K 1 K , A0 = X, but when P e Y = Y K 1 K , A0 = . Consequently, we expect to see a sudden drop in performance when noise rate passes the threshold. The rationale behind the necessity of E(x) taking this specific form is that it redistributes the probability mass of η in a uniform manner. This constraint arises because E(x) cannot favor any classes besides the true class. For instance, consider η(x) = 1 K + δ 1 K δ 1 K 1 K for some small δ > 0, a non-uniform E(x) would alter the arg max. The above theorems demonstrate that signal strength at x can still be high even under massive label noise P e Y = Y , and, in essence, it is the discrete nature of the classification problem that allows robustness to label noise. When immunity holds, the irreducible error in Theorem 2 vanishes, therefore NI-ERM becomes a consistent learning rule. We validate this through data simulations presented in Figure 2, where we systematically flip labels uniformly and observe the corresponding changes in the testing accuracy of NI-ERM. The simulation results align closely with the theoretical expectations: NI-ERM achieves near-Bayes risk performance until a certain noise threshold is reached, beyond which the testing performance sharply deteriorates. 7 Practical implication The modern practice of machine learning often involves training a deep neural network. In complex tasks involving noisy labels, the naïve NI-ERM is often outperformed by state-of-the-art methods by a significant extent [Li et al., 2020, Xiao et al., 2023]. This is consistent with the finding that directly training a large neural network on noisy data frequently leads to overfitting [Zhang et al., 2021a]. Yet this is not grounds for abandoning NI-ERM altogether as a practical strategy. Instead of using NI-ERM for end-to-end training of a deep neural network, we instead propose the following simple, two-step procedure, termed feature extraction + NI-ERM . 1. Perform feature extraction using any method (e.g., transfer learning or self-supervised learning) that does not require labels. 2. Learn a simple classifier (e.g., a linear classifier) on top of these extracted features, using the noisily labelled data, in a noise-ignorant way. This approach has three advantages over full network training. First, it avoids the potentially negative impact of the noisy labels on the extracted features. Second, it enjoys the inherent robustness of fitting a simple model (step 2) on noisy data, which we observed in Figure 2. Third, it avoids the need to tune hyperparameters of the feature extractor using noisy labels. We note that a self-supervised + simple approach to learning was previously studied by Bansal et al. [2021], although their focus was on generalisation properties without label noise. We also acknowledge that the practical idea of ignoring label noise is not new [Ghosh and Lan, 2021], but the full power of this approach has not been previously recognized. For example, prior works often combine this approach with additional steps or employ early stopping to mitigate the effects of noise [Zheltonozhskii et al., 2022, Xue et al., 2022]. Remarkably, this two-step approach attains extremely strong performance. We conducted experiments 1 on the CIFAR image data under two scenarios: synthetic label flipping (symmetric noise) and realistic human label errors [Wei et al., 2022], as shown in Figure 3. We examine three different feature extractors: the DINOv2 foundation model [Oquab et al., 2023], Res Net-50 features extracted from training on Image Net [He et al., 2016], and self-supervised Res Net-50 using contrastive loss [Chen et al., 2020]. We also compared to a simple linear model trained on the raw pixel intensities, and a Res Net-50 trained end-to-end. We observed that Res Net-50 exhibits degrading performance with increasing noise, consistent with previous findings [Zhang et al., 2021a, Mallinar et al., 2022]. The linear model demonstrates robustness to noise, but suffers from significant approximation error. 0 0.2 0.4 0.6 0.8 1 Test Accuracy Synthetic Label Error Linear + DINOv2 SSL Linear + Res Net-50 SSL Linear + Res Net-50 TL Linear Model (a) CIFAR-10 with synthetic label noise 0 0.2 0.4 0.6 0.8 1 Test Accuracy Human Label Error Linear + DINOv2 SSL Linear+ Res Net-50 SSL Linear + Res Net-50 TL Linear Model (b) CIFAR-10 with realistic label noise Figure 3: A linear model trained on features obtained from either transfer learning (pretrained Res Net-50 on Image Net [He et al., 2016] ), self-supervised learning (Res Net-50 trained on CIFAR-10 images with contrastive loss [Chen et al., 2020]), or a pretrained self-supervised foundation model DINOv2 [Oquab et al., 2023] significantly boosts the performance of the original linear model. In contrast, full training of a Res Net-50 leads to overfitting. Conversely, the FE+NI-ERM approach enjoys the best of both worlds. Regardless of how the feature extraction is carried out, the resulting models exhibit robustness to label noise, while the overall accuracy depends entirely on the quality of the extracted features. This is illustrated in Figure 3, where the flatness of the accuracy curves as noise increases indicates the robustness, and the intercept at zero label noise is a measure of the feature quality. Importantly, this property holds true even under realistic label noise of CIFAR-N [Wei et al., 2022]. In fact, we find that using the DINOv2 [Oquab et al., 2023] extracted features in our FE+NI-ERM approach yields state of the art results on the CIFAR-10N and CIFAR-100N benchmarks, across the noise levels, as shown in Table 1. We emphasize that the only hyperparameters of our model are the hyperparameters of the linear classifier, which are tuned automatically using standard cross-validation on the noisy labels. This contrasts to the implementation of many methods on the CIFAR-N leaderboard (http://noisylabels.com/) 1Code is available at: https://github.com/allan-z/label_noise_ignorance. 2, where the hyperparameters are hard-coded. Furthermore, our approach does not rely on data augmentation. Additional experiments, detailed in Appendix A.4 , include comparisons with the linear probing, then fine-tuning approach [Kumar et al., 2022], the application of different robust learning strategies on DINOv2 features, and results on synthetic instance-dependent label noise. Overall, the strong performance, the simplicity of the approach and the lack of any untunable hyperparameters highlights the effectiveness of FE+NI-ERM, and indicates the value of further investigation into its properties. Table 1: Performance comparison with CIFAR-N leaderboard (http://noisylabels.com/) in terms of testing accuracy. Aggre , Rand1 , . . . , Noisy denote various types of human label noise. We compare with four methods that covers the top three performance for all noise categories: Pro Mix [Xiao et al., 2023], ILL [Chen et al., 2023], PLS [Albert et al., 2023] and Divide Mix [Li et al., 2020]. Our approach, a Noise Ignorant linear model trained on features extracted by the self-supervised foundation model DINOv2 [Oquab et al., 2023] achieves new state-of-the-art results, highlighted in bold. We employed Python s sklearn logistic regression and cross-validation functions without data augmentation; the results are deterministic and directly reproducible. Leaderboard CIFAR-10N CIFAR-100N Methods Aggre Rand1 Rand2 Rand3 Worst Noisy Pro Mix 97.65 0.19 97.39 0.16 97.55 0.12 97.52 0.09 96.34 0.23 73.79 0.28 ILL 96.40 0.03 96.06 0.07 95.98 0.12 96.10 0.05 93.55 0.14 68.07 0.33 PLS 96.09 0.09 95.86 0.26 95.96 0.16 96.10 0.07 93.78 0.30 73.25 0.12 Divide Mix 95.01 0.71 95.16 0.19 95.23 0.07 95.21 0.14 92.56 0.42 71.13 0.48 FE + NI-ERM 98.69 0.00 98.80 0.00 98.65 0.00 98.67 0.00 95.71 0.00 83.17 0.00 0.0 0.2 0.4 0.6 0.8 1.0 κ Empirical CDF Cακα + ϵ = 1.49 κ1.35 + ϵ PX(M(x; η, eη) κ)) ϵ = 0.04 Figure 4: Empirical CDF of estimated RSS for CIFAR10N, evaluated on test data. RSS for realistic human label error. To calculate the RSS under realistic human label error, we train two linear classifiers on DINOv2 features under clean and noisy labels and use the models predictions as estimates for the class probabilities η and eη. Despite the high overall noise rate in CIFAR-10N Worst labels, with P(Y = e Y ) = 40.21%, we conjecture that the region where there is no signal, X \ A0, covers only a small portion of the probability mass (ϵ 4%). Furthermore, the cumulative distribution of the estimated RSS can be upper-bounded by a polynomial Cακα + ϵ, supporting the validity of the smooth relative signal margin condition introduced in Section 5.3. 8 Conclusions This work presents a rigorous theory for learning under multi-class, instance-dependent label noise. We establish nearly matching upper and lower bounds for excess risk and identify precise conditions for classifier consistency. Our theory reveals the (nearly) minimax optimality of Noise Ignorant Empirical Risk Minimizer (NI-ERM). To make this theory practical, we provide a simple modification leveraging a feature extractor with NI-ERM, demonstrating significant performance enhancements. A limitation of this work is that our methodology warrants more extensive experimental evaluation. 2If the link is inaccessible, see the May 23, 2024 archive captured by Wayback Machine: https://web. archive.org/web/20240523101740/http://noisylabels.com/. Acknowledgements This work was supported in part by the National Science Foundation under award 2008074 and the Department of Defense, Defense Threat Reduction Agency under award HDTRA1-20-2-0002. The authors thank Zixuan Huang, Yihao Xue for helpful discussions and Raj Rao Nadakuditi for feedback during a course project in which some early experiments in this paper were conducted. We also thank the anonymous reviewers for their suggestions, especially the reviewer who provided Example 3. Paul Albert, Eric Arazo, Tarun Krishna, Noel E O Connor, and Kevin Mc Guinness. Is your noise correction noisy? pls: Robustness to label noise with two stage detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 118 127, 2023. Dana Angluin and Philip Laird. Learning from noisy examples. Machine learning, 2:343 370, 1988. Yamini Bansal, Gal Kaplun, and Boaz Barak. For self-supervised learning, rationality implies generalization, provably. In International Conference on Learning Representations, 2021. Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131 167, 1999. T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100 128, 2021. Timothy I Cannings, Yingying Fan, and Richard J Samworth. Classification with imperfect training labels. Biometrika, 107(2):311 330, 2020. Hao Chen, Ankit Shah, Jindong Wang, Ran Tao, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, and Bhiksha Raj. Imprecise label learning: A unified framework for learning with various imprecise label configurations. ar Xiv preprint ar Xiv:2305.12715, 2023. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu. Learning with instancedependent label noise: A sample sieve approach. In International Conference on Learning Representations, 2021. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. IEEE, 2009. Luc Devroye, László Györfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31. Springer, 1996. Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=6Tm1mposlr M. Aritra Ghosh and Himanshu Kumar. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017. Aritra Ghosh and Andrew Lan. Contrastive learning improves model robustness under label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2703 2708, 2021. Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93 107, 2015. Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. Hyungki Im and Paul Grigas. Binary classification with instance and label dependent label noise. ar Xiv preprint ar Xiv:2306.03402, 2023. Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022. Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semisupervised learning. In International Conference on Learning Representations, 2020. Jiashuo Liu, Tianyu Wang, Peng Cui, and Hongseok Namkoong. On the need for a language describing distribution shifts: Illustrations on tabular datasets. Advances in Neural Information Processing Systems, 36, 2024. Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331 20342, 2020. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by overparameterization. In International Conference on Machine Learning, pages 14153 14172. PMLR, 2022. Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447 461, 2015. Yang Liu and Hongyi Guo. Peer loss functions: Learning from noisy labels without knowing noise rates. In International conference on machine learning, pages 6226 6236. PMLR, 2020. Jianhao Ma and Salar Fattahi. Blessing of depth in linear regression: Deeper models have flatter landscape around the true solution. Advances in Neural Information Processing Systems, 35: 34334 34346, 2022. Subha Maity, Diptavo Dutta, Jonathan Terhorst, Yuekai Sun, and Moulinath Banerjee. A linear adjustment based approach to posterior drift in transfer learning. Biometrika, 2023. URL https: //doi.org/10.1093/biomet/asad029. Neil Mallinar, James Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182 1195, 2022. Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34 (5):2326 2366, 2006. Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In International conference on machine learning, pages 125 134. PMLR, 2015. Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107:1561 1595, 2018. Balas K Natarajan. On learning sets and functions. Machine Learning, 4:67 97, 1989. Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373 1411, 2021. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2023. Diane Oyen, Michal Kucer, Nicolas Hengartner, and Har Simrat Singh. Robustness to label noise depends on the shape of the noise distribution. Advances in Neural Information Processing Systems, 35:35645 35656, 2022. Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944 1952, 2017. Clayton Scott. A generalized Neyman-Pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pages 738 761. PMLR, 2019. Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference on learning theory, pages 489 511. PMLR, 2013. Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004. Brendan Van Rooyen and Robert C Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1 50, 2018. V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971. doi: 10.1137/1116025. URL https://doi.org/10.1137/1116025. Jialu Wang, Eric Xin Wang, and Yang Liu. Estimating instance-dependent label-noise transition matrix using a deep neural network. In International Conference on Machine Learning, 2022. Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022. Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems, 33:7597 7610, 2020. Ruixuan Xiao, Yiwen Dong, Haobo Wang, Lei Feng, Runze Wu, Gang Chen, and Junbo Zhao. Promix: Combating label noise via maximizing clean sample utility. In Edith Elkind, editor, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI23, pages 4442 4450. International Joint Conferences on Artificial Intelligence Organization, 8 2023. doi: 10.24963/ijcai.2023/494. URL https://doi.org/10.24963/ijcai.2023/494. Main Track. Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman. Investigating why contrastive learning benefits robustness against label noise. In International Conference on Machine Learning, pages 24851 24871. PMLR, 2022. Shuo Yang, Songhua Wu, Erkun Yang, Bo Han, Yang Liu, Min Xu, Gang Niu, and Tongliang Liu. A parametrical model for instance-dependent label noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021a. Jianxin Zhang, Yutong Wang, and Clayton Scott. Learning from label proportions by learning with label noise. Advances in Neural Information Processing Systems, 35:26933 26942, 2022. Mingyuan Zhang, Jane Lee, and Shivani Agarwal. Learning from noisy labels with no change to the training process. In International Conference on Machine Learning, pages 12468 12478. PMLR, 2021b. Evgenii Zheltonozhskii, Chaim Baskin, Avi Mendelson, Alex M Bronstein, and Or Litany. Contrast to divide: Self-supervised pre-training for learning with noisy labels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1657 1667, 2022. Zhaowei Zhu, Tongliang Liu, and Yang Liu. A second-order approach to learning with instancedependent label noise. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10113 10123, 2021. A Appendix / supplemental material A.1 Equivalence of noise transition and domain adaptation perspectives The noise transition perspective models the joint distribution of (X, Y, e Y ), which can be characterized as: PX,Y,e Y = PX PY |X | {z } η Pe Y |Y,X | {z } E Thus, by specifying PX, η, and E, we obtain a triple (PX, η, eη) with eη(x) = E(x) η(x). In contrast, the domain adaptation perspective views label noise problems directly as a triple (PX, η, eη), bypassing the explicit modeling of the noise transition matrix E. If no assumptions are made about the form of E, the domain adaptation view remains fully expressive. Given a triple (PX, η, eη), we can always define a noise transition matrix as: E(x) = 1eη , where 1 = [1 . . . 1] . We can verify that E is row-stochastic, and eη = E(x) η = (eη1 )η = eη(1 η) = eη. Therefore, these two perspectives are equivalent. A.2.1 Proof of Proposition 1 Proposition A0(η, eη) = x X : arg max eη(x) arg max η(x) . Proof. Notice that M(x; η, eη) = 0 arg max eη(x) arg max η(x). This is because M(x; η, eη) = 0 when the numerator is zero and the denominator is non-zero, which happens when arg max eη(x) arg max η(x). An equivalent statement of this is M(x; η, eη) > 0 arg max eη(x) arg max η(x). A.2.2 Proof of Proposition 2 Proposition In the binary setting, for κ 0, M(x; η, eη) = max eη(x) 1 2 , 0 , and Aκ(η, eη) = x X : eη(x) 1 Proof. In a brute-force way, we can examine the nine cases where η(x), eη(x) is greater, equal, or smaller than 1/2. If eη(x) > 1 2, η(x) > 1 M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = eη(x) (1 eη(x)) η(x) (1 η(x)) = max eη(x) 1 If eη(x) < 1 2, η(x) < 1 2, the same argument holds. If eη(x) > 1 2, η(x) < 1 2 or eη(x) < 1 2, η(x) > 1 2, take j = arg max eη(x), we have M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = 0 = max eη(x) 1 If eη(x) = 1 2, η(x) < 1 2 or eη(x) = 1 2, η(x) < 1 2, take j = arg max η(x), we have M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = 0 = max eη(x) 1 If η(x) = 1 M(x; η, eη) = min j Y maxi[eη(x)]i [eη(x)]j maxi[η(x)]i [η(x)]j = maxi[eη(x)]i [eη(x)]j = max eη(x) 1 Note that it makes sense for RSS to be + when η(x) = 1 2, because in this case, clean excess risk R(f) R(f ) is 0 at point x for any classifier. Therefore, we can conclude that M(x; η, eη) = max eη(x) 1 by definition, we have, for κ 0 Aκ(η, eη) = {x X : M(x; η, eη) > κ} = x X : eη(x) 1 A.2.3 Proof of lower bound: Theorem 1 Now we provide a more formal statement of the minimax lower bound and its proof. We begin with the scenario where the noisy distribution PX e Y has zero Bayes risk as an introductory example. The proof for the general case follows a similar strategy but involves more complex bounding techniques. We recommend that interested readers first review the proof of the zero-error version to build a solid understanding before tackling the general case. Now consider a more restricted subset of Π(ϵ, κ): Π(ϵ, κ, V, 0) := n (PX, η, eη) : PX Aκ (η, eη) 1 ϵ, PX supported on V + 1 points, e R = 0 o . Theorem (Minimax Lower Bound: when e R = 0) Let ϵ [0, 1], κ > 0, V > 1. For any learning rule ˆf based upon Zn = (Xi, e Yi) n i=1, and n > max(V 1, 2), sup (PX,η,eη) Π(ϵ,κ) EZn h R ˆf R(f ) i sup (PX,η,eη) Π(ϵ,κ,V,0) EZn h R ˆf R(f ) i κ (V 1)(1 ϵ) We will construct a triple (PX, η, eη) that is parameterized by j, b := [b1 b2 b V 1] , and δ. First, we define PX. Pick any V + 1 distinct points x0, x1, . . . , x V , ϵ x = x0 (1 ϵ) 1 n x = x1, . . . , x V 1 (1 ϵ) 1 V 1 n x = x V . , this is where we need the condition that n > V 1. Then, define the clean and noisy class posteriors: If x = x0, then η(x) = ej, eη(x) = e1, j {1, 2, . . . K} (2) If x = xt, 1 t V 1, then η(x) = 1 2 + 1 2(κ+δ) ( 1)bt+1 1 2 1 2(κ+δ) ( 1)bt+1 , eη(x) = ebt, bt {1, 2}, δ > 0, If x = x V , then η(x) = 1 2 + 1 2(κ+δ) 1 2 1 2(κ+δ) 0 ... 0 , eη(x) = e1, (4) where ei denotes the one-hot vector whose i-th element is one. The triple (PX, η, eη) is thus parameterized by j, b := [b1 b2 b V 1] , and δ. This construction ensures (PX, η, eη) Π(ϵ, κ, V, 0). In particular, Aκ {x1, x2, . . . , x V }, PX(Aκ) 1 ϵ, X \ Aκ {x0}, PX(X \ Aκ) ϵ, and e R = 0 because eη(x) is one-hot for all x. For any classifier f, by definition, its risk equals R (f) = EX,Y 1f(X) =Y = EXEY |X[1f(X) =Y ] = EXEY |X[1 1f(X)=Y ] = EX 1 [η(X)]f(X) 1 [η(x)]f(x) d PX(x). Therefore, the Bayes risk and excess risk equal R(f ) = inf f EX,Y 1f(X) =Y X (1 max η(x)) d PX(x), R(f) R(f ) = Z max η(x) [η(x)]f(x) d PX(x). Under our construction of PX, R(f) can be decomposed into two parts 1 [η(x)]f(x) d PX(x) | {z } :=R0(f) {x1,...,x V } 1 [η(x)]f(x) d PX(x) | {z } :=RV (f) and so can the excess risk R(f) R(f ) = R0 (f) R0(f ) + RV (f) RV (f ) max η(x) [η(x)]f(x) d PX(x) {x1,...,x V } max η(x) [η(x)]f(x) d PX(x). Recall that in our construction, (PX, η, eη) is parameterized by j, b, and δ. Therefore sup (PX,η,eη) Π(ϵ,κ,V,0) EZn h R ˆf R(f ) i sup j,b,δ EZn h R ˆf R(f ) i = sup j,b,δ n EZn h R0 ˆf R0(f ) i + EZn h RV ˆf RV (f ) i o = sup j EZn h R0 ˆf R0(f ) i + sup b,δ EZn h RV ˆf RV (f ) i where the last equality holds because region {x0} only depends on j, while region {x1, . . . , x V } only depends on b, δ. In the remaining part of the proof, we will examine sup j EZn h R0 ˆf R0(f ) i (5) sup b,δ EZn h RV ˆf RV (f ) i (6) separately. Let s start with the first term (5), which reflects the excess risk over the low signal strength region {x0}. Since η is one-hot on {x0}, its Bayes risk over that is zero sup j EZn h R0 ˆf R0(f ) i = sup j EZn h R0 ˆf i = sup j EZn {x0} 1 ˆ f(x) =jd PX(x) To deal with supj, we use a technique called the probabilistic method [Devroye et al., 1996]: replace j with a random variable J Uniform{1, 2, . . . , K}: {x0} 1 ˆ f(x) =jd PX(x) {x0} 1 ˆ f(x) =Jd PX(x) {x0} 1 ˆ f(x) =Jd PX(x) {x0} 1 ˆ f(x) =Jd PX(x) Again, notice that J is an independent draw. Even if the point x0 is observed in Zn, the associated noisy label e Y = 1 does not give any information about the clean label Y = J. Thus {x0} 1 ˆ f(x) =Jd PX(x) {x0} 1 ˆ f(x) =Jd PX(x) {x0} EJ h 1 ˆ f(x) =J i d PX(x) Now we have the minimax lower bound for the first part (5): sup j EZn h R{x0} ˆf R{x0}(f ) i 1 1 For the second part (6), which is over {x1, . . . , x V }, due to the relative signal strength condition, and from our explicit construction in Eqn. (3) and (4), the excess risks w.r.t. true and noisy distribution are related by RV (f) RV (f ) = Z {x1,...,x V } max η(x) [η(x)]f(x) d PX(x) {x1,...,x V } max eη(x) [eη(x)]f(x) d PX(x) by construction of η, eη e RV (f) e RV ( f ) , where e RV (f) := R {x1,...,x V } 1 [eη(x)]f(x) d PX(x). Also note that f (x) = f (x) for x {x1, . . . , x V }, which is a result of our construction of η, eη. sup b,δ EZn h RV ˆf RV (f ) i = sup b,δ EZn 1 κ + δ e RV (f) e RV ( f ) . This allows us to reduce the label noise problem to a standard learning problem: we have an iid sample Zn from PX e Y and consider the risk evaluated on the same distribution PX e Y . The remainder of the proof is similar to the proof of Devroye et al. [1996, Theorem 14.1]. Notice that by our construction, e Y is a deterministic function of X. To be specific, e Y = f (X), where 1 x = x0, bt x = xt, 1 t V 1 1 x = x V is the noisy Bayes classifier. We use the shorthand fb := f to denote that the noisy Bayes classifier depends on b. Since the noisy Bayes risk is zero, sup b,δ EZn 1 κ + δ e RV ( ˆf) e RV ( f ) = sup b,δ 1 κ + δ EZn h e RV ( ˆf) i . Again, use the probabilistic method, replace b with B Uniform{1, 2}V 1, 1 κ + δ EZn h e RV ( ˆf) i sup δ 1 κ + δ EB,Zn h e RV ( ˆf) i 1 κ + δ EZn {x1,...,x V } 1 ˆ f(x) =f B(x)d PX(x) Since we have B Uniform{1, 2}V 1 and also Zn|B P n X e Y , then by Bayes rule (or eye-balling, since eη is one-hot), we get the posterior distribution of B|Zn, to be specific: x {x1, , x V }, If x = Xi, i {1, 2, . . . , n} , then P f B(x) = e Yi|Zn = 1, P f B(x) = e Yi|Zn = 0 If x = x V , then P (f B(x) = 1|Zn) = 1, P (f B(x) = 2|Zn) = 0 If x / {X1, . . . , Xn, x V }, then P (f B(x) = 1|Zn) = 1 2, P (f B(x) = 2|Zn) = 1 where we overload the notation P to denote conditional probability of B|Zn. Then the optimal decision rule for predicting B based on sample Zn is: e Yi x = Xi, i {1, 2, . . . , n} 1 x = x V random guess from {1, 2} x = X1, . . . , x = Xn, x = x V . Therefore, the error comes from the probability of X {x1, . . . , x V } not being one of the observed Xi: for any ˆf, EB,Zn h e RV ( ˆf) i = EZn {x1,...,x V } 1 ˆ f(x) =f B(x)d PX(x) P (X {x1, . . . , x V }, g(X; Zn) = f B(X)) error of ˆf error of g P (X = X1, . . . , X = Xn, X = x V , X {x1, . . . , x V }) t=1 P (X = X1, . . . , X = Xn, X = x V , X = xt) t=1 P (X1 = xt, . . . , Xn = xt, x V = xt, X = xt) replace all X with xt t=1 P (X1 = xt, . . . , Xn = xt, X = xt) t=1 P (X1 = xt, . . . , Xn = xt|X = xt) P (X = xt) t=1 (1 P (X = xt))n P (X = xt) 2(V 1) 1 1 ϵ = (V 1)(1 ϵ) = (V 1)(1 ϵ) 1+ϵ e 1+ϵ 1 1 ϵ n 1 ϵ e 1+ϵ 2 e 1 ϵ [0, 1] 4 = (V 1)(1 ϵ) 8en take n > 2. Now we get the minimax risk for the second part (6) sup b,δ EZn h RAκ ˆf RAκ(f ) i sup δ 1 κ + δ (V 1)(1 ϵ) κ (V 1)(1 ϵ) 8en let δ 0 Combine the two parts together, we get the final result, for n > max(V 1, 2) sup (PX,η,eη) Π(ϵ,κ,V,0) EZn h R ˆf R(f ) i K 1 κ (V 1)(1 ϵ) As for the general version of the lower bound, now consider the set of triples: Π(ϵ, κ, V, L) := n (PX, η, eη) :PX Aκ (η, eη) 1 ϵ, PX supported on V + 1 points, e RAκ f PX Aκ (η, eη) L o , where e RC(f) = R C 1 [eη(x)]f(x) d PX(x). Theorem (Minimax Lower Bound (General Version)) Let ϵ [0, 1], κ > 0, V > 1, L (0, 1/2). For any learning rule ˆf based upon Zn = (Xi, e Yi) n i=1, for n V 1 2L max n 16, 1 (1 2L)2 o sup (PX,η,eη) Π(ϵ,κ) EZn h R ˆf R(f ) i sup (PX,η,eη) Π(ϵ,κ,V,L) EZn h R ˆf R(f ) i Now we construct a triple (PX, η, eη) that is parameterized by j, b := [b1 b2 b V 1] , δ, c and p. First, we define PX. Pick any V + 1 distinct points x0, x1, . . . , x V , ϵ x = x0 (1 ϵ) p x = x1, . . . , x V 1 (1 ϵ) (1 (V 1)p) x = x V . This imposes the constraint (V 1)p 1, which will be satisfied in the end. Notice the difference compared to the previous zero-error proof: we place probability mass p, rather than 1/n, on x1, . . . , x V 1. As for the clean and noisy class probabilities, choose If x = x0, then η(x) = ej, eη(x) = e1, j {1, 2, . . . k} (7) If x = xt, 1 t V 1, then η(x) = 1 2 + c κ+δ ( 1)bt+1 1 2 c κ+δ ( 1)bt+1 1 2 + c ( 1)bt+1 1 2 c ( 1)bt+1 0 ... 0 bt {1, 2}, δ > 0, c 0, 1 If x = x V , then η(x) = 1 2 + 1 2(κ+δ) 1 2 1 2(κ+δ) 0 ... 0 , eη(x) = e1, (9) where ei denotes the one-hot vector whose i-th element is one. The construction for class posterior is also similar to the previous proof, except that for x = xt, t {1, . . . , V 1}, eη is no longer a one-hot vector, rather has class probability separated by 2c: [eη(x)]1 [eη(x)]2 = 2c. Therefore, the triple (PX, η, eη) can be parameterized by j, b := [b1 b2 b V 1] , δ, c and p. Again, this construction ensures (PX, η, eη) Π(ϵ, κ), to be specific: Aκ {x1, x2, . . . , x V }, PX(Aκ) 1 ϵ, X \ Aκ {x0}, PX(X \ Aκ) ϵ. For any classifier f, its risk can be decomposed into two parts 1 [η(x)]f(x) d PX(x) | {z } :=R0(f) {x1,...,x V } 1 [η(x)]f(x) d PX(x) | {z } :=RV (f) as can its excess risk R (f) R(f ) = R0 (f) R0(f ) + RV f RV (f ) . In our construction, (PX, η, eη) is parameterized by j, b := [b1 b2 b V 1] , δ, c and p, therefore sup (PX,η,eη) Π(ϵ,κ,V,L) EZn h R ˆf R(f ) i sup j EZn h R0 ˆf R0(f ) i (10) + sup b,δ,c,p EZn 1 κ + δ e RV (f) e RV ( f ) . (11) Note that we have used the fact that RV (f) RV (f ) = 1 κ + δ e RV (f) e RV ( f ) , where e RV (f) := R {x1,...,x V } 1 [eη(x)]f(x) d PX(x). The first part (10) is exactly the same as in the zero-error proof, and we have sup j EZn h R0 ˆf R0(f ) i 1 1 From this point forward, the procedure is similar to the proof of Devroye et al. [1996, Theorem 14.5]. For the second part (11), the noisy Bayes classifier is still j x = x0, bt x = xt, 1 t V 1 x = x V . We also use the shorthand fb := f to denote that the noisy Bayes classifier depends on b. Now the noisy Bayes risk is no longer zero. In fact e RV ( f ) = Z {x1,...,x V } 1 [eη(x)]f(x) d PX(x) = (V 1)(1 ϵ)p 1 What s more, PX Aκ (η, eη) e RV ( f ) PX {x1, . . . , x V } = (V 1)p 1 where the inequality holds from e RAκ( f ) = e RV ( f ) (because eη is one-hot at point x0) and PX Aκ (η, eη) PX {x1, . . . , x V } . Notice that in order to ensure that our construction (PX, η, eη) Π(ϵ, κ, V, L), by definition PX Aκ (η, eη) L, Due to the upper bound of (12), it suffices to require that 2 c = L, (13) and this ensures that (PX, η, eη) Π(ϵ, κ, V, L) upon recalling that (PX, η, eη) Π(ϵ, κ), and that PX is supported on V + 1 points. It should be noted that since (V 1)p 1 is required, and since c > 0, we have L < 1 1/2. This is the origin of our condition L < 1/2 in the statement of the theorem. Naturally, the statement can be adjusted to min(L, 1/2) instead. In any case, we are left with two nontrivial constraint on our parameters (p, c): (13) and (V 1)p 1, along with the boundary consraints p [0, 1] and c [0, 1/2]. For fixed b, plugging in the definition of eη, the excess risk over region {x1, . . . , x V } becomes e RV ( ˆf) e RV ( f ) = Z {x1,...,x V } 2c1 ˆ f(x) =fb(x)d PX(x) t=1 (1 ϵ)p1 ˆ f(xt) =fb(xt), where the inequality follows from the fact that we ignore the risk on point x V . Using the probabilistic method, replace b with B Uniform{1, 2}V 1, sup b,δ,c,p EZn 1 κ + δ e RV ( ˆf) e RV ( f ) sup δ,c,p EB,Zn 1 κ + δ e RV ( ˆf) e RV ( f ) = sup δ,c,p 1 κ + δ EZn h EB|Zn h e RV ( ˆf) e RV ( f ) ii Now, we need to calculate B|Zn, which can be calculated using Bayes rule because we have B Uniform{1, 2}V 1 and also Zn|B P n X e Y . To be specific, for any x {x0, x1, . . . , x V 1}, assume point xt is observed k times in training sample Zn, P (f B(x) = 1|Zn) = ( 1 2 x = X1, . . . , x = Xn, x = x V P (Bt = 1|Yt1, . . . , Ytk) x = xt = Xt1 = = Xtk, 1 t V 1, where Bt denotes the t-th element of vector B (that associates with xt). Next we compute P (Bt = 1|Yt1 = y1, . . . , Ytk = yk) for y1, . . . , yk {1, 2}. Denote the numbers of ones and twos by k1 = |{j k : yj = 1}| and k2 = |{j k : yj = 2}|. Using Bayes rule, we get P (Bt = 1|Yt1, . . . , Ytk) = P (Bt = 1 Yt1, . . . , Ytk) P (Yt1, . . . , Ytk) = P (Yt1, . . . , Ytk|Bt = 1) P (Bt = 1) P2 i=1 P (Yt1, . . . , Ytk|Bt = i) P (Bt = i) = (1/2 + c)k1(1/2 c)k2(1/2) (1/2 + c)k1(1/2 c)k2(1/2) + (1/2 + c)k2(1/2 c)k1(1/2). After some calculation, following the proof of Devroye et al. [1996, Theorem 14.5], we get sup b,δ,c,p EZn 1 κ + δ e RAκ(f) e RAκ( f ) 1 κ + δ c(V 1)(1 ϵ)pe 8n(1 ϵ)pc2 n(1 ϵ)p 1 2c κ sup c,p c(V 1)pe 8npc2 1 2c ϵ 0, take δ 0 κ sup c,p c L 1/2 ce 8npc2 1 2c , (13) (14) where the supremum is over (p, c) [0, 1] [0, 1/2] such that (V 1)p 1, and (V 1)p(1/2 c) = L. Now, suppose n is so large that n (V 1) 8L(1/2 L)2 L 1 and further that r 8n L , and p = L (V 1)(1/2 c). By our choice of c and the first condition on n above, we can conclude that L (1/2 c), and therefore, (V 1)p = L 1/2 c 1, meaning that both the constraints required on (p, c) are met by the above choice. As a consequence of this choice of c, p, we observe that npc2 = n L (V 1)(1/2 c) c2 = n L (V 1)(1/2 c) (V 1) 8n L = 1 4 8c 1 Since c 1/8 further implies that 1 1 2c 4 3, this implies that Thus, instantiating the bound (14), we conclude that sup b,δ,c,p EZn 1 κ + δ e RAκ(f) e RAκ( f ) 1 ϵ 8n L L 1/2 c e 7 Putting the two parts together sup (PX,η,eη) Π(ϵ,κ) EZn h R ˆf R(f ) i K 1 2L max n 16, 1 (1 2L)2 o . A.2.4 Proof of upper bound: Lemma 1 Lemma (Oracle Inequality under Feature-dependent Label Noise) For any (PX, η, eη) and any classifier f, we have R(f) R(f ) inf κ>0 PX X \ Aκ (η, eη) + 1 e R(f) e R f . Proof. For any κ 0, the input space X can be divided into two regions: X \ Aκ and Aκ. For any f, its risk is R (f) = EX,Y 1f(X) =Y = EXEY |X[1f(X) =Y ] = EXEY |X[1 1f(X)=Y ] = EX 1 [η(X)]f(X) 1 [η(x)]f(x) d PX(x). Therefore, its excess risk is R(f) R(f ) = Z max η(x) [η(x)]f(x) d PX(x) max η(x) [η(x)]f(x) d PX(x) max η(x) [η(x)]f(x) d PX(x) | {z } b Now examine the two terms separately, X\Aκ 1 d PX(x) = PX X \ Aκ (η, eη) , max eη(x) [eη(x)]f(x) d PX(x) by definition of relative signal strength max eη(x) [eη(x)]f(x) d PX(x) e R(f) e R( f ) by definition of e R. Since this works for any κ > 0, we then have R(f) R(f ) inf κ>0 PX X \ Aκ (η, eη) + 1 e R(f) e R f . A.2.5 Proof of upper bound: Theorem 2 To set the stage for the rate of convergence proof, we first introduce the concept of shattering in the multiclass setting and the Natarajan dimension [Natarajan, 1989], which serves as a multiclass counterpart to the VC dimension [Vapnik and Chervonenkis, 1971]. Definition 4 (Multiclass Shattering) Let H be a class of functions from X to Y = {1, 2, . . . , K}. For any set containing n distinct elements Cn = {x1, . . . , xn} X, denote HCn = {(h(x1), . . . , h(xn)) : h H} , and therefore |HCn| is the number of distinct vectors of length n that can be realized by functions in H. The nth shatter coefficient is defined as S(H, n) := max Cn |HCn| . We say that a set Cn is shattered by H if there exists f, g : Cn Y such that for every x Cn, f(x) = g(x), and HC {f(x1), g(x1)} {f(x2), g(x2)} {f(xn), g(xn)} If Y = {1, 2}, this definition reduces to the binary notion of shattering which says all labeling of points can be realized by some function in the hypothesis class H, i.e., HC = {1, 2}|C|. Note that multiclass shattering does not mean being able to realize all K possible labels for each point x C. Instead, multiclass shattering is more like embed the binary cube into multiclass , where every x C is allowed to pick from two of the K labels. Definition 5 (Natarajan Dimension) The Natarajan dimension of H, denoted Ndim(H), is the maximal size of a shattered set C X. Theorem (Excess Risk Upper Bound of NI-ERM) Let ϵ [0, 1], κ (0, + ). Consider any (PX, η, eη) Π(ϵ, κ), assume function class F has Natarajan dimension V , and the noisy Bayes classifier f belongs to F. Let ˆf F be the ERM trained on Zn = (Xi, e Yi) n i=1, then EZn h R ˆf R(f ) i ϵ + 1 V log n + 2V log k + 4 up to log factor. Proof. Following directly from Lemma 1, with (PX, η, eη) Π(ϵ, κ), we already have R(f) R(f ) PX X \ Aκ (η, eη) + 1 e R(f) e R f e R(f) e R f . Now replace f with NI-ERM ˆf. To bound the expected excess risk we employ a multiclass VC-style inequality. EZn h e R ˆf e R f i 16 log(8e S(H, n)) The binary version of this lemma is Corollary 12.1 in Devroye et al. [1996]. We prove the multiclass version below in Section A.2.6. Next, we bound the multiclass shattering coefficient with Natarajan dimension, using the following lemma, which can be viewed as a multiclass version of Sauer s lemma. Lemma 3 (Natarajan [1989]) Let C and Y be two finite sets and let H be a set of functions from C to Y. Then |H| |C|Ndim(H) |Y|2Ndim(H) . Letting V denote Ndim(H), we have that S(H, n) n V K2V , and therefore Lemma 2 can be upper bounded by EZn h e R ˆf e R f i 16 log (8e(n)V K2V ) log 8e + log (n V ) + log (K2V ) V log n + 2V log K + 4 Putting things together, EZn h R ˆf R(f ) i ϵ + 1 V log n + 2V log K + 4 A.2.6 Proof of Lemma 2 Theorem 6 Consider any set of multiclass classifiers F. Let (X1, Y1), . . . , (Xn, Yn) be iid draws from PXY . For any n, and any ϵ > 0, sup f F |Rn(f) R(f)| > ϵ 8S(F, n)e nϵ2/32 where the probability is with respect to the draw of the data. Proof. Apply Theorem 12.5 from Devroye et al. [1996], with the following identifications. In what follows, the left-hand side of each equation is a notation from Devroye et al. [1996], and the right-hand side is our notation. ν = PXY Z = (X, Y ) Zi = (Xi, Yi) A = {Af | f F}, where Af := {(x, y) | f(x) = y} With these identifications, we have ν(Af) = 1 R(f) i 1{Zi Af } = 1 i 1{f(Xi)=Yi} = 1 Rn(f) By Theorem 12.5 we conclude sup f F |Rn(f) R(f)| > ϵ 8s(A, n)e nϵ2/32, where s(A, n) (note the lowercase s ) is defined to be max z1,...,zn NA(z1, . . . , zn) where the max is over points z1, . . . , zn, and NA(z1, . . . , zn) is the number of distinct subsets of the form Af {z1, . . . , zn} as f ranges over F. To conclude the proof, it suffices to show that s(A, n) S(F, n), where the latter expression is the multiclass shatter coefficient defined above. We show this as follows. Consider fixed pairs zi = (xi, yi), i = 1, . . . , n. Supposed that there are N distinct subsets of the form Af {z1, . . . , zn}, and let f1, . . . , f N be the classifiers in F that realize these distinct subsets. Consider the map that sends fi to the vector of its values at x1, . . . , xn: fi 7 (fi(x1), . . . , fi(xn)) Yn. We will show that this map is injective, from which the claim follows. To see injectivity, consider classifiers fi and fj, where i = j. Since fi and fj yield different subsets, it means there is some pair (xk, yk) such that one of fi and fj classifies the pair correctly, while the other does not. This implies that fi(xk) = fj(xk), and therefore (fi(x1), . . . , fi(xn)) = (fj(x1), . . . , fj(xn)). This concludes the proof. Now, Lemma 2 follows from the above theorem (stated in terms of the noisy data/distribution/risk) in precisely the same way that Corollary 12.1 in Devroye et al. [1996] follows from Theorem 12.6 in the same book. A.2.7 Proof of upper bound: Theorem 3 Theorem (Excess Risk Upper Bound of NI-ERM under smooth relative margin condition) Let ϵ [0, 1], α > 0, Cα > 0. Consider any (PX, η, eη) Π (ϵ, α, Cα), assume function class F has Natarajan dimension V , and the noisy Bayes classifier f belongs to F. Let ˆf F be the ERM trained on Zn = (Xi, e Yi) n i=1. Then EZn h R ˆf R (f ) i ϵ + inf κ>0 = ϵ + O n α/(2+2α) . Proof. Again, using Lemma 1, and Theorem 2, we can conclude the following, where C is some large enough constant. EZn[R( ˆf) R(f )] PX X \ Aκ (η, eη) + 1 e R(f) e R f PX X \ Aκ (η, eη) + 1 CV log(n K) Now, by definition of Π (ϵ, α, Cα), it holds that κ > 0, PX(M(x; η, eη) κ) Cακα + ϵ. Thus, we can further conclude that EZn[R( ˆf) R(f )] inf κ>0 ϵ + Cακα + 1 CV log(n K) The final statement now comes from optimizing the above bound, which is attained by taking the derivative w.r.t. κ and set to zero, we have κ = (αCα) 1p CV log(n K)/n 1/(α+1) . This yields the bound EZn[R( ˆf) R(f )] ϵ + O p V log(n K)/n α/(α+1) = ϵ + O(n α/(2α+2)). A.2.8 Proof of immunity results: Theorem 4 and 5 Here, we state the immunity theorems in an equivalent but different way, so that the proofs are easier to follow. Theorem (Immunity for one-hot vector) Denote B = {e1, . . . , e K} to be the set of one-hot vectors. η(x) B, arg max η(x) = arg max E(x) η(x) Diagonal elements of E(x) maximizes its row. Proof. Let η(x) = ey for some y, then eη(x) = E η(x) = P e Y = 1 | Y = y, X = x P e Y = 2 | Y = y, X = x ... P e Y = K | Y = y, X = x arg max eη(x) = arg max [E(x)] y,: = arg max η(x) = y for any choice of y, it is equivalent to say that the diagonal elements of E(x) maximizes its row. Theorem (Universal Immunity) Consider K-class classification, η(x), arg max η(x) = arg max E(x) η(x) e(x) s.t. x, e(x) h 0, 1 1 (K 1)e(x) e(x) e(x) e(x) 1 (K 1)e(x) e(x) ... ... ... ... e(x) e(x) 1 (K 1)e(x) =: Plug E(x) into the expression eη(x) = E η(x) 1 (K 1)e(x) e(x) e(x) e(x) 1 (K 1)e(x) e(x) ... ... ... ... e(x) e(x) 1 (K 1)e(x) 1 0 0 0 1 0 ... ... ... ... 0 0 1 = (1 K e(x)) η(x) + constant vector When e(x) [0, 1 K ), we have η(x), arg max eη(x) = arg max η(x). = : Denote T (x) := E(x) , then t11(x) t12(x) t1K(x) t21(x) t22(x) t2K(x) ... ... ... ... t K1(x) t K2(x) t KK(x) has each column sum to 1. Let us consider several choices of η(x), which pose conditions on matrix T (x). 1) If η(x) = 1 eη(x) = T (x)η(x) = 1 t11(x) + t12(x) + + t1K(x) t21(x) + t22(x) + + t2K(x) ... t K1(x) + t K2(x) + + t KK(x) arg max eη(x) = arg max η(x) = {1, 2, . . . , k} , all elements of eη(x) must be equal, i.e., each row of T (x) should sum to the same value. The sum of all elements in T (x) is K, since all column sum to 1. Therefore, each row of T (x) also sum to 1. 2) If η(x) = h 1 K 1 1 K 1 1 K 1 0 i , then eη(x) = T (x)η(x) = 1 K 1 t11(x) + t12(x) + + t1(K 1)(x) t21(x) + t22(x) + + t2(K 1)(x) ... t(K 1)1(x) + t(K 1)2(x) + + t(K 1)(K 1)(x) t K1(x) + t K2(x) + + t K(K 1)(x) 1 t1K(x) 1 t2K(x) ... 1 t(K 1)k(x) 1 t KK(x) . each row of T (x) sum to 1 arg max eη(x) = arg max η(x) = {1, 2, . . . , K 1} , the first K 1 elements of eη(x) must be equal (and larger than t KK(x)), then we have t1K(x) = t2K(x) = = t(K 1)K(x). In other words, all elements of the K-th column of T (x) are the same (except for the (K, K)-th element). Similarly, consider η(x) to be a vector that contains 0 in the i-th position and 1 K 1 in other positions, then the general condition for T (x) is that: all elements of the i-th column are equal, except the i-th diagonal. Written explicitly, t11(x) t12(x) t13(x) t1K(x) t21(x) t22(x) t13(x) t1K(x) t21(x) t12(x) t33(x) t1K(x) ... ... ... ... ... t21(x) t12(x) t13(x) t KK(x) Since each row and column of T (x) sum to 1, we have 0 + t12(x) + t13(x) + + t1K(x) = (K 1)t21(x) sum of first row = sum of first column t21(x) + 0 + t13(x) + + t1K(x) = (K 1)t12(x) t21(x) + t12(x) + 0 + + t1K(x) = (K 1)t13(x) ... t21(x) + t12(x) + t13(x) + + 0 = (K 1)t1K(x) Subtracting the first equation from the second, we have t12(x) = t21(x). Repeating for all pairs of equations, we have t21(x) = t12(x) = t13(x) = = t1K(x). What s more, all diagonal elements of T (x) will be equal. Thus, 1 (K 1)e(x) e(x) e(x) e(x) 1 (K 1)e(x) e(x) ... ... ... ... e(x) e(x) 1 (K 1)e(x) , where e(x) [0, 1]. 3) The final step is to determine what value e(x) can take. Take η(x) = ey for some y, then from Theorem 4, we know that the diagonal elements of T (x) maximize their column, therefore 1 (K 1)e(x) > e(x) = e(x) [0, 1 Finally, take any η(x), the arg max is preserved by multiplying this specific choice of T (x). This concludes the = part. A.3 Experimental details A.3.1 2D Gaussian with synthetic label noise For 2D Gaussian mixture data, we draw from two Gaussian centered at [1 1] and [ 1 1] , with covariance matrix being identity, 200 data points from each, with label Y = 1, 2 respectively. To generate noisy labels, we flip every label uniformly with some probability. We use Sklearn s logistic regression (with no ℓ2 regularization). The experiment was conducted on AMD Ryzen 5 3600 CPU. The goal of the simulation is to experimentally verify noise immunity results in Section 6. Notice that different trial corresponds to different draw of both instances and noisy labels. Table 2: Testing accuracy of logistic regression on gaussian mixture data with uniform label noise. Noise rate refers to P e Y = Y , the percentage of wrong labels in the training data. As theory in Section 6 predicts, when P e Y = Y reach 50%, there is a sharp decrease in performance. Noise rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Trial #1 93.00 92.83 92.38 92.08 91.78 91.93 92.25 92.90 91.83 92.58 74.68 25.12 9.70 7.73 7.52 7.25 7.38 7.15 7.18 7.10 7.00 Trial #2 91.73 91.60 92.05 91.63 91.78 91.78 91.68 91.63 91.55 91.48 80.40 21.10 9.93 8.55 8.38 8.22 8.20 8.35 8.33 8.40 8.28 Trial #3 92.73 92.75 92.78 92.78 92.58 92.45 91.68 88.15 82.58 59.83 49.53 35.80 21.28 14.35 9.33 8.53 8.12 7.70 7.13 7.23 7.28 Trial #4 91.55 91.58 91.60 91.63 91.68 91.60 91.25 90.98 89.98 86.38 60.53 9.95 8.75 10.00 10.45 9.08 9.00 9.53 9.20 9.03 8.45 Trial #5 91.55 91.58 91.60 91.63 91.68 91.60 91.25 90.98 89.98 86.38 60.53 9.95 8.75 10.00 10.45 9.08 9.00 9.53 9.20 9.03 8.45 Mean 92.11 92.07 92.08 91.95 91.90 91.87 91.62 90.93 89.18 83.33 65.13 20.40 11.68 10.10 9.23 8.43 8.34 8.45 8.21 8.16 7.89 Std 0.70 0.66 0.51 0.50 0.38 0.35 0.41 1.74 3.79 13.44 12.35 10.94 5.39 2.56 1.29 0.75 0.68 1.07 1.03 0.94 0.70 A.3.2 MNIST with synthetic label noise We flip the clean training label of MNIST (http://yann.lecun.com/exdb/mnist/) uniformly (to any of the wrong classes). We use a shallow neural network with two convolution layers and two fully connected layers. We train with stochastic gradient descent with learning rate 0.01 for 10 epochs, batch size equals 64. We use the same hyperparamters for all tests. The experiments were conducted on a single NVIDIA GTX 1660S GPU. The goal of the simulation is to experimentally verify noise immunity results in Section 6. Here randomness corresponds to different realization of noisy labels and stochastic gradient descent. Table 3: Testing accuracy of a shallow CNN (2 conv layers with 2 fully connected layers) on MNIST with uniform label noise. Noise rate refers to P e Y = Y , the percentage of wrong labels in the training data. As theory in Section 6 predicts, when P e Y = Y reach 90%, there is a sharp decrease in performance. Noise rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Trial #1 98.97 98.89 98.81 98.46 98.49 98.16 98.46 98.07 97.98 97.57 97.88 97.84 97.19 97.10 96.70 95.02 89.00 83.72 11.58 0.17 0.03 Trial #2 98.88 98.73 98.94 98.55 98.72 98.66 98.50 98.24 98.15 98.23 97.86 97.98 97.70 97.10 96.91 95.76 91.99 88.49 9.99 0.08 0.04 Trial #3 99.00 99.04 98.86 98.56 98.69 98.66 98.51 98.49 98.37 98.25 98.25 97.39 97.37 97.18 96.66 94.88 92.15 81.48 6.19 0.14 0.04 Trial #4 99.04 98.86 98.70 98.76 98.83 98.65 98.34 98.42 98.58 98.47 98.00 97.41 97.63 97.09 96.46 95.94 93.19 84.78 8.68 0.19 0.01 Trial #5 99.05 98.58 98.89 98.82 98.72 98.83 98.34 98.55 98.40 98.38 98.01 97.31 97.33 96.21 96.29 94.92 90.38 85.84 8.98 0.13 0.08 Mean 98.99 98.82 98.84 98.63 98.69 98.59 98.43 98.35 98.30 98.18 98.00 97.59 97.44 96.94 96.60 95.30 91.34 84.86 9.08 0.14 0.04 Std 0.07 0.17 0.09 0.15 0.12 0.25 0.08 0.20 0.23 0.36 0.16 0.30 0.21 0.41 0.24 0.51 1.65 2.59 1.98 0.04 0.03 A.3.3 CIFAR with synthetic label noise We flip the clean training label of CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar. html) uniformly (to any of the wrong classes). To have a fair comparison between different methods, we fix the realization of noisy labels. Follow the 2-step procedure described in Section 7, we use different pre-trained neural networks as feature extractor: forward-passing the training image through the network and record the feature. Then use sklearn s (https://scikit-learn.org/stable/) logistic regression function to fit the (feature, noisy label) pair in a full batch manner. We prespecify a range of values for ℓ2 regularization ({0.0001, 0.001, 0.01, 0.1, 1, 10, 100} ) and number of iterations for lbfgs optimizer ({10, 20, 50, 100}), then do cross-validation on noisy data to pick the best hyper-parameters. We use the same range of hyper-parameters in all tests. The experiments were conducted on a single NVIDIA Tesla V100 GPU. The result is deterministic. Table 4: Peformance on CIFAR-10 with synthetic label noise. We apply linear model on top of different feature extractors: Res Net-50 TL refers to using a pre-trained Res Net-50 on Image Net [Deng et al., 2009] (available in Pytorch model library) in a transfer learning fashion, Res Net-50 SSL refers to using a pre-trained Res Net-50 on unlabeled CIFAR data with self-supervised loss [Chen et al., 2020] (publicly downloadable weights https://github.com/Contrast To Divide/C2D? tab=readme-ov-file) and DINOv2 SSL refers to using the self-supervised foundation model DINOv2 [Oquab et al., 2023] (available at https://github.com/facebookresearch/dinov2) as the feature extractor. Noise rate refers to P e Y = Y , the percentage of wrong labels in the training data. As theory in Section 6 predicts, when P e Y = Y reach 90%, there is a sharp decrease in performance. We employed Python s sklearn logistic regression and cross-validation functions without data augmentation. The results are deterministic. Noise rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95 1 Linear 41.37 41.09 40.97 40.37 40.45 39.44 37.28 35.20 26.74 18.00 10.28 5.50 3.92 Linear + Res Net-50 TL 90.17 89.58 89.01 88.27 87.55 87.28 86.40 85.01 82.03 74.02 10.82 1.47 0.26 Linear + Res Net-50 SSL 92.48 92.26 91.74 91.46 91.13 90.33 91.07 90.99 89.11 83.89 10.08 1.31 0.34 Linear + DINOv2 SSL 99.25 99.27 99.23 99.14 99.10 99.11 99.02 98.84 95.50 76.91 10.13 0.92 0.03 A.3.4 CIFAR with human label error We load the noisy human labels provided by http://noisylabels.com/, then follow exact the same procedure as above. Table 5: Performance on CIFAR-N dataset (http://noisylabels.com/) in terms of testing accuracy. Aggre , Rand1 , ..., Noisy denote various types of human label noise. We apply linear model on top of different feature extractors: Res Net-50 TL refers to using a pre-trained Res Net-50 on Image Net [Deng et al., 2009] in a transfer learning fashion, Res Net-50 SSL refers to using a pre-trained Res Net-50 on unlabeled CIFAR data with self-supervised loss [Chen et al., 2020] and DINOv2 SSL refers to using the self-supervised foundation model DINOv2 [Oquab et al., 2023] as the feature extractor. We employed Python s sklearn logistic regression and cross-validation functions without data augmentation; the results are deterministic and directly reproducible. Methods CIFAR-10N CIFAR-100N Aggre Rand1 Rand2 Rand3 Worst Noisy Linear 40.73 40.41 40.31 40.63 38.43 16.61 Linear + Res Net-50 TL 89.18 88.63 88.61 88.66 85.32 62.89 Linear + Res Net-50 SSL 91.78 91.66 91.39 91.28 87.84 57.95 Linear + DINOv2 SSL 98.69 98.80 98.65 98.67 95.71 83.17 A.4 Additional experiments A.4.1 Linear probing, then fine tuning (LP-FT) We study whether linear probing, then fine tuning (LP-FT) [Kumar et al., 2022] works better than linear probing (LP) only, in label noise learning scenario. Table 6: Performance on CIFAR-N dataset (http://noisylabels.com/) in terms of testing accuracy. Clean refers to no label noise, Aggre , Rand1 , ..., Noisy denote various types of human label noise. We compare the testing accuracy of LP-FT versus LP only, over different feature extractors: Res Net-50 TL refers to using a pre-trained Res Net-50 on Image Net [Deng et al., 2009] in a transfer learning fashion, Res Net-50 SSL refers to using a pre-trained Res Net-50 on unlabeled CIFAR data with contrastive loss [Chen et al., 2020] and DINOv2 (small) SSL refers to using a light version of the self-supervised foundation model DINOv2 [Oquab et al., 2023] as the feature extractor. Feature Method CIFAR-10N CIFAR-100N Clean Aggre Rand1 Rand2 Rand3 Worst Clean Noisy Res Net-50 TL LP (ours) 90.17 89.18 88.63 88.61 88.66 85.32 71.79 62.89 LP-FT 95.94 92.03 88.55 87.78 87.82 71.88 82.3 63.85 Res Net-50 SSL LP (ours) 92.54 91.78 91.66 91.46 91.17 87.85 69.88 57.98 LP-FT 94.11 89.11 84.49 83.75 84.15 65.00 74.41 54.49 DINOv2 (small) SSL LP (ours) 96.09 94.8 94.39 94.42 94.35 91.14 83.82 72.46 LP-FT 98.23 93.29 88.03 87.27 86.94 67.42 89.97 64.81 A.4.2 Robust learning strategy over DINOv2 feature This section examines how different robust learning strategy works over DINOv2 feature, compared with only training with cross entropy. Table 7: Comparison of different noise robust methods on DINOv2 features. Training a linear classifier with cross entropy (CE) loss is the baseline. We compare it with robust losses: mean absolute error (MAE) loss [Ghosh and Kumar, 2017, Ma and Fattahi, 2022], sigmoid loss [Ghosh et al., 2015], and regularized approaches: Early-Learning Regularization (ELR) [Liu et al., 2020], Sharpness Aware Minimization (SAM) [Foret et al., 2021]. Feature Method CIFAR-10N CIFAR-100N Clean Aggre Rand1 Rand2 Rand3 Worst Clean Noisy CE 99.25 98.69 98.8 98.65 98.67 95.71 92.85 83.17 MAE 99.27 99.04 99.01 99.09 99.11 95.55 90.68 82.55 Sigmoid 99.26 98.86 98.91 98.87 98.96 96.66 92.82 82.03 ELR 99.09 98.49 98.62 98.53 98.56 95.60 89.99 82.75 SAM 99.09 97.66 98.47 98.53 98.47 95.47 89.97 82.85 A.4.3 Synthetic instance-dependent label noise Table 8: We synthetically corrupt labels of CIFAR-10 according to Xia et al. [2020], and compare our NI-ERM with the Part-dependent matrix estimation (PTD) method proposed in that same paper. Method \ Noise rate 10 % 20 % 30 % 40 % 50 % PTD 79.01 76.05 72.28 58.62 53.98 NI-ERM (ours) 99.11 98.94 98.20 93.35 74.67 Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The claims are either supported by theory statements or by reproducible experiment results. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations about our practical method is described. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Assumptions are stated in the theorem statement. Full proofs are included in the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Important information about the experiments are in main text. Details on the experimental setup is described in the appendix. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code is provided, common benchmark datase were used, instructions are given, the result is easily reproducible. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: See appendix and attached code. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We have done repeated experiments for data simulation, e.g., different iid draw. For the result on CIFAR-N data challenge, our result is deterministic and thus no error bar. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See appendix. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors have read the Neur IPS Code of Ethics and confirm that this research follows the code of ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This is a theory-oriented paper towards better understanding of label noise problem. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Citations and urls are included. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve with this matter. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.