# confident_learning_estimating_uncertainty_in_dataset_labels__04e862ce.pdf Journal of Artificial Intelligence Research 70 (2021) 1373-1411 Submitted 09/2020; published 04/2021 Confident Learning: Estimating Uncertainty in Dataset Labels Curtis G. Northcutt cgn@mit.edu Massachusetts Institute of Technology, Department of EECS, Cambridge, MA, USA Lu Jiang lujiang@google.com Google Research, Mountain View, CA, USA Isaac L. Chuang ichuang@mit.edu Massachusetts Institute of Technology, Department of EECS, Department of Physics, Cambridge, MA, USA Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on Image Net to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for Res Net) by cleaning data prior to training. These results are replicable using the open-source cleanlab release. 1. Introduction Advances in learning with noisy labels and weak supervision usually introduce a new model or loss function. Often this model-centric approach band-aids the real question: which data is mislabeled? Yet, large datasets with noisy labels have become increasingly common. Examples span prominent benchmark datasets like Image Net (Russakovsky et al., 2015) and MS-COCO (Lin et al., 2014) to human-centric datasets like electronic health records (Halpern et al., 2016) and educational data (Northcutt et al., 2016). The presence of noisy labels in these datasets introduces two problems. How can we identify examples with label errors and how can we learn well despite noisy labels, irrespective of the data modality or model employed? Here, we follow a data-centric approach to theoretically and experimentally investigate the premise that the key to learning with noisy labels lies in accurately and directly characterizing the uncertainty of label noise in the data. c 2021 AI Access Foundation. All rights reserved. Northcutt, Jiang, & Chuang A large body of work, which may be termed confident learning, has arisen to address the uncertainty in dataset labels, from which two aspects stand out. First, Angluin and Laird s (1988) classification noise process (CNP) provides a starting assumption that label noise is class-conditional, depending only on the latent true class, not the data. While there are exceptions, this assumption is commonly used (Goldberger and Ben-Reuven, 2017; Sukhbaatar et al., 2015) because it is reasonable for many datasets. For example, in Image Net, a leopard is more likely to be mislabeled jaguar than bathtub. Second, direct estimation of the joint distribution between noisy (given) labels and true (unknown) labels (see Fig. 1) can be pursued effectively based on three principled approaches used in many related studies: (a) Prune, to search for label errors, e.g. following the example of Chen et al. (2019); Patrini et al. (2017); Van Rooyen et al. (2015), using soft-pruning via loss-reweighting, to avoid the convergence pitfalls of iterative re-labeling (b) Count, to train on clean data, avoiding error-propagation in learned model weights from reweighting the loss (Natarajan et al., 2017) with imperfect predicted probabilities, generalizing seminal work Forman (2005, 2008); Lipton et al. (2018) and (c) Rank which examples to use during training, to allow learning with unnormalized probabilities or decision boundary distances, building on well-known robustness findings (Page et al., 1997) and ideas of curriculum learning (Jiang et al., 2018). To our knowledge, no prior work has thoroughly analyzed the direct estimation of the joint distribution between noisy and uncorrupted labels. Here, we assemble these principled approaches to generalize confident learning (CL) for this purpose. Estimating the joint distribution is challenging as it requires disambiguation of epistemic uncertainty (model predicted probabilities) from aleatoric uncertainty (noisy labels) (Chowdhary and Dupuis, 2013), but useful because its marginals yield important statistics used in the literature, including latent noise transition rates (Sukhbaatar et al., 2015; Reed et al., 2015), latent prior of uncorrupted labels (Lawrence and Schölkopf, 2001; Graepel and Herbrich, 2001), and inverse noise rates (Katz-Samuels et al., 2019). While noise rates are useful for lossreweighting (Natarajan et al., 2013), only the joint can directly estimate the number of label errors for each pair of true and noisy classes. Removal of these errors prior to training is an effective approach for learning with noisy labels (Chen et al., 2019). The joint is also useful to discover ontological issues in datasets for dataset curation, e.g. Image Net includes two classes for the same maillot class (c.f. Table 5 in Sec. 5). The generalized CL assembled in this paper upon the principles of pruning, counting, and ranking, is a model-agnostic family of theories and algorithms for characterizing, finding, and learning with label errors. It uses predicted probabilities and noisy labels to count examples in the unnormalized confident joint, estimate the joint distribution, and prune noisy data, producing clean data as output. This paper makes two key contributions to prior work on finding, understanding, and learning with noisy labels. First, a proof is presented giving realistic sufficient conditions under which CL exactly finds label errors and exactly estimates the joint distribution of noisy and true labels. Second, experimental data are shared, showing that this CL algorithm is empirically performant on three tasks (a) label noise estimation, (b) label error finding, and (c) learning with noisy labels, increasing Res Net accuracy on a cleaned-Image Net and outperforming seven recent highly competitive methods for learning with noisy labels on Confident Learning: Estimating Uncertainty in Dataset Labels the CIFAR dataset. The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. These contributions are presented beginning with the formal problem specification and notation (Section 2), then defining the algorithmic methods employed for CL (Section 3) and theoretically bounding expected behavior under ideal and noisy conditions (Section 4). Experimental benchmarks on the CIFAR, Image Net, Web Vision, and MNIST datasets, crosscomparing CL performance with that from a wide range of highly competitive approaches, including INCV (Chen et al., 2019), Mixup (Zhang et al., 2018), Mentor Net (Jiang et al., 2018), and Co-Teaching (Han et al., 2018), are then presented in Section 5. Related work (Section 6) and concluding observations (Section 7) wrap up the presentation. Extended proofs of the main theorems, algorithm details, and comprehensive performance comparison data are presented in the appendices. 2. CL Framework and Problem Set-up In the context of multiclass data with possibly noisy labels, let [m] denote {1, 2, ..., m}, the set of m unique class labels, and X := (x, y)n (Rd, [m])n denote the dataset of n examples x Rd with associated observed noisy labels y [m]. x and y are coupled in X to signify that cleaning removes data and label. While a number of relevant works address the setting where annotator labels are available (Sambasivan et al., 2021; Bouguelia et al., 2018; Tanno et al., 2019a,b; Khetan et al., 2018), this paper addresses the general setting where no annotation information is available except the observed noisy labels. Assumptions We assume there exists, for every example, a latent, true label y . Prior to observing y, a class-conditional classification noise process (Angluin and Laird, 1988) maps y y such that every label in class j [m] may be independently mislabeled as class i [m] with probability p( y=i|y =j). This assumption is reasonable and has been used in prior work (Goldberger and Ben-Reuven, 2017; Sukhbaatar et al., 2015). Notation Notation is summarized in Table 1. The discrete random variable y takes an observed, noisy label (potentially flipped to an incorrect class), and y takes a latent, uncorrupted label. The subset of examples in X with noisy class label i is denoted X y=i, i.e. X y=cow is read, examples with class label cow. The notation p( y; x), as opposed to p( y|x), expresses our assumption that input x is observed and error-free. We denote the discrete joint probability of the noisy and latent labels as p( y, y ), where conditionals p( y|y ) and p(y | y) denote probabilities of label flipping. We use ˆp for predicted probabilities. In matrix notation, the n m matrix of out-of-sample predicted probabilities is ˆ Pk,i := ˆp( y = i; xk, θ), the prior of the latent labels is Qy := p(y =i); the m m joint distribution matrix is Q y,y := p( y=i, y =j); the m m noise transition matrix (noisy channel) of flipping rates is Q y|y := p( y=i|y =j); and the m m mixing matrix is Qy | y := p(y =i| y=j). At times, we abbreviate ˆp( y = i; x, θ) as ˆpx, y=i, where θ denotes the model parameters. CL assumes no specific loss function associated with θ: the CL framework is model-agnostic. 1. To foster future research in data cleaning and learning with noisy labels and to improve accessibility for newcomers, cleanlab is open-source and well-documented: https://github.com/cgnorthcutt/cleanlab/ Northcutt, Jiang, & Chuang Table 1: Notation used in confident learning. Notation Definition m The number of unique class labels [m] The set of m unique class labels y Discrete random variable y [m] takes an observed, noisy label y Discrete random variable y [m] takes the unknown, true, uncorrupted label X The dataset (x, y)n (Rd, [m])n of n examples x Rd with noisy labels xk The kth training data example yk The observed, noisy label corresponding to xk y k The unknown, true label corresponding to xk n The cardinality of X := (x, y)n, i.e. the number of examples in the dataset θ Model parameters X y=i Subset of examples in X with noisy label i, i.e. X y=cat is examples labeled cat X y=i,y =j Subset of examples in X with noisy label i and true label j ˆ X y=i,y =j Estimate of subset of examples in X with noisy label i and true label j p( y=i, y =j) Discrete joint probability of noisy label i and true label j. p( y=i|y =j) Discrete conditional probability of true label flipping, called the noise rate p(y =j| y=i) Discrete conditional probability of noisy label flipping, called the inverse noise rate ˆp( ) Estimated or predicted probability (may replace p( ) in any context) Qy The prior of the latent labels ˆQy Estimate of the prior of the latent labels Q y,y The m m joint distribution matrix for p( y, y ) ˆQ y,y Estimate of the m m joint distribution matrix for p( y, y ) Q y|y The m m noise transition matrix (noisy channel) of flipping rates for p( y|y ) ˆQ y|y Estimate of the m m noise transition matrix of flipping rates for p( y|y ) Qy | y The inverse noise matrix for p(y | y) ˆQy | y Estimate of the inverse noise matrix for p(y | y) ˆp( y = i; x, θ) Predicted probability of label y = i for example x and model parameters θ ˆpx, y=i Shorthand abbreviation for predicted probability ˆp( y = i; x, θ) ˆp( y=i; x X y=i, θ) The self-confidence of example x belonging to its given label y=i ˆ Pk,i n m matrix of out-of-sample predicted probabilities ˆp( y = i; xk, θ) C y,y The confident joint C y,y N 0m m, an unnormalized estimate of Q y,y Cconfusion Confusion matrix of given labels yk and predictions arg maxi [m] ˆp( y=i; xk, θ) tj The expected (average) self-confidence for class j used as a threshold in C y,y p ( y=i|y =y k) Ideal probability for some example xk, equivalent to noise rate p ( y=i|y =j) p x, y=i Shorthand abbreviation for ideal probability p ( y=i|y =y k) Goal Our assumption of a class-conditional noise process implies the label noise transitions are data-independent, i.e., p( y|y ; x) = p( y|y ). To characterize class-conditional label uncertainty, one must estimate p( y|y ) and p(y ), the latent prior distribution of uncorrupted labels. Unlike prior works which estimate p( y|y ) and p(y ) independently, we estimate both jointly by directly estimating the joint distribution of label noise, p( y, y ). Our goal is to estimate every p( y, y ) as a matrix Q y,y and use Q y,y to find all mislabeled examples x in dataset X where y = y. This is hard because it requires disambiguation of model error (epistemic uncertainty) from the intrinsic label noise (aleatoric uncertainty), while simultaneously estimating the joint distribution of label noise (Q y,y ) without prior knowledge Confident Learning: Estimating Uncertainty in Dataset Labels of the latent noise transition matrix (Q y|y ), the latent prior distribution of true labels (Qy ), or any latent, true labels (y ). Definition 1 (Sparsity). A statistic to quantify the characteristic shape of the label noise defined by fraction of zeros in the off-diagonals of Q y,y . High sparsity quantifies nonuniformity of label noise, common to real-world datasets. For example, in Image Net, missile may have high probability of being mislabeled as projectile, but near-zero probability of being mislabeled as most other classes like wool or wine. Zero sparsity implies every noise rate in Q y,y is non-zero. A sparsity of 1 implies no label noise because the off-diagonals of Q y,y , which encapsulate the class-conditional noise rates, must all be zero if sparsity = 1. Definition 2 (Self-Confidence). The predicted probability for some model θ that an example x belongs to its given label y, expressed as ˆp( y=i; x X y=i, θ). Low self-confidence is a heuristic-likelihood of being a label error. 3. CL Methods Confident learning (CL) estimates the joint distribution between the (noisy) observed labels and the (true) latent labels. CL requires two inputs: (1) the out-of-sample predicted probabilities ˆ Pk,i and (2) the vector of noisy labels yk. The two inputs are linked via index k for all xk X. None of the true labels y are available, except when y = y , and we do not know when that is the case. The out-of-sample predicted probabilities ˆ Pk,i used as input to CL are computed beforehand (e.g. cross-validation) using a model θ: so, how does θ fit into the CL framework? Prior works typically learn with noisy labels by directly modifying the model or training loss function, restricting the class of models. Instead, CL decouples the model and data cleaning procedure by working with model outputs ˆ Pk,i, so that any model that produces a mapping θ : x ˆp( y=i; xk, θ) can be used (e.g. neural nets with a softmax output, naive Bayes, logistic regression, etc.). However, θ affects the predicted probabilities ˆp( y=i; xk, θ) which in turn affect the performance of CL. Hence, in Section 4, we examine sufficient conditions where CL finds label errors exactly, even when ˆp( y=i; xk, θ) is erroneous. Any model θ may be used for final training on clean data provided by CL. CL identifies noisy labels in existing datasets to improve learning with noisy labels. The main procedure (see Fig. 1) comprises three steps: (1) estimate ˆQ y,y to characterize class-conditional label noise (Sec. 3.1), (2) filter out noisy examples (Sec. 3.2), and (3) train with errors removed, reweighting the examples by class weights ˆ Qy [i] ˆ Q y,y [i][i] for each class i [m]. In this section, we define these three steps and discuss their expected outcomes. 3.1 Count: Characterize and Find Label Errors using the Confident Joint To estimate the joint distribution of noisy labels y and true labels, Q y,y , we count examples that are likely to belong to another class and calibrate those counts so that they sum to the given count of noisy labels in each class, |X y=i|. Counts are captured in the confident joint C y,y Z 0m m, a statistical data structure in CL to directly find label errors. Diagonal entries of C y,y count correct labels and non-diagonals capture asymmetric label error counts. As an example, C y=3,y =1=10 is read, Ten examples are labeled 3 but should be labeled 1. Northcutt, Jiang, & Chuang Noisy inputs Noisy Data, 5 6, 78 9 ℝ<, ℤ>? Noisy Predicted Probs, C (78; 6, () Confident Joint, J 7K,K Estimate of Joint, OP 7K,K Dirty Data Examples with Label Issues J 7K,K 8 =dog 8 =fox 8 =cow 78=dog 100 40 20 78=fox 56 60 0 78=cow 32 12 80 P 7K,K 8 =dog 8 =fox 8 =cow 78=dog 0.25 0.1 0.05 78=fox 0.14 0.15 0 78=cow 0.08 0.03 0.2 Normalize rows to match prior & divide by total Figure 1: An example of the confident learning (CL) process. CL uses the confident joint, C y,y , and ˆQ y,y , an estimate of Q y,y , the joint distribution of noisy observed labels y and unknown true labels y , to find examples with label errors and produce clean data for training. In this section, we first introduce the confident joint C y,y to partition and count label errors. Second, we show how C y,y is used to estimate Q y,y and characterize label noise in a dataset X. Finally, we provide a related baseline Cconfusion and consider its assumptions and short-comings (e.g. class-imbalance) in comparison with C y,y and CL. CL overcomes these shortcomings using thresholding and collision handling to enable robustness to class imbalance and heterogeneity in predicted probability distributions across classes. The confident joint C y,y C y,y estimates X y=i,y =j, the set of examples with noisy label i that actually have true label j, by partitioning X into estimate bins ˆ X y=i,y =j. When ˆ X y=i,y =j = X y=i,y =j, then C y,y exactly finds label errors (proof in Sec. 4). ˆ X y=i,y =j (note the hat above ˆ X to indicate ˆ X y=i,y =j is an estimate of X y=i,y =j) is the set of examples x labeled y=i with large enough ˆp( y = j; x, θ) to likely belong to class y =j, determined by a per-class threshold, tj. Formally, the definition of the confident joint is C y,y [i][j] :=| ˆ X y=i,y =j| where ˆ X y=i,y =j := x X y=i : ˆp( y = j; x, θ) tj, j = arg max l [m]:ˆp( y=l;x,θ) tl ˆp( y = l; x, θ) and the threshold tj is the expected (average) self-confidence for each class Confident Learning: Estimating Uncertainty in Dataset Labels tj = 1 |X y=j| x X y=j ˆp( y = j; x, θ) (2) Unlike prior art, which estimates label errors under the assumption that the true labels are y k = arg maxi [m] ˆp( y=i; xk, θ) (Chen et al., 2019), the thresholds in this formulation improve CL uncertainty quantification robustness to (1) heterogeneous class probability distributions and (2) class-imbalance. For example, if examples labeled i tend to have higher probabilities because the model is over-confident about class i, then ti will be proportionally larger; if some other class j tends toward low probabilities, tj will be smaller. These thresholds allow us to guess y in spite of class-imbalance, unlike prior art which may guess over-confident classes for y because arg max is used (Guo et al., 2017). We examine how good the probabilities produced by model θ need to be for this approach to work in Section 4. To disentangle Eqn. 1, consider a simplified formulation: ˆ X(simple) y=i,y =j = {x X y=i : ˆp( y = j; x, θ) tj} The simplified formulation, however, introduces label collisions when an example x is confidently counted into more than one ˆ X y=i,y =j bin. Collisions only occur along the y dimension of C y,y because y is given. We handle collisions in the right-hand side of Eqn. 1 by selecting ˆy arg maxj [m] ˆp( y = j; x, θ) whenever |{k [m] : ˆp( y=k; x X y=i, θ) tk}| > 1 (collision). In practice with softmax, collisions sometimes occur for softmax outputs with higher temperature (more uniform probabilities), few collisions occur with lower temperature, and no collisions occur with a temperature of zero (one-hot prediction probabilities). The definition of C y,y in Eqn. 1 has some nice properties in certain circumstances. First, if an example has low (near-uniform) predicted probabilities across classes, then it will not be counted for any class in C y,y so that C y,y may be robust to pure noise or examples from an alien class not in the dataset. Second, C y,y is intuitive tj embodies the intuition that examples with higher probability of belonging to class j than the expected probability of examples in class j probably belong to class j. Third, thresholding allows flexibility for example, the 90th percentile may be used in tj instead of the mean to find errors with higher confidence; despite the flexibility, we use the mean because we show (in Sec. 4) that this formulation exactly finds label errors in various settings, and we leave the study of other formulations, like a percentile-based threshold, as future work. Complexity We provide algorithmic implementations of Eqns. 2, 1, and 3 in the Appendix. Given predicted probabilities ˆ Pk,i and noisy labels y, these require O(m2 + nm) storage and arithmetic operations to compute C y,y , for n training examples over m classes. Estimate the joint ˆQ y,y . Given the confident joint C y,y , we estimate Q y,y as ˆQ y=i,y =j = C y=i,y =j P j [m] C y=i,y =j |X y=i| P i [m],j [m] C y=i,y =j P j [m] C y=i,y =j |X y=i| (3) The numerator calibrates P j ˆQ y=i,y =j = |Xi|/ P i [m]|Xi|, i [m] so that row-sums match the observed marginals. The denominator calibrates P i,j ˆQ y=i,y =j = 1 so that the distribution sums to 1. Northcutt, Jiang, & Chuang Label noise characterization Using the observed prior Q y=i = |Xi| / P i [m]|Xi| and marginals of Q y,y , we estimate the latent prior as ˆQy =j := P i ˆQ y=i,y =j, j [m]; the noise transition matrix (noisy channel) as ˆQ y=i|y =j := ˆQ y=i,y =j/ ˆQy =j, i [m]; and the mixing matrix (Katz-Samuels et al., 2019) as ˆQy =j| y=i := ˆQ y=j,y =i/Q y=i, i [m]. As long as ˆQ y,y Q y,y , each of these estimators is similarly consistent (we prove this is the case under practical conditions in Sec. 4). Whereas prior approaches compute the noise transition matrices by directly averaging error-prone predicted probabilities (Reed et al., 2015; Goldberger and Ben-Reuven, 2017), CL is one step removed from the predicted probabilities by estimating noise rates based on counts from C y,y these counts are computed based on whether the predicted probability is greater than a threshold, relying only on the relative ranking of the predicted probability, not its exact value. This feature lends itself to the robustness of confident learning to imperfect probability estimation. Baseline approach Cconfusion To situate our understanding of C y,y performance in the context of prior work, we compare C y,y with Cconfusion, a baseline based on a singleiteration of the performant INCV method (Chen et al., 2019). Cconfusion forms an m m confusion matrix of counts | yk = i, y k = j| across all examples xk, assuming that model predictions, trained from noisy labels, uncover the true labels, i.e. Cconfusion simply assumes y k = arg maxi [m] ˆp( y=i; xk, θ). This baseline approach performs reasonably empirically (Sec. 5) and is a consistent estimator for noiseless predicted probabilities (Thm. 1), but fails when the distributions of probabilities are not similar for each class (Thm. 2), e.g. class-imbalance, or when predicted probabilities are overconfident (Guo et al., 2017). Comparison of C y,y (confident joint) with Cconfusion (baseline) To overcome the sensitivity of Cconfusion to class-imbalance and distribution heterogeneity, the confident joint, C y,y , uses per-class thresholding (Richard and Lippmann, 1991; Elkan, 2001) as a form of calibration (Hendrycks and Gimpel, 2017). Moreover, we prove that unlike Cconfusion, the confident joint (Eqn. 1) exactly finds label errors and consistently estimates Q y,y in more realistic settings with noisy predicted probabilities (see Sec. 4, Thm. 2). 3.2 Rank and Prune: Data Cleaning Following the estimation of C y,y and Q y,y (Section 3.1), any rank and prune approach can be used to clean data. This modularity property allows CL to find label errors using interpretable and explainable ranking methods, whereas prior works typically couple estimation of the noise transition matrix with training loss (Goldberger and Ben-Reuven, 2017) or couple the label confidence of each example with the training loss using loss reweighting (Natarajan et al., 2013; Jiang et al., 2018). In this paper, we investigate and evaluate five rank and prune methods for finding label errors, grouped into two approaches. We provide a theoretical analysis for Method 2: C y,y in Sec. 4 and evaluate all methods empirically in Sec. 5. Approach 1: Use off-diagonals of C y,y to estimate ˆ X y=i,y =j We directly use the sets of examples counted in the off-diagonals of C y,y to estimate label errors. CL baseline 1: Cconfusion. Estimate label errors as the Boolean vector yk = arg maxj [m] ˆp( y = j; xk, θ), for all xk X, where true implies label error and false implies Confident Learning: Estimating Uncertainty in Dataset Labels clean data. This is identical to using the off-diagonals of Cconfusion and similar to a single iteration of INCV (Chen et al., 2019). CL method 2: C y,y . Estimate label errors as {x ˆ X y=i,y =j : i = j} from the off-diagonals of C y,y . Approach 2: Use n ˆQ y,y to estimate | ˆ X y=i,y =j|, prune by probability ranking These approaches calculate n ˆQ y,y to estimate | ˆ X y=i,y =j|, the count of label errors in each partition. They either sum over the y dimension of | ˆ X y=i,y =j| to estimate and remove the number of errors in each class (prune by class), or prune for every off-diagonal partition (prune by noise rate). The choice of which examples to remove is made by ranking the examples based on predicted probabilities. CL method 3: Prune by Class (PBC). For each class i [m], select the n P j [m]:j =i ˆQ y=i,y =j[i] examples with lowest self-confidence ˆp( y = i; x Xi) . CL method 4: Prune by Noise Rate (PBNR). For each off-diagonal entry in ˆQ y=i,y =j, i = j, select the n ˆQ y=i,y =j examples x X y=i with max margin ˆpx, y=j ˆpx, y=i. This margin is adapted from Wei et al. s (2018) normalized margin. CL method 5: C+NR. Combine the previous two methods via element-wise and , i.e. set intersection. Prune an example if both methods PBC and PBNR prune that example. Learning with Noisy Labels To train with errors removed, we account for missing data by reweighting the loss by 1 ˆp( y=i|y =i) = ˆ Qy [i] ˆ Q y,y [i][i] for each class i [m], where dividing by ˆQ y,y [i][i] normalizes out the count of clean training data and ˆQy [i] re-normalizes to the latent number of examples in class i. CL finds errors, but does not prescribe a specific training procedure using the clean data. Theoretically, CL requires no hyper-parameters to find label errors. In practice, cross-validation might introduce a hyper-parameter: k-fold. However, in our paper k = 4 is fixed in the experiments using cross-validation. Which CL method to use? Five methods are presented to clean data. By default we use CL: C y,y because it matches the conditions of Thm. 2 exactly and is experimentally performant (see Table 4). Once label errors are found, we observe ordering label errors by the normalized margin: ˆp( y=i; x, θ) maxj =i ˆp( y=j; x, θ) (Wei et al., 2018) works well. In this section, we examine sufficient conditions when (1) the confident joint exactly finds label errors and (2) ˆQ y,y is a consistent estimator for Q y,y . We first analyze CL for noiseless ˆpx, y=j, then evaluate more realistic conditions, culminating in Thm. 2 where we prove (1) and (2) with noise in the predicted probabilities for every example. Proofs are in the Appendix (see Sec. A). As a notation reminder, ˆpx, y=i is shorthand for ˆp( y=i; x, θ). In the statement of each theorem, we use ˆQ y,y Q y,y , i.e. approximately equals, to account for precision error of using discrete count-based C y,y to estimate real-valued Q y,y . For example, if a noise rate is 0.39, but the dataset has only 5 examples in that class, the nearest possible estimate by removing errors is 2/5 = 0.4 0.39. So, ˆQ y,y is technically a consistent estimator for Q y,y only because of discretization error, otherwise all equalities are exact. Throughout, we assume X includes at least one example from every class. Northcutt, Jiang, & Chuang 4.1 Noiseless Predicted Probabilities We start with the ideal condition and a non-obvious lemma that yields a closed-form expression for threshold ti when ˆpx, y=i is ideal. Without some condition on ˆpx, y=i, one cannot disambiguate label noise from model noise. Condition 1 (Ideal). The predicted probabilities ˆp( y; x, θ) for a model θ are ideal if xk Xy =j, i [m], j [m], we have that ˆp( y=i; xk Xy =j, θ) = p ( y=i|y =y k) = p ( y=i|y =j). The final equality follows from the class-conditional noise process assumption. The ideal condition implies error-free predicted probabilities: they match the noise rates corresponding to the y label of x. We use p x, y=i as a shorthand. Lemma 1 (Ideal Thresholds). For a noisy dataset X := (x, y)n (Rd, [m])n and model θ, if ˆp( y; x, θ) is ideal, then i [m], ti = P j [m] p( y = i|y =j)p(y =j| y = i). This form of the threshold is intuitively reasonable: the contributions to the sum when i = j represents the probabilities of correct labeling, whereas when i = j, the terms give the probabilities of mislabeling p( y = i|y = j), weighted by the probability p(y = j| y = i) that the mislabeling is corrected. Using Lemma 1 under the ideal condition, we prove in Thm. 1 confident learning exactly finds label errors and ˆQ y,y is a consistent estimator for Q y,y when each diagonal entry of Q y|y maximizes its row and column. The proof hinges on the fact that the construction of C y,y eliminates collisions. Theorem 1 (Exact Label Errors). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆp( y; x, θ) is ideal and each diagonal entry of Q y|y maximizes its row and column, then ˆ X y=i,y =j = X y=i,y =j and ˆQ y,y Q y,y (consistent estimator for Q y,y ). While Thm. 1 is a reasonable sanity check, observe that y arg maxj ˆp( y=i| y =i; x), used by Cconfusion, trivially satisfies Thm. 1 if the diagonal of Q y|y maximizes its row and column. We highlight this because Cconfusion is the variant of CL most-related to prior work (e.g., Chen et al. (2019)). We next consider relaxed conditions motivated by realworld settings (e.g., Jiang et al. (2020)) where C y,y exactly finds label errors ( ˆ X y=i,y =j = X y=i,y =j) and consistently estimates the joint distribution of noisy and true labels ( ˆQ y,y Q y,y ), but Cconfusion does not. 4.2 Noisy Predicted Probabilities Motivated by the importance of addressing class imbalance and heterogeneous class probability distributions, we consider linear combinations of noise per-class. Here, we index ˆpx, y=j by j to match the comparison ˆp( y=j; x, θ) tj from the construction of C y,y (see Eqn. 1). Condition 2 (Per-Class Diffracted). ˆpx, y=j is per-class diffracted if there exist linear combinations of class-conditional error in the predicted probabilities s.t. ˆpx, y=j = ϵ(1) j p x, y=j + ϵ(2) j where ϵ(1) j , ϵ(2) j R and ϵj can be any distribution. This relaxes the ideal condition with noise that is relevant for neural networks, which are known to be class-conditionally overly confident (Guo et al., 2017). Corollary 1.1 (Per-Class Robustness). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆpx, y=j is per-class diffracted without label collisions and each diagonal entry of Q y|y maximizes its row, then ˆ X y=i,y =j = X y=i,y =j and ˆQ y,y Q y,y . Confident Learning: Estimating Uncertainty in Dataset Labels Cor. 1.1 shows us that C y,y in confident learning (which counts ˆ X y=i,y =j) is robust to any linear combination of per-class error in probabilities. This is not the case for Cconfusion because Cor. 1.1 no longer requires that the diagonal of Q y|y maximize its column as before in Thm. 1: for intuition, consider an extreme case of per-class diffraction where the probabilities of only one class are all dramatically increased. Then Cconfusion, which relies on y k arg maxi [m] ˆp( y=i|y =j; xk), will count only that one class for all y such that all entries in the Cconfusion will be zero except for one column, i.e. Cconfusion cannot count entries in any other column, so ˆ X y=i,y =j = X y=i,y =j. In comparison, for C y,y , the increased probabilities of the one class would be subtracted by the class-threshold, re-normalizing the columns of the matrix, such that, C y,y satisfies Cor. 1.1 using thresholds for robustness to distributional shift and class-imbalance. Cor. 1.1 only allows for m alterations in the probabilities and there are only m2 unique probabilities under the ideal condition, whereas in real-world conditions, an error-prone model could potentially output n m unique probabilities. Next, in Thm. 2, we examine a reasonable sufficient condition where CL is robust to erroneous probabilities for every example and class. Condition 3 (Per-Example Diffracted). ˆpx, y=j is per-example diffracted if j [m], x X, we have error as ˆpx, y=j = p x, y=j + ϵx, y=j where ( U(ϵj+tj p x, y=j , ϵj tj+p x, y=j] p x, y=j tj U[ϵj tj+p x, y=j , ϵj+tj p x, y=j) p x, y=j < tj (4) where ϵj = Ex X ϵx, y=j and U denotes a uniform distribution (we discuss a more general case in the Appendix). Theorem 2 (Per-Example Robustness). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆpx, y=j is per-example diffracted without label collisions and each diagonal entry of Q y|y maximizes its row, then ˆ X y=i,y =j X y=i,y =j and ˆQ y,y Q y,y . In Thm. 2, we observe that if each example s predicted probability resides within the residual range of the ideal probability and the threshold, then CL exactly identifies the label errors and consistently estimates Q y,y . Intuitively, if ˆpx, y=j tj whenever p x, y=j tj and ˆpx, y=j < tj whenever p x, y=j < tj, then regardless of error in ˆpx, y=j, CL exactly finds label errors. As an example, consider an image xk that is mislabeled as fox, but is actually a dog where tfox = 0.6, p ( y=fox; x Xy =dog, θ) = 0.2, tdog = 0.8, and p ( y=dog; x Xy =dog, θ) = 0.9. Then as long as 0.4 ϵx,fox < 0.4 and 0.1 < ϵx,dog 0.1, CL will surmise y k = dog, not fox, even though yk = fox is given. We empirically substantiate this theoretical result in Section 5.2. Thm. 2 addresses the epistemic uncertainty of latent label noise, via the statistic, Q y,y , while accounting for the aleatoric uncertainty of inherently erroneous predicted probabilities. 5. Experiments This section empirically validates CL on CIFAR (Krizhevsky and Hinton, 2009) and Image Net (Russakovsky et al., 2015) benchmarks. Sec. 5.1 presents CL performance on noisy examples Northcutt, Jiang, & Chuang in CIFAR where true labels are presumed known. Sec. 5.2 shows real-world label errors found in the original, unperturbed MNIST, Image Net, Web Vision, and Amazon Reviews datasets, and shows performance advantages using cleaned data provided by CL to train Image Net. Unless otherwise specified, we compute out-of-sample predicted probabilities ˆ Pk,j using four-fold cross-validation and Res Net architectures. 5.1 Asymmetric Label Noise on CIFAR-10 dataset We evaluate CL on three criteria: (a) joint estimation (Fig. 2), (b) accuracy finding label errors (Table 4), and (c) accuracy learning with noisy labels (Table 2). Noise Generation Following prior work by Sukhbaatar et al. (2015); Goldberger and Ben-Reuven (2017), we verify CL performance on the commonly used asymmetric label noise, where the labels of error-free/clean data are randomly flipped, for its resemblance to real-world noise. We generate noisy data from clean data by randomly switching some labels of training examples to different classes non-uniformly according to a randomly generated Q y|y noise transition matrix. We generate Q y|y matrices with different traces to run experiments for different noise levels. The noise matrices used in our experiments are in the Appendix in Fig. S3. We generate noise in the CIFAR-10 training dataset across varying sparsities, the fraction of off-diagonals in Q y,y that are zero, and the percent of incorrect labels (noise). We evaluate all models on the unaltered test set. Baselines and our method In Table 2, we compare CL performance versus seven recent highly competitive approaches and a vanilla baseline for multiclass learning with noisy labels on CIFAR-10, including INCV (Chen et al., 2019) which finds clean data with multiple iterations of cross-validation then trains on the clean set, SCE-loss (symmetric cross entropy) (Wang et al., 2019) which adds a reverse cross entropy term for loss-correction, Mixup (Zhang et al., 2018) which linearly combines examples and labels to augment data, Mentor Net (Jiang et al., 2018) which uses curriculum learning to avoid noisy data in training, Co-Teaching (Han et al., 2018) which trains two models in tandem to learn from clean data, S-Model (Goldberger and Ben-Reuven, 2017) which uses an extra softmax layer to model noise during training, and Reed (Reed et al., 2015) which uses loss-reweighting; and a Baseline model that denotes a vanilla training with the noisy labels. Training settings All models are trained using Res Net-50 with the common setting: learning rate 0.1 for epoch [0,150), 0.01 for epoch [150,250), 0.001 for epoch [250,350); momentum 0.9; and weight decay 0.0001, except INCV, SCE-loss, and Co-Teaching which are trained using their official Git Hub code. Settings are copied from the kuangliu/pytorchcifar Git Hub open-source code and were not tuned by hand. We report the highest score across hyper-parameters α {1, 2, 4, 8} for Mixup and p {0.7, 0.8, 0.9} for Mentor Net. For fair comparison with Co-Teaching, INCV, and Mentor Net, we also train using the co-teaching approach with forget rate = 0.5 [noise fraction], and report the max accuracy of the two trained models for each method. We observe that dropping the last partial batch of each epoch during training improves stability by avoiding weight updates from, in some cases, a single noisy example). Exactly the same noisy labels are used for training all models for each column of Table 2. For our method, we fix its hyper-parameter, i.e. the number of folds in cross-validation across different noise levels, and do not tune it on the validation set. Confident Learning: Estimating Uncertainty in Dataset Labels Table 2: Test accuracy (%) of confident learning versus recent methods for learning with noisy labels in CIFAR-10. Scores reported for CL methods are averaged over ten trials with standard deviations shown in Table 3. CL methods estimate label errors, remove them, then train on the cleaned data. Whereas other methods decrease in performance from low sparsity (e.g., 0.0) to high sparsity (e.g. 0.6), CL methods are robust across sparsity, as indicated by comparing the two column-wise red highlighted cells. Noise 20% 40% 70% Sparsity 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 CL: Cconfusion 89.6 89.4 90.2 89.9 83.9 83.9 83.2 84.2 31.5 39.3 33.7 30.6 CL: PBC 90.5 90.1 90.6 90.7 84.8 85.5 85.3 86.2 33.7 40.7 35.1 31.4 CL: C y,y 91.1 90.9 91.1 91.3 86.7 86.7 86.6 86.9 32.4 41.8 34.4 34.5 CL: C+NR 90.8 90.7 91.0 91.1 87.1 86.9 86.7 87.2 41.1 41.7 39.0 32.9 CL: PBNR 90.7 90.5 90.9 90.9 87.1 86.8 86.6 87.2 41.0 41.8 39.1 36.4 INCV (Chen et al., 2019) 87.8 88.6 89.6 89.2 84.4 76.6 85.4 73.6 28.3 25.3 34.8 29.7 Mixup (Zhang et al., 2018) 85.6 86.8 87.0 84.3 76.1 75.4 68.6 59.8 32.2 31.3 32.3 26.9 SCE-loss (Wang et al., 2019) 87.2 87.5 88.8 84.4 76.3 74.1 64.9 58.3 33.0 28.7 30.9 24.0 Mentor Net (Jiang et al., 2018) 84.9 85.1 83.2 83.4 64.4 64.2 62.4 61.5 30.0 31.6 29.3 27.9 Co-Teaching (Han et al., 2018) 81.2 81.3 81.4 80.6 62.9 61.6 60.9 58.1 30.5 30.2 27.7 26.0 S-Model (Goldberger et al., 2017) 80.0 80.0 79.7 79.1 58.6 61.2 59.1 57.5 28.4 28.5 27.9 27.3 Reed (Reed et al., 2015) 78.1 78.9 80.8 79.3 60.5 60.4 61.2 58.6 29.0 29.4 29.1 26.8 Baseline 78.4 79.2 79.0 78.2 60.2 60.8 59.6 57.3 27.0 29.7 28.2 26.8 For each CL method, sparsity, and noise setting, we report the mean accuracy in Table 2, averaged over ten trials, by varying the random seed and initial weights of the neural network for training. Standard deviations are reported in Table 3 to improve readability. For each column in Table 2, the corresponding standard deviations in in Table 3 are significantly less than the performance difference between CL methods and baseline methods. Notably, all standard deviations are significantly ( 10x) less than the mean performance difference between the top-performing CL method and baseline methods for each setting, averaged over random weight initialization. Standard deviations are only reported for CL methods because of difficulty reproducing consistent results for some of the other methods. Table 3: Standard deviations (% units) associated with the mean score (over ten trials) for scores reported for CL methods in Table 2. Each trial uses a different random seed and network weight initialization. No standard deviation exceeds 2%. Noise 20% 40% 70% Sparsity 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 CL: Cconfusion 0.07 0.10 0.17 0.08 0.19 0.22 0.23 0.20 0.93 0.24 0.13 0.26 CL: PBC 0.14 0.12 0.11 0.10 0.15 0.17 0.16 0.10 0.12 0.22 0.11 0.30 CL: C y,y 0.17 0.09 0.17 0.11 0.10 0.20 0.09 0.13 1.02 0.15 0.18 1.63 CL: C+NR 0.09 0.10 0.08 0.08 0.11 0.14 0.16 0.10 0.42 0.33 0.26 1.90 CL: PBNR 0.15 0.09 0.09 0.10 0.18 0.10 0.15 0.12 0.26 0.28 0.24 1.43 Northcutt, Jiang, & Chuang Table 4: Mean accuracy, F1, precision, and recall measures of CL methods for finding label errors in CIFAR-10, averaged over ten trials. Measure Accuracy (%) Std. Dev. (%) F1 (%) Precision (%) Recall (%) Noise 20% 40% 20% 40% 20% 40% 20% 40% Sparsity 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 CL: Cconfusion 84 0.07 85 0.09 85 0.24 81 0.21 71 72 84 79 56 58 74 70 98 97 97 90 CL: C y,y 89 0.15 90 0.10 86 0.15 84 0.12 75 78 84 80 67 70 78 77 86 88 91 84 CL: PBC 88 0.22 88 0.11 86 0.17 82 0.13 76 76 84 79 64 65 76 74 96 93 94 85 CL: PBNR 89 0.11 90 0.08 88 0.12 84 0.11 77 79 85 80 65 68 82 79 93 94 88 82 CL: C+NR 90 0.21 90 0.10 87 0.23 83 0.14 78 78 84 78 67 69 82 79 93 90 87 78 We also evaluate CL s accuracy in finding label errors. In Table 4, we compare five variants of CL methods across noise and sparsity and report their precision, recall, and F1 in recovering the true label. The results show that CL is able to find the label errors with high recall and reasonable F1. Robustness to Sparsity Table 2 reports CIFAR test accuracy for learning with noisy labels across noise amount and sparsity, where the first five rows report our CL approaches. As shown, CL consistently performs well compared to prior art across all noise and sparsity settings. We observe significant improvement in high-noise and/or high-sparsity regimes. The simplest CL method CL : Cconfusion performs similarly to INCV and comparably to prior art with best performance by C y,y across all noise and sparsity settings. The results validate the benefit of directly modeling the joint noise distribution and show that our method is competitive compared to highly competitive, robust learning methods. To understand why CL performs well, we evaluate CL joint estimation across noise and sparsity with RMSE in Table S1 in the Appendix and estimated ˆQ y,y in Fig. S1 in the Appendix. For the 20% and 40% noise settings, on average, CL achieves an RMSE of .004 relative to the true joint Q y,y across all sparsities. The simplest CL variant, Cconfusion normalized via Eqn. (3) to obtain ˆQconfusion, achieves a slightly worse RMSE of .006. Latent, true label y * dog frog horse Noisy label y 4 0.5 0 0.4 0 0 0.5 0 0 0 3.2 6.3 0 0.4 2.7 0.4 0 0 0.5 0.1 0.6 0 4.6 0.4 0 0 0 0 0 0 0.1 0.4 0 4 0 0 0 0 0 0 0 0.2 0 0.4 7.1 0 0 0 0 0 1.1 2 0 0.3 0 5.2 3.9 0 0.3 0 0.2 0.1 0 0.4 0.2 0 2.9 0 0 0 0 0.3 0 0.2 0 0 0 6.8 0 0.1 0.8 0 3.8 2.2 0 0 2.8 0 9.3 0 0 0 1.6 1.3 0 4.4 0 3.2 0 9.8 (a) True Q y,y (unknown to CL) Latent, true label y * 1.3 2.6 0.2 0.3 0 0.4 0.4 0 0.2 0.1 3.1 6.1 0.2 0.4 2.1 0.6 0.1 0.1 0.6 0.3 0.8 0.1 3.8 0.4 0.1 0.2 0.1 0 0.1 0.1 0.2 0.5 0.1 2.6 0.1 0.7 0.1 0 0 0.1 0.1 0.4 0.2 0.4 6.1 0.2 0.1 0.1 0 0 1.2 2.1 0.2 0.9 0.1 4.9 2.6 0.1 0.4 0.3 0.2 0.2 0.1 0.4 0.2 0.8 2 0 0.1 0 0.1 0.4 0 0.3 0.1 0.2 0 6 0 0.2 1.4 0.3 2.9 1.5 0.2 1.3 2 0 9 0.2 0.4 0.5 1.3 1.4 0.2 3.7 0.1 2.8 0.2 9.5 (b) CL estimated ˆQ y,y Latent, true label y * 2.7 2 0.2 0.1 0 0.4 0.1 0 0.2 0.1 0.1 0.2 0.2 0 0.6 0.2 0.1 0.1 0.2 0.2 0.2 0.1 0.9 0 0.1 0.2 0.1 0 0.1 0.1 0.1 0.1 0.1 1.4 0.1 0.7 0.1 0 0 0.1 0.1 0.1 0.2 0.1 1 0.2 0.1 0.1 0 0 0 0.1 0.2 0.6 0.1 0.3 1.3 0.1 0.2 0.3 0 0 0.1 0 0 0.8 0.9 0 0.1 0 0.1 0 0 0.1 0.1 0.2 0 0.8 0 0.2 0.6 0.3 0.8 0.7 0.2 1.3 0.8 0 0.2 0.2 0.4 0.5 0.3 0.1 0.2 0.6 0.1 0.4 0.2 0.3 Joint probability (10 2) (c) Absolute diff. |Q y,y ˆQ y,y | Figure 2: Our estimation of the joint distribution of noisy labels and true labels for CIFAR with 40% label noise and 60% sparsity. Observe the similarity (RSME = .004) between (a) and (b) and the low absolute error in every entry in (c). Probabilities are scaled up by 100. Confident Learning: Estimating Uncertainty in Dataset Labels In Fig. 2, we visualize the quality of CL joint estimation in a challenging high-noise (40%), high-sparsity (60%) regime on CIFAR. Subfigure (a) demonstrates high sparsity in the latent true joint Q y,y , with over half the noise in just six noise rates. Yet, as can be seen in subfigures (b) and (c), CL still estimates over 80% of the entries of Q y,y within an absolute difference of .005. The results empirically substantiate the theoretical bounds of Section 4. In Table S2 (see Appendix), we report the training time required to achieve the accuracies reported in Table 2 for INCV and confident learning. As shown in Table S2, INCV training time exceeded 20 hours. In comparison, CL takes less than three hours on the same machine: an hour for cross-validation, less than a minute to find errors, and an hour to re-train. 5.2 Real-world Label Errors in ILSVRC12 Image Net Train Dataset Russakovsky et al. (2015) suggest label errors exist in Image Net due to human error, but to our knowledge, few attempts have been made to find label errors in the ILSVRC 2012 training set, characterize them, or re-train without them. Here, we consider each application. We use Res Net18 and Res Net50 architectures with standard settings: 0.1 initial learning rate, 90 training epochs with 0.9 momentum. Table 5: Ten largest non-diagonal entries in the confident joint C y,y for Image Net train set used for ontological issue discovery. A duplicated class detected by CL is highlighted in red. C y,y y name y name y nid y nid Cconfusion ˆQ y,y 645 projectile missile n04008634 n03773504 494 0.00050 539 tub bathtub n04493381 n02808440 400 0.00042 476 breastplate cuirass n02895154 n03146219 398 0.00037 437 green_lizard chameleon n01693334 n01682714 369 0.00034 435 chameleon green_lizard n01682714 n01693334 362 0.00034 433 missile projectile n03773504 n04008634 362 0.00034 417 maillot maillot n03710637 n03710721 338 0.00033 416 horned_viper sidewinder n01753488 n01756291 336 0.00033 410 corn ear n12144580 n13133613 333 0.00032 407 keyboard space_bar n04505470 n04264628 293 0.00032 Ontological discovery for dataset curation Because Image Net is an one-hot class dataset, the classes are required to be mutually exclusive. Using Image Net as a case study, we observe auto-discovery of ontological issues at the class level in Table 5, operationalized by listing the 10 largest non-diagonal entries in C y,y . For example, the class maillot appears twice, the existence of is-a relationships like bathtub is a tub, misnomers like projectile and missile, and unanticipated issues caused by words with multiple definitions like corn and ear. We include the baseline Cconfusion to show that while Cconfusion finds fewer label errors than C y,y , they rank ontological issues similarly. Finding label issues Fig. 3 depicts the top 16 label issues found using CL: PBNR with Res Net50 ordered by the normalized margin. We use the term issue versus error because examples found by CL consist of a mixture of multi-label images, ontological issues, and Northcutt, Jiang, & Chuang Figure 3: Top 32 (ordered automatically by normalized margin) identified label issues in the 2012 ILSVRC Image Net train set using CL: PBNR. Errors are boxed in red. Ontological issues are boxed in green. Multi-label images are boxed in blue. actual label errors. Examples of each are indicated by colored borders in the figure. To evaluate CL in the absence of true labels, we conducted a small-scale human validation on a random sample of 500 errors (as identified using CL: PBNR) and found 58% were either multi-label, ontological issues, or errors. Image Net data are often presumed error-free, yet ours is the first attempt to identify label errors automatically in Image Net training images. Training Res Net on Image Net with label issues removed By providing cleaned data for training, we explore how CL can be used to achieve similar or better validation accuracy on Image Net when trained with less data. To understand the performance differences, we train Res Net-18 (Fig. 4) on progressively less data, removing 20%, 40%,..., 100% of Image Net train set label issues identified by CL and training from scratch each time. Fig. 4 depicts the top-1 validation accuracy when training with cleaned data from CL versus removing uniformly random examples, on each of (a) the entire ILSVRC validation set, (b) the 20 (noisiest) classes with the smallest diagonal in C y,y , (c) the foxhound class, which has the smallest diagonal in C y,y , and (d) the maillot class, a known erroneous class, duplicated accidentally in Image Net, as previously published (Hoffman et al., 2015), and verified (c.f. line 7 in Table 5). For readability, we plot the best performing CL method at each point and provide the individual performance of each CL method in the Appendix (see Fig. S2). For the case of a single class, as shown in Fig. 4(c) and 4(d), we show the recall using the model s top-1 prediction, hence the comparatively larger variance in classification accuracy reported compared to (a) and (b). We observed that CL outperforms the random removal baseline in nearly all experiments, and improves on the no-data-removal baseline accuracy, depicted by the left-most point in the subfigures, on average over the five trials for the 1,000 and 20 class settings, as shown in Fig. 4(a) and 4(b). To verify the result is not model-specific, we Confident Learning: Estimating Uncertainty in Dataset Labels 0K 50K 100K 150K 0.2% 0.5% 0.5% 0.5% 0.6% Pruning Method Confident Learning Random Removal (a) Accuracy on the ILSVRC2012 validation set 0K 50K 100K 150K 38% 2.0% 3.3% 2.9% (b) Accuracy on the top 20 noisiest classes 0K 50K 100K 150K Number of examples removed before training 8.4% 7.6% 5.3% (c) Accuracy on the noisiest class: foxhound 0K 50K 100K 150K Number of examples removed before training 6.0% 10.8% 16.8% (d) Accuracy on known erroneous class: maillot Figure 4: Res Net-18 Validation Accuracy on Image Net (ILSVRC2012) when 20%, 40%, ..., 100% of the label issues found using confident learning are removed prior to training (blue, solid line) compared with random examples removed prior to training (orange, dash-dotted line). Each subplot is read from left-to-right as incrementally more CL-identified issues are removed prior to training (shown by the x-axis). The translucent black dotted verticals bars measure the improvement when removing examples with CL vs random examples. Each point in all subfigures represents an independent training of Res Net-18 from scratch. Each point on the graph depicts the average accuracy of 5 trials (varying random seeding and weight initialization). The capped, colored vertical bars depict the standard deviation. repeat each experiment for a single trial with Res Net-50 (Fig. 5) and find that CL similarly outperforms the random removal baseline. These results suggest that CL can reduce the size of a real-world noisy training dataset by 10% while still moderately improving the validation accuracy (Figures 4a, 4b, 5a, 5b) and significantly improving the validation accuracy on the erroneous maillot class (Figures 4d, 5d). While we find CL methods may improve the standard Image Net training on clean training data by filtering out a subset of training examples, the significance of this result lies not in the magnitude of improvement, but as a warrant of exploration in the use of cleaning methods when training with Image Net, which is typically assumed to have correct labels. Whereas many of the label issues in Image Net are due to multi-labeled examples (Yun et al., 2021), next we consider a dataset with disjoint classes. 5.3 Amazon Reviews Dataset: CL using logistic regression on noisy text data The Amazon Reviews dataset is a corpus of textual reviews labeled with 1-star to 5-star ratings from Amazon customers used to benchmark sentiment analysis models (He and Northcutt, Jiang, & Chuang 0K 50K 100K 150K 72% 1.2% 0.9% 0.6% 0.8% Pruning Method Confident Learning Random Removal (a) Accuracy on the ILSVRC2012 validation set 0K 50K 100K 150K 48% 1.5% 4.8% (b) Accuracy on the top 20 noisiest classes 0K 50K 100K 150K Number of examples removed before training (c) Accuracy on the noisiest class: foxhound 0K 50K 100K 150K Number of examples removed before training (d) Accuracy on known erroneous class: maillot Figure 5: Replication of the experiments in Fig. 4 with Res Net-50. Each point in each subfigure depicts the accuracy of a single trial (due to computational limitations). Error bars, shown by the colored vertical lines, are estimated via Clopper-Pearson intervals for subfigures (a) and (b). For additional information, see the caption of Fig. 4. Mc Auley, 2016). We study the 5-core (9.9 GB) variant of the dataset the subset of data in which all users and items have at least 5 reviews. 2-star and 4-star reviews are removed due to ambiguity with 1-star and 5-star reviews, respectively. Left in the dataset, 2-star and 4-star reviews could inflate error counts, making CL appear to be more effective than it is. This subsection serves three goals. First, we use a logistic regression classifier, as opposed to a deep-learning model, for our experiments in this section to evaluate CL for non-deeplearning methods. Second, we seek to understand how CL may improve learning with noise in the label space of text data, but not noise in the text data itself (e.g. typos). Towards this goal, we consider non-empty reviews with more helpful up-votes than down-votes the resulting dataset consists of approximately ten million reviews. Finally, Theorem 2 shows that CL is robust to class-imbalance, but datasets like Image Net and CIFAR-10 are balanced by construction: the Amazon Reviews dataset, however, is naturally and extremely imbalanced the distribution of given labels (i.e., the noisy prior), is: 9% 1-star reviews:, 12% 3-star reviews, and 79% 5-star reviews. We seek to understand if CL can find label errors and improve performance in learning with noisy labels in this class-imbalanced setting. Training settings To demonstrate that non-deep-learning methods can be effective in finding label issues under the CL framework, we use a multinomial logistic regression classifier for both finding label errors and learning with noisy labels. The built-in SGD optimizer in the open-sourced fast Text library (Joulin et al., 2017) is used with settings: initial learning rate = 0.1, embedding dimension = 100, and n-gram = 3). Out-of-sample predicted probabilities Confident Learning: Estimating Uncertainty in Dataset Labels are obtained via 5-fold cross-validation. For input during training, a review is represented as the mean of pre-trained, tri-gram, word-level fast Text embeddings (Bojanowski et al., 2017). Finding label issues Table 6 shows examples of label issues in the Amazon Reviews dataset found automatically using the CL: C+NR variant of confident learning. We observe qualitatively that most label issues identified by CL in this context are reasonable except for sarcastic reviews, which appear to be poorly modeled by the bag-of-words approach. Table 6: Top 20 CL-identified label issues in the Amazon Reviews text dataset using CL: C+NR, ordered by normalized margin. A logistic regression classifier trained on fast Text embeddings is used to obtain out-of-sample predicted probabilities. Most errors are reasonable, with the exception of sarcastic reviews, which are poorly modeled by the bag-of-words model. Review Given Label CL Guess A very good addition to kindle. Cleans and scans. Very easy TO USE Buy it and enjoy a great story. Works great! I highly recommend it to everyone that enjoys singing hymns! Love it! Love it! Love it! :) . Awesome it was better than all the other my weirder school books. I love it! The best book ever.Awesome I gave this 5 stars under duress. I would rather give it 3 stars. it plays fine but it is a little boring so far. only six words: don t waist your money on this I love it so much at first I though it would be boring but turns out its fun for all ages get it Excellent read, could not put it down! Keep up the great works ms. Brown. Cannot wait to download the next one. This is one of the easiest to use games I have ever played. It is adaptable and fun. I love it. So this is what today s music has become? Sarah and Charlie, what a wonderful story. I loved this book and look forward to reading more of this series. I ve had this for over a year and it works very well. I am very happy with this purchase. this show is insane and I love it. I will be ordering more seasons of it. Just what the world needs, more generic r&b. I did like the Making Of This Is movie it okay it not the best okay it not great . Tough game. But of course it has the very best sound track ever! unexpected kid on the way thanks to this shit The kids are fascinated by it, Plus my wife loves it.. I love it I love it we love it Loved this book! A great story and insight into the time period and life during those times. Highly recommend this book Great reading I could not put it down. Highly recommend reading this book. You will not be disappointed. Must read. Table 7: Ablation study (varying train set size, test split, and epochs) comparing test accuracy (%) of CL methods versus a standard training baseline for classifying noisy, real-world Amazon reviews text data as either 1-star, 3-stars, or 5-stars. A simple multinomial logistic regression classifier is used. Mean top-1 accuracy and standard deviations are reported over five trials. The number of estimated label errors CL methods removed prior to training is shown in the Pruned column. Baseline training begins to overfit to noise with additional epochs trained, whereas CL test accuracy continues to increase (cf. N=1000K, Epochs: 50). Test Train set size N = 1000K N = 500K Epochs: 5 Epochs: 20 Epochs: 50 Pruned Epochs: 5 Epochs: 20 Pruned 10th CL: Cconfusion 85.2 0.06 89.2 0.02 90.0 0.02 291K 86.6 0.03 86.6 0.03 259K CL: C+NR 86.3 0.04 89.8 0.01 90.2 0.01 250K 87.5 0.05 87.5 0.03 244K CL: C y,y 86.4 0.01 89.8 0.02 90.1 0.02 246K 87.5 0.02 87.5 0.02 243K CL: PBC 86.2 0.03 89.7 0.01 90.2 0.01 257K 87.4 0.03 87.4 0.03 247K CL: PBNR 86.2 0.07 89.7 0.01 90.2 0.01 257K 87.4 0.05 87.4 0.05 247K Baseline 83.9 0.11 86.3 0.06 84.4 0.04 0K 82.7 0.07 82.8 0.07 0K 11th CL: Cconfusion 85.3 0.05 89.3 0.01 90.0 0.0 294K 86.6 0.04 86.6 0.06 261K CL: C+NR 86.4 0.06 89.8 0.01 90.2 0.01 252K 87.5 0.04 87.5 0.03 247K CL: C y,y 86.3 0.05 89.8 0.01 90.1 0.02 249K 87.5 0.03 87.5 0.02 246K CL: PBC 86.2 0.03 89.8 0.01 90.3 0.0 260K 87.4 0.03 87.4 0.05 250K CL: PBNR 86.2 0.06 89.8 0.01 90.2 0.02 260K 87.4 0.05 87.4 0.03 249K Baseline 83.9 0.0 86.3 0.05 84.4 0.12 0K 82.7 0.04 82.7 0.09 0K Northcutt, Jiang, & Chuang Learning with noisy labels / weak supervision We compare the CL methods, which prune errors from the train set and subsequently provide clean data for training, versus a standard training baseline (denoted Baseline in Table 7), which trains on the original, uncleaned train dataset. The same training settings used to find label errors (see Subsection 5.3) are used to obtain all scores reported in Table 7 for all methods. For a fair comparison, all mean accuracies in Table 7 are reported on the same held-out test set, created by splitting the Amazon reviews dataset into a train set and test set such that every tenth example is placed in a test set and the remaining data is available for training (the Amazon Reviews 5-core dataset provides no explicit train set and test set). The Amazon Reviews dataset is naturally noisy, but the fraction of noise in the dataset is estimated to be less than 4% (Northcutt et al., 2021), which makes studying the benefits of providing clean data for training challenging. To increase the percentage of noisy labels without adding synthetic noise, we subsample 1 million training examples from the train set by combining the label issues identified by all five CL methods from the original training data (244K examples) and a uniformly random subsample (766k examples) of the remaining cleaner training data. This process increases the percentage of label noise to 24% (estimated) in the train set and, importantly, does not increase the percentage of noisy labels in the test set large amounts of test set label noise have been shown to severely impact benchmark rankings (Northcutt et al., 2021). To mitigate the bias induced by the choice of train set size, test set split, and the number of epochs trained, we conduct an ablation study shown in Table 7. For the train set size, we repeat each experiment with train set sizes of 1-million examples and 500, 000 examples. For the test set split, we repeat all experiments by removing every eleventh example (instead of tenth) in our train/test split (c.f. the first column in Table 7), minimizing the overlap (9%) between the two test sets. For each number of epochs trained, we repeat each experiment with 5, 20, and 50 epochs. We omit (N = 500K, Epochs: 50) because no learning occurs after 5 epochs. Every score reported in Table 7 is the mean and standard deviation of five trials: each trial varies the randomly selected subset of training data and the initial weights of the logistic regression model used for training. The results in Table 7 reveal three notable observations. First, all CL methods outperform the baseline method by a significant margin in all cases. Second, CL methods outperform the baseline method even with nearly half of the training data pruned (Table 7, cf. N=500K). Finally, for the train set size N = 1000K, baseline training begins to overfit to noise with additional epochs trained, whereas CL test accuracy continues to increase (cf. N=1000K, Epochs: 50), suggesting CL robustness to overfitting to noise during training. The results in Table 7 suggest CL s efficacy for noisy supervision with logistic regression in the context of text data. 5.4 Real-world Label Errors in Other Datasets We use CL to find label errors in the purported error-free" MNIST dataset comprised of preprocessed black-and-white handwritten digits, and also in the noisy-labeled Web Vision dataset (Li et al., 2017a) comprised of color images collected from online image repositories and using the search query as the noisy label. Confident Learning: Estimating Uncertainty in Dataset Labels convnet guess: 7 | conf: 1.0 train img #: 59915 given: 4 | conf: 0.0 convnet guess: 9 | conf: 1.0 train img #: 1604 given: 4 | conf: 0.0 convnet guess: 3 | conf: 1.0 train img #: 43454 given: 5 | conf: 0.0 convnet guess: 2 | conf: 1.0 train img #: 37038 given: 1 | conf: 0.0 convnet guess: 3 | conf: 1.0 train img #: 40144 given: 5 | conf: 0.0 convnet guess: 9 | conf: 1.0 train img #: 51944 given: 4 | conf: 0.0 convnet guess: 7 | conf: 0.998 train img #: 8729 given: 3 | conf: 0.0 convnet guess: 1 | conf: 0.998 train img #: 43109 given: 8 | conf: 0.001 convnet guess: 4 | conf: 0.999 train img #: 51248 given: 9 | conf: 0.001 convnet guess: 4 | conf: 0.999 train img #: 26748 given: 9 | conf: 0.001 convnet guess: 0 | conf: 0.999 train img #: 902 given: 9 | conf: 0.001 convnet guess: 7 | conf: 0.999 train img #: 25562 given: 9 | conf: 0.001 convnet guess: 5 | conf: 0.998 train img #: 7080 given: 3 | conf: 0.002 convnet guess: 1 | conf: 0.998 train img #: 26560 given: 7 | conf: 0.002 convnet guess: 5 | conf: 0.998 train img #: 30049 given: 9 | conf: 0.002 convnet guess: 2 | conf: 0.994 train img #: 44484 given: 8 | conf: 0.002 convnet guess: 9 | conf: 0.997 train img #: 34750 given: 4 | conf: 0.003 convnet guess: 7 | conf: 0.996 train img #: 41284 given: 2 | conf: 0.004 convnet guess: 7 | conf: 0.995 train img #: 23911 given: 1 | conf: 0.004 convnet guess: 1 | conf: 0.994 train img #: 54264 given: 4 | conf: 0.006 convnet guess: 9 | conf: 0.991 train img #: 11210 given: 8 | conf: 0.007 convnet guess: 9 | conf: 0.993 train img #: 53806 given: 8 | conf: 0.007 convnet guess: 2 | conf: 0.993 train img #: 31134 given: 1 | conf: 0.007 convnet guess: 9 | conf: 0.982 train img #: 10994 given: 3 | conf: 0.008 Figure 6: Label errors in the original, unperturbed MNIST train dataset identified using CL: PBNR. These are the top 24 errors found by CL, ordered left-right, top-down by increasing self-confidence, denoted conf in teal. The predicted arg max ˆp( y = k; x, θ) label is in green. Overt errors are in red. This dataset is assumed error-free in tens of thousands of studies. To our surprise, the original, unperturbed MNIST dataset, which is predominately assumed error-free, contains blatant label errors, highlighted by the red boxes in Fig. 6. To find label errors in MNIST, we pre-trained a simple 2-layer CNN for 50 epochs, then used cross-validation to obtain ˆ Pk,i, the out-of-sample predicted probabilities for the train set. CL: PBNR was used to identify the errors. The top 24 label errors, ordered by self-confidence, are shown in Fig. 6. For verification, the indices of the train label errors are shown in grey. Given: TORCH Guess: SWITCH vpq HShk QH74-f M.JPG Given: JUNCO Guess: BIRDHOUSE n_l05Gy5bd Hrd M.JPG Given: BLACK BEAR Guess: COUGAR p D7_K_r F1o V2VM.JPG Given: ICE CREAM Guess: ICE POP 8Qu M75UJ0946MM.JPG Given: RADIO TELESCOPE Guess: SOLAR THERMAL COLLECTOR 9rw Z_dwp OK4Kx M.JPG Given: DISC BRAKE Guess: SWITCH OS-an1CPZq SUm M.JPG Given: BLACK BEAR Guess: BADGER r KCGNq06A2CFz M.JPG Given: CAROUSEL Guess: WEBSITE -Xa_YXv ZC3U3u M.JPG Given: RADIO TELESCOPE Guess: TOBACCO SHOP T-j WOdq UWo MYx M.JPG Given: BORZOI Guess: MUZZLE 4yq NKx Nl G_Bnl M.JPG Given: BORZOI Guess: HIPPOPOTAMUS 4L-Gi Kd597g39M.JPG Given: CAROUSEL Guess: WEBSITE 21uu_0VDWI9pa M.JPG Given: RADIO TELESCOPE Guess: AIRCRAFT CARRIER US-r7F-n TXXp EM.JPG Given: JUNCO Guess: HUMMINGBIRD k Gd0abb8Ip9xb M.JPG Given: CAROUSEL Guess: TRIUMPHAL ARCH 2M7wqv PPqq Vw DM.JPG Given: REDBONE COONHOUND Guess: KEESHOND s1n Y8ayx T2WJf M.JPG Given: BLACK BEAR Guess: MILITARY CAP v4W8o ATSy UPPt M.JPG Given: RADIO TELESCOPE Guess: SOLAR THERMAL COLLECTOR _s_q F0P5i SDf AM.JPG Given: ICE CREAM Guess: PLASTIC BAG 8aiq TZYilr Zac M.JPG Given: REDBONE COONHOUND Guess: MAILBOX h S3ra Rr EZQMQg M.JPG Given: ICE CREAM Guess: BANANA 9ve81Mn YI3Ohi M.JPG Given: IRISH WOLFHOUND Guess: MILITARY CAP Ij7-ho Tho PP5q M.JPG Given: WHITE CAPUCHIN Guess: OSTRICH s Juvd3Y6Azlr7M.JPG Given: BLACK BEAR Guess: VOLCANO ft5Oajt2HSc X5M. JPG Given: RADIO Guess: MICROPHONE 3i Jj NX0p J4Pmv M.JPG Given: ICE CREAM Guess: PLASTIC BAG QFp4A8_u Qzi1j M.JPG Given: ICE CREAM Guess: ICE POP Pcrji OCf PLo DZM.JPG Given: WORM SNAKE Guess: INDIAN COBRA Vf04a Wdi9ij93M.JPG Given: RADIATOR Guess: ELECTRIC FAN r67Hq9x DS9f-k M.JPG Given: RADIATOR Guess: ELECTRIC FAN rz KA7sb D0G84c M.JPG Given: WHITE CAPUCHIN Guess: SKUNK t PJQa2vj Ruf In M.JPG Given: ICE CREAM Guess: ICE POP Qwl Xczkot Ru ZSM.JPG Figure 7: Top 32 identified label issues in the Web Vision train set using CL: C y,y . Out-ofsample predicted probabilities are obtained using a model pre-trained on Image Net, avoiding training entirely. Errors are boxed in red. Ambiguous cases or mistakes are boxed in black. Label errors are ordered automatically by normalized margin. Northcutt, Jiang, & Chuang To find label errors in Web Vision, we used a pre-trained model to obtain ˆ Pk,i, observing two practical advantages of CL: (1) a pre-trained model can be used to obtain ˆ Pk,i out-ofsample instead of cross-validation and (2) this makes CL fast. For example, finding label errors in Web Vision, with over a million images and 1,000 classes, took three minutes on a laptop using a pre-trained Res Next model that had never seen the noisy Web Vision train set before. We used the CL: C y,y method to find the label errors and ordered errors by normalized margins. Examples of Web Vision label errors found by CL are shown in Fig. 7. 6. Related work We first discuss prior work on confident learning, then review how CL relates to noise estimation and robust learning. Confident learning Our results build on a large body of work termed confident learning . Elkan (2001) and Forman (2005) pioneered counting approaches to estimate false positive and false negative rates for binary classification. We extend counting principles to multi-class setting. To increase robustness against epistemic error in predicted probabilities and class imbalance, Elkan and Noto (2008) introduced thresholding, but required uncorrupted positive labels. CL generalizes the use of thresholds to multi-class noisy labels. CL also reweights the loss during training to adjust priors for the data removed. This choice builds on formative works (Natarajan et al., 2013; Van Rooyen et al., 2015) which used loss reweighting to prove equivalent empirical risk minimization for learning with noisy labels. More recently, Han et al. (2019) proposed an empirical deep self-supervised learning approach to avoid probabilities by using embedding layers of a neural network. In comparison, CL is non-iterative and theoretically grounded. Lipton et al. (2018) estimate label noise using approaches based on confusion matrices and cross-validation. However, unlike CL, the former assumes a less general form of label shift than class-conditional noise. Huang et al. (2019) demonstrate the empirical efficacy of first finding label errors, then training on clean data, but the study evaluates only uniform (symmetric) and pair label noise CL augments these empirical findings with theoretical justification for the broader class of asymmetric and class-conditional label noise. Theory: a model-free, data-free approach Theoretical analysis with noisy labels often assumes a restricted class of models or data to disambiguate model noise from label noise. For example, Shen and Sanghavi (2019) provide theoretical guarantees for learning with noisy labels in a more general setting than CL that includes adversarial examples and noisy data, but limit their findings to generalized linear models. CL theory is model and dataset agnostic, instead restricting the magnitude of example-level noise. In a formative related approach, Xu et al. (2019) prove that using the loss function log (| det(Q y,y ))| enables noise robust training for any model and dataset, further justified by performant empirical results. Similar to confident learning, their approach hinges on the use of Q y,y , however, they require that Q y|y is invertible and estimate Q y,y using Cconfusion, which is sensitive to class-imbalance and heterogeneous class probability distributions (see Sec. 3.1). In Sec. 4, we show sufficient conditions in Thm. 2 where C y,y exactly finds label errors, regardless of each class s probability distribution. Confident Learning: Estimating Uncertainty in Dataset Labels Uncertainty quantification and label noise estimation A number of formative works developed solutions to estimate noise rates using convergence criterion (Scott, 2015), positiveunlabeled learning (Elkan and Noto, 2008), and predicted probability ratios (Northcutt et al., 2017), but are limited to binary classification. Others prove equivalent empirical risk for binary learning with noisy labels (Natarajan et al., 2013; Liu and Tao, 2015; Sugiyama et al., 2012) assuming noise rates are known, which is rarely true in practice. Unlike these binary approaches, CL estimates label uncertainty in the multiclass setting, where prior work often falls into five categories: (1) theoretical contributions (Katz-Samuels et al., 2019), (2) loss modification for label noise robustness (Patrini et al., 2016, 2017; Sukhbaatar et al., 2015; Van Rooyen et al., 2015), (3) deep learning and model-specific approaches (Sukhbaatar et al., 2015; Patrini et al., 2016; Jindal et al., 2016), (4) crowd-sourced labels via multiple workers (Zhang et al., 2017b; Dawid and Skene, 1979; Ratner et al., 2016), (5) factorization, distillation (Li et al., 2017b), and imputation (Amjad et al., 2018) methods, among other (Sáez et al., 2014). Unlike these approaches, CL provides a consistent estimator for exact estimation of the joint distribution of noisy and true labels directly, under practical conditions. Label-noise robust learning Beyond the above noise estimation approaches, extensive studies have investigated training models on noisy datasets, e.g. (Beigman and Klebanov, 2009; Brodley and Friedl, 1999). Noise-robust learning is important for deep learning because modern neural networks trained on noisy labels generalize poorly on clean validation data (Zhang et al., 2017a). A notable recent rend in noise robust learning is benchmarking with symmetric label noise in which labels are uniformly flipped, e.g. (Goldberger and Ben Reuven, 2017; Arazo et al., 2019). However, noise in real-world datasets is highly non-uniform and often sparse. For example, in Image Net (Russakovsky et al., 2015), missile is likely to be mislabeled as projectile, but has a near-zero probability of being mislabeled as most other classes like wool, ox, or wine. To approximate real-world noise, an increasing number of studies examined asymmetric noise using, e.g. loss or label correction (Patrini et al., 2017; Reed et al., 2015; Goldberger and Ben-Reuven, 2017), per-example loss reweighting (Jiang et al., 2020, 2018; Shu et al., 2019), Co-Teaching (Han et al., 2018), semi-supervised learning (Hendrycks et al., 2018; Li et al., 2017b; Vahdat, 2017), symmetric cross entropy (Wang et al., 2019), and semi-supervised learning (Li et al., 2020), among others. These approaches work by introducing novel new models or insightful modifications to the loss function during training. CL takes a loss-agnostic approach, instead focusing on generating clean data for training by directly estimating of the joint distribution of noisy and true labels. Comparison of the INCV Method and Confident Learning The INCV algorithm (Chen et al., 2019) and confident learning both estimate clean data, use cross-validation, and use aspects of confusion matrices to deal with label errors in ML workflows. Due to these similarities, we discuss four key differences between confident learning and INCV. First, INCV errors are found using an iterative version of the Cconfusion confident learning baseline: any example with a different given label than its argmax prediction is considered a label error. This approach, while effective (see Table 2), fails to properly count errors for class imbalance or when a model is more confident (larger or smaller probabilities on average) for certain class than others, as discussed in Section 4. To account for this class-level bias in predicted probabilities and enable robustness, confident learning uses theoretically-supported (see Section 4) thresholds (Elkan, 2001; Richard and Lippmann, 1991) while estimating Northcutt, Jiang, & Chuang the confident joint. Second, a major contribution of CL is finding the label errors in the presumed error-free benchmarks such as Image Net and MNIST, whereas INCV emphasizes empirical results for learning with noisy labels. Third, in each INCV training iteration, 2-fold cross-validation is performed. The iterative nature of INCV makes training slow (see Appendix Table S2) and uses fewer data during training. Unlike INCV, confident learning is not iterative. In confident learning, cross-validated probabilities are computed only once beforehand from which the joint distribution of noisy and true labels is directly estimated which is used to identify clean data to be used by a single pass re-training. We demonstrate this approach is experimentally performant without iteration (see Table 2). Finally, confident learning is modular. CL approaches for training, finding label errors, and ordering label errors for removal are independent. In INCV, the procedure is iterative, and all three steps are tied together in a single looping process. A single iteration of INCV equates to the Cconfusion baseline benchmarked in this paper. 7. Conclusion and Future Work Following the principles of confident learning, we developed a novel approach to estimate the joint distribution of label noise and explicated theoretical and experimental insights into the benefits of doing so. We demonstrated accurate uncertainty quantification in high noise and and sparsity regimes, across multiple datasets, data modalities, and model architectures. We empirically evaluated three criteria: (1) uncertainty quantification via estimation of the joint distribution of label noise, (2) finding label errors, and (3) learning with noisy labels on CIFAR-10, and found that CL methods outperform recent prior art across all three. These findings emphasize the practical nature of confident learning, identifying numerous pre-existing label issues in Image Net, Amazon Reviews, MNIST, and other datasets, and improving the performance of learning models like deep neural networks by training on a cleaned dataset. Confident learning motivates the need for further understanding of dataset uncertainty estimation, methods to clean training and test sets, and approaches to identify ontological and label issues for dataset curation. Future directions include validation of CL methods on more datasets such as the Open ML Benchmark (Feurer et al., 2019), the multi-modal Egocentric Communications (Ego Com) benchmark (Northcutt et al., 2020), and the realistic noisy label benchmark CNWL (Jiang et al., 2020); evaluation of CL methods using other non-neural network models, such as random forests and XGBoost; examination of other threshold function formulations; examination of label errors in test sets and they affect machine learning benchmarks at scale (Northcutt et al., 2021); assimilation of CL label error finding with pseudo-labeling and/or curriculum learning to dynamically provide clean data during training; and further exploration of iterative and/or regression-based extensions of CL methods. Acknowledgements We thank the following colleagues: Jonas Mueller assisted with notation. Anish Athayle suggested starting the proof in claim 1 of Theorem 1 with the identity. Tailin Wu contributed to Lemma 1. Niranjan Subrahmanya provided feedback on baselines for confident learning. Confident Learning: Estimating Uncertainty in Dataset Labels Amjad, M., Shah, D., and Shen, D. (2018). Robust synthetic control. Journal of Machine Learning Research (JMLR), 19(1):802 852. Angluin, D. and Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4):343 370. Arazo, E., Ortego, D., Albert, P., O Connor, N. E., and Mc Guinness, K. (2019). Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning (ICML). Beigman, E. and Klebanov, B. B. (2009). Learning with annotation noise. In Annual Conference of the Association for Computational Linguistics (ACL). Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135 146. Bouguelia, M.-R., Nowaczyk, S., Santosh, K., and Verikas, A. (2018). Agreeing to disagree: active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8):1307 1319. Brodley, C. E. and Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research (JAIR), 11:131 167. Chen, P., Liao, B. B., Chen, G., and Zhang, S. (2019). Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning (ICML). Chowdhary, K. and Dupuis, P. (2013). Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification. Mathematical Modelling and Numerical Analysis (ESAIM), 47(3):635 662. Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer errorrates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20 28. Elkan, C. (2001). The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence (IJCAI). Elkan, C. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Müller, A., Vanschoren, J., and Hutter, F. (2019). Openml-python: an extensible python api for openml. ar Xiv preprint ar Xiv:1911.02490. Forman, G. (2005). Counting positives accurately despite inaccurate classification. In European Conference on Computer Vision (ECCV). Northcutt, Jiang, & Chuang Forman, G. (2008). Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164 206. Goldberger, J. and Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. In International Conference on Learning Representations (ICLR). Graepel, T. and Herbrich, R. (2001). The kernel gibbs sampler. In Conference on Neural Information Processing Systems (Neur IPS). Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In International Conference on Machine Learning (ICML). Halpern, Y., Horng, S., Choi, Y., and Sontag, D. (2016). Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4):731 740. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Conference on Neural Information Processing Systems (Neur IPS). Han, J., Luo, P., and Wang, X. (2019). Deep self-learning from noisy labels. In International Conference on Computer Vision (ICCV). He, R. and Mc Auley, J. (2016). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In International conference on world wide web (WWW). Hendrycks, D. and Gimpel, K. (2017). A baseline for detecting misclassified and out-ofdistribution examples in neural networks. International Conference on Learning Representations (ICLR). Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. In Conference on Neural Information Processing Systems (Neur IPS). Hoffman, J., Pathak, D., Darrell, T., and Saenko, K. (2015). Detector discovery in the wild: Joint multiple instance and representation learning. In Conference on Computer Vision and Pattern Recognition (CVPR). Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net: A simple noisy label detection approach for deep neural networks. In International Conference on Computer Vision (ICCV). Jiang, L., Huang, D., Liu, M., and Yang, W. (2020). Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning (ICML). Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning (ICML). Confident Learning: Estimating Uncertainty in Dataset Labels Jindal, I., Nokleby, M., and Chen, X. (2016). Learning deep networks from noisy labels with dropout regularization. In International Conference on Data Mining (ICDM). Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Annual Conference of the Association for Computational Linguistics (ACL). Katz-Samuels, J., Blanchard, G., and Scott, C. (2019). Decontamination of mutual contamination models. Journal of Machine Learning Research (JMLR), 20(41):1 57. Khetan, A., Lipton, Z. C., and Anandkumar, A. (2018). Learning from noisy singly-labeled data. In International Conference on Learning Representations (ICLR). Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto. Lawrence, N. D. and Schölkopf, B. (2001). Estimating a kernel fisher discriminant in the presence of label noise. In International Conference on Machine Learning (ICML). Li, J., Socher, R., and Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semisupervised learning. In International Conference on Learning Representations (ICLR). Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool, L. (2017a). Webvision database: Visual learning and understanding from web data. ar Xiv:1708.02862. Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.-J. (2017b). Learning from noisy labels with distillation. In International Conference on Computer Vision (ICCV). Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). Lipton, Z., Wang, Y.-X., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning (ICML). Liu, T. and Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(3):447 461. Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. (2017). Cost-sensitive learning with noisy labels. Journal of Machine Learning Research (JMLR), 18:155 1. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2013). Learning with noisy labels. In Conference on Neural Information Processing Systems (Neur IPS). Northcutt, C., Zha, S., Lovegrove, S., and Newcombe, R. (2020). Egocom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. In International Conference on Learning Representations Workshop Track (ICLR). Northcutt, Jiang, & Chuang Northcutt, C. G., Ho, A. D., and Chuang, I. L. (2016). Detecting and preventing multipleaccount cheating in massive open online courses. Computers & Education, 100:71 80. Northcutt, C. G., Wu, T., and Chuang, I. L. (2017). Learning with confident examples: Rank pruning for robust classification with noisy labels. In Conference on Uncertainty in Artificial Intelligence (UAI). Page, L., Brin, S., Motwani, R., and Winograd, T. (1997). Pagerank: Bringing order to the web. Technical report, Stanford Digital Libraries Working Paper. Patrini, G., Nielsen, F., Nock, R., and Carioni, M. (2016). Loss factorization, weakly supervised learning and label noise robustness. In International Conference on Machine Learning (ICML). Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. In Conference on Computer Vision and Pattern Recognition (CVPR). Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré, C. (2016). Data programming: Creating large training sets, quickly. In Conference on Neural Information Processing Systems (Neur IPS). Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. (2015). Training deep neural networks on noisy labels with bootstrapping. In International Conference on Learning Representations (ICLR). Richard, M. D. and Lippmann, R. P. (1991). Neural network classifiers estimate bayesian a posteriori probabilities. Neural computation, 3(4):461 483. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252. Sáez, J. A., Galar, M., Luengo, J., and Herrera, F. (2014). Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowledge and Information Systems, 38(1):179 206. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. M. (2021). "Everyone wants to do the model work, not the data work": Data cascades in high-stakes ai. In Conference on Human Factors in Computing Systems (CHI). Scott, C. (2015). A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In International Conference on Artificial Intelligence and Statistics (AISTATS). Shen, Y. and Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research. Confident Learning: Estimating Uncertainty in Dataset Labels Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In Conference on Neural Information Processing Systems (Neur IPS). Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density Ratio Estimation in ML. Cambridge University Press, New York, NY, USA, 1st edition. Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., and Fergus, R. (2015). Training convolutional networks with noisy labels. In International Conference on Learning Representations (ICLR). Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D. C., and Silberman, N. (2019a). Learning from noisy labels by regularized estimation of annotator confusion. In Conference on Computer Vision and Pattern Recognition (CVPR). Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D. C., and Silberman, N. (2019b). Learning from noisy labels by regularized estimation of annotator confusion. In Conference on Computer Vision and Pattern Recognition (CVPR). Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural networks. In Conference on Neural Information Processing Systems (Neur IPS). Van Rooyen, B., Menon, A., and Williamson, R. C. (2015). Learning with symmetric label noise: The importance of being unhinged. In Conference on Neural Information Processing Systems (Neur IPS). Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. (2019). Symmetric cross entropy for robust learning with noisy labels. In International Conference on Computer Vision (ICCV). Wei, C., Lee, J. D., Liu, Q., and Ma, T. (2018). On the margin theory of feedforward neural networks. Computing Research Repository (Co RR). Xu, Y., Cao, P., Kong, Y., and Wang, Y. (2019). L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In Conference on Neural Information Processing Systems (Neur IPS). Yun, S., Oh, S. J., Heo, B., Han, D., Choe, J., and Chun, S. (2021). Re-labeling imagenet: from single to multi-labels, from global to localized labels. In Conference on Computer Vision and Pattern Recognition (CVPR). Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017a). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR). Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR). Zhang, J., Sheng, V. S., Li, T., and Wu, X. (2017b). Improving crowdsourced label quality using noise correction. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1675 1688. Northcutt, Jiang, & Chuang Appendix A. Theorems and proofs for confident learning In this section, we restate the main theorems for confident learning and provide their proofs. Lemma 1 (Ideal Thresholds). For a noisy dataset X := (x, y)n (Rd, [m])n and model θ, if ˆp( y; x, θ) is ideal, then i [m], ti = P j [m] p( y = i|y =j)p(y =j| y = i). Proof. We use ti to denote the thresholds used to partition X into m bins, each estimating one of Xy . By definition, i [m], ti = Ex X y=i ˆp( y = i; x, θ) For any ti, we show the following. ti = E x X y=i j [m] ˆp( y=i|y =j; x, θ)ˆp(y =j; x, θ) Bayes Rule ti = E x X y=i j [m] ˆp( y=i|y =j)ˆp(y =j; x, θ) Class-conditional Noise Process (CNP) j [m] ˆp( y=i|y =j) E x X y=i ˆp(y =j; x, θ) j [m] p( y = i|y = j)p(y = j| y = i) Ideal Condition This form of the threshold is intuitively reasonable: the contributions to the sum when i = j represents the probabilities of correct labeling, whereas when i = j, the terms give the probabilities of mislabeling p( y = i|y = j), weighted by the probability p(y = j| y = i) that the mislabeling is corrected. Theorem 1 (Exact Label Errors). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆp( y; x, θ) is ideal and each diagonal entry of Q y|y maximizes its row and column, then ˆ X y=i,y =j = X y=i,y =j and ˆQ y,y Q y,y (consistent estimator for Q y,y ). Proof. Alg. 1 defines the construction of the confident joint. We consider Case 1: when there are collisions (trivial by the construction of Alg. 1) and case 2: when there are no collisions (harder). Case 1 (collisions): When a collision occurs, by the construction of the confident joint (Eqn. 1), a given example xk gets assigned bijectively into bin xk ˆ X y,y [ yk][arg max i [m] ˆp( y = i; x, θ)] Because we have that ˆp( y; x, θ) is ideal, we can rewrite this as xk ˆ X y,y [ yk][arg max i [m] ˆp( y = i|y =y k; x)] Confident Learning: Estimating Uncertainty in Dataset Labels And because by assumption each diagonal entry in Q y|y maximizes its column, we have xk ˆ X y,y [ yk][y k] Thus, any example x X y=i,y =j having a collision will be exactly assigned to ˆ X y=i,y =j. Case 2 (no collisions): We want to show that i [m], j [m], ˆ X y=i,y =j = X y=i,y =j. We can partition X y=i as X y=i = X y=i,y =j X y=i,y =j We prove i [m], j [m], ˆ X y=i,y =j = X y=i,y =j by proving two claims: Claim 1: X y=i,y =j ˆ X y=i,y =j Claim 2: X y=i,y =j ˆ X y=i,y =j We do not need to show X y =i,y =j ˆ X y=i,y =j and X y =i,y =j ˆ X y=i,y =j because the noisy labels y are given, thus the confident joint (Eqn. 1) will never place them in the wrong bin of ˆ X y=i,y =j. Thus, claim 1 and claim 2 suffice to show that ˆ X y=i,y =j = X y=i,y =j. Proof (Claim 1) of Case 2: Inspecting Eqn. (1) and Alg (1), by the construction of C y,y , we have that x X y=i, ˆp( y = j|y =j; x, θ) tj X y=i,y =j ˆ X y=i,y =j. When the left-hand side is true, all examples with noisy label i and hidden, true label j are counted in ˆ X y=i,y =j. Thus, it suffices to prove: x X y=i, ˆp( y = j|y =j; x, θ) tj (5) Because the predicted probabilities satisfy the ideal condition, ˆp( y = j|y =j, x) = p( y = j|y =j), x X y=i. Note the change from predicted probability, ˆp, to an exact probability, p. Thus by the ideal condition, the inequality in (5) can be written as p( y = j|y =j) tj, which we prove below: p( y = j|y =j) p( y = j|y =j) 1 Identity p( y = j|y =j) X i [m] p(y =i| y=j) i [m] p( y = j|y =j) p(y =i| y=j) move product into sum i [m] p( y = j|y =i) p(y =i| y=j) diagonal entry maximizes row tj Lemma 1, ideal condition Proof (Claim 2) of Case 2: We prove X y=i,y =j ˆ X y=i,y =j by contradiction. Assume there exists some example xk X y=i,y =z for z = j such that xk ˆ X y=i,y =j. By claim 1, we have that X y=i,y =j ˆ X y=i,y =j, therefore, xk ˆ X y=i,y =z. Northcutt, Jiang, & Chuang Thus, for some example xk, we have that xk ˆ X y=i,y =j and also xk ˆ X y=i,y =z. However, this is a collision and when a collision occurs, the confident joint will break the tie with arg max. Because each diagonal entry of Q y|y maximizes its row and column this will always be assign xk ˆ X y,y [ yk][y k] (the assignment from Claim 1). This theorem also states ˆQ y,y Q y,y . This directly follows directly from the fact that i [m], j [m], ˆ X y=i,y =j = X y=i,y =j, i.e. the confident joint exactly counts the partitions X y=i,y =j for all pairs (i, j) [m] M, thus C y,y = n Q y,y and ˆQ y,y Q y,y . Omitting discretization error, the confident joint C y,y , when normalized to ˆQ y,y , is an exact estimator for Q y,y . For example, if the noise rate is 0.39, but the dataset has only 5 examples in that class, the best possible estimate by removing errors is 2/5 = 0.4 0.39. Corollary 1.0 (Exact Estimation). For a noisy dataset, (x, y)n (Rd, [m])n and θ:x ˆp( y), if ˆp( y; x, θ) is ideal and each diagonal entry of Q y|y maximizes its row and column, and if ˆ X y=i,y =j = X y=i,y =j, then ˆQ y,y Q y,y . Proof. The result follows directly from Theorem 1. Because the confident joint exactly counts the partitions X y=i,y =j for all pairs (i, j) [m] M by Theorem 1, C y,y = n Q y,y , omitting discretization rounding errors. In the main text, Theorem 1 includes Corollary 1.0 for brevity. We have separated out Corollary 1.0 here to make apparent that the primary contribution of Theorem 1 is to prove ˆ X y=i,y =j = X y=i,y =j, from which the result of Corollary 1.0, namely that ˆQ y,y Q y,y naturally follows, omitting discretization rounding errors. Corollary 1.1 (Per-Class Robustness). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆpx, y=j is per-class diffracted without label collisions and each diagonal entry of Q y|y maximizes its row, then ˆ X y=i,y =j = X y=i,y =j and ˆQ y,y Q y,y . Proof. Re-stating the meaning of per-class diffracted, we wish to show that if ˆp( y; x, θ) is diffracted with class-conditional noise s.t. j [m], ˆp( y = j; x, θ) = ϵ(1) j p ( y = j|y =y k) + ϵ(2) j where ϵ(1) j R, ϵ(2) j R (for any distribution) without label collisions and each diagonal entry of Q y|y maximizes its row, then ˆ X y=i,y =j = X y=i,y =j and ˆQ y,y Q y,y . First note that combining linear combinations of real-valued ϵ(1) j and ϵ(2) j with the probabilities of class j for each example may result in some examples having ˆpx, y=j = ϵ(1) j p x, y=j + ϵ(2) j > 1 or ˆpx, y=j = ϵ(1) j p x, y=j + ϵ(2) j < 0. The proof makes no assumption about the validity of the model outputs and therefore holds when this occurs. Furthermore, confident learning does not require valid probabilities when finding label errors because confident learning depends on the rank principle, i.e., the rankings of the probabilities, not the values of the probabilities. When there are no label collisions, the bins created by the confident joint are: ˆ X y=i,y =j := {x X y=i : ˆp( y = j; x, θ) tj} (6) where tj = E x X y=j ˆpx, y=j Confident Learning: Estimating Uncertainty in Dataset Labels WLOG: we re-formulate the error ϵ(1) j p x, y=j + ϵ(2) j as ϵ(1) j (p x, y=j + ϵ(2) j ). Now, for diffracted (non-ideal) probabilities, we rearrange how the threshold tj changes for a given ϵ(1) j , ϵ(2) j : tϵj j = E x X y=j ϵ(1) j (p x, y=j + ϵ(2) j ) tϵj j = ϵ(1) j E x X y=j p x, y=j + E x X y=j ϵ(2) j tϵj j = ϵ(1) j t j + ϵ(2) j E x X y=j 1 tϵj j = ϵ(1) j (t j + ϵ(2) j ) Thus, for per-class diffracted (non-ideal) probabilities, Eqn. (6) becomes ˆ Xϵj y=i,y =j = {x X y=i : ϵ(1) j (p x, y=j + ϵ(2) j ) ϵ(1) j (t j + ϵ(2) j )} = {x X y=i : p x, y=j t j} = X y=i,y =j by Theorem (1) In the second to last step, we see that the formulation of the label errors is the formulation of C y,y for ideal probabilities, which we proved yields exact label errors and consistent estimation of Q y,y in Theorem 1, which concludes the proof. Note that we eliminate the need for the assumption that each diagonal entry of Q y|y maximizes its column because this assumption is only used in the proof of Theorem 1 when collisions occur, but here we only consider the case when there are no collisions. Theorem 2 (Per-Example Robustness). For a noisy dataset, X := (x, y)n (Rd, [m])n and model θ:x ˆp( y), if ˆpx, y=j is per-example diffracted without label collisions and each diagonal entry of Q y|y maximizes its row, then ˆ X y=i,y =j X y=i,y =j and ˆQ y,y Q y,y . Proof. We consider the nontrivial real-world setting when a learning model θ:x ˆp( y) outputs erroneous, non-ideal predicted probabilities with an error term added for every example, across every class, such that x X, j [m], ˆpx, y=j = p x, y=j + ϵx, y=j. As a notation reminder p x, y=j is shorthand for the ideal probabilities p ( y = j|y = y k) + ϵx, y=j and ˆpx, y=j is shorthand for the predicted probabilities ˆp( y = j; x, θ). The predicted probability error ϵx, y=j is distributed uniformly with no other constraints. We use ϵj R to represent the mean of ϵx, y=j per class, i.e. ϵj = Ex X ϵx, y=j, which can be seen by looking at the form of the uniform distribution in Eqn. (4). If we wanted, we could add the constraint that ϵj = 0, j [m] which would simplify the theorem and the proof, but is not as general and we prove exact label error and joint estimation without this constraint. We re-iterate the form of the error in Eqn. (4) here (U denotes a uniform distribution): ( U(ϵj + tj p x, y=j , ϵj tj + p x, y=j] p x, y=j tj U[ϵj tj + p x, y=j , ϵj + tj p x, y=j) p x, y=j < tj Northcutt, Jiang, & Chuang When there are no label collisions, the bins created by the confident joint are: ˆ X y=i,y =j := {x X y=i : ˆpx, y=j tj} (7) tj = 1 |X y=j| x X y=j ˆpx, y=j Rewriting the threshold tj to include the error terms ϵx, y=j and ϵj, we have tϵj j = 1 |X y=j| x X y=j p x, y=j + ϵx, y=j tϵj j = E x X y=j p x, y=j + E x X y=j ϵx, y=j where the last step uses the fact that ϵx, y=j is uniformly distributed over x X and n so that Ex X y=j ϵx, y=j = Ex X ϵx, y=j = ϵj. We now complete the proof by showing that p x, y=j + ϵx, y=j tj + ϵj p x, y=j tj If this statement is true then the subsets created by the confident joint in Eqn. 7 are unaltered and therefore ˆ Xϵx, y=j y=i,y =j = ˆ X y=i,y =j Thm. 1 = X y=i,y =j, where ˆ Xϵx, y=j y=i,y =j denotes the confident joint subsets for ϵx, y=j predicted probabilities. Now we complete the proof. From the distribution for ϵx, y=j (Eqn. 4) , we have that p x, y=j < tj = ϵx, y=j < ϵj + tj p x, y=j p x, y=j tj = ϵx, y=j ϵj + tj p x, y=j Re-arranging p x, y=j < tj = p x, y=j + ϵx, y=j < tj + ϵj p x, y=j tj = p x, y=j + ϵx, y=j tj + ϵj Using the contrapositive, we have p x, y=j + ϵx, y=j tj + ϵj = p x, y=j tj p x, y=j tj = p x, y=j + ϵx, y=j tj + ϵj Combining, we have p x, y=j + ϵx, y=j tj + ϵj p x, y=j tj Therefore, ˆ Xϵx, y=j y=i,y =j Thm. 1 = X y=i,y =j Confident Learning: Estimating Uncertainty in Dataset Labels The last line follows from the fact that we have reduced ˆ Xϵx, y=j y=i,y =j to counting the same condition (p x, y=j tj) as the confident joint counts under ideal probabilities in Thm (1). Thus, we maintain exact finding of label errors and exact estimation (Corollary 1.1) holds under no label collisions. The proof applies for finite datasets because we ignore discretization error, however, for equality, the proof requires the assumption n which is used in this step: Ex X y=j ϵx, y=j n = Ex X ϵx, y=j = ϵj. Thus, we use approximately equals in the statement of the theorem. Note that while we use a uniform distribution in Eqn. 4, any bounded symmetric distribution with mode ϵj = Ex X ϵx,j is sufficient. Observe that the bounds of the distribution are non-vacuous (they do not collapse to a single value ej) because tj = p x, y=j by Lemma 1. Algorithm 1 (Confident Joint) for class-conditional label noise characterization. input ˆ P an n m matrix of out-of-sample predicted probabilities ˆ P [i][j] := ˆp( y = j; x, θ) input y N 0n, an n 1 array of noisy labels procedure Confident Joint( ˆ P , y): PART 1 (Compute thresholds) for j 1, m do for i 1, n do l new empty list [] if y[i] = j then append ˆ P [i][j] to l t[j] average(l) May use percentile instead of average for more confidence PART 2 (Compute confident joint) C m m matrix of zeros for i 1, n do cnt 0 for j 1, m do if ˆ P [i][j] t[j] then cnt cnt + 1 y j guess of true label y y[i] if cnt > 1 then if label collision y arg max ˆ P [i] if cnt > 0 then C[ y][y ] C[ y][y ] + 1 output C, the m m unnormalized counts matrix Appendix B. The confident joint and joint algorithms The confident joint is expressed succinctly in equation Eqn. 1 with the thresholds expressed in Eqn. 2. For clarity, we provide these equations in algorithm form (See Alg. 1 and Alg. 2). Northcutt, Jiang, & Chuang The confident joint algorithm (Alg. 1) is an O(m2 + nm) step procedure to compute C y,y . The algorithm takes two inputs: (1) ˆ P an n m matrix of out-of-sample predicted probabilities ˆ P [i][j] := ˆp( y = j; xi, θ) and (2) the associated array of noisy labels. We typically use cross-validation to compute ˆ P for train sets and a model trained on the train set and fine-tuned with cross-validation on the test set to compute ˆ P for a test set. Any method works as long ˆp( y = j; x, θ) are out-of-sample, holdout predicted probabilities. Computation time. Finding label errors in Image Net takes 3 minutes on an i7 CPU. Results in all tables reproducible via open-sourced cleanlab package. Note that Alg. 1 embodies Eqn. 1, and Alg. 2 realizes Eqn. 3. Algorithm 2 ( Joint ) calibrates the confident joint to estimate the latent, true distribution of class-conditional label noise input C y,y [i][j], m m unnormalized counts input y an n 1 array of noisy integer labels procedure Joint Estimation(C, y): C y=i,y =j C y=i,y =j P j [m] C y=i,y =j |X y=i| calibrate marginals ˆQ y=i,y =j C y=i,y =j P i [m],j [m] C y=i,y =j joint sums to 1 output ˆQ y,y joint dist. matrix p( y, y ) Appendix C. Extended Comparison of Confident Learning Methods on CIFAR-10 Fig. S1 shows the absolute difference of the true joint Q y,y and the joint distribution estimated using confident learning ˆQ y,y on CIFAR-10, for 20%, 40%, and 70% label noise, 20%, 40%, and 60% sparsity, for all pairs of classes in the joint distribution of label noise. Observe that in moderate noise regimes between 20% and 40% noise, confident learning accurately estimates nearly every entry in the joint distribution of label noise. This figure serves to provide evidence for how confident learning identifies the label errors with high accuracy as shown in Table 2 as well as support our theoretical contribution that confident learning exactly estimates the joint distribution of labels under reasonable assumptions (c.f., Thm. 2). Because we did not remove label errors from the validation set, when training on the data cleaned by CL in the train set, we may have induced a distributional shift, making the moderate increase accuracy a more satisfying result. In Table S1, we estimate the Q y,y using the confusion-matrix Cconfusion approach normalized via Eqn. (3) and compare this ˆQ y,y , estimated by normalizing the CL approach with the confident joint C y,y , for various amounts of noise and sparsity in Q y,y . Table S1 shows improvement using C y,y over Cconfusion, low RMSE scores, and robustness to sparsity in moderate-noise regimes. Confident Learning: Estimating Uncertainty in Dataset Labels dog frog horse Noisy label y Noise = 0.2 | Sparsity = 0.0 Noise = 0.2 | Sparsity = 0.2 Noise = 0.2 | Sparsity = 0.4 Noise = 0.2 | Sparsity = 0.6 dog frog horse Noisy label y Noise = 0.4 | Sparsity = 0.0 Noise = 0.4 | Sparsity = 0.2 Noise = 0.4 | Sparsity = 0.4 Noise = 0.4 | Sparsity = 0.6 Latent, true label y * dog frog horse Noisy label y Noise = 0.7 | Sparsity = 0.0 Latent, true label y * Noise = 0.7 | Sparsity = 0.2 Latent, true label y * Noise = 0.7 | Sparsity = 0.4 Latent, true label y * Noise = 0.7 | Sparsity = 0.6 Figure S1: Absolute difference of the true joint Q y,y and the joint distribution estimated using confident learning ˆQ y,y on CIFAR-10, for 20%, 40%, and 70% label noise, 20%, 40%, and 60% sparsity, for all pairs of classes in the joint distribution of label noise. Table S1: RMSE error of Q y,y estimation on CIFAR-10 using C y,y to estimate ˆQ y,y compared with using the baseline approach Cconfusion to estimate ˆQ y,y . Noise 0.2 0.4 0.7 Sparsity 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 ˆQ y,y - Q y,y 2 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.005 0.011 0.010 0.015 0.017 ˆQconfusion - Q y,y 2 0.006 0.006 0.005 0.005 0.005 0.005 0.005 0.007 0.011 0.011 0.015 0.019 C.1 Benchmarking INCV We benchmarked INCV using the official Github code2 on a machine with 128 GB of RAM and 4 RTX 2080 ti GPUs. Due to memory leak issues (as of the February 2020 open-source release, tested on a Mac OS laptop with 16GB RAM and Ubuntu 18.04 LTS Linux server 128GB RAM) in the implementation, training frequently stopped due to out-of-memory errors. For fair comparison, we restarted INCV training until all models completed at least 90 training epochs. For each experiment, Table S2 shows the total time required for training, epochs completed, and the associated accuracies. As shown in the table, the training time for INCV may take over 20 hours because the approach requires iterative retraining. For comparison, CL takes less than three hours on the same machine: an hour for cross-validation, less than a minute to find errors, an hour to retrain. 2. https://github.com/chenpf1025/noisy_label_understanding_utilizing Northcutt, Jiang, & Chuang Table S2: Information about INCV benchmarks including accuracy, time, and epochs trained for various noise and sparsity settings. Noise 0.2 0.4 0.7 Sparsity 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 Accuracy 0.878 0.886 0.896 0.892 0.844 0.766 0.854 0.736 0.283 0.253 0.348 0.297 Time (hours) 9.120 11.350 10.420 7.220 7.580 11.720 20.420 6.180 16.230 17.250 16.880 18.300 Epochs trained 91 91 200 157 91 200 200 139 92 92 118 200 Appendix D. Additional Figures In this section, we include additional figures that support the main manuscript. Fig. S2 explores the benchmark accuracy of the individual confident learning approaches to support Fig. 5 and Fig. 4 in the main text. The noise matrices shown in Fig. S3 were used to generate the synthetic noisy labels for the results in Tables 4 and 2. Fig. S2 shows the top-1 accuracy on the ILSVRC validation set when removing label errors estimated by CL methods versus removing random examples. For each CL method, we plot the accuracy of training with 20%, 40%,..., 100% of the estimated label errors removed, omitting points beyond 200k. 0K 50K 100K 150K 200K Number of examples removed before training Method CL: Cconfusion CL: Cy, y * CL: opt Rand remove No Removal (a) Res Net18 Validation Accuracy 0K 50K 100K 150K 200K Number of examples removed before training 74% Method CL: Cconfusion CL: Cy, y * CL: opt Rand remove No Removal (b) Res Net50 Validation Accuracy Figure S2: Increased Res Net validation accuracy using CL methods on Image Net with original labels (no synthetic noise added). Each point on the line for each method, from left to right, depicts the accuracy of training with 20%, 40%..., 100% of estimated label errors removed. Error bars are estimated with Clopper-Pearson 95% confidence intervals. The red dash-dotted baseline captures when examples are removed uniformly randomly. The black dotted line depicts accuracy when training with all examples. Confident Learning: Estimating Uncertainty in Dataset Labels Figure S3: The CIFAR-10 noise transition matrices used to create the synthetic label errors. In the cleanlab code base, s is used in place of y to notate the noisy unobserved labels and y is used in place of y to notate the latent uncorrupted labels.