# robust_training_under_label_noise_by_overparameterization__b99a1d8c.pdf Robust Training under Label Noise by Over-parameterization Sheng Liu 1 Zhihui Zhu 2 Qing Qu 3 Chong You 4 Recently, over-parameterized deep networks, with increasingly more network parameters than training samples, have dominated the performances of modern machine learning. However, when the training data is corrupted, it has been wellknown that over-parameterized networks tend to overfit and do not generalize. In this work, we propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted. The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data. Specifically, we model the label noise via another sparse over-parameterization term, and exploit implicit algorithmic regularizations to recover and separate the underlying corruptions. Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets. Furthermore, our experimental results are corroborated by theory on simplified linear models, showing that exact separation between sparse noise and low-rank data can be achieved under incoherent conditions. The work opens many interesting directions for improving over-parameterized models by using sparse over-parameterization and implicit regularization. Code is available at https: //github.com/shengliu66/SOP. 1. Introduction One of the most important factors for the success of deep models is their large model size and high expressive power, which enable them to learn complicated input-output rela- 1Center for Data Science, New York University 2Electrical and Computer Engineering, University of Denver 3Department of EECS, University of Michigan 4Google Research, New York City. Correspondence to: Chong You . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). tions. As such, over-parametrized deep networks or large models, with more parameters than the size of training data, have dominated the performance in computer vision, natural language processing, and so on. The adoption of large models is justified by the recent discovery that deep models exhibit a double descent (Belkin et al., 2019) and unimodal variance (Yang et al., 2020) generalization behavior, where their performance continues to improve beyond the interpolation point, extending the classical learning theory of bias-variance trade-off. While there are infinitely many global solutions that overfit to training data, the choice of optimization algorithm imposes certain implicit regularization (Neyshabur et al., 2014) so that over-parameterized models converge to those that are generalizable. Nonetheless, the success of over-parameterization of deep networks critically depends on the availability of clean training data, while overfitting inevitably occurs when training data is corrupted. Consider the task of image classification with a training dataset {(xi, yi)}N i=1, with xi being an input image and yi being the corresponding one-hot label. With an over-parameterized deep network f( ; θ), model training is achieved by solving an optimization problem with respect to (w.r.t.) the network parameter θ as follows: min θ L(θ) = 1 N i=1 ℓ f(xi; θ), yi , (1) where ℓ( , ) is a loss function that measures the distance between network prediction f(xi; θ) and the label yi. If a proportion of the images in the training set is mislabelled (Song et al., 2020), it is well-known that the network will be optimized to zero training error hence produce f(xi; θ) yi for all i {1, , N}, even for yi s that are incorrect (Zhang et al., 2021a). Overfitting to wrong labels inevitably leads to poor generalization performance (see Fig. 1). In this paper, we introduce a principled method to address the challenges of overfitting over-parameterized deep networks in the presence of training data corruptions. We focus on the task of classification trained with noisy label, a ubiquitous problem in practice due to the extreme complexity of data annotation even for experienced domain experts (Fr enay & Verleysen, 2013). Our idea leverages the property that the label noise is sparse, namely only a fraction of the labels are corrupted and the rest are intact. Principled methods for dealing with sparse corruption have a rich Robust Training under Label Noise by Over-parameterization Figure 1. Sparse over-parameterization prevents overfitting to label noise. Training and test accuracy of a Pre Act Res Net18 network trained with a standard cross entropy (CE) loss (dashed lines) and our Sparse Over-parameterization (SOP) (solid lines) for image classification on the CIFAR-10 dataset with 0%, 20%, and 40% of the labels flipped at random. SOP prevents overfitting to the wrong training labels, obtaining near 100%, 80%, 60% training accuracy respectively, therefore achieves better generalization on the test set without an accuracy drop at the end of training. history, which can be retraced back to compressed sensing (Candes & Tao, 2005), robust subspace recovery (Cand es et al., 2011; Wright et al., 2008), and even earlier (Claerbout & Muir, 1973). Such methods are based on using a robust loss function, such as the ℓ1 norm which is less sensitive to large outlying entries. While it is tempting to use sparse modeling for the label noise problem by setting the loss ℓ() in (1) as the ℓ1 loss, such an approach cannot solve the overfitting issue since all global solutions are still given by those that satisfy f(xi; θ) yi for all i {1, , N}. Hence, handling sparse corruptions with over-parameterized models requires the development of techniques beyond the classical ℓ1 loss for sparse modeling. Overview of our method and contribution. To handle sparse corruption with over-parameterized models, our idea is simply to use an extra variable si to model the unknown label noise s i, which is the difference between the observed label yi and the corresponding clean label. Hence, the goal is to minimize the discrepancy between f(xi; θ)+si and yi. Inspired by a line of recent work (Vaskevicius et al., 2019; Zhao et al., 2019; You et al., 2020), we enforce sparsity of si by the over-parameterization si = ui ui vi vi and optimize the following training loss min θ,{ui,vi}N i=1 L θ, {ui, vi}N k=1 ), (2) L (θ, {ui, vi})) = 1 N PN i=1 ℓ(f(xi; θ) + ui ui vi vi, yi)), with denoting an entry-wise Hadamard product. We term our method Sparse Over-Parameterization (SOP). At the first glance, our SOP approach is seemingly problem- atic, because adding more learnable parameters {ui, vi}N i=1 to an over-parameterized network f( , θ) would aggravate rather than alleviate the overfitting issue. Indeed, a global solution to (2) is given by ui vi 0 and f(xi, θ) yi for all i {1, , N} where the network overfits to noisy labels. Here, we leverage the choice of a particular training algorithm to enforce an implicit bias towards producing the desired solutions. Technically, we run gradient descent on the objective in (2) starting from a small initialization for {ui, vi}N i=1: θ θ τ L(θ, {ui, vi}) ui ui ατ L(θ, {ui, vi}) ui , i = 1, . . . , N, vi vi ατ L(θ, {ui, vi}) vi , i = 1, . . . , N, where α > 0 is the ratio of learning rates for different training variables. Such a simple algorithm enables our method of SOP to train a deep image classification networks without overfitting to wrong labels and obtain better generalization performance (see Fig. 1). A more comprehensive empirical study with a variety of datasets is presented in Section 2. To rigorously justify our method, we theoretically investigate our method based upon a simplified over-parameterized linear model with sparse corruptions. As justified by a line of recent work (Jacot et al., 2018; Chizat et al., 2019), overparameterized linear models capture similar phenomena because they well approximate over-parameterized deep networks in a linearized regime around the initial points. Under sparse corruption and certain low-rank assumptions on the data, we show that the gradient descent (3) with an α below a certain threshold recovers the underlying model parameters with sparse corruptions. Our result is obtained by explicitly characterizing the implicit regularization for the term ui ui vi vi. In particular, we explicitly show that it leads to an ℓ1-norm regularization on the sparse corruption, hence connecting our method to classical ℓ1 loss approaches for model robustness. For more details, we refer readers to Section 3. In summary, our contributions are two-folds: Method. We proposed a simple yet practical SOP method that can effectively prevent overfitting for learning overparameterized deep networks from corrupted training data, demonstrated on a variety of datasets. Theory. Under a simplified over-parameterized linear model, we rigorously justify our approach for exactly separating sparse corruption from the data. Moreover, we believe the methodology we developed here could be far beyond the label noise setting, with the potential for dealing with more challenging scenarios of preventing overfitting in learning modern over-parametrized models of an ever-increasing size. Robust Training under Label Noise by Over-parameterization 2. Robust Classification with Label Noise In this section, we show how our SOP method plays out on image classification problems with the noisy label. In particular, we discuss extra implementation details of our method, followed by experimental demonstrations on a variety of datasets with synthetic and real label noise. 2.1. Implementation Details of SOP We train an over-parameterized deep neural network f( ; θ) from the noisy training data {(xi, yi)}N i=1 using the method briefly discussed in Section 1. Specifically, we train the network f( ; θ) using the objective (2) with stochastic gradient descent (SGD) (i.e. a batch version of (3)). Notice that there is additional prior information on label noise s i associated with a sample {xi, yi}, namely, the positive and negative entries of s i must correspond to nonzero entry and zero entries of yi, respectively. Moreover, all entries of s i must lie in the range of [ 1, 1]. To leverage such information, we optimize a variant of (2) given by min θ,{ui,vi}N i=1 i=1 ℓ f(xi; θ) + si, yi , (4) s.t. si .= ui ui yi vi vi (1 yi), and (5) ui [ 1, 1]K, vi [ 1, 1]K, (6) where K is the number of classes. In above, constraints on ui, vi are realized by performing a projection step after each gradient descent update. Choice of the loss function ℓ( , ) in (4). The most commonly used loss function for classification tasks is the crossentropy loss ℓCE( , ) (Krizhevsky et al., 2012). Because the ℓCE( , ) loss requires a probability distribution as an input, we define a mapping φ(w) .= max{w, ϵ1} max{w, ϵ1} 1 , (7) and set the loss ℓ( , ) in (4) to be LCE θ, ui, vi; xi, yi .= ℓCE φ f(xi; θ) + si , yi . (8) On the other hand, the cross-entropy loss cannot be used to optimize the variables {vi} (see Section A.1 for an explanation). Hence, we use the mean squared error loss ℓMSE and set the loss in (4) to be LMSE θ, ui, vi; xi, yi .= ℓMSE f(xi; θ) + si, yi , (9) when optimizing {vi} 1. We summarize our training method in Algorithm 1. 1We also project f(xi; θ) to a one-hot vector when using MSE loss which is empirically found to accelerate convergence of {vi}. Algorithm 1 Image classification under label noise by the method of Sparse Over-Parameterization (SOP). 1: Input: Training data {(xi, yi)}N i=1, network backbone f( , θ), variables {ui, vi}N i=1, number of epochs T, learning rate τ, learning rate ratio αu, αv, batch size β 2: Initialization: Draw entries of ui, vi from i.i.d. Gaussian distribution with zero-mean and s.t.d. 1e 8 3: for each t {1, , T} do 4: # Train network f( , θ) with SGD 5: for each b {1, , N/β} do 6: Sample a batch B {1, . . . , N} with |B| = β 7: Set θ θ τ P i B LCE(θ,ui,vi;xi,yi) θ 8: end for 9: # Update {ui, vi} 10: for each i {1, , N} do 11: Set ui P[ 1,1] ui αuτ LCE(θ,ui,vi;xi,yi) 12: Set vi P[ 1,1] vi αvτ LMSE(θ,ui,vi;xi,yi) 13: end for 14: end for 15: Output: Network parameters θ and {ui, vi}N i=1 2.2. Experiments We experimentally demonstrate the effectiveness of our proposed SOP method on datasets with both synthetic (i.e., CIFAR-10 and CIFAR-100) and realistic (i.e., CIFAR-N, Clothing-1M, and Web Vision) label noise. In addition to the SOP described in Algorithm 1, we also implement an improved version, termed SOP+, which incorporates two commonly used regularization techniques in the literature of label noise, namely the consistency regularization and the class-balance regularization. We explain SOP+ in more detail in Appendix A.3. Dataset descriptions. We use datasets with synthetic label noise generated from CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009). Each dataset contains 50k training images and 10k test images, all with clean labels, where each image is of size 32 32. Following previous works (Han et al., 2018; Liu et al., 2020; Xia et al., 2020), we generate symmetric label noise by uniformly flipping labels for a percentage of the training set for all classes, as well as asymmetric label noise by flipping labels for particular pairs of classes. For datasets with realistic label noise, we test on CIFAR-10N/CIFAR-100N (Wei et al., 2021b) which contains a re-annotation of CIFAR-10/CIFAR-100 with human workers. Specifically, each image in CIFAR10N contains three submitted labels (i.e., Random 1, 2, 3) which are further combined to have an Aggregate and a Worst label. Each image in CIFAR-100N contains a single submitted label for the fine classes. We also test on Clothing1M (Xiao et al., 2015) which is a large-scale dataset with images clawed from online shopping websites and labels Robust Training under Label Noise by Over-parameterization Table 1. Test accuracy with synthetic label noise on CIFAR-10 and CIFAR-100 with {20%, 40%, 60%, 80%} percent of labels for training data randomly flipped uniformly to another class. All methods use Res Net34 as the architecture. Mean and standard deviation over 5 independent runs are reported. METHODS CIFAR-10 CIFAR-100 20% 40% 60% 80% 20% 40% 60% 80% CE 86.32 0.18 82.65 0.16 76.15 0.32 59.28 0.97 51.43 0.58 45.23 0.53 36.31 0.39 20.23 0.82 FORWARD 87.99 0.36 83.25 0.38 74.96 0.65 54.64 0.44 39.19 2.61 31.05 1.44 19.12 1.95 8.99 0.58 GCE 89.83 0.20 87.13 0.22 82.54 0.23 64.07 1.38 66.81 0.42 61.77 0.24 53.16 0.78 29.16 0.74 SL 89.83 0.32 87.13 0.26 82.81 0.61 68.12 0.81 70.38 0.13 62.27 0.22 54.82 0.57 25.91 0.44 ELR 91.16 0.08 89.15 0.17 86.12 0.49 73.86 0.61 74.21 0.22 68.28 0.31 59.28 0.67 29.78 0.56 SOP (OURS) 93.18 0.57 90.09 0.27 86.76 0.22 68.32 0.77 74.67 0.30 70.12 0.57 60.26 0.41 30.20 0.63 Table 2. Comparison with the state-of-the-art methods that use two network ensembles and semi-supervised learning on CIFAR10 and CIFAR-100 under symmetric (with 20%, 50%, 80%) and asymmetric (with 40%) label noise. All methods use Res Net34 as the architecture. CIFAR-10 CIFAR-100 SYMMETRIC ASYM SYMMETRIC ASYM 20% 50% 80% 40% 20% 50% 80% 40% CE 87.2 80.7 65.8 82.2 58.1 47.1 23.8 43.3 MIXUP 93.5 87.9 72.3 - 69.9 57.3 33.6 - DIVIDEMIX 96.1 94.6 93.2 93.4 77.1 74.6 60.2 72.1 ELR+ 95.8 94.8 93.3 93.0 77.7 73.8 60.8 77.5 SOP+ (OURS) 96.3 95.5 94.0 93.8 78.8 75.9 63.3 78.0 Table 3. Test accuracy with realistic label noise on Clothing1M and Web Vision. We use a pre-trained Res Net50 for Clothing1M and an Inception Res Net V2 for Web Vision dataset. The results of the comparing methods are taken from their respective papers. METHODS CLOTHING1M WEBVISION ILSVRC12 CE 69.1 - - FORWARD 69.8 61.1 57.3 CO-TEACHING 69.2 63.6 61.5 ELR 72.9 76.2 68.7 CORES2 73.2 - - SOP (OURS) 73.5 76.6 69.1 generated based on surrounding texts. Clothing-1M contains 1 million training images, 15k validation images, and 10k test images with clean labels. Finally, we also test on the mini Web Vision dataset (Li et al., 2017) which contains the top 50 classes from the Google image subset of Web Vision (approximately 66 thousand images). Models trained on mini Web Vision are evaluated on both Web Vision and Image Net ILSVRC12 validation set. Details on the label noise for these datasets is provided in Section A.2. Network structures & hyperparameters. We implement our method with Py Torch v1.7. For each dataset, the choices of network architectures and hyperparameters for SOP are as follows. Additional details, as well as hyper-parameters for both SOP and SOP+, can be found in Appendix A.4. CIFAR-10/100 and CIFAR-10N/100N. We follow (Liu et al., 2020) to use Res Net-34 and Pre Act Res Net18 architectures trained with SGD using a 0.9 momentum. The initial learning rate is 0.02 decayed with a factor of 10 at the 40th and 80th epochs for CIFAR-10/CIFAR-10N and at the 80th and 120th epochs for CIFAR-100/CIFAR100N, respectively. Weight decay for network parameters θ is set to 5 10 4. No weight decay is used for parameters {ui, vi}N i=1. Clothing-1M. We follow the previous work (Liu et al., 2020) to use a Res Net-50 (He et al., 2016) pre-trained on Image Net (Krizhevsky et al., 2012). The network is trained with batch size 64 and an initial learning rate 0.001, which is reduced by a factor of 10 after 5th epoch (10 training epochs in total). Optimization is performed using SGD with a momentum 0.9. Weight decay is 0.001 for parameters θ and is zero for parameters {ui, vi}N i=1. Mini Webvision. We use Inception Res Net V2 as the backbone architecture. All other optimization details are the same as for CIFAR-10, except that we use weight decay 0.0005 and batch size 32. Experimental results. We compare with methods based on estimation of the transition matrix (Forward (Patrini et al., 2017)), design of loss functions (GCE (Zhang & Sabuncu, 2018) and SL (Wang et al., 2019)), training two networks (Co-teaching (Han et al., 2018) and Divide Mix (Li et al., 2020a)), and label noise correction (ELR (Liu et al., 2020) and CORES2 (Cheng et al., 2021)). Table 1 reports the performance of our method on synthetically generated symmetric label noise using CIFAR-10 and CIFAR-100. To compare with state-of-the-art methods, we also report the performance of SOP+ which contains additional regularization on both symmetric and asymmetric label noise and report the results in Table 2. It can be observed that our method is robust to a fairly large amount of label noise, and compares favorably to existing techniques. We further demonstrate that our method can effectively handle datasets with realistic label noise by reporting its performance on Clothing1M & Web Vision (see Table 3) and CIFAR-N (see Table 4) datasets. We can observe a performance gain over all comparing methods. Finally, we compare the training time (on a single Nvidia V100 GPU) of our method to the baseline methods in Table 5. We observe that our algorithm SOP/SOP+ achieves the fastest speed across all baselines. Robust Training under Label Noise by Over-parameterization Table 4. Test accuracy with realistic label noise on CIFAR-N. Mean and standard deviation over 5 independent runs are reported. The results of the baseline methods are taken from (Wei et al., 2021b) which all use Res Net34 as the architecture. For SOP+, we use Pre Act Res Net18. METHODS CIFAR-10N CIFAR-100N CLEAN RANDOM 1 RANDOM 2 RANDOM 3 AGGREGATE WORST CLEAN NOISY CE 92.92 0.11 85.02 0.65 86.46 1.79 85.16 0.61 87.77 0.38 77.69 1.55 76.70 0.74 55.50 0.66 FORWARD 93.02 0.12 86.88 0.50 86.14 0.24 87.04 0.35 88.24 0.22 79.79 0.46 76.18 0.37 57.01 1.03 CO-TEACHING 93.35 0.14 90.33 0.13 90.30 0.17 90.15 0.18 91.20 0.13 83.83 0.13 73.46 0.09 60.37 0.27 ELR+ 95.39 0.05 94.43 0.41 94.20 0.24 94.34 0.22 94.83 0.10 91.09 1.60 78.57 0.12 66.72 0.07 CORES 94.16 0.11 94.45 0.14 94.88 0.31 94.74 0.03 95.25 0.09 91.66 0.09 73.87 0.16 55.72 0.42 SOP+(OURS) 96.38 0.31 95.28 0.13 95.31 0.10 95.39 0.11 95.61 0.13 93.24 0.21 78.91 0.43 67.81 0.23 Table 5. Comparison of total training time in hours on CIFAR10 with 50% symmetric label noise. CE CO-TEACHING+ DIVIDEMIX ELR+ SOP SOP+ 0.9H 4.4H 5.4H 2.3H 1.0H 2.1H 3. Theoretical Insights with Simplified Models This section provides theoretical insights into our SOP method by studying structured data recovery with sparse corruption in the context of over-parameterized linear models. We will start with model simplification, followed by our main theoretical results and experimental verification. 3.1. Problem Setup & Main Result Given a highly overparameterized network f( ; θ), recent work (Jacot et al., 2018; Kalimeris et al., 2019) suggests that the parameter θ Rp may not change much from its initialization θ0 before obtaining zero training error. Hence, a nonlinear network f( ; θ) : Rn 7 R can be well approximated by its first-order Taylor expansion: f(x; θ) f(x; θ0) + θf(x; θ0), θ θ0 , (10) where we consider f( ; θ) as a scalar function for simplicity. Since the bias term f(x; θ0) θf(x; θ0), θ0 is constant w.r.t. θ, for simplicity we may further assume that f(x; θ) θf(x; θ0), θ . (11) Thus, for a dataset {xi}N i=1 of N points, collectively f(x1; θ) ... f(x N; θ) θf(x1; θ0) ... θf(x N; θ0) θ = J θ, (12) where J IRN p is a Jacobian matrix. This observation motivates us to consider the following problem setup. Problem setup. Based upon the above linearization, we assume that our corrupted observation y RN (e.g., noisy labels) is generated by y = J θ + s , (13) where θ Rp is the underlying groundtruth parameter, and the noise s RN is sparse so that only a subset of observation (e.g., labels) is corrupted. Given J and y generated from (13), our goal is to recover both θ and s . However, as we are considering the problem in an overparameterized regime with p > N, the underdetermined system (13) implies that there are infinite solutions for θ even if s is given. Nonetheless, recent work showed that the implicit bias of gradient descent for overparameterized linear models and deep networks tend to find minimum ℓ2norm solutions (Zhang et al., 2021a). To make our problem more well-posed, motivated by these results, we would like to find an θ with minimum ℓ2-norm, namely, θ = arg min θ θ 2 2 s.t. y = Jθ + s . (14) Analogous to (2), we will show that θ and s can be provably recovered by solving the problem min θ,u,v h(θ, u, v) .= 1 2 Jθ + u u v v y 2 2, (15) using the gradient descent algorithm with learning rates τ and ατ on θ and {u, v}, respectively: θk+1 = θk τ J rk, uk+1 = uk 2ατ uk rk, vk+1 = vk + 2ατ vk rk, where rk .= Jθk + uk uk vk vk y. Based on these, our result can be summarized as follows. Theorem 3.1 (Main result, informal). Suppose J is rankr and µ-incoherent defined in Section 3.3, and s is ksparse. If k2r < N/(4µ), with τ 0 and a proper choice of α depending on (J, θ , k), the gradient dynamics of (16) converges to the ground truth solution (θ , s ) in (13) starting from a small initialization of (θ, u, v). We state our result at a high level with more technical details in Section 3.2 and Section 3.3. The overall idea of the proof can be sketched through the following two steps. First, although the problem (15) is nonconvex, in Section 3.2 we show that it has benign global landscape, and that the gradient descent (16) converges to particular global solutions that are the same as solutions to a convex problem with explicit regularizations on θ and s. Building upon above results, in Section 3.3 we complete our analysis by showing that θ and s can be exactly Robust Training under Label Noise by Over-parameterization recovered by the convex problem with a small enough value for α. Throughout the analysis, we corroborate our findings with numerical simulations. 3.2. Landscapes & Implicit Sparse Regularization Benign global landscape. We start by characterizing the nonconvex landscape of (15), showing the following result. Proposition 3.2. Any critical point of (15) is either a global minimizer, or it is a strict saddle (Ge et al., 2015) with its Hessian having at least one negative eigenvalue. For a strict saddle function, recent work (Lee et al., 2016) showed that gradient descent with random initialization almost surely escapes saddle points and converges to a local minimizer. Thus, Proposition 3.2 ensures that the algorithm in (16) almost surely converges to a global solution of (15). However, because there are infinite many global solutions for the overparameterized model (15) and not all global solutions are of equal quality, convergence to a global solution alone is not sufficient for us to establish the correctness of our method. Nonetheless, as we will show in the following, the particular choice of the algorithm in (16) enables it to converge to a particular regularized global solution. Implicit sparse regularization. To understand which solution the algorithm (16) converges to, we study its gradient flow counterpart by taking the stepsize τ 0 in (16). Thus, the dynamics of such a gradient flow is governed by the following differential equations θt(γ, α) = J rt(γ, α), ut(γ, α) = 2α ut(γ, α) rt(γ, α), vt(γ, α) = 2α vt(γ, α) rt(γ, α), where we define rt(γ, α) = Jθt(γ, α) + ut(γ, α) ut(γ, α) vt(γ, α) vt(γ, α) y. (18) Here, we assume that θ, u, and v are initialized at θ0(γ, α) = 0, u0(γ, α) = γ1, v0(γ, α) = γ1, (19) with some small γ > 0. Solving the differential equations in (17) gives the gradient flow θt(γ, α) = J νt(γ, α), ut(γ, α) = γ exp(2ανt(γ, α)), vt(γ, α) = γ exp( 2ανt(γ, α)), where we define νt(γ, α) .= Z t 0 rτ(γ, α)dτ. (21) The following result shows that the solution that the gradient flow θt(γ, α), ut(γ, α), vt(γ, α) in (20) converges to at t is a global solution to (15) that is regularized with a particular of (γ, α). Proposition 3.3. Consider the gradient flow in (20) with the initialization in (19). (Global convergence) For any (γ, α), if the limit θ (γ, α), u (γ, α), v (γ, α) θt(γ, α), ut(γ, α), vt(γ, α) (22) exists, then it is a global solution to (15). (Implicit regularization) Fix any λ > 0 and let α be a function of γ as α(γ) = log γ If the limit bθ, bu, bv .= lim γ 0 θ (γ, α(γ)), u (γ, α(γ)), v (γ, α(γ)) (24) exists, then bθ, bu, bv is a global solution to (15). In particular, let bs .= bu bu bv bv, (25) then (bθ, bs) is an optimal solution to the convex program min θ, s 1 2 θ 2 2 + λ s 1 , s.t. y = Jθ + s. (26) As we observe from the above result, because (bθ, bs) that the gradient flow (17) converges to is also an optimal solution of (26), it implies that (bθ, bs) is regularized. In particular, the ℓ1-norm regularization on s comes as a result of implicit regularization on overparameterization s = u u v v, leading to a sparse solution on s as we desired. On the other hand, the ℓ2 regularization on θ leads to the desired minimum ℓ2-norm solution as we discussed in (14). Thus, the only question remains is whether the ground truth (θ , s ) in (13) can be identified through solving the convex problem (26), which we will discuss in the following Section 3.3. Numerical verification. While Proposition 3.3 is proved for gradient flow with both learning rate τ 0 and initialization scale γ 0, we numerically show that such a result also holds non-asymptotically with finitely small τ and γ. Given a tuple (N, p, r, k) of model parameters, we generate simulation data (J, θ , s , y) as follows. The matrix J RN p is generated by multiplying two randomly generated matrices of shape N r and r p, respectively with entries drawn i.i.d. from a standard Gaussian distribution. The sparse vector s RN is generated by randomly choosing k entries to be i.i.d. standard Gaussian, with the rest of entries zero. Then, we generate a vector θ Rp with all entries drawn i.i.d. from a standard Gaussian distribution, and let y = Jθ + s . Finally, we set θ as the minimum ℓ2-norm solution according to (14). Robust Training under Label Noise by Over-parameterization Figure 2. The gradient descent in (16) and the convex problem in (26) produce the same solutions with α = log γ 2λ . For fixed data (J, y), left figure shows the relative difference θα θλ 2 max{ θα 2, θλ 2} between the solution θα to (16) with varying values of α (in y-axis) and the solution θλ computed from (26) with varying values of λ (in x-axis). Likewise, right figure shows the relative difference for s. Blue line shows the curve α = log γ 2λ where γ is fixed to exp ( 8) in all experiments. In this experiment, we choose and fix (N, p, r, k) = (20, 40, 3, 3) for the data generation described above. With a varying learning rate α [4, 4000], we compute (θα, sα) as the solution provided by gradient descent in (16) with an initialization by (19) with γ = e 8. With a varying regularization λ [0.0001, 1] in (26), we compute (θλ, sλ) as the solution provided by the convex problem in (26) with weight parameter λ, using the ECOS solver (Domahidi et al., 2013) provided in CVXPY (Diamond & Boyd, 2016). Figure 2 provides a visualization of the relative difference ρ = θα θλ 2 max{ θα 2, θλ 2} between θα and θλ (and likewise for s), across all pairs of (α, λ). We can observe that as long as (α, λ) satisfies the relationship in (23), the relative difference ρ is small for θ, which is also true for s. On the other hand, the relative differences can be large if (23) is not satisfied, corroborating Proposition 3.3. 3.3. Exact Recovery under Incoherence Conditions Given the overparameterized model (13) with y RN, θ Rp, and p N, there is no enough information from y to recover θ and s even with the prior information that s is sparse any given vector y Rp can be decomposed as a summation of an arbitrary sparse vector s and a vector θ cooked up from the column space of J as long as J has full row rank. For the solution θ and s to be identified, first, we assume that J is low-rank, motivated by empirically observation in practical deep neural network fθ that the Jacobian matrix J of fθ is approximately low-rank (Oymak et al., 2019).2 However, the low-rank condition of J alone does not guarantee identifiability, because it cannot address the separability 2The low-rank assumption simplifies our analysis but at the cost that our model is not able to overfit to any corrupted labels as a deep neural network. We leave the study under a realistic approximate low-rank assumption to future work. (a) Varying k with fixed r = 20. (b) Varying r with fixed k = 20. Figure 3. Effect of model parameter λ for exact recovery by (26). The y-axis is the relative error of θ (left) and s (right) defined as θ θ 2 θ 2 and s s 2 s 2 , respectively, where (θ, s) is the solution to (26). The curves are averages over 10 independent trials. between Jθ and s following a similar argument as that in (Cand es et al., 2011), if any column of J has a single nonzero entry, then any s that is supported on the same entry cannot be recovered without ambiguities. Hence, we further assume that the column space of J and the standard basis [e1, . . . , e N] .= diag{1, . . . , 1} IRN N are incoherent, defined as follows. Definition 3.4 (Cand es et al. (2011)). Let J = UΣV IRN p be the compact SVD of J and r be the rank of J. The coherence of J (w.r.t. the standard basis) is defined as r max 1 i N U ei 2 2. (27) It should be noted that the low-rank and incoherence assumptions are common for matrix recovery (Davenport & Romberg, 2016; Chi et al., 2019). Based upon the above assumptions on J and s , we show the following. Proposition 3.5. Let r be the rank of J and k be the number of nonzero entries of s . If we have k2r < N 4µ(J), (28) then the solution to (26) is (θ , s ) for any λ > λ0, where λ0 > 0 is a scalar depending on (J, θ , k). Thus, combining this result with Proposition 3.3, the gradient flow in (17) with initialization (19) converges to (θ , s ) when the choice of learning rate ratio α in (17) is smaller than a certain threshold, justifying our claim in Theorem 3.1. Numerical verification. To corroborate Proposition 3.5, we numerically solve (26) under varying conditions of λ, r, and k. The simulated data (J, θ , s , y) is generated Robust Training under Label Noise by Over-parameterization Figure 4. Phase transition for solving (26) over 20 trials, with fixed λ = 0.1 and varying k, r. Recovery is declared success if θ θ 2 θ 2 < 0.001 (left) and s s 2 s 2 < 0.001 (right). the same way as the experimental part in Section 3.2 with N = 100 and p = 150, and for an obtained solution (θ, s) via solving (26), we measure the relative recovery error ϵθ = θ θ 2 θ 2 and ϵs = s s 2 Effects of the parameter λ. Here, we consider the recovery with varying λ [0.0001, 1]. First, we fix r = 20 and vary k {10, 20, 40, 60, 80}, showing the relative recovery errors ϵθ and ϵs in Figure 3(a). Second, we fix k = 20 and vary r {10, 20, 40, 60, 80}, showing the results in Figure 3(b). The results show a clear phase transition that correct recovery is obtained only when λ is greater than a particular threshold λ0. Moreover, λ0 varies depending on k and r, consistent with Proposition 3.5. Relationships between the rank r and sparsity k. Here, we fix λ = 0.1 and plot the phase transition with respect to r and k. For each (r, k), the simulation is repeated for 20 random instances, and for each instance we declare the recovery to be successful if ϵθ < 0.001 and ϵs < 0.001. As shown in Figure 4, the phase transition is consistent with Proposition 3.5 that successful recovery is achieved only when both k and r are small. 4. Related Work and Discussion 4.1. Prior Arts on Implicit Regularization Since overparameterized deep neural networks do not overfit (in the absence of data corruption) even without any explicit regularization (Zhang et al., 2021a), it is argued that there are implicit regularizations pf learning algorithms that enable the models to converge to desired solutions. Under the assumption of linear or deep linear models, many work characterized the mathematics of such implicit bias via explicit regularizations (Soudry et al., 2018; Gunasekar et al., 2018; Li et al., 2018; Oymak & Soltanolkotabi, 2019; Arora et al., 2019; Razin & Cohen, 2020; Li et al., 2020c; Ji et al., 2020; St oger & Soltanolkotabi, 2021; Jacot et al., 2021). Among those, the closest related to ours include (Vaskevicius et al., 2019; Zhao et al., 2019; Woodworth et al., 2020; Li et al., 2021a; Chou et al., 2021), which studied the implicit sparse regularization induced by a term of the form u u v v. While all the above works aim to understand implicit regularization by studying linear models, the practical benefits of such studies are unclear. Our work provides an inspiring result showing that principled design with implicit regularization leads to robust learning of over-parameterized models. In particular, our model in (2) is motivated by existing studies on the implicit sparse regularization, but adds such a regularization to an (already) implicitly regularized model for handling sparse corruptions. In other words, two forms of implicit regularization are involved in our model which poses new problems in the design of the optimization algorithm and in mathematical analysis. To the best of our knowledge, the only prior works that use implicit sparse regularization for robust learning are (You et al., 2020; Ma & Fattahi, 2021; Ding et al., 2021) which studied the robust recovery of low-rank matrices and images. Among them, our work extends (You et al., 2020) to the problem of image classification with label noise, demonstrates its effectiveness, and provides dedicated theoretical analyses. Additionally, methods in (Ma & Fattahi, 2021; Ding et al., 2021) require a particular learning rate schedule that may not be compatible with commonly used schedules such as cosine annealing (Loshchilov & Hutter, 2016) in image classification. 4.2. Relationship to Existing Work on Label Noise Deep neural networks are over-parameterized hence prone to overfitting to the label noises. While many popular regularization techniques for alleviating overfitting, such as label smoothing (Szegedy et al., 2016; Lukasik et al., 2020; Wei et al., 2021a) and mixup (Zhang et al., 2018), are useful for mitigating the impact of label noise, they do not completely solve the problem due to a lack of precise noise modeling. In the following, we discuss three of the most popular line of work dedicated to the label noise problem; we refer the reader to the survey papers (Algan & Ulusoy, 2021; Song et al., 2020; Wei et al., 2021b) for a comprehensive review. Loss design. Robust loss function, such as the ℓ1 loss (Ghosh et al., 2017), is one of the most popular approaches to the label noise problem which has many recent extensions (Zhang & Sabuncu, 2018; Wang et al., 2019; Amid et al., 2019; Ma et al., 2020; Yu et al., 2020; Wei & Liu, 2021). The method is based on reducing the loss associated with large outlying entries, hence the impact of label noise. A similar idea is also explored in gradient clipping (Menon et al., 2019) and loss reweighting (Liu & Tao, 2015; Wang et al., 2017; Chang et al., 2017; Zhang et al., 2021b; Zetterqvist et al., 2021) methods. While robust loss enables the model to learn faster from correct labels, the global solution still overfits to corrupted labels with over-parameterized models. Label transition probability. Another popular line of work for label noise is based on the assumption that the noisy label is drawn from a probability distribution conditioned on Robust Training under Label Noise by Over-parameterization the true label. Here, the main task is to estimate the underlying transition probabilities. The early work (Chen & Gupta, 2015; Goldberger & Ben-Reuven, 2017) encapsulates the transition probabilities as a noise adaptation layer that is stacked on top of a classification network and trained jointly in an end-to-end fashion. Recent work (Patrini et al., 2017) uses separated procedures to estimate the transition probabilities, the success of which requires either the availability of a clean validation data (Hendrycks et al., 2018) or additional data assumptions (Xia et al., 2019; Zhu et al., 2021; Li et al., 2021b; Zhang et al., 2021c). Even if the underlying transition probabilities can be correctly recovered, overfitting is only prevented asymptotically, requiring sufficiently many samples of corrupted labels for each input (Patrini et al., 2017), which is not practical. Label correction. In contrast to the above methods, our method completely avoids overfitting even with finite training samples. This is achieved by the over-parameterization term u u v v in (2) which recovers the clean labels. Hence, our method is related to techniques based on noisy label detection and refurbishment. Nonetheless, existing techniques are based on heuristic argument about different behaviors of clean and corrupted samples in the training process, such as properties of learned representations (Kim et al., 2021; Ma et al., 2018; Jiang et al., 2020), prediction consistency (Reed et al., 2014; Song et al., 2019a), learning speed (Li et al., 2020b; Liu et al., 2020; 2021a), margin (Lin & Bradic, 2021), confidence (Cheng et al., 2021). They often need to be combined with engineering tricks such as moving average (Huang et al., 2020; Liu et al., 2020) and burning-in (Zheng et al., 2020) to make them work well. Finally, the work (Hu et al., 2019) introduces a variable to estimate the label noise in a way similar to (2). However, the variable is not over-parameterized to induce sparsity, and their method does not have competitive performance. 4.3. Sparsity in Deep Learning Our method is broadly related to existing efforts on introducing sparsity into deep learning (Hoefler et al., 2021), but is notably different in both the objective of introducing sparsity, the origin of sparsity, and how sparsity is enforced. First, previous exploration of sparsity primarily aims to improve training and inference efficiency with large-scale models, while our paper focuses on robust training under label noise. Second, previous introduction of sparsity is often motivated by its presence in biological brains, but there is still a lack of clean understanding of how sparsity helps with learning. In contrast, sparsity in our method has the clear mathematically meaning that the percentage of corrupted labels is small. Finally, while pruning (Liu et al., 2021b; Chen et al., 2021) is a dominant approach for obtaining sparsity, our method leverages the implicit bias of gradient descent associated with a particular sparse over-parameterization. 4.4. Limitations and Future Directions Choice of optimization algorithms. Our SOP method is based on introducing more parameters to an already overparameterized model, hence relies critically on the choice of the optimization algorithm to induce the desired implicit regularization. For vanilla gradient descent, our analysis in Section 3 shows that it has the desired implicit regularization by design. In practical deep network training, it is more common to use the stochastic gradient descent with momentum. While not theoretically justified, experiments in Section 2 show that our method works with such practical variants. This may not come as a surprise, because existing studies already show that stochastic gradient descent (Nacson et al., 2019) and momentum acceleration (Wang et al., 2021) have the same implicit bias as the vanilla gradient decent under certain models. We leave the extension of such results to our method as future works. Modeling of label noise. Our method is based on the assumption that the label noise matrix S = [s 1, . . . , s N], where s i is the difference between the observed label yi and the underlying true label, is a sparse matrix. We made no additional assumption on the sparsity pattern of S , other than the non-negative and non-positive constraints discussed in (4). In practice, it is usually the case that certain pairs of classes are more similar hence more easily confusing with each other than other pairs. As a result, certain blocks of S tend to have more non-zero entries than the others. When there is a prior on which blocks may have more non-zero entries, our method may be adapted by using a weighted sparse regularization for the corresponding blocks. When there is no such prior, our method may be adapted by using a group sparse regularization (Neyshabur et al., 2014; Tibshirani, 2021). Acknowledgements SL and QQ were partially supported by NSF grant DMS 2009752. SL was partially supported by NSF NRT-HDR Award 1922658 and Alzheimer s Association grant AARGNTF-21-848627. QQ also acknowledge support from NSF CAREER 2143904, NSF CCF 2212066, and ONR N0001422-1-2529. ZZ acknowledges support from NSF grants CCF 2008460 and CCF 2106881. Part of this work was done when CY was at University of California, Berkeley and was supported by Tsinghua-Berkeley Shenzhen Institute Research Fund. The authors acknowledge helpful discussion with Ryan Chan from Johns Hopkins University. Algan, G. and Ulusoy, I. Image classification with deep learning in the presence of noisy labels: A survey. Knowledge-Based Systems, 215:106771, 2021. Robust Training under Label Noise by Over-parameterization Amid, E., Warmuth, M. K., Anil, R., and Koren, T. Robust bi-tempered logistic loss based on bregman divergences. Advances in Neural Information Processing Systems, 32: 15013 15022, 2019. Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pp. 7411 7422, 2019. Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005. Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011. Chang, H.-S., Learned-Miller, E., and Mc Callum, A. Active bias: Training more accurate neural networks by emphasizing high variance samples. Advances in Neural Information Processing Systems, 30:1002 1012, 2017. Chen, T., Zhang, Z., Balachandra, S., Ma, H., Wang, Z., Wang, Z., et al. Sparsity winning twice: Better robust generalization from more efficient training. In International Conference on Learning Representations, 2021. Chen, X. and Gupta, A. Webly supervised learning of convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1431 1439, 2015. Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., and Liu, Y. Learning with instance-dependent label noise: A sample sieve approach. In International Conference on Learning Representations, 2021. Chi, Y., Lu, Y. M., and Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019. Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937 2947, 2019. Chou, H.-H., Maly, J., and Rauhut, H. More is less: Inducing sparsity via overparameterization. ar Xiv preprint ar Xiv:2112.11027, 2021. Claerbout, J. F. and Muir, F. Robust modeling with erratic data. Geophysics, 38(5):826 844, 1973. Cohen, A., Dahmen, W., and De Vore, R. Compressed sensing and best k-term approximation. Journal of the American mathematical society, 22(1):211 231, 2009. Davenport, M. A. and Romberg, J. An overview of lowrank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4): 608 622, 2016. Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Ding, L., Jiang, L., Chen, Y., Qu, Q., and Zhu, Z. Rank overspecified robust matrix recovery: Subgradient method and exact recovery. Advances in Neural Information Processing Systems, 34, 2021. Domahidi, A., Chu, E., and Boyd, S. ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pp. 3071 3076, 2013. Fr enay, B. and Verleysen, M. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845 869, 2013. Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on learning theory, pp. 797 842. PMLR, 2015. Ghosh, A., Kumar, H., and Sastry, P. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. Goldberger, J. and Ben-Reuven, E. Training deep neuralnetworks using a noise adaptation layer. 2017. Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop (ITA), pp. 1 10. IEEE, 2018. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Robust Training under Label Noise by Over-parameterization He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems, 31:10456 10465, 2018. Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., and Peste, A. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1 124, 2021. Hu, W., Li, Z., and Yu, D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2019. Huang, L., Zhang, C., and Zhang, H. Self-adaptive training: beyond empirical risk minimization. ar Xiv preprint ar Xiv:2002.10319, 2020. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018. Jacot, A., Ged, F., Gabriel, F., Simsek, B., and Hongler, C. Deep linear networks dynamics: Low-rank biases induced by initialization scale and l2 regularization. ar Xiv preprint ar Xiv:2106.15933, 2021. Ji, Z., Dud ık, M., Schapire, R. E., and Telgarsky, M. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pp. 2109 2136. PMLR, 2020. Jiang, L., Huang, D., Liu, M., and Yang, W. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning, pp. 4804 4815. PMLR, 2020. Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. Sgd on neural networks learns functions of increasing complexity. Advances in Neural Information Processing Systems, 32:3496 3506, 2019. Kim, T., Ko, J., Choi, J., Yun, S.-Y., et al. Fine samples for learning with noisy labels. Advances in Neural Information Processing Systems, 34, 2021. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097 1105, 2012. Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246 1257. PMLR, 2016. Li, J., Wong, Y., Zhao, Q., and Kankanhalli, M. S. Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5051 5059, 2019. Li, J., Socher, R., and Hoi, S. C. Dividemix: Learning with noisy labels as semi-supervised learning. ar Xiv preprint ar Xiv:2002.07394, 2020a. Li, J., Nguyen, T., Hegde, C., and Wong, K. W. Implicit sparse regularization: The impact of depth and early stopping. Advances in Neural Information Processing Systems, 34, 2021a. Li, M., Soltanolkotabi, M., and Oymak, S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pp. 4313 4324. PMLR, 2020b. Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool, L. Webvision database: Visual learning and understanding from web data. ar Xiv preprint ar Xiv:1708.02862, 2017. Li, X., Liu, T., Han, B., Niu, G., and Sugiyama, M. Provably end-to-end label-noise learning without anchor points. ar Xiv preprint ar Xiv:2102.02400, 2021b. Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2 47, 2018. Li, Z., Luo, Y., and Lyu, K. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020c. Lin, J. Z. and Bradic, J. Learning to combat noisy labels via classification margins. ar Xiv preprint ar Xiv:2102.00751, 2021. Liu, S., Niles-Weed, J., Razavian, N., and Fernandez Granda, C. Early-learning regularization prevents memorization of noisy labels. Advances in Neural Information Processing Systems, 33, 2020. Liu, S., Liu, K., Zhu, W., Shen, Y., and Fernandez-Granda, C. Adaptive early-learning correction for segmentation from noisy annotations. Ar Xiv, abs/2110.03740, 2021a. Robust Training under Label Noise by Over-parameterization Liu, S., Yin, L., Mocanu, D. C., and Pechenizkiy, M. Do we actually need dense over-parameterization? in-time overparameterization in sparse training. In International Conference on Machine Learning, pp. 6989 7000. PMLR, 2021b. Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447 461, 2015. Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. Lukasik, M., Bhojanapalli, S., Menon, A., and Kumar, S. Does label smoothing mitigate label noise? In International Conference on Machine Learning, pp. 6448 6458. PMLR, 2020. Ma, J. and Fattahi, S. Implicit regularization of sub-gradient method in robust matrix recovery: Don t be afraid of outliers. ar Xiv preprint ar Xiv:2102.02969, 2021. Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S., Xia, S., Wijewickrema, S., and Bailey, J. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pp. 3355 3364. PMLR, 2018. Ma, X., Huang, H., Wang, Y., Romano, S., Erfani, S., and Bailey, J. Normalized loss functions for deep learning with noisy labels. In International Conference on Machine Learning, pp. 6543 6553. PMLR, 2020. Menon, A. K., Rawat, A. S., Reddi, S. J., and Kumar, S. Can gradient clipping mitigate label noise? In International Conference on Learning Representations, 2019. Nacson, M. S., Srebro, N., and Soudry, D. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3051 3059. PMLR, 2019. Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014. Oymak, S. and Soltanolkotabi, M. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, pp. 4951 4960, 2019. Oymak, S., Fabian, Z., Li, M., and Soltanolkotabi, M. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. ar Xiv preprint ar Xiv:1906.05392, 2019. Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944 1952, 2017. Razin, N. and Cohen, N. Implicit regularization in deep learning may not be explainable by norms. Advances in Neural Information Processing Systems, 33, 2020. Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. ar Xiv preprint ar Xiv:1412.6596, 2014. Song, H., Kim, M., and Lee, J.-G. Selfie: Refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907 5915. PMLR, 2019a. Song, H., Kim, M., Park, D., and Lee, J.-G. Prestopping: How does early stopping help generalization against label noise? Ar Xiv, abs/1911.08059, 2019b. Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. ar Xiv preprint ar Xiv:2007.08199, 2020. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018. St oger, D. and Soltanolkotabi, M. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Advances in Neural Information Processing Systems, 34, 2021. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016. Tanaka, D., Ikami, D., Yamasaki, T., and Aizawa, K. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552 5560, 2018. Tibshirani, R. J. Equivalences between sparse models and neural networks. 2021. Vaskevicius, T., Kanade, V., and Rebeschini, P. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pp. 2968 2979, 2019. Robust Training under Label Noise by Over-parameterization Wang, B., Meng, Q., Zhang, H., Sun, R., Chen, W., and Ma, Z.-M. Momentum doesn t change the implicit bias. ar Xiv preprint ar Xiv:2110.03891, 2021. Wang, R., Liu, T., and Tao, D. Multiclass learning with partially corrupted labels. IEEE transactions on neural networks and learning systems, 29(6):2568 2580, 2017. Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322 330, 2019. Wei, J. and Liu, Y. When optimizing f-divergence is robust with label noise. In International Conference on Learning Representations, 2021. Wei, J., Liu, H., Liu, T., Niu, G., and Liu, Y. Understanding (generalized) label smoothing whenlearning with noisy labels. ar Xiv preprint ar Xiv:2106.04149, 2021a. Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., and Liu, Y. Learning with noisy labels revisited: A study using real-world human annotations. ar Xiv preprint ar Xiv:2110.12088, 2021b. Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp. 3635 3673. PMLR, 2020. Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., and Ma, Y. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210 227, 2008. Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? Advances in Neural Information Processing Systems, 32:6838 6849, 2019. Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. In International Conference on Learning Representations, 2020. Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691 2699, 2015. Xie, Q., Dai, Z., Hovy, E. H., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation for consistency training. ar Xiv: Learning, 2020. Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pp. 10767 10777. PMLR, 2020. You, C., Zhu, Z., Qu, Q., and Ma, Y. Robust recovery via implicit bias of discrepant learning rates for double over-parameterization. Advances in Neural Information Processing Systems, 33:17733 17744, 2020. Yu, Y., Chan, K. H. R., You, C., Song, C., and Ma, Y. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33:9422 9434, 2020. Zetterqvist, O., J ornsten, R., and Jonasson, J. Robust neural network classification via double regularization. ar Xiv preprint ar Xiv:2112.08102, 2021. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021a. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. Zhang, H., Xing, X., and Liu, L. Dualgraph: A graph-based method for reasoning about label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9654 9663, 2021b. Zhang, Y., Niu, G., and Sugiyama, M. Learning noise transition matrix from only noisy labels via total variation regularization. ar Xiv preprint ar Xiv:2102.02414, 2021c. Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (Neur IPS), 2018. Zhao, P., Yang, Y., and He, Q.-C. Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression. ar Xiv preprint ar Xiv:1903.09367, 2019. Zheng, S., Wu, P., Goswami, A., Goswami, M., Metaxas, D., and Chen, C. Error-bounded correction of noisy labels. In International Conference on Machine Learning, pp. 11447 11457. PMLR, 2020. Zhu, Z., Song, Y., and Liu, Y. Clusterability as an alternative to anchor points when learning with noisy labels. ar Xiv preprint ar Xiv:2102.05291, 2021. Robust Training under Label Noise by Over-parameterization Appendices This appendix is organized as follows. In Section A we provide additional details for reproducing experimental results presented in Section 2. In Section B we provide proofs for the theoretical results presented in Section 3. A. Training Details for Robust Classification with Label Noise A.1. Choice of Loss Function The cross-entropy loss in (8) cannot be used to optimize {vi} as we explain below. Consider a data point x with a one-hot label y. With the CE loss in (8) rewritten below for convenience: LCE(θ, u, v; x, y) .= ℓCE φ f(x, θ) + s , y , with s .= u u y v v (1 y), (A.1) we may compute its gradient with respect to (w.r.t.) v as LCE(θ, u, v; x, y) v = 2v (1 y) 1 (f(x, θ) + s) . (A.2) This shows that the gradient w.r.t. different entries of v does not depend on the output f(x, θ) of the model at all modulo the divider shared by all entries. Hence, v cannot correctly learn the label noise. We now consider the MSE loss in (9) rewritten below for convenience: LMSE(θ, u, v; x, y) .= ℓMSE f(x, θ) + s, y , with s .= u u y v v (1 y). (A.3) The gradient w.r.t. v can be computed as LMSE(θ, u, v; x, y) v = 4(f(x, θ) + s y) v (1 y). (A.4) Here the gradient w.r.t. different entries of v varies depending on how well the model prediction f(x, θ) + s matches the given label y at the corresponding entry. Hence, when the model prediction deviates from the given label which may occur when the label is corrupted, v is able to learn the underlying corruption to the label. A.2. Definition of Label Noise In this paper, we consider two types of widely existed label noise, namely symmetric label noise and asymmetric label noise. For symmetric noise with noise level α, the labels are generated as follows: ( y GT with the probability of 1 α random one hot vector with the probability of α. We consider noise level α {0.2, 0.4, 0.5, 0.6, 0.8}. For asymmetric noise, following (Patrini et al., 2017), we flip labels between TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, and CAT DOG. We randomly choose 40% training data with their labels to be flipped according to this asymmetric labeling rule. For real world datasets, Clothing1M has noise level estimated at around 38.5% (Song et al., 2019b), and for Web Vision, the noise level is estimated to be at around 20% (Li et al., 2017). A.3. Implementation Details of SOP+ We considered two separate regularization terms to further boost the results and stabilize training. We will describe the definitions and roles of them below: Consistency regularizer LC. We use a regularizer LC to encourage consistency of network prediction on a original image and the corresponding augmented image. Such a regularizer is commonly used in semi-supervised learning and label noise learning literature, see e.g., (Berthelot et al., 2019; Li et al., 2019). Specifically, the consistency regularizer LC is defined as the Kullback-Leibler (KL)-divergence between the softmax predictions from the images with augmentations (described in Section A.4) and the softmax predictions for the corresponding images generated with Unsupervised Data Augmentation Robust Training under Label Noise by Over-parameterization Table A.1. Hyper-parameters for SOP on CIFAR-10/100, Clothing-1M and Webvision datasets. CIFAR-10 CIFAR-100 Clothing-1M Webvision architecture Res Net34 Pre Act Pres Net18 Res Net34 Pre Act Pres Net18 Res Net-50 (pretrained) Inception Res Net V2 batch size 128 128 128 128 64 32 learning rate (lr) 0.02 0.02 0.02 0.0 2 0.002 0.02 lr decay 40th & 80th Cosine Annealing 40th & 80th Cosine Annealing 5th 50th weight decay (wd) 5 10 4 5 10 4 5 10 4 5 10 4 1 10 3 5 10 4 training epochs 120 300 150 300 10 100 training examples 45,000 50,000 45,000 50,000 1,000,000 66,000 lr for {ui, vi} Sym: αu = 10, αv = 10 Sym: αu = 10, αv = 10 Sym: αu = 1, αv = 10 Sym: αu = 1, αv = 10 αu = 0.1 , αv = 1 αu = 0.1 , αv = 1 Asym: αu = 10, αv = 100 Asym: αu = 10, αv = 100 Asym: αu = 1, αv = 100 Asym: αu = 1, αv = 100 wd for {ui, vi} 0 0 0 0 0 0 init. std for {ui, vi} 10 8 10 8 10 8 10 8 10 8 10 8 λC 0.0 0.9 0.0 0.9 0.0 0.0 λB 0.0 0.1 0.0 0.1 0.0 0.0 (UDA) (Xie et al., 2020): i=1 DKL (f(xi; θ) f(UDA(xi); θ)) . Class-balance regularizer LB. We use a regularizer LB to prevent the network from assigning all data points to the same class. Following (Tanaka et al., 2018), we use the prior information on the probability distribution p of class labels and minimize its distance in terms of KL-divergence to the mean prediction of each batch B: k=1 pk log pk fk(x, θ) = k=1 pk log fk(x; θ), where fk(x; θ) 1 |B| P x B f(x; θ), and pk stands for the prior probability of the kth class. The final loss function for SOP+ is therefore constructed by three terms as follows L(θ, {ui, vi}) + λCLC(θ) + λBLB(θ), where λc, λB > 0 are the hyper-parameters. A.4. Experimental Settings Data processing: For experiments on CIFAR10/100 (Krizhevsky et al., 2009) without extra techniques, we use simple data augmentations including random crop and horizontal flip following previous works (Patrini et al., 2017; Liu et al., 2020). For SOP+, we use the default setting from unsupervised data augmentation (Xie et al., 2020) to apply efficient data augmentation to create another view of the data for consistency training. For Clothing-1M (Xiao et al., 2015), we first resize images to 256 256, and then random crop to 224 224, following a random horizontal flip. For Web Vision (Li et al., 2017), we randomly crop the images into size of 227 227. All images are standardized by their means and variances. Hyper-parameters of SOP: We adopt a SGD optimizer without weight decay for U and V . We keep all the hyperparameters fixed for different levels of noise. For fair comparison, we adopt two settings of hyper-parameters and architectures for SOP and SOP+. More details of hyper-parameters can be found in Table A.4. Note that the method is not very sensitive to hyper-parameters λC and λB. B. Proofs for Theoretical Analysis with Linear Models B.1. Proof of Proposition 3.2 We first present a simple but useful lemma. Lemma B.1. Let (θ, u, v) be a critical point to (15) that is not a global minimum, i.e., r := Jθ + u u v v y = 0. Robust Training under Label Noise by Over-parameterization Then there exists an index i such that ui = vi = 0, ri = 0, (B.1) where ui, vi, and ri denote the i-th elements of u, v and r, respectively. Proof. We may compute the gradient of the objective function h in (15) as θh(θ, u, v) = J r, uh(θ, u, v) = 2r u, vh(θ, u, v) = 2r v. Since r = 0 but uh(θ, u, v) = vh(θ, u, v) = 0, we must have ui = vi = 0 and ri = 0 for some i. We now prove Proposition 3.2 as follows. Proof of Proposition 3.2. We compute the hessian 2h of the objective function h in (15) as 2h(θ, u, v) = J J 2J diag(u) 2J diag(v) 2 diag(u)J 2 diag (r + 2u u) 4 diag(v u) 2 diag(v)J 4 diag(v u) 2 diag (r 2v v) For any direction d = d θ d u d v , the quadratic form of the Hessian 2h along this direction is given by d 2h(θ, u, v)d = Jdθ 2 2 + 4 u du 2 2 + 4 v dv 2 2 + 2 r, du du dv dv + 4 Jdθ, u du v dv 8 u du, v dv . (B.2) We now consider an arbitrary critical point (θ, u, v) of (15) that is not a global minimum. By Lemma B.1, there exists an i such that ri = 0 while ui = vi = 0. We divide the discussion into two cases. Case 1: ri > 0. We set dθ = 0, du = 0, and dv to be such that all of its entries are zero except for the i-th entry which is given by di v = 1. Plugging this direction into (B.2), we obtain d 2h(θ, u, v)d = 4 [vi]2 |{z} vi=0 [di v]2 2ri [di v]2 |{z} div=1 Case 2: ri < 0. We set dθ = 0, dv = 0, and du to be such that all of its entries are zero except for the i-th entry which is given by di u = 1. Plugging this direction into (B.2), we obtain d 2h(θ, u, v)d = 4 [ui]2 |{z} ui=0 [di u]2 + 2ri [di u]2 | {z } diu=1 In both cases above we have constructed a direction of negative curvature, hence (θ, u, v) is a strict saddle. B.2. Proof of Proposition 3.3 The proof is based on the following lemma which follows trivially from KKT conditions: Lemma B.2 (KKT condition). Given any J and y, if there exists (bθ, bs, bν) satisfying y = Jbθ + bs, bθ = J bν, and bν λsign(bs), then (bθ, bs) is an optimal solution to (26). In above, sign(z) is defined entrywise on z as ( z/|z| if bz = 0, [ 1, 1] if bz = 0. (B.4) Proof of Proposition 3.3. We divide the proof into two parts. Robust Training under Label Noise by Over-parameterization Global convergence. In this part we show that θ (γ, α), u (γ, α), v (γ, α) is a global solution to (15) for any fixed (γ, α). Denote r (γ, α) .= lim t rt(γ, α). (B.5) It follows from (18) and (22) that the limit r (γ, α) exists and can be written as r (γ, α) = Jθ (γ, α) + u (γ, α) u (γ, α) v (γ, α) v (γ, α). (B.6) Suppose for the purpose of obtaining a contradiction that θ (γ, α), u (γ, α), v (γ, α) is not a global solution to (15). It follows from Lemma B.1 that there exists an i such that ui (γ, α) = vi (γ, α) = 0, and ri (γ, α) = 0. (B.7) Without loss of generality we assume that C .= ri (γ, α) > 0 so that ri t(γ, α) C with t . For any ϵ (0, C), there exists a t0 > 0 such that C ϵ ri t(γ, α) C + ϵ, t > t0. (B.8) It follows from (B.8) and (21) that νi t(γ, α) = Z t 0 ri τ(γ, α)dτ = νi t0(γ, α) Z t t0 ri τ(γ, α)dτ νi t0(γ, α) (C + ϵ)(t t0), νi t0(γ, α) (C ϵ)(t t0) , t > t0. (B.9) Using this bound on νi t(γ, α) in (20), we obtain vi t(γ, α) = γ exp 2ανi t(γ, α) γ exp 2ανi t0(γ, α) exp 2α(C ϵ)(t t0) , t > t0. (B.10) Taking the limit of t , we obtain vi (γ, α) = which contradicts vi (γ, α) = 0 in (B.7). Therefore, we conclude that θ (γ, α), u (γ, α), v (γ, α) is a global solution to (15). Implicit regularization. In this part we prove that (bθ, bs) is an optimal solution to the regularized convex optimization problem in (26). Let ν (γ, α) be the limit of νt(γ, α) in (21) at t , and let bν .= lim γ 0 ν (γ, α(γ)), (B.11) with α(γ) defined in (23). We only need to show that the triplet (bθ, bs, bν) with bθ defined in (24), bs defined in (25) and bν defined in (B.11) satisfies the KKT conditions in (B.3). 1. Because θ (γ, α), u (γ, α), v (γ, α) is a global solution to (15), we have Jθ (γ, α) + u (γ, α) u (γ, α) v (γ, α) v (γ, α) = y, γ > 0, α > 0. (B.12) Taking the limit of γ 0 with α = α(γ) and noting the assumption that all limits in (24) exist, we obtain Jbθ + bu bu bv bv = y. (B.13) Plugging in the definition of bs in (25), we obtain y = Jbθ + bs. 2. By taking the limit of the relation θt(γ, α) = J νt(γ, α) in (20) and noting the assumptions that all relevant limits exist, we obtain bθ = lim γ 0 lim t θt(γ, α(γ)) = lim γ 0 lim t J νt(γ, α(γ)) = J bν. (B.14) 3. Denote s (γ, α) .= u (γ, α) u (γ, α) v (γ, α) v (γ, α). By (20), we have si (γ, α) .= ui (γ, α)2 vi (γ, α)2 = γ2 exp(4ανi (γ, α)) γ2 exp( 4ανi (γ, α)). (B.15) For each entry of bs = limγ 0 s (γ, α(γ)) (recall that α(γ) is defined in (23)), we may have three cases: Case 1: bsi > 0. From (B.15), we must have α(γ)νi (γ, α(γ)) + as γ 0 so that lim γ 0 exp 4α(γ)νi (γ, α(γ)) = , and lim γ 0 exp 4α(γ)νi (γ, α(γ)) = 0. (B.16) Robust Training under Label Noise by Over-parameterization lim γ 0 γ2 exp 4α(γ)νi (γ, α(γ)) = bsi = lim γ 0 2 log γ + 4α(γ)νi (γ, α(γ)) = log bsi = lim γ 0 νi (γ, α(γ)) = lim γ 0 4α(γ) log γ Plugging in the relation α(γ) = log γ 2λ in (23), we have lim γ 0 νi (γ, α(γ)) = lim γ 0 λ log bsi 2 log γ + λ = λ. Case 2: bsi < 0. Similar to case 1, from (B.15) we must have lim γ 0 exp 4α(γ)νi (γ, α(γ)) = 0, and lim γ 0 exp 4α(γ)νi (γ, α(γ)) = . (B.18) lim γ 0 γ2 exp 4α(γ)νi (γ, α(γ)) = bsi = lim γ 0 2 log γ 4α(γ)νi (γ, α(γ)) = log( bsi) = lim γ 0 νi (γ, α(γ)) = lim γ 0 4α(γ) + log γ Plugging in the relation α(γ) = log γ 2λ in (23), we have lim γ 0 νi (γ, α(γ)) = λ. Case 3: bsi = 0. From (B.15), we must have lim γ 0 γ2 exp 4α(γ)νi (γ, α(γ)) = 0, and lim γ 0 γ2 exp 4α(γ)νi (γ, α(γ)) = 0. (B.20) Hence, for any small ϵ (0, 1), there exists γ0 > 0 such that for all γ (0, γ0), we have γ2 max n exp 4α(γ)νi (γ, α(γ)) , exp 4α(γ)νi (γ, α(γ)) o < ϵ = 2 log γ + 4α(γ) |νi (γ, α(γ))| < log ϵ = |νi (γ, α(γ))| < log ϵ 4α(γ) log γ Now, plugging α(γ) = log γ 2λ in, we have |νi (γ, α(γ)))| < λ log ϵ 2 log γ + λ < λ. Therefore, we have lim γ 0 νi (γ, α(γ)) < λ. Synthesizing all the above three cases, we obtain: bν λsign(bs). B.3. Proof of Proposition 3.5 We begin with introducing the null space property that is widely used for providing necessary and sufficient conditions for exact recovery of sparse signals in compressive sensing. Definition B.3 ((Cohen et al., 2009)). We say a matrix A Rm n satisfies the null space property with constant ρ (0, 1) relative to S [n] if v S 1 ρ v Sc 1 for all v ker A, where ker A is the null space of A. Robust Training under Label Noise by Over-parameterization With Definition B.3, we prove Proposition 3.5 by using the following two lemmas. The first lemma establishes correct recovery of (θ , s ) from (26) under the null space property. Lemma B.4. Given matrix J and a matrix A that annihilates J on the left (i.e. such that AJ = 0). If A satisfies the stable null space property with constant ρ (0, 1) relative to the support of s , then the solution to (26) is (θ , s ) for any λ > λ0 where λ0 is a scalar that depends only on (J, θ , ρ). The second lemma shows that the null space property is satisfied under the incoherent condition in (28). Lemma B.5. Given matrix J and a matrix A that annihilates J on the left, if then A satisfies null space property with constant ρ relative to any S that satisfies |S| = k. Proof of Proposition 3.5. Assume that the condition in (28) is satisfied. Then there exists a ρ (0, 1) such that the condition in (B.22) holds. Hence, A satisfies null space property with constant ρ relative to any S that satisfies |S| = k. Since s is k-sparse, we have that A satisfies null space property with constant ρ relative to the support of s . Then the conclusion of Proposition 3.5 follows from applying Lemma B.4. Finally, from Lemma B.4 we have that λ0 is a function of (J, θ , ρ), wherein ρ is determined by A (hence J) and the associated sparsity k. Hence λ0 can be determined with a given (J, θ , k). In the rest of this section we prove Lemma B.4 and Lemma B.5. Proof of Lemma B.4. We first introduce the following result on a useful property of the stable null space property. Theorem B.6 (Useful property of stable null space property). Suppose a matrix A Rm n satisfies the null space property with constant ρ (0, 1) relative to S [n]. Then for every vector x supported on S, we have z x 1 1 + ρ 1 ρ( z 1 x 1) for any z with Az = Ax. Proof of Theorem B.6. Since A(z x) = 0, i.e., z x ker A, the null space property of A implies (z x)Sc 1 ρ (z x)Sc 1, which further gives that z x 1 (1 + ρ) (z x)Sc 1. We now use these properties to prove the main result as z 1 = z S 1 + z Sc 1 = (z x + x)S 1 + z Sc 1 x 1 (z x)S 1 + (z x)Sc 1 x 1 + (1 ρ) (z x)Sc 1 x 1 + 1 + ρ where the first inequality follows because x is only supported on S. We are now ready to prove Lemma B.4. Proof of Lemma B.4. Let J = UΣV be the compact SVD of J and V be an orthonormal basis that complements V . Then, the constraint in (26) is equivalent to Ay = As, θ = V Σ 1U (y s) + V h. Thus, the problem (26) is equivalent to min s,h 1 2 V Σ 1U (y s) + V h 2 2 + λ s 1 s.t. Ay = As, (B.23) Robust Training under Label Noise by Over-parameterization which is further equivalent to min s g(s) := 1 2 V Σ 1U (y s) 2 2 + λ s 1 s.t. As = As. (B.24) Assume A satisfies the stable null space property with constant ρ (0, 1) relative to the support of s . Now for any s with As = As, by Theorem B.6, we have s 1 s 1 1 ρ 1 + ρ s s 1, which ensures s = s if we only minimize s 1. The first term in (B.24) can be written as V Σ 1U (y s) 2 2 = V Σ 1U (s s + Jθ ) 2 2, θ = V Σ 1U (y s ). This together with the previous equation gives 1 + ρ s s 1 + V Σ 1U (s s + Jθ ) 2 2 V Σ 1U Jθ 2 2 1 + ρ s s 1 + V Σ 1U (s s) 2 2 + 2 s s, UΣ 1V θ 1 + ρ s s 1 2 UΣ 1V θ s s 1 1 + ρ 2 UΣ 1V θ Thus, if λ > λ0 with λ0 = 21 + ρ 1 ρ UΣ 1V θ , (B.25) we have g(s) g(s ) > 0 whenever s = s . Proof of Lemma B.5. Proof. Let J = UΣV be the compact SVD of J. From (B.22) we have (ρ + 1)k r r N µ(J) ρ. (B.26) Let S [N] with |S| = k and a IRr be an arbitrary vector. We have [Ua]S 1 = X i S |e i Ua| = X i S | U ei, a | k a 2 max i S U ei 2 k a 2 N µ(J), (B.27) where the last inequality is obtained from Definition 3.4. In addition, we have Ua 1 Ua 2 = a 2. (B.28) Combining (B.26), (B.27) and (B.28), we get (ρ + 1) [Ua]S 1 (ρ + 1)k a 2 N µ(J) ρ a 2 ρ Ua 1, (B.29) [Ua]S 1 ρ [Ua]Sc 1. (B.30) Noting that {Ua|a IRr} = ker A, this finishes the proof by Definition B.3.