# learning_imbalanced_datasets_with_labeldistributionaware_margin_loss__a1347ab0.pdf Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss Kaidi Cao Stanford University kaidicao@stanford.edu Colin Wei Stanford University colinwei@stanford.edu Adrien Gaidon Toyota Research Institute adrien.gaidon@tri.global Nikos Arechiga Toyota Research Institute nikos.arechiga@tri.global Tengyu Ma Stanford University tengyuma@stanford.edu Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes. We design two novel methods to improve performance in such scenarios. First, we propose a theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound. This loss replaces the standard cross-entropy objective during training and can be applied with prior strategies for training with class-imbalance such as re-weighting or re-sampling. Second, we propose a simple, yet effective, training schedule that defers re-weighting until after the initial stage, allowing the model to learn an initial representation while avoiding some of the complications associated with re-weighting or re-sampling. We test our methods on several benchmark vision tasks including the real-world imbalanced dataset i Naturalist 2018. Our experiments show that either of these methods alone can already improve over existing techniques and their combination achieves even better performance gains1. 1 Introduction Modern real-world large-scale datasets often have long-tailed label distributions [51, 28, 34, 12, 15, 50, 40]. On these datasets, deep neural networks have been found to perform poorly on less represented classes [17, 51, 5]. This is particularly detrimental if the testing criterion places more emphasis on minority classes. For example, accuracy on a uniform label distribution or the minimum accuracy among all classes are examples of such criteria. These are common scenarios in many applications [7, 42, 20] due to various practical concerns such as transferability to new domains, fairness, etc. The two common approaches for learning long-tailed data are re-weighting the losses of the examples and re-sampling the examples in the SGD mini-batch (see [5, 21, 10, 17, 18, 9] and the references therein). They both devise a training loss that is in expectation closer to the test distribution, and therefore can achieve better trade-offs between the accuracies of the frequent classes and the minority classes. However, because we have fundamentally less information about the minority classes and the models deployed are often huge, over-fitting to the minority classes appears to be one of the challenges in improving these methods. We propose to regularize the minority classes more strongly than the frequent classes so that we can improve the generalization error of minority classes without sacrificing the model s ability to fit 1Code available at https://github.com/kaidic/LDAM-DRW. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Figure 1: For binary classification with a linearly separable classifier, the margin γi of the i-th class is defined to be the the minimum distance of the data in the i-th class to the decision boundary. We show that the test error with the uniform label distribution is bounded by a quantity that scales in 1 γ1 n1 + 1 γ2 n2 . As illustrated here, fixing the direction of the decision boundary leads to a fixed γ1 + γ2, but the trade-off between γ1, γ2 can be optimized by shifting the decision boundary. As derived in Section 3.1, the optimal trade-off is γi n 1/4 i where ni is the sample size of the i-th class. the frequent classes. Implementing this general idea requires a data-dependent or label-dependent regularizer which in contrast to standard ℓ2 regularization depends not only on the weight matrices but also on the labels to differentiate frequent and minority classes. The theoretical understanding of data-dependent regularizers is sparse (see [57, 43, 2] for a few recent works.) We explore one of the simplest and most well-understood data-dependent properties: the margins of the training examples. Encouraging a large margin can be viewed as regularization, as standard generalization error bounds (e.g., [4, 59]) depend on the inverse of the minimum margin among all the examples. Motivated by the question of generalization with respect to minority classes, we instead study the minimum margin per class and obtain per-class and uniform-label test error bounds.2 Minimizing the obtained bounds gives an optimal trade-off between the margins of the classes. See Figure 1 for an illustration in the binary classification case. Inspired by the theory, we design a label-distribution-aware loss function that encourages the model to have the optimal trade-off between per-class margins. The proposed loss extends the existing soft margin loss [53] by encouraging the minority classes to have larger margins. As a label-dependent regularization technique, our modified loss function is orthogonal to the re-weighting and re-sampling approach. In fact, we also design a deferred re-balancing optimization procedure that allows us to combine the re-weighting strategy with our loss (or other losses) in a more efficient way. In summary, our main contributions are (i) we design a label-distribution-aware loss function to encourage larger margins for minority classes, (ii) we propose a simple deferred re-balancing optimization procedure to apply re-weighting more effectively, and (iii) our practical implementation shows significant improvements on several benchmark vision tasks, such as artificially imbalanced CIFAR and Tiny Image Net [1], and the real-world large-scale imbalanced dataset i Naturalist 18 [52]. 2 Related Works Most existing algorithms for learning imbalanced datasets can be divided in to two categories: re-sampling and re-weighting. Re-sampling. There are two types of re-sampling techniques: over-sampling the minority classes (see e.g., [46, 60, 5, 6] and references therein) and under-sampling the frequent classes (see, e.g., [17, 23, 5] and the references therein.) The downside of under-sampling is that it discards a large portion of the data and thus is not feasible when data imbalance is extreme. Over-sampling is effective in a lot of cases but can lead to over-fitting of the minority classes [9, 10]. Stronger data augmentation for minority classes can help alleviate the over-fitting [9, 61]. Re-weighting. Cost-sensitive re-weighting assigns (adaptive) weights for different classes or even different samples. The vanilla scheme re-weights classes proportionally to the inverse of their frequency [21, 22, 55]. Re-weighting methods tend to make the optimization of deep models difficult under extreme data imbalanced settings and large-scale scenarios [21, 22]. Cui et al. [10] observe 2The same technique can also be used for other test label distribution as long as the test label distribution is known. See Section C.5 for some experimental results. that re-weighting by inverse class frequency yields poor performance on frequent classes, and thus propose re-weighting by the inverse effective number of samples. This is the main prior work that we empirically compare with. Another line of work assigns weights to each sample based on their individual properties. Focal loss [35] down-weights the well-classified examples; Li et al. [31] suggests an improved technique which down-weights examples with either very small gradients or large gradients because examples with small gradients are well-classified and those with large gradients tend to be outliers. In a recent work [6], Byrd and Lipton study the effect of importance weighting and show that empirically importance weighting does not have a significant effect when no regularization is applied, which is consistent with the theoretical prediction in [48] that logistical regression without regularization converges to the max margin solution. In our work, we explicitly encourage rare classes to have higher margin, and therefore we don t converge to a max margin solution. Moreover, in our experiments, we apply non-trivial ℓ2-regularization to achieve the best generalization performance. We also found deferred re-weighting (or deferred re-sampling) are more effective than re-weighting and re-sampling from the beginning of the training. In contrast, and orthogonally to these papers above, our main technique aims to improve the generalization of the minority classes by applying additional regularization that is orthogonal to the re-weighting scheme. We also propose a deferred re-balancing optimization procedure to improve the optimization and generalization of a generic re-weighting scheme. Margin loss. The hinge loss is often used to obtain a max-margin classifier, most notably in SVMs [49]. Recently, Large-Margin Softmax [37], Angular Softmax [38], and Additive Margin Softmax [53] have been proposed to minimize intra-class variation in predictions and enlarge the inter-class margin by incorporating the idea of angular margin. In contrast to the class-independent margins in these papers, our approach encourages bigger margins for minority classes. Uneven margins for imbalanced datasets are also proposed and studied in [32] and the recent work [25, 33]. Our theory put this idea on a more theoretical footing by providing a concrete formula for the desired margins of the classes alongside good empirical progress. Label shift in domain adaptation. The problem of learning imbalanced datasets can be also viewed as a label shift problem in transfer learning or domain adaptation (for which we refer the readers to the survey [54] and the reference therein). In a typical label shift formulation, the difficulty is to detect and estimate the label shift, and after estimating the label shift, re-weighting or re-sampling is applied. We are addressing a largely different question: can we do better than re-weighting or re-sampling when the label shift is known? In fact, our algorithms can be used to replace the re-weighting steps of some of the recent interesting work on detecting and correcting label shift [36, 3]. Distributionally robust optimization (DRO) is another technique for domain adaptation (see [11, 16, 8] and the reference therein.) However, the formulation assumes no knowledge of the target label distribution beyond a bound on the amount of shift, which makes the problem very challenging. We here assume the knowledge of the test label distribution, using which we design efficient methods that can scale easily to large-scale vision datasets with significant improvements. Meta-learning. Meta-learning is also used in improving the performance on imbalanced datasets or the few shot learning settings. We refer the readers to [55, 47, 56] and the references therein. So far, we generally believe that our approaches that modify the losses are more computationally efficient than meta-learning based approaches. 3 Main Approach 3.1 Theoretical Motivations Problem setup and notations. We assume the input space is Rd and the label space is {1, . . . , k}. Let x denote the input and y denote the corresponding label. We assume that the class-conditional distribution P(x | y) is the same at training and test time. Let Pj denote the class-conditional distribution, i.e. Pj = P(x | y = j). We will use Pbal to denote the balanced test distribution which first samples a class uniformly and then samples data from Pj. For a model f : Rd Rk that outputs k logits, we use Lbal[f] to denote the standard 0-1 test error on the balanced data distribution: Lbal[f] = Pr (x,y) Pbal[f(x)y < max ℓ =y f(x)ℓ] Similarly, the error Lj for class j is defined as Lj[f] = Pr(x,y) Pj[f(x)y < maxℓ =y f(x)ℓ]. Suppose we have a training dataset {(xi, yi)}n i=1. Let nj be the number of examples in class j. Let Sj = {i : yi = j} denote the example indices corresponding to class j. Define the margin of an example (x, y) as γ(x, y) = f(x)y max j =y f(x)j (1) Define the training margin for class j as: γj = min i Sj γ(xi, yi) (2) We consider the separable cases (meaning that all the training examples are classified correctly) because neural networks are often over-parameterized and can fit the training data well. We also note that the minimum margin of all the classes, γmin = min{γ1, . . . , γk}, is the classical notion of training margin studied in the past [27]. Fine-grained generalization error bounds. Let F be the family of hypothesis class. Let C(F) be some proper complexity measure of the hypothesis class F. There is a large body of recent work on measuring the complexity of neural networks (see [4, 13, 57] and references therein), and our discussion below is orthogonal to the precise choices. When the training distribution and the test distribution are the same, the typical generalization error bounds scale in C(F)/ n. That is, in our case, if the test distribution is also imbalanced as the training distribution, then imbalanced test error 1 γmin Note that the bound is oblivious to the label distribution, and only involves the minimum margin across all examples and the total number of data points. We extend such bounds to the setting with balanced test distribution by considering the margin of each class. As we will see, the more fine-grained bound below allows us to design new training loss function that is customized to the imbalanced dataset. Theorem 1 (Informal and simplified version of Theorem 2). With high probability (1 n 5) over the randomness of the training data, the error Lj for class j is bounded by nj + log n nj (4) where we use to hide constant factors. As a direct consequence, nj + log n nj Class-distribution-aware margin trade-off. The generalization error bound (4) for each class suggests that if we wish to improve the generalization of minority classes (those with small nj s), we should aim to enforce bigger margins γj s for them. However, enforcing bigger margins for minority classes may hurt the margins of the frequent classes. What is the optimal trade-off between the margins of the classes? An answer for the general case may be difficult, but fortunately we can obtain the optimal trade-off for the binary classification problem. With k = 2 classes, we aim to optimize the balanced generalization error bound provided in (5), which can be simplified to (by removing the low order term log n nj and the common factor C(F)) 1 γ1 n1 + 1 γ2 n2 (6) At the first sight, because γ1 and γ2 are complicated functions of the weight matrices, it appears difficult to understand the optimal margins. However, we can figure out the relative scales between γ1 and γ2. Suppose γ1, γ2 > 0 minimize the equation above, we observe that any γ 1 = γ1 δ and γ 2 = γ2 + δ (for δ ( γ2, γ1)) can be realized by the same weight matrices with a shifted bias term (See Figure 1 for an illustration). Therefore, for γ1, γ2 to be optimal, they should satisfy 1 γ1 n1 + 1 γ2 n2 1 (γ1 δ) n1 + 1 (γ2 + δ) n2 (7) The equation above implies that n1/4 1 , and γ2 = C for some constant C. Please see a detailed derivation in the Section A. Fast rate vs slow rate, and the implication on the choice of margins. The bound in Theorem 1 may not necessarily be tight. The generalization bounds that scale in 1/ n (or 1/ ni here with imbalanced classes) are generally referred to the slow rate and those that scale in 1/n are referred to the fast rate . With deep neural networks and when the model is sufficiently big enough, it is possible that some of these bounds can be improved to the fast rate. See [58] for some recent development. In those cases, we can derive the optimal trade-off of the margin to be ni n 1/3 i . 3.2 Label-Distribution-Aware Margin Loss Inspired by the trade-off between the class margins in Section 3.1 for two classes, we propose to enforce a class-dependent margin for multiple classes of the form We will design a soft margin loss function to encourage the network to have the margins above. Let (x, y) be an example and f be a model. For simplicity, we use zj = f(x)j to denote the j-th output of the model for the j-th class. The most natural choice would be a multi-class extension of the hinge loss: LLDAM-HG((x, y); f) = max(max j =y {zj} zy + y, 0) (10) where j = C n1/4 j for j {1, . . . , k} (11) Here C is a hyper-parameter to be tuned. In order to tune the margin more easily, we effectively normalize the logits (the input to the loss function) by normalizing last hidden activation to ℓ2 norm 1, and normalizing the weight vectors of the last fully-connected layer to ℓ2 norm 1, following the previous work [53]. Notice that we then scale the logits by a constant s = 10 following [53]. Empirically, the non-smoothness of hinge loss may pose difficulties for optimization. The smooth relaxation of the hinge loss is the following cross-entropy loss with enforced margins: LLDAM((x, y); f) = log ezy y j =y ezj (12) where j = C n1/4 j for j {1, . . . , k} (13) In the previous work [37, 38, 53] where the training set is usually balanced, the margin y is chosen to be a label independent constant C, whereas our margin depends on the label distribution. Remark: Attentive readers may find the loss LLDAM somewhat reminiscent of the re-weighting because in the binary classification case where the model outputs a single real number which is passed through a sigmoid to be converted into a probability, both the two approaches change the gradient of an example by a scalar factor. However, we remark two key differences: the scalar factor introduced by the re-weighting only depends on the class, whereas the scalar introduced by LLDAM also depends on the output of the model; for multiclass classification problems, the proposed loss LLDAM affects the gradient of the example in a more involved way than only introducing a scalar factor. Moreover, recent work has shown that, under separable assumptions, the logistical loss, with weak regularization [59] or without regularization [48], gives the max margin solution, which is in turn not effected by any re-weighting by its definition. This further suggests that the loss LLDAM and the re-weighting may complement each other, as we have seen in the experiments. (Re-weighting would affect the margin in the non-separable data case, which is left for future work.) 3.3 Deferred Re-balancing Optimization Schedule Cost-sensitive re-weighting and re-sampling are two well-known and successful strategies to cope with imbalanced datasets because, in expectation, they effectively make the imbalanced training distribution closer to the uniform test distribution. The known issues with applying these techniques are (a) re-sampling the examples in minority classes often causes heavy over-fitting to the minority classes when the model is a deep neural network, as pointed out in prior work (e.g., [10]), and (b) weighting up the minority classes losses can cause difficulties and instability in optimization, especially when the classes are extremely imbalanced [10, 21]. In fact, Cui et al. [10] develop a novel and sophisticated learning rate schedule to cope with the optimization difficulty. We observe empirically that re-weighting and re-sampling are both inferior to the vanilla empirical risk minimization (ERM) algorithm (where all training examples have the same weight) before annealing the learning rate in the following sense. The features produced before annealing the learning rate by re-weighting and re-sampling are worse than those produced by ERM. (See Figure 6 for an ablation study of the feature quality performed by training linear classifiers on top of the features on a large balanced dataset.) Inspired by this, we develop a deferred re-balancing training procedure (Algorithm 1), which first trains using vanilla ERM with the LDAM loss before annealing the learning rate, and then deploys a re-weighted LDAM loss with a smaller learning rate. Empirically, the first stage of training leads to a good initialization for the second stage of training with re-weighted losses. Because the loss is non-convex and the learning rate in the second stage is relatively small, the second stage does not move the weights very far. Interestingly, with our LDAM loss and deferred re-balancing training, the vanilla re-weighting scheme (which re-weights by the inverse of the number of examples in each class) works as well as the re-weighting scheme introduced in prior work [10]. We also found that with our re-weighting scheme and LDAM, we are less sensitive to early stopping than [10]. Algorithm 1 Deferred Re-balancing Optimization with LDAM Loss Require: Dataset D = {(xi, yi)}n i=1. A parameterized model fθ 1: Initialize the model parameters θ randomly 2: for t = 1 to T0 do 3: B Sample Mini Batch(D, m) a mini-batch of m examples 4: L(fθ) 1 (x,y) B LLDAM((x, y); fθ) 5: fθ fθ α θL(fθ) one SGD step 6: Optional: α α/τ anneal learning rate by a factor τ if necessary 7: 8: for t = T0 to T do 9: B Sample Mini Batch(D, m) A mini-batch of m examples 10: L(fθ) 1 (x,y) B n 1 y LLDAM((x, y); fθ) standard re-weighting by frequency 11: fθ fθ α 1 P (x,y) B n 1 y θL(fθ) one SGD step with re-normalized learning rate 4 Experiments We evaluate our proposed algorithm on artificially created versions of IMDB review [41], CIFAR-10, CIFAR-100 [29] and Tiny Image Net [45, 1] with controllable degrees of data imbalance, as well as a Table 1: Top-1 validation errors on imbalanced IMDB review dataset. Our proposed approach LDAM-DRW outperforms the baselines. Approach Error on positive reviews Error on negative reviews Mean Error ERM 2.86 70.78 36.82 RS 7.12 45.88 26.50 RW 5.20 42.12 23.66 LDAM-DRW 4.91 30.77 17.84 real-world large-scale imbalanced dataset, i Naturalist 2018 [52]. Our core algorithm is developed using Py Torch [44]. Baselines. We compare our methods with the standard training and several state-of-the-art techniques and their combinations that have been widely adopted to mitigate the issues with training on imbalanced datasets: (1) Empirical risk minimization (ERM) loss: all the examples have the same weights; by default, we use standard cross-entropy loss. (2) Re-Weighting (RW): we re-weight each sample by the inverse of the sample size of its class, and then re-normalize to make the weights 1 on average in the mini-batch. (3) Re-Sampling (RS): each example is sampled with probability proportional to the inverse sample size of its class. (4) CB [10]: the examples are re-weighted or re-sampled according to the inverse of the effective number of samples in each class, defined as (1 βni)/(1 β), instead of inverse class frequencies. This idea can be combined with either re-weighting or re-sampling. (5) Focal: we use the recently proposed focal loss [35] as another baseline. (6) SGD schedule: by SGD, we refer to the standard schedule where the learning rates are decayed a constant factor at certain steps; we use a standard learning rate decay schedule. Our proposed algorithm and variants. We test combinations of the following techniques proposed by us. (1) DRW and DRS: following the proposed training Algorithm 1, we use the standard ERM optimization schedule until the last learning rate decay, and then apply re-weighting or resampling for optimization in the second stage. (2) LDAM: the proposed Label-Distribution-Aware Margin losses as described in Section 3.2. When two of these methods can be combined, we will concatenate the acronyms with a dash in between as an abbreviation. The main algorithm we propose is LDAM-DRW. Please refer to Section B for additional implementation details. 4.1 Experimental results on IMDB review dataset IMDB review dataset consists of 50,000 movie reviews for binary sentiment classification [41]. The original dataset contains an evenly distributed number of positive and negative reviews. We manually created an imbalanced training set by removing 90% of negative reviews. We train a two-layer bidirectional LSTM with Adam optimizer [26]. The results are reported in Table 1. 4.2 Experimental results on CIFAR Imbalanced CIFAR-10 and CIFAR-100. The original version of CIFAR-10 and CIFAR-100 contains 50,000 training images and 10,000 validation images of size 32 32 with 10 and 100 classes, respectively. To create their imbalanced version, we reduce the number of training examples per class and keep the validation set unchanged. To ensure that our methods apply to a variety of settings, we consider two types of imbalance: long-tailed imbalance [10] and step imbalance [5]. We use imbalance ratio ρ to denote the ratio between sample sizes of the most frequent and least frequent class, i.e., ρ = maxi{ni}/ mini{ni}. Long-tailed imbalance follows an exponential decay in sample sizes across different classes. For step imbalance setting, all minority classes have the same sample size, as do all frequent classes. This gives a clear distinction between minority classes and frequent classes, which is particularly useful for ablation study. We further define the fraction of minority classes as µ. By default we set µ = 0.5 for all experiments. Table 2: Top-1 validation errors of Res Net-32 on imbalanced CIFAR-10 and CIFAR-100. The combination of our two techniques, LDAM-DRW, achieves the best performance, and each of them individually are beneficial when combined with other losses or schedules. Dataset Imbalanced CIFAR-10 Imbalanced CIFAR-100 Imbalance Type long-tailed step long-tailed step Imbalance Ratio 100 10 100 10 100 10 100 10 ERM 29.64 13.61 36.70 17.50 61.68 44.30 61.45 45.37 Focal [35] 29.62 13.34 36.09 16.36 61.59 44.22 61.43 46.54 LDAM 26.65 13.04 33.42 15.00 60.40 43.09 60.42 43.73 CB RS 29.45 13.21 38.14 15.41 66.56 44.94 66.23 46.92 CB RW [10] 27.63 13.46 38.06 16.20 66.01 42.88 78.69 47.52 CB Focal [10] 25.43 12.90 39.73 16.54 63.98 42.01 80.24 49.98 HG-DRS 27.16 14.03 29.93 14.85 - - - - LDAM-HG-DRS 24.42 12.72 24.53 12.82 - - - - M-DRW 24.94 13.57 27.67 13.17 59.49 43.78 58.91 44.72 LDAM-DRW 22.97 11.84 23.08 12.19 57.96 41.29 54.64 40.54 Table 3: Validation errors on i Naturalist 2018 of various approaches. Our proposed method LDAMDRW demonstrates significant improvements over the previous state-of-the-arts. We include ERMDRW and LDAM-SGD for the ablation study. Loss Schedule Top-1 Top-5 ERM SGD 42.86 21.31 CB Focal [10] SGD 38.88 18.97 ERM DRW 36.27 16.55 LDAM SGD 35.42 16.48 LDAM DRW 32.00 14.82 We report the top-1 validation error of various methods for imbalanced versions of CIFAR-10 and CIFAR-100 in Table 2. Our proposed approach is LDAM-DRW, but we also include a various combination of our two techniques with other losses and training schedule for our ablation study. We first show that the proposed label-distribution-aware margin cross-entropy loss is superior to pure cross-entropy loss and one of its variants tailored for imbalanced data, focal loss, while no data-rebalance learning schedule is applied. We also demonstrate that our full pipeline outperforms the previous state-of-the-arts by a large margin. To further demonstrate that the proposed LDAM loss is essential, we compare it with regularizing by a uniform margin across all classes under the setting of cross-entropy loss and hinge loss. We use M-DRW to denote the algorithm that uses a cross-entropy loss with uniform margin [53] to replace LDAM, namely, the j in equation (13) is chosen to be a tuned constant that does not depend on the class j. Hinge loss (HG) suffers from optimization issues with 100 classes so we constrain its experiment setting with CIFAR-10 only. Imbalanced but known test label distribution: We also test the performance of an extension of our algorithm in the setting where the test label distribution is known but not uniform. Please see Section C.5 for details. 4.3 Visual recognition on i Naturalist 2018 and imbalanced Tiny Image Net We further verify the effectiveness of our method on large-scale imbalanced datasets. The i Natualist species classification and detection dataset [52] is a real-world large-scale imbalanced dataset which has 437,513 training images with a total of 8,142 classes in its 2018 version. We adopt the official training and validation splits for our experiments. The training datasets have a long-tailed label distribution and the validation set is designed to have a balanced label distribution. We use Res Net-50 as the backbone network across all experiments for i Naturalist 2018. Table 3 summarizes top-1 validation error for i Naturalist 2018. Notably, our full pipeline is able to outperform the ERM baseline 0-F 1-F 2-F 3-F 4-F 5-M 6-M 7-M 8-M 9-M class LDAM-DRW CB RS CB RW Figure 2: Per-class top-1 error on CIFAR-10 with step imbalance (ρ = 100, µ = 0.5). Classes 0-F to 4-F are frequent classes, and the rest are minority classes. Under this extremely imbalanced setting RW suffers from under-fitting, while RS over-fits on minority examples. On the contrary, the proposed algorithm exhibits great generalization on minority classes while keeping the performance on frequent classes almost unaffected. This suggests we succeeded in regularizing minority classes more strongly. 0 25 50 75 100 125 150 175 200 epoch method DRW CB RS CB RW phase test train Figure 3: Imbalanced training errors (dotted lines) and balanced test errors (solid lines) on CIFAR-10 with long-tailed imbalance (ρ = 100). We anneal decay the learning rate at epoch 160 for all algorithms. Our DRW schedule uses ERM before annealing the learning rate and thus performs worse than RW and RS before that point, as expected. However, it outperforms the others significantly after annealing the learning rate. See Section 4.4 for more analysis. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx by 10.86% and previous state-of-the-art by 6.88% in top-1 error. Please refer to Appendix C.2 for results on imbalanced Tiny Image Net. 4.4 Ablation study Evaluating generalization on minority classes. To better understand the improvement of our algorithms, we show per-class errors of different methods in Figure 2 on imbalanced CIFAR-10. Please see the caption there for discussions. Evaluating deferred re-balancing schedule. We compare the learning curves of deferred rebalancing schedule with other baselines in Figure 3. In Figure 6 of Section C.3, we further show that even though ERM in the first stage has slightly worse or comparable balanced test error compared to RW and RS, in fact the features (the last-but-one layer activations) learned by ERM are better than those by RW and RS. This agrees with our intuition that the second stage of DRW, starting from better features, adjusts the decision boundary and locally fine-tunes the features. 5 Conclusion We propose two methods for training on imbalanced datasets, label-distribution-aware margin loss (LDAM), and a deferred re-weighting (DRW) training schedule. Our methods achieve significantly improved performance on a variety of benchmark vision tasks. Furthermore, we provide a theoretically-principled justification of LDAM by showing that it optimizes a uniform-label generalization error bound. For DRW, we believe that deferring re-weighting lets the model avoid the drawbacks associated with re-weighting or re-sampling until after it learns a good initial representation (see some analysis in Figure 3 and Figure 6). However, the precise explanation for DRW s success is not fully theoretically clear, and we leave this as a direction for future work. Acknowledgements Toyota Research Institute ("TRI") provided funds and computational resources to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. We thank Percy Liang and Michael Xie for helpful discussions in various stages of this work. [1] Tiny imagenet visual recognition challenge. URL https://tiny-imagenet.herokuapp.com. [2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. ar Xiv preprint ar Xiv:1802.05296, 2018. [3] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r Jl0r3R9KX. [4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240 6249, 2017. [5] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249 259, 2018. [6] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, 2019. [7] Kaidi Cao, Yu Rong, Cheng Li, Xiaoou Tang, and Chen Change Loy. Pose-robust face recognition via deep residual equivariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5187 5196, 2018. [8] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. ar Xiv preprint ar Xiv:1907.02056, 2019. [9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321 357, 2002. [10] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [11] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. [12] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303 338, 2010. [13] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. ar Xiv preprint ar Xiv:1712.06541, 2017. [14] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. [15] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87 102. Springer, 2016. [16] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1934 1943, 2018. [17] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, (9):1263 1284, 2008. [18] Haibo He and Yunqian Ma. Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons, 2013. [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [20] J Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese. Evaluating fairness metrics in the presence of dataset bias. ar Xiv preprint ar Xiv:1809.09245, 2018. [21] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375 5384, 2016. [22] Chen Huang, Yining Li, Change Loy Chen, and Xiaoou Tang. Deep imbalanced learning for face recognition and attribute prediction. IEEE transactions on pattern analysis and machine intelligence, 2019. [23] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429 449, 2002. [24] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pages 793 800, 2009. [25] Salman Khan, Munawar Hayat, Syed Waqas Zamir, Jianbing Shen, and Ling Shao. Striking the right balance with uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 103 112, 2019. [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [27] Vladimir Koltchinskii, Dmitry Panchenko, et al. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1 50, 2002. [28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32 73, 2017. [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [30] Yann Le Cun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [31] Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. ar Xiv preprint ar Xiv:1811.05181, 2018. [32] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz Kandola. The perceptron algorithm with uneven margins. In ICML, volume 2, pages 379 386, 2002. [33] Zeju Li, Konstantinos Kamnitsas, and Ben Glocker. Overfitting of neural nets under class imbalance: Analysis and improvements for segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 402 410. Springer, 2019. [34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014. [35] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980 2988, 2017. [36] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pages 3128 3136, 2018. [37] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016. [38] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212 220, 2017. [39] Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for large-scale recognition. ar Xiv preprint ar Xiv:1710.00870, 2017. [40] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2537 2546, 2019. [41] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142 150. Association for Computational Linguistics, 2011. [42] Michele Merler, Nalini Ratha, Rogerio S Feris, and John R Smith. Diversity in faces. ar Xiv preprint ar Xiv:1901.10436, 2019. [43] Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hygn2o0q KX. [44] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. [45] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. [46] Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In European conference on computer vision, pages 467 482. Springer, 2016. [47] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. ar Xiv preprint ar Xiv:1902.07379, 2019. [48] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018. [49] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293 300, 1999. [50] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. ar Xiv preprint ar Xiv:1503.01817, 2015. [51] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. ar Xiv preprint ar Xiv:1709.01450, 2017. [52] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769 8778, 2018. [53] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926 930, 2018. [54] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018. [55] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Advances in Neural Information Processing Systems, pages 7029 7039, 2017. [56] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278 7286, 2018. [57] Colin Wei and Tengyu Ma. Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation. ar Xiv e-prints, art. ar Xiv:1905.03684, May 2019. [58] Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classification via an all-layer margin. ar Xiv preprint ar Xiv:1910.04284, 2019. [59] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. ar Xiv preprint ar Xiv:1810.05369, 2018. [60] Q Zhong, C Li, Y Zhang, H Sun, S Yang, D Xie, and S Pu. Towards good practices for recognition & detection. In CVPR workshops, 2016. [61] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Domain adaptation for semantic segmentation via class-balanced self-training. ar Xiv preprint ar Xiv:1810.07911, 2018.