# pareto_deep_longtailed_recognition_a_conflictaverse_solution__a3fe6b1e.pdf Published as a conference paper at ICLR 2024 PARETO DEEP LONG-TAILED RECOGNITION: A CONFLICT-AVERSE SOLUTION Zhipeng Zhou1, Liu Liu2, , Peilin Zhao2, Wei Gong1, 1University of Science and Technology of China, 2Tencent AI Lab zzp1994@mail.ustc.edu.cn, {leonliuliu, masonzhao}@tencent.com weigong@ustc.edu.cn Deep long-tailed recognition (DLTR) has attracted much attention due to its close touch with realistic scenarios. Recent advances have focused on re-balancing across various aspects, e.g., sampling strategy, loss re-weighting, logit adjustment, and input/parameter perturbation, etc. However, few studies have considered dynamic re-balancing to address intrinsic optimization conflicts, which are identified as prevalent and critical issues in this study. In this paper, we empirically establish the severity of the optimization conflict issue in the DLTR scenario, which leads to a degradation of representation learning. This observation serves as the motivation for pursuing Pareto optimal solutions. Unfortunately, a straightforward integration of multi-objective optimization (MOO) with DLTR methods is infeasible due to the disparity between multi-task learning (MTL) and DLTR. Therefore, we propose effective alternatives by decoupling MOO-based MTL from a temporal perspective rather than a structural one. Furthermore, we enhance the integration of MOO and DLTR by investigating the generalization and convergence problems. Specifically, we propose optimizing the variability collapse loss, guided by the derived MOObased DLTR generalization bound, to improve generalization. Additionally, we anticipate worst-case optimization to ensure convergence. Building upon the proposed MOO framework, we introduce a novel method called Pareto deep LOng Tailed recognition (PLOT). Extensive evaluations demonstrate that our method not only generally improves mainstream pipelines, but also achieves an augmented version to realize state-of-the-art performance across multiple benchmarks. Code is available at https://github.com/zzpustc/PLOT. 1 INTRODUCTION Nowadays success of machine learning (ML) techniques are largely attributed to the growing scale of the training dataset, as well as the assumption of it being independent and identically distributed (i.i.d) with the test distribution. However, such an assumption can hardly hold in many realistic scenarios where training sets show an imbalanced or even long-tailed distribution, raising a critical challenge to the traditional ML community (Zhang et al., 2021b). To address this issue, recent researches devoted on deep long-tailed recognition (DLTR) has gained increasing interests, which strives to mitigate the bias toward certain categories and generalize well on a balanced test dataset. Plenty of approaches have been proposed to realize re-balancing from various aspects in DLTR (Zhang et al., 2021b): sampling strategy (Zang et al., 2021; Cai et al., 2021), loss function (Wang et al., 2013; Ren et al., 2020; Tan et al., 2020), logit adjustment (Cao et al., 2019; Li et al., 2022), data augmentation (Kim et al., 2020; Wang et al., 2021), input/parameter perturbation (Rangwani et al., 2022; Zhou et al., 2023), decoupling learning regime (Kang et al., 2019), and diverse experts (Wang et al., 2020b; Guo & Wang, 2021), etc. Usually, these works design fixed re-balancing strategies according to the prior of the class frequency to ensure all categories are generally equally optimized. Several very recent researches (Ma et al., 2023; Sinha & Ohashi, 2023; Tan et al., 2023) empirically indicate that a dynamic re-balancing strategy is required, and achieve it by designing a quantitative Corresponding authors. Work done when Z. Zhou works as an intern in Tencent AI Lab. Published as a conference paper at ICLR 2024 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (b) LDAM-DRW C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (c) Bal. Softmax C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (e) Mi SLAS C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 Figure 1: Gradient conflicts among categories. Bal. Softmax is short for Balanced Softmax. The horizontal and vertical coordinates are for each category and the heat map represents the gradient similarity. 0 200 400 600 800 Training Steps Cosine Similarity 0 200 400 600 800 Training Steps Cosine Similarity (b) LDAM-DRW 0 100 200 300 400 500 600 700 800 Training Steps Cosine Similarity (c) Bal. Softmax 0 200 400 600 800 Training Steps Cosine Similarity 0 100 200 300 400 500 600 700 800 Training Steps Cosine Similarity (e) Mi SLAS 0 100 200 300 400 500 600 700 800 Training Steps Cosine Similarity Figure 2: Gradient similarities during optimization. 0 , 4 , and 9 denote the corresponding categories in CIFAR10-LT, belonging to the head , medium , tail classes. Please refer to the gradient norm examination in Section 4.5 of the Appendix. measurement of semantic scale imbalance and a meta module to learn from logit, etc. All these works take the instant rather than prior imbalance into consideration, enabling them to reach a competitive performance across various imbalanced scenarios. Nevertheless, a question is naturally raised: Is instant imbalance enough for designing a dynamic re-balancing strategy? We then delve into the optimization of representative DLTR models and present related observations from the perspective of multi-objective optimization (MOO) in Fig. 1 and Fig. 2. As depicted, intrinsic optimization conflicts among categories are prevalent and might be aggravated due to the dominated trajectories of certain categories, which would lead to sub-optimal solutions for the remaining ones. Such an issue is rarely discussed for the above question and cannot be addressed by current dynamic strategies due to their lack of design nature (Refer to Section 3 for more details). To fill this gap, we approach DLTR from a new angle, i.e., mitigate optimization conflicts via dynamic re-balancing, which is usually neglected in past works. We first identify the existing intrinsic gradient conflicts among categories in the optimization of prevailing DLTR methods and show their connection with the adopted fix re-balancing strategies. To prevent the representation from being overwhelmed by dominated categories properties, we introduce MOO in MTL to mine the shared features among categories. Unfortunately, a naïve combination is not applicable due to the structure difference between MTL and DLTR as illustrated in Fig. 6. Specifically, MOO-based MTL usually assumes that the model architecture is consist of a backbone network and several separate task-specific branches on top of the backbone network, and strives to learn task-shared features with backbone network via MOO algorithms. While DLTR targets only one task and owns only one branch. Hence a critical challenge appears: How to engage MOO into DLTR? As depicted in Fig. 6, we tackle this challenge with two key enablers: (1) Regarding a multiclassification task as multiple binary classification tasks, (2) transform the shared feature extraction and task-specific optimization from structural to temporal. Besides, by investigating several popular MOO approaches and choosing a stable one, we provide instructions on the integration of DLTR and MOO, propose variability collapse loss, and anticipate worst-case optimization to ensure generalization and convergence. It should be noted that our goal is to provide a distinct angle of re-balancing rather than design a new instant imbalance metric or MOO method, thus comparing our approach with these counterparts is beyond the scope of this paper. Contributions: Our contributions can mainly be summarized as four-fold: Published as a conference paper at ICLR 2024 Through the lens of MOO, we empirically identify the phenomena of optimization conflicts among categories and establish its severity for representation learning in DLTR. To mitigate the above issues, we endow prevailing re-balancing models with Pareto property via innovatively transforming the MOO-based MTL from structural to temporal, enabling the application of MOO algorithm in DLTR without model architecture modifications. Moreover, two theoretical motivated operations, i.e., variability collapse loss, and anticipating worst-case optimization are proposed to further ensure the generalization and convergence of MOO-based DLTR. Extensive evaluations have demonstrated that our method, PLOT, can significantly enhance the performance of mainstream DLTR methods, and achieve state-of-the-art results across multiple benchmarks compared to its advanced counterparts. 2 PRELIMINARIES Problem Setup: Taking a K-way classification task for example, assume we are given a long-tailed training set S = (xi, yi|i = 1, . . . , n) for the DLTR problem. And the corresponding per-class sample numbers are {n1, n2, ..., n K}, n = PK i ni. Without loss of generality, we assume ni < nj if i < j, and usually n K n1. Following the general DLTR setting, all models are finally evaluated on a balanced test dataset. Pareto Concept: Our framework hinges on MOO-based MTL, which strives to achieve the Pareto optimum under the MTL situation. Formally, assume that there are N tasks at hand, and their differentiable loss functions are Li(θ), i [N]. The weighted loss is Lω = PN i=1 ωi Li(θ), ω W, where θ is the parameter of the model and W is the probability simplex on [N]. A point θ is said to Pareto dominate θ, only if i, Li(θ ) Li(θ). And therefore the Pareto optimal is the situation that no θ can be found that holds i, Li(θ ) Li(θ) for the point θ. All points that satisfy the above conditions are called Pareto sets, and their solutions are so-called Pareto fronts. Another concept called Pareto stationary, which requires minω W gω = 0, where gω is the weighted gradient. In this paper, since we regard the K-way classification task as K binary tasks, N is assigned as K. Definition 2.1 (Gradient Similarity). Denote ϕij as the angle between two task gradients gi and gj, then we define the gradient similarity as cos φij and the gradients as conflicting when cos ϕij < 0. Definition 2.2 (Dominated Conflicting). For task gradients gi and gj, assume their average gradient as g0, and gi < gj . Then we define the gradients as dominated conflicting when cos ϕ0i < 0. 3 MOTIVATION AND EMPIRICAL OBSERVATIONS Intrinsic Property in Definition and Difference with MTL: As outlined in Section 2, a DLTR model is trained on an imbalanced dataset but is expected to generalize well to all categories, which aligns with the motivation of Pareto optimality, i.e., improving all individual tasks (categories). However, unlike MTL, which employs a distinct structure where the backbone and corresponding branches are responsible for shared feature extraction and task-specific optimization, respectively, the structures in DLTR are attributed to all categories. This difference impedes DLTR models from achieving Pareto properties. Therefore, in Section 4, we introduce the MOO-based DLTR pipeline. Optimization Conflicts under Imbalanced Scenarios: As depicted in Fig. 3, it is evident that conflicting non-conflicting gmean gmean (a) Balanced Scenarios. dominated conflicting non-conflicting gmean gmean conflicting (b) Imbalanced Scenarios. Figure 3: Illustration of gradient conflict scenarios. each task exhibits improvement when optimized using the average gradient, i.e., gmean, in balanced Published as a conference paper at ICLR 2024 Table 1: Benefits of MOO methods for mainstream DLTR models on CIFAR10-LT. We re-implement all models via their publicly released code, and all results are reported over 3 random seeds experiments. / indicates outperforms/underperforms their vanilla versions, while the early stop version is colored and the naïve integration version is underlined. Imb. c RT+Mixup LDAM-DRW Vanilla w/ EPO w/ MGDA w/ CAGrad Vanilla w/ EPO w/ MGDA w/ CAGrad 200 73.06 33.45 76.24 68.05 75.98 75.15 76.02 71.38 56.04 73.64 67.18 74.08 55.80 73.28 100 79.15 34.27 79.69 73.71 79.26 79.58 80.16 77.71 66.49 77.25 73.70 77.79 66.49 76.86 50 84.21 36.53 83.79 79.27 84.15 83.52 84.49 81.78 72.60 81.62 78.24 81.58 69.26 81.85 Imb. Balanced Softmax M2m Vanilla w/ EPO w/ MGDA w/ CAGrad Vanilla w/ EPO w/ MGDA w/ CAGrad 200 81.33 45.37 81.40 74.13 80.90 79.20 80.93 73.43 51.90 73.07 57.14 72.63 70.95 73.84 100 84.90 44.33 85.30 79.06 85.10 83.77 85.40 77.55 57.89 76.57 52.37 76.48 76.24 77.95 50 89.17 41.43 88.97 79.43 88.90 88.00 89.27 80.94 42.07 81.19 46.38 80.66 78.19 81.11 Imb. Mi SLAS GCL Vanilla w/ EPO w/ MGDA w/ CAGrad Vanilla w/ EPO w/ MGDA w/ CAGrad 200 76.59 36.62 76.97 63.40 76.12 76.30 77.43 79.25 62.08 79.73 75.43 80.03 78.73 80.08 100 81.33 39.92 81.22 68.09 82.00 82.10 82.47 82.85 74.78 82.75 79.01 82.81 82.48 83.48 50 85.23 44.78 84.60 70.20 84.84 85.20 85.33 86.00 78.42 84.55 81.89 85.58 85.31 85.90 scenarios where conflicts arise. However, in imbalanced scenarios, the utilization of gmean tends to favor the dominant tasks. This preference becomes particularly pronounced in extreme cases, where even when gmean and gj are in conflict (referred to as Dominated Conflicting in Definition 2.2), the optimization of task i leads to an enhancement in its performance at the expense of task j. In order to investigate the presence of optimization conflict issues in DLTR, we meticulously analyze several re-balancing regimes: (1) Cost-sensitive loss approaches, such as LDAM-DRW (Cao et al., 2019) and Balanced Softmax (Ren et al., 2020); (2) Augmentation techniques, such as M2m (Kim et al., 2020); and (3) Decoupling methods, including c RT + Mixup (Kang et al., 2019), Mi SLAS (Zhong et al., 2021), and GCL (Li et al., 2022). By computing the cosine similarities among gradients (referred to as Gradient Similarity in Definition 2.1) associated with different categories, we illustrate the conflict status of these methods in Fig. 1 1. As depicted, all the selected methods exhibit varying degrees of gradient conflicts, which persist in the early stages of training (refer to Section 4.2 in the Appendix). Furthermore, we examine the optimization preference of DLTR models in Fig. 2, revealing that certain categories dominate the overall optimization process. Additionally, we provide statistical analysis on the frequency of dominated conflicting instances in Fig. 5, establishing a roughly positive correlation between the frequency of dominated conflicting and the imbalance ratio. 0 10 20 30 40 50 Density (Log Scale) (a) Class 0 in vanilla. 0 200 400 600 Density (Log Scale) (b) Class 9 in vanilla. 0 5 10 15 20 25 30 Density (Log Scale) (c) Class 0 in addressed one. 0 100 200 300 400 Density (Log Scale) (d) Class 9 in addressed one. Figure 4: Hessian spectrum analysis of before and after addressing optimization conflict issue. The Benefit of Addressing Optimization Conflicts: Here we provide a preview of the advantages achieved by addressing optimization conflicts through the utilization of our temporal design, as outlined in Section 4.1. We present the benefits from two perspectives: (1) representation analysis and (2) performance improvements. To gain insights into the impact of addressing the optimization conflict issue on representation learning, we conducted a Hessian spectrum analy- 1Mainstream DLTR methods usually employ the SGD optimizer for implementation. Therefore, our analysis does not encompass the results obtained by utilizing alternative optimizers such as Adam. Published as a conference paper at ICLR 2024 Shared Feature Extraction Shared Feature Extraction Task-specific Optimization Task-specific Optimization epoch = 1 epoch = E+1 (b) MOO-based DLTR (a) MOO-based MTL Task 1 Task 2 Task 3 Class 1 Class 2 Class 3 Figure 6: Comparison of MOO-based MTL and MOO-based DLTR. sis (Rangwani et al., 2022) and visualized the results in Fig. 4. Our analysis reveals that addressing optimization conflicts results in flatter minima for each class, thereby mitigating the risk of being trapped at saddle points and facilitating the acquisition of more generalized representations. Furthermore, we demonstrate the performance benefits achieved by employing various MOO approaches in Table 1. The results effectively showcase the potential of integrating MOO with DLTR. ERM LDAM Bal. Soft. M2m Mi SLAS GCL Dominated Conflict Ratio % Figure 5: Statistical of dominated conflicts. Urgency of Conflict-Averse Strategy: Currently, our MOO-based DLTR framework can primarily be classified as a specialized dynamic re-balancing strategy, formally formulated as L(x, y) = PK k=1 ωk Bj l(xk ,yk ) B . Here B is the batch size, B and ω represent the frequency and dynamic re-weighting factor of class , respectively, and l(xk , yk ) is the average loss of class . While there are existing studies that explore the concept of dynamic rebalancing (Tan et al., 2023; Sinha & Ohashi, 2023; Ma et al., 2023), none of them address the issue of optimization conflicts from a comprehensive perspective. Consequently, the intrinsic optimization conflicts among categories cannot be effectively mitigated (See Section 4.4 in the Appendix). Fortunately, our work bridges this gap and offers a solution to this problem. 4 PARETO DEEP LONG-TAILED RECOGNITION Building on the aforementioned analysis, we present a detailed design for integrating MOO into DLTR, which encompasses its adaptation from MTL to DLTR and an augmented version to ensure generalization and convergence. For complete proofs of the proposed theorems, please refer to Section 1 of the Appendix. 4.1 MOO: FROM MTL TO DLTR As stated previously, a straightforward integration of MOO is not feasible for DLTR scenarios due to differences in task properties and architectures. Upon revisiting the function of each component in MOO-based MTL, they can be categorized into two aspects: (1) Shared feature extraction (SFE) and (2) Task-specific optimization (TSO). SFE aims to extract shared representations among distinct tasks, while TSO is responsible for the corresponding task performance with the independent branch. In this study, we approach the multi-classification task by treating it as multiple binary classification tasks. Each binary classification task is considered as a single objective in MOO, and we re-design SFE and TSO from a temporal perspective during the training stage, rather than a structural one. Specifically, we implement SFE in the first E epochs by applying MOO algorithms in DLTR models, but release it in the subsequent stages, as illustrated in Fig. 6. This approach can also be interpreted as an early stopping operation, inspired by previous research (Cadena et al., 2018; Hu et al., 2021) suggesting that the early layers of neural networks undergo fewer changes in the later stages of training. To validate the effectiveness of this design, we employ three representative MOO algorithms, i.e., MGDA (Désidéri, 2012), EPO (Mahapatra & Rajan, 2020), and CAGrad (Liu et al., 2021a), to Published as a conference paper at ICLR 2024 equip them on the aforementioned six DLTR models, and the results are presented in Table. 1. The results demonstrate that our design enables MOO to enhance DLTR models in most cases, which also highlights the potential benefits of addressing the optimization conflict problem. Moreover, the performance would significantly deteriorate without an early stop (naïve integration version), indicating class-specific feature degradation. It is also noteworthy that CAGrad exhibits a relatively stable performance across various baselines, thus we select it as the fundamental framework for further enhancement2. Remark (No Modifications on Model Architecture). Once again, it is important to emphasize that our approach does not involve any modifications to the model architecture. Rather, our method represents an effective learning paradigm for the application of MOO in DLTR. 4.2 CHALLENGES OF CAGRAD UNDER DLTR Generalization Problem: Prior to delving into the additional technical designs of PLOT, it is necessary to establish the learning guarantees for the MOO-based DLTR framework. To this end, we introduce ωk into the definition of Rademacher complexity (Cortes et al., 2020), which can be reformulated as follows: RS(G, ω) = E σ k=1 ωk 1 mk i=1 σi l(h(xk i ), yk i ) where G is associated to the hypothesis set as H : {G : (x, y) 7 l(h(x), y) : h H}; σi is the independent uniformly distributed random variables taking values in { 1, +1}. From this definition, we have the following theorem: Theorem 4.1. (MOO-based DLTR Generalization Bound) If the loss function lk belonging to kth category is Mk-Lipschitz, and (x, y), (x , y ) X Y, h H: [h(x), y] [h(x ), y ] DH, assume Mk DH is bounded by M, then for any ϵ > 0 and δ > 0, with probability at least 1 δ, the following inequality holds for h H and ω W: Lω(h) ˆLω(h) + 2RS(G, ω) + The above derived generalization bound indicates that we should minimize ˆLω(h) as well as constrain the intra-class loss variability M. With this theoretical insight, we design the following variability collapse loss: k=1 Std(el(xk , yk )) (2) where Std( ) is the standard deviation function and el(xk , yk ) is the loss set of the kth category in a mini-batch. It is worth noting that our proposed design shares a similar concept with a recent study (Liu et al., 2023), which aims to induce Neural Collapse in the DLTR setting. However, our approach is distinct in that we propose it from the perspective of MOO with theoretical analysis. 0 10 20 30 0 Loss Contours around Trained Model 0 10 20 30 0 Loss Contours around Trained Model Figure 7: Loss landscape of LDAM. Convergence Problem: On the other hand, although CAGrad exhibits stability, it may not always yield improvements, as evidenced by Table 1. Consequently, we delve deeper into CAGrad and demonstrate its limitations in the DLTR scenario. Based on the convergence analysis of CAGrad, we present the following theorem: Theorem 4.2. (Convergence of CAGrad in DLTR) With a fix step size α and the assumption of H-Lipschitz on gradients, i.e., Li(θ) Li(θ ) H θ θ for i = 1, 2, ..., K. Denote d (θt) as the optimization direction of CAGrad at step t, then we have: L(θt+1) L(θt) α 2 (1 c2) g0(θt) 2 + α 2 (Hα 1) d (θt) 2 , 2Nonetheless, the objective of this paper is to explore the potential of the MOO framework in addressing the DLTR problem and to propose an effective algorithm for enhancing mainstream DLTR methods. Therefore, we defer the development and integration of more advanced MOO algorithms to future research. Published as a conference paper at ICLR 2024 where c (0, 1) and g0(θt) is the corresponding average gradient of θt. The convergence of CAGrad is inherently influenced by the value of H, as observed in our experiments. This observation is further supported by a related study (Fernando et al., 2022), which introduces random noise to the optimization trajectory of CAGrad and demonstrates convergence failures. Additionally, it is widely acknowledged that achieving a small value of H for tail classes in DLTR models is often unattainable (Rangwani et al., 2022; Zhou et al., 2023). To visually illustrate this point, we provide a depiction of the LDAM-DRW loss landscape in Fig. 7 (for more visual illustrations, please refer to Section 7.2 in the Appendix). In this particular case, the loss landscape of the tail class exhibits a sharp minimum, indicating a large value of H and consequently posing challenges for achieving convergence when integrating CAGrad with DLTR models. To solve this problem, we constrain H by anticipating worst-case optimization (Foret et al., 2020), i.e., Sharpness aware minimization (SAM): min θ max ϵ(θ) L(θ + ϵ(θ)), where ϵ(θ) 2 ρ, (3) where the inner optimization can be approximated via the first order Taylor expansion, which results in the following solution (ρ is a hyper-parameter): ˆϵ(θ) = ρ θL(θ)/ θL(θ) 2 2 1 Overall Optimization Procedure: At the tth step of the SFE stage, we first compute the original loss L(θ) for a mini-batch, and obtain the perturbative loss LSAM = L(θ + ˆϵ(θ)) according to Eqn. 4, as well as the variability collapse loss Lvc defined in Eqn. 2 based on perturbative loss. Thus we have the average gradient g0 and the class-specific gradient gi, i {1, 2, ..., K} by back propagating Lmoo = LSAM + Lvc. Finally, the dynamic class weights ω = {ω1, ω2, ..., ωK} is obtained by solving CAGrad (Liu et al., 2021a): min ω W F(ω) := g ωg0 + p ϕ gω , where ϕ = c2 g0 2 , (5) and update the model via: θt = θt 1 α g0 + ϕ1/2 gω gω , gω = PK i ωi gi. The overall pseudo algorithm is summarized in the Section 6.2 of the Appendix. 4.3 IMPLEMENTATION DETAILS We implement our code with Python 3.8 and Py Torch 1.4.0, while all experiments are carried out on Tesla V100 GPUs. We train each model with batch size of 64 (for CIFAR10-LT and CIFAR100-LT) / 128 (for Places-LT) / 256 (for Image Net-LT and i Naturalist), SGD optimizer with momentum of 0.9. 5 EVALUATION Table 2: Performance on CIFAR datasets. CIFAR10-LT CIFAR100-LT Imb. 200 100 50 200 100 50 LDAM-DRW 71.38 77.71 81.78 36.50 41.40 46.61 LDAM-DRW + PLOT 74.32 78.19 82.09 37.31 42.31 47.04 M2m 73.43 77.55 80.94 35.81 40.77 45.73 M2m + PLOT 74.48 78.42 81.79 38.43 43.00 47.19 c RT + Mixup 73.06 79.15 84.21 41.73 45.12 50.86 c RT + Mixup + PLOT 78.99 80.55 84.58 43.80 47.59 51.43 Logit Adjustment - 78.01 - - 43.36 - Logit Adjustment + PLOT - 79.40 - - 44.19 - Mi SLAS 76.59 81.33 85.23 42.97 47.37 51.42 Mi SLAS + PLOT 77.73 81.88 85.70 44.28 47.91 52.66 GCL 79.25 82.85 86.00 44.88 48.95 52.85 GCL + PLOT 80.08 83.35 85.90 45.61 49.50 53.05 Following the mainstream protocols, we conduct experiments on popular DLTR benchmarks: CIFAR10-/CIFAR100-LT, Places-LT (Liu et al., 2019), Image Net-LT (Liu et al., 2019) and i Naturalist2018 (Van Horn et al., 2018). To show the versatility of PLOT, we equip it with the aforementioned popular re-balancing regimes under various imbalance scenarios. Moreover, PLOT achieves the state-of-the-art (SOTA) performance by augmenting the advanced baseline across large scale datasets. Micro benchmarks are elaborated to show the effectiveness of each components finally. For fair comparison, we exclude ensemble or pre-training models in our experiments. 5.1 VERSATILITY VERIFICATION Published as a conference paper at ICLR 2024 Table 3: Performance on large-scale Datasets. Dataset Method Backbone Overall CE Res Net-152 30.2 Decouple-τ-norm Res Net-152 37.9 Balanced Softmax Res Net-152 38.6 LADE Res Net-152 38.8 RSG Res Net-152 39.3 Dis Align Res Net-152 39.3 Res LT Res Net-152 39.8 GCL Res Net-152 40.6 c RT + Mixup Res Net-152 38.5 c RT + Mixup + PLOT Res Net-152 41.0 Mi SLAS Res Net-152 40.2 Mi SLAS + PLOT Res Net-152 40.5 Image Net-LT CE Res Ne Xt-50 44.4 Decouple-τ-norm Res Net-50 46.7 Balanced Softmax Res Ne Xt-50 52.3 LADE Res Ne Xt-50 52.3 RSG Res Ne Xt-50 51.8 Dis Align Res Net-50 52.9 Res Ne Xt-50 53.4 Res LT Res Ne Xt-50 52.9 LDAM-DRW + SAM Res Net-50 53.1 GCL Res Net-50 54.9 c RT + Mixup Res Net-50 51.7 c RT + Mixup + PLOT Res Ne Xt-50 54.3 Mi SLAS Res Net-50 52.7 Mi SLAS + PLOT Res Net-50 53.5 CE Res Net-50 61.7 Decouple-τ-norm Res Net-50 65.6 Balanced Softmax Res Net-50 70.6 LADE Res Net-50 70.0 RSG Res Net-50 70.3 Dis Align Res Net-50 70.6 Res LT Res Net-50 70.2 LDAM-DRW + SAM Res Net-50 70.1 GCL Res Net-50 72.0 c RT + Mixup Res Net-50 69.5 c RT + Mixup + PLOT Res Net-50 71.3 Mi SLAS Res Net-50 71.6 Mi SLAS + PLOT Res Net-50 72.1 Given the inherent optimization conflicts in advanced DLTR models, PLOT can serve as a valuable augmentation technique. Our experimental results, presented in Table 2, demonstrate that PLOT brings improvements in most scenarios by addressing the problem from a new dimension that is orthogonal to current solutions. Notably, c RT + Mixup and LDAM-DRW exhibit the most conflict scenarios and gain the most from PLOT. In fact, c RT + Mixup even achieves competitive performance compared to the state-of-the-art under certain imbalance ratio settings, highlighting the efficacy of PLOT in addressing optimization conflict problems. We also observe a marginal effect of this augmentation, as GCL exhibits marginal improvement or even degradation, owing to the absence of significant optimization conflicts (see Fig. 1). 5.2 COMPARISONS WITH SOTA We further evaluate the effectiveness of PLOT on large-scale datasets, i.e., Places-LT, Image Net LT, and i Naturalist, and compare it against mainstream methods in Table 3. Through augmenting two advanced baselines (c RT + Mixup and Mi SLAS), PLOT achieves state-of-the-art performance. Specifically, our approach exhibits a substantial performance advantage over other DLTR models on Places-LT and i Naturalist, two recognized challenging benchmarks due to their high imbalance ratios. 5.3 ABLATION STUDY Table 4: Ablation studies on CIFAR10-LT when imbalance ratio is set as 200. c RT + Mixup w/ temp w/ anti. w/ var. Acc. " 73.06 " " 76.02 " " " 77.79 " " " " 78.99 Our system comprises multiple components, and we aim to demonstrate their individual effectiveness. To this end, we conduct ablation studies and present the relationship between each component and the final performance in Table 4. Our results indicate that the proposed operations, i.e., temporal design (temp), variability collapse loss (var.) and anticipate worst-case optimization (anti.), can significantly enhance the system s performance. 5.4 OPTIMIZATION TRAJECTORY ANALYSIS Figure 8: Gradient similarity with the aggregated gradient before / after applying PLOT on LDAM-DRW. (a) Gradient similarities of LDAM-DRW. 0 200 400 600 800 Training Steps Cosine Similarity (b) Gradient similarities of PLOT. 0 200 400 600 800 Training Steps 0.0 0.2 0.4 0.6 Cosine Similarity We capture the optimization trajectories of different categories in LDAM-DRW + PLOT and present them in Fig. 8. In the left figure, the gradient similarity is computed between each category and the average gradient, while in the right figure, it is calculated between each category and the gradient aggregated by MOO. By comparison, the original LDAM-DRW approach exhibits a dominance of head classes in representation learning, resulting in a deterioration of shared feature extraction. In contrast, our augmented version with PLOT demonstrates relatively comparable and stable similarities among categories, indicating the potential for effective extraction of shared features across categories. Published as a conference paper at ICLR 2024 6 RELATED WORKS 6.1 DEEP LONG-TAILED RECOGNITION Recent advancements in DLTR have been driven by three distinct design philosophies: (1) rebalancing strategies on various aspects, (2) ensemble learning, and (3) representation learning. In the first category, considerable efforts have been dedicated to re-sampling (Wang et al., 2019; Zang et al., 2021; Cai et al., 2021; Wei et al., 2021), re-weighting (Park et al., 2021; Kini et al., 2021), re-margining (Feng et al., 2021; Cao et al., 2019; Koltchinskii & Panchenko, 2002), logit adjustment (Menon et al., 2020; Zhang et al., 2021a), and information augmentation (Kim et al., 2020; Yang & Xu, 2020; Zang et al., 2021), etc. As anticipated, these approaches aim to manually re-balance the model by addressing sample number (through sampling and augmentation), cost sensitivity, prediction margin/logit, and other factors, thereby reducing the bias towards major classes. Ensemble learning-based approaches strive to leverage the expertise of multiple models. Generally, there are several methods for aggregating these models. BBN (Zhou et al., 2020) and Sim CAL (Wang et al., 2020a) train experts using both long-tailed and uniformly distributed data and aggregate their outputs. On the other hand, ACE (Cai et al., 2021), Res LT (Cui et al., 2022), and BAGS (Li et al., 2020) train experts on different subsets of categories. SADE (Zhang et al., 2022) employs diverse experts, including long-tailed, uniform, and inverse long-tailed, and adaptively aggregates them using a self-supervised objective. Recent efforts in representation learning have emerged in decoupling and contrastive learning, which employ distinct regimes to obtain general representations for all categories. Decoupling-based methods (Kang et al., 2019; Zhong et al., 2021; Li et al., 2022) have shown that the representation learned via random sampling strategies is powerful enough, and additional effort devoted to the second stage, i.e., classifier adjustment, can help achieve advanced performance. A recent study (Liu et al., 2021b) has empirically found that contrastive learning-based methods are less sensitive to imbalance scenarios. Thus, these methods (Cui et al., 2021; Zhu et al., 2022) extract general representations via supervised contrastive learning and achieve competitive performance. Our work takes a new approach to DLTR by developing a gradient conflict-averse solution, which is almost orthogonal to current solutions and has been verified to be effective. 6.2 MOO-BASED MTL Multi-task learning (MTL), particularly MOO-based MTL, has garnered significant attention in the machine learning community as a fundamental and practical task. MGDA-UB (Sener & Koltun, 2018) achieves Pareto optimality by optimizing a derived upper bound in large-scale MTL scenarios. PCGrad (Yu et al., 2020) mitigates conflict challenges by projecting gradients on the corresponding orthogonal directions. In contrast, CAGrad (Liu et al., 2021a) develops a provably convergent solution by optimizing the worst relative individual task and constraining it around the average solution. Additionally, EPO (Mahapatra & Rajan, 2020) proposes a preference-guided method that can search for Pareto optimality tailored to the prior. Our work represents the first attempt to integrate MOO into DLTR. We bridge the gap between MTL and DLTR and propose two improvements for further augmentation. Although evaluating PLOT under the MTL setting would be a satisfactory choice, it is beyond the scope of this paper and is left for future investigation. 7 CONCLUSION This paper have presented a novel approach to bridging the gap between MTL and DLTR. Specifically, we proposed a re-design of the MOO paradigm from structural to temporal, with the aim of addressing the challenge of optimization conflicts. To further ensure the convergence and generalization of the MOO algorithm, we optimized the derived MOO-based DLTR generation bound and seek a flatter minima. Our experimental results demonstrated the benefits of injecting the Pareto property across multiple benchmarks. We hope that our findings provide valuable insights for researchers studying the integration of MOO and DLTR, which has been shown to hold great promise. Our future works lie in developing adaptive strategies to apply MOO algorithm more efficiently. Published as a conference paper at ICLR 2024 8 ACKNOWLEDGEMENTS We thank anonymous reviewers for their valuable comments. This work was supported by the NSFC under Grants 61932017 and 61971390. 9 REPRODUCIBILITY STATEMENT Further implementation details can be found in Section 2 of the Appendix. Specifically, the supplementary material includes the attached code, which serves as a reference and provides additional information. The provided code demo consists of a prototype trained on the CIFAR10-/100-LT datasets. Santiago A Cadena, Marissa A Weis, Leon A Gatys, Matthias Bethge, and Alexander S Ecker. Diverse feature visualizations reveal invariances in early layers of deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217 232, 2018. Jiarui Cai, Yizhou Wang, and Jenq-Neng Hwang. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 112 121, 2021. Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019. Corinna Cortes, Mehryar Mohri, Javier Gonzalvo, and Dmitry Storcheus. Agnostic learning with multiple objectives. Advances in Neural Information Processing Systems, 33:20485 20495, 2020. Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia. Parametric contrastive learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 715 724, 2021. Jiequan Cui, Shu Liu, Zhuotao Tian, Zhisheng Zhong, and Jiaya Jia. Reslt: Residual learning for long-tailed recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012. Chengjian Feng, Yujie Zhong, and Weilin Huang. Exploring classification equilibrium in long-tailed object detection. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 3417 3426, 2021. Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. In The Eleventh International Conference on Learning Representations, 2022. Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. ar Xiv preprint ar Xiv:2010.01412, 2020. Hao Guo and Song Wang. Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15089 15098, 2021. Jie Hu, Liujuan Cao, Tong Tong, Qixiang Ye, Shengchuan Zhang, Ke Li, Feiyue Huang, Ling Shao, and Rongrong Ji. Architecture disentanglement for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 672 681, 2021. Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. ar Xiv preprint ar Xiv:1910.09217, 2019. Published as a conference paper at ICLR 2024 Jaehyung Kim, Jongheon Jeong, and Jinwoo Shin. M2m: Imbalanced classification via major-tominor translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13896 13905, 2020. Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, and Christos Thrampoulidis. Labelimbalanced and group-sensitive classification under overparameterization. Advances in Neural Information Processing Systems, 34:18970 18983, 2021. Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1 50, 2002. Mengke Li, Yiu-ming Cheung, and Yang Lu. Long-tailed visual recognition via gaussian clouded logit adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6929 6938, 2022. Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, and Jiashi Feng. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10991 11000, 2020. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878 18890, 2021a. Hong Liu, Jeff Z Hao Chen, Adrien Gaidon, and Tengyu Ma. Self-supervised learning is more robust to dataset imbalance. ar Xiv preprint ar Xiv:2110.05025, 2021b. Xuantong Liu, Jianfeng Zhang, Tianyang Hu, He Cao, Yuan Yao, and Lujia Pan. Inducing neural collapse in deep long-tailed learning. In International Conference on Artificial Intelligence and Statistics, pp. 11534 11544. PMLR, 2023. Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Largescale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537 2546, 2019. Yanbiao Ma, Licheng Jiao, Fang Liu, Yuxin Li, Shuyuan Yang, and Xu Liu. Delving into semantic scale imbalance. The Eleventh International Conference on Learning Representations, 2023. Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in pareto optimization. In International Conference on Machine Learning, pp. 6597 6607. PMLR, 2020. Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. ar Xiv preprint ar Xiv:2007.07314, 2020. Seulki Park, Jongin Lim, Younghan Jeon, and Jin Young Choi. Influence-balanced loss for imbalanced visual classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 735 744, 2021. Harsh Rangwani, Sumukh K Aithal, Mayank Mishra, et al. Escaping saddle points for effective generalization on class-imbalanced data. Advances in Neural Information Processing Systems, 35: 22791 22805, 2022. Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175 4186, 2020. Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018. Saptarshi Sinha and Hiroki Ohashi. Difficulty-net: Learning to predict difficulty for long-tailed recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6444 6453, 2023. Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11662 11671, 2020. Published as a conference paper at ICLR 2024 Jingru Tan, Bo Li, Xin Lu, Yongqiang Yao, Fengwei Yu, Tong He, and Wanli Ouyang. The equalization losses: Gradient-driven training for long-tailed object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769 8778, 2018. Jialei Wang, Peilin Zhao, and Steven CH Hoi. Cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering, 26(10):2425 2438, 2013. Jianfeng Wang, Thomas Lukasiewicz, Xiaolin Hu, Jianfei Cai, and Zhenghua Xu. Rsg: A simple but effective module for learning imbalanced datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3784 3793, 2021. Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew, Sheng Tang, Steven Hoi, and Jiashi Feng. The devil is in classification: A simple framework for long-tail instance segmentation. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XIV 16, pp. 728 744. Springer, 2020a. Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X Yu. Long-tailed recognition by routing diverse distribution-aware experts. ar Xiv preprint ar Xiv:2010.01809, 2020b. Yiru Wang, Weihao Gan, Jie Yang, Wei Wu, and Junjie Yan. Dynamic curriculum learning for imbalanced data classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5017 5026, 2019. Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, and Fan Yang. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10857 10866, 2021. Yuzhe Yang and Zhi Xu. Rethinking the value of labels for improving class-imbalanced learning. Advances in neural information processing systems, 33:19290 19301, 2020. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33: 5824 5836, 2020. Yuhang Zang, Chen Huang, and Chen Change Loy. Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3457 3466, 2021. Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2361 2370, 2021a. Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey. ar Xiv preprint ar Xiv:2110.04596, 2021b. Yifan Zhang, Bryan Hooi, Lanqing Hong, and Jiashi Feng. Self-supervised aggregation of diverse experts for test-agnostic long-tailed recognition. Advances in Neural Information Processing Systems, 35:34077 34090, 2022. Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16489 16498, 2021. Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9719 9728, 2020. Published as a conference paper at ICLR 2024 Zhipeng Zhou, Lanqing Li, Peilin Zhao, Pheng-Ann Heng, and Wei Gong. Class-conditional sharpness-aware minimization for deep long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3499 3509, June 2023. Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe Chen, and Yu-Gang Jiang. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6908 6917, 2022. Under review as a conference paper at ICLR 2024 APPENDIX FOR: PARETO DEEP LONG-TAILED RECOGNITION: A CONFLICT-AVERSE SOLUTION Anonymous authors Paper under double-blind review 1 Theoretical proofs 2 1.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Details of Implementations 3 2.1 DLTR Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 MOO-based MTL Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Open Long-Tailed Recognition Evaluation 5 4 Additional Experiment Results 5 4.1 Identify Optimization Conflict under Various Conditions . . . . . . . . . . . . . . . . . . . . 5 4.2 Gradient Conflicts in the Early Training Stage . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Frequency vs. Loss Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4 Gradient Conflicts Status of Dynamic Re-weighting Approach . . . . . . . . . . . . . . . . . 7 4.5 Gradient Norm Examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.6 Gradient Similarity Examination after PLOT augmentation . . . . . . . . . . . . . . . . . . . 8 4.7 Apply PLOT on Different Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.8 Gradient Conflicts Status under the Balanced Setting . . . . . . . . . . . . . . . . . . . . . . 8 5 Limitation 8 6 Computational Complexity and Scalability Analysis 9 6.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7 Visualizations 10 7.1 Feature Representation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 7.2 Loss Landscape of DLTR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 7.3 Embedding Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 7.4 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8 Detailed Results on Mainstream Datasets 12 Under review as a conference paper at ICLR 2024 1 THEORETICAL PROOFS 1.1 PROOF OF THEOREM 4.1 Theorem 1. (MOO-based DLTR Generalization Bound) If the loss function lk belonging to kth category is Mk-Lipschitz, and (x, y), (x , y ) X Y, h H: [h(x), y] [h(x ), y ] DH, assume Mk DH is bounded by M, then for any ϵ > 0 and δ > 0, with probability at least 1 δ, the following inequality holds for h H and ω W: Lω(h) ˆ Lω(h) + 2RS(G, ω) + Proof. ω W and sample S = {(x1, y1), ..., (x N, y N)}, let Φ(S) = suph HLω(h) ˆ Lω(h). And assume S is another sample set that contain only one point (x , y ) different from S. Thus, we have Φ(S ) Φ(S) = suph H[Lω(h) ˆ L ω(h)] suph H[Lω(h) ˆ Lω(h)] suph H[Lω(h) ˆ L ω(h) Lω(h) + ˆ Lω(h)] = suph H[ ˆ Lω(h) ˆ L ω(h)] k=1 ωk 1 mk i=1 l(x i, y i) k=1 ωk 1 mk i=1 l(xi, yi) k=1 ωk 1 mk [l(x i, y i) l(xi, yi)] k=1 ωk 1 mk [l(x i, y i) l(xi, yi)] k=1 ωk 1 mk Mk [h(x), y] [h(x ), y ] k=1 ωk 1 mk Mk DH According to Mc Diarmid s inequality, for any δ > 0 with probability at least 1 δ for any h H, we have: Lω(h) ˆ Lω(h) + E h suph HLω(h) ˆ Lω(h) i + ˆ Lω(h) + 2RS(G, ω) + 1.2 PROOF OF THEOREM 4.2 Pseudo Code of CAGrad: Under review as a conference paper at ICLR 2024 Algorithm 1: Training Paradigm of CAGrad Input: Initial model parameter θ, differentiable loss functions are Li(θ), i [N] , a constant c [0, 1] and learning rate α R+. Output: Model trained with CAGrad while not converged do At the tth optimization step, define g0 = 1 K PK i=1 θLi(θt 1) and ϕ = c2 g0 . Solve min ω W F(ω) := g ωg0 + p ϕ gω , where ϕ = c2 g0 2 Update θt = θt 1 α g0 + ϕ1/2 Theorem 2. (Convergence of CAGrad in DLTR) With a fix step size α and the assumption of HLipschitz on gradients, i.e., Li(θ) Li(θ ) H θ θ for i = 1, 2, ..., K. Denote d (θt) as the optimization direction of CAGrad at step t, then we have: L(θt+1) L(θt) α 2 (1 c2) g0(θt) 2 + α 2 (Hα 1) d (θt) 2 L(θt+1) L(θt) = L(θt αd (θt)) L(θt) αg0(θt) d (θt) + Hα2 g0(θt) 2 + d (θt) 2 g0(θt) d (θt) 2 + Hα2 g0(θt) 2 g0(θt) d (θt) 2 + Hα2 2 d (θt) 2 α 2 (1 c2) g0(θt) 2 + α 2 (Hα 1) d (θt) 2 2 DETAILS OF IMPLEMENTATIONS We conduct all experiments according to their publicly released code if applicable, please refer to these code for more details. Our early stop hyper-parameter E is selected from {10, 30, 50, 80}, while anticipating worst-case optimization hyper-parameter ρ is searched over {1.0e-3, 1.0e-4, 1.0e-5}. 2.1 DLTR METHODS LDAM-DRW. LDAM-DRW re-balances the model via logit adjustment. It enforces the theoretical derived margins that is class frequency-related to achieve cost-sensitive learning. Its publicly released code can be found at https://github.com/kaidic/LDAM-DRW. Balanced Softmax. Balanced Softmax is another cost-sensitive learning approach, which re-formulate softmax function in with the combination of link function and Bayesian inference. Its publicly released code can be found at https://github.com/jiawei-ren/ Balanced Meta Softmax-Classification. M2m. M2m takes advantage of generating adversarial samples from major to minor classes and thus re-balance via augmentation. Its publicly released code can be found at https://github.com/ alinlab/M2m. c RT + Mixup. c RT is a milestone decoupling method, which re-trains the classifier with a balanced sampling strategy in the second stage. Despite the official implementation is available, it does not Under review as a conference paper at ICLR 2024 include a version that utilizes Res Net-32 and is evaluated on CIFAR10-/CIFAR100-LT, which are the mainstream protocols. Therefore, we have re-implemented the method and achieved similar performance to that reported in GCL. In our implementation, we have adopted the same learning rate decay strategy as GCL, which involves multiplying the learning rate by 0.1 after the 160th and 180th epochs. Mi SLAS. Mi SLAS follows the decoupling regime and adopts a class-frequency related label smoothing operation to achieve both the improvement of accuracy and calibration. Its publicly released code can be found at https://github.com/dvlab-research/Mi SLAS. GCL. Likewise, GCL is also a pioneer two-stage method, which observes the problem of softmax saturation and proposes to tackle it by Gaussian perturbation of different class logits with varied amplitude. Its publicly released code can be found at https://github.com/Keke921/GCLLoss. difficulty Net. As a dynamic re-balancing method, difficulty Net employs meta learning to learn the adjustment of class re-weighting from logits. Its publicly released code can be found at https: //github.com/hitachi-rd-cv/Difficulty_Net. BBN. BBN Zhou et al. (2020) takes care of both representation learning and classifier learning by equipping with a novel cumulative learning strategy on two branches. Its publicly released code can be found at https://github.com/megvii-research/BBN. Pa Co. Pa Co Cui et al. (2021) introduces a set of parametric class-wise learnable centers to rebalance from an optimization perspective, addressing the bias on high-frequency classes in supervised contrastive loss. Its publicly released code can be found at https://github.com/ dvlab-research/Parametric-Contrastive-Learning. BCL. BCL Zhu et al. (2022) proposes class-averaging and class-complement methods to help form a regular simplex for representation learning in supervised contrastive learning. Its publicly released code can be found at https://github.com/Flamie Zhu/ Balanced-Contrastive-Learning. 2.2 MOO-BASED MTL METHODS MGDA. MGDA is a classical baseline for MOO-based MTL. This approach is particularly appealing due to its ability to guarantee convergence to a Pareto stationary point under mild conditions. Building upon this foundation, MGDA-UP introduces an upper bound on the multi-objective loss, aiming to optimize it and thereby achieve the Pareto optimal solution. In practice, it suggests task weighting based on the Frank-Wolfe algorithm (Jaggi, 2013). We conduct evaluations with the re-implementation in the publicly released code of CAGrad. EPO. Different from the general MOO-based MTL, EPO provides a preference-specific MOO frameworks, which can effectively finds the expected Pareto front from the Pareto set by carefully controlling ascent to traverse the Pareto front in a principled manner. Generally, the re-balancing purpose also requires a preference for MOO. Its publicly released code can be found at https: //github.com/dbmptr/EPOSearch. CAGrad. CAGrad improves MGDA mainly by the ideas of worst-case optimization and convergence guarantee. It strikes a balance between Pareto optimality and globe convergence by regulating the combined gradients in proximity to the average gradient.: max d Rm min ω Wg ω d s.t. d g0 c g0 (1) where d represents the combined gradient, while g0 denotes the averaged gradient, and c is the hyperparameter. Its publicly released code can be found at https://github.com/Cranial-XIX/ CAGrad. Why we choose CAGrad as the baseline? CAGrad is widely recognized as a robust baseline in MOO -based MTL. In contrast to MGDA, which consistently favors individuals with smaller gradient norms, CAGrad achieves a delicate balance between Pareto optimality and global convergence. This unique characteristic allows CAGrad to preserve the Pareto property while maximizing individual progress. Conversely, EPO necessitates manual tuning of the preference hyper-parameter, which plays a crucial role in its performance but proves challenging to optimize in practical scenarios, particularly Under review as a conference paper at ICLR 2024 for classification tasks with a large number of categories. In comparison, CAGrad requires less effort in terms of hyper-parameter tuning. 2.3 DATASETS CIFAR10-/CIFAR100-LT. CIFAR10-/CIFAR100-LT is a subset of CIFAR10/CIFAR100, which is formed by sampling from the original 50,000 training images to create a long-tailed distribution. In our evaluation, we set the imbalance ratio β = nmax nmin to {200, 100, 50}, where nmax and nmin represent the sample numbers of the most and least frequent classes, respectively. Places-LT & Image Net-LT. Places-LT and Image Net-LT are long-tailed variants of Places-365 and Image Net, respectively. Places-LT comprises a total of 62.5K training images distributed across 365 classes, resulting in an imbalance ratio of 996. Similarly, Image Net-LT consists of 115.8K training images spanning 1000 classes, with an imbalance ratio of 256. i Naturalist 2018. i Naturalist 2018 is a naturally occurring long-tailed classification dataset that comprises 437.5K training images distributed across 8142 categories, resulting in an imbalance ratio of 512. In our evaluations, we have followed the official split. 3 OPEN LONG-TAILED RECOGNITION EVALUATION To further demonstrate the robustness of the learned representations by PLOT, we conduct an evaluation known as open long-tailed recognition (OLTR) (Liu et al., 2019) on Places-LT and Image Net-LT datasets. The results are presented in Table 1. Based on the comparison of F-measures, our method achieves state-of-the-art performance in OLTR. It is worth noting that we employ a cosine similarity measurement between incoming representations and prototypes proposed by (Liu et al., 2019), which allows us to compete favorably against bells and whistles OLTR methods (e.g., LUNA (Cai et al., 2022)). This highlights the superiority of the mechanism proposed by PLOT for representation learning. Table 1: Open long-tailed recognition on Places-LT and Imaget Net-LT. F-measure CE Lifted Loss Focal Loss Range Loss Open Max OLTR IEM LUNA CC-SAM c RT + Mixup + PLOT Image Net-LT 0.295 0.374 0.371 0.373 0.368 0.474 0.525 0.579 0.552 0.563 Places-LT 0.366 0.459 0.453 0.457 0.458 0.464 0.486 0.491 0.510 0.516 4 ADDITIONAL EXPERIMENT RESULTS 4.1 IDENTIFY OPTIMIZATION CONFLICT UNDER VARIOUS CONDITIONS Generally, the optimization objectives of samples from different classes tend to exhibit some conflicts, as they need to learn class-specific features in addition to the shared features across classes. In the case of a balanced training set, this conflict does not significantly impact the simultaneous optimization of individual classes. However, in an imbalanced scenario, the model optimization becomes dominated by the majority class, exacerbating this conflict issue. Consequently, the performance of the minority class is compromised in favor of optimizing the majority class. This outcome hinders the effective learning of category-sharing features. Importantly, this problem is intuitively independent of hyperparameters, optimizers, and other related factors. To validate the aforementioned hypothesis, we have conducted an extensive investigation into the gradient conflict and dominated categories issues by thoroughly examining LDAM-DRW across various experimental settings. These settings encompassed diverse mini-batch sizes, learning rates, and optimizers. The meticulous presentation of our findings is shown in Fig. 1 and Fig. 2. Notably, Fig. 1 provides a snapshot of the instantaneous gradient conflict status at an early stage, while Fig. 2 illustrates the continuous gradient similarity status throughout the optimization process. It is important to emphasize that Figs. 2 (a) and (b) are presented on a larger scale in their axes, resulting in a less apparent discrepancy. However, it should be noted that these subfigures do not differ significantly from the other subfigures when placed on the same scale. Under review as a conference paper at ICLR 2024 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (a) SGD-128-0.1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (b) Adam-128-0.1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (c) SGD-256-0.1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (d) SGD-512-0.1 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (e) SGD-128-0.01 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (f) SGD-128-1e-3 Figure 1: Gradient conflict status of LDAM-DRW under various conditions. Each sub-figure is named as (Optimizer)-(Batch-Size)-(Learning-Rate). 0 200 400 600 800 Training Steps Cosine Similarity (a) Adam-128-0.1 0 200 400 600 800 Training Steps Cosine Similarity (b) Adam-128-1e3 0 100 200 300 400 500 600 700 Training Steps Cosine Similarity (c) SGD-256-0.1 0 200 400 600 800 Training Steps Cosine Similarity (d) SGD-512-0.1 0 200 400 600 800 Training Steps Cosine Similarity (e) SGD-128-0.01 0 200 400 600 800 Training Steps Cosine Similarity (f) SGD-128-1e-3 Figure 2: Continual gradient similarity status of LDAM-DRW under various conditions. Each sub-figure is named as (Optimizer)-(Batch-Size)-(Learning-Rate). As anticipated, we have observed the presence of the gradient conflict and dominated categories issues in all tested scenarios. These findings significantly contribute to validating the universality of the gradient conflict issue. If deemed applicable, we are fully committed to providing additional evidence and conducting further analysis in the updated version to reinforce the robustness of our findings. C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (b) LDAM-DRW C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (c) Bal. Softmax C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (e) Mi SLAS C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (h) LDAM-DRW C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (i) Bal. Softmax C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (k) Mi SLAS C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 Figure 3: Gradient conflicts among categories at step 50 (a-f) and step 84(g-l). 4.2 GRADIENT CONFLICTS IN THE EARLY TRAINING STAGE Here we further present the existing of optimization conflicts at different steps in Fig. 3. As depicted, such conflicts remain for a long time in the early stage. 4.3 FREQUENCY VS. LOSS VALUE In Fig. 4, we present our observations of class-wise weighted loss, denoted as Bk l(xk ,yk ) B , to compare the norms of B /B and l(xk , yk ). Our results indicate that Balanced Softmax, which functions as a re-balancing method by adjusting the loss function, is still dominated by class frequency, i.e., B /B. Conversely, LDAM-DRW exhibits a balanced weighted loss for both head and tail classes, thereby demonstrating the effectiveness of its re-balancing strategy. Under review as a conference paper at ICLR 2024 0 200 400 600 800 Training Steps Weighted Loss (a) c RT + Mixup 0 200 400 600 800 Training Steps Weighted Loss (b) LDAM-DRW 0 200 400 600 800 Training Steps Weighted Loss (c) Balanced Softmax Figure 4: Class-wise weighted loss comparison on CIFAR10-LT when imbalance ratio is set as 200 and batch size is set as 64. 4.4 GRADIENT CONFLICTS STATUS OF DYNAMIC RE-WEIGHTING APPROACH Figure 5: Optimization conflicts of diff. Net. (a) Epoch 0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (b) Epoch 30 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 As a form of dynamic re-weighting method, we further investigate the optimization conflicts associated with other dynamic re-weighting approaches to showcase our distinct research perspective. In this regard, we employ difficulty Net as the baseline for conducting a verification experiment, and the results are depicted in Fig. 5. It is evident that difficulty Net fails to address optimization conflicts during its early training stage, primarily due to its inherent lack of design nature. 4.5 GRADIENT NORM EXAMINATION In this section, we expand our analysis by conducting a thorough investigation into the phenomenon of gradient norm domination within mainstream DLTR approaches. To quantify this domination, we calculate the mean gradient of the corresponding category and present the results of our examination in Figure 6(a-f). As depicted, all the examined approaches demonstrate a consistent pattern, mirroring the tendency observed in Figure 2 of the main text. Specifically, we observe that the tail class is extremely under-optimized, indicating the explicit dominance of specific categories during the optimization process. These compelling findings provide strong support for the motivation underlying our proposed method, which aims to effectively mitigate such domination and its associated adverse effects. For comparative purposes, we also provide the PLOT-augmented gradient norm examinations in Figure 6(g-l). As expected, the results reveal that no individuals (categories) exhibit explicit domination during the early stage of representation learning. 0 200 400 600 800 Training Steps 0 200 400 600 800 Training Steps 0 15 30 45 60 (b) LDAM-DRW 0 200 400 600 800 Training Steps (c) Bal. Softmax 0 200 400 600 800 Training Steps 0 200 400 600 800 Training Steps (e) Mi SLAS 0 200 400 600 800 Training Steps 0 200 400 600 800 Training Steps 0 200 400 600 800 Training Steps 0.0 2.5 5.0 7.5 10.0 (h) LDAM-DRW 0 200 400 600 800 Training Steps (i) Bal. Softmax 0 200 400 600 800 Training Steps 0 200 400 600 800 Training Steps (k) Mi SLAS 0 200 400 600 800 Training Steps Figure 6: Gradient norm examination for mainstream DLTR approaches (a-f). Norm Ratio is calculated between the individuals and the mean gradient. (g-l) are the PLOT-augmented ones. Under review as a conference paper at ICLR 2024 4.6 GRADIENT SIMILARITY EXAMINATION AFTER PLOT AUGMENTATION To substantiate the effectiveness of the PLOT, which facilitates simultaneous progress for all individuals, we compute the cosine similarities between individuals and the gradient derived from MOO. These cosine similarities are then compared with the corresponding vanilla results presented in Figure 2 of the main text. The comparative results are depicted in Figure 7. As observed, the DLTR methods augmented with PLOT exhibit a more balanced cosine similarity, indicating that all individuals achieve more equitable progress and improvements. 0 200 400 600 800 Training Steps Cosine Similarity 0 200 400 600 800 Training Steps 0.0 0.2 0.4 0.6 Cosine Similarity (b) LDAM-DRW 0 200 400 600 800 Training Steps Cosine Similarity (c) Bal. Softmax 0 200 400 600 800 Training Steps Cosine Similarity 0 200 400 600 800 Training Steps Cosine Similarity (e) Mi SLAS 0 200 400 600 800 Training Steps Cosine Similarity Figure 7: Gradient similarity examination for PLOT-augmented mainstream DLTR approaches. Cosine Similarity is calculated between the individuals and the MOO derived gradient. 4.7 APPLY PLOT ON DIFFERENT LAYERS Figure 8: Impact of different layers. Imbalance Ratio Layer1 Layer2 Layer3 MOO algorithms are recognized to be computationally intensive, particularly when the number of tasks scales up. To address this issue, we propose applying PLOT to select specific layers rather than the entire model. In this study, we examine the impact of different layers to demonstrate the robustness of our approach. Specifically, we apply PLOT to different layers, i.e., layer1, layer2, and layer3 in Res Net-32, and present their final performance in Figure 8. Our results indicate that the last layer is the optimal choice across various evaluations. 4.8 GRADIENT CONFLICTS STATUS UNDER THE BALANCED SETTING We investigate the gradient conflict and similarity status in a balanced setting to motivate the integration of MOO and DLTR. To this end, we train ERM 1 with the full (roughly balanced) CIFAR10 dataset for 200 epochs and achieve a final accuracy of 92.84%. We record the class-wise gradient and compute their similarities. As shown in Fig 9, gradients among categories also exhibit conflict cases but play a roughly equal role during optimization 2. 5 LIMITATION Figure 9: Gradient conflicts and similarity of ERM. (a) Gradient conflicts. C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 (b) Gradient sim. 0 100 200 300 400 500 600 700 800 Training Steps Cosine Similarity The limitations of PLOT can be attributed to two main factors. Firstly, the approach incurs additional computational overhead due to multiple gradient backpropagation and MOO. To address this issue, we propose applying PLOT to select layers rather than the entire model, which serves as an efficient alternative. Additionally, we randomly sample a small set of categories to participate in MOO for largescale datasets. Secondly, the early stopping mechanism is currently controlled by a hyper-parameter, which introduces additional tuning complexities. To simplify this process, we select the hyper-parameter from a fixed set of values. 1We utilize Wide Res Nets (Foret et al., 2020) as the backbone network without Shake Shake regularization or additional augmentations. Therefore, our reported result is slightly lower than the generally reported accuracy. Notably, our focus is on observing the class-wise gradient status rather than realizing a re-implementation. 2While this conclusion is straightforward, it is worth comparing with DLTR results. Under review as a conference paper at ICLR 2024 6 COMPUTATIONAL COMPLEXITY AND SCALABILITY ANALYSIS 6.1 COMPUTATIONAL COMPLEXITY As noted in the Limitations section, our method is more complex than its baselines. However, we would like to emphasize that our approach only applies MOO to a subset of the model parameters in a mini-batch, which contains fewer classes. This results in a lower time cost than expected. Additionally, we only apply MOO in the early stage of DLTR training for shared feature extraction. For large-scale datasets such as Image Net-LT, we further adopt a class sampling strategy to reduce the computation cost. We believe that these optimizations help to mitigate the complexity of our approach while maintaining its effectiveness in addressing the long-tailed recognition problem. Here we provide a simple comparison of the time cost of running on a Tesla T4 GPU: Table 2: Training time cost comparison running on Tesla T4. Method CE LDAM-DRW c RT + Mixup c RT + Mixup + PLOT LDAM-DRW + PLOT GPU Time (h) 0.62 0.68 0.65 0.83 0.86 6.2 SCALABILITY As previous research has indicated (Cadena et al., 2018; Hu et al., 2021; Zimmermann et al., 2021; Liu et al., 2019), the parameters of backbone layers tend to be more stable in the later stages of training. Therefore, our temporal-style MOO learning paradigm is designed to address the gradient conflict problem that occurs during the early stages of representation learning. This approach enables compatibility with mainstream DLTR models. Here, we provide examples of how our approach can be integrated with Decoupling using pseudo code, as shown in Algorithm 2 (Please refer Eqns. 2, 3, 4 in the maintext): Algorithm 2: Representation Training Paradigm of Decoupling + PLOT Input: Training Dataset S ps(x, y) = ps(x|y)ps(y) Output: Model trained with PLOT Stage SFE: Initialize θ randomly while epoch E do foreach batch Bi in S do Compute empirical loss LS with Bi and obtain its gradient g S Perturb θ with g S according to Eqn. 3 and Eqn. 4 Compute the perturbative loss LSAM and the variability collapse loss Lvc according to Eqn. 2. Estimate the class-specific gradient set G = {g1, g2, ..., gk} with respect to Lmoo = LSAM + Lvc Update θ by solving Eqn. 6: θ θ α g0 + ϕ1/2 Stage TSO: while epoch > E do foreach batch Bi in S do Computing empirical loss LS with Bi Update θ: θ θ θLS(θ) Under review as a conference paper at ICLR 2024 7 VISUALIZATIONS 7.1 FEATURE REPRESENTATION ANALYSIS More class-specific Hessian spectral analysis 3 are provided in Figs. 10 11, which align with the conclusion in the main text, i.e., PLOT leads to a flatter minima for each class, thereby reducing the risk of getting stuck at saddle points. 0 10 20 30 40 Density (Log Scale) (a) Class 1 Density (Log Scale) (b) Class 2 0 20 40 60 80 100 Density (Log Scale) (c) Class 3 0 25 50 75 100 125 Density (Log Scale) (d) Class 4 0 50 100 150 200 Density (Log Scale) (e) Class 5 0 50 100 150 200 250 Density (Log Scale) (f) Class 6 0 50 100 150 200 250 Density (Log Scale) (g) Class 7 0 100 200 300 400 Density (Log Scale) (h) Class 8 Figure 10: Hessian spectrum analysis of vanilla c RT + Mixup. 0 5 10 15 20 25 30 Density (Log Scale) (a) Class 1 Density (Log Scale) (b) Class 2 0 20 40 60 80 Density (Log Scale) (c) Class 3 0 20 40 60 80 Density (Log Scale) (d) Class 4 0 25 50 75 100 125 150 Density (Log Scale) (e) Class 5 0 50 100 150 200 Density (Log Scale) (f) Class 6 0 50 100 150 Density (Log Scale) (g) Class 7 0 100 200 300 400 Density (Log Scale) (h) Class 8 Figure 11: Hessian spectrum analysis of PLOT-augmented c RT + Mixup. 7.2 LOSS LANDSCAPE OF DLTR MODELS To further illustrate the unattainable small value of H in the DLTR models, we have included additional loss landscape visualizations in Fig. 12. These visualizations reveal that the tail classes of the models exhibit sharp minima, with the exception of M2m. Additionally, we observe that PLOT displays flat minima in both head and tail classes. 3https://github.com/val-iisc/Saddle-Long Tail Under review as a conference paper at ICLR 2024 0 10 20 30 0 Loss Contours around Trained Model (a) Head of ERM 0 10 20 30 0 Loss Contours around Trained Model (b) Head of LDAMDRW 0 10 20 30 0 Loss Contours around Trained Model (c) Head of M2m 0 10 20 30 0 Loss Contours around Trained Model (d) Head of Mi SLAS 0 10 20 30 0 Loss Contours around Trained Model (e) Head of PLOT 0 10 20 30 0 Loss Contours around Trained Model (f) Tail of ERM 0 10 20 30 0 Loss Contours around Trained Model (g) Tail of LDAMDRW 0 10 20 30 0 Loss Contours around Trained Model (h) Tail of M2m 0 10 20 30 0 Loss Contours around Trained Model (i) Tail of Mi SLAS 0 10 20 30 0 Loss Contours around Trained Model (j) Tail of PLOT Figure 12: Loss landscapes of head and tail classes in ERM, LDAM-DRW, M2m, Mi SLAS, and PLOT. 7.3 EMBEDDING VISUALIZATION In addition, we present visualizations of the extracted embeddings from both the vanilla and PLOTaugmented approaches by projecting them into a 2D plane using t-SNE Van der Maaten & Hinton (2008). The corresponding visualizations can be found in Figure 13 (a)(b). As anticipated, the tail classes of the c RT + Mixup + PLOT approach exhibit increased separability compared to the vanilla approach. This observation suggests that the incorporation of PLOT enhances the representation of all categories, as intended. c RT + Mixup 0 1 2 3 4 5 6 7 8 9 (a) c RT + Mixup. c RT + Mixup + PLOT 0 1 2 3 4 5 6 7 8 9 (b) c RT + Mixup + PLOT 0.0 2.5 5.0 7.5 10.0 12.5 15.0 CCA Coef idx CCA coef value Vanilla PLOT Figure 13: (a) and (b) are the t-SNE visualization results. (c) Representation similarity between head and tail classes. 7.4 CANONICAL CORRELATION ANALYSIS To further investigate the shared feature representation across categories in the early stage, we examine the Canonical Correlation Analysis (CCA) scores (Raghu et al., 2017) of representations between head and tail categories in CIFAR10-LT, as learned by both the standard and PLOT enpowered LDAM-DRW. As illustrated in Fig. 13 (c), the PLOT augmented version exhibits a greater degree of similarity among categories in comparison to the conventional model, thereby substantiating the efficacy of our conflict-averse solution. Under review as a conference paper at ICLR 2024 8 DETAILED RESULTS ON MAINSTREAM DATASETS In order to demonstrate the generalization of PLOT and its impact on different subsets, we provide more detailed results in this section. Specifically, we compare PLOT with two state-of-the-art DLTR approaches, namely BBN and Pa Co, by augmenting them with PLOT. The consistent improvements observed in Table 4 indicate that PLOT can effectively enhance various types of DLTR approaches. Furthermore, we present detailed results on large-scale datasets in Table 3, which empirically illustrate that PLOT successfully augments medium and tail classes without significantly compromising the performance of major classes. Dataset Method Backbone Many Medium Few Overall CE Res Net-152 45.7 27.3 8.20 30.2 Decouple-τ-norm Kang et al. (2019) Res Net-152 37.8 40.7 31.8 37.9 Balanced Softmax Ren et al. (2020) Res Net-152 42.0 39.3 30.5 38.6 LADE Hong et al. (2021) Res Net-152 42.8 39.0 31.2 38.8 RSG Wang et al. (2021) Res Net-152 41.9 41.4 32.0 39.3 Dis Align Zhang et al. (2021) Res Net-152 40.4 42.4 30.1 39.3 Res LT Cui et al. (2022) Res Net-152 39.8 43.6 31.4 39.8 GCL Li et al. (2022) Res Net-152 - - - 40.6 c RT + Mixup Kang et al. (2019) Res Net-152 44.1 38.5 27.1 38.1 c RT + Mixup + PLOT Res Net-152 40.9 43.1 37.1 41.0 Mi SLAS Zhong et al. (2021) Res Net-152 39.2 43.2 36.5 40.2 Mi SLAS + PLOT Res Net-152 39.2 43.5 36.9 40.5 Image Net-LT CE Res Ne Xt-50 65.9 37.5 7.7 44.4 Decouple-τ-norm Kang et al. (2019) Res Net-50 56.6 44.2 27.4 46.7 Balanced Softmax Ren et al. (2020) Res Ne Xt-50 64.1 48.2 33.4 52.3 LADE Hong et al. (2021) Res Ne Xt-50 64.4 47.7 34.3 52.3 RSG Wang et al. (2021) Res Ne Xt-50 63.2 48.2 32.2 51.8 Dis Align Zhang et al. (2021) Res Net-50 61.3 52.2 31.4 52.9 Res Ne Xt-50 62.7 52.1 31.4 53.4 Res LT Cui et al. (2022) Res Ne Xt-50 63.0 50.5 35.5 52.9 LDAM-DRW + SAM Rangwani et al. (2022) Res Net-50 62.0 52.1 34.8 53.1 GCL Li et al. (2022) Res Net-50 - - - 54.9 c RT + Mixup Kang et al. (2019) Res Ne Xt-50 61.5 49.3 36.8 51.7 c RT + Mixup + PLOT Res Ne Xt-50 62.6 52.0 39.4 54.3 Mi SLAS Zhong et al. (2021) Res Net-50 61.7 51.3 35.8 52.7 Mi SLAS + PLOT Res Net-50 61.4 52.3 37.5 53.5 BCL Zhu et al. (2022) Res Net-50 65.7 53.7 37.3 56.0 BCL + PLOT Res Net-50 67.4 54.4 38.8 57.2 CE Res Net-50 72.2 63.0 57.2 61.2 Decouple-τ-norm Kang et al. (2019) Res Net-50 65.6 65.3 65.9 65.6 Balanced Softmax Ren et al. (2020) Res Net-50 - - - 70.6 LADE Hong et al. (2021) Res Net-50 - - - 70.0 RSG Wang et al. (2021) Res Net-50 - - - 70.3 Dis Align Zhang et al. (2021) Res Net-50 - - - 70.6 Res LT Cui et al. (2022) Res Net-50 - - - 70.2 LDAM-DRW + SAM Rangwani et al. (2022) Res Net-50 64.1 70.5 71.2 70.1 GCL Li et al. (2022) Res Net-50 - - - 72.0 c RT + Mixup Kang et al. (2019) Res Net-50 74.2 71.1 68.2 70.2 c RT + Mixup + PLOT Res Net-50 74.2 72.5 69.4 71.3 Mi SLAS Zhong et al. (2021) Res Net-50 73.2 72.4 70.4 71.6 Mi SLAS + PLOT Res Net-50 73.1 72.9 71.2 72.1 Table 3: Detailed Top-1 Accuracy on Places-LT, Image Net-LT, and i Naturalist-2018. Under review as a conference paper at ICLR 2024 Table 4: Top-1 Accuracy on CIFAR datasets. CIFAR10-LT CIFAR100-LT Imb. 200 100 50 200 100 50 LDAM-DRW 71.38 77.71 81.78 36.50 41.40 46.61 LDAM-DRW + PLOT 74.32 78.19 82.09 37.31 42.31 47.04 M2m 73.43 77.55 80.94 35.81 40.77 45.73 M2m + PLOT 74.48 78.42 81.79 38.43 43.00 47.19 c RT + Mixup 73.06 79.15 84.21 41.73 45.12 50.86 c RT + Mixup + PLOT 78.99 80.55 84.58 43.80 47.59 51.43 Logit Adjustment - 78.01 - - 43.36 - Logit Adjustment + PLOT - 79.40 - - 44.19 - BBN 73.52 77.43 80.19 36.14 39.77 45.64 BBN + PLOT 74.34 78.49 82.44 36.21 40.29 46.13 Mi SLAS 76.59 81.33 85.23 42.97 47.37 51.42 Mi SLAS + PLOT 77.73 81.88 85.70 44.28 47.91 52.66 GCL 79.25 82.85 86.00 44.88 48.95 52.85 GCL + PLOT 80.08 83.35 85.90 45.61 49.50 53.05 Pa Co - - - 47.28 51.71 55.74 Pa Co + PLOT - - - 47.75 52.60 56.61 Santiago A Cadena, Marissa A Weis, Leon A Gatys, Matthias Bethge, and Alexander S Ecker. Diverse feature visualizations reveal invariances in early layers of deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217 232, 2018. Jiarui Cai, Yizhou Wang, Hung-Min Hsu, Jenq-Neng Hwang, Kelsey Magrane, and Craig S Rose. Luna: Localizing unfamiliarity near acquaintance for open-set long-tailed recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 131 139, 2022. Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia. Parametric contrastive learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 715 724, 2021. Jiequan Cui, Shu Liu, Zhuotao Tian, Zhisheng Zhong, and Jiaya Jia. Reslt: Residual learning for long-tailed recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. ar Xiv preprint ar Xiv:2010.01412, 2020. Youngkyu Hong, Seungju Han, Kwanghee Choi, Seokjun Seo, Beomsu Kim, and Buru Chang. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6626 6636, 2021. Jie Hu, Liujuan Cao, Tong Tong, Qixiang Ye, Shengchuan Zhang, Ke Li, Feiyue Huang, Ling Shao, and Rongrong Ji. Architecture disentanglement for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 672 681, 2021. Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pp. 427 435. PMLR, 2013. Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. ar Xiv preprint ar Xiv:1910.09217, 2019. Mengke Li, Yiu-ming Cheung, and Yang Lu. Long-tailed visual recognition via gaussian clouded logit adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6929 6938, 2022. Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Largescale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537 2546, 2019. Under review as a conference paper at ICLR 2024 Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017. Harsh Rangwani, Sumukh K Aithal, Mayank Mishra, et al. Escaping saddle points for effective generalization on class-imbalanced data. Advances in Neural Information Processing Systems, 35: 22791 22805, 2022. Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175 4186, 2020. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. Jianfeng Wang, Thomas Lukasiewicz, Xiaolin Hu, Jianfei Cai, and Zhenghua Xu. Rsg: A simple but effective module for learning imbalanced datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3784 3793, 2021. Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2361 2370, 2021. Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16489 16498, 2021. Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9719 9728, 2020. Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe Chen, and Yu-Gang Jiang. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6908 6917, 2022. Roland S Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of cnn activations? Advances in Neural Information Processing Systems, 34:11730 11744, 2021.