# logit_perturbation__c0dacd62.pdf

Logit Perturbation

Mengyang Li1,2, Fengguang Su2, Ou Wu2*, Ji Zhang3

1 Jiuantianxia Inc., China 2 National Center for Applied Mathematics, Tianjin University, China 3 The University of Southern Queensland, Australia limengyang99@gmail.com, {fengguangsu, wuou}@tju.edu.cn, Ji.Zhang@usq.edu.au

Features, logits, and labels are the three primary data when a sample passes through a deep neural network. Feature perturbation and label perturbation receive increasing attention in recent years. They have been proven to be useful in various deep learning approaches. For example, (adversarial) feature perturbation can improve the robustness or even generalization capability of learned models. However, limited studies have explicitly explored for the perturbation of logit vectors. This work discusses several existing methods related to logit perturbation. Based on a unified viewpoint between positive/negative data augmentation and loss variations incurred by logit perturbation, a new method is proposed to explicitly learn to perturb logits. A comparative analysis is conducted for the perturbations used in our and existing methods. Extensive experiments on benchmark image classification data sets and their long-tail versions indicated the competitive performance of our learning method. In addition, existing methods can be further improved by utilizing our method.

Introduction There are several main paradigms (which may overlap) among numerous deep learning studies, including new network architecture, new training loss, new training data perturbation scheme, and new learning strategy (e.g., weighting). Training data perturbation mainly refers to feature and label perturbations. In feature perturbation, many data augmentation tricks can be viewed as feature perturbation methods when the input is the raw feature (i.e., raw samples). For example, cropped or rotated images can be seen as the perturbed samples of the raw images in computer vision; sentences with modified words can also be seen as the perturbed texts in text classification. Another well-known feature perturbation technique is about the generation of adversarial training samples (Madry et al. 2018), which attracts great attention in various AI applications especially in computer vision (Xie et al. 2020) and natural language processing (Jin et al. 2020). Adversarial samples are those that can fool the learned models. They can be obtained by solving the following objective function:

xadv = x + arg max δ ϵ l(f(x + δ), y), (1)

*Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

where x is the input or the hidden feature, δ is the perturbation term, ϵ is the perturbation bound, y is the label, and xadv is the generated adversarial sample. A number of methods have been proposed to optimize Eq. (1) (Goodfellow, Shlens, and Szegedy 2015; Madry et al. 2018). Adversarial samples can be used to train more robust models. In label perturbation, the labels are modified or corrected to avoid overfitting and noises. For example, a popular yet simple training trick, label smoothing (Szegedy et al. 2016), generates a new label for each sample according to y = y + λ( I

C y), where C is the number of categories, I is a vector with all elements equaling to 1, ( I

C y) is the perturbation term, and λ is a hyper-parameter. Other methods such as Boostrapping loss (Reed et al. 2015), label correction (Patrini et al. 2017; Wang et al. 2021), and Meta label corrector (Wu et al. 2021b) can be seen as a type of label perturbation. Mixup (Zhang et al. 2018) can be attributed to the combination of feature and label perturbation. Logit vectors (or logits) are the outputs of the final feature encoding layer in almost deep neural networks (DNNs). Although logits are important in the DNN data pipeline, only several learning methods in data augmentation and longtail classification directly (without optimization) or implicitly employ class-level logit perturbation. Based on the loss variations of these methods, the loss variations incurred by logit perturbation are highly related to the purpose of positive/negative augmentation1 on training data. Accordingly, a new method is proposed to learn a class-level logit perturbation (LPL) in this study. A comparative analysis is conducted for two classical methods and our LPL method. Extensive experiments are run on benchmark data sets. The results show the competitiveness of our method. The contributions of our study are as follows:

A new method is proposed to learn to perturb logits which can be used in data augmentation and long-tail classification contexts. Experimental results show that our method outperforms existing state-of-the-art methods related to logit perturbation in both contexts. Several classical methods are re-discussed in terms of

1In this study, negative augmentation denotes the augmentation which aims to reduce the (relative) performances of some categories. Accordingly, existing augmentation methods are positive.

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

logit perturbation and positive/negative augmentation. The differences between existing and our proposed logit perturbation methods are also analyzed.

Related Work Data Augmentation Data augmentation is prevailed in almost all deep learning approaches. In its early stage, heuristic operations on raw samples are utilized, such as image flip, image rotation, and word replacing in sentences. Recently, advanced tricks are investigated, such as mixup (Zhang et al. 2018) and semantic data augmentation (Wang et al. 2019). In mixup, given a sample {x1, y1}, its perturbation term is {λ (x2 x1) , λ (y2 y1)}, where λ is a random parameter (not a hyper-parameter), and {x2, y2} is another randomly selected sample. Hu et al. (2019) introduced reinforcement learning to automatically augment data. In this study, existing data augmentation is called positive data augmentation. Negative data augmentation, which is proposed in this study, may be helpful when we aim to restrain the (relative) performance of certain categories (e.g., to keep fairness in some tasks).

Long-tail Classification Real data usually conform to a skewed or even a long-tail distribution. In long-tail classification, the proportions of tail samples are considerably small compared with those of head samples. Long-tail classification may be divided into two main strategies. The first strategy is to design new network architectures. Zhou et al. (2020) designed a bilateral-branch network to learn the representations of head and tail samples. The second strategy is to modify the training loss. In this way, the weighting scheme (Fan et al. 2017) is the most common practice. Relatively larger weights are exerted on the losses of the tail samples. Besides weighting, some recent studies modify the logits to change the whole loss, such as logit adjustment (LA) (Menon et al. 2021). This new path achieves higher accuracies in benchmark data corpora compared with weighting (Wu et al. 2021a).

Methodology This section first discusses several typical learning methods related to logit perturbation. The notations and symbols are defined as follows. Let S = {xi, yi}N i=1 be a corpus of N training samples, where xi is the input feature and yi is the label. Let C be the number of category and πc = Nc/N be the proportion of the samples in the cth category in S. Without loss of generality, we assume that π1 > > πc > > πC. Let ui be the logit vector of xi which can be obtained by ui = f(xi, W), where f( , ) is the deep neural network with parameter W. Let δi be the perturbation term of xi. Let L be the training loss and li be the loss of xi. The standard cross-entropy (CE) loss is used throughout the study.

Logit Perturbation in Existing Methods Logit adjustment (LA) (Menon et al. 2021). This method is designed for long-tail classification and achieves compet-

itive performance in benchmark data sets (Wu et al. 2021a). The employed loss in LA is defined as follows:

i l(softmax(ui + δi), yi)

i log exp(ui,yi + λ log πyi) P

c exp(ui,c + λ log πc). (2)

In Eq. (2), the perturbation term δi is as follows:

δi= δ=λ[logπ1, , logπc, , log πC]T , (3)

where δ represents that δi is a corpus-level vector; thus the values of δi for all the samples in the corpus S are identical. Eq. (2) can be re-written as follows:

i log exp(ui,yi) P

c exp(ui,c + λ(log πc logπyi)). (4)

Previously, we assumed that π1 > > πc > > πC; hence, the losses of the samples in the first category (head) are decreased, while those of the samples in the last category (tail) are increased. The variations of the losses of the rest categories depend on the concrete loss of each sample. Implicitly semantic data augmentation (ISDA) (Wang et al. 2019). ISDA is a newly proposed data augmentation method. Given a sample xi, ISDA assumes that each (virtual) new sample can be sampled from a distribution N (xi, Σyi), where Σyi is the co-variance matrix for the yith category. With the M (virtual) new samples for each sample, the loss becomes

m=1 l(f(xi,m, W), yi), (5)

where xi,m is the mth (virtual) new sample for xi. When M + , the upper bound of the loss in Eq. (5) becomes

i=1 log exp(ui,yi)

c=1 exp(ui,c + λ

2 (wc wyi)T Σyi(wc wyi)) ,

(6) where wc is the network parameter for the logit vectors and ui,c =wc T xi+bc, where xi is the output of the last feature encoding layer. In contrast with previous data augmentation methods, ISDA does not generate new samples or features. In Eq. (6), there is an implicit perturbation term δi defined as follows:

(w1 wyi)T Σyi(w1 wyi) ... (w C wyi)T Σyi(w C wyi)

which is a category-level vector. Each element of δi is nonnegative. Therefore, the new loss of each category from Eq. (7) is larger than the loss from the standard CE loss. LDAM (Cao et al. 2019). In this method, the new loss is defined as

i=1 log exp(ui,yi C(πyi) 1/4)

exp(ui,yi C(πyi) 1

c =yi exp(ui,c) .

(8) The perturbation term δi is as follows:

δi= δyi=λ[0, , C(πyi) 1

4 , , 0]T , (9)

which is also a category-level vector. The losses for all categories are increased in LDAM.

Figure 1: The relative loss variations ( l l

l ) of the three methods on different categories on different data sets.

Our Proposed Method LPL The losses of the three example methods analyzed in the previous subsection can be written as follows:

i l(softmax(ui + δyi), yi). (10)

All the three methods directly infer the perturbation term without a numerical optimization approach. Logit perturbations result in the loss variations. Fig. 1 shows the statistics for the relative loss variations incurred by ISDA, LA, and LDAM for each category on a balanced data set (CIFAR100) and two long-tail sets (CIFAR10-LT and CIFAR100-LT) which are introduced in the experimental section. The loss variations of all categories are positive using ISDA. ISDA achieves the worst results on CIFAR100-LT (shown in the experimental parts), indicating that the non-tail-priority augmentation in long-tail problems is ineffective (ISDA achieves relatively better results on CIFAR10-LT). Only the curves on CIFAR100-LT are shown for LA and LDAM because similar trends can be observed on CIFAR10-LT. The loss variations of head categories are negative, and those of tail are positive using LA. All the variations are positive yet there is an obvious increasing trend using LDAM. We propose two conjectures based on the above observations and from a unified data augmentation viewpoint:

If one aims to positively augment the samples in a category, the loss of this category should be increased. The larger the loss increment, the greater the augmentation. If one aims to negatively augment the samples in a category, then the loss of this category should be reduced. The larger the loss decrement, the greater the negative augmentation.

The above two conjectures are supported in the aforementioned three methods. To handle a long-tail problem, LA should positively augment tail samples and negatively augment head samples; hence, the losses of tails are increased, and those of heads are decreased. ISDA aims to positively augment samples in all categories; thus, the losses for all categories are increased. LDAM aims to positively augment tail samples more than head samples. Hence, the increments of tail categories are larger than those of the head. On the basis of our conjectures, we establish the following new learning loss with logit perturbation:

xi Sc min δc ϵc l(softmax(ui + δc), c)

xi Sc max δc ϵc l(softmax(ui + δc), c), (11)

Figure 2: Overview of the logit perturbation-based new loss. Four solid circles denote four categories. Two categories are positively augmented via loss maximization and the rest two are negatively augmented via minimization.

where ϵc is the perturbation bound related to the extent of augmentation; Na is the index set of categories which should be negatively augmented; Pa is the index set of categories which should be positively augmented; and Sc is the set of samples in the cth category. The loss maximization for the Pa categories is actually the category-level adversarial learning on the logits; the loss minimization for the Na categories is the opposite. Fig. 2 illustrates the calculation of the logit perturbation-based new loss in Eq. (11). The split of the category set (i.e., Na and Pa) and the definition (calculation) of ϵc are crucial for the learning with Eq. (11). Category set split determines the categories that should be positively or negatively augmented. Meanwhile, the value of ϵc determines the augmentation extent. Category set split. The split depends on specific learning tasks. Two common cases are explored in this study. The first case splits categories according to their performances. In this case, Eq. (11) becomes the following compact form:

c{S(τ qc) X

xi Sc max δc ϵc [l(softmax(ui + δc), c)S(τ qc)]}, (12)

where S( ) is the signum function (if a 0, then S(a) = 1), τ is a threshold, and qc is calculated by

xi Sc qi,c = 1

exp(ui,c) P

c exp(ui,c ). (13)

When τ = mean( qc) = PC c=1 qc/C, Eq. (12) indicates that if the performance of a category is below the mean performance, it will be positively augmented. Meanwhile, when the performance is above the mean, it will be negatively augmented. When τ > max c { qc}, all the categories will be positively augmented as in ISDA. The second case is special for a long-tail problem, and it splits categories according to the proportion order of each

Algorithm 1: Learning to Perturb Logits (LPL) Input: S, τ, max iteration T, hyper-parameters for PGDlike optimization, and other conventional training hyperparameters. 1: Randomly initialize Θ. 2: for t = 0 to T do 3: Sample a mini-batch from S. 4: Update τ if it is not fixed (e.g., mean( qc) is used) and split the category set. 5: Compute ϵc for each category using (18) if varied bounds are used. 6: Infer δc for each category using a PGD-like optimization method for (12) in balanced classification or (14) in long-tail classification. 7: Update the logits for each sample and compute the new cross entropy loss. 8: Update Θ with SGD. 9: end for Output: Θ

category. Eq. (11) becomes the following compact form:

xi Sc max δc ϵc [l(softmax(ui + δc), c)S(c τ)]}, (14)

where τ is a threshold for the category index. In (14), the tail categories locate in Pa and will be positively augmented. Eqs. (12) and (14) can be solved with an optimization approach similar to PGD (Madry et al. 2018). First, we have

l(softmax(ui + δyi), yi)

0 = softmax(ui) ˆyi. (15)

In the maximization of (12) and (14), δyi is updated by

δyi = λ Nyi

j:yj=yi (softmax(uj) ˆyj). (16)

where λ is the hyper-parameter. In the minimization part, δyi is updated by

j:yj=yi (softmax(uj) ˆyj). (17)

Bound calculation. The category with a relatively low/high performance should be more positively/negatively augmented; the category closer to the tail/head should be more positively/negatively augmented. We define

ϵc=ϵ+ ϵ |τ qc| or

ϵc= ϵ + ϵ qc

q1 c τ ϵ + ϵ q C

qc c > τ . (18)

In Eq. (18), the larger the difference between the performance ( qc) of the current category and the threshold τ, or the larger the ratio qc/ q1 (and q C/ qc), the larger the bound ϵc. This notion is in accordance with our previous conjecture. When ϵ in Eq. (18) equals to zero, the bound is fixed. The algorithmic steps of our LPL are in Algorithm 1.

Figure 3: Illustrative example for ISDA. Both categories are positively augmented (new samples are virtually generated) according to feature distributions.

Figure 4: Illustrative example for LPL. Samples near the classification boundary are virtually generated or deleted.

Comparative Analysis This subsection compares the perturbations in ISDA and our LPL in terms of data augmentation. In the ISDA s rationale, new samples are (virtually instead of really) generated based on the distribution of each category. Fig. 3 shows the (virtually) generated samples by ISDA. In the right case, the positive augmentation for head category may further hurt the performance of the tail category. ISDA failed in the long-tail problem. Li et al. (2021) leveraged meta learning to adapt ISDA for the long-tail problem. In contrast with the above-mentioned methods, our proposed LPL method conducts positive or negative augmentation according to the directions of loss maximization and minimization. Theoretically, loss maximization will force the category to move close to the decision boundary (i.e., the category is positively augmented or virtual samples are generated for this category). By contrast, loss minimization will force the category to be far from the boundary (i.e., the category is negatively augmented or samples are virtually deleted for this category). Fig. 4 shows an illustrative example.

Experiments Our proposed LPL is first evaluated on data augmentation and long-tail classification tasks. The properties of LPL are then analyzed with more experiments. All the codes are available online2.

Experiments on Data Augmentation Datasets and competing methods. In this subsection, two benchmark image classification data sets, namely, CIFAR10 and CIFAR100, are used. Both data consist of 32 32 natural images in 10 classes for CIFAR10 and 100 classes for

2https://github.com/limengyang1992/lpl

CIFAR100. There are 50,000 images for training and 10,000 images for testing. The training and testing configurations used in (Wang et al. 2019) are followed. Several classical and state-of-the-art robust loss functions and (semantic) data augmentation methods are compared: Large-margin loss (Liu et al. 2016), Disturb label (Xie et al. 2016), Focal Loss (Lin et al. 2017), Center loss (Wen et al. 2016), Lq loss (Zhang and Sabuncu 2018), CGAN (Mirza and Osindero 2014), ACGAN (Odena, Olah, and Shlens 2017), info GAN (Chen et al. 2016), ISDA, and ISDA + Dropout. The Wide-Res Net-28 (Zagoruyko and Komodakis 2016) and Res Net-110 (He et al. 2016) are used as the base neural networks. Considering that the training/testing configuration is fixed for both sets, the results of the above competing methods reported in the ISDA paper (Wang et al. 2019) are directly presented (some are from their original papers). The training settings for the above base neural networks also follow the instructions of ISDA paper and its released codes. Our methods have two variants.

LPL (mean+fixed bound). In this version, the optimization in Eq. (12) is used. Mean denotes that the threshold is mean( qc). Fixed bound means that the value of ϵc is fixed and identical for all categories during optimization. It is searched in {0.1, 0.2, 0.3, 0.4}. LPL (mean+varied bound). In this version, the optimization in Eq. (12) is used. Theoretically, varied bound means that the value of ϵc is varied according to Eq. (18). However, the varied bounds in the same batch make the implementation more difficult and increase the training complexity. In our implementation, we choose to set a varied number of updating steps for each category in our PGD-like optimization. The value of ϵ is searched in {0.1, 0.2}. The PGD-like optimization in Algorithm 1 contains two hyper-parameters, namely, step size and #steps. Let α be the step size, and Kc be the number of steps(#steps) for category c. On the balanced classification, the α is searched in {0.01, 0.02, 0.03}. the Kc is calculated by

With step size and #steps, the PGD-like optimization is detailed in Algorithm 2.

Algorithm 2: PGD-like Optimization Input: The logit vectors (ui) for the cth category in the current mini-batch, ϵc, and α. 1: Let u0 i = ui for the input vectors; 2: Calculate Kc according to Eq. (19); 3: for k = 0 to Kc 1 do

4: Calculate l(softmax(uk i + δc),c) δc

0 = softmax(uk i ) ˆc.

5: Calculate δk+1 yi according to Eq. (16) for maximization and Eq. (17) for minimization; 6: uk+1 i := uk i + δk+1 yi . 7: end for Output: δc = u Kc i ui

The Top-1 error is used as the evaluation metric. The performances of the base neural networks with the standard cross-entropy loss are re-run before running our methods to conduct a fair comparison. Almost identical results are obtained compared with the published results in the ISDA paper. Results. Tables 1 and 2 present the results of all competing methods on the CIFAR10 and CIFAR100, respectively. Our LPL method (two versions) achieves the best performance almost under both the two base neural networks. ISDA achieves the second-best performance. Only in the case of Wide-Res Net-28-10 on CIFAR100, LPL (mean+fixed ϵc) is inferior to ISDA. However, the former still achieves the fourth lowest error.

Method Wide-Res Net-28-10 Res Net-110 Basic 3.82 0.15% 6.76 0.34% Large Margin 3.69 0.10% 6.46 0.20% Disturb Label 3.91 0.10% 6.61 0.04% Focal Loss 3.62 0.07% 6.68 0.22% Center Loss 3.76 0.05% 6.38 0.20% Lq Loss 3.78 0.08% 6.69 0.07% CGAN 3.84 0.07% 6.56 0.14% ACGAN 3.81 0.11% 6.32 0.12% info GAN 3.81 0.05% 6.59 0.12% ISDA 3.58 0.15% 6.33 0.19% ISDA+Drop Out 3.58 0.15% 5.98 0.20% LPL (mean+ fixed ϵc) 3.39 0.04% 5.83 0.21%

LPL (mean+ varied ϵc) 3.37 0.04% 5.72 0.05%

Table 1: Mean values and standard deviations of the test Top1 errors for all the involved methods on CIFAR10.

Method Wide-Res Net-28-10 Res Net-110 Basic 18.53 0.07% 28.67 0.44% Large Margin 18.48 0.05% 28.00 0.09% Disturb Label 18.56 0.22% 28.46 0.32% Focal Loss 18.22 0.08% 28.28 0.32% Center Loss 18.50 0.25% 27.85 0.10% Lq Loss 18.43 0.37% 28.78 0.35% CGAN 18.79 0.08% 28.25 0.36% ACGAN 18.54 0.05% 28.48 0.44% info GAN 18.44 0.10% 27.64 0.14% ISDA 17.98 0.15% 27.57 0.46% ISDA+Drop Out 17.98 0.15% 26.35 0.30% LPL (mean+ fixed ϵc) 18.19 0.07% 26.09 0.16%

LPL (mean+ varied ϵc) 17.61 0.30% 25.87 0.07%

Table 2: Mean values and standard deviations of the test Top1 errors for all the involved methods on CIFAR100. The results of LPL with varied bounds are better than those of LPL with fixed bounds. This comparison indicates the rationality of our motivation that the category with relatively low (high) performance should be more positively (negatively) augmented. In the final part of this section,

more analyses will be conducted to compare ISDA and our method. Naturally, the varied threshold will further improve the performances.

Ratio 100:1 10:1 Class-balanced CE loss 61.23% 42.43% Class-balanced fine-tuning 58.50% 42.43% Meta-weight net 58.39% 41.09% Focal Loss 61.59% 44.22% Class-balanced focal loss 60.40% 42.01% LDAM 59.40% 42.71% LDAM-DRW 57.11% 41.22% ISDA + Dropout 62.60% 44.49% LA 56.11% 41.66% LPL (varied τ + fixed ϵc) 58.03% 41.86% LPL (varied τ + varied ϵc) 55.75% 39.03%

Table 3: Test Top-1 errors on CIFAR100-LT (Res Net-32).

Ratio 100:1 10:1 Class-balanced CE loss 27.32% 13.10% Class-balanced fine-tuning 28.66% 16.83% Meta-weight net 26.43% 12.45% Focal Loss 29.62% 13.34% Class-balanced focal loss 25.43% 12.52% LDAM 26.45% 12.68% LDAM-DRW 21.88% 11.63% ISDA + Dropout 27.45% 12.98% LA 22.33% 11.07% LPL (varied τ + fixed ϵc) 23.97% 11.09% LPL (varied τ + varied ϵc) 22.05% 10.59%

Table 4: Test Top-1 errors on CIFAR10-LT (Res Net-32).

Ratio 100:1 10:1 LA 56.11% 22.33% 41.66% 11.07% LPL 55.75% (-0.36%)

22.05% (-0.28%)

39.03% (-2.63%)

10.59% (-0.48%)

Table 5: The error reduction of LPL (varied τ+varied ϵ) over LA on the two data sets.

Experiments on Long-tail Classification Datasets and competing methods. In this experiment, the long-tail version of CIFAR10 and CIFAR100 compiled by Cui et al. (2019) are used and called CIFAR 10-LT and CIFAR 100-LT, respectively. The training and testing configurations used in (Menon et al. 2021) are followed. Several classical and state-of-the-art robust loss functions and semantic data augmentation methods are compared: Classbalanced CE loss (Wang et al. 2019), Class-balanced finetuning (Cui et al. 2018), Meta-weight net (Shu et al. 2019), Focal loss (Lin et al. 2017), Class-balanced focal loss (Cui et al. 2019), LDAM (Cao et al. 2019), LDAM-DAR (Cao et al. 2019), ISDA, and LA. Menon et al. (2021) released the training data when the imbalance ratio (i.e., π1/π100) is 100:1; hence, their data and reported results for the above competing methods are directly presented. When the ratio is 10:1, the results of ISDA+Dropout and LA are obtained by running their released codes. The results of the rest methods are from the

Method CIFAR10-LT100 CIFAR100-LT100 LA 22.33% 56.11% LPL 22.05% 55.75% LA+LPL 21.46% 53.89%

Table 6: Test Top-1 errors of three methods on two data sets.

study conducted by Li et al. (2021). The hyper-parameter λ in LA is searched in {0.5, 1, 1.5, 2, 2.5} according to the suggestion in (Menon et al. 2021). Similar to the experiments in (Menon et al. 2021), Res Net-32 (He et al. 2016) is used as the base network. The results of ISDA, LA, and LPL are the average of five repeated runs. Our methods have two variants: LPL (varied threshold + fixed bound) and LPL (varied threshold + varied bound). The threshold τ is searched in {0.4C, 0.5C, 0.6C}. The value of ϵ is set to 0, and ε is searched in {1.5, 2.5, 5}. The step size is searched in {0.1, 0.2, 0.3} because the augmentation extent should be large in long-tail classification. and the number of steps (#steps) for category c is also calculated by Eq. (19). Only one meta-based method, Meta-weight net, is involved, because we mainly aim to compare methods that only modify the training loss. In addition, meta-based methods require an auxiliary high-quality validation set (Li et al. 2021). Other methods, such as BBN (Zhou et al. 2020), which focus on the new network structure are also not included in the comparisons. Results. The Top-1 error is also used. Table 3 shows the results of all the methods on the CIFAR100-LT data. On the ratios 100:1 and 10:1, LPL (varied τ + varied ϵc) yields the lowest Top-1 errors. It exceeds the best competing method LA by 0.36% and 2.63% on the ratios 100:1 and 10:1, respectively. Table 4 shows the results of all the methods on the CIFAR10-LT data. LPL (varied τ + varied ϵc) still obtains the lowest Top-1 errors on both ratios. On all the comparisons, the semantic augmentation method ISDA obtains poor results. On CIFAR100-LT, ISDA achieves the worst performances on both ratios. This result is expected because ISDA aims to positively augment all categories equally and does not favor tail categories, which may lead to tail categories suffering from this positive augmentation. Nevertheless, ISDA has a better performance on CIFAR10-LT than on CIFAR100-LT. In Fig. 1(b), the loss increments of tail categories are larger than those of the head ones. That is, larger augmentations are exerted on tail categories. We listed the Top-1 errors of LA and LPL (varied τ + varied ϵc) on Table 5 to better present the comparison. When the ratio is smaller, the improvements (error reductions) are relatively larger. This result is reasonable because when the ratio becomes small, the effectiveness of LA will be subsequently weakened. When the imbalance ratio is one, indicating that there is no imbalance, LA will lose effect; however, our LPL can still augment the training data effectively.

More Analysis for Our Method Improvements on existing methods. Our LPL method seeks the perturbation via an optimization scheme. In ISDA and LA, the perturbations are directly calculated rather than optimization. A natural question arises, that is, whether the

Method #Params CIFAR10 CIFAR100 Res Net-32+ISDA 0.5M 7.09 0.12% 30.27 0.34% Res Net-32+LPL (mean + fixed ϵc) 0.5M 7.01 0.16% 29.59 0.27% Res Net-32+LPL (mean + varied ϵc) 0.5M 6.66 0.09% 28.53 0.16% SE-Resnet110+ISDA 1.7M 5.96 0.21% 26.63 0.21% SE-Resnet110+LPL (mean + fixed ϵc) 1.7M 5.87 0.17% 26.12 0.24% SE-Resnet110+LPL (mean + varied ϵc) 1.7M 5.39 0.10% 25.70 0.07% Wide-Res Net-16-8+ISDA 11.0M 4.04 0.29% 19.91 0.21% Wide-Res Net-16-8+LPL (mean + fixed ϵc) 11.0M 3.97 0.09% 19.87 0.02% Wide-Res Net-16-8+LPL (mean + varied ϵc) 11.0M 3.93 0.10% 19.83 0.09%

Table 7: Number of parameters and test Top-1 errors of ISDA and LPL with different base networks.

perturbations in existing methods further improved via our method. Therefore, we propose a combination method with the following loss in imbalance image classification: X

xi Sc min δyi ϵc l(softmax(ui + λ log πyi + δyi), yi)

xi Sc max δyi ϵc l(softmax(ui + λ log πyi + δyi), yi).

When all ϵcs are zero, the above-mentioned loss becomes the loss of LA; when λ is zero, the above loss becomes our LPL (with fixed bound). We conducted experiments on CIFAR10-LT100 and CIFAR100-LT100. The results are shown in Table 6. Res Net-32 is used as the basic model. The value of λ is searched in {0.5, 1, 1.5, 2, 2.5}. The threshold τ is set as 4 and 40 on CIFAR10 and CIFAR100, respectively. Other parameters follow the setting in the previous experiments. The combination method LA+LPL achieves the lowest errors on both comparisons, indicating that our LPL can further improve the performances of existing SOTA methods. ISDA can likewise be improved with the same manner. More comparisons with ISDA. ISDA claims that it does not increase the number of parameters compared with the direct learning with the basic DNN models. Our method also does not increase the number of model parameters. The reason lies in that the perturbation terms are no longer used in the final prediction. Table 7 shows the comparisons between ISDA and LPL (two variants) on three additional base DNN models, namely, SE-Res Net110 (Hu, Shen, and Sun 2018), Wide Res Net-16 (Zagoruyko and Komodakis 2016), and Res Net32. The numbers of parameters are equal for ISDA and LPL. Nevertheless, the two variants of our method LPL outperform ISDA on both data sets under all the five base models. Loss variations of LPL during training. We plot the loss variations of LPL on two balanced and two long-tail data sets to assess whether our method LPL is in accordance with the two conjectures. The curves are shown in Fig. 5. On the balanced data, the relative loss variations are similar to those of ISDA; on the long-tail data, the losses of head categories are reduced, whereas those of tail ones are increased, which is similar to those of LA. The two variations indicate that LPL can implement appropriate positive/negative augmentation according to the characteristics of training data. Performances of LPL under different τ and ϵc. Both the threshold for category set split and the bound for aug-

Figure 5: Relative loss variations of our LPL on two balanced and two long-tail data sets.

mentation extent are two important hyper-parameters in LPL. Based on our experiments, the following observations are obtained. On the balanced data sets, the results are relatively stable when the bound locates in [0.1, 0.5]; when the threshold is searched around the mean( qc), the results are usually better. On the long-tail data sets, the results are relatively stable when the bound locates in [1.5, 5.0]. When the threshold is searched in {0.4C, 0.5C, 0.6C}, the results are usually good in our experiment. Long-tail problems require larger extent of data augmentation.

Conclusions

This study investigates the category-level logit perturbation in deep learning. A conjecture for the relationship between (logit perturbation-incurred) loss increment/decrement and positive/negative data augmentation is proposed. Based on this conjecture, a new method is introduced to learn to perturb logits (LPL) during DNN training. Two key components of LPL including category-set split and boundary calculation are investigated. The differences between our proposed LPL and two existing methods are analyzed. Extensive experiments on data augmentation (for balanced classification) and long-tail classification are conducted. LPL achieves the best performances in both situations under different basic networks. Existing methods with logit perturbation (e.g. LA) can also be improved by using our method.

Acknowledgments This work is partially supported by NSFC No. 62076178, Zhejiang Fund No. 2019KB0AB03, Tianjin Fund No. 19JCZDJC31300. We are very grateful to Rui Wang for her good suggestion.

References Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; and Ma, T. 2019. Learning imbalanced datasets with label-distribution-aware margin loss. In Neur IPS, 1567 1578. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neur IPS, 2180 2188. Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019. Class-balanced loss based on effective number of samples. In CVPR, 9268 9277. Cui, Y.; Song, Y.; Sun, C.; Howard, A.; and Belongie, S. 2018. Large scale fine-grained categorization and domainspecific transfer learning. In CVPR, 4109 4118. Fan, Y.; Lyu, S.; Ying, Y.; and Hu, B.-G. 2017. Learning with average top-k loss. In Neur IPS, 497 505. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. In ICLR. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770 778. Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In CVPR, 7132 7141. Hu, Z.; Tan, B.; Salakhutdinov, R.; Mitchell, T.; and Xing, E. P. 2019. Learning data manipulation for augmentation and weighting. In Neur IPS, 15738 15749. Jin, D.; Jin, Z.; Zhou, J. T.; and Szolovits, P. 2020. Is bert really robust? A strong baseline for natural language attack on text classification and entailment. In AAAI, 8018 8025. Li, S.; Gong, K.; Liu, C. H.; Wang, Y.; Qiao, F.; and Cheng, X. 2021. Meta SAug: Meta semantic augmentation for Longtailed visual recognition. In CVPR, 5212 5221. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll ar, P. 2017. Focal loss for dense object detection. In ICCV, 2980 2988. Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016. Large-margin softmax loss for convolutional neural networks. In ICML, 507 516. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2018. Towards deep learning models resistant to adversarial attacks. In ICLR. Menon, A. K.; Jayasumana, S.; Rawat, A. S.; Jain, H.; Veit, A.; and Kumar, S. 2021. Long-tail learning via logit adjustment. In ICLR. Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. ar Xiv:1411.1784. Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classifier gans. In ICML, 2642 2651.

Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; and Qu, L. 2017. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 1944 1952. Reed, S.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; and Rabinovich, A. 2015. Training deep neural networks on noisy labels with bootstrapping. In ICLR. Shu, J.; Xie, Q.; Yi, L.; Zhao, Q.; Zhou, S.; Xu, Z.; and Meng, D. 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. In Neur IPS, 1917 1928. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR, 2818 2826. Wang, X.; Hua, Y.; Kodirov, E.; Clifton, D. A.; and Robertson, N. M. 2021. Proselflc: Progressive self label correction for training robust deep neural networks. In CVPR, 752 761. Wang, Y.; Pan, X.; Song, S.; Zhang, H.; Huang, G.; and Wu, C. 2019. Implicit semantic data augmentation for deep networks. In Neur IPS, 12635 12644. Wen, Y.; Zhang, K.; Li, Z.; and Qiao, Y. 2016. A discriminative feature learning approach for deep face recognition. In ECCV, 499 515. Springer. Wu, T.; Liu, Z.; Huang, Q.; Wang, Y.; and Lin, D. 2021a. Adversarial robustness under long-tailed distribution. In CVPR, 8659 8668. Wu, Y.; Shu, J.; Xie, Q.; Zhao, Q.; and Meng, D. 2021b. Learning to purify noisy labels via meta soft label corrector. In AAAI, 10388 10396. Xie, C.; Tan, M.; Gong, B.; Wang, J.; Yuille, A. L.; and Le, Q. V. 2020. Adversarial examples improve image recognition. In CVPR, 819 828. Xie, L.; Wang, J.; Wei, Z.; Wang, M.; and Tian, Q. 2016. Disturblabel: Regularizing cnn on the loss layer. In CVPR, 4753 4762. Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. In BMVC, 87.1 87.12. Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. Mixup: Beyond empirical risk minimization. In ICLR. Zhang, Z.; and Sabuncu, M. R. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In Neur IPS, 8778 8788. Zhou, B.; Cui, Q.; Wei, X.-S.; and Chen, Z.-M. 2020. BBN: Bilateral-branch network with cumulative learning for longtailed visual recognition. In CVPR, 9719 9728.