# classincremental_learning_via_dual_augmentation__0a55c8c5.pdf Class-Incremental Learning via Dual Augmentation Fei Zhu1,2, Zhen Cheng1,2, Xu-Yao Zhang1,2 , Cheng-Lin Liu1,2,3 1NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing, 100049, China 3Center for Excellence of Brain Science and Intelligence Technology, CAS {zhufei2018, chengzhen2019}@ia.ac.cn, {xyz, liucl}@nlpr.ia.ac.cn Deep learning systems typically suffer from catastrophic forgetting of past knowledge when acquiring new skills continually. In this paper, we emphasize two dilemmas, representation bias and classifier bias in class-incremental learning, and present a simple and novel approach that employs explicit class augmentation (class Aug) and implicit semantic augmentation (seman Aug) to address the two biases, respectively. On the one hand, we propose to address the representation bias by learning transferable and diverse representations. Specifically, we investigate the feature representations in incremental learning based on spectral analysis and present a simple technique called class Aug, to let the model see more classes during training for learning representations transferable across classes. On the other hand, to overcome the classifier bias, seman Aug implicitly involves the simultaneous generating of an infinite number of instances of old classes in the deep feature space, which poses tighter constraints to maintain the decision boundary of previously learned classes. Without storing any old samples, our method can perform comparably with representative data replay based approaches. 1 Introduction Deep neural networks (DNNs) have enabled great success in many machine learning tasks, based on stationary, large-scale, computationally expensive, and memory-intensive training data [1, 2, 3]. Yet the need of the ability to acquire sequential experience in dynamic and open environments [4, 5, 6] poses a serious challenge to modern deep learning systems, which only perform well on homogenized, balanced, and shuffled data [7]. Typically, DNNs suffer from drastic performance degradation of previously learned tasks after learning new knowledge, which is a well-documented phenomenon, known as catastrophic forgetting [8, 9, 10]. Recently, incremental learning (IL), also referred to as lifelong learning or continual learning, has received extensive attention [11, 12, 13, 14] to enable DNNs to preserve and extend knowledge continually. Many earlier studies focus on task-incremental learning, which uses separate output layers for different tasks, and needs the task identity for inference [11, 15, 16]. In this work, we consider a more realistic and challenging setting of class-incremental learning (Class-IL), where the model only has access to data of new classes at each stage and needs to learn a unified classifier that can classify all seen classes [13, 17, 18]. Unfortunately, the learning paradigm of Class-IL will lead to two problems: representation bias and classifier bias, as shown in Figure 1. First, for representation learning, if the feature extractor is fixed after learning old classes, the learned representations could be preserved, but suffer from the lack of transferability for new classes; on the contrary, if we update the feature extractor on new classes, the updated representations would be no longer suitable for old classes. Consequently, the old and new classes would be easily overlapped in the deep feature space. We Corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). old class data: unavailable feature extractor new new data Note: new class data: sufficient Figure 1: Two inherent problems in Class-IL: representation bias and classifier bias. denote this dilemma as the representation bias. Second, to distinguish new classes from old classes, the training loss is typically calculated on all classes. Without old training data, the class weights of old classes would be ill-updated and mismatched with the updated representation space. We denote this dilemma as the classifier bias. In this work, we investigate the learning of representation and classifier in incremental learning and propose a simple and effective dual augmentation framework to overcome these two biases in Class-IL without storing and replaying training data of old classes. Learning Representation for Incremental Learning. Existing works typically regularize network parameters explicitly [11, 15, 16] or implicitly [12] to reduce the representation shift when learning new classes. In this paper, instead of asking how to keep previously learned representations unchanged, we investigate the following question: What properties of learned representations could facilitate incremental learning? We hypothesize that learning transferable and diverse representations is an important requirement for incremental learning. Intuitively, with such representations, it could be easier to find a model to perform well on all tasks and improve both plasticity and stability, since different tasks would be closer in the parameters space. From a spectral analysis viewpoint, we investigate which components of feature representations are more transferable and less forgettable in the incremental learning process. It is found that spectral components with large eigenvalues are less forgettable. Furthermore, we exploit this finding to propose a simple technique named class Aug, which can enlarge the spectral components to introduce more diverse and transferable representations for incremental learning. Learning Classifier for Incremental Learning. Recently, several works were proposed to alleviate the classifier bias in data replay based methods [18, 19, 20]. However, in non-exemplar based (i.e., without storing and replaying old data) Class-IL setting, the classifier bias is more serious and the above methods can not be directly used. A straightforward way is storing instances of old classes in the deep feature space. However, this strategy is undesirable due to the limited memory resource and scalability. This work delves into the classifier learning for Class-IL and proposes an implicit semantic augmentation (seman Aug) approach to generate an infinite number of instances of old classes in the deep feature space by leveraging the distribution information. Seman Aug is inspired by MCF [21] and ISDA [22], which have performed semantic augmentation for linear models and DNNs, respectively. However, both our way to leverage semantic augmentation and the motivation fundamentally differ from them [21, 22]. Contributions. (i) We provide new insights into the representation learning in incremental learning by analyzing the structural characteristics of the learned embedding space via spectral decomposition and find that spectral components with large eigenvalues are less forgettable and carry more transferable features. Based on this observation, we propose a simple and effective method of class Aug to learn better embedding space for incremental learning. (ii) For classifier learning in incremental learning, we propose seman Aug which implicitly involves simultaneous generating an infinite number of instances of old classes in the deep feature space to maintain the decision boundary of previously learned classes. (iii) Extensive experiments on benchmark datasets demonstrate the superior performance of our dual augmentation framework for the challenging scenario of Class-IL. 2 Related Work Incremental Learning. Diverse approaches have been proposed for incremental learning of DNNs. They can be roughly divided into three categories: regularization based, data replay based, and architecture based approaches. Regularization based methods focus on weight regularization by estimating and preventing the important network weights from changing [11, 15, 16]. The difference among those methods is the way to compute the importance of the parameters. However, it is hard to design a reasonable metric to measure the importance of parameters, and it is known that regularization strategies show poor performance in Class-IL scenario [23, 24]. Data replay based methods address both the representation bias and classifier bias straightforwardly by storing a fraction of old data to jointly train the model with current data. With stored real samples, some works [17, 13, 25] use a distillation loss to prevent forgetting, while others [26, 27, 28] develop gradient-based regularization to make more efficient use of the rehearsal data. To avoid storing real data, another line of works generates pseudo-samples of all previous classes for replay using deep generative models [29, 30, 31, 32]. Nevertheless, storing real data is undesirable for resource-limited or privacy and safety concerning scenarios. Moreover, training big generative models for complex datasets is inefficient. Architecture based methods dynamically extend the network structure during the course of incremental learning [33, 34, 35, 36]. However, growing architecture is unfeasible for large numbers of tasks, and those methods are often impractical for Class-IL. Data Augmentation. Literature is rich on data augmentation for improving the generalization of DNNs. Classical strategies commonly synthesize positive new samples in a way that is consistent with the underlying data distribution of the original dataset [3]. Recent works show that label mixing based methods such as Mixup [37] and Cutmix [38] can greatly improve the generalization of DNNs. In complement to the input space augmentations mentioned above, some works have explored feature space augmentations which augment the learned representations in deep embedding space to enhance classifier performance. The intuition behind those works is that certain directions in the deep feature space correspond to meaningful semantic transformations [39, 40]. For instance, deep feature interpolation [40] leverages simple interpolations in the embedding space to achieve semantic augmentation. A recently proposed ISDA [22] performs semantic augmentation by estimating and leveraging the category-wise distribution of deep representations in an online manner. Despite the simplicity, ISDA shows its effectiveness in semi-supervised learning [22], contrastive learning [41], domain adaptation [42] and long-tailed recognition [43]. 3 Dual Augmentation Framework for Class-Incremental Learning We first formalize the problem of Class-IL, and then introduce the proposed class Aug for representation learning and seman Aug for classifier learning, respectively. Finally, we present the dual augmentation framework for Class-IL by combing the two augmentations. Problem Definition. Typically, a Class-IL problem involves the sequential learning of T tasks that consist of disjoint classes sets, and the model has to classify all seen classes at any given point in training. At incremental step t {1, ..., T }, (x, y) Dt denotes a training sample, where x is an sample in the input space X and y Ct is its corresponding label. Ct is the class set of task t. To facilitate analysis, we represent the DNN based model with two components: a feature extractor and a unified classifier. Specifically, the feature extractor fθ : X Z, parameterized by θ, maps the input x into a feature vector z = fθ(x) Rd in the deep feature space Z; the unified classifier gϕ : Z RC1:t, parameterized by ϕ, produces a probability distribution gϕ(z) as the prediction for x. Denote the overall parameters by Θ = (θ, ϕ). The general objective is to correctly classify test examples from all seen classes [44]. The key challenge of Class-IL is that data from previous tasks are assumed to be unavailable, which means that the best configuration of the model for all seen tasks must be sought by minimizing the predefined loss function L (e.g., cross-entropy) on current data Dt: argmin θ,ϕ E(x,y) Dt[L(gϕ(fθ(x)), y)]. (1) A widely used strategy to preserve old knowledge is knowledge distillation [45], which typically matches the current model with previous model response to current training data using the teacherstudent framework [12, 13, 19]. 3.1 Learning Representation with Class Augmentation As we focus on non-exemplar based Class-IL, we intentionally avoid storing training samples of old classes. To maintain the generalizability of the learned representations for old classes, existing methods typically restrain the feature extractor from changing [11, 15, 16, 12]. However, this would lead to a trade-off between the plasticity and stability [5], and it would be hard to perform long-step incremental learning. Our high-level idea is to learn transferable and diverse representations to bridge the old and new classes in a better feature space. To delve into this problem, we want to answer two questions: (1) Which part of feature representations tends to be forgotten in incremental learning? (2) How to facilitate the representation learning for incremental learning? 3.1.1 Analyzing Forgetting via Spectral Decomposition In what follows, we explore which part of feature representations tends to be forgotten and may not be transferable across different tasks in incremental learning. To this end, we propose to quantify the sensitivity of the model to different directions in the deep feature space by measuring the similarity of the space before and after learning new tasks. Formally, given a feature extractor fθ,old trained on dataset Dold = {(xi, yi)}n i=1. A new dataset Dnew that contains disjoint classes with Dold is used to update fθ,old, and the updated feature extractor is denoted as fθ,new. For the samples in Dold, we can get two groups of deep features mapped by fθ,old and fθ,new, respectively. Using eigenvalue decomposition, we could respectively decompose the features mapped by original feature extractor (i.e., fθ,old(xi)) as well as the features mapped by updated feature extractor (i.e., fθ,new(xi)) to different directions as following: i=1 fθ(xi)fθ(xi)T = j=1 ujλju T j , (2) where λj represents the eigenvalue with index j and uj is its eigenvector. d is the dimensionality of the feature space. Through spectral factorization in Eq. (2), we can represent the original and new representations with two groups of eigenvectors: {uold,1, ..., uold,d} and {unew,1, ..., unew,d}. Next, we investigate the forgetting or transferability of each direction. Shonkwiler [46] introduced the principal angles [47] to measure the similarity of two subspaces. However, it is unreasonable to treat all eigenvectors equally to calculate the principal angles, regardless of their relative eigenvalues. Inspired by [48], we use corresponding angles, denoted by ψ, to explore the distance between two subspaces in incremental learning: Definition 1 (Corresponding Angle) Given two groups of eigenvectors: {uold,1, ..., uold,d} and {unew,1, ..., unew,d}, corresponding angle represents the angle between two eigenvectors corresponding to the same eigenvalue value index. The cosine value of the corresponding angle is: cos(ψj) = uold,j, unew,j uold,j unew,j , (3) where uold,j is the j-th eigenvectors with the j-th largest eigenvalue in the old feature space, and similarly for unew,j. Note that uold,j = 1 and unew,j = 1. For IL, the meaning of preserve old knowledge refers to maintain the previously learned decision boundary among classes. At representation level, for an old class, the shape (i.e., covariance) of the distributions should not be changed too much. If an eigenvector direction only changes slightly after updating the feature extractor, the corresponding angle is small, and vice versa. Intuitively, the corresponding angle could capture the representation shift between the old and updated feature extractor during incremental learning, and reflect the forgetting along certain directions in the deep feature space. Based on the metric defined above, we explore the forgetting of different directions in Class-IL. We use Lw F-MC [12, 13] as baseline method and train a Res Net-18 [1] on CIFAR-100 [49] using SGD in a 2-step manner. Concretely, the model is first trained on the first 50 classes and then updated on the other 50 classes. Figure 2 (a) shows the absolute cosine values of corresponding angles between the old and new eigenvectors. We can observe that eigenvectors with larger eigenvalues produce larger similarity (small corresponding angles), which indicates those directions are more transferable and less forgettable across different tasks. On the contrary, the eigenvectors with small eigenvalues prefer to move after updating the model on new tasks, and could be regarded as forgettable directions. Transferable and Diverse Representations. As demonstrated above, the directions with larger eigenvalues transfer better and suffer less forgetting. This thought-provoking observation indicates that our learned representations should have the following properties: (1) Transferability: the eigenvalues of those several significant directions should be enlarged to transfer across tasks (or 0 10 20 30 40 50 60 70 Cosine values of corresponding angles index of eigenvectors 0 20 40 60 80 100 Cosine values of corresponding angles index of eigenvectors average of class-wised results eigenvectors with larger eigenvalues transfer better 0 10 20 30 40 50 60 70 80 eigenvalues index of eigenvectors Baseline Mixup LS class Aug 0 10 20 30 40 50 60 70 80 eigenvalues index of eigenvectors Baseline Mixup LS class Aug representations for all classes representations for (a) cosine values of corresponding angles (b) eigenvalues of representations Figure 2: (a) Absolute cosine values of corresponding angles. (b) Distribution of eigenvalues for baseline, Mixup [37], LS [50], and our class Aug training based models. classes). (2) Diversity: the number of the directions with significant eigenvalues should be increased. Note that those properties are different from that in the common single-task learning scenario. Actually, reducing the number of directions with significant variance has been seen as a form of feature compression [51], which is linked to generalization by information theory [52, 53]. However, the usual concepts of generalization may not entirely be appropriate for IL, since standard learning only aims to learn compact representations within training classes without considering new class generalizability. In IL, those less discriminative directions for the current task could capture useful representations for future tasks. A recent paper [54] has shown that strong compressed representations can actually hurt the generalization ability in the deep metric learning setting. Therefore, to reduce forgetting and enhance the transferability of the representations, it is important to enlarge the eigenvalues and increase the number of eigenvectors with significant variance. 3.1.2 Learning Representations via Class Augmentation We now exploit our above analysis to propose a simple method for representation learning in Class IL. Our key idea is to learn transferable and diverse representations by learning more classes at each incremental stage t. To do so, a direct way is to introduce real classes from other datasets as auxiliary. However, it is unrealistic to always have access to other real classes, and which datasets should be used remains unknown. Therefore, we propose class augmentation (class Aug) to augment the original classes by synthesizing auxiliary classes based on Dt. Concretely, inspired augmented class Figure 3: Illustration of class Aug. by Mixup [37], class Aug randomly interpolates two samples xa and xb from two different classes a and b to generate a new sample xnew ab representing a new class: xnew ab = λxa + (1 λ)xb, (4) where λ is a random number of interpolation coefficient. For a k-class problem, we can generate k(k 1)/2 new classes using the above method, which can be further merged to m auxiliary classes. As a result, the original k-class problem in the current task is extended to a (k + m)-class problem. Moreover, we restrict the λ to be sampled from the interval of [0.4, 0.6], to reduce the overlap between the augmented and original classes. At the end of each IL stage, the augmented class nodes in the classifier would be removed. Discussion. The proposed class Aug is related to Mixup [37] which applies random interpolation on a pair of training samples and the respective one-hot labels. However, the interpolated samples in Mixup are near original data, and the number of classes is not changed, but in our method, it is increased. By learning to classify more classes in each stage t, the model could learn more transferable and diverse representations. Figure 2 (b) displays and compares the eigenvalues 2 of representations learned with different methods on the first 50 classes of CIFAR-100. It is obvious that the proposed class Aug can enhance the value of eigenvalues significantly, and produce more directions with significant variance compared with other methods. On the contrary, Mixup and Label-Smoothing (LS) [50] lead to significantly smaller eigenvalues for the several top eigenvectors, which represent more compact representations. Indeed, the compression effect of soft-label based methods has also been demonstrated in [51, 50]. As shown in Section 4.3, class Aug can improve the performance of Class-IL significantly, while Mixup and LS have negative effect in our experiments. 2To visualize the distribution clearly, we do not include the largest eigenvalue in the figure. 3.2 Learning Classifier with Semantic Augmentation As demonstrated in Section 1, classifier bias is another problem in Class-IL. When learning new classes, the previously learned decision boundary would suffer from catastrophic distortion and thus the test samples from old classes could be easily mapped to wrong classes. To overcome this issue, we propose semantic augmentation (seman Aug), which leverages the distribution information (i.e., class mean and covariance) of old classes to regularize the learning of the classifier. Formally, for each old class k {1, ..., Cold}, we can generate M instances in the deep feature space from its distribution, i.e., ezk N(µk, γΣk), in which γ is a non-negative coefficient. Then the generated instances of old classes and real instances of new classes in the deep feature space can be jointly fed to the classifier for minimizing cross-entropy loss: eϕT yizi+byi PCall c=1 eϕT c zi+bc | {z } Lt,new: loss on real features of new classes eϕT k ezk,m+bk PCall c=1 eϕT c ezk,m+bc | {z } Lt,old: loss on generated features of old classes where nt is the number of training samples in current task dataset Dt, Cold is the number of total old classes upon stage t, and Call = Cold + Ct is the number of all seen classes at stage t. ϕ = [ϕ1, ..., ϕCall]T RCall d and b = [b1, ..., b Call]T RCall are the weight matrix and bias vector of the last fully connected layer, respectively. In Class-IL, the second term in Eq. (5), Lt,old, is computationally inefficient when M and Cold are large. In the following, we present an easy-to-compute way to implicitly generate infinite instances in the deep feature space for old classes. Upper bound of Lt,old. Concretely, in the case of M , the second term in Eq. (5): Lt,old = 1 Cold eϕT k ezk+bk PCall c=1 eϕT c ezk+bc c=1 e(ϕT c ϕT k )ezk+(bc bk) !# c=1 e(ϕT c ϕT k )ezk+(bc bk) #! c=1 ev T c,kµk+(bc bk)+ γ 2 v T c,kΣkvc,k ! (6) In above equation, vc,k = ϕc ϕk. The inequality is based on Jensen s inequality E[log(X)] log E[X], and the last equality is obtained by using the moment-generating function E[et X] = etµ+ 1 2 σ2t2, X N(µ, σ2), due to the fact that (ϕc ϕk)ezk + (bc bk) is a Gaussian random variable. As can be seen, Eq. (6) is an upper bound of original Lt,old, which provides an elegant and much efficient way to implicitly generate infinite instances in the deep feature space for old classes. The Lt,old in Eq. (6) can be write in the common cross-entropy loss form: Lt,seman Aug Lt,old = 1 Cold eϕT k µk+bk PCall c=1 eϕT c µk+bc+ γ 2 v T c,kΣkvc,k Intuitively, Lt,old implicitly performs semantic transformations for µk based on Σk. To maintain the decision boundary, γ should be smaller if the distribution of a class is near the decision boundary; instead, γ should be bigger if the distance is relatively far. We set γ = 2 in our experiments. In addition, we can observe that when γ = 0, only the class means are used for knowledge retention. Discussion. (1) Although the derivation of the upper bound in Eq. (6) is similar with ISDA [22], both our motivation and the way to leverage seman Aug are different from ISDA. When learning new classes, we only apply seman Aug for the class mean of each old class based on the memorized distribution information. While ISDA applies seman Aug on all the training samples to improve generalization in standard supervised learning. In addition, a crucial step in ISDA is to estimate the mean and covariance matrix of each class in an online manner. Differently, seman Aug is naturally suitable for Class-IL, since the distribution of old classes can be estimated with all training samples at the end of each learning stage. (2) Using previous class statistics for IL has also been explored in IL2M [55]. However, our method differs from IL2M in both the statistics information and the way to tf new & augmented classes distillation old feature space feature extractor new feature space new class 1 new class 2 augmented class Figure 4: Illustration of our dual augmentation framework (IL2A) for Class-IL. On the one hand, the training samples of new classes at current task are augmented via the proposed class Aug. On the other hand, the distributions of old classes are retained by seman Aug in the deep feature space. leverage them. First, The class statistics in IL2M is the prediction score of the classifier, while ours is the class distribution statistics in the deep feature space. Second, IL2M uses the class statistics to calibrate the prediction of a continual learner in a post-processing manner, while our method leverage the statistics to automatically learn a balanced classifier. 3.3 The Dual Augmentation Learning Framework With class Aug for representation bias and seman Aug for classifier bias, Figure 4 describes the learning process of the dual augmentation framework (IL2A). We also use the well-known knowledge distillation (KD) [19] for two reasons. Firstly, class Aug and KD are complementary and focus on different aspect of learning representation. Secondly, KD can reduce the change of feature extractor, which is crucial for seman Aug because it implicitly generate instances in the deep feature space from old distribution. The total learning objective at each stage t is as following: Lt = Lt,new + αLt,seman Aug + βLt,kd, (8) where α and β are two hyper-parameters. Lt,new and Lt,seman Aug are shown in Eq. (5) and Eq. (7), respectively. Lt,kd = 1 nt Pnt i=1 fθt 1(xi) fθt(xi) . Note that Lt,new and Lt,seman Aug are applied to both the original and synthesized samples. Algorithm 1 presents the pseudo code of IL2A. 4 Experiments 4.1 Evaluation Protocol Algorithm 1: IL2A: Dual augmentation algorithm Randomly initialize Θ0 = {θ0, ϕ0}; S0 = ; foreach incremental stage t {1, ..., T } do Input: model Θt 1, data Dt = {(xi, yi)}nt i=1; Output: model Θt; Θt Θt 1; Dt,aug = {(x i, y i)}n t i=1 via class Aug; add class nodes for augmented classes; if t = 1 then train Θt by minimizing L(gϕ(fθ(x )), y ); else train Θt by minimizing Eq. (8); s compute {µ, Σ} for each class in Dt; St St 1 s; remove augmented class nodes in classifier; Datasets. We perform our experiments on CIFAR-100 [49] and Tiny-Image Net [56]. A common setting is to train the model on half of classes for first task, and equal classes in the remaining incremental steps. Based on this, we split the CIFAR-100 dataset in different settings: 50 + 5 10, 50+10 5, 40+20 3. For instance, 50+ 10 5 represents that the first task contains 50 classes and there are 5 classes for the following 10 tasks. Similarly, the settings for Tiny-Image Net are 100+5 20, 100+ 10 10 and 100+20 5. Intuitively, more classes in each tasks requires the model to learn a harder problem for each task, while increasing the length of the task sequence challenges the model s retention. Implementation Details. In our experiments, we follow [44] to utilize the Res Net-18 [1] as our base architecture, and train it from scratch 1 2 3 4 5 6 accuracy (%) number of tasks 1 2 3 4 5 6 7 8 9 10 11 accuracy (%) number of tasks 1 2 3 4 5 6 7 8 9 10 11 12131415161718192021 accuracy (%) number of tasks i Ca RL-CNN i Ca RL-NME 1 2 3 4 5 6 accuracy (%) number of tasks 1 2 3 4 5 6 7 8 9 10 11 accuracy (%) number of tasks 1 2 3 4 5 6 7 8 9 10 11 12131415161718192021 accuracy (%) number of tasks i Ca RL-CNN i Ca RL-NME CIFAR-100 (5 phases) CIFAR-100 (10 phases) CIFAR-100 (20 phases) Tiny-Image Net (5 phases) Tiny-Image Net (10 phases) Tiny-Image Net (20 phases) Figure 5: Results of top-1 accuracy on CIFAR-100 and Tiny-Image Net under different settings. Solid lines present methods that do not store old exemplars, dashed lines present data replay based methods. in each experiment. All models are trained using Adam [57] optimizer with an initial learning rate of 0.001 for 100 epochs with the mini-batch size of 64. The learning rate is reduced by a factor of 10 at 45 and 90 epochs. We use the same hyper-parameter value for all experiments. Specifically, we set α = 10 and β = 10 in Eq. (8). The number of augmented classes (i.e. The number of augmented classes (i.e., m) depends on the number of (original) classes at current incremental step. Taking CIFAR-100 as an example, the m is 45 for 5 phases setting where each incremental step has 10 classes; and m is 10 for 10 phases setting where each incremental step has 5 classes. At the end of each incremental stage, we evaluate the model on all seen classes after removing the class nodes of the m augmented classes in the classifier. Our code is available at https://github.com/Impression2805/IL2A. Comparison Methods. Our method (IL2A) does not store any old samples for replay when learning new classes. Therefore, we first compare IL2A with several non-exemplar based approaches: MAS [16], Lw F-MC [13], MUC [58], Lw M [59]. In addition, we also compare with several exemplar based methods such as i Ca RL [13], EEIL [18] and LUCIR [19]. Specifically, for the data replay based methods, we follow [13, 19] to store 20 samples for each class using herd selection technique [13]. We report the average top-1 accuracy of all previously seen classes up to each incremental step t. For i Ca RL, we respectively report its results of CNN predictions and nearest-mean-of-exemplars classification, denoted as i Ca RL-CNN and i Ca RL-NME. 4.2 Experimental Results Main Results. Comparative results are shown in Figure 5. Firstly, we observe that our method performs much better than non-exemplar based methods such as Lw F-MC and MUC in the trend of accuracy curve under different settings. Particularly, the gap appears unbridgeable in the long-step Class-IL setting, e.g., 10 phases and 20 phases. This suggests that only constraining old parameters does not suffice to prevent forgetting. We argue that this is partly due to the unaddressed classifier bias. When compared to representative data replay based methods such as i Ca RL, EEIL and LUCIR, our method remarkably shows strong performance without storing old samples. The success of our method can contribute to the proposed class Aug and seman Aug. Specifically, class Aug is applied to new classes of current task, which enables the model to learn more transferable and diverse representations for future classes and in turn, reduces the forgetting of old parameters when learning new classes. While seman Aug is applied to old classes of previous tasks, which leverage the valuable distribution information of old classes to learn a unified classifier to connect the classes from different tasks to each other. Ablation Study. To evaluate the effect of each component in IL2A, we perform the ablation study and show the results of 10 phases setting (CIFAR-100) in Table 1. Specifically, the baseline denotes the method that does not generate pseudo-instance using seman Aug, but only replays the class-mean Table 1: The effect of each component in IL2A. Method\Incremental stage 1 2 3 4 5 6 7 8 9 10 Final Knowledge Distillation 78.78 30.18 20.71 14.61 11.87 8.80 7.70 7.23 7.10 6.05 6.04 baseline 78.86 62.85 56.96 54.66 51.72 47.33 43.61 40.12 40.76 36.55 34.71 + seman Aug 79.16 69.14 60.68 58.18 54.77 50.89 48.45 46.29 46.97 44.38 42.09 + class Aug 79.72 68.30 64.15 60.15 56.21 52.61 51.48 46.48 46.36 43.63 41.56 + class Aug + seman Aug 81.08 74.54 66.28 63.89 58.80 54.97 51.32 48.64 49.74 47.05 45.07 of each old class when training new classes. By doing so, we aim to validate the effectiveness of seman Aug compared with only replaying class-mean. In summary, we can observe that: (1) Baseline improves the performance of KD significantly. (2) Seman Aug improves the performance of baseline from 34.71% to 42.09%. Those results indicate the effect of the distribution information for maintaining old knowledge in Class-IL. (3) Class Aug also has remarkably effect on baseline, and (4) the performance can be further improved by combing with seman Aug, which indicates that those two modules are complementary. Similar results are observed in other settings of CIFAR-100 and Tiny-Image Net datasets. (5) As for the computational complexity, class Aug involves input level sample mixing and the augmented samples are fed to feature extractor. Differently, seman Aug performs implicit old instance generation in the deep feature space. Therefore, seman Aug is cheaper compared with class Aug from the computation perspective. 4.3 Further Analysis Class Aug Improves both Plasticity and Stability in Class-IL. To analyze the effectiveness of class Aug more concretely, we explore how it affects the new tasks accuracy ( ) and average forgetting ( ) (CIFAR-100, 10 phases setting). Average forgetting [60] is defined to estimate the forgetting of previous tasks. The forgetting measure f i k of the i-th task after training k-th task is defined as f i k = max t 1,...,k 1(at,i ak,i), i < k, in which am,n is the accuracy of task n after training task m. The average forgetting measure Fk is then defined as Fk = 1 k 1 Pk 1 i=1 f i k. Intuitively, new task accuracy can be viewed as the plasticity of the incremental learner and the average forgetting can be viewed as the stability of the incremental learner. Figure 6 (a) and (b) report the results, from which we see that class Aug simultaneously improves the new task accuracy and reduces the average forgetting. Specifically, the significant improvement on new task accuracy implies that the model training with class Aug is a good initialization for the following tasks. Consequently, class Aug is effective to improve the trade-off between plasticity and stability of a continual learner. 1 2 3 4 5 6 7 8 9 10 11 new task accuracy (%) number of tasks baseline w/ class Aug 1 2 3 4 5 6 7 8 9 10 11 forgetting (%) number of tasks baseline w/ class Aug better stability better plasticity 5 phases 10 phases 20 phases final accuracy (%) baseline class Aug Mixup LS (a) (b) (c) Figure 6: (a, b) Class Aug can simultaneously improve the new task accuracy and reduce the average forgetting. (c) Compared with class Aug, Mixup and LS have negative effect for Class-IL. Compare Class Aug with Other Regularizers. We compare the proposed class Aug with Mixup and LS in Figure 6 (c), where the baseline (with seman Aug) represents our IL2A without using class Aug. As can be seen, Mixup and LS have negative effect on the final accuracy. This phenomenon could be interpreted based on the analysis in Section 3.1.1 and Figure 2 (b). Specifically, those regularizers result in more compressed representations, damaging the transferability of the representations. Besides, the label smoothing strategy also affects the weights of old classes in the classifier, thus increasing the classifier bias. Similar results have also been reported in [61]. Discussion of Covariance Matrix. In our main experiments, we use the original covariance matrix for seman Aug. However, storing the original covariance matrix might be inefficient when the Table 2: OOD detection results. indicates higher is better. OOD AUROC AUPR-In AUPR-Out baseline Mixup class Aug baseline Mixup class Aug baseline Mixup class Aug MNIST 87.02 92.46 94.99 79.89 89.00 93.05 92.26 95.48 97.20 Fashion-MNIST 90.28 93.37 94.40 86.18 89.11 92.43 94.26 96.19 96.78 LSUN 88.50 88.80 93.90 83.48 74.71 91.08 92.92 94.09 96.73 Tiny-Image Net 88.49 84.96 93.92 83.84 64.02 91.77 92.70 92.19 96.55 Mean 88.57 89.90 94.30 83.35 79.21 92.08 93.04 94.49 96.81 matrix dimension is large. An alternative way is to only store the elements on the diagonal, which could greatly reduce the cost of memory. Figure 7 also reports the results of using the diagonal covariance matrix. Under different settings, using the original covariance matrix is slightly better than the diagonal form. This is reasonable because the original covariance matrix stores more distribution information of old classes. However, using the diagonal covariance matrix would be more memory-efficient in practice. 5 phases 10 phases 20 phases final accuracy (%) original diagonal Figure 7: Original v.s. diagonal covariance matrix. CIFAR-100. Class Aug Improves Confidence Reliability. During continuous use of a machine learning system in open-world applications, there are mainly three key steps [62]. The first step is out-of-distribution (OOD) detection [63], which requires the system to detect unknown samples from novel classes. The second step is to label the collected unknown samples by humans or automatic algorithms [64]. Finally, the system must scale and adapt incrementally to learn the novel classes, which is the Class-IL problem studied in this paper. Recently studies found that DNNs are overconfident for their predictions [63, 65], lacking the ability to detect samples from unknown classes. In real-world applications, we expect a continual learner has good OOD detection ability. We explore the OOD detection ability of the proposed class Aug. Concretely, we train a Res Net-18 on CIFAR-10, and the test samples from CIFAR-10 are in-distribution. For OOD examples, we test on MNIST [66], Fashion-MNIST [67], LSUN (resized) [68] and Tiny-Image Net (resized). As shown in Table 2, class Aug noticeably improves the OOD detection performance of baseline [63] on commonly used metrics such as AUROC, AUPR-In and AUPR-Out [63]. By recognizing synthetic samples, DNNs could learn more robust and transferable representations which could be generalized to OOD samples. Moreover, as shown in Table 2, Mixup sometimes damages the performance of OOD detection, which further demonstrates the superiority of class Aug. 5 Conclusion In this paper, we propose a simple and effective dual augmentation framework to address the representation bias and classifier bias in Class-IL. We first investigate the transferability (or forgetting) of representations via spectral decomposition, which motivates us to propose class Aug that can learn transferable, diverse and less compact representations for IL. Furthermore, we propose to use seman Aug to implicitly generate infinite instances of old classes in the deep feature space during jointly learning of the unified classifier. Experiments show that our method could achieve remarkable performance compared with state-of-the-art Class-IL methods. Future works will consider the dual augmentation framework for more challenging scenarios like Class-IL with distribution shift and OOD data, few-shot Class-IL, and federated incremental learning. Acknowledgements This work has been supported by the National Key Research and Development Program under Grant No. 2018AAA0100400, the National Natural Science Foundation of China (NSFC) grants U20A20223, 61633021, 62076236, 61721004, the Key Research Program of Frontier Sciences of CAS under Grant ZDBS-LY-7004, and the Youth Innovation Promotion Association of CAS under Grant 2019141. [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. 1, 4, 7 [2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. 1 [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, pages 1097 1105, 2012. 1, 3 [4] Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):12 25, 2015. 1 [5] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54 71, 2019. 1, 4 [6] Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 1 [7] Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences, 2020. 1 [8] Ian J. Goodfellow, M. Mirza, Xia Da, Aaron C. Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgeting in gradient-based neural networks. Co RR, 2014. 1 [9] M. Mc Closkey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, pages 109 165, 1989. 1 [10] Robert M French. Interactive tandem networks and the sequential learning problem. Citeseer. 1 [11] J. Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, J. Veness, G. Desjardins, Andrei A. Rusu, K. Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, C. Clopath, D. Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pages 3521 3526, 2017. 1, 2, 3, 4 [12] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., pages 2935 2947, 2018. 1, 2, 3, 4 [13] Sylvestre-Alvise Rebuffi, A. Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. In CVPR, pages 5533 5542, 2017. 1, 3, 4, 8 [14] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In ECCV, pages 524 540, 2020. 1 [15] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, pages 3987 3995, 2017. 1, 2, 3, 4 [16] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In ECCV, pages 139 154, 2018. 1, 2, 3, 4, 8 [17] Y. Wu, Yan-Jia Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374 382, 2019. 1, 3 [18] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In ECCV, pages 233 248, 2018. 1, 2, 8 [19] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. In CVPR, pages 831 839, 2019. 2, 3, 7, 8 [20] Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. Maintaining discrimination and fairness in class incremental learning. In CVPR, pages 13205 13214, 2020. 2 [21] Laurens Maaten, Minmin Chen, Stephen Tyree, and Kilian Weinberger. Learning with marginalized corrupted features. In ICML, pages 410 418, 2013. 2 [22] Yulin Wang, Gao Huang, Shiji Song, Xuran Pan, Yitong Xia, and Cheng Wu. Regularizing deep networks with semantic data augmentation. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 2, 3, 6 [23] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. ar Xiv preprint ar Xiv:1810.12488, 2018. 3 [24] Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. ar Xiv preprint ar Xiv:1904.07734, 2019. 3 [25] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, pages 86 102, 2020. 3 [26] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR, 2018. 3 [27] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Neur IPS, 2017. 3 [28] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019. 3 [29] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Neur IPS, pages 2994 3003, 2017. 3 [30] Chenshen Wu, L. Herranz, X. Liu, Y. Wang, Joost van de Weijer, and B. Raducanu. Memory replay gans: Learning to generate new categories without forgetting. In Neur IPS, pages 5962 5972, 2018. 3 [31] Ye Xiang, Ying Fu, Pan Ji, and Hua Huang. Incremental learning using conditional adversarial networks. In ICCV, pages 6618 6627, 2019. 3 [32] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. In ICLR, 2018. 3 [33] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016. 3 [34] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, pages 7765 7773, 2018. 3 [35] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In ICML, pages 4548 4557, 2018. 3 [36] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In ICLR, 2018. 3 [37] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. In ICLR, 2018. 3, 5 [38] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023 6032, 2019. 3 [39] Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In ICML, pages 552 560, 2013. 3 [40] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. In CVPR, pages 7064 7073, 2017. 3 [41] Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, and Tao Mei. Joint contrastive learning with infinite possibilities. In Neur IPS, 2020. 3 [42] Shuang Li, Mixue Xie, Kaixiong Gong, Chi Harold Liu, Yulin Wang, and Wei Li. Transferable semantic augmentation for domain adaptation. ar Xiv preprint ar Xiv:2103.12562, 2021. 3 [43] Shuang Li, Kaixiong Gong, Chi Harold Liu, Yulin Wang, Feng Qiao, and Xinjing Cheng. Metasaug: Meta semantic augmentation for long-tailed visual recognition. ar Xiv preprint ar Xiv:2103.12579, 2021. 3 [44] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. In Neur IPS, 2020. 3, 7 [45] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. 3 [46] Clayton Shonkwiler. Poincaré duality angles for riemannian manifolds with boundary. ar Xiv preprint ar Xiv:0909.1967, 2009. 4 [47] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear algebra and its applications, 171:81 98, 1992. 4 [48] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In ICML, pages 1081 1090, 2019. 4 [49] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. 4, 7 [50] Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Neur IPS, pages 4696 4705, 2019. 5 [51] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In ICML, pages 6438 6447, 2019. 5 [52] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1 5, 2015. 5 [53] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017. 5 [54] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Björn Ommer, and Joseph Paul Cohen. Revisiting training strategies and generalization performance in deep metric learning. In ICML, 2020. 5 [55] Eden Belouadah and Adrian Popescu. Il2m: Class incremental learning with dual memory. In ICCV, pages 583 592, 2019. 6 [56] Leon Yao and John Miller. Tiny imagenet classification with convolutional neural networks. CS 231N. 7 [57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 8 [58] Yu Liu, Sarah Parisot, Gregory G. Slabaugh, Xu Jia, Ales Leonardis, and Tinne Tuytelaars. More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning. In ECCV, pages 699 716, 2020. 8 [59] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In CVPR, pages 5138 5146, 2019. 8 [60] Arslan Chaudhry, P. Dokania, Thalaiyasingam Ajanthan, and P. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, pages 532 547, 2018. 9 [61] Sudhanshu Mittal, Silvio Galesso, and Thomas Brox. Essentials for class incremental learning. ar Xiv preprint ar Xiv:2102.09517, 2021. 9 [62] Xu-Yao Zhang, Cheng-Lin Liu, and Ching Y Suen. Towards robust pattern recognition: A review. Proceedings of the IEEE, 108(6):894 922, 2020. 10 [63] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017. 10 [64] Kai Han, Sylvestre-Alvise Rebuffi, Sébastien Ehrhardt, Andrea Vedaldi, and Andrew Zisserman. Autonovel: Automatically discovering and learning novel visual categories. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 10 [65] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, pages 1321 1330, 2017. 10 [66] Yann Le Cun and Corinna Cortes. The mnist database of handwritten digits. 2005. 10 [67] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. 10 [68] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. Ar Xiv, abs/1506.03365, 2015. 10