# classincremental_learning_via_dual_augmentation__0a55c8c5.pdf

Class-Incremental Learning via Dual Augmentation

Fei Zhu1,2, Zhen Cheng1,2, Xu-Yao Zhang1,2 , Cheng-Lin Liu1,2,3

1NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing, 100049, China 3Center for Excellence of Brain Science and Intelligence Technology, CAS {zhufei2018, chengzhen2019}@ia.ac.cn, {xyz, liucl}@nlpr.ia.ac.cn

Deep learning systems typically suffer from catastrophic forgetting of past knowledge when acquiring new skills continually. In this paper, we emphasize two dilemmas, representation bias and classiﬁer bias in class-incremental learning, and present a simple and novel approach that employs explicit class augmentation (class Aug) and implicit semantic augmentation (seman Aug) to address the two biases, respectively. On the one hand, we propose to address the representation bias by learning transferable and diverse representations. Speciﬁcally, we investigate the feature representations in incremental learning based on spectral analysis and present a simple technique called class Aug, to let the model see more classes during training for learning representations transferable across classes. On the other hand, to overcome the classiﬁer bias, seman Aug implicitly involves the simultaneous generating of an inﬁnite number of instances of old classes in the deep feature space, which poses tighter constraints to maintain the decision boundary of previously learned classes. Without storing any old samples, our method can perform comparably with representative data replay based approaches.

1 Introduction

Deep neural networks (DNNs) have enabled great success in many machine learning tasks, based on stationary, large-scale, computationally expensive, and memory-intensive training data [1, 2, 3]. Yet the need of the ability to acquire sequential experience in dynamic and open environments [4, 5, 6] poses a serious challenge to modern deep learning systems, which only perform well on homogenized, balanced, and shufﬂed data [7]. Typically, DNNs suffer from drastic performance degradation of previously learned tasks after learning new knowledge, which is a well-documented phenomenon, known as catastrophic forgetting [8, 9, 10]. Recently, incremental learning (IL), also referred to as lifelong learning or continual learning, has received extensive attention [11, 12, 13, 14] to enable DNNs to preserve and extend knowledge continually.

Many earlier studies focus on task-incremental learning, which uses separate output layers for different tasks, and needs the task identity for inference [11, 15, 16]. In this work, we consider a more realistic and challenging setting of class-incremental learning (Class-IL), where the model only has access to data of new classes at each stage and needs to learn a uniﬁed classiﬁer that can classify all seen classes [13, 17, 18]. Unfortunately, the learning paradigm of Class-IL will lead to two problems: representation bias and classiﬁer bias, as shown in Figure 1. First, for representation learning, if the feature extractor is ﬁxed after learning old classes, the learned representations could be preserved, but suffer from the lack of transferability for new classes; on the contrary, if we update the feature extractor on new classes, the updated representations would be no longer suitable for old classes. Consequently, the old and new classes would be easily overlapped in the deep feature space. We

Corresponding author.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

old class data: unavailable

feature extractor

new new data

Note: new class data: sufficient

Figure 1: Two inherent problems in Class-IL: representation bias and classiﬁer bias.

denote this dilemma as the representation bias. Second, to distinguish new classes from old classes, the training loss is typically calculated on all classes. Without old training data, the class weights of old classes would be ill-updated and mismatched with the updated representation space. We denote this dilemma as the classiﬁer bias. In this work, we investigate the learning of representation and classiﬁer in incremental learning and propose a simple and effective dual augmentation framework to overcome these two biases in Class-IL without storing and replaying training data of old classes.

Learning Representation for Incremental Learning. Existing works typically regularize network parameters explicitly [11, 15, 16] or implicitly [12] to reduce the representation shift when learning new classes. In this paper, instead of asking how to keep previously learned representations unchanged, we investigate the following question: What properties of learned representations could facilitate incremental learning? We hypothesize that learning transferable and diverse representations is an important requirement for incremental learning. Intuitively, with such representations, it could be easier to ﬁnd a model to perform well on all tasks and improve both plasticity and stability, since different tasks would be closer in the parameters space. From a spectral analysis viewpoint, we investigate which components of feature representations are more transferable and less forgettable in the incremental learning process. It is found that spectral components with large eigenvalues are less forgettable. Furthermore, we exploit this ﬁnding to propose a simple technique named class Aug, which can enlarge the spectral components to introduce more diverse and transferable representations for incremental learning.

Learning Classiﬁer for Incremental Learning. Recently, several works were proposed to alleviate the classiﬁer bias in data replay based methods [18, 19, 20]. However, in non-exemplar based (i.e., without storing and replaying old data) Class-IL setting, the classiﬁer bias is more serious and the above methods can not be directly used. A straightforward way is storing instances of old classes in the deep feature space. However, this strategy is undesirable due to the limited memory resource and scalability. This work delves into the classiﬁer learning for Class-IL and proposes an implicit semantic augmentation (seman Aug) approach to generate an inﬁnite number of instances of old classes in the deep feature space by leveraging the distribution information. Seman Aug is inspired by MCF [21] and ISDA [22], which have performed semantic augmentation for linear models and DNNs, respectively. However, both our way to leverage semantic augmentation and the motivation fundamentally differ from them [21, 22].

Contributions. (i) We provide new insights into the representation learning in incremental learning by analyzing the structural characteristics of the learned embedding space via spectral decomposition and ﬁnd that spectral components with large eigenvalues are less forgettable and carry more transferable features. Based on this observation, we propose a simple and effective method of class Aug to learn better embedding space for incremental learning. (ii) For classiﬁer learning in incremental learning, we propose seman Aug which implicitly involves simultaneous generating an inﬁnite number of instances of old classes in the deep feature space to maintain the decision boundary of previously learned classes. (iii) Extensive experiments on benchmark datasets demonstrate the superior performance of our dual augmentation framework for the challenging scenario of Class-IL.

2 Related Work

Incremental Learning. Diverse approaches have been proposed for incremental learning of DNNs. They can be roughly divided into three categories: regularization based, data replay based, and architecture based approaches. Regularization based methods focus on weight regularization by

estimating and preventing the important network weights from changing [11, 15, 16]. The difference among those methods is the way to compute the importance of the parameters. However, it is hard to design a reasonable metric to measure the importance of parameters, and it is known that regularization strategies show poor performance in Class-IL scenario [23, 24]. Data replay based methods address both the representation bias and classiﬁer bias straightforwardly by storing a fraction of old data to jointly train the model with current data. With stored real samples, some works [17, 13, 25] use a distillation loss to prevent forgetting, while others [26, 27, 28] develop gradient-based regularization to make more efﬁcient use of the rehearsal data. To avoid storing real data, another line of works generates pseudo-samples of all previous classes for replay using deep generative models [29, 30, 31, 32]. Nevertheless, storing real data is undesirable for resource-limited or privacy and safety concerning scenarios. Moreover, training big generative models for complex datasets is inefﬁcient. Architecture based methods dynamically extend the network structure during the course of incremental learning [33, 34, 35, 36]. However, growing architecture is unfeasible for large numbers of tasks, and those methods are often impractical for Class-IL.

Data Augmentation. Literature is rich on data augmentation for improving the generalization of DNNs. Classical strategies commonly synthesize positive new samples in a way that is consistent with the underlying data distribution of the original dataset [3]. Recent works show that label mixing based methods such as Mixup [37] and Cutmix [38] can greatly improve the generalization of DNNs. In complement to the input space augmentations mentioned above, some works have explored feature space augmentations which augment the learned representations in deep embedding space to enhance classiﬁer performance. The intuition behind those works is that certain directions in the deep feature space correspond to meaningful semantic transformations [39, 40]. For instance, deep feature interpolation [40] leverages simple interpolations in the embedding space to achieve semantic augmentation. A recently proposed ISDA [22] performs semantic augmentation by estimating and leveraging the category-wise distribution of deep representations in an online manner. Despite the simplicity, ISDA shows its effectiveness in semi-supervised learning [22], contrastive learning [41], domain adaptation [42] and long-tailed recognition [43].

3 Dual Augmentation Framework for Class-Incremental Learning

We ﬁrst formalize the problem of Class-IL, and then introduce the proposed class Aug for representation learning and seman Aug for classiﬁer learning, respectively. Finally, we present the dual augmentation framework for Class-IL by combing the two augmentations.

Problem Deﬁnition. Typically, a Class-IL problem involves the sequential learning of T tasks that consist of disjoint classes sets, and the model has to classify all seen classes at any given point in training. At incremental step t {1, ..., T }, (x, y) Dt denotes a training sample, where x is an sample in the input space X and y Ct is its corresponding label. Ct is the class set of task t. To facilitate analysis, we represent the DNN based model with two components: a feature extractor and a uniﬁed classiﬁer. Speciﬁcally, the feature extractor fθ : X Z, parameterized by θ, maps the input x into a feature vector z = fθ(x) Rd in the deep feature space Z; the uniﬁed classiﬁer gϕ : Z RC1:t, parameterized by ϕ, produces a probability distribution gϕ(z) as the prediction for x. Denote the overall parameters by Θ = (θ, ϕ).

The general objective is to correctly classify test examples from all seen classes [44]. The key challenge of Class-IL is that data from previous tasks are assumed to be unavailable, which means that the best conﬁguration of the model for all seen tasks must be sought by minimizing the predeﬁned loss function L (e.g., cross-entropy) on current data Dt:

argmin θ,ϕ E(x,y) Dt[L(gϕ(fθ(x)), y)]. (1)

A widely used strategy to preserve old knowledge is knowledge distillation [45], which typically matches the current model with previous model response to current training data using the teacherstudent framework [12, 13, 19].

3.1 Learning Representation with Class Augmentation

As we focus on non-exemplar based Class-IL, we intentionally avoid storing training samples of old classes. To maintain the generalizability of the learned representations for old classes, existing

methods typically restrain the feature extractor from changing [11, 15, 16, 12]. However, this would lead to a trade-off between the plasticity and stability [5], and it would be hard to perform long-step incremental learning. Our high-level idea is to learn transferable and diverse representations to bridge the old and new classes in a better feature space. To delve into this problem, we want to answer two questions: (1) Which part of feature representations tends to be forgotten in incremental learning? (2) How to facilitate the representation learning for incremental learning?

3.1.1 Analyzing Forgetting via Spectral Decomposition

In what follows, we explore which part of feature representations tends to be forgotten and may not be transferable across different tasks in incremental learning. To this end, we propose to quantify the sensitivity of the model to different directions in the deep feature space by measuring the similarity of the space before and after learning new tasks.

Formally, given a feature extractor fθ,old trained on dataset Dold = {(xi, yi)}n i=1. A new dataset Dnew that contains disjoint classes with Dold is used to update fθ,old, and the updated feature extractor is denoted as fθ,new. For the samples in Dold, we can get two groups of deep features mapped by fθ,old and fθ,new, respectively. Using eigenvalue decomposition, we could respectively decompose the features mapped by original feature extractor (i.e., fθ,old(xi)) as well as the features mapped by updated feature extractor (i.e., fθ,new(xi)) to different directions as following:

i=1 fθ(xi)fθ(xi)T =

j=1 ujλju T j , (2)

where λj represents the eigenvalue with index j and uj is its eigenvector. d is the dimensionality of the feature space. Through spectral factorization in Eq. (2), we can represent the original and new representations with two groups of eigenvectors: {uold,1, ..., uold,d} and {unew,1, ..., unew,d}.

Next, we investigate the forgetting or transferability of each direction. Shonkwiler [46] introduced the principal angles [47] to measure the similarity of two subspaces. However, it is unreasonable to treat all eigenvectors equally to calculate the principal angles, regardless of their relative eigenvalues. Inspired by [48], we use corresponding angles, denoted by ψ, to explore the distance between two subspaces in incremental learning:

Deﬁnition 1 (Corresponding Angle) Given two groups of eigenvectors: {uold,1, ..., uold,d} and {unew,1, ..., unew,d}, corresponding angle represents the angle between two eigenvectors corresponding to the same eigenvalue value index. The cosine value of the corresponding angle is:

cos(ψj) = uold,j, unew,j uold,j unew,j , (3)

where uold,j is the j-th eigenvectors with the j-th largest eigenvalue in the old feature space, and similarly for unew,j. Note that uold,j = 1 and unew,j = 1. For IL, the meaning of preserve old knowledge refers to maintain the previously learned decision boundary among classes. At representation level, for an old class, the shape (i.e., covariance) of the distributions should not be changed too much. If an eigenvector direction only changes slightly after updating the feature extractor, the corresponding angle is small, and vice versa. Intuitively, the corresponding angle could capture the representation shift between the old and updated feature extractor during incremental learning, and reﬂect the forgetting along certain directions in the deep feature space.

Based on the metric deﬁned above, we explore the forgetting of different directions in Class-IL. We use Lw F-MC [12, 13] as baseline method and train a Res Net-18 [1] on CIFAR-100 [49] using SGD in a 2-step manner. Concretely, the model is ﬁrst trained on the ﬁrst 50 classes and then updated on the other 50 classes. Figure 2 (a) shows the absolute cosine values of corresponding angles between the old and new eigenvectors. We can observe that eigenvectors with larger eigenvalues produce larger similarity (small corresponding angles), which indicates those directions are more transferable and less forgettable across different tasks. On the contrary, the eigenvectors with small eigenvalues prefer to move after updating the model on new tasks, and could be regarded as forgettable directions.

Transferable and Diverse Representations. As demonstrated above, the directions with larger eigenvalues transfer better and suffer less forgetting. This thought-provoking observation indicates that our learned representations should have the following properties: (1) Transferability: the eigenvalues of those several signiﬁcant directions should be enlarged to transfer across tasks (or

0 10 20 30 40 50 60 70

Cosine values of

corresponding angles

index of eigenvectors

0 20 40 60 80 100

Cosine values of

corresponding angles

index of eigenvectors

average of class-wised results eigenvectors with larger eigenvalues

transfer better

0 10 20 30 40 50 60 70 80

eigenvalues

index of eigenvectors

Baseline Mixup LS class Aug

0 10 20 30 40 50 60 70 80

eigenvalues

index of eigenvectors

Baseline Mixup LS class Aug

representations for

all classes

representations for

(a) cosine values of corresponding angles (b) eigenvalues of representations

Figure 2: (a) Absolute cosine values of corresponding angles. (b) Distribution of eigenvalues for baseline, Mixup [37], LS [50], and our class Aug training based models.

classes). (2) Diversity: the number of the directions with signiﬁcant eigenvalues should be increased. Note that those properties are different from that in the common single-task learning scenario. Actually, reducing the number of directions with signiﬁcant variance has been seen as a form of feature compression [51], which is linked to generalization by information theory [52, 53]. However, the usual concepts of generalization may not entirely be appropriate for IL, since standard learning only aims to learn compact representations within training classes without considering new class generalizability. In IL, those less discriminative directions for the current task could capture useful representations for future tasks. A recent paper [54] has shown that strong compressed representations can actually hurt the generalization ability in the deep metric learning setting. Therefore, to reduce forgetting and enhance the transferability of the representations, it is important to enlarge the eigenvalues and increase the number of eigenvectors with signiﬁcant variance.

3.1.2 Learning Representations via Class Augmentation

We now exploit our above analysis to propose a simple method for representation learning in Class IL. Our key idea is to learn transferable and diverse representations by learning more classes at each incremental stage t. To do so, a direct way is to introduce real classes from other datasets as auxiliary. However, it is unrealistic to always have access to other real classes, and which datasets should be used remains unknown. Therefore, we propose class augmentation (class Aug) to augment the original classes by synthesizing auxiliary classes based on Dt. Concretely, inspired

augmented class

Figure 3: Illustration of class Aug.

by Mixup [37], class Aug randomly interpolates two samples xa and xb from two different classes a and b to generate a new sample xnew ab representing a new class:

xnew ab = λxa + (1 λ)xb, (4)

where λ is a random number of interpolation coefﬁcient. For a k-class problem, we can generate k(k 1)/2 new classes using the above method, which can be further merged to m auxiliary classes. As a result, the original k-class problem in the current task is extended to a (k + m)-class problem. Moreover, we restrict the λ to be sampled from the interval of [0.4, 0.6], to reduce the overlap between the augmented and original classes. At the end of each IL stage, the augmented class nodes in the classiﬁer would be removed.

Discussion. The proposed class Aug is related to Mixup [37] which applies random interpolation on a pair of training samples and the respective one-hot labels. However, the interpolated samples in Mixup are near original data, and the number of classes is not changed, but in our method, it is increased. By learning to classify more classes in each stage t, the model could learn more transferable and diverse representations. Figure 2 (b) displays and compares the eigenvalues 2 of representations learned with different methods on the ﬁrst 50 classes of CIFAR-100. It is obvious that the proposed class Aug can enhance the value of eigenvalues signiﬁcantly, and produce more directions with signiﬁcant variance compared with other methods. On the contrary, Mixup and Label-Smoothing (LS) [50] lead to signiﬁcantly smaller eigenvalues for the several top eigenvectors, which represent more compact representations. Indeed, the compression effect of soft-label based methods has also been demonstrated in [51, 50]. As shown in Section 4.3, class Aug can improve the performance of Class-IL signiﬁcantly, while Mixup and LS have negative effect in our experiments.

2To visualize the distribution clearly, we do not include the largest eigenvalue in the ﬁgure.

3.2 Learning Classiﬁer with Semantic Augmentation

As demonstrated in Section 1, classiﬁer bias is another problem in Class-IL. When learning new classes, the previously learned decision boundary would suffer from catastrophic distortion and thus the test samples from old classes could be easily mapped to wrong classes. To overcome this issue, we propose semantic augmentation (seman Aug), which leverages the distribution information (i.e., class mean and covariance) of old classes to regularize the learning of the classiﬁer. Formally, for each old class k {1, ..., Cold}, we can generate M instances in the deep feature space from its distribution, i.e., ezk N(µk, γΣk), in which γ is a non-negative coefﬁcient. Then the generated instances of old classes and real instances of new classes in the deep feature space can be jointly fed to the classiﬁer for minimizing cross-entropy loss:

eϕT yizi+byi PCall c=1 eϕT c zi+bc

| {z } Lt,new: loss on real features of new classes

eϕT k ezk,m+bk PCall c=1 eϕT c ezk,m+bc

| {z } Lt,old: loss on generated features of old classes

where nt is the number of training samples in current task dataset Dt, Cold is the number of total old classes upon stage t, and Call = Cold + Ct is the number of all seen classes at stage t. ϕ = [ϕ1, ..., ϕCall]T RCall d and b = [b1, ..., b Call]T RCall are the weight matrix and bias vector of the last fully connected layer, respectively.

In Class-IL, the second term in Eq. (5), Lt,old, is computationally inefﬁcient when M and Cold are large. In the following, we present an easy-to-compute way to implicitly generate inﬁnite instances in the deep feature space for old classes.

Upper bound of Lt,old. Concretely, in the case of M , the second term in Eq. (5):

Lt,old = 1 Cold

eϕT k ezk+bk PCall c=1 eϕT c ezk+bc

c=1 e(ϕT c ϕT k )ezk+(bc bk) !#

c=1 e(ϕT c ϕT k )ezk+(bc bk) #!

c=1 ev T c,kµk+(bc bk)+ γ

2 v T c,kΣkvc,k !

(6) In above equation, vc,k = ϕc ϕk. The inequality is based on Jensen s inequality E[log(X)] log E[X], and the last equality is obtained by using the moment-generating function E[et X] = etµ+ 1

2 σ2t2, X N(µ, σ2), due to the fact that (ϕc ϕk)ezk + (bc bk) is a Gaussian random variable. As can be seen, Eq. (6) is an upper bound of original Lt,old, which provides an elegant and much efﬁcient way to implicitly generate inﬁnite instances in the deep feature space for old classes. The Lt,old in Eq. (6) can be write in the common cross-entropy loss form:

Lt,seman Aug Lt,old = 1 Cold

eϕT k µk+bk PCall c=1 eϕT c µk+bc+ γ

2 v T c,kΣkvc,k

Intuitively, Lt,old implicitly performs semantic transformations for µk based on Σk. To maintain the decision boundary, γ should be smaller if the distribution of a class is near the decision boundary; instead, γ should be bigger if the distance is relatively far. We set γ = 2 in our experiments. In addition, we can observe that when γ = 0, only the class means are used for knowledge retention.

Discussion. (1) Although the derivation of the upper bound in Eq. (6) is similar with ISDA [22], both our motivation and the way to leverage seman Aug are different from ISDA. When learning new classes, we only apply seman Aug for the class mean of each old class based on the memorized distribution information. While ISDA applies seman Aug on all the training samples to improve generalization in standard supervised learning. In addition, a crucial step in ISDA is to estimate the mean and covariance matrix of each class in an online manner. Differently, seman Aug is naturally suitable for Class-IL, since the distribution of old classes can be estimated with all training samples at the end of each learning stage. (2) Using previous class statistics for IL has also been explored in IL2M [55]. However, our method differs from IL2M in both the statistics information and the way to

tf new & augmented classes

distillation

old feature space

feature extractor

new feature space

new class 1

new class 2

augmented class

Figure 4: Illustration of our dual augmentation framework (IL2A) for Class-IL. On the one hand, the training samples of new classes at current task are augmented via the proposed class Aug. On the other hand, the distributions of old classes are retained by seman Aug in the deep feature space.

leverage them. First, The class statistics in IL2M is the prediction score of the classiﬁer, while ours is the class distribution statistics in the deep feature space. Second, IL2M uses the class statistics to calibrate the prediction of a continual learner in a post-processing manner, while our method leverage the statistics to automatically learn a balanced classiﬁer.

3.3 The Dual Augmentation Learning Framework

With class Aug for representation bias and seman Aug for classiﬁer bias, Figure 4 describes the learning process of the dual augmentation framework (IL2A). We also use the well-known knowledge distillation (KD) [19] for two reasons. Firstly, class Aug and KD are complementary and focus on different aspect of learning representation. Secondly, KD can reduce the change of feature extractor, which is crucial for seman Aug because it implicitly generate instances in the deep feature space from old distribution. The total learning objective at each stage t is as following:

Lt = Lt,new + αLt,seman Aug + βLt,kd, (8)

where α and β are two hyper-parameters. Lt,new and Lt,seman Aug are shown in Eq. (5) and Eq. (7), respectively. Lt,kd = 1 nt Pnt i=1 fθt 1(xi) fθt(xi) . Note that Lt,new and Lt,seman Aug are applied to both the original and synthesized samples. Algorithm 1 presents the pseudo code of IL2A.

4 Experiments

4.1 Evaluation Protocol

Algorithm 1: IL2A: Dual augmentation algorithm

Randomly initialize Θ0 = {θ0, ϕ0}; S0 = ; foreach incremental stage t {1, ..., T } do

Input: model Θt 1, data Dt = {(xi, yi)}nt i=1; Output: model Θt; Θt Θt 1;

Dt,aug = {(x i, y i)}n t i=1 via class Aug; add class nodes for augmented classes; if t = 1 then

train Θt by minimizing L(gϕ(fθ(x )), y ); else

train Θt by minimizing Eq. (8); s compute {µ, Σ} for each class in Dt; St St 1 s; remove augmented class nodes in classiﬁer;

Datasets. We perform our experiments on CIFAR-100 [49] and Tiny-Image Net [56]. A common setting is to train the model on half of classes for ﬁrst task, and equal classes in the remaining incremental steps. Based on this, we split the CIFAR-100 dataset in different settings: 50 + 5 10, 50+10 5, 40+20 3. For instance, 50+ 10 5 represents that the ﬁrst task contains 50 classes and there are 5 classes for the following 10 tasks. Similarly, the settings for Tiny-Image Net are 100+5 20, 100+ 10 10 and 100+20 5. Intuitively, more classes in each tasks requires the model to learn a harder problem for each task, while increasing the length of the task sequence challenges the model s retention.

Implementation Details. In our experiments, we follow [44] to utilize the Res Net-18 [1] as our base architecture, and train it from scratch

1 2 3 4 5 6

accuracy (%)

number of tasks

1 2 3 4 5 6 7 8 9 10 11

accuracy (%)

number of tasks

1 2 3 4 5 6 7 8 9 10 11 12131415161718192021

accuracy (%)

number of tasks

i Ca RL-CNN

i Ca RL-NME

1 2 3 4 5 6

accuracy (%)

number of tasks

1 2 3 4 5 6 7 8 9 10 11

accuracy (%)

number of tasks

1 2 3 4 5 6 7 8 9 10 11 12131415161718192021

accuracy (%)

number of tasks

i Ca RL-CNN

i Ca RL-NME

CIFAR-100 (5 phases) CIFAR-100 (10 phases) CIFAR-100 (20 phases)

Tiny-Image Net (5 phases) Tiny-Image Net (10 phases) Tiny-Image Net (20 phases)

Figure 5: Results of top-1 accuracy on CIFAR-100 and Tiny-Image Net under different settings. Solid lines present methods that do not store old exemplars, dashed lines present data replay based methods.

in each experiment. All models are trained using Adam [57] optimizer with an initial learning rate of 0.001 for 100 epochs with the mini-batch size of 64. The learning rate is reduced by a factor of 10 at 45 and 90 epochs. We use the same hyper-parameter value for all experiments. Speciﬁcally, we set α = 10 and β = 10 in Eq. (8). The number of augmented classes (i.e. The number of augmented classes (i.e., m) depends on the number of (original) classes at current incremental step. Taking CIFAR-100 as an example, the m is 45 for 5 phases setting where each incremental step has 10 classes; and m is 10 for 10 phases setting where each incremental step has 5 classes. At the end of each incremental stage, we evaluate the model on all seen classes after removing the class nodes of the m augmented classes in the classiﬁer. Our code is available at https://github.com/Impression2805/IL2A.

Comparison Methods. Our method (IL2A) does not store any old samples for replay when learning new classes. Therefore, we ﬁrst compare IL2A with several non-exemplar based approaches: MAS [16], Lw F-MC [13], MUC [58], Lw M [59]. In addition, we also compare with several exemplar based methods such as i Ca RL [13], EEIL [18] and LUCIR [19]. Speciﬁcally, for the data replay based methods, we follow [13, 19] to store 20 samples for each class using herd selection technique [13]. We report the average top-1 accuracy of all previously seen classes up to each incremental step t. For i Ca RL, we respectively report its results of CNN predictions and nearest-mean-of-exemplars classiﬁcation, denoted as i Ca RL-CNN and i Ca RL-NME.

4.2 Experimental Results

Main Results. Comparative results are shown in Figure 5. Firstly, we observe that our method performs much better than non-exemplar based methods such as Lw F-MC and MUC in the trend of accuracy curve under different settings. Particularly, the gap appears unbridgeable in the long-step Class-IL setting, e.g., 10 phases and 20 phases. This suggests that only constraining old parameters does not sufﬁce to prevent forgetting. We argue that this is partly due to the unaddressed classiﬁer bias. When compared to representative data replay based methods such as i Ca RL, EEIL and LUCIR, our method remarkably shows strong performance without storing old samples.

The success of our method can contribute to the proposed class Aug and seman Aug. Speciﬁcally, class Aug is applied to new classes of current task, which enables the model to learn more transferable and diverse representations for future classes and in turn, reduces the forgetting of old parameters when learning new classes. While seman Aug is applied to old classes of previous tasks, which leverage the valuable distribution information of old classes to learn a uniﬁed classiﬁer to connect the classes from different tasks to each other.

Ablation Study. To evaluate the effect of each component in IL2A, we perform the ablation study and show the results of 10 phases setting (CIFAR-100) in Table 1. Speciﬁcally, the baseline denotes the method that does not generate pseudo-instance using seman Aug, but only replays the class-mean

Table 1: The effect of each component in IL2A.

Method\Incremental stage 1 2 3 4 5 6 7 8 9 10 Final

Knowledge Distillation 78.78 30.18 20.71 14.61 11.87 8.80 7.70 7.23 7.10 6.05 6.04 baseline 78.86 62.85 56.96 54.66 51.72 47.33 43.61 40.12 40.76 36.55 34.71 + seman Aug 79.16 69.14 60.68 58.18 54.77 50.89 48.45 46.29 46.97 44.38 42.09 + class Aug 79.72 68.30 64.15 60.15 56.21 52.61 51.48 46.48 46.36 43.63 41.56 + class Aug + seman Aug 81.08 74.54 66.28 63.89 58.80 54.97 51.32 48.64 49.74 47.05 45.07

of each old class when training new classes. By doing so, we aim to validate the effectiveness of seman Aug compared with only replaying class-mean. In summary, we can observe that: (1) Baseline improves the performance of KD signiﬁcantly. (2) Seman Aug improves the performance of baseline from 34.71% to 42.09%. Those results indicate the effect of the distribution information for maintaining old knowledge in Class-IL. (3) Class Aug also has remarkably effect on baseline, and (4) the performance can be further improved by combing with seman Aug, which indicates that those two modules are complementary. Similar results are observed in other settings of CIFAR-100 and Tiny-Image Net datasets. (5) As for the computational complexity, class Aug involves input level sample mixing and the augmented samples are fed to feature extractor. Differently, seman Aug performs implicit old instance generation in the deep feature space. Therefore, seman Aug is cheaper compared with class Aug from the computation perspective.

4.3 Further Analysis

Class Aug Improves both Plasticity and Stability in Class-IL. To analyze the effectiveness of class Aug more concretely, we explore how it affects the new tasks accuracy ( ) and average forgetting ( ) (CIFAR-100, 10 phases setting). Average forgetting [60] is deﬁned to estimate the forgetting of previous tasks. The forgetting measure f i k of the i-th task after training k-th task is deﬁned as f i k = max t 1,...,k 1(at,i ak,i), i < k, in which am,n is the accuracy of task n after training task m.

The average forgetting measure Fk is then deﬁned as Fk = 1 k 1 Pk 1 i=1 f i k. Intuitively, new task accuracy can be viewed as the plasticity of the incremental learner and the average forgetting can be viewed as the stability of the incremental learner. Figure 6 (a) and (b) report the results, from which we see that class Aug simultaneously improves the new task accuracy and reduces the average forgetting. Speciﬁcally, the signiﬁcant improvement on new task accuracy implies that the model training with class Aug is a good initialization for the following tasks. Consequently, class Aug is effective to improve the trade-off between plasticity and stability of a continual learner.

1 2 3 4 5 6 7 8 9 10 11

new task accuracy (%)

number of tasks

baseline w/ class Aug

1 2 3 4 5 6 7 8 9 10 11

forgetting (%)

number of tasks

baseline w/ class Aug

better stability

better plasticity

5 phases 10 phases 20 phases

final accuracy (%)

baseline class Aug Mixup LS

(a) (b) (c)

Figure 6: (a, b) Class Aug can simultaneously improve the new task accuracy and reduce the average forgetting. (c) Compared with class Aug, Mixup and LS have negative effect for Class-IL.

Compare Class Aug with Other Regularizers. We compare the proposed class Aug with Mixup and LS in Figure 6 (c), where the baseline (with seman Aug) represents our IL2A without using class Aug. As can be seen, Mixup and LS have negative effect on the ﬁnal accuracy. This phenomenon could be interpreted based on the analysis in Section 3.1.1 and Figure 2 (b). Speciﬁcally, those regularizers result in more compressed representations, damaging the transferability of the representations. Besides, the label smoothing strategy also affects the weights of old classes in the classiﬁer, thus increasing the classiﬁer bias. Similar results have also been reported in [61].

Discussion of Covariance Matrix. In our main experiments, we use the original covariance matrix for seman Aug. However, storing the original covariance matrix might be inefﬁcient when the

Table 2: OOD detection results. indicates higher is better.

OOD AUROC AUPR-In AUPR-Out baseline Mixup class Aug baseline Mixup class Aug baseline Mixup class Aug

MNIST 87.02 92.46 94.99 79.89 89.00 93.05 92.26 95.48 97.20 Fashion-MNIST 90.28 93.37 94.40 86.18 89.11 92.43 94.26 96.19 96.78 LSUN 88.50 88.80 93.90 83.48 74.71 91.08 92.92 94.09 96.73 Tiny-Image Net 88.49 84.96 93.92 83.84 64.02 91.77 92.70 92.19 96.55 Mean 88.57 89.90 94.30 83.35 79.21 92.08 93.04 94.49 96.81

matrix dimension is large. An alternative way is to only store the elements on the diagonal, which could greatly reduce the cost of memory. Figure 7 also reports the results of using the diagonal covariance matrix. Under different settings, using the original covariance matrix is slightly better than the diagonal form. This is reasonable because the original covariance matrix stores more distribution information of old classes. However, using the diagonal covariance matrix would be more memory-efﬁcient in practice.

5 phases 10 phases 20 phases

final accuracy (%)

original diagonal

Figure 7: Original v.s. diagonal covariance matrix. CIFAR-100.

Class Aug Improves Conﬁdence Reliability. During continuous use of a machine learning system in open-world applications, there are mainly three key steps [62]. The ﬁrst step is out-of-distribution (OOD) detection [63], which requires the system to detect unknown samples from novel classes. The second step is to label the collected unknown samples by humans or automatic algorithms [64]. Finally, the system must scale and adapt incrementally to learn the novel classes, which is the Class-IL problem studied in this paper. Recently studies found that DNNs are overconﬁdent for their predictions [63, 65], lacking the ability to detect samples from unknown classes. In real-world applications, we expect a continual learner has good OOD detection ability.

We explore the OOD detection ability of the proposed class Aug. Concretely, we train a Res Net-18 on CIFAR-10, and the test samples from CIFAR-10 are in-distribution. For OOD examples, we test on MNIST [66], Fashion-MNIST [67], LSUN (resized) [68] and Tiny-Image Net (resized). As shown in Table 2, class Aug noticeably improves the OOD detection performance of baseline [63] on commonly used metrics such as AUROC, AUPR-In and AUPR-Out [63]. By recognizing synthetic samples, DNNs could learn more robust and transferable representations which could be generalized to OOD samples. Moreover, as shown in Table 2, Mixup sometimes damages the performance of OOD detection, which further demonstrates the superiority of class Aug.

5 Conclusion

In this paper, we propose a simple and effective dual augmentation framework to address the representation bias and classiﬁer bias in Class-IL. We ﬁrst investigate the transferability (or forgetting) of representations via spectral decomposition, which motivates us to propose class Aug that can learn transferable, diverse and less compact representations for IL. Furthermore, we propose to use seman Aug to implicitly generate inﬁnite instances of old classes in the deep feature space during jointly learning of the uniﬁed classiﬁer. Experiments show that our method could achieve remarkable performance compared with state-of-the-art Class-IL methods. Future works will consider the dual augmentation framework for more challenging scenarios like Class-IL with distribution shift and OOD data, few-shot Class-IL, and federated incremental learning.

Acknowledgements

This work has been supported by the National Key Research and Development Program under Grant No. 2018AAA0100400, the National Natural Science Foundation of China (NSFC) grants U20A20223, 61633021, 62076236, 61721004, the Key Research Program of Frontier Sciences of CAS under Grant ZDBS-LY-7004, and the Youth Innovation Promotion Association of CAS under Grant 2019141.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. 1, 4, 7

[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. 1

[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Neur IPS, pages 1097 1105, 2012. 1, 3

[4] Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):12 25, 2015. 1

[5] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54 71, 2019. 1, 4

[6] Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classiﬁcation tasks. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 1

[7] Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences, 2020. 1

[8] Ian J. Goodfellow, M. Mirza, Xia Da, Aaron C. Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgeting in gradient-based neural networks. Co RR, 2014. 1

[9] M. Mc Closkey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, pages 109 165, 1989. 1

[10] Robert M French. Interactive tandem networks and the sequential learning problem. Citeseer. 1

[11] J. Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, J. Veness, G. Desjardins, Andrei A. Rusu, K. Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, C. Clopath, D. Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pages 3521 3526, 2017. 1, 2, 3, 4

[12] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., pages 2935 2947, 2018. 1, 2, 3, 4

[13] Sylvestre-Alvise Rebufﬁ, A. Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classiﬁer and representation learning. In CVPR, pages 5533 5542, 2017. 1, 3, 4, 8

[14] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In ECCV, pages 524 540, 2020. 1

[15] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, pages 3987 3995, 2017. 1, 2, 3, 4

[16] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In ECCV, pages 139 154, 2018. 1, 2, 3, 4, 8

[17] Y. Wu, Yan-Jia Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374 382, 2019. 1, 3

[18] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In ECCV, pages 233 248, 2018. 1, 2, 8

[19] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and D. Lin. Learning a uniﬁed classiﬁer incrementally via rebalancing. In CVPR, pages 831 839, 2019. 2, 3, 7, 8

[20] Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. Maintaining discrimination and fairness in class incremental learning. In CVPR, pages 13205 13214, 2020. 2

[21] Laurens Maaten, Minmin Chen, Stephen Tyree, and Kilian Weinberger. Learning with marginalized corrupted features. In ICML, pages 410 418, 2013. 2

[22] Yulin Wang, Gao Huang, Shiji Song, Xuran Pan, Yitong Xia, and Cheng Wu. Regularizing deep networks with semantic data augmentation. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 2, 3, 6

[23] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. ar Xiv preprint ar Xiv:1810.12488, 2018. 3

[24] Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. ar Xiv preprint ar Xiv:1904.07734, 2019. 3

[25] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, pages 86 102, 2020. 3

[26] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR, 2018. 3

[27] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Neur IPS, 2017. 3

[28] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efﬁcient lifelong learning with a-gem. In ICLR, 2019. 3

[29] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Neur IPS, pages 2994 3003, 2017. 3

[30] Chenshen Wu, L. Herranz, X. Liu, Y. Wang, Joost van de Weijer, and B. Raducanu. Memory replay gans: Learning to generate new categories without forgetting. In Neur IPS, pages 5962 5972, 2018. 3

[31] Ye Xiang, Ying Fu, Pan Ji, and Hua Huang. Incremental learning using conditional adversarial networks. In ICCV, pages 6618 6627, 2019. 3

[32] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. In ICLR, 2018. 3

[33] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016. 3

[34] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, pages 7765 7773, 2018. 3

[35] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In ICML, pages 4548 4557, 2018. 3

[36] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In ICLR, 2018. 3

[37] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. In ICLR, 2018. 3, 5

[38] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In ICCV, pages 6023 6032, 2019. 3

[39] Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In ICML, pages 552 560, 2013. 3

[40] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. In CVPR, pages 7064 7073, 2017. 3

[41] Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, and Tao Mei. Joint contrastive learning with inﬁnite possibilities. In Neur IPS, 2020. 3

[42] Shuang Li, Mixue Xie, Kaixiong Gong, Chi Harold Liu, Yulin Wang, and Wei Li. Transferable semantic augmentation for domain adaptation. ar Xiv preprint ar Xiv:2103.12562, 2021. 3

[43] Shuang Li, Kaixiong Gong, Chi Harold Liu, Yulin Wang, Feng Qiao, and Xinjing Cheng. Metasaug: Meta semantic augmentation for long-tailed visual recognition. ar Xiv preprint ar Xiv:2103.12579, 2021. 3

[44] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. In Neur IPS, 2020. 3, 7

[45] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. 3

[46] Clayton Shonkwiler. Poincaré duality angles for riemannian manifolds with boundary. ar Xiv preprint ar Xiv:0909.1967, 2009. 4

[47] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear algebra and its applications, 171:81 98, 1992. 4

[48] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In ICML, pages 1081 1090, 2019. 4

[49] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. 4, 7

[50] Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Neur IPS, pages 4696 4705, 2019. 5

[51] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najaﬁ, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In ICML, pages 6438 6447, 2019. 5

[52] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1 5, 2015. 5

[53] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017. 5

[54] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Björn Ommer, and Joseph Paul Cohen. Revisiting training strategies and generalization performance in deep metric learning. In ICML, 2020. 5

[55] Eden Belouadah and Adrian Popescu. Il2m: Class incremental learning with dual memory. In ICCV, pages 583 592, 2019. 6

[56] Leon Yao and John Miller. Tiny imagenet classiﬁcation with convolutional neural networks. CS 231N. 7

[57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 8

[58] Yu Liu, Sarah Parisot, Gregory G. Slabaugh, Xu Jia, Ales Leonardis, and Tinne Tuytelaars. More classiﬁers, less forgetting: A generic multi-classiﬁer paradigm for incremental learning. In ECCV, pages 699 716, 2020. 8

[59] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In CVPR, pages 5138 5146, 2019. 8

[60] Arslan Chaudhry, P. Dokania, Thalaiyasingam Ajanthan, and P. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, pages 532 547, 2018. 9

[61] Sudhanshu Mittal, Silvio Galesso, and Thomas Brox. Essentials for class incremental learning. ar Xiv preprint ar Xiv:2102.09517, 2021. 9

[62] Xu-Yao Zhang, Cheng-Lin Liu, and Ching Y Suen. Towards robust pattern recognition: A review. Proceedings of the IEEE, 108(6):894 922, 2020. 10

[63] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distribution examples in neural networks. In ICLR, 2017. 10

[64] Kai Han, Sylvestre-Alvise Rebufﬁ, Sébastien Ehrhardt, Andrea Vedaldi, and Andrew Zisserman. Autonovel: Automatically discovering and learning novel visual categories. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 10

[65] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, pages 1321 1330, 2017. 10

[66] Yann Le Cun and Corinna Cortes. The mnist database of handwritten digits. 2005. 10

[67] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. 10

[68] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. Ar Xiv, abs/1506.03365, 2015. 10