# learning_with_selective_forgetting__fb9ce28a.pdf

Learning with Selective Forgetting

Takashi Shibata , Go Irie , Daiki Ikami and Yu Mitsuzumi NTT Communication Science Laboratories, NTT Corporation, Japan {t.shibata, goirie}@ieee.org, {daiki.ikami.ef, yu.mitsuzumi.ae}@hco.ntt.co.jp

Lifelong learning aims to train a highly expressive model for a new task while retaining all knowledge for previous tasks. However, many practical scenarios do not always require the system to remember all of the past knowledge. Instead, ethical considerations call for selective and proactive forgetting of undesirable knowledge in order to prevent privacy issues and data leakage. In this paper, we propose a new framework for lifelong learning, called Learning with Selective Forgetting, which is to update a model for the new task with forgetting only the selected classes of the previous tasks while maintaining the rest. The key is to introduce a class-speciﬁc synthetic signal called mnemonic code. The codes are watermarked on all the training samples of the corresponding classes when the model is updated for a new task. This enables us to forget arbitrary classes later by only using the mnemonic codes without using the original data. Experiments on common benchmark datasets demonstrate the remarkable superiority of the proposed method over several existing methods.

1 Introduction Deep learning often suffers from a phenomenon called catastrophic forgetting. When a network is updated for a new task, its performance on previous tasks dramatically degrades. To mitigate this harmful effect, lifelong learning (or continual learning) has been explored, in which the network is updated to adapt to a new task (e.g., a new set of classes and a new instance) without forgetting the results of past learning. Major methods can be categorized into memory-replay-based [Rebufﬁet al., 2017; Lopez-Paz and Ranzato, 2017], parameterfreezing-based [Mallya and Lazebnik, 2018; Mallya et al., 2018], and regularization-based [Kirkpatrick et al., 2017; Li and Hoiem, 2017; Zenke et al., 2017; Aljundi et al., 2018] approaches. Most existing methods have been designed to learn a highly expressive model for the new task while preserving all of the knowledge for the previous tasks. Meanwhile, artiﬁcial intelligence is currently facing a new type of problem; as artiﬁcial intelligence has become more practical and connected to our everyday lives, various ethical

issues such as privacy protection and data leakage prevention have become critical topics. This has brought new challenges to the ﬁeld, covering learning from encrypted data [Gilad Bachrach et al., 2016], preventing learning of unintended information [Wang et al., 2019], privacy preserving localization [Speciale et al., 2019a; Speciale et al., 2019b], just to name a few. Even lifelong learning cannot avoid this issue either. Retaining the complete knowledge of all previous tasks is a double-edged sword it possibly leads to the risk of data leakage and invasion of privacy. Moreover, it is not always necessary to have the complete knowledge of the previous tasks so desirable to have a mechanism for forgetting knowledge no longer needed. For example, a face recognition system at an ofﬁce entrance gate would not need to remember the faces of staff who have transferred to other departments. These observations motivate us to propose a new lifelong learning framework called Learning with Selective Forgetting (LSF), which aims to avoid catastrophic forgetting of previous tasks while selectively forgetting only speciﬁed sets of past classes. To the best of our knowledge, our study is the ﬁrst to introduce the new forgetting problem to lifelong learning and proposes a solution to it. In this paper, we focus on task-incremental learning. The challenge is to forget only the speciﬁed classes while preventing catastrophic forgetting for the rest without using the original data of the previous tasks. Our method solves this issue by performing a special type of data augmentation that embeds a class-speciﬁc signal, called mnemonic code, in all the samples of the corresponding class when updating the model. This makes the class information tightly linked to the corresponding code, making it possible to forget arbitrary classes later on simply by discarding the codes corresponding to the classes. Experiments on common benchmark datasets demonstrate the remarkable superiority of our proposed method to existing approaches.

2 Related Work 2.1 Lifelong Learning We brieﬂy review three mainstream approaches in lifelong learning, memory-replay-based, parameter-freezing-based, and regularization-based, and we highlight the contributions of our work.

Memory-replay: The memory-replay-based approach uses a set of original samples of previous tasks when updat-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

ing the model for a new task [Chaudhry et al., 2018b; Rebufﬁet al., 2017; Lopez-Paz and Ranzato, 2017]. Instead of using the original samples, some methods train deep generative models to generate pseudo samples [Wu et al., 2018; Shin et al., 2017]. Several recent papers have proposed algorithms to solve the problem of data imbalance between the current task and the previous tasks [Zhao et al., 2020; Wu et al., 2019; Liu et al., 2020].

Parameter-freezing: The basic idea of this approach is to use different model parameters for each task. Several strategies have been proposed, such as networks switching the nodes or branches to be used depending on the tasks [Mallya and Lazebnik, 2018; Mallya et al., 2018] or adding new nodes or branches every time a new task is learned [Rusu et al., 2016; Aljundi et al., 2017]. A hybrid version of these approaches has also been proposed [Hung et al., 2019].

Regularization: This approach leverages the previous tasks knowledge implicitly by introducing additional regularization terms. This approach can be grouped into data-drivenbased [Li and Hoiem, 2017; Hou et al., 2018; Dhar et al., 2019] and weight-constrain-based [Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018; Chaudhry et al., 2018a; Lee et al., 2017; Yu et al., 2020]. The former utilizes the knowledge distillation, while the latter introduces a prior on the model parameters.

Our method is categorized into the regularization-based approach. To summarize, the existing algorithms are designed to retain all the information of the classes for the past tasks. Unlike these, our contributions of this paper are to propose a new problem setting of lifelong learning that requires forgetting only speciﬁed classes and a solution to this problem.

2.2 Machine Unlearning

The concept of Machine Unlearning (MU) was ﬁrst introduced by Cao et al. [Cao and Yang, 2015]. Its typical deﬁnition is to remove the effect of speciﬁed training samples without retraining the whole model so that the resulting model is indistinguishable from a model trained on a dataset without those samples. General approaches are to train multiple small models on separated subsets of the training data to prevent retraining the whole model [Bourtoule et al., 2019] or to utilize vestiges of the learning process, i.e., the stored learned model parameters and their gradients [Wu et al., 2020]. Specialized methods for some basic learning algorithms such as linear discriminant analysis [Guo et al., 2020] and k-means [Ginart et al., 2019] have also been presented. Inspired by differential privacy [Abadi et al., 2016], Eternal Sunshine of the Spotless Net [Golatkar et al., 2020] introduced a scrubbing procedure that removes information from the trained weights of deep neural networks using the Fisher information matrix. Mixed Linear Forgetting [Golatkar et al., 2021] derived a tractable optimization problem by linearly approximating the amount of change in weights due to the addition of training data. Variational Bayesian inference also provides a compelling approach for MU [Nguyen et al., 2020]. Our work differs from these previous studies in the following two points. First, we focus on lifelong learning. To

the best of our knowledge, this is the ﬁrst work that considers the forgetting problem in the context of lifelong learning. Second, we address the problem of class-level forgetting, i.e., making a speciﬁed set of classes unrecognizable, rather than sample-level forgetting. This is a practical forgetting problem that has not yet been thoroughly studied in past MU literature.

3 Learning with Selective Forgetting

Let us begin with an introduction to a standard lifelong learning setting. Denote by {D1, Dk, DK} a sequence of datasets, where Dk = {(xi k, yi k)nk i=1} is the dataset of the k-th task. xi k X is an input and yi k Y is its class label. While observing the datasets in a streaming manner, the purpose of standard lifelong learning is to learn a model fθ : X Y parameterized by θ so that it can map a test input x of any learned tasks to its correct class label y. Now we deﬁne our new problem illustrated in Figure 1, called Learning with Selective Forgetting (LSF). In this problem, each of the learned classes is assigned to either preservation set or deletion set. Formally,

Preservation Set CP k : A set of classes learned in the past and should be preserved at k-th task.

Deletion Set CP k : A set of classes still memorized and should be forgotten at k-th task (the complement of CP k ).

At the k-th task, we are given the dataset Dk and the preservation set CP k . We use index k for the new task and p for the previous tasks.

Deﬁnition 1 (LSF Problem). The Learning with Selective Forgetting (LSF) problem is deﬁned as follows:

Objective: Learn a model fθ : X Y. This model fθ should map a test input x to its correct class label y if x is in the preservation set CP . Otherwise, fθ should map x to a wrong class label y = y.

Constraint: No original samples or generative models for the past tasks are available after the new task begins.

We propose a method to solve the LSF problem. An overview of our method is shown in Fig. 2. Our method uses a multiheaded network architecture that has one head per task, which is a common architecture in lifelong learning [Li and Hoiem, 2017; Chaudhry et al., 2018a]. We ﬁrst introduce mnemonic code, which is the key to solving the LSF problem, and then we present loss functions for learning our model. Finally, we empirically analyze the key properties of our method.

4.1 Mnemonic Code The challenge of the LSF problem is to retain memorizing the classes listed in the preservation set while forgetting those in the deletion set without accessing the original dataset. Our idea aims to associate information of each class with a fairly simple code, called mnemonic code, and to use only that code to control whether the class will be retained or forgotten. We implement this idea as a special type of data augmentation. An overview of the process is illustrated in the left-hand

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Figure 1: Problem Setting of Learning with Selective Forgetting. The goal is to carry out both selective forgetting and lifelong learning without using the original data of previous tasks.

Figure 2: Overview of Our Method. We introduce mnemonic code a classspeciﬁc random signal embedded in each sample of the same class and is trained to be an anchor of the class. Remembering/forgetting of the class information can be performed by only using the corresponding code without using the original data of the past tasks.

side of Fig. 2. When a new task is received, one synthetic image is generated per class with random pixel values as the class-speciﬁc mnemonic code and embedded in all the samples of the corresponding class. Formally, let {ξk,c} be a set of mnemonic codes, where ξk,c is the code for k-th task of the c-th class. During training for the k-th task, we generate an augmented sample xi k by embedding the mnemonic code ξk,c into the original sample xi k of the c-th class such like mixup [Zhang et al., 2018]:

xi k = λxi k + (1 λ)ξk,c, (1)

where λ is a uniform random variable in [0, 1]. Besides the set of the originals {(xi k, yi k)nk i=1}, we also use the augmented samples {( xi k, yi k)nk i=1} to update our model at the k-th task. Once the updating is done, we retain only the codes {ξp,c} for later tasks to control remembering and forgetting the classes learned in the past tasks. The intuition behind this procedure is as follows. By training with such augmented data, the samples of the same class are aggregated around the corresponding mnemonic code in the feature space. We therefore can control whether or not to maintain the feature distribution around the code locally depending on whether or not to use the code later at updating the model for a new incoming task, leading it possible to remember or forget arbitrary classes by only using the corresponding codes but without using the original samples (we will show later analytic results in Sec. 4.3). This idea is inspired by a human learning technique called mnemonics that aids in memory retention by associating different types of information (e.g., images and words), hence the name.

Implementation of Mnemonic Code: We use random color patterns to generate our mnemonic code as shown in Fig. 2. Speciﬁcally, we assign a random color to each grid of an image of the same size as the original sample. Other types of codes are possible; however, we argue several strengths of using such a random code: i) the random pattern can be generated easily, ii) the patterns are i.i.d. for each class and each task, and iii) unlike existing memory-based-approaches that uses (a part of) the original samples, the pattern itself does not directly represent any information of the raw data, which

is suitable for privacy protection and data leakage prevention purposes. Interestingly, as we will show later in Sec. 5.3, the performance of our random code is comparable to that of a content-based code, i.e., an average image of the original samples within the same class, which emphasizes the advantages of our version. One limitation so far is that our code has been customized for image data, but the idea itself can be readily extended to other data types, which will be a compelling future research direction.

4.2 Loss Function We train the model with our mnemonic codes. As shown in Fig. 2, the total loss function L for training consists of four terms: classiﬁcation loss LC, mnemonic loss LM, selective forgetting loss LSF , and regularization term LR. The ﬁrst two are for learning a new task and the last two are for maintaining the previous tasks.

new task z }| { LC + LM +

previous tasks z }| { LSF + LR . (2) Below we detail each of them one-by-one.

Classiﬁcation Loss LC: The classiﬁcation loss for the k-th new task is given as

i l(xi k, yi k), (3)

where Nk is the number of the training samples in the k-th task, l(x, y) is a loss function for the input x and its class label y. A typical choice would be softmax cross entropy (CE) or additive margin softmax (AMS) loss [Wang et al., 2018a; Wang et al., 2018b]. We use AMS for l(x, y), as we found it is better than CE.

Mnemonic Loss LM: In addition to the classiﬁcation loss that uses the original samples Dk = {(xi k, yi k)nk i=1}, we also use another loss using the augmented samples with our mnemonic codes Dk = {( xi k, yi k)nk i=1} for tying each code to the corresponding class. The loss function is given by

i l( xi k, yi k). (4)

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

(a) Vanilla

(d) Ours E Figure 3: Analysis. The accuracy of each task for each epoch (top) and t-SNE plots of the features obtained from the last layer of the backbone network at the end of each task (bottom) are shown. Each color in the t-SNE plot represents the following categories (best viewed in color). Orange: Belongs to the preservation set throughout all three tasks. Blue: Changes from the preservation set to the deletion set after completing task 1. Green: Learns as a new task, i.e., the preservation set, in task 2, then change to the deletion set in task 3.

We use AMS for l( , ) as the classiﬁcation loss.

Selective Forgetting Loss LSF : The aim of this loss function is to keep remember only the classes in the preservation set and to forget the others in the deletion set. This can be achieved by training with only the mnemonic codes corresponding to the classes in the preservation set and discarding the other codes. For convenience, let us denote by ξi p the mnemonic code used to generate xi p (i.e., the code for the class of xi p). The loss function is

i l(ξi p, yi p), (5)

where Np is the number of training samples at the p-th task, and γSF is a balancing weight. l( , ) is the AMS function. Note that this loss function does not use any of the original samples. By ignoring the codes of the classes in the deletion set, these classes will experience catastrophic forgetting. This allows us to achieve the selective forgetting for the previous tasks without using any of the original samples.

Regularization Term LR: The regularization term is often introduced to prevent catastrophic forgetting. In this work, we consider using three existing regularization terms, namely, Learning without Forgetting (Lw F) [Li and Hoiem, 2017], Elastic Weight Consolidation (EWC) [Kirkpatrick et al., 2017], and Memory Aware Synapses (MAS) [Aljundi et al., 2018]. Lw F and EWC are originally designed to retain all classes, while we only need to memorize the classes included in the preservation set for our LWS problem. Thus, we make the following minor modiﬁcations to adapt Lw F and EWC to our problem. The modiﬁed versions are distinguished from their original versions by * , for example Lw F .

- Lw F [Li and Hoiem, 2017]: The regularization term of Lw F is deﬁned as LLw F = γ P i CP k y o (i) log ˆy o (i), where γ is the weight for the term, and i is the index of the class label. We change the summation to only be taken over the preservation set, i.e., j CP k . y o and ˆy o are the modiﬁed versions of recorded and current probabilities as in [Li and Hoiem, 2017]1.

1The modiﬁed versions of recorded and current probabilities, i.e.,

- EWC [Kirkpatrick et al., 2017]: The regularization term of EWC is LEWC = γ 2 P q,p Fq,p(θq ˆθq,p)2, where γ is the weight for the regularization term, Fq,p is the diagonal component of the Fisher matrix for the p-th previous task corresponding to the q-th parameter ˆθq,p. We change the Fisher matrix to be evaluated only for the classes corresponding to the preservation set2.

- MAS [Aljundi et al., 2018]: The regularization LR for MAS is given by LMAS = γ

2 P q,p Ωq,p(θq ˆθq,p)2, where γ is the regularization strength, Ωq,p is the constraint strength, i.e., the importance parameter, for the p-th previous task for the q-th parameter, which is estimated by the sensitivity of the squared l2 norm of the function output to their changes.

Beyond the cases of using each of these individually as our regularization term LR, we can also consider combinations of them. Speciﬁcally, we test the following two versions of combinations for our method in the experiments.

LR = LLw F + LEWC , (6) LR = LLw F + LMAS, (7)

which we denote Ours E and Ours M, respectively.

4.3 Analysis In our preliminary analysis, we demonstrate that our mnemonic code can forget only the speciﬁed classes in the deletion set while maintaining the rest.

Setting: We use Permuted MNIST [Kirkpatrick et al., 2017], which is an artiﬁcial dataset often used for lifelong learning benchmarks. We prepare three tasks with different permutations; ten digit classes for each task (30 classes total). In this analysis, we always set three classes { 0 , 1 , 2 } at each task as the deletion set and the other seven classes as the

y o and ˆy o, are given by y o i = (y o (i))1/T /P

j CP k (y o (j))1/T and

ˆy o i = (ˆy o (i))1/T /P

j CP k (ˆy o (j))1/T , where T is a hyperparameter for the distillation knowledge. We set T = 2 as in the original paper [Li and Hoiem, 2017]. The summation is only taken over the preservation set, i.e., j CP k . 2In the case of multiple tasks, EWC requires storing the Fisher matrix for each task independently and performing regularization on all of them together [Chaudhry et al., 2018a].

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

preservation set, so whenever a new task comes, only { 0 , 1 , 2 } from the past tasks need to be forgotten, and the rest are required to be retained. We compare four methods: 1) Vanilla, which only uses the classiﬁcation loss LC, 2) EWC, 3) EWC , and 4) Ours E. The standard two-conv-two-FC CNN is used for all methods3.

Results: Figure 3 shows the accuracy vs. epoch plots (top) and t-SNE visualization results (bottom), where the features on the ﬁnal layer of the shared backbone at the end of each task are visualized. The two accuracy plots show the performance for the deletion set (left) and the preservation set (right). In contrast to the others, we can see that Ours correctly reduces the accuracy of the past three classes to be forgotten and also maintains the accuracy of the seven classes to be retained. Vanilla forgets everything of the past and EWC remembers all. EWC , which has been modiﬁed to apply regularization only to classes that should be preserved, tends to manage the task correctly, but is still inadequate. This proves that straightforward modiﬁcations of existing methods are not satisfactory and supports the unique effectiveness of our mnemonic code. Another important observation is that the mnemonic code is only embedded in the training images, which leads to a gap between the training and test images, however, this has no signiﬁcant negative impact on the ﬁnal classiﬁcation accuracy. The t-SNE plots also show that only Ours could keep the samples of the classes to be remembered agglomerated in the feature space and quickly scattered those to be forgotten, which is the desirable behavior for the problem. This implies that our random mnemonic code, despite its simplicity, is able to tightly link the samples of each class to the corresponding code in the feature space, and as a result, can control remembering/forgetting of the classes individually.

5 Experiments 5.1 Setting Datasets: We use three widely used benchmark datasets for lifelong learning, i.e., CIFAR-100, CUB200-2011 [Wah et al., 2011], and Stanford Cars [Krause et al., 2013]. CUB200-2011 has 200 classes with 5,994 training images and 5,794 test images. CIFAR-100 contains 50,000 training images and 1,0000 test images overall. Stanford Cars comprises 196 cars of 8,144 images for training and 8,041 for testing. Unless otherwise noted, as the analysis in Sec. 4.3, the ﬁrst 30% of classes for each task belongs to the deletion set, while the other classes belong to the preservation set.

Implementation Details: We used Res Net-18 [He et al., 2016] for the classiﬁcation model. The ﬁnal layer was changed to the multi-head architecture as shown in Fig. 2. We trained the network for 200 epochs for each task. Minibatch sizes were set to 128 for new tasks and 32 for past tasks

3The detailed conﬁguration of the CNN is: Conv(3,32) - Conv(3,64) - Max Pool(2) - Dropout(0.25) - Linear(9216,120) - Dropout(0.5) - Linear(120,10), where Conv(k,c) denotes a convolution-Re LU layer with the kernel size k k and output channel c, Maxpool(2) denotes max pooling with stride 2, and Dropout(p) denotes dropout with probability p.

in CIFAR-100, and 32 for new tasks and 8 for previous tasks in CUB-200-2011 and Stanford Cars. The weight decay was 5.0 10 4. We used SGD for optimization. We employed a standard data augmentation strategy: random crop, horizontal ﬂip, and rotation. In this experiment, we used Xavier s initialization.

Baselines: We compared our proposed method with Lw F [Li and Hoiem, 2017], EWC [Kirkpatrick et al., 2017], and MAS [Aljundi et al., 2018], which are the popular regularization-based lifelong learning methods. In the following experiments, γ for Lw F/Lw F , EWC/EWC , and MAS are set to 5, 100, and 5, respectively. The weight parameter for the selective forgetting loss γSF is set to 10. We compare the above methods, including their combinations. To sum up, the speciﬁc methods compared are as follows:

- Vanilla: Trained using only the classiﬁcation loss LC, - Lw F: [Li and Hoiem, 2017], - Lw F : Modiﬁed version of Lw F, - EWC: [Kirkpatrick et al., 2017], - EWC : Modiﬁed version of EWC, - EWC +Lw F : Combination of EWC and Lw F

- MAS: [Aljundi et al., 2018] - MAS+Lw F : Combination of MAS and Lw F

- Ours E: Our method with EWC and Lw F

- Ours M: Our method with MAS and Lw F

Evaluation Metric: In our LSF problem setting, the goal is to forget the deletion set and preserve the preservation set. There is no suitable evaluation metric for evaluating the performance on this setting, because it contains a new criterion, selective forgetting. We introduce a new evaluation metric S, called Learning with Selective Forgetting Measure (LSFM). LSFM is calculated as the harmonic mean of the two standard evaluation measures for lifelong learning [Chaudhry et al., 2018b]: the average accuracy Ak for the preservation set and the forgetting measure Fk for the deletion set, i.e.,

Sk = 2 Ak Fk

Ak + Fk . (8)

The average accuracy Ak is evaluated for the preservation set after the model has been trained up untill the k-th task. The speciﬁc deﬁnition is given by Ak = 1

k Pk p=1 ak,p, where ak,p is the accuracy for the p-th task after the training for the k-th task is completed. Ak is evaluated only for the preservation set. Similarly, the forgetting measure Fk is computed for the deletion set after completing the k-th task. This is given by Fk = 1

k Pk p=1 f p k, where f p k = maxl 1 k al,p ak,p, which represents the largest gap (decrease) from the past to the current accuracy for the p-th task. This is evaluated only for the deletion set4. The ranges of Ak and Fk are both [0, 1]. We report the averages of Sk, Ak and Fk over k after the last task has been completed, which are denoted by S, A, and F, respectively.

4In the ﬁrst task (i.e., the number of previous tasks is zero), no class belongs to the deletion set, so Fk and Sk are not deﬁned.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

(a) Results on CIFAR-100

(b) Results on CUB-200-2011

(c) Results on Stanford Cars

Figure 4: Per-Task Performance. The left, center, and right ﬁgures show the LSFM Sk, average accuracy Ak for the preservation set, and forgetting measure Fk for the deletion set at the end of each task, respectively. Higher is better for each.

CIFAR-100 CUB-200-2011 Stanford Cars # Task:5, # Class:20 # Task:5, # Class:40 # Task:4, # Class:49

S (A , F ) S (A , F ) S (A , F) Vanilla 51.79 (39.66,74.62) 42.23 (31.77,62.93) 48.14 (41.08,58.12) Lw F 17.23 (79.05,9.67) 15.20 (69.04,8.54) 9.93 (88.10,5.26) Lw F* 68.24 (81.32,58.79) 44.54 (68.27,33.05) 53.70 (88.22,38.60) EWC 48.57 (36.54,72.42) 41.36 (32.97,55.47) 48.74 (47.72,49.80) EWC* 49.61 (36.58,77.08) 42.08 (33.38,56.92) 50.71 (46.17,56.24) EWC*+Lw F* 67.64 (81.20,57.96) 43.42 (69.29,31.62) 52.79 (88.65,37.58) MAS 47.46 (34.89,74.17) 45.07 (34.87,63.71) 48.80 (44.66,53.80) MAS+Lw F* 66.35 (81.83,55.79) 47.49 (69.69,36.02) 50.57 (89.01,35.32) Ours M 73.21 (72.61,73.83) 57.97 (63.07,53.63) 72.24 (84.57,63.04) Ours E 79.60 (75.33,84.37) 61.41 (65.99,57.43) 73.70 (85.98,64.49)

Table 1: Results on CIFAR-100, CUB-200-2011, and Stanford Cars. Bold and underline indicate the best and second best methods, respectively.

# Task:2, # Class:50 # Task:5, # Class:20 # Task:10, # Class:10

S (A , F ) S (A , F ) S (A , F ) Vanilla 55.87 (55.21,56.55) 51.79 (39.66,74.62) 37.88 (25.41,74.41) Lw F 9.02 (74.69,4.80) 17.23 (79.05,9.67) 22.50 (80.74,13.07) Lw F* 54.64 (76.44,42.52) 68.24 (81.32,58.79) 63.62 (82.29,51.85) EWC 58.58 (56.73,60.55) 48.57 (36.54,72.42) 34.91 (23.07,71.70) EWC* 57.17 (56.25,58.13) 49.61 (36.58,77.08) 36.90 (23.68,83.52) EWC*+Lw F* 53.51 (77.11,40.98) 67.64 (81.20,57.96) 69.17 (74.11, 64.85) MAS 55.44 (54.42,56.49) 47.46 (34.89,74.17) 35.26 (23.25,72.96) MAS+Lw F* 56.54 (76.85,44.72) 66.35 (81.83,55.79) 70.83 (74.63,67.41) Ours M 70.08 (74.89,65.84) 73.21 (72.61,73.83) 71.63 (68.56,75.00) Ours E 74.02 (74.93,73.14) 79.60 (75.33,84.37) 76.01 (67.93,86.26)

Table 2: Results on CIFAR-100 for Varying Number of Tasks/Classes. Bold and underline indicate the best and second best methods, respectively.

5.2 Comparative Results

Overall Results: Table 1 shows the comparative results of all the methods. We can clearly see that Ours E is the best and Ours M is the second best among all the methods in SF . No other method that is better in terms of both A and F. This is mainly due to the advantage of our mnemonic codes; as we veriﬁed in our preliminary analysis, the codes enable accurate control over whether each class should be retained or forgotten on a class-by-class basis.

Per-Task Results: To visualize the performance changes over the tasks, we show Sk, Ak and Fk at each task in Fig. 45. We can see that Ours E consistently achieves the best S on all the datasets. We can draw several observations from the results. First, any single existing method (like Lw F, EWC, and MAS) cannot work well. While most of the methods main-

5Due to space limitations, we report only the results of Vanilla, EWC, LWF, EWC , LWF , and Ours E in this ﬁgure.

tain satisfactory performance in terms of Ak, Vanilla, EWC, and EWC suffer from catastrophic forgetting, showing a decrease in accuracy with each new task added. These three methods look better in Fk than the other baselines, however, they forget all the classes whether they are in the preservation set or the deletion set, which is not desirable behavior. Conversely, Lw F remembers all the classes, and thus has high Ak but sacriﬁces Fk. Second, even a combination of the existing methods, namely EWC +Lw F , cannot yield satisfactory performance. This indicates that a straightforward idea to apply the strong regularization terms only to the preservation set is not sufﬁcient to maintain adequate performance. These observations emphasize the difﬁculty of maintaining both Ak and Fk together. Unlike these methods, Ours E shows consistently high accuracy in both Ak and Fk. This proves the effectiveness of our mnemonic code and learning strategy in overcoming the new problem.

Results for Varying Number of Tasks/Classes: Table 2 shows the results on CIFAR-100 under the various numbers of tasks/classes. Ours is the best in S in all the cases. We also evaluated the performance of the methods when the ratio of the number of classes in the deletion set to that of all the classes, rdel, is varied in the range from 0.1 to 0.9. From the results shown in Table 3, S of our two methods are higher than those of the other methods for all the ratios. These results show the strong robustness of the proposed method to the various settings.

5.3 Sensitivity Analysis

We analyze the sensitivity of the performance to the hyperparameters, including the weight for the selective forgetting loss LSF and the mnemonic loss LM. We also evaluate the performance with different types of mnemonic codes.

Effectiveness of LSF : Figure 5 shows the performance for various γSF in Eq. (5). First, as γSF is decreased, the performance decreases. This suggests that LSF has a signiﬁcant contribution to improving the performance. For larger γSF , the performance is high and stable, indicating that tuning of the the value is not severe.

Effectiveness of LM: We compared the performance of the proposed method with and without the mnemonic loss LM. The results are shown in the left side of Table 4. It clearly shows that the loss LM signiﬁcantly improves the performance, demonstrating the effectiveness of LM.

Mnemonic Code Choice: We evaluated the effectiveness of our choice for the mnemonic code, i.e., the random pattern

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

rdel = 0.1 rdel = 0.5 rdel = 0.9 S (A , F ) S (A , F ) S (A , F ) Vanilla 46.20 (31.63,85.62) 52.58 (40.97,73.40) 69.23 (68.00,70.50) Lw F 16.13 (77.80,9.00) 14.96 (79.97,8.25) 13.51 (81.79,7.36) Lw F* 70.31 (79.14,63.25) 65.54 (83.30,54.02) 71.91 (88.50,60.56) EWC 44.13 (30.43,80.25) 48.29 (36.37,71.85) 67.06 (66.79,67.33) EWC* 43.19 (29.17,83.12) 52.98 (40.78,75.57) 67.88 (66.46,69.36) EWC*+Lw F* 67.04 (78.72,58.37) 70.07 (83.85,60.17) 75.47 (89.18,65.42) MAS 45.52 (30.55,89.25) 51.85 (40.72,71.35) 68.16 (68.36,67.97) MAS+Lw F* 65.16 (77.43,56.25) 71.23 (82.02,62.95) 77.24 (88.75,68.37)

Ours M 77.80 (76.99,78.62) 79.88 (79.45,80.32) 81.46 (87.46,76.24) Ours E 83.46 (75.87,92.75) 83.16 (81.80,84.57) 83.68 (86.57,80.97)

Table 3: Results on CIFAR-100 for Various Ratio rdel of Deletion Set.. Bold and underline indicate the best and second best methods, respectively.

Figure 5: Performance Sensitivity to γSF . Left: LSFM (black line), Right: Average accuracy (red line) and Forgetting measure (blue line).

S (A , F ) w/o LM 66.16 (56.71, 79.39) w/ LM 74.02 (74.93, 73.14)

S (A , F ) Mean 74.91 (73.20, 76.71) Random 74.02 (74.93, 73.14)

Table 4: Effectiveness of Mnemonic Loss LM and Mnemonic Code. Left: Effectiveness of the mnemonic loss, Right: Performance with different mnemonic code types.

code, by comparing it with the average code which is the average image of each class. The right side of Table 4 shows the results. We can see that the performance of these two methods is highly comparable. As we discuss in Sec. 4.1, the random code has several advantages compared with the average code. The results further emphasizes the merit of using the random code for lifelong learning with selective forgetting.

6 Conclusion

We opened up a new framework for lifelong learning called Learning with Selective Forgetting (LSF), which allows a model to continuously learn from new tasks while selectively forgetting undesirable class information. Our key contribution was the proposal of a simple and effective idea called mnemonic code. The code is a class-speciﬁc random signal embedded in each sample of the same class, which makes it possible to control the remembering and forgetting of the arbitrary class without using the original samples. Thorough experiments proved that our method could achieve signiﬁcantly better performance than existing methods on this new problem. We believe that this paper will bring a new and practical direction of lifelong learning to the community and give the ﬁrst baseline for the new problem.

[Abadi et al., 2016] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar,

and Li Zhang. Deep learning with differential privacy. In Proc. CCS, pages 308 318, 2016.

[Aljundi et al., 2017] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proc. CVPR, pages 3366 3375, 2017.

[Aljundi et al., 2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proc. ECCV, pages 139 154, 2018.

[Bourtoule et al., 2019] Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. ar Xiv preprint ar Xiv:1912.03817, 2019.

[Cao and Yang, 2015] Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In Proc. S&P, pages 463 480, 2015.

[Chaudhry et al., 2018a] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proc. ECCV, pages 532 547, 2018.

[Chaudhry et al., 2018b] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efﬁcient lifelong learning with a-gem. In Proc. ICLR, 2018.

[Dhar et al., 2019] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In Proc. CVPR, pages 5138 5146, 2019.

[Gilad-Bachrach et al., 2016] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proc. ICML, pages 201 210, 2016.

[Ginart et al., 2019] A Ginart, M Guan, G Valiant, and J Zou. Making ai forget you: Data deletion in machine learning. In Proc. Neur IPS, 2019.

[Golatkar et al., 2020] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proc. CVPR, pages 9304 9312, 2020.

[Golatkar et al., 2021] Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Mixed-privacy forgetting in deep networks. In Proc. CVPR, 2021.

[Guo et al., 2020] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certiﬁed data removal from machine learning models. In Proc. ICML, pages 3832 3842, 2020.

[He et al., 2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Hou et al., 2018] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Lifelong learning via progressive distillation and retrospection. In Proc. ECCV, pages 437 452, 2018. [Hung et al., 2019] Ching-Yi Hung, Cheng-Hao Tu, Cheng En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. In Proc. Neur IPS, pages 13669 13679, 2019. [Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521 3526, 2017. [Krause et al., 2013] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for ﬁnegrained categorization. In Proc. ICCVW, pages 554 561, 2013. [Lee et al., 2017] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Proc. Neur IPS, pages 4652 4662, 2017. [Li and Hoiem, 2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 40(12):2935 2947, 2017. [Liu et al., 2020] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proc. CVPR, page 12254, 2020. [Lopez-Paz and Ranzato, 2017] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Proc. Neur IPS, pages 6467 6476, 2017. [Mallya and Lazebnik, 2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proc. CVPR, pages 7765 7773, 2018. [Mallya et al., 2018] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proc. ECCV, pages 67 82, 2018. [Nguyen et al., 2020] Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. In Proc. Neur IPS, pages 16025 16036, 2020. [Rebufﬁet al., 2017] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classiﬁer and representation learning. In Proc. CVPR, pages 2001 2010, 2017. [Rusu et al., 2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016.

[Shin et al., 2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Proc. Neur IPS, pages 2990 2999, 2017. [Speciale et al., 2019a] Pablo Speciale, Johannes L Schonberger, Sing Bing Kang, Sudipta N Sinha, and Marc Pollefeys. Privacy preserving image-based localization. In Proc. CVPR, pages 5493 5503, 2019. [Speciale et al., 2019b] Pablo Speciale, Johannes L Schonberger, Sudipta N Sinha, and Marc Pollefeys. Privacy preserving image queries for camera localization. In Proc. ICCV, pages 1486 1496, 2019. [Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011. [Wang et al., 2018a] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face veriﬁcation. IEEE Signal Processing Letters, 25(7):926 930, 2018. [Wang et al., 2018b] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proc. CVPR, pages 5265 5274, 2018. [Wang et al., 2019] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proc. ICCV, pages 5310 5319, 2019. [Wu et al., 2018] Chenshen Wu, Luis Herranz, Xialei Liu, Joost van de Weijer, Bogdan Raducanu, et al. Memory replay gans: Learning to generate new categories without forgetting. In Proc. Neur IPS, pages 5962 5972, 2018. [Wu et al., 2019] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proc. CVPR, pages 374 382, 2019. [Wu et al., 2020] Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In Proc. ICML, pages 10355 10366, 2020. [Yu et al., 2020] Lu Yu, Bartlomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, and Joost van de Weijer. Semantic drift compensation for class-incremental learning. In Proc. CVPR, pages 6982 6991, 2020. [Zenke et al., 2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proc. ICML, pages 3987 3995, 2017. [Zhang et al., 2018] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proc. ICLR, 2018. [Zhao et al., 2020] Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. Maintaining discrimination and fairness in class incremental learning. In Proc. CVPR, pages 13208 13217, 2020.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)