# exploring_example_influence_in_continual_learning__f034c097.pdf

Exploring Example Inﬂuence in Continual Learning

Qing Sun , Fan Lyu , Fanhua Shang, Wei Feng, Liang Wan College of Intelligence and Computing, Tianjin University {sssunqing, fanlyu, fhshang, wfeng, lwan}@tju.edu.cn https://github.com/SSSun Qing/Example_Influence_CL

Continual Learning (CL) sequentially learns new tasks like human beings, with the goal to achieve better Stability (S, remembering past tasks) and Plasticity (P, adapting to new tasks). Due to the fact that past training data is not available, it is valuable to explore the inﬂuence difference on S and P among training examples, which may improve the learning pattern towards better SP. Inspired by Inﬂuence Function (IF), we ﬁrst study example inﬂuence via adding perturbation to example weight and computing the inﬂuence derivation. To avoid the storage and calculation burden of Hessian inverse in neural networks, we propose a simple yet effective Meta SP algorithm to simulate the two key steps in the computation of IF and obtain the Sand P-aware example inﬂuence. Moreover, we propose to fuse two kinds of example inﬂuence by solving a dual-objective optimization problem, and obtain a fused inﬂuence towards SP Pareto optimality. The fused inﬂuence can be used to control the update of model and optimize the storage of rehearsal. Empirical results show that our algorithm signiﬁcantly outperforms state-of-the-art methods on both taskand class-incremental benchmark CL datasets.

1 Introduction

By mimicking human-like learning, Continual Learning (CL) aims to enable a model to continuously learn from novel knowledge (new tasks, new classes, etc.) in a sequential order. The major challenge in CL is to harness catastrophic forgetting and knowledge transition, namely the Stability-Plasticity dilemma, with Stability (S) showing the ability to prevent performance drops for old tasks and Plasticity (P) referring if the new task can be learned rapidly and unimpededly. Intuitively speaking, a robust CL system should achieve outstanding S and P through sequential learning.

The sequential paradigm means CL does not access past training data. Comparing to traditional machine learning, the training data in CL is thus more precious. It is valuable to explore the inﬂuence difference on S and P among training examples. Following the accredited inﬂuence chain Data Model-Performance , exploring this difference is equivalent to tracing from performance back to example difference. With appropriate control, this may improve the learning pattern towards better SP. On top of this, the goal of this paper is to explore the reasonable inﬂuence from each training example to SP, and apply the example inﬂuence to CL training.

To understand example inﬂuence, one classic successful technique is the Inﬂuence Function (IF) [20], which leverages the derivation chain rule from a test objective to training examples. However, directly applying the chain rule leads to computing the inverse of Hessian with the complexity of O(nq2 + q3) (n is the number of examples and q is parameter size), which is computationally intensive and may run out-of-memory in neural networks. In this paper, we propose a novel meta-learning algorithm, called Meta SP, to compute example inﬂuence via simulating IF. We design based on the rehearsal-based

Co-ﬁrst authors. Corresponding author.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(a) Influence Acquirement

Positive P Negative P

Bird VS Fish

(c) Influence Utilization

1. Update Model

Update by IF

(b) Influence Fusion

SP Pareto Front

Better Worse

Positive SP Negative SP

IF = 0.02 0.03 0.08 0.11 0.15 0.17 0.21 0.23

Pareto Fusion 2. Selection For Rehearsal

Memory Buffer

with Fixed Size

Figure 1: Training examples have different inﬂuences on Stability and Plasticity. Given an old task with classes cat and dog and a new task with classes Bird and Fish , we compute the inﬂuence on S and P for each example. Then, we fuse the two kinds of inﬂuence towards SP Pareto front. We also show that example inﬂuence can be used to adjust model update and optimize rehearsal selection.

CL framework, which avoids forgetting via retraining a part of old data. First, a pseudo update is held with example-level perturbations. Then, two validation sets sampled from seen data are used to compute the gradients on example perturbations. The gradients are regarded as the example inﬂuence on S and P. As shown in Fig. 1(a), examples can be distinguished by the value of inﬂuence on S and P.

To leverage the two independent kinds of inﬂuence in CL, we need to take full account of the inﬂuence on both S and P. However, the inﬂuence on S and P may interfere with each other, which leads us to make a trade-off. This can be seen as a Dual-Objective Optimization (DOO) problem, which aims to ﬁnd solutions not dominated (no other better solution) by any other one, i.e. Pareto optimal solutions [8]. We say the solutions as the example inﬂuence on SP. Following the gradient-based MGDA algorithm [12], we obtain the fused example inﬂuence on SP by meeting the Karush-Kuhn Tucker (KKT) condition, as illustrated in Figure 1(b).

Finally, we show that the fused inﬂuence can be used to control the update of model and optimize the storage of rehearsal in Figure 1(c). On one hand, the fused inﬂuence can be directly used to control the magnitude of training loss for each example. On the other hand, under a ﬁxed memory budget, the fused inﬂuence can be used to select appropriate examples storing and dropping, which keeps the rehearsal memory always have larger positive inﬂuence on SP.

In summary, our contributions are four-fold: 1) Inspired by the inﬂuence function, we study CL from the perspective of example difference and propose Meta SP to compute the example inﬂuence on S and P. 2) We propose to trade off S and P inﬂuence via solving a DOO problem and fuse them towards SP Pareto optimal. 3) We leverage the fused inﬂuence to control model update and optimize the storage of rehearsal. 4) The veriﬁcation contribution: by considering the example inﬂuence, in our experiments on both taskand class-incremental CL, better S and more stable P can be observed.

2 Related Work

Continual Learning. Due to many researchers efforts, lots of methods for CL have been proposed, which can be classiﬁed into three categories. The regularization-based methods [19, 11, 10] are based on regularizing the parameters corresponding to the old tasks and penalizing the feature drift. The parameter isolation based methods [14, 26] generate task-speciﬁc parameter expansion or subbranch. Rehearsal-based methods [29, 7, 22, 6, 1, 2, 28, 3, 25] tackle the challenge of SP dilemma by retaining a subset of old tasks in a stored memory buffer with bounded resources. Although the existing methods apply themselves to achieve better SP, they fail to explore what contributes to the Stability and Plasticity inside the training data. In this work, we explore the problem in the perspective of example difference, where we argue that each example contributes differently to the SP. We focus our work on the rehearsal-based CL framework in order to omit the divergence between models, while evaluating the old data s inﬂuence simultaneously.

Example Inﬂuence. In recent years, as the impressive Interpretable Machine Learning (IML) [27] develops, people realize the importance of exploring the nature of data-driven machine learning. Examples are different, even they belong to the same distribution. Because of such difference, the example contributes differently to the learning pattern. In other words, the inﬂuence acquired in advance from different training examples can signiﬁcantly improve the CL training. Some studies

propose a similar idea and use the inﬂuences to reweight or dropout the training data [31, 13, 36]. In contrast to complicated model design, a model-agnostic algorithm estimates the training example inﬂuence via computing the derivation from a test loss to a training data weight. One typical example method is the Inﬂuence Function [20], which leverages a pure second-order derivation (Hessian) with the chain rule. In this paper, to avoid the expensive computation of Hessian inverse, we design a meta learning [18] based method, which can be used to control the training.

3 Demystifying Example Inﬂuence on SP

3.1 Preliminary: Rehearsal-based CL

Given T different tasks w.r.t. datasets {D1, , DT }, Continual Learning (CL) seeks to learn them in sequence. For the t-th dataset (task), Dt = {(x(n) t , y(n) t )}Nt n=1 is split into a training set Dtrn t and a test set Dtst t , where Nt is the number of examples. At any time, CL aims at learning a multi-task/multi-class predictor to predict tasks/classes that have been learned (say task-incremental and class-incremental CL). To suppress the catastrophic forgetting, the rehearsal-based CL [30, 22, 32, 6, 16] builds a small size memory buffer Mt sampled from Dtrn t for each task (i.e., |Mt| |Dtrn t |). At training phase, the data in the whole memory M = k<t Mk will be retrained together with the current tasks. Accordingly, a mini-batch training step of task t in rehearsal-based CL is denoted as

min θt ℓ(Bold Bnew, θt), Bold M and Bnew Dtrn t , (1)

where ℓis the empirical loss. θt is the trainable parameters at task t and is updated from scratch.

3.2 Example Inﬂuence on Stability and Plasticity

Deﬁnition 1 (Stability and Plasticity) Suppose the parameter of a model is initialized to θ0. At the training on the t-th task, given test sets of an old task Dtst k (k < t) and the current task Dtst t , the Stability Sk t and Plasticity Pt can be evaluated by:

Sk t = p(Dtst k |θt 1, Dtrn t ) p(Dtst k |θk), Pt = p(Dtst t |θt 1, Dtrn t ) p(Dtst t |θt 1),

where p(D1|θ, D2) represents the performance (accuracy in classiﬁcation) of D1 conditioned to the model θ training on D2. p(D|θ) denotes the performance of D tested on the model θ.

The S of a task is evaluated by the performance difference on the test set after training on any later tasks, which is also known as Forgetting [6]. The P of a task is deﬁned as the ability to integrate new knowledge, which is regarded as the test performance of this task. As many existing CL methods demonstrate, the SP inevitably interferes mutually.

Deﬁnition 2 (Example Inﬂuence on SP) At the training on the t-th task, with a sampled example xtrn Dtrn t , the example inﬂuence from xtrn to Stability Sk t and Plasticity Pt for k < t can be evaluated by the gap from deleting it then retraining the model:

IS(Dtst k , xtrn) = p(Dtst k |θt 1, Dtrn t ) p(Dtst k |θt 1, Dtrn t /xtrn),

IP (Dtst t , xtrn) = p(Dtst t |θt 1, Dtrn t ) p(Dtst t |θt 1, Dtrn t /xtrn),

where Dtrn t /xtrn denotes the dataset Dtrn t without the training example xtrn.

However, deleting every example to compute full inﬂuences is impractical due to the highly computational cost. Instead, the performance change can be indicated by the loss change, which leads to a derivable way to approximate the inﬂuence:

IS(Dtst k , xtrn)

def = ℓ(Dtst k ) ϵ , IP (Dtst t , xtrn)

def = ℓ(Dtst t ) ϵ , (2)

where ϵ is the weight perturbation to the training example and def = means deﬁne. This inﬂuence can be computed by the Inﬂuence Function [20] that will be introduced in the next section.

a b a b a b c a b a b a b a b c

Task t Task t - 1 Task t + 1

Pseudo update Diff IF

Virtual update Fusion IF

Input arg min

𝜽ℓℬ, 𝜽+ 𝑬𝑇𝑳ℬ, 𝜽 𝜽 𝜽𝑬,ℬ

Training Data

𝜶 𝜽ℓℬ, 𝜽 𝐈 𝑇𝑳ℬ, 𝜽 Input

𝜽 Training Data

(a) Influence Acquirement in Mini-Batch

(b) Model update with example influence

New Task Dataset

Select by top

Select by top

Select by top

Better SP influence

Worse SP influence

(c) Rehearsal selection with example influence

Continual Learning

Figure 2: Evaluating and making use of example inﬂuence in mini-batch Continual Learning. (a) At each iteration in CL training, Meta SP updates in pseudo and use two validation sets representing old tasks and new task to obtain the example inﬂuence on S and P. The two kinds of inﬂuence are fused towards a Pareto optimal. (b) The computed inﬂuence can be directly used to update CL model and (c) select examples for rehearsal storing and dropping.

4 Meta Learning on Stability and Plasticity

4.1 Inﬂuence Function for SP

A mini-batch, B, from the training data is sampled, and the normal model update is ˆθ = arg min θ ℓ(B, θ) . (3)

In Inﬂuence Function (IF) [20], a small weight perturbation ϵ is added to the training example xtrn B

ˆθϵ,x = arg min θ ℓ(B, θ) + ϵℓ(xtrn, θ), xtrn B. (4)

We can easily promote this to the mini-batch ˆθE,B = arg min θ ℓ(B, θ) + E L(B, θ), (5)

where L denotes the loss vector for a mini-batch and E R|B| 1 denotes the perturbation on each example in it. It is easy to know that the example inﬂuence I(Dtst, B) is reﬂected in the derivative Eℓ(Dtst, ˆθE,x) E=0. By the chain rule, the example inﬂuence in IF can be computed by

def = Eℓ(Dtst, ˆθE,x) E=0 = θℓ(Dtst, ˆθ)H 1 θ L(B, ˆθ), (6)

where H = 2 θℓ(B, ˆθ) is a Hessian. Unfortunately, the inverse of Hessian requires the complexity O(|B|q2 + q3) and huge storage for neural networks (maybe out-of-memory), which is challenging for efﬁcient training.

In Eq. (6), we have I(Dtst, B) = [I(Dtst, xtrn)|xtrn B] and ﬁnd the loss will get larger if I(Dtst, xtrn) > 0, which means the negative inﬂuence on the test set Dtst. Similarly, I(Dtst, xtrn) < 0 means the positive inﬂuence on the test set Dtst. Fortunately, the second-order derivation in IF is not necessary under the popular meta learning paradigm such as [18], instead we can easily get the derivation like IF through a one-step pseudo update. In the following, we will introduce a simple yet effective meta-based method, named Meta SP, to simulate IF at each step with a two-level optimization to avoid computing Hessian inverse.

4.2 Simulating IF for SP

Based on the meta learning paradigm, we transform the example inﬂuence computation into solving a meta gradient descent problem, named Meta SP. For each training step in a rehearsal-based CL,

Algorithm 1: Computation of Example Inﬂuence (Meta SP) Input: Bold, Bnew, Vold, Vnew ; // Training batches, Validation batches Output: I ; // Pareto example influence on SP

1 ˆθE,B = arg minθ ℓ(Bold Bnew, θ) + E L(Bold Bnew, θ) ; // Pseudo update

2 I(Vold, B) = Eℓ(Vold, ˆθE,B) ; // Gradient from old val loss

3 I(Vnew, B) = Eℓ(Vnew, ˆθE,B); // Gradient from new val loss

4 γ Eq. (11); // Optimal fusion hyper-parameter

5 I = γ I(Vold, B) + (1 γ ) I(Vnew, B); // Influence fusion

we have two mini-batches data Bold and Bnew in respect to old and new tasks. Our goal is to obtain the inﬂuence on S and P from every example in Bold Bnew. Note that both S-aware and P-aware inﬂuence are applied to every example regardless of old or new tasks. That is, the contribution of an example is not deterministic. Data of old tasks may also affect the new task in positive, and vice-versa. In rehearsal-based CL, we turn to computing the derivations Eℓ(Vold, ˆθ)|E=0 for example inﬂuence.

To compute the derivation, as shown in Fig. 2(a), our Meta SP has two key steps:

(1) Pseudo update. This step is to simulate Eq. (5) in IF via a pseudo update

ˆθE,B = arg min θ ℓ(Bold Bnew, θ) + E L(Bold Bnew, θ), (7)

where L denotes the loss vector for a mini-batch combining both old and new tasks.

(2) Compute example inﬂuence. This step computes example inﬂuence on S and P for all training examples as simulating Eq. (6). Based on the pseudo updated model in Eq. (7), we compute Sand P-aware example inﬂuence via two validation sets Vold and Vnew. Noteworthily, because the test set Dtst is unavailable at training phase, we use two dynamic validation sets Vold and Vnew to act as the alternative in the CL training process. One is sampled from the memory buffer (Vold) representing the old tasks, and the other is from the seen training data representing the new task (Vnew). With E initialized to 0, the two kinds of example inﬂuence are computed as

I(Vold, B) = Eℓ(Vold, ˆθE,B), I(Vnew, B) = Eℓ(Vnew, ˆθE,B). (8)

Generally, each elements in two inﬂuence vectors I(Vold, B) and I(Vnew, B) represents the example inﬂuence on S and P. Similar to IF, elements with positive value mean negative inﬂuence while elements with negative value mean positive inﬂuence.

5 Using Inﬂuence for Continual Learning

5.1 Before Using: Inﬂuence for SP Pareto Optimality

As shown in Eq. (8), the example inﬂuence is equal to the derivation from validation loss of old and new tasks to the perturbations E. However, the two kinds of inﬂuence are independent and interfere with each other. That is, using only one of them may fail the other performance. We prefer to ﬁnd a solution that makes a trade-off between the inﬂuence on both S and P. Thus, we integrate the two inﬂuence I(Vold, B) and I(Vnew, B) into a DOO problem with two gradients from different objectives.

n ℓ(Vold, ˆθE,B), ℓ(Vnew, ˆθE,B) o . (9)

The goal of Problem (9) is to obtain a fused way that satisﬁes the SP Pareto optimality.

Deﬁnition 3 (SP Pareto Optimality)

1. (Pareto Dominate) Let Ea, Eb be two solutions for Problem (9), Ea is said to dominate Eb (Ea Eb) if and only if ℓ(V, ˆθEa,B) ℓ(V, ˆθEb,B), V {Vold, Vnew}, and ℓ(V, ˆθEa,B) < ℓ(V, ˆθEb,B), V {Vold, Vnew} . 2. (SP Pareto Optimal) E is called SP Pareto optimal if no other solution can have better values in ℓ(Vold, ˆθE,B) and ℓ(Vnew, ˆθE,B).

Algorithm 2: Using Example Inﬂuence in Rehearsal-based Continual Learning. Input: Initialized θ0, Learning rate α, Training set {Dtrn 1 , , Dtrn T }, Memory M Output: θT ; // Final model

1 for task t = 1 : T do

2 θt= Train New Task(θt 1, Dtrn t , M) (Alg. 3)

3 C1, C2, , C |M|

t K-Means(Dtrn t );

4 Rank Ci with E(I (x)), x Ci;

5 Rank M with E(I (x)), x M;

6 for i = 1 : |M|

7 Pop the bottom of M;

8 Push the top of Ci to M;

Inspired by the Multiple-Gradient Descent Algorithm (MGDA) [12], we transform Problem (9) to a min-norm problem. Speciﬁcally, according to the KKT conditions [15], we have

γ = arg min γ

γ Eℓ(Vold, ˆθE,B) + (1 γ) Eℓ(Vnew, ˆθE,B) 2 2, s.t., 0 γ 1. (10)

Referring to the study from Sener et al. [34], the optimal γ is easily computed as

( Eℓ(Vnew, ˆθE,B) Eℓ(Vold, ˆθE,B)) Eℓ(Vnew, ˆθE,B)

Eℓ(Vnew, ˆθE,B) Eℓ(Vold, ˆθE,B) 2 2 , 0

Thus, the SP Pareto inﬂuence of the training batch can be computed by

I = γ I(Vold, B) + (1 γ ) I(Vnew, B). (12)

This process can be seen in Fig. 2(a). Different from the S-aware and P-aware inﬂuence, the integrated inﬂuence consider the Pareto optimum to both S and P, i.e., reducing the negative inﬂuence on S or P and keeping the positive inﬂuence on both S and P. Then we will introduce how to leverage example inﬂuence in CL training, our algorithm can be seen in Alg. 1.

5.2 Model Update Using Example Inﬂuence

Algorithm 3: Training New Task Input: Initialized θt, Training set Dtrn t , Memory M, Learning rate α Output: Trained θt 1 for i = 1 :ITER_NUM do

2 Bnew Dtrn t ;

3 if t = 1 then

4 θt = θt α θℓ(Bnew, θt);

6 Bold M, Vold M, Vnew Dtrn t ;

7 I METASP(Bold, Bnew, Vold, Vnew);

8 θt = θt α θ(ℓ(Bold Bnew, θt)

9 +( I ) L(Bold Bnew, θt));

With the computed example inﬂuence in each mini-batch, we can easily control the model update of this mini-batch to adjust the training towards an ensemble positive direction. Given parameter θ from the previous iteration the step size α, the model can be updated in traditional SGD as θ = θ α θ (ℓ(B, θ)), where B = Bold Bnew. By regularizing the update with the example inﬂuence I , we have

θ = θ α θ ℓ(B, θ) + ( I ) L(B, θ) . (13) Meta SP offers regularized updates at every step for rehearsal-based CL, which leads the CL training to better SP but with only the complexity of O(|B|q + vq) (v denotes the validation size) compared with that of IF, O(|B|q2 + q3).

We show this application in Fig. 2(b). By updating like the above equation, we can make use of the inﬂuence of each example to a large extent. In this way, some useless examples are restrained and some positive examples are emphasized, which may improve the acquisition of new knowledge and the maintenance of old knowledge simultaneously.

5.3 Rehearsal Selection Using Example Inﬂuence

Rehearsal in ﬁxed budget needs to consider storing and dropping to keep the memory M having the core set of all old tasks. In tradition, storing and dropping are both based on randomly example

selection, which ignores the inﬂuence difference on SP from each example. Given inﬂuence I (x) representing contributions from example x to SP, we further design to use it to improve the rehearsal strategy under ﬁxed memory budget. The above example inﬂuence on S and P is computed in mini-batch level, we can promote it to the whole dataset according to the law of large numbers, and the inﬂuence value for the example x is the value of expectation over batches, i.e., E(I (x)).

The ﬁxed-size memory is divided averagely by the seen task number. After task t ﬁnishes its training, we conduct our inﬂuence-aware rehearsal selection strategy as shown in Fig. 2(c). For storing, we ﬁrst cluster all training data into |M|

t groups using K-means to diversify the store data. Each group is ranked by its SP inﬂuence value, and the most positive inﬂuence on both SP will be selected to store. For dropping, we rank again on the memory buffer via their inﬂuence value, and drop the most negative |M|

t example. In this way, M always stores diverse examples with positive SP inﬂuence.

6 Experiments

6.1 Datasets and implementation details

We use three commonly used benchmarks for evaluation: 1) Split CIFAR-10 [37] consists of 5 tasks, with 2 distinct classes each and 5000 exemplars per class, deriving from the CIFAR-10 dataset; 2) Split CIFAR-100 [37] splits the original CIFAR-100 dataset into 10 disjoint subsets, each of which is considered as a separate task with 10 classes; 3) Split Mini-Imagenet [35] is a subset of 100 classes from Image Net [9], rescaled to 32 32. Each class has 600 samples, randomly subdivided into training (80%) and test sets (20%). Mini-Imagenet dataset is equally divided into 5 disjoint tasks.

We employ Res Net-18 [17] as the backbone which is trained from scratch. We use Stochastic Gradient Descent (SGD) optimizer and set the batch size 32 unchanged in order to guarantee an equal number of updates. Also, the rehearsal batch sampled from memory buffer is set to 32. We construct the SP validation sets in Meta SP by randomly sampling 10% of the seen data and 10% of the memory buffer at each training step. We set other hyper-settings following ER tricks [4], including 50 total epochs and hyper-parameters. All results are averaged over 5 ﬁxed seeds for fairness.

To better evaluate the CL process, we suggest evaluating SP with four metrics as follows. We use the sign function 1( ) to represent if the prediction of model is equal to the ground truth. 1) First Accuracy (A1 = 1

xi Dtst t 1(yi, θt(xi))): For each task, when it is ﬁrst trained done, we evaluate its testing performance immediately, which indicates the Plasticity, i.e., the capability of learning new knowledge. 2) Final Accuracy (A = 1

xi Dtst t 1(yi, θT (xi))): This metric is the ﬁnal performance for each task, which indicates Stability, i.e., the capability of suppressing catastrophic forgetting. 3) Mean Average Accuracy (Am = 1

xi Dtst k 1(yi, θt(xi)) ): This metric computes along CL process, indicating the SP performance after each task trained done. 4) Backward Transfer (BWT = 1 T 1 PT 1 t=1 P

(x,y) Dtst t (1(y, θT (x)) 1(y, θt(x))) = T T 1(A A1)): This metric is the performance drop from ﬁrst to ﬁnal accuracy of each task.

6.2 Main Comparison Results

We compare our method against 8 rehearsal-based methods (including GDUMB [28], GEM [22], AGEM [6], HAL [5], GSS [2], MIR [1], GMED [33] and ER [7]). What s more, we also provide a lower bound that train new data directly without any forgetting avoidance strategy (Fine-tune) and an upper bound that is given by all task data through joint training (Joint).

In Table 1, we show the quantitative results of all compared methods and the proposed Meta SP in class-incremental and task-incremental settings. First of all, by controlling the training according to the inﬂuence on SP, the proposed Meta SP outperforms other methods on all metrics. With the memory buffer size growth, all the rehearsal-based CL get better performance, while the advantages of Meta SP are more obvious. In terms of the First Accuracy A1, indicating the ability to learn new tasks, our method outperforms most of the other methods with a little numerical advantage. In terms of the Final Accuracy A , which is used to measure the forgetting, we have an obvious improvement of an average of 3.17 for class-incremental setting and averagely 1.77 for task-incremental setting w.r.t. the second best result. This shows Meta SP can signiﬁcantly keep stable learning of the new task while suppressing the catastrophic forgetting. It is because although the new tasks may have

Table 1: Comparisons on three datasets, averaged across 5 runs (See std. in the Appendix). Red and blue values mean the best in our methods and the compared methods. indicates that our method is signiﬁcantly better than the compared method (paired t-tests at 95% signiﬁcance level).

CIFAR10 (Class increment) CIFAR10 (Task increment) buffer size 300 buffer size 500 buffer size 300 buffer size 500 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 19.66 Joint 91.79 Finetune 65.27 Joint 98.16 GDUMB [28] 36.92 44.27 73.22 78.06 GEM [22] 93.90 37.51 55.43 70.48 92.76 36.95 57.36 69.76 96.62 89.34 92.49 9.09 96.73 90.42 92.93 7.88 AGEM [6] 96.57 20.02 45.57 95.68 96.56 20.01 46.52 95.69 96.78 85.52 90.16 14.07 96.71 86.45 90.90 12.83 HAL [5] 91.30 24.45 46.34 83.56 91.96 27.94 49.05 80.01 91.41 79.90 83.78 14.39 92.03 81.84 84.19 12.73 MIR [1] 96.70 38.53 56.96 72.72 96.65 42.65 59.99 67.50 96.76 88.50 90.87 10.33 96.73 90.63 91.99 7.62 GSS [2] 96.53 35.89 54.33 75.80 96.55 41.96 58.16 68.24 96.56 88.05 90.60 10.63 96.57 90.38 92.19 7.73 GMED [33] 96.65 38.12 58.92 73.16 96.65 43.68 62.56 66.21 96.73 88.91 91.20 9.76 96.72 89.72 92.10 8.75 ER [7] 96.73 34.19 53.72 78.18 96.74 40.45 57.69 70.36 96.93 88.97 91.12 9.95 96.79 90.60 92.28 7.74 Ours 96.87 42.42 63.52 68.05 96.82 49.16 67.88 59.57 97.10 89.40 92.54 9.62 97.31 90.91 93.38 7.99 Ours+Reh Sel 96.85 43.76 63.69 -66.36 96.81 50.10 68.28 -58.38 97.11 89.91 92.66 -8.99 97.30 91.41 93.28 -7.36

CIFAR100 (Class increment) CIFAR100 (Task increment) buffer size 500 buffer size 1000 buffer size 500 buffer size 1000 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 9.14 Joint 71.25 Finetune 33.85 Joint 91.63 GDUMB [28] 11.11 15.75 36.40 43.25 GEM [22] 85.28 15.91 29.38 77.07 84.28 22.79 34.09 68.32 85.53 68.68 68.49 18.72 85.24 73.71 72.59 12.81 AGEM [6] 85.97 9.31 24.60 85.18 85.66 9.27 24.67 84.88 85.97 55.28 58.23 34.10 85.66 55.95 59.96 33.01 HAL [5] 67.33 8.20 22.72 65.70 68.06 10.59 24.74 63.86 67.64 44.98 50.79 25.17 68.62 50.07 54.01 20.61 MIR [1] 87.38 13.49 28.88 82.09 87.39 17.56 32.48 77.59 87.42 66.18 67.43 23.60 87.50 71.20 71.42 18.10 GSS [2] 86.03 14.01 28.00 80.02 86.31 17.87 31.82 76.04 86.10 66.80 66.55 21.44 86.44 71.98 71.00 16.06 GMED [33] 87.18 14.56 33.41 80.68 87.29 18.67 38.69 76.23 87.30 68.82 72.66 20.53 87.49 73.91 76.36 15.10 ER [7] 87.23 13.75 28.88 81.64 87.33 17.56 32.45 77.52 87.29 66.82 67.56 22.73 87.40 71.74 71.60 17.40 Ours 88.13 18.96 38.62 76.85 87.58 24.78 45.20 69.76 88.94 70.03 74.07 21.01 88.94 75.32 78.09 15.14 Ours+Reh Sel 87.81 19.28 39.23 -76.13 87.55 25.72 45.48 -68.69 88.58 70.81 74.24 -19.74 89.03 76.14 78.27 -14.32

Mini-Imagenet (Class increment) Mini-Imagenet (Task increment) buffer size 500 buffer size 1000 buffer size 500 buffer size 1000 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 11.12 Joint 44.39 Finetune 23.46 Joint 62.30 GDUMB [28] 6.22 7.15 16.37 17.69 AGEM [6] 50.06 10.69 22.29 49.22 50.03 10.69 22.28 49.16 50.06 18.34 28.05 39.65 50.03 18.78 28.12 39.05 MIR [1] 51.44 11.07 23.65 50.46 51.25 11.32 24.09 49.92 51.47 29.10 35.20 27.95 51.31 31.39 37.24 24.89 GSS [2] 51.63 11.09 23.62 50.66 51.35 11.42 24.05 49.91 51.64 28.67 35.22 28.71 51.40 31.75 37.23 24.56 GMED [33] 51.21 11.03 24.47 50.23 50.87 11.73 25.50 48.93 51.29 30.47 37.64 26.02 51.00 32.85 39.66 22.69 ER [7] 51.68 11.00 23.71 50.84 51.41 11.35 24.08 50.08 51.70 28.97 35.30 28.40 51.55 31.59 37.36 24.95 Ours 51.76 12.48 26.50 49.10 50.91 14.43 28.47 45.59 52.44 32.59 39.38 24.82 52.27 36.25 41.59 20.02 Ours+Reh Sel 51.81 12.74 26.43 -48.84 50.96 14.54 28.44 -45.52 51.73 34.36 40.48 -21.70 51.47 37.20 42.19 -17.83

Positive IF on S

Negative IF on S

Positive IF on P

Negative IF on P

Positive IF on SP

Negative IF on SP

(a) Total 500 memory data (b) Total 10000 new data per task

Influence statistics Influence proportion

Figure 3: Top: Statistics of examples with positive and negative inﬂuence on S, P, and SP. Bottom: We divide all example inﬂuences equally into 5 groups, and count the number in each range.

larger gradient to dominant the update for all rehearsal-based CL, our method improves the example with positive effective and restrain the negative-impact example. In terms of the Mean Average Accuracy Am, which evaluates the SP throughout the whole CL process, our method shows its signiﬁcant superiority with an average improvement of over 4.44 and 1.24 w.r.t the second best results in class-incremental and task-incremental settings. The complete results with std. can be viewed in the Appendix. Moreover, with the proposed rehearsal selection strategy (Ours+Reh Sel), we have our A improved, which means the selected example according to their inﬂuence has a clear ability for reducing catastrophic forgetting. With our Rehearsal Selection (Reh Sel) strategy, we have an improvement of 0.77 on A , but A1 and Am have uncertain performance. This means better memory may bring in worse task conﬂict.

6.3 Analysis of Dataset Inﬂuence on SP

In Fig. 3, we count the example with positive/negative inﬂuences on old task (S), new task (P), and total SP in Split-CIFAR-10. At each task after task 2, we have 500 ﬁxed-size memory and 10,000 new task data. We ﬁrst ﬁnd that most data of old tasks has a positive inﬂuence on S and a negative inﬂuence on P, while most data of new tasks has a positive inﬂuence on P and a negative inﬂuence on S. Even so, some data in both new and old tasks has the opposite inﬂuence. Then, for the total SP inﬂuence, most of memory data has positive inﬂuence. In contrast, examples of new tasks have near equal number of positive and negative SP inﬂuence. Thus, by clustering and storing examples via higher inﬂuence to rehearsal memory buffer, the old knowledge can be kept. By dividing all example inﬂuences equally into 5 groups from the minimum to the maximum, we ﬁnd that most examples have mid-level inﬂuence, and server as the main part of the dataset. Also, the numbers of examples with large positive and negative inﬂuence are small, which means unique examples are few in the dataset. The observations suggest the example difference should be used to improve model training.

6.4 Analysis on SP Pareto Optimum

SP Pareto Front

Ours only P

Ours only S Ours

GSS MIR GMED

A1 Figure 4: SP Pareto front.

In this paper, we propose to convert the S-aware and P-aware inﬂuence fusion into a DOO problem and use the MGDA to guarantee the fused solution is an SP Pareto optimum. As shown in Fig. 4, we show the comparison of the First Accuracy and Final Accuracy coordinate visualization for all compared methods. We also evaluate with only stability-aware (Ours only S) and with only Plasticity-aware (Ours only P) inﬂuence. Obviously, with only one kind of inﬂuence, our method can already get better SP than other methods. The integration of two kinds of inﬂuence yield an even more balanced SP. On the other hand, the existing methods cannot approach the SP Pareto front well.

6.5 Training Time

Table 2: Comparison of training time [s] on CIFAR-10.

Method ER GSS AGEM HAL MIR GMED GEM Meta SP One-Step 0.013 0.015 0.029 0.043 0.077 0.093 0.290 0.250 Total 2685 2672 3812 5029 7223 8565 24768 5898

We list the training time of one-step update and total update overhead for all compared methods for Split CIFAR-10 dataset. In one-step update, we evaluate all methods with a batch on one update. Our method takes more time than other methods except for GEM, because of the pseudo update, backward on perturbation and inﬂuence fusion. To guarantee the efﬁciency, we utilize our proposed method only in the last 5 epochs among the total, and the rest are naive ﬁne-tuning (See details in the Appendix). The results show the strategy is as fast as other light-weight methods but achieve huge improvement on SP. We also use this setting for the comparison in Table 1.

7 Conclusion

In this paper, we proposed to explore the example inﬂuence on Stability-Plasticity (SP) dilemma in rehearsal-based continual learning. To achieve that, we evaluated the example inﬂuence via small perturbation instead of the computationally expensive Hessian-like inﬂuence function and proposed a simple yet effective Meta SP algorithm. At each iteration in CL training, Meta SP builds a pseudo update and obtains the Sand P-aware example inﬂuence in batch level. Then, the two kinds of inﬂuence are combined via an SP Pareto optimal factor and can support the regular model update. Moreover, the example inﬂuence can be used to optimize rehearsal selection. The experimental results on three popular CL datasets veriﬁed the effectiveness of the proposed method. We list the limitation of the proposed method. (1) The proposed method relies on rehearsal selection, which may affect privacy and extra storage is needed. (2) The proposed method is not fast enough for online continual learning. In most situations, however, we can leverage our training tricks to reduce the time. (3) Our method is limited in the extremely small memory size. Large memory size means better remembering and an accurate validation set. The proposed method does not perform well when the memory size is extremely small.

Acknowledgement

This work is ﬁnancially supported in part by the National Key Research and Development Program of China under Grant (No. 2019YFC1520904) and the Natural Science Foundation of China (Nos. 62072334, 62276182, 61876220). The authors would like to thank constructive and valuable suggestions for this paper from the experienced reviewers and AE.

[1] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Neur IPS, 2019.

[2] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. Neur IPS, 2019.

[3] Benedikt Bagus and Alexander Gepperth. An investigation of replay-based approaches for continual learning. In IJCNN, 2021.

[4] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Simone Calderara. Rethinking experience replay: a bag of tricks for continual learning. In ICPR, 2021.

[5] Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. In AAAI, 2021.

[6] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efﬁcient lifelong learning with a-gem. In ICLR, 2018.

[7] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc Aurelio Ranzato. On tiny episodic memories in continual learning. ar Xiv preprint ar Xiv:1902.10486, 2019.

[8] Kalyanmoy Deb and Himanshu Gupta. Searching for robust pareto-optimal solutions in multi-objective optimization. In ICEMO, 2005.

[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[10] Kaile Du, Linyan Li, Fan Lyu, Fuyuan Hu, Zhenping Xia, and Fenglei Xu. Class-incremental lifelong learning in multi-label classiﬁcation. ar Xiv preprint ar Xiv:2207.07840, 2022.

[11] Kaile Du, Fan Lyu, Fuyuan Hu, Linyan Li, Wei Feng, Fenglei Xu, and Qiming Fu. Agcn: Augmented graph convolutional network for lifelong multi-label image recognition. In ICME, 2022.

[12] Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 2012.

[13] Yang Fan, Yingce Xia, Lijun Wu, Shufang Xie, Weiqing Liu, Jiang Bian, Tao Qin, and Xiang-Yang Li. Learning to reweight with deep interactions. ar Xiv preprint ar Xiv:2007.04649, 2020.

[14] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734, 2017.

[15] Jörg Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 2000.

[16] Yunhui Guo, Mingrui Liu, Tianbao Yang, and Tajana Rosing. Learning with long-term remembering: Following the lead of mixed stochastic gradient. ar Xiv preprint ar Xiv:1909.11763, 2019.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[18] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. ar Xiv preprint ar Xiv:2004.05439, 2020.

[19] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.

[20] Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. In ICML, 2017.

[21] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop. IEEE, 1996.

[22] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Neur IPS, 2017.

[23] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, 2020.

[24] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, 2016.

[25] Fan Lyu, Shuai Wang, Wei Feng, Zihan Ye, Fuyuan Hu, and Song Wang. Multi-domain multi-task rehearsal for lifelong learning. In AAAI, 2021.

[26] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, 2018.

[27] Christoph Molnar. Interpretable Machine Learning. Lulu. com, 2020.

[28] Ameya Prabhu, Philip Torr, and Puneet Dokania. Gdumb: A simple approach that questions our progress in continual learning. In ECCV, 2020.

[29] Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological review, 1990.

[30] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classiﬁer and representation learning. In CVPR, 2017.

[31] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.

[32] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. ar Xiv preprint ar Xiv:1810.11910, 2018.

[33] Liu Risheng, Liu Yaohua, Zeng Shangzhi, and Zhang Jin. Gradient-based editing of memory examples for online task-free continual learning. In Neur IPS, 2021.

[34] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Neur IPS, 2018.

[35] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Neur IPS, 2016.

[36] Tianyang Wang, Jun Huan, and Bo Li. Data dropout: Optimizing training data for convolutional neural networks. In ICTAI, 2018.

[37] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section ??. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] We use public dataset and opensourced code. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]