# exploring_example_influence_in_continual_learning__f034c097.pdf Exploring Example Influence in Continual Learning Qing Sun , Fan Lyu , Fanhua Shang, Wei Feng, Liang Wan College of Intelligence and Computing, Tianjin University {sssunqing, fanlyu, fhshang, wfeng, lwan}@tju.edu.cn https://github.com/SSSun Qing/Example_Influence_CL Continual Learning (CL) sequentially learns new tasks like human beings, with the goal to achieve better Stability (S, remembering past tasks) and Plasticity (P, adapting to new tasks). Due to the fact that past training data is not available, it is valuable to explore the influence difference on S and P among training examples, which may improve the learning pattern towards better SP. Inspired by Influence Function (IF), we first study example influence via adding perturbation to example weight and computing the influence derivation. To avoid the storage and calculation burden of Hessian inverse in neural networks, we propose a simple yet effective Meta SP algorithm to simulate the two key steps in the computation of IF and obtain the Sand P-aware example influence. Moreover, we propose to fuse two kinds of example influence by solving a dual-objective optimization problem, and obtain a fused influence towards SP Pareto optimality. The fused influence can be used to control the update of model and optimize the storage of rehearsal. Empirical results show that our algorithm significantly outperforms state-of-the-art methods on both taskand class-incremental benchmark CL datasets. 1 Introduction By mimicking human-like learning, Continual Learning (CL) aims to enable a model to continuously learn from novel knowledge (new tasks, new classes, etc.) in a sequential order. The major challenge in CL is to harness catastrophic forgetting and knowledge transition, namely the Stability-Plasticity dilemma, with Stability (S) showing the ability to prevent performance drops for old tasks and Plasticity (P) referring if the new task can be learned rapidly and unimpededly. Intuitively speaking, a robust CL system should achieve outstanding S and P through sequential learning. The sequential paradigm means CL does not access past training data. Comparing to traditional machine learning, the training data in CL is thus more precious. It is valuable to explore the influence difference on S and P among training examples. Following the accredited influence chain Data Model-Performance , exploring this difference is equivalent to tracing from performance back to example difference. With appropriate control, this may improve the learning pattern towards better SP. On top of this, the goal of this paper is to explore the reasonable influence from each training example to SP, and apply the example influence to CL training. To understand example influence, one classic successful technique is the Influence Function (IF) [20], which leverages the derivation chain rule from a test objective to training examples. However, directly applying the chain rule leads to computing the inverse of Hessian with the complexity of O(nq2 + q3) (n is the number of examples and q is parameter size), which is computationally intensive and may run out-of-memory in neural networks. In this paper, we propose a novel meta-learning algorithm, called Meta SP, to compute example influence via simulating IF. We design based on the rehearsal-based Co-first authors. Corresponding author. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (a) Influence Acquirement Positive P Negative P Bird VS Fish (c) Influence Utilization 1. Update Model Update by IF (b) Influence Fusion SP Pareto Front Better Worse Positive SP Negative SP IF = 0.02 0.03 0.08 0.11 0.15 0.17 0.21 0.23 Pareto Fusion 2. Selection For Rehearsal Memory Buffer with Fixed Size Figure 1: Training examples have different influences on Stability and Plasticity. Given an old task with classes cat and dog and a new task with classes Bird and Fish , we compute the influence on S and P for each example. Then, we fuse the two kinds of influence towards SP Pareto front. We also show that example influence can be used to adjust model update and optimize rehearsal selection. CL framework, which avoids forgetting via retraining a part of old data. First, a pseudo update is held with example-level perturbations. Then, two validation sets sampled from seen data are used to compute the gradients on example perturbations. The gradients are regarded as the example influence on S and P. As shown in Fig. 1(a), examples can be distinguished by the value of influence on S and P. To leverage the two independent kinds of influence in CL, we need to take full account of the influence on both S and P. However, the influence on S and P may interfere with each other, which leads us to make a trade-off. This can be seen as a Dual-Objective Optimization (DOO) problem, which aims to find solutions not dominated (no other better solution) by any other one, i.e. Pareto optimal solutions [8]. We say the solutions as the example influence on SP. Following the gradient-based MGDA algorithm [12], we obtain the fused example influence on SP by meeting the Karush-Kuhn Tucker (KKT) condition, as illustrated in Figure 1(b). Finally, we show that the fused influence can be used to control the update of model and optimize the storage of rehearsal in Figure 1(c). On one hand, the fused influence can be directly used to control the magnitude of training loss for each example. On the other hand, under a fixed memory budget, the fused influence can be used to select appropriate examples storing and dropping, which keeps the rehearsal memory always have larger positive influence on SP. In summary, our contributions are four-fold: 1) Inspired by the influence function, we study CL from the perspective of example difference and propose Meta SP to compute the example influence on S and P. 2) We propose to trade off S and P influence via solving a DOO problem and fuse them towards SP Pareto optimal. 3) We leverage the fused influence to control model update and optimize the storage of rehearsal. 4) The verification contribution: by considering the example influence, in our experiments on both taskand class-incremental CL, better S and more stable P can be observed. 2 Related Work Continual Learning. Due to many researchers efforts, lots of methods for CL have been proposed, which can be classified into three categories. The regularization-based methods [19, 11, 10] are based on regularizing the parameters corresponding to the old tasks and penalizing the feature drift. The parameter isolation based methods [14, 26] generate task-specific parameter expansion or subbranch. Rehearsal-based methods [29, 7, 22, 6, 1, 2, 28, 3, 25] tackle the challenge of SP dilemma by retaining a subset of old tasks in a stored memory buffer with bounded resources. Although the existing methods apply themselves to achieve better SP, they fail to explore what contributes to the Stability and Plasticity inside the training data. In this work, we explore the problem in the perspective of example difference, where we argue that each example contributes differently to the SP. We focus our work on the rehearsal-based CL framework in order to omit the divergence between models, while evaluating the old data s influence simultaneously. Example Influence. In recent years, as the impressive Interpretable Machine Learning (IML) [27] develops, people realize the importance of exploring the nature of data-driven machine learning. Examples are different, even they belong to the same distribution. Because of such difference, the example contributes differently to the learning pattern. In other words, the influence acquired in advance from different training examples can significantly improve the CL training. Some studies propose a similar idea and use the influences to reweight or dropout the training data [31, 13, 36]. In contrast to complicated model design, a model-agnostic algorithm estimates the training example influence via computing the derivation from a test loss to a training data weight. One typical example method is the Influence Function [20], which leverages a pure second-order derivation (Hessian) with the chain rule. In this paper, to avoid the expensive computation of Hessian inverse, we design a meta learning [18] based method, which can be used to control the training. 3 Demystifying Example Influence on SP 3.1 Preliminary: Rehearsal-based CL Given T different tasks w.r.t. datasets {D1, , DT }, Continual Learning (CL) seeks to learn them in sequence. For the t-th dataset (task), Dt = {(x(n) t , y(n) t )}Nt n=1 is split into a training set Dtrn t and a test set Dtst t , where Nt is the number of examples. At any time, CL aims at learning a multi-task/multi-class predictor to predict tasks/classes that have been learned (say task-incremental and class-incremental CL). To suppress the catastrophic forgetting, the rehearsal-based CL [30, 22, 32, 6, 16] builds a small size memory buffer Mt sampled from Dtrn t for each task (i.e., |Mt| |Dtrn t |). At training phase, the data in the whole memory M = k 0, which means the negative influence on the test set Dtst. Similarly, I(Dtst, xtrn) < 0 means the positive influence on the test set Dtst. Fortunately, the second-order derivation in IF is not necessary under the popular meta learning paradigm such as [18], instead we can easily get the derivation like IF through a one-step pseudo update. In the following, we will introduce a simple yet effective meta-based method, named Meta SP, to simulate IF at each step with a two-level optimization to avoid computing Hessian inverse. 4.2 Simulating IF for SP Based on the meta learning paradigm, we transform the example influence computation into solving a meta gradient descent problem, named Meta SP. For each training step in a rehearsal-based CL, Algorithm 1: Computation of Example Influence (Meta SP) Input: Bold, Bnew, Vold, Vnew ; // Training batches, Validation batches Output: I ; // Pareto example influence on SP 1 ˆθE,B = arg minθ ℓ(Bold Bnew, θ) + E L(Bold Bnew, θ) ; // Pseudo update 2 I(Vold, B) = Eℓ(Vold, ˆθE,B) ; // Gradient from old val loss 3 I(Vnew, B) = Eℓ(Vnew, ˆθE,B); // Gradient from new val loss 4 γ Eq. (11); // Optimal fusion hyper-parameter 5 I = γ I(Vold, B) + (1 γ ) I(Vnew, B); // Influence fusion we have two mini-batches data Bold and Bnew in respect to old and new tasks. Our goal is to obtain the influence on S and P from every example in Bold Bnew. Note that both S-aware and P-aware influence are applied to every example regardless of old or new tasks. That is, the contribution of an example is not deterministic. Data of old tasks may also affect the new task in positive, and vice-versa. In rehearsal-based CL, we turn to computing the derivations Eℓ(Vold, ˆθ)|E=0 for example influence. To compute the derivation, as shown in Fig. 2(a), our Meta SP has two key steps: (1) Pseudo update. This step is to simulate Eq. (5) in IF via a pseudo update ˆθE,B = arg min θ ℓ(Bold Bnew, θ) + E L(Bold Bnew, θ), (7) where L denotes the loss vector for a mini-batch combining both old and new tasks. (2) Compute example influence. This step computes example influence on S and P for all training examples as simulating Eq. (6). Based on the pseudo updated model in Eq. (7), we compute Sand P-aware example influence via two validation sets Vold and Vnew. Noteworthily, because the test set Dtst is unavailable at training phase, we use two dynamic validation sets Vold and Vnew to act as the alternative in the CL training process. One is sampled from the memory buffer (Vold) representing the old tasks, and the other is from the seen training data representing the new task (Vnew). With E initialized to 0, the two kinds of example influence are computed as I(Vold, B) = Eℓ(Vold, ˆθE,B), I(Vnew, B) = Eℓ(Vnew, ˆθE,B). (8) Generally, each elements in two influence vectors I(Vold, B) and I(Vnew, B) represents the example influence on S and P. Similar to IF, elements with positive value mean negative influence while elements with negative value mean positive influence. 5 Using Influence for Continual Learning 5.1 Before Using: Influence for SP Pareto Optimality As shown in Eq. (8), the example influence is equal to the derivation from validation loss of old and new tasks to the perturbations E. However, the two kinds of influence are independent and interfere with each other. That is, using only one of them may fail the other performance. We prefer to find a solution that makes a trade-off between the influence on both S and P. Thus, we integrate the two influence I(Vold, B) and I(Vnew, B) into a DOO problem with two gradients from different objectives. n ℓ(Vold, ˆθE,B), ℓ(Vnew, ˆθE,B) o . (9) The goal of Problem (9) is to obtain a fused way that satisfies the SP Pareto optimality. Definition 3 (SP Pareto Optimality) 1. (Pareto Dominate) Let Ea, Eb be two solutions for Problem (9), Ea is said to dominate Eb (Ea Eb) if and only if ℓ(V, ˆθEa,B) ℓ(V, ˆθEb,B), V {Vold, Vnew}, and ℓ(V, ˆθEa,B) < ℓ(V, ˆθEb,B), V {Vold, Vnew} . 2. (SP Pareto Optimal) E is called SP Pareto optimal if no other solution can have better values in ℓ(Vold, ˆθE,B) and ℓ(Vnew, ˆθE,B). Algorithm 2: Using Example Influence in Rehearsal-based Continual Learning. Input: Initialized θ0, Learning rate α, Training set {Dtrn 1 , , Dtrn T }, Memory M Output: θT ; // Final model 1 for task t = 1 : T do 2 θt= Train New Task(θt 1, Dtrn t , M) (Alg. 3) 3 C1, C2, , C |M| t K-Means(Dtrn t ); 4 Rank Ci with E(I (x)), x Ci; 5 Rank M with E(I (x)), x M; 6 for i = 1 : |M| 7 Pop the bottom of M; 8 Push the top of Ci to M; Inspired by the Multiple-Gradient Descent Algorithm (MGDA) [12], we transform Problem (9) to a min-norm problem. Specifically, according to the KKT conditions [15], we have γ = arg min γ γ Eℓ(Vold, ˆθE,B) + (1 γ) Eℓ(Vnew, ˆθE,B) 2 2, s.t., 0 γ 1. (10) Referring to the study from Sener et al. [34], the optimal γ is easily computed as ( Eℓ(Vnew, ˆθE,B) Eℓ(Vold, ˆθE,B)) Eℓ(Vnew, ˆθE,B) Eℓ(Vnew, ˆθE,B) Eℓ(Vold, ˆθE,B) 2 2 , 0 Thus, the SP Pareto influence of the training batch can be computed by I = γ I(Vold, B) + (1 γ ) I(Vnew, B). (12) This process can be seen in Fig. 2(a). Different from the S-aware and P-aware influence, the integrated influence consider the Pareto optimum to both S and P, i.e., reducing the negative influence on S or P and keeping the positive influence on both S and P. Then we will introduce how to leverage example influence in CL training, our algorithm can be seen in Alg. 1. 5.2 Model Update Using Example Influence Algorithm 3: Training New Task Input: Initialized θt, Training set Dtrn t , Memory M, Learning rate α Output: Trained θt 1 for i = 1 :ITER_NUM do 2 Bnew Dtrn t ; 3 if t = 1 then 4 θt = θt α θℓ(Bnew, θt); 6 Bold M, Vold M, Vnew Dtrn t ; 7 I METASP(Bold, Bnew, Vold, Vnew); 8 θt = θt α θ(ℓ(Bold Bnew, θt) 9 +( I ) L(Bold Bnew, θt)); With the computed example influence in each mini-batch, we can easily control the model update of this mini-batch to adjust the training towards an ensemble positive direction. Given parameter θ from the previous iteration the step size α, the model can be updated in traditional SGD as θ = θ α θ (ℓ(B, θ)), where B = Bold Bnew. By regularizing the update with the example influence I , we have θ = θ α θ ℓ(B, θ) + ( I ) L(B, θ) . (13) Meta SP offers regularized updates at every step for rehearsal-based CL, which leads the CL training to better SP but with only the complexity of O(|B|q + vq) (v denotes the validation size) compared with that of IF, O(|B|q2 + q3). We show this application in Fig. 2(b). By updating like the above equation, we can make use of the influence of each example to a large extent. In this way, some useless examples are restrained and some positive examples are emphasized, which may improve the acquisition of new knowledge and the maintenance of old knowledge simultaneously. 5.3 Rehearsal Selection Using Example Influence Rehearsal in fixed budget needs to consider storing and dropping to keep the memory M having the core set of all old tasks. In tradition, storing and dropping are both based on randomly example selection, which ignores the influence difference on SP from each example. Given influence I (x) representing contributions from example x to SP, we further design to use it to improve the rehearsal strategy under fixed memory budget. The above example influence on S and P is computed in mini-batch level, we can promote it to the whole dataset according to the law of large numbers, and the influence value for the example x is the value of expectation over batches, i.e., E(I (x)). The fixed-size memory is divided averagely by the seen task number. After task t finishes its training, we conduct our influence-aware rehearsal selection strategy as shown in Fig. 2(c). For storing, we first cluster all training data into |M| t groups using K-means to diversify the store data. Each group is ranked by its SP influence value, and the most positive influence on both SP will be selected to store. For dropping, we rank again on the memory buffer via their influence value, and drop the most negative |M| t example. In this way, M always stores diverse examples with positive SP influence. 6 Experiments 6.1 Datasets and implementation details We use three commonly used benchmarks for evaluation: 1) Split CIFAR-10 [37] consists of 5 tasks, with 2 distinct classes each and 5000 exemplars per class, deriving from the CIFAR-10 dataset; 2) Split CIFAR-100 [37] splits the original CIFAR-100 dataset into 10 disjoint subsets, each of which is considered as a separate task with 10 classes; 3) Split Mini-Imagenet [35] is a subset of 100 classes from Image Net [9], rescaled to 32 32. Each class has 600 samples, randomly subdivided into training (80%) and test sets (20%). Mini-Imagenet dataset is equally divided into 5 disjoint tasks. We employ Res Net-18 [17] as the backbone which is trained from scratch. We use Stochastic Gradient Descent (SGD) optimizer and set the batch size 32 unchanged in order to guarantee an equal number of updates. Also, the rehearsal batch sampled from memory buffer is set to 32. We construct the SP validation sets in Meta SP by randomly sampling 10% of the seen data and 10% of the memory buffer at each training step. We set other hyper-settings following ER tricks [4], including 50 total epochs and hyper-parameters. All results are averaged over 5 fixed seeds for fairness. To better evaluate the CL process, we suggest evaluating SP with four metrics as follows. We use the sign function 1( ) to represent if the prediction of model is equal to the ground truth. 1) First Accuracy (A1 = 1 xi Dtst t 1(yi, θt(xi))): For each task, when it is first trained done, we evaluate its testing performance immediately, which indicates the Plasticity, i.e., the capability of learning new knowledge. 2) Final Accuracy (A = 1 xi Dtst t 1(yi, θT (xi))): This metric is the final performance for each task, which indicates Stability, i.e., the capability of suppressing catastrophic forgetting. 3) Mean Average Accuracy (Am = 1 xi Dtst k 1(yi, θt(xi)) ): This metric computes along CL process, indicating the SP performance after each task trained done. 4) Backward Transfer (BWT = 1 T 1 PT 1 t=1 P (x,y) Dtst t (1(y, θT (x)) 1(y, θt(x))) = T T 1(A A1)): This metric is the performance drop from first to final accuracy of each task. 6.2 Main Comparison Results We compare our method against 8 rehearsal-based methods (including GDUMB [28], GEM [22], AGEM [6], HAL [5], GSS [2], MIR [1], GMED [33] and ER [7]). What s more, we also provide a lower bound that train new data directly without any forgetting avoidance strategy (Fine-tune) and an upper bound that is given by all task data through joint training (Joint). In Table 1, we show the quantitative results of all compared methods and the proposed Meta SP in class-incremental and task-incremental settings. First of all, by controlling the training according to the influence on SP, the proposed Meta SP outperforms other methods on all metrics. With the memory buffer size growth, all the rehearsal-based CL get better performance, while the advantages of Meta SP are more obvious. In terms of the First Accuracy A1, indicating the ability to learn new tasks, our method outperforms most of the other methods with a little numerical advantage. In terms of the Final Accuracy A , which is used to measure the forgetting, we have an obvious improvement of an average of 3.17 for class-incremental setting and averagely 1.77 for task-incremental setting w.r.t. the second best result. This shows Meta SP can significantly keep stable learning of the new task while suppressing the catastrophic forgetting. It is because although the new tasks may have Table 1: Comparisons on three datasets, averaged across 5 runs (See std. in the Appendix). Red and blue values mean the best in our methods and the compared methods. indicates that our method is significantly better than the compared method (paired t-tests at 95% significance level). CIFAR10 (Class increment) CIFAR10 (Task increment) buffer size 300 buffer size 500 buffer size 300 buffer size 500 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 19.66 Joint 91.79 Finetune 65.27 Joint 98.16 GDUMB [28] 36.92 44.27 73.22 78.06 GEM [22] 93.90 37.51 55.43 70.48 92.76 36.95 57.36 69.76 96.62 89.34 92.49 9.09 96.73 90.42 92.93 7.88 AGEM [6] 96.57 20.02 45.57 95.68 96.56 20.01 46.52 95.69 96.78 85.52 90.16 14.07 96.71 86.45 90.90 12.83 HAL [5] 91.30 24.45 46.34 83.56 91.96 27.94 49.05 80.01 91.41 79.90 83.78 14.39 92.03 81.84 84.19 12.73 MIR [1] 96.70 38.53 56.96 72.72 96.65 42.65 59.99 67.50 96.76 88.50 90.87 10.33 96.73 90.63 91.99 7.62 GSS [2] 96.53 35.89 54.33 75.80 96.55 41.96 58.16 68.24 96.56 88.05 90.60 10.63 96.57 90.38 92.19 7.73 GMED [33] 96.65 38.12 58.92 73.16 96.65 43.68 62.56 66.21 96.73 88.91 91.20 9.76 96.72 89.72 92.10 8.75 ER [7] 96.73 34.19 53.72 78.18 96.74 40.45 57.69 70.36 96.93 88.97 91.12 9.95 96.79 90.60 92.28 7.74 Ours 96.87 42.42 63.52 68.05 96.82 49.16 67.88 59.57 97.10 89.40 92.54 9.62 97.31 90.91 93.38 7.99 Ours+Reh Sel 96.85 43.76 63.69 -66.36 96.81 50.10 68.28 -58.38 97.11 89.91 92.66 -8.99 97.30 91.41 93.28 -7.36 CIFAR100 (Class increment) CIFAR100 (Task increment) buffer size 500 buffer size 1000 buffer size 500 buffer size 1000 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 9.14 Joint 71.25 Finetune 33.85 Joint 91.63 GDUMB [28] 11.11 15.75 36.40 43.25 GEM [22] 85.28 15.91 29.38 77.07 84.28 22.79 34.09 68.32 85.53 68.68 68.49 18.72 85.24 73.71 72.59 12.81 AGEM [6] 85.97 9.31 24.60 85.18 85.66 9.27 24.67 84.88 85.97 55.28 58.23 34.10 85.66 55.95 59.96 33.01 HAL [5] 67.33 8.20 22.72 65.70 68.06 10.59 24.74 63.86 67.64 44.98 50.79 25.17 68.62 50.07 54.01 20.61 MIR [1] 87.38 13.49 28.88 82.09 87.39 17.56 32.48 77.59 87.42 66.18 67.43 23.60 87.50 71.20 71.42 18.10 GSS [2] 86.03 14.01 28.00 80.02 86.31 17.87 31.82 76.04 86.10 66.80 66.55 21.44 86.44 71.98 71.00 16.06 GMED [33] 87.18 14.56 33.41 80.68 87.29 18.67 38.69 76.23 87.30 68.82 72.66 20.53 87.49 73.91 76.36 15.10 ER [7] 87.23 13.75 28.88 81.64 87.33 17.56 32.45 77.52 87.29 66.82 67.56 22.73 87.40 71.74 71.60 17.40 Ours 88.13 18.96 38.62 76.85 87.58 24.78 45.20 69.76 88.94 70.03 74.07 21.01 88.94 75.32 78.09 15.14 Ours+Reh Sel 87.81 19.28 39.23 -76.13 87.55 25.72 45.48 -68.69 88.58 70.81 74.24 -19.74 89.03 76.14 78.27 -14.32 Mini-Imagenet (Class increment) Mini-Imagenet (Task increment) buffer size 500 buffer size 1000 buffer size 500 buffer size 1000 A1 A Am BWT A1 A Am BWT A1 A Am BWT A1 A Am BWT Finetune 11.12 Joint 44.39 Finetune 23.46 Joint 62.30 GDUMB [28] 6.22 7.15 16.37 17.69 AGEM [6] 50.06 10.69 22.29 49.22 50.03 10.69 22.28 49.16 50.06 18.34 28.05 39.65 50.03 18.78 28.12 39.05 MIR [1] 51.44 11.07 23.65 50.46 51.25 11.32 24.09 49.92 51.47 29.10 35.20 27.95 51.31 31.39 37.24 24.89 GSS [2] 51.63 11.09 23.62 50.66 51.35 11.42 24.05 49.91 51.64 28.67 35.22 28.71 51.40 31.75 37.23 24.56 GMED [33] 51.21 11.03 24.47 50.23 50.87 11.73 25.50 48.93 51.29 30.47 37.64 26.02 51.00 32.85 39.66 22.69 ER [7] 51.68 11.00 23.71 50.84 51.41 11.35 24.08 50.08 51.70 28.97 35.30 28.40 51.55 31.59 37.36 24.95 Ours 51.76 12.48 26.50 49.10 50.91 14.43 28.47 45.59 52.44 32.59 39.38 24.82 52.27 36.25 41.59 20.02 Ours+Reh Sel 51.81 12.74 26.43 -48.84 50.96 14.54 28.44 -45.52 51.73 34.36 40.48 -21.70 51.47 37.20 42.19 -17.83 Positive IF on S Negative IF on S Positive IF on P Negative IF on P Positive IF on SP Negative IF on SP (a) Total 500 memory data (b) Total 10000 new data per task Influence statistics Influence proportion Figure 3: Top: Statistics of examples with positive and negative influence on S, P, and SP. Bottom: We divide all example influences equally into 5 groups, and count the number in each range. larger gradient to dominant the update for all rehearsal-based CL, our method improves the example with positive effective and restrain the negative-impact example. In terms of the Mean Average Accuracy Am, which evaluates the SP throughout the whole CL process, our method shows its significant superiority with an average improvement of over 4.44 and 1.24 w.r.t the second best results in class-incremental and task-incremental settings. The complete results with std. can be viewed in the Appendix. Moreover, with the proposed rehearsal selection strategy (Ours+Reh Sel), we have our A improved, which means the selected example according to their influence has a clear ability for reducing catastrophic forgetting. With our Rehearsal Selection (Reh Sel) strategy, we have an improvement of 0.77 on A , but A1 and Am have uncertain performance. This means better memory may bring in worse task conflict. 6.3 Analysis of Dataset Influence on SP In Fig. 3, we count the example with positive/negative influences on old task (S), new task (P), and total SP in Split-CIFAR-10. At each task after task 2, we have 500 fixed-size memory and 10,000 new task data. We first find that most data of old tasks has a positive influence on S and a negative influence on P, while most data of new tasks has a positive influence on P and a negative influence on S. Even so, some data in both new and old tasks has the opposite influence. Then, for the total SP influence, most of memory data has positive influence. In contrast, examples of new tasks have near equal number of positive and negative SP influence. Thus, by clustering and storing examples via higher influence to rehearsal memory buffer, the old knowledge can be kept. By dividing all example influences equally into 5 groups from the minimum to the maximum, we find that most examples have mid-level influence, and server as the main part of the dataset. Also, the numbers of examples with large positive and negative influence are small, which means unique examples are few in the dataset. The observations suggest the example difference should be used to improve model training. 6.4 Analysis on SP Pareto Optimum SP Pareto Front Ours only P Ours only S Ours GSS MIR GMED A1 Figure 4: SP Pareto front. In this paper, we propose to convert the S-aware and P-aware influence fusion into a DOO problem and use the MGDA to guarantee the fused solution is an SP Pareto optimum. As shown in Fig. 4, we show the comparison of the First Accuracy and Final Accuracy coordinate visualization for all compared methods. We also evaluate with only stability-aware (Ours only S) and with only Plasticity-aware (Ours only P) influence. Obviously, with only one kind of influence, our method can already get better SP than other methods. The integration of two kinds of influence yield an even more balanced SP. On the other hand, the existing methods cannot approach the SP Pareto front well. 6.5 Training Time Table 2: Comparison of training time [s] on CIFAR-10. Method ER GSS AGEM HAL MIR GMED GEM Meta SP One-Step 0.013 0.015 0.029 0.043 0.077 0.093 0.290 0.250 Total 2685 2672 3812 5029 7223 8565 24768 5898 We list the training time of one-step update and total update overhead for all compared methods for Split CIFAR-10 dataset. In one-step update, we evaluate all methods with a batch on one update. Our method takes more time than other methods except for GEM, because of the pseudo update, backward on perturbation and influence fusion. To guarantee the efficiency, we utilize our proposed method only in the last 5 epochs among the total, and the rest are naive fine-tuning (See details in the Appendix). The results show the strategy is as fast as other light-weight methods but achieve huge improvement on SP. We also use this setting for the comparison in Table 1. 7 Conclusion In this paper, we proposed to explore the example influence on Stability-Plasticity (SP) dilemma in rehearsal-based continual learning. To achieve that, we evaluated the example influence via small perturbation instead of the computationally expensive Hessian-like influence function and proposed a simple yet effective Meta SP algorithm. At each iteration in CL training, Meta SP builds a pseudo update and obtains the Sand P-aware example influence in batch level. Then, the two kinds of influence are combined via an SP Pareto optimal factor and can support the regular model update. Moreover, the example influence can be used to optimize rehearsal selection. The experimental results on three popular CL datasets verified the effectiveness of the proposed method. We list the limitation of the proposed method. (1) The proposed method relies on rehearsal selection, which may affect privacy and extra storage is needed. (2) The proposed method is not fast enough for online continual learning. In most situations, however, we can leverage our training tricks to reduce the time. (3) Our method is limited in the extremely small memory size. Large memory size means better remembering and an accurate validation set. The proposed method does not perform well when the memory size is extremely small. Acknowledgement This work is financially supported in part by the National Key Research and Development Program of China under Grant (No. 2019YFC1520904) and the Natural Science Foundation of China (Nos. 62072334, 62276182, 61876220). The authors would like to thank constructive and valuable suggestions for this paper from the experienced reviewers and AE. [1] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Neur IPS, 2019. [2] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. Neur IPS, 2019. [3] Benedikt Bagus and Alexander Gepperth. An investigation of replay-based approaches for continual learning. In IJCNN, 2021. [4] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Simone Calderara. Rethinking experience replay: a bag of tricks for continual learning. In ICPR, 2021. [5] Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. In AAAI, 2021. [6] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2018. [7] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc Aurelio Ranzato. On tiny episodic memories in continual learning. ar Xiv preprint ar Xiv:1902.10486, 2019. [8] Kalyanmoy Deb and Himanshu Gupta. Searching for robust pareto-optimal solutions in multi-objective optimization. In ICEMO, 2005. [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [10] Kaile Du, Linyan Li, Fan Lyu, Fuyuan Hu, Zhenping Xia, and Fenglei Xu. Class-incremental lifelong learning in multi-label classification. ar Xiv preprint ar Xiv:2207.07840, 2022. [11] Kaile Du, Fan Lyu, Fuyuan Hu, Linyan Li, Wei Feng, Fenglei Xu, and Qiming Fu. Agcn: Augmented graph convolutional network for lifelong multi-label image recognition. In ICME, 2022. [12] Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 2012. [13] Yang Fan, Yingce Xia, Lijun Wu, Shufang Xie, Weiqing Liu, Jiang Bian, Tao Qin, and Xiang-Yang Li. Learning to reweight with deep interactions. ar Xiv preprint ar Xiv:2007.04649, 2020. [14] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734, 2017. [15] Jörg Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 2000. [16] Yunhui Guo, Mingrui Liu, Tianbao Yang, and Tajana Rosing. Learning with long-term remembering: Following the lead of mixed stochastic gradient. ar Xiv preprint ar Xiv:1909.11763, 2019. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [18] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. ar Xiv preprint ar Xiv:2004.05439, 2020. [19] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017. [20] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In ICML, 2017. [21] Jan Larsen, Lars Kai Hansen, Claus Svarer, and M Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop. IEEE, 1996. [22] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Neur IPS, 2017. [23] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, 2020. [24] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, 2016. [25] Fan Lyu, Shuai Wang, Wei Feng, Zihan Ye, Fuyuan Hu, and Song Wang. Multi-domain multi-task rehearsal for lifelong learning. In AAAI, 2021. [26] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, 2018. [27] Christoph Molnar. Interpretable Machine Learning. Lulu. com, 2020. [28] Ameya Prabhu, Philip Torr, and Puneet Dokania. Gdumb: A simple approach that questions our progress in continual learning. In ECCV, 2020. [29] Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological review, 1990. [30] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In CVPR, 2017. [31] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018. [32] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. ar Xiv preprint ar Xiv:1810.11910, 2018. [33] Liu Risheng, Liu Yaohua, Zeng Shangzhi, and Zhang Jin. Gradient-based editing of memory examples for online task-free continual learning. In Neur IPS, 2021. [34] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Neur IPS, 2018. [35] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Neur IPS, 2016. [36] Tianyang Wang, Jun Huan, and Bo Li. Data dropout: Optimizing training data for convolutional neural networks. In ICTAI, 2018. [37] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] See Section ??. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] We use public dataset and opensourced code. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]