# sparcl_sparse_continual_learning_on_the_edge__f1b50f2e.pdf Spar CL: Sparse Continual Learning on the Edge Zifeng Wang1, , Zheng Zhan1, , Yifan Gong1, Geng Yuan1, Wei Niu2, Tong Jian1, Bin Ren2, Stratis Ioannidis1, Yanzhi Wang1, Jennifer Dy1 1 Northeastern University, 2 College of William and Mary {zhan.zhe, gong.yifa, geng.yuan, yanz.wang}@northeastern.edu, {zifengwang, jian, ioannidis, jdy}@ece.neu.edu, wniu@email.wm.edu, bren@cs.wm.edu Existing work in continual learning (CL) focuses on mitigating catastrophic forgetting, i.e., model performance deterioration on past tasks when learning a new task. However, the training efficiency of a CL system is under-investigated, which limits the real-world application of CL systems under resource-limited scenarios. In this work, we propose a novel framework called Sparse Continual Learning (Spar CL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. Spar CL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates. Each of them not only improves efficiency, but also further mitigates catastrophic forgetting. Spar CL consistently improves the training efficiency of existing state-of-the-art (SOTA) CL methods by at most 23 less training FLOPs, and, surprisingly, further improves the SOTA accuracy by at most 1.7%. Spar CL also outperforms competitive baselines obtained from adapting SOTA sparse training methods to the CL setting in both efficiency and accuracy. We also evaluate the effectiveness of Spar CL on a real mobile phone, further indicating the practical potential of our method. 1 Introduction The objective of Continual Learning (CL) is to enable an intelligent system to accumulate knowledge from a sequence of tasks, such that it exhibits satisfying performance on both old and new tasks [32]. Recent methods mostly focus on addressing the catastrophic forgetting [43] problem learning model tends to suffer performance deterioration on previously seen tasks. However, in the real world, when CL applications are deployed in resource-limited platforms [48] such as edge devices, the learning efficiency, w.r.t. both training speed and memory footprint, are also crucial metrics of interest, yet they are rarely explored in prior CL works. Existing CL methods can be categorized into regularization-based [2, 32, 37, 68], rehearsal-based [8, 12, 50, 61], and architecture-based [31, 42, 52, 58, 59, 70]. Both regularizationand rehearsal-based methods directly train a dense model, which might be over-parameterized even for the union of all tasks [19, 39]. Though several architecture-based methods [51, 57, 64] start with a sparse sub-network from the dense model, they still grow the model size progressively to learn emerging tasks. The aforementioned methods, although striving for greater performance with less forgetting, still introduce significant memory and computation overhead during the whole CL process. Both authors contributed equally to this work 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Figure 1: Left: Overview of Spar CL. Spar CL consists of three complementary components: task-aware dynamic masking (TDM) for weight sparsity, dynamic data removal (DDR) for data efficiency, and dynamic gradient masking (DGM) for gradient sparsity. Right: Spar CL successfully preserves the accuracy and significantly improves efficiency over DER++ [8], one of the SOTA CL methods, with different sparsity ratios on the Split Tiny-Image Net [16] dataset. Recently, another stream of work, sparse training [4, 20, 35], has emerged as a new training trend to achieve training acceleration, which embraces the promising training-on-the-edge paradigm. With sparse training, each iteration takes less time with the reduction in computation achieved by sparsity. Inspired by these sparse training methods, under the traditional i.i.d. learning setting, we naturally think about introducing sparse training to the field of CL. A straightforward idea is to directly combine existing sparse training methods, such as SNIP [35], Rig L [20], with a rehearsal buffer under the CL setting. However, these methods fail to consider key challenges in CL to mitigate catastrophic forgetting, for example, properly handling transition between tasks. As a result, these sparse training methods, though enhancing training efficiency, cause significant accuracy drop (see Section 5.2). Thus, we would like to explore a general strategy, orthogonal to existing CL methods, that not only leverages the idea of sparse training for efficiency, but also addresses key challenges in CL to preserve (or even improve) accuracy. In this work, we propose Sparse Continual Learning (Spar CL), a general framework for cost-effective continual learning, aiming at enabling practical CL on edge devices. As shown in Figure 1 (left), Spar CL achieves both learning acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, to maintain a small dynamic sparse network during the whole CL process, we develop a novel task-aware dynamic masking (TDM) strategy to keep only important weights for both the current and past tasks, with special consideration during task transitions. Moreover, we propose a dynamic data removal (DDR) scheme, which progressively removes easy-to-learn examples from training iterations, which further accelerates the training process and also improves accuracy of CL by balancing current and past data and keeping more informative samples in the buffer. Finally, we provide an additional dynamic gradient masking (DGM) strategy to leverage gradient sparsity for even better efficiency and knowledge preservation of learned tasks, such that only a subset of sparse weights are updated. Figure 1 (right) demonstrates that Spar CL successfully preserves the accuracy and significantly improves efficiency over DER++ [8], one of the SOTA CL methods, under different sparsity ratios. Spar CL is simple in concept, compatible with various existing rehearsal-based CL methods, and efficient under practical scenarios. We conduct comprehensive experiments on multiple CL benchmarks to evaluate the effectiveness of our method. We show that Spar CL works collaboratively with existing CL methods, greatly accelerates the learning process under different sparsity ratios, and even sometimes improves upon the state-of-the-art accuracy. We also establish competing baselines by combining representative sparse training methods with advanced rehearsal-based CL methods. Spar CL again outperforms these baselines in terms of both efficiency and accuracy. Most importantly, we evaluate our Spar CL framework on real edge devices to demonstrate the practical potential of our method. We are not aware of any prior CL works that explored this area and considered the constraints of limited resources during training. In summary, our work makes the following contributions: We propose Sparse Continual Learning (Spar CL), a general framework for cost-effective continual learning, which achieves learning acceleration through the synergy of weight sparsity, data effi- ciency, and gradient sparsity. To the best of our knowledge, our work is the first to introduce the idea of sparse training to enable efficient CL on edge devices. Our code is publicly available . Spar CL shows superior performance compared to both conventional CL methods and CL-adapted sparse training methods on all benchmark datasets, leading to at most 23 less training FLOPs and, surprisingly, an 1.7% improvement over SOTA accuracy. We evaluate Spar CL on a real mobile edge device, demonstrating the practical potential of our method and also encouraging future research on CL on-the-edge. The results indicate that our framework can achieve at most 3.1 training acceleration. 2 Related work 2.1 Continual Learning The main focus in continual learning (CL) has been mitigating catastrophic forgetting. Existing methods can be classified into three major categories. Regularization-based methods [2, 32, 37, 68] limit updates of important parameters for the prior tasks by adding corresponding regularization terms. While these methods reduce catastrophic forgetting to some extent, their performance deteriorates under challenging settings [40], and on more complex benchmarks [50, 61]. Rehearsal-based methods [13, 14, 25] save examples from previous tasks into a small-sized buffer to train the model jointly with the current task. Though simple in concept, the idea of rehearsal is very effective in practice and has been adopted by many state-of-the-art methods [8, 11, 49]. Architecture-based methods [42, 51, 57, 59, 63] isolate existing model parameters or assign additional parameters for each task to reduce interference among tasks. As mentioned in Section 1, most of these methods use a dense model without consideration of efficiency and memory footprint, thus are not applicable to resource-limited settings. Our work, orthogonal to these methods, serves as a general framework for making these existing methods efficient and enabling a broader deployment, e.g., CL on edge devices. A limited number of works explore sparsity in CL, however, for different purposes. Several methods [41, 42, 53, 57] incorporate the idea of weight pruning [24] to allocate a sparse sub-network for each task to reduce inter-task interference. Nevertheless, these methods still reduce the full model sparsity progressively for every task and finally end up with a much denser model. On the contrary, Spar CL maintains a sparse network throughout the whole CL process, introducing great efficiency and memory benefits both during training and at the output model. A recent work [15] aims at discovering lottery tickets [21] under CL, but still does not address efficiency. However, the existence of lottery tickets in CL serves as a strong justification for the outstanding performance of our Spar CL. 2.2 Sparse Training There are two main approaches for sparse training: fixed-mask sparse training and dynamic sparse training. Fixed-mask sparse training methods [35, 54, 56, 60] first apply pruning, then execute traditional training on the sparse model with the obtained fixed mask. The pre-fixed structure limits the accuracy performance, and the first stage still causes huge computation and memory consumption. To overcome these drawbacks, dynamic mask methods [4, 17, 20, 45, 46] adjust the sparsity topology during training while maintaining low memory footprint. These methods start with a sparse model structure from an untrained dense model, then combine sparse topology exploration at the given sparsity ratio with the sparse model training. Recent work [67] further considers incorporating data efficiency into sparse training for better training accelerations. However, all prior sparse training works are focused on the traditional training setting, while CL is a more complicated and difficult scenario with inherent characteristics not explored by these works. In contrast to prior sparse training methods, our work explores a new learning paradigm that introduces sparse training into CL for efficiency and also addresses key challenges in CL, mitigating catastrophic forgetting. 3 Continual Learning Problem Setup In supervised CL, a model f learns from a sequence of tasks D = {D1, . . . , DT }, where each task Dt = {(xt,i, yt,i)}nt i=1 consists of input-label pairs, and each task has a disjoint set of classes. Tasks https://github.com/neu-spiral/Spar CL Figure 2: Illustration of the Spar CL workflow. Three components work synergistically to improve training efficiency and further mitigate catastrophic forgetting for preserving accuracy. arrive sequentially, and the model must adapt to them. At the t-th step, the model gains access to data from the t-th task. However, a small fix-sized rehearsal buffer M is allowed to save data from prior tasks. At test time, the easiest setting is to assume task identity is known for each coming test example, named task-incremental learning (Task-IL). If this assumption does not hold, we have the more difficult class-incremental learning (Class-IL) setting. In this work, we mainly focus on the more challenging Class-IL setting, and only report Task-IL performance for reference. The goal of conventional CL is to train a model sequentially that performs well on all tasks at test time. The main evaluation metric is average test accuracy on all tasks. In real-world resourcelimited scenarios, we should further consider training efficiency of the model. Thus, we measure the performance of the model more comprehensively by including training FLOPs and memory footprint. 4 Sparse Continual Learning (Spar CL) Our method, Sparse Continual Learning, is a unified framework composed of three complementary components: task-aware dynamic masking for weight sparsity, dynamic data removal for data efficiency, and dynamic gradient masking for gradient sparsity. The entire framework is shown in Figure 2. We will illustrate each component in detail in this section. 4.1 Task-aware Dynamic Masking To enable cost-effective CL in resource limited scenarios, Spar CL is designed to maintain a dynamic structure when learning a sequence of tasks, such that it not only achieves high efficiency, but also intelligently adapts to the data stream for better performance. Specifically, we propose a strategy named task-aware dynamic masking (TDM), which dynamically removes less important weights and grows back unused weights for stronger representation power periodically by maintaining a single binary weight mask throughout the CL process. Different from typical sparse training work, which only leverages the weight magnitude [45] or the gradient w.r.t. data from a single training task [20, 67], TDM considers also the importance of weights w.r.t. data saved in the rehearsal buffer, as well as the switch between CL tasks. Specifically, TDM strategy starts from a randomly initialized binary mask M = M0, with a given sparsity constraint k M k0/k k0 = 1 s, where s 2 [0, 1] is the sparsity ratio. Moreover, it makes different intraand inter-task adjustments to keep a dynamic sparse set of weights based on their continual weight importance (CWI). We summarize the process of task-aware dynamic masking in Algorithm 1 and elaborate its key components in detail below. Continual weight importance (CWI). For a model f parameterized by , the CWI of weight w is defined as follows: CWI(w) = |w| + |@ L(Dt; ) @w | + β|@L(M; ) where Dt denotes the training data from the t-th task, M is the current rehearsal buffer, and , β are coefficients to control the influence of current and buffered data, respectively. Moreover, L represents the cross-entropy loss for classification, while L is the single-head [1] version of the cross-entropy loss, which only considers classes from the current task by masking out the logits of other classes. Algorithm 1: Task-aware Dynamic Masking (TDM) Input: Model weight , number of tasks T, training epochs of the t-th task Kt, binary sparse mask M , sparsity ratio s, intra-task adjustment ratio pintra, inter-task adjustment ratio pinter, update interval δk Initialize: , M , s.t. k M k0/k k0 = 1 s for t = 1, . . . , T do for e = 1, . . . , KT do if t > 1 then /* Inter-task adjustment */ Expand M by randomly adding unused weights, s.t. k M k0/k k0 = 1 (s pinter) if e = δk then Shrink M by removing the least important weights according to equation 1, s.t. k M k0/k k0 = 1 s end end if e mod δk = 0 then /* Intra-task adjustment */ Shrink M by removing the least important weights according to equation 1, s.t. k M k0/k k0 = 1 (s + pintra) Expand M by randomly adding unused weights, s.t. k M k0/k k0 = 1 s end Update M via backpropagation end end Intuitively, CWI ensures we keep (1) weights of larger magnitude for output stability, (2) weights important for the current task for learning capacity, and (3) weights important for past data to mitigate catastrophic forgetting. Moreover, inspired by the classification bias in CL [1], we use the singlehead cross-entropy loss when calculating the importance score w.r.t. the current task to make the importance estimation more accurate. Intra-task adjustment. When training the t-th task, a natural assumption is that the data distribution is consistent inside the task. As a result, we would like to update the sparse model in a relatively stable way while keeping its flexibility. Thus, in Algorithm 1, we choose to update the sparsity mask M in a shrink-and-expand way every δk epochs. We first remove pintra of the weights of least CWI to retain learned knowledge so far. Then we randomly select unused weights to recover the learning capacity for the model and keep the sparsity ratio s unchanged. Inter-task adjustment. When tasks switch, on the contrary, we assume that the data distribution shifts immediately. Ideally, we would like the model to keep the knowledge learned from old tasks as much as possible, and to have enough learning capacity to accommodate the new task. Thus, instead of the shrink-and-expand strategy for intra-task adjustment, we follow an expand-and-shrink scheme. Specifically, at the beginning of the (t + 1)-th task, we expand the sparse model by randomly adding a proportion of pinter unused weights. Intuitively, the additional learning capacity facilitates fast adoption of new knowledge and reduces interference with learned knowledge. We allow our model to have smaller sparsity (i.e., larger learning capacity) temporarily for the first δk epochs as a warm-up period, and then remove the pinter weights with least CWI, following the same process in the intra-task case, to satisfy the sparsity constraint. 4.2 Dynamic Data Removal In addition to weight sparsity, decreasing the amount of training data can be directly translated into training time savings. Thus, we would also like to explore data efficiency to reduce the training workload. Some prior CL works select informative examples to construct the rehearsal buffer [3, 6, 65]. However, their main purpose is not training acceleration; thus, they either introduce excessive computational cost or consider different problem settings. By considering the features of CL, we present a simple yet effective strategy, dynamic data removal (DDR), to reduce training data for further acceleration. We measure the importance of each training example by the occurrence of misclassification [55, 67] during CL. In TDM, the sparse structure of our model updates periodically every δk epochs, so we align our data removal process with weight mask updates for further efficiency and training stability. In Section 4.1, we have partitioned the training process for the t-th task into Nt = Kt/δk stages based on the dynamic mask update. Therefore, we gradually remove training data at the end of i-th stage, based on the following policy: 1) Calculate the total number of misclassifications fi(xj) for each training example during the i-th stage. 2) Remove a proportion of i training samples with the least number of misclassifications. Although our main purpose is to keep the harder examples to learn to consolidate the sparse model, we can get further benefits for better CL results. First, the removal of easier examples increases the probability that harder examples to be saved to the rehearsal buffer, given the common strategy, e.g. reservoir sampling [14], to buffer examples. Thus, we construct a more informative buffer in a implicit way without heavy computation. Moreover, since the buffer size is much smaller than the training set size of each task, the data from the buffer and the new task is highly imbalanced; dynamic data removal also relieves this data imbalance issue. Formally, we set the data removal proportion for each task as 2 [0, 1], and a cutoff stage, such that: The cutoff stage controls the trade-off between efficiency and accuracy: when we set the cutoff stage earlier, we reduce the training time for all the following stages; however, when the cutoff stage is set too early, the model might underfit the removed training data. Note that when we set i = 0 for all i = 1, 2, . . . , Nt and cutoff = Nt, we simply recover the vanilla setting without any data efficiency considerations. In our experiments, we assume i = /cutoff, i.e., removing equal proportion of data at the end of every stage, for simplicity. We also conduct a comprehensive exploration study of and the selection of the cutoff stage in Section 5.3 and Appendix B.3. 4.3 Dynamic Gradient Masking With TDM and DDR, we can already achieve weight efficiency and data efficiency during training. To further boost training efficiency, we explore gradient sparsity and propose dynamic gradient masking (DGM) for CL. Our method focuses on reducing computational costs by only applying the most important gradients onto the corresponding unpruned model parameters via a gradient mask. The gradient mask is also dynamically updated along with the weight mask defined in Section 4.1. Intuitively, while targeting for better training efficiency, DGM also promotes the preservation of past knowledge by preventing a fraction of weights from updating. Formally, our goal here is to find a subset of unpruned parameters (or, equivalently, a gradient mask MG) to update over multiple training iterations. For a model f parameterized by , we have the corresponding gradient matrix G calculated during each iteration. To prevent the pruned weights from updating, the weight mask M will be applied onto the gradient matrix G as G M during backpropagation. Besides the gradients of pruned weights, we in addition consider to remove less important gradient coefficients for faster training. To achieve this, we introduce the continual gradient importance (CGI) based on the CWI to measure the importance of weight gradients: CGI(w) = |@ L(Dt; ) @w | + β|@L(M; ) We remove a proportion q of non-zero gradients from G with less importance measured by CGI and we have k MGk0/k k0 = 1 (s + q). The gradient mask MG is then applied onto the gradient matrix G. During the entire training process, the gradient mask MG is updated with a fixed interval. 5 Experiment 5.1 Experiment Setting Datasets. We evaluate our Spar CL on two representative CL benchmarks, Split CIFAR-10 [33] and Split Tiny-Image Net [16] to verify the efficacy of Spar CL. In particular, we follow [8, 68] by Table 1: Comparison with CL methods. Spar CL consistently improves training efficiency of the corresponding CL methods while preserves (or even improves) accuracy on both classand task-incremental settings. Method Sparsity Buffer size Split CIFAR-10 Split Tiny-Image Net Class-IL (") Task-IL (") FLOPs Train 1015 (#) Class-IL (") Task-IL (") FLOPs Train EWC [32] 0.00 19.49 0.12 68.29 3.92 8.3 7.58 0.10 19.20 0.31 13.3 Lw F [37] 19.61 0.05 63.29 2.35 8.3 8.46 0.22 15.85 0.58 13.3 Pack Net [42] 0.50 - 93.73 0.55 5.0 61.88 1.01 7.3 LPS [57] - 94.50 0.47 5.0 63.37 0.83 7.3 20.04 0.34 83.88 1.49 11.1 8.07 0.08 22.77 0.03 17.8 i Ca RL [50] 49.02 3.20 88.99 2.13 11.1 7.53 0.79 28.19 1.47 17.8 FDR [5] 30.91 2.74 91.01 0.68 13.9 8.70 0.19 40.36 0.68 22.2 ER [14] 44.79 1.86 91.19 0.94 11.1 8.49 0.16 38.17 2.00 17.8 DER++ [8] 64.88 1.17 91.92 0.60 13.9 10.96 1.17 40.87 1.16 22.2 Spar CL-ER75 46.89 0.68 92.02 0.72 2.0 8.98 0.38 39.14 0.85 3.2 Spar CL-DER++75 0.75 66.30 0.98 94.06 0.45 2.5 12.73 0.40 42.06 0.73 4.0 Spar CL-ER90 45.81 1.05 91.49 0.47 0.9 8.67 0.41 38.79 0.39 1.4 Spar CL-DER++90 0.90 200 65.79 1.33 93.73 0.24 1.1 12.27 1.06 41.17 1.31 1.8 Spar CL-ER95 44.59 0.23 91.07 0.64 0.5 8.43 0.09 38.20 0.46 0.8 Spar CL-DER++95 0.95 65.18 1.25 92.97 0.37 0.6 10.76 0.62 40.54 0.98 1.0 22.67 0.57 89.48 1.45 11.1 8.06 0.04 25.33 0.49 17.8 i Ca RL [50] 47.55 3.95 88.22 2.62 11.1 9.38 1.53 31.55 3.27 17.8 FDR [5] 28.71 3.23 93.29 0.59 13.9 10.54 0.21 49.88 0.71 22.2 ER [14] 57.74 0.27 93.61 0.27 11.1 9.99 0.29 48.64 0.46 17.8 DER++ [8] 72.70 1.36 93.88 0.50 13.9 19.38 1.41 51.91 0.68 22.2 Spar CL-ER75 60.80 0.22 93.82 0.32 2.0 10.48 0.29 50.83 0.69 3.2 Spar CL-DER++75 0.75 74.09 0.84 95.19 0.34 2.5 20.75 0.88 52.19 0.43 4.0 Spar CL-ER90 59.34 0.97 93.33 0.10 0.9 10.12 0.53 49.46 1.22 1.4 Spar CL-DER++90 0.90 500 73.42 0.95 94.82 0.23 1.1 19.62 0.67 51.93 0.36 1.8 Spar CL-ER95 57.75 0.45 92.73 0.34 0.5 9.91 0.17 48.57 0.50 0.8 Spar CL-DER++95 0.95 72.14 0.78 94.39 0.15 0.6 19.01 1.32 51.26 0.78 1.0 Pack Net and LPS actually have a decreased sparsity after learning every task, we use 0.50 to roughly represent the average sparsity. splitting CIFAR-10 and Tiny-Image Net into 5 and 10 tasks, each of which consists of 2 and 20 classes respectively. Dataset licensing information can be found in Appendix A. Comparing methods. We select several CL methods including regularization-based (EWC [32], Lw F [37]), architecture-based (Pack Net [42], LPS [57]), and rehearsal-based (A-GEM [13], i Ca RL [44], FDR [5], ER [14], DER++ [8]) methods. Note that Pack Net and LPS are only compatible with task-incremental learning. We also adapt representative sparse training methods (SNIP [35], Rig L [20]) to the CL setting by combining them with DER++ (SNIP-DER++, Rig L-DER++). Variants of our method. To show the generality of Spar CL, we combine it with DER++ (one of the SOTA CL methods), and ER (simple and widely-used) as Spar CL-DER++ and Spar CL-ER, respectively. We also vary the weight sparsity ratio (0.75, 0.90, 0.95) of Spar CL for a comprehensive evaluation. Evaluation metrics. We use the average accuracy on all tasks to evaluate the performance of the final model. Moreover, we measure the training FLOPs [20], and memory footprint [67] (including feature map pixels and model parameters during training) to demonstrate the efficiency of each method. Please see Appendix B.1 for detailed definitions of these metrics. Experiment details. For fair comparison, we strictly follow the settings in prior CL work [8, 29]. We set the per task training epochs to 50 and 100 for Split CIFAR-10 and Tiny-Image Net, respectively, with a batch size of 32. For the model architecture, we follow [8, 50] and adopt the Res Net-18 [26] without any pre-training. We also use the best hyperparameter setting reported in [8, 57] for CL methods, and in [20, 35] for CL-adapted sparse training methods. For Spar CL and its competing CL-adapted sparse training methods, we adopt a uniform sparsity ratio for all convolutional layers. Please see Appendix B for further details. 5.2 Main Results Comparison with CL methods. Table 1 summarizes the results on Split CIFAR-10 and Tiny Image Net, under both class-incremental (Class-IL) and task-incremental (Task-IL) settings. From Table 1, we can clearly tell that Spar CL significantly improves upon ER and DER++, while also Table 2: Comparison with CL-adapted sparse training methods. All methods are combined with DER++ with a 500 buffer size. Spar CL outperforms all methods in both accuracy and training efficiency, under all sparsity ratios. All three methods here can save 20% 51% memory footprint, please see Appendix B.2 for details. Method Spasity Split CIFAR-10 Split Tiny-Image Net Class-IL (") FLOPs Train 1015 (#) Class-IL (") FLOPs Train DER++ [8] 0.00 72.70 1.36 13.9 19.38 1.41 22.2 SNIP-DER++ [35] 69.82 0.72 1.6 16.13 0.61 2.5 Rig L-DER++ [20] 0.90 69.86 0.59 1.6 18.36 0.49 2.5 Spar CL-DER++90 73.42 0.95 1.1 19.62 0.67 1.8 SNIP-DER++ [35] 66.07 0.91 0.9 14.76 0.52 1.5 Rig L-DER++ [20] 0.95 66.53 1.13 0.9 15.88 0.63 1.5 Spar CL-DER++95 72.14 0.78 0.6 19.01 1.32 1.0 Table 3: Ablation study on Split-CIFAR10 with 0.75 sparsity ratio. All components contributes to the overall performance, in terms of both accuracy and efficiency (training FLOPs and memory footprint). TDM DDR DGM Class-IL (") FLOPs Train Memory Footprint (#) 7 7 7 72.70 13.9 247MB 3 7 7 73.37 3.6 180MB 3 3 7 73.80 2.8 180MB 3 7 3 73.97 3.3 177MB 3 3 3 74.09 2.5 177MB outperforming other CL baselines, in terms of training efficiency (measured in FLOPs). With higher sparsity ratio, Spar CL leads to fewer training FLOPs. Notably, Spar CL achieves 23 training efficiency improvement upon DER++ with a sparsity ratio of 0.95. On the other hand, our framework also improves the average accuracy of ER and DER++ consistently under all cases with a sparsity ratio of 0.75 and 0.90, and only a slight performance drop when sparsity gets larger as 0.95. In particular, Spar CL-DER++ with 0.75 sparsity ratio sets new SOTA accuracy, with all buffer sizes under both benchmarks. The outstanding performance of Spar CL indicates that our proposed strategies successfully preserve accuracy by further mitigating catastrophic forgetting with a much sparser model. Moreover, the improvement that Spar CL brings to two different existing CL methods shows the generalizability of Spar CL as a unified framework, i.e., it has the potential to be combined with a wide array of existing methods. We also take a closer look at Pack Net and LPS, which also leverage the idea of sparsity to split the model by different tasks, a different motivation from training efficiency. Firstly, they are only compatible with the Task-IL setting, since they leverage task identity at both training and test time. Moreover, the model sparsity of these methods reduces with the increasing number of tasks, which still leads to much larger overall training FLOPs than that of Spar CL. This further demonstrates the importance of keeping a sparse model without permanent expansion throughout the CL process. Comparison with CL-adapted sparse training methods. Table 2 shows results under the more difficult Class-IL setting. Spar CL outperforms all CL-adapted sparse training methods in both accuracy and training FLOPs. The performance gap between Spar CL-DER++ and other methods gets larger with a higher sparsity. SNIPand Rig L-DER++ achieve training acceleration at the cost of compromised accuracy, which suggests that keeping accuracy is a non-trivial challenge for existing sparse training methods under the CL setting. SNIP generates the static initial mask after network initialization which does not consider the structure suitability among tasks. Though Rig L adopts a dynamic mask, the lack of task-awareness prevents it from generalizing well to the CL setting. 5.3 Effectiveness of Key Components Ablation study. We provide a comprehensive ablation study in Table 3 using Spar CL-DER++ with 0.75 sparsity on Split CIFAR10. Table 3 demonstrates that all components of our method contribute to both efficiency and accuracy improvements. Comparing rows 1 and 2, we can see that the majority of FLOPs decrease results from TDM. Interestingly, TDM leads to an increase in accuracy, indicating Figure 3: Comparison between DDR and Oneshot [67] data removal strategy w.r.t. different data removal proportion . DDR outperforms One-shot and also achieves better accuracy when 30%. Figure 4: Comparison with CL-adapted sparse training methods in training acceleration rate and accuracy results. The radius of circles are measured by memory footprint. TDM generates a sparse model that is even more suitable for learning all tasks than then full dense model. Comparing rows 2 and 3, we can see that DDR indeed further accelerates training by removing less informative examples. As discussed in Section 4.2, when we remove a certain number of samples (30% here), we achieve a point where we keep as much informative samples as we need, and also balance the current and buffered data. Comparing rows 2 and 4, DGM reduce both training FLOPs and memory footprint while improve the performance of the network. Finally, the last row demonstrates the collaborative performance of all components. We also show the same ablation study with 0.90 sparsity in Appendix B.4 for reference. Details can be found in Appendix B.1. Exploration on DDR. To understand the influence of the data removal proportion , and the cutoff stage for each task, we show corresponding experiment results in Figure 3 and Appendix B.3, respectively. In Figure 3, we fix cutoff = 4, i.e., gradually removing equal number of examples every 5 epochs until epoch 20, and vary from 10% to 90%. We also compare DDR with One-shot removal strategy [67], which removes all examples at once at cutoff. DDR outperforms One-shot consistently with different in average accuracy. Also note that since DDR removes the examples gradually before the cutoff stage, DDR is more efficient than One-shot. When 30%, we also observe increased accuracy of DDR compared with the baseline without removing any data. When 40%, the accuracy gets increasingly lower for both strategies. The intuition is that when DDR removes a proper amount of data, it removes redundant information while keeps the most informative examples. Moreover, as discussed in Section 4.2, it balances the current and buffered data, while also leave informative samples in the buffer. When DDR removes too much data, it will also lose informative examples, thus the model has not learned these examples well before removal. Exploration on DGM. We test the efficacy of DGM at different sparsity levels. Detailed exploratory experiments are shown in Appendix B.5 for reference. The results indicate that by setting the proportion q within an appropriate range, DGM can consistently improve the accuracy performance regardless of the change of weight sparsity. 5.4 Mobile Device Results The training acceleration results are measured on the CPU of an off-the-shelf Samsung Galaxy S20 smartphone, which has the Qualcomm Snapdragon 865 mobile platform with a Qualcomm Kryo 585 Octa-core CPU. We run each test on a batch of 32 images to denote the training speed. The detail of on-mobile compiler-level optimizations for training acceleration can be found in Appendix C.1. The acceleration results are shown in Figure 4. Spar CL can achieve approximately 3.1 and 2.3 training acceleration with 0.95 sparsity and 0.90 sparsity, respectively. Besides, our framework can also save 51% and 48% memory footprint when the sparsity is 0.95 and 0.90. Furthermore, the obtained sparse models save the storage consumption by using compressed sparse row (CSR) storage and can be accelerated to speed up the inference on-the-edge. We provide on-mobile inference acceleration results in Appendix C.2. 6 Conclusion This paper presents a unified framework named Spar CL for efficient CL that achieves both learning acceleration and accuracy preservation. It comprises three complementary strategies: task-aware dynamic masking for weight sparsity, dynamic data removal for data efficiency, and dynamic gradient masking for gradient sparsity. Extensive experiments on standard CL benchmarks and real-world edge device evaluations demonstrate that our method significantly improves upon existing CL methods and outperforms CL-adapted sparse training methods. We discuss the limitations and potential negative social impacts of our method in Sections 7 and 8, respectively. 7 Limitations One limitation of our method is that we assume a rehearsal buffer is available throughout the CL process. Although the assumption is widely-accepted, there are still situations that a rehearsal buffer is not allowed. However, as a framework targeting for efficiency, our work has the potential to accelerate all types of CL methods. For example, simply removing the terms related to rehearsal buffer in equation 1 and equation 3 could serve as a naive variation of our method that is compatible with other non-rehearsal methods. It is interesting to further improve Spar CL to be more generic for all kinds of CL methods. Moreover, the benchmarks we use are limited to vision domain. Although using vision-based benchmarks has been a common practice in the CL community, we believe evaluating our method, as well as other CL methods, on datasets from other domains such as NLP will lead to a more comprehensive and reliable conclusion. We will keep track of newer CL benchmarks from different domains and further improve our work correspondingly. 8 Potential Negative Societal Impact Although Spar CL is a general framework to enhance efficiency for various CL methods, we still need to be aware of its potential negative societal impact. For example, we need to be very careful about the trade-off between accuracy and efficiency when using Spar CL. If one would like to pursue efficiency by setting the sparsity ratio too high, then even Spar CL will result in significant accuracy drop, since the over-sparsified model does not have enough representation power. Thus, we should pay much attention when applying Spar CL on accuracy-sensitive applications such as healthcare [66]. Another example is that, Spar CL as a powerful tool to make CL methods efficient, can also strengthen models for malicious applications [7]. Therefore, we encourage the community to come up with more strategies and regulations to prevent malicious use of artificial intelligence. 9 Acknowledgement The authors gratefully acknowledge support by the National Science Foundation under grants CCF1937500, CCF-1919117 and CNS-2112471. [1] Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. Ss-il: Separated softmax for incremental learning. In CVPR, 2021. [2] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. In ECCV, 2018. [3] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. Neur IPS, 2019. [4] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. In ICLR, 2018. [5] Ari S Benjamin, David Rolnick, and Konrad Kording. Measuring and regularizing networks in function space. ar Xiv preprint ar Xiv:1805.08289, 2018. [6] Zalán Borsos, Mojmir Mutny, and Andreas Krause. Coresets via bilevel optimization for continual learning and streaming. Neur IPS, 2020. [7] Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. ar Xiv preprint ar Xiv:1802.07228, 2018. [8] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. In Neur IPS, 2020. [9] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Simone Calderara. Rethinking ex- perience replay: a bag of tricks for continual learning. In ICPR, pages 2180 2187. IEEE, 2021. [10] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. ar Xiv preprint ar Xiv:1605.07678, 2016. [11] Hyuntak Cha, Jaeho Lee, and Jinwoo Shin. Co2l: Contrastive continual learning. In ICCV, [12] Arslan Chaudhry, Albert Gordo, Puneet Kumar Dokania, Philip Torr, and David Lopez-Paz. Us- ing hindsight to anchor past knowledge in continual learning. ar Xiv preprint ar Xiv:2002.08165, 2(7), 2020. [13] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. ar Xiv preprint ar Xiv:1812.00420, 2018. [14] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc Aurelio Ranzato. On tiny episodic memories in continual learning. ar Xiv preprint ar Xiv:1902.10486, 2019. [15] Tianlong Chen, Zhenyu Zhang, Sijia Liu, Shiyu Chang, and Zhangyang Wang. Long live the lottery: The existence of winning tickets in lifelong learning. In ICLR, 2020. [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR. Ieee, 2009. [17] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019. [18] Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin, Zhengang Li, Yifan Gong, Bin Ren, Xue Lin, and Dingwen Tao. Rtmobile: Beyond real-time mobile acceleration of rnns for speech recognition. In DAC, pages 1 6. IEEE, 2020. [19] Xuanyi Dong and Yi Yang. Network pruning via transformable architecture search. In Neur IPS, pages 759 770, 2019. [20] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In ICML, pages 2943 2952. PMLR, 2020. [21] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR, 2019. [22] Yifan Gong, Geng Yuan, Zheng Zhan, Wei Niu, Zhengang Li, Pu Zhao, Yuxuan Cai, Sijia Liu, Bin Ren, Xue Lin, et al. Automatic mapping of the best-suited dnn pruning schemes for real-time mobile acceleration. ACM Transactions on Design Automation of Electronic Systems (TODAES), 27(5):1 26, 2022. [23] Yifan Gong, Zheng Zhan, Zhengang Li, Wei Niu, Xiaolong Ma, Wenhao Wang, Bin Ren, Caiwen Ding, Xue Lin, Xiaolin Xu, et al. A privacy-preserving-oriented dnn pruning and mobile acceleration framework. In GLSVLSI, pages 119 124, 2020. [24] Song Han, Jeff Pool, et al. Learning both weights and connections for efficient neural network. In Neur IPS, pages 1135 1143, 2015. [25] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. Memory efficient experience replay for streaming learning. In ICRA, 2019. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [27] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. ar Xiv preprint ar Xiv:1808.06866, 2018. [28] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR, pages 4340 4349, 2019. [29] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. ar Xiv preprint ar Xiv:1810.12488, 2018. [30] Tong Jian, Yifan Gong, Zheng Zhan, Runbin Shi, Nasim Soltani, Zifeng Wang, Jennifer G Dy, Kaushik Roy Chowdhury, Yanzhi Wang, and Stratis Ioannidis. Radio frequency fingerprinting on the edge. IEEE Transactions on Mobile Computing, 2021. [31] Zixuan Ke, Bing Liu, and Xingchang Huang. Continual learning of a mixed sequence of similar and dissimilar tasks. Neur IPS, 2020. [32] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521 3526, 2017. [33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf, 2009. [34] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, [35] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. ICLR, 2019. [36] Tuanhui Li, Baoyuan Wu, Yujiu Yang, Yanbo Fan, Yong Zhang, and Wei Liu. Compressing convolutional neural networks via factorized convolutional filters. In CVPR, pages 3977 3986, 2019. [37] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 40(12):2935 2947, 2017. [38] Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In AAAI, pages 5117 5124, 2020. [39] Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, et al. An image enhancing pattern-based sparsity for real-time inference on mobile devices. In ECCV, pages 629 645. Springer, 2020. [40] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey. ar Xiv preprint ar Xiv:2101.10423, 2021. [41] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, pages 67 82, 2018. [42] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, 2018. [43] Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989. [44] Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, and Emma Strubell. An empirical investigation of the role of pre-training in lifelong learning. ICML Workshop, 2021. [45] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1 12, 2018. [46] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In ICML, pages 4646 4655. PMLR, 2019. [47] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. ar Xiv preprint ar Xiv:2001.00138, 2020. [48] Lorenzo Pellegrini, Vincenzo Lomonaco, Gabriele Graffieti, and Davide Maltoni. Continual learning at the edge: Real-time training on smartphone devices. ar Xiv preprint ar Xiv:2105.13127, 2021. [49] Quang Pham, Chenghao Liu, and Steven Hoi. Dualnet: Continual learning, fast and slow. Neur IPS, 2021. [50] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In CVPR, pages 2001 2010, 2017. [51] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016. [52] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In ICML, 2018. [53] Ghada Sokar, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Spacenet: Make free space for continual learning. Neurocomputing, 439:1 11, 2021. [54] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Neur IPS, 33:6377 6389, 2020. [55] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. ar Xiv preprint ar Xiv:1812.05159, 2018. [56] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In ICLR, 2019. [57] Zifeng Wang, Tong Jian, Kaushik Chowdhury, Yanzhi Wang, Jennifer Dy, and Stratis Ioannidis. Learn-prune-share for lifelong learning. In ICDM, 2020. [58] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. ECCV, 2022. [59] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. CVPR, 2022. [60] Paul Wimmer, Jens Mehnert, and Alexandru Condurache. Freezenet: Full performance by reduced storage costs. In ACCV, 2020. [61] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374 382, 2019. [62] Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, and Yanzhi Wang. Compiler-aware neural architecture search for on-mobile real-time super-resolution. ECCV, 2022. [63] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. In CVPR, pages 3014 3023, 2021. [64] Li Yang, Sen Lin, Junshan Zhang, and Deliang Fan. Grown: Grow only when necessary for continual learning. ar Xiv preprint ar Xiv:2110.00908, 2021. [65] Jaehong Yoon, Divyam Madaan, Eunho Yang, and Sung Ju Hwang. Online coreset selection for rehearsal-based continual learning. ar Xiv preprint ar Xiv:2106.01085, 2021. [66] Kun-Hsing Yu, Andrew L Beam, and Isaac S Kohane. Artificial intelligence in healthcare. Nature biomedical engineering, 2(10):719 731, 2018. [67] Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Neur IPS, 34, 2021. [68] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017. [69] Zheng Zhan, Yifan Gong, Pu Zhao, Geng Yuan, Wei Niu, Yushu Wu, Tianyun Zhang, Malith Jayaweera, David Kaeli, Bin Ren, et al. Achieving on-mobile real-time super-resolution with neural architecture and pruning search. In ICCV, pages 4821 4831, 2021. [70] Tingting Zhao, Zifeng Wang, Aria Masoomi, and Jennifer Dy. Deep bayesian unsupervised lifelong learning. Neural Networks, 149:95 106, 2022. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] The claims match the experimental results and it is expected to generalize according to the diverse experiments stated in our paper. We include all of our code, data, and models in the supplementary materials, which can reproduce our experimental results. (b) Did you describe the limitations of your work? [Yes] See Section 6 and Appendix 7. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6 and Appendix 8. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] We have read the ethics review guidelines and ensured that our paper conforms to them. 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] Our paper is based on the experimental results and we do not have any theoretical results. (b) Did you include complete proofs of all theoretical results? [N/A] Our paper is based on the experimental results and we do not have any theoretical results. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] See Section 5.1, Section 5.4 and we provide code to reproduce the main experimental results. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5.1 and Section 5.4. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] See Table 1, Table 2, fig 1, fig 3. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5.1, Section 5.4. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] We mentioned and cited the datasets (Split CIFAR-10 and Tiny-Image Net), and all comparing methods with their paper and github in it. (b) Did you mention the license of the assets? [Yes] The licences of used datasets/models are provided in the cited references and we state them explicitly in Appendix A. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We provide code for our proposed method in the supplement. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]