# parameterlevel_softmasking_for_continual_learning__db83f3b3.pdf Parameter-Level Soft-Masking for Continual Learning Tatsuya Konishi 1 Mori Kurokawa 1 Chihiro Ono 1 Zixuan Ke 2 Gyuhak Kim 2 Bing Liu 2 Existing research on task incremental learning in continual learning has primarily focused on preventing catastrophic forgetting (CF). Although several techniques have achieved learning with no CF, they attain it by letting each task monopolize a sub-network in a shared network, which seriously limits knowledge transfer (KT) and causes overconsumption of the network capacity, i.e., as more tasks are learned, the performance deteriorates. The goal of this paper is threefold: (1) overcoming CF, (2) encouraging KT, and (3) tackling the capacity problem. A novel technique (called SPG) is proposed that soft-masks (partially blocks) parameter updating in training based on the importance of each parameter to old tasks. Each task still uses the full network, i.e., no monopoly of any part of the network by any task, which enables maximum KT and reduction in capacity usage. To our knowledge, this is the first work that soft-masks a model at the parameter-level for continual learning. Extensive experiments demonstrate the effectiveness of SPG in achieving all three objectives. More notably, it attains significant transfer of knowledge not only among similar tasks (with shared knowledge) but also among dissimilar tasks (with little shared knowledge) while mitigating CF. 1. Introduction Catastrophic forgetting (CF) and knowledge transfer (KT) are two key challenges of continual learning (CL), which learns a sequence of tasks incrementally. CF refers to the phenomenon where a model loses some of its performance on previous tasks once it learns a new task. KT means that tasks may help each other to learn by sharing knowledge. The work was done when this author was visiting Bing Liu s group at University of Illinois at Chicago. 1KDDI Research, Inc., Fujimino, Japan. 2University of Illinois at Chicago, Chicago, United States. Correspondence to: Tatsuya Konishi . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). This work further investigates these problems in the popular CL paradigm, task-incremental learning (TIL). In TIL, each task consists of several classes of objects to be learned. Once a task is learned, its data is discarded and will not be available for later use. During testing, the task id is provided for each test sample so that the corresponding classification head of the task can be used for prediction. Several effective approaches have been proposed for TIL that can achieve learning with little or no CF. Parameter isolation is perhaps the most successful one in which the system learns to mask a sub-network for each task in a shared network. HAT (Serra et al., 2018) and Sup Sup (Wortsman et al., 2020) are two representative systems. HAT set binary/hard masks on neurons (not parameters) that are important for each task. In learning a new task, those masks block the gradient flow through the masked neurons in the backward pass. Only those free (unmasked) neurons and their parameters are trainable. Thus, as more tasks are learned, the number of free neurons left becomes fewer, making later tasks harder to learn, which results in gradual performance deterioration (see Section 4.2.1). Further, if a neuron is masked, all the parameters feeding to it are also masked, which consumes a great deal of network capacity (hereafter referred to as the capacity problem ). As the sub-networks for old tasks cannot be updated, it has limited knowledge transfer. CAT (Ke et al., 2020) tries to improve KT of HAT by detecting task similarities. If the new task is found similar to some previous tasks, these tasks masks are removed so that the new task training can update the parameters of these tasks for backward pass. However, this is risky because if a dissimilar task is detected as similar, serious CF occurs, and if similar tasks are detected as dissimilar, its knowledge transfer will be limited. Sup Sup uses a backbone network (randomly initialized) and finds a sub-network for each task. The sub-network is represented by a mask, which is a set of binary gates indicating which parameters in the network are used. The mask for each task is saved. Since the network is not changed, Sup Sup has no CF or capacity problem, but since each mask is independent of other masks, Sup Sup by design has no KT. To tackle these problems, we propose a very different approach, named Soft-masking of Parameter-level Gradient flow (SPG). It is surprisingly effective and contributes in following ways: Parameter-Level Soft-Masking for Continual Learning (1). Instead of learning hard/binary masks on neurons for each task and blocking these neurons in training a new task and in testing like HAT, SPG computes an importance score for each network parameter (not neuron) to old tasks using gradients. The reason that gradients can be used as importance is because gradients directly tell how a change to a specific parameter will affect the output classification and may cause CF. SPG uses the importance score of each parameter as a soft-mask to constrain the gradient flow in the backward pass to ensure those important parameters to old tasks have minimum changes in learning a new task to prevent CF of previous knowledge. To our knowledge, soft-masking of parameters has not been done before. (2). SPG has some resemblance to the popular regularizationbased approach, e.g., EWC (Kirkpatrick et al., 2017), in that both use importance of parameters to constrain changes to important parameters of old tasks. But there is a major difference. SPG directly controls each parameter (fine-grained), but EWC controls all parameters together using a regularization term in the loss to penalize the sum of changes to all parameters in the network (rather coarse-grained). Section 4.2 shows that our soft-masking is markedly better than regularization. We believe this is an important result. (3). In the forward pass, no masks are applied, which encourages knowledge transfer among tasks. This is better than CAT as SPG does not need extra mechanism for task similarity comparison. Knowledge sharing and transfer in SPG are automatic. Sup Sup cannot do knowledge transfer. (4). As SPG soft-masks parameters, it does not monopolize any parameters or sub-network like HAT for each task and SPG s forward pass does not use any masks. This reduces the capacity problem. Experiments with the standard CL setup have been conducted with (1) similar tasks to demonstrate SPG s better knowledge transfer and (2) dissimilar tasks to show SPG s ability to overcome CF, and (3) deal with the capacity issue. None of the baselines is able to achieve all. The code is available at https://github.com/UIC-Liu-Lab/spg. 2. Related Work Approaches in continual learning can be grouped into three main categories. We review them below. Regularization-based: This approach computes importance values of either parameters or their gradients on previous tasks, and adds a regularization in the loss to restrict changes to those important parameters to mitigate CF. EWC (Kirkpatrick et al., 2017) uses the Fisher information matrix to represent the importance of parameters and a regularization to penalize the sum of changes to all parameters. SI (Zenke et al., 2017) extends EWC to re- duce the complexity in computing the penalty. Many other approaches (Li & Hoiem, 2016; Zhang et al., 2020; Ahn et al., 2019) in this category have also been proposed, but they still have difficulty to prevent CF. As discussed in the introduction section, the proposed approach SPG has some resemblance to a regularization based method EWC. But the coarse-grained approach of using regularization is significant poorer than the fine-grained soft-masking in SPG for overcoming CF as we will see in Section 4.2. Memory-based: This approach introduces a small memory buffer to store data of previous tasks and replay them in learning a new task to prevent CF (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019). Some methods (Shin et al., 2017; Deja et al., 2021) prepare data generators for previous tasks, and the generated pseudo-samples are used instead of real samples. Although several other approaches (Rebuffi et al., 2017; Riemer et al., 2019; Aljundi et al., 2019) have been proposed, they still suffer from CF. SPG does not save replay data or generate pseudo-replay data. Parameter isolation-based: This approach is most similar to ours SPG. It tries to learn a sub-network for each task (tasks may share parameters and neurons), which limits knowledge transfer. We have discussed HAT, Sup Sup, and CAT in Section 1. Many others also take similar approaches, e.g., Progressive Networks (PGN) (Rusu et al., 2016), APD (Yoon et al., 2020), Path Net (Fernando et al., 2017), Pack Net (Mallya & Lazebnik, 2018), Space Net (Sokar et al., 2021), and WSN (Kang et al., 2022). In particular, PGN allocates a sub-network for each task in advance, and progressively concatenates previous sub-networks while freezing parameters allocated to previous tasks. APD selectively reuses and dynamically expands the dense network. Those methods, however, depend on the expansion of network for their performance, which is often not acceptable in cases where many tasks need to be learned. Path Net splits each layer into multiple sub-modules and finds the best pathway designated to each task. Pack Net freezes important weights for each task by finding them based on pruning. Although Path Net and Pack Net do not expand the network along with continual learning, they suffer from over-consumption of the fixed capacity. To address this, Space Net adopts the sparse training to preserve parameters for future tasks but the performance for each task is sacrificed. WSN also allocates a subnetwork within a dense network and selectively reuses subnetworks for previous tasks without expanding the whole network. Nevertheless, those methods are still limited by the pre-allocated network size as each task monopolizes and consumes some amount of capacity, which results in poorer KT when learning many tasks. In summary, parameter isolation-based methods suffer from over-consumption of network capacity and have limited KT, which the proposed method tries to address at the same time. Parameter-Level Soft-Masking for Continual Learning 3. Proposed SPG (a) Train task 𝑡until convergence Forward pass Backward pass (b) Compute importance after training task 𝑡 𝒈 &! 1 𝜸'"#$% 𝒈&! Soft-masking: Parameter 𝜽! Compute importance: Figure 1. When learning task t, SPG proceeds in two steps. Black (solid) and green (dashed) arrows represent forward and backward propagation, respectively. Ht denotes the head for task t. (a) Training of a model. In the forward pass, nothing extra is done. In the backward pass, the gradients of parameters in the feature extractor gi are changed to g i, based on the accumulated importance (γ t 1 i ). For parameters of the head for task t, their gradients g Ht are changed to g Ht using the average of accumulated importance ( γ t 1). (b) Computation of the accumulated importance γ t i . As discussed in Section 1, the current parameter isolation approaches like HAT (Serra et al., 2018) and Sup Sup (Wortsman et al., 2020) are very effective for overcoming CF, but they hinder knowledge transfer and/or consume too much learning capacity of the network. For such a model to improve knowledge transfer, it needs to decide which parameters can be shared and updated for a new task. That is the approach taken in CAT (Ke et al., 2020). CAT finds similar tasks and removes their masks for updating, but may find wrong similar tasks, which causes CF. Further, parameters are the atomic information units, not neurons, which HAT masks. If a neuron is masked, all parameters feeding into it are masked, which costs a huge amount of learning capacity. SPG directly soft-masks parameters based on their importance to previous tasks, which is a more flexible and uses much less learning space. Soft-masking clearly enables automatic knowledge transfer. Figure 1 and Algorithm 1 illustrate how SPG works. In SPG, the importance of a parameter to a task is computed based on its gradient. We do so because gradients of parameters directly and quantitatively reflect how much changing a parameter affects the final loss. Additionally, we normalize the gradients of the parameters within each layer to make relative importance more reliable as gradients in different layers Algorithm 1 Continual Learning in SPG. 1: for t = 1, , T do 2: # Training of task t. Mt is the model for task t (see Figure 1(a)). 3: repeat 4: Compute gradients {gi} and g Ht with Mt using the task t s data (Xt, Yt). 5: for all parameters of i-th layer do 6: g i Equation (6) 7: end for 8: for all parameters of the task t s head do 9: g Ht Equation (7) 10: end for 11: Update Mt with the modified gradients {g i} and g Ht. 12: until Mt converges. 13: # Computing the importance of parameters after training task t (see Figure 1(b)). 14: for τ = 1, , t do 15: Compute a loss Lt,τ in Equation (2). 16: for all parameters of i-th layer do 17: γt,τ i Equation (1) 18: end for 19: end for 20: for all parameters of i-th layer do 21: γt i Equation (4), γ t i Equation (5) 22: end for 23: Store only {γ t i } for future tasks. 24: end for can have different magnitude. The normalized importance scores are accumulated by which the corresponding gradients are reduced in the optimization step to avoid forgetting the knowledge learned from the previous tasks. 3.1. Computing the Importance of Parameters This procedure corresponds to Figure 1(b). The importance of each parameter to task t is computed right after completing the training of task t following these steps. Task t s training data, (Xt, Yt), is given again to the trained model of task t, and the gradient of each parameter in i-th layer (i.e., each weight or bias of each layer) is then computed and used for computing the importance of the parameter. Note that we use θi (a vector) to represent all parameters of the i-th layer. This process does not update the model parameters. The reason that the importance is computed after training of the current task has converged is as follows. Even after a model converges, some parameters can have larger gradients, which indicate that changing those parameters may take the model out of the (local) minimum leading to forgetting. On the contrary, if all parameters have similar gradients (i.e, balanced directions of gradients), changing Parameter-Level Soft-Masking for Continual Learning (a) After learning task 𝜏 Additional loss for CHI (b) After learning task 𝑡 (𝑡> 𝜏) Figure 2. Cross-head importance (CHI). In the above figures, thl i and wl ij represents the output of the i-th neuron in the l-th layer just after training task t and the parameter in l-th layer connecting between the neurons thl i to thl+1 j , respectively. (a) The importance of wl ij to task τ is computed based on its gradient, Lτ,τ/ wl ij and then accumulated. (b) After learning task t (t > τ), the state of related parameters might have been changed. To reflect importance to task τ again with the current neurons output (e.g., thl i rather than old τhl i), an additional loss, Lt,τ, is computed at task τ s head using task t s data as unlabeled data for task τ. the parameters will not likely to change the model much to cause forgetting. Based on this assumption, we utilize the normalized gradients after training as a signal to indicate such dangerous parameter updates. The proposed mechanism in SPG has the merit that it keeps the model flexible as it does not fully block parameters using an importance threshold or binary masks. While HAT completely blocks important neurons, which results in the loss of trainable parameters over time, SPG allows most parameters remain alive even when most of them do not change much. Additionally, computing the gradients based only on the model (Mt) of the current task t does not deal with another issue. We use an example illustrated in Figure 2. For example, just after learning task τ, the gradient of a parameter is computed and normalized among the same layer s parameters to be accumulated. Even though during learning task t (t > τ), the parameter is not much changed considering its accumulated importance, at the end of learning task t, the state of related parameters might have been changed, by which the normalized importance may become less useful. To reflect the parameter s importance to task τ again in the current network state, we introduce cross-head importance (CHI) mechanism. In particular, an additional loss, Sum(Mτ(Xt)), is computed with each previous task s head by substituting task t s data as unlabeled data for the previous tasks. By taking this loss, parameters affecting the logits more for previous tasks are regarded more important. Finally, both the normalized importance computed for the current task s head and the ones for previous tasks heads in CHI are considered by taking element-wise maximum, as shown in Equation (4). To put things together, the proposed method computes the normalized importance, γt i, of the parameters of the i-th layer, θi, using each task τ s model (1 τ t), Mτ. γt,τ i = tanh Norm Lt,τ ( L (Mτ (Xt) , Yt) (τ = t) Sum (Mτ (Xt)) (τ < t) , (2) Norm(x) = x Mean(x) p Var(x) , (3) γt i = max γt,1 i , , γt,t i , (4) where max ( ) and L mean element-wise maximum and a loss function, respectively. Equation (1) normalizes the gradients over the same layer to avoid the discrepancies caused by large differences of gradient magnitudes in different layers. For the current task s head (i.e., τ = t), a normal loss function (e.g., cross entropy) is used as Lt,t in Equation (2). However, for each previous task s head (when τ γt,t i for any τ(1 τ < t) in Equation (4). (2) Overwrite Gap at each task (G-each): When the cases of F-each happen, how much is the difference of overwriting on average? It is defined by the average of max (γt,1 i , , γt,t 1 i ) γt,t i . (3) Overwrite Frequency in total (F-total): How often does the importance from CHI actually overwrite the accumulated importance through the maximum operation? It corresponds to cases where γt,τ i > γ t 1 i for any τ(1 τ < t) in Equation (5). (4) Overwrite Gap in total (G-total): When the cases of F-total happen, how much is the difference of overwriting on average? It is defined by the average of max (γt,1 i , , γt,t 1 i ) γ t 1 i . The result is presented in Table 9. We can clearly observe that CHI adds more importance to some parameters (e.g., in C-10, about 15-42% of parameters constantly update their accumulated importance by the ones from CHI), which is denoted by F-total. Since we introduce CHI to further mitigate forgetting by accumulating more importance, this expectation is consistent with the observed results. Although CHI also overwrites the accumulated importance in similar tasks as frequently as in dissimilar tasks (see F-each and F-total), it happens with a smaller gap overall (see G-each and G-total), which is reasonable as the tasks are similar thus parameters can have similar gradients among different tasks. Parameter-Level Soft-Masking for Continual Learning D. Representation Learning in Continual Learning All results (i.e., the pairs of a CL and non-CL dataset are C-10/Tiny Image Net, I-100/CIFAR100, and T-10/CIFAR100) are presented in Figure 6. We can see that SPG learns better representations in continual learning than baselines in all cases. E. Network Size The number of learnable parameters of each system is presented in Table 10. Note that all approaches adopt Alex Net as their backbone, and the number of parameters vary depending on their additional structures such as attention mechanisms or sub-modules. It also depends on datasets because each dataset has a different number of tasks and in TIL, each task has a different classification head and the number of units in each classification head depends on the number of classes in each task. It can be seen that CAT and Sup Sup need more parameters than SPG and other approaches. Table 10. The number of learnable parameters of each model. M means a million (1, 000, 000). Dissimilar tasks Similar tasks Model C-10 C-20 T-10 T-20 I-100 FC-10 FC-20 FE-10 FE-20 (MTL) 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M (ONE) 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M NCL 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M A-GEM 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M PGN 6.7M 6.7M 6.7M 6.7M 8.3M 6.0M 6.6M 7.5M 8.9M Path Net 6.6M 6.8M 6.7M 6.6M 8.4M 6.4M 6.4M 7.8M 8.7M HAT 6.8M 6.8M 7.0M 7.0M 9.0M 6.6M 6.7M 7.8M 9.1M CAT 39.5M 39.7M 40.8M 40.9M N/A 38.5M 38.9M 46.2M 55.1M Sup Sup 65.2M 130.2M 65.4M 130.4M 652.0M 65.0M 130.1M 65.8M 131.7M UCL 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M SI 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M TAG 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M WSN 6.7M 6.7M 6.8M 6.8M 8.5M 6.5M 6.5M 7.7M 8.9M EWC 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M EWC-GI 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M SPG-FI 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M SPG 6.7M 6.7M 6.9M 6.9M 8.6M 6.5M 6.6M 7.7M 9.0M