# multidomain_multitask_rehearsal_for_lifelong_learning__38f6095a.pdf Multi-Domain Multi-Task Rehearsal for Lifelong Learning Fan Lyu1, Shuai Wang1, Wei Feng1*, Zihan Ye2, Fuyuan Hu2 and Song Wang1, 3 1Colledge of Intelligence and Computing, Tianjin University 2School of Electronic & Information Engineering, Suzhou University of Science and Technology 3Department of Computer Science and Engineering, University of South Carolina {fanlyu, wangshuai201909, wfeng}@tju.edu.cn, {zihanye@post, fuyuanhu@mail}.usts.edu.cn, songwang@cec.sc.edu Rehearsal, seeking to remind the model by storing old knowledge in lifelong learning, is one of the most effective ways to mitigate catastrophic forgetting, i.e., biased forgetting of previous knowledge when moving to new tasks. However, the old tasks of the most previous rehearsal-based methods suffer from the unpredictable domain shift when training the new task. This is because these methods always ignore two significant factors. First, the Data Imbalance between the new task and old tasks that makes the domain of old tasks prone to shift. Second, the Task Isolation among all tasks will make the domain shift toward unpredictable directions; To address the unpredictable domain shift, in this paper, we propose Multi Domain Multi-Task (MDMT) rehearsal to train the old tasks and new task parallelly and equally to break the isolation among tasks. Specifically, a two-level angular margin loss is proposed to encourage the intra-class/task compactness and inter-class/task discrepancy, which keeps the model from domain chaos. In addition, to further address domain shift of the old tasks, we propose an optional episodic distillation loss on the memory to anchor the knowledge for each old task. Experiments on benchmark datasets validate the proposed approach can effectively mitigate the unpredictable domain shift. Introduction Lifelong learning, also known as continual learning and incremental learning, aims to continually learn new knowledge from a sequence of tasks over a lifelong time. In contrast to traditional supervised learning, the lifelong setting helps machine learning work like a more realistic human learning by acquiring a new skill quickly with new training data. All the while, catastrophic forgetting (French 1999; Kirkpatrick et al. 2017) is the main challenge for lifelong learning, which happens when the learner forgets the knowledge of old tasks while learning a new task. To seek a balance between the old tasks and the new task, many methods have been proposed to handle the catastrophic forgetting in recent years. Following (De Lange et al. 2019), their methods can be categorized into Rehearsal (Lopez-Paz and Ranzato 2017; Chaudhry et al. 2018b; Guo et al. 2019), Regularization (Li and Hoiem 2016; Chaudhry et al. 2018a; Dhar et al. 2019) and Parameter Isolation (Mallya, Davis, and *Corresponding Author. Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Constrains Constrains Step 3 Step 4 Classification Classification (a) Traditional rehearsal ℳ1 ℳ2 ℳ3 𝒟3 Step 3 Step 4 Cross-domain classification Cross-domain classification (b) MDMT rehearsal Figure 1: (a) Traditional rehearsal-based methods construct single-task learning architecture for the new task (data from training set D) and treat the old tasks (data from memory M) as the constraints of its training. (b) The proposed MDMT rehearsal-based method trains old tasks and new task equally and keep tasks from isolation via TAM loss. Lazebnik 2018; Yoon et al. 2017). Regularization-based and parameter isolation-based methods store no data from old tasks and highly rely on extra regularizers or architectures, resulting in their lower performance than the rehearsal-based methods. Rehearsal-based methods store a small number of samples in the training set, the model will retrain the saved data when training the new task to avoid forgetting. At each step of lifelong learning (see Fig. 1(a)), the most existing rehearsal-based methods (Rebuffiet al. 2017; Lopez-Paz and Ranzato 2017; Chaudhry et al. 2018b; Guo et al. 2019) focus on training the new task while treating the stored data from old tasks as the constraints to preserve their performance. However, the old tasks in these methods may suffer from unpredictable domain shift that arises from two significant factors in the lifelong learning process: 1) The Data Imbalance between old and new task. The shrinkage of training data of old tasks leads to their domains will be prone to shift that manifests as the catastrophic forgetting. 2) The Task Isolation among all tasks (old and new), which makes such domain shift toward unpredictable directions and the boundary between any two tasks may become weak. To address the unpredictable domain shift, in this paper, we propose a Multi-Domain Multi-Task (MDMT) Rehearsal The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) method inspired by the multi-domain multi-task learning (Yang and Hospedales 2014) that considers both multiple tasks w.r.t. multiple domains and trains them equally. Specifically, as shown in Fig. 1(b), we first retrain the old tasks along with new task training parallelly rather than setting them as the constraints. We separate all these tasks by a Cross-Domain Softmax, which extends the softmax for each isolated task by combining the logits of all other seen tasks and separates them from each other. Then, to further alleviate the unpredictable domain shift, we propose to leverage a Two-level Angular Margin (TAM) loss to encourage the intra-class/task compactness and the inter-class/task discrepancy on the basis of Cross-Domain Softmax. In addition, we present an optional Episodic Distillation (ED) loss on all buffer memories for old tasks that suppress the domain shift by storing the latent representations of each sample in memories. We evaluate our MDMT rehearsal on four popular lifelong learning datasets for image classification and achieve new state-of-the-art performance. The experimental results show the proposed MDMT rehearsal can significantly mitigate the unpredictable domain shift. Our contributions are three-fold: (1) We propose a Multi-Domain Multi-Task Rehearsal method for lifelong learning, which parallelly and equally trains the old and new tasks and separate them by a Cross-Domain Softmax function. (2) We propose a Two-level Angular Margin (TAM) loss for lifelong learning to further boost the Cross-Domain Softmax for the sake of intra-class/task compactness and the inter-class/task discrepancy. (3) We build an optional Episodic Distillation loss to reduce the domain shift in lifelong progress. Related Work Lifelong Learning In contrast to static machine learning (He et al. 2016; Deng et al. 2018; Lyu et al. 2019; Lyu, Feng, and Wang 2020), Lifelong Learning (Ring 1998; Thrun 1998) seeks to improve the self-learning ability of the machine that continually learns new knowledge. The previous solutions to the catastrophic forgetting (French 1999; Kirkpatrick et al. 2017) in recent years can be categorized into regularizationbased, parameter isolation-based and rehearsal-based methods (De Lange et al. 2019). Regularization-based methods (Li and Hoiem 2016; Chaudhry et al. 2018a; Dhar et al. 2019) store no data but explore extra regularization terms in the loss function to consolidate previous knowledge. Parameter isolation-based methods (Mallya, Davis, and Lazebnik 2018; Yoon et al. 2017) freeze the task-specific parameters and grow new branches for new tasks to bring in new knowledge. Rehearsal-based methods store some knowledge of old tasks to remind the model and often achieve better performance. Existing methods can be categorized into three groups. 1) by saving the raw data (Rehearsal, e.g., image) (Rebuffiet al. 2017; Lopez-Paz and Ranzato 2017; Chaudhry et al. 2018b; Guo et al. 2019), the model can retrain the saved data along with the current training; 2) by saving the latent features for selected samples (Latentrehearsal) (Pellegrini et al. 2019), the model slows down learning at the layers below the rehearsal layer and leaves the layers above free to learn at full pace; 3) by building generative model to synthesize data (Pseudo-rehearsal) (Shen et al. 2020; van de Ven and Tolias 2018; Lesort et al. 2019), the knowledge can be saved as parameters rather than data. In this paper, we only consider the native rehearsal by storing raw data in image classification. Multi-domain Multi-task Learning Multi-domain learning (Nam and Han 2016; Tang and Jia 2020) refers to sharing information about the same problem across different contextual domains, while multi-task learning (Lin et al. 2019; Sener and Koltun 2018) addresses sharing information about different problems in the same domain. By considering both multiple domains and multiple tasks, Multi-domain multi-task (MDMT) learning was first proposed in (Yang and Hospedales 2014), and has been applied to classification (Peng and Dredze 2016) and semantic segmentation (Fourure et al. 2017), etc.. The common solution to MDMT problem is to construct parallel data streams and seek to build the correlations among tasks. Here, we explain why we decide to formulate the lifelong learning problem into a MDMT learning problem. 1) By storing some samples of a task into memory, MDMT learning can significantly train them together, which helps mitigate the task isolation in the traditional rehearsal-based lifelong learning. 2) MDMT learning can help suspending the domain shift to some extent by making classifiers perceive each other. Margin Loss And Distillation Loss The margin based Softmax explicitly adds a margin to each logit to improve feature discrimination. L-Softmax (Liu et al. 2016) and Sphere Face (Liu et al. 2017) add multiplicative angular margin to squeeze each class. Cos Face (Wang et al. 2018b,a) and Arc Face (Deng et al. 2019) add additive cosine margin and angular margin, respectively, for easier optimization. Based on Arc Face, we propose a Two-level Angular Margin loss to guarantee both inter-class/task compactness and intra-class/task discrepancy. The knowledge distillation (Hinton, Vinyals, and Dean 2015) transfers the knowledge about smoothed probability distribution of the output layer of the teacher network to the student network. Inspired by this, we propose to build distillation loss between the old and new models on old tasks by storing the latent representation of stored data. Methodology Multi-domain Multi-task Rehearsal Suppose there are T different tasks with respect to datasets {D1, , DT }. For the t-th dataset (task), Dt = {(xt,1, yt,1), , (xt,Nt, yt,Nt)}, where xt,i Xt is the ith input data, yt,i Yt is the corresponding label and Nt is the number of samples. Dt can be split into a training set Dtrn t and a testing set Dtst t , and we denote Dt as Dtrn t in our presentation for simple denotation. Lifelong learning aims at learning a predictor ft : Xk Yk, k {1, , t}, which can predict tasks that have been learned at any time. The rehearsal-based lifelong learning (Rebuffiet al. 2017; Lopez-Paz and Ranzato 2017; Riemer et al. 2018; Chaudhry 𝒟4 STOREMEM Shared layers Task-specific layers DISTILLATION DISTILLATION Step 3 Step 4 Latent representation Figure 2: Training procedure of the proposed MDMT rehearsal based lifelong learning. At each step, a small number of samples will be saved into memory M and the corresponding latent representations will be saved into F. TAM loss guarantee the intraclass/task compactness and inter-class/task discrepancy. Episodic Distillation loss helps further to reduce the domain shift of the old tasks. The dashed elements mean the optional operation. et al. 2018b; Guo et al. 2019) builds a memory buffer Mk Dk with small-size for each previous task k, i.e., |Mk| |Dk|. Following (Lopez-Paz and Ranzato 2017), when training a task t {1, , T}, for all Mk that k < t, the rehearsal-based lifelong learning can be modeled as a single objective optimizing problem: arg min θ,θt ℓ(fθ, fθt, Dt), s.t. ℓ(fθ, fθk, Mk) ℓ(f t 1 θ , f t 1 θk , Mk), k < t, (1) where ℓis the empirical loss. θ is the shared parameter across all tasks while θk and θt are the task-specific parameters. The constraints above are designed to prevent the performance degradation of previous tasks. Then, the problem can be reduced to find an optimal gradient that benefits all tasks. To inspect the increase in old tasks loss, (Lopez-Paz and Ranzato 2017; Chaudhry et al. 2018b; Guo et al. 2019) compute the angle between the gradient of each old task and the proposed gradient update on the current task. However, such a single objective optimization on the current task for rehearsal-based lifelong learning overemphasizes the new task while ignoring the difference among tasks. In other words, the old tasks can only play the role of source domain to be transferred into the current training model. The domain of old tasks will significantly shift because of the rectified gradient that the gradient norm of new task is much larger than the old tasks , which may induce the domain overlap. In contrast, this paper treats the problem as a Multi Domain Multi-Task (MDMT) learning problem to jointly and equally improve the current task as well as the old tasks: arg θ,{θ1, ,θt} {min ℓ(fθ, fθt, Dt), min ℓ(fθ, fθk, Mk), , min ℓ(fθ, fθ1, M1)}, s.t. d(fi, fj) d(f t 1 i , f t 1 j ), i, j [1, t], i = j, (2) where fi = fθ(Di) if i = t and fi = fθ(Mi) if i < t. d means the distance between two domains. For the t tasks w.r.t. datasets {D1, , Dt}, a MDMT rehearsal model trains t tasks parallelly and equally. The constraints above mean the domain distance between any two tasks should not be smaller than the model trained on the last task. Note that we only consider the situation that the tasks are irrelevant as the common lifelong learning. We make two key operations to solve the Eq.(2) efficiently. First, we transform the multi-objective optimization as a single-objective optimization problem by ensembling all these objectives as the traditional solution to multi-task learning (Lin et al. 2019; Sener and Koltun 2018). arg min θ ℓ(fθ, fθt, Dt) + k=1 ℓ(fθ, fθk, Mk), (3) Second, it exists high memory-cost to calculate the distance between any two domains and store old predictors f t 1 θ , but we can do this in a simple yet effective way by extending the softmax function for each task as n=1 log e(W k yn) T xn+byn j=1 e(W k j ) T xn+bj + j=1 e(W i j) T xn+bj. (5) Nk is the batch size for task k and W k j Rd denotes the j-th column of the weight W k Rd Ck in the last fully-connected layer for task k and Ck is the class number. We name this extension as Cross-Domain Softmax (CDS), which combines the logits from other classifiers and is similar to a native softmax to a classification problem with total Pt k=1 Ck class. Here, we discuss the difference. For MDMT rehearsal, different tasks never share a same classifiers as common classification, i.e., the classifiers for different tasks lack mutual perception. By combining the logits form other tasks, the tasks can perceive and separate from each other. The previous methods update the model by the optimal gradient that highly rely on the angle between the gradients of old and new tasks. In contrast, we directly obtain the hybrid gradient for the shared layers by ensembling the gradients from the new task and old tasks as g Pt k=1 gk. We compare our MDMT rehearsal with several wellknown rehearsal-based lifelong works: i Ca RL (Rebuffiet al. 2017) saves small number of samples to make the model not to forget old class, but they classify Algorithm 1 MDMT rehearsal based lifelong learning. Procedure TRAIN(fθ, fθ1:T , {Dtrn 1 , , Dtrn T }) M, F {}, {} for t = 1 to T do for (x, y) Dtrn t do g, g1 θℓ(fθ(x, t), y) if t = 1 then g g else gref, g1:t 1 θℓ(fθ, fθ1:t 1, M) gref gref + θ ℓ(fθ, Fref) g g + gref end if θ θ Step Size g θ1:t θ1:t Step Size g1:t end for M, F STOREMEM(M, F, Dtrn t , fθ) end for Procedure STOREMEM(M, F, D, f) for i = 1 to |M|/T do (x, y) D M M + (x, y) F F + f(x) end for Return M, F Procedure EVAL(fθ, fθ1:T , {Dtst 1 , , Dtst T }) a 0 RT for t = 1 to T do at 0 for (x, y) Dtst t do at at + Accuracy(fθt(fθ(x, t)), y) end for at at/|Dtst t | end for Return a samples by the nearest prototype, which is not suitable for task-incremental lifelong learning because the task-specific parameters are ignored. GEM/A-GEM (Lopez-Paz and Ranzato 2017; Chaudhry et al. 2018b) propose to solve forgetting by finding the optimal gradient that saves the old tasks from being corrupt, and they focus on training the new task with single objective optimization while ignore the domain shift of old tasks. ER (Chaudhry et al. 2019a) extends Experience Replay (Rolnick et al. 2019) for reinforcement lifelong learning and be proven better than A-GEM. However, they never consider the relations among all tasks, which makes the domains of old task may significantly shift. PRD (Hou et al. 2018) proposes to treat lifelong learning as a multi-task learning problem and proposes to build a distillation module with one saved CNN expert as teacher for each old task. Differently, we would like to build a MDMT rehearsal that leverage the expanded softmax without saving many extra models. A-GEM MEGA TAM CDS MEGA TAM CDS Task 1 after 17 Task 9 after 17 Task 17 after 17 Lifelong Step Angle in Task 1 Figure 3: On Permuted MNIST, (a) the changes of angle range between feature and the target weight center of task 1 along the lifelong learning; (b) the angulars relations of class centers of task 1, 9 and 17 after trained on task 17. Two-level Angular Margin Loss The proposed MDMT rehearsal helps to jointly and equally train the new task and retrain the old tasks, making all tasks perceive each other. Nonetheless, the softmax loss is not efficient enough because it does not explicitly encourage intra-class compactness and inter-class discrepancy, in coping with which, large margin based softmax is widely used in recent discriminative problems (Deng et al. 2019; Liu et al. 2016). However, these methods cannot be directly applied to MDMT rehearsal based lifelong learning because these methods place the large margin only to single task and can not be applied to multiple tasks scenario. In this paper, we propose two levels margin, i.e., class level and task level, on softmax for each task (Eq. (4)). Our work is based on the popular large margin based softmax method Arcface (Deng et al. 2019) where the large margin is added to the angle between weight and feature, which has been proven effective and efficient. Specifically, Arcface deletes the bias and transforms the logit fed into the softmax as W T j xi = Wj xi cos θj, where θj is the angle between the weight Wj and the feature xi, then an angular margin m is placed between different classes i=1 log es cos(θyi+m) es cos(θyi+m) + Pn j=1,j =yi es cos θj , (6) where the individual weight ||Wj|| is fixed to 1 by l2 normalization and the embedding feature ||xi|| is fixed to s by l2 normalization and rescale. The normalization on features and weights makes the predictions only depend on the angle between them. Such a geodesic distance margin between the sample and centers makes the prediction gain more intraclass compactness and inter-class discrepancy. Based on Eq. (6), we propose our Two-level Angular Margin (TAM) loss for the task k [1, t] n=1 log es cos((θk yn+mc)+mt) σn =es cos((θk yi+mc)+mt) + j=1,j =yi es cos(θk j +mt)+ j=1 es cos θi j. In Eq. (7), we add class-level margin mc and task-level margin mt on the angular. mc is similar to m in Eq. (6), which controls the intra-task class compactness and discrepancy (Deng et al. 2019). mt controls the task compactness and discrepancy, which ensures the knowledge of each task not to mix up with others. As shown in Fig. 3, the proposed TAM loss produces two advantages for MDMT rehearsal based lifelong learning. First, TAM helps the model to better discriminate into a task. Although the CDS has a better angle between feature and its target weight, TAM loss even reduce the angle to a smaller than CDS, which expresses the effect of mc. Second, TAM loss mitigates the domain overlap caused by the domain shift by forcing tasks to separate. We can also see that for the angles among weights center, TAM loss can significantly separate old and new tasks, which expresses the effect of mt. However, it is still difficult to omit the domain shift because of the extreme data imbalance between old tasks and new task. Thus, we construct an optional Episodic distillation loss for the MDMT rehearsal based lifelong process. Episodic Distillation In this paper, we propose a simple yet effective solution to further mitigate the domain shift for old task named Episodic Distilllation (ED) loss. The main role the ED loss played is to reduce the feature distribution change along with the lifelong process as far as possible. First, apart from the sampled training data stored in memory, i.e., Mk = {(xk,1, yk,1), , (xk,|Mk|, yk,|Mk|)} Dk, we also store the corresponding latent representations when they are first trained, denoted as Fk = {fk,1, , fk,|Mk|}. Then, we train the model with an updated objective: arg min θ ℓ(fθ, fθt, Dt)+ h ℓ(fθ, fθk, Mk) + ℓ(fθ, Fk) i , ℓ(fθ, Fk) 1 i ℓi(fθ(xk,i), fk,i). (10) ℓi is the ED loss that can be in many formats, and we choose the Mean Square Error (MSE). By training with Eq. (9) in each step, we can ease the shift effectively. ED loss is an optional loss function and builds extra memory buffers to save the latent representation for each sample in memories. The extra memory buffers do increase the memory cost to some extent, but still very small in compared with the whole training set. In our implementation, we save the representation from the fc layer before the last one, which is a vector with length from 256 to 2048 for different (a) Permuted MNIST (b) Split CIFAR (c) Split CUB (d) Split AWA VAN MAS EWC GEM A-GEM MEGA PROG-NN MDMT-R Tasks Tasks Tasks Tasks Avg Accuracy Avg Accuracy Avg Accuracy Avg Accuracy Figure 4: Average accuracy trend (from A1 to AT ) on four datasets in the lifelong process. network. That means the cost of the representation memory is even smaller than the data memory. Total Algorithm We follow A-GEM (Chaudhry et al. 2018b) that unite memory of all old tasks for efficient training. Let M = k