# continual_learning_with_recursive_gradient_optimization__c7fc40c1.pdf Published as a conference paper at ICLR 2022 CONTINUAL LEARNING WITH RECURSIVE GRADIENT OPTIMIZATION Hao Liu Department of Computer Science Tsinghua University Beijing, China hao-liu20@mails.tsinghua.edu.cn Huaping Liu Department of Computer Science Tsinghua University Beijing, China hpliu@tsinghua.edu.cn Learning multiple tasks sequentially without forgetting previous knowledge, called Continual Learning (CL), remains a long-standing challenge for neural networks. Most existing methods rely on additional network capacity or data replay. In contrast, we introduce a novel approach which we refer to as Recursive Gradient Optimization (RGO). RGO is composed of an iteratively updated optimizer that modifies the gradient to minimize forgetting without data replay and a virtual Feature Encoding Layer (FEL) that represents different network structures with only task descriptors. Experiments demonstrate that RGO has significantly better performance on popular continual classification benchmarks when compared to the baselines and achieves new state-of-the-art performance on 20-split-CIFAR100 (82.22%) and 20-split-mini Image Net (72.63%). With higher average accuracy than Single-Task Learning (STL), this method is flexible and reliable to provide continual learning capabilities for learning models that rely on gradient descent. 1 INTRODUCTION In many application scenarios, one needs to learn a sequence of tasks without access to historical data, called continual learning. Although variants of stochastic gradient descent (SGD) have made a significant contribution to the progress made by neural networks in many fields, these optimizers require the mini-batches of data to satisfy the independent identically distributed (i.i.d.) assumption. In continual learning, the violation of this requirement leads to significant degradation of performance on previous tasks, called catastrophic forgetting. Recent works attempt to tackle this issue by modifying the training process from a variety of perspectives. Memory-based approaches use extra memory to store some samples (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2020), gradients (Chaudhry et al., 2019a; 2020; Saha et al., 2021), or their generative models (Shin et al., 2017; Shen et al., 2020) to modify future training process. The memory for replay leads to a linear-increased space complexity with respect to the number of tasks. Expansionbased approaches select the network parameters dynamically (Yoon et al., 2018; Rosenbaum et al., 2018; Serra et al., 2018; Kaushik et al., 2021), add additional components as new tasks arrive (Rusu et al., 2016; Fernando et al., 2017; Alet et al., 2018; Chang et al., 2019; Li et al., 2019), or use larger networks to generate network parameters (Aljundi et al., 2017; Yoon et al., 2019; von Oswald et al., 2019). These methods reduce interference between tasks by additional task-specific parameters. Single-Task Learning (STL) can also be regarded as an expansion-based method which trains a network for each task separately. Regularization-based approaches encourage important parameters to lie in a close vicinity of previous solutions by introducing quadratic penalty term to the loss function (Kirkpatrick et al., 2017; Zenke et al., 2017; Yin et al., 2020) or constraining the direction of parameter update (Farajtabar et al., 2019; Chaudhry et al., 2019a; Saha et al., 2021). Our method is also regularization-based which combines the advantages of loss penalty and gradient constraint. In this work, we focus on continual learning in a fixed-capacity network without data replay. We aim to minimize the expected increment of the total loss of past tasks without reducing the performance of the current task. To this end, we propose an upper bound of quadratic loss estimation and Corresponding author. Published as a conference paper at ICLR 2022 design a recursive optimization procedure to modify the direction of the gradient to the optimal solution under this upper bound. In addition, we introduce trace normalization process to guarantee the learning rate during the training process according to the principle of current-task-first (CFT). This normalization process makes our approach compatible with the vast majority of existing models and learning strategies well-designed for single-task solutions. As the gradient modification process is independent of data samples and previous parameters, our optimizer can be used directly in most deep architecture networks as typical single-task optimizers like SGD. Further, to reduce the interference between tasks, we develop a feature encoding strategy to represent the multi-modal structure of the network without additional parameters. A virtual feature encoding layer (FEL) which randomly permutes the output feature maps using integer task descriptor as seed is attached after each real layer. Thus, each task obtains a specific virtual structure under same network parameters. Since the parameter space of the network is not changed, such a strategy will not change the fitting ability of the neural network. Experimental validations on several continual learning benchmarks show that the proposed method has significantly less forgetting and higher accuracy than existing fixedcapacity baselines. We achieve state-of-the-art performance on 20-split-CIFAR100 (82.22%) and 20-split-mini Image Net (72.62%). In addition to minimizing forgetting, this method has comparable or better performance than single-task learning which handles all tasks individually. 2 PRELIMINARIES Consider K sequentially arrived supervised learning tasks {Tk|k [K]}, where [N] := {1, 2, , N} for any positive integer N. In each task Tk, there are nk data points {(xk,i, yk,i)|i [nk]} sampled from an unknown distribution Dk. Let X, Y, W be the space of inputs, targets and model parameters. By denoting the predictor as f(x, k) : X [K] Y, the loss function of θ W associated with data point (x, y) and task identifier k can be expressed as l(f(θ; x, k), y) : W R, and the empirical loss function of task Tk is defined as: i=1 l(f(θ; k, xk,i), yk,i) (1) In the continual learning scenario studied in this work, the parameter space W remains a fixed size, and an integer task descriptor is provided at both training and testing time. Without access to past samples, we use a second-order Taylor expansion to estimate the loss function of previous tasks.Let θ j be the optimal parameter of Lj(θ) generated by the gradient descent process according to θ j Lj = 0. For a new model parameter θ in the neighborhood of θ j , the loss of previous task Tj(j < k) can be estimated as: Lj(θ) = Lj(θ j ) + 1 2(θ θ j )T Hj(θ θ j ) (2) where Hj := 2Lj(θ j ) is the Hessian matrix. 3 PROBLEM FORMULATION & SOLUTION In our fixed-capacity continual learning setting, finding an appropriate joint solution that works well across the task sequence is the core goal. To this end, we introduce a novel continual learning optimization problem and corresponding iterative optimization strategy. 3.1 OPTIMIZATION PROBLEM In this paper, we formalize forgetting as the the increment of old task losses. As mentioned in Section 2, the total loss of tasks before Tk, denoted as Fk, can be estimated by: j=1 [Lj(θ j ) + 1 2(θ θ j )T Hj(θ θ j )] (3) Published as a conference paper at ICLR 2022 As Equation (3) need the explicit value of previous model parameters, Fk is too expensive to be an optimization target in continual learning. We turn to a more concise form which we refer to as recursive least loss (RLL): F RLL k (θ) := 1 2(θ θ k 1)T ( j=1 Hj)(θ θ k 1) (4) In Appendix A.2 , we prove that F RLL k and Fk are equivalent for optimization if all previous tasks are fully-trained. Based on the conclusions above, the optimization problem during task Tk is formalized as: θ k : min θ F RLL k (θ), subject to Lk(θ) = 0 (5) F RLL k has the same form as the regularization term in many regularization-based methods. The optimization goal of these methods are variants of Lk(θ)+λF RLL k (θ), derived from Bayesian posterior approximation with Gaussian prior (Kirkpatrick et al., 2017; Nguyen et al., 2018) or approximation of the KL-divergence used in natural gradient descent (Amari, 1998; Ritter et al., 2018; Tseran et al., 2018). The Bayesian methods try to estimate and minimize the overall loss function, while our method prioritizes the performance of the current task by Lk(θ) = 0 and minimizes the expected forgetting of the past tasks F RLL k (θ). 3.2 GRADIENT MODIFICATION For the newest task Tk, the optimal solution where Lk(θ) = 0 should be obtained by stochastic gradient descent started from the former optimal model parameter θ k 1 at the end of task Tk 1. Using subscript i to represent the parameters of step i, and the initial state θ0 = θ k 1, the single step update can be expressed as: θi = θi 1 ηi Lk(θi 1) Assume that the pre-set learning rate ηi is small enough to ignore the higher order terms, the loss function after the one-step update can be expressed as: Lk(θi) = Lk(θi 1) ηi( Lk(θi 1))T Lk(θi 1) If we hope to solve the task Tk only, updating the parameter θ according to the gradient above is enough. However, as mentioned above, such a method will encourage the neural network to gradually forget the old tasks. Therefore, we modify the update direction to minimize the expectation of forgetting. To this end, we introduce a new positive definite symmetric matrix P with appropriate dimensions to modify the gradients (g Pg). The modified one-step update is: ( θi = θi 1 ηi P Lk(θi 1) Lk(θi) = Lk(θi 1) ηi( Lk(θi 1))T P Lk(θi 1) (6) To maintain the pre-set learning rate during the continual learning problem and avoid repetitive selection of hyper-parameters, we impose an additional constraint on the trace of the projection matrix and prove the corresponding convergence rate consistency theorem 1 in Appendix A.1. Theorem 1 (convergence rate consistency). Under the constraint of trace(P)=dim(P), the expectation of the learning rate for unknown isotropic distribution is the same as the original optimizer. For common network structures, we provide the space complexity of matrix P and the time complexity of the corresponding gradient modification process in Appendix B.2. As described above, the only extra memory of our approach is the projection matrix P, which contains the information of previous tasks. This allows our approach to be a space-invariant method different with typical memory-based or expansion-based continual learning method. As the performance of our method is identified with the choice of P, the following problem is: How to find a good projection matrix? We will answer this question in the following parts. Published as a conference paper at ICLR 2022 3.3 APPROXIMATE SOLUTION The next step is to find a solution for problem (5). When the training process on Tk is finished, the final state θ k and the residual loss can be obtained from the accumulation of one-step update: θ k = θ k 1 i=1 ηi P Lk(θi 1) Lk(θ k) = Lk(θ k 1) i=1 ηi( Lk(θi 1))T P Lk(θi 1) In this way, for a given sample sequence and initial value θ k 1, the result of θ k depends on P. The optimization problem 5 on θ k is transformed into an optimization problem on P. However, the relationship between F RLL k (θ k) and P is too complicated to be used in optimization process. To tackle this problem, we propose an upper bound of F RLL k (θ k) as a practical optimization target. Theorem 2 (upper bound). Denote ˆσm( ) as the symbol for maximum eigenvalue and ηm as the maximum single-step learning rate, the recursive least loss has an upper bound: F RLL k (θ k) 1 2nkηmˆσm(P H)Lk(θ k 1) (8) where H = Pk 1 j=1 Hj is defined as the sum of the Hessian matrices of all old tasks. We prove theorem 2 in Appendix A.3. Discarding the constant terms, we get an alternative optimization problem for projection matrix P: ( min P ˆσm(P H) subject to trace(P) = dim(P) (9) We provide a detailed solution of this problem in Appendix A.4. The normalized solution is: P = dim( H) trace( H 1) H 1 (10) The normalized projection procedure can be described as: finding a new gradient that has similar effects on the current task to minimize the upper bound of the old task losses. Our optimizer only modifies the direction of the gradient without reducing the search region, which will guarantee the fitting ability of the network consistent throughout the task sequence. 4 IMPLEMENTATION The main concern of our method in section 3 lies in the expensive space and time cost in deep neural networks. In this section, we propose two approaches to reduce the time and space complexity of the algorithm for models with forward-backward propagation structures. 4.1 VIRTUAL FEATURE ENCODING LAYER In multi-layer networks, the output of the previous layer can be regarded as a set of features generated by the feature extractors(for example, weight matrices, bias vectors, convolution kernels, etc). In order to make the gradients generated in back-propagation process conform to our assumption of isotropic distribution(Theorem 1), we propose a virtual Feature Encoding Layer(FEL) to apply task-specific connections to the output of the previous layer and the input of the next layer. Definition 4.1 (Feature Encoding Layer). A feature encoding layer applies a task-specific rearrangement to the input feature maps, the order of which is randomly generated using the task identifier as a seed. Published as a conference paper at ICLR 2022 Note that FEL is only a permutation of existing feature maps, and its order does not change during the training process. Although this feature encoding layer does not require extra space, we write it in matrix form for the convenience of theoretical analysis. The permutation matrix at layer l(l = 1, 2, , L) is: Sl(k) := random permutation of Il with seed(k) where Il is an identity matrix with dimensions equal to the number of feature maps. Considering that the order of the features is critical for the next layer to obtain interpretable information, the recognition ability of the network will degrade greatly without correct feature order. FEL provides an effective way to eliminate the interference between tasks, where features encoded by a specific task descriptor will be randomly permuted in other tasks. Therefore, the same feature extractor plays different roles in different tasks. Although gradients of different layer are strongly correlated in the current task, there is little correlation from the perspective of old tasks. That means current impact on past tasks can be regarded as the independent summation of the influence from different local feature extractors. 4.2 LOCAL-GLOBAL EQUIVALENCE Then, during the backward propagation process of task Tk, the gradient of Lk on an intermediate layer hl can be calculated by the chain rule: j=l Sj+1(k)Dj+1Wj+1 (11) where Dj+1 is a diagonal matrix representing the derivative of the nonlinear activation function. Since previous samples cannot be accessed to recalculate the gradient, common gradient-based methods (Li et al., 2019; Azizan et al., 2019; Farajtabar et al., 2019) assume that the joint optimal parameter lies in the neighborhood of the previous optimal parameter and use previously calculated gradients as an approximation, which leads to f(θ; x) f(θ ; x) and 2h L h2 l 0. We follow this assumption and use the proposed optimizer to ensure this assumption as much as possible. Thus we have: 2Lk hl hr = ( h L ( h L)2 ( h L hl )T l (f(θ; x); y) h L hl + αIl (13) where α is a penalty parameter to ensure the positive definiteness of Hl. We set α = 1 in all subsequent sections according to the trace normalization proposed in Section 3. Further, we can decompose the global optimization problem into independent sub-problems at each layer. We introduce Theorem 3 and prove it in Appendix A.5: Theorem 3 (Local-Global Equivalence). Under the assumption of close vicinity, the global optimization problem is equivalent to independent local optimization problems. The local optimal projection matrix of layer l is: Pl = dim( Hl) trace( Hl 1) Hl 1, where Hl = 4.3 ITERATIVE UPDATE Considering that the complexity of calculating inverse matrix with dimension of n is O(n3), it is time-consuming to calculate the inverse Hessian matrix H 1 l in practice. Instead, we update Published as a conference paper at ICLR 2022 the projection matrix Pl iteratively like recursive least square(RLS) (Haykin, 2002) algorithm at training step(see Appendix B.1 for more details). This allows RGO to be an online algorithm with linear memory complexity and single-step time complexity in the number of model parameters. We summarize the gradient modification and the iterative update of the projection matrix to Algorithm 1. Note that feature extractors in the same layer share a projection matrix calculated by the average of the gradients considering their linear correlation. In this way, handling multiple gradients in the same layer does not increase the complexity of updating projection. We list the memory size and the single step-time complexity for different kinds of feature extractors in Appendix B.2. It is worth mentioning that, because the local optimizers of different layers are independent, after obtaining the back-propagation gradients, the gradient modification process of different layers can be processed in parallel to further reduce the time required. Algorithm 1 Learning Algorithm of Recursive Gradient Optimization Input: Task sequence Tk, k = 1, 2, Output: optimum parameter θ k 1: Pl(0) Il. θ θ 0 is randomly initialized. 2: for k = 1, 2, do 3: for (x, y) Dk do 4: gl = Lk hl , l = 1, 2, , L Stochastic/Batch Gradient of current loss 5: ˆgl = Pl gl dim(Pl)/trace(Pl) modify origin gradients 6: θ θ ηˆg update the model parameters 7: end for 8: for (x, y) Dk do 9: gl = [l (f(θ; x); y)] 1 2 h L hl , l = 1, 2, , L get local gradient by back-propagation 10: kl = Pl gl/(α + g T l Plgl) 11: Pl Pl klg T l Pl Update projection matrix 12: end for 13: θ k θ get the optimal model parameter 14: end for 5 EXPERIMENT SETUP Benchmarks: We evaluate the performance of our approach on four supervised continual learning benchmarks. Permuted MNIST (Goodfellow et al., 2014; Kirkpatrick et al., 2017) and Rotated MNIST (Chaudhry et al., 2020) are variants of MNIST dataset of handwritten digits (Le Cun, 1998) with 20 tasks applying random permutations of the input pixels and random rotations of the original images respectively. Split-CIFAR100 (Zenke et al., 2017) is a random division of CIFAR100 into 20 tasks, each task has 5 different classes. Split mini Image Net, introduced by (Chaudhry et al., 2020), applies a similar division on a subset of the original Image Net (Russakovsky et al., 2015) dataset. Baselines: In this work, we perform experiments on the benchmarks above with the following fixed capacity methods and an expansion-based method for comparison: (1) SGD which uses stochastic gradient descent optimizing procedure to finetune the model, (2) EWC (Kirkpatrick et al., 2017) which is one of the pioneering regularization methods using fisher information diagonals as important weights, (3) A-GEM (Chaudhry et al., 2019a) which uses loss gradients of stored previous data in an in-equality constrained optimization, (4) LOS (Chaudhry et al., 2020) which constraints gradients in a low-rank orthogonal subspace, (5) ER-ring (Chaudhry et al., 2019b) which utilizes a tiny ring memory to alleviate forgetting, (6) GPM (Saha et al., 2021) which trains new tasks in the residual gradient subspace, (7) APD (Yoon et al., 2019) which is a strong expansion-based method decomposing the parameters of different tasks with a common basis, and (8) STL which trains a model for each single task. For the compared methods, we follow the original implementations to perform some necessary processing at the end of every task. The storage for memory-based methods is set to 1 sample per class per task following Chaudhry et al. (2020). Published as a conference paper at ICLR 2022 Metrics: We use average accuracy(ACC) and average accuracy decline, also called backward transfer(BWT) by Lopez-Paz & Ranzato (2017), to evaluate the classification performance. Denote the accuracy of task k at the end of task T as RT,k, ACC and BWT are defined as: k=1 RT,k, BWT = 1 T 1 k=1 RT,k Rk,k Architectures and training details: We evaluate all of the continual learning methods for the same network architectures. The model is a three-layer fully connected network with 256 hidden units in MNIST experiment and a standard Res Net18 (He et al., 2016) in CIFAR and Image Net experiments. For RGO, we add a virtual feature encoding layer attached to each layer. MNIST variants are trained 1000 steps while CIFAR and mini Image Net are trained 2000 steps. Batchsize is set at 10 for all tasks. The task identifiers are provided for both training and testing time. All results are reported across 5 runs with different seeds. See Appendix C for more details. 6 RESULTS & DISCUSSIONS Table 1: Performance of different methods on Permuted MNIST/ Rotated MNIST/ Split Cifar100/ Split mini-Imagenet. The model is trained with 20 tasks for 5 different seeds and evaluated by final average accuracy with stds. ( ) denotes methods with additional network capacity. RGO-2 is a version without FEL for ablation experiments Permuted MNIST Rotated MNIST Method Replay ACCtest(%) BWT(%) ACCtest(%) BWT(%) RGO N 91.15( 0.20) -2.05( 0.09) 91.25( 0.01) -1.59( 0.01) RGO-2 N 87.95( 0.01) -5.65( 0.38) 72.26( 0.95) -20.74( 0.01) GPM N 83.29( 0.01) -8.45( 0.01) 70.02( 0.95) -17.95( 0.01) LOS N 86.56( 0.38) -4.10( 0.33) 80.21( 1.11) -13.44( 1.06) ER-Ring Y 79.84( 0.63) -12.88( 0.65) 69.20( 0.79) -25.93( 0.85) AGEM Y 72.32( 1.04) -19.94( 1.02) 53.26( 1.00) -41.74( 0.96) EWC N 67.79( 1.60) -24.38( 1.53) 43.27( 0.66) -50.74( 0.76) SGD N 46.11( 3.91) -46.06( 4.00) 44.82( 0.01) -50.18( 0.01) STL N 91.33( 0.20) 0.0 91.09( 0.01) 0.0 Split CIFAR100 Split Image Net Method Replay ACCtest(%) BWT(%) ACCtest(%) BWT(%) RGO N 73.18( 0.51) -1.67 ( 0.29) 70.33( 0.87) -1.64 ( 0.41) RGO-2 N 62.82( 0.98) -15.91 ( 0.92) 57.40( 1.90) -22.40 ( 1.89) GPM N 53.41( 2.87) -27.98 ( 3.14) - - LOS Y 56.20( 1.12) -25.46 ( 1.14) 43.25( 2.29) -34.75 ( 2.69) ER-Ring Y 53.74( 2.13) -28.15 ( 2.02) 45.88( 2.39) -29.21 ( 1.63) AGEM Y 49.56( 2.64) -32.10 ( 2.73) 34.67( 0.52) -38.06 ( 0.87) EWC N 47.71( 1.70) -25.17 ( 1.50) 32.61( 3.67) -24.95 ( 3.62) SGD N 37.02( 1.64) -44.34( 1.55) 37.69( 1.00) -37.23( 0.72) STL N 74.90( 0.73) 0.0 67.76( 1.70) 0.0 The evolution of average accuracy is shown in Figure 1 and the final results with error bars of the indicated datasets at the end of training are reported in Table 1. The proposed method shows a strong performance of average accuracy over the baselines on all benchmarks. The result of BWT shows that RGO can significantly reduce catastrophic forgetting especially on complex tasks and deep architectures. RGO improves upon strongest baseline considerably: 17.0% and 24.5% absolute gain in average accuracy, 93.4% and 94.4% reduction in forgetting, on CIFAR100 and mini Image Net, respectively. Meanwhile, on rotated MNIST and mini Image Net, we observe that our method even exceeds the performance of STL which is often regarded as the upper bound of continual learning methods. The results of ablation experiments show that RGO maintains good performance without FEL, only slightly lower than LOS which has an additional task orthogonal mapping layer on Rotated MNIST. In Table 2, we list some results on modified Le Net from APD ((Yoon et al., 2019)) Published as a conference paper at ICLR 2022 (a) Permuted MNIST (b) Rotated MNIST (c) Split CIFAR100 (d) Split mini Image Net Figure 1: Evolution of average accuracy with the number of tasks during the continual learning process. as a comparison with expansion-based methods. Contrary to the forgetting of other methods, RGO shows positive knowledge transfer and exceeds the theoretical upper bound on the testing set under a fixed network capacity. Table 2: Average accuracy of 20-task Split-CIFAR100 dataset with modified Le Net. Capacity denotes the percentage of network capacity used with respect to the original network. Relative ACC represents the difference in accuracy between the corresponding method and STL under the same settings. ( ) denotes results reported from APD. Metric PGN DEN RCL APD RGO relative ACC(%) -10.24 0.39 -9.90 0.77 -9.01 0.25 -4.19 0.33 +6.02 0.5 Capacity 271% 191% 184% 130% 100% Further, we test our method with more architectures. As shown in Table 3, RGO achieves higher test accuracy than STL with only 5% capacity despite forgetting on the training set. RGO provides more robust features to reduce the accuracy gap between the training set and the testing set by 9% to 44%. Meanwhile, we report new state-of-the-art performance of 82.22% and 72.63% on Split-CIFAR100 (20 tasks) and Split-mini Image Net (20 tasks) respectively without a well-designed training schedule. Although RGO minimizes forgetting in each local optimization problem, due to the layer-by-layer accumulation of errors, deeper structures lead to more forgetting. FEL uses random permutation to greatly reduce the coupling between layers, which plays an important role in alleviating this problem. In this perspective, shallow and wide structures may be beneficial to alleviate catastrophic forgetting in the field of continual learning. Table 3: Results of different architectures on 20-Split-CIFAR100(*) and 20-Split-mini Image Net( ). denotes the difference between training set accuracy and testing set accuracy. All experiments are trained 5 times with 20 epoches. Learning rate is set at 0.03 and 0.01 for CIFAR and mini Image Net respectively. Single-Task Learning Recursive Gradient Optimization Architecture ACCtrain(%) ACCtest(%) (%) ACCtrain(%) ACCtest(%) (%) BWT(%) Le Net-5 100.0 0.00 75.01 0.32 25.0 0.3 99.35 0.05 81.03 0.51 18.3 0.5 -0.99 0.15 Alex Net-6 99.90 0.03 81.60 0.31 18.3 0.3 98.78 0.08 82.22 0.24 16.6 0.2 -1.45 0.13 VGG-11 98.30 0.55 79.51 0.71 18.8 0.6 95.17 0.08 79.81 0.22 15.4 0.3 -4.00 0.19 VGG-13 95.73 0.83 75.64 0.49 20.1 0.8 93.26 1.22 77.43 0.34 15.8 1.0 -4.86 0.94 Alex Net-7 99.72 0.30 71.92 0.55 27.8 0.6 98.10 0.11 72.63 0.45 25.5 0.5 -1.98 0.16 VGG-11 99.70 0.08 71.67 0.13 28.0 0.2 95.91 0.22 71.14 0.62 24.8 0.7 -3.06 0.23 VGG-13 97.62 0.44 67.22 0.76 30.4 0.5 92.48 1.10 66.24 1.04 26.2 0.7 -4.79 0.48 Res Net-18 97.04 0.46 68.78 1.09 28.3 0.8 86.79 0.85 71.00 0.61 15.8 0.3 -5.20 0.62 Published as a conference paper at ICLR 2022 7 RELATED WORK In this section, we present some discussions between the adopted technology with existing work. The starting point of our approach and many other loss-constrained approaches is to optimize current loss and the estimated past loss at the same time. Most of the commonly used regularization methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Teng et al., 2020) use a hyperparameter to balance the current task and past tasks. The objective functions of these methods can be expressed as L(θ) + λF(θ θold). This type of method suffers from the trade-off between new tasks and old tasks and requires hyperparameter search to obtain better results. In contrast, we follow the principle of current-task-first discarding empirical trade-offs between tasks which means λ = 0. In the solution space of θ = arg min L, we change the path of the gradient descent process through the P matrix and find the one that minimizes F(θ θold). Thus, under the assumption of over-parameters, RGO optimizes current performance and forgetting simultaneously. Although the starting point is different, our gradient modification process is closely related to gradient constraint methods like OWM (Zeng et al., 2019) and GPM (Saha et al., 2021). OWM uses similar iteratively updated projectors derived from recursively least square(RLS), which regards each layer as an independent linear classifier and uses the output of the previous layer to build the projection matrix. This leads to layer-by-layer accumulation of errors in modern complex end-to-end network architectures. Using the gradients of the loss function directly, our approach is less worried about the depth of the network and more compatible with existing auto-grad frameworks. For a single-layer linear classifier y = Wx, OWM and RGO are equivalent considering that y W = x. In addition, an extra normalization procedure in our method guarantees the learning rate of the current task. This brings an additional advantage that our method can reuse the hyperparameters of original single-task models. GPM (Saha et al., 2021) projects the gradient of each layer into a lower-dimensional residual space of previous tasks, while the parameter space of RGO is consistent for different tasks. RGO will maintain the network s fitting ability as the number of tasks increases. In addition, our method is not limited to the instability of SVD decomposition and does not require hyperparameter selection. Task encoding layer has been used to reduce interference between tasks in recent works like LOS (Chaudhry et al., 2020) and HAT (Serra et al., 2018), which require additional network capacity. On the contrary, FEL is only a permutation of the input corresponding to the task id. This provides an efficient task encoding and decoupling method which can be easily integrated into other continual learning methods. 8 LIMITATION Like all methods based on gradient constraint, analyses in this paper are based on neighborhood assumption and over-parameterized assumption which may not be satisfied in some narrower networks. When the number of tasks is close to the minimum number of channels in the network, this assumption fails. Although we have empirically proved that this effect is not obvious under common experimental settings, attention should be paid to the width of the network in applications. 9 CONCLUSION In this paper, we propose a new recursive gradient optimization method to find the optimal parameters of fixed capacity networks, and a new feature encoding strategy to characterize the structure of the network. The feature encoding layer and the optimizer to minimize forgetting are both compatible with typical learning models, which allows our approach to be a general method to add continual learning capability into the vast majority of the existing network architectures learned by variants of gradient descent, with only constant times of memory/time cost than typical back-propagation algorithms. The theoretical derivation and experimental results show that RGO is currently the optimal approach under the current-task-first principle and quadratic loss estimation for fixed capacity networks. Experiments demonstrate that RGO achieves significantly better performance than other state-of-the-art methods on a variety of benchmarks. Without restrictions on the network structure and loss form, RGO has broad prospects in combination with other continuous learning methods and applications in other representation learning fields. Published as a conference paper at ICLR 2022 REPRODUCIBILITY STATEMENT We give the reproducible source code in the supplementary materials, and introduce the implementation of the baseline method in Appendix C.1. See Appendix C for the selection of hyperparameters. In Python3.6 and Tensor Flow1.4, all results can be reproduced. The theorems put forward in the main text have corresponding proofs in Appendix A. ACKNOWLEDGEMENT The authors are with the Beijing National Research Center for Information Science and Technology, Institute for Artificial Intelligence, and the Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China. This work was supported by National Natural Science Fund for Key International Collaboration (62120106005). Ferran Alet, Tom as Lozano-P erez, and Leslie P. Kaelbling. Modular meta-learning. (Co RL), 2018. URL http://arxiv.org/abs/1806.10166. Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January:7120 7129, 2017. doi: 10.1109/CVPR.2017.753. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251 276, 1998. Navid Azizan, Sahin Lale, and Babak Hassibi. Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. pp. 1 35, 2019. URL http://arxiv.org/abs/1906.03830. Michael B. Chang, Sergey Levine, Abhishek Gupta, and Thomas L. Griffiths. Automatically composing representation transformations as a means for generalization. 7th International Conference on Learning Representations, ICLR 2019, pp. 1 23, 2019. Arslan Chaudhry, Ranzato Marc Aurelio, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. 7th International Conference on Learning Representations, ICLR 2019, pp. 1 20, 2019a. Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Puneet K. Dokania, Philip H.S. Torr, Thalaiyasingam Ajanthan, and Marc Aurelio Ranzato. On tiny episodic memories in continual learning. ar Xiv, pp. 1 15, 2019b. ISSN 23318422. Arslan Chaudhry, Naeemullah Khan, Puneet K. Dokania, and Philip H. S. Torr. Continual Learning in Low-rank Orthogonal Subspaces. (Neur IPS):1 12, 2020. URL http://arxiv.org/abs/ 2010.11635. Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal Gradient Descent for Continual Learning. 2019. URL http://arxiv.org/abs/1910.07104. Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Path Net: Evolution Channels Gradient Descent in Super Neural Networks. 2017. URL http://arxiv.org/abs/1701.08734. Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, 2014. Simon Haykin. Adaptive Filter Theory. Prentice Hall, 2002. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December:770 778, 2016. ISSN 10636919. doi: 10.1109/CVPR.2016.90. Published as a conference paper at ICLR 2022 Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012. Prakhar Kaushik, Alex Gain, Adam Kortylewski, and Alan Yuille. Understanding Catastrophic Forgetting and Remembering in Continual Learning with Optimal Relevance Mapping. 2021. URL http://arxiv.org/abs/2102.11343. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114(13):3521 3526, 2017. ISSN 10916490. doi: 10.1073/pnas.1611835114. Richard J Larsen and Morris L Marx. An introduction to mathematical statistics. Prentice Hall, 2005. Yann Le Cun. The mnist database of handwritten digits. 1998. URL http://yann.lecun. com/exdb/mnist/. Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. 2019. URL http://arxiv.org/abs/1904.00310. David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 2017-December(Nips):6468 6477, 2017. ISSN 10495258. Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, (Vi):1 18, 2018. Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 2018December:3738 3748, 2018. ISSN 10495258. Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, pp. 1 16, 2018. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, 2015. ISSN 15731405. doi: 10.1007/s11263-015-0816-y. URL http://dx.doi.org/10.1007/s11263-015-0816-y. Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. 2016. URL http://arxiv.org/abs/1606.04671. Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient Projection Memory for Continual Learning. (2018):1 18, 2021. URL http://arxiv.org/abs/2103.09762. Joan Serra, D ıdac Suris, Marius Mir on, and Alexandras Karatzoglou. Overcoming Catastrophic forgetting with hard attention to the task. 35th International Conference on Machine Learning, ICML 2018, 10:7225 7234, 2018. ISSN 2640-3498. Gehui Shen, Song Zhang, Xiang Chen, and Zhi Hong Deng. Generative Feature Replay with Orthogonal Weight Modification for Continual Learning. ar Xiv, 2020. Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in Neural Information Processing Systems, 2017-December(Nips):2991 3000, 2017. ISSN 10495258. Published as a conference paper at ICLR 2022 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1 14, 2015. Yunfei Teng, Anna Choromanska, and Murray Campbell. Continual learning with directionconstrained optimization. 2020. URL http://arxiv.org/abs/2011.12581. Hanna Tseran, Mohammad Emtiyaz Khan, Tatsuya Harada, and Thang Bui. Natural Variational Continual Learning. Continual Learning Workshop at the 32nd Conference on Neural Information Processing Systems (Neur IPS), (Neur IPS):1 5, 2018. Johannes von Oswald, Christian Henning, Jo ao Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. pp. 1 28, 2019. URL http://arxiv.org/abs/1906. 00695. Dong Yin, Mehrdad Farajtabar, and Ang Li. SOLA: Continual learning with second-order loss approximation. ar Xiv, pp. 1 18, 2020. Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, pp. 1 11, 2018. Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. pp. 1 15, 2019. URL http: //arxiv.org/abs/1902.09432. Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364 372, 2019. doi: 10.1038/ s42256-019-0080-x. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. 34th International Conference on Machine Learning, ICML 2017, 8:6072 6082, 2017. Section A provides proofs of theorems introduced in the paper. Section B provides some details and approximations used in the implementation. Section C provides the details of codes, architectures, hyperparameters and resources used in the experiment. A.1 PROOF OF THEOREM 1 As we have mentioned, the distribution of elements of the gradients of future tasks are assumed to be isotropic, and this assumption is guaranteed by our random encoding strategy under different tasks. For isotropic distribution, we have: Lemma 4 (Distribution Consistency). Isotropism is invariant under orthogonal transformation. (Larsen & Marx, 2005) Then, we introduce a lemma on trace of matrix which is often used in matrix analysis (Horn & Johnson, 2012): Lemma 5 (Trace Consistency). Trace of matrix is consistent under orthogonal transformation. As described in Section 3, at i-th single train step during task k, we have: ( θi = θi 1 ηi P Lk(θi 1) Lk(θi) = Lk(θi 1) ηi( Lk(θi 1))T P Lk(θi 1) (14) Published as a conference paper at ICLR 2022 For any potential gradient, the mathematical expectation of the loss function decline is: E[Lk(θi) Lk(θi 1)] = ηE[( Lk)T P Lk] (15) the optimizer degrades to original single task optimizer when P equals to identify matrix. As positive definite symmetric matrix can be orthogonal diagonalized by a orthogonal matrix V , V PV T = diag{λ1, λ2, , λn} := Λ (16) where {λi} represent the eigen values of the Projection Matrix P. Mark dim(P) and Lk(θ) with n and x = (x1, x2, , xn) respectively. We assume the distribution of unknown future gradients is isotropic and apply Lemma 4 to the expectation: E[( Lk)T P Lk] = Ex[x T V T ΛP V x] = Ex[x T ΛP x] i=1 λi Ex[x2 i ] = i=1 λi Ex[Pn i=1 x2 i ] n = Pn i=1 λi n Ex[x T x] = Pn i=1 λi n E[( Lk)T Lk] According to Lemma 5, the sum of the eigen values of Inverse Hessian Matrix P equals to that of Λ, that means: i=1 λi = trace(P) (18) Thus if we add a normalize a constraint trace(P) = dim(P) to the Inverse Hessian matrix P, every task in the continual learning procedure can have consistent expectation convergence rate. A.2 PROOF OF LOSS FUNCTION EQUIVALENCE Theorem 6 (Loss Equivalence). If θ k 1 = arg minθ F RLL k 1 (θ), using F RLL k or Fk as the loss to optimize is equivalent, which means arg minθ F RLL k (θ) = arg minθ Fk(θ) We first expand Fk in Equation 3 at the initial state θ k 1 during Tk: j=1 [Lj(θ j ) + 1 2(θ θ k 1 + θ k 1 θ j )T Hj(θ θ k 1 + θ k 1 θ j )] = (θ θ k 1)T k 2 X j=1 [Hj(θ k 1 θ j )] + 1 2(θ θ k 1)T ( j=1 Hj)(θ θ k 1) + const. For further analysis, we first introduce the following lemma which can be proved inductively. Lemma 7. if (Pk 1 j=1 Hj)(θ k θ k 1) = 0 holds for any k, then Pk 1 j=1[Hj(θ k θ j )] = 0 also holds for any k. Proof. We mark Pk 1 j=1[Hj(θ k θ j )] as D(k), then for k = 1, 2, 3, we have Published as a conference paper at ICLR 2022 D(k) D(k 1) j=1 [Hj(θ k θ j )] j=1 [Hj(θ k 1 θ j )] = Hk 1(θ k θ k 1) + ( j=1 Hj)(θ k θ k 1) j=1 Hj)(θ k θ k 1) Using the fact that D(0) = 0 , D(k) = 0 holds for all positive integer k. Ignoring the constant terms, the key to prove this theorem is the second term of the expression of Fk; Proof. If F RLL k (θ k) = 0 satisfies for all k [K], then we can get j=1 Hj)(θ k θ k 1) = 0 , k [K] Using the result of Lemma 7, we have j=1 [Hj(θ k 1 θ j )] = 0 , k [K] Substituting this formula into Equation (3) and discarding the constant terms lead to Fk(θ k) = F RLL k (θ k) , k [K] Thus, {F RLL k }K k=1 and {Fk}K k=1 are equivalent loss sequences throughout the continual learning process. Note that this equivalence is obtained in the case of a second-order approximation of the loss functions of tasks, so it also depends on the close vicinity hypothesis. A.3 PROOF OF THEOREM 2 Lemma 8 (Cauchy Inequality). For any positive integer n, positive symmetric definite matrix P and vectors {vi}n i=1, denote the P-norm of v by ||v||P = v T Pv, we have: i=1 vi||2 P n i=1 ||vi||2 P Denote ˆσm( ) as the symbol for finding maximum eigenvalue and ηm as the maximum single-step learning rate, the recursive least loss has an upper bound: F RLL k (θ k) = 1 i=1 ηi Lk(θi 1))T P( i=1 ηi Lk(θi 1)) i=1 ηi Lk(θi 1))T P( i=1 ηi Lk(θi 1)) According to Lemma 8, we have Published as a conference paper at ICLR 2022 F RLL k (θ k) ˆσm[P( i=1 (ηi Lk(θi 1))T P(ηi Lk(θi 1)) 2nkηmˆσm[P( i=1 ηi( Lk(θi 1))T P Lk(θi 1) Considering that Lk(θ k) 0 at the end of training and loss for current task before training can be expressed as Lk(θ k 1), the change of loss function described in Equation 7 satisfies the following inequality: i=1 ηi( Lk(θi 1))T P Lk(θi 1) (22) Then we can complete the proof by combining 21 with 22. F RLL k (θ k) 1 2nkηmˆσm[P( j=1 Hj)]Lk(θ k 1) (23) A.4 SOLUTION OF PROBLEM 9 At a fixed point of the model, total Hessian matrix H := Pk 1 j=1 Hj must be positive definite like P. As positive definite symmetric matrix can be orthogonal diagonalized by an orthogonal matrix (Horn & Johnson, 2012), H and P can be expressed as H = U T Λ HU, P = V T ΛP V , while Λ H = {σ1, , σn} and ΛP = {λ1, , λn} are diagonal matrices. Using the diagonalization step, the optimization problem 9 can be expressed as: ( min P ˆσm(V T ΛP V U T Λ HU) s.t. trace(P) = dim(P) (24) As orthogonal transformation maintains eigen value (Horn & Johnson, 2012), which means: ˆσm(V T ΛP V U T Λ HU) = ˆσm(UV T ΛP V U T Λ H) There are two main variable to be optimized, diagonal matrix ΛP and orthogonal matrix UV T . To simplify the derivation, we set UV T to the simplest orthogonal matrix I. Under this assumption, the optimization problem is simplified as: {λi}n i=1 : min λi max i σiλi i=1 λi = n (25) We get the optimal eigen values: λi = n σi P 1 P = V ΛP V T = n P 1 σi V Λ 1 H V T = dim( H) trace( H 1) H 1 (27) Published as a conference paper at ICLR 2022 A.5 PROOF OF THEOREM 3 Denote the gradient of total parameter set and the total Hessian matrix as gθ = (g T 1 , g T 2 , , g T L)T θ2 }l,r = 2Lk hl hr , we have: g T l Pl 2Lk hl hr Prgr = g T l V T l ΛPl Vl U T l Λ 1 2 Hr Ur V T r ΛPr Vrgr = g T l V T l Λ 1 2 Pr Vrgr 1 2 Hl)ˆσm(Λ 1 2 Pr)g T l V T l Λ 1 2 Pr Vrgr 1 l,r L g T l Pl 2Lk hl hr Prgr 1 l,r L ˆσm(Λ 1 2 Hl)ˆσm(Λ 1 2 Pr)g T l V T l Λ 1 2 Pr Vrgr maxl[ˆσm(ΛPlΛ Hl)] X 1 l,r L g T l V T l Λ 1 2 Pr Vrgr = maxl[ˆσm(ΛPlΛ Hl)]g T θ Pgθ The optimization problem above has same form as the global optimization problem in Section A.4, the solution can be easily got as: Pl = dim( Hl) trace( Hl 1) Hl 1, where Hl = B IMPLEMENTATION DETAILS B.1 DETAILS OF QUADRATIC ESTIMATION OF THE HESSIAN MATRIX For the C-class classification problems, f(x; θ) has C-logits associated to different classes. We consider the most commonly used softmax cross entropy loss which is defined as l(y, f(x; θ)) = j=1 yjlog(aj) (30) where aj = exp(fj(x; θ))/ PC c=1 exp(fc(x; θ)) as the j-th softmax output. The (i, j)-th element of the second derivative matrix of the loss function with respect to f(x; θ) is then calculated as l (y; f(x, θ))i,j = ajφi,j aiaj (31) where φi,j is Dirac function equal to 1 while i = j else 0. As a symmetric diagonally dominant matrix, l (y; f(x, θ)) has its matrix root [l (y; f(x, θ))] 1 2 . This guarantees the correctness of our algorithm. In implementation, for convenience, we only used the diagonal element corresponding to the ground truth label for an estimation. B.2 TIME & MEMORY COMPLEXITY OF RGO We list the shape of projection matrix and time complexity of projection matrix update and gradient modification introduced by RGO for some typical feature extractors below: First, according to Algorithm 1, the time complexity of updating P is obviously O(dim(P)2). The main concern comes from the matrix-matrix product in g = Pg for gs with higher dimension. However, if we notice the linear correlation of the columns of g, we can avoid this matrix multiplication. Use a fully connected layer y = x W + b : x Rn1, y Rn2 as an example. Considering Published as a conference paper at ICLR 2022 Kind Shape Size of P time complexity vector n1 (n1, n1) n2 1 matrix (n1, n2) (n1, n1) n2 1 kernel (n1, n2, ksize, ksize) (ksize2n1, ksize2n1) ksize4n2 1 y x T , we have g = Pg = P L y x T . If we calculate from left to right instead of calculating g first, we can avoid matrix multiplication and reduce the number of calculations from n1n2 + n2 1n2 to n1n2 + n2 1. The amount of calculation beyond the original backpropagation is only n2 1. The calculation process for kernels is the same except for a reshape process. Considering that both the kernel size and n1 n2 have upper bounds in common neural network models, the time complexity of RGO remains the same as that of backpropagation. C EXPERIMENT DETAILS C.1 BASELINE IMPLEMENTATIONS EWC (Kirkpatrick et al., 2017), LOS (Chaudhry et al., 2020), A-GEM (Chaudhry et al., 2019a), and ER-ring (Chaudhry et al., 2019b) are implemented from adapting the code provided by Chaudhry et al. (2020) under MIT License. GPM is implemented from the official implementation provided by Saha et al. (2021) under MIT License. C.2 RESOURCES All experiments of our method are completed in several hours with 4 pieces of Nvidia-2080Ti GPUs. C.3 ARCHITECTURES We provide details of architectures we used in the experiment section. Le Net-5: A modified Le Net used by Yoon et al. (2019). There are two convolutional layers with kernels size of (5,5) and channels of (20,50), followed by two hidden fully connected layer with (800,500) units. Alex Net-6: A modified Alex Net used by Saha et al. (2021). There are three convolutional layers with kernels size of (4,3,2) and channels of (64,128,256), followed by two hidden fully connected layer with (2048,2048) units. Alex Net-7: A modified Alex Net. There are four convolutional layers with kernels size of (5,4,3,3) and channels of (64,128,128,128), followed by two hidden fully connected layer with (2048,2048) units. VGG-11&VGG-13 : Original VGG11 and VGG13 proposed by Simonyan & Zisserman (2015). Res Net-18: A standard 18-layer Res Net proposed by He et al. (2016). For our approach, we remove all batch-norm layers because their parameters are not updated by gradient descent. Le Net-like and Alex Net-like architectures are attached a 2 2 maxpooling layer after each convolutional layer. C.4 HYPERPARAMETERS The learning rates of all baselines are generated by hyperparameter search in [0.003,0.01,0.03,0.1,0.3,1] to achieve better results. Other hyperparameters of EWC, A-GEM, ER-Ring and LOS follows Chaudhry et al. (2020), while those of GPM and APD follows their official implementation. Single Task Learning learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) Published as a conference paper at ICLR 2022 Recursive Gradient Optimization(Ours) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net 2000steps), 0.01(mini Image Net 20epochs) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) regularization: 10(MNIST, CIFAR100, mini Image Net) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) learningrate: 0.1(MNIST), 0.4(CIFAR100), 0.2(mini Image Net) learningrate: 0.1(MNIST), 0.03(CIFAR100, mini Image Net) threshold: 0.95 for first layer and 0.99 for other layers(MNIST) ,increase from 0.97 to 1(CIFAR), increase from 0.985 to 1(mini Image Net) dimension of representation matrices: 300(MNIST), 125(CIFAR), 100(mini Image Net)