# residual_continual_learning__9f68b8e8.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Residual Continual Learning

Janghyeon Lee,1 Donggyu Joo,1 Hyeong Gwon Hong,2 Junmo Kim1

1School of Electrical Engineering, KAIST 2Graduate School of AI, KAIST {wkdgus9305, jdg105, honggudrnjs, junmo.kim}@kaist.ac.kr

We propose a novel continual learning method called Residual Continual Learning (Res CL). Our method can prevent the catastrophic forgetting phenomenon in sequential learning of multiple tasks, without any source task information except the original network. Res CL reparameterizes network parameters by linearly combining each layer of the original network and a ﬁne-tuned network; therefore, the size of the network does not increase at all. To apply the proposed method to general convolutional neural networks, the effects of batch normalization layers are also considered. By utilizing residuallearning-like reparameterization and a special weight decay loss, the trade-off between source and target performance is effectively controlled. The proposed method exhibits state-ofthe-art performance in various continual learning scenarios.

Introduction Deep learning with artiﬁcial neural networks is now one of the most powerful artiﬁcial intelligence technologies. It exhibits state-of-the-art performance in various machine learning ﬁelds such as computer vision (He et al. 2016a), natural language processing (Wu et al. 2016), and reinforcement learning (Silver et al. 2017). However, it requires large amounts of training data and time to train such deep networks as network structure becomes complicated. To alleviate this difﬁculty, transfer learning methods such as ﬁnetuning (Yosinski et al. 2014) are used to utilize source task knowledge and to boost training for target tasks. As transfer learning methods consider only target task performance during training, most of source task performance is lost as a side effect called the catastrophic forgetting phenomenon (French 1999; Mc Closkey and Cohen 1989). This is a serious problem if high performance is required for both source and target tasks. Continual learning methods should be adopted to resolve this problem. Our main goal is to achieve good target task performance while maintaining source task performance. Speciﬁcally, we focus on image classiﬁcation tasks with Convolutional Neural Networks (CNNs). Moreover, we impose two practical conditions for the problem. First, we assume that no source

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

task information is available during target task training. In many real-world applications, source data are often too heavy to handle or do not have a public license. If they are available, joint training of source and target data would be a better solution. Further, we also assume that not only source data but also any other forms containing source task information are not available. Recent studies on continual learning do not use the source data directly, but they often refer to parts of the information about source data, for example in forms of generative adversarial networks (Shin et al. 2017) or Fisher information matrices (Kirkpatrick et al. 2017; Ritter, Botev, and Barber 2018), which somewhat dilutes the original purpose of continual learning. Second, the size of a network should not increase. Without this condition, a network can be expanded while keeping the entire original network, e.g., (Terekhov, Montone, and O Regan 2015; Rusu et al. 2016). Such network expansion methods show good performance in both source and target tasks, but it is difﬁcult to use them with deep neural networks in practice because the size of a network becomes heavier as the number of tasks increases. The main features are as follows.

Residual-learning-like reparameterization allows continual learning, and a simple decay loss controls the trade-off between source and target performance.

No information about source tasks is needed, except the original source network.

The size of a network does not increase at all for inference (except last task-speciﬁc linear classiﬁers).

The proposed method can be applied to general CNNs including Batch Normalization (BN) (Ioffe and Szegedy 2015) layers in a natural way.

We propose two fair measures for comparing different continual learning methods, maximum achievable average accuracy for an ideal measure and source accuracy at required target accuracy for a practical measure.

Related Work

Learning without Forgetting (Lw F) (Li and Hoiem 2018) is a simple but effective continual learning method. An Lw F loss

restricts source task outputs of a new network to be close to those of a source network using only target task data. In that work, softmax layers with high temperatures are used to soften the original output distributions in a sense of knowledge distillation (Hinton, Vinyals, and Dean 2015). Our method, Residual Continual Learning (Res CL), includes the Lw F method as a special case since we use an Lw F loss to train a combined network. Res CL falls back to Lw F if training is performed on a source network instead of a combined one with true target labels instead of softened ones. In Incremental Moment Matching (IMM) (Lee et al. 2017), the posterior distributions of each task and the combined task are approximated as Gaussian distributions. The moments of the posterior distribution for the combined task are incrementally matched by mean-IMM or mode-IMM. Mean-IMM simply averages the weights of the original and new networks as the minimization of the Kullback Leibler divergence between the posterior of the combined task and the mixture of each task. Mode-IMM merges two networks with their covariance information to approximate the mode of the mixture of two Gaussian posteriors. Our method also uses the combination of two networks as IMM does. The difference is that the coefﬁcients of the combination in IMM are determined with a hyperparameter of the mixing ratios of each task in a framework of Bayesian neural networks, and there is no additional training for a combined network. As neural networks are not linear or convex, we cannot be sure that a combined network works properly without additional training. In Res CL, weights and combination coefﬁcients are learnable, so we can ensure that a combined network works properly. (Terekhov, Montone, and O Regan 2015; Rusu et al. 2016) can prevent forgetting perfectly by expanding a network while keeping the entire original network, but they increase the size of a network. (Kirkpatrick et al. 2017; Ritter, Botev, and Barber 2018) protect source task performance by a quadratic penalty loss where the importance of each weight is measured by the Fisher information matrix. However, source data are required to calculate the Fisher information matrix. (Aljundi et al. 2018) proposes to measure the importance of a parameter by the magnitude of the gradient, which also requires source data. (Zenke, Poole, and Ganguli 2017) also deﬁnes a quadratic penalty loss designed with the change in a source task loss over an entire trajectory of parameters during source task training. (Mallya and Lazebnik 2018) adds multiple tasks to a single network by iterative pruning and re-training with source data. Basically, our method uses linear combination of ﬁlters, which has also been studied for multitask or transfer learning, as in (Rebufﬁ, Bilen, and Vedaldi 2017) and (Rosenfeld and Tsotsos 2018) for example. The main purpose of those methods is to learn multiple networks or parameter sets across multiple tasks but with maximized parameter sharing for efﬁciency, since they focus on multitask or transfer learning. Therefore, when those are applied to sequential learning, every single task should retain its own network or parameters. In contrast, our method aims to reduce catastrophic forgetting in sequential learning where only a single network is allowed to represent whole tasks. In (Rebufﬁ,

Bilen, and Vedaldi 2017), different BN parameters and 1 1 ﬁlters are learned for each task but with remaining parameters shared across all the given tasks. So, the proposed solution is only applicable to networks that retain BN by definition. Then, it is unsuitable for cases where BN does not work properly, for example, when minibatch size cannot be set large enough due to memory constraints. However, our solution does not depend on BN or speciﬁc network architecture, thus more general. In (Rosenfeld and Tsotsos 2018), newly added ﬁlters for a target task are learned in the form of linear combination with existing ﬁlters of source tasks. However, the coefﬁcients necessary for the linear combination are restricted to binary digits 0 or 1 during training. Furthermore, the coefﬁcients are shared across all layers in a network, which could be suboptimal. In contrast, our method ﬁnds a better solution by learning optimal different real coefﬁcients for each layer and each ﬁlter during training.

Method Continual learning is essentially to reach a good midpoint between two tasks. A simple idea is linearly combining each layer of source and target networks to obtain a middle network between them, where the source network is the original network that is trained on the source task, and the target network is a ﬁne-tuned network from the original one for the target task. By combining them, we can obtain a network that lies between the source and target task solutions. This basic idea is similar to what IMM (Lee et al. 2017) does, and we also start from here. However, the performance of a linearly combined network is not guaranteed, as neural networks are not linear or convex. Therefore, after two networks are combined, we have an additional training phase for the combined network to ensure that it will work properly. Because an additional training can hurt the source knowledge, the original weights should be freezed in the combined network. In this paper, we often call the source network as the original network.

Linear Combination of Two Layers Suppose that we want to combine two fully connected layers whose weight matrices are Ws RCo Ci and Wt RCo Ci. For an input x RCi, our combination layer simply combines two outputs linearly with the combination parameters αs RCo and αt RCo as follows:

(1Co + αs) (Wsx) + αt (Wtx) , (1)

where 1Co is a Co-dimensional vector with all ones, and denotes element-wise multiplication. The biases are omitted for brevity. Note that the combination parameters αs and αt are vectors with the dimension Co and not scalars; thus, the combination layer can set a different importance for each feature. Moreover, αs and αt do not share their values to allow the combination layer to freely manipulate two features. Each combination layer in a network has different values of αs and αt for the same reason. In the Res CL framework, Ws is a weight of the source network, and Wt is that of the target network ﬁne-tuned to the target task from the source network. Wt is additionally trained to reﬁne its features to be combined well with

1 + ߙ௦ߠ௦+ ߙ௧ߠ௧

source net ݐ௦ target net ݐ௧

combined net ݐ

ܮ௦ݔ= ܦ ݐ௦ݔ|| ݐ (ݔ; task௦) ܮ௧ݔ= ܦ ݐ௧ݔ|| ݐ (ݔ; task௧)

+ ߣ 2 ߠ௧ ଶ ଶ

residual part

ܮ ௬= ߣ ߙ௦, ߙ௧ ଵ

combination layer

Figure 1: An illustration of our method. Learnable parameters are shown in red. We begin with an original network nets, which was trained with source data. First, nets is ﬁne-tuned with target data to obtain nett. Each linear block in nets and nett is combined with a combination layer as in netc. Continual learning on netc is performed with an Lw F loss Ls for preserving performance for the source task and a distillation loss Lt for adapting to the target task. There is also a special decay loss Ldecay, which is the most important loss to prevent forgetting. DKL( || ) refers to the Kullback Leibler divergence with a softmax temperature of 2. Note that each task has its own last task-speciﬁc fully connected layer since source and target tasks have different class categories in general. Therefore, netc has two different outputs: netc( ; tasks) for a source task and netc( ; taskt) for a target task.

Ws, whereas Ws is ﬁxed to prevent catastrophic forgetting. Moreover, the combination parameters are also learned by backpropagation to optimally mix two features. Therefore, the learnable parameters in the combined network are Wt and α = (αs, αt). One can easily ﬁnd that the two fully connected layers and combination layer can be equivalently expressed as one fully connected layer whose weight is (

T Ci Ws + αt

T Ci Wt (2) owing to their linearity. This is the reason why we can call our method a type of reparameterization; thus, the size of a network does not increase for inference once training is ﬁnished. Any nonlinear layer such as sigmoid or Re LU should not be included in the combination, as in Fig. 1 and Fig. 2. Otherwise, those layers cannot be merged into one layer; then the network size increases as the number of tasks increases.

Training There are two Kullback Leibler divergence losses for training of the combined network: one for maintaining the source performance and the other for solving the target task. For the former loss, we adopt an Lw F loss (Li and Hoiem 2018), which preserves the source information well. That is, the outputs for the source task are constrained to be similar to those of the original network. Softmax layers with a temperature of 2 are also used. For the latter loss, the distillation loss (Hinton, Vinyals, and Dean 2015) from the ﬁne-tuned network with a temperature of 2 is used for better generalization and as a natural counterpart of the Lw F loss. Therefore,

the outputs for the target task are constrained to be similar to those of the ﬁne-tuned network. One might be wondering why the combination coefﬁcient of Wsx is parameterized to

+ αs instead of simply αs in Equation 1. Since a weight decay loss is widely used as a regularization term to improve generalization of neural networks (Krogh and Hertz 1992), we will also use a weight decay loss for not only Wt but also α. If we use simply αs for the combination coefﬁcient of Wsx, then the combined output is αs (Wsx) + αt (Wtx) . (3)

Now, the source information is lost by the decay loss for αs, as it causes the coefﬁcient of Wsx to be close to zero. Thus, it is not a good idea to naively decay the combination coefﬁcients; therefore, we carefully design the weight decay loss of combination layers for continual learning. In Res CL, we set the destination of the decay loss to the original network, not a zero-weight network. This can be done by parameterizing the ﬁrst coefﬁcient of the combination to

+ αs rather than just αs, as in Equation 1. With this modiﬁcation, the decay loss tends to protect the original weights against the target distillation loss. We also experimentally found that the L1 decay loss for α is slightly better than L2. Although it seems to be a very simple reparameterization, it is the key feature that allows continual learning with a signiﬁcant performance improvement. Actually, this idea is very similar to that of residual learning (He et al. 2016a). Residual learning tries to learn the residual of the identity mapping by reformulating a desirable mapping h(x) to f(x) + x, where f(x) is a learnable residual function. If

the identity mapping is desirable, this can be easily learned by decaying the weights of f(x) to zeros. Similarly, Res CL tries to learn the residual of the source layer by reparameterizing W x to Wsx+αs (Wsx)+αt (Wtx). If returning to the original source network is desirable to recover forgetting, it can be done easily with a decay loss λ||(αs, αt)||. As learned residual functions in a residual network tend to have small responses (He et al. 2016a), if altering some feature is not very helpful for a target task, the decay loss for α will settle that feature near the original feature. Only the necessary features for solving the target task will have large deviations from the original features, and the importance of each feature is automatically learned by the decay loss and implicitly controlled by the trade-off hyperparameter λ in Algorithm 1. As this residual-learning-like reparameterization and the decay loss on α play a very important role in our method, we call the proposed continual learning method as Residual Continual Learning. The entire procedure of the proposed Res CL method is summarized in Fig. 1 and Algorithm 1.

Algorithm 1: Residual Continual Learning

Input: nets ( ; θ s) // given source network Input: λ // source target trade-off hyperparameter Input: (Xt, Yt) // training data of target task nett ( ; θt) nets ( ; θ s) ; // init nett as a copy of nets θ t argminθt DKL (Yt||nett (Xt; θt)) + 1

2λdec||θt||2 2; // ﬁne-tuning from the source network ˆ Ys nets (Xt; θ s) ; // source network outputs for Lw F ˆ Yt nett (Xt; θ t ) ; // ﬁne-tuned net outputs for distillation (αs, αt) ( 1/2 1, 1/2 1) ; // init combination params θt θ t ; // init θt as ﬁne-tuned weight θ t netc ( ; (αs, θ s, αt, θt) , task = ) COMBINE (αs, nets ( ; θ s) , αt, nett ( ; θt)) ; // init netc as in Fig. 1 and Fig. 2 (α s, α t, θ t ) argminαs,αt,θt{ DKL( ˆ Ys||netc (Xt; (αs, θ s, αt, θt) , task = s)) +DKL( ˆ Yt||netc (Xt; (αs, θ s, αt, θt) , task = t)) +λ|| (αs, αt) ||1 + 1

2λdec||θt||2 2}; // train combined net Output: netc ( ; (α s, θ s, α t, θ t ) , task = )

Convolution and Batch Normalization An extension of the proposed framework to convolutional layers is straightforward. Let Ws, Wt RCo Ci Hk Wk be the weight tensors of two convolutional layers and αs, αt RCo be the corresponding combination parameters. Note that combination parameters are shared across the spatial dimension, in order to take advantage of the structure of CNNs. Then, the two outputs of the convolutional layers are combined in the same manner as Equation 1 with convolutions instead of matrix multiplications.

Conv(s) Conv(t)

BN(s) BN(t)

Conv(s) Conv(t)

equivalent in test phase

(a1) (a2) (c) (b)

Figure 2: Combination of source and target pre-activation residual units (He et al. 2016b). Comb represents a combination layer. Two paths are combined by a combination layer before every nonlinearity. Learnable layers are shown in red. In the inference phase, the combined network (c), which is equivalent to (b) and has the same network size as (a1) and (a2), is used.

The case of convolutional layers with BN (Ioffe and Szegedy 2015) must also be considered since BN is widely used in modern CNNs (Zagoruyko and Komodakis 2016; Huang et al. 2017; Chollet 2017). A BN layer is quite tricky, as it has two different functions depending on its phase of training and inference. BN normalizes an input with the statistics of the current minibatch in the training phase, whereas the population statistics, which are not learned by gradient descent, are used for inference. If training and test data originate from the same task, it is not a signiﬁcant issue because the two statistics would be very similar. However, it is problematic in continual learning since we are dealing with multiple different tasks whose distributions are not the same in general. Our method can be applied with BN layers in a natural way. We do not need to worry about changes in the distribution, as each subnetwork has its own BN layer for its own task. Speciﬁcally, the original BN layer (BN(s) in Fig. 2) should use its population statistics of the source task during both the additional training and test phases. Otherwise, some of the source knowledge is lost, as the combined network cannot see and make use of the original statistics during additional training. The two BN and two convolutional layers with the combination layer can also be merged into one equivalent convolutional layer after training since a BN layer is a deterministic linear layer in the inference phase and convolution is also a linear operation.

Maximum Achievable Average Accuracy

We evaluate our method for sequential learning of image classiﬁcation tasks and compare it with other methods, including ﬁne-tuning, Lw F, and Mean-IMM, that do not re-

Table 1: Maximum achievable average accuracies[%] for each method. Means and standard deviations of four runs. The optimal trade-off hyperparameters are in parentheses.

Task (Hyperparam.) CIFAR-10 CIFAR-100 CIFAR-100 CIFAR-10 CIFAR-10 SVHN Joint Training 80.02 0.28 79.63 0.06 93.72 0.10 Last-layer Fine-tuning 61.09 0.34 69.71 0.33 66.46 0.53 Fine-tuning 62.37 1.06 49.90 0.37 54.23 0.40 Lw F (Li and Hoiem 2018) λ 77.67 0.10(20) 76.18 0.16(21) 68.90 0.89(20) Mean-IMM (Lee et al. 2017) α1/α2 73.98 0.30(2 9) 76.41 0.19(2 10) 79.91 1.47(20) Res CL (Ours) λ/10 4 78.80 0.17(20) 77.13 0.12(2 4) 89.49 0.32(25)

Table 2: Source and target accuracies[%] for each method.

Task CIFAR-10 CIFAR-100 CIFAR-100 CIFAR-10 CIFAR-10 SVHN source target source target source target Joint Training 91.70 0.19 68.34 0.40 68.75 0.03 90.51 0.12 91.93 0.17 95.51 0.11 Last-layer Fine-tuning 91.80 0.14 30.39 0.62 66.98 0.25 72.43 0.48 91.89 0.14 41.11 1.06 Fine-tuning 56.01 2.41 68.74 0.46 07.14 0.62 92.67 0.19 12.49 0.84 95.96 0.05 Lw F (Li and Hoiem 2018) 87.81 0.26 67.53 0.21 63.91 0.14 88.45 0.19 43.12 1.87 94.68 0.14 Mean-IMM (Lee et al. 2017) 91.24 0.18 56.72 0.57 67.42 0.17 85.41 0.26 81.29 2.03 78.53 1.53 Res CL (Ours) 89.48 0.04 68.13 0.32 66.84 0.39 87.41 0.32 88.66 0.56 90.32 0.23

fer to any source task information for fair comparisons. Mode-IMM is not compared in the experiment because it requires the Fisher information matrix, which cannot be obtained without source data. The source and target tasks are to classify the CIFAR-10, CIFAR-100 (Krizhevsky 2009), or SVHN (Netzer et al. 2011) dataset. A pre-activation residual network of 32 layers without bottlenecks (He et al. 2016b) is used.

For the CIFAR datasets, data augmentation and hyperparameter settings are the same as those in (He et al. 2016b). Training images are horizontally ﬂipped with a probability of 0.5 and randomly cropped to 32 32 from 40 40 zero-padded images during training. SGD with a momentum of 0.9, a minibatch size of 128, and a weight decay of λdec = 0.0001 optimizes networks until 64000 iterations. Note that this usual weight decay loss for θt is different from the special decay loss for the combination parameters α. The learning rate starts from 0.1 and is multiplied by 0.1 at 32000 and 48000 iterations. The He s initialization method (He et al. 2015) is used to initialize source networks. Combination parameters (αs, αt) in Res CL are initialized to ( 1/2 1, 1/2 1) in order to balance the original and new features at the early stage of training. For the SVHN dataset, all settings are the same as above, but the training data are not augmented.

We evaluate each method by the average accuracy, which is the average of the source and target accuracies, for three sequential learning scenarios: CIFAR-10 CIFAR-100, CIFAR-100 CIFAR-10, and CIFAR-10 SVHN. Since the source and target tasks have different class categories, each task has its own last task-speciﬁc fully connected layer. In Lw F, the last layer of target task is trained ﬁrst with the other weights freezed (warm-up step in (Li and Hoiem 2018)), as in the original paper. For a fair comparison, all other methods also start with this last-layer ﬁne-tuning step. Mean-IMM matches the moments of the last-layer ﬁne-

tuning model and Lw F model, as in (Lee et al. 2017). Res CL combines two paths before every nonlinearity, as in Fig. 2. The last layer for the target task is not reparameterized because there are no layers for the target task in the original network. All continual learning methods should control the tradeoff between source and target performance by their own trade-off hyperparameters. Since each method has different approaches and different trade-off hyperparameter deﬁnitions, it is not a fair comparison to use just one speciﬁc tradeoff hyperparameter setting. Here, we propose to use a fair measure over different continual learning methods, which does not depend on hyperparameter deﬁnitions, maximum achievable average accuracy, where the average is taken over all source tasks a model has learned so far and the current target task. We search the optimal trade-off hyperparameters over {20, 2 1, ..., 2 10} for each method to obtain the maximum achievable average accuracies, where the default hyperparameter is 20, and a larger hyperparameter means that the source performance is more strongly protected, for all methods. By this experimental setting, we can obtain the true capacity of each method. The results are summarized in Tables 1 and 2. The performance of the joint training method is provided as an upper bound. Lw F works in a small range of trade-off hyperparameters since it directly changes the magnitude of a cross-entropy loss. As the hyperparameter moves away from its default value, the balance between the source and target losses is quickly broken. The optimal trade-off hyperparameters are almost the same as the default value 20 for Lw F. Our method can use a wider range of the hyperparameter than Lw F by changing the multiplier of the decay loss instead of the cross-entropy loss. It naturally controls how far the reparameterized network is from the original one without any modiﬁcation of the source and target losses. As a wide range of trade-off hyperparameters works ef-

-5 0 5 mean ( )

probability density

original Lw F

0 1 2 3 4 5 standard deviation ( )

probability density

original Lw F

Figure 3: The distribution of the statistical parameters (μ and σ) of all BN layers in one speciﬁc network. The black line is the distribution of the original network, and the red line is the distribution after the original network is trained on the target task with Lw F.

fectively in Res CL, we can analyze the meaning of the optimal λ in Table 1 with the difﬁculty of each task. First, the CIFAR-10 CIFAR-100 scenario can be thought of as continual learning from an easier task to a harder task because the CIFAR-10 data have 10 classes to classify and the CIFAR-100 data have 100 classes. The optimal trade-off hyperparameter is the same as the default value, which means that the default hyperparameter λ = 10 4 can be used for such coarse-to-ﬁne scenarios. The second scenario, CIFAR100 CIFAR-10, is the converse of the ﬁrst scenario. The source task is relatively more difﬁcult than the target task; thus, there is much informative knowledge in the Lw F loss, and the original network already has good features for the target task (a target accuracy of 72.43% with the last-layer ﬁne-tuning model). Therefore, we can pay less attention to preventing catastrophic forgetting, and the optimal trade-off hyperparameter is small (1/16 times the default value). The CIFAR-10 SVHN case is more challenging. These two tasks are very different; CIFAR-10 images consist of visual objects such as dogs and trucks, whereas the classes of the SVHN dataset are digits. Thus, this scenario is vulnerable to catastrophic forgetting, and other methods perform poorly for the source task. However, Res CL still remarkably maintains the source knowledge well; therefore, it signiﬁcantly outperforms other methods. The optimal trade-off hyperparameter is large (32 times the default value), as the additional training on the target task can easily degrade the source performance. The statistical parameter (μ and σ) distribution of all BN layers in one speciﬁc network is shown in Fig. 3 for the Lw F method. The statistics of the two tasks are very different, even though those tasks are similar. As BN layers contain statistics of the target task only in Lw F, we cannot make use of the source population statistics even though they are also an important part of the original network. In contrast, our method provides BN layers to each task, and further makes use of the source population statistics during training of a combined network. Res CL can also be used for sequential learning of more than two tasks. In addition to the experiments with the three scenarios, we evaluate our method for the sequential learning of the three tasks to demonstrate the scalability of the proposed method. Res CL still exhibits remarkable performance, as indicated in Table 3. In addition, there is no dif-

Table 3: Maximum achievable average accuracies[%] for each method. The second column represents sequential learning on three tasks.

Source task CIFAR-10 CIFAR-100 Image Net Target task SVHN CUB Network architecture pre Res Net Alex Net VGG Last-layer Fine-tuning 68.76 50.51 64.78 Fine-tuning 35.30 46.43 64.54 Lw F (Li and Hoiem 2018) 58.44 48.47 68.82 Mean-IMM (Lee et al. 2017) 52.12 67.88 Res CL (Ours) 78.53 53.51 68.95

trade-off hyperparameter

average accuracy

trade-off hyperparameter

target accuracy

trade-off hyperparameter

source accuracy

required target accuracy

(a) (b1) (b2)

Figure 4: As each method has different trade-off hyperparameter deﬁnitions, it is not a fair comparison to use just one speciﬁc trade-off hyperparameter setting. (a) One of the fair measures is the maximum achievable average accuracy (circled points). (b1) In practice, the trade-off hyperparameter is adjusted using the target validation set until the required target accuracy is reached (dotted circled points). (b2) The models with those hyperparameters are tested once on the source test set.

ﬁculty in applying the Res CL method to other CNN models or large scale datasets. Table 3 summarizes the results for sequential learning from the ILSVRC2012 dataset (Russakovsky et al. 2015) to the Caltech-UCSD Birds-2002011 dataset (Wah et al. 2011) with Alex Net (Krizhevsky, Sutskever, and Hinton 2012) and VGG (Simonyan and Zisserman 2014) architecture.

Source Accuracy at Required Target Accuracy The maximum achievable average accuracy is a good fair measure that does not depend on trade-off hyperparameter deﬁnitions. However, we cannot achieve this maximum average accuracy in practice since there are no available source data for searching the optimal trade-off hyperparameter. This ideal measure gives the true capacity of a continual learning method, but it is not a practical one. In practical applications, we can set a lower limit of the target accuracy that a model has to achieve. Then, the tradeoff hyperparameter can still be effectively adjusted using target validation data only, which are available, until the required target accuracy is reached. After ﬁnding the model that meets the required target accuracy, the model is tested once on the source test set (Fig. 4). We set the required target accuracy to 95% of that of the ﬁne-tuning model since the ﬁne-tuning model is trained to solve only the target task well. This evaluation setting gives another fair measure, the source accuracy at required target accuracy, which can be determined in practice. The results with this measure are summarized in Tables 4 and 5.

Table 4: Source accuracies at required target accuracy[%] of each method.

Source task CIFAR-10 CIFAR-100 CIFAR-10 Target task CIFAR-100 CIFAR-10 SVHN Required target accuracy 65.30 88.04 91.17 (w.r.t. ﬁne-tuning model) (95%) (95%) (95%) Lw F (Li and Hoiem 2018) 89.59 63.91 38.08 Res CL (Ours) 90.65 68.13 76.83

Table 5: Source accuracies at required target accuracy[%] of each method. The second column represents sequential learning on three tasks.

Source task CIFAR-10 CIFAR-100 Image Net Target task SVHN CUB Network architecture pre Res Net Alex Net VGG Required target accuracy 90.07 50.83 67.04 (w.r.t. ﬁne-tuning model) (95%) (95%) (95%) Lw F (Li and Hoiem 2018) 41.84 40.17 66.55 Mean-IMM (Lee et al. 2017) 52.33 68.48 Res CL (Ours) 53.24 53.91 69.73

Trade-off Hyperparameter λ and Combination Parameter α

In this section, we investigate whether the hyperparameter λ, which is the multiplier of the decay loss for the combination parameter α, controls the source target performance trade-off reasonably. For the CIFAR-10 SVHN case, the source, target, and average accuracies with respect to λ are shown in Fig. 5. As shown in Fig. 5, λ acts as a reasonable trade-off hyperparameter. The source accuracy increases with λ and becomes saturated, whereas the target accuracy is a decreasing function of λ. As a result, the average accuracy graph has a concave shape. By probing the magnitude of α, we can observe how much the features was changed to solve the target task. Fig. 6 shows the mean absolute value of the elements of combination parameters with respect to the depth of their layers. For all scenarios in Fig. 6, the magnitude of the changes tends to increase with the depth. In deep convolutional neural networks, it is known that shallow layers learn basic features such as colors, edges, and corners, whereas deep layers learn class-speciﬁc features such as dog faces and bird legs (Zeiler and Fergus 2014). Thus, shallow layers do not need to be changed much since their features are already common in both tasks, but deep layers have large deviations from the original ones since the classes of the target task are different from those of the source task.

We have proposed a novel continual learning method, Res CL, which exhibits the state-of-the-art performance for continual learning of image classiﬁcation tasks. It prevents catastrophic forgetting, even if the source and target tasks are very different. Res CL can be used in practice, as no information about the source task is required, except the original network, and the size of a network does not increase in the inference phase. Moreover, any general CNN architectures can be adopted since our method is designed to handle

VRXUFH DFFXUDF\

WDUJHW DFFXUDF\

DYHUDJH DFFXUDF\

Figure 5: Source, target, and average accuracies with respect to the trade-off hyperparameter λ in the CIFAR-10 SVHN scenario.

PHDQ DEVROXWH YDOXH RI FRPELQDWLRQ SDUDPHWHUV

&,)$5 &,)$5

&,)$5 &,)$5

PHDQ DEVROXWH YDOXH RI FRPELQDWLRQ SDUDPHWHUV

Figure 6: Mean absolute value of the elements of the combination parameter α with respect to the depth.

convolution and BN layers. In this study, we limited the scope of the task to sequential learning of image classiﬁcation with CNNs. However, the Res CL method can be naturally extended to support other types of neural networks, such as recurrent neural networks, since it simply linearly combines the outputs of two layers. We leave the application of the Res CL method to other ﬁelds for future work.

Acknowledgment

This research was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921).

Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T. 2018. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), 139 154. Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1251 1258. French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3(4):128 135. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, 1026 1034.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity mappings in deep residual networks. In European Conference on Computer Vision, 630 645. Springer. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700 4708. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448 456. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13):3521 3526. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization. In Advances in neural information processing systems, 950 957. Lee, S.-W.; Kim, J.-H.; Jun, J.; Ha, J.-W.; and Zhang, B.-T. 2017. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, 4655 4665. Li, Z., and Hoiem, D. 2018. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40(12):2935 2947. Mallya, A., and Lazebnik, S. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7765 7773. Mc Closkey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation 24:109 165. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning. Rebufﬁ, S.-A.; Bilen, H.; and Vedaldi, A. 2017. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 506 516. Ritter, H.; Botev, A.; and Barber, D. 2018. Online structured laplace approximations for overcoming catastrophic

forgetting. In Advances in Neural Information Processing Systems, 3738 3748. Rosenfeld, A., and Tsotsos, J. K. 2018. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211 252. Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671. Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, 2990 2999. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of go without human knowledge. Nature 550(7676):354. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Terekhov, A. V.; Montone, G.; and O Regan, J. K. 2015. Knowledge transfer in deep block-modular neural networks. In Conference on Biomimetic and Biohybrid Systems, 268 279. Springer. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144. Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, 3320 3328. Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. In Richard C. Wilson, E. R. H., and Smith, W. A. P., eds., Proceedings of the British Machine Vision Conference (BMVC), 87.1 87.12. BMVA Press. Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, 818 833. Springer. Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3987 3995. JMLR. org.