# overcoming_multimodel_forgetting__0b2d7225.pdf Overcoming Multi-model Forgetting Yassine Benyahia * 1 Kaicheng Yu * 2 Kamil Bennani-Smires 3 Martin Jaggi 4 Anthony Davison 1 Mathieu Salzmann 2 Claudiu Musat 3 We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters. To overcome this, we introduce a statistically-justified weight plasticity loss that regularizes the learning of a model s shared parameters according to their importance for the previous models, and demonstrate its effectiveness when training two models sequentially and for neural architecture search. Adding weight plasticity in neural architecture search preserves the best models to the end of the search and yields improved results in both natural language processing and computer vision tasks. 1. Introduction Deep neural networks have been very successful for tasks such as visual recognition (Xie & Yuille, 2017) and natural language processing (Young et al., 2017), and much recent work has addressed the training of models that can generalize across multiple tasks (Caruana, 1997). In this context, when the tasks become available sequentially, a major challenge is catastrophic forgetting: when a model initially trained on task A is later trained on task B, its performance on task A can decline calamitously. Several recent articles have addressed this problem (Kirkpatrick et al., 2017; Rusu et al., 2016; He & Jaeger, 2017; Li & Hoiem, 2016). In particular, Kirkpatrick et al. (2017) show how to overcome catastrophic forgetting by approximating the *Equal contribution, Work done while being at Swisscom Digital Lab; 1Institute of Mathematics, EPFL; 2Computer Vision Lab, EPFL; 3Artificial Intelligence Lab, Swisscom; 4Machine Learning and Optimization lab, EPFL. Correspondence to: Yassine Benyahia , Kaicheng Yu . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). posterior probability, p(θ | D1, D2), with θ the network parameters and D1, D2 datasets representing the tasks. In many situations one does not train a single model for multiple tasks but multiple models for a single task. This is the scenario we tackle in this paper. When dealing with many large models, a common strategy to keep training tractable is to share a subset of the weights across the multiple models and to train them sequentially (Pham et al., 2018; Xie & Yuille, 2017; Liu et al., 2018a). This strategy has a major drawback. Figure 1 shows that for two models, A and B, the larger the number of shared weights, the more the accuracy of A drops when training B; B overwrites some of the weights of A and this damages the performance of A. We call this multi-model forgetting. The benefits of weight-sharing have been emphasized in tasks like neural architecture search, where the associated speed gains have been key in making the process practical (Pham et al., 2018; Liu et al., 2018b), but its downsides remain unexplored. In this paper we introduce an approach to overcoming multi-model forgetting. Given a dataset D, we first consider two models f1(D; θ1, θs) and f2(D; θ2, θs) with shared weights θs and private weights θ1 and θ2. We formulate learning as the maximization of the posterior p(θ1, θ2, θs|D). Under mild assumptions we show that this posterior can be approximated and expressed using a loss, dubbed Weight Plasticity Loss (WPL), that minimizes multimodel forgetting. Our framework evaluates the importance of each weight, conditioned on the previously-trained model, and encourages the update of each shared weight to be inversely proportional to its importance. We then show that our approach extends to more than two models by exploiting it for neural architecture search. Our work is the first to propose a solution to multi-model forgetting. We establish the merits of our approach when training two models with partially shared weights and in the context of neural architecture search. For the former, we establish the effectiveness of WPL in the strict convergence case, where each model is trained until convergence, and in the more realistic loose convergence setting, where training is stopped early. WPL can reduce the forgetting effect by 99% when model A converges fully, and by 52% in the loose convergence case. Overcoming Multi-model Forgetting (1) Train Model A to convergence Model A Model B (2) Train three versions of Model B while sharing different weights with A Operation with param Training iterations Top 1 Accuracy 1 shared 2 shared 3 shared A during (2) A after (1) 1 shared 2 shared 3 shared B during (2) multi-model Figure 1. (Left) Two models to be trained (A, B), where A s parameters are in green and B s in purple, and B shares some parameters with A (indicated in green during phase 2). We first train A to convergence and then train B. (Right) Accuracy of model A as the training of B progresses. The different colors correspond to different numbers of shared layers. The accuracy of A decreases dramatically, especially when more layers are shared, and we refer to the drop (the red arrow) as multi-model forgetting. This experiment was performed on MNIST (Le Cun & Cortes, 2010). For neural architecture search, we implement WPL within the efficient ENAS method of Pham et al. (2018), a stateof-the-art technique that relies on parameter sharing and corresponds to the loose convergence setting. We show that, at each iteration, the use of WPL reduces the forgetting effect by 51% on the most affected model and by 95% on average over all sampled models. Our final results on the best architecture found by the search confirm that limiting multimodel forgetting yields better results and better convergence for both language modeling (on the PTB dataset (Marcus et al., 1994)) and image classification (on the CIFAR10 dataset (Krizhevsky et al., 2009)). For language modeling the perplexity decreases from 65.01 for ENAS without WPL to 61.9 with WPL. For image classification WPL yields a drop of top-1 error from 4.87% to 3.81%. We also adapt our method to NAO (Luo et al., 2018) and show that it also significantly reduces multi-model forgetting. Our code is public available at https://github.com/kcyu2014/multimodel-forgetting. 2. Related work Single-model Forgetting. The goal of training a single model to tackle multiple problems is to leverage the structures learned for one task for other tasks. This has been employed in transfer learning (Pan & Yang, 2010), multitask learning (Caruana, 1997) and lifelong learning (Silver et al., 2013). However, sequential learning of later tasks has visible negative consequences for the initial one. Kirkpatrick et al. (2017) selectively slow down the learning of the weights that are comparatively important for the first task by defining the importance of an individual weight us- ing its Fisher information (Rissanen, 1996). He & Jaeger (2017) project the gradient so that directions relevant to the previous task are unaffected. Other families of methods save the older models separately to create progressive networks (Rusu et al., 2016) or use regularization to force the parameters to remain close to the values obtained by previous tasks while learning new ones (Li & Hoiem, 2016). In (Xu & Zhu, 2018), forgetting is avoided altogether by fixing the parameters of the first model while complementing the second one with additional operations found by an architecture search procedure. This work, however, does not address the multi-model forgetting that occurs during the architecture search. An extreme case of sequential learning is lifelong learning, for which the solution to catastrophic forgetting developed by Aljundi et al. (2018) is also to prioritize the weight updates, with smaller updates for weights that are important for previously-learned tasks. Teh et al. (2017) proposes a reinforcement learning approach for multi-task and multi-model scenario, but it relies on knowledge distillation which works under the assumption of two models. Applying it to train every two consecutive models, the knowledge of model not in the current pair will again be forgotten. Parameter Sharing in Neural Architecture Search. In both sequential learning on multiple tasks and lifelong learning, the forgetfulness concerns an individual model. Here we tackle scenarios where one seeks to optimize a population of multiple models that share parts of their internal structure. The use of multiple models to solve a single task dates back to model ensembles (Dietterich, 2000). Recently, sharing weights between models that are candidate solutions to a problem has shown great promise in the generation of custom neural architectures, known as neural architecture Overcoming Multi-model Forgetting search (Elsken et al., 2018). Existing neural architecture search strategies mostly divide into reinforcement learning and evolutionary techniques. For instance, Zoph & Le (2017) use reinforcement learning to explore a search space of candidate architectures, with each architecture encoded as a string using an RNN trained with REINFORCE (Williams, 1992) and taking validation performance as the reward. Meta QNN (Baker et al., 2017) uses Q-Learning to design CNN architectures. By contrast, neuro-evolution strategies use evolutionary algorithms (B ack, 1996) to perform the search. An example is Liu et al. (2018a), who introduce a hierarchical representation of neural networks and use tournament selection (Goldberg & Deb, 1991) to evolve the architectures. Initial search solutions required hundreds of GPUs due to the huge search space, but recent efforts have made the search more tractable, for example via the use of neural blocks (Negrinho & Gordon, 2017; Bennani-Smires et al., 2018). Similarly, and directly related to this work, weight sharing between the candidates has allowed researchers to greatly decrease the computational cost of neural architecture search. For neuro-evolution methods, sharing is implicit. For example, Real et al. (2017) define weight inheritance as allowing the children to inherit their parents weights whenever possible. For RL-base techniques, weight sharing is modeled explicitly and has been shown to lead to significant gains. In particular, ENAS (Pham et al., 2018), which builds upon NAS (Zoph & Le, 2017), represents the search space as a single directed acyclic graph (DAG) in which each candidate architecture is a subgraph. EAS (Cai et al., 2018) also uses an RL strategy to grow the network depth or layer width with function-preserving transformations defined by Chen et al. (2016) where they initialize new models with previous parameters. DARTS (Liu et al., 2018b) uses soft assignment to select paths that implicitly inherit the previous weights. NAO (Luo et al., 2018) replaces the reinforcement learning portion of ENAS with a gradientbased auto-encoder that directly exploits weight sharing. While weight sharing has proven effective, its downsides have never truly been studied. Bender et al. (2018) realized that training was unstable and proposed to circumvent this issue by randomly dropping network paths. However, they did not analyze the reasons for the instability. Here, by contrast, we highlight the underlying multi-model forgetting problem and introduce a statistically-justified solution that further improves on path dropout. 3. Methodology In this section we study the training of multiple models that share certain parameters. As discussed above, training the multiple models sequentially as in (Pham et al., 2018), for example, is suboptimal, since multi-model forgetting arises. Below we derive a method to overcome this for two models, and then show how our formalism extends to multiple models in the context of neural architecture search, and in particular within ENAS (Pham et al., 2018). 3.1. Weight Plasticity Loss: Preventing Multi-model Forgetting Given a dataset D, we seek to train two architectures f1(D; θ1, θs) and f2(D; θ2, θs) with shared parameters θs and private parameters θ1 and θ2. We suppose that the models are trained sequentially, which reflects common largemodel, large-dataset scenarios and will facilitate generalization. Below, we derive a statistically-motivated framework that prevents multi-model forgetting; it stops the training of the second model from degrading the performance of the first model. We formulate training as finding the parameters θ = (θ1, θ2, θs) that maximize the posterior probability p(θ | D), which we approximate to derive our new loss function. Below we discuss the different steps of this approximation, first expressing p(θ | D) more conveniently. Lemma 1. Given a dataset D and two architectures with shared parameters θs and private parameters θ1 and θ2, and if p(θ1, θ2 | θs, D) = p(θ1 | θs, D)p(θ2 | θs, D), we have p(θ1, θ2, θs | D) p(D | θ2, θs)p(θ1, θs | D)p(θ2, θs) R p(D | θ1, θs)p(θ1, θs)dθ1 . Proof. Provided in the appendix. Lemma 1 presupposes that p(θ1, θ2 | θs, D) = p(θ1 | θs, D)p(θ2 | θs, D), i.e., θ1 and θ2 are conditionally independent given θs and the dataset D. While this must be checked in applications, it is suitable for our setting, since we want both networks, f1(D; θ1, θs) and f2(D; θ2, θs), to train independently well. To derive our loss we study the components on the right of equation (1). We start with the integral in the denominator, for which we seek a closed form. Suppose we have trained the first model and seek to update the parameters of the second one while avoiding forgetting. The following lemma provides an expression for the denominator of equation (1). Lemma 2. Suppose we have the maximum likelihood estimate (ˆθ1, ˆθs) for the first model, write Card(θ1) + Card(θs) = p1 + ps = p, and let the negative Hessian Hp(ˆθ1, ˆθs) of the log posterior probability distribution log p(θ1, θs | D) evaluated at (ˆθ1, ˆθs) be partitioned into four blocks corresponding to (θ1, θs) as Hp(ˆθ1, ˆθs) = H11 H1s Hs1 Hss Overcoming Multi-model Forgetting If the parameters of each model follow Normal distributions, i.e., (θ1, θs) Np(0, σ2Ip), with Ip the p-dimensional identity matrix, then the denominator of equation (1), A = R p(D | θ1, θs)p(θs, θ1)dθ1 can be written as A = exp {lp(ˆθ1, ˆθs) 1 2v Ωv} (2π)p1/2|det(H 1 11 )|1/2, (2) where v = θs ˆθs, lp(θ) = l(θ) θT θ/2σ2, and Ω= Hss H 1s H 1 11 H1s . Proof. Provided in the appendix. Lemma 2 requires the maximum likelihood estimate (ˆθ1, ˆθs), which can be hard to obtain with deep networks, since they have non-convex objective functions. In practice, one can train the network to convergence and treat the resulting parameters as maximum likelihood estimates. Our experiments show that the parameters obtained without optimizing to convergence can be used effectively. Moreover Haeffele & Vidal (2017) showed that networks relying on positively homogeneous functions have critical points that are either global minimizers or saddle points, and that training to convergence yields near-optimal solutions, which correspond to true maximum likelihood estimates. Following Lemmas 1 and 2, as shown in the appendix, log p(θ | D) log p(D | θ2, θs) + log p(θ2, θs) + log p(θ1, θs | D) + 1 apart from an additive constant. To derive a loss function that prevents multi-model forgetting, consider equation (3). The first term on its right-hand side corresponds to the log likelihood of the second model and can be replaced by the cross-entropy L2(θ2, θs), and if we use a Gaussian prior on the parameters, the second term encodes an L2 regularization. Since equation (3) depends only on the log likelihood of the second model f2(D; θ2, θs), the information learned from the first model f1(D; θ1, θs) must reside in the conditional posterior probability log p(θ1, θs | D), and the final term, 1 2v Ωv, must represent the interactions between the models f1(D; θ2, θs) and f2(D; θ1, θs). This term will not appear in a standard single-model forgetting scenario. Let us examine these terms more closely. The posterior probability p(θ1, θs | D) is intractable, so we apply a Laplace approximation (Mac Kay, 1992); we approximate the log posterior using a second-order Taylor expansion around the maximum likelihood estimate (ˆθ1, ˆθs). This yields log p(θ1, θs | D) = log p(ˆθ1, ˆθs | D) 2(θ 1, θ s) Hp(θ 1, θ s), (4) where (θ 1, θ s) = (θ1, θs) (ˆθ1, ˆθs), and Hp(ˆθ1, ˆθs) is the negative Hessian of the log posterior evaluated at the maximum likelihood estimate (MLE). As the first derivative is evaluated at the MLE, it equals zero. Equation (4) yields a Gaussian approximation to the posterior with mean (ˆθ1, ˆθs) and covariance matrix H 1 p , i.e., p(θ1, θs | D) exp 1 2(θ 1, θ s) Hp(θ 1, θ s) . (5) Our parameter space is too large to compute the inverse of the negative Hessian Hp, so we replace it with the diagonal of the Fisher information, diag(F ). This approximation falsely presupposes that the parameters (θ1, θs) are independent, but it has already proven effective (Kirkpatrick et al., 2017; Pascanu & Bengio, 2014). One of its main advantages is that we can compute the Fisher information from the squared gradients, thereby avoiding any need for second derivatives. Using equation (5) and the Fisher approximation we can express the log posterior as log p(θ1, θs | D) α θsi θs Fθsi(θsi ˆθsi)2 , (6) where Fθsi is the diagonal element corresponding to parameter θsi in the diagonal approximation of the Fisher information matrix and α is a hyper-parameter, which can be obtained from the trained model f1(D; θ1, θs). Now consider the last term in equation (3), noting that Ω= Hss H 1s H 1 11 H1s, as defined in Lemma 2. As our previous approximation relies on the assumption of a diagonal Fisher information matrix, we have H1s = 0, leading to Ω= Hss, so 1 2v Ωv = 1 θsi θs Fθsi(θsi ˆθsi)2 . (7) The last two terms on the right-hand side of equation (3), as expressed in equation (6) and equation (7), can then be grouped. Combining the result with the first two terms, discussed below equation (3), yields our Weight Plasticity Loss, LWPL(θ2, θs) =L2(θ2, θs) + λ 2 ( θs 2 + θ2 2) θsi θs Fθsi(θsi ˆθsi)2, (8) where Fθsi is the diagonal element corresponding to parameter θsi in the Fisher information matrix obtained from the trained first model f1(D; θ1, θs). We omit the terms depending on θ1 in equation (6) because we are optimizing with respect to (θ2, θs) at this stage. The Fisher information in Overcoming Multi-model Forgetting the last term encodes the importance of each shared weight for the first model s performance, so WPL encourages preserving shared parameters that were important for the first model, while allowing others to undergo larger changes and thus to improve the accuracy of the second model. 3.1.1. RELATION TO ELASTIC WEIGHT CONSOLIDATION The final loss function obtained in equation (8) may appear similar to that obtained by Kirkpatrick et al. (2017) when formulating their Elastic Weight Consolidation (EWC) to address catastrophic forgetting. However, the problem we address here is fundamentally different. Kirkpatrick et al. (2017) tackle sequential learning on different tasks, where a single model is sequentially trained using two datasets, and their goal is to maximize the posterior p(θ | D) = p(θ | D1, D2). By relying on Laplace approximations in neural networks (Mac Kay, 1992) and the connection between the Fisher information matrix and second-order derivatives (Pascanu & Bengio, 2014), EWC is then formulated as the loss L(θ) = LB(θ) + P i λ 2 Fi(θi θ A,i)2, where A and B refer to two different tasks, θ encodes the network parameters and Fi is the Fisher information of θi. Here we consider scenarios with a single dataset but two models with shared parameters as shown in Figure 2, and aim to maximize the posterior p(θ1, θ2, θs | D). The resulting WPL combines the original loss of the second model, a Fisher-weighted MSE term on the shared parameters and an L2 regularizer on the parameters of the second model. More importantly, the last term in equation (3), v Ωv, is specific to the multi-model case, since it encodes the interaction between the two models; it never appears in the EWC derivation. Because we adopt a Laplace approximation based on the diagonal Fisher information matrix, as shown in equation (7), this term can be grouped with that of equation (6). In principle, however, other approximations of v Ωv could be used, such as a Laplace one with a full covariance matrix, which would yield a final loss that differs fundamentally from the EWC one. 3.2. WPL for Neural Architecture Search In the previous section, we considered only two models being trained sequentially, but in practice one often seeks to train three or more models. Our approach is then unchanged, but each model shares parameters with several other models, which entails using diagonal approximations to Fisher information matrices for all previously-trained models from equation (3). In the remainder of this section, we discuss how our approach can be used for neural architecture search. Consider using our WPL within the ENAS strategy of Pham et al. (2018). ENAS is a reinforcement-learning-based method that consists of two training processes: 1) sequen- A = ( s, 1) p( A, B|D1) Low error param. space on D1 p( A|D1, D2) Single Model Dual Model Low error param. space on B = ( s, 2) EWC: based on WPL: based on p( A|D1) + v> v s,A MLE: jointly optimized s|D1 MLE: given A = ( s, 1) Figure 2. Comparison between EWC and WPL. The ellipses in each subplot represent parameter regions corresponding to low error. (Top left) Both methods start with a single model, with parameters θA = {θs, θ1}, trained on a single dataset D1. (Bottom left) EWC regularizes all parameters based on p(θA|D1) to train the same initial model on a new dataset D2. (Top right) By contrast, WPL makes use of the initial dataset D1 and regularizes only the shared parameters θs based on both p(θA|D1) and v Ωv, while the parameters θ2 can vary freely. tially train sampled models with shared parameters; and 2) train a controller RNN that generates model candidates. Incorporating our WPL within ENAS only affects 1). The first step of ENAS consists of sampling a fixed number of architectures from the RNN controller, and training each architecture on B batches. This implies that our requirement for access to the maximum likelihood estimate of the previously-trained models is not satisfied, but we verify that in practice our WPL remains effective in this scenario. After sufficiently many epochs it is likely that all the parameters of a newly-sampled architecture are shared with previouslytrained ones, and then we can consider that all parameters of new models are shared. At the beginning of the search, the parameters of all models are randomly initialized. Adopting WPL directly from the start would therefore make it hard for the process to learn anything, as it would encourage some parameters to remain random. To better satisfy our assumption that the parameters of previously-trained models should be optimal, we follow the original ENAS training strategy for n epochs, with n = 5 for RNN search and n = 3 for CNN search in our experiments. We then incorporate our WPL and store the optimal parameters after each architecture is trained. We also update the Fisher information, Overcoming Multi-model Forgetting Top 1 Accuracy (a) (b) (c) (d) Figure 3. From strict to loose convergence. We conduct experiments on MNIST with models A and B with shared parameters, and report the accuracy of Model A before training Model B (baseline, green) and the accuracy of Models A and B while training Model B with (orange) or without (blue) WPL. In (a) we show the results for strict convergence: A is initially trained to convergence. We then relax this assumption and train A to around 55% (b), 43% (c), and 38% (d) of its optimal accuracy. We see that WPL is highly effective when A is trained to at least 40% of optimality; below, the Fisher information becomes too inaccurate to provide reliable importance weights. Thus WPL helps to reduce multi-model forgetting, even when the weights are not optimal. WPL reduced forgetting by up to 99.99% for (a) and (b), and by up to 2% for (c). which adds virtually no computational overhead, because Fθi = ( L/ θi)2, where L = P i Li, with i indexing the previously-sampled architectures, and the derivatives are already computed for back-propagation. To ensure that these updates use the contributions from all previously-sampled architectures, we use a momentum-based update expressed as Fθ t i = (1 η)Fθ t 1 i +η( L/ θi)2, with η = 0.9. Since this is not computed at the MLE of the parameters, we flush the global Fisher buffer to zero every three epochs, yielding an increasingly accurate estimate of the Fisher information as optimization proceeds. We also use a scheduled decay for α in equation (8). 4. Experiments We first evaluate our weight plasticity loss (WPL) in the general scenario of training two models sequentially, both in the strict convergence case and when the weights of the first model are sub-optimal. We then evaluate the performance of our approach within the ENAS framework. 4.1. General Scenario: Training Two Models To test WPL in the general scenario, we used the MNIST handwritten digit recognition dataset (Le Cun & Cortes, 2010). We designed two feed-forward networks with 4 (Model A) and 6 (Model B) layers, respectively. All the layers of A are shared by B. Let us first evaluate our approach in the strict convergence case. To this end, we trained A until convergence, thus obtaining a solution close to the MLE ˆθA = (ˆθ1, ˆθs), since all our operations are positively homogeneous (Haeffele & Vidal, 2017). To compute the Fisher information, we used the backward gradients of θs calculated on 200 images in the validation set. We then initialized θs of Model B, f B(D; (θ2, θs)), as ˆθs and trained B by standard SGD with respect to all its parameters. Figure 3(a) compares the performance of training Model B with and without WPL. Without WPL the performance of A degrades as training B progresses, but using WPL allows us to maintain the initial performance of A, indicated as Baseline in the plot. This entails no loss of performance for B, whose final accuracy is virtually the same both with and without WPL. The assumption of optimal weights is usually hard to enforce. We therefore now turn to the more realistic loose convergence scenario. To evaluate the influence of suboptimal weights for Model A on our approach, we trained Model A to different, increasingly lower, top 1 accuracies. As shown in Figure 3(b) and (c), even in this setting our approach still significantly reduces multi-model forgetting. We can quantify the relative reduction rate of such forgetting as d A d A+WPL/d A, where d = acc A acc is A s accuracy decay after training B. WPL can reduce multi-model forgetting by up to 99% for a converged model, and by 52% even for the loose convergence case. This suggests that the Fisher information remains a reasonable empirical approximation to the weights importance even when our optimality assumption is not satisfied. 4.2. WPL for Neural Architecture Search We demonstrate the effectiveness of WPL in a real-world application, neural architecture search. We incorporate WPL in the ENAS framework (Pham et al., 2018), which relies on weight-sharing across model candidates to speed up the search and thus, while effective, will suffer from multimodel forgetting even with random dropping of weights and output dropout. To show this, we examine how the Overcoming Multi-model Forgetting Epochs Iterations (a) Mean diff. (b) Best 5 mean diff. (c) Max diff. (d) Mean reward (R) Error Difference (diff.) Figure 4. Error difference during neural architecture search. For each architecture, we compute the RNN error differences err2 err1, where err1 is the error right after training this architecture and err2 the error after all architectures are trained in the current epoch. We plot (a) the mean difference over all sampled models, (b) the mean difference over the 5 models with lowest err1, and (c) the max difference over all models. The plots show that WPL reduces multi-model forgetting; the error differences are much closer to 0. Quantitatively, the forgetting reduction can be up to 95% for (a), 59% for (b) and 51% for (c). In (d), we plot the average reward of the sampled architectures as a function of training iterations. Although WPL initially leads to lower rewards, due to a large weight α in equation (8), by reducing the forgetting it later allows the controller to sample better architectures, as indicated by the higher reward in the second half. previously-trained architectures are affected by the training of new ones by evaluating the prediction error of each sampled architecture on a fraction of the validation dataset immediately after it is trained, denoted by err1, and at the end of the epoch, denoted by err2. A positive difference err2 err1 for a specific architecture indicates that it has been forced to forget by others. We performed two experiments: RNN cell search on the PTB dataset and CNN micro-cell search on the CIFAR10 dataset. We report the mean error difference for all sampled architectures, the mean error difference for the 5 architectures with the lowest err1, and the maximum error difference over all sampled architectures. Figure 4(a), (b) and (c) plot these as functions of the training epochs for the RNN case, and similar plots for CNN search are in the appendix. The plots show that without WPL the error differences are much larger than 0, clearly displaying the multi-model forgetting effect. This is particularly pronounced in the first half of training, which can have a dramatic effect on the final results, as it corresponds to the phase where the algorithm searches for promising architectures. WPL significantly reduces the forgetting, as shown by much lower error differences. With WPL, these differences tend to decrease over time, emphasizing that the observed Fisher information encodes an increasingly reliable notion of weight importance as training progresses. Owing to limited computational resources we estimate the Fisher information using only small validation batches, but use of larger batches could further improve our results. In Figure 4(d), we plot the average reward of all sampled architectures as a function of the training iterations. In the first half of training, the models trained with WPL tend to have lower rewards. This can be explained by the use of a large value for α in equation (8) during this phase; while such a large value may prevent the best models from achieving as high a reward as possible, it has the advantage of preventing the forgetting of good models, and thus avoiding their being discarded early. This is shown by the fact that, in the second half of training, when we reduce α, the mean reward of the architectures trained with WPL is higher than without using it. In other words, our approach allows us to maintain better models until the end of training. When the search is over, we train the best architecture from scratch and evaluate its final accuracy. Table 1 compares the results obtained without (ENAS) and with WPL (ENAS+WPL) with those from the original ENAS paper (ENAS*), which were obtained after conducting an extensive hyper-parameter search. For both datasets, using WPL improves final model accuracy, thus showing the importance of overcoming multi-model forgetting. In the case of PTB, our approach even outperforms ENAS*, without extensive hyper-parameter tuning. Based on the gap between ENAS and ENAS*, we anticipate that such a tuning procedure could further boost our results. In any event, we believe that these results already clearly show the benefits of reducing multi-model forgetting. 4.3. Neural Architecture Optimization Our approach is general, and its use in the context of neural architecture search is not limited to ENAS. To demonstrate this, we applied it to the neural architecture optimization (NAO) method of (Luo et al., 2018), which also exploits weight-sharing in the search phase. In this context, we therefore investigate (i) whether multi-model forgetting occurs, Overcoming Multi-model Forgetting 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 NAO NAO + WPL NAO + Drop-path 0 100 200 300 Dropout = 0.0 0 100 200 300 Dropout = 0.25 0 100 200 300 Dropout = 0.50 0 100 200 300 Dropout = 0.75 12.5 NAO NAO + WPL NAO + Drop-path Figure 5. Comparison of different output dropout rates for NAO. We plot the mean validation perplexity while searching for the best architecture (top) and the best 5 model s error differences (bottom) for four different dropout rates. Note that path dropping in NAO prevents learning shortly after model initialization with all different dropout rates. At all the dropout rates, our WPL achieves lower error differences, i.e., it reduces multi-model forgetting, as well as speeds up training. Table 1. Results of the best models found. We take the best model obtained during the search and train it from scratch. ENAS* corresponds to the results of Pham et al. (2018) obtained after extensive hyper-parameter search, while ENAS and ENAS+WPL were trained in comparable conditions. For both RNN and CNN search, our WPL gives a significant boost to ENAS, thus showing the importance of overcoming multi-model forgetting. In the RNN case, our approach outperforms ENAS* without requiring extensive hyper-parameter tuning. Datasets Metric ENAS* ENAS ENAS + WPL PTB perplexity 63.26 65.01 61.9 CIFAR10 top-1 error 3.54 4.87 3.81 and if so, (ii) the effectiveness of our approach in the NAO framework. Due to resource and time constraints, we focus our experiments mainly on the search phase, as training the best model that was found from scratch takes around 4 GPU days. To evaluate the influence of the dropout strategy of Bender et al. (2018), we test NAO with or without random path-dropping and with four output dropout rates from 0 to 0.75 by steps of 0.25. As in Section 4.2, in Figure 5, we plot the mean validation perplexity and the best five model s error differences for all models that are sampled during a single training epoch. For random path-dropping, since Luo et al. (2018) exploit a more aggressive dropping policy than that used in Bender et al. (2018), validation perplexity quickly plateaus. Hence we do not add WPL to the path dropout strategy, but use it in conjunction with output dropout. At all four different dropout rates, WPL clearly reduces multi-model forgetting and accelerates training. The level of forgetting decreases with the dropout rate, but our loss always further reduces it. Among the three methods, NAO with path dropping suffers the least from forgetting, but only because it does not learn properly. By contrast, WPL reduces multi-model forgetting while still allowing the models to learn. This shows that our approach generalizes beyond ENAS for neural architecture search. 5. Conclusion This paper has identified the problem of multi-model forgetting in the context of sequentially training multiple models: the shared weights of previously-trained models are overwritten during training of subsequent models, leading to performance degradation. We show that the degree of degradation is linked to the proportion of shared weights, and introduce a statistically-motivated weight plasticity loss (WPL) to overcome this. Our experiments on multi-model training and on neural architecture search clearly show the effectiveness of WPL in reducing multi-model forgetting and yielding better architectures, leading to improved results in both natural language processing and computer vision tasks. We believe that the impact of WPL goes beyond the tasks studied in this paper. In future work, we plan to integrate WPL within other neural architecture search strategies in which weight sharing occurs and to study its use in other multi-model contexts, such as for ensemble learning. Overcoming Multi-model Forgetting Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. The European Conference on Computer Vision (ECCV), 2018. B ack, T. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press, Inc., 1996. Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations (ICLR), Conference track, 2017. Bender, G., Kindermans, P.-J., Zoph, B., Vasudevan, V., and Le, Q. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pp. 549 558, 2018. Bennani-Smires, K., Musat, C., Hossmann, A., and Baeriswyl, M. Gitgraph - from computational subgraphs to smaller architecture search spaces. International Conference on Learning Representations (ICLR), Workshop track, 2018. Cai, H., Chen, T., Zhang, W., Yu, Y., and Wang, J. Efficient architecture search by network transformation. AAAI, 2018. Caruana, R. Multitask learning. Machine Learning, 28(1): 41 75, 1997. Chen, T., Goodfellow, I. J., and Shlens, J. Net2net: Accelerating learning via knowledge transfer. International Conference on Learning Representations (ICLR), Conference track, 2016. Dietterich, T. G. Ensemble methods in machine learning. Multiple Classifier Systems, pp. 1 15, 2000. Elsken, T., Hendrik Metzen, J., and Hutter, F. Neural Architecture Search: A Survey. ar Xiv preprint ar Xiv:1808.05377, 2018. Goldberg, D. E. and Deb, K. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, pp. 69 93, 1991. Haeffele, B. D. and Vidal, R. Global optimality in neural network training. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4390 4398, 2017. He, X. and Jaeger, H. Overcoming catastrophic interference by conceptors. ar Xiv preprint ar Xiv:1707.04853, 2017. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017. Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). 2009. Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010. Li, Z. and Hoiem, D. Learning without forgetting. In European Conference on Computer Vision, pp. 614 629. Springer, 2016. Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. International Conference on Learning Representations (ICLR), Conference track, 2018a. Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018b. Luo, R., Tian, F., Qin, T., and Liu, T.-Y. Neural architecture optimization. ar Xiv preprint ar Xiv:1808.07233, 2018. Mac Kay, D. J. C. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4(3): 448 472, 1992. Marcus, M., Kim, G., Marcinkiewicz, M. A., Mac Intyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, pp. 114 119. Association for Computational Linguistics, 1994. Negrinho, R. and Gordon, G. Deep Architect: Automatically Designing and Training Deep Architectures. ar Xiv preprint ar Xiv:1704.08792, 2017. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng., 22(10):1345 1359, 2010. Pascanu, R. and Bengio, Y. Revisiting natural gradient for deep networks. International Conference on Learning Representations (ICLR), Conference track, 2014. Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient Neural Architecture Search via Parameter Sharing. International Conference on Machine Learning (ICML), 2018. Overcoming Multi-model Forgetting Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A. Large-scale evolution of image classifiers. International Conference on Machine Learning (ICML), 2017. Rissanen, J. Fisher information and stochastic complexity. IEEE Trans. Information Theory, 42, 1996. Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016. Silver, D., Yang, Q., and Li, L. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium Series, 2013. Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496 4506, 2017. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229 256, 1992. Xie, L. and Yuille, A. Genetic cnn. IEEE International Conference on Computer Vision (ICCV), 2017. Xu, J. and Zhu, Z. Reinforced continual learning. In NIPS, 2018. Young, T., Hazarika, D., Poria, S., and Cambria, E. Recent trends in deep learning based natural language processing. ar Xiv preprint ar Xiv:1708.02709, 2017. Zoph, B. and Le, Q. V. Neural Architecture Search with Reinforcement Learning. International Conference on Learning Representations (ICLR), Conference track, 2017.