# generalized_variational_continual_learning__0b094850.pdf Published as a conference paper at ICLR 2021 GENERALIZED VARIATIONAL CONTINUAL LEARNING Noel Loo, Siddharth Swaroop & Richard E. Turner University of Cambridge {nl355,ss2163,ret26}@cam.ac.uk Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific Fi LM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with Fi LM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration. 1 INTRODUCTION Continual learning methods enable learning when a set of tasks changes over time. This topic is of practical interest as many real-world applications require models to be regularly updated as new data is collected or new tasks arise. Standard machine learning models and training procedures fail in these settings (French, 1999), so bespoke architectures and fitting procedures are required. This paper makes two main contributions to continual learning for neural networks. First, we develop a new regularization-based approach to continual learning. Regularization approaches adapt parameters to new tasks while keeping them close to settings that are appropriate for old tasks. Two popular approaches of this type are Variational Continual Learning (VCL) (Nguyen et al., 2018) and Online Elastic Weight Consolidation (Online EWC) (Kirkpatrick et al., 2017; Schwarz et al., 2018). The former is based on a variational approximation of a neural network s posterior distribution over weights, while the latter uses Laplace s approximation. In this paper, we propose Generalized Variational Continual Learning (GVCL) of which VCL and Online EWC are two special cases. Under this unified framework, we are able to combine the strengths of both approaches. GVCL is closely related to likelihood-tempered Variational Inference (VI), which has been found to improve performance in standard learning settings (Zhang et al., 2018; Osawa et al., 2019). We also see significant performance improvements in continual learning. Our second contribution is to introduce an architectural modification to the neural network that combats the deleterious overpruning effect of VI (Trippe & Turner, 2018; Turner & Sahani, 2011). We analyze pruning in VCL and show how task-specific Fi LM layers mitigate it. Combining this architectural change with GVCL results in a hybrid architectural-regularization based algorithm. This additional modification results in performance that exceeds or is within statistical error of strong baselines such as HAT (Serra et al., 2018) and Path Net (Fernando et al., 2017). The paper is organized as follows. Section 2 outlines the derivation of GVCL, shows how it unifies many continual learning algorithms, and describes why it might be expected to perform better than them. Section 3 introduces Fi LM layers, first from the perspective of multi-task learning, and then through the lens of variational over-pruning, showing how Fi LM layers mitigate this pathology of VCL. Finally, in Section 5 we test GVCL and GVCL with Fi LM layers on many standard bench- Published as a conference paper at ICLR 2021 marks, including ones with few samples, a regime that could benefit more from continual learning. We find that GVCL with Fi LM layers outperforms existing baselines on a variety of metrics, including raw accuracy, forwards and backwards transfer, and calibration error. In Section 5.4 we show that Fi LM layers provide a disproportionate improvement to variational methods, confirming our hypothesis in Section 31. 2 GENERALIZED VARIATIONAL CONTINUAL LEARNING In this section, we introduce Generalized Variational Continual Learning (GVCL) as a likelihoodtempered version of VCL, with further details in Appendix C. We show how GVCL recovers Online EWC. We also discuss further links between GVCL and the Bayesian cold posterior in Appendix D. 2.1 LIKELIHOOD-TEMPERING IN VARIATIONAL CONTINUAL LEARNING Variational Continual Learning (VCL). Bayes rule calculates a posterior distribution over model parameters θ based on a prior distribution p(θ) and some dataset DT = {XT , y T }. Bayes rule naturally supports online and continual learning by using the previous posterior p(θ|DT 1) as a new prior when seeing new data (Nguyen et al., 2018). Due to the intractability of Bayes rule in complicated models such as neural networks, approximations are employed, and VCL (Nguyen et al., 2018) uses one such approximation, Variational Inference (VI). This approximation is based on approximating the posterior p(θ|DT ) with a simpler distribution q T (θ), such as a Gaussian. This is achieved by optimizing the ELBO for the optimal q T (θ), ELBOVCL = Eθ q T (θ)[log p(DT |θ)] DKL(q T (θ) q T 1(θ)), (1) where q T 1(θ) is the approximation to the previous task posterior. Intuitively, this refines a distribution over weight samples that balances good predictive performance (the first expected prediction accuracy term) while remaining close to the prior (the second KL-divergence regularization term). Likelihood-tempered VCL. Optimizing the ELBO will recover the true posterior if the approximating family is sufficiently rich. However, the simple families used in practice typically lead to poor test-set performance. Practitioners have found that performance can be improved by downweighting the KL-divergence regularization term by a factor β, with 0 < β < 1. Examples of this are seen in Zhang et al. (2018) and Osawa et al. (2019), where the latter uses a data augmentation factor for down-weighting. In a similar vein, sampling from cold posteriors in SG-MCMC has also been shown to outperform the standard Bayes posterior, where the cold posterior is given by p T (θ|D) p(θ|D) 1 T , T < 1 (Wenzel et al., 2020). Values of β > 1 have also been used to improve the disentanglement variational autoencoder learned models (Higgins et al., 2017). We down-weight the KL-divergence term in VCL, optimizing the β-ELBO2, β-ELBO = Eθ q T (θ)[log p(DT |θ)] βDKL(q T (θ) q T 1(θ)). VCL is trivially recovered when β = 1. We will now show that surprisingly as β 0, we recover a special case of Online EWC. Then, by modifying the term further as required to recover the full version of Online EWC, we will arrive at our algorithm, Generalized VCL. 2.2 ONLINE EWC IS A SPECIAL CASE OF GVCL We analyze the effect of KL-reweighting on VCL in the case where the approximating family is restricted to Gaussian distributions over θ. We will consider training all the tasks with a KLreweighting factor of β, and then take the limit β 0, recovering Online EWC. Let the approximate posteriors at the previous and current tasks be denoted as q T 1(θ) = N(θ; µT 1, ΣT 1) and q T (θ) = N(θ; µT , ΣT ) respectively, where we are learning {µT , ΣT }. The optimal ΣT under the β-ELBO has the form (see Appendix C), β µT µT Eq T (θ)[ log p(DT |θ)] + Σ 1 T 1. (2) 1Code is available at https://github.com/yolky/gvcl 2We slightly abuse notation by writing the likelihood as p(DT |θ) instead of p(y T |θ, XT ). Published as a conference paper at ICLR 2021 Now take the limit β 0. From Equation 2, ΣT 0, so q T (θ) becomes a delta function, and β µT µT log p(DT |θ = µT ) + Σ 1 T 1 = 1 β HT + Σ 1 T 1 = 1 t=1 Ht + Σ 1 0 , (3) where HT is the Tth task Hessian3. Although the learnt distribution q T (θ) becomes a delta function (and not a full Gaussian distribution as in Laplace s approximation), we will see that a cancellation of β factors in the β-ELBO will lead to the eventual equivalence between GVCL and Online EWC. Consider the terms in the β-ELBO that only involve µT : β-ELBO = Eθ q T (θ)[log p(DT |θ)] β 2 (µT µT 1) Σ 1 T 1(µT µT 1) = log p(DT |θ = µT ) 1 2(µT µT 1) T 1 X t=1 Ht + βΣ 1 0 (µT µT 1), (4) where we have set the form of ΣT 1 to be as in Equation 3. Equation 4 is an instance of the objective function used by a number of continual learning methods, most notably Online EWC4 (Kirkpatrick et al., 2017; Schwarz et al., 2018), Online-Structured Laplace (Ritter et al., 2018), and SOLA (Yin et al., 2020). These algorithms can be recovered by changing the approximate posterior class Q to Gaussians with diagonal, block-diagonal Kronecker-factored covariance matrices, and low-rank precision matrices, respectively (see Appendices C.4 and C.5). Based on this analysis, we see that β can be seen as interpolating between VCL, with β = 1, and continual learning algorithms which use point-wise approximations of curvature as β 0. In Appendix A we explore how β controls the scale of the quadratic curvature approximation, verifying with experiments on a toy dataset.. Small β values learn distributions with good local structure, while higher β values learn distributions with a more global structure. We explore this in more detail in Appendices A and B, where we show the convergence of GVCL to Online-EWC on a toy experiment. Inference using GVCL. When performing inference with GVCL at test time, we use samples from the unmodified q(θ) distribution. This means that when β = 1, we recover the VCL predictive, and as β 0, the posterior collapses as described earlier, meaning that the weight samples are effectively deterministic. This is in line with the inference procedure given by Online EWC and its variants. In practice, we use values of β = 0.05 0.2 in Section 5, meaning that some uncertainty is retained, but not all. We can increase the uncertainty at inference time by using an additional tempering step, which we describe, along with further generalizations in Appendix D. 2.3 REINTERPRETING λ AS COLD POSTERIOR REGULARIZATION As described above, the β-ELBO recovers instances of a number of existing second-order continual learning algorithms including Online EWC as special cases. However, the correspondence does not recover a key hyperparameter λ used by these methods that up-weights the quadratic regularization term. Instead, our derivation produces an implicit value of λ = 1, i.e. equal weight between tasks of equal sample count. In practice it is found that algorithms such as Online EWC perform best when λ > 1, typically 10 1000. In this section, we view this λ hyperparameter as a form of cold posterior regularization. In the previous section, we showed that β controls the length-scale over which we approximate the curvature of the posterior. However, the magnitude of the quadratic regularizer stays the same, because the O(β 1) precision matrix and the β coefficient in front of the KL-term cancel out. Taking inspiration from cold posteriors (Wenzel et al., 2020), which temper both the likelihood and the prior and improve accuracy with Bayesian neural networks, we suggest tempering the prior in GVCL. Therefore, rather than measuring the KL divergence between the posterior and prior, q T and q T 1, respectively, we suggest regularizing towards tempered version of the prior, qλ T 1. However, this 3The actual Hessian may not be positive semidefinite while Σ is, so here we refer to a positive semidefinite approximation of the Hessian. 4EWC uses the Fisher information, but our derivation results in the Hessian. The two matrices coincide when the model has near-zero training loss, as is often the case (Martens, 2020). Published as a conference paper at ICLR 2021 form of regularization has a problem: in continual learning, over the course of many tasks, old tasks will be increasingly (exponentially) tempered. In order to combat this, we also use the tempered version of the posterior in the KL divergence, qλ T . This should allow us to gain benefits from tempering the prior while being stable over multiple tasks in continual learning. As we now show, tempering in this way recovers the λ hyperparameter from algorithms such as Online EWC. Note that raising the distributions to the power λ is equivalent to tempering by τ = λ 1. For Gaussians, tempering a distribution by a temperature τ = λ 1 is the same as scaling the covariance by λ 1. We can therefore expand our new KL divergence, DKL qλ T qλ T 1 = 1 2 (µT µT 1) λΣ 1 T 1(µT µT 1) + Tr(λΣ 1 T 1λ 1ΣT ) + log |ΣT 1|λ d 2 (µT µT 1) λΣ 1 T 1(µT µT 1) + Tr(Σ 1 T 1ΣT ) + log |ΣT 1| = DKLλ(q T q T 1). In the limit of β 0, our λ coincides with Online EWC s λ, if the tasks have the same number of samples. However, this form of λ has a slight problem: it increases the regularization strength of the initial prior Σ0 on the mean parameter update. We empirically found that this negatively affects performance. We therefore propose a different version of λ, which only up-weights the datadependent parts of ΣT 1, which can be viewed as likelihood tempering the previous task posterior, as opposed to tempering both the initial prior and likelihood components. This new version still converges to Online EWC as β 0, since the O(1) prior becomes negligible compared to the O(β 1) Hessian terms. We define, Σ 1 T,λ := λ t=1 Ht + Σ 1 0 = λ(Σ 1 T Σ 1 0 ) + Σ 1 0 . In practice, it is necessary to clip negative values of Σ 1 T Σ 1 0 to keep Σ 1 T,λ positive definite. This is only required because of errors during optimization. We then use a modified KL-divergence, DKL λ(q T q T 1) = 1 2 (µT µT 1) Σ 1 T 1,λ(µT µT 1) + Tr(Σ 1 T 1ΣT ) + log |ΣT 1| Note that in Online EWC, there is another parameter γ, that down-weights the previous Fisher matrices. As shown in Appendix C, we can introduce this hyperparameter by taking the KL divergence priors and posteriors at different temperatures: qλ T 1 and qγλ T . However, we do not find that this approach improves performance. Combining everything, we have our objective for GVCL, Eθ q T (θ)[log p(DT |θ)] βDKL λ(q T (θ) q T 1(θ)). 3 FILM LAYERS FOR CONTINUAL LEARNING The Generalized VCL algorithm proposed in Section 2 is applicable to any model. Here we discuss a multi-task neural network architecture that is especially well-suited to GVCL when the task ID is known at both training and inference time: neural networks with task-specific Fi LM layers. 3.1 BACKGROUND TO FILM LAYERS The most common architecture for continual learning is the multi-headed neural network. A shared set of body parameters act as the feature extractor. For every task, features are generated in the same way, before finally being passed to separate head networks for each task. This architecture does not allow for task-specific differentiation in the feature extractor, which is limiting (consider, for example, the different tasks of handwritten digit recognition and image recognition). Fi LM layers (Perez et al., 2018) address this limitation by linearly modulating features for each specific task so that useful features can be amplified and inappropriate ones ignored. In fully-connected layers, the transformation is applied element-wise: for a hidden layer with width W and activation values hi, 1 i W, Fi LM layers perform the transformation h i = γihi + bi, before being passed on to the remainder of the network. For convolutional layers, transformations are applied filter-wise. Consider a layer with N filters of size K K, resulting in activations hi,j,k, 1 i N, 1 j W, 1 k H, where W and H are the dimensions of the resulting feature map. The transformation has the Published as a conference paper at ICLR 2021 (b) GVCL+ Fi LM Figure 1: Visualizations of deviation from the prior distribution for filters in the first layer of a convolutional networks trained on Hard-CHASY. Lighter colours indicate an active filter for that task. Models are trained either (a) sequentially using GVCL, or (b) sequentially with GVCL + Fi LM. Fi LM layers increase the number of active units. form h i,j,k = γi hi,j,k + bi. The number of required parameters scales with the number of filters, as opposed to the full activation dimension, making them computationally cheap and parameterefficient. Fi LM layers have previously been shown to help with fine-tuning for transfer learning (Rebuffiet al., 2017), multi-task meta-learning (Requeima et al., 2019), and few-shot learning (Perez et al., 2018). In Appendix F, we show how Fi LM layer parameters are interpretable, with similarities between Fi LM layer parameters for similar tasks in a multi-task setup. 3.2 COMBINING GVCL AND FILM LAYERS It is simple to apply GVCL to models which utilize Fi LM layers. Since these layers are specific to each task they do not need a distributional treatment or regularization as was necessary to support continual learning of the shared parameters. Instead, point estimates are found by optimising the GVCL objective function. This has a well-defined optimum unlike joint MAP training when Fi LM layers are added (see Appendix E for a discussion). We might expect an improved performance for continual learning by introducing task-specific Fi LM layers as this results in a more suitable multi-task model. However, when combined with GVCL, there is an additional benefit. When applied to multi-head networks, VCL tends to prune out large parts of the network (Trippe & Turner, 2018; Turner & Sahani, 2011) and GVCL inherits this behaviour. This occurs in the following way: First, weights entering a node revert to their prior distribution due to the KL-regularization term in the ELBO. These weights then add noise to the network, affecting the likelihood term of the ELBO. To avoid this, the bias concentrates at a negative value so that the Re LU activation effectively shuts off the node. In the single task setting, this is often relatively benign and can even facilitate compression (Louizos et al., 2017; Molchanov et al., 2017). However, in continual learning the effect is pathological: the bias remains negative due to its low variance, meaning that the node is effectively shut off from that point forward, preventing the node from re-activating. Ultimately, large sections of the network can be shut off after the first task and cannot be used for future tasks, which wastes network capacity (see Figure 1a). In contrast, when using task-specific Fi LM layers, pruning can be achieved by either setting the Fi LM layer scale to 0 or the Fi LM layer bias to be negative. Since there is no KL-penalty on these parameters, it is optimal to prune in this way. Critically, both the incoming weights and the bias of a pruned node can then return to the prior without adding noise to the network, meaning that the node can be re-activated in later tasks. The increase in the number of unpruned units can be seen in Figure 1b. In Appendix G we provide more evidence of this mechanism. 4 RELATED WORK Regularization-based continual learning. Many algorithms attempt to regularize network parameters based on a metric of importance. Section 2 shows how some methods can be seen as special cases of GVCL. We now focus on other related methods. Lee et al. (2017) proposed IMM, which is an extension to EWC which merges posteriors based on their Fisher information matrices. Ahn et al. Published as a conference paper at ICLR 2021 (2019), like us, use regularizers based on the ELBO, but also measure importance on a per-node basis rather than a per-weight one. SI (Zenke et al., 2017) measures importance using Synaptic Saliency, as opposed to methods based on approximate curvature. Architectural approaches to continual learning. This family of methods modifies the standard neural architecture by adding components to the network. Progressive Neural Networks (Rusu et al., 2016) adds a parallel column network for every task, growing the model size over time. Path Net (Fernando et al., 2017) fixes the model size while optimizing the paths between layer columns. Architectural approaches are often used in tandem with regularization based approaches, such as in HAT (Serra et al., 2018), which uses per-task gating parameters alongside a compression-based regularizer. Adel et al. (2020) propose CLAW, which also uses variational inference alongside pertask parameters, but requires a more complex meta-learning based training procedure involving multiple splits of the dataset. GVCL with Fi LM layers adds to this list of hybrid architecturalregularization based approaches. See Appendix H for a more comprehensive related works section. 5 EXPERIMENTS We run experiments in the small-data regime (Easy-CHASY and Hard-CHASY) (Section 5.1), on Split-MNIST (Section 5.1), on the larger Split CIFAR benchmark (Section 5.2), and on a much larger Mixed Vision benchmark consisting of 8 different image classification datasets (Section 5.3). In order to compare continual learning performance, we compare final average accuracy, forward transfer (the improvement on the current task as number of past tasks increases (Pan et al., 2020)) and backward transfer (the difference in accuracy between when a task is first trained and its accuracy after the final task (Lopez-Paz & Ranzato, 2017)). We compare to many baselines, but due to space constraints, only report the best-performing baselines in the main text. We also compare to two offline methods: an upper-bound joint version trained on all tasks jointly, and a lower-bound separate version with each task trained separately (no transfer). Further baseline results are in Appendix J. The combination of GVCL on task-specific Fi LM layers (GVCL-F) outperforms baselines on the smaller-scale benchmarks and outperforms or performs within statistical error of baselines on the larger Mixed Vision benchmark. We also report calibration curves, showing that GVCL-F is well-calibrated. Full experimental protocol and hyperparameters are reported in Appendix I. 5.1 CHASY AND Split-MNIST (a) Easy-CHASY (b) Hard-CHASY (c) Split-MNIST Figure 2: Running average accuracy of Easy-CHASY, Hard-CHASY and Split-MNIST trained continually. GVCL-F and GVCL are compared to the best performing baseline algorithm. GVCL-F and GVCL both significantly outperform HAT on Easy-CHASY. On Hard-CHASY, GVCL-F still manages to perform as well joint MAP training, while GVCL performs as well as Path Net. In Split MNIST, GVCL-F narrowly outperforms HAT, with both performing nearly as well as joint training. The CHASY benchmark consists of a set of tasks specifically designed for multi-task and continual learning, with detailed explanation in Appendix K. It is derived from the HASYv2 dataset (Thoma, 2017), which consists of 32x32 handwritten latex characters. Easy-CHASY was designed to maximize transfer between tasks and consists of similar tasks with 20 classes for the first task, to 11 classes for the last. Hard-CHASY represents scenarios where tasks are very distinct, where tasks range from 18 to 10 classes. Both versions have very few samples per class. Testing our algorithm on these datasets tests two extremes of the continual learning spectrum. For these two datasets we use a small convolutional network comprising two convolutions layers and a fully connected layer. Published as a conference paper at ICLR 2021 (a) Easy-CHASY (b) Hard-CHASY Figure 3: Accuracy of Easy-CHASY and Hard-CHASY trained models at the end of learning all 10 tasks continually. Performance of GVCL-F, GVCL and the best performing baselines (HAT and Pathnet) are compared to Joint and Separate training. GVCL-F again strongly outperforms the baselines and performs similar to the upper-bound VI joint training. For our Split-MNIST experiment, in addition to the standard 5 binary classification tasks for Split MNIST, we add 5 more binary classification tasks by taking characters from the KMNIST dataset (Clanuwat et al., 2018). For these experiments we used a 2-layer fully-connected network, as in common in continual learning literature (Nguyen et al., 2018; Zenke et al., 2017). Figure 2 shows the raw accuracy results. As the CHASY datasets have very few samples per class (16 per class, resulting in the largest task having a training set of 320 samples), it is easy to overfit. This few-sample regime is a key practical use case for continual learning as it is essential to transfer information between tasks. In this regime, continual learning algorithms based on MAP-inference overfit, resulting in poor performance. As GVCL-F is based on a Bayesian framework, it is not as adversely affected by the low sample count, achieving 90.9% accuracy on Easy-CHASY compared to 82.6% of the best performing MAP-based CL algorithm, HAT. Hard-CHASY tells a similar story, 69.1% compared to Path Net s 64.8%. Compared to the full joint training baselines, GVCLF achieves nearly the same accuracy (Figure 3). The gap between GVCL-F and GVCL is larger for Easy-CHASY than for Hard-CHASY, as the task-specific adaptation that Fi LM layers provide is more beneficial when tasks require contrasting features, as in Hard-CHASY. With Split-MNIST, GVCL-F also reaches the same performance as joint training, however it is difficult to distinguish approaches on this benchmark as many achieve near maximal accuracy. GVCL-F GVCL HAT Path Net VCL Online EWC Easy-CHASY ACC (%) 90.9 0.3 88.9 0.6 82.6 0.9 82.4 0.9 78.4 1.0 73.4 3.4 BWT (%) 0.2 0.1 0.8 0.4 1.6 0.6 0.0 0.0 4.1 1.2 8.9 2.9 FWT (%) 0.4 0.3 0.6 0.5 0.4 1.4 1.5 0.9 7.9 0.8 1.5 0.5 Hard-CHASY ACC (%) 69.5 0.6 64.4 0.6 62.5 5.4 64.8 0.8 45.8 1.4 56.4 1.7 BWT (%) 0.1 0.1 0.6 0.2 0.8 0.4 0.0 0.0 11.9 1.6 7.1 1.7 FWT (%) 1.6 0.7 6.3 0.6 3.7 5.5 2.2 0.8 13.5 2.2 3.4 1.3 Split-MNIST (10 Tasks) ACC (%) 98.6 0.1 94.6 0.7 98.3 0.1 95.2 1.8 92.4 1.2 94.0 1.4 BWT (%) 0.0 0.0 4.0 0.7 0.2 0.0 0.0 0.0 5.5 1.1 3.8 1.4 FWT (%) 0.1 0.1 0.0 0.0 0.1 0.1 3.3 1.8 0.8 0.1 0.8 0.1 Split-CIFAR ACC (%) 80.0 0.5 70.6 1.7 77.3 0.3 68.7 0.8 44.2 14.2 77.1 0.2 BWT (%) 0.3 0.2 2.3 1.4 0.1 0.1 0.0 0.0 23.9 12.2 0.5 0.3 FWT (%) 8.8 0.5 1.3 1.0 6.8 0.2 1.9 0.8 3.5 2.1 6.9 0.3 Mixed Vision Tasks ACC (%) 80.0 1.2 49.0 2.8 80.3 1.0 76.8 2.0 26.9 2.1 62.8 5.2 BWT (%) 0.9 1.3 13.1 1.6 0.1 0.1 0.0 0.0 35.0 5.6 18.7 5.8 FWT (%) 4.8 1.6 23.5 3.4 5.8 1.0 9.5 2.0 23.7 3.8 4.8 0.7 Table 1: Performance metrics of GVCL-F and GVCL compared to baselines (more in Appendix J). GVCL-F obtains the best accuracy and backwards/forwards transfer on many datasets/architectures. 5.2 Split-CIFAR The popular Split-CIFAR dataset, introduced in Zenke et al. (2017), has CIFAR10 as the first task, and then 5 tasks as disjoint 10-way classifications from the first 50 classes of CIFAR100, giving a Published as a conference paper at ICLR 2021 total of 6 tasks. We use the same architecture as in other papers (Zenke et al., 2017; Pan et al., 2020). Like with Easy-CHASY, jointly learning these tasks significantly outperforms networks separately trained on the tasks, indicating potential for forward and backward transfer in a continual learning algorithm. Results are in Figure 4. GVCL-F is able to achieve the same final accuracy as joint training with Fi LM layers, achieving 80.0 0.5%, beating all baseline algorithms by at least 2%. This confirms that our algorithm performs well in larger settings as well as the previous smallerscale benchmarks, with minimal forgetting. While the backwards transfer metric for many of the best performing continual learning algorithms is near 0, GVCL-F has the highest forward transfer, achieving 8.5%. GVCL consistently outperforms VCL, but unlike in the CHASY experiments, it does not outperform Online EWC. This also occurs in the Mixed Vision tasks considered next. Theoretically this should not happen, but GVCL s hyperparameter search found β = 0.2 which is far from Online EWC. We believe this is because optimizing the GVCL cost for small β is more challenging (see Appendix B). However, since intermediate β settings result in more pruning, Fi LM layers then bring significant improvement. (a) Running average accuracy of Split-CIFAR (b) Final accuracies on Split-CIFAR Figure 4: Running average accuracy of Split-CIFAR and final accuracies after continually training on 6 tasks for GVCL-F, GVCL, and HAT. GVCL-F achieves the maximum amount of forwards transfer, and achieves close to the upper-bound joint performance. 5.3 MIXED VISION TASKS (a) Final accuracies after all tasks (b) Relative accuracy after training on the ith task Figure 5: (a) Average accuracy of mixed vision tasks at the end of training for GVCL-F and HAT. Both algorithms perform nearly equally well in this respect. (b) GVCL-F gracefully forgets, with higher intermediate accuracies, while HAT has a lower initial accuracy but does not forget. We finally test on a set of mixed vision datasets, as in Serra et al. (2018). This benchmark consists of 8 image classification datasets with 10-100 classes and a range of dataset sizes, with the order of tasks randomly permuted between different runs. We use the same Alex Net architecture as in Serra et al. (2018). Average accuracies of the 8 tasks after continual training are shown in Figure 5. Published as a conference paper at ICLR 2021 GVCL-F s final accuracy matches that of HAT, with similar final performances of 80.0 1.2% and 80.3 1.0% for the two methods, respectively. Figure 5b shows the relative accuracy of the model after training on intermediate tasks compared to its final accuracy. A positive relative accuracy after t tasks means that the method performs better on the tasks seen so far than it does on the same tasks after seeing all 8 tasks (Appendix I contains a precise definition). HAT achieves its continual learning performance by compressing earlier tasks, hindering their performance in order to reserve capacity for later tasks. In contrast, GVCL-F attempts to maximize the performance for early tasks, but allows performance to gradually decay, as shown by the gradually decreasing relative accuracy in Figure 5b. While both strategies result in good final accuracy, one could argue that pre-compressing a network in anticipation of future tasks which may or may not arrive is an impractical real-world strategy, as the number of total tasks may be unknown a priori, and therefore one does not know how much to compress the network. The approach taken by GVCL-F is then more desirable, as it ensures good performance after any number of tasks, and frees capacity by gracefully forgetting . (a) Cifar100 Calibration Curve (b) Facescrub Calibration Curve (c) ECE on individual tasks Figure 6: Calibration curves and Expected Calibration Error for GVCL-F and HAT trained on the Mixed Vision Tasks benchmark. GVCL-F achieves much lower Expected Calibration Error, attaining a value averaged across all tasks of 0.3% compared to HAT s 1.7%. Uncertainty calibration. As GVCL-F is based on a probabilistic framework, we expect it to have good uncertainty calibration compared to other baselines. We show this for the Mixed Vision tasks in Figure 6. Overall, the average Expected Calibration Error for GVCL-F (averaged over tasks) is 0.32%, compared to HAT s 1.69%, with a better ECE on 7 of the 8 tasks. These results demonstrate that GVCL-F is generally significantly better calibrated than HAT, which can be extremely important in decision critical problems where networks must know when they are likely to be uncertain. 5.4 RELATIVE GAIN FROM ADDING FILM LAYERS Algorithm Easy-CHASY Hard-CHASY Split-MNIST (10 tasks) Split-CIFAR Mixed Vision Tasks Average GVCL 2.0 0.5% 5.1 0.4% 4.0 0.7% 9.5 1.4% 31.0 2.2% 10.3 10.6% VCL 1.5 1.2% 19.2 1.4% 2.4 1.5% 12.0 16.4% 28.6 3.6% 12.8 10.3% Online EWC 2.6 3.3% 0.3 7.1% 0.1 1.2% 0.1 0.1% 7.7 2.1% 2.2 2.9% Table 2: Relative performance improvement from adding Fi LM layers on several benchmarks, for VI and non-VI based algorithms. VI-based approaches see a much more significantly gain over EWC, suggesting that Fi LM layers synergize very well with VI and address the pruning issue. In Section 3, we suggested that adding Fi LM layers to VCL in particular would result in the largest gains, since it addresses issues specific to VI, and that Fi LM parameter values were automatically best allocated based on the prior. In Section 5.4, we compare the relative gain of adding Fi LM layers to VI-based approaches and Online EWC. We omitted HAT, since it already has per-task gating mechanisms, so Fi LM layers would be redundant. We see that the gains from adding Fi LM layers to Online EWC are limited, averaging 2.2% compared to over 10% for both VCL and GVCL. This suggests that the strength of Fi LM layers is primarily in how they interact with variational methods for continual learning. As described in Section 3, with VI we do not need any special algorithm to encourage pruning and how to allocate resources, as they are done automatically by VI. This contrasts HAT, where specific regularizers and gradient modifications are necessary to encourage the use of Fi LM parameters. Published as a conference paper at ICLR 2021 6 CONCLUSIONS We have developed a framework, GVCL, that generalizes Online EWC and VCL, and we combined it with task-specific Fi LM layers to mitigate the effects of variational pruning. GVCL with Fi LM layers outperforms strong baselines on a number of benchmarks, according to several metrics. Future research might combine GVCL with memory replay methods, or find ways to use Fi LM layers when task ID information is unavailable. Published as a conference paper at ICLR 2021 Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless Fowlkes, Stefano Soatto, and Pietro Perona. Task2Vec: Task Embedding for Meta Learning. ar Xiv:1902.03545 [cs, stat], February 2019. URL http://arxiv.org/abs/ 1902.03545. ar Xiv: 1902.03545. Alessandro Achille, Giovanni Paolini, and Stefano Soatto. Where is the information in a deep neural network?, 2020. Tameem Adel, Han Zhao, and Richard E. Turner. Continual learning with adaptive weights (claw). In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=Hklso24Kwr. Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems 32, pp. 4392 4402. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8690-uncertainty-based-continual-learning-with-adaptive-regularization. pdf. Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. volume 80 of Proceedings of Machine Learning Research, pp. 159 168, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings. mlr.press/v80/alemi18a.html. Tarin Clanuwat, Mikel Bober-Irizar, A. Kitamoto, A. Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. Ar Xiv, abs/1812.01718, 2018. Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. 01 2017. Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128 135, April 1999. ISSN 1364-6613. doi: 10.1016/S1364-6613(99) 01294-2. URL http://www.sciencedirect.com/science/article/pii/ S1364661399012942. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK. pp. 22, 2017. J. Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, J. Veness, G. Desjardins, Andrei A. Rusu, K. Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, C. Clopath, D. Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114:3521 3526, 2017. Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4652 4662. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7051-overcoming-catastrophic-forgetting-by-incremental-moment-matching. pdf. David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6467 6476. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7225-gradient-episodic-memory-for-continual-learning.pdf. Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. Published as a conference paper at ICLR 2021 3288 3298. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 6921-bayesian-compression-for-deep-learning.pdf. James Martens. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21(146):1 76, 2020. URL http://jmlr.org/papers/v21/ 17-678.html. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. volume 70 of Proceedings of Machine Learning Research, pp. 2498 2507, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http: //proceedings.mlr.press/v70/molchanov17a.html. Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=Bk Qqq0g Rb. Manfred Opper and Cedric Archambeau. The variational gaussian approximation revisited. Neural computation, 21:786 92, 10 2008. doi: 10.1162/neco.2008.08-07-592. Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschenhagen, Richard E Turner, and Rio Yokota. Practical deep learning with bayesian principles. In Advances in Neural Information Processing Systems 32, pp. 4287 4299. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8681-practical-deep-learning-with-bayesian-principles.pdf. Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard E. Turner, and Mohammad Emtiyaz Khan. Continual Deep Learning by Functional Regularisation of Memorable Past. ar Xiv:2004.14070 [cs, stat], June 2020. URL http://arxiv.org/abs/2004. 14070. ar Xiv: 2004.14070. Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 506 516. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 6654-learning-multiple-visual-domains-with-residual-adapters. pdf. James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems 32, pp. 7959 7970. Curran Associates, Inc., 2019. Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 3738 3748. Curran Associates, Inc., 2018. Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. ar Xiv:1606.04671 [cs], September 2016. URL http://arxiv.org/abs/1606.04671. ar Xiv: 1606.04671. Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. volume 80 of Proceedings of Machine Learning Research, pp. 4528 4537, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings. mlr.press/v80/schwarz18a.html. Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. volume 80 of Proceedings of Machine Learning Research, pp. 4548 4557, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/serra18a.html. Published as a conference paper at ICLR 2021 Alexander Smola, Vishy Vishwanathan, and Eleazar Eskin. Laplace propagation. 01 2003. Siddharth Swaroop, Cuong V. Nguyen, Thang D. Bui, and Richard E. Turner. Improving and Understanding Variational Continual Learning. ar Xiv:1905.02099 [cs, stat], May 2019. URL http://arxiv.org/abs/1905.02099. ar Xiv: 1905.02099. Martin Thoma. The HASYv2 dataset. ar Xiv:1701.08380 [cs], January 2017. URL http:// arxiv.org/abs/1701.08380. ar Xiv: 1701.08380. Brian Trippe and Richard Turner. Overpruning in Variational Bayesian Neural Networks. ar Xiv:1801.06230 [stat], January 2018. URL http://arxiv.org/abs/1801.06230. ar Xiv: 1801.06230. R. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models. 2011. Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? Co RR, abs/2002.02405, 2020. URL https: //arxiv.org/abs/2002.02405. Dong Yin, Mehrdad Farajtabar, and Ang Li. SOLA: Continual Learning with Second-Order Loss Approximation. ar Xiv:2006.10974 [cs, stat], June 2020. URL http://arxiv.org/abs/ 2006.10974. ar Xiv: 2006.10974. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. volume 70 of Proceedings of Machine Learning Research, pp. 3987 3995, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/zenke17a.html. Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. volume 80 of Proceedings of Machine Learning Research, pp. 5852 5861, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http: //proceedings.mlr.press/v80/zhang18l.html. Published as a conference paper at ICLR 2021 A LOCAL VS GLOBAL CURVATURE IN GVCL In this section, we look at the effect of β on the approximation of local curvature found from optimizing the β-ELBO by analyzing its effect on a toy dataset. In doing so, we aim to provide intuition why different values of β might outperform β = 1. We start by looking at the equation of the fixed point of Σ. β µT µT Eq T (θ)[ log p(DT |θ)] + Σ 1 T 1. (5) We consider the T = 1 case. We can interpret this as roughly measuring the curvature of log p(DT |θ) at different samples of θ drawn from the distribution q T (θ). Based on this equation, we know Σ 1 T increases as β decreases, so samples from q T (θ) are more localized, meaning that the curvature is measured closer to the mean, forming a local approximation of curvature. Conversely, if β is larger, Σ 1 T broadens and the approximation of curvature is on a more global scale. For simplicity, we write µT µT Eq T (θ)[ log p(DT |θ)] as HT . To test this explanation of β, we performed β-VI on a simple toy dataset. We have a true data generative distribution X N(0, 1), and we sample 1000 points forming the dataset, D. Our model is a generative model with X N(f(θ), σ2 0 = 30), with θ being the model s only parameter and f(θ) an arbitrary fixed function. With β-VI, we aim to approximate p(θ|D) with q(θ) = N(θ; µ, σ2) with a prior p(θ) = N(θ; 0, 1). We choose three different equations for f(θ): 1. f1(θ) = |θ|1.6 2. f2(θ) = 4p 3. f3(θ) = 3p (|θ| 0.5)3 + 0.4 We visualize log p(D|θ) for each of these three functions in Figure 7. Here, we see that the data likelihoods have very distinct shapes. f1 results in a likelihood that is flat locally but curves further away from the origin. f2 is the opposite: there is a cusp at 0 then flattens out. f3 is a mix, where at a very small scale it has high curvature, then flattens, then curves again. Now, we perform β-VI to get µ and σ2, for β {0.1, 1, 10}. We then have values for σ2, which acts as Σ 1 T in Equation 5. We want to extract HT 1 from these values, so we perform the operation σ2 = β 1 σ2 1, which represents our estimate of the curvature of log p(D|θ) at the mean. This operation also cancels the scaling effect of β. We then plot these approximate log-likelihood functions log p(D|θ) = N(θ; µ, σ2) in Figure 8. Figure 7: True data log-likelihoods of a generative model of the form p(x|θ) = N(x; f(θ), σ2 0). Curves are shifted so that they pass through the origin From these figures, we see a clear trend: small values of β cause the approximate curvature to be measured locally while larger values cause it to be measured globally, confirming our hypothesis. Most striking is Figure 8c, where the curvature is not strictly increasing or decreasing further from the origin. Here, we see that the curvature first is high for β = 0.1, then flattens out for β = 1 then becomes high again for β = 10. Now imagine in continual learning our posterior for a parameter whose posterior looks like Figure 8a. Here, the parameter would be under-regularized with β = 1, Published as a conference paper at ICLR 2021 Figure 8: Approximate data log-likelihoods found using β-VI for various values of β for three different generative models. Small values of β cause local approximations of curvature and large values cause global ones. so the parameter will drift far away, significantly affecting performance. Equally, if the posterior was like Figure 8b, then values of β = 1 would cause the parameter to be over-regularized, limiting model capacity than in practice could be freed. In practice we found that β values of 0.05 0.2 worked the best. We leave finding better ways of quantifying the posterior s variable curvature and ways of selecting appropriate values of β as future work. B CONVERGENCE TO ONLINE-EWC ON A TOY EXAMPLE Figure 9: Visualization of a simple 2d logistic regression clustering task. The first task is distinguishing blue and red, classes 1 and 2 respectively. The second task is distinguishing green (class 1) from yellow (class 2). The combined task is shown on the left Here, we demonstrate convergence of GVCL to Online-EWC for small β. In this problem, we deal with 2d logistic regression on a toy dataset consisting of separated clusters. The clusters are shown in Figure 9. The first set of tasks is separating the red/blue clusters, then the second is the yellow/green clusters. Blue and green are the first class and red and yellow are the second. Or model is given by the equation p(yi = 1|w, b, xi) = σ(w xi + b) Where xi are our datapoints and w and b are our parameters. yi = 1 means class 2 (and yi = 0 means class 1). x is 2-dimensional so we have a total of 3 parameters. Next, we ran GVCL with decreasing values of β and compared the resulting values of w and b after the second task to solution generated by Online-EWC. For both cases, we set λ = 1. For our prior, we used the unit normal prior on both w and b, our approximating distribution was a fully factorized Gaussian. We ran this experiment for 5 random seeds (of the parameters, not the clusters) and plotted the results. Figure 10 shows the result. Evidently, the values of the parameters approach those of Online-EWC as we decrease β, in line with our theory. However, it is worth noting that to get this convergent behaviour, we had to run this experiment for very long. For the lowest β value, it took 17 minutes to Published as a conference paper at ICLR 2021 Figure 10: Convergence of GVCL parameter values to Online-EWC parameter values for decreasing values of β for a toy 2d logistic regression problem converge compared to 1.7 for β = 1. A small learning rate of 1e-4 with 100000 iteration steps was necessary for the smallest β =1e-4. If the optimization process was run for shorter, or too large a learning rate was used, we would observe convergent behaviour for the first few values of β, but the smallest values of β would result in completely different values. This shows that while in theory, for small β, GVCL should approach Online-EWC, it is extremely hard to achieve in practice. Given that it takes so long to achieve convergent behaviour on a model with 3 parameters, it is unsurprising that we were not able to achieve the same performance as Online-EWC for our neural networks, and explains why despite GVCL, in theory, encompassing Online-EWC, can sometimes perform worse. C FURTHER DETAILS ON RECOVERING ONLINE EWC Here, we show the full derivation to recover Online EWC from GVCL, as β 0. First, we expand the β-ELBO which for Gaussian priors and posteriors has the form: β-ELBO = Eθ q T (θ)log p(DT |θ) βDKL(q T (θ)||q T 1(θ)) = Eθ q T (θ)[log p(DT |θ)] β log |ΣT 1| log |ΣT | d + Tr(Σ 1 T 1ΣT ) + (µT µT 1) Σ 1 T (µT µT 1) , where q T (θ) is our approximate distribution with means and covariance µT and ΣT , and our prior distribution q T 1(θ) has mean and covariance µT 1 and ΣT 1. DT refers to the Tth dataset and d the dimension of µ. Next, take derivatives wrt ΣT and set to 0: ΣT β-ELBO = ΣT Eθ q T (θ)[log p(DT |θ)] + β 2 Σ 1 T 1 (6) 2 µ µEq T (θ)[log p(DT |θ)] + β 2 Σ 1 T 1 (7) β µT µT Eq T (θ)[ log p(DT |θ)] + Σ 1 T 1. (8) We move from Equation 6 to Equation 7 using Equation 19 in Opper & Archambeau (2008). From Equation 8, we see that as β 0, the precision grows indefinitely, so q T (θ) approaches a delta function centered at its mean. We give a more precise explanation of this argument in Appendix C.1. Published as a conference paper at ICLR 2021 β µT µT log p(DT |θ = µT ) + Σ 1 T 1 β HT + Σ 1 T 1, (9) where HT is the Hessian of the Tth dataset log-likelihood. This recursion of Σ 1 T gives t=1 Ht + Σ 1 0 . Now, optimizing the β-ELBO for µT (ignoring terms that do not depend on µT ): β-ELBO = Eθ q(θ)[log p(D|θ)] β 2 (µT µT 1) Σ 1 T 1(µT µT 1) (10) = log p(D|θ = µT ) 1 2(µT µT 1) T 1 X t=1 Ht + βΣ 1 0 (µT µT 1). (11) Which is the exact optimization problem for Laplace Propagation (Smola et al., 2003). If we note that HT NT FT (Martens, 2020), where NT is the number of samples in the Tth dataset and FT is the Fisher information matrix, we recover Online EWC with λ = 1 when N1 = N2 = ... = NT (with γ = 1). C.1 CLARIFICATION OF THE DELTA-FUNCTION ARGUMENT In C, we argued, β µT µT Eq T (θ)[ log p(DT |θ)] + Σ 1 T 1 β HT + Σ 1 T 1 for small β. We argued that for small β, q(θ) collapsed to its mean and it is safe to treat the expectation as sampling only from the mean. In this section, we show that this argument is justified. Lemma 1. If q(θ) has mean and covariance parameters µ and Σ, and Σ 1 = 1 β µ µEθ q(θ)[f(θ)] + C, C = O( 1 β ), then for small β, Σ 1 1 β Hµ + C, where Hµ is the Hessian of f(θ) evaluated at µ, assuming Hµ = O(1) Proof. We first assume that f(θ) admits a Taylor expansion around µ. For notational purposes, we define, Tk1,...,kn θ=µ = f θ(k1) . . . θ(kn) For our notation, upper indices in brackets indicate vector components (not powers), and lower indices indicate covector components. Note that, Hµ,i,j = Ti,j θ=µ. 5 Then, a Taylor expansion centered at µ has the form f(θ) = f(µ) + 1 n!Tk1,...,kn θ=µ(θ µ)(k1) . . . (θ µ)(kn) 5In this case, the µ in Hµ,i,j refers to the Hessian evaluated at µ, while i, j refers to the indices Published as a conference paper at ICLR 2021 Where we use Einstein notation, so Tk1,...,kn θ=µ(θ µ)(k1) . . . (θ µ)(kn) = k1,...,kn=1 Tk1,...,kn θ=µ(θ µ)(k1) . . . (θ µ)(kn) With D the dimension of θ. To denote the central moments of q(θ), we define µ(k1,...,kn) := Eθ q(θ) h (θ µ)(k1) . . . (θ µ)(kn)i These moments can be computed using Isserlis theorem. Notably, for a Gaussian, if n is odd, µ(k1,...,kn) = 0 Now, we can compute our expectation as an infinite sum: µ µEθ q(θ)[f(θ)] = µ µEθ q(θ) 1 n!Tk1,...,kn θ=µ(θ µ)(k1) . . . (θ µ)(kn) # 1 n!Tk1,...,kn θ=µ µ(k1,...,kn) # 1 2n!Tk1,...,k2n θ=µ µ(k1,...,k2n) # (odd moments are 0) = A for notational simplicity We can look at individual components of A: Ai,j = µ(i) µ(j) 1 2n!Tk1,...,k2n θ=µ µ(k1,...,k2n) # = Ti,j θ=µ + 1 2n!Ti,j,k1,...,k2n θ=µ µ(k1,...,k2n) Now we can insert this into our original equation. β µ µEθ q(θ)[f(θ)] + C Σ 1 i,j = 1 β Ai,j + Ci,j looking at individual indices Σ 1 i,j |{z} Ti,j θ=µ | {z } O(1) 1 2n!Ti,j,k1,...,k2n θ=µ µ(k1,...,k2n) | {z } O(β) + Ci,j |{z} Now we assumed that Hµ is O(1) (so Ti,j θ=µ is too), which means that Σ 1 i,j must be at least O( 1 If Σ 1 = O( 1 β ), then Σ = O(β). From Isserlis theorem, we know that µ(k1,...,k2n) is composed of the product of n elements of Σ, so µ(k1,...,k2n) = O(βn). Ti,j,k1,...,k2n θ=µ is constant with respect to β, so is O(1). Hence, the summation is O(β), which for small β is negligible compared to the O(1) term Ti,j θ=µ, so can therefore be ignored. Then, keeping only O( 1 Published as a conference paper at ICLR 2021 β ) z}|{ Σ 1 i,j = 1 O(1) z }| { Ti,j θ=µ + O(β) z }| { X 1 2n!Ti,j,k1,...,k2n θ=µ µ(k1,...,k2n) ! β ) z}|{ Ci,j β ) z}|{ Σ 1 i,j = β ) z }| { 1 β Ti,j θ=µ + O(1) z }| { 1 β 1 2n!Ti,j,k1,...,k2n θ=µ µ(k1,...,k2n) ! β ) z}|{ Ci,j β Ti,j θ=µ + Ci,j β Hµ,i,j + Ci,j C.2 CORRESPONDING GVCL S λ AND ONLINE EWC S λ We use DKL λ in place of DKL, with DKL λ defined as DKL λ(q T q T 1) = 1 (µT µT 1) Σ 1 T 1,λµT µT 1) + Tr(Σ 1 T 1ΣT ) + log |ΣT 1| d log |ΣT | , Σ 1 T,λ := λ t=1 Ht + Σ 1 0 = λ(Σ 1 T Σ 1 0 ) + Σ 1 0 . Now, the fixed point for ΣT is still given by Equation 9, but the β-ELBO for for terms involving µT has the form, β-ELBO = Eθ q(θ)[log p(D|θ)] β 2 (µT µT 1) Σ 1 T 1,λ(µT µT 1) = log p(D|θ = µT ) 1 t=1 Ht + βΣ 1 0 which upweights the quadratic terms dependent on the data (and not the prior), similarly to λ in Online EWC. C.3 RECOVERING γ FROM TEMPERING In order to recover λ, we used the KL-divergence between tempered priors and posteriors qλ T 1 and qλ T . Recovering γ can be done using the same trick, except we temper the posterior to qγλ T : DKL(qλ T qγλ T 1) = 1 2 (µT µT 1) λΣ 1 T 1(µT µT 1) + Tr(γλΣ 1 T 1λ 1ΣT ) + log |λ 1ΣT 1| |(γλ) 1ΣT | d 2 (µT µT 1) λΣ 1 T 1(µT µT 1) + γTr(Σ 1 T 1ΣT ) log |ΣT | + cons. = DKLλ,γ(q T q T 1) Published as a conference paper at ICLR 2021 We can apply the same λ to λ as before to get DKL λ,γ(q T q T 1). Plugging this into the β-ELBO and solving yields the recursion for ΣT to be β HT + γΣ 1 T 1, which is exactly that of Online EWC. C.4 GVCL RECOVERS THE SAME APPROXIMATION OF FT AS ONLINE EWC The earlier analysis dealt with full rank ΣT . In practice, however, ΣT is rarely full rank and we deal with approximations of ΣT . In this subsection, we consider diagonal ΣT , like Online EWC, which in practice uses a diagonal approximation of FT . The way Online EWC approximates this diagonal is by matching diagonal entries of FT . There are many ways of producing a diagonal approximation of a matrix, for example matching diagonals of the inverse matrix is also valid, depending on the metric we use. Here, we aim to show that that the diagonal approximation of ΣT that is produced when Q is the family of diagonal covariance Gaussians is the same as the way Online EWC approximates FT , that is, diagonals of Σ 1 T,approx match diagonals of Σ 1 T,true, i.e. we match the diagonal precision entries, not the diagonal covariance entries. Let ΣT,approx = diag(σ2 1, σ2 2, ..., σ2 d), with d the dimension of the matrix. Because we are performing VI, we are optimizing the forwards KL divergence, i.e. DKL(qapprox||qtrue). Therefore, ignoring terms that do not depend on ΣT,approx, DKL(qapprox||qtrue) = 1 2Tr(ΣT,approxΣ 1 T,true) 1 2 log |ΣT,approx| + (constants wrt ΣT,approx) i=1 (ΣT,approxΣ 1 T,true)i,i 1 i=1 log σ2 i σ2 i (Σ 1 T,true)i,i) log σ2 i . Optimizing wrt σ2 i : DKL(qapprox||qtrue) σ2 i = 0 = 1 (Σ 1 T,true)i,i 1 σ2 i = 1 (Σ 1 T,true)i,i . So we have that diagonals of Σ 1 T,approx match diagonals of Σ 1 T,true. C.5 GVCL RECOVERS THE SAME APPROXIMATION OF HT AS SOLA SOLA approximates the Hessian with a rank-restricted matrix H (Yin et al., 2020). We first consider a relaxation of this problem with full rank, then consider the limit when we reduce this relaxation. Because we are concerned with limiting β 0, it is sufficient to consider Σ 1 true as H, the true Hessian. Because H is symmetric (and assuming it is positive-semi-definite), we can also write H as H = V DV = Pp i=1 λixix i , with D, and V be the diagonal matrix of eigenvalues and a unitary matrix of eigenvectors, respectively. These eigenvalues and eigenvectors are λi and xi, respectively, and p the dimension of H. For H, we first consider full-rank matrix which becomes low-rank as δ 0: i=1 λi xi x i + j=k+1 δ xj x j Published as a conference paper at ICLR 2021 This matrix has λi, 1 i k as its first k eigenvalues and δ as its remaining. We also set x i xi = 1 and x i xj = 0, i = j. With KL minimization, we aim to minimize (up to a constant and scalar factor), KL = Tr(ΣapproxΣ 1 true) log |Σapprox| In our case, this is Equation 13, which we can further expand as, KL = Tr( H 1H) log | H 1| (13) 1 λi xi x i + i=1 log( λi) + j=k+1 log δ (14) 1 λi xi x i H 1 δ xj x j H i=1 log( λi) + j=k+1 log δ (15) 1 λi x i H xi + 1 δ x j H xj + i=1 log( λi) + j=k+1 log δ (16) Taking derivatives wrt λi, we have: λ2 i x i H xi + 1 λi = x i H xi (19) Which when put into Equation 16, 1 λi x i H xi + 1 δ x j H xj + i=1 log( λi) + j=k+1 log δ (20) x i H xi x i H xi + 1 δ x j H xj + i=1 log( λi) + j=k+1 log δ (21) 1 δ x j H xj + i=1 log( λi) + j=k+1 log δ (22) j=k+1 x j H xj + i=1 log( λi) (removing constants) (23) j=k+1 x j H xj + i=1 log( x i H xi) (24) Now we need to consider the constraints x i xi = 1 and x i xj = 0, i = j by adding Lagrange multipliers to our KL cost, j=k+1 x j H xj + i=1 log( x i H xi) i=1 φi,i( x i xi 1) X i,j,i =j φi,j x i xj (25) Published as a conference paper at ICLR 2021 Taking derivatives wrt xi: L xi = 0 = 2H xi x i H xi 2φi,i xi 2 X i,j =i φi,j xj (26) i,j =i φi,j xj = H x i H xi φi,i Ip In Equation 27, we have xi expressed as a linear combination of xj, j = i, but xi and xj are orthogonal, so xi cannot be expressed as such, so φi,j = 0, i = j, and, H xi x i H xi = φi,i xi (28) Meaning xi are eigenvectors of H for 1 i k. We can also use the same Lagrange multipliers to show that xi for k + 1 i p are also eigenvectors of H. This means that our cost, j=k+1 x j H xj + i=1 log( x i H xi) (29) i=1 log( κi) (30) where the set ( κ1, κ2, ..., κp) is a permutation of (λ1, λ2, ..., λp) and κi = λi for 1 i k. I.e., H shares k eigenvalues with H, and the rest are δ. It now remains to determine which eigenvalues are shared and which are excluded. Considering only two eigenvalues, λi, λj, and let λi > λj 0. Let r = λi λj . The relative cost of excluding λi in the set { κ1, κ2, ..., κk} compared to including it is, Relative Cost = λi λj If the relative cost is positive, then including λi as one of the eigenvalues of H is the more optimal choice. Now solving the inequality, Relative Cost > 0 r) δ log r > 0 Which, for sufficiently small δ is always true because r > 1. Thus, it is always better to swap two eigenvalues which are included/excluded, if the excluded one is larger. This means that H has the k largest eigenvalues of H, and we already showed that it shares the same eigenvectors. This maximum eigenvalue/eigenvector pair selection is exactly the procedure used by SOLA. Published as a conference paper at ICLR 2021 D COLD POSTERIOR VCL AND FURTHER GENERALIZATIONS The use of KL-reweighting is closely related related to the idea of cold-posteriors, in which p T (θ|D) p(θ|D) 1 τ . Finding this cold posterior is equivalent to find optimal q distributions for maximizing the τ-ELBO: τ-ELBO := Eθ q(θ)[log p(D|θ) + log p(θ) τ log q(θ)] when Q is all possible distributions of θ. This objective is the same as the standard ELBO with only the entropy term reweighted, and contrasts the β-ELBO where both the entropy and prior likelihoods are reweighted. Here, β acts similarly to T (the temperature, not to be confused with task number). This relationship naturally leads to the transition diagram shown in Figure 11. In this, we can see that we can easily transition between posteriors at different temperatures by optimizing either the β-ELBO, τ-ELBO, or tempering the posterior. Cold (τ < 1) p p(θ) 1 τ p p(θ|D1) 1 τ p p(θ|D1:2) 1 τ ... Warm (τ = 1) p(θ) p(θ|D1) p(θ|D1:2) ... Tempering τ-ELBO Figure 11: Transitions between posteriors at different temperatures using tempering and optimizing either the τ-ELBO or β-ELBO When Q contains all possible distributions, moving along any path results in the exact same distribution, for example optimizing the τ-ELBO then tempering is the same as directly optimizing the ELBO. However in the case where Q is limited, this transition is not exact, and the resulting posterior is path dependent. In fact, each possible path represents a different valid method for performing continual learning. Standard VCL works by traversing the horizontal arrows, directly optimizing the ELBO, while an alternative scheme of VCL would optimize the τ-ELBO to form cold posteriors, then heat the posterior before optimizing the τ-ELBO for a new task. Inference can be done at either the warm or cold state. Note that for Gaussians, heating the posterior is just a matter of scaling the covariance matrix by a constant factor τafter While warm posteriors generated through this two-step procedure are not optimal under the ELBO, when Q is limited, they may perform better for continual learning. Similar to Equation 2, the optimal Σ when optimizing the τ-ELBO is given by Where Ht is the approximate curvature for a specific value of τ for task t, which coincides with the true Hessian for τ 0, like with the β-ELBO. Here, both the prior and data-dependent component are scaled by 1 τ , in contrast to Equation 2, where only the data-dependent component is reweighted. As discussed in Section 2.2 and further explored in appendix A, this leads to a different scale of the quadratic approximation, which may lend itself better for continual learning. This also results in a second way to recover γ in Online EWC by first optimizing the β-ELBO with β = γ, then tempering by a factor of 1 γ (i.e. increasing the temperature when γ < 1). E MAP DEGENERACY WITH FILM LAYERS Here we describe how training Fi LM layer with MAP training leads to degenerate values for the weights and scales, whereas with VI training, no degeneracy occurs. For simplicity, consider only the nodes leading into a single node and let there be d of them, i.e. θ has dimension d. Because we only have one node, our scale parameter γ is a single variable. Published as a conference paper at ICLR 2021 For MAP training, we have the loss function L = p(D|θ, γ) + λ 2 θ2, with D the dataset and λ the L2 regularization hyperparameter. Note that p(D|θ, γ) = p(D|cθ, 1 cγ), hence we can scale θ arbitrarily without affecting the likelihood, so long as γ is scaled inversely. If c < 1, λ cθ)2, so increasing c decreases the L2 penalty if θ is inversely scaled by c. Therefore the optimal setting of the scale parameter γ is arbitrarily large, while θ shrinks to 0. At a high level, VI-training (with Gaussian posteriors and priors) does not have this issue because the KL-divergence penalizes the variance of the parameters from deviating from the prior in addition to the mean parameters, whereas MAP training only penalizes the means. Unlike with MAP training, if we downscale the weights, we also downscale the value of the variances, which increases the KL-divergence. The variances cannot revert to the prior either, as when they are up-scaled by the Fi LM scale parameter, the noise would increase, affecting the log-likelihood component of the ELBO. Therefore, there exists an optimal amount of scaling which balances the mean-squared penalty component of the KL-divergence and the variance terms. Mathematically we can derive this optimal scale. Consider the scenario with VI training with Gaussian variational distribution and prior, where our approximate posterior q(θ) has mean and variance µ and Σ and our prior p(θ) has parameters µ0 and Σ0. First consider the scenario without Fi LM Layers. Now, have our loss function L = Eθ q(θ) log p(D|θ) + DKL(q(θ)||q0(θ)). For multivariate Gaussians, DKL(q(θ)||p(θ)) = 1 2(log |Σ0| log |Σ| d + Tr(Σ 1 0 Σ) + (µ µ0)T Σ 1 0 (µ µ0)). Now consider another distribution q (θ), with mean and variance parameters cµ and c2Σ. Now if q (θ) is paired with Fi LM scale parameter γ set at 1 c, the log-likelihood component is unchanged: Eθ q(θ) log p(D|θ) = Eθ q (θ) log p(D|θ, γ = 1 with γ being our Fi LM scale parameter and p(D|θ, γ) representing a model with Fi LM scale layers. Now consider the DKL(q (θ)||q0(θ)), and optimize c with µ and Σ fixed: DKL(q (θ)||p(θ)) = 1 2(log |Σ0| log |c2Σ| d + Tr(Σ 1 0 c2Σ) + (cµ µ0)T Σ 1 0 (cµ µ0)) 2(log |Σ0| log |Σ| 2d log c d + c2Tr(Σ 1 0 Σ) + (cµ µ0)T Σ 1 0 (cµ µ0)) c |c=c = 0 = d c + c Tr(Σ 1 0 Σ) + (c µ µ0)T Σ 1 0 µ 0 = d + c 2Tr(Σ 1 0 Σ) + c 2µT Σ 1 0 µ c µT 0 Σ 1 0 µ 0 = c 2(Tr(Σ 1 0 Σ) + µT Σ 1 0 µ) c µT 0 Σ 1 0 µ d c = µT 0 Σ 1 0 µ q (µT 0 Σ 1 0 µ)2 + 4d(Tr(Σ 1 0 Σ) + µT Σ 1 0 µ) 2(Tr(Σ 1 0 Σ) + µT Σ 1 0 µ) . Also note that c = 0 results in an infinitely-large KL-divergence, so there is a barrier at c = 0, i.e. If optimized through gradient descent, c should never change sign. Furthermore, note that c2 + Tr(Σ 1 0 Σ) + µT Σ 1 0 µ > 0. So the KL-divergence is concave with respective to c, so c is a minimizer of DKL and therefore DKL(q(θ)||p(θ)) DKL(q (θ)||p(θ))|c=c , which implies the optimal value of the Fi LM scale parameter γ is 1 c . While no formal data was collected, it was observed that the scale parameters do in fact reach very close to this optimal scale value after training. Published as a conference paper at ICLR 2021 F CLUSTERING OF FILM PARAMETERS (c) Shifts and Scales Figure 12: T-SNE of Fi LM layer parameters of 58 tasks coming from different domains. Shift and scale parameters from the same domain are more similar than those from different ones. In this section, we test the interpretability of learned Fi LM Parameters. Such clustering has been done in the past with Fi LM parameters, as well as node-wise uncertainty parameters. One would intuitively expect that tasks from similar domains would finds similar features salient, and thus share similar Fi LM parameters. To test this hypothesis, we took the 8 mixed vision task from Section 5.3 and split each task into multi 5-way classification tasks so that there were many tasks from similar domains. For example, CIFAR100, which originally had 100 classes, became 20 5-way clasification tasks, Trafficsigns became 8 tasks (7 5-way and 1 8-way), and MNIST 2 (2 5-way). Next, we trained the same architecture used in Section 5.3 except trained all 58 resulting tasks. Joint training was chosen over continual learning to avoid artifacts which would arise from task ordering. Figure 12 shows that the results scale and shift parameters can be clustered and Fi LM parameters which arise from the same base task cluster together. Like in Achille et al. (2019), this likely could be used as a means of knowing which tasks to learn continually and which tasks to separate (i.e. tasks from the same cluster would likely benefit from joint training, while tasks from different ones should be separately trained), however we did not explore this idea further. G HOW FILM LAYERS INTERACT WITH PRUNING (a) Weights, no Fi LM layers (b) Biases, no Fi LM layers (c) Weights, with Fi LM layers (d) Biases, with Fi LM layers Figure 13: Posterior distributions for incoming weights (left) or biases (right) for a node in the first layer. Nodes are either unrpruned (left within a column) or pruned (right within a column). Without Fi LM Layers (top row), we see that pruned nodes have their bias concentrated at a negative value, preventing future tasks from reactivating the node. With Fi LM Layers, a pruned node prunes using the Fi LM parameters rather than the shared ones, allowing the posteriors to revert to the prior distribution, allowing for node reactivation. In Section 3, we discussed the problem of pruning in variational continual learning and how it prevents nodes from becoming reactivated. To reiterate, pruning broadly occurs in three steps: 1. Weights incoming to a node begin to revert to the prior distribution 2. Noise from these high-variance weights affect the likelihood term in the ELBO Published as a conference paper at ICLR 2021 3. To prevent noise, the bias concentrates at a negative value to be cut off by the Re LU activation Later tasks then are initialized with this negative bias with low variance, meaning that the node has a difficult time reactivating the node without incurring a high prior cost. This results in the effect shown in Figure 1, where after the first task, effectively no more nodes are reactivated. The effect is further exacerbated with larger values of β, where the pruning effect is stronger. Increasing λ worsens this as well, as increasing the quadratic cost further prevents already low-variance negative biases from moving. We verify that this mechanism is indeed the cause of the limited capacity use by visualizing the posteriors for weights and biases entering a node in the first convolutional layer for a network trained on Easy-CHASY (Figure 13). Here, we see that biases in pruned nodes when there are no Fi LM Layers do indeed concentrate at negative values. In contrast, biases in models with Fi LM layers are able to revert to their prior because the Fi LM parameters perform pruning. H RELATED WORK Regularization-based continual learning. Many algorithms attempt to regularize network parameters based on a metric of importance. The most directly comparable algorithms to GVCL are EWC (Kirkpatrick et al., 2017), Online EWC (Schwarz et al., 2018), and VCL (Nguyen et al., 2018). EWC measures importance based on the Fisher information matrix, while VCL uses an approximate posterior covariance matrix as an importance measure. Online EWC slightly modifies EWC so that there is only a single regularizer based on the cumulative sum of Fisher information matrices. Lee et al. (2017) proposed IMM, which is an extension to EWC which merges posteriors based on their Fisher information matrices. Ritter et al. (2018) and Yin et al. (2020) both aim to approximate the Hessian by using either Kronecker-factored or low-rank forms, using the Laplace approximation to form approximate posteriors of parameters. These methods all use second-order approximations of the loss.Ahn et al. (2019), like us, use regularizers based on the ELBO, but also measure importance on a per-node basis than a per-weight one. SI (Zenke et al., 2017) measures importance using Synaptic Saliency, as opposed to methods based on approximate curvature. Architectural approaches to continual and meta-learning. This family of methods modifies the standard neural architecture by either adding parallel or series components to the network. Progressive Neural Networks adds a parallel column network for every task. Pathnet (Fernando et al., 2017) can be interpreted as a parallel-network based algorithm, but rather than growing model size over time, the model size remains fixed while paths between layer columns are optimized. Fi LM parameters can be interpreted as adding series components to a network, and has been a mainstay in the multitask and meta-learning literature. Requeima et al. (2019) use hypernetworks to amortize Fi LM parameter learning, and has been shown to be capable of continual learning. Architectural approaches are often used in tandem with regularization based approaches, such as in HAT (Serra et al., 2018), which uses per-task gating parameters alongside a compression-based regularizer. Adel et al. (2020) propose CLAW, which also uses variational inference alongside per-task parameters, but requires a more complex meta-learning based training procedure involving multiple splits of the dataset. GVCL with Fi LM layers adds to this list of hybrid architectural-regularization based approaches. Cold Posteriors and likelihood-tempering. As mentioned in Section 2, likelihood-tempering (or KL-reweighting) has been empirically found to improve performance when using variational inference for Bayesian Neural Networks over a wide number of contexts and papers (Osawa et al., 2019; Zhang et al., 2018). Cold posteriors are closely related to likelihood tempering, except they temper the full posterior rather than only the likelihood term, and often empirically outperform Bayesian posteriors when using MCMC sampling Wenzel et al. (2020). From an information-theoretic perspective, KL-reweighted ELBOs have also studied as compression (Achille et al., 2020). Achille et al. (2019), like us, considers a limiting case of β, and uses this to measure parameter saliency, but use this information to create a task embedding rather than for continual learning. Outside of the Bayesian Neural Network context, values of β > 1 have also been explored (Higgins et al., 2017), and more generally different values of β trace out different points on a rate-distortion curve for VAEs (Alemi et al., 2018). Published as a conference paper at ICLR 2021 I EXPERIMENT DETAILS I.1 REPORTED METRICS All reported scores and figures present the mean and standard deviation across 5 runs of the algorithm with a different network initialization. For Easy-CHASY and Hard-CHASY, train/test splits are also varied across iterations. For the Mixed Vision tasks, task permutation of the 8 tasks is also randomized between iterations. Let the matrix Ri,j represent the performance of jth task after the model was trained on the ith task. Furthermore, let Rind j be the mean performance of the jth for a network trained only on that task and let the total number of tasks be T. Following Lopez-Paz & Ranzato (2017) and Pan et al. (2020), we define Average Accuracy (ACC) = 1 Forward Transfer (FWT) = 1 j=1 Rj,j Rind j , Backward Transfer (BWT) = 1 j=1 RT,j Rj,j. Note that these metrics are not exactly the same as those presented in all other works, as the FWT and BWT metrics are summed over the indices 1 j T, whereas Lopez-Paz & Ranzato (2017) and Pan et al. (2020) sum from 2 j T and 1 j T 1 for FWT and BWT, respectively. For FWT, this definition does not assumes that R1,1 = Rind 1 , and affects algorithms such as HAT and Progressive Neural Networks, which either compress the model, resulting in lower accuracy, or use a smaller architecture for the first task. The modified BWT transfer is equal to the other BWT metrics apart from a constant factor T 1 Intuitively, forward transfer equates to how much continual learning has benefited a task when a task is newly learned, while backwards transfer is the accuracy drop as the network learns more tasks compared to when a task was first learned. Furthermore, in the tables in Appendix J, we also present net performance gain (NET), which quantifies the total gain over separate training, at the end of training continually: NET = FWT + BWT = 1 j=1 RT,j Rind j . Note that for computation of Rind, we compare to models trained under the same paradigm, i.e. MAP algorithms (all baselines except for VCL) are compared to a MAP trained model, and VI algorithms (GVCL-F, GVCL and VCL) are compared to KL-reweighted VI models. This does not make a difference for most of the benchmarks where Rind MAP Rind VI . However, for Easy and Hard CHASY, Rind MAP < Rind VI , so we compare VI to VI and MAP to MAP to obtain fair metrics. In Figure 5b, we plot ACCi, which we define as j=1 Ri,j RT,j. This metric is useful when the tasks have very different accuracies and their permutation is randomized, as is the case with the mixed vision tasks. Note that this means that Ri,j would refer to a different task for each permutation, but we average over the 5 permutations of the runs. Empirically, Published as a conference paper at ICLR 2021 if two algorithms have similar final accuracies, this metric measures how much the network forgets about the first i tasks from that point to the end, and also measures how high the accuracy would have been if training was terminated after i tasks. Plotting this also captures the concept of graceful vs catastrophic forgetting, as graceful forgetting would show up as a smooth downward curve, while catastrophic forgetting would have sudden drops. I.2 OPTIMIZER AND TRAINING DETAILS The implementation of all baseline methods was based on the Github repository6 for HAT (Serra et al., 2018), except the implementions of IMM-Mode and EWC were modified due to an error in the computation of the Fisher Information Matrix in the original implementation. Baseline MAP algorithms were trained with SGD with a decaying learning starting at 5e-2 with a maximum epochs of 200 per task for the Split-MNIST, Split-CIFAR and the mixed vision benchmarks. The number of maximum epochs for Easy-CHASY and Hard-CHASY was 1000, due to the small dataset size. Early stopping based on the validation set was used. 10% of the training set was used as validation for these methods, and for Easy and Hard CHASY, 8 samples per class form the validation set (which are disjoint from the training samples or test samples). For VI models, we used Adam optimizer with a learning rate of 1e-4 for Split-MNIST and Mixture, and 1e-3 for Easy-CHASY, Hard-CHASY and Split-CIFAR. We briefly tested running the baselines algorithms using Adam rather than SGD and performance did not change. Easy-CHASY and Hard CHASY were run for 1500 epochs per task, Split-MNIST for 100, Split-CIFAR for 60, and 180 for Mixture. The number of epochs was changed so that the number of gradient steps for each task was roughly equal. For Easy-CHASY, Hard-CHASY and Split-CIFAR, this means that later tasks are run for more epochs, since the largest training sets are at the start. For Mixture, we ran 180 equivalents epochs for Facescrub. For how many epochs this equates to in the other datasets, we refer the reader to Appendix A in Serra et al. (2018). We did not use early stopping for these VI results. While we understand that in some cases we trained for many more epochs than the baselines, the baselines used early stopping and therefore all stopped long before the 200 epoch limit was reached, so allocating more time would not change their results. Swaroop et al. (2019) also finds that allowing VI to converge is crucial for continual learning performance. We leave the discussion of improving this convergence time for future work. All experiments (both the baselines and VI methods) use a batch size of 64. I.3 ARCHITECTURAL DETAILS Easy and Hard CHASY. We use a convolutional architecture with 2 convolutions layers with: 1. 3x3 convolutional layer with 16 filters, padding of 1, Re LU activations 2. 2x2 Max Pooling with stride 2 3. 3x3 convolutional layer with 32 filters, padding of 1, Re LU activations 4. 2x2 Max Pooling with stride 2 5. Flattening layer 6. Fully connected layer with 100 units and Re LU activations 7. Task-specific head layers Split-MNIST. We use a standard MLP with: 1. Fully connected layer with 256 units and Re LU activations 2. Fully connected layer with 256 units and Re LU activations 3. Task-specific head layers Split-CIFAR. We use the same architecture from Zenke et al. (2017): 1. 3x3 convolutional layer with 32 filters, padding of 1, Re LU activations 6Repository at https://github.com/joansj/hat Published as a conference paper at ICLR 2021 2. 3x3 convolutional layer with 32 filters, padding of 1, Re LU activations 3. 2x2 Max Pooling with stride 2 4. 3x3 convolutional layer with 64 filters, padding of 1, Re LU activations 5. 3x3 convolutional layer with 64 filters, padding of 1, Re LU activations 6. 2x2 Max Pooling with stride 2 7. Flattening 8. Fully connected layer with 512 units and Re LU activations 9. Task-specific head layers Mixed vision tasks. We use the same Alex Net architecture from Serra et al. (2018): 1. 4x4 convolutional layer with 64 filters, padding of 0, Re LU activations 2. 2x2 Max Pooling with stride 2 3. 3x3 convolutional layer with 128 filters, padding of 0, Re LU activations 4. 2x2 Max Pooling with stride 2 5. 2x2 convolutional layer with 256 filters, padding of 0, Re LU activations 6. 2x2 Max Pooling with stride 2 7. Flattening 8. Fully connected layer with 2048 units and Re LU activations 9. Fully connected layer with 2048 units and Re LU activations 10. Task-specific head layers For MAP models, dropout layers with probabilities of either 0.2 or 0.5 were added after convolutional or fully-connected layers. For GVCL-F, Fi LM layers were inserted after convolutional/hidden layers, but before Re LU activations. I.4 HYPERPARAMETER SELECTION For all algorithms on Easy-CHASY, Hard-CHASY, Split-MNIST and Split-CIFAR, hyperparameter selection was done by selecting the combination which produced the best average accuracy on the first 3 tasks. The algorithms were then run on the full number of tasks. For the Mixed Vision tasks, the best hyperparameters for the baselines were taken from the HAT Github repository. For GVCL, we performed hyperparameter selection in the same way as in Serra et al. (2018): we found the best hyperparameters for the average performance on the first random permutation of tasks. Note that in the mixture tasks, we randomly permute the task order for each iteration (with permutations kept consistent between algorithms), whereas for the other 4 benchmarks, the task order is fixed. Hyperparameter searches were performed using a grid search. The best selected hyperparameters are shown in Table 3. Published as a conference paper at ICLR 2021 Algorithm Hyperparameter Easy-CHASY Hard-CHASY Split-MNIST Split-CIFAR Mixed Vision GVCL-F β 0.05 0.05 0.1 0.2 0.1 λ 10 10 100 100 50 GVCL β 0.05 0.05 0.1 0.2 0.1 λ 100 100 1 1000 100 HAT λ 1 1 0.1 0.025 0.75* smax 10 50 50 50 400* Path Net # of evolutions 20 200 10 100 20* VCL None - - - - - Online EWC λ 100 500 10000 100 5 Progressive None - - - - - IMM-Mean λ 0.0005 1e-6 5e-4 1e-4 0.0001* IMM-Mode λ 1e-7 0.1 0.1 1e-5 1 LWF λ 0.5 0.5 2 2 2* T 4 2 4 4 1* * Best hyperparameters taken from HAT code Table 3: Best (selected) hyperparameters for continual learning experiments for various algorithms. We fix Online EWC s γ = 1. For the Joint and Separate VI baselines, we used the same β. For the mixed vision tasks, we had to used a prior variance of 0.01 (for both VCl, GVCL and GVCL-F), but for all other tasks we did not need to tune this. J FURTHER EXPERIMENTAL RESULTS In following section we present more quantitative results of the various baselines on our benchmarks. For brevity, in the main text, we only included the best performing baselines and those which are most comparable to GVCL, which consisted of HAT, Path Net, Online EWC and VCL. Published as a conference paper at ICLR 2021 J.1 EASY-CHASY ADDITIONAL RESULTS Metric ACC (%) BWT (%) FWT (%) NET (%) GVCL-F 90.9 0.3 0.2 0.1 0.4 0.3 0.6 0.3 GVCL 88.9 0.6 0.8 0.4 0.6 0.5 1.4 0.6 HAT 82.6 0.9 1.6 0.6 0.4 1.4 1.3 0.9 Path Net 82.4 0.9 0.0 0.0 1.5 0.9 1.5 0.9 VCL 78.4 1.0 4.1 1.2 7.9 0.8 11.9 1.0 VCL-F 79.9 1.0 6.1 0.9 4.3 0.3 10.4 1.0 Online EWC 73.4 3.4 8.9 2.9 1.5 0.5 10.5 3.4 Online EWC-F 76.0 1.5 6.9 1.6 1.0 0.3 7.9 1.5 Progressive 82.6 0.6 0.0 0.0 1.3 0.6 1.3 0.6 IMM-mean 42.3 1.0 1.1 0.6 40.6 1.1 41.6 1.0 imm-mode 74.8 1.0 11.2 0.1 2.1 0.9 9.1 1.0 LWF 75.1 2.4 12.9 1.9 4.1 0.6 8.8 2.4 SGD 75.3 1.8 11.1 0.9 2.5 1.0 8.6 1.8 SGD-Frozen 81.2 0.8 0.0 0.0 2.7 0.8 2.7 0.8 Separate (MAP) 88.4 0.8 - - 0.0 0.0 Separate (β-VI) 90.3 0.1 - - 0.0 0.0 Joint (MAP) 88.6 0.7 - - 4.7 0.7 Joint (β-VI + Fi LM) 91.9 0.1 - - 1.6 0.1 Table 4: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Easy-CHASY. Separate and joint training results for both MAP and β-VI models are also presented Figure 14: Mean accuracy of individual tasks after training for all approaches on Easy-CHASY Published as a conference paper at ICLR 2021 Figure 15: Mean accuracy of individual tasks after training for the top 5 performing approaches on Easy-CHASY Figure 16: Running average accuracy of individual tasks after training for the all approaches on Easy-CHASY Published as a conference paper at ICLR 2021 Figure 17: Running average accuracy of individual tasks after training for the top 5 approaches on Easy-CHASY Published as a conference paper at ICLR 2021 J.2 HARD-CHASY ADDITIONAL RESULTS Metric ACC (%) BWT (%) FWT (%) NET (%) GVCL-F 69.5 0.6 0.1 0.1 1.6 0.7 1.7 0.6 GVCL 64.4 0.6 0.6 0.2 6.3 0.6 6.8 0.6 HAT 62.5 5.4 0.8 0.4 3.7 5.5 4.5 5.4 Path Net 64.8 0.8 0.0 0.0 2.2 0.8 2.2 0.8 VCL 45.8 1.4 11.9 1.6 13.5 2.2 25.4 1.4 VCL-F 65.0 0.8 2.7 0.8 3.4 0.6 6.1 0.8 Online EWC 56.4 1.7 7.1 1.7 3.4 1.3 10.5 1.7 Online EWC-F 56.7 6.4 8.8 5.9 1.4 0.9 10.2 6.4 Progressive 65.2 1.6 0.0 0.0 1.8 1.6 1.8 1.6 IMM-mean 35.5 0.8 1.0 0.8 30.5 1.2 31.5 0.8 imm-mode 44.3 4.3 22.2 5.4 0.5 1.1 22.7 4.3 LWF 46.4 2.5 23.0 2.8 2.4 1.0 20.6 2.5 SGD 47.1 2.2 21.0 2.7 1.2 0.7 19.8 2.2 SGD-Frozen 61.6 1.4 0.0 0.0 5.3 1.4 5.3 1.4 Separate (MAP) 54.1 1.2 - - 0.0 0.0 Separate (β-VI) 71.2 0.5 - - 0.0 0.0 Joint (MAP) 66.4 0.6 - - 0.6 0.6 Joint (β-VI + Fi LM) 70.4 0.8 - - 0.8 0.8 Table 5: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Hard-CHASY. Separate and joint training results for both MAP and β-VI models are also presented Figure 18: Mean accuracy of individual tasks after training for all approaches on Hard-CHASY Published as a conference paper at ICLR 2021 Figure 19: Mean accuracy of individual tasks after training for the top 5 performing approaches on Hard-CHASY Figure 20: Running average accuracy of individual tasks after training for the all approaches on Hard-CHASY Published as a conference paper at ICLR 2021 Figure 21: Running average accuracy of individual tasks after training for the top 5 approaches on Hard-CHASY Published as a conference paper at ICLR 2021 J.3 Split-MNIST ADDITIONAL RESULTS Metric ACC (%) BWT (%) FWT (%) NET (%) GVCL-F 98.6 0.1 0.0 0.0 0.1 0.1 0.0 0.1 GVCL 94.6 0.7 4.0 0.7 0.0 0.0 4.1 0.7 HAT 98.3 0.1 0.2 0.0 0.1 0.1 0.3 0.1 Path Net 95.2 1.8 0.0 0.0 3.3 1.8 3.3 1.8 VCL 92.4 1.2 5.5 1.1 0.8 0.1 6.3 1.2 VCL-F 94.8 0.9 3.3 0.9 0.6 0.1 3.9 0.9 Online EWC 94.0 1.4 3.8 1.4 0.8 0.1 4.6 1.4 Online EWC-F 94.1 0.7 0.3 0.6 4.1 0.3 4.4 0.7 Progressive 98.4 0.0 0.0 0.0 0.2 0.0 0.2 0.0 IMM-mean 90.5 1.1 0.5 0.1 8.5 1.2 8.0 1.1 imm-mode 95.4 0.2 1.7 0.3 1.5 0.1 3.1 0.2 LWF 97.4 0.2 1.1 0.1 0.1 0.1 1.2 0.2 SGD 76.2 1.7 22.4 1.7 0.0 0.1 22.4 1.7 SGD-Frozen 91.7 0.2 0.0 0.0 6.9 0.2 6.9 0.2 Separate (MAP) 98.6 0.0 - - 0.0 0.0 Separate (β-VI) 98.7 0.0 - - 0.0 0.0 Joint (MAP) 98.7 0.0 - - 0.1 0.0 Joint (β-VI + Fi LM) 98.8 0.0 - - 0.1 0.0 Table 6: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Split-MNIST. Separate and joint training results for both MAP and β-VI models are also presented Figure 22: Mean accuracy of individual tasks after training for all approaches on Split-MNIST Published as a conference paper at ICLR 2021 Figure 23: Mean accuracy of individual tasks after training for the top 5 performing approaches on Split-MNIST Figure 24: Running average accuracy of individual tasks after training for the all approaches on Split-MNIST Published as a conference paper at ICLR 2021 Figure 25: Running average accuracy of individual tasks after training for the top 5 approaches on Split-MNIST Published as a conference paper at ICLR 2021 J.4 Split-CIFAR ADDITIONAL RESULTS Metric ACC (%) BWT (%) FWT (%) NET (%) GVCL-F 80.0 0.5 0.3 0.2 8.8 0.5 8.5 0.5 GVCL 70.6 1.7 2.3 1.4 1.3 1.0 1.0 1.7 HAT 77.3 0.3 0.1 0.1 6.8 0.2 6.7 0.3 Path Net 68.7 0.8 0.0 0.0 1.9 0.8 1.9 0.8 VCL 44.2 14.2 23.9 12.2 3.5 2.1 27.4 14.2 VCL-F 56.2 2.8 19.5 3.2 4.1 0.8 15.4 2.8 Online EWC 77.1 0.2 0.5 0.3 6.9 0.3 6.4 0.2 Online EWC-F 77.1 0.2 0.4 0.2 6.9 0.3 6.5 0.2 Progressive 70.7 0.8 0.0 0.0 0.1 0.8 0.1 0.8 IMM-mean 67.6 0.6 0.2 0.3 2.9 0.8 3.1 0.6 imm-mode 74.9 0.3 6.2 0.3 10.5 0.4 4.3 0.3 LWF 73.8 0.9 8.0 0.8 11.2 0.2 3.2 0.9 SGD 74.7 0.4 6.5 0.4 10.6 0.8 4.1 0.4 SGD-Frozen 70.3 0.4 0.0 0.0 0.3 0.4 0.3 0.4 Separate (MAP) 70.6 0.6 - - 0.0 0.0 Separate (β-VI) 71.6 0.2 - - 0.0 0.0 Joint (MAP) 80.9 0.3 - - 10.2 0.3 Joint (β-VI + Fi LM) 79.8 1.0 - - 8.2 1.0 Table 7: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Split-CIFAR. Separate and joint training results for both MAP and β-VI models are also presented Figure 26: Mean accuracy of individual tasks after training for all approaches on Split-CIFAR Published as a conference paper at ICLR 2021 Figure 27: Mean accuracy of individual tasks after training for the top 5 performing approaches on Split-CIFAR Figure 28: Running average accuracy of individual tasks after training for the all approaches on Split-CIFAR Published as a conference paper at ICLR 2021 Figure 29: Running average accuracy of individual tasks after training for the top 5 approaches on Split-CIFAR Published as a conference paper at ICLR 2021 J.5 MIXED VISION TASKS ADDITIONAL RESULTS Metric ACC (%) BWT (%) FWT (%) NET (%) GVCL-F 80.0 1.2 0.9 1.3 4.8 1.6 5.6 1.2 GVCL 49.0 2.8 13.1 1.6 23.5 3.4 36.7 2.8 HAT 80.3 1.0 0.1 0.1 5.8 1.0 5.9 1.0 Path Net 76.8 2.0 0.0 0.0 9.5 2.0 9.5 2.0 VCL 26.9 2.1 35.0 5.6 23.7 3.8 58.8 2.1 VCL-F 55.5 2.0 18.2 2.1 11.9 2.4 30.1 2.0 Online EWC 62.8 5.2 18.7 5.8 4.8 0.7 23.4 5.2 Online EWC-F 70.5 4.0 11.8 4.3 3.9 0.5 15.7 4.0 Progressive 77.6 0.4 0.0 0.0 8.6 0.4 8.6 0.4 IMM-mean 53.8 2.0 4.4 1.7 28.0 3.3 32.4 2.0 imm-mode 36.6 18.7 9.1 7.0 40.5 11.9 49.6 18.7 LWF 25.8 4.3 57.3 4.5 3.1 0.6 60.4 4.3 SGD 35.4 3.9 50.5 3.9 0.4 0.0 50.9 3.9 SGD-Frozen 52.9 3.9 0.0 0.0 33.3 3.9 33.3 3.9 Separate (MAP) 86.3 0.1 - - 0.0 0.0 Separate (β-VI) 85.7 0.1 - - 0.0 0.0 Joint (MAP) 84.3 0.1 - - 2.0 0.1 Joint (β-VI + Fi LM) 83.8 0.2 - - 1.8 0.2 Table 8: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Mixed Vision tasks. Separate and joint training results for both MAP and β-VI models are also presented Figure 30: Mean accuracy of individual tasks after training for all approaches on mixed vision tasks Published as a conference paper at ICLR 2021 Figure 31: Mean accuracy of individual tasks after training for the top 5 performing approaches on mixed vision tasks CIFAR10 CIFAR100 MNIST SVHN F-MNIST Traffic Signs Facescrub Not MNIST Average GVCL-F 0.79% 0.01% 0.04% 0.73% 0.25% 0.10% 0.11% 0.53% 0.32% HAT 0.12% 0.40% 0.13% 2.55% 0.94% 0.42% 5.05% 3.88% 1.69% Table 9: ECE of all 8 mixed vision tasks for a model trained continually using GVCL-F or HAT. F-MNIST stands for Fashion MNIST. Published as a conference paper at ICLR 2021 Figure 32: Clusters of symbols found by performing K-means clustering with K = 20 based on the embedding layer of a model trained with variational inference on a 200-way classification task on the 200 most common symbols in the HASYv2 dataset. Easy-CHASY is made by taking the first symbol from each cluster as the first task, then the second, and so on, up to 10 tasks. Hard-CHASY is made by taking the clusters with the most classes in order (clusters 1-10). K CLUSTERED HASYV2 (CHASY) The HASYv2 dataset is a dataset consisting over 32x32 black/white handwritten Latex characters. There are a total of 369 classes, and over 150 000 total samples (Thoma, 2017). We constructed 10 classification tasks, each with a varying number of classes ranging from 20 to 11. To construct these tasks, we first trained a mean-field Bayesian neural network on a 200-way classification task on the 200 classes with the most total samples. To get an embedding for each class, we use the activations of the second-last layer. Then, we performed K-means clustering with 20 clusters on the means of the embedding generated by each class when the samples of the classes were input into the network. Doing this yielded the classes shown in figure 32. Now, within each cluster are classes which are deemed similar by the network. To make the 10 classification tasks, we then took classes from each cluster sequentially (in order of the class whose mean was closest to the cluster s mean), so that each task contains at most 1 symbol from each cluster. Doing this ensures that tasks are similar to one another, since each task consists of classes which are different in similar ways. With the classes selected, the training set is made by selecting 16 samples of each classes, and using the remaining as the test set. This procedure was used to generate the easy set of tasks, which should have the maximum amount of similarity between tasks. We also constructed a second set of tasks, the hard set, in which each task is individually difficult. This was done by selecting each task to be classification within each cluster, selecting clusters with the most number of symbols first. This corresponds to clusters 1-10 in figure 32. With the classes for each task selected, 16 samples from each class are used in the training set, and the remainder are used as the test set. Excess samples are discarded so that the test set class distribution is also uniform within each task. It was necessary to perform this clustering procedure as we found it difficult to produce sizable transfer gains if we simply constructed tasks by taking the classes with the most samples. While we were able to have gains of up to 3% from joint training on 10 20-way classification tasks with the tasks chosen by class sample count, these gains were significantly diminished when performing MAP estimation as opposed to MLE estimation, and reduced even further when performing VI. Because one of our benchmark continual learning methods is VCL, showing transfer when trained using VI is necessary. Published as a conference paper at ICLR 2021 (a) Average relative performance (b) Individual task relative performances Figure 33: Relative test-set accuracy of models trained jointly on the easy set of tasks relative to individual training for MAP estimation. Figure 33a shows the means aggregated over all tasks while figure 33b shows the performance differences for individual tasks. Performance increases near monotonically as more tasks are added, achieving an average of around 4.7% gain with 10 tasks Figure 34: Relative performance of models trained jointly on the easy set of tasks relative to individual training for variational inference with various KL-reweighting coefficients β. Performance gains reach around 2.0% with 10 tasks in the worst case, which is less than with MAP training but still significant Figures 33a and 34 show the performance gains of joint training over separate training on this new dataset, for both MAP, and KL-reweighted VI, respectively. Figure 33b shows how relative test set accuracy varies for each specific task for these training procedures.