# continual_learning_via_sequential_functionspace_variational_inference__8351ede6.pdf

Continual Learning via Sequential Function-Space Variational Inference

Tim G. J. Rudner 1 Freddie Bickford Smith 1 Qixuan Feng 1 Yee Whye Teh 1 Yarin Gal 1

Sequential Bayesian inference over predictive functions is a natural framework for continual learning from streams of data. However, applying it to neural networks has proved challenging in practice. Addressing the drawbacks of existing techniques, we propose an optimization objective derived by formulating continual learning as sequential function-space variational inference. In contrast to existing methods that regularize neural network parameters directly, this objective allows parameters to vary widely during training, enabling better adaptation to new tasks. Compared to objectives that directly regularize neural network predictions, the proposed objective allows for more ﬂexible variational distributions and more effective regularization. We demonstrate that, across a range of task sequences, neural networks trained via sequential function-space variational inference achieve better predictive accuracy than networks trained with related methods while depending less on maintaining a set of representative points from previous tasks.

1. Introduction

Continual learning promises to enable applications of machine learning to settings with resource constraints, privacy concerns, or non-stationary data distributions. However, continual learning in deep neural networks remains a challenge. While progress has been made to mitigate forgetting of previously learned abilities, existing objective-based approaches to continual learning still fall short.

A popular family of objectives penalizes changes in parameters from one task to another (Ahn et al., 2019; Aljundi et al., 2018; Chaudhry et al., 2018; Kirkpatrick et al., 2017; Lee et al., 2017; Liu et al., 2018; Loo et al., 2020; Nguyen et al., 2018; Park et al., 2019; Ritter et al., 2018; Schwarz et al.,

1University of Oxford, Oxford, UK. Correspondence to: Tim G. J. Rudner <tim.rudner@cs.ox.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2018; Swaroop et al., 2019; Yin et al., 2020a;b; Zenke et al., 2017). However, explicitly regularizing parameters in this way may be ineffective, since parameters are only a proxy for a neural network s predictive function. For example, predictive functions deﬁned by overparameterized neural networks may be obtained with several different parameter conﬁgurations, and small changes in a network s parameters may cause large changes in its predictions.

An alternative approach that addresses this shortcoming is to regularize the predictive function directly (Benjamin et al., 2019; Bui et al., 2017; Buzzega et al., 2020; Jung et al., 2018; Kapoor et al., 2021; Kim et al., 2018; Li and Hoiem, 2018; Moreno-Mu noz et al., 2019; Pan et al., 2020; Titsias et al., 2020). Existing function-space regularization methods represent the state of the art among objective-based approaches to continual learning (Kapoor et al., 2021; Pan et al., 2020; Titsias et al., 2020). Yet, as we demonstrate, these methods still leave room for improvement. For example, functional regularization of the memorable past (FROMP; Pan et al., 2020) uses a Laplace approximation and as such does not directly optimize variance parameters, while functional regularization for continual learning (FRCL; Titsias et al., 2020) is constrained to linear models.

To address these limitations, we frame continual learning as sequential function-space variational inference (S-FSVI) and adapt the variational objective proposed by Rudner et al. (2021) to the continual-learning setting. The resulting variational optimization objective has three key advantages over existing alternatives. First, it is expressed purely in terms of distributions over predictive functions, which allows greater ﬂexibility than with parameter-space regularization methods (Figure 1). Second, unlike FROMP, it allows direct optimization of variational variance parameters. Third, unlike

FRCL, it can be applied to fully-stochastic neural networks not just to Bayesian linear models.

We demonstrate that S-FSVI outperforms existing objectivebased continual learning methods in some cases by a signiﬁcant margin on a wide range of task sequences, including single-head split MNIST, multi-head split CIFAR, and multi-head sequential Omniglot. We further present empirical results that showcase the usefulness of learned variational variance parameters and demonstrate that S-FSVI is less reliant on careful selection of datapoints that summarize past tasks than other methods.

Continual Learning via Sequential Function-Space Variational Inference

Figure 1. Schematic of how sequential function-space variational inference (S-FSVI) allows a Bayesian neural network to learn new tasks while maintaining previously learned abilities. (Top: predictive distributions.) On task 1, the model ﬁts dataset D1 by updating an initial distribution over parameters q0(θ) to a variational posterior q1(θ), which in turn induces a distribution over functions q1(f). On task 2, the variational objective encourages the posterior distribution over functions to match q1(f) on a small set of data points from task 1 while also ﬁtting dataset D2. The mean and two standard deviations of the distributions over functions learned on task 1 and task 2 are shown in grey and blue, respectively. (Bottom: learning trajectories.) On task 1, the distribution over functions changes by a large amount for inputs X1 (left) but by a small amount for inputs X2 (right). On task 2, the reverse is true. On both tasks, the change in the distribution over parameters (center) is decoupled from the changes in the distribution over functions (left, right).

2. Background

2.1. Continual Learning as Bayesian Inference

Consider a sequence of tasks indexed by t {1, . . . , T}. Each task involves making predictions on a supervised dataset Dt = (Xt, yt). Continual learning is the problem of inferring a distribution over predictive functions that ﬁts the whole collection of datasets {D1, . . . , DT } as well as possible given access to only a single full dataset at a time.

Sequential Bayesian inference over predictive functions f provides a natural framework for this. Assuming we have a prior p(f), the posterior distribution over f at task 1 is

p(f | D1) = p(D1 | f)p(f)/p(D1). (1)

For subsequent tasks t, the posterior can be expressed as

p(f | D1, . . . , Dt) p(Dt | f)p(f | D1, . . . , Dt 1), (2)

the posterior after task t 1 is treated as the prior for task t. Given the intractibility of computing this posterior exactly, we need to use approximate inference.

2.2. Function-Space Variational Inference

Given a dataset D = (X, y), a prior p(f) and a variational family Qf, function-space variational inference (Burt et al., 2021; Matthews et al., 2016; Rudner et al., 2021; Sun et al., 2019) consists of ﬁnding the variational distribution q(f) Qf that maximizes

Eq(f)[log p(y | f(X))] DKL(q(f) p(f)). (3)

This variational optimization problem presents a trade-off between ﬁtting the data and matching a prior over functions. To address the fact that the KL divergence between distributions over functions is not in general tractable, prior works have developed estimation procedures that allow turning Equation (3) into an objective function that can be used in practice (Rudner et al., 2021; Sun et al., 2019).

3. Continual Learning via Sequential Function-Space Variational Inference

The ideas presented in Section 2 provide a starting point for our method. To approximate the posterior in Equation (2) at task t, we would like to ﬁnd a variational distribution qt(f) Qf that minimizes

DKL(qt(f) pt(f|D1, ..., Dt)), (4)

which can equivalently be express as maximizing

Eqt(f)[log p(yt | f(Xt))] DKL(qt(f) pt(f|D1, ..., Dt 1)).

Since we do not have access to pt(f|D1, ..., Dt 1), we simplify the inference problem to maximizing the variational objective

Eqt(f)[log p(yt | f(Xt))] DKL(qt(f) pt(f)), (5)

where for t = 1 we assume some prior p1(f) and for t > 1 the prior is given by the variational posterior distribution over functions inferred on the previous task. That is,

pt(f) .= qt 1(f).

Continual Learning via Sequential Function-Space Variational Inference

While this objective is in general intractable for distributions over functions induced by neural networks with stochastic parameters, Rudner et al. (2021) proposed an approximation that makes this objective amenable to gradient-based optimization and scalable to large neural networks. To perform sequential function-space variational inference, we adapt the estimation procedure proposed by Rudner et al. (2021) to the continual-learning setting:

Proposition 1 (Sequential Function-Space Variational Inference (S-FSVI); adapted from Rudner et al. (2021)). Let Dt be the number of model output dimensions for t tasks, let f : X RP RDt be a mapping deﬁned by a neural network architecture, let Θ RP be a multivariate random vector of network parameters, and let qt(θ) .= N(µt, Σt) and qt 1(θ) .= N(µt 1, Σt 1) be variational distributions over Θ. Additionally, let XC denote a set of context points, and let Xt {Xt XC}. Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation (5) can be approximated by

F(qt, qt 1, XC, Xt, yt) .= Eqt(θ)[log p(yt | f(Xt; θ))]

log |[Kpt]k|

|[Kqt]k| + Tr([Kpt] 1 k [Kqt]k) | Xt|

+ ( Xt; µt, µt 1) [Kpt] 1 k ( Xt; µt, µt 1) ,

( Xt; µt, µt 1) .= [f( Xt; µt)]k [f( Xt; µt 1)]k (7)

Kpt .= J ( Xt, µt 1)Σt 1J ( Xt, µt 1) (8)

Kqt .= J ( Xt, µt)Σt J ( Xt, µt) , (9)

are covariance matrix estimates constructed from Jacobians J ( , m) .= f( ;Θ)

Θ |Θ=m with m = {µt, µt 1}.

Proof. See Appendix A.

Functional regularization for continual learning (FRCL; Titsias et al., 2020) and functional regularization of the memorable past (FROMP; Pan et al., 2020) use objectives conceptually similar to the objective in Equation (5) and mathematically similar to the objective in Equation (6). To highlight the differences between the S-FSVI objective above and FROMP and FRCL, respectively, we make the relationship between these two methods and S-FSVI precise in the following two propositions.

Proposition 2 (Relationship between FROMP and S-FSVI). With the S-FSVI objective F deﬁned as in Equation (6), let Xt = XC. Then, up to a multiplicative constant, the FROMP objective corresponds to the S-FSVI objective with the prior covariance given by a Laplace approximation about µt 1 and the variational distribution given by a Dirac delta distribution q FROMP t (θ) .= δ(θ µt). Denoting the prior covariance under a Laplace approximation about µt 1 by ˆΣ0(µt 1) so that q FROMP t 1 (θ) .= N(µt 1, ˆΣ0(µt 1)), the FROMP objective can be expressed as

LFROMP(q FROMP t , q FROMP t 1 , XC, Xt, yt)

= F(q FROMP t , q FROMP t 1 , XC, Xt, yt) V,

log [ Kˆpt]k

[ Kqt]k + [ Kqt]k

[ Kˆpt]k 1 ,

with K denoting the covariance under a blockdiagonalization without inter-task dependence, and

Kˆpt .=block-diag J ( Xt, µt 1) ˆΣ0(µt 1)J ( Xt, µt 1) .

Proof. See Appendix A.

Proposition 2 shows that the FROMP objective nearly corresponds to the S-FSVI objective but is missing the term in the S-FSVI objective (denoted by V above) that encourages learning variational variance parameters that accurately reﬂect the variance of the prior. This insight reﬂects a shortcoming of the FROMP objective. Unlike in the S-FSVI objective which allows optimization over Σ, the FROMP objective is restricted to covariance estimates given by the Laplace approximation.

The FRCL objective can be related to the S-FSVI objective in a similar way: Proposition 3 (Relationship between FRCL and S-FSVI). With the S-FSVI objective F deﬁned as in Equation (6), let Xt = XC, and let f LM( ; Θ) .= Φψ( )Θ be a Bayesian linear model, where Φψ( ) is a deterministic feature map parameterized by ψ. Then the FRCL objective corresponds to the S-FSVI objective for the model f LM( ; Θ) plus an additional weight-space KL divergence penalty. That is, for pt(θ) .= N(µt 1, Σt 1), and qt(θ) .= N(µt, Σt),

LFRCL(q FRCL t , q FRCL t 1 , XC, Xt, yt)

= F(q FRCL t , q FRCL t 1 , XC, Xt, yt) + DKL(qt(θ) pt(θ)). (10)

Proof. See Appendix A.

Proposition 3 highlights that the FRCL objective is restricted to Bayesian linear models and does not regularize the deterministic parameters in the feature map as effectively as if they were variational parameters.

Continual Learning via Sequential Function-Space Variational Inference

(a) Task 1 (b) Task 2 (c) Task 3 (d) Task 4 (e) Task 5

Figure 2. A practical demonstration of sequential function-space variational inference (S-FSVI) on a sequence of ﬁve binary-classiﬁcation tasks with 2D inputs. The neural network infers a decision boundary between the two classes while maintaining high predictive uncertainty away from the data. The experimental setup is described in detail in Appendix C.

3.1. Simpliﬁed Sequential Function-Space VI

For ease of computation and to ensure scalability to large neural networks, we consider mean-ﬁeld distributions q MF t (θ) for all tasks, diagonalize the covariance matrix estimates Kpt and Kqt across input points in Xt, and let (XB, y B) Dt be a mini-batch from the current dataset. This way, we obtain the simpliﬁed variational objective

F(q MF t , q MF t 1, XC, XB, y B)

i=1 log p(y B | f(XB; h(µt, Σt, ϵ(i))))

log [Kpt]j,k

[Kqt]j,k + [Kqt]j,k

[f( Xt; µt)]j,k [f( Xt; µt 1)]j,k 2

where h(µt, Σt, ϵ(i)) .= µt + Σt ϵ(i) is a reparameterization of Θ RP with ϵ(i) N(0, IP ), S is the number of Monte Carlo samples, Dt is as deﬁned before, and

Kpt .= diag J ( Xt, µt 1)Σt 1J ( Xt, µt 1) (12)

Kqt .= diag J ( Xt, µt)Σt J ( Xt, µt) . (13)

This simpliﬁed objective does not require matrix inversion, and the time and space complexity for gradient estimation and prediction scale linearly in the number of context points Xt and network parameters. The context set XC can be constructed from coresets containing representative points from previous tasks.

We provide an empirical comparison of the simpliﬁed SFSVI, FROMP, and FRCL objectives in Section 5 to assess the extent to which the differences described above affect continual learning.

4. Related Work

There are three main (partially overlapping) categories of methods for continual learning in a deep neural network. Objective-based approaches modify the objective function

used to train the neural network. Replay-based approaches summarize past tasks using either stored data or freshly generated synthetic data. Architecture-based approaches change the neural network s structure from one task to another. For extensive reviews, see De Lange et al. (2021) and Parisi et al. (2019). As sequential function-space variational inference (S-FSVI) centers around a new training objective, we focus on objective-based approaches in this review. (Like the methods reviewed below, S-FSVI does incorporate a form of replay in that it uses context points, but the primary interest is the training objective.)

For a neural network to retain abilities it has previously learned, its predictions on data associated with past tasks must not change signiﬁcantly from one task to another. One way of achieving this is to include in the training objective a form of function-space regularization to discourage important changes in the network s predictions or internal representations. Learning without forgetting (Li and Hoiem, 2018) uses a modiﬁed cross-entropy loss that penalizes the difference between the predictions of the current network on the current task data and the predictions of the previous network on the current task data. Less-forgetful learning (Jung et al., 2018) employs the same method but uses squared Euclidean distance rather than the modiﬁed cross-entropy loss and applies it to the penultimate-layer representations rather than the network s predictions. Keep and learn (Kim et al., 2018) also uses internal representations as a basis for regularization. The method subsequently proposed by Benjamin et al. (2019) involves comparing the current network with all previous versions of the network and on data from all past tasks instead of with only the most recent network on data from the current task. Each pair of networks is compared by computing the Euclidean distance between the networks predictions. Dark experience replay (Buzzega et al., 2020) extends this method to work in a setting where task boundaries are not clearly deﬁned.

While these approaches mitigate forgetting, they do not explicitly account for predictive uncertainty, which is an issue if the neural network is a poor ﬁt to the data. This deﬁciency is addressed by probabilistic approaches to function-space regularization, which encourage a network s predictions to

Continual Learning via Sequential Function-Space Variational Inference

Table 1. Predictive accuracies of a selection of objective-based methods for continual learning. Results are reported for three task sequences: split MNIST (S-MNIST), split Fashion MNIST (S-FMNIST) and permuted MNIST (P-MNIST). In some cases, a multi-head setup (MH) is used; in others, a single-head setup (SH). Best results for identical network architectures are printed in boldface (exception:

VAR-GP uses a non-parametric model). Best overall results are highlighted in gray. Each numerical entry denotes the mean accuracy across tasks at the end of training. Where possible, this accuracy is based on experiments repeated with different random seeds (10 repeats for S-FSVI), with both the mean value and standard error reported. All methods use the same architecture and coreset size unless indicated otherwise. See Appendix C for more experimental details. 1Accuracies computed using the best coreset-selection method (either random or k-center). 2Uses random coreset selection. 3Requires a multi-head setup with task identiﬁers, including for permuted MNIST. This requirement explains the missing FRCL result for S-MNIST (SH). 4Uses a larger MLP architecture (see Table 4 in appendix).5Evaluates the KL divergence at points sampled from the empirical data distribution of the current task. 6Uses one sample per class as a coreset.

Method S-MNIST (MH) S-FMNIST (MH) P-MNIST (SH) S-MNIST (SH)

EWC (Kirkpatrick et al., 2017) 63.10% 84.00% SI (Zenke et al., 2017) 98.90% 86.00% VCL (Nguyen et al., 2018)1 98.40% 98.60% 0.04 93.00% 32.11% 1.16 VCL (no coreset) 97.00% 89.60% 1.75 87.50% 0.61 17.74% 1.20 FRCL (Titsias et al., 2020)3 97.80% 0.22 97.28% 0.17 94.30% 0.06 FROMP (Pan et al., 2020) 99.00% 0.04 99.00% 0.03 94.90% 0.04 35.29% 0.52 VAR-GP (Kapoor et al., 2021) 97.20% 0.08 90.57% 1.06 S-FSVI (ours)2 99.54% 0.04 99.19% 0.02 95.76% 0.02 92.87% 0.14

S-FSVI Ablation Study: S-FSVI (larger networks)4 99.76% 0.00 99.16% 0.03 97.50% 0.01 93.38% 0.10 S-FSVI (no coreset)5 99.62% 0.02 99.54% 0.01 84.06% 0.46 20.15% 0.52 S-FSVI (minimal coreset)6 89.59% 0.30 51.44% 1.22

agree with a prior distribution over functions rather than with a single function. Functional regularization for continual learning (FRCL; Titsias et al., 2020) considers a network whose ﬁnal layer is a Bayesian linear model. Based on the duality between parameter space and function space, the FRCL objective includes the KL divergence between predictive distributions at a selection of input points. This encourages similarity between the network s current predictive distribution and the distributions from past tasks. FRCL is theoretically appealing, building on a well-understood method for stochastic variational inference using inducing points, but is only applicable to Bayesian linear models. In contrast, functional regularization of the memorable past (FROMP; Pan et al., 2020) maintains a posterior distribution over all the parameters of a neural network. While FROMP achieves state-of-the-art performance on several continuallearning task sequences, it relies on a change in the underlying probabilistic model and uses a surrogate objective for optimization, which divorces it from function-space variational objectives. As we show, this results in suboptimal performance compared to sequential function-space variational inference, which maintains a stronger link to the underlying Bayesian approximation.

Although our focus is on methods for training deep neural networks, for completeness, we also note methods based on Gaussian processes (GPs). Incremental variational sparse GP regression (Cheng and Boots, 2016), streaming sparse GPs (Bui et al., 2017) and online sparse multi-output GP regression (Yang et al., 2019) built on the work of Csat o and Opper (2002) and Csat o (2002), and are effective approaches to continual learning for regression tasks. Continual multi-

task GPs (Moreno-Mu noz et al., 2019) extend to multioutput settings with non-Gaussian likelihoods. The success of variational autoregressive GPs (VAR-GP; Kapoor et al., 2021) on continual learning for task sequences with image inputs gives reason for inclusion where relevant in Section 5. However, we note that VAR-GP scales poorly with the number of tasks: the time complexity for inference is cubic in the number of context points and hence in the number of tasks, which may limit its applicability to task sequences like sequential Omniglot. In contrast, the time complexity of S-FSVI is linear in the number of context points.

Also distinct from but related to our method are a number of objective-based approaches to continual learning that directly regularize the parameters of a neural network. We brieﬂy discuss these approaches in Appendix D.

5. Empirical Evaluation

After visualizing how S-FSVI works in practice (Section 5.1), we compare S-FSVI s performance with that of existing objective-based methods for continual learning (Sections 5.2 to 5.4). For a comprehensive comparison, we evaluate SFSVI on a range of task sequences used in related work. Aiming to use as strong baselines as possible, we report results taken directly from the literature in most cases (and mention when we do not). Reporting baselines in this way leaves gaps in our comparison: for each existing technique, results are available for only a subset of the task sequences we consider here (e.g., Pan et al. (2020) report results for split CIFAR but not sequential Omniglot, while Titsias et al. (2020) do the reverse).

Continual Learning via Sequential Function-Space Variational Inference

10 20 30 40 Coreset Size

Accuracy (%)

(a) S-MNIST (MH)

10 20 30 40 Coreset Size

(b) S-FMNIST (MH)

10 20 50 100 200 Coreset Size

(c) P-MNIST (SH)

10 20 50 100 200 Coreset Size

Random Entropy KL

(d) S-MNIST (SH)

Figure 3. Effect of the coreset size and coreset-selection method on the predictive accuracy of S-FSVI. Three coreset-selection methods are presented: sampling data points with uniform probability; sampling with probability proportional to model s predictive entropy; and sampling with probability proportional to the KL divergence between the posterior predictive distribution and the prior predictive distribution. Ten inducing points are used in each case. No coreset-selection method consistently yields higher accuracy.

Our evaluation pays attention to two factors important in the assessment of continual-learning methods: the use of task identiﬁers when making predictions, and the use of a coresets of data points to summarize past tasks (Farquhar and Gal, 2018). To provide some commentary on the ﬁrst of these factors, we run an experiment that compares the performance of a single-head neural network (which does not use task identiﬁers) to that of a multi-head neural network (which uses task identiﬁers). Regarding the second factor, we explore how performance changes when the coreset size changes or a context set unrelated to previous tasks is used.

Details about the experimental setups (e.g., optimization routines and hyperparameter searches) can be found in Appendix C. Our code can be accessed at:

https://timrudner.com/sfsvi-code.

5.1. Illustrative Example

To provide intuition for how S-FSVI allows learning on new tasks while maintaining previously acquired abilities, we apply it to a task sequence based on easy-to-visualize synthetic 2D data, originally proposed by Pan et al. (2020). In this task sequence, each data point belongs to one of two classes, and more data points are revealed as the task sequence progresses. The data-generating process is assumed to reveal data from mostly non-overlapping subsets of the input space. The continual-learning problem is then to infer the decision boundary around data points revealed up to and including the current task without forgetting the decision boundary inferred on previous tasks. We use a single-head neural network.

In Figure 2, we plot the model s posterior predictive distribution after training on each of ﬁve tasks. After training on task 1, the model has low predictive uncertainty close to the data points and high uncertainty (class probabilities around 0.5) everywhere else (Figure 2a). On task 2, S-FSVI seeks to match the distribution over functions inferred on

the previous task while ﬁtting the new set of data points. S-FSVI achieves this and expands the area in input space where the model is conﬁdent in its predictions (Figure 2b).

As more tasks and data are revealed, S-FSVI allows the model to continually explore the data space and infer the decision boundary while maintaining accurate, highconﬁdence predictions on data points in parts of the inputs space where it was previously trained on observed data. Finally, after training on ﬁve tasks, the model has inferred the decision boundary between the two classes, while maintaining high predictive uncertainty in parts of the input space where no data points have been observed yet (Figure 2e). The model maintains high predictive uncertainty away from the data, which makes it easier to learn on new tasks. This is unlike deterministic neural networks, which tend to make highly conﬁdent predictions in parts of the inputs space where no data has been observed, or on data points that lie outside of the distribution of the training data.

5.2. Split (Fashion) MNIST & Permuted MNIST

Having established some intuition for how S-FSVI works, we demonstrate how this translates to high predictive accuracy on three task sequences commonly used to evaluate continual-learning methods. First is split MNIST (S-MNIST), in which each task consists of binary classiﬁcation on a pair of MNIST classes (0 vs. 1, 2 vs. 3, and so on). Second is split Fashion MNIST (S-FMNIST), which has the same structure but uses data from Fashion MNIST, posing a harder problem. Third is permuted MNIST (P-MNIST), in which each task consists of ten-way classiﬁcation on MNIST images whose pixels have been randomly reordered. A multi-head setup (MH) with task identiﬁers provided at prediction time is the default for S-MNIST and S-FMNIST, while a single-head setup (SH) without task identiﬁers is standard for P-MNIST. In addition to running the default setup for all three task sequences, we run a single-head setup for S-MNIST.

With a standard conﬁguration, S-FSVI outperforms all existing methods based on deep neural networks by a statistically

Continual Learning via Sequential Function-Space Variational Inference

1 2 3 4 5 6 Avg Task

Accuracy After Task 6 (%)

Joint Separate S-FSVI FROMP VCL

(a) Split CIFAR Accuracies After Training on All Tasks

10 20 50 100 200 Coreset Size

Accuracy (%)

Random Entropy

(b) S-FSVI Accuracy as a Function of Coreset Size

Figure 4. Predictive accuracies of S-FSVI and related methods on split CIFAR. (a) Per-task and average accuracy after training on six tasks. The result of joint baseline is obtained using a model trained on data from all tasks at the same time. The accuracy at task t for the separate baseline is the accuracy of an independent model trained only on task t. We use the best performing method for each baseline: FROMP for joint , S-FSVI for separate . (b) Average accuracy after training on six tasks with different coreset sizes. Random coreset selection denotes uniform sampling from the training set. Entropy coreset selection denotes sampling from the training set with probability proportional to the entropy of the model s posterior predictive distribution.

signiﬁcant margin on all task sequences (Table 1). As noted in Section 4, VAR-GP s conceptual connection to our method warrants its inclusion in our comparison. VAR-GP performs better than our standard conﬁguration of S-FSVI on permuted MNIST, but this advantage disappears once a larger neural network is used with S-FSVI. Moreover, VAR-GP is unlikely to scale well to more challenging task sequences, such as those in Sections 5.3 and 5.4.

5.3. Sequential Omniglot

Sequential Omniglot (Lake et al., 2015; Schwarz et al., 2018) provides a more challenging task sequence than those considered in Section 5.2. It consists of 50 classiﬁcation tasks, where the number of classes varies between the tasks (details in Appendix C). We ﬁnd that S-FSVI produces better predictive accuracy than all available baselines, including FRCL, by a statistically signiﬁcant margin (Table 2). To illustrate the stability of S-FSVI across long task sequences, we plot its mean accuracy over 50 tasks in Figure 5.

Table 2. Predictive accuracies of S-FSVI and related methods on sequential Omniglot. For S-FSVI and FRCL, the coreset consists of two data points per class. All baseline results are from Titsias et al. (2020). For all methods, the mean and standard deviation over ﬁve random task permutations are reported. 1Li and Hoiem (2018). 2Schwarz et al. (2018). 3Schwarz et al. (2018). 4Coreset selected using FRCL s trace method. 5Details in Appendix C.

Method Test Accuracy

Learning Without Forgetting1 62.06% 2.0 EWC 67.32% 4.7 Online EWC2 69.99% 3.2 Progress & Compress3 70.32% 3.3 FRCL4 81.47% 1.6 S-FSVI (ours)5 83.29% 1.2

0 10 20 30 40 50 Task

Accuracy (%)

S-FSVI FRCL (paper)

Figure 5. Predictive accuracies of S-FSVI and FRCL on sequential Omniglot. For S-FSVI, the accuracy shown at task t is the mean accuracy across all tasks up to that point (mean one standard error as computed across ﬁve permutations of the task order). We were unable to reproduce the result reported in Titsias et al. (2020) using the authors code. However, we compare against the result from the paper (only the accuracy at task 50 is reported) here to provide a strong baseline.

5.4. Split CIFAR

Moving beyond classiﬁcation tasks on grayscale images, we evaluate S-FSVI on split CIFAR (Pan et al., 2020; Zenke et al., 2017). This uses the full CIFAR-10 dataset for the ﬁrst task, followed by ﬁve ten-way classiﬁcation tasks drawn from CIFAR-100. Our results show S-FSVI achieving higher accuracy on all tasks than FROMP and VCL after learning all six tasks (Figure 4a). Notably, on each task except the ﬁrst, S-FSVI performs close to or better than two baselines: a model trained only on that task, and a model trained on all tasks jointly. The latter is a particularly strong baseline, because all data is available during training.

As in related work (Lopez-Paz and Ranzato, 2017; Pan et al., 2020), we compute the forward transfer (FT) and backward transfer (BT) for S-FSVI on split CIFAR. FT captures by

Continual Learning via Sequential Function-Space Variational Inference

1 2 3 4 5 Task

Accuracy (%)

(a) S-MNIST (MH)

2 4 6 8 10 Task

S-FSVI (coreset) S-FSVI (no coreset) VCL (coreset) VCL (no coreset) VCL (k centers)

(b) P-MNIST (SH)

Figure 6. Predictive accuracies of S-FSVI and parameter-space variational inference (VCL) on split MNIST and permuted MNIST. The accuracy shown at task t is the mean accuracy across all tasks up to that point (mean one standard error as computed across ten repetitions of the experiment). With a coreset, S-FSVI outperforms

VCL on both task sequences. Without a coreset, S-FSVI performs poorly on permuted MNIST.

S-FSVI FROMP FRCL Method

Accuracy (%)

(a) S-MNIST (MH)

S-FSVI FROMP FRCL Method

Random noise Random pixel Random image

(b) S-FMNIST (MH)

Figure 7. Predictive accuracies of S-FSVI, FROMP and FRCL on multi-head split (Fashion) MNIST without using coresets. Inducing inputs for evaluating the KL divergence are sampled according to three different sampling schemes derived from the current task s empirical data distribution (see Appendix C for details). Using

S-FSVI with images sampled from the current task s training set signiﬁcantly outperforms all other methods.

how much the accuracy on the current tasks increases as the number of past tasks increases; BT captures by how much the accuracy on the previous tasks increases as more tasks are observed (see Appendix C.6 for mathematical deﬁnitions). As well as having the best overall accuracy, S-FSVI signiﬁcantly outperforms all baselines in terms of FT and has BT comparable to EWC and FROMP (Table 3).

Table 3. Forward transfer (FT) and backward transfer (BT) of S-

FSVI and related methods on split CIFAR. All baseline results are from Pan et al. (2020). For all methods, the mean and standard error over ﬁve repeated experiments are reported. 1Details in Appendix C.

Method Test Accuracy FT BT

EWC 71.6% 0.4 0.2 0.4 -2.3 0.6 VCL 67.4% 0.6 1.8 1.4 -9.2 0.8 FROMP 76.2% 0.2 6.1 0.3 -2.6 0.4 S-FSVI (ours)1 77.6% 0.2 7.3 0.2 -2.5 0.2

5.5. Functionvs. Parameter-Space Inference

To demonstrate the importance of performing inference in function space, we compare how the accuracies of S-FSVI and VCL evolve from one task to another on split MNIST and permuted MNIST (Figure 6). We ﬁnd that S-FSVI consistently outperforms VCL whose predictive performance steadily degrades suggesting that function-space inference may be more effective than parameter-space inference at transferring prior knowledge from one task to another, and that this may offset the information loss in the KL divergence between distributions over functions compared to the KL divergence between distributions over parameters.

5.6. Coreset Size and Selection

Similar to existing methods such as FROMP and FRCL, SFSVI includes in the training objective a function-space regularization term that encourages matching the prior dis-

tribution over functions at a set of context points. Typically, this requires keeping a representative coreset of data points from each task, from which a context set can be constructed.

S-FSVI offers two beneﬁts with respect to coresets. First, it is insensitive to which points get included in the coresets. Whereas existing methods often require expensive procedures to select important data points from previous tasks, Figures 3 and 4b show that S-FSVI achieves strong performance while only using randomly selected coresets. Second, S-FSVI does not require large coresets to perform well. On permuted MNIST, S-FSVI achieves better predictive accuracy than EWC and SI even if the coreset used for S-FSVI consists of only a single data point per class (Table 1). On the single-head version of split MNIST, a minimal coreset (one point per class, or two points per task) allows S-FSVI to outperform VCL and FROMP, both with coresets of 40 points per task (Table 1). In some multi-head settings, S-FSVI achieves state-of-the-art predictive accuracies with randomly-generated noise coresets (Table 1 and Figure 7).

6. Conclusion

We presented sequential function-space variational inference (S-FSVI), a method for continual learning in deep neural networks. We showed that S-FSVI improves on the predictive performance of existing objective-based continual learning methods often by a signiﬁcant margin including on task sequences with high-dimensional inputs (split CIFAR) and large numbers of tasks (sequential Omniglot). Lastly, we demonstrated that unlike existing functionspace regularization methods S-FSVI does not rely on careful coreset selection and, in multi-head settings, can achieve state-of-the-art performance even without coresets collected on previous tasks. We hope that this work will lead to future research into further improving function-space objectives for continual learning.

Continual Learning via Sequential Function-Space Variational Inference

Acknowledgements

Tim G. J. Rudner and Freddie Bickford Smith are funded by the Engineering and Physical Sciences Research Council (EPSRC). Tim G. J. Rudner is also funded by the Rhodes Trust and by a Qualcomm Innovation Fellowship. We gratefully acknowledge donations of computing resources by the Alan Turing Institute.

Ahn, H., Cha, S., Lee, D., and Moon, T. (2019). Uncertaintybased continual learning with adaptive regularization. In Advances in Neural Information Processing Systems.

Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. (2018). Memory aware synapses: learning what (not) to forget. In European Conference on Computer Vision.

Benjamin, A., Rolnick, D., and Kording, K. (2019). Measuring and regularizing networks in function space. In International Conference on Learning Representations.

Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Streaming variational Bayes. In Advances in Neural Information Processing Systems.

Bui, T. D., Nguyen, C., and Turner, R. E. (2017). Streaming sparse Gaussian process approximations. In Advances in Neural Information Processing Systems.

Burt, D. R., Ober, S. W., Garriga-Alonso, A., and van der Wilk, M. (2021). Understanding variational inference in function-space. In Symposium on Advances in Approximate Bayesian Inference.

Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. (2020). Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems.

Chaudhry, A., Dokania, P., Ajanthan, T., and Torr, P. (2018). Riemannian walk for incremental learning: understanding forgetting and intransigence. In European Conference on Computer Vision.

Cheng, C.-A. and Boots, B. (2016). Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.

Csat o, L. (2002). Gaussian processes: iterative sparse approximations. Ph D thesis, Aston University.

Csat o, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural Computation.

De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. (2021). A continual learning survey: defying forgetting in classiﬁcation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Ebrahimi, S., Elhoseiny, M., Darrell, T., and Rohrbach, M. (2020). Uncertainty-guided continual learning with Bayesian neural networks. In International Conference on Learning Representations.

Farquhar, S. and Gal, Y. (2018). Towards robust evaluations of continual learning. ICML Workshop on Lifelong Learning: A Reinforcement Learning Approach.

Ghahramani, Z. and Attias, H. (2000). Online variational Bayesian learning. In NIPS Workshop on Online Learning.

Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In International Symposium on Independent Component Analysis and Blind Signal Separation.

Jung, H., Ju, J., Jung, M., and Kim, J. (2018). Less-forgetful learning for domain expansion in deep neural networks. In AAAI Conference on Artiﬁcial Intelligence.

Kapoor, S., Karaletsos, T., and Bui, T. D. (2021). Variational auto-regressive Gaussian processes for continual learning. In International Conference on Machine Learning.

Kessler, S., Nguyen, V., Zohren, S., and Roberts, S. (2019). Hierarchical Indian buffet neural networks for Bayesian continual learning. ar Xiv.

Kim, H.-E., Kim, S., and Lee, J. (2018). Keep and learn: continual learning by constraining the latent space for knowledge preservation in neural networks. In Medical Image Computing and Computer Assisted Intervention.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science.

Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W., and Zhang, B.-T. (2017). Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems.

Continual Learning via Sequential Function-Space Variational Inference

Li, Z. and Hoiem, D. (2018). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Liu, X., Masana, M., Herranz, L., van de Weijer, J., L opez, A. M., and Bagdanov, A. D. (2018). Rotate your networks: better weight consolidation and less catastrophic forgetting. International Conference on Pattern Recognition.

Loo, N., Swaroop, S., and Turner, R. E. (2020). Generalized variational continual learning. In International Conference on Learning Representations.

Lopez-Paz, D. and Ranzato, M. A. (2017). Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems.

Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In International Conference on Artiﬁcial Intelligence and Statistics.

Moreno-Mu noz, P., Art es-Rodr ıguez, A., and Alvarez, M. A. (2019). Continual multi-task Gaussian processes. ar Xiv.

Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. (2018). Variational continual learning. In International Conference on Learning Representations.

Pan, P., Swaroop, S., Immer, A., Eschenhagen, R., Turner, R., and Khan, M. E. E. (2020). Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: a review. Neural Networks.

Park, D., Hong, S., Han, B., and Lee, K. M. (2019). Continual learning by asymmetric loss approximation with single-side overestimation. In International Conference on Computer Vision.

Ritter, H., Botev, A., and Barber, D. (2018). Online structured Laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems.

Rudner, T. G. J., Chen, Z., and Gal, Y. (2021). Rethinking function-space variational inference in Bayesian neural networks. In Symposium on Advances in Approximate Bayesian Inference.

Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation.

Schwarz, J., Czarnecki, W., Luketina, J., Grabska Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. (2018). Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning.

Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana and Chicago.

Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural networks. In International Conference on Learning Representations.

Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. (2019). Improving and understanding variational continual learning. In Neur IPS Workshop on Continual Learning.

Titsias, M. K., Schwarz, J., de G. Matthews, A. G., Pascanu, R., and Teh, Y. W. (2020). Functional regularisation for continual learning with Gaussian processes. In International Conference on Learning Representations.

Yang, L., Wang, K., and Mihaylova, L. S. (2019). Online sparse multi-output Gaussian process regression and learning. IEEE Transactions on Signal and Information Processing over Networks.

Yin, D., Farajtabar, M., and Li, A. (2020a). SOLA: continual learning with second-order loss approximation. ar Xiv.

Yin, D., Farajtabar, M., Li, A., Levine, N., and Mott, A. (2020b). Optimization and generalization of regularization-based continual learning: a loss approximation viewpoint. ar Xiv.

Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In International Conference on Machine Learning.

Continual Learning via Sequential Function-Space Variational Inference Supplementary Material

Table of Contents

Appendix A: Proofs

Appendix B: Further Empirical Results

Appendix C: Experimental Details

Appendix D: Further Related Work

A.1. Variational Objective

Proposition 1 (Sequential Function-Space Variational Inference (S-FSVI); adapted from (Rudner et al., 2021)). Let Dt be the number of model output dimensions for t tasks, let f : X RP RDt be a mapping deﬁned by a neural network architecture, let Θ RP be a multivariate random vector of network parameters, and let qt(θ) .= N(µt, Σt) and qt 1(θ) .= N(µt 1, Σt 1) be variational distributions over Θ. Additionally, let XC denote a sample of context points, and let Xt {Xt XC}. Under a diagonal approximation of the prior and variational posterior covariance functions across output dimensions, the objective in Equation (5) can be approximated by

F(qt, qt 1, XC, Xt, yt) .= Eqt(θ)[log p(yt | f(Xt; θ))]

log |[Kpt]k|

|[Kqt]k| + Tr([Kpt] 1 k [Kqt]k) | Xt| + ( Xt; µt, µt 1) [Kpt] 1 k ( Xt; µt, µt 1) ,

( Xt; µt, µt 1) .= [f( Xt; µt)]k [f( Xt; µt 1)]k (A.2)

and Kpt .= J ( Xt, µt 1)Σt 1J ( Xt, µt 1) and Kqt .= J ( Xt, µt)Σt J ( Xt, µt) , (A.3)

are covariance matrix estimates constructed from Jacobians J ( , m) .= f( ;Θ)

Θ |Θ=m with m = {µt, µt 1}.

Proof. The results follows directly from the variational objective derived in (Rudner et al., 2021) when setting the prior to p .= qt 1 and specifying the context set to be constructed from the coreset.

A.2. Derivation of Correspondence to Other Function-Space Objectives

Proposition 2 (Relationship between FROMP and S-FSVI). With the S-FSVI objective F deﬁned as in Equation (6), let Xt = XC. Then, up to a multiplicative constant, the FROMP objective corresponds to the S-FSVI objective with the prior covariance given by a Laplace approximation about µt 1 and the variational distribution given by a Dirac delta distribution q FROMP t (θ) .= δ(θ µt). Denoting the prior covariance under a Laplace approximation about µt 1 by ˆΣ0(µt 1) so that q FROMP t 1 (θ) .= N(µt 1, ˆΣ0(µt 1)), the FROMP objective can be expressed as

LFROMP(q FROMP t , q FROMP t 1 , XC, Xt, yt) = F(q FROMP t , q FROMP t 1 , XC, Xt, yt) V,

log [ Kˆpt]k

[ Kqt]k + [ Kqt]k

[ Kˆpt]k 1 ,

with K denoting the covariance under a block-diagonalization without inter-task dependence, and

Kˆpt .=block-diag J ( Xt, µt 1) ˆΣ0(µt 1)J ( Xt, µt 1) .

Continual Learning via Sequential Function-Space Variational Inference

Proof. By Equation (8) in Pan et al. (2020), the FROMP objective function is given by

LFROMP(q FROMP t , q FROMP t 1 , XC, Xt, yt)

.= Eqt(θ)[log p(yt | f(Xt; µt))]+

τ 2 ([f(XC; µt)]k [f(XC; µt 1)]k) [Kˆpt] 1 k ([f(XC; µt)]k [f(XC; µt 1)]k) ,

with temperature parameter τ. The result follows directly from the deﬁnition of F(q FROMP t , q FROMP t 1 , XC, Xt, yt) and τ = 1.

Proposition 3 (Relationship between FRCL and S-FSVI). With the S-FSVI objective F deﬁned as in Equation (6), let Xt = XC, and let f LM( ; Θ) .= Φψ( )Θ be a Bayesian linear model, where Φψ( ) is a deterministic feature map parameterized by ψ. Then the FRCL objective corresponds to the S-FSVI objective for the model f LM( ; Θ) plus an additional weight-space KL divergence penalty. That is, for pt(θ) .= N(µt 1, Σt 1), and qt(θ) .= N(µt, Σt),

LFRCL(q FRCL t , q FRCL t 1 , XC, Xt, yt) = F(q FRCL t , q FRCL t 1 , XC, Xt, yt) + DKL(qt(θ) pt(θ)). (A.5)

Proof. By Section 2.3 in Titsias et al. (2020), the FRCL objective function is given by

LFRCL(µt, Σt, XC, Xt, yt) .= Eqt(θ)[log p(yt | Φψ(Xt)θ)]

t =1 DKL( qt ( f(XCt ; θ)) pt ( f(XCt ; θ))) DKL(qt(θ) pt(θ)), (A.6)

while the S-FSVI objective for a Bayesian linear model is

F(µt, Σt, XC, Xt, yt)) = Eqt(θ)[log p(yt | Φψ(Xt)θ)] DKL( qt ( f(XCt ; θ)) pt ( f(XCt ; θ))). (A.7)

Kpt .= block-diag Jµt 1(XCt )Σt 1Jµt 1(XCt )

Kqt .= block-diag Jµt(XCt )Σt Jµt(XCt ) (A.8)

be block diagonal matrices without inter-task dependence, with diagonal entries {Kpt 1 , ..., Kpt t 1} and {Kqt 1 , ..., Kqt t 1}, respectively, computed from task-speciﬁc context points XCt . Then, since in general for any block diagonal matrix A RJm Jm with diagonal entries {A1, ..., AJ} and Aj Rm m, the determinant can be expressed as det(A) = QJ 1=j det(Aj) and for any X = [x1, ..., x J], with xj Rm, the square form x Ax can be expressed as x Ax = PJ j=1 x j Ajxj, we can express the KL divergence as a sum and write the S-FSVI objective as

ˆF(µt, Σt, XC, Xt, yt) = Eqt(θ)[log p(yt | Φψ(Xt)θ)]

t =1 DKL( qt ( f(XCt ; θ)) pt ( f(XCt ; θ))), (A.9)

since the KL divergence between multivariate Gaussians is a sum of log-determinants, traces, and a square form. The result follows immediately.

Continual Learning via Sequential Function-Space Variational Inference

B. Further Empirical Results

Induced Prior Identity Covariance Function

Accuracy (%)

(a) S-MNIST (MH)

Induced Prior Identity Covariance Function

(b) S-FMNIST (MH)

Induced Prior Identity Covariance Function

(c) P-MNIST (SH)

Induced Prior Identity Covariance Function

(d) S-MNIST (SH)

Figure 8. Effect of Empirical Prior Covariance. Comparison of predictive performance under the induced prior covariance function Kpt = diag Jµt 1(x)Σt 1Jµt 1(x ) (left) vs. an identity covariance function (right).

1 100 2 100 1 200 2 200 1 300 2 300 1 400 2 400 1 50

Architecture

Accuracy (%)

0.001 0.01 0.1 1.0 10.0 100.0 Prior Covariance

40 60 80 120 160 Number of Epochs

(a) S-MNIST (MH)

4 200 4 300 4 400 4 50 Architecture

Accuracy (%)

0.001 0.01 0.1 1.0 10.0 100.0 Prior Covariance

40 60 80 120 160 Number of Epochs

(b) S-FMNIST (MH)

2 100 3 100 2 200 3 200 2 400 Architecture

Accuracy (%)

0.001 0.01 0.1 1.0 10.0 100.0 Prior Covariance

10 20 40 60 80 Number of Epochs

(c) P-MNIST (SH)

1 100 2 100 1 200 2 200 1 300 2 300 1 400 2 400 1 50

Architecture

Accuracy (%)

0.001 0.01 0.1 1.0 10.0 100.0 Prior Covariance

60 80 120 180 240 Number of Epochs

(d) S-MNIST (SH)

Figure 9. Effect of Neural-Network Size, First-Task Prior Covariance, and the Number of Training Epochs. We explore settings of neural-network size (e.g., 2 100 means a fully connected neural network with two hidden layers of size 100), initial prior covariance and number of training epochs for each task. To limit the computational resources required, we vary the values of one hyperparameter at a time instead of carrying out a full grid search.

Continual Learning via Sequential Function-Space Variational Inference

100 200 300 400 Hidden-Layer Size

Accuracy (%)

(a) Permuted MNIST (SH)

100 200 300 400 Hidden-Layer Size

(b) Split MNIST (SH)

Figure 10. Effect of Neural-Network Size under Minimal Coresets. Predictive accuracy under S-FSVI on permuted MNIST (SH) and split MNIST (SH) as a function of network width, using only a minimal coreset of one sample per class, selected randomly.

0.001 0.005 0.01 Prior Covariance

Accuracy (%)

100 200 Number of Epochs

Accuracy (%)

Figure 11. Hyperparameter Search on Split CIFAR. We explore settings of the initial ﬁrst-task prior covariance and the number of epochs for the ﬁrst task. To limit the computational resources required, we vary the values of one hyperparameter at a time instead of carrying out a full grid search.

Highest entropy

Lowest entropy

Coreset Selection Method

Accuracy (%)

Figure 12. Comparison of Different Coreset-Selection Methods on Split CIFAR. For score-based coreset-selection methods, we ﬁrst score each coreset point using Equation (11) for ELBO scoring, using the predictive entropy for entropy scoring, and the KL divergence in Equation (11) for KL scoring then sample context points from the coreset according to the probability mass function deﬁned in Equation (C.10).

1 2 Setting

Accuracy (%)

Figure 13. Hyperparameter Search on Sequential Omniglot. We compare two settings. In the ﬁrst, we always sample one context point for each previous task from the context set at each gradient step. In the second, we sample a larger number of context points (with a budget of 60 samples per gradient step) from the context set when learning on the ﬁrst 25 tasks.

Continual Learning via Sequential Function-Space Variational Inference

C. Experimental Details

Our empirical evaluation centers around six sequences of classiﬁcation tasks: a synthetic sequence of binary-classiﬁcation tasks with 2D inputs; split MNIST; split Fashion MNIST; permuted MNIST; split CIFAR; and sequential Omniglot. With the exception of permuted MNIST, each of these task sequences can be tackled by a neural network with either a multi-head setup (MH) or a single-head setup (SH). In a multi-head setup, the neural network has a separate output layer (or head) for each task, and task identiﬁers are provided at test time in order to select the appropriate head. In a single-head setup, the neural network has just one output layer shared across all tasks, and task identiﬁers are not provided. In our experiments, we use multi-head setups for split Fashion MNIST, split CIFAR and sequential Omniglot, and single-head setups for the synthetic task sequence along with permuted MNIST. For split MNIST, we run both setups.

C.1. Illustrative Example

The task sequence shown in Figure 2 was created by Pan et al. (2020). Each of the ﬁve tasks in this sequence involves binary classiﬁcation on 2D inputs, where the number of training examples per task is 3,600. Following Pan et al. (2020), we use a fully connected neural network with an input layer of size 2, two hidden layers of size 20 and an output layer of size 2. When running S-FSVI, we set the prior covariance as Σ0 = 0.1 and train the neural network for 250 epochs on each task. We use the Adam optimizer with an initial learning rate of 0.0005 (β1 = 0.9, β2 = 0.999) and a batch size of 128. The coreset is constructed by choosing 40 samples from the training data for each task. To evaluate the KL divergence between the posterior and the prior distributions over functions, for each previous task we sample 20 input points from the context set and generate another 30 samples by sampling each pixel uniformly from the range [ 4, 4]. For example, when we train the model on task t {1, 2, 3, . . .}, we use 20(t 1) samples chosen from the context set and 30t white-noise samples. The noise samples encourage the neural network to preserve high predictive uncertainty in regions far from the training data.

C.2. Task Sequences Based on (Fashion) MNIST

Split MNIST consists of ﬁve tasks, where each task is binary classiﬁcation on a pair of MNIST classes. Split Fashion MNIST has the same form but uses data from Fashion MNIST. Permuted MNIST comprises ten tasks, where each task involves classifying images into the ten MNIST classes after the image pixels have been randomly reordered. Unless speciﬁed otherwise, the following setups apply to Figures 3, 6, 7 and 8 and Table 1.

Dataset. In all cases, 60,000 data samples are used for training and 10,000 data samples are used for testing. The input images are converted to ﬂoating-point numbers with values in the range [0, 1].

Neural-Network Size & Coreset Size. To ensure fair comparison, all methods in Table 1 (unless where explicitly indicated otherwise) use the same neural-network size and (where applicable) coreset size. As in prior work (Pan et al., 2020; Titsias et al., 2020), we use fully connected neural networks, with two hidden layers of size 100 for permuted MNIST and two hidden layers of size 256 for split (Fashion) MNIST. In all cases, the Re LU activation function is applied to non-output units. For single-head setups, we use 200 coreset points; for multi-head setups, we use 40 points.

Coreset Selection. For S-FSVI with a coreset, when training on the ﬁrst task, 40 context points are generated by sampling each pixel uniformly from the range [0, 1]; during training on subsequent tasks, 40 context points are chosen randomly from the context set. For S-FSVI without a coreset, 40 context points are chosen uniformly randomly from the training data of the current task (corresponding to the Random label in Figure 3).

Prior Distribution. For the ﬁrst task, S-FSVI uses a prior distribution over functions with ﬁxed mean and diagonal covariance. When using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 0.001. When not using a coreset, the prior distribution is assumed to be Gaussian with zero mean and a diagonal covariance of magnitude 100. The prior variance is optimized via hyperparameter selection on a validation set.

Optimization. We use the Adam optimizer with an initial learning rate of 0.0005 (β1 = 0.9, β2 = 0.999). The number of epochs on each task is 60 for split MNIST (MH), 60 for split Fashion MNIST (MH), 10 for permuted MNIST (SH) and 80 for split MNIST (SH). The batch size is 128.

Prediction. The predictive distribution used for computing the expected log-likelihood is estimated using ﬁve Monte Carlo samples.

Hyperparameter Selection. For S-FSVI (optimized) in Table 1, we used the optimized hyperparameters chosen on a

Continual Learning via Sequential Function-Space Variational Inference

Table 4. Hyperparameter selection. Optimal values (in bold) were chosen based on validation-set accuracy. Standard errors were computed across ten random seeds.

Task Sequences Number of Layers & Units Magnitude of Prior Variance Number of Epochs

Split MNIST (MH) {1, 2} * {100, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {40, 60, 80, 120, 160} Split Fashion MNIST (MH) {4} * {50, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {40, 60, 80, 120, 160} Permuted MNIST (SH) {2} * {100, 200, 400, 500} {0.001, 0.01, 0.1, 1, 10, 100} {10, 20, 40, 60, 80} Split MNIST (SH) {1, 2} * {100, 200, 300, 400} {0.001, 0.01, 0.1, 1, 10, 100} {60, 80, 120, 160, 240}

validation set after exploring the conﬁgurations shown in Table 4. For cases where no conﬁguration is signiﬁcantly better than the rest, the default value given in Appendix C.2 is used.

C.3. Split CIFAR

Split CIFAR, as described in Pan et al. (2020), consists of six tasks. The ﬁrst is ten-way classiﬁcation on the full CIFAR-10 dataset. Each of the following ﬁve is also ten-way classiﬁcation, with classes drawn from CIFAR-100. Following Pan et al. (2020), we use a neural network with four convolutional layers followed by two fully connected layers followed by multiple output heads (one for each task). For S-FSVI, we use the following setup: Adam optimizer with learning rate 0.0005, prior with covariance 0.01, random coreset selection, 200 coreset points per task, 50 context points at each task. We also use this setup (and a training duration of 2000 epochs) when training individual neural networks for the separate baseline.

C.4. Sequential Omniglot

Sequential Omniglot, as described in Schwarz et al. (2018), comprises 50 classiﬁcation tasks. Each task is associated with an alphabet, and the number of characters (classes) varies between alphabets. Following Schwarz et al. (2018), we use a neural network with four convolutional layers followed by one fully connected layer. For S-FSVI, we use two coreset points per character, as used by Titsias et al. (2020). The coreset points are sampled from the training set with probability proportional to the entropy of the neural network s posterior predictive distribution. To limit memory usage, we draw no more than 25 context points from the context set at each gradient step after task 25. We use a learning rate of 0.001 and a prior covariance of 1.0. For the ﬁrst task, the neural network trains for 200 epochs; for subsequent tasks, it trains for ten epochs per task. We use the same data augmentation and train-test split as Titsias et al. (2020).

C.5. Coreset-Selection Methods

We consider different distributions from which to sample points to be added to the coreset. For each of the scoring methods below, we use the scores to create a probability mass function from which points can be sampled.

Random. Points are sampled uniformly from the training data.

Predictive-Entropy Scoring. Points are scored according to the total predictive uncertainty (i.e., the predictive entropy) of the model. For a model with stochastic parameters Θ, pre-likelihood outputs f(X; θ), and a likelihood function p(y | f(X; θ)), the predictive entropy is given by H(E[p(y | f(X; θ))]) (Cover and Thomas, 1991; Shannon and Weaver, 1949). The expectation is taken with respect to the model parameters. H( ) is the entropy functional, and I(y ; Θ) is the mutual information between the model parameters and its predictions.

Evidence-Lower-Bound Scoring. Points are scored according to the value of the evidence lower bound (ELBO) given in Equation (11).

Kullback-Leibler-Divergence Scoring. Points are scored according to the value of the approximation to the function-space KL divergence given in Equation (11).

Score-Based Distributions. After scoring with the above methods, points are added to the coreset by sampling from one of the following probability mass functions:

Lowest: P(i) .= si PN j=1 sj and Highest: P(i) .= si PN j=1 sj , (C.10)

where si is the score of i-th point, si = max N j=1 sj si, and N is the number of candidate points.

Continual Learning via Sequential Function-Space Variational Inference

C.6. Forward and Backward Transfer

In Table 3, we report forward and backward transfer metrics as deﬁned in Pan et al. (2020). Backward transfer (BT) indicates the performance gain on past tasks when new tasks are learnt, while forward transfer (FT) quantiﬁes how much knowledge from past tasks helps the learning of new tasks. Higher is better for both. For T tasks, let Ri,i be the accuracy of model on task ti after training on task ti, and let Rind i be the accuracy of an independent model trained only on task ti. Then

BT .= 1 T 1

i=1 RT,i Ri,i and FT .= 1 T 1

i=2 Ri,i Rind i .

D. Further Related Work

Objective-based approaches to continual learning involve training a neural network using a specially designed objective function. Typically the objective includes a regularization term that penalizes changes in the neural network s conﬁguration. Whereas in Section 4 we summarise methods that regularize in function space, here we cover methods that regularize directly in terms of the parameters of a neural network. Among these, most relevant to our work are those that approximate Bayesian updating, in which the posterior from the previous task forms the prior for the current task.

A key idea is shared between many methods for parameter-space regularization: for each parameter, apply a penalty on the difference between its current setting and its prior setting, weighted by a measure of the parameter s importance. Methods vary in how they measure importance. Variational continual learning (VCL; Nguyen et al., 2018; Swaroop et al., 2019), which extends the concept of online variational inference (Broderick et al., 2013; Ghahramani and Attias, 2000; Honkela and Valpola, 2003; Sato, 2001) to deep neural networks, uses the parameter covariance matrix of the model currently serving as the prior. Elastic weight consolidation (EWC; Kirkpatrick et al., 2017) and its successors (Chaudhry et al., 2018; Lee et al., 2017; Liu et al., 2018; Schwarz et al., 2018) use a Fisher information matrix computed on each task. Online structured Laplace (Ritter et al., 2018) and second-order loss approximation (Yin et al., 2020a) respectively use Kronecker-factored and low-rank Hessians. Synaptic intelligence (SI; Zenke et al., 2017) uses a cumulative sum of the gradient of the training objective with respect to the parameters. Memory-aware synapses (MAS; Aljundi et al., 2018) use the gradient of the model output with respect to the parameters.

Other related work on parameter-space regularization includes various modiﬁcations to VCL (Ahn et al., 2019; Kessler et al., 2019), uncertainty-guided continual learning in Bayesian neural networks (Ebrahimi et al., 2020), and a variation of SI known as asymmetric loss approximation with single-side overestimation (Park et al., 2019). There have also been efforts to conceptually unify some of the approaches outlined above: Loo et al. (2020) draws a link between VCL and online

EWC; Chaudhry et al. (2018) combines EWC and SI in a single method; Yin et al. (2020b) generalizes EWC, online structured Laplace, SI and MAS.