# bayesian_online_natural_gradient_bong__a0230671.pdf

Bayesian Online Natural Gradient (BONG)

Matt Jones University of Colorado mcjones@colorado.edu

Peter Chang MIT gyuyoung@mit.edu

Kevin Murphy Google Deep Mind kpmurphy@google.com

We propose a novel approach to sequential Bayesian inference based on variational Bayes (VB). The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior (which comes from the posterior at the previous timestep); instead we can optimize just the expected log-likelihood, performing a single step of natural gradient descent starting at the prior predictive. We prove this method recovers exact Bayesian inference if the model is conjugate. We also show how to compute an efﬁcient deterministic approximation to the VB objective, as well as our simpliﬁed objective, when the variational distribution is Gaussian or a sub-family, including the case of a diagonal plus low-rank precision matrix. We show empirically that our method outperforms other online VB methods in the non-conjugate setting, such as online learning for neural networks, especially when controlling for computational costs.

1 Introduction

Bayesian methods for neural network (NN) training aim to minimize the Kullback-Leibler divergence between true and estimated posterior distributions. This is equivalent to minimizing the variational loss (or negative ELBO)

L(ψ) = Eθ qψ[log p(D|θ)] + DKL(qψ|p0) (1)

Here θ are the network parameters, ψ are the variational parameters of the approximate posterior qψ(θ), D is the training dataset, and p0(θ) is the prior. The two terms in the variational loss correspond to data ﬁt and regularization to the prior, the latter being analogous to a regularizer r(θ) = log p0(θ) in traditional point estimation methods like SGD.

An important set of approaches learns the variational parameters by gradient descent on L(ψ) [Blundell et al., 2015]. More recently Khan and colleagues [Khan et al., 2018b, Khan and Rue, 2023, Shen et al., 2024] have proposed using the natural gradient F 1 ψ ψL(ψ) where Fψ is the Fisher information matrix of the variational family evaluated at qψ. Natural gradient descent (NGD) is often more efﬁcient than vanilla GD because it accounts for the intrinsic geometry of the variational family [Amari, 1998]. Khan and Rue [2023] call this approach the "Bayesian Learning Rule" or BLR. Using various choices for the variational distribution, generalized losses replacing negative log-likelihood, and other approximations, they reproduce many standard optimization methods such as Adam, and derive new ones.

We study Bayesian NN optimization in online learning, where the data are observed in sequence, Dt = {(xk, yk)t k=1}, and the algorithm maintains an approximate posterior qψt(θt) p(θt|Dt), which it updates at each step. Fast updates (in terms of both computational speed and statistical efﬁciency) are critical for many online learning applications [Zhang et al., 2024]. To allow for nonstationarity in the datastream, we include a time index on θt, to represent that the parameters may change over time, as is standard for approaches based on state-space models and the extended Kalman ﬁlter (see e.g., [Sarkka and Svensson, 2023]). The belief state is updated recursively using

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

the prior qψt|t 1 derived from the previous time step so that the variational loss becomes

L(ψt) = Eθt qψt[log p(yt|xt, θt)] + DKL qψt|qψt|t 1 (2)

One option for this online learning problem is to apply NGD on L(ψt) at each time step, iterating until ψt converges before consuming the next observation. Our ﬁrst contribution is a proposal for skipping this inner loop by (a) performing a single natural gradient step with unit learning rate and (b) omitting the DKL term in Eq. (2) so that learning is based only on expected loglikelihood:

ψt = ψt|t 1 + F 1 ψt|t 1 ψt|t 1Eθt qψt|t 1[log p (yt|xt, θt)] (3)

These two modiﬁcations work together: instead of regularizing toward the prior explicitly using DKL qψt|qψt|t 1 , we do so implicitly by using ψt|t 1 as the starting point of our single natural gradient step. This may appear as a heuristic but we prove in Proposition 4.1 that it yields exact Bayesian inference when qψ and p (y|x, θ) are conjugate and qψ is an exponential family with natural parameter ψ. Thus our proposed update can be viewed as a relaxation of the Bayesian update to the non-conjugate variational case. As is common in work on variational inference, we view the result for the conjugate case as a motivating foundation that ensures our method is exact in certain simple settings. The experiments reported in Section 5 and Appendix B complement the theory by showing our method also works well in more general settings. We call Eq. (3) the Bayesian online natural gradient (BONG).

Our second contribution concerns ways of computing the expectation in Eqs. (1) to (3). This is intractable for NNs, even for variational distributions that are easy to compute, since the likelihood takes the form p(yt|xt, θt) = p(yt|f(xt, θt)) with f(xt, θt), representing the function computed by the network, is a complex, nonlinear function of θt. Many previous approaches have approximated the expected loglikelihood by sampling methods which add variance and computation time depending on the number of samples [Blundell et al., 2015, Shen et al., 2024]. We propose a deterministic, closed-form update that applies when the variational distribution is Gaussian (or a sub-family) and the likelihood is an exponential family with natural parameter f(xt, θt) and mean parameter h(xt, θt) (e.g., for classiﬁcation, f returns the vector of class logits, h returns class probabilities, and h = softmax(f)). This update can be derived in two equivalent ways. First, we use a local linear approximation of the network h(xt, θt) ht(θt) [Immer et al., 2021a] and a Gaussian approximation of the likelihood N(yt| ht(θt), Rt) [Ollivier, 2018, Tronarp et al., 2018]. Under these assumptions the expectation in Eq. (3) can be calculated analytically. Alternatively, we use a different linear approximation f(xt, θt) ft(θt) and a delta approximation qψt|t 1(θt) δµt|t 1(θt) where µt|t 1 = Eqψt|t 1[θt] is the prior mean, so that the expectation in Eq. (3) is replaced by a plugin prediction. The linear(h)-Gaussian approximation is previously known but the linear(f)-delta approximation is new, and we prove in Proposition 4.2 that they yield the same update, which we call linearized BONG, or BONG-LIN. Finally, we discuss different ways of approximating the Hessian of the objective, which is needed for NGD.

Our BONG framework uniﬁes several existing methods for Bayesian online learning, and it offers new algorithms based on alternative variational families or parameterizations. We deﬁne a large space of methods by combining 4 different update rules with 4 different ways of computing the relevant expected gradients and Hessians and 3 different variational families (Gaussians with full, diagonal, and diagonal-plus-low rank precision matrices). We conduct experiments systematically testing how these factors affect performance. We ﬁnd support for all three principles of our approach NGD, implicit regularization to the prior, and linearization in terms of both statistical and computational efﬁciency. Code for our experiments is available at https://github.com/petergchang/bong/.

2 Related work

Variational inference approximates the Bayesian posterior from within some suitable family in a way that bypasses the normalization term [Zellner, 1988, Jordan et al., 1999]. A common choice for the variational family is a Gaussian. For online learning, the exact update equations for Gaussian variational ﬁltering are given by the RVGA method of [Lambert et al., 2021]. This update is implicit but can be approximated by an explicit RVGA update which we show arises as a special case of BONG. Most applications of Gaussian VI use a mean-ﬁeld approximation deﬁned by diagonal covariance, which scales linearly with model size. More expressive but still linear in the model size are methods

that express the covariance [Tomczak et al., 2020] or precision [Mishkin et al., 2018, Lambert et al., 2023, Chang et al., 2023] as a sum of diagonal and low rank matrices (DLR). In this paper, we consider variational families deﬁned by full covariance, diagonal covariance, and DLR covariance.

For NNs and other complicated models, even the variational approximation can be intractable, so methods have been developed to approximately minimize the VI loss. Bayes by backprop (BBB) [Blundell et al., 2015] learns a variational distribution on NN weights by iterated GD on the VI loss of Eq. (1). They focus on mean-ﬁeld Gaussian approximations but the approach also applies to other variational families. Here we adapt BBB to online learning to compare to our methods.

The Bayesian learning rule (BLR) replaces BBB s GD with NGD [Khan and Rue, 2023]. Several variants of BLR have been developed including VON and VOGN for a mean-ﬁeld Gaussian prior [Khan et al., 2018b] and SLANG for a DLR Gaussian [Mishkin et al., 2018]. BLR has also been used to derive versions of many classic optimizers including SGD, RMSprop and Adam [Khan et al., 2018a, Khan and Rue, 2023, Lin et al., 2024, Shen et al., 2024]. Although BLR has been applied to online learning, we are particularly interested in Bayesian ﬁltering including in nonstationary environments, where observations must be processed one at time and updates are based on the posterior from the previous step, often in conjunction with parameter dynamics. We therefore develop ﬁltering versions of BLR to compare to BONG, some of which reduce to VON, VOGN and SLANG in the batch setting, while others are novel. We also note BLR is a mature theory including several clever tricks we have not yet incorporated into our framework.

Khan and Rue [2023] observe that conjugate updating is equivalent to one step of BLR with learning rate 1. This is similar to our Proposition 4.1 except that BLR retains the KL term in the variational loss. BLR and BONG agree in this case because the gradient of the KL is zero on BLR s ﬁrst iteration: ψ=ψt|t 1DKL qψ|qψt|t 1 = 0. Therefore BONG can be seen as a special case of BLR with one update step per observation and learning rate 1. Our contribution is to recognize that doing a single update step allows the KL term to be dropped entirely, yielding a substantially simpler algorithm which our experiments show also performs better.

While BLR allows alternative losses in place of the NLL in Eq. (2), we can also replacing the KL term with other divergences [Knoblauch et al., 2022]. Our approach ﬁts within that "generalized VB" framework in that it drops the divergence altogether. Our approach of implicitly regularizing to the prior using a single NGD step is also similar to the implicit MAP ﬁlter of [Bencomo et al., 2023] which performs truncated GD from the prior mode. The principal difference is they perform GD on model parameters (θt) while we do NGD on the variational parameters (ψt). Thus BONG maintains a full prior and posterior while IMAP is more concerned with how the choice of optimizer can substitute for explicit tracking of covariance.

We show two other ways to derive the BONG update in Appendix D, one of which is to replace the expected NLL in Eq. (2) with a linear approximation and solve the resulting equation exactly. Several past works have taken this approach, arriving at updates similar to ours. Chérief-Abdellatif et al. [2019] study streaming variational Bayes and propose solving Eq. (2) with a linearized expected NLL. When the variational family is an exponential family their update becomes NGD [Khan and Lin, 2017] and matches the BONG update. Hoeven et al. [2018] show how mirror descent can be derived as a special case of Exponential Weights [Littlestone and Warmuth, 1994], which is closely related to Bayesian updating. The resulting algorithm is similar to BONG and follows from linearizing the NLL instead of expected NLL, with an additional delta assumption at the prior mean. Lyu and Tsang [2021] study relaxed block-box optimization where the objective is arg minψ Ex qψ[f(x)] for some target function f. They use a mirror descent formulation with linearized expected loss and KL regularizer and show the resulting update is NGD on expected loss, formally equivalent to our BONG update. From the perspective of this prior work, our contribution is to express the BONG update simply as NGD on the expected NLL, motivated by replacing the KL with implicit regularization, and to show how this yields a variety of known and novel algorithms for Bayesian ﬁltering.

EKF applications to NNs apply Bayesian ﬁltering using a local linear approximation of the network, leading to simple closed form updates [Singhal and Wu, 1989, Puskorius and Feldkamp, 1991]. The classic EKF assumes a Gaussian observation distribution but it has been extended to other exponential families (e.g. for classiﬁcation) by matching the mean and covariance in what we call the conditional moments EKF (CM-EKF) [Ollivier, 2018, Tronarp et al., 2018]. Applying a KL projection to diagonal covariance yields the variational diagonal EKF (VD-EKF) [Chang et al., 2022].

Alternatively, projecting to diagonal plus low rank precision using SVD gives LO-FI [Chang et al., 2023]. We derive all these methods as special cases of BONG-LIN. Further developments in this direction include the method of [Titsias et al., 2024] which does Bayesian ﬁltering on only the ﬁnal weight layer, and Wo LF [Duran-Martin et al., 2024] which achieves robustness to outliers through data-dependent weighting of the loglikelihood.

3 Background

We study online supervised learning where the agent receives input xt RD and observation yt RC on each time step, which it aims to model with a function ft(θt) = f(xt, θt) such as a NN with weights θt RP . The predictions for yt are given by some observation distribution p(yt|ft(θt)). For example, f may compute the mean for regression or the class logits for classiﬁcation.

We work in a Bayesian framework where the agent maintains an approximate posterior distribution over θt after observing data Dt = {(xk, yk)t k=1}. The ﬁltering posterior qψt(θt) p(θt|Dt) is approximated within some parametric family indexed by the variational parameter ψt. We allow for nonstationarity by assuming θ changes over time according to some dynamic model p(θt|θt 1). By pushing the posterior from step t 1 through the dynamics we obtain a prior for step t given by qψt|t 1(θt) p(θt|Dt 1). For example suppose the variational posterior from the previous step is Gaussian, qψt 1(θt 1) = N(θt 1|µt 1, Σt 1), and the dynamics model is an Ornstein-Uhlenbeck process, as proposed in prior work [Kurle et al., 2020, Titsias et al., 2024] to handle non-stationarity, i.e., the dynamics model has the form θt N(γtθt 1 +(1 γt)µ0, Qt), where Qt = (1 γ2 t )Σ0 is the covariance of the noise process, 0 γt 1 is the degree of drift, and p(θ0) = N(µ0, Σ0) is the prior. In this case, the parameters of the prior predictive distribution are µt|t 1 = γtµt 1+(1 γt)µ0 and Σt|t 1 = γ2 t Σt 1 + Qt. In general the predict step may require approximation to stay in the variational family (e.g., if the dynamics are nonlinear). In this paper, our focus is the update step from ψt|t 1 to ψt upon observing (xt, yt), so for simplicity we assume constant (static) parameters, i.e., p(θt|θ 1) = δ(θt θt 1) (equivalently γt = 1), so ψt|t 1 = ψt 1; however, our method can trivially handle non-stationary parameters.

Variational inference seeks an approximate posterior that minimizes the KL divergence from the exact Bayesian update from the prior. In the online setting this becomes

ψ t = arg min ψ DKL qψ(θt)|Z 1 t qψt|t 1(θt) p(yt|ft(θt)) = arg min ψ Lt(ψ) (4)

where Lt is the online VI loss deﬁned in Eq. (2), and the normalization term Zt (which depends on xt) drops out as an additive constant. Our goal is an efﬁcient approximate solution to this variational optimization problem.

We will sometimes assume the variational posterior qψ is an exponential family distribution with natural parameter ψ so that qψt(θt) = exp (ψ t T(θt) Φ(ψt)), with log-partition function Φ and sufﬁcient statistics T(θt). Assuming Φ is strictly convex (which holds in the cases we study) there is a bijection between ψt and the dual (or expectation) parameter ρt = Eθt qψt[T(θt)]. Classical thermodynamic identities imply that the Fisher information matrix has the form Fψt = ρt/ ψt. This has important implications for NGD on exponential families [Khan and Rue, 2023] because it implies that for any function ℓdeﬁned on the variational parameter space the natural gradient wrt natural parameters ψt is the regular gradient wrt the dual parameters ρt, i.e., F 1 ψt ψtℓ= ρtℓ.

We propose to approximate the variational optimization problem in Eq. (4) using the BONG update in Eq. (3). When qψ is an exponential family, the fact that the natural gradient wrt the natural parameters ψt is the regular gradient wrt the dual parameters ρt implies an equivalent mirror descent form (see Appendix D for further analysis of BONG from the MD perspective):

ψt = ψt|t 1 + ρt|t 1Eθt qψt|t 1[log p(yt|xt, θt)] (5)

This is NGD with unit learning rate on the variational loss in Eq. (2) but ignoring the DKL qψ|qψt|t 1

term. In this section we ﬁrst prove this method is optimal when the model is conjugate and then describe extensions to more complex cases of practical interest.

4.1 Conjugate case

Our approach is motivated by the following result which states that BONG matches exact Bayesian inference when the variational distribution and the likelihood are conjugate exponential families: Proposition 4.1. Let the observation distribution (likelihood) be an exponential family with natural parameter θt (where Tl(yt) = yt is the sufﬁcient statistics for the likelihood and A(θt) is the log-partition function) pt(yt|θt) = exp (θ t yt A(θt) b(yt)) (6) and let the prior be the conjugate exponential family

qψt|t 1(θt) = exp ψ t|t 1T(θt) Φ(ψt|t 1) (7)

with T(θt) = [θt; A(θt)]. Then the exact Bayesian update agrees with Eq. (5).

The proof is in Appendix C. Writing the natural parameters of the prior as ψt|t 1 = [χt|t 1; νt|t 1], we show the Bayesian update and BONG both yield χt = χt|t 1 +yt and νt = νt|t 1 +1. Intuitively, we are just accumulating a sum of the observed sufﬁcient statistics, and a counter of the sample size (number of observations seen so far).

4.2 Variational case

In practical settings the conjugacy assumption of Proposition 4.1 will not be met, so Eqs. (3) and (5) will only approximate the Bayesian update. In this paper we restrict to Gaussian variational families. We refer to the unrestricted case as FC (full covariance), deﬁned by the variational distribution qψt|t 1(θt) = N θt|µt|t 1, Σt|t 1 (8) where Σt|t 1 can be any positive semi-deﬁnite (PSD) matrix. The natural and dual parameters are ψ = (Σ 1µ, 1

2vec(Σ 1)) and ρ = (µ, vec(µµ + Σ)). Appendix E.1.1 shows that Eq. (5) translated back to (µ, Σ) gives the following BONG update for the FC case: µt = µt|t 1 + Σt Eθt qψt|t 1[ θt log p(yt|ft(θt))] | {z } gt

Σ 1 t = Σ 1 t|t 1 Eθt qψt|t 1 2 θt log p(yt|ft(θt))

which matches the explicit update in the RVGA method of [Lambert et al., 2021].

4.3 Monte Carlo approximation

The integrals over the prior qψt|t 1 in Eqs. (9) and (10) are generally intractable and must be approximated. One option is to use Monte Carlo, in what we call BONG-MC. Given M independent samples ˆθ(m) t qψt|t 1, we estimate the expected gradient gt = Eθt qψt|t 1[ θt log p(yt|ft(θt))]

and expected Hessian Gt = Eθt qψt|t 1 2 θt log p (yt|ft (θt)) as the empirical means

m=1 ˆg(m) t , ˆg(m) t = θt=ˆθ(m) t log p(yt|ft(θt)) (11)

GMC-HESS t = 1

m=1 ˆG(m) t , ˆG(m) t = 2 θt=ˆθ(m) t log p(yt|ft(θt)) (12)

We use GMC-HESS only for small models. Otherwise we use empirical Fisher (Section 4.5).

4.4 Linearized BONG

As an alternative to BONG-MC, we propose a linear approximation we call BONG-LIN that yields a deterministic and closed-form update. Assume the likelihood is an exponential family as in Proposition 4.1 but with natural parameter predicted by some function ft(θt) = f(xt, θt): p(yt|xt, θt) = exp (ft(θt) yt A(ft(θt)) b(yt)) (13)

We also deﬁne the dual (moment) parameter of the likelihood as ht(θt) = E [yt|ft(θt)]. In a NN, ft and ht are related by the ﬁnal response layer. For example in classiﬁcation ft and ht give the class logits and probabilities, with ht(θt) = softmax(ft(θt)), with yt being the one-hot encoding.

We now deﬁne two methods for approximating the expected gradient gt and expected Hessian Gt, based on linearizing the predictive model at the prior mean µt|t 1 in terms of either ft(θt) or ht(θt), and then prove their equivalence.

The linear(h)-Gaussian approximation [Ollivier, 2018, Tronarp et al., 2018] linearizes ht(θt)

ht(θt) = ˆyt + Ht(θt µt|t 1) (14)

ˆyt = ht(µt|t 1) (15)

θt |θt=µt|t 1 (16)

and approximates the likelihood by a Gaussian with variance based at µt|t 1

p LG t (yt|θt) = N(yt| ht(θt), Rt), Rt = V yt|θt = µt|t 1 (17)

The linear(f)-delta approximation linearizes ft(θt) and maintains the original exponential family likelihood distribution in Eq. (13)

ft(θt) = ft(µt|t 1) + Ft(θt µt|t 1) (18)

θt |θt=µt|t 1 (19)

p LD t (yt|θt) exp ft(θt) yt A( ft(θt)) b(yt) (20)

It also uses a plug-in approximation that replaces qψt|t 1(θt) with a point mass δµt|t 1(θt) so that the expected gradient and Hessian are approximated by their values at the prior mean, i.e., θt=µt|t 1 log p LD t (yt|θt) and 2 θt=µt|t 1 log p LD t (yt|θt), rather than being sampled.

Proposition 4.2. Under a Gaussian variational distribution, the linear(h)-Gaussian and linear(f)- delta approximations yield the same values for the expected gradient and Hessian

g LIN t = H t R 1 t (yt ˆyt) (21)

GLIN-HESS t = H t R 1 t Ht (22)

See Appendix C for the proof. The main idea for the g LIN t part is that the linear-Gaussian assumptions make the gradient linear in θt so the expected gradient equals the gradient at the mean. The main idea for the GLIN-HESS t part is that eliminating the Hessian of the NN requires different linearizing assumptions for the Gaussian and delta approximations, and the remaining nonlinear terms (from the log-likelihood in Eq. (13)) agree because of the property of exponential families that the Hessian of the log-partition A equals the conditional variance Rt.

Applying Proposition 4.2 to Eqs. (9) and (10) gives the BONG-LIN update for a FC Gaussian prior

µt = µt|t 1 + Kt(yt ˆyt) (23)

Σt = Σt|t 1 Kt HtΣt|t 1 (24)

Kt = Σt|t 1H t Rt + HtΣt|t 1H t 1 (25)

where Kt is the Kalman gain matrix (see Appendix E.1.2). This matches the CM-EKF [Tronarp et al., 2018, Ollivier, 2018].

4.5 Empirical Fisher

Name Eqs. MC-HESS (11), (12) LIN-HESS (21), (22) MC-EF (11), (26) LIN-EF (28), (29)

Table 1: The 4 Hessian approximations.

The methods in Sections 4.3 and 4.4 require explicitly computing the Hessian of the loss (MC-HESS) or the Jacobian of the network (LIN-HESS). These are too expensive for large models or high-dimensional observations. Instead we can use an empirical Fisher approximation that replaces the Hessian with the outer product of the gradient (see e.g, [Martens, 2020]).

For the MC-EF variant, we make the following approximation:

GMC-EF t = 1

M ˆG(1:M) t ˆG(1:M)

where ˆG(1:M) t = [ˆg(1) t , . . . , ˆg(M) t ] is the P M matrix of gradients from the MC samples.

We can also consider a similar approach for the LIN-EF variant that is Jacobian-free and sampling-free. Note that if ˆyt were the true value of E [yt|xt] (i.e., if the model were correct) then we would have E [(yt ˆyt)(yt ˆyt) ] = Rt, implying E [g LIN t (g LIN t ) ] = GLIN-HESS t . This suggests using

g LIN-EF t = θt=µt|t 1 1

2 (yt ht(θt)) R 1 t (yt ht(θt)) (27)

θt=µt|t 1 R 1 t (yt ht(µt|t 1)) = g LIN t (28)

GLIN-EF t = g LIN t (g LIN t ) (29)

where Eq. (29) is the EF approximation to Eq. (22).

A more accurate EF approximation is possible by sampling virtual observations yt from p( |ft( ˆθt (m))) or p( |ft(µt|t 1)) and using them for the gradients in Eq. (26) or Eq. (29) (respectively) [Martens, 2020, Kunstner et al., 2020]. However, in our experiments we use the actual observations yt which is faster and follows previous work (e.g., [Khan et al., 2018b]).

4.6 Update rules

Name Loss Update BONG E [NLL] NGD(I = 1) BOG E [NLL] GD(I = 1) BLR VI NGD(I 1) BBB VI GD(I 1)

Table 2: The 4 update algorithms.

In addition to the four ways of approximating the expected Hessian (summarized in Table 1), we also consider four variants of BONG, based on what kind of loss we optimize and what kind of update we perform, as we describe below. See Table 2 for a summary.

BONG (Bayesian online natural gradient) performs one step of NGD on the expected log-likelihood. We set learning rate to αt = 1 since this is optimal for conjugate models. The update (for an exponential variational family) is as in Eq. (5):

ψt = ψt|t 1 + ρt|t 1Eθt qψt|t 1[log p(yt|xt, θt)] (30)

BOG (Bayesian online gradient) performs one step of GD (instead of NGD) on the expected loglikelihood. We include a learning rate α because GD does not have the scale-invariance of NGD:

ψt = ψt + αt ψt Eθt qψt[log p(yt|ft(θt))] (31)

BLR (Bayesian learning rule, [Khan and Rue, 2023]) uses NGD (like BONG) but optimizes the VI loss using multiple iterations, instead of optimizing the expected NLL with a single step. When modiﬁed to the online setting, BLR starts an inner loop at each time step with ψt,0 = ψt|t 1 and iterates

ψt,i = ψt,i 1 + αt Fψt,i 1 ψt,i 1 Eθt qψt,i 1[log p(yt|ft(θt))] DKL qψt,i 1|qψt|t 1 (32)

For an exponential variational family this can be written in mirror descent form

ψt,i = ψt,i 1 + αt ρt,i 1 Eθt qψt,i 1[log p(yt|ft(θt))] DKL qψt,i 1|qψt|t 1 (33)

BBB (Bayes By Backprop, [Blundell et al., 2015]) is like BLR but uses GD instead of NGD. When adapted to online learning, it starts each time step at ψt,0 = ψt|t 1 and iterates with GD:

ψt,i = ψt,i 1 + αt ψt,i 1 Eθt qψt,i 1[log p(yt|ft(θt))] DKL qψt,i 1|qψt|t 1 (34)

Variational Family Update Hessian Full Diag DLR BONG MC-EF O(MP 2) [RVGA] O(MP) O((R + M)2P) BLR MC-EF O(IP 3) O(IMP) [VON] O(I(R + M)2P) [SLANG] BOG MC-EF O(P 3) O(MP) O(RMP) BBB MC-EF O(IP 3) O(IMP) [BBB] O(IR(R + M)P) BONG LIN-HESS O(CP 2) [CM-EKF] O(C2P) [VD-EKF] O((R + C)2P) [LO-FI] BLR LIN-HESS O(IP 3) O(IC2P) O(I(2R + C)2P) BOG LIN-HESS O(P 3) O(C2P) O(C(C + R)P) BBB LIN-HESS O(IP 3) O(IC2P) O(I(C + R)RP) BONG LIN-EF O(P 2) O(P) O(R2P) BLR LIN-EF O(IP 3) O(IP) O(IR2P) BOG LIN-EF O(P 3) O(P) O(RP) BBB LIN-EF O(IP 3) O(IP) O(IR2P) Table 3: Time complexity of the algorithms. P: params, C: observation dim, M: MC samples, I: iterations, R: DLR rank. We assume P {I, C, R, M} so display only the terms of leading order in P. Time complexities for MC-HESS algorithms (not shown) are always at least as great as for the corresponding MC-EF. Full (full covariance) and Diag (diagonal covariance) columns indicate natural parameters; corresponding algorithms using moment parameters have the same complexities except BOG-FC_MOM which is O(MP 2) for MC-EF, O(CP 2) for LIN-HESS, and O(P 2) for LIN-EF. [Names] correspond to the following existing methods (or variants thereof) in the literature: RVGA: [Lambert et al., 2021] (explicit update version); VON: [Khan et al., 2018b] (modiﬁed for online); SLANG: [Mishkin et al., 2018] (modiﬁed for online); BBB: [Blundell et al., 2015] (modiﬁed for online and uses moment parameters); CM-EKF: [Ollivier, 2018, Tronarp et al., 2018]; VD-EKF: [Chang et al., 2022]; LO-FI: [Chang et al., 2023].

4.7 Variational families and their parameterizations

We investigate ﬁve variational families for the posterior distribution: (1) FC Gaussian using natural parameters ψ = (Σ 1µ, 1

2Σ 1), (2) FC Gaussian using central moment parameters ψ = (µ, Σ), (3) diagonal Gaussian using natural parameters ψ = (σ 2µ, 1

2σ 2) (using elementwise exponents and products), (4) diagonal Gaussian using central moment parameters ψ = (µ, σ2), and (5) DLR Gaussian with parameters ψ = (µ, Υ, W) and precision Σ 1 = Υ + WW where Υ RP P is diagonal and W RP R with R P. The moment parameterizations are included to test the importance of using natural parameters per Proposition 4.1. The diagonal family allows learning of large models because it scales linearly in the model size P. DLR also scales linearly but is more expressive than diagonal, maintaining some of the correlation information between parameters that is lost in the mean ﬁeld (diagonal) approximation [Lambert et al., 2023, Mishkin et al., 2018, Chang et al., 2023].

Optimizing the BONG objective wrt (µ, Υ, W) using NGD methods is challenging because the Fisher information matrix in this parameterization cannot be efﬁciently inverted. Instead we ﬁrst derive the update wrt the FC natural parameters (leveraging the fact that the prior Σ 1 t|t 1 is DLR to make this efﬁcient), and then use SVD to project the posterior precision back to low-rank form, following our prior LO-FI work [Chang et al., 2023]. However, if we omit the Fisher preconditioning matrix and use GD as in BOG and BBB, we can directly optimize the objective wrt (µ, Υ, W) (see Appendix E.5).

4.8 Overall space of methods

Crossing the four algorithms in Table 2, the four methods of approximating the Hessian in Table 1, and the ﬁve variational families yields 80 algorithms. Table 3 shows the 36 based on the three tractable Hessian approximations and the three variational families that use natural parameters. Update equations for all the algorithms are derived in Appendix E. Pseudocode is given in Appendix A.

5 Experiments

This section presents our primary experimental results. These are based on MNIST (D = 784, Ntrain = 60k, Ntest = 10k, C = 10 classes) [Le Cun et al., 2010]. See Appendix B for more details on these experiments, and more results on MNIST and other datasets. We focus on training on a preﬁx of the ﬁrst T = 2000 examples from each dataset, since our main interest is in online learning from potentially nonstationary distributions, where rapid adaptation of a model in response to a small number of new data points is critical.

Our primary evaluation objective is the negative log predictive density (NLPD) of the test set as a function of the number of training points observed so far.1 It is deﬁned as NLPDt = 1 Ntest P

i Dtest log R p(yi|f(xi, θt))qψt(θt)dθt . We approximate this integral in two main ways: (1) using Monte Carlo sampling2,or (2) using a plugin approximation, where we replace the posterior qψt(θt) with a delta function centered at the mean, δ(θt µt).

For methods that require a learning rate (i.e., all methods except BONG), we optimize it wrt mid-way or ﬁnal performance on a holdout validation set, using Bayesian optimization on NLL. All methods require specifying the prior belief state, p(θ0) = N(µ0, Σ0 = σ2 0I). We optimize over σ2 0 and sample µ0 from a standard NN initializer. As Hessian approximations, we use MC-EF with M = 100 samples, as well as the deterministic approximations LIN-HESS and LIN-EF. (In the appendix, we also study MC-HESS but ﬁnd that it works very poorly, even with M = 1000.)

In Fig. 1 we compare the 4 main algorithms using DLR family with rank R = 10. We apply these to a CNN with two convolutional layers (each with 16 features and a (5,5) kernel), followed by two linear layers (one with 64 features and the ﬁnal one with 10 features), for a total of 57,722 parameters. Shaded regions in these and all other plots indicate 1 SE, based on 5 independent trials randomly varying in the prior mean µ0, data ordering, and MC sampling. From this ﬁgure (and additional results in Appendix B) we conclude

Linearization helps: LIN-HESS and LIN-EF both outperform the MC variants. NGD helps: BONG outperforms BOG. Implicit regularization helps: BONG outperforms BLR. The LIN-HESS approximation outperforms LIN-EF, at least for BONG. BBB generally does poorly. The BONG posterior predictive (using LIN-HESS) is slightly better calibrated than BOG, and both are much better than BLR and BBB, at least for small sample sizes, as shown in Fig. 7. The plugin posterior predictive is similar to the Lin-MC predictive (see Footnote 2), and both are generally much better than the simple MC predictive, as shown in Fig. 5.

In Fig. 2 we compare BONG using different variational families, and conclude

DLR-10 outperforms DLR-1, which is similar to diagonal (except when using BONG-LIN-EF, where DLR-1 is worse than diagonal). Also, we ﬁnd (in results not reported here) that rank 5 10 often gives results as good as FC, but is much cheaper. Both natural and moment parameterizations for the diagonal representation perform comparably to each other, although with LIN-EF, the moment parameterization can be numerically unstable.3

Finally, in Fig. 3 we report the runtimes from these experiments and conclude

One-step methods (BONG and BOG) are faster than iterative methods (BLR and BBB), as expected. Linearized methods (LIN-HESS and LIN-EF) are faster than MC methods (MC-EF).

1We assume the training and test sets are drawn from the same static distribution. Alternatively, if there is only one stream of data coming from a potential notstationary source, we can use the prequential or one-step-ahead log predictive density Gama et al. [2013]. We leave studying the non-stationary case to future work. 2That is, we compute S = 100 posterior samples θs t p(θt|Dtrain 1:t ) and then use p(y|x) 1 S PS s=1 p(y|x, θs t ). For efﬁciently sampling from a DLR Gaussian we follow Mishkin et al. [2018]. Following Immer et al. [2021b], we ﬁnd it better to linearize the model (i.e., replacing h(xi, θt) with h(xi, θt) deﬁned in Eq. (14)) before pushing posterior samples through. We call this the Lin-MC approximation. 3This is likely due to the fact that in the moment update in Eq. (249), we add diag(Gt) to the variance, whereas in the natural update in Eq. (213), we subtract diag(Gt) from the precision; the latter is more stable since Gt is negative semi-deﬁnite for all 4 Hessian approximations.

EF methods (LIN-EF) are a bit faster than methods that compute the Hessian exactly (LIN-HESS), especially for diagonal family. (This speedup is larger when the output dimensionality C is big.)

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (lin-MC)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

nlpd (lin-MC)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

Figure 1: Performance on MNIST using Lin-MC posterior predictive, where the posterior is computed using BONG, BOG, BBB and BLR and the 3 tractable Hessian approximations with DLR-10 variational family.

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (lin-MC)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

nlpd (lin-MC)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

Figure 2: Performance on MNIST using Lin-MC posterior predictive, where the posterior is computed using BONG with different variational families, namely diagonal (natural and moment), DLR-1, DLR10.

6 Conclusions, limitations and future work

Our experiment results show beneﬁts of BONG s three main principles: NGD, implicit regularization to the prior, and linearization. The clear winner across datasets and variational families is BONG-LINHESS, which embodies all three principles. BLR-LIN-HESS nearly matches its performance but is much slower. Several of the best-performing algorithms are previously known (notably CM-EKF and LO-FI) but we explain these results within a systematic theory that also offers new methods (including BLR-LIN-HESS).

BONG is motivated by Proposition 4.1 which applies only in the idealized setting of a conjugate prior. Nevertheless we ﬁnd it performs well in non-conjugate settings. On the other hand our experiments are based on relatively small models and datasets. It will be important to test how our methods scale up, especially using the promising DLR representation.

Acknowledgments and Disclosure of Funding

Thanks to Gerardo Duràn-Martìn and Alex Shestopaloff for helpful input. MJ was supported by NSF grant 2020-906.

S Amari. Natural gradient works efﬁciently in learning. Neural Comput., 10(2):251 276, 1998. URL http://dx.doi.org/10.1162/089976698300017746.

Gianluca M Bencomo, Jake C Snell, and Thomas L Grifﬁths. Implicit maximum a posteriori ﬁltering via adaptive optimization. ar Xiv preprint ar Xiv:2311.10580, 2023.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In ICML, 2015. URL http://arxiv.org/abs/1505.05424.

Georges Bonnet. Transformations des signaux aléatoires a travers les systemes non linéaires sans mémoire. In Annales des Télécommunications, volume 19, pages 203 220. Springer, 1964.

Peter G Chang, Kevin Patrick Murphy, and Matt Jones. On diagonal approximations to the extended kalman ﬁlter for online training of bayesian neural networks. In Continual Lifelong Learning Workshop at ACML 2022, December 2022. URL https://openreview.net/forum? id=asge Et25kk.

Peter G Chang, Gerardo Durán-Martín, Alexander Y Shestopaloff, Matt Jones, and Kevin Murphy. Low-rank extended Kalman ﬁltering for online learning of neural networks from streaming data. In COLLAS, May 2023. URL http://arxiv.org/abs/2305.19535.

Badr-Eddine Chérief-Abdellatif, Pierre Alquier, and Mohammad Emtiyaz Khan. A generalization bound for online variational inference. In Asian conference on machine learning, pages 662 677. PMLR, 2019.

Gerardo Duran-Martin, Matias Altamirano, Alexander Y Shestopaloff, Leandro Sánchez-Betancourt, Jeremias Knoblauch, Matt Jones, François-Xavier Briol, and Kevin Murphy. Outlier-robust kalman ﬁltering through generalised bayes. ar Xiv preprint ar Xiv:2405.05646, 2024.

João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. On evaluating stream learning algorithms. MLJ, 90(3):317 346, March 2013. URL https://tinyurl.com/mrxfk4ww.

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2): 217 288, 2011.

Dirk Hoeven, Tim Erven, and Wojciech Kotłowski. The many faces of exponential weights in online learning. In Conference On Learning Theory, pages 2067 2092. PMLR, 2018.

Michael F Hutchinson. A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989.

Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of bayesian neural nets via local linearization. In Arindam Banerjee and Kenji Fukumizu, editors, AISTATS, volume 130 of Proceedings of Machine Learning Research, pages 703 711. PMLR, 2021a. URL https: //proceedings.mlr.press/v130/immer21a.html.

Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of bayesian neural nets via local linearization. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, pages 703 711. PMLR, 2021b. URL https://proceedings.mlr. press/v130/immer21a.html.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183 233, 1999.

Mohammad Khan and Wu Lin. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In Artiﬁcial Intelligence and Statistics, pages 878 887. PMLR, 2017.

Mohammad Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. In International conference on machine learning, pages 2611 2620. PMLR, 2018a.

Mohammad Emtiyaz Khan and Håvard Rue. The bayesian learning rule. J. Mach. Learn. Res., 2023. URL http://arxiv.org/abs/2107.04562.

Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by Weight-Perturbation in adam. In ICML, 2018b. URL http://arxiv.org/abs/1806.04854.

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on bayes rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23 (132):1 109, 2022.

Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical ﬁsher approximation for natural gradient descent, 2020.

Richard Kurle, Botond Cseke, Alexej Klushyn, Patrick van der Smagt, and Stephan Günnemann. Continual learning with bayesian neural networks for non-stationary data. In ICLR, March 2020. URL https://openreview.net/forum?id=SJls Fp Vt DB.

Marc Lambert, Silvère Bonnabel, and Francis Bach. The recursive variational gaussian approximation (R-VGA). Stat. Comput., 32(1):10, December 2021. URL https://hal.inria.fr/ hal-03086627/document.

Marc Lambert, Silvère Bonnabel, and Francis Bach. The limited-memory recursive variational gaussian approximation (l-rvga). Statistics and Computing, 33(3):70, 2023.

Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E Turner, and Alireza Makhzani. Can we remove the square-root in adaptive gradient methods? a second-order perspective. ar Xiv preprint ar Xiv:2402.03496, 2024.

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

Yueming Lyu and Ivor W Tsang. Black-box optimizer with stochastic implicit natural gradient. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13 17, 2021, Proceedings, Part III 21, pages 217 232. Springer, 2021.

James Martens. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21(146):1 76, 2020.

Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. SLANG: Fast structured covariance approximations for bayesian deep learning with natural gradient. In NIPS, pages 6245 6255. Curran Associates, Inc., 2018.

Yann Ollivier. Online natural gradient as a kalman ﬁlter. Electron. J. Stat., 12(2):2930 2961, 2018. URL https://projecteuclid.org/euclid.ejs/1537257630.

Robert Price. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory, 4(2):69 72, 1958.

G V Puskorius and L A Feldkamp. Decoupled extended kalman ﬁlter training of feedforward layered networks. In International Joint Conference on Neural Networks, volume i, pages 771 777 vol.1, 1991. URL http://dx.doi.org/10.1109/IJCNN.1991.155276.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Simo Sarkka and Lennart Svensson. Bayesian Filtering and Smoothing (2nd edition). Cambridge University Press, 2023.

Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Clement Bazan, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, and Thomas Möllenhoff. Variational learning is effective for large deep networks. ar Xiv preprint ar Xiv:2402.17641, 2024.

Sharad Singhal and Lance Wu. Training multilayer perceptrons with the extended kalman algorithm. In NIPS, volume 1, 1989.

Michalis K Titsias, Alexandre Galashov, Amal Rannen-Triki, Razvan Pascanu, Yee Whye Teh, and Jorg Bornschein. Kalman ﬁlter for online classiﬁcation of non-stationary data. In ICLR, 2024.

Marcin B Tomczak, Siddharth Swaroop, and Richard E Turner. Efﬁcient low rank gaussian variational inference for neural networks. In NIPS, 2020. URL https://proceedings.neurips.cc/ paper/2020/file/310cc7ca5a76a446f85c1a0d641ba96d-Paper.pdf.

Filip Tronarp, Ángel F García-Fernández, and Simo Särkkä. Iterative ﬁltering and smoothing in nonlinear and Non-Gaussian systems using conditional moments. IEEE Signal Process. Lett., 25 (3):408 412, 2018. URL https://acris.aalto.fi/ws/portalfiles/portal/17669270/ cm_parapub.pdf.

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael W Mahoney. ADAHESSIAN: An adaptive second order optimizer for machine learning. In AAAI, 2021. URL http://arxiv.org/abs/2006.00719.

Arnold Zellner. Optimal information processing and bayes s theorem. The American Statistician, 42 (4):278 280, 1988.

Wenxuan Zhang, Youssef Mohamed, Bernard Ghanem, Philip Torr, Adel Bibi, and Mohamed Elhoseiny. Continual learning on a diet: Learning from sparsely labeled streams under constrained computation. In ICLR, 2024. URL https://openreview.net/pdf?id=Xvfz8NHm Cj.

A Abstract pseudocode

Algorithms 1 to 7 give pseudocode for applying the methods we study. For the predict step, we assume the dynamics model has the form θt N(Ftθt 1 + bt, Qt). The update step in Algorithm 3 calls one of four grad functions (Algorithms 4 to 7) that estimate the expected gradient and Hessian using either MC or linearization combined with either the direct Hessian (or Jacobian and observation covariance) or EF. The update step also calls an inner step function that implements BONG, BLR, BOG or BBB on some variational family corresponding to the encoding of ψ (not shown). In practice the grad-fn and inner-step-fn are not as cleanly separated because the full matrix Gt,i is not passed between them except when the variational family is FC. When the family is diagonal, grad-fn only needs to return diag( Gt,i). When grad-fn uses EF, it only needs to return ˆG(1:M) t,i (grad-MC-EF) or g LIN t,i (grad-LIN-EF) and inner-step-fn will implicitly use the outer product of this output as Gt,1. Finally, note the expressions for g LIN t,i in Algorithms 6 and 7 are equivalent ways of computing the same quantity, as explained after Eq. (28).

for t = 1 : do

ψt|t 1 = predict(ψt 1) ψt = update(ψt|t 1, xt, yt) end

Algorithm 1: Main loop.

def predict(ψt 1 = (µt 1, Σt 1)): µt|t 1 = Ftµt 1 + bt Σt|t 1 = FtΣt 1F t + Qt Return ψt|t 1 = (µt|t 1, Σt|t 1)

Algorithm 2: Predict step.

def update(ψt|t 1, xt, yt, f(), A(), grad-fn, inner-step-fn, α, I, M): ψt,0 = ψt|t 1 ft(θt) = f(xt, θt) ht(θt) = E[yt|xt, θt] = η=ft(θt)A(η) Vt(θt) = Cov[yt|xt, θt] = 2 η=ft(θt)A(η) ℓt(θt) = log p(yt|xt, θt) = log p(yt|ft(θt)) for i = 1 : I do

( gt,i, Gt,i) = grad-fn(ψt,i 1, ℓt, ht, Vt, M) ψt,i = inner-step-fn(ψt|t 1, ψt,i 1, gt,i, Gt,i, α) end Return ψt,I

Algorithm 3: Update step. The inner-step-fn is BONG, BLR, BOG or BBB (not shown).

B Additional experiment results

In this section, we give a more thorough set of experimental results.

B.1 Running time measures

The running times of the methods for the experiments in Figs. 1 and 2, where we ﬁt a CNN to MNIST, are shown in Fig. 3.

The running times of the methods for the FC and DLR case, where we ﬁt an MLP to a synthetic regression dataset, are shown in Fig. 4. The slower speed of BLR (even with I = 1) relative to BONG is at least partly attributable to the fact that BLR must compute the SVD of a larger matrix (see Appendices E.5.3 and E.5.4).

def grad-MC-HESS(ψt,i 1, ℓt, ht = [], Vt = [], M): for m = 1 : M do

ˆθ(m) t,i qψt,i 1(θ)

ˆg(m) t,i = θt=ˆθ(m) t,i ℓt(θt)

ˆG(m) t,i = 2 θt=ˆθ(m) t,i ℓt(θt)

end g MC t,i = 1 M PM m=1 ˆg(m) t,i GMC-HESS t,i = 1 M PM m=1 ˆG(m) t,i Return (g MC t,i , GMC-HESS t,i )

Algorithm 4: MC gradient/Hessian estimator

def grad-MC-EF(ψt,i 1, ℓt, ht = [], Vt = [], M): for m = 1 : M do

ˆθ(m) t,i qψt,i 1(θ)

ˆg(m) t,i = θt=ˆθ(m) t,i ℓt(θt)

end g MC t,i = 1 M PM m=1 ˆg(m) t,i ˆG(1:M) t,i = [ˆg(1) t,i , . . . , ˆg(M) t,i ]

GMC-EF t,i = 1

M ˆG(1:M) t,i ˆG(1:M)

t,i Return (g MC t,i , GMC-EF t,i ) Algorithm 5: MC gradient/Hessian estimator with Empirical Fisher

def grad-LIN-HESS(ψt,i 1, ℓt = [], ht, Vt, M = []): µt,i 1 = E[θt|ψt,i 1] ˆyt,i = ht(µt,i 1) Ht,i = ht

θt |θ=µt,i 1 Rt,i = Vt(µt,i 1) g LIN t,i = H t,i R 1 t,i (yt ˆyt,i) GLIN-HESS t,i = H t,i R 1 t,i Ht,i Return (g LIN t,i , GLIN-HESS t,i ) Algorithm 6: Linearized gradient/Hessian estimator

def grad-LIN-EF(ψt,i 1, ℓt = [], ht, Vt, M = []): µt,i 1 = E[θt|ψt,i 1] Rt,i = Vt(µt,i 1) g LIN t,i = θt=µt,i 1 1

2 (yt ht(θt)) R 1 t,i (yt ht(θt))

GLIN-EF t,i = g LIN t,i g LIN t,i

Return (g LIN t,i , GLIN-EF t,i ) Algorithm 7: Linearized gradient/Hessian estimator with empirical Fisher

bbb-dlr10-ef_mc100

blr-dlr10-ef_mc100

bog-dlr10-ef_mc100

bong-dlr10-ef_mc100

bbb-dlr10-hess_lin

blr-dlr10-hess_lin

bog-dlr10-hess_lin

bong-dlr10-hess_lin

bbb-dlr10-ef_lin

blr-dlr10-ef_lin

bog-dlr10-ef_lin

bong-dlr10-ef_lin

runtime (s)

bong-diag-ef_mc100

bong-diag-hess_lin

bong-diag-ef_lin

bong-diag_mom-ef_mc100

bong-diag_mom-hess_lin

bong-diag_mom-ef_lin

bong-dlr1-ef_mc100

bong-dlr10-ef_mc100

bong-dlr1-hess_lin

bong-dlr1-ef_lin

bong-dlr10-hess_lin

bong-dlr10-ef_lin

runtime (s)

Figure 3: Runtimes for methods on MNIST. Left: Corresponding to Fig. 1 using different algorithms on DLR-10 family. Right: Corresponding to Fig. 2, using BONG on different variational families.

0 100 200 300 400 Num. parameters

Elapsed time per step (sec)

bong_fc-MC100-I1-LR0-EF0-Lin0 bong_fc-MC100-I1-LR0-EF0-Lin1 bong_fc-MC100-I1-LR0-EF1-Lin0 bog_fc-MC100-I1-LR0_001-EF0-Lin0 bog_fc-MC100-I1-LR0_001-EF0-Lin1 bog_fc-MC100-I1-LR0_001-EF1-Lin0 blr_fc-MC100-I10-LR0_001-EF0-Lin0 blr_fc-MC100-I10-LR0_001-EF0-Lin1 blr_fc-MC100-I10-LR0_001-EF1-Lin0 bbb_fc-MC100-I10-LR0_001-EF0-Lin1 bbb_fc-MC100-I10-LR0_001-EF1-Lin0

0 10000 20000 30000 40000 50000 60000 Num. parameters

Elapsed time per step (sec)

bong-dlr10-EF1-MC100 bong-dlr10-lin bog-dlr10-EF1-MC100-LR0_001 bog-dlr10-lin-LR0_001 blr-dlr10-EF1-MC100-I1-LR0_001 blr-dlr10-lin-I1-LR0_001 bbb-dlr10-EF1-MC100-I1-LR0_001 bbb-dlr10-lin-I1-LR0_001

Figure 4: Running time (seconds) vs number of parameters P (size of state space) on a synthetic regression problem. For BBB and BLR, we show results using I = 1 and I = 10 iterations per step. Hessian approximations are denoted as follows: EF0-Lin0 = MC-Hess, EF1-Lin0 = EF-Hess, EF0-Lin1 = Lin-Hess. (a) Full Covariance representation. (b) DLR representation. The BLR plot is truncated due to out of memory problem.

B.2 Detailed results for CNN on MNIST

Here we report further metrics for the experiments in Figs. 1 and 2. We show 3 approximations to the NLPD: plugin, MC, and Linearized MC.4 For each of these approximations to the posterior predictive, we also measure the corresponding misclassiﬁcation rate based on picking the most probable predicted class. Results are shown in Figs. 5 and 6. We see that the plugin and lin-MC approximations are similar, and both are generally much better than standard MC.

Finally, in Fig. 7 and Fig. 8, we report the test-set expected calibration error (ECE) at time steps [250, 500, 1,000, 2,000], computed using 20 bins. Note that among the LIN-HESS variants, BONG-DLR-10

4Lin-MC is deﬁned in Footnote 2. The motivation for this approximation (from Immer et al. [2021b]) is the following: If we push posterior samples through a nonlinear predictive model, the results can be poor if the samples are far from the mean, but if we linearize the predictive model, extrapolations away from the mean are more sensible. This is true even if the posterior was not explicitly computed using a linear approximation.

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (plugin)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

NLL (plugin)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (lin-MC)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

nlpd (lin-MC)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (MC)

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

bbb-dlr10-ef_mc100 [sec:49.2] blr-dlr10-ef_mc100 [sec:141.9] bog-dlr10-ef_mc100 [sec:5.6] bong-dlr10-ef_mc100 [sec:13.0] bbb-dlr10-hess_lin [sec:44.4] blr-dlr10-hess_lin [sec:37.1] bog-dlr10-hess_lin [sec:2.7] bong-dlr10-hess_lin [sec:5.2] bbb-dlr10-ef_lin [sec:31.3] blr-dlr10-ef_lin [sec:28.6] bog-dlr10-ef_lin [sec:2.7] bong-dlr10-ef_lin [sec:3.8]

bbb-dlr10-ef_mc100

blr-dlr10-ef_mc100

bog-dlr10-ef_mc100

bong-dlr10-ef_mc100

bbb-dlr10-hess_lin

blr-dlr10-hess_lin

bog-dlr10-hess_lin

bong-dlr10-hess_lin

bbb-dlr10-ef_lin

blr-dlr10-ef_lin

bog-dlr10-ef_lin

bong-dlr10-ef_lin

runtime (s)

Figure 5: MNIST results for methods using DLR family. Left column shows misclassiﬁcation rate, right column showns NLL. First row uses plugin approximation to the posterior predictive, second row uses linearized MC approximation, and third row uses standard MC approximation.

method is the most well-calibrated (in addition to exhibiting the strongest plugin and linearized-MC NLPD results), when compared to other DLR-10 methods as well as other BONG variants.

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (plugin)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

NLL (plugin)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (lin-MC)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

nlpd (lin-MC)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

misclassification (MC)

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

0 250 500 750 1000 1250 1500 1750 num. training observations

bong-diag-ef_mc100 [sec:3.2] bong-diag-hess_lin [sec:3.3] bong-diag-ef_lin [sec:1.7] bong-diag_mom-ef_mc100 [sec:3.1] bong-diag_mom-hess_lin [sec:2.2] bong-diag_mom-ef_lin [sec:1.7] bong-dlr1-ef_mc100 [sec:14.5] bong-dlr10-ef_mc100 [sec:13.1] bong-dlr1-hess_lin [sec:4.8] bong-dlr1-ef_lin [sec:3.1] bong-dlr10-hess_lin [sec:5.0] bong-dlr10-ef_lin [sec:3.8]

bong-diag-ef_mc100

bong-diag-hess_lin

bong-diag-ef_lin

bong-diag_mom-ef_mc100

bong-diag_mom-hess_lin

bong-diag_mom-ef_lin

bong-dlr1-ef_mc100

bong-dlr10-ef_mc100

bong-dlr1-hess_lin

bong-dlr1-ef_lin

bong-dlr10-hess_lin

bong-dlr10-ef_lin

runtime (s)

Figure 6: MNIST results for BONG variants. Left column shows misclassiﬁcation rate, right column shows NLL. First row uses plugin approximation to the posterior predictive, second row uses linearized MC approximation, and third row uses standard MC approximation.

Model Variant

Expected Calibration Error (ECE) at Different Time Steps

Figure 7: MNIST expected calibration error (ECE) results at selected timesteps for methods using DLR family.

bong-diag_mom

bong-diag_mom

bong-diag_mom

bong-diag_mom

bong-diag_mom

Model Variant

bong-diag_mom

bong-diag_mom

bong-diag_mom

Expected Calibration Error (ECE) at Different Time Steps

Figure 8: MNIST expected calibration error (ECE) results at selected timesteps for BONG variants. We see that BONG and BOG are both well calibrated, as least when combined with LIN-HESS, with BONG have a slight edge, especially for small sample sizes, where there is more posterior uncertainty.

B.3 SARCOS dataset

In addition to MNIST, we report experiments on the SARCOS regression dataset (D = 22, Ntrain = 44,484, Ntest = 4449, C = 1). This dataset derives from an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Following Rasmussen and Williams [2006], we pick a single target output dimension, so C = 1. The data is from https://gaussianprocess.org/gpml/data/.

We use a small MLP of size 21-20-20-1, so there are P = 881 parameters. For optimizing learning rates for SARCOS, we use grid search on NLPD-PI. We ﬁx the variance of the prior belief state

to σ2 0 = 1.0, which represents a mild degree of regularization.5 We ﬁx the observation variance to Rt = 0.1 ˆR, where ˆR = Var(y1:T ) is the maximum likelihood estimate based on the training sequence; we can think of this as a simple empirical Bayes approximation, and the factor of 0.1 accounts for the fact that the variance of the residuals from the learned model will be smaller than from the unconditional baseline. We focus on DLR approximation of rank 10. This gives similar results to full covariance, but is much faster to compute. We also focus on the plugin approximation to NLPD, since the MC approximation gives much worse results (not shown).

B.3.1 Comparison of BONG, BLR, BBB and BOG

In Fig. 9 we show the results of using the LIN-HESS approximation. For 1 iteration per step, we see that BONG and BLR are indistinguishable in performance, and BBB and BOG are also indistinguishable, but much worse. For 10 iterations per step, we see that BBB improves signiﬁcantly, and approaches BONG and BLR. However, BLR and BBB are now about 10 times slower. (In practice, the slowdown is less than 10, due to constant factors of the implementation.) (Note that BONG and BOG always use a single iteration, so their performance does not change.)

In Fig. 10 we show the results of using the MC-EF approximation with MC = 100 samples. The trends are similar to the LIN-HESS case. In particular, for I = 1, BONG and BLR are similar, with BONG having a slight edge; and for I = 10, BBB catches up with both BONG and BLR, with BOG always in last place. Finally, we see that the performance of MC-EF is slightly worse than LIN-HESS when I = 1, but catches up with I = 10. However, in larger scale experiments, we usually ﬁnd that LIN-HESS is signiﬁcantly better than MC-EF, even with I = 10.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: bakeoff-it1

bbb-dlr10-Lin Hess-It1-LR0_05 [sec:5.9+-0.1 nlpd-pi@T=2000:3.50+-0.09] blr-dlr10-Lin Hess-It1-LR0_5 [sec:6.5+-0.0 nlpd-pi@T=2000:3.28+-0.21] bog-dlr10-Lin Hess-LR0_05 [sec:5.0+-0.0 nlpd-pi@T=2000:3.50+-0.09] bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21]

(a) 1 iteration.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: bakeoff-it10

bbb-dlr10-Lin Hess-It10-LR0_01 [sec:14.6+-0.1 nlpd-pi@T=2000:3.60+-0.25] blr-dlr10-Lin Hess-It10-LR0_1 [sec:21.2+-0.2 nlpd-pi@T=2000:3.29+-0.14] bog-dlr10-Lin Hess-LR0_05 [sec:5.0+-0.0 nlpd-pi@T=2000:3.50+-0.09] bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21]

(b) 10 iterations.

Figure 9: Predictive performance on SARCOS using MLP 21-20-20-1 with DLR rank 10. Error bars represent 1 standard deviation computed from 3 random trials, randomizing over data order and initial state µ0. (a) We show all 4 algorithms combined with LIN-HESS approximation and I = 1. (b) Same as (a) but with I = 10.

B.3.2 Learning rate sensitivity

In Fig. 11 we show the test set performance for BLR (with LIN-HESS approximation) for 5 different learning rates (namely 5 10 3, 1 10 2, 5 10 2, 1 10 1, and 5 10 1).

When using 1 iteration per step, the best learning rate is α = 0.5, which is also the value chosen based on validation set performance. With this value, BLR matches BONG. For other learning rates, BLR performance is much worse. When using 10 iterations per step, there are several learning rates all of which give performance as good as BONG.

5This value was based on a small amount of trial and error. Using a smaller value of σ0 results in underﬁtting relative to a linear least squares baseline, and using a much larger value results in unstable posterior covariances, causing the NLPD-MC samples to result in Na Ns after a few hundred steps.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: bakeoff-it1

bbb-dlr10-MCEF100-It1-LR0_05 [sec:6.9+-0.1 nlpd-pi@T=2000:5.29+-0.59] blr-dlr10-MCEF100-It1-LR0_5 [sec:13.4+-0.0 nlpd-pi@T=2000:4.45+-0.24] bog-dlr10-MCEF100-LR0_1 [sec:5.4+-0.1 nlpd-pi@T=2000:6.19+-0.90] bong-dlr10-MCEF100 [sec:12.3+-0.1 nlpd-pi@T=2000:4.12+-0.03]

(a) 1 iteration.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: bakeoff-it10

bbb-dlr10-MCEF100-It10-LR0_01 [sec:15.8+-0.1 nlpd-pi@T=2000:3.78+-0.10] blr-dlr10-MCEF100-It10-LR0_1 [sec:78.1+-0.5 nlpd-pi@T=2000:3.45+-0.05] bog-dlr10-MCEF100-LR0_1 [sec:5.4+-0.1 nlpd-pi@T=2000:6.19+-0.90] bong-dlr10-MCEF100 [sec:12.3+-0.1 nlpd-pi@T=2000:4.12+-0.03]

(b) 10 iterations.

Figure 10: Same as Fig. 9 except we use MC-EF approximation with MC = 100.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: blr1-lr-bong

bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21] blr-dlr10-Lin Hess-It1-LR0_005 [sec:6.2+-0.1 nlpd-pi@T=2000:8.13+-0.22] blr-dlr10-Lin Hess-It1-LR0_01 [sec:6.2+-0.0 nlpd-pi@T=2000:6.12+-0.47] blr-dlr10-Lin Hess-It1-LR0_05 [sec:6.3+-0.0 nlpd-pi@T=2000:3.68+-0.70] blr-dlr10-Lin Hess-It1-LR0_1 [sec:6.4+-0.1 nlpd-pi@T=2000:3.51+-0.49] blr-dlr10-Lin Hess-It1-LR0_5 [sec:6.5+-0.0 nlpd-pi@T=2000:3.28+-0.21]

(a) 1 iteration.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: blr10-lr-bong

bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21] blr-dlr10-Lin Hess-It10-LR0_005 [sec:20.0+-0.1 nlpd-pi@T=2000:3.66+-0.68] blr-dlr10-Lin Hess-It10-LR0_01 [sec:20.5+-0.2 nlpd-pi@T=2000:3.50+-0.46] blr-dlr10-Lin Hess-It10-LR0_05 [sec:20.5+-0.1 nlpd-pi@T=2000:3.30+-0.25] blr-dlr10-Lin Hess-It10-LR0_1 [sec:21.2+-0.2 nlpd-pi@T=2000:3.29+-0.14] blr-dlr10-Lin Hess-It10-LR0_5 [sec:22.9+-0.3 nlpd-pi@T=2000:3.36+-0.17]

(b) 10 iterations.

Figure 11: Same setup as Fig. 9, except now we plot performance for BLR for 5 different learning rates. We also show BONG as a baseline, which uses a ﬁxed learning rate step size of 1.0.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: new-bbb1-lr

bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21] bbb-dlr10-Lin Hess-It1-LR0_005 [sec:5.9+-0.0 nlpd-pi@T=2000:8.13+-0.22] bbb-dlr10-Lin Hess-It1-LR0_01 [sec:5.9+-0.0 nlpd-pi@T=2000:5.90+-0.55] bbb-dlr10-Lin Hess-It1-LR0_05 [sec:5.9+-0.1 nlpd-pi@T=2000:3.50+-0.09] bbb-dlr10-Lin Hess-It1-LR0_1 [sec:16.7+-8.5 nlpd-pi@T=161:inf+-nan] bbb-dlr10-Lin Hess-It1-LR0_5 [sec:27.7+-0.3 nlpd-pi@T=54:1.09e+28+-inf]

(a) 1 iteration.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: new-bbb10-lr

bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21] bbb-dlr10-Lin Hess-It10-LR0_005 [sec:14.6+-0.1 nlpd-pi@T=2000:3.75+-0.52] bbb-dlr10-Lin Hess-It10-LR0_01 [sec:14.6+-0.1 nlpd-pi@T=2000:3.60+-0.25] bbb-dlr10-Lin Hess-It10-LR0_05 [sec:74.1+-33.4 nlpd-pi@T=516:5.10+-1.35] bbb-dlr10-Lin Hess-It10-LR0_1 [sec:132.5+-1.0 nlpd-pi@T=92:9.17+-2.51]

(b) 10 iterations.

Figure 12: Same setup as Fig. 9, except now we plot performance for BBB for 5 different learning rates. We also show BONG as a baseline.

In Fig. 12, we show the analogous plot for BBB. When using 1 iteration per step, all learning rates result in poor performance, with many resulting in Na Ns. When using 10 iterations per step, there are some learning rates that enable BBB to get close to (but still not match) the performance of BONG.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: new-bog-lr

bong-dlr10-Lin Hess [sec:6.1+-0.1 nlpd-pi@T=2000:3.32+-0.21] bog-dlr10-Lin Hess-LR0_005 [sec:4.9+-0.1 nlpd-pi@T=2000:8.13+-0.22] bog-dlr10-Lin Hess-LR0_01 [sec:4.9+-0.0 nlpd-pi@T=2000:5.90+-0.55] bog-dlr10-Lin Hess-LR0_05 [sec:5.0+-0.0 nlpd-pi@T=2000:3.50+-0.09] bog-dlr10-Lin Hess-LR0_1 [sec:16.6+-8.7 nlpd-pi@T=161:inf+-nan] bog-dlr10-Lin Hess-LR0_5 [sec:26.8+-0.1 nlpd-pi@T=54:1.08e+28+-inf]

(a) BOG with LIN-HESS.

0 250 500 750 1000 1250 1500 1750 2000 num. training observations

Model: mlp_20_20_1[P=881]. Data: sarcos. Expt: new-bog-lr-ef

bong-dlr10-MCEF100 [sec:12.3+-0.1 nlpd-pi@T=2000:4.12+-0.03] bog-dlr10-MCEF100-LR0_005 [sec:5.4+-0.1 nlpd-pi@T=2000:8.17+-0.24] bog-dlr10-MCEF100-LR0_01 [sec:5.3+-0.0 nlpd-pi@T=2000:8.18+-0.24] bog-dlr10-MCEF100-LR0_05 [sec:5.3+-0.1 nlpd-pi@T=2000:5.24+-0.45] bog-dlr10-MCEF100-LR0_1 [sec:5.4+-0.1 nlpd-pi@T=2000:6.19+-0.90] bog-dlr10-MCEF100-LR0_5 [sec:11.9+-9.3 nlpd-pi@T=270:3.39e+26+-inf]

(b) BOG with MC-EF.

Figure 13: We plot performance for BOG for 5 different learning rates. We also show BONG as a baseline. (a) LIN-HESS approximation. (b) MC-EF approximation.

Finally, in Fig. 13a, we show the analogous plot for BOG with LIN-HESS, and in Fig. 13b with MC-EF, where results are much worse.

Overall we conclude that all the methods (except BONG) are quite sensitive to the learning rate. In our experiments, we pick a value based on performance on a validation set, but in the truly online setting, where there is just a single data stream, picking an optimal learning rate is difﬁcult, which is an additional advantage of BONG.

C Proof of Propositions 4.1 and 4.2

Proposition 4.1. To ease notation we write the natural parameters of the prior as ψt|t 1 = [χt|t 1; νt|t 1], which can be interpreted as the prior sufﬁcient statistics and prior sample size. Note that xt can be omitted as a constant. Based on the prior qψt|t 1(θt) the exact posterior is

p(θt|Dt) qψt|t 1(θt) pt(yt|θt) (35)

exp χ t|t 1θt νt|t 1A(θt) exp (θ t yt A(θt)) (36)

qψt(θt) (37)

ψt = χt|t 1 + yt νt|t 1 + 1

For BONG, we ﬁrst note the dual parameter is given by

ρt|t 1 = Eθt qψt|t 1[T(θt)] (39)

= Eθt qψt|t 1

Therefore the natural gradient in Eq. (5) is

ρt|t 1Eθt qψt|t 1[log p (yt|θt)] = ρt|t 1Eθt qψt|t 1 θ t yt A (θt) (41)

= ρt|t 1ρ t|t 1

Therefore the BONG update yields ψt = [χt|t 1 + yt; νt|t 1 + 1] in agreement with Eq. (38).

Proposition 4.2. The intuition behind this proof is as follows. For the mean update in Eq. (9) θtℓt is linear in θt so the expectation equals the value at the mean. For the covariance update in Eq. (10)

2 θtℓt is independent of θt so we can drop the expectation operator. The tricky part is why we need different linearizations for the Hessians to agree. It has to do with making the Hessian of the NN disappear (as in GGN). In the Gaussian approximation this happens when the predicted mean (ˆyt = ft(θt)) is linear in θt. In the plugin approximation it happens when the outcome-dependent part of the loglikelihood (ht(θt) yt) is linear in θt. In the latter case the only nonlinear term remaining in the log-likelihood is the log-partition A, and the two methods end up agreeing because of the property that the Hessian of the log-partition equals the conditional variance Rt.

Formally, under the linear(h)-Gaussian approximation in Eqs. (14) and (17) the expected gradient and Hessian can be calculated directly:

Eθt qψt|t 1 θt log p LG t (yt|θt)) = Eθt qψt|t 1 H t R 1 t (yt ˆyt Ht(θt µt|t 1) (44)

= H t R 1 t (yt ˆyt) (45)

Eθt qψt|t 1 2 θt log p LG t (yt|θt) = Eθt qψt|t 1 H t R 1 t Ht (46)

= H t R 1 t Ht (47)

For the linear(f)-delta approximation we use the properties of exponential families that (1) the gradient of the log-partition A with respect to the natural parameter ft(θt) equals the expectation parameter ht(θt), (2) the Hessian of the log-partition with respect to the natural parameter equals the conditional variance, and consequently (3) the Jacobian of the expectation parameter with respect to the natural parameter equals the conditional variance Rt:

η=ft(µt|t 1)A(η) = ht(µt|t 1) = ˆy (48)

2 η=ft(µt|t 1)A(η) = V yt|θt = µt|t 1 = Rt (49)

ht(θt) ft(θt) |θt=µt|t 1 = Rt (50)

The last of these implies Ft = R 1 t Ht. Therefore the expected gradient and Hessian can be calculated as

Eθt δµt|t 1 θt log p LD t (yt|θt) = θt=µt|t 1 log p LD t (yt|θt) (51)

= F t yt F t η=ft(µt|t 1)A(η) (52)

= F t (yt ˆyt) (53)

= H t R 1 t (yt ˆyt) (54)

Eθt δµt|t 1 2 θt log p LD t (yt|θt) = 2 θt=µt|t 1 log p LD t (yt|θt) (55)

= F t 2 η=ft(µt|t 1)A(η) Ft (56)

= F t Rt Ft (57)

= H t R 1 t Ht (58)

D Mirror descent formulation

In this section we give a more detailed derivation of BONG as mirror descent and use this to give two alternative interpretations of how BONG approximates exact VB: (1) by approximating the expected NLL as linear in the expectation parameter ρ, or (2) by replacing an implicit update with an explicit one.

Assume the variational family is an exponential one as introduced at the end of Section 3, with natural and dual parameters ψ and ρ, sufﬁcient statistics T(θ), and log-partition Φ(ψ):

qψ(θ) = exp (ψ T(θ) Φ(ψ)) (59) ρ = Eθ qψ[T(θ)] (60)

We ﬁrst review how NGD on an exponential family is a special case of mirror descent [Khan and Rue, 2023, Martens, 2020]. The mirror map is the gradient of the log-partition, which satisﬁes the thermodynamic identity

ρ = Φ(ψ) (61)

This is a bijection when Φ is convex (which includes the cases we study), so we can implicitly treat ψ and ρ as functions of each other. Given a loss function L(ψ), MD iteratively solves the local optimization problem

ψi+1 = arg min ψ ρi L(ψi), ρ + 1

αDΦ(ψi, ψ) (62)

The ﬁrst term is a linear (in ρ) approximation of L about the previous iteration ψi and the second term is the Bregman divergence

DΦ(ψi, ψi+1) = Φ(ψi) Φ(ψi+1) (ψi ψi+1) ρi+1 (63)

The Bregman divergence acts as a regularizer toward ψi and captures the intrinsic geometry of the parameter space because of its equivalence with the (reverse) KL divergence

DKL qψi+1|qψi = Eθ qψi+1[(ψi+1 ψi) T(θ) + Φ(ψi) Φ(ψi+1)] (64)

= DΦ(ψi, ψi+1) (65)

Importantly, this recursive regularizer is not part of the loss and serves only to deﬁne an iterated algorithm that converges to a local minimum of L. Solving Eq. (62) by differentiating by ρ yields the MD update ψi+1 = ψi α ρi L(ψi) (66) Because the Fisher matrix for an exponential family is Fψ = ρ/ ψ, this is equivalent to NGD with respect to ψ. Khan and Rue [2023] offer this as a derivation of the BLR, when L(ψ) is taken to be the variational loss from Eq. (1).

By applying this analysis to the online setting, our approach can be seen to follow from two insights. First, the online variational loss in Eq. (2) already includes KL divergence from the previous step, so we do not need the artiﬁcial regularizer in Eq. (62). That is, if we start from the online variational problem in Eq. (4) and deﬁne Lt(ψ) as the expected NLL,

Lt(ψ) = Eθt qψ[log p(yt|ft(θt))] (67)

then replacing Lt(ψ) with a linear approximation based at ψt|t 1 and applying Eq. (65) leads to

ψt = arg min ψ

ρt|t 1Lt(ψt|t 1), ρ + DΦ(ψt|t 1, ψ) (68)

By comparing to Eq. (62) we see this deﬁnes an MD algorithm with unit learning rate that works in a single step rather than by iterating. Paralleling the derivation of Eq. (66) from Eq. (62) we get

ψt = ψt|t 1 ρt|t 1Lt(ψt|t 1) (69)

which matches the BONG update in Eq. (5). Thus BONG can be seen as an approximate solution of the online variational problem in Eq. (4) based on linearizing the expected NLL wrt ρ. (Note this is different from the assumption underlying BONG-LIN that ft(θt) or ht(θt) is linear in θt.)

Second, in the conjugate case, this linearity assumption is true: Lt is linear in ρ (see proof of Proposition 4.1). Therefore 68 is equivalent to solving Eq. (4) exactly:

ψt = arg min ψ Lt(ψ) + DΦ(ψt|t 1, ψ) (70)

This recapitulates Proposition 4.1 that BONG is Bayes optimal in the conjugate case. In general the exact solution to Eq. (70) is

ψt = ψt|t 1 ρt Lt(ψt) (71)

This is an implicit update because the gradient is evaluated at the (unknown) posterior, whereas Eq. (69) is an explicit update because it evaluates the gradient at the prior. (In the Gaussian case these can be shown to match the implicit and explicit RVGA updates of [Lambert et al., 2021].) Therefore BONG can be also interpreted as an approximation of exact VB that replaces the implicit update, Eq. (71), with an explicit update, Eq. (69).

E Derivations

This section derives the update equations for all 80 algorithms we investigate (Table 3 plus the MC-HESS and LIN-EF variants). In Appendix E.6 we also translate the BLR algorithms from our online setting back to the batch setting used in Khan and Rue [2023].

For an exponential variational family with natural parameters ψ and dual parameters ρ, we can derive all 16 methods (BONG, BLR, BOG, BBB under all four Hessian approximations) from four quantities:

ρt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] (72)

ρt,i 1DKL qψt,i 1|qψt|t 1 (73)

ψt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] (74)

ψt,i 1DKL qψt,i 1|qψt|t 1 (75)

The NGD methods (BONG and BLR) use gradients with respect to ρt,t i while the GD methods (BOG and BBB) use gradients with respect to ψt,i 1. For BONG and BOG the DKL term is not relevant, and there is no inner loop so ψt,i 1 = ψt|t 1 and gt,i = gt, Gt,i = Gt.

When ψ is not the natural parameter of an exponential family we must explicitly compute the inverse Fisher preconditioner for the NGD methods. Therefore the updates can be derived from these three quantities:

Fψt,i 1 (76)

ψt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] (77)

ψt,i 1DKL qψt,i 1|qψt|t 1 (78)

We will frequently use Bonnet s and Price s theorems [Bonnet, 1964, Price, 1958]

µt,i 1EN(µt,i 1,Σt,i 1)[log p (yt|ft (θt))] = EN(µt,i 1,Σt,i 1)[ θt log p (yt|ft (θt))] (79)

= gt,i (80)

Σt,i 1EN(µt,i 1,Σt,i 1)[log p (yt|ft (θt))] = 1

2EN(µt,i 1,Σt,i 1) 2 θt log p (yt|ft (θt)) (81)

For diagonal Gaussians with covariance Diag σ2 , Price s theorem also implies6

σ2 t,i 1Eθt N(µt,i 1,Σt,i 1)[log p (yt|ft (θt))] = 1

2diag (Gt,i) (83)

Update equations for MC-HESS, MC-EF and LIN-EF methods are displayed together in the subsections that follow, because for the most part they differ only in the choice of GMC-HESS t , GMC-EF t or GLIN-EF t to approximate Gt and g MC t or g LIN t to approximate gt. We note cases where decomposing GMC-EF t = 1

M ˆG(1:M) t ˆG(1:M)

t or GLIN-EF t = g LIN t (g LIN t ) allows a more efﬁcient update.

We derive updates for BONG-LIN-HESS and BOG-LIN-HESS from the corresponding BONG-MC-HESS and BOG-MC-HESS updates using Proposition 4.2 which entails substituting

gt H t R 1 t (yt ˆyt) (84)

Gt H t R 1 t Ht (85)

For the algorithms with inner loops (BLR and BBB) we adapt the notation of Section 4.4 as follows:

yt,i = ht (µt,i 1) (86)

θt |θ=µt,i 1 (87)

Rt,i = V [yt|θt = µt,i 1] (88)

gt,i = H t,i R 1 t,i (yt ˆyt,i) (89)

Gt,i = H t,i R 1 t,i Ht,i (90)

6We use diag(A) to denote the vector of diagonal elements of matrix A and Diag(v) to denote the matrix whose diagonal entries are v and off-diagonal entries are 0.

This corresponds to basing the linear(f)-Gaussian and linear(h)-delta approximations at µt,i 1 instead of µt|t 1. Thus the updates for BLR-LIN-HESS and BBB-LIN-HESS are obtained by substituting

gt,i H t,i R 1 t,i (yt ˆyt,i) (91)

Gt,i H t,i R 1 t,i Ht,i (92)

E.1 Full covariance Gaussian, natural parameters

The natural and dual parameters for a general Gaussian are given by

ψ(1) t,i 1 = Σ 1 t,i 1µt,i 1 ρ(1) t,i 1 = µt,i 1 (93)

ψ(2) t,i 1 = 1

2Σ 1 t,i 1 ρ(2) t,i 1 = µt,i 1µ t,i 1 + Σt,i 1 (94) Inverting these relationships gives

2ψ(2) 1 t,i 1 ψ(1) t,i 1 = ρ(1) t,i 1 (95)

2ψ(2) 1 t,i 1 = ρ(2) t,i 1 ρ(1) t,i 1ρ(1) t,i 1 (96)

The KL divergence in the VI loss is DKL qψt,i 1|qψt|t 1 = 1

2 µt,i 1 µt|t 1 Σ 1 t|t 1 µt,i 1 µt|t 1

2Tr Σ 1 t|t 1Σt,i 1 1

2 log |Σt,i 1|) + const (97)

with gradients µt,i 1DKL qψt,i 1|qψt|t 1 = Σ 1 t|t 1 µt,i 1 µt|t 1 (98)

Σt,i 1DKL qψt,i 1|qψt|t 1 = 1

2 Σ 1 t|t 1 Σ 1 t,i 1 (99)

Following Appendix C of [Khan et al., 2018a], for any scalar function ℓthe chain rule gives

ρ(1) t,i 1ℓ= µt,i 1

ρ(1) t,i 1 µt,i 1ℓ+ Σt,i 1

ρ(1) t,i 1 Σt,i 1ℓ (100)

= µt,i 1ℓ 2 Σt,i 1ℓ µt,i 1 (101)

ρ(2) t,i 1ℓ= µt,i 1

ρ(2) t,i 1 µt,i 1ℓ+ Σt,i 1

ρ(2) t,i 1 Σt,i 1ℓ (102)

= Σt,i 1ℓ (103) Therefore ρ(1) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = gt,i Gt,iµt,i 1 (104)

ρ(2) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 1

2Gt,i (105)

and ρ(1) t,i 1DKL qψt,i 1|qψt|t 1 = Σ 1 t,i 1µt,i 1 Σ 1 t|t 1µt|t 1 (106)

ρ(2) t,i 1DKL qψt,i 1|qψt|t 1 = 1

2 Σ 1 t|t 1 Σ 1 t,i 1 (107)

Following the same approach for ψ gives

ψ(1) t,i 1ℓ= µt,i 1

ψ(1) t,i 1 µt,i 1ℓ+ Σt,i 1

ψ(1) t,i 1 Σt,i 1ℓ (108)

2ψ(2) 1 t,i 1 µt,i 1ℓ (109)

= Σt,i 1 µt,i 1ℓ (110)

ψ(2) t,i 1ℓ= µt,i 1

ψ(2) t,i 1 µt,i 1ℓ+ Σt,i 1

ψ(2) t,i 1 Σt,i 1ℓ (111)

2ψ(2) 1 t,i 1 µt,i 1ℓ ψ(1) t,i 1ψ(2) 1 t,i 1 + 1

2ψ(2) 1 t,i 1 Σt,i 1ℓ ψ(2) 1 t,i 1 (112)

= 2Σt,i 1 µt,i 1ℓ µ t,i 1 + 2Σt,i 1 Σt,i 1ℓ Σt,i 1 (113)

ψ(1) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = Σt,i 1gt,i (114)

ψ(2) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 2Σt,i 1gt,iµ t,i 1 + Σt,i 1Gt,iΣt,i 1 (115)

ψ(1) t,i 1DKL qψt,i 1|qψt|t 1 = Σt,i 1Σ 1 t|t 1 µt,i 1 µt|t 1 (116)

ψ(2) t,i 1DKL qψt,i 1|qψt|t 1 = 2Σt,i 1Σ 1 t|t 1 µt,i 1 µt|t 1 µ t,i 1

+ Σt,i 1 Σ 1 t|t 1 Σ 1 t,i 1 Σt,i 1 (117)

E.1.1 BONG FC (explicit RVGA)

Substituting Eqs. (104) and (105) into Eq. (5) gives

ψ(1) t = ψ(1) t|t 1 + gt Gtµt|t 1 (118)

ψ(2) t = ψ(2) t|t 1 + 1

Translating to (µt, Σt) gives the BONG-FC update

µt = µt|t 1 + Σtgt (120)

Σ 1 t = Σ 1 t|t 1 Gt (121)

This is equivalent to the explicit update form of RVGA [Lambert et al., 2021]. Using GMC-HESS t this update takes O(P 3) because of the matrix inversion. Using GMC-EF t and the Woodbury matrix identity we can write the update in a form that takes O(MP 2 + M 3):

µt = µt|t 1 + Kt1M (122)

Σt = Σt|t 1 Kt ˆG(1:M)

t Σt|t 1 (123)

Kt = Σt|t 1 ˆG(1:M) t MIM + ˆG(1:M)

t Σt|t 1 ˆG(1:M) t 1 (124)

Likewise using GLIN-EF t takes O(P 2):

µt = µt|t 1 + Kt (125)

Σt = Σt|t 1 Kt (g LIN t ) Σt|t 1 (126)

Kt = Σt|t 1g LIN t 1 + (g LIN t ) Σt|t 1g LIN t (127)

E.1.2 BONG-LIN-HESS FC (CM-EKF)

Applying Proposition 4.2 to Eqs. (120) and (121) gives the BONG-LIN-HESS-FC update

µt = µt|t 1 + Σt H t R 1 t (yt ˆyt) (128)

Σ 1 t = Σ 1 t|t 1 + H t R 1 t Ht (129)

This is equivalent to CM-EKF [Tronarp et al., 2018, Ollivier, 2018] and can be rewritten using the Woodbury identity in a form that takes O(CP 2 + C3):

µt = µt|t 1 + Kt(yt ˆyt) (130)

Σt = Σt|t 1 Kt HtΣt|t 1 (131)

Kt = Σt|t 1H t Rt + HtΣt|t 1H t 1 (132)

E.1.3 BLR FC

Substituting Eqs. (104) to (107) into Eq. (33) gives

ψ(1) t,i = ψ(1) t,i 1 + α gt,i Gt,iµt,i 1 Σ 1 t,i 1µt,i 1 + Σ 1 t|t 1µt|t 1 (133)

ψ(2) t,i = ψ(2) t,i 1 + α

Gt,i + Σ 1 t,i 1 Σ 1 t|t 1 (134)

Translating to (µt,i, Σt,i) gives the BLR-FC update

µt,i = µt,i 1 + αΣt,iΣ 1 t|t 1 µt|t 1 µt,i 1 + αΣt,igt,i (135)

Σ 1 t,i = (1 α) Σ 1 t,i 1 + αΣ 1 t|t 1 αGt,i (136)

This update takes O(P 3) per iteration because of the matrix inversion. In Appendix E.1.1 we were able to use the Woodbury identity to exploit the low rank of GMC-EF t and GLIN-EF t and obtain BONG updates with complexity quadratic in P. This does not appear possible with Eq. (136) because of the extra precision term on the RHS (applying Woodbury would require inverting (1 α)Σ 1 t,i 1 + αΣ 1 t|t 1). Therefore unlike BONG-FC, BLR-FC requires time cubic in the model size, for reasons that can be traced back to the KL term in Eq. (33).

E.1.4 BLR-LIN-HESS FC

Applying Proposition 4.2 to Eqs. (135) and (136) gives the BLR-LIN-HESS-FC update

µt,i = µt,i 1 + αΣt,iΣ 1 t|t 1 µt|t 1 µt,i 1 + αΣt,i H t,i R 1 t,i (yt ˆyt,i) (137)

Σ 1 t,i = (1 α) Σ 1 t,i 1 + αΣ 1 t|t 1 + αH t,i R 1 t,i Ht,i (138)

This update takes O(P 3) per iteration because of the matrix inversion.

E.1.5 BOG FC

Substituting Eqs. (114) and (115) into Eq. (31) gives

ψ(1) t = ψ(1) t|t 1 + αΣt|t 1gt (139)

ψ(2) t = ψ(2) t|t 1 + 2αΣt|t 1gtµ t|t 1 + αΣt|t 1GtΣt|t 1 (140)

Translating to (µt, Σt) gives the BOG-FC update

µt = ΣtΣ 1 t|t 1µt|t 1 + αΣtΣt|t 1gt (141)

Σ 1 t = Σ 1 t|t 1 4αΣt|t 1gtµ t|t 1 2αΣt|t 1GtΣt|t 1 (142)

This update takes O(P 3) because of the matrix inversion. The greater cost of the BOG-FC update relative to BONG-FC can be traced to the difference between GD and NGD: the NLL gradients wrt ψt|t 1 in Eqs. (114) and (115) are more complicated than the gradients wrt ρt|t 1 in Eqs. (104) and (105).

E.1.6 BOG-LIN-HESS FC

Applying Proposition 4.2 to Eqs. (141) and (142) gives the BOG-LIN-HESS-FC update

µt = ΣtΣ 1 t|t 1µt|t 1 + αΣtΣt|t 1H t R 1 t (yt ˆyt) (143)

Σ 1 t = Σ 1 t|t 1 4αΣt|t 1H t R 1 t (yt ˆyt) µ t|t 1 + 2αΣt|t 1H t R 1 t HtΣt|t 1 (144)

This update takes O(P 3) because of the matrix inversion.

E.1.7 BBB FC

Substituting Eqs. (114) to (117) into Eq. (34) gives

ψ(1) t,i = ψ(1) t,i 1 + αΣt,i 1gt,i αΣt,i 1Σ 1 t|t 1 µt,i 1 µt|t 1 (145)

ψ(2) t = ψ(2) t,i 1 + 2αΣt,i 1gt,iµ t,i 1 + αΣt|t 1GtΣt|t 1

2αΣt,i 1Σ 1 t|t 1 µt,i 1 µt|t 1 µ t,i 1 αΣt,i 1 Σ 1 t|t 1 Σ 1 t,i 1 Σt,i 1 (146)

Translating to (µt,i, Σt,i) gives the BBB-FC update

µt,i = Σt,iΣ 1 t,i 1µt,i 1 + αΣt,iΣt,i 1 gt,i + Σ 1 t|t 1 µt|t 1 µt,i 1 (147)

Σ 1 t,i = Σ 1 t,i 1 2αΣt,i 1

2Σ 1 t|t 1 µt|t 1 µt,i 1 µ t,i 1 + IP

+2gt,iµ t,i 1 + Gt,i Σ 1 t|t 1 Σt,i 1

This update takes O(P 3) per iteration because of the matrix inversion.

E.1.8 BBB-LIN-HESS FC

Applying Proposition 4.2 to Eqs. (147) and (148) gives the BBB-LIN-HESS-FC update

µt,i = Σt,iΣ 1 t,i 1µt,i 1

+ αΣt,iΣt,i 1 H t,i R 1 t,i (yt ˆyt,i) + Σ 1 t|t 1 µt|t 1 µt,i 1 (149)

Σ 1 t,i = Σ 1 t,i 1 2αΣt,i 1

2Σ 1 t|t 1 µt|t 1 µt,i 1 µ t,i 1 + IP +2H t,i R 1 t,i (yt ˆyt,i) µ t,i 1 H t,i R 1 t,i Ht,i + Σ 1 t|t 1 Σt,i 1

This update takes O(P 3) per iteration because of the matrix inversion.

E.2 Full covariance Gaussian, Moment parameters

The Bonnet and Price theorems give

µt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = gt,i (151)

Σt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 1

2Gt,i (152)

From Appendix E.1 we have

µt,i 1DKL qψt,i 1|qψt|t 1 = Σ 1 t|t 1 µt,i 1 µt|t 1 (153)

Σt,i 1DKL qψt,i 1|qψt|t 1 = 1

2 Σ 1 t|t 1 Σ 1 t,i 1 (154)

We write the Fisher with respect to the moment parameters ψ = (µ, vec(Σ)) as a block matrix:

F = Fµ,µ Fµ,Σ FΣ,µ FΣ,Σ

The blocks can be calculated by the second-order Fisher formula

Fµ,µ = Eqψ[ µ,µ log qψ(θ)] (156)

= Σ 1 (157) Fµ,Σ = Eqψ[ µ,Σ log qψ(θ)] (158)

= Eqψ ΣΣ 1 (θ µ) (159)

= 0 (160) FΣ,Σ = Eqψ[ Σ,Σ log qψ(θ)] (161)

2Σ 1(θ µ)(θ µ) Σ 1 1

ΣΣ 1 (θ µ)(θ µ) Σ 1

+Σ 1(θ µ)(θ µ) ΣΣ 1 ΣΣ 1 (163)

2 ΣΣ 1 (164)

2Σ 1 Σ 1 (165)

In the ﬁnal line we used

ij,kℓ (167)

with ij and kℓtreated as composite indices in [P 2]. Therefore the preconditioner in the NGD methods is

F 1 ψt,i 1 = Σt,i 1 0 0 2Σt,i 1 Σt,i 1

E.2.1 BONG FC, Moment

Substituting Eqs. (151), (152) and (168) into Eq. (3) gives the BONG-FC_MOM update

µt = µt|t 1 + Σt|t 1gt (169)

Σt = Σt|t 1 + Σt|t 1GtΣt|t 1 (170)

This update takes O(P 3) using GMC-HESS t , O(MP 2) using GMC-EF t , and O(P 2) using GLIN-EF t . The update is similar to the BONG-FC update except that Eq. (170) ignores the ˆG(1:M)

t Σt|t 1 ˆG(1:M) t term in Eq. (124) or the (g LIN t ) Σt|t 1g LIN t term in Eq. (127) which estimate the epistemic part of predictive uncertainty.

E.2.2 BONG-LIN-HESS FC, Moment

Applying Proposition 4.2 to Eqs. (169) and (170) gives the BONG-LIN-HESS-FC_MOM update

µt = µt|t 1 + Σt|t 1H t R 1 t (yt ˆyt) (171)

Σt = Σt|t 1 Σt|t 1H t R 1 t HtΣt|t 1 (172)

This update takes O(CP 2).

E.2.3 BLR FC, Moment

Substituting Eqs. (151) to (154) and (168) into Eq. (32) gives the BLR-FC_MOM update

µt,i = µt,i 1 + αΣt,i 1Σ 1 t|t 1 µt|t 1 µt,i 1 + αΣt,i 1gt,i (173)

Σt,i = (1 + α) Σt,i 1 + αΣt,i 1 Gt,i Σ 1 t|t 1 Σt,i 1 (174)

This update takes O(P 3) per iteration because of the matrix inversion. Using GMC-EF t = 1

M ˆG(1:M) t ˆG(1:M)

t or GLIN-EF = g LIN t (g LIN t ) allows the BONG-FC_MOM covariance update in Eq. (170) to scale quadratically, but this does not help here. Instead the BLR-FC_MOM update scales cubically because of the additional Σ 1 t|t 1 term that comes from the KL divergence in the VI objective.

E.2.4 BLR-LIN-HESS FC, Moment

Applying Proposition 4.2 to Eqs. (173) and (174) gives the BLR-LIN-HESS-FC_MOM update

µt,i = µt,i 1 + αΣt,i 1Σ 1 t|t 1 µt|t 1 µt,i 1 + αΣt,i 1H t,i R 1 t,i (yt ˆyt,i) (175)

Σt,i = (1 + α) Σt,i 1 αΣt,i 1 Σ 1 t|t 1 + H t,i R 1 t,i Ht,i Σt,i 1 (176)

This update takes O(P 3) per iteration because of the matrix inversion in the Σ 1 t|t 1 term that comes from the KL divergence in the VI objective.

E.2.5 BOG FC, Moment

Substituting Eqs. (151) and (152) into Eq. (31) gives the BOG-FC_MOM update

µt = µt|t 1 + αgt (177)

Σt = Σt|t 1 + α

Note the mean update is vanilla online gradient descent (OGD) and does not depend on the covariance. This update takes O(MP 2) using GMC-HESS t or GMC-EF t and O(P 2) using GLIN-EF t .

E.2.6 BOG-LIN-HESS FC, Moment

Applying Proposition 4.2 to Eqs. (177) and (178) gives the BOG-LIN-HESS-FC_MOM update

µt = µt|t 1 + αH t R 1 t (yt ˆyt) (179)

Σt = Σt|t 1 α

2 H t R 1 t Ht (180)

This update takes O(CP 2).

E.2.7 BBB FC, Moment

Substituting Eqs. (151) to (154) into Eq. (34) gives the BBB-FC_MOM

µt,i = µt,i 1 + αΣ 1 t|t 1 µt|t 1 µt,i 1 + αgt,i (181)

Σt,i = Σt,i 1 + α

Σ 1 t,i 1 Σ 1 t|t 1 + Gt,i (182)

This update takes O(P 3) per iteration because of the matrix inversion, which traces back to the VI objective. Comparing to the BOG-FC_MOM update in Eqs. (177) and (178) (which has quadratic complexity in P), the extra terms here come from the KL part of Eq. (34).

E.2.8 BBB-LIN-HESS FC, Moment

Applying Proposition 4.2 to Eqs. (181) and (182) gives the BBB-LIN-HESS-FC_MOM update

µt,i = µt,i 1 + αΣ 1 t|t 1 µt|t 1 µt,i 1 + αH t,i R 1 t,i (yt ˆyt,i) (183)

Σt = Σt,i 1 + α

Σ 1 t,i 1 Σ 1 t|t 1 H t,i R 1 t,i Ht,i (184)

This update takes O(P 3) per iteration because of the matrix inversion. Comparing to the BOG-LINHESS-FC_MOM update in Eqs. (179) and (180) (which has quadratic complexity in P), the extra terms here come from the KL part of Eq. (34).

E.3 Diagonal Gaussian, Natural parameters

Throughout this subsection, vector multiplication and exponents are elementwise.

The natural and dual parameters for a diagonal Gaussian are given by

ψ(1) t,i 1 = σ 2 t,i 1µt,i 1 ρ(1) t,i 1 = µt,i 1 (185)

ψ(2) t,i 1 = 1

2σ 2 t,i 1 ρ(2) t,i 1 = µt,i 1µ t,i 1 + σ2 t,i 1 (186)

Inverting these relationships gives

2 ψ(2) t,i 1 1 ψ(1) t,i 1 = ρ(1) t,i 1 (187)

σ2 t,i 1 = 1

2 ψ(2) t,i 1 1 = ρ(2) t,i 1 ρ(1) t,i 1 2 (188)

The KL divergence in the VI loss is

DKL qψt,i 1|qψt|t 1 = 1

2 µt,i 1 µt|t 1 2 σ 2 t|t 1 + 1

2 X σ 2 t|t 1σ2 t,i 1 log σ2 t,i 1 +const (189) with gradients

µt,i 1DKL qψt,i 1|qψt|t 1 = σ 2 t|t 1 µt,i 1 µt|t 1 (190)

σ2 t,i 1DKL qψt,i 1|qψt|t 1 = 1

2 σ 2 t|t 1 σ 2 t,i 1 (191)

For any scalar function ℓthe chain rule gives

ρ(1) t,i 1ℓ= µt,i 1

ρ(1) t,i 1 µt,i 1ℓ+ σ2 t,i 1 ρ(1) t,i 1 σ2 t,i 1ℓ (192)

= µt,i 1ℓ 2µt,i 1 σ2 t,i 1ℓ (193)

ρ(2) t,i 1ℓ= µt,i 1

ρ(2) t,i 1 µt,i 1ℓ+ σ2 t,i 1 ρ(2) t,i 1 σ2 t,i 1ℓ (194)

= σ2 t,i 1ℓ (195)

ρ(1) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = gt,i diag (Gt,i) µt,i 1 (196)

ρ(2) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 1

2diag (Gt,i) (197)

ρ(1) t,i 1DKL qψt,i 1|qψt|t 1 = σ 2 t,i 1µt,i 1 σ 2 t|t 1µt|t 1 (198)

ρ(2) t,i 1DKL qψt,i 1|qψt|t 1 = 1

2 σ 2 t|t 1 σ 2 t,i 1 (199)

Following the same approach for ψ gives

ψ(1) t,i 1ℓ= µt,i 1

ψ(1) t,i 1 µt,i 1ℓ+ σ2 t,i 1 ψ(1) t,i 1 Σt,i 1ℓ (200)

2 ψ(2) t,i 1 1 µt,i 1ℓ (201)

= σ2 t,i 1 µt,i 1ℓ (202)

ψ(2) t,i 1ℓ= µt,i 1

ψ(2) t,i 1 µt,i 1ℓ+ σ2 t,i 1 ψ(2) t,i 1 σ2 t,i 1ℓ (203)

2 ψ(2) t,i 1 2 ψ(1) t,i 1 µt,i 1ℓ+ 1

2 ψ(2) t,i 1 2 σ2 t,i 1ℓ (204)

= 2σ2 t,i 1µt,i 1 µt,i 1ℓ+ 2σ4 t,i 1 σ2 t,i 1ℓ (205)

ψ(1) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = σ2 t,i 1gt,i (206)

ψ(2) t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 2σ2 t,i 1µt,i 1gt,i + σ4 t,i 1diag (Gt,i) (207)

ψ(1) t,i 1DKL qψt,i 1|qψt|t 1 = σ2 t,i 1σ 2 t|t 1 µt,i 1 µt|t 1 (208)

ψ(2) t,i 1DKL qψt,i 1|qψt|t 1 = 2σ2 t,i 1σ 2 t|t 1µt,i 1 µt,i 1 µt|t 1

+ σ4 t,i 1 σ 2 t|t 1 σ 2 t,i 1 (209)

Our implementations often make use of the following trick: Suppose A Rn m and B Rm n. Then we can efﬁciently compute diag(AB) in O(mn) time using (AB)ii = PM j=1 Aij Bji.

For MC-HESS methods, we approximate the diagonal of the Hessian for each MC sample ˆθ(m) t using Hutchinson s trace estimation method [Hutchinson, 1989] which has been used in other DNN optimization papers such as adahessian Yao et al. [2021]. This involves an extra inner loop with size denoted N.

E.3.1 BONG Diag

Substituting Eqs. (196) and (197) into Eq. (5) gives

ψ(1) t = ψ(1) t|t 1 + gt diag (Gt) µt|t 1 (210)

ψ(2) t = ψ(2) t|t 1 + 1

2diag (Gt) (211)

Translating to µt, σ2 t gives the BONG-DIAG update

µt = µt|t 1 + σ2 t gt (212)

σ 2 t = σ 2 t|t 1 diag (Gt) (213)

This update takes O(MP) to estimate Gt using GMC-EF t , O(NMP) using GMC-HESS t and Hutchinson s method, and O(P) using GLIN-EF t .

E.3.2 BONG-LIN-HESS Diag (VD-EKF)

Applying Proposition 4.2 to Eqs. (210) and (211) gives the BONG-LIN-HESS-DIAG update

µt = µt|t 1 + σ2 t H t R 1 t (yt ˆyt) (214)

σ 2 t = σ 2 t|t 1 + diag H t R 1 t Ht (215)

This update is equivalent to VD-EKF [Chang et al., 2022] and takes O(C2P).

E.3.3 BLR Diag (VON)

Substituting Eqs. (196) to (199) into Eq. (33) gives

ψ(1) t,i = ψ(1) t,i 1 + α gt,i diag (Gt,i) µt,i 1 σ 2 t,i 1µt,i 1 + σ 2 t|t 1µt|t 1 (216)

ψ(2) t,i = ψ(2) t,i 1 + α

diag (Gt,i) + σ 2 t,i 1 σ 2 t|t 1 (217)

Translating to µt,i, σ2 t,i gives the BLR-DIAG update

µt,i = µt,i 1 + ασ2 t,iσ 2 t|t 1 µt|t 1 µt,i 1 + ασ2 t,igt,i (218)

σ 2 t,i = (1 α) σ 2 t,i 1 + ασ 2 t|t 1 α diag (Gt,i) (219)

This update takes O(MP) per iteration to estimate Gt using GMC-EF t , O(NMP) per iteration using GMC-HESS t and Hutchinson s method, and O(P) per iteration using GLIN-EF t .

The MC-HESS and MC-EF versions of this update are respectively equivalent to VON and VOGN [Khan et al., 2018b] in the batch setting where we replace qψt|t 1 with a spherical prior N(0, λ 1IP ) (see Appendix E.6).

E.3.4 BLR-LIN-HESS Diag

Applying Proposition 4.2 to Eqs. (218) and (219) gives the BLR-LIN-HESS-DIAG update

µt,i = µt,i 1 + ασ2 t,iσ 2 t|t 1 µt|t 1 µt,i 1 + ασ2 t,i H t,i R 1 t,i (yt ˆyt,i) (220)

σ 2 t,i = (1 α) σ 2 t,i 1 + ασ 2 t|t 1 + α diag H t,i R 1 t,i Ht,i (221)

This update takes O(C2P) per iteration.

E.3.5 BOG Diag

Substituting Eqs. (206) and (207) into Eq. (31) gives

ψ(1) t = ψ(1) t|t 1 + ασ2 t|t 1gt (222)

ψ(2) t = ψ(2) t|t 1 + 2ασ2 t|t 1µt|t 1gt + ασ4 t|t 1diag (Gt) (223)

Translating to µt, σ2 t gives the BOG-DIAG update

µt = σ2 t σ 2 t|t 1µt|t 1 + ασ2 t σ2 t|t 1gt (224)

σ 2 t = σ 2 t|t 1 4ασ2 t|t 1µt|t 1gt 2ασ4 t|t 1diag (Gt) (225)

This update takes O(MP) to estimate Gt using GMC-EF t , O(NMP) using GMC-HESS t and Hutchinson s method, and O(P) using GLIN-EF t .

E.3.6 BOG-LIN-HESS Diag

Applying Proposition 4.2 to Eqs. (224) and (225) gives the BOG-LIN-HESS-DIAG update

µt = σ2 t σ 2 t|t 1µt|t 1 + ασ2 t σ2 t|t 1 H t R 1 t (yt ˆyt) (226)

σ 2 t = σ 2 t|t 1 4ασ2 t|t 1µt|t 1 H t R 1 t (yt ˆyt) + 2ασ4 t|t 1diag H t R 1 t Ht (227)

This update takes O(C2P).

E.3.7 BBB Diag

Substituting Eqs. (206) to (209) into Eq. (34) gives

ψ(1) t,i = ψ(1) t,i 1 + ασ2 t,i 1gt,i ασ2 t,i 1σ 2 t|t 1 µt,i 1 µt|t 1 (228)

ψ(2) t = ψ(2) t,i 1 + 2ασ2 t,i 1µt,i 1gt,i + ασ4 t,i 1diag (Gt,i)

2ασ2 t,i 1σ 2 t|t 1µt,i 1 µt,i 1 µt|t 1 ασ4 t,i 1 σ 2 t|t 1 σ 2 t,i 1 (229)

Translating to (µt,i, Σt,i) gives the BBB-DIAG update

µt,i = σ2 t,iσ 2 t,i 1µt,i 1 + ασ2 t,iσ2 t,i 1gt,i + ασ2 t,iσ2 t,i 1σ 2 t|t 1 µt|t 1 µt,i 1 (230)

σ 2 t,i = σ 2 t,i 1 4ασ2 t,i 1µt,i 1gt,i 2ασ4 t,i 1diag (Gt,i)

+ 4ασ2 t,i 1σ 2 t|t 1µt,i 1 µt,i 1 µt|t 1 + 2ασ4 t,i 1 σ 2 t|t 1 σ 2 t,i 1 (231)

This update takes O(MP) per iteration to estimate Gt using GMC-EF t , O(NMP) per iteration using GMC-HESS t and Hutchinson s method, and O(P) per iteration using GLIN-EF t .

E.3.8 BBB-LIN-HESS Diag

Applying Proposition 4.2 to Eqs. (230) and (231) gives the BBB-LIN-HESS-DIAG update

µt,i = σ2 t,iσ 2 t,i 1µt,i 1 + ασ2 t,iσ2 t,i 1 H t,i R 1 t,i (yt ˆyt,i)

+ ασ2 t,iσ2 t,i 1σ 2 t|t 1 µt|t 1 µt,i 1 (232)

σ 2 t,i = σ 2 t,i 1 4ασ2 t,i 1µt,i 1 H t,i R 1 t,i (yt ˆyt,i) + 2ασ4 t,i 1diag H t,i R 1 t,i Ht,i

+ 4ασ2 t,i 1σ 2 t|t 1µt,i 1 µt,i 1 µt|t 1 + 2ασ4 t,i 1σ 2 t|t 1 2ασ2 t,i 1 (233)

This update takes O(C2P) per iteration.

E.4 Diagonal Gaussian, Moment parameters

Throughout this subsection, vector multiplication and exponents are elementwise.

The Bonnet and Price theorems give µt,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = gt,i (234)

σ2 t,i 1Eθt qψt,i 1[log p (yt|ft (θt))] = 1

2diag (Gt,i) (235)

From Appendix E.3 we have µt,i 1DKL qψt,i 1|qψt|t 1 = σ 2 t|t 1 µt,i 1 µt|t 1 (236)

σ2 t,i 1DKL qψt,i 1|qψt|t 1 = 1

2 σ 2 t|t 1 σ 2 t,i 1 (237)

We write the Fisher with respect to the moment parameters ψ = (µ, σ2) as a block matrix:

Fψ = Fµ,µ Fµ,σ2 Fσ2,µ Fσ2,σ2

The blocks can be calculated by the second-order Fisher formula Fµ,µ = Eqψ[ µ,µ log qψ(θ)] (239)

= Diag σ 2 (240)

Fµ,σ2 = Eqψ µ,σ2 log qψ(θ) (241)

= Eqψ Diag (θ µ)σ 4 (242)

Fσ2,σ2 = Eqψ σ2,σ2 log qψ(θ) (244)

= Eqψ h Diag (µ θ)2 σ 6 + 1

2σ 4 i (245)

2Diag σ 4 (246) Therefore the preconditioner for the NGD methods is

F 1 ψt,i 1 = Diag σ2 t,i 1 0 0 2Diag σ4 t,i 1 (247)

E.4.1 BONG Diag, Moment

Substituting Eqs. (234), (235) and (247) into Eq. (3) gives the BONG-DIAG_MOM update µt = µt|t 1 + σ2 t|t 1gt (248)

σ2 t = σ2 t|t 1 + σ4 t|t 1diag (Gt) (249)

This update takes O(MP) to estimate Gt using GMC-EF t , O(NMP) using GMC-HESS t and Hutchinson s method, and O(P) using GLIN-EF t .

E.4.2 BONG-LIN-HESS Diag, Momemt

Applying Proposition 4.2 to Eqs. (248) and (249) gives the BONG-LIN-HESS-DIAG_MOM update µt = µt|t 1 + σ2 t|t 1 H t R 1 t (yt ˆyt) (250)

σ2 t = σ2 t|t 1 σ4 t|t 1diag H t R 1 t Ht (251)

This update takes O(C2P).

E.4.3 BLR Diag, Moment

Substituting Eqs. (234) to (237) and (247) into Eq. (32) gives the BLR-DIAG_MOM update µt,i = µt,i 1 + ασ2 t,i 1σ 2 t|t 1 µt|t 1 µt,i 1 + ασ2 t,i 1gt,i (252)

σ2 t,i = σ2 t,i 1 + ασ4 t,i 1 σ 2 t,i 1 σ 2 t|t 1 + ασ4 t,i 1diag (Gt,i) (253)

This update takes O(MP) per iteration to estimate Gt using GMC-EF t , O(NMP) per iteration using GMC-HESS t and Hutchinson s method, and O(P) per iteration using GLIN-EF t .

E.4.4 BLR-LIN-HESS Diag, Moment

Applying Proposition 4.2 to Eqs. (252) and (253) gives the BLR-LIN-HESS-DIAG_MOM update

µt,i = µt,i 1 + ασ2 t,i 1σ 2 t|t 1 µt|t 1 µt,i 1 + ασ2 t,i 1 H t,i R 1 t,i (yt ˆyt,i) (254)

σ2 t,i = σ2 t,i 1 + ασ4 t,i 1 σ 2 t,i 1 σ 2 t|t 1 ασ4 t,i 1diag H t,i R 1 t,i Ht,i (255)

This update takes O(C2P) per iteration.

E.4.5 BOG Diag, Moment

Substituting Eqs. (234) and (235) into Eq. (31) gives the BOG-DIAG_MOM update

µt = µt|t 1 + αgt (256)

σ2 t = σ2 t|t 1 + α

2 diag (Gt) (257)

This update takes O(MP) to estimate Gt using GMC-EF t , O(NMP) using GMC-HESS t and Hutchinson s method, and O(P) using GLIN-EF t .

E.4.6 BOG-LIN-HESS Diag, Moment

Applying Proposition 4.2 to Eqs. (256) and (257) gives the BOG-LIN-HESS-DIAG_MOM update

µt = µt|t 1 + αH t R 1 t (yt ˆyt) (258)

σ2 t = σ2 t|t 1 α

2 diag H t R 1 t Ht (259)

This update takes O(C2P).

E.4.7 BBB Diag, Moment

Substituting Eqs. (234) to (237) into Eq. (34) gives the BBB-DIAG_MOM

µt,i = µt,i 1 + ασ 2 t|t 1 µt|t 1 µt,i 1 + αgt,i (260)

σ2 t = σ2 t,i 1 + α

σ 2 t,i 1 σ 2 t|t 1 + α

2 diag (Gt,i) (261)

This update takes O(MP) per iteration to estimate Gt using GMC-EF t , O(NMP) per iteration using GMC-HESS t and Hutchinson s method, and O(P) per iteration using GLIN-EF t .

This is similar to the original diagonal Gaussian method in [Blundell et al., 2015] except (1) they reparameterize σ = log(1 + exp(ρ)) (elementwise) and do GD on (µ, ρ) instead of (µ, σ2), and (2) they use the reparameterization trick instead of Price s theorem for calculating the gradient with respect to ρ (via σ).

E.4.8 BBB-LIN-HESS Diag, Moment

Applying Proposition 4.2 to Eqs. (260) and (261) gives the BBB-LIN-HESS-DIAG_MOM update

µt,i = µt,i 1 + ασ 2 t|t 1 µt|t 1 µt,i 1 + αH t,i R 1 t,i (yt ˆyt,i) (262)

σ2 t = σ2 t,i 1 + α

σ 2 t,i 1 σ 2 t|t 1 α

2 diag H t,i R 1 t,i Ht,i (263)

This update takes O(C2P) per iteration.

E.5 Diagonal plus low rank

Assume the prior is given by

qψt|t 1 (θt) = N θt|µt|t 1, Υt|t 1 + Wt|t 1W t|t 1 1 (264)

with W RP R and diagonal Υt|t 1 RP P . We sometimes abuse notation by writing Υt|t 1 for the vector diag Υt|t 1 when the meaning is clear from context.

Substituting the DLR form in the gradients for the KL divergence derived in Appendix E.2 gives

µt,i 1DKL qψt,i 1|qψt|t 1 = Υt|t 1 + Wt|t 1W t|t 1 µt,i 1 µt|t 1 (265)

Σt,i 1DKL qψt,i 1|qψt|t 1 = 1

2 Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1

For any function ℓthe chain rule gives

Υt|t 1ℓ= diag Υt|t 1 + Wt|t 1W t|t 1 1 Σt|t 1ℓ Υt|t 1 + Wt|t 1W t|t 1 1

Wt|t 1ℓ= 2 Υt|t 1 + Wt|t 1W t|t 1 1 Σt|t 1ℓ Υt|t 1 + Wt|t 1W t|t 1 1 Wt|t 1 (268)

Therefore the gradients we need are

µt,i 1Eθt qψt,i 1[log p (yt|ht (θt))] = gψt,i (269)

Υt,i 1Eθt qψt,i 1[log p (yt|ht (θt))] = 1

Υt,i 1 + Wt,i 1W t,i 1 1 Gt,i Υt,i 1 + Wt,i 1W t,i 1 1

Wt,i 1Eθt qψt,i 1[log p (yt|ht (θt))] = Υt,i 1 + Wt,i 1W t,i 1 1 Gt,i

Υt,i 1 + Wt,i 1W t,i 1 1 Wt,i 1 (271)

Υt,i 1DKL qψt,i 1|qψt|t 1 =

Υt,i 1 + Wt,i 1W t,i 1 1

Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1

Υt,i 1 + Wt,i 1W t,i 1 1

Wt,i 1DKL qψt,i 1|qψt|t 1 = Υt,i 1 + Wt,i 1W t,i 1 1

Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1

Υt,i 1 + Wt,i 1W t,i 1 1 Wt,i 1 (273)

The Fisher matrix can be decomposed as a block-diagonal with blocks for µt,i and for (Υt,i, Wt,i), but (in contrast to the FC Gaussian case in Eq. (168)) we have not found an efﬁcient way to analytically invert the latter block, which has size P + RP. To avoid the O R3P 3 cost of brute force inversion, we use a different strategy for BONG and BLR of performing the update on the natural parameters of the FC Gaussian as in Appendix E.1 and projecting the result back to rank R using SVD. Speciﬁcally, assume the updated precision from applying Eq. (121) for BONG or Eq. (136) for BLR can be written as Σ 1 t,i = Υt,i + Wt,i W t,i (274)

and let the SVD of Wt,i be Wt,i = Ut,iΛt,i V t,i (275) with Ut,i, Vt,i orthogonal and Λt,i rectangular-diagonal. Following [Mishkin et al., 2018, Chang et al., 2023] we deﬁne the update

Wt,i = Ut,i [:, : R] Λt,i [: R, : R] (276)

Υt,i = Υt,i + diag Wt,i W t,i Wt,i W t,i (277)

so that Wt,i contains the top R singular vectors and values of the FC posterior and the diagonal is

preserved: diag Υt,i + Wt,i W t,i = diag Σ 1 t,i . This approach works for MC-EF and LIN-EF methods but not MC-HESS, which we omit.

Finally, in the MC-EF methods we sample from the DLR prior using the routine in [Mishkin et al., 2018, Lambert et al., 2023] which takes O(R(R + M)P).

E.5.1 BONG DLR

Substituting the DLR prior from Eq. (264) into the FC precision update from Eq. (121) and using the MC-EF approximation yields the posterior precision

Σ 1 t = Υt|t 1 + Wt|t 1W t|t 1 GMC-EF t (278)

= Υt + Wt W t (279) Υt = Υt|t 1 (280)

Wt = Wt|t 1, 1

M ˆG(1:M) t

Note Wt RP (R+M). Using this posterior precision in the mean update from Eq. (120) yields

µt = µt|t 1 + Υt|t 1 + Wt W t 1 gt (282)

= µt|t 1 + Υ 1 t|t 1 Υ 1 t|t 1 Wt IR+M + W t Υ 1 t|t 1 Wt 1 W t Υ 1 t|t 1

where the second line comes from the Woodbury identity and can be computed in O((R + M)2P + (R + M)3). Applying the SVD projection gives

Wt = Ut [:, : R] Λt [: R, : R] (284)

Υt = Υt|t 1 + diag Wt W t Wt W t (285)

(Ut, Λt, _) = SVD Wt (286)

which takes O((R + M)2P) for the SVD. Therefore the BONG-MC-EF-DLR update is deﬁned by Eqs. (281) and (283) to (286) and takes O((R + M)2P + (R + M)3).

The BONG-LIN-EF-DLR update comes from replacing 1

M ˆG(1:M) t with g LIN t in Eq. (281) and replacing IR+M with IR+1 in Eq. (283). This update takes O((R + 1)2P + (R + 1)3).

E.5.2 BONG-LIN-HESS DLR (LO-FI)

Applying Proposition 4.2 to Eqs. (281) and (283) gives the BONG-LIN-HESS-DLR update:

µt = µt|t 1 + Υ 1 t|t 1 Υ 1 t|t 1 Wt IR+C + W t Υ 1 t|t 1 Wt 1 W t Υ 1 t|t 1

H t R 1 t (yt ˆyt) (287) Wt = Ut [:, : R] Λt [: R, : R] (288)

Υt = Υt|t 1 + diag Wt W t Wt W t (289)

(Ut, Λt, _) = SVD Wt (290)

Wt = Wt|t 1, H t A t (291)

At = chol R 1 t (292)

This is equivalent to LO-FI [Chang et al., 2023] and takes O((R + C)2P + (R + C)3).

E.5.3 BLR DLR (SLANG)

Substituting DLR forms for qψt|t 1 and qψt,i 1 into the FC precision update from Eq. (136) and using the MC-EF approximation yields the posterior precision Σ 1 t,i = (1 α) Υt,i 1 + Wt,i 1W t,i 1 + α Υt|t 1 + Wt|t 1W t|t 1 αGMC-EF t,i (293)

= Υt,i + Wt,i W t,i (294) Υt,i = (1 α) Υt,i 1 + αΥt|t 1 (295)

1 αWt,i 1, αWt|t 1, r α

M ˆG(1:M) t,i

Note Wt RP (2R+M). Using this posterior precision in the mean update from Eq. (135) yields

µt,i = µt,i 1 + α Υt,i + Wt,i W t,i 1

Υt|t 1 + Wt|t 1W t|t 1 µt|t 1 µt,i 1 + gt,i (297)

= µt,i 1 + α Υ 1 t,i Υ 1 t,i Wt,i I2R+M + W t,i Υ 1 t,i Wt,i 1 W t,i Υ 1 t,i

Υt|t 1 + Wt|t 1W t|t 1 µt|t 1 µt,i 1 + gt,i (298)

where the second line comes from the Woodbury identity and can be computed in O((2R + M)2P + (2R + M)3). Applying the SVD projection gives Wt,i = Ut,i [:, : R] Λt,i [: R, : R] (299)

Υt,i = Υt,i + diag Wt,i W t,i Wt,i W t,i (300)

(Ut,i, Λt,i, _) = SVD Wt,i (301)

which takes O((2R + M)2P) for the SVD. Therefore the BLR-MC-EF-DLR update is deﬁned by Eqs. (295), (296) and (298) to (301) and takes O((2R + M)2P + (2R + M)3) per iteration. Notice Wt has larger rank for BLR than for BONG (2R + M vs. R + M) because of the extra αWt|t 1 term that originates in the KL part of the VI loss. This difference will slow BLR-MC-EF-DLR relative to BONG-MC-EF-DLR especially when R is not small relative to M.

This method closely resembles SLANG Mishkin et al. [2018] in the batch setting where we replace qψt|t 1 with a spherical prior N(0, λ 1IP ) (see Appendix E.6).

We can also deﬁne a BLR-LIN-EF-DLR method by replacing p α

M ˆG(1:M) t with αg LIN t in Eq. (296) and I2R+M with I2R+1 in Eq. (298). This update takes O((2R + 1)2P + (2R + 1)3) per iteration.

E.5.4 BLR-LIN-HESS DLR

Applying Proposition 4.2 to Eqs. (296) and (298) gives the BLR-LIN-HESS-DLR update:

µt,i = µt,i 1 + α Υ 1 t,i Υ 1 t,i Wt,i I2R+C + W t,i Υ 1 t,i Wt,i 1 W t,i Υ 1 t,i

Υt|t 1 + Wt|t 1W t|t 1 µt|t 1 µt,i 1 + H t,i R 1 t,i (yt ˆyt,i)

(302) Wt,i = Ut,i [:, : R] Λt,i [: R, : R] (303)

Υt,i = Υt,i + diag Wt,i W t,i Wt,i W t,i (304)

(Ut,i, Λt,i, _) = SVD Wt,i (305)

1 αWt,i 1, αWt|t 1, αH t,i A t,i (306)

At,i = chol R 1 t,i (307) Υt,i = (1 α) Υt,i 1 + αΥt|t 1 (308)

This update takes O((2R + C)2P + (2R + C)3) per iteration. As with the EF versions of BLR-DLR, Wt has larger rank for BLR-LIN-HESS-DLR than for BONG-LIN-HESS-DLR (2R + C vs. R + C). This difference will slow BLR-LIN-HESS-DLR relative to BONG-LIN-HESS-DLR especially when R is not small relative to C.

E.5.5 BOG DLR

Substituting Eqs. (269) to (271) into Eq. (31) gives the BOG-DLR update

µt = µt|t 1 + αgt (309)

Υt = Υt|t 1 α

2 diag Υt|t 1 + Wt|t 1W t|t 1 1 Gt Υt|t 1 + Wt|t 1W t|t 1 1

Wt = Wt|t 1 α Υt|t 1 + Wt|t 1W t|t 1 1 Gt Υt|t 1 + Wt|t 1W t|t 1 1 Wt|t 1 (311)

Using the EF approximation and Woodbury, the BOG-MC-EF-DLR update can be rewritten as

Υt = Υt|t 1 + α 2M diag (Bt B t ) (312)

Wt = Wt|t 1 + α

M Bt B t Wt|t 1 (313)

Bt = Υ 1 t|t 1 Υ 1 t|t 1Wt|t 1 IR + W t|t 1Υ 1 t|t 1Wt|t 1 1 W t|t 1Υ 1 t|t 1

which takes O(RMP + MR2 + R3).

The BOG-LIN-EF-DLR update comes from replacing ˆG(1:M) t with g LIN t in Eq. (314) and dropping the M 1 factors in Eqs. (312) and (313). This update takes O(RP + R3).

E.5.6 BOG-LIN-HESS DLR

Applying Proposition 4.2 to Eqs. (309) and (312) to (314) gives the BOG-LIN-HESS-DLR update

µt = µt|t 1 + αH t R 1 t (yt ˆyt) (315)

Υt = Υt|t 1 + α

2 diag (Bt B t ) (316)

Wt = Wt|t 1 + αBt B t Wt|t 1 (317)

Bt = Υ 1 t|t 1 Υ 1 t|t 1Wt|t 1 IR + W t|t 1Υ 1 t|t 1Wt|t 1 1 W t|t 1Υ 1 t|t 1

This update takes O(C(C + R)P + CR2 + R3).

E.5.7 BBB DLR

Substituting Eqs. (265) and (269) to (273) into Eq. (34) gives the BBB-DLR update

µt,i = µt,i 1 + α Υt|t 1 + Wt|t 1W t|t 1 µt|t 1 µt,i 1 + αgt (319)

Υt,i = Υt,i 1

2 diag Σt,i 1 Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1 Gt,i Σt,i 1

(320) Wt,i = Wt,i 1

+ αΣt,i 1 Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1 Gt,i Σt,i 1Wt,i 1 (321)

The previous covariance can be written using Woodbury as

Σt,i 1 = Υ 1 t,i 1 Υ 1 t,i 1Wt,i 1 IR + W t,i 1Υ 1 t,i 1Wt,i 1 1 W t,i 1Υ 1 t,i 1 (322)

The BBB-MC-EF-DLR update can be computed efﬁciently by expanding terms in Eqs. (320) to (322). For example the terms involving Gt,i can be calculated as

diag Σt,i 1GMC-EF t,i Σt,i 1 = 1

Υ 1 t,i 1 ˆG(1:M) t,i ˆG(1:M)

t,i Υ 1 t,i 1 Υ 1 t,i 1 ˆG(1:M) t,i B t,i Bt,i ˆG(1:M)

t,i Υ 1 t,i 1 + Bt,i B t,i

Σt,i 1GMC-EF t,i Σt,i 1Wt,i 1 = 1

M Υ 1 t,i 1 ˆG(1:M) t,i ˆG(1:M)

t,i Υ 1 t,i 1Wt,i 1

M Υ 1 t,i 1 ˆG(1:M) t,i B t,i Wt,i 1

M Bt,i ˆG(1:M)

t,i Υ 1 t,i 1Wt,i 1 + 1

M Bt,i B t,i Wt,i 1 (324)

Bt,i = Υ 1 t,i 1Wt,i 1 IR + W t,i 1Υ 1 t,i 1Wt,i 1 1

W t,i 1Υ 1 t,i 1 ˆG(1:M) t,i (325)

Using this strategy the update takes O((R + M)RP + MR2 + R3).

The BBB-LIN-EF-DLR update comes from replacing ˆG(1:M) t with g LIN t and dropping the M 1 factors in Eqs. (323) to (325). This update takes O(R2P + R3).

E.5.8 BBB-LIN-HESS DLR

Applying Proposition 4.2 to Eqs. (319) to (321) gives the BBB-LIN-HESS-DLR update

µt,i = µt,i 1 + α Υt|t 1 + Wt|t 1W t|t 1 µt|t 1 µt,i 1 + αH t,i R 1 t,i (yt ˆyt,i)

Υt,i = Υt,i 1 + α

2 diag Σt,i 1

Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1 + H t,i R 1 t,i Ht,i

Wt,i = Wt,i 1 + αΣt,i 1

Υt|t 1 Υt,i 1 + Wt|t 1W t|t 1 Wt,i 1W t,i 1 + H t,i R 1 t,i Ht,i

Σt,i 1Wt,i 1 (328)

This can be computed in O((C +R)2P +CR2 +R3) using Eq. (322) and following a computational approach similar to Eqs. (323) to (325).

E.6 Batch BLR

It is interesting to translate the BLR updates derived here back to the batch setting where BLR was developed [Khan and Rue, 2023], by replacing N µt|t 1, Σt|t 1 with a centered spherical prior N 0, λ 1IP .

The batch BLR-FC update becomes µi = µi 1 + αΣi (gi λµi 1) (329)

Σ 1 i = (1 α)Σ 1 i 1 + α (λIP Gi) (330)

The batch BLR-LIN-HESS-FC update becomes

µi = µi 1 + αΣi H i R 1 i (yt ˆyi) λµi 1 (331)

Σ 1 i = (1 α)Σ 1 i 1 + α λIP + H i R 1 i Hi (332)

The batch BLR-FC_MOM update becomes µi = µi 1 + αΣi 1 (gi λµi 1) (333) Σi = (1 + α)Σi 1 + αΣi 1 (Gi λIP ) Σi 1 (334)

The batch BLR-LIN-HESS-FC_MOM update becomes

µi = µi 1 + αΣi 1 H i R 1 i (yt ˆyi) λµi 1 (335)

Σi = (1 + α)Σi 1 αΣi 1 λIP + H i R 1 i Hi Σi 1 (336)

The batch BLR-DIAG update becomes

µi = µi 1 + ασ2 i (gi λµi 1) (337)

σ 2 i = (1 α)σ 2 i 1 + α diag (λIP Gi) (338)

The MC-HESS version of this update is equivalent to VON [Khan et al., 2018b] if we use MC approximation with M = 1. The MC-EF version with M = 1 is equivalent to VOGN [Khan et al., 2018b].

The batch BLR-LIN-HESS-DIAG update becomes

µi = µi 1 + ασ2 i H i R 1 i (yt ˆyi) λµi 1 (339)

σ 2 i = (1 α)σ 2 i 1 + α diag λIP + H i R 1 i Hi (340)

The batch BLR-DIAG_MOM update becomes

µi = µi 1 + ασ2 i 1 (gi λµi 1) (341)

σ2 i = (1 + α)σ2 i 1 + ασ4 i 1diag (Gi λIP ) (342)

The batch BLR-LIN-HESS-DIAG_MOM update becomes

µi = µi 1 + ασ2 i 1 H i R 1 i (yt ˆyi) λµi 1 (343)

σ2 i = (1 + α)σ2 i 1 ασ4 i 1diag λIP + H i R 1 i Hi (344)

The batch BLR-DLR update becomes

µi = µi 1 + α Υ 1 i Υ 1 i Wi IR+M + W i Υ 1 i Wi 1 W i Υ 1 i

(gt λµi 1) (345) Wi = Ui [:, : R] Λi [: R, : R] (346)

Υi = Υi + diag Wi W i Wi W i (347)

(Ui, Λi, _) = SVD Wi (348)

1 αWi 1, r α

M ˆG(1:M) i

Ai = chol R 1 i (350) Υi = (1 α)Υi 1 + αλIP (351)

This is equivalent to SLANG except for the following differences. SLANG processes a minibatch of M examples at each iteration, using a single sample ˆθ qψi 1 for each minibatch. It uses a different SVD routine which is slightly faster but stochastic, taken from [Halko et al., 2011]. Most signiﬁcantly, SLANG applies the SVD before the mean update, meaning µi is calculated using the rank-R Wi and Υi instead of the rank-(R + M) Wi and Υi, thus ignoring the non-diagonal information in the M discarded singular vectors from Wi.

The batch BLR-LIN-HESS-DLR update becomes

µi = µi 1 + α Υ 1 i Υ 1 i Wi IR+C + W i Υ 1 i Wi 1 W i Υ 1 i

H i R 1 i (yt ˆyi) λµi 1 (352)

Wi = Ui [:, : R] Λi [: R, : R] (353)

Υi = Υi + diag Wi W i Wi W i (354)

(Ui, Λi, _) = SVD Wi (355)

1 αWi 1, αH i A i (356)

Ai = chol R 1 i (357) Υi = (1 α)Υi 1 + αλIP (358)

This algorithm could be called SLANG-LIN-HESS and would be deterministic and faster than SLANG since it does not need MC sampling. We can also deﬁne SLANG-LIN-EF which would be even faster, by replacing p α

M ˆG(1:M) i with αg LIN i in Eq. (349) and IR+M with IR in Eq. (345).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: Section 1 states two main theoretical contributions (proved in Propositions 4.1 and 4.2 and summarizes the experiment ﬁndings reported in Section 5.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: We discuss limitations in Section 6. We report asymptotic efﬁciency in Table 3 and report actual running time in Fig. 4.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justiﬁcation: We state two formal theorems and provide complete proofs in Appendix C which we sketch in the main text in Sections 4.1 and 4.4.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: We give pseudocode for the main methods in Appendix A and complete update equations for all algorithms in Appendix E. We give full details of the experiment methods in Section 5.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justiﬁcation: We will post our code on github after the blind review period, including scripts for exactly reproducing all experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justiﬁcation: Section 5 provides links to the data and reports hyperparameters and architecture details.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments?

Answer: [Yes]

Justiﬁcation: Plots show 1 SE shading based on empirical SD across trials. Trials vary randomly in network initialization, data ordering and MC samples. This is stated in Section 5.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes] Justiﬁcation: We report running times. Experiments all run on any standard GPU/TPU so further details are not necessary.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: We have read the Code of Ethics and afﬁrm our research adheres to all points therein.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [No]

Justiﬁcation: We believe there are no direct societal impacts (positive or negative) beyond the generic impacts of foundational work on training neural networks.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justiﬁcation: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justiﬁcation: We provide citations and URLs for the two datasets we use in Section 5. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: There are no new data. As noted above, we will post our model code after the blind review period. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.