# the_bayesian_learning_rule__fc1dce16.pdf

Journal of Machine Learning Research 24 (2023) 1-46 Submitted 3/22; Revised 6/23; Published 9/23

The Bayesian Learning Rule

Mohammad Emtiyaz Khan emtiyaz.khan@riken.jp RIKEN Center for Advanced Intelligence Project 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan

H avard Rue haavard.rue@kaust.edu.sa CEMSE Division King Abdullah University of Science and Technology Thuwal 23955-6900, Saudi Arabia

Editor: Justin Domke

We show that many machine-learning algorithms are speciﬁc instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from ﬁelds such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton s method, and Kalman ﬁlter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Diﬀerent candidate distributions result in diﬀerent algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only uniﬁes, generalizes, and improves existing algorithms, but also helps us design new ones. Keywords: Bayesian methods, optimization, deep learning, graphical models.

1. Introduction

1.1 Learning-Algorithms

Machine Learning (ML) methods have been extremely successful in solving many challenging problems in ﬁelds such as computer vision, natural-language processing, and artiﬁcial intelligence. The main idea is to formulate those problems as prediction problems and learn a model on existing data to predict the future outcomes. For example, to recognize objects in images, we collect N images xi RD and object labels yi {1, 2, . . . , K}, and learn a model fθ(x) with parameters θ RP to predict the label for a new image. Learning algorithms are often employed to estimate θ by Empirical Risk Minimization (ERM),

θ = arg min θ ℓ(θ) where ℓ(θ) =

i=1 ℓ(yi, fθ(xi)) + R(θ). (1)

Here, ℓ(y, fθ(x)) is a loss function that encourages the model to predict well and R(θ) is a regularizer that prevents it from overﬁtting. A wide-variety of such learning-algorithms exist in the literature to solve a variety of learning problems, for example, ridge regression, Kalman ﬁlters, gradient descent, and Newton s method. These algorithms play a key role in the success of modern ML.

c 2023 Mohammad Emtiyaz Khan and H avard Rue.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0291.html.

Khan and Rue

Learning-algorithms are often derived by borrowing and combining ideas from a diverse set of ﬁelds, such as statistics, optimization, and computer science. For example, the ﬁeld of probabilistic graphical models (Koller and Friedman, 2009; Bishop, 2006) uses popular algorithms such as ridge-regression (Hoerl and Kennard, 1970), Kalman ﬁlters (Kalman, 1960), the Viterbi algorithm for Hidden Markov Models (Stratonovich, 1965), and Expectation Maximization (EM) (Dempster et al., 1977). The ﬁeld of approximate inference builds upon such algorithms to perform inference on complex graphical models, for example, algorithms such as Laplace s method (Laplace, 1986; Tierney and Kadane, 1986; Rue and Martino, 2007; Rue et al., 2009, 2017), stochastic variational inference (SVI) (Hoﬀman et al., 2013; Sato, 2001), Variational message passing (VMP) (Winn and Bishop, 2005) etc. Similarly, the ﬁeld of continuous optimization has its own popular methods such as gradient descent (Cauchy, 1847), Newton s method, and mirror descent (Nemirovski and Yudin, 1978), and deep-learning (DL) methods use them to design new techniques that work with massive models and data sets, for example, stochastic-gradient descent (Robbins and Monro, 1951), RMSprop (Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2015), Dropout regularization (Srivastava et al., 2014), and Straight-Through Estimator (STE) (Bengio et al., 2013). Such mixing of algorithms from diverse ﬁelds is a strength of the ML community, and our goal here is to provide common principles to unify, generalize, and improve such algorithms.

1.2 The Bayesian Learning Rule

We show that a wide-range of well-known learning-algorithms from a variety of ﬁelds are all speciﬁc instances of a single learning algorithm derived from Bayesian principles. The starting point is to extend Eq. 1 by using the variational formulation of Zellner (1988), where we optimize over a well-deﬁned candidate distribution q(θ), and for which the minimizer

q (θ) = arg min q(θ) Eq

i=1 ℓ(yi, fθ(xi))

+ DKL[q(θ) p(θ)] (2)

deﬁnes a generalized posterior (Zhang, 1999; Catoni, 2007; Bissiri et al., 2016) in lack of a precise likelihood. We can rewrite the objective in terms of entropy H(q) = Eq[ log q(θ)] by expanding the Kullback-Leibler Divergence (KLD) as follows,

DKL[q(θ) p(θ)] = Eq

= Eq [R(θ)] H(q),

where we chose the prior to be p(θ) exp( R(θ)). Plugging this in Eq. 2, we get

q (θ) = arg min q(θ) Eq ℓ(θ) H(q).

When exp( ℓ(yi, fθ(xi))) is proportional to the likelihood of yi for all i, then q (θ) is the posterior distribution (Zellner, 1988); see App. A for a proof. We use two components to derive learning algorithms from this Bayesian formulation,

1. A (sub-)class of distributions Q to optimize over. In our discussion, we will assume Q to be the set of a regular and minimal exponential family

qλ(θ) = h(θ) exp [ λ, T (θ) A(λ)] ,

The Bayesian Learning Rule

where λ Ω RM are the natural (or canonical) parameter in a non-empty open set Ω. The cumulant (or log partition) function A(λ) is ﬁnite, strictly convex and diﬀerentiable over Ω. Further, T (θ) is the suﬃcient statistics, , is an inner product and h(θ) is some function. The expectation parameter µ(λ) = Eqλ[T (θ)] M is a (bijective) function of λ. Examples later will include the multivariate Normal distribution and the Bernoulli distribution (see Table 2 for a list).

2. An optimizing algorithm, called the Bayesian learning rule (BLR), that locates the best candidate qλ (θ) in Q, by updating the candidate qλt(θ) with natural parameter λt at iteration t using a sequence of learning rates ρt > 0:

λt+1 λt ρt e λ h Eqλt ℓ(θ) H(qλt) i . (3)

The updates use the natural-gradients (Amari, 1998), denoted by e λ, and deﬁned as

e λEqλt( ) = F (λt) 1 h λEqλ( )|λ=λt

i = µEqλ( )|µ= λA(λt) , (4)

which rescale the vanilla gradients λ with the Fisher information matrix (FIM) F (λ) = 2 λA(λ) to adjust for the geometry in the natural-parameter space Ω. The second equality follows from the chain rule to express natural-gradients as vanilla gradients with respect to µ = λA(λ) (Malag o et al., 2011; Raskutti and Mukherjee, 2015). Throughout, we will use this property to simplify natural-gradient computations. These details are discussed in Sec. 2, where we show that the BLR update can also be seen as a mirror descent algorithm, where the geometry of the Bregman divergence is dictated by the chosen exponential-family. Throughout, we will assume λt Ωfor all t, which in practice might require a line-search or a projection step.

The main message of this paper is that many well-known learning-algorithms such as those used in optimization, deep learning, and machine learning in general can be derived directly following the above scheme via a single algorithm (BLR) that optimizes Eq. 2. Diﬀerent exponential families Q give rise to diﬀerent algorithms, and within those, various approximations to natural-gradients that are needed, give rise to many variants. These results are extended to mixture-candidates in Sec. 3.2, and we expect them to hold in general for other classes of candidate distributions. Our use of natural-gradients here is not a matter of choice. In fact, natural-gradients are inherently present in all solutions of the Bayesian objective in Eq. 2. For example, the natural parameter λ of a solution qλ (θ) is a ﬁxed point of Eq. 3, and at convergence

λ = λ ρ e λ Eqλ ℓ(θ) H(qλ ) = λ = e λEqλ ℓ(θ) . (5)

In the second equality, we assume that h(θ) is a constant, so that e λH(qλ) = λ (see App. B). The equality states that the natural parameter λ must be equal to the natural gradient of Eqλ ℓ(θ) , therefore estimation of qλ (θ) is tightly coupled to computation of natural gradients. Depite this, the importance of natural-gradients is entirely missed in the Bayesian/variational inference literature, including textbooks, reviews, tutorials on this topic (Bishop, 2006; Murphy, 2012; Blei et al., 2017; Zhang et al., 2018a) where naturalgradients are often put in a special category. Our work ﬁxes this issue.

Khan and Rue

We will show that natural gradients retrieve essential higher-order information about the loss landscape which are then assigned to appropriate natural parameters using Eq. 5. The information-matching is due to the presence of the entropy term there, which is an important quantity for the optimality of Bayes in general (Jaynes, 1982; Zellner, 1988) and generally absent in non-Bayesian formulations, for instance, Eq. 1. The entropy term in general leads to exponential-weighting in Bayes rule (Littlestone and Warmuth, 1994; Vovk, 1990). In our context, it gives rise to natural-gradients and, as we will soon see, automatically determines the complexity of the derived algorithm through the complexity of the class of distributions Q, yielding a principled way to develop new algorithms. Overall, our work demonstrates the importance of natural-gradients and information geometry for algorithm design in ML. This is similar in spirit to Information Geometric Optimization (Ollivier et al., 2017) which focuses on the optimization of black-box, deterministic functions. In contrast, we derive generic learning algorithms by using a Bayesian objective where an additional entropy term is added. The BLR we use is a generalization of the method by Khan and Lin (2017); Khan and Nielsen (2018), proposed speciﬁcally for variational inference. Here, we establish it as a general learning rule to derive many old and new learning algorithms. These include both Bayesian and non-Bayesian algorithms, going way beyond its original proposal. We do not claim that these successful algorithms work well because they are derived from the BLR. Rather, we use the BLR to simply unravel the inherent Bayesian nature of these good algorithms. In this sense, the BLR can be seen as a variant of Bayes rule, useful for generic algorithm design.

1.3 Examples

To ﬁx ideas, we will now use the BLR to derive three classical algorithms: gradient descent, Newton s method, and ridge regression. Such derivations are repeated throughout the paper to recover various algorithms in diﬀerent ﬁelds. A full list of the learning algorithms derived in this paper is given in Table 1. We not only unify and generalize existing algorithms but also derive new ones. These includes: (i) a new multimodal optimization algorithm, (ii) new uncertainty estimation algorithms (OGN and VOGN), (iii) the Bayes Bi NN algorithm for binary neural networks, and (iv) non-conjugate variational inference algorithms. For notational simplicity, we will write qλt as qt in the rest of the paper. Throughout this section, we choose Q to be the set of multivariate Gaussians and use the following simpliﬁed form of the BLR,

λt+1 (1 ρt)λt ρt µEqt ℓ(θ) + log h(θ) , (6)

This is obtained by using µH(q) = λ µEq(log h(θ)); see the derivation in App. B.

1.3.1 Gradient Descent

The gradient descent algorithm uses the following update

θt+1 θt ρt θ ℓ(θt),

using only the ﬁrst-order information θ ℓ(θt) (the gradient evaluated at θt). We choose q(θ) = N(θ|m, I), a multivariate Gaussian with unknown mean m and known covariance matrix set to I for simplicity; a more general case is given in Eq. 35.

The Bayesian Learning Rule

We can now simply plug in the deﬁnition of λ, µ, and h(θ) in Eq. 6. For q(θ) = N(θ|m, I), we have λ = µ = m (Table 2). We also have log h(θ) = 1

2P log(2π) 1

2θT θ. Using these in Eq. 6, the update follows directly:

mt+1 mt ρt m EN(θ|m,I) ℓ(θ) m=mt . (7)

This is gradient descent but over the expected loss Eq ℓ(θ) . We can remove the expectation by using the ﬁrst-order delta method (Dorfman, 1938; Ver Hoef, 2012) (App. C) where we approximate the expectation of a function by its value at the mean,

m Eq ℓ(θ) θ ℓ(θ) θ=m .

With this, the BLR equals the gradient descent algorithm with mt as the iterate θt. The two choices, ﬁrst of the distribution Q and second of the delta method, give us gradient descent from the BLR. The delta method allows us to interpret non-Bayesian solutions as crude approximations of Bayesian ones. By approximating the expectation in the BLR with a greedy approximation, we are back to minimizing a deterministic (non-Bayesian) objective. This observation suggests that Eq. 7 should perform better than gradient descent. In Sec. 4.4, we will revisit this point and see that averaging in fact leads to improved uncertainty and better robustness properties. Using the Dirac s-delta distribution instead of the delta method is not advisable, as their use in this context is fundamentally ﬂawed. This is because the entropy goes to , making the Bayesian objective meaningless (Welling et al., 2008).

1.3.2 Newton s Method

Newton s method is a second-order method

θt+1 θt 2 θ ℓ(θt) 1 θ ℓ(θt) . (8)

which too can be derived from the BLR by expanding the class Q to q(θ) = N(θ|m, S 1) with an unknown precision matrix S. This example illustrates a property of the BLR: the complexity of the derived algorithm is directly related to the complexity of the exponential family Q. Fixed covariances previously gave rise to gradient-descent (a ﬁrst-order method) and by increasing the complexity where we also learn the precision matrix, the BLR reduces to a (more complex) second-order method. We use the simpliﬁed form of the BLR given in Eq. 6. The function h(θ) is a constant, and the natural and expectation parameters divide themselves into natural pairs

λ(1) = Sm, λ(2) = 1

2S, µ(1) = Eq[θ] = m, µ(2) = Eq[θθ ] = S 1 + mm T . (9)

The natural gradients can be expressed in terms of the gradient and Hessian of ℓ(θ). To show this, we use the chain-rule to ﬁrst write gradients with respect to µ in terms of (m, S 1) (ﬁrst equality below) and then use Bonnet s and Price s theorem respectively (Bonnet, 1964; Price, 1958; Rezende et al., 2014) to get the second equality (full derivation is in App. D),

µ(1)Eq[ ℓ(θ)] = m Eq[ ℓ(θ)] 2 S 1Eq[ ℓ(θ)] m = Eq[ θ ℓ(θ)] Eq[ 2 θ ℓ(θ)]m, (10)

µ(2)Eq[ ℓ(θ)] = S 1Eq[ ℓ(θ)] = 1

2Eq[ 2 θ ℓ(θ)]. (11)

Khan and Rue

The expressions show that the natural gradients contain the information about the ﬁrst and second-order derivatives of ℓ(θ). The BLR now turns into an online variant of Newton s method where the precision matrix contains an exponential-smoothed Hessian average, used as a pre-conditioner to update the mean,

mt+1 mt ρt S 1 t+1Eqt θ ℓ(θ) and St+1 (1 ρt)St + ρt Eqt 2 θ ℓ(θ) . (12)

We can recover the classical Newton s method in three steps. First, apply the delta method (see Eqs. 65 and 66 in App. C),

Eq θ ℓ(θ) θ ℓ(θ) θ=m and Eq 2 θ ℓ(θ) 2 θ ℓ(θ) θ=m . (13)

Second, set the learning rate to 1 which is justiﬁed when the loss is strongly convex or the algorithm is initialized close to the solution. Finally, treat the mean mt as the iterate θt.

1.3.3 Ridge Regression

Why do we get a second-order method when we increase the complexity of the Gaussian? This is due to the natural gradients which, depending on the complexity of the distribution, retrieve essential higher-order information about the loss landscape. We illustrate this now through the simplest non-trivial case of Ridge regression. Denote input matrix by X RN D and output by y RN. The loss is quadratic: ℓ(θ) = 1

2(y Xθ) (y Xθ)+ 1

2δθ θ with δ > 0 as the regularization parameter, and the solution is available in closed-form:

θ = (X X + δI) 1X y.

Natural gradients for this case are extremely easy to derive. We ﬁrst note that the expectedloss is linear in µ, that is,

Eq ℓ(θ) = y Xµ(1) + Tr 1

2 XT X + δI µ(2) + 1

and therefore the natural-gradients are simply the coeﬃcients in front of µ(1) and µ(2):

µ(1)Eq ℓ(θ) = XT y and µ(2)Eq ℓ(θ) = 1

2 XT X + δI . (14)

These already contain parts of the solution θ which we can recover by using Eq. 5

λ = µEqλ ℓ(θ) = S m = XT y, S = XT X + δI.

and solving to get the mean m = θ . By increasing the complexity of Q, natural-gradients can retrieve appropriate higher-order information, which are then assigned to the corresponding natural parameters using Eq. 5. This is the main reason why diﬀerent algorithms are obtained when we change the class Q. We discuss this point further in Sec. 3 when relating Bayesian principles to those of optimization.

The Bayesian Learning Rule

Learning Algorithm Posterior Approx. Natural-Gradient Approx. Sec.

Optimization Algorithms

Gradient Descent Gaussian (ﬁxed cov.) Delta method 1.3.1

Newton s method Gaussian 1.3.2

Multimodal optimization (New) Mixture of Gaussians 3.2

Deep-Learning Algorithms

SGD Gaussian (ﬁxed cov.) Delta method, stochastic approx. 4.1

RMSprop/Adam Gaussian (diagonal cov.) Delta method, stochastic approx., Hessian approx., square-root scaling, slow-moving scale vectors

Dropout Mixture of Gaussians Delta method, stochastic approx., responsibility approx. 4.3

STE Bernoulli Delta method, stochastic approx. 4.5

Online Gauss-Newton (OGN) (New) Gaussian (diagonal cov.) Gauss-Newton Hessian approx. in Adam & no square-root scaling 4.4

Variational OGN (New) Remove delta method from OGN 4.4

Bayes Bi NN (New) Bernoulli Remove delta method from STE 4.5

Approximate Bayesian Inference Algorithms

Conjugate Bayes Exponential family Set learning rate ρt = 1 5.1

Laplace s method Gaussian Delta method 4.4

Expectation Maximization Exponential Family + Gaussian Delta method for the parameters 5.2

Stochastic Variational Inference (SVI) Exponential family (mean-ﬁeld) Stochastic approx., local ρt = 1 5.3

Variational Message Passing (VMP) ρt = 1 for all nodes 5.3

Non-Conjugate VMP 5.3

Non-Conjugate Variational Inference (New) Mixture of Exponential family None 5.4

Table 1: A summary of learning algorithms derived from the BLR, along with the required approximations to the posterior and natural gradients. New algorithms are marked with New . Abbreviations: cov. covariance, STE Straight-Through-Estimator.

Khan and Rue

1.4 Outline of the Rest of the Paper

The rest of the paper is organized as follows. In Sec. 2, we give two derivations of the BLR by using natural-gradient descent and mirror descent, respectively. In Sec. 3, we summarize the main principles for the derivation of optimization algorithms, and give guidelines for the design of new multimodal-optimization algorithms. In Sec. 4, we discuss the derivation of existing deep-learning algorithms, and design new algorithms for uncertainty estimation. In Sec. 5, we discuss the derivation of algorithms for Bayesian inference for both conjugate and non-conjugate models. In Sec. 6, we conclude.

2. Derivation of the Bayesian Learning Rule

This section contains two derivations of the BLR. First, we interpret it as a natural-gradient descent using a second-order expansion of the KLD, which strengthens its intuition. Second, we do a more formal derivation speciﬁcally for exponential-family, where we use a mirror-descent algorithm leveraging the connection to Legendre-duality. This derivation is more direct and obtained without the second-order approximation. It reveals that certain parameterizations are preferable for natural-gradient descent over Bayesian objectives.

2.1 Bayesian Learning Rule as Natural-Gradient Descent

By expanding the KLD term and collecting the log-prior term with the loss functions, we can write the objective in Eq. 2 as follows,

L(λ) = Eq[ ℓ(θ) + log q(θ)].

The classical gradient-descent algorithm to minimize L(λ) performs the following update,

λt+1 λt ρt λL(λt). (15)

The insight motivating natural-gradient algorithms, is that Eq. 15 solves the following:

λt+1 arg min λ λL(λt), λ + 1

2ρt λ λt 2 2, (16)

which reveals the implicit Euclidean penalty of changes in the parameters. The parameters λt parameterize probability distributions, and therefore their updates should be penalized based on the distance in the space of distributions. Distance between two parameter conﬁgurations might be a poor measure of the distance between the corresponding distributions. Natural-gradient algorithms (Amari, 1998; Martens, 2020) use instead an alternative penalty, and using the KLD, we get

λt+1 arg min λ λL(λt), λ + 1

ρt DKL[q(θ) qt(θ)]. (17)

A closed form update can be obtained with a second-order expansion of the KLD-term which is obtained by using the fact that the Hessian of the KLD with respect to λ is equal to the FIM F (λ) of q(θ) (Pascanu and Bengio, 2013), that is,

DKL[q(θ) qt(θ)] 1

2(λ λt) F (λt)(λ λt)

The Bayesian Learning Rule

Using this, we get λt+1 λt ρt F (λt) 1 λL(λt). (18)

The descent direction F (λt) 1 λt L(λt) in this case is referred to as the natural-gradient. When λt is the natural parameter of qt(θ), the above coincides with the BLR update Eq. 3. We stress that our use of natural-gradient method is more general than the more popular use for maximum-likelihood estimation (Amari, 1998; Martens, 2020). There, a loglikelihood of the form log p(y; θ) is maximized by using the ﬁsher F (θ) of p(y; θ),

θt+1 θt ρt F (θt) 1 θ[ log p(y; θ)]. (19)

This update uses the same distribution p(y; θ) to deﬁne both the objective and the Fisher, which is in contrast to Eq. 3 where we are free to use any loss ℓ(θ) in the objective and the distribution q(θ) can also be chosen freely, irrespective of the objective. The BLR update is more general and the above maximum-likelihood update can be obtained as a special instance by using q as a Gaussian distribution, as shown in Sec. 4.4.

2.2 Bayesian Learning Rule as Mirror-Descent

The previous derivation holds for a general q(θ). In this section, we will do a derivation speciﬁcally for exponential-family distribution by using a mirror-descent formulation, similarly to Raskutti and Mukherjee (2015), but for a Bayesian objective. A mirror-descent step deﬁned in the space of µ is shown to be equivalent to a natural-gradient step in the space of λ (the update Eq. 18). This is a consequence of the Legendre-duality between the spaces of λ and µ (Wainwright and Jordan, 2008), and is also related to the dual-ﬂat Riemannian structures (Amari, 2016) employed in information geometry. Unlike the previous derivation, this derivation is more direct and obtained without any second-order approximation of the KLD, and it reveals that, due to computational reasons, certain parameterizations should be preferred for a natural-gradient descent over Bayesian objectives. There is a duality between the space of λ and µ, since µ = λA(λ) is a bijection. An alternative view of this mapping, is to use the Legendre transform of A(λ),

A (µ) = sup λ Ω µ, λ A(λ ).

The mapping follows by setting the gradient of the right-hand-side to zero. The reverse mapping is A(λ) = supµ M µ , λ A (µ ), giving us, λ = µA (µ). The expectation parameters µ provide a dual coordinate system to specify the exponential family with natural parameter λ, and using µ we express q(θ) as given below (see Banerjee et al. (2005) for details):

q(θ) = h(θ) exp [ DA (T (θ) µ) + A (T (θ))] (20)

where DA (µ1 µ2) = A (µ1) A (µ2) µ1 µ2, µA (µ2) is the Bregman divergence deﬁned using the function A (µ). The Bregman and KL divergences are related,

DKL[q1(θ) q2(θ)] = DA (µ1 µ2) = DA(λ2 λ1). (21)

The Bregman divergence is equal to the KL divergence between the corresponding distributions, which can also be measured in the dual space using the (swapped order of the)

Khan and Rue

natural parameters. We can then use these Bregman divergences to measure the distances in the two dual spaces. The relationship with the KLD enables us to express natural-gradient descent in one space as a mirror descent in the dual space. Speciﬁcally, the following mirror-descent update in the space M is equivalent to the natural-gradient descent Eq. 18 in the space of Ω,

µt+1 arg min µ M µ L(µt), µ + 1

ρt DA (µ µt). (22)

The L(µ) = L(λ) here is a reparameterized objetive expressed in terms of µ. The proof is straightforward: from the deﬁnition for the Bregman divergence, we ﬁnd that

µDA (µ µt) = λ λt, (23)

and if we take the derivative of Eq. 22 with respect to µ and set it to zero, we get the BLR update. The derivation shows that the second-order approximation in the previous derivation (used to obtain Eq. 18) is in fact exact for exponential-family distribution. Due to this we get an equivalence between the natural gradient descent and mirror descent. The reverse statement also holds: a mirror descent update in the space Ωleads to a natural gradient descent in the space of M. From a Bayesian viewpoint, Eq. 22 is closer to Bayes rule since an addition in the natural-parameter space Ωis equivalent to multiplication in the space of Q (see Sec. 5.1). Natural parameters additionally encode conditional independence (e.g., for Gaussians) which is preferable from a computational viewpoint (see Malag o and Pistone (2015) for a similar suggestion). In any case, a reverse version of BLR can always be used if the need arises.

3. Optimization Algorithms from the Bayesian Learning Rule

3.1 Firstand Second-Order Methods by using Gaussian Candidates

The examples in Sec. 1.3 demonstrate the key ideas behind our recipe to derive learning algorithms:

1. Natural-gradients retrieve essential higher-order information about the loss landscape and assign them to the appropriate natural parameters,

2. Both of these choices are dictated by the entropy of the chosen class Q, which automatically determines the complexity of the derived algorithms.

These are concisely summarized in the optimality condition of the Bayes objective (derived in Eq. 5).

µEq [ ℓ(θ)] = µH (q ) (24)

From this, we can derive as a special case the optimality condition of a non-Bayesian problem over ℓ(θ), given in Eq. 1. For instance, for Gaussians N(θ|m, I), the entropy is constant. Therefore the right hand side in Eq. 24 goes to zero, giving the ﬁrst equality below,

m Eq ℓ(θ) m=m = 0 Bonnet s thm. Eq θ ℓ(θ) = 0 delta θ ℓ(θ) θ=θ = 0. (25)

The Bayesian Learning Rule

The simpliﬁcations shown at the right above are respectively obtained by using Bonnet s theorem (App. D) and the delta method (App. C) to ﬁnally recover the 1st-order optimality condition at a local minimizer θ of ℓ(θ). Clearly, the above natural gradient contain the information about the ﬁrst-order derivative of the loss, to give us an equation that determines the value of the natural parameter m . When the complexity of the Gaussian is increased to include the covariance parameters with candidates N(θ|m, S 1), the entropy is no more a constant but depends on S: we have H(q) = 1

2 log |2πe S|. The ﬁxed-point of the BLR (in Eq. 12 now) to yield an additional condition, shown on the left,

S 1Eq ℓ(θ) S 1=S 1 = 1

2S Price s thm. Eq 2 θ ℓ(θ) = S delta 2 θ ℓ(θ) θ=θ 0.

(26) The simpliﬁcations at the right are respectively obtained by using Price s theorem (App. D) and the delta method (App. C) to recover the 2nd-order optimality condition of a local minimizer θ of ℓ(θ), Here, 0 denotes the positive-deﬁnite condition which follows from S 0. The condition above matches the second-order derivatives to the precision matrix S . In general, more complex suﬃcient statistics in q(θ) can enable us to retrieve higherorder derivatives of the loss through natural gradients.

3.2 Multimodal Optimization by using Mixtures-Candidate Distributions

It is natural to expect that by further increasing the complexity of the class Q, we can obtain algorithms that go beyond standard Newton-like optimizers. We illustrate this point by using mixture distribution to obtain a new algorithm for multimodal optimization (Yu and Gen, 2010). The goal is to simultaneously locate multiple local minima within a single run of the algorithm (Wong, 2015). Mixture distributions are ideal for this task where individual components can be tasked to locate diﬀerent local minima and diversity among them is encouraged by the entropy, forcing them to take responsibility of diﬀerent regions. Our goal in this section is to illustrate this principle. Note that we do not aim to propose new algorithms that solve multimodal optimization in its entirety. We will rely on the work of Lin et al. (2019b,a) who derive a natural-gradient update, similar to Eq. 3, but for mixtures. Consider, for example, the ﬁnite mixture of Gaussians

k=1 πk N(θ|mk, S 1 k )

with mk and Sk as the mean and precision of the k th Gaussian component and πk are the component probabilities with πK = 1 PK 1 k=1 πk. In general, the FIM of such mixtures could be singular, making it diﬃcult to use natural-gradient updates. However, the joint q(θ, z = k) = πk N(θ|mk, S 1 k ), where z {1, 2, . . . , K} a mixture-component indicator, has a well-deﬁned FIM which can be used instead. We will now brieﬂy describe this for the mixture of Gaussian case with ﬁxed πk. We start by writing the natural and expectation parameters of the joint q(θ, z = k), denoted by λk and µk respectively. Since πk are ﬁxed, these take very similar form to a

Khan and Rue

single Gaussian case in Eq. 9 as shown below (we denote the indicator function by 1[z=k]),

λ(1) k = Skmk,

µ(1) k = Eq 1[z=k]θ = πkmk,

µ(2) k = Eq 1[z=k]θθT = πk S 1 k + mkm T k , (27)

where λk = {λ(1) k , λ(2) k } and µk = {µ(1) k , µ(2) k }. With this, we can write a natural-gradient update for λk, for all k = 1, . . . , K, by using the gradient with respect to µk (Lin et al., 2019b, Theorem 3),

λk,t+1 λk,t ρt µk Eqt ℓ(θ) H(qt) , (28)

where λk,t denotes the value of the natural parameter λk at iteration t and H(q) = Eq[ log q(θ)] is the entropy of the mixture q(θ) (not the joint q(θ, z)). This is an extension of the BLR (Eq. 3) to ﬁnite mixtures. As before, the natural gradients retrieve ﬁrst and second order information. Speciﬁcally, as shown in App. E, the update for each k takes a Newton-like form,

mk,t+1 mk,t ρt S 1 k,t+1Eqk,t θ ℓ(θ) + θ log q(θ) , (29)

Sk,t+1 Sk,t + ρt Eqk,t 2 θ ℓ(θ) + 2 θ log q(θ) , (30)

where qk,t(θ) = N(θ|mk,t, S 1 k,t) is the k th component at iteration t. The mean and precision of a component are updated using the expectations of the gradient and Hessian respectively. Similarly to the Newton step of Eq. 12, the ﬁrst update is preconditioned with the covariance. The expectation helps to focus on a region where the component has a high probability mass. The update can locate multiple solutions in a single run, when each component takes the responsibility of an individual minimum. For example, for a problem with two local minima θ1, and θ2, , the best candidate with two components N(θ|m1, , S 1 1, ) and N(θ|m2, , S 1 2, ) satisﬁes optimality condition,

EN(θ|mk, ,S 1 k, ) h θ ℓ(θ) + r(θ) S1, (m1, θ) + (1 r(θ)) S2, (m2, θ) | {z } = θ log q(θ)

i = 0, (31)

for k = 1, 2, where we use expression for θ log q(θ) from Eq. 69 in App. E, and r(θ) is the responsibility function deﬁned as

r(θ) = πN(θ|m1, , S 1 1, )

πN(θ|m1, , S 1 1, ) + (1 π)N(θ|m2, , S 1 2, ). (32)

Suppose that each component takes responsibility of one local minimum each, for example, for θ sampled from the ﬁrst component, r(θ) 1, and for those sampled from the second component, r(θ) 0 (see Eq. 74 in App. E which gives the explicit assumptions). Under this condition, both the second and third term in Eq. 31 are negligible, and the mean mk, approximately satisﬁes the ﬁrst-order optimality condition to qualify as a minimizer of ℓ(θ),

EN(θ|mk, ,S 1 k, )[ θ ℓ(θ)] 0 delta θ ℓ(θ) θ=mk, 0, for k = 1, 2 (33)

The Bayesian Learning Rule

This roughly implies that mk, θk, , meaning the two components have located the two local minima. The example illustrates the role of entropy in encouraging diversity through the responsibility function. There remain several practical diﬃculties to overcome before we are ensured a good algorithmic outcome, like setting the correct number of components, appropriate initialization, and of course degenerate solutions where two components collapse to be the same. The example simply illustrates the usefulness of being Bayesian. Other mixture distributions can also be used as shown in Lin et al. (2019b) who consider the following: ﬁnite mixtures of minimal exponential-families, scale mixture of Gaussians (Andrews and Mallows, 1974; Eltoft et al., 2006), Birnbaum-Saunders distribution (Birnbaum and Saunders, 1969), multivariate skew-Gaussian distribution (Azzalini, 2005), multivariate exponentially-modiﬁed Gaussian distribution (Grushka, 1972), normal inverse Gaussian distribution (Barndorﬀ-Nielsen, 1997), and matrix-variate Gaussian distribution (Gupta and Nagar, 2018; Louizos and Welling, 2016).

4. Deep-Learning Algorithms from the Bayesian Learning Rule

4.1 Stochastic Gradient Descent

Stochastic gradient descent (SGD) is one of the most popular deep-learning algorithms due to its computational eﬃciency. The computational speedup is obtained by approximating the gradient of the loss by a stochastic-gradient built using a small subset M of M examples with M N. The subset M, commonly referred to as a mini-batch, is randomly drawn at each iteration from the full data set using a uniform distribution, and the resulting mini-batch gradient

b θ ℓ(θ) = N

i M θℓ(yi, fθ(xi)) + θR(θ), (34)

is much faster to compute. Similarly to the gradient-descent case in Sec. 1.3.1, the SGD step can be interpreted as the BLR over a distribution q(θ) = N(θ|m, I) with the unknown mean m and the gradient in Eq. 7 being replaced by the mini-batch gradient. The BLR update is extended to candidates q(θ) = N(θ|m, S 1) with any ﬁxed S 0 to get an SGD-like update where S 1 is the preconditioner,

mt+1 mt ρt S 1 b θ ℓ(θ) θ=mt . (35)

This is derived similarly to Sec. 1.3.1 by using the various quantities associated with N(θ|m, I), that is, λ = Sm, µ = m, and 2 log h(θ) = P log |2πS| θT Sθ. The above BLR interpretation sheds light on the ineﬀectiveness of ﬁrst-order methods (like SGD) to estimate posterior approximations. This is a recent trend in Bayesian deep learning, where a preconditioned update is preferred due to the an-isotropic SGD noise (Mandt et al., 2017; Chaudhari and Soatto, 2018; Maddox et al., 2019). Unfortunately, a ﬁrst-order optimizer (such as Eq. 35) oﬀers no clue about estimating S. SGD iterates can be used (Mandt et al., 2017; Maddox et al., 2019), but iterates and the noise in them are aﬀected by the choice of S, which makes the estimation diﬃcult. With our Bayesian scheme, the preconditioner is estimated automatically using a second-order method (Eq. 12) when we allow S as a free parameter. We discuss this next in the context of adaptive learning-rate algorithms.

Khan and Rue

4.2 Adaptive Learning-Rate Algorithms

Adaptive learning-rate algorithms, motivated from Newton s method, adapt the learning rate in SGD with a scaling vector, and we discuss their relationship to the BLR. Early adaptive variants relied on the diagonal of the mini-batch Hessian matrix to reduce the computation cost (Barto and Sutton, 1981; Becker and Le Cun, 1988), and use exponential smoothing to improve stability (Le Cun et al., 1998):

θt+1 θt α 1

st+1 b θ ℓ(θt), and st+1 (1 β)st + β diag[b 2 θ ℓ(θt)], (36)

where a b and a/b denote element-wise multiplication and division respectively, and the scaling vector, denoted by st, uses diag[b 2 θ ℓ(θt)] which denotes a vector whose j th entry is the mini-batch Hessian

b 2 θj ℓ(θ) = N

i M | 2 θjℓ(yi, f θ(xi))| + 2 θj R(θ).

Note the use of absolute value to make sure that the entries of st stay positive (assuming R(θ) to be strongly-convex). The exponential smoothing, a popular tool from the nonstationary time-series literature (Brown, 1959; Holt et al., 1960; Gardner Jr, 1985), works extremely well for deep learning and is adopted by many adaptive learning-rate algorithms, for example, the natural Newton method of Le Roux and Fitzgibbon (2010) and the v SGD method by Schaul et al. (2013) both use it to estimate the second-order information, while more modern variants such as RMSprop (Tieleman and Hinton, 2012), Ada Delta (Zeiler, 2012), and Adam (Kingma and Ba, 2015), use it to estimate the magnitude of the gradient (a more crude approximation of the second-order information (Bottou et al., 2018)). The BLR naturally gives rise to the update in Eq. 36, where exponential smoothing arises as a consequence and not something we need to invent. We optimize over candidates of the form q(θ) = N(θ|m, S 1) where S is constrained to be a diagonal matrix with a vector s as the diagonal. This choice results in the BLR update shown below (the derivation is identical to Sec. 1.3.2),

θt+1 θt ρt 1 st+1 θ ℓ(θt), where st+1 (1 ρt)st + ρt diag( 2 θ ℓ(θt)), (37)

to obtain mt = θt and diag(St) = st. Replacing the gradient and Hessian by their minibatch approximations and then employing diﬀerent learning-rates for θt and st, we recover the update in Eq. 36. Using diﬀerent learning rates is useful to compensate for the errors due to minibatches and diagonal approximation of the Hessian. The similarity of the BLR update can be exploited to compute uncertainty estimates for deep-learning models. Khan et al. (2018) study the relationship between Eq. 37 and modern adaptive learning-rate algorithms, such as RMSprop (Tieleman and Hinton, 2012), Ada Delta (Zeiler, 2012), and Adam (Kingma and Ba, 2015). These modern variants also use exponential smoothing but over gradient-magnitudes b θ ℓ(θt) b θ ℓ(θt), instead of the Hessian. For example, RMSprop uses the following update,

θt+1 θt α 1 vt+1 + c1 h b θ ℓ(θt) i , where vt+1 (1 β)vt + β h b θ ℓ(θt) b θ ℓ(θt) i .

The Bayesian Learning Rule

There are also some other minor diﬀerences to Eq. 37, for example, the scaling uses a square-root to take care of the factor 1/M 2 in the square of the mini-batch gradient and a small c > 0 is added to avoid (near)-zero vt+1 (Khan et al., 2018, Sec. 3.4). Despite these small diﬀerences, the similarly of the two updates in Eqs. 37 and 38 enables us to incorporate Bayesian principles in deep learning. Uncertainty is then computed using a slightly-modiﬁed RMSprop update (Osawa et al., 2019) which, with just a slight changes in the code, can exploit all the tricks-of-the-trade of deep learning such as batch normalization, data augmentation, clever initialization, and distributed training (also see Sec. 4.4). The BLR update extends the application of adaptive learning-rate algorithms to uncertainty estimation in deep learning. The update in Eq. 37 also demonstrate that the BLR can naturally deal with stochastic approximations. The update is an online extension of Newton s method where the scale vector uses exponential smoothing. Such smoothing leads to much more stable than directly using a mini-batch approximation of the Hessian in Newton s method. The BLR can lead to second-order methods that are most suitable for online, stochastic, or incremental learning. The BLR update can also be used to recover the Adam optimizer (Kingma and Ba, 2015), which is inarguably one of the most popular optimizer. The Adam optimizer is closely related to a BLR variant which uses a momentum based mirror-descent. Momentum is a common technique to improve convergence speed of deep learning (Sutskever et al., 2013). The classical momentum method is based on the Polyak s heavy-ball method (Polyak, 1964),

θt+1 θt αb θ ℓ(θt) + γt(θt θt 1)

with a ﬁxed momentum coeﬃcient γt > 0, but the momentum used in adaptive learningrate algorithms, such as Adam, diﬀer in the use of scaling with st in front of the momentum term θt θt 1 (see Eq. 78 in App. F). These variants can be better explained by including a momentum term in the BLR within the mirror descent framework of Sec. 2, which gives the following modiﬁcation of the BLR (see App. F):

λt+1 λt ρt e λ Eqt ℓ(θ) H(qt) + γt(λt λt 1).

For a Gaussian distribution, this recovers both SGD and Newton s method with momentum. Variants like Adam can then be obtained by making approximations to the Hessian and by using a square-root scaling; see Eq. 81 in App. F. Finally, we will also mention Aitchison (2020) who derive deep-learning optimizers using a Bayesian ﬁltering approach with updating equations similar to the ones we presented. They further discuss modiﬁcations to justify the use of square-root in Adam and RMSprop. The approach taken by Ollivier (2018) is similar, but exploits the connection of Kalman ﬁltering and the natural-gradient method.

4.3 Dropout

Dropout is a popular regularization technique to avoid overﬁtting in deep learning where hidden units are randomly dropped from the neural network (NN) during training (Srivastava et al., 2014). Gal and Ghahramani (2016) interpret SGD-with-dropout as a Bayesian approximation (called MC-dropout), but their crude approximation to the entropy term prohibits extensions to more ﬂexible posterior approximations. In this section, we show

Khan and Rue

that by using the BLR we can improve their approximation and go beyond SGD to derive MC-dropout variants of Newton s method, Adam and RMSprop. Denoting the j th unit in the l th layer by fjl(x) for an input x, we can write the i th unit for the l + 1 th layer as follows:

fi,l+1(x) = h

j=1 θijlzjlfjl(x)

where the binary weights zjl {0, 1} are to be deﬁned, nl is the number of units in the l th layer, all θijl R are the parameters, and h( ) is a nonlinear activation function. For simplicity, we have not included an oﬀset term. Without dropout, all weights zjl are set to one. With dropout, all zjl s are independent Bernoulli variables with probability for being 1 equal to π1 (and ﬁxed). If zjl = 0, then the j th unit of the l th layer is dropped from the NN, otherwise it is retained. Let θijl = θijlzjl and let θ and θ denote the complete set of parameters. The training can be performed with the following simple modiﬁcation of SGD where θ are used during back-propagation,

θt+1 θt αb θ ℓ( θt). (40)

This simple procedure has proved to be very eﬀective in practice and dropout has become a standard trick-of-the-trade to improve performance of deep networks trained with SGD. We can include dropout into the Bayesian learning rule, by considering a spike-and-slab mixture distribution for θjl = (θ1jl, . . . , θnljl)

q(θjl) = π1N(θjl|mjl, S 1 jl ) + (1 π1)N(θjl|0, s 1 0 Inl) (41)

where the means and covariances are unknown, but s0 is ﬁxed to a small positive value to emulate a spike at zero. Further, we assume independence for every j and l, q(θ) = Q jl q(θjl). For this choice, we arrive in the end at the following update,

θjl,t+1 θjl,t ρt S 1 jl,t+1 h b θjl ℓ( θt) i /π1, (42)

Sjl,t+1 (1 ρt) Sjl,t + ρt h b 2 θjl ℓ( θt) i /π1. (43)

The derivation in App. G uses two approximations. First, the delta method (Eq. 84, also used by Gal and Ghahramani (2016)), and second, an approximation to the responsibility function (Eq. 32) which assumes that θjl is far apart from 0 (can be ensured by choosing s0 to be very small; see Eqs. 85 and 86). The update is a variant of the Newton s method derived in Eq. 12. The diﬀerence is that the gradients and Hessians are evaluated at the dropout weights θt. By choosing Sjl,t to be a diagonal matrix as in Sec. 4.2, we can similarly derive RMSprop/Adam variants with dropout. The BLR derivation enables more ﬂexible posterior approximations than the original derivation by Gal and Ghahramani (2016). We can build on the mixture candidate model to learn π1 and also allow it to be diﬀerent across layers, although by doing so we leave the justiﬁcation leading to dropout. Such extensions can be obtained using the extension to mixture distribution by Lin et al. (2019b).

The Bayesian Learning Rule

The update requires gradients with respect to π1 which can be approximated by using the Concrete distribution (Maddison et al., 2016; Jang et al., 2016) where the binary zjl is replaced with an appropriate variable in the unit interval, see also Sec. 4.5 for a similar approach. Gal et al. (2017) proposed a similar extension but their approach do not allow for more ﬂexible candidate distributions. Our approach here extends to adaptive learningrate optimizers, and can be extended to more complex candidate distributions.

4.4 Uncertainty Estimation in Deep Learning

The observation that some deep-learning algorithms are speciﬁc instances of the BLR, can be leveraged to improve them. Speciﬁcally, we will show that, by improving the approximations, we get improved uncertainty estimates. The solutions will also have better robustness properties, similar to those of ﬂatter minima found by stochastic methods in deep learning. Uncertainty estimation in deep learning is crucial for many applications, such as medical diagnosis and self-driving cars, but its computation is extremely challenging due to a large number of parameters and data examples. A straightforward application of Bayesian methods does not usually work. Standard Bayesian approaches, such as Markov Chain Monte Carlo (MCMC), Stochastic Gradient Langevin Dynamics (SGLD), and Hamiltonian Monte-Carlo (HMC), are either infeasible or slow (Balan et al., 2015). Even approximate inference approaches, such as Laplace s method (Mackay, 1991; Ritter et al., 2018), variational inference (Graves, 2011; Blundell et al., 2015), and expectation propagation (Hernandez Lobato and Adams, 2015), do not match the performance of the standard point-estimation methods at large scale. On the Image Net data set, for example, these Bayesian methods performed poorly until recently when Osawa et al. (2019) applied the natural-gradient algorithms (Khan et al., 2018; Zhang et al., 2018b). Such natural-gradient algorithms are in fact variants of the BLR and we will now discuss their derivation, application, and suitability for uncertainty estimation in deep learning. We start with the Laplace s method which is one of the simplest approaches to obtain uncertainty estimates from point estimates; see Mackay (1991) for an early application to NN. Given a minimizer θ of ERM in Eq. 1 (assume that the loss correspond to a negative log probability over outputs yi), in Laplace s method, we employ a Gaussian approximation N(θ|θ , S 1 ) where S = 2 θ ℓ(θ ). For models with billions of parameters, even forming S is infeasible and a diagonal approximation is often employed (Ritter et al., 2018). Still, Hessian computation after training requires a full pass through the data which is expensive and challenging. The BLR updates solve this issue since Hessian can be obtained during training, for example, using Eq. 37 with a minibach gradient and Hessian (denoted by b ),

θt+1 θt ρt 1 st+1 b θ ℓ(θt), with st+1 (1 ρt)st + ρt diag(b 2 θ ℓ(θt)). (44)

The vectors st converge to an unbiased estimate of the diagonal of the Hessian at θ . Using (diagonal) Hessian is still problematic in deep learning because sometimes it can be negative, and the steps can diverge. An additional issue is that its computation is cumbersome in existing software which mainly support ﬁrst-order derivatives only. Due to this, it is tempting to simply use the RMSprop/Adam update Eq. 38, but using (b θj ℓ(θt))2

results in poor performance as shown by Khan et al. (2018, Thm. 1). Instead, they propose

Khan and Rue

to use the following Gauss-Newton approximation which is built from ﬁrst-order derivatives but gives better estimate of uncertainty:

b 2 θj ℓ(θ) N

θjℓ(yi, fθ(xi)) 2 + 2 θj R(θ). (45)

This is referred to as the online Gauss-Newton (OGN) in Osawa et al. (2019, App. C). The update Eq. 44 is similar to RMSprop, and can be implemented with minimal changes to the existing software and many practical deep-learning tricks can also be applied to boost the performance. Osawa et al. (2019) use OGN with batch normalization, data augmentation, learning-rate scheduling, momentum, initialization tricks, and distributed training. With these tricks, OGN gives similar performance as the Adam optimizer, even at large scale (Osawa et al., 2019) while yielding a reasonable estimate of the variance. The compute cost is only slightly higher, but it is not an increase in the complexity, rather due to a diﬃculty of implementing Eq. 45 in the existing deep-learning software. Khan et al. (2019) further propose a slightly better (but costlier) approximation based on the FIM the log-likelihood, to get an algorithm they call the online Generalized Gauss-Newton (OGGN) algorithm. Both OGN and OGGN algorithms are scalable BLR variants for Laplace s method for NNs. OGN s uncertainty estimates can be improved further by relaxing the approximation made by the delta method and using Bayesian averaging through variational inference. This improves over Laplace s method by using the stationarity conditions (Eqs. 25 and 26) where Bayesian averaging is employed through an expectation over q(θ) (Opper and Archambeau, 2009). The variational solution is slightly diﬀerent than Laplace s method and is expected to be more robust due to the averaging over q (θ) in Eqs. 25 and 26. Two such situations are illustrated in Fig. 1, one corresponding to an asymmetric loss where the Bayesian solution avoids a region with extremely high loss (Fig. 1(a)), and the other one where it seeks a more stable solution involving a wide and shallow minimum compared to a sharp and deep minimum (Fig. 1(b)). Such situations are hypothesised to exist in deeplearning problems (Hochreiter and Schmidhuber, 1997, 1995; Keskar et al., 2016), where a good performance of stochastic optimization methods is attributed to their ability to ﬁnd shallow minima (Dziugaite and Roy, 2017). Similar strategies exist in stochastic search and global optimization literature where a kernel is used to convolve/smooth the objective function, like in Gaussian homotopy continuation method (Mobahi and Fisher III, 2015), optimization by smoothing (Leordeanu and Hebert, 2008), graduated optimization method (Hazan et al., 2016), and evolution strategies (Huning, 1976; Wierstra et al., 2014). The stochastic noise in these methods is akin to the Bayesian noise injected through the distribution q(θ). Unlike the unregulated noise in these methods, the beneﬁt of the Bayesian approach is that the noise source q(θ) can adapt itself from data by using the BLR. The variational-Bayesian version is obtained by simply removing the approximation due to the delta method from Eq. 44, going back to the original Newton variant of Eq. 12 (now with minibatch Hessians),

θt+1 θt ρt 1 st+1 Eqt h b θ ℓ(θ) i , with st+1 (1 ρt)st + ρt Eqt h diag(b 2 θ ℓ(θ)) i ,

The Bayesian Learning Rule

Zero gradient

Small +ve gradient Large ve gradient

Region with a large loss

Flat minima

Sharp minima

Initialization

𝜃 ,# 𝜃 𝜃 ,$

Figure 1: Bayesian solutions have similar robustness properties to ﬂatter minima found in deep learning via stochastic algorithms. Panel (a): When the minimum lies right next to a wall , the Bayesian solution shifts away from the wall (towards the ﬂatter side) to avoid large losses under small perturbation. This is due to the averaging over q (θ) in condition Eq. 25. Panel (b): Given a sharp minimum vs a ﬂat minimum, the Bayesian solution often prefers the ﬂatter minimum, which is again due to the averaging over q (θ).

where iterates qt(θ) = N(θ|θt, S 1 t ) are deﬁned with precision matrix St as a diagonal matrix with st as the diagonal. The iterates converge to an optimal diagonal Gaussian candidate that optimizes Eq. 2. Similarly to OGN, the update Eq. 46 can be modiﬁed to use the Gauss-Newton approximation Eq. 45, which is easier to implement and can also exploit the deep-learning tricks to ﬁnd good solutions. The expectations can be implemented by a simple weightperturbation , Eqt[b θ ℓ(θ)] b θ ℓ(θt + ϵt)

where ϵt N(ϵ|0, diag(st)) (multiple samples can also be used). The resulting algorithm is referred to as variational OGN or VOGN by Khan et al. (2018)) and is shown to match the performance of Adam and give better uncertainty estimates than both Adam, OGN, and other Bayesian alternatives (Osawa et al., 2019, Fig. 1). A version with the FIM of the log-likelihood is proposed in Khan et al. (2019) (algorithm is called VOGGN). Variants that exploit Hessian structures using Kronecker-factored (Zhang et al., 2018b) and low-rank (Mishkin et al., 2018) are also considered in the literature.

4.5 Binary Neural Networks

Binary NNs have binary weights θ { 1, +1}P , hence we need to consider another class of the candidate distributions than the Gaussian (we will choose Bernoulli). Binary NNs are natural candidates for resource-constrained applications, like mobile phones, wearable and Io T devices, but their training is challenging due to the binary weights for which gradients

Khan and Rue

do not exists. Our Bayesian scheme solves this issue because gradients of the Bayesian objective are always deﬁned with respect to the parameters of a Bernoulli distribution, turning a discrete problem into a continuous one. The discussion here is based on the results of Meng et al. (2020) who used the BLR to derive the Bayes Bi NN learning algorithm and connect it to the well known Straight-Through-Estimator (STE) (Bengio et al., 2013). The training objective of binary NN involves a discrete optimization problem

min θ { 1,+1}P

i=1 ℓ(yi, fθ(xi)). (47)

In theory, gradient based methods should not work for such problems since the gradients with respect to discrete variables do not exist. Still, the most popular method to train binary NN, the Straight-Through-Estimator (STE) (Bengio et al., 2013), employs continuous optimization methods and works remarkably well (Courbariaux et al., 2015). The method is justiﬁed based on latent real-valued weights θ RP which are discretized at every iteration to get binary weights θ, while the gradients used to update the latent weights θ, are computed at the binary weights θ, as shown below,

θt+1 θt α X

i Mt θℓ(yi, fθt(xi)), where θt sign( θt) (48)

It is not clear why the gradients computed at the binary weights help the search for the minimum of the discrete problem, and many studies have attempted to answer this question, e.g., see Yin et al. (2019); Alizadeh et al. (2019). The Bayesian objective in Eq. 2 rectify this issue since the expectation of a discrete objective is a continuous objective with respect to the parameters of the distributions, justifying the use of gradient based methods. We will now show that using Bernoulli candidate distributions in the BLR gives rise to an algorithm similar to Eq. 48, justifying the application of the STE algorithm to solve a discrete optimization problem. Speciﬁcally, we choose q(θj = 1) = pj where pj [0, 1] is the success probability. We also assume a mean-ﬁeld distribution to get q(θ) = QP j=1 q(θj). Since the Bernoulli distribution is a minimal exponential-family with a constant h(θ), we use the BLR in Eq. 6 with λj = 1/2 log(pj/(1 pj)) and µj = 2pj 1, to get

λt+1 (1 ρt)λt ρt µEqt

i Mt ℓ(yi, fθ(xi))

The challenge now is to compute the gradient with respect to µ, but it turns out that approximating it with the Concrete distribution (Maddison et al., 2016; Jang et al., 2016) recovers the STE step. Concrete distribution provides a way to relax a discrete random variable into a continuous one such that it becomes a deterministic function of a continuous random variable. Following Maddison et al. (2016), the continuous random variable, denoted by ˆθ(τ) ( 1, 1), are constructed from the binary variable θ { 1, 1} by using iid uniform random variables ϵ (0, 1) and using the function tanh( ),

ˆθ(τ) = tanh λ + δ(ϵ)

, where δ(ϵ) = 1

2 log ϵ 1 ϵ

The Bayesian Learning Rule

with τ > 0 as the temperature parameter. As τ 0, the function approaches the sign function used in STE step (Eq. 48): ˆθ(0) = sign(λ + δ(ϵ)). The gradient too can be approximated by the gradients with respect to the relaxed variables

µEq [ℓ(yi, fθ(xi))] µℓ(yi, fˆθ (τ)(xi)) = s h ˆθ (τ)ℓ(yi, fˆθ (τ)(xi)) i , (51)

where we used chain rule in the last step and s is a vector of

sj = µjθ(τ) j = 1

" 1 (ˆθ(τ) j )2

1 tanh2(λj)

By renaming λ by θ and ˆθ (τ) by θ, the rule in Eq. 49 becomes

θt+1 (1 ρt) θt ρtst

i Mt θℓ(yi, fθt(xi))

, where θt tanh 1

with st (1 θ2 t )/(τ(1 tanh2( θt))), and ϵt is a vector of P iid samples from a uniform distribution. Meng et al. (2020) refer to this algorithm as Bayes Bi NN. Bayes Bi NN is similar to STE shown in Eq. 48, but computes natural-gradients (instead of gradients) at the relaxed parameters θt which are obtained by adding noise δ(ϵt) to the real-valued weights θt. The gradients are approximations of the expected loss and are well deﬁned. By setting noise to zero and letting the temperature τ 0, the tanh( /τ) becomes the sign-function and we recover the gradients used in STE. This limit is when the randomness is ignored and is similar to the delta method used in earlier sections. The random noise δ(ϵt) and the non-zero temperature τ enables a softening of the binary weights, allowing for meaningful gradients. Bayes Bi NN also provides meaningful latent θ which are now the natural parameters of the Bernoulli distribution. Unlike STE, the updates employs a exponential smoothing which is a direct consequence of using the entropy term in the Bayesian objective. With discrete weights, optimum of Eq. 47 could be completely unrelated to an STE solution θ , for example, the one that yields zero gradients P i θℓ(yi, fθ (xi)) = 0. In contrast, the Bayes Bi NN solution minimizes the well-deﬁned Bayes objective of Eq. 2, whose optimum is characterized by the optimality condition Eq. 24 directly relating the optimal θ to the relaxed variables θ :

i=1 θℓ(yi, fθ (xi)). (53)

The Bayes Bi NN solution θ is in general diﬀerent from the one from the STE algorithm, but when temperature τ 0 we expect the two solutions to be similar whenever s (as the left-hand-side in the equation above goes to zero). In general, we expect the Bayes Bi NN solution to have robustness properties similar to the ones discussed for Fig. 1. The exponential smoothing used in Bayes Bi NN is similar to the update of Helwegen et al. (2019) but their formulation lacks a well-deﬁned objective.

Khan and Rue

5. Probabilistic Inference Algorithms from the Bayesian Learning Rule

Algorithms for inference in probabilistic graphical models can be derived from the BLR, by setting the loss function to be the log-joint density of the data and unknowns. Unknowns could include both latent variables z and model parameters θ. The class Q is chosen based on the form of the likelihood and prior. We will ﬁrst discuss the conjugate case, where BLR reduces to Bayes rule, and then discuss expectation maximization and variational inference (VI) where structural approximations to the posterior are employed.

5.1 Conjugate Models and Bayes Rule

We start with one of the most popular category of Bayesian models called exponentialfamily conjugate models. We consider N i.i.d. data yi with likelihoods p(yi|θ) (no latent variables) and prior p(θ). Conjugacy implies the existence of suﬃcient statistics T (θ) such that both the likelihood and prior take an exponential form shown below

p(yi|θ) e eλi(yi),T (θ) , and p(θ) e λ0,T (θ) ,

for some eλi(yi) (a function of yi), and λ0. For example, for ridge regression discussed in Sec. 1.3.3, the likelihoods and prior are

p(yi|θ) = N(yi|x i θ, 1) e(yixi) θ+Tr( 1

2 xix i (θθ )) 1

p(θ) = N(θ|0, I/δ) e Tr( 1

2 δI(θθ )),

respectively, and, because T (θ) = (θ, θθ ), we have the following pairs

eλ (1) i (yi) = yixi, eλ (2) i (yi) = 1

λ(1) 0 = 0,

The quantity eλi(yi) is often interpreted as the suﬃcient statistics of yi, but it can also be seen as the natural parameter of the likelihood with respect to T (θ). The posterior is then obtained by a simple addition of these natural parameters, as shown below,

p(θ|y) p(θ) Y

i p(yi|θ) exp

i eλi(yi), T (θ)

where y = (y1, y2, . . . , y N). The natural-parameter of the posterior can be obtained from the BLR. To see this, we ﬁrst note that the loss needed in the BLR is the negative of the log-joint density

ℓ(θ) = log p(y, θ) = log p(θ) +

i=1 [ log p(yi|θ)] .

The natural parameter of the posterior is recovered from the ﬁxed point of the BLR (Eq. 5),

λ = µEq [log p(θ)] + X

i µEq [log p(yi|θ)] = λ0 + X

i eλi(yi), (54)

The Bayesian Learning Rule

which is equivalent to running the BLR for just one-iteration with ρt = 1. For ridge regression, for instance, the above equation results in the natural gradients in Eq. 14 from which the ridge-regression solution can be recovered, as shown in Sec. 1.3.3. In general, for such conjugate exponential-family models, the BLR yields the exact posterior by a direct application of Eq. 5. Applying Bayes rule there is equivalent to a single step of Eq. 6 with ρt = 1. This includes popular Gaussian linear models (Roweis and Ghahramani, 1999), such as Kalman ﬁlter/smoother, probabilistic Principle Components Analysis (PCA), factor analysis, mixture of Gaussians, and hidden Markov models, as well as nonparameteric models, such as Gaussian process (GP) regression (Rasmussen and Williams, 2006). For all such cases, the BLR gives yields the natural parameter of the posterior in just one step. From a computational point of view, it does not oﬀer any advice to obtain marginal properties in general, and, similarly to Bayes rule, eﬃcient techniques, such as message passing, may be needed to perform the necessary computation.

5.2 Expectation Maximization

We will now derive Expectation Maximization (EM) from the BLR with coordinate-wise updates with ρt = 1, similarly to the conjugate case. In the subsequent sections, this connection is used to derive new stochastic VI algorithms where sometimes ρt < 1 is used. EM is a popular algorithm for parameter estimation with latent variables (Dempster et al., 1977). Given the joint distribution p(y, z|θ) with latent variables z and parameters θ, the EM algorithm computes the posterior p(z|y, θ ) as well as parameter θ that maximizes the marginal likelihood p(y|θ). The algorithm can be seen as coordinate-wise BLR updates with ρt = 1 to update candidate distribution that factorizes across z and θ,

qt(z, θ) = qt(z)qt(θ) e λt,T z(z) N(θ|θt, I), (55)

where θt denotes the mean of the Gaussian. The loss function is set to ℓ(z, θ) = log p(y, z|θ). To ease the presentation, assume the likelihood, prior, and joint are expressed as

p(y|z, θ) exp eλ1(y, θ), T z(z) , p(z|θ) exp ( λ0(θ), T z(z) ) ,

p(y, z|θ) exp ( T yz(y, z), θ Ayz(θ)) , (56)

for some functions eλ1( , ) and λ0( ) (similarly to the previous section), and θ is the natural parameter of the joint. The likelihood and prior are expressed in terms of suﬃcient statistics T z(z) while the joint is with T yz( , ) (Winn and Bishop, 2005). The EM algorithm then takes a simple form where we iterate between updating λ and θ (Banerjee et al., 2005),

E-step: λt+1 eλ1(y, θt) + λ0(θt), M-step: θt+1 A yz Eqt+1[T yz(y, z)] .

Here, A yz( ) is the Legendre transform, and qt+1(z) = p(z|y, θt+1) is the outcome of E-step. The two steps are obtained from the BLR by using the delta method to approximate the expectation with respect to qt(θ). For the E-step, we assume qt(θ) ﬁxed, and use Eqs. 5 and 54 and the delta method,

λt+1 µEqt(z)qt(θ)[ ℓ(z, θ)] µ=µt = Eqt(θ)[eλ1(y, θ) + λ0(θ)] eλ1(y, θt) + λ0(θt)

Khan and Rue

For the M-step, we use the stationarity condition similar to Eq. 25 but now with the delta method with respect to q(θ) (note that θt+1 is both the expectation and natural parameter),

θEqt+1(z) ℓ(z, θ) θ=θt+1 = 0

A simple calculation using the joint p(y, z) will show that this reduces to the M-step. The derivation is easily extended to the generalized EM iterations. The coordinatewise strategy also need not be employed and the BLR steps can also be run in parallel. Such scheme resembles the generalized EM algorithm and usually converges faster. Online schemes can be obtained by using stochastic gradients, similar to those considered in Titterington (1984); Neal and Hinton (1998); Sato (1999); Capp e and Moulines (2009); Delyon et al. (1999). More recently, Amid and Warmuth (2020) use a divergence-based approach and Kunstner et al. (2021) prove attractive convergence properties using a mirror-descent formulation similar to ours. Next, we derive a generalization of such online schemes, called the stochastic variational inference.

5.3 Stochastic Variational Inference and Variational Message Passing

Stochastic variational inference (SVI) is a generalization of EM, ﬁrst considered by Sato (2001) for the conjugate case and extended by Hoﬀman et al. (2013) to conditionallyconjugate models (a similar strategy was also proposed by Honkela et al. (2011)). We consider the conjugate case due to its simplicity. The model is assumed to be similar to the EM case as in Eq. 56, but with N iid data examples yi associated with one latent vector zi each and a conjugate prior p(θ|α) with α = (α1, α2) as prior parameters,

p(yi|zi, θ) exp eλi(yi, θ), T i(zi) , p(zi|θ) exp ( λ0(θ), T i(zi) ) ,

p(yi, zi|θ) exp ( T yz(yi, zi), θ Ayz(θ)) , p(θ|α) = h0(θ) exp ([ α1, θ α2Ayz(θ)]) . (57)

Due to conjugacy, each conditional update can be done coordinate wise. A common strategy is to assume a mean-ﬁeld approximation

q(z1, . . . , z N, θ) = q(θ) Y

i q(zi) eλ(1) 0 λ(2) 0 Ayz(θ) Y

i e λi,T i(zi) ,

where λi is the natural parameter of q(zi), and (λ(1) 0 , λ(2) 0 ) is the same for q(θ). Coordinate-wise updates when applied to general graphical model are known as variational message passing (Winn and Bishop, 2005). These updates are a special case of BLR, applied coordinate wise with ρt = 1. The derivation is almost identical to the one used for the EM case earlier, therefore omitted. One important diﬀerence is that, we do not employ the delta method for q(θ) and explicitly carry out the marginalization which has a closed-form expression due to conjugacy. The update for q(θ) is therefore replaced by its corresponding conjugate update. Stochastic variational inference employs a special update where, after every q(zi) update with ρt = 1, we update q(θ) but with ρt < 1. Clearly, this is also covered as a special case of the BLR. The BLR update is more general than these strategies since it applies to a wider class of non-conjugate models, as discussed next.

The Bayesian Learning Rule

5.4 Non-Conjugate Variational Inference

Consider a graphical model p(x) Q i I fi(xi) where x denotes the set of latent variables z and parameters θ and fi(xi) is the i th factor operating on the subset xi indexed by the set I. The BLR can be applied to estimate a minimal exponential-family (assuming a constant h(θ)),

λt+1 (1 ρt)λt + ρt X

i I eλi(µt), where eλi(µt) = µEq [log fi(xi)]|µ=µt , (58)

and µt = λA(λt). We already encountered the quantities eλi(µt) for conjugate models in Sec. 5.1 where they are referred to as the natural parameter of the factors. They can also be interpreted as local messages that are aggregated to obtain the global natural parameter λt+1. The update is quite ﬂexible and can also be applied to other graphical models, such as Bayesian networks, Markov random ﬁelds, and Boltzmann machines. We can also view it as an update over distributions, by multiplying by T (x) followed by exponentiation,

qt+1(x) qt(x)(1 ρt) "Y

i I exp eλi(µt), T (x) #ρt (59)

and at convergence q (x) Y

i I exp eλi(µ ), T (x) ,

where q (x) is the optimal distribution with natural and expectation parameters as λ and µ . This update has three important features.

1. The optimal q (x) has the same structure as p(x) and each term in q (x) yields an approximation to those in p(x),

log fi(xi) eλi(µ ), T (x) + ci

for some constant ci. The quantity eλi( ) can be interpreted as the natural parameter of the local approximation. Such local parameters are often referred to as the site parameters in expectation propagation (EP) where their computations are often rather involved. In our formulation they are simply the natural gradients of the expected log joint density. See Chang et al. (2020, Sec. 3.4 and 3.5) for an example comparing the two. The above local approximations are inherently superior to the local-variational bounds, for example, quadratic bounds used in logistic regression (Jaakkola and Jordan, 1996; Khan et al., 2010, 2012; Khan, 2012), and also to the augmentation strategies (Girolami and Rogers, 2006; Klami, 2015). Gaussian posteriors give rise to quadratic surrogates to the loss function. These can be seen as alternatives to those obtained using Taylor s expansion. The advantage of our surrogates is that they are not tight locally, rather globally Opper and Archambeau (2009).

2. The update in Eq. 59 yields both the local eλi( ) and global λ parameters simultaneously, which is unlike the message-passing strategies that only compute the local

Khan and Rue

messages. In fact, Eq. 59 can also be reformulated entirely in terms of local parameters

(denoted by eλ (t+1) i below with a slight abuse of notation),

i I eλ (t+1) i , where eλ (t+1) i (1 ρt)eλ (t) i + ρteλi(µt), (60)

This formulation was discussed by Khan and Lin (2017) who show this to be a generalization of the non-conjugate variational message passing algorithm (Knowles and Minka, 2011). The form is suitable for a distributed implementation using a messagepassing framework.

3. The update applies to both conjugate and non-conjugate factors, but also for both probabilistic and deterministic variables. Automatic diﬀerentiation tools can be easily integrated. Due to such properties, the BLR oﬀers an attractive framework for practical probabilistic programming (van de Meent et al., 2018).

We have argued that natural gradients play an important role for probabilistic inference, both for conjugate models or within those using message-passing frameworks. Still, their importance is somehow missed in the community. Natural gradients were ﬁrst introduced in this context by Khan and Lin (2017); Khan et al. (2018), although Salimans and Knowles (2013) also mention a similar condition without an explicit reference to natural gradients. Sheth and Khardon (2016); Sheth (2019) discuss an update similar to the BLR for a speciﬁc case of two-level graphical models. Salimbeni et al. (2018) propose natural-gradient strategies for sparse GPs by using automatic diﬀerentiation which is often slower (the original work by Hensman et al. (2013), is also a special case of the BLR). A recent generalization to structured natural-gradient by Lin et al. (2021) is also worth noting. We will end the section by discussing the connections of the BLR to the algorithms used in online learning, where the loss functions are observed sequentially at every iteration t,

qt+1(θ) = arg min q(θ) Eq [ℓ(yt, fθ(xt))] + 1

ρt DKL[q(θ) qt(θ)].

The resulting updates are called Exponential-Weight (EW) updates (Littlestone and Warmuth, 1994; Vovk, 1990), which are closely related to Bayes rule. van der Hoeven et al. (2018) show that, by approximating the ﬁrst term by a surrogate (linear/quadratic) at the posterior mean (the delta method), many existing online learning algorithms can be obtained as special cases. This includes, online gradient descent, online Newton step, online mirror descent, among many others. This derivation is similar to the BLR which, when applied here, is slightly more general, not only due to the use of expectated loss, but also because the surrogate choice is automatically obtained by the posterior approximations. An advantage of this connection is that the theoretical guarantees derived in online learning can be translated to the BLR, and consequently to all the learning-algorithms derived from it. We will also note that the local and global BLR updates presented in Eqs. 58 and 60 are similar to the greedy and lazy updates used in online learning (van der Hoeven et al., 2018). The conjugate case discussed in earlier sections are very similar to the algorithms proposed for online learning by Azoury and Warmuth (2001), which are concurrent to a similar proposal in the Bayesian community by Sato (2001).

The Bayesian Learning Rule

6. Discussion

To learn, we need to extract useful information from new data and revise our beliefs. To do this well, we must reject false information and, at the same time, not ignore any (real) information. A good learning algorithm too must possess these properties. We expect the same to be true for the algorithms that have stood the test of time in terms of good performance, even if they are derived through empirical experimentation and intuitions. If there exist an optimal learning-algorithm, as it is often hypothesized, then we might be able to view these empirically good learning-algorithms as derived from this common origin. They might not be perfect but we might expect them to be a reasonable approximation of the optimal learning-algorithm. All in all, it is possible that all such algorithms optimize similar objectives and use similar strategies on how the optimization proceeds.

In this paper we argue for two ideas. The ﬁrst is that the learning-objective is a Bayesian one and uses the variational formulation by Zellner (1988) in Eq. 2. The objective tells us how to balance new information with the old one, resulting ultimately in Bayes theorem as the optimal information processing rule. The power of the variational formulation, is that it also shows how to process the information when we have limited abilities, for example, in terms of computation, to extract the relevant information. The role of q(θ) is to deﬁne how to represent the knowledge, for which exponential families and mixtures thereof, are natural choices that have shown to balance complexity while still being practical.

The second idea is the role natural gradients play in the learning algorithms. Bayes rule in its original form has no opinion about them, but they are inherently present in all solutions of the Bayesian objective. They give us information about the learning-loss landscape, sometimes in the form of the derivatives of the loss and sometimes as the messages in a graphical model. Our work shows that all these seemingly diﬀerent ideas, in fact, have deep roots in information geometry that is being exploited by natural gradients and then in Bayesian Learning Rule (BLR) we proposed.

The two ideas go together as a symbiosis into the BLR. By a long series of examples from optimisation, deep learning, and graphical models, we demonstrate that classical algorithms such as ridge regression, Newton s method, and Kalman ﬁlter, as well as modern deeplearning algorithms such as stochastic-gradient descent, RMSprop, and Dropout, all can be derived from the proposed BLR. The BLR gives a uniﬁed framework and understanding of what various (good) learning algorithms do, how they diﬀer in approximating the target and then also how they can be improved. The main advantage of the BLR, is that we now have a principled framework to approach new learning problems, both in terms of what to aim for and also how to get there.

Acknowledgement

M. E. Khan would like to thank many current and past colleagues at RIKEN-AIP, including W. Lin, D. Nielsen, X. Meng, T. M ollenhoﬀand P. Alquier, for many insightful discussions that helped shape parts of this paper. We also thank the anonymous reviewers for their feedback to improve the presentation.

Khan and Rue

Appendix A. Bayesian Inference as Optimization

Bayesian inference is a special case of the Bayesian learning problem. Although this follows directly from the variational formulation by Zellner (1988), it might be useful to provide some more motivation. Bayesian inference corresponds to a log-likelihood loss ℓ(y, fθ(x)) = log p(y|fθ(x)). With conditional independent data and prior p(θ), the posterior distribution is p(θ|D) = p(θ)/Z(D) QN i=1 p(yi|fθ(xi)). The minimizer of the Bayesian learning problem recovers this posterior distribution. This follows directly from reorganizing the Bayesian objective Eq. 2:

L(q) = Eq(θ)

i=1 log p(yi|fθ(xi))

+ DKL[q(θ) p(θ)] (61)

p(θ) Z(D) QN i=1 p(yi|fθ(xi))

+ log Z(D) (62)

= DKL[q(θ) p(θ|D)] + log Z(D) (63)

Choosing q(θ) as p(θ|D) minimizes the above equation since Z(D) is a constant, and the KL divergence is zero.

Appendix B. Natural Gradient of the Entropy

Here, we show that e λH(q) = λ when h(θ) is a constant. First, we write

e λH(q) = µEq[log q(θ)]

= µ [ λ, µ A(λ)] + µEq[log h(θ)]

= λ + 2 λA(λ) 1 [µ λA(λ)] + µEq[log h(θ)]

= λ + µEq[log h(θ)],

where in the second line we use the deﬁnition of q(θ) and in the third line we use chain-rule and the fact that µ = λA(λ). For constant h(θ), this reduces to λ, giving us the result.

Appendix C. The Delta Method

In the delta method (Dorfman, 1938; Ver Hoef, 2012), we approximate the expectation of a function of a random variable by the expectation of the function s Taylor expansion. For example, given a distribution q(θ) with mean m, we can use the ﬁrst-order Taylor approximation to get,

Eq[f(θ)] Eq h f(m) + (θ m) θf(θ)|θ=m i f(m)

We will often use this approximation to approximate the expectation of gradient/Hessian at their values at a single point (usually the mean). For example, we can use a zeroth-order expansion to get

Eq[ θf(θ)] θf(θ)|θ=m Eq[ 2 θf(θ)] 2 θf(θ) θ=m .

The Bayesian Learning Rule

A better derivation (to get the same result) would be use higher-order expansion. For example, by using the ﬁrst-order Taylor expansion, we approximate below the expectation of the gradient by the gradient at the mean,

Eq[ θf(θ)] = m Eq[f(θ)]

m Eq h f(m) + (θ m) θf(θ)|θ=m i

= θf(θ)|θ=m , (65)

where the ﬁrst line is due to Bonnet s theorem (App. D) and the second line is simply the ﬁrst-order Taylor expansion. We refer to this as the ﬁrst-order delta method. Similar applications have been used in variational inference (Blei and Laﬀerty, 2007; Braun and Mc Auliﬀe, 2010; Wang and Blei, 2013). Similarly, by using the second-order Taylor expansion, we approximate below the expectation of the Hessian by the Hessian at the mean,

Eq[ 2 θf(θ)] = 2 S 1Eq[f(θ)]

f(m) + (m θ) θf(θ)|θ=m + 1

2(m θ) 2 θf(θ) θ=m (m θ)

= 2 S 1Tr 1

2S 1 2 θf(θ) θ=m

= 2 θf(θ) θ=m , (66)

where the ﬁrst line is due to Price s theorem (App. D) and the second line is simply the second-order Taylor expansion. We refer to this as the second-order delta method.

C.1 Deterministic Estimates as crude-Bayesians

In ML it is often argued that deterministic estimates/models can be seen as a special case of probabilistic ones where variables are set to their maximum-a-posterior (MAP) estimates. One justiﬁcation stems from the fact that such estimates can be seen as a posterior which is a Dirac-delta distribution at the MAP estimate; see Buntine (2002); Girolami and Kab an (2003); Gal (2016) for examples. The argument is then to use the delta distribution as a solution to the Bayesian objective (the evidence lower bound), justifying the MAP estimate as a Bayesian solution. Formally, denoting the MAP estimate as θ , we can deﬁne a distribution qσ(θ) = N(θ|θ , σI) By taking σ 0, we can view L(qσ) as the Bayesian solution obtained by a non-Bayesian MAP estimate. Unfortunately, this argument is ﬂawed when estimating continuous parameters of deterministic models, as shown by Welling et al. (2008). This is because, as σ 0, the entropy H(q) = P log σ , making the objective useless. Therefore, it is not correct to use delta distributions to justify deterministic solutions as probabilistic and Bayesian ones. Instead, we use the Delta method and obtain MAP estimates as a crude Bayesian approximation of the expectation. In this method, we do not set the posterior as a Dirac delta, rather we use a posterior approximation with a pre-speciﬁed uncertainty. For example, we can use a Gaussian q (θ) = N(θ|θ , H 1 )

Khan and Rue

where H is the Hessian of ℓ(θ) or a positive-deﬁnite approximation of it when H itself is not. This is the Laplace s method where the curvature around θ is used as an uncertainty estimate. The choice views the MAP estimate as a crude-Bayesian because, by using the Delta method, we can replace the expectation by loss at the mean θ ,

Eq[ ℓ(θ)] H(q) ℓ(θ ) + 1

2P log |H |

which is exactly the objective optimized by θ (the second term being a constant). A crude-Bayesian ignores the uncertainty around the estimate and simply uses the mean to approximate the expectation. A similar view was presented in Opper and Archambeau (2009) where Laplace s method was shown to be a crude-approximation of the variational method. However, in our argument the posterior covariance can be an arbitrary quantity, and even the posterior form does not need to be a Gaussian. The Delta method view is more general than the Laplace s view of Opper and Archambeau (2009).

Appendix D. Newton s Method from the BLR

We start by expressing the natural gradients in terms of the gradient with respect to the mean and covariance. Let θ be multivariate normal with mean µ and covariance C, and h(µ, C) be a diﬀerentiable function. First, we write m and C in terms of µ(1) and µ(2),

m = µ(1), C = µ(2) µ(1) µ(1) ,

We then express the natural gradients in terms of gradients with respect to m and C by using the chain rule. For example, the derivative with respect to µ(2) is as follows,

µ(2) ij = X

mk µ(2) ij + X

Ckl µ(2) ij = h Ckl ,

giving us µ(2)h = Ch = S 1h. Similarly, the gradient with respect to µ(1) is

mk µ(1) i + X

h Cki µ(1) k X

h Cil µ(1) l

where the last line follows due to symmetry of C. Using this, we get the desired result,

µ(1)h = mh 2[ S 1h]m,

µ(2)h = S 1h

The Bayesian Learning Rule

Next, we give a summary of the Theorem s of Bonnet s and Price s (Bonnet, 1964; Price, 1958); see also Opper and Archambeau (2009); Rezende et al. (2014). Let f(θ) be a twice diﬀerentiable function where E|f(θ)| < , then the Bonnet and Price s results are

µi E [f(θ)] = E

xi f(θ) and Cij E [f(θ)] = cij E 2

where cij = 1/2 if i = j and cij = 1 if i = j (since C is symmetric), respectively. Using these, we obtain Eq. 10 and Eq. 11. The following derivation is also detailed in Khan et al. (2018). We now use the deﬁnitions of natural and expectation parameters from Eq. 9 in the BLR Eq. 6. The function h(θ) is constant for the multivariate Gaussian, hence

St+1mt+1 (1 ρt)Stmt ρt µ(1)Eqt[ ℓ(θ)]

St+1 (1 ρt)St + 2ρt µ(2)Eqt[ ℓ(θ)].

The update for St+1 can be obtained using Eq. 11,

St+1 (1 ρt)St + ρt Eqt 2 θ ℓ(θ) .

We can simplify the update for the mean using Eq. 10,

St+1mt+1 = (1 ρt)Stmt ρt Eqt[ θ ℓ(θ)] Eqt[ 2 θ ℓ(θ)]mt ,

= St+1mt ρt Eqt[ θ ℓ(θ)],

and the update for mt+1 follows from this.

Appendix E. Natural-Gradient Updates for Mixture of Gaussians

The joint distribution q(θ, z) is a minimal-conditional exponential-family distribution (Lin et al., 2019b), i.e., the distribution then takes a minimal form conditioned over z = k which ensures that the FIM is non-singular. Details of the update are in Lin et al. (2019b, Theorem 3). The natural and expectation parameters are given in Lin et al. (2019b, Table 1)). The gradients with respect to µk can be written in terms of mk and S 1 k as

µ(1) k Eq[ ℓ(θ)] = h µ(1) k m k i mk Eq[ ℓ(θ)] = 1

πk mk Eq[ ℓ(θ)]

Hence we can utilize from the derivation of Newton s method, Eq. 10 and Eq. 11, and do similarly as in App. D, to achieve a Newton-like update for a mixture of Gaussians

mk,t+1 mk,t ρ

πk S 1 k,t+1 mk Eqt ℓ(θ) + log q(θ) , (67)

Sk,t+1 Sk,t + 2ρ

πk S 1 k,t Eqt ℓ(θ) + log q(θ) . (68)

The gradients can be expressed in terms of the gradient and Hessian of the loss, similar to Bonnet s and Price s Theorem in App. D, see (Lin et al., 2019a, Thm 6 and 7). This gives

mk Eq[f(θ)] = πk EN(θ|mk,S 1 k )[ θf(θ)],

S 1 k Eq[f(θ)] = πk

2 EN(θ|mk,S 1 k )[ 2 θf(θ)],

Khan and Rue

and hence the updates in Eqs. 29 and 30. The gradient of log q(θ) = log PK k=1 πk N(θ|mk, S 1 k ) were not spelled out explicit in Eqs. 67 and 68, but are as follows,

θ log q(θ) =

j=1 rj(θ)Sj(mj θ), (69)

2 θ log q(θ) =

i=1 ri(θ)Aij(θ)

Here, rk(θ) is the responsibility of the k th mixture component at θ, q(z = k|θ)

q(z = k|θ) = πk N(θ|mk, S 1 k ) PK j=1 πj N(θ|mj, S 1 j ) . (71)

and Aij(θ) = Si(mi θ)(mj θ) Sj. With two natural assumptions, we can get simpliﬁcations that help us understand how diﬀerent components can take responsibility for diﬀerent local minima. First, we assume θk = mk, then Aik(θk) = 0 for all i, leading to

θ log q(θk) = X

j =k rj(θk)Sj(θj θk), (72)

2 θ log q(θk) = rk(θk)Sk + X

j =k rj(θk)

i=1,i =k ri(θk)Aij(θk)

Second, we assume the mixture components to be far apart from each other, then the responsibility is approximately zero for all j = k, leading to

θ log q(θk) 0, 2 θθ log q(θk) Sk. (74)

Appendix F. Deep Learning Algorithms with Momentum

Momentum is a useful technique to improve the performance of SGD (Sutskever et al., 2013). Modern adaptive-learning algorithms, such as RMSprop and Adam, often employ variants of the classical momentum to increase the learning rate (Sutskever et al., 2013). The classical momentum method is based on the Polyak s heavy-ball method Polyak (1964), where the current θt+1 θt update is pushed along the direction of the previous update θt θt 1:

θt+1 θt αb θ ℓ(θt) + γ(θt θt 1) (75)

with a ﬁxed momentum coeﬃcient γ > 0. Why this technique can be beneﬁcial when using noisy gradient is more transparent when we rewrite Eq. 75 as

k=0 γk b θ ℓ(θt k) + initial conditions. (76)

The Bayesian Learning Rule

Hence, the eﬀective gradient is an average of previous gradients with decaying weights. The hope is that this will smooth out and reduce the variance of the stochastic gradient estimates, although exact mechanisms behind the success of momentum-based methods are still an open research question. Adaptive-learning rate algorithms, such as RMSprop and Adam, employ a variant of the classical momentum method where exponential-smoothing is applied to the gradient ut+1 γut+(1 γ)b θ ℓ(θt) which is used in the update while keeping the scaling unchanged (Graves, 2013; Kingma and Ba, 2015). This gives the update

θt+1 θt α 1 st+1 + c1 ut+1. (77)

We can express this similar to Eq. 75, as (Wilson et al., 2017)

θt+1 θt α(1 γ) 1 st+1 + c1 h b θ ℓ(θt) i + γ st + c1 st+1 + c1 (θt θt 1), (78)

showing a dynamic scaling of the momentum depending (essentially) on the ratio of st+1 and st. Although this adaptive scaling is not counter-intuitive, it is a result of the somewhat arbitrary choices made. Can we justify this from the Bayesian principles and derive the corresponding BLR? The problem we face is provoked by the use of stochastic gradient approximations and the principles simply state that we should compute exact gradients instead. This might appear as a deﬁciency in our main argument but we can resolve this issue in other ways. It is our view that the natural way to include a momentum term in the BLR, is to do it within the mirror descent framework discussed in Sec. 2. This will result in an update where the momentum term obey the geometry of the posterior approximation. We propose to augment Eq. 22 with a momentum term

µt+1 arg min µ M µL(qt), µ + 1 + γt

ρt DA (µ µt) γt

ρt DA (µ µt 1) (79)

The last term penalizes the proximity to µt 1 where the proximity is measured according to the KL divergence, following the suggestion by Khan et al. (2018) we may interpret this as the natural momentum term. This gives a (revised) BLR with momentum,

λt+1 λt ρt e λ Eqt ℓ(θ) H(qt) + γt(λt λt 1). (80)

This update will recover Eq. 75 and an alternative to Eq. 78. Choosing a Gaussian candidate distribution N(θ|m, I) (the result is invariant to a ﬁxed covariance matrix), we recover Eq. 75 after a derivation similar as in Sec. 1.3.1. Choosing the candidate distribution N(θ|m, diag(s) 1), follow again previous derivations (Newton s method in Sec. 1.3.2) with mini-batch gradients, and result in the following update

θt+1 θt ρt 1 st+1 h b θ ℓ(θt) i + 1 st+1 (stθt st 1θt 1),

st+1 (1 ρt)st + ρt diag h b 2 θθ ℓ(θt) i + γt(st st 1). (81)

Khan and Rue

Assuming st st 1 the last terms in the above two equations simpliﬁes, and then by replacing st by st and adding a constant c, we can recover Eq. 78. This momentumversion of Eq. 37, not only justiﬁes the use of scaling vectors for the momentum, but also the use of exponential-smoothing for the scaling vectors itself.

Appendix G. Dropout from the BLR

Deriving updates for mjl and Sjl with dropout is similar to the Newton variant derived for mixture of Gaussian discussed in App. E. With dropout there are only two components where one of the component has ﬁxed parameters, hence one set of natural and expectation parameters per weight vector θjl,

λ(1) jl = Sjlmjl,

λ(2) jl = 1

µ(1) jl = Eq h 1[zjl=1]θjl i = π1mjl,

µ(2) jl = Eq h 1[zjl=1]θjlθT jl i = π1 S 1 jl + mjlm T jl ,

We can reuse the update Eqs. 67 and 68 from App. E

mjl,t+1 mjl,t ρ

π1 S 1 jl,t+1 mjl Eqt ℓ(θ) + Eqt [log q(θjl)] , (82)

Sjl,t+1 Sjl,t+1 + 2ρ

π1 S 1 jl Eqt ℓ(θ) + Eqt [log q(θjl)] . (83)

We have moved the expectation inside the gradient since the entropy term only depends on θjl. The gradient of the loss ℓ(θ) may depend on the whole θ, however. By using appropriate methods to compute gradients, we can recover the dropout method. Speciﬁcally, for the loss term, we use the reparameterization trick (Kingma and Welling, 2013) and approximate the gradients at the samples from q(θjl),

θjl = zjl(mjl + S 1/2 jl ϵ1,jl) + (1 zjl)s 1/2 0 ϵ2,jl

where zjl is a sample from Bernoulli distribution with probability π1, and ϵ1,jl and ϵ2,jl are two independent samples from standard normal distribution. Since weights are deterministic in dropout, we need to ignore ϵ1,jl and ϵ2,jl to recover dropout from the BLR. The gradients are then evaluated at the dropout mean f mjl = zjlmjl (the delta method) which leads to the following gradient approximations:

mjl Eqt[ ℓ(θ)] θjl ℓ( θ) θ=f mt ,

2 S 1 jl Eqt ℓ(θ) = 2 mjlm jl Eqt ℓ(θ) 2 θjl θ jl ℓ( θ) θ=f mt . (84)

We will use these gradients for the loss term in Eqs. 82 and 83. Gradients of log q(θjl) only depends on the variables involved in q(θjl), and we can use the delta method similar to Eq. 33 in Sec. 3. We make the assumption that the two mixture components are not close which is reasonable with dropout. With this assumption then

The Bayesian Learning Rule

the gradient respect to mjl is approximately zero

mjl Eq [log q(θjl)] = mjl

Z dθjl log q(θjl) h π1N(θjl|mjl, S 1 jl ) + (1 π1)N(θjl|0, s 1 0 Inl) i

Z dθjl log q(θjl)N(θjl|mjl, S 1 jl )

= π1EN(θjl|mjl,t,S 1 jl,t) θjl log q(θjl) (Using Bonnet s Theorem)

π1 mjl log q(mjl,t) (Using the Delta approximation)

For the gradient with respect to S 1 we get

2 S 1 jl Eq [log q(θjl)] = 2 mjlm jl Eq [log q(θjl)]

= mjl h π1EN(θjl|mjl,t,S 1 jl,t) θjl log q(θjl) i (using ﬁrst derivative wrt mjl)

= π1EN(θjl|mjl,t,S 1 jl,t) h 2 θjlθ jl log q(θjl) i (Using Bonnet s theorem)

π1 2 mjlm jl log q(mjl,t) (Using Delta approximation)

π1Sjl,t, (86)

Using the gradient approximations Eqs. 84 to 86 in Eqs. 82 and 83, we get the

mjl,t+1 mjl,t ρ

π1 S 1 jl,t+1 mjl mjl ℓ(f mt) + 0 ,

Sjl,t+1 Sjl,t+1 + ρ

n 2 mjlm jl ℓ(f mt) π1Sjl,t o .

which reduces to the updates Eqs. 42 and 43.

Appendix H. Table of exponential-family distributions

Khan and Rue

Distribution λ T (θ) µ Reference

Gaussian N(θ|m, I) with ﬁxed covariance

m θ m Sec. 1.3.1

Gaussian N(θ|m, S 1) with ﬁxed covariance

Sm θ m Sec. 4.1

Gaussian N(θ|m, S 1) with diagonal S = diag(s)

Sections 4.2 and 4.4

Gaussian N(θ|m, S 1)

Sections 1.3.2 and 1.3.3

Finite mixture of Gaussian P(z = k, θ) = πk N(θ|mk, S 1 k ) with ﬁxed πk

! πkmk πk mkm k + S 1 k

Sections 3.2 and 4.3

Bernoulli with P(θ = 1) = p

1 2 log p 1 p 21[θ=k] 1 2p 1 Sec. 4.5

Table 2: A summary of exponential-family used in the paper showing natural parameters λ, suﬃcient statistics T (θ), and expectation parameters µ, along with the relevant sections.

L. Aitchison. Bayesian ﬁltering uniﬁes adaptive and non-adaptive neural network optimization methods. Advances in Neural Information Processing Systems, 33:18173 18182, 2020.

M. Alizadeh, J. Fern andez-Marqu es, N. D. Lane, and Y. Gal. An empirical study of binary neural networks optimisation. ICLR, 2019.

S. Amari. Natural gradient works eﬃciently in learning. Neural computation, 10(2):251 276, 1998.

S. Amari. Information geometry and its applications. Springer, 2016.

The Bayesian Learning Rule

E. Amid and M. K. Warmuth. Divergence-based motivation for online EM and combining hidden variable models. In J. Peters and D. Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 81 90. PMLR, 03 06 Aug 2020.

D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pages 99 102, 1974.

K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, 2001.

A. Azzalini. The skew-normal distribution and related multivariate families. Scandinavian Journal of Statistics, 32(2):159 188, 2005.

A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, pages 3438 3446, 2015.

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6(Oct):1705 1749, 2005.

O. E. Barndorﬀ-Nielsen. Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of statistics, 24(1):1 13, 1997.

A. G. Barto and R. S. Sutton. Goal seeking components for adaptive intelligence: An initial assessment. Technical report, Massachusetts University Amherst Dept. of Computer and Information Science, 1981.

S. Becker and Y. Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29 37, 1988.

Y. Bengio, N. L eonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

Z. W. Birnbaum and S. C. Saunders. A new family of life distributions. Journal of applied probability, 6(2):319 327, 1969.

C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.

P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103 1130, 2016. doi: 10.1111/rssb.12158. URL https://rss.onlinelibrary. wiley.com/doi/abs/10.1111/rssb.12158.

D. M. Blei and J. D. Laﬀerty. A correlated topic model of science. The Annals of Applied Statistics, pages 17 35, 2007.

D. M. Blei, A. Kucukelbir, and J. D. Mc Auliﬀe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

Khan and Rue

C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In International Conference on Machine Learning, pages 1613 1622, 2015.

G. Bonnet. Transformations des signaux al eatoires a travers les systemes non lin eaires sans m emoire. In Annales des T el ecommunications, volume 19, pages 203 220. Springer, 1964.

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

M. Braun and J. Mc Auliﬀe. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association, 105(489):324 335, 2010.

R. G. Brown. Statistical forecasting for inventory control. Mc Graw/Hill, 1959.

W. Buntine. Variational extensions to EM and multinomial PCA. In 13th European Conference on Machine Learning Helsinki, Finland, pages 23 34. Springer, 2002.

O. Capp e and E. Moulines. On-line expectation maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71 (3):593 613, 2009.

O Catoni. PAC-Bayesian supervised classiﬁcation: The thermodynamics of statistical learning. institute of mathematical statistics lecture notes monograph series 56. IMS, Beachwood, OH. MR2483528, 5544465, 2007.

A. Cauchy. M ethode g en erale pour la r esolution des systemes d equations simultan ees. Comp. Rend. Sci. Paris, 25(1847):536 538, 1847.

P. E. Chang, W. J. Wilkinson, M. E. Khan, and A. Solin. Fast variational learning in statespace Gaussian process models. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6. IEEE, 2020.

P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1 10. IEEE, 2018.

M. Courbariaux, Y. Bengio, and J-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123 3131, 2015.

B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version of the EM algorithm. The Annals of Statistics, 27(1):94 128, 1999.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977.

R. A. Dorfman. A note on the delta-method for ﬁnding variance formulae. Biometric Bulletin, 1938.

The Bayesian Learning Rule

G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008, 2017.

T. Eltoft, T. Kim, and T-W. Lee. Multivariate scale mixture of Gaussians modeling. In International Conference on Independent Component Analysis and Signal Separation, pages 799 806. Springer, 2006.

Y. Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016.

Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050 1059, 2016.

Y. Gal, J. Hron, and A. Kendall. Concrete dropout. In Advances in neural information processing systems, pages 3581 3590, 2017.

E. S. Gardner Jr. Exponential smoothing: The state of the art. Journal of forecasting, 4 (1):1 28, 1985.

M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790 1817, 2006.

Mark Girolami and Ata Kab an. On an equivalence between PLSI and LDA. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 433 434, 2003.

A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348 2356, 2011.

A. Graves. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013.

E. Grushka. Characterization of exponentially modiﬁed Gaussian peaks in chromatography. Analytical Chemistry, 44(11):1733 1738, 1972.

A. K. Gupta and D. K. Nagar. Matrix variate distributions, volume 104. CRC Press, 2018.

E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International Conference on Machine Learning, pages 1833 1841, 2016.

K. Helwegen, J. Widdicombe, L. Geiger, Z. Liu, K-T. Cheng, and R. Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 13, page 282 290, Arlington, Virginia, USA, 2013. AUAI Press.

Khan and Rue

J. M. Hernandez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pages 1861 1869, 2015.

S. Hochreiter and J. Schmidhuber. Simplifying neural nets by discovering ﬂat minima. In Advances in neural information processing systems, pages 529 536, 1995.

S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997.

A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970.

M. D. Hoﬀman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

C. C. Holt, F. Modigliani, J. F. Muth, and H. A. Simon. Planning Production, Inventories, and Work Force. Englewood Cliﬀs, 1960.

A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate Riemannian conjugate gradient learning for ﬁxed-form variational Bayes. Journal of Machine Learning Research, 11:3235 3268, 2011.

A. Huning. Evolutionsstrategie. optimierung technischer systeme nach prinzipien der biologischen evolution, 1976.

T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression problems and their extensions. In International conference on Artiﬁcial Intelligence and Statistics, 1996.

E. Jang, S Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70 (9):939 952, 1982. doi: 10.1109/PROC.1982.12425.

R. E. Kalman. A new approach to linear ﬁltering and prediction problems. Journal of Basic Engineering, 82(1):35 45, 03 1960. ISSN 0021-9223.

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

M. E. Khan. Variational Learning for Latent Gaussian Models of Discrete Data. Ph D thesis, University of British Columbia, 2012.

M. E. Khan and W. Lin. Conjugate-computation variational inference: converting variational inference in non-conjugate models to inferences in conjugate models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 878 887, 2017.

The Bayesian Learning Rule

M. E. Khan and D. Nielsen. Fast yet simple natural-gradient descent for variational inference in complex models. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pages 31 35. IEEE, 2018.

M. E. Khan, B. Marlin, G. Bouchard, and K. Murphy. Variational Bounds for Mixed-Data Factor Analysis. In Advances in Neural Information Processing Systems, 2010.

M. E. Khan, S. Mohamed, B. Marlin, and K. Murphy. A stick-breaking likelihood for categorical data analysis with latent gaussian models. In Artiﬁcial Intelligence and Statistics, pages 610 618. PMLR, 2012.

M. E. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2611 2620. PMLR, 10 15 Jul 2018.

M. E. Khan, A. Immer, E. Abedi, and M. Korzepa. Approximate inference turns deep networks into Gaussian processes. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

A. Klami. Polya-gamma augmentations for factor models. In D. Phung and H. Li, editors, Proceedings of the Sixth Asian Conference on Machine Learning, volume 39 of Proceedings of Machine Learning Research, pages 112 128, Nha Trang City, Vietnam, 26 28 Nov 2015. PMLR.

D. A. Knowles and T. Minka. Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems, pages 1701 1709, 2011.

D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

F. Kunstner, R. Kumar, and M. Schmidt. Homeomorphic-invariance of EM: Non-asymptotic convergence in KL divergence for exponential families via mirror descent. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3295 3303. PMLR, 13 15 Apr 2021.

P. S. Laplace. Memoir on the probability of the causes of events. Statistical science, 1(3): 364 378, 1986.

N. Le Roux and A. W. Fitzgibbon. A fast natural Newton method. In International Conference on Machine Learning, 2010.

Khan and Rue

Y. Le Cun, L. Bottou, and G. B. Orrand K-R. M uller. Eﬃcient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, page 9 50, Berlin, Heidelberg, 1998. Springer-Verlag. ISBN 3540653112.

M. Leordeanu and M. Hebert. Smoothing-based optimization. In Computer Vision and Pattern Recognition, pages 1 8, 2008.

W. Lin, M. E. Khan, and M. Schmidt. Stein s lemma for the reparameterization trick with exponential family mixtures. ar Xiv preprint ar Xiv:1910.13398, 2019a.

W. Lin, M. E. Khan, and M. Schmidt. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 3992 4002. PMLR, 09 15 Jun 2019b.

W. Lin, F. Nielsen, M. E. Khan, and M. Schmidt. Tractable structured natural-gradient descent using local parameterizations. In Proceedings of the 38th International Conference on Machine Learning, 2021.

N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

C. Louizos and M. Welling. Structured and eﬃcient variational deep learning with matrix Gaussian posteriors. In International Conference on Machine Learning, pages 1708 1716, 2016.

D. Mackay. Bayesian Methods for Adaptive Models. Ph D thesis, California Institute of Technology, 1991.

C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016.

W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13153 13164, 2019.

L. Malag o and G. Pistone. Information geometry of the Gaussian distribution in view of stochastic optimization. In Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, pages 150 162, 2015.

L. Malag o, M. Matteucci, and G. Pistone. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms, pages 230 242, 2011.

S. Mandt, M. D. Hoﬀman, and D. M. Blei. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18:1 35, 2017.

J. Martens. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21(146):1 76, 2020.

The Bayesian Learning Rule

X. Meng, R. Bachmann, and M. E. Khan. Training binary neural networks using the Bayesian learning rule. In International conference on machine learning, pages 6852 6861. PMLR, 2020.

A. Mishkin, F. Kunstner, D. Nielsen M. Schmidt, and M. E. Khan. SLANG: Fast structured covariance approximations for Bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, volume 31, 2018.

H. Mobahi and J. W. Fisher III. A theoretical analysis of optimization by Gaussian continuation. In AAAI Conference on Artiﬁcial Intelligence, pages 1205 1211, 2015.

K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029.

R. M. Neal and G. E. Hinton. A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In Learning in graphical models, pages 355 368. Springer, 1998.

A. Nemirovski and D. Yudin. On Cesaro s convergence of the gradient descent method for ﬁnding saddle points of convex-concave functions. Doklady Akademii Nauk SSSR, 239 (4), 1978.

Y. Ollivier. Online natural gradient as a Kalman ﬁlter. Electronic Journal of Statistics, 12 (2):2930 2961, 2018.

Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Information-geometric optimization algorithms: A unifying picture via invariance principles. Journal of Machine Learning Research, 18(18):1 65, 2017.

M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Computation, 21(3):786 792, 2009.

K. Osawa, S. Swaroop, A. Jain, R. Eschenhagen, R. E. Turner, R. Yokota, and M. E. Khan. Practical deep learning with Bayesian principles. In Advances in neural information processing systems, pages 4287 4299, 2019.

R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. ar Xiv preprint ar Xiv:1301.3584, 2013.

B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

R. Price. A useful theorem for nonlinear devices having Gaussian inputs. IRE Transactions on Information Theory, 4(2):69 72, 1958.

G. Raskutti and S. Mukherjee. The information geometry of mirror descent. IEEE Transactions on Information Theory, 61(3):1451 1457, 2015.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Khan and Rue

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278 1286, 2014.

H. Ritter, A. Botev, and D. Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations, 2018.

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3): 400 407, 9 1951.

S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural computation, 11(2):305 345, 1999.

H. Rue and S. Martino. Approximate Bayesian inference for hierarchical Gaussian Markov random ﬁelds models. Journal of Statistical Planning and Inference, 137(10):3177 3192, 2007. Special Issue: Bayesian Inference for Stochastic Processes.

H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B, 71(2):319 392, 2009.

H. Rue, A. Riebler, S. H. Sørbye, J. B. Illian, D. P. Simpson, and F. K. Lindgren. Bayesian computing with INLA: A review. Annual Reviews of Statistics and Its Applications, 4 (March):395 421, 2017. doi: 10.1146/annurev-statistics-060116-054045.

T. Salimans and D. A. Knowles. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013.

H. Salimbeni, S. Eleftheriadis, and J. Hensman. Natural gradients in practice: Nonconjugate variational inference in Gaussian process models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 689 697. PMLR, 2018.

M-A. Sato. Fast learning of on-line EM algorithm. Technical report, ATR Human Information Processing Research Laboratories, 1999.

M-A. Sato. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649 1681, 2001.

T. Schaul, S. Zhang, and Y. Le Cun. No more pesky learning rates. In International Conference on Machine Learning, pages 343 351, 2013.

R. Sheth. Algorithms and Theory for Variational Inference in Two-Level Non-conjugate Models. Ph D thesis, Tufts University, 2019.

R. Sheth and R. Khardon. Monte carlo structured SVI for two-level non-conjugate models. ar Xiv preprint ar Xiv:1612.03957, 2016.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(56):1929 1958, 2014.

The Bayesian Learning Rule

R. L. Stratonovich. Conditional markov processes. In Non-linear transformations of stochastic processes, pages 427 453. Elsevier, 1965.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139 1147, 2013.

T. Tieleman and G. Hinton. Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2012.

L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393):82 86, 1986.

D. M. Titterington. Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society: Series B (Methodological), 46(2):257 267, 1984.

J-W. van de Meent, B. Paige, H. Yang, and F. Wood. An introduction to probabilistic programming. ar Xiv preprint ar Xiv:1809.10756, 2018.

D. van der Hoeven, T. van Erven, and W. Kot lowski. The many faces of exponential weights in online learning. In Conference On Learning Theory, pages 2067 2092. PMLR, 2018.

J. M. Ver Hoef. Who invented the delta method? The American Statistician, 66(2):124 127, 2012.

V. G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT 90, pages 371 386, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc. ISBN 1-55860-146-5.

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1 2:1 305, 2008.

C. Wang and D. M. Blei. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14(1):1005 1031, 2013.

M. Welling, C. Chemudugunta, and N. Sutter. Deterministic latent variable models and their pitfalls. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 196 207. SIAM, 2008.

D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949 980, 2014.

A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4151 4161, 2017.

J. Winn and C. M. Bishop. Variational message passing. Journal of Machine Learning Research, 6(Apr):661 694, 2005.

Khan and Rue

K-C. Wong. Evolutionary multimodal optimization: A short survey. ar Xiv preprint ar Xiv:1508.00457, 2015.

P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin. Understanding straight-through estimator in training activation quantized neural nets. ICLR, 2019.

X. Yu and M. Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.

M. D. Zeiler. ADADELTA: An adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

A. Zellner. Optimal information processing and Bayes s theorem. The American Statistician, 42(4):278 280, 1988.

C. Zhang, J. B utepage, H. Kjellstr om, and S. Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018a.

G. Zhang, S. Sun, D. Duvenaud, and R. Grosse. Noisy natural gradient as variational inference. In International conference on machine learning, pages 5852 5861. PMLR, 2018b.

T. Zhang. Theoretical analysis of a class of randomized regularization methods. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 99, page 156 163, New York, NY, USA, 1999. Association for Computing Machinery. ISBN 1581131674.