# tractable_structured_naturalgradient_descent_using_local_parameterizations__d51174af.pdf Tractable Structured Natural-Gradient Descent Using Local Parameterizations Wu Lin 1 Frank Nielsen 2 Mohammad Emtiyaz Khan 3 Mark Schmidt 1 4 Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using local-parameter coordinates to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods. 1. Introduction A wide-variety of problems that arise in the field of optimization, inference, and search can be expressed as min q(w) Q Eq(w) [ℓ(w)] γH(q(w)), (1) where w is the parameter of interest, q(w) Q is a distribution, H(q(w)) is Shannon s entropy, ℓ(w) is a loss function, and γ 0. For example, in problems involving random search (Baba, 1981), stochastic optimization (Spall, 2005), and evolutionary strategies (Beyer, 2001), q(w) is the socalled search distribution used to find a global minimum of a black-box function ℓ(w). In reinforcement learning, it can be the policy distribution which minimizes the expected value-function ℓ(w) (Sutton et al., 1998), sometimes with entropy regularization (Williams & Peng, 1991; Teboulle, 1992; Mnih et al., 2016). For Bayesian problems, q(w) is the posterior distribution or its approximation and the ℓ(w) is the log of the joint distribution (Zellner, 1986) (γ set to 1). 1University of British Columbia. 2Sony Computer Science Laboratories Inc. 3RIKEN Center for Advanced Intelligence Project. 4CIFAR AI Chair, Alberta Machine Intelligence Institute. Correspondence to: Wu Lin . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Finally, many robust or global optimization techniques employ q(w) to smooth out local minima (Mobahi & Fisher III, 2015; Leordeanu & Hebert, 2008; Hazan et al., 2016), where often γ = 0. Developing fast and scalable algorithms for solving (1) potentially impacts all these fields. Natural-gradient descent (NGD) is an attractive algorithm to solve (1) and can speed up the optimization by exploiting the information geometry of q(w) (Wierstra et al., 2008; Sun et al., 2009; Hoffman et al., 2013; Khan & Lin, 2017; Salimbeni et al., 2018). It also unifies a wide-variety of learning algorithms, which can be seen as its instances with a specific q(w) (Khan & Rue, 2020). This includes deep learning (Khan et al., 2018), approximate inference (Khan & Lin, 2017), and optimization (Khan & Rue, 2020; Khan et al., 2017). NGD also has better convergence properties compared to methods that ignore the geometry, for example, Ranganath et al. (2014); Lezcano Casado (2019). We consider NGD where parameters of q(w) assume special structures, for example, low-rank or sparse Gaussian covariances. For such cases, NGD is often intractable and/or costly due to difficult Fisher Information Matrix (FIM) computations. First, the FIM can be singular for restricted parametrizations (see Fig. 1(I)), which is often addressed with ad-hoc structural approximations, derived on a case-bycase basis (Sun et al., 2013; Akimoto & Hansen, 2016; Li & Zhang, 2017; Mishkin et al., 2018; Tran et al., 2020) (also see Appx. D.4). Second, while we can switch parameterizations, the computation could be ineffecient because the structure might be lost, for example, when switching from sparse precision to covariances. Using automatic differentiation could make the situation worse because such tools are often unaware of the structure (Salimbeni et al., 2018) (also see Appx. G.1). Finally, the choice of parameterizations and approximations themselves involve delicate choices to get a desired computation-accuracy trade-off. For example, for neural networks layer-wise approximations (Sun & Nielsen, 2017; Zhang et al., 2018; Lin et al., 2019a) might be better than low-rank/diagonal structures (Mishkin et al., 2018; Tran et al., 2020; Ros & Hansen, 2008; Khan et al., 2018), but may also involve more computations. Our goal is to address these difficulties and design a flexible method that works well for a variety of structured parameterizations. We present local-parameter coordinates to design flexible and tractable NGD for a variety of structured-parameter Tractable Structured Natural-Gradient Descent Using Local Parameterizations Global Auxiliary λ (II) Structured NGD with local parameterization (I) Parameterization with singular FIM 1-rank Diagonal PSD Matrix with FIM non -singular at all Invertible matrix, but can have singular FIM Unconstrained matrix, FIM non- singular at 0 new = φλold uu> Diag(d2) Local j XMUIQj OIEq OHAJdbi FBj SBw AM8wiu8WSPrx Xq3Pmb Rgj Wf OYQ/s D5/AH2wla Q= (λ) = = = (III) 1-D Bayesian Logistic Regression Figure 1. (I) The FIM can be singular, for example, when the covariance Σ has a low rank structure (more details in Appx. J.1.6). The two identical columns of FIM are shown in yellow. (II) We fix such issues by using a local parameterization η (here M, an unconstrained structured matrix) which is related to the global variable τ (= Σ for the low-rank example) through an auxiliary parameter λ (= A, an invertible matrix with a specific structure to get a low-rank Σ = AA ). The three parameter-spaces are related through maps τ = ψ(λ) = AA and λ = φλold(η) = A = Aold Exp(M), and need to satisfy Assumptions 1 and 2 given in Section 3. This results in a valid NGD step (shown at the bottom) in the local-parameter space (defined at η0 = 0 with learning rate β). (III) For a 1-D Bayesian logistic-regression, our NGD is invariant to two different parameterizations, which is not the case for GD (details in Appx. D.3). spaces. The method is summarized in Fig. 1(II), and involves specifying (i) a local parameter coordinate that satisfies the structural constraints of the original (global) parameters, (ii) a map to convert back to the global parameters via auxiliary parameters, and finally (iii) a tractable natural-gradient computation in the local-parameter space. This construction ensures a valid NGD update in local parameter spaces, while maintaining structures (often via matrix groups) in the auxiliary parameters. This decoupling enables a tractable NGD that exploits the structure, when these parameters and the map are chosen carefully. We show four applications of our method. 1. We generalize Glasmachers et al. (2010) s method to more general distributions and structures (Section 3.1). 2. In Section 3.2, we recover Newton-like methods derived by Lin et al. (2020) using Riemannian-gradients and by Khan et al. (2018) using the standard NGD. 3. Our approach is easily generalizable to other non Gaussian cases; see Setion 3.3 and 3.4. 4. In Section 4, we derive new 2nd-order methods for lowrank, diagonal, and sparse covariances. The methods are only slightly more costly than diagonal-covariance methods. Moreover, they can be used as structured 2nd-order methods for unconstrained optimization. We show applications to various problems for search, variational inference, and deep learning, obtaining much faster convergence than methods that ignore geometry. An example for 1-D logistic regression is shown in 1(III). Overall, our work opens a new direction to design efficient and structured geometric methods via local parameterizations. 2. Structured NGD and its Challenges The distributions q(w) Q are often parameterized, say using parameters τ Ωτ, for which we write q(w|τ). The problem can then be conveniently expressed as an optimization problem in the space Ωτ, τ = arg min τ Ωτ Eq(w|τ) [ℓ(w)] , (2) where we assume γ = 0 for simplicity (general case is in Lemma 4 of Appx. C). The NGD step is τ t+1 τ t βˆgτ t where β > 0 is the step size and natural gradients are as ˆgτ t := Fτ(τ t) 1 τEq(w|τ) [ℓ(w)] , (3) where Fτ(τ):= Eq[ τ log q(w|τ)( τ log q(w|τ))] is an invertible and well-defined FIM following the regularity condition (see Appx. C). The iterates τ t+1 may not always lie inside Ωτ and a projection step might be required. In some cases, the NGD computation may not require an explicit FIM inversion. For example, when q(w|τ) is a minimal exponential-family (EF) distribution, FIM is always invertible, and natural gradients are equal to vanilla gradients with respect to the expectation parameter (Malagò et al., 2011; Khan & Nielsen, 2018). By appropriately choosing Q, the NGD then takes forms adapted by popular algorithms (Khan & Rue, 2020), for example, for Gaussians q(w|τ) = N(w|µ, S 1) where S denotes the precision, it reduces to a Newton-like update (Khan et al., 2018), µt+1 µt βS 1 t+1Eq(w|τ t)[ wℓ(w)], St+1 St + βEq(w|τ t) 2 wℓ(w) . (4) The standard Newton update for optimization is recovered by approximating the expectation at the mean and using a step-size of 1 with γ = 1 (Khan & Rue, 2020). Several connections and extensions have been derived in the recent Tractable Structured Natural-Gradient Descent Using Local Parameterizations years establishing NGD as an important algorithm for optimization, search, and inference (Khan & Lin, 2017; Khan & Nielsen, 2018; Lin et al., 2019a; Osawa et al., 2019b). This simplification of NGD breaks down when (2) involves structured-parameter spaces Ωτ, for example, spaces with constrains such as low-rank or sparse structures. Even for the simplest Gaussian case, where covariances lie in the positive-definite space, the update (4) may violate the constraint (Khan et al., 2018). Extensions have been derived using Riemannian gradient descent (RGD) to fix this issue (Lin et al., 2020). Other solutions based on Cholesky (Sun et al., 2009; Salimbeni et al., 2018) or square-root parameterization (Glasmachers et al., 2010) have also been considered, where the problem is converted to an unconstrained parameter space. For example, Glasmachers et al. (2010) use a square-root parameterization q(w) = N(w|µ, AAT ), where A is the square-root of S 1, to get the update, µt+1 µt βEq(w|τ t) Atzt ℓ(w) , At+1 At Exp β 2 Eq(w|τ t) ztz T t I ℓ(w) , (5) where zt = A 1 t (w µt) and Exp(X) = I + P k=1 X k k! is the matrix exponential function. These solutions however do not easily generalize. For example, it is not obvious how to apply these updates to cases where the covariance is lowrank (Mishkin et al., 2018; Tran et al., 2020), Kronecker structured (Zhang et al., 2018; Lin et al., 2019a), or to cases involving non-Gaussian distributions such as the Wishart, univariate exponential family distributions (Lin et al., 2020) and Gaussian mixtures (Lin et al., 2019a). In fact, the issue with the structure and its effect on parameterization is a bit more involved than it might appear at first. Certain choices of the structure/parameterization can make the Fisher matrix singular which can make NGD invalid, for example, for low-rank Gaussians as shown in Fig. 1(I) where it requires new tricks such as auxiliary parameterization (Lin et al., 2019a), block approximations (Tran et al., 2020), algorithmic approximations (Mishkin et al., 2018), or damping (Zhang et al., 2018). The computational cost depends on the parameterization, the choice of which is often not obvious. Some methods exploit structure in the covariances (Glasmachers et al., 2010) while the others work with its inverse such as (4). Customized structures, such as layer-wise and Kronecker-factored covariances in deep neural nets, may work well in one parameterization but not in the other. Thus, it is essential to have a flexible method that works well for a variety of structured-parameterizations and distributions. Our goal is to propose such a method. 3. Local Parameter Coordinates We present local-parameter coordinates to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. Table 1 in Appx. A summarizes the examples and extensions we consider. We describe the method in three steps. Step 1. The first step involves specifying a local parameterization, denoted by η Ωη, so that the following assumption is satisfied (throughout, we set η0 = 0). Assumption 1: The Fisher matrix Fη(η0) is non-singular. Step 2. The second step involves specifying two maps shown below to connect to the original global parameters τ via an auxiliary parameter λ Ωλ, τ = ψ(λ) and λ = φλt(η), (6) where the first map is surjective and the second map is defined such that λt = φλt(η0), i.e., the function is tight at η0 to match the current λt. The local parameter η0 can be seen as a relative origin tied to λt. The overall map is τ = ψ φλt(η) (the map could change with iterations). Notice that we make no assumption about the non-singularity of the FIM in the auxiliary space Ωλ. The FIM in the auxiliary space Ωλ can be singular (see Section 3.1). The only restriction is a mild coordinate compatibility assumption. Assumption 2 : λt Ωλ, the map η 7 ψ φλt(η) is locally C1-diffeomorphic at an open neighborhood of η0. Assumption 2 implies that the local η has the same degrees of freedom as τ, but the auxiliary λ can have a different one (an example is in Section 3.1). Assumption 1-2, together with surjective ψ( ), imply a non-singular FIM in the global space Ωτ, so there is no need to check it for specific cases. On the other hand, if we know the non-singularity of the FIM in Ωτ beforehand, Assumption 2 together with surjective ψ( ) imply that Assumption 1 is satisfied. Step 3. The final step is to compute the natural gradient at η0 in the local-parameter space to update the global τ, which can be done by using the chain rule, ˆg(t) η0 = Fη(η0) 1 η0 ψ φλt(η) gτ t, (7) where gτ := τEq(w|τ)[ℓ(w)] is the vanilla gradient. An indirect computation is given in (26) in Appx. C. The above computation is most useful when the computation of ˆg(t) η0 is tractable, which ultimately depends on the choice of ψ φλt which in turn depends on the form of q(w). Then, by using an NGD step η0 βˆg(t) η0 in the local-parameter space, we get the following overall update for τ, Structured NGD using local parameters λt+1 φλt βˆg(t) η0 , τ t+1 ψ (λt+1) (8) since we assume η0 = 0. In summary, given an auxiliary parameter λt, we can use the natural gradient ˆg(t) η0 to update τ according to (8). The NGD step using (3) is a special case of the above NGD step (see details in Appx. F). Tractable Structured Natural-Gradient Descent Using Local Parameterizations Finally, we require the following Assumption to be satisfied to ensure that the NGD step βˆg(t) η0 Ωη in (8) (this assumption is satisfied for all examples we discuss). Assumption 3 : Ωη has a vector space structure so that the vector addition, the vector subtraction, and the real scalar multiplication are valid. We will now discuss three applications of our method where we derive existing NGD strategies as special cases. 3.1. Gaussian with square-root covariance structure For a Gaussian distribution N(w|µ, Σ), the covariance matrix Σ is positive definite. Standard NGD steps such as (4), may violate this constraint (Khan et al., 2018). Glasmachers et al. (2010) use Σ = AA where A is an invertible matrix (not a Cholesky), and derive an update using a specific local parameterization. We now show that their update is a special case of our method. Following Glasmachers et al. (2010), we choose the following parameterizations, where we use Sp p ++ , Sp p, and Rp p ++ to denote the set of symmetric positive definite matrices, symmetric matrices, and invertible matrices, respectively, τ := µ Rp, Σ Sp p ++ , λ := µ Rp, A Rp p ++ , η := δ Rp, M Sp p , where δ and M are the local parameters. The map ψ φλt(η) at λt := {µt, At} is chosen to be1 = ψ(λ) := µ AA = φλt(η) := µt + Atδ At Exp 1 Finally, we can get the natural gradients (7) by using the Fisher matrix Fη(η0) (see Appx. D.2 for a derivation), ˆg(t) δ0 vec(ˆg(t) M0) = Ip 0 0 1 2Ip2 1 A t gµt vec(A t gΣt At) By plugging (10) and (11) in (7), our update can be written in the space of λ as below, where S 1 t = Σt. µt+1 µt βS 1 t gµt At+1 At Exp βAT t gΣt At (12) By the REINFORCE trick (Williams, 1992), the gradients with respect to global parameters are gµ = Eq(w|τ) h A T z ℓ(w) i h A T zz T I A 1ℓ(w) i (13) 1The 1/2 shown in red in (10) is used to match the parameterizations in Glasmachers et al. (2010), but the update in (12) remains unchanged even when without it. where z = A 1(w µ). By plugging in (13) into (12), we recover the update (5) used in Glasmachers et al. (2010). Appx. D.2 shows that Assumptions 1-2 are satisfied. Parameterizations η = {δ, M} and λ = {µ, A} play distinct roles. Local parameter M is chosen to be symmetric with p(p + 1)/2 degrees of freedom so that Assumption 1 holds (also see Appx. D.1.3). Auxiliary parameter A can be an invertible matrix with p2 degrees of freedom and the Fisher matrix Fλ(λ) is singular. Note that we perform natural-gradient descent in η instead of λ. This is in contrast with the other works (Sun et al., 2009; Salimbeni et al., 2018) that require a Cholesky structure in A with p(p+1)/2 degrees of freedom to ensure that Fλ(λ) is non-singular. Glasmachers et al. (2010) only demonstrated their method in the Gaussian case without complete derivations2 and a formal formulation. It is difficult to generalize their method without explicitly knowing the distinct roles of parameterizations η and λ. Moreover, their approach only applied to a square-root structure of the covariance and it is unclear how to generalize it to other structures (e.g., low-rank structures). Our method fixes these issues of their approach. 3.2. Connection to Newton s method We now show that the update (5) derived using local parameterization is in fact closely related to a Newton-like algorithm. Specifically, we will convert the update of At+1 in (5) to the update over St+1, as in (4), and recover the Newton s update derived by Lin et al. (2020). To do so, we need to make two changes. First, we will expand Exp βM = k! = I + βM + 1 2(βM)2 + O(β3). (14) Second, instead of using (13), we will use Stein s identity (Opper & Archambeau, 2009; Lin et al., 2019b): gµ = Eq[ wℓ(w)], gΣ = 1 2 wℓ(w) (15) Using these changes, the update over St+1 can be rewritten as a modified Newton s update proposed by Lin et al. (2020), St+1 = At+1AT t+1 1 = A T t Exp 2βAT t gΣAt A 1 t 2 wℓ(w) +β2 2 GS 1 t G + O(β3) (16) where G = Eq 2 wℓ(w) . Ignoring the red term gives us the update (4) derived by Khan et al. (2018). The term is added by Lin et al. (2020) to fix the positive-definite constraint violation, by Riemannian gradient descent. Thus, these methods can be seen as special cases of ours with an approximation of the exponential map. 2There are a few typos in their paper. The matrix A is missing in their Eq 8 and a factor 2 is missing in Eq 11. Tractable Structured Natural-Gradient Descent Using Local Parameterizations 3.3. Wishart with square-root precision structure We will now show an example that goes beyond Gaussians. We consider a Wishart distribution which is a distribution over p-by-p positive-definite matrices, Wp(W|S, n) = |W|(n p 1)/2|S|n/2 2 )2np/2 e 1 where Γp( ) is the multivariate gamma function. Here, the global parameters are based on the precision matrix S, unlike the example in Sec. 3.1. We will see that our update will automatically take care of this difference and report a similar update to the one obtained using Σ in (12). We start by specifying the parameterization, τ := n R, S Sp p ++ | n > p 1 , λ := b R, B Rp p ++ , η := δ R, M Sp p , and their respective maps defined at λt := {bt, Bt} n S = ψ(λ) := 2f(b) + p 1 (2f(b) + p 1)BB = φλt(η) := bt + δ Bt Exp (M) where f(b) = log(1 + exp(b)) is the soft-plus function3. The auxiliary parameter B here is defined as the square-root of the precision matrix S, unlike in the previous examples. Denoting the gradients by GS 1 := s 1Eq [ℓ(W)] , gn := n Eq [ℓ(W)] , (17) we can write the updates as (derivation in Appx. E): Bt+1 Bt Exp β n2 t B 1 t GS 1 t B T t bt+1 bt βct nt Tr GS 1 t S 1 t (19) where ct = 2(1+exp(bt)) nt +Dψ,p( nt 2 ) 1 and Dψ,p(x) is the multivariate trigamma function. Moreover, we can use re-parameterizable gradients (Figurnov et al., 2018; Lin et al., 2019b) for GS 1 t and gn due to the Bartlett decomposition (Smith et al., 1972) (see Appx. E.1 for details). The update (18) for B (square-root of the precision matrix) is very similar to the update for A (square-root for covariance) in (12). The change from covariance to precision parameterization changes the sign of the update. The step size is modified using the parameter nt. The local parameterization can automatically adjust to such changes in the parameter specification, giving rise to intuitive updates. 3We use the soft-plus function instead of the scalar exponential map for numerical stability. 3.4. Connection to Riemannian Gradient Descent We will show that the updates on the Wishart distribution is a generalization of Riemannian Gradient Descent (RGD) over the space of positive-definite matrices. Given an optimization problem min Z Sp p ++ ℓ(Z) over the space of symmetric positive-definite matrices, the RGD update with retraction can be written in terms of the inverse U = Z 1 (see Appx. E.2 for the details), Ut+1 Ut + β1 ℓ(Zt) + β2 1 2 ℓ(Zt)]U 1 t ℓ(Zt)] where is taken with respect to Z, and β1 is the step size. We now show that this is a special case of (18) where gradients (17) are approximated at the mean of the Wishart distribution as Eq [W] = n S 1. Denoting the mean by Zt, the approximation is (see the derivation in Appx. E.3), GS 1 t nt ℓ(Zt), gnt Tr ℓ(Zt)S 1 t (20) Plugging (20) into (19), b remains constant after the update, bt+1 bt βct ((((((( Tr ℓ(Zt)S 1 t ((((((( Tr ℓ(Zt)S 1 t so that bt+1 bt and nt is constant since n = 2f(b)+p 1. Resetting the step-size to be β = 1 2β1n,4 (18) becomes Bt+1 Bt Exp β1 2 B 1 t ℓ(Zt) B T t Finally, we express the update in terms of Ut := Z 1 t = Bt BT t to rewrite (21) as by using the second-order terms in the matrix exponential (14), Ut+1 Bt Exp(β1B 1 t ℓ(Zt) B T t )BT t Ut + β1 ℓ(Zt) + β2 1 2 ℓ(Zt)]U 1 t ℓ(Zt)] + O(β3 1) recovering the RGD update. Thus, the RGD update is a special case of our update, where the expectation is approximated at the mean. This is a local approximation to avoid sampling from q(W). This derivation is another instance of reduction to a local method using NGD over distributions, similar to the ones obtained by Khan & Rue (2020). 3.5. Generalizations and Extensions In previous sections, we use the matrix exponential map to define φλt(η), but other maps can be used. This is convenient since the map can be difficult to compute and numerically unstable. We propose to use another map: h(M) = I + M + 1 Map h( ) plays a key role for complexity reduction in Sec. 4, since it simplifies the natural-gradient computation in Gaussian and Wishart cases without changing the form of the 4Since n remains constant, β = 1 2β1n is a constant step-size. Tractable Structured Natural-Gradient Descent Using Local Parameterizations updates (due to Lemma 6-8 in Appx. C). For example, consider the Gaussian case in Sec. 3.1 where covariance Σ is used. Using our approach, we could easily change the parameterization to the precision S = Σ 1 instead, by changing the parameters in (9) to τ := µ Rp, S Sp p ++ λ := µ Rp, B Rp p ++ η := δ Rp, M Sp p . We can use map h( ) in the following transformations: µ S = ψ(λ) := µ BB = φλt(η) := µt + B T t δ Bth(M) An update (see (39) in Appx. D.1) almost identical to (16) is obtained with this parameterization and map. The difference only appears in a O(β3) term. Unlike the method originally described by Glasmachers et al. (2010), our formulation makes it easy for a variety of parameterizations and maps, while keeping the natural-gradient computation tractable. To avoid computing the Hessian 2 wℓ(w) in Gaussian cases, we could use the re-parameterizable trick5 for the covariance matrix (Lin et al., 2019b; 2020) in the update of (16): S(w µ) T wℓ(w) . (24) By the identities in (24, 13, 15), we establish the connection of our Gaussian update to variational inference by the reparameterizable trick, to numerical optimization by Stein s identity, and to black-box search by the REINFORCE trick. Our approach also gives NGD updates for common univariate exponential family (EF) distributions via Auto Differentiation (see Appx. G for the detail). In practice, the FIM under global parameter τ or local parameter η can be singular. For example, the FIM of curved EFs (Lin et al., 2019a) and MLPs (Amari et al., 2018) can be singular. The FIM of the low-rank structured Gaussian (Tran et al., 2020; Mishkin et al., 2018) has the same issue. (see Appx. J.1.6 for a discussion). We extend our approach to the following two kinds of curved EFs, where we relax Assumption 1 for local parameterizations. In Appx. I, we adapt our local parameterization approach to a block approximation for matrix Gaussian cases, where cross-block terms in the FIM are set to zeros (see (48) in Appx. I). Our approximated FIM is guaranteed to be nonsingular since matrix Gaussian is a minimal multi-linear EF (Lin et al., 2019a). Our approach is very different from noisy-KFAC (Zhang et al., 2018). In noisy-KFAC, KFAC 5 wℓ(w) is only required to exist almost surely. approximation along with a block-approximation is used, where the approximated FIM can be singular without damping. Damping introduces an extra tuning hyper-parameter. In Appx. H, we extend our approach to mixtures such as Gaussian mixtures using the FIM defined by the joint distribution of a mixture. For mixture distributions, Lin et al. (2019a) show that the FIM of the joint distribution of a minimal conditional mixture is guaranteed to be non-singular. 4. NGD for Structured Matrix Groups We now show applications to NGD on matrices with special structures. The key idea is to use the fact that the auxiliaryparameter space Rp p ++ used in Sec. 3 is a general linear group (GL group) (Belk, 2013), and structured restrictions give us its subgroups. We can specify local parameterizations for the subgroups to get a tractable NGD. We will use the Gaussian example considered in Sec. 3.5 to illustrate this idea. A similar technique could be applied to the Wishart example. We will discuss the triangular group first, and then discuss an extension inspired by the Heisenberg group. We use Bup(k) to denote the space of following block uppertriangular p-by-p matrices as an auxiliary parameter space, where k is the block size with 0 < k < p and d0 = p k, and Dd0 d0 ++ is the space of diagonal and invertible matrices. Bup(k) = n BA BB 0 BD BA Rk k ++ , BD Dd0 d0 ++ o When k = 0, Bup(k) = Dp p ++ becomes a diagonal auxiliary space. When k = p, Bup(k) = Rp p ++ becomes a full space. The following lemma shows Bup(k) is a matrix group. Lemma 1 Bup(k) is a matrix group that is closed under matrix multiplication. A local parameter space for Bup(k) is defined below with less degrees of freedom than the local space Sp p in (22). Mup(k) = n MA MB 0 MD MA Sk k, MD Dd0 d0o where Dd0 d0 denotes the space of diagonal matrices. Lemma 2 shows that h( ) defined in Sec. 3.5 is essential. Lemma 2 For any M Mup(k), h(M) Bup(k). Using these spaces, we specify the parametrization for the Gaussian N(w|µ, S 1), where the precision S belongs to a sub-manifold6 of Sp p ++ , τ := n µ Rp, S = BBT Sp p ++ | B Bup(k) o , λ := {µ Rp, B Bup(k)} , η := {δ Rp, M Mup(k)} . 6η locally gives a parametric representation of the submanifold. See (51) in Appx. J.1.3 for an equivalent global parameterization of this sub-manifold. Tractable Structured Natural-Gradient Descent Using Local Parameterizations 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15 20 True Tri-up (k = 5) 15 10 5 0 5 10 15 20 True Tri-low (k = 5) Figure 2. Comparison results of structured Gaussian mixtures to fit a 80-Dim mixture of Student s t distributions with 10 components. The first marginal dimension obtained by our updates is shown in the figure, where an upper triangular structure in the precision form achieves better approximation than a lower triangular structure and a diagonal structure. The upper triangular structure performs comparably to the full covariance structure with lower computational cost. Figure 6-8 in Appx. B show more dimensions and results on other structures. The map ψ φλt(η) at λt := {µt, Bt} is chosen to be the same as (23) due to Lemma1 and Lemma 2. Lemma 3 below shows that this local parameterization is valid. Lemma 3 Assumption 1-2 are satisfied in this case. The natural-gradients (shown in Appx. J.1.4) are ˆg(t) δ0 = B 1 t gµt; ˆg(t) M0 = Cup κup 2B 1 t gΣt B T t where is the element-wise product, κup(X) extracts nonzero entries of Mup(k) from X so that κup(X) Mup(k), J is a matrix of ones, Cup is a constant matrix defined as below, where factor 1 2 appears in the symmetric part of Cup. 2JA JB 0 1 2ID The NGD update over the auxiliary parameters is structured update µt+1 µt βS 1 t gµt Bt+1 Bth βCup κup 2B 1 t gΣt B T t (25) where (25) preserves the structure: Bt+1 Bup(k) if Bk Bup(k). When k = p, the update (25) recovers the update (38) of the example in Sec. 3.5 and connects to Newton s method in (16) (see (39) in Appx. D.1). When k < p, (25) becomes a structured update preserved the group structure. By exploiting the structure of B (shown in Appx. J.1.7), the update enjoys low time complexity O(k2p). The product S 1gµ can be computed in O(k2p). We can compute Bh(M) in O(k2p) when B and h(M) are block upper triangular matrices. The gradient gΣ is obtained using Hessian where we only compute/approximate diagonal entries of the Hessian and use O(k) Hessian-vector-products for nonzero entries of κup 2B 1gΣB T (see (53) in Appx. J.1.7). We store the non-zero entries of B with space complexity O((k + 1)p). Map h( ) simplifies the computation and reduces the time complexity, whereas the exponential map suggested by Glasmachers et al. (2010) does not. As shown in Appx. J.1.5, this parameterization induces a special structure over Sup = BBT , which is a block arrowhead matrix (O leary & Stewart, 1990): Sup = BABT A + BBBT B BBBD BDBT B B2 D and over Σup = S 1 up , which is a low-rank matrix7: Σup = Uk UT k + 0 0 0 B 2 D ; Uk = B T A B 1 D BT BB T A where Uk is a rank-k matrix since B T A is invertible. As shown in Appx. J.1.8, we obtain a similar update for a block lower-triangular group Blow(k). Blow(k) = n BA 0 BC BD BA Rk k ++ , BD Dd0 d0 ++ o Our update with a structure B Blow(k) enjoys a low-rank structure in precision Slow = BBT . Likewise, our update with a structure B Bup(k) has a low-rank structure in covariance S 1 up = (BBT ) 1. These are structured secondorder updates where the precision can be seen as approximations of Hessians in Newton s method (see Sec. 3.2). An extension is to construct a hierarchical structure inspired by the Heisenberg group (Schulz & Seesanea, 2018) by replacing a diagonal group in BD with a block triangular group, where 1 < k1 + k2 < p and d0 = p k1 k2 Bup(k1, k2) = n BA BB 0 BD BD = BD1 BD2 0 BD4 where BA Rk1 k1 ++ , BD1 Dd0 d0 ++ , BD4 Rk2 k2 ++ . This group has a flexible structure and recovers the block triangular group as a special case when k2 = 0. Likewise, We can define a lower block Heisenberg group Blow(k1, k2). In Appx. J.2, we show that these groups can be used as structured parameter spaces, which could be useful for problems of interest in optimization, inference, and search. If the Hessian 2ℓ(w) has a model-specific structure, we could design a customized group to capture such structure in the precision. For example, the Hessian of layer-wise matrix weights of a neural network admits a Kronecker form 7 The zero block highlighted in red in the expression of Σup guarantees the FIM to be non-singular (see Appx. J.1.6). Tractable Structured Natural-Gradient Descent Using Local Parameterizations (see Appx. I.2). We can use a Kronecker product group to capture such structure. The Kronecker structure is preserved even when the Gauss-Newton approximation is employed. This group structure can further reduce the time complexity from the quadratic complexity to a linear complexity in k (see Appx. I.1 and Figure 4). In general, many subgroups (e.g., block (invertible) triangular Toeplitz groups and groups constructed from an existing group via the group conjugation by an element of the rotation group) of the GL group Rp p ++ can be used as structured auxiliary parameter spaces B. Our approach to construct a structured Gaussian-precision is valid if there exists a local parameter space M so that h(M) B for any M M and Assumptions 1-3 are satisfied. If these conditions hold, the inverse of FIM F 1 η (η0) using M will be easy to compute due to Lemma 11 in Appx. D.1. We can even weaken Assumption 1 as discussed in Sec. 3.5. The computational requirements are (1) group product and inverse can be efficiently implemented and (2) κ 2B 1gΣB T M can be implemented without the whole Hessian in (15), where κ( ) converts Rp p to M. 5. Numerical Results We present results on problems involving search, inference, optimization, and deep learning, where Table 1 in Appx. A summarizes our updates. We use h( ) defined in Sec. 3.5 to replace the matrix exponential map in our proposed updates. 5.1. Search with Re-parameterizable Gradients We validate our update in the metric nearness task (Brickell et al., 2008) using a Wishart distribution as a search distribution q with γ = 0 in (1). The objective function is ℓ(W) = 1 2N PN i=1 WQxi xi 2 2, where xi Rd, Q Sd d ++ and W Sd d ++ . The optimal solution is Q 1. We randomly generate xi and Q with d = 50, Ntrain = 125, 000 for training and Ntest = 25, 000 for testing. All methods are trained using mini-batches, where the size of mini-batch is 100. We use re-parameterizable gradients with 1 Monte Carlo (MC) sample in our update (referred to as our-rep ), where we update B and b. we also consider to only update B with re-parameterizable gradients (referred to as our-fixed-rep ). To numerically show the similarity between RGD and our update, we consider a case where gradients are evaluated at the mean (referred to as -mean ). We consider baseline methods: the RGD update for positive-definite manifolds and the Riemannian trivialization8 (Lezcano Casado, 2019), where gradients are evaluated at the mean. For the trivialization method, we consider trivializations for the positive-definite mani- 8In variational inference (VI), trivializing a parametric distribution is known as black-box VI (Ranganath et al., 2014). fold: a Cholesky factor and the matrix logarithmic function. We report the best result of the trivializations denoted by Adam , where we use Adam to perform updates in a trivialized (Euclidean) space. From Figure 3a, we can see our update performs similarly to RGD if gradients are evaluated at the mean while the trivialization method is trapped in a local mode. If we use re-parameterizable gradients, jointly updating both parameters is better than only updating B. 5.2. Variational Inference with Gaussian Mixtures We consider the Gaussian mixture approximation problem (Lin et al., 2020), where we use a Gaussian mixture with K components q(w) = 1 K PK k=1 N(w|µk, S 1 k ) as a variational distribution q with γ = 1 in (1). The goal of the problem is to approximate a mixture of d-dimensional Student s t distributions exp( ℓ(w)) = 1 C PC c=1 T (w|uc, Vc, α) with α = 2. We consider six kinds of structures of each Gaussian component: full precision (referred to as full ), diagonal precision (referred to as diag ), precision with the block upper triangular structure (referred to as Triup ), precision with the block lower triangular structure (referred to as Tri-low ), precision with the block upper Heisenberg structure (referred to as Hs-up ), precision with the block lower Heisenberg structure (referred to as Hs-low ). Each entry of uc is generated uniformly in an interval ( s, s). Each matrix Vc is generated as suggested by Lin et al. (2020). We consider a case with K = 40, C = 10, d = 80, s = 20. We update each component during training, where 10 MC samples are used to compute gradients. We compute gradients as suggested by Lin et al. (2020), where second-order information is used. For structured updates, we compute Hessian-vector products and diagonal entries of the Hessian without directly computing the Hessian 2 wℓ(w). From Figure 2, we can see an upper structure is better for inference problems9. Figure 6-8 in Appx. B show more results on dimensions and structures such as Heisenberg structures. 5.3. Structured Second-order Optimization We consider non-separable valley-shaped test functions for optimization: Rosenbrock: ℓrb(w) = 1 d Pd 1 i=1 100(wi+1 wi)2+(wi 1)2 , and Dixon-Price: ℓdp(w) = 1 d (wi 1)2+ Pd i=2 i(2w2 i wi 1)2 . We test our structured Newton s updates, where we set d = 200 and γ = 1 in (1). We consider these structures in the precision: the upper triangular structure (denoted by Tri-up ), the lower triangular structure (denoted by Tri-low ), the upper Heisenberg structure (denoted by Hs-up ), and the lower Heisenberg structure (denoted by Hs-low ), where second-order information is used. For our updates, we compute Hessian-vector products 9For variational inference, an upper structure in the precision is better than a lower structure to capture off-diagonal correlations. Tractable Structured Natural-Gradient Descent Using Local Parameterizations 0 20000 40000 60000 80000 100000 Iterations Metric Nearness Adam-mean Our-mean RGD-mean Our-rep Our-fixed-rep 0 50000 100000 370000 0 500 1000 1500 2000 2500 Iterations Tri-low (k = 1) Tri-up (k = 1) Hs-up (k1 = k2 = 1) Hs-low (k1 = k2 = 1) 0 250 500 750 1000 1250 1500 1750 2000 Iterations Dixon-Price Tri-low (k = 1) Tri-up (k = 1) Hs-up (k1 = k2 = 1) Hs-low (k1 = k2 = 1) (c) Figure 3. The performances of our updates for search and optimization problems. Figure 3a shows the performances using a Wishart distribution to search the optimal solution of a metric nearness task where our method evaluated at the mean behaves like RGD and converges faster than the Riemannian trivialization (Lezcano Casado, 2019) with Adam. Our updates with re-parameterizable gradients also can find a solution near the optimal solution. Figure 3b and 3c show the performances using structured Newton s updates to optimize non-separable, valley-shaped, 200-dimensional functions, where our updates only require to compute diagonal entries of Hessian and Hessian-vector products. Our updates with a lower Heisenberg structure in the precision form converge faster than BFGS and Adam. 103 104 Iterations Tri-low (k = 5) Tri-low (k = 3) Tri-low (k = 1) 0 100 200 300 400 500 600 700 Seconds Tri-low (k = 5) Tri-low (k = 3) Tri-low (k = 1) 103 104 Iterations Tri-low (k = 5) Tri-low (k = 3) Tri-low (k = 1) 0 200 400 600 800 1000 Seconds Tri-low (k = 5) Tri-low (k = 3) Tri-low (k = 1) Figure 4. The performances for optimization of a CNN using matrix Gaussian with low-rank in a Kronecker precision form, where our updates (O(k|w|)) have a linear iteration cost like Adam (O(|w|)) and are automatically parallelized by Auto-Diff. Our updates achieve higher test accuracy (75.8% on STL-10 and 85.0% on CIFAR-10 ) than Adam (69.5% on STL-10 and 82.3% on CIFAR-10 ). and diagonal entries of the Hessian without directly computing the Hessian. We consider baseline methods: the BFGS method provided by Sci Py and the Adam optimizer, where the step-size is tuned for Adam. We evaluate gradients at the mean for all methods. Figure 3b-3c show the performances of all methods10, where our updates with a lower Heisenberg structure converge faster than BFGS and Adam. 5.4. Optimization for Deep Learning We consider a CNN model with 9 hidden layers, where 6 layers are convolution layers. For a smooth objective, we use average pooling and GELU (Hendrycks & Gimpel, 2016) as activation functions. We employ L2 regularization with weight 10 2. We set γ = 1 in (1) in our updates. We train the model with our updates derived from matrix Gaussian (see Appx. I) for each layer-wise matrix weight11 on datasets CIFAR-10 , STL-10 . Each Gaussian-precision has a Kronecker product group structure of two lower-triangular groups (referred to as Tri-low ) for computational com- 10Empirically, we find out that a lower structure in the precision performs better than an upper structure for optimization tasks including optimization for neural networks. 11 W Rcout cinp2 is a weight matrix, where p, cin, cout are the kernel size, the number of input, output channels, respectively. plexity reduction (see Appx. I.1). For CIFAR-10 and STL-10 , we train the model with mini-batch size 20. Additional results on CIFAR-100 can be found at Figure 5 in Appx. B. We evaluate gradients at the mean and approximate the Hessian by the Gauss-Newton approximation. We compare our updates to Adam, where the step-size for each method is tuned by grid search. We use the same initialization and hyper-parameters in all methods. We report results in terms of test accuracy, where we average the results over 5 runs with distinct random seeds. From Figure 4, we can see our structured updates have a linear iteration cost like Adam while achieve higher test accuracy. 6. Conclusion We propose a tractable NGD for structured spaces. The method enables more flexible covariance structures with lower complexity than other methods. Preliminarily results show the method is promising. An interesting direction is to evaluate its performance on large-scale problems. Acknowledgements WL is supported by a UBC International Doctoral Fellowship. This research was partially supported by the Canada CIFAR AI Chair Program. Tractable Structured Natural-Gradient Descent Using Local Parameterizations Agakov, F. V. and Barber, D. An auxiliary variational method. In International Conference on Neural Information Processing, pp. 561 566. Springer, 2004. Akimoto, Y. and Hansen, N. Projection-based restricted covariance matrix adaptation for high dimension. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 197 204, 2016. Amari, S.-i., Ozeki, T., Karakida, R., Yoshida, Y., and Okada, M. Dynamics of learning in mlp: Natural gradient and singularity revisited. Neural computation, 30(1): 1 33, 2018. Baba, N. Convergence of a random optimization method for constrained optimization problems. Journal of Optimization Theory and Applications, 33(4):451 461, 1981. Belk, J. Lecture Notes: Matrix Groups. http: //faculty.bard.edu/belk/math332/ Matrix Groups.pdf, 2013. Accessed: 2021/02. Beyer, H.-G. The theory of evolution strategies. Springer Science & Business Media, 2001. Brickell, J., Dhillon, I. S., Sra, S., and Tropp, J. A. The metric nearness problem. SIAM Journal on Matrix Analysis and Applications, 30(1):375 396, 2008. Chen, S.-W., Chou, C.-N., and Chang, E. Y. Ea-cg: An approximate second-order method for training fullyconnected neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3337 3346, 2019. Dangel, F., Harmeling, S., and Hennig, P. Modular blockdiagonal curvature approximations for feedforward architectures. In International Conference on Artificial Intelligence and Statistics, pp. 799 808. PMLR, 2020. Figurnov, M., Mohamed, S., and Mnih, A. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pp. 441 452, 2018. Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., and Schmidhuber, J. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 393 400, 2010. Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348 2356, 2011. Hazan, E., Levy, K. Y., and Shalev-Shwartz, S. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pp. 1833 1841. PMLR, 2016. Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013. Hosseini, R. and Sra, S. Matrix manifold optimization for Gaussian mixtures. In Advances in Neural Information Processing Systems, pp. 910 918, 2015. Johansen, S. Introduction to the theory of regular exponential famelies. 1979. Khan, M. and Lin, W. Conjugate-computation variational inference: Converting variational inference in nonconjugate models to inferences in conjugate models. In Artificial Intelligence and Statistics, pp. 878 887, 2017. Khan, M. E. and Nielsen, D. Fast yet Simple Natural Gradient Descent for Variational Inference in Complex Models. ar Xiv preprint ar Xiv:1807.04489, 2018. Khan, M. E. and Rue, H. Learning-algorithms from Bayesian principles. 2020. https: //emtiyaz.github.io/papers/learning_ from_bayes.pdf. Khan, M. E., Lin, W., Tangkaratt, V., Liu, Z., and Nielsen, D. Variational adaptive-Newton method for explorative learning. ar Xiv preprint ar Xiv:1711.05560, 2017. Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning, pp. 2611 2620, 2018. Leordeanu, M. and Hebert, M. Smoothing-based optimization. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8. IEEE, 2008. Lezcano Casado, M. Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems, 32:9157 9168, 2019. Li, Z. and Zhang, Q. A simple yet efficient evolution strategy for large-scale black-box optimization. IEEE Transactions on Evolutionary Computation, 22(5):637 646, 2017. Lin, W. An upper triangular version of the cholesky decompostion. https://math.stackexchange.com/q/4114067, 2021. Accessed: 2021/04. Lin, W., Khan, M. E., and Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning, pp. 3992 4002, 2019a. Tractable Structured Natural-Gradient Descent Using Local Parameterizations Lin, W., Khan, M. E., and Schmidt, M. Stein s Lemma for the Reparameterization Trick with Exponential-family Mixtures. ar Xiv preprint ar Xiv:1910.13398, 2019b. Lin, W., Schmidt, M., and Khan, M. E. Handling the positive-definite constraint in the bayesian learning rule. In International Conference on Machine Learning, pp. 6116 6126. PMLR, 2020. Malagò, L., Matteucci, M., and Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms, pp. 230 242, 2011. Mishkin, A., Kunstner, F., Nielsen, D., Schmidt, M., and Khan, M. E. SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient. In Advances in Neural Information Processing Systems, pp. 6246 6256, 2018. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016. Mobahi, H. and Fisher III, J. A theoretical analysis of optimization by gaussian continuation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. O leary, D. and Stewart, G. Computing the eigenvalues and eigenvectors of symmetric arrowhead matrices. Journal of Computational Physics, 90(2):497 505, 1990. Opper, M. and Archambeau, C. The variational Gaussian approximation revisited. Neural computation, 21(3):786 792, 2009. Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen, R., Turner, R. E., and Yokota, R. Practical deep learning with Bayesian principles. In Advances in neural information processing systems, pp. 4287 4299, 2019a. Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., and Matsuoka, S. Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12359 12367, 2019b. Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814 822, 2014. Ros, R. and Hansen, N. A simple modification in cma-es achieving linear time and space complexity. In International Conference on Parallel Problem Solving from Nature, pp. 296 305. Springer, 2008. Salimbeni, H., Eleftheriadis, S., and Hensman, J. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. Schulz, E. and Seesanea, A. Extensions of the Heisenberg group by two-parameter groups of dilations. ar Xiv preprint ar Xiv:1804.10305, 2018. Smith, W., Hocking, R., et al. Wishart variate generator. Applied Statistics, 21:341 345, 1972. Spall, J. C. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005. Sun, K. and Nielsen, F. Relative fisher information and natural gradient for learning large modular models. In International Conference on Machine Learning, pp. 3289 3298, 2017. Sun, Y., Wierstra, D., Schaul, T., and Schmidhuber, J. Efficient natural evolution strategies. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pp. 539 546, 2009. Sun, Y., Schaul, T., Gomez, F., and Schmidhuber, J. A linear time natural evolution strategy for non-separable functions. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pp. 61 62, 2013. Sutton, R. S., Barto, A. G., et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. Teboulle, M. Entropic proximal mappings with applications to nonlinear programming. Math. Oper. Res., 17(3): 670â A S690, August 1992. ISSN 0364-765X. Tran, M.-N., Nguyen, N., Nott, D., and Kohn, R. Bayesian deep net glm and glmm. Journal of Computational and Graphical Statistics, 29(1):97 113, 2020. Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp. 3381 3387. IEEE, 2008. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Tractable Structured Natural-Gradient Descent Using Local Parameterizations Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241 268, 1991. Zellner, A. Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association, 81(394):446 451, 1986. Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference on Machine Learning, pp. 5847 5856, 2018.