# accelerating_natural_gradient_with_higherorder_invariance__bef7535e.pdf

Accelerating Natural Gradient with

Higher-Order Invariance

Yang Song 1 Jiaming Song 1 Stefano Ermon 1

An appealing property of the natural gradient is that it is invariant to arbitrary differentiable reparameterizations of the model. However, this invariance property requires inﬁnitesimal steps and is lost in practical implementations with small but ﬁnite step sizes. In this paper, we study invariance properties from a combined perspective of Riemannian geometry and numerical differential equation solving. We deﬁne the order of invariance of a numerical method to be its convergence order to an invariant solution. We propose to use higher-order integrators and geodesic corrections to obtain more invariant optimization trajectories. We prove the numerical convergence properties of geodesic corrected updates and show that they can be as computational efﬁcient as plain natural gradient. Experimentally, we demonstrate that invariance leads to faster optimization and our techniques improve on traditional natural gradient in deep neural network training and natural policy gradient for reinforcement learning.

1. Introduction

Non-convex optimization is a key component of the success of deep learning. Current state-of-the-art training methods are usually variants of stochastic gradient descent (SGD), such as Ada Grad (Duchi et al., 2011), RMSProp (Hinton et al., 2012) and Adam (Kingma & Ba, 2015). While generally effective, performance of those ﬁrst-order optimizers is highly dependent on the curvature of the optimization objective. When the Hessian matrix of the objective at the optimum has a large condition number, the problem is said to have pathological curvature (Martens, 2010; Sutskever

1Computer Science Department, Stanford University. Correspondence to: Yang Song <yangsong@cs.stanford.edu>, Jiaming Song <tsong@cs.stanford.edu>, Stefano Ermon <ermon@cs.stanford.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

et al., 2013), and ﬁrst-order methods will have trouble in making progress. The curvature, however, depends on how the model is parameterized. There may be some equivalent way of parameterizing the same model which has better-behaved curvature and is thus easier to optimize with ﬁrst-order methods. Model reparameterizations, such as good network architectures (Simonyan & Zisserman, 2014; He et al., 2016) and normalization techniques (Le Cun et al., 2012; Ioffe & Szegedy, 2015; Salimans & Kingma, 2016) are often critical for the success of ﬁrst-order methods.

The natural gradient (Amari, 1998) method takes a different perspective to the same problem. Rather than devising a different parameterization for ﬁrst-order optimizers, it tries to make the optimizer itself invariant to reparameterizations by directly operating on the manifold of probabilistic models. This invariance, however, only holds in the idealized case of inﬁnitesimal steps, i.e., for continuous-time natural gradient descent trajectories on the manifold (Ollivier, 2013; 2015). Practical implementations with small but ﬁnite step size (learning rate) are only approximately invariant. Inspired by Newton-Raphson method, the learning rate of natural gradient method is usually set to values near 1 in real applications (Martens, 2010; 2014), leading to potential loss of invariance.

In this paper, we investigate invariance properties within the framework of Riemannian geometry and numerical differential equation solving. We observe that both the exact solution of the natural gradient dynamics and its approximation obtained with Riemannian Euler method (Bielecki, 2002) are invariant. We propose to measure the invariance of a numerical scheme by studying its rate of convergence to those idealized truly invariant solutions. It can be shown that the traditional natural gradient update (based on the forward Euler method) converges in ﬁrst order. For improvement, we ﬁrst propose to use a second-order Runge-Kutta integrator. Additionally, we introduce corrections based on the geodesic equation. We argue that the Runge-Kutta integrator converges to the exact solution in second order, and the method with geodesic corrections converges to the Riemannian Euler method in second order. Therefore, all the new methods have higher order of invariance, and experiments verify their faster convergence in deep neural

Accelerating Natural Gradient with Higher-Order Invariance

Figure 1. An illustration of Riemannian geometry concepts: tan-

gent spaces, cotangent spaces, coordinate basis, dual coordinate basis, geodesics and the exponential map.

network training and policy optimization for deep reinforcement learning. Moreover, the geodesic correction update has a faster variant which keeps the second-order invariance while being roughly as time efﬁcient as the original natural gradient update. Our new methods can be used as drop-in replacements in any situation where natural gradient may be used.

2. Preliminaries

2.1. Riemannian Geometry and Invariance

We use Einstein s summation convention throughout this paper to simplify formulas. The convention states that when any index variable appears twice in a term, once as a superscript and once as a subscript, it indicates summation of the term over all possible values of the index variable. For example, aµbµ , Pn

µ=1 aµbµ when index variable µ 2 [n].

Riemannian geometry is used to study intrinsic properties of differentiable manifolds equipped with metrics. The goal of this necessarily brief section is to introduce some key concepts related to the understanding of invariance. For more details, please refer to (Petersen, 2006) and (Amari et al., 1987).

In this paper, we describe a family of probabilistic models as a manifold. Roughly speaking, a manifold M of dimension n is a smooth space whose local regions resemble Rn (Carroll, 2004). Assume there exists a smooth mapping φ : M ! Rn in some neighborhood of p and for any p 2 M, φ(p) is the coordinate of p. As an example, if p is a parameterized distribution, φ(p) will refer to its parameters. There is a linear space associated with each p 2 M called the tangent space Tp M. Each element v 2 Tp M is called a vector. For any tangent space Tp M, there exists a dual space T

p M called the cotagent space, which consists of all linear real-valued functions on the tangent space. Each element v in the dual space T

p M is called a covector. Let φ(p) = ( 1, 2, , n) be the coordinates of p, it can be shown that the set of operators

@ 1 , , @ @ n } forms a basis for Tp M and is called the coordinate basis. Similarly, the dual space admits the dual coordinate basis {d 1, , d n}. These two sets of bases

satisfy d µ(@ ) = δµ

1, µ = 0, µ 6= is the Kro-

necker delta. Note that in this paper we abbreviate @ @ µ to @µ and often refer to an entity (e.g., vector, covector and point on the manifold) with its coordinates.

Vectors and covectors are geometric objects associated with a manifold, which exist independently of the coordinate system. However, we rely on their representations w.r.t. some coordinate system for quantitative studies. Given a coordinate system, a vector a (covector a ) can be represented by its coefﬁcients w.r.t. the coordinate (dual coordinate) bases, which we denote as aµ (aµ). Therefore, these coefﬁcients depend on a speciﬁc coordinate system, and will change for different parameterizations. In order for those coefﬁcients to represent coordinate-independent entities like vectors and covectors, their change should obey some appropriate transformation rules. Let the new coordinate system under a different parameterization be φ0(p) = ( 1, , n) and let the old one be φ(p) = ( 1, , n). It can be shown that the new coefﬁcients of a 2 Tp M will be given by

aµ0 = aµ @ µ0

@ µ , while the new coefﬁcients of a 2 T

p M will be determined by aµ0 = aµ @ µ

@ µ0 . Due to the difference of transformation rules, we say aµ is contravariant while aµ is covariant, as indicated by superscripts and subscripts respectively. In this paper, we only use Greek letters to denote contravariant / covariant components.

Riemannian manifolds are equipped with a positive deﬁnite metric tensor gp 2 T

p M, so that distances and angles can be characterized. The inner product of two vectors a = aµ@µ 2 Tp M, b = b @ 2 Tp M is deﬁned as ha, bi , gp(a, b) = gµ d µ d (aµ@µ, b @ ) = gµ aµb . For convenience, we denote the inverse of the metric tensor as g β using superscripts, i.e., g βgβµ = δ

µ. The introduction of inner product induces a natural map from a tangent space to its dual space. Let a = aµ@µ 2 Tp M, its natural correspondence in T

p M is the covector a , ha, i = a d . It can be shown that a = aµgµ and aµ = gµ a . We say the metric tensor relates the coefﬁcients of a vector and its covector by lowering and raising indices, which effectively changes the transformation rule.

The metric structure makes it possible to deﬁne geodesics on the manifold, which are constant speed curves γ : R ! M that are locally distance minimizing. Since the distances on manifolds are independent of parameterization, geodesics are invariant objects. Using a speciﬁc coordinate system, γ(t) can be determined by solving the geodesic equation

β γ γβ = 0, (1)

Accelerating Natural Gradient with Higher-Order Invariance

β is the Levi-Civita connection deﬁned by

2gµ (@ g β + @βg @ g β). (2)

Note that we use γ to denote dγ

dt and γ for d2γ

Given p 2 M and v 2 Tp M, there exists a unique geodesic satisfying γ(0) = p, γ(0) = v. If we follow the curve γ(t) from p = γ(0) for a unit time t = 1, we can reach another point p0 = γ(1) on the manifold. In this way, traveling along geodesics deﬁnes a map from M T M to M called exponential map

Exp(p, v) , γ(1), (3)

where γ(0) = p and γ(0) = v. By simple re-scaling we also have Exp(p, hv) = γ(h).

As a summary, we provide a graphical illustration of relevant concepts in Riemannian geometry in Figure 1. Here we emphasize again the important ontological difference between an object and its coordinate. The manifold itself, along with geodesics and vectors (covectors) in its tangent (cotangent) spaces is intrinsic and independent of coordinates. The coordinates need to be transformed correctly to describe the same objects and properties on the manifold under a different coordinate system. This is where invariance emerges.

2.2. Numerical Differential Equation Solvers

Let the ordinary differential equation (ODE) be x(t) = f(t, x(t)), where x(0) = a and t 2 [0, T]. Numerical integrators try to trace x(t) with iterative local approximations {xk | k 2 N}.

We discuss several useful numerical methods in this paper. The forward Euler method updates its approximation by xk+1 = xk + hf(tk, xk) and tk+1 = tk + h. It can be shown that as h ! 0, the error kxk x(tk)k can be bounded by O(h). The midpoint integrator is a Runge-Kutta method with O(h2) error. Its update formula is given by xk+1 = xk + hf

2 f(tk, xk)

, tk+1 = tk + h. The Riemannian Euler method (see pp.3-6 in (Bielecki, 2002)) is a less common variant of the Euler method, which uses the Exponential map for its updates as xk+1 = Exp(xk, hf(tk, xk)), tk+1 = tk + h. While having the same asymptotic error O(h) as forward Euler, it has more desirable invariance properties.

2.3. Revisiting Natural Gradient Method

Let r (x, t) = p (t | x)q(x) denote a probabilistic model parameterized by 2 , where x, t are random variables, q(x) is the marginal distribution of x and assumed to be ﬁxed. Conventionally, x is used to denote the input and t

represents its label. In a differential geometric framework, the set of all possible probabilistic models r constitutes a manifold M, and the parameter vector provides a coordinate system. Furthermore, the inﬁnitesimal distance of probabilistic models can be measured by the Fisher information metric gµ = Ex q Ep (t|x)[@µ log p (t | x)@ log p (t | x)]. Let the loss function L(r ) = Ex q[log p (l | x)] be the expected negative log-likelihood, where l denotes the ground truth labels in the training dataset. Our learning goal is to ﬁnd a model r that minimizes the (empirical) loss L(r ).

The well known update rule of gradient descent µ

k hλ@µL(r k) can be viewed as approximately solving the (continuous time) ODE

µ = λ@µL(r ) (4)

with forward Euler method. Here λ is a time scale constant, h is the step size, and their product hλ is the learning rate. Note that λ will only affect the speed but not trajectory of the system. It is notorious that the gradient descent ODE is not invariant to reparameterizations (Ollivier, 2013; Martens, 2010). For example, if we rescale µ to 2 µ, @µL(r ) will be downscaled to 1

2@µL(r ). This is more evident from a differential geometric point of view. As can be veriﬁed by chain rule, µ transforms contravariantly and can therefore be treated as a vector in Tp M, while @µL(r ) transforms covariantly, thus being a covector in T

p M. Because Eq. (4) tries to relate objects in different spaces with different transformation rules, it is not an invariant relation.

Natural gradient alleviates this issue by approximately solving an invariant ODE. Recall that we can raise or lower an index given a metric tensor gµ . By raising the index of @µL(r ), the r.h.s. of the gradient descent ODE (Eq. (4)) becomes a vector in Tp M, which solves the type mismatch problem of Eq. (4). The new ODE

µ = λgµ @ L(r ) (5)

is now invariant, and the forward Euler approximation becomes µ

k hλgµ @ L(r k), which is the traditional natural gradient update (Amari, 1998).

3. Higher-order Integrators

If we could integrate the learning trajectory equation µ = λgµ @ L exactly, the optimization procedure would be invariant to reparameterizations. However, the na ıve linear update of natural gradient µ

k hλgµ @ L is only a forward Euler approximation, and can only converge to the invariant exact solution in ﬁrst order. Therefore, a natural improvement is to use higher-order integrators to obtain a more accurate approximation to the exact solution.

As mentioned before, the midpoint integrator has secondorder convergence and should be generally more accurate.

Accelerating Natural Gradient with Higher-Order Invariance

In our case, it becomes

2hλgµ ( k)@ L(r k),

k hλgµ ( k+ 1

2 )@ L(r k+ 1

where gµ ( k), gµ ( k+ 1

2 ) are the inverse metrics evaluated at k and k+ 1

2 respectively. Since our midpoint integrator converges to the invariant natural gradient ODE solution in second order, it preserves higher-order invariance compared to the ﬁrst-order Euler integrator used in vanilla natural gradient.

4. Riemannian Euler Method

For solving the natural gradient ODE (Eq. (5)), the Riemannian Euler method s update rule becomes

k+1 = Exp( µ

k, hλgµ @ L(r k)), (6)

where Exp : {(p, v) | p 2 M, v 2 Tp M} ! M is the exponential map as deﬁned in Section 2.1. The solution obtained by Riemannian Euler method is invariant to reparameterizations, because Exp is a function independent of parameterization and for each step, the two arguments of Exp are both invariant.

4.1. Geodesic Correction

For most models, it is not tractable to compute Exp, since it requires solving the geodesic equation (1) exactly. Nonetheless, there are two numerical methods to approximate geodesics, with different levels of accuracy.

According to Section 2.1, Exp( µ

k, hλgµ @ L(r k)) = γµ

k (h), where γµ

k satisﬁes the geodesic equation (1) and

k (0) = λgµ @ L(r k).

The ﬁrst method for approximately solving γµ

k (t) ignores the whole geodesic equation and only uses information of ﬁrst derivatives, giving

k hλgµ @ L,

which corresponds to the na ıve natural gradient update rule.

The more accurate method leverages information of second derivatives from the geodesic equation (1). The result is

k hλgµ @ L 1

The additional second-order term given by the geodesic equation (1) reduces the truncation error to third-order. This

corresponds to our new natural gradient update rule with geodesic correction, i.e.,

k (0) = λgµ @ L(r k).

4.2. Faster Geodesic Correction

To obtain the second order term in the geodesic corrected update, we ﬁrst need to compute γk(0), which requires inverting the Fisher information matrix. Then we have to plug in γk(0) and compute Γµ

k (0), which involves inverting the same Fisher information matrix again (see (2)). Matrix inversion (more precisely, solving the corresponding linear system) is expensive and it would be beneﬁcial to combine the natural gradient and geodesic correction terms together and do only one inversion.

To this end, we propose to estimate γk(0) in Γµ

β γk(0) γk(0)β with γk(0) ( k k 1)/h. Using this approximation and substituting (2) into (7) gives the following faster geodesic correction update rule:

@ L(r k) (8)

1 4hλ(@ g β + @βg @ g β)δ

which only involves one inversion of the Fisher information matrix.

4.3. Convergence Theorem

We summarize the convergence properties of geodesic correction and its faster variant in the following general theorem.

Theorem 1 (Informal). Consider the initial value problem

x = f(t, x(t)), x(0) = a, 0 t T. Let the interval [0, T] be subdivided into n equal parts by the grid points 0 = t0 < t1 < < tn = T, with the grid size h = T/n. Denote xk and ˆxk as the numerical solution given by geodesic correction and its faster version respectively. Deﬁne the error ek at each grid point xk by ek = x0

k xk, and ˆek = x0

k ˆxk, where x0

k is the numerical solution given by Riemannian Euler method. Then it follows that

kekk O(h2) and kˆekk O(h2), h ! 0, 8k 2 [n].

As a corollary, both Euler s update with geodesic correction

and its faster variant converge to the solution of ODE in 1st order.

Proof. Please refer to Appendix A for a rigorous statement and detailed proof.

Accelerating Natural Gradient with Higher-Order Invariance

The statement of Theorem 1 is general enough to hold beyond the natural gradient ODE (Eq. (5)). It shows that both geodesic correction and its faster variant converge to the invariant Riemannian Euler method in 2nd order. In contrast, vanilla forward Euler method, as used in traditional natural gradient, is a ﬁrst order approximation of Riemannian Euler method. In this sense, geodesic corrected updates preserve higher-order invariance.

5. Geodesic Correction for Neural Networks

Adding geodesic correction requires computing the Levi Civita connection Γ

µ (see (2)), which usually involves second-order derivatives. This is to the contrast of natural gradient, where the computation of Fisher information matrix only involves ﬁrst-order derivatives of outputs. In this section, we address the computational issues of geodesic correction in optimizing deep neural networks.

In order to use natural gradient for neural network training, we ﬁrst need to convert neural networks to probabilistic models. A feed-forward network can be treated as a conditional distribution p (t | x). For regression networks, p (t | x) is usually a family of multivariate Gaussians. For classiﬁcation networks, p (t | x) usually becomes a family of categorical distributions. The joint probability density is q(x)p (t | x), where q(x) is the data distribution and is usually approximated with the empirical distribution.

The ﬁrst result in this section is the analytical formula of the Levi-Civita connection of neural networks.

Proposition 1. The Levi-Civita connection of a neural network model manifold is given by

β = gµ Eq(x)Ep (t|x)

@ log p (t | x)

@ @β log p (t | x)+

1 2@ log p (t | x)@β log p (t | x)

Proof. In Appendix A.

We denote the outputs of a neural network as y(x, ) = (y1, y2, , yo), which is an o-dimensional vector if there

are o output units. In this paper, we assume that y(x, ) are the values after ﬁnal layer activation (e.g., softmax). For typical loss functions, the expectation with respect to the corresponding distributions can be calculated analytically. Speciﬁcally, we instantiate the Levi-Civita connection for model distributions induced by three common losses and summarize them in the following proposition.

Proposition 2. For the squared loss, we have

p (t | x) =

N(ti | yi, σ2)

Eq(x)[@µyi@ yi]

gµ Eq(x)[@ yi@ @βyi]

For the binary cross-entropy loss, we have

p (t | x) =

i (1 yi)1 ti

1 yi(1 yi) @µyi@ yi

i (1 yi)2 @ yi@ yi@βyi

+ 1 yi(1 yi) @ yi@ @βyi

In the case of multi-class cross-entropy loss, we have

p (t | x) =

@ yi@ yi@βyi

Proof. In Appendix B.

For geodesic correction, we only need to compute connection-vector products Γµ

β γ γβ. This can be done with a similar idea to Hessian-vector products (Pearlmutter, 1994), for which we provide detailed derivations and pseu-

docodes in Appendix C. It can also be easily handled with automatic differentiation frameworks. We discuss some practical considerations on how to apply them in real cases in Appendix D.

6. Related Work

The idea of using the geodesic equation to accelerate gradient descent on manifolds was ﬁrst introduced in Transtrum et al. (2011). However, our geodesic correction has several important differences. Our framework is generally applicable to all probabilistic models. This is to be contrasted

Accelerating Natural Gradient with Higher-Order Invariance

Figure 2. The effect of re-parameterizations on algorithms ﬁtting a univariate Gamma distribution. Titles indicate which parameterization was used.

with geodesic acceleration in Transtrum et al. (2011) and Transtrum & Sethna (2012), which can only be applied to nonlinear least squares. Additionally, our geodesic correction is motivated from the perspective of preserving higher-order invariance, while in Transtrum & Sethna (2012) it is motivated as a higher-order correction to the Gaussian-Newton approximation of the Hessian under the so-called small-curvature assumption . We discuss and evaluate empirically in Appendix F why the small-curvature approximation does not hold for training deep neural networks.

There has been a resurgence of interest in applying natural gradient to neural network training. Martens (2010) and Martens & Sutskever (2011) show that Hessian-Free optimization, which is equivalent to natural gradient method in important cases in practice (Pascanu & Bengio, 2013; Martens, 2014), is able to obtain state-of-the-art results in optimizing deep autoencoders and RNNs. To scale up natural gradient, some approximations for inverting the Fisher information matrix have been recently proposed, such as Krylov subspace descent (Vinyals & Povey, 2012), FANG (Grosse & Salakhutdinov, 2015) and KFAC (Martens & Grosse, 2015; Grosse & Martens, 2016; Ba et al., 2017).

7. Experimental Evaluations

In this section, we demonstrate the beneﬁt of respecting higher-order invariance through experiments on synthetic optimization problems, deep neural net optimization tasks and policy optimization in deep reinforcement learning.

Algorithms have abbreviated names in ﬁgures. We use ng to denote the basic natural gradient, geo to denote the one with geodesic correction, geof to denote the faster geodesic correction, and mid to abbreviate natural gradient update using midpoint integrator.

7.1. Invariance

In this experiment, we investigate the effect of invariance under different parameterizations of the same objective. We test different algorithms on ﬁtting a univariate Gamma distribution via Maximum Log-Likelihood. The problem is simple we can calculate the Fisher information metric and corresponding Levi-Civita connection accurately. Moreover, we can use ODE-solving software to numerically integrate the continuous natural gradient equation and calculate the exponential map used in Riemannian Euler method.

The pdf of Gamma distribution is

p(x | , β) = Γ(x; , β) , β

Γ( )x 1e βx,

where , β are shape and rate parameters. Aside from the original parameterization, we test three others: 1) = 0, β = 1/β0; 2) = 0, β = (β0)3 and 3) = ( 0)2, β = (β0)2, where 0, β0 are new parameters. We generate 10000

synthetic data points from Γ(X; 20, 20). During training, and β are initialized to 1 and the learning rate is ﬁxed to 0.5.

We summarize the results in Figure 2. Here ng(exact) is obtained by numerically integrating (5), and geo(exact) is obtained using Riemannian Euler method with a numerically calculated exponential map function. As predicted by the theory, both methods are exactly invariant under all parameterizations. From Figure 2 we observe that the vanilla natural gradient update is not invariant under reparameterizations, due to its ﬁnite step size. We observe that our midpoint natural gradient method and geodesic corrected algorithms are more resilient to re-parameterizations, and all lead to accelerated convergence of natural gradient.

7.2. Training Deep Neural Nets

We test our algorithms on deep autoencoding and classiﬁcation problems. The datasets are CURVES, MNIST and FACES, all of which contain small gray-scale images of various objects, i.e., synthetic curves, hand-written digits and hu-

Accelerating Natural Gradient with Higher-Order Invariance

Figure 3. Training deep auto-encoders and classiﬁers with different acceleration algorithms. Solid lines show performance against number

of iterations (bottom axes) while dashed lines depict performance against running time (top axes).

Figure 4. Sample efﬁciency of model-free reinforcement learning on continuous control tasks (Todorov et al., 2012). Titles indicate the

environment used in Open AI Gym (Brockman et al., 2016).

Accelerating Natural Gradient with Higher-Order Invariance

man faces. Since all deep networks use fully-connected layers and sigmoid activation functions, the tasks are non-trivial to solve even for modern deep learning optimizers, such as Adam (Kingma & Ba, 2015). Due to the high difﬁculty of this task, it has become a standard benchmark for neural network optimization algorithms (Hinton & Salakhutdinov, 2006; Martens, 2010; Vinyals & Povey, 2012; Sutskever et al., 2013; Martens & Grosse, 2015). Since these tasks only involve squared loss and binary cross-entropy, we additionally test multi-class cross-entropy on a MNIST classiﬁcation task. Additional details can be found in Appendix E.1.

Figure 3 summarizes our results on all three datasets. Here ng is the natural gradient method (Hessian-Free) as imple-

mented in Martens (2010). For completeness, we also add Adam (Kingma & Ba, 2015) into comparison and denote it as adam . The training error reported for all datasets is the squared reconstruction error. Since the test error traces the training error very well, we only report the result of training error due to space limitation. It is clear that all acceleration methods lead to per iteration improvements compared to na ıve natural gradient. It is also remarkable that the performance of geof , while being roughly half as expensive as geo per iteration, does not degrade too much compared to geo . For performance comparisons with respect to time, geof is usually the best (or comparable to the best). mid

and geo are relatively slower, since they need roughly twice as much computation per iteration as ng . Nonetheless, mid still has the best time performance for MNIST classiﬁcation task.

We hereby emphasize again that geodesic correction methods are not aimed for providing more accurate solutions of the natural gradient ODE. Instead, they are higher-order approximations of an invariant solution (obtained by Riemannian Euler method), which itself is a ﬁrst-order approximation to the exact solution. The improvements of both geodesic correction and midpoint integrator in Figure 2 and Figure 3 conﬁrm our intuition that preserving higher-order invariance can accelerate natural gradient optimization.

7.3. Model-free Reinforcement Learning for

Continuous Control

Finally, we evaluate our methods in reinforcement learning over six continuous control tasks (Todorov et al., 2012). Speciﬁcally, we consider improving the algorithm of ACKTR (Wu et al., 2017), an efﬁcient variant of natural policy gradient (Kakade, 2002) which uses Kronecker factors (Martens & Grosse, 2015) to approximately compute the inverse of Fisher information matrix. For these methods, we evaluate sample efﬁciency (expected rewards per episode reached within certain numbers of interactions); in robotics tasks the cost of simulation often dominates the cost of the reinforcement learning algorithm, so requiring less interac-

tions to achieve certain performance has higher priority than lower optimization time per iteration. Therefore we only test midpoint integrator and geodesic correction method for improving ACKTR, and omit the faster geodesic correction because of its less accurate approximation.

Figure 4 describes the results on the continuous control tasks, where we use midand geoto denote our midpoint integrator and geodesic correction methods for ACKTR respectively. In each environment, we consider the same constant learning rate schedule for all three methods (detailed settings in Appendix E.2). While the Fisher information matrices are approximated via Kronecker factors, our midpoint integrator and geodesic correction methods are still able to outperform ACKTR in terms of sample efﬁciency in most of the environments. This suggests that preserving higher-order invariance could also beneﬁt natural policy gradients in reinforcement learning, and our methods can be scaled to large problems via approximations of Fisher information matrices.

8. Conclusion

Our contributions in this paper can be summarized as:

We propose to measure the invariance of numerical

schemes by comparing their convergence to idealized invariant solutions.

To the best of our knowledge, we are the ﬁrst to use

midpoint integrators for natural gradient optimization.

Based on Riemannian Euler method, we introduce

geodesic corrected updates. Moreover, the faster geodesic correction has comparable time complexity with vanilla natural gradient. Computationally, we also introduce new backpropagation type algorithms to compute connectionvector products. Theoretically, we provide convergence proofs for both types of geodesic corrected updates.

Experiments conﬁrm the beneﬁts of invariance and

demonstrate faster convergence and improved sample efﬁciency of our proposed algorithms in supervised learning and reinforcement learning applications.

For future research, it would be interesting to perform a thorough investigation over applications in reinforcement learning, and studying faster variants and more efﬁcient implementations of the proposed acceleration algorithms.

Acknowledgements

The authors would like to thank Jonathan Mc Kinney for helpful discussions. This work was supported by NSF grants #1651565, #1522054, #1733686, Toyota Research Institute, Future of Life Institute, and Intel.

Accelerating Natural Gradient with Higher-Order Invariance

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorﬂow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265 283, 2016.

Amari, S.-I. Natural gradient works efﬁciently in learning.

Neural computation, 10(2):251 276, 1998.

Amari, S.-I., Barndorff-Nielsen, O., Kass, R., Lauritzen, S.,

Rao, C., et al. Differential geometrical theory of statistics. In Differential geometry in statistical inference, pp. 19 94. Institute of Mathematical Statistics, 1987.

Ba, J., Grosse, R., and Martens, J. Distributed second-order

optimization using kronecker-factored approximations. 2017.

Bielecki, A. Estimation of the euler method error on a

riemannian manifold. Commun. Numer. Meth. Engng., 18(11):757 763, 1 November 2002.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Carroll, S. M. Spacetime and geometry. an introduction to

general relativity. Spacetime and geometry/Sean Carroll. San Francisco, CA, USA: Addison Wesley, ISBN 0-80538732-3, 2004, XIV+ 513 pp., 1, 2004.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Gan-

guli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933 2941, 2014.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plap-

pert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Openai baselines. https://github.com/ openai/baselines, 2017.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient

methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Grosse, R. and Martens, J. A kronecker-factored approxi-

mate ﬁsher matrix for convolution layers. In Proceedings of The 33rd International Conference on Machine Learning, pp. 573 582, 2016.

Grosse, R. and Salakhutdinov, R. Scaling up natural gra-

dient by sparsely factorizing the inverse ﬁsher matrix. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2304 2313, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Hinton, G., Srivastava, N., and Swersky, K. Lecture 6a

overview of mini batch gradient descent. 2012.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the di-

mensionality of data with neural networks. science, 313 (5786):504 507, 2006.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating

deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015.

Kakade, S. M. A natural policy gradient. In Advances in

neural information processing systems, pp. 1531 1538, 2002.

Kingma, D. and Ba, J. Adam: A method for stochastic

optimization. International Conference on Learning Representations, 2015.

Le Cun, Y. A., Bottou, L., Orr, G. B., and M uller, K.-R.

Efﬁcient backprop. In Neural networks: Tricks of the trade, pp. 9 48. Springer, 2012.

Marquardt, D. W. An algorithm for least-squares estimation

of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431 441, 1963.

Martens, J. Deep learning via hessian-free optimization.

In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 735 742, 2010.

Martens, J. New insights and perspectives on the natural

gradient method. ar Xiv preprint ar Xiv:1412.1193, 2014.

Martens, J. and Grosse, R. Optimizing neural networks with

kronecker-factored approximate curvature. In Proceedings of The 32nd International Conference on Machine Learning, pp. 2408 2417, 2015.

Martens, J. and Sutskever, I. Learning recurrent neural net-

works with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033 1040, 2011.

Ollivier, Y. Riemannian metrics for neural networks i: feed-

forward networks. ar Xiv preprint ar Xiv:1303.0818, 2013.

Ollivier, Y. Riemannian metrics for neural networks ii:

recurrent networks and learning symbolic data sequences. Information and Inference, 4(2):154 193, 2015.

Pascanu, R. and Bengio, Y. Revisiting natural gradient for

deep networks. ar Xiv preprint ar Xiv:1301.3584, 2013.

Accelerating Natural Gradient with Higher-Order Invariance

Pearlmutter, B. A. Fast exact multiplication by the hessian.

Neural computation, 6(1):147 160, 1994.

Petersen, P. Riemannian geometry, volume 171. Springer,

Salimans, T. and Kingma, D. P. Weight normalization: A

simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901 909, 2016.

Schraudolph, N. N. Fast curvature matrix-vector products

for second-order gradient descent. Neural computation, 14(7):1723 1738, 2002.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the

importance of initialization and momentum in deep learning. In Proceedings of The 30th International Conference on Machine Learning, pp. 1139 1147, 2013.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics

engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012.

Transtrum, M. K. and Sethna, J. P. Geodesic acceleration

and the small-curvature approximation for nonlinear least squares. ar Xiv preprint ar Xiv:1207.4999, 2012.

Transtrum, M. K., Machta, B. B., and Sethna, J. P. Ge-

ometry of nonlinear least squares with applications to sloppy models and optimization. Physical Review E, 83 (3):036701, 2011.

Vinyals, O. and Povey, D. Krylov subspace descent for

deep learning. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1261 1268, 2012.

Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,

J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pp.

5285 5294, 2017.