# kernelized_wasserstein_natural_gradient__9f2fd221.pdf

Published as a conference paper at ICLR 2020

KERNELIZED WASSERSTEIN NATURAL GRADIENT

Michael Arbel, Arthur Gretton Gatsby Computational Neuroscience Unit University College London {michael.n.arbel,arthur.gretton}@gmail.com Wuchen Li University of California, Los Angeles wcli@math.ucla.edu Guido Montúfar University of California, Los Angeles, and Max Planck Institute for Mathematics in the Sciences montufar@mis.mpg.de

Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically.

1 INTRODUCTION

The success of machine learning algorithms relies on the quality of an underlying optimization method. Many of the current state-of-the-art methods rely on variants of Stochastic Gradient Descent (SGD) such as Ada Grad (Duchi et al., 2011), RMSProp (Hinton et al., 2012), and Adam (Kingma and Ba, 2014). While generally effective, the performance of such methods remains sensitive to the curvature of the optimization objective. When the Hessian matrix of the objective at the optimum has a large condition number, the problem is said to have a pathological curvature (Martens, 2010; Sutskever et al., 2013). In this case, the first-order optimization methods tend to have poor performance. Using adaptive step sizes can help when the principal directions of curvature are aligned with the coordinates of the vector parameters. Otherwise, an additional rotation of the basis is needed to achieve this alignment. One strategy is to find an alternative parametrization of the same model that has a better-behaved curvature and is thus easier to optimize with standard first-order optimization methods. Designing good network architectures (Simonyan and Zisserman, 2014; He et al., 2015) along with normalization techniques (Le Cun et al., 2012; Ioffe and Szegedy, 2015; Salimans and Kingma, 2016) is often critical for the success of such optimization methods.

The natural gradient method (Amari, 1998) takes a related but different perspective. Rather than reparametrizing the model, the natural gradient method tries to make the optimizer itself invariant to reparameterizations by directly operating on the manifold of probability distributions. This requires endowing the parameter space with a suitable notion of proximity formalized by a metric. An important metric in this context is the Fisher information metric (Fisher and Russell, 1922; Rao, 1992), which induces the Fisher-Rao natural gradient (Amari, 1985). Another important metric in probability space is the Wasserstein metric (Villani, 2009; Otto, 2001), which induces the Wasserstein natural gradient (Li and Montufar, 2018a;b; Li, 2018); see similar formulations in Gaussian families (Malagò et al., 2018; Modin, 2017). In spite of their numerous theoretical advantages, applying natural gradient methods is challenging in practice. Indeed, each parameter update requires inverting the metric tensor. This becomes infeasible for current deep learning models, which typically have millions of parameters. This has motivated research into finding efficient algorithms to estimate the natural gradient (Martens and Grosse, 2015; Grosse and Martens, 2016; George et al., 2018;

Published as a conference paper at ICLR 2020

Heskes, 2000; Bernacchia et al., 2018). Such algorithms often address the case of the Fisher metric and either exploit a particular structure of the parametric family or rely on a low rank decomposition of the information matrix. Recently, Li et al. (2019) proposed to estimate the metric based on a dual formulation and used this estimate in a proximal method. While this avoids explicitly computing the natural gradient, the proximal method also introduces an additional optimization problem to be solved at each update of the model s parameters. The quality of the solver will thus depend on the accuracy of this additional optimization.

In this paper, we use the dual formulation of the metric to directly obtain a closed form expression of the natural gradient as a solution to a convex functional optimization problem. We focus on the Wasserstein metric as it has the advantage of being well defined even when the model doesn t admit a density. The expression remains valid for general metrics including the Fisher-Rao metric. We leverage recent work on Kernel methods (Sriperumbudur et al., 2017; Arbel and Gretton, 2017; Sutherland et al., 2017; Mroueh et al., 2019) to compute an estimate of the natural gradient by restricting the functional space appearing in the dual formulation to a Reproducing Kernel Hilbert Space. We demonstrate empirically the accuracy of our estimator on toy examples, and show how it can be effectively used to approximate the trajectory of the natural gradient descent algorithm. We also analyze the effect of the dimensionality of the model on the accuracy of the proposed estimator. Finally, we illustrate the benefits of our proposed estimator for solving classification problems when the model has an ill-conditioned parametrization.

The paper is organized as follows. In Section 2, after a brief description of natural gradients, we discuss Legendre duality of metrics, and provide details on the Wasserstein natural gradient. In Section 3, we present our kernel estimator of the natural gradient. In Section 4 we present experiments to evaluate the accuracy of the proposed estimator and demonstrate its effectiveness in supervised learning tasks.

2 NATURAL GRADIENT DESCENT

We first briefly recall the natural gradient descent method in Section 2.1, and its relation to metrics on probability distribution spaces in Section 2.2. We next present Legendre dual formulations for metrics in Section 2.3 where we highlight the Fisher-Rao and Wasserstein metrics as important examples.

2.1 GENERAL FORMULATION

It is often possible to formulate learning problems as the minimization of some cost functional ρ7 F(ρ) over probability distributions ρ from a parametric model PΘ. The set PΘ contains probability distributions defined on an open sample space Ω Rd and parametrized by some vector θ Θ, where Θ is an open subset of Rq. The learning problem can thus be formalized as finding an optimal value θ that locally minimizes a loss function L(θ) := F(ρθ) defined over the parameter space Θ. One convenient way to solve this problem approximately is by gradient descent, which uses the Euclidean gradient of L w.r.t. the parameter vector θ to produce a sequence of updates θt according to the following rule:

θt+1=θt γt L(θt).

Here the step-size γt is a positive real number. The Euclidean gradient can be viewed as the direction in parameter space that leads to the highest decrease of some linear model Mt of the cost function L per unit of change of the parameter. More precisely, the Euclidean gradient is obtained as the solution of the optimization problem:

L(θt)= argmin u Rq Mt(u)+1

The linear model Mt is an approximation of the cost function L in the neighborhood of θt and is simply obtained by a first order expansion: Mt(u) = L(θt)+ L(θt) u. The quadratic term u 2 penalizes the change in the parameter and ensures that the solution remains in the neighborhood where the linear model is still a good approximation of the cost function.

This particular choice of quadratic term is what defines the Euclidean gradient descent algorithm, which can often be efficiently implemented for neural network models using back-propagation. The performance of this algorithm is highly dependent on the parametrization of the model PΘ, however (Martens, 2010; Sutskever et al., 2013). To obtain an algorithm that is robust to parametrization, one can take advantage of the structure of the cost function L(θ) which is obtained as the composition of the functional F and the model θ7 ρθ and define a generalized natural gradient (Amari and Cichocki, 2010). We first provide

Published as a conference paper at ICLR 2020

a conceptual description of the general approach to obtain such gradient. The starting point is to choose a divergence D between probability distributions and use it as a new penalization term:

argmin u Rq Mt(u)+1

2D(ρθt,ρθt+u). (2)

Here, changes in the model are penalized directly in probability space rather than parameter space as in (1). In the limit of small u, the penalization term can be replaced by a quadratic term u GD(θ)u where GD(θ) contains second order information about the model as measured by D. This leads to the following expression for the generalized natural gradient DL(θt) where the dependence in D is made explicit:

DL(θt):= argmin u Rq Mt(u)+1

2u GD(θt)u. (3)

From (3), it is possible to express the generalized natural gradient by means of the Euclidean gradient: DL(θt)=GD(θt) 1 L(θt). The parameter updates are then obtained by the new update rule:

θt+1=θt γt GD(θt) 1 L(θt). (4)

Equation (4) leads to a descent algorithm which is invariant to parametrization in the continuous-time limit: Proposition 1. Let Ψ be an invertible and smoothly differentiable re-parametrization ψ = Ψ(θ) and denote by L(ψ):=L(Ψ 1(ψ)). Consider the continuous-time natural gradient flows:

θs= D θ L(θs), ψs= D ψ L(ψs), ψ0=Ψ(θ0)

Then ψs and θs are related by the equation ψs=Ψ(θs) at all times s 0.

This result implies that an ill-conditioned parametrization of the model has little effect on the optimization when (4) is used. It is a consequence of the transformation properties of the natural gradient by change of parametrization: D ψ L(ψ) = θΨ(θ) D θ L(θ) which holds in general for any covariant gradient. We provide a proof of Proposition 1 in Appendix C.1 in the particular the case when D is either Kullback-Leibler divergence F, or the squared Wasserstein-2 distance W using notions introduced later in Section 2.3 and refer to Ollivier et al. (2011) for a detailed discussion.

The approach based on (2) for defining the generalized natural gradient is purely conceptual and can be formalized using the notion of metric tensor from differential geometry which allows for more generality. In Section 2.2, we provide such formal definition in the case when D is either the Kullback-Leibler divergence F, or the squared Wasserstein-2 distance W.

2.2 INFORMATION MATRIX VIA DIFFERENTIAL GEOMETRY

When D is the Kullback-Leibler divergence or relative entropy F, then (3) defines the Fisher-Rao natural gradient FL(θ) (Amari, 1985) and GF(θ) is called the Fisher information matrix. GF(θ) is well defined when the probability distributions in PΘ all have positive densities, and when some additional differentiability and integrability assumptions on ρθ are satisfied. In fact, it has an interpretation in Riemannian geometry as the pull-back of a metric tensor g F defined over the set of probability distributions with positive densities and known as the Fisher-Rao metric (see Definition 4 in Appendix B.1; see also Holbrook et al. 2017): Definition 1 (Fisher information matrix). Assume θ 7 ρθ(x) is differentiable for all x on Ωand that R ρθ(x) 2

ρθ(x) dx < . Then the Fisher information matrix is defined as the pull-back of the Fisher-Rao metric g F:

GF(θ)ij =g F ρθ( iρθ, jρθ):= Z fi(x)fj(x)ρθ(x)dx,

where the functions fi on Ωare given by: fi= iρθ

Definition 1 directly introduces GF using the Fisher-Rao metric tensor which captures the infinitesimal behavior of the KL. This approach can be extended to any metric tensor g defined on a suitable space of probability distributions containing PΘ. In particular, when D is the Wasserstein-2, the Wasserstein information matrix is obtained directly by means of the Wasserstein-2 metric tensor g W (Otto and Villani, 2000; Lafferty and Wasserman, 2008) as proposed in Li and Montufar (2018a); Chen and Li (2018):

Published as a conference paper at ICLR 2020

Definition 2 (Wasserstein information matrix). The Wasserstein information matrix (WIM) is defined as the pull-back of the Wasserstein 2 metric g W:

GW(θ)ij =g W ρθ ( iρθ, jρθ):= Z φi(x) φj(x)dρθ(x),

where φi are vector valued functions on Ωthat are solutions to the partial differential equations with Neumann boundary condition:

iρθ = div(ρθφi), 1 i q.

Moreover, φi are required to be in the closure of the set of gradients of smooth and compactly supported functions in L2(ρθ)d. In particular, when ρθ has a density, φi= xfi, for some real valued function fi on Ω.

The partial derivatives iρθ should be understood in distribution sense, as discussed in more detail in Section 2.3. This allows to define the Wasserstein natural gradient even when the model ρθ does not admit a density. Moreover, it allows for more generality than the conceptual approach based on (2) which would require performing a first order expansion of the Wasserstein distance in terms of its linearized version known as the Negative Sobolev distance. We provide more discussion of those two approaches and their differences in Appendix B.3. From now on, we will focus on the above two cases of the natural gradient DL(θ), namely FL(θ) and WL(θ). When the dimension of the parameter space is high, directly using equation (4) becomes impractical as it requires storing and inverting the matrix G(θ). In Section 2.3 we will see how equation (3) can be exploited along with Legendre duality to get an expression for the natural gradient that can be efficiently approximated using kernel methods.

2.3 LEGENDRE DUALITY FOR METRICS

In this section we provide an expression for the natural gradient defined in (3) as the solution of a saddle-point optimization problem. It exploits Legendre duality for metrics to express the quadratic term u G(θ)u as a solution to a functional optimization problem over C c (Ω), the set of smooth and compactly supported functions on Ω. The starting point is to extend the notion of gradient ρθ which appears in Definitions 1 and 2 to the distributional sense of Definition 3 below.

Definition 3. Given a parametric family PΘ of probability distributions, we say that ρθ admits a distributional gradient at point θ if there exists a linear continuous map ρθ :C c (Ω) Rq such that: Z f(x)dρθ+ϵu(x) Z f(x)dρθ(x)=ϵ ρθ(f) u+ϵδ(ϵ,f,u) f C c (Ω), u Rq

where δ(ϵ,f,u) depends on f and u and converges to 0 as ϵ approaches 0. ρθ is called the distributional gradient of ρθ at point θ.

When the distributions in PΘ have a density, written x7 ρθ(x) by abuse of notation, that is differentiable w.r.t. θ and with a jointly continuous gradient in θ and x then ρθ(f) is simply given by R f(x) θρθ(x)dx as shown in Proposition 12 of Appendix C.1. In this case, the Fisher-Rao natural gradient admits a formulation as a saddle point solution involving ρθ and provided in Proposition 2 with a proof in Appendix C.1.

Proposition 2. Under the same assumptions as in Definition 1, the Fisher information matrix admits the dual formulation:

1 2u GF(θ)u:= sup f C c (Ω) R f(x)dρθ(x)=0

Z f(x)2dρθ(x)dx. (5)

Moreover, defining Uθ(f):= L(θ)+ ρθ(f), the Fisher-Rao natural gradient satisfies:

FL(θ)= argmin u Rq sup f C c (Ω) R f(x)dρθ(x)=0

Z f(x)2dρθ(x)dx,

Another important case is when PΘ is defined as an implicit model. In this case, any sample x from a distribution ρθ in PΘ is obtained as x=hθ(z), where z is a sample from a fixed latent distribution ν

Published as a conference paper at ICLR 2020

defined over a latent space Z and (θ,z)7 hθ(z) is a deterministic function with values in Ω. This can be written in a more compact way as the push-forward of ν by the function hθ:

PΘ:={ρθ :=(hθ)#ν|θ Ω}. (6)

A different expression for ρθ is obtained in the case of implicit models when θ7 hθ(z) is differentiable for ν-almost all z and hθ is square integrable under ν:

ρθ(f)= Z hθ(z) xf(hθ(z))dν(z). (7)

Equation (7) is also known as the re-parametrization trick (Kingma et al., 2015) and allows to derive a dual formulation of the Wasserstein natural gradient in the case of implicit models. Proposition 3 below provides such formulation under mild assumptions stated in Appendix A.2 along with a proof in Appendix C.1. Proposition 3. Assume PΘ is defined by (6) such that ρθ is given by (7). Under Assumptions (B) and (C), the Wasserstein information matrix satisfies:

1 2u GW(θ)u= sup f C c (Ω) ρθ(f) u 1

Z xf(x) 2dρθ(x) (8)

and the Wasserstein natural gradient satisfies:

WL(θ)= argmin u Rq sup f C c (Ω) Uθ(f) u 1

Z xf(x) 2dρθ(x). (9)

The similarity between the variational formulations provided in Propositions 2 and 3 is worth noting. A first difference however, is that Proposition 3 doesn t require the test functions f to have 0 mean under ρθ. This is due to the form of the objective in (8) which only depends on the gradient of f. More importantly, while (8) is well defined, the expression in (5) can be infinite when ρθ is given by (7). Indeed, if the ρθ doesn t admit a density, it is always possible to find an admissible function f C c (Ω) with bounded second moment under ρθ but for which ρθ(f) is arbitrarily large. This is avoided in (8) since the quadratic term directly penalizes the gradient of functions instead. For similar reasons, the dual formulation of the Sobolev distance considered in Mroueh et al. (2019) can also be infinite in the case of implicit models as discussed in Appendix B.3 although formally similar to (8). Nevertheless, a similar estimator as in Mroueh et al. (2019) can be considered using kernel methods which is the object of Section 3.

3 KERNELIZED WASSERSTEIN NATURAL GRADIENT

In this section we propose an estimator for the Wasserstein natural gradient using kernel methods and exploiting the formulation in (9). We restrict to the case of the Wasserstein natural gradient (WNG), denoted by WL(θ), as it is well defined for implicit models, but a similar approach can be used for the Fisher-Rao natural gradient in the case of models with densities. We first start by presenting the kernelized Wasserstein natural gradient (KWNG) in Section 3.1, then we introduce an efficient estimator for KWNG in Section 3.2. In Section 3.4 we provide statistical guarantees and discuss practical considerations in Section 3.3.

3.1 GENERAL FORMULATION AND MINIMAX THEOREM

Consider a Reproducing Kernel Hilbert Space (RKHS) H which is a Hilbert space endowed with an inner product .,. H along with its norm . H. H has the additional property that there exists a symmetric positive semi-definite kernel k : Ω Ω7 R such that k(x,.) H for all x Ωand satisfying the Reproducing property for all functions f in H:

f(x)= f,k(x,.) H, x Ω. (10)

The above property is central in all kernel methods as it allows to obtain closed form expressions for some class of functional optimization problems. In order to take advantage of such property for estimating the natural gradient, we consider a new saddle problem obtained by restricting (9) to functions in the RKHS H and adding some regularization terms:

WL(θ):= min u Rqsup f H Uθ(f) u 1

Z xf(x) 2dρθ(x)+1

2(ϵu D(θ)u λ f 2 H). (11)

Published as a conference paper at ICLR 2020

The kernelized Wasserstein natural gradient is obtained by solving (11) and is denoted by WL(θ). Here, ϵ is a positive real numbers, λ is non-negative while D(θ) is a diagonal matrix in Rq with positive diagonal elements whose choice will be discussed in Section 3.3. The first regularization term makes the problem strongly convex in u, while the second term makes the problem strongly concave in f when λ>0. When λ=0, the problem is still concave in f. This allows to use a version of the minimax theorem (Ekeland and Témam, 1999, Proposition 2.3, Chapter VI) to exchange the order of the supremum and minimum which also holds true when λ=0. A new expression for the kernelized natural gradient is therefore obtained: Proposition 4. Assume that ϵ>0 and λ>0, then the kernelized natural gradient is given by:

ϵD(θ) 1Uθ(f ), (12)

where f is the unique solution to the quadratic optimization problem:

inf f HJ (f):= Z xf(x) 2dρθ(x)+1

ϵUθ(f) D(θ) 1Uθ(f)+λ f 2 H. (13)

When λ=0, f might not be well defined, still, we have: WL(θ)=limj 1

ϵD(θ) 1Uθ(fj) for any limiting sequence of (13).

Proposition 4 allows to compute the kernelized natural gradient directly, provided that the functional optimization (13) can be solved. This circumvents the direct computation and inversion of the metric as suggested by (11). In Section 3.2, we propose a method to efficiently compute an approximate solution to (13) using Nyström projections. We also show in Section 3.4 that restricting the space of functions to H can still lead to a good approximation of the WNG provided that H enjoys some denseness properties.

3.2 NYSTRÖM METHODS FOR THE KERENALIZED NATURAL GRADIENT

We are interested now in finding an approximate solution to (13) which will allow to compute an estimator for the WNG using Proposition 4. Here we consider N samples (Zn)1 n N from the latent distribution ν which are used to produce N samples (Xn)1 n N from ρθ using the map hθ, i.e., Xn=hθ(Zn). We

also assume we have access to an estimate of the Euclidean gradient L(θ) which is denoted by L(θ). This allows to compute an empirical version of the cost function in (13),

n=1 xf(Xn) 2+ 1

ϵ c Uθ(f) D(θ) 1c Uθ(f)+λ f 2 H, (14)

where c Uθ(f) is given by: c Uθ(f)= L(θ)+ 1

N PN n=1 hθ(Zn)) xf(Xn). (14) has a similar structure as the empirical version of the kernel Sobolev distance introduced in Mroueh et al. (2019), it is also similar to another functional arising in the context of score estimation for infinite dimensional exponential families (Sriperumbudur et al., 2017; Sutherland et al., 2017; Arbel and Gretton, 2017). It can be shown using the generalized Representer Theorem (Schölkopf et al., 2001) that the optimal function minimizing (14) is a linear combination of functions of the form x7 ik(Xn,x) with 1 n N and 1 i d and ik(y,x) denotes the partial derivative of k w.r.t. yi. This requires to solve a system of size Nd Nd which can be prohibitive when both N and d are large. Nyström methods provide a way to improve such computational cost by further restricting the optimal solution to belong to a finite dimensional subspace HM of H called the Nyström subspace. In the context of score estimation, Sutherland et al. (2017) proposed to use a subspace formed by linear combinations of the basis functions x7 ik(Ym,x):

span{x7 ik(Ym,x)|1 m M; 1 i d}, (15)

where (Ym)1 m M are basis points drawn uniformly from (Xn)1 n N with M N. This further reduces the computational cost when M N but still has a cubic dependence in the dimension d since all partial derivatives of the kernel are considered to construct (15). Here, we propose to randomly sample one component of ( ik(Ym,.))1 i d for each basis point Ym. Hence, we consider M indices (im)1 m M uniformly drawn form {1,...,d} and define the Nyström subspace HM to be:

HM :=span{x7 imk(Ym,x)|1 m M}.

An estimator for the kernelized Wasserstein natural gradient (KWNG) is then given by:

ϵD(θ) 1c Uθ(ˆf ), ˆf :=argmin f HM ˆJ (f). (16)

Published as a conference paper at ICLR 2020

By definition of the Nyström subspace HM, the optimal solution ˆf is necessarily of the form: ˆf (x)= PM m=1αm imk(Ym,x), where the coefficients (αm)1 m M are obtained by solving a finite dimensional quadratic optimization problem. This allows to provide a closed form expression for (17) in Proposition 5. Proposition 5. The estimator in (16) is given by:

Å D(θ) 1 D(θ) 1T TD(θ) 1T +λϵK+ ϵ

N CC TD(θ) 1 ã L(θ), (17)

where C and K are matrices in RM Nd and RM M given by Cm,(n,i)= im i+dk(Ym,Xn), Km,m = im im +dk(Ym,Ym ), (18)

while T is a matrix in RM q obtained as the Jacobian of θ7 τ(θ) RM, i.e., T := τ(θ), with

n=1 imk(Ym,hθ(Zn)).

In (18), we used the notation i+dk(y,x) for the partial derivative of k w.r.t. xi. A proof of Proposition 5 is provided in Appendix C.2 and relies on the reproducing property (10) and its generalization for partial derivatives of functions. The estimator in Proposition 5 is in fact a low rank approximation of the natural gradient obtained from the dual representation of the metric (9). While low-rank approximations for the Fisher-Rao natural gradient were considered in the context of variational inference and for a Gaussian variational posterior (Mishkin et al., 2018), (17) can be applied as a plug-in estimator for any family PΘ obtained as an implicit model. We next discuss a numerically stable expression of (17), its computational cost and the choice of the damping term in Section 3.3. We then provide asymptotic rates of convergence for (17) in Section 3.4.

3.3 PRACTICAL CONSIDERATIONS

Numerically stable expression. When λ=0, the estimator in (17) has an additional structure which can be exploited to get more accurate solutions. By the chain rule, the matrix T admits a second expression of the form T =CB where B is the Jacobian matrix of (hθ(Zn))1 n N. Although this expression is impractical to compute in general, it suggests that C can be simplified . This simplification can be achieved in practice by computing the SVD of CC =USUT and pre-multiplying T by S UT. The resulting expression is given in Proposition 6 and falls into the category of Ridgless estimators (Liang and Rakhlin (2019)). Proposition 6. Consider an SVD decomposition of CCT of the form CC =USUT, then (17) is equal to:

Å D(θ) 1 D(θ) 1e T e TD(θ) 1e T + ϵ

N P e TD(θ) 1 ã L(θ), (19)

where P :=S S and e T :=S UTT.

Choice of damping term. So far, we only required D(θ) to be a diagonal matrix with positive coefficients. While a natural choice would be the identity matrix, this doesn t necessarily represent the best choice. As discussed by Martens and Sutskever (2012, Section 8.2), using the identity breaks the self-rescaling properties enjoyed by the natural gradient. Instead, we consider a scale-sensitive choice by setting (D(θ))i= e T.,i where e T is defined in Proposition 6. When the sample-size is limited, as it is often the case when N is the size of a mini-batch, larger values for ϵ might be required. That is to prevent the KWNG from over-estimating the step-size in low curvature directions. Indeed, these directions are rescaled by the inverse of the smallest eigenvalues of the information matrix which are harder to estimate accurately. To adjust ϵ dynamically during training, we use a variant of the Levenberg-Marquardt heuristic as in Martens and Sutskever (2012) which seems to perform well in practice; see Section 4.

Computational cost. The number of basis points M controls the computational cost of both (17) and (19) which is dominated by the cost of computing T and C, solving an M M linear system and performing an SVD of CCT in the case of (19). This gives an overall cost of O(d NM2+q M2+M3). In practice, M can be chosen to be small (M 20) while N corresponds to the number of samples in a mini-batch. Hence, in a typical deep learning model, most of the computational cost is due to computing T as the typical number of parameters q is of the order of millions. In fact, T can be computed using automatic differentiation and would require performing M backward passes on the model to compute the gradient for each component of τ. Overall, the proposed estimator can be efficiently implemented and used for typical deep learning problems as shown in Section 4.

Published as a conference paper at ICLR 2020

Choice of the kernel. We found that using either a gaussian kernel or a rational quadratic kernel to work well in practice. We also propose a simple heuristic to adapt the bandwidth of those kernels to the data by setting it to σ=σ0σN,M, where σN,M is equal to the average square distance between samples (Xn)1 n N and the basis points (Ym)1 m M and σ0 is fixed a priori. Another choice is the median heuristic Garreau et al. (2018).

In this section we are interested in the behavior of the estimator in the limit of large N and M and when λ > 0; we leave the case when λ = 0 for future work. We work under Assumptions (A) to (G) in Appendix A.2 which state that Ωis a non-empty subset, k is continuously twice differentiable with bounded second derivatives, hθ(z) has at most a linear growth in z and ν satisfies some standard moments conditions. Finally, we assume that the estimator of the euclidean gradient L(θ) satisfies Chebychev s concentration inequality which is often the case in Machine learning problem as discussed in Remark 1 of Appendix A.2. We distinguish two cases: the well-specified case and the miss-specified case. In the well-specified case, the vector valued functions (φi)1 i q involved in Definition 2 are assumed to be gradients of functions in H and their smoothness is controlled by some parameter α 0 with worst case being α=0. Under this assumption, we obtain smoothness dependent convergence rates as shown in Theorem 14 of Appendix C.3 using techniques from Rudi et al. (2015); Sutherland et al. (2017). Here, we will only focus on the miss-specified which relies on a weaker assumption: Assumption 1. There exists two constants C >0 and c 0 such that for all κ>0 and all 1 i q, there is a function fκ i satisfying:

φi fκ i L2(ρθ) Cκ, fκ i H Cκ c. (20)

The left inequality in (20) represents the accuracy of the approximation of φi by gradients of functions in H while the right inequality represents the complexity of such approximation. Thus, the parameter c characterizes the difficulty of the problem: a higher value of c means that a more accurate approximation of φi comes at a higher cost in terms of its complexity. Theorem 7 provides convergences rates for the estimator in Proposition 5 under Assumption 1: Theorem 7. Let δ be such that 0 δ 1 and b:= 1 2+c. Under Assumption 1 and Assumptions (A) to (G)

listed in Appendix A.2, for N large enough, M (d N 1 2b+1log(N)), λ N 1 2b+1 and ϵ N b 2b+1, it holds with probability at least 1 δ that:

ÿ WL(θ) WL(θ) 2=O N 2 4+c .

A proof of Theorem 7 is provided in Appendix C.3. In the best case where c=0, we recover a convergence rate of 1

N as in the well specified case for the worst smoothness parameter value α=0. Hence, Theorem 7 is a consistent extension of the well-specified case. For harder problems where c>0 more basis points are needed, with M required to be of order d Nlog(N) in the limit when c in which case the Nyström approximation loses its computational advantage.

4 EXPERIMENTS

This section presents an empirical evaluation of (KWNG) based on (19). Code for the experiments is available at https://github.com/Michael Arbel/KWNG.

4.1 CONVERGENCE ON SYNTHETIC MODELS

To empirically assess the accuracy of KWNG, we consider three choices for the parametric model PΘ: the multivariate normal model, the multivariate log-normal model and uniform distributions on hyper-spheres. All have the advantage that the WNG can be computed in closed form (Chen and Li, 2018; Malagò et al., 2018). While the first models admit a density, the third one doesn t, hence the Fisher natural gradient is not defined in this case. While this choice of models is essential to obtain closed form expressions for WNG, the proposed estimator is agnostic to such choice of family. We also assume we have access to the exact Euclidean Gradient (EG) which is used to compute both of WNG and KWNG.

Published as a conference paper at ICLR 2020

2 4 6 8 10 d

Relative error

0 1000 2000 3000 4000 5000 N

101 102 103

d = 10 d = 8 d = 6 d = 4 d = 2

Figure 1: Relative error of KWNG averaged over 100 runs for varying dimension form d=1 (yellow) to d=10 (dark red) for the hyper-sphere model. (a): box-plot of the relative error as d increases while N =5000 and M = ö d

N ù . (b) Relative error as the sample size N increases and M = ö d

N ù . (c): Relative error as M increases and N =5000. A gaussian kernel is used with a fixed bandwidth σ=1.

Figure 2: Left (a): Training error per iteration for KWNG, WNG, and EG. Right (b): projection of the sequence of updates obtained using KWNG, WNG and EG along the first two PCA directions of the WNG trajectory. The dimension of the sample space is fixed to d = 10. Exact valued for the gradient are used for EG and WNG. For KWNG, N = 128 samples and M = 100 basis points are used. The regularization parameters are set to: λ=0 and ϵ=10 10. An optimal step-size γt is used: γt =0.1 for both KWNG and WNG while γt=0.0001 for EG.

Figure 1 shows the evolution of the the relative error w.r.t. the sample-size N, the number of basis points M and the dimension d in the case of the hyper-sphere model. As expected from the consistency results provided in Section 3.4, the relative error decreases as the samples size N increases. The behavior in the number of basis points M shows a clear threshold beyond which the estimator becomes consistent and where increasing M doesn t decrease the relative error anymore. This threshold increases with the dimension d as discussed in Section 3.4. In practice, using the rule M = ö d

N ù seems to be a good heuristic as shown in Figure 1 (a). All these observations persist in the case of the normal and log-normal model as shown in Figure 4 of Appendix D.1. In addition we report in Figure 5 the sensitivity to the choice of the bandwidth σ which shows a robustness of the estimator to a wide choice of σ.

We also compare the optimization trajectory obtained using KWNG with the trajectories of both the exact WNG and EG in a simple setting: PΘ is the multivariate normal family and the loss function L(θ) is the squared Wasserstein 2 distance between ρθ and a fixed target distribution ρθ . Figure 2 (a), shows the evolution of the loss function at every iteration. There is a clear advantage of using the WNG over EG as larger step-sizes are allowed leading to faster convergence. Moreover, KWNG maintains this properties while being agnostic to the choice of the model. Figure 2 (b) shows the projected dynamics of the three methods along the two PCA directions of the WNG trajectory with highest variance. The dynamics of WNG seems to be well approximated by the one obtained using KWNG.

4.2 APPROXIMATE INVARIANCE TO PARAMETRIZATION

We illustrate now the approximate invariance to parametrization of the KWNG and show its benefits for training deep neural networks when the model is ill-conditioned. We consider a classification task on two datasets Cifar10 and Cifar100 with a Residual Network He et al. (2015). To use the KWNG

Published as a conference paper at ICLR 2020

0 100 200 300 Epochs

(b) Test accuracy: (IC)

0 100 200 300 Epochs

(a) Training accuracy: (IC)

0 100 200 300 Epochs

(d) Test accuracy: (WC)

0 100 200 300 Epochs

(c) Training accuracy: (WC)

KWNG(M=5) KWNG(M=20) Adam SGD SGD+M SGD+M+WD KFAC e KFAC

0 100 200 300 Epochs

(b) Test accuracy: (IC)

0 100 200 300 Epochs

(a) Training accuracy, (IC)

0 100 200 300 Epochs

(d) Test accuracy: (WC)

0 100 200 300 Epochs

(c) Training accuracy: (WC)

KWNG(M=5) KWNG(M=5)+M SGD+M+WD SGD SGD+M KFAC e KFAC KWNG(N=512)

Figure 3: Test accuracy and Training accuracy for classification on Cifar10 (top) and Cifar100 (bottom) in both the ill-conditioned case (left side) and well-conditioned case (right side) for different optimization methods. on Cifar10 Results are averaged over 5 independent runs except for KFAC and e KFAC.

estimator, we view the input RGB image as a latent variable z with probability distribution ν and the output logits of the network x:=hθ(z) as a sample from the model distribution ρθ PΘ where θ denotes the weights of the network. The loss function L is given by:

L(θ):= Z y(z) log(SM(Uhθ(z)))dν(z),

where SM is the Softmax function, y(z) denotes the one-hot vector representing the class of the image z and U is a fixed invertible diagonal matrix which controls how well the model is conditioned. We consider two cases, the Well-conditioned case (WC) in which U is the identity and the Ill-conditioned case (IC) where U is chosen to have a condition number equal to 107. We compare the performance of the proposed method with several variants of SGD: plain SGD, SGD + Momentum, and SGD + Momentum + Weight decay. We also compare with Adam Kingma and Ba (2014), KFAC optimizer (Martens and Grosse, 2015; Grosse and Martens, 2016) and e KFAC (George et al., 2018) which implements a fast approximation of the empirical Fisher Natural Gradient. We emphasize that gradient clipping by norm was used for all experiments and was crucial for a stable optimization using KWNG. Details of the experiments are provided in Appendix D.2. Figure 3 shows the training and test accuracy at each epoch on Cifar10 in both (WC) and (IC) cases. While all methods achieve a similar test accuracy in the (WC) case on both datasets, methods based on the Euclidean gradient seem to suffer a drastic drop in performance in the (IC) case. This doesn t happen for KWNG (red line) which achieves a similar test accuracy as in (WC) case. Moreover, a speed-up in convergence in number of iterations can be obtained by increasing the number of basis points M (brown line). The time cost is also in favor of KWNG (Figure 6). On Cifar100, KWNG is also less affected by the ill-conditioning, albeit to a lower extent. Indeed, the larger number of classes in Cifar100 makes the estimation of KWNG harder as discussed in Section 4.1. In this case, increasing the batch-size can substantially improve the training accuracy (pink line). Moreover, methods that are used to improve optimization using the Euclidean gradient can also be used for KWNG. For instance, using Momentum leads to an improved performance in the (WC) case (grey line). Interestingly, KFAC seems to also suffer a drop in performance in the (IC) case. This might result from the use of an isotropic damping term D(θ) = I which would be harmful in this case. We also observe a drop in performance when a different choice of damping is used for KWNG. More importantly, using only a diagonal pre-conditioning of the gradient doesn t match the performance of KWNG (Figure 7).

ACKNOWLEDGEMENT

GM has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no 757983).

Published as a conference paper at ICLR 2020

REFERENCES Amari, S. and Cichocki, A. (2010). Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences, 58(No 1):183 195.

Amari, S.-i. (1985). Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics. Springer-Verlag, New York.

Amari, S.-i. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251 276.

Ambrosio, L., Gigli, N., and Savaré, G. (2004). Gradient flows with metric and differentiable structures, and applications to the Wasserstein space. Atti della Accademia Nazionale dei Lincei. Classe di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni, 15(3-4):327 343.

Arbel, M. and Gretton, A. (2017). Kernel Conditional Exponential Family. ar Xiv:1711.05363 [stat]. ar Xiv: 1711.05363.

Arbel, M., Korba, A., Salim, A., and Gretton, A. (2019). Maximum Mean Discrepancy Gradient Flow. ar Xiv:1906.04370 [cs, stat]. ar Xiv: 1906.04370.

Benamou, J.-D. and Brenier, Y. (2000). A computational fluid mechanics solution to the Monge Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393.

Bernacchia, A., Lengyel, M., and Hennequin, G. (2018). Exact natural gradient in deep linear networks and its application to the nonlinear case. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 5941 5950. Curran Associates, Inc.

Chen, Y. and Li, W. (2018). Natural gradient in Wasserstein statistical manifold. ar Xiv:1805.08380 [cs, math]. ar Xiv: 1805.08380.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121 2159.

Ekeland, I. and Témam, R. (1999). Convex Analysis and Variational Problems. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.

Fisher, R. A. and Russell, E. J. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594-604):309 368.

Garreau, D., Jitkrittum, W., and Kanagawa, M. (2018). Large sample analysis of the median heuristic. ar Xiv:1707.07269 [math, stat]. ar Xiv: 1707.07269.

George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vincent, P. (2018). Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis. ar Xiv:1806.03884 [cs, stat]. ar Xiv: 1806.03884.

Grosse, R. and Martens, J. (2016). A Kronecker-factored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 573 582. JMLR.org. event-place: New York, NY, USA.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778.

Heskes, T. (2000). On Natural Learning and Pruning in Multilayered Perceptrons. Neural Computation, 12(4):881 901.

Hinton, G., Srivastava, N., and Swersky, K. (2012). Lecture 6a overview of mini batch gradient descent.

Holbrook, A., Lan, S., Streets, J., and Shahbaba, B. (2017). The nonparametric Fisher geometry and the chi-square process density prior. ar Xiv:1707.03117 [stat]. ar Xiv: 1707.03117.

Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 448 456. JMLR.org. event-place: Lille, France.

Published as a conference paper at ICLR 2020

Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs]. ar Xiv: 1412.6980.

Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational Dropout and the Local Reparameterization Trick. Ar Xiv, abs/1506.02557. ar Xiv: 1506.02557.

Klenke, A. (2008). Probability Theory: A Comprehensive Course. World Publishing Corporation.

Lafferty, J. and Wasserman, L. (2008). Rodeo: sparse, greedy nonparametric regression. The Annals of Statistics, 36(1):28 63.

Le Cun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. (2012). Efficient Back Prop. In Montavon, G., Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science, pages 9 48. Springer Berlin Heidelberg, Berlin, Heidelberg.

Li, W. (2018). Geometry of probability simplex via optimal transport. ar Xiv:1803.06360 [math]. ar Xiv: 1803.06360.

Li, W., Lin, A. T., and Montufar, G. (2019). Affine natural proximal learning.

Li, W. and Montufar, G. (2018a). Natural gradient via optimal transport. ar Xiv:1803.07033 [cs, math]. ar Xiv: 1803.07033.

Li, W. and Montufar, G. (2018b). Ricci curvature for parametric statistics via optimal transport. ar Xiv:1807.07095 [cs, math, stat]. ar Xiv: 1807.07095.

Liang, T. and Rakhlin, A. (2019). Just Interpolate: Kernel "Ridgeless" Regression Can Generalize. ar Xiv:1808.00387 [cs, math, stat]. ar Xiv: 1808.00387.

Malagò, L., Montrucchio, L., and Pistone, G. (2018). Wasserstein Riemannian Geometry of Positive Definite Matrices. ar Xiv:1801.09269 [math, stat]. ar Xiv: 1801.09269.

Martens, J. (2010). Deep Learning via Hessian-free Optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pages 735 742, USA. Omnipress. event-place: Haifa, Israel.

Martens, J. and Grosse, R. (2015). Optimizing Neural Networks with Kronecker-factored Approximate Curvature. ar Xiv:1503.05671 [cs, stat]. ar Xiv: 1503.05671.

Martens, J. and Sutskever, I. (2012). Training Deep and Recurrent Networks with Hessian-Free Optimization. In Montavon, G., Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science, pages 479 535. Springer Berlin Heidelberg, Berlin, Heidelberg.

Mishkin, A., Kunstner, F., Nielsen, D., Schmidt, M., and Khan, M. E. (2018). SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient. ar Xiv:1811.04504 [cs, stat]. ar Xiv: 1811.04504.

Modin, K. (2017). Geometry of Matrix Decompositions Seen Through Optimal Transport and Information Geometry. Journal of Geometric Mechanics, 9(3):335 390. ar Xiv: 1601.01875.

Mroueh, Y., Sercu, T., and Raj, A. (2019). Sobolev Descent. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2976 2985.

Ollivier, Y., Arnold, L., Auger, A., and Hansen, N. (2011). Information-geometric optimization algorithms: A unifying picture via invariance principles. J. Mach. Learn. Res., 18:18:1 18:65.

Otto, F. (2001). The Geometry of Dissipative Evolution Equations: The Porous Medium Equation. Communications in Partial Differential Equations, 26(1-2):101 174.

Otto, F. and Villani, C. (2000). Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality. Journal of Functional Analysis, 173(2):361 400.

Published as a conference paper at ICLR 2020

Rao, C. R. (1992). Information and the Accuracy Attainable in the Estimation of Statistical Parameters. In Kotz, S. and Johnson, N. L., editors, Breakthroughs in Statistics: Foundations and Basic Theory, Springer Series in Statistics, pages 235 247. Springer New York, New York, NY.

Rudi, A., Camoriano, R., and Rosasco, L. (2015). Less is more: Nyström computational regularization.

Salimans, T. and Kingma, D. P. (2016). Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. ar Xiv:1602.07868 [cs]. ar Xiv: 1602.07868.

Schölkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. In Helmbold, D. and Williamson, B., editors, Computational Learning Theory, pages 416 426, Berlin, Heidelberg. Springer Berlin Heidelberg.

Simonyan, K. and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ar Xiv:1409.1556 [cs]. ar Xiv: 1409.1556.

Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., and Hyvärinen, A. (2017). Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research.

Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition.

Sutherland, D. J., Strathmann, H., Arbel, M., and Gretton, A. (2017). Efficient and principled score estimation.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139 1147.

Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Soc. Google-Books-ID: R_n Wqjq89o EC.

Villani, C. (2009). Optimal transport: Old and new. Technical report.

Published as a conference paper at ICLR 2020

A PRELIMINARIES

A.1 NOTATION

We recall that Ωis an open subset of Rd while Θ is an open subset of parameters in Rq. Let Z Rp be a latent space endowed with a probability distribution ν over Z. Additionally, (θ,z)7 hθ(z) Ωis a function defined over Θ Z. We consider a parametric set of probability distributions PΘ over Ωdefined as the implicit model: PΘ:={ρθ :=(hθ)#ν ; θ Θ}, where by definition, ρθ =(hθ)#ν means that any sample x from ρθ can be written as x=hθ(z) where z is a sample from ν. We will write B to denote the jacobian of hθ w.r.t. θ viewed as a linear map from Rq to L2(ν)d without explicit reference to θ: Bu(z)= hθ(z).u; u Rq. As in the main text, L:Θ R is a loss functions which is assumed to be of the form L=F(ρθ), with F being a real valued functional over the set of probability distributions. L(θ) denotes the euclidean gradient of L w.r.t θ while L(θ) is an estimator of L(θ) using N samples from ρθ.

We also consider a Reproducing Kernel Hilbert Space H of functions defined over Ωwith inner product .,. H and norm . H and with a kernel k : Ω Ω R. The reproducing property for the derivatives (Steinwart and Christmann, 2008, Lemma 4.34) will be important: if(x)= f, ik(x,.) H for all x Ω. It holds as long as k is differentiable.

C b (Ω) denotes the space of smooth bounded real valued functions on Ω, and C c (Ω) C b (Ω) denotes the subset of compactly supported functions. For any measured space Z with probability distribution ν, we denote by L2(ν) the space of real valued and square integrable functions under ν and by L2(ν)d the space of square integrable vector valued functions under ν and with values in Rd.

A.2 ASSUMPTIONS

We make the following set of assumptions:

(A) Ωis a non-empty open subset of Rd.

(B) There exists positive constants ζ and σ such that R z pdν(z) 1

2p!ζp 2σ2 for any p 2. (C) For all θ Θ there exists C(θ) such that θhθ(z) C(θ)(1+ z ) for all z Z. (D) k is twice continuously differentiable on Ω Ω.

(E) For all θ Θ it holds that R i i+dk(x,x)dpθ(x)< for all 1 i d.

(F) The following quantity is finite: κ2=sup x Ω 1 i q i i+qk(x,x).

(G) For all 0 δ 1, it holds with probability at least 1 δ that L(θ) L(θ) N 1

Remark 1. Assumption (G) holds if for instance L(θ) can be written as an empirical mean of i.i.d. terms with finite variance:

i=1 θl(hθ(Zi))

where Zi are i.i.d. samples from the latent distribution ν where R θl(hθ(z))dν(z)=L(θ). This is often the case in the problems considered in machine-learning. In, this case, the sum of variances of the vector L(θ) along its coordinates satisfies: Z L(θ) L(θ) 2dν(z)= 1

Z θl(hθ(z)) 2dν(z):= 1

One can then conclude using Cauchy-Schwarz inequality followed by Chebychev s inequality that with probability 1 δ:

L(θ) L(θ) σ

Moreover, Assumption (C) is often satisfied when the implicit model is chosen to be a deep networks with Re LU non-linearity.

Published as a conference paper at ICLR 2020

A.3 OPERATORS DEFINITION

Differential operators. We introduce the linear L operator and its adjoint L :

L:H L2(ν)d L :L2(ν)d H

f 7 ( if hθ)1 i d v7 Z d X

i=1 ik(hθ(z),.)vi(z)dν(z)

This allows to obtain the linear operator A defined in Assumption 2 in the main text by composition A:=L L. We recall here another expression for A in terms of outer product and its regularized version for a given λ>0,

i=1 ik(hθ(z),.) ik(hθ(z),.)dν(z) Aλ:=A+λI.

It is easy to see that A is a symmetric positive operator. Moreover, it was established in Sriperumbudur et al. (2017) that A is also a compact operator under Assumption (E).

Assume now we have access to N samples (Zn)1 n N as in the main text. We define the following objects:

i=1 ik(hθ(Zn),.) ik(hθ(Zn),.), ˆAλ:= ˆA+λI.

Furthermore, if v is a continuous function in L2(ν)d, then we can also consider an empirical estimator for L v:

i=1 ik(hθ(Zn),.)vi(Zn).

Subsampling operators. We consider the operator QM defined from H to RM by:

m=1 em imk(Ym,.) (21)

where (em)1 m M is an orthonormal basis of RM. QM admits a singular value decomposition of the form QM =UΣV , with V V :=PM being the orthogonal projection operator on the Nyström subspace HM. Similarly to Rudi et al. (2015); Sutherland et al. (2017), we define the projected inverse function GM(C) as:

GM(C)=V (V CV ) 1V .

We recall here some properties of GM from (Sutherland et al., 2017, Lemma 1): Lemma 8. Let A:H H be a positive operator, and define Aλ =A+λI for any λ>0. The following holds:

1. GM(A)PM =GM(A)

2. PMGM(A)=GM(A)

3. GM(Aλ)AλPM =PM

4. GM(Aλ)=(PMAPM +λI) 1PM

1 2 λGM(Aλ)A

Estimators of the Wasserstein information matrix. Here we would like to express the estimator in Proposition 5 in terms of the operators introduced previously. We have the following proposition: Proposition 9. The estimator defined in Proposition 5 admits the following representation:

ÿ WL(θ)=(ϵD(θ)+GM,N) 1 L(θ)

where GM,N is given by:

GM,N :=( L B) GM(ˆAλ) L B.

Published as a conference paper at ICLR 2020

Proof. This a direct consequence of the minimax theorem (Ekeland and Témam, 1999, Proposition 2.3, Chapter VI) and applying (Sutherland et al., 2017, Lemma 3).

The matrix GM,N is in fact an estimator of the Wasserstein information matrix defined in Definition 2. We will also need to consider the following population version of GM,N defined as :

GM :=(L B) GM(Aλ)L B (22)

B BACKGROUND IN INFORMATION GEOMETRY

B.1 FISHER-RAO STATISTICAL MANIFOLD

In this section we briefly introduce the non-parametric Fisher-Rao metric defined over the set P of probability distributions with positive density. More details can be found in Holbrook et al. (2017). By abuse of notation, an element ρ P will be identified with its density which will also be denoted by ρ. Consider Tρ, the set of real valued functions f defined over Ωand satisfying Z f(x)2

ρ(x) dx< ; Z f(x)ρ(x)dx=0.

We have the following definition for the Fisher-Rao metric:

Definition 4 (Fisher-Rao metric). The Fisher-Rao metric g F is defined for all ρ P as an inner product over Tρ of the form:

g F ρ (f,g):= Z 1 ρ(x)f(x)g(x)dx, f,g Tρ

Note that the choice of the set Tρ is different from the one considered in Holbrook et al. (2017) which replaces the integrability condition by a smoothness one. In fact, it can be shown that these choices result in the same metric by a density argument.

B.2 WASSERSTEIN STATISTICAL MANIFOLD

In this section we review the theory of Wasserstein statistical manifold introduced in Li and Montufar (2018a); Chen and Li (2018). By analogy to the Fisher-Rao metric which allows to endow the parametric model PΘ with the structure of a Riemannian manifold, it is also possible to use a different metric that is derived from the Wasserstein 2 distance. We first start by briefly introducing the Wasserstein 2 distance. Given two probability distributions ρ and ρ , we consider the set of all joint probability distributions Π(ρ,ρ ) between ρ and ρ usually called the set of couplings between ρ and ρ . Any coupling π defines a way of transporting mass from ρ to ρ . The cost of such transport can be measured as the expected distance between an element of mass of ρ at location x that is mapped to an element of mass of ρ at location y using the coupling π: Z x y 2dπ(x,y)

The squared Wasserstein 2 distance between ρ and ρ is defined as the smallest transport cost over all possible couplings:

W 2 2 (ρ,ρ )= inf π Π(ρ,ρ )

Z x y 2dπ(x,y).

A dynamical formulation of W2 was provided by the celebrated Benamou-Brenier formula in Benamou and Brenier (2000):

W 2 2 (ρ,ρ )=inf φt

Z φl(x) 2dρl(x)dl

where the infimum is taken over the set of vector fields φ : [0,1] Ω Rd. Each vector field of the potential determines a corresponding probability distribution ρl as the solution of the continuity equation:

lρl+div(ρlφl)=0, ρ0=ρ,ρ1=ρ . (23)

Published as a conference paper at ICLR 2020

When Ωis a compact set, a Neumann condition is added on the boundary of Ωto ensure that the total mass is conserved. Such formulation suggests that W2(ρ,ρ ) corresponds in fact to the shortest path from ρ to ρ . Indeed, given a path ρl from ρ to ρ , the infinitesimal displacement direction is given by the distribution lρl. The length | lρl| of this direction is measured by: | lρl|2 := R φl(x) 2dρl(x) Hence, W 2 2 (ρ,ρ ) can be written as:

W 2 2 (ρρ )=inf ρl

0 | lρl|2dl.

In fact, lρl can be seen as an element in the tangent space Tρl P2 to P2 at point ρl. To ensure that (23) is well defined, TρP2 can be defined as the set of distributions σ satisfying σ(1)=0.

|σ(f)| C f L2(ρ), f C c (Ω) (24)

for some positive constant C. Indeed, the condition in (24) guarantees the existence of a vector field φσ that is a solution to the PDE: σ= div(ρφσ).

Moreover, | lρl|2 can be seen as an inner product of lρl with itself in Tρl P2. This inner product defines in turn a metric tensor g W on P2 called the Wasserstein metric tensor (see Otto and Villani (2000); Ambrosio et al. (2004)): Definition 5. The Wasserstein metric g W is defined for all ρ P2 as the inner product over TρP2 of the form:

g W ρ (σ,σ ):= Z φσ(x) φσ (x)dρ(x), σ,σ TρP2

where φσ and φσ are solutions to the partial differential equations:

σ= div(ρφσ), σ = div(ρφσ ).

Moreover, φσ and φσ are required to be in the closure of gradient of smooth and compactly supported functions w.r.t. L2(ρ)d.

Definition 5 allows to endow P2 with a formal Riemannian structure with W2 being its geodesic distance:

W 2 2 (ρ,ρ )=inf ρl

0 gρl( lρl, lρl)dl.

B.3 NEGATIVE SOBOLEV DISTANCE AND LINEARIZATION OF THE WASSERSTEIN DISTANCE

To device the Wasserstein natural gradient, one can exploit a Taylor expansion of W which is given in terms of the Negative Sobolev distance ρθ+u ρθ H 1(ρθ) as done in Mroueh et al. (2019):

W 2 2 (ρθ,ρθ+u)= ρθ+u ρθ 2 H 1(ρθ)+o( u 2).

Further performing a Taylor expansion of ρθ+u ρθ H 1(ρθ) in u leads to a quadratic term u GW(θt)u where we call GW(θt) the Wasserstein information matrix. This two steps approach is convenient conceptually and allows to use the dual formulation of the Negative Sobolev distance to get an estimate of the quadratic term u GW(θt)u using kernel methods as proposed in Mroueh et al. (2019) for learning non-parametric models. However, with such approach, ρθ+u ρθ H 1(ρθ) needs to be well defined for u small enough. This requirement does not exploit the parametric nature of the problem and can be restrictive if ρθ+u and ρθ do not share the same support as we discuss now.

As shown in (Villani, 2003, Theorem 7.26) and discussed in (Arbel et al., 2019, Proposition 17 and 18), the Wasserstein distance between two probability distributions ρ and ρ admits a first order expansion in terms of the Negative Sobolev Distance:

lim ϵ 0 1 ϵW2(ρ,ρ+ϵ(ρ ρ))= ρ ρ H 1(ρ)

when ρ admits a bounded density w.r.t. ρ. When such assumption fails to hold, there are cases when this first order expansion is no longer available. For instance, in the simple case when the parametric family consists of dirac distributions δθ located at a value θ, the Wasserstein distance admits a closed form expression of the form: W2(δθ,δθ+ϵ(δθ δθ))= ϵ θ θ

Published as a conference paper at ICLR 2020

ϵW2(δθ,δθ +ϵ(δθ δθ)) diverges to infinity. One can consider a different perturbation of the model δθ+ϵu for some vector u which the one we are interested in here. In this case, the Wasserstein distance admits a well-defined asymptotic behavior:

lim ϵ 1 ϵW2(δθ,δθ+ϵu)= u .

On the other hand the Negative Sobolev Distance is infinite for any value of ϵ. To see this, we consider its dual formulation as in Mroueh et al. (2019): 1 2 ρ ρ H 1(ρ)= sup f C c (Ω) R f(x)dρ(x)=0

Z f(x)dρ(x) Z f(x)dρ (x) 1

Z xf(x) 2dρ(x)

Evaluating this expression for δθ and δθ+ϵu for any value ϵ>0 and for any u that is non zero, on has: 1 2ϵ δθ δθ+ϵu H 1(δθ)= sup f C c (Ω) f(θ)=0

1 ϵ(f(θ) f(θ+ϵu)) 1

2 xf(θ) 2 (25)

One can always find a function f such that f(θ)=0, f(θ)=0 and f(θ+ϵu) can be arbitrarily large, thus the Negative Sobolev distance is infinite. This is not the case of the metric u GW(θ)u which can be computed in closed form: 1 2u GW(θ)u= sup f C c (Ω) f(θ)=0

2 xf(θ) 2 (26)

In this case, choosing f(θ)=0 and f(θ)=u achieves the supremum which is simply given by 1

2 u 2. Equation (26) can be seen as a limit case of (25) when ϵ 0: 1 2u GW(θ)u:= sup f C c (Ω) f(θ)=0

lim ϵ 0 1 ϵ(f(θ) f(θ+ϵu)) 1

However, the order between the supremum and the limit cannot be exchanged in this case, which makes the two objects behave very differently in the case of singular probability distributions.

C.1 PRELIMINARY RESULTS

Here we provide a proof of the invariance properties of the Fisher and Wasserstein natural gradient descent in the continuous-time limit as stated in Proposition 1. Consider an invertible and smoothly differentiable re-parametrization Ψ, satisfying ψ = Ψ(θ). Denote by ρψ = ρΨ 1(ψ) the re-parametrized model and GW(ψ) and GF(ψ) their corresponding Wasserstein and Fisher information matrices whenever they are well defined.

Proof of Proposition 1. Here we only consider the case when DL(θ) is either given by the Fisher natural gradient FL(θ) or the Wasserstein Natural gradient WL(θ). We will first define eψs := Ψ(θs) and show that in fact eψs=ψs at all times s>0. First, let s differentiate eψs in time: eψs= θΨ(θs) GD(θs) 1 θL(θs)

By the chain rule, we have that θL(θs)= θΨ(θs) ψ L(eψs), hence: eψs= θΨ(θs) GD(θs) 1 θΨ(θs) ψ L(eψs).

It is easy to see that θΨ 1(eψs) = ( ψΨ 1(ψs)) 1 by definition of Ψ and eψs, hence by Lemma 10 one can conclude that: eψs= GD(eψs) 1 ψ L(eψs).

Hence, eψs satisfies the same differential equation as ψs. Now keeping in mind that ψ0 = eψ0 = Ψ(θ0), it follows that ψ0= eψ0=Ψ(θ0) by uniqueness of differential equations.

Published as a conference paper at ICLR 2020

Lemma 10. Under conditions of Propositions 2 and 3, the informations matrices GW(ψ) and GF(ψ) are related to GW(θ) and GF(θ) by the relation:

GW(ψ)= ψΨ 1(ψ) GW(θ) ψΨ 1(ψ) GF(ψ)= ψΨ 1(ψ) GF(θ) ψΨ 1(ψ)

Proof. Let v Rq and write u = θΨ 1(ψ)v, then by the dual formulations of GW(θ) and GF(θ) in Proposition 2 we have that:

1 2v ψΨ 1(ψ) GF(θ) ψΨ 1(ψ)v= sup f C c (Ω) R f(x)dρθ(x)=0

ρθ(f) θΨ 1(ψ)v 1

Z f(x)2dρθ(x)dx,

Now recalling that ψ ρψ = θρθ ψΨ 1(ψ) by Lemma 11, it follows that:

1 2v ψΨ 1(ψ) GF(θ) ψΨ 1(ψ)v= sup f C c (Ω) R f(x)dρθ(x)=0

ψ ρψ(f) v 1

Z f(x)2dρθ(x)dx,

Using again Proposition 2 for the reparametrized model ρψ, we directly have that:

1 2v GF(ψ)v= sup f C c (Ω) R f(x)dρθ(x)=0

ψ ρψ(f) v 1

Z f(x)2dρθ(x)dx,

The result follows by equating both expression. The same procedure can be applied for the case the Wasserstein information matrix using Proposition 3.

Lemma 11. The distributional gradients ψ ρψ and θρθ are related by the expression:

ψ ρψ = θρθ ψΨ 1(ψ)

Proof. The proof follows by considering a fixed direction u Rq and a test function f C c (Ω) and the definition of distributional gradient in Definition 3:

ρψ(f) u= lim ϵ 0 1 ϵ

Z f(x)d ρψ+ϵu(x) Z f(x)d ρψ(x)

= lim ϵ 0 1 ϵ

Z f(x)dρΨ 1(ψ+ϵu)(x) Z f(x)dρΨ 1(ψ)(x)

Now by differentiability of Ψ 1, we have the following first order expansion:

Ψ 1(ψ+ϵu)=Ψ 1(ψ)+ϵ Ψ 1(ψ) u+ϵυ(ϵ)

where υ(ϵ) converges to 0 when ϵ 0. Now using again the definition Definition 3 for ρΨψ one has:

ÅZ f(x)dρΨ 1(ψ+ϵu)(x) Z f(x)dρΨ 1(ψ)(x) ã = ρΨ 1(ψ)(f) Ψ 1(ψ) u

+ϵυ(ϵ)+δ(ϵ,f,( Ψ 1(ψ)u+υ(ϵ)))

The last two terms converge to 0 as ϵ 0, hence leading to the desired expression.

Proposition 12. When ρθ admits a density that is continuously differentiable w.r.t θ and such that x7 ρθ(x) is continuous, then the distributional gradient is of the form:

ρθ(f)= Z f(x) ρθ(x)dx, f C c (Ω)

where ρθ(x) denotes the gradient of the density of ρθ(x) at x.

Published as a conference paper at ICLR 2020

Proof. Let ϵ>0 and u Rq, we define the function ν(ϵ,u,f) as follows:

ν(ϵ,u,f)= Z f(x) Å1

ϵ(ρθ+ϵu) ρθ ρ θ u ã dx

we just need to show that ν(ϵ,u,f) 0 as ϵ 0. This follows form the differentiation lemma (Klenke, 2008, Theorem 6.28) applied to the function (θ,x) 7 f(x)ρθ(x). Indeed, this function is integrable in x for any θ in a neighborhood U of θ that is small enough, it is also differentiable on that neighborhood U and satisfies the domination inequality:

|f(x) ρθ(x) u| |f(x)| sup x Supp(f),θ U ρθ(x) u|.

The inequality follows from continuity of (θ,x) ρθ(x) and recalling that f is compactly supported. This concludes the proof.

We fist provide a proof of the dual formulation for the Fisher information matrix.

Proof of Proposition 2 . Consider the optimization problem:

sup f C c (Ω) R f(x)dρθ(x)=0

ÅZ f(x) ρθ(x)dx ã u 1

Z f(x)2ρθ(x)dx (27)

Recalling that the set of smooth and compactly supported functions C c ( ) is dense in L2(ρθ) and that the objective function in (27) is continuous and coercive in f, it follows that (27) admits a unique solution f in L2(ρθ) which satisfies the optimality condition: Z f(x)( ρθ(x)) udx= Z f(x)f (x)ρθ(x)dx f L2(ρθ)

Hence, it is easy to see that f =( ρθ) u/ρθ and that the optimal value of (27) is given by:

Z (( ρθ(x)) u)2

This is equal to u GF(θ)u by Definition 1.

The next proposition ensures that the Wasserstein information matrix defined in Definition 2 is well-defined and has a dual formulation.

Proposition 13. Consider the model defined in (6) and let (es)1 s q be an orthonormal basis of Rq. Under Assumptions (B) and (C), there exists an optimal solution Φ = (φs)1 s q with φs in L2(ρθ)d satisfying the PDE:

sρθ = div(ρθφs)

The elliptic equations also imply that L hθ =L (Φ hθ). Moreover, the Wasserstein information matrix GW(θ) on PΘ at point θ can be written as GW(θ) = Φ Φ where the inner-product is in L2(ρθ)d and satisfies:

1 2u GW(θ)u= sup f C c (Ω) R f(x)dρθ(x)=0

Z xf(hθ(z)) 2dν(z).

for all u Rq.

Proof. Let (es)1 s q be an orthonormal basis of Rq. For all 1 s q, we will establish the existence of an optimal solution φs in L2(ρθ)d satisfying the PDE:

sρθ = div(ρθφs) (28)

Published as a conference paper at ICLR 2020

Consider the variational problem:

Z φ(hθ(z)) θshθ(z) 1

2 φ 2 L2(ρθ) (29)

where S is a Hilbert space obtained as the closure in L2(ρθ)d of functions of the form φ = xf with f C c (Ω):

S :={ xf | f C c (Ω)}L2(ρd θ).

We have by Assumption (C) that: Z φ(hθ(z)) θshθ(z)dν(z) C(θ)

Z (1+ z 2)dν(z) Z φ L2(ρθ).

Moreover, by Assumption (B), we know that R (1+ z 2)dν(z)< . This implies that the objective in (29) is continuous in φ while also being convex.s It follows that (29) admits a unique solution φ s S which satisfies for all φ S: Z φ(hθ(z)) φ s(hθ(z))dν(z)= Z φ(hθ(z)) θshθ(z))dν(z)

In particular, for any f C c (Ω), it holds that: Z xf(hθ(z)) φ s(hθ(z))dν(z)= Z xf(hθ(z)) θshθ(z))dν(z)

which is equivalent to (28) and implies directly that LT hθ = LTΦ hθ where Φ := (φ s)1 s q. The variational expression for 1

2u GWu follows by noting that (29) admits the same optimal value as

sup f C c (Ω) R f(x)dρθ(x)=0

Z xf(hθ(z)) 2dν(z).

That is because S is by definition the closure in L2(ρθ)d of the set of gradients of smooth and compactly supported functions on Ω.

Proof of Proposition 3. This is a consequence of Proposition 13.

C.2 EXPRESSION OF THE ESTIMATOR

We provide here a proof of Proposition 5

Proof of Proposition 5. Here, to simplify notations, we simply write D instead of D(θ). First consider the following optimization problem:

inf f HM 1 N

n=1 f(Xn) 2+λ f 2 H+ 1

ϵR(f) D 1R(f)+2

ϵR(f) D 1 L(θ)

with R(f) given by R(f)= 1

N PN n=1 f(Xn) B(Zn). Now, recalling that any f HM can be written as f =PM m=1αm imk(Ym,.), and using the reproducing property if(x)= f, ik(x,.) H (Steinwart and Christmann, 2008, Lemma 4.34) , it is easy to see that:

n=1 f(Xn) 2= 1

1 n N 1 i d

m=1 αm im i+dk(Ym,Xn))2.

1 m,m M αmαm im im +dk(Ym,Ym )

1 n N 1 i d 1 m M

αm im i+dk(Ym,Xn)Bi(Zn)

Published as a conference paper at ICLR 2020

The above can be expressed in matrix form using the matrices defined in Proposition 5:

n=1 f(Xn) 2=α CC α; f 2 H=α Kα; R(f)=α CB.

Hence the optimal solution ˆf is of the form ˆf =PM m=1α m imk(Ym,.), with α obtained as a solution to the finite dimensional problem in RM:

min α RMα (ϵCC +ϵλK+CBD 1B C )α+2α CBD 1 L(θ)

It is easy to see that α are given by:

α = (ϵCCT +ϵλK+CBD 1BTCT) CBD 1 L(θ).

Now recall that the estimator in Proposition 5 is given by: ÿ WL(θ) = 1 ϵD 1Uθ(ˆf ). Hence,

1 ϵD 1( L(θ) BTCTα ) The desired expression is obtained by noting that CB=T using the chain rule.

C.3 CONSISTENCY RESULTS

Well-specified case. Here, we assume that the vector valued functions (φi)1 i q involved in Definition 2 can be expressed as gradients of functions in H. More precisely: Assumption 2. For all 1 i q, there exits functions fi H such that φi = fi. Additionally, fi are of the form fi=Aαvi for some fixed α 0, with vi H and A being the differential covariance operator defined on H by A:f 7 R Pd i=1 ik(hθ(z),.) if(hθ(z))dν(z).

The parameter α characterizes the smoothness of fi and therefore controls the statistical complexity of the estimation problem. Using a similar analysis as Sutherland et al. (2017) we obtain a convergence rate for the estimator in Proposition 5

the following convergence rates for the estimator in Proposition 5: Theorem 14. Let δ be such that 0 δ 1 and b := min(1, α + 1

2). Under Assumption 2 and Assumptions (A) to (G) listed in Appendix A.2, for N large enough, M (d N 1 2b+1log(N)), λ N 1 2b+1

and ϵ N b 2b+1, it holds with probability at least 1 δ that:

ÿ WL(θ) WL(θ) 2=O N 2b 2b+1 .

In the worst case where α=0, the proposed estimator needs at most M (d

Nlog(N)) to achieve a convergence rate of N 1

2 . The smoothest case requires only M (d N 1 3log(N)) to achieve a rate of N 2

3 . Thus, the proposed estimator enjoys the same statistical properties as the ones proposed by Sriperumbudur et al. (2017); Sutherland et al. (2017) while maintaining a computational advantage1 t Now we provide a proof for Theorem 14 which relies on the same techniques used by Rudi et al. (2015); Sutherland et al. (2017).

Proof of Theorem 14 . The proof is a direct consequence of Proposition 15 under Assumption 2.

Proof of Theorem 7. The proof is a direct consequence of Proposition 15 under Assumption 1.

Proposition 15. Under Assumptions (A) to (G) and for 0 δ 1 and N large enough, it holds with probability at least 1 δ: WL WL =O(N b 2b+1)

provided that M d N 1 2b+1 log N, λ N 1 2b+1 and ϵ N b 2b+1 where b := min(1,α + 1

2) when Assumption 2 holds and b= 1 2+c when Assumption 1 holds instead.

1 The estimator proposed by Sutherland et al. (2017) also requires M to grow linearly with the dimension d although such dependence doesn t appear explicitly in the statement of Sutherland et al. 2017, Theorem 2.

Published as a conference paper at ICLR 2020

Proof. Here for simplicity we assume that D(θ) = I without loss of generality and we omit the dependence in θ and write WL and L instead of WL(θ) and L(θ) and WL(θ). We also define ˆGϵ = ϵI +GM,N and Gϵ = ϵI +GW. By Proposition 9, we know that WL = ˆG 1 ϵ d L. We use the following decomposition:

WL WL ˆG 1 ϵ (d L L) + ˆG 1 ϵ (GM,N GW)G 1 W L +ϵ ˆG 1 ϵ G 1 W L

To control the norm of ˆG 1 ϵ we write ˆG 1 ϵ = G 1

2 ϵ (H + I) 1G 1

2 ϵ , where H is given by

2 ϵ (GM,N GW)G 1

2 ϵ . Hence, provided that µ := λmax(H), the highest eigenvalue of H, is smaller than 1, it holds that: (H+I) 1 (1 µ) 1.

Moreover, since GW is positive definite, its smallest eigenvalue η is strictly positive. Hence, G 1 ϵ (η+ϵ) 1. Therefore, we have ˆG 1 ϵ (η+ϵ)(1 µ)) 1, which implies:

WL WL (η+ϵ) 1 Ç d L L

1 µ +η 1 L GM,N GW +ϵη 1 L

Let 0 δ 1. We have by Assumption (G) that d L L = O(N 1

2) with probability at least 1 δ. Similarly, by Proposition 16 and for N large enough, we have with probability at least 1 δ that GM,N GW =O(N b 2b+1) where b is defined in Proposition 16. Moreover, for N large enough, one can ensure that µ 1

2 so that the following bound holds with probability at least 1 δ:

WL WL (η+ϵ) 1Ä 2N 1

2 +η 1 L (N b 2b+1 +ϵ) ä .

Thus by setting ϵ N b 2b+1 we get the desired convergence rate.

Proposition 16. For any 0 δ 1, we have with probability as least 1 δ and for N large enough that:

GM,N GW =O(N b 2b+1).

provided that M d N 1 2b+1 log N where b:=min(1,α+ 1

2) when Assumption 2 holds and b= 1 2+c when Assumption 1 holds instead.

Proof. To control the error GM,N GW we decompose it into an estimation error GM,N GM and approximation error GM GW :

GM,N GW GM GW + GM GM,N

were GM is defined in (22) and is obtained by taking the number of samples N to infinity while keeping the number of basis points M fixed.

The estimation error GM GM,N is controlled using Proposition 17 where, for any 0 δ 1, we have with probability at least 1 δ and as long as N M(1,λ,δ):

Nλ (a N,δ+ p

2γ1κ+2γ1 λ+κ

In the limit where N and λ 0, only the dominant terms in the above equation remain which leads to an error GM,N GM = O((Nλ) 1

2). Moreover, the condition on N can be expressed as λ 1logλ 1 N.

To control the error approximation error GM GW we consider two cases: the well-specified case and the miss-specified case.

Well-specified case. Here we work under Assumption 2 which allows to use Proposition 19. Hence, for any 0 δ 1 and if M M(d,λ,δ), it holds with probability at least 1 δ:

GM GW λmin(1,α+ 1

Published as a conference paper at ICLR 2020

Miss-specified case. Here we work under Assumption 1 which allows to use Proposition 18. Hence, for any 0 δ 1 and if M M(d,λ,δ), it holds with probability at least 1 δ:

GM GW λ 1 2+c

Let s set b := min(1,α + 1

2) for the well-specified case and b = 1 2+c for the miss-specified case. In the limit where M and λ 0 the condition on M becomes: M dλ 1logλ 1. Hence, when M dλ 1logλ 1 and λ 1logλ 1 N it holds with probability as least 1 δ that

GM,N GW =O(λb+(λN) 1

One can further choose λ of the form λ = N θ. This implies a condition on M of the form d Nθlog(N) M and Nθlog(N) N. After optimizing over θ to get the tightest bound, the optimal value is obtained when θ = 1/(2b+1) and the requirement on N is always satisfied once N is large enough. Moreover, one can choose M d N 1 2b+1 log N so that the requirement on M is satisfied for N large enough. In this case we get the following convergence rate:

GM,N GW =O(N b 2b+1).

Proposition 17. For any 0 δ 1, provided that N M(1,λ,δ), we have with probability as least 1 δ:

Nλ (2a N,δ+ p

2γ1κ+2γ1 λ+κ

δ + 2alog 2

Proof. For simplicity, we define E = L B L B. By definition of GM,N and GM we have the following decomposition:

GM,N GM =E GM(ˆAλ)E | {z } E0

+E GM(ˆAλ)L B | {z } E1

+B LGM(ˆAλ)E | {z } E2 B LGM(Aλ)PM(ˆA A)PMGM(ˆAλ)L B | {z } E3 The first three terms can be upper-bounded in the following way:

E0 = E ˆA 1

1 2 λGM(ˆAλ)ˆA

E 2 ˆA 1 λ | {z } 1/λ

1 2 λGM(ˆAλ)ˆA

1 2 λ | {z } 1

E1 = E2 = E A 1

1 2 λGM(Aλ)A

1 2 λGM(ˆAλ)ˆA

1 2 λ | {z } 1

2 λ L | {z } 1

For the last term E3 , we first recall that by definition of GM(Aλ) we have:

GM(Aλ)PM(ˆA A)PMGM(Aλ)=GM(Aλ)(ˆA A)GM(Aλ).

Therefore, one can write:

E3 = B LA 1

1 2 λGM(Aλ)A

2 λ (ˆA A)A 1

1 2 λGM(ˆAλ)ˆA

2 λ 2 | {z } 1

1 2 λGM(Aλ)A

1 2 λ | {z } 1

1 2 λGM(ˆAλ)ˆA

1 2 λ | {z } 1

1 2 λ 2 A 1

2 λ (ˆA A)A 1

1 2 λ 2 A 1

2 λ (ˆA A)A 1

Published as a conference paper at ICLR 2020

We recall now (Rudi et al., 2015, Proposition 7.) which allows to upper-bound A

1 2 λ by (1 η) 1

where η=λmax(A

1 2 λ(A ˆA)A

1 2 λ) provided that η<1. Moreover, (Rudi et al., 2015, Proposition 8.) allows

to control both η and A 1

2 λ (ˆA A)A 1

2 λ under Assumption (F). Indeed, for any 0 δ 1 and provided that 0<λ A it holds with probability 1 δ that:

2 λ (ˆA A)A 1

2 λ 2γ1 1+κ/λ

Nλ where γ1 and γ2 are given by:

γ1=log(8Tr(A)

λδ ); γ2=log(4Tr(A)

Hence, for N M(1,λ,δ) we have that (1 η) 1

2 2 and one can therefore write:

E3 4 B 2(2γ1 1+κ/λ

E1 = E1 2 B

The error E is controlled by Proposition 22 where it holds with probability greater or equal to 1 δ that:

δ + 2alog 2

Finally, we have shown that provided that N M(1,λ,δ) then with probability greater than 1 δ one has:

Nλ (2a N,δ+ p

2γ1κ+2γ1 λ+κ

Proposition 18. Let 0 λ A and define M(d,λ,δ):= 128

9 log 4Tr(A)

λδ (dκλ 1+1). Under Assumption 1 and Assumption (F), for any δ 0 such that M M(d,λ,δ) the following holds with probability 1 δ:

GM GW λ 1 2+c

Proof. We consider the error GM GW . Recall that GW is given by GW = Φ Φ with Φ defined in Proposition 13. Let κ be a positive real number, we know by Assumption 1 that there exists F κ := (fκ s )1 s q with fκ s H such that Φ F κ L2(ρθ) Cκ and fκ s H Cκ c for some fixed positive constant C. Therefore, we use F κ to control the error GM GW . Let s call E=Φ hθ LF κ We consider the following decomposition:

GM GW =(L Φ hθ) GM(Aλ)L Φ hθ Φ Φ

=E LGM(Aλ)L E | {z } E1

E E | {z } E2

+F κ L LGM(Aλ) I L Φ hθ | {z } E3

+E L GM(Aλ)L L I F κ | {z } E4 First we consider the term E1 one simply has:

2 λ | {z } 1

1 2 λGM(Aλ)A

1 2 λ | {z } 1

2 λ L | {z } 1

The second term also satisfies E1 κ2 by definition of Fκ. For the last two terms E3 and E4 we use Lemma 20 which allows to control the operator norm of L(GM(Aλ)L L I). Hence, for any δ 0 and M such that M M(d,λ,δ) and for κ 1 it holds with probability 1 δ that:

We have shown so far that GM GW (κ2+2κ c

λ). One can further optimize over κ on the interval [0,1] to get a tighter bound. The optimal value in this case is κ =min(1,(cλ 1 2) 1 2+c). By considering λ>0 such that (cλ 1 2) 1 2+c) 1, it follows directly that GM GW λ 1 2+c which shows the desired result.

Published as a conference paper at ICLR 2020

Proposition 19. Let 0 λ A and define M(d,λ,δ):= 128

9 log 4Tr(A)

λδ (dκλ 1+1). Under Assumption 2 and Assumption (F), for any δ 0 such that M M(d,λ,δ) the following holds with probability 1 δ:

GM GW λmin(1,α+ 1

Proof. Recall that GW is given by GW =Φ Φ with Φ defined in Proposition 13. By Assumption 2, we have that Φ= (AαV ) with V :=(vs)1 s q Hq. Hence, one can write

GM GW =(L Φ hθ) GM(Aλ)L Φ hθ Φ Φ

=V (Aα(AGM(Aλ)A A)AαV

we can therefore directly apply Lemma 20 and get GM GW λmin(1,α+ 1

2) with probability 1 δ for any δ 0 such that M M(d,λ,δ).

Lemma 20. Let 0 λ A , α 0 and define M(d,λ,δ) := 128

9 log 4Tr(A)

λδ (dκλ 1 + 1). Under Assumption (F), for any δ 0 such that M M(d,λ,δ) the following holds with probability 1 δ:

L(GM(Aλ)L L I)Aα λmin(1,α+ 1

Proof. We have the following identities:

L(GM(Aλ)L L I)Aα=L(GM(Aλ)Aλ I λGM(Aλ))Aα

1 2 λ(GM(Aλ)AλPM I)Aα | {z } E1

1 2 λGM(Aλ)A

2 λ Aα | {z } E3

1 2 λGM(Aλ)A

1 2 λ(I PM)Aα | {z } E2

For the first E1 we use (Sutherland et al., 2017, Lemma 1 (iii)) which implies that GM(Aλ)AλPM =PM.

Thus E1 =LA 1

1 2 λ(PM I)Aα. Moreover, by Lemma 21 we have that A

1 2 λ(I PM) 2

λ with probability 1 δ for M > M(d,λ,δ). Therefore, recalling that (I PM)2 = I PM since PM is a projection, one can further write:

2 λ | {z } 1

1 2 λ(PM I) 2 | {z } λ

2 λ | {z } 1

1 2 λGM(Aλ)A

1 2 λ | {z } 1

1 2 λ(PM I) 2 | {z } 4λ

2 λ | {z } 1

1 2 λGM(Aλ)A

1 2 λ | {z } 1

It remains to note that A 1

2 λ Aα λα 1

2 when 0 α 1

2 and that A 1

2 λ Aα A α 1

2 for α > 1

2 which allows to conclude.

C.4 AUXILIARY RESULTS

Lemma 21. Let 0 λ A . Under Assumption (F), for any δ 0 such that M M(d,λ,δ):= 128

9 log 4Tr(A)

λδ (κλ 1+1) the following holds with probability 1 δ:

1 2 λ(I PM) 2

Proof. The proof is an adaptation of the results in Rudi et al. (2015); Sutherland et al. (2017). Here we recall QM defined in (21). Its transpose Q M sends vectors in RM to elements in the span of the Nyström basis

Published as a conference paper at ICLR 2020

points, hence PM and Q M have the same range, i.e.: range(PM)= range(Q M). We are in position to ap-

ply (Rudi et al., 2015, Proposition 3.) which allows to find an upper-bound on A

1 2 λ(PM I) in terms of QM:

1 2 λ(PM I)

1 2 λ(Q MQM +λI) 1

For simplicity we write ˆAM :=Q MQM and E2:=A 1

2 λ (A ˆAM)A 1

2 λ . We also denote by β=λmax(E2)

the highest eigenvalue of E2. We can therefore control A

1 2 λ(ˆAM +λI) 1

2 in terms of β using (Rudi et al., 2015, Proposition 7) provided that β<1:

1 2 λ(PM I)

Now we need to make sure that β<1 for M large enough. To this end, we will apply (Rudi et al., 2015, Proposition 8.) to ˆAM. Denote by vm =

d imk(Ym,.). Hence, by definition of ˆAM it follows that ˆAM = 1

M PM m=1vm vm. Moreover, (vm)1 m M are independent and identically distributed and satisfy:

E[vm vm]= Z q X

i=1 ik(y,.) ik(y,.)dpθ(y)=A.

We also have by Assumption (F) that vm,A 1 λ vm dκ

λ almost surely and for all λ>0. We can therefore apply (Rudi et al., 2015, Proposition 8.) which implies that for any 1 δ 0 and with probability 1 δ it holds that:

Mλ with γ=log 4Tr(A)

λδ provided that λ A . Thus by choosing M 128γ

9 (dκλ 1+1) we have that β 3

4 with probability 1 δ which allows to conclude.

Proposition 22. There exist a>0 and σ1>0 such that for any 0 δ 1, it holds with probability greater of equal than 1 δ that:

L B L B 2alog 2

Proof. denote by vn =Pd i=1 ik(Xn,.)Bi(Zn), we have that E[vn]=L B. We will apply Bernstein s inequality for sum of random vectors. For this we first need to find a>0 and σ1 >0 such that E[ zn L B p H] 1

2p!σ2 1ap 2. To simplify notations, we write x and x instead of hθ(z) and hθ(z ). We have that:

E[ zn L B p H]= Z

i=1 ik(x,.)Bi(z) Z d X

i=1 ik(x ,.)Bi(z )dν(z )

Z ( ik(x,.) ik(x ,.))Bi(z)dν(z )

i=1 ik(x,.)(Bi(z) Bi(z ))dν(z )

| {z } E2 We used the convexity of the norm and the triangular inequality to get the last line. We introduce the notation γi(x):= ik(x,.) R ik(hθ(z ),.)dν(z ) and by Γ(x) we denote the matrix whose components are given by Γ(x)ij := γi(x),γj(x) H. The first term E1 can be upper-bounded as follows:

E1= Z Tr(B(z)B(z) Γ(x))

Z B(z) 2Tr(Γ(x)2) 1 2

Published as a conference paper at ICLR 2020

Moreover, we have that Tr(Γ(x)2) 1 2 = (P 1 i,j d γi(x),γj(x) 2 H) 1 2 Pd i=1 γi(x) 2. We further have that γi(x) i i+dk(x,x) 1 2 + R i i+dk(hθ(z),hθ(z)) 1 2 dν(z) and by Assumption (F) it follows that γi(x) 2 κ. Hence, one can directly write that: E1 (2

κd)p R B(z) pdν(z). Recalling Assumptions (B) and (C) we get:

κd)p C(θ)p(1+1

Similarly, we will find an upper-bound on E2. To this end, we introduce the matrix Q(x ,x ) whose components are given by Q(x ,x )i,j = i i+dk(x ,x ). One, therefore has:

Z Z Tr((B(z) B(z ))(B(z) B(z )) Q(x ,x )dν(z )dν(z )

Z Z B(z) B(z ) B(z) B(z ) Tr(Q(x ,x )2) 1 2dν(z )dν(z )

Once again, we have that Tr(Q(x ,x )2) 1 2 (Pd i=1 i i+dk(x ,x )) 1 2(Pd i=1 i i+dk(x ,x )) 1 2 dκ thanks to Assumption (F). Therefore, it follows that:

dκ)p Z | Z B(z) B(z ) dν(z)|pdν(z)

dκ)p C(θ)p(2p+ Z z pdν(z)+ ÅZ z dν(z) ãp )

dκ)p C(θ)p(2p+ 1

2p!ζp 2σ2+ ÅZ z dν(z) ãp ).

The second line is a consequence of Assumption (C) while the last line is due to Assumption (B). These calculations, show that it is possible to find constants a and σ1 such that E[ zn L B p H] 1

2p!σ2 1ap 2. Hence one concludes using Bernstein s inequality for a sum of random vectors (see for instance Rudi et al., 2015, Proposition 11).

D EXPERIMENTAL DETAILS

D.1 NATURAL WASSERSTEIN GRADIENT FOR THE MULTIVARIATE NORMAL MODEL

Multivariate Gaussian. Consider a multivariate gaussian with mean µ Rd and covariance matrix Σ Rd Rd parametrized using its lower triangular components s=T(Σ). We denote by Σ=T 1(s) the inverse operation that maps any vector s R d(d+1)

2 to its corresponding symmetric matrix in Rd Rd. The concatenation of the mean µ and s will be denoted as θ : θ =(µ,s). Given two parameter vectors u = (m,T(S)) and v = (m ,T(S )) where m and m are vectors in Rd and S and S are symmetric matrices in Rd Rd the metric evaluated at u and v is given by:

u G(θ)v=m m +Tr(AΣA )

where A and A are symmetric matrices that are solutions to the Lyapunov equation:

S=AΣ+ΣA, S =A Σ+ΣA .

A and A can be computed in closed form using standard routines making the evaluation of the metric easy to perform. Given a loss function L(θ) and gradient direction θL(θ) = µL(θ), s L(θ), the corresponding natural gradient W θ L(θ) can also be computed in closed form:

W θ L(θ)=( µL(θ),T(Σ(A+diag(A))+(A+diag(A))Σ)),

where A = T 1( s L(θ)). To use the estimator proposed in Proposition 5 we take advantage of the parametrization of the Gaussian distribution as a push-forward of a standard normal vector:

X N(µ,Σ) X =Σ 1 2Z+µ,Z N(0,Id)

Published as a conference paper at ICLR 2020

1 2 3 4 5 6 7 8 9 10 d

Relative error

0 1000 2000 3000 4000 5000 N

101 102 103

d = 10 d = 9 d = 8 d = 7 d = 6 d = 5 d = 4 d = 3 d = 2 d = 1

1 2 3 4 5 6 7 8 9 10 d

Relative error

0 1000 2000 3000 4000 5000 N

101 102 103

d = 10 d = 9 d = 8 d = 7 d = 6 d = 5 d = 4 d = 3 d = 2 d = 1

Figure 4: Evolution of the relative error of KWNG averaged over 100 runs for varying dimension form d=1 (yellow) to d=10 (dark red). For each run, a random value for the parameter θ and for the Euclidean gradient L(θ) is sampled from a centered Gaussian with variance 0.1. In all cases, λ=0 and ϵ=10 5. Top row: multivariate normal model, bottom row: multivariate log-normal. Left (a): box-plot of the relative error as d increases with N =5000 and the number of basis points is set to M = ö d

N ù . (b) Relative

error as the sample size N increases and the number of basis points is set to M = ö d

N ù . Right (c): Relative error as M increases and N fixed to 5000.

Relative error

Relative error

Relative error

d = 2 d = 4 d = 6 d = 8

Figure 5: Relative error of the KWNG for varying bandwidth of the kernel. Results are averaged over 100 runs for varying dimension form d=1 (yellow) to d=10 (dark red). For each run, a random value for the parameter θ and for the Euclidean gradient L(θ) is sampled from a centered Gaussian with variance 0.1. In all cases, λ = ϵ = 10 10. The sample size is fixed to N = 5000 and the number of basis points is set to M = ö d

N ù . Left: uniform distributions on a hyper-sphere, middle: multivariate normal, and right: multivariate log-normal.

D.2 CLASSIFICATION ON CIFAR10 AND CIFAR100

Architecture. We use a residual network with one convolutional layer followed by 8 residual blocks and a final fully connected layer. Each residual block consists of two 3 3 convolutional layers each and Re LU nonlinearity. We use batch normalization for all methods. Details of the intermediate output shapes and kernel size are provided in Table 1.

Hyper-parameters. For all methods, we used a batch-size of 128. The optimal step-size γ was selected in {10,1,10 1,10 2,10 3,10 4} for each method. In the case of SGD with momentum, we used a momentum parameter of 0.9 and a weight decay of either 0 or 5 10 4. For KFAC and EKFAC, we used a damping coefficient of 10 3 and a frequency of reparametrization of 100 updates. For KWGN we set M =5 and λ=0 while the initial value for ϵ is set to ϵ=10 5 and is adjusted using an adaptive scheme based on the Levenberg-Marquardt dynamics as in (Martens and Grosse, 2015, Section 6.5). More

Published as a conference paper at ICLR 2020

Kernel size Output shape z 32 32 3 Conv 3 3 64 Residual block [3 3] 2 64 Residual block [3 3] 2 128 Residual block [3 3] 2 256 Residual block [3 3] 2 512 Linear - Number of classes

Table 1: Network architecture.

precisely, we use the following update equation for ϵ after every 5 iterations of the optimizer:

ϵ ωϵ, if r> 3

ϵ ω 1ϵ, if r< 1

Here, r is the reduction ratio:

r= max tk 1 t tk

Å 2 L(θt)) L(θt+1)

where (tk)k are the times when the updates occur. and ω is the decay constant chosen to ω=0.85.

D.3 ADDITIONAL EXPERIMENTS

Figure 6 reports the time cost in seconds for each methods, while Figure 7 compares KWNG to diagonal conditioning.

Published as a conference paper at ICLR 2020

102 103 104 105

(a) Training accuracy: (IC)

KWNG(M=5) KWNG(M=20) SGD SGD+Momentum SGD+Momentum+WD KFAC e KFAC

102 103 104 105

(b) Test accuracy: (IC)

102 103 104 105

(c) Training accuracy: (WC)

KWNG(M=5) KWNG(M=20) SGD SGD+Momentum SGD+Momentum+WD KFAC e KFAC

102 103 104 105

(d) Test accuracy: (WC)

Figure 6: Training accuracy (left) and test accuracy (right) as a function of time for classification on Cifar10 in both the ill-conditioned case (top) and well-conditioned case (bottom) for different optimization methods.

0 100 200 300 Epochs

(a) Training accuracy: (IC)

0 100 200 300 Epochs

(b) Test accuracy: (IC)

Diagonal conditioning: D = T

Diagonal conditioning: D = T KWNG: D = I

KWNG: D = T

KWNG: D = T

KWNG: D = T (rq-kernel)

Figure 7: KWNG vs Diagonal conditioning in the ill-conditioned case on Cifar10. In red and blue, the euclidean gradient is preconditioned using a diagonal matrix D either given by Di= T.,i or Di= e T.,i , where T and e T are defined in Propositions 5 and 6. The rest of the traces are obtained using the stable version of KWNG in Proposition 6 with different choices for the damping term D=I , D= T.,i and e T.,i . All use a gaussian kernel except the yellow traces which uses a rational quadratic kernel.