# minimum_stein_discrepancy_estimators__b60dfd1a.pdf

Minimum Stein Discrepancy Estimators

Alessandro Barp Department of Mathematics Imperial College London a.barp16@imperial.ac.uk

François-Xavier Briol Department of Statistical Science University College London f.briol@ucl.ac.uk

Andrew B. Duncan Department of Mathematics Imperial College London a.duncan@imperial.ac.uk

Mark Girolami Department of Engineering University of Cambridge mag92@eng.cam.ac.uk

Lester Mackey Microsoft Research Cambridge, MA, USA lmackey@microsoft.com

When maximum likelihood estimation is infeasible, one often turns to score matching, contrastive divergence, or minimum probability ﬂow to obtain tractable parameter estimates. We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators, and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. We establish the consistency, asymptotic normality, and robustness of DKSD and DSM estimators, then derive stochastic Riemannian gradient descent algorithms for their efﬁcient optimisation. The main strength of our methodology is its ﬂexibility, which allows us to design estimators with desirable properties for speciﬁc models at hand by carefully selecting a Stein discrepancy. We illustrate this advantage for several challenging problems for score matching, such as non-smooth, heavy-tailed or light-tailed densities.

1 Introduction

Maximum likelihood estimation [9] is a de facto standard for estimating the unknown parameters in a statistical model {Pθ : θ Θ}. However, the computation and optimization of a likelihood typically requires access to the normalizing constants of the model distributions. This poses difﬁculties for complex statistical models for which direct computation of the normalisation constant would entail prohibitive multidimensional integration of an unnormalised density. Examples of such models arise naturally in modelling images [27, 39], natural language [54], Markov random ﬁelds [61] and nonparametric density estimation [63, 69]. To by-pass this issue, various approaches have been proposed to address parametric inference for unnormalised models, including Monte Carlo maximum likelihood [22], contrastive divergence [28], minimum probability ﬂow learning [62], noise-contrastive estimation [10, 26, 27] and score matching (SM) [34, 35].

The SM estimator is a minimum score estimator [16] based on the Hyvärinen scoring rule that avoids normalizing constants by depending on Pθ only through the gradient of its log density x log pθ. SM estimators have proven to be a widely applicable method for estimation for models with unnormalised smooth positive densities, with generalisations to bounded domains [35] and compact Riemannian manifolds [51]. Despite the ﬂexibility of this approach, SM has three important and distinct limitations. Firstly, as the Hyvärinen score depends on the Laplacian of the log-density, SM estimation will be expensive in high dimension and will break down for non-smooth models or for models in which the second derivative grows very rapidly. Secondly, as we shall demonstrate, SM estimators can behave poorly for models with heavy tailed distributions. Thirdly, the SM estimator is not robust to

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

outliers in many applications of interest. Each of these situations arise naturally for energy models, particularly product-of-experts models and ICA models [33].

In a separate strand of research, new approaches have been developed to measure discrepancy between an unnormalised distribution and a sample. In [23, 25, 50, 24], it was shown that Stein s method can be used to construct discrepancies that control weak convergence of an empirical measure to a target.

In this paper we consider minimum Stein discrepancy (SD) estimators and show that SM, minimum probability ﬂow and contrastive divergence estimators are all special cases. Within this class we focus on SDs constructed from reproducing kernel Hilbert Spaces (RKHS), establishing the consistency, asymptotic normality and robustness of these estimators. We demonstrate that these SDs are appropriate for estimation of non-smooth distributions and heavyor lighttailed distributions. The remainder of the paper is organized as follows. In Section 2 we introduce the class of minimum SD estimators, then investigate asymptotic properties of SD estimators based on kernels in Section 3, demonstrating consistency and asymptotic normality under general conditions, as well as conditions for robustness. Section 4 presents three toy problems where SM breaks down, but our new estimators are able to recover the truth. All proofs are in the supplementary materials.

2 Minimum Stein Discrepancy Estimators

Let PX the set of Borel probability measures on X. Given identical and independent (IID) realisations from Q PX on an open subset X Rd, the objective is to ﬁnd a sequence of measures Pn that approximate Q in an appropriate sense. More precisely we will consider a family PΘ = {Pθ : θ Θ} PX together with a function D : PX PX R+ which quantiﬁes the discrepancy between any two measures in PX , and wish to estimate an optimal parameter θ satisfying θ arg minθ Θ D(Q Pθ). In practice, it is often difﬁcult to compute the discrepancy D explicitly, and it is useful to consider a random approximation ˆD({Xi}n i=1 Pθ) based on a IID sample X1, . . . , Xn Q, such that ˆD({Xi}n i=1 Pθ) a.s. D(Q Pθ) as n . We then consider the sequence of estimators

ˆθD n argminθ Θ ˆD({Xi}n i=1 Pθ).

The choice of discrepancy will impact the consistency, efﬁciency and robustness of the estimators. Examples of such estimators include minimum distance estimators [4, 58] where the discrepancy will be a metric on probability measures, including minimum maximum mean discrepancy (MMD) estimation [18, 42, 8] and minimum Wasserstein estimation [19, 21, 6].

More generally, minimum scoring rule estimators [16] arise from proper scoring rules, for example Hyvärinen, Bregman and Tsallis scoring rules. These discrepancies are often statistical divergences, i.e., D(Q P) = 0 P = Q for all P, Q in a subset of PX . Suppose that Pθ and Q are absolutely continuous with respect to a common measure λ on X, with respective positive densities pθ and q. Then a well-known statistical divergence is the Kullback-Leibler (KL) divergence KL(Q Pθ) R

X log(d Q/d Pθ)d Q = R

X log qd Q R

X log pθd Q. Minimising KL(Q Pθ) is equivalent to maximising R

X log pθd Q, which can be estimated using the likelihood c KL({Xi}n i=1 Pθ) 1

n Pn i=1 log pθ(Xi). Informally, we see that minimising the KL-divergence is equivalent to performing maximum likelihood estimation.

For our purposes we are interested in discrepancies that can be evaluated when Pθ is only known up to normalisation, precluding the use of KL divergence. We instead consider a related class of discrepancies based on integral probability pseudometric (IPM) [55] and Stein s method [3, 11, 65]. Let Γ(Y) Γ(X, Y) {f : X Y}. A map SP : G Γ(Rd) Γ(R) is a Stein operator over a Stein class G if R

X SP[f]d P = 0 f G for any P. We can then deﬁne an associated Stein discrepancy (SD) [23] using an IPM with entry-dependent function space F SPθ[G]

SDSPθ [G](Q Pθ) supf SPθ [G] R

X fd Q = supg G R

X SPθ[g]d Q . (1)

The Stein discrepancy depends on Q only through expectations, and does not require the existence of a density, therefore permitting Q to be an empirical measure. If P has a C1 density p on X, one can consider the Langevin-Stein discrepancy arising from the Stein operator Tp[g] log p, g + g [23, 25]. In this case, the Stein discrepancy will not depend on the normalising constant of p.

In this paper, for an arbitrary m Γ(Rd d) which we call diffusion matrix, we shall consider the more general diffusion Stein operators [25]: Sm p [g] (1/p) (pmg) , Sm p [A] (1/p) (pm A), where g Γ(Rd), A Γ(Rd d), and the associated minimum Stein discrepancy estimators which minimise (1). As we will only have access to a sample {Xi}n i=1 Q, we will focus on the estimators minimising an approximation c SDSPθ [G]({Xi}n i=1 Pθ) based on a U-statistic of the Q-integral:

ˆθStein n argminθ Θ c SDSPθ [G]({Xi}n i Pθ).

Related and complementary approaches to inference using SDs include the nonparametric estimator of [41], the density ratio approach of [47] and the variational inference algorithms of [49, 60]. We now highlight several instances of SDs which will be studied in detail in this paper.

2.1 Example 1: Diffusion Kernel Stein Discrepancy Estimators

A convenient choice of Stein class is the unit ball of reproducing kernel Hilbert spaces (RKHS) [5] of a scalar kernel function k. For the Langevin Stein operator Tp, the resulting kernel Stein discrepancy (KSD) ﬁrst appeared in [57] and has since been considered extensively in the context of hypothesis testing, measuring sample quality and approximation of probability measures in [12 14, 17, 24, 44, 46, 43]. In this paper, we consider a more general class of discrepancies based on the diffusion Stein operator and matrix-valued kernels.

Consider an RKHS Hd of functions f Γ(Rd) with (matrix-valued) kernel K Γ(X X, Rd d), Kx K(x, ) (see Appendix A.3 and A.4 for further details). The Stein operator Sm p [f] induces an operator Sm,2 p Sm,1 p : Γ X X, Rd d Γ R which acts ﬁrst on the ﬁrst variable and then on the second one. We brieﬂy mention two simple examples of matrix kernels constructed from scalar kernels. If we want the components of f to be orthogonal, we can use the diagonal kernel (i) K = diag(λ1k1, . . . , λdkd) where λi > 0 and ki is a C2 kernel on X, for i = 1, . . . , d; else we can correlate" the components by setting (ii) K = Bk where k is a (scalar) kernel on X and B is a (constant) symmetric positive deﬁnite matrix.

We propose to study diffusion kernel Stein discrepancies indexed by K and m (see Appendix B): Theorem 1 (Diffusion Kernel Stein Discrepancy). For any kernel K, we ﬁnd that Sm p [f](x) = Sm,1 p Kx, f Hd for any f Hd. Moreover if x 7 Sm,1 p Kx Hd L1(Q), we have

DKSDK,m(Q P)2 suph Hd h 1

X Sm p [h]d Q 2 = R

X k0(x, y)d Q(x)d Q(y)

k0(x, y) Sm,2 p Sm,1 p K(x, y) = 1 p(y)p(x) y x p(x)m(x)K(x, y)m(y) p(y) . (2)

In order to use these for minimum SD estimation, we propose the following U-statistic approximation:

\ DKSDK,m({Xi}n i=1 Pθ)2 = 2 n(n 1) P 1 i<j n k0 θ(Xi, Xj) = 1 n(n 1) P i =j k0 θ(Xi, Xj), (3)

with associated estimators: ˆθDKSD n argminθ Θ \ DKSDK,m({Xi}n i=1 Pθ)2.

As the proof shows, the Stein kernel k0 is indeed a (scalar) kernel obtained from the feature map φ : X Hd, φ(x) Sm,1 p [K]|x. For K = Ik, m = Ih, DKSD is a KSD with scalar kernel h(x)k(x, y)h(y), and if h = 1 our objective becomes the usual Langevin-based KSD of [14, 24, 46, 57] (see Appendix B.4). The work of [45] discussed the potential of optimizing the KSD with gradient descent but did not evaluate its merits. In the sections to follow, we will see the advantages conferred by introducing more ﬂexible diffusion operators, matrix kernels, and Riemannian optimization.

Now that our DKSD estimators are deﬁned, an important remaining question is under which conditions can DKSD discriminate distinct probability measures. To answer, we will need several deﬁnitions. We say a matrix kernel K is in the Stein class of Q if R

X Sm,1 q [K]d Q = 0, and that it is strictly integrally positive deﬁnite (IPD) if R

X X dµ (x)K(x, y)dµ(y) > 0 for any ﬁnite non-zero signed vector Borel measure µ. From Sm p [f](x) = Sm,1 p Kx, f Hd we have that f Hd is in the Stein class (i.e., R

X Sm q [f]d Q = 0) when K is also in the class. Setting sp m log p Γ(Rd): Proposition 1 (DKSD as a Statistical Divergence). Suppose K is IPD and in the Stein class of Q, and m(x) is invertible. If sp sq L1(Q), then DKSDK,m(Q P)2 = 0 iff Q = P.

See Appendix B.5 for the proof. Note that this proposition generalises Proposition 3.3 from [46] to a signiﬁcantly larger class of SD. For the matrix kernels introduced above, the proposition below shows that K is IPD when its associated scalar kernels are; a well-studied problem [64]. Proposition 2 (IPD Matrix Kernels). (i) Let K = diag(k1, . . . , kd). Then K is IPD iff each kernel ki is IPD. (ii) Let K = Bk for B be symmetric positive deﬁnite. Then K is IPD iff k is IPD.

2.2 Example 2: Diffusion Score Matching Estimators

A well-known family of estimators are the score matching (SM) estimators (based on the Fisher or Hyvarinen divergence) [34, 35]. As will be shown below, these can be seen as special cases of minimum SD estimators. The SM discrepancy is computable for sufﬁciently smooth densities:

X log p log q 2 2 d Q = R

X log q 2 2 + log p 2 2 + 2 log p d Q

where denotes the Laplacian and we have used the divergence theorem. If P = Pθ, the ﬁrst integral above does not depend on θ, and the second one does not depend on the density of Q, so we consider the approximation d SM({Xi}n i=1 Pθ) 1 n Pn i=1 log pθ(Xi) + 1

2 log pθ(Xi) 2 2 based on an unbiased estimation for the minimiser of the SM divergence, and its estimators ˆθSM n argminθ Θd SM({Xi}n i=1 Pθ), for independent random vectors Xi Q.

The SM discrepancy can also be generalised to include higher-order derivatives of the log-likelihood [48] and does not require a normalised model. We will now introduce a further generalisation that we call diffusion score matching (DSM) which is a SD constructed from the diffusion Stein operator (see Appendix B.6): Theorem 2 (Diffusion Score Matching). Let X = Rd and consider some diffusion Stein operator Sm p for some function m Γ(Rd d) and the Stein class G {g = (g1, . . . , gd) C1(X, Rd) L2(X; Q) : g L2(X;Q) 1}. If p, q > 0 are differentiable and sp sq L2(Q), then we deﬁne the diffusion score matching divergence as the Stein discrepancy,

DSMm(Q P) supf Sp[G] R

X fd P 2 = R

X m ( log q log p) 2 2d Q.

This satisﬁes DSMm(Q P) = 0 iff Q = P when m(x) is invertible. Moreover, if p is twicedifferentiable, and qmm log p, (qmm log p) L1(Rd), then Stoke s theorem gives

DSMm(Q P) = R

X m x log p 2 2 + m log q 2 2 + 2 mm log p d Q.

Notably, DSMm recovers SM when m(x)m(x) = I and the (generalised) non-negative score matching estimator of [48] with the choice m(x) diag(h1(x1)1/2, . . . , hd(xd)1/2). Like standard SM, DSM is only deﬁned for distributions with sufﬁciently smooth densities. Since the θ-dependent part of DSMm(Q Pθ) does not depend on the density of Q, and can be estimated using an empirical mean, leading to the estimators ˆθDSM n argminθ Θ [ DSMm({Xi}n i=1 Pθ) for

[ DSMm({Xi}n i=1 Pθ) 1

n Pn i=1 m x log pθ 2 2 + 2 mm log pθ (Xi)

where {Xi}n i=1 is a sample from Q. Note that this is only possible if m is independent of θ, in contrast to DKSD where m can depend on X Θ, thus leading to a more ﬂexible class of estimators.

An interesting remark is that the DSMm discrepancy may in fact be obtained as a limit of DKSD over a sequence of target-dependent kernels: see Appendix B.6 for the complete result which corrects and signiﬁcantly generalises previously established connections between the SM divergence and KSD (such as in Sec. 5 of [46]).

We conclude by commenting on the computational complexity. Evaluating the DKSD loss function requires O(n2d2) computation, due to the U-statistic and a matrix-matrix product. However, if K = diag(λ1k1, . . . , λdkd) or K = Bk, and if m is a diagonal matrix, then we can by-pass expensive matrix products and the cost is O(n2d), making it comparable to that of KSD. Although we do not consider these in this paper, recent approximations to KSD could also be adapted to DKSD to reduce the computational cost to O(nd) [32, 36]. The DSM loss function has computational cost O(nd2), which is comparable to the SM loss. From a computational viewpoint, DSM will hence be preferable to DKSD for large n, whilst DKSD will be preferable to DSM for large d.

2.3 Further Examples: Contrastive Divergence and Minimum Probability Flow

Before analysing DKSD and DSM estimators further, we show that the class of minimum SD estimators also includes other well-known estimators for unnormalised models. Let Xn θ , n N be a Markov process with unique invariant probality measure Pθ, for example a Metropolis-Hastings chain. Let P n θ be the associated transition semigroup, i.e. (P n θ f)(x) = E[f(Xn θ )|X0 θ = x]. Choosing the Stein operator Sp = I P n θ and Stein class G = {log pθ + c : c R}, leads to the following SD:

CD(Q Pθ) = R

X (log pθ P n θ log pθ)d Q = KL(Q Pθ) KL(Qn θ Pθ),

where Qn θ is the law of Xn θ |X0 θ Q and assuming that Q Pθ and Qn θ Pθ, which is the loss function associated with contrastive divergence (CD) [28, 45]. Suppose now that X is a ﬁnite set. Given θ Θ let Pθ be the transition matrix for a Markov process with unique invariant distribution Pθ. Suppose we observe data {xi}n i=1 and let q be the corresponding empirical distribution. Choosing the Stein operator Sp = I Pθ and the Stein set G = {f Γ(R) : f 1}. Note that, g arg supg G |Q(Sp[g])| will satisfy g(i) = sgn(q (I Pθ)i), and the resulting Stein discrepancy is the minimum probability ﬂow loss objective function [62]:

MPFL(Q P) = P

y ((I Pθ) q)y = P

y {xi}n i=1

x {xi}n i=1(I Pθ)xy .

2.4 Implementing Minimum SD Estimators: Stochastic Riemannian Gradient Descent

In order to implement the minimum SD estimators, we propose to use a stochastic gradient descent (SGD) algorithm associated to the information geometry induced by the SD on the parameter space. More precisely, consider a parametric family PΘ of probability measures on X with Θ Rm. Given a discrepancy D : PΘ PΘ R satisfying D(Pα Pθ) = 0 iff Pα = Pθ (called a statistical divergence), its associated information matrix ﬁeld on Θ is deﬁned as the map θ 7 g(θ), where g(θ) is the symmetric bilinear form g(θ)ij = 1

2( 2/ αi θj)D(Pα Pθ)|α=θ [2]. When g is positive deﬁnite, we can use it to perform (Riemannian) gradient descent on the parameter space Θ. We provide below the information matrices of DKSD and DSM (and hence extends results of [37]): Proposition 3 (Information Tensor DKSD). Assume the conditions of Proposition 1 hold. The information tensor associated to DKSD is positive semi-deﬁnite and has components

g DKSD(θ)ij = R

X ( x θj log pθ(x)) mθ(x)K(x, y)m θ (y) y θi log pθ(y)d Pθ(x)d Pθ(y).

Proposition 4 (Information Tensor DSM). Assume the conditions of Theorem 2 hold. The information tensor deﬁned by DSM is positive semi-deﬁnite and has components

g DSM(θ)ij = R

X m θi log pθ, m θj log pθ d Pθ.

See Appendix C for the proofs. Given an (information) Riemannian metric, recall the gradient ﬂow of a curve θ on the Riemannian manifold Θ is the solution to θ(t) = θ(t) SD(Q Pθ), where θ denotes the Riemannian gradient at θ. It is the curve that follows the direction of steepest decrease (measured with respect to the Riemannian metric) of the function SD(Q Pθ) (see Appendix A.5). The well-studied natural gradient descent [1, 2] corresponds to the case in which the Riemannian manifold is Θ = Rm equipped with the Fisher metric and SD is replaced by KL. When Θ is a linear manifold with coordinates (θi) we have θ SD(Q Pθ) = g(θ) 1dθ SD(Q Pθ), where dθf denotes the tuple ( θif). We will approximate this at step t of the descent using the biased estimator ˆgθt({Xt i}i) 1dθt c SD({Xt i}n i=1 Pθ), where ˆgθt({Xt i}n i=1) is an unbiased estimator for the information matrix g(θt) and {Xt i Q}i is a sample at step t. In general, we have no guarantee that ˆgθt is invertible, and so we may need a further approximation step to obtain an invertible matrix. Given a sequence (γt) of step sizes we will approximate the gradient ﬂow with

ˆθt+1 = ˆθt γtˆgθt({Xt i}n i=1) 1dθt c SD({Xt i}n i=1 Pθ).

Minimum SD estimators hold additional appeal for exponential family models, since their densities have the form pθ(x) exp( θ, T(x) Rm) exp(b(x)) for natural parameters θ Rm, sufﬁcient statistics T Γ(Rm), and base measure exp(b(x)). For these models, the U-statistic approximations of DKSD and DSM are convex quadratics with closed form solutions whenever K and m are independent of θ. Moreover, since the absolute value of an afﬁne function is convex, and the supremum of convex functions is convex, any SD with a diffusion Stein operator is convex in θ, provided m and the Stein class G are independent of θ.

3 Theoretical Properties for Minimum Stein Discrepancy Estimators

We now show that the DKSD and DSM estimators have many desirable properties such as consistency, asymptotic normality and bias-robustness. These results do not only provide us with reassuring theoretical guarantees on the performance of our algorithms, but can also be a practical tool for choosing a Stein operator and Stein class given an inference problem of interest.

We begin by establishing strong consistency and for DKSD; i.e. almost sure convergence: ˆθDKSD n a.s. θDKSD argminθ Θ DKSDK,m(Q Pθ)2. This will be followed by a proof of asymptotic normality. We will assume we are in the speciﬁed setting, so that Q = PθDKSD PΘ. In the misspeciﬁed setting, we will need to also assume the existence of a unique minimiser. Theorem 3 (Strong Consistency of DKSD). Let X = Rd, Θ Rm. Suppose that K is bounded with bounded derivatives up to order 2, that k0(x, y) is continuously-differentiable on an Rm-open neighbourhood of Θ, and that for any compact subset C Θ there exist functions f1, f2, g1, g2 such that for Q-a.e. x X,

m (x) log pθ(x) f1(x), where f1 L1(Q) and continuous,

θ m(x) log pθ(x) g1(x), where g1 L1(Q) is continuous,

3. m(x) + xm(x) f2(x) where f2 L1(Q) and continuous,

4. θm(x) + θ xm(x) g2(x) where g2 L1(Q) is continuous.

Assume further that θ 7 Pθ is injective. Then we have a unique minimiser θDKSD , and if either Θ is compact, or θDKSD int(Θ) and Θ and θ 7 \ DKSDK,m({Xi}n i=1 Pθ)2 are convex, then ˆθDKSD n is strongly consistent. Theorem 4 (Central Limit Theorem for DKSD). Let X and Θ be open subsets of Rd and Rm respectively. Let K be a bounded kernel with bounded derivatives up to order 2 and suppose that ˆθDKSD n p θDKSD and that there exists a compact neighbourhood N Θ of θDKSD such that θ \ DKSDK,m({Xi}n i=1, Pθ)2 is twice continuously differentiable for θ N and, for Q-a.e. x X,

1. m (x) log pθ(x) + θ m(x) log pθ(x) f1(x),

2. m(x) + xm(x) + θm(x) + θ xm(x) f2(x),

3. θ θ m(x) log pθ(x) + θ θ θ m(x) log pθ(x) g1(x),

4. θ θm(x) + θ θ xm(x) + θ θ θm(x) + θ θ θ xm(x) g2(x),

where f1, f2 L2(Q),g1, g2 L1(Q) are continuous. Suppose also that the information tensor g is invertible at θDKSD . Then n ˆθDKSD n θDKSD d N 0, g 1 DKSD(θDKSD )ΣDKSDg 1 DKSD(θDKSD ) ,

where ΣDKSD = R

X θk0 θDKSD (x, y)d Q(y) R

X θk0 θDKSD (x, z)d Q(z) d Q(x).

See Appendix D for proofs. For both results, the assumptions on the kernel are satisﬁed by most kernels common in the literature, such as Gaussian, inverse-multiquadric (IMQ) and any Matérn kernels with smoothness greater than 2. Similarly, the assumptions on the model are very weak given that the diffusion tensor m can be adapted to guarantee consistency and asymptotic normality.

We now prove analogous results for DSM. This time we show weak consistency, i.e. convergence in probability: ˆθDSM n p θDSM argminθ Θ DSMm(Q Pθ) = argminθ Θ R

X Fθ(x)d Q(x). This will be a sufﬁcient form of convergence for asymptotic normality. Theorem 5 (Weak Consistency of DSM). Let X be an open subset of Rd, and Θ Rm. Suppose log pθ( ) C2(X) and m C1(X), and x log pθ(x) f1(x) for Q-a.e. x. Suppose also that x x log pθ(x)| f2(x) on any compact set C Θ for Q-a.e. x, where m f1 L2(Q), (mm ) f1 L1(Q), mm f2 L1(Q). If either Θ is compact, or Θ and θ 7 Fθ are convex and θDSM int(Θ), then ˆθDSM n is weakly consistent for θDSM .

Theorem 6 (Central Limit Theorem for DSM). Let X, Θ be open subsets of Rd and Rm respectively. Suppose ˆθDSM n p θDSM , θ 7 log pθ(x) is twice continuously differentiable on a closed ball B(ϵ, θDSM ) Θ, and that for Q-a.e. x X,

(i) m(x)m (x) + x (m(x)m (x)) f1(x), and x log pθ(x) + θ x log pθ(x) + θ x x log pθ(x) f2(x), with f1f2, f1f 2 2 L2(Q)

(ii) for θ B(ϵ, θ ), θ x log pθ 2 + x log pθ θ θ x log pθ + θ θ x log pθ + θ θ x x log pθ g1(x), and f1g1 L1(Q).

Then, if the information tensor is invertible at θDSM , we have n ˆθDSM n θDSM d N 0, g 1 DSM θDSM ΣDSMg 1 DSM θDSM .

where ΣDSM = R

X θFθDSM (x) θFθDSM (x)d Q(x).

All of the proofs can be found in Appendix D.2. An important special case covered by our theory is that of natural exponential families, which admit densities of the form log pθ(x) θ, T(x) Rm + b(x). If K is IPD with bounded derivative up to order 2, T has linearly independent rows, m is invertible, and Tm , xb m , xm + m L2(Q), then the sequence of minimum DKSD and DSM estimators are strongly consistent and asymptotically normal (see Appendix D.3).

Before concluding this section, we turn to a concept of importance to practical inference: robustness when subjected to corrupted data [31]. We quantify the robustness of DKSD and DSM estimators in terms of their inﬂuence function, which can be interpreted as measuring the impact of an inﬁnitesimal perturbation of a distribution P by a Dirac located at a point z X on the estimator. If θQ denotes the unique minimum SD estimator for Q, then the inﬂuence functions is given by IF(z, Q) tθQt|t=0 if it exists, where Qt = (1 t)Q+tδz, for t [0, 1]. An estimator is said to be bias robust if IF(z, Q) is bounded in z. Proposition 7 (Robustness of DKSD estimators). Suppose that the map θ Pθ over Θ is injective, then IF(z, Pθ) = g DKSD(θ) 1 R

X θk0(z, y)d Pθ(y). Moreover, suppose that y 7 F(x, y) is Q-integrable for any x, where F(x, y) = K(x, y)sp(y) , K(x, y) θsp(y) , x K(x, y)sp(y) , x K(x, y) θsp(y) , y x(K(x, y)m(y)) , y x(K(x, y) θm(y)) . Then if x 7 ( sp(x) + θsp(x) ) R F(x, y)Q(dy)|θDKSD is bounded, the DKSD estimators are bias robust: supz X IF(z, Q) < .

The analogous results for DSM estimators can be found in Appendix E. Consider a Gaussian location model, i.e. pθ exp( x θ 2 2), for θ Rd. The Gaussian kernel satisﬁes the assumptions of Proposition 7 so that supz IF(z, Q) < , even when m = I. Indeed IF(z, Pθ) C(θ)e z θ 2/4 z θ , where z 7 e z θ 2/4 z θ is uniformly bounded over θ. In contrast, the SM estimator has an inﬂuence function of the form IF(z, Q) = z R

X xd Q(x), which is unbounded with respect to z, and is thus not robust. This clearly demonstrates the importance of carefully selecting a Stein class for use in minimum SD estimators. An alternative way of inducing robustness is to introduce a spatially decaying diffusion matrix in DSM. To this end, consider the minimum DSM estimator with scalar diffusion coefﬁcient m. Then θDSM = ( R

X m2(x)d Q(x)) 1 R

X m2(x)xd Q(x) + R

X m2(x)d Q(x) . A straightforward calculation yields that the associated inﬂuence function will be bounded if both m(x) and m(x) decay as x . This clearly demonstrates another signiﬁcant advantage provided by the ﬂexibility of our family of diffusion SD, where the Stein operator also plays an important role.

4 Numerical Experiments

In this section, we explore several examples which demonstrate worrying breakpoints for SM, and highlight how these can be straightforwardly handled using KSD, DKSD and DSM.

4.1 Rough densities: the symmetric Bessel distributions

A major drawback of SM is the smoothness requirement on the target density. However, this can be remedied by choosing alternative Stein classes, as will be demonstrated in the case of the symmetric

Figure 1: Minimum SD Estimators for the Symmetric Bessel Distribution. We consider the case where θ 1 = 0 and θ 2 = 1 and n = 500 for a range of smoothness parameter values s in d = 1.

Figure 2: Minimum SD Estimators for Non-standardised Student-t Distributions. We consider a student-t problem with ν = 5, θ 1 = 25, θ 2 = 10 and n = 300.

Bessel distributions. Let Ks d/2 denote the modiﬁed Bessel function of the second kind with parameter s d/2. This distribution generalises the Laplace distribution [40] and has log-density: log pθ(x) ( x θ1 2/θ2)(s d/2)Ks d/2( x θ1 2/θ2) where θ1 Rd is a location parameter and θ2 > 0 a scale parameter. The parameter s d/2 encodes smoothness.

We compared SM with KSD based on a Gaussian kernel and a range of lengthscale values in Fig. 1. These results are based on n = 500 IID realisations in d = 1. The case s = 1 corresponds to a Laplace distribution, and we notice that both SM and KSD are able to obtain a reasonable estimate of the location. For rougher values, for example s = 0.6, we notice that KSD outperforms SM for certain choices of lengthscales, whereas for s = 2, SM and KSD are both able to recover the parameter. Analogous results for scale can be found in Appendix F.1, and Appendix F.2 illustrates the trade-off between efﬁciency and robustness on this problem.

4.2 Heavy-tailed distributions: the non-standardised student-t distributions

A second drawback of standard SM is that it is inefﬁcient for heavy-tailed distributions. To demonstrate this, we focus on non-standardised student-t distributions: pθ(x) (1/θ2)(1 + (1/ν) x θ1 2 2/θ2 2) (ν+1)/2 where θ1 R is a location parameter and θ2 > 0 a scale parameter. The parameter ν determines the degrees of freedom: when ν = 1, we have a Cauchy distribution, whereas ν = gives the Gaussian distribution. For small values of ν, the student-t distribution is heavy-tailed.

We illustrate SM and KSD for ν = 5 in Fig. 2, where we take an IMQ kernel k(x, y; c, β) = (c2 + x y 2 2)β with c = 1. and β = 0.5. This choice of ν guarantees the ﬁrst two moments exist, but the distribution is still heavy-tailed. In the left plot, both SM and KSD struggle to recover θ 1 when n = 300, and the loss functions are far from convex. However, DKSD with mθ(x) = 1+ x θ1 2/θ2 2 can estimate θ1 very accurately. In the middle left plot, we instead estimate θ2 with SM, KSD and their correponding non-negative version (NNSM & NNKSD, m(x) = x), which are particularly well suited for scale parameters. NNSM and NNKSD provide improvements on SM and KSD, but DKSD with mθ(x) = ((x θ1)/θ2)(1 + (1/ν) x θ1 2 2/θ2 2) provides signiﬁcant further gains. On the right-hand side, we also consider the advantage of the Riemannian SGD algorithm over SGD by illustrating them on the KSD loss function with n = 1000. Both algorithms use constant stepsizes and minibatches of size 50. As demonstrated, Riemmannian SGD converges within a few dozen iterations, whereas SGD hasn t converged after 1000 iterations. Additional experiments on the robustness of these estimators is also available in Appendix F.2.

4.3 Robust estimators for light-tailed distributions: the generalised Gamma distributions

Our ﬁnal example demonstrates a third failure mode for SM: its lack of robustness for light-tailed distributions. We consider generalised gamma location models with likelihoods pθ(x) exp( (x θ1)θ2) where θ1 is a location parameter and θ2 determines how fast the tails decay. The larger θ2,

Figure 3: Minimum SD Estimators for Generalised Gamma Distributions under Corruption. We consider the case where θ 1 = 0 and θ 2 = 2 (left and middle) or θ 2 = 5 (right). Here n = 300.

the lighter the tails will be and vice-versa. We set n = 300 and corrupt 80 points by setting them to the value x = 8. A robust estimator should obtain a good approximation of θ even under this corruption. The left plot in Fig. 3 considers a Gaussian model (i.e. θ 2 = 2); we see that SM is not robust for this very simple model whereas DSM with m(x) = 1/(1 + x α), α = 2 is robust. The middle plot shows that DKSD with this same m is also robust, and conﬁrms the analytical results of the previous section. Finally, the right plot considers the case θ 2 = 5 and we see that α can be chosen as a function of θ2 to guarantee robustness. In general, taking α θ 2 1 will guarantee a bounded inﬂuence function. Such a choice allows us to obtain robust estimators even for models with very light tails.

4.4 Efﬁcient estimators for a simple unnormalised model

Figure 4: Estimators for a Simple Intractable Model

Finally we consider a simple intractable model from [47]: pθ(x) exp(η(θ) ψ(x)) where ψ(x) = (Pd i=1 x2 i , Pd i=3 x1xi, tanh(x)) and tanh is applied elementwise to x and η(θ) = ( 0.5, 0.2, 0.6, 0, 0, 0, θ, 0). This model is intractable since we cannot easily compute its normalisation constant due to the difﬁculty of integrating the unnormalised part of the model. Our results based on n = 200 samples show that DKSD with m(x) = diag(1/(1 + x)) is able to recover θ = 1, whereas both SM and KSD provide less accurate estimates of the parameter. This illustrates yet again that a judicious choice of diffusion matrix can signiﬁcantly improve the efﬁciency of our estimators.

5 Conclusion

This paper introduced a general approach for constructing minimum distance estimators based on Stein s method, and demonstrated that many popular inference schemes can be recovered as special cases. This class of algorithms gives us additional ﬂexibility through the choice of an operator and function space (the Stein operator and Stein class), which can be used to tailor the inference scheme to trade-off efﬁciency and robustness. However, this paper only scratches the surface of what is possible with minimum SD estimators. Looking ahead, it will be interesting to identify diffusion matrices which increase efﬁciency for important classes of problems in machine learning. One example on which we foresee progress are the product of student-t experts models [38, 66, 68], whose heavy tails render estimation challenging for SM. Advantages could also be found for other energy models, such as large graphical models where the kernel could be adapted to the graph [67].

Acknowledgments

AB was supported by a Roth scholarship from the Department of Mathematics at Imperial College London. FXB was supported by the EPSRC grants [EP/L016710/1, EP/R018413/1]. AD and MG were supported by the Lloyds Register Foundation Programme on Data-Centric Engineering, the UKRI Strategic Priorities Fund under the EPSRC Grant [EP/T001569/1] and the Alan Turing Institute under the EPSRC grant [EP/N510129/1]. MG was supported by the EPSRC grants [EP/J016934/3, EP/K034154/1, EP/P020720/1, EP/R018413/1].

[1] S.-I. Amari. Natural gradient works efﬁciently in learning. Neural Computation, 10(2):251 276, 1998.

[2] S.-I. Amari. Information Geometry and Its Applications, volume 194. Springer, 2016.

[3] A. Barbour and L. H. Y. Chen. An introduction to Stein s method. Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore, 2005.

[4] A. Basu, H. Shioya, and C. Park. Statistical Inference: The Minimum Distance Approach. CRC Press, 2011.

[5] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science+Business Media, New York, 2004.

[6] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(2):235 269, 2019.

[7] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013.

[8] F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv:1906.05944, 2019.

[9] G. Casella and R. Berger. Statistical Inference. 2001.

[10] C. Ceylan and M. U. Gutmann. Conditional noise-contrastive estimation of unnormalised models. ar Xiv:1806.03664, 2018.

[11] L. H. Y. Chen, L. Goldstein, and Q.-M. Shao. Normal Approximation by Stein s Method. Springer, 2011.

[12] W. Y. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. J. Oates. Stein points. In Proceedings of the International Conference on Machine Learning, PMLR 80:843-852, 2018.

[13] W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. J. Oates. Stein point Markov chain Monte Carlo. In International Conference on Machine Learning, PMLR 97, pages 1011 1021, 2019.

[14] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, pages 2606 2615, 2016.

[15] A. P. Dawid and M. Musio. Theory and applications of proper scoring rules. Metron, 72(2): 169 183, 2014. ISSN 2281695X. doi: 10.1007/s40300-014-0039-y.

[16] A. P. Dawid, M. Musio, and L. Ventura. Minimum scoring rule inference. Scandinavian Journal of Statistics, 43(1):123 138, 2016.

[17] G. Detommaso, T. Cui, Y. Marzouk, A. Spantini, and R. Scheichl. A stein variational newton method. In Advances in Neural Information Processing Systems 31, pages 9169 9179. 2018.

[18] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Uncertainty in Artiﬁcial Intelligence, 2015.

[19] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053 2061, 2015.

[20] D. Gabay. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177 219, 1982.

[21] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with Sinkhorn divergences. In Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, PMLR 84, pages 1608 1617, 2018.

[22] C. J. Geyer. On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):261 274, 1994.

[23] J. Gorham and L. Mackey. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, pages 226 234, 2015.

[24] J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the International Conference on Machine Learning, pages 1292 1301, 2017.

[25] J. Gorham, A. Duncan, L. Mackey, and S. Vollmer. Measuring sample quality with diffusions. ar Xiv:1506.03039. To appear in Annals of Applied Probability., 2016.

[26] M. U. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 297 304, 2010.

[27] M. U. Gutmann and A. Hyvarinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13:307 361, 2012.

[28] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800, 2002.

[29] W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, pages 293 325, 1948.

[30] W. Hoeffding. The strong law of large numbers for U-statistics. Technical report, North Carolina State University Department of Statistics, 1961.

[31] P. J. Huber and E. M. Ronchetti. Robust Statistics. Wiley, 2009.

[32] J. Huggins and L. Mackey. Random feature stein discrepancies. In Advances in Neural Information Processing Systems, pages 1899 1909, 2018.

[33] A. Hyvärinen. Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation. Neural computation, 11(7):1739 1768, 1999.

[34] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695 708, 2006.

[35] A. Hyvärinen. Some extensions of score matching. Computational Statistics and Data Analysis, 51(5):2499 2512, 2007.

[36] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A linear-time kernel goodnessof-ﬁt test. In Advances in Neural Information Processing Systems, pages 261 270, 2017.

[37] R. Karakida, M. Okada, and S.-I. Amari. Adaptive natural gradient learning algorithms for unnormalized statistical models. Artiﬁcial Neural Networks and Machine Learning - ICANN, 2016.

[38] D. P. Kingma and Y. Le Cun. Regularized estimation of image statistics by score matching. In Advances in Neural Information Processing Systems, pages 1126 1134, 2010.

[39] U. Köster and A. Hyvärinen. A two-layer model of natural stimuli estimated with score matching. Neural Computation, 22(9):2308 2333, 2010.

[40] S. Kotz, T. J. Kozubowski, and K. Podgorski. The Laplace Distribution and Generalizations. Springer, 2001.

[41] Y. Li and R. E. Turner. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.

[42] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In Proceedings of the International Conference on Machine Learning, volume 37, pages 1718 1727, 2015.

[43] C. Liu and J. Zhu. Riemannian Stein Variational Gradient Descent for Bayesian Inference. (i), 2017. URL http://arxiv.org/abs/1711.11216.

[44] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, 2016.

[45] Q. Liu and D. Wang. Learning deep energy models: Contrastive divergence vs. amortized mle. ar Xiv preprint ar Xiv:1707.00797, 2017.

[46] Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In Proceedings of the International Conference on Machine Learning, pages 276 284, 2016.

[47] S. Liu, T. Kanamori, W. Jitkrittum, and Y. Chen. Fisher efﬁcient inference of intractable models. ar Xiv:1805.07454, 2018.

[48] S. Lyu. Interpretation and generalization of score matching. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 359 366, 2009.

[49] C. Ma and D. Barber. Black-box Stein divergence minimization for learning latent variable models. Advances in Approximate Bayesian Inference, NIPS 2017 Workshop, 2017.

[50] L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. Electronic Communications in Probability, 21, 2016.

[51] K. V. Mardia, J. T. Kent, and A. K. Laha. Score matching estimators for directional distributions. ar Xiv preprint ar Xiv:1604.08470, 2016.

[52] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural computation, 17(1): 177 204, 2005.

[53] M. Micheli and J. A. Glaunes. Matrix-valued kernels for shape deformation analysis. ar Xiv preprint ar Xiv:1308.5739, 2013.

[54] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the International Conference on Machine Learning, pages 419 426, 2012.

[55] A. Muller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

[56] W. K. Newey and D. Mc Fadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111 2245, 1994.

[57] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society B: Statistical Methodology, 79(3):695 718, 2017.

[58] L. Pardo. Statistical Inference Based on Divergence Measures, volume 170. Chapman and Hall/CRC, 2005.

[59] S. Pigola and A. G. Setti. Global divergence theorems in nonlinear PDEs and geometry. Ensaios Matemáticos, 26:1 77, 2014.

[60] R. Ranganath, J. Altosaar, D. Tran, and D. M. Blei. Operator variational inference. In Advances in Neural Information Processing Systems, pages 496 504, 2016.

[61] S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2): 205, 2009.

[62] J. Sohl-dickstein, P. Battaglino, and M. R. De Weese. Minimum probability ﬂow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 905 912, 2011.

[63] B. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyvärinen, and R. Kumar. Density estimation in inﬁnite dimensional exponential families. Journal of Machine Learning Research, 18(1): 1830 1888, 2017.

[64] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: 1517 1561, 2010.

[65] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of 6th Berkeley Symposium on Mathematical Statistics and Probability, pages 583 602. University of California Press, 1972.

[66] K. Swersky, M. A. Ranzato, D. Buchman, B. M. Marlin, and N. de Freitas. On autoencoders and score matching for energy based models. In International Conference on Machine Learning, pages 1201 1208, 2011.

[67] S. V. N. Vishwanathan, N. Schraudolph, R. Kondor, and K. Borgwardt. Graph kernels. Journal of Machine Learning Research, pages 1201 1242, 2010.

[68] M. Welling, G. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In Advances in Neural Information Processing Systems, pages 1383 1390, 2003.

[69] L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for exponential family densities. ar Xiv:1811.08357, 2018.

[70] I.-K. Yeo and R. A. Johnson. A uniform strong law of large numbers for U-statistics with application to transforming to near symmetry. Statistics & Probability Letters, 51(1):63 69, 2001.