# nonparametric_score_estimators__a0a90001.pdf

Nonparametric Score Estimators

Yuhao Zhou 1 Jiaxin Shi 1 Jun Zhu 1

Estimating the score, i.e., the gradient of log density function, from a set of samples generated by an unknown distribution is a fundamental task in inference and learning of probabilistic models that involve ﬂexible yet intractable densities. Kernel estimators based on Stein s methods or score matching have shown promise, however their theoretical properties and relationships have not been fully-understood. We provide a unifying view of these estimators under the framework of regularized nonparametric regression. It allows us to analyse existing estimators and construct new ones with desirable properties by choosing different hypothesis spaces and regularizers. A uniﬁed convergence analysis is provided for such estimators. Finally, we propose score estimators based on iterative regularization that enjoy computational beneﬁts from curl-free kernels and fast convergence.

1. Introduction

Intractability of density functions has long been a central challenge in probabilistic learning. This may arise from various situations such as training implicit models like GANs (Goodfellow et al., 2014), or marginalizing over a non-conjugate hierarchical model, e.g., evaluating the output density of stochastic neural networks (Sun et al., 2019). In these situations, inference and learning often require evaluating such intractable densities or optimizing an objective that involves them.

Among various solutions, one important family of methods are based on score estimation, which rely on a key step of estimating the score, i.e., the derivative of the log density x log p(x) from a set of samples drawn from some unknown probability density p. These methods in-

1Dept. of Comp. Sci. & Tech., BNRist Center, Institute for AI, Tsinghua-Bosch ML Center, Tsinghua University. Correspondence to: J. Zhu <dcszj@tsinghua.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

clude parametric score matching (Hyv arinen, 2005; Sasaki et al., 2014; Song et al., 2019), its denoising variants as autoencoders (Vincent, 2011), nonparametric score matching (Sriperumbudur et al., 2017; Sutherland et al., 2018), and kernel score estimators based on Stein s methods (Li & Turner, 2018; Shi et al., 2018). They have been successfully applied to applications such as estimating gradients of mutual information for representation learning (Wen et al., 2020), score-based generative modeling (Song & Ermon, 2019; Saremi & Hyvarinen, 2019), gradient-free adaptive MCMC (Strathmann et al., 2015), learning implicit models (Warde-Farley & Bengio, 2016), and solving intractability in approximate inference algorithms (Sun et al., 2019).

Recently, nonparametric score estimators are growing in popularity, mainly because they are ﬂexible, have wellstudied statistical properties, and perform well when samples are very limited. Despite a common goal, they have different motivations and expositions. For example, the work Sriperumbudur et al. (2017) is motivated from the density estimation perspective and the richness of kernel exponential families (Canu & Smola, 2006; Fukumizu, 2009), where the estimator is obtained by score matching. Li & Turner (2018) and Shi et al. (2018) are mainly motivated by Stein s methods. The solution of Li & Turner (2018) gives the score prediction at sample points by minimizing the kernelized Stein discrepancy (Chwialkowski et al., 2016; Liu et al., 2016) and at an out-of-sample point by adding it to the training data, while the estimator of Shi et al. (2018) is obtained by a spectral analysis in function space.

As these estimators are studied in different contexts, their relationships and theoretical properties are not fullyunderstood. In this paper, we provide a unifying view of them under the regularized nonparametric regression framework. This framework allows us to construct new estimators with desirable properties, and to justify the consistency and improve the convergence rate of existing estimators. It also allows us to clarify the relationships between these estimators. We show that they differ only in hypothesis spaces and regularization schemes.

Our contributions are both theoretical and algorithmic:

We provide a unifying perspective of nonparametric score estimators. We show that the major distinction of the KEF estimator (Sriperumbudur et al., 2017) from

Nonparametric Score Estimators

the other two estimators lies in the use of curl-free kernels, while Li & Turner (2018) and Shi et al. (2018) differ mostly in regularization schemes, with the former additionally ignores a one-dimensional subspace in the hypothesis space. We provide a uniﬁed convergence analysis under the framework.

We justify the consistency of the Stein gradient estimator (Li & Turner, 2018), although the originally proposed out-of-sample extension is heuristic and expensive. We provide a natural and principled out-ofsample extension derived from our framework. For both approaches we provide explicit convergence rates.

From the convergence analysis we also obtain the explicit rate for Shi et al. (2018), which can be shown to improve the error bound of Shi et al. (2018).

Our results suggest favoring curl-free estimators in high dimensions. To address the scalability challenge, we propose iterative score estimators by adopting the ν-method (Engl et al., 1996) as the regularizer. We show that the structure of curl-free kernels can further accelerate such algorithms. Inspired by a similar idea, we propose a conjugate gradient solver of KEF that is signiﬁcantly faster than previous approximations.

Notation We always assume ρ is a probability measure with probability density function p(x) supported on X Rd, and L2(X, ρ; Rd) is the Hilbert space of all square integrable functions f : X Rd with inner product f, g L2(X,ρ;Rd) = Ex ρ[ f(x), g(x) Rd]. We denote by , ρ and ρ the inner product and the norm in L2(X, ρ; Rd), respectively. We denote k as a scalar-valued kernel, and K as a matrix-valued kernel K : X X Rd d satisfying the following conditions: (1) K(x, x ) = K(x , x)T for any x, x X; (2) Pm i,j=1 c T i K(xi, xj)cj 0 for any {xi} X and {ci} Rd. We denote a vector-valued reproducing kernel Hilbert space (RKHS) associated to K by HK, which is the closure of Pm i=1 K(xi, )ci : xi X, ci Rd, m N

under the norm induced by the inner product K(xi, )ci, K(xj, )sj := c T i K(xi, xj)sj. We deﬁne Kx := K(x, ) and [M] := {1, , M} for M Z+. For A1, , An Rs t, we use (A1, , An) to represent a block matrix A Rns t with A(i 1)s+j,k being the (j, k)-th component of Ai, and we similarly deﬁne [A1, , An] := (AT 1 , , AT n)T.

2. Background

In this section, we brieﬂy introduce the nonparametric regression method of learning vector-valued functions (Baldassarre et al., 2012). We also review existing kernel-based approaches to score estimation.

2.1. Vector-Valued Learning

Supervised vector-valued learning amounts to learning a vector-valued function fz : X Y from a training set z = {(xm, ym)}m [M], where X Rd, Y Rq. Here we assume the training data is sampled from an unknown distribution ρ(x, y), which can be decomposed into ρ(y|x)ρX (x). A criterion for evaluating such an estimator is the mean squared error (MSE) E(f) := Eρ(x,y) f(x) y 2 2. It is well-known that the conditional expectation fρ(x) := Eρ(y|x)[y] minimizes E. In practice, we minimize the empirical error Ez(f) := 1 M PM m=1 f(xm) ym 2 2 in a certain hypothesis space F. However, the minimization problem is typically ill-posed for large F. Hence, it is convenient to consider the regularized problem:

fz,λ := arg min f F Ez(f) + λ f 2 F , (1)

where F is the norm in F. In the vector-valued case, it is typical to consider a vector-valued RKHS HK associated with a matrix-valued kernel K as the hypothesis space. Then the estimator is fz,λ = PM m=1 Kxmcm, where Kxm denotes the function K(xm, ). cm solves the linear system ( 1

M K + λI)c = 1 M y with Kij = K(xi, xj), c = (c1, , c M), y = (y1, , y M).

For convenience, we deﬁne the sampling operator Sx : HK RMq as Sx(f) := (f(x1), , f(x M)). Its adjoint S x : RMq HK that satisﬁes Sx(f), c RMq = f, S x (c) HK, f HK, c RMq is S x (c1, , c M) = PM m=1 Kxmcm. Since ( 1

M S x Sx + λI)fz,λ = 1 M S x Kc + λS x c = 1 M S x y, the estimator now can be written as fz,λ = 1

M S x Sx + λI 1 1

M S x y. In fact, if we consider the data-free limit of (1): arg min f HK E(f) + λ f 2 HK, the minimizer is unique when λ > 0 and is given by fλ := (LK +λI) 1LKfρ, where LK : HK HK is the integral operator deﬁned as LKf := R

X Kxf(x)dρX (Smale & Zhou, 2007). It turns out that 1 M S x Sx is an empirical estimate of LK: ˆLKf := 1 M PM m=1 Kxmf(xm) = 1 M S x Sxf. It can also be shown that ˆLKfρ = 1 M S x y. Hence, we can write fz,λ = (ˆLK + λI) 1 ˆLKfρ.

As we have mentioned, the role of regularization is to deal with the ill-posedness. Speciﬁcally, ˆLK is not always invertible as it has ﬁnite rank and HK is usually of inﬁnite dimension. Many regularization methods are studied in the context of solving inverse problems (Engl et al., 1996) and statistical learning theory (Bauer et al., 2007). The regularization method we presented in (1) is the famous Tikhonov regularization, which belongs to a class of regularization techniques called spectral regularization (Bauer et al., 2007). Speciﬁcally, spectral regularization corresponds to a family of estimators deﬁned as

f g z,λ := gλ(ˆLK)ˆLKfρ,

Nonparametric Score Estimators

where gλ : R+ R is a regularizer such that gλ(ˆLK) approximates the inverse of ˆLK. Note that ˆLK can be decomposed into P σi ei, ei, where (σi, ei) is a pair of eigenvalue and eigenfunction, we can deﬁne gλ(ˆLK) := P gλ(σi) ei, ei. The Tikhonov regularization corresponds to gλ(σ) = (λ + σ) 1. There are several different regularizers. For example, the spectral cut-off regularizer is deﬁned by gλ(σ) = σ 1 for σ λ and gλ(σ) = 0 otherwise. We refer the readers to Smale & Zhou (2007); Bauer et al. (2007); Baldassarre et al. (2012) for more details.

2.2. Related Work

We assume log p(x) is differentiable, and deﬁne the score as sp := log p. By score estimation we aim to estimate sp from a set of i.i.d. samples {xm}m [M] drawn from ρ. There have been many kernel-based score estimators studied in different contexts (Sriperumbudur et al., 2017; Sutherland et al., 2018; Li & Turner, 2018; Shi et al., 2018). Below we give a brief review of them.

Kernel Exponential Family Estimator The kernel exponential family (KEF) (Canu & Smola, 2006; Fukumizu, 2009) was originally proposed as an inﬁnite-dimensional generalization of exponential families. It was shown to be useful in density estimation as it can approximate a broad class of densities arbitrarily well (Sriperumbudur et al., 2017). The KEF is deﬁned as:

Pk := {pf(x) = ef(x) A(f) : f Hk, e A(f) < },

where Hk is a scalar-valued RKHS, and A(f) := log R

X ef(x)dx is the normalizing constant. Since A(f) is typically intractable, Sriperumbudur et al. (2017) proposed to estimate f by matching the model score log pf and the data score sp, thus the KEF can naturally be used for score estimation (Strathmann et al., 2015). This approach works by minimizing the regularized score matching loss:

min f Hk J(p pf) + λ f 2 Hk , (2)

where J(p q) := Ep log p log q 2 2 is the Fisher divergence between p and q. Integration by parts was used to eliminated log p from J(p q) (Hyv arinen, 2005) and the exact solution of (2) was given as follows (Sriperumbudur et al., 2017, Theorem 5):

j=1 c(m 1)d+j jk(xm, ) ˆξ λ, (3)

where ˆξ(x) = 1 M PM m=1 Pd j=1 2 j k(xm, ), c RMd

is obtained by solving (G + MλI)c = b/λ with G(m 1)d+i,(ℓ 1)d+j = i j+dk(xm, xℓ) and b(m 1)d+i =

1 M PM ℓ=1 Pd j=1 i 2 j+dk(xm, xℓ), where j+d denotes taking derivative w.r.t. the j-th component of the second parameter xℓ. This solution suffers from computational drawbacks

due to the large linear system of size Md Md. Sutherland et al. (2018) proposed to use the Nystr om method to accelerate KEF. Instead of minimizing the loss in the whole RKHS, they minimized it in a low dimensional subspace.

Stein Gradient Estimator The Stein gradient estimator proposed by Li & Turner (2018) is based on inverting the following generalized Stein s identity (Stein, 1981; Gorham & Mackey, 2015)

Ep[h(x) log p(x)T + h(x)] = 0, (4)

where h : X Rd is a test function satisfying some regularity conditions. An empirical approximation of the identity is 1

M HS xh, where H = (h(x1), , h(x M)) Rd M, S = ( log p(x1), , log p(x M)) RM d and xh = 1 M PM m=1 xmh(xm). Li & Turner (2018) pro-

posed to minimize xh + 1 M HS 2 F + η M 2 S 2 F to estimate S, where F denotes the Frobenius norm. The kernel trick k(xi, xj) := h(xi)Th(xj) was then exploited to obtain the estimator. From the above we only have score estimates at the sample points. Li & Turner (2018) proposed a heuristic out-of-sample extension at x by adding it to {xm} and recompute the minimizer. Such an approach is unjustiﬁed. It is still unclear whether the estimator is consistent.

Spectral Stein Gradient Estimator The Spectral Stein Gradient Estimator (SSGE) (Shi et al., 2018) was derived by a spectral analysis of the score function. Unlike Li & Turner (2018) it was shown to have convergence guarantees and principled out-of-sample extension. The idea is to expand each component of the score in a scalar-valued function space L2(X, ρ): gi(x) = P j=1 βijψj(x), where gi is the i-th component of the score and {ψj} are the eigenfunctions of the integral operator Lkf := R

X k(x, )f(x)dρ(x) associated with a scalar-valued kernel k. By using the Stein s identity in (4) Shi et al. (2018) showed that βij = Eρ[ iψj(x)]. The Nystr om method (Baker, 1977; Williams & Seeger, 2001) was then used to estimate ψj:

m=1 k(x, xm)wjm, (5)

where {xm}m [M] are i.i.d. samples drawn from ρ, wjm is the m-th component of the eigenvector that corresponds to its j-th largest eigenvalue of the kernel matrix constructed from {xm}m [M]. The ﬁnal estimator was obtained by truncating gi to PJ j=1 βijψj(x) and plugging in ˆψj. Shi et al. (2018, Theorem 2) provided an error bound of SSGE depending on J and M. However, the convergence rate is still unknown.

Nonparametric Score Estimators

3. Nonparametric Score Estimators

The kernel score estimators discussed in Sec. 2.2 were proposed in different contexts. The KEF estimator is motivated from the density estimation perspective, while Stein and SSGE have no explicit density models. SSGE relies on spectral analysis in the function space, while the other two are derived by minimizing a loss function. Despite sharing a common goal, it is still unclear how these estimators relate to each other. In this section, we present a unifying framework of score estimation using regularized vector-valued regression. We show that several existing kernel score estimators are special cases under the framework, which allows us to thoroughly investigate their strengths and weaknesses.

3.1. A Unifying Framework

As introduced in Sec. 2.2, the goal is to estimate the score sp from a set of i.i.d. samples {xm}m [M] drawn from ρ. We ﬁrst consider the ideal case where we have the ground truth values of sp at the sample locations. Then we can estimate sp with vector-valued regression as described in Sec. 2.1:

ˆsp,λ = arg min s HK

m=1 s(xm) sp(xm) 2 2 + λ

(6) The solution is given by ˆsp,λ = (ˆLK + λI) 1 ˆLKsp. We could replace the Tikhonov regularizer with other spectral regularization, for which the general solution is

ˆsg p,λ := gλ(ˆLK)ˆLKsp. (7)

In reality, the values of sp at x1:M are unknown and we cannot compute ˆLKsp as 1 M PM m=1 Kxmsp(xm). Fortunately, we could exploit integration by parts to avoid this problem. Under some mild regularity conditions (Assumptions B.1B.3), we have

LKsp = Eρ[Kx log p(x)] = Eρ[divx KT x ],

where the divergence of KT x is deﬁned as a vector-valued function, whose i-th component is the divergence of the i-th column of KT x . The empirical estimate ˆLKsp is then available as 1

M PM m=1 divxm KT xm, which leads to the following general formula of nonparametric score estimators:

ˆsg p,λ = gλ(ˆLK)ˆζ, (8)

where ˆζ := 1 M PM m=1 divxm KT xm.

3.2. Regularization Schemes

We now derive the ﬁnal form of the estimator under three regularization schemes (Bauer et al., 2007). The choice of regularization will impact the convergence rate of the estimator, which will be studied in Sec. 4.

Theorem 3.1 (Tikhonov Regularization). Let ˆsg p,λ be deﬁned as in (8), and gλ(σ) = (σ + λ) 1. Then

ˆsg p,λ(x) = Kx Xc ˆζ(x)/λ, (9)

where c is obtained by solving

(K + MλI)c = h/λ. (10)

Here c RMd, h = (ˆζ(x1), , ˆζ(x M)) RMd, Kx X = [K(x, x1), , K(x, x M)] Rd Md, and K RMd Md is given by K(m 1)d+i,(ℓ 1)d+j = K(xm, xℓ)ij.

The proof is given in appendix C.4, where the general representer theorem (Sriperumbudur et al., 2017, Theorem A.2) is used to show that the solution lies in the subspace generated by {Kxmcm : m [M], cm Rd} {ˆζ}. (11)

Unlike the Tikhonov regularizer that shifts all eigenvalues simultaneously, the spectral cut-off regularization sets gλ(σ) = σ 1 for σ λ, and gλ(σ) = 0 otherwise. To obtain such estimator, we need the following lemma that relates the spectral properties of K and ˆLK.

Lemma 3.2. Let σ be a non-zero eigenvalue of 1 M K such that 1 M Ku = σu, where u RMd is the unit eigenvector. Then σ is an eigenvalue of ˆLK and the corresponding unit eigenfunction is

m=1 Kxmu(m),

where u is splitted into (u(1), , u(M)) and u(i) Rd.

The lemma is a direct generalization of Rosasco et al. (2010, Proposition 9) to vector-valued operators.

Theorem 3.3 (Spectral Cut-Off Regularization). Let ˆsg p,λ be deﬁned as in (8), and

( σ 1 σ > λ, 0 σ λ.

Let (σj, uj)j 1 be the eigenvalue and eigenvector pairs that satisfy 1 M Kuj = σjuj. Then we have

ˆsg p,λ(x) = Kx X

uju T j Mσ2 j

where Kx X and h are deﬁned as in Theorem 3.1.

Apart from the above methods with closed-form solutions, early stopping of iterative solvers like gradient descent can also play the role of regularization (Engl et al., 1996). Iterative methods replace the expensive inversion or eigendecomposition of the Md Md size kernel matrix with fast

Nonparametric Score Estimators

matrix-vector multiplication. In Sec. 3.5 we show that such methods can be further accelerated by utilizing the structure of our kernel matrix.

We consider two iterative methods: the Landweber iteration and the ν-method (Engl et al., 1996). The Landweber iteration solves ˆLKsp = ˆζ with the ﬁxed-point iteration:

ˆs(t+1) p := ˆs(t) p η ˆζ + ˆLKˆs(t) p , (13)

where η is a step-size parameter. It can be regarded as using the following regularization:

Theorem 3.4 (Landweber Iteration). Let ˆsg p,λ and ˆs(k) p be deﬁned as in (8) and (13), respectively. Let ˆs(0) = 0 and gλ(σ) = η Pt 1 i=0(1 ησ)i, where t := λ 1 . Then,

ˆsg p,λ = ˆs(t) p = tηˆζ + Kx Xct,

where c0 = 0, ct+1 = (Id ηK/M)ct tη2h/M, and K, Kx X, h are deﬁned as in Theorem 3.1.

The Landweber iteration often requires a large number of iterations. An accelerated version of it is the ν-method, where ν is a parameter controlling the maximal convergence rate (see Sec. 4). The regularizer of the ν-method can be represented by a family of polynomials gλ(σ) = poly(σ). These polynomials approximate 1/σ better than those in the Landweber iteration. As a result, the ν-method only requires a polynomial of degree λ 1/2 to deﬁne gλ, which signiﬁcantly reduces the number of iterations (Engl et al., 1996; Bauer et al., 2007). The next iterate of the ν-method can be generated by the current and the previous ones:

ˆs(t+1) p = ˆs(t) p + ut(ˆs(t) p ˆs(t 1) p ) ωt(ˆζ + ˆLKˆs(t) p ),

where ut, ωt are carefully chosen constants (Engl et al., 1996, Algorithm 6.13). We describe the full algorithm in Example C.4 (appendix C.4.3).

3.3. Hypothesis Spaces

In this framework, the hypothesis space is characterized by the matrix-valued kernel that induces the RKHS (Alvarez et al., 2012). Below we discuss two choices of the kernel: the diagonal ones are computationally more efﬁcient, while curl-free kernels capture the conservative property of score vector ﬁelds.

Diagonal Kernels The simplest way to deﬁne a diagonal matrix-valued kernel is K(x, y) = k(x, y)Id, where k : X X R is a scalar-valued kernel. This induces a product RKHS Hd k := d i=1Hk where all output dimensions of a function are independent. In this case the kernel matrix for X = (x1, . . . , x M) is K = k(X, X) Id, where k(X, X) denotes the Gram matrix of the scalar-valued kernel k. Therefore, the computational cost of matrix inversion and

eigendecomposition is the same as in the scalar-valued case. On the other hand, the independence assumption may not hold for score functions, whose output dimensions are correlated as they form the gradient of the log density. As we shall see in Sec. 5, such misspeciﬁcation of the hypothesis space degrades the performance in high dimensions.

Curl-Free Kernels Noticing that score vector ﬁelds are gradient ﬁelds, we can use curl-free kernels (Fuselier Jr, 2007; Macˆedo & Castro, 2010) to capture this property. A curl-free kernel can be constructed from the negative Hessian of a translation-invariant kernel k(x, y) = φ(x y): Kcf(x, y) := 2φ(x y), where φ : X R C2. It is easy to see that Kcf is positive deﬁnite. A nice property of HKcf is that any element in it is a gradient of some function. To see this, notice that the j-th column of Kcf is ( jφ) and each element in HKcf is a linear combination of columns of Kcf. We also note that the unnormalized logdensity function can be recovered from the estimated score when using curl-free kernels (see appendix C.3). The cost of inversion and eigendecomposition of the kernel matrix is O(M 3d3), compared to O(M 3) for diagonal kernels.

3.4. Examples

In the following we provide examples of nonparametric score estimators derived from the framework. We show that existing estimators can be recovered with certain types of kernels and regularization schemes (Table 1).

Example 3.5 (KEF). Consider using curl-free kernels for the Tikhonov regularized estimator in (9). By substituting Kcf(x, y) = 2φ(x y) for K, we get

ˆsg p,λ(x) =

j=1 c(m 1)d+j jφ(x xm) ˆζcf(x)

where ζcf(x)i := 1

M PM m=1 Pd j=1 i 2 j φ(x xm). Noticing that Kcf(x, y)ij = i jφ(x y) = i j+dk(x, y), we could check that the deﬁnition of c here, which follows from (9), is the same as in (3). Thus by comparing with (3), we have

ˆsg p,λ(x) = ˆfp,λ(x) = log p ˆ fp,λ(x). (14)

Therefore, the KEF estimator is equivalent to choosing curlfree kernels and the Tikhonov regularization in (8).

We note that, although the solutions are equivalent, the space { log pf, f Hk} looks different from the curl-free RKHS constructed from the negative Hessian of k. Such equivalence of regularized minimization problems may be of independent interest.

Example 3.6 (SSGE). For the estimator (12) obtained from the spectral cut-off regularization. Consider letting

Nonparametric Score Estimators

K(x, y) = k(x, y)Id. Then K = k(X, X) Id, and it can be decomposed as PM m=1 Pd i=1 λm(wmw T m eie T i ), where {(λm, wm)} is the eigenpairs of k(X, X) with λ1 λ2 λM and {ei} is the standard basis of Rd. The estimator reduces to

ˆsg p,λ(x)i = k(x, X)

wjw T j λ2 j

where 1 M ri := (hi, hd+i, , h(M 1)d+i). When we choose λ = λJ, simple calculations (see appendix C.2) show that (15) equals the SSGE estimator ˆsg p,λ(x)i =

M PJ j=1 PM m=1 i ˆψj(xm) ˆψj(x), where ˆψj is deﬁned as in (5). Therefore, SSGE is equivalent to choosing the diagonal kernel K(x, y) = k(x, y)Id and the spectral cut-off regularization in (8).

Example 3.7 (Stein). We consider modifying the Tikhonov regularizer to gλ(σ) = (λ + σ) 11{σ>0}. In this case, we obtain an estimator ˆsg p,λ(x) = Kx XK 1( 1

M K + λI) 1h by Lemma C.2. At sample points, the estimated score is ( 1

M K + λI) 1h, which coincides with the Stein gradient estimator. This suggests a principled out-of-sample extension of the Stein gradient estimator.

To gain more insights, we consider to minimize (6) in the subspace generated by {Kxmcm : m [M], cm Rd}. Compared with (11), the one-dimensional subspace Rˆζ is ignored. We could check that (in appendix C.2) this is equivalent to exploiting the previous mentioned regularizer. Therefore, the Stein estimator is equivalent to using the diagonal kernel K(x, y) = k(x, y)Id and the Tikhonov regularization with a one-dimensional subspace ignored.

All the above examples can be extended to use a subset of the samples with Nystr om methods (Williams & Seeger, 2001). Speciﬁcally, we can modify the general formula in (8) as ˆsg,Z p,λ = gλ(PZ ˆLKPZ)PZˆζ, where PZ : HK HK is the projection onto a low-dimensional subspace generated by the subset Z. When the curl-free kernel and the same truncated Tikhonov regularizer as in Example 3.7 are used, this estimator is equivalent to the Nystr om KEF (NKEF) (Sutherland et al., 2018). More details can be found in appendix C.1.

3.5. Scalability

When using curl-free kernels, we need to deal with an Md Md matrix. In such cases, the Tikhonov and the spectral cut-off regularization cost O(M 3d3) and have difﬁculties scaling with the sample size and the input dimensions. Fortunately, as the unifying perspective suggests, we could modify the regularization schemes with iterative methods that only require matrix-vector multiplications, e.g., the Landweber iteration and the ν-method (see Sec. 3.2).

Table 1. Existing nonparametric score estimators, their kernel types, and regularization schemes. φ is from k(x, y) = φ(x y).

ALGORITHM KERNEL REGULARIZER

SSGE k(x, y)Id 1{σ λ}σ 1

Stein k(x, y)Id 1{σ>0}(λ + σ) 1

KEF 2φ(x y) (λ + σ) 1

NKEF 2φ(x y) 1{σ>0}(λ + σ) 1

Interestingly, we get further acceleration by utilizing the structure of curl-free kernels. Example 3.8 (Iterative curl-free estimators). We observe that when using a curl-free kernel Kcf constructed from a radial scalar-valued kernel k(x, y) = φ( x y ),

Kcf(x, y) = φ

where r = x y, r = r 2. Consider in matrix-vector multiplications, for a vector a Rd, Kcf(x, y)a can be

computed as φ

r2 (r Ta)r φ

r a, where only a vector-

vector multiplication is required with time complexity O(d), compared to general O(d2). Thus, we only need O(M 2d) time to compute Kb for any b RMd. In practice, we only need to store samples for computing Kb instead of constructing the whole kernel matrix. This reduces the memory usage from O(M 2d2) to O(M 2d).

We note that the same idea in Example 3.8 can be used to accelerate the KEF estimator if we adopt the conjugate gradient methods (Van Loan & Golub, 1983) to solve (10), because we have shown that the KEF estimator is equivalent to our Tikhonov regularized estimators with curl-free kernels. As we shall see in experiments, this method is extremely fast in high dimensions.

4. Theoretical Properties

In this section, we give a general theorem on the convergence rate of score estimators in our framework, which provides a tighter error bound of SSGE (Shi et al., 2018). We also investigate the case where samples are corrupted by a small set of points, and provide the convergence rate of the heuristic out-of-sample extension proposed in Li & Turner (2018). Proofs and assumptions are deferred to appendix B.

First, we follow Bauer et al. (2007); Baldassarre et al. (2012) to characterize the regularizer. Deﬁnition 4.1 (Bauer et al. (2007)). We say a family of functions gλ : [0, κ2] R, 0 < λ κ2 is a regularizer if there are constants B, D, γ such that sup0<σ κ2 |σgλ(σ)| D, sup0<σ κ2 |gλ(σ)| B/λ and sup0<σ κ2 |1 σgλ(σ)| γ. The qualiﬁcation of gλ is the maximal r such that sup0<σ κ2 |1 σgλ(σ)|σr γrλr, where γr does not depend on λ.

Nonparametric Score Estimators

Now, we can use the idea of Bauer et al. (2007, Theorem 10) to obtain an error bound of our estimator.

Theorem 4.2. Assume Assumptions B.1-B.5 hold. Let r be the qualiﬁcation of the regularizer gλ, and ˆsg p,λ be deﬁned as in (8). Suppose there exists f0 HK such that sp = Lr Kf0 for some r [0, r]. Then we have for λ = M 1 2r+2 ,

ˆsg p,λ sp HK = Op M r 2r+2 ,

and for r [0, r 1/2], we have

ˆsg p,λ sp ρ = Op M r+1/2

where Op is the Big-O notation in probability.

Note the qualiﬁcation impacts the maximal convergence rate. As the qualiﬁcation of Tikhonov regularization is 1, from the error bound, we observe the well-known saturation phenomenon of Tikhonov regularization (Engl et al., 1996), i.e., the convergence rate does not improve even if sp = Lr Kf0 for r > 1. To alleviate this, we can choose the regularizer with a larger qualiﬁcation. For example, the spectral cut-off regularization and the Landweber iteration have qualiﬁcation , and the ν-method has qualiﬁcation ν. This suggests that the ν-method is appealing as it has a smaller iteration number than the Landweber iteration and a better maximal convergence rate than the Tikhonov regularization.

Remark 4.3 (Stein). The consistency and convergence rate of the Stein estimator and its out-of-sample extension suggested in Example 3.7 follow from Theorem 4.2. The rate in ρ is Op M θ1 , where θ1 [1/4, 1/3]. The convergence rate of the original out-of-sample extension in Li & Turner (2018) will be given in Corollary 4.7.

Remark 4.4 (SSGE). From Theorem 4.2, the convergence rate in ρ of SSGE is Op(M θ2), where θ2 [1/4, 1/2), which improves Shi et al. (2018, Theorem 2). To see this, we assume the eigenvalues of LK are µ1 > µ2 > and they decay as µJ = O(J β). The error bound provided by Shi et al. (2018) is

ˆsp,λ sp 2 ρ = Op

µJ(µJ µJ+1)2M + µJ

We can choose J = M

1 4(β+1) to obtain ˆsp,λ sp ρ =

Op(M β 8(β+1) ). The convergence rate is slower than Op(M 1/4), the worst case of Theorem 4.2.

Remark 4.5 (KEF). Compared with Theorem 7(ii) in Sriperumbudur et al. (2017), where they bound the Fisher divergence, which is the square of the L2-norm in our Theorem 4.2, we see that the two results are exactly the same. The rate in this norm is Op M θ3 , where θ3 [1/4, 1/3].

Next, we consider the case where estimators are not obtained from i.i.d. samples. Speciﬁcally, we consider how the convergence rate is affected when our data is the mixture of a set of i.i.d. samples and a set of ﬁxed points.

Theorem 4.6. Under the same assumption of Theorem 4.2, we deﬁne gλ(σ) := (λ + σ) 1, and choose Z := {zn}n [N] X. Let Y := {ym}m [M] be a set of i.i.d. samples drawn from ρ, and ˆsp,λ,Z be deﬁned as in (8) with X = Z Y. Suppose N = O(M α), then we have for λ = M 1 2r+2 ,

sup Z ˆsp,λ,Z sp HK = Op M r 2r+2 + O(M α r r+1 ),

where the sup Z is taken over all {zn}n [N] X.

Proof Outline. Deﬁne TZ := 1 N S ZSZ, where SZ is the sampling operator. Let ˆsp,λ be deﬁned as in (8) with X = Y. We can write the estimator as ˆsp,λ,Z := gλ(ˆLK + RZ)(ˆLK + RZ)sp, where RZ := N M+N (TZ ˆLK), and bound ˆsp,λ,Z ˆsp,λ by (gλ(ˆLK+RZ) gλ(ˆLK))ˆLKsp + gλ(ˆLK + RZ)RZsp . It can be shown that the ﬁrst term is O NM 1λ 2 , and the second term is O NM 1λ 1 . Combining these with Theorem 4.2, we ﬁnish the proof.

From Theorem 4.6, we see that the convergence rate is not affected when data is corrupted by at most O(M r 2r+2 ) points. Under the same notation of this theorem, the out-ofsample extension of the Stein estimator proposed in Li & Turner (2018) can be written as ˆsp,λ,x(x), which corrupts the i.i.d. data by a single test point. Then we can obtain the following bound for this estimator.

Corollary 4.7. With the same assumptions and notations of Theorem 4.6, we have

sup x X ˆsp,λ,x(x) sp(x) 2 = Op(M r 2r+2 ).

5. Experiments

We evaluate our estimators on both synthetic and real data. In Sec. 5.1, we consider a challenging grid distribution as described in the experiment of Sutherland et al. (2018) to test the accuracy of nonparametric score estimators in high dimensions and out-of-sample points, In Sec. 5.2 we train Wasserstein autoencoders (WAE) with score estimation and compare the accuracy and the efﬁciency of different estimators. We mainly compare the following score estimators1:

Existing nonparametric estimators: Stein (Li & Turner, 2018), SSGE (Shi et al., 2018), KEF (Sriperumbudur et al., 2017), and its low rank approximation NKEFα (Sutherland et al., 2018), where α represents to use αM/10 samples.

1Code is available at https://github.com/miskcoo/ kscore.

Nonparametric Score Estimators

50 100 150 200 dimension

normalized ℓ2 distance

Stein SSGE KEF-CG NKEF4 ν-Method

(a) M = 128

50 100 150 200 dimension

normalized ℓ2 distance

Stein SSGE KEF-CG NKEF4 ν-Method

(b) M = 512

100 200 300 400 samples

normalized ℓ2 distance

Stein SSGE KEF-CG NKEF4 ν-Method

100 200 300 400 samples

normalized ℓ2 distance

Stein SSGE KEF-CG NKEF4 ν-Method

(d) d = 128

Figure 1. Normalized distance E[ sp ˆsp,λ 2 2]/d on grid data. In the ﬁrst row, M is ﬁxed and d varies. In the second row, d is ﬁxed and M varies. Shaded areas are three times the standard deviation.

200 400 600 800 1000 Latent Dim

Time per Epoch (s)

(a) Computational Cost

50 100 150 200 250 Latent Dim

(b) λmax/λmin

Figure 2. (a) Computational costs of KEF-CG for λ = 10 5 on MNIST; (b) The ratio of the maximum and the minimum eigenvalues of kernel matrices.

Parametric estimators: In the WAE experiment, we also consider the sliced score matching (SSM) estimator (Song et al., 2019), which is a parametric method and requires amortized training.

Proposed: The iterative curl-free estimator with the νmethod, and the conjugate gradient version of the KEF estimator (KEF-CG), both described in Sec. 3.5.

5.1. Synthetic Distributions

We follow Sutherland et al. (2018, Sec. 5.1) to construct a d-dimensional grid distribution. It is the mixture of d standard Gaussian distributions centered at d ﬁxed vertices in the unit hypercube. We change d and M respectively to test the accuracy and the convergence of score estimators, and use 1024 samples from the grid distribution to evaluate the ℓ2. We report the result of 32 runs in Fig. 1.

We can see that the effect of hypothesis space is signiﬁcant. The diagonal kernels used in SSGE and Stein degrade the accuracy in high dimensions, while curl-free kernels provide better performance. In low dimensions, all estimators are comparable, and the computational cost of diagonal kernels

Table 2. Negative log-likelihoods on MNIST datasets and per epoch time on 128 latent dimension. All models are timed on Ge Force GTX TITAN X GPU.

LATENT DIM 8 32 64 128 TIME

STEIN 97.15 92.10 101.60 114.41 4.2S SSGE 97.24 92.24 101.92 114.57 9.2S KEF 97.07 90.93 91.58 92.40 201.1S NKEF2 97.71 92.29 92.82 94.14 36.4S NKEF4 97.59 91.19 91.80 92.94 97.5S NKEF8 97.23 90.86 92.39 92.49 301.2S

KEF-CG 97.39 90.77 92.66 92.05 13.7S ν-METHOD 97.28 90.94 91.48 92.10 78.1S

SSM 96.98 89.06 93.06 96.92 6.0S

is lower than that of curl-free kernels. This suggests favoring the diagonal kernels in low dimensions. Possibly because this dataset does not make the convergence rate saturate, we ﬁnd different regularization schemes produce similar results. The iterative score estimator based on the ν-method is among the best and KEF-CG closely tracked them even with large d and M.

5.2. Wasserstein Autoencoders

Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017) is a latent variable model p(z, x) with observed and latent variables x X and z Z, respectively. p(z, x) is deﬁned by a prior p(z) and a distribution of z conditioned on x, and can be written as p(z, x) = p(z)pθ(x|z). WAEs aim at minimizing Wasserstein distance Wc(p X, p G) between the data distribution p X(x) and p G(x) := R p(z, x)dz, where c is a metric on X. Tolstikhin et al. (2017) showed that when pθ(x|z) maps z to x deterministically by a function G : Z X, it sufﬁces to minimize Ep X(x)Eqφ(z|x)[ x G(z) 2 2] + λD(qφ(z), p(z)), where D is a divergence of two distributions and qφ(z|x) is a parametric approximation of the posterior. When we choose D to be the KL divergence, the entropy term of qφ(z) := R qφ(z|x)p X(x)dx in the loss function is intractable (Song et al., 2019). If z can be parameterized by fφ(x) with x p X, the gradient of the entropy can be estimated using score estimators as Ep X(x)[ z log qφ(z) φfφ(x)] (Shi et al., 2018; Song et al., 2019).

We train WAEs on MNIST and Celeb A and repeat each conﬁguration 3 times. The average negative log-likelihoods for MNIST estimated by AIS (Neal, 2001) are reported in Table 2. The results for Celeb A are reported in appendix A. We can see that the performance of these estimators is close in low latent dimensions, and the parametric method is slightly better than nonparametric ones as we have continuously generated samples. However, in high dimensions, estimators based on curl-free kernels signiﬁcantly outperform those based on diagonal kernels and parametric methods. This is probably due to guarantee that the estimates at all locations

Nonparametric Score Estimators

form a gradient ﬁeld.

As discussed in Sec. 3.5, curl-free kernels are computationally expensive. This is shown in Table 2 by the running time of the original KEF algorithm. By comparing the time and performance of NKEFα with α = 2, 4, 8, we see that in order to get meaningful speed-up in high dimensions, low-rank approximation methods have to sacriﬁce the performance, which are outperformed by the iterative curl-free estimators based on the ν-method. KEF-CG is the fastest curl-free method in high dimensions while the performance is comparable with the original KEF. Fig. 2a shows the training time of KEF-CG in different latent dimensions. Surprisingly, the speed rapidly increases with increasing latent dimension and then ﬂattens out. The convergence rate of conjugate gradient is determined by the condition number, which means the kernel matrix K becomes well-conditioned in high dimensions (see Fig. 2b).

We found with large d, SSGE required at least 97% eigenvalues to attain the reported likelihood. We also ran SSGE with curl-free kernels and found only 13% eigenvalues are required to attain a comparable result when d = 8. From these observations, a possible reason why diagonal kernels degrade the performance in high dimensions is that the distribution is complicated while the hypothesis set is simple, so the small number of eigenfunctions are insufﬁcient to approximate the target. This can also be observed from Fig. 1, where the performance of diagonal kernels and curl-free kernels are closer as M increases since more eigenfunctions are provided.

6. Conclusion

Our contributions are two folds. Theoretically, we present a unifying view of nonparametric score estimators, and clarify the relationships of existing estimators. Under this perspective, we provide a uniﬁed convergence analysis of existing estimators, which improves existing error bounds. Practically, we propose an iterative curl-free estimator with nice theoretical properties and computational beneﬁts, and develop a fast conjugate gradient solver for the KEF estimator.

Acknowledgements

We thank Ziyu Wang for reading an earlier version of this manuscript and providing valuable feedback. This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No. L172037), Beijing Academy of Artiﬁcial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, Tsinghua Institute for Guo Qiang, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. JS was also supported by a Microsoft Research Asia Fellowship.

Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012.

Baker, C. T. The numerical treatment of integral equations. 1977.

Baldassarre, L., Rosasco, L., Barla, A., and Verri, A. Multioutput learning via spectral ﬁltering. Machine learning, 87(3):259 301, 2012.

Bauer, F., Pereverzev, S., and Rosasco, L. On regularization algorithms in learning theory. Journal of complexity, 23 (1):52 72, 2007.

Canu, S. and Smola, A. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714 720, 2006.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, pp. 2606 2615, 2016.

De Vito, E., Rosasco, L., and Toigo, A. Learning sets with separating kernels. Applied and Computational Harmonic Analysis, 37(2):185 217, 2014.

Engl, H. W., Hanke, M., and Neubauer, A. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.

Fukumizu, K. Exponential manifold by reproducing kernel hilbert spaces. Algebraic and Geometric mothods in statistics, pp. 291 306, 2009.

Fuselier Jr, E. J. Reﬁned error estimates for matrix-valued radial basis functions. Ph D thesis, Texas A&M University, 2007.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014.

Gorham, J. and Mackey, L. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015.

Hyv arinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005.

Li, Y. and Turner, R. E. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.

Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pp. 276 284, 2016.

Nonparametric Score Estimators

Macˆedo, I. and Castro, R. Learning divergence-free and curl-free vector ﬁelds with matrix-valued kernels. IMPA, 2010.

Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001.

Rosasco, L., Belkin, M., and Vito, E. D. On learning with integral operators. Journal of Machine Learning Research, 11(Feb):905 934, 2010.

Saremi, S. and Hyvarinen, A. Neural empirical bayes. Journal of Machine Learning Research, 20:1 23, 2019.

Sasaki, H., Hyv arinen, A., and Sugiyama, M. Clustering via mode seeking by direct estimation of the gradient of a log-density. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19 34. Springer, 2014.

Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, pp. 4651 4660, 2018.

Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153 172, 2007.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895 11907, 2019.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. ar Xiv preprint ar Xiv:1905.07088, 2019.

Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyv arinen, A., and Kumar, R. Density estimation in inﬁnite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830 1888, 2017.

Stein, C. M. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pp. 1135 1151, 1981.

Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., and Gretton, A. Gradient-free hamiltonian monte carlo with efﬁcient kernel exponential families. In Advances in Neural Information Processing Systems, pp. 955 963, 2015.

Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational Bayesian Neural Networks. In International Conference on Learning Representations, 2019.

Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. Efﬁcient and principled score estimation with nystr om kernel exponential families. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 652 660, 2018.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017.

Van Loan, C. F. and Golub, G. H. Matrix computations. Johns Hopkins University Press, 1983.

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Vito, E. D., Rosasco, L., Caponnetto, A., Giovannini, U. D., and Odone, F. Learning from examples as an inverse problem. Journal of Machine Learning Research, 6(May): 883 904, 2005.

Warde-Farley, D. and Bengio, Y. Improving generative adversarial networks with denoising feature matching. International Conference on Learning Representations, 2016.

Wen, L., Zhou, Y., He, L., Zhou, M., and Xu, Z. Mutual information gradient estimation for representation learning. In International Conference on Learning Representations, 2020.

Williams, C. K. and Seeger, M. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, pp. 682 688, 2001.