# random_feature_stein_discrepancies__a2ffdea8.pdf

Random Feature Stein Discrepancies

Jonathan H. Huggins Department of Biostatistics, Harvard

jhuggins@mit.edu

Lester Mackey Microsoft Research New England

lmackey@microsoft.com

Computable Stein discrepancies have been deployed for a variety of applications, ranging from sampler selection in posterior inference to approximate Bayesian inference to goodness-of-ﬁt testing. Existing convergence-determining Stein discrepancies admit strong theoretical guarantees but suffer from a computational cost that grows quadratically in the sample size. While linear-time Stein discrepancies have been proposed for goodness-of-ﬁt testing, they exhibit avoidable degradations in testing power even when power is explicitly optimized. To address these shortcomings, we introduce feature Stein discrepancies (ΦSDs), a new family of quality measures that can be cheaply approximated using importance sampling. We show how to construct ΦSDs that provably determine the convergence of a sample to its target and develop high-accuracy approximations random ΦSDs (RΦSDs) which are computable in near-linear time. In our experiments with sampler selection for approximate posterior inference and goodness-of-ﬁt testing, RΦSDs perform as well or better than quadratic-time KSDs while being orders of magnitude faster to compute.

1 Introduction

Motivated by the intractable integration problems arising from Bayesian inference and maximum likelihood estimation [9], Gorham and Mackey [10] introduced the notion of a Stein discrepancy as a quality measure that can potentially be computed even when direct integration under the distribution of interest is unavailable. Two classes of computable Stein discrepancies the graph Stein discrepancy [10, 12] and the kernel Stein discrepancy (KSD) [7, 11, 19, 21] have since been developed to assess and tune Markov chain Monte Carlo samplers, test goodness-of-ﬁt, train generative adversarial networks and variational autoencoders, and more [7, 10 12, 16 19, 27]. However, in practice, the cost of these Stein discrepancies grows quadratically in the size of the sample being evaluated, limiting scalability. Jitkrittum et al. [16] introduced a special form of KSD termed the ﬁnite-set Stein discrepancy (FSSD) to test goodness-of-ﬁt in linear time. However, even after an optimization-based preprocessing step to improve power, the proposed FSSD experiences a unnecessary degradation of power relative to quadratic-time tests in higher dimensions.

To address the distinct shortcomings of existing linearand quadratic-time Stein discrepancies, we introduce a new class of Stein discrepancies we call feature Stein discrepancies (ΦSDs). We show how to construct ΦSDs that provably determine the convergence of a sample to its target, thus making them attractive for goodness-of-ﬁt testing, measuring sample quality, and other applications. We then introduce a fast importance sampling-based approximation we call random ΦSDs (RΦSDs). We provide conditions under which, with an appropriate choice of proposal distribution, an RΦSD is close in relative error to the ΦSD with high probability. Using an RΦSD, we show how, for any γ > 0, we can compute OP (N 1/2)-precision estimates of an ΦSD in O(N 1+γ) (near-linear) time when the ΦSD precision is (N 1/2). Additionally, to enable applications to goodness-of-ﬁt testing,

Contributed equally

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

we (1) show how to construct RΦSDs that can distinguish between arbitrary distributions and (2) describe the asymptotic null distribution when sample points are generated i.i.d. from an unknown distribution. In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter selection and fast goodness-of-ﬁt testing, we obtain high-quality results which are comparable to or better than those produced by quadratic-time KSDs using only ten features and requiring orders-of-magnitude less computation.

Notation For measures µ1, µ2 on RD and functions f : RD ! C, k : RD RD ! C, we let µ1(f) :=

f(x)µ1(dx), (µ1k)(x0) :=

k(x, x0)µ1(dx), and (µ1 µ2)(k) := R R

k(x1, x2)µ1(dx1)µ2(dx2). We denote the generalized Fourier transform of f by ˆf or F(f) and its inverse by F 1(f). For r 1, let Lr := {f : kfk Lr := (

|f(x)|r dx)1/r < 1} and Cn denote

the space of n-times continuously differentiable functions. We let

P! denote convergence in distribution and in probability, respectively. We let a denote the complex conjugate of a. For D 2 N, deﬁne [D] := {1, . . . , D}. The symbol & indicates greater than up to a universal constant.

2 Feature Stein discrepancies

When exact integration under a target distribution P is infeasible, one often appeals to a discrete measure QN = 1

n=1 δxn to approximate expectations, where the sample points x1, . . . , x N 2 RD are generated from a Markov chain or quadrature rule. The aim in sample quality measurement is to quantify how well QN approximates the target in a manner that (a) recognizes when a sample sequence is converging to the target, (b) highlights when a sample sequence is not converging to the target, and (c) is computationally efﬁcient. It is natural to frame this comparison in terms of an integral probability metric (IPM) [20], d H(QN, P) := suph2H |QN(h) P(h)|, measuring the maximum discrepancy between target and sample expectations over a class of test functions. However, when generic integration under P is intractable, standard IPMs like the 1-Wasserstein distance and Dudley metric may not be efﬁciently computable.

To address this need, Gorham and Mackey [10] introduced the Stein discrepancy framework for generating IPM-type quality measures with no explicit integration under P. For any approximating probability measure µ, each Stein discrepancy takes the form

d T G(µ, P) = sup

|µ(T g)| where 8g 2 G, P(T g) = 0.

Here, T is an operator that generates mean-zero functions under P, and G is the Stein set of functions on which T operates. For concreteness, we will assume that P has C1 density p with support Rd and restrict our attention to the popular Langevin Stein operator [10, 21] deﬁned by T g := PD

d=1 Tdgd for (Tdgd)(x) := p(x) 1@xd(p(x)gd(x)) and g : RD ! RD. To date, two classes of computable Stein discrepancies with strong convergence-determining guarantees have been identiﬁed. The graph Stein discrepancies [10, 12] impose smoothness constraints on the functions g and are computed by solving a linear program, while the kernel Stein discrepancies [7, 11, 19, 21] deﬁne G as the unit ball of a reproducing kernel Hilbert space and are computed in closed-form. Both classes, however, suffer from a computational cost that grows quadratically in the number of sample points. Our aim is to develop alternative discrepancy measures that retain the theoretical and practical beneﬁts of existing Stein discrepancies at a greatly reduced computational cost.

Our strategy is to identify a family of convergence-determining discrepancy measures that can be accurately and inexpensively approximated with random sampling. To this end, we deﬁne a new domain for the Stein operator centered around a feature function Φ : RD RD ! C which, for some r 2 [1, 1) and all x, z 2 RD, satisﬁes Φ(x, ) 2 Lr and Φ( , z) 2 C1:

g : RD ! R | gd(x) =

Φ(x, z)fd(z) dz with PD

Ls 1 for s = r r 1

When combined with the Langevin Stein operator T , this feature Stein set gives rise to a feature Stein discrepancy (ΦSD) with an appealing explicit form (PD

d=1kµ(TdΦ)k2

Φ,r(µ, P) := supg2GΦ,r |µ(T g)|2 = supg2GΦ,r

d=1 µ(Tdgd)

= supf:vd=kfdk Ls,kvk2 1

µ(TdΦ)(z)fd(z) dz

= supv:kvk2 1

d=1kµ(TdΦ)k Lrvd

d=1kµ(TdΦ)k2

In Section 3.1, we will show how to select the feature function Φ and order r so that ΦSDΦ,r provably determines convergence, in line with our desiderata (a) and (b).

To achieve efﬁcient computation, we will approximate the ΦSD in expression (1) using an importance sample of size M drawn from a proposal distribution with (Lebesgue) density . We call the resulting stochastic discrepancy measure a random ΦSD (RΦSD):

Φ,r, ,M(µ, P) := PD

m=1 (Zm) 1|µ(TdΦ)(Zm)|r 2/r

for Z1, . . . , ZM

Importantly, when µ is the sample approximation QN, the RΦSD can be computed in O(MN) time by evaluating the MND rescaled random features, (TdΦ)(xn, Zm)/ (Zm)1/r; this computation is also straightforwardly parallelized. In Section 3.2, we will show how to choose so that RΦSDΦ,r, ,M approximates ΦSDΦ,r with small relative error.

Special cases When r = 2, the ΦSD is an instance of a kernel Stein discrepancy (KSD) with base reproducing kernel k(x, y) =

Φ(x, z)Φ(y, z) dz. This follows from the deﬁnition [7, 11, 19, 21] KSDk(µ, P)2 := PD

d=1(µ µ)((Td Td)k) = PD

d=1kµ(TdΦ)k2

L2 = ΦSDΦ,2(µ, P)2. However, we will see in Sections 3 and 5 that there are signiﬁcant theoretical and practical beneﬁts to using ΦSDs with r 6= 2. Namely, we will be able to approximate ΦSDΦ,r with r 6= 2 more effectively with a smaller sampling budget. If Φ(x, z) = e ihz,xi ˆ (z)1/2 and / ˆ for 2 L2, then RΦSDΦ,2, ,M is the random Fourier feature (RFF) approximation [22] to KSDk with k(x, y) = (x y). Chwialkowski et al. [6, Prop. 1] showed that the RFF approximation can be a undesirable choice of discrepancy measure, as there exist uncountably many pairs of distinct distributions that, with high probability, cannot be distinguished by the RFF approximation. Following Chwialkowski et al. [6] and Jitkrittum et al. [16], we show how to select Φ and to avoid this property in Section 4. The random ﬁnite set Stein discrepancy [FSSD-rand, 16] with proposal is an RΦSDΦ,2, ,M with Φ(x, z) = f(x, z) (z)1/2 for f a real analytic and C0-universal [4, Def. 4.1] reproducing kernel. In Section 3.1, we will see that features Φ of a different form give rise to strong convergence-determining properties.

3 Selecting a Random Feature Stein Discrepancy

In this section, we provide guidance for selecting the components of an RΦSD to achieve our theoretical and computational goals. We ﬁrst discuss the choice of the feature function Φ and order r and then turn our attention to the proposal distribution . Finally, we detail two practical choices of RΦSD that will be used in our experiments. To ease notation, we will present theoretical guarantees in terms of the sample measure QN, but all results continue to hold if any approximating probability measure µ is substituted for QN.

3.1 Selecting a feature function Φ

A principal concern in selecting a feature function is ensuring that the ΦSD detects non-convergence that is, QN

D =) P whenever ΦSDΦ,r(QN, P) ! 0. To ensure this, we will construct ΦSDs that upper bound a reference KSD known to detect non-convergence. This is enabled by the following inequality proved in Appendix A.

Proposition 3.1 (KSD-ΦSD inequality). If k(x, y) =

F(Φ(x, ))(!)F(Φ(y, ))(!) (!) d!, r 2 [1, 2], and 2 Lt for t = r/(2 r), then

k(QN, P) k k Lt ΦSD2

Φ,r(QN, P). (2)

Our strategy is to ﬁrst pick a KSD that detects non-convergence and then choose Φ and r such that (2) applies. Unfortunately, KSDs based on many common base kernels, like the Gaussian and Mat ern, fail to detect non-convergence when D > 2 [11, Thm. 6]. A notable exception is the KSD with inverse multiquadric (IMQ) base kernel.

Example 3.1 (IMQ kernel). The IMQ kernel is given by IMQ

c,β (x y) := (c2 + kx yk2

2)β, where c > 0 and β < 0. Gorham and Mackey [11, Thm. 8] proved that when β 2 ( 1, 0), KSDs with an IMQ base kernel determine weak convergence on RD whenever P 2 P, the set of distantly dissipative distributions for which r log p is Lipschitz.2

Let m N := EX QN [X] denote the mean of QN. We would like to consider a broader class of base kernels, the form of which we summarize in the following assumption: Assumption A. The base kernel has the form k(x, y) = AN(x) (x y)AN(y) for 2 C2, A 2 C1, and AN(x) := A(x m N), where A > 0 and r log A is bounded and Lipschitz.

The IMQ kernel falls within the class deﬁned by Assumption A (let A = 1 and = IMQ

c,β ). On the other hand, our next result, proved in Appendix B, shows that tilted base kernels with A increasing sufﬁciently quickly also control convergence. Theorem 3.2 (Tilted KSDs detect non-convergence). Suppose that P 2 P, Assumption A holds, 1/A 2 L2, and H(u) := sup!2RD e k!k2

2/(2u2)/ˆ (!) is ﬁnite for all u > 0. Then for any sequence

of probability measures (µN)1

N=1, if KSDk(µN, P) ! 0 then µN

D =) P. Example 3.2 (Tilted hyperbolic secant kernel). The hyperbolic secant (sech) function is sech(u) := 2/(eu + e u). For x 2 RD and a > 0, deﬁne the sech kernel sech

a (x) := QD

. Since ˆ sech

a (!) = sech

1/a (!)/a D, KSDk from Theorem 3.2 detects non-convergence when = sech

a and A 1 2 L2. Valid tilting functions include A(x) = QD

d for any c > 0 and A(x) = (c2 + kxk2

2)b for any b > D/4 (to ensure A 1 2 L2).

With our appropriate reference KSDs in hand, we will now design upper bounding ΦSDs. To accomplish this we will have Φ mimic the form of the base kernels in Assumption A: Assumption B. Assumption A holds and Φ(x, z) = AN(x)F(x z), where F 2 C1 is positive, and there exist a norm k k and constants s, C > 0 such that

|@xd log F(x)| C(1 + kxks), lim kxk!1(1 + kxks)F(x) = 0, and F(x z) CF(z)/F(x).

In addition, there exist a constant c 2 (0, 1] and continuous, non-increasing function f such that c f(kxk) F(x) f(kxk).

Assumption B requires a minimal amount of regularity from F, essentially that F be sufﬁciently smooth and behave as if it is a function only of the norm of its argument. A conceptually straightforward choice would be to set F = F 1(ˆ 1/2) that is, to be the square root kernel of . We would then have that (x y) =

F(x z)F(y z) dz, so in particular ΦSDΦ,2 = KSDk. Since the exact square-root kernel of a base kernel can be difﬁcult to compute in practice, we require only that F be a suitable approximation to the square root kernel of :

Assumption C. Assumption B holds, and there exists a smoothness parameter λ 2 (1/2, 1] such that if λ 2 (1/2, λ), then ˆF/ˆ λ/2 2 L2.

Requiring that ˆF/ˆ λ/2 2 L2 is equivalent to requiring that F belongs to the reproducing kernel Hilbert space Kλ induced by the kernel F 1(ˆ λ). The smoothness of the functions in Kλ increases as λ increases. Hence λ quantiﬁes the smoothness of F relative to .

Finally, we would like an assurance that the ΦSD detects convergence that is, ΦSDΦ,r(QN, P) ! 0 whenever QN converges to P in a suitable metric. The following result, proved in Appendix C, provides such a guarantee for both the ΦSD and the RΦSD. Proposition 3.3. Suppose Assumption B holds with F 2 Lr, 1/A bounded, x 7! x/A(x) Lipschitz, and EP [A(Z)k Zk2

2] < 1. If the tilted Wasserstein distance

WAN (QN, P) := suph2H |QN(ANh) P(ANh)| (H := {h : krh(x)k2 1, 8x 2 RD})

2We say P satisﬁes distant dissipativity [8, 12] if 0 := lim infr!1 (r) > 0 for (r) = inf{ 2hr log p(x) r log p(y), x yi/kx yk2

2 : kx yk2 = r}.

converges to zero, then ΦSDΦ,r(QN, P) ! 0 and RΦSDΦ,r, N,MN (QN, P)

P! 0 for any choices of r 2 [1, 2], N, and MN 1. Remark 3.4. When A is constant, WAN is the familiar 1-Wasserstein distance.

3.2 Selecting an importance sampling distribution

Our next goal is to select an RΦSD proposal distribution for which the RΦSD is close to its reference ΦSD even when the importance sample size M is small. Our strategy is to choose so that the second moment of each RΦSD feature, wd(Z, QN) := |(QNTdΦ)(Z)|r/ (Z), is bounded by a power of its mean: Deﬁnition 3.5 ((C, γ) second moments). Fix a target distribution P. For Z , d 2 [D], and N 1, let YN,d := wd(Z, QN). If for some C > 0 and γ 2 [0, 2] we have E[Y 2

N,d] CE[YN,d]2 γ

for all d 2 [D] and N 1, then we say (Φ, r, ) yields (C, γ) second moments for P and QN.

The next proposition, proved in Appendix D, demonstrates the value of this second moment property. Proposition 3.6. Suppose (Φ, r, ) yields (C, γ) second moments for P and QN. If M 2CE[YN,d] γ log(D/δ)/ 2 for all d 2 [D], then, with probability at least 1 δ,

RΦSDΦ,r, ,M(QN, P) (1 )1/r ΦSDΦ,r(QN, P).

Under the further assumptions of Proposition 3.1, if the reference KSDk(QN, P) & N 1/2,3 then a sample size M & N γr/2Ck kγr/2

Lt log(D/δ)/ 2 sufﬁces to have, with probability at least 1 δ,

Lt RΦSDΦ,r, ,M(QN, P) (1 )1/r KSDk(QN, P).

Notably, a smaller r leads to substantial gains in the sample complexity M = (N γ r/2). For example, if r = 1, it sufﬁces to choose M = (N 1/2) whenever the weight function wd is bounded (so that γ = 1); in contrast, existing analyses of random Fourier features [15, 22, 25, 26, 30] require M = (N) to achieve the same error rates. We will ultimately show how to select so that γ is arbitrarily close to 0. First, we provide simple conditions and a choice for which guarantee (C, 1) second moments. Proposition 3.7. Assume that P 2 P, Assumptions A and B hold with s = 0, and there exists a constant C0 > 0 such that for all N 1, QN([1 + k k]AN) C0. If (z) / QN([1 + k k]Φ( , z)), then for any r 1, (Φ, r, ) yields (C, 1) second moments for P and QN.

Proposition 3.7, which is proved in Appendix E, is based on showing that the weight function wd(z, QN) is uniformly bounded. In order to obtain (C, γ) moments for γ < 1, we will choose such that wd(z, QN) decays sufﬁciently quickly as kzk ! 1. We achieve this by choosing an overdispersed that is, we choose with heavy tails compared to F. We also require two integrability conditions involving the Fourier transforms of and F.

Assumption D. Assumptions A and B hold, !2

1 ˆ 1/2(!) 2 L1, and for t = r/(2 r), ˆ / ˆF 2 2 Lt.

The L1 condition is an easily satisﬁed technical condition while the Lt condition ensures that the KSD-ΦSD inequality (2) applies to our chosen ΦSD. Theorem 3.8. Assume that P 2 P, Assumptions A to D hold, and there exists C > 0 such that,

QN([1 + k k + k m Nks]AN/F( m N)) C for all N 1. (3)

Then there is a constant b 2 [0, 1) such that the following holds. For any 2 (0, 1 b), c > 0, and

> 2(1 λ), if (z) c (z m N) r, then there exists a constant C > 0 such that (Φ, r, ) yields (C , γ ) second moments for P and QN, where γ := + (2 ) /(2 b ).

Theorem 3.8 suggests a strategy for improving the importance sample growth rate γ of an RΦSD: increase the smoothness λ of F and decrease the over-dispersion parameter of .

3Note that KSDk(QN, P) = P (N 1/2) whenever the sample points x1, . . . , x N are drawn i.i.d. from a distribution µ, since the scaled V-statistic N KSD2

k(QN, P) diverges when 6= P and converges in distribution to a non-zero limit when = P [23, Thm. 32]. Moreover, working in a hypothesis testing framework of shrinking alternatives, Gretton et al. [13, Thm. 13] showed that KSDk(QN, P) = (N 1/2) was the smallest local departure distinguishable by an asymptotic KSD test.

(a) Efﬁciency of L1 IMQ

(b) Efﬁciency of L2 Sech Exp

(c) M necessary for std(RΦSD)

2 Figure 1: Efﬁciency of RΦSDs. The L1 IMQ RΦSD displays exceptional efﬁciency.

3.3 Example RΦSDs

In our experiments, we will consider two RΦSDs that determine convergence by Propositions 3.1 and 3.3 and that yield (C, γ) second moments for any γ 2 (0, 1] using Theorem 3.8.

Example 3.3 (L2 tilted hyperbolic secant RΦSD). Mimicking the construction of the hyperbolic secant kernel in Example 3.2 and following the intuition that F should behave like the square root of , we choose F = sech

2a . As shown in Appendix I, if we choose r = 2 and (z) / sech

4a (z m N) we can verify all the assumptions necessary for Theorem 3.8 to hold. Moreover, the theorem holds for any b > 0 and hence any 2 (0, 1) may be chosen. Note that can be sampled from efﬁciently using the inverse CDF method.

Example 3.4 (Lr IMQ RΦSD). We can also parallel the construction of the reference IMQ kernel k(x, y) = IMQ

c,β (x y) from Example 3.1, where c > 0 and β 2 [ D/2, 0). (Recall we have A = 1 in Assumption A.) In order to construct a corresponding RΦSD we must choose the constant λ 2 (1/2, 1) that will appear in Assumption C and 2 (0, 1/2), the minimum we will be able to choose when constructing . We show in Appendix J that if we choose F = IMQ

c0,β0, then Assumptions A to D hold when c0 = λc/2, β0 2 [ D/(2 ), β/(2 ) D/(2 )), r = D/(2β0 ), 2 ( , 1), and (z) / IMQ

c0,β0(z m N) r. A particularly simple setting is given by β0 = D/(2 ), which yields r = 1. Note that can be sampled from efﬁciently since it is a multivariate t-distribution.

In the future it would be interesting to construct other RΦSDs. We can recommend the following fairly simple default procedure for choosing an RΦSD based on a reference KSD admitting the form in Assumption A. (1) Choose any γ > 0, and set = γ/3, λ = 1 /2, and = 4 /(2 + ). These are the settings we will use in our experiments. It may be possible to initially skip this step and reason about general choices of γ, , and λ. (2) Pick any F that satisﬁes ˆF/ˆ λ/2 2 L2 for some λ 2 (1/2, λ) (that is, Assumption C holds) while also satisfying ˆ / ˆF 2 2 Lt for some t 2 [1, 1]. The selection of t induces a choice of r via Assumption D. A simple choice for F is F 1 ˆ λ. (3) Check if Assumption B holds (it usually does if F decays no faster than a Gaussian); if it does not, a slightly different choice of F should be made. (4) Choose (z) / (z m N) r.

4 Goodness-of-ﬁt testing with RΦSDs

We now detail additional properties of RΦSDs relevant to testing goodness of ﬁt. In goodness-of-ﬁt testing, the sample points (Xn)N

n=1 underlying QN are assumed to be drawn i.i.d. from a distribution µ, and we wish to use the test statistic Fr,N := RΦSD2

Φ,r, ,M(QN, P) to determine whether the null hypothesis H0 : P = µ or alternative hypothesis H1 : P 6= µ holds. For this end, we will restrict our focus to real analytic Φ and strictly positive analytic , as by Chwialkowski et al. [6, Prop. 2 and Lemmas 1-3], with probability 1, P = µ , RΦSDΦ,r, ,M(µ, P) = 0 when these properties hold. Thus, analytic RΦSDs do not suffer from the shortcoming of RFFs which are unable to distinguish between inﬁnitely many distributions with high probability [6].

It remains to estimate the distribution of the test statistic Fr,N under the null hypothesis and to verify that the power of a test based on this distribution approaches 1 as N ! 1. To state our result, we assume that M is ﬁxed. Let r,N,dm(x) := (TdΦ)(x, ZN,m)/(M (ZN,m))1/r for r 2 [1, 2], where

(a) Step size selection using RΦSDs and quadratic-time KSD baseline. With M 10, each quality measure selects a step size of " = .01 or .005.

(b) SGLD sample points with equidensity contours of p overlaid. The samples produced by SGLD with " = .01 or .005 are noticeably better than those produced using smaller or large step sizes.

Figure 2: Hyperparameter selection for stochastic gradient Langevin dynamics (SGLD)

indep N, so that r,N(x) 2 RDM. The following result, proved in Appendix K, provides the basis for our testing guarantees.

Proposition 4.1 (Asymptotic distribution of RΦSD). Assume r,N := Cov P ( r,N) is ﬁnite for all N and r := lim N!1 r,N exists. Let N (0, r). Then as N ! 1: (1) under H0 : P = µ,

m=1 | dm|r)2/r and (2) under H1 : P 6= µ, NFr,N

Remark 4.2. The condition r := lim N!1 r,N holds if N = 0( m N) for a distribution 0.

Our second asympotic result provides a roadmap for using RΦSDs for hypothesis testing and is similar in spirit to Theorem 3 from Jitkrittum et al. [16]. In particular, it furnishes an asymptotic null distribution and establishes asymptotically full power.

Theorem 4.3 (Goodness of ﬁt testing with RΦSD). Let ˆµ := N 1 PN

n) and ˆ := N 1 PN

n)> ˆµˆµ> with either X0

n = Xn or X0

i.i.d. P. Suppose for the test NFr,N, the test threshold is set to the (1 )-quantile of the distribution of PD

m=1 | dm|r)2/r, where N (0, ˆ ). Then, under H0 : P = µ, asymptotically the false positive rate is . Under H1 : P 6= µ, the test power PH1(NFr,N > ) ! 1 as N ! 1.

5 Experiments

We now investigate the importance-sample and computational efﬁciency of our proposed RΦSDs and evaluate their beneﬁts in MCMC hyperparameter selection and goodness-of-ﬁt testing.4 In our experiments, we considered the RΦSDs described in Examples 3.3 and 3.4: the tilted sech kernel using r = 2 and A(x) = QD

d (L2 Sech Exp) and the inverse multiquadric kernel using r = 1 (L1 IMQ). We selected kernel parameters as follows. First we chose a target γ and then selected λ, , and in accordance with the theory of Section 3 so that (Φ, r, ) yielded (Cγ, γ)

4See https://bitbucket.org/jhhuggins/random-feature-stein-discrepancies for our code.

second moments. In particular, we chose = γ/3, λ = 1 /2, and = 4 /(2 + ). Except for the importance sample efﬁciency experiments, where we varied γ explicitly, all experiments used γ = 1/4. Let d medu denote the estimated median of the distance between data points under the u-norm, where the estimate is based on using a small subsample of the full dataset. For L2 Sech Exp, we took a 1 =

2 d med1, except in the sample quality experiments where we set a 1 =

2 . Finding hyperparameter settings for the L1 IMQ that were stable across dimension and appropriately controlled the size for goodness-of-ﬁt testing required some care. However, we can offer some basic guidelines. We recommend choosing = D/(D + df), which ensures has df degrees of freedom. We speciﬁcally suggest using df 2 [0.5, 3] so that is heavy-tailed no matter the dimension. For most experiments we took β = 1/2, c = 4 d med2, and df = 0.5. The exceptions were in the sample quality experiments, where we set c = 1, and the restricted Boltzmann machine testing experiment, where we set c = 10 d med2 and df = 2.5. For goodness-of-ﬁt testing, we expect appropriate choices for c and df will depend on the properties of the null distribution.

Figure 3: Speed of IMQ KSD vs. RΦSDs with M = 10 importance sample points (dimension D = 10). Even for moderate sample sizes N, the RΦSDs are orders of magnitude faster than the KSD.

Importance sample efﬁciency To validate the importance sample efﬁciency theory from Sections 3.2 and 3.3, we calculated P[RΦSD > ΦSD/4] as the importance sample size M was increased. We considered choices of the parameters for L2 Sech Exp and L1 IMQ that produced (Cγ, γ) second moments for varying choices of γ. The results, shown in Figs. 1a and 1b, indicate greater sample efﬁciency for L1 IMQ than L2 Sech Exp. L1 IMQ is also more robust to the choice of γ. Fig. 1c, which plots the values of M necessary to achieve stdev(RΦSD)/ ΦSD < 1/2, corroborates the greater sample efﬁciency of L1 IMQ.

Computational complexity We compared the computational complexity of the RΦSDs (with M = 10) to that of the IMQ KSD. We generated datasets of dimension D = 10 with the sample size N ranging from 500 to 5000. As seen in Fig. 3, even for moderate dataset sizes, the RΦSDs are computed orders of magnitude faster than the KSD. Other RΦSDs like FSSD and RFF obtain similar speed-ups; however, we will see the power beneﬁts of the L1 IMQ and L2 Sech Exp RΦSDs below.

Approximate MCMC hyperparameter selection We follow the stochastic gradient Langevin dynamics [SGLD, 28] hyperparameter selection setup from Gorham and Mackey [10, Section 5.3]. SGLD with constant step size " is a biased MCMC algorithm that approximates the overdamped Langevin diffusion. No Metropolis-Hastings correction is used, and an unbiased estimate of the score function from a data subsample is calculated at each iteration. There is a bias-variance tradeoff in the choice of step size parameter: the stationary distribution of SGLD deviates more from its target as " grows, but as " gets smaller the mixing speed of SGLD decreases. Hence, an appropriate choice of " is critical for accurate posterior inference. We target the bimodal Gaussian mixture model (GMM) posterior of Welling and Teh [28] and compare the step size selection made by the two RΦSDs to that of IMQ KSD [11] when N = 1000. Fig. 2a shows that L1 IMQ and L2 Sech Exp agree with IMQ KSD (selecting " = .005) even with just M = 10 importance samples. L1 IMQ continues to select " = .005 while L2 Sech Exp settles on " = .01, although the value for " = .005 is only slightly larger. Fig. 2b compares the choices of " = .005 and .01 to smaller and larger values of ". The values of M considered all represent substantial reductions in computation as the RΦSD replaces the DN(N + 1)/2 KSD kernel evaluations of the form ((Td Td)k)(xn, xn0) with only DNM feature function evaluations of the form (TdΦ)(xn, zm).

Goodness-of-ﬁt testing Finally, we investigated the performance of RΦSDs for goodness-of-ﬁt testing. In our ﬁrst two experiments we used a standard multivariate Gaussian p(x) = N (x | 0, I) as the null distribution while varying the dimension of the data. We explored the power of RΦSD-based tests compared to FSSD [16] (using the default settings in their code), RFF [22] (Gaussian and Cauchy kernels with bandwidth = d med2), and KSD-based tests [7, 11, 19] (Gaussian kernel with bandwidth

(a) Gaussian null

(b) Gaussian vs. Laplace

(c) Gauss vs. multivariate t

Figure 4: Quadratic-time KSD and linear-time RΦSD, FSSD, and RFF goodness-of-ﬁt tests with M = 10 importance sample points (see Section 5 for more details). All experiments used N = 1000 except the multivariate t, which used N = 2000. (a) Size of tests for Gaussian null. (b, c, d) Power of tests. Both RΦSDs offer competitive performance.

= d med2 and IMQ kernel IMQ

1, 1/2). We did not consider other linear-time KSD approximations due to relatively poor empirical performance [16]. There are two types of FSSD tests: FSSD-rand uses random sample locations and ﬁxed hyperparameters while FSSD-opt uses a small subset of the data to optimize sample locations and hyperparameters for a power criterion. All linear-time tests used M = 10 features. The target level was = 0.05. For each dimension D and RΦSD-based test, we chose the nominal test level by generating 200 p-values from the Gaussian asymptotic null, then setting the nominal level to the minimum of and the 5th percentile of the generated p-values. All other tests had nominal level . We veriﬁed the size of the FSSD, RFF, and RΦSD-based tests by generating 1000 p-values for each experimental setting in the Gaussian case (see Fig. 4a). Our ﬁrst experiment replicated the Gaussian vs. Laplace experiment of Jitkrittum et al. [16] where, under the alternative hypothesis, q(x) = QD

d=1 Lap(xd|0, 1/

2 ), a product of Laplace distributions with variance 1 (see Fig. 4b). Our second experiment, inspired by the Gaussian vs. multivariate t experiment of Chwialkowski et al. [7], tested the alternative in which q(x) = T (x|0, 5), a standard multivariate t-distribution with 5 degrees of freedom (see Fig. 4c). Our ﬁnal experiment replicated the restricted Boltzmann machine (RBM) experiment of Jitkrittum et al. [16] in which each entry of the matrix used to deﬁne the RBM was perturbed by independent additive Gaussian noise (see Fig. 4d). The amount of noise was varied from σper = 0 (that is, the null held) up to σper = 0.06. The L1 IMQ test performed well across all dimensions and experiments, with power of at least 0.93 in almost all experiments. The only exceptions were the Laplace experiment with D = 20 (power 0.88) and the RBM experiment with σper = 0.02 (power 0.74). The L2 Sech Exp test performed comparably to or better than the FSSD and RFF tests. Despite theoretical issues, the Cauchy RFF was competitive with the other linear-time methods except for the superior L1 IMQ. Given its superior power control and computational efﬁciency, we recommend the L1 IMQ over the L2 Sech Exp.

6 Discussion and related work

In this paper, we have introduced feature Stein discrepancies, a family of computable Stein discrepancies that can be cheaply approximated using importance sampling. Our stochastic approximations, random feature Stein discrepancies (RΦSDs), combine the computational beneﬁts of linear-time discrepancy measures with the convergence-determining properties of quadratic-time Stein discrepancies. We validated the beneﬁts of RΦSDs on two applications where kernel Stein discrepancies have shown excellent performance: measuring sample quality and goodness-of-ﬁt testing. Empirically, the L1 IMQ RΦSD performed particularly well: it outperformed existing linear-time KSD approximations in high dimensions and performed as well or better than the state-of-the-art quadratic-time KSDs.

RΦSDs could also be used as drop-in replacements for KSDs in applications to Monte Carlo variance reduction with control functionals [21], probabilistic inference using Stein variational gradient descent [18], and kernel quadrature [2, 3]. Moreover, the underlying principle used to generalize the KSD could also be used to develop fast alternatives to maximum mean discrepancies in twosample testing applications [6, 13]. Finally, while we focused on the Langevin Stein operator, our development is compatible with any Stein operator, including diffusion Stein operators [12].

Acknowledgments

Part of this work was done while JHH was a research intern at MSR New England.

[1] M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover Publica-

tions, 1964.

[2] F. Bach. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions.

Journal of Machine Learning Research, 18:1 38, 2017.

[3] F.-X. Briol, C. J. Oates, J. Cockayne, W. Y. Chen, and M. A. Girolami. On the Sampling

Problem for Kernel Quadrature. In International Conference on Machine Learning, 2017.

[4] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit a. Vector valued reproducing kernel hilbert

spaces and universality. Analysis and Applications, 8(01):19 61, 2010.

[5] F. Chung and L. Lu. Complex Graphs and Networks, volume 107. American Mathematical

Society, Providence, Rhode Island, 2006.

[6] K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast Two-Sample Testing

with Analytic Representations of Probability Measures. In Advances in Neural Information Processing Systems, 2015.

[7] K. Chwialkowski, H. Strathmann, and A. Gretton. A Kernel Test of Goodness of Fit. In

International Conference on Machine Learning, 2016.

[8] A. Eberle. Reﬂection couplings and contraction rates for diffusions. Probability Theory and

Related Fields, 166(3-4):851 886, 2016.

[9] C. J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. In Computing Science and

Statistics, Proceedings of the 23rd Symposium on the Interface, pages 156 163, 1991.

[10] J. Gorham and L. Mackey. Measuring Sample Quality with Stein s Method. In Advances in

Neural Information Processing Systems, 2015.

[11] J. Gorham and L. Mackey. Measuring Sample Quality with Kernels. In International Conference

on Machine Learning, 2017.

[12] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring Sample Quality with

Diffusions. ar Xiv.org, Nov. 2016, 1611.06972v3.

[13] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. J. Smola. A Kernel Two-Sample

Test. Journal of Machine Learning Research, 13:723 773, 2012.

[14] R. Herb and P. Sally Jr. The Plancherel formula, the Plancherel theorem, and the Fourier

transform of orbital integrals. In Representation Theory and Mathematical Physics: Conference in Honor of Gregg Zuckerman s 60th Birthday, October 24 27, 2009, Yale University, volume 557, page 1. American Mathematical Soc., 2011.

[15] J. Honorio and Y.-J. Li. The Error Probability of Random Fourier Features is Dimensionality

Independent. ar Xiv.org, Oct. 2017, 1710.09953v1.

[16] W. Jitkrittum, W. Xu, Z. Szab o, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-

of-Fit Test. In Advances in Neural Information Processing Systems, 2017.

[17] Q. Liu and J. D. Lee. Black-box Importance Sampling. In International Conference on Artiﬁcial

Intelligence and Statistics, 2017.

[18] Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference

Algorithm. In Advances in Neural Information Processing Systems, 2016.

[19] Q. Liu, J. D. Lee, and M. I. Jordan. A Kernelized Stein Discrepancy for Goodness-of-ﬁt Tests

and Model Evaluation. In International Conference on Machine Learning, 2016.

[20] A. M uller. Integral probability metrics and their generating classes of functions. Ann. Appl.

Probab., 29(2):pp. 429 443, 1997.

[21] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017.

[22] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in

Neural Information Processing Systems, 2007.

[23] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based

and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 2291, 2013.

[24] R. J. Serﬂing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New

York, 1980.

[25] B. K. Sriperumbudur and Z. Szab o. Optimal rates for random Fourier features. In Advances in

Neural Information Processing Systems, pages 1144 1152, 2015.

[26] D. J. Sutherland and J. Schneider. On the error of random Fourier features. In Proceedings of

the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 862 871, 2015.

[27] D. Wang and Q. Liu. Learning to Draw Samples - With Application to Amortized MLE for

Generative Adversarial Learning. ar Xiv, stat.ML, 2016.

[28] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In

International Conference on Machine Learning, 2011.

[29] H. Wendland. Scattered Data Approximation. Cambridge University Press, New York, NY,

[30] J. Zhao and D. Meng. Fast MMD: Ensemble of circular discrepancy for efﬁcient two-sample

test. Neural computation, 27(6):1345 1372, 2015.