# measuring_sample_quality_with_kernels__dd7d3ba8.pdf

Measuring Sample Quality with Kernels

Jackson Gorham 1 Lester Mackey 2

Approximate Markov chain Monte Carlo (MCMC) offers the promise of more rapid sampling at the cost of more biased inference. Since standard MCMC diagnostics fail to detect these biases, researchers have developed computable Stein discrepancy measures that provably determine the convergence of a sample to its target distribution. This approach was recently combined with the theory of reproducing kernels to deﬁne a closed-form kernel Stein discrepancy (KSD) computable by summing kernel evaluations across pairs of sample points. We develop a theory of weak convergence for KSDs based on Stein s method, demonstrate that commonly used KSDs fail to detect non-convergence even for Gaussian targets, and show that kernels with slowly decaying tails provably determine convergence for a large class of target distributions. The resulting convergence-determining KSDs are suitable for comparing biased, exact, and deterministic sample sequences and simpler to compute and parallelize than alternative Stein discrepancies. We use our tools to compare biased samplers, select sampler hyperparameters, and improve upon existing KSD approaches to one-sample hypothesis testing and sample quality improvement.

1. Introduction

When Bayesian inference and maximum likelihood estimation (Geyer, 1991) demand the evaluation of intractable expectations EP [h(Z)] =

p(x)h(x)dx under a target distribution P, Markov chain Monte Carlo (MCMC) methods (Brooks et al., 2011) are often employed to approximate these integrals with asymptotically correct sample aver-

1Stanford University, Palo Alto, CA USA 2Microsoft Research New England, Cambridge, MA USA. Correspondence to: Jackson Gorham <jgorham@stanford.edu>, Lester Mackey <lmackey@microsoft.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

ages EQn[h(X)] = 1

i=1 h(xi). However, many exact MCMC methods are computationally expensive, and recent years have seen the introduction of biased MCMC procedures (see, e.g., Welling & Teh, 2011; Ahn et al., 2012; Korattikara et al., 2014) that exchange asymptotic correctness for increased sampling speed.

Since standard MCMC diagnostics, like mean and trace plots, pooled and within-chain variance measures, effective sample size, and asymptotic variance (Brooks et al., 2011), do not account for asymptotic bias, Gorham & Mackey (2015) deﬁned a new family of sample quality measures the Stein discrepancies that measure how well EQn approximates EP while avoiding explicit integration under P. Gorham & Mackey (2015); Mackey & Gorham (2016); Gorham et al. (2016) further showed that speciﬁc members of this family the graph Stein discrepancies were (a) efﬁciently computable by solving a linear program and (b) convergence-determining for large classes of targets P. Building on the zero mean reproducing kernel theory of Oates et al. (2016b), Chwialkowski et al. (2016) and Liu et al. (2016) later showed that other members of the Stein discrepancy family had a closed-form solution involving the sum of kernel evaluations over pairs of sample points.

This closed form represents a signiﬁcant practical advantage, as no linear program solvers are necessary, and the computation of the discrepancy can be easily parallelized. However, as we will see in Section 3.2, not all kernel Stein discrepancies are suitable for our setting. In particular, in dimension d 3, the kernel Stein discrepancies previously recommended in the literature fail to detect when a sample is not converging to the target. To address this shortcoming, we develop a theory of weak convergence for the kernel Stein discrepancies analogous to that of (Gorham & Mackey, 2015; Mackey & Gorham, 2016; Gorham et al., 2016) and design a class of kernel Stein discrepancies that provably control weak convergence for a large class of target distributions.

After formally describing our goals for measuring sample quality in Section 2, we outline our strategy, based on Stein s method, for constructing and analyzing practical quality measures at the start of Section 3. In Section 3.1, we deﬁne our family of closed-form quality measures the kernel Stein discrepancies (KSDs) and establish several

Measuring Sample Quality with Kernels

appealing practical properties of these measures. We analyze the convergence properties of KSDs in Sections 3.2 and 3.3, showing that previously proposed KSDs fail to detect non-convergence and proposing practical convergencedetermining alternatives. Section 4 illustrates the value of convergence-determining kernel Stein discrepancies in a variety of applications, including hyperparameter selection, sampler selection, one-sample hypothesis testing, and sample quality improvement. Finally, in Section 5, we conclude with a discussion of related and future work.

Notation We will use µ to denote a generic probability measure and ) to denote the weak convergence of a sequence of probability measures. We will use k kr for r 2 [1, 1] to represent the r norm on Rd and occasionally refer to a generic norm k k with associated dual norm kak , supb2Rd,kbk=1 ha, bi for vectors a 2 Rd. We let ej be the j-th standard basis vector. For any function g : Rd ! Rd0, we deﬁne M0(g) , supx2Rdkg(x)k2, M1(g) , supx6=ykg(x) g(y)k2/kx yk2, and rg as the gradient with components (rg(x))jk , rxkgj(x). We further let g 2 Cm indicate that g is m times continuously differentiable and g 2 Cm

0 indicate that g 2 Cm

and rlg is vanishing at inﬁnity for all l 2 {0, . . . , m}. We deﬁne C(m,m) (respectively, C(m,m)

b and C(m,m)

0 ) to be the set of functions k : Rd Rd ! R with (x, y) 7! rl

yk(x, y) continuous (respectively, continuous and uniformly bounded, continuous and vanishing at inﬁnity) for all l 2 {0, . . . , m}.

2. Quality measures for samples

Consider a target distribution P with continuously differentiable (Lebesgue) density p supported on all of Rd. We assume that the score function b , r log p can be evaluated1 but that, for most functions of interest, direct integration under P is infeasible. We will therefore approximate integration under P using a weighted sample Qn = Pn

i=1 qn(xi)δxi with sample points x1, . . . , xn 2 Rd and qn a probability mass function. We will make no assumptions about the origins of the sample points; they may be the output of a Markov chain or even deterministically generated.

Each Qn offers an approximation EQn[h(X)] = Pn

i=1 qn(xi)h(xi) for each intractable expectation EP [h(Z)], and our aim is to effectively compare the quality of the approximation offered by any two samples targeting P. In particular, we wish to produce a quality measure that (i) identiﬁes when a sequence of samples is converging to the target, (ii) determines when a sequence of samples is not converging to the target, and (iii) is efﬁciently computable. Since our interest is in approx-

1No knowledge of the normalizing constant is needed.

imating expectations, we will consider discrepancies quantifying the maximum expectation error over a class of test functions H:

d H(Qn, P) , sup

|EP [h(Z)] EQn[h(X)]|. (1)

When H is large enough, for any sequence of probability measures (µm)m 1, d H(µm, P) ! 0 only if µm ) P. In this case, we call (1) an integral probability metric (IPM) (M uller, 1997). For example, when H = BLk k2 , {h : Rd ! R | M0(h) + M1(h) 1}, the IPM d BLk k2 is called the bounded Lipschitz or Dudley metric and exactly metrizes convergence in distribution. Alternatively, when H = Wk k2 , {h : Rd ! R | M1(h) 1} is the set of 1-Lipschitz functions, the IPM d Wk k in (1) is known as the Wasserstein metric.

An apparent practical problem with using the IPM d H as a sample quality measure is that EP [h(Z)] may not be computable for h 2 H. However, if H were chosen such that EP [h(Z)] = 0 for all h 2 H, then no explicit integration under P would be necessary. To generate such a class of test functions and to show that the resulting IPM still satisﬁes our desiderata, we follow the lead of Gorham & Mackey (2015) and consider Charles Stein s method for characterizing distributional convergence.

3. Stein s method with kernels

Stein s method (Stein, 1972) provides a three-step recipe for assessing convergence in distribution:

1. Identify a Stein operator T that maps functions g :

Rd ! Rd from a domain G to real-valued functions T g such that

EP [(T g)(Z)] = 0 for all g 2 G.

For any such Stein operator and Stein set G, Gorham & Mackey (2015) deﬁned the Stein discrepancy as

S(µ, T , G) , sup

|Eµ[(T g)(X)]| = d T G(µ, P) (2)

which, crucially, avoids explicit integration under P.

2. Lower bound the Stein discrepancy by an IPM d H

known to dominate weak convergence. This can be done once for a broad class of target distributions to ensure that µm ) P whenever S(µm, T , G) ! 0 for a sequence of probability measures (µm)m 1 (Desideratum (ii)).

3. Provide an upper bound on the Stein discrepancy en-

suring that S(µm, T , G) ! 0 under suitable convergence of µm to P (Desideratum (i)).

Measuring Sample Quality with Kernels

While Stein s method is principally used as a mathematical tool to prove convergence in distribution, we seek, in the spirit of (Gorham & Mackey, 2015; Gorham et al., 2016), to harness the Stein discrepancy as a practical tool for measuring sample quality. The subsections to follow develop a speciﬁc, practical instantiation of the abstract Stein s method recipe based on reproducing kernel Hilbert spaces. An empirical analysis of the Stein discrepancies recommended by our theory follows in Section 4.

3.1. Selecting a Stein operator and a Stein set

A standard, widely applicable univariate Stein operator is the density method operator (see Stein et al., 2004; Chatterjee & Shao, 2011; Chen et al., 2011; Ley et al., 2017),

(T g)(x) , 1 p(x)

d dx(p(x)g(x)) = g(x)b(x) + g0(x).

Inspired by the generator method of Barbour (1988; 1990) and G otze (1991), Gorham & Mackey (2015) generalized this operator to multiple dimensions. The resulting Langevin Stein operator

(TP g)(x) , 1 p(x)hr, p(x)g(x)i = hg(x), b(x)i + hr, g(x)i

for functions g : Rd ! Rd was independently developed, without connection to Stein s method, by Oates et al. (2016b) for the design of Monte Carlo control functionals. Notably, the Langevin Stein operator depends on P only through its score function b = r log p and hence is computable even when the normalizing constant of p is not. While our work is compatible with other practical Stein operators, like the family of diffusion Stein operators deﬁned in (Gorham et al., 2016), we will focus on the Langevin operator for the sake of brevity.

Hereafter, we will let k : Rd Rd ! R be the reproducing kernel of a reproducing kernel Hilbert space (RKHS) Kk of functions from Rd ! R. That is, Kk is a Hilbert space of functions such that, for all x 2 Rd, k(x, ) 2 Kk and f(x) = hf, k(x, )i Kk whenever f 2 Kk. We let k k Kk be the norm induced from the inner product on Kk.

With this deﬁnition, we deﬁne our kernel Stein set Gk,k k as the set of vector-valued functions g = (g1, . . . , gd) such that each component function gj belongs to Kk and the vector of their norms kgjk Kk belongs to the k k unit ball:2

Gk,k k , {g = (g1, . . . , gd) | kvk 1 for vj , kgjk Kk}.

The following result, proved in Section B, establishes that this is an acceptable domain for TP .

Proposition 1 (Zero mean test functions). If k 2 C(1,1)

b and EP [kr log p(Z)k2] < 1, then EP [(TP g)(Z)] = 0 for all g 2 Gk,k k.

2Our analyses and algorithms support each gj belonging to a different RKHS Kkj, but we will not need that ﬂexibility here.

The Langevin Stein operator and kernel Stein set together deﬁne our quality measure of interest, the kernel Stein discrepancy (KSD) S(µ, TP , Gk,k k). When k k = k k2, this deﬁnition recovers the KSD proposed by Chwialkowski et al. (2016) and Liu et al. (2016). Our next result shows that, for any k k, the KSD admits a closed-form solution. Proposition 2 (KSD closed form). Suppose k 2 C(1,1), and, for each j 2 {1, . . . d}, deﬁne the Stein kernel

0(x, y) , 1 p(x)p(y)rxjryj(p(x)k(x, y)p(y)) (3)

= bj(x)bj(y)k(x, y) + bj(x)ryjk(x, y)

+ bj(y)rxjk(x, y) + rxjryjk(x, y).

< 1, then S(µ, TP , Gk,k k) =

kwk where wj ,

0(X, X)] with X, X

The proof is found in Section C. Notably, when µ is the discrete measure Qn = Pn

i=1 qn(xi)δxi, the KSD reduces to evaluating each kj

0 at pairs of support points as wj = q Pn

i,i0=1 qn(xi)kj

0(xi, xi0)qn(xi0), a computation which is easily parallelized over sample pairs and coordinates j.

Our Stein set choice was motivated by the work of Oates et al. (2016b) who used the sum of Stein kernels k0 = Pd

0 to develop nonparametric control variates. Each term wj in Proposition 2 can also be viewed as an instance of the maximum mean discrepancy (MMD) (Gretton et al., 2012) between µ and P measured with respect to the Stein kernel kj

0. In standard uses of MMD, an arbitrary kernel function is selected, and one must be able to compute expectations of the kernel function under P. Here, this requirement is satisﬁed automatically, since our induced kernels are chosen to have mean zero under P.

For clarity we will focus on the speciﬁc kernel Stein set choice Gk , Gk,k k2 for the remainder of the paper, but our results extend directly to KSDs based on any k k, since all KSDs are equivalent in a strong sense: Proposition 3 (Kernel Stein set equivalence). Under the assumptions of Proposition 2, there are constants cd, c0

d > 0 depending only on d and k k such that cd S(µ, TP , Gk,k k) S(µ, TP , Gk,k k2) c0

d S(µ, TP , Gk,k k).

The short proof is found in Section D.

3.2. Lower bounding the kernel Stein discrepancy

We next aim to establish conditions under which the KSD S(µm, TP , Gk) ! 0 only if µm ) P (Desideratum (ii)). Recently, Gorham et al. (2016) showed that the Langevin graph Stein discrepancy dominates convergence in distribution whenever P belongs to the class P of distantly dissipative distributions with Lipschitz score function b:

Measuring Sample Quality with Kernels

Deﬁnition 4 (Distant dissipativity (Eberle, 2015; Gorham et al., 2016)). A distribution P is distantly dissipative if 0 , lim infr!1 (r) > 0 for

(r) = inf{ 2 hb(x) b(y),x yi

2 : kx yk2 = r}. (4)

Examples of distributions in P include ﬁnite Gaussian mixtures with common covariance and all distributions strongly log-concave outside of a compact set, including Bayesian linear, logistic, and Huber regression posteriors with Gaussian priors (see Gorham et al., 2016, Section 4). Moreover, when d = 1, membership in P is sufﬁcient to provide a lower bound on the KSD for most common kernels including the Gaussian, Mat ern, and inverse multiquadric kernels. Theorem 5 (Univariate KSD detects non-convergence). Suppose that P 2 P and k(x, y) = Φ(x y) for Φ 2 C2

with a non-vanishing generalized Fourier transform. If d = 1, then S(µm, TP , Gk) ! 0 only if µm ) P.

The proof in Section E provides a lower bound on the KSD in terms of an IPM known to dominate weak convergence. However, our next theorem shows that in higher dimensions S(Qn, TP , Gk) can converge to 0 without the sequence (Qn)n 1 converging to any probability measure. This deﬁciency occurs even when the target is Gaussian. Theorem 6 (KSD fails with light kernel tails). Suppose k 2 C(1,1)

b and deﬁne the kernel decay rate

γ(r) , sup{max(|k(x, y)|, krxk(x, y)k2,

|hrx, ryk(x, y)i|) : kx yk2 r}.

If d 3, P = N(0, Id), and γ(r) = o(r ) for , ( 1

2 1 d) 1, then S(Qn, TP , Gk) ! 0 does not imply Qn ) P.

Theorem 6 implies that KSDs based on the commonly used Gaussian kernel, Mat ern kernel, and compactly supported kernels of Wendland (2004, Theorem 9.13) all fail to detect non-convergence when d 3. In addition, KSDs based on the inverse multiquadric kernel (k(x, y) = (c2 + kx yk2

2)β) for β < 1 fail to detect non-convergence for any d > 2β/(β + 1). The proof in Section F shows that the violating sample sequences (Qn)n 1 are simple to construct, and we provide an empirical demonstration of this failure to detect non-convergence in Section 4.

The failure of the KSDs in Theorem 6 can be traced to their inability to enforce uniform tightness. A sequence of probability measures (µm)m 1 is uniformly tight if for every > 0, there is a ﬁnite number R( ) such that lim supm µm(k Xk2 > R( )) . Uniform tightness implies that no mass in the sequence of probability measures escapes to inﬁnity. When the kernel k decays more rapidly than the score function grows, the KSD ignores excess mass in the tails and hence can be driven to zero by a

non-tight sequence of increasingly diffuse probability measures. The following theorem demonstrates uniform tightness is the missing piece to ensure weak convergence. Theorem 7 (KSD detects tight non-convergence). Suppose that P 2 P and k(x, y) = Φ(x y) for Φ 2 C2 with a nonvanishing generalized Fourier transform. If (µm)m 1 is uniformly tight, then S(µm, TP , Gk) ! 0 only if µm ) P.

Our proof in Section G explicitly lower bounds the KSD S(µ, TP , Gk) in terms of the bounded Lipschitz metric d BLk k(µ, P), which exactly metrizes weak convergence.

Ideally, when a sequence of probability measures is not uniformly tight, the KSD would reﬂect this divergence in its reported value. To achieve this, we consider the inverse multiquadric (IMQ) kernel k(x, y) = (c2 + kx yk2

for some β < 0 and c > 0. While KSDs based on IMQ kernels fail to determine convergence when β < 1 (by Theorem 6), our next theorem shows that they automatically enforce tightness and detect non-convergence whenever β 2 ( 1, 0). Theorem 8 (IMQ KSD detects non-convergence). Suppose P 2 P and k(x, y) = (c2 + kx yk2

2)β for c > 0 and β 2 ( 1, 0). If S(µm, TP , Gk) ! 0, then µm ) P.

The proof in Section H provides a lower bound on the KSD in terms of the bounded Lipschitz metric d BLk k(µ, P). The success of the IMQ kernel over other common characteristic kernels can be attributed to its slow decay rate. When P 2 P and the IMQ exponent β > 1, the function class TP Gk contains unbounded (coercive) functions. These functions ensure that the IMQ KSD S(µm, TP , Gk) goes to 0 only if (µm)m 1 is uniformly tight.

3.3. Upper bounding the kernel Stein discrepancy

The usual goal in upper bounding the Stein discrepancy is to provide a rate of convergence to P for particular approximating sequences (µm)1

m=1. Because we aim to directly compute the KSD for arbitrary samples Qn, our chief purpose in this section is to ensure that the KSD S(µm, TP , Gk) will converge to zero when µm is converging to P (Desideratum (i)).

Proposition 9 (KSD detects convergence). If k 2 C(2,2)

b and r log p is Lipschitz with EP [kr log p(Z)k2

2] < 1, then S(µm, TP , Gk) ! 0 whenever the Wasserstein distance d Wk k2 (µm, P) ! 0.

Proposition 9 applies to common kernels like the Gaussian, Mat ern, and IMQ kernels, and its proof in Section I provides an explicit upper bound on the KSD in terms of the Wasserstein distance d Wk k2 . When Qn = 1

i=1 δxi for

iid µ, (Liu et al., 2016, Thm. 4.1) further implies that S(Qn, TP , Gk) ) S(µ, TP , Gk) at an O(n 1/2) rate under continuity and integrability assumptions on µ.

Measuring Sample Quality with Kernels

4. Experiments

We next conduct an empirical evaluation of the KSD quality measures recommended by our theory, recording all timings on an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. Throughout, we will refer to the KSD with IMQ base kernel k(x, y) = (c2 + kx yk2

2)β, exponent β = 1

2, and c = 1 as the IMQ KSD. Code reproducing all experiments can be found on the Julia (Bezanson et al., 2014) package site https://jgorham.github.io/ Stein Discrepancy.jl/.

4.1. Comparing discrepancies

Our ﬁrst, simple experiment is designed to illustrate several properties of the IMQ KSD and to compare its behavior with that of two preexisting discrepancy measures, the Wasserstein distance d Wk k2 , which can be computed for simple univariate targets (Vallender, 1974), and the spanner graph Stein discrepancy of Gorham & Mackey (2015). We adopt a bimodal Gaussian mixture with p(x) / e 1

2 and = 1.5 as our target P and generate a ﬁrst sample point sequence i.i.d. from the target and a second sequence i.i.d. from one component of the mixture, N( e1, Id). As seen in the left panel of Figure 1 where d = 1, the IMQ KSD decays at an n 0.51

rate when applied to the ﬁrst n points in the target sample and remains bounded away from zero when applied to the to the single component sample. This desirable behavior is closely mirrored by the Wasserstein distance and the graph Stein discrepancy.

The middle panel of Figure 1 records the time consumed by the graph and kernel Stein discrepancies applied to the i.i.d. sample points from P. Each method is given access to d cores when working in d dimensions, and we use the released code of Gorham & Mackey (2015) with the default Gurobi 6.0.4 linear program solver for the graph Stein discrepancy. We ﬁnd that the two methods have nearly identical runtimes when d = 1 but that the KSD is 10 to 1000 times faster when d = 4. In addition, the KSD is straightforwardly parallelized and does not require access to a linear program solver, making it an appealing practical choice for a quality measure.

Finally, the right panel displays the optimal Stein func-

tions, gj(y) =

EQn[bj(X)k(X,y)+rxj k(X,y)]

S(Qn,TP ,Gk) , recovered by the IMQ KSD when d = 1 and n = 103. The associated

test functions h(y) = (TP g)(y) =

0(X,y)] S(Qn,TP ,Gk) are the mean-zero functions under P that best discriminate the target P and the sample Qn. As might be expected, the optimal test function for the single component sample features large magnitude values in the oversampled region far from the missing mode.

4.2. The importance of kernel choice

Theorem 6 established that kernels with rapidly decaying tails yield KSDs that can be driven to zero by offtarget sample sequences. Our next experiment provides an empirical demonstration of this issue for a multivariate Gaussian target P = N(0, Id) and KSDs based on the popular Gaussian (k(x, y) = e kx yk2

2/2) and Mat ern (k(x, y) = (1 +

3kx yk2) radial kernels.

Following the proof Theorem 6 in Section F, we construct an off-target sequence (Qn)n 1 that sends S(Qn, TP , Gk) to 0 for these kernel choices whenever d 3. Speciﬁcally, for each n, we let Qn = 1

i=1 δxi where, for all i and j, kxik2 2n1/d log n and kxi xjk2 2 log n. To select these sample points, we independently sample candidate points uniformly from the ball {x : kxk2 2n1/d log n}, accept any points not within 2 log n Euclidean distance of any previously accepted point, and terminate when n points have been accepted.

For various dimensions, Figure 2 displays the result of applying each KSD to the off-target sequence (Qn)n 1 and an on-target sequence of points sampled i.i.d. from P. For comparison, we also display the behavior of the IMQ KSD which provably controls tightness and dominates weak convergence for this target by Theorem 8. As predicted, the Gaussian and Mat ern KSDs decay to 0 under the off-target sequence and decay more rapidly as the dimension d increases; the IMQ KSD remains bounded away from 0.

4.3. Selecting sampler hyperparameters

The approximate slice sampler of Du Bois et al. (2014) is a biased MCMC procedure designed to accelerate inference when the target density takes the form p(x) / (x) QL

l=1 (yl|x) for ( ) a prior distribution on Rd and (yl|x) the likelihood of a datapoint yl. A standard slice sampler must evaluate the likelihood of all L datapoints to draw each new sample point xi. To reduce this cost, the approximate slice sampler introduces a tuning parameter which determines the number of datapoints that contribute to an approximation of the slice sampling step; an appropriate setting of this parameter is imperative for accurate inference. When is too small, relatively few sample points will be generated in a given amount of sampling time, yielding sample expectations with high Monte Carlo variance. When is too large, the large approximation error will produce biased samples that no longer resemble the target.

To assess the suitability of the KSD for tolerance parameter selection, we take as our target P the bimodal Gaussian mixture model posterior of (Welling & Teh, 2011). For an array of values, we generated 50 independent approximate slice sampling chains with batch size 5, each with a

Measuring Sample Quality with Kernels

i.i.d. from mixture

i.i.d. from single mixture component

101 102 103 104 101 102 103 104 10 2.5

Number of sample points, n

Discrepancy value

Discrepancy

Graph Stein discrepancy

Wasserstein

d = 1 d = 4

101 102 103 104 101 102 103 104

Number of sample points, n

Computation time (sec)

i.i.d. from mixture

i.i.d. from single mixture component

3 0 3 3 0 3

Figure 1. Left: For d = 1, comparison of discrepancy measures for samples drawn i.i.d. from either the bimodal Gaussian mixture target P or a single mixture component (see Section 4.1). Middle: On-target discrepancy computation time using d cores in d dimensions. Right: For n = 103 and d = 1, the Stein functions g and discriminating test functions h = TP g which maximize the KSD.

i.i.d. from target P Off target sample

Gaussian Matérn Inverse Multiquadric

102 103 104 105102 103 104 105

Number of sample points, n

Kernel Stein discrepancy

Figure 2. Gaussian and Mat ern KSDs are driven to 0 by an offtarget sequence that does not converge to the target P = N(0, Id) (see Section 4.2). The IMQ KSD does not share this deﬁciency.

budget of 148000 likelihood evaluations, and plotted the median IMQ KSD and effective sample size (ESS, a standard sample quality measure based on asymptotic variance (Brooks et al., 2011)) in Figure 3. ESS, which does not detect Markov chain bias, is maximized at the largest hyperparameter evaluated ( = 10 1), while the KSD is minimized at an intermediate value ( = 10 2). The right panel of Figure 3 shows representative samples produced by several settings of . The sample produced by the ESS-selected chain is signiﬁcantly overdispersed, while the sample from = 0 has minimal coverage of the second mode due to

its small sample size. The sample produced by the KSDselected chain best resembles the posterior target. Using 4 cores, the longest KSD computation with n = 103 sample points took 0.16s.

4.4. Selecting samplers

Ahn et al. (2012) developed two biased MCMC samplers for accelerated posterior inference, both called Stochastic Gradient Fisher Scoring (SGFS). In the full version of SGFS (termed SGFS-f), a d d matrix must be inverted to draw each new sample point. Since this can be costly for large d, the authors developed a second sampler (termed SGFS-d) in which only a diagonal matrix must be inverted to draw each new sample point. Both samplers can be viewed as discrete-time approximations to a continuoustime Markov process that has the target P as its stationary distribution; however, because no Metropolis-Hastings correction is employed, neither sampler has the target as its stationary distribution. Hence we will use the KSD a quality measure that accounts for asymptotic bias to evaluate and choose between these samplers.

Speciﬁcally, we evaluate the SGFS-f and SGFS-d samples produced in (Ahn et al., 2012, Sec. 5.1). The target P is a Bayesian logistic regression with a ﬂat prior, conditioned on a dataset of 104 MNIST handwritten digit images. From each image, the authors extracted 50 random projections of the raw pixel values as covariates and a label indicating whether the image was a 7 or a 9. After discarding the ﬁrst half of sample points as burn-in, we obtained regression coefﬁcient samples with 5 104 points and d = 51 dimensions (including the intercept term). Figure 4 displays the IMQ KSD applied to the ﬁrst n points in each sample. As external validation, we follow the protocol of Ahn et al. (2012) to ﬁnd the bivariate marginal means and 95% conﬁdence ellipses of each sample that align best and worst with those of a surrogate ground truth sample obtained from a

Measuring Sample Quality with Kernels

ESS (higher is better)

KSD (lower is better)

0 10 4 10 3 10 2 10 1

Tolerance parameter, ε

Log median diagnostic

ε = 0 (n = 230) ε = 10 2 (n = 416) ε = 10 1 (n = 1000)

2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 x1

Figure 3. Left: Median hyperparameter selection criteria across 50 independent approximate slice sampler sample sequences (see Section 4.3); IMQ KSD selects = 10 2; effective sample size selects = 10 1. Right: Representative approximate slice sampler samples requiring 148000 likelihood evaluations with posterior equidensity contours overlaid; n is the associated sample size.

Hamiltonian Monte Carlo chain with 105 iterates. Both the KSD and the surrogate ground truth suggest that the moderate speed-up provided by SGFS-d (0.0017s per sample vs. 0.0019s for SGFS-f) is outweighed by the signiﬁcant loss in inferential accuracy. However, the KSD assessment does not require access to an external trustworthy ground truth sample. The longest KSD computation took 400s using 16 cores.

4.5. Beyond sample quality comparison

While our investigation of the KSD was motivated by the desire to develop practical, trustworthy tools for sample quality comparison, the kernels recommended by our theory can serve as drop-in replacements in other inferential tasks that make use of kernel Stein discrepancies.

4.5.1. ONE-SAMPLE HYPOTHESIS TESTING

Chwialkowski et al. (2016) recently used the KSD S(Qn, TP , Gk) to develop a hypothesis test of whether a given sample from a Markov chain was drawn from a target distribution P (see also Liu et al., 2016). However, the authors noted that the KSD test with their default Gaussian base kernel k experienced a considerable loss of power as the dimension d increased. We recreate their experiment and show that this loss of power can be avoided by using our default IMQ kernel with β = 1

2 and c = 1. Following (Chwialkowski et al., 2016, Section 4) we draw zi

iid N(0, Id) and ui

iid Unif[0, 1] to generate a sample (xi)n

i=1 with xi = zi + ui e1 for n = 500 and various dimensions d. Using the authors code (modiﬁed to include an IMQ kernel), we compare the power of the Gaussian KSD test, the IMQ KSD test, and the standard normality test of Baringhaus & Henze (1988) (B&H) to discern whether the sample (xi)500

i=1 came from the null distribution P = N(0, Id). The results, averaged over 400 simula-

tions, are shown in Table 1. Notably, the IMQ KSD experiences no power degradation over this range of dimensions, thus improving on both the Gaussian KSD and the standard B&H normality tests.

Table 1. Power of one sample tests for multivariate normality, averaged over 400 simulations (see Section 4.5.1)

d=2 d=5 d=10 d=15 d=20 d=25 B&H 1.0 1.0 1.0 0.91 0.57 0.26 Gaussian 1.0 1.0 0.88 0.29 0.12 0.02 IMQ 1.0 1.0 1.0 1.0 1.0 1.0

4.5.2. IMPROVING SAMPLE QUALITY

Liu & Lee (2016) recently used the KSD S(Qn, TP , Gk) as a means of improving the quality of a sample. Speciﬁcally, given an initial sample Qn supported on x1, . . . , xn, they minimize S( Qn, TP , Gk) over all measures Qn supported on the same sample points to obtain a new sample that better approximates P over the class of test functions H = TP Gk. In all experiments, Liu & Lee (2016) employ a Gaussian kernel k(x, y) = e 1

2 with bandwidth h selected to be the median of the squared Euclidean distance between pairs of sample points. Using the authors code, we recreate the experiment from (Liu & Lee, 2016, Fig. 2b) and introduce a KSD objective with an IMQ kernel k(x, y) = (1 + 1

2) 1/2 with bandwidth selected in the same fashion. The starting sample is given by Qn = 1

i=1 δxi for n = 100, various dimensions d, and each sample point drawn i.i.d. from P = N(0, Id). For the initial sample and the optimized samples produced by each KSD, Figure 5 displays the mean squared error (MSE)

1 dk EP [Z] E Qn[X]k2

2 averaged across 500 independently generated initial samples. Out of the box, the IMQ kernel produces better mean estimates than the standard Gaussian.

Measuring Sample Quality with Kernels

102 102.5 103 103.5 104 104.5

Number of sample points, n

IMQ kernel Stein discrepancy

0.3 0.2 0.1 0.0 x7

0.8 0.9 1.0 1.1 1.2

0.5 0.4 0.3 0.2 0.1 0.0 x8

0.1 0.0 0.1 0.2 x32

1.5 1.4 1.3 1.2 1.1

Figure 4. Left: Quality comparison for Bayesian logistic regression with two SGFS samplers (see Section 4.4). Right: Scatter plots of n = 5 104 SGFS sample points with overlaid bivariate marginal means and 95% conﬁdence ellipses (dashed blue) that align best and worst with surrogate ground truth sample (solid red).

2 10 50 75 100 Dimension, d

Average MSE, ||EP Z EQn ~ X||2 2 d

Gaussian KSD

Figure 5. Average quality of mean estimates ( 2 standard errors) under optimized samples Qn for target P = N(0, Id); MSE averaged over 500 independent initial samples (see Section 4.5.2).

5. Related and future work

The score statistic of Fan et al. (2006) and the Gibbs sampler convergence criteria of Zellner & Min (1995) detect certain forms of non-convergence but fail to detect others due to the ﬁnite number of test functions tested. For example, when P = N(0, 1), the score statistic (Fan et al., 2006) only monitors sample means and variances.

For an approximation µ with continuously differentiable density r, Chwialkowski et al. (2016, Thm. 2.1) and Liu et al. (2016, Prop. 3.3) established that if k is C0universal (Carmeli et al., 2010, Defn. 4.1) or integrally strictly positive deﬁnite (ISPD, Stewart, 1976, Sec. 6) and Eµ[k0(X, X) + kr log p(X)

2] < 1 for k0 , Pd

0, then S(µ, TP , Gk) = 0 only if µ = P. However, this property is insufﬁcient to conclude that probability measures with small KSD are close to P in any traditional sense. Indeed, Gaussian and Mat ern kernels are C0 universal and ISPD, but, by Theorem 6, their KSDs can be driven to zero by sequences not converging to P. On compact domains,

where tightness is no longer an issue, the combined results of (Oates et al., 2016a, Lem. 4), (Fukumizu et al., 2007, Lem. 1), and (Simon-Gabriel & Sch olkopf, 2016, Thm. 55) give conditions for a KSD to dominate weak convergence.

While assessing sample quality was our chief objective, our results may hold beneﬁts for other applications that make use of Stein discrepancies or Stein operators. In particular, our kernel recommendations could be incorporated into the Monte Carlo control functionals framework of Oates et al. (2016b); Oates & Girolami (2015), the variational inference approaches of Liu & Wang (2016); Liu & Feng (2016); Ranganath et al. (2016), and the Stein generative adversarial network approach of Wang & Liu (2016).

In the future, we aim to leverage stochastic, low-rank, and sparse approximations of the kernel matrix and score function to produce KSDs that scale better with the number of sample and data points while still guaranteeing control over weak convergence. A reader may also wonder for which distributions outside of P the KSD dominates weak convergence. The following theorem, proved in Section J, shows that no KSD with a C0 kernel dominates weak convergence when the target has a bounded score function. Theorem 10 (KSD fails for bounded scores). If r log p is bounded and k 2 C(1,1)

0 , then S(Qn, TP , Gk) ! 0 does not imply Qn ) P.

However, Gorham et al. (2016) developed convergencedetermining graph Stein discrepancies for heavy-tailed targets by replacing the Langevin Stein operator TP with diffusion Stein operators of the form (T g)(x) =

1 p(x)hr, p(x)(a(x) + c(x))g(x)i. An analogous construction should yield convergence-determining diffusion KSDs for P outside of P. Our results also extend to targets P supported on a convex subset X of Rd by choosing k to satisfy p(x)k(x, ) 0 for all x on the boundary of X.

Measuring Sample Quality with Kernels

Ahn, S., Korattikara, A., and Welling, M. Bayesian poste-

rior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML, ICML 12, 2012.

Bachman, G. and Narici, L. Functional Analysis. Aca-

demic Press textbooks in mathematics. Dover Publications, 1966. ISBN 9780486402512.

Baker, J. Integration of radial functions. Mathematics Mag-

azine, 72(5):392 395, 1999.

Barbour, A. D. Stein s method and Poisson process con-

vergence. J. Appl. Probab., (Special Vol. 25A):175 184, 1988. ISSN 0021-9002. A celebration of applied probability.

Barbour, A. D. Stein s method for diffusion approxima-

tions. Probab. Theory Related Fields, 84(3):297 322, 1990. ISSN 0178-8051. doi: 10.1007/BF01197887.

Baringhaus, L. and Henze, N. A consistent test for mul-

tivariate normality based on the empirical characteristic function. Metrika, 35(1):339 348, 1988.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V.B.

Julia: A fresh approach to numerical computing. ar Xiv preprint ar Xiv:1411.1607, 2014.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Hand-

book of Markov chain Monte Carlo. CRC press, 2011.

Carmeli, C., De Vito, E., Toigo, A., and Umanit a, V. Vector

valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 8(01):19 61, 2010.

Chatterjee, S. and Shao, Q. Nonnormal approximation by

Stein s method of exchangeable pairs with application to the Curie-Weiss model. Ann. Appl. Probab., 21(2):464 483, 2011. ISSN 1050-5164. doi: 10.1214/10-AAP712.

Chen, L., Goldstein, L., and Shao, Q. Normal approxima-

tion by Stein s method. Probability and its Applications. Springer, Heidelberg, 2011. ISBN 978-3-642-15006-7. doi: 10.1007/978-3-642-15007-4.

Chwialkowski, K., Strathmann, H., and Gretton, A. A ker-

nel test of goodness of ﬁt. In Proc. 33rd ICML, ICML, 2016.

Du Bois, C., Korattikara, A., Welling, M., and Smyth, P.

Approximate slice sampling for Bayesian posterior inference. In Proc. 17th AISTATS, pp. 185 193, 2014.

Eberle, A. Reﬂection couplings and contraction rates for

diffusions. Probab. Theory Related Fields, pp. 1 36, 2015. doi: 10.1007/s00440-015-0673-1.

Fan, Y., Brooks, S. P., and Gelman, A. Output assessment

for Monte Carlo simulations via the score statistic. J. Comp. Graph. Stat., 15(1), 2006.

Fukumizu, K., Gretton, A., Sun, X., and Sch olkopf, B. Ker-

nel measures of conditional dependence. In NIPS, volume 20, pp. 489 496, 2007.

Geyer, C. J. Markov chain Monte Carlo maximum like-

lihood. Computer Science and Statistics: Proc. 23rd Symp. Interface, pp. 156 163, 1991.

Gorham, J. and Mackey, L. Measuring sample quality with

Stein s method. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Adv. NIPS 28, pp. 226 234. Curran Associates, Inc., 2015.

Gorham, J., Duncan, A., Vollmer, S., and Mackey,

L. Measuring sample quality with diffusions. ar Xiv:1611.06972, Nov. 2016.

G otze, F. On the rate of convergence in the multivariate

CLT. Ann. Probab., 19(2):724 739, 1991.

Gretton, A., Borgwardt, K., Rasch, M., Sch olkopf, B., and

Smola, A. A kernel two-sample test. J. Mach. Learn. Res., 13(1):723 773, 2012.

Herb, R. and Sally Jr., P.J. The Plancherel formula, the

Plancherel theorem, and the Fourier transform of orbital integrals. In Representation Theory and Mathematical Physics: Conference in Honor of Gregg Zuckerman s 60th Birthday, October 24 27, 2009, Yale University, volume 557, pp. 1. American Mathematical Soc., 2011.

Korattikara, A., Chen, Y., and Welling, M. Austerity in

MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st ICML, ICML 14, 2014.

Ley, C., Reinert, G., and Swan, Y. Stein s method for com-

parison of univariate distributions. Probab. Surveys, 14: 1 52, 2017. doi: 10.1214/16-PS278.

Liu, Q. and Feng, Y. Two methods for wild variational inference. ar Xiv preprint ar Xiv:1612.00081, 2016.

Liu, Q. and Lee, J. Black-box importance sampling. ar Xiv:1610.05247, October 2016. To appear in AISTATS 2017.

Liu, Q. and Wang, D. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. ar Xiv:1608.04471, August 2016.

Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrep-

ancy for goodness-of-ﬁt tests. In Proc. of 33rd ICML, volume 48 of ICML, pp. 276 284, 2016.

Measuring Sample Quality with Kernels

Mackey, L. and Gorham, J. Multivariate Stein factors for a class of strongly log-concave distributions. Electron. Commun. Probab., 21:14 pp., 2016. doi: 10.1214/ 16-ECP15.

M uller, A. Integral probability metrics and their generating

classes of functions. Ann. Appl. Probab., 29(2):pp. 429 443, 1997.

Oates, C. and Girolami, M. Control functionals for Quasi-

Monte Carlo integration. ar Xiv:1501.03379, 2015.

Oates, C., Cockayne, J., Briol, F., and Girolami, M. Con-

vergence rates for a class of estimators based on steins method. ar Xiv preprint ar Xiv:1603.03220, 2016a.

Oates, C. J., Girolami, M., and Chopin, N. Control func-

tionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pp. n/a n/a, 2016b. ISSN 1467-9868. doi: 10.1111/rssb.12185.

Ranganath, R., Tran, D., Altosaar, J., and Blei, D. Operator

variational inference. In Advances in Neural Information Processing Systems, pp. 496 504, 2016.

Simon-Gabriel, C. and Sch olkopf, B. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. ar Xiv preprint ar Xiv:1604.05251, 2016.

Sriperumbudur, B. On the optimal estimation of probability

measures in weak and strong topologies. Bernoulli, 22 (3):1839 1893, 2016.

Sriperumbudur, B., Gretton, A., Fukumizu, K., Sch olkopf,

B., and Lanckriet, G. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11 (Apr):1517 1561, 2010.

Stein, C. A bound for the error in the normal approximation

to the distribution of a sum of dependent random variables. In Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pp. 583 602. Univ. California Press, Berkeley, Calif., 1972.

Stein, C., Diaconis, P., Holmes, S., and Reinert, G. Use

of exchangeable pairs in the analysis of simulations. In Stein s method: expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pp. 1 26. Inst. Math. Statist., Beachwood, OH, 2004.

Steinwart, I. and Christmann, A. Support Vector Machines.

Springer Science & Business Media, 2008.

Stewart, J. Positive deﬁnite functions and generalizations,

an historical survey. Rocky Mountain J. Math., 6(3):409 434, 09 1976. doi: 10.1216/RMJ-1976-6-3-409.

Vallender, S. Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl., 18(4):784 786, 1974.

Wainwright, M. High-dimensional statistics: A non-asymptotic viewpoint. 2017. URL http: //www.stat.berkeley.edu/ wainwrig/ nachdiplom/Chap5_Sep10_2015.pdf.

Wang, D. and Liu, Q. Learning to Draw Samples: With Ap-

plication to Amortized MLE for Generative Adversarial Learning. ar Xiv:1611.01722, November 2016.

Welling, M. and Teh, Y. Bayesian learning via stochastic

gradient Langevin dynamics. In ICML, 2011.

Wendland, H. Scattered data approximation, volume 17.

Cambridge university press, 2004.

Zellner, A. and Min, C. Gibbs sampler convergence crite-

ria. JASA, 90(431):921 927, 1995.