# interpretable_stein_goodnessoffit_tests_on_riemannian_manifold__8f1dd76c.pdf

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Wenkai Xu 1 Takeru Matsuda 2

In many applications, we encounter data on Riemannian manifolds such as torus and rotation groups. Standard statistical procedures for multivariate data are not applicable to such data. In this study, we develop goodness-of-ﬁt testing and interpretable model criticism methods for general distributions on Riemannian manifolds, including those with an intractable normalization constant. The proposed methods are based on extensions of kernel Stein discrepancy, which are derived from Stein operators on Riemannian manifolds. We discuss the connections between the proposed tests with existing ones and provide a theoretical analysis of their asymptotic Bahadur efﬁciency. Simulation results and real data applications show validity and usefulness of the proposed methods.

1. Introduction

In many scientiﬁc and machine learning applications, data appear in the domains described by Riemannian manifolds. For example, structures of proteins and molecules are described by a pair of angular variables, which is identiﬁed with a point on the torus (Singh et al., 2002). In computer vision and related studies, the orientation of a camera is represented by a 3 3 rotation matrix, which gives rise to data on the rotation group (Song et al., 2009). Other examples include the orbit of a comet (Jupp et al., 1979) and the vectorcardiogram data (Downs, 1972). In addition, shape analysis (Dryden & Mardia, 2016) and compositional data analysis (Pawlowsky-Glahn & Buccianti, 2011) also deal with complex data deﬁned on Riemannian manifolds. Recently, Klein et al. (2020) developed a graphical model on torus to analyze phase coupling between neuronal activities. Since the usual statistical procedures for Euclidean data are not applicable in such scenarios, many studies have developed statistical models and methods tailored for data

1Gatsby Computational Neuroscience Unit, London, United Kingdom 2RIKEN Center for Brain Science, Tokyo, Japan. Correspondence to: Wenkai Xu <xwk4813@gmail.com>, Takeru Matsuda <takeru.matsuda@riken.jp>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

on Riemannian manifolds (Chikuse, 2012; Mardia & Jupp, 1999; Ley & Verdebout, 2017).

Statistical models on Riemannian manifolds are often given in the form of unnormalized densities with a computationally intractable normalization constant. For example, the Fisher distribution on the rotation group (Chikuse, 2012; Sei et al., 2013) is deﬁned by

p(X | Θ) exp(tr(Θ X)), (1)

and its normalization constant is not given in closed form. Statistical inference with such models can become computationally intensive due to the intractable normalization constant. Thus, statistical methods on Riemannian manifolds that do not require computation of the normalization constant have been developed for several tasks such as parameter estimation (Mardia et al., 2016) and sampling (Girolami et al., 2009; Ma et al., 2015). However, goodness-of-ﬁt testing or model criticism procedures for general distributions on Riemannian manifolds is not established, to the best of our knowledge.

Kernel Stein discrepancy (KSD) (Gorham & Mackey, 2015; Ley et al., 2017) is a discrepancy measure between distributions based on Stein s method (Barbour & Chen, 2005; Chen et al., 2010) and reproducing kernel Hilbert space (RKHS) theory (Berlinet & Thomas, 2004). KSD provides a general procedure for goodness-of-ﬁt testing that does not require computation of the normalization constant, and it has shown state-of-the-art performance in various scenarios including Euclidean data (Chwialkowski et al., 2016; Liu et al., 2016), discrete data (Yang et al., 2018), point processes (Yang et al., 2019), directional data (Xu & Matsuda, 2020), censored data (Fernandez et al., 2020) and random graphs (Xu & Reinert, 2021). In addition, by using the technique of optimizing test power (Gretton et al., 2012; Sutherland et al., 2016), KSD-based testing procedures also enable extraction of distributional features to perform model criticism (Jitkrittum et al., 2017; 2018; Kanagawa et al., 2019; Jitkrittum et al., 2020). We note that Stein s method has recently been extended to Riemannian manifolds and studied for numerical integration (Barp et al., 2018) and Bayesian inference (Liu & Zhu, 2018).

In this paper, we develop goodness-of-ﬁt testing and interpretable model criticism methods for general distributions on Riemannian manifolds. After brieﬂy reviewing

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

background topics, we ﬁrst introduce several types of Stein operators on Riemannian manifolds by using Stokes theorem. Then, we deﬁne manifold kernel Stein discrepancies (m KSD) based on them and propose goodness-of-ﬁt testing procedures, which do not require computation of the normalization constant. We also develop m KSD-based interpretable model criticism procedures. Theoretical comparisons of test performance in terms of Bahadur efﬁciency are provided, and simulation results validate the claims. Finally, we provide real data applications to demonstrate the usefulness of the proposed methods.

2. Background

2.1. Distributions on Riemannian Manifolds

In this paper, we focus on distributions on a smooth Riemannian manifold (M, g), where g is a Riemannian metric on M1. See Kobayashi & Nomizu (1963) for details on Riemannian geometry. Here, we give several examples that will be used in experiments. Note that we deﬁne the probability density of each distribution by its Radon Nikodym derivative with respect to the volume element of (M, g).

Torus Bivariate circular data (x1, x2) [0, 2π)2 can be viewed as data on the torus S1 S1, where we identify (cos x, sin x) S1 with x [0, 2π). To describe dependence between circular variables, Singh et al. (2002) proposed the bivariate von-Mises distribution:

p(x1, x2 | η) exp(κ1 cos(x1 µ1) + κ2 cos(x2 µ2)

+ λ12 sin(x1 µ1) sin(x2 µ2)), (2)

where η = (κ1, κ2, µ1, µ2, λ12), κ1 0, κ2 0, 0 µ1 < 2π and 0 µ2 < 2π. Its normalization constant is not represented in closed form. We will apply this model to wind direction data in Section 8.

Rotation group The rotation group SO(m) is deﬁned as

SO(m) = {X Rm m | X X = Im, det X = 1},

where Im is the m-dimensional identity matrix. The Fisher distribution (Chikuse, 2012; Sei et al., 2013) on SO(m) is deﬁned as

p(X | Θ) exp(tr(Θ X)),

for which the normalization constant is not given in closed form. We will apply this model to vectorcardiogram data in Section 8.

The goodness-of-ﬁt testing for general distributions on Riemannian manifolds is not established, to the best of our

1In this paper, M may have non-empty boundary M.

knowledge. For tests of uniformity, several methods have been proposed such as the Sobolev test (Chikuse & Jupp, 2004; Giné, 1975; Jupp et al., 2008). However, they are not readily applicable to general disributions. Although there are a few methods applicable to general distributions (Jupp et al., 2005; Jupp & Kume, 2018), they require computation of the normalization constant, which is often computationally intensive. In addition, existing testing procedures cannot be applied to perform interpretable model criticism (Jitkrittum et al., 2016; Kim et al., 2016; Lloyd & Ghahramani, 2015), which would provide an intuitive clariﬁcation of the discrepancy between the model and data.

2.2. Kernel Stein Discrepancy on Rd

Here, we brieﬂy review the goodness-of-ﬁt testing with kernel Stein discrepancy on Rd. See Chwialkowski et al. (2016); Liu et al. (2016) for more detail.

Let q be a smooth probability density on Rd. For a smooth function f = (f1, . . . , fd) : Rd Rd, the Stein operator Tq is deﬁned by

xi log q(x) +

xi fi(x) . (3)

From integration by parts on Rd, we obtain the equality, i.e. the Stein s identity Eq[Tqf] = 0, under mild regularity conditions. Since Stein operator Tq depends on the density q only through the derivatives of log q, it does not involve the normalization constant of q, which is a useful property for dealing with unnormalized models (Hyvärinen, 2005).

Let H be a reproducing kernel Hilbert space (RKHS) on Rd and Hd be its product. By using Stein operator, kernel Stein discrepancy (KSD) (Gorham & Mackey, 2015; Ley et al., 2017) between two densities p and q is deﬁned as KSD(p q) = sup f Hd 1 Ep[Tqf].

It is shown that KSD(p q) 0 and KSD(p q) = 0 if and only if p = q under mild regularity conditions (Chwialkowski et al., 2016). Thus, KSD is a proper discrepancy measure between densities. After some calculation, KSD(p q) is rewritten as

KSD2(p q) = Ex, x p[hq(x, x)], (4)

where hq(x, x) = Tqk(x, ), Tqk( x, ) H, which does not involve the density p.

Given samples x1, . . . , xn from unknown density p on Rd, an empirical estimate of KSD2(p q) can be obtained by using Eq.(4) in the form of U-statistics, and this estimate is used to test the hypothesis H0 : p = q, where the critical value is determined by bootstrap. In this way, a general method of non-parametric goodness-of-ﬁt test on Rd is obtained, which does not require computation of the normalization constant.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

3. Stein Operators on M

In this section, we introduce several types of Stein operators for distributions on Riemannian manifolds by using Stokes theorem. The operators are categorized via the order of differentials of the input functions2.

3.1. Differential Forms and Stokes Theorem

To derive Stein operators on Riemannian manifolds, we need to use differential forms and Stokes theorem. Here, we brieﬂy introduce these concepts. For more detailed and rigorous treatments, see Flanders (1963); Spivak (2018).

Let M be a smooth d-dimensional Riemannian manifold and take its local coordinate system x1, . . . , xd. We introduce symbols dx1, . . . , dxd and an associative and antisymmetric operation between them called the wedge product: dxi dxj = dxj dxi. Note that dxi dxi = 0. Then, a p-form ω on M (0 p d) is deﬁned as

i1 ip fi1 ipdxi1 dxip,

where the sum is taken over all p-tuples {i1, , ip} {1, . . . , d} and each fi1 ip is a smooth function on M. The exterior derivative dω of ω is deﬁned as the (p + 1)-form given by

xi dxi dxi1 dxip.

For another coordinate system y1, . . . , yd on M, the differential form is transformed by dyj = Pd i=1 yj

The volume element is deﬁned as the d-form given by

(det g)1/2dx1 dxd,

where g = g(x1, . . . , xd) is the d d matrix of the Riemannian metric with respect to x1, . . . , xd.

The integration of a d-form on a d-dimensional manifold is naturally deﬁned like the usual integration on Rd and invariant with respect to the coordinate selection. Correspondingly, the integration by parts formula on Rd is generalized in the form of Stokes theorem.

Proposition 1 (Stokes theorem). Let M be the boundary of M and ω be a (d 1)-form on M. Then, Z

Corollary 1. If M is empty, then R

M dω = 0 for any (d 1)-form ω on M.

2Note that this should be distinguished from the differentials of the (unnormalized) density functions.

Coordinate choice In the following, to facilitate the derivation as well as computation of Stein operators, we assume that there exists a coordinate system θ1, . . . , θd on M that covers M almost everywhere. For example, spherical coordinates for the hyperspheres and torus, generalized Euler angles (Chikuse, 2012, Section 2.5.1) for the rotation groups, and Givens rotations (Pourzanjani et al., 2017) for the Stiefel manifolds satisfy this assumption.

3.2. First Order Stein Operator

For a smooth probability density q on M and a smooth function f = (f 1, . . . , f d) : M Rd, deﬁne a function A(1) q f : M R by

θi log(q J) , (5)

where J = (det g)1/2 is the volume element. We refer to A(1) q as the ﬁrst order Stein operator. Note that Xu & Matsuda (2020) utilized this operator for goodness-of-ﬁt testing on hyperspheres.

Theorem 1. If M is empty or f 1, . . . , f d vanish on M, then

Eq[A(1) q f] = 0.

If M is a closed manifold such as torus and rotation group, it does not have boundary by deﬁnition and thus the assumption of Theorem 1 holds. If the boundary of M is non-empty, a discussion relevant to the assumption of Theorem 1 can be found in Liu & Kanamori (2019), which studies density estimation on truncated domains. Note that the assumption of Theorem 1 is similar to Assumption 4 in Barp et al. (2018).

3.3. Second Order Stein Operator

In the context of numerical integration on Riemannian manifolds, Barp et al. (2018) introduced a different type of Stein operator A(2) q , which we call the second order Stein operator. Speciﬁcally, for a smooth probability density q on M and a smooth function f : M R, deﬁne A(2) q f : M R by

A(2) q f = X

gij 2 f θi θj + gij f

where we denote the inverse matrix of (gij) by (gij) following the convention of Riemmanian geometry.

Proposition 2 (Proposition 1 of Barp et al. (2018)). If M is empty or f vanishes on M, then

Eq[A(2) q f] = 0.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Theorem 2 follows from Theorem 1, because the second order Stein operator in Eq.(6) can be viewed as a special case of the ﬁrst order Stein operator in Eq.(5) with

Similar form of the second order Stein operator in Eq.(6) has been studied in Liu & Zhu (2018) for Bayesian inference. On the other hand, Le et al. (2020) arrives at a similar second order Stein operator on Riemannian manifolds via Feller diffusion process in the context of density approximation.

3.4. Zeroth Order Stein Operator

For a smooth probability density q on M and a function h : M R, deﬁne a function A(0) q h : M R by

A(0) q h = h Eq[h], (8)

which clearly satisﬁes Eq[A(0) q h] = 0. Since A(0) q does not involve any differentials, we call it the zeroth order Stein operator. Compared to the ﬁrst and second order Stein operators, this operator requires the normalization constant of q, which is often computationally intractable for Riemannian manifolds. We will show later that this operator corresponds to the maximum mean discrepancy (MMD) (Gretton et al., 2007).

4. Goodness-of-ﬁt Tests on M

In this section, we propose goodness-of-ﬁt testing procedures for distributions on Riemannian manifolds based on kernelized discrepancies using the Stein operators in the previous section.

4.1. Manifold Kernel Stein Discrepancies

By using Stein operators introduced in the previous section, we extend kernel Stein discrepancy to distributions on Riemannian manifolds.

Let H be a RKHS on M with reproducing kernel k and Hd be its product. We deﬁne the manifold kernel Stein discrepancies (m KSD) of the ﬁrst, second and zeroth order by

m KSD(1)(p q) = sup f Hd 1 Ep[A(1) q f],

m KSD(2)(p q) = sup f H 1 Ep[A(2) q f],

m KSD(0)(p q) = sup h H 1 Ep[A(0) q h],

respectively. We also deﬁne the Stein kernels of ﬁrst, second

and zeroth order by

h(1) q (x, x) = D A(1) q k(x, ), A(1) q k( x, ) E

h(2) q (x, x) = D A(2) q k(x, ), A(2) q k( x, ) E

h(0) q (x, x) = D A(0) q k(x, ), A(0) q k( x, ) E

respectively. Then, by algebraic manipulation, we obtain the following.

Theorem 2. If p and q are smooth densities on M and the reproducing kernel k of H is smooth, then

m KSD(c)(p q)2 = Ex, x[h(c) q (x, x)] (9)

for c = 0, 1, 2, where x, x p are independent.

From Theorem 2, we can estimate m KSD by using samples from p. This is an important property in goodness-of-ﬁt testing.

The following theorem shows that m KSD is a proper discrepancy measure between distributions on Riemannian manifolds. The proof is given in supplementary material. Let L(x) = (L1(x), . . . , Ld) Rd with

Li(x) = θi log q(x)

Theorem 3. Let p and q be smooth densities on M. Assume: 1) The kernel k vanishes at M and is compact universal in the sense of Carmeli et al. (2010, Deﬁnition 2 (ii)); 2) Ex, x p[h(c) q (x, x)2] < , for c = 0, 1, 2; 3) Ep L(x) 2 < . Then, m KSD(c)(p q) 0 and m KSD(c)(p q) = 0 if and only if p = q.

Note that different m KSD uses different RKHS as the space of test functions. With the d-dimensional vector valued RKHS Hd, m KSD(1) takes the supremum over a larger class of functions than m KSD(2), capturing richer distribution features. Theoretical analysis in testing context will be presented in Section 6.

Equivalence of m KSD(0) and MMD For a RKHS H, the maximum mean discrepancy (MMD) (Gretton et al., 2007) between p and q is deﬁned by

MMD(p q)2 = µp µq 2 H,

where µp, µq are the kernel mean embeddings (Muandet et al., 2017) of p and q, respectively. The following theorem shows that m KSD(0) is equivalent to MMD.

m KSD(0)(p q) = MMD(p q).

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Proof. By deﬁnition, we have

m KSD(0)(p q) = sup h H 1 Ep[A(0) q h]

= sup h H 1 (Ep[h] Eq[h]).

Hence, taking the supreme in closed form via reproducing property, we obtain

m KSD(0)(p q)2 = µp µq 2 H = MMD(p q)2.

An illustrative example To see the differences between the Stein operators, we consider a uniform distribution q on 2-dimensional torus, using spherical coordinate (θ1, θ2)3. Thus, the ﬁrst and second order Stein operators can be explicitly written as

A(1) q f = f 1

θ2 , A(2) q f = 2 f θ1 θ1 + 2 f θ2 θ2 ,

respectively. This derivation echoes the interpretation of the difference between ﬁrst order and second order via Eq.(7). To better understand the difference in terms of m KSD, we choose f 1, f 2, f H1 where H1 denotes the RKHS equipped with product von-Mises kernel of unit bandwidth,

k(u, v) = exp{u v} = exp{cos(θ1 u θ1 v)+cos(θ2 u θ2 v)}

where θu, θv are θ-parametrisation of u, v respectively. Then, for θu, θv p, m KSD2 has explicit form,

m KSD(1)(p q)2 = Eθu,θv h 2k(u,v)

θ1u θ1v + 2k(u,v)

= Eθu,θv [(ξ1 + ξ2)k(u, v)]

where ξ1 = cos (θ1 u θ1 v) sin (θ1 u θ1 v) 2, and ξ2 = cos (θ2 u θ2 v) sin (θ2 u θ2 v) 2 can be seen as the statistics tracking differences of p, q in each S1. On the other hand,

m KSD(2)(p q)2 =

θ1u 2 θ1v 2 + 4k(u, v)

θ2u 2 θ2v 2 + 4k(u, v)

θ1u 2 θ2v 2 + 4k(u, v)

θ2u 2 θ1v 2

= Eθu,θv h (ξ11 + ξ22 + ξ12 + ξ21)k(u, v) i ,

ξii = sin (θi u θi v) sin (θi u θi v) 2 3 cos (θi u θi v) 1 ,

and ξij = ξiiξjj for (i = j).

3In this case, the Jacobian J is a constant as the torus S1 S1

is a direct product of two circles. Then, derivative of log q J = 0

By setting f 1, f 2 and f all belonging to the same RKHS, H1, we are able to explicitly see an effect of having a larger space of test functions f = (f 1, f 2) H1 H1, compared to f H1, through ξ-terms.

The zeroth order operator A(0) q does not involve any differential operator while density q is represented via expectation over the test function in Eq.(8). For mean embedding µq( ) = Eq[k(x, )], where µq(x) = k(x, ), µq( ) H and the constant cq = µq 2 H,

m KSD(0)(p q)2 = Eθu,θv [k(u, v)] 2Eθu [µq(u)] + cq.

From the derivation4 for m KSD(0), we see that k(u, v) does not interact with any of the ξ-terms as opposed to differential based Stein discrepancies. Instead, the characterisation is via balancing constant cq that is representative for density q.

Connections to Euclidean Stein operator One interesting interpretation on the differences between proposed Stein operators A(1) q , A(2) q and the Euclidean Stein operator in Eq.(3) is that the proposed operators diffuse along the shape/surface of the manifold, while the Stein operator in Rd diffuse to all directions w.r.t. the Cartesian coordinate5. For instance, in a 2-dimensional torus that is embedded in R3, the proposed Stein operator only diffuse on the surface of the torus, not going into or out from the torus; however, an Euclidean Stein operator does so due to diffusion over R3. This is also crucial for the unnormalised models. The same unnormalised density corresponds to different models when domains are not the same, e.g. unit sphere and sphere with radius 2. Essentially, the diffusion along manifold notion makes the Stein s identity holds and gives controlled type-I error for our tests.

Although A(0) q was not discussed explicitly in the Euclidean setting, it remains the same for both Euclidean and Riemannian cases. The manifold shapes and structures are well-taken care of by taking the expectation over relevant distributions in Eq.(8), e.g. in the above example, the expectation is taken over the 2-dimensional torus instead of R3. As such, the relevant kernel Stein test gives controlled type-I error. However, it is worth to note that the advantage on dealing with unnormalized densities can be violated when computing expectations in closed forms. When samples are available from unnormalized densities, e.g. via MCMC, an estimation for the expectation function can be obtained. By Theorem 4, Stein test for goodness-of-ﬁt degenerates into MMD test for two-sample problems. We further discuss such connections and empirical results in Section 6 and 7.

4In this case, the derivation holds for both scalar valued test function h H or (h1, h2) H H. 5The Cartesian coordinate forms the basis directions to guide the diffusion in Rd. Hence, we can see Eq.(3) as a special case of A(1) q on Euclidean manifold.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

4.2. Goodness-of-ﬁt Testing with m KSDs

Here, we present procedures for testing H0 : p = q with signiﬁcance level α based on samples x1, . . . , xn p.

From Theorem 2, an unbiased estimate of m KSD can be obtained in the form of U-statistics (Lee, 1990):

m KSD(c) u (p q)2 = 1 n(n 1)

i =j h(c) q (xi, xj). (10)

Its asymptotic distribution is obtained via U-statistics theory (Lee, 1990; Van der Vaart, 2000) as follows. We denote the convergence in distribution by d .

Theorem 5. For c = 0, 1, 2, the following statements hold.

1. Under H0 : p = q,

n m KSD(c) u (p q)2 d

j=1 w(c) j (Z2 j 1), (11)

where Zj are i.i.d. standard Gaussian random variables and w(c) j are the eigenvalues of the Stein kernel h(c) q (x, x) under p( x): Z h(c) q (x, x)φj( x)p( x)d x = w(c) j φj(x), (12)

where φj(x) = 0 is the non-trivial eigen-function.

2. Under H1 : p = q,

n m KSD(c) u (p q)2 m KSD(c)(p q)2 d N(0, σc 2),

where σc2 = Varx p[E x p[h(c) q (x, x)]] > 0.

Based on Theorem 5, we propose two procedures for goodness-of-ﬁt testing.

Wild-bootstrap Test We employ the wild-bootstrap test with the V-statistics (Chwialkowski et al., 2014). The test statisic is given by

m KSD(c) v (p q)2 = 1

i,j h(c) q (xi, xj). (13)

To approximate its null distribution, we deﬁne the wildbootstrap samples by

i,j Wi,t Wj,th(c) q (xi, xj), (14)

where each Wi,t { 1, 1} is the Rademacher variable of zero mean and unit variance.

The testing procedure is outlined in Algorithm 1. We adopt this algorithm in the following experiments due to its computational efﬁciency.

Algorithm 1 m KSD test via wild-bootstrap Input:

samples x1, . . . , xn p, null density q kernel function k, test size α bootstrap sample size B Objective:

Test H0 : p = q versus H1 : p = q. Test procedure:

1: Compute the statistic m KSDv (c)(p q)2, Eq.(10). 2: for t = 1 : B do 3: Sample Rademacher variables W1,t, . . . , Wn,t. 4: Compute St by Eq.(14). 5: end for 6: Determine the (1 α)-quantile γ1 α of S1, . . . , SB. Output:

Reject H0 if m KSDv (c)(p q)2 > γ1 α; otherwise do not reject H0.

Spectrum Test We can also directly approximate the null distribution in Eq.(11) by using the eigenvalues of the Stein kernel matrix (Gretton et al., 2009, Theorem 1). Specifically, let M (c) be the n n Stein kernel matrix deﬁned by (M (c))ij = h(c) q (xi, xj) and ew(c) 1 , . . . , ew(c) n be its eigenvalues. Then, we generate the simulated null samples by

j=1 ew(c) j (Z2 j,t 1), (15)

where each Zj,t is the standard Gaussian variable. In practice, the spectrum test is more useful when sample size is small where the wild-bootstrap procedure can be less accurate. However, when sample size n is large, computing ew(c)

via eigenvalue decomposition requires O(n3) complexity, which makes the test computationally less efﬁcient.

Kernel choice The performance of kernel-based testing is sensitive to the choice of kernel parameters. We choose the kernel parameters by maximizing an approximation of the test power following Gretton et al. (2012); Jitkrittum et al. (2016); Sutherland et al. (2016). From Theorem 5,

D := n m KSD(c) u (p q) 2 m KSD(c)(p q)2

under the alternative hypothesis, H1 : p = q. Thus, for sufﬁciently large n, the test power is approximated as

PH1(n m KSD(c) u (p q) 2 > r) Φ n m KSD(c)(p q)2

where Φ denotes the c.d.f. for the standard Gaussian distribution (Sutherland et al., 2016). Thus, we choose the kernel parameters by maximizing an estimate of m KSD2(p q)/σc (Jitkrittum et al., 2017).

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

5. Model Criticism on M

Now, we propose m KSD-based model criticism procedures for distributions on Riemannian manifolds.

When the proposed model does not ﬁt the observed data well, understanding which part of the model misﬁt the data is of practical interest. The model criticism study can be helpful to better understand the representative prototype (Kim et al., 2016), to criticize prior assumptions in Bayesian settings (Lloyd & Ghahramani, 2015) or to help better training of generative models (Sutherland et al., 2016). With kernel-based non-parametric testing, distributional features can be extracted in the form of test locations to represent areas that best distinguish" distributions. The locations where two sample distributions differ the most via MMD are studied in Jitkrittum et al. (2016) and the most misspeciﬁed" locations between samples and models via KSD are studied in Jitkrittum et al. (2017). Recently, Seth et al. (2019) studied the model criticism via latent space, which may intrinsically correspond to Riemannian manifold structures. Such setting can be an interesting application of our development.

Let sp( ) = E x p[A(1) q k( x, )] Hd. Adapted from Jitkrittum et al. (2017), we deﬁne the manifold Finite Set Stein Discrepancy (m FSSD) as follows. For a small set of J test locations {v1, . . . , vj} M,

m FSSD(p q)2 = 1

j=1 (sp(vj))2 i , (16)

which can be computed in linear time of sample size n.

Proposition 3 (Theorem 1 Jitkrittum et al. (2017)). Let V = v1, . . . , v J M be random vectors drawn i.i.d. from a distribution ν which has a density. Let X be a connected open set in Rd. Assume conditions in Theorem 3 hold. Then, for any J 1 ,ν-almost surely m FSSD(p q)2 = 0 if and only if p = q.

Stein identity of sp( ) ensures m FSSD2 = 0 under H0 almost surely. To perform model criticism, we extract test locations that give a higher detection rate (i.e., test power) than others. We choose the test locations V = {vj}J j=1 by maximizing the approximate test power:

V = arg max v m FSSD2

where σH1 is the variance of m FSSD2 under H1. More details are shown in Proposition 4 and 5 in the supplementary.

6. Comparison between m KSD Tests

Bahadur efﬁciency From Theorem 5, m KSD tests are consistent against all alternatives. Thus, to understand

which m KSD test is more powerful than others, we investigated their Bahadur efﬁciency (Bahadur et al., 1960), which quantify how fast the p-value goes to zero under alternatives. Here, to focus on the effect of the choice of Stein operator on test performance, we brieﬂy present results for testing of uniformity on the circle S1 under the von-Mises distribution. See supplementary material for more details. The technique of the proof is adapted from Jitkrittum et al. (2017).

Theorem 6. (Scaling shift in von-Mises distribution) Let x S1, q(x) 1 and p(x) exp (κu x). Choose the von-Mises kernel of the form k(x, x ) = exp (x x ). Denote the approximate Bahadur efﬁciency between m KSD with ﬁrst and second order Stein operators as

E1,2(κ) := c(m KSD(1))(κ)

c(m KSD(2))(κ),

where κ > 0. Then E1,2(κ) > 1.

Adapting from Theorem 5 of Jitkrittum et al. (2017), it sufﬁces to show m KSD(1)(p q) m KSD(2)(p q) and

Ex, x q[h(2) q (x, x)2] > Ex, x q[h(1) q (x, x)2] > 0.

See supplementary material for details.

We provide additional discussion on test efﬁciencies with m KSD(0) in the supplementary material. In general, since we cannot compute Ep in closed form, especially with unnormlized density, we need to perform the test with samples, where sampling error makes the m KSD(0) test less asymptotically efﬁcient (Jitkrittum et al., 2017; Yang et al., 2019; Xu & Matsuda, 2020).

Computational efﬁciency Since the Stein kernels h(1) q and h(2) q depend on q only through the derivative of log q, m KSD tests with the ﬁrst and second order Stein operators do not require computation of the normalization constant of q. This is a major computational advantage over existing goodness-ofﬁt tests on Riemannian manifolds. While the computational cost of m KSD(1) u is O(n2d), that of m KSD(2) is O(n2d3) due to the computation of the metric tensor.

On the other hand, m KSD test of zeroth order is equivalent to testing whether two sets of samples are from the same distribution by using MMD (Gretton et al., 2007). Namely, to test whether x1, . . . , xn is from density q, we draw samples y1, . . . , ym from q and determine whether x1, . . . , xn and y1, . . . , ym are from the same distribution. This procudre requires to sample from the null distribution q on Riemannian manifolds, which is computationally intensive in general. Note that the results in Theorem 5 with c = 0 replicate the asymptotic results for MMD (Gretton et al., 2007).

Choosing m KSD tests In overall, testing with m KSD(1)

has its advantage in terms of having a larger space of test

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

0.0 0.1 0.2 0.3 0.4 0.5 0.6 Perturbation

1.0 m KSD2 m KSD1 m KSD0/MMD Sobolev

(a) n = 100

50 100 150 200 250 300 Sample Size

m KSD2 m KSD1 m KSD0/MMD Sobolev

(b) κ = 0.35

0.0 0.2 0.4 0.6 0.8 1.0 Perturbation

m KSD2 m KSD1 m KSD0/MMD Sobolev

(c) n = 100

50 100 150 200 250 300 Sample Size

1.0 m KSD2 m KSD1 m KSD0/MMD Sobolev

(d) b = 0.20

Figure 1. Rejection rates at α = 0.01: a)-b) for uniform density; c)-d) for Fisher distribution on SO(3)

functions with both asymptotic test efﬁciency and computational efﬁciency so that it is recommended to use when available. m KSD(2) can be slightly easier to compute and parameterize in certain scenarios, although it may sacriﬁce test power and computational efﬁciency. m KSD(0), or namely MMD test from simulated samples, is also applicable when it is possible to sample from the given unnormalized density model on the Riemannian manifolds.

7. Simulation Results

In this section, we show the validity of the proposed m KSD tests by simulation on the rotation group SO(3). We use the Euler angle (Chikuse, 2012) as the coordinate system. The bootstrap sample size is set to B = 1000. The signiﬁcance level is set to α = 0.01. For the m KSD(0) test (MMD two-sample test), the number of samples from the null is set to be equal to the sample size n. We used the kernel k(X, Y ) = exp(γ tr(X Y )), where the parameter γ was chosen by optimizing the approximate test power. The exponential-trace kernel k(X, Y ) = exp(γ tr(X Y )) for the rotation group is compact universal. To see this, we rewrite the kernel in the form analogous to the Gaussian kernel: k(X, Y ) = exp(γ tr(X Y )) = C exp( 1

2γ X Y 2 F ), where C is a constant that only depends on d, the dimension of the matrices X, Y SO(d) due to tr(X X) = tr(Id) = d for all X SO(d). Since the Gaussian kernel is universal (Sriperumbudur et al., 2011) and the rotation group SO(d) is a compact subset of the space of d d matrices, the exponential-trace kernel is then also compact-universal from Corollary 3 of Carmeli et al. (2010).

7.1. Uniform distribution

First, we consider testing of uniformity on SO(3) and compare the performance of the m KSD tests with the Sobolev test (Jupp et al., 2005). We generated samples from the exponential trace distribution p(X | κ) exp(κ tr(X)) by the rejection sampling (Hoff, 2009). The uniform distribution corresponds to κ = 0.

Figure 1 (a) plots the rejection rates with respect to κ for n = 100. When κ = 0, the type-I errors of all tests are well

controlled to the signiﬁcance level α = 0.01. The power of all tests increases with increasing κ and converges to one. Figure 1 (b) plots the rejection rates with respect to n for κ = 0.35. The power of all tests increases with n and converges to one. When the model becomes increasingly different from the null, the m KSD1 is more sensitive to distinguish the difference, with higher power than others.

7.2. Fisher distribution

Next, we consider the Fisher distribution (or matrix Langevin distribution) p(X | F) exp(tr(F X)) (Chikuse, 2003; Sei et al., 2013). We generated data from p(X | F0) and applied m KSD tests on the null p(X | Fb),

1 b 0 b 1 0 0 0 1

. We compare the m KSD tests

with the extended Sobolev test (Jupp et al., 2005), in which we compute the normalization constant by Monte Carlo.

Figure 1 (c) plots the rejection rates with respect to b for n = 100. Figure 1 (d) plots the rejection rates with respect to n for b = 0.2. From the plot, we see that all tests achieves the correct test level under the null. When the model becomes increasingly different from the null, the m KSD1 is more sensitive to distinguish the difference, with higher power than others. MMD test has lower power than m KSD1 and m KSD2 due to inefﬁciency from sampling. While the Sobolev test is useful when the null and the alternative are very different, it is not powerful for harder problems where the alternative perturbed very little from the null.

8. Real Data Applications

Finally, we apply the m KSD tests to two real data.

8.1. Vectorcardiogram data

As a real dataset on the rotation group SO(3), we use the vectorcardiogram data studied by Jupp et al. (2008). The data summarizes vectorcardiogram from normal children where each data point records 3 perpendicular vectors of directions QRS, PRS and T from Frank system for electrical lead placement. Details of this dataset can be found in Downs (1972).

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

0 1 2 3 4 5 6 x1

0 1 2 3 4 5 6 x1

0 1 2 3 4 5 6 x1

Figure 2. Wind direction data. Left: 2D histogram for wind directions; colorbar shows the counts of data points in each square. Mid: the 10 optimized locations (in red star), without repetition. Right: the objective value in Eq.(17), m FSSD2

σH1 , on the speciﬁed data location of test (i.e. setting J=1); the higher the darker.

We ﬁtted a Fisher distribution p(X | F) exp(tr(F X)) to 28 data points of children aged between 2 to 10 (Jupp et al., 2005) and obtained the estimate

0.583 0.629 0.514 0.660 0.736 0.151 0.473 0.252 0.844

We use this value as the null model to be tested. Table 1 presents the p-values of each test. We apply kernel of the form k(X, Y ) = exp(γ tr(X Y )) as used in Section 7. All m KSD tests show strong evidence to reject the ﬁtted model at α = 0.05. However, the Sobolev test, with p-value being 0.126, is not powerful enough to reject the null at the same test level.

Table 1. p-values for vectorcardiogram data. m KSD1 m KSD2 m KSD0/MMD Sobolev 0.004 0.000 0.010 0.126

8.2. Wind direction data

As a real data on torus, we consider wind direction in Tokyo on 00:00 (x1) and 12:00 (x2) for each day in 20186. Thus, the sample size is n = 365. The data were discretized into 16 directions, such as north-northeast. Figure 2 presents a 16 16 histogram of raw data.

We consider the goodness-of-ﬁt testing procedures for the bivariate von-Mises distribution in Eq.(2) using m KSD. We apply the product von-Mises kernel as described in the illustrative example for torus in Section 4, k(u, v) = exp{γ1 cos(θ1 u θ1 v) + γ2 cos(θ2 u θ2 v)}, where the bandwidth parameters γ1 and γ2 were chosen by optimizing the approximate test power7. In addition, we pay particular attention on positive deﬁniteness when choosing our kernel. As Feragen et al. (2015) pointed out, not all

6Data available on Japan Meteorological Agency website https://www.data.jma.go.jp/obd/stats/etrn/. 7The optimization objective is, m KSD2

Var(m KSD2), which is similar

to Eq.(17) where the test statistics is m KSD2 instead of m FSSD2.

geodesic distance, d(u, v), induces a positive-deﬁnite kernel k(u, v) = exp( γ d(u, v)) on manifold. Adapting results on strictly positive functions in hyperspheres (Gneiting et al., 2013), the above chosen kernel is positive deﬁnite for torus as the product spherical coordinate on S1. By using noise contrastive estimation (Gutmann & Hyvärinen, 2012), Uehara et al. (2020) ﬁtted the bivariate von-Mises distribution to the wind direction data and obtained the estimate for parameter set η = (κ1, κ2, µ1, µ2, λ12),

bη = (0.7170, 0.3954, 1.1499, 1.1499, 1.1274).

By setting this ﬁtted model to the null model, the p-value obtained using m KSD1 is 0.434, which indicates that the model is a good ﬁt for the data.

In addition, we ﬁtted a simpler model with no interactions between x1 and x2, i.e. λ12 is set to zero in Eq.(2) so that the model reduces to the product of two von-Mises distribution on each direction. The p-value by m KSD1 is 0.002, which is a strong evidence to reject the null model. In other words, there is a signiﬁcant interaction between wind direction on 00:00 and 12:00. We then carried out model criticism by m FFSD statistic in Eq.(16) with optimized test location via maximizing approximate test power. Choosing the number of test locations J = 10, we plot the optimized locations in Figure 2. It provides information about dependence between wind direction at midnight and noon.

Concluding Remark In this study, we develop goodnessof-ﬁt procedures and model criticism methods for general distributions on Riemannian manifolds. As m KSDs are proper discrepancy measures under mild assumptions, the connections and comparisons of topologies induced from different m KSDs are interesting future direction.

Acknowledgement

T.M. was supported by JSPS KAKENHI Grant Number 19K20220 and JST Moonshot Grant Number JPMJMS2024. W.X. was supported by Gatsby Charitable Foundation.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Bahadur, R. R. et al. Stochastic comparison of tests. Annals of Mathematical Statistics, 31(2):276 295, 1960.

Barbour, A. D. and Chen, L. H. Y. An introduction to Stein s method, volume 4. World Scientiﬁc, 2005.

Barp, A., Oates, C., Porcu, E., and Girolami, M. A riemannian-stein kernel method. ar Xiv preprint ar Xiv:1810.04946, 2018.

Berlinet, A. and Thomas, C. Reproducing kernel Hilbert spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.

Carmeli, C., De Vito, E., Toigo, A., and Umanitá, V. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 8(01):19 61, 2010.

Chen, L. H. Y., Goldstein, L., and Shao, Q. M. Normal approximation by Stein s method. Springer, 2010.

Chikuse, Y. Concentrated matrix langevin distributions. Journal of Multivariate Analysis, 2(85):375 394, 2003.

Chikuse, Y. Statistics on special manifolds, volume 174. Springer Science & Business Media, 2012.

Chikuse, Y. and Jupp, P. E. A test of uniformity on shape spaces. Journal of multivariate analysis, 88(1):163 176, 2004.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, pp. 2606 2615, 2016.

Chwialkowski, K. P., Sejdinovic, D., and Gretton, A. A wild bootstrap for degenerate kernel tests. In Advances in neural information processing systems, pp. 3608 3616, 2014.

Downs, T. D. Orientation statistics. Biometrika, 59(3): 665 676, 1972.

Dryden, I. L. and Mardia, K. V. Statistical shape analysis: with applications in R, volume 995. John Wiley & Sons, 2016.

Feragen, A., Lauze, F., and Hauberg, S. Geodesic exponential kernels: When curvature and linearity conﬂict. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3032 3042, 2015.

Fernandez, T., Rivera, N., Xu, W., and Gretton, A. Kernelized stein discrepancy tests of goodness-of-ﬁt for timeto-event data. In International Conference on Machine Learning, pp. 3112 3122. PMLR, 2020.

Flanders, H. Differential Forms with Applications to the Physical Sciences. Dover, 1963.

Garreau, D., Jitkrittum, W., and Kanagawa, M. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017.

Giné, E. Invariant tests for uniformity on compact riemannian manifolds based on sobolev norms. The Annals of statistics, pp. 1243 1266, 1975.

Girolami, M., Calderhead, B., and Chin, S. A. Riemannian manifold hamiltonian monte carlo. ar Xiv preprint ar Xiv:0907.1100, 2009.

Gleser, L. J. The comparison of multivariate tests of hypothesis by means of bahadur efﬁciency. Sankhy a: The Indian Journal of Statistics, Series A, pp. 157 174, 1966.

Gneiting, T. et al. Strictly and non-strictly positive deﬁnite functions on spheres. Bernoulli, 19(4):1327 1349, 2013.

Gorham, J. and Mackey, L. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015.

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the two-sampleproblem. In Advances in neural information processing systems, pp. 513 520, 2007.

Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. A fast, consistent kernel two-sample test. In Advances in neural information processing systems, pp. 673 681, 2009.

Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems, pp. 1205 1213, 2012.

Gutmann, M. U. and Hyvärinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13:307 361, 2012.

Hoff, P. D. Simulation of the matrix bingham von mises ﬁsher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18(2):438 456, 2009.

Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005.

Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, pp. 181 189, 2016.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Jitkrittum, W., Xu, W., Szabó, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-ﬁt test. In Advances in Neural Information Processing Systems, pp. 262 271, 2017.

Jitkrittum, W., Kanagawa, H., Sangkloy, P., Hays, J., Schölkopf, B., and Gretton, A. Informative features for model comparison. In Advances in Neural Information Processing Systems, pp. 808 819, 2018.

Jitkrittum, W., Kanagawa, H., and Schölkopf, B. Testing goodness of ﬁt of conditional density models with kernels. ar Xiv preprint ar Xiv:2002.10271, 2020.

Jupp, P. and Kume, A. Measures of goodness of ﬁt obtained by canonical transformations on riemannian manifolds. ar Xiv preprint ar Xiv:1811.04866, 2018.

Jupp, P. et al. Sobolev tests of goodness of ﬁt of distributions on compact riemannian manifolds. The Annals of Statistics, 33(6):2957 2966, 2005.

Jupp, P. et al. Data-driven sobolev tests of uniformity on compact riemannian manifolds. The Annals of Statistics, 36(3):1246 1260, 2008.

Jupp, P. E., Mardia, K. V., et al. Maximum likelihood estimators for the matrix von mises-ﬁsher and bingham distributions. The Annals of Statistics, 7(3):599 606, 1979.

Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., and Gretton, A. A kernel stein test for comparing latent variable models. ar Xiv preprint ar Xiv:1907.00586, 2019.

Kim, B., Khanna, R., and Koyejo, O. Examples are not enough, learn to criticize! criticism for interpretability. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2288 2296, 2016.

Klein, N., Orellana, J., Brincat, S. L., Miller, E. K., Kass, R. E., et al. Torus graphs for multivariate phase coupling analysis. Annals of Applied Statistics, 14(2):635 660, 2020.

Kobayashi, S. and Nomizu, K. Foundations of differential geometry, volume 1. New York, London, 1963.

Le, H., Lewis, A., Bharath, K., and Fallaize, C. A diffusion approach to stein s method on riemannian manifolds. ar Xiv preprint ar Xiv:2003.11497, 2020.

Lee, A. J. U-Statistics: Theory and Practice. CRC Press, 1990.

Ley, C. and Verdebout, T. Modern directional statistics. Chapman and Hall/CRC, 2017.

Ley, C., Reinert, G., Swan, Y., et al. Stein s method for comparison of univariate distributions. Probability Surveys, 14:1 52, 2017.

Liu, C. and Zhu, J. Riemannian stein variational gradient descent for bayesian inference. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pp. 276 284, 2016.

Liu, S. and Kanamori, T. Estimating density models with complex truncation boundaries. ar Xiv preprint ar Xiv:1910.03834, 2019.

Lloyd, J. R. and Ghahramani, Z. Statistical model criticism using kernel two sample tests. Advances in Neural Information Processing Systems, 28:829 837, 2015.

Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pp. 2917 2925, 2015.

Mardia, K. V. and Jupp, P. E. Directional Statistics. Wiley, New York, NY, 1999.

Mardia, K. V., Kent, J., and Laha, A. Score matching estimators for directional distributions. ar Xiv:1604.08470, 2016.

Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017.

Pawlowsky-Glahn, V. and Buccianti, A. Compositional data analysis: Theory and applications. John Wiley & Sons, 2011.

Pourzanjani, A. A., Jiang, R. M., Mitchell, B., Atzberger, P. J., and Petzold, L. R. General bayesian inference over the stiefel manifold via the givens representation. ar Xiv preprint ar Xiv:1710.09443, 2017.

Sei, T., Shibata, H., Takemura, A., Ohara, K., and Takayama, N. Properties and applications of ﬁsher distribution on the rotation group. Journal of Multivariate Analysis, 116: 440 455, 2013.

Serﬂing, R. J. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009.

Seth, S., Murray, I., Williams, C. K., et al. Model criticism in latent space. Bayesian Analysis, 14(3):703 725, 2019.

Singh, H., Hnizdo, V., and Demchuk, E. Probabilistic model for two dependent circular variables. Biometrika, 89:719 723, 2002.

Interpretable Stein Goodness-of-ﬁt Tests on Riemannian Manifolds

Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 961 968, 2009.

Spivak, M. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. CRC press, 2018.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12 (Jul):2389 2410, 2011.

Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. Generative models and model criticism via optimized maximum mean discrepancy. ar Xiv preprint ar Xiv:1611.04488, 2016.

Uehara, M., Matsuda, T., and Kim, J. K. Imputation estimators for unnormalized models with missing data. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 831 841. PMLR, 2020.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Xu, W. and Matsuda, T. A stein goodness-of-ﬁt test for directional distributions. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 831 841. PMLR, 2020.

Xu, W. and Reinert, G. A stein goodness-of-test for exponential random graph models. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 415 423. PMLR, 2021.

Yang, J., Liu, Q., Rao, V., and Neville, J. Goodness-ofﬁt testing for discrete distributions via stein discrepancy. In International Conference on Machine Learning, pp. 5557 5566, 2018.

Yang, J., Rao, V., and Neville, J. A stein papangelou goodness-of-ﬁt test for point processes. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 226 235, 2019.