# interpretable_stein_goodnessoffit_tests_on_riemannian_manifold__8f1dd76c.pdf Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Wenkai Xu 1 Takeru Matsuda 2 In many applications, we encounter data on Riemannian manifolds such as torus and rotation groups. Standard statistical procedures for multivariate data are not applicable to such data. In this study, we develop goodness-of-fit testing and interpretable model criticism methods for general distributions on Riemannian manifolds, including those with an intractable normalization constant. The proposed methods are based on extensions of kernel Stein discrepancy, which are derived from Stein operators on Riemannian manifolds. We discuss the connections between the proposed tests with existing ones and provide a theoretical analysis of their asymptotic Bahadur efficiency. Simulation results and real data applications show validity and usefulness of the proposed methods. 1. Introduction In many scientific and machine learning applications, data appear in the domains described by Riemannian manifolds. For example, structures of proteins and molecules are described by a pair of angular variables, which is identified with a point on the torus (Singh et al., 2002). In computer vision and related studies, the orientation of a camera is represented by a 3 3 rotation matrix, which gives rise to data on the rotation group (Song et al., 2009). Other examples include the orbit of a comet (Jupp et al., 1979) and the vectorcardiogram data (Downs, 1972). In addition, shape analysis (Dryden & Mardia, 2016) and compositional data analysis (Pawlowsky-Glahn & Buccianti, 2011) also deal with complex data defined on Riemannian manifolds. Recently, Klein et al. (2020) developed a graphical model on torus to analyze phase coupling between neuronal activities. Since the usual statistical procedures for Euclidean data are not applicable in such scenarios, many studies have developed statistical models and methods tailored for data 1Gatsby Computational Neuroscience Unit, London, United Kingdom 2RIKEN Center for Brain Science, Tokyo, Japan. Correspondence to: Wenkai Xu , Takeru Matsuda . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). on Riemannian manifolds (Chikuse, 2012; Mardia & Jupp, 1999; Ley & Verdebout, 2017). Statistical models on Riemannian manifolds are often given in the form of unnormalized densities with a computationally intractable normalization constant. For example, the Fisher distribution on the rotation group (Chikuse, 2012; Sei et al., 2013) is defined by p(X | Θ) exp(tr(Θ X)), (1) and its normalization constant is not given in closed form. Statistical inference with such models can become computationally intensive due to the intractable normalization constant. Thus, statistical methods on Riemannian manifolds that do not require computation of the normalization constant have been developed for several tasks such as parameter estimation (Mardia et al., 2016) and sampling (Girolami et al., 2009; Ma et al., 2015). However, goodness-of-fit testing or model criticism procedures for general distributions on Riemannian manifolds is not established, to the best of our knowledge. Kernel Stein discrepancy (KSD) (Gorham & Mackey, 2015; Ley et al., 2017) is a discrepancy measure between distributions based on Stein s method (Barbour & Chen, 2005; Chen et al., 2010) and reproducing kernel Hilbert space (RKHS) theory (Berlinet & Thomas, 2004). KSD provides a general procedure for goodness-of-fit testing that does not require computation of the normalization constant, and it has shown state-of-the-art performance in various scenarios including Euclidean data (Chwialkowski et al., 2016; Liu et al., 2016), discrete data (Yang et al., 2018), point processes (Yang et al., 2019), directional data (Xu & Matsuda, 2020), censored data (Fernandez et al., 2020) and random graphs (Xu & Reinert, 2021). In addition, by using the technique of optimizing test power (Gretton et al., 2012; Sutherland et al., 2016), KSD-based testing procedures also enable extraction of distributional features to perform model criticism (Jitkrittum et al., 2017; 2018; Kanagawa et al., 2019; Jitkrittum et al., 2020). We note that Stein s method has recently been extended to Riemannian manifolds and studied for numerical integration (Barp et al., 2018) and Bayesian inference (Liu & Zhu, 2018). In this paper, we develop goodness-of-fit testing and interpretable model criticism methods for general distributions on Riemannian manifolds. After briefly reviewing Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds background topics, we first introduce several types of Stein operators on Riemannian manifolds by using Stokes theorem. Then, we define manifold kernel Stein discrepancies (m KSD) based on them and propose goodness-of-fit testing procedures, which do not require computation of the normalization constant. We also develop m KSD-based interpretable model criticism procedures. Theoretical comparisons of test performance in terms of Bahadur efficiency are provided, and simulation results validate the claims. Finally, we provide real data applications to demonstrate the usefulness of the proposed methods. 2. Background 2.1. Distributions on Riemannian Manifolds In this paper, we focus on distributions on a smooth Riemannian manifold (M, g), where g is a Riemannian metric on M1. See Kobayashi & Nomizu (1963) for details on Riemannian geometry. Here, we give several examples that will be used in experiments. Note that we define the probability density of each distribution by its Radon Nikodym derivative with respect to the volume element of (M, g). Torus Bivariate circular data (x1, x2) [0, 2π)2 can be viewed as data on the torus S1 S1, where we identify (cos x, sin x) S1 with x [0, 2π). To describe dependence between circular variables, Singh et al. (2002) proposed the bivariate von-Mises distribution: p(x1, x2 | η) exp(κ1 cos(x1 µ1) + κ2 cos(x2 µ2) + λ12 sin(x1 µ1) sin(x2 µ2)), (2) where η = (κ1, κ2, µ1, µ2, λ12), κ1 0, κ2 0, 0 µ1 < 2π and 0 µ2 < 2π. Its normalization constant is not represented in closed form. We will apply this model to wind direction data in Section 8. Rotation group The rotation group SO(m) is defined as SO(m) = {X Rm m | X X = Im, det X = 1}, where Im is the m-dimensional identity matrix. The Fisher distribution (Chikuse, 2012; Sei et al., 2013) on SO(m) is defined as p(X | Θ) exp(tr(Θ X)), for which the normalization constant is not given in closed form. We will apply this model to vectorcardiogram data in Section 8. The goodness-of-fit testing for general distributions on Riemannian manifolds is not established, to the best of our 1In this paper, M may have non-empty boundary M. knowledge. For tests of uniformity, several methods have been proposed such as the Sobolev test (Chikuse & Jupp, 2004; Giné, 1975; Jupp et al., 2008). However, they are not readily applicable to general disributions. Although there are a few methods applicable to general distributions (Jupp et al., 2005; Jupp & Kume, 2018), they require computation of the normalization constant, which is often computationally intensive. In addition, existing testing procedures cannot be applied to perform interpretable model criticism (Jitkrittum et al., 2016; Kim et al., 2016; Lloyd & Ghahramani, 2015), which would provide an intuitive clarification of the discrepancy between the model and data. 2.2. Kernel Stein Discrepancy on Rd Here, we briefly review the goodness-of-fit testing with kernel Stein discrepancy on Rd. See Chwialkowski et al. (2016); Liu et al. (2016) for more detail. Let q be a smooth probability density on Rd. For a smooth function f = (f1, . . . , fd) : Rd Rd, the Stein operator Tq is defined by xi log q(x) + xi fi(x) . (3) From integration by parts on Rd, we obtain the equality, i.e. the Stein s identity Eq[Tqf] = 0, under mild regularity conditions. Since Stein operator Tq depends on the density q only through the derivatives of log q, it does not involve the normalization constant of q, which is a useful property for dealing with unnormalized models (Hyvärinen, 2005). Let H be a reproducing kernel Hilbert space (RKHS) on Rd and Hd be its product. By using Stein operator, kernel Stein discrepancy (KSD) (Gorham & Mackey, 2015; Ley et al., 2017) between two densities p and q is defined as KSD(p q) = sup f Hd 1 Ep[Tqf]. It is shown that KSD(p q) 0 and KSD(p q) = 0 if and only if p = q under mild regularity conditions (Chwialkowski et al., 2016). Thus, KSD is a proper discrepancy measure between densities. After some calculation, KSD(p q) is rewritten as KSD2(p q) = Ex, x p[hq(x, x)], (4) where hq(x, x) = Tqk(x, ), Tqk( x, ) H, which does not involve the density p. Given samples x1, . . . , xn from unknown density p on Rd, an empirical estimate of KSD2(p q) can be obtained by using Eq.(4) in the form of U-statistics, and this estimate is used to test the hypothesis H0 : p = q, where the critical value is determined by bootstrap. In this way, a general method of non-parametric goodness-of-fit test on Rd is obtained, which does not require computation of the normalization constant. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds 3. Stein Operators on M In this section, we introduce several types of Stein operators for distributions on Riemannian manifolds by using Stokes theorem. The operators are categorized via the order of differentials of the input functions2. 3.1. Differential Forms and Stokes Theorem To derive Stein operators on Riemannian manifolds, we need to use differential forms and Stokes theorem. Here, we briefly introduce these concepts. For more detailed and rigorous treatments, see Flanders (1963); Spivak (2018). Let M be a smooth d-dimensional Riemannian manifold and take its local coordinate system x1, . . . , xd. We introduce symbols dx1, . . . , dxd and an associative and antisymmetric operation between them called the wedge product: dxi dxj = dxj dxi. Note that dxi dxi = 0. Then, a p-form ω on M (0 p d) is defined as i1 ip fi1 ipdxi1 dxip, where the sum is taken over all p-tuples {i1, , ip} {1, . . . , d} and each fi1 ip is a smooth function on M. The exterior derivative dω of ω is defined as the (p + 1)-form given by xi dxi dxi1 dxip. For another coordinate system y1, . . . , yd on M, the differential form is transformed by dyj = Pd i=1 yj The volume element is defined as the d-form given by (det g)1/2dx1 dxd, where g = g(x1, . . . , xd) is the d d matrix of the Riemannian metric with respect to x1, . . . , xd. The integration of a d-form on a d-dimensional manifold is naturally defined like the usual integration on Rd and invariant with respect to the coordinate selection. Correspondingly, the integration by parts formula on Rd is generalized in the form of Stokes theorem. Proposition 1 (Stokes theorem). Let M be the boundary of M and ω be a (d 1)-form on M. Then, Z Corollary 1. If M is empty, then R M dω = 0 for any (d 1)-form ω on M. 2Note that this should be distinguished from the differentials of the (unnormalized) density functions. Coordinate choice In the following, to facilitate the derivation as well as computation of Stein operators, we assume that there exists a coordinate system θ1, . . . , θd on M that covers M almost everywhere. For example, spherical coordinates for the hyperspheres and torus, generalized Euler angles (Chikuse, 2012, Section 2.5.1) for the rotation groups, and Givens rotations (Pourzanjani et al., 2017) for the Stiefel manifolds satisfy this assumption. 3.2. First Order Stein Operator For a smooth probability density q on M and a smooth function f = (f 1, . . . , f d) : M Rd, define a function A(1) q f : M R by θi log(q J) , (5) where J = (det g)1/2 is the volume element. We refer to A(1) q as the first order Stein operator. Note that Xu & Matsuda (2020) utilized this operator for goodness-of-fit testing on hyperspheres. Theorem 1. If M is empty or f 1, . . . , f d vanish on M, then Eq[A(1) q f] = 0. If M is a closed manifold such as torus and rotation group, it does not have boundary by definition and thus the assumption of Theorem 1 holds. If the boundary of M is non-empty, a discussion relevant to the assumption of Theorem 1 can be found in Liu & Kanamori (2019), which studies density estimation on truncated domains. Note that the assumption of Theorem 1 is similar to Assumption 4 in Barp et al. (2018). 3.3. Second Order Stein Operator In the context of numerical integration on Riemannian manifolds, Barp et al. (2018) introduced a different type of Stein operator A(2) q , which we call the second order Stein operator. Specifically, for a smooth probability density q on M and a smooth function f : M R, define A(2) q f : M R by A(2) q f = X gij 2 f θi θj + gij f where we denote the inverse matrix of (gij) by (gij) following the convention of Riemmanian geometry. Proposition 2 (Proposition 1 of Barp et al. (2018)). If M is empty or f vanishes on M, then Eq[A(2) q f] = 0. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Theorem 2 follows from Theorem 1, because the second order Stein operator in Eq.(6) can be viewed as a special case of the first order Stein operator in Eq.(5) with Similar form of the second order Stein operator in Eq.(6) has been studied in Liu & Zhu (2018) for Bayesian inference. On the other hand, Le et al. (2020) arrives at a similar second order Stein operator on Riemannian manifolds via Feller diffusion process in the context of density approximation. 3.4. Zeroth Order Stein Operator For a smooth probability density q on M and a function h : M R, define a function A(0) q h : M R by A(0) q h = h Eq[h], (8) which clearly satisfies Eq[A(0) q h] = 0. Since A(0) q does not involve any differentials, we call it the zeroth order Stein operator. Compared to the first and second order Stein operators, this operator requires the normalization constant of q, which is often computationally intractable for Riemannian manifolds. We will show later that this operator corresponds to the maximum mean discrepancy (MMD) (Gretton et al., 2007). 4. Goodness-of-fit Tests on M In this section, we propose goodness-of-fit testing procedures for distributions on Riemannian manifolds based on kernelized discrepancies using the Stein operators in the previous section. 4.1. Manifold Kernel Stein Discrepancies By using Stein operators introduced in the previous section, we extend kernel Stein discrepancy to distributions on Riemannian manifolds. Let H be a RKHS on M with reproducing kernel k and Hd be its product. We define the manifold kernel Stein discrepancies (m KSD) of the first, second and zeroth order by m KSD(1)(p q) = sup f Hd 1 Ep[A(1) q f], m KSD(2)(p q) = sup f H 1 Ep[A(2) q f], m KSD(0)(p q) = sup h H 1 Ep[A(0) q h], respectively. We also define the Stein kernels of first, second and zeroth order by h(1) q (x, x) = D A(1) q k(x, ), A(1) q k( x, ) E h(2) q (x, x) = D A(2) q k(x, ), A(2) q k( x, ) E h(0) q (x, x) = D A(0) q k(x, ), A(0) q k( x, ) E respectively. Then, by algebraic manipulation, we obtain the following. Theorem 2. If p and q are smooth densities on M and the reproducing kernel k of H is smooth, then m KSD(c)(p q)2 = Ex, x[h(c) q (x, x)] (9) for c = 0, 1, 2, where x, x p are independent. From Theorem 2, we can estimate m KSD by using samples from p. This is an important property in goodness-of-fit testing. The following theorem shows that m KSD is a proper discrepancy measure between distributions on Riemannian manifolds. The proof is given in supplementary material. Let L(x) = (L1(x), . . . , Ld) Rd with Li(x) = θi log q(x) Theorem 3. Let p and q be smooth densities on M. Assume: 1) The kernel k vanishes at M and is compact universal in the sense of Carmeli et al. (2010, Definition 2 (ii)); 2) Ex, x p[h(c) q (x, x)2] < , for c = 0, 1, 2; 3) Ep L(x) 2 < . Then, m KSD(c)(p q) 0 and m KSD(c)(p q) = 0 if and only if p = q. Note that different m KSD uses different RKHS as the space of test functions. With the d-dimensional vector valued RKHS Hd, m KSD(1) takes the supremum over a larger class of functions than m KSD(2), capturing richer distribution features. Theoretical analysis in testing context will be presented in Section 6. Equivalence of m KSD(0) and MMD For a RKHS H, the maximum mean discrepancy (MMD) (Gretton et al., 2007) between p and q is defined by MMD(p q)2 = µp µq 2 H, where µp, µq are the kernel mean embeddings (Muandet et al., 2017) of p and q, respectively. The following theorem shows that m KSD(0) is equivalent to MMD. m KSD(0)(p q) = MMD(p q). Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Proof. By definition, we have m KSD(0)(p q) = sup h H 1 Ep[A(0) q h] = sup h H 1 (Ep[h] Eq[h]). Hence, taking the supreme in closed form via reproducing property, we obtain m KSD(0)(p q)2 = µp µq 2 H = MMD(p q)2. An illustrative example To see the differences between the Stein operators, we consider a uniform distribution q on 2-dimensional torus, using spherical coordinate (θ1, θ2)3. Thus, the first and second order Stein operators can be explicitly written as A(1) q f = f 1 θ2 , A(2) q f = 2 f θ1 θ1 + 2 f θ2 θ2 , respectively. This derivation echoes the interpretation of the difference between first order and second order via Eq.(7). To better understand the difference in terms of m KSD, we choose f 1, f 2, f H1 where H1 denotes the RKHS equipped with product von-Mises kernel of unit bandwidth, k(u, v) = exp{u v} = exp{cos(θ1 u θ1 v)+cos(θ2 u θ2 v)} where θu, θv are θ-parametrisation of u, v respectively. Then, for θu, θv p, m KSD2 has explicit form, m KSD(1)(p q)2 = Eθu,θv h 2k(u,v) θ1u θ1v + 2k(u,v) = Eθu,θv [(ξ1 + ξ2)k(u, v)] where ξ1 = cos (θ1 u θ1 v) sin (θ1 u θ1 v) 2, and ξ2 = cos (θ2 u θ2 v) sin (θ2 u θ2 v) 2 can be seen as the statistics tracking differences of p, q in each S1. On the other hand, m KSD(2)(p q)2 = θ1u 2 θ1v 2 + 4k(u, v) θ2u 2 θ2v 2 + 4k(u, v) θ1u 2 θ2v 2 + 4k(u, v) θ2u 2 θ1v 2 = Eθu,θv h (ξ11 + ξ22 + ξ12 + ξ21)k(u, v) i , ξii = sin (θi u θi v) sin (θi u θi v) 2 3 cos (θi u θi v) 1 , and ξij = ξiiξjj for (i = j). 3In this case, the Jacobian J is a constant as the torus S1 S1 is a direct product of two circles. Then, derivative of log q J = 0 By setting f 1, f 2 and f all belonging to the same RKHS, H1, we are able to explicitly see an effect of having a larger space of test functions f = (f 1, f 2) H1 H1, compared to f H1, through ξ-terms. The zeroth order operator A(0) q does not involve any differential operator while density q is represented via expectation over the test function in Eq.(8). For mean embedding µq( ) = Eq[k(x, )], where µq(x) = k(x, ), µq( ) H and the constant cq = µq 2 H, m KSD(0)(p q)2 = Eθu,θv [k(u, v)] 2Eθu [µq(u)] + cq. From the derivation4 for m KSD(0), we see that k(u, v) does not interact with any of the ξ-terms as opposed to differential based Stein discrepancies. Instead, the characterisation is via balancing constant cq that is representative for density q. Connections to Euclidean Stein operator One interesting interpretation on the differences between proposed Stein operators A(1) q , A(2) q and the Euclidean Stein operator in Eq.(3) is that the proposed operators diffuse along the shape/surface of the manifold, while the Stein operator in Rd diffuse to all directions w.r.t. the Cartesian coordinate5. For instance, in a 2-dimensional torus that is embedded in R3, the proposed Stein operator only diffuse on the surface of the torus, not going into or out from the torus; however, an Euclidean Stein operator does so due to diffusion over R3. This is also crucial for the unnormalised models. The same unnormalised density corresponds to different models when domains are not the same, e.g. unit sphere and sphere with radius 2. Essentially, the diffusion along manifold notion makes the Stein s identity holds and gives controlled type-I error for our tests. Although A(0) q was not discussed explicitly in the Euclidean setting, it remains the same for both Euclidean and Riemannian cases. The manifold shapes and structures are well-taken care of by taking the expectation over relevant distributions in Eq.(8), e.g. in the above example, the expectation is taken over the 2-dimensional torus instead of R3. As such, the relevant kernel Stein test gives controlled type-I error. However, it is worth to note that the advantage on dealing with unnormalized densities can be violated when computing expectations in closed forms. When samples are available from unnormalized densities, e.g. via MCMC, an estimation for the expectation function can be obtained. By Theorem 4, Stein test for goodness-of-fit degenerates into MMD test for two-sample problems. We further discuss such connections and empirical results in Section 6 and 7. 4In this case, the derivation holds for both scalar valued test function h H or (h1, h2) H H. 5The Cartesian coordinate forms the basis directions to guide the diffusion in Rd. Hence, we can see Eq.(3) as a special case of A(1) q on Euclidean manifold. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds 4.2. Goodness-of-fit Testing with m KSDs Here, we present procedures for testing H0 : p = q with significance level α based on samples x1, . . . , xn p. From Theorem 2, an unbiased estimate of m KSD can be obtained in the form of U-statistics (Lee, 1990): m KSD(c) u (p q)2 = 1 n(n 1) i =j h(c) q (xi, xj). (10) Its asymptotic distribution is obtained via U-statistics theory (Lee, 1990; Van der Vaart, 2000) as follows. We denote the convergence in distribution by d . Theorem 5. For c = 0, 1, 2, the following statements hold. 1. Under H0 : p = q, n m KSD(c) u (p q)2 d j=1 w(c) j (Z2 j 1), (11) where Zj are i.i.d. standard Gaussian random variables and w(c) j are the eigenvalues of the Stein kernel h(c) q (x, x) under p( x): Z h(c) q (x, x)φj( x)p( x)d x = w(c) j φj(x), (12) where φj(x) = 0 is the non-trivial eigen-function. 2. Under H1 : p = q, n m KSD(c) u (p q)2 m KSD(c)(p q)2 d N(0, σc 2), where σc2 = Varx p[E x p[h(c) q (x, x)]] > 0. Based on Theorem 5, we propose two procedures for goodness-of-fit testing. Wild-bootstrap Test We employ the wild-bootstrap test with the V-statistics (Chwialkowski et al., 2014). The test statisic is given by m KSD(c) v (p q)2 = 1 i,j h(c) q (xi, xj). (13) To approximate its null distribution, we define the wildbootstrap samples by i,j Wi,t Wj,th(c) q (xi, xj), (14) where each Wi,t { 1, 1} is the Rademacher variable of zero mean and unit variance. The testing procedure is outlined in Algorithm 1. We adopt this algorithm in the following experiments due to its computational efficiency. Algorithm 1 m KSD test via wild-bootstrap Input: samples x1, . . . , xn p, null density q kernel function k, test size α bootstrap sample size B Objective: Test H0 : p = q versus H1 : p = q. Test procedure: 1: Compute the statistic m KSDv (c)(p q)2, Eq.(10). 2: for t = 1 : B do 3: Sample Rademacher variables W1,t, . . . , Wn,t. 4: Compute St by Eq.(14). 5: end for 6: Determine the (1 α)-quantile γ1 α of S1, . . . , SB. Output: Reject H0 if m KSDv (c)(p q)2 > γ1 α; otherwise do not reject H0. Spectrum Test We can also directly approximate the null distribution in Eq.(11) by using the eigenvalues of the Stein kernel matrix (Gretton et al., 2009, Theorem 1). Specifically, let M (c) be the n n Stein kernel matrix defined by (M (c))ij = h(c) q (xi, xj) and ew(c) 1 , . . . , ew(c) n be its eigenvalues. Then, we generate the simulated null samples by j=1 ew(c) j (Z2 j,t 1), (15) where each Zj,t is the standard Gaussian variable. In practice, the spectrum test is more useful when sample size is small where the wild-bootstrap procedure can be less accurate. However, when sample size n is large, computing ew(c) via eigenvalue decomposition requires O(n3) complexity, which makes the test computationally less efficient. Kernel choice The performance of kernel-based testing is sensitive to the choice of kernel parameters. We choose the kernel parameters by maximizing an approximation of the test power following Gretton et al. (2012); Jitkrittum et al. (2016); Sutherland et al. (2016). From Theorem 5, D := n m KSD(c) u (p q) 2 m KSD(c)(p q)2 under the alternative hypothesis, H1 : p = q. Thus, for sufficiently large n, the test power is approximated as PH1(n m KSD(c) u (p q) 2 > r) Φ n m KSD(c)(p q)2 where Φ denotes the c.d.f. for the standard Gaussian distribution (Sutherland et al., 2016). Thus, we choose the kernel parameters by maximizing an estimate of m KSD2(p q)/σc (Jitkrittum et al., 2017). Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds 5. Model Criticism on M Now, we propose m KSD-based model criticism procedures for distributions on Riemannian manifolds. When the proposed model does not fit the observed data well, understanding which part of the model misfit the data is of practical interest. The model criticism study can be helpful to better understand the representative prototype (Kim et al., 2016), to criticize prior assumptions in Bayesian settings (Lloyd & Ghahramani, 2015) or to help better training of generative models (Sutherland et al., 2016). With kernel-based non-parametric testing, distributional features can be extracted in the form of test locations to represent areas that best distinguish" distributions. The locations where two sample distributions differ the most via MMD are studied in Jitkrittum et al. (2016) and the most misspecified" locations between samples and models via KSD are studied in Jitkrittum et al. (2017). Recently, Seth et al. (2019) studied the model criticism via latent space, which may intrinsically correspond to Riemannian manifold structures. Such setting can be an interesting application of our development. Let sp( ) = E x p[A(1) q k( x, )] Hd. Adapted from Jitkrittum et al. (2017), we define the manifold Finite Set Stein Discrepancy (m FSSD) as follows. For a small set of J test locations {v1, . . . , vj} M, m FSSD(p q)2 = 1 j=1 (sp(vj))2 i , (16) which can be computed in linear time of sample size n. Proposition 3 (Theorem 1 Jitkrittum et al. (2017)). Let V = v1, . . . , v J M be random vectors drawn i.i.d. from a distribution ν which has a density. Let X be a connected open set in Rd. Assume conditions in Theorem 3 hold. Then, for any J 1 ,ν-almost surely m FSSD(p q)2 = 0 if and only if p = q. Stein identity of sp( ) ensures m FSSD2 = 0 under H0 almost surely. To perform model criticism, we extract test locations that give a higher detection rate (i.e., test power) than others. We choose the test locations V = {vj}J j=1 by maximizing the approximate test power: V = arg max v m FSSD2 where σH1 is the variance of m FSSD2 under H1. More details are shown in Proposition 4 and 5 in the supplementary. 6. Comparison between m KSD Tests Bahadur efficiency From Theorem 5, m KSD tests are consistent against all alternatives. Thus, to understand which m KSD test is more powerful than others, we investigated their Bahadur efficiency (Bahadur et al., 1960), which quantify how fast the p-value goes to zero under alternatives. Here, to focus on the effect of the choice of Stein operator on test performance, we briefly present results for testing of uniformity on the circle S1 under the von-Mises distribution. See supplementary material for more details. The technique of the proof is adapted from Jitkrittum et al. (2017). Theorem 6. (Scaling shift in von-Mises distribution) Let x S1, q(x) 1 and p(x) exp (κu x). Choose the von-Mises kernel of the form k(x, x ) = exp (x x ). Denote the approximate Bahadur efficiency between m KSD with first and second order Stein operators as E1,2(κ) := c(m KSD(1))(κ) c(m KSD(2))(κ), where κ > 0. Then E1,2(κ) > 1. Adapting from Theorem 5 of Jitkrittum et al. (2017), it suffices to show m KSD(1)(p q) m KSD(2)(p q) and Ex, x q[h(2) q (x, x)2] > Ex, x q[h(1) q (x, x)2] > 0. See supplementary material for details. We provide additional discussion on test efficiencies with m KSD(0) in the supplementary material. In general, since we cannot compute Ep in closed form, especially with unnormlized density, we need to perform the test with samples, where sampling error makes the m KSD(0) test less asymptotically efficient (Jitkrittum et al., 2017; Yang et al., 2019; Xu & Matsuda, 2020). Computational efficiency Since the Stein kernels h(1) q and h(2) q depend on q only through the derivative of log q, m KSD tests with the first and second order Stein operators do not require computation of the normalization constant of q. This is a major computational advantage over existing goodness-offit tests on Riemannian manifolds. While the computational cost of m KSD(1) u is O(n2d), that of m KSD(2) is O(n2d3) due to the computation of the metric tensor. On the other hand, m KSD test of zeroth order is equivalent to testing whether two sets of samples are from the same distribution by using MMD (Gretton et al., 2007). Namely, to test whether x1, . . . , xn is from density q, we draw samples y1, . . . , ym from q and determine whether x1, . . . , xn and y1, . . . , ym are from the same distribution. This procudre requires to sample from the null distribution q on Riemannian manifolds, which is computationally intensive in general. Note that the results in Theorem 5 with c = 0 replicate the asymptotic results for MMD (Gretton et al., 2007). Choosing m KSD tests In overall, testing with m KSD(1) has its advantage in terms of having a larger space of test Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Perturbation 1.0 m KSD2 m KSD1 m KSD0/MMD Sobolev (a) n = 100 50 100 150 200 250 300 Sample Size m KSD2 m KSD1 m KSD0/MMD Sobolev (b) κ = 0.35 0.0 0.2 0.4 0.6 0.8 1.0 Perturbation m KSD2 m KSD1 m KSD0/MMD Sobolev (c) n = 100 50 100 150 200 250 300 Sample Size 1.0 m KSD2 m KSD1 m KSD0/MMD Sobolev (d) b = 0.20 Figure 1. Rejection rates at α = 0.01: a)-b) for uniform density; c)-d) for Fisher distribution on SO(3) functions with both asymptotic test efficiency and computational efficiency so that it is recommended to use when available. m KSD(2) can be slightly easier to compute and parameterize in certain scenarios, although it may sacrifice test power and computational efficiency. m KSD(0), or namely MMD test from simulated samples, is also applicable when it is possible to sample from the given unnormalized density model on the Riemannian manifolds. 7. Simulation Results In this section, we show the validity of the proposed m KSD tests by simulation on the rotation group SO(3). We use the Euler angle (Chikuse, 2012) as the coordinate system. The bootstrap sample size is set to B = 1000. The significance level is set to α = 0.01. For the m KSD(0) test (MMD two-sample test), the number of samples from the null is set to be equal to the sample size n. We used the kernel k(X, Y ) = exp(γ tr(X Y )), where the parameter γ was chosen by optimizing the approximate test power. The exponential-trace kernel k(X, Y ) = exp(γ tr(X Y )) for the rotation group is compact universal. To see this, we rewrite the kernel in the form analogous to the Gaussian kernel: k(X, Y ) = exp(γ tr(X Y )) = C exp( 1 2γ X Y 2 F ), where C is a constant that only depends on d, the dimension of the matrices X, Y SO(d) due to tr(X X) = tr(Id) = d for all X SO(d). Since the Gaussian kernel is universal (Sriperumbudur et al., 2011) and the rotation group SO(d) is a compact subset of the space of d d matrices, the exponential-trace kernel is then also compact-universal from Corollary 3 of Carmeli et al. (2010). 7.1. Uniform distribution First, we consider testing of uniformity on SO(3) and compare the performance of the m KSD tests with the Sobolev test (Jupp et al., 2005). We generated samples from the exponential trace distribution p(X | κ) exp(κ tr(X)) by the rejection sampling (Hoff, 2009). The uniform distribution corresponds to κ = 0. Figure 1 (a) plots the rejection rates with respect to κ for n = 100. When κ = 0, the type-I errors of all tests are well controlled to the significance level α = 0.01. The power of all tests increases with increasing κ and converges to one. Figure 1 (b) plots the rejection rates with respect to n for κ = 0.35. The power of all tests increases with n and converges to one. When the model becomes increasingly different from the null, the m KSD1 is more sensitive to distinguish the difference, with higher power than others. 7.2. Fisher distribution Next, we consider the Fisher distribution (or matrix Langevin distribution) p(X | F) exp(tr(F X)) (Chikuse, 2003; Sei et al., 2013). We generated data from p(X | F0) and applied m KSD tests on the null p(X | Fb), 1 b 0 b 1 0 0 0 1 . We compare the m KSD tests with the extended Sobolev test (Jupp et al., 2005), in which we compute the normalization constant by Monte Carlo. Figure 1 (c) plots the rejection rates with respect to b for n = 100. Figure 1 (d) plots the rejection rates with respect to n for b = 0.2. From the plot, we see that all tests achieves the correct test level under the null. When the model becomes increasingly different from the null, the m KSD1 is more sensitive to distinguish the difference, with higher power than others. MMD test has lower power than m KSD1 and m KSD2 due to inefficiency from sampling. While the Sobolev test is useful when the null and the alternative are very different, it is not powerful for harder problems where the alternative perturbed very little from the null. 8. Real Data Applications Finally, we apply the m KSD tests to two real data. 8.1. Vectorcardiogram data As a real dataset on the rotation group SO(3), we use the vectorcardiogram data studied by Jupp et al. (2008). The data summarizes vectorcardiogram from normal children where each data point records 3 perpendicular vectors of directions QRS, PRS and T from Frank system for electrical lead placement. Details of this dataset can be found in Downs (1972). Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds 0 1 2 3 4 5 6 x1 0 1 2 3 4 5 6 x1 0 1 2 3 4 5 6 x1 Figure 2. Wind direction data. Left: 2D histogram for wind directions; colorbar shows the counts of data points in each square. Mid: the 10 optimized locations (in red star), without repetition. Right: the objective value in Eq.(17), m FSSD2 σH1 , on the specified data location of test (i.e. setting J=1); the higher the darker. We fitted a Fisher distribution p(X | F) exp(tr(F X)) to 28 data points of children aged between 2 to 10 (Jupp et al., 2005) and obtained the estimate 0.583 0.629 0.514 0.660 0.736 0.151 0.473 0.252 0.844 We use this value as the null model to be tested. Table 1 presents the p-values of each test. We apply kernel of the form k(X, Y ) = exp(γ tr(X Y )) as used in Section 7. All m KSD tests show strong evidence to reject the fitted model at α = 0.05. However, the Sobolev test, with p-value being 0.126, is not powerful enough to reject the null at the same test level. Table 1. p-values for vectorcardiogram data. m KSD1 m KSD2 m KSD0/MMD Sobolev 0.004 0.000 0.010 0.126 8.2. Wind direction data As a real data on torus, we consider wind direction in Tokyo on 00:00 (x1) and 12:00 (x2) for each day in 20186. Thus, the sample size is n = 365. The data were discretized into 16 directions, such as north-northeast. Figure 2 presents a 16 16 histogram of raw data. We consider the goodness-of-fit testing procedures for the bivariate von-Mises distribution in Eq.(2) using m KSD. We apply the product von-Mises kernel as described in the illustrative example for torus in Section 4, k(u, v) = exp{γ1 cos(θ1 u θ1 v) + γ2 cos(θ2 u θ2 v)}, where the bandwidth parameters γ1 and γ2 were chosen by optimizing the approximate test power7. In addition, we pay particular attention on positive definiteness when choosing our kernel. As Feragen et al. (2015) pointed out, not all 6Data available on Japan Meteorological Agency website https://www.data.jma.go.jp/obd/stats/etrn/. 7The optimization objective is, m KSD2 Var(m KSD2), which is similar to Eq.(17) where the test statistics is m KSD2 instead of m FSSD2. geodesic distance, d(u, v), induces a positive-definite kernel k(u, v) = exp( γ d(u, v)) on manifold. Adapting results on strictly positive functions in hyperspheres (Gneiting et al., 2013), the above chosen kernel is positive definite for torus as the product spherical coordinate on S1. By using noise contrastive estimation (Gutmann & Hyvärinen, 2012), Uehara et al. (2020) fitted the bivariate von-Mises distribution to the wind direction data and obtained the estimate for parameter set η = (κ1, κ2, µ1, µ2, λ12), bη = (0.7170, 0.3954, 1.1499, 1.1499, 1.1274). By setting this fitted model to the null model, the p-value obtained using m KSD1 is 0.434, which indicates that the model is a good fit for the data. In addition, we fitted a simpler model with no interactions between x1 and x2, i.e. λ12 is set to zero in Eq.(2) so that the model reduces to the product of two von-Mises distribution on each direction. The p-value by m KSD1 is 0.002, which is a strong evidence to reject the null model. In other words, there is a significant interaction between wind direction on 00:00 and 12:00. We then carried out model criticism by m FFSD statistic in Eq.(16) with optimized test location via maximizing approximate test power. Choosing the number of test locations J = 10, we plot the optimized locations in Figure 2. It provides information about dependence between wind direction at midnight and noon. Concluding Remark In this study, we develop goodnessof-fit procedures and model criticism methods for general distributions on Riemannian manifolds. As m KSDs are proper discrepancy measures under mild assumptions, the connections and comparisons of topologies induced from different m KSDs are interesting future direction. Acknowledgement T.M. was supported by JSPS KAKENHI Grant Number 19K20220 and JST Moonshot Grant Number JPMJMS2024. W.X. was supported by Gatsby Charitable Foundation. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Bahadur, R. R. et al. Stochastic comparison of tests. Annals of Mathematical Statistics, 31(2):276 295, 1960. Barbour, A. D. and Chen, L. H. Y. An introduction to Stein s method, volume 4. World Scientific, 2005. Barp, A., Oates, C., Porcu, E., and Girolami, M. A riemannian-stein kernel method. ar Xiv preprint ar Xiv:1810.04946, 2018. Berlinet, A. and Thomas, C. Reproducing kernel Hilbert spaces in Probability and Statistics. Kluwer Academic Publishers, 2004. Carmeli, C., De Vito, E., Toigo, A., and Umanitá, V. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 8(01):19 61, 2010. Chen, L. H. Y., Goldstein, L., and Shao, Q. M. Normal approximation by Stein s method. Springer, 2010. Chikuse, Y. Concentrated matrix langevin distributions. Journal of Multivariate Analysis, 2(85):375 394, 2003. Chikuse, Y. Statistics on special manifolds, volume 174. Springer Science & Business Media, 2012. Chikuse, Y. and Jupp, P. E. A test of uniformity on shape spaces. Journal of multivariate analysis, 88(1):163 176, 2004. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In International Conference on Machine Learning, pp. 2606 2615, 2016. Chwialkowski, K. P., Sejdinovic, D., and Gretton, A. A wild bootstrap for degenerate kernel tests. In Advances in neural information processing systems, pp. 3608 3616, 2014. Downs, T. D. Orientation statistics. Biometrika, 59(3): 665 676, 1972. Dryden, I. L. and Mardia, K. V. Statistical shape analysis: with applications in R, volume 995. John Wiley & Sons, 2016. Feragen, A., Lauze, F., and Hauberg, S. Geodesic exponential kernels: When curvature and linearity conflict. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3032 3042, 2015. Fernandez, T., Rivera, N., Xu, W., and Gretton, A. Kernelized stein discrepancy tests of goodness-of-fit for timeto-event data. In International Conference on Machine Learning, pp. 3112 3122. PMLR, 2020. Flanders, H. Differential Forms with Applications to the Physical Sciences. Dover, 1963. Garreau, D., Jitkrittum, W., and Kanagawa, M. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017. Giné, E. Invariant tests for uniformity on compact riemannian manifolds based on sobolev norms. The Annals of statistics, pp. 1243 1266, 1975. Girolami, M., Calderhead, B., and Chin, S. A. Riemannian manifold hamiltonian monte carlo. ar Xiv preprint ar Xiv:0907.1100, 2009. Gleser, L. J. The comparison of multivariate tests of hypothesis by means of bahadur efficiency. Sankhy a: The Indian Journal of Statistics, Series A, pp. 157 174, 1966. Gneiting, T. et al. Strictly and non-strictly positive definite functions on spheres. Bernoulli, 19(4):1327 1349, 2013. Gorham, J. and Mackey, L. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the two-sampleproblem. In Advances in neural information processing systems, pp. 513 520, 2007. Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. A fast, consistent kernel two-sample test. In Advances in neural information processing systems, pp. 673 681, 2009. Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems, pp. 1205 1213, 2012. Gutmann, M. U. and Hyvärinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13:307 361, 2012. Hoff, P. D. Simulation of the matrix bingham von mises fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18(2):438 456, 2009. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005. Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, pp. 181 189, 2016. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Jitkrittum, W., Xu, W., Szabó, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-fit test. In Advances in Neural Information Processing Systems, pp. 262 271, 2017. Jitkrittum, W., Kanagawa, H., Sangkloy, P., Hays, J., Schölkopf, B., and Gretton, A. Informative features for model comparison. In Advances in Neural Information Processing Systems, pp. 808 819, 2018. Jitkrittum, W., Kanagawa, H., and Schölkopf, B. Testing goodness of fit of conditional density models with kernels. ar Xiv preprint ar Xiv:2002.10271, 2020. Jupp, P. and Kume, A. Measures of goodness of fit obtained by canonical transformations on riemannian manifolds. ar Xiv preprint ar Xiv:1811.04866, 2018. Jupp, P. et al. Sobolev tests of goodness of fit of distributions on compact riemannian manifolds. The Annals of Statistics, 33(6):2957 2966, 2005. Jupp, P. et al. Data-driven sobolev tests of uniformity on compact riemannian manifolds. The Annals of Statistics, 36(3):1246 1260, 2008. Jupp, P. E., Mardia, K. V., et al. Maximum likelihood estimators for the matrix von mises-fisher and bingham distributions. The Annals of Statistics, 7(3):599 606, 1979. Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., and Gretton, A. A kernel stein test for comparing latent variable models. ar Xiv preprint ar Xiv:1907.00586, 2019. Kim, B., Khanna, R., and Koyejo, O. Examples are not enough, learn to criticize! criticism for interpretability. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2288 2296, 2016. Klein, N., Orellana, J., Brincat, S. L., Miller, E. K., Kass, R. E., et al. Torus graphs for multivariate phase coupling analysis. Annals of Applied Statistics, 14(2):635 660, 2020. Kobayashi, S. and Nomizu, K. Foundations of differential geometry, volume 1. New York, London, 1963. Le, H., Lewis, A., Bharath, K., and Fallaize, C. A diffusion approach to stein s method on riemannian manifolds. ar Xiv preprint ar Xiv:2003.11497, 2020. Lee, A. J. U-Statistics: Theory and Practice. CRC Press, 1990. Ley, C. and Verdebout, T. Modern directional statistics. Chapman and Hall/CRC, 2017. Ley, C., Reinert, G., Swan, Y., et al. Stein s method for comparison of univariate distributions. Probability Surveys, 14:1 52, 2017. Liu, C. and Zhu, J. Riemannian stein variational gradient descent for bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pp. 276 284, 2016. Liu, S. and Kanamori, T. Estimating density models with complex truncation boundaries. ar Xiv preprint ar Xiv:1910.03834, 2019. Lloyd, J. R. and Ghahramani, Z. Statistical model criticism using kernel two sample tests. Advances in Neural Information Processing Systems, 28:829 837, 2015. Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pp. 2917 2925, 2015. Mardia, K. V. and Jupp, P. E. Directional Statistics. Wiley, New York, NY, 1999. Mardia, K. V., Kent, J., and Laha, A. Score matching estimators for directional distributions. ar Xiv:1604.08470, 2016. Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017. Pawlowsky-Glahn, V. and Buccianti, A. Compositional data analysis: Theory and applications. John Wiley & Sons, 2011. Pourzanjani, A. A., Jiang, R. M., Mitchell, B., Atzberger, P. J., and Petzold, L. R. General bayesian inference over the stiefel manifold via the givens representation. ar Xiv preprint ar Xiv:1710.09443, 2017. Sei, T., Shibata, H., Takemura, A., Ohara, K., and Takayama, N. Properties and applications of fisher distribution on the rotation group. Journal of Multivariate Analysis, 116: 440 455, 2013. Serfling, R. J. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009. Seth, S., Murray, I., Williams, C. K., et al. Model criticism in latent space. Bayesian Analysis, 14(3):703 725, 2019. Singh, H., Hnizdo, V., and Demchuk, E. Probabilistic model for two dependent circular variables. Biometrika, 89:719 723, 2002. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 961 968, 2009. Spivak, M. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. CRC press, 2018. Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12 (Jul):2389 2410, 2011. Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. Generative models and model criticism via optimized maximum mean discrepancy. ar Xiv preprint ar Xiv:1611.04488, 2016. Uehara, M., Matsuda, T., and Kim, J. K. Imputation estimators for unnormalized models with missing data. In International Conference on Artificial Intelligence and Statistics, pp. 831 841. PMLR, 2020. Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000. Xu, W. and Matsuda, T. A stein goodness-of-fit test for directional distributions. In International Conference on Artificial Intelligence and Statistics, pp. 831 841. PMLR, 2020. Xu, W. and Reinert, G. A stein goodness-of-test for exponential random graph models. In International Conference on Artificial Intelligence and Statistics, pp. 415 423. PMLR, 2021. Yang, J., Liu, Q., Rao, V., and Neville, J. Goodness-offit testing for discrete distributions via stein discrepancy. In International Conference on Machine Learning, pp. 5557 5566, 2018. Yang, J., Rao, V., and Neville, J. A stein papangelou goodness-of-fit test for point processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 226 235, 2019.