# active_slices_for_sliced_stein_discrepancy__8ba5e12c.pdf

Active Slices for Sliced Stein Discrepancy

Wenbo Gong 1 Kaibo Zhang 1 Yingzhen Li 2 Jos e Miguel Hern andez-Lobato 1

Sliced Stein discrepancy (SSD) and its kernelized variants have demonstrated promising successes in goodness-of-ﬁt tests and model learning in high dimensions. Despite their theoretical elegance, their empirical performance depends crucially on the search of optimal slicing directions to discriminate between two distributions. Unfortunately, previous gradient-based optimisation approaches for this task return sub-optimal results: they are computationally expensive, sensitive to initialization, and they lack theoretical guarantees for convergence. We address these issues in two steps. First, we provide theoretical results stating that the requirement of using optimal slicing directions in the kernelized version of SSD can be relaxed, validating the resulting discrepancy with ﬁnite random slicing directions. Second, given that good slicing directions are crucial for practical performance, we propose a fast algorithm for ﬁnding such slicing directions based on ideas of active sub-space construction and spectral decomposition. Experiments on goodness-of-ﬁt tests and model learning show that our approach achieves both improved performance and faster convergence. Especially, we demonstrate a 1480x speed-up in goodness-of-ﬁt tests when comparing with gradient-based alternatives.

1. Introduction

Discrepancy measures between two distributions are critical tools in modern statistical machine learning. Among them, Stein discrepancy (SD) and its kernelized version, kernelized Stein discrepancy (KSD), have been extensively used for goodness-of-ﬁt (GOF) testing (Liu et al., 2016; Chwialkowski et al., 2016; Huggins & Mackey, 2018; Jitkrit-

1Department of Engineering, University of Cambridge, Cambridge, United Kingdom 2Department of Computing, Imperial College London, London, United Kingdom. Correspondence to: Wenbo Gong <wg242@cam.ac.uk>, Jos e Miguel Hern andez Lobato <jmh233@cam.ac.uk>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

tum et al., 2017; Gorham & Mackey, 2017) and model learning (Liu & Wang, 2016; Pu et al., 2017; Hu et al., 2018; Grathwohl et al., 2020). Despite their recent success, applications of Stein discrepancies to high-dimensional distribution testing and learning remains an unsolved challenge.

These curse of dimensionality issues have been recently addressed by the newly proposed Sliced Stein discrepancy (SSD) and its kernelized variants SKSD (Gong et al., 2021), which have demonstrated promising results in both high dimensional GOF tests and model learning. They work by ﬁrst projecting the score function and the test inputs across two slice directions r and gr and then comparing the two distributions using the resulting one dimensional slices. The performance of SSD and SKSD crucially depends on choosing slicing directions that are highly discriminative. Indeed, Gong et al. (2021) showed that such discrepancy can still be valid despite the information loss caused by the projections, if optimal slices directions along which the two distributions differ the most are used. Unfortunately, gradientbased optimization for searching such optimal slices often suffers from slow convergence and sub-optimal solutions. In practice, many gradient updates may be required to obtain a reasonable set of slice directions (Gong et al., 2021).

We aim to tackle the above practical challenges by proposing an efﬁcient algorithm to ﬁnd good slice directions with statistical guarantees. Our contributions are as follows:

We propose a computationally efﬁcient variant of SKSD using a ﬁnite number of random slices. This relaxes the restrictive constraint of having to use optimal slices, with the consequence that convergence during optimisation to a global optimum is no longer required.

Given that good slices are still preferred in practice, we propose surrogate optimization tasks to ﬁnd such directions. These are called active slices and have analytic solutions that can be computed very efﬁciently.

Experiments on GOF test benchmarks (including testing on restricted Boltzmann machines) show that our algorithm outperforms alternative gradient-based approaches while achieving at least a 14x speed-up.

In the task of learning high dimensional independent component analysis (ICA) models (Comon, 1994), our algorithm converges much faster and to signiﬁcantly better solutions than other baselines.

Active Slices for Sliced Stein Discrepancy

Road map: First, we give a brief background for SD, SKSD and its relevant variants (Section 2). Next, we show that the optimality of slices are not necessary. Instead, ﬁnite random slices are enough to ensure the validity of SKSD (3.1). Despite that relaxing the optimality constraint gives us huge freedom to select slice directions, choosing an appropriate objective for ﬁnding slices is still crucial. Unfortunately, analysing SKSD in RKHS is challenging. We thus propose to analyse SSD as a surrogate objective by showing SKSD can be well approximated by SSD (Section 3.2). Lastly, by analyzing SSD, we propose algorithms to ﬁnd active slices for SKSD (Sections 4, 5, 6), and demonstrate the efﬁcacy of our proposal in the experiments (Section 7). Assumptions and proofs of theoretical results as well as the experimental settings can be found in the appendix.

2. Background

For a distributions p on X RD with differentiable density, we deﬁne its score function as sp(x) = x log p(x). We also deﬁne the Stein operator Ap for distribution p as

Apf(x) = sp(x)T f(x) + T x f(x) , (1)

where f : X RD is a test function. Then the Stein discrepancy (SD) (Gorham & Mackey, 2015) between two distributions p, q with differentiable densities on X is

DSD(q, p) = sup f Fq Eq[Apf(x)] , (2)

where Fq is the Stein s class of q that contains test functions satisfying Eq[Aqf(x)] = 0 (also see Deﬁnition B.2 in appendix B). The supremum can be obtained by choosing f sp(x) sq(x) if Fq is rich (Hu et al., 2018).

Chwialkowski et al. (2016); Liu et al. (2016) further restricts the test function space Fq to be a unit ball in an RKHS induced by a c0 universal kernel k : X X R. This results in the kernelized Stein discrepancy (KSD), which can be computed analytically:

sup f Hk,||f||Hk 1 Eq[Apf(x)]

= ||Eq[sp(x)k(x, ) + xk(x, )]||2 Hk ,

where Hk is the k induced RKHS with norm || ||Hk.

2.1. Sliced kernelized Stein discrepancy

Despite the theoretical elegance of KSD, it often suffers from the curse-of-dimensionality in practice. To address this issue, Gong et al. (2021) proposed a divergence family called sliced Stein discrepancy (SSD) and its kernelized variants, under mild assumptions on the regularity of probability densities (Assumptions 1-4 in appendix B) and the

richness of the kernel (Assumption 5 in appendix B). The key idea is to compare the distributions on their one dimensional slices by projecting the score sp and test input x with two directions r and its corresponding gr, respectively. Readers are referred to appendix C for details. Despite that one cannot access all the information possessed by sp and x due to the projections, the validity of the discrepancy can be ensured by using an orthogonal basis for r along with the corresponding most discriminative gr directions. The resulting valid discrepancy is called max SSD-g, which uses a set of orthogonal basis r Or and their corresponding optimal gr directions:

Smaxgr (q, p) = X

r Or sup hrgr Fq gr SD 1 Eq[sr p(x)hrgr(x T gr)+

r T gr x T grhrgr(x T gr)] ,

where hrgr : K R R is the test function, SD 1

is the D-dimensional unit sphere and sr p(x) = sp(x)T r is the projected score function. Under certain scenarios (Gong et al., 2021), i.e. GOF test, one can further improve the performance of max SSD-g by replacing P

r Or with the optimal supr SD 1 in Eq.4, resulting in another variant called max SSD-rg (Smaxrgr ). This increment in performance is due to the higher discriminative power provided by the optimal r.

However, the optimal test functions h rgr in max SSD-g (or -rg) are intractable in practice. Gong et al. (2021) further proposed kernelized variants to address this issue by letting Fq to be in a unit ball of an RKHS induced by a c0 universal kernel krgr. With

ξp,r,gr(x, ) = sr p(x)krgr(x T gr, )+

r T gr x T grkrgr(x T gr, ) , (5)

the max SKSD-g (the kernelized version of max SSD-g) is

SKmaxgr (q, p) = X

r Or sup gr SD 1 ||Eq[ξp,r,gr(x)]||2 Hrgr ,

(6) where Hrgr is the RKHS induced by krgr with the associated norm || ||Hrgr . Similarly, a kernelized version of max SSD-rg, denoted by max SKSD-rg (SKmaxrgr ), is obtained by replacing P r Or with supr SD 1 in Eq.6.

Despite that max SKSD-g (or -rg) addresses the tractability of test functions, the practical challenge of computing them is the computation of the optimal slice directions r and gr. Gradient-based optimization (Gong et al., 2021) for such computation suffers from slow convergence; even worse, it is sensitive to initialization and returns sub-optimal solutions only. In such case, it is unclear whether the resulting discrepancy is still valid, making the correctness of GOF test unveriﬁed. Therefore, the ﬁrst important question to

Active Slices for Sliced Stein Discrepancy

(Eq.23 app. C)

Gong et al., 2021

without supgr SD 1

without supr SD 1

without supgr SD 1

Validity by Section 3.1

(Section 3.2

Active Slices

(Section 4,5 Theorem 4,5)

(Section 4,5 Theorem 4,5)

(Section 3.2

Figure 1. The relationship between different SSD discrepancies, where green texts indicate our contributions, red symbols indicate valid discrepancies and Hrgr is the RKHS induced by kernel krgr. The leftmost part are the discrepancies proposed by Gong et al. (2021), whereas the rightmost part + central Active Slices are our contributions. The arrows indicate the connections between Gong et al. (2021) and our work.

ask is: are the optimality of slices a necessary condition for the validity of max SKSD-g (or -rg)? Remarkably, we show that the answer is No with mild assumptions on the kernel (Assumption 5-6 in appendix B).

As the sliced Stein discrepancy deﬁned previously involves a sup operator, making them difﬁcult to analyze, we need to deﬁne notations for their sub-optimal versions. For example, max SSD-g (Eq.4) involves a sup operator over slices gr. We thus deﬁne SSD-g (Sgr) as Eq.4 with a given gr instead of the sup:

r Or sup hrgr Fq Eq[sr p(x)hrgr(x T gr)+

r T gr x T grhrgr(x T gr)] (7)

Following similar logic, we deﬁne the sub-optimal version for each of the discrepancy mentioned in this section as table 1 and appendix A.

3. Relaxing constraints for the SKSD family

3.1. Is optimality necessary for validity?

As mentioned before, the discrepancy validity of max SKSD requires the optimality of slice directions, which restricts its application in practice. In the following, we show that these restrictions can be much relaxed with mild assumptions on the kernel. All proofs can be found in Appendix E.

The key idea is to use kernels such that the corresponding term SKrgr is real analytic w.r.t. both r and gr, which is detailed by Assumption 6 (Appendix B). A nice property of any real analytic function is that, unless it is a constant function, otherwise the set of its roots has zero Lebesgue measure. This means the possible valid slices are almost

everywhere in RD, giving us huge freedom to choose slices without worrying about violating validity.

Theorem 1 (Conditions for valid slices). Assuming assumptions 1-4 (density regularity), 5 (richness of kernel) and 6 (real analytic kernel) in Appendix B, let gr ηg for each r ηr, where ηg, ηr are distributions on RD with a density, then SKrgr(q, p) = 0 iff. p = q almost surely.

The above theorem tells us that a ﬁnite number of random slices is enough to make SKrgr valid without the need of using optimal slices (c.f. SKmaxrgr ). In practice, we often consider r, gr SD 1 instead of RD. Fortunately, one can easily transform arbitrary slices to SD 1 without violating the validity. For any r, gr, we (i) add Gaussian noises to them, and (2) re-normalize the noisy r, gr to unit vectors. We refer to corollary 6.1 in appendix E.1 for details.

3.2. Relationship between SSD and SKSD

Theorem 1 allows us to use random slices. However, it is still beneﬁcial to ﬁnd good ones in practice. Unfortunately, SKrgr is not a suitable objective for ﬁnding good slice directions. This is because, unlike the test function in a general function space (hrgr Fq), the optimal kernel test function (Eq[ξp,r,gr(x, )]) can not be easily analyzed for ﬁnding good slices due to its restriction in RKHS.

Instead, we propose to use Sg (or Srgr) as the optimization objective. To justify Srgr as a good replacement for SKgr, we show that SKrgr approximates Srgr arbitrarily well if the corresponding RKHS of the chosen kernel is dense in continuous function space. Similar results for SKg Sg can be derived accordingly as the only difference between Sgr and Srgr is the summation over orthogonal basis Or. However, Srgr still involves a sup operator over test functions hrgr, which hinders further analysis. To deal with this, we give an important proposition that are needed in almost every theoretical claims we made. This proposition characterises the optimal test functions for Srgr (or Sgr).

Proposition 1 (Optimal test function given r, gr). Assume assumptions 1-4 (density regularity) and given directions r, gr. Assume an arbitrary orthogonal matrix Gr = [a1, . . . , a D]T where ai SD 1 and ad = gr. Denote x q and y = Grx which is also a random variable with the induced distribution q Gr. Then, the optimal test function for Srgr is

h rgr x T gr Eq Gr (y d|yd) sr p G 1 r y sr q G 1 r y

(8) where yd = x T gr and y d contains other y elements.

Intuitively, assume Gr is a rotation matrix. Then h rgr is the conditional expected score difference between two rotated p and q. This form is very similar to the optimal test function for SD, which is just the score difference between the origi-

Active Slices for Sliced Stein Discrepancy

Table 1. Notations for sub-optimal versions of SSD & SKSD. Optimal form max SSD-g (Smaxgr ) max SSD-rg (Smaxrgr ) max SKSD-g (SKmaxgr ) max SKSD-rg (SKmaxrgr )

Modiﬁcations Change supgr Change supr,gr to given Same as max SSD-g. Same as max SSD-rg to given gr in Eq.4 r, gr in Eq.37 (App. C) in Eq.6 in Eq.41 (App. C)

sub-optimum SSD-g (Sgr) SSD-rg (Srgr) SKSD-g (SKgr) SKSD-rg (SKrgr)

nal p, q. Knowing the optimal form of h rgr, we can show SKrgr can be well approximated by Srgr. Theorem 2 (SKrgr Srgr). Assume assumptions 1-4 (density regularity) and 5 (richness of kernel). Given r and gr, ϵ > 0 there exists a constant C such that

0 Srgr SKrgr < Cϵ .

As Srgr approximates SKrgr arbitrary well, the hope is that good slices for Srgr also correspond to good slices for SKrgr in practice. Therefore in the next section we focus on analyzing Srgr instead to propose an efﬁcient algorithm for ﬁnding good slices.

4. Active slice direction g

Finding good slices involves alternating maximization of r and gr. To simplify the analysis, we focus on good directions gr given ﬁxed r, e.g. the orthogonal basis r Or for now. Finding good gr is achieved in two steps: (i) Rewriting the problem of the maximizing Sgr w.r.t gr into an equivalent minimization problem, called controlled approximation; (ii) Establish an upper-bound of the controlled approximation objective such that its minimizer is analytic. This derivation is based on an important inequality: Poincar e inequality, which upper bounds the variances of a function by its gradient magnitude. Therefore, we need Assumptions 7-8 (Appendix B) to make sure this inequality is valid. We name the resulting gr that minimizes the upper bound as active slices. All proofs can be found in appendix F.

4.1. Controlled Approximation

To start with, we need an upper bound for Sgr so that we can transform the maximization of Sgr into the minimization of their gap. Hence, we propose a generalization of SD (Eq.2) called projected Stein discrepancy (PSD):

PSD(q, p; Or) = X

r Or sup fr Fq Eq[sr p(x)fr(x)+r T xfr(x)]

(9) where fr : X RD R. SD is a special case of PSD by setting Or as identity matrix I. In proposition 4 of appendix F.1, we show that if Fq contains all bounded continuous functions, then the optimal test function in PSD is

f r (x) sr p(x) sr q(x) . (10)

It can also be shown that PSD is equivalent to the Fisher divergence, which has been extensively used in training energy based models (Song et al., 2020; Song & Ermon, 2019) and ﬁtting kernel exponential families (Sriperumbudur et al., 2017; Sutherland et al., 2018; Wenliang et al., 2019).

We now prove that PSD upper-bounds Sgr, with the gap as the expected square error between their optimal test functions f r and h rgr (Proposition 1). Since PSD is constant w.r.t. gr, maximization of Sgr is equivalent to a minimization task, called controlled approximation.

Theorem 3 (Controlled Approximation). Assume assumptions 1-4 (density regularity), and the coefﬁcient for the optimal test functions to be 1 w.l.o.g., then PSD Sgr and

PSD Sgr = X

r Or Eq[(f r (x) h rgr(x T gr))2], (11)

with f r and h rgr are optimal test functions for PSD and Sgr deﬁned in Eq.10 and Eq.8 respectively.

Intuitively, minimizing the above gap can be regarded as a function approximation problem, where we want to approximate a multivariate function f r : RD R by a univariate function h rgr : R R with optimal parameters gr.

4.2. Upper-bounding the error

Solving the controlled approximation task directly may be difﬁcult in practice. Instead, we propose an upper-bound of the approximation error, such that this upper-bound s minimizer gr is analytic. The inspiration comes from the active subspace method for dimensionality reduction (Constantine et al., 2014; Zahm et al., 2020), therefore we name the corresponding minimizers as active slices.

Theorem 4 (Error upper-bound and active slices gr). Assume assumptions 2, 4 (density regularity) and 7-8 (Poincar e inequality conditions), we can upper bound the inner part of the controlled approximation error (Eq.11) by

Eq h f r (x) h rgr x T gr 2i Csup tr Gr\d Hr GT r\d ,

Hr = Z q(x) xf r (x) xf r (x)T dx. (13)

Here Csup is the Poincar e constant deﬁned in assumption 8 and Gr\d R(D 1) D is an arbitrary orthogonal matrix

Active Slices for Sliced Stein Discrepancy

Gr excluding the dth row gr. The orthogonal matrix has the form Gr = [a1, . . . , a D]T where ai SD 1 and ad = gr.

The above upper-bound is minimized when the row space of Gr\d is the span of the ﬁrst D 1 eigenvectors of Hr (arranging eigenvalues in ascending order). One possible choice for active slice gr is v D, where (λi, vi) is the eigenpair of Hr and λ1 λ2 . . . λD.

Intuitively, the active slices gr = v D are the directions where the test function f r varies the most. Indeed, we have v T DHrv D = Eq[|| xf r (x)T v D||2] = λD, where the eigenvalue λD measures the averaged gradient variation in the direction deﬁned by v D.

5. Active slice direction r

The dependence of active slice gr on r motivate us to consider the possible choices of r. Although ﬁnite random slices r are sufﬁcient for obtaining a valid discrepancy, in practice using sub-optimal r can result in weak discriminative power and poor active slices gr. We address this issue by proposing an efﬁcient algorithm to search for good r. Again all the proofs can be found in appendix G.

5.1. PSD Maximization for searching r

Directly optimizing Srgr w.r.t. r is particularly difﬁcult due to the alternated updates of r and gr. To simplify the analysis, we start from the task of ﬁnding a single direction r. Our key idea to sidestep such alternation is based on the intuition that Srgr with active slices gr should well approximate PSDr (PSD with given r) from theorem 4. The independence of PSDr to gr allows us to avoid the alternated update and the accurate approximation validates the direct usage of the resulting active slices in Srgr. Indeed, we will prove that maximizing PSDr is equivalent to maximizing a lower-bound for Srgr.

Assume we have two slices r1 and r2, with given gr1, gr2. Then ﬁnding good r1 is equivalent to maximizing the difference Sr1,gr1 Sr2,gr2 . The following proposition establishes a lower-bound for this difference.

Proposition 2 (Lower-bound for the Srgr gap). Assume the conditions in theorem 4 are satisﬁed, then for any slices r1, r2 and gr1, gr2, we have

Sr1,gr1 Sr2,gr2 PSDr1 PSDr2 CsupΩ, (14)

where Csup is the Poincar e constant deﬁned in assumption 8 and Ω= PD i=1 ωi where {ωi}D i is the eigenvalue of Eq[ xf (x) xf (x)T ], f (x) = sp(x) sq(x).

Proposition 2 justiﬁes the maximization of PSDr1 w.r.t. r1 as a valid surrogate. But more importantly, this alternative

objective admits an analytic maximizer of r, which is then used as the active slice direction:

Theorem 5 (Active slice r). Assuming assumptions 14 (density regularity), then the maximum of the PSDr is achieved at r = vmax:

max r SD 1 Eq sr p(x)f r (x) + r T xf r (x) = λmax.

Here (λmax, vmax) is the largest eigenpair of the matrix S = Eq f (x)f (x)T

5.2. Constructing the orthogonal basis Or

Under certain scenarios, e.g. model learning, we want to train the model to perform well in every directions instead of a particular one. Thus, using a good orthogonal basis is preferred over a single active slice r. Here gradient-based optimization is less suited as it breaks the orthogonality constraint. Also proposition 2 is less useful here as well, as PSD is invariant to the choice of Or, i.e. PSD(q, p; Or1) = PSD(q, p; Or2) and Or1 = Or2.

Inspired by the analysis of single active r, we propose to use the eigendecomposition of S to obtain a good orthogonal basis Or. Theoretically, this operation also corresponds to a greedy algorithm, where in step i it searches for the optimal direction ri that is orthogonal to {r<i} and maximizes PSDri (see Corollary 6.2 in appendix G.3). Although there is no guarantee for ﬁnding the optimal Or due to its myopic behavior, in practice this greedy algorithm at least ﬁnds some good directions with high discriminative power (eigenvectors with large eigenvalues).

6. Practical algorithm

The proposed active slice method is summarized in Algorithm 1, which requires the intractable score difference sp(x) sq(x). Two types of approximations can be used. The ﬁrst approach applies gradient estimators (GE) to estimate sq(x) from x samples. We use the Stein gradient estimator (Li & Turner, 2017) for the GE approach, although other estimators (Sriperumbudur et al., 2017; Sutherland et al., 2018; Shi et al., 2018; Zhou et al., 2020) can also be employed. The second method directly estimates the score difference using a kernel-smoothed estimator (KE):

sp(y) sq(y) Ex q[(sp(x) sq(x))k(x, y)]

=Ex q[sp(x)k(x, y) + xk(x, y)], (15)

where the second expression comes from integration by part, and it can be computed in practice. Figure 1 summarizes the relationships between different SSD discrepancies and highlights our contributions. For GOF test speciﬁcally, we also derive the asymptotic distribution and propose an practical GOF algorithm in appendix D.

Active Slices for Sliced Stein Discrepancy

Algorithm 1 Active slice algorithm

Input: Samples x q, density p, kernel k : X X R, Gaussian noise γ, pruning factor m (optional) Result: f Or, G Estimate sp(x) sq(x) using KE or GE with kernel k and samples x. if Pruning then

Top m eigenvectors of S to form f Or (Theorem 5) else

Getting all eigenvectors of S to form f Or end if Add noise γ to f Or, then normalize. (Section 3.1) for r f Or do gr is the top 1 eigenvector of Hr (Theorem 4) Add noise γ to gr then normalize (Section 3.1) Concatenate gr to G end for Further optimize f Or, G with SKSD-g (SKgr) using gradient-based optimization (Optional) Return: f Or, G

6.1. Computational cost

The overall complexity includes the cost for (1) ﬁnding active slices (algorithm 1) (2) applying the downstream test. For ﬁnding the active slices r, one important fact is that we only need the m (m D) most important r (importance characterised by eigenvalues). Luckily, fast eigenvalue-decomposition algorithm, e.g. randomized SVD from Saibaba et al. (2021), requires O(m) matrix-vector product. For g, from algorithm 1, we only need to solve m eigenvalue-decomposition, each only cares about the most important eigenvector. Therefore, O(m 1) matrixvector product are needed. So the overall complexity for ﬁnding slices is O(m D2), where D2 comes from matrixvector product. For gradient-based optimization (GO), the complexity is O(l(D2 + Cgrad)) (l is optimization step and Cgrad is the back-prop cost, D2 coms from evaluating SKgr or SKrgr). Our algorithm in general has lower training cost as l m and Cgrad can be expensive. For (2), our method has O(m D) cost compared to O(D2) for GO. As m D, active slices have less complexity compared to pure GO based method proposed in Gong et al. (2021). For memory cost, our method costs O(m D) to store r, g whereas GO uses O(D2). Overall, our method requires nearly an order of magnitude less complexity in terms of computation and memory consumption.

7. Experiments

GOF test aims to test the ﬁtness of the model to the target data. The test procedure roughly proceeds as: (1) Deﬁne null hypothesis (model matches the data distribution) and

alternative hypothesis (model does not match the data distribution); (2) Compute test statistic (e.g. KSD) and threshold (e.g. bootstrap method); (3) Reject null hypothesis (statistic > threshold) or not (statistic threshold). Refer to appendix D for more details.

7.1. Benchmark GOF tests

We demonstrate the improved test power results (in terms of null rejection rates) and signiﬁcant speed-ups of the proposed active slice algorithm on 3 benchmark tasks, which have been extensively used for measuring GOF test performances (Jitkrittum et al., 2017; Huggins & Mackey, 2018; Chwialkowski et al., 2016; Gong et al., 2021). Here the test statistic is based on SKSD-g (SKgr) with ﬁxed basis Or = I. Two practical approaches are considered for computing the active slice gr: (i) gradient estimation with the Stein gradient estimator (SKSD-g+GE), and (ii) gradient estimation with the kernel-smoothed estimator (KE), plus further gradient-based optimization (SKSD-g+KE+GO). For reference, we include a version of the algorithm with exact score difference (SKSD-g+Ex) as an ablation for the gradient estimation approaches.

In comparison, we include the following strong baselines: KSD with RBF kernel (Liu et al., 2016; Chwialkowski et al., 2016), maximum mean discrepancy (MMD, Gretton et al., 2012) with RBF kernel, random feature Stein discrepancy with L1 IMQ kernel (L1-IMQ, Huggins & Mackey, 2018), and the current state-of-the-art max SKSD-g with random initialized gr followed by gradient optimization (SKSDg+GO, Gong et al., 2021). For all methods requiring GO or active slices, we split the 1000 test samples from q into 800 test and 200 training data, where we run GO or active slice method on the training set.

The 3 GOF test benchmarks, with details in appendix H.1, are: (1) Laplace: p(x) = N(0, I), q(x) = QD d=1 Lap(xd|0, 1/

2); (2) Multivariate-t: p(x) = N(0, 5

3I), q(x) is a fully factorized multivariate-t with 5 degrees of freedom, 0 mean and scale 1; (3) Diffusion: p(x) = N(0, I), q(x) = N(0, Σ1) where in q(x) the variance of 1st-dim is 0.3 and the rest is I.

The upper panels in Figure 2 show the test power results as the dimensions D increase. As expected, KSD and MMD with RBF kernel suffer from the curse-ofdimensionality. L1-IMQ performs relatively well in Laplace and multivariate-t but still fails in diffusion. For SKSD based approaches, SKSD-g+GO with 1000 training epochs still exhibits a decreasing test power in Laplace and multivariate-t. On the other hand, SKSD-g+KE+GO with 50 training epochs has nearly optimal performance. SKSDg+Ex and SKSD-g+GE achieve the true optimal rejection rate without any GO. Speciﬁcally, Table 2 shows that the active slice method achieves signiﬁcant computational savings

Active Slices for Sliced Stein Discrepancy

0 20 40 60 80 100 Dim

Null Rejection Rate

0 20 40 60 80 100 Dim

0 20 40 60 80 100 Dim

0 500 1000 1500 2000 Training Epoch

Null Rejection Rate

SKSD-g+GO SKSD-g+KE+GO SKSD-g+Ex SKSD-g+GE KSD MMD L1-IMQ

0 500 1000 1500 2000 Training Epoch

0 500 1000 1500 2000 Training Epoch

Figure 2. (Upper panel): The null rejection rate w.r.t. different dimensional benchmark problems. SKSD-g+Ex and SKSD-g+GE coincide at the optimal rejection rate (Lower panel): Null rejection rate with different number of gradient optimization epochs.

with 14x-80x speed-up over SKSD-g+GO.

For approaches that require gradient optimization, the lower panels in Figure 2 show the test power as the number of training epochs increases. SKSD-g+GO with random slice initialization requires a huge number of gradient updates to obtain reasonable test power, and 1000 epochs achieves the best balance between run-time and performance. On the other hand, SKSD-g+KE+GO with active slice achieves signiﬁcant speed-ups with near-optimal test power using around 50 epochs on Laplace and Multivariate-t. Remarkably, on Diffusion test, gr initialized by the active slices achieves near-optimal results already, so that the later gradient reﬁnements are not required.

7.2. RBM GOF test

Following Gong et al. (2021), we conduct a more complex GOF test using restrict Boltzman machines (RBMs, (Hinton & Salakhutdinov, 2006; Welling et al.)). Here the p distribution is an RBM: p(x) = 1 Z exp x Bh + b x + c x 1

2 x 2 , where x RD

and h { 1}dh denotes the hidden variables. The q distribution is also an RBM with the same b, c parameters as p but a different B matrix perturbed by different levels of Gaussian noise. We use D = 50 and dh = 40, and block Gibbs sampler with 2000 burn-in steps. The test statistics for all the approaches are computed on a test set containing 1000 samples from q.

The test statistic is constructed using SKSD-rg (SKrgr) with r, gr obtained either by gradient-based optimization (SKSD-rg+GO) or the active slice algorithms (+KE, +GE

and +Ex) without the gradient reﬁnements. Speciﬁcally, SKSD-rg+GO runs 50 training epochs with r and gr initialized to I. For the active slice methods, we also prune away most slices and only keep the top-3 most important r slices.

The left panel of Figure 3 shows that SKSD-rg+KE achieves the best null rejection rates among all baselines, except for SKSD-rg+Ex whose performance is expected to upperbound all other active slice methods. This shows the potential of our approach with an accurate score difference estimator. Although SKSD-rg+GO performs reasonably well, its run-time is 53x longer than SKSD-rg+KE as shown in Table 4. Interestingly, SKSD-rg+GE performs worse than KSD due to the signiﬁcant under-estimation of the magnitude of sq(x). Therefore, we omit this approach in the following ablation studies.

Ablation studies The ﬁrst ablation study, with results shown in the middle panel in Figure 3, considers pruning the active slices at different pruning levels, where the horizontal axis indicates the number of r slices used to construct the test statistic. We observe that the null rejection rates of active slice methods peak with pruning level 3, indicating their ability to select the most important directions. Their performances decrease when more r are considered since, in practice, those less important directions introduce extra noise to the test statistic. On the other hand, SKSD-rg+GO shows no pruning abilities due to its sensitivity to slice initialization. Remarkably, the ﬁnal performance of SKSD-rg+GO without pruning is still worse than SKSD-rg+KE with pruning, showing the importance of ﬁnding good instead of many average-quality directions. Another advantage of

Active Slices for Sliced Stein Discrepancy

Table 2. Test power for 100 dimensional benchmarks and time consumption. The run-time for SKSD-g+KE+GO include both the active slice computation and the later gradient-based reﬁnement steps. NRR stands for null rejection rate.

Laplace Multi-t Diffusion Method NRR sec/trial Speed-up NRR sec/trial Speed-up NRR sec/trial Speed-up SKSD-g+Ex 1 0.38 103x 1 0.49 90x 1 0.34 102x SKSD-g+GO 0.58 39.39 1x 0.67 44.24 1x 0.96 34.73 1x SKSD-g+KE+GO 0.99 2.72 14x 0.97 2.38 19x 1 0.43 81x SKSD-g+GE 1 0.66 60x 1 0.67 66x 1 0.78 44x

0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 perturbation level

Null Rejection Rate

RBM test power

KSD MMD L1-IMQ SKSD-rg+KE SKSD-rg+GE SKSD-rg+Ex SKSD-rg+GO

0 10 20 30 40 50 Number of r

1.0 Test Power v.s. Pruning

0 20 40 60 80 100 Epoch

Test Power v.s. Optimization EP

SKSD-rg+KE+GO:3 SKSD-rg+Ex+GO:3 SKSD-rg+GO:50

Figure 3. (Left): The GOF test power of each method with different level of noise perturbations (Mid): The effect of different pruning level towards the test power (Right): The effect of gradient based optimization epoch to the test power. 3 and 50 indicates the pruning level. KE+GO or Ex+GO means active slices with further gradient reﬁnement steps.

pruning is to reduce the computational and memory costs from O(MD) to O(m D), where m and M are the number of pruned r and slice initializations, respectively (m M).

The second ablation study investigates the quality of the obtained slices either by gradient-based optimization or by the active slice approaches. Results are shown in the right panel of Figure 3, where the horizontal axis indicates the number of training epochs, and the numbers annotated in the legend (3 and 50) indicate the pruning. We observe that the null rejection rate of SKSD-rg+KE+GO starts to improve only after 100 epochs, meaning that short run of GO reﬁnements are redundant due to the good quality of active slices. The performance decrease of SKSD-rg+Ex+GO is due to the over-ﬁtting of GO to the training set. The null rejection rate of SKSD-rg+GO gradually increases with larger training epochs as expected. However, even after 100 epochs, the test power is still lower than active slices without any GO.

In appendix H.2, another ablation study also shows the advantages of good r compared to using random slices.

7.3. Model learning: ICA

We evaluate the performance of the active slice methods in model learning by training an independent component analysis (ICA) model, which has been extensively used to evaluate algorithms for training energy-based models (Gutmann & Hyv arinen, 2010; Hyv arinen & Dayan, 2005; Ceylan & Gutmann, 2018; Grathwohl et al., 2020). ICA follows a simple generative process: it ﬁrst samples a D-

dimensional random variable z from a non-Gaussian pz (we use multivariate-t), then transforms z to x = W z with a non-singular matrix W RD D. The log-likelihood is log p(x) = log pz(W 1x) + C where C can be ignored if trained by minimizing Stein discrepancies. We follow Grathwohl et al. (2020); Gong et al. (2021) to sample 20000 training and 5000 test datapoints from a randomly initialized ICA model. The baselines considered include KSD, SKSD-g+GO, SKSD-rg+GO and the state-of-the-art learned Stein discrepancy (LSD) (Grathwohl et al., 2020), where the test function is parametrized by a neural network. For active slice approaches, one optimization epoch include the following two steps: (i) ﬁnding active slices for both orthogonal basis Or and gr at the beginning of the epoch, and (ii) reﬁning the gr directions and the W parameters in an adversarial manner with Or ﬁxed. For SKSD-g+GO, we ﬁx basis Or = I and only update gr with GO. We refer to appendix H.3 for details on the setup and training procedure. We see from Figure 4 that SKSD-g+KE+GO converges signiﬁcantly faster at 150 dimensions than all baselines; moreover, it has much better NLL (Table 3). We argue this performance gain is due to the use of the better orthogonal basis Or found by the greedy algorithm, showing the advantages of better Or in model learning. On the other hand, the importance of orthogonality in Or is indicated by the poor performance of SKSD-rg+GO, as gradient updates for r violate the orthogonality constraint. The goal of learning is to train the model to match the data distribution along every slicing direction, and the orthogonality constraint can help prevent the model from ignoring important slicing directions.

Active Slices for Sliced Stein Discrepancy

Table 3. The test NLL of different dimensional ICA model Dimensions SKSD-g+KE+GO SKSD-g+Ex+GO SKSD-g+GO SKSD-rg+GO LSD KSD 10 7.93 0.31 7.95 0.31 8.06 0.33 10.03 0.61 7.42 0.31 7.82 0.31 80 7.88 0.77 15.17 0.97 19.03 1.06 62.53 0.92 6.26 1.49 80.75 1.22 100 6.93 1.36 21.50 1.41 22.22 1.08 75.28 1.63 17.55 1.60 110.78 1.19 150 11.67 2.46 27.37 3.04 21.63 3.27 107.25 1.93 32.15 3.75 180.47 1.91

Table 4. Test power and time consumption at 0.01 perturbation

Test Power Opt. Time Speed-up SKSD-rg+Ex 0.95 0.04s 254x SKSD-rg+KE 0.67 0.19s 53x SKSD-rg+GO 0.45 10.15s 1x

0 2000 4000 6000 8000 10000 12000 14000 Itertations

ICA Training Curve: 150 dimensions

SKSD-g+KE+GO SKSD-g+Ex+GO LSD SKSD-g+GO SKSD-rg+GO KSD

Figure 4. Training Curve of ICA model, where y-axes indicates the test NLL.

Interestingly, SKSD-g+Ex+GO performs worse than +KE+GO. We hypothesize that this is because the +Ex+GO approach often focuses on directions with large discriminative power but with less useful learning signal (see appendix H.3). LSD performs well in low dimensional problems. However, in high dimensional learning tasks it spends too much time on ﬁnding good test functions, which slows down the convergence signiﬁcantly.

8. Related Work

Active subspace method (ASM): ASM is initially proposed as a dimensionality reduction method, which constructs a subspace with low-rank projectors (Constantine et al., 2014) according to the subspace Poincar e inequality. Zahm et al. (2020) showed promising results on the application of ASM to approximating multivariate functions with lower dimensional ones. However, they only considered the subspace Poincar e inequality under Gaussian measures, and a generalization to a broader family of famlity is proposed by Parente et al. (2020). Another closely related approach uses logarithmic Sobolev inequality in-

stead to construct the active subspace (Zahm et al., 2018), which can be interpreted as ﬁnding the optimal subspace to minimize a KL-divergence. It has shown successes in Bayesian inverse problems and particle inference (Chen et al., 2019). However, as the ASM method is based on the eigen-decomposition of the sensitivity matrix, there is a potential limitation when the sensitivity matrix is estimated by Monte-Carlo method. We prove this limitation in appendix I.

Sliced discrepancies: Existing examples of sliced discrepancies can be roughly divided into two groups. Most of them belong to the ﬁrst group, and they use the slicing idea to improve computational efﬁciency. For example, sliced Wasserstein distance projects distributions onto one dimensional slices so that the corresponding distance has an analytic form (Kolouri et al., 2019; Deshpande et al., 2019). Sliced score matching uses Hutchinson s trick to avoid the expensive computation of the Hessian matrix (Song et al., 2020). The second group focuses on the curse-ofdimensionality issue which remains to be addressed. To the best of our knowledge, existing integral probability metrics in this category include SSD (Gong et al., 2021) and kernelized complete conditional Stein discrepancy (KCC-SD, Singhal et al., 2019). The former is more general and requires less restrictive assumptions, while the latter requires samples from complete conditional distributions. Recent work has also investigated the statistical properties of sliced discrepancies (Nadjahi et al., 2020).

9. Conclusion

We have proposed the active slice method as a practical solution for searching good slices for SKSD. We ﬁrst prove that the validity of the kernelized discrepancy only requires ﬁnite number of random slices instead of optimal ones, giving us huge freedom to select slice directions. Then by analyzing the approximation quality of SSD to SKSD, we proposed to ﬁnd active slices by optimizing surrogate optimization tasks. Experiments on high-dimensional GOF tests and ICA training showed the active slice method performed the best across a number of competitive baselines in terms of both test performance and run-time. Future research directions include better score difference estimation methods, non-linear generalizations of slice projections, and the application of the active slice method to other discrepancies.

Active Slices for Sliced Stein Discrepancy

Arcones, M. A. and Gine, E. On the bootstrap of u and v statistics. The Annals of Statistics, pp. 655 674, 1992.

Ceylan, C. and Gutmann, M. U. Conditional noisecontrastive estimation of unnormalised models. In International Conference on Machine Learning, pp. 726 734. PMLR, 2018.

Chen, P., Wu, K., Chen, J., O Leary-Roseberry, T., and Ghattas, O. Projected stein variational newton: A fast and scalable bayesian inference method in high dimensions. ar Xiv preprint ar Xiv:1901.08659, 2019.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of ﬁt. JMLR: Workshop and Conference Proceedings, 2016.

Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

Constantine, P. G., Dow, E., and Wang, Q. Active subspace methods in theory and practice: applications to kriging surfaces. SIAM Journal on Scientiﬁc Computing, 36(4): A1500 A1524, 2014.

Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019.

Gong, W., Li, Y., and Hern andez-Lobato, J. M. Sliced kernelized stein discrepancy. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=t0Ta Kv0Gx6Z.

Gorham, J. and Mackey, L. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292 1301. PMLR, 2017.

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., and Zemel, R. Cutting out the middle-man: Training and evaluating energy-based models without sampling. ar Xiv preprint ar Xiv:2002.05616, 2020.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp.

297 304. JMLR Workshop and Conference Proceedings, 2010.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. science, 313 (5786):504 507, 2006.

Hoeffding, W. A class of statistics with asymptotically normal distribution. In Breakthroughs in statistics, pp. 308 334. Springer, 1992.

Hu, T., Chen, Z., Sun, H., Bai, J., Ye, M., and Cheng, G. Stein neural sampler. ar Xiv preprint ar Xiv:1810.03545, 2018.

Huggins, J. and Mackey, L. Random feature stein discrepancies. In Advances in Neural Information Processing Systems, pp. 1899 1909, 2018.

Huskova, M. and Janssen, P. Consistency of the generalized bootstrap for degenerate u-statistics. The Annals of Statistics, pp. 1811 1823, 1993.

Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Jitkrittum, W., Xu, W., Szab o, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-ﬁt test. In Advances in Neural Information Processing Systems, pp. 262 271, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. K. Generalized sliced wasserstein distances. ar Xiv preprint ar Xiv:1902.00434, 2019.

Li, Y. and Turner, R. E. Gradient estimators for implicit models. ar Xiv preprint ar Xiv:1705.07107, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. ar Xiv preprint ar Xiv:1608.04471, 2016.

Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-ﬁt tests. In International conference on machine learning, pp. 276 284, 2016.

Mityagin, B. The zero set of a real analytic function. ar Xiv preprint ar Xiv:1512.07276, 2015.

Nadjahi, K., Durmus, A., Chizat, L., Kolouri, S., Shahrampour, S., and S ims ekli, U. Statistical and topological properties of sliced probability divergences. ar Xiv preprint ar Xiv:2003.05783, 2020.

Active Slices for Sliced Stein Discrepancy

Parente, M. T., Wallin, J., Wohlmuth, B., et al. Generalized bounds for active subspaces. Electronic Journal of Statistics, 14(1):917 943, 2020.

Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. Vae learning via stein variational gradient descent. ar Xiv preprint ar Xiv:1704.05155, 2017.

Saibaba, A. K., Hart, J., and van Bloemen Waanders, B. Randomized algorithms for generalized singular value decomposition with application to sensitivity analysis. Numerical Linear Algebra with Applications, pp. e2364, 2021.

Sameh, A. and Tong, Z. The trace minimization method for the symmetric generalized eigenvalue problem. Journal of computational and applied mathematics, 123(1-2):155 175, 2000.

Serﬂing, R. J. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009.

Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. ar Xiv preprint ar Xiv:1806.02925, 2018.

Singhal, R., Han, X., Lahlou, S., and Ranganath, R. Kernelized complete conditional stein discrepancy. ar Xiv preprint ar Xiv:1904.04478, 2019.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918 11930, 2019.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artiﬁcial Intelligence, pp. 574 584. PMLR, 2020.

Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyv arinen, A., and Kumar, R. Density estimation in inﬁnite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830 1888, 2017.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12 (7), 2011.

Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. Efﬁcient and principled score estimation with nystr om kernel exponential families. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 652 660. PMLR, 2018.

Welling, M., Rosen-Zvi, M., and Hinton, G. E. Exponential family harmoniums with an application to information retrieval. Citeseer.

Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. Learning deep kernels for exponential family densities. In International Conference on Machine Learning, pp. 6737 6746. PMLR, 2019.

Yu, Y., Wang, T., and Samworth, R. J. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015.

Zahm, O., Cui, T., Law, K., Spantini, A., and Marzouk, Y. Certiﬁed dimension reduction in nonlinear bayesian inverse problems. ar Xiv preprint ar Xiv:1807.03712, 2018.

Zahm, O., Constantine, P. G., Prieur, C., and Marzouk, Y. M. Gradient-based dimension reduction of multivariate vector-valued functions. SIAM Journal on Scientiﬁc Computing, 42(1):A534 A558, 2020.

Zhou, Y., Shi, J., and Zhu, J. Nonparametric score estimators. ar Xiv preprint ar Xiv:2005.10099, 2020.