# kernelized_cumulants_beyond_kernel_mean_embeddings__df5fc07a.pdf

Kernelized Cumulants: Beyond Kernel Mean Embeddings

Patric Bonnier1 Harald Oberhauser 1 Zoltán Szabó2

1Mathematical Institute, University of Oxford 2Department of Statistics, London School of Economics bonnier,oberhauser@maths.ox.ac.uk z.szabo@lse.ac.uk

In Rd, it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.

Keywords: kernel, cumulant, mean embedding, Hilbert-Schmidt independence criterion, kernel Lancaster interaction, kernel Streitberg interaction, maximum mean discrepancy, maximum variance discrepancy

1 Introduction

The moments of a random variable are arguably the most popular all-purpose statistic. However, cumulants are often more favorable statistics than moments. For example, if µm := E[Xm] denotes the moments of a real-valued random variable X, then µ2 = µ2 1 + Var(X) and hence the variance that directly measures the fluctuation around the mean is a much better statistic for scale than the second moment µ2, see Appendix A. Cumulants provide a systematic way to only record the parts of the moment sequence that are not already captured by lower-order moments. While the moment and cumulant sequences (µm)m and (κm)m carry the same information, cumulants have several desirable properties that generalize to Rd-valued random variables (Mc Cullagh, 2018). Among these properties of cumulants, the ones that are important for our paper is that they can characterize distributions and statistical (in)dependence.

Kernel embeddings. Mean and covariance arise naturally in the context of kernel-enriched domains. Kernel techniques (Schölkopf and Smola, 2002; Steinwart and Christmann, 2008; Saitoh and Sawano, 2016) provide a principled and powerful approach for lifting data points to a so-called reproducing kernel Hilbert space (RKHS; Aronszajn 1950). Considering the mean of this feature referred to as kernel mean embedding (KME; Berlinet and Thomas-Agnan 2004; Smola et al. 2007) also enables one to represent probability measures, and to induce a semi-metric referred to as maximum mean discrepancy (MMD; Smola et al. 2007; Gretton et al. 2012) and an independence measure called the

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Hilbert-Schmidt independence criterion (HSIC1; Gretton et al. 2005). It is known that MMD is a metric when the underlying kernel is characteristic (Fukumizu et al., 2008; Sriperumbudur et al., 2010). HSIC captures independence for d = 2 components with characteristic kernels (Lyons, 2013, Theorem 3.11) and for d > 2 components (Quadrianto et al., 2009; Sejdinovic et al., 2013a; Pfister et al., 2018) with universal ones (Szabó and Sriperumbudur, 2018). MMD belongs to the family of integral probability metrics (IPM; Zolotarev 1983; Müller 1997) when in the IPM the underlying function class is chosen to be the unit ball of the RKHS. MMD and HSIC (with d = 2) are known to be equivalent (Sejdinovic et al., 2013b) to the notions of energy distance (Baringhaus and Franz, 2004; Székely and Rizzo, 2004, 2005) also called N-distance (Zinger et al., 1992; Klebanov, 2005) and distance covariance (Székely et al., 2007; Székely and Rizzo, 2009; Lyons, 2013) of the statistics literature. Both MMD and HSIC can be expressed in terms of expectations of kernel values which can be leveraged to design efficient estimators for them; we will refer to this trick as the expected kernel trick. A recent survey on mean embedding and their applications is given by Muandet et al. (2017). The closest previous work to ours is by Makigusa (2020) who considers the variance in the RKHS which in our setting can be identified as the first kernelized cumulant after the kernel mean embedding for two-sample testing like we do in parts of this paper. Unfortunately, Makigusa (2020) does not provide conditions on the validity of the resulting maximum variance discrepancy, and it is not formulated in the context of cumulant embeddings.

Contribution. The main contribution of our paper is to introduce cumulants of random variables in RKHSs and to show that under mild conditions the proposed kernelized cumulants characterize distributions (Theorem 2) and independence (Theorem 3). Thanks to the RKHS formulation, kernelized cumulants have computable estimators (Lemma 2 and Lemma 3) and they show strong performance in two-sample and independence testing on various benchmarks (Section 4). Although cumulants are a classic tool in multi-variate statistics, they have not received attention in the kernel literature. The primary technical challenge to circumvent in the derivation of fundamental properties of cumulants is the rich combinatorial structure which already arises in Rd from their definition via moment-generating function which is closely linked to the partition lattice (Speed, 1983, 1984). In an RKHS, even the definition of cumulants is non-straightforward. The key insight for our extension is that the combinatorial expressions for cumulants in Rd can be generalized by using tools from tensor algebras. This in turn allows us to derive the main properties of the RKHS cumulants that underpin their statistical properties.

Broader impact & limitations. We do not see any direct negative societal impact arising from the proposed new set of all-purpose kernel-based divergence and dependence measure. Choosing the underlying kernels in an optimal fashion even for MMD in two-sample testing (Hagrass et al., 2022) or goodness-of-fit testing (Hagrass et al., 2023) and showing optimal rates even for MMD with radial kernels on Rd (Tolstikhin et al., 2016) are quite challenging problems requiring dedicated analysis, and are not addressed here.

Outline. The paper is structured as follows: In Section 2 we formulate the notion of cumulants of random variables in Hilbert spaces. In Section 3 we prove a kernelized version of the classical result of Rd-valued random variables on the characterization of distributions and independence using cumulants. We show that one can leverage the expected kernel trick to derive efficient estimators for our novel statistics, with MMD and HSIC arising specifically as the degree 1 objects. In Section 4 we demonstrate numerically that going beyond degree 1 is advantageous. We provide a technical background (on cumulants, tensor products and tensor algebras), proofs, further details on numerical experiments, and our V-statistic based estimators in the Appendices.

2 Moments and cumulants

We briefly revisit classical cumulants and define cumulants of random variables in Hilbert spaces.

1HSIC is MMD with the tensor product kernel evaluated on the joint distribution and the product of the marginals, or equivalently HSIC equals to the Hilbert-Schmidt norm of the cross-covariance operator.

Moments. Let γ be a probability measure on Rd and (X1, . . . , Xd) γ. The moments µ(γ) = (µi(γ))i Nd of γ are defined as

µi(γ) := E Xi1 1 Xid d R, (1)

where i = (i1, . . . , id) Nd denotes a d-tuple of non-negative integers (i1, . . . , id 0). The degree of an element i Nd is defined as deg(i):= i1 + +id. For m N, let µm(γ) := (µi(γ))deg(i)=m which we refer to as the m-th moments of γ with the convention that µ0(γ) = 1.

Cumulants. Cumulants κ(γ) = (κi(γ))i Nd can be defined by the moment generating function as

i Nd κi(γ)θi

i Nd µi(γ)θi

i! , θ = (θ1, . . . , θd) Rd, (2)

where we denote i! = i1! id! and θi = θi1 1 θid d ; an equivalent definition of cumulants is via a combinatorial expression of partitions (elaborated in Appendix C.1). Cumulants have several attractive properties, the following forms our main motivation.

Theorem 1 (Characterization of distributions with cumulants on Rd, from Proposition 1 in Jammalamadaka et al. 2006). Let γ be a probability measure on a bounded subset of Rd with cumulants κ(γ) and let (X1, . . . , Xd) γ. Then

1. γ 7 κ(γ) is injective.

2. X1, . . . , Xd are jointly independent if and only if κi(γ) = 0 for all d-tuples of positive integers i Nd +.

2.1 Moments in Hilbert spaces

Instead of directly considering the law of a tuple of random variables (X1, . . . , Xd) in a product space X1 Xd, it can be advantageous to use feature maps Φi : Xi Hi and instead study the distribution of the H1 Hd-valued random variable Φ1(X1), . . . , Φd(Xd) . Motivated by this lifting, we study here moments of Hilbert-space valued random variables and assume in this subsection (with a slight abuse of notations) that one has already applied the lifting and Xi Hi where i = 1, . . . , d. In Section 3 we specialize the construction to RKHSs, and use these moments (Def. 1) to define kernelized cumulants.

Moments. In the finite-dimensional case (1) we defined the moment sequence by taking expectations of products of the coordinates of the underlying random variable. For the infinite-dimensional case, it is convenient to develop a coordinate-free definition which can be accomplished by using tensors. To do so we make use of the following results about Hilbert spaces: for real Hilbert spaces H1 and H2 the tensor product H1 H2 is the Hilbert space given by completion of the tensor product of H1 and H2 as vector space; we also write H m 1 : = H1 H1 | {z } m-times . Similarly, the direct sum

H1 H2 is a Hilbert space. It is natural to consider E X m 1 H m 1 as the m-th moment of a H1-valued random variable X1 where the integral in the expectation is meant in Bochner sense. Consequently the natural state space for all moments of a H1-valued random variable is the tensor algebra T1 := Q

m 0 H m 1 where by convention H 0 1 := R. See Appendix B for more details on tensor products of Hilbert spaces and tensor algebras.

Example 2.1 (H1 = Rd, m = 2). If X1 = X1 1, . . . , Xd 1 is H1 = Rd-valued then E X 2 1

(Rd) 2 can be identified with a (d d)-sized matrix whose (i, j)-th entry is E h Xi 1Xj 1 i .

Since we are interested in the general case of a H1 Hd-valued random variable X = (X1, . . . , Xd) we arrive at the definition below.

Definition 1 (Moments in Hilbert spaces). Let γ be a probability measure on H:= H1 Hd and let (X1, . . . , Xd) γ. We define

µi(γ) := E[X i1 1 X id d ] H i, H i := H i1 1 H id d (3)

for every i Nd whenever the above expectation exists. The moment sequence is defined as the element

µ(γ) = (µi(γ))i Nd T := T1 Td, with Tj := Y

m 0 H m j ,

and for m N we refer to µm(γ) = L

i Nd:deg(i)=m µi(γ) as the m-moments of γ.

In case of Hi = R, both definitions (1) and (3) apply for µi(γ). Henceforth, we always refer to (3) when we write µi(γ). Even in the finite-dimensional case, Def. 1 is useful, for instance when X1 H1 and X2 H2 have different state space (H1 = H2).

3 Kernelized cumulants

We lift a random variable X= (X1, . . . , Xd) X = X1 Xd via a feature map Φ : X H into a Hilbert space valued random variable Φ(X). For the rest of the paper (i) X1, . . . , Xd will denote a collection of Polish spaces, but the reader is invited to think of them as finite-dimensional Euclidean spaces, (ii) H is an RKHS with kernel k and canonical feature map Φ(x) = k(x, ),2 and (iii) all kernels are assumed to be bounded.3 Our main results (Theorem 2 and Theorem 3) are that in this case the expected kernel trick applies to both items in the kernelized version of Theorem 1. The key to these results is an expression for inner products of cumulants in RKHSs (Lemma 1).

A combinatorial expression of cumulants. Classical cumulants can be defined via the moment generating function or via combinatorial sums over partitions (Appendix C.1). To generalize cumulants to RKHSs the combinatorial definition is the most efficient way. A partition π of m elements is a family of non-empty, disjoint subsets π1, . . . , πb of {1, . . . , m} whose union is the whole set; formally Sb j=1 πj = {1, . . . , m} and πi πj = for i = j. We call b the number of blocks of the partition π and use the shorthand |π| to denote it. The set of all partitions of m is denoted with P(m). To formulate our main results, it is convenient to associate with a measure γ and a partition π the so-called partition measure γπ that is given by permuting the marginals of γ.

Definition 2 (Partition measure). Let γ be a probability measure on X1 Xd and π P(d). Define

γπ := γ|Xπ1 γ|Xπb ,

where Xπi denotes the product space Q

j πi Xj and γ|Xπi is the corresponding marginal distribution of γ. We call γπ the partition measure induced by π.

We also associate with γ and a multi-index i the so-called diagonal measure γi that is given by repeating marginals according to i.

Definition 3 (Diagonal measure). Let γ be a probability measure on X1 Xd and i = (i1, . . . , id) Nd. Define

γi := Law(X1, . . . , X1 | {z } i1 times , X2, . . . , X2 | {z } i2 times , . . . , Xd, . . . , Xd | {z } id times ),

where (X1, . . . , Xd) γ. We call γi the diagonal measure induced by i.

In general, the partition measure γπ and the diagonal measure are not probability measures on X1 Xd but on spaces that are constructed by permuting or repeating X1, . . . , Xd. Formally, γπ is a probability measure on Xπ1 Xπb and γi is a probability measure on X i1 1 X id d ; thus, γπ has d coordinates and γi has deg(i) coordinates. These two constructions can be combined, writing γi π for the measure (γi)π which makes sense whenever π P(deg(i)). We can now write down our generalization of cumulants.

2k(x, ) denotes the function x 7 k(x, x ) with x X fixed. 3A kernel k : X X R is called bounded if there exists B R such that supx,x X k(x, x ) B.

Definition 4 (Kernelized cumulants). Let γ be a probability measure on X1 Xd and let (H1, k1), . . . , (Hd, kd) be RKHSs on X1, . . . , Xd respectively. We define the kernelized cumulants

κk1,...,kd(γ) := κi k1,...,kd(γ)

κi k1,...,kd(γ) := X

π P (m) cπEγiπk i((X1, . . . , Xm), ),

where m = deg(i), cπ := ( 1)|π| 1(|π| 1)!, γi π = (γi)π and

k i((x1, . . . , xm), (y1, . . . , ym)):= k1(x1, y1) k1(xi1, yi1) (4) kd(xm id+1, ym id+1) kd(xm, ym)

is the reproducing kernel of H i where H = H1 Hd.

Def. 4 is the natural generalization of the combinatorial definition of cumulants in Rd and Appendix C.2 gives an equivalent definition via a generating function analogous to (2). However, our posthoc justification that these are the "right" definitions for cumulants in an RKHS are Theorems 2 and 3 that show that these kernelized cumulants have the same powerful properties as classic cumulants in Rd (Theorem 1).

Example 3.1 (Kernelized cumulants). Let γ be a probability measure on X1 X2, with the RKHSs (H1, k1), (H2, k2) given. Denote the random variables K1 = k1(X1, ), K2 = k2(X2, ) where (X1, X2) γ. Then the degree two kernelized cumulants are given as κ(2,0) k1,k2(γ) = E K 2 1

E [K1] 2, κ(1,1) k1,k2(γ) = E [K1 K2] E [K1] E [K2] , κ(0,2) k1,k2(γ) = E K 2 2 E [K2] 2.

Inner products of cumulants. Computing inner products of moments is straightforward thanks to a nonlinear kernel trick, see Lemma 6 in the Appendix. For example, given two probability measures γ1, γ2 with corresponding random variables (X1, . . . , Xd) γ1, (Y1, . . . , Yd) γ2 on X1 Xd and RKHSs (H1, k1), . . . , (Hd, kd) on X1, . . . , Xd with bounded kernels, and H = H1 Hd, we can express:

µi k1,...,kd(γ1), µi k1,...,kd(γ2) H i = Eγ1 γ2k1(X1, Y1)i1 kd(Xd, Yd)id, (5)

where µi k1, ,kd is defined in Def. 1, and the expectation is taken over the product measure γ1 γ2.

Example 3.2. In the particular case of d = 1, (5) reduces to the well-known formula for the inner product of mean embeddings µ(1) k (γ1), µ(1) k (γ2) Hk = Eγ1 γ2k(X, Y ).

Lemma 1 (Inner product of cumulants). Let (H1, k1), . . . , (Hd, kd) be RKHSs with bounded kernels on X1, . . . , Xd respectively, and let γ and η two probability measures on X1 Xd, i = (i1, . . . , id) Nd such that deg(i) = m. Then

κi k1,...,kd(γ), κi k1,...,kd(η) H i = X

π,τ P (m) cπcτEγi π ηi τ k i((X1, . . . , Xm), (Y1, . . . , Ym)).

Point separating kernels. In the classic MMD setting the injectivity of the mean embedding γ 7 EX γ[k(X, )] on probability measures (known as the characteristic property of the kernel k) is equivalent to the MMD being a metric; this property is central in applications. We formulate our theoretical results in the next section using the much weaker property of what we term pointseparating which is satisfied for essentially all popular kernels.

Definition 5 (Point-separating kernel). We call a kernel k : X X R point-separating if the canonical feature map Φ : x 7 k(x, ) is injective.

3.1 (Semi-)metrics for probability measures

In this section we use cumulants to characterize probability measures and show how to compute the distance between kernelized cumulants with the expected kernel trick.

Theorem 2 (Characterization of distributions with cumulants). Let γ and η be two probability measures on X1 Xd, (H1, k1), . . . , (Hd, kd) RKHSs on the Polish spaces X1, . . . , Xd such that for every 1 j d kj is a bounded, continuous, point-separating kernel. Then

γ = η if and only if κk1,...,kd(γ) = κk1,...,kd(η).

Moreover, the expected kernel trick applies and for i Nd with deg(i) = m, and k i and H i as in (4)

di(γ, η) := κi k1,...,kd(γ) κi k1,...,kd(η) 2 H i (6)

π,τ P (m) cπcτ h Eγiπ γiτ k i((X1, . . . , Xm), (Y1, . . . , Ym))

+ Eηiπ ηiτ k i((X1, . . . , Xm), (Y1, . . . , Ym))

2Eγiπ ηiτ k i((X1, . . . , Xm), (Y1, . . . , Ym)) i .

We recall Example 3.1 and now give examples of distances between such expressions

Example 3.3 (m = 1). Applied with m = 1 and d = 1, (6) becomes MMD2 k(γ, η)

κ(1) k (γ) κ(1) k (η) 2 Hk = Ek(X, X ) + Ek(Y, Y ) 2Ek(X, Y ),

where X, X denotes independent copies of γ and Y, Y denotes independent copies of η. Example 3.4 (m = 2). For m = 2 and d = 1, (6) reduces to

κ(2) k (γ) κ(2) k (η) 2 H(1,1) = Ek(X, X )k(X , X ) + Ek(Y, Y )k(Y , Y ) + Ek(X, X )2

+Ek(Y, Y )2 + 2Ek(X, Y )k(X , Y ) + 2Ek(X, Y )k(X, Y ) 2Ek(X, Y )k(X , Y )

2Ek(X, Y )2 2Ek(X, X )k(X, X ) 2Ek(Y, Y )k(Y, Y ),

where X, X , X , X denotes independent copies of γ and Y, Y , Y , Y denotes independent copies of η. This expression compares the variances in the RKHS instead of the means. This is an example of the kernel variance embedding defined in the next subsection.

The price for the weak assumption of a point-separating kernel is that without any stronger assumptions one does not get a metric in general, and the all-purpose way to achieve a metric is to take an infinite sum over all di s. If we only use the degree m = 1 term di reduces to the well-known MMD formula which requires characteristicness to become a metric (see Example 3.3). There are two reasons why working under weaker assumptions is useful: firstly, if the underlying kernel is not characteristic this sum gives a structured way to incorporate finer information that discriminates the two distributions; an extreme case is the linear kernel k(x, y) = x, y which is point-separating, and in this case the sum reduces to the differences of classical cumulants. Secondly, under the stronger assumption of characteristicness one already has a metric after truncation at degree m = 1 (the classical MMD). However, in the finite-sample case adding higher degree terms can lead to increased power. Indeed, our experiments (Section 4) show that even just going one degree further (i.e. taking m = 2), can lead to more powerful tests.

3.2 A characterization of independence

Here we characterize independence in terms of kernelized cumulants. Theorem 3 (Characterization of independence with cumulants). Let γ be a probability measure on X1 Xd, and (H1, k1), . . . , (Hd, kd) RKHSs on Polish spaces X1, . . . , Xd such that for every 1 j d kj is a bounded, continuous, point-separating kernel. Then

γ = γ|X1 γ|Xd if and only if κi k1,...,kd(γ) = 0

for every i Nd +. Moreover, the expected kernel trick applies in the sense that for i Nd +

κi k1,...,kd(γ) 2 H i = X

π,τ P (m) cπcτEγiπ γiτ k i((X1, . . . , Xm), (Y1, . . . , Ym)), (7)

where m := deg(i), and k i and H i are defined as in (4).

Applied to i = (1, 1), the expression (7) reduces to the classical HSIC for two components, see Example 3.5 below. But for general i this construction leads to genuine new statistics in RKHSs. Example 3.5 (Specific case: HSIC, kernel Lancaster interaction, kernel Streitberg interaction). If d = 2 there is only one order 2 index in Nd +, namely i = (1, 1); in this case (7) reduces to the classical HSIC equation

κ(1,1) k1,k2(γ) 2 H(1,1)= Ek1(X, Y )k2(X, Y ) + Ek1(X, Y )k2(X , Y ) 2Ek1(X, Y )k2(X , Y ),

where (X, Y ) and (X , Y ) are independent copies of the same random variable following γ. More generally, with i = 1d one gets the kernel Streitberg interaction (Streitberg, 1990; Sejdinovic et al., 2013a; Liu et al., 2023), and specifically the kernel Lancaster interaction (Sejdinovic et al., 2013a) for d {2, 3}; the latter reduces to HSIC for two random variables (d = 2).

3.3 Finite-sample statistics

To apply Theorem 2 and Theorem 3 in practice, one needs to estimate expressions such as Ek i((X1, . . . , Xm), (Y1, . . . , Ym)). One could use classical estimators such as U-statistic (Van der Waart, 2000) which lead to unbiased estimators. However, we follow Gretton et al. (2008) and use a V-statistic which is biased but conceptually simpler, easier, and efficient to compute. We note that the estimators presented here all have quadratic complexity like MMD and HSIC, see Appendix E.

A two-sample test for non-characteristic feature maps. If k is characteristic then MMDk(γ, η) = 0 exactly when γ = η, but we can still increase testing power by considering the distance between the kernel variance and skewness embeddings, which leads us to use our semi-metrics d(2)(γ, η) and d(3)(γ, η) as defined in (6). An efficient estimator for d(3) is given in detail in Appendix E; we provide the full expression for d(2) here.

Lemma 2 (d(2) estimation, see (6)). The V-statistic for d(2)(γ, η) = κ(2) k (γ) κ(2) k (η) 2 H(1,1) is

1 N 2 Tr (Kx JN)2 + 1 M 2 Tr (Ky JM)2 2 NM Tr Kxy JMK xy JN ,

where Tr denotes trace, (xn)N n=1 i.i.d. γ, (ym)M m=1 i.i.d. η, Kx = [k(xi, xj)]N i,j=1 RN N, Ky = [k(yi, yj)]M i,j=1 RM M, Kx,y = [k(xi, yj)]N,M i,j=1 RN M, Jn = In 1

n1n1 n Rn n, with 1n = (1, . . . , 1) Rn.

A kernel independence test. By Theorem 3, if γ = γ|X1 γ|X2, then κ(2,1)(γ) = 0 and κ(1,2)(γ) = 0. We may compute the magnitude of either κ(2,1)(γ) or κ(1,2)(γ) we will refer to these quantities as cross skewness independence criterion (CSIC). Note that these criteria are asymmetric. When d = 2 we have a probability measure γ on X1 X2 and two kernels k : X 2 1 R, ℓ: X 2 2 R. Assume that we have samples (xi, yi)N i=1 and use the shorthand notation K = Kx, L = Ly (similarly to Lemma 2) and H = HN = 1 N 1N1 N RN N. Denote by the Hadamard product and the sum over all elements of a matrix. Then one can derive the following CSIC estimator.(Note that matrix multiplication takes precedence over the Hadamard product.)

Lemma 3 (CSIC estimation). The V-statistic for κ(1,2) k,ℓ(γ) 2 H 1 k H 2 ℓ is

K K L 4K KH L 2K K LH + 4KH K LH

+ 2KH HK L + 4K HK LH + K K L

Remark (computational complexity w.r.t. degree m). We saw that the computational complexity of the cumulant based measures is quadratic w.r.t. the sample size. Let Bm = |P(m)| be the m-th

Bell number, in other words the number of elements in P(m). The Bell numbers follow a recursion: Bm+1 = |P(m + 1)| = Pm k=0 m k Bk, with the first elements of the sequence being B0 = B1 = 1, B2 = 2, B3 = 5, B4 = 15, B5 = 52. By (6)-(7), in the worst case the number of operations to compute di(γ, η) or κi k1,...,kd(γ) 2 H i (m = deg(i)) is proportional to B2 m (it equals to 3B2 m and to B2 m, respectively). Though asymptotically Bm grows quickly (de Bruijn, 1981; Lovász, 1993), for reasonably small degrees the computation is still manageable. In addition, merging various terms in the estimator can often be carried out, which leads to computational saving. For instance, the estimator of d(2) (see Lemma 2, Example E.1), CSIC (Lemma 3, Example E.2) and d(3) (Example E.3) consists of only 3, 11 and 10 + 2 7 = 24 terms compared to the predicted worst-case setting of 3B2 2 = 12, B2 3 = 25, and 3B2 3 = 75 terms, respectively. On a practical side, we found that using m {2, 3} is a good compromise between gain in sample efficiency and ease of implementation.

4 Experiments

In this section, we demonstrate the efficiency of the proposed kernel cumulants in two-sample and independence testing.4

Two-sample test: Given N N samples from two probability measures γ and η on a space X, the goal was to test the null hypothesis H0 : γ = η against the alternative H1 : γ = η. The compared test statistics (S) were MMD, d(2), and d(3).

Independence test: Given N paired samples from a probability measure γ on a product space X1 X2, the aim was to the test the null hypothesis H0 : γ = γ1 γ2 against the alternative H1 : γ = γ1 γ2. The compared test statistics (S) were HSIC and CSIC.

In our experiments H1 held, and the estimated power of the tests is reported. Permutation test was applied to approximate the null distribution and its 0.95-quantile (which corresponds to the level choice α = 0.05): We first computed our test statistic S using the given samples (S0 = S), and then permuted the samples 100 times. If S0 was in a high percentile ( 95% in our case) of the resulting distribution of S under the permutations, we rejected the null. We repeated these experiments 100 times to estimate the power of the test. This procedure was in turn repeated 5 times and the 5 samples are plotted as a box plot along with a line plot showing the mean against the number of samples

(N) used. All experiments were performed using the rbf-kernel rbfσ(x, y) = e x y 2 2 2σ2 , where the parameter σ is called the bandwidth. We performed all experiments for every bandwidth of the form σ = a10b where a = 1, 2.5, 5, 7.5 and b = 5, 4, 3, 2, 1, 0 and the optimal value across the bandwidths was chosen for each method and sample size. The experiments were carried out on a laptop with an i7 CPU and 16GBs of RAM.

4.1 Synthetic data

For synthetic data we designed two experiments.

2-sample test: We compared a uniform distribution with a mixture of two uniforms.

Independence test: We considered the joint measure of a uniform and a correlated χ2 random variable. We also use this same benchmark to compare the efficiency of classical and kernelized cumulants in Appendix D.

Comparing a uniform with a mixture of uniforms. Even for simpler distributions like mixtures of uniform distributions it can be hard to pick up higher-order features, and d(2) can outperform MMD even when provided with a moderate number of samples. Here we compared one uniform distribution U[ 1, 1] with an equal mixture of U[0.35, 0.778] and U[ 0.35, 0.778]. The endpoints in the mixture were chosen to match the first three moments of U[ 1, 1]. The number of samples used ranged from 5 to 50, and the results are summarized in Fig. 1. One can see that with d(2) the power approaches 100% much faster than with using MMD.

4All the code replicating our experiments is available at https://github.com/Patric Bonnier/ Kernelized-Cumulants.

6 12 18 24 30 36 42 48 54 60 N

Test power (%)

Figure 2: Test power as a function of the sample size (N) of independence testing using HSIC (red) and CSIC statistics (blue), with the independence testing between Y 2 0.5 and X.

4 8 12 16 20 24 28 32 36 40 44 N

Test power (%)

Figure 3: Two-sample testing using MMD (red) and d(2) (blue) on the Seoul bicycle data set.

5 10 15 20 25 30 35 40 45 50 N

Test power (%)

Figure 1: Test power as a function of the sample size (N) of two-sample test using the MMD (red) and d(2) statistics (blue), with U[ 1, 1] and an equal mixture of U[0.35, 0.778] and U[ 0.35, 0.778] compared.

Independence between a uniform and a χ2. Let X U[0, 1] and Z N(0, 1) be independent of X. Denote by Φ the c.d.f. of a standard normal distribution and define Yp to be a mixture with weight p at Φ 1(X) and weight 1 p at Z so that Y0 = Z and Y1 = Φ 1(X). We test for independence for p = 0.5 between Y 2 p which will be χ2 distributed with 1 degree of freedom and X. As the statistical dependence of Y 2 p and X is more complicated than a simple correlation we expect that higher-order features of the data will help in the independence testing. The number of samples used ranged from 6 to 60, with the results summarized in Fig. 2. One can see that CSIC supersedes HSIC for every sample size, and the difference is more pronounced for smaller ones.

4.2 Real-world data

We demonstrate the efficiency of the kernelized cumulants on real-world data. We designed two experiments.

2-sample test: Here the goal was to test if environmental data in two different seasons, and traffic data at different speeds can be distinguished.

independence test: The aim was to test if two distributions describing traffic flow and other traffic factors are independent.

To improve performance of both test statistics, all features are standardized to lie between 0 and 1.

Seoul bicycle data. The Seoul bicycle data set (E et al., 2020) consists of environmental data along with the number of bicycle rentals. The environmental data consists of 9 numerical values, 1 categorical value (season), and two binary values. We compare the distribution of the environmental data in the winter and the fall, as we expect these distributions to be different. Concretely, we do 2-sample testing on two measures γ, η on R11, and assume that γ = η where γ is the distribution of the environmental data in winter, and η is that of the data in the fall. Permutation testing was performed for N between 4 and 44, with results summarized in Fig. 3. As it can be observed that d(2) outperforms MMD in terms of test power. For the Type I error, i.e. the probability of falsely rejecting the null hypothesis (comparing winter data with itself), it hovers between 5 10% for both statistics,

4 8 12 16 20 24 28 32 36 40 N

Test power (%)

Figure 4: Independence testing using HSIC (red) and CSIC (blue) on the Sao Paulo traffic dataset.

5 10 15 20 25 30 35 40 45 50 N

Test power (%)

Figure 5: Two-sample testing using MMD (red) and d(3) (blue) on the Sao Paulo traffic dataset.

which is admittedly slightly higher than the desired 5% due to the small sample size, but very similar for both statistics; for further details, the reader is referred to Fig. 8 in Appendix D.

Brazilian traffic data. We used the Sao Paulo traffic benchmark (Ferreira, 2016) to perform independence testing. The dataset consists of 16 different integer-valued statistics about the hourly traffic in Sao Paulo such as blockages, fires and other reasons that might hold up traffic. This is combined with a number that describes the slowness of traffic at the given hour; so X1 = R16, X2 = R. One expects a strong dependence between the two sets or equivalently, for the null hypothesis to be false and for the statistics are heavily skewed towards 0 as it is naturally sparse. For independence testing we performed permutation testing for N between 4 and 40. The resulting test powers are summarized in Fig. 4. As it can be seen, HSIC and CSIC performs similarly for very low sample sizes, but for anything else CSIC is the favorable statistic in terms of test power. For two-sample testing, we sampled N between 5 and 50 and compared the distribution of slow moving traffic with the fast moving traffic. The results are summarized in Fig. 5. It is clear that d(3) performs similarly to MMD in terms of test power for very small sample sizes, but significantly better for larger ones.

5 Conclusion

We defined cumulants for random variables in RKHSs by extending the algebraic characterization of cumulants on Rd. This construction results in a structured description of the law of random variables that goes beyond the classic kernelized mean and covariance. A kernel trick allows us to compute the resulting kernelized cumulants. We applied our theoretical results to two-sample and independence testing; although kernelized mean and covariance are sufficient for this task, the higherorder kernelized cumulants have the potential to increase the test power and to relax the assumptions on the kernel. Our experiments on real and synthetic data show that kernelized cumulants can indeed lead to significant improvement of the test power. A disadvantage of these higher-order statistics is that their theoretical analysis requires more mathematical machinery although we emphasize that the resulting estimators are simple V-statistics.

Arcones, M. A. and Giné, E. (1992). On the bootstrap of U and V statistics. Annals of Statistics, 20:655 674.

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404.

Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis, 88:190 206.

Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer.

Bogachev, V. I. (2007). Measure theory. Vol. I, II. Springer.

Bonnier, P. and Oberhauser, H. (2020). Signature cumulants, ordered partitions, and independence of stochastic processes. Bernoulli, 26(4):2727 2757.

Chevyrev, I. and Oberhauser, H. (2022). Signature moments to characterize laws of stochastic processes. Journal of Machine Learning Research, 23(176):1 42.

Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. Annals of Statistics, 41(2):484 507.

de Bruijn, N. G. (1981). Asymptotic Methods in Analysis. Dover.

Dudley, R. (2004). Real Analysis and Probability. Cambridge University Press.

E, S. V., Park, J., and Cho, Y. (2020). Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications, 153:353 366.

Ferreira, R. P. (2016). Combination of artificial intelligence techniques for prediction the behavior of urban vehicular traffic in the city of São Paulo. In Anais do 10. Congresso Brasileiro de Inteligência Computacional, pages 1 5.

Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems (NIPS), pages 498 496.

Gärtner, T. (2003). A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter, 5(1):49 58.

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773.

Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory (ALT), pages 63 78.

Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008). A kernel statistical test of independence. In Advances in Neural Information Processing Systems (NIPS), pages 585 592.

Hagrass, O., Sriperumbudur, B. K., and Li, B. (2022). Spectral regularized kernel two-sample tests. Technical report. (https://arxiv.org/abs/2212.09201).

Hagrass, O., Sriperumbudur, B. K., and Li, B. (2023). Spectral regularized kernel goodness-of-fit tests. Technical report. (https://arxiv.org/abs/2308.04561).

Jammalamadaka, S. R., Rao, T. S., and Terdik, G. (2006). Higher order cumulants of random vectors and applications to statistical inference and time series. Indian Journal of Statistics, 68(2):326 356.

Király, F. J. and Oberhauser, H. (2019). Kernels for sequentially ordered data. Journal of Machine Learning Research, 20.

Klebanov, L. (2005). N-Distances and Their Applications. Charles University, Prague.

Kriege, N. M., Johansson, F. D., and Morris, C. (2020). A survey on graph kernels. Applied Network Science, 5(1).

Lang, S. (2002). Algebra. Springer.

Liu, Z., Peach, R. L., Mediano, P. A., and Barahona, M. (2023). Interaction measures, partition lattices and kernel tests for high-order interactions. Technical report. (https://arxiv.org/ abs/2306.00904).

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). Text classification using string kernels. Journal of machine learning research, 2:419 444.

Lovász, L. (1993). Combinatorial Problems and Exercise. 2nd ed. Amsterdam, Netherlands: North-Holland.

Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41:3284 3305.

Makigusa, N. (2020). Two-sample test based on maximum variance discrepancy. Technical report. (https://arxiv.org/abs/2012.00980).

Mc Cullagh, P. (2018). Tensor Methods in Statistics. Courier Dover Publications.

Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141.

Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429 443.

Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2018). Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5 31.

Quadrianto, N., Song, L., and Smola, A. (2009). Kernelized sorting. In Advances in Neural Information Processing Systems (NIPS), pages 1289 1296.

Rudin, W. (1953). Principles of mathematical analysis. Mc Graw-Hill Book Company, Inc., New York-Toronto-London.

Saitoh, S. and Sawano, Y. (2016). Theory of Reproducing Kernels and Applications. Springer Singapore.

Schölkopf, B. and Smola, A. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Sejdinovic, D., Gretton, A., and Bergsma, W. (2013a). A kernel test for three-variable interactions. In Advances in Neural Information Processing Systems (NIPS), pages 1124 1132.

Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013b). Equivalence of distancebased and RKHS-based statistics in hypothesis testing. Annals of Statistics, 41:2263 2291.

Serfling, R. (1980). Approximation Theorems of Mathematical Statistics. Wiley Series in Probability and Statistics.

Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), pages 13 31.

Speed, T. P. (1983). Cumulants and partition lattices. Australian Journal of Statistics, 25(2):378 388.

Speed, T. P. (1984). Cumulants and partition lattices II. Australian Journal of Statistics, 4(1):34 53.

Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561.

Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer.

Streitberg, B. (1990). Lancaster interactions revisited. Annals of Statistics, 18(4):1878 1885.

Szabó, Z. and Sriperumbudur, B. K. (2018). Characteristic and universal tensor product kernels. Journal of Machine Learning Research, 18(233):1 29.

Székely, G. and Rizzo, M. (2004). Testing for equal distributions in high dimension. Inter Stat, 5:1249 1272.

Székely, G. and Rizzo, M. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93:58 80.

Székely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3:1236 1265.

Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35:2769 2794.

Tolstikhin, I., Sriperumbudur, B., and Schölkopf, B. (2016). Minimax estimation of maximal mean discrepancy with radial kernels. In Advances in Neural Information Processing Systems (NIPS), pages 1930 1938.

Van der Waart, A. A. (2000). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics.

Willard, S. (1970). General Topology. Addison-Wesley.

Zinger, A., Kakosyan, A., and Klebanov, L. (1992). A characterization of distributions by mean values of statistics and certain probabilistic metrics. Journal of Soviet Mathematics.

Zolotarev, V. (1983). Probability metrics. Theory of Probability and its Applications, 28:278 302.

These appendices provide additional background and elaborate on some of the finer points in the main text. In Appendix A we illustrate that cumulants have typically lower variance estimators compared to moments. Technical background on tensor products and tensor sums of Hilbert spaces, and on tensor algebras is provided in Appendix B. We present our proofs in Appendix C. In Appendix D additional details on our numerical experiments are provided. Our V-statistic based estimators are detailed in Appendix E.

A Moments and cumulants

Already for real-valued random variable X, moments have well-known drawbacks that make cumulants often preferable as statistics. For a detailed introduction to the use of cumulants in statistics we refer to Mc Cullagh (2018). Here we just mention that

1. the moment generating function f(t) = E[et X] = P

m µmtm/m! describes the law of X with sequence (µm) of moments µm = E[Xm] R. However, since the function t 7 f(t) is the expectation of an exponential, one would often expect that f is also "exponential in t", hence g(t) = log f(t) = P

m! should be simpler to describe as a power series. For example, for

a Gaussian f(t) = et E(X)+ t2

2 V ar(X) and while µm can be in this case explicitly calculated and uneven moments vanish, the m-moments are fairly complicated compared to the power series expansion of g(t) = κ1t + κ2 t2

2 which just consists of κ1 (mean) and κ2 (variance).

2. In the moment sequence µm, lower moments can dominate higher moments. Hence, a natural idea to compensate for these "different scales" is to systematically subtract lower moments from higher moments. As mentioned in the introduction, this is in particular troublesome if finite samples are available. Even in dimension d = 1 the second moment is dominated by the squared mean, that is for a real-valued random variable X γ

µ2(γ) = (µ1(γ))2 + Var(X),

where Var(X) := E[(X µ1(γ))2]. It is well known that the minimum variance unbiased estimators for the variance are more efficient than that for the second moment: denoting them by c µ2 and bκ respectively, one can show (Bonnier and Oberhauser, 2020) that given N samples from X, the following holds

Var c µ2 = Var (bκ) + 2

(EX)4 (EX)2 Var(X) 2Var(X)2

This means that when X has a large mean, it is more efficient to estimate its variance than its second moment since the last term in the above expression dominates. Hence, the variance Var(X) is typically a much more sensible second-order statistic than µ2(γ). However, we emphasize that there are many other reasons why cumulants can have better properties as estimators

3. Cumulants characterize laws and the independence of two random variables manifests itself simply as vanishing of cross-cumulants. In view of the above item 2, this means for example that testing independence can be preferable in terms of vanishing cumulants rather than testing if moments factor E[Xm Y n] = E[Xm]E[Xn], and similarly for testing if distributions are the same.

The caveat to the above points is that it is not true that cumulants are always preferable. For example, there are distributions for which (a) the moment generating function is not naturally exponential in t, (b) lower moments do not dominate higher moments, (c) consequently independence or two-sample testing become worse with cumulants. While one can write down conditions under which for example, the variance of the kernelized cumulants is lower, the use of cumulants among statisticians is to simply regard cumulants as arising from natural motivations which leads to another estimator in their toolbox.

The main idea of our paper is simply that for the same reasons that cumulants can turn out to be powerful for real or vector-valued random variables, cumulants of RKHS-valued random variables are a natural choice of statistics. The situation is more complicated since it requires formalizing momentand cumulant-generating functions in RKHS but ultimately a kernel trick allows for circumventing the computational bottleneck of working in infinite dimensions and leads to computable estimators for independence and two-sample testing.

Further, we note that although cumulants are classic for vector-valued data, there seems to be not much work done about extending their properties to general structured data. Our kernelized cumulants apply to any set X where a kernel is given. This includes many practically relevant examples such as strings (Lodhi et al., 2002), graphs (Kriege et al., 2020), or general sequentially ordered data (Király and Oberhauser, 2019; Chevyrev and Oberhauser, 2022); a survey of kernels for structured data is provided by Gärtner (2003).

B Technical background

In Section B.1 the tensor products (Nd j=1 Hj) and direct sums of Hilbert spaces (L

i I Hi) are recalled. Section B.2 is about tensor algebras over Hilbert spaces (Q

B.1 Tensor products and direct sums of Banach and Hilbert spaces

Tensor products of Hilbert spaces. For Hilbert spaces H, . . . , Hd and (h1, . . . , hd) H1 Hd, the multi-linear operator h1 hd H1 Hd is defined as

(h1 hd)(f1, . . . , fd) =

j=1 hj, fj Hj

for all (f1, . . . , fd) H1 Hd. By extending the inner product

a1 ad, b1 bd H1 Hd :=

j=1 aj, bj Hj

to finite linear combinations of a1 ad-s ( n X

i=1 ci d j=1 ai,j : ci R, ai,j Hj, n 1

by linearity, and taking the topological completion one arrives at H1 Hd. Specifically, if (H1, k1), . . . , (Hd, kd) are RKHSs, then so is H1 Hd = H d j=1kj (Berlinet and Thomas Agnan, 2004, Theorem 13) with the tensor product kernel

d j=1 kj ((x1, . . . , xd) , (x 1, . . . , x d)) :=

j=1 kj xj, x j ,

where (x1, . . . , xd), (x 1, . . . , x d) X1 Xd.

Tensor products of Banach spaces. For Banach spaces B1, . . . Bd, the construction of B1 Bd is a little more involved (Lang, 2002) as one cannot rely on an inner product.

Direct sums of Hilbert and Banach spaces. Let (Hi)i I be Hilbert or Banach spaces where I is some index set. The direct sum of Hi-s written as L

i I Hi consists of ordered tuples h = (hi)i I such that hi Hi for all i I and hi = 0 for all but a finite number of i I. Operations (addition, scalar multiplication) are performed coordinate-wise, and the inner product of a, b L

i I Hi is defined as a, b L

i I Hi = P i I aibi.

B.2 Tensor algebras

The tensor algebra Tj over a Hilbert space Hj is defined as the topological completion of the space M

m 0 H m j .

Note that it can equivalently be defined as the subset of (h0, h1, h2, . . .) Q m 0 H m j such that P

m 0 hm 2 H m j < , and as such it is a Hilbert space with norm

(h0, h1, h2, . . .) 2Q

m 0 H m j = X

m 0 hm 2 H m j .

Tj is also an algebra, endowed with the tensor product over Hj as its product. For a = (a0, a1, a2, a2 . . .), b = (b0, b1, b2, b2 . . .) Tj, their product can be written down in coordinates as

i=0 ai bm i

For a sequence H1, . . . , Hd of Hilbert spaces, we define

T := T1 Td,

where Tj = Q

m 0 H m j (j = 1, . . . , d). Let H = H1 Hd, and recall that given a tuple of integers i = (i1, . . . , id) Nd we define H i := H i1 1 H id d . This allows us to write down a multi-grading for T as

i Nd H i. (8)

Note that this gives credence to us using multi-indices i Nd to describe elements of the tensor algebra, as the multi-indices form its multi-grading.

Furthermore, T is a multi-graded algebra when endowed with the (linear extension of the) following multiplication defined on the components of T

: H i1 H i2 H (i1+i2), (9) (x1 xd) (y1 yd) = (x1 y1) (xd yd),

so that for a = ai

i Nd , b = bi

i Nd T, their product can be written down as

i1+i2=i ai1 bi2 (10)

where addition of tuples i1, i2 Nd is defined as i1 + i2 = i1 1 + i2 1, . . . , i1 d + i2 d . With the degree of a tuple defined as deg(i) = i1 + + id, T is also a graded algebra, with the grading written down as

{i Nd:deg(i)=m} H i,

so that if one multiplies two elements together, the degree of their product is the sum of their degree.

Finally we note that T is a unital algebra and the unit has the explicit form

(1, 0, 0, . . .),

i.e. the element consisting of only a 1 at degree 0.

This section is dedicated to proofs. The equivalence between the combinatorial expressions of cumulants and the definition via a moment generating function is proved in Section C.2. The derivation of our main results (Theorem 2 and Theorem 3) are detailed in Section C.3.

C.1 Equivalent definitions of cumulants in Rd

Here we introduce a classical definition of cumulants via a moment generating function and its equivalence to the combinatorial expressions. If X = (X1, . . . , Xd) is an Rd-valued random variable distributed according to X γ, then recall that

µi(γ) = E[Xi1 1 Xid d ] R

for i = (i1, . . . , id) Nd.

The following definition of the cumulants κi(γ) of γ are equivalent

i Nd κi(γ) θi

i Nd µi(γ) θi

2. κi(γ) = P

π P (d) cπµi(γi π),

where θ = (θ1, . . . , θd) Rd, cπ = ( 1)|π|(|π| 1)!. The equivalence between these two definitions of cumulants via a generating function and via their combinatorial definition, is classical (Mc Cullagh, 2018) even if our notation here is non-standard in the classical case. This equivalence is also at the heart of many proofs about properties of cumulants since some properties are easier to prove via one or the other definition.

C.2 Equivalent definitions of cumulants in RKHS

In the main text, we defined cumulants in RKHS by mimicking the combinatorial definition of cumulants in Rd. It is natural and useful to also have the analogous definition via a "generating function" for RKHS-valued random variables. However, to generalize the definition via the logarithm of the moment generating function to random variables in RKHS, requires to define a logarithm for tensor series of moments. In this part, we show that this can be done and that indeed the two definitions are equivalent.

We use the shorthand κ(γ) := κk1,...,kd(γ), µ(γ) := µk1,...,kd(γ), and we overload the notation (X1, . . . , Xd) with (k1( , X1), . . . , kd( , Xd)). With this notation, we show that given coordinates i Nd, one may express the generalized cumulant κi(γ) as either a combinatorial sum over moments indexed by partitions, or by using the cumulant generating function.

More specifically, we show that the generalized cumulant of a probability measure γ on H1 Hd defined as

π P (m) cπEγiπ(X i),

where cπ = ( 1)|π| 1(|π| 1)! can also be expressed as coordinates in the tensorized logarithm of the moment series. Motivated by the Taylor series expansion of the classic logarithm, we define

log : T T, x 7 X

where denotes the product as defined in (9) and for t T, t n is defined as

t n = t t | {z } n - times ,

or coordinate-wise (t n)i = P

i1+ +in=i ti1 tin for i Nd. Note that unlike the classical logarithm log : R+ R, the tensorized logarithm is defined on the whole space as a formal expression. This can be summarized in the following lemma:

π P (m) cπEγiπ(X i) = log µ(γi) 1m, (11)

where 1m = (1, . . . , 1) Nm.

By iterating (10) we can express (11) as

i1+ +ij=1m µi1(γi) µij(γi),

and our goal is to express this as a sum over partitions. We will use the notation [n] = {1, . . . , n}. We can achieve our goal in two parts:

1. Show that for a fixed i Nd with deg(i) = m we can express (11) as a sum over all surjective functions from [m] to [j].

2. Show that this sum over functions reduces to a sum over partitions.

Part 1. Note that given i1 + + ij = 1m we may define h : [m] [j] by the relation (ih(n))n = 1, that is, we take h(n) to be the index c for which the multi-index ic is 1 at n. Note that this function is necessarily surjective since the sum is taken over non-zero multi-indices. Equivalently, for any surjective function h : [m] [j] we may define multi-indices by setting

(ic)n = 1 if n h 1(c) 0 otherwise .

Note that any such multi-index will be non-zero since the function is assumed to be surjective. With this identification we can write

log µ(γi) 1m =

h:[m] [j] µih 1(1)(γi) µih 1(j)(γi).

Part 2. Recall that given a function h : [m] [j] we can associate it to its corresponding partition πh P(m) by considering the set {h 1(1), . . . , h 1(j)}, and there are exactly j! different functions corresponding to a given partition, which are given by re-ordering the values 1, . . . , j. This reordering of the blocks does not change the summands since the marginals of the partition measure are always copies of each other and hence self-commute, hence a product of moments like

µih 1(1)(γi) µih 1(j)(γi) can always be written as µi(γi πh), the i-th coordinate of the moment sequence of the partition measure γi πh. With this in mind we can write

log µ(γi) 1m =

h:[m] [j] µi(γi πh) = X

|π| |π|!µi(γi π)

π P (m) cπµi(γi π) = X

π P (m) cπEγiπ(X i).

From this it immediately follows that for two probability measures γ, η we can write

κi(γ), κi(η) H i = X

π P (m) cπEγiπ(X i), X

τ P (m) cτEηiτ (Y i) H i

π,τ P (m) cπcτE(X,Y ) γiπ ηiτ X i, Y i H i.

Lemma 1 then follows from the definition of the tensor products.

C.3 Proof of Theorem 2 and Theorem 3

In this section we present the proofs of Theorem 2 and Theorem 3. We do this in a slightly more abstract setting where the feature maps take values in Banach spaces for clarity, until the end when we again restrict our attention to RKHSs. We start out by showing that polynomial functions of the feature maps characterize measures (Lemma 5). From this result we show that cumulants have the same property (Theorem 4), and lastly that this also holds when working directly with the kernels (Proposition 1).

A monomial on separable Banach spaces B1, . . . , Bd is any expression of the form

M(x1, . . . , xd) =

j=1 f 1 j , x1

j=1 f d j , xd

for some (i1, . . . , id) Nd, where f i j B i are elements of the dual space B i and xi Bi.5 Finite linear combinations of monomials are called the polynomials. Recall that a set of functions F on a set S is said to separate the points of S if for every x = y S there exists f F such that f(x) = f(y). Lemma 5 (Polynomial functions of feature maps characterize probability measures). Let X1, . . . , Xd be Polish spaces, B1 . . . , Bd separable Banach spaces and φi : Xi Bi be continuous, bounded, and injective functions. Then the set of functions on the Borel probability measures P Qd i=1 Xi of Qd i=1 Xi

Qd i=1 Xi p φ1(x1), . . . , φd(xd) dγ(x1, . . . , xd),

where p ranges over all polynomials, separates the points of P Qd i=1 Xi .

Proof. We first show that the pushforward map

is injective. This is done in two parts, first we show that every Borel measure on Qd i=1 Xi is a Radon measure, then we show that the pushforward map is injective on Radon measures. To see the first part, note that since X1, . . . , Xd are Polish spaces, so is their product space Qd i=1 Xi (Dudley 2004, Theorem 2.5.7; Willard 1970, Theorem 16.4c), and since Borel measures on Polish spaces are Radon measures (Bogachev, 2007, Theorem 7.1.7), any γ P(Qd i=1 Xi) must be a Radon measure.

For the second part, note that

(x1, . . . , xd) 7

is a norm bounded, continuous injection. Since Qd i=1 Bi is a Hausdorff space, Qd i=1 φi is a homeomorphism on compacts since continuous injections into Hausdorff spaces are homeomorphisms on compacts (Rudin, 1953, Theorem 4.17). Let µ, ν P Qd i=1 Xi be two Radon measures such that

their pushforwards are the same Qd i=1 φi(µ) = Qd i=1 φi(ν), then for any compact C Qd i=1 Xi we have µ(C) = ν(C) as Qd i=1 φi : C Qd i=1 φi(C) is a homeomorphism. Since Radon measures are characterized by their values on compacts, this implies that µ = ν. Hence the pushforward map is injective.

Denote by K the image of Qd i=1 Xi under the mapping Qd i=1 φi in Qd i=1 Bi. Note that K is a bounded Polish space. It is enough to show that the polynomials separate the points of P(K). To see this, note that the polynomials form an algebra of continuous functions that separate the points of Qd i=1 Bi, and when restricted to K they are bounded, since K is norm bounded. Since K is Polish, any Borel measure is Radon, and we can apply the Stone-Weierstrass theorem for Radon measures (Bogachev, 2007, Exercise 7.14.79) to get the assertion.

In what follows we will use the following index notation for linear functionals. Fix some tuple i = (i1, . . . , id) Nd with deg(i) = m. Given separable Banach spaces B1 . . . , Bd we use the notation

B i := B i1 1 B id d

5These monomials naturally extend the classical ones.

and given an element x = (x1, . . . , xd) Qd i=1 Bi we write xi : = x i1 1 x id d so that xi B i. If we have functions (φi)d i=1 such that φi : Xi Bi on some Polish spaces X1, . . . , Xd, then we write

φ i := φ i1 1 φ id d , φ i :

i=1 Xi B i.

Given a collection of linear functionals F Qd j=1 B j ij such that F = (f1, . . . , fd) we write

F i := f1 fd, F i B i .

Note the following trick: the monomials on Qd i=1 Bi are exactly functions of the form

x 7 F i, xi

for F = (f1, . . . , fd), this will be used in the proofs. We can now restate and prove the our theorem. Note that the cumulants here are defined like in Definition 4 which is a sensible definition even if the feature maps are not associated to kernels.

Theorem 4 (Generalization of Theorem 2 and Theorem 3). Let X1, . . . , Xd be Polish spaces and φi : Xi Bi be continuous, bounded and injective feature maps into separable Banach spaces Bi for i = 1, . . . d. Let γ and η be probability measures on X1 Xd. Then

1. γ = η if and only if κ(γ) = κ(η).

2. γ = Nd i=1 γ|Xi if and only if the cross cumulants vanish, that is κi(γ) = 0 for all i Nd +.

Proof. Item 2: We want to show that the cross cumulants vanish if and only if γ = Nd i=1 γ|Xi. By Lemma 5 it is enough to show that

Eγ h p φ1(X1), . . . , φd(Xd) i = ENd i=1 γ|Xi

h p φ1(X1), . . . , φd(Xd) i

for any monomial function p. Let us take linear functionals F = (f1, . . . , fd) and note that

F i, κi(γ) = X

π P (d) cπEγiπ f1(φ1(X1)) fd(φd(Xd))

which is the classical cumulant of the vector-valued random variable (f1 φ1)(X1), . . . , (fd φd)(Xd) ,

where (X1, . . . , Xd) γ. Hence by classical results (Speed, 1983), all cross cumulants of (f1 φ1)(X1), . . . , (fd φd)(Xd) vanish if and only if the cross moments split, that is to say

Eγ h p (f1 φ1)(X1), . . . , (fd φd)(Xd) i = ENd i=1 γ|Xi

h p (f1 φ1)(X1), . . . , (fd φd)(Xd) i

for any monomial p on Rd. Since f1, . . . , fd were arbitrary this holds for all monomials, which shows the assertion.

Item 1: By assumption κi(γ) = κi(η) for every i Nd; this implies that Eγp(φ1, . . . , φd) = Eηp(φ1, . . . , φd) for any polynomial p, so we can apply Lemma 5.

Proposition 1 (Theorem 2 and Theorem 3). Let X1, . . . , Xd be Polish spaces and ki : X 2 i R be a collection of bounded, continuous, point-separating kernels. Let γ and η be be probability measures on X1 Xd. Then

1. γ = η if and only if κk1,...,kd(γ) = κk1,...,kd(η).

2. γ = Nd i=1 γ|Xi if and only if κi k1,...,kd(γ) = 0 for all i Nd +.

Proof. We reduce the proof to the checking of the conditions of Theorem 4. Let φi denote the canonical feature map of the kernel ki, and let Bi : = Hki be the RKHS associated to ki (i {1, . . . , d}). For all i {1, . . . , d}, φi is (i) bounded by the boundedness of ki since φi(x) 2 Hki = ki(x, x) supx Xi |ki(x, x)| < , (ii) continuous by the continuity of ki (Steinwart and Christmann, 2008, Lemma 4.29), (iii) injective by the point-separating property of ki. The separability of Hki follows (Steinwart and Christmann, 2008, Lemma 4.33) from the separability of Xi and the continuity of ki (i {1, . . . , d}). Note: Details on the expected kernel trick part of Theorem 2 and Theorem 3 are provided in Section E.

D Additional experiments and details

Here we give additional details on the experiments that were performed, and discuss some further experiments that did not fit into the main text.

Background on permutation testing. Permutation testing works by bootstrapping the distribution of a test statistic under the null hypothesis. This allows the user to estimate confidence intervals under the null, which is a powerful all-purpose way of doing so when analytic expressions are unavailable. As an example, assume we have two probability measures γ, η on X with i.i.d. samples x1, . . . , x N γ, y1, . . . , y N η. If the null hypothesis is that γ = η then we may set

(z1, . . . , z2N):= (x1, . . . , x N, y1, . . . , y N)

so that for any permutation σ on 2N elements, we get two different set of of i.i.d. samples from γ = η by using the empirical measures

γσ := (zσ(1), . . . , zσ(N)), ησ := (zσ(N+1), . . . , zσ(2N))

and for any statistic S : P(X)2 R, we may estimate S(γ, η) under the null by sampling from S( γσ, ησ). If the null hypothesis were true, we might expect S(γ, η) to lie in a region with high probability of the permutation estimator, and we can use this as a criteria for rejecting the null. Under fairly weak assumptions, this yields a test at the appropriate level (Chung and Romano, 2013).

Comparing a uniform and a mixture. Any uniform random variable over a symmetric interval will have 0 mean and skewness, so a symmetric mixture only needs to match the variance. If X is a 50/50 mixture of U[a, b] and U[ a, b] then

3 b2 + ba + a2

so if Y is distributed according to U[ c, c] then we only need to solve

b2 + ba + a2 = c2

which is straightforward for a given a and c.

Computational complexity of estimators. The V-statistic for d(2) as written in Lemma 2 is bottlenecked by the matrix multiplications. We may note however that for two matrices A, B it holds that

Tr(A B) = A B ,

where denotes the sum over elements and denotes the Hadamard product. We also note that for for Hn = 1

n1n1 n we have AHn

n Pn c=1 Ai,c. Using both of these tricks we may compute

both d(2) and CSIC without any matrix multiplications, which brings the computational complexity down to O(N 2) for both. For a comparison of actual computation time, see Fig. 6 and Fig. 7, where the average computational times for out methods are compared to the KME and and HSIC for N between 50 and 2000.

Type I error on the Seoul Bicycle data. The results when comparing the winter data to itself is presented in Fig. 8. As we see the performance is similar for both estimators and lies between 5 and 10%.

0 250 500 750 1000 1250 1500 1750 2000 N

Average times (s)

Figure 6: Average computational time in seconds for KME (red) and d(2) (blue) for sample size N between 50 and 2000.

0 250 500 750 1000 1250 1500 1750 2000 N

Average times (s)

Figure 7: Average computational time in seconds for HSIC (red) and CSIC (blue) for sample size N between 50 and 2000.

Classical vs. kernelized cumulants. Using the same distributions as in the synthetic independence testing experiment, we now compare X with Y 2 0.5 to contrast independence testing with classical cumulants with their kernelized counterpart. The results are summarized in Table 1 where they are displayed as the median value half the difference between the 75th and 25th percentile. We consider every combination of classical vs. kernelized, variance vs. skewness, and two different sample sizes. One can observe that the classical variance based test performs poorly compared to a classical skewness test, the kernelized variance test is almost as powerful as the kernelized skewness test, and in all cases the kernelized tests deliver higher power.

E Kernel trick computations

Here we show how to arrive at the expressions used for the V-statistics used in the experiments.

4 8 12 16 20 24 28 32 N

Type I error (%)

Figure 8: Type I errors using MMD (red) and d(2) (blue) on the Seoul bicycle data set.

Table 1: Comparison of classical and kernelized cumulants for independence testing with both variance and skewness.

N=20 Variance Skewness

Classical 19% 3.0% 56% 3.5% Rbf kernel 39% 4.5% 59% 3.0%

N=30 Variance Skewness

Classical 17% 0.5% 68% 1.0% Rbf kernel 65% 3.5% 79% 1.5%

Given a real analytic function f(x, . . . , xd) = P

i Nd fixi in m variables with nonzero radius of convergence and Hilbert spaces H1, . . . , Hd we may (formally) extend f to a function

i=1 Hi T, f (x1, . . . , xd) = Y

i Nd fix i.

Moreover, if the Hilbert spaces are RKHSs then we have the following result.

Lemma 6 (Nonlinear kernel trick). For any collection of RKHSs H1, . . . , Hd with feature maps φi : Xi Hi, assume that f and g are real analytic functions with radii of convergence r(f) and r(g) such that max1 i d supx Xi |φi(x)| < min(r(f), r(g)). Then

f φ1(x1), . . . , φd(xd) , g φ1(y1), . . . , φd(yd) T = X

i Nd figik1(x1, y1)i1 . . . kd(xd, yd)id.

Proof. Since the image of the φis lie inside the radius of convergence of f and g the power series converge absolutely and we can write

f φ i(xi) , g φ i(yi) T = X

i Nd fiφ i(xi), X

i Nd giφ i(yi) T

i Nd figi φ i(xi), φ i(yi) H i = X

i Nd figik1(x1, y1)i1 . . . kd(xd, yd)id,

where H = H1 Hd.

Using Lemma 6, we can choose kernels ki : X 2 i R with associated RKHSs Hi and feature maps φi and some i Nd with deg(i) = m. We make the observation that with X = (X1, . . . , Xd) γ, Y = (Y1, . . . , Yd) η and k i and H i as in (4), one has

κi(γ), κi(η) H i = X

π P (m) cπEγiπφ i(Xi), X

τ P (m) cτEηiτ φ i(Y i) H i

π,τ P (m) cπcτ Eγiπφ i(Xi), Eηiτ φ i(Y i) H i

π,τ P (m) cπcτEγiπ ηiτ φ i(Xi), φ i(Y i) H i

π,τ P (m) cπcτEγiπ ηiτ k i((X1, . . . , Xm), (Y1, . . . , Ym)),

κi(γ) 2 H i = κi(γ), κi(γ) H i

κi(γ) κi(η) 2 H i = κi(γ), κi(γ) H i + κi(η), κi(η) H i 2 κi(γ), κi(η) H i

one gets the expected kernel trick statements of Theorem 2 and Theorem 3.

We are now interested in explicitly computing the expression κ(1,2) k,ℓ(γ) 2 H 1 k H 2 ℓ, κ(2) k (γ)

κ(2) k (η) 2 H(1,1) and κ(3) k (γ) κ(3) k (η) 2 H 3 k , and their corresponding V-statistics. Recall that for

a (w.l.o.g.) symmetric, measurable function h(z1, . . . , zm), the V-statistic of h with N samples Z1, . . . , ZN is defined as

V(h; Z1, . . . , ZN) := N m N X

i1,...,im=1 h(Zi1, . . . , Zim).

Under fairly general conditions, the V-statistic converges in distribution to E[h(Z1, . . . , Zm)] and a well-developed theory describes this convergence (Van der Waart, 2000; Serfling, 1980; Arcones and Giné, 1992).

Example E.1 (Estimating κ(2) k (γ) κ(2) k (η) 2 H(1,1)). Let X, X , X , X denote independent

copies of γ and Y, Y , Y , Y denote independent copies of η. The full expression for κ(2) k (γ) κ(2) k (η) 2 H(1,1) is

κ(2) k (γ) κ(2) k (η) 2 H(1,1) = Ek(X, X )k(X , X ) + Ek(Y, Y )k(Y , Y ) (12)

+ Ek(X, X )2 + Ek(Y, Y )2

+ 2Ek(X, Y )k(X , Y ) + 2Ek(X, Y )k(X, Y )

2Ek(X, Y )k(X , Y ) 2Ek(X, Y )2

2Ek(X, X )k(X, X ) 2Ek(Y, Y )k(Y, Y ).

Given samples (xi)N i=1, (yi)M i=1 from γ and η respectively the corresponding V statistic is

i,j,κ,l=1 k(xi, xj)k(xκ, xl) + 1 M 4

i,j,κ,l=1 k(yi, yj)k(yκ, yl) (13)

i,j=1 k(xi, xj)2 + 1 M 2

i,j=1 k(yi, yj)2

j=1 k(xi, yj)k(xκ, yj) + 2 NM 2

j,κ=1 k(xi, yj)k(xi, yκ)

j,κ=1 k(xi, yj)k(xκ, yl) 2 NM

j=1 k(xi, yj)2

i,j,κ=1 k(xi, xj)k(xi, xκ) 2 M 3

i,j,κ=1 k(yi, yj)k(yi, yκ).

Let us define the Gram matrices Kx = [k(xi, xj)]N i,j=1 RN N, Ky = [k(yi, yj)]M i,j=1 RM M, Kx,y = [k(xi, yj)]N,M i,j=1 and let HN = 1 N 1N1 N RN N, HM = 1 M 1M1 M RM M be the centering, then (13) can be rewritten as

1 N 2 Tr(HNKx HNKx) + 1 M 2 Tr(HMKy HMKy) + 1

N 2 Tr(K2 x) + 1 M 2 Tr(K2 y)

+ 2 NM Tr(Kxy HNKxy) + 2 NM Tr(Kxy HMK xy) 2 NM Tr(HMK xy HNKxy) 2 NM Tr(K2 xy)

N 2 Tr(Kx HNKx) 2 M 2 Tr(Ky HMKy)

which simplifies to

1 N 2 Tr (Kx(I HN))2 + 1 M 2 Tr (Ky(I HM))2 2 NM Tr Kxy(I HM)K xy(I HN) .

This estimator can be computed in quadratic time.

Example E.2 (Estimating κ(1,2) k,ℓ(γ) 2 H 1 k H 2 ℓ). Let k denote the kernel on X1 and ℓdenote the

kernel on X2. Let (X, Y ), (X , Y ), (X , Y ), (X(3), Y (3)), (X(4), Y (4)), (X(5), Y (5)) denote in-

dependent copies of γ P(X1 X2). The full expression for κ(1,2) k,ℓ(γ) 2 H 1 k H 2 ℓ is

Ek(X, X )k(X, X )ℓ(Y, Y ) 4Ek(X, X )k(X, X )ℓ(Y, Y )

2Ek(X, X )k(X, X )ℓ(Y, Y ) + 4Ek(X, X )k(X, X )ℓ(Y, Y (3))

+ 2Ek(X, X )k(X , X(3))ℓ(Y, Y ) + 2Ek(X, X )k(X , X(3))ℓ(Y, Y (3))

+ 4Ek(X, X )k(X , X )ℓ(Y, Y (3)) + Ek(X, X )k(X, X )ℓ(Y , Y (3))

8Ek(X, X )k(X , X(3))ℓ(Y (4), Y ) 4Ek(X, X )k(X , X )ℓ(Y (4), Y (3))

+ 4Ek(X, X )k(X , X(3))ℓ(Y (4), Y (5)).

Given samples (xi, yi)N i=1 from γ the corresponding V-statistic for this expression is

i,j=1 k(xi, xj)k(xi, xj)ℓ(yi, yj) 4

i,j,κ=1 k(xi, xj)k(xi, xκ)ℓ(yi, yj)

i,j,κ=1 k(xi, xj)k(xi, xj)ℓ(yi, yκ) + 4

i,j,κ,l=1 k(xi, xj)k(xi, xκ)ℓ(yi, yl)

i,j,κ,l=1 k(xi, xj)k(xκ, xl)ℓ(yi, yj) + 2

i,j,κ,l=1 k(xi, xj)k(xκ, xl)ℓ(yi, yl)

i,j,κ,l=1 k(xi, xj)k(xκ, xj)ℓ(yi, yl) + 1

i,j,κ,l=1 k(xi, xj)k(xi, xj)ℓ(yκ, yl)

i,j,κ,l,m=1 k(xi, xj)k(xκ, xl)ℓ(ym, yj) 4

i,j,κ,l,m=1 k(xi, xj)k(xκ, xj)ℓ(ym, yl)

i,j,κ,l,m,n=1 k(xi, xj)k(xκ, xl)ℓ(ym, yn).

Using the shorthand notation K = Kx, L = Ly and H = HN and denoting by the Hadamard product [A B]i,j = Ai,j Bi,j and the sum over all elements of a matrix A = PN i,j=1 Ai,j, the V-statistic above can be written in the simpler form

K K L 4K KH L 2K K LH

+ 4KH K LH + 2K L K

+ 4K HK LH + K K L

Again this estimator can be computed in quadratic time.

Example E.3 (Estimating κ(3) k (γ) κ(3) k (η) 2 H 3 k ). In order to estimate d(3)(γ, η) we note that one can write

κ(3) k (γ) κ(3) k (η) 2 H 3 k = κ(3) k (γ) 2 H 3 k + κ(3) k (η) 2 H 3 k 2 κ(3) k (γ), κ(3) k (η) H 3 k .

We can estimate the first two terms like in Example E.2, and the third term can be expressed as

κ(3) k (γ), κ(3) k (η) H 3 k = Ek(X, Y )3 3Ek(X, Y )2k(X, Y )2

3Ek(X, Y )2k(X , Y )2 + 6Ek(X, Y )k(X, Y )k(X , Y )

+ 3Ek(X, Y )2k(X , Y ) + 2Ek(X, Y )k(X , Y )k(X , Y )

+ 2Ek(X, Y )k(X, Y )k(X, Y ) 6Ek(X, Y )k(X, Y )k(X , Y )

6Ek(X, Y )k(X , Y )k(X , Y ) + 4Ek(X, Y )k(X , Y )k(X , Y ). For simplicity we will assume that we have an equal number of samples (N) from both measures (xi)N i=1 γ and (yi)N i=1 η. The V-statistic for κ(3) k (γ), κ(3) k (η) H 3 k can be expressed as

i,j=1 k(xi, yj)3 3

i,j,κ=1 k(xi, yj)2k(xi, yκ)

i,j,κ=1 k(xi, yj)2k(xκ, yi) + 6

i,j,κ,l=1 k(xi, yj)k(xi, yκ)k(xl, yj)

i,j,κ,l=1 k(xi, yj)2k(xκ, yl) + 2

i,j,κ,l=1 k(xi, yj)k(xκ, yj)k(xl, yj)

i,j,κ,l=1 k(xi, yj)k(xi, yκ)k(xi, yl) 6

i,j,κ,l,m=1 k(xi, yj)k(xi, yκ)k(xl, ym)

i,j,κ,l,m=1 k(xi, yj)k(xκ, yj)k(xl, ym) + 4

i,j,κ,l,m,n=1 k(xi, yj)k(xκ, yl)k(xm, yn).

Using the notation Kxy = [k(xi, yj)]N i,j=1, this estimator simplifies to

Kxy Kxy Kxy 3Kxy Kxy HKxy

3Kxy Kxy Kxy H + 6Kxy Kxy H HKxy

+ 2Kxy HKxy HKxy

+ 2Kxy Kxy H Kxy H 6Kxy Kxy H Kxy

We mention also that the first two terms κ(3) k (γ) 2 H 3 k , κ(3) k (η) 2 H 3 k can be computed a little more simply than in Example E.2 since the expressions have more symmetry, using the notation Kx = [k(xi, xj)]N i,j=1 we can write down the V-statistic for κ(3) k (γ) 2 H 3 k as

Kx Kx Kx 6Kx Kx H Kx

+ 4Kx H Kx Kx H + 3Kx Kx

+ 6Kx H HKx Kx 12Kx HKx

with a similar expression for κ(3) k (η) 2 H 3 k . The estimator can be computed in quadratic time.