# composite_goodnessoffit_tests_with_kernels__9d41bc1d.pdf

Journal of Machine Learning Research 26 (2025) 1-60 Submitted 2/24; Revised 1/25; Published 2/25

Composite Goodness-of-ﬁt Tests with Kernels

Oscar Key oscar.key.20@ucl.ac.uk Centre for Artiﬁcial Intelligence, University College London

Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, University College London

François-Xavier Briol f.briol@ucl.ac.uk Department of Statistical Science, University College London

Tamara Fernandez tamara.fernandez@uai.cl Faculty of Engineering and Science, Adolfo Ibañez University

Editor: Pierre Alquier

We propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on kernel-based distances such as the maximum mean discrepancy. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. We also prove that the popular wild bootstrap will lead to an overly conservative test, and show that the parametric bootstrap is consistent and can lead to signiﬁcantly improved performance in practice. Our approach is illustrated on a range of problems, including testing for goodness-of-ﬁt of a non-parametric density model, and an intractable generative model of a biological cellular network.

Keywords: bootstrap, hypothesis testing, kernel methods, minimum distance estimation, Stein s method

1. Introduction

Most statistical or machine learning algorithms are based on assumptions about the distribution of the observed data, of auxiliary variables, or of certain estimators. Crucially, the validity of such assumptions directly impacts the performance of these algorithms. To verify whether such assumptions are reasonable, one approach is goodness-of-ﬁt testing, which considers the problem of rejecting, or not, the hypothesis that some data was generated by a ﬁxed distribution. More precisely, given a distribution P and some observed data x1, . . . , xn from some distribution Q, goodness-of-ﬁt tests compare the null hypothesis H0 : P = Q against the alternative hypothesis H1 : P = Q.

. Equal contribution

c 2025 Oscar Key, Arthur Gretton, François-Xavier Briol, and Tamara Fernandez.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0276.html.

Key*, Gretton, Briol, and Fernandez*

There is a vast literature on goodness-of-ﬁt testing, and the reader is referred to Chapter 14 in Lehmann and Romano (2005) for a detailed introduction. In the case of univariate real observations, a popular family of tests uses the cumulative distribution function (CDF) as the test statistic, for example the Kolmogorov-Smirnov and Anderson-Darling tests (Kolmogorov, 1933; Anderson and Darling, 1954). Various approaches have been taken to support multivariate observations, including extending these CDF tests (Justel et al., 1997), developing a statistic based on the empirical characteristic function (Jiménez-Gamero et al., 2009), or partitioning the observation space into bins and evaluating a test statistic on the resulting discrete empirical measures (Gyorﬁand van der Meulen, 1991; Beirlant et al., 1994; Györﬁand Vajda, 2002; Beirlant et al., 2001). Unfortunately, the applicability of all of these tests is limited, because they rely on statistics that are often diﬃcult to compute for complex models or high-dimensional observation spaces. To overcome these issues, a class of kernel-based hypothesis tests has been proposed, which construct the statistic through either the maximum mean discrepancy (MMD) (Lloyd and Ghahramani, 2015; Zhu et al., 2019; Balasubramanian et al., 2021; Kellner and Celisse, 2019), the kernel Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016), the energy distance (Székely and Rizzo, 2005, 2013), or the Hilbert-Schmidt independence criterion (Sen and Sen, 2014). These tests can be applied to a wide-range of data types by choosing a test statistic based on an appropriate kernel which can be easily evaluated in the setting considered. Additionally, many of these tests are straightforward to apply even for intractable generative models for which no density exists, or for models with unnormalised likelihoods, making them particularly versatile. In this paper, we consider the more complex question of whether our data comes from any element of some parametric family of distributions. Let {Pθ}θ Θ denote a parametric family indexed by a parameter θ in some space Θ. Our test compares the null hypothesis

HC 0 : θ0 Θ such that Pθ0 = Q,

against the alternative hypothesis

HC 1 : Q {Pθ}θ Θ.

Such tests are known as composite goodness-of-ﬁt tests, and can be much more challenging to construct since θ0 is usually unknown. They have the potential to answer important questions relating to model misspeciﬁcation, allowing the user to conﬁrm whether the parametric model they have selected is appropriate for their analysis. Once again there is a signiﬁcant literature tackling this problem, including tests which are speciﬁc to simple parametric families such as Gaussians (Shapiro and Wilk, 1965; Lilliefors, 1967), or composite versions of some of the non kernel-based tests mentioned above (Durbin, 1975; Neyman, 1967). Unfortunately, existing these tests all suﬀer from the same drawback as their simple goodness-of-ﬁt counterparts, in that they are only applicable in a limited range of settings. Our paper ﬁlls an important gap in the literature by proposing the ﬁrst set of comprehensive kernel-based composite hypothesis tests. Given a kernel-based discrepancy D, our test statistics take the form = minθ Θ n D(Pθ, Qn). That is, we use the smallest discrepancy between Qn and any element of the parametric family to determine whether HC 0 should be rejected or not. In this work, we primarily consider the case where D = MMD2, and include

Composite Goodness-of-fit Tests with Kernels

full theoretical and empirical analysis of the test s behaviour. This analysis is challenging because, in contrast to existing kernel-based tests, the data now enters the test statistic twice: once to select the closest element of the parametric family, and a second time to estimate the discrepancy. We also include encouraging empirical results for D = KSD2, but leave the extension of our theoretical framework to this test for future work. Our tests extend the advantages of kernel-based tests to the composite setting. In particular, they can be directly applied to both generative or unnormalised models, two wide classes which cannot be tackled in full generality with classical composite tests. Within the kernel testing literature, general composite tests have not yet been proposed. Existing tests have either been limited to speciﬁc parametric families, such as multivariate Gaussians (Kellner and Celisse, 2019) or survival models with a speciﬁc survival function (Fernández et al., 2020), or required splitting the data set thus reducing test power (Chwialkowski et al., 2016; Liu et al., 2016). Wolfer and Alquier (2022) also proposed a promising non-asymptotic composite test based on the MMD, but did not study it for generative models nor consider bootstrapping algorithms. We also show in our experiments that this approach leads to a more conservative test. The remainder of the paper is as follows. Section 2: we recall existing work on testing and parameter estimation with kernels. Section 3: we propose novel kernel-based composite hypothesis tests, and show that the MMD test statistic has well-behaved limiting distributions despite the reuse of the data for both estimation and testing. Section 4: we compare two choices of bootstrap method that we can use to implement these tests in practice. Section 5: we demonstrate these algorithms on a range of problems including an intractable generative model and non-parametric density estimation. Section 6: we discuss limitations of our work. Section 7: we highlight connections with existing tests. The code to reproduce our experiments, and implement the tests for your own models, is available at: https: //github.com/oscarkey/composite-tests

2. Background

We begin by reviewing the use of kernel discrepancies for testing and estimation. We denote by X the data space and P(X) the set of Borel probability distributions on X.

2.1 Kernel-based Discrepancies

To measure the similarity of two distributions P, Q P(X), we can use a discrepancy D : P(X) P(X) [0, ). We consider discrepancies related to integral pseudo-probability metrics (IPMs) (Muller, 1997). An IPM indexed by the set of functions F has the form

DF(P, Q) = sup f F |EX P [f(X)] EY Q[f(Y )]| , (1)

where F is suﬃciently rich so that DF is a statistical divergence; i.e. DF(P, Q) = 0 P = Q. A ﬁrst discrepancy is the maximum mean discrepancy (MMD) (Gretton et al., 2007, 2012). Denote by HK the reproducing kernel Hilbert space (RKHS) associated to the kernel K : X X R (Berlinet and Thomas-Agnan, 2004). The MMD is obtained by taking FMMD = {f HK | f HK 1}, which is a convenient choice because it allows the supremum

Key*, Gretton, Briol, and Fernandez*

in Equation 1 to be evaluated in closed form if EX P [ p

K(X, X)] < P P(X):

MMD2(P, Q) = EX,X P [K(X, X )] 2EX P,X Q[K(X, X )] + EX,X Q[K(X, X )]. (2)

Assuming we have access to independent and identically distributed realisations { xi}m i=1 iid P

and {xj}n j=1 iid Q, we can deﬁne Pm = 1 m Pm i=1 δ xi and Qn = 1

n Pn j=1 δxj (where δx is a Dirac measure at x X) and we get the following estimate:

MMD2(Pm, Qn) = 1

i,j=1 K( xi, xj) 2 nm

j=1 K( xi, xj) + 1

i,j=1 K(xi, xj).

Furthermore, when n = m, this simpliﬁes to

MMD2(Pn, Qn) = 1

i,j=1 h MMD((xi, xi), (xj, xj)),

where h MMD((xi, xi), (xj, xj)) = K( xi, xj)+K(xi, xj) K( xi, xj) K( xj, xi). This is known as a V-statistic (Gretton et al., 2012), and h MMD is known as the core of this statistic. The statistic is biased, but has smaller variance than alternative U-statistics for this quantity. We also consider the kernel Stein discrepancy (KSD) (Oates et al., 2017; Chwialkowski et al., 2016; Liu et al., 2016), which is obtained by applying Stein operators to functions in some RKHS; see Anastasiou et al. (2023) for a recent review. Let X = Rd and Hd K = HK . . . HK denote the d-dimensional tensor product of HK. The most common instance is the Langevin KSD for which FKSD = {SP [f]| f Hd K 1} where SP [g](x) = g(x) x log p(x)+ x g(x) is

the Langevin Stein operator, p is the Lebesgue density of P and x = ( / x1, . . . , / xd) . With this choice, we obtain the discrepancy:

KSD2(P, Q) = EX,X Q[h KSD(X, X )] ,

where h KSD(x, x ) = K(x, x ) x log p(x) x log p(x ) + x log p(x) x K(x, x )

+ x log p(x ) x K(x, x ) + x x K(x, x ) .

Given Qn = 1

n Pn j=1 δxj, we obtain a V-statistic KSD2(P, Qn) = 1 n2 Pn i,j=1 h KSD(xi, xj). We choose the MMD and the KSD to implement our tests because they are applicable in complementary settings. The MMD can be estimated whenever samples from Pθ are available. This makes the corresponding test particularly suitable for simulator-based models (also called generative models); i.e. models which can be represented through a pair (U, Gθ) such that sampling x Pθ involves sampling u U, and setting x = Gθ(u). This class of models covers many models widely used in machine learning, including variational autoencoders (Kingma and Welling, 2014) and generative adversarial networks (Li et al., 2015; Dziugaite et al., 2015), but also in the sciences including in synthetic biology (Bonassi et al., 2011) and telecommunication engineering (Bharti et al., 2021) to name just a few. The interested reader is referred to Cranmer et al. (2020) for a recent review. On the other hand, the KSD requires pointwise evaluations of log p and the KSD-based test will therefore be particularly suitable when the density of Pθ is tractable. This includes cases where the density can be only evaluated up to normalisation constant, such as for deep energy models or nonparametric

Composite Goodness-of-fit Tests with Kernels

density estimation models (Canu and Smola, 2006; Fukumizu, 2009; Sriperumbudur et al., 2017; Matsubara et al., 2022), exponential random graphs (Robins et al., 2007) and large spatial models based on lattice structures (Besag, 1974).

Throughout this paper, we will assume that our kernel-based divergences are valid statistical divergences. This can be guaranteed for the MMD by assuming the kernel K is characteristic (Sriperumbudur et al., 2010). For the KSD we require that K is strictly integrally positive deﬁnite and EX Q[| x log p(X) x log q(X)|] < (see for example Chwialkowski et al. (2016, Theorem 2.1) and Barp et al. (2019, Proposition 1)). Examples of kernels on Rd which satisfy these assumptions include the Gaussian, inverse multiquadric and Matérn kernels.

While these conditions ensure that the MMD and KSD can distinguish any pair of distributions when n, m , for ﬁnite samples this is not necessarily the case. The ability of the divergences to distinguish distributions is primarily dependent whether an appropriate kernel has been selected for the data in question. Additionally, the KSD suﬀers from a failure mode when applied to multi-modal distributions, common for methods based on x log p: if Q contains isolated modes with areas of near-zero density between them, the KSD cannot distinguish this from P that is identical except having a diﬀerent weighting of the modes (Wenliang and Kanagawa, 2021). We discuss the impacts of these failure modes on ours tests in Section 6.

2.2 Goodness-of-ﬁt Testing with Kernels

We now recall how divergences lend themselves naturally to (standard) goodness-of-ﬁt testing. Let D : X X [0, ) be some statistical divergence, and recall that we are testing H0 : P = Q. A natural approach is to compute D(P, Q) and check whether it is zero (in which case H0 holds) or not (in which case H1 holds). Since we only have access to independent realisations {xj}n j=1 instead of Q itself, this idealised procedure is replaced by the evaluation of D(P, Qn). The question then becomes whether or not D(P, Qn) is further away from zero than we would expect under H0 given a data set of size n.

Kernel-based discrepancies can be computed in a wide range of scenarios including for intractable models. These divergences also have favourable sample complexity in that D(P, Qn) converges to D(P, Q) at a square-root n rate for both KSD (Matsubara et al., 2022, Theorem 4) and MMD (Briol et al., 2019, Lemma 1). The MMD was ﬁrst used by Gretton et al. (2007, 2012) for two-sample testing, and then studied for goodness-of-ﬁt testing by Lloyd and Ghahramani (2015). MMD tests are closely related to many classical tests; see Sejdinovic et al. (2013). Chwialkowski et al. (2016); Liu et al. (2016) introduced an alternate test based on KSD2(P, Qn).

To determine when H0 should be rejected, we select an appropriate threshold cα R, which will depend on the level of the test α [0, 1]. More precisely, cα should be the (1 α)-quantile of the distribution of the test statistic under H0. This distribution will usually be unknown, but can be approximated using a bootstrap method. A common example is the wild bootstrap (Shao, 2010; Leucht and Neumann, 2013), which was specialised for kernel tests by Chwialkowski et al. (2014). We will return to this point in Section 4.

Key*, Gretton, Briol, and Fernandez*

2.3 Minimum Distance Estimation with Kernels

A statistical divergence D can also be used for parameter estimation through minimum distance estimation (Wolfowitz, 1957; Parr and Schucany, 1980). Given a parametric family

{Pθ}θ Θ P(X) indexed by Θ Rp and some data {xi}n i=1 iid Q, a natural estimator is

ˆθn := arg min θ Θ D(Pθ, Qn),

In practice, the minimum is often obtained using numerical optimisation algorithms. Under regularity conditions on D and Pθ, the estimator approaches θ := arg minθ Θ D(Pθ, Q) as n . In particular, when the model is well-speciﬁed (i.e. HC 0 holds), then θ = θ0. The MMD was ﬁrst used for parameter estimation for neural networks by Dziugaite et al. (2015); Li et al. (2015); Sutherland et al. (2017); Li et al. (2017); Bińkowski et al. (2018), then studied more broadly as a statistical estimator by Briol et al. (2019); Chérief-Abdellatif and Alquier (2022, 2020); Niu et al. (2021); Dellaporta et al. (2022); Bharti et al. (2023). They closely relate to minimum scoring rule estimators based on the kernel-scoring rule (Dawid, 2007). The use of KSD for parameter estimation was ﬁrst proposed by Barp et al. (2019), and later extended by Betsch et al. (2020); Grathwohl et al. (2020); Gong et al. (2021); Matsubara et al. (2022). These estimators are also closely related to score-matching estimators (Hyvärinen, 2006); see Barp et al. (2019, Theorem 2) for details.

3. Methodology: The Test Statistic

We are now ready to present our composite goodness-of-ﬁt tests. The ﬁrst section describes our general framework and studies the asymptotic distribution of the MMD test statistic. In the next section, we then describe how to implement the tests through bootstrapping.

3.1 Composite Goodness-of-ﬁt Testing with Kernels

The high-level idea behind our test-statistics is to use the smallest kernel-based discrepancy value between the data Qn and any element of the parametric family:

:= min θ Θ n D(Pθ, Qn) = min P {Pθ}θ Θ n D(P, Qn).

To implement the tests, we propose a two-stage approach:

Stage 1 (Estimate): Compute ˆθn = arg minθ Θ D(Pθ, Qn), so that = n D(Pˆθn, Qn).

Stage 2 (Test): Compute or estimate a test threshold cα from the distribution of test statistic under HC 0 . If > cα reject HC 0 , else do not reject.

In this paper, we will study two other instances of this framework. Our primary focus will be on the MMD composite goodness-of-ﬁt test, where D = MMD2. The next section provides regularity conditions so that as n , for this test converges to an inﬁnite sum of χ2

random variables under HC 0 but diverges under HC 1 , thus allowing us to distinguish between the two hypotheses. The MMD test is widely applicable since it only requires that we can sample from Pθ.

Composite Goodness-of-fit Tests with Kernels

Q P Qn test error D(P, Qn)

(a) Non-composite test

D(Pˆθn, Qn)

(b) Composite test

Figure 1: Illustration of the sources of error when estimating the test statistic under the null hypothesis. Existing non-composite tests use the test statistic n D(P, Qn), where D is a statistical divergence, and thus encounter only test error. The composite test we introduce uses the statistic n D(Pˆθn, Qn), and thus encounters both estimation and test error.

The second instance will be the KSD composite goodness-of-ﬁt test, where D = KSD2. Although we do not provide theoretical guarantees for this case, we do conjecture that similar guarantees exist. This test is particularly interesting since it only requires a tractable (possibly unnormalised) density function for Pθ. In both tests, stage 1 requires computing the solution to the optimisation problem ˆθn = arg minθ Θ D(Pθ, Qn). The best method for this depends on the model and should be selected alongside it. In this paper, for the MMD test, we use gradient-based stochastic optimisation and automatic diﬀerentiation. For the KSD test, we use the closed-form expression for the minimiser, which is available when Pθ is a member of the exponential family of distributions (see Section D.1). However, stochastic optimisation could also be used in order to work with more general families of models.

3.2 Theoretical Analysis of the MMD Composite Goodness-of-ﬁt fest

To present our analysis of the MMD test, we ﬁrst deﬁne some additional notation. Since we will be interested in asymptotic distributions, we will now consider the data to be random variables X1, . . . , Xn i.i.d. Q, rather than the realisations x1, . . . , xn introduced so far. We then overload Qn = 1

n Pn i=1 δXi to deﬁne a random measure. This is done in order to simplify notation, and whether we refer to the random or realised Qn can be inferred from the context. The results we present in this paper are based on the asymptotic distribution of n MMD2(Pˆθn, Qn) as n . This is similar to standard kernel-based goodness-of-ﬁt tests which use n MMD2(P, Qn), the main diﬀerence being that our test statistic depends on the data through both ˆθn and Qn. In other words, we are now using the data twice, once for estimation and once for testing. We note that this creates some non-trivial dependence and makes the derivation of theoretical guarantees signiﬁcantly more challenging. We ﬁrst state and discuss our assumptions, then present all results in the following subsection.

3.2.1 Assumptions

We will write Cm(X) for the set of m-times continuously diﬀerentiable functions on X, and Cm b (X) for the subset of Cm(X) where all of the m derivatives are bounded. For bivariate

Key*, Gretton, Briol, and Fernandez*

functions, we will denote by Cm,m b (X X) the set of function which are Cm b (X X) with respect to each input. We can now present our ﬁrst assumption.

Assumption 1. The observations {xi}n i=1 are independent and identically distributed realisations from Q.

This assumption is common, and covers a very large class of applications, including all our examples in Section 5. It could be relaxed to dependent data but this would require the development of specialised tools and notions of dependence which are beyond the scope of this paper; see for example Chérief-Abdellatif and Alquier (2022).

Assumption 2. X Rd, X is separable, and Θ is a compact and convex subset of Rp.

The assumptions are relatively mild and needed for the consistency of our estimators. Even when working with models where the assumptions on Θ are not satisﬁed, it is often possible to reparameterise the model so that the assumption holds. Our next assumption deﬁnes regularity assumptions on the model considered by the MMD test, which will have to be veriﬁed on a per-model basis.

Assumption 3. {Pθ}θ Θ is an identiﬁable parametric family of models. Each element Pθ P has density pθ with respect to a common measure λ which satisﬁes:

For each ﬁxed x X, p (x) C3(Θ) and pθ : X R is measurable for almost all θ Θ.

For any collection ℓ, i, j {1, . . . , p}, the following holds true: Z

θℓ pθ(x) λ(dx) < ,

θℓ θi pθ(x) λ(dx) < ,

θℓ θi θj pθ(x) λ(dx) < .

The smoothness assumptions with respect to the parameter are required to ensure convergence of ˆθn to θ0 in an appropriate sense under HC 0 . Note that this assumption, together with the fact that Θ is compact, allows us to deduce R

X supθ Θ pθ(x)λ(dx) < . On the other hand, the smoothness assumptions with respect to X are needed for the divergences to be well deﬁned. In many cases, when applying the MMD test, we will not have access to a computable density function and will only be able to sample from the model. For example, many models are deﬁned in terms of a distribution U on some latent space, and a generator function Gθ : U X, where a sample from the model x = Gθ(u) with u U. In this case, the results could be updated to instead place assumptions on U and Gθ rather than on the density, as was done in Briol et al. (2019). We make an additional assumption about the model family under the null hypothesis.

Assumption 4. Under HC 0 , we have that θ0, the parameter value under the null hypothesis, belongs to the interior of Θ.

Composite Goodness-of-fit Tests with Kernels

For most models this assumption will be met, or could be met by reparameterising the model. Our next assumption relates to the kernel underlying the MMD.

Assumption 5. The kernel K : X X R is bounded, continuous and characteristic.

This assumption is mild and will be satisﬁed by kernels commonly used in the literature on minimum MMD estimators, such as the Gaussian and Matérn kernels of suﬃcient smoothness. Our ﬁnal assumption concerns the regularity of the divergences around the minimisers. This is necessary to guarantee convergence of our minimum distance estimators under HC 0 , and is a mild assumption which is common in the literature (see e.g. Theorem 2 in Briol et al. (2019) and Theorem 4 in Barp et al. (2019)).

Assumption 6. The Hessian matrix H Rp p given by Hij = 2 θi θj MMD2(Pθ, Pθ0) θ=θ0 for any i, j {1, . . . , p} is positive deﬁnite.

Remark 1. We could extend the results to the KSD test by noting that the KSD is equivalent to an MMD with a speciﬁc kernel (Barp et al., 2019). However, this kernel is often unbounded and depends on pθ. Thus, additional assumptions on pθ would be required. We leave this extension to future work. In our experiments we assume that the KSD is a valid statistical divergence, which holds under the conditions discussed in Section 2.1.

3.2.2 Behaviour of the Test Statistic Under HC 0 and HC 1

We now present results on the asymptotic distribution of our test statistic, and the power (i.e. probability of correctly rejecting HC 0 when it does not hold) of the associated tests, as n .

Theorem 2 (Convergence under the null hypothesis). Under HC 0 and Assumptions 1 to 6:

n MMD2(Pˆθn, Qn) D

i=1 λi Z2 i as n , (3)

where (Zi)i 1 are a collection of i.i.d. standard normal random variables and (λi)i 1 are constants, which depend on the choice of kernel, where P i=1 λi < .

Theorem 2 guarantees that our test statistic has a well-behaved limiting distribution under the null hypothesis HC 0 . Unfortunately, this distribution is not tractable, but knowing that it exists and is well-behaved is still necessary for guaranteeing that we can approximate quantiles with a bootstrapping method, as will be discussed in Section 4. The results are analogous to Theorem 12 in Gretton et al. (2012), though this result is only valid for testing H0 vs. H1 and not HC 0 vs. HC 1 . Our second result concerns the power of the test: when HC 1 holds, we show that our test is able to reject HC 0 when n is large enough.

Theorem 3 (Consistency under the alternative hypothesis). Under Assumptions 1 to 6, we have that HC 1 holds if and only if lim infn MMD2(Pˆθn, Qn) > 0.

As our test statistic, n MMD2(Pˆθn, Qn), is scaled by n, the above result implies that the test statistic diverges as n .

Key*, Gretton, Briol, and Fernandez*

4. Methodology: Practical Implementation via Bootstraps

To implement the tests, we require an estimate of the rejection threshold cα, which should be set to the (1 α)-quantile of the distribution of = n D(Pˆθn, Qn) under HC 0 . Since this distribution is unknown, we estimate the threshold using bootstrap algorithms. In following section we present two possible choices: the wild bootstrap and the parametric bootstrap, and analyse their behaviour when applied to the MMD test.

4.1 The Wild Bootstrap

A natural ﬁrst approach is to use the wild bootstrap, since that is the current state-of-the-art for non-composite kernel goodness-of-ﬁt tests (Chwialkowski et al., 2016; Liu et al., 2016; Schrab et al., 2022). The algorithm is based on bootstrapped versions of the test statistics, deﬁned for non-composite tests as

n MMD2 W (Pn, Qn) := 1

i,j=1 Wi Wjh MMD((xi, xi), (xj, xj)),

n KSD2 W (P, Qn) := 1

i,j=1 Wi Wjh KSD,P (xi, xj),

where W1, . . . , Wn are i.i.d. Rademacher random variables (i.e. random variables taking value 1 and 1 with probability 1/2 each). Note that the bootstrapped version of the MMD test requires realisations of both P and Q (i.e. Pn and Qn) whereas the bootstrapped version of the KSD test requires evaluations of the score of the distribution P and samples from Q (denoted through the empirical measure Qn). Given ﬁxed observations Qn, it is possible to draw samples (1), . . . , (b) of the bootstrapped test statistics by taking fresh realisations of W1, . . . , Wn for each sample. If we consider an MMD test, under H0, Chwialkowski et al. (2014, Theorem 1) show that the distributions of n MMD2(P, Qn) and n MMD2 W (P, Qn) converge to the same distribution as n . Thus, we can estimate the threshold cα during a non-composite test by computing the (1 α)-quantile of { (k)}b k=1. Under H1, Chwialkowski et al. (2014, Proposition 3.2) show that n MMD2 W (Pn, Qn) will converge to a ﬁnite quantity, while Theorem 3 shows that n MMD2(Pn, Qn) . This means that P(MMD2(Pn, Qn) > cα) 1 as n and the test rejects H0 correctly. We can extend the wild bootstrap to the composite case by ﬁrst estimating the parameter and then applying the wild bootstrap to the estimated distribution. Thus, the bootstrapped test statistics are n MMD2 W (Pˆθn,n, Qn) and n KSD2 W (Pˆθn, Qn), where Pˆθn,n = 1

n Pn i=1 δ Xi with Xi Pˆθn. Algorithm 1 describes this approach for the MMD, and Algorithm 2 for the KSD. Note that the KSD version of the test does not require an additional sampling step since the KSD can be estimated with a single sample, whereas two diﬀerent samples are needed to estimate the MMD. It is not clear from the existing results in the literature if applying the wild bootstrap to the composite test in this fashion will result in a test with a type I error rate (i.e. probability of wrongly rejecting HC 0 when it holds) smaller than or equal to α. This is because the composite test statistic contains an additional source of error, in comparison to

Composite Goodness-of-fit Tests with Kernels

Algorithm 1: Wild bootstrap (MMD) Input: Pˆθn, Qn, α, b

n Pn i=1 δ xi where x1, . . . , xn iid Pˆθn; for k {1, . . . , b} do

w(k) = (w(k) 1 , . . . , w(k) n ) iid Rademacher; (k) = n MMD2 w(k)(Pˆθn,n, Qn);

cα = quantile({ (1), . . . , (b)}, 1 α);

Algorithm 2: Wild bootstrap (KSD) Input: Pˆθn, Qn, α, b for k {1, . . . , b} do

w(k) 1 , . . . , w(k) n iid Rademacher; (k) = n KSD2 w(k)(Pˆθn, Qn);

cα = quantile({ (1), . . . , (b)}, 1 α);

the non-composite statistic. MMD2(Pn, Qn) contains one source of error, due to estimating MMD2(P, Qn) through samples. However, MMD2(Pˆθn,n, Qn) contains a second source of error because Pθ0, the true data generating distribution under HC 0 , is estimated with Pˆθn. Figure 1 illustrates these sources of error in the non-composite and composite test statistics under the null hypothesis. If we ﬁx an estimate of ˆθn and then apply the wild bootstrap, we fail to take into account this additional source of error. Our main result for this section shows that applying the wild bootstrap does lead to a type I error rate smaller than or equal to α in the MMD test. However, our result shows the (asymptotic) type I error rate is strictly below α, which is surprising as usually resampling strategies yield asymptotically-exact type-I error rates. This result suggests that our test is conservative and could potentially be improved further.

Theorem 4. Let α (0, 1), and deﬁne

qα = inf γ : lim sup n P(n MMD2 W (Pˆθn,n, Qn) > γ) α .

Under the null hypothesis and Assumptions 1 to 6, it holds

lim sup n P n MMD2(Pˆθn, Qn) qα < α.

The root cause of the above result is that, under both HC 0 and HC 1 , MMD2(Pˆθn, Qn) MMD2(Pθ0, Qn), because ˆθn is chosen as the minimiser of θ MMD2(Pθ, Qn). The impact of this on the wild bootstrap is that, for all x R,

lim sup n P n MMD2(Pˆθn, Qn) x < lim sup n P n MMD2 W (Pˆθn,n, Qn) x

as we show in the proof of Theorem 4. This means that every quantile of the asymptotic distribution of the wild bootstrap statistic is shifted away from zero compared to the test statistic, as demonstrated in Figure 2. Thus, the threshold computed from the samples of the bootstrapped test statistic is too large, and the resulting test is conservative. We conjecture that similar behaviour occurs under HC 1 , resulting in the loss of power we observe in the experiments (see e.g. Figure 3b). To achieve better performance, we can use an alternative bootstrap which does take account of estimation error.

Key*, Gretton, Briol, and Fernandez*

0 1 2 3 statistic value

n MMD2(Pˆθn, Qn)

n MMD2 W (Pˆθn,n, Qn)

0 1 2 3 statistic value

cumulative density

n MMD2(Pˆθn, Qn)

n MMD2 W (Pˆθn,n, Qn)

Figure 2: Distribution of MMD2(Pˆθn, Qn) and MMD2 W (Pˆθn,n, Qn) under HC 0 , obtained by simulation. and show the respective (1 α) quantiles, demonstrating that the wild bootstrap estimates a conservative threshold.

4.2 The Parametric Bootstrap

We now consider the parametric bootstrap (Stute et al., 1993) presented in Algorithm 3, which is commonly used in the composite testing literature (for example Kellner and Celisse (2019)). To approximate the distribution of n D(Pˆθn, Qn) under HC 0 , this approach ﬁrst ﬁts the parameter to the observations. It then repeatedly resamples the observations, re-estimates the parameter and recomputes the test statistic. By repeatedly re-estimating the parameter, the parametric bootstrap takes account of the estimation error.

Algorithm 3: Parametric bootstrap

Input: D, Pθ, Qn, α, b (num bootstrap samples) ˆθn = arg minθ D(Pθ, Qn); for k {1, . . . , b} do

n Pn i=1 δx(k) i , x(k) i n i=1 iid Pˆθn;

ˆθ(k) n = arg minθ Θ D(Pθ, Q(k) n );

(k) = n D(Pˆθ(k) n , Q(k) n );

cα = quantile({ (1), . . . , (b)}, 1 α);

Algorithm 3 generates samples (k) of the bootstrapped test statistic n D(Pˆθ n, Q n),

where ˆθ n = arg minθ Θ D(Pθ, Q n), and Q n = 1

n Pn i=1 δX i with X i Pˆθn. We show that the parametric bootstrap is valid for the MMD test, which once again requires that the distribution of the test statistic n MMD2(Pˆθn, Qn) and the distribution of the bootstrapped test statistic n MMD2(Pˆθ n, Q n) converge to the same distribution as n . To present the result we require some additional notation, which is deﬁned formally in Section A.1.4: considering random variables An and A, we write An DD A to indicate that An converges to A in distribution conditioned on the data, {Xi}n i=1.

Composite Goodness-of-fit Tests with Kernels

Theorem 5 (Parametric bootstrap under the null). Under HC 0 and Assumptions 1 to 6, we

have that n MMD2(Pθ n, Q n) DD Γ , where Γ is a random variable with distribution given by Equation 3, i.e. the distribution of the test statistic under HC 0 .

Note that the above convergence holds conditioned on any sequence of observations, thus holds in general. In Section 5 we ﬁnd empirically that applying the parametric bootstrap to our tests results in a good type I error rate and a better performance than the wild bootstrap when n is small. However, the parametric bootstrap is substantially more computationally intensive because it requires repeatedly computing a minimum distance estimator (and hence the kernel matrix) on fresh data, whereas the wild bootstrap only uses a single minimum distance estimator. Assuming we are computing the minimum distance estimators through T steps of a numerical optimiser based on function evaluations or gradients, the computational complexity of the wild bootstrap algorithms is O((T + b)n2), whilst the parametric bootstrap is O(Tbn2). We therefore expect the parametric bootstrap to be signiﬁcantly more expensive than the wild bootstrap. However, this extra computation is likely an issue only for large n, and in this high data regime we ﬁnd in our experiments that the wild bootstrap achieves similar power to the parametric bootstrap. Thus, in a high data regime there is little penalty to using the cheaper wild bootstrap. Note that the cost could be further reduced to be linear in n through alternative estimates of the MMD (see e.g. Lemma 14 of Gretton et al. (2012)), but this would likely lead to substantially lower test power.

5. Empirical Results

In the following section, we now study the performance of our tests in a range of settings. We consider three examples: (i) a multivariate Gaussian location and scale model, (ii) a generative model of the interaction of genes through time called a toggle-switch model, and (iii) an unnormalised nonparametric density model. For all experiments we set the test level α = 0.05, and give full implementation details in Section D.

5.1 Gaussian Model

We begin by exploring the diﬀerent conﬁgurations of our test, in regard to the underlying discrepancy and the bootstrap method, for various Gaussian models. Unless otherwise stated, we use a Gaussian kernel with lengthscale computed using the median heuristic (Gretton et al., 2007). We test Gaussianity by considering HC 0 : Pθ = N(µ, Σ) with unknown parameters θ = {µ, Σ}, for µ Rd and Σ (R+)d d. We start with d = 1, and test against a particular alternative where Q = Students-t with 2 degrees of freedom. Figure 3 shows the type I error rate and power of the MMD and KSD tests using both bootstraps as n increases. Considering the MMD test under HC 0 , the test using the wild bootstrap converges to a conservative type I error rate as n becomes large, while the test using the parametric has the correct rate. This is consistent with our analysis of the bootstraps in Section 4. Under HC 1 , the parametric bootstrap has higher power, once again due to the wild bootstrap being conservative, but both bootstraps achieve a power of one once n is large enough. For these results, m, the number of samples from Pθ, was set to m = n as a balance between computational cost and

Key*, Gretton, Briol, and Fernandez*

10 250 500 750 1000 number of observations, n

type I error rate

(a) Type I error rate under HC 0 , where Q = N(µ0, 12).

10 250 500 750 1000 number of observations, n

(b) Power under HC 1 , where Q = Students-t with 2 degrees of freedom.

Figure 3: Performance of the MMD and KSD tests as n increases, where HC 0 : P = N(µ, 1). show the parametric bootstrap, show the wild bootstrap, shows the level. The error bars show one standard error across 4 random seeds.

2 300 600 d

HC 0 HC 1a , HC 1b

(a) MMD test

2 300 600 d

HC 0 HC 1a , HC 1b

(b) KSD test

Figure 4: Power of the parametric bootstrap tests as d increases. HC 0 : P = N(µ, σ2), HC 1a and HC 1b indicate Q = Students-t, with 3 and 5 degrees of freedom respectively. shows the level.

power. Better power could be achieved by increasing m, although this would increase the computational cost. The ﬁgure also shows the same set of results for the KSD test. It has similar convergence behaviour, suggesting that our theoretical results may also hold for the KSD. Comparing the MMD and KSD tests under HC 1 we note that the KSD test has higher power for any given n. This is both because the KSD does not require sampling from Pθ, and because the MMD estimation process requires stochastic optimisation, whereas for the KSD we use a closed-form estimator. Second, we examine the performance of the parametric bootstrap tests as d, the dimensionality of the observations, increases. We test against the family of multivariate Gaussians with known covariance, HC 0 : P = N(µd, Id d) with unknown parameter θ = µd Rd. We consider two alternatives, HC 1a and HC 1b, where we generate data from multivariate tdistributions with ν = 3 and ν = 5 degrees of freedom respectively. As ν becomes large, the observed data become indistinguishable from a Gaussian, hence we expect the tests to have

Composite Goodness-of-fit Tests with Kernels

0.1 1.0 1.8 scale of Q, σ

wild param Wolfer

Figure 5: Comparison of the MMD tests against Wolfer and Alquier (2022). HC 0 : P = N(µ, 1) and Q = N(µ0, σ2). n = 200. shows the level. The error bars show one standard error across 4 random seeds.

lower power against larger ν. We set the scale parameter, Σ = Id d(ν 2)/ν, so that the samples have covariance I and the distance between the samples does not grow more rapidly than under HC 0 as d increases. Figure 4 shows that both tests maintain the type I error rate of 0.05 under HC 0 . Under HC 1a and HC 1b we ﬁnd that the power of the MMD test increases with d, while the power of the KSD test decreases with d. Next, we further examine the power of the MMD tests when using the wild and parametric bootstraps. We also include the non-asymptotic MMD test introduced by Wolfer and Alquier (2022), for comparison. We take HC 0 to be a family of one dimensional Gaussians with known variance, speciﬁcally P = N(µ, σ2), where θ = µ R and σ2 = 1. We generate data from N(µ0, σ2), where µ0 R is an arbitrary mean, and compute the power of the tests for diﬀerent values of σ2. An ideal test would have power close to α = 0.05 when σ2 = 1, and power close to one otherwise. Figure 5 shows the results, with the parametric bootstrap having higher power than the wild bootstrap once again. The test introduced by Wolfer and Alquier (2022) achieves lower power than both our tests with either bootstrap. We also consider the time taken by each bootstrap method. For n = 100 and d = 1, we found that each wild bootstrap test took approximately 1ms, while each parametric bootstrap test approximately 5ms. These ﬁgures depend heavily on the implementation and hardware details (as given in Section D). In particular, we execute all bootstrap samples in parallel on a GPU, which is particularly beneﬁcial for the parametric bootstrap where each sample has a high computational cost. However, it will not be possible for larger models because it requires keeping b copies of the model parameters in memory. Additionally, it would be possible to further optimise the bootstrap implementations. Despite this, these approximate numbers illustrate that the parametric bootstrap test is much more expensive. Thus, for large n where the two bootstraps achieve similar power (see Figure 3b) it may be better to use the wild bootstrap. Figure 6 explores the robustness of the test to corrupted observations under Huber s contamination model (Huber and Ronchetti, 2011), as previously explored in the testing setting by Liu and Briol (2024); Schrab and Kim (2024). Once again we consider HC 0 : P = N(µd, Id d), for µd Rd. We assume that HC 0 holds, but some observations have been corrupted with a large amount of noise. Thus, we generate data from Q = (1 ϵ) N(0d, Id d) + ϵ N(10d, 0.2Id d), where 0 ϵ 1 controls the percentage of observations

Key*, Gretton, Briol, and Fernandez*

0 2 4 6 % corrupted, ϵ

type I error rate

d = 1 d = 10 d = 220

Figure 6: Type I error rate of the tests under corrupted observations. HC 0 : P = N(µd, Σd d) and Q = (1 ϵ)P + ϵ N(10d, 0.2Id d). n = 200. show the MMD, the KSD, the KSD with robust kernel, and the level. The error bars show one standard error across 4 random seeds.

2 4 6 8 degrees of freedom of Q, ν

S-W Lllfrs A-D D Agso

MMD l = m MMD l = 2 KSD l = m KSD l = 2

2.5 5.0 7.5 degrees of freedom of Q, ν

Figure 7: Comparison of the MMD and KSD tests (parametric bootstrap), and the Shapiro Wilks (S-W), Lilliefors (Lllfrs), Anderson-Darling (A-D) and D Agostino s (D Agso) tests. H0 : P = N(µ, σ2), and Q = a series of Student s t-distributions. Two kernel lengthscales are shown for our tests l = m (the median heuristic) and l = 2 (selected via grid search). shows the level.

that are corrupted. A robust test would maintain the correct type I error rate as ϵ grows. We evaluate the MMD test and KSD test with a Gaussian kernel (as deﬁned in Appendix C.2). We also include a robust KSD test using the tilted kernel K (x, x ) = w(x)K(x, x )w(x ), where K is the Gaussian kernel, w(x) = (1 + a2 x 2 2) b, and we choose a = 1, b = 1

2 (Barp et al., 2019; Liu and Briol, 2024). The results on one-dimensional observations, d = 1, show that the MMD test is robust up to 2-3% corruption, the KSD test with Gaussian kernel is not robust, and the KSD test with tilted kernel is robust to > 6% corruption. All the tests become less robust as d increases. To complete the experiments on Gaussian models, we compare the performance of our tests to existing specialised tests of Gaussianity. We consider three nonparametric tests: the

Composite Goodness-of-fit Tests with Kernels

0 50 100 150 t

Figure 8: Left: Evolution of the ut variable of the toggle switch model, starting with initial state ut = 10. Each line shows a diﬀerent random evolution of the model. Right: Histogram of the resulting values of y.

Anderson-Darling, Lilliefors (the Kolomogorov-Smirnov test specialised to Gaussians) and Shapiro-Wilks tests. We also consider D Agostino s K2 parametric test (D Agostino and Pearson, 1973). For this comparison we return to testing HC 0 : Pθ = N(µ, σ2), for unknown parameters θ = {µ, σ2}, in one dimension. We perform the test against data generated from a Student s t-distribution, and vary the degree of freedom ν. Once again, we expect the test power to fall to α = 0.05, the speciﬁed level of the test, as ν becomes large. Figure 7 shows the results of this experiment, for both tests using the parametric bootstrap and for two choices of lengthscale. The MMD test has substantially lower power than the other tests for both lengthscales. The KSD test is an improvement, having performance closer to the specialised tests when setting the lengthscale with the median heuristic. Increasing n improves the power of each test, but does not change the ordering. These results reveal that we should prefer tests speciﬁcally designed for the model in question where possible, as opposed to our generally applicable tests. The advantage of our tests is that they are applicable beyond Gaussianity, including for unnormalised or generative models. They are also applicable to multivariate data, which is not the case for the Anderson-Darling or Lilliefors tests. In this plot we also demonstrate the importance of choosing the kernel. We include the power of our test for a more optimal choice of l = 2, which we found by sampling additional observations and performing a grid search. For this kernel, the KSD test has comparable power to the specialised tests. However, in practice we cannot use this method to select the kernel, as we are not able to sample additional observations. Future work could look at better strategies than the median heuristic for selecting the bandwidth parameter in composite tests.

5.2 Toggle-Switch Model

We now consider a toggle switch model, which is a generative model with unknown likelihood and hence suitable for our composite MMD tests. The model describes how the expression level of two genes u and v interact in a sample of cells (Bonassi et al., 2011; Bonassi and West, 2015). Sampling from the model involves two coupled discretised-time equations. Denote by P ts θ,T a model with parameters θ = (α1, α2, β1, β2, µ, σ, γ) with a discretisation consisting of

Key*, Gretton, Briol, and Fernandez*

Figure 9: Fraction out of 20 repeats for which the MMD test (parametric bootstrap) rejects HC 0 : P = P ts θ,T , comparing against data generated from Q = P ts θ0,300.

0 250 500 750 1000 1250 y

true model T = 10 T = 50

Figure 10: Fit of the toggle switch model for T = 10 (for which HC 0 is rejected) and T = 50 (for which it is not). The densities are generated using kernel density estimation on 500 samples.

T steps. To sample y P ts θ,T , we sample y e N(µ + u T , µσ/uγ T ), where

ut+1 e N(µu,t, 0.5), µu,t = ut + α1 1 + vβ1 t (1 + 0.03ut),

vt+1 e N(µv,t, 0.5), µv,t = vt + α2 1 + uβ2 t (1 + 0.03vt),

and u0 = v0 = 10. In the above e N indicates a Gaussian distribution truncated to give non-negative realisations. Figure 8 demonstrates how this sampling process works, showing the evolution of ut as t increases. To ﬁt the model, we use stochastic optimisation; see Section D.3.3. Existing work considers the model to have converged for T = 300, but using such a large value can be computationally expensive, thus a practitioner may wonder if it is possible to use a smaller value. We apply our parametric bootstrap test to this scenario by generating n = 500 observations from P ts θ0,300, and then testing the HC 0 : θ0 such that P ts θ0,T = P ts θ0,300 for values of 1 T 50. To generate the observations we follow Bernton et al. (2019) and use θ0 = (22, 12, 4, 4.5, 325, 0.25, 0.15). Figure 9 shows the result of the test, revealing that the test is unlikely to reject HC 0 for T 40. In Figure 10 we give examples of the ﬁt of the model for small and large T.

Composite Goodness-of-fit Tests with Kernels

8922 20828 32735 galaxy velocity (km/s)

p = 1 p = 2 p = 3

p = 4 p = 5 p = 25

p = 4 p = 5 p = 25

Figure 11: Fit of the kernel exponential family model on the galaxies data set. The grey histogram shows the data set, while the coloured lines show the density of the ﬁtted model with varying p. The dashed lines indicate that the wild bootstrap test rejected HC 0 , while the solid lines indicate that it failed to reject.

5.3 Density Estimation with a Kernel Exponential Family Model

We now consider the kernel exponential family pθ(x) qkef(x) exp(f(x)) where f is an element of some RKHS Hκ with Gaussian kernel κ : X X R with lengthscale lkef, and qkef is a reference density (Canu and Smola, 2006). Following Matsubara et al. (2022), we will work with a ﬁnite-rank approximation f(x) = Pp i=1 θiφi(x) where θ Θ = Rp and

l2i kefi! exp x2

, for all i ,

are basis functions. Matsubara et al. (2022) use this model for a density estimation task on a data set of velocities of 82 galaxies (Postman et al., 1986; Roeder, 1990), and use the approximation with p = 25 for computational convenience. However, an open question is whether p = 25 is suﬃcient to achieve a good ﬁt to the data. We propose to use our composite KSD test to answer this question. Figure 11 shows the ﬁt of the model for increasing values of p, and whether the test rejected HC 0 . We use a sum of IMQ kernels; see Section D.2 for a discussion of this choice. We ﬁnd that the test rejects HC 0 for p = 1, 2, 3, but does not reject for p = 4, 5, 25, with the wild and parametric bootstraps producing similar results. This suggests that p = 25 is a suitable choice, though it would also be reasonable to use a smaller value of p which would further decrease the computational cost of inference.

6. Limitations

While we ﬁnd that our composite tests allow us to answer much more complex questions than existing kernel tests, we did also observe some limitations that we discuss below. The performance of any kernel-based test depends heavily on the choice of kernel, as this determines the ability of the discrepancy to distinguish between distributions when observations are ﬁnite. In our composite tests, this issue has a larger eﬀect on the power of the test since we use the discrepancy not only as a test statistic but also for parameter

Key*, Gretton, Briol, and Fernandez*

estimation. For example, in Appendix B, we demonstrate how a very poor choice of kernel can result in biased estimates of the parameter and thus a type I error rate larger than the level. In practice, we should therefore consider our method as a test of both the model and the performance of the estimator, rather than of the model alone. We consider this to be a desirable behaviour, since, in practice, a parametric model is only ever as useful as our ability to estimate its parameters. In the non-composite case, several techniques have been developed to address kernel selection, including the power-maximizing kernel (Sutherland et al., 2017), Sup-MMD (Fukumizu et al., 2009), and the aggregated procedure of Schrab et al. (2023, 2022). Future work could investigate how these procedures could be adapted for composite tests. As an initial trial, in Appendix C we apply the aggregated procedure to our composite KSD test, ﬁnding that it sometimes achieves a small increase in power, and sometimes a small decrease. Future work could also consider kernels which are specialised to structured data such as time-series, graphs, and images. A further issue that can arise is the numerical optimiser implementing the estimator failing to converge to a global optimum (for example, because the objective is non-convex and the optimiser only ﬁnds local minimums). This failure mode is also illustrated in Section B. Once again, this means we should consider our method as a test of the ﬁtting method, in addition to the model itself. All tests based on the KSD can suﬀer from type I errors due to its inability to distinguish between multi-modal distributions that only diﬀer by the weight of each mode, and this limitation also applies to our test. The limitation can be observed in Figure 11 where, for p = 25, the model allocates too much weight to the left mode, yet the test still fails to reject HC 0 . However, as n this error will no longer occur. A ﬁnal limitation of the tests is due to the diﬃculty of approximating cα for models which are expensive to ﬁt due to a large number of parameters and/or the size of the associated data sets. In those cases, the wild bootstrap might not provide a good enough approximation of cα, whilst the high computational cost of the parametric bootstrap (which requires ﬁtting the model many times) may be prohibitive. Alternative bootstrap algorithms could be developed for this task; see for example the work of Kojadinovic and Yan (2012).

7. Connections with Existing Tests

Connections between (non-composite) kernel-based and classical tests are well-studied; see Sejdinovic et al. (2013) for an extensive discussion showing that MMD tests are equivalent to tests with the energy distance when using an appropriate kernel. Naturally, similar results also hold for composite tests, and we sketch out some of these connections below.

7.1 Likelihood-ratio Tests

The ﬁrst connection is with likelihood-ratio tests (Lehmann and Romano, 2005, Section 12.4.4), which compare a null model pθ0 and an alternative parametric model qγ. Interestingly, we will show in this section that our tests are intimately linked to likelihood ratio tests where the alternative model is parameterised in some reproducing kernel Hilbert space. Likelihood-ratio are based on the test statistic Sn(γ; f) = γ 1 n Pn i=1 log qγ(xi) which is optimised over γ via the maximum likelihood estimator (MLE). Now consider the alternative

Composite Goodness-of-fit Tests with Kernels

model given by qf,γ(x) exp(γf(x))pθ0(x) for some γ R. This model is indexed by the function f, and consists of a perturbation of pθ0 of size γ in the sense that qf,γ = pθ0 when γ = 0. In this case,

Sn(γ; f) = 1

X f(x) exp(γf(x))pθ0(x)dx R

X exp(γf(x))pθ0(x)dx ,

and Sn(0; f) = 1

n Pn i=1 f(xi) R

X f(x)pθ0(x)dx. If γ = 0, the MLE of γ will be close to zero, and thus Sn(0) 0. Note that |Sn(γ; f)| > 0 implies that for some value γ > 0, the perturbation model is a better ﬁt for Qn than Pθ0. This choice of alternative model allows us to recover the MMD and KSD by considering suitably deﬁned perturbations in some RKHS:

sup f FMMD Sn(0; f) = MMD(Pθ0, Qn), sup f FKSD Sn(0; f) = KSD(Pθ0, Qn).

where FMMD and FKSD were deﬁned in Section 2. Therefore, MMD and KSD tests are likelihood-ratio tests with an alternative model parametrised in some RKHS. Of course, the argument above also holds if θ0 is replaced by ˆθn. As a result, our composite tests can be thought of as a composite version of likelihood-ratio tests where ˆθn is a minimum distance estimator based on the MMD and KSD.

7.2 Tests Based on Characteristic Functions

The composite goodness-of-ﬁt tests in this paper are also closely connected with hypothesis tests based on distances between characteristic functions. Recall that the characteristic function of some distribution Q is deﬁned as φQ(ω) = EX Q[exp(i X ω)]. The following results shows that using the MMD is equivalent to comparing the (weighted) L2 distance between characteristic functions, where the weight depends on the choice of kernel.

Theorem 6 (Theorem 3 and Corollary 4, Sriperumbudur et al. (2010)). Let X = Rd and suppose K(x, y) = ψ(x y) where ψ : Rd R is a bounded and continuous positive function. Following Bochner s theorem, ψ is the Fourier transform of a ﬁnite nonnegative Borel measure Λ: ψ(x) = R

Rd exp( ix ω)Λ(dω) and hence the MMD can be expressed as

MMD2(P, Q) = Z

Rd |φP (ω) φQ(ω)|2Λ(dω).

There are a number of composite tests based on weighted distances between characteristic functions. One example is the work of Henze and Zirkler (1990), which consider testing for multivariate Gaussians with unknown mean and covariance. This test therefore uses a distance equivalent to the MMD with a speciﬁc choice of kernel. However, this paper makes use of maximum likelihood estimation rather than of a minimum MMD estimator. Interestingly, this means their test is equivalent to that in Kellner and Celisse (2019), but the latter paper does not make the connection and propose to use the test by Henze and Zirkler (1990) as a competitor in their numerical experiments. Note that this correspondence extends to the functional data case; see Wynne and Nagy (2021) for more details. Another related work is that of Koutrouvelis and Kellermeier (1981), who focus on (possibly non-Gaussian) univariate distributions and consider an approximation of the

Key*, Gretton, Briol, and Fernandez*

weighted L2 distance between empirical characteristic functions. Here, the approximation consists of discretising the integral in the deﬁnition of L2 distance, and this discrepancy can therefore be thought of as an approximation of the MMD with a speciﬁc choice of kernel. Their approach does however have an additional level of approximation due to the discretisation of the integral which is unnecessary when noting the connection with the MMD. Note that Koutrouvelis and Kellermeier (1981) used the same for both discrepancy for estimation and testing. The connection between tests based on kernels and characteristic functions can also be extended to the case of the KSD, albeit through modiﬁed characteristic functions. To state the result, deﬁne Sl P,xg = g(x)( log p(x)/ xl) + ( g(x)/ xl) for some suﬃciently regular scalar-valued test function g and l {1, . . . , d}. For some ﬁxed l, the modiﬁed characteristic function of some distribution Q will take the form

φl Q(ω) = EX Q h Sl P,x h exp(i X ω) ii .

Theorem 7. [Simpliﬁed version of Theorem 4.1, Wynne et al. (2022)] Let X = Rd and suppose K(x, y) = ψ(x y) where ψ : Rd R is a bounded and continuous positive function so that ψ(x) = R

Rd exp( ix ω)Λ(dω). Then, the KSD can be expressed as

KSD2(P, Q) = Z

φl Q(ω) 2 Λ(dω) = Z

φl Q(ω) φl P (ω) 2 Λ(dω).

Ebner (2021) proposed a test of normality which uses a test statistic exactly of the form in Theorem 7. Their test implicitly assumes that Λ has a density, and they study the impact of the choice of density on the performance of the test. Through the theorem above, we can see that this is equivalent to varying the choice of kernel K indexing the KSD.

8. Conclusion

This paper proposes and studies two kernel-based tests to verify whether a given data set is a realisation from any element of some parametric family of distributions. This was achieved by combining existing kernel hypothesis tests with recently developed minimum distance estimators, using either the MMD or KSD. We also studied two bootstrap algorithms to implement these tests: a wild bootstrap algorithm which is suitable in large-data regimes, and a parametric bootstrap which is suitable in smaller data regimes. A number of limitations were mentioned in Section 6, and these could each be addressed in future work. For example, the issue that the tests perform poorly when the minimum distance estimators provide a poor parameter estimate could be mitigated by allowing for the use of alternative estimators which are specialised for models at hand. This is straightforward to do in practice, but would require an extension of our theoretical results. Relatedly, the performance of the composite tests is dependent on the choice of an appropriate divergence, usually through a choice of kernel. This issue could be mitigated by using the aggregate approach of Schrab et al. (2023, 2022), which removes the need to carefully select a kernel, within our composite tests. Additionally, our paper focused on Euclidean data, our MMD and KSD tests could straightforwardly be generalised to more complex settings. For example, the cases of discrete

Composite Goodness-of-fit Tests with Kernels

(Yang et al., 2018), manifold (Xu and Matsuda, 2020), censored (Fernández and Gretton, 2019) or time-to-event (Fernández et al., 2020) data could all be covered through our general methodology. Furthermore, models such as point processes (Yang et al., 2019), latent variable models (Kanagawa et al., 2019) or exponential random graph models (Xu and Reinert, 2021) could also be tackled in this way. Finally, we also envisage that our approach could be extended to more complex testing questions. For example, the framework could be extended to relative tests, where instead of testing whether data comes from a ﬁxed parametric model, the relative ﬁt of several parametric models would be compared (Bounliphone et al., 2016; Jitkrittum et al., 2018; Lim et al., 2019). Alternatively, we could extend our methodology to construct composite tests for conditional distributions (Jitkrittum et al., 2020).

Acknowledgments

We thank Antonin Schrab for his helpful instructions on implementing the aggregated version of test, as presented in Section C. OK and FXB acknowledge support from the Engineering and Physical Sciences Research Council with grant numbers EP/S021566/1 and EP/Y011805/1 respectively. TF was supported by the Data Observatory Foundation - ANID Technology Center No. DO210001 and ANID FONDECYT grant No. 11221143. Arthur Gretton acknowledges support from the Gatsby Charitable Foundation.

Key*, Gretton, Briol, and Fernandez*

A Theoretical Results for MMD 24 A.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.4 Proofs of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.5 Proof of Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

B Illustration of Limitations 48

C Aggregated Composite Test 50

D Experiment Details 51 D.1 Closed-Form KSD Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.3 Details of Speciﬁc Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Appendix A. Theoretical Results for MMD

A.1 Deﬁnitions

A.1.1 General Notation

In this section we make some additional deﬁnitions which will be used in the proofs. We denote by Qn the empirical measure based on the sample X1, . . . , Xn i.i.d. Q, and we denote by Q n the empirical measure based on the parametric bootstrap samples X 1, . . . , X n i.i.d Pˆθn given ˆθn.

1. We deﬁne the functions Ln : Θ R and L : Θ R by

Ln(θ) = MMD2(Pθ, Qn) and L(θ) = MMD2(Pθ, Q). (4)

Similarly, for the parametric Bootstrap, we deﬁne L n : Θ R by

L n(θ) = MMD2(Pθ, Q n). (5)

Note then that ˆθn = arg minθ Θ Ln(θ) and θ n = arg minθ Θ L n(θ).

2. We denote by H Rp p the Hessian matrix given by Hij = 2 θi θj L(θ0) for any i, j {1, . . . , p}.

3. We deﬁne the function g : H Θ R by g(ω, θ) = EX Pθ(ω(X)).

Composite Goodness-of-fit Tests with Kernels

4. We denote by µθ, µn and µ n H, respectively, the kernel mean embeddings of the distribution function Pθ and the empirical distribution functions Qn and Q n, given by

X K(x, y)pθ(y)λ(dy), µn(x) = 1

i=1 K(x, Xi), µ n(x) = 1

i=1 K(x, X i ).

5. We deﬁne φ : X Θ Rp by

φ(x, θ) = θµθ(x) Z

X ( θµθ(y))pθ(y)λ(dy), (6)

where θµθ : X Rp is such that ( θµθ(x))i = θi µθ(x) for i {1, . . . , p}.

6. We deﬁne by η : H X Θ R by

η(ω, x, θ) = ω(x) g(ω, θ) 2 θg(ω, θ0), H 1φ(x, θ) , (7)

where θg : H Θ Rp is such that ( θg(ω, θ))i = θi g(ω, θ) for i {1, . . . , p}.

7. We deﬁne the functionals Sn : H R and S n : H R by

Sn(ω) = 1 n

i=1 η(ω, Xi, θ0), and S n(ω) = 1 n

i=1 η(ω, X i , ˆθn), (8)

where η : H X Θ R is deﬁned in Equation 7.

Remark 8 (Bound for RKHS functions). Observe that for any ω H and x X,

|ω(x)| = | ω, Kx H | ω H Kx H = ω H p

where the inequality is due to the Cauchy-Schwarz s inequality. By Assumptions 1 to 6, we have that the kernel is bounded by some constant C > 0. Thus supx X |ω(x)| ω H

Remark 9 (Interchange of integration and diﬀerentiation). Observe that for any ﬁxed ω H, g(ω, ) C3(Θ) and, moreover

Dθg(ω, θ) = Z

X ω(x)Dθpθ(x)λ(dx),

where Dθ is any derivative up to order three. The previous result holds by an application of Lebesgue s dominated convergence theorem since pθ(x) C3(Θ) for all x X, and by the integrability conditions of Assumption 3. Similarly µθ(x) C3(Θ) for all x X, and Dθµθ(x) = R

X K(x, y)Dθpθ(y)λ(dy). Moreover, by the integrability conditions of Assumption 3 and since the kernel is bounded by a constant C > 0 (Assumption 5), it holds

sup θ Θ sup x X

X K(x, y)Dθpθ(y)λ(dy) C Z

X sup θ Θ |Dθpθ(y)| λ(dy) < . (9)

Key*, Gretton, Briol, and Fernandez*

A.1.2 Test Statistic

In our proofs we consider an alternative form of the deﬁnition of the MMD, which is equivalent to that in Equation 2:

n MMD2(Pθ, Qn) = sup ω H: ω H=1

i=1 ω(Xi) g(ω, θ)

A.1.3 Wild Bootstrap

In order to analyse the asymptotic behaviour of the wild bootstrap test-statistic, we write it in terms of a supremum as follows

n MMD2 W (Pˆθn,n, Qn) := 1

i,j=1 Wi Wjh MMD((Xi, Xi), (Xj, Xj))

= sup ω H: ω H=1

i=1 Wi(ω(Xi) ω( Xi))

where recall that X1, . . . , Xn i.i.d. Q, X1, . . . , Xn i.i.d. Pˆθn given ˆθn, and W1, . . . , Wn i.i.d. Rademacher are independent of everything.

A.1.4 Parametric bootstrap

To obtain the results for the parametric bootstrap, we condition on the data X1, . . . , Xn. To this end, we use the following notation:

PD( ) = P( |X1, . . . , Xn), ED( ) = E( |X1, . . . , Xn),

Var D( ) = Var( |X1, . . . , Xn), and Cov D( , ) = Cov( , |X1, . . . , Xn).

Deﬁne an = op D(1) if, for each ε > 0, it holds PD(|an| ε) P 0 when n .

Deﬁne an = Op D(1) if for any ε > 0 there exists M > 0 such that

P ({PD(|an| > M) ε}) 1.

Given a random variable a, deﬁne an DD a (i.e. an converges in distribution to a given the data points X1, . . . , Xn) if and only if |ED(f(an) f(a))| P 0, for any bounded, uniformly continuous real-valued f.

A.2 Preliminary Results

We state some preliminary results that will be used in our proofs. The ﬁrst corresponds to Theorem 1 of Fernández and Rivera (2022), and it will be used to prove Theorems 2 and 5.

Composite Goodness-of-fit Tests with Kernels

Theorem 10. Let (Sn)n 1 be a sequence of bounded linear test-statistics satisfying conditions G0-G2 (stated below). Then

sup ω H: ω H=1 S2 n(ω) D

i=1 λi Z2 i ,

where Z1, Z2, . . . are i.i.d. standard normal random variables, and λ1, λ2, . . . are the eigenvalues of the operator Tσ : H H deﬁned by (Tσω)(x) = σ(ω, Kx), where σ : H H R is the bilinear form of Condition G0.

We state conditions G0-G2.

Condition G0 There exists a continuous bilinear form σ : H H R such that for any m N, the bounded linear test-statistic Sn(w1 + + wm) converges in distribution to a normal random variable with mean 0 and variance given by Pm i=1 Pm j=1 σ(wi, wj).

Condition G1 For some orthonormal basis (ψi)i 1 of H we have P i 1 σ(ψi, ψi) < .

Condition G2 For some orthonormal basis (ψi)i 1 of H, and for any ε > 0, we have that

lim i lim sup n P

k=i+1 S2 n(ψk) ε

The following result appears in Nickl (2012).

Theorem 11 (Theorem 1, Nickl (2012)). Suppose that Θ Rp is compact (i.e. bounded and closed). Assume that L : Θ R is a deterministic function with L C0(Θ), and that θopt is the unique minimiser of L. If

sup θ Θ |Ln(θ) L(θ)| P 0

as n grows to inﬁnity, then any solution ˆθn of minθ Θ Ln(θ) satisﬁes

ˆθn P θopt as n .

A.3 Auxiliary Results

The following collection of lemmas and propositions will be used in the proofs of our main results. Their proofs are provided in Section A.5.

Lemma 12. Under Assumptions 1 to 6 and HC 0 it holds that

n MMD2(Pˆθn, Qn) = n MMD2(Pθ0, Qn) 2

j=1 φ(Xj, θ0) H 1φ(Xi, θ0) + op(1),

where φ : X Θ Rp is deﬁned in Equation 6, and H is the Hessian matrix deﬁned in Section A.1.1.

Key*, Gretton, Briol, and Fernandez*

Proposition 13 (Normal distribution for the Wild Bootstrap statistic). Deﬁne Zn : H R by

Zn(ω) = 1 n

i=1 Wi(ω(Xi) ω( Xi)),

where X1, . . . , Xn i.i.d. Q and X1, . . . , Xn i.i.d. Pˆθn given ˆθn, and W1, . . . , Wn i.i.d. Rademacher that are independent of everything else. Then, Zn(ω) DD N(0, ν(ω, ω)), where

ν(ω, ω ) = 2 Z (ω(x) g(ω, θ0))(ω (x) g(ω , θ0))pθ0(x)λ(dx). (10)

Proposition 14. Under Assumptions 1 to 6 and the null hypothesis it holds that ˆθn P θ0.

Proposition 15.

i=1 φ(Xi, θ), and θL n(θ) = 2

i=1 φ(X i , θ).

Moreover, under Assumptions 1 to 6 the following items hold.

i) EX Pθ(φ(X, θ)) = 0, and ED(φ(X , ˆθn)) = 0, where X Pˆθn given ˆθn,

ii) supθ Θ supx Rd φ(x, θ) < , and

iii) n θLn(θ0) = Op(1) holds under the null hypothesis, and n θL n(ˆθn) = Op(1).

Proposition 16. Under Assumptions 1 to 6 the following items hold.

i) For any ﬁxed θ Θ and ω H, EX Pθ(η(ω, X, θ)) = 0,

ii) there exists constants C1 > 0 and C2 > 0 such that for any ﬁxed θ Θ and ω H,

EX Pθ(η2(ω, X, θ)) C1EX Pθ ω2(X) + C2 θg(ω, θ0) 2 ,

iii) there exists constants C1 > 0 and C2 > 0 that do not depend on ω, x nor θ, such that for all ω H, it holds supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H < .

Lemma 17. Under Assumptions 1 to 6 and the null hypothesis, it holds that for any ℓ, j {1, . . . , p} and ε > 0

θℓ θj µθ(Xi) EX Pθ0

θℓ θj µθ(X)

θℓ θj µθ(X i ) EX Pθ0

θℓ θj µθ(X) ε

as n grows to inﬁnity.

Composite Goodness-of-fit Tests with Kernels

Proposition 18. Deﬁne Hn and H n by

θi θj Ln( θj) and (H n)ij = 2

θi θj L n( θj )

for any i, j {1, . . . , p}, where Ln and L n are deﬁned in Equations 4 and 5, respectively, θ1, . . . , θp lie between θ0 and ˆθn, and θ1 , . . . , θp lie between ˆθn and ˆθ n. Then, under Assumptions 1 to 6 and the null hypothesis it holds

Hn H = op(1) and H n H = op D(1)

as n grows to inﬁnity.

Proposition 19. Under Assumptions 1 to 6 and the null hypothesis, the following holds.

i) n(θ0 ˆθn) = n H 1 θLn(θ0) + op(1), and

ii) n(ˆθn θ n) = n H 1 θL n(ˆθn) + op(1).

Lemma 20. Under Assumptions 1 to 6, it holds

i) n(ˆθn θ n) = Op(1), and

ii) θ n θ0 = op(1) holds under the null hypothesis.

Proposition 21. Under the null hypothesis and Assumptions 1 to 6, we have that

i=1 ω(Xi) g(ω, ˆθn) = Sn(ω) + op(1),

holds for every ω H, where Sn : H R is deﬁned in Equation 8 and the error term does not depend on ω.

Proposition 22. Consider Sn = (Sn(ω1), . . . , Sn(ωm)), where ωℓ H for each ℓ [m] and ﬁxed m N. Then, under the null hypothesis, and Assumptions 1 to 6, we have Sn D Nm(0, Σ) as n grows to inﬁnity, where for any i, j [m] we have

Σij = σ(ωi, ωj) := EX Pθ0 (η(ωi, X, θ0)η(ωj, X, θ0)) , (13)

and the function η is deﬁned in Equation 7.

The next two results are analogous to Proposition 21 and Proposition 22 but for the parametric bootstrap test-statistic. Note that these results are obtained conditioned on the data X1, . . . , Xn.

Proposition 23. Under the null hypothesis and Assumptions 1 to 6, it holds that

i=1 ω(X i ) g(ω, θ n) = S n(ω) + op(1),

where S n : H R is deﬁned in Equation 8.

Proposition 24. Consider S n = (S n(ω1), . . . , S n(ωm)), where ωℓ H for each ℓ [m] and m N is ﬁxed. Then, under the null hypothesis, and Assumptions 1 to 6, we have S n D Nm(0, Σ) as n grows to inﬁnity, where Σ is deﬁned in Equation 13

Proposition 25. Let (ψi)i 1 be any orthonormal basis of H. Then under Assumptions 1 to 6 it holds that P i 1 θg(ψi, θ0) 2 < .

Key*, Gretton, Briol, and Fernandez*

A.4 Proofs of Main Results

This section contains proofs of results given in the main paper. Proof of Theorem 2 By the deﬁnition of the maximum mean discrepancy, it holds

n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 γ2 n(ω), where γn(ω) = 1 n

i=1 ω(Xi) g(ω, ˆθn).

Proposition 21 yields γn(ω) = Sn(ω) + op(1), where Sn : H R is the functional deﬁned in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1 S2 n(ω) converges in distribution (which will be proved next) we obtain

n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 S2 n(ω) + op(1).

By Slutsky s theorem, we deduce that n MMD2(Pˆθn, Qn) and supω H: ω H=1 S2 n(ω) have the same asymptotic null distribution. It is not diﬃcult to verify that Sn(ω) = 1 n Pn i=1 η(ω, Xi, θ0) is bounded and linear in H. The linearity follows from the fact that for any ﬁxed x X and θ Θ, η(ω, x, θ) is linear on its argument ω. Boundedness follows from item iii of Proposition 16 since supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H where C1 > 0 and C2 > 0 are constants that do not depend on ω. Then, we can use Theorem 10 to study the asymptotic null distribution of supω H: ω H=1 Sn(ω)2 and thus obtain the result. To apply Theorem 10 we need to verify conditions G0, G1 and G2, which are stated in Section A.2. Condition G0 follows directly from Proposition 22, where σ : H H R is given by σ(ω, ω ) := EX Pθ0(η(ω, X, θ0)η(ω , X, θ0)) for any ω, ω H. To verify condition G1 consider an orthonormal basis (ψi)i 1 of H and observe that by Proposition 16.ii. there exists constants C1 > 0 and C2 > 0 such that for all i 1 it holds

σ(ψi, ψi) = EX Pθ0(η2(ψi, X, θ0)) < C1EX Pθ0 ψi2(X) + C2 θg(ψi, θ0) 2 .

i 1 σ(ψi, ψi) C1 X

i 1 EX Pθ0 ψi2(X) + C2 X

i 1 θg(ψi, θ0) 2 < . (14)

where the last inequality holds since K(x, x) = P i 1 ψ2 i (x), and the kernel is bounded by our assumptions, and by Proposition 25. Finally, we proceed to verify condition G2. For this, let ε > 0, and note

k i S2 n(ψk) ε

k i ε 1E S2 n(ψk) = ε 1 X

k i EX Pθ0(η2(ψk, X, θ0))

k i σ(ψk, ψk),

where ﬁrst inequality is due to Markov s inequality, the subsequent equality follows from the deﬁnition of Sn(ω) = 1 n Pn i=1 η(ω, Xi, θ0) given in Equation 8, and the last equality is due

Composite Goodness-of-fit Tests with Kernels

to the deﬁnition of σ. Finally, notice that

lim i lim sup n P

k i Sn(ψk)2 ε

k i σ(ψk, ψk) = 0,

where the limit is due to Equation 14.

Proof of Theorem 3 Observe that

MMD(Pˆθn, Qn) = µˆθn µn H µˆθn µQ H µQ µn H ,

where µQ is the kernel mean embedding of Q given by µQ(x) = R K(x, y)q(y)λ(dy). Note that limn µQ µn H = 0 a.s. by the standard law of large numbers for RKHS in Chen and Zhu (2011, Theorem 1). Next, we will prove that

µˆθn µQ H > 0 a.s.

We proceed by contradiction. Suppose that lim infn µˆθn µQ H = 0 a.s., then there

exists a collection of indices (a(n))n 1 such that the subsequence µbθa(n) satisﬁes

µbθa(n) µQ 2

H = 0 a.s. (15)

Additionally, note that since the set Θ is compact, there exists a subsequence (bθa(b(n)))n 1 and θ Θ such that limn bθa(b(n)) θ = 0, a.s. By Assumptions 3 and 5, the kernel K is bounded, pθ(x) C3(Θ) for each ﬁxed x X, and R

X supθ Θ pθ(x)λ(x)dx < . Therefore, by the Lebesgue s dominated convergence theorem we deduce µbθa(b(n)) µθ 2

X K(x, y) pbθa(b(n))(x) pθ (x) pbθa(b(n))(y) pθ (y) λ(x)λ(y)dxdy

when n grows to inﬁnity. Then, by the triangle inequality and Equations 15 and 16 we have

µθ µQ H µθ µbθa(b(n))

H + µbθa(b(n)) µQ H 0 a.s.

Thus µθ µQ H = MMD(Q, Pθ ) = 0, and since the kernel is characteristic by Assumption 5, we conclude Q = Pθ , but this is a contradiction since under the alternative hypothesis Q Pθ.

Proof of Theorem 4 Observe that

n MMD2 W (Pˆθn,n, Qn) = sup ω H: ω H=1 Z2 n(ω), (17)

Key*, Gretton, Briol, and Fernandez*

where Zn : H R is deﬁned Zn(ω) = 1 n Pn i=1 Wi(ω(Xi) ω( Xi)). It is not diﬃcult to see that Zn is linear on its argument ω, and that Zn is bounded since Assumptions 1 to 6 yield supx X |ω(x)| ω H supx X p

K(x, x) < . Then, the asymptotic distribution of the wild bootstrap test-statistic n MMD2 W (Pˆθn,n, Qn) can be obtained from Theorem 10 and Equation 17, by verifying that Zn satisﬁes conditions G0-G2, which we proceed to verify. Observe that Condition G0 follows directly from Proposition 13, where we obtain that

Zn(ω1 + . . . + ωm) DD N

j=1 ν(ωi, ωj)

for any m N and ω1, . . . , ωm H, where ν : H H R is given by

ν(ω, ω ) = 2 Z (ω(x) g(ω, θ0))(ω (x) g(ω , θ0))pθ0(x)λ(dx).

To verify Condition G1, we consider some orthonormal basis (ψi)i 1 of H, and verify that P i 1 ν(ψi, ψi) < holds. Note that

i 1 ν(ψi, ψi) = X

i 1 2 Z (ψi(x) g(ψi, θ0))2pθ0(x)λ(dx)

i 1 ψi(x)2pθ0(x)λ(dx) = 2 Z K(x, x)pθ0(x)λ(dx) < ,

where the ﬁrst inequality follows from the fact that the variance of a random variable is upper bounded by its second moment, and the last inequality holds from the fact that P i 1 ψ2 i (x) = K(x, x), and that the kernel is bounded by Assumption 5.

Finally, note that limi lim supn PD P k i Z2 n(ψk) ε = 0 veriﬁes Condition G2. By Markov s inequality

k i Z2 n(ψk) ε

k i ED(Z2 n(ψk)),

and note that

k i ED(Z2 n(ψk)) = 1

j=1 Wj(ψk(Xj) ψk( Xj))

j=1 Wj(ψk(Xj) ψk( Xj))

2 ( Xℓ)n ℓ=1

j=1 (ψk(Xj) ψk( Xj))2

k i ED ψ2 k( X1)

Composite Goodness-of-fit Tests with Kernels

where the ﬁrst equality follows from replacing Zn by its deﬁnition, the third equality follows from noticing that conditioned on the data (Xi, Xi)n i=1, Zn is the sum of independent random variables, E(Wi|(Xi, Xi)n i=1) = 0 and W 2 i = 1. Finally, note that the last inequality holds from noticing that (a b)2 2(a2 + b2) for any a, b R, and ( Xi)n i=1 are i.i.d. give the data (Xi)n i=1. By using the previous set of equations, we obtain

lim i lim sup n PD

k i Z2 n(ψk) ε

lim i lim sup n 2 ε

+ lim i lim sup n 2 ε

k i ψ2 k(x)pˆθn(x)λ(dx)

We shall prove the previous equation is zero. For the ﬁrst term observe that

lim i lim sup n 2 ε

lim i lim sup n 2 nε

k i ψ2 k(Xj)

= lim i 2 ε

k i ψ2 k(x)pθ0(x)λ(dx)

where the ﬁrst inequality follows by Jensen s inequality, the ﬁrst equality follows from the law of large numbers since P k i ψ2 k(x) P k 1 ψ2 k(x) = K(x, x) and the kernel is bounded by Assumption 5. The same argument, combined with Lebesgue s dominated convergence theorem, explains the last equality. For the second term, the Lebesgue s dominated convergence theorem yields

lim i lim sup n 2 ε

k i ψ2 k(x)pˆθn(x)λ(dx) = lim i 2 ε

k i ψ2 k(x)pθ0(x)λ(dx) = 0.

Since conditions G0-G2 are satisﬁed, then Theorem 10 yields that

n MMD2 W (Pˆθn,n, Qn) DD

i=1 λ i Z 2 i , (18)

where Z 1, . . . , Z n are standard normal i.i.d. random variables, and λ 1, λ 2, . . . , are the eigenvalues of the operator Tν : H H given by (Tνω)(x) = ν(ω, Kx). On the other hand, observe that the wild bootstrap for the goodness-of-ﬁt problem satisﬁes that

n MMD2 W (Pθ0, Qn) DD

i=1 λi Z2 i , (19)

Key*, Gretton, Briol, and Fernandez*

where Z1, . . . , Zn are standard normal i.i.d. random variables, and λ1, λ2, . . . , are the eigenvalues of the operator Tσ : H H given by (Tσω)(x) = σ(ω, Kx) where σ = 1

2ν (hence the eigenvalues are half of those in Equation 18). Therefore, we deduce that for every x R,

lim sup n P(n MMD2 W (Pˆθn,n, Qn) x) = lim sup n P n MMD2 W (Pθ0, Qn) x

> lim sup n P n MMD2 W (Pθ0, Qn) x (20)

Replacing x = qα in the previous equation yields

α = lim sup n P(n MMD2 W (Pˆθn,n, Qn) qα) > lim sup n P n MMD2 W (Pθ0, Qn) qα

= lim sup n P n MMD2(Pθ0, Qn) qα , (21)

where the ﬁrst equality is due to the deﬁnition of qα, the ﬁrst inequality holds by Equation 20 and the last equality holds since n MMD2(Pθ0, Qn) and n MMD2 W (Pθ0, Qn) have the same asymptotic null distribution (Chwialkowski et al., 2014, Theorem 1). Finally note that

lim sup n P n MMD2(Pˆθn, Qn) qα

= lim sup n P

n MMD2(Pθ0, Qn) 2

j=1 φ(Xj, θ0) H 1φ(Xi, θ0) qα

< lim sup n P n MMD2(Pθ0, Qn) qα < α,

where the ﬁrst equality holds due to Lemma 12, the ﬁrst inequality holds since both H and H 1 are positive deﬁnite matrices, and thus Pn i=1 Pn j=1 φ(Xj, θ0) H 1φ(Xi, θ0) 0, and the last inequality holds by Equation 21.

Proof of Theorem 5 By the deﬁnition of the maximum mean discrepancy, we have

n MMD2(Pθ n, Q n) = sup ω H: ω H=1 (γ n(ω))2, where γ n(ω) = 1 n

i=1 ω(X i ) g(ω, θ n).

Proposition 23 yields γ n(ω) = S n(ω) + op(1), where S n : H R is the functional deﬁned in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1(S n(ω))2 converges in distribution (which is proved next) we obtain

n MMD2(Pθ n, Q n) = sup ω H: ω H=1 (S n(ω))2 + op(1).

By Slutsky s theorem, we conclude that n MMD2(Pθ n, Q n) and supω H: ω H=1(S n(ω))2 have the same asymptotic null distribution. We use Theorem 10 to obtain the asymptotic null distribution of supω H: ω H=1(S n(ω))2. Observe however that to apply this theorem, we need to check that

S n(ω) = 1 n

i=1 η(ω, X i , ˆθn)

Composite Goodness-of-fit Tests with Kernels

is a bounded linear functional. The linearity follows from the fact that for any ﬁxed x X and ω Θ, η(ω, x, θ) is a linear function of ω. Boundedness follows from item iii of Proposition 16 since supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H where C1 > 0 and C2 > 0 are constants that do not depend on ω. We proceed to check Conditions G0, G1 and G2 of Theorem 10. Condition G0 follows directly from Proposition 24, where we identify σ(ω, ω ) = EX Pθ0(η(ω, X, θ0)η(ω , X, θ0)) for any ω, ω H. Indeed, note that it is the same bilinear form of Equation 13, and thus Condition G1 was already proved in Equation 14 in the proof of Theorem 2. Finally, to check condition G2 observe that

k i (S n(ψk))2 ε

k i ED (S n(ψk))2 = ε 1 X

k i ED(η2(ψk, X , ˆθn))

X η2(ψk, z, ˆθn)pˆθn(z)λ(dz)

where X Pˆθn|ˆθn. The ﬁrst inequality follows from Markov s inequality, and the ﬁrst equality follows since S n(ω) = 1 n Pn i=1 η(ω, X i , ˆθn), and recall that (X i )n i=1 are generated

i.i.d. from Pˆθn|ˆθn. From Proposition 16.ii., there exists constants C1 > 0 and C2 > 0 such that for all k 1 it holds Z

X η2(ψk, z, ˆθn)pˆθn(z)λ(dz) C1

X ψ2 k(z)pˆθn(z)λ(dz) + C2 θg(ψk, θ0) 2 . (23)

Finally, by combining Equations 22 and 23, we get

lim i lim sup n PD

k i (S n(ψk))2 ε

lim i lim sup n

X ψ2 k(z)pˆθn(z)λ(dz) + ε 1C2 lim i

k i θg(ψk, θ0) 2 .

Note that the second limit of the previous equation converges to zero by Proposition 25. For the ﬁrst limit, observe

lim i lim sup n

X ψ2 k(z)pˆθn(z)λ(dz) lim i

X ψ2 k(z) sup θ Θ pθ(z)λ(dz) = 0

since P k=1 ψ2 k(y) = K(y, y), and R

X supθ Θ pθ(z)λ(dz) < under Assumptions 1 to 6.

A.5 Proof of Auxiliary Results

We proceed to prove the auxiliary results.

Key*, Gretton, Briol, and Fernandez*

Proof of Lemma 12 By the deﬁnition of the MMD, it holds that

n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 γ2 n(ω), where γn(ω) = 1 n

i=1 ω(Xi) g(ω, ˆθn).

Proposition 21 yields γn(ω) = Sn(ω) + op(1), where Sn : H R is the functional deﬁned in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1 S2 n(ω) converges in distributionwhich is proved in Theorem 2we obtain

n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 S2 n(ω) + op(1).

Note that Sn(ω) = An(ω) + Bn(ω), where An : H R and Bn : H R are given by

An(ω) = 1 n

i=1 (ω(Xi) g(ω, θ0)) and Bn(ω) = 2 n

i=1 θg(ω, θ0), H 1φ(Xi, θ0) .

n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 (An(ω) + Bn(ω))2 + op(1)

= (A1 n + B1 n)(A2 n + B2 n)K + op(1)

= n MMD2(Pθ0, Qn) + A1 n B2 n K + B1 n A2 n K + B1 n B2 n K + op(1), (24)

where Ai n K with i {1, 2} denotes the application of An to the i-the coordinate of the kernel, and the same notation is used for Bi n. The second equality is due to Fernández and Rivera (2022), and the third equality holds since n MMD2(Pθ0, Qn) = A1 n A2 n K. We proceed to analyse A1 n B2 n K, B1 n A2 n K and B1 n B2 n. Observe that

X ω(x) θpθ0(x)λ(dx), H 1φ(Xi, θ0)

where the equality follows by Lebesgue s dominated convergence theorem (see Remark 9). For all x X, deﬁne Kx : X R by Kx( ) = K(x, ). Then, B2 n K H is given by

(B2 n K)(x) = Bn Kx = 2 n

X Kx(y) θpθ0(y)λ(dy), H 1φ(Xi, θ0) . (25)

By linearity, we obtain

A1 n B2 n K = 2 n

X (A1 n K)(y) θpθ0(y)λ(dy), H 1φ(Xi, θ0) (26)

Note that for any ﬁxed y X, it holds

(A1 n K)(y) = 1 n

X K(x, y)pθ0(x)λ(dx) .

Composite Goodness-of-fit Tests with Kernels

Then, a simple computation shows that

X (A1 n K)(x) θpθ0(x)λ(dx) = 1 n

X ( θµθ0(x))pθ0(x)λ(dx)

j=1 φ(Xj, θ0).

By substituting the previous equation into Equation 26, we obtain

A1 n B2 n K = 2

φ(Xj, θ0), H 1φ(Xi, θ0) = 2

j=1 φ(Xj, θ0) H 1φ(Xi, θ0).

A similar computation shows that B1 n A2 n K = A1 n B2 n K. For the term B1 n B2 n K, we have

B1 n B2 n K

X (B1 n K)(x) θpθ0(x)λ(dx), H 1φ(Xi, θ0)

X Kx(y) θpθ0(y)λ(dy), H 1φ(Xj, θ0) θpθ0(x)λ(dx),

H 1φ(Xi, θ0)

X K(x, y) θpθ0(y), H 1φ(Xj, θ0) θpθ0(x), H 1φ(Xi, θ0) λ(dy)λ(dx)

m=1 (H 1φ(Xi, θ0))ℓ(H 1φ(Xj, θ0))m Hm,ℓ

j=1 φ(Xj, θ0) H 1φ(Xi, θ0), (28)

where the second equality holds from replacing (B2 n K)(x) by the expression given in Equation 25, and the fourth equality holds since

θℓ pθ0(x) λ(dy)λ(dx).

By substituting Equation 27 and Equation 28 into Equation 24, we obtain

n MMD2(Pˆθn, Qn) = n MMD2(Pθ0, Qn) 2

j=1 φ(Xj, θ0) H 1φ(Xi, θ0) + op(1).

Key*, Gretton, Briol, and Fernandez*

Proof of Proposition 13 Note that conditioned on the data e D = (Xi, Xi)n i=1, Zn(ω) is a sum of i.i.d. random variables with mean given by

E e D (Zn(ω)) = E e D

i=1 Wi(ω(Xi) ω( Xi))

and variance given by Var e D(Zn(ω)) = 1

n Pn i=1(ω(Xi) ω( Xi))2. We claim that

i=1 (ω(Xi) ω( Xi))2 P 2 Z (ω(x) g(ω, θ0))2pθ0(x)λ(dx) = ν(ω, ω).

Thus, by Lyapunov s Central limit theorem, we obtain Zn(ω) DD N(0, ν(ω, ω)) which yields the desired result. To prove the claim, consider ξn = 1

n Pn i=1(ω(Xi) ω( Xi))2. We will show that E(ξn) = ν(ω, ω) and that Var(ξn) 0. Therefore the convergence in probability follows from Markov s inequality. Consider the data D = (Xi)i 1, and deﬁne ED( ) = E( |(Xi)n i=1). Then, observe that

E (ξn) = E ω(X1) ω( X1) 2 = E ED

ω(X1) ω( X1) 2

= E ω2(X1) 2ω(X1)ED(ω( X1)) + ED(ω2( X1)) , (29)

ED ω( X1) = Z ω(z)pˆθn(z)λ(dz) and ED ω2( X1) = Z ω2(z)pˆθn(z)λ(dz).

Recall that R supθ Θ pθ(x)λ(dx) < by Assumption 3, and that ω H is bounded by Assumption 5. Then, under the null hypothesis, and by applying Lebesgue s dominated convergence theorem twice, it holds

lim n E ED ω( X1) = E lim n

Z ω(z)pˆθn(z)λ(dz) = Z ω(z)pθ0(z)λ(dz).

By using the same arguments we can show that limn E(ED(ω2( X1))) = R ω2(z)pθ0(z)λ(dz). The previous equations together with Equation 29, yield

lim n E (ξn) = 2

Z ω2(z)pθ0(z) Z ω(z)pθ0(z) 2!

Finally, we proceed to verify that limn Var(ξn) = 0. Note that

Var (ξn) = Var (ED(ξn)) + E(Var D (ξn)),

and by the previous arguments limn Var (ED(ξn)) = 0 since ED(ξn) converges to a constant. Note that conditioned on the data D, the random variables (ω(Xi) ω( Xi))n i=1 are independent. Thus

Var D (ξn) = 1

ω(Xi) ω( Xi) 2 0,

Composite Goodness-of-fit Tests with Kernels

where the limit holds since the ω is bounded. Hence we conclude that limn Var(ξn) = 0.

Proof of Proposition 14 To prove this result we use Theorem 11. Note that by Assumption 2 the set Θ is bounded and closed, by Assumption 3 we have L(θ) C0(Θ), and by Assumption 5 the kernel K is characteristic, which means that the maximum mean discrepancy is a distance between probability measures. Thus, since the parametric family {Pθ}θ Θ is identiﬁable (Assumption 3), we deduce that L(θ) = MMD2(Pθ, Pθ0) has a unique minimiser under the null hypothesis. Observe that

sup θ Θ |Ln(θ) L(θ)| = sup θ Θ

µθ µn 2 H µθ µθ0 2 H

2 µθ, µθ0 µn H + µn 2 H µθ0 2 H

2 sup θ Θ µθ H µθ0 µn H + µn 2 H µθ0 2 H

C µθ0 µn H + µn 2 H µθ0 2 H 0

in probability, where the last result holds by the standard law of large numbers for RKHS Berlinet and Thomas-Agnan (2004, Section 9.1.1-9.1.2). Then, by Theorem 11, we deduce that ˆθn θ0 in probability.

Proof of Proposition 15 Recall the deﬁnition φ in Equation 6, then

θLn(θ) = θ µn µθ 2 H = 2 θµθ, µn µθ H

X ( θµθ(x)) pθ(x)λ(dx)

i=1 φ(Xi, θ).

By doing the same computations, we obtain θL n(θ) = θ µ n µθ 2 H = 2

n Pn i=1 φ(X i , θ), where recall that µ n is the mean kernel embedding of the empirical distribution Q n obtained from the parametric bootstrap samples X 1, . . . , X n. We proceed to prove item i). Observe that clearly,

EX Pθ(φ(X, θ)) = EX Pθ

X ( θµθ(y))pθ(y)λ(dy) = 0,

and similarly ED(φ(X i , ˆθn)) = 0. We continue by checking ii). For any ﬁxed x X and θ Θ, we have

φ(x, θ) 2 =

θi µθ(y) pθ(y)λ(dy) 2 4

Key*, Gretton, Briol, and Fernandez*

Additionally, by similar arguments as those shown in Remark 9 we have that for any i {1, . . . , p}, it holds

sup θ Θ sup x X

θi µθ(x) = sup θ Θ sup x X

θi pθ(y)λ(dy) < .

We ﬁnish by verifying item iii). Recall that θLn(θ0) = 2

n Pn i=1 φ(Xi, θ0), where φ(X1, θ0), . . . , φ(Xn, θ0) are i.i.d. random variables. Moreover, under the null hypothesis, the items i) and ii) of this proposition deduce E(φ(Xi, θ0)) = 0 and E( φ(Xi, θ0) 2) < . Thus, E( n θLn(θ0) 2) = 4E( φ(X1, θ0) 2) < , and we obtain n θLn(θ0) = Op(1). By using the same arguments we deduce that for almost all sequences (Xi)i 1,

ED( n θL n(ˆθn) 2) = 4ED( φ(X 1, ˆθn) 2) < .

Thus n θL n(ˆθn) = Op(1).

Proof of Proposition 16 We start with i). Observe that

EX Pθ (η(ω, X, θ)) = EX Pθ (ω(X) g(ω, θ)) 2 θg(ω, θ0), H 1EX Pθ (φ(X, θ)) = 0,

where the ﬁrst equality follows from the linearity of expectations, and the last equality is due to g(ω, θ) = EX Pθ(ω(X)) and EX Pθ(φ(X, θ)) = 0 from Proposition 15, item i). We continue by verifying items ii) and iii). Observe that for any ﬁxed ω H, x X and θ Θ, we have that

|η(ω, x, θ)| = ω(x) g(ω, θ) 2 θg(ω, θ0), H 1φ(x, θ)

|ω(x) g(ω, θ)| + 2 θg(ω, θ0) H 1 F φ(x, θ)

|ω(x) g(ω, θ)| + C θg(ω, θ0) (31)

where F denotes the Frobenius norm, and C > 0 is a constant that does not depend on ω, x, nor θ. In the previous set of equations, the ﬁrst inequality follows from the triangle inequality and the Cauchy-Schwarz s inequality, and the second inequality holds by taking C = 2 H 1 F supθ Θ supx X φ(x, θ) < since H 1 F < , and supθ Θ supx X φ(x, θ) < by item ii) of Proposition 15. We proceed to verify item ii). From Equation 31, we have that for any ﬁxed ω H and θ Θ,

EX Pθ η2(ω, X, θ) 2EX Pθ |ω(X) g(ω, θ)|2 + 2C 2 θg(ω, θ0) 2

C1EX Pθ ω2(X) + C2 θg(ω, θ0) 2 ,

where C1 = 2 and C2 = 2C 2. The ﬁrst inequality holds from the fact that for any a, b R we have that (a + b)2 2(a2 + b2), and the second inequality follows from the fact that the variance of random variable is upper bounded by its second moment. We continue with item iii). Note that for each ω H, we have supx X |ω(x)| ω H

C (see Remark 8), and thus

sup θ Θ |g(ω, θ)| sup θ Θ

X |ω(x)|pθ(x)λ(dx) ω H

Composite Goodness-of-fit Tests with Kernels

Additionally, note that (see Remark 9)

sup θ Θ θg(ω, θ) 2 = sup θ Θ

θi pθ(x)λ(dx) 2

θi pθ(y) λ(dy) 2 < , (33)

where the equality holds under Assumption 3 by the Lebesgue s dominated convergence theorem (see Remark 9), the ﬁrst inequality is due to the fact that supx X |ω(x)| ω H

C, and the last inequality holds by Assumption 3. Then, by combining Equations 31 to 33, we get

sup θ Θ sup x X |η(ω, x, θ)| 2 ω H

C + ω 2 H C

θi pθ(y) λ(dy) 2

C1 ω H + C2 ω 2 H < ,

where C1 > 0 and C2 > 0 are constants that do not depend on ω, x or θ.

Proof of Lemma 17 We begin by deﬁning Zn(θ) = |Yn(θ) γ(θ)| and Z n(θ) = |Y n (θ) γ(θ)|, where

θℓ θj µθ(Xi), γ(θ) = EX Pθ0

θℓ θj µθ(X) , and

Y n (θ) = 1

θℓ θj µθ(X i ).

Observe that Equation 11 is equivalent to supθ Θ Zn(θ) P 0 as n grows to inﬁnity, and

Equation 12 is equivalent to PD (supθ Θ Z n(θ) ε) P 0 holds for any ε > 0. To verify Equation 11 we use Theorem 21.9 of Davidson (1994), from which we just need to check the following properties: (i) Zn(θ) P 0 for each ﬁxed θ Θ, and (ii) Zn( ) is stochastically equicontinuous. Analogously, for Equation 12 we need to check (i ) PD (Z n(θ) ε) P 0, and (ii ) for almost all sequences (Xi)i 1, Z n( ) is stochastically equicontinuous. We start proving (i) and (i ). Note that (i) follows from the law of large numbers since

θℓ θj µθ(X)

X K(X, y) 2

θℓ θj pθ(y)λ(dy)

holds under Assumption 3 (see Remark 9). For item (i ), note that for any ε > 0 we have

PD (Z n(θ) ε) = PD (|γ(θ) Y n (θ)| ε)

PD (|γn(θ) Y n (θ)| ε/2) + PD (|γ(θ) γn(θ)| ε/2) ,

Key*, Gretton, Briol, and Fernandez*

where γn(θ) = EX Pˆθn

2 θℓ θj µθ(X) . The last two terms converge to zero. To see this observe that

PD |γn(θ) Y n (θ)| ε

4ED(|γn(θ) Y n (θ)|2) ε2 4 ε2n ED

θℓ θj µθ(X 1) 2!

where the ﬁrst inequality is due to Chebyshev s inequality, and the second inequality holds since X 1, . . . , X n i.i.d. Pˆθn|ˆθn. For the second term

|γn(θ) γ(θ)| =

θℓ θj µθ(x)(pˆθn(x) pθ0(x))λ(dx)

θℓ θj µθ(x)

X |(pˆθn(x) pθ0(x))|λ(dx)

now, by Taylor s theorem, for each x X, it exists θx in the line between θ0 and ˆθn such that pˆθn(x) pθ0(x) = θpθx(x) (ˆθn θ0). By the previous equality and by Holder s inequality

|γn(θ) γ(θ)| sup x X

θℓ θj µθ(x)

X θpθx(x) 1 λ(dx) ˆθn θ0

θℓ θj µθ(x)

X sup θ Θ θpθ(x) 1 λ(dx) ˆθn θ0 0,

since |ˆθn θ0| = op(1) by Proposition 14, and since the other terms are ﬁnite due to Assumptions 3 and 5 and Remark 9. We continue by checking (ii) and (ii ). To check ii) we use Theorem 21.10 of Davidson (1994) from which it is enough to verify that |Zn(θ) Zn(θ )| B θ θ holds for all θ, θ Θ and for all n, where B is a constant which does not depend of n, θ and θ . To check condition ii ) we will verify that there exists B > 0 such that |Z n(θ) Z n(θ )| B θ θ , holds for every sequence (Xi)i 1. We only verify that |Zn(θ) Zn(θ )| B θ θ for some constant B since the analogous result for parametric bootstrap follows from the same arguments. By the triangle inequality, we obtain

|Zn(θ) Zn(θ )| γ(θ) γ(θ ) + Yn(θ) Yn(θ ) .

Since the kernel is bounded by a constant C > 0, it holds

γ(θ) γ(θ ) = EX Pθ0

θℓ θj µθ(X) 2

θℓ θj µθ (X)

X K(X, y) 2

θℓ θj (pθ(y) pθ (y))λ(dy)

θℓ θj (pθ(y) pθ (y)) λ(dy), (34)

Composite Goodness-of-fit Tests with Kernels

Yn(θ) Yn(θ ) =

X K(Xi, y) 2

θℓ θj (pθ(y) pθ (y))λ(dy)

θℓ θj (pθ(y) pθ (y)) λ(dy), (35)

By combining Equation 34 and Equation 35, we get

|Zn(θ) Zn(θ )| 2C Z

θℓ θj (pθ(y) pθ (y)) λ(dy),

By the mean value theorem, we have 2 θℓ θj (pθ(y) pθ (y)) = θ 2 θℓ θj pθy(y) (θ θ ), where θy is in the line between θ and θ (which belongs to Θ by convexity of this set). By Holder s inequality, we obtain 2

θℓ θj (pθ(y) pθ (y)) θ

θℓ θj pθy(y) 1 θ θ

θℓ θj pθ(y) 1

Then, we conclude that

|Zn(θ) Zn(θ )| 2C Z

θℓ θj pθ(y) 1 λ(dy) | {z } Bℓ,j

Finally, choose B = maxℓ,j [p] Bℓ,j and note it is ﬁnite by Assumption 3 together with Remark 9.

Proof of Proposition 18 We start by proving Hn H = op(1). To prove this result, it suﬃces to show that each component of (Hn)ij converges in probability to Hij. Observe that

θi θj Ln( θj) = 2

θi θj (Ln( θj) L( θj)) + 2

θi θj L( θj).

A simple computation shows that for each θ Θ it holds

θi θj (Ln(θ) L(θ)) = 2

θi θj µθ(X) 1

θi θj µθ(Xi)

where the second equality holds by Lemma 17. Thus,

(Hn)ij = op(1) + 2

θi θj L( θj).

Key*, Gretton, Briol, and Fernandez*

Finally, note that L(θ) = µθ µθ0 2 H C0(Θ) since µθ(x) C3(Θ) for all x X (see Assumption 3 and Remark 9). Then, the continuous mapping theorem, together with the

fact that θj n P θ0 (by Proposition 14), yields

θi θj L(θ0) + op(1) = Hij + op(1).

We continue with the result for H n. Similarly to what we did before, we sum and subtract the term 2 θi θj L( θj ) to (Hn) ij and obtain

(Hn) ij = 2

L n( θj ) L( θj ) + 2

θi θj L( θj ).

Observe that for any θ Θ, under the null hypothesis, we have

θi θj (L n(θ) L(θ)) = 2

θi θj µθ(X) 1

θi θj µθ(X i )

Then, the result follows from Lemma 17 and from the fact that ˆθ n θ0 = op(1) and the same arguments as before.

Proof of Proposition 19 We start by proving item i). By Assumptions 1 to 6, we have that Ln(θ) = µθ µn 2 H C3(Θ) since µθ(x) C3(Θ) for all x X (see Remark 9). Then, for each j {1, . . . , p}, a ﬁrst order Taylor s expansion of θj Ln(θ) around ˆθn exists and is given by

θj Ln(θ) = θj Ln(ˆθn) + θ

θj Ln( θj) , θ ˆθn

θj Ln( θj) , θ ˆθn

where θj lies in the line segment between ˆθn and θ, and θj Θ by convexity. The second equality holds since ˆθn is the minimiser of Ln, and belongs to the interior of Θ. By using the previous equation and by evaluating at θ0, we obtain θLn(θ0) = H n(θ0 ˆθn), where (Hn)ij = 2 θi θj Ln( θj) for each i, j [p]. By Proposition 18 we have that Hn converges in probability to the Hessian matrix H. Recall that H is positive deﬁnite by Assumption 6. Then, by the continuity of the determinant, we have that Hn is invertible in sets of arbitrarily large probability for all n large enough, and thus

n(θ0 ˆθn) = n(H n) 1 θLn(θ0)

= n(H ) 1 θLn(θ0) + n((H n) 1 H 1) θLn(θ0).

By Proposition 15, n θLn(θ0) = Op(1), and since Hn P H, we have

n(θ0 ˆθn) = n(H ) 1 θLn(θ0) + op(1).

Composite Goodness-of-fit Tests with Kernels

For item ii) we follow a very similar argument. Note that by Assumptions 1 to 6, L n(θ) C3(Θ). Then, for each j {1, . . . , p}, we perform a ﬁrst order Taylor s approximation of θj L n(θ) around θ n and we get

θj L n(ˆθn) = θ

θj L n( θj ) , ˆθn θ n

where θj lies in the line segment between θ n and ˆθn. By using the previous equation, we obtain θL n(ˆθn) = H n (ˆθn θ n), where (H n)ij = 2 θi θj L n( θj ) for each i, j [p], and all θ1 , . . . , θp lie in the line segment between ˆθn and θ n. By Proposition 18 we have that H n converges in probability to the Hessian matrix H. Then, by the continuity of the determinant, we have that H n is invertible in sets of arbitrarily large probability for all n large enough, and thus n(ˆθn θ n) = n(H n ) 1 θL n(ˆθn). We ﬁnish by noting that under Assumptions 1 to 6, Proposition 15 deduces that n θL n(ˆθn) = Op(1) and by Proposition 18, (H n ) 1 converges to H 1.

Proof of Lemma 20 We start by proving i). Proposition 19 yields n(ˆθn θ n) = n H 1 θL n(ˆθn)+op(1), and Proposition 15 yields n θL n(ˆθn) = Op(1), which together give us that n(ˆθn θ n) = Op(1). We continue by proving ii). Observe that

P(|θ n θ0| ϵ) P |θ n ˆθn| ϵ/2 + P |ˆθn θ0| ϵ/2 0,

where the ﬁrst probability tends to 0 by part i), and the second one by Proposition 14 (that can only be applied under the null hypothesis).

Proof of Proposition 21 Deﬁne γn : H R by

γn(ω) = 1 n

ω(Xi) g(ω, ˆθn) . (36)

By Assumptions 1 to 6, for each ﬁxed ω H, we have g(ω, θ) C3(Θ) (see Remark 9). Then, a ﬁrst order Taylor s expansion of g(ω, θ) around θ0 yields g(ω, ˆθn) = g(ω, θ0) + θg(ω, θn), ˆθn θ0 , where θn lies on the line segment between θ0 and ˆθn. Note that by the convexity of Θ, it holds that θn Θ. Then, by replacing the previous expression in Equation 36 we obtain

γn(ω) = 1 n

i=1 (ω(Xi) g(ω, θ0)) n D θg(ω, θn), ˆθn θ0 E .

By assuming the null hypothesis, Proposition 14 yields ˆθn P θ0, and Proposition 15.iii and Proposition 19.i yield n(ˆθn θ0) = Op(1). Then, by the continuous mapping theorem we obtain

γn(ω) = 1 n

i=1 (ω(Xi) g(ω, θ0)) n D θg(ω, θ0), ˆθn θ0 E + op(1). (37)

Key*, Gretton, Briol, and Fernandez*

By combining Proposition 15 and Proposition 19.i, we have

n(ˆθn θ0) = n H 1 θLn(θ0) + op(1) = 2 n

i=1 H 1φ(Xi, θ0) + op(1).

By replacing the previous expression in Equation 37 we obtain

γn(ω) = 1 n

ω(Xi) g(ω, θ0) 2 θg(ω, θ0), H 1φ(Xi, θ0) + op(1)

i=1 η(ω, Xi, θ0) + op(1) = Sn(ω) + op(1).

Proof of Proposition 22 First note that

i=1 (η(ω1, Xi, θ0), . . . , η(ωm, Xi, θ0)),

where (η(ω1, Xi, θ0), . . . , η(ωm, Xi, θ0))n i=1 are a collection of i.i.d. random vectors. By Proposition 16.i, we have that for any ﬁxed ω H it holds EX Pθ0(η(ω, X, θ0)) = 0, and thus EX Pθ0(Sn) = 0. Additionally, by Proposition 16.ii, for any ω H we have

EX Pθ0(η2(ω, X, θ0)) < .

Thus, the Central Limit Theorem yields Sn D Nm(0, Σ), where for any i, j [m] we have

Σij = σ(ωi, ωj) = EX Pθ0(η(ωi, X, θ0)η(ωj, X, θ0)).

Proof of Proposition 23 Deﬁne γ : H R by

γ n(ω) = 1 n

i=1 (ω(X i ) g(ω, θ n)) . (38)

By Assumptions 1 to 6, for each ω H, g(ω, θ n) C3(Θ) (see Remark 9). By a ﬁrst order Taylor s expansion of g(ω, θ) around ˆθn we obtain g(ω, θ n) = g(ω, ˆθn) + θg(ω, θ n), θ n ˆθn where the θ n belongs to the line segment between ˆθn and θ n. Then, it follows that

γ n(ω) = 1 n

ω(X i ) g(ω, ˆθn) D θg(ω, θ n), θ n ˆθn E .

Proposition 14 yields that ˆθn θ0 = op(1), and Lemma 20 yields θ n θ0 = op(1) and n(θ n ˆθn) = Op(1). Thus, by the continuous mapping theorem it holds

γ n(ω) = 1 n

ω(X i ) g(ω, ˆθn) D θg(ω, θ0), θ n ˆθn E + op(1). (39)

Composite Goodness-of-fit Tests with Kernels

Finally, by combining Propositions 15 and 19.ii we get

n(θ n ˆθn) = n H 1 θL n(ˆθn) + op(1) = 2 n

i=1 H 1φ(X i , ˆθn) + op(1),

and thus by replacing in Equation 39,

γ n(ω) = 1 n

ω(X i ) g(ω, ˆθn) 2 D θg(ω, θ0), H 1φ(X i , ˆθn) E + op(1)

i=1 η(ω, X i , ˆθn) + op(1).

Proof of Proposition 24 We will apply the Lidenberg-Feller Central Limit Theorem for multivariate triangular arrays. Deﬁne Zn = 1

n Pn i=1 Yn,i, where {Yn,i} is a triangular array of random vectors Yn,i = (η(ω1, X n,i, ˆθn), . . . , η(ωm, X n,i, ˆθn)), ω1, . . . , ωm H and m N. Observe that by Proposition 16, we have that

ED(η(ω, X n,i, ˆθn)) = Z

X η(ω, z, ˆθn))pˆθn(z)λ(dz) = 0

holds for any ω H, and thus ED(Yn,i) = 0. Consider Vn = 1

n Pn i=1 Var D(Yn,i), and observe that for any ℓ, k [m], we have that

(Vn)ℓ,k = Cov D(η(ωℓ, X n,1, ˆθn), η(ωk, X n,1, ˆθn)) = Z

X η(ωℓ, z, ˆθn)η(ωk, z, ˆθn))pˆθn(z)λ(dz)

X η(ωℓ, z, θ0)η(ωk, z, θ0))pθ0(z)λ(dz).

The convergence in probability holds since ˆθn P θ0 by Proposition 14, and since the map θ R

X η(ω, z, θ)η(ω , z, θ)pθ(z)λ(dz) is continuous for any ﬁxed ω, ω H. Thus the continuous mapping theorem yields Equation 40. We proceed to check Linderberg s condition. Observe that for every ε > 0

i=1 ED Yn,i 21{ Yn,i ε n} = 1

j=1 ED η2(ωj, Xn,i, ˆθn)1{ Yn,i ε n}

i=1 PD Yn,i ε n

ϵ n 0, a.s.

The ﬁrst inequality holds since supθ Θ supx X η2(ωj, x, θ) < by Proposition 16.iii.

Key*, Gretton, Briol, and Fernandez*

Proof of Proposition 25 By Remark 9

i=1 θg(ψi, θ0) 2 =

θℓ pθ0(y)λ(dy) 2 . (41)

Observe that for each ℓ {1, . . . , p}, it holds that

θℓ pθ0(y)λ(dy) 2

X ψi(y)ψi(x)

θℓ pθ0(x) λ(dy)λ(dx)

i=1 |ψi(y)||ψi(x)| θℓ pθ0(y)

θℓ pθ0(x) λ(dy)λ(dx)

i=1 |ψi(y)|2

j=1 |ψj(x)|2 θℓ pθ0(y)

θℓ pθ0(x) λ(dy)λ(dx)

i=1 |ψi(y)|2 θℓ pθ0(y) λ(dy)

K(y, y) θℓ pθ0(y) λ(dy) 2

θℓ pθ(y) λ(dy) 2 < (42)

where the second inequality is due to the Cauchy-Schwarz inequality, the second equality holds since P i=1 ψ2 i (y) = K(y, y), the third inequality follows from the fact that the kernel is bounded (Assumption 5), and the last inequality is due to Assumption 3. By combining Equations 41 and 42 we conclude that P i=1 θg(ψi, θ0) 2 < .

Appendix B. Illustration of Limitations

We illustrate two of the limitations introduced in Section 6. The ﬁrst occurs when D(P, Qn) is poor at distinguishing between probability distributions. For example, this could occur when using a Gaussian kernel with a very small or large lengthscale. Figure 12 shows the type I error rate of the composite KSD test, using the parametric bootstrap, as the lengthscale of the Gaussian kernel varies. We consider the null hypothesis case where HC 0 = the data is Gaussian and the data is sampled from Q = N(0.4, 1.42). The ﬁgure shows that the type I error rate of the composite test can exceed the level for very small lengthscales, and goes to zero for large lengthscales. The ﬁgure also shows the type I error rate of a non-composite test with H0 : P = Q as deﬁned above, which reveals that the non-composite test does not suﬀer

Composite Goodness-of-fit Tests with Kernels

estimated scale

10 2 100 102 lengthscale, l

type I error rate

composite test non-composite test

Figure 12: Comparison of the type I error rate of a composite and non-composite test, as we vary the lengthscale of the kernel. The left axis, blue, shows the type I error rate of the two tests. The right axis, orange, shows the estimates of the scale parameter of the model made by the composite test over 2000 repeats. The solid blue horizontal line shows the level of the test, 0.05. The solid orange horizontal line shows the true value of the standard deviation, 1.4. The solid black vertical line shows the lengthscale selected by the median heuristic.

from this problem. To investigate the issue the ﬁgure also shows the estimates of the scale parameter. For very small lengthscales, the parameter estimates are not centered around the true parameter value, thus the test is comparing the data to a poor choice of model from {Pθ}θ, leading to type I errors. For large lengthscales, we suggest that the type I error rate of the composite test goes to zero faster than that of the non-composite test because the small amount of variance in the estimated parameters makes distinguishing the null and alternative hypotheses more diﬃcult. Importantly, we note that the range of lengthscales for which the composite test has good performance is fairly wide (note the log x-axis scale), and the median heuristic selects a value in the middle of this range. The same behaviour can also occur for the MMD test, although, in this case, we have the additional reassurance of theoretical results which show that the test will behave correctly as n becomes large.

The second limitation we will illustrate occurs when there is large variance in the estimate of the parameter, if either the minimum distance estimator has high variance or the optimisation process fails. Figure 13 demonstrates this limitation for the MMD test, testing HC 0 that Q = N(θ, 12) for θ R against the alternative N(0.5, 1.52). If the optimiser is conﬁgured correctly, the ﬁgure shows that the estimates of the parameter are centered around the true parameter value and have low variance, and there is clear separation between the distributions of the test statistic under HC 0 and HC 1 . Thus, the test is able to distinguish HC 0 and HC 1 and achieve a high power. If the optimiser is conﬁgured poorly we use too large a learning rate we can see that the parameter estimates have higher variance, and the distributions of the test statistic under HC 0 and HC 1 overlap completely. Thus, no matter where the threshold is set it will not be possible to distinguish the two hypotheses, and the test will have very low power.

Key*, Gretton, Briol, and Fernandez*

0.00 0.02 0.04 0.06 0.08

good optimiser

0.5 0.0 0.5 1.0 1.5

0.00 0.05 0.10 0.15 0.20 test statistic

poor optimiser

0.5 0.0 0.5 1.0 1.5 parameter estimate

Figure 13: Demonstration of a failure mode where the parameter estimate has high variance due to a poorly conﬁgured optimiser. The ﬁrst row shows the distribution of the test statistic (left) and that of the parameter estimate (right) for a well-conﬁgured optimiser, and the second row the distributions for a poorly-conﬁgured optimiser.

1 2 3 4 5 ν

Qa = Students-t(ν)

median heuristic aggregated

0.00 0.02 0.04 0.06 0.08 0.10 ϵ

Qb = (1 ϵ)N(0, 1) + ϵN(5, 1)

Figure 14: Power of the KSD test with wild bootstrap, comparing setting the lengthscale set using the median heuristic and aggregating tests using 8 diﬀerent lengthscales. We set HC 0 : P = N(µ, σ2), with the left and right ﬁgure showing data generated from diﬀerent distributions, Qa and Qb. shows the level.

Appendix C. Aggregated Composite Test

Figure 14 shows an attempt at combining the aggregated test method introduced by Schrab et al. (2023, 2022) with our composite KSD test, using the wild bootstrap. The aggregated test is the combination of 8 composite tests using the Gaussian kernel, where the lengthscales are set as the {25, 32, 39, 46, 54, 61, 68, 75}th percentiles of the pairwise distances between the observations. The non-aggregated test uses a single lengthscale set using the median heuristic, i.e. at the 50th percentile of the pairwise distances. The null hypothesis of the

Composite Goodness-of-fit Tests with Kernels

test is HC 0 : P = N(µ, σ2), for µ R and σ R+. To evaluate the performance of the test, we consider the two alternatives speciﬁed in the ﬁgure. Against a Student s t-distribution, we ﬁnd that the median heuristic has slightly better performance than the aggregated test, while, against a mixture of Gaussians, the aggregated test has a small advantage. We would expect the aggregated test to have a bigger advantage in the case of the mixture alternative, as here the spread of lengthscales will bring a bigger beneﬁt than in the unimodal case of the Student s t alternative. It is somewhat unexpected that the aggregated test has slightly lower performance than the median heuristic against the Student s t alternative. Future work could investigate why this happens, and whether any modiﬁcations can be made to the aggregated testing procedure to improve the performance in the composite case.

Appendix D. Experiment Details

For all experiments we use a test level of α = 0.05. The experiments are implemented in JAX, and executed on an Nvidia RTX 3090 GPU.

D.1 Closed-Form KSD Estimator

For any Pθ in the exponential family, we can calculate ˆθn = arg minθ KSD2(Pθ, Qn) in closed-form. Here we state the expression for this estimator, which we use in our Gaussian and kernel exponential family experiments. The detailed derivation of this result can be found in Barp et al. (2019, Appendix D3) or Matsubara et al. (2022). Let the density of a model in the exponential family be pθ(x) = exp(η(θ) t(x) a(θ)+b(x)), with η : Θ Rk an invertible map, t : Rd Rk any suﬃcient statistic for some k N1, a : Θ R and b : Rd R. Then the KSD estimator is given by ˆθn := η 1 1

2Λ 1 n vn , where Λn Rk k and νn Rk are deﬁned as Λn := 1 n2 Pn i=1 Pn j=1 Λ(xi, xj) and νn := 1 n2 Pn i=1 Pn j=1 ν(xi, xj). These are based on functions Λ : Rd Rd Rk k and ν : Rd Rd Rk which depend on the speciﬁc model, deﬁned as Λ(x, x ) := K(x, x ) xt(x) xt(x ) and

ν(x, x ) := K(x, x ) xb(x) xt(x ) + xt(x) x K(x, x )

+ K(x, x ) xb(x ) xt(x) + xt(x ) x K(x , x) .

D.2 Kernels

We deﬁne the Gaussian kernel as K(x1, x2) = exp( x1 x2 2 2 /2l2), and the IMQ kernel as K(x1, x2) = (1 + x1 x2 2 2 /2l2) 1

2 , where l R+ is the lengthscale, and the exponent of 1

2 follows Matsubara et al. (2022). Where we set the lengthscale using the median heuristic,

this is deﬁned as lmed = q

median( yi yj 2 2 /2), where the median is taken over all pairs of observations, yi and yj. Where we use a sum kernel, this is deﬁned as K(x1, x2) = 1

L PL i=1 Kli(x1, x2), where Kl1, . . . , Kl L are a set of L Gaussian or IMQ kernels with lengthscales l1, . . . , l L. Prior work has found that using a sum of kernels with a range of lengthscales, rather than attempting to choose the lengthscale of a single kernel, produces good empirical results (Li et al., 2015; Ren et al., 2016; Sutherland et al., 2017). Note that a sum of characteristic kernels is a

Key*, Gretton, Briol, and Fernandez*

characteristic kernel (Sriperumbudur et al., 2010), so this choice satisﬁes the assumptions we make in our theoretical results.

D.3 Details of Speciﬁc Figures

D.3.1 Distribution of Wild Bootstrap Statistic (Figure 2)

HC 0 : P = N(µ, σ2) for µ R and σ R+, with the estimator conﬁgured the same as in the Gaussian experiments below. We sample observations from Q = N(0.3, 1.0). We use a Gaussian kernel with lengthscale 1.0. To compute the distribution of the test statistic, we take 1000 sets of n = 1000 samples from Q, and for each estimate Pˆθn and compute MMD2(Pˆθn,n, Qn). To compute the distribution of the wild bootstrap statistic, we take n = 1000 samples from Q, estimate Pˆθn, sample 1000 sets of Rademacher variables, and thus compute 1000 samples of MMD2 W (Pˆθn,n, Qn).

D.3.2 Gaussian Model (Figures 3 to 7)

repeats to compute power (per seed) 1000 K Gaussian, l = lmed wild bootstrap samples 500 parametric bootstrap samples 300

MMD test details For the experiments using the MMD, we minimize the minimum distance estimator using the Adam optimiser (Kingma and Ba, 2015). We use a learning rate of 0.05 and 100 iterations. When θ = µd (Figures 4 to 6) we sample the initial value of µd from N(0d, Id). When θ = {µ, σ2} (Figures 3 and 7) sample the initial µ from N(0, 1), and sample σ from a standard normal distribution truncated between 0.5 and 1.5.

Timing We estimate the time taken by each bootstrap as follows. We run 3 tests as a warmup, then we run 2000 tests and calculate the mean time taken. The full conﬁguration is as follows:

HC 0 Pµ,σ = N(µ, σ2) for µ R, σ R+ n 100 D KSD bootstrap samples 300 K Gaussian, l = 1.0

D.3.3 Toggle Switch

Figures 8 to 10 The true model parameters, θ0, are

parameter α1 α2 β1 β2 µ σ γ

true value 22.0 12.0 4.0 4.5 325.0 0.25 0.15

We use stochastic gradient descent with random restarts to estimate the parameters. The algorithm is:

1. Sample 500 initial parameter values, θ1 init, . . . , θ500 init.

Composite Goodness-of-fit Tests with Kernels

2. Select the 15 θi init for which MMD2(Pθi init, Qn) is smallest.

3. Run SGD 15 times, once for each of the selected θi init, to ﬁnd ˆθ1 n, . . . , ˆθ15 n . Use learning rate 0.04 and perform 300 iterations.

4. Return ˆθi n for which MMD2(Pˆθin, Qn) is smallest.

The initial parameters are sampled from a uniform distribution with the following ranges:

parameter α1 α2 β1 β2 µ σ γ

initial range 0.01 0.01 0.01 0.01 250.0 0.01 0.01 50.00 50.00 5.00 5.00 450.0 0.50 0.40

To run the test, we use the following conﬁguration:

K unweighted mixture of Gaussian kernels with lengthscales 20.0, 40.0, 80.0, 100.0, 130.0, 200.0, 400.0, 800.0, 1000.0 n 400 bootstrap parametric bootstrap samples 200

To create Figure 10 we apply kernel density estimation to samples from the model. We use a Gaussian kernel, with the bandwidth set to 0.1.

D.3.4 Nonparametric Density Estimation

Figure 11 We are testing whether the family of models considered by Matsubara et al. (2022) ﬁts the data, thus our choices of qkef and lkef match the original paper.

K unweighted sum of IMQ kernels with lengthscales 0.6, 1.0, 1.2 n 82 wild bootstrap samples 500 qkef N(0.0, 3.02) lkef

Following Matsubara et al. (2022), we normalise each observation in the data set as follows: ynorm i = yi µ

0.5 σ, where µ and σ are the mean and standard deviation of the data set. To plot the density of the model, we estimate the normalising constant of the model using the importance sampling approach suggested by Wenliang et al. (2019, Section 3.2). As a proposal we use N(0.0, 3.52), and we take 2000 samples.

A. Anastasiou, A. Barp, F-X. Briol, B. Ebner, R. E. Gaunt, F. Ghaderinezhad, J. Gorham, A. Gretton, C. Ley, Q. Liu, L. Mackey, C. J. Oates, G. Reinert, and Y. Swan. Stein s method meets computational statistics: A review of some recent developments. Statistical Science, 2023.

Key*, Gretton, Briol, and Fernandez*

T. W. Anderson and D. A. Darling. A test of goodness of ﬁt. Journal of the American Statistical Association, 1954.

K. Balasubramanian, T. Li, and M. Yuan. On the optimality of kernel-embedding based goodness-of-ﬁt tests. Journal of Machine Learning Research, 2021.

Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. In Advances in Neural Information Processing Systems, 2019.

J. Beirlant, L. Gyorﬁ, and G. Lugosi. On the asymptotic normality of the L1and L2-errors in histogram density estimation. The Canadian Journal of Statistics, 1994.

J. Beirlant, L. Devroye, L. Györﬁ, and I. Vajda. Large deviations of divergence measures on partitions. Journal of Statistical Planning and Inference, 2001.

Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2004.

Espen Bernton, Pierre E. Jacob, Mathieu Gerber, and Christian P. Robert. Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), April 2019.

J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1974.

S. Betsch, B. Ebner, and B. Klar. Minimum lq-distance estimators for non-normalized parametric models. Canadian Journal Of Statistics, 2020.

Ayush Bharti, Francois-Xavier Briol, and Troels Pedersen. A general method for calibrating stochastic radio channel models with kernels. IEEE Transactions on Antennas and Propagation, 2021.

Ayush Bharti, Masha Naslidnyk, Oscar Key, Samuel Kaski, and François-Xavier Briol. Optimally-weighted estimators of the Maximum Mean Discrepancy for likelihood-free inference. In International Conference on Machine Learning, 2023.

M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In International Conference on Learning Representation, 2018.

Fernando V. Bonassi and Mike West. Sequential Monte Carlo with adaptive weights for approximate Bayesian computation. Bayesian Analysis, 2015.

Fernando V. Bonassi, Lingchong You, and Mike West. Bayesian learning from marginal data in bionetwork models. Statistical Applications in Genetics and Molecular Biology, 2011.

W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A test of relative similarity for model selection in generative models. In International Conference on Learning Representations, 2016.

Composite Goodness-of-fit Tests with Kernels

François-Xavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with Maximum Mean Discrepancy. ar Xiv:1906.05944, 2019.

Stéphane Canu and Alexander J. Smola. Kernel methods and the exponential family. Neurocomputing, 2006.

Ying-Xia Chen and Wei-Jun Zhu. Note on the strong law of large numbers in a Hilbert space. Gen. Math, 2011.

B-E. Chérief-Abdellatif and P. Alquier. MMD-Bayes: Robust Bayesian estimation via Maximum Mean Discrepancy. In 2nd Symposium on Advances in Approximate Bayesian Inference, 2020.

Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Finite sample properties of parametric MMD estimation: Robustness to misspeciﬁcation and dependence. Bernoulli, 2022.

Kacper Chwialkowski, Dino Sejdinovic, and Arthur Gretton. A wild bootstrap for degenerate kernel tests. In Neural Information Processing Systems, 2014.

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, 2016.

K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences of the United States of America, 2020.

Ralph D Agostino and E. S. Pearson. Tests for departure from normality. Empirical results for the distributions of b2 and

b1. Biometrika, 1973.

A. P. Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 2007.

C. Dellaporta, J. Knoblauch, T. Damoulas, and F-X. Briol. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In International Conference on Artiﬁcial Intelligence and Statistics, 2022.

J. Durbin. Kolmogorov-Smirnov tests when parameters are estimated with applications to tests of exponentiality and tests on spacings. Biometrika, 1975.

G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via Maximum Mean Discrepancy optimization. In Uncertainty in Artiﬁcial Intelligence, 2015.

Bruno Ebner. On combining the zero bias transform and the empirical characteristic function to test normality. Latin American Journal of Probability and Mathematical Statistics, 2021.

T. Fernández and A. Gretton. A Maximum-Mean-Discrepancy goodness-of-ﬁt test for censored data. In Artiﬁcial Intelligence and Statistics, 2019.

Tamara Fernández and Nicolás Rivera. A general framework for the analysis of kernel-based tests. ar Xiv:2209.00124, 2022.

Key*, Gretton, Briol, and Fernandez*

Tamara Fernández, Nicolás Rivera, Wenkai Xu, and Arthur Gretton. Kernelized Stein discrepancy tests of goodness-of-ﬁt for time-to-event data. In International Conference on Machine Learning, 2020.

K. Fukumizu. Exponential manifold by reproducing kernel Hilbert spaces. In Algebraic and Geometric Methods in Statistics. 2009.

Kenji Fukumizu, Arthur Gretton, Gert Lanckriet, Bernhard Schölkopf, and Bharath K. Sriperumbudur. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, 2009.

W. Gong, Y. Li, and J. M. Hernández-Lobato. Sliced Kernelized Stein Discrepancy. International Conference on Learning Representations, 2021.

W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, and R. Zemel. Learning the Stein discrepancy for training and evaluating energy-based models without sampling. In International Conference on Machine Learning, 2020.

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, 2007.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 2012.

L. Györﬁand I. Vajda. Asymptotic distributions for goodness-of-ﬁt statistics in a sequence of multinomial models. Statistics and Probability Letters, 2002.

L. Gyorﬁand E. C. van der Meulen. A consistent goodness-of-ﬁt test based on the total variation distance. In Nonparametric Functional Estimation and Related Topics. 1991.

N. Henze and B. Zirkler. A class of invariant consistent tests for multivariate normality. Communications in Statistics - Theory and Methods, 1990.

Peter J Huber and Elvezio M Ronchetti. Robust Statistics. John Wiley & Sons, 2011.

A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2006.

M. D. Jiménez-Gamero, V. Alba-Fernández, J. Muñoz García, and Y. Chalco-Cano. Goodnessof-ﬁt tests based on empirical characteristic functions. Computational Statistics & Data Analysis, 2009.

W. Jitkrittum, P. Sangkloy, B. Schölkopf, H. Kanagawa, J. Hays, and A. Gretton. Informative features for model comparison. In Neural Information Processing Systems, 2018.

Wittawat Jitkrittum, Heishiro Kanagawa, and Bernhard Schölkopf. Testing goodness of ﬁt of conditional density models with kernels. In Conference on Uncertainty in Artiﬁcial Intelligence, 2020.

Composite Goodness-of-fit Tests with Kernels

Ana Justel, Daniel Peﬁa, Ruben Zamar, and Departament Of Statistics. A multivariate Kolmogorov-Smirnov test of goodness of ﬁt. Statistics & Probability Letters, 1997.

H. Kanagawa, W. Jitkrittum, L. Mac Key, K. Fukumizu, and A. Gretton. A kernel Stein test for comparing latent variable models. ar Xiv:1907.00586, 2019.

Jérémie Kellner and Alain Celisse. A one-sample test for normality with kernel methods. Bernoulli, 2019.

Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015.

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

I. Kojadinovic and J. Yan. Goodness-of-ﬁt testing based on a weighted bootstrap: A fast large-sample alternative to the parametric bootstrap. Canadian Journal of Statistics, 2012.

A. N. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Giornale dell Istituto Italianodegli Attuari, 1933.

I. A. Koutrouvelis and J. Kellermeier. A goodness-of-ﬁt test based on the empirical characteristic function when parameters must be estimated. Journal of the Royal Statistical Society Series B (Statistical Methodology), 1981.

Erich L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses. 3 edition, 2005.

Anne Leucht and Michael H. Neumann. Dependent wild bootstrap for degenerate Uand V-statistics. Journal of Multivariate Analysis, 2013.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2017.

Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. In International Conference on Machine Learning, 2015.

Hubert W Lilliefors. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 1967.

J. N. Lim, M. Yamada, B. Schölkopf, and W. Jitkrittum. Kernel Stein tests for multiple model comparison. In Neural Information Processing Systems, 2019.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, 2016.

X. Liu and F.-X. Briol. On the robustness of kernel goodness-of-ﬁt tests. ar Xiv:2408.05854, 2024.

James Robert Lloyd and Zoubin Ghahramani. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems, 2015.

Key*, Gretton, Briol, and Fernandez*

T. Matsubara, J. Knoblauch, François-Xavier Briol, and C. J. Oates. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society B: Statistical Methodology, 2022.

A. Muller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 1997.

J. Neyman. A Selection of Early Statistical Papers of J. Neyman, chapter 28. University of California Press, 1967.

Richard Nickl. Statistical theory, 2012.

Z. Niu, J. Meier, and François-Xavier Briol. Discrepancy-based inference for intractable generative models using quasi-Monte Carlo. ar Xiv:2106.11561, 2021.

C J Oates, M Girolami, and N Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society B: Statistical Methodology, 2017.

W C Parr and W R Schucany. Minimum distance and robust estimation. Journal of the American Statistical Association, 1980.

M. Postman, J. P. Huchra, and M. J. Geller. Probes of large-scale structure in the Corona Borealis region. The Astronomical Journal, 1986.

Yong Ren, Jialian Li, Yucen Luo, and Jun Zhu. Conditional generative moment-matching networks. In Advances in Neural Information Processing Systems, 2016.

G. Robins, P. Pattison, Y. Kalish, and D. Lusher. An introduction to exponential random graph (p*) models for social networks. Social Networks, 2007.

Kathryn Roeder. Density estimation with conﬁdence sets exempliﬁed by superclusters and voids in the galaxies. Journal of the American Statistical Association, 1990.

A. Schrab and I. Kim. Robust kernel hypothesis testing under data corruption. ar Xiv:2405.19912, 2024.

Antonin Schrab, Benjamin Guedj, and Arthur Gretton. KSD aggregated goodness-of-ﬁt test. In Advances in Neural Information Processing Systems, 2022.

Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 2023.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distancebased and RKHS-based statistics in hypothesis testing. Annals of Statistics, 2013.

A. Sen and B. Sen. Testing independence and goodness-of-ﬁt in linear models. Biometrika, 2014.

Xiaofeng Shao. The dependent wild bootstrap. Journal of the American Statistical Association, 2010.

Composite Goodness-of-fit Tests with Kernels

Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 1965.

B. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in inﬁnite dimensional exponential families. Journal of Machine Learning Research, 2017.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 2010.

Winfried Stute, Wenceslao Gonzáles Manteiga, and Manuel Presedo Quindimil. Bootstrap based goodness-of-ﬁt-tests. Metrika, 1993.

Danica J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alexander J. Smola, and Arthur Gretton. Generative models and model criticism via optimized Maximum Mean Discrepancy. In International Conference on Learning Representations, 2017.

G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 2005.

Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 2013.

Li Wenliang, Danica J. Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for exponential family densities. In International Conference on Machine Learning, 2019.

Li K. Wenliang and Heishiro Kanagawa. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv:2008.10087, 2021.

G. Wolfer and P. Alquier. Variance-aware estimation of kernel mean embedding. ar Xiv:2210.06672, 2022.

J. Wolfowitz. The minimum distance method. The Annals of Mathematical Statistics, 1957.

G. Wynne and S. Nagy. Statistical depth meets machine learning: Kernel mean embeddings and depth in functional data analysis. ar Xiv:2105.12778, 2021.

G. Wynne, M. Kasprzak, and A. B. Duncan. A spectral representation of kernel Stein discrepancy with application to goodness-of-ﬁt tests for measures on inﬁnite dimensional Hilbert spaces. ar Xiv:2206.04552, 2022.

W. Xu and T. Matsuda. A Stein goodness-of-ﬁt test for directional distributions. In Artiﬁcial Intelligence and Statistics, 2020.

W. Xu and G. Reinert. A Stein goodness of ﬁt test for exponential random graph models. In Artiﬁcial Intelligence and Statistics, 2021.

Key*, Gretton, Briol, and Fernandez*

J. Yang, Q. Liu, V. Rao, and J. Neville. Goodness-of-ﬁt testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, 2018.

J. Yang, V. Rao, and J. Neville. A Stein-Papangelou goodness-of-ﬁt test for point processes. In International Conference on Artiﬁcial Intelligence and Statistics, 2019.

S. Zhu, B. Chen, P. Yang, and Z. Chen. Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of ﬁt. In International Conference on Artiﬁcial Intelligence and Statistics, 2019.