# composite_goodnessoffit_tests_with_kernels__9d41bc1d.pdf Journal of Machine Learning Research 26 (2025) 1-60 Submitted 2/24; Revised 1/25; Published 2/25 Composite Goodness-of-fit Tests with Kernels Oscar Key oscar.key.20@ucl.ac.uk Centre for Artificial Intelligence, University College London Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, University College London François-Xavier Briol f.briol@ucl.ac.uk Department of Statistical Science, University College London Tamara Fernandez tamara.fernandez@uai.cl Faculty of Engineering and Science, Adolfo Ibañez University Editor: Pierre Alquier We propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on kernel-based distances such as the maximum mean discrepancy. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. We also prove that the popular wild bootstrap will lead to an overly conservative test, and show that the parametric bootstrap is consistent and can lead to significantly improved performance in practice. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of a non-parametric density model, and an intractable generative model of a biological cellular network. Keywords: bootstrap, hypothesis testing, kernel methods, minimum distance estimation, Stein s method 1. Introduction Most statistical or machine learning algorithms are based on assumptions about the distribution of the observed data, of auxiliary variables, or of certain estimators. Crucially, the validity of such assumptions directly impacts the performance of these algorithms. To verify whether such assumptions are reasonable, one approach is goodness-of-fit testing, which considers the problem of rejecting, or not, the hypothesis that some data was generated by a fixed distribution. More precisely, given a distribution P and some observed data x1, . . . , xn from some distribution Q, goodness-of-fit tests compare the null hypothesis H0 : P = Q against the alternative hypothesis H1 : P = Q. . Equal contribution c 2025 Oscar Key, Arthur Gretton, François-Xavier Briol, and Tamara Fernandez. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0276.html. Key*, Gretton, Briol, and Fernandez* There is a vast literature on goodness-of-fit testing, and the reader is referred to Chapter 14 in Lehmann and Romano (2005) for a detailed introduction. In the case of univariate real observations, a popular family of tests uses the cumulative distribution function (CDF) as the test statistic, for example the Kolmogorov-Smirnov and Anderson-Darling tests (Kolmogorov, 1933; Anderson and Darling, 1954). Various approaches have been taken to support multivariate observations, including extending these CDF tests (Justel et al., 1997), developing a statistic based on the empirical characteristic function (Jiménez-Gamero et al., 2009), or partitioning the observation space into bins and evaluating a test statistic on the resulting discrete empirical measures (Gyorfiand van der Meulen, 1991; Beirlant et al., 1994; Györfiand Vajda, 2002; Beirlant et al., 2001). Unfortunately, the applicability of all of these tests is limited, because they rely on statistics that are often difficult to compute for complex models or high-dimensional observation spaces. To overcome these issues, a class of kernel-based hypothesis tests has been proposed, which construct the statistic through either the maximum mean discrepancy (MMD) (Lloyd and Ghahramani, 2015; Zhu et al., 2019; Balasubramanian et al., 2021; Kellner and Celisse, 2019), the kernel Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016), the energy distance (Székely and Rizzo, 2005, 2013), or the Hilbert-Schmidt independence criterion (Sen and Sen, 2014). These tests can be applied to a wide-range of data types by choosing a test statistic based on an appropriate kernel which can be easily evaluated in the setting considered. Additionally, many of these tests are straightforward to apply even for intractable generative models for which no density exists, or for models with unnormalised likelihoods, making them particularly versatile. In this paper, we consider the more complex question of whether our data comes from any element of some parametric family of distributions. Let {Pθ}θ Θ denote a parametric family indexed by a parameter θ in some space Θ. Our test compares the null hypothesis HC 0 : θ0 Θ such that Pθ0 = Q, against the alternative hypothesis HC 1 : Q {Pθ}θ Θ. Such tests are known as composite goodness-of-fit tests, and can be much more challenging to construct since θ0 is usually unknown. They have the potential to answer important questions relating to model misspecification, allowing the user to confirm whether the parametric model they have selected is appropriate for their analysis. Once again there is a significant literature tackling this problem, including tests which are specific to simple parametric families such as Gaussians (Shapiro and Wilk, 1965; Lilliefors, 1967), or composite versions of some of the non kernel-based tests mentioned above (Durbin, 1975; Neyman, 1967). Unfortunately, existing these tests all suffer from the same drawback as their simple goodness-of-fit counterparts, in that they are only applicable in a limited range of settings. Our paper fills an important gap in the literature by proposing the first set of comprehensive kernel-based composite hypothesis tests. Given a kernel-based discrepancy D, our test statistics take the form = minθ Θ n D(Pθ, Qn). That is, we use the smallest discrepancy between Qn and any element of the parametric family to determine whether HC 0 should be rejected or not. In this work, we primarily consider the case where D = MMD2, and include Composite Goodness-of-fit Tests with Kernels full theoretical and empirical analysis of the test s behaviour. This analysis is challenging because, in contrast to existing kernel-based tests, the data now enters the test statistic twice: once to select the closest element of the parametric family, and a second time to estimate the discrepancy. We also include encouraging empirical results for D = KSD2, but leave the extension of our theoretical framework to this test for future work. Our tests extend the advantages of kernel-based tests to the composite setting. In particular, they can be directly applied to both generative or unnormalised models, two wide classes which cannot be tackled in full generality with classical composite tests. Within the kernel testing literature, general composite tests have not yet been proposed. Existing tests have either been limited to specific parametric families, such as multivariate Gaussians (Kellner and Celisse, 2019) or survival models with a specific survival function (Fernández et al., 2020), or required splitting the data set thus reducing test power (Chwialkowski et al., 2016; Liu et al., 2016). Wolfer and Alquier (2022) also proposed a promising non-asymptotic composite test based on the MMD, but did not study it for generative models nor consider bootstrapping algorithms. We also show in our experiments that this approach leads to a more conservative test. The remainder of the paper is as follows. Section 2: we recall existing work on testing and parameter estimation with kernels. Section 3: we propose novel kernel-based composite hypothesis tests, and show that the MMD test statistic has well-behaved limiting distributions despite the reuse of the data for both estimation and testing. Section 4: we compare two choices of bootstrap method that we can use to implement these tests in practice. Section 5: we demonstrate these algorithms on a range of problems including an intractable generative model and non-parametric density estimation. Section 6: we discuss limitations of our work. Section 7: we highlight connections with existing tests. The code to reproduce our experiments, and implement the tests for your own models, is available at: https: //github.com/oscarkey/composite-tests 2. Background We begin by reviewing the use of kernel discrepancies for testing and estimation. We denote by X the data space and P(X) the set of Borel probability distributions on X. 2.1 Kernel-based Discrepancies To measure the similarity of two distributions P, Q P(X), we can use a discrepancy D : P(X) P(X) [0, ). We consider discrepancies related to integral pseudo-probability metrics (IPMs) (Muller, 1997). An IPM indexed by the set of functions F has the form DF(P, Q) = sup f F |EX P [f(X)] EY Q[f(Y )]| , (1) where F is sufficiently rich so that DF is a statistical divergence; i.e. DF(P, Q) = 0 P = Q. A first discrepancy is the maximum mean discrepancy (MMD) (Gretton et al., 2007, 2012). Denote by HK the reproducing kernel Hilbert space (RKHS) associated to the kernel K : X X R (Berlinet and Thomas-Agnan, 2004). The MMD is obtained by taking FMMD = {f HK | f HK 1}, which is a convenient choice because it allows the supremum Key*, Gretton, Briol, and Fernandez* in Equation 1 to be evaluated in closed form if EX P [ p K(X, X)] < P P(X): MMD2(P, Q) = EX,X P [K(X, X )] 2EX P,X Q[K(X, X )] + EX,X Q[K(X, X )]. (2) Assuming we have access to independent and identically distributed realisations { xi}m i=1 iid P and {xj}n j=1 iid Q, we can define Pm = 1 m Pm i=1 δ xi and Qn = 1 n Pn j=1 δxj (where δx is a Dirac measure at x X) and we get the following estimate: MMD2(Pm, Qn) = 1 i,j=1 K( xi, xj) 2 nm j=1 K( xi, xj) + 1 i,j=1 K(xi, xj). Furthermore, when n = m, this simplifies to MMD2(Pn, Qn) = 1 i,j=1 h MMD((xi, xi), (xj, xj)), where h MMD((xi, xi), (xj, xj)) = K( xi, xj)+K(xi, xj) K( xi, xj) K( xj, xi). This is known as a V-statistic (Gretton et al., 2012), and h MMD is known as the core of this statistic. The statistic is biased, but has smaller variance than alternative U-statistics for this quantity. We also consider the kernel Stein discrepancy (KSD) (Oates et al., 2017; Chwialkowski et al., 2016; Liu et al., 2016), which is obtained by applying Stein operators to functions in some RKHS; see Anastasiou et al. (2023) for a recent review. Let X = Rd and Hd K = HK . . . HK denote the d-dimensional tensor product of HK. The most common instance is the Langevin KSD for which FKSD = {SP [f]| f Hd K 1} where SP [g](x) = g(x) x log p(x)+ x g(x) is the Langevin Stein operator, p is the Lebesgue density of P and x = ( / x1, . . . , / xd) . With this choice, we obtain the discrepancy: KSD2(P, Q) = EX,X Q[h KSD(X, X )] , where h KSD(x, x ) = K(x, x ) x log p(x) x log p(x ) + x log p(x) x K(x, x ) + x log p(x ) x K(x, x ) + x x K(x, x ) . Given Qn = 1 n Pn j=1 δxj, we obtain a V-statistic KSD2(P, Qn) = 1 n2 Pn i,j=1 h KSD(xi, xj). We choose the MMD and the KSD to implement our tests because they are applicable in complementary settings. The MMD can be estimated whenever samples from Pθ are available. This makes the corresponding test particularly suitable for simulator-based models (also called generative models); i.e. models which can be represented through a pair (U, Gθ) such that sampling x Pθ involves sampling u U, and setting x = Gθ(u). This class of models covers many models widely used in machine learning, including variational autoencoders (Kingma and Welling, 2014) and generative adversarial networks (Li et al., 2015; Dziugaite et al., 2015), but also in the sciences including in synthetic biology (Bonassi et al., 2011) and telecommunication engineering (Bharti et al., 2021) to name just a few. The interested reader is referred to Cranmer et al. (2020) for a recent review. On the other hand, the KSD requires pointwise evaluations of log p and the KSD-based test will therefore be particularly suitable when the density of Pθ is tractable. This includes cases where the density can be only evaluated up to normalisation constant, such as for deep energy models or nonparametric Composite Goodness-of-fit Tests with Kernels density estimation models (Canu and Smola, 2006; Fukumizu, 2009; Sriperumbudur et al., 2017; Matsubara et al., 2022), exponential random graphs (Robins et al., 2007) and large spatial models based on lattice structures (Besag, 1974). Throughout this paper, we will assume that our kernel-based divergences are valid statistical divergences. This can be guaranteed for the MMD by assuming the kernel K is characteristic (Sriperumbudur et al., 2010). For the KSD we require that K is strictly integrally positive definite and EX Q[| x log p(X) x log q(X)|] < (see for example Chwialkowski et al. (2016, Theorem 2.1) and Barp et al. (2019, Proposition 1)). Examples of kernels on Rd which satisfy these assumptions include the Gaussian, inverse multiquadric and Matérn kernels. While these conditions ensure that the MMD and KSD can distinguish any pair of distributions when n, m , for finite samples this is not necessarily the case. The ability of the divergences to distinguish distributions is primarily dependent whether an appropriate kernel has been selected for the data in question. Additionally, the KSD suffers from a failure mode when applied to multi-modal distributions, common for methods based on x log p: if Q contains isolated modes with areas of near-zero density between them, the KSD cannot distinguish this from P that is identical except having a different weighting of the modes (Wenliang and Kanagawa, 2021). We discuss the impacts of these failure modes on ours tests in Section 6. 2.2 Goodness-of-fit Testing with Kernels We now recall how divergences lend themselves naturally to (standard) goodness-of-fit testing. Let D : X X [0, ) be some statistical divergence, and recall that we are testing H0 : P = Q. A natural approach is to compute D(P, Q) and check whether it is zero (in which case H0 holds) or not (in which case H1 holds). Since we only have access to independent realisations {xj}n j=1 instead of Q itself, this idealised procedure is replaced by the evaluation of D(P, Qn). The question then becomes whether or not D(P, Qn) is further away from zero than we would expect under H0 given a data set of size n. Kernel-based discrepancies can be computed in a wide range of scenarios including for intractable models. These divergences also have favourable sample complexity in that D(P, Qn) converges to D(P, Q) at a square-root n rate for both KSD (Matsubara et al., 2022, Theorem 4) and MMD (Briol et al., 2019, Lemma 1). The MMD was first used by Gretton et al. (2007, 2012) for two-sample testing, and then studied for goodness-of-fit testing by Lloyd and Ghahramani (2015). MMD tests are closely related to many classical tests; see Sejdinovic et al. (2013). Chwialkowski et al. (2016); Liu et al. (2016) introduced an alternate test based on KSD2(P, Qn). To determine when H0 should be rejected, we select an appropriate threshold cα R, which will depend on the level of the test α [0, 1]. More precisely, cα should be the (1 α)-quantile of the distribution of the test statistic under H0. This distribution will usually be unknown, but can be approximated using a bootstrap method. A common example is the wild bootstrap (Shao, 2010; Leucht and Neumann, 2013), which was specialised for kernel tests by Chwialkowski et al. (2014). We will return to this point in Section 4. Key*, Gretton, Briol, and Fernandez* 2.3 Minimum Distance Estimation with Kernels A statistical divergence D can also be used for parameter estimation through minimum distance estimation (Wolfowitz, 1957; Parr and Schucany, 1980). Given a parametric family {Pθ}θ Θ P(X) indexed by Θ Rp and some data {xi}n i=1 iid Q, a natural estimator is ˆθn := arg min θ Θ D(Pθ, Qn), In practice, the minimum is often obtained using numerical optimisation algorithms. Under regularity conditions on D and Pθ, the estimator approaches θ := arg minθ Θ D(Pθ, Q) as n . In particular, when the model is well-specified (i.e. HC 0 holds), then θ = θ0. The MMD was first used for parameter estimation for neural networks by Dziugaite et al. (2015); Li et al. (2015); Sutherland et al. (2017); Li et al. (2017); Bińkowski et al. (2018), then studied more broadly as a statistical estimator by Briol et al. (2019); Chérief-Abdellatif and Alquier (2022, 2020); Niu et al. (2021); Dellaporta et al. (2022); Bharti et al. (2023). They closely relate to minimum scoring rule estimators based on the kernel-scoring rule (Dawid, 2007). The use of KSD for parameter estimation was first proposed by Barp et al. (2019), and later extended by Betsch et al. (2020); Grathwohl et al. (2020); Gong et al. (2021); Matsubara et al. (2022). These estimators are also closely related to score-matching estimators (Hyvärinen, 2006); see Barp et al. (2019, Theorem 2) for details. 3. Methodology: The Test Statistic We are now ready to present our composite goodness-of-fit tests. The first section describes our general framework and studies the asymptotic distribution of the MMD test statistic. In the next section, we then describe how to implement the tests through bootstrapping. 3.1 Composite Goodness-of-fit Testing with Kernels The high-level idea behind our test-statistics is to use the smallest kernel-based discrepancy value between the data Qn and any element of the parametric family: := min θ Θ n D(Pθ, Qn) = min P {Pθ}θ Θ n D(P, Qn). To implement the tests, we propose a two-stage approach: Stage 1 (Estimate): Compute ˆθn = arg minθ Θ D(Pθ, Qn), so that = n D(Pˆθn, Qn). Stage 2 (Test): Compute or estimate a test threshold cα from the distribution of test statistic under HC 0 . If > cα reject HC 0 , else do not reject. In this paper, we will study two other instances of this framework. Our primary focus will be on the MMD composite goodness-of-fit test, where D = MMD2. The next section provides regularity conditions so that as n , for this test converges to an infinite sum of χ2 random variables under HC 0 but diverges under HC 1 , thus allowing us to distinguish between the two hypotheses. The MMD test is widely applicable since it only requires that we can sample from Pθ. Composite Goodness-of-fit Tests with Kernels Q P Qn test error D(P, Qn) (a) Non-composite test D(Pˆθn, Qn) (b) Composite test Figure 1: Illustration of the sources of error when estimating the test statistic under the null hypothesis. Existing non-composite tests use the test statistic n D(P, Qn), where D is a statistical divergence, and thus encounter only test error. The composite test we introduce uses the statistic n D(Pˆθn, Qn), and thus encounters both estimation and test error. The second instance will be the KSD composite goodness-of-fit test, where D = KSD2. Although we do not provide theoretical guarantees for this case, we do conjecture that similar guarantees exist. This test is particularly interesting since it only requires a tractable (possibly unnormalised) density function for Pθ. In both tests, stage 1 requires computing the solution to the optimisation problem ˆθn = arg minθ Θ D(Pθ, Qn). The best method for this depends on the model and should be selected alongside it. In this paper, for the MMD test, we use gradient-based stochastic optimisation and automatic differentiation. For the KSD test, we use the closed-form expression for the minimiser, which is available when Pθ is a member of the exponential family of distributions (see Section D.1). However, stochastic optimisation could also be used in order to work with more general families of models. 3.2 Theoretical Analysis of the MMD Composite Goodness-of-fit fest To present our analysis of the MMD test, we first define some additional notation. Since we will be interested in asymptotic distributions, we will now consider the data to be random variables X1, . . . , Xn i.i.d. Q, rather than the realisations x1, . . . , xn introduced so far. We then overload Qn = 1 n Pn i=1 δXi to define a random measure. This is done in order to simplify notation, and whether we refer to the random or realised Qn can be inferred from the context. The results we present in this paper are based on the asymptotic distribution of n MMD2(Pˆθn, Qn) as n . This is similar to standard kernel-based goodness-of-fit tests which use n MMD2(P, Qn), the main difference being that our test statistic depends on the data through both ˆθn and Qn. In other words, we are now using the data twice, once for estimation and once for testing. We note that this creates some non-trivial dependence and makes the derivation of theoretical guarantees significantly more challenging. We first state and discuss our assumptions, then present all results in the following subsection. 3.2.1 Assumptions We will write Cm(X) for the set of m-times continuously differentiable functions on X, and Cm b (X) for the subset of Cm(X) where all of the m derivatives are bounded. For bivariate Key*, Gretton, Briol, and Fernandez* functions, we will denote by Cm,m b (X X) the set of function which are Cm b (X X) with respect to each input. We can now present our first assumption. Assumption 1. The observations {xi}n i=1 are independent and identically distributed realisations from Q. This assumption is common, and covers a very large class of applications, including all our examples in Section 5. It could be relaxed to dependent data but this would require the development of specialised tools and notions of dependence which are beyond the scope of this paper; see for example Chérief-Abdellatif and Alquier (2022). Assumption 2. X Rd, X is separable, and Θ is a compact and convex subset of Rp. The assumptions are relatively mild and needed for the consistency of our estimators. Even when working with models where the assumptions on Θ are not satisfied, it is often possible to reparameterise the model so that the assumption holds. Our next assumption defines regularity assumptions on the model considered by the MMD test, which will have to be verified on a per-model basis. Assumption 3. {Pθ}θ Θ is an identifiable parametric family of models. Each element Pθ P has density pθ with respect to a common measure λ which satisfies: For each fixed x X, p (x) C3(Θ) and pθ : X R is measurable for almost all θ Θ. For any collection ℓ, i, j {1, . . . , p}, the following holds true: Z θℓ pθ(x) λ(dx) < , θℓ θi pθ(x) λ(dx) < , θℓ θi θj pθ(x) λ(dx) < . The smoothness assumptions with respect to the parameter are required to ensure convergence of ˆθn to θ0 in an appropriate sense under HC 0 . Note that this assumption, together with the fact that Θ is compact, allows us to deduce R X supθ Θ pθ(x)λ(dx) < . On the other hand, the smoothness assumptions with respect to X are needed for the divergences to be well defined. In many cases, when applying the MMD test, we will not have access to a computable density function and will only be able to sample from the model. For example, many models are defined in terms of a distribution U on some latent space, and a generator function Gθ : U X, where a sample from the model x = Gθ(u) with u U. In this case, the results could be updated to instead place assumptions on U and Gθ rather than on the density, as was done in Briol et al. (2019). We make an additional assumption about the model family under the null hypothesis. Assumption 4. Under HC 0 , we have that θ0, the parameter value under the null hypothesis, belongs to the interior of Θ. Composite Goodness-of-fit Tests with Kernels For most models this assumption will be met, or could be met by reparameterising the model. Our next assumption relates to the kernel underlying the MMD. Assumption 5. The kernel K : X X R is bounded, continuous and characteristic. This assumption is mild and will be satisfied by kernels commonly used in the literature on minimum MMD estimators, such as the Gaussian and Matérn kernels of sufficient smoothness. Our final assumption concerns the regularity of the divergences around the minimisers. This is necessary to guarantee convergence of our minimum distance estimators under HC 0 , and is a mild assumption which is common in the literature (see e.g. Theorem 2 in Briol et al. (2019) and Theorem 4 in Barp et al. (2019)). Assumption 6. The Hessian matrix H Rp p given by Hij = 2 θi θj MMD2(Pθ, Pθ0) θ=θ0 for any i, j {1, . . . , p} is positive definite. Remark 1. We could extend the results to the KSD test by noting that the KSD is equivalent to an MMD with a specific kernel (Barp et al., 2019). However, this kernel is often unbounded and depends on pθ. Thus, additional assumptions on pθ would be required. We leave this extension to future work. In our experiments we assume that the KSD is a valid statistical divergence, which holds under the conditions discussed in Section 2.1. 3.2.2 Behaviour of the Test Statistic Under HC 0 and HC 1 We now present results on the asymptotic distribution of our test statistic, and the power (i.e. probability of correctly rejecting HC 0 when it does not hold) of the associated tests, as n . Theorem 2 (Convergence under the null hypothesis). Under HC 0 and Assumptions 1 to 6: n MMD2(Pˆθn, Qn) D i=1 λi Z2 i as n , (3) where (Zi)i 1 are a collection of i.i.d. standard normal random variables and (λi)i 1 are constants, which depend on the choice of kernel, where P i=1 λi < . Theorem 2 guarantees that our test statistic has a well-behaved limiting distribution under the null hypothesis HC 0 . Unfortunately, this distribution is not tractable, but knowing that it exists and is well-behaved is still necessary for guaranteeing that we can approximate quantiles with a bootstrapping method, as will be discussed in Section 4. The results are analogous to Theorem 12 in Gretton et al. (2012), though this result is only valid for testing H0 vs. H1 and not HC 0 vs. HC 1 . Our second result concerns the power of the test: when HC 1 holds, we show that our test is able to reject HC 0 when n is large enough. Theorem 3 (Consistency under the alternative hypothesis). Under Assumptions 1 to 6, we have that HC 1 holds if and only if lim infn MMD2(Pˆθn, Qn) > 0. As our test statistic, n MMD2(Pˆθn, Qn), is scaled by n, the above result implies that the test statistic diverges as n . Key*, Gretton, Briol, and Fernandez* 4. Methodology: Practical Implementation via Bootstraps To implement the tests, we require an estimate of the rejection threshold cα, which should be set to the (1 α)-quantile of the distribution of = n D(Pˆθn, Qn) under HC 0 . Since this distribution is unknown, we estimate the threshold using bootstrap algorithms. In following section we present two possible choices: the wild bootstrap and the parametric bootstrap, and analyse their behaviour when applied to the MMD test. 4.1 The Wild Bootstrap A natural first approach is to use the wild bootstrap, since that is the current state-of-the-art for non-composite kernel goodness-of-fit tests (Chwialkowski et al., 2016; Liu et al., 2016; Schrab et al., 2022). The algorithm is based on bootstrapped versions of the test statistics, defined for non-composite tests as n MMD2 W (Pn, Qn) := 1 i,j=1 Wi Wjh MMD((xi, xi), (xj, xj)), n KSD2 W (P, Qn) := 1 i,j=1 Wi Wjh KSD,P (xi, xj), where W1, . . . , Wn are i.i.d. Rademacher random variables (i.e. random variables taking value 1 and 1 with probability 1/2 each). Note that the bootstrapped version of the MMD test requires realisations of both P and Q (i.e. Pn and Qn) whereas the bootstrapped version of the KSD test requires evaluations of the score of the distribution P and samples from Q (denoted through the empirical measure Qn). Given fixed observations Qn, it is possible to draw samples (1), . . . , (b) of the bootstrapped test statistics by taking fresh realisations of W1, . . . , Wn for each sample. If we consider an MMD test, under H0, Chwialkowski et al. (2014, Theorem 1) show that the distributions of n MMD2(P, Qn) and n MMD2 W (P, Qn) converge to the same distribution as n . Thus, we can estimate the threshold cα during a non-composite test by computing the (1 α)-quantile of { (k)}b k=1. Under H1, Chwialkowski et al. (2014, Proposition 3.2) show that n MMD2 W (Pn, Qn) will converge to a finite quantity, while Theorem 3 shows that n MMD2(Pn, Qn) . This means that P(MMD2(Pn, Qn) > cα) 1 as n and the test rejects H0 correctly. We can extend the wild bootstrap to the composite case by first estimating the parameter and then applying the wild bootstrap to the estimated distribution. Thus, the bootstrapped test statistics are n MMD2 W (Pˆθn,n, Qn) and n KSD2 W (Pˆθn, Qn), where Pˆθn,n = 1 n Pn i=1 δ Xi with Xi Pˆθn. Algorithm 1 describes this approach for the MMD, and Algorithm 2 for the KSD. Note that the KSD version of the test does not require an additional sampling step since the KSD can be estimated with a single sample, whereas two different samples are needed to estimate the MMD. It is not clear from the existing results in the literature if applying the wild bootstrap to the composite test in this fashion will result in a test with a type I error rate (i.e. probability of wrongly rejecting HC 0 when it holds) smaller than or equal to α. This is because the composite test statistic contains an additional source of error, in comparison to Composite Goodness-of-fit Tests with Kernels Algorithm 1: Wild bootstrap (MMD) Input: Pˆθn, Qn, α, b n Pn i=1 δ xi where x1, . . . , xn iid Pˆθn; for k {1, . . . , b} do w(k) = (w(k) 1 , . . . , w(k) n ) iid Rademacher; (k) = n MMD2 w(k)(Pˆθn,n, Qn); cα = quantile({ (1), . . . , (b)}, 1 α); Algorithm 2: Wild bootstrap (KSD) Input: Pˆθn, Qn, α, b for k {1, . . . , b} do w(k) 1 , . . . , w(k) n iid Rademacher; (k) = n KSD2 w(k)(Pˆθn, Qn); cα = quantile({ (1), . . . , (b)}, 1 α); the non-composite statistic. MMD2(Pn, Qn) contains one source of error, due to estimating MMD2(P, Qn) through samples. However, MMD2(Pˆθn,n, Qn) contains a second source of error because Pθ0, the true data generating distribution under HC 0 , is estimated with Pˆθn. Figure 1 illustrates these sources of error in the non-composite and composite test statistics under the null hypothesis. If we fix an estimate of ˆθn and then apply the wild bootstrap, we fail to take into account this additional source of error. Our main result for this section shows that applying the wild bootstrap does lead to a type I error rate smaller than or equal to α in the MMD test. However, our result shows the (asymptotic) type I error rate is strictly below α, which is surprising as usually resampling strategies yield asymptotically-exact type-I error rates. This result suggests that our test is conservative and could potentially be improved further. Theorem 4. Let α (0, 1), and define qα = inf γ : lim sup n P(n MMD2 W (Pˆθn,n, Qn) > γ) α . Under the null hypothesis and Assumptions 1 to 6, it holds lim sup n P n MMD2(Pˆθn, Qn) qα < α. The root cause of the above result is that, under both HC 0 and HC 1 , MMD2(Pˆθn, Qn) MMD2(Pθ0, Qn), because ˆθn is chosen as the minimiser of θ MMD2(Pθ, Qn). The impact of this on the wild bootstrap is that, for all x R, lim sup n P n MMD2(Pˆθn, Qn) x < lim sup n P n MMD2 W (Pˆθn,n, Qn) x as we show in the proof of Theorem 4. This means that every quantile of the asymptotic distribution of the wild bootstrap statistic is shifted away from zero compared to the test statistic, as demonstrated in Figure 2. Thus, the threshold computed from the samples of the bootstrapped test statistic is too large, and the resulting test is conservative. We conjecture that similar behaviour occurs under HC 1 , resulting in the loss of power we observe in the experiments (see e.g. Figure 3b). To achieve better performance, we can use an alternative bootstrap which does take account of estimation error. Key*, Gretton, Briol, and Fernandez* 0 1 2 3 statistic value n MMD2(Pˆθn, Qn) n MMD2 W (Pˆθn,n, Qn) 0 1 2 3 statistic value cumulative density n MMD2(Pˆθn, Qn) n MMD2 W (Pˆθn,n, Qn) Figure 2: Distribution of MMD2(Pˆθn, Qn) and MMD2 W (Pˆθn,n, Qn) under HC 0 , obtained by simulation. and show the respective (1 α) quantiles, demonstrating that the wild bootstrap estimates a conservative threshold. 4.2 The Parametric Bootstrap We now consider the parametric bootstrap (Stute et al., 1993) presented in Algorithm 3, which is commonly used in the composite testing literature (for example Kellner and Celisse (2019)). To approximate the distribution of n D(Pˆθn, Qn) under HC 0 , this approach first fits the parameter to the observations. It then repeatedly resamples the observations, re-estimates the parameter and recomputes the test statistic. By repeatedly re-estimating the parameter, the parametric bootstrap takes account of the estimation error. Algorithm 3: Parametric bootstrap Input: D, Pθ, Qn, α, b (num bootstrap samples) ˆθn = arg minθ D(Pθ, Qn); for k {1, . . . , b} do n Pn i=1 δx(k) i , x(k) i n i=1 iid Pˆθn; ˆθ(k) n = arg minθ Θ D(Pθ, Q(k) n ); (k) = n D(Pˆθ(k) n , Q(k) n ); cα = quantile({ (1), . . . , (b)}, 1 α); Algorithm 3 generates samples (k) of the bootstrapped test statistic n D(Pˆθ n, Q n), where ˆθ n = arg minθ Θ D(Pθ, Q n), and Q n = 1 n Pn i=1 δX i with X i Pˆθn. We show that the parametric bootstrap is valid for the MMD test, which once again requires that the distribution of the test statistic n MMD2(Pˆθn, Qn) and the distribution of the bootstrapped test statistic n MMD2(Pˆθ n, Q n) converge to the same distribution as n . To present the result we require some additional notation, which is defined formally in Section A.1.4: considering random variables An and A, we write An DD A to indicate that An converges to A in distribution conditioned on the data, {Xi}n i=1. Composite Goodness-of-fit Tests with Kernels Theorem 5 (Parametric bootstrap under the null). Under HC 0 and Assumptions 1 to 6, we have that n MMD2(Pθ n, Q n) DD Γ , where Γ is a random variable with distribution given by Equation 3, i.e. the distribution of the test statistic under HC 0 . Note that the above convergence holds conditioned on any sequence of observations, thus holds in general. In Section 5 we find empirically that applying the parametric bootstrap to our tests results in a good type I error rate and a better performance than the wild bootstrap when n is small. However, the parametric bootstrap is substantially more computationally intensive because it requires repeatedly computing a minimum distance estimator (and hence the kernel matrix) on fresh data, whereas the wild bootstrap only uses a single minimum distance estimator. Assuming we are computing the minimum distance estimators through T steps of a numerical optimiser based on function evaluations or gradients, the computational complexity of the wild bootstrap algorithms is O((T + b)n2), whilst the parametric bootstrap is O(Tbn2). We therefore expect the parametric bootstrap to be significantly more expensive than the wild bootstrap. However, this extra computation is likely an issue only for large n, and in this high data regime we find in our experiments that the wild bootstrap achieves similar power to the parametric bootstrap. Thus, in a high data regime there is little penalty to using the cheaper wild bootstrap. Note that the cost could be further reduced to be linear in n through alternative estimates of the MMD (see e.g. Lemma 14 of Gretton et al. (2012)), but this would likely lead to substantially lower test power. 5. Empirical Results In the following section, we now study the performance of our tests in a range of settings. We consider three examples: (i) a multivariate Gaussian location and scale model, (ii) a generative model of the interaction of genes through time called a toggle-switch model, and (iii) an unnormalised nonparametric density model. For all experiments we set the test level α = 0.05, and give full implementation details in Section D. 5.1 Gaussian Model We begin by exploring the different configurations of our test, in regard to the underlying discrepancy and the bootstrap method, for various Gaussian models. Unless otherwise stated, we use a Gaussian kernel with lengthscale computed using the median heuristic (Gretton et al., 2007). We test Gaussianity by considering HC 0 : Pθ = N(µ, Σ) with unknown parameters θ = {µ, Σ}, for µ Rd and Σ (R+)d d. We start with d = 1, and test against a particular alternative where Q = Students-t with 2 degrees of freedom. Figure 3 shows the type I error rate and power of the MMD and KSD tests using both bootstraps as n increases. Considering the MMD test under HC 0 , the test using the wild bootstrap converges to a conservative type I error rate as n becomes large, while the test using the parametric has the correct rate. This is consistent with our analysis of the bootstraps in Section 4. Under HC 1 , the parametric bootstrap has higher power, once again due to the wild bootstrap being conservative, but both bootstraps achieve a power of one once n is large enough. For these results, m, the number of samples from Pθ, was set to m = n as a balance between computational cost and Key*, Gretton, Briol, and Fernandez* 10 250 500 750 1000 number of observations, n type I error rate (a) Type I error rate under HC 0 , where Q = N(µ0, 12). 10 250 500 750 1000 number of observations, n (b) Power under HC 1 , where Q = Students-t with 2 degrees of freedom. Figure 3: Performance of the MMD and KSD tests as n increases, where HC 0 : P = N(µ, 1). show the parametric bootstrap, show the wild bootstrap, shows the level. The error bars show one standard error across 4 random seeds. 2 300 600 d HC 0 HC 1a , HC 1b (a) MMD test 2 300 600 d HC 0 HC 1a , HC 1b (b) KSD test Figure 4: Power of the parametric bootstrap tests as d increases. HC 0 : P = N(µ, σ2), HC 1a and HC 1b indicate Q = Students-t, with 3 and 5 degrees of freedom respectively. shows the level. power. Better power could be achieved by increasing m, although this would increase the computational cost. The figure also shows the same set of results for the KSD test. It has similar convergence behaviour, suggesting that our theoretical results may also hold for the KSD. Comparing the MMD and KSD tests under HC 1 we note that the KSD test has higher power for any given n. This is both because the KSD does not require sampling from Pθ, and because the MMD estimation process requires stochastic optimisation, whereas for the KSD we use a closed-form estimator. Second, we examine the performance of the parametric bootstrap tests as d, the dimensionality of the observations, increases. We test against the family of multivariate Gaussians with known covariance, HC 0 : P = N(µd, Id d) with unknown parameter θ = µd Rd. We consider two alternatives, HC 1a and HC 1b, where we generate data from multivariate tdistributions with ν = 3 and ν = 5 degrees of freedom respectively. As ν becomes large, the observed data become indistinguishable from a Gaussian, hence we expect the tests to have Composite Goodness-of-fit Tests with Kernels 0.1 1.0 1.8 scale of Q, σ wild param Wolfer Figure 5: Comparison of the MMD tests against Wolfer and Alquier (2022). HC 0 : P = N(µ, 1) and Q = N(µ0, σ2). n = 200. shows the level. The error bars show one standard error across 4 random seeds. lower power against larger ν. We set the scale parameter, Σ = Id d(ν 2)/ν, so that the samples have covariance I and the distance between the samples does not grow more rapidly than under HC 0 as d increases. Figure 4 shows that both tests maintain the type I error rate of 0.05 under HC 0 . Under HC 1a and HC 1b we find that the power of the MMD test increases with d, while the power of the KSD test decreases with d. Next, we further examine the power of the MMD tests when using the wild and parametric bootstraps. We also include the non-asymptotic MMD test introduced by Wolfer and Alquier (2022), for comparison. We take HC 0 to be a family of one dimensional Gaussians with known variance, specifically P = N(µ, σ2), where θ = µ R and σ2 = 1. We generate data from N(µ0, σ2), where µ0 R is an arbitrary mean, and compute the power of the tests for different values of σ2. An ideal test would have power close to α = 0.05 when σ2 = 1, and power close to one otherwise. Figure 5 shows the results, with the parametric bootstrap having higher power than the wild bootstrap once again. The test introduced by Wolfer and Alquier (2022) achieves lower power than both our tests with either bootstrap. We also consider the time taken by each bootstrap method. For n = 100 and d = 1, we found that each wild bootstrap test took approximately 1ms, while each parametric bootstrap test approximately 5ms. These figures depend heavily on the implementation and hardware details (as given in Section D). In particular, we execute all bootstrap samples in parallel on a GPU, which is particularly beneficial for the parametric bootstrap where each sample has a high computational cost. However, it will not be possible for larger models because it requires keeping b copies of the model parameters in memory. Additionally, it would be possible to further optimise the bootstrap implementations. Despite this, these approximate numbers illustrate that the parametric bootstrap test is much more expensive. Thus, for large n where the two bootstraps achieve similar power (see Figure 3b) it may be better to use the wild bootstrap. Figure 6 explores the robustness of the test to corrupted observations under Huber s contamination model (Huber and Ronchetti, 2011), as previously explored in the testing setting by Liu and Briol (2024); Schrab and Kim (2024). Once again we consider HC 0 : P = N(µd, Id d), for µd Rd. We assume that HC 0 holds, but some observations have been corrupted with a large amount of noise. Thus, we generate data from Q = (1 ϵ) N(0d, Id d) + ϵ N(10d, 0.2Id d), where 0 ϵ 1 controls the percentage of observations Key*, Gretton, Briol, and Fernandez* 0 2 4 6 % corrupted, ϵ type I error rate d = 1 d = 10 d = 220 Figure 6: Type I error rate of the tests under corrupted observations. HC 0 : P = N(µd, Σd d) and Q = (1 ϵ)P + ϵ N(10d, 0.2Id d). n = 200. show the MMD, the KSD, the KSD with robust kernel, and the level. The error bars show one standard error across 4 random seeds. 2 4 6 8 degrees of freedom of Q, ν S-W Lllfrs A-D D Agso MMD l = m MMD l = 2 KSD l = m KSD l = 2 2.5 5.0 7.5 degrees of freedom of Q, ν Figure 7: Comparison of the MMD and KSD tests (parametric bootstrap), and the Shapiro Wilks (S-W), Lilliefors (Lllfrs), Anderson-Darling (A-D) and D Agostino s (D Agso) tests. H0 : P = N(µ, σ2), and Q = a series of Student s t-distributions. Two kernel lengthscales are shown for our tests l = m (the median heuristic) and l = 2 (selected via grid search). shows the level. that are corrupted. A robust test would maintain the correct type I error rate as ϵ grows. We evaluate the MMD test and KSD test with a Gaussian kernel (as defined in Appendix C.2). We also include a robust KSD test using the tilted kernel K (x, x ) = w(x)K(x, x )w(x ), where K is the Gaussian kernel, w(x) = (1 + a2 x 2 2) b, and we choose a = 1, b = 1 2 (Barp et al., 2019; Liu and Briol, 2024). The results on one-dimensional observations, d = 1, show that the MMD test is robust up to 2-3% corruption, the KSD test with Gaussian kernel is not robust, and the KSD test with tilted kernel is robust to > 6% corruption. All the tests become less robust as d increases. To complete the experiments on Gaussian models, we compare the performance of our tests to existing specialised tests of Gaussianity. We consider three nonparametric tests: the Composite Goodness-of-fit Tests with Kernels 0 50 100 150 t Figure 8: Left: Evolution of the ut variable of the toggle switch model, starting with initial state ut = 10. Each line shows a different random evolution of the model. Right: Histogram of the resulting values of y. Anderson-Darling, Lilliefors (the Kolomogorov-Smirnov test specialised to Gaussians) and Shapiro-Wilks tests. We also consider D Agostino s K2 parametric test (D Agostino and Pearson, 1973). For this comparison we return to testing HC 0 : Pθ = N(µ, σ2), for unknown parameters θ = {µ, σ2}, in one dimension. We perform the test against data generated from a Student s t-distribution, and vary the degree of freedom ν. Once again, we expect the test power to fall to α = 0.05, the specified level of the test, as ν becomes large. Figure 7 shows the results of this experiment, for both tests using the parametric bootstrap and for two choices of lengthscale. The MMD test has substantially lower power than the other tests for both lengthscales. The KSD test is an improvement, having performance closer to the specialised tests when setting the lengthscale with the median heuristic. Increasing n improves the power of each test, but does not change the ordering. These results reveal that we should prefer tests specifically designed for the model in question where possible, as opposed to our generally applicable tests. The advantage of our tests is that they are applicable beyond Gaussianity, including for unnormalised or generative models. They are also applicable to multivariate data, which is not the case for the Anderson-Darling or Lilliefors tests. In this plot we also demonstrate the importance of choosing the kernel. We include the power of our test for a more optimal choice of l = 2, which we found by sampling additional observations and performing a grid search. For this kernel, the KSD test has comparable power to the specialised tests. However, in practice we cannot use this method to select the kernel, as we are not able to sample additional observations. Future work could look at better strategies than the median heuristic for selecting the bandwidth parameter in composite tests. 5.2 Toggle-Switch Model We now consider a toggle switch model, which is a generative model with unknown likelihood and hence suitable for our composite MMD tests. The model describes how the expression level of two genes u and v interact in a sample of cells (Bonassi et al., 2011; Bonassi and West, 2015). Sampling from the model involves two coupled discretised-time equations. Denote by P ts θ,T a model with parameters θ = (α1, α2, β1, β2, µ, σ, γ) with a discretisation consisting of Key*, Gretton, Briol, and Fernandez* Figure 9: Fraction out of 20 repeats for which the MMD test (parametric bootstrap) rejects HC 0 : P = P ts θ,T , comparing against data generated from Q = P ts θ0,300. 0 250 500 750 1000 1250 y true model T = 10 T = 50 Figure 10: Fit of the toggle switch model for T = 10 (for which HC 0 is rejected) and T = 50 (for which it is not). The densities are generated using kernel density estimation on 500 samples. T steps. To sample y P ts θ,T , we sample y e N(µ + u T , µσ/uγ T ), where ut+1 e N(µu,t, 0.5), µu,t = ut + α1 1 + vβ1 t (1 + 0.03ut), vt+1 e N(µv,t, 0.5), µv,t = vt + α2 1 + uβ2 t (1 + 0.03vt), and u0 = v0 = 10. In the above e N indicates a Gaussian distribution truncated to give non-negative realisations. Figure 8 demonstrates how this sampling process works, showing the evolution of ut as t increases. To fit the model, we use stochastic optimisation; see Section D.3.3. Existing work considers the model to have converged for T = 300, but using such a large value can be computationally expensive, thus a practitioner may wonder if it is possible to use a smaller value. We apply our parametric bootstrap test to this scenario by generating n = 500 observations from P ts θ0,300, and then testing the HC 0 : θ0 such that P ts θ0,T = P ts θ0,300 for values of 1 T 50. To generate the observations we follow Bernton et al. (2019) and use θ0 = (22, 12, 4, 4.5, 325, 0.25, 0.15). Figure 9 shows the result of the test, revealing that the test is unlikely to reject HC 0 for T 40. In Figure 10 we give examples of the fit of the model for small and large T. Composite Goodness-of-fit Tests with Kernels 8922 20828 32735 galaxy velocity (km/s) p = 1 p = 2 p = 3 p = 4 p = 5 p = 25 p = 4 p = 5 p = 25 Figure 11: Fit of the kernel exponential family model on the galaxies data set. The grey histogram shows the data set, while the coloured lines show the density of the fitted model with varying p. The dashed lines indicate that the wild bootstrap test rejected HC 0 , while the solid lines indicate that it failed to reject. 5.3 Density Estimation with a Kernel Exponential Family Model We now consider the kernel exponential family pθ(x) qkef(x) exp(f(x)) where f is an element of some RKHS Hκ with Gaussian kernel κ : X X R with lengthscale lkef, and qkef is a reference density (Canu and Smola, 2006). Following Matsubara et al. (2022), we will work with a finite-rank approximation f(x) = Pp i=1 θiφi(x) where θ Θ = Rp and l2i kefi! exp x2 , for all i , are basis functions. Matsubara et al. (2022) use this model for a density estimation task on a data set of velocities of 82 galaxies (Postman et al., 1986; Roeder, 1990), and use the approximation with p = 25 for computational convenience. However, an open question is whether p = 25 is sufficient to achieve a good fit to the data. We propose to use our composite KSD test to answer this question. Figure 11 shows the fit of the model for increasing values of p, and whether the test rejected HC 0 . We use a sum of IMQ kernels; see Section D.2 for a discussion of this choice. We find that the test rejects HC 0 for p = 1, 2, 3, but does not reject for p = 4, 5, 25, with the wild and parametric bootstraps producing similar results. This suggests that p = 25 is a suitable choice, though it would also be reasonable to use a smaller value of p which would further decrease the computational cost of inference. 6. Limitations While we find that our composite tests allow us to answer much more complex questions than existing kernel tests, we did also observe some limitations that we discuss below. The performance of any kernel-based test depends heavily on the choice of kernel, as this determines the ability of the discrepancy to distinguish between distributions when observations are finite. In our composite tests, this issue has a larger effect on the power of the test since we use the discrepancy not only as a test statistic but also for parameter Key*, Gretton, Briol, and Fernandez* estimation. For example, in Appendix B, we demonstrate how a very poor choice of kernel can result in biased estimates of the parameter and thus a type I error rate larger than the level. In practice, we should therefore consider our method as a test of both the model and the performance of the estimator, rather than of the model alone. We consider this to be a desirable behaviour, since, in practice, a parametric model is only ever as useful as our ability to estimate its parameters. In the non-composite case, several techniques have been developed to address kernel selection, including the power-maximizing kernel (Sutherland et al., 2017), Sup-MMD (Fukumizu et al., 2009), and the aggregated procedure of Schrab et al. (2023, 2022). Future work could investigate how these procedures could be adapted for composite tests. As an initial trial, in Appendix C we apply the aggregated procedure to our composite KSD test, finding that it sometimes achieves a small increase in power, and sometimes a small decrease. Future work could also consider kernels which are specialised to structured data such as time-series, graphs, and images. A further issue that can arise is the numerical optimiser implementing the estimator failing to converge to a global optimum (for example, because the objective is non-convex and the optimiser only finds local minimums). This failure mode is also illustrated in Section B. Once again, this means we should consider our method as a test of the fitting method, in addition to the model itself. All tests based on the KSD can suffer from type I errors due to its inability to distinguish between multi-modal distributions that only differ by the weight of each mode, and this limitation also applies to our test. The limitation can be observed in Figure 11 where, for p = 25, the model allocates too much weight to the left mode, yet the test still fails to reject HC 0 . However, as n this error will no longer occur. A final limitation of the tests is due to the difficulty of approximating cα for models which are expensive to fit due to a large number of parameters and/or the size of the associated data sets. In those cases, the wild bootstrap might not provide a good enough approximation of cα, whilst the high computational cost of the parametric bootstrap (which requires fitting the model many times) may be prohibitive. Alternative bootstrap algorithms could be developed for this task; see for example the work of Kojadinovic and Yan (2012). 7. Connections with Existing Tests Connections between (non-composite) kernel-based and classical tests are well-studied; see Sejdinovic et al. (2013) for an extensive discussion showing that MMD tests are equivalent to tests with the energy distance when using an appropriate kernel. Naturally, similar results also hold for composite tests, and we sketch out some of these connections below. 7.1 Likelihood-ratio Tests The first connection is with likelihood-ratio tests (Lehmann and Romano, 2005, Section 12.4.4), which compare a null model pθ0 and an alternative parametric model qγ. Interestingly, we will show in this section that our tests are intimately linked to likelihood ratio tests where the alternative model is parameterised in some reproducing kernel Hilbert space. Likelihood-ratio are based on the test statistic Sn(γ; f) = γ 1 n Pn i=1 log qγ(xi) which is optimised over γ via the maximum likelihood estimator (MLE). Now consider the alternative Composite Goodness-of-fit Tests with Kernels model given by qf,γ(x) exp(γf(x))pθ0(x) for some γ R. This model is indexed by the function f, and consists of a perturbation of pθ0 of size γ in the sense that qf,γ = pθ0 when γ = 0. In this case, Sn(γ; f) = 1 X f(x) exp(γf(x))pθ0(x)dx R X exp(γf(x))pθ0(x)dx , and Sn(0; f) = 1 n Pn i=1 f(xi) R X f(x)pθ0(x)dx. If γ = 0, the MLE of γ will be close to zero, and thus Sn(0) 0. Note that |Sn(γ; f)| > 0 implies that for some value γ > 0, the perturbation model is a better fit for Qn than Pθ0. This choice of alternative model allows us to recover the MMD and KSD by considering suitably defined perturbations in some RKHS: sup f FMMD Sn(0; f) = MMD(Pθ0, Qn), sup f FKSD Sn(0; f) = KSD(Pθ0, Qn). where FMMD and FKSD were defined in Section 2. Therefore, MMD and KSD tests are likelihood-ratio tests with an alternative model parametrised in some RKHS. Of course, the argument above also holds if θ0 is replaced by ˆθn. As a result, our composite tests can be thought of as a composite version of likelihood-ratio tests where ˆθn is a minimum distance estimator based on the MMD and KSD. 7.2 Tests Based on Characteristic Functions The composite goodness-of-fit tests in this paper are also closely connected with hypothesis tests based on distances between characteristic functions. Recall that the characteristic function of some distribution Q is defined as φQ(ω) = EX Q[exp(i X ω)]. The following results shows that using the MMD is equivalent to comparing the (weighted) L2 distance between characteristic functions, where the weight depends on the choice of kernel. Theorem 6 (Theorem 3 and Corollary 4, Sriperumbudur et al. (2010)). Let X = Rd and suppose K(x, y) = ψ(x y) where ψ : Rd R is a bounded and continuous positive function. Following Bochner s theorem, ψ is the Fourier transform of a finite nonnegative Borel measure Λ: ψ(x) = R Rd exp( ix ω)Λ(dω) and hence the MMD can be expressed as MMD2(P, Q) = Z Rd |φP (ω) φQ(ω)|2Λ(dω). There are a number of composite tests based on weighted distances between characteristic functions. One example is the work of Henze and Zirkler (1990), which consider testing for multivariate Gaussians with unknown mean and covariance. This test therefore uses a distance equivalent to the MMD with a specific choice of kernel. However, this paper makes use of maximum likelihood estimation rather than of a minimum MMD estimator. Interestingly, this means their test is equivalent to that in Kellner and Celisse (2019), but the latter paper does not make the connection and propose to use the test by Henze and Zirkler (1990) as a competitor in their numerical experiments. Note that this correspondence extends to the functional data case; see Wynne and Nagy (2021) for more details. Another related work is that of Koutrouvelis and Kellermeier (1981), who focus on (possibly non-Gaussian) univariate distributions and consider an approximation of the Key*, Gretton, Briol, and Fernandez* weighted L2 distance between empirical characteristic functions. Here, the approximation consists of discretising the integral in the definition of L2 distance, and this discrepancy can therefore be thought of as an approximation of the MMD with a specific choice of kernel. Their approach does however have an additional level of approximation due to the discretisation of the integral which is unnecessary when noting the connection with the MMD. Note that Koutrouvelis and Kellermeier (1981) used the same for both discrepancy for estimation and testing. The connection between tests based on kernels and characteristic functions can also be extended to the case of the KSD, albeit through modified characteristic functions. To state the result, define Sl P,xg = g(x)( log p(x)/ xl) + ( g(x)/ xl) for some sufficiently regular scalar-valued test function g and l {1, . . . , d}. For some fixed l, the modified characteristic function of some distribution Q will take the form φl Q(ω) = EX Q h Sl P,x h exp(i X ω) ii . Theorem 7. [Simplified version of Theorem 4.1, Wynne et al. (2022)] Let X = Rd and suppose K(x, y) = ψ(x y) where ψ : Rd R is a bounded and continuous positive function so that ψ(x) = R Rd exp( ix ω)Λ(dω). Then, the KSD can be expressed as KSD2(P, Q) = Z φl Q(ω) 2 Λ(dω) = Z φl Q(ω) φl P (ω) 2 Λ(dω). Ebner (2021) proposed a test of normality which uses a test statistic exactly of the form in Theorem 7. Their test implicitly assumes that Λ has a density, and they study the impact of the choice of density on the performance of the test. Through the theorem above, we can see that this is equivalent to varying the choice of kernel K indexing the KSD. 8. Conclusion This paper proposes and studies two kernel-based tests to verify whether a given data set is a realisation from any element of some parametric family of distributions. This was achieved by combining existing kernel hypothesis tests with recently developed minimum distance estimators, using either the MMD or KSD. We also studied two bootstrap algorithms to implement these tests: a wild bootstrap algorithm which is suitable in large-data regimes, and a parametric bootstrap which is suitable in smaller data regimes. A number of limitations were mentioned in Section 6, and these could each be addressed in future work. For example, the issue that the tests perform poorly when the minimum distance estimators provide a poor parameter estimate could be mitigated by allowing for the use of alternative estimators which are specialised for models at hand. This is straightforward to do in practice, but would require an extension of our theoretical results. Relatedly, the performance of the composite tests is dependent on the choice of an appropriate divergence, usually through a choice of kernel. This issue could be mitigated by using the aggregate approach of Schrab et al. (2023, 2022), which removes the need to carefully select a kernel, within our composite tests. Additionally, our paper focused on Euclidean data, our MMD and KSD tests could straightforwardly be generalised to more complex settings. For example, the cases of discrete Composite Goodness-of-fit Tests with Kernels (Yang et al., 2018), manifold (Xu and Matsuda, 2020), censored (Fernández and Gretton, 2019) or time-to-event (Fernández et al., 2020) data could all be covered through our general methodology. Furthermore, models such as point processes (Yang et al., 2019), latent variable models (Kanagawa et al., 2019) or exponential random graph models (Xu and Reinert, 2021) could also be tackled in this way. Finally, we also envisage that our approach could be extended to more complex testing questions. For example, the framework could be extended to relative tests, where instead of testing whether data comes from a fixed parametric model, the relative fit of several parametric models would be compared (Bounliphone et al., 2016; Jitkrittum et al., 2018; Lim et al., 2019). Alternatively, we could extend our methodology to construct composite tests for conditional distributions (Jitkrittum et al., 2020). Acknowledgments We thank Antonin Schrab for his helpful instructions on implementing the aggregated version of test, as presented in Section C. OK and FXB acknowledge support from the Engineering and Physical Sciences Research Council with grant numbers EP/S021566/1 and EP/Y011805/1 respectively. TF was supported by the Data Observatory Foundation - ANID Technology Center No. DO210001 and ANID FONDECYT grant No. 11221143. Arthur Gretton acknowledges support from the Gatsby Charitable Foundation. Key*, Gretton, Briol, and Fernandez* A Theoretical Results for MMD 24 A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A.4 Proofs of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.5 Proof of Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B Illustration of Limitations 48 C Aggregated Composite Test 50 D Experiment Details 51 D.1 Closed-Form KSD Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.3 Details of Specific Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Appendix A. Theoretical Results for MMD A.1 Definitions A.1.1 General Notation In this section we make some additional definitions which will be used in the proofs. We denote by Qn the empirical measure based on the sample X1, . . . , Xn i.i.d. Q, and we denote by Q n the empirical measure based on the parametric bootstrap samples X 1, . . . , X n i.i.d Pˆθn given ˆθn. 1. We define the functions Ln : Θ R and L : Θ R by Ln(θ) = MMD2(Pθ, Qn) and L(θ) = MMD2(Pθ, Q). (4) Similarly, for the parametric Bootstrap, we define L n : Θ R by L n(θ) = MMD2(Pθ, Q n). (5) Note then that ˆθn = arg minθ Θ Ln(θ) and θ n = arg minθ Θ L n(θ). 2. We denote by H Rp p the Hessian matrix given by Hij = 2 θi θj L(θ0) for any i, j {1, . . . , p}. 3. We define the function g : H Θ R by g(ω, θ) = EX Pθ(ω(X)). Composite Goodness-of-fit Tests with Kernels 4. We denote by µθ, µn and µ n H, respectively, the kernel mean embeddings of the distribution function Pθ and the empirical distribution functions Qn and Q n, given by X K(x, y)pθ(y)λ(dy), µn(x) = 1 i=1 K(x, Xi), µ n(x) = 1 i=1 K(x, X i ). 5. We define φ : X Θ Rp by φ(x, θ) = θµθ(x) Z X ( θµθ(y))pθ(y)λ(dy), (6) where θµθ : X Rp is such that ( θµθ(x))i = θi µθ(x) for i {1, . . . , p}. 6. We define by η : H X Θ R by η(ω, x, θ) = ω(x) g(ω, θ) 2 θg(ω, θ0), H 1φ(x, θ) , (7) where θg : H Θ Rp is such that ( θg(ω, θ))i = θi g(ω, θ) for i {1, . . . , p}. 7. We define the functionals Sn : H R and S n : H R by Sn(ω) = 1 n i=1 η(ω, Xi, θ0), and S n(ω) = 1 n i=1 η(ω, X i , ˆθn), (8) where η : H X Θ R is defined in Equation 7. Remark 8 (Bound for RKHS functions). Observe that for any ω H and x X, |ω(x)| = | ω, Kx H | ω H Kx H = ω H p where the inequality is due to the Cauchy-Schwarz s inequality. By Assumptions 1 to 6, we have that the kernel is bounded by some constant C > 0. Thus supx X |ω(x)| ω H Remark 9 (Interchange of integration and differentiation). Observe that for any fixed ω H, g(ω, ) C3(Θ) and, moreover Dθg(ω, θ) = Z X ω(x)Dθpθ(x)λ(dx), where Dθ is any derivative up to order three. The previous result holds by an application of Lebesgue s dominated convergence theorem since pθ(x) C3(Θ) for all x X, and by the integrability conditions of Assumption 3. Similarly µθ(x) C3(Θ) for all x X, and Dθµθ(x) = R X K(x, y)Dθpθ(y)λ(dy). Moreover, by the integrability conditions of Assumption 3 and since the kernel is bounded by a constant C > 0 (Assumption 5), it holds sup θ Θ sup x X X K(x, y)Dθpθ(y)λ(dy) C Z X sup θ Θ |Dθpθ(y)| λ(dy) < . (9) Key*, Gretton, Briol, and Fernandez* A.1.2 Test Statistic In our proofs we consider an alternative form of the definition of the MMD, which is equivalent to that in Equation 2: n MMD2(Pθ, Qn) = sup ω H: ω H=1 i=1 ω(Xi) g(ω, θ) A.1.3 Wild Bootstrap In order to analyse the asymptotic behaviour of the wild bootstrap test-statistic, we write it in terms of a supremum as follows n MMD2 W (Pˆθn,n, Qn) := 1 i,j=1 Wi Wjh MMD((Xi, Xi), (Xj, Xj)) = sup ω H: ω H=1 i=1 Wi(ω(Xi) ω( Xi)) where recall that X1, . . . , Xn i.i.d. Q, X1, . . . , Xn i.i.d. Pˆθn given ˆθn, and W1, . . . , Wn i.i.d. Rademacher are independent of everything. A.1.4 Parametric bootstrap To obtain the results for the parametric bootstrap, we condition on the data X1, . . . , Xn. To this end, we use the following notation: PD( ) = P( |X1, . . . , Xn), ED( ) = E( |X1, . . . , Xn), Var D( ) = Var( |X1, . . . , Xn), and Cov D( , ) = Cov( , |X1, . . . , Xn). Define an = op D(1) if, for each ε > 0, it holds PD(|an| ε) P 0 when n . Define an = Op D(1) if for any ε > 0 there exists M > 0 such that P ({PD(|an| > M) ε}) 1. Given a random variable a, define an DD a (i.e. an converges in distribution to a given the data points X1, . . . , Xn) if and only if |ED(f(an) f(a))| P 0, for any bounded, uniformly continuous real-valued f. A.2 Preliminary Results We state some preliminary results that will be used in our proofs. The first corresponds to Theorem 1 of Fernández and Rivera (2022), and it will be used to prove Theorems 2 and 5. Composite Goodness-of-fit Tests with Kernels Theorem 10. Let (Sn)n 1 be a sequence of bounded linear test-statistics satisfying conditions G0-G2 (stated below). Then sup ω H: ω H=1 S2 n(ω) D i=1 λi Z2 i , where Z1, Z2, . . . are i.i.d. standard normal random variables, and λ1, λ2, . . . are the eigenvalues of the operator Tσ : H H defined by (Tσω)(x) = σ(ω, Kx), where σ : H H R is the bilinear form of Condition G0. We state conditions G0-G2. Condition G0 There exists a continuous bilinear form σ : H H R such that for any m N, the bounded linear test-statistic Sn(w1 + + wm) converges in distribution to a normal random variable with mean 0 and variance given by Pm i=1 Pm j=1 σ(wi, wj). Condition G1 For some orthonormal basis (ψi)i 1 of H we have P i 1 σ(ψi, ψi) < . Condition G2 For some orthonormal basis (ψi)i 1 of H, and for any ε > 0, we have that lim i lim sup n P k=i+1 S2 n(ψk) ε The following result appears in Nickl (2012). Theorem 11 (Theorem 1, Nickl (2012)). Suppose that Θ Rp is compact (i.e. bounded and closed). Assume that L : Θ R is a deterministic function with L C0(Θ), and that θopt is the unique minimiser of L. If sup θ Θ |Ln(θ) L(θ)| P 0 as n grows to infinity, then any solution ˆθn of minθ Θ Ln(θ) satisfies ˆθn P θopt as n . A.3 Auxiliary Results The following collection of lemmas and propositions will be used in the proofs of our main results. Their proofs are provided in Section A.5. Lemma 12. Under Assumptions 1 to 6 and HC 0 it holds that n MMD2(Pˆθn, Qn) = n MMD2(Pθ0, Qn) 2 j=1 φ(Xj, θ0) H 1φ(Xi, θ0) + op(1), where φ : X Θ Rp is defined in Equation 6, and H is the Hessian matrix defined in Section A.1.1. Key*, Gretton, Briol, and Fernandez* Proposition 13 (Normal distribution for the Wild Bootstrap statistic). Define Zn : H R by Zn(ω) = 1 n i=1 Wi(ω(Xi) ω( Xi)), where X1, . . . , Xn i.i.d. Q and X1, . . . , Xn i.i.d. Pˆθn given ˆθn, and W1, . . . , Wn i.i.d. Rademacher that are independent of everything else. Then, Zn(ω) DD N(0, ν(ω, ω)), where ν(ω, ω ) = 2 Z (ω(x) g(ω, θ0))(ω (x) g(ω , θ0))pθ0(x)λ(dx). (10) Proposition 14. Under Assumptions 1 to 6 and the null hypothesis it holds that ˆθn P θ0. Proposition 15. i=1 φ(Xi, θ), and θL n(θ) = 2 i=1 φ(X i , θ). Moreover, under Assumptions 1 to 6 the following items hold. i) EX Pθ(φ(X, θ)) = 0, and ED(φ(X , ˆθn)) = 0, where X Pˆθn given ˆθn, ii) supθ Θ supx Rd φ(x, θ) < , and iii) n θLn(θ0) = Op(1) holds under the null hypothesis, and n θL n(ˆθn) = Op(1). Proposition 16. Under Assumptions 1 to 6 the following items hold. i) For any fixed θ Θ and ω H, EX Pθ(η(ω, X, θ)) = 0, ii) there exists constants C1 > 0 and C2 > 0 such that for any fixed θ Θ and ω H, EX Pθ(η2(ω, X, θ)) C1EX Pθ ω2(X) + C2 θg(ω, θ0) 2 , iii) there exists constants C1 > 0 and C2 > 0 that do not depend on ω, x nor θ, such that for all ω H, it holds supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H < . Lemma 17. Under Assumptions 1 to 6 and the null hypothesis, it holds that for any ℓ, j {1, . . . , p} and ε > 0 θℓ θj µθ(Xi) EX Pθ0 θℓ θj µθ(X) θℓ θj µθ(X i ) EX Pθ0 θℓ θj µθ(X) ε as n grows to infinity. Composite Goodness-of-fit Tests with Kernels Proposition 18. Define Hn and H n by θi θj Ln( θj) and (H n)ij = 2 θi θj L n( θj ) for any i, j {1, . . . , p}, where Ln and L n are defined in Equations 4 and 5, respectively, θ1, . . . , θp lie between θ0 and ˆθn, and θ1 , . . . , θp lie between ˆθn and ˆθ n. Then, under Assumptions 1 to 6 and the null hypothesis it holds Hn H = op(1) and H n H = op D(1) as n grows to infinity. Proposition 19. Under Assumptions 1 to 6 and the null hypothesis, the following holds. i) n(θ0 ˆθn) = n H 1 θLn(θ0) + op(1), and ii) n(ˆθn θ n) = n H 1 θL n(ˆθn) + op(1). Lemma 20. Under Assumptions 1 to 6, it holds i) n(ˆθn θ n) = Op(1), and ii) θ n θ0 = op(1) holds under the null hypothesis. Proposition 21. Under the null hypothesis and Assumptions 1 to 6, we have that i=1 ω(Xi) g(ω, ˆθn) = Sn(ω) + op(1), holds for every ω H, where Sn : H R is defined in Equation 8 and the error term does not depend on ω. Proposition 22. Consider Sn = (Sn(ω1), . . . , Sn(ωm)), where ωℓ H for each ℓ [m] and fixed m N. Then, under the null hypothesis, and Assumptions 1 to 6, we have Sn D Nm(0, Σ) as n grows to infinity, where for any i, j [m] we have Σij = σ(ωi, ωj) := EX Pθ0 (η(ωi, X, θ0)η(ωj, X, θ0)) , (13) and the function η is defined in Equation 7. The next two results are analogous to Proposition 21 and Proposition 22 but for the parametric bootstrap test-statistic. Note that these results are obtained conditioned on the data X1, . . . , Xn. Proposition 23. Under the null hypothesis and Assumptions 1 to 6, it holds that i=1 ω(X i ) g(ω, θ n) = S n(ω) + op(1), where S n : H R is defined in Equation 8. Proposition 24. Consider S n = (S n(ω1), . . . , S n(ωm)), where ωℓ H for each ℓ [m] and m N is fixed. Then, under the null hypothesis, and Assumptions 1 to 6, we have S n D Nm(0, Σ) as n grows to infinity, where Σ is defined in Equation 13 Proposition 25. Let (ψi)i 1 be any orthonormal basis of H. Then under Assumptions 1 to 6 it holds that P i 1 θg(ψi, θ0) 2 < . Key*, Gretton, Briol, and Fernandez* A.4 Proofs of Main Results This section contains proofs of results given in the main paper. Proof of Theorem 2 By the definition of the maximum mean discrepancy, it holds n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 γ2 n(ω), where γn(ω) = 1 n i=1 ω(Xi) g(ω, ˆθn). Proposition 21 yields γn(ω) = Sn(ω) + op(1), where Sn : H R is the functional defined in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1 S2 n(ω) converges in distribution (which will be proved next) we obtain n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 S2 n(ω) + op(1). By Slutsky s theorem, we deduce that n MMD2(Pˆθn, Qn) and supω H: ω H=1 S2 n(ω) have the same asymptotic null distribution. It is not difficult to verify that Sn(ω) = 1 n Pn i=1 η(ω, Xi, θ0) is bounded and linear in H. The linearity follows from the fact that for any fixed x X and θ Θ, η(ω, x, θ) is linear on its argument ω. Boundedness follows from item iii of Proposition 16 since supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H where C1 > 0 and C2 > 0 are constants that do not depend on ω. Then, we can use Theorem 10 to study the asymptotic null distribution of supω H: ω H=1 Sn(ω)2 and thus obtain the result. To apply Theorem 10 we need to verify conditions G0, G1 and G2, which are stated in Section A.2. Condition G0 follows directly from Proposition 22, where σ : H H R is given by σ(ω, ω ) := EX Pθ0(η(ω, X, θ0)η(ω , X, θ0)) for any ω, ω H. To verify condition G1 consider an orthonormal basis (ψi)i 1 of H and observe that by Proposition 16.ii. there exists constants C1 > 0 and C2 > 0 such that for all i 1 it holds σ(ψi, ψi) = EX Pθ0(η2(ψi, X, θ0)) < C1EX Pθ0 ψi2(X) + C2 θg(ψi, θ0) 2 . i 1 σ(ψi, ψi) C1 X i 1 EX Pθ0 ψi2(X) + C2 X i 1 θg(ψi, θ0) 2 < . (14) where the last inequality holds since K(x, x) = P i 1 ψ2 i (x), and the kernel is bounded by our assumptions, and by Proposition 25. Finally, we proceed to verify condition G2. For this, let ε > 0, and note k i S2 n(ψk) ε k i ε 1E S2 n(ψk) = ε 1 X k i EX Pθ0(η2(ψk, X, θ0)) k i σ(ψk, ψk), where first inequality is due to Markov s inequality, the subsequent equality follows from the definition of Sn(ω) = 1 n Pn i=1 η(ω, Xi, θ0) given in Equation 8, and the last equality is due Composite Goodness-of-fit Tests with Kernels to the definition of σ. Finally, notice that lim i lim sup n P k i Sn(ψk)2 ε k i σ(ψk, ψk) = 0, where the limit is due to Equation 14. Proof of Theorem 3 Observe that MMD(Pˆθn, Qn) = µˆθn µn H µˆθn µQ H µQ µn H , where µQ is the kernel mean embedding of Q given by µQ(x) = R K(x, y)q(y)λ(dy). Note that limn µQ µn H = 0 a.s. by the standard law of large numbers for RKHS in Chen and Zhu (2011, Theorem 1). Next, we will prove that µˆθn µQ H > 0 a.s. We proceed by contradiction. Suppose that lim infn µˆθn µQ H = 0 a.s., then there exists a collection of indices (a(n))n 1 such that the subsequence µbθa(n) satisfies µbθa(n) µQ 2 H = 0 a.s. (15) Additionally, note that since the set Θ is compact, there exists a subsequence (bθa(b(n)))n 1 and θ Θ such that limn bθa(b(n)) θ = 0, a.s. By Assumptions 3 and 5, the kernel K is bounded, pθ(x) C3(Θ) for each fixed x X, and R X supθ Θ pθ(x)λ(x)dx < . Therefore, by the Lebesgue s dominated convergence theorem we deduce µbθa(b(n)) µθ 2 X K(x, y) pbθa(b(n))(x) pθ (x) pbθa(b(n))(y) pθ (y) λ(x)λ(y)dxdy when n grows to infinity. Then, by the triangle inequality and Equations 15 and 16 we have µθ µQ H µθ µbθa(b(n)) H + µbθa(b(n)) µQ H 0 a.s. Thus µθ µQ H = MMD(Q, Pθ ) = 0, and since the kernel is characteristic by Assumption 5, we conclude Q = Pθ , but this is a contradiction since under the alternative hypothesis Q Pθ. Proof of Theorem 4 Observe that n MMD2 W (Pˆθn,n, Qn) = sup ω H: ω H=1 Z2 n(ω), (17) Key*, Gretton, Briol, and Fernandez* where Zn : H R is defined Zn(ω) = 1 n Pn i=1 Wi(ω(Xi) ω( Xi)). It is not difficult to see that Zn is linear on its argument ω, and that Zn is bounded since Assumptions 1 to 6 yield supx X |ω(x)| ω H supx X p K(x, x) < . Then, the asymptotic distribution of the wild bootstrap test-statistic n MMD2 W (Pˆθn,n, Qn) can be obtained from Theorem 10 and Equation 17, by verifying that Zn satisfies conditions G0-G2, which we proceed to verify. Observe that Condition G0 follows directly from Proposition 13, where we obtain that Zn(ω1 + . . . + ωm) DD N j=1 ν(ωi, ωj) for any m N and ω1, . . . , ωm H, where ν : H H R is given by ν(ω, ω ) = 2 Z (ω(x) g(ω, θ0))(ω (x) g(ω , θ0))pθ0(x)λ(dx). To verify Condition G1, we consider some orthonormal basis (ψi)i 1 of H, and verify that P i 1 ν(ψi, ψi) < holds. Note that i 1 ν(ψi, ψi) = X i 1 2 Z (ψi(x) g(ψi, θ0))2pθ0(x)λ(dx) i 1 ψi(x)2pθ0(x)λ(dx) = 2 Z K(x, x)pθ0(x)λ(dx) < , where the first inequality follows from the fact that the variance of a random variable is upper bounded by its second moment, and the last inequality holds from the fact that P i 1 ψ2 i (x) = K(x, x), and that the kernel is bounded by Assumption 5. Finally, note that limi lim supn PD P k i Z2 n(ψk) ε = 0 verifies Condition G2. By Markov s inequality k i Z2 n(ψk) ε k i ED(Z2 n(ψk)), and note that k i ED(Z2 n(ψk)) = 1 j=1 Wj(ψk(Xj) ψk( Xj)) j=1 Wj(ψk(Xj) ψk( Xj)) 2 ( Xℓ)n ℓ=1 j=1 (ψk(Xj) ψk( Xj))2 k i ED ψ2 k( X1) Composite Goodness-of-fit Tests with Kernels where the first equality follows from replacing Zn by its definition, the third equality follows from noticing that conditioned on the data (Xi, Xi)n i=1, Zn is the sum of independent random variables, E(Wi|(Xi, Xi)n i=1) = 0 and W 2 i = 1. Finally, note that the last inequality holds from noticing that (a b)2 2(a2 + b2) for any a, b R, and ( Xi)n i=1 are i.i.d. give the data (Xi)n i=1. By using the previous set of equations, we obtain lim i lim sup n PD k i Z2 n(ψk) ε lim i lim sup n 2 ε + lim i lim sup n 2 ε k i ψ2 k(x)pˆθn(x)λ(dx) We shall prove the previous equation is zero. For the first term observe that lim i lim sup n 2 ε lim i lim sup n 2 nε k i ψ2 k(Xj) = lim i 2 ε k i ψ2 k(x)pθ0(x)λ(dx) where the first inequality follows by Jensen s inequality, the first equality follows from the law of large numbers since P k i ψ2 k(x) P k 1 ψ2 k(x) = K(x, x) and the kernel is bounded by Assumption 5. The same argument, combined with Lebesgue s dominated convergence theorem, explains the last equality. For the second term, the Lebesgue s dominated convergence theorem yields lim i lim sup n 2 ε k i ψ2 k(x)pˆθn(x)λ(dx) = lim i 2 ε k i ψ2 k(x)pθ0(x)λ(dx) = 0. Since conditions G0-G2 are satisfied, then Theorem 10 yields that n MMD2 W (Pˆθn,n, Qn) DD i=1 λ i Z 2 i , (18) where Z 1, . . . , Z n are standard normal i.i.d. random variables, and λ 1, λ 2, . . . , are the eigenvalues of the operator Tν : H H given by (Tνω)(x) = ν(ω, Kx). On the other hand, observe that the wild bootstrap for the goodness-of-fit problem satisfies that n MMD2 W (Pθ0, Qn) DD i=1 λi Z2 i , (19) Key*, Gretton, Briol, and Fernandez* where Z1, . . . , Zn are standard normal i.i.d. random variables, and λ1, λ2, . . . , are the eigenvalues of the operator Tσ : H H given by (Tσω)(x) = σ(ω, Kx) where σ = 1 2ν (hence the eigenvalues are half of those in Equation 18). Therefore, we deduce that for every x R, lim sup n P(n MMD2 W (Pˆθn,n, Qn) x) = lim sup n P n MMD2 W (Pθ0, Qn) x > lim sup n P n MMD2 W (Pθ0, Qn) x (20) Replacing x = qα in the previous equation yields α = lim sup n P(n MMD2 W (Pˆθn,n, Qn) qα) > lim sup n P n MMD2 W (Pθ0, Qn) qα = lim sup n P n MMD2(Pθ0, Qn) qα , (21) where the first equality is due to the definition of qα, the first inequality holds by Equation 20 and the last equality holds since n MMD2(Pθ0, Qn) and n MMD2 W (Pθ0, Qn) have the same asymptotic null distribution (Chwialkowski et al., 2014, Theorem 1). Finally note that lim sup n P n MMD2(Pˆθn, Qn) qα = lim sup n P n MMD2(Pθ0, Qn) 2 j=1 φ(Xj, θ0) H 1φ(Xi, θ0) qα < lim sup n P n MMD2(Pθ0, Qn) qα < α, where the first equality holds due to Lemma 12, the first inequality holds since both H and H 1 are positive definite matrices, and thus Pn i=1 Pn j=1 φ(Xj, θ0) H 1φ(Xi, θ0) 0, and the last inequality holds by Equation 21. Proof of Theorem 5 By the definition of the maximum mean discrepancy, we have n MMD2(Pθ n, Q n) = sup ω H: ω H=1 (γ n(ω))2, where γ n(ω) = 1 n i=1 ω(X i ) g(ω, θ n). Proposition 23 yields γ n(ω) = S n(ω) + op(1), where S n : H R is the functional defined in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1(S n(ω))2 converges in distribution (which is proved next) we obtain n MMD2(Pθ n, Q n) = sup ω H: ω H=1 (S n(ω))2 + op(1). By Slutsky s theorem, we conclude that n MMD2(Pθ n, Q n) and supω H: ω H=1(S n(ω))2 have the same asymptotic null distribution. We use Theorem 10 to obtain the asymptotic null distribution of supω H: ω H=1(S n(ω))2. Observe however that to apply this theorem, we need to check that S n(ω) = 1 n i=1 η(ω, X i , ˆθn) Composite Goodness-of-fit Tests with Kernels is a bounded linear functional. The linearity follows from the fact that for any fixed x X and ω Θ, η(ω, x, θ) is a linear function of ω. Boundedness follows from item iii of Proposition 16 since supθ Θ supx X |η(ω, x, θ)| < C1 ω H + C2 ω 2 H where C1 > 0 and C2 > 0 are constants that do not depend on ω. We proceed to check Conditions G0, G1 and G2 of Theorem 10. Condition G0 follows directly from Proposition 24, where we identify σ(ω, ω ) = EX Pθ0(η(ω, X, θ0)η(ω , X, θ0)) for any ω, ω H. Indeed, note that it is the same bilinear form of Equation 13, and thus Condition G1 was already proved in Equation 14 in the proof of Theorem 2. Finally, to check condition G2 observe that k i (S n(ψk))2 ε k i ED (S n(ψk))2 = ε 1 X k i ED(η2(ψk, X , ˆθn)) X η2(ψk, z, ˆθn)pˆθn(z)λ(dz) where X Pˆθn|ˆθn. The first inequality follows from Markov s inequality, and the first equality follows since S n(ω) = 1 n Pn i=1 η(ω, X i , ˆθn), and recall that (X i )n i=1 are generated i.i.d. from Pˆθn|ˆθn. From Proposition 16.ii., there exists constants C1 > 0 and C2 > 0 such that for all k 1 it holds Z X η2(ψk, z, ˆθn)pˆθn(z)λ(dz) C1 X ψ2 k(z)pˆθn(z)λ(dz) + C2 θg(ψk, θ0) 2 . (23) Finally, by combining Equations 22 and 23, we get lim i lim sup n PD k i (S n(ψk))2 ε lim i lim sup n X ψ2 k(z)pˆθn(z)λ(dz) + ε 1C2 lim i k i θg(ψk, θ0) 2 . Note that the second limit of the previous equation converges to zero by Proposition 25. For the first limit, observe lim i lim sup n X ψ2 k(z)pˆθn(z)λ(dz) lim i X ψ2 k(z) sup θ Θ pθ(z)λ(dz) = 0 since P k=1 ψ2 k(y) = K(y, y), and R X supθ Θ pθ(z)λ(dz) < under Assumptions 1 to 6. A.5 Proof of Auxiliary Results We proceed to prove the auxiliary results. Key*, Gretton, Briol, and Fernandez* Proof of Lemma 12 By the definition of the MMD, it holds that n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 γ2 n(ω), where γn(ω) = 1 n i=1 ω(Xi) g(ω, ˆθn). Proposition 21 yields γn(ω) = Sn(ω) + op(1), where Sn : H R is the functional defined in Equation 8, and the error term does not depend on ω. Then, by using the fact that supω H: ω H=1 S2 n(ω) converges in distributionwhich is proved in Theorem 2we obtain n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 S2 n(ω) + op(1). Note that Sn(ω) = An(ω) + Bn(ω), where An : H R and Bn : H R are given by An(ω) = 1 n i=1 (ω(Xi) g(ω, θ0)) and Bn(ω) = 2 n i=1 θg(ω, θ0), H 1φ(Xi, θ0) . n MMD2(Pˆθn, Qn) = sup ω H: ω H=1 (An(ω) + Bn(ω))2 + op(1) = (A1 n + B1 n)(A2 n + B2 n)K + op(1) = n MMD2(Pθ0, Qn) + A1 n B2 n K + B1 n A2 n K + B1 n B2 n K + op(1), (24) where Ai n K with i {1, 2} denotes the application of An to the i-the coordinate of the kernel, and the same notation is used for Bi n. The second equality is due to Fernández and Rivera (2022), and the third equality holds since n MMD2(Pθ0, Qn) = A1 n A2 n K. We proceed to analyse A1 n B2 n K, B1 n A2 n K and B1 n B2 n. Observe that X ω(x) θpθ0(x)λ(dx), H 1φ(Xi, θ0) where the equality follows by Lebesgue s dominated convergence theorem (see Remark 9). For all x X, define Kx : X R by Kx( ) = K(x, ). Then, B2 n K H is given by (B2 n K)(x) = Bn Kx = 2 n X Kx(y) θpθ0(y)λ(dy), H 1φ(Xi, θ0) . (25) By linearity, we obtain A1 n B2 n K = 2 n X (A1 n K)(y) θpθ0(y)λ(dy), H 1φ(Xi, θ0) (26) Note that for any fixed y X, it holds (A1 n K)(y) = 1 n X K(x, y)pθ0(x)λ(dx) . Composite Goodness-of-fit Tests with Kernels Then, a simple computation shows that X (A1 n K)(x) θpθ0(x)λ(dx) = 1 n X ( θµθ0(x))pθ0(x)λ(dx) j=1 φ(Xj, θ0). By substituting the previous equation into Equation 26, we obtain A1 n B2 n K = 2 φ(Xj, θ0), H 1φ(Xi, θ0) = 2 j=1 φ(Xj, θ0) H 1φ(Xi, θ0). A similar computation shows that B1 n A2 n K = A1 n B2 n K. For the term B1 n B2 n K, we have B1 n B2 n K X (B1 n K)(x) θpθ0(x)λ(dx), H 1φ(Xi, θ0) X Kx(y) θpθ0(y)λ(dy), H 1φ(Xj, θ0) θpθ0(x)λ(dx), H 1φ(Xi, θ0) X K(x, y) θpθ0(y), H 1φ(Xj, θ0) θpθ0(x), H 1φ(Xi, θ0) λ(dy)λ(dx) m=1 (H 1φ(Xi, θ0))ℓ(H 1φ(Xj, θ0))m Hm,ℓ j=1 φ(Xj, θ0) H 1φ(Xi, θ0), (28) where the second equality holds from replacing (B2 n K)(x) by the expression given in Equation 25, and the fourth equality holds since θℓ pθ0(x) λ(dy)λ(dx). By substituting Equation 27 and Equation 28 into Equation 24, we obtain n MMD2(Pˆθn, Qn) = n MMD2(Pθ0, Qn) 2 j=1 φ(Xj, θ0) H 1φ(Xi, θ0) + op(1). Key*, Gretton, Briol, and Fernandez* Proof of Proposition 13 Note that conditioned on the data e D = (Xi, Xi)n i=1, Zn(ω) is a sum of i.i.d. random variables with mean given by E e D (Zn(ω)) = E e D i=1 Wi(ω(Xi) ω( Xi)) and variance given by Var e D(Zn(ω)) = 1 n Pn i=1(ω(Xi) ω( Xi))2. We claim that i=1 (ω(Xi) ω( Xi))2 P 2 Z (ω(x) g(ω, θ0))2pθ0(x)λ(dx) = ν(ω, ω). Thus, by Lyapunov s Central limit theorem, we obtain Zn(ω) DD N(0, ν(ω, ω)) which yields the desired result. To prove the claim, consider ξn = 1 n Pn i=1(ω(Xi) ω( Xi))2. We will show that E(ξn) = ν(ω, ω) and that Var(ξn) 0. Therefore the convergence in probability follows from Markov s inequality. Consider the data D = (Xi)i 1, and define ED( ) = E( |(Xi)n i=1). Then, observe that E (ξn) = E ω(X1) ω( X1) 2 = E ED ω(X1) ω( X1) 2 = E ω2(X1) 2ω(X1)ED(ω( X1)) + ED(ω2( X1)) , (29) ED ω( X1) = Z ω(z)pˆθn(z)λ(dz) and ED ω2( X1) = Z ω2(z)pˆθn(z)λ(dz). Recall that R supθ Θ pθ(x)λ(dx) < by Assumption 3, and that ω H is bounded by Assumption 5. Then, under the null hypothesis, and by applying Lebesgue s dominated convergence theorem twice, it holds lim n E ED ω( X1) = E lim n Z ω(z)pˆθn(z)λ(dz) = Z ω(z)pθ0(z)λ(dz). By using the same arguments we can show that limn E(ED(ω2( X1))) = R ω2(z)pθ0(z)λ(dz). The previous equations together with Equation 29, yield lim n E (ξn) = 2 Z ω2(z)pθ0(z) Z ω(z)pθ0(z) 2! Finally, we proceed to verify that limn Var(ξn) = 0. Note that Var (ξn) = Var (ED(ξn)) + E(Var D (ξn)), and by the previous arguments limn Var (ED(ξn)) = 0 since ED(ξn) converges to a constant. Note that conditioned on the data D, the random variables (ω(Xi) ω( Xi))n i=1 are independent. Thus Var D (ξn) = 1 ω(Xi) ω( Xi) 2 0, Composite Goodness-of-fit Tests with Kernels where the limit holds since the ω is bounded. Hence we conclude that limn Var(ξn) = 0. Proof of Proposition 14 To prove this result we use Theorem 11. Note that by Assumption 2 the set Θ is bounded and closed, by Assumption 3 we have L(θ) C0(Θ), and by Assumption 5 the kernel K is characteristic, which means that the maximum mean discrepancy is a distance between probability measures. Thus, since the parametric family {Pθ}θ Θ is identifiable (Assumption 3), we deduce that L(θ) = MMD2(Pθ, Pθ0) has a unique minimiser under the null hypothesis. Observe that sup θ Θ |Ln(θ) L(θ)| = sup θ Θ µθ µn 2 H µθ µθ0 2 H 2 µθ, µθ0 µn H + µn 2 H µθ0 2 H 2 sup θ Θ µθ H µθ0 µn H + µn 2 H µθ0 2 H C µθ0 µn H + µn 2 H µθ0 2 H 0 in probability, where the last result holds by the standard law of large numbers for RKHS Berlinet and Thomas-Agnan (2004, Section 9.1.1-9.1.2). Then, by Theorem 11, we deduce that ˆθn θ0 in probability. Proof of Proposition 15 Recall the definition φ in Equation 6, then θLn(θ) = θ µn µθ 2 H = 2 θµθ, µn µθ H X ( θµθ(x)) pθ(x)λ(dx) i=1 φ(Xi, θ). By doing the same computations, we obtain θL n(θ) = θ µ n µθ 2 H = 2 n Pn i=1 φ(X i , θ), where recall that µ n is the mean kernel embedding of the empirical distribution Q n obtained from the parametric bootstrap samples X 1, . . . , X n. We proceed to prove item i). Observe that clearly, EX Pθ(φ(X, θ)) = EX Pθ X ( θµθ(y))pθ(y)λ(dy) = 0, and similarly ED(φ(X i , ˆθn)) = 0. We continue by checking ii). For any fixed x X and θ Θ, we have φ(x, θ) 2 = θi µθ(y) pθ(y)λ(dy) 2 4 Key*, Gretton, Briol, and Fernandez* Additionally, by similar arguments as those shown in Remark 9 we have that for any i {1, . . . , p}, it holds sup θ Θ sup x X θi µθ(x) = sup θ Θ sup x X θi pθ(y)λ(dy) < . We finish by verifying item iii). Recall that θLn(θ0) = 2 n Pn i=1 φ(Xi, θ0), where φ(X1, θ0), . . . , φ(Xn, θ0) are i.i.d. random variables. Moreover, under the null hypothesis, the items i) and ii) of this proposition deduce E(φ(Xi, θ0)) = 0 and E( φ(Xi, θ0) 2) < . Thus, E( n θLn(θ0) 2) = 4E( φ(X1, θ0) 2) < , and we obtain n θLn(θ0) = Op(1). By using the same arguments we deduce that for almost all sequences (Xi)i 1, ED( n θL n(ˆθn) 2) = 4ED( φ(X 1, ˆθn) 2) < . Thus n θL n(ˆθn) = Op(1). Proof of Proposition 16 We start with i). Observe that EX Pθ (η(ω, X, θ)) = EX Pθ (ω(X) g(ω, θ)) 2 θg(ω, θ0), H 1EX Pθ (φ(X, θ)) = 0, where the first equality follows from the linearity of expectations, and the last equality is due to g(ω, θ) = EX Pθ(ω(X)) and EX Pθ(φ(X, θ)) = 0 from Proposition 15, item i). We continue by verifying items ii) and iii). Observe that for any fixed ω H, x X and θ Θ, we have that |η(ω, x, θ)| = ω(x) g(ω, θ) 2 θg(ω, θ0), H 1φ(x, θ) |ω(x) g(ω, θ)| + 2 θg(ω, θ0) H 1 F φ(x, θ) |ω(x) g(ω, θ)| + C θg(ω, θ0) (31) where F denotes the Frobenius norm, and C > 0 is a constant that does not depend on ω, x, nor θ. In the previous set of equations, the first inequality follows from the triangle inequality and the Cauchy-Schwarz s inequality, and the second inequality holds by taking C = 2 H 1 F supθ Θ supx X φ(x, θ) < since H 1 F < , and supθ Θ supx X φ(x, θ) < by item ii) of Proposition 15. We proceed to verify item ii). From Equation 31, we have that for any fixed ω H and θ Θ, EX Pθ η2(ω, X, θ) 2EX Pθ |ω(X) g(ω, θ)|2 + 2C 2 θg(ω, θ0) 2 C1EX Pθ ω2(X) + C2 θg(ω, θ0) 2 , where C1 = 2 and C2 = 2C 2. The first inequality holds from the fact that for any a, b R we have that (a + b)2 2(a2 + b2), and the second inequality follows from the fact that the variance of random variable is upper bounded by its second moment. We continue with item iii). Note that for each ω H, we have supx X |ω(x)| ω H C (see Remark 8), and thus sup θ Θ |g(ω, θ)| sup θ Θ X |ω(x)|pθ(x)λ(dx) ω H Composite Goodness-of-fit Tests with Kernels Additionally, note that (see Remark 9) sup θ Θ θg(ω, θ) 2 = sup θ Θ θi pθ(x)λ(dx) 2 θi pθ(y) λ(dy) 2 < , (33) where the equality holds under Assumption 3 by the Lebesgue s dominated convergence theorem (see Remark 9), the first inequality is due to the fact that supx X |ω(x)| ω H C, and the last inequality holds by Assumption 3. Then, by combining Equations 31 to 33, we get sup θ Θ sup x X |η(ω, x, θ)| 2 ω H C + ω 2 H C θi pθ(y) λ(dy) 2 C1 ω H + C2 ω 2 H < , where C1 > 0 and C2 > 0 are constants that do not depend on ω, x or θ. Proof of Lemma 17 We begin by defining Zn(θ) = |Yn(θ) γ(θ)| and Z n(θ) = |Y n (θ) γ(θ)|, where θℓ θj µθ(Xi), γ(θ) = EX Pθ0 θℓ θj µθ(X) , and Y n (θ) = 1 θℓ θj µθ(X i ). Observe that Equation 11 is equivalent to supθ Θ Zn(θ) P 0 as n grows to infinity, and Equation 12 is equivalent to PD (supθ Θ Z n(θ) ε) P 0 holds for any ε > 0. To verify Equation 11 we use Theorem 21.9 of Davidson (1994), from which we just need to check the following properties: (i) Zn(θ) P 0 for each fixed θ Θ, and (ii) Zn( ) is stochastically equicontinuous. Analogously, for Equation 12 we need to check (i ) PD (Z n(θ) ε) P 0, and (ii ) for almost all sequences (Xi)i 1, Z n( ) is stochastically equicontinuous. We start proving (i) and (i ). Note that (i) follows from the law of large numbers since θℓ θj µθ(X) X K(X, y) 2 θℓ θj pθ(y)λ(dy) holds under Assumption 3 (see Remark 9). For item (i ), note that for any ε > 0 we have PD (Z n(θ) ε) = PD (|γ(θ) Y n (θ)| ε) PD (|γn(θ) Y n (θ)| ε/2) + PD (|γ(θ) γn(θ)| ε/2) , Key*, Gretton, Briol, and Fernandez* where γn(θ) = EX Pˆθn 2 θℓ θj µθ(X) . The last two terms converge to zero. To see this observe that PD |γn(θ) Y n (θ)| ε 4ED(|γn(θ) Y n (θ)|2) ε2 4 ε2n ED θℓ θj µθ(X 1) 2! where the first inequality is due to Chebyshev s inequality, and the second inequality holds since X 1, . . . , X n i.i.d. Pˆθn|ˆθn. For the second term |γn(θ) γ(θ)| = θℓ θj µθ(x)(pˆθn(x) pθ0(x))λ(dx) θℓ θj µθ(x) X |(pˆθn(x) pθ0(x))|λ(dx) now, by Taylor s theorem, for each x X, it exists θx in the line between θ0 and ˆθn such that pˆθn(x) pθ0(x) = θpθx(x) (ˆθn θ0). By the previous equality and by Holder s inequality |γn(θ) γ(θ)| sup x X θℓ θj µθ(x) X θpθx(x) 1 λ(dx) ˆθn θ0 θℓ θj µθ(x) X sup θ Θ θpθ(x) 1 λ(dx) ˆθn θ0 0, since |ˆθn θ0| = op(1) by Proposition 14, and since the other terms are finite due to Assumptions 3 and 5 and Remark 9. We continue by checking (ii) and (ii ). To check ii) we use Theorem 21.10 of Davidson (1994) from which it is enough to verify that |Zn(θ) Zn(θ )| B θ θ holds for all θ, θ Θ and for all n, where B is a constant which does not depend of n, θ and θ . To check condition ii ) we will verify that there exists B > 0 such that |Z n(θ) Z n(θ )| B θ θ , holds for every sequence (Xi)i 1. We only verify that |Zn(θ) Zn(θ )| B θ θ for some constant B since the analogous result for parametric bootstrap follows from the same arguments. By the triangle inequality, we obtain |Zn(θ) Zn(θ )| γ(θ) γ(θ ) + Yn(θ) Yn(θ ) . Since the kernel is bounded by a constant C > 0, it holds γ(θ) γ(θ ) = EX Pθ0 θℓ θj µθ(X) 2 θℓ θj µθ (X) X K(X, y) 2 θℓ θj (pθ(y) pθ (y))λ(dy) θℓ θj (pθ(y) pθ (y)) λ(dy), (34) Composite Goodness-of-fit Tests with Kernels Yn(θ) Yn(θ ) = X K(Xi, y) 2 θℓ θj (pθ(y) pθ (y))λ(dy) θℓ θj (pθ(y) pθ (y)) λ(dy), (35) By combining Equation 34 and Equation 35, we get |Zn(θ) Zn(θ )| 2C Z θℓ θj (pθ(y) pθ (y)) λ(dy), By the mean value theorem, we have 2 θℓ θj (pθ(y) pθ (y)) = θ 2 θℓ θj pθy(y) (θ θ ), where θy is in the line between θ and θ (which belongs to Θ by convexity of this set). By Holder s inequality, we obtain 2 θℓ θj (pθ(y) pθ (y)) θ θℓ θj pθy(y) 1 θ θ θℓ θj pθ(y) 1 Then, we conclude that |Zn(θ) Zn(θ )| 2C Z θℓ θj pθ(y) 1 λ(dy) | {z } Bℓ,j Finally, choose B = maxℓ,j [p] Bℓ,j and note it is finite by Assumption 3 together with Remark 9. Proof of Proposition 18 We start by proving Hn H = op(1). To prove this result, it suffices to show that each component of (Hn)ij converges in probability to Hij. Observe that θi θj Ln( θj) = 2 θi θj (Ln( θj) L( θj)) + 2 θi θj L( θj). A simple computation shows that for each θ Θ it holds θi θj (Ln(θ) L(θ)) = 2 θi θj µθ(X) 1 θi θj µθ(Xi) where the second equality holds by Lemma 17. Thus, (Hn)ij = op(1) + 2 θi θj L( θj). Key*, Gretton, Briol, and Fernandez* Finally, note that L(θ) = µθ µθ0 2 H C0(Θ) since µθ(x) C3(Θ) for all x X (see Assumption 3 and Remark 9). Then, the continuous mapping theorem, together with the fact that θj n P θ0 (by Proposition 14), yields θi θj L(θ0) + op(1) = Hij + op(1). We continue with the result for H n. Similarly to what we did before, we sum and subtract the term 2 θi θj L( θj ) to (Hn) ij and obtain (Hn) ij = 2 L n( θj ) L( θj ) + 2 θi θj L( θj ). Observe that for any θ Θ, under the null hypothesis, we have θi θj (L n(θ) L(θ)) = 2 θi θj µθ(X) 1 θi θj µθ(X i ) Then, the result follows from Lemma 17 and from the fact that ˆθ n θ0 = op(1) and the same arguments as before. Proof of Proposition 19 We start by proving item i). By Assumptions 1 to 6, we have that Ln(θ) = µθ µn 2 H C3(Θ) since µθ(x) C3(Θ) for all x X (see Remark 9). Then, for each j {1, . . . , p}, a first order Taylor s expansion of θj Ln(θ) around ˆθn exists and is given by θj Ln(θ) = θj Ln(ˆθn) + θ θj Ln( θj) , θ ˆθn θj Ln( θj) , θ ˆθn where θj lies in the line segment between ˆθn and θ, and θj Θ by convexity. The second equality holds since ˆθn is the minimiser of Ln, and belongs to the interior of Θ. By using the previous equation and by evaluating at θ0, we obtain θLn(θ0) = H n(θ0 ˆθn), where (Hn)ij = 2 θi θj Ln( θj) for each i, j [p]. By Proposition 18 we have that Hn converges in probability to the Hessian matrix H. Recall that H is positive definite by Assumption 6. Then, by the continuity of the determinant, we have that Hn is invertible in sets of arbitrarily large probability for all n large enough, and thus n(θ0 ˆθn) = n(H n) 1 θLn(θ0) = n(H ) 1 θLn(θ0) + n((H n) 1 H 1) θLn(θ0). By Proposition 15, n θLn(θ0) = Op(1), and since Hn P H, we have n(θ0 ˆθn) = n(H ) 1 θLn(θ0) + op(1). Composite Goodness-of-fit Tests with Kernels For item ii) we follow a very similar argument. Note that by Assumptions 1 to 6, L n(θ) C3(Θ). Then, for each j {1, . . . , p}, we perform a first order Taylor s approximation of θj L n(θ) around θ n and we get θj L n(ˆθn) = θ θj L n( θj ) , ˆθn θ n where θj lies in the line segment between θ n and ˆθn. By using the previous equation, we obtain θL n(ˆθn) = H n (ˆθn θ n), where (H n)ij = 2 θi θj L n( θj ) for each i, j [p], and all θ1 , . . . , θp lie in the line segment between ˆθn and θ n. By Proposition 18 we have that H n converges in probability to the Hessian matrix H. Then, by the continuity of the determinant, we have that H n is invertible in sets of arbitrarily large probability for all n large enough, and thus n(ˆθn θ n) = n(H n ) 1 θL n(ˆθn). We finish by noting that under Assumptions 1 to 6, Proposition 15 deduces that n θL n(ˆθn) = Op(1) and by Proposition 18, (H n ) 1 converges to H 1. Proof of Lemma 20 We start by proving i). Proposition 19 yields n(ˆθn θ n) = n H 1 θL n(ˆθn)+op(1), and Proposition 15 yields n θL n(ˆθn) = Op(1), which together give us that n(ˆθn θ n) = Op(1). We continue by proving ii). Observe that P(|θ n θ0| ϵ) P |θ n ˆθn| ϵ/2 + P |ˆθn θ0| ϵ/2 0, where the first probability tends to 0 by part i), and the second one by Proposition 14 (that can only be applied under the null hypothesis). Proof of Proposition 21 Define γn : H R by γn(ω) = 1 n ω(Xi) g(ω, ˆθn) . (36) By Assumptions 1 to 6, for each fixed ω H, we have g(ω, θ) C3(Θ) (see Remark 9). Then, a first order Taylor s expansion of g(ω, θ) around θ0 yields g(ω, ˆθn) = g(ω, θ0) + θg(ω, θn), ˆθn θ0 , where θn lies on the line segment between θ0 and ˆθn. Note that by the convexity of Θ, it holds that θn Θ. Then, by replacing the previous expression in Equation 36 we obtain γn(ω) = 1 n i=1 (ω(Xi) g(ω, θ0)) n D θg(ω, θn), ˆθn θ0 E . By assuming the null hypothesis, Proposition 14 yields ˆθn P θ0, and Proposition 15.iii and Proposition 19.i yield n(ˆθn θ0) = Op(1). Then, by the continuous mapping theorem we obtain γn(ω) = 1 n i=1 (ω(Xi) g(ω, θ0)) n D θg(ω, θ0), ˆθn θ0 E + op(1). (37) Key*, Gretton, Briol, and Fernandez* By combining Proposition 15 and Proposition 19.i, we have n(ˆθn θ0) = n H 1 θLn(θ0) + op(1) = 2 n i=1 H 1φ(Xi, θ0) + op(1). By replacing the previous expression in Equation 37 we obtain γn(ω) = 1 n ω(Xi) g(ω, θ0) 2 θg(ω, θ0), H 1φ(Xi, θ0) + op(1) i=1 η(ω, Xi, θ0) + op(1) = Sn(ω) + op(1). Proof of Proposition 22 First note that i=1 (η(ω1, Xi, θ0), . . . , η(ωm, Xi, θ0)), where (η(ω1, Xi, θ0), . . . , η(ωm, Xi, θ0))n i=1 are a collection of i.i.d. random vectors. By Proposition 16.i, we have that for any fixed ω H it holds EX Pθ0(η(ω, X, θ0)) = 0, and thus EX Pθ0(Sn) = 0. Additionally, by Proposition 16.ii, for any ω H we have EX Pθ0(η2(ω, X, θ0)) < . Thus, the Central Limit Theorem yields Sn D Nm(0, Σ), where for any i, j [m] we have Σij = σ(ωi, ωj) = EX Pθ0(η(ωi, X, θ0)η(ωj, X, θ0)). Proof of Proposition 23 Define γ : H R by γ n(ω) = 1 n i=1 (ω(X i ) g(ω, θ n)) . (38) By Assumptions 1 to 6, for each ω H, g(ω, θ n) C3(Θ) (see Remark 9). By a first order Taylor s expansion of g(ω, θ) around ˆθn we obtain g(ω, θ n) = g(ω, ˆθn) + θg(ω, θ n), θ n ˆθn where the θ n belongs to the line segment between ˆθn and θ n. Then, it follows that γ n(ω) = 1 n ω(X i ) g(ω, ˆθn) D θg(ω, θ n), θ n ˆθn E . Proposition 14 yields that ˆθn θ0 = op(1), and Lemma 20 yields θ n θ0 = op(1) and n(θ n ˆθn) = Op(1). Thus, by the continuous mapping theorem it holds γ n(ω) = 1 n ω(X i ) g(ω, ˆθn) D θg(ω, θ0), θ n ˆθn E + op(1). (39) Composite Goodness-of-fit Tests with Kernels Finally, by combining Propositions 15 and 19.ii we get n(θ n ˆθn) = n H 1 θL n(ˆθn) + op(1) = 2 n i=1 H 1φ(X i , ˆθn) + op(1), and thus by replacing in Equation 39, γ n(ω) = 1 n ω(X i ) g(ω, ˆθn) 2 D θg(ω, θ0), H 1φ(X i , ˆθn) E + op(1) i=1 η(ω, X i , ˆθn) + op(1). Proof of Proposition 24 We will apply the Lidenberg-Feller Central Limit Theorem for multivariate triangular arrays. Define Zn = 1 n Pn i=1 Yn,i, where {Yn,i} is a triangular array of random vectors Yn,i = (η(ω1, X n,i, ˆθn), . . . , η(ωm, X n,i, ˆθn)), ω1, . . . , ωm H and m N. Observe that by Proposition 16, we have that ED(η(ω, X n,i, ˆθn)) = Z X η(ω, z, ˆθn))pˆθn(z)λ(dz) = 0 holds for any ω H, and thus ED(Yn,i) = 0. Consider Vn = 1 n Pn i=1 Var D(Yn,i), and observe that for any ℓ, k [m], we have that (Vn)ℓ,k = Cov D(η(ωℓ, X n,1, ˆθn), η(ωk, X n,1, ˆθn)) = Z X η(ωℓ, z, ˆθn)η(ωk, z, ˆθn))pˆθn(z)λ(dz) X η(ωℓ, z, θ0)η(ωk, z, θ0))pθ0(z)λ(dz). The convergence in probability holds since ˆθn P θ0 by Proposition 14, and since the map θ R X η(ω, z, θ)η(ω , z, θ)pθ(z)λ(dz) is continuous for any fixed ω, ω H. Thus the continuous mapping theorem yields Equation 40. We proceed to check Linderberg s condition. Observe that for every ε > 0 i=1 ED Yn,i 21{ Yn,i ε n} = 1 j=1 ED η2(ωj, Xn,i, ˆθn)1{ Yn,i ε n} i=1 PD Yn,i ε n ϵ n 0, a.s. The first inequality holds since supθ Θ supx X η2(ωj, x, θ) < by Proposition 16.iii. Key*, Gretton, Briol, and Fernandez* Proof of Proposition 25 By Remark 9 i=1 θg(ψi, θ0) 2 = θℓ pθ0(y)λ(dy) 2 . (41) Observe that for each ℓ {1, . . . , p}, it holds that θℓ pθ0(y)λ(dy) 2 X ψi(y)ψi(x) θℓ pθ0(x) λ(dy)λ(dx) i=1 |ψi(y)||ψi(x)| θℓ pθ0(y) θℓ pθ0(x) λ(dy)λ(dx) i=1 |ψi(y)|2 j=1 |ψj(x)|2 θℓ pθ0(y) θℓ pθ0(x) λ(dy)λ(dx) i=1 |ψi(y)|2 θℓ pθ0(y) λ(dy) K(y, y) θℓ pθ0(y) λ(dy) 2 θℓ pθ(y) λ(dy) 2 < (42) where the second inequality is due to the Cauchy-Schwarz inequality, the second equality holds since P i=1 ψ2 i (y) = K(y, y), the third inequality follows from the fact that the kernel is bounded (Assumption 5), and the last inequality is due to Assumption 3. By combining Equations 41 and 42 we conclude that P i=1 θg(ψi, θ0) 2 < . Appendix B. Illustration of Limitations We illustrate two of the limitations introduced in Section 6. The first occurs when D(P, Qn) is poor at distinguishing between probability distributions. For example, this could occur when using a Gaussian kernel with a very small or large lengthscale. Figure 12 shows the type I error rate of the composite KSD test, using the parametric bootstrap, as the lengthscale of the Gaussian kernel varies. We consider the null hypothesis case where HC 0 = the data is Gaussian and the data is sampled from Q = N(0.4, 1.42). The figure shows that the type I error rate of the composite test can exceed the level for very small lengthscales, and goes to zero for large lengthscales. The figure also shows the type I error rate of a non-composite test with H0 : P = Q as defined above, which reveals that the non-composite test does not suffer Composite Goodness-of-fit Tests with Kernels estimated scale 10 2 100 102 lengthscale, l type I error rate composite test non-composite test Figure 12: Comparison of the type I error rate of a composite and non-composite test, as we vary the lengthscale of the kernel. The left axis, blue, shows the type I error rate of the two tests. The right axis, orange, shows the estimates of the scale parameter of the model made by the composite test over 2000 repeats. The solid blue horizontal line shows the level of the test, 0.05. The solid orange horizontal line shows the true value of the standard deviation, 1.4. The solid black vertical line shows the lengthscale selected by the median heuristic. from this problem. To investigate the issue the figure also shows the estimates of the scale parameter. For very small lengthscales, the parameter estimates are not centered around the true parameter value, thus the test is comparing the data to a poor choice of model from {Pθ}θ, leading to type I errors. For large lengthscales, we suggest that the type I error rate of the composite test goes to zero faster than that of the non-composite test because the small amount of variance in the estimated parameters makes distinguishing the null and alternative hypotheses more difficult. Importantly, we note that the range of lengthscales for which the composite test has good performance is fairly wide (note the log x-axis scale), and the median heuristic selects a value in the middle of this range. The same behaviour can also occur for the MMD test, although, in this case, we have the additional reassurance of theoretical results which show that the test will behave correctly as n becomes large. The second limitation we will illustrate occurs when there is large variance in the estimate of the parameter, if either the minimum distance estimator has high variance or the optimisation process fails. Figure 13 demonstrates this limitation for the MMD test, testing HC 0 that Q = N(θ, 12) for θ R against the alternative N(0.5, 1.52). If the optimiser is configured correctly, the figure shows that the estimates of the parameter are centered around the true parameter value and have low variance, and there is clear separation between the distributions of the test statistic under HC 0 and HC 1 . Thus, the test is able to distinguish HC 0 and HC 1 and achieve a high power. If the optimiser is configured poorly we use too large a learning rate we can see that the parameter estimates have higher variance, and the distributions of the test statistic under HC 0 and HC 1 overlap completely. Thus, no matter where the threshold is set it will not be possible to distinguish the two hypotheses, and the test will have very low power. Key*, Gretton, Briol, and Fernandez* 0.00 0.02 0.04 0.06 0.08 good optimiser 0.5 0.0 0.5 1.0 1.5 0.00 0.05 0.10 0.15 0.20 test statistic poor optimiser 0.5 0.0 0.5 1.0 1.5 parameter estimate Figure 13: Demonstration of a failure mode where the parameter estimate has high variance due to a poorly configured optimiser. The first row shows the distribution of the test statistic (left) and that of the parameter estimate (right) for a well-configured optimiser, and the second row the distributions for a poorly-configured optimiser. 1 2 3 4 5 ν Qa = Students-t(ν) median heuristic aggregated 0.00 0.02 0.04 0.06 0.08 0.10 ϵ Qb = (1 ϵ)N(0, 1) + ϵN(5, 1) Figure 14: Power of the KSD test with wild bootstrap, comparing setting the lengthscale set using the median heuristic and aggregating tests using 8 different lengthscales. We set HC 0 : P = N(µ, σ2), with the left and right figure showing data generated from different distributions, Qa and Qb. shows the level. Appendix C. Aggregated Composite Test Figure 14 shows an attempt at combining the aggregated test method introduced by Schrab et al. (2023, 2022) with our composite KSD test, using the wild bootstrap. The aggregated test is the combination of 8 composite tests using the Gaussian kernel, where the lengthscales are set as the {25, 32, 39, 46, 54, 61, 68, 75}th percentiles of the pairwise distances between the observations. The non-aggregated test uses a single lengthscale set using the median heuristic, i.e. at the 50th percentile of the pairwise distances. The null hypothesis of the Composite Goodness-of-fit Tests with Kernels test is HC 0 : P = N(µ, σ2), for µ R and σ R+. To evaluate the performance of the test, we consider the two alternatives specified in the figure. Against a Student s t-distribution, we find that the median heuristic has slightly better performance than the aggregated test, while, against a mixture of Gaussians, the aggregated test has a small advantage. We would expect the aggregated test to have a bigger advantage in the case of the mixture alternative, as here the spread of lengthscales will bring a bigger benefit than in the unimodal case of the Student s t alternative. It is somewhat unexpected that the aggregated test has slightly lower performance than the median heuristic against the Student s t alternative. Future work could investigate why this happens, and whether any modifications can be made to the aggregated testing procedure to improve the performance in the composite case. Appendix D. Experiment Details For all experiments we use a test level of α = 0.05. The experiments are implemented in JAX, and executed on an Nvidia RTX 3090 GPU. D.1 Closed-Form KSD Estimator For any Pθ in the exponential family, we can calculate ˆθn = arg minθ KSD2(Pθ, Qn) in closed-form. Here we state the expression for this estimator, which we use in our Gaussian and kernel exponential family experiments. The detailed derivation of this result can be found in Barp et al. (2019, Appendix D3) or Matsubara et al. (2022). Let the density of a model in the exponential family be pθ(x) = exp(η(θ) t(x) a(θ)+b(x)), with η : Θ Rk an invertible map, t : Rd Rk any sufficient statistic for some k N1, a : Θ R and b : Rd R. Then the KSD estimator is given by ˆθn := η 1 1 2Λ 1 n vn , where Λn Rk k and νn Rk are defined as Λn := 1 n2 Pn i=1 Pn j=1 Λ(xi, xj) and νn := 1 n2 Pn i=1 Pn j=1 ν(xi, xj). These are based on functions Λ : Rd Rd Rk k and ν : Rd Rd Rk which depend on the specific model, defined as Λ(x, x ) := K(x, x ) xt(x) xt(x ) and ν(x, x ) := K(x, x ) xb(x) xt(x ) + xt(x) x K(x, x ) + K(x, x ) xb(x ) xt(x) + xt(x ) x K(x , x) . D.2 Kernels We define the Gaussian kernel as K(x1, x2) = exp( x1 x2 2 2 /2l2), and the IMQ kernel as K(x1, x2) = (1 + x1 x2 2 2 /2l2) 1 2 , where l R+ is the lengthscale, and the exponent of 1 2 follows Matsubara et al. (2022). Where we set the lengthscale using the median heuristic, this is defined as lmed = q median( yi yj 2 2 /2), where the median is taken over all pairs of observations, yi and yj. Where we use a sum kernel, this is defined as K(x1, x2) = 1 L PL i=1 Kli(x1, x2), where Kl1, . . . , Kl L are a set of L Gaussian or IMQ kernels with lengthscales l1, . . . , l L. Prior work has found that using a sum of kernels with a range of lengthscales, rather than attempting to choose the lengthscale of a single kernel, produces good empirical results (Li et al., 2015; Ren et al., 2016; Sutherland et al., 2017). Note that a sum of characteristic kernels is a Key*, Gretton, Briol, and Fernandez* characteristic kernel (Sriperumbudur et al., 2010), so this choice satisfies the assumptions we make in our theoretical results. D.3 Details of Specific Figures D.3.1 Distribution of Wild Bootstrap Statistic (Figure 2) HC 0 : P = N(µ, σ2) for µ R and σ R+, with the estimator configured the same as in the Gaussian experiments below. We sample observations from Q = N(0.3, 1.0). We use a Gaussian kernel with lengthscale 1.0. To compute the distribution of the test statistic, we take 1000 sets of n = 1000 samples from Q, and for each estimate Pˆθn and compute MMD2(Pˆθn,n, Qn). To compute the distribution of the wild bootstrap statistic, we take n = 1000 samples from Q, estimate Pˆθn, sample 1000 sets of Rademacher variables, and thus compute 1000 samples of MMD2 W (Pˆθn,n, Qn). D.3.2 Gaussian Model (Figures 3 to 7) repeats to compute power (per seed) 1000 K Gaussian, l = lmed wild bootstrap samples 500 parametric bootstrap samples 300 MMD test details For the experiments using the MMD, we minimize the minimum distance estimator using the Adam optimiser (Kingma and Ba, 2015). We use a learning rate of 0.05 and 100 iterations. When θ = µd (Figures 4 to 6) we sample the initial value of µd from N(0d, Id). When θ = {µ, σ2} (Figures 3 and 7) sample the initial µ from N(0, 1), and sample σ from a standard normal distribution truncated between 0.5 and 1.5. Timing We estimate the time taken by each bootstrap as follows. We run 3 tests as a warmup, then we run 2000 tests and calculate the mean time taken. The full configuration is as follows: HC 0 Pµ,σ = N(µ, σ2) for µ R, σ R+ n 100 D KSD bootstrap samples 300 K Gaussian, l = 1.0 D.3.3 Toggle Switch Figures 8 to 10 The true model parameters, θ0, are parameter α1 α2 β1 β2 µ σ γ true value 22.0 12.0 4.0 4.5 325.0 0.25 0.15 We use stochastic gradient descent with random restarts to estimate the parameters. The algorithm is: 1. Sample 500 initial parameter values, θ1 init, . . . , θ500 init. Composite Goodness-of-fit Tests with Kernels 2. Select the 15 θi init for which MMD2(Pθi init, Qn) is smallest. 3. Run SGD 15 times, once for each of the selected θi init, to find ˆθ1 n, . . . , ˆθ15 n . Use learning rate 0.04 and perform 300 iterations. 4. Return ˆθi n for which MMD2(Pˆθin, Qn) is smallest. The initial parameters are sampled from a uniform distribution with the following ranges: parameter α1 α2 β1 β2 µ σ γ initial range 0.01 0.01 0.01 0.01 250.0 0.01 0.01 50.00 50.00 5.00 5.00 450.0 0.50 0.40 To run the test, we use the following configuration: K unweighted mixture of Gaussian kernels with lengthscales 20.0, 40.0, 80.0, 100.0, 130.0, 200.0, 400.0, 800.0, 1000.0 n 400 bootstrap parametric bootstrap samples 200 To create Figure 10 we apply kernel density estimation to samples from the model. We use a Gaussian kernel, with the bandwidth set to 0.1. D.3.4 Nonparametric Density Estimation Figure 11 We are testing whether the family of models considered by Matsubara et al. (2022) fits the data, thus our choices of qkef and lkef match the original paper. K unweighted sum of IMQ kernels with lengthscales 0.6, 1.0, 1.2 n 82 wild bootstrap samples 500 qkef N(0.0, 3.02) lkef Following Matsubara et al. (2022), we normalise each observation in the data set as follows: ynorm i = yi µ 0.5 σ, where µ and σ are the mean and standard deviation of the data set. To plot the density of the model, we estimate the normalising constant of the model using the importance sampling approach suggested by Wenliang et al. (2019, Section 3.2). As a proposal we use N(0.0, 3.52), and we take 2000 samples. A. Anastasiou, A. Barp, F-X. Briol, B. Ebner, R. E. Gaunt, F. Ghaderinezhad, J. Gorham, A. Gretton, C. Ley, Q. Liu, L. Mackey, C. J. Oates, G. Reinert, and Y. Swan. Stein s method meets computational statistics: A review of some recent developments. Statistical Science, 2023. Key*, Gretton, Briol, and Fernandez* T. W. Anderson and D. A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 1954. K. Balasubramanian, T. Li, and M. Yuan. On the optimality of kernel-embedding based goodness-of-fit tests. Journal of Machine Learning Research, 2021. Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. In Advances in Neural Information Processing Systems, 2019. J. Beirlant, L. Gyorfi, and G. Lugosi. On the asymptotic normality of the L1and L2-errors in histogram density estimation. The Canadian Journal of Statistics, 1994. J. Beirlant, L. Devroye, L. Györfi, and I. Vajda. Large deviations of divergence measures on partitions. Journal of Statistical Planning and Inference, 2001. Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2004. Espen Bernton, Pierre E. Jacob, Mathieu Gerber, and Christian P. Robert. Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), April 2019. J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1974. S. Betsch, B. Ebner, and B. Klar. Minimum lq-distance estimators for non-normalized parametric models. Canadian Journal Of Statistics, 2020. Ayush Bharti, Francois-Xavier Briol, and Troels Pedersen. A general method for calibrating stochastic radio channel models with kernels. IEEE Transactions on Antennas and Propagation, 2021. Ayush Bharti, Masha Naslidnyk, Oscar Key, Samuel Kaski, and François-Xavier Briol. Optimally-weighted estimators of the Maximum Mean Discrepancy for likelihood-free inference. In International Conference on Machine Learning, 2023. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In International Conference on Learning Representation, 2018. Fernando V. Bonassi and Mike West. Sequential Monte Carlo with adaptive weights for approximate Bayesian computation. Bayesian Analysis, 2015. Fernando V. Bonassi, Lingchong You, and Mike West. Bayesian learning from marginal data in bionetwork models. Statistical Applications in Genetics and Molecular Biology, 2011. W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A test of relative similarity for model selection in generative models. In International Conference on Learning Representations, 2016. Composite Goodness-of-fit Tests with Kernels François-Xavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with Maximum Mean Discrepancy. ar Xiv:1906.05944, 2019. Stéphane Canu and Alexander J. Smola. Kernel methods and the exponential family. Neurocomputing, 2006. Ying-Xia Chen and Wei-Jun Zhu. Note on the strong law of large numbers in a Hilbert space. Gen. Math, 2011. B-E. Chérief-Abdellatif and P. Alquier. MMD-Bayes: Robust Bayesian estimation via Maximum Mean Discrepancy. In 2nd Symposium on Advances in Approximate Bayesian Inference, 2020. Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Finite sample properties of parametric MMD estimation: Robustness to misspecification and dependence. Bernoulli, 2022. Kacper Chwialkowski, Dino Sejdinovic, and Arthur Gretton. A wild bootstrap for degenerate kernel tests. In Neural Information Processing Systems, 2014. Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, 2016. K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences of the United States of America, 2020. Ralph D Agostino and E. S. Pearson. Tests for departure from normality. Empirical results for the distributions of b2 and b1. Biometrika, 1973. A. P. Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 2007. C. Dellaporta, J. Knoblauch, T. Damoulas, and F-X. Briol. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In International Conference on Artificial Intelligence and Statistics, 2022. J. Durbin. Kolmogorov-Smirnov tests when parameters are estimated with applications to tests of exponentiality and tests on spacings. Biometrika, 1975. G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via Maximum Mean Discrepancy optimization. In Uncertainty in Artificial Intelligence, 2015. Bruno Ebner. On combining the zero bias transform and the empirical characteristic function to test normality. Latin American Journal of Probability and Mathematical Statistics, 2021. T. Fernández and A. Gretton. A Maximum-Mean-Discrepancy goodness-of-fit test for censored data. In Artificial Intelligence and Statistics, 2019. Tamara Fernández and Nicolás Rivera. A general framework for the analysis of kernel-based tests. ar Xiv:2209.00124, 2022. Key*, Gretton, Briol, and Fernandez* Tamara Fernández, Nicolás Rivera, Wenkai Xu, and Arthur Gretton. Kernelized Stein discrepancy tests of goodness-of-fit for time-to-event data. In International Conference on Machine Learning, 2020. K. Fukumizu. Exponential manifold by reproducing kernel Hilbert spaces. In Algebraic and Geometric Methods in Statistics. 2009. Kenji Fukumizu, Arthur Gretton, Gert Lanckriet, Bernhard Schölkopf, and Bharath K. Sriperumbudur. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, 2009. W. Gong, Y. Li, and J. M. Hernández-Lobato. Sliced Kernelized Stein Discrepancy. International Conference on Learning Representations, 2021. W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, and R. Zemel. Learning the Stein discrepancy for training and evaluating energy-based models without sampling. In International Conference on Machine Learning, 2020. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, 2007. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 2012. L. Györfiand I. Vajda. Asymptotic distributions for goodness-of-fit statistics in a sequence of multinomial models. Statistics and Probability Letters, 2002. L. Gyorfiand E. C. van der Meulen. A consistent goodness-of-fit test based on the total variation distance. In Nonparametric Functional Estimation and Related Topics. 1991. N. Henze and B. Zirkler. A class of invariant consistent tests for multivariate normality. Communications in Statistics - Theory and Methods, 1990. Peter J Huber and Elvezio M Ronchetti. Robust Statistics. John Wiley & Sons, 2011. A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2006. M. D. Jiménez-Gamero, V. Alba-Fernández, J. Muñoz García, and Y. Chalco-Cano. Goodnessof-fit tests based on empirical characteristic functions. Computational Statistics & Data Analysis, 2009. W. Jitkrittum, P. Sangkloy, B. Schölkopf, H. Kanagawa, J. Hays, and A. Gretton. Informative features for model comparison. In Neural Information Processing Systems, 2018. Wittawat Jitkrittum, Heishiro Kanagawa, and Bernhard Schölkopf. Testing goodness of fit of conditional density models with kernels. In Conference on Uncertainty in Artificial Intelligence, 2020. Composite Goodness-of-fit Tests with Kernels Ana Justel, Daniel Pefia, Ruben Zamar, and Departament Of Statistics. A multivariate Kolmogorov-Smirnov test of goodness of fit. Statistics & Probability Letters, 1997. H. Kanagawa, W. Jitkrittum, L. Mac Key, K. Fukumizu, and A. Gretton. A kernel Stein test for comparing latent variable models. ar Xiv:1907.00586, 2019. Jérémie Kellner and Alain Celisse. A one-sample test for normality with kernel methods. Bernoulli, 2019. Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. I. Kojadinovic and J. Yan. Goodness-of-fit testing based on a weighted bootstrap: A fast large-sample alternative to the parametric bootstrap. Canadian Journal of Statistics, 2012. A. N. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Giornale dell Istituto Italianodegli Attuari, 1933. I. A. Koutrouvelis and J. Kellermeier. A goodness-of-fit test based on the empirical characteristic function when parameters must be estimated. Journal of the Royal Statistical Society Series B (Statistical Methodology), 1981. Erich L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses. 3 edition, 2005. Anne Leucht and Michael H. Neumann. Dependent wild bootstrap for degenerate Uand V-statistics. Journal of Multivariate Analysis, 2013. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2017. Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. In International Conference on Machine Learning, 2015. Hubert W Lilliefors. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 1967. J. N. Lim, M. Yamada, B. Schölkopf, and W. Jitkrittum. Kernel Stein tests for multiple model comparison. In Neural Information Processing Systems, 2019. Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, 2016. X. Liu and F.-X. Briol. On the robustness of kernel goodness-of-fit tests. ar Xiv:2408.05854, 2024. James Robert Lloyd and Zoubin Ghahramani. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems, 2015. Key*, Gretton, Briol, and Fernandez* T. Matsubara, J. Knoblauch, François-Xavier Briol, and C. J. Oates. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society B: Statistical Methodology, 2022. A. Muller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 1997. J. Neyman. A Selection of Early Statistical Papers of J. Neyman, chapter 28. University of California Press, 1967. Richard Nickl. Statistical theory, 2012. Z. Niu, J. Meier, and François-Xavier Briol. Discrepancy-based inference for intractable generative models using quasi-Monte Carlo. ar Xiv:2106.11561, 2021. C J Oates, M Girolami, and N Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society B: Statistical Methodology, 2017. W C Parr and W R Schucany. Minimum distance and robust estimation. Journal of the American Statistical Association, 1980. M. Postman, J. P. Huchra, and M. J. Geller. Probes of large-scale structure in the Corona Borealis region. The Astronomical Journal, 1986. Yong Ren, Jialian Li, Yucen Luo, and Jun Zhu. Conditional generative moment-matching networks. In Advances in Neural Information Processing Systems, 2016. G. Robins, P. Pattison, Y. Kalish, and D. Lusher. An introduction to exponential random graph (p*) models for social networks. Social Networks, 2007. Kathryn Roeder. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association, 1990. A. Schrab and I. Kim. Robust kernel hypothesis testing under data corruption. ar Xiv:2405.19912, 2024. Antonin Schrab, Benjamin Guedj, and Arthur Gretton. KSD aggregated goodness-of-fit test. In Advances in Neural Information Processing Systems, 2022. Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 2023. D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distancebased and RKHS-based statistics in hypothesis testing. Annals of Statistics, 2013. A. Sen and B. Sen. Testing independence and goodness-of-fit in linear models. Biometrika, 2014. Xiaofeng Shao. The dependent wild bootstrap. Journal of the American Statistical Association, 2010. Composite Goodness-of-fit Tests with Kernels Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 1965. B. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 2017. B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 2010. Winfried Stute, Wenceslao Gonzáles Manteiga, and Manuel Presedo Quindimil. Bootstrap based goodness-of-fit-tests. Metrika, 1993. Danica J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alexander J. Smola, and Arthur Gretton. Generative models and model criticism via optimized Maximum Mean Discrepancy. In International Conference on Learning Representations, 2017. G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 2005. Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 2013. Li Wenliang, Danica J. Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for exponential family densities. In International Conference on Machine Learning, 2019. Li K. Wenliang and Heishiro Kanagawa. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv:2008.10087, 2021. G. Wolfer and P. Alquier. Variance-aware estimation of kernel mean embedding. ar Xiv:2210.06672, 2022. J. Wolfowitz. The minimum distance method. The Annals of Mathematical Statistics, 1957. G. Wynne and S. Nagy. Statistical depth meets machine learning: Kernel mean embeddings and depth in functional data analysis. ar Xiv:2105.12778, 2021. G. Wynne, M. Kasprzak, and A. B. Duncan. A spectral representation of kernel Stein discrepancy with application to goodness-of-fit tests for measures on infinite dimensional Hilbert spaces. ar Xiv:2206.04552, 2022. W. Xu and T. Matsuda. A Stein goodness-of-fit test for directional distributions. In Artificial Intelligence and Statistics, 2020. W. Xu and G. Reinert. A Stein goodness of fit test for exponential random graph models. In Artificial Intelligence and Statistics, 2021. Key*, Gretton, Briol, and Fernandez* J. Yang, Q. Liu, V. Rao, and J. Neville. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, 2018. J. Yang, V. Rao, and J. Neville. A Stein-Papangelou goodness-of-fit test for point processes. In International Conference on Artificial Intelligence and Statistics, 2019. S. Zhu, B. Chen, P. Yang, and Z. Chen. Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of fit. In International Conference on Artificial Intelligence and Statistics, 2019.