# on_the_robustness_of_kernel_goodnessoffit_tests__f875db2a.pdf

Journal of Machine Learning Research 26 (2025) 1-72 Submitted 8/24; Published 10/25

On the Robustness of Kernel Goodness-of-Fit Tests

Xing Liu xingliu97@outlook.com Quant Co

François-Xavier Briol f.briol@ucl.ac.uk Department of Statistical Science University College London

Editor: Bharath Sriperumbudur

Goodness-of-ﬁt testing is often criticized for its lack of practical relevance: since all models are wrong , the null hypothesis that the data conform to our model is ultimately always rejected as the sample size grows. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is good enough for the task at hand. This question can be formalized as a robust goodness-of-ﬁt testing problem by asking whether the data were generated from a distribution that is a mild perturbation of the model. In this paper, we show that existing kernel goodness-of-ﬁt tests are not robust under common notions of robustness including both qualitative and quantitative robustness. We further show that robustiﬁcation techniques using tilted kernels, while eﬀective in the parameter estimation literature, are not suﬃcient to ensure both types of robustness in the testing setting. To address this, we propose the ﬁrst robust kernel goodness-of-ﬁt test, which resolves this open problem by using kernel Stein discrepancy (KSD) balls. This framework encompasses many well-known perturbation models, such as Huber s contamination and density-band models. Keywords: robustness, hypothesis testing, kernel methods, Stein s method

1. Introduction

Goodness-of-ﬁt (GOF) testing (D Agostino, 1986; Lehmann and Romano, 2022, Chapter 16) tackles the question of how well a given probabilistic model describes some observed data. More formally, given a model P and observations drawn independently from some distribution Q, GOF testing compares the null hypothesis H0 : Q = P against the alternative hypothesis H1 : Q = P. This process is fundamental to validating a model before it is used for predictions, decision-making, or providing probabilistic guarantees on important quantities of interest. Indeed, GOF testing is closely linked to the concept of statistical signiﬁcance, which now permeates almost all areas of science and engineering. In this paper, we focus exclusively on kernel-based GOF tests (Chwialkowski et al., 2016; Liu et al., 2016). Unlike most classical alternatives such as likelihood-ratio tests (Hogg et al., 1977, Chapter 8.7) and Kolmogorov-Smirnov tests (Kolmogorov, 1933), kernel GOF tests can be used for models whose density function is known only up to a multiplicative constant. This is a signiﬁcant advantage since it enables their use for models which are beyond the reach of other tests, including many modern ﬂexible density estimation models, energy-based models, large graphical models, and Bayesian posteriors. Kernel GOF tests achieve this through

c 2025 Xing Liu and François-Xavier Briol.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1365.html.

Liu and Briol

a tractable test statistic based on the score function (i.e., the gradient of the log-density), which is widely available since it can be evaluated without knowledge of the normalization constant of the density. This property has made kernel GOF tests popular and has led to numerous extensions specializing the approach to problems involving time-to-event data (Fernandez et al., 2020), discrete data (Yang et al., 2018), point processes (Yang et al., 2019), manifold data (Xu and Matsuda, 2020), graphs (Xu and Reinert, 2021), protein structures (Amin et al., 2023), and text documents of variable lengths (Baum et al., 2023).

A signiﬁcant drawback of GOF testing is the following conundrum: in almost all real-world applications, models are wrong, causing GOF tests to ultimately reject the null hypothesis as the sample size grows. However, there is often a diﬀerence between statistical signiﬁcance from a GOF perspective, and the practical relevance of the model. For instance, data corruption may arise in signal processing due to sensor failures (Rizzoni and Min, 1991; Sharma et al., 2010), in classiﬁcation tasks due to mislabelled entities (Frénay and Verleysen, 2013), and in radio systems due to impulsive noise (Blackard et al., 1993). In these cases, H0 may be rejected even though P is almost correct up to these mild perturbations, and thus it could still be useful for downstream tasks such as prediction or probabilistic inference.

This conundrum has led to the development of robust tests (Huber, 1965; Le Cam, 1973; Dabak and Johnson, 1994; Fauß et al., 2021), which aim to control the Type-I error rate (the probability of falsely rejecting the null hypothesis) under a composite null hypothesis HC 0 : Q P0. The set P0 is a family of probability distributions, called the uncertainty set (Fauß et al., 2021), which is constructed to include P and distributions similar to P in some appropriate notion. By designing P0 to contain potential contaminations, such as adversarial contaminations or outliers, the composite null hypothesis HC 0 can now hold in realistic scenarios, thus avoiding the aforementioned conundrum and aligning statistical and practical signiﬁcance.

As a concrete example, consider a scenario where a practitioner models a data set of sensor signals that are potentially contaminated by outliers. Suppose the ﬁtted model accurately captures the true signal in the data set. A standard GOF test might still reject the standard null hypothesis H0 : Q = P due to the presence of outliers. Instead, one can adopt a composite null hypothesis HC 0 : Q P0, where P0 is constructed to include contaminated versions of P within a certain tolerance. A GOF test designed to be calibrated against HC 0 can now reject the model with the correct test level.

Unfortunately, robust kernel-based GOF tests have only received limited attention in the existing literature. The closest work is that of Sun and Zou (2023); Schrab and Kim (2024), who proposed robust tests using uncertainty sets deﬁned by the Maximum Mean Discrepancy (MMD; Müller 1997; Gretton et al. 2012). These tests require samples from both Q and P. However, generating samples from P may be infeasible or computationally expensive, and approximating P by ﬁnite samples can also reduce statistical eﬃciency.

We propose a novel class of robust kernel GOF tests based on the kernel Stein discrepancy (KSD). These tests are appropriate in the one-sample setting, where we assume samples are available only from Q, and an unnormalized density is available for P. To guarantee robustness against mild perturbations, we use tilted kernels proposed in Barp et al. (2019); Matsubara et al. (2022), in a related construction for robust parameter estimation. We then study qualitative and quantitative robustness, two notions of robustness which capture a

On the Robustness of Kernel Goodness-of-Fit Tests

Goodness-of-Fit Test Qualitative robustness Quantitative Robustness Result

Existing KSD test + stationary kernels Theorem 1 Existing KSD test + tilted kernels Theorem 3 Novel KSD tests proposed in this paper Theorem 4

Table 1: Summary of theoretical contributions to the robustness of kernel GOF tests provided in this paper.

test s insensitivity to perturbations around P, as formally introduced in Section 2.3. Our contributions are summarized in Table 1 and detailed below:

1. In Section 3, we study the robustness properties of existing KSD tests. We show that KSD tests with stationary kernels are not necessarily robust under inﬁnitesimal model deviation (which we call qualitative robustness; see Deﬁnition 1), and never robust against ﬁxed classes of misspeciﬁed models (called quantitative robustness; see Deﬁnition 2). We then show that qualitative robustness can be guaranteed by using appropriately tilted kernels, but this is not suﬃcient to guarantee quantitative robustness.

2. In Section 4, we propose a kernel GOF test that controls the Type-I error rate against all distributions in a KSD-ball of radius θ > 0 centered at P. This test is straightforward to implement since it is identical to existing KSD tests up to a shift of the test statistic by θ. We then make recommendation on how to select θ in practice to ensure robustness to common perturbations such as Huber s contaminations (which includes outliers) or density-band contaminations.

2. Background

We now review the necessary background for this work, covering kernel Stein discrepancy and robust GOF testing. We begin by brieﬂy summarizing our notation. Let P(Rd) denote the set of probability measures on Rd. Given an integer r > 0, Cr is the set of functions f : Rd R that are r-times continuously diﬀerentiable. The gradient of f C1 is denoted f(x) = ( 1f(x), . . . , df(x)) . The set C1 b contains functions in C1

that are bounded and with bounded gradient, and the set C2 b contains f C1 b for which supx Rd | f(x)| < , where f(x) = Pd j=1 j jf(x). For a function k : Rd Rd R with two arguments, ik(x, x ) will be used to denote its gradient with respect to the i-th argument respectively for i = 1, 2, and we deﬁne 1 2k(x, x ) = Pd j=1 1j 2jk(x, x ), where

ijk(x, x ) = [ ik(x, x )]j for i = 1, 2 and j = 1, . . . , d. The set C(1,1) (resp., C(1,1) b ) denotes all functions k : Rd Rd R with (x, x ) 7 1j 2jk(x, x ) continuous (resp., continuous and bounded) for all j = 1, . . . , d.

2.1 The Kernel Stein Discrepancy

A key ingredient for kernel GOF tests is the kernel Stein discrepancy (KSD, Chwialkowski et al., 2016; Liu et al., 2016; Gorham and Mackey, 2017; Oates et al., 2017). Throughout, we

Liu and Briol

assume P, Q P(Rd), and P admits a Lebesgue density function p C1 such that p > 0 on Rd. Let k C(1,1) be a scalar-valued reproducing kernel with associated reproducing kernel Hilbert space (RKHS, Berlinet and Thomas-Agnan, 2004) denoted by Hk. The Langevin KSD gives a notion of discrepancy between Q and P and is deﬁned as

D(Q, P) := sup f B

EX Q[(Apf)(X)] = sup f B

Rd(Apf)(x)Q(dx) ,

where B := {h = (h1, . . . , hd) : hj Hk and Pd j=1 hj 2 Hk 1} is the unit ball in the d-times Cartesian product Hd k of the RKHS Hk. The operator Ap is a linear operator called the Langevin Stein operator and it maps (suitably regular) vector-valued functions f to scalar-valued ones via (Apf)(x) := sp(x) f(x) + f(x), where sp(x) := x log p(x) is known as the (Hyvärinen) score function of P (Hyvärinen, 2005). Throughout this paper, we will shorten Langevin KSD to KSD for brevity, but note that other Stein operators can also be used to construct KSDs; see Anastasiou et al. (2023) for a review. When EX P [ sp 2] < , the squared KSD admits the following double-integral form (Barp et al., 2024, Corollary 1; Gorham and Mackey, 2017)

D2(Q, P) = EX,X Q[up(X, X )] = Z

Rd up(x, x )Q(dx)Q(dx ) ,

up(x, x ) := sp(x) sp(x )k(x, x ) + sp(x) 2k(x, x ) (1a)

+ 1k(x, x ) sp(x ) + 1 2k(x, x ) , (1b)

whenever D2(Q, P) is well-deﬁned. The function up is also a reproducing kernel, known as the Stein (reproducing) kernel. The main advantage of the KSD is that it is computable even if p is unnormalized. Indeed, up depends on p only through sp, which does not depend on the normalizing constant of p (as the constant is cancelled due to the diﬀerentiation). The KSD can be straightforwardly estimated at O(n2) cost using a V-statistic estimator. It is formed by replacing Q with its empirical version Qn based on independent observations Xn := {Xi}n i=1 drawn from Q:

D2(Xn) := D2(Qn, P) = 1

j=1 up(Xi, Xj) , (2)

where we have overloaded the notation by writing D2 : (Rd)n [0, ) as a test statistic. The V-statistic is non-negative, biased but consistent. An alternative estimator is a U-statistic, which diﬀers from (2) by summing only over disjoint index pairs i = j. Although a U-statistic is unbiased, it can take negative values. Therefore, we focus on V-statistics in this paper, and leave a discussion on how to extend our results to U-statistics to Section C. When the kernel k is characteristic, the KSD is P-separating, meaning that D(Q, P) 0, with equality if and only if Q = P for all Q that ﬁnitely integrates sp 2; see, e.g., Barp et al. (2024, Theorem 3). Most characteristic kernels used in kernel GOF tests are suﬃciently smooth stationary kernels of the form k(x, x ) = h(x x ), where h C2 b and h(0) > 0. These include the squared-exponential kernel k(x, x ) = exp( x x 2/(2λ2)) and the inverse

On the Robustness of Kernel Goodness-of-Fit Tests

multi-quadric (IMQ) kernel k(x, x ) = (1 + x x 2 2/λ2) b, where b > 0, and λ > 0 is a hyperparameter known as the bandwidth. The IMQ kernel is often preferred in practice since the resulting KSD has the desirable property of P-convergence control (Gorham and Mackey, 2017; Barp et al., 2024). The bandwidth λ also plays a crucial role in the performance of kernel-based tests (Reddi et al., 2015; Ramdas et al., 2015; Huang et al., 2023), and a median-heuristic (Fukumizu et al., 2009, Section 5) is often used to select its value in practice (Liu et al., 2016; Chwialkowski et al., 2016).

2.2 Kernel Goodness-of-Fit Testing

KSD is a natural choice of test statistic for GOF testing, since, by the P-separation property of KSD, testing H0 : Q = P is equivalent to testing whether D(Q, P) = 0. As KSD is non-negative, A GOF test can therefore be constructed whereby H0 is rejected for large values of the KSD estimate (Chwialkowski et al., 2016; Liu et al., 2016). KSD tests have been used for GOF testing with unnormalized models in a wide range of applications and data structures (Yang et al., 2018; Fernandez et al., 2020; Xu and Reinert, 2021; Xu and Matsuda, 2021; Amin et al., 2023). Extensions have also been developed to address its limitations, such as high computational cost (Jitkrittum et al., 2017; Huggins and Mackey, 2018), diﬃculty in bandwidth selection (Schrab et al., 2022) and lack of test power against certain alternatives (Liu et al., 2023). A signiﬁcant practical challenge with KSD tests is determining an appropriate decision threshold. A valid threshold can be obtained by considering the quantiles of the null distribution of D2(Xn), but since this distribution is not usually tractable, bootstrapping is often employed to estimate it. One approach is to compute the empirical quantiles of bootstrap samples of the form

D2 W(Xn) = 1 n2

j=1 (Wi 1)(Wj 1)up(Xi, Xj) , (3)

where W := (W1, . . . , Wn) Multinomial(n; 1/n, . . . , 1/n). This procedure, called Efron s bootstrap or weighted bootstrap (Arcones and Gine, 1992; Janssen, 1994), assigns a multinomial weight to each observation, which mimics recomputing KSD using data sampled with replacement from Xn. Other bootstrap methods such as wild bootstraps (Leucht and Neumann, 2013; Shao, 2010) are also viable, particularly when the samples are potentially correlated (Chwialkowski et al., 2016), but we focus on the weighted bootstrap since we ﬁnd that they perform similarly in our setting with independent samples; see Section B.8. The test threshold is selected as the (1 α)-quantile of the distribution of D2 W(Xn) conditional on Xn, i.e.,

q2 ,1 α(Xn) := inf u R : 1 α Pr W D2 W(Xn) u | Xn . (4)

In practice, this quantile is approximated with Monte-Carlo estimation by ﬁrst drawing B independent copies {Wb}B b=1 of W, and then computing

q2 B,1 α(Xn) := inf

u R : 1 α 1 B + 1

1{D2(Xn) u} +

b=1 1{D2 Wb(Xn) u} )

Liu and Briol

where 1{A} denotes the indicator function for the event A. The KSD test then rejects H0 if D2(Xn) > q2 B,1 α(Xn). This test is asymptotically well-calibrated for H0 : Q = P (Liu et al., 2016, Theorem 4.3). When KSD is P-separating, this test is also consistent, meaning that, whenever Q = P, the probability of rejection approaches one as the sample size n (Chwialkowski et al., 2016; Liu et al., 2016).

2.3 Robustness for Goodness-of-Fit Testing

Robustness of GOF tests refers to the lack of sensitivity of the test outcome to small model deviations (Rieder, 1982; Lambert, 1982). Model deviations can be formalized as a neighborhood P0 around a nominal distribution P. The neighborhood P0 is often called an uncertainty set, and encodes the practitioner s uncertainty on P. One popular example for P0 is Huber s contamination model (Huber, 1964, 1965)

P(P; ϵ) := {(1 ϵ )P + ϵ R : 0 ϵ ϵ, R P(Rd)} , (6)

where ϵ [0, 1] is the maximal contamination ratio, and the probability measure R acts as arbitrary contamination. This model is appropriate when practitioners believe that P accurately describes all but a small proportion of the data. Huber s models have been studied in the context of robust GOF testing including Huber (1965); Qin and Priebe (2017), as well as in robust estimation (Hampel, 1974; Huber and Ronchetti, 2011). Special interests lie in the case when R is restricted to point masses, i.e., R = δz for some z Rd, where robustness is often called bias-robustness (Huber and Ronchetti, 2011). Beyond Huber s models, P0 can also be chosen as density-band models (Kassam, 1981), which assume the density of the data-generating distribution is close to the model density up to a small error (see Section 4.2). Moreover, P0 can be set to a ball deﬁned via a statistical divergence or metric, such as Hellinger distance (Le Cam, 1973), Wasserstein metric (Gao et al., 2018), and Maximum Mean Discrepancy (Sun and Zou, 2023). Uncertainty sets P0 provide a framework for assessing the robustness of a GOF test, by studying the rejection probability when Q deviates from the nominal distribution P but remains within P0. In the robust testing literature, a common approach is to consider a sequence {Pn 0 } n=1 of uncertainty sets with shrinking size as the sample size n increases. This is formalized in the following notion of qualitative robustness, inspired by Rieder (1982).

Deﬁnition 1 (Qualitative robustness to a sequence of neighborhood). Let P1 0 P2 0 . . . be a sequence of subsets of P(Rd) that contain P and such that n=1Pn 0 = {P}. For any positive integer n, let Tn : Rd n R be a test statistic where a large value of Tn suggests deviation from the null H0 : Q = P, and let γn : Rd n R be a function that computes the decision threshold. A sequence of hypothesis tests (indexed by n) that reject H0 when Tn > γn is qualitatively robust to {Pn 0 } n=1 if, as n ,

sup Q Pn 0 Pr Xn Q Tn(Xn) > γn(Xn) Pr X n P Tn(X n) > γn(X n) 0 . (7)

Intuitively, this notion of robustness asks how sensitive the test outcome is under inﬁnitesimally small model deviation. The shrinking-size condition ensures that the rejection probability under any Q = P does not trivially approach one due to the consistency of the test. By choosing the Pn 0 in Deﬁnition 1 to be Prokhorov balls (Prokhorov, 1956), we would

On the Robustness of Kernel Goodness-of-Fit Tests

recover the qualitative robustness introduced in Rieder (1982, Deﬁnition 2.1), which parallels the conventional notion of qualitative robustness for estimators (Rieder, 1982, Remark 2). However, distributions in Prokhorov balls do not have a simple form, posing challenges to the analysis. Our deﬁnition extends it to a general sequence of neighborhood. In Section 3, we will choose Pn 0 to be Huber s contamination models (6), which both signiﬁcantly simpliﬁes the analysis and encompasses a wide range of relevant scenarios. We will show that, within Huber s neighborhood, the standard KSD test is not necessarily qualitatively robust with stationary kernels, but is qualitatively robust with appropriately tilted kernels. Moreover, our deﬁnition of qualitative robustness is tied to a sequence of shrinking neighborhood {P n 0 } n=1. Clearly, if a test is qualitatively robust to a sequence {P n 0 } n=1, then it is also qualitatively robust to any sequences that decay faster. Therefore, loosely speaking, the rate of decay of {P n 0 } n=1 characterizes the degree of qualitative robustness. In Section 3, we will explicitly derive the rate required for the standard KSD test to retain qualitative robustness. However, qualitative robustness has its own limitations. It only concerns the insensitivity of a test to suﬃciently small model deviations, but oﬀers no guarantees for deviations of a ﬁxed size. The latter scenario is more practically pertinent, as practitioners often need to account for a speciﬁc form of model misspeciﬁcation and require the test to remain well-calibrated under that level of uncertainty. This can be formalized by relaxing the point null hypothesis H0 to a composite hypothesis HC 0 : Q P0, and requiring calibration under HC 0 . This motivates the following notion of quantitative robustness.

Deﬁnition 2 (Quantitative robustness to a single neighborhood). Given α (0, 1) and P0 P(Rd), a test is quantitatively robust to P0 at level α if its rejection probability under any Q P0 does not exceeds α.

Quantitatively robust GOF tests have been developed for various types of uncertainty sets P0, including Huber s contamination model (Huber, 1965) and neighborhoods deﬁned by Kullback-Leibler divergence (Levy, 2008; Yang and Chen, 2018) or α-divergence (Gül and Zoubir, 2016). These tests enjoy minimax optimality but require the normalizing constant of P to be known, thus not applicable to unnormalized models. Two-sample tests that are quantitatively robust to Hellinger distance (Le Cam, 1973), Wasserstein distance (Gao et al., 2018), and Maximum Mean Discrepancy (Sun and Zou, 2023; Gao et al., 2021) have also been proposed. However, to be used for GOF testing, they require approximation of P by ﬁnite samples, which incurs extra approximation error and can be computationally demanding due to the non-trivial task of sampling from P. In Section 4, we will choose P0 to be a KSD ball centered at P. We choose KSD balls over other types of neighborhood because (i) this choice naturally lends itself to a GOF test that is both easy to implement and guarantees robustness at little extra computational cost, and (ii) it is easy to select the radius of such balls so that our proposed test is quantitatively robust to various types of contamination of interest, such as Huber s contaminations or density-band contaminations. We conclude this section with a brief comparison between qualitative and quantitative robustness. Qualitative robustness concerns the limiting behavior of a test against some speciﬁc class of local alternatives. Thus, it is closely tied to the minimax separation boundary of a test (Ingster, 1987, 1993). Moreover, it oﬀers a notion of degree of robustness of a

Liu and Briol

given test via the decay rate of the uncertainty neighbouhood sequence {Pn 0 } n=1. In contrast, quantitative robustness cannot be used for this purpose, because any consistent test is not quantitatively robust by deﬁnition. However, quantitative robustness has more practical relevance, as it concerns a ﬁxed composite null set, which can be explicitly constructed to enforce robustness. This is why we have introduced two diﬀerent notions of robustness.

3. The (Lack of) Robustness of Existing Kernel Goodness-of-ﬁt Tests

We will now study the robustness of standard KSD GOF tests using stationary kernels in Section 3.1, and then their tilted counterparts which are popular in the parameter estimation literature in Section 3.2.

3.1 Existing KSD Tests with Stationary Kernels are not Qualitatively Robust

Our ﬁrst result states that contamination of Huber s type can considerably aﬀect the probability of the standard KSD test rejecting the null hypothesis H0 : Q = P. Our result holds with the bootstrap threshold deﬁned in (4), and all probabilities are taken over the randomness of both the sample and the bootstrap weights W Multinomial(n; 1/n, . . . , 1/n). The proof is in Section A.1.

Theorem 1. Assume EX P [ sp(X) 4 2] < , the function x 7 sp(x) 2 is unbounded, k(x, x ) = h(x x ) with h C2 b and h(0) > 0, and assume the integrability conditions

sup z Rd sp(z) 4 2EX P h(X z)4 < and sup z Rd sp(z) 2 2EX P h(X z) 2 2 < .

Then, for any test level α (0, 1) and any sequence {ϵn} n=1 with ϵn = o(n) 1, the following holds as n ,

sup Q P(P;ϵn)

Pr Xn Q,W D2(Xn) > q2 ,1 α(Xn) Pr X n P,W D2(X n) > q2 ,1 α(X n)

where P(P; ϵn) is the Huber s contamination model deﬁned in (6).

Theorem 1 immediately implies that the standard KSD test is not qualitatively robust to any sequence of Huber s models that satisﬁes the rate condition ϵn = o(n) 1. At ﬁrst glance, the lack of qualitative robustness might seem trivial given that the KSD test is consistent, meaning that the rejection probability under any Q = P approaches one as n . However, one subtlety is that the set P(P; ϵn) can also shrink in size, since ϵn is allowed to decay. The condition ϵn = o(n) 1 imposes a lower bound on this decay rate. Intuitively, this rate condition requires ϵnn, the expected number of contamination in the sample, to grow with n. In particular, this excludes the case where ϵnn is a constant. Moreover, since Huber s models P(P; ϵn) are contained within Prokhorov balls, Theorem 1 immediately implies non-qualitative robustness to Prokhorov neighborhoods.

Remark 1 (Unbounded Stein kernel). Theorem 1 is a consequence of the fact that the Stein kernel evaluated at x, i.e., x 7 up(x, x), is unbounded when P has an exploding

On the Robustness of Kernel Goodness-of-Fit Tests

score function. Exploding score functions are often associated with light tails. Examples of such distributions include those with a density of the form p(x) exp( x r 2) with r 2. These distributions have sub-Gaussian tails, and their score function has the form sp(x) = rx x r 2 2 , which is unbounded for r 2. On the other hand, this exploding-score condition excludes heavy-tailed models with bounded score functions, such as super-Laplacian or t-distributions (Gorham and Mackey, 2017; Barp et al., 2024).

Remark 2 (Moment conditions). The moment conditions in Theorem 1 are mild. For the example from Remark 1, namely p(x) exp( x r 2) with r 2, direct computation shows that these conditions hold if h(x) and h(x) decay at least as fast as |x (r 1) 2 so as to cancel the growth of the score sp(x) 2 = r x r 1 2 . Such kernels include squared-exponential kernels, IMQ kernels h(u) = (1+ u 2 2) s/2 with s r 1, and Matérn kernels with suﬃcient smoothness. Section 5.1 will provide numerical evidence using a Gaussian model.

Remark 3 (Connection to separation boundaries). Let Qδ := {Q P(Rd) : S(Q, P) δ}, where S is some statistical divergence or metric. The separation boundary (Ingster, 1987, 1993) of a test is the fastest decaying sequence {δn}n such that the test power under any Q Qδn still converges to 1 as n . From this perspective, Theorem 1 implies that the separation boundary of the standard KSD test must decay at least with rate ϵn = o(n) 1, whenever Qδn contains Huber-contamination models. This complements existing results on the separation boundary of KSD tests. The most relevant works are Schrab et al. (2022); Hagrass et al. (2025), where they consider alternative distributions Q that have a density with respect to either the Lebesgue measure or the target model P. In particular, their results cannot be used to derive Theorem 1, since their density assumption on Q is violated in our case. Indeed, we consider alternatives of the form Q = (1 ϵn)P + ϵnδz, which involve a Dirac delta measure and thus has no density with respect to the Lebesgue measure nor any continuous measures, including P.

3.2 Existing KSD Tests with Tilted Kernels are Qualitatively Robust

Since unbounded Stein kernels are the main cause of the lack of robustness of existing KSD tests, a natural approach to enforce robustness is to choose a suitable k so that the Stein kernel up becomes bounded. We now show that this can be achieved using tilted kernels (Barp et al., 2019; Matsubara et al., 2022). Speciﬁcally, we will show in Theorem 3 that, with a tilted kernel, a small proportion of contamination in the data has negligible impact on the outcome of the standard KSD test. We ﬁrst prove that tilted kernels give rise to bounded Stein kernels. The proof is deferred to Section A.2.

Lemma 2 (Bounded Stein kernel). Suppose k(x, x ) = w(x)h(x x )w(x ), where h C2 b is a stationary reproducing kernel, and the weighting function w C1 b satisﬁes the condition supx Rd w(x)sp(x) 2 < . Then

sup x,x Rd |up(x, x )| sup x Rd up(x, x) = τ < .

Notably, the function x 7 up(x, x) is non-negative since up is a reproducing kernel. Barp et al. (2024, Theorem 9) showed that KSD with such tilted kernels still satisﬁes P-separation,

Liu and Briol

namely D(Q, P) = 0 Q = P, provided that sp 2 grows at most root-exponentially and the spectral density of the translation-invariant kernel k which exists by Bochner s Theorem (Berlinet and Thomas-Agnan, 2004, Theorem 20) is bounded away from zero on compact sets. Most stationary kernels used in KSD tests (such as the IMQ or squared-exponential kernels) satisfy the conditions in Lemma 2. Intuitively, the weighting function w is used to counteract the growth of the score function sp. When the score grows polynomially like sp(x) 2 = O( x r 2) for some r > 0, it suﬃces to choose w(x) = (1+a2 x 2 2) b for any a > 0 and b 1/(2r). This weight function is common in the frequentist parameter estimation literature (Barp et al., 2019) and the generalized Bayesian inference literature (Matsubara et al., 2022; Altamirano et al., 2023, 2024; Duran-Martin et al., 2024). The following result states that, with a tilted kernel, the outcome of the standard KSD test will not be signiﬁcantly aﬀected by small proportions of contamination in the data. The proof is provided in Section A.3.

Theorem 3. Assume EX P [ sp(X) 2] < and that k is a tilted kernel satisfying the conditions in Lemma 2. Then, for any test level α (0, 1) and any sequence ϵn = o(n 1/2), the following holds as n ,

sup Q P(P;ϵn)

Pr Xn Q,W D2(Xn) > q2 ,1 α(Xn) Pr X n P,W D2(X n) > q2 ,1 α(X n) 0 ,

where P(P; ϵn) is the Huber s contamination model deﬁned in (6).

The LHS of the above convergence is the worst-case diﬀerence between the rejection probabilities of the standard KSD test with and without data corruptions. This result thus immediately implies that the standard KSD test is qualitative robust to any sequence of Huber s models so long as the contamination ratio decays suﬃciently fast as ϵn = o(n 1/2). The condition on ϵn allows the expected number of contaminated data nϵn to grow with n but requires the growth rate to be slow. In particular, it is met when the expected number of contaminated data is bounded, i.e., ϵn = O(n 1). This condition is not an artifact of the proof: in Section B.1, we show empirically that this rate is tight, i.e., the tilted-KSD test is no longer qualitatively robust when ϵn = n r for any r 1/2.

Remark 4. The intuition behind Theorem 3 is that, when the conditions in Lemma 2 are met so that the Stein kernel up is bounded, the impact of any single outlier on the test statistic D2(Xn) can be bounded. In contrast, under the setting of Theorem 1, where k is translation-invariant and thus up is unbounded, even a single outlier can drive D2(Xn) to inﬁnity. This is the key motivation for using a tilted kernel to guarantee robustness: by choosing a suitable weighting function w, the Stein kernel can be made bounded, thereby mitigating the eﬀect of outliers.

Remark 5. The boundedness condition on w(x)sp(x) 2 required of Lemma 2 and Theorem 3 bears similarity with the ones imposed for KSD-based robust estimation methods (Barp et al., 2024; Matsubara et al., 2022). For example, Barp et al. (2019, Proposition 7) showed that their KSD-based estimator is globally bias-robust assuming x 7 sp(x) 2 R

Rd k(x, x )sp(x ) 2Q(dx ) is bounded. This is slightly weaker than those we assume in Lemma 2 and Theorem 3, but also harder to verify in practice.

On the Robustness of Kernel Goodness-of-Fit Tests

Algorithm 1 Robust-KSD (R-KSD) test for goodness-of-ﬁt evaluation.

1: Input: Data Xn = {xi}n i=1; target distribution P; uncertainty radius θ; bootstrap sample size B; test level α.

2: Compute test statistic θ(Xn) as deﬁned in (9).

3: for b = 1, . . . , B do

4: Draw Wb Multinomial(n; 1/n, . . . , 1/n) and compute bootstrapped sample D2 Wb(Xn) by (3).

6: Compute the (non-squared) bootstrapped quantile q B,1 α(Xn) by (5).

7: Reject HC 0 : Q BKSD(P; θ) if θ(Xn) > q B,1 α(Xn).

Remark 6 (Connection to separation boundaries, continued). Theorem 3 implies that the separation boundary of the KSD tests cannot decay faster than n 1/2 when considering alternatives that include Huber-contamination models (see also Remark 3). This can be compared to existing results on the separation boundaries of tests based on Maximum Mean Discrepancy (MMD, Müller, 1997; Gretton et al., 2012). MMD is a broader family of discrepancies that recovers KSD when the kernel is chosen to be the Stein kernel up (Barp et al., 2024, Theorem 1). The separation rates of MMD tests have been studied extensively (Balasubramanian et al., 2021; Hagrass et al., 2024b), but these works assume the alternative Q has a density with respect to either the Lebesgue measure or P, which does not hold in our contamination setting. An exception is Hagrass et al. (2024a), which does not make this assumption. However, their results are limited to the two-sample setting, where the target distribution P is unknown and approximated by ﬁnite samples, again diﬀering from our setup.

Although tilted kernels enforce qualitative robustness, they are not enough to guarantee quantitative robustness. Indeed, so long as k is characteristic, the consistency of the standard KSD test implies that Pr Xn Q,W(D2(Xn) > q2 ,1 α(Xn)) 1 whenever Q = P (Liu et al. 2016, Proposition 4.2; Schrab et al. 2022, Theorem 3.3), so KSD tests with either stationary or tilted kernels are not quantitatively robust to any neighborhoods strictly larger than the singleton set {P}. We provide numerical evidence in Section 5.1.

4. Robust Kernel Goodness-of-ﬁt Tests for KSD-Ball Uncertainty Sets

We now propose a novel robust KSD test for the setting where the uncertainty set is a KSD-ball. Given P P(Rd) with density p C1, we consider the composite hypotheses

HC 0 : Q BKSD(P; θ) , HC 1 : Q BKSD(P; θ) , (8)

where θ 0 and BKSD(P; θ) := {Q : D(Q, P) θ}. The advantage of using a KSD-ball as the uncertainty set is that it lends naturally to a simple, tractable test that is robust to the contamination models we reviewed in Section 2.3. We ﬁrst describe our novel test, which we call robust-KSD test, and show that it is quantitatively robust to KSD balls in the sense of Deﬁnition 2. We then discuss in Section 4.2 how to choose the uncertainty radius θ to incorporate common contamination models, such as Huber s contamination models and density-band models.

Liu and Briol

4.1 A Robust KSD Test

Given a prescribed test level α (0, 1), our robust KSD test uses the following test statistic

θ(Xn) := max 0, D(Xn) θ , (9)

where D(Xn) is the square-root of the V-statistic (2). The test rejects HC 0 for large values of θ(Xn). The decision threshold should be chosen to control the Type-I error rate for all possible Q BKSD(P; θ), i.e., we require γ = γ(Xn) so that sup Q BKSD(P;θ) Pr Xn Q,W( n(Xn) > γ) α. One approach to construct γ is to use deviation inequalities for bounded functions as done in Gretton et al. (2012, Corollary 11). However, such approach tends to be overly conservative (Gretton et al., 2012). Instead, we propose a bootstrap procedure to construct a decision threshold. We will show that choosing γ = q ,1 α(Xn), the square-root of the (population) bootstrap quantile deﬁned in (4), gives the desired Type-I error control asymptotically. Our robust-KSD test therefore rejects HC 0 : Q BKSD(P; θ) if θ(Xn) > q ,1 α(Xn). This is summarized in Algorithm 1. The validity and consistency of our robust KSD test is stated in the following result, proved in Section A.5.2. In particular, it implies that our robust-KSD test is quantitatively robust to KSD balls in the inﬁnite-sample limit.

Theorem 4. Suppose EX P [ sp(X) 2] < and k(x, x ) = w(x)h(x x )w(x ), with stationary reproducing kernel h C2 b and weighting function w C1 b . Deﬁne the set of distributions P(Rd; w) := {Q P(Rd) : EX Q[ w(X)sp(X) 4 2] < }.

1. (Calibration) It holds that

sup Q BKSD(P;θ) P(Rd;w) lim sup n Pr Xn Q,W θ(Xn) > q ,1 α(Xn) α .

2. (Consistency) For any Q P(Rd; w)\BKSD(P; θ), it holds that

lim n Pr Xn Q,W θ(Xn) > q ,1 α(Xn) = 1 .

The condition w C1 b requires the weighting function w and its gradient to be bounded. In particular, this holds for w(x) = 1, in which case k(x, x ) = h(x x ) reduces to an untilted stationary kernel such as IMQ or squared-exponential kernels. Theorem 4 also assumes the data-generating distribution Q ﬁnitely integrates the fourth moment of the weighted score, i.e., w(x)sp(x) 4 2. This automatically holds for any Q P(Rd) when k is a tilted kernel satisfying the conditions in Lemma 2, in which case w(x)sp(x) 2 becomes bounded. In practice, the decision threshold q ,1 α is intractable, and we again use the Monte Carlo estimate q B,1 α, deﬁned as the squared root of (5), as an approximation. Compared with the standard KSD test targeting the point null H0 : Q = P, our robust-KSD test uses the same bootstrap procedure to compute the decision threshold, but has a slightly diﬀerent test statistic θ(Xn) instead of D2(Xn), given by (2), to account for the composite null. Moreover, as θ 0, the test statistic θ approaches D(Xn), the test statistic of the standard test, and the KSD-ball BKSD(P; θ) falls to the singleton {P}, so the robust-KSD test reduces to the standard test. Our robust KSD test can hence be viewed as a generalization of the

On the Robustness of Kernel Goodness-of-Fit Tests

standard KSD test to the composite hypotheses in (8). In particular, the robust KSD test is also qualitatively robust whenever the conditions on k and sp from Theorem 3 hold; see Proposition 14 in Section A.3 for a formal statement and proof. Another advantage of our robust KSD test being a direct generalization of the standard test is that no extra computation is required to guarantee robustness. For a given θ > 0, our robust test only requires a minor transformation of the test statistic of the standard KSD test. It hence has the same computational cost as the standard test, namely O(n2d). However, extra computation is potentially needed in determining θ, as it might require optimizing the Stein kernel. This will be discussed in detail in Section 4.2.

Remark 7 (Pointwise and uniform controls). The Type-I error control shown in Theorem 4 is pointwise over the null distributions Q. For robust tests, a more desirable control is often a uniform one, whereby the supremum over Q is taken before the limit over n (Lehmann and Romano, 2022, Chapter 11). For the proposed test, a uniform control can be shown if the bootstrapped quantile is replaced by the ground-truth quantile, say q 1 α(Xn). That is (see also Remark 9 in the appendix),

lim sup n sup Q BKSD(P;θ) P(Rd;w) Pr Xn Q,W θ(Xn) > q 1 α(Xn) α .

However, extending this to the bootstrapped quantile requires uniformly bounding the bootstrap approximation errors, which is non-trivial. We leave this as an open question for future research. In the literature, most kernel-based robust tests that oﬀer uniform controls are two-sample tests using either permutation (Schrab and Kim, 2024) or deviation bounds (Sun and Zou, 2023). Yet, these approaches are either not applicable to our one-sample setting or too conservative. This is why our test uses bootstrapping. Another kernel robust test that uses bootstrapping is Key et al. (2025), but their Type-I error control is also pointwise. In Section D, we propose an alternative robust test that leverages deviation bounds in a manner similar to Sun and Zou (2023) to achieve uniform controls, but at the expense of a lower test power.

4.2 Choosing the Uncertainty Radius

A crucial design choice in our robust tests is the uncertainty radius θ. This should be guided by the types of contamination that the practitioner is willing to tolerate. In this section, we discuss how to choose θ for Huber s contamination models and density-band models when using KSD-balls based on tilted kernels. Firstly, suppose we are considering Huber s contamination model P(P; ϵ0) for some ϵ0 [0, 1]. Then θ should be chosen such that the KSD-ball BKSD(P; θ) contains P(P; ϵ0). The following result, proved in Section A.6, shows how to achieve this given an upper bound on the Stein kernel τ = supx Rd up(x, x).

Proposition 5. Suppose k satisﬁes the conditions in Lemma 2 and EX P [ sp(X) 2] < . Let ϵ0 [0, 1]. Then P(P; ϵ0) BKSD(P; θ) if θ = ϵ0τ 1/2 and this bound is tight, i.e., sup Q P(P;ϵ0) D(Q, P) = ϵ0τ 1/2 .

This result suggests that, when at most a proportion ϵ0 of data is corrupted, setting θ = ϵ0τ 1/2 will ensure quantitative robustness to P(P; ϵ0). Note that using a similar proof,

Liu and Briol

Huber s contamination models can also be related to the MMD balls; see Lemma 23 in Section B.2. One practical issue is that τ , the supremum of the Stein kernel, does not always admit a closed-form expression. One way to compute it is by numerical optimization of x 7 up(x, x), but this assumes the optimizer converges and requires extra computation, thus not suitable in high dimension or when evaluation of the score function is costly. We propose an alternative approach, where τ is approximated by the maximum of the Stein kernel evaluated at the observed data, i.e., maxi=1,...,n up(Xi, Xi). This approach requires no extra computation and gives a reasonable estimate with moderate or large n. In Section 5, we use this approach and show empirically that the resulting test is still well-calibrated despite this approximation. Further discussions on how this approximation aﬀects the test performance can be found in Section B.3, where we demonstrate that this approach still controls the Type-I error rate, even with a small sample size.

Remark 8. Proposition 5 holds for Huber s contamination models of the form Q = (1 ϵ)P + ϵR with ϵ ϵ0. In particular, it holds for any contamination distribution R. This agnosticism can be useful in practice, because the exact form of contamination is often unknown. On the other hand, when prior knowledge about R is available, it is also possible to incorporate it into the proposed test. For example, if R is known to have a support bounded by some B > 0, then the bound in Proposition 5 can be tightened by replacing the worst-case bound τ = supx Rd up(x, x) with the localized version τ = sup x 2 B up(x, x). As discussed in Fauß et al. (2021), such assumptions are realistic in, e.g., highly regulated experimental environments, where any outliers are known to lie within bounded regions.

Another uncertainty model commonly studied in the literature is the density-band model (Kassam, 1981; Hafner, 1993), deﬁned as distributions whose density function lies within an error band of a nominal model with density p, i.e., {Q P(Rd) : Q has density q with |q(x) p(x)| δ(x) for all x}, for some function δ : Rd [0, ). It can be shown that density-band models can be rewritten as Huber s contamination models (6) with additional constraints on the outlier distribution (Fauß and Zoubir, 2016). However, compared with Huber s models, density-band models have the advantage of being more interpretable and more natural for certain forms of model disparity such as heavy tails. The following result suggests how the uncertainty radius θ should be chosen for such models and is proved in Section A.6.3.

Proposition 6. Suppose k satisﬁes the conditions in Lemma 2 and EX P [ sp(X) 2] < . Furthermore, assume Q, P P(Rd) admit positive densities q, p on Rd and p C1. If |q(x) p(x)| δ(x) for some function δ : Rd [0, ) such that δ0 := R

Rd δ(x) dx < ,

then D(Q, P) δ0τ 1/2 .

The bound in Proposition 6 is not tight, as its proof involves bounding an integrand that can take negative values by its absolute value. Nevertheless, as we will show in Section 5.1.4, this bound is not overly loose and can still be useful. In particular, it allows us to design a KSD test that is robust to tail misspeciﬁcation while still maintaining non-trivial power. Proposition 5 and 6 also reveal a limitation of our robust KSD test it can be overly conservative to alternatives that are not the intended contamination but still lie within the chosen KSD-ball uncertainty set. For example, to ensure robustness to Huber s contamination models (6) with tolerance ϵ0, Proposition 5 suggests setting the uncertainty radius to

On the Robustness of Kernel Goodness-of-Fit Tests

θ = ϵ0τ 1/2 . However, the resulting KSD-ball BKSD(P; θ) contains not only Huber s models, but also other distributions such as density-band models. As a result, the robust KSD test will control Type-I errors for all these distributions, regardless of whether they are the intended contamination type. This limitation is not unique to our method; it is a generic drawback of all robust tests based on uncertainty sets (Fauß et al., 2021).

5. Numerical Experiments

We will now evaluate the proposed GOF tests using both synthetic and real data. Unless otherwise mentioned, all standard KSD tests are based on an IMQ kernel k(x, x ) = h IMQ(x x ) where h IMQ(u) = (1 + u 2 2/λ2) 1/2 with a bandwidth λ2 > 0 selected via the median heuristic, i.e., λmed = Median Xi Xj 2 : 1 i < j n . All tilted-KSD and robust-KSD tests are based on a tilted IMQ kernel with weight w(x) = (1 + x a 2 2/c) b, where a Rd and c > 0. Intuitively, a and c respectively centers and scales the input. We ﬁx a = 0 and c = 1 in all experiments, as all data will always be centered and on a suitable scale. More generally, we could replace x a 2 2/c by a weighted norm of the form (x a) C(x a), where C Rd d is a pre-conditioning matrix, chosen possibly as the empirical covariance matrix or robust estimates of it. Since our experiments will focus on sub-Gaussian models, we choose b = 1/2. This ensures the Stein kernel is bounded. All tests have nominal level α = 0.05. The probability of rejection is computed by averaging over 100 repetitions, and the 95% conﬁdence intervals are reported. Our robust KSD test will be shortened as R-KSD. Code for reproducing all experiments can be found at github.com/Xing LLiu/robust-kernel-test.

5.1 Toy Gaussian Model

We ﬁrst consider a Gaussian model P = N(0, 1), in which case the score function of the model is sp(x) = x and is hence unbounded. This toy example is simplistic and not representative of the unnormalized models our test is most suited for, but it will nonetheless be helpful to study our algorithmic choices, to verify Theorem 1 and Theorem 3 numerically, and to compare against alternative robust tests.

5.1.1 Decay Rate of Weighting Function

We ﬁrst draw random samples Xn of size n = 500 from Q = (1 ϵ)P +ϵδz, for diﬀerent values of ϵ [0, 1] and z R. This setting mimics the presence of outliers at z. Given a stationary kernel h, a natural question is how does the choice of weight w aﬀect the Stein kernel and the test power? . On the left-hand side of Figure 1, we plot the Stein kernel x 7 up(x, x) for diﬀerent values of b. As b grows, the tails of the Stein kernel are progressively down-weighted. We then plot the rejection probability of the standard KSD tests using these Stein kernels when the outlier is z = 10 on the right-hand side of Figure 1. As b increases, the rejection probability decreases, suggesting that the test power is lower when the tails of the Stein kernel are overly down-weighted. This is not surprising since the Stein kernel decays faster for larger values of b, making the test insensitive to model deviations at the tails. Ideally, w should decay just enough so that up remains bounded, but not too fast as it would lose power. This highlights the trade-oﬀbetween robustness and power in the choice of w. A

Liu and Briol

0.0 0.2 0.4 Contamination ratio ϵ

Rejection rate

b = 0. b = 0.25 b = 0.5 b = 1.0 α = 0.05

Figure 1: Left. Stein kernel for P = N(0, 1) and an IMQ kernel tilted by w(x) = (1+x2) b. The larger b is, the more the tails of the function x 7 up(x, x) are down-weighted. The choice b = 0 corresponds to no weighting, reducing to an IMQ kernel. Right. The rejection probability under contamination by R = δz with z = 10.

0.0 0.2 0.4 Contamination ratio ϵ

Rejection rate

0.01 0.1 1.0 10.0 100.0 med KSDAgg

Figure 2: Rejection probability of robust KSD with diﬀerent bandwidths λ. med is the median heuristic. KSDAgg is the test of Schrab et al. (2022). The dashed line is α = 0.05. The vertical line is the maximal proportion of contamination ϵ0 = 0.05 controlled by robust-KSD.

10 1 100 101 102 Outlier value z

Rejection rate

IMQ-KSD Tilted-KSD R-KSD dc MMD MMD-Dev α = 0.05

10 1 100 101 102 Outlier value z

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Rejection rate

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Figure 3: Rejection probability under an outlier-contaminated Gaussian model with diﬀerent outlier values z and contamination ratios ϵ. The grey dotted horizontal line is the test level α = 0.05, and the black dash-dot vertical line corresponds to ϵ0 = 0.05. The KSD tests with IMQ kernel lack both qualitative and quantitative robustness since they reject even for small z or ϵ. The tilted-KSD test is more robust in cases where z or ϵ are larger, but ultimately still reject the null due to their lack of quantitative robustness.

similar trade-oﬀhas been observed in robust estimation, where an overly decaying w can impede the estimation accuracy (Barp et al., 2019, Appendix F.2). Henceforth, we choose b = 1/2 to balance this trade-oﬀ, as it is the smallest value that ensures up is bounded for sub-Gaussian models.

5.1.2 Kernel Bandwidth

We now use the same data set to investigate how the choice of the kernel bandwidth λ aﬀects the performance of the robust-KSD test. It is well-known that the bandwidth plays a crucial role in kernel-based tests and can considerably aﬀect the ability of the test to detect model disparities (Gretton et al., 2012; Ramdas et al., 2015; Reddi et al., 2015; Schrab et al., 2022, 2023). In general, a smaller bandwidth is better at detecting local diﬀerences, while a larger bandwidth is more suited for global deviations (Schrab et al., 2023). A common strategy for bandwidth selection in kernel-based testing is the median heuristic. Schrab et al. (2022, 2023) also proposed testing frameworks that aggregate the test result with multiple bandwidths

On the Robustness of Kernel Goodness-of-Fit Tests

to give the ﬁnal decision, thereby avoiding bandwidth selection and achieving higher test power. A natural question is therefore: how does the bandwidth aﬀects the robustness of the proposed test? .

We run the robust-KSD test with various choices of bandwidths λ2 Λ {λ2 med}, where Λ = {0.01, 0.1, 1, 10, 100} and the uncertainty radius θ is set to control at most ϵ = 0.05 contaminations in the sample. We also compare it with the KSDAgg test of Schrab et al. (2022) using the same tilted kernel and aggregating over the collection of bandwidths Λ. Other conﬁgurations of KSDAgg follow the recommendations in Schrab et al. (2022, Section 4.2). This experiment is repeated 200 times to reduce randomness, and the results are reported in Figure 2. For any choice of λ, the robust-KSD test is able to control the Type-I error when ϵ ϵ0, where ϵ is the proportion of contamination in the data, suggesting that robust-KSD remains well-calibrated for all bandwidths. If instead ϵ > ϵ0, all robust-KSD tests eventually reject HC 0 , with λ chosen by median heuristic achieving the highest power. The KSDAgg test is not robust and more sensitive to contamination than all robust-KSD tests at all contamination levels ϵ. This is not surprising given that KSDAgg is designed to maximize the test power over multiple bandwidths, rendering this test more sensitive to a wide range of alternatives, including contaminated models. As a result, we henceforth do not include KSDAgg in our experiments. A similar study for the standard KSD test is given in Section B.4.

5.1.3 Standard versus Robust KSD Tests

Using the same data set, we then compare KSD tests with a stationary kernel to tilted-KSD tests and our proposed robust-KSD test in Figure 3. As shown in the left two plots, with a contamination ratio ϵ of 0.01 or 0.05, the standard IMQ-KSD test rejects the point null with high probability for large outlier values z. This aligns with the lack of qualitative robustness of tests based on stationary kernels proved in Theorem 1. In contrast, the standard Tilted-KSD test rejects with lower probability for all z values, aligning with the qualitative robustness result with tilted kernels in Theorem 3. Notably, our theoretical results in both Section 3 and Section 4 apply only to ﬁxed kernels, thus excluding kernels based on observed data such as those selected via median heuristic. However, our numerical results suggest that the conclusion of our theory may hold more broadly.

The Type-I error of the tilted-KSD test still exceeds the nominal level α = 0.05, suggesting that a tilted kernel alone is not enough to enforce quantitative robustness. Notably, the rejection probability of Tilted-KSD is highest when z = 1 but declines for larger z. This is because up(z, z) with the tilted kernel peaks at around |z| 1 and then decreases with z (see Figure 1), making the test more sensitive when |z| 1 but less sensitive to large outliers. In contrast, running our robust-KSD test with a pre-set contamination control ϵ0 = 0.05, we can see from the left two plots of Figure 3 that it remains well-calibrated for all values of outlier z. The results also suggest that R-KSD is conservative since its rejection probability is close to 0; we attribute this to the fact that the robust tests are designed to control all types of contamination, of which the injected outlier is only one speciﬁc type. To demonstrate the non-trivial power of the robust tests, we show in the right two plots the rejection probability against diﬀerent contamination ratios ϵ. As ϵ grows, R-KSD eventually is capable of rejecting the null hypothesis H0 : Q BKSD(P; θ) with test power approaching one.

Liu and Briol

4 2 0 2 4 x

Log densities

ν = 3 ν = 10 ν = 50 ν = 100

Degree of freedom ν

Rejection rate

IMQ Tilted R-KSD (ν0=5)

R-KSD (ν0=10)

R-KSD (ν0=20)

Figure 4: Heavy-tailed experiment. Left. Log densities of a standard Gaussian model and scaled t-distributions with diﬀerent degree-of-freedom (dof) and moment-matched to Gaussian. Right. Rejection probability of the standard and robust tests with diﬀerent θ set to control the cases ν ν0 for diﬀerent values of ν0.

We also compared our tests with two existing robust kernel tests, namely the robust MMD tests of Schrab and Kim (2024) (denoted dc MMD) and of Sun and Zou (2023, Eq. 38) (denoted MMD-Dev). Both tests are provably quantitatively robust to Huber s contamination models, but they are two-sample tests that require extra samples from the model P. When the same number of samples from Q and from P are used and the cost of simulation is negligible, the cost of a single evaluation of MMD comparable to that of KSD. However, this cost could be prohibitively expensive when P is a more complex model. Implementation details of these tests are deferred to Section B.2. As shown in Figure 3, both tests are more conservative than our proposed KSD tests. The conservativeness of dc MMD is because it is designed for a contamination model that is stronger than Huber s model, while the conservativeness of MMD-Dev is because it uses a deviation inequality to construct the decision threshold, which is known to be conservative; see, e.g., Gretton et al. (2012).

5.1.4 Misspecified Tails

Finally, we now change the data-generating mechanism to study the impact of misspeciﬁed tails. We sample from Qν = tν p

(ν 2)/ν, where tν is the Student s t-distribution with degree-of-freedom (dof) ν > 2, and the scaling factor p

(ν 2)/ν ensures its second moment matches that of P. We set the uncertainty radius so that the R-KSD test is robust to Qν with ν ν0 for diﬀerent values of ν0. This is achieved using Proposition 6 to derive a bound on D(Qν, P), which is discussed in detail in Proposition 22 of Section A.6.4. The results are reported in Figure 4. For small values of ν, the tail of the model is misspeciﬁed, leading the robust tests with ν0 = 10 or 20 to reject with high probability because the KSD between Qν and P is large. As ν grows, Qν converges weakly to the standard Gaussian, making the model well-speciﬁed in the limit ν . Consequently, neither the standard tests or the robust tests reject the null hypothesis for large ν. The robust test with ν0 = 5 shows no power because, to be robust to tails that are so misspeciﬁed, the uncertainty radius θ computed using Proposition 6 must be very large, thus resulting in a small test statistic (cf. (9)) and low test power. Overall, the robust tests are more conservative than the standard one. This is because the robust tests control all possible contaminations within a KSD-ball, while misspeciﬁed tails are only one of the many possible such forms of contamination.

On the Robustness of Kernel Goodness-of-Fit Tests

10 5 0 5 10 x1

Data Contamination

0 200 400 Index of data

Log unnormalized density

Data Contamination

0.00 0.25 0.50 0.75 1.00 Contamination ratio ϵ

Rejection rate

IMQ-KSD Tilted-KSD R-KSD α = 0.05

Figure 5: Gaussian-Bernoulli RBM experiment. Left. Data generated from P and injected contamination in the ﬁrst two dimensions. Middle. Log unnormalized densities of the data and contamination ordered from small to large; the injected contamination is indeed abnormal since they have much lower densities. Right. Probability of rejection with a Gaussian-Bernoulli RBM against the contamination ratio ϵ. The robust tests are calibrated to control the Type-I error with no more than ϵ0 = 0.1 proportion of contamination.

5.2 Gaussian-Bernoulli Restricted Boltzmann Machines

Our next experiment is an example of energy-based model called a Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) model (Cho et al., 2013). RBMs have been used in a wide range of scientiﬁc applications (see Fischer and Igel, 2014; Zhang et al., 2019, for a review), and they are a common benchmarks for assessing kernel GOF tests (Liu et al., 2016; Jitkrittum et al., 2017; Schrab et al., 2022). RBMs are latent-variable models with joint density p(x, h) exp(1

2x Bh+b x+c h 1

2 x 2 2), where x Rd is an observable variable, h { 1}d is a binary hidden variable with latent dimension d , and B Rd d , b Rd and c Rd are model parameters. Computing the normalizing constant of the marginal p(x) requires summing over 2d terms, so it becomes intractable when d is large. However, its score function has a closed-form expression sp(x) = b x + Btanh(B x + c), where tanh is applied entry-wise. We set d = 50 and d = 10, and randomly initialize B, b, c by sampling each entry independently from a N(0, 1). We then generate n = 500 samples from the model by block Gibbs sampling following Cho et al. (2013); Jitkrittum et al. (2017), and randomly replace ϵ [0, 1] proportions of the data with outliers drawn from R = N(0, 0.12Id); see the left and middle of Figure 5. As shown by the plots, the outliers are clearly abnormal since they have low density values. We plot the rejection probabilities of the tests on the right-hand side of Figure 5. The standard IMQ-KSD and Tilted-KSD tests are not well-calibrated under the composite null HC 0 : Q BKSD(P; θ), while robust-KSD with θ chosen by setting ϵ0 = 0.1 controls the rejection probability below the level α = 0.05 when the contamination ratio ϵ does not exceed ϵ0. On the other hand, when ϵ exceeds the maximal allowed proportion ϵ0, the robust test rejects HC 0 with probability approaching 1, thus showing its power. When generating the synthetic data from P, we run a Gibbs sampler for 7000 steps and discard the ﬁrst 2000 as burn-in and keep every 10-th datum to reduce correlation. This takes 58.40 seconds on average, compared with 0.84 seconds required to perform the robust-KSD test. As a result, it will clearly not be reasonable to use the robust MMD tests from Sun and Zou (2023); Schrab et al. (2023) here as simulating a large enough number of samples from P would cost orders of magnitude more than the cost of the GOF test.

Liu and Briol

0 10 20 30 Outlier mean y

Rejection rate

IMQ Tilted R-KSD α = 0.05

0 10 20 30 Outlier mean y

0.0 0.5 1.0 1.5 2.0 Offset δ

Rejection rate

IMQ Tilted R-KSD α = 0.05

10 0 10 20 Standardized galaxy velocity

10 0 10 20 Standardized galaxy velocity

10 0 10 20 Standardized galaxy velocity

0.0 0.5 1.0 2.0

Figure 6: Probability of rejection (top) and data and ﬁtted models (bottom) for the KEF experiment. (a) Extra outliers are added to the data; data and ﬁtted models with no contamination (bottom left) and contamination ratio ϵ = 0.2 (bottom right). (b) Diﬀerent oﬀset values δ are added to the ﬁtted parameter to create model deviation. The radius θ is chosen to control contamination of up to a proportion of ϵ0 = 0.2.

Moreover, the samples generated through Gibbs sampling are no longer i.i.d., thus violating the assumptions in our theoretical results. Extension to the non-independent case could potentially be made following the approach in Chérief-Abdellatif and Alquier (2022), which studied the robustness of estimators (instead of GOF tests) under correlated data.

5.3 A Kernel Exponential Family Model for Density Estimation

Our next example concerns density estimation using a kernel exponential family (KEF) model (Canu and Smola, 2006). Given a reproducing kernel k0 : Rd Rd R and a reference density p0 on Rd, we have pη(x) p0(x) exp( fη(x)), where fη lies in the RKHS Hk0 associated with k0. We follow the setting in Matsubara et al. (2022) which uses a ﬁnite-basis approximation of the RKHS function so that fη(x) = P25 l=1 ηlφl(x), where η = (η1, . . . , η25) R25 and φl(x) = (xl/

l!) exp( x2/2) for l = 1, 2, . . . , 25. We are interested in testing the goodness-of-ﬁt of the KEF model under data contamination. Suppose that the practitioners believe the observed data are subject to contamination, so that a robust estimator for η is used. Robust parameter estimation for exponential family models was studied in Barp et al. (2019), which proposed a minimum-distance estimator using KSD with tilted kernels. With a suitable weighting function, the resulting estimator is bias-robust (Barp et al., 2019, Proposition 7) and thus not susceptible to contamination. However, how to test the goodness-of-ﬁt of the estimated model in this setting remains an open question. The standard IMQ-KSD test is not suitable since it is not robust and can thus falsely reject a good model due to contamination. Using a tilted kernel is also not enough since it is not quantitatively robust and hence can suﬀer from uncontrolled Type-I error. We therefore propose to use our robust-KSD test for this task. One challenge with the KEF model is that its density does not have a closed form, and although sampling from

On the Robustness of Kernel Goodness-of-Fit Tests

P is feasible in this low-dimensional example (d = 1), it would become challenging in high dimensions, making MMD-based tests not well-suited for this model. Our robust-KSD test, however, does not suﬀer from this limitation. We use the data set as Matsubara et al. (2022); Key et al. (2025), which is a 1-dimensional data set of 82 galaxy velocities (Postman et al., 1986; Roeder, 1990). To avoid using the same data for model training and testing, we randomly split the data into equal halves, each containing ndata = 41 data points. To each half, we then add nol independent draws from R = N(z, 0.12) for some value of z for the mean contamination, so that the resulting data set has a size of n = ndata + nol and a contamination ratio of ϵ = nol/(ndata + nol). We then ﬁt the KEF model to one half of the data, and perform the tests on the other half. Both the robust minimum KSD estimation and the robust-KSD test use the same tilted kernel k(x, y) = w(x)hΛ(x y)w(y) with w(x) = (1 + x2) 1/2 and a sum of IMQ kernels hΛ(x y) = P λ2 Λ(1 + |x y|2/(2λ2)) 1/2 where Λ = {0.6, 1, 1.2}. The choice of the weighting function follows Matsubara et al. (2022) and the use of a sum kernel is recommended in Key et al. (2025). Direct computation shows that the Stein kernel up is bounded in this case. We choose ϵ0 = 0.2 when computing θ for R-KSD. The results are shown in Figure 6. We ﬁrst study the level of our tests in Figure 6a. As the outlier mean z increases, the IMQ-KSD test rejects the null H0 : Q = P with increasing probability, suggesting that this is due to the presence of outliers. Notably, this happens even though the parameter estimator is not susceptible to contamination; see the bottom row of Figure 6a for the ﬁtted model with and without outliers. When the tilted kernel was used, the standard KSD test has a lower rejection probability, although it is still above the nominal level. On the other hand, the robust-KSD test is well-calibrated for all values of z. We then study the power of KSD tests in Figure 6b. To show the robust test has nontrivial power when no outliers are present, we ﬁt the model parameters ˆη on half of the data (without contamination) and replace the ﬁrst component by ˆη1 + δ, where δ {0, 0.5, 1, 2} is an error oﬀset, so that the resulting model has a poor ﬁt. Other experimental setups including the choice of θ remain the same as before. As expected, the robust test has lower power than the standard tests; however, as the oﬀset value δ increases, the robust test eventually rejects with high probability, thus showing its power under no contamination.

5.4 Limitations for Multimodal Models

We investigate the performance of the robust test when the model has multimodality and the data are drawn from the same model with misspeciﬁed mixing ratios, thus revealing a limitation of our robust-KSD test. It is well-known that the standard KSD test performs poorly with mixture models with well-separated modes (Liu et al., 2023). This is intuitively due to the myopia of the score function, making the test blind to disparities in the mixing ratios, and it is a general issue for KSD and other score-based discrepancies (Wenliang and Kanagawa, 2020; Zhang et al., 2022). We demonstrate that our robust test can also suﬀer from this limitation, which is made even more prominent due to the relaxation of the point null to a KSD-ball. We consider a 2-dimensional mixture of Gaussian model P = P5 j=1 πj N(γµj, I2), where πj uj with u1, . . . , u5 drawn independently from Uniform(0, 1), each mean µj Rd

has entries drawn independently from Uniform( 2, 2), and γ > 0 is a scaling factor controlling the mode separation. The data-generating distribution Q has a diﬀerent set of mixture ratios

Liu and Briol

Scale γ = 2

Scale γ = 5

IMQ-KSD Tilted-KSD α = 0.05

Rejection rate

IMQ-KSD Tilted-KSD R-KSD-0.005 R-KSD-0.01 R-KSD-0.05 α = 0.05

Figure 7: Mixture-of-Gaussian experiment. Left. Contour plots of model densities. Middle. KSD value with IMQ and tilted kernels. Right. Rejection probability of the standard and robust tests. R-KSD-0.01 refers to the R-KSD test with uncertainty radius θ chosen by setting the maximal contamination tolerance to ϵ0 = 0.01, and similarly for the others.

that are randomly generated. For R-KSD, the uncertainty radius is chosen to be θ = ϵ0τ 1/2 with diﬀerent values of ϵ0, and we label the corresponding result as R-KSD-ϵ0. The results are reported in Figure 7. As the scale γ increases, the modes become more separated, leading to smaller values of D(Q, P), as shown in the left and middle plots. Consequently, Q eventually becomes indistinguishable from P under KSD, and all tests fail to reject with high probability. This is a result of the aforementioned myopia of KSD, which aﬀects both the standard and the robust tests and regardless of whether IMQ or tilted kernels are used. The opposite is observed when γ is small and the modes overlap, where all tests can correctly reject the null with high probability. Compared with the standard tests, the robust one is more conservative because it controls all possible contamination; nevertheless, it still achieves non-trivial power when ϵ0, thus also the radius θ, is set to a small value.

6. Conclusion and Discussion

This paper studied the robustness of the kernel GOF test of Liu et al. (2016); Chwialkowski et al. (2016). We showed that existing KSD-based tests lack qualitative robustness when stationary kernels are used, but achieve it when employing tilted kernels inspired by robust estimation methods (Barp et al., 2019; Matsubara et al., 2022). Since qualitative robustness alone does not ensure calibration under ﬁxed contamination levels, we proposed a novel robust KSD test that provably controls the Type-I error when the data-generating distribution lies in a KSD-ball around the reference model, while consistently rejecting alternatives outside of it. We then discussed how to select the radius of this ball under diﬀerent contamination models, namely Huber s contamination model and the density-band model. Empirical results on synthetic and real data sets demonstrate that our robust GOF test is well-calibrated and has non-trivial power. We conclude by discussing promising directions for future work. Firstly, we established consistency of the proposed robust KSD test against all alternative distributions outside the null set, i.e., HC 1 : Q BKSD(P; θ). In the broader robust testing literature, however, it is also common to restrict the alternative set to a neighborhood centered at a speciﬁc alternative P = P and study minimax properties against such hypothesis (Huber, 1965; Levy, 2008; Gül and Zoubir, 2016; Sun and Zou, 2023). Understanding whether our robust KSD test achieves minimax optimality in this setting is an interesting future work.

On the Robustness of Kernel Goodness-of-Fit Tests

Secondly, as discussed in Section 2.3, our proposed robust KSD test does not require the Stein kernel up to be bounded. A bounded Stein kernel is convenient as the resulting KSD-ball BKSD(P; θ) contains many contamination models of interest, such as Huber s model with an arbitrary noise distribution. However, this could be too strict in some applications where outliers cannot take an arbitrary value, e.g., when the domain is bounded or the data acquisition procedure is highly regulated to prevent extreme outliers. In these cases, even with an unbounded kernel, the KSD-ball could still capture all contamination models of interest. Unbounded Stein kernels possess appealing theoretical properties such as convergence-control of moments (Kanagawa et al., 2022). Exploring whether such kernels can lead to more powerful robust-KSD tests is another worthwhile future research direction.

Acknowledgments

The authors would like to thank Antonin Schrab, Andrew Duncan, Matias Altamirano, and Charita Dellaporta for helpful discussions. XL was supported by the President s Ph D Scholarships of Imperial College London, the EPSRC Stat ML CDT programme [EP/S023151/1], and the Alan Turing Enrichment Scheme Award. FXB was supported through the EPSRC grant [EP/Y011805/1].

Liu and Briol

A Proofs of Theoretical Results 24 A.1 Proof of Theorem 1 and Related Preliminary Results . . . . . . . . . . . . . . 24 A.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.3 Proof of Theorem 3 and Related Preliminary Results . . . . . . . . . . . . . . 36 A.4 P-Targeted Kernel Stein Discrepancy . . . . . . . . . . . . . . . . . . . . . . . 47 A.5 Validity of the Bootstrap Approach . . . . . . . . . . . . . . . . . . . . . . . . 49 A.6 Connections Between KSD Balls and Contamination Models . . . . . . . . . . 53

B Complementary Experiments 56 B.1 Decay Rate of Contamination Ratio . . . . . . . . . . . . . . . . . . . . . . . 56 B.2 Implementation Details for the Robust MMD Tests . . . . . . . . . . . . . . . 57 B.3 Approximating the Supremum of the Stein Kernel . . . . . . . . . . . . . . . 58 B.4 Ablation Study for Kernel Bandwidth in the Standard KSD Test . . . . . . . 59 B.5 Gaussian Mean-Shift Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 59 B.6 Scalability with Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 B.7 Diﬀerent Contamination Distributions . . . . . . . . . . . . . . . . . . . . . . 60 B.8 Weighted Bootstrap and Wild Bootstrap . . . . . . . . . . . . . . . . . . . . . 61

C Extension to U-statistics 62

D A Non-Asymptotically Valid Robust KSD Test 63

Appendix A. Proofs of Theoretical Results

This section holds proofs of the theoretical results. Throughout the appendix, given i.i.d. random variables X1, . . . , Xn Q P(Rd) and an integrable function f : (Rd)n R, we will write for brevity E[f(X1, . . . , Xn)] = R

Rd f(x1, . . . , xn)Q(dx1) Q(dxn).

A.1 Proof of Theorem 1 and Related Preliminary Results

Overview of proof. For any ϵ [0, 1] and R P(Rd), a random variable Xi drawn from Q = (1 ϵ)P +ϵR can be written as Xi = (1 ξi)X i +ξi Zi, where ξi Bernoulli(ϵ), X i P and Zi R are independent, and the equality is understood as equality in distribution. Deﬁning the random variable M = Pn i=1 ξi, then M Binomial(n, ϵ) and represents the number of contaminated data in a random sample Xn drawn from Q. We will ﬁrst show that the conclusion in Theorem 1 holds conditional on the event {M m 0} for some integer m 0 [0, n], i.e., when there are at least m contaminated data in the sample. This will be suﬃcient to show Theorem 1 by proving that this event occurs with high probability. We will use the following notation in the rest of this section: Let X n be a random sample drawn from Q. For any integer m [0, n] giving the number of corrupted data and any z Rd, we deﬁne Xn,z = X n m {z}m . The rest of this section is organized as follows:

On the Robustness of Kernel Goodness-of-Fit Tests

We ﬁrst present three preliminary results. Lemma 7 and Lemma 8 respectively bound the test statistic D2(Xn,z) and q2 ,1 α(Xn,z) for a given number of corrupted data, m , by quantities of the form anup(z, z) + bn for some an, bn. These results allow us to compare the relative rate at which D2(Xn,z) and q2 ,1 α(Xn,z) scale with z. Both results will be used to prove Proposition 9, which states that the claim in Theorem 1 holds conditionally on the number of corrupted data.

Section A.1.1, A.1.2, A.1.3 and A.1.4 provide the proofs for Lemma 7, Lemma 8, Proposition 9 and Theorem 1, respectively.

Assumption 1. Assume EX P [ sp(X) 4 2] < and k(x, x ) = h(x x ) for some h C2 b with h(0) > 0.

Lemma 7. Assume σ2 := supz Rd EX P [up(X, z)2] < and suppose Assumption 1 holds. Then, for any δ > 0 and integers 0 < m < n, the following event holds with probability at least 1 δ:

D2(Xn,z) ϵ2 m,nup(z, z) 2σ ϵm,n

where ϵm,n = m /n and m = n m .

Lemma 8. Suppose Assumption 1 holds. Assume (i) σ2 := supz Rd EX P [up(X, z)2] < and (ii) ξ2 P := EX P [up(X, X)2] < . Then there exist constants C, CP,1, and CP,2 depending only on P such that, for any δ > 0 and integers 0 < m < n, the following event holds with probability at least 1 δ:

q2 ,1 α(Xn,z) C log(1/α)

δ ϵm,nup(z, z) + CP,1

nδ + CP,2 log(1/α)

where ϵm,n = m /n and m = n m , and the probability is over the joint distribution of Xn,z and W Multinomial(n; 1/n, . . . , 1/n).

Proposition 9. There exists a constant C > 0 such that the following holds. Suppose Assumption 1 holds. Also assume the following integrability conditions

sup z Rd sp(z) 4 2EX P h(X z)4 < , and sup z Rd sp(z) 2 2EX P h(x z) 2 2 < .

Further assume that z 7 sp(z) 2 is unbounded. Then for any s [0, 1), there exists z Rd

such that, for any sample size n and integer m 1

2n1 s, n giving the number of corrupted data, the condition

Pr X n m P,W D2(Xn,z) > q2 ,1 α(Xn,z) 1 n (1 s).

The assumption m 1

2n1 s, n ensures there is a considerable amount of corrupted data in the sample, while excluding the case where all data are contaminated (m = n). The constant factor 1/2 is arbitrary.

Liu and Briol

A.1.1 Proof of Lemma 7

Proof The test statistic can be decomposed in three terms: a term with uncorrupted samples, a term with corrupted samples, and a term for interactions between these two groups:

D2(Xn,z) = 1

1 i,j n u P (Xi, Xj)

1 i,j m up(X i , X j) + 2

1 i m<j n up(X i , z) + 1

m<i,j n up(z, z)

n2 D2(X m) + 2(n m)

i=1 up(X i , z) + (n m)2

n2 up(z, z)

= (1 ϵm,n)2D2(X m) + 2ϵm,n(1 ϵm,n)Sm,z + ϵ2 m,nup(z, z) , (10)

where in the last line we have deﬁned Sm,z := m 1 Pm i=1 up(Xi, z) and ϵm,n := (n m)/n. The remainder of the proof proceeds by lower-bounding the ﬁrst two terms with high probability. The ﬁrst term can be lower-bounded by 0, since D2(X m) is a V-statistic and is hence non-negative. This lower bound will be relatively good as we would expect D2((X m) to approach zero at a root-m rate. Bounding the second term in (10) requires bounding Sm,z. Because by assumption Xi = X i P are i.i.d. for i = 1, . . . , m, the term Sm,z is a sum of i.i.d. random variables up(X i , z). Moreover, up(X i , z) are zero-mean, because the proof of Gorham and Mackey (2017, Proposition 1) shows that EX P [up(X , )] = 0 whenever EX P [ sp(X ) 2] < , which holds under Assumption 1. We can therefore compute the variance of Sm,z as

Var(Sm,z) = E S2 m,z = 1 m EX 1 P up(X 1, z)2 σ2 m , (11)

which is ﬁnite since σ2 < by assumption. We can hence apply Chebyshev s inequality. Choosing

the set A1 := |Sm,z| t holds with probability at least 1 δ. Furthermore, on A1, we have by (10) that

D2(Xn) (1 ϵm,n)2D2(X m) 2ϵm,n(1 ϵm,n)t + ϵ2 m,nup(z, z) 2ϵm,nt + ϵ2 m,nup(z, z) ,

where in the last step we have used ϵm,n [0, 1]. This shows the claim.

A.1.2 Proof of Lemma 8

To prove this lemma, we will need the following concentration bound for quadratic forms of multinomial random vectors. This is a special instance of the Hanson-Wright inequality due to Adamczak (2015) when the dependent random vectors follow multinomial distributions.

On the Robustness of Kernel Goodness-of-Fit Tests

Lemma 10 (Hanson-Wright inequality; Adamczak 2015, Theorem 2.5). Let W = (W1, . . . , Wn) Multinomial(n; 1/n, . . . , 1/n). Then there exists constant C > 0 such that, for any n n matrix A with entries aij R and any δ > 0, we have

W AW µ C A F log 1

where µ := n 1 P 1 i,j n aij + P 1 i n aii, and A F = P 1 i,j n a2 ij 1/2 is the Frobenius norm.

Proof of Lemma 10 Adamczak (2015, Remark 2.2) shows that multinomial random vectors satisfy the convex concentration inequality (Adamczak, 2015, Deﬁnition 2.2). Therefore, we can apply Adamczak (2015, Theorem 2.5) to conclude that there exists constants C1, C2 such that, for any n n matrix A and any t > 0,

Pr W W AW µ t 2 exp 1

C2 A 2 F , t A op

C2 A 2 F , t A F

where A op = sup x 2 1 Ax 2 is the operator norm, µ = EW P 1 i,j n(Wi 1)(Wj 1)aij , and the last step holds due to the well-known ordering between operator norm and Frobenius norm A op A F (Golub and van Loan, 2013, Eq. 2.3.7). Moreover, we can compute µ using explicit expressions for the joint central moments of multinomial random variables (Ouimet, 2020, Theorem 2) as follows

1 i,j n aij EW [(Wi 1)(Wj 1)]

1 i,j n aij EW [(Wi 1)(Wj 1)] + X

1 i n aii EW (Wi)2

1 i =j n aij + 1 1

1 i,j n aij + X

1 i n aii .

For any δ > 0, we claim that the choice

Liu and Briol

implies that Pr W(|W AW µ| t) δ, which would complete the proof by setting C = max C1/2 1 C1/2 2 , C1 . To see this, we note that the the following are equivalent

C1 min n4t2

C2 A 2 F , n2t A F

C2 A 2 F , n2t A F

C2 A 2 F and C1 log 2

t C1/2 1 C1/2 2 A F

and t C1 A F

C1/2 1 C1/2 2 , C1

Since log(2/δ) > 1 for δ (0, 1), the last inequality is met with t deﬁned in (13). This shows that Pr W W AW µ t δ, thus ﬁnishing the proof.

We are now ready to prove Lemma 8. Proof of Lemma 8 Denote by F the cumulative distribution function for the conditional distribution of the bootstrap sample D2 W(Xn,z) (deﬁned in (3)) given Xn,z, and denote its (1 α)-quantile as

q2 ,1 α(Xn,z) := inf{t R : 1 α F (t)} .

To bound q2 ,1 α/2(Xn,z), we use a similar argument as in Schrab et al. (2023, Appendix E4) by bounding the exceedance probability of D2 W(Xn,z). Notably, their proof cannot be applied directly here, because they use a wild bootstrap (Leucht and Neumann, 2013) that corresponds to replacing Wi 1 with independent Rademacher weights, whereas in our case Wi are correlated and multinomially distributed. To proceed, we ﬁrst note that we can write D2 W(Xn,z) as a quadratic form as D2 W(Xn,z) = n 2W AW, where A = (aij)ij with aij := up(Xi, Xj). We can then use the Hanson-Wright inequality stated in Lemma 10 applied with δ = α/2 to show that there exist positive constant C such that, for any almost sure realizations Xn,z,

D2 W(Xn,z) µ/n2 + C A F

W AW µ + C A F log 1

W AW µ C A F log 1

On the Robustness of Kernel Goodness-of-Fit Tests

where µ := n 1 P 1 i,j n aij +P 1 i n aii. This implies that the (1 α)-quantile of D2 W(Xn) can be bounded as

q2 ,1 α µ n2 + t = µ n2 + ρα A F

where we have deﬁned ρα := C log(1/α). We now further simpliﬁes this bound by bounding µ and A F. The matrix A is the Gram matrix of the Stein kernel up, thus positive semi-deﬁnite. This implies P 1 i,j n aij 0 and aii 0 for all i, so we have

1 i,j n aij + X

1 i n aii X

1 i n aii = X

1 i m aii + X

1 i m aii + (n m)up(z, z) , (15)

where the last quality holds since by assumption Xi = z for i > m. Moreover, since aii = up(Xi, Xi) 0 for any i, we can apply Markov inequality to bound the ﬁrst term on the RHS. This gives that the following holds with probability at least 1 δ/4,

1 i m aii 4

1 i m E[aii] = 4m

δ EX P [up(X , X )] 4m

δ EX P [up(X , X )2] 1

where the second step holds since by assumption Xi = X i P are i.i.d. for i = 1, . . . , m, and where we have substituted ξ2 P = EX1 P [up(X1, X1)2]. Combining with (15) implies that, with probability at least 1 δ/4,

δ ξP + (n m)up(z, z) . (16)

We now bound A 2 F. We ﬁrst show that Σ2 P := EX,X P [up(X, X )2] is ﬁnite, where X, X P are independent. Since up is a reproducing kernel (Barp et al., 2024, Theorem 1), we can use the reproducing property and the Cauchy-Schwarz inequality to yield

up(x, x ) = up( , x), up( , x ) Hu up( , x) Hu up( , x ) Hu = up(x, x) 1 2 up(x , x ) 1 2 ,

where Hu denotes the RKHS associated with up. This implies

Σ2 P = EX,X P [up(X, X )2] EX,X P [up(X, X)up(X , X )] (a) = EX P [up(X, X)] 2

(b) EX P [up(X, X)2]

where (a) holds since X, X are i.i.d. and (b) follows from Jensen s inequality. By applying Markov s inequality again, with probability at least 1 δ/4, where the probability is taken

Liu and Briol

over the randomness of X m, we have

1 i,j n up(Xi, Xj)2

1 i,j n E[up(Xi, Xj)2]

1 i,j m E[up(X i , X j)2] + 2(n m) X

1 i m E[up(X i , z)2] + (n m)2E[up(z, z)2]

m(m 1)Σ2 P + mξ2 P + 2m(n m)EX 1 P [up(X 1, z)2] + (n m)2up(z, z)2

m(m 1)Σ2 P + mξ2 P + 2m(n m)σ2 + (n m)2up(z, z)2 , (17)

where the second last line holds since X

1 i,j m E[up(X i , X j)2] = X

1 i =j m E[up(X i , X j)2] + X

1 i m E[up(X i , X i )2]

= m(m 1)E[up(X i , X j)2] + m E[up(X i , X i )2]

= m(m 1)Σ2 P + mξ2 P ,

and since we have substituted Σ2 P = EX,X P [up(X, X )2], ξ2 P = EX P [up(X, X)2] and σ2 = supz Rd EX P [up(X, z)2]. Using the above inequality and (16) to bound (14), we conclude that the following event holds with probability at least 1 δ/2 δ/4 δ/4 = 1 δ, where the randomness is taken over the joint distribution of X m and W,

q2 ,1 α(Xn,z)

4 nδξP + n m

n2 up(z, z)

4 δ m(m 1)Σ2 P + mξ2 P + 2m(n m)σ2 + (n m)2up(z, z)2

4 nδξP + n m

n2 up(z, z)

m(m 1)ΣP + mξP + p

2m(n m)σ + (n m)up(z, z) ,

4 nδξP + ϵm,n

n up(z, z) + ρα

2ΣP + ξP + p

2ϵm,nσ + ϵm,nup(z, z)

4 nδξP + 2ραϵm,n

δ up(z, z) + ρα

where the ﬁrst inequality holds by (16) and (17) (and using m n), the second inequality follows from using twice the inequality

b for any a, b 0, the third inequality holds since m n and ϵm,n = (n m)/n, and (18) holds since ϵm,n 1 and since the conditions δ (0, 1) and ρα 1 imply ϵm,n

n up(z, z) ϵm,n

δ up(z, z) ραϵm,n

δ up(z, z) .

On the Robustness of Kernel Goodness-of-Fit Tests

Substituting the deﬁnition of ρα = C log(1/α), and deﬁning the constants C := 2C and CP,1 := 4ξP and CP,2 := C(2ΣP + ξP +

2σ ), we have from (18) that

q2 ,1 α(Xn,z) C log(1/α)ϵm,n

δ up(z, z) + CP,1

nδ + CP,2 log(1/α)

A.1.3 Proof of Proposition 9

Proof We ﬁrst show that the moment conditions (i)-(ii) in Lemma 7 and Lemma 8 are met, and apply these lemmas to conclude the proof. To verify the moment conditions, we ﬁrst bound the squared Stein kernel evaluated at any x, z Rd as

= sp(x) sp(z)h(x z) sp(x) h(x z) + sp(z) h(x z) + h(x z) 2

4 h sp(x) sp(z)h(x z) 2 + sp(x) h(x z) 2 + sp(z) h(x z) 2

+ h(x z) 2i

4 h sp(x) 2 2 sp(z) 2 2h(x z)2 + sp(x) 2 2 h(x z) 2 2 + sp(z) 2 2 h(x z) 2 2 (19a)

+ h(x z) 2i , (19b)

where the ﬁrst inequality holds since (a+b+c+d)2 4(a2 +b2 +c2 +d2) for any a, b, c, d R, and the second inequality is due to Cauchy-Schwarz inequality. The assumption h C2 b immediately implies that the last term in (19) is uniformly bounded over x, z. For the ﬁrst term, taking expectation over P and supremum over z gives

sup z Rd EX P sp(X) 2 2 sp(z) 2 2h(X z)2

sup z Rd sp(z) 2 2 EX P h(X z)4 1/2 EX P sp(X) 4 2 1/2

= C1/2 P sup z Rd sp(z) 2 2 EX P h(X z)4 1/2 ,

where the ﬁrst step holds by Hölder s inequality, and CP := EX P sp(X) 4 2 . The RHS of the last line is ﬁnite by assumption. Similarly, the second term in (19) can also be bounded by Hölder s inequality as

sup z Rd EX P sp(x) 2 2 h(x z) 2 2 C1/2 P sup z Rd EX P h(x z) 4 2 1/2 ,

which is ﬁnite since h C2 b . For the third term in (19),

sup z Rd EX P sp(z) 2 2 h(x z) 2 2 = sup z Rd sp(z) 2 2EX P h(x z) 2 2 ,

Liu and Briol

which is again ﬁnite by assumption. Substituting these bounds into (19), we have shown that σ2 = supz Rd EX P [up(X, z)2] < , which veriﬁes condition (i). To show (ii), we note that a translation-invariant kernel corresponds to choosing a constant weighting function w(x) 1 in the tilted kernel of Lemma 2, so that we can leverage the derivations in the proof of Lemma 2. Setting w(x) 1 in (28) to simplify up(x, x) yields

ξ2 P := EX P up(X, X)2 = EX P h sp(X) 2 2h(0) h(0) 2i

2h(0)2EX P sp(X) 4 2 + 2 h(0) 2 < ,

where the ﬁrst inequality holds since (a + b)2 2a2 + 2b2 for any a, b R. This proves condition (ii). Having showed the moment conditions, we are now ready to prove the main theorem. Pick any s [0, 1) and any integer m 1

2n1 s, n . Deﬁne ϵm,n := m /n and m = n m . Let C, CP,1, CP,2 be the constants deﬁned in Lemma 8, and deﬁne the events

A1 := D2(Xn) ϵ2 m,nup(z, z) t1(m ) , (20a)

A2 := q2 B,1 α(Xn) C log(1/α)

δ/2 ϵm,nup(z, z) + t2(m ) , (20b)

t1(m ) := 2σ ϵm,n p

δm/2 , t2(m ) := 2CP,1

nδ + CP,2 log(1/α)

and the dependence on m in these quantities is through m = n m . It follows from Lemma 7 and Lemma 8 applied to δ/2 that the event A := A1 A2 occurs with probability Pr(A) 1 δ/2 δ/2 = 1 δ. These events allow us to bound D2(Xn,z) and q2 B,1 α(Xn,z), which then allows us to show that D2(Xn,z) > q2 B,1 α(Xn,z) for suﬃciently large z. On A, we have

D2(Xn) q2 B,1 α(Xn,z) ϵ2 m,nup(z, z) t1(m ) C log(1/α)

δ/2 ϵm,nup(z, z) t2(m )

= ϵm,nup(z, z) ϵm,n C

where in the last line we have deﬁned C :=

2C log(1/α) and T(m ) := t1(m ) + t2(m ). This implies

Pr X m P,W D2(Xn,z) > ˆq B 1 α(Xn,z)

Pr X m P,W D2(Xn,z) > ˆq B 1 α(Xn,z) | A Pr X m P,W(A)

(1 δ) Pr X m P,W D2(Xn,z) > ˆq B 1 α(Xn,z) | A

(1 δ)1 ϵm,nup(z, z) ϵm,n C

T(m ) > 0 , (21)

On the Robustness of Kernel Goodness-of-Fit Tests

where the ﬁrst step holds due to the law of total probability. We claim that the indicator function in (21) takes value 1 for all m 1

2n1 s, n when the following inequalities hold

2 > 4C , sp(z) 2 2 > 1 h(0)

4T(n 1)n2s + 1 2h(0) . (22)

Indeed, since δ > 0 was arbitrary, we can set δ = n (1 s), so we have

(a) > ϵm,n C

4C ns = ϵm,n 1 4ns

(b) ϵm,n ϵm,n

where (a) holds since the ﬁrst condition in (22) implies n 1 2 + s

2 = ns+ (1 s)

2 4C ns, and (b) holds since the assumption m 1

2n1 s implies ns n/(2m ) = ϵm,n/2. Using this and the second inequality in (22),

ϵm,nup(z, z) ϵm,n C

= ϵm,n sp(z) 2 2h(0) 1 2h(0) ϵm,n C

> ϵm,n 4T(n 1)n2s ϵm,n

ϵm,n 2T(n 1)

where the ﬁrst equality holds by setting w(x) 1 in (28) to compute up(z, z), the second line follows from the second inequality in (22) and (23), the third line holds since n2s n2s/(m )2 = ϵ 2 m,n, and the last line holds as direct computation shows T(n 1) T(m ) since m < n by assumption. To summarize, we have shown that for any n and z satisfying (22), the following holds

Pr X m P,W D2(Xn,z) > q2 ,1 α(Xn,z) 1 δ = 1 n (1 s) .

It then remains to show that there exists z not dependent on m that satisﬁes the second inequality in (22), which would imply the claimed result. Such z always exists, because, for the second inequality in (22), the RHS does not depend on z nor m and is ﬁnite since ϵm,n = m /n (n s/2, 1) by assumption, while the LHS explodes with z as we have assumed z 7 sp(z) 2 is unbounded. This concludes the proof.

A.1.4 Proof of Theorem 1

Proof It suﬃces to show the claim for any sequence of the form ϵn = n s, where s [0, 1) is arbitrary. Let the random variable M be deﬁned as at the start of Section A.1. For any random sample Xn, deﬁne the event A(Xn) = {D2(Xn) > q2 ,1 α(Xn)}. For n suﬃciently

Liu and Briol

large so that n(1 s)/2 > 2

2C log(1/α), we can apply Proposition 9 to conclude that, for all n, there exists zn Rd such that the following holds for all m 1

2n1 s, n = 1

Pr Xn Q,W A(Xn) | M = m = Pr X n m P,W D2(Xn,z) > q2 ,1 α(Xn,z)

1 n (1 s) , (24)

where we have used the shorthand notation Xn,zn = X n m {zn}m . We will show that the sequence of probability measures Q := (1 ϵn)P + ϵnδzn satisﬁes

Pr Xn Q,W D2(Xn) > q2 ,1 α(Xn) Pr X n P,W D2(X n) > q2 ,1 α(X n) 1 α , (25)

which will imply the claimed result of our theorem. Using a concentration inequality for Binomial distributions (e.g., Chung and Lu, 2002, Lemma 2.1) applied to the case of Binomial(n, ϵn), the event B := {ϵnn/2 M 3ϵnn/2} holds with high probability

Pr M (B) = 1 Pr M M ϵnn < ϵnn

1 exp (ϵnn)2

2(ϵnn + ϵnn/6)

= 1 exp ϵnn

=: 1 f(n) , (26)

where in the last line we have deﬁned f(n) = exp ( ϵnn/8) exp ( 3ϵnn/7). Deﬁne the index set In to be the set of integers in the interval [ϵnn/2, 3ϵnn/3]. Then

Pr Xn Q,W D2(Xn) > q2 ,1 α(Xn)

Pr Xn Q,W A(Xn) B

m In Pr Xn Q,W A(Xn) | M = m Pr(M = m )

min m In Pr Xn Q,W A(Xn) | M = m n X

(b) = min m In Pr Xn Q,W A(Xn) | M = m Pr(B)

(c) 1 n (1 s) 1 f(n) (d) = 1 O n (1 s) , (27)

where the ﬁrst inequality follows from the law of total probability, (a) and (b) hold by the partition B = m In{M = m } and again the law of total probability, (c) follows from (24) and (26), and (d) holds since the assumptions ϵn = n s and s [0, 1) imply f(n) = o n (1 s) . On the other hand, the assumed moment conditions imply EX P [up(X , )] = 0 as argued in the paragraph before (11) and EX 1,X 2 P [up(X 1, X 2)2] EX P [up(X , X )2] < as argued in Section A.1.3, so we can apply Arcones and Gine (1992, Theorem 3.5) with r = 2 to yield the asymptotic validity of the KSD test when Q = P, namely

Pr X n P,W D2(X n) > q2 ,1 α(X n) α .

On the Robustness of Kernel Goodness-of-Fit Tests

Combining this with (27) shows the desired convergence (25).

A.1.5 Implications for KSDs on Non-Euclidean Spaces

The core argument underpinning the proofs in Theorem 1 is that the function z 7 up(z, z) is unbounded on the support of the model, which in our case is Rd. The proof per se does not rely on the support being an Euclidean space. Consequently, this argument can potentially be extended to KSDs constructed for non-Euclidean data, such as Riemannian data (Xu and Matsuda, 2021), sequence data with varying dimensionality (Baum et al., 2023), functional data (Wynne et al., 2022), or censored time-to-event data (Fernandez et al., 2020).

A.2 Proof of Lemma 2

Proof Direct expansion of the Stein kernel in (1) with a tilted kernel k gives

up(x, x ) = w(x)sp(x), w(x )sp(x ) h(x, x )

+ w(x )sp(x ), w(x) h(x x ) + w(x )sp(x ), h(x x ) w(x)

+ w(x)sp(x), w(x ) h(x x ) w(x)sp(x), h(x x ) w(x )

+ w(x), w(x ) h(x x ) + w(x) h(x x ), w(x )

w(x), w(x ) h(x x ) w(x)w(x ) h(x x )

= sp,w(x), sp,w(x ) h(x x )

+ sp,w(x ), w(x) h(x x ) + sp,w(x ), h(x x ) w(x)

+ sp,w(x), w(x ) h(x x ) sp,w(x), h(x x ) w(x )

+ w(x), w(x ) h(x x ) + h(x x ), w(x ) w(x)

w(x), h(x x ) w(x ) w(x)w(x ) h(x x ) ,

where sp,w( ) := w( )sp( ) and h(x x ) denotes h evaluated at x x . By setting x = x

and eliminating identical terms,

= sp,w(x) 2 2h(0) + 2 sp,w(x), w(x) h(0) + w(x) 2 2h(0) w(x)2 h(0) (28)

sp,w(x) 2 2h(0) + 2 sp,w(x) 2 w(x) 2h(0) + w(x) 2 2h(0) + w(x)2| h(0)| , (29)

where the last line follows from Cauchy-Schwarz inequality. The RHS of (29) is bounded over x assuming the stated conditions on h, w and the supremum of sp(x)w(x) 2 = sp,w(x) 2. To prove the bound on up(x, x ), we denote by Hu the RKHS associated with up, which is a reproducing kernel (see, for example, Barp et al., 2024, Theorem 1). We use the reproducing property of the Stein reproducing kernel up and the Cauchy-Schwarz inequality for the corresponding RKHS norm Hu to yield

up(x, x ) = up( , x), up( , x ) Hu up( , x) Hu up( , x ) Hu = up(x, x) 1 2 up(x , x ) 1 2 . (30)

Liu and Briol

Hence, supx,x Rd up(x, x ) . In particular, up(x, x)1/2 is well-deﬁned, since up(x, x) 0, which is because the Stein kernel up is positive deﬁnite.

A.3 Proof of Theorem 3 and Related Preliminary Results

Overview of proof. We will show a general result in Proposition 14, which states that the robust-KSD test that rejects HC 0 : BKSD(P; θ) if θ(Xn) := max(0, D(Xn) θ) > q B,1 α(Xn) is qualitatively robust for any θ 0. This immediately shows Theorem 3 by setting θ = 0. To show Proposition 14, we follow a similar approach in the proof of Theorem 1, where we will ﬁrst show that the result holds conditionally on the number of contaminated data, and use a high-probability argument to complete the proof. The rest of this section is organized as follows:

Section A.3.1 shows Lemma 11, which states that the bootstrap quantiles q B,1 α(Xn) and q B,1 α(X n) computed using any two samples Xn and X n that diﬀer by at most m

elements are close to each other with high probability.

Section A.3.2 shows Lemma 12, which states that the diﬀerence in the exceedance probabilities of θ(Xn) and of θ(X n) is small.

Section A.3.3 shows Lemma 13, which states that the rejection probability of a robust KSD test using Xn is close to that of a test using X n, assuming Xn and X n diﬀer by no more than o(n1/2) elements. Its proof relies on Lemma 11 and Lemma 12.

Section A.3.4 uses Lemma 13 to show Proposition 14, which states that the robust-KSD test is qualitatively robust for any θ 0. This result immediately implies Theorem 3, the proof of which is also contained in this subsection.

Lemma 11. Assume k is a tilted kernel satisfying the conditions in Lemma 2. Then there exists absolute constants C1, C2 > 0 such that, for any δ > 0 and any (deterministic) realizations Xn = {xi}n i=1 and X n = {x i }n i=1 that diﬀer by at most m elements, we have

D2 W(Xn) D2 W(X n) τ ϵ

C1 + C2 log 8

where D2 W(Xn), deﬁned in (3), is the bootstrap sample computed using Xn, and ϵm,n = m /n. Moreover, the above inequality implies

|q ,1 α(Xn) q ,1 α(X n)| τ

1 4m,n n 1 2

C1 + C2 log 8

Lemma 12. Assume EX P [ sp(X) 2] < and that k is a tilted kernel satisfying the conditions in Lemma 2. For any δ, γ > 0 and any integer m [0, n], it holds that

sup z1,...,zm Rd

Pr X m P D X n m {zi}m i=1 > γ Pr X n P (D(X n) > γ)

Pr X n P (|D(X n) γ| tm,n,δ) + δ ,

On the Robustness of Kernel Goodness-of-Fit Tests

where tm,n,δ := 4

δn + 2τ ϵ2 m,n 1/2.

Lemma 13. Assume EX P [ sp(X) 2] < and that k is a tilted kernel satisfying the conditions in Lemma 2. For any integers θ 0, δ > 0, and any sequence fn = o(n1/2), there exists n0 such that for any n n0, we have

max m fn ω(m ) < δ ,

where the max is over all non-negative integers m fn, and

ω(m ) := Pr X m P,W θ(X m Zm ) > q ,1 α(X m Zm )

Pr X n P,W θ(X n) > q ,1 α(X n) .

Proposition 14. Assume EX P [ sp(X) 2] < and that k is a tilted kernel satisfying the conditions in Lemma 2. If θ 0 and ϵn = o(n 1/2), then as n ,

sup Q P(P;ϵn)

Pr Xn Q,W θ(Xn) > q ,1 α(Xn) Pr X n P,W θ(X n) > q ,1 α(X n) 0 ,

where P(P; ϵn) is the Huber s contamination model deﬁned in (6).

A.3.1 Proof of Lemma 11

Proof Without loss of generality, we assume Xn and X n are ordered so that xi = x i for i = 1, . . . , m. We also deﬁne the n n matrices U := (uij)1 i,j n and U := (u ij)1 i,j n, where uij := up(xi, xj) and u ij := up(x i , x j). Since up is a reproducing kernel, U and U

are Gram matrices, thus symmetric and positive semi-deﬁnite. Furthermore, we deﬁne W 0 i := Wi 1 for any i = 1, . . . , n, and deﬁne the vectors V1 := (W 0 1 , . . . , W 0 m, 0, . . . , 0)

and V2 := (0, . . . , 0, W 0 m+1, . . . , W 0 n) , so that W0 = W 1 = V1 + V2. It follows that the bootstrap sample D2 W(Xn) deﬁned in (3) can be decomposed as

D2 W(Xn) = 1 n2 X

1 i,j n W 0 i W 0 j uij = 1 n2 (W0) UW0

V 1 UV1 + V 2 UV2 + 2V 1 UV2 .

Similarly, we can decompose D2 W(X n) as

D2 W(X n) = 1 n2

V 1 U V1 + V 2 U V2 + 2V 1 U V2 .

Since xi = x i for all i m by construction, we have uij = u ij for 1 i, j m, so V 1 UV1 = P 1 i,j m W 0 i W 0 j uij = P 1 i,j m W 0 i W 0 j u ij = V 1 U V1, and we can bound the following diﬀerence as

n2 D2 W(Xn) D2 W(X n)

= V 2 UV2 + 2V 1 UV2 V 2 U V2 2V 1 U V2

|V 2 UV2| + 2|V 1 UV2| + |V 2 U V2| + 2|V 1 U V2|

|V 2 UV2| + 2|V 1 UV1| 1 2 |V 2 UV2| 1 2 + |V 2 U V2| + 2|V 1 U V1| 1 2 |V 2 U V2| 1 2 , (31)

Liu and Briol

where the last line follows from Cauchy-Schwarz inequality applied to the U-weighted inner product (x, x ) 7 x Ux and the U -weighted inner product (x, x ) 7 x U x , which are well-deﬁned since U, U are positive semi-deﬁnite. The proof proceeds by bounding each term separately. We discuss how to bound |V1UV1| and |V2UV2|, and the argument for U is identical. We ﬁrst deﬁne the n n matrix with (A1)ij = uij for i, j m and (A1)ij = 0, so that we can write V 1 UV1 = W A1W. We can apply the Hanson-Wright lemma (Lemma 10) to conclude that there exists positive constants C such that, for any δ > 0 and any almost-sure sequence Xn, the following event occurs with probability at least 1 δ/4,

|V 1 AV1| = |W A1W|

1 i,j m uij + X

1 i,j m u2 ij

1 i,j m |uij| + X

1 i m |uii| + C log 8

1 i,j m u2 ij

n τ + mτ + Cmτ log 8

2mτ + Cmτ log 8

where the second last inequality holds since |uij| supx,x Rd |up(x, x )| τ , and the last line holds since m n. Similarly, deﬁning the matrix A2 with (A2)ij = 0 for i, j m and (A2)ij = uij otherwise, we can write V 2 UV2 = W A2W, and the same argument as before shows that the following holds with probability at least 1 δ/4,

|v 2 Av2| = |W A2W|

m<i,j n uij + X

m<i,j n u2 ij

n + (n m)τ + C(n m)τ log 8

2(n m)τ + C(n m)τ log 8

Combining (32) and (33), we conclude that the following holds with probability at least 1 δ/2,

|V 2 UV2| + 2|V 1 UV1| 1 2 |V 2 UV2| 1 2

2(n m)τ + C(n m)τ log 8

+ 2 2mτ + Cmτ log 8

2 2(n m)τ + C(n m)τ log 8

= (n m) + 2m 1 2 (n m) 1 2 τ

2 + C log 8

On the Robustness of Kernel Goodness-of-Fit Tests

The same argument shows that the last two terms of (31) can be bounded by the same quantity on an event with probability at least 1 δ/2. To conclude, we have shown that with probability at least 1 δ,

D2 W(Xn) D2 W(X n) 2 n2 (n m) + 2(n m) 1 2 m 1 2 τ

2 + C log 8

n (n m) 1 2

n 1 2 + 2m 1 2

2 + C log 8

2 + C log 8

12 + 6C log 8

where in the second last line we have deﬁned ϵm,n = (n m)/n and used n m n and m n. Deﬁning C1 = 12 and C2 = 6C shows the ﬁrst claim. The second claim can be shown by ﬁrst noting that the above inequality implies that their (1 α)-quantiles must satisfy

q2 ,1 α(Xn) q2 ,1 α(X n) τ ϵ

C1 + C2 log 8

=: ρm,n,δ .

We will argue separately for the two cases (q ,1 α(Xn) + q ,1 α(X n)) ρ1/2 m,n,δ and

(q ,1 α(Xn) + q ,1 α(X n)) < ρ1/2 m,n,δ. In the former case, the above inequality implies

|q ,1 α(Xn) q ,1 α(X n)| =

q ,1 α(Xn) q ,1 α(X n) q ,1 α(Xn) + q ,1 α(X n)

q ,1 α(Xn) + q ,1 α(X n)

q2 ,1 α(Xn) q2 ,1 α(X n)

q ,1 α(Xn) + q ,1 α(X n)

1 2 m,n,δ = ρ

1 2 m,n,δ .

When instead (q ,1 α(Xn) + q ,1 α(X n)) < ρ1/2 m,n,δ, we have

|q ,1 α(Xn) q ,1 α(X n)| max(q ,1 α(Xn), q ,1 α(X n))

q ,1 α(Xn) + q ,1 α(X n)

1 2 m,n,δ ,

which combined with the previous inequality shows the second claim.

A.3.2 Proof of Lemma 12

Proof Pick {zi}m i=1 Rd, and deﬁne Zm := {zi}m i=1. Decomposing the test statistic D2(X m Zm ) using a similar approach as in (10), we have

D2(X m Zm ) = m2

n2 D2(X m) + 2

m<i,j n up(zi, zj) , (34)

Liu and Briol

where Sm := Pm i=1 Ti and Ti := Pn j=m+1 up(X i , zj). Similarly, the test statistic computed using the pure sample X n can be written as

D2(X n) = m2

n2 D2(X m) + 2

m<i,j n up(X i , X j) ,

where S m := Pm i=1 T i and T i := Pn j=m+1 up(X i , X j). Using a triangle inequality to bound their diﬀerence,

n2 D2(X m Zm ) D2(X n) = 2Sm + X

m<i,j n up(zi, zj) 2S m X

m<i,j n up(Xi, Xj)

2|Sm| + 2|S m| + X

m<i,j n |up(zi, zj)| + |up(Xi, Xj)|

2|Sm| + 2|S m| + 2(n m)2τ , (35)

where the last line holds since supx,x Rd up(x, x ) τ under the assumed kernel conditions due to Lemma 2. We then bound |Sm| and |S m| with high probability. When EX P [ sp(X ) 2] < , we have EX P [up(X , )] = 0 as argued in the paragraph before (11), so it is straightforward to see that EX i P [Ti] = 0 for all i, and thus Sm is a sum of zero-mean i.i.d. random variables Ti. Therefore,

Var(S2 m) = m EX 1 P T 2 1 = m X

m<j,l n EX 1 P [up(X 1, zj)up(X 1, zl)] m(n m)2τ 2 ,

where the last step holds again since supx,x Rd up(x, x ) τ . For any δ > 0, Chebyshev s inequality implies that the event A1 := {|Sm| m(n m)τ / p

δ/2} occurs with probability at least 1 δ/2. In other words, |Sm| can be upper bounded on the high-probability event A1. A similar argument applied to S m shows that the event A2 := |S m| m(n m)τ / p

δ/2} also occurs with probability at least 1 δ/2. On A := A1 A2, which occurs with probability at least 1 δ, we have by (35) that

D2(X m Zm ) D2(X n) 4 m(n m)τ

δ/2 + 2(n m)2τ

δ + 2(n m)2τ

δn + 2τ ϵ2 m,n =: t2 m,n,δ . (36)

where the second step holds since m n, and in the last line we have substituted ϵm,n = (n m)/n. We claim that this implies that the following holds on A

D(X m Zm ) D(X n) tm,n,δ . (37)

On the Robustness of Kernel Goodness-of-Fit Tests

To see this, we ﬁrst assume D(X m Zm ) + D(X n) tm,n,δ. In this case, (36) implies

D(X m Zm ) D(X n) =

D(X m Zm ) D(X n) D(Xn) + D(X n)

D(X m Zm ) + D(X n)

= D2(X m Zm ) D2(X n) D(X m Zm ) + D(X n) tm,n,δ .

On the other hand, when D(X m Zm ) + D(X n) < tm,n,δ, we have D(X m Zm ) D(X n) max D(X m Zm ), D(X n) D(X m Zm ) + D(X n) < tm,n,δ ,

where the ﬁrst inequality holds since D(X m Zm ), D(X n) are non-negative. Combining these two cases shows (37). It then follows that, for any γ > 0,

Pr X n P D(X m Zm ) > γ

= Pr X n P {D(X m Zm ) > γ} A + Pr X n P {D(X m Zm ) > γ} Ac

Pr X n P D(X n) + tm,n,δ > γ + δ

= Pr X n P D(X n) > γ tm,n,δ + δ ,

where the ﬁrst step follows from the law of total probability, and the second line holds since Pr X n P {D(X m Zm ) > γ} Ac Pr(Ac) δ. This implies one side of the desired inequality

Pr X n P D(X m Zm ) > γ Pr X n P D(X n) > γ

Pr X n P D(X n) > γ tm,n,δ Pr X n P D(X n) > γ + δ

= Pr X n P γ tm,n,δ D(X n) γ + δ

Pr X n P D(X n) γ tm,n,δ + δ ,

where in the last line we have used

Pr X n P γ tm,n,δ D(X n) γ Pr X n P γ tm,n,δ D(X n) γ + tm,n,δ

= Pr X n P D(X n) γ tm,n,δ .

A similar argument shows the other direction

Pr X n P D(X m Zm ) > γ Pr X n P D(X n) > γ

Pr X n P {D(X m Zm ) > γ} A Pr X n P D(X n) > γ

Pr X n P D(X n) > γ + tm,n,δ Pr X n P D(X n) > γ

= Pr X n P γ D(X n) γ + tm,n,δ

Pr X n P γ tm,n,δ D(X n) γ + tm,n,δ

= Pr X n P D(X n) γ tm,n,δ ,

where the ﬁrst step holds by the law of total probability, and the second step follows from (37).

Liu and Briol

A.3.3 Proof of Lemma 13

Proof Fix any θ, γ 0. Let m , n be any positive integers with m fn, where fn = o(n1/2). Deﬁne m = n m . Pick Zm = {zi}m i=1 Rd, and denote the LHS of the inequality by

T(z1, . . . , z m) := Pr X m P,W θ(X m Zm ) > q ,1 α(X m Zm )

Pr X n P,W θ(X n) > q ,1 α(X n) .

The LHS of the inequality can be bounded as

T(z1, . . . , z m) = Pr X m P,W D(X m Zm ) > q ,1 α(X m Zm ) + θ

Pr X n P,W D(X n) > q ,1 α(X n) + θ

Pr X m P,W D(X m Zm ) > q ,1 α(X m Zm ) + θ

Pr X n P,W D(X n) > q ,1 α(X m Zm ) + θ

+ Pr X m P,W D(X n) > q ,1 α(X m Zm ) + θ

Pr X n P,W D(X n) > q ,1 α(X n) + θ

=: T1 + T2 , (38)

where the ﬁrst equality holds since, for any γ 0, it can be checked that the inequality θ = max(0, D(Xn) θ) > γ holds if and only if D(Xn) < γ + θ. We will now bound the terms T1 and T2, respectively. Denote for brevity γz := q ,1 α(X m Zm ) + θ and γ := q ,1 α(X n) + θ. Fix any δ > 0. Applying Lemma 12 with γ = γz yields

T1 Pr X n P,W D(X n) γz tm,n,δ + δ

Pr X n P,W D(X n) γ tm,n,δ + |γz γ | + δ , (39)

where tm,n,δ := 4

δn + 2τ ϵ2 m,n 1/2 is deﬁned in Lemma 12, and (39) follows from a triangle inequality. Moreover, by Lemma 11, there exists an event, say A, with probability at least 1 δ such that, on A,

|γz γ | = q ,1 α(X m Zm ) q ,1 α(X n) τ

1 4m,n n 1 2

C1 + C2 log 8

=: ρm,n,δ .

Combining this with (39) implies that T1 can be bounded as

T1 Pr X n P,W D(X n) γ tm,n,δ + |γz γ | A + Pr(Ac) + δ

Pr X n P,W D(X n) γ tm,n,δ + ρm,n,δ + 2δ , (40)

where the second line holds since Pr(Ac) δ. To bound T2, we ﬁrst note that

T2 = Pr X n P D(X n) > γz Pr X n P D(X n) > γ

= Pr X n P min(γz, γ ) < D(X n) max(γz, γ )

Pr X n P min(γz, γ ) < D(X n) max(γz, γ ) A + Pr(Ac)

Pr X n P D(X n) γ ρm,n,δ + δ , (41)

On the Robustness of Kernel Goodness-of-Fit Tests

where the second line holds since | Pr(Y > a) Pr(Y > b)| = Pr(min(a, b) < Y max(a, b)) for any constants a, b and random variable Y , and the last line holds since Pr(Ac) δ and since on A we have

γ ρm,n,δ γ γz γ min(γz, γ ) max(γz, γ ) γ + γz γ

γ + ρm,n,δ .

Substituting the bounds (40) and (41) into (38) gives

T(z1, . . . , z m) Pr X n P,W D(X n) γ tm,n,δ + ρm,n,δ

+ Pr X n P,W D(X n) γ ρm,n,δ + 3δ

2 Pr X n P,W D(X n) γ tm,n,δ + ρm,n,δ + 3δ , (42)

where the last inequality holds since tm,n,δ 0. Since m fn and fn = o(n1/2) by assumption, we have ϵm,n = m /n fn/n = o(n 1/2). This implies

2 + 2τ ϵ2 m,n 1

which holds since τ , δ are constants, and

1 4m,n n 1 2

C1 + C2 log 8n2(1 2s) = τ

1 4m,n n 1 2

C 1 + C 2 log(n)

where we have deﬁned C 1 := C1 + C2 log 8 and C 2 := 2C2(1 2s), and the last step holds since ϵ1/4 m,n p

log(n) = o n 1/8p

log(n) = o(1). Deﬁne η(m ) := tm,n,δ + ρm,n,δ, where the dependence on m is through m = n m . Then the above derivations show that η(m ) = o(n 1/2) for all m fn. In particular, η(fn) = o(n 1/2). Moreover, substituting these into (42) gives

T(z1, . . . , z m) 2 Pr X n P,W D(X n) γ η(m ) + 3δ

2 Pr X n P,W D(X n) γ η (fn) + 3δ ,

where the last line holds since m fn by assumption and m 7 η(m ) is monotone increasing by direct computation. Taking supremum over Zm Rd and m fn gives

max m fn sup z1,...,zm Rd T(z1, . . . , z m) 2 Pr X n P,W D(X n) γ η (fn) + 3δ

It remains to show that the ﬁrst term on the RHS can be bounded by δ, from which the claimed result in Lemma 13 would follow by redeﬁning δ. To proceed, we write

Pr X n P,W | θ(X n) q ,1 α(X n)| η (fn)

= Pr X n P,W θ(X n) q ,1 α(X n) + η (fn)

Pr X n P,W θ(X n) q ,1 α(Xn) η (fn)

= Pr X n P,W n D(X n) nζn,θ + nη (fn) (44a)

Pr X n P,W n D(X n) nζn,θ nη (fn) (44b)

Liu and Briol

where the ﬁrst equality holds again because | Pr(Y > a) Pr(Y > b)| = Pr(min(a, b) < Y max(a, b)) for any constants a, b and random variable Y , the second equality holds by noting that θ(X n) = max(0, D(X n) θ) t for some t > 0 if and only if D(X n) θ t, and (44) holds by multipling n on both sides of the inequalities within the probabilities and deﬁning ζn,θ := q ,1 α(X n) + θ. Under the assumed conditions on k and assuming EX P [ sp(X) 2] < , Gorham and Mackey (2017, Proposition 1) shows that EX P [up(X , )] = 0, so the V-statistic D2(X n) is degenerate (see, e.g., Serﬂing, 2009, Chapter 6), and classic results on the asymptotics of degenerate V-statistics (Serﬂing, 2009, Theorem 6.4.1 A) shows that n D2(X n) converges weakly to a non-negative distribution. Since the square-root function x 7 x is continuous on [0, ) and continuous functions preserve weak limits by the Continuous Mapping Theorem (Van der Vaart, 2000, Theorem 2.3), the squared-root statistic, n D(X n), also converges weakly to a non-negative distribution. In particular, the scaled quantile nq ,1 α(X n), and thus also nζn,θ, converge to a positive number as n . Also since η (fn) = o(n 1/2) as argued below (43), we have nη (fn) 0. In summary, we have shown that

lim n Pr X n P,W (| θ(X n) q ,1 α(X n)| η (fn))

Pr X n P,W n D(X n) nζn,θ + nη (fn)

Pr X n P,W n D(X n) nζn,θ nη (fn)

= lim n Pr X n P,W n D(X n) nζn,θ Pr X n P,W n D(X n) nζn,θ

which completes the proof.

A.3.4 Proofs for Proposition 14 and Theorem 3

Proof of Proposition 14 For each n and R P(Rd), deﬁne Q = (1 ϵn)P + ϵn R. Let Xn and X n be random samples drawn from Q and P, respectively. As argued in Section A.1, each random variable in Xn can be written as Xi d= (1 ξi)X i + ξi Zi, where Zi R and

ξi Bernoulli(n, ϵn) are independent, and d= denotes equality in distribution. Deﬁne the random variable M = Pn i=1 ξi, then M Binomial(n, ϵn). Pick δ > 0. Given any sequence ϵn = o(n 1/2), there must exists fn = o(n1/2) such that ϵn fn/n and fn as n . Take such f and deﬁne the event B := {M ϵnn fn}. Intuitively, B is the event where the number of outliers M does not deviate from its mean ϵnn by more than fn. Our proof proceeds by ﬁrst showing that B occurs with high probability, then proving that the claimed result holds on this event. To show B occurs with high probability, we use a concentration inequality for Binomial distributions; see, e.g., Chung and Lu (2002, Lemma 2.1, Eq. 2.2). Applying this result to Binomial(n, ϵn), we have

Pr M (Bc) = Pr M M ϵnn > fn exp f2 n 2(ϵnn + fn/3)

=: tn . (45)

On the Robustness of Kernel Goodness-of-Fit Tests

Deﬁne the event A(Xn) = { θ(Xn) > q ,1 α(Xn)} and similarly for A(X n). Using the above inequality to decompose Pr Xn Q,W(A(Xn)) yields

Pr Xn Q,W(A(Xn))

Pr Xn Q,W(A(Xn) B) + Pr(Bc)

m ϵnn+fn Pr Xn Q,W(A(Xn) | M = m ) Pr(M = m ) + Pr(Bc)

m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) Pr(M = m ) + Pr(Bc)

max m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) X

m ϵnn+fn Pr(M = m ) + Pr(Bc)

= max m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) Pr(B) + Pr(Bc)

max m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) + tn ,

where in the last line we have used (45) and that Pr(B) 1. This implies

Pr Xn Q,W(A(Xn)) Pr X n P (A(X n))

max m ϵnn+fn

Pr X n m P, Zm R,W(A(X n m Zm )) Pr X n P (A(X n)) + tn

max m ϵnn+fn sup z1,...,zm Rd

Pr X m P (A(X m {zi}m i=1)) Pr X n P (A(X n)) + tn

=: max m ϵnn+fn ω(m ) + tn ,

where in the second inequality we have taken supremum over all possible values of Zm , and in the last line we have deﬁned

ω(m ) := sup z1,...,zm Rd

Pr X n m P A X n m {zi}m i=1 Pr X n P A(X n) .

Similarly, we have the lower bound

Pr Xn Q,W(A(Xn))

Pr Xn Q,W(A(Xn) B)

m ϵnn+fn Pr Xn Q,W(A(Xn) | M = m ) Pr(M = m )

m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) Pr(M = m )

min m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) X

m ϵnn+fn Pr(M = m )

= min m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) Pr(B) ,

Liu and Briol

where in the second last line we have taken inﬁmum over the values of Zm . Taking inﬁmum over z1, . . . , zm , the last line is lower bounded by

min m ϵnn+fn Pr X n m P, Zm R,W(A(X n m Zm )) Pr(B)

min m ϵnn+fn inf z1,...,zm Pr X n m P (A(X n m {zi}m i=1)) Pr(B)

min m ϵnn+fn inf z1,...,zm Pr X n m P (A(X n m {zi}m i=1)) (1 tn)

= min m ϵnn+fn inf z1,...,zm

Pr X n m P (A(X n m {zi}m i=1))

tn Pr X n m P (A(X n m {zi}m i=1))

min m ϵnn+fn inf z1,...,zm Pr X n m P (A(X n m {zi}m i=1)) tn , (46)

where the second inequality holds since (45) implies Pr(B) 1 tn, and the last line holds as a probability is always smaller than or equal to 1. This implies

Pr Xn Q,W(A(Xn)) Pr X n P (A(X n))

min m ϵnn+fn inf z1,...,zm Pr X n m P A X n m {zi}m i=1 Pr X n P A(X n) tn

max m ϵnn+fn sup z1,...,zm Rd

Pr X n m P A X n m {zi}m i=1 Pr X n P A(X n) tn

= max m ϵnn+fn ω(m ) tn .

We have therefore shown that Pr Xn Q,W(A(Xn)) Pr X n P (A(X n)) max m ϵnn+fn ω(m ) + tn . (47)

It remains to bound the two terms on the RHS of the above inequality. The second term can be bounded by noting that, since ϵn fn/n and fn = o(n1/2), it is clear that tn, deﬁned in (45), converges to 0 as n , so tn δ/2 for suﬃciently large n. To bound the ﬁrst term, since ϵnn + fn 2fn = o(n1/2) and the assumptions in Lemma 13 hold, we can apply Lemma 13 to conclude that there exists n0 such that for any n n0, we have

max m ϵnn+fn ω(m ) < δ

Using these arguments to bound (47), we have shown that, for suﬃciently large n,

Pr Xn Q,W(A(Xn)) Pr X n P (A(X n)) δ

Taking the supremum over Q P(P; ϵn) then implies, for suﬃciently large n,

sup Q P(P;ϵn)

Pr Xn Q,W θ(Xn) > q ,1 α(Xn) Pr X n P,W θ(X n) > q ,1 α(X n) δ .

Since δ > 0 was arbitrary, this shows the claim.

On the Robustness of Kernel Goodness-of-Fit Tests

Proof of Theorem 3 This result immediately follows from Proposition 14 by setting θ = 0, in which case

sup Q P(P;ϵn) Pr Xn Q,W D2(Xn) > q2 ,1 α(Xn) Pr X n P,W D2(X n) > q2 ,1 α(X n)

= sup Q P(P;ϵn) Pr Xn Q,W D(Xn) > q ,1 α(Xn) Pr X n P,W D(X n) > q ,1 α(X n)

= sup Q P(P;ϵn) Pr Xn Q,W θ(Xn) > q ,1 α(Xn) Pr X n P,W θ(X n) > q ,1 α(X n)

where the second line holds by taking the square-root on both sides of the inequalities, and the last equality holds since D(Xn) = θ(Xn) when θ = 0.

A.4 P-Targeted Kernel Stein Discrepancy

This section states preliminary results which will be needed to prove Theorem 4. The main tool we use is the following generalized deﬁnition of KSD, which was originally proposed in Shi and Mackey (2024, Deﬁnition 4).

Deﬁnition 3 (P-KSD). Given P P(Rd) with a Lebesgue density p C1 and a reproducing kernel k C(1,1) b , the P-targeted (Langevin) kernel Stein discrepancy (P-KSD) between two probability measures Q, R P(Rd) is deﬁned as

SP (Q, R) := supf B EX Q[(AP f)(X)] EY R[(AP f)(Y)] , (48)

where B := h = (h1, . . . , hd) : hj Hk and Pd j=1 hj 2 Hk 1 is the unit ball in the d-times Cartesian product Hd k of the RKHS Hk associated with k.

When any input probability measure, say Q, is an empirical measure based on a sample Xn = {xi}n i=1 Rd, we will abuse the notation by writing SP (Q, R) = SP (Xn, R) to emphasize the dependence on Xn. It can be shown that P-KSD is a Maximum Mean Discrepancy (MMD, Fortet and Mourier, 1953; Müller, 1997) with the Stein reproducing kernel up (Barp et al., 2024, Theorem 1). When either of the two arguments coincides with P, the P-KSD reduces to the standard KSD, as we will show in the next section. The main beneﬁt of working with P-KSD rather than KSD is that P-KSD satisﬁes both symmetry and a triangle inequality (Shi and Mackey, 2024, Lemma 1), i.e., SP (Q, R) = SP (R, Q) and SP (Q, R) SP (Q, R ) + SP (R , R), for any Q, R, R P(Rd) for which they are deﬁned. These properties allow us to show several lemmas, which will be crucial in proving the validity of our robust-KSD tests in later sections.

Lemma 15 shows that the standard KSD can be written as a P-KSD.

Lemma 16 shows that SP (Q, R) is equivalent to an MMD, and that it admits a closed form involving expectations over Q and R.

Lemma 17 shows that the P-KSD-projection of a probability measure onto a (standard) KSD-ball centered at P admits a closed-form expression.

Liu and Briol

Lemma 18 restates Tolstikhin et al. (2017, Proposition A.1), which is a Mc Diarmid-type inequality for MMD. It also applies to P-KSD, as an P-KSD is also an MMD. We will use it in Section D to derive a robust-KSD test that is well-calibrated for ﬁnite samples. Other deviation bounds instead of the Mc Diarmid bound could also be used to construct a similar test;see Remark 10 for a discussion.

The MMD has already been studied extensively in the context of both Bayesian and frequentist robust parameter estimation; see, e.g., Briol et al. (2019); Chérief-Abdellatif and Alquier (2020, 2022); Dellaporta et al. (2022); Dellaporta and Damoulas (2023); Alquier and Gerber (2024). A key assumption to ensure robustness in all of these papers is that the kernel is assumed to be bounded, and this is exactly what we are able to achieve with the KSD and P-KSD through Lemma 2.

A.4.1 Properties of P-KSD

Lemma 15 (KSD as P-KSD). Assume k C(1,1) b and EX P [ sp(X) 2] < . For any Q P(Rd) such that D(Q, P) < , we have SP (Q, P) = SP (P, Q) = D(Q, P).

Proof of Lemma 15 This follows directly from the symmetry of P-KSD (Shi and Mackey, 2024, Lemma 1) and that the assumed conditions imply EX P [(AP f)(X)] = 0 for all f B (Gorham and Mackey, 2017, Proposition 1), so one of the two expectations in (48) vanishes when either of the two arguments coincides with P.

Barp et al. (2024, Theorem 1) shows that a KSD is equivalent to an MMD with the Stein kernel up. Since P-KSD is a KSD by Lemma 15, we can conclude that the P-KSD is also an MMD with kernel up, and that P-KSD has a double-expectation form similar to MMD.

Lemma 16 (P-KSD closed-form). Let Q, R P(Rd) and k C(1,1) b . Denote by Hu the RKHS associated with the Stein kernel Hk. If EX Q[up(X, X)] < and EY R[up(Y, Y)] < , then the following two identities hold

S2 P (Q, R) = EX Q[up( , X)] EY R[up( , Y)] 2 Hu = EX,X Q[up(X, X )] + EY,Y R[up(Y, Y )] 2EX Q,Y R[up(X, Y)] .

Proof of Lemma 16 The assumed conditions ensure S2 P (Q, R) is well-deﬁned. Since P-KSD is an MMD with reproducing kernel up, we can apply Gretton et al. (2012, Lemma 4) to conclude the ﬁrst equality, and Gretton et al. (2012, Lemma 6) to yield the second.

Lemma 17 (P-KSD projection). Let R P(Rd) and assume EY R[up(Y, Y)] < . Further assume the conditions in Lemma 15 hold. For any θ > 0,

inf Q BKSD(P;θ) SP (R, Q) = max 0, SP (R, P) θ = max 0, D(R, P) θ .

Proof of Lemma 17 The P-KSD SP (R, P) is well-deﬁned under the assumed conditions, and hence so is SP λR + (1 λ)P, P for any λ [0, 1]. Also since P-KSD is an MMD with (potentially unbounded) reproducing kernel up, the claim then follows by Sun and Zou (2023,

On the Robustness of Kernel Goodness-of-Fit Tests

Proposition 3) and noting that their proof extends directly to potentially unbounded kernels.

Lemma 18 (P-KSD deviation bound). Assume k is a tilted kernel satisfying the conditions in Lemma 2 and P P(Rd) has a density p C1. For any random sample Xn drawn from Q P(Rd) and any α > 0,

SP (Xn, Q) > rτ

Proof of Lemma 18 Let Qn denote the empirical distribution based on Xn. By the ﬁrst equality in Lemma 16, we can write SP (Xn, Q) = EX Qn[up( , X)] EX Q[up( , X)] Hu. Under the assumptions in Lemma 2, the function up( , x) : Rd Hu is continuous for all x Rd. Moreover, up( , x) Hu = up(x, x) by the reproducing property of the Stein kernel up (see also the argument before (30)), and supx Rd up(x, x) < again by Lemma 2. We can hence apply the Mc Diarmid-type inequality for MMD in Tolstikhin et al. (2017, Proposition A.1) to conclude the claimed result.

A.5 Validity of the Bootstrap Approach

We provide a proof for the validity of the bootstrap approach in Theorem 4. We ﬁrst discuss the intuition, and then present two lemmas in Section A.5.1. The proof of Theorem 4 will be presented in Section A.5.2. The bootstrap procedure used in robust-KSD is the same as the one in the standard KSD test, and it is not immediately obvious that this gives a valid decision threshold. For the standard KSD test, the V-statistic D2(Xn) is degenerate under the point null H0 : Q = P, i.e., EX Q[up( , P)] = 0 when Q = P (Liu et al., 2016, Theorem 4.1), and classic bootstrapping methods for degenerate V-statistics (Arcones and Gine, 1992; Huskova and Janssen, 1993) show that the bootstrap quantile q ,1 α is a valid decision threshold. However, the same argument cannot be applied directly to our robust test, because our null HC 0 is a composite one that contains not only P but also other distributions Q = P, in which case D2(Xn) is no longer degenerate as shown by Liu et al. (2016, Theorem 4.1). Nevertheless, our proof shows that q ,1 α is still a correct decision threshold. Intuition of the proof. Let Xn = {Xi}n i=1 be a random sample from some Q P(Rd). We will ﬁrst show that the rejection probability of the robust-KSD test can be bounded as

Pr Xn Q θ(Xn) > q ,1 α(Xn) Pr Xn Q SP (Xn, Q) > q ,1 α(Xn) ,

where SP (Xn, Q) is the P-KSD introduced in Section A.4. We will then prove that the RHS of the above inequality converges to the prescribed test level α as n by showing that q ,1 α(Xn) is a valid bootstrap approximation for the (1 α)-quantile of the distribution of SP (Xn, Q).

Liu and Briol

To see this, we ﬁrst use the closed-form expression for P-KSD (Lemma 16) to write

S2 P (Xn, Q) = EX,X Qn[up(X, X )] 2EX Qn,Y Q[up(X, Y)] + EY,Y Q[up(Y, Y )]

= EX,X Qn[up(X, X ; Q)]

1 i,j n up(Xi, Xj; Q) , (49)

where we have deﬁned

up(x, x ; Q) := up(x, x ) EY Q[up(x, Y)] EY Q[up(Y , x )] + EY,Y Q[up(Y, Y )] .

In other words, S2 P (Xn, Q) is a V-statistic with symmetric function up( , ; Q). Our key observation is that S2 P (Xn, Q) is the second-order term in the Hoeﬀding s decomposition (Arcones and Gine, 1992, Eq. 3.2) of the V-statistic estimate D2(Xn, P) for the standard KSD deﬁned in (2). This suggests that (3) is an appropriate bootstrap sample. Crucially, this argument does not require Q = P, unlike in the proof of the standard KSD test. We formalize this argument in the next section. A naive alternative approach to bootstrap S2 P (Xn, Q) is to note that it is degenerate for any Q and use standard bootstrapping techniques for degenerate V-statistics (e.g., Arcones and Gine, 1992, Theorem 3.5). However, this approach is not applicable here since, to compute the bootstrap sample, it requires the function up(x, x ; Q) to be computable at any x, x , which is not the case since up(x, x ; Q) involves intractable expectations over Q (recall that up has mean zero under P, but not necessarily under Q).

A.5.1 Bootstrapping the P-KSD Estimate

Our proof for Theorem 4 relies on Arcones and Gine (1992, Lemma 3.4), which states that the asymptotic distribution of the second-order term in the Hoeﬀding s decomposition of a V-statistic can be approximated by a bootstrap distribution. We restate this result for the case of S2 P (Xn, Q) in Lemma 19.

Lemma 19 (Arcones and Gine (1992), Lemma 3.4). Let X = {Xi} i=1 be a random sample where Xi Q are independent, and for any n deﬁne Xn := {Xi}n i=1. Let Qn be the empirical measure based on Xn and deﬁne X n = {X i }n i=1, where X i Qn are independent conditionally on Xn. Assume EX Q[up(X, X)2] < . Then, as n , the following holds almost surely,

Pr X n Qn n S2 P (X n, Xn) t | X Pr Xn Q n S2 P (Xn, Q) t 0 .

Using this result, we can prove that a valid bootstrap approximation for the distribution of SP (Xn, Q) can be obtained by using the bootstrap sample DW(Xn) deﬁned in (3). This is summarized in the next lemma.

Lemma 20. Assume EX Q[up(X, X)2] < and let W := (W1, . . . , Wn) be a random vector distributed as Multinomial(n; 1/n, . . . , 1/n). Under the notation in Lemma 19, the following holds as n ,

Pr W n DW(Xn) t | X Pr Xn Q n SP (Xn, Q) t 0 . (50)

On the Robustness of Kernel Goodness-of-Fit Tests

Proof of Lemma 20 As argued in (49), the squared P-KSD S2 P (Xn, Q) can be written as a V-statistic with symmetric function up( , ; Q). Moreover, direct computation shows that EX Q[up( , X; Q)] 0, so the symmetric function up( , ; Q) is Q-degenerate of order 1 (Arcones and Gine, 1992, pp. 5). On the other hand, it is well-known that the weighted bootstrap form can be equivalently written as an Efron s resampled bootstrap statistic (Janssen, 1994; Dehling and Mikosch, 1994; Janssen, 1997), i.e.,

D2 W(Xn) = 1 n2 X

1 i,j n (Wi 1)(Wj 1)up(Xi, Xj)

1 i,j n Wi Wjup(Xi, Xj) Wiup(Xi, Xj) Wjup(Xi, Xj) + up(Xi, Xj)

1 i,j n up(X i , X j) up(X i , Xj) up(Xi, X j) + up(Xi, Xj)

1 i,j n up(X i , X j; Qn)

= S2 P (X n, Xn) ,

where X i and Qn are deﬁned in Lemma 19, the notation d= denotes equality in distribution, the second last line follows by the deﬁnition of up( , ; Qn), and the last line follows from Lemma 16. Under the assumed moment condition, we can apply Lemma 19 together with the above derivation to conclude that a version of (50) with the squared statistics holds, i.e.,

Pr W n D2 W(Xn) t | X Pr Xn Q n S2 P (Xn, Q) t 0 .

The claim then follows by noting that the mapping u 7 u is everywhere continuous on [0, ) and that weak convergence is preserved by continuous function by the Continuous Mapping Theorem (Van der Vaart, 2000, Theorem 2.3).

A.5.2 Proof of Theorem 4

Proof of Theorem 4 We ﬁrst show that EX Q[up(X, X)2] < , so in particular EX Q[up(X, X)] < and we can apply Lemma 17. Deﬁning sp,w(x) = w(x)sp(x), we have

EX Q[up(X, X)2] EX Q

sp,w(x) 2 2h(0) + 2 sp,w(x) 2 w(x) 2h(0) + w(x) 2 2h(0)

+ w(x)2 h(0) 2

4 EX Q sp,w(x) 4 2 h2(0) + 2EX Q sp,w(x) 2 2 w(x) 2 2 h2(0)

+ EX Q w(x) 4 2 h2(0) + EX Q w(x)4 h(0) 2 ,

Liu and Briol

where the second line follows from (29), and the last line holds since (a + b + c + d)2 4(a2 + b2 + c2 + d2) for any a, b, c, d R. The RHS of the last inequality is ﬁnite under the assumed conditions on k. We are now ready to prove the ﬁrst result of our theorem. For any Q BKSD(P; θ) P(Rd; w), we have

Pr Xn Q,W θ(Xn) > q ,1 α(Xn)

= Pr Xn Q,W max(0, D(Xn) θ) > q ,1 α(Xn)

= Pr Xn Q,W inf Q BKSD(P;θ) SP (Xn, Q ) > q ,1 α(Xn)

Pr Xn Q,W SP (Xn, Q) > q ,1 α(Xn) , (51)

where the ﬁrst equality holds by Lemma 17 and noting that D(Xn) is equivalent to the KSD between the empirical measure based on Xn and P, and the last line holds since Q BKSD(P; θ). To show the ﬁrst claim of our theorem, it suﬃces to show that the RHS of (51) converges to α as n . Deﬁning the following bootstrapping error

δn := Pr Xn Q,W SP (Xn, Q) > q ,1 α(Xn) Pr W DW(Xn) > q ,1 α(Xn) | X ,

we can write the RHS of (51) as

Pr Xn Q,W SP (Xn, Q) > q ,1 α(Xn) = Pr W DW(Xn) > q ,1 α(Xn) | X + δn = α + δn , (52)

where the last step holds since q ,1 α(Xn) is the (1 α)-quantile of the conditional distribution of DW(Xn) given X . Moreover, since we have shown that EX Q[up(X, X)2] < , we can apply Lemma 20 to conclude that δn 0 as n . Hence, the probability on the LHS of the above equation converges to α. Combining with (51), we have proved that limn Pr Xn Q,W θ(Xn) > q ,1 α(Xn) α. Taking supremum over all Q BKSD(P; θ) P(Rd; w) shows the ﬁrst claim of our theorem. Now assume Q BKSD(P; θ) and Q P(Rd; w), so in particular θ D(Q, P) < 0. We have

Pr Xn Q,W θ(Xn) > q ,1 α(Xn))

= Pr Xn Q,W D(Xn) θ > q ,1 α(Xn) (53)

= Pr Xn Q,W n D2(Xn) D2(Q, P) > n(q ,1 α(Xn) + θ)2 n D2(Q, P) , (54)

where (53) holds since it can be checked that θ(Xn) > γ if and only if D(Xn) θ > γ for any γ 0. The argument in Liu et al. (2016, Theorem 4.1) shows that that n(D2(Xn) D2(Q, P)) converges weakly to a Gaussian limit assuming EX Q[up(X, X )2] < , which holds since (30) and Jensen s inequality imply

EX,X Q[up(X, X )2] EX,X Q[up(X, X)up(X , X )] = EX Q[up(X, X)] 2

EX Q[up(X, X)2] ,

On the Robustness of Kernel Goodness-of-Fit Tests

which is ﬁnite as shown before. On the other hand, the weak convergence of n(D2(Xn) D2(Q, P)) also implies q ,1 α(Xn) 0 as n , so when D(Q, P) > θ,

lim n (q ,1 α(Xn) + θ)2 D2(Q, P) = θ2 D2(Q, P) < 0 .

Combing these arguments shows that in the probability in (54), the LHS converges to a non-degenerate distribution, while the RHS tends to . This implies (54) converges to 1, thus proving the second claim.

Remark 9. If the bootstrapped quantile q ,1 α(Xn) is replaced by the quantile q 1 α(Xn) of the distribution of the P-KSD SP (Xn, Q), then the bootstrap approximation error in (52) vanishes, i.e., δn = 0 for all n. In that case, we have the following stronger, uniform level control

lim sup n sup Q BKSD(P;θ) P(Rd;w) Pr Xn Q,W θ(Xn) > q 1 α(Xn) α .

In fact, the inequality holds for any ﬁnite n, since the RHS of (52) equals to α for all n.

A.6 Connections Between KSD Balls and Contamination Models

This section provides proofs for the results in Section 4.2.

Section A.6.1 states and proves Lemma 21, an intermediary result which provides a decomposition of the KSD under Huber s contamination models.

Section A.6.2 proves Proposition 5 using Lemma 21.

Section A.6.3 proves Proposition 6.

Section A.6.4 provides an example of using Proposition 6 to bound the KSD between tand Gaussian distributions.

A.6.1 Statement and Proof of Lemma 21

Lemma 21. Let ϵ [0, 1] and Q = (1 ϵ)P + ϵR, for any R P(Rd) such that EY R[up(Y, Y)1/2] < . Assume k C(1,1) b and EX P [ sp(X) 2] < . Then D(Q, P) = ϵD(R, P).

Proof of Lemma 21 The assumption EY R[up(Y, Y)1/2] < implies that D(R, P) is well-deﬁned by Gorham and Mackey (2017, Proposition 2). Moreover, when k C(1,1) b and EX P [ sp(X) 2] < , Gorham and Mackey (2015, Proposition 1) shows that EX P [up(X, )] = 0. Using this and the linearity of expectation,

D2(Q, P) = (1 ϵ)2EX,X P [up(X, X )] + 2(1 ϵ)ϵEX P,Y R[up(X, Y)]

+ ϵ2EY,Y R[up(Y, Y )]

= (1 ϵ)2D2(P, P) + ϵ2D2(R, P)

= ϵ2D2(R, P) .

Liu and Briol

Taking square-root of both sides gives the desired result.

A.6.2 Proof of Proposition 5

Proof The proof follows a similar approach as Chérief-Abdellatif and Alquier (2022, Lemma 3.3). Under the assumed kernel conditions, Lemma 2 shows that supx,x Rd up(x, x ) τ < , which implies D2(Q, P) = EX,X Q[up(X, X )] τ < for any probability measure Q P(Rd). In particular, D(Q, P) is well-deﬁned for any Q P(Rd) and the conditions in Lemma 21 are met. For any Q = (1 ϵ)P + ϵR with Q P(P; ϵ0), applying Lemma 21 shows that

D(Q, P) = ϵD(R, P) ϵ0D(R, P) ϵ0τ

To show that this bound is tight, we can lower bound the KSD as

sup Q P(P;ϵ0) D(Q, P) sup z Rd,ϵ [0,ϵ0] D (1 ϵ)P + ϵδz, P = sup z Rd,ϵ [0,ϵ0] ϵD(δz, P)

= ϵ0 sup z Rd up(z, z) 1 2

where the ﬁrst inequality holds since R = δz is only one possible type of perturbation, the ﬁrst equality follows from Lemma 21, the second equality holds because the supremum over ϵ is reached at ϵ0, and the last step holds as supz Rd up(z, z) = τ by deﬁnition. Combining with the upper bound (55), we have shown that sup Q P(P;ϵ0) D(Q, P) = ϵ0τ 1/2 .

A.6.3 Proof of Proposition 6

Under the assumed conditions on k, Lemma 2 shows that the Stein kernel is bounded by τ = supx Rd up(x, x) < , so in particular D(Q, P) is well-deﬁned for all Q P(Rd). Under the assumed integrability condition, we have EX P [up(X, )] = 0 as argued in the paragraph before (11). We can therefore rewrite the squared KSD as

D2(Q, P) = Z

Rd up(x, x )q(x)q(x ) dx dx

Rd up(x, x )(q(x) p(x))(q(x ) p(x )) dx dx ,

Using the assumed bound |q(x) p(x)| δ(x) and Lemma 2, we can bound the RHS of the above line by Z

Rd up(x, x )δ(x)δ(x ) dx dx Z

Rd |up(x, x )|δ(x)δ(x ) dx dx

Rd δ(x) dx 2 = τ δ2 0 .

Taking the square-root of both sides implies D(Q, P) τ 1/2 δ0.

On the Robustness of Kernel Goodness-of-Fit Tests

A.6.4 KSD Balls and Fat Tails

The next proposition serves as an example for how Proposition 6 can be applied to design a robust KSD test when the model is Gaussian but data are drawn from t-distributions that are moment-matched to the model.

Proposition 22. Let P = N(0, 1) and let Qν = tν p

(ν 2)/ν, where tν is the t-distribution with degree-of-freedom (dof) ν > 2. Denote by p and qν their probability density functions, and by F and Fν their cumulative distribution functions, respectively. Then p(x) and qν(x) have exactly two intersections a1 < a2 on (0, ). If furthermore k satisﬁes the conditions in Lemma 2, then KSD(Qν, P) 4τ 1/2 (Fν(a1) F (a1) + F (a2) Fν(a2)).

This result suggests that, to ensure quantitative robustness against the scaled t-distribution with ν degrees-of-freedom, we can choose the uncertainty radius in the robust-KSD test to be θ = 4τ 1/2 (Fν(a1) F (a1) + F (a2) Fν(a2)). The intersection points a1, a2 do not have a closed form; instead, we approximate them by numerical solvers, which is trivial for this one-dimensional problem, and the numerical error is negligible as evidenced empirically in Figure 4. Proof of Proposition 22 For ν > 2, the change-of-variable formula shows that the scaled tdistribution Qν has density function q(x) = Zνq ν(x), where Zν := Γ( ν+1

2) , q ν(x) := (1 + x2 ν 2) (ν+1)/2, and Γ( ) is the Gamma function (see, e.g., Forbes et al., 2011, Chapter 8.1). To show qν(x) intersects with p(x) at exactly two points on (0, ), we ﬁrst show that qν(x) is strong super-Gaussian (Palmer et al., 2010), i.e., x 7 log qν( x) is convex on [0, ). This is immediate by writing

log qν( x) = log Zν ν + 1

2 log 1 + x ν 2

and noting that u 7 log(1 + u) is concave on [0, ). Palmer et al. (2010, Theorem 2) shows that any symmetric and strongly super-Gaussian density belongs to the class of density cross inequalities (DC+), which are symmetric densities that cross a Gaussian density of equal variance exactly four times on R, and take higher values than that normal density at x = 0 and as x . Since in this case both Qν and P have unit variance, we conclude that qν and p intersect at exactly four points on R. By symmetry of these densities about x = 0, this shows that qν and p must intersect at exactly two points on (0, ). Call these two intersections a1, a2 and assume without loss of generality 0 < a1 < a2. It then follows that qν(x) p(x) on [0, a1) [a2, ) and qν(x) p(x) on [a1, a2). We therefore have

R |p(x) qν(x)| dx

0 (qν(x) p(x)) dx + 2 Z a2

a1 (p(x) qν(x)) dx + 2 Z

a2 (qν(x) p(x)) dx

= 2(Fν(a1) F (a1)) + 2 F (a2) F (a1) (Fν(a2) Fν(a1))

+ 2 (1 Fν(a2)) (1 F (a2))

= 2(Fν(a1) F (a1) + F (a2) F (a1) Fν(a2) + Fν(a1) + F (a2) Fν(a2))

= 4(Fν(a1) F (a1) + F (a2) Fν(a2)) ,

Liu and Briol

0 500 1000 1500 2000 Sample size n

Rejection rate

0.4 0.45 0.5 0.55 0.6 0.7 α = 0.05

Figure 8: Rejection probability (in log scale) against sample size under the univariate contaminated Gaussian model with outlier z = 10. The contamination ratio scales as ϵn = n r for diﬀerent values of r.

10 1 100 101 102 Outlier z

Rejection rate

0.01 0.1 1.0 10.0 100.0 med KSDAgg

Figure 9: Rejection probability for an IMQ kernel with bandwidth λ. med is the median heuristic. KSDAgg is the test of Schrab et al. (2022). The dashed line shows α = 0.05. All tests all reject H0 for large z.

where in the second line we have again used the symmetry of normal and t densities. Applying Proposition 6 completes the proof.

Appendix B. Complementary Experiments

This section includes additional experimental results. Section B.1 discusses the rate requirement in Theorem 3 and demonstrates that the KSD test is no longer qualitatively robust when this condition is not met. Section B.2 reviews the robust MMD tests included in Section 5.1 and presents implementation details. Section B.4 includes an ablation study on the choice of kernel bandwidths in the robust-KSD test. Section B.6 studies its scalability with dimension. Section B.8 compares robust-KSD using two diﬀerent bootstrap methods: the weighted bootstrap used throughout this work, and a wild bootstrap that is more prevalent in the kernel testing literature.

B.1 Decay Rate of Contamination Ratio

We show empirically that the rate requirement in the qualitative robustness result Theorem 3 is not an artifact of the proof, i.e., the tilted-KSD test is no longer qualitatively robust to Huber s contamination models with contamination ratio ϵn if ϵn = n r for any r 1/2, where n is the sample size. We run the standard KSD test with a tilted kernel under the contaminated Gaussian model as described in Section 5.1 with dimension d = 1 and outlier z = 10. The contamination ratio is chosen to be ϵn = n r for diﬀerent choices of r, and the probability of rejection as n grows is shown in Figure 8. The results are averaged over 400 repetitions instead of 100 repetitions as in Section 5.1 to reduce numerical uncertainty. When r > 0.5, the rejection probability converges to the prescribed test level α, which is the limit of the rejection probability without contamination. This aligns with Theorem 3. However, this rejection probability no longer converges to α when r 0.5, thus showing that the tilted-KSD test is no longer qualitatively robust.

On the Robustness of Kernel Goodness-of-Fit Tests

B.2 Implementation Details for the Robust MMD Tests

This section provides implementation details for the two robust MMD-based tests included in Section 5.1. Both tests use the Maximum Mean Discrepancy (MMD; Gretton et al., 2012, Deﬁnition 2) between the data-generating distribution Q and the model P as a measure of their disparity. Given a reproducing kernel k : Rd Rd R, the squared MMD between Q and P are deﬁned as

D2 MMD(Q, P) = EX,X Q[k(X, X )] + EY,Y P [k(Y, Y )] 2EX Q,Y P [k(X, Y)] .

Given independent samples Xn = {Xi}n i=1 from Q and Ym = {Yj}m j=1 from P, the MMD can be estimated directly by D2 MMD(Qn, Pm), where Qn and Pm are the empirical measures formed by Xn and Ym, respectively (Gretton et al., 2012). The computational cost of such estimate scales with sample size n as O((n + m)2 + m Csim), where Csim is the cost of simulating one datum from P, compared with O(n2 + n Cscore) for KSD, where Cscore denotes the cost of a single score evaluation. In our experiment in Section 5.1, P is a Gaussian model, so both simulation and score evaluation are trivial and Csim, Cscore are both small. Hence, we set m = n so that the costs of MMD and of KSD are comparable. Moreover, we will use the squared-exponential kernel k(x, x ) = exp( x x 2 2/(2γ2)) and set γ2 using the median heuristic. Squared-exponential kernels are popular for MMDs and are also used in the original papers that proposed the two robust MMD tests (Sun and Zou, 2023; Schrab and Kim, 2024).

B.2.1 The MMD-Dev Test

The robust MMD test of Sun and Zou (2023, Eq. 38), which we refer to as MMD-Dev, targets the hypotheses

HMMD 0 : Q BMMD(P; θMMD) , HMMD 1 : Q BMMD(P; θMMD) , (56)

where θMMD 0 and BMMD(P; θMMD) = {Q P : DMMD(Q, P) θMMD} is the MMD ball centered at P and with radius θMMD. Given a test level α (0, 1), MMD-Dev rejects HMMD 0 if DMMD(Qn, Pn) > θMMD + γn, where γn = p

2K/n(1 + log α). Sun and Zou (2023, Theorem 4) shows that this test controls the Type-I error for any ﬁnite n, and is asymptotically optimal against certain alternatives. The decision threshold γn is derived using a Mc Diarmid-type deviation inequality for MMDs due to Gretton et al. (2012, Theorem 8). In our experiments, we chose θMMD = ϵ0

2. This ensures that BMMD(P; θMMD) contains Huber s contamination models with contamination ratios up to ϵ0, which we show in the next result. Notably, although we can compare the MMD-Dev test with our robust-KSD test on Huber s contamination models, they are in general not directly comparable, since they target diﬀerent null hypotheses.

Lemma 23 (MMD balls and Huber s model). Let k be a reproducing kernel with 0 < k(x, x ) K < , for all x, x Rd. For any ϵ0 [0, 1], if θMMD = ϵ0

2K, then P(P; ϵ0) BMMD(P; θMMD), where P(P; ϵ0) is the Huber s contamination model deﬁned in (6).

Liu and Briol

Proof of Lemma 23 For any Q P(Rd), we deﬁne its kernel mean embedding (Berlinet and Thomas-Agnan, 2004, Chapter 4) as µQ( ) := EX Q[k(X, )], which is well-deﬁned since k is bounded. By Sriperumbudur et al. (2010, Theorem 1), the MMD between any Q, P P(Rd) can be equivalently written as DMMD(Q, P) = µQ µP Hk, where Hk is the RKHS associated with k. Pick ϵ0 [0, 1] and R P. For any ϵ [0, ϵ0],

DMMD((1 ϵ)P + ϵR, P) = µ(1 ϵ)P+ϵR µP Hk = (1 ϵ)µP + ϵµR µP Hk = ϵ µR µP Hk = ϵDMMD(R, P) ,

where the second equality holds due to the linearity of the expectation operator. Moreover,

D2 MMD(R, P) = EX,X R[k(X, X )] + EY,Y P [k(Y, Y )] 2EX R,Y P [k(X, Y)]

EX,X R[k(X, X )] + EY,Y P [k(Y, Y )]

where the last line holds since supx,x Rd k(x, x ) K by assumption. This shows that DMMD((1 ϵ)P + ϵR, P) ϵ

2K, so the claim holds.

B.2.2 The dc MMD Test

The dc MMD test (Schrab and Kim, 2024) targets diﬀerent hypotheses. Given ϵ0 [0, 1], the null assumes that the observed sample Xn is generated by ﬁrstly drawing i.i.d. random variables from some probability measure, then corrupting them by replacing at most a proportion of ϵ0 of the sample by arbitrary values. The hypotheses can be formalized as follows:

Hdc MMD 0 : At least (1 ϵ0) n random variables in Xn are i.i.d. from P.

Hdc MMD 1 : Otherwise .

The dc MMD test rejects Hdc MMD 0 if DMMD(Qn, Pn) > q MMD α + 2ϵ0

2K, where q MMD α is the empirical quantile of B permutation samples {T b MMD}B b=1, and each T b MMD is computed by (i) randomly permuting Xn Yn, (ii) partitioning the permuted set into two subsets Xb n and Yb n of size n, and (iii) computing T b MMD := DMMD(Qb n, P b n), where Qb n and P b n are the empirical measures based on Xb n and Yb n, respectively. We use B = 500, which is the default setup in Schrab and Kim (2024). The dc MMD test gives stronger calibration guarantee than Huber s contamination model because it controls contamination proportion no larger than ϵ0 for any realization of the sample Xn, rather than in expectation as required by Huber s model. However, it is not necessarily stronger than KSD-balls or MMD-balls, since it does not account for the case where all samples are under mild perturbation.

B.3 Approximating the Supremum of the Stein Kernel

As described in Section 4.2, in our experiments we approximated the intractable supremum τ = supx Rd up(x, x) by maxi=1,...,n up(xi, xi). We now study how this approximation

On the Robustness of Kernel Goodness-of-Fit Tests

Sample size n

Estimated Truth

0.0 0.2 0.4 Contamination ratio ϵ

Rejection rate

25 50 100 500 1000 α = 0.05

Figure 10: Left. Estimated τ using the trick described in Section 4.2 compared with the ground-truth; the estimate becomes more accurate with larger samples. Right. Rejection probability of robust-KSD under an outlier-contaminated Gaussian model; the Type-I error remains calibrated even for small sample sizes.

aﬀects the performance of the robust-KSD test. We run the same contaminated Gaussian experiment in Section 5.1 with d = 1 and outlier location z = 10. For this model, x 7 up(x, x) is a simple, univariate function, so we can compute the ground-truth τ accurately by numerical optimization. Clearly, the ground truth τ is equivalent to taking n = . This is evidenced by the left plot in Figure 10, which shows that the ﬁnite-sample approximation maxi=1,...,n up(xi, xi) becomes more accurate as n increases. The right plot of Figure 10 shows the rejection probability of robust-KSD with θ set to control at most ϵ0 = 0.05 proportion of contamination. Remarkably, even when this approximation under-estimates τ (which is the case for small n), robust-KSD still remains well-calibrated. This is not surprising: the supremum τ represents the maximal contribution of a single datum x taking arbitrary values in the entire sample space Rd, regardless of whether it is present in the observed data set. However, in practice, it suﬃces to control the contribution of any single datum in the observed data set.

B.4 Ablation Study for Kernel Bandwidth in the Standard KSD Test

We evaluate how the choice of the kernel bandwidth aﬀects the performance of the standard KSD test. We run the same contaminated Gaussian experiment as in the previous subsection. The standard KSD test uses an IMQ kernel with bandwidth λ2 Λ {λ2 med}, where Λ = {0.01, 0.1, 1, 10, 100} and λ2 med is the bandwidth chosen by median heuristic. We also include KSDAgg, which also uses an IMQ kernel for a fair comparison. As shown in Figure 9, all standard IMQ-KSD tests reject the point null with high probability for large outlier values, regardless of the bandwidth value. This suggests that no ﬁxed bandwidth can ensure robustness, which is consistent with Theorem 1, and is because the Stein kernel is unbounded regardless of the choice of λ, thus the non-robustness issue persists. The KSDAgg test is even more sensitive to contamination than the IMQ-KSD tests. As noted in Section 5.1, this is because KSDAgg is designed to optimally combine multiple bandwidths to boost the test power against all alternatives, including contaminated models.

B.5 Gaussian Mean-Shift Experiment

We run a Gaussian mean-shift experiment to demonstrate the performance of our robust test against model deviations other than contamination. We use a Gaussian model N(0, Id) in

Liu and Briol

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

Rejection rate

IMQ-KSD Tilted-KSD R-KSD α = 0.05

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

Figure 11: Probability of rejection against the mean-shift µ0 under a Gaussian model in d = 50 dimensions. Black dash-dot line. Uncertainty radius θ, which is set to be the KSD value D(Qµ0, P) corresponding to diﬀerent values of µ0.

dimension d = 50, and draw data from Qµ0 = N(µ0e1, Id), where e1 = (1, 0, . . . , 0) Rd

and µ0 R is a mean-shift. The uncertainty radius θ is chosen to be the Tilted-KSD value corresponding to µ0 = 0, 0.2 and 0.6, respectively. The purpose of this experiment is to demonstrate that the robust-KSD test is well-calibrated when D(Qµ0, P) θ and consistent when D(Qµ0, P) > θ. As shown in Figure 11, the standard tests reject with probability higher than robust-KSD. This is again expected since the standard tests are not robust to contamination. For mean-shift values not greater than the black vertical line, Qµ0 BKSD(P; θ), so we are under the null hypothesis and robust-KSD rejects no more frequently than the level, showing that it is well-calibrated. For larger µ0, robust-KSD rejects with probability approaching one, thus showing its power. Moreover, when µ0 = 0 so that θ = 0, robust-KSD becomes identical to the standard Tilted-KSD test, showing that robust-KSD is indeed a generalization of the standard test.

B.6 Scalability with Dimension

We run the Gaussian mean-shift experiment in Section B.5 in diﬀerent dimensions; other experimental setups remain unchanged. The uncertainty radius for R-KSD is chosen to be the (non-squared) KSD value corresponding to mean-shift µ = 0.3. Results are reported in Figure 12. As dimension increases, both the two standard tests and R-KSD have declining power in correctly rejecting for large values of µ. This shows that both the standard and the robust KSD tests suﬀer from the curse-of-dimensionality, a known issue for kernel-based tests (Huang et al., 2023; Reddi et al., 2015; Ramdas et al., 2015). Unsurprisingly, this issue is more prominent for the R-KSD test. This is because the KSD-ball BKSD(P; θ) could potentially include more distributions in higher dimensions.

B.7 Diﬀerent Contamination Distributions

In most of our experiments in Section 5, the data-generating distribution takes the form Q = (1 ϵ0)P + ϵ0R with the contamination distribution R = δz being a Dirac delta taking a single value at z. We now investigate how the choice of R aﬀects the empirical performance of R-KSD. Importantly, our results in Section 4 make no assumption on the form of R. We set P = N(0, 1) and generate data from Q = (1 ϵ)P +ϵN(10, σ2) with varying σ > 0. Following Proposition 5, we choose θ = ϵ0τ 1/2 , and ﬁx the contamination tolerance to be

On the Robustness of Kernel Goodness-of-Fit Tests

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

Rejection rate

IMQ Tilted R-KSD α = 0.05

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

0.0 0.5 1.0 1.5 2.0 Mean shift µ0

Figure 12: Probability of rejection against the mean shift under a Gaussian model in various dimensions. Grey dotted line. Test level α = 0.05. Black dash-dot line. Uncertainty radius θ, which is set to be the KSD value corresponding to µ = 0.3.

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Rejection rate

IMQ-KSD Tilted-KSD R-KSD α = 0.05

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

0 5 10 15 20 Standard deviation σ

Figure 13: Mixture-of-Gaussian experiment. Left. Probability of rejection under diﬀerent standard deviations for the contamination component. Right. KSD estimates. Grey dotted line. Test level α = 0.05. Black dash-dot line. Uncertainty radius θ = ϵ0τ 1/2 , with τ estimated following Section 4.2.

ϵ0 = 0.05. Since Proposition 5 holds for all R, we expect R-KSD to remain (asymptotically) valid in this setting.

Results are shown in Figure 13. For larger σ, all tests, including our R-KSD, saw a exhibit reduced test power. This is because, when σ is large, the noise component R becomes more dispersed, making it harder for KSD to detect discrepancies. This can be conﬁrmed by the rightmost plot, which shows that the KSD between Q and P decreases with σ. Crucially, our R-KSD test is able to control the Type-I error regardless of the value of σ.

B.8 Weighted Bootstrap and Wild Bootstrap

We compare the robust-KSD test using the weighted bootstrap described in Section 2.2 (equivalent to the Efron s bootstrap) against the wild bootstrap due to Leucht and Neumann (2013). For the standard KSD test, weighted bootstrap was used in Liu et al. (2016), while the wild bootstrap is more popular in the literature (Chwialkowski et al., 2016; Schrab et al., 2022; Liu et al., 2023). Compared with weighted bootstrap, the wild approach is more ﬂexible as it can work for dependent samples (Chwialkowski et al., 2014), and theoretical guarantees on the power of the standard test were only proved with the wild bootstrap (Schrab et al., 2022). In this work, we have used the weighted approach since we can then leverage existing theoretical results to show the validity of the test; see Section A.5.1. This is for simplicity rather than necessity.

Liu and Briol

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Rejection rate

R-KSD R-KSD-Wild α = 0.05

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Figure 14: Probability of rejection under an outlier-contaminated Gaussian model using the weighted bootstrap and the wild bootstrap. Black dash-dot vertical line. Uncertainty radius θ = ϵ0τ 1/2 , with τ estimated following Section 4.2.

We numerically compare these two bootstrap methods using the contaminated Gaussian model in Section 5.1 and show that the use of weighted bootstrap does not negatively impact the test. Figure 14 shows the rejection probability of the robust-KSD test using the weighted bootstrap (R-KSD) and wild bootstrap (R-KSD-Wild). The uncertainty radius is chosen to be θ = ϵ0τ 1/2 for diﬀerent values of ϵ0, following Section 5.1. In particular, ϵ0 = 0 corresponds to setting θ = 0, in which case robust-KSD reduces to the standard KSD. As evident from the plot, these two bootstrap methods produce almost identical results regardless of the value of θ.

Appendix C. Extension to U-statistics

In this work, we have focused primarily on using the V-statistic (2) for estimating the KSD. An alternative estimator commonly used in the literature (Liu et al., 2016; Jitkrittum et al., 2017; Schrab et al., 2022; Liu et al., 2023) is the following U-statistic 1 n(n 1) P 1 i =j n up(Xi, Xj). In this section, we discuss brieﬂy how to extend our key results to this U-statistic estimate.

Extending the qualitative (non-)robustness results: The main ingredients for proving Theorem 1 are (i) a deviation bound for the V-statistic D2(Xn) as shown in Lemma 7, and (ii) a concentration bound for the bootstrapped quantile q2 ,1 α(Xn) as shown in Lemma 8. A U-statistic counterpart of both results can be shown by following the same proof technique and replacing D2(Xn) with the U-statistic. For example, with the U-statistic, a decomposition similar to (10) can be derived by summing over only the non-diagonal terms i = j and replacing normalizing factors of the form 1/n2

to 1/(n(n 1)). Similarly, the proof of Theorem 3 can be extended to U-statistics by following the same steps in Section A.3.

Extending the robust-KSD test: To adapt the robust-KSD test to the U-statistic, we ﬁrst note that rejecting the null if θ(Xn) > q ,1 α is equivalent to rejecting the null if D2(Xn) > (q ,1 α + θ)2, as argued in (53) and the paragraph thereafter. A natural approach is therefore to replace the V-statistic D2(Xn) by the U-statistic, and the bootstrap quantile q ,1 α by the square-root of the bootstrap quantile formed with the U-statistic. However, since the bootstrap samples (3) based on U-statistics can take negative values, the bootstrap quantile can also take negative values, rendering its square-root q ,1 α undeﬁned. One solution is to never reject the null when this happens. This has little impact when D2(Q, P) is large, since the bootstrap samples

On the Robustness of Kernel Goodness-of-Fit Tests

are then likely to take positive values. However, when D2(Q, P) 0, the bootstrap samples are more prone to taking negative values, thus making the test conservative.

Appendix D. A Non-Asymptotically Valid Robust KSD Test

The robust-KSD test introduced in Section 4 is only well-calibrated when n . In this appendix, we now derive a robust test that is well-calibrated with ﬁnite samples. An immediate consequence is that a stronger, uniform Type-I error control can be achieved. The test rejects the null HC 0 in (8) if

θ(Xn) > γn ,

where γn = p

2τ (log α)/n, the constant τ is deﬁned in Lemma 2, and θ(Xn) is deﬁned in (9). We call this test robust-KSD-Dev (R-KSD-Dev). The decision threshold γn is based on the concentration bound Lemma 18, which is a deviation bound of the Mc Diarmid s type (Mc Diarmid et al., 1989). This test can be viewed as a counterpart of the robust MMD test of Sun and Zou (2023), which is also constructed using Mc Diarmid s inequality. The next result, proved at the end of this section, shows its ﬁnite-sample validity under HC 0 as well as its consistency under HC 1 .

Theorem 24. Let X = {Xi} i=1 be a sequence of independent random variables following Q. Suppose k is a tilted kernel satisfying the conditions in Lemma 2, and let θ 0. Then

1. Under HC 0 , for any n, it holds that sup Q BKSD(P;θ) Pr Xn Q θ(Xn) > γn α.

2. Under HC 1 , it holds that Pr Xn Q θ(Xn) > γn 1, as n .

Remark 10 (Alternative deviation bounds). Alternative deviation bounds to Lemma 18 can also be used to construct similar tests. More precisely, Theorem 24 holds for any threshold γn that satisﬁes

Pr Xn Q(SP (Xn, Q) > γn) α ,

where SP (Xn, Q) = EX Qn[up( , X)] EX Q[up( , Y)] Hu is a Hilbert-space norm, as shown in Lemma 16. Thus, any deviation bound for Hilbert-space norms may be applied. Examples include another Mc Diarmid s bound of Gretton et al. (2012, Theorem 7), the empirical Bernstein bounds of Wolfer and Alquier (2022, Theorem A.1) and Martinez-Taboada and Ramdas (2024, Corollary 1), as well as the Hilbert-space valued Hoeﬀding bound of Pinelis (1994, Theorem 3.5), particularly its i.i.d. variant (Chatalic et al., 2022, Lemma E.1).

Remark 11 (Justiﬁcation for Mc Diarmid-type bound). We opted for the Mc Diarmid-type bound in Lemma 18 because, in our setting, it is tighter than the bounds from Gretton et al. (2012), Wolfer and Alquier (2022) and Chatalic et al. (2022). In particular, while Wolfer and Alquier (2022) claim their empirical Bernstein bound outperforms Mc Diarmid s inequality, we ﬁnd that this only holds for their bound tailored to translation-invariant kernels (Wolfer and Alquier, 2022, Theorem 3.1), but not for the non-translation-invariant version (Wolfer and Alquier, 2022, Theorem A.1), which our setting requires since we have assumed Lemma 2. Moreover, although the bound in Lemma 18 is worse than the empirical Bernstein bound of Martinez-Taboada and Ramdas (2024), we ﬁnd that the diﬀerence is only marginal, and the former is considerably easier to implement.

Liu and Briol

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Rejection rate

IMQ-KSD Tilted-KSD R-KSD R-KSD-Dev dc MMD MMD-Dev α = 0.05

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

0.0 0.1 0.2 0.3 0.4 Contamination ratio ϵ

Figure 15: Rejection probability under an outlier-contaminated Gaussian model. Black dashdot vertical line. Uncertainty radius θ = ϵ0τ 1/2 , with τ estimated following Section 4.2.

In Figure 15, we compare R-KSD-Dev with the bootstrap-based robust-KSD test, the robust MMD test and the standard tests under the outlier-contaminated Gaussian model in Section 5.1. We set the uncertainty radius to be θ = ϵ0τ 1/2 for various values of ϵ0, where τ is estimated following Section 4.2. As expected, R-KSD-Dev controls Type-I error against Huber s contamination with maximal proportion ϵ0. However, when ϵ > ϵ0, it has lower power compared with robust-KSD and the MMD-based tests. This is because R-KSD-Dev uses a deviation bound to construct the decision threshold, which is a uniform bound over all data-generating distributions and is therefore conservative. For the same reason, we expect tests constructed using the alternative deviation bounds discussed in Remark 10 to suﬀer from similar conservativeness. Since the robust-KSD-Dev test has ﬁnite-sample guarantee on the Type-I error, it might be preferable to the bootstrap-based robust-KSD test when the sample size is small. On the other hand, the conservative nature of the robust-KSD-Dev test suggests it might only be eﬀective for identifying the most aberrant behaviors. Proof of Theorem 24 Proceeding as in the proof of Theorem 4, for any Q BKSD(P; θ), we have the inequality

Pr Xn Q θ(Xn) > γn Pr Xn Q SP (Xn, Q) > γn ,

By the deviation bound in Lemma 18, the RHS is bounded by α, thus proving the ﬁrst claim. We now ﬁx Q BKSD(P; θ) so that we are under HC 1 . Since θ(Xn) = max(D(Xn) θ, 0) D(Xn) θ, we have

Pr Xn Q θ(Xn) > γn Pr Xn Q D(Xn) θ > γn

= Pr Xn Q n D(Xn) D(Q, P) > n γn + θ D(Q, P) .

The same argument in Section A.5.2 shows that n D(Xn) D(Q, P) converges weakly to a Gaussian distribution. On the other hand, since D(Q, P) > θ under HC 1 and γn 0 as n , the term n γn + θ D(Q, P) . Therefore, the probability in the last line converges to one, thus proving the second claim.

On the Robustness of Kernel Goodness-of-Fit Tests

Radoslaw Adamczak. A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20(none):1 13, 2015.

Pierre Alquier and Mathieu Gerber. Universal robust regression via maximum mean discrepancy. Biometrika, 111(1):71 92, 2024.

Matias Altamirano, François-Xavier Briol, and Jeremias Knoblauch. Robust and scalable Bayesian online changepoint detection. In International Conference on Machine Learning, pages 642 663, 2023.

Matias Altamirano, François-Xavier Briol, and Jeremias Knoblauch. Robust and conjugate Gaussian process regression. In International Conference on Machine Learning, pages 1155 1185, 2024.

Alan Nawzad Amin, Eli N. Weinstein, and Debora Susan Marks. A kernelized Stein discrepancy for biological sequences. In International Conference of Machine Learning, pages 718 767, 2023.

Andreas Anastasiou, Alessandro Barp, François-Xavier Briol, Bruno Ebner, Robert E. Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, Lester Mackey, Chris J. Oates, Gesine Reinert, and Yvik Swan. Stein s method meets computational statistics: A review of some recent developments. Statistical Science, 38(1): 120 139, 2023.

Miguel A. Arcones and Evarist Gine. On the bootstrap of U and V statistics. The Annals of Statistics, 20(2):655 674, 1992.

Krishnakumar Balasubramanian, Tong Li, and Ming Yuan. On the optimality of kernelembedding based goodness-of-ﬁt tests. Journal of Machine Learning Research, 22(1):1 45, 2021.

Alessandro Barp, François-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. Advances in Neural Information Processing Systems, 32:12964 12976, 2019.

Alessandro Barp, Carl-Johann Simon-Gabriel, Mark Girolami, and Lester Mackey. Targeted separation and convergence with kernel discrepancies. Journal of Machine Learning Research, 25(378):1 50, 2024.

Jerome Baum, Heishiro Kanagawa, and Arthur Gretton. A kernel Stein test of goodness of ﬁt for sequential models. In International Conference on Machine Learning, pages 1936 1953. PMLR, 2023.

Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, New York, 2004.

Kenneth L. Blackard, Theodore S. Rappaport, and Charles W. Bostian. Measurements and models of radio frequency impulsive noise for indoor wireless communications. IEEE Journal on selected areas in communications, 11(7):991 1001, 1993.

Liu and Briol

François-Xavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with maximum mean discrepancy. ar Xiv:1906.05944, pages 1 57, 2019.

Stéphane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714 720, 2006.

Antoine Chatalic, Nicolas Schreuder, Lorenzo Rosasco, and Alessandro Rudi. Nyström kernel mean embeddings. In International Conference on Machine Learning, pages 3006 3024. PMLR, 2022.

Badr-Eddine Chérief-Abdellatif and Pierre Alquier. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Advances in Approximate Bayesian Inference (AABI), pages 1 21, 2020.

Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Finite sample properties of parametric MMD estimation: robustness to misspeciﬁcation and dependence. Bernoulli, 28(1):181 213, 2022.

Kyung Hyun Cho, Tapani Raiko, and Alexander Ilin. Gaussian-Bernoulli deep Boltzmann machine. In International Joint Conference on Neural Networks (IJCNN), pages 1 7. IEEE, 2013.

Fan Chung and Linyuan Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6(2):125 145, 2002.

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606 2615, 2016.

Kacper P Chwialkowski, Dino Sejdinovic, and Arthur Gretton. A wild bootstrap for degenerate kernel tests. Advances in Neural Information Processing Systems, 27, 2014.

Anand G. Dabak and D.H. Johnson. Geometrically based robust detection. Conference on Information Sciences and Systems, Johns Hopkins University, Baltimore, pages 73 77, May 1994.

Ralph B D Agostino. Goodness-of-Fit Techniques. Routledge, 1986.

Herold Dehling and Thomas Mikosch. Random quadratic forms and the bootstrap for U-statistics. Journal of Multivariate Analysis, 51(2):392 413, 1994.

Charita Dellaporta and Theodoros Damoulas. Robust Bayesian Inference for Berkson and Classical Measurement Error Models. ar Xiv:2306.01468, 2023.

Charita Dellaporta, Jeremias Knoblauch, Theodoros Damoulas, and François-Xavier Briol. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In International Conference on Artiﬁcial Intelligence and Statistics, pages 943 970, 2022.

On the Robustness of Kernel Goodness-of-Fit Tests

Gerardo Duran-Martin, Matias Altamirano, Alexander Y. Shestopaloﬀ, Jeremias Knoblauch, Matt Jones, François-Xavier Briol, and Kevin Murphy. Outlier-robust Kalman ﬁltering through generalised Bayes. In International Conference on Machine Learning, pages 12138 12171, 2024.

Michael Fauß and Abdelhak M Zoubir. Old bands, new tracks revisiting the band model for robust hypothesis testing. IEEE Transactions on signal Processing, 64(22):5875 5886, 2016.

Michael Fauß, Abdelhak M. Zoubir, and H. Vincent Poor. Minimax robust detection: Classic results and recent advances. IEEE Transactions on Signal Processing, 69:2252 2283, 2021.

Tamara Fernandez, Nicolas Rivera, Wenkai Xu, and Arthur Gretton. Kernelized Stein discrepancy tests of goodness-of-ﬁt for time-to-event data. In International Conference on Machine Learning, pages 3112 3122. PMLR, 2020.

Asja Fischer and Christian Igel. Training restricted Boltzmann machines: An introduction. Pattern Recognition, 47(1):25 39, 2014.

Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. Statistical Distributions. John Wiley & Sons, 4 edition, 2011.

Robert Fortet and Edith Mourier. Convergence de la répartition empirique vers la répartition théorique. Annales scientiﬁques de l École Normale Supérieure, 3e série, 70(3):267 285, 1953.

Benoît Frénay and Michel Verleysen. Classiﬁcation in the presence of label noise: A survey. IEEE transactions on neural networks and learning systems, 25(5):845 869, 2013.

Kenji Fukumizu, Arthur Gretton, Gert Lanckriet, Bernhard Schölkopf, and Bharath K Sriperumbudur. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. Advances in Neural Information Processing Systems, 22, 2009.

Rui Gao, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using Wasserstein uncertainty sets. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Ruize Gao, Feng Liu, Jingfeng Zhang, Bo Han, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Maximum mean discrepancy test is aware of adversarial attacks. In International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3564 3575. PMLR, 18 24 Jul 2021.

Gene H. Golub and Charles F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013. ISBN 1421407949 9781421407944.

Jackson Gorham and Lester Mackey. Measuring sample quality with Stein s method. Advances in Neural Information Processing Systems, 28, 2015.

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292 1301. PMLR, 2017.

Liu and Briol

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012.

Gökhan Gül and Abdelhak M Zoubir. Robust hypothesis testing with α-divergence. IEEE Transactions on Signal Processing, 64(18):4737 4750, 2016.

Robert Hafner. Construction of minimax-tests for bounded families of probability-densities. Metrika, 40(1):1 23, 1993.

Omar Hagrass, Bharath Sriperumbudur, and Bing Li. Spectral regularized kernel two-sample tests. The Annals of Statistics, 52(3):1076 1101, 2024a.

Omar Hagrass, Bharath K. Sriperumbudur, and Bing Li. Spectral regularized kernel goodnessof-ﬁt tests. Journal of Machine Learning Research, 25(309):1 52, 2024b.

Omar Hagrass, Bharath Sriperumbudur, and Krishnakumar Balasubramanian. Minimax optimal goodness-of-ﬁt testing with kernel Stein discrepancy. ar Xiv preprint ar Xiv:2404.08278, 2025.

Frank R. Hampel. The inﬂuence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383 393, 1974.

Robert V. Hogg, Elliot A. Tanis, and Dale L. Zimmerman. Probability and Statistical Inference, volume 993. Macmillan New York, 1977.

Kevin H. Huang, Xing Liu, Andrew Duncan, and Axel Gandy. A high-dimensional convergence theorem for U-statistics with applications to kernel-based testing. In Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 3827 3918, 2023.

Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964.

Peter J. Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, pages 1753 1758, 1965.

Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. John Wiley & Sons, 2011.

Jonathan Huggins and Lester Mackey. Random feature Stein discrepancies. Advances in Neural Information Processing Systems, 31, 2018.

Marie Huskova and Paul Janssen. Consistency of the generalized bootstrap for degenerate U-statistics. The Annals of Statistics, 21(4):1811 1823, 1993.

Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695 709, 2005.

Yu I Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the L_p metrics. Theory of Probability & Its Applications, 31(2):333 337, 1987.

On the Robustness of Kernel Goodness-of-Fit Tests

Yuri I. Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives. I, II, III. Math. Methods Statist, 2(2):85 114, 1993.

Paul Janssen. Weighted Bootstrapping of U-Statistics. Journal of Statistical Planning and Inference, 38(1):31 41, 1994. ISSN 0378-3758.

Paul Janssen. Bootstrapping U-statistics. South African Statistical Journal, 31(2):185 216, 1997.

Wittawat Jitkrittum, Wenkai Xu, Zoltán Szabó, Kenji Fukumizu, and Arthur Gretton. A linear-time kernel goodness-of-ﬁt test. In International Conference on Neural Information Processing Systems, pages 261 270. Curran Associates Inc., 2017.

Heishiro Kanagawa, Arthur Gretton, and Lester Mackey. Controlling moments with kernel Stein discrepancies. ar Xiv preprint ar Xiv:2211.05408, 2022.

S. Kassam. Robust hypothesis testing for bounded classes of probability densities (corresp.). IEEE Transactions on Information Theory, 27(2):242 247, 1981.

Oscar Key, Arthur Gretton, François-Xavier Briol, and Tamara Fernandez. Composite goodness-of-ﬁt tests with kernels. Journal of Machine Learning Research, 26(51):1 60, 2025.

A.N. Kolmogorov. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell inst Ital Degli Att, 4:89 91, 1933.

Diane Lambert. Qualitative robustness of tests. Journal of the American Statistical Association, 77(378):352 357, 1982.

Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, 1(1):38 53, 1973.

Erich Leo Lehmann and Joseph P Romano. Testing Statistical Hypotheses. Springer Cham, fourth edition, 2022.

Anne Leucht and Michael H. Neumann. Dependent wild bootstrap for degenerate Uand V-statistics. Journal of Multivariate Analysis, 117:257 280, 2013. ISSN 0047-259X.

Bernard C. Levy. Robust hypothesis testing with a relative entropy tolerance. IEEE Transactions on Information Theory, 55(1):413 421, 2008.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 276 284, 2016.

Xing Liu, Andrew B. Duncan, and Axel Gandy. Using perturbation to improve goodness-ofﬁt tests based on kernelized Stein discrepancy. In International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 21527 21547, 2023.

Diego Martinez-Taboada and Aaditya Ramdas. Empirical Bernstein in smooth Banach spaces. ar Xiv preprint ar Xiv:2409.06060, 2024.

Liu and Briol

Takuo Matsubara, Jeremias Knoblauch, François-Xavier Briol, and Chris J Oates. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3):997 1022, 2022.

Colin Mc Diarmid et al. On the method of bounded diﬀerences. Surveys in combinatorics, 141(1):148 188, 1989.

Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79 (3):695 718, 2017.

Frédéric Ouimet. Explicit formulas for the joint third and fourth central moments of the multinomial distribution. ar Xiv preprint ar Xiv:2006.09059, 2020.

Jason A. Palmer, Ken Kreutz-Delgado, and Scott Makeig. Strong Suband Super-Gaussianity. In Vincent Vigneron, Vicente Zarzoso, Eric Moreau, Rémi Gribonval, and Emmanuel Vincent, editors, Latent Variable Analysis and Signal Separation, pages 303 310, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-15995-4.

Iosif Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability, pages 1679 1706, 1994.

Marc Postman, John Peter Huchra, and Margaret J Geller. Probes of large-scale structure in the corona borealis region. Astronomical Journal (ISSN 0004-6256), vol. 92, Dec. 1986, p. 1238-1247., 92:1238 1247, 1986.

Yu V Prokhorov. Convergence of random processes and limit theorems in probability theory. Theory of Probability & Its Applications, 1(2):157 214, 1956.

Yichen Qin and Carey E Priebe. Robust hypothesis testing via Lq-likelihood. Statistica Sinica, pages 1793 1813, 2017.

Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Sashank Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, and Larry Wasserman. On the high dimensional power of a linear-time two sample test under mean-shift alternatives. In Artiﬁcial Intelligence and Statistics, pages 772 780. PMLR, 2015.

Helmut Rieder. Qualitative robustness of rank tests. The Annals of Statistics, 10(1):205 211, 1982.

Giorgio Rizzoni and PS Min. Detection of sensor failures in automotive engines. IEEE Transactions on Vehicular Technology, 40(2):487 500, 1991.

On the Robustness of Kernel Goodness-of-Fit Tests

Kathryn Roeder. Density estimation with conﬁdence sets exempliﬁed by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411):617 624, 1990.

Antonin Schrab and Ilmun Kim. Robust kernel hypothesis testing under data corruption. ar Xiv preprint ar Xiv:2405.19912, 2024.

Antonin Schrab, Benjamin Guedj, and Arthur Gretton. KSD aggregated goodness-of-ﬁt test. Advances in Neural Information Processing Systems, 35:32624 32638, 2022.

Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 24(194):1 81, 2023.

Robert J Serﬂing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 2009.

Xiaofeng Shao. The dependent wild bootstrap. Journal of the American Statistical Association, 105(489):218 235, 2010.

Abhishek B Sharma, Leana Golubchik, and Ramesh Govindan. Sensor faults: Detection methods and prevalence in real-world datasets. ACM Transactions on Sensor Networks (TOSN), 6(3):1 39, 2010.

Jiaxin Shi and Lester Mackey. A ﬁnite-particle convergence rate for stein variational gradient descent. Advances in Neural Information Processing Systems, 36, 2024.

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010.

Zhongchang Sun and Shaofeng Zou. Kernel robust hypothesis testing. IEEE Transactions on Information Theory, 2023.

Ilya Tolstikhin, Bharath K. Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18(86):1 47, 2017.

Aad W Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000.

Li K Wenliang and Heishiro Kanagawa. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv preprint ar Xiv:2008.10087, 2020.

Geoﬀrey Wolfer and Pierre Alquier. Variance-aware estimation of kernel mean embedding. ar Xiv preprint ar Xiv:2210.06672, 2022.

George Wynne, Mikołaj Kasprzak, and Andrew B. Duncan. A spectral representation of kernel Stein discrepancy with application to goodness-of-ﬁt tests for measures on inﬁnite dimensional Hilbert spaces. ar Xiv preprint ar Xiv:2206.04552, 2022.

Liu and Briol

Wenkai Xu and Takeru Matsuda. A Stein goodness-of-ﬁt test for directional distributions. In International Conference on Artiﬁcial Intelligence and Statistics, pages 320 330. PMLR, 2020.

Wenkai Xu and Takeru Matsuda. Interpretable Stein goodness-of-ﬁt tests on Riemannian manifold. In International Conference on Machine Learning, pages 11502 11513. PMLR, 2021.

Wenkai Xu and Gesine Reinert. A Stein goodness-of-test for exponential random graph models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 415 423. PMLR, 2021.

Jiasen Yang, Qiang Liu, Vinayak Rao, and Jennifer Neville. Goodness-of-ﬁt testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, pages 5561 5570. PMLR, 2018.

Jiasen Yang, Vinayak Rao, and Jennifer Neville. A Stein Papangelou goodness-of-ﬁt test for point processes. In International Conference on Artiﬁcial Intelligence and Statistics, pages 226 235. PMLR, 2019.

Pengfei Yang and Biao Chen. Robust Kullback-Leibler divergence and universal hypothesis testing for continuous distributions. IEEE Transactions on Information Theory, 65(4): 2360 2373, 2018.

Liangwei Zhang, Jing Lin, Bin Liu, Zhicong Zhang, Xiaohui Yan, and Muheng Wei. A review on deep learning applications in prognostics and health management. Ieee Access, 7:162415 162438, 2019.

Mingtian Zhang, Oscar Key, Peter Hayes, David Barber, Brooks Paige, and François-Xavier Briol. Towards healing the blindness of score matching. ar Xiv preprint ar Xiv:2209.07396, 2022.