# sequential_kernel_goodnessoffit_testing__e03ca4a8.pdf

Sequential Kernel Goodness-of-fit Testing

Zhengyu Zhou 1 Weiwei Liu 1

Goodness-of-fit testing, a classical statistical tool, has been extensively explored in the batch setting, where the sample size is predetermined. However, practitioners often prefer methods that adapt to the complexity of a problem rather than fixing the sample size beforehand. Classical batch tests are generally unsuitable for streaming data, as valid inference after data peeking requires multiple testing corrections, resulting in reduced statistical power. To address this issue, we delve into the design of consistent sequential goodness-of-fit tests. Following the principle of testing by betting, we reframe this task as selecting a sequence of payoff functions that maximize the wealth of a fictitious bettor, betting against the null in a repeated game. We conduct experiments to demonstrate the adaptability of our sequential test across varying difficulty levels of problems while maintaining control over type-I errors.

1. Introduction

Goodness-of-fit tests are fundamental tools in statistical analysis, dating back to the Kolmogorov Smirnov test (Kolmogorov, 1933; Smirnov, 1948). Given observations Z sampled from the distribution q, we aim to test the null hypothesis that q matches the reference or target distribution p. Traditional goodness-of-fit measures, such as the Kolmogorov Smirnov statistic (Kolmogorov, 1933; Smirnov, 1948) and Cram er von Mises criterion, are primarily applicable to univariate random variables. Gorham and Mackey propose the Stein discrepancy (Gorham & Mackey, 2015), a measure of sample quality relative to a target. This measure is a maximum discrepancy between empirical sample expectations

1School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China. Correspondence to: Weiwei Liu <liuweiwei863@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

and target expectations over a large class of test functions. It is constructed to have zero expectation over the target distribution through a Stein operator, which solely depends on the derivative of the log p. Kernel Stein Discrepancy (KSD) has gained substantial attention in non-parametric goodness-of-fit testing (Liu et al., 2016; Chwialkowski et al., 2016; Jitkrittum et al., 2017; Baum et al., 2023; Liu et al., 2023).

In the existing literature, significant attention has been devoted to batch testing, primarily when dealing with predetermined sample sizes. When the sample distribution deviates from the target, the necessary sample size for detecting such discrepancies is not known a priori. In cases where test results show promise but remain inconclusive, such as when a p-value slightly exceeds a chosen significance level, gathering additional data and repeating the study becomes necessary. Traditional batch tests are not designed to deal with these challenges. Therefore, this paper focuses on sequential tests that facilitate the scrutiny of observed data, enabling the decision to stop and reject the null hypothesis or continue with data collection.

Problem Setup. Suppose there is a continuous stream of data denoted as {Zt}t 1 Z, where Zt iid q. Our objective is to investigate whether the distribution q aligns with a known reference or target distribution p. According to our formulation, the target distribution p is assumed to be known only up to a normalization constant. We also aim to design sequential tests for the following hypotheses:

H0 : Zt iid q, t 1 and q = p, (1a)

H1 : Zt iid q, t 1 and q = p. (1b)

Following the test of power one framework (Darling & Robbins, 1968), we define a level-α sequential test as a mapping Φ : t=1Zt {0, 1} that satisfies the formula:

PH0( t 1 : Φ(Z1, . . . , Zt) = 1) α. (2)

The output 0 stands for do not reject the null yet and continue sampling, while 1 means reject the null and stop. Additionally, defining the stopping time τ := inf{t 1 : Φ(Z1, . . . , Zt) = 1} as the first time the test outputs 1, a level-α sequential test must satisfy the formula:

PH0(τ < ) α. (3)

Sequential Kernel Goodness-of-fit Testing

We highlight the primary limitations of existing tests that our new method addresses.

Limitations of Corrected Batch Tests. Using batch tests without corrections for multiple testing results in an inflated false alarm rate under continuous monitoring (see Appendix A.1). Hence, na ıve Bonferroni corrections restore type-I error control but generally result in low-power tests. As a result, directly designing sequential tests, without batch test correction, is necessary. Therefore, we perform valid sequential goodness-of-fit tests for target density function p(x) 1/(1 + x2) and Zt = βXt + (1 β)Yt, where

Xt i.i.d. N(0, 1) and Yt i.i.d. p are independent. We consider 11 β: β {0.5, 0.55, . . . , 1.0} values, and for each β we repeat the simulation 100 times. We use Gaussian kernel k(x, y) = exp( |x y|2 /2) for all testing procedures. In this simulation, we compare two goodness-of-fit testing approaches:

1. KSD-based SKGT proposed in this work (Algorithm 2).

2. Batch KSD Test adapted for continuous monitoring via Bonferroni correction. We enable monitoring after processing every n, n {10, 100}, new points from q. In other words, the bootstrapped p-value (computed over 1000 wild bootstrapped samples) is compared with the rejection thresholds: αi = α/(i(i + 1)) and i = 1, 2, . . . , where i stands for the monitoring order.

As illustrated in Figure 1, our tests exhibit a lower average sample requirement compared to other methods. Additionally, as β approaches 1, indicating a less challenging task, the number of samples needed to reject the null hypothesis decreases. This property is not observed in other tests in this example.

Sequential tests serve as complementary tools to batch tests and are not designed to replace them. We consider two scenarios to highlight this point. When there are 2, 000 data points, recourse is limited if batch tests fail to reject the null hypothesis. However, if sequential tests fail to reject, analysts can collect more data and continue testing, retaining type-I error control. Conversely, when dealing with 2 million data points, the KSD may be time-consuming, due to the bootstrap procedure. Therefore, if the alternative hypothesis is true, and the signal is strong, sequential tests may reject the hypothesis within 200 samples and terminate. In essence, the capacity of sequential tests to continuously collect and analyze data proves advantageous, particularly in challenging situations.

1.1. Related Work

The principle of testing by betting can be traced back to Ville s 1939 doctoral thesis (Ville, 1939), which has recently

Figure 1. Average stopping time of different tests for continuous monitoring. Batch + n-step represents the batch KSD tests with Bonferroni correction applied every n steps.

been popularized by Shafer (2021). Shafer (2021) primarily focuses on parametric and simple hypotheses, which is distinct from our setting. The studies most closely related to the current paper include (Shekhar & Ramdas, 2023; Shaer et al., 2023; Podkopaev et al., 2023; Gr unwald et al., 2023; Podkopaev & Ramdas, 2023), which also address non-parametric hypotheses. Notably, Shekhar and Ramdas utilize testing by betting to design sequential non-parametric two-sample tests (Shekhar & Ramdas, 2023), incorporating a state-of-the-art sequential kernel maximum mean discrepancy test. To the best of our knowledge, this paper is the first attempt to apply the principle of testing by betting to goodness-of-fit testing.

2. Sequential Kernel Goodness-of-fit Testing

This section summarizes the principle of testing by betting (Shafer, 2021; Shafer & Vovk, 2019). Given that a sequence of random variables {Zt}t 1, where Zt Z, a player begins with initial wealth K0 = 1. At round t of the game, the player can select a payoff function ft : Z [ 1, 1] that satisfies EZ PZ[ft(Z)|Ft 1] = 0 for all PZ H0, where Ft 1 = σ(Z1, . . . , Zt 1) is the σ-algebra generated by {Zs : 1 s t 1}. Additionally, the player bets a fraction of the wealth λt Kt 1 for an Ft 1-measurable λt [ 1, 1]. Once Zt is revealed, the player s wealth is updated as:

Kt = Kt 1(1 + λtft(Zt)). (4)

After that, a level-α sequential test is obtained using the following stopping rule: Φ(Z1, . . . , Zt) = 1{Kt 1/α}, i.e., the null hypothesis is rejected once the player s wealth

Sequential Kernel Goodness-of-fit Testing

exceeds 1/α. Under the null hypothesis, the imposed constraints on the payoff sequences {ft}t 1 and betting fraction {λt}t 1 prevent the player from making a profit. Formally, the wealth process {Kt}t 0 is a non-negative martingale. The validity of the resulting test is established through Ville s inequality (Ville, 1939).

To ensure that the resulting test has power under the alternative hypothesis, payoffs and betting fractions have to be selected carefully. Inspired by sequential two-sample tests proposed by Shekhar & Ramdas (2023), our approach relies on the KSD (Chwialkowski et al., 2016; Liu et al., 2016), which has a variational representation.

Kernel Stein Discrepancy. Let G be the reproducing kernel Hilbert space (RKHS) of real-valued functions on Rd

with the reproducing kernel k( , ) and inner product , G. Similarly, let Gd denote the product RKHS consisting of elements f := (f1, . . . , fd) and fi G, with the standard inner product f, g Gd = Pd i=1 fi, gi G. Therefore, we define a Stein operator Tp acting on f Gd as:

(Tpf)(x) :=

xi fi(x) + fi(x)

We observe that the operator can be expressed by defining a function that depends on the kernel and log-density s gradient, as shown below:

ξp(x, ) := [ log p(x)k(x, ) + k(x, )], (5)

where the gradient is taken with respect to variable x. Thus, the expected inner product of ξp(x, ) with f corresponds to the expected value of the Stein operator,

EZ q Tpf(Z) = f, EZ qξp(Z, ) Gd

i=1 fi, EZ qξp,i(Z, ) G,

where ξp,i(x, ) is the i-th component of ξp(x, ). Furthermore, for samples from the target distribution, EX p(Tpf)(X) = 0, which can be verified using integration by parts. We define the KSD as:

Sp(q) := sup f Gd 1 EZ q Tpf(Z) EX p(Tpf)(X)

= sup f Gd 1 EZ q Tpf(Z)

= sup f Gd 1 f, EZ qξp(Z, ) Gd

= Eqξp(Z, ) Gd

To develop powerful goodness-of-fit tests, it is imperative that the function class Gd is sufficiently expressive to guarantee Sp(q) > 0 when p = q. The following theorem

established by Chwialkowski et al. (2016) affirms that the KSD can be used to distinguish between two distributions. Before presenting the theorem, it is necessary to define the following:

hp(x, y) := log p(x) log p(y)k(x, y)

+ log p(y) xk(x, y)

+ log p(x) yk(x, y)

+ xk(x, ), yk( , y) Gd ,

where the last term can be written as Pd i=1 2k(x,y)

Theorem 2.1. (Chwialkowski et al., 2016, Theorem 2.2) Let p, q be probability measures that Z q. If the kernel k is C0-universal (Carmeli et al., 2010, Definition 4.1),

EZ qhp(Z, Z) < , and EZ q log p(Z)

then Sp(q) = 0 if and only if q = p.

Examples of C0-universal kernels on Rd include the Gaussian, Laplacian, inverse multiquadrics, and Mat ern class, among others.

KSD-based Sequential Kernel Goodness-of-fit Testing. An element g Gd that achieves the supremum in (6), often referred to as the witness function , can be regarded as the test function in Gd that best distinguishes q from p. Thus, we consider the payoffs f(Zt) of the following form:

s g, ξp(Zt, ) Gd = s ( g(Zt), log p(Zt) + g(Zt)), (8) where g = Pd i=1 g xi denotes the divergence of g, and the scaling factor s > 0 ensures that f(z) [ 1, 1], for any z Rd. Substituting g in (8) with the witness function g , we denote the resulting function as the oracle payoff f . Let the oracle wealth process {K t }t 0 be defined using f

and the betting fraction, as shown below:

λ = EZ q[f (Z)] EZ q[f (Z)] + EZ q[(f (Z))2]. (9)

We have the following result regarding the oracle payoff function and betting fraction in (9).

Theorem 2.2. Let G be the RKHS constructed from a C0-universal kernel and Gd be the corresponding product RKHS:

1. Under H0 in (1a), any payoff function of the form (8) satisfies EH0[f(Z)] = 0.

2. Under H1 in (1b), the oracle payoff function f based on the witness function g satisfies EH1[f (Z)] > 0. Further, for λ defined in (9), it holds that EH1[log(1+ λ f (Z))] > 0. Hence, K t a.s. + , which implies that the oracle test is consistent: PH1(τ < ) = 1, where τ = inf{t 1 : K t 1/α}.

Sequential Kernel Goodness-of-fit Testing

Remark 2.1. While the betting fraction (9) suffices to guarantee the consistency of the corresponding test, the fastest growth rate of the wealth process is ensured by considering

λ best arg max λ [ 1,1] EZ q[log(1 + λf (Z))].

However, casually selecting the betting fraction may result in the wealth tending to zero almost surely, as exemplified in (Podkopaev et al., 2023, Example 2).

Therefore, to construct a practical test, we must replace the oracle f and λ with predictable estimates {ft}t 1 and {λt}. This indicates that they are computed using data observed prior to a given round of the game.

Assumption 2.3. Suppose the kernel k is C0-universal, non-negative and satisfies: supx Rd k(x, x) 1 and

supx Rd Pd i=1 2k(x,y)

Assumption 2.4. Suppose that the target distribution p satisfies: supz Rd log p(z) 2 1, where v 2 := (Pd i=1 v2 i )1/2 denotes the ℓ2-norm.

The C0-universality of kernel k in Assumption 2.3 ensures that Sp(q) > 0, when p = q. The boundedness requirement in Assumption 2.3 is fulfilled by several commonly used kernels, including the Gaussian and inverse multiquadrics with appropriate scaling. Assumption 2.3 together with Assumption 2.4 guarantees the boundedness of the payoff function.

Payoff Function ft. Considering the KSD s variational formulation, the witness function adopts a closed form:

g = EZ qξp(Z, ) EZ qξp(Z, ) Gd . (10)

The oracle payoff f (Zt) based on KSD is given by:

1 2 g ,ξp(Zt, ) Gd = 1

2( g (Zt), log p(Zt) + g (Zt)), (11) with the form (8) and s = 1/2. Therefore, to construct the test, we use estimators {ft}t 1 of the oracle payoff function f obtained by replacing g in (11) with the plugin estimator:

ˆgt = EZ ˆqt 1ξp(Z, ) EZ ˆqt 1ξp(Z, ) Gd , (12)

where ˆqt 1 is the empirical distribution of {Zs}t 1 s=1 and

EZ ˆqt 1ξp(Z, )

i=1 log p(Zi)k(Zi, ) + k(Zi, )

Algorithm 1 Online Newton Step (ONS) strategy for selecting betting fractions

Input: sequence of payoffs {ft(Zt)}t 1, λONS 1 = 0, a0 = 1. for t = 1, 2, . . . do

Observe ft(Zt); Set zt = ft(Zt)/(1 λONS t ft(Zt)); Set at = at 1 + z2 t ; Set λONS t+1 := 1

2 2 2 log 3 zt

at λONS t ; end for

Algorithm 2 KSD-based SKGT

Input: significance level α (0, 1), data stream {Zt}t 1, where Zt q, λONS 1 = 0. for t=1,2,... do

Use Z1, . . . , Zt 1 to compute ˆgt as in (12); Compute KSD payoff ft(Zt); Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

compute λONS t+1 (Algorithm 1); end if end for

Notably, in (12), the witness function s plug-in estimate is defined as an operator.

Betting Fraction λt. To select betting fractions in an online fashion, we employ the approach proposed by Cutkosky & Orabona (2018). They expressed the problem of choosing the optimal betting fraction for coin betting as an online optimization problem with exp-concave losses and proposed a strategy based on the Online Newton Step (ONS) (Hazan et al., 2007), which is summarized in Algorithm 1.

We conclude this section with formal guarantees regarding the time-uniform type-I error control and consistency of KSD-based SKGT. In addition, we show that the wealth process grows exponentially and characterize the wealth growth rate in terms of the true KSD. The proof is detailed in Appendix B.1.

Theorem 2.5. Under Assumptions 2.3 and 2.4, the following claims hold for KSD-based SKGT (Algorithm 2):

1. Under H0 in (1a), SKGT stops with probability at most α: PH0(τ < ) α.

2. Under H1 in (1b), then it holds that K a.s. + and thus the SKGT is consistent: PH1(τ < ) = 1. Furthermore, the wealth grows exponentially, and the cor-

Sequential Kernel Goodness-of-fit Testing

responding growth rate satisfies the following formula:

lim inf t log Kt

t EH1[f (Z)]

EH1[(f (Z))2] 1

(14) almost surely, where f is the oracle payoff defined in (11).

Proof sketch. Time-uniform type-I error control directly follows from Ville s inequality. To establish the consistency of KSD-based SKGT, we leverage (Cutkosky & Orabona, 2018, Theorem 1), showing that

t log K(λ0)

for any betting fraction λ0. Hence, after selecting a specific betting fraction and employing fundamental inequalities, we further obtain

1 t Pt i=1 fi 4 0

1 t Pt i=1 fi 1 t Pt i=1 f 2 i 1

Then, the asymptotic properties of the right-hand side are investigated in Lemma B.5. The proof s core lies in the convergence of V-statistics, analyzed using the non-asymptotic convergence based on Hoeffding s study (Hoeffding, 1963). Asymptotic convergence follows from combining the nonasymptotic results with the Borel Cantelli lemma.

Since EH1[f (Z)] = 1

2Sp(q) and f (z) [ 1, 1], as established in the proof of Theorem 2.5, Theorem 2.5 implies that:

lim inf t log Kt

Amongst the betting fractions that are constrained to lie in [ 1/2, 1/2], such as the ONS betting strategy, the optimal growth rate is ensured by using:

λ = arg max λ [ 1/2,1/2] E[log(1 + λf (Z)]. (15)

Consequently, we obtain the following results regarding the oracle test s growth rate:

Proposition 2.6. The optimal log-wealth S := E[log(1 + λ f (Z))] that can be achieved by an oracle betting scheme (15), which knows f from (11) and the underlying distribution, satisfies the formula:

3E[(f (Z))2] 1

Proof. The fact that S E[f (Z)/2] trivially follows from E[log(1 + λf (Z))] λE[f (Z)] E[f (Z)]/2.

Since for any x [ 1/2, 1/2], it holds that: log(1 + x) x 3x2/8, we know that:

S max λ [ 1/2,1/2]

λE[f (Z)] 3

8λ2E[(f (Z))2] ,

(17) and by assuming the maximization problem, we obtains the upper bound:

S 2 (E[f (Z)])2

3E[(f (Z))2] (18)

assuming E[f (Z)] E[(f (Z))2] 3/8. On the other hand, it always holds that: S E[f (Z)/2]. To obtain the claimed bound, we multiply the RHS of (18) by two, which completes the proof of (16).

3. Alternative Stein Discrepancies

Practically, data is sometimes encountered in discrete spaces or bounded domains. However, the KSD is specifically designed for smooth density functions on Rd. This section introduces sequential goodness-of-fit tests based on Kernel Discrete Stein Discrepancy (KDSD) (Yang et al., 2018) and Bound-domain Kernel Stein Discrepancy (bd-KSD) (Xu, 2022).

3.1. KDSD-based Sequential Kernel Goodness-of-fit Testing

We recall the KDSD defined on X = {0, . . . , L 1}d

with L > 1. In place of derivative, we specify k as the cyclic forward difference with respect to k-th coordinate as follows:

kf(x)=f(x1, . . . , xk, . . . , xd) f(x1, . . . , xk, . . . , xd),

where xk = xk + 1 mod L, with the corresponding vectorvalued operator = ( 1, . . . , d) . The inverse operator 1 k is given by the backward difference:

1 k f(x)=f(x1, . . . , xk, . . . , xd) f(x1, . . . , xk, . . . , xd),

where xk = xk 1 mod L, and 1 = ( 1 1 , . . . , 1 d ) . Then, the score is sp(x) := p(x) 1 p(x), where it is assumed that the probability mass function is strictly positive. For a vector-valued function g : X Rd, the difference Stein operator is then defined as:

Apg(x) := tr[g(x)s p (x) + 1g(x)]

p(x) gi(x) + 1 i gi(x).

It can be shown that Ex p[Apg(x)] = 0 (Yang et al., 2018, Theorem 2). Given an RKHS Hd of vector-valued functions g : X Rd, we obtain the KDSD as:

KDSD(q p) := sup g Hd, g Hd 1 Ex q[Apg(x)]. (19)

Sequential Kernel Goodness-of-fit Testing

Algorithm 3 KDSD-based SKGT

Input: significance level α (0, 1), data stream {xt}t 1, where xt q, λONS 1 = 0. for t=1,2,. . . do

Use x1, . . . , xt 1 to compute ˆgt as in (23); Compute KDSD payoff ft(xt) = s ˆgt, ηp(xt, ) Hd; Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

compute λONS t+1 (Algorithm 1); end if end for

Witness Function for KDSD. Suppose that k is the reproducing kernel of H, we define the kernel embedding as ηp(x, ) := sp(x)k(x, ) + 1 x k(x, ), where 1 x indicates that the operator 1 is applied with respect to x. Based on the kernel embedding, we can express KDSD as:

KDSD(q p) = sup g Hd, g Hd 1 g, Ex qηp(x, ) Hd . (20)

Based on the variational formulation, the witness function of KDSD has the closed form:

g = Ex qηp(x, ) Ex qηp(x, ) Hd . (21)

The oracle payoff function f based on KDSD is given by the equation:

f (x) = s g , ηp(x, ) Hd , (22)

where the scaling factor s > 0 ensures f (xt) [ 1, 1]. Given a sequence of samples {xt}t 1 from q, we use estimates {ft}t 1 of the oracle payoff function obtained by replacing g in (22) with the plug-in estimator:

ˆgt = Ex ˆqt 1ηp(x, ) Ex ˆqt 1ηp(x, ) Hd , (23)

where ˆqt 1 is the empirical distribution of {x1, . . . , xt 1}.

Assumption 3.1. Suppose that a constant Bp for the target distribution p exists, such that supx X p(x)/p(x) 2 Bp.

Assumption 3.2. Suppose that the kernel k satisfies: supx X k(x, x) Bk,0, supx X 1 x k(x, x) 2 Bk,1, and supx X tr[ 1 x 1 x k(x, x )] x =x Bk,2.

Considering that we are working with distributions in a discrete space, the aforementioned assumption is not too restrictive. A common choice for the kernel in discrete space is the exponential Hamming kernel: k(x, x ) = exp( H(x, x )),

where H(x, x ) := 1

d Pd i=1 1{xi = x i} is the normalized Hamming distance. We have the following guarantees regarding our KDSD-based SKGT s time-uniform type-I error control, as detailed in Appendix B.2.

Theorem 3.3. Assuming that Assumptions 3.1 and 3.2 are satisfied, and setting s = 1

Bk,0B2p+2Bk,1Bp+Bk,2 , then, un-

der the null hypothesis H0 in (1a), the KDSD-based SKGT (Algorithm 3) ensures: PH0(τ < ) α.

3.2. bd-KSD-based Sequential Kernel Goodness-of-fit Testing

Unlike densities on unbounded domains commonly assumed to vanish at infinity, densities on compact domains may not necessarily vanish at the boundary. Hence, a direct application of a Stein operator on Rd may require the knowledge of normalized density at the boundary, which defeats the purpose of KSD testing for unnormalized models.

We recall the bd-KSD defined on a bounded domain V Rd. We consider a bounded smooth function h : Rd Rd

such that hi( V ) = 0, i = 1, . . . , d. For the unnormalized p on V , the bounded-domain Stein operator is defined as: Tp,hg(x) = 1 p(x) Pd i=1 xi (p(x)gi(x)hi(x)). Given an RKHS Hd of vector-valued functions g : Rd Rd, the bd-KSD is defined as:

bd-KSD(q p) := sup g Hd, g Hd 1 Ex q[Tp,hg(x)]. (24)

Witness Function for bd-KSD. Assuming that k is the reproducing kernel of H, we define the kernel embedding ζp,h(x, ) := ( log p(x) h(x) + h(x))k(x, ) + h(x) k(x, ), where we denote u v := (u1v1, , udvd) as the element-wise multiplication. Leveraging the kernel embedding, we can express bd-KSD as:

bd-KSD(q p) = sup g Hd, g Hd 1 g, Ex qζp,h(x, ) Hd .

Moreover, building upon the variational formulation, the witness function of bd-KSD adopts the closed form:

g = Ex qζp,h(x, ) Ex qζp,h(x, ) Hd . (26)

Similar to the payoff function defined in the previous section, we use the plug-in estimator:

ˆgt = Ex ˆqt 1ζp,h(x, ) Ex ˆqt 1ζp,h(x, ) Hd , (27)

where ˆqt 1 is the empirical distribution of a sequence of samples {x1, . . . , xt 1}. Finally, leveraging the plug-in

Sequential Kernel Goodness-of-fit Testing

Algorithm 4 bd-KSD-based SKGT

Input: significance level α (0, 1), data stream {xt}t 1, where xt q, λONS 1 = 0. for t=1,2,. . . do

Use x1, . . . , xt 1 to compute ˆgt as in (27); Compute bd-KSD payoff ft(xt) = s ˆgt, ζp(xt, ) Hd; Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

compute λONS t+1 (Algorithm 1); end if end for

estimator, we construct the payoff function as

ft(xt) = s ˆgt, ζp,h(xt, ) Hd . (28)

Assumption 3.4. Suppose that a constant Cp for the target distribution p exists, such that supx V log p(x) 2 Cp. Assumption 3.5. Suppose that the kernel k satisfies: supx V k(x, x) Ck,0, supx V k(x, x) 2 Ck,1, and supx V tr[ x xk(x, x )] x =x Ck,2. Assumption 3.6. Suppose that constants Ch,0 and Ch,1 for the function h exist, such that supx V h(x) 2 Ch,0 and supx V h(x) 2 Ch,1.

Notably, as we are currently dealing with a bounded domain, the above assumption is not overly restrictive. The following result guarantees the time-uniform type-I error control of our bd-KSD-based SKGT. The proof is deferred to Appendix B.2. Theorem 3.7. Assuming that Assumptions 3.4, 3.5, and 3.6 hold, and setting s = 1

2(C2 p C2 h,0+C2 h,1)Ck,0+2(Cp Ch,0+Ch,1)Ck,1+C2 h,0Ck,2 , then,

under the null hypothesis H0 in (1a), the bd-KSD-based SKGT (Algorithm 4) satisfies: PH0(τ < ) α.

4. Numerical Simulations

This section describes the experiments that demonstrate our tests capacity to adapt to a problem s unknown difficulty while maintaining type-I error control.

4.1. Student s t versus Normal

In our first experiment, we consider testing the null hypothesis that the observed samples come from a Student s t distribution with 1 degree of freedom. Student s t distribution has the probability density function given by the formula:

f(t) 1 + t2

Figure 2. Ability of our sequential test to adapt to the problem s unknown difficulty while controlling the type-I error. Each solid curve is obtained by running 500 trials of the KSD-based SKGT (Algorithm 2) using samples from the Student s t distribution with the degrees of freedom at 1, 5, 10, and . The dashed vertical line shows our SKGT s average stopping time. df denotes degree of freedom.

where ν is the number of degrees of freedom. We consider samples from distributions with 1, 5, 10, and , where is equivalent to sampling from a standard normal distribution. We employ our KSD-based SKGT (Algorithm 2) to test against the null hypothesis, utilizing the Gaussian kernel k(x, y) = exp( |x y|2 /2).

As displayed in Figure 2, we observed that under the null hypothesis, the type-I error is rigorously controlled. Furthermore, as the degrees of freedom approach 1, indicating a more challenging problem, the expected stopping time increases, thereby corroborating our theoretical result presented in Theorem 2.5.

4.2. Ising Model

In the second experiment, we consider testing the null hypothesis that the observed samples come from an Ising model. The Ising model (Ising, 1924) is a canonical example of a Markov random field. Consider an undirected graph G = (V, E), where each vertex i V is associated with a binary spin. The collection of spins form a random vector x = (x1, . . . , xd) { 1, +1}d, whose components xi and xj directly interacts only if (i, j) E. The probability mass function is defined as:

pΘ(x) = 1 Z(Θ) exp

(i,j) E θijxixj

Sequential Kernel Goodness-of-fit Testing

Figure 3. Our sequential test s capacity to adapt to a problem s unknown complexity while controlling the type-I error. Each solid curve is obtained by running 500 trials of the KDSD-based SKGT (Algorithm 3) using samples from the Ising models under temperatures 1, 2, 3, and 5. The dashed vertical lines show our SKGT s corresponding average stopping time.

where θij are the edge potentials and Z(Θ) is the partition function which is prohibitive to compute when d is high.

We consider a periodic 10-by-10 lattice, with d = 100 random variables. We focus on the ferromagnetic setting and set θij = 1/T, where T is the temperature of the system. For T0 = 5 and various values of T, we test the hypotheses H0 : T = T0 vs. H1 : T = T0 using data samples drawn from the model under temperature T. To draw samples from the Ising model, we apply the Metropolis algorithm. We employ our KDSD-based SKGT (Algorithm 3) to test against the null hypothesis, using the exponential hamming kernel k(x, x ) = exp( H(x, x )), where H(x, x ) := 1 d Pd i=1 1{xi = x i} is the normalized Hamming distance.

As illustrated in Figure 3, under the null hypothesis, our KDSD-based SKGT does not reject the null within 1000 samples. Conversely, under the alternative hypotheses, as the temperature approaches T0, indicative of a more challenging task, the average stopping time increases.

4.3. Truncated Gaussians in Unit Ball

In the third experiment, we consider testing the null hypothesis that samples come from the standard Gaussian truncated in B1(R3) = {x R3 : x 2 1}. Specifically, our testing procedure is applied to samples generated from truncated Gaussians with various mean shifts, formulated as

q(x) N (µ, I3) , x B1(R3),

Figure 4. Our sequential test s ability to adapt to a problem s unknown difficulty while controlling the type-I error. Each solid curve is obtained by running 500 trials of the bd-KSD-based SKGT (Algorithm 4) using samples from the truncated Gaussians with shifts 0, 0.3, 0.6, and 0.9. The dashed vertical lines show our SKGT s corresponding average stopping time.

where µ = (v, v, v) and I3 is the identity matrix in R3 3. We consider different values of v {0, 0.3, 0.6, 0.9}. We employ our bd-KSD-based KDSD, equipped with the Gaussian kernel k(x, y) = exp x y 2 2 /2 . To satisfy boundary conditions and consider the rotational invariance of the unit ball, we set the auxiliary functions as hi(x) = 1 x 2 2 , i = 1, 2, 3.

As depicted in Figure 4, under the null hypothesis, our bd-KSDs-based SKGT does not reject the null within 104

samples. Conversely, under the alternative hypothesis, as the mean shift decreases, indicating a more challenging task, the average stopping time increases.

5. Conclusion

In this paper, we introduce the SKGT based on the principle of testing by betting. The SKGT allows for continuous monitoring of data and adaptation to a problem s unknown complexity. After that, we provide formal guarantees regarding time-uniform type-I error control and the consistency of the KSD-based SKGT. Specifically, we demonstrate that the wealth process exhibits exponential growth and characterize the rate of wealth growth in terms of the true KSD. To handle data in discrete spaces and bounded domains, we propose SKGTs based on KDSD and bd-KSD. Our experiments demonstrate the adaptability of our tests to a problem s unknown complexity while maintaining type-I error control.

Sequential Kernel Goodness-of-fit Testing

Impact Statement

To the best of our knowledge, this work has no negative social impact. This work mainly provides sequential kernel goodness-of-fit tests. Hence, our work may promote the development of the related fields.

Acknowledgements

This work is supported by the National Key R&D Program of China under Grant 2023YFC3604702, and the Fundamental Research Fund Program of LIESMARS.

Baum, J., Kanagawa, H., and Gretton, A. A kernel stein test of goodness of fit for sequential models. In ICML, 2023.

Carmeli, C., De Vito, E., Toigo, A., and Umanit a, V. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 2010.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In ICML, 2016.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In COLT, 2018.

Darling, D. A. and Robbins, H. Some nonparametric sequential tests with power one. Proceedings of the National Academy of Sciences, 1968.

Fan, X., Grama, I., and Liu, Q. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 2015.

Gorham, J. and Mackey, L. Measuring sample quality with stein s method. In Neur IPS, 2015.

Gr unwald, P., Henzi, A., and Lardy, T. Anytime-valid tests of conditional independence under model-x. Journal of the American Statistical Association, 2023.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 2007.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963.

Ising, E. Beitrag zur theorie des ferro-und paramagnetismus. Ph D thesis, Grefe & Tiedemann Hamburg, Germany, 1924.

Jitkrittum, W., Xu, W., Szab o, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-fit test. In Neur IPS, 2017.

Kolmogorov, A. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell inst Ital Degli Att, 1933.

Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-fit tests. In ICML, 2016.

Liu, X., Duncan, A. B., and Gandy, A. Using perturbation to improve goodness-of-fit tests based on kernelized stein discrepancy. In ICML, 2023.

Podkopaev, A. and Ramdas, A. Sequential predictive twosample and independence testing. In Neur IPS, 2023.

Podkopaev, A., Bl obaum, P., Kasiviswanathan, S., and Ramdas, A. Sequential kernelized independence testing. In ICML, 2023.

Shaer, S., Maman, G., and Romano, Y. Model-x sequential testing for conditional independence via testing by betting. In AISTATS, 2023.

Shafer, G. Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society Series A: Statistics in Society, 2021.

Shafer, G. and Vovk, V. Game-theoretic foundations for probability and finance. 2019.

Shekhar, S. and Ramdas, A. Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory, 2023.

Smirnov, N. Table for estimating the goodness of fit of empirical distributions. The annals of mathematical statistics, 1948.

Ville, J. Etude critique de la notion de collectif. Th eses de l entre-deux-guerres. Gauthier-Villars, 1939.

Xu, W. Standardisation-function kernel stein discrepancy: A unifying view on kernel stein discrepancy tests for goodness-of-fit. In AISTATS, 2022.

Yang, J., Liu, Q., Rao, V., and Neville, J. Goodness-of-fit testing for discrete distributions via stein discrepancy. In ICML, 2018.

Sequential Kernel Goodness-of-fit Testing

A. Goodness-of-fit Testing for Streaming Data

A.1. Failure of Batch KSD under Continuous Monitoring

It is straightforward to estimate the squared Stein discrepancy from samples {Zi}n i=1: a quadratic time estimator is a V-statistic, and takes the form:

j=1 hp(Zi, Zj).

For kernel k, we choose the Gaussian kernel k(x, x ) = exp( x x 2 2/2). To conduct goodness-of-fit testing using batch KSD, we use the Wild Bootstrap Testing (Chwialkowski et al., 2016). A simple Markov chain taking values in { 1, 1}, starting from W1,n = 1, Wt,n = 1{Ut > an}Wt 1,n 1{Ut < an}Wt 1,n,

where the Ut are uniform (0, 1) i.i.d. random variables and an is the probability of Wt,n changing sign. For i.i.d. data, we set an = 0.5. This leads to a bootstrapped V-statistic

j=1 Wi,n Wj,nhp(Zi, Zj).

Having obtained wild bootstrap samples {Bn,i}D i=1, we use the p-value: P = 1

D PD i=1 1{Bn,i > Vn}, with D = 1000 wild bootstrap samples. Next, we study batch KSD under (a) fixed-time and (b) continuous monitoring. We consider a simple case when Z is from the student-t distribution with degree of freedom 1 and the target distribution is also the student-t distribution with density 1

π 1 + t2 1. We conduct a test at 12 different samples sizes: t {50, 100, . . . , 600}:

(a) Under fixed-time monitoring, for each value of t, we sample a sequence Z1, . . . , Zt (150 times for each t) and conduct batch-KSD test. The goal is to confirm that batch-KSD controls type I error by tracking the standard miscoverage rate.

(b) Under continuous monitoring, we sample new datapoints and re-conduct the test. We illustrate inflated type I error by tracking the cumulative miscoverage rate, that is, the fraction of times, the test falsely rejects the null.

Figure 5. Inflated false discovery rate of batch KSD under continuous monitoring (CM, red line with squares) for the case when p = q. Bonferroni correction (CM, green line with triangles) restore type I error control. Type I error is controlled at a specified level under fixed-time monitoring (FTM, blue line with circles).

The results are presented in Figure 5. For Bonferroni correction, we compute p-values every 50 samples and decompose the error budget as: α = P i=1 α i(i+1), that is, for i-th test we use threshold αi = α i(i+1).

Sequential Kernel Goodness-of-fit Testing

B.1. Proofs for Section 2

Theorem 2.2. Let G be the RKHS constructed from a C0-universal kernel and Gd be the corresponding product RKHS:

1. Under H0 in (1a), any payoff function of the form (8) satisfies EH0[f(Z)] = 0.

2. Under H1 in (1b), the oracle payoff function f based on the witness function g satisfies EH1[f (Z)] > 0. Further, for λ defined in (9), it holds that EH1[log(1 + λ f (Z))] > 0. Hence, K t a.s. + , which implies that the oracle test is consistent: PH1(τ < ) = 1, where τ = inf{t 1 : K t 1/α}.

Proof. 1. Under H0 in (1a), we have that:

EZ p[f (Z)] = EZ p[s ( g (Zt), log p(Zt) + g (Zt))]

Rd ( g (x), log p(x) + g (x)) p(x)dx

i=1 g i (x) p(x)

xi dx + s Z

Rd( g (x))p(x)dx

i=1 p(x) g i (x) xi dx + s Z

Rd( g (x))p(x)dx

where equality (i) follows from integration by parts.

2. Under H1 in (1b) and the i.i.d. setting, we have

E[f (Zt)|Ft 1] = E[f (Z)] = s Sp(q) > 0.

Let W := f (Z), and consider the quantity EH1[log(1 + λW)]. We use the following inequality (Fan et al., 2015, Equation (4.12)): for any y 1 and λ [0, 1), we have

log(1 + λy) λy + y2(log(1 λ) + λ).

Hence, we get the following:

EH1[log(1 + λW)]λ λEH1[W] + EH1[W 2](log(1 λ) + λ).

Finally, by choosing λ = EH1[W]/(EH1[W] + EH1[W 2]) (0, 1) and the fact that log(1 x) + x x2 2(1 x) for x [0, 1), we arrive at

EH1[log(1 + λ W)] 1

2 (EH1[W])2

EH1[W] + EH1[W 2] > 0.

The wealth process corresponding to the oracle test satisfies:

i=1 (1 + λ f (Zt)) = exp

i=1 log(1 + λ f (Zi))

By the Strong Law of Large Numbers (SLLN), we have:

i=1 log(1 + λ f (Zi)) a.s. EH1[log(1 + λ f (Z))] > 0.

The above result implies that K t a.s. + , thereby ensuring the consistency of the oracle test.

Sequential Kernel Goodness-of-fit Testing

Theorem 2.5. Under Assumptions 2.3 and 2.4, the following claims hold for KSD-based SKGT (Algorithm 2):

1. Under H0 in (1a), SKGT stops with probability at most α: PH0(τ < ) α.

2. Under H1 in (1b), then it holds that K a.s. + and thus the SKGT is consistent: PH1(τ < ) = 1. Furthermore, the wealth grows exponentially, and the corresponding growth rate satisfies the following formula:

lim inf t log Kt

t EH1[f (Z)]

EH1[(f (Z))2] 1 (14)

almost surely, where f is the oracle payoff defined in (11).

Proof. 1. First, let us show that the predictable estimates of the oracle payoff function are bounded when the scaling factors s = 1/2 is used. Recall that:

2 ˆgt, log p(x)k(x, ) + k(x, ) Gd (30)

The Cauchy-Schwartz inequality implies that

2 ˆgt Gd log p(x)k(x, ) Gd + 1

2 ˆgt Gd k(x, ) Gd

2 log p(x)k(x, ) Gd + 1

2 k(x, ) Gd ,

where the second equality follows from the constraint that ˆgt Gd = 1. The first term log p(x)k(x, ) Gd can be computed as:

log p(x)k(x, ) Gd =

i=1 xi log p(x)k(x, ), xi log p(x)k(x, )

= log p(x) 2 p

where the last inequality follows from Assumption 2.3 and 2.4. The second term k(x, ) Gd is calculated as follows:

k(x, ) Gd =

i=1 xik(x, ), xik(x, ) G

i=1 xi xik(x, ), k(x, ) G

where the last inequality follows from Assumption 2.3. Hence ft(x) [ 1, 1]. Next, we show that the constructed payoff function yields a fair bet. Note that ft is constructed from {Zs}t 1 s=1; thus ft is Ft 1-measurable. Based on this fact, we have:

E[ft(Zt)|Ft 1] = 1

2EZt q[ ˆgt(Zt), log p(Zt) + ˆgt(Zt)],

and in particular, the above implies that EH0[ft(Zt)|Ft 1] = 0 for H0 in (1a). In the following, we show that the resulting wealth process is a non-negative martingale for all strategies for selecting betting fractions that are considered in this work. Since the betting fractions are predictable, i.e. λt is Ft 1-measurable, we have:

EH0[Kt|Ft 1] = EH0[Kt 1(1 + λtft(Zt))|Ft 1]

= Kt 1 + λt EH0[ft(Zt)|Ft 1]

The assertion of the theorem then follows directly from Ville s inequality (Theorem B.6) when a = 1/α.

Sequential Kernel Goodness-of-fit Testing

2. In the following, we establish the consistency of KSD-based SKGT with the ONS betting strategy. Under the ONS betting strategy, for any sequence of outcomes {ft}t 1, ft [ , 1, 1], i 1, it holds that (Cutkosky & Orabona, 2018, Proof of Theorem 1):

log Kt(λ0) log Kt = O

where Kt(λ0) is the wealth process of any constant betting strategy λ0 [ 1/2, 1/2] and Kt is the wealth process corresponding to the ONS strategy. It follows that the wealth process corresponding to the ONS strategy satisfies

t log Kt(λ0)

for some absolute constant C > 0. Next, let us consider:

Pt i=1 fi Pt i=1 f 2 i 1

We obtain: log Kt(λ0)

i=1 log(1 + λ0fi)

λ0fi λ2 0f 2 i

1 t Pt i=1 fi 4 0

1 t Pt i=1 fi 1 t Pt i=1 f 2 i 1

where in (a) we use the inequality: log(1 + x) x x2 for x [ 1/2, 1/2]. From Lemma B.5, it follows for fi = fi(Zi) that:

1 t Pt i=1 fi(Zi)

1 t Pt i=1 fi(Zi)

1 t Pt i=1 (fi(Zi))2 1

a.s. E[f (Z)]

E[(f (Z))2] 1

Using (32), we conclude that the growth rate of the ONS wealth process satisfies

lim inf t log Kt

E[(f (Z))2] 1

We conclude that the test is consistent, that is, if H1 is true, then P(τ < ) = 1.

B.2. Proofs for Section 3

Theorem 3.3. Assuming that Assumptions 3.1 and 3.2 are satisfied, and setting s = 1

Bk,0B2 p+2Bk,1Bp+Bk,2 , then, under

the null hypothesis H0 in (1a), the KDSD-based SKGT (Algorithm 3) ensures: PH0(τ < ) α.

Proof. It suffices to show that the proposed payoff functions are bounded. Note that: ˆgt, ηp(xt, ) Hd ˆgt Hd ηp(xt, ) Hd = ηp(xt, ) Hd

k(xt, xt) sp(xt) 2 2 + 2sp(xt) 1 x k(xt, xt) + tr[ 1 x 1 x k(x, x )] x =x

Bk,0B2p + 2Bk,1Bp + Bk,2,

Sequential Kernel Goodness-of-fit Testing

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality is due to Assumption 3.1 and Assumption 3.2. We conclude that any predictive estimate of the oracle payoff function for KDSD satisfies

|ft(xt)| 1.

Next, we show that the constructed payoff function yields a fair bet. Note that ft is constructed from {xs}t 1 s=1; thus ft is Ft 1-measurable. Based on this fact, we have:

E[ft(xt)|Ft 1] = s Ext q[ ˆgt(xt), sp(xt) + 1 ˆgt(xt)],

and in particular, the above implies that EH0[ft(xt)|Ft 1] = 0 for H0 in (1a). In the following, we show that the resulting wealth process is a non-negative martingale for all strategies for selecting betting fractions that are considered in this work. Since the betting fractions are predictable, i.e. λt is Ft 1-measurable, we have:

EH0[Kt|Ft 1] = EH0[Kt 1(1 + λtft(xt))|Ft 1]

= Kt 1 + λt EH0[ft(xt)|Ft 1]

The assertion of the theorem then follows directly from Ville s inequality (Theorem B.6) when a = 1/α.

Theorem 3.7. Assuming that Assumptions 3.4, 3.5, and 3.6 hold, and setting s = 1

2(C2p C2 h,0+C2 h,1)Ck,0+2(Cp Ch,0+Ch,1)Ck,1+C2 h,0Ck,2 , then, under the null hypothesis H0 in (1a), the bd-KSD-based

SKGT (Algorithm 4) satisfies: PH0(τ < ) α.

Proof. It suffices to show that the proposed payoff functions are bounded. Note that:

ˆgt, ζp,h(xt, ) Hd 2

ˆgt 2 Hd ζp,h(xt, ) 2 Hd

= log p(xt) h(xt) + h(xt) 2 2 k(xt, xt) | {z } (I) + 2 log p(xt) h(xt) + h(xt), k(xt, xt) | {z } (II)

i=1 h2 i (xt) 2k(x, x )

x=x =xt | {z } III

where the first inequality is due to Cauchy-Schwartz inequality. For the term (I), we have

log p(xt) h(xt) + h(xt) 2 2 k(xt, xt) 2( log p(xt) h(xt) 2 2 + h(xt) 2 2)k(xt, xt)

2( log p(xt) 2 2 h(xt) 2 2 + h(xt) 2 2)k(xt, xt)

2(C2 h,0C2 p + C2 h,1)Ck,0,

where the first inequality follows from the triangle inequality and (x + y)2 2x2 + 2y2, x, y R, the second inequality is due to the fact that u v 2 2 u 2 2 v 2 2, and the third inequality follows from Assumptions 3.4, 3.5 and 3.6. For the term (II), we have

log p(xt) h(xt) + h(xt), k(xt, xt) 2( log p(xt) h(xt) 2 + h(xt) 2) k(xt, xt) 2 2(Cp Ch,0 + Ch,1)Ck,1, (37)

Sequential Kernel Goodness-of-fit Testing

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality is due to our assumptions and the fact that u v 2 2 u 2 2 v 2 2. For the term (III), we have

i=1 h2 i (xt) 2k(x, x )

i=1 h(xt) 2 2 2k(x, x )

C2 h,0Ck,2,

where the first inequality is because u2 i u 2 2, for all i = 1, . . . , d, and the second inequality is due to our assumptions.

Combining these terms, we obtain ˆgt, ζp,h(xt, ) Hd q

2(C2p C2 h,0 + C2 h,1)Ck,0 + 2(Cp Ch,0 + Ch,1)Ck,1 + C2 h,0Ck,2.

By setting s = 1

2(C2p C2 h,0+C2 h,1)Ck,0+2(Cp Ch,0+Ch,1)Ck,1+C2 h,0Ck,2 , we conclude that |ft(xt)| 1. Following the same

procedure as in the proof of Theorem 3.3, we can show that the resulting wealth process is a martingale. The assertion of the theorem directly follows from Ville s inequality.

B.3. Supporting Lemmas

To begin, let s review the definition of hp(x, y) before presenting the first result.

hp(x, y) := log p(x) log p(y)k(x, y) + log p(y) xk(x, y)

+ log p(x) yk(x, y) + xk(x, y), yk(x, y) Gd ,

Lemma B.1 (Convergence of V -statistic). Suppose that Assumption 2.3 and 2.4 hold. Define Vn = 1 n2 Pn i=1 Pn i=1 hp(Zi, Zj) and S2 p(q) = EZ,Z [hp(Z, Z )], where {Zi}n i=1 is a sample from distribution q and Z is independent of Z with identical distribution q. Then for n > 1 and any δ (0, 1), it holds with probability of at least 1 δ that: Vn S2 p(q) 8

Proof. First, note that Vn is a biased estimator of S2 p(q):

E[Vn] = n 1

n EZ,Z [hp(Z, Z )] + 1

n EZ[hp(Z, Z)].

The triangle inequality implies:

|Vn S2 p(q)| |Vn E[Vn]| + 1

S2 p(q) EZ[hp(Z, Z)]

Then the problem is decomposed into upper bounding the first and second term separately.

(i) Upper bounding S2 p(q) EZ[hp(Z, Z)] .

We start with providing an upper bound for hp(x, y). hp(x, y) can be written as:

hp(x, y) = log p(x)k(x, ) + k(x, ), log p(y)k(y, ) + k(y, ) Gd

Using Cauchy-Schwartz inequality, we obtain:

|hp(x, y)| log p(x)k(x, ) + k(x, ) Gd log p(y)k(y, ) + k(y, ) Gd

By the triangle inequality of norm Gd, we have:

log p(x)k(x, ) + k(x, ) Gd log p(x)k(x, ) Gd + k(x, ) Gd

The first term log p(x)k(x, ) Gd can be computed as:

log p(x)k(x, ) Gd =

i=1 xi log p(x)k(x, ), xi log p(x)k(x, )

= log p(x) 2 p

Sequential Kernel Goodness-of-fit Testing

where the last inequality follows from Assumption 2.3 and 2.4. The second term k(x, ) Gd is calculated as follows:

k(x, ) Gd =

i=1 xik(x, ), xik(x, ) G

i=1 xi xik(x, ), k(x, ) G

where the last inequality follows from Assumption 2.3. Hence log p(x)k(x, ) + k(x, ) Gd 2. The same argument also holds for log p(y)k(y, ) + k(y, ) Gd, therefore, we obtain |hp(x, y)| 4 for any x, y Rd. It implies that S2 p(q) EZ[hp(Z, Z)] 8. (38)

(ii) Upper bounding |Vn E[Vn]|.

As stated in the seminar work of Hoeffding (1963), any V-statistic can be written as a U-statistic :

Vn = 1 n(n 1)

i =j h p(Zi, Zj),

where h p(Zi, Zj) = n 1

n hp(Zi, Zj) + 1

nhp(Zi, Zi) for i = j. We have shown that hp(x, y) [ 4, 4], then it also holds that h p(Zi, Zj) [ 4, 4]. We are ready to use the result of Hoeffding (1963, Equation (5.7)):

P(|Vn E[Vn]| t) 2 exp( nt2/64).

Choosing t = 8 q

n , then we have, with probability of at least 1 δ:

|Vn E[Vn]| 8

Combining (38) and (39), we have, with probability of at least 1 δ:

|Vn S2 p(q)| 8

Now, we recall the definition of Eˆqt 1[ξp(Z, )]:

Eˆqt 1[ξp(Z, )] := 1 t 1

i=1 ξp(Zi, ) (40)

Lemma B.2. Suppose that Assumption 2.3 and 2.4 hold. For Eˆqt 1[ξp(Z, )] defined in (40), it holds that Eˆqt 1[ξp(Z, )] Gd a.s. Eq[ξp(Z, )] Gd . (41)

Proof. We have

Eq[ξp(Z, )] 2 Gd = S2 p(q)

Eˆqt 1[ξp(Z, )] 2 Gd = 1 (t 1)2

i=1 hp(Zi, Zj) =: Vt 1,

Sequential Kernel Goodness-of-fit Testing

where Vt 1 is defined in Lemma B.1. From Lemma B.1 and the Borel-Cantelli lemma, it follows that:

Eˆqt 1[ξp(Z, )] 2 Gd a.s. Eq[ξp(Z, )] 2 Gd .

The result then follows from the continuous mapping theorem.

Lemma B.3. Suppose that Assumption 2.3 and 2.4 hold. Under H1 in (1b), for the oracle (10) and plug-in (12) witness function, it holds that: ˆgt, g Gd a.s. 1. (42)

As a consequence, ˆgt g Gd a.s. 0.

Proof. Assuming that the alternative (1b) is true, it follows that:

Eq[ξp(Z, )] Gd > 0.

We aim to show that: * Eˆqt 1[ξp(Z, )] Eˆqt 1[ξp(Z, )] Gd , Eq[ξp(Z, )] Eq[ξp(Z, )] Gd

From Lemma B.2, we know that Eˆqt 1[ξp(Z, )] Gd a.s. Eq[ξp(Z, )] Gd. Hence, it suffices to show that

Eˆqt 1[ξp(Z, )], Eq[ξp(Z, )]

Gd a.s. Eq[ξp(Z, )] 2 Gd . (43)

We rewrite Eˆqt 1[ξp(Z, )], Eq[ξp(Z, )]

Eˆqt 1[ξp(Z, )], EZ q[ξp(Z, )]

i=1 ξp(Zi, ), EZ q[ξp(Z, )] Gd

i=1 EZ q[hp(Zi, Z)].

Since we have shown in the proof of (B.1) that |hp(x, y)| 4, by the SLLN, we obtain:

i=1 EZ q[hp(Zi, Z)] a.s. EZ,Z [hp(Z, Z )] = S2 p(q).

Therefore, we deduce that:

Eˆqt 1[ξp(Z, )], EZ q[ξp(Z, )]

Gd a.s. Eq[ξp(Z, )] 2 Gd .

ˆgt 1 g Gd a.s. 0 simply follows from the fact that

ˆgt g Gd = q

2(1 ˆgt, g Gd).

Lemma B.4. Suppose that {xt}t 1 is a sequence of numbers such that limt xt = x. Then the corresponding sequence of partial averages also converges to x, that is, limn 1

n Pn t=1 xt = x. That also implies that if {Xt}t 1 is a sequence of random variables such that Xt a.s. X, then 1

n Pn t=1 Xt a.s. X.

Sequential Kernel Goodness-of-fit Testing

Proof. Fix any ε > 0. Since {xt}t 1 is converging, then M > 0, such that:

|xt x| M, t 1.

Now, let n0 be such that |xt x| ε/2 for all n > n0. Further, choose any n1 > n0: Mn0/n1 ε/2. Hence, for any m > n1, it holds that: 1 m

t=n0+1 (xt x)

t=1 |xt x| + 1

t=n0+1 |xt x|

which implies the result.

Before, we state the next result, recall that KSD-based payoffs are based on the predictable estimates {gi}i 1 of the oracle witness function g and have the following form:

2( ˆgi(Zi), log p(Zi) + ˆgi(Zi)), i 1,

2( g (Zi), log p(Zi) + g (Zi)). (44)

Lemma B.5. Suppose that Assumption 2.3 and 2.4 hold. Under H1 in (1b), it holds that:

i=1 fi(Zi) a.s. Eq[f (Z)], (45)

i=1 (fi(Zi))2 a.s. Eq[(f (Z))2]. (46)

Proof. First, we prove that (45). Note that:

i=1 fi(Zi) E[f (Z)]

i=1 f (Zi) E[f (Z)]

| {z } a.s. 0

where the second term converges almost surely to 0 by the SLLN. For the first term, we have that:

i=1 f (Zi) E[f (Z)]

i=1 |fi(Zi) f (Zi)| .

Finally, note that

|fi(Zi) f (Zi)| = 1

2 | ˆgi g , ξp(Zi) |

2 ˆgi g Gd ξp(Zi, ) Gd

ˆgi g Gd a.s. 0.

where the last inequality follows from the fact that ξp(Zi, ) Gd 2 and the convergence result is due to Lemma B.3. The result (45) then follows after invoking Lemma B.4.

Sequential Kernel Goodness-of-fit Testing

Next, we show that (46) holds. Note that:

i=1 (fi(Zi))2 = 1

i=1 (fi(Zi) f (Zi) + f (Zi))2

i=1 (fi(Zi) f (Zi))2

| {z } a.s. 0

i=1 f (Zi) (fi(Zi) f (Zi))

i=1 (f (Zi))2

| {z } a.s. E[(f (Z))]

where the first convergence result is due to (47) and Lemma B.4 and the second converge result is due to the SLLN. Using (47), Lemma B.4, and the fact that f is bounded by 1, we deduce that: 2

i=1 f (Zi) (fi(Zi) f (Zi))

i=1 |fi(Zi) f (Zi)| a.s. 0,

and hence we conclude that the convergence (46) holds.

B.4. Auxiliary Results

Theorem B.6 (Ville s inequality (Ville, 1939)). Suppose that {Mt}t 0 is a non-negative supermartingale process adapted to a filtration {Ft : t 0}. Then, for any a > 0 it holds that:

P( t 1 : Mt a) E[M0]