# sequential_predictive_twosample_and_independence_testing__7cf3af88.pdf

Sequential Predictive Two-Sample and Independence Testing

Aleksandr Podkopaev Walmart Global Tech sasha.podkopaev@walmart.com

Aaditya Ramdas Carnegie Mellon University aramdas@cmu.edu

We study the problems of sequential nonparametric two-sample and independence testing. Sequential tests process data online and allow using observed data to decide whether to stop and reject the null hypothesis or to collect more data, while maintaining type I error control. We build upon the principle of (nonparametric) testing by betting, where a gambler places bets on future observations and their wealth measures evidence against the null hypothesis. While recently developed kernel-based betting strategies often work well on simple distributions, selecting a suitable kernel for high-dimensional or structured data, such as images, is often nontrivial. To address this drawback, we design prediction-based betting strategies that rely on the following fact: if a sequentially updated predictor starts to consistently determine (a) which distribution an instance is drawn from, or (b) whether an instance is drawn from the joint distribution or the product of the marginal distributions (the latter produced by external randomization), it provides evidence against the two-sample or independence nulls respectively. We empirically demonstrate the superiority of our tests over kernel-based approaches under structured settings. Our tests can be applied beyond the case of independent and identically distributed data, remaining valid and powerful even when the data distribution drifts over time.

1 Introduction

We consider two closely-related problems of nonparametric two-sample and independence testing. In the former, given observations from two distributions P and Q, the goal is to test the null hypothesis that the distributions are the same: H0 : P = Q, against the alternative that they are not: H1 : P = Q. In the latter, given observations from some joint distribution PXY , the goal is to test the null hypothesis that the random variables are independent: H0 : PXY = PX PY , against the alternative that they are not: H1 : PXY = PX PY . Kernel tests, such as kernel-MMD [Gretton et al., 2012] for two-sample and HSIC [Gretton et al., 2005] for independence testing, are amongst the most popular methods for solving these tasks which work well on data from simple distributions. However, their performance is sensitive to the choice of a kernel and respective parameters, like bandwidth, and applying such tests requires additional effort. Further, selecting kernels for structured data, like images, is a nontrivial task. Lastly, kernel tests suffer from decaying power in high dimensions [Ramdas et al., 2015].

Predictive two-sample and independence tests (2STs and ITs respectively) aim to address such limitations of kernelized approaches. The idea of using classifiers for two-sample testing dates back to Friedman [2004] who proposed using the output scores as a dimension reduction method. More recent works focused on the direct evaluation of a learned model for testing. In an initial ar Xiv 2016 preprint, Kim et al. [2021] proposed and analyzed predictive 2STs based on sample-splitting, namely testing whether the accuracy of a model trained on the first split of data and estimated on the second split is significantly better than chance. The authors established the consistency of asymptotic and exact tests in high-dimensional settings and provided rates for the case of Gaussian

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

distributions. Inspired by this work, Lopez-Paz and Oquab [2017] soon after demonstrated that empirically predictive 2STs often outperform state-of-the-art 2STs, such as kernel-MMD. Recently, Hediger et al. [2022] proposed a related test that utilizes out-of-bag predictions for bagging-based classifiers, such as random forests. To incorporate measures of model confidence, many authors have also explored using test statistics that are based on the output scores instead of the binary class predictions [Kim et al., 2019, Liu et al., 2020, Cheng and Cloninger, 2022, Kübler et al., 2022].

The focus of the above works is on batch tests which are calibrated to have a fixed false positive rate (say, 5%) if the sample size is specified in advance. In contrast, we focus on the setting of sequentially released data. Our tests allow on-the-fly decision-making: an analyst can use observed data to decide whether to stop and reject the null or to collect more data, without inflating the false alarm rate.

Problem Setup. First, we define the problems of sequential two-sample and independence testing. Definition 1 (Sequential two-sample testing). Suppose that we observe a stream of i.i.d. observations ((Zt, Wt))t 1, where Wt Rademacher(1/2), the distribution of Zt | Wt = +1 is denoted P, and that of Zt | Wt = 1 is denoted Q. The goal is to design a sequential test for

H0 : P = Q, (1a) H1 : P = Q. (1b)

Definition 2 (Sequential independence testing). Suppose that we observe that a stream of observations: ((Xt, Yt))t 1, where (Xt, Yt) PXY for t 1. The goal is to design a sequential test for

H0 : (Xt, Yt) PXY and PXY = PX PY , (2a) H1 : (Xt, Yt) PXY and PXY = PX PY . (2b)

We operate in the framework of power-one tests [Darling and Robbins, 1968] and define a level-α sequential test as a mapping Φ : t=1Zt {0, 1} that satisfies: PH0( t 1 : Φ(Z1, . . . , Zt) = 1) α. We refer to such notion of type I error control as time-uniform. Here, 0 stands for do not reject the null yet and 1 stands for reject the null and stop . Defining the stopping time as the first time that the test outputs 1: τ := inf{t 1 : Φ(Z1, . . . , Zt) = 1}, a sequential test must satisfy

PH0 (τ < ) α. (3)

We aim to design consistent tests which are guaranteed to stop if the alternative happens to be true:

PH1 (τ < ) = 1. (4)

Related Works. Our construction follows the principle of testing by betting [Shafer, 2021]. The most closely related work is that of nonparametric 2ST by betting of Shekhar and Ramdas [2023], which later inspired several follow-up works, including sequential (marginal) kernelized independence tests of Podkopaev et al. [2023], and the sequential conditional independence tests under the model-X assumption of Grünwald et al. [2023] and Shaer et al. [2023]. We extend the line of work of Shekhar and Ramdas [2023] and of Podkopaev et al. [2023], studying predictive approaches in detail.

Sequential predictive 2STs were studied by Lhéritier and Cazals [2018, 2019], but in practice, those tests were found to be inferior to the ones developed by Shekhar and Ramdas [2023]. Recently, Pandeva et al. [2022] proposed a predictive 2ST that handles unknown class proportions using ideas from [Wasserman et al., 2020]. In Section 2, we show that our tests are closely related to [Lhéritier and Cazals, 2018, 2019, Pandeva et al., 2022], but are consistent under much milder assumptions.

Sequential Nonparametric Two-Sample and Independence Testing by Betting. Suppose that one observes a sequence of random variables (Zt)t 1, where Zt Z. The principle of testing by betting [Shafer and Vovk, 2019, Shafer, 2021] can be described as follows. A player starts the game with initial capital K0 = 1. At round t, she selects a payoff function ft : Z [ 1, ) that satisfies EZ PZ [ft(Z) | Ft 1] = 0 for all PZ H0, where Ft 1 = σ(Z1, . . . , Zt 1) denotes the sigma-field generated by Z1, . . . , Zt 1 with F0 being the trivial sigma-field, and bets a fraction of her wealth λt Kt 1 for an Ft 1-measurable λt [0, 1]. Once Zt is revealed, her wealth is updated as

Kt = Kt 1 + λt Kt 1ft(Zt) = Kt 1 (1 + λtft(Zt)) . (5)

The wealth is used to measure the evidence against H0: if a player can make money in such game, the null is rejected. Formally, for testing H0 at level α (0, 1), we use the stopping rule:

τ = inf {t 1 : Kt 1/α} . (6)

The validity of the test follows from Ville s inequality [Ville, 1939], a time-uniform generalization of Markov s inequality, since (Kt)t 0 is a nonnegative martingale starting at 1 under any PZ H0. To ensure high power, one has to choose (ft)t 1 and (λt)t 1 to guarantee the growth of the wealth if the alternative is true. In the context of two-sample and independence testing, Shekhar and Ramdas [2023] and Podkopaev et al. [2023] recently proposed effective betting strategies based on kernelized measures of statistical distance and dependence respectively which admit a variational representation. In a nutshell, datapoints observed prior to a given round are used to estimate the witness function one that best highlights the discrepancy between P and Q for two-sample (or between PXY and PX PY for independence) testing and a bet is formed as an estimator of a chosen measure of distance (or dependence). In contrast, our bets are based on evaluating the performance of a sequentially learned predictor that distinguishes between instances from distributions of interest. Remark 1. In practical settings, an analyst may not be able to continue collecting data forever and may adaptively stop the experiment before the wealth exceeds 1/α. In such case, one may use a different threshold for rejecting the null at a stopping time τ, namely U/α, where U is a (stochastically larger than) uniform random variable on [0, 1] drawn independently from (Ft)t 0. This choice strictly improves the power of the test without violating the validity; see [Ramdas and Manole, 2023].

Contributions. In Section 2, we develop sequential predictive two-sample and independence tests. We establish sufficient conditions for consistency of our tests and relate those to evaluation metrics of the underlying models. In Section 3, we conduct an extensive empirical study on synthetic and real data, justifying the superiority of our tests over the kernelized ones on structured data.

2 Classification-based Two-Sample Testing

We begin with the two-sample testing setting outlined in Definition 5. Let G : Z [ 1, 1] denote a class of predictors used to distinguish between instances from P (labeled as +1) and Q (labeled as 1)1. We assume that: (a) if g G, then g G, (b) if g G and s [0, 1], then sg G, and (c) predictions are based on sign[g( )], and if g(z) = 0, then z is assigned to the positive class. Two natural evaluation metrics of a predictor g G include the misclassification and the squared risks:

Rm(g) := P (W sign [g (Z)] < 0) , Rs(g) := E (g(Z) W)2 , (7)

which give rise to the following measures of distance between P and Q, namely

dm(P, Q) := sup g G

2 Rm(g) , ds(P, Q) := sup g G (1 Rs(g)) . (8)

It is easy to check that dm(P, Q) [0, 1/2] and dm(P, Q) [0, 1]. The upper bounds hold due to the non-negativity of the risks and the lower bounds follow by considering g : g(z) = 0, z Z. Note that the misclassification risk is invariant to rescaling (Rm(sg) = Rm(g), s (0, 1]), whereas the squared risk is not, and rescaling any g to optimize the squared risk provides better contrast between P and Q. In the next result, whose proof is deferred to Appendix D.3, we present an important relationship between the squared risk of a rescaled predictor and its expected margin: E [W g(Z)].

Proposition 1. Fix an arbitrary predictor g G. The following claims hold2:

1. For the misclassification risk, we have that:

sup s [0,1]

2 Rm(sg) = 1

2 Rm(g) 0 = 1

2 E [W sign [g(Z)]] 0. (9)

2. For the squared risk, we have that:

sup s [0,1] (1 Rs(sg)) (E [W g(Z)] 0) E [W g(Z)]

E [g2(Z)] 1 (10)

Further, ds(P, Q) > 0 if and only if there exists g G such that E [W g(Z)] > 0.

1Similar argument can be applied to general scoring-based classifiers: g : Z R, e.g., SVMs, by considering G = { g : g(z) = tanh(s g(z)), g G, s > 0}, where the constant s > 0 corrects the scale of the scores. 2 and denote a maximum and a minimum respectively: a b = max{a, b}, a b = min{a, b}.

Consider an arbitrary predictor g G. Note that under the null H0 in (1a), the misclassification risk Rm(g) does not depend on g, being equal to 1/2, whereas the squared risk Rs(g) does. In contrast, the lower bound (10) no longer depends on g under the null H0, being equal to 0.

Oracle Test. It is a known fact that the minimizer of either the misclassification or the squared risk is g Bayes(z) = 2η(z) 1, where η(z) = P (W = +1 | Z = z). Since g Bayes may not belong to G, we consider g G, which minimizes either the misclassification or the squared risk over predictors in G, and omit superscripts for brevity. To design payoff functions, we follow Proposition 1 and consider

f m (Zt, Wt) = Wt sign [g (Zt)] { 1, 1}, (11a) f s (Zt, Wt) = Wt g (Zt) [ 1, 1]. (11b)

Let the oracle wealth processes based on misclassification and squared risks (Km, t )t 0 and (Ks, t )t 0 be defined by using the payoff functions (11a) and (11b) respectively, along with a predictable sequence of betting fractions (λt)t 1 selected via online Newton step (ONS) strategy [Hazan et al., 2007] (Algorithm 1), which has been studied in the context of coin-betting by Cutkosky and Orabona [2018]. If a constant betting fraction is used throughout: λt = λ, t, then

E h 1 t log Ki, t i = E log(1 + λf i (Z, W)) , i {m, s} . (12)

To illustrate the tightness of our results, we consider the optimal constant betting fractions which maximize the log-wealth (12) and are constrained to lie in [ 0.5, 0.5], like ONS bets:

λi = arg max λ [ 0.5,0.5] E log(1 + λf i (Z, W)) , i {m, s} . (13)

Algorithm 1 Online Newton step (ONS) strategy for selecting betting fractions

Input: sequence of payoffs (ft)t 1, λONS 1 = 0, a0 = 1. for t = 1, 2, . . . do

Observe ft [ 1, 1]; Set zt := ft/(1 + λONS t ft); Set at := at 1 + z2 t ; Set λONS t+1 := 1

2 0 λONS t + 2 2 log 3 zt

Assuming that the 2ST null is false and a predictor with non-trivial prediction risk is used, it is easy to see that the optimal constant betting fraction has to be positive, which is why in Algorithm 1 we truncate λONS t at 0 instead of 1/2 (the latter option is used in [Cutkosky and Orabona, 2018]). We note that this is allowed and does not affect any theoretical claims of the paper. In early simulations, we also did not observe any major differences between the two truncating options. We conclude with the following result for the oracle tests, whose proof is deferred to Appendix D.3. Theorem 1. The following claims hold:

1. Suppose that H0 in (1a) is true. Then the oracle sequential test based on either (Km, t )t 0 or (Ks, t )t 0 ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then:

(a) The growth rate of the oracle wealth process (Km, t )t 0 satisfies:

lim inf t 1

t log Km, t a.s. 1

2 Rm (g ) 2 . (14)

If Rm (g ) < 1/2, then the test based on (Km, t )t 0 is consistent: PH1 (τ < ) = 1. Further, the optimal growth rate achieved by λm in (13) satisfies:

E [log(1 + λm f m (Z, W))] 16

2 Rm(g ) 2 1

2 Rm(g ) . (15)

(b) The growth rate of the oracle wealth process (Ks, t )t 0 satisfies:

lim inf t 1

t log Ks, t a.s. 1 4 E [W g (Z)] . (16)

If E [W g (Z)] > 0, then the test based on (Ks, t )t 0 is consistent: PH1 (τ < ) = 1. Further, the optimal growth rate achieved by λs in (13) satisfies:

E [log(1 + λs f s (Z, W))] 1

2 E [W g (Z)] . (17)

Theorem 1 precisely characterizes the properties of the oracle wealth processes and relates those to interpretable metrics of predictive performance. Further, the proof of Theorem 1 highlights a direct impact of the variance of the payoffs on the wealth growth rate, and hence the power of the resulting sequential tests (as the null is rejected once the wealth exceeds 1/α).

The second moment of the payoffs based on the misclassification risk (11a) is equal to one, resulting in a slow growth: the bound (14) is proportional to squared deviation of the misclassification risk from one half. The bound (15) shows that the growth rate with the ONS strategy matches, up to constants, that of the oracle betting fraction. Note that the second term in (15) characterizes the growth rate if Rm(g ) < 5/16 (low Bayes risk). In this regime, the growth rate of our test is at least (3/16) (1/2 Rm(g )) which is close to the optimal rate. The second moment of the payoffs based on the squared risk is more insightful. First, we present a result for the case when the oracle predictor g in (11b) is replaced by an arbitrary g G. The proof is deferred to Appendix D.3.

Corollary 1. Consider an arbitrary g G with nonnegative expected margin: E [W g(Z)] 0. Then the growth rate of the corresponding wealth process (Ks t)t 0 satisfies:

lim inf t 1

t log Ks t a.s. 1 4 sup s [0,1] (1 Rs (sg)) E [W g(Z)]) (18a)

1 4 (E [W g(Z)])2 , (18b)

and the optimal growth rate achieved by λs in (13) satisfies:

E [log(1 + λs f s(Z, W))] 4 3 sup s [0,1] (1 Rs (sg)) 1

2 E [W g(Z)] . (19)

Corollary 1 states that for an arbitrary g G, the growth rate is lower bounded by the minimum of the expected margin and the (optimized) squared risk of such predictor. While the latter term is always smaller for the optimal g , this may not hold for an arbitrary g G. The lower bound (18b), which follows from Proposition 1, is always worse than that for g (the expected margin is squared). The upper bound (19) shows that the growth rate with the ONS strategy matches, up to constants, that of the optimal constant betting fraction. Before presenting a practical sequential 2ST, we provide two important remarks that further contextualize the current work in the literature. Remark 2. In practice, we learn a predictor sequentially and have to choose a learning algorithm. Note that (18a) suggests that direct margin maximization may hurt the power of the resulting 2ST: the squared risk is sensitive to miscalibrated and overconfident predictors. Kübler et al. [2022] made a similar conjecture in the context of batch two-sample testing. To optimize the power, the authors suggested minimizing the cross-entropy or the squared loss and related such approach to maximizing the signal-to-noise ratio, a heuristic approach that was proposed earlier by Sutherland et al. [2017]3. Remark 3. Suppose that g Bayes G and consider the payoff function based on the squared risk (11b). At round t, the wealth of a player Kt 1 is multiplied by

1 + λt Wt g Bayes(Zt) = (1 λt) 1 + λt 1 + Wt g Bayes(Zt)

= (1 λt) 1 + λt (η(Zt))1{Wt=1} (1 η(Zt))1{Wt= 1}

2 1{Wt=1} 1

2 1{Wt= 1} , (20)

and hence, the betting fractions interpolate between the regimes of not betting and betting using a likelihood ratio. From this standpoint, 2STs of Lhéritier and Cazals [2018, 2019], Pandeva et al. [2022] set λt = 1, t, and use only the second term for updating the wealth despite the fact that the true likelihood ratio is unknown. An argument about the consistency of such test hence requires imposing strong assumptions about a sequence of predictors (gt)t 1 [Lhéritier and Cazals, 2018, 2019]. Our test differs in a critical way: we use a sequence of betting fractions, (λt)t 1, which adapts to the quality of the underlying predictors, yielding a consistent test under much weaker assumptions.

3Standard CLT does not apply directly when the conditioning set grows; see [Kim and Ramdas, 2020].

Example 1. Consider P = N(0, 1) and Q = N(δ, 1) for 20 values of δ, equally spaced in [0, 0.5]. For a given δ, the Bayes-optimal predictor is

g Bayes(z) = φ(z; 0, 1) φ(z; δ, 1)

φ(z; 0, 1) + φ(z; δ, 1) [ 1, 1], (21)

where φ(z; µ, σ2) denotes the density of N(µ, σ2) evaluated at z. In Figure 1a, we compare tests that use (a) the Bayes-optimal predictor, (b) a predictor constructed with the plug-in estimates of the means and variances. While in the former case betting using a likelihood ratio (λt = 1, t) is indeed optimal, our test with an adaptive sequence (λt)t 1 is superior when a predictor is learned. The difference becomes even more drastic in Figure 1b where a (regularized) k-NN predictor is used.

0.0 0.1 0.2 0.3 0.4 0.5 δ

Rejection Rate at Time T = 5000

Oracle predictor:

λt = 1, t Our test Learned predictor:

λt = 1, t Our test

(a) Parametric model.

0.0 0.1 0.2 0.3 0.4 0.5 δ

Rejection Rate at Time T = 5000

λt = 1, t Our Test α = 0.05

(b) Nonparametric model (k-NN).

Figure 1: Comparison between our 2ST with adaptive betting fractions and the likelihood ratio test for Example 1. While the likelihood ratio test is better if the Bayes-optimal predictor is used, our test is superior if a predictor is learned. The results are aggregated over 500 runs for each value of δ.

Sequential Classification-based 2ST (Seq-C-2ST). Let Ac : ( t 1(Z { 1, +1})t) G G denote a learning algorithm which maps a training dataset of any size and previously used classifier, to an updated predictor. For example, Ac may apply a single gradient descent step using the most recent observation to update a model. We start with D0 = and g1 G : g1(z) = 0, for any z Z. At round t, we use one of the payoffs:

f m t (Zt, Wt) = Wt sign [gt(Zt)] { 1, 1}, (22a) f s t (Zt, Wt) = Wt gt(Zt) [ 1, 1]. (22b)

After (Zt, Wt) is used for betting, we update a training dataset: Dt = Dt 1 {(Zt, Wt)}, and an existing predictor: gt+1 = Ac(Dt, gt). We summarize our sequential classification-based 2ST (Seq C-2ST) in Algorithm 2. We note that using pre-trained models (e.g., for image data) as initialization may definitely improve the performance during the early stages of testing in practice.

Algorithm 2 Sequential classification-based 2ST (Seq-C-2ST)

Input: level α (0, 1), data stream ((Zt, Wt))t 1, g1(z) 0, Ac, D0 = , λONS 1 = 0. for t = 1, 2, . . . do

Evaluate the payoff f s t (Zt, Wt) as in (22a); Using λONS t , update the wealth process Ks t as per (5); if Ks t 1/α then Reject H0 and stop; else

Update the training dataset: Dt := Dt 1 {(Zt, Wt)}; Update predictor: gt+1 = Ac(Dt, gt); Compute λONS t+1 (Algorithm 1) using f s t (Zt, Wt);

While we do not need any assumptions to confirm the type I error control, we place some mild assumptions on the learning algorithm Ac to argue about the consistency.

Assumption 1 (Rm-learnability). Suppose that H1 in (1b) is true. An algorithm Ac is such that the resulting sequence (gt)t 1 satisfies: lim supt 1 t Pt i=1 1 {Wi sign [gi(Zi)] < 0} a.s. < 1/2.

Assumption 2 (Rs-learnability). Suppose that H1 in (1b) is true. An algorithm Ac is such that the resulting sequence (gt)t 1 satisfies: lim supt 1 t Pt i=1 (gi(Zi) Wi)2 a.s < 1.

In words, the above assumptions state that a sequence of predictors (gt)t 1 is better than a chance predictor on average. We conclude with the following result, whose proof is deferred to Appendix D.3.

Theorem 2. The following claims hold for Seq-C-2ST (Algorithm 2):

1. If H0 in (1a) is true, the test ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then:

(a) Under Assumption 1, the test with the payoff (22a) is consistent: PH1 (τ < ) = 1. (b) Under Assumption 2, the test with the payoff (22b) is consistent: PH1 (τ < ) = 1.

Remark 4. Suppose that we perform two-sample testing for d-dimensional data. At time T, the total accumulated computation for the kernelized test of Shekhar and Ramdas [2023] is O(d T 2). For our test, the answer depends on the chosen classifier and learning algorithm. Using logistic regression in combination with gradient descent for updating model parameters results in cheap updates and payoff evaluation (both are O(d) at each round, and hence the total accumulated computation at time T is O(d T)). For k-NN classifier, no parameters have to be updated, yet evaluating payoffs becomes more expensive with a growing sample size, resulting in the total accumulated computation of O(kd T 2) at time T. For more complex models like neural nets, runtime depends on the chosen architecture: the total accumulated computation at time T is O((c B + F)T), where F and B are the costs of forward-propagation and back-propagation steps respectively and c is the number of back-propagation steps applied after processing the next point (the exact cost depends on the architecture).

Sequential Classification-based Independence Test (Seq-C-IT). Under the setting of Definition 2, a single point from PXY is revealed at each round. Following [Podkopaev et al., 2023], we bet on two points from PXY (labeled as +1) and utilize external randomization to produce instances from PX PY (labeled as 1). Let AIT c : ( t 1((X Y) { 1, +1})t) G G denote a learning algorithm which maps a training dataset of any size and previously used classifier, to an updated predictor. We start with D0 = and g1 : g1(x, y) = 0, (x, y) X Y. We use derandomized versions of the payoffs (22), e.g., instead of (22b), we use

f s t ((X2t 1, Y2t 1), (X2t, Y2t)) = 1

4 (gt(X2t 1, Y2t 1) + gt(X2t, Y2t))

4 (gt(X2t 1, Y2t) + gt(X2t, Y2t 1)) . (23)

After (X2t 1, Y2t 1), (X2t, Y2t) have been used for betting, we update a training dataset:

Dt = Dt 1 {((X2t 1, Y2t 1), +1), ((X2t, Y2t), +1), ((X2t 1, Y2t), 1), ((X2t, Y2t 1), 1)} ,

and an existing predictor: gt+1 = AIT c (Dt, gt). Seq-C-IT inherits the time-uniform type I error control and the consistency guarantees of Theorem 2, and we omit details for brevity.

3 Experiments

Synthetic Experiments. In our evaluation, we first consider synthetic datasets where the complexity of the independence testing setup is characterized by a single univariate parameter. We set the monitoring horizon to T = 5000 points from PXY , and for each parameter value, we aggregate the results over 200 runs. In particular, we use the following synthetic settings:

1. Spherical model. Let (Ut)t 1 be a sequence of random vectors on a unit sphere in Rd:

Ut iid Unif(Sd), and let u(i) denote the i-th coordinate of u. For t 1, we take

(Xt, Yt) = ((Ut)(1), (Ut)(2)).

We consider d {3, . . . , 10}, where larger d defines a harder setup.

2. Hard-to-detect-dependence (HTDD) model. We sample ((Xt, Yt))t 1 from

p(x, y) = 1 4π2 (1 + sin(wx) sin(wy)) 1 (x, y) [ π, π]2 . (24)

We consider w {0, . . . , 6}, where H0 is true (random variables are independent) if and only if w = 0. For w > 0, Corr (X, Y ) 1/w2, and the setup is harder for larger w.

For the comparison, we use two predictive models to construct Seq-C-ITs:

1. Let Nt(z) := N(z, Dt 1, kt) define the set of kt closest points in Dt 1 to a query point z := (x, y). We consider a regularized k-NN predictor: ˆgt(z) = 1 kt+1 P

(Z,W ) Nt(z) W.

We select the number of neighbors using the square-root rule: kt = p

4(t 1). 2. We use a multilayer perceptron (MLP) with three hidden layers and 128, 64 and 32 neurons respectively and the parameters learned using an incremental training scheme.

We use the HSIC-based sequential kernelized independence test (SKIT) [Podkopaev et al., 2023] as a reference test and defer details, such as MLP training scheme and SKIT hyperparameters, to Appendix E.1. In Figure 2, we observe that SKIT outperforms Seq-C-ITs under the spherical model (with no localized dependence structure), whereas, under the structured HTDD model, Seq-C-ITs, is superior. Further, inspecting Figure 2b at w = 0 confirms that all tests control the type I error. We refer the reader to Appendix E.2 for additional experiments on synthetic data with localized dependence where Seq-C-ITs are superior. In Appendix E.2, we also provide the results for the average stopping times of our tests: we empirically confirm that our tests are adaptive to the complexity of a problem at hand: they stop earlier on easy tasks and later on harder ones.

3 4 5 6 7 8 9 10 d

Power at Time T = 5000

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t

(a) Spherical model.

0 1 2 3 4 5 6 w

Rejection Rate at Time T = 5000

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t α = 0.05

(b) HTDD model.

Figure 2: Power of different sequential independence tests on synthetic data from Section 3. Under the spherical model (no localized dependence), SKIT is better than Seq-C-ITs. Under the (structured) HTDD model, SKIT is inferior to sequential predictive independence tests.

Real Data Experiments. First, we compare sequential classification-based and kernelized 2STs using Karolinska Directed Emotional Faces dataset (KDEF) [Lundqvist et al., 1998] which contains images of actors and actresses expressing different emotions: afraid (AF), angry (AN), disgusted (DI), happy (HA), neutral (HE), sad (SA), and surprised (SU). Following earlier works [Lopez-Paz and Oquab, 2017, Jitkrittum et al., 2016], we focus on straight profile only and assign HA, NE, SU emotions to the positive class (instances from P), and AF, AN, DI emotions to the negative class (instances from Q); see Figure 3a. We remove corrupted images and obtain a dataset containing 802 images with six different emotions. The original images (562 762 pixels) are cropped to exclude the background, resized to 64 64 pixels and converted to grayscale.

For Seq-C-2ST, we use a small CNN as an underlying model and defer details about the architecture and training to Appendix E.1. As a reference kernel-based 2ST, we use the sequential MMD test of Shekhar and Ramdas [2023] and adapt it to the setting where at each round either an observation from P or that from Q is revealed; see Appendix E.1 for details. We omit the comparison to Lhéritier and Cazals [2019] since their test has been shown to be inferior to the one developed by Shekhar and Ramdas [2023]. In Figure 3b, we illustrate that while both tests achieve perfect power after processing sufficiently many observations, our Seq-C-2ST requires fewer observations to do so.

0 100 200 300 400 500 600 700 800 Sample size

Rejection Rate

Seq-C-2ST Kernel 2ST

Figure 3: (a) Examples of instances from P (top row) and Q (bottom row) for KDEF dataset. (b) Rejection rates for our test (Seq-C-2ST) and the sequential kernelized 2ST. While both tests achieve perfect power with enough data, our test is superior to the kernelized approach, requiring fewer observations to do so. The results are averaged over 200 random orderings of the data.

Next, we compare two independence tests using MNIST image dataset [Le Cun et al., 1998]. To simulate the null setting, we sample pairs of random images from the entire dataset, and to simulate the alternative, we sample pairs of random images depicting the same digit (Figure 4a). For Seq-C-IT, we use MLP with the same architecture as for simulations on synthetic data. For SKIT, we use the median heuristic with 20 points from PXY to compute kernel hyperparameters. In Figure 4b, we show that while both tests control the type I error under H0, SKIT is inferior to Seq-C-IT under H1, requiring twice as much data to achieve perfect power.

0 100 200 300 400 500 Number of Pairs

Rejection Rate

H1 is true H0 is true SKIT:

H1 is true H0 is true

Figure 4: (a) Instances from the PXY (top row) and PX PY (bottom row) for MNIST dataset. (b) While both independence tests control the type I error under H0, Seq-C-IT outperforms SKIT under H1, rejecting the null much sooner. The results are aggregated over 200 runs.

4 Conclusion

While kernel methods are state-of-the-art for nonparametric two-sample and independence testing, their performance often deteriorates on complex data, e.g., high-dimensional data with localized dependence. In such settings, prediction-based tests are often much more effective. In this work, we developed sequential predictive two-sample and independence tests following the principle of testing by betting. Our tests control the type I error despite continuously monitoring the data and are consistent under weak and tractable assumptions. Further, our tests provably adapt to the complexity of a problem at hand: they stop earlier on easy tasks and later on harder ones. An additional advantage of our tests is that an analyst may modify the design choices, e.g., model architecture, on-the-fly. Through experiments on synthetic and real data, we confirm that our tests are competitive to kernel-based ones overall and outperform those under structured settings.

We refer the reader to the Appendix for additional results that were not included in the main paper:

1. In Appendix A, we complement classification-based ITs with a regression-based approach. Regression-based ITs represent an alternative to the classification-based approach in settings where a data stream ((Xt, Yt))t 1 may be processed directly as feature-response pairs. 2. In Section 2, we considered the case of balanced classes, meaning that at each round, an instance from either P or Q is observed with equal chance. In Appendix B, we extend the methodology to a more general case of two-sample testing with unknown class proportions. 3. Batch two-sample and independence tests rely on either a cutoff computed using the asymptotic null distribution of a chosen test statistic (when it is tractable) or a permutation p-value, and if the distribution drifts, both approaches fail to provide the type I error control. In contrast, Seq-C-2ST and Seq-C-IT remain valid beyond the i.i.d. setting by construction (analogous to tests developed by Shekhar and Ramdas [2023], Podkopaev et al. [2023]), and we refer the reader to Appendix C for more details.

Acknowledgements. The authors thank Ian Waudby-Smith and Tudor Manole for fruitful discussions. AR acknowledges support from NSF grants IIS-2229881 and DMS-2310718.

Thomas B. Berrett and Richard J. Samworth. Nonparametric independence testing via mutual information. Biometrika, 2019.

Xiuyuan Cheng and Alexander Cloninger. Classification logit two-sample testing by neural networks for differentiating near manifold densities. IEEE Transactions on Information Theory, 2022.

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference on Learning Theory, 2018.

Donald A. Darling and Herbert E. Robbins. Some nonparametric sequential tests with power one. Proceedings of the National Academy of Sciences, 1968.

Jerome H. Friedman. On multivariate goodness-of-fit and two-sample testing. Technical report, Stanford University, 2004.

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory, 2005.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 2012.

Peter Grünwald, Alexander Henzi, and Tyron Lardy. Anytime-valid tests of conditional independence under model-X. Journal of the American Statistical Associatio, 2023.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 2007.

Simon Hediger, Loris Michel, and Jeffrey Näf. On the use of random forest for two-sample testing. Computational Statistics & Data Analysis, 2022.

Wittawat Jitkrittum, Zoltán Szabó, Kacper P Chwialkowski, and Arthur Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, 2016.

Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics. Bernoulli, 2020.

Ilmun Kim, Ann B. Lee, and Jing Lei. Global and local two-sample tests via regression. Electronic Journal of Statistics, 2019.

Ilmun Kim, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. Classification accuracy as a proxy for two-sample testing. The Annals of Statistics, 2021.

Jonas M. Kübler, Vincent Stimper, Simon Buchholz, Krikamol Muandet, and Bernhard Schölkopf. Auto ML two-sample test. In Advances in Neural Information Processing Systems, 2022.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.

Alix Lhéritier and Frédéric Cazals. A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 2018.

Alix Lhéritier and Frédéric Cazals. Low-complexity nonparametric bayesian online prediction with universal guarantees. In Advances in Neural Information Processing Systems, 2019.

Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J. Sutherland. Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning, 2020.

David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017.

Daniel Lundqvist, Anders Flykt, and Arne Öhman. The Karolinska Directed Emotional Faces - KDEF. Technical report, Karolinska Institutet, 1998.

Teodora Pandeva, Tim Bakker, Christian A Naesseth, and Patrick Forré. E-valuating classifier two-sample tests. ar Xiv preprint: 2210.13027, 2022.

Aleksandr Podkopaev, Patrick Blöbaum, Shiva Prasad Kasiviswanathan, and Aaditya Ramdas. Sequential kernelized independence testing. In International Conference on Machine Learning, 2023.

Aaditya Ramdas and Tudor Manole. Randomized and exchangeable improvements of Markov s, Chebyshev s and Chernoff s inequalities. ar Xiv preprint: 2304.02611, 2023.

Aaditya Ramdas, Sashank Jakkam Reddi, Barnabas Poczos, Aarti Singh, and Larry Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. AAAI Conference on Artificial Intelligence, 2015.

Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. ar Xiv preprint: 2009.03167, 2020.

Shalev Shaer, Gal Maman, and Yaniv Romano. Model-free sequential testing for conditional independence via testing by betting. In Conference on Artificial Intelligence and Statistics, 2023.

Glenn Shafer. Testing by betting: a strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2021.

Glenn Shafer and Vladimir Vovk. Game-Theoretic Foundations for Probability and Finance. Wiley Series in Probability and Statistics. Wiley, 2019.

Shubhanshu Shekhar and Aaditya Ramdas. Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory, 2023.

Danica J. Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations, 2017.

Jean Ville. Étude critique de la notion de collectif. Thèses de l entre-deux-guerres. Gauthier-Villars, 1939.

Larry Wasserman, Aaditya Ramdas, and Sivaraman Balakrishnan. Universal inference. Proceedings of the National Academy of Sciences, 2020.

A Regression-based Independence Testing

Regression-based independence tests represent an alternative to classification-based approaches in settings where a data stream ((Xt, Yt))t 1 may be processed directly as feature-response pairs. Suppose that one selects a functional class G : X Y for performing such prediction task, and let ℓdenote a loss function that evaluates the quality of predictions. For example, if (Yt)t 1 is a sequence of univariate random variables, one can use the squared loss: ℓ(g(x), y) = (g(x) y)2, or the absolute loss: ℓ(g(x), y) = |g(x) y|.

Such tests rely on the following idea: if the alternative H1 in (2b) is true and a sequence of sequentially updated predictors (gt)t 1 has nontrivial predictive power, then the losses on random instances drawn from the joint distribution PXY are expected to be less on average than the losses on random instances from PX PY . For the t-th pair of points from PXY , we can label the losses of gt on all possible (X, Y )-pairs as

L2t 1 = ℓ(gt(X2t 1), Y2t 1) , L2t = ℓ(gt(X2t), Y2t) ,

L 2t 1 = ℓ(gt(X2t 1), Y2t) , L 2t = ℓ(gt(X2t), Y2t 1) . (25)

One can view this problem as sequential two-sample testing under distribution drift (due to incremental learning of (gt)t 1). Hence, one may use either Seq-C-2ST from Section 2 or sequential kernelized 2ST of Shekhar and Ramdas [2023] on the resulting sequence of the losses on observations from PXY and PX PY . In what follows, we analyze a direct approach where testing is performed by comparing the losses on instances drawn from the two distributions. A critical difference with a construction of Seq-C-2ST is that to design a valid betting strategy one has to ensure that the payoff functions are lower bounded by negative one.

A.1 Proxy Regression-based Independence Test

To avoid cases when some expected values are not well-defined, we assume for simplicity that X is a bounded subset of Rd for som d 1: X = x Rd : x 2 B1 for some B1 > 0. Similarly, we assume that Y is a bounded subset of R: Y = {y R : |y| B2} for some B2 > 0. We note that the construction of the regression-based IT will not require explicit knowledge of constants B1 and B2. First, we consider a setting where an instance either from the joint distribution or an instance from the product of the marginal distributions is observed at each round. Definition 3 (Proxy Setting). Suppose that we observe a stream of i.i.d. observations ((Xt, Yt, Wt))t 1, where Wt Rademacher(1/2), the distribution of (Xt, Yt) | Wt = +1 is PX PY , and that of (Xt, Yt) | Wt = 1 is PXY . The goal is to design a test for the following pair of hypotheses:

H0 : PXY = PX PY , (26a) H1 : PXY = PX PY . (26b)

Oracle Proxy Sequential Regression-based IT. To construct an oracle test, we assume having access to the oracle predictor g : X Y, e.g., the minimizer of the squared risk is g (x) = E [Y | X = x]. Formalizing the above intuition, we use E [Wℓ(g (X), Y )] as a natural way for measuring dependence between X and Y . To enforce boundedness of the payoff functions, we use ideas of the tests for symmetry from [Ramdas et al., 2020, Shekhar and Ramdas, 2023, Podkopaev et al., 2023, Shaer et al., 2023], namely we use a composition with an odd function:

f r (Xt, Yt, Wt) = tanh (s Wt ℓ(g (Xt), Yt)) [ 1, 1], (27)

where s > 0 is an appropriately selected scaling factor4. Since under H0 in (26a), s Wt ℓ(g (Xt), Yt) is a random variable that is symmetric around zero, it follows that E[f r (Xt, Yt, Wt)] =

4We note that rescaling is important for arguing about consistency and not the type I error control.

0, and, using the argument analogous to the proof of Theorem 1, we can easily deduce that a sequential IT based on f r controls the type I error control. The scaling factor s is selected in a way that guarantees that, if H1 in (26b) is true and if E [Wℓ(g (X), Y )] > 0, then E [f r (X, Y, W)] > 0, which is a sufficient condition for consistency of the oracle test. In particular, we show that it suffices to consider:

where µ = E [Wℓ(g (X), Y )] , (28b)

ν = E h (1 + W) (ℓ(g (X), Y ))3i . (28c)

Without loss of generality, we assume that ν is bounded away from zero (which is a very mild assumption since ν essentially corresponds to a cubic risk of g on data drawn from the product of the marginal distributions PX PY ). Let the oracle regression-based wealth process Kr, t

t 0 be defined by using the payoff function (27) with a scaling factor defined in (28a), along with a predictable sequence of betting fractions (λt)t 1 selected via the ONS strategy (Algorithm 1). We have the following result about the oracle regression-based IT, whose proof is deferred to Appendix D.4.

Theorem 3. The following claims hold for the oracle sequential regression-based IT based on Kr, t

1. Suppose that H0 in (26a) is true. Then the test ever stops with probability at most α: PH1 (τ < ) α.

2. Suppose that H1 in (26b) is true. Further, suppose that: E [Wℓ(g (X), Y )] > 0. Then the test is consistent: PH1 (τ < ) = 1.

Practical Proxy Sequential Regression-based IT. To construct a practical test, we use a sequence of predictors (gt)t 1 that are updated sequentially as more data are observed. We write Ar : ( t 1(X Y)t) G G to denote a chosen regressor learning algorithm which maps a training dataset of any size and previously used predictor, to an updated predictor. We start with D0 = and some initial guess g1 G. At round t, we use the payoff function:

f r t (Xt, Yt, Wt) = tanh (st Wt ℓ(gt(Xt), Yt)) . (29)

where a sequence of predictable scaling factors (st)t 1 is defined as follows: we set s0 = 0 and define:

i=1 Wi ℓ(gi(Xi), Yi)

i=1 (1 + Wi) (ℓ(gi(Xi), Yi))3 . (30c)

After (Xt, Yt, Wt) has been used for betting, we update a training dataset: Dt = Dt 1 {(Xt, Yt, Wt)}, and an existing predictor: gt+1 = Ar(Dt, gt). We summarize this practical sequential 2ST in Algorithm 3.

For simplicity, we consider a class of functions G := {gθ : X Y, θ Θ} for some parameter set Θ which we assume to be a subset of a metric space. In this case, a sequence of predictors (gt)t 1 is associated with the corresponding sequence of parameters (θt)t 1: for t 1, gt( ) = g( ; θt) for some θt Θ. To argue about the consistency of the resulting test, we make two assumptions.

Assumption 3 (Smoothness). We assume that:

Predictors in G are L1-Lipschitz smooth:

sup x X |g(x; θ) g(x; θ )| L1 θ θ , θ, θ Θ. (31)

Algorithm 3 Proxy Sequential Regression-based IT

Input: significance level α (0, 1), data stream ((Xt, Yt, Wt))t 1, g1(z) 0, Ar, D0 = , λONS 1 = 0, s1 = 0. for t = 1, 2, . . . do

Evaluate the payoff f r t (Xt, Yt, Wt) as in (29); Using λONS t , update the wealth process Kr t as in (5); if Kr t 1/α then Reject H0 and stop; else

Update the training dataset: Dt := Dt 1 {(Xt, Yt)}; Update predictor: gt+1 = Ar(Dt, gt); Compute st+1 as in (30a); Compute λONS t+1 (Algorithm 1) using f r t (Xt, Yt, Wt);

The loss function ℓis L2-Lipschitz smooth:

sup x X y Y |ℓ(g(x; θ), y) ℓ(g(x; θ ), y)| L2 sup x X |g(x; θ) g(x; θ )| , θ, θ Θ. (32)

In words, Assumption (31) states that the outputs of predictors, whose parameters are close, will also be close. Assumption (32) states that that the losses of two predictors, whose outputs are close, will also be close. For example, if G is a class of linear predictors: gθ(x) = θ x, x X, then Assumption 3 will be trivially satisfied for the squared and the absolute losses if X and Y are bounded. Note that we do not need an explicit knowledge of L1 or L2 for designing a test. Second, we make a learnability assumption about algorithm Ar.

Assumption 4 (Learnability). Suppose that H1 in (26b) is true. We assume that the regressor learning algorithm Ar is such that for the resulting sequence of parameters (θt)t 1, it holds that

θt a.s. θ , where θ is a random variable taking values in Θ and E [Wℓ(g(X; θ ), Y ) | θ ] a.s. > 0, where (X, Y, W) θ .

We conclude with the following result for the practical proxy sequential regression-based IT, whose proof is deferred to Appendix D.4.

Theorem 4. The following claims hold for the proxy sequential regression-based IT (Algorithm 3):

1. Suppose that H0 in (26a) is true. Then the test ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (26b) is true. Further, suppose that Assumptions 3 and 4 are satisfied. Then the test is consistent: PH1 (τ < ) = 1.

Sequential Regression-based Independence Test (Seq-R-IT). Next, we instantiate this test for the sequential independence testing setting (as per Definition 2) where we observe sequence ((Xt, Yt))t 1, where (Xt, Yt) iid PXY , t 1. Analogous to Section 2, we bet on the outcome of two observations drawn from the joint distribution PXY . To proceed, we derandomize the payoff function (29) and consider

f r t ((X2t 1, Y2t 1), (X2t, Y2t)) = 1

4 (tanh (st ℓ(gt(X2t 1), Y2t)) + tanh (st ℓ(gt(X2t), Y2t 1)))

4 (tanh (st ℓ(gt(X2t), Y2t)) tanh (st ℓ(gt(X2t 1), Y2t 1))) . (33) After betting on the outcome of the t-th pair of observations from PXY , we update a training dataset:

Dt = Dt 1 {(X2t 1, Y2t 1), (X2t, Y2t)} ,

and a predictive model: ˆgt+1 = Ar(Dt, ˆgt).

A.2 Synthetic Experiments

To evaluate the performance of Seq-R-IT, we consider the Gaussian linear model. Let (Xt)t 1 and (εt)t 1 denote two independent sequences of i.i.d. standard Gaussian random variables. For t 1, we take (Xt, Yt) = (Xt, Xtβ + εt), where β = 0 implies nonzero linear correlation (hence dependence). We consider 20 values of β equally spaced in [0, 1/2]. For the comparison, we use:

1. Seq-R-IT with ridge regression. We use ridge regression as an underlying model: ˆgt(x) = β(t) 0 + xβ(t) 1 , where

(β(t) 0 , β(t) 1 ) = arg min β0,β1

i=1 (Yi Xiβ1 β0)2 + λβ2 1.

2. Seq-C-IT with QDA. Note that PXY = N(µ, Σ+) and PX PY = N(µ, Σ ), where

, Σ+ = 1 β β 1 + β2

, Σ = 1 0 0 1 + β2

For this problem, an oracle predictor which minimizes the misclassification risk is

g (x, y) = φ((x, y); µ+, Σ+) φ((x, y); µ , Σ )

φ((x, y); µ , Σ ) + φ((x, y); µ+, Σ+) [ 1, 1], (34)

where φ((x, y); µ, Σ) denotes the density of the Gaussian distribution N(µ, Σ) evaluated at (x, y). Recall that Dt 1 = {(Zi, +1)}i 2(t 1) {(Z i, 1)}i 2(t 1) denotes the training dataset that is available at round t for training a predictor ˆgt : X Y [ 1, 1]. We deploy Seq-C-IT with an estimator ˆgt of (34), obtained by using plug-in estimates of µ+, Σ+, µ , Σ , computed from Dt 1:

ˆµ+ t = 1 2(t 1)

(ˆµ+ t )(ˆµ+ t ) ,

and ˆµ t , ˆΣ t are computed similarly from D t .

In addition, we also include HSIC-based SKIT to the comparison and defer the details regarding kernel hyperparameters to Appendix E.1. We set the monitoring horizon to T = 5000 points from PXY and aggregate the results over 200 sequences of observations for each value of β. We illustrate the result in Figure 5: while Seq-R-IT has high power for large values of β, we observe its inferior performance against Seq-C-IT (and SKIT) under the harder settings. Improving regression-based betting strategies, e.g., designing better scaling factors that still yield a provably consistent test, is an open question for future research.

B Two-sample Testing with Unbalanced Classes

In Section 2, we developed a sequential 2ST under the assumption at each round, an instance from either P or Q is revealed with equal probability. Such assumption was reasonable for designing Seq-C-IT, where external randomization produced two instances from PXY and PX PY at each round. Next, we generalize our sequential 2ST to a more general setting of unbalanced classes. Definition 4 (Sequential two-sample testing with unbalanced classes). Let π (0, 1). Suppose that we observe a stream of i.i.d. observations ((Zt, Wt))t 1, where Wt Rademacher(π), the distribution of Zt | Wt = +1 is denoted P, and that of Zt | Wt = 1 is denoted Q. We set the goal of designing a sequential test for the following pair of hypotheses:

H0 : P = Q, (35a) H1 : P = Q. (35b)

0.0 0.1 0.2 0.3 0.4 0.5 β

Rejection Rate at Time T = 5000

Seq-C-IT Seq-R-IT SKIT α = 0.05

0.1 0.2 0.3 0.4 0.5 β

Average Stopping Time

Seq-C-IT Seq-R-IT SKIT

Figure 5: Comparison between Seq-R-IT, Seq-C-IT and HSIC-based SKIT under the Gaussian linear model. Inspecting Figure 5a at β = 0 confirms that all tests control the type I error. Non-surprisingly, kernel-based SKIT performs better than predictive tests under this model (no localized dependence). We also observe that Seq-C-IT performs better than Seq-R-IT.

For what follows, we will focus on the payoff based on the squared risk due to its relationship to the likelihood-ratio-based test (Remark 3). In particular, after correcting the likelihood under the null in (20) to account for a general positive class proportion π, we can deduce that (see Appendix D.5):

(1 λt) 1+λt (ηt(Zt))1{Wt=1} (1 ηt(Zt))1{Wt=0}

(π)1{Wt=1} (1 π)1{Wt=0} = 1+λt Wt (gt(Zt) (2π 1))

1 + Wt(2π 1) , (36)

where ηt(z) = (gt(z) + 1)/2, and hence, a natural payoff function for the case with unbalanced classes is

f u t (Zt, Wt) = Wt (gt(Zt) (2π 1))

1 + Wt(2π 1) . (37)

Note that the payoff for the balanced case (22b) is recovered by setting π = 1/2. It is easy to check that (see Appendix D.5): (a) f u t (z, w) 1 for any (z, w) Z { 1, 1}, and (b) if H0 in (35a) is true, then EH0 [f u t (Zt, Wt) | Ft 1] = 0, where Ft 1 = σ({(Zi, Wi)}i t 1). This in turn implies that a wealth process that relies on the payoff function f u t in (37) is a nonnegative martingale, and hence, the corresponding sequential 2ST is valid. However, the positive class proportion π, needed to use the payoff function (37), is generally unknown beforehand. First, let us consider the case when λt = 1, t 1. In this case, the wealth of a gambler that uses the payoff function (37) after round t is

Kt = Qt i=1 (ηi(Zi))1{Wi=1} (1 ηi(Zi))1{Wi=0}

Qt i=1 π1{Wi=1} (1 π)1{Wi=0} . (38)

i=1 1 {Wt = 1} = arg max π [0,1]

i=1 π1{Wi=1} (1 π)1{Wi=0} !

is the MLE for π computed from {Wi}i t. In particular, if we consider a process ( Kt)t 0, where

Kt := Qt i=1 (ηi(Zi))1{Wi=1} (1 ηi(Zi))1{Wi=0}

Qt i=1 (ˆπt)1{Wi=1} (1 ˆπt)1{Wi=0} , t 1,

it follows that Kt Kt, t 1, meaning that ( Kt)t 0 is a process that is upper bounded by a nonnegative martingale with initial value one. This in turn implies that a test based on ( Kt)t 0 is a valid level-α sequential 2ST for the case of unknown class proportions. This idea underlies the running MLE sequential likelihood ratio test of Wasserman et al. [2020] and has been recently considered in the context of two-sample testing by Pandeva et al. [2022]. In case of nontrivial betting fractions: (λt)t 1, representation of the wealth process (38) no longer holds, and to proceed, we modify the rules of the game and use minibatching. A bet is placed on every b (say, 5 or 10) observations, meaning

that for a given minibatch size b 1, at round t we bet on {(Zb(t 1)+i, Wb(t 1)+i)}i {1,...,b}. The MLE of π computed from the t-th minibatch is

i=b(t 1)+1 1 {Wi = +1} .

We consider a payoff function of the following form:

f u t (Zb(t 1)+i, Wb(t 1)+i)

i {1,...,b}

1 + Wigt(Zi)

1 + Wi(2ˆπt 1)

In words, the above payoff essentially compares the performance of a predictor gt, trained on {(Zi, Wi)}i b(t 1) and evaluated on the t-th minibatch, to that of a trivial baseline predictor to form a bet. In particular, setting b = 1 yields a valid, yet a powerless test. Indeed, we have ˆπt = 1 {Wt = 1} = (Wt + 1)/2. In this case, the payoff (39) reduces to

Wt (gt(Zt) (2ˆπt 1))

1 + Wt(2ˆπt 1) = Wtgt(Zt) 1

a.s. [ 1, 0],

implying that the wealth can not grow even if the null is false. Define a wealth processes (Ku t )t 0 based on the payoff functions (39) along with a predictable sequence of betting fractions (λt)t 1 selected via ONS strategy (Algorithm 1). Let Ft = σ({(Zi, Wi)}i bt) for t 1, with F0 denoting a trivial sigma-algebra. We conclude with the following result, whose proof is deferred to Appendix D.5. Theorem 5. Suppose that H0 in (35a) is true. Then (Ku t )t 0 is a nonnegative supermartingale adapted to (Ft)t 0. Hence, the sequential 2ST based on (Ku t )t 0 satisfies: PH0 (τ < ) α.

C Testing under Distribution Drift

First, we define the problem of two-sample testing when at each round instances from both distributions are observed. Definition 5 (Sequential two-sample testing). Suppose that we observe that a stream of observations: ((Xt, Yt))t 1, where (Xt, Yt) iid PX PY for t 1. The goal is to design a sequential test for

H0 : (Xt, Yt) iid PX PY and PX = PY , (40a)

H1 : (Xt, Yt) iid PX PY and PX = PY . (40b)

Under the two-sample testing setting (Definition 5), we label observations from PY as positive (+1) and observations from PX as negative ( 1). We write A2ST c : ( t 1(X { 1, +1})t) G G to denote a chosen learning algorithm which maps a training dataset of any size and previously used predictor, to an updated predictor. We start with D0 = and g1 : g1(x) = 0, x X. At round t, we bet using derandomized versions of the payoffs (22), namely

f m t (Xt, Yt) = 1

2 (sign [gt(Yt)] sign [gt(Xt)]) , (41a)

f s t (Xt, Yt) = 1

2 (gt(Yt) gt(Xt)) . (41b)

After (Xt, Yt) has been used for betting, we update a training dataset and an existing predictor:

Dt = Dt 1 {(Yt, +1), (Xt, 1)} , gt+1 = A2ST c (Dt, gt).

Testing under Distribution Drift. Batch two-sample and independence tests generally rely on either a cutoff computed using the asymptotic null distribution of a chosen test statistic (if tractable) or a permutation p-value. Both approaches require imposing i.i.d. (or exchangeability, for the latter option) assumption about the data distribution, and if the distribution drifts, both approaches fail to guarantee the type I error control. In contrast, Seq-C-2ST and Seq-C-IT remain valid beyond the i.i.d. setting by construction (analogous to tests developed in [Shekhar and Ramdas, 2023, Podkopaev et al., 2023]). First, we define the problems of sequential two-sample and independence testing under distribution drift.

Definition 6 (Sequential two-sample testing under distribution drift). Suppose that we observe that a stream of independent observations: ((Xt, Yt))t 1, where (Xt, Yt) P (t) X P (t) Y , t 1. The goal is to design a sequential test for the following pair of hypotheses:

H0 : P (t) X = P (t) Y , t, (42a)

H1 : t : P (t ) X = P (t ) Y . (42b) Definition 7 (Sequential independence testing under distribution drift). Suppose that we observe that a stream of independent observations from the joint distribution which drifts over time: ((Xt, Yt))t 1,

where (Xt, Yt) P (t) XY . The goal is to design a sequential test for the following pair of hypotheses:

H0 : P (t) XY = P (t) X P (t) Y , t, (43a)

H1 : t : P (t ) XY = P (t ) X P (t ) Y . (43b)

The superscripts highlight that, in contrast to the standard i.i.d. setting (Definitions 5 and 2), the underlying distributions may drift over time. For independence testing, we need to impose an additional assumption that enables reasoning about the type I error control of Seq-C-IT. Assumption 5. Consider the setting of independence testing under distribution drift (Definition 7). We assume that for each t 1, it holds that either P (t 1) X = P (t) X or P (t 1) Y = P (t) Y , meaning that at each step either the distribution of X changes or that of Y changes, but not both simultaneously5.

We have the following result about the type I error control of our tests under distribution drift. Corollary 2. The following claims hold:

1. Suppose that H0 in (42a) is true. Then Seq-C-2ST satisfies: PH0 (τ < ) α.

2. Suppose that H0 in (43a) is true. Further, suppose that Assumption 5 is satisfied. Then Seq-C-IT satisfies: PH0 (τ < ) α.

The above result follows from the fact the payoff functions underlying Seq-C-2ST (41) and Seq-CIT (23) are valid under the more general null hypotheses (42a) and (43a) respectively. The rest of the proof of Corollary 2 follows the same steps as that of Theorem 2, and we omit the details. We conclude with an example which shows that Assumption 5 is necessary for the type I error control. Example 2. Consider the following case when the null H0 in (43a) is true, but Assumption 5 is not satisfied. We show that Seq-C-IT fails to control type I error (at any prespecified level α (0, 1)), and for simplicity, focus on the payoff function based on the squared risk (23). Suppose that we observe a sequence of observations: ((Xt, Yt))t 1, where (Xt, Yt) = (t+Wt, t+Vt) and Wt, Vt iid Bern(1/2). It suffices to show that there exists a sequence of predictors (gt)t 1, for which

lim inf t 1

i=1 f s t ((X2t 1, Y2t 1), (X2t, Y2t)) a.s. > 0. (44)

If (44) holds, then using the same argument as in the proof of Theorem 2, one can then deduce that P (τ < ) = 1. Consider the following sequence of predictors (gt)t 1: gt(x, y) = x 2t 1

2 1 1. We have: gt(X2t, Y2t) = W2t + 1

gt(X2t 1, Y2t 1) = W2t 1 1

gt(X2t, Y2t 1) = W2t + 1

gt(X2t 1, Y2t) = W2t 1 1

2 . Simple calculation shows that: E [gt(X2t, Y2t)] = 11/16, E [gt(X2t 1, Y2t 1)] = E [gt(X2t, Y2t 1)] = E [gt(X2t 1, Y2t)] = 0 and hence, for all t 1, it holds that E [f s t ((X2t 1, Y2t 1), (X2t, Y2t))] = 11/64 > 0. This in turn implies (44), and hence, we conclude that Seq-C-IT fails to control the type I error.

5Technically, a slightly weaker condition suffices at odd t, the distribution can change arbitrarily, but at even t, either the distribution of X changes or that of Y changes but not both; however, this weaker condition is slightly less intuitive than the stated condition.

C.1 Synthetic Experiments

Under the alternative, the power of our test depends on the performance of a classifier according to misclassification/squared risk. If the distribution drifts over time, then models updated via online gradient descent may retain predictive power. In such cases, our test will still have high power despite the present distribution drift. Of course, if the distribution shifts adversarially, the method will not have power, but neither will any other method. So the implicit goal is to maintain type-1 error control under the null despite shifting distributions (beyond the i.i.d. assumption typically made in the literature) while retaining power against reasonable (but not all) distribution drifts. To put this into context, we consider four settings:

1. P = Q = N(0, 1), i.e., the 2ST null is true; 2. P = N(0, 1) and Q = N(0.5, 1), i.e., the 2ST null is false;

3. Up to time t = 250, P (t) = Q(t) = N(0, 1), but for t > 251, we have P (t) = N(0, 1) and Q(t) = N(0.5, 1), i.e., there is a distribution shift and the 2ST null is false;

4. Up to time t = 250, P (t) = Q(t) = N(0, 1), but for t > 251, P (t) = Q(t) = N(0.5, 1), i.e., there is a change in distribution yet the 2ST null is still true.

We use a standard logistic regression model as an underlying predictor with weights updated via online gradient descent and monitor the tests until T = 2000 observations. In Figure 6, we observe that our test controls the type-1 error whenever the 2ST null is true even if there is a shift in distribution, and retains high power if the alternative is true.

0 250 500 750 1000 1250 1500 1750 2000 Sample size

Rejection Rate

H0 is true, no shift H0 is false, no shift H0 is true, changepoint H0 is false, changepoint α = 0.05

Figure 6: Average rejection rates for four settings: solid lines correspond to the standard i.i.d. setting and dashed lines correspond to the settings that involve distribution shift. Despite there being a shift in distribution, our test that utilizes logistic regression learned via online gradient descent controls the type-1 error under a more general 2ST null hypothesis and has power under the alternative. Results are aggregated over 500 random sequences from the corresponding distributions.

D.1 Auxiliary Results

Proposition 2 (Ville s inequality [Ville, 1939]). Suppose that (Mt)t 0 is a nonnegative supermartingale process adapted to a filtration (Ft)t 0. Then, for any a > 0 it holds that:

P ( t 1 : Mt a) E [M0]

D.2 Supporting Lemmas

Lemma 6. Consider sequential two-sample testing setting (Definition 1). Suppose that a predictor g G satisfies E [f(Z, W)] > 0, where f(z, w) := wg(z).

(a) Consider the wealth process (Kt)t 0 based on f along with the ONS strategy for selecting betting fractions (Algorithm 1). Then we have the following lower bound on the growth rate of the wealth process:

lim inf t log Kt

(E [f(Z, W)])2

E [f 2(Z, W)] E [f(Z, W)]

(b) For λ = arg maxλ [ 0.5,0.5] E [log(1 + λf(Z, W))], it holds that:

E [log(1 + λ f(Z, W))] 4

3 (E [f(Z, W)])2

E h (f(Z, W))2i E [f(Z, W)]

Analogous result holds when the payoff function f(z, w) := w sign [g(z)] is used instead.

Proof. (a) Under the ONS betting strategy, for any sequence of outcomes (ft)t 1, ft [ 1, 1], it holds that (see the proof of Theorem 1 in [Cutkosky and Orabona, 2018]):

log Kt(λ0) log Kt = O

where Kt(λ0) is the wealth of any constant betting strategy λ0 [ 1/2, 1/2] and Kt is the wealth corresponding to the ONS betting strategy. Hence, it follows that

t log Kt(λ0)

for some absolute constant C > 0. Next, consider

Pt i=1 fi Pt i=1 f 2 i 1

We obtain: log Kt(λ0)

i=1 log(1 + λ0fi)

i=1 (λ0fi λ2 0f 2 i )

1 t Pt i=1 fi 4 0

1 t Pt i=1 fi 1 t Pt i=1 f 2 i 1

where in (a) we used that log(1 + x) x x2 for x [ 1/2, 1/2]. From (48), it then follows that:

lim inf t log Kt

a.s. E [f(Z, W)]

4 0 E [f(Z, W)]

E [f 2(Z, W)] 1

(E [f(Z, W)])2

E [f 2(Z, W)] E [f(Z, W)]

which completes the proof of the first assertion of the lemma.

(b) Since log(1 + x) x 3x2/8 for any x [ 0.5, 0.5], we know that:

E [log (1 + λ f(Z, W))] E λ f(Z, W) 3

8 (λ f(Z, W))2

max λ [ 0.5,0.5]

λ E [f(Z, W)] 3λ2

8 E h (f(Z, W))2i .

The optimizer of the above is

λ = 4E [f(Z, W)]

3E h (f(Z, W))2i 1

Hence, as long as E [f(Z, W)] (3/8) E h (f(Z, W))2i , we have:

E [log (1 + λ f(Z, W))] 2

3 (E [f(Z, W)])2

E h (f(Z, W))2i. (50)

If however, E [f(Z, W)] > (3/8) E h (f(Z, W))2i , then we know that:

E [log (1 + λ f(Z, W))] E [f(Z, W)]

To bring it to a convenient form, we multiply the upper bound in (50) by two and get the bound (46), which completes the proof of the second assertion of the lemma.

D.3 Proofs for Section 2

Proposition 1. Fix an arbitrary predictor g G. The following claims hold6:

1. For the misclassification risk, we have that:

sup s [0,1]

2 Rm(sg) = 1

2 Rm(g) 0 = 1

2 E [W sign [g(Z)]] 0. (9)

2. For the squared risk, we have that:

sup s [0,1] (1 Rs(sg)) (E [W g(Z)] 0) E [W g(Z)]

E [g2(Z)] 1 (10)

Further, ds(P, Q) > 0 if and only if there exists g G such that E [W g(Z)] > 0.

Proof. 1. The first equality in (9) follows from two facts: (a) for any g G and any s (0, 1], it holds that Rm(sg) = Rm(g), (b) Rm(0) = 1/2. The second equality easily follows from the following fact: sign [x] /2 = 1/2 1 {x < 0}.

2. Consider an arbitrary predictor g G. Let us consider all possible scenarios:

(a) If E [W g(Z)] 0, then the RHS of (10) is zero. For the LHS of (10), we have that:

sup s [0,1] (1 Rs(sg)) 1 Rs(0) = 0,

so the bound (10) holds. (b) Next, assume that E [W g(Z)] > 0, then it is easy to derive that:

s := arg max s [0,1] (1 Rs(sg)) = E [W g(Z)]

E [g2(Z)] 1. (51)

A simple calculation shows that:

1 Rs(s g) E [W g(Z)] E [W g(Z)]

E [g2(Z)] 1 ,

and hence, we conclude that the bound (10) holds.

6 and denote a maximum and a minimum respectively: a b = max{a, b}, a b = min{a, b}.

To establish the second part of the statement, note that ds(P, Q) > 0 iff there is a predictor g G such that Rs(g) < 1. For the squared risk, we have:

1 Rs(g) = 2E [W g(Z)] E g2(Z) , (52)

and hence, Rs(g) < 1 trivially implies that E [W g(Z)] > 0. The converse implication trivially follows from (10). Hence, the result follows.

Theorem 1. The following claims hold:

1. Suppose that H0 in (1a) is true. Then the oracle sequential test based on either (Km, t )t 0 or (Ks, t )t 0 ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then:

(a) The growth rate of the oracle wealth process (Km, t )t 0 satisfies:

lim inf t 1

t log Km, t a.s. 1

2 Rm (g ) 2 . (14)

If Rm (g ) < 1/2, then the test based on (Km, t )t 0 is consistent: PH1 (τ < ) = 1. Further, the optimal growth rate achieved by λm in (13) satisfies:

E [log(1 + λm f m (Z, W))] 16

2 Rm(g ) 2 1

2 Rm(g ) . (15)

(b) The growth rate of the oracle wealth process (Ks, t )t 0 satisfies:

lim inf t 1

t log Ks, t a.s. 1 4 E [W g (Z)] . (16)

If E [W g (Z)] > 0, then the test based on (Ks, t )t 0 is consistent: PH1 (τ < ) = 1. Further, the optimal growth rate achieved by λs in (13) satisfies:

E [log(1 + λs f s (Z, W))] 1

2 E [W g (Z)] . (17)

Proof. 1. We trivially have that the payoff functions (11a) and (11b) are bounded: (z, w) Z { 1, 1}, it holds that f m (z, w) [ 1, 1] and f s (z, w) [ 1, 1]. Further, under the null H0 in (1a), it trivially holds that EH0 [f m (Zt, Wt) | Ft 1] = EH0 [f s (Zt, Wt) | Ft 1] = 0, where Ft 1 = σ({(Zi, Wi)}i t 1). Since ONS betting fractions λONS t

t 1 are predictable, we conclude that the resulting wealth process is a nonnegative martingale. The assertion of the Theorem then follows directly from Ville s inequality (Proposition 2) when a = 1/α.

2. Suppose that H1 in (1b) is true. First, we prove the results for the lower bounds:

(a) Consider the wealth process based on the misclassification risk (Km, t )t 0. Note that for all t 1:

E [f m (Zt, Wt)] = 2 1

2 Rm (g ) , (f m (Zt, Wt))2 = 1.

Since E [f m (Zt, Wt)] [0, 1], we also have (E [f m (Zt, Wt)])2 E [f m (Zt, Wt)]. From the first part of Lemma 6, it follows that:

lim inf t log Km, t t

4 (E [f m (Zt, Wt)])2 = 1

2 Rm (g ) 2 .

From the second part of Lemma 6, and (46) in particular, it follows that:

E [log (1 + λm f m (Z, W))]

2 Rm(g ) 2 1

The first term in the above is smaller or equal than the second one whenever Rm(g ) 5/16. We conclude that the assertion of the theorem is true.

(b) Next, we consider the wealth process based on the squared error: (Ks, t )t 0. Note that:

E [f s (Zt, Wt)] = E [W g (Z)] ,

E h (f s (Zt, Wt))2i = E g2 (Z) ,

and hence from Lemma 6, it follows that:

lim inf t log Ks, t t

(E [W g (Z)])2

E [g2 (Z)] E [W g (Z)]

In the above, we assume that the following case is not possible: g (Z) a.s. = 0 (for such g , the corresponding expected margin and the growth rate of the resulting wealth process are clearly zero, and will still be highlighted in our resulting bound). Next, note that since g arg ming G Rs (g), we have that:

1 Rs (g ) = sup s [0,1] (1 Rs (sg )) ,

meaning that g can not be improved by scaling with s < 1. From Proposition 1, and (51) in particular, it follows that:

E [W g (Z)]

E [g2 (Z)] 1, (54)

and hence, the bound (53) reduces to

lim inf t log Ks, t t

a.s. E [W g (Z)]

From the second part of Lemma 6, it follows that:

E [log (1 + λs f s (Z, W))] 4

3 (E [W g (Z)])2

E h (g (Z))2i E [W g (Z)]

Next, we use that g satisfies (54), which implies that the second term in (55) is smaller, and hence,

E [log (1 + λs f s (Z, W))] E [W g (Z)]

which concludes the proof of the second part of the theorem.

Corollary 1. Consider an arbitrary g G with nonnegative expected margin: E [W g(Z)] 0. Then the growth rate of the corresponding wealth process (Ks t)t 0 satisfies:

lim inf t 1

t log Ks t a.s. 1 4 sup s [0,1] (1 Rs (sg)) E [W g(Z)]) (18a)

1 4 (E [W g(Z)])2 , (18b)

and the optimal growth rate achieved by λs in (13) satisfies:

E [log(1 + λs f s(Z, W))] 4 3 sup s [0,1] (1 Rs (sg)) 1

2 E [W g(Z)] . (19)

Proof. Following the same argument as that of the proof of Theorem 1, we can deduce that:

lim inf t log Ks t t

(E [W g(Z)])2

E [g2(Z)] E [W g(Z)]

Hence, it suffices to argue that the lower bound (56) is equivalent to (18a). Without loss of generality, we can assume that E [W g(Z)] 0, and further, the two lower bounds are equal if E [W g(Z)] = 0. Hence, we consider the case when E [W g(Z)] > 0. First, let us consider the case when

E [g2(Z)] < 1. (57)

Using (51), we get that:

sup s [0,1] (1 Rs (sg)) = (E [W g(Z)])2

E [g2(Z)] , (58)

and hence, two bounds coincide. For the upper bound (19), we use Lemma 6, and the upper bound (46) in particular. Note that the first term in (46) is less than the second term whenever

E h (g(Z))2i 3

However, in this regime we also know that (58) holds, and hence the two bounds coincide. This completes the proof.

Theorem 2. The following claims hold for Seq-C-2ST (Algorithm 2):

1. If H0 in (1a) is true, the test ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then:

(a) Under Assumption 1, the test with the payoff (22a) is consistent: PH1 (τ < ) = 1. (b) Under Assumption 2, the test with the payoff (22b) is consistent: PH1 (τ < ) = 1.

Proof. 1. We trivially have that the payoff functions (22a) and (22b) are bounded: t 1 and (z, w) Z { 1, 1}, it holds that f m t (z, w) [ 1, 1] and f s t (z, w) [ 1, 1]. Further, under the null H0 in (1a), it trivially holds that EH0 [f m t (Zt, Wt) | Ft 1] = EH0 [f s t (Zt, Wt) | Ft 1] = 0, where Ft 1 = σ({(Zi, Wi)}i t 1). Since ONS betting fractions λONS t

t 1 are predictable, we conclude that the resulting wealth process is a nonnegative martingale. The assertion of the Theorem then follows directly from Ville s inequality (Proposition 2) when a = 1/α.

2. Note that if ONS strategy for selecting betting fractions is deployed, then (49) implies that the tests will be consistent as long as

lim inf t 1

i=1 fi a.s. > 0, (59)

where for i 1, fi = f m i (Zi, Wi) and fi = f s i (Zi, Wi) for the payoffs based on the misclassification and the squared risks respectively.

(a) Recall that f m i (Zi, Wi) = Wi sign [gi(Zi)] , and Assumption 1 states that:

lim sup t 1

i=1 1 {Wi sign [gi(Zi)] < 0} a.s. < 1

Since 1 {x < 0} = (1 sign [x]) /2, we get that:

lim sup t 1

2 Wi sign [gi(Zi)]

which, after rearranging and multiplying by two, implies that:

lim inf t 1

i=1 Wi sign [gi(Zi)] a.s. > 0.

Hence, a sufficient condition for consistency (59) holds, and we conclude that the result is true.

(b) Recall that f s i (Zi, Wi) = Wi gi(Zi), and Assumption 2 states that:

lim sup t 1

i=1 (gi(Zi) Wi)2 a.s < 1,

which is equivalent to

lim sup t 1

g2 i (Zi) 2Wi gi(Zi) a.s < 0.

It is easy to see that the above, in turn, implies that:

lim inf t 1

i=1 Wi gi(Zi) a.s > 0.

Hence, a sufficient condition for consistency (59) holds, and we conclude that the result is true.

D.4 Proofs for Appendix A

Theorem 3. The following claims hold for the oracle sequential regression-based IT based on Kr, t

1. Suppose that H0 in (26a) is true. Then the test ever stops with probability at most α: PH1 (τ < ) α.

2. Suppose that H1 in (26b) is true. Further, suppose that: E [Wℓ(g (X), Y )] > 0. Then the test is consistent: PH1 (τ < ) = 1.

Proof. 1. We trivially have that the payoff function (27) is bounded: (x, y, w) X Y { 1, 1}, it holds that f r (x, y, w) [ 1, 1]. Further, under the null H0 in (26a), it trivially holds that EH0 [f r (Xt, Yt, Wt) | Ft 1] = 0, where Ft 1 = σ({(Xi, Yi, Wi)}i t 1). Since ONS betting fractions λONS t

t 1 are predictable, we conclude that the resulting wealth process is a nonnegative martingale. The assertion of the Theorem then follows directly from Ville s inequality (Proposition 2) when a = 1/α.

2. Note that if ONS strategy for selecting betting fractions is deployed, then (49) implies that the tests will be consistent as long as

lim inf t 1

i=1 f r (Xi, Yi, Wi) a.s. > 0. (60)

i=1 f r (Xi, Yi, Wi) = 1

i=1 tanh (s Wiℓ(g (Xi), Yi)) a.s. E [tanh (s Wℓ(g (X), Y ))] .

Note that for any x R : tanh(x) x 1

3 max x3, 0 . Hence, for any s > 0, it holds that:

E [tanh (s Wℓ(g (X), Y ))] s E [Wℓ(g (X), Y )] 1

3E max s3 W(ℓ(g (X), Y ))3, 0

= s E [Wℓ(g (X), Y )] s3

3 E (ℓ(g (X), Y ))3 max {W, 0}

= s E [Wℓ(g (X), Y )] s3

6 E (1 + W) (ℓ(g (X), Y ))3 , (61)

where we used that max {W, 0} = (W + 1)/2 since W { 1, 1}. Maximizing the RHS of (61) over s > 0 yields s defined in (28a). Hence,

E [tanh (s Wℓ(g (X), Y ))] s E [Wℓ(g (X), Y )] s3 6 E (1 + W) (ℓ(g (X), Y ))3

E [Wℓ(g (X), Y )] s2 6 E (1 + W) (ℓ(g (X), Y ))3

E [Wℓ(g (X), Y )] 1

3E [Wℓ(g (X), Y )]

3 E [Wℓ(g (X), Y )] > 0.

Hence, we conclude that the oracle regression-based IT is consistent since the sufficient condition (62) holds.

Theorem 4. The following claims hold for the proxy sequential regression-based IT (Algorithm 3):

1. Suppose that H0 in (26a) is true. Then the test ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (26b) is true. Further, suppose that Assumptions 3 and 4 are satisfied. Then the test is consistent: PH1 (τ < ) = 1.

Proof. 1. We trivially have that the payoff function (29) is bounded: (x, y, w) X Y { 1, 1}, it holds that f r t (x, y, w) [ 1, 1]. Further, under the null H0 in (26a), it trivially holds that EH0 [f r t (Xt, Yt, Wt) | Ft 1] = 0, where Ft 1 = σ({(Xi, Yi, Wi)}i t 1). Since ONS betting fractions (λONS t )t 1 are predictable, we conclude that the resulting wealth process is a nonnegative martingale. The assertion of the Theorem then follows directly from Ville s inequality (Proposition 2) with a = 1/α.

2. Note that if ONS strategy for selecting betting fractions is deployed, then (49) implies that the tests will be consistent as long as

lim inf t 1

i=1 f r t (Xi, Yi, Wi) a.s. > 0. (62)

(a) Step 1. Consider a predictable sequence of scaling factors (st)t 1, defined in (30a), and the corresponding sequences (µt)t 1 and (νt)t 1, defined in (30b) and (30c) respectively. For t 1, let Ft := σ({(Xi, Yi, Wi)}i t). Since the losses are bounded, we have that:

(Wi ℓ(g(Xi; θi), Yi) E [Wi ℓ(g(Xi; θi), Yi) | Fi 1])i 1 ,

is a bounded martingale difference sequence (BMDS). By the Strong Law of Large Numbers for BMDS, it follows that:

i=1 (Wi ℓ(g(Xi; θi), Yi) E [Wi ℓ(g(Xi; θi), Yi) | Fi 1]) a.s. 0.

Since ((Xt, Yt, Wt))t 1 is a sequence of i.i.d. observations, we can write

i=1 E [Wi ℓ(g(Xi; θi), Yi) | Fi 1] = 1

i=1 E [W ℓ(g(X; θi), Y ) | θi] ,

where (X, Y, W) (θt)t 1, θ . Using Assumption 3, we get that: 1

i=1 E [W ℓ(g(X; θi), Y ) | θi] E [W ℓ(g(X; θ ), Y ) | θ ]

i=1 sup x X y Y |ℓ(g(x; θi), y) ℓ(g(x; θ ), y)|

i=1 L2 sup x X |g(x; θi) g(x; θ )|

i=1 L2 L1 θi θ a.s. 0,

since θi θ a.s. 0 by Assumption 4. In particular, this implies that µt a.s. E [Wℓ(g(X; θ ), Y ) | θ ]. Similar argument can be used to show that νt a.s. E (1 + W) (ℓ(g(X; θ ), Y ))3 | θ , and hence,

2E [Wℓ(g(X; θ ), Y ) | θ ] E [(1 + W) (ℓ(g(X; θ ), Y ))3 | θ ] =: s . (64)

Note that s is a random variable which is positive (almost surely) by Assumption 4. (b) Step 2. Recall that for any x R : tanh(x) x 1

3 max x3, 0 and that max {W, 0} = (W + 1)/2 since W { 1, 1}. We have:

i=1 f r i (Xi, Yi, Wi)

i=1 tanh (si Wiℓ(g(Xi; θi), Yi))

si Wi ℓ(g(Xi; θi), Yi) s3 i 6 (1 + Wi) (ℓ(g(Xi; θi), Yi))3 .

Note that θi and si are Fi 1-measurable (see Step 1 for the definition of Fi 1). Under a minor technical assumption that (st)t 1 is a sequence of bounded scaling factors (the lower bound is trivially zero and the upper bound also holds if νt are bounded away from zero almost surely which is reasonable given the definition of νt), we can use analogous argument regarding a BMDS in Step 1 to deduce that:

lim inf t 1

i=1 f r i (Xi, Yi, Wi)

lim inf t 1

si E [W ℓ(g(X; θi), Y ) | θi] s3 i 6 E (1 + W) (ℓ(g(X; θi), Y ))3 | θi .

(65) Using argument analogous to (63), we can show that:

i=1 E (1 + W) (ℓ(g(X; θi), Y ))3 | θi a.s. E (1 + W) (ℓ(g(X; θ ), Y ))3 | θ .

(66) Combining (63), (64) and (66), we deduce that

si E [W ℓ(g(X; θi), Y ) | θi] s3 i 6 E (1 + W) (ℓ(g(X; θi), Y ))3 | θi

a.s. s E [W ℓ(g(X; θ ), Y ) | θ ] s3 6 E (1 + W) (ℓ(g(X; θ ), Y ))3 | θ

3 E [W ℓ(g(X; θ ), Y ) | θ ] .

Hence, from (65) it follows that:

lim inf t 1

i=1 f r i (Xi, Yi, Wi) 2s

3 E [W ℓ(g(X; θ ), Y ) | θ ] ,

where the RHS is a random variable which is positive almost surely. Hence, a sufficient condition for consistency (62) holds which concludes the proof.

D.5 Proofs for Appendix B

Two-Sample Testing with Unbalanced Classes. Note that (g(z) = 2η(z) 1):

(1 λt) 1 + λt (η(Zt))1{Wt=1} (1 η(Zt))1 1{Wt=1}

(π)1{Wt=1} (1 π)1 1{Wt=1}

= (1 λt) 1 + λt

2 1{Wt=1} 1 g(Zt)

2 1 1{Wt=1}

(π)1{Wt=1} (1 π)1 1{Wt=1}

= (1 λt) 1 + λt

2 (1 + g(Zt))1{Wt=1} (1 g(Zt))1 1{Wt=1}

(π)1{Wt=1} (1 π)1 1{Wt=1}

= (1 λt) 1 + λt

2 1 + Wtg(Zt)

(π)1{Wt=1} (1 π)1 1{Wt=1}

= (1 λt) 1 + λt

2 2 1 + Wt(2π 1) (1 + Wtg(Zt))

= (1 λt) 1 + λt 1 + Wt(2π 1) (1 + Wtg(Zt))

= 1 + λt Wt (g(Zt) (2π 1))

1 + Wt(2π 1) .

Payoff for the Case of Unbalanced Classes (known π). To see that the payoff function (37) is lower bounded by negative one, note that:

f u t (z, 1) = gt(z) (2π 1)

2π 1 (2π 1)

f u t (z, 1) = gt(z) + (2π 1)

2(1 π) 1 + (2π 1)

2(1 π) = 1.

To see that such payoff is fair, note that:

EH0 [f u t (Zt, Wt) | Ft 1] = EP

π gt(Zt) (2π 1)

(1 π) gt(Zt) (2π 1)

2(1 π) | Ft 1

where Ft 1 = σ {(Zi, Wi)}i t 1 .

Theorem 5. Suppose that H0 in (35a) is true. Then (Ku t )t 0 is a nonnegative supermartingale adapted to (Ft)t 0. Hence, the sequential 2ST based on (Ku t )t 0 satisfies: PH0 (τ < ) α.

Proof. First, we show that (Ku t )t 0 is a nonnegative supermartingale. For any t 1, the wealth Kt 1 is multiplied at round t by

1 + λtf u t (Zb(t 1)+i, Wb(t 1)+i)

i {1,...,b}

= (1 λt) 1 + λt

Qbt i=b(t 1)+1 (1 + Wigt(Zi)) Qb i=1 (1 + Wi(2ˆπt 1)) .

Since λt [0, 0.5], we conclude that the process (Ku t )t 0 is nonnegative. Next, note that since ˆπt is the MLE of π computed from a t-th minibatch, it follows that:

1 + λtf u t (Zb(t 1)+i, Wb(t 1)+i)

i {1,...,b}

(1 λt) 1 + λt

Qbt i=b(t 1)+1 (1 + Wigt(Zi)) Qbt i=b(t 1)+1 (1 + Wi(2π 1))

= (1 λt) 1 + λt

1 + Wigt(Zi)

1 + Wi(2π 1)

Recall that Ft 1 = σ {Zi, Wi}i b(t 1) . It suffices to show that if H0 is true, then

1 + Wigt(Zi)

1 + Wi(2π 1)

Note that the individual terms in the above product are independent conditional on Ft 1. Hence,

1 + Wigt(Zi)

1 + Wi(2π 1)

i=b(t 1)+1 EH0

1 + Wigt(Zi)

1 + Wi(2π 1) | Ft 1

For any i {b(t 1) + 1, . . . , bt}, it holds that:

1 + Wigt(Zi)

1 + Wi(2π 1) | Ft 1

π 1 + gt(Zi) 1 + (2π 1) + (1 π) 1 gt(Zi) 1 (2π 1) | Ft 1

2 + 1 gt(Zi)

Hence, we conclude that (Ku t )t 0 is a nonnegative supermartingale adapted to (Ft)t 0. The timeuniform type I error control of the resulting test then follows from Ville s inequality (Proposition 2).

E Additional Experiments and Details

E.1 Modeling Details

CNN Architecture and Training. We use CNN with 4 convolutional layers (kernel size is taken to be 3 3) and 16, 32, 32, 64 filters respectively. Further, each convolutional layer is followed by max-pooling layer (2 2). After flattening, those layers are followed by 1 fully connected layer with 128 neurons. Dropout (p = 0.5) and early stopping (with patience equal to ten epochs and 20% of data used in the validation set) is used for regularization. Re LU activation functions are used in each layer. Adam optimizer is used for training the network. We start training after processing twenty observations, and update the model parameters after processing every next ten observations. Maximum number of epochs is set to 25 for each training iteration. The batch size is set to 32.

Single-stream Sequential Kernelized 2ST. The construction of this test is the extension of 2ST of Shekhar and Ramdas [2023] to the case when at each round an observation only from a single distribution (P or Q) is revealed. Let G denote an RKHS with positive-definite kernel k and canonical feature map φ( ) defined on Z. Recall that instances from P as labeled as +1 and instances from Q are labeled as 1 (characterized by W). The mean embeddings of P and Q are then defined as

ˆµ(t) P = 1 N+(t)

i=1 φ(Zi) 1 {Wi = +1} ,

ˆµ(t) Q = 1 N (t)

i=1 φ(Zi) 1 {Wi = 1} ,

where N+(t) = |i t : Wi = +1| and N (t) = |i t : Wi = 1|. The corresponding payoff function is

f k t (Zt+1, Wt+1) = Wt+1 ˆgt(Zt+1),

where ˆgt = ˆµ(t) P ˆµ(t) Q ˆµ(t) P ˆµ(t) Q G

To make the test computationally efficient, it is critical to update the normalization constant efficiently. Suppose that at round t + 1, an instance from P is observed. In this case, ˆµ(t+1) Q = ˆµ(t) Q . Note that:

ˆµ(t+1) P = 1 N+(t + 1)

i=1 φ(Zi) 1 {Wi = +1}

= 1 N+(t) + 1

i=1 φ(Zi) 1 {Wi = +1}

= 1 N+(t) + 1φ(Zt+1) + 1 N+(t) + 1

i=1 φ(Zi) 1 {Wi = +1}

= 1 N+(t) + 1φ(Zt+1) + N+(t) N+(t) + 1 ˆµ(t) P .

Hence, we have: ˆµ(t+1) P ˆµ(t+1) Q 2

G = ˆµ(t+1) P ˆµ(t) Q 2

= ˆµ(t+1) P 2

G 2 D ˆµ(t+1) P , ˆµ(t) Q E

G + ˆµ(t) Q 2

In particular, D ˆµ(t+1) P , ˆµ(t) Q E

G = 1 N+(t) + 1φ(Zt+1) + N+(t) N+(t) + 1 ˆµ(t) P , ˆµ(t) Q

= 1 N+(t) + 1

D φ(Zt+1), ˆµ(t) Q E

G + N+(t) N+(t) + 1

D ˆµ(t) P , ˆµ(t) Q E

Note that: D φ(Zt+1), ˆµ(t) Q E

G = 1 N (t)

i=1 k(Zt+1, Zi) 1 {Wi = 1} .

Next, we assume for simplicity that k(x, x) = 1, x which holds for RBF kernel. Observe that: ˆµ(t+1) P 2

G = D ˆµ(t+1) P , ˆµ(t+1) P E

(N+(t) + 1)2 + 2N+(t)

(N+(t) + 1)2 D φ(Zt+1), ˆµ(t) P E

G + (N+(t))2

(N+(t) + 1)2 ˆµ(t) P 2

By caching intermediate results, we can compute the normalization constant using linear in t number of kernel evaluations. We start betting once at least one instance is observed from both P and Q. For simulations, we use RBF kernel and the median heuristic with first 20 instances to compute the kernel hyperparameter.

MLP Training Scheme We begin training after processing twenty datapoints from PXY which gives a training dataset with 40 datapoints (due to randomization). When updating a model, we use previous parameters as initialization. We use the following update scheme: we start after next n0 = 10 datapoints from PXY are observed. Once n0 becomes less than 1% of the size of the existing training dataset, we increase it by ten, that is, nt = nt 1 + 10. When we fit the model, we set the maximum number of epochs to be 25 and use early stopping with patience of 3 epochs.

Kernel Hyperparameters for Synthetic Experiments. For SKIT, we use RBF kernels:

k(x, x ) = exp λX x x 2 2 , l(y, y ) = exp λY y y 2 2 .

For simulations on synthetic data, we take kernel hyperparameters to be inversely proportional to the second moment of the underlying variables (the median heuristic yields similar results):

2E h X X 2 2 i, λY = 1

2E h Y Y 2 2 i.

1. Spherical model. By symmetry, we have: PX = PY , and hence we take λX = λY . We have

E (X X )2 = 2E X2 = 2

2. HTDD model. By symmetry, we have: PX = PY , and hence we take λX = λY . We have

E (X X )2 = 2E X2 = 2π2

3. Sparse signal model. We have

E h X X 2 2 i = 2E h X 2 2 i = 4d,

E h Y Y 2 2 i = 2E h Y 2 2 i = 2tr(Bs B s + Id) = 2(d +

i=1 β2 i ).

4. Gaussian model. We have

E (X X )2 = 2E X2 = 2,

E (Y Y )2 = 2E Y 2 = 2(1 + β2).

Ridge Regression. We use ridge regression as an underlying predictive model: ˆgt(x) = β(t) 0 +xβ(t) 1 , where the coefficients are obtained by solving:

(β(t) 0 , β(t) 1 ) = arg min β0,β1

i=1 (Yi Xiβ1 β0)2 + λβ2 1.

Let Γ = diag(0, 1). Let Xt 1 R2(t 1) 2 be such that (Xt 1)i = (1, Xi), i [1, 2(t 1)]. Finally, let Yt 1 be a vector of responses: (Yt 1)i = Yi, i [1, 2(t 1)]. Then:

β(t) = arg min β Yt 1 Xt 1β 2 + λβ Γβ = X t 1Xt 1 + λΓ 1 (X t 1Yt 1).

E.2 Additional Experiments for Seq-C-IT

In Figure 7, we present average stopping times for ITs under the synthetic settings from Section 3. We confirm that all tests adapt to the complexity of a problem at hand, stopping earlier on easy tasks and later on harder ones. We also consider two additional synthetic examples where Seq-C-IT outperforms a kernelized approach:

1. Sparse signal model. Let (Xt)t 1 and (εt)t 1 be two independent sequences of standard

Gaussian random vectors in Rd: Xt, εt iid N(0, Id), t 1. We take

(Xt, Yt) = (Xt, Bs Xt + εt),

where Bs = diag(β1, . . . , βd) and only s = 5 of {βi}d i=1 are nonzero being sampled from Unif([ 0.5, 0.5]). We consider d {5, . . . , 50}.

3 4 5 6 7 8 9 10 d

Average Stopping Time

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t

(a) Spherical model.

1 2 3 4 5 6 w

Average Stopping Time

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t

(b) HTDD model.

Figure 7: Stopping times of ITs on synthetic data from Section 3. Subplot (a) shows that SKIT is only marginally better than Seq-C-IT (MLP) due to slightly better sample efficiency under the spherical model (no localized dependence). Under the structured HTDD model, SKIT is inferior to Seq-C-ITs.

2. Nested circles model. Let (Lt)t 1, (Θt)t 1, (ε(1) t )t 1, (ε(2) t )t 1 denote sequences of ran-

dom variables where L iid Unif(1, . . . , l) for some prespecified l N, Θt iid Unif([0, 2π]),

and ε(1) t , ε(2) t iid N(0, (1/4)2). For t 1, we take

(Xt, Yt) = (Lt cos(Θt) + ε(1) t , Lt sin(Θt) + ε(2) t ). (67)

We consider l {1, . . . , 10}.

In Figure 8, we show that Seq-C-ITs significantly outperform SKIT under these models. We note that the degrading performance of kernel-based tests under the nested circles model (67) has been also observed in earlier works [Berrett and Samworth, 2019, Podkopaev et al., 2023].

E.3 Sequential vs Batch 2STs

The role of sequential tests is to complement existing batch tests and not replace those. Generally, if one is ready to commit to a particular sample size, it is better to refer to batch tests. To understand the loss of power, we conducted an experiment where we take P = N(0, 1), Q = N(δ, 1) and use k-NN predictor as an underlying predictor. As we point out in Remark 1, if an analyst stops the experiment early, i.e., before the wealth exceeds 1/α, there is one modification that allows using non-exhausted type-I error budget to strictly improve power [Ramdas and Manole, 2023]: one may use a different threshold for rejecting the null, namely U/α, where U is an independently drawn (stochastically larger than) uniform random variable on [0, 1].

We set the sample size to 1500 and compared three tests: batch (with half of the data used for training and half for inference), Seq-C-2ST (our sequential test), and Seq-C-2ST + Randomization (as per Remark 1). In Figure 9, we observe that the batch test has only slightly higher power than our sequential one when the sample size is specified beforehand. Yet in cases where the power is less than one using a sequential test allows collecting more data to improve it, but with the batch test, nothing can be done since the type-1 error budget is fully exhausted.

10 20 30 40 50 d

Power at Time T = 5000

SKIT Seq-C-IT (MLP):

f s t f m t

(a) Sparse signal model.

10 20 30 40 50 d

Average Stopping Time

SKIT Seq-C-IT (MLP):

f s t f m t

(b) Sparse signal model.

2 4 6 8 10 l

Power at Time T = 5000

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t

(c) Nested circles model.

2 4 6 8 10 l

Average Stopping Time

SKIT Seq-C-IT (MLP):

f s t f m t Seq-C-IT (k-NN):

f s t f m t

(d) Nested circles model.

Figure 8: Rejection rates (left column) and average stopping times (right column) of sequential ITs for synthetic datasets from Appendix E.2. In both cases, SKIT is inferior to Seq-C-ITs.

0.0 0.1 0.2 0.3 0.4 0.5 δ

Rejection Rate

Seq-C-2ST + Randomization Seq-C-2ST Batch α = 0.05

Figure 9: Comparison between the batch and sequential 2STs on Gaussian data when the sample size is set beforehand to 1500. First, invoking randomization improves the power of our sequential test across all settings. Second, batch 2ST has only slightly higher power across many settings. Note that the power of Seq-C-2ST can be easily improved after collecting a bit more data, yet this is not allowed for the batch test.