# evaluating_classifier_twosample_tests__c40a30cb.pdf

Published in Transactions on Machine Learning Research (04/2024)

E-Valuating Classifier Two-Sample Tests

Teodora Pandeva t.p.pandeva@gmail.com University of Amsterdam

Tim Bakker t.b.bakker@uva.nl University of Amsterdam

Christian A. Naesseth c.a.naesseth@uva.nl University of Amsterdam

Patrick Forré p.d.forre@uva.nl University of Amsterdam

Reviewed on Open Review: https: // openreview. net/ forum? id= dw FRov8xhr

We introduce a powerful deep classifier two-sample test for high-dimensional data based on e-values, called e-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting e-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches, beyond the conventional two-split (training and testing) approach of standard two-sample classifier tests. This strategy increases the power of the test, while keeping the type I error well below the desired significance level.

1 Introduction

We consider two-sample tests, which aim to answer the statistical question of whether two independently obtained populations are statistically significantly different. Often these tests help to distinguish real from generated data (Lopez-Paz and Oquab, 2017), noise from data (Hastie et al., 2001; Gutmann and Hyvärinen, 2012; Mikolov et al., 2013; Goodfellow et al., 2014), and are widely used in simulation-based inference (Lueckmann et al., 2021; Miller et al., 2022). In the general setting, consider the scenario where we are given two independent samples from two possibly different distributions:

X(0) 1 , . . . , X(0) N0 i.i.d. P0, X(1) 1 , . . . , X(1) N1 i.i.d. P1.

Based on these samples, we want to test if the distributions are equal or not. Thus, we can define a corresponding statistical test with null and alternative hypotheses as follows:

H0 : P0 = P1 vs. HA : P0 = P1.

In this paper, we consider the two-sample testing problem in the context of sequential testing, where the user accumulates data from P0 and P1 in a time-dependent manner. The primary goal is to evaluate, at each observed time step, whether the null hypothesis defined above remains valid. Thus, if enough evidence against the null is acquired, we can reject the null and stop collecting data.

Related work. Two-sample testing has a long history in statistics, giving rise to classical techniques such as Student s and Welch s t-tests (Student, 1908; Welch, 1947), which compare the means of two normally distributed samples. In addition, nonparametric tests such as the Wilcoxon-Mann-Whitney test (Mann

Published in Transactions on Machine Learning Research (04/2024)

and Whitney, 1947), the Kolmogorov-Smirnov tests (Kolmogorov, 1933; Smirnov, 1939), and the Kuiper test (Kuiper, 1960) have been established. In the area of high-dimensional data, kernel methods have been introduced, which focus on comparing the kernel embeddings of two populations (Gretton et al., 2012; Chwialkowski et al., 2015; Jitkrittum et al., 2016). However, these traditional two-sample statistical become less powerful when dealing with more complicated data types such as images and text.

Recent advances have led to the development of classifier-based two-sample tests (Kim et al., 2021) and their deep learning extensions, as seen in (Lopez-Paz and Oquab, 2017; Kirchler et al., 2020; Liu et al., 2020; Cheng and Cloninger, 2022). In these methods, a model is trained to discriminate between the two populations using training data, and then a statistical test is performed on a separate test set. However, all listed methods share a common limitation: when applied sequentially, they can lead to inflated type I error. In simpler terms, these methods assume that the sample size is known in advance, which can be a drawback in practice.

To address this limitation, sequential testing procedures offer a solution by allowing practitioners to dynamically reject the null hypothesis as new batches of data arrive. Within this context, e-value-based sequential tests have revived in the work of Shafer (2019); Ramdas et al. (2023); Grünwald et al. (2024), where they are interpreted as bets against the null hypothesis. More formally, e-variables are simply non-negative variables E that satisfy

for all P H0 : EP [E] 1,

i.e. the expectation of E with respect to any distribution from the null hypothesis distribution class H0 is less than one. An example of e-variables for singleton hypothesis classes are Bayes factors, i.e. we test if the unknown probability density p equals p0 or p A:

H0 : p = p0 vs. HA : p = p A.

The Bayes factor given by E(x) := p A(x)

p0(x) is an e-variable w.r.t. H0 since Ep0[E] = R p A(x)

p0(x) p0(x) dx = 1 1. Note that observing a very large value of E, which we call an e-value, provides evidence against the null hypothesis.

The appealing theoretical properties of e variables (details in Section 2.2) have led to the development of a growing body of work on e-value-based (conditional) independence tests spanning several domains, including two-sample tests for contingency tables (Turner and Grünwald, 2023), sequential data (Balsubramani and Ramdas, 2016), kernel-based approaches (Podkopaev et al., 2023; Shekhar and Ramdas, 2024), rank-based conditional independence tests (Duan et al., 2022), and conditional independence tests under model-X assumptions (Grünwald et al., 2023; Shaer et al., 2023), split-likelihood two-sample tests (Lhéritier and Cazals, 2018), the concurrent work by Podkopaev and Ramdas (2024).

Contributions. We extend the work of Lhéritier and Cazals (2018) by introducing e-values for conditional independence testing which is a larger class of testing problems (see Section 4). The resulting e-values combine the ideas of the split-likelihood testing procedure of Wasserman et al. (2020) (see Section 3) and the existing work on predictive conditional independence testing frameworks of Burkart and Király (2017) (see Section 4). In contrast to (Lhéritier and Cazals, 2018), our framework, when applied to two-sample testing, referred to as E-C2ST, assumes a composite null hypothesis. Furthermore, it leverages the representational capabilities of machine learning models, allowing us to design tests for complex data structures.

We show that the described tests (including E-C2ST) provide non-asymptotic type I error control under the null, and are consistent (i.e., reject the null almost surely) in both a sequential and a non-sequential setting (details in Section 4). Moreover, when restricted to the two-sample test setting, we establish milder conditions required for the machine learning model than those in (Lhéritier and Cazals, 2018) for the consistency guarantees to hold.

In our empirical analysis in Section 6, we use the theoretical properties of E-C2ST to design sequential tests that optimize data usage by segmenting it into multiple batches. Each batch contributes to the cumulative test statistic. This method contrasts with traditional two-sample classifier tests, which derive a test statistic solely from the test set conditioned on the training data. Our approach not only achieves maximum power faster than standard methods, but also consistently keeps type I errors well below the significance level.

Published in Transactions on Machine Learning Research (04/2024)

2 Hypothesis Testing with E-Variables

Building on the recent work of Ramdas et al. (2023); Grünwald et al. (2024), we give a detailed introduction to evariables and their properties in Section 2.2 and establish their connections to hypothesis testing in Sections 2.3 and 2.4. We deferred all proofs to Appendix A. First, we introduce the notation used throughout this paper.

2.1 Notation

Consider a sample of data points x1, x2, . . . ,1 reflecting realizations of random variables X1, X2 . . . , drawn from an unknown probability distribution P P(Ω) coming from some unknown sample space Ω, where P(Ω) is the set of all probability measures on Ω. In hypothesis testing, we usually consider two model classes:

H0 = {Pθ P(Ω) | θ Θ0} (null hypothesis),

HA = {Pθ P(Ω) | θ ΘA} (alternative),

where Θ0 and ΘA represent the parameter sets of the distributions that are valid under the null and alternative, respectively. We want to decide if P comes from H0 or from HA:

H0 : P H0 vs. HA : P HA.

In most cases, the data points come from the same space X and we would at most observe countably many of such data points Xn. In this setting, we can w.l.o.g. assume that Ω= X N. If we, furthermore, assume that the Xn, n N, are drawn i.i.d. from P then P((Xn)n N) = NN n=1 P(Xn), we can directly incorporate the product structure into H0 and HA and restrict ourselves to one of those factors to state Hi. By slight abuse of notations we re-write for i = 0, A:

Hi = {Pθ P(X) | θ Θi} ,

and implicitly assume that Pθ((Xn)n N) = NN n=1 Pθ(Xn). Moreover, assume that our probability measures, Pθ Hi, are given via a density with respect to a product reference measure µ. We denote the density by pθ(x) or p(x|θ) interchangeably in this work.

2.2 Conditional E-Variables

Now consider the more general relative framework where we allow hypothesis classes to come from a set of Markov kernels, which can be used to model conditional probability distributions for i = 0, A:

Hi = {Pθ : Z P(X) | θ Θi} P(X)Z, (1)

where P(X)Z denotes the space of all Markov kernels from Z to X, i.e. for each Pθ Hi for fixed z Z Pθ( |z) is a valid probability measure on X. An example of conditional hypothesis classes is given in Section 4, where the null hypothesis class represent the set of distributions that reflect the conditional independence of two variables after observing a third one. With respect to H0 as defined in Equation (1) we can define corresponding e-variables which we call conditional e-variables23:

Definition 2.1 (Conditional E-variable). A conditional e-variable w.r.t. H0 P(X)Z is a non-negative measurable map:

E : X Z R 0, (x, z) 7 E(x|z),

such that for all Pθ H0 and z Z we have:

Eθ[E|z] := Z E(x|z) Pθ(dx|z) 1.

1In the following we will write small x if we either mean the realization of a random variable X or the argument of a function living on the same space. We use capital X for a data point if we want to stress its role as a random variable. 2A formal definition of the "unconditional" e-variables introduced in Section 1 can be easily derived from Definition 2.1 by dropping Z. Moreover, if E is an e-variable and x X a fixed point then we call E(x) the e-value of x w.r.t. E. 3A similar definition can be found in (Grünwald et al., 2024).

Published in Transactions on Machine Learning Research (04/2024)

One of the notable features of e-variables is their preservation under multiplication. We can easily combine (conditionally) independent e-variables by simply multiplying them which results in a proper e-variable. This property makes e-variables appealing for meta-analysis studies (Vovk and Wang, 2021; Grünwald et al., 2024). A more general result is that backward-dependent conditional e-variables can be combined by multiplication. This property of e-values is analogous to the chain rule observed in probability densities. Such a property becomes key in the development of the E-C2ST in this framework, which is formally stated as:

Lemma 2.2 (Products of conditional E-variables (based on Grünwald et al. (2024))). If E(1) is a a conditional e-variable w.r.t. H(1) 0 P(Y)Z and E(2) a conditional e-variable w.r.t. H(2) 0 P(X)Y Z then E(3) defined via their product:

E(3)(x, y|z) := E(2)(x|y, z) E(1)(y|z),

is a conditional e-variable w.r.t.:

H(3) 0 := H(2) 0 H(1) 0 P(X Y)Z,

where we define the product hypothesis as:

H(2) 0 H(1) 0 := n Pθ Pψ Pθ H(2) 0 , Pψ H(1) 0 o ,

with the product Markov kernels given by:

(Pθ Pψ) (dx, dy|z) := Pθ(dx|y, z) Pψ(dy|z).

2.3 Hypothesis Testing with Conditional E-Variables

In the context of statistical testing, we can evaluate an e-variable based on the given data points, which are realizations of the random variables X1, . . . , XN. Subsequently, the decision criterion for rejecting the null hypothesis at a significance level α [0, 1] is as follows

Reject H0 in favor of HA if E(X1, . . . , XN) α 1.

Lemma 2.3 tells us that with this rule the type I error, the error rate of falsely rejecting the H0, is bounded by α.

Lemma 2.3 (Type I error control). Let E be a conditional e-variable w.r.t. H0 P(X)Z. Then for every α [0, 1], Pθ H0 and z Z we have:

Pθ(E α 1|z) α.

Thus, the e-values can be transformed into more conservative p-values via the relation p = min{1, 1/E} such that for Pθ H0 it holds Pθ(p α|z) α. Note that a valid way of constructing an e-variable from the independent random variables X1, . . . , XN w.r.t. the observed sample points according to Lemma 2.2 is E(X1, . . . , XN) = Q

2.4 Sequential Hypothesis Testing with Conditional E-Variables

Up to this point, we have focused primarily on e-value-based tests for scenarios where the sample size N is predetermined. Now suppose that the data does not arrive all at once, but instead we observe an infinite stream of data. In this context, a new sample Xt becomes available at each time t. Consequently, we are interested in developing statistical tests that allow us to reject the null hypothesis at any given time t. These tests are called sequential tests and can be constructed using e-variables.

Building on the concepts introduced in the previous section, we can define a conditional variable E(t) = E(Xt|X1, . . . , Xt 1), conditioned on past observations X1, . . . , Xt 1 with respect to the null hypothesis H(t) 0 P(X)X t 1. Importantly, Lemma 2.2 suggests combining all the evidence available up to time t to construct a backward-dependent e-variable. In other words, the running product E( t) = Qt l=1 E(l), where E(1) = E(X1), proves to be a valid e-variable with respect to H0.

Published in Transactions on Machine Learning Research (04/2024)

This sequence of e-variables, also known as an e-process (Ramdas et al., 2023), offers a theoretical advantage over non-sequential p-value-based tests by allowing what it is called optional continuation (Grünwald et al., 2024). In simple terms, the user can make an informed decision at any given time t: whether to accumulate more data from additional experiments or to stop the process. This decision can be driven by, for example, the decision to reject the null hypothesis. The optional continuation property is facilitated by the following result, which ensures an anytime type-I-error bound for the process (E( t))t 1. Proposition 2.1 ((Ramdas et al., 2023; Grünwald et al., 2024)). Let E( t) be the running e-variable described above. Then for all Pθ H0 and all α (0, 1]

Pθ( t 0 such that E( t) α 1) α.

This result implies that we maintain type-I-error control not just at individual time points t but consistently throughout the entire data collection period. More precisely, the decision rule for rejecting the null hypothesis:

Reject H0 in favor of HA if E( t) α 1 for any t 1

has type I error bounded by α. Additionally, we consider this sequential test to be consistent if it correctly rejects the null with a finite number of steps: Pθ( t 0 such that E( t) α 1) = 1 for all Pθ HA.

3 M-Split Likelihood Ratio Test

In general, constructing an e-variable with respect to any H0 is not a straightforward task. There exist two main approaches. The first approach, see (Grünwald et al., 2024), is based on the reverse information projection of the hypothesis space HA onto H0. It is not data-dependent and can be shown to be growthoptimal in the worst case. However, the reverse information projection is not very explicit in general settings, especially when working with non-convex hypotheses, HA and H0. The second approach is based on constructing a data-driven e-variable. By utilizing the M-split likelihood ratio test by Wasserman et al. (2020) introduced in this section, we establish an e-variable for a fixed sample size. Subsequently, we will demonstrate that the same e-variable can be adapted for sequential testing for an infinite data stream.

Assume that our data set D = {X1, . . . , XN} is of size N. We now split the index set [N] := {1, . . . , N} into M 2 disjoint batches: [N] = I(1) I(M). For m = 1, . . . , M we also abbreviate:

I(<m) := I(1) I(m 1), x(m) := (xn)n I(m) Y

n I(m) X =: X (m)

and x(<m), x( m), I( m), analogously. Then for m = 2, . . . , M we follow these steps:

1. Train a model on ΘA on all previous points x(<m) in an arbitrary way (MLE, MAP, full Bayesian, etc.) and get p A(x|x(<m)). To achieve a high power of the test, the density p A(x|x(<m)) should reflect the true distribution in the best possible way to generalize well to unseen data.

2. Train a model on Θ0 on the data points of the current batch x(m) (conditioned on the previous ones x(<m)) via maximum-likelihood fitting (MLE):

ˆθ( m) 0 := ˆθ(m) 0 (x( m)) := argmax θ Θ0 pθ(x(m)|x(<m)),

and get: p0(x|x( m)) := p(x|x(<m), ˆθ(m) 0 (x( m))). Note that under i.i.d. assumptions there is no dependence on x(<m).

3. Evaluate both models on the current points x(m) and define E(m) via their ratio:

E(m)(x(m)|x(<m)) := p A(x(m)|x(<m))

p0(x(m)|x( m)) = p A(x(m)|x(<m)) maxθ Θ0 pθ(x(m)|x(<m)). (2)

Then E(m) constitutes a conditional e-variable, conditioned on the space X (<m), w.r.t. H(m)|(<m) 0 (see Appendix A for the proof).

Published in Transactions on Machine Learning Research (04/2024)

From the previous section and Lemma 2.2 we know that the running product

E := E( M) :=

m=1 E(m), (3)

defines an e-variable w.r.t. H0 = H( M) 0 (see Appendix A for the proof). For fixed M, Lemma 2.3 gives us type I guarantees of the derived test. In other words, the M-split likelihood ratio test, for significance level α [0, 1], rejects the null hypothesis H0 if E(X1, . . . , XN) α 1 with type I error bounded by α.

Remark 3.1 (Intuition for the m-th conditional e-variable.). For a fixed m, the m-th conditional e-variable E(m) intrinsically compares the HA-model s test performance: log p A(x(m)|x(<m)), which is trained on x(<m), tested on x(m), with the H0-model s train performance: log p0(x(m)|x(m)), both trained and tested on the same x(m) in the i.i.d. case. This means that if the alternative is true, then the HA-model p A has to perform better on x(m) than the H0-model p0, while the latter was allowed to directly be (over)fitted on x(m).

Now consider a scenario where the number of batches M is not fixed, and we are dealing with a continuous stream of incoming batches of data. Using the insights from Proposition 2.1, the resulting e-value from (3) can be used for sequential testing. In this way, we can achieve an even more robust form of batch-wise anytime type I error control (details in Appendix A for the proof), as follows:

Corollary 3.2 (Batch-wise anytime type I error control). Consider the sequence of e-variables E( M)

M N from equation Equation (3) for an infinite stream of finite batches of data points. It follows that for every Pθ H0 and every α (0, 1] we have the anytime type I error bound:

Pθ M N E( M) α 1 α.

3.1 Test Consistency

In this section, we explore the consistency of the test introduced earlier, both for the specific case of M = 2 and for that of M = , addressing standard predictive testing and sequential testing scenarios. Consistency is a key concept in statistical testing, as it guarantees that as more data are accumulated under the alternative, the test becomes more reliable in accurately detecting true hypotheses. This property allows us to examine how the test behaves as the sample size increases.

Now consider the split-likelihood case for M = 2 and the singleton set H0 = {P0} with a density p0 for which the e variable is given by

E(N (2)|N (1))(x(2)|x(1)) := E(1)(x(2)|x(1)) = Y

p A(xn|x(1))

where with E(N(2)|N(1)) we want to make explicit the dependence of the e variable on the train (N (1)) and test (N (2)) data sizes. By making mild assumptions about the learner and ensuring the boundedness of the (conditional) e-variable, we prove (details in Appendix B.1) the consistency of the test with respect to the e-variable defined above:

Theorem 3.3. Let H0 = {P0} be a singleton set. Consider a model class HA and a learning algorithm that

for every realization x = (xn)n N X N and every number N (1) N fits a model P |x(1)

A P(X) to the first N (1) entries x(1) = (xn)n I(1) of x. Assume that for every Pθ HA and Pθ-almost every i.i.d. realization x = (xn)n N of Pθ there exists a number N (1)(x) N such that for all N (1) N (1)(x):

KL(Pθ P |x(1)

A ) < KL(Pθ P0), sup xn X | log E(xn|x(1))| < . (4)

Then, Pθ E(N (2)|N(1)) < α 1 0 for N (1), N (2) .

Published in Transactions on Machine Learning Research (04/2024)

To ensure the consistency of the test, certain conditions are required. The first key assumption is that, given a sufficiently large data set, the output model P |x(1)

A has a greater fit to the class of alternative hypotheses

than the null distribution has to the same class. The second assumption implies that both p|x(1)

A and p0 must

have lower bounds. In other words, it requires that infx X p|x(1)

A (x) > 0 and infx X p0(x) > 0. For example, this condition can be easily satisfied for certain discrete distributions, such as the categorical distribution, as shown in Section 5. A more precise formulation of this theorem can be found in Theorem B.6. Lhéritier and Cazals (2018) discuss similar but stronger assumptions. In their work, P |x(1)

A is required to be strongly

pointwise consistent, i.e. P |x(1)

A (x) N Pθ(x) almost surely, for which they can provide λ-consistency results (a weaker notion of consistency). They also assume that the null hypothesis is known. Next, we will show that in the case M = , we can remove this condition and only assume that the learner is a better approximation of the true distribution than the estimated null one. Under similar conditions as in Theorem 3.3 we can prove that the sequential test defined in Equation (3) is consistent. The proof is deferred to Appendix B.2. Theorem 3.4. Consider the sequence of e-variables E( M)

M N from Equation (3) for an infinite stream

of finite batches of data points. Let the learning algorithm fit a model P |x(<M)

A P(Z) to the first M 1 batches x(<M) = (xn)n I(<M). Assume that for every Pθ HA and for all M N and every instantiation of x( M) the learner satisfies

log pθ(x(M)) p0(x(M)|x(M))

KL P x(M) θ P x(M)|x(<M)

A > r M > 0, sup

x X |I(M)| | log E(x|x(<M))| < s M,

where for the positive sequences (r M)M N and (s M)M N hold

m=1 rm > 0, lim M

Then, Pθ( M N such that E( M) α 1) = 1.

The requirement regarding the first sequence can be understood as guaranteeing that the learner consistently provides a better approximation of the true distribution compared to the estimated null distribution, averaged over a sequence of M consecutive steps. In this context, r M could even be a decreasing null sequence, such as log(M)/M. The second condition is a milder assumption than uniform boundedness. Similar to the previous result, this condition is relatively easier to satisfy when dealing with categorical random variables or random variables with a compact support.

4 Predictive Conditional Independence Testing

In this section, we combine the ideas of predictive conditional independence testing by Burkart and Király (2017) with e-variables from the M-split likelihood ratio test from Section 3 based on Wasserman et al. (2020) to derive a proper e-variable for conditional independence testing. The desired two sample test will later on be reformulated as an independence test by utilizing the theoretical results discussed in this section.

As a reminder, in conditional independence testing, we want to test if a random variable X is independent of Y , or not, conditioned on Z:

H0 : X Y | Z vs. HA : X Y | Z,

based on data D = {(X1, Y1, Z1), . . . , (XN, YN, ZN)}. The corresponding (full) hypothesis spaces, in the i.i.d. setting, are:

Hfl 0 = {Pθ(X|Z) Pθ(Y |Z) Pθ(Z) | θ Θ0} Hfl A = {Pθ(X, Y, Z) | θ ΘA} \ H0.

If we assume that P(X, Z) is fixed for H0 and HA then this simplifies to the following product hypothesis classes, i = 0, A: Hfx i := Hpd i {P(X, Z)} , where the conditional hypothesis classes Hpd i P(Y)X Z of predictive distributions are given by:

Hpd 0 = {Pθ(Y |Z) | θ Θ0} , Hpd A = {Pθ(Y |X, Z) | θ ΘA} \ H0. (5)

Published in Transactions on Machine Learning Research (04/2024)

Equation (2) applied to Hfx i under i.i.d. assumptions leads us to the following m-th conditional e-variable, using the abbreviation w = (x, y, z):

E(m)(w(m)|w(<m)) = p A(y(m)|x(m), z(m), w(<m))

p(y(m)|z(m), ˆθ0(y(m), z(m))) . (6)

The reason that we applied Equation (2) to Hfx i instead of Hfl i is that E(m) will automatically be a valid conditional e-variable for Hfl 0 and even Hpd 0 , as well.

Remark 4.1. According to Shah and Peters (2020), in the general case, a valid test for conditional independence lacks power against all alternatives unless certain additional assumptions are introduced. One such assumption is the model X assumption, where the user can access the conditional distribution P(X|Z). In the construction outlined above, this assumption is implicit, since we assume that the joint distribution P(X, Z) remains fixed under both hypotheses.

5 Classifier Two-Sample Tests with E-Variables

Algorithm 1 Algorithmic description of E-C2ST.

Data stream (x(m), y(m) = (xn, yn)n I(m))m [M], Significance level α, Initial λ1, Training epochs T

2: Initialize:

Dtrain, Dval Split((xn, yn)n I(1)), E(1) = 1

3: for m = 2, . . . , M do

4: for t = 1, . . . T do

5: gθ Train(gθ, Dtrain)

6: if Early Stopping(gθ, Dval) then

8: Compute E(m) on (xn, yn)n I(m) as in Equation (9).

9: Compute E := QM m=1 E(m)

10: if E > α 1 then

11: reject and break

12: Obtain λm+1 that solves (10)

13: Dtrain Dtrain (xn, yn)n I(m 1), Dval (xn, yn)n I(m)

In this section, we will formalize the classifier two-sample test using e-variables. The following test can be easily derived from the conditional independence test introduced in Section 4 by introducing a binary variable denoted Y , along with the following abbreviation:

P(X|Y = 0) := P0(X), P(X|Y = 1) := P1(X),

If we pool the data points X(y) n via augmenting them with a Y -component: (X(y) n , Y (y) n ) with Y (y) n := y, then the pooled data set can be seen as one i.i.d. sample from P(X, Y ) of size N := N0 + N1 for some unknown marginal P(Y ). We can then reformulate the two-sample test as an independence test:

H0 : X Y vs. HA : X Y.

This allows us to use the e-variables from Section 4 (without any conditioning variable Z) for (conditional) independence testing. Furthermore, since Y is a binary variable we can write any Markov kernel P(Y |X) as a Bernoulli distribution Pθ(Y |X = x) = Ber(σ(gθ(x))) for some parameterized measurable function gθ and where σ(t) := 1 1+exp( t) is the logistic sigmoid-function. So our hypothesis spaces look like:

H0 = {Ber(qθ) | θ Θ0} , qθ [0, 1], (7)

HA = {Ber(σ(gθ)) | θ ΘA} \ H0,

Published in Transactions on Machine Learning Research (04/2024)

and the m-th conditional e-variable is given by:

E(m)(y(m)|x(m), x(<m), y(<m)) = p A(y(m)|x(m), x(<m), y(<m))

p(y(m)|ˆθ0(y(m))) (8)

σ(gˆθ(<m) A (xn))

N (m) 1 /N (m)

1 σ(gˆθ(<m) A (xn))

N (m) 0 /N (m)

The maximum likelihood estimator for qθ with respect to y(m) is represented by ˆq(m) = N (m) 1 /N (m). This estimate corresponds to the frequency of data points in the m-th batch that are classified as belonging to class y = 1. Additionally, the function gθ is trained on the data set (x(<m), y(<m)) using binary classification.

Therefore, we combine all conditional e-variables to create an anytime valid e-variable, as explained in Section 3. In addition, we can use Theorem 3.4 to prove the consistency of the classifier s two-sample test. Consequently, we introduce a minor adjustment to the conditional e-variable in (8), resulting in a bounded e-variable that yields the following consistency result:

Lemma 5.1. Let the learning algorithm fit a model P |x(<m),y(<m)

A P(Z) to the first m 1 batches

(x(<m), y(<m)) = (xn, yn)n I(<m). Furthermore, let P |x(<m),y(<m)

A := λm P |y(m)

0 + (1 λm)P |x(<m),y(<m)

corresponding density p|x(<m)

A and λm (0, 1). Then

E(m)(y(m)|x(m), x(<m), y(<m)) = p A(y(m)|x(m), x(<m), y(<m))

p(y(m)|ˆθ0(y(m))) (9)

constitutes a bounded conditional e-variable w.r.t (7), i.e. for every instantiation (x(<m), y(<m)) it holds sup(x,y) X Y | log E(m)(y|x, x(<m), y(<m))| < . Additionally, if the batch sample size is at most B N together with rest of the conditions in Theorem 3.4 it follows that the e-variable (E( M))M N with increments defined in (9) yields a consistent test.

In the above lemma, we also make the assumption that the batch size cannot exceed a fixed number B N. This assumption is not particularly restrictive, since we have the flexibility to construct the conditional e variable using a maximum of B samples from each incoming batch. Furthermore, one can think of the sequence λm as a sequence of hyperparameters. We propose a hyperparameter selection procedure that derives λm from the previous step, leading to the constrained optimization problem:

λm+1 argmax λ (0,1) log E(m) argmax λ (0,1)

λ p(yn|ˆθ0(yn)) + (1 λ) p A(yn|xn, x(<m), y(<m))

p(yn|ˆθ0(yn))

In step m + 1, we determine the mixing parameter λm+1 to optimize the conditional e value calculated in the previous step m, as given in (9). When considering an alternative hypothesis, we expect a general decrease in λm with each successive step due to the model s enhancement with the accumulation of more data. Under the null hypothesis, this parameter is expected to maintain a higher value. Thus, this mixing parameter becomes important in balancing our assumptions about the most plausible hypothesis class. The resulting test we call E-C2ST whose steps are summarized in Algorithm 1.

6 Experiments

We compare our method to other classifier two-sample tests on the Blob, MNIST, and KDEF data. We will empirically show that we can exploit the ability of E-C2ST to construct test statistics by using the entire dataset to gain statistical power over the other tests that compute a p-value based only on the train-test data split. Meanwhile, E-C2ST keeps the type I error strictly below the alpha significance level.

6.1 Baselines Implementation and Training

We compare E-C2ST to the following baselines: S-C2ST (standard C2ST), is the C2ST proposed by Lopez Paz and Oquab (2017); Kim et al. (2021); L-C2ST (logits C2ST) by Cheng and Cloninger (2022), and

Published in Transactions on Machine Learning Research (04/2024)

Figure 1: Type I error and Power experiments for the Blob dataset. Compared to the baselines, E-C2ST reaches maximum power faster than the baselines while maintaining the type I error strictly below the significance level.

M-C2ST by Kirchler et al. (2020). Each method involves training a classifier to differentiate the two classes and then using the trained model for computing the test statistics. The training procedure is with early stopping, and we use the same fixed network architecture across all tests for all experiments. The resulting p-values are reported from 500 permutations for all baseline tests. The main paper or the appendix details all experiments and tests implementations (see Appendix D).

6.2 Evaluation

For each sample size we sample a dataset from a fixed distribution or as a subsample from a given data set. In the baseline case, we split the dataset into train, validation and test sets with ratio 5:1:1 and we fit a classifier. Using a significance level of α = 0.05, we decide whether to reject the null hypothesis on the test set. In comparison, E-C2ST is evaluated sequentially. More precisely, for each fixed sample size, we sample a data set that is divided into batches of equal size. We apply the procedure described in Section 5 by constructing the running e variable as in (3) with initial λ1 = 0.5 in all experiments unless specified otherwise. We decide whether to reject the null or to continue testing each time a new batch is observed. We run 100 independent experiments and report the rejection rates for all methods, corresponding to type I error or power, depending on whether the two classes are from the same distribution or not.

6.3 Empirical Results

Blob Dataset. The blob data set is a two-dimensional Gaussian mixture model with nine modes arranged on a 3 3 grid, used by Gretton et al. (2012); Chwialkowski et al. (2015) in their analysis. The two distributions differ in their variance, as visualized in Figure 6 in Appendix D.2. In the case of E-C2ST, we split the data into mini-batches of size 90 and compute the test statistics as explained in the previous section. The results are plotted in Figure 1, where the x-axis refers to the sample sizes of the total data used and the y-axis is the rejection rate. E-C2ST reaches maximum power faster than the baseline methods while achieving type I error strictly below the α significance level.

KDEF Data. The Karolinska Directed Emotional Faces (KDEF) dataset (Lundqvist et al., 1998) is used by Jitkrittum et al. (2016); Lopez-Paz and Oquab (2017); Kirchler et al. (2020) to distinguish between positive (happy, neutral, surprised) and negative (afraid, angry, disgusted) emotions from faces. We draw datasets of sizes 3 64, 4 64, . . . from the two classes on which we perform the statistical tests. We set the E-C2ST batch size to 64 and run 100 independent experiments per sample size. The results are shown in Figure 2, where we compute the rejection rate (type I error and power) per sample size. Although our method has lower power for smaller sample sizes than the baselines, it reaches maximum power faster while having a type I error consistently below 0.05.

Corrupted MNIST Data. The MNIST dataset (Le Cun et al., 1998) consists of 70 000 handwritten digits. As in (Liu et al., 2020), we make use of generated MNIST images (Le Cun et al., 1998) from a pre-trained DCGAN model (Radford et al., 2015) for this benchmark experiment. More precisely, we assume that under

Published in Transactions on Machine Learning Research (04/2024)

Figure 2: Power analysis and type I error for the KDEF data. All methods show very good power performance. The baselines start off with higher power. However, E-C2ST reaches power 1 the fastest while keeping the type I error lower than the baselines.

Figure 3: Power analysis for the Corrupted MNIST Data for different proportion of corruption p = 0, 0.5, 0.7, 1. Compared to the baselines, E-C2ST shows the highest power.

the alternative, we are given two sets of MNIST digits where one of them contains corrupted images, i.e. images generated from the trained DCGAN. We vary the portion of these images in the mini-batches to be p = 0, 0.5, 0.7, 1 and we resample the two datasets of size 3 64, 4 64, . . .. We fix the mini-batch size to be 64 for training E-C2ST and we conduct 100 independent runs. The rejection rates per sample size are displayed in Figure 3 for the four different cases. Note that the rejection rate in the case p = 0 refers to the estimated type I error. We can see that E-C2ST has superior power across the three levels of corruption compared to the baseline methods while keeping the type I error strictly below the significance level.

The mini-batch size. Here, we aim to determine the average number of samples required to effectively reject the null hypothesis. We perform power experiments on two datasets: MNIST (p = 1) and KDEF, using different batch sizes of 8, 16, 32, 64, and 128. Our methodology differs from previous experiments as we use a sequential testing approach. We continuously sample new batches and stop only when the null hypothesis is rejected. This procedure allows for dynamic adjustments of the sample size needed for maximum power.

We illustrate our results in Figure 4 using 100 independent experiments per scenario. The lines in Figure 4 represent the estimated power. Our results show that smaller batch sizes (e.g., 16 or 32) lead to faster rejection of the null hypothesis in terms of the number of samples, but not in terms of the number of batches required. Conversely, larger sample sizes require more samples to reject the null hypothesis, but reduce the number of steps involved, implying less computational power. For example, in the KDEF scenario, maximum power is achieved in only three steps when the batch size is 128. However, reducing batch size too strongly (e.g. batch size 8) can lead to reduced power (see MNIST case). We believe this is a result of training instabilities for very small batch sizes, leading to suboptimal networks at each step and thus small e-values.

The initial value of λ. We conducted power experiments on two datasets: MNIST (with p = 1) and KDEF, keeping the batch size constant at 32 samples. We varied λ over 0.1, 0.3, 0.5, 0.7 and 0.9. We aim to determine the average number of samples required to effectively reject the null hypothesis. As in the previous experiment, we use a sequential testing strategy.

Published in Transactions on Machine Learning Research (04/2024)

Figure 4: Power experiments performed on the MNIST and KDEF datasets using different batch sizes (8, 16, 32, 64, 128). The lines indicate the estimated power. In general, we can conclude that smaller batch sizes (with the exception of very small batches) allow faster rejection of the null hypothesis in terms of number of samples, and larger batch sizes require fewer steps but more samples.

Figure 5: Power experiments performed on the MNIST and KDEF datasets for varying λ = 0.1, 0.3, 0.5, 0.7, 0.9 and fixed batch size of 32 samples. The lines indicate the estimated power. The initial value of λ had no significant impact on the test performance in the KDEF scenario, while in the MNIST case, higher λ values increased the test performance. This effect is due to the early stages of testing, where lower initial λ values and the suboptimal neural network performance lead to lower batch e-values.

We visualized the results using 100 independent experiments for each scenario, as shown in Figure 5. The lines represent the power of the tests. From these results, we observe that the initial value of λ does not significantly affect the power of the test in the KDEF scenario. However, in the MNIST case, we observe a more visible effect: higher values of λ seem to increase the power of the test. This occurs because in the early stages of testing when sample sizes are smaller, the neural network tends to show suboptimal test performance. This is particularly evident when the initial λ is low, resulting in lower conditional e-values. They affect the overall e-value, potentially resulting in reduced power performance. However, it is important to note that the initial λ value does not drastically affect the number of samples required to achieve maximum power in our tests.

To summarize the two ablation studies, we compared the best E-C2ST performer based on the last two experiments with the baseline methods in Figure 7 in the Appendix, where we observe a significant gain in terms of power compared to the initial E-C2ST.

7 Discussion

We present E-CS2T, a deep e-value-based classifier two sample test tailored for both fixed and streaming data testing scenarios. This method combines predictive conditional independence tests with M-split likelihood ratio tests, resulting in a consistent test based on a e-variable that guarantees finite sample batch-wise anytime type I error control. Through empirical evaluations, we demonstrate that E-CS2T outperforms traditional p-value-based methods in terms of power, effectively utilizing the data and maintaining type I

Published in Transactions on Machine Learning Research (04/2024)

error below the specified significance level. In addition, our observations highlight a trade-off between batch size and computational efficiency, and suggest an interesting future direction for exploring this trade-off from a theoretical and practical perspective.

Online learning. Even though our test is computationally efficient for large data sets, for small data regimes it is more expensive than the considered baseline methods (see Table 1). A promising direction for future work is to integrate an online training procedure into our method. By doing so, we could significantly reduce the computational time and make it linearly proportional to the number of batches processed.

Active Learning. Another interesting direction for future work is to use active learning techniques to improve the performance of our method. For example, we could prune each batch before including it in the training set by actively selecting the most informative samples. With this strategy, we could potentially improve the learning efficiency and effectiveness of the test.

Acknowledgments

We would like to thank Peter Grünwald for his exciting talk at the AI4Science Colloquium, which introduced us to the theory and background of e-variables and safe testing. He inspired us to learn more about these topics and to pursue this research project.

Tim Bakker is partially supported by the Efficient Deep Learning research program, which is financed by the Dutch Research Council (NWO) in the domain Applied and Engineering Sciences (TTW).

Balsubramani, A. (2020). Sharp finite-sample concentration of independent variables. ar Xiv preprint ar Xiv:2008.13293.

Balsubramani, A. and Ramdas, A. (2016). Sequential Nonparametric Testing with the Law of the Iterated Logarithm. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 16, page 42 51, Arlington, Virginia, USA. AUAI Press.

Burkart, S. and Király, F. J. (2017). Predictive Independence Testing, Predictive Conditional Independence Testing, and Predictive Graphical Modelling. Ar Xiv, abs/1711.05869.

Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing, 16(5):1190 1208.

Cheng, X. and Cloninger, A. (2022). Classification Logit Two-Sample Testing by Neural Networks for Differentiating Near Manifold Densities. IEEE Transactions on Information Theory.

Chwialkowski, K. P., Ramdas, A., Sejdinovic, D., and Gretton, A. (2015). Fast Two-Sample Testing with Analytic Representations of Probability Measures. Advances in Neural Information Processing Systems.

Csiszár, I. (1984). Sanov Property, Generalized I-Projection and a Conditional Limit Theorem. The Annals of Probability.

Duan, B., Ramdas, A., and Wasserman, L. (2022). Interactive Rank Testing by Betting. In Conference on Causal Learning and Reasoning, pages 201 235. PMLR.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. In Advances in Neural Information Processing Systems.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. The Journal of Machine Learning Research.

Grünwald, P., de Heide, R., and Koolen, W. M. (2024). Safe Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology.

Published in Transactions on Machine Learning Research (04/2024)

Grünwald, P., Henzi, A., and Lardy, T. (2023). Anytime Valid Tests of Conditional Independence Under Model-X. Journal of the American Statistical Association, 0(0):1 12.

Gutmann, M. U. and Hyvärinen, A. (2012). Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. Journal of Machine Learning Research.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.

Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. (2016). Interpretable Distribution Features with Maximum Testing Power. Advances in Neural Information Processing Systems.

Kim, I., Ramdas, A., Singh, A., and Wasserman, L. (2021). Classification accuracy as a proxy for two-sample testing. The Annals of Statistics, 49(1):411 434.

Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.

Kirchler, M., Khorasani, S., Kloft, M., and Lippert, C. (2020). Two-Sample Testing Using Deep Learning. In International Conference on Artificial Intelligence and Statistics. PMLR.

Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari.

Kuiper, N. H. (1960). Tests concerning random points on a circle. In Nederl. Akad. Wetensch. Proc. Ser. A.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE.

Lhéritier, A. and Cazals, F. (2018). A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 64(5):3361 3370.

Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. (2020). Learning Deep Kernels for Non-Parametric Two-Sample Tests. In International Conference on Machine Learning. PMLR.

Lopez-Paz, D. and Oquab, M. (2017). Revisiting Classifier Two-Sample Tests. In International Conference on Learning Representations.

Lueckmann, J.-M., Boelts, J., Greenberg, D., Goncalves, P., and Macke, J. (2021). Benchmarking Simulation Based Inference. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Lundqvist, D., Flykt, A., and Öhman, A. (1998). Karolinska Directed Emotional Faces. Cognition and Emotion.

Mann, H. B. and Whitney, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. Curran Associates, Inc.

Miller, B. K., Weniger, C., and Forré, P. (2022). Contrastive Neural Ratio Estimation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 3262 3278. Curran Associates, Inc.

Podkopaev, A., Blöbaum, P., Kasiviswanathan, S., and Ramdas, A. (2023). Sequential Kernelized Independence Testing. In International Conference on Machine Learning, pages 27957 27993. PMLR.

Podkopaev, A. and Ramdas, A. (2024). Sequential Predictive Two-Sample and Independence Testing. Advances in Neural Information Processing Systems, 36.

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ar Xiv preprint.

Published in Transactions on Machine Learning Research (04/2024)

Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). Game-Theoretic Statistics and Safe Anytime Valid Inference. Statistical Science, 38(4):576 601.

Shaer, S., Maman, G., and Romano, Y. (2023). Model-X Sequential Testing for Conditional Independence via Testing by Betting. In International Conference on Artificial Intelligence and Statistics, pages 2054 2086. PMLR.

Shafer, G. (2019). The language of betting as a strategy for statistical and scientific communication. ar Xiv preprint.

Shah, R. D. and Peters, J. (2020). The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514 1538.

Shekhar, S. and Ramdas, A. (2024). Nonparametric Two-Sample Testing by Betting. IEEE Transactions on Information Theory.

Smirnov, N. V. (1939). On the Estimation of Discrepancy between Empirical Curves of Distribution for Two Independent Samples. Bulletin Mathématique de L Université de Moscow.

Student (1908). The Probable Error of a Mean. Biometrika.

Turner, R. and Grünwald, P. (2023). Exact anytime-valid confidence intervals for contingency tables and beyond. Statistics & Probability Letters, 198:109835.

Ville, J. (1939). Étude Critique de la Notion de Collectif. Bull. Amer. Math. Soc, 45(11):824.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py 1.0 Contributors (2020). Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272.

Vovk, V. and Wang, R. (2021). E-values: Calibration, Combination and Applications. The Annals of Statistics.

Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). Universal Inference. Proceedings of the National Academy of Sciences.

Welch, B. L. (1947). The Generalization of Student s Problem when Several Different Population Variances are Involved. Biometrika.

Published in Transactions on Machine Learning Research (04/2024)

Proof of Lemma 2.2

Proof. By Fubini s theorem we get:

Eθ,ψ h E(3) z i

= Z E(3)(x, y|z) (Pθ Pψ) (dx, dy|z)

= Z Z E(2)(x|y, z) E(1)(y|z)

Pθ(dx|y, z) Pψ(dy|z)

= Z Z E(2)(x|y, z) Pθ(dx|y, z)

E(1)(y|z) Pψ(dy|z)

Z 1 E(1)(y|z) Pψ(dy|z) 1.

Lemma A.1 (Convex combinations of conditional e-variables). If E and E are two conditional e-variables w.r.t. H0 P(X)Z and g : Z [0, 1] a measurable map, then:

E(x|z) := g(z) E(x|z) + (1 g(z)) E(x|z),

also defines a conditional e-variable w.r.t. H0.

Proof of Lemma A.1

= Z E(x|z) Pθ(dx|z)

= Z g(z) E(x|z) + (1 g(z)) E(x|z) Pθ(dx|z)

E(x|z)Pθ(dx|z) + (1 g(z)) Z E(x|z)Pθ(dx|z)

g(z) 1 + (1 g(z)) 1 = 1

Proof of (2) being an e-variable

Proof. For θ Θ0 we have:

Eθ h E(m) z(<m)i

= Z E(m)(z(m)|z(<m)) pθ(y(m)|y(<m)) µ(dz(m))

= Z p A(y(m)|x(m), z(<m)) max θ Θ0 p θ(y(m)|y(<m)) pθ(y(m)|y(<m)) µ(dz(m))

Z p A(y(m)|x(m), z(<m))

pθ(y(m)|y(<m)) pθ(y(m)|y(<m)) µ(dz(m))

= Z p A(y(m)|x(m), z(<m)) µ(dz(m)) = 1.

Published in Transactions on Machine Learning Research (04/2024)

Lemma A.1. Let E1, . . . , ED are D e-variables with respect to the same null hypothesis class H0. Then, the average over all D e-variables is an e-variable E := 1

D PD i=1 Ei is an e-variable.

Proof. Let pθ H0. Then, for E we get

i=1 Ei(x) pθ(x)µ(dx) = 1

Z Ei(x)pθ(x)µ(dx) 1

Proof of Corollary 3.2. First, we here shortly review Ville s inequality:

Theorem A.2 (Ville s Inequality, see Ville (1939)). Let (S(M))M N be a non-negative supermartingale, S(M) : (Ω, BΩ, P) [0, ], M N, w.r.t. filtration F = (FM)M N, FM BΩ. Then for every s > 1 we have the inequality:

P M N. S(M) s E[S(1)]

Proof. The sequence E( M)

M N constitutes a non-negative super-martingale of e-variables w.r.t. the filtration F := σ(X( M)) due to the following computation for Pθ H0:

Eθ h E( M+1) x( M)i = Z M+1 Y

m=1 E(X(m)|x(<m)) µ(dx(M+1))

= Z E(X(M+1)|x(<M+1))

m=1 E(X(m)|x(<m)) µ(dx(M+1))

m=1 E(X(m)|x(<m)) Z E(X(M+1)|x(<M+1)) µ(dx(M+1))

m=1 E(X(m)|x(<m)) 1

= E( M)(x( M)) 1.

By Ville s inequality, see Thm. A.2, we get:

Pθ M N. E( M) θ α 1 α.

Published in Transactions on Machine Learning Research (04/2024)

B Type II Error Control

A general finite sample bound for the type II error of testing based on the product of (conditional) i.i.d. e-variables can be achieved by Sanov s theorem, see Csiszár (1984); Balsubramani (2020).

Theorem B.1 (Conditional type II error control for conditional i.i.d. e-variables). Let E : X Z R 0 be a conditional e-variable w.r.t. H0 given Z. Let X1, . . . , XN : Ω Z X be conditional random variables that are i.i.d. conditioned on Z. Let E(N) := QN n=1 E(Xn|Z). Let α (0, 1], γN := 1

N log α 0 and for γ R 0 put:

A|z γ := {Q P(X) | EX Q[log E(X|z)] γ} .

Then for every Pθ HA and z Z we have the following type II error bound:

Pθ E(N) α 1 Z = z exp N KL(A|z γN P |z θ ) ,

which converges to 0 if KL(A|z γ P |z θ ) > 0 for some γ > 0. Note that for a subset A P(X) we abbreviate:

KL(A P) := inf Q A KL(Q P).

Proof. If ˆPN := 1 N PN n=1 δXn|Z is the empirical distribution then we get the following equivalence, when conditioned on Z = z:

E(N)|z α 1 Y

n=1 E(Xn|z) α 1

n=1 log E(Xn|z) 1

N log α =: γN

EX ˆ P |z N [log E(X|z)] γN

ˆP |z N A|z γN .

The bound then follows by a simple application of Sanov s theorem, see Csiszár (1984); Balsubramani (2020), for each z Z individually:

Pθ E(N) α 1 Z = z = Pθ ˆPN A|z γN Z = z exp N KL(A|z γN P |z θ ) ,

which requires the i.i.d. assumption (conditioned on Z) and that A|z γN is completely convex, which it is.

Lemma B.2. Consider the situation in Theorem B.1 and fix z Z. Then the first statement implies the second:

1. KL(A|z γ(z) P |z θ ) > 0 for some γ(z) > 0.

2. EX P |z θ [log E(X|z)] > 0.

If, furthermore, supx X | log E(x|z)| < then the set A|z γ(z) is TV-closed in P(X) for every γ(z) 0. In this case, the second statement also implies the first one, where we then have the implication:

0 γ(z) < EX P |z θ [log E(X|z)] = KL(A|z γ(z) P |z θ ) > 0.

Proof. = : If EX P |z θ [log E(X|z)] 0 then P |z θ A0. Since A0 Aγ for every γ > 0 we get:

KL(A|z γ P |z θ ) = 0 for every γ > 0.

Published in Transactions on Machine Learning Research (04/2024)

= : Assume C := supx X | log E(x|z)| < and γ 0. Let Q P(X) be a TV-limit point of a sequence Pn A|z γ , n N. Then we have the inequality:

EX Q[log E(X|z)] = EX Q[log E(X|z)] EX Pn[log E(X|z)] + EX Pn[log E(X|z)]

|EX Q[log E(X|z)] EX Pn[log E(X|z)]| + EX Pn[log E(X|z)]

C TV(Q, Pn) + γ

This shows that Q A|z γ as well, and, thus, A|z γ is TV-closed. By way of contradiction now assume that EX P |z θ [log E(X|z)] > 0, but KL(A|z γ P |z θ ) = 0 for all γ > 0. Since A|z γ is TV-closed and (completely) convex

we have that P |z θ A|z γ for all γ > 0. So we get:

EX P |z θ [log E(X|z)] γ,

for all γ > 0, and thus: EX P |z θ [log E(X|z)] 0, which is a contradiction to our assumption.

The unconditional version follows from the above by using the one-point space Z = { } and reads like:

Corollary B.3 (Type II error control for i.i.d. e-variables). Let X1, . . . , XN be an i.i.d. sample, E : X R 0 be an e-variable w.r.t. H0 and E(N) := QN n=1 E(Xn). Let α (0, 1], γN := 1

N log α 0 and for γ R 0 put:

Aγ := {Q P(X) | EQ[log E] γ} .

Then for every Pθ HA we have the following type II error bound:

Pθ E(N) α 1 exp ( N KL(AγN Pθ)) ,

which converges to 0 if KL(Aγ Pθ) > 0 for some γ > 0.

Relating to the simpler unconditional case of the Corollary we can make the following clarifying remarks.

Remark B.4. 1. The condition: KL(Aγ Pθ) > 0 for some γ > 0, is slightly stronger than the condition: EPθ[log E] > 0. If supx X | log E(x)| < then one can show that both those conditions are equivalent.

2. If there exist δ, γ > 0 such that for all Pθ HA we have KL(Aγ Pθ) δ then we easily deduce the uniform type II error bound for N log α

sup Pθ HA Pθ E(N) α 1 exp ( N δ) .

From Theorem B.1 we can also get a type II error control for conditional i.i.d. e-variables if we assume that the distribution for the conditioning variable is a marginal part of the hypothesis.

Corollary B.5 (Unconditional type II error control for conditional i.i.d. e-variables). Let E : X Z R 0 be a conditional e-variable w.r.t. H0 given Z. Let Z : Ω Z be a fixed random variable with values in Z and let X1, . . . , XN : Ω X be random variables that are i.i.d. conditioned on Z. Let E(N) := QN n=1 E(Xn|Z). Let α (0, 1], γN := 1

N log α 0. Then for every Pθ HA we have the following type II error bound:

Pθ E(N) α 1 Eθ h exp N KL(A|Z γN P |Z θ ) i

exp N inf z Z KL(A|z γN P |z θ ) ,

where the middle term converges to 0 for N if for Pθ(Z)-almost-all z Z there exists some γ > 0 such that KL(A|z γ P |z θ ) > 0. The latter is e.g. the case if infz Z KL(A|z γ P |z θ ) > 0 for some γ > 0.

Published in Transactions on Machine Learning Research (04/2024)

Proof. The inequalities directly follow from Theorem B.1 by plugging the random variable Z into z and taking expectation values:

Pθ E(N) α 1 = Eθ h Pθ E(N) α 1 Z i

Eθ h exp N KL(A|Z γN P |Z θ ) i

sup z Pθ(Z) exp N KL(A|z γN P |z θ )

= exp N inf z Pθ(Z) KL(A|z γN P |z θ )

exp N inf z Z KL(A|z γN P |z θ ) .

Here supz Pθ(Z) and infz Pθ(Z) denote the essential supremum, essential infimum, resp., w.r.t. Pθ(Z).

The statement of the convergence follows from the dominated convergence theorem and the observation that for every z Z we have the trivial bounds:

0 exp N KL(A|z γN P |z θ ) 1.

This shows the claim.

B.1 Type II Error M=2

Theorem B.6 (Type-II error control for conditional e-variable for singleton H0). Let H0 = {P0} be a singleton set. Consider a model class HA and a learning algorithm that for every realization x = (xn)n N X N and

every number N (1) N fits a model P |x(1)

A P(X) to the first N (1) entries x(1) = (xn)n I(1) of x. Assume that for every Pθ HA and Pθ-almost every i.i.d. realization x = (xn)n N of Pθ there exists a number N (1)(x) N

and ϵ(x) > 0 such that for all N (1) N (1)(x) the model P |x(1)

A P(X) has a density p A(xn|x(1)) and satisfies:

KL(Pθ P |x(1)

A ) < KL(Pθ P0) ϵ(x), sup xn X | log E(xn|x(1))| < .

Then for every N (1), N (2) N we have the bound:

Pθ E(N (2)|N(1)) α 1 EX(1) Pθ h exp N (1) KL(A|X(1) γN(2) Pθ) i ,

which converges to zero for min(N (1), N (2)) .

Proof. The bound directly follows from Corollary B.5. Note that by the independence assumptions, we have P |x(1)

θ = Pθ. Then note that for Pθ-almost-all x X N and for N (1) N (1)(x):

EXn Pθ h log E(Xn|x(1)) i = KL(Pθ P0) KL(Pθ P |x(1)

A ) > ϵ(x) > 0.

By assumption and Lemma B.2 we have that for Pθ-almost all x X N and for N (1) > N (1)(x) and for γ(x(1)) with:

0 < γ(x(1)) < ϵ(x) < EXn Pθ h log E(Xn|x(1)) i

we have: KL(A|x(1)

γ(x(1)) Pθ) > 0. So for N (2) big enough we get:

γN(2) := 1 N (2) log α < ϵ(x).

Published in Transactions on Machine Learning Research (04/2024)

This shows that for:

N (2) > log α

ϵ(x) =: N (2)(x),

we have: KL(A|x(1) γN(2) Pθ) > 0. This shows that for Pθ-almost-all x X N we have that:

exp N (2) KL(A|x(1) γN(2) Pθ) 0, for min(N (1), N (2)) .

Since we always have the trivial bounds:

0 exp N (2) KL(A|x(1) γN(2) Pθ) 1,

the theorem of dominated convergence tells us that we also have the convergence:

EX(1) Pθ h exp N (2) KL(A|X(1) γN(2) Pθ) i 0, for min(N (1), N (2)) .

This shows the claim.

Lemma B.7. Let PA(y|x, x(1), y(1)) = (1 λ)PA(y|x, x(1), y(1)) + λ P0(y|y(2)) for λ (0, 1) and y {0, 1}. Then the conditional e-variable defined by

E(x, y|x(1), y(1)) = p A(y|x, x(1), y(1))

p0(y|y(2)) = λ + (1 λ)E(x, y|x(1), y(1)) (11)

is bounded, i.e. log E < .

Proof. For every I(2) X Y and every (x, y) I(2)

log E(x, y|x(1), y(1)) = log λ + (1 λ)E(x, y|x(1), y(1)) log λ

log E(x, y|x(1), y(1)) = log λ + (1 λ)E(x, y|x(1), y(1))

log λ + 1 λ min I(2) X Y p0(y|y(2))

log λ + 1 λ

= log(λ + (1 λ)N (2))

Corollary B.8. Let the assumptions about the learner in Theorem B.6 hold, i.e. for every realization x = (xn)n N X N and every number N (1) N the learner fits a model P |x(1)

A P(X) to the first N (1) entries x(1) = (xn)n I(1) of x. Assume that for every Pθ HA and Pθ-almost every i.i.d. realization x = (xn)n N of

Pθ there exists a number N (1)(x) N and ϵ(x) > 0 such that for all N (1) N (1)(x) the model P |x(1)

A P(X) has a density p A(xn|x(1)) and satisfies:

KL(Pθ P |x(1)

A ) < KL(Pθ P0) ϵ(x),

Then the statististical test w.r.t. E(x, y|x(1), y(1)) from Equation 11 is consistent.

Proof. The claim follows directly from Theorem B.6 since E(x, y|x(1), y(1)) is bounded according to Lemma B.7.

Published in Transactions on Machine Learning Research (04/2024)

B.2 Type II Error Control M =

Theorem B.9 (Strong Law of Large Numbers for Martingale Difference Sequences). Consider the probability space (Ω, F, P). Let (Xn)n N be a sequence random variables that satisfies for some r 1

EP [|Xn|2r]

Consider the natural filtration Fn = σ(X1, . . . , Xn) F. Additionally, let EP [Xn|Fn 1] = 0 for all n N.

Then, it holds 1

n Pn k=1 Xk a.s. 0.

Proof of Theorem 3.4

Proof. Consider the stopping time T adapted to the natural filtration FM = σ(X( M)}) given by

T = inf{M 0 : E( M) 1/α}.

For this stopping time we have the equivalent relation

T < M 0 such that E( M) α 1

The probability that the null distribution will not be rejected in favor of the alternative is given by

Pθ(T = ) = Pθ( M 0 : E( M) < α 1) = Pθ

m=1 log E(m) < log α 1

Next, define the random variables Wm :

Wm = Eθ[log E(m)|Fm 1] = Eθ

i I(m) E(xi|x(<m))

log pθ(x(m))

p0(x(m)|ˆθ0(x(m)))

KL(Pθ P |x(<m)

Then the above probability equals

m=1 log E(m) < log α 1

m=1 log E(m) Wm + 1

m=1 Wm < log α 1

m=1 log E(m) Wm + 1

m=1 rm log α 1

Next, we will prove that 1 M PM m=1 log E(m) Wm a.s. 0. Note that PM m=1 log E(m) Wm is a martingale w.r.t. the filtration FM with bounded martingale differences log E(m) Wm in L2. This results from the boundedness of log E(m):

E[(log E(m) Wm)2] sup (xn)n I( m) (log E(m) Wm)2 4s2 m

It follows that

E[(log E(m) Wm)2]

Published in Transactions on Machine Learning Research (04/2024)

With Theorem B.9 we get that 1 M PM m=1 log E(m) Wm a.s. 0. This implies

m=1 log E(m) Wm log α 1

m=1 rm = 0 0 + r > 0,

where r = lim sup M PM m=1 rm

M > 0. Note that the sequence can even diverge, i.e. r = + . Thus,

m=1 log E(m) Wm + 1

m=1 rm log α 1

m=1 log E(m) Wm log α 1

It follows that Pθ(T = ) = 0.

Proof of Lemma 5.1

Proof. By Lemma B.7 it follows that the defined E-variable is bounded with bound depending on the batch size. With |I(m)| < B for all m the statement follows directly from Theorem 3.4

Published in Transactions on Machine Learning Research (04/2024)

C Expected Log Growth Rate

Theorem C.1. Consider the sequence of E-C2ST E-variables (E( M))M 1 with increments E(m) for m = 1, . . . , M defined as in Equation (9). Let P(X, Y ) denote the true joint distribution of X and Y with probability density function p(x, y) = p(y|x)p(x).

Then it holds for all M 1

EX( M),Y ( M) h log E( M)i |I( M)| I(X; Y )

Proof. For any M 1 we get due to the independence of the observations:

EX( M),Y ( M) h log E( M)i =

m=1 EX( M),Y ( M) h log E(m)i =

m=1 EX( m),Y ( m) h log E(m)i

m=1 EX(<m),Y (<m) h EX(m),Y (m) h log E(m)|X(<m), Y (<m)ii

Let PA(X, Y |X(<M), Y (<M)) be the estimated joint distribution of X and Y under the alternative with probability density function p A(x, y) = p A(y|x, x(<M), y(<M))p(x), where p(x) is the unknown marginal distribution of X and p A(y|x, x(<M), y(<M)) is the learner trained on x(<M), y(<M). We get for the increments

EX(m),Y (m) h log E(m)|X(<m), Y (<m)i = EX(m),Y (m)

log p A(y(m)|x(m), x(<m), y(<m))

p(y(m)|ˆθ0(y(m)))

= EX(m),Y (m)

log p A(y(m)|x(m), x(<m), y(<m))p(y(m)|x(m))p(y(m))

p(y(m)|ˆθ0(y(m)))p(y(m)|x(m))p(y(m))

= EX(m),Y (m) log p A(y(m)|x(m), x(<m), y(<m))

p(y(m)|x(m))

+ EX(m),Y (m) log p(y(m)|x(m))

+ EX(m),Y (m)

log p(y(m))

p(y(m)|ˆθ0(y(m)))

= KL(P (m) P (m)|(<m) A ) + |I(m)| I(X; Y ) + EY (m)

log p(y(m))

p(y(m)|ˆθ0(y(m)))

KL(P (m) P (m)|(<m) A ) = EX(m),Y (m) log p A(y(m)|x(m), x(<m), y(<m))p(x(m))

p(y(m)|x(m))p(x(m))

Plugging this into the above expression for EX( M),Y ( M) log E( M) we get

EX( M),Y ( M) h log E( M)i = X

m EX(<m),Y (<m) h KL(P (m) P (m)|(<m) A ) i + |I( M)| I(X; Y )

log p(y(m))

p(y(m)|ˆθ0(y(m)))

First, note that p(y(m)|ˆθ0(y(m))) p(y(m)) for any m and thus EY (m) h log p(y(m)) p(y(m)|ˆθ0(y(m)))

i 0. Together with the non-negativity of the KL we get that

EX( M),Y ( M) h log E( M)i 0 + |I( M)| I(X; Y ) + 0 = |I( M)| I(X; Y )(= Mb I(X; Y ) for constant batch size)

Published in Transactions on Machine Learning Research (04/2024)

Figure 6: The two classes of the Blob dataset.

D Experiments

In this section, we explain the implementation and training of our models in detail. In Section D.2, we discuss the architecture choice and training of E-C2ST and the other baseline methods for the synthetic and image data experiments. Code will be provided upon acceptance.

D.1 Baselines

We compare E-C2ST to the following baselines.

S-C2ST (standard C2ST), is the C2ST proposed by Lopez-Paz and Oquab (2017). We train a binary classifier on the augmented data. The null hypothesis is that accuracy is 0.5 and the alternative is that it is larger 0.5. The p-values is computed via a permutation test.

L-C2ST (logits C2ST) proposed by Cheng and Cloninger (2022) is a kernel based test, which again trains a binary classifier to distinguish the two classes. The null hypothesis is rejected if the difference between the classes logits average is not significant. The p-values is computed via a permutation test.

M-C2ST We conduct the tests by means of the proposed test statistics based on maximum mean discrepancy (Kirchler et al., 2020). The p-values is computed via a permutation test.

D.2 Training

We used Adam optimizer (Kingma and Ba, 2015) with learning rate 1 1e 4 (and 5 1e 4 for the Blob data). For fitting the parameter λ from (10) we used L-BFGS-B (Byrd et al., 1995) implemented in (Virtanen et al., 2020) and we set the initial value to 0.5 unless specified otherwise. Note that in all experiments we consider a paired two-sample test for simplicity, i.e. each observation consists of a pair X and Y that possibly come from different observations.

Blob data. The two Blob distributions used in the corresponding type 2 error experiment are visualized in Figure 6. The means are the same for both classes and are arranged in a 3 3 grid. The two populations differ in their variance. The used network architectures are described in Table 2. We trained the models with early stopping with patience 20 for all methods in all cases.

MNIST. The dataset is obtained from https://github.com/fengliu90/DK-for-TST. Table 3 outlines the neural network architectures. We trained the models with early stopping with patience of 15 epochs for the baseline methods and 10 for E-C2ST.

Face Expression Data. For all methods we used the network architecture provided in Table 4. We set the patience parameter to 20 epochs.

D.3 Additional Experiments

Best models according to the ablation study. In the main paper, we performed two experiments to investigate the effect of batch size and the effect of initial lambda value. Here we summarize our results by

Published in Transactions on Machine Learning Research (04/2024)

Figure 7: Comparison between the E-C2ST trained with batch size=32 and the initial E-C2ST and baseline L-C2ST. We observe a significant increase in power when using batch size of 32.

Figure 8: Robustness to batch order. Here we visualize the mean and 95% confidence interval based on 10 batch permutations of the power curves computed from 100 independent runs.

comparing the best E-C2ST performer according to the ablation studies with the best baseline L-C2ST in Figure 7. For both MNIST and KDEF, we can conclude that there is a significant gain in performance by using the enhanced E-C2ST.

Robustness to the batch order. We aim to empirically assess the robustness of our statistical test to changes in the order of data batches. To achieve this, we design a sequential experiment in which we iteratively collect new data batches, stopping the process when the null hypothesis is rejected. This experiment is repeated 100 times, with each iteration involving a random shuffling of the predetermined data batches, repeated 10 times to generate 10 different sequence orders. For each sequence order, we calculate the power of the test based on the results of the 100 independent runs. We then aggregate the results from all 10 sequences by calculating and visualizing the mean power curve and its corresponding 95% confidence interval in Figure 8 for the MNIST and KDEF datasets, where the batch size is 64 and the initial λ = 0.5. Figure 8 shows that the observed deviations from the mean are small in both scenarios. This indicates that our method is robust with respect to permutation in the order of the data batches.

Wall clock time. We performed a comparative analysis of the execution time, measured in seconds, for the E-C2ST and L-C2ST algorithms. This analysis is based on 100 independent trials with different sample sizes within the KDEF and MNIST scenarios, as shown in Table 1. In this context, data were generated under the alternative hypothesis. Both E-C2ST and L-C2ST were trained with identical training parameters (learning rate, patience, network architectures). Batches of 64 samples were used for E-C2ST. For L-C2ST, the total sample size - shown in the second row of the Table 1 - was divided according to a ratio we used in all our experiments, i.e. 5:1:1 for training, validation and test data, respectively.

We do not to include the other baselines, since their execution times are comparable to those of L-C2ST. This similarity in execution time is due to the fact that all baselines, including L-C2ST, use the same trained

Published in Transactions on Machine Learning Research (04/2024)

model and the same number of permutations to compute the corresponding p-values. All experimental runs were performed on NVIDIA GP102 [Ge Force GTX 1080 Ti] GPUs.

KDEF MNIST 64 3 64 6 64 9 64 3 64 6 64 10 64 14 64 17 E-C2ST 1.5 0.6 4.4 1.4 5.6 2.7 2.5 0.4 4.9 0.8 10.5 1.8 11.5 2.8 11.6 2.9 L-C2ST 2.7 1.1 4.6 2.1 9.4 3.4 3.5 0.7 4.4 0.98 7 2.3 11.6 3.99 15.96 5.1

Table 1: Wall Clock Times in Seconds

Our discussion of the results is organized around three different scenarios: 1) when the number of data batches equals three; 2) when the total sample size falls below the threshold required to achieve maximum statistical power; and 3) when the total number of samples exceeds the minimum data set size required to achieve maximum power.

Looking at the first columns of the Table 1 for both the KDEF and MNIST scenarios, we see that E-C2ST requires less computation time compared to L-C2ST when the dataset is segmented into three batches. In this case E-C2ST performs a single training iteration (as L-C2ST) utilizing the three batches for training, validation, and testing, respectively. Then, the reason for the increased computational cost for L-C2ST can be attributed to permutation test used for calculating the p-value that uses 500 permutations.

In our work, as demonstrated by our experimental results, including those shown in Figure 8, we have empirically determined the optimal sample sizes for maximizing the statistical power of E-C2ST when using a batch size of 64. In particular, we found that approximately 384 samples (equivalent to 64 6) are required for KDEF and 896 samples (equivalent to 64 14) are required for MNIST. This indicates that to effectively reject the null hypothesis, E-C2ST requires a maximum of 6 and 14 batches in the KDEF and MNIST scenarios, respectively.

When the available data set is smaller than these thresholds, E-C2ST becomes more computationally expensive than L-C2ST (see the second and third columns for the MNIST scenario). This increased computational cost occurs because E-C2ST updates the model with each additional batch of data, using the full dataset available at that time, whereas L-C2ST trains the model only once using the train, validation, test data split.

However, when the dataset size exceeds the identified thresholds for achieving maximum power with E-C2ST, E-C2ST becomes less computationally expensive than L-C2ST (see the second and third columns for KDEF and the fourth and fifth columns for MNIST). This efficiency gain is due to E-C2ST s ability to terminate further training upon early rejection of the null hypothesis in large datasets. This means that for datasets larger than the specified thresholds, the computational time for E-C2ST becomes constant and does not increase with additional data, providing a significant computational advantage over L-C2ST.

Why can t we use p-values for sequential testing? Traditional statistical tests, such as the t-test or chi-squared test, assume that the number of observations or experiments is determined before data collection begins. This setup means that researchers decide in advance how much data to collect, and that predetermined sample size does not change no matter what the outcome of the test is. Sequential testing, on the other hand, allows data to be analyzed as it comes in. This approach allows researchers to make decisions about whether to continue collecting data at any point during the study. Therefore, sequential testing focuses on designing tests that can keep the type I error rate below a certain significance level throughout the data collection period. This is a much more strict requirement than in classical statistical testing, where controlling for type I error is only necessary for a single, fixed sample size.

Standard testing methods are not directly applicable to sequential testing scenarios. In situations where data arrives in batches over time, recalculating the p-value for the entire dataset each time new data arrives and making testing decisions based on that recalculated value is likely to push the type I error rate above the alpha threshold. This increased risk is due to the fact that performing multiple tests on the same data increases the likelihood of incorrectly rejecting the null hypothesis. We illustrate this in the following example.

Published in Transactions on Machine Learning Research (04/2024)

Figure 9: Type I error of L-C2ST, applied on a sequential testing task. We fix the number of batches at 20 and compute the type I error over time, as the number of batches revealed to the model increases.

Figure 10: Estimated type I error from 100 independent sequential experiments for batch sizes randomly selected in the interval [32, 64]. The estimated cumulative type I error is way above the significance level α = 0.05 and naturally increases over time.

We ran a sequential testing experiment, where in each round of this experiment, two samples were drawn, with their sizes randomly chosen from the range [32, 64]. Both were sampled from a standard normal distribution.

As new batches of data arrived, a t-test was performed to compare the means of the samples collected up to that point (this includes the current batch and the past data). The test was stopped if the null hypothesis was rejected at a significance level of 0.05; if not, new samples were drawn until the null hypothesis was rejected at that level.

This procedure was repeated 100 times to estimate the type I error rate over time. Our experiment shows that the type I error increases over time and reaches a level of 60%, as shown in Figure 10. Let s look at the implications for the tests we are considering in this work. If we use the same batch splitting technique for any standard method as we do for E-C2ST, we compromise the type I error guarantees that would be in place if we calculated the p-value only once. In particular, if we have M batches and thus recalculate a p-value M times, the standard methods face M times more opportunities to falsely reject the null hypothesis. This significantly increases the likelihood of inflating the type I error rate when evaluated over the entire data set. For example, Figure 9 shows the increasing type I error of L-C2ST as the number of batches increases, where the maximum batch number is 20 and the batch size is 64. Thus, for M=20, in this sequential setting, the type I error of L-C2ST is above 0.6 >> 0.05.

On the other hand, the way E-C2ST is designed addresses this problem, i.e., it ensures that the type I error in this case remains below the predetermined significance level. See Corollary 3.2 for these theoretical guarantees.

Published in Transactions on Machine Learning Research (04/2024)

Layer (type) Output Shape Linear-1 [batch size, 30] Layer Norm-2 [batch size, 30] Re LU-3 [batch size, 30] Linear-4 [batch size, 30] Layer Norm-5 [batch size, 30] Re LU-6 [batch size, 30] Linear-7 [batch size, 2]

Table 2: The network architecture employed in the Blob experiment for all methods .

Layer (type) Parameters Conv2d-1 16, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-2 negative slope=0.2 Group Norm-3 eps=1e-05 Conv2d-4 32, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-5 negative slope=0.2 Group Norm-6 eps=1e-05 Conv2d-7 64, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-8 negative slope=0.2 Group Norm-9 eps=1e-05 Conv2d-10 1, kernel size=(3, 3), stride=(2, 2), padding=(1, 1)

Table 3: The network architecture employed in the MNIST experiment for E-C2ST, L-C2ST, S-C2ST, M-C2ST.

Layer (type) Parameters Linear-1 size=32 Layer Norm-2

Re LU-3 Dropout-4 0.5 Linear-5 size=32 Layer Norm-6

Re LU-7 Dropout-8 0.5 Linear-9 size=1

Table 4: The network architecture employed in the KDEF experiments for all baselines.