# evaluating_classifier_twosample_tests__c40a30cb.pdf Published in Transactions on Machine Learning Research (04/2024) E-Valuating Classifier Two-Sample Tests Teodora Pandeva t.p.pandeva@gmail.com University of Amsterdam Tim Bakker t.b.bakker@uva.nl University of Amsterdam Christian A. Naesseth c.a.naesseth@uva.nl University of Amsterdam Patrick Forré p.d.forre@uva.nl University of Amsterdam Reviewed on Open Review: https: // openreview. net/ forum? id= dw FRov8xhr We introduce a powerful deep classifier two-sample test for high-dimensional data based on e-values, called e-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting e-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches, beyond the conventional two-split (training and testing) approach of standard two-sample classifier tests. This strategy increases the power of the test, while keeping the type I error well below the desired significance level. 1 Introduction We consider two-sample tests, which aim to answer the statistical question of whether two independently obtained populations are statistically significantly different. Often these tests help to distinguish real from generated data (Lopez-Paz and Oquab, 2017), noise from data (Hastie et al., 2001; Gutmann and Hyvärinen, 2012; Mikolov et al., 2013; Goodfellow et al., 2014), and are widely used in simulation-based inference (Lueckmann et al., 2021; Miller et al., 2022). In the general setting, consider the scenario where we are given two independent samples from two possibly different distributions: X(0) 1 , . . . , X(0) N0 i.i.d. P0, X(1) 1 , . . . , X(1) N1 i.i.d. P1. Based on these samples, we want to test if the distributions are equal or not. Thus, we can define a corresponding statistical test with null and alternative hypotheses as follows: H0 : P0 = P1 vs. HA : P0 = P1. In this paper, we consider the two-sample testing problem in the context of sequential testing, where the user accumulates data from P0 and P1 in a time-dependent manner. The primary goal is to evaluate, at each observed time step, whether the null hypothesis defined above remains valid. Thus, if enough evidence against the null is acquired, we can reject the null and stop collecting data. Related work. Two-sample testing has a long history in statistics, giving rise to classical techniques such as Student s and Welch s t-tests (Student, 1908; Welch, 1947), which compare the means of two normally distributed samples. In addition, nonparametric tests such as the Wilcoxon-Mann-Whitney test (Mann Published in Transactions on Machine Learning Research (04/2024) and Whitney, 1947), the Kolmogorov-Smirnov tests (Kolmogorov, 1933; Smirnov, 1939), and the Kuiper test (Kuiper, 1960) have been established. In the area of high-dimensional data, kernel methods have been introduced, which focus on comparing the kernel embeddings of two populations (Gretton et al., 2012; Chwialkowski et al., 2015; Jitkrittum et al., 2016). However, these traditional two-sample statistical become less powerful when dealing with more complicated data types such as images and text. Recent advances have led to the development of classifier-based two-sample tests (Kim et al., 2021) and their deep learning extensions, as seen in (Lopez-Paz and Oquab, 2017; Kirchler et al., 2020; Liu et al., 2020; Cheng and Cloninger, 2022). In these methods, a model is trained to discriminate between the two populations using training data, and then a statistical test is performed on a separate test set. However, all listed methods share a common limitation: when applied sequentially, they can lead to inflated type I error. In simpler terms, these methods assume that the sample size is known in advance, which can be a drawback in practice. To address this limitation, sequential testing procedures offer a solution by allowing practitioners to dynamically reject the null hypothesis as new batches of data arrive. Within this context, e-value-based sequential tests have revived in the work of Shafer (2019); Ramdas et al. (2023); Grünwald et al. (2024), where they are interpreted as bets against the null hypothesis. More formally, e-variables are simply non-negative variables E that satisfy for all P H0 : EP [E] 1, i.e. the expectation of E with respect to any distribution from the null hypothesis distribution class H0 is less than one. An example of e-variables for singleton hypothesis classes are Bayes factors, i.e. we test if the unknown probability density p equals p0 or p A: H0 : p = p0 vs. HA : p = p A. The Bayes factor given by E(x) := p A(x) p0(x) is an e-variable w.r.t. H0 since Ep0[E] = R p A(x) p0(x) p0(x) dx = 1 1. Note that observing a very large value of E, which we call an e-value, provides evidence against the null hypothesis. The appealing theoretical properties of e variables (details in Section 2.2) have led to the development of a growing body of work on e-value-based (conditional) independence tests spanning several domains, including two-sample tests for contingency tables (Turner and Grünwald, 2023), sequential data (Balsubramani and Ramdas, 2016), kernel-based approaches (Podkopaev et al., 2023; Shekhar and Ramdas, 2024), rank-based conditional independence tests (Duan et al., 2022), and conditional independence tests under model-X assumptions (Grünwald et al., 2023; Shaer et al., 2023), split-likelihood two-sample tests (Lhéritier and Cazals, 2018), the concurrent work by Podkopaev and Ramdas (2024). Contributions. We extend the work of Lhéritier and Cazals (2018) by introducing e-values for conditional independence testing which is a larger class of testing problems (see Section 4). The resulting e-values combine the ideas of the split-likelihood testing procedure of Wasserman et al. (2020) (see Section 3) and the existing work on predictive conditional independence testing frameworks of Burkart and Király (2017) (see Section 4). In contrast to (Lhéritier and Cazals, 2018), our framework, when applied to two-sample testing, referred to as E-C2ST, assumes a composite null hypothesis. Furthermore, it leverages the representational capabilities of machine learning models, allowing us to design tests for complex data structures. We show that the described tests (including E-C2ST) provide non-asymptotic type I error control under the null, and are consistent (i.e., reject the null almost surely) in both a sequential and a non-sequential setting (details in Section 4). Moreover, when restricted to the two-sample test setting, we establish milder conditions required for the machine learning model than those in (Lhéritier and Cazals, 2018) for the consistency guarantees to hold. In our empirical analysis in Section 6, we use the theoretical properties of E-C2ST to design sequential tests that optimize data usage by segmenting it into multiple batches. Each batch contributes to the cumulative test statistic. This method contrasts with traditional two-sample classifier tests, which derive a test statistic solely from the test set conditioned on the training data. Our approach not only achieves maximum power faster than standard methods, but also consistently keeps type I errors well below the significance level. Published in Transactions on Machine Learning Research (04/2024) 2 Hypothesis Testing with E-Variables Building on the recent work of Ramdas et al. (2023); Grünwald et al. (2024), we give a detailed introduction to evariables and their properties in Section 2.2 and establish their connections to hypothesis testing in Sections 2.3 and 2.4. We deferred all proofs to Appendix A. First, we introduce the notation used throughout this paper. 2.1 Notation Consider a sample of data points x1, x2, . . . ,1 reflecting realizations of random variables X1, X2 . . . , drawn from an unknown probability distribution P P(Ω) coming from some unknown sample space Ω, where P(Ω) is the set of all probability measures on Ω. In hypothesis testing, we usually consider two model classes: H0 = {Pθ P(Ω) | θ Θ0} (null hypothesis), HA = {Pθ P(Ω) | θ ΘA} (alternative), where Θ0 and ΘA represent the parameter sets of the distributions that are valid under the null and alternative, respectively. We want to decide if P comes from H0 or from HA: H0 : P H0 vs. HA : P HA. In most cases, the data points come from the same space X and we would at most observe countably many of such data points Xn. In this setting, we can w.l.o.g. assume that Ω= X N. If we, furthermore, assume that the Xn, n N, are drawn i.i.d. from P then P((Xn)n N) = NN n=1 P(Xn), we can directly incorporate the product structure into H0 and HA and restrict ourselves to one of those factors to state Hi. By slight abuse of notations we re-write for i = 0, A: Hi = {Pθ P(X) | θ Θi} , and implicitly assume that Pθ((Xn)n N) = NN n=1 Pθ(Xn). Moreover, assume that our probability measures, Pθ Hi, are given via a density with respect to a product reference measure µ. We denote the density by pθ(x) or p(x|θ) interchangeably in this work. 2.2 Conditional E-Variables Now consider the more general relative framework where we allow hypothesis classes to come from a set of Markov kernels, which can be used to model conditional probability distributions for i = 0, A: Hi = {Pθ : Z P(X) | θ Θi} P(X)Z, (1) where P(X)Z denotes the space of all Markov kernels from Z to X, i.e. for each Pθ Hi for fixed z Z Pθ( |z) is a valid probability measure on X. An example of conditional hypothesis classes is given in Section 4, where the null hypothesis class represent the set of distributions that reflect the conditional independence of two variables after observing a third one. With respect to H0 as defined in Equation (1) we can define corresponding e-variables which we call conditional e-variables23: Definition 2.1 (Conditional E-variable). A conditional e-variable w.r.t. H0 P(X)Z is a non-negative measurable map: E : X Z R 0, (x, z) 7 E(x|z), such that for all Pθ H0 and z Z we have: Eθ[E|z] := Z E(x|z) Pθ(dx|z) 1. 1In the following we will write small x if we either mean the realization of a random variable X or the argument of a function living on the same space. We use capital X for a data point if we want to stress its role as a random variable. 2A formal definition of the "unconditional" e-variables introduced in Section 1 can be easily derived from Definition 2.1 by dropping Z. Moreover, if E is an e-variable and x X a fixed point then we call E(x) the e-value of x w.r.t. E. 3A similar definition can be found in (Grünwald et al., 2024). Published in Transactions on Machine Learning Research (04/2024) One of the notable features of e-variables is their preservation under multiplication. We can easily combine (conditionally) independent e-variables by simply multiplying them which results in a proper e-variable. This property makes e-variables appealing for meta-analysis studies (Vovk and Wang, 2021; Grünwald et al., 2024). A more general result is that backward-dependent conditional e-variables can be combined by multiplication. This property of e-values is analogous to the chain rule observed in probability densities. Such a property becomes key in the development of the E-C2ST in this framework, which is formally stated as: Lemma 2.2 (Products of conditional E-variables (based on Grünwald et al. (2024))). If E(1) is a a conditional e-variable w.r.t. H(1) 0 P(Y)Z and E(2) a conditional e-variable w.r.t. H(2) 0 P(X)Y Z then E(3) defined via their product: E(3)(x, y|z) := E(2)(x|y, z) E(1)(y|z), is a conditional e-variable w.r.t.: H(3) 0 := H(2) 0 H(1) 0 P(X Y)Z, where we define the product hypothesis as: H(2) 0 H(1) 0 := n Pθ Pψ Pθ H(2) 0 , Pψ H(1) 0 o , with the product Markov kernels given by: (Pθ Pψ) (dx, dy|z) := Pθ(dx|y, z) Pψ(dy|z). 2.3 Hypothesis Testing with Conditional E-Variables In the context of statistical testing, we can evaluate an e-variable based on the given data points, which are realizations of the random variables X1, . . . , XN. Subsequently, the decision criterion for rejecting the null hypothesis at a significance level α [0, 1] is as follows Reject H0 in favor of HA if E(X1, . . . , XN) α 1. Lemma 2.3 tells us that with this rule the type I error, the error rate of falsely rejecting the H0, is bounded by α. Lemma 2.3 (Type I error control). Let E be a conditional e-variable w.r.t. H0 P(X)Z. Then for every α [0, 1], Pθ H0 and z Z we have: Pθ(E α 1|z) α. Thus, the e-values can be transformed into more conservative p-values via the relation p = min{1, 1/E} such that for Pθ H0 it holds Pθ(p α|z) α. Note that a valid way of constructing an e-variable from the independent random variables X1, . . . , XN w.r.t. the observed sample points according to Lemma 2.2 is E(X1, . . . , XN) = Q 2.4 Sequential Hypothesis Testing with Conditional E-Variables Up to this point, we have focused primarily on e-value-based tests for scenarios where the sample size N is predetermined. Now suppose that the data does not arrive all at once, but instead we observe an infinite stream of data. In this context, a new sample Xt becomes available at each time t. Consequently, we are interested in developing statistical tests that allow us to reject the null hypothesis at any given time t. These tests are called sequential tests and can be constructed using e-variables. Building on the concepts introduced in the previous section, we can define a conditional variable E(t) = E(Xt|X1, . . . , Xt 1), conditioned on past observations X1, . . . , Xt 1 with respect to the null hypothesis H(t) 0 P(X)X t 1. Importantly, Lemma 2.2 suggests combining all the evidence available up to time t to construct a backward-dependent e-variable. In other words, the running product E( t) = Qt l=1 E(l), where E(1) = E(X1), proves to be a valid e-variable with respect to H0. Published in Transactions on Machine Learning Research (04/2024) This sequence of e-variables, also known as an e-process (Ramdas et al., 2023), offers a theoretical advantage over non-sequential p-value-based tests by allowing what it is called optional continuation (Grünwald et al., 2024). In simple terms, the user can make an informed decision at any given time t: whether to accumulate more data from additional experiments or to stop the process. This decision can be driven by, for example, the decision to reject the null hypothesis. The optional continuation property is facilitated by the following result, which ensures an anytime type-I-error bound for the process (E( t))t 1. Proposition 2.1 ((Ramdas et al., 2023; Grünwald et al., 2024)). Let E( t) be the running e-variable described above. Then for all Pθ H0 and all α (0, 1] Pθ( t 0 such that E( t) α 1) α. This result implies that we maintain type-I-error control not just at individual time points t but consistently throughout the entire data collection period. More precisely, the decision rule for rejecting the null hypothesis: Reject H0 in favor of HA if E( t) α 1 for any t 1 has type I error bounded by α. Additionally, we consider this sequential test to be consistent if it correctly rejects the null with a finite number of steps: Pθ( t 0 such that E( t) α 1) = 1 for all Pθ HA. 3 M-Split Likelihood Ratio Test In general, constructing an e-variable with respect to any H0 is not a straightforward task. There exist two main approaches. The first approach, see (Grünwald et al., 2024), is based on the reverse information projection of the hypothesis space HA onto H0. It is not data-dependent and can be shown to be growthoptimal in the worst case. However, the reverse information projection is not very explicit in general settings, especially when working with non-convex hypotheses, HA and H0. The second approach is based on constructing a data-driven e-variable. By utilizing the M-split likelihood ratio test by Wasserman et al. (2020) introduced in this section, we establish an e-variable for a fixed sample size. Subsequently, we will demonstrate that the same e-variable can be adapted for sequential testing for an infinite data stream. Assume that our data set D = {X1, . . . , XN} is of size N. We now split the index set [N] := {1, . . . , N} into M 2 disjoint batches: [N] = I(1) I(M). For m = 1, . . . , M we also abbreviate: I( 0 and infx X p0(x) > 0. For example, this condition can be easily satisfied for certain discrete distributions, such as the categorical distribution, as shown in Section 5. A more precise formulation of this theorem can be found in Theorem B.6. Lhéritier and Cazals (2018) discuss similar but stronger assumptions. In their work, P |x(1) A is required to be strongly pointwise consistent, i.e. P |x(1) A (x) N Pθ(x) almost surely, for which they can provide λ-consistency results (a weaker notion of consistency). They also assume that the null hypothesis is known. Next, we will show that in the case M = , we can remove this condition and only assume that the learner is a better approximation of the true distribution than the estimated null one. Under similar conditions as in Theorem 3.3 we can prove that the sequential test defined in Equation (3) is consistent. The proof is deferred to Appendix B.2. Theorem 3.4. Consider the sequence of e-variables E( M) M N from Equation (3) for an infinite stream of finite batches of data points. Let the learning algorithm fit a model P |x( r M > 0, sup x X |I(M)| | log E(x|x( 0, lim M Then, Pθ( M N such that E( M) α 1) = 1. The requirement regarding the first sequence can be understood as guaranteeing that the learner consistently provides a better approximation of the true distribution compared to the estimated null distribution, averaged over a sequence of M consecutive steps. In this context, r M could even be a decreasing null sequence, such as log(M)/M. The second condition is a milder assumption than uniform boundedness. Similar to the previous result, this condition is relatively easier to satisfy when dealing with categorical random variables or random variables with a compact support. 4 Predictive Conditional Independence Testing In this section, we combine the ideas of predictive conditional independence testing by Burkart and Király (2017) with e-variables from the M-split likelihood ratio test from Section 3 based on Wasserman et al. (2020) to derive a proper e-variable for conditional independence testing. The desired two sample test will later on be reformulated as an independence test by utilizing the theoretical results discussed in this section. As a reminder, in conditional independence testing, we want to test if a random variable X is independent of Y , or not, conditioned on Z: H0 : X Y | Z vs. HA : X Y | Z, based on data D = {(X1, Y1, Z1), . . . , (XN, YN, ZN)}. The corresponding (full) hypothesis spaces, in the i.i.d. setting, are: Hfl 0 = {Pθ(X|Z) Pθ(Y |Z) Pθ(Z) | θ Θ0} Hfl A = {Pθ(X, Y, Z) | θ ΘA} \ H0. If we assume that P(X, Z) is fixed for H0 and HA then this simplifies to the following product hypothesis classes, i = 0, A: Hfx i := Hpd i {P(X, Z)} , where the conditional hypothesis classes Hpd i P(Y)X Z of predictive distributions are given by: Hpd 0 = {Pθ(Y |Z) | θ Θ0} , Hpd A = {Pθ(Y |X, Z) | θ ΘA} \ H0. (5) Published in Transactions on Machine Learning Research (04/2024) Equation (2) applied to Hfx i under i.i.d. assumptions leads us to the following m-th conditional e-variable, using the abbreviation w = (x, y, z): E(m)(w(m)|w( α 1 then 11: reject and break 12: Obtain λm+1 that solves (10) 13: Dtrain Dtrain (xn, yn)n I(m 1), Dval (xn, yn)n I(m) In this section, we will formalize the classifier two-sample test using e-variables. The following test can be easily derived from the conditional independence test introduced in Section 4 by introducing a binary variable denoted Y , along with the following abbreviation: P(X|Y = 0) := P0(X), P(X|Y = 1) := P1(X), If we pool the data points X(y) n via augmenting them with a Y -component: (X(y) n , Y (y) n ) with Y (y) n := y, then the pooled data set can be seen as one i.i.d. sample from P(X, Y ) of size N := N0 + N1 for some unknown marginal P(Y ). We can then reformulate the two-sample test as an independence test: H0 : X Y vs. HA : X Y. This allows us to use the e-variables from Section 4 (without any conditioning variable Z) for (conditional) independence testing. Furthermore, since Y is a binary variable we can write any Markov kernel P(Y |X) as a Bernoulli distribution Pθ(Y |X = x) = Ber(σ(gθ(x))) for some parameterized measurable function gθ and where σ(t) := 1 1+exp( t) is the logistic sigmoid-function. So our hypothesis spaces look like: H0 = {Ber(qθ) | θ Θ0} , qθ [0, 1], (7) HA = {Ber(σ(gθ)) | θ ΘA} \ H0, Published in Transactions on Machine Learning Research (04/2024) and the m-th conditional e-variable is given by: E(m)(y(m)|x(m), x( 1 we have the inequality: P M N. S(M) s E[S(1)] Proof. The sequence E( M) M N constitutes a non-negative super-martingale of e-variables w.r.t. the filtration F := σ(X( M)) due to the following computation for Pθ H0: Eθ h E( M+1) x( M)i = Z M+1 Y m=1 E(X(m)|x( 0 for some γ > 0. Note that for a subset A P(X) we abbreviate: KL(A P) := inf Q A KL(Q P). Proof. If ˆPN := 1 N PN n=1 δXn|Z is the empirical distribution then we get the following equivalence, when conditioned on Z = z: E(N)|z α 1 Y n=1 E(Xn|z) α 1 n=1 log E(Xn|z) 1 N log α =: γN EX ˆ P |z N [log E(X|z)] γN ˆP |z N A|z γN . The bound then follows by a simple application of Sanov s theorem, see Csiszár (1984); Balsubramani (2020), for each z Z individually: Pθ E(N) α 1 Z = z = Pθ ˆPN A|z γN Z = z exp N KL(A|z γN P |z θ ) , which requires the i.i.d. assumption (conditioned on Z) and that A|z γN is completely convex, which it is. Lemma B.2. Consider the situation in Theorem B.1 and fix z Z. Then the first statement implies the second: 1. KL(A|z γ(z) P |z θ ) > 0 for some γ(z) > 0. 2. EX P |z θ [log E(X|z)] > 0. If, furthermore, supx X | log E(x|z)| < then the set A|z γ(z) is TV-closed in P(X) for every γ(z) 0. In this case, the second statement also implies the first one, where we then have the implication: 0 γ(z) < EX P |z θ [log E(X|z)] = KL(A|z γ(z) P |z θ ) > 0. Proof. = : If EX P |z θ [log E(X|z)] 0 then P |z θ A0. Since A0 Aγ for every γ > 0 we get: KL(A|z γ P |z θ ) = 0 for every γ > 0. Published in Transactions on Machine Learning Research (04/2024) = : Assume C := supx X | log E(x|z)| < and γ 0. Let Q P(X) be a TV-limit point of a sequence Pn A|z γ , n N. Then we have the inequality: EX Q[log E(X|z)] = EX Q[log E(X|z)] EX Pn[log E(X|z)] + EX Pn[log E(X|z)] |EX Q[log E(X|z)] EX Pn[log E(X|z)]| + EX Pn[log E(X|z)] C TV(Q, Pn) + γ This shows that Q A|z γ as well, and, thus, A|z γ is TV-closed. By way of contradiction now assume that EX P |z θ [log E(X|z)] > 0, but KL(A|z γ P |z θ ) = 0 for all γ > 0. Since A|z γ is TV-closed and (completely) convex we have that P |z θ A|z γ for all γ > 0. So we get: EX P |z θ [log E(X|z)] γ, for all γ > 0, and thus: EX P |z θ [log E(X|z)] 0, which is a contradiction to our assumption. The unconditional version follows from the above by using the one-point space Z = { } and reads like: Corollary B.3 (Type II error control for i.i.d. e-variables). Let X1, . . . , XN be an i.i.d. sample, E : X R 0 be an e-variable w.r.t. H0 and E(N) := QN n=1 E(Xn). Let α (0, 1], γN := 1 N log α 0 and for γ R 0 put: Aγ := {Q P(X) | EQ[log E] γ} . Then for every Pθ HA we have the following type II error bound: Pθ E(N) α 1 exp ( N KL(AγN Pθ)) , which converges to 0 if KL(Aγ Pθ) > 0 for some γ > 0. Relating to the simpler unconditional case of the Corollary we can make the following clarifying remarks. Remark B.4. 1. The condition: KL(Aγ Pθ) > 0 for some γ > 0, is slightly stronger than the condition: EPθ[log E] > 0. If supx X | log E(x)| < then one can show that both those conditions are equivalent. 2. If there exist δ, γ > 0 such that for all Pθ HA we have KL(Aγ Pθ) δ then we easily deduce the uniform type II error bound for N log α sup Pθ HA Pθ E(N) α 1 exp ( N δ) . From Theorem B.1 we can also get a type II error control for conditional i.i.d. e-variables if we assume that the distribution for the conditioning variable is a marginal part of the hypothesis. Corollary B.5 (Unconditional type II error control for conditional i.i.d. e-variables). Let E : X Z R 0 be a conditional e-variable w.r.t. H0 given Z. Let Z : Ω Z be a fixed random variable with values in Z and let X1, . . . , XN : Ω X be random variables that are i.i.d. conditioned on Z. Let E(N) := QN n=1 E(Xn|Z). Let α (0, 1], γN := 1 N log α 0. Then for every Pθ HA we have the following type II error bound: Pθ E(N) α 1 Eθ h exp N KL(A|Z γN P |Z θ ) i exp N inf z Z KL(A|z γN P |z θ ) , where the middle term converges to 0 for N if for Pθ(Z)-almost-all z Z there exists some γ > 0 such that KL(A|z γ P |z θ ) > 0. The latter is e.g. the case if infz Z KL(A|z γ P |z θ ) > 0 for some γ > 0. Published in Transactions on Machine Learning Research (04/2024) Proof. The inequalities directly follow from Theorem B.1 by plugging the random variable Z into z and taking expectation values: Pθ E(N) α 1 = Eθ h Pθ E(N) α 1 Z i Eθ h exp N KL(A|Z γN P |Z θ ) i sup z Pθ(Z) exp N KL(A|z γN P |z θ ) = exp N inf z Pθ(Z) KL(A|z γN P |z θ ) exp N inf z Z KL(A|z γN P |z θ ) . Here supz Pθ(Z) and infz Pθ(Z) denote the essential supremum, essential infimum, resp., w.r.t. Pθ(Z). The statement of the convergence follows from the dominated convergence theorem and the observation that for every z Z we have the trivial bounds: 0 exp N KL(A|z γN P |z θ ) 1. This shows the claim. B.1 Type II Error M=2 Theorem B.6 (Type-II error control for conditional e-variable for singleton H0). Let H0 = {P0} be a singleton set. Consider a model class HA and a learning algorithm that for every realization x = (xn)n N X N and every number N (1) N fits a model P |x(1) A P(X) to the first N (1) entries x(1) = (xn)n I(1) of x. Assume that for every Pθ HA and Pθ-almost every i.i.d. realization x = (xn)n N of Pθ there exists a number N (1)(x) N and ϵ(x) > 0 such that for all N (1) N (1)(x) the model P |x(1) A P(X) has a density p A(xn|x(1)) and satisfies: KL(Pθ P |x(1) A ) < KL(Pθ P0) ϵ(x), sup xn X | log E(xn|x(1))| < . Then for every N (1), N (2) N we have the bound: Pθ E(N (2)|N(1)) α 1 EX(1) Pθ h exp N (1) KL(A|X(1) γN(2) Pθ) i , which converges to zero for min(N (1), N (2)) . Proof. The bound directly follows from Corollary B.5. Note that by the independence assumptions, we have P |x(1) θ = Pθ. Then note that for Pθ-almost-all x X N and for N (1) N (1)(x): EXn Pθ h log E(Xn|x(1)) i = KL(Pθ P0) KL(Pθ P |x(1) A ) > ϵ(x) > 0. By assumption and Lemma B.2 we have that for Pθ-almost all x X N and for N (1) > N (1)(x) and for γ(x(1)) with: 0 < γ(x(1)) < ϵ(x) < EXn Pθ h log E(Xn|x(1)) i we have: KL(A|x(1) γ(x(1)) Pθ) > 0. So for N (2) big enough we get: γN(2) := 1 N (2) log α < ϵ(x). Published in Transactions on Machine Learning Research (04/2024) This shows that for: N (2) > log α ϵ(x) =: N (2)(x), we have: KL(A|x(1) γN(2) Pθ) > 0. This shows that for Pθ-almost-all x X N we have that: exp N (2) KL(A|x(1) γN(2) Pθ) 0, for min(N (1), N (2)) . Since we always have the trivial bounds: 0 exp N (2) KL(A|x(1) γN(2) Pθ) 1, the theorem of dominated convergence tells us that we also have the convergence: EX(1) Pθ h exp N (2) KL(A|X(1) γN(2) Pθ) i 0, for min(N (1), N (2)) . This shows the claim. Lemma B.7. Let PA(y|x, x(1), y(1)) = (1 λ)PA(y|x, x(1), y(1)) + λ P0(y|y(2)) for λ (0, 1) and y {0, 1}. Then the conditional e-variable defined by E(x, y|x(1), y(1)) = p A(y|x, x(1), y(1)) p0(y|y(2)) = λ + (1 λ)E(x, y|x(1), y(1)) (11) is bounded, i.e. log E < . Proof. For every I(2) X Y and every (x, y) I(2) log E(x, y|x(1), y(1)) = log λ + (1 λ)E(x, y|x(1), y(1)) log λ log E(x, y|x(1), y(1)) = log λ + (1 λ)E(x, y|x(1), y(1)) log λ + 1 λ min I(2) X Y p0(y|y(2)) log λ + 1 λ = log(λ + (1 λ)N (2)) Corollary B.8. Let the assumptions about the learner in Theorem B.6 hold, i.e. for every realization x = (xn)n N X N and every number N (1) N the learner fits a model P |x(1) A P(X) to the first N (1) entries x(1) = (xn)n I(1) of x. Assume that for every Pθ HA and Pθ-almost every i.i.d. realization x = (xn)n N of Pθ there exists a number N (1)(x) N and ϵ(x) > 0 such that for all N (1) N (1)(x) the model P |x(1) A P(X) has a density p A(xn|x(1)) and satisfies: KL(Pθ P |x(1) A ) < KL(Pθ P0) ϵ(x), Then the statististical test w.r.t. E(x, y|x(1), y(1)) from Equation 11 is consistent. Proof. The claim follows directly from Theorem B.6 since E(x, y|x(1), y(1)) is bounded according to Lemma B.7. Published in Transactions on Machine Learning Research (04/2024) B.2 Type II Error Control M = Theorem B.9 (Strong Law of Large Numbers for Martingale Difference Sequences). Consider the probability space (Ω, F, P). Let (Xn)n N be a sequence random variables that satisfies for some r 1 EP [|Xn|2r] Consider the natural filtration Fn = σ(X1, . . . , Xn) F. Additionally, let EP [Xn|Fn 1] = 0 for all n N. Then, it holds 1 n Pn k=1 Xk a.s. 0. Proof of Theorem 3.4 Proof. Consider the stopping time T adapted to the natural filtration FM = σ(X( M)}) given by T = inf{M 0 : E( M) 1/α}. For this stopping time we have the equivalent relation T < M 0 such that E( M) α 1 The probability that the null distribution will not be rejected in favor of the alternative is given by Pθ(T = ) = Pθ( M 0 : E( M) < α 1) = Pθ m=1 log E(m) < log α 1 Next, define the random variables Wm : Wm = Eθ[log E(m)|Fm 1] = Eθ i I(m) E(xi|x( 0, where r = lim sup M PM m=1 rm M > 0. Note that the sequence can even diverge, i.e. r = + . Thus, m=1 log E(m) Wm + 1 m=1 rm log α 1 m=1 log E(m) Wm log α 1 It follows that Pθ(T = ) = 0. Proof of Lemma 5.1 Proof. By Lemma B.7 it follows that the defined E-variable is bounded with bound depending on the batch size. With |I(m)| < B for all m the statement follows directly from Theorem 3.4 Published in Transactions on Machine Learning Research (04/2024) C Expected Log Growth Rate Theorem C.1. Consider the sequence of E-C2ST E-variables (E( M))M 1 with increments E(m) for m = 1, . . . , M defined as in Equation (9). Let P(X, Y ) denote the true joint distribution of X and Y with probability density function p(x, y) = p(y|x)p(x). Then it holds for all M 1 EX( M),Y ( M) h log E( M)i |I( M)| I(X; Y ) Proof. For any M 1 we get due to the independence of the observations: EX( M),Y ( M) h log E( M)i = m=1 EX( M),Y ( M) h log E(m)i = m=1 EX( m),Y ( m) h log E(m)i m=1 EX(> 0.05. On the other hand, the way E-C2ST is designed addresses this problem, i.e., it ensures that the type I error in this case remains below the predetermined significance level. See Corollary 3.2 for these theoretical guarantees. Published in Transactions on Machine Learning Research (04/2024) Layer (type) Output Shape Linear-1 [batch size, 30] Layer Norm-2 [batch size, 30] Re LU-3 [batch size, 30] Linear-4 [batch size, 30] Layer Norm-5 [batch size, 30] Re LU-6 [batch size, 30] Linear-7 [batch size, 2] Table 2: The network architecture employed in the Blob experiment for all methods . Layer (type) Parameters Conv2d-1 16, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-2 negative slope=0.2 Group Norm-3 eps=1e-05 Conv2d-4 32, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-5 negative slope=0.2 Group Norm-6 eps=1e-05 Conv2d-7 64, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Leaky Re LU-8 negative slope=0.2 Group Norm-9 eps=1e-05 Conv2d-10 1, kernel size=(3, 3), stride=(2, 2), padding=(1, 1) Table 3: The network architecture employed in the MNIST experiment for E-C2ST, L-C2ST, S-C2ST, M-C2ST. Layer (type) Parameters Linear-1 size=32 Layer Norm-2 Re LU-3 Dropout-4 0.5 Linear-5 size=32 Layer Norm-6 Re LU-7 Dropout-8 0.5 Linear-9 size=1 Table 4: The network architecture employed in the KDEF experiments for all baselines.