# kernelbased_tests_for_likelihoodfree_hypothesis_testing__79accd36.pdf Kernel-Based Tests for Likelihood-Free Hypothesis Testing Patrik Róbert Gerber Department of Mathematics, MIT Cambridge, MA 02139 prgerber@mit.edu Tianze Jiang Department of Mathematics, MIT Cambridge, MA 02139 tjiang@mit.edu Yury Polyanskiy Department of EECS, MIT Cambridge, MA 02139 yp@mit.edu Rui Sun Department of Mathematics, MIT Cambridge, MA 02139 eruisun@mit.edu Given n observations from two balanced classes, consider the task of labeling an additional m inputs that are known to all belong to one of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions (n = ) the problem is solved optimally by the likelihood-ratio test; when m = 1 it corresponds to binary classification; and when m n it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between m and n: increasing the data sample m reduces the amount n of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under maximum mean discrepancy (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric m vs n trade-off. 1 Likelihood-Free Inference The goal of likelihood-free inference (LFI) [8; 12; 17; 24], also called simulation-based inference (SBI), is to perform statistical inference in a setting where the data generating process is a black-box, but can be simulated. Given the ability to generate samples Xθ P n θ for any parameter θ, and given real-world data Z P m θ , we want to use our simulations to learn about the truth θ . LFI is particularly relevant in areas of science where we have precise but complex laws of nature, for which we can do (stochastic) forward simulations, but can not directly compute the (distribution) density Pθ. The Bayesian community approached the problem under the name of Approximate Bayesian Computation (ABC) [6; 13; 47]. More recent ML-based methods where regressors and classifiers Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). are used to summarize data, select regions of interest, approximate likelihoods or likelihood-ratios [14; 29; 30; 42; 49] have also emerged for this challenge. Despite empirical advances, the theoretical study of frequentist LFI is still in its infancy. We focus on the nonparametric and non-asymptotic setting, which we justify as follows. For applications where tight error control is critical one might be reluctant to rely on asymptotics. More broadly, the nonasymptotic regime can uncover new phenomena and provide insights for algorithm design. Further, parametric models are clearly at odds with the black-box assumption. Recently, [18] proposed likelihood-free hypothesis testing (LFHT) as a simplified model and found minimax optimal tests for a range of nonparametric distribution classes, thereby identifying a fundamental simulationexperimentation trade-off between the number of simulated observations n and the size of the experimental data sample m. Here we extend [18], and prior related work [19; 26; 27; 32; 33], to a new setting designed to model experimental setups more truthfully and derive sample complexity (upper and lower bounds) for kernel-based tests over nonparametric classes. While minimax optimal, the algorithms of [18; 19] are impractical as they rely on discretizing the observations on a regular grid. Thus, both in our theory as well as experiments we turn to kernel methods which provide an empirically more powerful set of algorithms that have shown success in nonparametric testing [20; 21; 22; 31; 39]. Contributions Our contributions are twofold. Theoretically, we introduce mixed likelihood-free hypothesis testing (m LFHT), which is a generalization of (LFHT) and provides a better model of applications such as the search for new physics [11; 38]. We propose a robust kernel-based test and derive both upper and lower bounds on its minimax sample complexity over a large nonparametric class of densities, generalizing multiple results in [18; 19; 26; 27; 32; 33; 37]. Although the simulationexperimentation (m vs n) trade-off has been proven in the minimax sense (that is, for some worst-case data distribution), it is not clear whether it actually occurs in real data. Our second contribution is the empirical confirmation of the existence of an asymmetric trade-off, cf. Figure 1.2. To this end we construct state-of-the-art tests building on ideas of [39; 48] on learning good kernels from the data. We execute this program in two settings: the Higgs boson discovery [5], and detecting diffusion [25] generated images planted in the CIFAR-10 [35] dataset. 1.1 LFHT and the Simulation-Experimentation Trade-off Suppose we have i.i.d. samples X, Y each of size n from two unknown distributions PX, PY on a measurable space X, as well as a third i.i.d. sample Z PZ of size m. In the context of LFI, we may think of the samples X, Y as being generated by our simulator, and Z being the data collected in the real world. The problem we refer to as likelihood-free hypothesis testing is the task of deciding between the hypotheses H0 : PZ = PX versus H1 : PZ = PY . (LFHT) This problem originates in [23; 52], where authors study the exponents of error decay for finite X and fixed PX, PY as n m ; more recently [2; 18; 19; 26; 27; 32; 33; 37] it is studied in the non-asymptotic regime. Assuming that PX, PY belong to a known nonparametric class of distributions P and are guaranteed to be ϵ-separated with respect to total variation (TV) distance (i.e. TV(PX, PY ) ϵ), [18] characterizes the sample sizes n and m required for the sum of type-I and type-II errors to be small, as a function of ϵ and for several different P s. Their results show, for three settings of P, that (i) testing (LFHT) at vanishing error is possible even when n is not large enough to estimate PX and PY within total variation distance O(ϵ), and that (ii) to achieve a fixed level of error, say α, one can trade off m vs. n along a curve of the form {min{n, mn} n TS(α, ϵ, P), m log(1/α)/ϵ2}. Here n TS denotes the minimax sample complexity of two-sample testing over P, i.e. the minimum number of observations n needed from PX, PY P to distinguish the cases TV(PX, PY ) ϵ versus PX = PY . Here suppresses dependence on constants and untracked parameters. It is unclear, however, whether predictions drawn from minimax sample complexities over specified distribution classes can be observed in real-world data. Without the theory, a natural expectation is that the error contour {(m, n) : a test with total error α} would look similar to that of minimax two-sample testing with unequal sample size, namely {(m, n) : min{n, m} n TS(α, ϵ, P)}, i.e. n and m simply need to be above a certain threshold simultaneously (as is the case for e.g. two-sample testing over smooth densities [4; 37]). However, from Figures 2 and 1.2 we see that there is indeed a non-trivial trade-off between n and m: the contours are not always parallel to the axes and aren t symmetric about the line m = n. The importance of Fig. 1.2 is in demonstrating that said trade-off is not a kink of a theory that arises due to some esoteric worst-case data distribution, but is instead a real effect observed in state-of-the-art LFI algorithms ran on actual data. We remark that the n used in this plot is the total number of simulated samples (most of which are used for choosing a neural-network parameterized kernel) and are not just the n occuring in Theorems 3.1 and 3.2 which apply to a fixed kernel. See Section 4 for details on sample division. 1.2 Mixed Likelihood-Free Hypothesis Testing A prominent application of likelihood-free inference lies in the field of particle physics. Scientists run sophisticated experiments in the hope of finding a new particle or phenomenon. Often said phenomenon can be predicted from theory, and thus can be simulated, as was the case for the Higgs boson whose existence was verified after nearly 50 years at the Large Hadron Collider (LHC) [3; 9]. Suppose we have n simulations from the background distribution PX and the signal distribution PY . Further, we also have m (real-world) datapoints from PZ = (1 ν)PX + νPY , i.e. the observed data is a mixture between the background and signal distributions with rate parameter ν. The goal of physicists is to construct confidence intervals for ν, and a discovery corresponds to a 5σ confidence interval that excludes ν = 0. We model this problem by testing H0 : ν = 0 versus H1 : ν δ (m LFHT) for fixed (usually predicted) δ > 0. See the rigorous definition of (m LFHT) in Section 3. In particular, a discovery can be claimed if H0 is rejected. 2 The Likelihood-Free Test Statistic 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 lg(m) Higgs experiment 6.5 10 252 1.0 10 100 1.5 10 40 1.4 10 16 4.9 10 7 3.1 10 3 0.10 1.0 1.5 2.0 2.5 3.0 3.5 4.0 lg(m) CIFAR experiment Type I error + Type II error Figure 1: n versus m trade-off for the Higgs and CIFAR experiments using our test in Section 2. Error probabilities are estimated by normal approximation for Higgs and simulated for CIFAR. This section introduces the testing procedure based on Maximum Mean Discrepancy (MMD) that we study throughout the paper both theoretically and empirically. First, we introduce the necessary background on MMD in Section 2.1. Then, we define our test statistics in Section 2.2. 2.1 Kernel Embeddings and MMD Given a set X, we call the function K : X 2 R a kernel if the n n matrix with ij th entry K(xi, xj) is symmetric positive semidefinite for all choices of x1, . . . , xn X and n 1. There is a unique reproducing kernel Hilbert space (RKHS) HK associated to K. HK consists of functions X 7 R and satisfies the reproducing property K(x, ), f HK = f(x) for all f Hk and x X, in particular K(x, ) HK. Given a probability measure P on X, define its kernel embedding θP as θP EX P K(X, ) = Z X K(x, )P(dx). (1) Given the kernel embeddings of two probability measures P, Q, we can measure their distance in the RKHS by MMD(P, Q) θP θQ HK, where MMD stands for maximum mean discrepancy. MMD has a closed form thanks to the reproducing property and linearity: MMD2(P, Q) = E h K(X, X ) + K(Y, Y ) 2K(X, Y ) i where (X, X , Y, Y ) P 2 Q 2. In particular, if P, Q are empirical measures based on observations, we can evaluate the MMD exactly, which is crucial in practice. Yet another attractive property of MMD is that (under mild integrability conditions) it is an integral probability metric (IPM) where the supremum is over the unit ball of the RKHS HK. See e.g. [41; 46] for references. The following result is a consequence of the fact that self-adjoint compact operators are diagonalizable. Theorem 2.1 (Hilbert Schmidt). Suppose that K L2(µ µ) is symmetric. Then there exists a sequence (λj)j 1 ℓ2 and an orthonormal basis {ej}j 1 of L2(µ) such that K(x, y) = P j 1 λjej(x)ej(y) for all j 1, where convergence is in L2(µ µ). Assumption 1. Unless specified otherwise, we implicitly assume a choice of a non-negative measure µ and kernel K for which the conditions of Theorem 2.1 hold. Note that λj 0 and depend on µ. Removing the Bias In our proofs we work with the kernel embedding of empirical measures for which we need to modify the inner product , HK (and thus MMD) slightly by removing the diagonal terms. Namely, given i.i.d. samples X, Y of size n, m respectively and corresponding empirical measures b PX, b PY , we define MMD2 u( b PX, b PY ) X K(Yi, Yj) m(m 1) 2 X We also write θ b PX, θ b PX u,HK θ b PX 2 u,HK 1 n(n 1) P i =j K(Xi, Xj) and extend linearly. The u stands for unbiased, since E MMD2 u( b PX, b PY ) = MMD2(PX, PY ) = E MMD2( b PX, b PY ) in general. 2.2 Test Statistic With Section 2.1 behind us, we are in a position to define the test statistic that we use to tackle (m LFHT). Suppose that we have samples X, Y, Z of sizes n, n, m from the probability measures PX, PY , PZ. Write b PX for the empirical measure of sample X, and analogously for Y, Z. The core of our test statistic for (m LFHT) is the following: T(X, Y, Z) θ b PZ, θ b PY θ b PX u,HK = 1 nm n K(Zj, Yi) K(Zj, Xi) o . (3) Note that T is of the additive form 1 m Pm j=1 f(Zi) where f(z) θ b PY (z) θ b PX(z) can be interpreted as the witness function of [21; 31]. Given some π [0, 1] (taken to be half the predicted signal rate δ/2 in our proofs), the output of our test is Ψπ = 1 n T(X, Y, Z) γ(X, Y, π) o , where γ(X, Y, π) = π MMD2 u( b PX, b PY )+T(X, Y, X). (4) The threshold γ gives Ψπ a natural geometric interpretation: it checks whether the projection of θ b PZ θ b PX onto the vector θ b PY θ b PX falls further than π along the segment joining θ b PX to θ b PY (up to deviations due to the omitted diagonal terms, see Section 2.1). Setting δ = 1 in (m LFHT) recovers (LFHT), and the corresponding test output is Ψδ/2 = Ψ1/2 = 1 if and only if MMDu( b PZ, b PX) MMDu( b PZ, b PY ). This very statistic (i.e. MMDu( b PZ, b PX) MMDu( b PX, b PY )) has been considered in the past for relative goodness-of-fit testing [7] where it s asymptotic properties are established. In the non-asymptotic setting, if MMD is replaced by the L2-distance we recover the test statistic studied by [18; 27; 33]. However, we are the first to introduce Ψδ for δ = 1 and to study MMD-based tests for (m)LFHT in a non-asymptotic setting. Variance cancellation At first sight it may seem more natural to the reader to threshold the distance MMDu( b PZ, b PX), resulting in rejection if, say, MMDu( b PZ, b PX) MMDu( b PX, b PY )δ/2. The geometric meaning of this would be similar to the one outlined above. However, there is a crucial difference: (LFHT) (the case δ = 1) is possible with very little experimental data m due to the cancellation of variance. More precisely, the statistic MMD2( b PZ, b PX) contains the term 1 m(m 1) P i =j K(Zi, Zj) whose variance is prohibitively large and would inflate the m required for reliable testing but this can be canceled by subtracting MMD2( b PZ, b PY ). Our statistic T(X, Y, Z) γ(X, Y, π) simply generalizes this idea to (m LFHT). 3 Minimax Rates of Testing 3.1 Upper Bounds on the Minimax Sample Complexity of (m LFHT) Let us start by reintroducing (m LFHT) in a rigorous fashion. Given C, ϵ, R 0, let Pµ(C, ϵ, R) denote the set of triples (PX, PY , PZ) of distributions such that the following three conditions hold: (i) PX, PY and PZ have µ-densities bounded by C, (ii) MMD(PX, PY ) ϵ, and (iii) MMD(PZ, (1 ν)PX + νPY ) R MMD(PX, PY ), where we define ν = ν(PX, PY , PZ) = arg minν R MMD(PZ, (1 ν )PX + ν PY ). For some δ > 0, consider the two hypotheses H0(C, ϵ, δ, R) : (PX, PY , PZ) Pµ(C, ϵ, R) and ν = 0 H1(C, ϵ, δ, R) : (PX, PY , PZ) Pµ(C, ϵ, R) and ν δ, (5) which we regard as subsets of probability measures. Notice that R controls the level of misspecification in the direction that is orthogonal to the line connecting the kernel embeddings of PX and PY . Setting R = 0 simply asserts that PZ is guaranteed to be a mixture of PX and PY , as is the case for prior works on LFHT. Before presenting our main result on the minimax sample complexity of m LFHT, let us define one final piece of terminology. We say that a test Ψ, which takes some data as input and takes values in {0, 1}, has total error probability less than α for the problem of testing H0 vs H1 if sup P H0 P(Ψ = 1) + sup Q H1 Q(Ψ = 0) α. (6) Theorem 3.1. Suppose we observe three i.i.d. samples X, Y, Z from distributions PX, PY , PZ composed of n, n, m observations respectively and let C (0, ) and R, ϵ 0 and δ (0, 1). There exists a universal constant c > 0 such that Ψδ/2 defined Section 2.2 tests H0 vs H1, as defined in (5), at total error α provided min{m, n} c C λ log(1/α) (ϵδ/(1 + R))2 and min{n, nm} c C λ 2 log(1/α) Figure 2: n versus m trade-off for the toy experiment, verifying Theorem 3.1. Probabilities estimated over 104 runs, and smoothed using Gaussian noise. Note that Theorem 3.1 does not place assumptions on the distributions PX, PY beyond bounded density with respect to the base measure µ. This is different from usual results in statistics, where prior specification of distribution classes is crucial. On the other hand, instead of standard distances such as Lp, we assume separation with respect to MMD and the latter is potentially harder to interpret than, say, L1 i.e. total variation. We do point out that our Theorem 3.1 can be used to derive results in the classical setting; we discuss this further in Section 3.4. In an appropriate regime of the parameters, the sufficient sample complexity in Theorem 3.1 exhibits a trade-off of the form min{n, mn} λ 2 log(1/α)/(δϵ2) between the number of simulation samples n and real observations m. This trade-off is shown in Figure 2 using data from a toy problem. The trade-off is clearly asymmetric and the relationship m n const. also seems to appear. In this toy problem we set R = 0, δ = 1, ϵ = .3, k = 100 and PX = PZ, PY are distributions on {1, 2, . . . , k} with PX(i) = (1 + ϵ (2 1{i odd} 1))/k = 2/k PY (i) for all i = 1, 2, . . . , k. The kernel we take is K(x, y) = Pk i=1 1{x = y = i} and µ is simply the counting measure; the resulting MMD is simply the L2-distance on pmfs. Figure 1.2 illustrates a larger scale experiment using real data using a trained kernel. Note that we plot the total number n of simulation samples, including those used for training the kernel itself (see Section 4); which ensures that Figure 1.2 gives a realistic picture of data requirements. However, due to the dependence between the kernel and the data, Theorem 3.1 no longer applies. Nevertheless, we observe a trade-off similar to Figure 2. 3.2 Lower Bounds on the Minimax Sample Complexity of (m LFHT) In this section we prove a minimax lower bound on the sample complexity of m LFHT, giving a partial converse to Theorem 3.1. Before we can state this results, we must make some technical definitions. Given J 2, let λ 2 2J PJ j=2 λ2 j and define J ϵ max n J : sup ηj= 1 Theorem 3.2 (Lower Bounds for m LFHT). Suppose that R X K(x, y)µ(dx) λ1, µ(X) = 1 and supx X K(x, x) 1. There exists a universal constant c > 0 such that any test of H0 vs H1, as defined in (5), with total error at most α must use a number (n, m) of observations that satisfy m cλ2 log(1/α) ϵ2δ2 and n c λ 2J ϵ p log(1/α) ϵ2 and δm + mn c λ 2J ϵ p log(1/α) ϵ2δ . Remark 3.3. Recall that the eigenvalues λ depend on the choice of µ, so that by choosing a different base measure µ one can optimize the lower bound. However, since PX, PY , PZ are assumed to have bounded density with respect to µ, this appears rather involved. Remark 3.4. The requirements supx X K(x, x) 1 and µ(X) = 1 are is essentially without loss of generality, as µ and K can be rescaled. The condition R X K(x, y)µ(dx) λ1 implies that the top eigenfunction e1 is equal to a constant or equivalently, that y 7 K(x, y)µ(dx) defines a Markov kernel up to a normalizing constant. 3.3 Tightness of Theorems 3.1 and 3.2 Dependence on λ 2 λ 2 λ 2 An apparent weakness of Theorem 3.2 is its reliance on the unknown value J ϵ , which depends on the specifics of the kernel K and base measure µ. Determining it is potentially highly nontrivial even for simple kernels. Slightly weakening Theorem 3.2 we obtain the following corollary, which shows that the dependence on λ 2 is tight, at least for small ϵ. Corollary 3.5. Suppose J 2 is such that PJ j=2 λ2 j c2 λ 2 2 for some c 1. Then λ 2J ϵ can be replaced by c λ 2 in Theorem 3.2 whenever ϵ c λ 2/(2 Dependence on R and α Due to the general nature of our lower bound constructions, it is difficult to capture the dependence on the misspecification parameter R. As for the probability of error α, based on related work [16] we expect the gap of size p log(1/α) to be a shortcoming of Theorem 3.1 and not the lower bound. Closing this gap may require a different approach, however, as tests based on empirical L2 distances are known to have a sub-optimal concentration [27]. Dependence on δ The correct dependence on the signal rate δ is the most important question left open by our theoretical results. Any method requiring n larger than a function of δ irrespective of m (as in Theorem 3.1) is provably sub-optimal because taking m 1/(δϵ)2 and n large enough to estimate both PX, PY to within accuracy ϵ/10 always suffices to reach a fixed level of total error. 3.4 Relation to Prior Results In this section we discuss some connections of Theorem 3.1 to prior work. Specifically, we discuss how Theorem 3.1 recovers some known results in the literature [4; 18; 37] that are minimax optimal. Details omitted in this section are included in Appendix B. Binary Hypothesis Testing Suppose the two distributions PX, PY are known, we are given m i.i.d. observations Z1, . . . , Zm PZ and our task is to decide between the hypotheses H0 : PX = PZ versus H1 : PY = PZ. Then, we may take n = , R = 0, δ = 1 in Theorem 3.1 to conclude that m c C λ log(1/α) observations suffice to perform the test at total error α. Two-Sample Testing Suppose we have two i.i.d. samples X and Y , both of size n, from unknown distributions PX, PY respectively and our task is to decide between H0 : PX = PY against H1 : MMD(PX, PY ) ϵ. We split our Y sample in half resulting in Y (1) and Y (2) and form the statistic ΨTS Ψ1/2(X, Y (1), Y (2)) Ψ1/2(Y (1), X, Y (2)), where Ψ1/2 is defined in Section 2.2. Then | E ΨTS| is equal to 0 under the null hypothesis and is at least 1 2α1 under the alternative, where α1 is the target total error probability of Ψ1/2. Taking α1 = 5%, by repeated sample splitting and majority voting we may amplify the success probability to α provided n c C λ 2 log(1/α) where c > 0 is universal (see Appendix for details). The upper bound (7) partly recovers [37, Theorem 3 and 5] where authors show that thresholding the MMD with Gaussian kernel Gσ(x, y) = σ d exp( x y 2/σ2) achieves the minimax optimal sample complexity n ϵ (2β+d/2)/β for the problem of two-sample testing over the class Pβ,d of d-dimensional (β, 2)-Sobolev-smooth distributions (defined in Appendix B.3) under ϵ-L2-separation. For this, taking σ ϵ1/β ensures that P Q L2 MMD(P, Q) over P, Q Pβ,d. Taking e.g. dµ(x) = exp( x 2 2)dx as the base measure, (7) recovers the claimed sample complexity since λ 2 2 = R G2 σ(x, y)dµ(x)dµ(y) = O(σ d) hiding dimension dependent constants. Our result requires a bounded density with respect to a Gaussian. Likelihood-Free Hypothesis Testing By taking α R δ = Θ(1) in Theorem 3.1 we can recover many of the results in [18; 32; 33]. When X is finite, we can take the kernel K(x, y) = P z X 1{x = y = z} in Theorem 3.1 to obtain the results for bounded discrete distributions (defined in Appendix B.1) which state that under ϵ-TV-separation the minimax optimal sample complexity is given by m 1/ϵ2; min{n, nm} p |X|/ϵ2. A similar kernel recovers the optimal result for the class of β-Hölder smooth densities on the hypercube [0, 1]d (see Appendix B.2). Curse of Dimensionality Using the Gaussian kernel Gσ as for two-sample testing above, one can conclude by Theorem 3.1 that the required number of samples for (m LFHT) over the class Pβ,d under ϵ-L2-separation grows at most like Ω ( c δ2 for some c > 1, instead of the expected Ω ( c ϵδ)Ω(d) . This may be interpreted as theoretical support for the success of LFI in practice where signal and background can be rather different (cf. [5, Figures 2-3]) and the difficulty of the problem stems from the rate of signal events being small (i.e. ϵ 1 but δ 1). 4 Learning Kernels from Data Given a fixed kernel K, our Theorems 3.1 and 3.2 show that the sample complexity is heavily dependent on the separation ϵ under the given MMD as well as the spectrum λ = λ(µ, K) of the kernel. Thus, to have good test performance we need to use a kernel K that is well-adapted to the problem at hand. In practice, however, instead of using a fixed kernel it would be only natural to use part of the simulation sample to try to learn a good kernel. In Sections 4 and 5 we report experimental results after training a kernel parameterized by a neural network on part of the simulation data. In particular, due to the dependence between the data and the kernel, Theorem 3.1 doesn t directly apply. Our main contribution here is showing the existence of an asymmetric simulation-experimentation trade-off (cf. Figure 1.2 and also Section 1.1) even in this realistic setting. Figure 1.2 plots the total number n of simulations used, including those used for training, so as to provide a realistic view of the amount of data used. The experiments also illustrate that the (trained-)kernel-based statistic of Section 2.2 achieves state-of-the-art performance. 4.1 Proposed Training Algorithm Consider splitting the data into three parts: (Xtr, Y tr) is used for training (optimizing) the kernel; (Xev, Y ev) is used to evaluate our test statistic at test time; and (Xcal, Y cal) is used for calibrating the distribution of the test statistic under the null hypothesis. We write ns = |Xs| = |Y s| for s {tr, ev, cal}. Given the training data Xtr, Y tr with empirical measures b PXtr, b PY tr, we maximize the objective in b J(Xtr, Y tr; K) = MMD2 u( b PXtr, b PY tr; K) bσ(Xtr, Y tr; K) , (8) which was introduced in [48]. Here bσ2 is an estimator of the variance of MMD2 u( b PXtr, b PY tr; K) and is defined in Appendix F.1. Algorithm 1 m LFHT with a learned deep kernel Input: (Xtr, Xev, Xcal), (Y tr, Y ev, Y cal); parametrized kernel Kω; hyperparameters and initialization. # Phase 1: Kernel training (optimization) on Xtr and Y tr. ω arg max Optimizer ω b J(Xtr, Y tr; Kω); # maximize objective as defined in (8) # Phase 2: Distributional calibration of test statistic (under null hypothesis). for r = 1, 2, . . . , k do Zcal,r sample m points without replacement from Xcal; Tr 1 nevm P i,j Kω(Zcal,r i , Y ev j ) Kω(Zcal,r i , Xev j ) ; end for # Phase 3: Inference with input Z. b T 1 nevm P i,j Kω(Zi, Y ev j ) Kω(Zi, Xev j ) ; Output: Estimated p-value: 1 k Pk i=1 1{ b T < Ti}. Intuitively, the objective J aims to separate PX from PY while keeping variance low. For a heuristic justification of its use for (m LFHT) see Appendix. In Algorithm 1 we describe the training and testing procedure, which produces unbiased p-values for (m LFHT) when there is no misspecification (R = 0 in Theorem 3.1). During training, we use the Adam optimizer [34] with stochastic batches. Proposition 4.1. When there is no misspecification (R = 0 in Theorem 3.1), Algorithm 1 outputs an unbiased estimate of the p-value that is consistent as min{ncal, k} . Time complexity Algorithm 1 runs in three separate stages: training, calibration, and inference. The first two take O #epochs B2 + knevm total time, where B is the batch size, whereas Phase 3 takes only O(nevm) time, which is generally much faster especially if nev << ntr. Sample usage Empirically, data splitting in Algorithm 1 can have non-trivial effects on performance. Instead of training the kernel on only a fraction of the data ({Xtr, Y tr} {Xev, Y ev} = ), we discovered that taking {Xev, Y ev} {Xtr, Y tr} results in more efficient use of data. The stronger condition {Xev, Y ev} = {Xtr, Y tr} can also be applied; we take to reduce time complexity. We do, however, crucially require Xcal, Y cal in Phase 2 to be independently sampled ( held-out ) for consistent p-value estimation. Finally, we remark also that splitting this way is only valid in the context of Algorithm 1. For the test (4) using the data-dependent threshold γ, one needs {Xtr, Y tr} {Xev, Y ev} = to estimate γ. 4.2 Classifier-Based Tests and Other Benchmarks Let ϕ : X [0, 1] be a classifier, assigning small values to PX and high values to PY by minimizing cross-entropy loss of a classifier net. There are two natural test statistics based on ϕ: Scheffé s Test. The first idea, attributed to Scheffé in folklore [15, Section 6], is to take the statistic T(Z) = 1 m Pm i=1 1{ϕ(Zi) > t} where t is some (learn-able) threshold. Approximate Neyman-Pearson / Logit Methods. If ϕ is trained to perfection, then ϕ(z) = P(PX|z) would be the likelihood and ϕ(z)/(1 ϕ(z)) would equal precisely the likelihood ratio between PY and PX at z. This motivates the use of T(Z) = 1 m Pm i=1 log(ϕ(Zi)/(1 ϕ(Zi))). See also [10]. Let us list the testing procedures that we benchmark against each other in our experiments. 1. MMD-M: The MMD statistic (3) using K with the mixing architecture K(x, y) = [(1 τ)Gσ(φω(x), φω(y)) + τ] Gσ0(x + φ ω (x), y + φ ω (y)). Here Gσ is the Gaussian kernel with variance σ2; φω, φ ω are NN s (with parameters ω, ω ), and σ, σ0, τ, ω, ω are trained. 2. MMD-G: The MMD statistic (3) using the Gaussian kernel architecture K(x, y) = Gσ(φω(x), φω(y)) where φω is the feature mapping parametrized by a trained network and σ is a trainable parameter. 3. MMD-O: The MMD statistic (3) using the Gaussian Kernel K(x, y) = Gσ(x, y) with optimized bandwidth σ. First proposed in [7; 39]. 50 100 150 200 250 300 350 m 1.0Rejection Rate (alpha=0.05) 50 100 150 200 250 300 350 m 0.5 Expected P-Value 50 100 150 200 250 300 350 m 0.5 Error Probability At 0 0.0 0.2 0.4 0.6 0.8 1.0 FPR 1.0ROC curves for different m m=384 m=256 m=192 m=128 m=64 Figure 3: Empirical performance on (9) for the CIFAR detection problem when ntr = 1920. Plots from left to right are as follows. (a) rejection rate under the alternative if test rejects whenever the estimated p-value is smaller than 5%; (b) expected p-value [45] under the alternative; (c) the average of type-I and II error probabilities when thresholded at 0 (different from (4), see Appendix); and (d) ROC curves for different m using MMD-M and Algorithm 1. Shaded area shows the standard deviation across 10 independent runs. Missing benchmarks (thresholded MMD, MMD-O, LBI, RFM) are weaker; see Appendix for full plot. 4. UME: An interpretable model comparison algorithm proposed by [31], which evaluates the kernel mean embedding on a chosen witness set . 5. SCHE, LBI: Scheffé s test and Logit Based Inference methods [10], based on a binary classifier network ϕ trained via cross-entropy, introduced above. 6. RFM: Recursive Feature Machines, a recently proposed kernel learning algorithm by [44]. 4.3 Additive Statistics and the Thresholding Trick Given a function f (usually obtained by training) and test data Z = (Z1, . . . , Zm), we call a test additive if its output is obtained by thresholding Tf(Z) 1 m Pm i=1 f(Zi). We point out that all of MMD-M/G/O, SCHE, LBI, UME, RFM are of this form, see the Appendix for further details. Similarly to [39], we observe that any such statistic can be realized by our kernel-based approach. Proposition 4.2. The kernel-based statistic defined in (3) with the kernel K(x, y) = f(x)f(y) is equal to Tf up to a multiplicative and additive constant independent of Z. Motivated by the Scheffé s test, instead of directly thresholding the additive statistic Tf(Z), we found empirically that replacing f by ft(x) 1{f(x) > t} can yield improved power. We set t by maximizing an estimate of the significance under the null using a normal approximation, i.e. by solving topt arg maxt Tft(Y opt) Tft(Xopt) Tft(Xopt)(1 Tft(Xopt)), where Xopt, Y opt satisfy {Xopt, Y opt} ({Xcal, Y cal} {Xev, Y ev}) = . This trick improves the performance of our tester on the Higgs dataset in Section 5.2 but not for the image detection problem in Section 5.1. 5 Experiments Our code can be found at https://github.com/Sr-11/LFI. 5.1 Image Source Detection Our first empirical study looks at the task of detecting whether images come from the CIFAR-10 [36] dataset or a SOTA generative model (DDPM) [25; 51]. While source detection is on its own interesting, it turns out that detecting whether a group of images comes from the generative model versus the real dataset can be too easy (see experiments in [31]). Therefore, we consider a mixed alternative, where the alternative hypothesis is not simply the generative model but CIFAR with planted DDPM images. Namely, our n labeled images come from the following distributions: PX = CIFAR, and PY = 1 3 CIFAR. (9) The goal is to test whether the m unlabeled observations Z have been corrupted with ρ or more fraction of DDPM images (versus uncorrupted CIFAR); this corresponds to (LFHT) (or equivalently (m LFHT) with δ = 1). Figure 3 shows the performance of our approach with this mixed alternative. Network Architecture With a standard deep CNN, the difference is only at the final layer: for the kernel-based tests it is a feature output; for classifiers, we add an extra linear layer to logits. We see from Figure 3 that our kernel-based test outperforms other benchmarks at a fixed training set size ntr. One potential cause is that MMD has an optimization subroutine (which it solves in closed form) as it is an IPM. This additional layer of optimization may lead to better performance at small sample sizes. The thresholding trick does not seem to improve power empirically. We omit several benchmarks from this figure for graphic presentation and they do not exhibit good separating power; see the Appendix for the complete results. The bottom plot of Figure 1.2 shows m and n on log-scale against the total probability of error, exhibiting the simulation-experimentation trade-off. 5.2 Higgs-Boson Discovery Training set size 2n/million Significance of discovery/ MMD-M with topt SCHE with topt Figure 4: Expected significance of discovery on a mixture of 1000 backgrounds and 100 signals in the Higgs experiment. Shaded area shows the standard deviation over 10 independent runs. See Appendix for full plot including missing benchmarks. The 2012 announcement of the Higgs boson s discovery by the ATLAS and CMS experiments [1; 9] marked a significant milestone in physics. The statistical problem inherent in the experiment is well-modeled by (m LFHT), using a signal rate δ predicted by theory and misspecification parameter R = 0 (as was assumed in the original discovery). We consider our algorithm s power against past studies in the physics literature [5] as measured by the significance of discovery. We note an important distinction from Algorithm 1 in this application. Estimating the Significance In physics, the threshold for claiming a discovery is usually at a significance of 5σ, corresponding to a pvalue of 2.87 10 7. Approximately ncal (2.87) 1 107 samples would be necessary for Algorithm 1 to reach such a precision. Fortunately the distribution of the test statistic is approximated by a Gaussian customarily. We adopt this approach for our experiment hereby assuming that m is large enough for the CLT to apply. We use the expected significance of discovery as our metric [5] which, for the additive statistic Tf = 1 m Pm i=1 f(Zi), is given by δ(Tf (Y cal) Tf (Xcal)) c var(f(Xcal))/m . If the thresholding trick (Section 4.3) is applied we use the more precise Binomial tail, in which case the significance is estimated by Φ 1(P(Bin(m, Tftopt(Xcal)) Tftopt(Z))), where Φ is the standard normal CDF. Newtowk Architecture The architecture is a 6-layer feedforward net similar for all tests (kernelbased and classifiers) except for the last layer. We leave further details to the Appendix. As can be seen in Figure 4, Scheffé s test and MMD-M with threshold topt are the best methods, achieving similar performance as the algorithm of [5]; reaching the significance level of 5σ on 2.6 million simulated datapoints and a test sample made up of a mixture of 1000 backgrounds and 100 signals. The top plot of Figure 1.2 shows m and n on log-scale against the total probability of error through performing the test (4), exhibiting the asymmetric simulation-experimentation trade-off. 6 Conclusion In this paper, we introduced (m LFHT) as a theoretical model of real-world likelihood-free signal detection problems arising in science. We proposed a kernel-based test statistic and analyzed its minimax sample complexity, obtaining both upper (Theorem 3.1) and lower bounds (Theorem 3.2) in terms of multiple problem parameters, and discussed their tightness (Section 3.3) and connections to prior work (Section 3.4). On the empirical side, we described a method for training a parametrized kernel and proposed a consistent p-value estimate (Algorithm 1 and Proposition 4.1). We examined the performance of our method in two experiments and found that parametrized kernels achieve state-of-the-art performance compared to relevant benchmarks from the literature. Moreover, we confirmed experimentally the existence of the asymmetric simulation-experimentation trade-off (Figure 1.2) which is suggested by minimax analysis. We defer further special cases of Theorem 3.1, all relevant proofs and experimental details to the Appendix. Acknowledgments and Disclosure of Funding YP was supported in part by the National Science Foundation under Grant No CCF-2131115. Research was sponsored by the United States Air Force Research Laboratory and the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. [1] Aad, G., Abajyan, T., Abbott, B., Abdallah, J., Khalek, S. A., Abdelalim, A. A., Aben, R., Abi, B., Abolins, M., Abou Zeid, O., et al. Observation of a new particle in the search for the standard model Higgs boson with the ATLAS detector at the LHC. Physics Letters B, 716(1):1 29, 2012. [2] Acharya, J., Das, H., Jafarpour, A., Orlitsky, A., Pan, S., and Suresh, A. Competitive classification and closeness testing. In Conference on Learning Theory, pages 22 1. JMLR Workshop and Conference Proceedings, 2012. [3] Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kégl, B., and Rousseau, D. The Higgs boson machine learning challenge. In NIPS 2014 workshop on high-energy physics and machine learning, pages 19 55. PMLR, 2015. [4] Arias-Castro, E., Pelletier, B., and Saligrama, V. Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. Journal of Nonparametric Statistics, 30 (2):448 471, 2018. [5] Baldi, P., Sadowski, P., and Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5(1):1 9, 2014. [6] Beaumont, M. A. Approximate bayesian computation. Annual review of statistics and its application, 6:379 403, 2019. [7] Bounliphone, W., Belilovsky, E., Blaschko, M. B., Antonoglou, I., and Gretton, A. A test of relative similarity for model selection in generative models. In Bengio, Y. and Le Cun, Y., editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [8] Brehmer, J., Louppe, G., Pavez, J., and Cranmer, K. Mining gold from implicit models to improve likelihood-free inference. Proceedings of the National Academy of Sciences, 117(10): 5242 5249, 2020. [9] Chatrchyan, S., Khachatryan, V., Sirunyan, A. M., Tumasyan, A., Adam, W., Aguilo, E., Bergauer, T., Dragicevic, M., Erö, J., Fabjan, C., et al. Observation of a new boson at a mass of 125 Ge V with the CMS experiment at the LHC. Physics Letters B, 716(1):30 61, 2012. [10] Cheng, X. and Cloninger, A. Classification logit two-sample testing by neural networks. Co RR, abs/1909.11298, 2019. [11] Cowan, G., Cranmer, K., Gross, E., and Vitells, O. Asymptotic formulae for likelihood-based tests of new physics. The European Physical Journal C, 71(2):1 19, 2011. [12] Cranmer, K., Brehmer, J., and Louppe, G. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055 30062, 2020. [13] Csilléry, K., Blum, M. G., Gaggiotti, O. E., and François, O. Approximate bayesian computation (abc) in practice. Trends in ecology & evolution, 25(7):410 418, 2010. [14] Dalmasso, N., Izbicki, R., and Lee, A. Confidence sets and hypothesis testing in a likelihoodfree inference setting. In International Conference on Machine Learning, pages 2323 2334. PMLR, 2020. [15] Devroye, L. and Lugosi, G. Combinatorial Methods in Density Estimation. Springer Series in Statistics. Springer New York, 2012. ISBN 9781461301257. [16] Diakonikolas, I., Gouleakis, T., Kane, D. M., Peebles, J., and Price, E. Optimal testing of discrete distributions with high probability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 542 555, 2021. [17] Diggle, P. J. and Gratton, R. J. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society: Series B (Methodological), 46(2):193 212, 1984. [18] Gerber, P. R. and Polyanskiy, Y. Likelihood-free hypothesis testing. Co RR, abs/2211.01126, 2022. doi: 10.48550/ar Xiv.2211.01126. [19] Gerber, P. R., Han, Y., and Polyanskiy, Y. Minimax optimal testing by classification, 2023. [20] Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. A fast, consistent kernel two-sample test. Advances in neural information processing systems, 22, 2009. [21] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [22] Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scale two-sample tests. Advances in neural information processing systems, 25, 2012. [23] Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Transactions on Information Theory, 35(2):401 408, 1989. [24] Gutmann, M. U., Dutta, R., Kaski, S., and Corander, J. Likelihood-free inference via classification. Statistics and Computing, 28:411 425, 2018. [25] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [26] Huang, D. Hypothesis Testing and Learning with Small Samples. Ph D thesis, Citeseer, 2012. [27] Huang, D. and Meyn, S. Classification with high-dimensional sparse samples. In 2012 IEEE International Symposium on Information Theory Proceedings, pages 2586 2590. IEEE, 2012. [28] Ingster, Y. I. Minimax testing of nonparametric hypotheses on a distribution density in the lp metrics. Theory of Probability & Its Applications, 31(2):333 337, 1987. doi: 10.1137/1131042. [29] Izbicki, R., Lee, A., and Schafer, C. High-dimensional density ratio estimation with extensions to approximate likelihood computation. In Artificial intelligence and statistics, pages 420 429. PMLR, 2014. [30] Jiang, B., Wu, T.-y., Zheng, C., and Wong, W. H. Learning summary statistic for approximate bayesian computation via deep neural network. Statistica Sinica, pages 1595 1618, 2017. [31] Jitkrittum, W., Kanagawa, H., Sangkloy, P., Hays, J., Schölkopf, B., and Gretton, A. Informative features for model comparison. Advances in Neural Information Processing Systems, 31, 2018. [32] Kelly, B. G., Tularak, T., Wagner, A. B., and Viswanath, P. Universal hypothesis testing in the learning-limited regime. In 2010 IEEE International Symposium on Information Theory, pages 1478 1482. IEEE, 2010. [33] Kelly, B. G., Wagner, A. B., Tularak, T., and Viswanath, P. Classification of homogeneous data with large alphabets. IEEE transactions on information theory, 59(2):782 795, 2012. [34] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y., editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [35] Krizhevsky, A. Learning multiple layers of features from tiny images. University of Toronto, 2009. [36] Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset (2014). Online: http://www. cs. toronto. edu/kriz/cifar. html, 55, 2020. [37] Li, T. and Yuan, M. On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. ar Xiv preprint ar Xiv:1909.03302, 2019. [38] Lista, L. Statistical methods for data analysis in particle physics, volume 941. Springer, 2017. [39] Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. Learning deep kernels for non-parametric two-sample tests. In International conference on machine learning, pages 6316 6326. PMLR, 2020. [40] Lopez-Paz, D. and Oquab, M. Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017. [41] Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10 (1-2):1 141, 2017. [42] Papamakarios, G., Sterratt, D., and Murray, I. Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 837 848. PMLR, 2019. [43] Polyanskiy, Y. and Wu, Y. Information Theory: From Coding to Learning. Cambridge University Press, 2023+. [44] Radhakrishnan, A., Beaglehole, D., Pandit, P., and Belkin, M. Feature learning in neural networks and kernel machines that recursively learn features. ar Xiv preprint ar Xiv:2212.13881, 2022. [45] Sackrowitz, H. and Samuel-Cahn, E. P values as random variables-expected p values. The American Statistician, 53(4):326 331, 1999. ISSN 00031305. [46] Schölkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 06 2001. ISBN 9780262256933. doi: 10.7551/ mitpress/4175.001.0001. [47] Sisson, S. A., Fan, Y., and Beaumont, M. Handbook of approximate Bayesian computation. CRC Press, 2018. [48] Sutherland, D. J., Tung, H., Strathmann, H., De, S., Ramdas, A., Smola, A. J., and Gretton, A. Generative models and model criticism via optimized maximum mean discrepancy. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. [49] Thomas, O., Dutta, R., Corander, J., Kaski, S., and Gutmann, M. U. Likelihood-free inference by ratio estimation. Bayesian Analysis, 17(1):1 31, 2022. [50] Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519. [51] von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., and Wolf, T. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022. [52] Ziv, J. On classification with empirically observed statistics and universal data compression. IEEE Transactions on Information Theory, 34(2):278 286, 1988. We use A B, A B, A B to denote A = Ω(B), B = Ω(A) and A = Θ(B) respectively, where the hidden constants depend on untracked parameters multiplicatively.2 We write TV, KL, χ2 for total-variation, KL-divergence and χ2-divergence, respectively. We write D(PY |X QY |X|PX) = EX PX D(PY |X QY |X) as the conditional divergence for any probability measures P, Q on two variables X, Y and divergence D {TV, KL, χ2}. We write ℓp for the usual ℓp sequence space and Lp for the usual Lp space with respect to the Lebesgue measure. Both the ℓp norm and the Lp norm are written as p if no ambiguity arises. For real numbers a, b R we also write max{a, b} as a b and min{a, b} as a b. We use 1d to denote an d-dimensional all 1 s vector. For an integer k Z+, we write [k] as a short notation for the set {1, 2, . . . , k}. In the proofs of Theorem 3.1 and Theorem 3.2, we use != for an equality that we are trying to prove. B Applications of Theorem 3.1 Usually, minimax rates of testing are proven under separation assumptions using more traditional measures of distance such as Lp, where p [1, ]. In this section we show one example of how Theorem 3.1 can be used to recover known results, and also obtain some novel results under L2-separation and L1-separation. B.1 Bounded Discrete Distributions Under L2/L1-Separation Sample Complexity Upper Bounds Let PDb(k, C) be the set of all discrete distributions P supported on [k] = {1, 2, . . . , k} satisfying max1 i k p(i) C/k, where p is the probability mass function of P (here Pk i=1 p(k) = 1). For distributions PX, PY , PZ we shall write p X, p Y , p Z as their probability mass functions, respectively. Let us apply Theorem 3.1 with underlying space X = [k] and measure µ = 1 k Pk i=1 δi. Take the kernel K(x, y) = 1{x = y} = Pk i=1 1{x = y = i}, and note that for any two distributions PX, PY we have MMD2(PX, PY ) = E h K(X, X ) + K(Y, Y ) 2K(X, Y ) i = X i |p X(i) p Y (i)|2 where (X, X , Y, Y ) P 2 X P 2 Y . So the corresponding MMD is the ℓ2-distance on prob- ability mass functions. Note also that K = Pk i=1 1 k k1{y = i} , where n k1{x = i} ok i=1 forms an orthonormal basis of L2(µ). So K has only one nonzero eigenvalue, namely λ1 = λ2 = . . . = λk = 1/k, of multiplicity k. Suppose that we observe samples X, Y, Z of size n, n, m from PX, PY , PZ PDb(k, C), where MMD(PX, PY ) = p P i |p X(i) p Y (i)|2 ϵ. Plugging into Theorem 3.1 shows that: 2For example, the first equation in (10) means that there exists a constant c independent of α, k, ϵ, δ, R, such that min{m, n} c log(1/α)(1+R)2 Proposition B.1. For any two PX, PY PDb(k, C), if the ℓ2-distance between p X, p Y is at least ϵ, then testing (m LFHT) is possible at total error α using n simulation samples and m real data samples provided that min{m, n} C λ log(1/α)(1 + R)2 δ2ϵ2 log(1/α)(1 + R)2 min{n, mn} C λ 2 p log(1/α) ϵ2δ kϵ2δ . (10) where R is defined as in the assumption (iii) of Section 3.1. We can convert the above results to measure separation with respect to total variation (recall TV(p, q) = 1 2 P i |p(i) q(i)| = 1 2 p q 1) using the AM-QM inequality p X p Y 1 k p X p Y 2. Then, taking R α δ = Θ(1) recovers the minimax optimal results of [18; 32; 33], for LFHT over the class PDb. Note that analogous results for two-sample testing follow from the above using the reduction presented in Section 3.4. Sample Complexity Lower Bounds Recall the definition of J ϵ and note that λ 2 2,J = min(J 1,k) k2 for all J 2. By Corollary 3.5 we see that J ϵ k as soon as ϵ 1/k. Thus, for ϵ 1/k the necessity of kϵ2 and m + mn follows by Theorem 3.2. Here it is crucial to note that when δ = Θ(1), we have and hence the upper bound (10) meets with the lower bound (11) provided R δ = Θ(1). Once again, setting R δ α = Θ(1) we the optimal lower bounds recovering the results of [18] (in the regime ϵ 1/k). In short we can also recover the following result for LFHT. Proposition B.2 ([18, Theorem 1, adapted]). On the class PDb(k, C), using n simulation samples and m real data samples, if kϵ2 , m 1 kϵ2 , mn 1 then for any two distributions PX, PY PDb(k, C) with p X p Y 2 ϵ, testing (LFHT) is possible with a total error of 1%. Conversely, to ensure the existence of a procedure that can test (LFHT) with a total error of 1% for any PX, PY PDb(k, C) with p X p Y 2 ϵ, the number of observations (n, m) must satisfy kϵ2 , m 1 kϵ2 , mn 1 The implied constants in (12) and (13) do not depend on k and ϵ, but may differ. B.2 β-Hölder Smooth Densities on [0, 1]d Under L2/L1-Separation Sample Complexity Upper Bounds Let PH(β, d, C) be the set of all distributions on [0, 1]d with β-Hölder smooth Lebesgue-density p satisfying p Cβ C for some constant C > 1, where p Cβ max 0 |α| β 1 f (α) + sup x =y [0,1]d,|α|= β 1 |f (α)(x) f (α)(y)| x y β β 1 2 , where β 1 is the largest integer strictly smaller than β and |α| = P i αi is the norm of a multiindex α Nd. Abusing notation, we also use PH(β, d, C) to denote the set of all corresponding density functions. We take K(x, y) = P j 1{x, y Bj}, where {Bj}j [κ]d is the j th cell of the regular grid of size κd on [0, 1]d, i.e., Bj = [(j 1d)/κ, j/κ] for j [κ]d. Clearly there are κd nonzero eigenvalues, each equal to 1. The following approximation result is due to Ingster [28], see also [4, Lemma 7.2]. Lemma B.3. Let f, g PH(β, d, C) with f g 2 ϵ. Then, there exist constants c, c independent of ϵ such that for any κ cϵ 1/β, MMD(f, g) c f g 2. Now, suppose that we have samples X, Y, Z of size n, n, m from PX, PY , PZ PH(β, d, C) with densities p X, p Y , p Z such that p X p Y 2 ϵ. Then, Theorem 3.1 combined with Lemma B.3 and the choice κ ϵ 1/β shows that Proposition B.4. Testing (m LFHT) on PH(β, d, C) at total error α using n simulation and m real data samples is possible provided min{m, n} C λ log(1/α)(1 + R)2 δ2ϵ2 log(1/α)(1 + R)2 min{n, nm} C λ 2 p log(1/α) ϵ2δ log(1/α) ϵ(2β+d/2)/βδ , where ϵ is an L2-distance lower bound between PX, PY and R is defined as in the assumption (iii) of Section 3.1. Setting R α δ = Θ(1) recovers the optimal results of [18] for the class PH. Once again, identical results under L1 separation follow from Jensen s inequality L1([0,1]d) L2([0,1]d). Note that analogous results for two-sample testing follow from the above using the reduction presented in Section 3.4. Sample Complexity Lower Bounds The kernel defined in the previous paragraph is not suitable for constructing lower bounds over the class PH because its eigenfunctions do not necessarily lie in PH. It would be possible to consider a different kernel that is more adapted to this problem/class but we do not pursue this here. B.3 (β, 2)-Sobolev Smooth Densities on Rd Under L2-Separation Sample Complexity Upper Bounds Let PS(β, d, C) be the class of distributions that are supported on Rd and whose Lebesgue density p satisfies p β,2 C, where p β,2 (1 + )βF[p] 2 (14) and F denotes the Fourier transform. Again, abusing notation, we write PS(β, d, C) both as the set of distributions and the set of density functions. We take the Gaussian kernel Gσ(x, y) = σ d exp( x y 2 2/σ2) on X = Rd with base measure dµ(x) = exp( x2)dx. In [37] the authors showed that the two-sample test that thresholds the Gaussian MMD with appropriately chosen variance σ2 achieves the minimax optimal sample complexity over PS, when separation is measured by L2. A key ingredient in their proof is the following inequality. Lemma B.5 ([37, Lemma 5]). Let f, g PS(β, d, C) with f g 2 ϵ. Then, there exist constants c, c independent of ϵ such that for any σ c ϵ1/β, we have MMD(f, g) c f g 2. Now, suppose that we have samples X, Y, Z of sizes n, n, m from PX, PY , PZ PS(β, d, C) for some constant C with densities p X, p Y , p Y satisfying p X p Y 2 ϵ. Note that the heat-semigroup is an L2-contraction ( λ 1) and that λ 2 2 = Z Gσ(x, y)2dµ(x)dµ(y) σ d up to constants depending on the dimension. Theorem 3.1 combined with Lemma B.5 and a choice σ ϵ1/β yields the following result. Proposition B.6. Testing (m LFHT) over the class PS with total error α is possible provided min{m, n} C λ log(1/α)(1 + R)2 δ2ϵ2 log(1/α)(1 + R)2 min{n, nm} C λ 2 p log(1/α) ϵ2δ log(1/α) ϵ(2β+d/2)/βδ , where ϵ is the lower bound on the L2-distance between PX, PY and R is defined as in the assumption (iii) of Section 3.1. Taking R δ α = Θ(1) above, we obtain new results for LFHT and using the reduction from two-sample testing given in Section 3.4 we partly recover [37, Theorem 5]. Only partly, because the above requires bounded density with respect to our base measure dµ(x) = exp( x2)dx. Sample Complexity Lower Bounds Note that our lower bound Theorem 3.2 doesn t apply because the top eigenfunction of the Gaussian kernel is not constant. Once again, a more careful choice of the base measure (or kernel) might lead to a more suitable argument for the lower bound. We leave such pursuit as open. C Black-box Boosting of Success Probability In this section we briefly describe how upper bounds on the minimax sample complexity in the constant error probability regime (α = Θ(1)) can be used to obtain the dependence log(1/α) in the small error probability regime (α = o(1)). We will argue abstractly in a way that applies to the setting of Theorem 3.1. Suppose that from some distributions P1, P2, . . . , Pk we take samples X1, X2, . . . , Xk of size n1, n2, . . . , nk respectively and are able to decide between two hypotheses H0 and H1 (fixed but arbitrary) with total error probability at most 1/3. Call this test as Ψ(X1, . . . , Xk) {0, 1}, so that P(Ψ(X1, . . . , Xk) = 0|H0) 2/3 and P(Ψ(X1, . . . , Xk) = 1|H1) 2/3. Now, to each an error of o(1), instead, we take 18n1 log(2/α), . . . , 18nk log(2/α) observations from P1 through Pk, and split each sample into 18 log(2/α) equal sized batches {Xi,j}i [k],j [18 log(2/α)]. Here 18 log(2/α) is assumed to be an integer without loss of generality. The split samples form 18 log(2/α) independent binary random variables Aj Ψ(X1,j, . . . , Xk,j) for j = 1, 2, . . . , 18 log(2/α). We claim that the majority voting test Ψα({Xi,j}i,j) = 1 if A 1/2 0 otherwise tests H0 against H1 with total probability of error at most α, where A 1 18 log(2/α) 18 log(2/α) X Indeed, by Hoeffding s inequality, we have P A 1/2 H0 α/2 P A 1/2 H1 α/2. Therefore, in the remainder of our upper bound proofs, we only focus on achieving a constant probability of error (α = Θ(1)) as the logarithmic dependence follows by the above. Remark C.1. As mentioned in the discussion succeeding Corollary 3.5, we do conjecture the tight dependence in the upper bound to be p log(α 1) instead of log(α 1) shown by this method. D Proof of Theorem 3.1 D.1 Notation and Technical Tools We use the expansion K(x, y) = X ℓ λℓeℓ(x)eℓ(y) extensively, where λ (λ1, λ2, . . . ) are K s eigenvalues (regarded as an integral operator on L2(µ)) in non-increasing order and e1, e2, . . . are the corresponding eigenfunctions forming an orthonormal basis for L2(µ), and convergence is to be understood in L2(µ). We use the notation R dµ. For all u L2(µ) we define uℓ ueℓ , uℓℓ ueℓeℓ , ℓ= 1, 2, . . . and consequently u = P ℓuℓeℓ. We also define K[u]( ) Z K(t, )u(t)µ(dt) = X ℓ λℓuℓeℓ( ), where the second equality follows from the orthonormality of {eℓ} ℓ=1. Note that the RKHS embedding satisfies θu R K(x, )u(x)dµ(x) = K[u]. Now, for PX we write xℓ (p X)ℓ= p Xeℓ , xℓℓ (p X)ℓℓ = p Xeℓeℓ , ℓ, ℓ = 1, 2, . . . where p X is the µ-density of PX. The similar notations also apply to PY , PZ. The following identities will be very useful in our proofs. Lemma D.1. For each identity below, let f, g, h L2(µ) be such that the quantity is well defined. Then, θf 2 HK = X ℓ λℓf 2 ℓ (15) MMD2(f, g) = X ℓ λℓ(fℓ gℓ)2 (16) K[f] 2 2 = X ℓ λ2 ℓf 2 ℓ (17) ℓ λℓfℓgℓ= f K[g] = K[f]g (18) ℓℓ λℓλℓ hℓℓ fℓgℓ = h K[f]K[g] (19) ℓℓ λℓλℓ gℓℓ fℓℓ = X ℓ λℓ feℓK[geℓ] . (20) Suppose that f, g are probability densities with respect to µ that are bounded by C. Then ℓℓ λℓλℓ gℓℓ fℓℓ C2 λ 2 2. (21) Proof. We prove each claim, starting with (15). Clearly θf 2 HK = K[f] 2 HK Z K(x, )f(x)dµ(x) = ZZ K(x, ), K(y, ) HKf(x)f(y)dµ(x)dµ(y) = ZZ K(x, y)f(x)f(y)dµ(x)dµ(y) as required. The second claim (16) follows immediately from (15) by definition. For (17) by orthogonality we have K[f] 2 2 = X ℓ λℓfℓeℓ 2 2 ℓ λ2 ℓf 2 ℓ. For (18) by the definition of K[ ] we have For (19) we can write ℓℓ λℓλℓ hℓℓ fℓgℓ = X ℓ λℓfℓ K[g]heℓ = K[g]h K[f] . Finally, for (20) we have ℓℓ λℓλℓ fℓℓ gℓℓ = X ℓ λℓ gℓℓ eℓ ℓ λℓ K[geℓ]feℓ . Suppose now that f, g are probability densities with respect to µ that are bounded by C > 0. Let X, Y be independent random variables following the densities f, g. Then ℓℓ λℓλℓ fℓℓ gℓℓ = E ℓ λℓeℓ(X)eℓ(Y ) ℓ λℓeℓ(x)eℓ(y) as claimed, where we used that the eℓare orthonormal. D.2 Mean and Variance Computation We take π = δ/2. Our statistic reads T(X, Y, Z) + γ(X, Y, π) = θ b PZ ( πθ b PX + πθ b PY ), θ b PX θ b PY u,HK ij k(Xi, Zj) ij k(Yi, Zj) i 0, such that the testing problem is possible at constant error probability (say α = 5%), provided that the sample sizes m, n satisfy the following inequalities: min{m, n} c C λ (1 + R2) min{n, nm} c C λ 2 By repeated sample splitting and majority voting (see Appendix C), we can boost the success probability of this test to the desired level 1 α by incurring a multiplicative Θ(log(1/α)) factor on the sample sizes n, m, which yields the desired result. E Proof of Theorem 3.2 E.1 Information theoretic tools Our lower bounds rely on the method of two fuzzy hypotheses [50]. Given a measurable space S, let M(S) denote the set of all probability measures on S. We call subsets H M(S) hypotheses. The following is the main technical result that our proofs rely on. Lemma E.1. Take hypotheses H0, H1 M(S) and P0, P1 M(S) random with P(Pi Hi) = 1. Then inf ψ max i=0,1 sup P Hi P(ψ = i) 1 2 (1 TV(EP0, EP1)) , where the infimum is over all tests ψ : X {0, 1}. Proof. For any ψ max i=0,1 sup Pi Hi Pi(ψ = i) 1 2 sup Pi Hi (P0(ψ = 1) + P1(ψ = 0)) 2 E h P0(ψ = 1) + P1(ψ = 0) i . Optimizing over ψ we get that the RHS above is equal to 1 2(1 TV(E P0, E P1)) as required. Therefore, to prove a lower bound on the minimax sample complexity of testing with total error probability α, we just need to construct two random measures Pi Hi such that 1 TV(E P0, E P1) = Ω(α). In our proofs we also use the following standard results on f-divergences. Lemma E.2 ([43, Section 7]). For any probability distributions P, Q the inequalities 1 TV(P, Q) 1 2 exp( KL(P Q)) 1 2 1 1 + χ2(P Q) hold. Lemma E.3 (Chain rule for χ2-divergence). Let PX,Y , QX,Y be probability measures such that the marginals on X are equal (PX = QX). Then χ2(PX,Y QX,Y ) = χ2(PY |X QY |X|PX). Proof. Let PX,Y , QX,Y have densities p, q with respect to some µ. Then, by some abuse of notation, we have χ2(PX,Y QX,Y ) = 1 + Z p(x, y)2 q(x, y) dµ(x, y) = 1 + Z p(y|x)2p(x) q(y|x) dµ(x, y) = Z p(x) Z p(y|x)2 q(y|x) 1 dµ(y, x) = χ2(PY |X QY |X|PX). E.2 Constructing hard instances Recall that in the statement of Theorem 3.2, we assume that µ(X) = 1, supx X K(x, x) 1 and R K(x, y)µ(dx) λ1. Let f0 1 and for each η { 1}N define fη = 1 + ϵ X where {ρj}j 2 is chosen as ρj = 1{2 j J} p λj/ λ 2,J, where we define λ 2,J = q P 2 j J λ2 j for some J 2. Notice that R fη(x)µ(dx) = µ(X) = 1 due to orthogonality of the eigenfunctions. Assume from here on that J is chosen so that for all η we have fη(x) 1/2 for all x X. This makes fη into a valid probability density with respect to the base measure µ. Before continuing, we prove the following Lemma, which gives a lower bound on the maximal J for which fη 1/2 for all η. Lemma E.4. J J ϵ holds provided 2ϵ Proof of Lemma E.4. Notice that ej = sup x X K(x, ), ej H sup x X K(x, ) H ej H 1 where we use K(x, ) H = p K(x, x). We have j 2 ρjηjej = ϵ sup x X K(x, ), X j 2 ρjηjej H j 2 ρjηjej H = ϵ s X j 2 ρ2 j/λj = ϵ J 1 λ 2,J , and the result follows. Note that Lemma E.4 immediately gives us a proof of Corollary 3.5. Proof of Corollary 3.5. Suppose that J is such that PJ j=2 λ2 j c2 λ 2 2. Then, by Lemma E.4, if ϵ λ 2J/(2 J 1) then J J ϵ . By assumption, this is implied by the inequality ϵ c λ 2/(2 J 1), and the result follows. Continuing with our proof, note that by construction we have MMD2(f0, fη) = X j 2 λjρ2 j = ϵ2, η { 1}N. (25) E.2.1 Lower Bound on m Again, we apply Lemma E.1 with the new (deterministic) construction P0 = f n 0 (1+ϵe2/ p λ2)n (1+δϵe2/ p λ2) m, P1 = f n 0 (1+ϵe2/ p λ2)n f m 0 , (26) where we write f1 = f(1,1,... ) and similarly for g1. By the data-processing inequality for χ2divergence (also by Lemma E.3), we may drop the first 2n coordinates and obtain χ2(E P0, E P1) = χ2((1 + δϵe2/ p λ2) m f m 0 ) = 1 + δ2ϵ2/λ2 m 1 By Lemma E.2 we 1 TV(E P0, E P1) 1 χ2(E P0, E P1) 1 exp( δ2ϵ2m) != Ω(α). The lower bound m λ2 log(1/α)/(δϵ)2 now follows readily. E.2.2 Lower Bound on n Once again, we apply Lemma E.1 to the new construction P0 = f n 0 f n η f m 0 , P1 = f n η f n 0 f m 0 , (27) where we put a uniform prior on η { 1}N as before. Using the subadditivity of total variation under products, we compute TV(E P0, E P1) = TV(f n 0 E f n η , E[f n η ] f n 0 ) 2 TV(E f n η , f n 0 ). Just as in Appendix E.2.3 we upper bound by the χ2-divergence to get χ2(E f n η f n 0 ) = 1 + Eηη Z n Y i=1 (fη(xi)fη (xi))µ(dx1) . . . µ(dxn) 1 + E exp(nϵ2 X j 2 ρ2 jηjη j) j 2 cosh(nϵ2ρ2 j) 1 + exp(n2ϵ4 X = 1 + exp(n2ϵ4/ λ 2 2,J). Again, by Lemma E.2 we obtain 1 TV(E P0, E P1) 1 χ2(E P0 E P1) 1 exp( n2ϵ4/ λ 2 2,J) != Ω(α). The lower bound n p log(1/α) λ 2,J/ϵ2 now follows readily. E.2.3 Lower Bound on m n We take a uniform prior on η and consider the random measures P0 = f n 0 f n η ((1 δ)f0 + δfη) m and P1 = f n 0 f n η f m 0 . (28) Our goal is to apply Lemma E.1 to P0, P1. Notice that (1 δ)f0 + δfη = 1 + δϵgη. Let us write X, Y, Z for the marginals first n, second n and last m coordinates of P0 and P1. By the data processing inequality and the chain rule Lemma E.3 we have χ2(E P0 E P1) = χ2((E P0)Y,Z (E P1)Y,Z) = χ2((E P0)Z|Y (E P1)Z|Y |(E P0)Y ) = E χ2 E (1 + δϵgη) m Y f m 0 =: ( ). Notice that the expectation inside the χ2-divergence is with respect to η given the variables Y , or in other words, over the posterior of η with uniform prior given n observations from the density 1 + ϵgη = fη. The outer expectation is over Y . Given Y , let η and η be i.i.d. from said posterior. We get the bound ( ) + 1 E Z m Y i=1 (1 + δϵgη(xi))(1 + δϵgη (xi))µ(dxi) = E(1 + δ2ϵ2 X j 2 ρ2 jηjη j)m E exp(δ2ϵ2m X j 2 ρ2 jηjη j). Define the collections of variables η j = {ηj}j 2 \ {ηj} and η j similarly. We shall prove the following claim: E exp(δ2ϵ2mρ2 jηjη j) η jη j exp(cδ2ϵ4(δ2m2 + mn)ρ4 j) (29) for some universal constant c > 0. Assuming that (29) holds, by induction we can show that ( ) + 1 exp(cδ2(δ2m2 + mn)ϵ4 X = exp(cδ2(δ2m2 + mn)ϵ4/ λ 2 2,J). Thus, if mn + δ2m2 = o λ 2 2,J/(δ2ϵ4) then testing is impossible. We now prove (29). Since the variable η jη j is either 1 or 1, we have E exp(δ2ϵ2mρ2 jηjη j) η jη j = (eδ2ϵ2mρ2 j e δ2ϵ2mρ2 j) P(ηjη j = 1|η jη j) + e δ2ϵ2mρ2 j. Let us write η 1,j for the vector of signs equal to η but whose j th coordinate is 1 respectively. Looking at the probability above, and using the independence of η, η given Y , we have P(ηjη j = 1|Y, η j, η j) = P(ηj = 1|Y, η j)2 + P(ηj = 1|Y, η j)2 (f n η1j (Y ))2 + (f n η 1j(Y ))2 1 2f n η1j (Y ) + 1 2f n η 1j(Y ) 2 . Taking the expectation E[ |η j, η j] and using the HM-AM inequality ( 1 2(x + y)) 1 1 y) valid for all x, y > 0 gives P(ηjη j = 1|η j, η j) = 1 Z (Qn i=1 fη1j(xi))2 + (Qn i=1 fη 1j(xi))2 1 2 Qn i=1 fη1j(xi) + 1 2 Qn i=1 fη 1j(xi) µ(dx1) . . . µ(dxn) Z (Qn i=1 fη1j(xi))2 Qn i=1 fη 1j(xi) + (Qn i=1 fη 1j(xi))2 Qn i=1 fη1j(xi) µ(dx1) . . . µ(dxn) = ( ). Note that fη1j = fη 1j + 2ϵρjej. Using the lower bound fη 1j(x) 1 2 for all x X and the inequality 1 + x exp(x), we get 1 + Z 4ϵ2ρ2 je2 j(x) fη 1j(x) µ(dx) 1 + Z 4ϵ2ρ2 je2 j(x) fη1j(x) µ(dx) 4(1 + e8ϵ2nρ2 j). Recall that ( ) is a probability so ( ) 1, and we obtain 4(1 + e8ϵ2nρ2 j ln 3). Putting it together and applying Lemma E.5 we get LHS of (29) (eδ2ϵ2mρ2 j e δ2ϵ2mρ2 j)1 4(1 + e8ϵ2nρ2 j ln 3) + e δ2ϵ2mρ2 j ecδ2ϵ4ρ4 j(δ2m2+mn) for universal c = 16 > 0. Thus, by Lemma E.2 we obtain 1 TV(E P0, E P1) 1 χ2(E P0, E P1) + 1 exp( cδ2ϵ4(δ2m2 + mn)/ λ 2 2,J) != Ω(α). The necessity of mn + δ2m2 log(1/α) λ 2 2,J δ2ϵ4 follows immediately.3 Lemma E.5. For a, b 0, the following inequality holds: 1 4(ea e a)(1 + eb ln 3) + e a e2(ab+a2). Proof. If b ln 3 or a 1 we have: 4(ea e a)(1 + eln 3) + e a = ea e b ln 3 a+a2. If b < ln 3 and a < 1, we have eb 1 + 2 ln 3b 1 + 2b, ea + e a 2 ea2, ea e a 3We have mn + m2 ( mn + m)2 2(mn + m2), so mn + m 1 4(ea e a)(1 + eb) + e a = 1 2(ea + e a) + eb 1 ea2(1 + 2ab) The result follows from ln 3 > 1. F Proofs From Section 4 F.1 Computing ˆσ We follow the implementation of bσ2 in [39]. Given X1, . . . , Xtr ntr sampled from PX and Y1, . . . , Y tr ntr sampled from PY , denote Hij := K(Xtr i , Xtr j ) + K(Y tr i , Y tr j ) K(Xtr i , Y tr j ) K(Y tr i , Xtr j ), i, j [ntr]. (30) Then bσ2 is computed via ˆσ2(Xntr, Y ntr; K) = 4 Note that ˆσ2 being non-negative follows from the AM-GM inequality. F.2 Heuristic Justification of the Objective (8) As usual, let X, Y, Z denotes samples of sizes n, n, m from PX, PY , PZ respectively. Let us give a heuristic justification for using the training objective defined in (8) for the purpose of obtaining a kernel for LFHT/m LFHT. Note that originally it was proposed as a training objective for kernels to be used in two sample testing. Recall that our test for LFHT can be written as Ψ1/2(X, Y, Z) = 1 n TLF 0 o TLF = MMD2 u( b PZ, b PY ; K) MMD2 u( b PZ, b PX; K), Heuristically, to maximize the power of (m LFHT), we would like to maximize the following population quantity JLF E0[TLF] E1[TLF] p var0(TLF) where E0[TLF] = EX,Y,Z[TLF|PZ = PX] = + MMD2(PX, PY ; K), E1[TLF] = EX,Y,Z[TLF|PZ = PY ] = MMD2(PX, PY ; K). Let TTS = MMDu( ˆPX, ˆPY ) be the usual statistic that is thresholded for two-sample testing. Then, a computation analogous to that in Section D.2 show (cf. (22)) that var0(TLF) A(K, PX, PY ) n + A(K, PX, PY ) m + B(K, PX, PY ) n2 + B(K, PX, PY ) var0(TTS) A(K, PX, PY ) n + B(K, PX, PY ) for some A(K) and B(K). Therefore, we have approximately JLF 2 MMD2(PX, PY ; K) p1 + n var0(TTS) 2 r m m + n b J(X, Y ; K) which only differs from our optimization objective defined in (8) by a constant factor. Second, notice that MMD(PX,PY ;K) var(TTS) depends only on PX PY and that ((1 δ)PX + δPY ) PX PY PX, therefore it is sensible to use (8) as our training objective for is also sensible for (m LFHT), and we don t even need to observe the sample Z. F.3 Proof of Proposition 4.1 Proof. In this proof we regard D (Xtr, Xev, Y tr, Y ev) and the parameters of the kernel ω as fixed. Recall that we are looking at the problem m LFHT with a misspecification parameter R = 0 (see Theorem 3.1). Given a test set {zi}i [m], our test statistic is T({zi}i [m]) = 1 m Pm i=1 f(zi) where f(zi) = 1 nev Kω(zi, Y ev j ) Kω(zi, Xev j ) . In Phase 3 of Algorithm 1, we observe the value b T = T(Z) = 1 m Pm i=1 f(Zi) and reject the null hypothesis for large values of b T. Thus, the p-value is defined as p = p(Z, D) P e Z P m X (T( e Z) > b T). Phase 2 of our Algorithm 1 produces random variables T1, . . . , Tk that all have the distribution of T({ e Zi}i [m]), so that 1{Tr b T} (r = 1, . . . , k) are unbiased estimates of the p-value. However, the Ti are not independent, because they sample from the finite collection of calibration samples Xcal. However, as ncal the covariances between Tr1, Tr2 for r1 = r2 tend to zero, and we obtain a consistent estimate of p. F.4 Proof of Proposition 4.2 Proof. The test statistic T(X, Y, Z) in (3) is given by T(X, Y, Z) = 1 i=1 f K(Zi) where f K(z) = θ b PY (z) θ b PX(z). This simplifies to (consider K(x, y) = f(x)f(y)) j=1 f(Yj) 1 f(z) = C(X, Y )f(z). where C(X, Y ) does not depend on z. Therefore, for any witness function f, we obtain the desired additive test. F.5 Additive Test Statistics In this section we prove accordingly that the test statistics of all of MMD-M/G/O, SCHE, LBI, UME, RFM are of the form Tf(Z) = 1 m Pm i=1 f(Zi) (where f might depends on X, Y ). The test is to compare Tf(Z) with some threshold γ(X, Y ). Note that in the setting of Algorithm 1, the X and Y here correspond to Xev and Y ev. MMD-M/G/O As described in (3) we have j=1 (K(Zi, Yj) K(Zi, Xj)) SCHE As described in Section 4.2 we have i=1 1{ϕ(Zi) > t}. LBI As described in Section 4.2 we have i=1 log ϕ(Zi) 1 ϕ(Zi) UME As described in [31], the UME statistic evaluates the squared witness function at Jq test locations W = {wk}Jq k=1 X. Formally for any two distributions P, Q we define U 2(P, Q) = θQ θP 2 L2(W ) = 1 k=1 (θQ(wk) θP (wk))2. However, we note a crucial difference that their result only considers the case of n = m, and their proposed estimator for U 2(PZ, PX) can not be naturally extended to the case of n = m. Here we generalize it to m = n where we (conveniently) use a biased estimate of their distance. Given samples X, Y, Z and a set of witness locations W, the test statistic is a (biased yet) consistent estimator of U 2(PZ, PY ) U 2(PZ, PX). Let ψW (z) = 1 Jq (K(z, w1), . . . , K(z, w Jq)) R|W | be the feature function, then: b U 2(Z, X) = i=1 ψW (Zi) 1 j=1 ψW (Xi) i=1 ψW (Zi) j=1 ψW (Xi) 1 i m,1 j n ψW (Zi) , ψW (Xj) Here , denotes the usual inner product. Therefore, the difference between distances is b U 2(Z, Y ) b U 2(Z, X) = 1 ψW (Zi) , 2 j=1 (ψW (Xj) ψW (Yj)) where F is sum function based only on X, Y . This is clearly an additive statistic for Z. RFM Algorithm 1 in [44] describes a method for learning a kernel from data given a binary classification task. For convenience lets concatenate the data to XRFM = (X, Y ) R2n d and labels y RFM = ( 0n, 1n) R1 2n. Given a learned kernel K, we write the Gram matrix as (K(XRFM, XRFM))i,j = K(XRFM i , XRFM j ) (1 i, j 2n). Let K(XRFM, z) be a column vector with components K(XRFM i , z) (1 i 2n). The classifier is then defined as f RFM(z) = y RFM K(XRFM, XRFM) 1 K(XRFM, z). (32) Though in [44] the kernel learned from RFM is used to construct a classifier as in Equation (32), since RFM is a feature learning method, we also apply the RFM kernel to our MMD test, namely f RFM to MMD(z) = 1 j=1 (K(z, Yj) K(z, Xj)) . G Application: Diffusion Models vs CIFAR We defer a more fine-grained detail to our code submission, which includes executable programs (with Py Torch) once the data-generating script from DDPM has been run (see README in the ./codes/CIFAR folder). Figure 5: Data visualization for CIFAR-10 (left) vs DDPM diffusion generated images (right) G.1 Dataset Details We use the CIFAR-10 dataset available online at https://www.cs.toronto.edu/~kriz/cifar. html, which contains 50000 colored images of size 32 32 with 10 classes. For the diffusion generated images, we use the SOTA Hugging Face model (DDPM) that can be found at https: //huggingface.co/google/ddpm-CIFAR-10-32. We generated 10000 artificial images for our experiments. The code can be found at our code supplements. For dataset balancing, we randomly shuffled the CIFAR-10 dataset and used 10000 images as data in our code. Most of our experiments are conducted with the null PX as CIFAR images, and the alternate as PY = 2 3 CIFAR + 1 3 DDPM. To this end, we matched 20000 images from CIFAR to belong to the alternate hypothesis, and the remaining 30000 images to stay in the null hypothesis. For the alternate dataset, we simply sample without replacement from the 20000 + 10000 mixture. This sampled distribution is almost the same as mixing (so long as the sample bank is large enough compared to the acquired data, so that each item in the alternate has close to 1/3 probability of being in DDPM, which is indeed the case). G.2 Experiment Setup and Benchmarks We use a standard deep Conv-net [40], which has been employed for SOTA GAN discriminator tasks in similar settings. It has four convolutional layers and one fully connected layer outputting the feature space of size (300, 1). For SCHE and LBI, we simply added a linear layer of (300, 2) after applying Re LU to the 300-dimensional layer and used the cross-entropy loss to train the network. Note that this is equivalent to first fixing the feature space and then performing logistic regression to the feature space. For kernels, we add extra trainable parameters after the 300-d feature output. For the MMD-based tests, we simply train the kernel on the neural net and evaluate our objective. For UME, we used a slightly generalized version of the original statistic in [31] which allows for comparison on randomly selected witness locations in the null hypothesis with m = n (see Appendix F.5). The kernel is trained using our heuristic (see (8) and Appendix F.2), with MMD replaced by UME. The formula for UME variance can be found in [31]. For RFM, we use Algorithm 1 in [44] to learn a kernel on (stochastic batched) samples, and then use our MMD test on the trained kernel. We use 80 training epochs for most of our code from the CNN architecture (for classifiers, this is well after interpolating the training data and roughly when validation loss stops decreasing), and a batch size of 32 which has a slight empirical benefit compared to larger batch sizes. The learning rates are tuned separately in MMD methods for optimality, whereas for classifiers they follow the discriminator s original setting from [40]. In Phase 2 of Algorithm 1, we choose k = 1000 for the 50 100 150 200 250 300 350 m 1.0Rejection Rate (alpha=0.05) 100 200 300 400 m 0.25 False Positive Rate 50 100 150 200 250 300 350 m 0.5 Expected P-Value 50 100 150 200 250 300 350 m 0.5 Error Probability At 0 Figure 6: Relevant plots following the setting in Figure 3 (in the main text) of fixing ntr = 1920 and varying sample size m in the x-axis for the comparison with missing benchmarks. Errorbars are projected showing standard deviation across 10 runs. We replaced part (d) in Figure 3 (in the main text) to a sanity check in our FPR when thresholded at α = 0.05. desired precision while not compromising runtime. For each task, we run 10 independent models and report their performances as the mean and standard deviation of those 10 runs as estimates. We refer to a full set of hyper-parameters in our code implementation. Our code is implemented in Python 3.7 (Py Torch 1.1) and was ran on an NVIDIA RTX 3080 GPU equipped with a standard torch library and dataset extensions. Our code setup for feature extraction is similar to that of [39]. For benchmark implementations, our code follows from the original code templated provided by the cited papers. G.3 Sample Allocation We make a comment on why (4) is different from just thresholding \ MMD2(Z, Y tr) \ MMD2(Z, Xtr) at 0, which was what we did in part (c) of Figure 3 (and hence the difference along the curve of MMD-M vs Figure 1.2). Our theory assumes that the samples are i.i.d. conditioned on the kernel being chosen already. However, in the experiments, the kernel is dependent on the training data. Therefore, to evaluate the MMD estimate (between experimentations), one needs extra data that does not intersect with training. In fact, it can be experimentally shown by comparing Figure 1.2 and Figure 2(c) that doing so (while reducing the sample complexity on nev) hurts performance. Indeed, we found out that when Xev, Y ev are non-intersecting with training, performance is (almost) always better at a cost of hurting the overall sample complexity of n. G.4 Remarks on Results Figure 6 lists all of our benchmarks in the setting of Figure 3 (in the main text) on missing benchmarks, where the last figure is replaced by the false positive rate at thresholding at α = 0.05 to verify our results. As mentioned in the main text, our MMD-M method consistently outperforms other benchmarks on both the expected p-value (of alternate) and rejection rate at α = 0.05, while all of our tests observe a empirical false positive rate close to α = 0.05% (Part (b)), showing the consistency of methods. H Application: Higgs-Boson Detection H.1 Dataset Details We use the Higgs dataset available online at http://archive.ics.uci.edu/ml/datasets/ HIGGS, produced using Monte Carlo simulations [5]. The dataset is nearly balanced, containing 5, 829, 122 signal instances and 5, 170, 877 background instances. Each instance is a 28-dimensional vector, consisting of 28 features. The first 21 features are kinematic properties measured by the detectors in the accelerator, such as momentum and energy. The last 7 properties are invariant masses, derived from the first 21 features. 0.4 0.6 0.8 1.0 1.2 1.4 1.6 m Wbb(normalized) Frequenc of events Background Signal Above threshold Below threshold Figure 7: This figure visualizes the distribution of the 26th feature, the invariant mass m W bb. The red and black lines are the histograms of the original dataset. We employ MMD-M as a classifier, trained and evaluated using ntr = 1.3 106 and nev = nopt = 2 104 through Algorithm 3. The blue(green) line represents all instances z s whose witness scores f(z; Xev, Y ev) s are larger(smaller) than topt. H.2 Experiment Setup and Training Models The modified Algorithm 1 is shown in Algorithm 2 and Algorithm 3. Compared with Algorithm 2, we implement the thresholding trick (Section 4.3) in Algorithm 3. H.2.1 Configuration and Model Architecture We implement all methods in Python 3.9 and Py Torch 1.13 and run them on an NVIDIA Quadro RTX 8000 GPU. For all classifier-based methods in this study (SCHE and LBI), we adopt the same architecture as previously proposed in [5]. The classifiers are six-layer neural networks with 300 hidden units in each layer, all employing the tanh activation function. For SCHE, the output layer is a single sigmoid unit and we utilize the binary cross-entropy loss for training. For LBI, the output layer is a linear unit and we utilize the binary cross entropy loss combined with a logit function (which is more numerically stable than simply using a sigmoid layer followed by a cross entropy loss). For all MMD-based methods (MMD-M, MMD-G, MMD-O, and UME), the networks φ and φ are both six-layer neural networks with 300 Re LU units in each layer. The feature space, which is the output of the neural network φ, is set to be 100-dimensional. Here UME has the same kernel architecture as MMD-M, and the number of test locations is set to be Jq = 4096. For RFM, we adopt the same architecture as in [44], where the kernel is KM(x, y) = exp( γ(x y)T M(x y)) with a constant γ and a learnable positive semi-definite matrix M. We set γ 1. The neural networks are initialized using the default setting in Py Torch, and the bandwidths σ, σ are initialized using the median heuristic [22]. The parameter τ is initially set to 0.5. For UME, the witness locations W are initially randomly sampled from the training set. For RFM, the initial M equals the median bandwidth times an identity matrix. H.2.2 Training The size of our training set, denoted as ntr, varies from 1.0 102 to 1.6 106. For a given ntr, we select the first ntr datapoints from each class of the Higgs dataset to form Xtr and Y tr, i.e., |Xtr| = |Y tr| = ntr. Subsequently, we randomly select nvalidation = min( 10ntr, 0.1ntr) points from each of Xtr, Ytr to constitute the validation set, while the remainder of Xtr, Ytr are used for running gradient descent. The optimizer is set to be a minibatch SGD, with a batch size of 1024, a learning rate of 0.001, and a momentum of 0.99. Training is halted once the validation loss stops to decrease for 10 epochs, then we choose the checkpoint (saved for each epoch) with the smallest validation loss thus far as our trained model. Beyond the general setting above, in RFM a batch size of 1024 doesn t work well and instead we use a batch size of 20, 000. H.3 Evaluating the Performance H.3.1 Evaluating the p-Value with the Methodology of Algorithm 1 We call the witness score of an instance z X as f(z; Xev, Y ev) = 1 ncal i=1 (k(z, Y ev i ) k(z, Xev i )) . (33) For a vector of instances Z = (Z1, . . . , Zm), we write f(Z; Xev, Y ev) = (f(Z1; Xev, Y ev), . . . , f(Zm; Xev, Y ev)). The testing procedure is summarized in Phases 2, 3 and 4 in Algorithm 2 and Algorithm 3. In the Higgs experiment, we utilize the Gaussian approximation method to determine the p-values when the witness function f is not thresholded, which allows us to reach very small p-values and errors under limited computational resource. In cases where the score function f is thresholded by a value t, using the Binomial distribution as in Algorithm 3 is more precise and also fast enough. Given a trained kernel K trained on Xtr and Y tr, we set Xev = Xtr and Y ev = Y tr, and accordingly nev = ntr. This results in a more efficient use of data (since we reuse Xtr, Y tr also as Xev, Y ev). Then, out of the untouched portion of the data, we randomly choose ncal = 20, 000 datapoints from both classes to populate Xcal and Y cal, i.e., |Xcal| = |Y cal| = ncal = 20, 000. In addition to the general setting above, for RFM, we need to solve a 2nev-dimensional linear equation during inference, which arises from the inverse matrix in Equation (32) (solving K(XRFM, XRFM)u = (y RFM)T for u R2nev). So we set nev = min(ntr, 10, 000) that Xev, Yev are randomly sampled from the training set. In order to compare different benchmarks, we evaluate the expected significance of discovery on a mixture of 1000 backgrounds and 100 signals. For each benchmark and each ntr, we train 10 independent models. Then for each trained model we proceed through the Phases 2, 3 (and 4) in Algorithm 2 and Algorithm 3 by 10 times for 10 different (Xev, Xcal, Xopt, Y ev, Y cal, Y opt). The mean and standard deviation from these 100 runs are reported in Figure 8. We also display in Figure 9 the trade-off b (m, nev) and (m, ntr) to reach certain levels of significance of discovery in MMD-M. From the bottom left plot, we see that the (averaged) significance is not sensitive to nev when lg nev is large. So taking nev = 20, 000 is sufficient. H.3.2 Evaluating the Error of the Test (4) We set the parameters to be δ = 0.1 and π = 1 2δ in our experiments. As explained Appendix G.3, here we no longer take Xev = Xtr. Empirically, taking Xev = Xtr yields a very bad threshold γ(Xev, Y ev, π).4 Instead, Xev is sampled from untouched datapoints other than Xtr, and the same applies for Y . We still take nev = ntr here, resulting in a total size of nev + ntr = 2ntr. Specifically, when nev 10, 000, computing a nev nev Gram matrix becomes computationally expensive, so we adopt Monte Carlo method to compute γ(Xev, Y ev, π), in which we subsample 10, 000 points from Xev and Y ev to calculate γ and repeat this process 100 times. Again, we utilize the Gaussian approximation. Recall that the test is to compare T = 1 m Pm i=1 f(Zi) with γ. The type 1 and type 2 error are estimated as CDFN(0,1) γ(Xev,Y ev,π) E[f|H0] var(f|H0)/m 4If the kernel K( , ) = KXtr,Y tr( , ) is independent of Xev, Y ev, then we have γ(Xev, Y ev, δ/2) 1 2 EZ Px[T(Xev, Y ev, Z)] + EZ δPY +(1 δ)PX [T(Xev, Y ev, Z)] . However this is no longer true if (Xtr, Y tr) and (Xev, Y ev) intersect. E[f|H1] γ(Xev,Y ev,π) var(f|H1)/m for the witness function f, which can be estimated efficiently using the calibration samples Xcal, Y cal. We consider both the regimes of fixing kernels and varying kernels (training kernel based on n). The results are shown in the top plot in Figure 1.2 and the top plot in Figure 9. For each point on the plot, we train 30 independent models and test each model 10 times, and report the average of these 300 runs. In both plots, we observe the asymmetric m vs n trade-off. Algorithm 2 Estimate the significance of discovery of an input Ztest, using the original statistic Input: (Xtr, Xev, Xcal), (Y tr, Y ev, Y cal); parametrized kernel Kω; input Ztest. # Phase 1: Kernel training on Xtr and Y tr ω arg maxoptimizer ω ˆJ(Xtr, Y tr; Kw) # maximize objective b J(Xtr, Y tr; Kω) as in (8) # Phase 2: Distributional calibration of test statistic Scores(0) f(Xcal; Xev, Y ev) # Scores(0) has a length of ncal Scores(1) f(Y cal; Xev, Y ev) # Scores(1) has a length of ncal θ0 mean(Scores(0)) # estimate E[f(Z)|Z PX] θ1 mean(Scores(1)) # estimate E[f(Z)|Z PY ] σ0 std(Scores(0)) # estimate p var[f(Z)|Z PX] # Phase 3: Inference with input Ztest m length(Ztest) T Tf(Ztest; Xev, Y ev) = mean(f(Ztest; Xev, Y ev)) # compute test statistic Zdiscovery T θ0 σ0/ m Output: Estimated significance: Zdiscovery Algorithm 3 Estimate the significance of discovery of an input Ztest, applying the thresholding trick Input: (Xtr, Xev, Xcal, Xopt), (Y tr, Y ev, Y cal, Y opt); parametrized kernel Kω; input Ztest. # Phase 1: Kernel training on Xtr and Y tr ω arg maxoptimizer ω ˆJ(Xtr, Y tr; Kw) # maximize objective b J(Xtr, Y tr; Kω) as in (8) # Phase 2: Find the best threshold Scores(0) f(Xopt; Xev, Y ev) Scores(1) f(Y opt; Xev, Y ev) # witness function as in (33) for i = 1, 2, ..., 2nopt do t = (Scores(0) Scores(1))[i] TP, TN = mean(Scores(1) > t), mean(Scores(0) < t) # true positive and true negative rate poweri = TP+TN 1 TN(1 TN) # find t to maximize the (estimated) p-value end for topt = (Scores(0) Scores(1))[arg maxi poweri] # Phase 3: Distributional calibration of test statistic (under null hypothesis) Scores(0) (f(Xcal; Xev, Y ev) > t) # Scores(0) {0, 1}nev Scores(1) (f(Y cal; Xev, Y ev) > t) # Scores(1) {0, 1}nev θ0 mean(Scores(0)) # estimate E[ft(Z)|Z PX] [0, 1] θ1 mean(Scores(1)) # estimate E[ft(Z)|Z PY ] [0, 1] # Phase 4: Inference with input Ztest m length(Ztest) T Tf(Ztest; Xev, Y ev) = mean(f(Ztest; Xev, Y ev) > t) # compute test statistic Zdiscovery CDF 1 N(0,1)(CDFBin(m,θ0)(T)) Output: Estimated significance: Zdiscovery Training set size 2n/million Significance of discovery/ MMD-M with topt MMD-M MMD-G MMD-O UME SCHE with topt SCHE with t=0.5 LBI RFM as classifier RFM for MMD Figure 8: Complete image of Figure 1.2 in the main text. The mean and standard deviation are calculated based on 100 runs. See Appendix H for details. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 lg(m) Type I error + II error 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Significance 1.0 1.5 2.0 2.5 3.0 3.5 4.0 lg(m) Significance 0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8 14.4 Figure 9: The top plot displays the (m, nev) trade-off to reach certain levels of total error using ntr = 1.3 106 in MMD-M. The bottom figures show the trade-off of (m, nev) and (m, ntr) to reach certain level of significance of discovery in MMD-M. In the bottom left figure, we fix ntr = 1.3 106. In the bottom right figure, we fix nev = 20, 000. See Appendix H for details. I Limitations and Future Directions Finally, we discuss several limitations of our work and raise open questions that we hope will be addressed in future works. From the theoretical side of our arguments, we point out several aspects. First, our upper bound (on the minimax sample complexity) Theorem 3.1 has a likely sub-optimal dependence on α, δ. Second, it might be possible to improve our lower bound to a more natural form by replacing λ 2,J ϵ by λ 2 and removing the constraint that the top eigenfunction has to be constant. Third, it remains open to extend our theory to include data-dependent K, as opposed to fixed K. Empirically, our proposal Algorithm 1 can be inefficient in Phase 2 (prior works such as [39] have used permutation-based arguments for a more efficient estimate), which we adopted due to its simplicity and universality in all benchmarks. Moreover, one might hope that LFHT/m LFHT can be extended to more complex applications, such as text data or videos. Such questions are important to investigate as a future direction.