# mmd_aggregated_twosample_test__89b04068.pdf Journal of Machine Learning Research 24 (2023) 1-81 Submitted 10/21; Revised 3/23; Published 6/23 MMD Aggregated Two-Sample Test Antonin Schrab a.schrab@ucl.ac.uk Centre for Artificial Intelligence, University College London & Inria London Gatsby Computational Neuroscience Unit, University College London London, WC1V 6LJ, UK Ilmun Kim ilmun@yonsei.ac.kr Department of Statistics & Data Science, Department of Applied Statistics, Yonsei University Seoul, 03722, South Korea M elisande Albert melisande.albert@insa-toulouse.fr Institut de Math ematiques de Toulouse; UMR 5219, Universit e de Toulouse; CNRS, INSA; France B eatrice Laurent beatrice.laurent@insa-toulouse.fr Institut de Math ematiques de Toulouse; UMR 5219, Universit e de Toulouse; CNRS, INSA; France Benjamin Guedj b.guedj@ucl.ac.uk Centre for Artificial Intelligence, University College London & Inria London London, WC1V 6LJ, UK Arthur Gretton arthur.gretton@gmail.com Gatsby Computational Neuroscience Unit, University College London London, W1T 4JG, UK Editor: Ingo Steinwart Abstract We propose two novel nonparametric two-sample kernel tests based on the Maximum Mean Discrepancy (MMD). First, for a fixed kernel, we construct an MMD test using either permutations or a wild bootstrap, two popular numerical procedures to determine the test threshold. We prove that this test controls the probability of type I error nonasymptotically. Hence, it can be used reliably even in settings with small sample sizes as it remains well-calibrated, which differs from previous MMD tests which only guarantee correct test level asymptotically. When the difference in densities lies in a Sobolev ball, we prove minimax optimality of our MMD test with a specific kernel depending on the smoothness parameter of the Sobolev ball. In practice, this parameter is unknown and, hence, the optimal MMD test with this particular kernel cannot be used. To overcome this issue, we construct an aggregated test, called MMDAgg, which is adaptive to the smoothness parameter. The test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We prove that MMDAgg still controls the level non-asymptotically, and achieves the minimax rate over Sobolev balls, up to an iterated logarithmic term. Our guarantees are not restricted to a specific type of kernel, but hold for any product of one-dimensional translation invariant characteristic kernels. We provide a user-friendly parameter-free implementation of MMDAgg using an adaptive collection of bandwidths. We demonstrate that MMDAgg significantly outperforms alternative state-of-the-art MMD-based two-sample tests on synthetic data satisfying the Sobolev smoothness assumption, and that, on real-world image data, MMDAgg closely matches the power of tests leveraging the use of models such as neural networks. Keywords: two-sample testing, kernel methods, minimax adaptivity c 2023 Antonin Schrab, Ilmun Kim, M elisande Albert, B eatrice Laurent, Benjamin Guedj and Arthur Gretton. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1289.html. Schrab, Kim, Albert, Laurent, Guedj and Gretton 1 Introduction 3 2 Background 6 2.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Maximum Mean Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Construction of tests and bounds 9 3.1 Assumptions and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Non-asymptotic single MMD test with a fixed bandwidth . . . . . . . . . . 11 3.2.1 Permutation approach . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Wild bootstrap approach . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.3 Single MMD test: definition and level . . . . . . . . . . . . . . . . . 12 3.3 Controlling the power of the single MMD test . . . . . . . . . . . . . . . . . 14 3.4 Uniform separation rate of the single MMD test over a Sobolev ball . . . . 15 3.5 Non-asymptotic MMDAgg test aggregating multiple bandwidths . . . . . . 16 3.6 Uniform separation rate of MMDAgg over Sobolev balls . . . . . . . . . . . 20 4 Related work 22 5 Experiments 25 5.1 Weighting strategies and fixed bandwidth collections for MMDAgg . . . . . 25 5.2 Adaptive parameter-free collection of bandwidths for MMDAgg . . . . . . . 27 5.3 State-of-the-art MMD-based two-sample tests . . . . . . . . . . . . . . . . . 28 5.4 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.5 Power experiments on synthetic data . . . . . . . . . . . . . . . . . . . . . . 31 5.6 Power experiments on the MNIST dataset . . . . . . . . . . . . . . . . . . . 35 5.7 Power experiment: continuous limit of the collection of bandwidths . . . . . 37 5.8 Power experiment for image shift detection . . . . . . . . . . . . . . . . . . 38 5.9 Overview of additional experimental results . . . . . . . . . . . . . . . . . . 39 6 Conclusion and future work 39 A Additional experiments 41 A.1 Level experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.2 Power experiments: widening the collection of bandwidths . . . . . . . . . . 42 A.3 Power experiments: comparing wild bootstrap and permutations . . . . . . 44 A.4 Power experiments: using unbalanced sample sizes . . . . . . . . . . . . . . 44 A.5 Power experiments: increasing sample sizes for the ost test . . . . . . . . . 45 B Relation between permutations and wild bootstrap 47 C Efficient implementation of MMDAgg (Algorithm 1) 50 D Lower bound on the minimax rate over a Sobolev ball 53 MMD Aggregated Two-Sample Test E Proofs 54 E.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 E.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 E.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 E.4 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 E.5 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 E.6 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 E.7 Proof of Corollary 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 E.8 Proof of Proposition 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 E.9 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 E.10 Proof of Corollary 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 References 78 1. Introduction We consider the problem of nonparametric two-sample testing, where we are given two independent sets of i.i.d. samples, and we want to determine whether these two samples come from the same distribution. This fundamental problem has a long history in statistics and machine learning, with numerous real-world applications in various fields, including clinical laboratory science (Miles et al., 2004), genomics (Chen and Qin, 2010), biology (Fisher et al., 2006), geology (Vermeesch, 2013) and finance (Horv ath et al., 2013). To compare samples from two probability distributions, we use a statistical test of the null hypothesis that the two distributions are equal, against the alternative hypothesis that they are different. Many such tests exist, and rely on different assumptions. If we assume that the two probability distributions are Gaussian with the same variance, then we can perform a Student s t-test (Student, 1908) to decide whether or not to reject the null hypothesis. However, the t-test is parametric in nature, and designed only for comparing two Gaussian distributions. By contrast, our interest is in nonparametric tests, which are sensitive to general alternatives, without relying on specific distributional assumptions. An example of such a nonparametric test is the Kolmogorov Smirnov test (Massey Jr, 1951) which uses as its test statistic the largest distance between empirical distribution functions of the two samples. The limitation of the Kolmogorov Smirnov test, however, is that it applies only to univariate data, and its multivariate extension is challenging (Bickel, 1969). The test statistic we consider is an estimate of the Maximum Mean Discrepancy (MMD Gretton et al., 2007, 2012a) which is a kernel-based metric on the space of probability distributions. The MMD is an integral probability metric (M uller, 1997) and hence is defined as the supremum, taken over a class of smooth functions, of the difference of their expectations under the two probability distributions. This function class is taken to be the unit ball of a characteristic Reproducing Kernel Hilbert Space (Aronszajn, 1950; Fukumizu et al., 2008; Sriperumbudur et al., 2011), so the Maximum Mean Discrepancy depends on the choice of kernel. We work with a wide range of kernels, each parametrised by their bandwidths. There exist several heuristics to choose the kernel bandwidths. In the Gaussian kernel case, for example, bandwidths are often simply set to the median distance between pairs of points from both samples (Gretton et al., 2012a). This strategy for bandwidth choice Schrab, Kim, Albert, Laurent, Guedj and Gretton does not provide any guarantee of optimality, however. In fact, existing empirical results demonstrate that the median heuristic performs poorly (i.e. it leads to low test power) when differences between the two distributions occur at a lengthscale that differs sufficiently from the median inter-sample distance (Gretton et al., 2012b, Figure 1). Ramdas et al. (2015) and Reddi et al. (2015) show that the median heuristic scales as the square root of the dimension, and that using a bandwidth of the higher order with respect to the dimension generally leads to higher power. Another approach is to split the data and learn a good kernel choice on data held out for this purpose (e.g. Gretton et al., 2012b; Liu et al., 2020), however, the resultant reduction in data for testing can reduce overall test power at smaller sample sizes. Our contributions. Having motivated the problem, we summarize our contributions. We first address the case where the smoothness parameter s of the task is known: that is, the distributions being tested have densities in Rd whose difference lies in a Sobolev ball Ss d(R) with smoothness parameter s and radius R. For this setting, we construct a single MMD test that is optimal in the minimax sense over Ss d(R), for a specific choice of bandwidths which depend on s. In practice, s is unknown, and our test must be adaptive to it. We therefore construct a test which is adaptive to s in the minimax sense, by aggregating across tests with different bandwidths, and rejecting the null hypothesis if any individual test (with appropriately corrected level) rejects it. We refer to our proposed MMD aggregated test as MMDAgg. By upper bounding the uniform separation rate of testing of MMDAgg, we prove that it is optimal (up to an iterated logarithmic term) over the Sobolev ball Ss d(R) for any s > 0 and R > 0. For the practical deployment of our test, we require numerical procedures for computing the test thresholds. We may obtain the threshold for a test of level α using either permutations or a wild bootstrap to estimate the (1 α)-quantile of the test statistic distribution under the null hypothesis. We prove that our theoretical guarantees still hold under both test threshold estimation procedures. In the process of establishing these results, we demonstrate the equivalence between using a wild bootstrap and using a restricted set of permutations, which is of independent interest. Through the use of either permutations or a wild bootstrap, we can theoretically guarantee that our proposed single MMD test and MMDAgg test both have non-asymptotic level, which differs from the original MMD test of Gretton et al. (2012a). We believe that this non-asymptotic property of our tests contributes to their real-world applications. Indeed, in practice, settings in which the sample sizes are fixed and may not be assumed to be asymptotically large are very common (e.g. medical data, seismological data, data for materials science, etc.). While existing tests relying on asymptotic may fail to be well-calibrated in those settings, ours are guaranteed to correctly control the test level non-asymptotically. We stress that the implementation of MMDAgg corresponds exactly to the test for which we prove theoretical guarantees: we do not make any further approximations in our implementation, nor do we require any prior knowledge on the underlying distribution smoothness. All our theoretical results hold for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. Our test is, to MMD Aggregated Two-Sample Test the best of our knowledge, the first to be minimax adaptive (up to an iterated logarithmic term) for various kernels, and not only for the Gaussian kernel. Since our approach combines multiple MMD single tests across a large collection of bandwidths, it requires no tuning and our implementation is parameter-free. Furthermore, since we consider various bandwidths simultaneously, our test is adaptive: it performs well both in cases requiring the kernel to have a small bandwidth and in those necessitating a large bandwidth. This means that the same MMDAgg test can detect both local and global differences in densities, which is not the case for a single MMD test with fixed bandwidth. The key contributions of our paper can be summarised as follows. Based on either permutations or a wild bootstrap, we propose a single MMD test and theoretically prove that it has non-asymptotic level. By setting the kernel bandwidth adequately, we show that the test is minimax optimal over a Sobolev ball with known smoothness parameter. In order to be adaptive to this smoothness parameter (which is unknown in practice), we construct a two-sample aggregated MMD test, called MMDAgg, which does not require data splitting. We prove that MMDAgg controls the type I error nonasymptotically, and that it is optimal in the minimax sense (up to an iterated logarithmic term) for a wide range of kernels when using either permutations or a wild bootstrap to estimate the test threshold. In our experiments on synthetic data, on which the Sobolev smoothness assumption holds, we observe that MMDAgg obtains significantly higher power than all other state-of-the-art MMD adaptive tests considered. On real-world image data, MMDAgg almost matches the power obtained by tests leveraging the capacity of models such as neural networks to detect differences in image distributions. The power of MMDAgg is retained even when a large collection of kernels is considered (up to 12000 kernels in the experiments). Experimentally, no cost in power is incurred for aggregating more kernels, the overall power appears to match the highest power achieved with a single kernel while correctly controlling the test level. As such, in practice, the user can consider as many kernels as computationally feasible. We provide a user-friendly parameter-free implementation of MMDAgg, both in Jax and in Numpy, available at https://github.com/antoninschrab/mmdagg-paper. This repository also contains code for the reproducibility of our experiments. Related Works. We present an overview of works related to ours, more details are provided in Section 4. Our non-asymptotic aggregated test, which is minimax adaptive over the Sobolev balls Ss d(R) : s > 0, R > 0 , originates from the works of Fromont et al. (2012, 2013) and Albert et al. (2022). Fromont et al. (2012, 2013) consider the two-sample problem with sample sizes following independent Poisson processes. In this framework, they construct an aggregated test using a different kernel-based estimator with a wild bootstrap. In the multivariate setting, with some condition on the kernel in the Fourier domain, they show minimax adaptivity of their test over Sobolev and anisotropic Nikol skii-Besov balls. Albert et al. (2022) construct an aggregated independence test using Gaussian kernels. The theoretical guarantees for the Schrab, Kim, Albert, Laurent, Guedj and Gretton single MMD tests using a permutation-based threshold are related to the result of Kim et al. (2022, Proposition 8.4), also for Gaussian kernels. Besides treating the adaptive two-sample case, rather than the independence case considered by Albert et al. (2022), the present work builds on these earlier results in two important ways. First, the optimality of our aggregated test is not restricted to the use of a specific kernel; it holds more generally for many popular choices of kernels. Second, our theoretical guarantees for this adaptive test are proved to hold even under practical choices for the test threshold: namely the permutation and wild bootstrap approaches. In this paper, we propose quadratic-time aggregated tests for two-sample testing. This present work, along with those of Albert et al. (2022) and Schrab et al. (2022a) on independence and goodness-of-fit testing, respectively, have been the basis of the later work of Schrab et al. (2022b) who construct efficient (linear-time) variants of these three aggregated tests using incomplete U-statistics, and quantify the trade-offbetween computational efficiency and test power (in terms of uniform separation rate over Sobolev balls). Outline. The paper is organised as follows. In Section 2, we formalize the two-sample problem, review the theory of statistical hypothesis testing, and recall the definition of the Maximum Mean Discrepancy. In Section 3, we construct the MMD single and aggregated (MMDAgg) two-sample tests, provide pseudocode for the latter, and derive theoretical guarantees for both. Having introduced the required terminology, we then discuss how our results relate to other works in Section 4. We run various experiments in Section 5 to evaluate how well MMDAgg performs compared to alternative state-of-the-art MMD adaptive tests. The paper closes with discussions and perspectives in Section 6. Proofs, additional discussions, and further experimental results are provided in the Appendices. 2. Background First, we formalise the two-sample problem in mathematical terms. Two-sample problem. Given independent samples Xm := (Xi)1 i m and Yn := (Yj)1 j n, consisting of i.i.d. random variables with respective probability density functions p and q on Rd with respect to the Lebesgue measure, can we decide whether p = q holds? To tackle this problem, we work in the non-asymptotic framework and construct two nonparametric hypothesis tests in Section 3: a single MMD test for fixed kernel/bandwidth, and MMDAgg which aggregates multiple kernels/bandwidths. In Section 2.1, we first introduce the required notions about hypothesis testing. We then recall the definition of the Maximum Mean Discrepancy and present two estimators for it in Section 2.2. 2.1 Hypothesis testing We use the convention that Pp q denotes the probability with respect to X1, . . . , Xm iid p and Y1, . . . , Yn iid q all independent of each other. If given more random variables, say Z1, . . . , Zt iid r for some probability density or mass function r, we use the notation Pp q r. We follow similar conventions for expectations and variances. We address this two-sample problem by testing the null hypothesis H0 : p = q against the alternative hypothesis Ha : p = q. Given a test which is a function of Xm and Yn, MMD Aggregated Two-Sample Test the null hypothesis is rejected if and only if (Xm, Yn) = 1. The test is usually designed to control the probability of type I error sup p Pp p( (Xm, Yn) = 1) α for a given α (0, 1), where the supremum is taken over all probability densities on Rd. We then say that the test has level α. For all the definitions, if the test depends on other random variables, we take the probability with respect to those too. For a given fixed level α, the aim is then to construct a test with the smallest possible probability of type II error Pp q( (Xm, Yn) = 0) for specific choices of alternatives for which p = q. If this probability is bounded by some β (0, 1), we say that the test has power 1 β against that particular alternative. In the asymptotic framework, for a consistent test and a fixed alternative with p q 2 > 0, we can find large enough sample sizes m and n so that the test has power close to 1 against this alternative. In the non-asymptotic framework of this paper, the sample sizes m and n are fixed. We can then find an alternative with p q 2 small enough so that the test has power close to 0 against this alternative. Given a test , a class of functions C and some β (0, 1), one can ask what the smallest value ρ > 0 is such that the test has power at least 1 β against all alternative hypotheses satisfying p q C and p q 2 > ρ. Clearly, this depends on the sample sizes: as m and n increase, the value of ρ decreases. This motivates the definition of uniform separation rate (Baraud, 2002) ρ( , C, β, M) := inf ρ > 0 : sup (p,q) FM ρ (C) Pp q( (Xm, Yn) = 0) β where FM ρ (C) := {(p, q) : max( p , q ) M, p q C, p q 2 > ρ}. For uniform separation rates, we are mainly interested in the dependence on m + n: for example we will show upper bounds of the form a(m + n) b for positive constants a and b independent of m and n. The greatest lower bound on the uniform separation rates of all tests with non-asymptotic level α (0, 1) is called the minimax rate of testing (Baraud, 2002) ρ(C, α, β, M) := inf α ρ( α, C, β, M), where the infimum is taken over all tests α of non-asymptotic level α for testing H0 : p = q against Ha : p = q, and where we compare uniform separation rates in terms of growth rates as functions of m + n. This is a generalisation of the concept of critical radius introduced by Ingster (1993a,b) to the non-asymptotic framework. A test is optimal in the minimax sense (Baraud, 2002) if its uniform separation rate is upper-bounded up to a constant by the minimax rate of testing. As the class of functions C, we consider the Sobolev ball Ss d(R) := f L1 Rd L2 Rd : Z Rd ξ 2s 2 | bf(ξ)|2 dξ (2π)d R2 (1) with smoothness parameter s > 0, radius R > 0, and where bf denotes the Fourier transform of f, that is, bf(ξ) := R Rd f(x)e ix ξ dx for all ξ Rd. Our aim is to construct a test which Schrab, Kim, Albert, Laurent, Guedj and Gretton achieves the minimax rate of testing over Ss d(R) (up to an iterated logarithmic term) and which does not depend on the smoothness parameter s of the Sobolev ball; such a test is called minimax adaptive. As shown by Li and Yuan (2019, Theorems 3 and 5), the minimax rate of testing over the Sobolev ball Ss d(R) is lower bounded as ρ(Ss d(R), α, β, M) C0(M, d, s, R, α, β) (m + n) 2s/(4s+d) (2) for some constant C0 > 0 depending on α, β (0, 1), d N \ {0} and M, s, R (0, ). Their proof is an extension of the results of Ingster (1987, 1993b) and we provide more details in Appendix D. We later construct a test with non-asymptotic level α and show in Corollary 7 that its uniform separation rate over Ss d(R) with respect to m + n is at most (m + n) 2s/(4s+d), up to some multiplicative constant. This implies that the minimax rate of testing over the Sobolev ball Ss d(R) with respect to m + n is exactly of order (m + n) 2s/(4s+d). 2.2 Maximum Mean Discrepancy As a measure between two probability distributions, we consider the kernel-based Maximum Mean Discrepancy (MMD Gretton et al., 2007, 2012a). In detail, for a given Reproducing Kernel Hilbert Space Hk (Aronszajn, 1950) with kernel k, the MMD can be formalized as the integral probability metric (M uller, 1997) MMD(p, q; Hk) := sup f Hk : f Hk 1 |EX p[f(X)] EY q[f(Y )]|. Our particular interest is in a characteristic kernel k, which guarantees that we have MMD(p, q; Hk) = 0 if and only if p = q. We refer to the works of Fukumizu et al. (2008) and Sriperumbudur et al. (2011) for details on characteristic kernels. It can easily be shown (Gretton et al., 2012a, Lemma 4) that the MMD is the Hk-norm of the difference between the mean embeddings µp(u) := EX p [k(X, u)] and µq(u) := EY q [k(Y, u)] for u Rd. Using this fact, a natural unbiased quadratic-time estimator for MMD2(p, q; Hk) (Gretton et al., 2012a, Lemma 6) is \ MMD 2 a(Xm, Yn; Hk) := 1 m(m 1) 1 i =i m k(Xi, Xi ) + 1 n(n 1) 1 j =j n k(Yj, Yj ) j=1 k(Xi, Yj). (3) This is the minimum variance MMD estimator (Serfling, 1980, Section 5.1.4). As pointed out by Kim et al. (2022), this quadratic-time estimator can be written as a two-sample U-statistic (both of second order) (Hoeffding, 1992) \ MMD 2 a(Xm, Yn; Hk) = 1 m(m 1)n(n 1) 1 j =j n hk(Xi, Xi , Yj, Yj ) (4) MMD Aggregated Two-Sample Test where hk(x, x , y, y ) := k(x, x ) + k(y, y ) k(x, y ) k(x , y) (5) for x, y, x , y Rd. Writing the estimator \ MMD 2 a(Xm, Yn; Hk) as a two-sample U-statistic can be theoretically appealing but we stress the fact that it can be computed in quadratic time using Equation (3). The unnormalised version of the test statistic \ MMD 2 a(Xm, Yn; Hk) was also considered in the work of Fromont et al. (2012). For the special case when m = n, Gretton et al. (2012a, Lemma 6) also propose to consider a different estimator for the Maximum Mean Discrepancy which is the one-sample second-order U-statistic \ MMD 2 b(Xn, Yn; Hk) := 1 n(n 1) 1 i =j n hk(Xi, Xj, Yi, Yj). (6) Note that, unlike the estimator \ MMD 2 a(Xn, Yn; Hk), the estimator \ MMD 2 b(Xn, Yn; Hk) does not incorporate the terms {k(Xi, Yi) : i = 1, . . . , n}. This means that the ordering of Xn = (Xi)1 i n and Yn = (Yj)1 j n changes the estimator \ MMD 2 b(Xn, Yn; Hk). So, when using this estimator, we have to assume we are given a specific ordering of the samples. While \ MMD 2 b has slightly higher variance than \ MMD 2 a, computing \ MMD 2 b is computationally much faster than evaluating \ MMD 2 a, as discussed in Appendix C. The MMD depends on the choice of kernel, which we explore for our hypothesis tests. 3. Construction of tests and bounds This section contains our main contributions. In Section 3.1, we introduce some notation along with technical assumptions for our analysis. We then present in Section 3.2 two data-dependent procedures to construct a single MMD test that makes use of some specific kernel bandwidth. Section 3.3 provides sufficient conditions under which this single MMD test is powerful when the difference in densities is measured in terms of the MMD and of the L2-norm. Based on these preliminary results, we prove an upper bound on its uniform separation rate over the Sobolev ball Ss d(R) in Section 3.4, which shows that for a specific choice of bandwidth, the corresponding single MMD test is optimal in the minimax sense. However, the optimal single MMD test relies on the unknown smoothness parameter s of the Sobolev ball Ss d(R), which motivates the introduction of our aggregated MMDAgg test in Section 3.5. Finally, we prove in Section 3.6 that MMDAgg is minimax adaptive over the Sobolev balls Ss d(R) : s > 0, R > 0 . 3.1 Assumptions and notation We assume that the sample sizes m and n are balanced up to a constant factor, meaning that there exists a positive constant C > 0 such that m n and n Cm. (7) As can be seen in Equation (20), this assumption allows us to upper bound terms such as m 1 and n 1 by (m + n) 1 up to a constant depending on C. The smaller this constant C Schrab, Kim, Albert, Laurent, Guedj and Gretton is, the tighter our bounds on uniform separation rates will be. For fixed sample sizes, this condition of being balanced is always satisfied. For increasing sample sizes, the condition requires that both sample sizes increase at the same rate. In particular, under this condition, it is not possible to fix one sample size and let the other tend to infinity. In general, we write Ci(p1, . . . , pℓ) to express the dependence of a positive constant Ci on some parameters p1, . . . , pℓ. We assume that we have d characteristic kernels (x, y) 7 Ki(x y) on R R for some functions Ki : R R lying in L1(R) L2(R) and satisfying R R Ki(u)du = 1 for i = 1, . . . , d. Then, for some bandwidth λ = (λ1, . . . , λd) (0, )d, the function1 kλ(x, y) := is a characteristic kernel on Rd Rd satisfying2 Rd kλ(x, y)dx = 1 and Z Rd kλ(x, y)2dx = κ2(d) λ1 λd (8) for the constant κ2(d) defined later in Equation (21). Using Ki(u) = 1 π exp u2 for u R and i = 1, . . . , d, for example, yields the Gaussian kernel kλ. Using Ki(u) = 1 2 exp( |u|) for u R and i = 1, . . . , d yields the Laplace kernel kλ. For notation purposes, given λ = (λ1, . . . , λd) (0, )d, we also write for u Rd, so that kλ(x, y) = ϕλ(x y) for all x, y Rd. In this paper, we investigate the choice of kernel bandwidth for MMD tests. While our theoretical results on test power only hold for translation-invariant kernels on Rd Rd (as introduced above), we stress that our single and aggregated MMD tests are well-defined and have well-calibrated non-asymptotic levels (Propositions 1 and 8) on any domain and for any positive definite characteristic kernel. For clarity, we denote MMD(p, q; Hkλ), \ MMD 2 a(Xm, Yn; Hkλ), \ MMD 2 b(Xn, Yn; Hkλ) and hkλ (all defined in Section 2) simply by MMDλ(p, q), \ MMD 2 λ,a(Xm, Yn), \ MMD 2 λ,b(Xn, Yn) and hλ, respectively. When m = n, we let \ MMD 2 λ(Xm, Yn) denote the estimator \ MMD 2 λ,a(Xm, Yn). When m = n, we let \ MMD 2 λ(Xn, Yn) denote either \ MMD 2 λ,a(Xn, Yn) or \ MMD 2 λ,b(Xn, Yn). This means that, when m = n, all our results hold for both estimators. 1. Multiplying the kernel kλ by a positive constant Cλ does not affect the outputs of the single and aggregated tests as it simply scales both the test statistic and the quantile. The requirements that R R Ki(u)du = 1 for i = 1, . . . , d and the scaling term (λ1 λd) 1 in the definition of the kernel kλ are not required for our theoretical results to hold. We introduce those simply for ease of notation in our statements and proofs. 2. Detailed calculations are presented at the beginning of Appendix E. MMD Aggregated Two-Sample Test 3.2 Non-asymptotic single MMD test with a fixed bandwidth Here, we consider the bandwidth λ (0, )d to be fixed a priori. The null and alternative hypotheses for the two-sample problem are H0 : p = q against Ha : p = q, or equivalently H0 : MMD2 λ(p, q) = 0 against Ha : MMD2 λ(p, q) > 0, provided that the kernels K1, . . . , Kd are characteristic. Using the samples Xm = (Xi)1 i m and Yn = (Yj)1 j n, we calculate the test statistic \ MMD 2 λ(Xm, Yn). Since we want our test to be valid in the non-asymptotic framework, we cannot rely on the asymptotic distribution of \ MMD 2 λ under the null hypothesis to compute the required threshold which guarantees the desired level α (0, 1). Instead, we use a Monte Carlo approximation to estimate the conditional (1 α)-quantile of the permutation-based and wild bootstrap procedures given the samples Xm and Yn under the null hypothesis. For the estimator \ MMD 2 λ,a(Xm, Yn) defined in Equation (3) we use permutations, while for the estimator \ MMD 2 λ,b(Xn, Yn) defined in Equation (6) we use a wild bootstrap. In Appendix B, we provide some more in-depth discussion about the relation between those two procedures. In particular, for the estimate \ MMD 2 λ,b(Xn, Yn), we show in Proposition 11 that using a wild bootstrap corresponds exactly to using permutations which either fix or swap Xi and Yi for i = 1, . . . , n. 3.2.1 Permutation approach In this case, we consider the MMD estimator defined in Equation (3) which can be written as \ MMD 2 λ,a(Xm, Yn) = 1 m(m 1)n(n 1) 1 j =j n hλ(Ui, Ui , Um+j, Um+j ) where Ui := Xi and Um+j := Yj for i = 1, . . . , m and j = 1, . . . , n. Given a permutation function σ: {1, . . . , m + n} {1, . . . , m + n}, we can compute the MMD estimator on the permuted samples Xσ m := Uσ(i) 1 i m and Yσ n := Uσ(m+j) 1 j n to get c M σ λ := \ MMD 2 λ,a(Xσ m, Yσ n) (10) = 1 m(m 1)n(n 1) 1 j =j n hλ(Uσ(i), Uσ(i ), Uσ(m+j), Uσ(m+j )) 1 i =i m kλ(Uσ(i), Uσ(i )) + 1 n(n 1) 1 j =j n kλ(Uσ(m+j), Uσ(m+j )) j=1 kλ(Uσ(i), Uσ(m+j)). In order to estimate, with a Monte Carlo approximation, the conditional quantile of c M σ λ given Xm and Yn, we uniformly sample B i.i.d. permutations σ(1), . . . , σ(B). We denote their probability mass function by r, so that σ(b) r for b = 1, . . . , B. We introduce the notation ZB := σ(b) 1 b B and also simply write c M b λ := c M σ(b) λ for b = 1, . . . , B. We can then use the values c M b λ 1 b B to estimate the conditional quantile as explained in Section 3.2.3. Schrab, Kim, Albert, Laurent, Guedj and Gretton 3.2.2 Wild bootstrap approach In this case, we assume that m = n and we work with the MMD estimator \ MMD 2 λ,b defined in Equation (6). Recall that for this, we must assume an ordering of our samples which gives rise to a pairing (Xi, Yi) for i = 1, . . . , n. Simply using permutations as presented in Section 3.2.1 would break this pairing and our estimators would consist of a signed sum of different terms because kλ(Uσ(i), Uσ(n+i)) : i = 1, . . . , n = {kλ(Ui, Un+i) : i = 1, . . . , n} for most permutations σ: {1, . . . , 2n} {1, . . . , 2n}. The idea is then to restrict ourselves to the permutations σ which, for i = 1, . . . , n, either fix or swap Xi and Yi, in the sense that {Uσ(i), Uσ(n+i)} = {Ui, Un+i}, so that kλ(Uσ(i), Uσ(n+i)) = kλ(Ui, Un+i). We show in Proposition 11 in Appendix B that this corresponds exactly to using a wild bootstrap, which we now define. Given n i.i.d. Rademacher random variables ϵ := (ϵ1, . . . , ϵn) with values in { 1, 1}n, we let c M ϵ λ := 1 n(n 1) 1 i =j n ϵiϵjhλ(Xi, Xj, Yi, Yj). (11) As for the permutation approach, in order to obtain a Monte Carlo estimate of the conditional quantile of c M ϵ λ given Xn and Yn, for b = 1, . . . , B, we generate n i.i.d. Rademacher random variables ϵ(b) := ϵ(b) 1 , . . . , ϵ(b) n with values in { 1, 1}n and compute c M b λ := c M ϵ(b) λ . We write ZB := ϵ(b) 1 b B and denote their probability mass function as r to be consistent with the notation introduced in Section 3.2.1, so that ϵ(b) r for b = 1, . . . , B. We next show in Section 3.2.3 how to estimate the conditional quantile using c M b λ 3.2.3 Single MMD test: definition and level Depending on which MMD estimator we use, either \ MMD 2 λ,a from Equation (3) or \ MMD 2 λ,b from Equation (6), we obtain c M b λ 1 b B either as in Section 3.2.1 or as in Section 3.2.2, respectively. Inspired by the work of Romano and Wolf (2005a, Lemma 1) and Albert et al. (2022), in order to obtain the prescribed non-asymptotic test level, we also add the original MMD statistic c M B+1 λ := \ MMD 2 λ(Xm, Yn) which corresponds either to the case where the permutation is the identity or where the n Rademacher random variables are equal to 1. We can then estimate the conditional quantile of the distribution of either c M σ λ or c M ϵ λ given Xm and Yn under the null hypothesis H0 : p = q by using a Monte Carlo approximation. In particular, our estimator of the conditional (1 α)-quantile is given by bq λ,B 1 α ZB Xm, Yn := inf u R : 1 α 1 B + 1 b=1 1 c M b λ u = c M (B+1)(1 α) λ (12) where c M 1 λ c M B+1 λ denote the ordered simulated test statistics c M b λ 1 b B+1. We then define the single MMD test λ,B α for some given bandwidth λ (0, )d as λ,B α (Xm, Yn, ZB) := 1 \ MMD 2 λ(Xm, Yn) > bq λ,B 1 α ZB Xm, Yn . MMD Aggregated Two-Sample Test Intuitively, c M b λ 1 b B+1 simulate values of the MMD test statistic under the null hypoth- esis H0 : p = q, the quantile bq λ,B 1 α is defined such that only an α-proportion of the simulated test statistics are greater than bq λ,B 1 α. As such, as shown in Proposition 1, under H0, the probability that the MMD test statistic is greater than the quantile (i.e. rejecting the null) is non-asymptotically at most α. As shown in Appendix E.1, the p-value of the test can be computed as pλ val := 1 B + 1 b=1 1 c M b λ c M B+1 λ ! and satisfies the property that pλ val α \ MMD 2 λ(Xm, Yn) > bq λ,B 1 α ZB Xm, Yn . Note that computing the quantile bq λ,B 1 α ZB Xm, Yn requires sorting the simulated test statistics, while computing the p-value pλ val does not. We now prove that this single MMD test has the desired non-asymptotic level α, which we stress differs from the original asymptotic MMD test of Gretton et al. (2012a). We believe that this non-asymptotic property will contribute to the wide use of those MMDbased tests. Proposition 1 (proof in Appendix E.1) For fixed bandwidth λ (0, )d, α (0, 1) and B N \ {0}, the test λ,B α has non-asymptotic level α, that is Pp p r λ,B α (Xm, Yn, ZB) = 1 α for all probability density functions p on Rd. This single MMD test λ,B α depends on the choice of bandwidth λ. In practice, one would like to choose λ such that the test λ,B α has high power against most alternatives. In general, a smaller bandwidth gives a narrower kernel which is well suited to detect local differences between probability densities such as small perturbations. On the other hand, a larger bandwidth gives a wider kernel which is better at detecting global differences between probability densities. We verify those intuitions in our experiments presented in Section 5. While insightful, those do not tell us exactly how to choose the bandwidth. As mentioned in the introduction, in practice, there exist two common approaches to choosing the bandwidth of the single MMD test. The first one, proposed by Gretton et al. (2012a), is to set the bandwidth to be equal to the median inter-sample distance. The second approach involves splitting the data into two parts where the first half is used to choose the bandwidth that maximises the asymptotic power, and the second half is used to run the test. This was initially proposed by Gretton et al. (2012b) for the linear-time MMD estimator, and later generalised by Liu et al. (2020) to the case of the quadratic-time MMD estimator. The former approach has no theoretical guarantees, while the latter can suffer from a loss of power caused by the use of less data to run the test. Those two methods are further analysed in our experiments in Section 5. In Sections 3.3 and 3.4, we obtain theoretical guarantees for the power of the single MMD test λ,B α and specify the choice of the bandwidth that leads to minimax optimality. Schrab, Kim, Albert, Laurent, Guedj and Gretton 3.3 Controlling the power of the single MMD test We start by presenting conditions on the discrepancy measures MMDλ(p, q) and p q 2 under which the probability of type II error of the single MMD test Pp q r λ,B α (Xm, Yn, ZB) = 0 = Pp q r \ MMD 2 λ(Xm, Yn) bq λ,B 1 α ZB Xm, Yn is controlled by a small positive constant β. We then express these conditions in terms of the bandwidth λ. We find a sufficient condition on the value of MMD2 λ(p, q) which guarantees that the single MMD test λ,B α has power at least 1 β against the alternative Ha : p = q. Lemma 2 (proof in Appendix E.2) For α, β (0, 1), and B N \ {0}, the condition MMD2 λ(p, q) r 2 β varp q \ MMD 2 λ(Xm, Yn) + bq λ,B 1 α ZB Xm, Yn 1 β is sufficient to control the probability of type II error such that Pp q r λ,B α (Xm, Yn, ZB) = 0 β. If the densities p and q differ significantly in the sense that MMD2 λ(p, q) satisfies the condition of Lemma 2, then the probability of type II error of the single MMD test λ,B α against that alternative hypothesis is upper-bounded by β. The condition includes two terms: the first term depends on β as well as on the variance of \ MMD 2 λ(Xm, Yn), and the second is the conditional quantile estimated using the Monte Carlo method with either permutations or a wild bootstrap. In the next two propositions, we make this condition more concrete by providing upper bounds for the variance and the estimated conditional quantile. In particular, the upper bounds are expressed in terms of the bandwidth λ and the sample sizes m and n, which guides us towards the choice of the bandwidth with an optimal guarantee. We start with the variance term. Proposition 3 (proof in Appendix E.3) Assume that max ( p , q ) M for some M > 0. Given ϕλ as defined in Equation (9) and ψ := p q, there exists a positive constant C1(M, d) such that varp q \ MMD 2 λ(Xm, Yn) C1(M, d) ψ ϕλ 2 2 m + n + 1 (m + n)2 λ1 λd We now upper bound the estimated conditional quantile bq λ,B 1 α ZB Xm, Yn in terms of λ and m + n. Since this is a random variable, we provide a bound which holds with high probability. Proposition 4 (proof in Appendix E.4) We assume max ( p , q ) M for some M > 0, α (0, 0.5) and δ (0, 1). For all B N satisfying B 3 α2 ln 4 δ + α(1 α) , we have bq λ,B 1 α ZB Xm, Yn C2(M, d) 1 (m + n) λ1 λd for some positive constant C2(M, d). MMD Aggregated Two-Sample Test Note that while this bound looks similar to the one proposed by Albert et al. (2022, Proposition 3) for independence testing, it differs in two major aspects. Firstly, while they consider the theoretical (unknown) quantile qλ 1 α, we stress that our bound holds for the random variable bq λ,B 1 α ZB Xm, Yn , which is the conditional quantile estimated using the Monte Carlo method with either permutations or a wild bootstrap. Secondly, our bound holds for any bandwidth λ (0, )d without any additional assumptions. In particular, we do not require the restrictive condition that (m + n) λ1 λd > ln 1 α which can in some cases imply that the sample sizes need to be very large. Having obtained upper bounds for varp q \ MMD 2 λ(Xm, Yn) and bq λ,B 1 α ZB Xm, Yn , we now combine these with Lemma 2 to obtain a more concrete condition for type II error control. More specifically, the refined condition depends on λ, m+n and β, and guarantees that the probability of type II error of the single MMD test λ,B α , against the alternative (p, q) defined in terms of the L2-norm, is at most β. Theorem 5 (proof in Appendix E.5) We assume max ( p , q ) M for some M > 0, α (0, e 1), β (0, 1) and B N which satisfy B 3 α2 ln 8 β + α(1 α) . We consider ϕλ as defined in Equation (9) and let ψ := p q. Assume that λ1 λd 1. There exists a positive constant C3(M, d) such that if ψ 2 2 ψ ψ ϕλ 2 2 + C3(M, d) ln 1 β(m + n) λ1 λd , then the probability of type II error satisfies Pp q r λ,B α (Xm, Yn, ZB) = 0 β. The main condition of Theorem 5 requires p q 2 2 to be greater than the sum of two quantities. The first one is the bias term ψ ψ ϕλ 2 2 and the second one comes from the upper bounds in Propositions 3 and 4 on the variance varp q \ MMD 2 λ(Xm, Yn) and on the estimated conditional quantile bq λ,B 1 α ZB Xm, Yn . Now, we want to express the bias term ψ ψ ϕλ 2 2 explicitly in terms of the bandwidths λ. For this, we need some smoothness assumption on the difference of the probability densities. 3.4 Uniform separation rate of the single MMD test over a Sobolev ball We now assume that ψ := p q belongs to the Sobolev ball Ss d(R) defined in Equation (1). This assumption allows us to derive an upper bound on the uniform separation rate of the single MMD test in terms of the bandwidth λ and of the sum of sample sizes m + n. Theorem 6 (proof in Appendix E.6) We assume that α (0, e 1), β (0, 1), s > 0, R > 0, M > 0 and B N satisfying B 3 α2 ln 8 β +α(1 α) . Given that λ1 λd 1, the uniform separation rate of the test λ,B α over the Sobolev ball Ss d(R) can be upper bounded as follows ρ λ,B α , Ss d(R), β, M 2 C4(M, d, s, R, β) i=1 λ2s i + ln 1 (m + n) λ1 λd for some positive constant C4(M, d, s, R, β). Schrab, Kim, Albert, Laurent, Guedj and Gretton The upper bound on the uniform separation rate ρ λ,B α , Ss d(R), β, M given by Theorem 6 consists of two terms depending on the bandwidth λ (0, )d. As the bandwidth λ varies, there is a trade-offbetween those two quantities: increasing one implies decreasing the other. We can choose the optimal bandwidth λ (depending on m + n, d and s) in the sense that both terms have the same order with respect to the sum of sample sizes m + n. Corollary 7 (proof in Appendix E.7) We assume that α (0, e 1), β (0, 1), s > 0, R > 0, M > 0 and B N satisfying B 3 α2 ln 8 β + α(1 α) . The test λ ,B α for the choice of bandwidth λ i = (m + n) 2/(4s+d), i = 1, . . . , d, is optimal in the minimax sense over the Sobolev ball Ss d(R), that is ρ λ ,B α , Ss d(R), β, M C5(M, d, s, R, α, β) (m + n) 2s/(4s+d) for some positive constant C5(M, d, s, R, α, β). We have constructed the single MMD test λ ,B α and proved that it is minimax optimal over the Sobolev ball Ss d(R) without any restriction on the sample sizes m and n. However, it is worth pointing out that the optimality of the single MMD test hinges on the assumption that the smoothness parameter s is known in advance, which is not realistic. Given this limitation, our next goal is to construct a test which does not rely on the unknown smoothness parameter s of the Sobolev ball Ss d(R) and achieves the same minimax rate, up to an iterated logarithmic term, for all s > 0 and R > 0. This is the main topic of Section 3.5 below. 3.5 Non-asymptotic MMDAgg test aggregating multiple bandwidths We propose to construct an aggregated test (MMDAgg) by combining multiple single MMD tests, which allows the test to be adaptive to the unknown the smoothness parameter of the Sobolev balls. We use the powerful multiple testing correction of Romano and Wolf (2005b, Equation 9), for which we derive non-asymptotic level guarantees. Consider a finite collection Λ of bandwidths in (0, )d with an associated collection of positive weights3 (wλ)λ Λ, which will determine the importance of each single MMD test over the others when aggregating all of them. We require that P λ Λ wλ 1. For notational convenience, we let Λw denote the collection of bandwidths Λ with its associated collection of weights. Intuitively, we want to define our aggregated MMDAgg test as the test which rejects the null hypothesis H0 : p = q if one of the single MMD tests λ,B1 uαwλ λ Λ rejects the null hypothesis, where uα is defined as4 uα = sup u 0, min λ Λ w 1 λ : Pp p r \ MMD 2 λ(Xm, Yn) bq λ,B1 1 uwλ ZB1 Xm, Yn >0 α 3. We stress that this differs from the notation often used in the literature (for example, for the independence aggregated test of Albert et al., 2022) where the weights are defined as e wλ rather than as wλ. 4. Since α (0, 1) and the function u 7 Pp p r max λ Λ \ MMD 2 λ(Xm, Yn) bq λ,B1 1 uwλ ZB1 Xm, Yn > 0 is non-decreasing, tends to 0 as u tends to 0, and tends to 1 as u tends to min λ Λw 1 λ , uα is well-defined. MMD Aggregated Two-Sample Test to ensure that MMDAgg has level α. We stress that the data, as well as the choice of collections of bandwidths and weights, all affect the value of uα. In practice, the probability and the supremum in the definition of uα cannot be computed exactly. We can estimate the former using a Monte Carlo approximation and estimate the latter using the bisection method. We now explain this in more detail and provide a formal definition of our aggregated MMDAgg test. For the case of the estimator \ MMD 2 λ,a defined in Equation (3), we independently gen- erate a permutation σ(b,ℓ) r of {1, . . . , m + n} and compute c M b λ,ℓ:= c M σ(b,ℓ) λ as defined in Equation (10) for ℓ= 1, 2, b = 1, . . . , Bℓand λ Λ. When working with the estimator \ MMD 2 λ,b defined in Equation (6), we independently generate n i.i.d. Rademacher random variables ϵ(b,ℓ) = ϵ(b,ℓ) 1 , . . . , ϵ(b,ℓ) n r and compute c M b λ,ℓ:= c M ϵ(b,ℓ) λ as defined in Equation (11) for ℓ= 1, 2, b = 1, . . . , Bℓand λ Λ. For consistency between the two procedures, we let Zℓ Bℓ:= µ(b,ℓ) 1 b Bℓfor ℓ= 1, 2, where µ(b,ℓ) denotes either the permutation σ(b,ℓ) or the Rademacher random variable ϵ(b,ℓ) for ℓ= 1, 2 and b = 1, . . . , Bℓ. With a slight abuse of notation, we refer to Z1 B1 and Z2 B2 simply as ZB1 and ZB2. For both estimators, we also let c M B1+1 λ,1 := \ MMD 2 λ(Xm, Yn). We denote by c M 1 λ,1 c M B1+1 λ,1 the ordered elements c M b λ,1 We use c M b λ,1 1 b B1+1, which are computed using ZB1, Xm and Yn, to estimate the conditional (1 a)-quantile bq λ,B1 1 a ZB1 Xm, Yn := c M (B1+1)(1 a) λ,1 for any a (0, 1) as in Equation (12). As explained in Section 3.2.3, bq λ,B1 1 a is defined such that an a-proportion of the test statistics c M b λ,1 1 b B1+1 simulated under the null are greater than bq λ,B1 1 a . By Proposition 1, this ensures the single test with bandwidth λ has non-asymptotic level a. We use c M b λ,2 1 b B2, which are computed using ZB2, Xm and Yn, to estimate with a Monte Carlo approximation the probability \ MMD 2 λ(Xm, Yn) bq λ,B1 1 uwλ ZB1 Xm, Yn > 0 (13) which appears in the definition of uα. We denote the approximated quantity by u B2 α , which is formally defined as u B2 α ZB2 Xm, Yn, ZB1 := sup u 0, min λ Λ w 1 λ : 1 b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 uwλ ZB1 Xm, Yn >0 α = sup u 0, min λ Λ w 1 λ : 1 b=1 1 max λ Λ c M b λ,2 c M (B1+1)(1 uwλ) λ,1 > 0 α . Since the function u 7 1 B2 PB2 b=1 1 max λ Λ c M b λ,2 c M (B1+1)(1 uwλ) λ,1 > 0 α is increasing, u B2 α is actually the largest root of this function. As such, it can be computed in practice by Schrab, Kim, Albert, Laurent, Guedj and Gretton using the bisection method for finding the root. We let5 bu B2:3 α = bu B2:3 α ZB2 Xm, Yn, ZB1 be the lower bound of the interval obtained by performing B3 steps of the bisection method to approximate the supremum (i.e. find the root) in the definition of u B2 α . We then have u B2 α bu B2:3 α , bu B2:3 α + 2 B3 min λ Λ w 1 λ We recall that the data, the collection of bandwidths, and the weights, all affect the value of the correction uα, and hence, also the value of its estimate bu B2:3 α . For α (0, 1), we can then define our aggregated MMDAgg test5 Λw,B1:3 α as rejecting the null hypothesis, that is Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 1, if one of the tests λ,B1 bu B2:3 α wλ λ Λ rejects the null hypothesis, that is λ Λ : \ MMD 2 λ(Xm, Yn) > bq λ,B1 1 bu B2:3 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn , or equivalently λ Λ : \ MMD 2 λ(Xm, Yn) > c M l (B1+1) 1 bu B2:3 α ZB2 Xm,Yn,ZB1 wλ m λ,1 ZB1 Xm, Yn . The parameters of our MMDAgg test Λw,B1:3 α are: its level α, the finite collection Λw of bandwidths with its associated weights, and the positive integers B1, B2 and B3. We generate independent permutations or Rademacher random variables to obtain ZB1 and ZB2. In practice, we are given realisations of Xm = (Xi)1 i m and Yn = (Yj)1 j n. Hence, we are able to compute Λw,B1:3 α (Xm, Yn, ZB1, ZB2) to decide whether or not we should reject the null hypothesis H0 : p = q. This exact version of our aggregated MMDAgg test Λw,B1:3 α can be implemented in practice with no further approximation. We provide a detailed pseudocode of MMDAgg in Algorithm 1 and our code is available here. In Appendix C, we further discuss how to efficiently compute the values c M b λ,ℓfor ℓ= 1, 2, b = 1, . . . , Bℓand λ Λ (corresponding to Step 1 of Algorithm 1). The only conditions we have on our weights (wλ)λ Λ for the collection of bandwidths Λ are that they need to be positive and to satisfy P λ Λ wλ 1. We now explain why this condition on the sum of the weights is not necessarily required. In general, the two aggregated tests with weights (wλ)λ Λ and with scaled weights (w λ)λ Λ where w λ := wλ P λ Λ wλ for λ Λ are exactly the same. This is due to the way the correction of the levels of the single MMD tests is performed. In particular, making the dependence of bu B2:3 α on either Λw or Λw explicit, we have α ZB2 Xm, Yn, ZB1 = bu B2:3,Λw α ZB2 Xm, Yn, ZB1 X and so bu B2:3,Λw α w λ = bu B2:3,Λw α wλ, which implies that Λw ,B1:3 α (Xm, Yn, ZB1, ZB2) = Λw,B1:3 α (Xm, Yn, ZB1, ZB2) . (14) 5. We use the condensed notation B2:3 and B1:3 to refer to (B2, B3) and (B1, B2, B3), respectively. MMD Aggregated Two-Sample Test Algorithm 1: MMDAgg Λw,B1:3 α Inputs: samples Xm = (xi)1 i m in Rd and Yn = (yj)1 j n in Rd choice between permutations (Equation (3)) or wild bootstrap (Equation (6)) one-dimensional kernels K1, . . . , Kd satisfying the properties presented in Section 3.1 level α (0, e 1) finite collection of bandwidths Λ in (0, )d collection of positive weights (wλ)λ Λ satisfying P λ Λ wλ 1 number of simulated test statistics B1 to estimate the quantiles number of simulated test statistics B2 to estimate the level correction number of iterations B3 for the bisection method Procedure: Step 1: compute all simulated test statistics (see Appendix C for a more efficient Step 1) for ℓ= 1, 2 and b = 1, . . . , Bℓ: generate µ(b,ℓ) r as in Sections 3.2.1 or 3.2.2 (permutations or Rademacher) for λ Λ : compute c M b λ,ℓ:= c M µ(b,ℓ) λ as in Equations (10) or (11) compute c M B1+1 λ,1 := \ MMD 2 λ(Xm, Yn) as in Equations (3) or (6) c M 1 λ,1, . . . , c M B1+1 λ,1 = sort by ascending order c M 1 λ,1, . . . , c M B1+1 λ,1 Step 2: compute buα using the bisection method umin := 0 and umax := min λ Λ w 1 λ repeat B3 times: compute u := umin+umax compute Pu := 1 B2 PB2 b=1 1 max λ Λ c M b λ,2 c M (B1+1)(1 uwλ) λ,1 > 0 if Pu α then umin := u else umax := u buα := umin Step 3: output test result if c M B1+1 λ,1 > c M (B1+1)(1 buαwλ) λ,1 for some λ Λ: return 1 (reject H0) else: return 0 (fail to reject H0) Time complexity:6 O |Λ|(B1 + B2)(m + n)2 Space complexity: O (m + n)2 + (B1 + B2)(m + n) 6. The time complexity is actually O |Λ|(B1 + B2)(m + n)2 + |Λ|B1 ln(B1) + |Λ|B2B3 which under the reasonable assumption m + n > max p ln(B1), B3 gives O |Λ|(B1 + B2)(m + n)2 . Schrab, Kim, Albert, Laurent, Guedj and Gretton Consider some u 0, minλ Λw 1 λ . Note that if a single MMD test λ,B1 uwλ has a large associated weight wλ, then its adjusted level uwλ is bigger and so the estimated conditional quantile bq λ,B1 1 uwλ is smaller, which means that we reject this single test more often. Recall that if a single MMD test rejects the null hypothesis, then the aggregated MMDAgg test necessarily rejects the null as well. It follows that a single test λ,B1 uwλ with large weight wλ is viewed as more important than the other tests in the aggregated procedure. When running an experiment, putting weights on the bandwidths of the single MMD tests can be seen as incorporating prior knowledge about which bandwidths might be better suited to this specific experiment. The choice of prior, or equivalently of weights, is further explored in Section 5.1. As presented in Section 3.2, the p-value of one of the single MMD tests can be computed as pλ val := 1 B1 + 1 b=1 1 c M b λ,1 c M B1+1 λ,1 ! and, with its adjusted level bu B2:3 α wλ, it satisfies the property that pλ val bu B2:3 α wλ \ MMD 2 λ(Xm, Yn) > bq λ,B1 1 bu B2:3 α wλ. Hence, our aggregated MMDAgg test can also be expressed in terms of p-values as Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 1 \ MMD 2 λ(Xm, Yn) > bq λ,B1 1 bu B2:3 α wλ for some λ Λ = 1 pλ val bu B2:3 α wλ for some λ Λ . We now show that MMDAgg indeed has non-asymptotic level α. We emphasize the nonasymptotic nature of our aggregated test, which allows for the use of MMDAgg even in settings with small fixed sample sizes, where other asymptotic tests (such as the original MMD test of Gretton et al., 2012a) fail to control correctly the probability of type I error. Proposition 8 (proof in Appendix E.8) Consider α (0, 1) and B1, B2, B3 N \ {0}. For a collection Λ of bandwidths in (0, )d and a collection of positive weights (wλ)λ Λ satisfying P λ Λ wλ 1, the MMDAgg test Λw,B1:3 α has non-asymptotic level α, that is Pp p r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 1 α for all probability density functions p on Rd. 3.6 Uniform separation rate of MMDAgg over Sobolev balls In this section, we compute the uniform separation rate of our MMDAgg test Λw,B1:3 α over the Sobolev ball Ss d(R). We then present a collection Λw of bandwidths and associated weights for which our aggregated test Λw,B1:3 α is almost optimal in the minimax sense. First, as part of the proof of Theorem 9 in Equation (25), we have shown that the following bound holds Pp q r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 0 β 2 + min λ Λ Pp q r λ,B1 αwλ/2 ZB1 Xm, Yn = 0 . MMD Aggregated Two-Sample Test This means that we can control the probability of type II error of our MMDAgg test Λw,B1:3 α by controlling the smallest probability of type II error of the single MMD tests λ,B1 αwλ/2 λ Λ with adjusted levels. Hence, given a collection Λ of bandwidths with its associated weights (wλ)λ Λ, if for some λ Λ the single MMD test λ,B1 αwλ/2 has probability of type II error upper bounded by β/2 (0, 0.5), then the probability of type II error of our aggregated MMDAgg test Λw,B1:3 α is at most β. Intuitively, this means that even if our collection of single MMD tests consists of only one good test (in the sense that it has high power with adjusted level) and many other bad tests (in the sense that they have low power with adjusted levels), MMDAgg would still have high power. This is because when the good MMD test rejects the null hypothesis, MMDAgg also necessarily rejects it. Another point of view on this is that we do not lose any power by testing a wider range of bandwidths as long as the weight of the best test remains the same. The uniform separation rate of our MMDAgg test Λw,B1:3 α over the Sobolev ball Ss d(R) is then at most twice the lowest of the uniform separation rates of the single MMD tests λ,B1 αwλ/2 λ Λ. Combining this result with Theorem 6, we obtain the following upper bound on the uniform separation rate of MMDAgg Λw,B1:3 α over the Sobolev ball Ss d(R). Theorem 9 (proof in Appendix E.9) Consider a collection Λ of bandwidths in (0, )d such that λ1 λd 1 for all λ Λ and a collection of positive weights (wλ)λ Λ such that P λ Λ wλ 1. We assume α (0, e 1), β (0, 1), s > 0, R > 0, M > 0 and B1, B2, B3 N satisfying B1 maxλ Λ w 2 λ 12 β + α(1 α) , B2 8 α2 ln 2 β and B3 log2 4 α minλ Λw 1 λ . The uniform separation rate of the aggregated MMDAgg test Λw,B1:3 α over the Sobolev ball Ss d(R) can be upper bounded as follows ρ Λw,B1:3 α , Ss d(R), β, M 2 C6(M, d, s, R, β) min λ Λ i=1 λ2s i + ln 1 α + ln 1 wλ (m + n) λ1 λd for some positive constant C6(M, d, s, R, β). We recall from Corollary 7 that the optimal choice of bandwidth λ i =(m + n) 2/(4s+d), i = 1, . . . , d, for the single MMD test λ ,B α leads to a uniform separation rate over the Sobolev ball Ss d(R) of order(m + n) 2s/(4s+d) which is optimal in the minimax sense. However, this choice depends on the unknown smoothness parameter s and so the test cannot be run in practice with this bandwidth. We now propose a specific choice of collection Λw of bandwidths and associated weights, which does not depend on s, and derive the uniform separation rate over the Sobolev ball Ss d(R) of our aggregated MMDAgg test Λw,B1:3 α using that collection. Intuitively, the main idea is to construct a collection of bandwidths which includes a bandwidth (denoted λ ) with the property that m + n ln(ln(m + n)) 2/(4s+d) λ i m + n ln(ln(m + n)) for some a > 1 and for i = 1, . . . , d. The extra iterated logarithmic term comes from the additional weight term ln 1 wλ in Theorem 9. Schrab, Kim, Albert, Laurent, Guedj and Gretton Corollary 10 (proof in Appendix E.10) We assume α (0, e 1), β (0, 1), s > 0, R > 0, M > 0, m + n > 15 so that ln(ln(m + n)) > 1 and B1, B2, B3 N satisfying B1 3 α2 ln 8 β +α(1 α) , B2 8 α2 ln 2 β and B3 log2 2π2 3α . We consider our aggregated MMDAgg test Λw,B1:3 α with the collection of bandwidths Λ := n 2 ℓ, . . . , 2 ℓ (0, )d : ℓ n 1, . . . , l2 d log2 m + n ln(ln(m + n)) and the collection of positive weights wλ := 6 π2 ℓ2 so that P λ Λ wλ 1 for any sample sizes m and n. The uniform separation rate of the MMDAgg test Λw,B1:3 α over the Sobolev ball Ss d(R) then satisfies ρ Λw,B1:3 α , Ss d(R), β, M C7(M, d, s, R, α, β) m + n ln(ln(m + n)) for some positive constant C7(M, d, s, R, α, β). This means that the MMDAgg test Λw,B1:3 α , which does not depend on s and R, is optimal in the minimax sense up to an iterated logarithmic term over the Sobolev balls Ss d(R) : s > 0, R > 0 ; the MMDAgg test Λw,B1:3 α is minimax adaptive. Note that the choice of using negative powers of 2 for the bandwidths in Corollary 10 is arbitrary. The result holds more generally using negative powers of a for any real number a > 1. With the specific choice of bandwidths and weights of Corollary 10, we have proved that the uniform separation rate of the proposed aggregated MMDAgg test is upper bounded by ((m+n)/ ln(ln(m+n))) 2s/(4s+d). Comparing this with the minimax rate (m+n) 2s/(4s+d), we see that MMDAgg attains rate optimality over the Sobolev ball Ss d(R), up to an iterated logarithmic factor, and more importantly, the aggregated test does not depend on the prior knowledge of the smoothness parameter s. Our MMDAgg test is minimax adaptive over the Sobolev balls Ss d(R) : s > 0, R > 0 . 4. Related work In this section, we compare our results to a number of different adaptive kernel hypothesis testing approaches. Fromont et al. (2012, 2013) construct a two-sample aggregated test in a framework in which the sample sizes follow independent Poisson processes. They use a different kernelbased estimator which corresponds to an unscaled version of the classical quadratic-time MMD estimator of Gretton et al. (2012a, Lemma 6). In their Poisson setting, they derive uniform separation rates for their aggregated test which is minimax adaptive over Sobolev balls, and over anisotropic Nikol skii-Besov balls, up to an iterated logarithmic term. The quantiles they consider are estimated with a wild bootstrap. They also have an additional assumption on the kernel (condition in the Fourier domain; Fromont et al., 2013, Equation 3.7), which we do not require. Albert et al. (2022) consider the problem of testing whether two random vectors are dependent and use the kernel-based Hilbert-Schmidt Independence Criterion (HSIC Gretton MMD Aggregated Two-Sample Test et al., 2005) as a dependence measure. Similarly to our work, they propose a non-asymptotic minimax adaptive test which aggregates single (HSIC) tests, and provide theoretical guarantees: upper bounds for the uniform separation rate of testing over Sobolev and Nikol skii balls. In their independence testing setting, the information about the problem is encoded in the joint distribution over pairs of variables, with the goal of determining whether this is equal to the product of the marginals. This differs from the two-sample problem we consider, where we have samples from two separate distributions. Albert et al. (2022) define their single HSIC test using the theoretical quantile of the statistic under the null hypothesis, which is an unknown quantity in practice. To implement the test, they propose a deterministic upper bound on the theoretical quantile (Albert et al., 2022, Proposition 3). This upper bound holds in the two-sample case under the restrictive assumption (m + n) λ1 λd > ln 1 α (this condition is adapted to the twosample setting from their condition n p λ1 . . . λpµ1 . . . µq > ln 1 α for independence testing). If the bandwidth is small (as it can be in the case of the optimal bandwidth λ in the proof of Corollary 10), then this condition implies that the results would hold only for very large sample sizes. By contrast with the above bound, we use a wild bootstrap or permutations to approximate the theoretical quantiles. While the theoretical quantiles are real numbers given data, our estimated quantiles are random variables given data. This means that instead of having a deterministic upper bound on the theoretical quantiles (Albert et al., 2022, Proposition 3), we have an upper bound on our estimated conditional quantiles which holds with high probability as in Proposition 4. Our use of an estimated threshold in place of a deterministic upper bound has an important practical consequence: it allows us to drop the assumption (m + n) λ1 λd > ln 1 α entirely. Another difference is how the level correction of the single MMD tests is performed. The aggregated test of Albert et al. (2022) involves a theoretical value uα which cannot be computed in practice, we incorporate directly in our test a Monte Carlo approximation, using either a wild bootstrap or permutations, to estimate the probability under the null hypothesis, and use the bisection method to approximate the supremum. We stress that our theoretical guarantees of minimax optimality (up to an iterated logarithmic term) hold for our aggregated MMDAgg test which can be implemented without any further approximations. Finally, while the results of Albert et al. (2022) hold only for the Gaussian kernel, ours are more general and hold for any product of one-dimensional characteristic translation invariant kernels which are absolutely and square integrable. Kim et al. (2022, Section 7) propose an adaptive two-sample test for testing equality between two H older densities supported on the real d-dimensional unit ball. Instead of testing various bandwidths or kernels, they discretise the support in bins of equal sizes and aggregate tests with varying bin sizes. Each single test is a multinomial test based on the discretised data. Their strategy and the function class they use are both different from the one we consider, but they derive a similar upper bound on the uniform separation rate of testing over H older densities. Kim et al. (2022, Proposition 8.4) also mention the setting considered by Albert et al. (2022) and prove an equivalent version of our Theorem 5 for single tests, using permutations for the Gaussian kernel. We consider both the permutationbased and wild bootstrap procedures, and our results hold more generally for a wide range of kernels. With those aforementioned differences, Kim et al. (2022, Example 8.5) anticipate Schrab, Kim, Albert, Laurent, Guedj and Gretton that one can use a similar reasoning to Albert et al. (2022) to obtain minimax optimality of the single MMD tests. We provide the full statement and proof of this result in our more general setting. Li and Yuan (2019) present goodness-of-fit, two-sample and independence aggregated asymptotic tests and also establish the minimax rates over Sobolev balls for these three settings. Their tests use the Gaussian kernel and heavily rely on the asymptotic distribution of the test statistic, while our test is non-asymptotic and is not limited to a particular choice of kernel. Their tests are adaptive over Sobolev balls (which they define in a slightly different way than in our case) provided that the smoothness parameter satisfies s d/4. We do not have such a restriction. We also note that they assume that the two densities belong to a Sobolev ball, rather than assuming only that the difference of the densities lies in a Sobolev ball. We also point out the work of Tolstikhin et al. (2016) who derive lower bounds for MMD estimation based on finite samples for any radial universal kernel (Sriperumbudur et al., 2011). They establish the minimax rate optimality of the MMD estimators (V -statistic and U-statistic; Lee, 1990). Gretton et al. (2012b) address kernel adaptation for the linear-time MMD, where the test statistic is computed as a running average (this results in a statistic with greater variance, but allows the processing of larger sample sizes). They propose to choose the kernel by splitting the data, and using one part to select the bandwidth which maximises the estimated ratio of the Maximum Mean Discrepancy to the standard deviation. They show that maximizing this criterion for the linear-time setting corresponds to maximizing the asymptotic power of the test. The test is then performed on the remaining part of the data. Sutherland et al. (2017) and Liu et al. (2020) address kernel adaptation for the quadratic-time MMD using the same sample-splitting strategy, and show that the ratio of the MMD to the standard deviation under the alternative can again be used as a good proxy for test power. Liu et al. (2020) in particular propose a regularized estimator for the variance under the alternative hypothesis, which admits a convenient closed-form expression. Generally, kernel choice by sample splitting gives better results than the median heuristic, as the former is explicitly tuned to optimize the asymptotic power (or a proxy for it). The price to pay for this increase in performance, however, is that we cannot use all the data for the test. In cases where we have access to almost unlimited data this clearly would not be a problem, but in cases where we have a restricted number of samples and work in the non-asymptotic setting, the loss of data to kernel selection might actually result in a net reduction in power, even after kernel adaptation. For better data efficiency, K ubler et al. (2022a) propose an MMD test which uses held-out data not only for kernel selection, but also for choosing weights and test locations for the MMD witness function. In later work, leveraging recent advances in supervised learning and also relying on sample splitting, K ubler et al. (2022b) construct a test which learns the witness function directly by training, for a given amount of time (i.e. one minute), an Auto Gluon model (Erickson et al., 2020) which can be, for example, a neural network. K ubler et al. (2020) propose another approach to an MMD adaptive two-sample test which does not require data splitting. Using all the data, they select the linear combination of test statistics with different bandwidths (or even different kernels) which is optimal in the sense that it maximises a power proxy, they then run their test using again all the data. Using the post-selection inference framework (Fithian et al., 2014; Lee et al., 2016), they MMD Aggregated Two-Sample Test are able to correctly calibrate their test to account for the introduced dependencies. This framework requires asymptotic normality of the test statistic under the null hypothesis, however, and hence they are by design restricted to using the linear-time MMD estimate. We observe in our experiments that using this estimate results in a significant loss in power when compared to tests which use the quadratic-time statistic. Yamada et al. (2019) also use post-selection inference to obtain a feature selection method based on the MMD, where the chosen features best distinguish the samples. In a different setting, Wynne and Duncan (2022) study the efficiency of MMD-based tests when dimension increases, and propose an MMD-based two-sample test for Functional Data Analysis, a framework in which the samples consist of functions rather than of data points. In this setting, Wynne and Nagy (2021) study the connections between kernel mean embeddings and statistical depth (i.e. how representative a point is from a given measure). 5. Experiments For our aggregated MMDAgg test, we first introduce in Section 5.1 four weighting strategies and a family of collections of bandwidths motivated by Corollary 10. Those collections depend on some parameters which would usually need to be chosen by the user, by contrast, we introduce in Section 5.2 a parameter-free adaptive collection for MMDAgg, which we recommend using in practice. We then present in Section 5.3 some other state-of-the-art MMD-based two-sample tests we will compare ours to. In Section 5.4, we provide details about our experimental procedure. We show that our aggregated MMDAgg test obtains high power on both synthetic and real-world datasets in Sections 5.5 and 5.6, respectively. In Section 5.7, we observe that MMDAgg retains power even in the continuous limit of the collection of bandwidths. We show in Section 5.8 that, on image shift experiments, MMDAgg matches the power of tests using neural networks, even for large sample sizes. Finally, in Section 5.9, we briefly report the results from the additional experiments presented in Appendix A. 5.1 Weighting strategies and fixed bandwidth collections for MMDAgg The positive weights (wλ)λ Λ for the collection of bandwidths Λ are required to satisfy P λ Λ wλ 1. As noted in Equation (14), rescaling all the weights to ensure that P λ Λ wλ = 1 does not change the output of our aggregated test Λw,B1:3 α (Xm, Yn, ZB1, ZB2). For this reason, we propose weighting strategies for which P λ Λ wλ = 1 holds. For any collection Λ of N bandwidths, one can use uniform weights which we define as Using uniform weights should be prioritised if the user does not have any useful prior information to incorporate in the test. The choice of weights is entirely up to the user; the weights can be designed to reflect any given prior belief about the location of the best bandwidths in the collection. Nonetheless, we also present three standard weighting strategies for incorporating prior knowledge when dealing with a more structured collection of bandwidths. Schrab, Kim, Albert, Laurent, Guedj and Gretton 5 bandwidths uniform centred increasing decreasing 6 bandwidths Bandwidths sorted in increasing order Figure 1: Weighting strategies. Consider the case where we have some reference bandwidth λref (0, )d and we are interested in aggregating scaled versions of it, that is, we have an ordered collection of N bandwidths defined as λ(i) := ciλref, i = 1, . . . , N, for positive constants c1 < < c N. If we have no prior knowledge, then we would simply use the aforementioned uniform weights. Suppose we believe that, if the two distributions differ, then this difference would be better captured by the smaller bandwidths in our collection, in that case we would use decreasing weights wd λ(i) := 1 ℓ=1 ℓ 1 ! 1 for i = 1, . . . , N. On the contrary, if we think that the larger bandwidths in our collection are well suited to capture the difference between the two distributions, if it exists, then we would use increasing weights wi λ(i) := 1 N + 1 i ℓ=1 ℓ 1 ! 1 for i = 1, . . . , N. Finally, if our prior knowledge is that the bandwidths in the middle of our ordered collection are the most likely to detect the potential difference between the two densities, then we would use centred weights which, for N odd, are defined as wc λ(i) := 1 N+1 2 ℓ + 1 1! 1 for i = 1, . . . , N, and, for N even, as wc λ(i) := 1 N+1 for i = 1, . . . , N. MMD Aggregated Two-Sample Test All those weighting strategies are inspired from the weights of Corollary 10 which are defined as wλ(i) := i 2 P ℓ=1 ℓ 2 1 for i N\{0}, where the square exponent is required in order to have a convergent series. However, in practice, using square exponents in our weights would assign extremely small weights to some of the bandwidths in our collection. This would be almost equivalent to disregarding those bandwidths in our aggregated MMDAgg test, which is not a desired property since if we are not interested in testing some bandwidths, then we would simply not include them in our collection. For this reason, we have defined our weighting strategies without the square exponent. We provide visualisations of our four weighting strategies for collections of 5 and 6 bandwidths in Figure 1. In our experiments, we refer to our aggregated MMDAgg test with those four weighting strategies as: MMDAgg uniform, MMDAgg decreasing, MMDAgg increasing and MMDAgg centred. For these, we use B1 = 500 simulated test statistics to estimate the quantiles, B2 = 500 simulated test statistics to estimate the probability in Equation (13) for the level correction, and B3 = 100 iterations for the bisection method. Motivated by Corollary 10, those tests are used with collections of bandwidths of the form Λ(ℓ , ℓ+) := n 2ℓλmed (0, )d : ℓ ℓ , . . . , ℓ+ o (15) for ℓ , ℓ+ Z such that ℓ < ℓ+, where the median bandwidth λmed is (λmed)i := median wi w i : w, w Xm Yn, w = w for i = 1, . . . , d. We note that, for each experiment, we have chosen the values ℓ and ℓ+ which highlight the differences between the four weighting strategies. In practice, it is not clear how to choose those values, instead, we recommend using the adaptive parameter-free collection of bandwidths introduced in Section 5.2 with uniform weights. 5.2 Adaptive parameter-free collection of bandwidths for MMDAgg For radial basis function kernels (i.e. kernels k(x, y) which can be written as a function of x y for some norm ), we recommend using a collection of bandwidths which, intuitively, discretises the interval between the smallest and the largest of the inter-sample distances7 D := x y : x Xm, y Yn . More formally, we use (i 1)/(N 1) λmin/2 : i = 1, . . . , N which is a discretisation of the interval [λmin/2, 2λmax] using N points. We let λmin be the minimum value in D. If this value is smaller than 0.1, we instead use the 5% smallest value in D for λmin, if this quantity is still smaller than 0.1, we use λmin = 0.1. For λmax, we use the maximum value of D or 0.3 if this maximum is smaller than 0.3. In practice, we recommend using N = 10 points, as can be observed in Figure 6 the power remain the same when increasing N to be larger (i.e. N = 100 or N = 1000). Since in general we might not 7. In practice, D can be computed using at most 500 samples from Xm and 500 samples from Yn. Schrab, Kim, Albert, Laurent, Guedj and Gretton have prior information about the location of well suited bandwidths, we recommend using the proposed collection of bandwidths with uniform weights as defined in Section 5.1. We use B1 = 2000 and B2 = 2000 simulated test statistics to estimate the quantiles and the probability in Equation (13) for the level correction, respectively, and use B3 = 50 steps of bisection method. We refer to this test as MMDAgg and emphasize the fact that it is run with exactly the same parameters across all experiments. 5.3 State-of-the-art MMD-based two-sample tests Gretton et al. (2012a) first suggested using the median heuristic to choose the bandwidth of the MMD test with the Gaussian kernel8 corresponding to Ki(u) := 1 π exp( u2) for u R, i = 1, . . . , d, so that kλ(x, y) := = 1 λ1 λdπd/2 exp They proposed to set the bandwidth to be equal to λi := median w w 2 : w, w Xm Yn, w = w for i = 1, . . . , d. To generalise this approach to our case where K1, . . . , Kd need not all be the same, as explained in Section 5.1, we can in a similar way set the bandwidth9 to λi := median wi w i : w, w Xm Yn, w = w for i = 1, . . . , d. With this specific definition for the bandwidth, we use the notation λmed := (λ1, . . . , λd). We refer to the single MMD test with the bandwidth λmed as median in our experiments. Another common approach for selecting the bandwidth was first introduced by Gretton et al. (2012b) for the single MMD test using the linear-time MMD estimator. The method was then extended to the case of the quadratic-time MMD estimator by Sutherland et al. (2017). It consists in splitting the data in two parts and in using the first part to select the bandwidth which maximises the asymptotic power of test, or equivalently the estimated ratio (Liu et al., 2020, Equations 4 and 5) \ MMD 2 λ(Xn, Yn) bσλ(Xn, Yn) (16) bσ2 λ(Xn, Yn) := 4 j=1 hλ(Xi, Xj, Yi, Yj) j=1 hλ(Xi, Xj, Yi, Yj) 8. Gretton et al. (2012a) actually consider the unnormalised Gaussian kernel without the λ1 λdπd/2 1 term, but as pointed out in Footnote 1, this does not affect the output of the test. 9. Note that those two ways of setting the bandwidths are not equivalent for the Gaussian kernel but they are each equally valid. MMD Aggregated Two-Sample Test is a regularised positive estimator of the asymptotic variance of the quadratic-time estimator \ MMD 2 λ(Xn, Yn) under the alternative hypothesis Ha for m = n. In our experiments, we select the bandwidth of the form cλmed for various positive values of c which maximises the estimated ratio. The single MMD test with the selected bandwidth is then performed on the second part of the data. In our experiments, we refer to this test as split. Another interesting test to compare ours to, is the one which uses new data to choose the optimal bandwidth. This corresponds to running the above test which uses data splitting on twice the amount of data. In some sense, this represents the best performance the single MMD test can achieve as the test is run on the whole dataset with an optimal choice of bandwidth. As such, it is interesting to observe the difference in power between MMDAgg and this oracle test which uses extra data. In our experiments, we denote this test as oracle. A radically different approach to constructing an MMD adaptive two-sample test was recently presented by K ubler et al. (2020). They work in the asymptotic regime and require asymptotic normality under the null hypothesis of their MMD estimator, so they are restricted to using the linear-time estimator. Using all of the data, they compute the linear-time MMD estimates for several kernels (or several bandwidths of a kernel), they then select the linear combination of these which maximises a proxy for asymptotic power, and compare its value to their test threshold. They do not split the data but they are able to correctly calibrate their test for the introduced dependencies. For this, they prove and use a generalised version of the post-selection inference framework (Fithian et al., 2014; Lee et al., 2016) which holds for uncountable candidate sets (i.e. all linear combinations). In our experiments, we compare our aggregated MMDAgg test to their one-sided test (OST K ubler et al., 2020) for which we use their implementation. This test is referred to as ost in our experiments. In later work, K ubler et al. (2022b) propose an Auto ML (Automated Machine Learning) test with an implementation which is essentially parameter-free. Their test relies on sample splitting, on cross-validation, and on permuting the data, the witness function of the test is learnt by training an Auto Gluon model (Erickson et al., 2020) for some prescribed amount of time (one minute by default). Depending on the time limit and on the compute available, a different model will be chosen automatically. While such a black-box approach can certainly be convenient for everyday users, this convenience comes at the expense of reproducibility (even on the same machine it can take different amounts of time to train identical models). We refer to this test in our experiments as Auto ML. 5.4 Experimental procedure To compute the median bandwidth (λmed)i := median wi w i : w, w Xm Yn, w = w for i = 1, . . . , d, we use at most 1000 randomly selected data points from Xm and at most 1000 from Yn, since the median is robust, this is sufficient to get an accurate estimate of it. Moreover, we use a threshold so that the bandwidth is not smaller than 0.0001. This avoids division by 0 in some settings where one component of the data points is always the Schrab, Kim, Albert, Laurent, Guedj and Gretton same value, as it can be the case for the problem considered in Section 5.6 which uses the MNIST dataset, where the pixel in one corner of the images is always black for every digit. We run all our experiments with the Gaussian kernel kλ(x, y) := = 1 λ1 λdπd/2 exp for Ki(u) := 1 π exp( u2) for u R, i = 1, . . . , d, and with the Laplace kernel kλ(x, y) := = 1 λ1 λd2d exp for Ki(u) := 1 2 exp( |u|) for u R, i = 1, . . . , d. As mentioned in Footnote 1, MMDAgg does not depend on the scaling of the kernels. Hence, in our implementation we drop the scaling terms in front of the exponential functions, which is numerically more stable. We use a wild bootstrap for all our experiments, except for the one in Appendix A.3 where we compare using the permutation-based and wild bootstrap procedures, and for the one in Appendix A.4 where we must use permutations as we consider different sample sizes m = n. We use level α = 0.05 for all our experiments. We run all our tests on three different types of data: 1-dimensional and 2-dimensional perturbed uniform distributions, and the MNIST dataset. Those are introduced in Sections 5.5 and 5.6, respectively. We use sample sizes m = n = 500 for the 1-dimensional perturbed uniform distributions and for the MNIST dataset, and use larger sample sizes m = n = 2000 for the case of the 2-dimensional perturbed uniform distributions. For the split and oracle tests, we use two equal halves of the data, and oracle is run on twice the sample sizes. We choose the bandwidth which maximises the estimated ratio presented in Equation (16) out of the collection cλmed : c {0.1, 0.2, . . . , 0.9, 1} when considering perturbed uniform distributions, and when considering the MNIST dataset we select it out of the collection 2cλmed : c {10, 11, . . . , 19, 20} . Similarly to our aggregated tests of Section 5.1, for the median, split and oracle tests, we use B = 500 simulated test statistics to estimate the quantile. For MMDAgg , we also use either the Gaussian or the Laplace kernel with N = 10 bandwidths, we refer to those tests as MMDAgg Gaussian and MMDAgg Laplace. We also consider aggregating over different types of kernels, in particular, we can use both the Gaussian and Laplace kernels, each with N = 10 bandwidths. This gives a collection consisting of 2N = 20 kernels, over which MMDAgg Laplace Gaussian aggregates. Finally, we propose to aggregate 12 kernels (each with N = 10 bandwidths): the Gaussian kernel, the inverse multiquadric (IMQ) kernel, and the Mat ern kernels with the ℓ1 and ℓ2 distances for ν = 0.5, 1.5, 2.5, 3.5, 4.5 (the Laplace kernel is the Mat ern kernel with ℓ1 distance and ν = 0.5). This test aggregates over 12N = 120 kernels, we refer to it as MMDAgg All. Note that in Figure 6, we consider N = 1000 bandwidths, which means for example that the MMDAgg All aggregates over 12N = 12000 kernels, and we observe that it retains its high power. To estimate the power in our experiments, we average the test outputs of 500 repetitions, that is, 500 times, we sample some new data and run the test. We sample new data for each MMD Aggregated Two-Sample Test 0.5 1.0 0.0 0.5 0.0 1.0 0.5 0.0 1.0 Figure 2: (a) Function G, (b) 1-dimensional uniform distribution with 0, 1, 2, 3 and 4 perturbations, (c) 2-dimensional uniform distribution with 2 perturbations. test with different parameters, except when we compare using either a wild bootstrap or permutations, in which case we use the same samples. All our experiments are reproducible and our code is available here. 5.5 Power experiments on synthetic data As explained in Appendix D, a lower bound on the minimax rate of testing over the Sobolev ball Ss d(R) can be obtained by considering a d-dimensional uniform distribution and a perturbed version of it with P N \ {0} perturbations. As presented in Equation (19), the latter has density fθ(u) := 1[0,1]d(u) + cd P s X ν {1,...,P}d θν i=1 G (Pui νi) , u Rd (17) where θ = (θν)ν {1,...,P}d { 1, 1}P d, that is, θ is a vector of length P d with entries either 1 or 1, and it is indexed by the P d d-dimensional elements of {1, . . . , P}d, and G(u) := exp 1 1 (4u + 3)2 2)(u) exp 1 1 (4u + 1)2 2 ,0)(u), u R. We have added a scaling factor cd to emphasize the effect of the perturbations, in our experiments we use c1 = 2.7 and c2 = 7.3. Those values were chosen to ensure that the densities with one perturbation remain positive on [0, 1]d. The uniform density with P perturbations for P = 0, 1, 2, 3, 4 when d = 1 and for P = 2 when d = 2, as well as the function G, are plotted in Figure 2. As shown by Li and Yuan (2019), for P large enough, the difference between the uniform density and the perturbed uniform density lies in the Sobolev ball Ss d(R) for some R > 0. In our experiments, we choose the smoothness parameter of the perturbed uniform density defined in Equation (17) to be equal to s = 1. For each of the 500 repetitions used to estimate the power of a test, we sample uniformly a new value of the parameter θ { 1, 1}P d for the perturbed uniform density. In Figure 3, we consider testing n = 500 samples drawn from the 1-dimensional uniform distribution against m = 500 samples drawn from a 1-dimensional uniform distribution Schrab, Kim, Albert, Laurent, Guedj and Gretton Power Gaussian kernel Λ( 6, 2) Λ( 4, 0) Λ( 2, 2) 1 2 3 4 0.00 Power Laplace kernel 1 2 3 4 Number of perturbations MMDAgg Auto ML MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing median split oracle ost 1 2 3 4 Number of perturbations MMDAgg Laplace MMDAgg Gaussian MMDAgg Laplace Gaussian MMDAgg All Figure 3: Power experiments with 1-dimensional perturbed uniform distributions using sample sizes m = n = 500 with a wild bootstrap. with P = 1, 2, 3, 4 perturbations. We consider the same setting but in two dimensions with sample sizes m = n = 2000 with up to three perturbations in Figure 4. For the MMDAgg tests of Section 5.1 and for the ost test, we use the collections of bandwidths Λ( 6, 2), Λ( 4, 0) and Λ( 2, 2) as defined in Equation (15). As the number of perturbations increases, it becomes harder to distinguish the two distributions, this translates into a decrease in power for all the tests. Even though we consider more samples for the 2-dimensional case, the performance of all the tests degrades significantly with dimension as detecting the perturbations becomes considerably more challenging. For all the settings considered in Figures 3 and 4, we observe that MMDAgg always performs the best, with power slightly higher than the one of oracle which has access to extra data to select an optimal bandwidth. All other tests achieve significantly lower test MMD Aggregated Two-Sample Test Power Gaussian kernel Λ( 6, 2) Λ( 4, 0) Λ( 2, 2) Power Laplace kernel 1 2 3 Number of perturbations MMDAgg Auto ML MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing median split oracle ost 1 2 3 Number of perturbations MMDAgg Laplace MMDAgg Gaussian MMDAgg Laplace Gaussian MMDAgg All Figure 4: Power experiments with 2-dimensional perturbed uniform distributions using sample sizes m = n = 2000 with a wild bootstrap. power. This validates our theoretical results that our aggregated MMDAgg test is minimax optimal in settings such as the one considered in this experiment where the difference in densities lies in a Sobolev ball of unknown smoothness. In particular, with its adaptive collection of bandwidths, MMDAgg obtains higher power than the four other aggregated tests with specifically chosen collections. In the 1-dimensional case, Auto ML performs similarly to our four aggregated tests, while in the 2-dimensional case with two perturbations it obtains much lower power. Our four tests with different weighting strategies outperform the three other tests median, split and ost in most settings, and always at least match the power of the best of those three. The ost test, which is restricted to using the linear-time MMD estimate, obtains very low power compared to all the other tests using the quadratic-time estimate. Schrab, Kim, Albert, Laurent, Guedj and Gretton In all the experiments in Figures 3 and 4, using the Gaussian rather than the Laplace kernel results in higher power for our four aggregated tests, the difference is small but notable for the 1-dimensional case while it is large for the 2-dimensional case. For MMDAgg , there is no difference in Figure 3 and a small one in Figure 4, as can be seen in the bottom plots. In one dimension, the median test performs significantly better when using the Laplace kernel and outperforms the split test, which is not the case in all the other settings. In the bottom plots of Figures 3 and 4, we observe that by aggregating over both Gaussian and Laplace kernels, MMDAgg Laplace Gaussian obtains the highest power achieved by either MMDAgg Laplace or MMDAgg Gaussian. When adding 10 other kernels to the collection (each with 10 bandwidths), MMDAgg All retains the same high power, we do not observe a cost in power for considering more kernels. This is only possible due the way we perform the level correction in Section 3.5. We now discuss the relation between the four weighting strategies for MMDAgg. Recall from Section 3.5 that a single MMD test with larger associated weight is viewed as more important than one with smaller associated weight in the aggregated procedure. Recall from Section 5.1 that MMDAgg uniform puts equal weights on every bandwidths, that MMDAgg centred puts the highest weight on the bandwidth in the middle of the collection, and that MMDAgg increasing puts the highest weight on the biggest bandwidth while MMDAgg decreasing puts it on the smallest bandwidth. This allows us to interpret our results. First, let s consider the case of the collection of bandwidths Λ( 6, 2) for both one and two dimensions. We observe that MMDAgg increasing has the highest power and MMDAgg decreasing the lowest of the four aggregated tests, this means that putting the highest weight on the biggest bandwidth performs the best while putting it on the smallest bandwidth performs the worst. We can deduce that the most important bandwidth in our collection is the biggest one, which suggests that we should consider a collection consisting of larger bandwidths, say Λ( 4, 0). In this case, MMDAgg centred now obtains the highest power of our four weighting strategies. We can infer that the optimal bandwidth is close to the bandwidths in the middle of our collection. When considering a collection of even larger bandwidths Λ( 2, 2), we see the opposite trends to ones observed using Λ( 6, 2); MMDAgg decreasing and MMDAgg increasing are performing the best and worst of our four tests, respectively. This suggests that a collection consisting of smaller bandwidths than Λ( 2, 2) might be more appropriate. So, comparing our aggregated tests with different weighting strategies gives us some insights on whether the collection we have considered is appropriate, or consists of bandwidths which are either too small or too large. The uniform weighting strategy does not perform the best but it is more robust to changes in the collection of bandwidths than the other strategies. Of course, in practice, if we have access to a limited amount of data, one cannot run a hypothesis test with some parameters, observe the results and then modify those parameters to run the test again. Nonetheless, the interpretation of the results of our different weighting strategies remains an appealing feature of our tests. In practice, we recommend using the parameter-free test MMDAgg Laplace Gaussian with its collection of bandwidths chosen adaptively. MMD Aggregated Two-Sample Test Power Gaussian kernel Λ(8, 12), Λ(10, 14) Λ(10, 14), Λ(12, 16) Λ(12, 16), Λ(14, 18) Q1 Q2 Q3 Q4 Q5 0.00 Power Laplace kernel Q1 Q2 Q3 Q4 Q5 Choice of alternative MMDAgg Auto ML MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing median split oracle ost Q1 Q2 Q3 Q4 Q5 Q1 Q2 Q3 Q4 Q5 Choice of alternative MMDAgg Laplace MMDAgg Gaussian MMDAgg Laplace Gaussian MMDAgg All Figure 5: Power experiments with the MNIST dataset using sample sizes m = n = 500 with a wild bootstrap. 5.6 Power experiments on the MNIST dataset Motivated by the experiment considered by K ubler et al. (2020), we consider the MNIST dataset (Le Cun et al., 2010) down-sampled to 7 7 images. In Figure 5, we consider 500 samples drawn with replacement from the set P consisting all 70 000 images of digits P : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Schrab, Kim, Albert, Laurent, Guedj and Gretton We test these against 500 samples drawn with replacement against one of the sets Q1 : 1, 3, 5, 7, 9, Q2 : 0, 1, 3, 5, 7, 9, Q3 : 0, 1, 2, 3, 5, 7, 9, Q4 : 0, 1, 2, 3, 4, 5, 7, 9, Q5 : 0, 1, 2, 3, 4, 5, 6, 7, 9, of respective sizes 35 582, 42 485, 49 475, 56 299 and 63 175. While the samples are in dimension 49, distinguishing images of different digits reduces to a lower-dimensional problem. We consider the Gaussian kernel with the collections of bandwidths Λ(8, 12), Λ(10, 14) and Λ(12, 16) and the Laplace kernel with Λ(10, 14), Λ(12, 16) and Λ(14, 18). As i increases, distinguishing P from Qi becomes a more challenging task, which results in a decrease in power. In the experiments presented in Figure 5 on image data, MMDAgg and Auto ML achieve the same power and outperform by far all the other tests. This could be due to the fact that the collections for the other aggregated tests consist of very large bandwidths (from 28λmed to 218λmed). Nonetheless, we observe that split and oracle, which consider smaller bandwidths, also perform poorly compared to MMDAgg and Auto ML. The four aggregated tests with different weighting strategies often match, or even slightly beat, the performance of oracle which uses extra data to select an optimal bandwidth. Moreover, they outperform significantly the two adaptive tests split and ost, as well as the median test. For MMDAgg , using either the Laplace or Gaussian kernel leads to the same performance in this experiment. Furthermore, MMDAgg retains its high power when considering many more kernels as well, as can be seen in the bottom figure of Figure 5 with MMDAgg All. Contrary to the previous experiments, we observe that using the Laplace kernel rather than the Gaussian one results in substantially higher power for the four aggregated tests with different weighting strategies. We recall that in the experiments of Figures 3 and 4, we observed that using a Gaussian kernel leads to higher power. This illustrates that the optimal choice of kernel varies depending on the type of data, as such in practice we recommend using MMDAgg Laplace Gaussian since it is observed to achieve the highest power obtained by either MMDAgg Laplace or MMDAgg Gaussian in those three experiments. The test MMDAgg All also obtains the same power, but as it considers 12 types of kernels, each with 10 bandwidths, it is computationally more expensive. The pattern we observed in Figures 3 and 4 of having MMDAgg increasing and MMDAgg decreasing obtaining the highest and lowest power of our four aggregated tests for the collections of smaller and larger bandwidths, respectively, still holds to some extent in Figure 5 but the differences are less significant. For the collection of bandwidths Λ(12, 16) with the Laplace kernel, MMDAgg centred does not perform the best, it obtains slightly less power than MMDAgg uniform and MMDAgg increasing which have almost equal power. Following our interpretation, this simply means that, while Λ(12, 16) is an appropriate choice of collection, the optimal bandwidth might be slightly larger than 214λmed. Note that, except MMDAgg and Auto ML, each test obtains similar power when trying to distinguish P from either Q3 or Q4. Recall that Q3 consists of images of all the digits MMD Aggregated Two-Sample Test except 4, 6 and 8 while Q4 consists of images of all of them except 6 and 8. One possible explanation could be that these tests distinguish P from Q3 mainly by detecting if images of the digit 6 appear in the sample, this would explain why we observe similar power for Q3 and Q4, and why the power for Q5 (consisting of every digit except 8) drops significantly. We also sometimes observe in Figure 5 that our aggregated tests with different weighting strategies obtain slightly higher power for Q4 than for Q3, which might at first seem counterintuitive. This could be explained by the fact that the optimal bandwidths for distinguishing P from Q3 and from Q4 might be very different, and that the choice of collections of bandwidths presented in Figure 5 are slightly better suited for distinguishing P from Q4 than from Q3. While it is also the case in Figures 3 and 4 that the alternatives with different number of perturbations require different bandwidths to be detected, it looks like in that case considering a collection of five bandwidths which are powers of 2 is enough to adapt to those differences. For the MNIST experiment in Figure 5, it seems that the differences between the optimal bandwidths for Q3 and Q4 are more important. Using MMDAgg with its adaptive parameter-free collection of bandwidths solves this problem. An advantage of our aggregated MMDAgg tests is that, even if we fix the collection of bandwidths, they are able to detect differences at various lengthscales, this is not the case for the median and split tests as those select some specific bandwidth and are only able to detect the differences at the corresponding lengthscale. 5.7 Power experiment: continuous limit of the collection of bandwidths Our collection of bandwidths for MMDAgg is a discretisation of an interval using N = 10 points, which we formally introduced in Section 5.2. As we increase the number of points 101 102 103 0.00 d = 1, 3 perturbations m = n = 500 101 102 103 Number of bandwidths per kernel d = 2, 3 perturbations m = n = 2000 MMDAgg Laplace MMDAgg Gaussian MMDAgg Laplace Gaussian MMDAgg All 101 102 103 MNIST, Q4 m = n = 500 Figure 6: Power experiments varying the size of the collections of bandwidths using perturbed uniform d-dimensional distributions and the MNIST dataset with a wild bootstrap. For MMDAgg Laplace Gaussian, two kernels are aggregated, each with a varying number of bandwidths. For MMDAgg All, 12 kernels are considered with a varying number of bandwidths (up to 12000 kernels are aggregated). Schrab, Kim, Albert, Laurent, Guedj and Gretton N, the discretisation becomes finer and in the limit as N it corresponds to the whole continuous interval. In Figure 6, we consider the three experiments of Figures 3 to 5 presented in Sections 5.5 and 5.6 for MMDAgg Gaussian, MMDAgg Laplace, MMDAgg Laplace Gaussian, and MMDAgg All, with collections of sizes N, N, 2N and 12N, respectively (details in Section 5.4). We vary the number of bandwidths per kernel N to be 10, 100 and 1000. In particular, this means for example that MMDAgg All with N = 1000 aggregates over 12N = 12000 kernels. For the one-dimensional uniform setting, we use three perturbations and sample sizes 500. The 2-dimensional case is considered with three perturbations and m = n = 2000. For the MNIST experiment, we use Q4 as an alternative (every digit except 8 and 6) with sample sizes 500. As in the previous experiments, in Figure 6, the MMDAgg test aggregating both Laplace and Gaussian kernels obtains the highest power achieved by either MMDAgg Laplace or MMDAgg Gaussian. Considering more types of kernels with MMDAgg All does not change the test power. Since it is computationally more expensive, we recommend using MMDAgg Laplace Gaussian in practice. In Figure 6, we observe for all four tests that the test power remains the same when increasing the number of bandwidths per kernel from 10 to 1000. The case N = 1000 simulates the continuous limit of the collection of bandwidths, that is, when the full interval is considered without discretisation. First, this shows that our MMDAgg test retains high power even when aggregating up to 12000 kernels, which is only possible due to the way we perform the level correction. Second, this illustrates that the power of the aggregated test with a continuous collection of bandwidths, can also be achieved by the less computationally expensive MMDAgg test with a discretisation of N = 10 points. 5.8 Power experiment for image shift detection In this section, we consider the experiment of Rabanser et al. (2019, Table 1a) on image shift detection on the MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2009) datasets. Ten different types of shifts are applied to either 10%, 50% or 100% of the samples, those include adversarial shifts, class knock-outs, injecting Gaussian noise, and combining rotations/translations/zoom-ins, with different shift strengths (see Section 4 of Rabanser et al. 2019 for details). We report in Table 1 the test power obtained by our four MMDAgg tests, by our MMDAgg uniform test with Laplace kernels and with Gaussian kernels, as well as by four versions of the Auto ML test of K ubler et al. (2022b, Table 1a). We observe that the MMDAgg tests outperform the MMDAgg uniform tests. For MMDAgg , using the Laplace kernel performs the best, aggregating both Laplace and Gaussian kernels performs better than using only Gaussian kernels and almost as well as using only Laplace kernels, aggregating many more kernels with MMDAgg All results in the same power obtained by MMDAgg Laplace Gaussian. Despite using off-the-shelf kernels which are not specifically designed for images, our aggregated MMDAgg tests are still performing within only a few percentage points of stateof-the-art tests based on training models (e.g. neural networks) which excel on image data, such as the Auto ML test. MMD Aggregated Two-Sample Test Table 1: Image shift detection experiment of Rabanser et al. (2019). The numbers reported correspond to the test power averaged over 60 alternatives (each repeated 5 times): 10 different shift types applied on either 10%, 50% or 100% of the image samples drawn from either the MNIST or CIFAR-10 datasets. Test Number of samples 10 20 50 100 200 500 1000 10000 MMDAgg Laplace 0.21 0.29 0.40 0.44 0.47 0.56 0.67 0.83 MMDAgg Gaussian 0.19 0.26 0.34 0.42 0.42 0.51 0.62 0.75 MMDAgg Laplace Gaussian 0.21 0.27 0.37 0.43 0.45 0.54 0.65 0.80 MMDAgg All 0.21 0.27 0.37 0.43 0.46 0.55 0.66 0.80 MMDAgg uniform (Laplace) 0.20 0.28 0.40 0.43 0.46 0.52 0.58 0.79 MMDAgg uniform (Gaussian) 0.15 0.23 0.33 0.35 0.38 0.44 0.48 0.69 Auto ML (raw) 0.17 0.24 0.37 0.46 0.50 0.62 0.67 0.87 Auto ML (pre) 0.18 0.29 0.42 0.47 0.47 0.64 0.65 0.72 Auto ML (class) 0.19 0.19 0.38 0.46 0.52 0.61 0.67 0.87 Auto ML (bin) 0.03 0.14 0.31 0.43 0.49 0.51 0.59 0.86 5.9 Overview of additional experimental results We consider additional experiments in Appendix A, we briefly summarize them here. In Appendix A.1, we verify that all the tests we consider have well-calibrated levels. We then consider widening the collection the bandwidths for our tests MMDAgg uniform and MMDAgg centred in Appendix A.2, and observe that the associated cost in the power is relatively small. We verify in Appendix A.3 that using a wild bootstrap or permutations results in similar performance; the difference is of non-significant order and is not biased towards one or the other. In Appendix A.4, we show that if one of the sample sizes is fixed to a small number, we cannot obtain high power even if we take the other sample size to be very large. Finally, we increase the sample sizes for the ost test in Appendix A.5 and observe that we need extremely large sample sizes to match the performance of MMDAgg uniform with 500 or 2000 samples. 6. Conclusion and future work We have constructed a two-sample hypothesis test, called MMDAgg, which aggregates multiple MMD tests using different kernels/bandwidths. Our test is adaptive over Sobolev balls and does not require data splitting. We have proved that MMDAgg is optimal in the minimax sense over Sobolev balls up to an iterated logarithmic term, for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. This optimality result also holds under two popular strategies used in estimating the test thresholds, namely the wild bootstrap and permutation procedures. In practice, we propose four weighting strategies which allow the user to incorporate prior knowledge about the collection of bandwidths. We also introduce a parameter-free adaptive collection Schrab, Kim, Albert, Laurent, Guedj and Gretton of bandwidths, which we recommend using in practice while aggregating over both Gaussian and Laplace kernels, each with multiple bandwidths. This adaptive collection is the discretisation of an interval and in practice we found that using ten bandwidths per kernel performs as well as using the whole continuous interval in the limit. Our MMDAgg test obtains significantly higher power than other state-of-the-art MMD-based two-sample tests in synthetic settings where the smoothness Sobolev assumption is satisfied, this empirically validates our theoretical result of minimax optimality and adaptivity over Sobolev balls. In experiments on image data, we observe that MMDAgg almost matches the power of much more complex two-sample tests relying on training models, such as neural networks, to detect the difference in distributions. We now discuss three research directions based on this current work, two of which have been explored by Schrab et al. (2022a,b). First, it would be interesting to consider the two-sample kernel-based test of Jitkrittum et al. (2016), who use adaptive features (in the data space or in the Fourier domain) to construct a linear-time test with good test power. Jitkrittum et al. (2016) require setting aside part of the data to select the kernel bandwidths and the feature locations, by maximizing a proxy for test power. They then perform the test on the remaining data. It would be of interest to develop an approach to learning such adaptive interpretable features without data splitting. Adapting the current results of this work to that setting remain an open challenge. Second, aggregated tests that are adaptive over Sobolev balls have been constructed for several alternative testing scenarios. The independence testing problem using the Hilbert Schmidt Independence Criterion has been treated by Albert et al. (2022), which is related to the Maximum Mean Discrepancy (Gretton et al., 2012a, Section 7.4). A further setting of interest is goodness-of-fit testing, where a sample is compared against a model. Our theoretical results can directly be applied to goodness-of-fit testing using the MMD, as long as the expectation of the kernel under the model can be computed in closed form. A more challenging problem arises when this expectation cannot be easily computed. In this case, a test may be constructed based on the Kernelised Stein Discrepancy (KSD Liu et al., 2016; Chwialkowski et al., 2016). This corresponds to computing a Maximum Mean Discrepancy in a modified Reproducing Kernel Hilbert Space, consisting of functions which have zero expectation under the model. Building on the present paper, Schrab et al. (2022a) develop an adaptive aggregated KSDAgg test of goodness-of-fit for the KSD and provide conditions which guarantee high test power for KSDAgg. Third, our MMDAgg test proposed in this work, the KSDAgg test of Schrab et al. (2022a), and the aggregated HSIC test of Albert et al. (2022), are all quadratic-time hypothesis tests. While quadratic-time tests usually achieve higher power than linear-time tests, this comes at the expense of an important computational cost. To tackle this problem, relying on incomplete U-statistics, Schrab et al. (2022b) propose efficient variants (including linear-time ones) of those three aggregated tests, called MMDAgg Inc, KSDAgg Inc and HSICAgg Inc. They theoretically quantify the cost incurred in the minimax rate over Sobolev balls for this improvement in computational efficiency. MMD Aggregated Two-Sample Test Acknowledgements We would like to thank the action editor Ingo Steinwart and the anonymous referees for their thorough reviews and suggestions which have helped to significantly improve the paper. Antonin Schrab acknowledges support from the U.K. Research and Innovation under grant number EP/S021566/1. Ilmun Kim acknowledges support from the Yonsei University Research Fund of 2022-22-0289 as well as support from the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2022R1A4A1033384), and the Korea government (MSIT) RS-2023-00211073. B eatrice Laurent acknowledges the funding by ANITI ANR-19-PI3A-0004. Benjamin Guedj acknowledges partial support by the U.S. Army Research Laboratory and the U.S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1; Benjamin Guedj also acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015-02. Arthur Gretton acknowledges support from the Gatsby Charitable Foundation. Overview of Appendices In Appendix A, we present results of additional experiments. In Appendix B, we explain the relation between using permutations and using the wild bootstrap. We present an efficient implementation of MMDAgg in Appendix C. We highlight the proof strategy of deriving the minimax rate over a Sobolev ball in Appendix D. Finally, we present the proofs of all our results in Appendix E. Appendix A. Additional experiments In this section, we verify the level achieved by all the tests considered (Appendix A.1), and run multiple power experiments: widening the collection of bandwidths (Appendix A.2), comparing wild bootstrap and permutations (Appendix A.3), using unbalanced sample sizes (Appendix A.4), and increasing the sample sizes for the ost test (Appendix A.5). A.1 Level experiments In Table 2, we empirically verify that all the tests we consider have the desired level α = 0.05 in the three different settings considered in Figures 3 to 5. For the aggregated MMDAgg tests of Section 5.1, we use the collection of bandwidths Λ( 4, 0) for samples drawn from a uniform distribution in one and two dimensions. For samples drawn from the set P of images of all MNIST digits, we use Λ(10, 14) and Λ(12, 16) for the Gaussian and Laplace kernels, respectively. To obtain more precise results, we use 5000 repetitions to estimate the levels. We observe in Table 2 that all the tests have well-calibrated levels, indeed all the estimated levels are relatively close to the prescribed level 0.05. We consider three different types of data and run the tests with the Gaussian and Laplace kernel using either a wild bootstrap or permutations. We note that there is no noticeable trend in the differences in the estimated levels across all those different settings. Schrab, Kim, Albert, Laurent, Guedj and Gretton Table 2: Level experiments with samples drawn either from d-dimensional uniform distributions or from the MNIST dataset using the Gaussian (G.) and Laplace (L.) kernels with either a wild bootstrap (w.b.) or permutations (p.). MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing median split ost G. w.b. 0.0476 0.052 0.0456 0.0434 0.047 0.054 0.0594 p. 0.0496 0.0532 0.0478 0.0454 0.0468 0.0528 0.0594 L. w.b. 0.0474 0.0488 0.0516 0.0504 0.0534 0.05 0.0586 p. 0.047 0.0482 0.0496 0.0494 0.0522 0.0494 0.0586 G. w.b. 0.039 0.0432 0.044 0.0496 0.0464 0.0482 0.0478 p. 0.0424 0.0446 0.0414 0.0498 0.0466 0.0472 0.0478 L. w.b. 0.0382 0.0502 0.0506 0.0478 0.0438 0.0548 0.0502 p. 0.0418 0.0474 0.0514 0.049 0.0458 0.0548 0.0502 G. w.b. 0.0478 0.0528 0.0474 0.0488 0.0526 0.0498 0.0496 p. 0.042 0.05 0.0476 0.048 0.055 0.0484 0.0496 L. w.b. 0.054 0.052 0.0424 0.0548 0.0518 0.0444 0.05 p. 0.0526 0.0532 0.0442 0.0554 0.051 0.0448 0.05 MMDAgg Laplace MMDAgg Gaussian MMDAgg Laplace & Gaussian MMDAgg All Auto ML d = 1 w.b. 0.0506 0.05 0.0496 0.0492 0.0462 p. 0.0528 0.0492 0.0496 0.0494 d = 2 w.b. 0.0458 0.0428 0.0434 0.043 0.051 p. 0.0456 0.0438 0.0438 0.0434 MNIST w.b. 0.0624 0.0624 0.0632 0.0594 0.518 p. 0.0612 0.062 0.0628 0.0622 A.2 Power experiments: widening the collection of bandwidths In practice, we might not have strong prior knowledge to guide us in the choice of a collection consisting of only a few bandwidths. For this reason, in practice, we recommend using the adaptive parameter-free collection introduced in Section 5.2 with both Laplace and Gaussian kernels. Nonetheless, it is interesting to study the properties of the family of collections of Section 5.1 consisting of the median bandwidth scaled by powers of 2. In Figure 7, we design an experiment where we start with a collection of 3 bandwidths chosen to be centred around the optimal bandwidth. We then widen the collection of bandwidths and observe how much the power deteriorates as more bandwidths are included in the collection. We consider collections ranging from 3 to 15 bandwidths for our two tests MMDAgg uniform and MMDAgg centred, for the three types of data used in Figures 3 to 5. For the perturbed uniform distributions in one and two dimensions, we use the collection of bandwidths Λ( 2 i, 2+i) for i = 1, . . . , 7 for both kernels. For the MNIST dataset with MMD Aggregated Two-Sample Test Power Gaussian kernel d = 1, 3 perturbations m = n = 500 Λ( 2 i, 2 + i) d = 2, 2 perturbations m = n = 2 000 Λ( 2 i, 2 + i) MNIST, Q3, m = n = 500 Λ(12 i, 12 + i) Λ(14 i, 14 + i) 3 5 7 9 11 13 15 0.00 Power Laplace kernel 3 5 7 9 11 13 15 Number of bandwidths in the collection (corresponding to i = 1, . . . , 7) MMDAgg uniform MMDAgg centred 3 5 7 9 11 13 15 Figure 7: Power experiments varying the size of the collections of bandwidths using perturbed uniform d-dimensional distributions and the MNIST dataset with a wild bootstrap. the Gaussian and Laplace kernels, we use the collections of bandwidths Λ(12 i, 12 + i) and Λ(14 i, 14 + i) for i = 1, . . . , 7, respectively. Those collections are centred as those corresponding to the middle columns of Figures 3 to 5 (for the case i = 2), so we expect the bandwidth in the centre of each collection to be a well-calibrated one. For this reason, it makes sense to consider only MMDAgg uniform and MMDAgg centred in those experiments. In all the settings considered in Figure 7, we observe only a very small decrease in power when considering a wider collection of bandwidths for MMDAgg centred. This is due to the fact that even though we consider more bandwidths, we still put the highest weight on the well-calibrated one in the centre of the collection. Nonetheless, the fact that almost no power is lost when considering more bandwidths for MMDAgg centred is a great feature of our test, which is only possible due to the way we perform the level correction for MMDAgg. For the MNIST dataset, we observe a slight increase in power for MMDAgg centred with the collections of nine bandwidths for both kernels. This could indicate that, as suggested in Section 5.6, the bandwidths in the centre of the collections are well-calibrated to distinguish P from Q4 but are not necessarily the best choice to distinguish P from Q3. Remarkably, the power for MMDAgg uniform, which puts equal weights on all the bandwidths, decays relatively slowly and this test does not use the information that the bandwidth in the centre of the collection is a well-calibrated one. So, we expect similar results for any collections of the same sizes which include this bandwidth but not necessarily in the centre of the collection. This means that, in practice, without any prior knowledge, one Schrab, Kim, Albert, Laurent, Guedj and Gretton Difference in power Gaussian kernel d = 1, Λ( 4, 0) m = n = 500 d = 2, Λ( 4, 0) m = n = 2 000 MNIST, Λ(10, 14), Λ(12, 16) m = n = 500 1 2 3 4 Number of perturbations Difference in power Laplace kernel 1 2 3 Number of perturbations MMDAgg median MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing split Q1 Q2 Q3 Q4 Q5 Choice of alternative Figure 8: Power experiments considering the difference between using a wild bootstrap or permutations on perturbed uniform d-dimensional distributions and on the MNIST dataset. can use uniform weights with a relatively wide collection of bandwidths without incurring a considerable loss in power. A.3 Power experiments: comparing wild bootstrap and permutations We consider the settings of the experiments presented in Figures 3 to 5 on synthetic and real-world data using the Gaussian and Laplace kernels. We run the same experiments using permutations instead of a wild bootstrap for one collection of bandwidths for each of the different settings. We then consider the power obtained using a wild bootstrap minus the one obtained using permutations, and plot this difference in Figure 8. The absolute difference in power between using a wild bootstrap or permutations is minimal, it is at most roughly 0.02 and is even considerably smaller in most cases. Furthermore, the difference overall does not seem to be biased towards using either of the two procedures. Since there is no significant difference in power, we suggest using a wild bootstrap when the sample sizes are the same since our implementation of it runs slightly faster in practice. Of course, when the sample sizes are different, one must use permutations. A.4 Power experiments: using unbalanced sample sizes In Figure 9, we consider fixing the sample size m and increasing the size n of the other sample, we use permutations since we work with different sample sizes. We consider the MMD Aggregated Two-Sample Test Power Gaussian kernel d = 1, 3 perturbations m = 100, Λ( 4, 0) d = 2, 2 perturbations m = 250, Λ( 4, 0) MNIST, Q3, m = 100 Λ(10, 14), Λ(12, 16) Power Laplace kernel Sample size n MMDAgg MMDAgg uniform MMDAgg centred MMDAgg increasing MMDAgg decreasing Figure 9: Power experiments with different sample sizes m = n on perturbed uniform ddimensional distributions and on the MNIST dataset using permutations. settings of Figures 3 to 5 with three and two perturbations for the uniform distributions in one and two dimensions, respectively. For the MNIST dataset, we use the set of images of all digits P against the set of images Q3 which does not include the digits 4, 6 and 8. We observe the same patterns across the six experiments presented in Figure 9. When fixing one of the sample sizes to be small (100 or 250), we cannot achieve power higher than 0.25 (except for MMDAgg on the MNIST data, which as in Figure 5 performs much better than all other tests) by increasing the size of the other sample to be very large (up to 5000). Indeed, we observe that a plateau is reached where considering an even larger sample size does not result in higher power. In some sense, all the information provided by the small sample has already been extracted and using more points for the other sample has almost no effect. As shown in Figures 3 to 5, we can obtain significantly higher power in all of those settings using samples of sizes m = n = 500 or m = n = 2000. Having access to even more samples overall (5100 instead of 1000) but in such an unbalanced way results in very low power. This shows the importance of having, if possible, balanced datasets with sample sizes of the same order. A.5 Power experiments: increasing sample sizes for the ost test In Figure 10, we report the results of Figures 3 to 5 for MMDAgg uniform and ost which are run using the same collections of bandwidths. As previously mentioned, the ost test Schrab, Kim, Albert, Laurent, Guedj and Gretton Power Gaussian kernel d = 1, Λ( 4, 0) d = 2, Λ( 4, 0) MNIST, Λ(10, 14), Λ(12, 16) 1 2 3 4 Number of perturbations Power Laplace kernel MMDAgg uniform 500 ost 500 ost 25 000 ost 50 000 ost 75 000 ost 100 000 1 2 3 Number of perturbations MMDAgg uniform 2 000 ost 2 000 ost 250 000 ost 500 000 ost 750 000 ost 1 000 000 Q1 Q2 Q3 Q4 Q5 Choice of alternative MMDAgg uniform 500 ost 500 ost 50 000 ost 100 000 ost 150 000 ost 200 000 ost 250 000 Figure 10: Power experiments increasing the sample sizes for the ost test on perturbed uniform d-dimensional distributions and on the MNIST dataset. The legend lists the name of the test followed by the sample sizes used. is restricted to the use of the linear-time MMD estimate, and hence obtains low power compared to the other tests which use the quadratic-time MMD estimate. We increase the sample sizes for the ost test until it matches the power of MMDAgg uniform with fixed sample sizes. For the case of 1-dimensional perturbed uniform distributions, we observe in Figure 10 that ost requires sample sizes of 75 000 and 50 000 in order to match the power obtained by MMDAgg uniform with m = n = 500 for the Gaussian and Laplace kernels, respectively. For the case of 2-dimensional perturbed uniform distributions, one million, and half a million samples are required for the Gaussian and Laplace kernels, respectively, to obtain the same power as MMDAgg uniform with 2000 samples. When working with the MNIST dataset, it takes 250 000 samples for ost to achieve similar power to the one obtained by MMDAgg uniform with 500 samples. Recall that the MNIST dataset consists of 70 000 images, so it is interesting to see that the power of ost keeps increasing for sample sizes which are more than ten times bigger than the size of the dataset. This is due to the use of the linear-time MMD estimate which MMD Aggregated Two-Sample Test for even sample sizes n = m is equal to i=1 hk(X2i 1, X2i, Y2i 1, Y2i) for hk defined as in Equation (5). The two pairs of samples (X2i 1, X2i) and (Y2i 1, Y2i) only appear together as a 4-tuple in this estimate, for i = 1, . . . , n/2. So, as long as we do not sample exactly the two same pairs of images together, this creates a new 4-tuple which is considered as new data for this estimate. This explains why considering sample sizes much larger than 70 000 still results in an increase in power. Appendix B. Relation between permutations and wild bootstrap In this section, we assume that we have equal sample sizes m = n and show the relation between using permutations and using a wild bootstrap for the estimator \ MMD 2 λ,b defined in Equation (6). First, we introduce some notation. For a matrix A =(ai,j)1 i,j 2n, we denote the sum of all its entries by A+ and denote by A the matrix A with all the entries {ai,i, an+i,n+i, an+i,i, ai,n+i : i = 1, . . . , n} set equal to 0. Note that A is composed of four (n n)-submatrices, and that A is the matrix A with the diagonal entries of those four submatrices set to 0. We let 1n Rn 1 denote the vector of length n with all entries equal to 1. We also let v := (1n, 1n) R2n 1 and note that vv = 1n1 n 1n1 n 1n1 n 1n1 n We let Kλ denote the kernel matrix (kλ(Ui, Uj))1 i,j 2n where Ui := Xi and Un+i := Yi for i = 1, . . . , n. We let Tr denote the trace operator and denote the Hadamard product. By definition of Equation (6), we have \ MMD 2 λ,b(Xn, Yn) := 1 n(n 1) 1 i =j n hλ(Xi, Xj, Yi, Yj) 1 i =j n kλ(Xi, Xj) + kλ(Yi, Yj) kλ(Xi, Yj) kλ(Yi, Xj) = 1 n(n 1) Tr K λ vv = 1 n(n 1) Tr v K λ v = 1 n(n 1) v K λ v. Schrab, Kim, Albert, Laurent, Guedj and Gretton For the wild bootstrap, as presented in Section 3.2.2, we have n i.i.d. Rademacher random variables ϵ := (ϵ1, . . . , ϵn) with values in { 1, 1}n, and let vϵ := (ϵ, ϵ) { 1, 1}2n, so that vϵv ϵ = ϵϵ ϵϵ As in Equation (11), we then have c M ϵ λ := 1 n(n 1) 1 i =j n ϵiϵjhλ(Xi, Xj, Yi, Yj) 1 i =j n ϵiϵjkλ(Xi, Xj) + ϵiϵjkλ(Yi, Yj) ϵiϵjkλ(Xi, Yj) ϵiϵjkλ(Yi, Xj) = 1 n(n 1) Tr K λ vϵv ϵ = 1 n(n 1) v ϵ K λ vϵ. We introduce more notation. For a matrix A = (ai,j)1 i,j 2n and some permutation τ : {1, . . . , 2n} {1, . . . , 2n}, we denote by A τ the matrix A with all the entries {aτ(i),τ(i), aτ(n+i),τ(n+i), aτ(n+i),τ(i), aτ(i),τ(n+i) : i = 1, . . . , n} set to be equal to 0. We denote by Aτ the permuted matrix aτ(i),τ(j) 1 i,j 2n. Similarly, for a vector w = (w1, . . . , w2n), we write the permuted vector as wτ = (wτ(1), . . . , wτ(2n)). Recall that v = (v1, . . . , v2n) = (1n, 1n) R2n 1. Similarly to Equation (10) in Sec- tion 3.2.1, but for the estimator \ MMD 2 λ,b, given a permutation σ: {1, . . . , 2n} {1, . . . , 2n}, we can define c M σ λ as 1 i =j n hλ(Uσ(i), Uσ(j), Uσ(n+i), Uσ(n+j)) 1 i =j n kλ(Uσ(i),Uσ(j))+kλ(Uσ(n+i),Uσ(n+j)) kλ(Uσ(i),Uσ(n+j)) kλ(Uσ(n+i),Uσ(j)) (K σ λ )σ vv = 1 n(n 1) Tr K σ λ vσ 1v σ 1 = 1 n(n 1) v σ 1K σ λ vσ 1. MMD Aggregated Two-Sample Test Using those formulas, we are able to prove the following proposition, but first we introduce two notions. Fix ℓ {1, . . . , n}. We say that a permutation τ : {1, . . . , 2n} {1, . . . , 2n} fixes Xℓ= Uℓand Yℓ= Un+ℓif τ(ℓ) = ℓand τ(n + ℓ) = n + ℓ. Moreover, we say that it swaps Xℓ= Uℓand Yℓ= Un+ℓif τ(ℓ) = n + ℓand τ(n + ℓ) = ℓ. We denote by P the set of all permutations which either fix or swap Xi and Yi for all i = 1, . . . , n. Since the identity belongs to P, every element in P is self-inverse, and the composition of two permutations in P gives an element in P, the set P is a subgroup of the permutation group. Proposition 11 Assume we have equal sample sizes m = n and that we work with the estimator \ MMD 2 λ,b defined in Equation (6). Then, using a wild bootstrap is equivalent to using permutations which belong to the subgroup P. There is a one-to-one correspondence between these two procedures. Proof First, note that for a permutation σ P, we either have σ(i) = i, σ(n + i) = n + i or σ(i) = n + i, σ(n + i) = i for i = 1, . . . , n. Hence, the two sets {kλ(Ui, Ui), kλ(Un+i, Un+i), kλ(Ui, Un+i), kλ(Un+i, Ui) : i = 1, . . . , n} and kλ(Uσ(i), Uσ(i)), kλ(Uσ(n+i), Uσ(n+i)), kλ(Uσ(i), Uσ(n+i)), kλ(Uσ(n+i), Uσ(i)) : i = 1, . . . , n are equal, we deduce that for σ P we have K σ λ = K λ. Moreover, since a permutation σ P either fixes or swaps Xi and Yi for i = 1, . . . , n, it must be self-inverse, that is σ 1 = σ. For the permutation σ P, we then have c M σ λ = 1 n(n 1) v σ K λvσ. for vσ = (vσ(1), . . . , vσ(2n)) where vi = 1 and vn+i = 1 for i = 1, . . . , n. We recall that for the wild bootstrap we have n i.i.d. Rademacher random variables ϵ := (ϵ1, . . . , ϵn) with values in { 1, 1}n and c M ϵ λ = 1 n(n 1) v ϵ K λ vϵ. where vϵ := (ϵ, ϵ) { 1, 1}2n. We need to show that for a given permutation σ P there exists some ϵ { 1, 1}n such that vϵ = vσ, and that for a given ϵ { 1, 1}n there exists a permutation σ P such that vσ = vϵ, and that this correspondence is one-to-one. Suppose we have a permutation σ P and let ϵ := (vσ(1), . . . , vσ(n)) { 1, 1}n. We claim that vσ = vϵ, that is, that (vσ(1), . . . , vσ(2n)) = (vσ(1), . . . , vσ(n), vσ(1), . . . , vσ(n)), so we need to prove that vσ(n+i) = vσ(i) for i = 1, . . . , n. As σ P, for i = 1, . . . , n, we either have σ(i) = i and σ(n + i) = n + i in which case vσ(n+i) = vn+i = 1 = vi = vσ(i), or σ(i) = n + i and σ(n + i) = i in which case we have vσ(n+i) = vi = 1 = vn+i = vσ(i). Schrab, Kim, Albert, Laurent, Guedj and Gretton This proves the first direction. Now, suppose we are given ϵ := (ϵ1, . . . , ϵn) i.i.d. Rademacher random variables. We have vϵ = (ϵ, ϵ) { 1, 1}2n and we need to construct σ P such that vσ = vϵ, that is, vσ(i) = ϵi and vσ(n+i) = ϵi for i = 1, . . . , n. We can construct such a permutation σ P as follows: for i = 1, . . . , n: if ϵi = 1 then let σ(i) := i and σ(n + i) := n + i (i.e. σ fixes Xi and Yi) we then have vσ(i) = vi = 1 = ϵi and vσ(n+i) = vn+i = 1 = ϵi if ϵi = 1 then let σ(i) := n + i and σ(n + i) := i (i.e. σ swaps Xi and Yi) we then have vσ(i) = vn+i = 1 = ϵi and vσ(n+i) = vi = 1 = ϵi This proves the second direction. Our two constructions show that the correspondence is one-to-one. This highlights the relation between those two procedures: using a wild bootstrap is equivalent to using a restricted set of permutations for the estimator \ MMD 2 λ,b. Appendix C. Efficient implementation of MMDAgg (Algorithm 1) We discuss how to compute Step 1 of Algorithm 1 efficiently. To compute the |Λ| kernel matrices for the Gaussian and Laplace kernels, we compute the matrix of pairwise distances only once. For each λ Λ, we compute all the values c M b λ,1 1 b B1+1 and c M b λ,2 1 b B2 together. We compute the sums over the permuted kernel matrices efficiently, in particular we do not want to explicitly permute rows and columns of the kernel matrices as this is computationally expensive. We start by considering the wild bootstrap case, so we have equal sample sizes m = n. We let Kλ denote the kernel matrix (kλ(Ui, Uj))1 i,j 2n where Ui := Xi and Un+i := Yi for i = 1, . . . , n. Note that Kλ is composed of four (n n)-submatrices, we denote by K λ the matrix Kλ with the diagonal entries of those four submatrices set to 0. As explained in Appendix B, for n i.i.d. Rademacher random variables ϵ := (ϵ1, . . . , ϵn) with values in { 1, 1}n, we have c M ϵ λ = 1 n(n 1) v ϵ K λ vϵ where vϵ := (ϵ, ϵ) { 1, 1}2n. We want to extend this to be able to compute c M ϵ(b) λ 1 b B for any B N\{0}. We can do this by letting R be the 2n B matrix consisting of stacked vectors vϵ(b) 1 b B and computing 1 n(n 1)diag R K λR . Note that, in general, given 2n B matrices A = (ai,j)1 i 2n 1 j B and C = (ci,j)1 i 2n 1 j B , we have diag A C = diag r=1 ar,icr,j 1 i B 1 j B r=1 ar,icr,i MMD Aggregated Two-Sample Test where denotes the Hadamard product and where P rows takes a matrix as input and outputs a vector which is the sum of the row vectors of the matrix. We deduce that diag R K λR = X rows R (K λR). We found that this way of computing the values c M ϵ(b) λ 1 b B is computationally faster than other alternatives. Letting 1n denote the vector of length n with all entries equal to 1, we can obtain an efficient version for Step 1 of Algorithm 1 using a wild bootstrap as follows. Efficient Step 1 of Algorithm 1 using a wild bootstrap: generate n (B1 + B2 + 1) matrix e R of Rademacher random variables concatenate R and e R to form the 2n (B1 + B2 + 1) matrix R replace the (B1 + 1)th column of R with the vector (1n, 1n) for λ Λ: compute kernel matrix K λ with zero diagonals for its four submatrices compute 1 n(n 1) P rows R (K λR) to get c M 1 λ,1, . . . , c M B1+1 λ,1 , c M 1 λ,2, . . . , c M B2 λ,2 c M 1 λ,1, . . . , c M B1+1 λ,1 = sort by ascending order c M 1 λ,1, . . . , c M B1+1 λ,1 Before tackling the case of permutations, we recall the main steps of the strategy used for the wild bootstrap case. As explained in Appendix B, we first noted that \ MMD 2 λ,b(Xn, Yn) = K λ 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) = 1 n(n 1) v K λ v. for v := (1n, 1n) R2n 1, where A+ denotes the sum all the entries of a matrix A. We then observed that it was enough to replace the vector v with vϵ := (ϵ, ϵ) { 1, 1}2n to obtain c M ϵ λ = 1 n(n 1) v ϵ K λ vϵ. The whole reasoning was based on the fact that we could rewrite the matrix 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) 1n1 n /(n(n 1)) as an outer product of vectors. Schrab, Kim, Albert, Laurent, Guedj and Gretton Now, we consider the permutation-based procedure. For a square matrix A, we let A0 denote the matrix A with its diagonal entries set equal to 0. We have \ MMD 2 λ,a(Xm, Yn) = 1 m(m 1) 1 i =i m k(Xi, Xi ) + 1 n(n 1) 1 j =j n k(Yj, Yj ) j=1 k(Xi, Yj) = K0 λ 1m1 m/(m(m 1)) 1m1 n /(mn) 1n1 m/(mn) 1n1 n /(n(n 1)) + where it is not possible to rewrite the matrix as an outer product of vectors. Instead, we break it down into a sum of three outer products of vectors. 1m1 m/(m(m 1)) 1m1 n /(mn) 1n1 m/(mn) 1n1 n /(n(n 1)) = 1 m(m 1) 1 mn 1m1 m 0m0 n 0n0 m 0n0 n + 1 n(n 1) 1 mn 0m0 m 0m0 n 0n0 m 1n1 n 1m1 m 1m1 n 1n1 m 1n1 n mn(m 1)uu + m n + 1 mn(n 1)ww + 1 mnvv where u := (1m, 0n), w := (0m, 1n) and v := (1m, 1n) of shapes (m + n) 1, with 0n denoting the vector of length n with all entries equal to 0. Using this fact, we obtain that \ MMD 2 λ,a(Xm, Yn) = n m + 1 mn(m 1)u K0 λu + m n + 1 mn(n 1)w K0 λw + 1 mnv K0 λv. For a vector a = (a1, . . . , aℓ) and a permutation τ : {1, . . . , ℓ} {1, . . . , ℓ}, we denote the permuted vector as aτ = (aτ(1), . . . , aτ(ℓ)). Recall that Ui := Xi, i = 1, . . . , m and Um+j := Yj, j = 1, . . . , n. Consider a permutation σ: {1, . . . , m + n} {1, . . . , m + n} and let Xσ m := Uσ(i) 1 i m and Yσ n := Uσ(m+j) 1 j n. Following a similar reasoning to the one presented in Appendix B, we find c M σ λ := \ MMD 2 λ,a(Xσ m, Yσ n) mn(m 1)u σ 1K0 λuσ 1 + m n + 1 mn(n 1)w σ 1K0 λwσ 1 + 1 mnv σ 1K0 λvσ 1 since kλ(Xi, Yi) : i = 1, . . . , n = kλ(Xσ(i), Yσ(i)) : i = 1, . . . , n . The aim is to compute c M σ(b) λ 1 b B efficiently for any B N \ {0}. We let U, V and W denote the (m + n) B matrices of stacked vectors uσ(b) 1 1 b B, vσ(b) 1 1 b B and wσ(b) 1 1 b B, respectively. We are then able to compute c M σ(b) λ n m + 1 mn(m 1)diag U K0 λU + m n + 1 mn(n 1)diag W K0 λW + 1 mndiag V K0 λV rows U K0 λU + m n + 1 rows W K0 λW + 1 mn rows V K0 λV . MMD Aggregated Two-Sample Test Since the inverse map for permutations is a bijection between the space of all permutations and itself, it follows that uniformly generating B permutations and taking their inverses is equivalent to directly uniformly generating B permutations. So, in practice, we can simply uniformly generate permutations τ (1), . . . , τ (B) and assume that these correspond to σ(1) 1, . . . , σ(B) 1 for uniformly generated permutations σ(1), . . . , σ(B). We can now present an efficient version for Step 1 of Algorithm 1 using permutations. Efficient Step 1 using permutations: construct (m + n) (B1 + B2 + 1) matrix U of stacked vectors of (1m, 0n) construct (m + n) (B1 + B2 + 1) matrix V of stacked vectors of (1m, 1n) construct (m + n) (B1 + B2 + 1) matrix W of stacked vectors of (0m, 1n) use B1+B2 permutations to permute the elements of the columns of U, V and W without permuting the elements of the (B1 + 1)th columns of U, V and W for λ Λ: compute kernel matrix K0 λ with zero diagonals compute c M 1 λ,1, . . . , c M B1+1 λ,1 , c M 1 λ,2, . . . , c M B2 λ,2 as n m + 1 mn(m 1) rows U K0 λU + m n + 1 rows W K0 λW + 1 mn rows V K0 λV c M 1 λ,1, . . . , c M B1+1 λ,1 = sort by ascending order c M 1 λ,1, . . . , c M B1+1 λ,1 Appendix D. Lower bound on the minimax rate over a Sobolev ball We claim that the two-sample minimax rate of testing over the Sobolev ball Ss d(R) is n 2s/(4s+d). Formally, let α, β (0, 1), d N \ {0} and M, s, R (0, ), we claim that there exists some positive constant C 0(M, d, s, R, α, β) such that ρ(Ss d(R), α, β, M) := inf α ρ( α, Ss d(R), β, M) C 0(M, d, s, R, α, β) n 2s/(4s+d) (18) where the infimum is taken over all tests α of non-asymptotic level α and where c m n C for some positive constants c and C . The proof mirrors the reasoning of Albert et al. (2022, Section 4, Theorem 4) who derive the independence minimax rate of testing n 2s/(4s+d1+d2) over the Sobolev ball Ss d1+d2(R) considering paired samples on Rd1 Rd2. While the case d = 1 is not covered by their framework, the proof extends naturally to it. We illustrate the main reasoning behind the proof as this is of interest for our experiments in Section 5.5. First, note that the minimax rate ρ(Ss d(R), α, β, M) is inf α ρ( α, Ss d(R), β, M) = inf α inf ρ > 0 : sup (p,q) FM ρ (Ss d(R)) Pp q( α(Xm, Yn) = 0) β ρ > 0 : inf α sup (p,q) FM ρ (Ss d(R)) Pp q( α(Xm, Yn) = 0) β Schrab, Kim, Albert, Laurent, Guedj and Gretton where FM ρ (Ss d(R)) := {(p, q) : max( p , q ) M, p q Ss d(R), p q 2 > ρ}. Hence, to prove Equation (18) it suffices to construct two probability densities p and q on Rd which satisfy max( p , q ) M, p q Ss d(R), p q 2 < C 1(m + n) 2s/(4s+d) for some C 1 > 0, and for which Pp q( α(Xm, Yn) = 0) > β holds for all tests α with non-asymptotic level α. Intuitively, one constructs densities which are close enough in L2-norm so that any test with non-asymptotic level α fails to distinguish them from one another. Albert et al. (2022, Section 4) show that a suitable choice of p and q is to take the uniform probability density on [0, 1]d and a perturbed version of it. To construct the latter, first define for all u R the function G(u) := exp 1 1 (4u + 3)2 2)(u) exp 1 1 (4u + 1)2 which is plotted in Figure 2 in Section 5.5. Consider P N \ {0} and some vector θ = (θν)ν {1,...,P}d { 1, 1}P d of length P d with entries either 1 or 1 which is indexed by the P d d-dimensional elements of {1, . . . , P}d. Then, the perturbed uniform density is defined as fθ(u) := 1[0,1]d(u) + P s X ν {1,...,P}d θν i=1 G (Pui νi) (19) for all u Rd. As illustrated in Figure 2, this indeed corresponds to a uniform probability density with P perturbations along each dimension. This construction by Albert et al. (2022) is a generalisation of the detailed construction of Butucea (2007, Section 5) for the 1-dimensional case for goodness-of-fit testing. We also point out the work of Li and Yuan (2019, Theorems 5, part (ii)) who use a similar construction to show that, for any alternative10 in FM ρ (Ss d(R)) with rate ρ smaller or equal to n 2s/(4s+d), there exists some α (0, 1) such that any test with asymptotic level α must have power asymptotically strictly smaller than one. The original idea behind all those constructions is due to Ingster (1987, 1993b) with his work on nonparametric minimax rates for goodness-of-fit testing. Appendix E. Proofs In this section, we prove the statements presented in Section 3. We first introduce some standard results. First, recall that we assume m n and n Cm for some constant C 1 as in Equation (7). It follows that n = 2(C + 1) 2n 2(C + 1) m + n and 1 m + n 2 m + n 1 10. To be more precise, the alternatives considered by Li and Yuan (2019) require both p and q to belong to the Sobolev ball (rather than just p q in our case) but do not require these densities to be bounded. Moreover, they use the non-homogeneous Sobolev space definition (with an extra +1 in the weighting term) while we use the homogeneous definition (without that extra term). MMD Aggregated Two-Sample Test For the kernels K1, . . . , Kd satisfying the properties presented in Section 3.1, we define the constants R |Ki(xi)|dxi < and κ2(d) := R Ki(xi)2 dxi < (21) which are well-defined as Ki L1(R) L2(R) for i = 1, . . . , d by assumption. We do not make explicit the dependence on K1, . . . , Kd in the constants as we consider those to be chosen a priori. Moreover, we often use the kernel properties of kλ presented in Equation (8), that are Rd kλ(x, y)dx = R Ki x i dx i = 1 Rd kλ(x, y)2dx = 2 dxi = 1 λ1 λd R Ki x i 2 dx i = κ2 λ1 λd . We often use in our proofs the standard result that, for a1, . . . , aℓ R, we have i=1 12 ! ℓ X which holds by Cauchy Schwarz inequality. In our proofs, we show that there exist some constants which are large enough so that our results hold. We keep track of those constants and show how they depend on each other. The aim is to show that such constants exist, we do not focus on obtaining the tightest constants possible. E.1 Proof of Proposition 1 Recall that in Sections 3.2.1 and 3.2.2 we have constructed elements c M b λ 1 b B+1 for the two MMD estimators defined in Equations (3) and (6), respectively. The first one uses permutations while the second uses a wild bootstrap. For those two cases, we first show that the elements c M b λ 1 b B+1 are exchangeable under the null hypothesis H0 : p = q. We are then able to prove that the test λ,B α has the prescribed level using the exchangeability of c M b λ Exchangeability using permutations as in Section 3.2.1. Recall that in this case we have permutations σ(1), . . . , σ(B) of {1, . . . , m+n}. We also have c M b λ := \ MMD 2 λ,a(Xσ(b) m , Yσ(b) n ) for b = 1, . . . , B and c M B+1 λ := \ MMD 2 λ,a(Xm, Yn). Following the same reasoning as in the proof of Albert et al. (2022, Proposition 1, Equation C.1), we can deduce that c M b λ 1 b B+1 are exchangeable under the null hypothesis. The only difference is that they work with the Hilbert Schmidt Independence Criterion rather than with the Maximum Mean Discrepancy, but this does not affect the reasoning of the proof. Schrab, Kim, Albert, Laurent, Guedj and Gretton Exchangeability using a wild bootstrap as in Section 3.2.2. The exchangeability under the null using the wild bootstrap follows from the one using permutations since by Proposition 11 using the wild bootstrap corresponds to using a subgroup of the permutation group. We show below how the original statistic and the permuted one can be seen to have the same distribution under the null. For b = 1, . . . , B, we have n i.i.d. Rademacher random variables ϵ(b) := ϵ(b) 1 , . . . , ϵ(b) n with values in { 1, 1}n and c M b λ := 1 n(n 1) 1 i =j n ϵ(b) i ϵ(b) j hλ(Xi, Xj, Yi, Yj) where hλ is defined in Equation (5). We also have c M B+1 λ := \ MMD 2 λ,b(Xn, Yn) = 1 n(n 1) 1 i =j n hλ(Xi, Xj, Yi, Yj). By the reproducing property of the kernel kλ, we have sup f Fλ f(Xi) f(Yi) 2 = sup f Fλ i=1 kλ(Xi, ) 1 i=1 kλ(Yi, ) i=1 kλ(Xi, ) 1 i=1 kλ(Yi, ) j=1 hλ(Xi, Xj, Yi, Yj) where Fλ := {f Hkλ : f Hkλ 1}. Under the null hypothesis H0 : p = q, all the samples (X1, . . . , Xn, Y1, . . . , Yn) are independent and identically distributed. So, the distribution of supf Fλ 1 n Pn i=1 f(Xi) f(Yi) 2 does not change if we randomly exchange Xi and Yi for each i = 1, . . . , n. This can be formalized using n i.i.d. Rademacher random variables ϵ1, . . . , ϵn, we have sup f Fλ f(Xi) f(Yi) 2 d= H0 i=1 ϵi f(Xi) f(Yi) 2 , where the notation d= H0 means that the two random variables have the same distribution under the null hypothesis H0 : p = q. Since we also have sup f Fλ i=1 ϵi f(Xi) f(Yi) 2 = sup f Fλ i=1 ϵikλ(Xi, ) 1 i=1 ϵikλ(Yi, ) i=1 ϵikλ(Xi, ) 1 i=1 ϵikλ(Yi, ) j=1 ϵiϵjhλ(Xi, Xj, Yi, Yj), MMD Aggregated Two-Sample Test j=1 hλ(Xi, Xj, Yi, Yj) d= H0 j=1 ϵiϵjhλ(Xi, Xj, Yi, Yj). Since ϵ2 i = 1 for i = 1, . . . , n, subtracting Pn i=1 hλ(Xi, Xi, Yi, Yi) from both sides, we get 1 i =j n hλ(Xi, Xj, Yi, Yj) d= H0 1 i =j n ϵiϵjhλ(Xi, Xj, Yi, Yj). Level of the test. Following a similar reasoning to the one presented by Albert et al. (2022, Proposition 1), we have λ,B α (Xm, Yn, ZB) = 1 \ MMD 2 λ(Xm, Yn) > bq λ,B 1 α ZB Xm, Yn c M B+1 λ > c M (B+1)(1 α) λ b=1 1 c M b λ < c M B+1 λ (B + 1)(1 α) b=1 1 c M b λ < c M B+1 λ B + 1 (B + 1)(1 α) b=1 1 c M b λ c M B+1 λ α(B + 1) b=1 1 c M b λ c M B+1 λ α(B + 1) b=1 1 c M b λ c M B+1 λ ! where we have used the fact that B+1 (B+1)(1 α) = α(B+1) . Using the exchangeability of c M b λ 1 b B+1, the result of Romano and Wolf (2005a, Lemma 1) guarantees that b=1 1 c M b λ c M B+1 λ ! We deduce that Pp p r λ,B α (Xm, Yn, ZB) = 1 α. E.2 Proof of Lemma 2 Let A := n \ MMD 2 λ(Xm, Yn) bq λ,B 1 α ZB Xm, Yn o Schrab, Kim, Albert, Laurent, Guedj and Gretton B := MMD2 λ(p, q) r 2 β varp q \ MMD 2 λ(Xm, Yn) + bq λ,B 1 α ZB Xm, Yn . By assumption, we have Pp q r(B) 1 β 2 , and we want to show Pp q r(A) β. Note that Pp q r(A | B) = Pp q r \ MMD 2 λ(Xm, Yn) bq λ,B 1 α ZB Xm, Yn B \ MMD 2 λ(Xm, Yn) MMD2 λ(p, q) r 2 β varp q \ MMD 2 λ(Xm, Yn) MMD2 λ(p, q) \ MMD 2 λ(Xm, Yn) r 2 β varp q \ MMD 2 λ(Xm, Yn) MMD2 λ(p, q) \ MMD 2 λ(Xm, Yn) r 2 β varp q \ MMD 2 λ(Xm, Yn) by Chebyshev s inequality (Chebyshev, 1899) as Ep q h \ MMD 2 λ(Xm, Yn) i = MMD2 λ(p, q). We then have Pp q r(A) = Pp q r(A | B)Pp q r(B) + Pp q r(A | Bc)Pp q r(Bc) E.3 Proof of Proposition 3 We prove this result separately for our two MMD estimators \ MMD 2 λ,a and \ MMD 2 λ,b defined in Equations (3) and (6), respectively. Variance bound for MMD estimator \ MMD 2 λ,a defined in Equation (3). In this case, we use the fact that \ MMD 2 λ,a can be written as a two-sample U-statistic as in Equation (4). As noted by Kim et al. (2022, Appendix E, Part 1) one can derive from the explicit variance formula of the two-sample U-statistic (Lee, 1990, Equation 2 p.38) that there exists some positive constant c0 such that varp q \ MMD 2 λ,a(Xm, Yn) c0 m + σ2 λ,0,1 σ2 λ,1,0 := var X EX ,Y,Y hλ(X, X , Y, Y ) , σ2 λ,0,1 := var Y EX,X ,Y hλ(X, X , Y, Y ) , σ2 λ,2,2 := var X,X ,Y,Y hλ(X, X , Y, Y ) , MMD Aggregated Two-Sample Test where X, X iid p and Y, Y iid q are all independent of each other. Making use of Equation (20), we deduce that there exists a positive constant c 0 such that varp q \ MMD 2 λ,a(Xm, Yn) c 0 σ2 λ,1,0 + σ2 λ,0,1 m + n + σ2 λ,2,2 (m + n)2 Recall that ϕλ(u) := Qd i=1 1 λi Ki ui λi for u Rd and that ψ := p q. Letting Gλ = ψ ϕλ, we then have for all u Rd Gλ(u) = (ψ ϕλ) (u) Rd ψ(u )ϕλ(u u )du Rd ψ(u )kλ(u, u )du Rd kλ(u, u ) p(u ) q(u ) du = EX kλ(u, X ) EY kλ(u, Y ) . EX ,Y hλ(X, X , Y, Y ) = EX ,Y kλ(X, X ) + kλ(Y, Y ) kλ(X, Y ) kλ(X , Y ) = EX kλ(X, X ) EY kλ(X, Y ) EX kλ(Y, X ) EY kλ(Y, Y ) = Gλ(X) Gλ(Y ). Hence, we get σ2 λ,1,0 := var X EX ,Y,Y hλ(X, X , Y, Y ) = var X(EY [Gλ(X) Gλ(Y )]) = var X(Gλ(X) EY [Gλ(Y )]) = var X(Gλ(X)) Rd Gλ(x)2p(x)dx Rd Gλ(x)2dx M Gλ 2 2 = M ψ ϕλ 2 2 and, similarly, we get σ2 λ,0,1 := var Y EX,X ,Y hλ(X, X , Y, Y ) M ψ ϕλ 2 2. Schrab, Kim, Albert, Laurent, Guedj and Gretton For the third term, we have σ2 λ,2,2 := var X,X ,Y,Y hλ(X, X , Y, Y ) = var X,X ,Y,Y kλ(X, X ) + kλ(Y, Y ) kλ(X, Y ) kλ(X , Y ) EX,X ,Y,Y h kλ(X, X ) + kλ(Y, Y ) kλ(X, Y ) kλ(X , Y ) 2i 4 EX,X kλ(X, X )2 + EY,Y kλ(Y, Y )2 + 2EX,Y kλ(X, Y )2 . EX,Y kλ(X, Y )2 = Z Rd kλ(x, y)2p(x)q(y) dxdy Rd kλ(x, y)2dx q(y) dy = p κ2 λ1 λd where κ2 depends on d and is defined in Equation (21). Similarly, we have EX,X kλ(X, X )2 Mκ2 λ1 λd and EY,Y kλ(Y, Y )2 Mκ2 λ1 λd . (22) We deduce that σ2 λ,2,2 := var X,X ,Y,Y hλ(X, X , Y, Y ) 16Mκ2 Letting C1(M, d) := max n 2c 0M, 16c 0Mκ2 o and combining the results, we obtain varp q \ MMD 2 λ,a(Xm, Yn) c 0 σ2 λ,1,0 + σ2 λ,0,1 m + n + σ2 λ,2,2 (m + n)2 ψ ϕλ 2 2 m + n + 1 (m + n)2 λ1 λd Variance bound for MMD estimator \ MMD 2 λ,b defined in Equation (6). The MMD estimator \ MMD 2 λ,b(Xn, Yn) := 1 n(n 1) 1 i =j n hλ(Xi, Xj, Yi, Yj) is a one-sample U-statistic of order 2. Hence, we can apply the result of Albert et al. (2022, Lemma 10) to get varp q \ MMD 2 λ,b(Xn, Yn) ec0 n + σ2 λ,2,2 MMD Aggregated Two-Sample Test for some positive constant ec0, where σ2 λ,1,1 := var X,Y EX ,Y hλ(X, X , Y, Y ) and σ2 λ,2,2 := var X,X ,Y,Y hλ(X, X , Y, Y ) 16Mκ2 λ1 λd as shown earlier. Using the above results, we get σ2 λ,1,1 := var X,Y EX ,Y hλ(X, X , Y, Y ) = var X,Y (Gλ(X) Gλ(Y )) EX,Y h (Gλ(X) Gλ(Y ))2i 2 EX Gλ(X)2 + EY Gλ(Y )2 4M ψ ϕλ 2 2. Letting e C1(M, d) := 4 max {4ec0M, 16ec0Mκ2}, we deduce that varp q \ MMD 2 λ,b(Xn, Yn) ec0 n + σ2 λ,2,2 4 e C1(M, d) ψ ϕλ 2 2 n + 1 n2λ1 λd ψ ϕλ 2 2 2n + 1 (2n)2λ1 λd E.4 Proof of Proposition 4 Recall that (c M b λ )1 b B is defined in Sections 3.2.1 and 3.2.2 for the estimators \ MMD 2 λ,a and \ MMD 2 λ,b, respectively, and that c M B+1 λ := \ MMD 2 λ(Xm, Yn) for both estimators. Let us recall that the (1 α)-quantile function of a random variable X with cumulative distribution function FX is given by q1 α = inf{x R : 1 α FX(x)}. We denote by FB and FB+1 the empirical cumulative distribution functions of (c M b λ )1 b B and (c M b λ )1 b B+1, respectively. For the case of the estimator \ MMD 2 λ,a, we denote by F the cumulative distribution function of the conditional distribution of c M σ λ (defined in Equation (10)) given Xm and Yn, where the randomness comes from the uniform choice of permutation σ among all possible permutations of {1, . . . , m + n}. For the case of the estimator \ MMD 2 λ,b, we similarly denote by F the cumulative distribution function of the conditional distribution of c M ϵ λ (defined in Equation (11)) given Xm and Yn, where the randomness comes from the n i.i.d. Rademacher variables ϵ := (ϵ1, . . . , ϵn) with values in { 1, 1}n. Schrab, Kim, Albert, Laurent, Guedj and Gretton Based on the above definitions, we can write bq λ, 1 α (Xm, Yn) = inf{u R : 1 α F (u)} bq λ,B 1 α ZB Xm, Yn = inf{u R : 1 α FB+1(u)} = inf u R : 1 α 1 B + 1 b=1 1 c M b λ u = inf u R : (B + 1)(1 α) b=1 1 c M b λ u = c M (B+1)(1 α) λ where c M 1 λ c M B+1 λ denote the ordered simulated test statistics (c M b λ )1 b B+1. Now, for any given δ > 0, define the event sup u R |FB(u) F (u)| As noted by Kim et al. (2022, Remark 2.1), Dvoretzky Kiefer Wolfowitz inequality (Dvoretzky et al., 1956), more precisely the version with the tight constant which is due to Massart (1990), then guarantees that Pr(A|Xm, Yn) 1 δ 2 for any Xm and Yn, so we deduce that Pp q r(A) 1 δ 2. We now assume that the event A holds, so the bound we derive holds with probability 1 δ 2. Notice that we cannot directly apply the Dvoretzky Kiefer Wolfowitz inequality to FB+1 since it is not based on i.i.d. samples. Nevertheless, under the event A, we have bq λ,B 1 α ZB Xm, Yn = inf{u R : 1 α FB+1(u)} = inf u R : 1 α 1 B + 1 b=1 1 c M b λ u inf u R : 1 α 1 B + 1 b=1 1 c M b λ u = inf u R : (1 α)B + 1 inf u R : (1 α)B + 1 = inf u R : (1 α)B + 1 | {z } :=1 α = bq λ, 1 α (Xm, Yn). MMD Aggregated Two-Sample Test Now, we take B large enough (only depending on α and δ) such that so that bq λ, 1 α (Xm, Yn) bq λ, 1 α/2(Xm, Yn) under the event A. By reducing this problem to a quadratic equation with respect to B, we find + α α2 2 α2(1 α)2 In particular, by upper bounding α2(1 α)2 by 0, we find that the above inequality holds as soon as Note that this condition is in particular trivially satisfied if With this choice of B, we have Pp q r bq λ,B 1 α ZB Xm, Yn bq λ, 1 α/2(Xm, Yn) 1 δ We now upper bound bq λ, 1 α (Xm, Yn) for the two estimators \ MMD 2 λ,a and \ MMD 2 λ,b separately. We then use this to prove the required upper bound on bq λ,B 1 α ZB Xm, Yn . Quantile bound for MMD estimator \ MMD 2 λ,a defined in Equation (3). In this case, we base our reasoning on the work of Kim et al. (2022, proof of Lemma C.1). Recall from Equation (7) that we assume that m n and n Cm for some positive constant C. We use the notation presented in Section 3.2.1 that Ui := Xi and Um+j := Yj for i = 1, . . . , m and j = 1, . . . , n. By the result of Kim et al. (2022, Equation 59), there exists some c1 > 0 such that bq λ, 1 α (Xm, Yn) c1 1 m2(m 1)2 X 1 i =j m+n kλ(Ui, Uj)2 ln 1 almost surely. As shown in Equation (22), for the constant κ2(d) defined in Equation (21), we have max EX,X kλ(X, X )2 , EX,Y kλ(X, Y )2 , EY,Y kλ(Y, Y )2 Mκ2 λ1 λd (23) Schrab, Kim, Albert, Laurent, Guedj and Gretton where X, X iid p and Y, Y iid q are all independent of each other. We deduce that 1 m2(m 1)2 X 1 i =j m+n kλ(Ui, Uj)2 (m + n)(m + n 1) m2(m 1)2 Mκ2 λ1 λd (m + C 1n)4 where we use the fact that n Cm. Using Markov s inequality, we get that, for any δ (0, 1), we have 1 m2(m 1)2 X 1 i =j m+n kλ(Ui, Uj)2 2 1 m2(m 1)2 X 1 i =j m+n kλ(Ui, Uj)2 1 m2(m 1)2 X 1 i =j m+n kλ(Ui, Uj)2 2 (m + n)2 λ1 λd bq λ, 1 α/2(Xm, Yn) 1 (m + n) λ1 λd bq λ, 1 α/2(Xm, Yn) 1 (m + n) λ1 λd α since α (0, 0.5). We now let C2(M, d) := 16c1C2 2Mκ2. Then, for all B N such that B 3 α2 ln 4 δ + α(1 α) , we have bq λ,B 1 α ZB Xm, Yn 1 δ C2(M, d) ln 1 (m + n) λ1 λd n bq λ,B 1 α ZB Xm, Yn bq λ, 1 α/2(Xm, Yn) o bq λ, 1 α/2(Xm, Yn) 1 δ C2(M, d) ln 1 (m + n) λ1 λd MMD Aggregated Two-Sample Test where we use the standard fact that for events B and C satisfying P(B) 1 δ1 and P(C) 1 δ2, we have P(B C) = 1 P(Bc Cc) 1 P(Bc) P(Cc) 1 δ1 δ2. Quantile bound for MMD estimator \ MMD 2 λ,b defined in Equation (6). For this case, in order to upper bound bq λ, 1 α (Xn, Yn), we can use the result of de la Pe na and Gin e (1999, Corollary 3.2.6) and Markov s inequality as done by Fromont et al. (2012, Appendix D). We obtain that there exists a positive constant ec1 such that 1 i =j n ϵiϵjhλ(Xi, Xj, Yi, Yj) 1 i =j n hλ(Xi, Xj, Yi, Yj)2 ln 2 where ϵ1, . . . , ϵn are i.i.d. Rademacher variables, and so P 1 i =j n ϵiϵjhλ(Xi, Xj, Yi, Yj) is a Rademacher chaos. We deduce that bq λ, 1 α (Xn, Yn) ec1 1 n2(n 1)2 X 1 i =j n hλ(Xi, Xj, Yi, Yj)2 ln 2 almost surely. Using Equation (23), we have 1 n2(n 1)2 X 1 i =j n hλ(Xi, Xj, Yi, Yj)2 = 1 n2(n 1)2 Ep q k(Xi, Xj) + k(Yi, Yj) k(Xi, Yj) k(Xj, Yi) 2 EX,X kλ(X, X )2 + EY,Y kλ(Y, Y )2 + 2EX,Y kλ(X, Y )2 32Mκ2 n2λ1 λd where X, X iid p and Y, Y iid q are all independent of each other, and where κ2 is the constant defined in Equation (21) depending on d. Similarly to the previous case, we can then use Markov s inequality to get that, for any δ (0, 1), we have 1 n2(n 1)2 X 1 i =j n hλ(Xi, Xj, Yi, Yj)2 2 1 n2(n 1)2 X 1 i =j n hλ(Xi, Xj, Yi, Yj)2 1 n2(n 1)2 X 1 i =j n hλ(Xi, Xj, Yi, Yj)2 2 δ 32Mκ2 n2λ1 λd bq λ, 1 α/2(Xn, Yn) 1 bq λ, 1 α/2(Xn, Yn) 1 Schrab, Kim, Albert, Laurent, Guedj and Gretton α since α (0, 0.5). Letting e C2(M, d) := 48ec1 Mκ2 and applying the same reasoning as earlier, we get bq λ,B 1 α ZB Xm, Yn 1 δ e C2(M, d) ln 1 for all B N satisfying B 3 α2 ln 4 δ + α(1 α) . E.5 Proof of Theorem 5 First, as shown by Gretton et al. (2012a, Lemma 6), the Maximum Mean Discrepancy can be written as MMD2 λ(p, q) = EX,X kλ(X, X ) 2 EX,Y [kλ(X, Y )] + EY,Y kλ(Y, Y ) Rd kλ(x, x )p(x)p(x ) dxdx Rd kλ(x, y)p(x)q(y) dxdy Rd kλ(y, y )q(y)q(y ) dydy Rd kλ(u, u ) p(u) q(u) p(u ) q(u ) dudu for X, X iid p and Y, Y iid q all independent of each other. Using the function ϕλ defined in Equation (9) and ψ := p q, we obtain MMD2 λ(p, q) = Z Rd ϕλ(u u )ψ(u)ψ(u ) dudu Rd ψ(u )ϕλ(u u ) du du Rd ψ(u) ψ ϕλ (u) du = ψ, ψ ϕλ 2 ψ 2 2 + ψ ϕλ 2 2 ψ ψ ϕλ 2 2 where the last equality is obtained by expanding ψ ψ ϕλ 2 2. By Lemma 2, a sufficient condition to ensure that Pp q r λ,B α (Xm, Yn, ZB) = 0 β is MMD2 λ(p, q) r 2 β varp q \ MMD 2 λ(Xm, Yn) + bq λ,B 1 α ZB Xm, Yn 1 β and an equivalent sufficient condition is ψ 2 2 ψ ψ ϕλ 2 2 ψ ϕλ 2 2 + 2 r 2 β varp q \ MMD 2 λ + 2bq λ,B 1 α MMD Aggregated Two-Sample Test By Proposition 3, we have varp q \ MMD 2 λ(Xm, Yn) C1(M, d) ψ ϕλ 2 2 m + n + 1 (m + n)2 λ1 λd β varp q \ MMD 2 λ(Xm, Yn) 2 β ψ ϕλ 2 2 m + n + 2C1 β(m + n)2 λ1 λd ψ ϕλ 2 2 2C1 β(m + n) + 2 2C1 β(m + n) λ1 λd ψ ϕλ 2 2 + 2C1 β(m + n) + 2 2C1 β(m + n) λ1 λd ψ ϕλ 2 2 + 2C1 + 2 2C1 β(m + n) λ1 λd ln 1 ψ ϕλ 2 2 + 6C1 β(m + n) λ1 λd ln 1 ψ ϕλ 2 2 2 r 2 β varp q \ MMD 2 λ 6C1 ln 1 β(m + n) λ1 λd where for the third inequality we used the fact that x + y x + y for all x, y > 0, for the fourth inequality we used the fact that 2 xy x + y for all x, y > 0, and for the fifth inequality we use the fact that λ1 λd 1, β (0, 1) and ln 1 α > 1. A similar reasoning has been used by Fromont et al. (2013, Theorem 1) and Albert et al. (2022, Theorem 1). Let C3(M, d) := 6C1(M, d) + 2 2C2(M, d) where C1 and C2 are the constants from Propositions 3 and 4, respectively. Assume that our condition holds, that is ψ 2 2 ψ ψ ϕλ 2 2 (2 2C2 + 6C1) ln 1 β(m + n) λ1 λd . Omitting the variables for bq λ,B 1 α ZB Xm, Yn and for \ MMD 2 λ(Xm, Yn), we then get 2bq λ,B 1 α ψ 2 2 ψ ψ ϕλ 2 2 + ψ ϕλ 2 2 2 r 2 β varp q \ MMD 2 λ 2bq λ,B 1 α (6C1 + 2 β(m + n) λ1 λd 6C1 ln 1 β(m + n) λ1 λd β(m + n) λ1 λd bq λ,B 1 α C2 (m + n) λ1 λd Schrab, Kim, Albert, Laurent, Guedj and Gretton where the third inequality holds because β (0, 1) and the last one holds by Proposition 4 since B 3 α2 ln 8 β + α(1 α) . Lemma 2 then implies that Pp q r λ,B α (Xm, Yn, ZB) = 0 β. E.6 Proof of Theorem 6 Theorem 5 gives us a condition on ψ 2 2 ψ ψ ϕλ 2 2 to control the power of the test λ,B α . We now want to upper bound ψ ψ ϕλ 2 2 in terms of the bandwidths when assuming that the difference of the densities lie in a Sobolev ball. We first prove that if ψ := p q Ss d(R) for some s > 0 and R > 0, then there exists some S (0, 1) such that ψ ψ ϕλ 2 2 S2 ψ 2 2 C 4(d, s, R) i=1 λ2s i (24) for some positive constant C 4(d, s, R). For j = 1, . . . , d, since Kj L1(R) L2(R), it follows by the Riemann-Lebesgue Lemma that its Fourier transform b Kj is continuous. For j = 1, . . . , d, note that b Kj(0) = Z R Kj(x)e ix0dx = Z R Kj(x)dx = 1 and, since Kj L1(R) L2(R), also that Kj(x)e ixξj dx = R |Kj(x)|dx =: κ1 < as defined in Equation (21). We deduce that 1 Qd i=1 b Ki(ξi) 1 + κ1 for all ξ Rd. Let us define g: Rd R by g(ξ) = 1 Qd i=1 b Ki(ξi) for ξ Rd. We have g(0, . . . , 0) = 0, so by continuity of g, there exists some t > 0 such that S := sup ξ 2 t |g(ξ)| < 1. For any s > 0, we also define Ts := sup ξ 2>t 1 Qd i=1 b Ki(ξi) ξ s 2 1 + κ1 Let Ψ := ψ ψ ϕλ. As it is a scaled product of K1, . . . , Kd L1(R) L2(R), we have ϕλ L1(Rd) L2(Rd). Since we assume that ψ Ss d(R), we have ψ L1(Rd) L2(Rd). For p {1, 2}, since ψ L1(Rd), we have ψ ϕλ p ψ 1 ϕλ p < . Hence, we deduce that Ψ L1(Rd) L2(Rd). By Plancherel s Theorem, we then have (2π)d Ψ 2 2 = ||bΨ||2 2 (2π)d ψ ψ ϕλ 2 2 = (1 c ϕλ) bψ 2 MMD Aggregated Two-Sample Test In general, for a > 0 the Fourier transform of a function x 7 1 a is ξ 7 bf(aξ). Since ϕλ(u) := Qd i=1 1 λi Ki ui λi for u Rd, we deduce that c ϕλ(ξ) = Qd i=1 b Ki(λiξi) for ξ Rd. Therefore, we have (2π)d ψ ψ ϕλ 2 2 = (1 c ϕλ) bψ 2 Rd|1 c ϕλ(ξ)|2 bψ(ξ) 2 dξ i=1 b K(λiξi) 2 bψ(ξ) 2 dξ i=1 b K(λiξi) 2 bψ(ξ) 2 dξ + Z i=1 b K(λiξi) 2 bψ(ξ) 2 dξ bψ(ξ) 2 dξ + T 2 s ξ 2>t (λ1ξ1, . . . , λdξd) 2s 2 bψ(ξ) 2 dξ i=1 λ2 i ξ2 i !s bψ(ξ) 2 dξ S2(2π)d ψ 2 2 + T 2 s !s bψ(ξ) 2 dξ = S2(2π)d ψ 2 2 + T 2 s λ 2s 2 Rd ξ 2s 2 bψ(ξ) 2 dξ S2(2π)d ψ 2 2 + T 2 s λ 2s 2 (2π)d R2 since ψ Ss d(R), and where we have used Plancherel s Theorem for ψ L1(Rd) L2(Rd). We have proved that there exists some S (0, 1) such that ψ ψ ϕλ 2 2 S2 ψ 2 2 + T 2 s R2 λ 2s 2 . If s 1, then x 7 xs is convex and, by Jensen s inequality (finite form), we have 1 d λ2 i s = ds 1 d X i=1 λ2s i d1+s d X i=1 λ2s i . If s < 1, then γ := 1 s > 1 and so, it is a standard result that γ 1. We then have λ2s i γ !1/γ = λ2s γ λ2s 1 = i=1 λ2s i d1+s d X i=1 λ2s i . Hence, for all s > 0, we have λ 2s 2 d1+s Pd i=1 λ2s i . We conclude that ψ ψ ϕλ 2 2 S2 ψ 2 2 + T 2 s R2d1+s d X Schrab, Kim, Albert, Laurent, Guedj and Gretton which proves the statement presented in Equation (24) with C 4(d, s, R) := T 2 s R2d1+s. We now consider the constant C3(M, d) from Theorem 5. Suppose we have ψ 2 2 T 2 s R2d1+s i=1 λ2s i + C3 (1 S2) ln 1 β(m + n) λ1 λd (1 S2) ψ 2 2 T 2 s R2d1+s d X i=1 λ2s i + C3 ln 1 β(m + n) λ1 λd ψ 2 2 S2 ψ 2 2 + T 2 s R2d1+s d X i=1 λ2s i + C3 ln 1 β(m + n) λ1 λd ψ 2 2 ψ ψ ϕλ 2 2 + C3 ln 1 β(m + n) λ1 λd then, by Theorem 5, we can ensure that Pp q r λ,B α (Xm, Yn, ZB) = 0 β. By definition of uniform separation rates, we deduce that ρ λ,B α , Ss d(R), β, M 2 T 2 s R2d1+s i=1 λ2s i + C3 (1 S2) ln 1 β(m + n) λ1 λd C4(M, d, s, R, β) i=1 λ2s i + ln 1 (m + n) λ1 λd for C4(M, d, s, R, β) := max n T 2 s R2d1+s 1 S2 , C3(M,d) β(1 S2) o . E.7 Proof of Corollary 7 By Theorem 6, if λ1 λd 1, we have ρ λ,B α , Ss d(R), β, M 2 C4(M, d, s, R, β) i=1 λ2s i + ln 1 (m + n) λ1 λd We want to express the bandwidths in terms of the sum of sample sizes m + n raised to some negative power such that the terms Pd i=1 λ2s i and 1 (m+n) λ1 λd have the same be- haviour in m + n. With the choice of bandwidths λ i := (m + n) 2/(4s+d) for i = 1, . . . , d, the term Pd i=1(λ i )2s has order (m + n) 4s/(4s+d) and the term 1 (m+n) λ 1 λ d has order (m + n)d/(4s+d) 1 = (m + n) 4s/(4s+d). So, indeed, this choice of bandwidths leads to the same behaviour in m+n for the two terms, which gives the smallest order of m+n possible. MMD Aggregated Two-Sample Test It is clear that λ 1 λ d < 1, we find that ρ λ ,B α , Ss d(R), β, M 2 C4(M, d, s, R, β) i=1 (λ i )2s + ln 1 (m + n) pλ 1 λ d C4(M, d, s, R, β) (m + n) 4s/(4s+d) + ln 1 (m + n) 4s/(4s+d) C4(M, d, s, R, β) ln 1 (m + n) 4s/(4s+d) = C5(M, d, s, R, α, β)2 (m + n) 4s/(4s+d) for C5(M, d, s, R, α, β) := q C4(M, d, s, R, β) ln 1 α . We deduce that ρ λ ,B α , Ss d(R), β, M C5(M, d, s, R, α, β) (m + n) 2s/(4s+d) . E.8 Proof of Proposition 8 By definition of u B2 α , we have b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 u B2 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn > 0 α. Taking the expectation on both sides, we get \ MMD 2 λ(Xm, Yn) bq λ,B1 1 u B2 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn > 0 α as under the null hypothesis H0 : p = q, we have c M b λ,2 µ(b,2) Xm, Yn 1 b B2 distributed like \ MMD 2 λ(Xm, Yn). Using the bisection search approximation which satisfies bu B2:3 α ZB2 Xm, Yn, ZB1 u B2 α ZB2 Xm, Yn, ZB1 , \ MMD 2 λ(Xm, Yn) bq λ,B1 1 bu B2:3 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn > 0 \ MMD 2 λ(Xm, Yn) bq λ,B1 1 u B2 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn > 0 We deduce that Pp p r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 1 α. Schrab, Kim, Albert, Laurent, Guedj and Gretton E.9 Proof of Theorem 9 Consider some u (0, 1) to be determined later. For b = 1, . . . , B2, let Wb µ(b,2) Xm, Yn, ZB1 := 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 , so that, following a similar argument to the one presented in Appendix E.8, we obtain that Ep q r r Wb µ(b,2) Xm, Yn, ZB1 is equal to \ MMD 2 λ(Xm, Yn) bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 . Consider the events b=1 Wb µ(b,2) Xm, Yn, ZB1 Er h Wb µ(b,2) Xm, Yn, ZB1 i b=1 Wb µ(b,2) Xm, Yn, ZB1 Ep q r r h Wb µ(b,2) Xm, Yn, ZB1 i Using Hoeffding s inequality, we obtain that Pr A Xm, Yn, ZB1 1 β 2 for any Xm, Yn and ZB1, we deduce that Pp q r r(A) 1 β 2 . First, assuming that the event A holds, we show that u B2 α ZB2 Xm, Yn, ZB1 α. Since we assume that the event A holds, the bounds we obtain hold with probability 1 β 2 . We have b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 b=1 Wb µ(b,2) Xm, Yn, ZB1 Ep p r r h Wb µ(b,2) Xm, Yn, ZB1 i + \ MMD 2 λ(Xm, Yn) bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 + n \ MMD 2 λ(Xm, Yn) > bq λ,B1 1 u wλ ZB1 Xm, Yn o! MMD Aggregated Two-Sample Test Using Boole s inequality, we obtain b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 λ Λ Pp p r \ MMD 2 λ(Xm, Yn) > bq λ,B1 1 u wλ ZB1 Xm, Yn + for u := 3α 4 , where we have used Proposition 1 and the fact that P λ Λ wλ 1. Now, for B2 8 α2 ln 2 β , we get and so, we obtain b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 u wλ ZB1 Xm, Yn > 0 α. Recall that u B2 α ZB2 Xm, Yn, ZB1 is defined as sup u 0, min λ Λ w 1 λ : 1 b=1 1 max λ Λ c M b λ,2 µ(b,2) Xm, Yn bq λ,B1 1 uwλ ZB1 Xm, Yn >0 α , we deduce that u B2 α ZB2 Xm, Yn, ZB1 u = 3α for B2 8 α2 ln 2 β when the event A holds. Under the event A, after performing B3 steps of the bisection method, we have bu B2:3 α ZB2 Xm, Yn, ZB1 u B2 α ZB2 Xm, Yn, ZB1 min λ Λ w 1 λ 4 min λ Λ w 1 λ Schrab, Kim, Albert, Laurent, Guedj and Gretton for B3 log2 4 α min λ Λ w 1 λ . We are interested in upper bounding the probability of type II error Pp q r r(B) for the event B := n Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 0 o . We have Pp q r r(B) = Pp q r r B A Pp q r r(A) + Pp q r r B Ac Pp q r r(Ac) Pp q r r B A + β Pp q r r B A = Pp q r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 0 A \ MMD 2 λ(Xm, Yn) bq λ,B1 1 bu B2:3 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn A min λ Λ Pp q r r \ MMD 2 λ(Xm, Yn) bq λ,B1 1 bu B2:3 α ZB2 Xm,Yn,ZB1 wλ ZB1 Xm, Yn A min λ Λ Pp q r \ MMD 2 λ(Xm, Yn) bq λ,B1 1 αwλ/2 ZB1 Xm, Yn = min λ Λ Pp q r λ,B1 αwλ/2 ZB1 Xm, Yn = 0 , we deduce that Pp q r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2)=0 β 2 + min λ Λ Pp q r λ,B1 αwλ/2 ZB1 Xm, Yn =0 .(25) In order to upper bound Pp q r r Λw,B1:3 α (Xm, Yn, ZB1, ZB2) = 0 by β it is sufficient to upper bound min λ Λ Pp q r λ,B1 αwλ/2(Xm, Yn, ZB1) = 0 by β 2 . By definition of uniform separation rates, it follows that ρ Λw,B1:3 α , Ss d(R), β, M 2 4 min λ Λ ρ λ,B1 αwλ/2, Ss d(R), β, M 2 . For each λ λ, since 1 αwλ/2 αwλ/2 (1 α)α as α (0, e 1) and wλ 1, we have B1 maxλ Λ w 2 λ 12 MMD Aggregated Two-Sample Test so we can apply Theorem 6 to the tests λ,B1 αwλ/2 λ Λ to obtain ρ Λw,B1:3 α , Ss d(R), β, M 2 4 min λ Λ ρ λ,B1 αwλ/2, Ss d(R), β, M 2 4C4(M, d, s, R, β) min λ Λ i=1 λ2s i + ln 2 αwλ (m + n) λ1 λd C6(M, d, s, R, β) min λ Λ i=1 λ2s i + ln 1 αwλ (m + n) λ1 λd for C6(M, d, s, R, β) := 8C4(M, d, s, R, β) where C4(M, d, s, R, β) is the constant from Theorem 6, and where we used the fact that = ln(2) + ln 1 ln(2) + 1 ln 1 as ln 1 αwλ α > 1 since α (0, e 1). E.10 Proof of Corollary 10 First, note that we indeed have and also that for all λ = (2 ℓ, . . . , 2 ℓ) Λ we have λ1 λd = 2 dℓ< 1 as ℓ, d N \ {0}. Let λ = (2 ℓ , . . . , 2 ℓ ) Λ where ℓ := 2 4s + d log2 m + n ln(ln(m + n)) m + n ln(ln(m + n)) Since min λ Λ w 1 λ = 6 π2 , we have B3 log2 4 αmin λ Λ w 1 λ , so we can apply Theorem 9 to get ρ Λw,B1:3 α , Ss d(R), β, M 2 C6(M, d, s, R, β) min λ Λ i=1 λ2s i + ln 1 α + ln 1 wλ (m + n) λ1 λd C6(M, d, s, R, β) i=1 (λ i )2s + ln 1 α + ln 1 wλ (m + n) pλ 1 λ d Note that ℓ 2 4s+d log2 m+n ln(ln(m+n)) +1 which gives λ i = 2 ℓ 2 1 m+n ln(ln(m+n)) 2/(4s+d) for i = 1, . . . , d. We get pλ 1 λ d 2 d 2 m+n ln(ln(m+n)) d/(4s+d) and so 1 pλ 1 λ d 2 d 2 m + n ln(ln(m + n)) Schrab, Kim, Albert, Laurent, Guedj and Gretton Note also that ℓ 2 4s + d log2 m + n ln(ln(m + n)) 2 4s + d log2(m + n) + 1 2 d ln(2) + 1 ln(m + n) < 4 ln(m + n) as ln(ln(m + n)) > 1 and ln(m + n) > 1. We get = 2 ln(ℓ ) + ln π2 2 ln(4 ln(m + n)) + ln π2 2 ln(4) + 1 + ln π2 ln(ln(m + n)) < 5 ln(ln(m + n)) as ln(ln(m + n)) > 1. Combining those upper bounds, we get α + ln 1 wλ (m + n) pλ 1 . . . λ d 1 (m + n) pλ 1 . . . λ d + 5 ln(ln(m + n)) + 5 ln(ln(m + n)) m + n 1 pλ 1 . . . λ d + 5 ln(ln(m + n)) m + n ln(ln(m + n)) = 2 d 2 ln 1 + 5 m + n ln(ln(m + n)) as ln(ln(m + n)) > 1. Note also that ℓ 2 4s + d log2 m + n ln(ln(m + n)) (λ i )2s = (2 ℓ )2s m + n ln(ln(m + n)) for i = 1, . . . , d. Hence, we get i=1 (λ i )2s d m + n ln(ln(m + n)) 4s/(4s+d) . MMD Aggregated Two-Sample Test ρ Λw,B1:3 α , Ss d(R), β, M 2 C6(M, d, s, R, β) i=1 (λ i )2s + ln 1 α + ln 1 wλ (m + n) pλ 1 λ d C7(M, d, s, R, α, β)2 m + n ln(ln(m + n)) where C7(M, d, s, R, α, β) := r C6(M, d, s, R, β) max n d, 2 d 2 ln 1 α + 5 o . We conclude that ρ Λw,B1:3 α , Ss d(R), β, M C7(M, d, s, R, α, β) m + n ln(ln(m + n)) 2s/(4s+d) . Hence, the test Λw,B1:3 α is optimal in the minimax sense up to an iterated logarithmic term. Since it does not depend on the unknown parameters s and R, our aggregated MMDAgg test Λw,B1:3 α is minimax adaptive over the Sobolev balls Ss d(R) : s > 0, R > 0 . Schrab, Kim, Albert, Laurent, Guedj and Gretton M. Albert, B. Laurent, A. Marrel, and A. Meynaoui. Adaptive test of independence based on HSIC measures. The Annals of Statistics, 50(2):858 879, 2022. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950. Y. Baraud. Non-asymptotic minimax rates of testing in signal detection. Bernoulli, 1 (8(5):577 606), 2002. P. J. Bickel. A distribution free version of the Smirnov two sample test in the p-variate case. The Annals of Mathematical Statistics, 40(1):1 23, 1969. C. Butucea. Goodness-of-fit testing and quadratic functional estimation from indirect observations. Long version with Appendix. The Annals of Statistics, 35(5), 2007. P. L. Chebyshev. Oeuvres. Commissionaires de l Acad emie Imp eriale des Sciences, 1, 1899. S. X. Chen and Y.-L. Qin. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics, 38(2):808 835, 2010. K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pages 2606 2615. PMLR, 2016. V. H. de la Pe na and E. Gin e. Decoupling: From Dependence to Independence. Springer Science & Business Media, 1999. A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642 669, 1956. N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon-tabular: Robust and accurate automl for structured data. ar Xiv preprint ar Xiv:2003.06505, 2020. H. S. Fisher, B. B. Wong, and G. G. Rosenthal. Alteration of the chemical environment disrupts communication in a freshwater fish. Proceedings of the Royal Society B: Biological Sciences, 273(1591):1187 1193, 2006. W. Fithian, D. Sun, and J. Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014. M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret. Kernels based tests with nonasymptotic bootstrap approaches for two-sample problems. In Conference on Learning Theory, volume 23 of Journal of Machine Learning Research Proceedings, 2012. M. Fromont, B. Laurent, and P. Reynaud-Bouret. The two-sample problem for Poisson processes: Adaptive tests with a nonasymptotic wild bootstrap approach. The Annals of Statistics, 41(3):1431 1461, 2013. MMD Aggregated Two-Sample Test K. Fukumizu, A. Gretton, X. Sun, and B. Sch olkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, volume 1, pages 489 496, 2008. A. Gretton, O. Bousquet, A. Smola, and B. Sch olkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In International Conference on Algorithmic Learning Theory. Springer, 2005. A. Gretton, K. Borgwardt, M. Rasch, B. Sch olkopf, and A. Smola. A kernel method for the two-sample problem. In Advances in Neural Information Processing Systems, pages 513 520, Cambridge, MA, 2007. MIT Press. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723 773, 2012a. A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, volume 1, pages 1205 1213, 2012b. W. Hoeffding. A class of statistics with asymptotically normal distribution. In Breakthroughs in Statistics, pages 308 334. Springer, 1992. L. Horv ath, P. Kokoszka, and R. Reeder. Estimation of the mean of functional time series and a two-sample problem. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):103 122, 2013. Y. I. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the Lp metrics. Theory of Probability & its Applications, 31(2):333 337, 1987. Y. I. Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives. Journal of Soviet Mathematics, 1(44:466 476), 1993a. Y. I. Ingster. Minimax testing of the hypothesis of independence for ellipsoids in lp. Zapiski Nauchnykh Seminarov POMI, 1(207:77 97), 1993b. W. Jitkrittum, Z. Szab o, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, volume 29, pages 181 189, 2016. I. Kim, S. Balakrishnan, and L. Wasserman. Minimax optimality of permutation tests. The Annals of Statistics, 50(1):225 251, 2022. A. Krizhevsky. Learning multiple layers of features from tiny images. 2009. J. M. K ubler, W. Jitkrittum, B. Sch olkopf, and K. Muandet. Learning kernel tests without data splitting. In Advances in Neural Information Processing Systems 33, pages 6245 6255. Curran Associates, Inc., 2020. J. M. K ubler, W. Jitkrittum, B. Sch olkopf, and K. Muandet. A witness two-sample test. In International Conference on Artificial Intelligence and Statistics, pages 1403 1419. PMLR, 2022a. Schrab, Kim, Albert, Laurent, Guedj and Gretton J. M. K ubler, V. Stimper, S. Buchholz, K. Muandet, and B. Sch olkopf. Auto ML twosample test. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, 2022b. Y. Le Cun, C. Cortes, and C. Burges. MNIST handwritten digit database. AT&T Labs, 2010. URL http://yann.lecun.com/exdb/mnist/. J. Lee. U-statistics: Theory and Practice. Citeseer, 1990. J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016. T. Li and M. Yuan. On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. ar Xiv preprint ar Xiv:1909.03302, 2019. F. Liu, W. Xu, J. Lu, G. Zhang, A. Gretton, and D. J. Sutherland. Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning, 2020. Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276 284. PMLR, 2016. P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, 18(3):1269 1283, 1990. F. J. Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253):68 78, 1951. R. R. Miles, R. F. Roberts, A. R. Putnam, and W. L. Roberts. Comparison of serum and heparinized plasma samples for measurement of chemistry analytes. Clinical Chemistry, 50(9):1704 1706, 2004. A. M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 1:429 443, 1997. S. Rabanser, S. G unnemann, and Z. Lipton. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32, 2019. A. Ramdas, S. J. Reddi, B. P oczos, A. Singh, and L. Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. S. Reddi, A. Ramdas, B. P oczos, A. Singh, and L. Wasserman. On the high dimensional power of a linear-time two sample test under mean-shift alternatives. In Artificial Intelligence and Statistics, pages 772 780. PMLR, 2015. J. P. Romano and M. Wolf. Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association, 100(469):94 108, 2005a. MMD Aggregated Two-Sample Test J. P. Romano and M. Wolf. Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237 1282, 2005b. A. Schrab, B. Guedj, and A. Gretton. KSD aggregated goodness-of-fit test. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, 2022a. A. Schrab, I. Kim, B. Guedj, and A. Gretton. Efficient aggregated kernel tests using incomplete U-statistics. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, 2022b. R. J. Serfling. Approximation theorems of mathematical statistics. John Wiley & Sons, 1980. B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(7), 2011. Student. The probable error of a mean. Biometrika, 1(1):1 25, 1908. D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations, 2017. I. Tolstikhin, B. a. K. Sriperumbudur, and B. Sch olkopf. Minimax estimation of maximum mean discrepancy with radial kernels. Advances in Neural Information Processing Systems, 29, 2016. P. Vermeesch. Multi-sample comparison of detrital age distributions. Chemical Geology, 341:140 146, 2013. G. Wynne and A. B. Duncan. A kernel two-sample test for functional data. Journal of Machine Learning Research, 23(73):1 51, 2022. G. Wynne and S. Nagy. Statistical depth meets machine learning: Kernel mean embeddings and depth in functional data analysis. ar Xiv preprint ar Xiv:2105.12778, 2021. M. Yamada, D. Wu, Y. H. Tsai, H. Ohta, R. Salakhutdinov, I. Takeuchi, and K. Fukumizu. Post selection inference with incomplete maximum mean discrepancy estimator. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.