# replicable_uniformity_testing__92a11579.pdf Replicable Uniformity Testing Sihan Liu University of California San Diego La Jolla, CA sil046@ucsd.edu Christopher Ye University of California San Diego La Jolla, CA czye@ucsd.edu Uniformity testing is arguably one of the most fundamental distribution testing problems. Given sample access to an unknown distribution p on [n], one must decide if p is uniform or ε-far from uniform (in total variation distance). A long line of work established that uniformity testing has sample complexity Θ( nε 2). However, when the input distribution is neither uniform nor far from uniform, known algorithms may have highly non-replicable behavior. Consequently, if these algorithms are applied in scientific studies, they may lead to contradictory results that erode public trust in science. In this work, we revisit uniformity testing under the framework of algorithmic replicability [STOC 22], requiring the algorithm to be replicable under arbitrary distributions. While replicability typically incurs a ρ 2 factor overhead in sample complexity, we obtain a replicable uniformity tester using only O( nε 2ρ 1) samples. To our knowledge, this is the first replicable learning algorithm with (nearly) linear dependence on ρ. Lastly, we consider a class of symmetric" algorithms [FOCS 00] whose outputs are invariant under relabeling of the domain [n], which includes all existing uniformity testers (including ours). For this natural class of algorithms, we prove a nearly matching sample complexity lower bound for replicable uniformity testing. 1 Introduction Distribution property testing (see [43, 9, 12] for surveys of the field), originated from statistical hypothesis testing [39, 36], aims at testing whether an unknown distribution satisfies a certain property or is significantly far from satisfying the property given sample access to the distribution. After the pioneering early works that formulate the field from a TCS perspective [5, 4], a number of works have achieved progress on testing a wide range of properties [6, 21, 11, 22, 10, 19]. Within the field, uniformity testing [28] is arguably one of the most fundamental distribution testing problems: Given sample access to some unknown distribution p on [n] = {1, , n}, one tries to decide whether p is uniform or far from being uniform. When the unknown distribution is promised to be either uniform or at least ε-far from being uniform in total variation (TV) distance, a line of work in the field [28, 40, 47, 20, 1, 16] has led to efficient testers that achieve information theoretically optimal sample complexity, i.e., Θ( n/ε2). However, the testers are only guaranteed to provide reliable answers when p fulfills the input promise p is either Un or far from it. In practical scenarios, due to a number of reasons such as model misspecification, or inaccurate measurements, such promises are rarely guaranteed [13]. Since the tester no longer has a correctness constraint to adhere to, it may have arbitrarily volatile behavior. Ideally, we hope for algorithmic stability for all distributions, not only on those fulfilling the promises. Consider the scenario where two group of scientists, as an intermediate step of some scientific study, are trying to independently test the uniformity of some ground-truth distribution p . If the groups 38th Conference on Neural Information Processing Systems (Neur IPS 2024). reach inconsistent conclusions in the end (as noted in [31, 32, 7], such replicability issues are fairly common in science), the public might naturally interpret the inconsistency to be caused by procedural or human errors on the part of one (or both) team(s), therefore eroding public trust in the scientific community [37, 44, 3, 42]. However, even when both teams follow the experimental procedure precisely, inconsistencies could still arise simply due to sample variance in the case where p is neither uniform nor far from uniform. By ensuring that the algorithm is replicable, we can effectively rule out sample variance as a cause of inconsistency. In the work of [30], a formal definition of replicability for learning algorithms has been proposed to mitigate non-replicability caused by statistical methods. Specifically, replicable learning algorithms are required to give identical outputs with high probability in two different runs where it is given sample access to some common, but potentially adversarially chosen data distribution. Definition 1.1 (Replicability [30]). A randomized algorithm A : X n 7 Y is ρ-replicable if for all distributions p on X, Prr,T,T (A(T; r) = A(T ; r)) 1 ρ, where T, T are i.i.d. samples taken from p and r denotes the internal randomness of the algorithm A. . A number of works have explored the connection between replicability and other algorithmic stability notions [8, 34, 15, 38, 23], and shown efficient replicable algorithms for a wide range of machine learning tasks [8, 34, 24, 25, 35, 26, 33]. In the context of uniformity testing, if we can design testers that conform to the above replicability requirement, consistencies of their outputs can be guaranteed even when there are no promises on the data distribution. This then motivates the study of replicable uniformity testing. Definition 1.2 (Replicable Uniformity Testing). Let n Z+ , and ε, ρ (0, 1/2). A randomized algorithm A, given sample access to some distribution p on [n], is said to solve (n, ε, ρ)- replicable uniformity testing if A is ρ-replicable and it satisfies the following: (1) If p is uniform, A should accept with probability at least 1 ρ. 1 (2) If p is ε-far from the uniform distribution in TV distance, A should reject with probability at least 1 ρ. As observed in [30], learning algorithms usually incurs additional sample complexity overhead in the replicability parameter ρ compared to their non-replicable counterpart. In this work, we characterize the sample complexity of replicable uniformity testing up to polylogarithmic factors in all relevant parameters n, ε, ρ (under mild assumptions on the testers). 1.1 Relationship with Tolerant Testing An alternative approach to address the stringency of the promises in the formulation of uniformity testing is the concept of tolerant testing [41]. At a high level, given some 0 < ξ < ε, the unknown distribution is now relaxed to be either ε far from Un or ξ close to Un in TV distance. The tester is then required to reject in the former case but accept in the latter. As shown in [45, 46, 48, 13], assuming that ε is some constant 2, the sample complexity of tolerant testing quickly grows from strongly sublinear, i.e., Θ( n), to barely sublinear, i.e, Θ(n/ log n) as ξ increases from 0 to ε/2. In replicable uniformity testing the testers are not required to accept or reject for intermediate distributions p, i.e., p such that 0 < TV(p, Un) < ε. Instead, for all possible distributions p, the testers are only required to give replicable answers (with high probability). While we can construct an (n, ε , ρ)-replicable uniformity testing algorithm using tolerant testing by randomly sampling some threshold r (0, ε ) and performing tolerant testing with ξ = r ρε and ε = r + ρε , the sample complexity of such an approach will be barely sublinear in n even for constant ρ as discussed above. Notably, the replicable algorithm in our work has a sample complexity that remains strongly sublinear in n. 1.2 Our Results When n = 2, uniformity testing amounts to distinguishing a fair coin from an ε-biased coin. It is well known that the task requires Θ(ε 2) samples without the replicable requirement. When the tester is required to be ρ-replicable, [30] shows that Θ(ε 2ρ 2) many samples are necessary and sufficient 1In this work, we do not focus on the dependency on the failure probability of the tester. The common formulation is that A should succeed with some large constant probability, i.e., 2/3. However, if A is at the same time required to be replicable with probability at least 1 ρ, one can see that A must also be correct with probability at least 1 ρ. 2See [13] for the dependency on ε and ξ for the full landscape of the problem. for replicable coin testing, demonstrating a quadratic blowup in ρ compared to the non-replicable counterpart of the problem. For large n, the sample complexity of non-replicable uniformity testing has been resolved after a long line of work [28, 40, 47], and shown to be Θ( nε 2). Following the pattern, one naturally expect that replicable uniformity testing would have sample complexity Θ( nρ 2ε 2). In fact, this is exactly the sample complexity reached if one views the outcome of an optimal non-replicable uniformity tester as a coin flip, and uses an optimal replicable coin tester to convert the given uniformity tester into a replicable one. Somewhat surprisingly, we show that the sample complexity from this blackbox reduction is suboptimal for replicable uniformity testing it is possible to additionally shave one ρ factor and make the dominating term s dependency on ρ linear. To our knowledge, this is the first replicable algorithm that has nearly linear dependence on the replicability parameter in the dominating term. Theorem 1.3 (Replicable Uniformity Testing Upper Bound). Let n Z+, ε, ρ (0, 1/2). Algorithm 1 solves (n, ε, ρ)-replicable uniformity testing with sample complexity O n ε2ρ + 1 ρ2ε2 .3 Remark 1.4. [27] showed that the more general problem of identity testing, i.e., testing whether an unknown distribution p is equal or far from some known distribution q with explicit description, can be reduced to uniformity testing with only a constant factor loss in sample complexity. It is not hard to verify that this reduction preserves replicability. This immediately implies a replicable identity tester with the same asymptotic sample complexity (see Appendix C.3 for more detail). As our second result, we provide a nearly matching sample complexity lower bound for a natural class of testers whose outputs are invariant under relabeling of the domain elements [n]. The class of testers are commonly referred as Symmetric Algorithms, and first studied in the work of [5]. Definition 1.5 (Symmetric Algorithms, Definition 13 of [5]). Let f : [n] m 7 {0, 1} be a binary function. We say that f is symmetric if for any sample set (x1, xm) [n] m and any permutation π Sn, we have that f(x1, xm) = f(π(x1), , π(xm)). We say an algorithm A(; r) is symmetric if it computes some symmetric function fr under any fixed random seed r. Without the replicability requirement, it can be shown that assuming the algorithm is symmetric is without loss of generality for uniformity testing. This is due to a simple observation that uniformity itself is a symmetric property: if p is uniform or far from the uniform distribution, the property is preserved even if we permute the labels of the elements within [n]. Consequently, all known optimal uniformity testers, including our replicable uniformity tester, are indeed symmetric algorithms. For this natural class of testers, we show that the sample complexity achieved is essentially optimal (up to polylogarithmic factors). Theorem 1.6 (Symmetric Testers Lower Bound). Let n Z+, ε, ρ (0, 1/2). Any symmetric algo- rithm solving the (n, ε, ρ)-replicable uniformity testing problem requires Ω n ε2ρ + 1 ρ2ε2 samples. Unfortunately, when replicability is concerned, it is unclear whether we can still assume that the optimal algorithm is symmetric. Whether the above lower bound holds for all algorithms is left as one of the main open questions of this work. 1.3 Limitations, Discussion, and Future Work In this paper, we present a replicable algorithm for uniformity testing using O( nε 2ρ 1 +ε 2ρ 2) samples. We provide a matching lower bound for a natural class of symmetric uniformity testers algorithms that essentially consider only the frequency of each element without discriminating between distinct labels. Since all known uniformity testing algorithms are symmetric, we tend to believe that the lower bound can be established unconditionally and leave it as an open question. We discuss some issues in generalizing our current approach to realize this goal in Appendix C.2. While uniformity testing is a central problem in distribution property testing, there are many other settings where it would be interesting to develop replicable algorithms, such as closeness and independence testing [18]. 3 O hides polylogarithmic factors in n, ρ 1.4 Technical Overview nρ 2 Barriers for ℓ2 based Statistics. Most of the well known non-replicable uniformity testers compute an unbiased estimator for the ℓ2 norm of the the unknown distribution p, i.e., ||p||2 2 = P i=1 p2 i , from the number of occurrences Xi of each element i among the observed samples. These include the testers from [28, 17], which are based on counting pair-wise collisions, i.e., 1 2 Pn i=1 Xi(Xi 1), and the ones from [14, 1, 47, 20], which are based on variants of the χ2 statistic, i.e., Pn i=1 (Xi m/n)2 Xi /(m/n). As one can see, these estimators all rely heavily on computing the quantities X2 i , which could have large variances even when there is a single heavy element. This poses serious challenges on designing replicable testers based on these statistics 4. In the following paragraphs we discuss the challenge for collision based testers in more detail. A similar barrier exists for χ2 based statistics, which we defer to Appendix C.1. Given an unknown distribution p, the expected number of pairs of collision will be exactly Pn i=1 m 2 p2 i . If p is uniform, the expected value of the test statistic will be about m2 2n . If p is ε-far from being uniform, the expected value will be at least m2(1+ε2) 2n . To construct a replicable uniformity tester from the collision statistics, a natural idea is to select a random threshold r between the two extrema to be the decision threshold. Consequently, the tester fails to replicate if and only if the random threshold falls between the realized values of the test statistics in two different runs. Conditioned on that the test statistics computed in two different runs deviate by , the above events happens with probability exactly 2n m2ε2 . Hence, for the tester to be ρ-replicable, the test statistics will need to deviate by no more than = O m2ε2ρ/n with constant probability. To focus on the dependency on ρ, we let ε be a small constant. We now construct a hard instance that makes collision-based statistics violate the above concentration requirement unless m nρ 2. Consider a distribution with a single heavy element with probability mass p 1/ n. Let X Binom (m, p) denote the number of occurrences of this heavy element. It is not hard to see that this element contributes to the total collision counts by X 2 , which deviates by Ω mp mp = Ω(m3/2/n3/4) from its mean with constant probability. Consequently, over two runs of the algorithm, with constant probability the numbers of collisions may differ by Ω(m3/2/n3/4). Therefore, the tester will fail to be ρ-replicable unless m3/2/n3/4 ρm2/n or equivalently m n Total Variation Distance Statistic To make the test statistics less sensitive to the counts of heavy elements, we compute the total variation statistic, which has been used in [16] to achieve optimal uniformity testing in the high probability regime. In particular, the test statistics measures the TV distance between the empirical distribution over samples and the uniform distribution: i=1 |Xi/m 1/n|, where Xi is the number of occurrences of the i-th element. Unlike collision-based statistics, note that the TV statistics depends only linearly on each Xi. For a heavy element Xi, the contribution to the empirical total variation distance is (up to normalization) at most Xi with variance Var(Xi) = Θ (mp) = Θ m/n1/2 opposed to Var(X2 i ) = Θ m3/n3/2 . Intuitively, this allows us to obtain tighter concentration bounds on the test statistic S, thereby improving the final sample complexity. First, we observe that when the distribution is ε-far from uniform, [16] shows that the expected value of S exceeds the expected value of S under the uniform distribution by at least some function f(m, n, ε) (see Equation (2) for the full expression). To establish a replicable tester in the super-linear case (m = Θ( nε 2ρ 1) n), we use Mc Diarmid s inequality to directly argue that the TV test statistic deviates by at most ρf(m, n, ε) with high probability. Hence, if we use a random threshold that lies within the gap interval, the threshold will be ρf(m, n, ε) close to the expected value of the test statistic with probability at most O(ρ), ensuring replicability of the final result. 4Interestingly, the large variance caused by heavy elements is usually not an issue in the non-replicable setting where the distribution is promised to be either uniform or far from uniform. This is because heavy elements could only exist in the latter case. In that case, the expected value of the test statistic must also be large, thus adding more slackness to the concentration requirement of the test statistics. Unfortunately, to obtain replicability, we need to deal with distributions that have heavy elements but are still relatively close to the uniform distribution. The key challenge lies in obtaining the desired concentration in the sublinear regime, i.e. m n. In this regime, the expectation gap is given by f(m, n, ε) = ε2m2 n2 . Unfortunately, the presence of heavy elements, i.e., element with mass pi 1/m, can still make the test statistic have high variance. As a result, it becomes challenging to obtain the desired concentration of ρf(m, n, ε) on the test statistic. However, in the presence of very heavy elements (e.g., pi m/n) which causes the variance of the test statistic to be too large, we observe that these elements cause the expectation of the test statistic to increase sufficiently beyond the expectation gap so that the tester rejects consistently, even though the distribution may not be ε-far from uniform. To formalize the intuition, we show that whenever the variance of the test statistic is large, so is its expected value, thereby ensuring consistent rejection. In particular, in Lemma 3.3 we show that, whenever p Var(S) ρf(m, n, ε) = ρε2m2 n2 , it holds that Var(S) µ(Un) + ε2m2/n2 (1) where µ(Un) is the expected value of S under the uniform distribution Un. At a high level, Equation (1) is shown by rewriting S with indicator variables representing whether some sample collides with another, and we use correlation inequalities to show that these collision indicators are almost independent of each other (see proof sketch of Lemma 3.3 for more detail). Combining Equation (1) with Chebyshev s inequality, we then have that S µ(Un)+ε2m2/n2 with high constant probability (which can be easily boosted to 1 ρ with the standard median trick ). As we choose our random threshold from the interval [µ(Un), µ(Un) + ε2m2/n2], it follows that the tester must consistently reject. Otherwise, we have p Var(S) ρf(m, n, ε). Applying Chebyshev s inequality gives that Pr (|S E [S] | > ρf(m, n, ε)) 1. The rest of the argument is similar to the super-linear case. Sample complexity Lower Bound At a high level, we present a family of distributions {p(ξ)}ξ parametrized by some parameter ξ [0, ε] satisfying the following: no symmetric tester that takes fewer than Ω nε 2ρ 1 many samples can be replicable with high probability against a random distribution from the family. Since we have fixed the testing instance distribution, using a minimax style argument similar to [30], we can assume that the testing algorithm is deterministic 5. The family of distributions is constructed to satisfy the following properties: (i) p(0) is the uniform distribution and p(ε) is ε-far from uniform (ii) for any two distributions p(ξ), p(ξ + δ), where δ < ερ, within the family, no symmetric tester taking fewer than Ω nε 2ρ 1 many samples can reliably distinguish them. By (i) and the correctness guarantee of the algorithm, the acceptance probabilities of the tester should be near 1 under p(0) and near 0 under p(ε). This implies that there exists some ξ within the range such that the acceptance probability is 1/2 under p(ξ ).6 By property (ii), p(ξ ) cannot be distinguished from p(ξ O(ερ)) by the tester, which immediately implies that the acceptance probability of the tester is near 1/2 whenever ξ = ξ O(ερ), and therefore not replicable with constant probability. Thus, if we sample ξ uniformly from [0, ε], the tester will fail to replicate with probability at least Ω(ρ). The construction of p(ξ) is natural and simple: half of the elements will have mass (1 + ξ)/n and the other half will have mass (1 ξ)/n. Property (i) follows immediately. The formal proof of Property (ii) is technical, but the high level intuition is straightforward. If we assume the underlying tester is symmetric, the most informative information is essentially the number of elements that have frequencies exactly 2 among the samples (as elements having frequencies more than 2 are rarely seen and the numbers of elements having frequencies 0 or 1 are about the same in the two cases.) If the tester takes m samples, it observes about m2 (1 + ξ)2 + (1 ξ)2 /n = 2m2(1 + ξ2)/n many frequency-2 elements under p(ξ) in expectation. On the other hand, the standard deviation of the number of such elements is about p m2/n = m/ n. Hence, for the tester to successfully distinguish the two distributions, m needs to be sufficiently large such that n (1 + (ξ + ερ)2) (1 + ξ2) m2(ξ + ερ)ερ 5More precisely, we pick a good random string such that the induced deterministic tester is correct and replicable at the same time against our fixed testing instance distribution with high probability. 6An omitted technical detail is that our construction ensures that the acceptance probability is a continuous function of ξ. which yields m n (ξ+ερ)ερ > Ω n To formalize the above intuition, we will use an information theoretic argument 7. Note that for such an argument to work, one often need to randomize the order of heavy and light elements [18, 16]. Otherwise, a tester could simply group the elements whose mass is above 1/n under p(ξ) into one giant bucket and reduce the problem into learning the bias of a single coin, which requires much fewer samples. To achieve this randomization, we consider the Local Swap Family (see Definition 4.2), where we pair the elements with mass (1 + ξ)/n with those with mass (1 ξ)/n and randomly swap their orders. Since pairs are ordered randomly, no algorithm can hope to identify heavy/light elements without taking a significant number of samples. Thus, this creates distribution families containing p(ξ) and p(ξ + δ) that are information theoretically hard to distinguish even for asymmetric testers. Consequently, this shows the existence of some permutation πξ of [n] such that the permuted distributions πξ p(ξ) and πξ p(ξ + δ) are hard to distinguish. However, recall that in the replicability argument we need to first fix some ξ such that p(ξ ) = 1/2. Thus, we have to prove specifically that p(ξ) and p(ξ + δ) themselves are hard to distinguish. Fortunately, for symmetric testers, the acceptance probabilities of πξ p(ξ) and πξ p(ξ + δ) must be identical to those of p(ξ) and p(ξ + δ) respectively. Consequently, no symmetric tester can easily distinguish p(ξ) and p(ξ + δ). 2 Preliminaries For any positive integer n, let [n] = {1, 2, . . . , n}. We typically use n to denote the domain size, p to denote a distribution over [n] and m to denote sample complexity. Given a distribution p over [n], let px denote the probability of x in p. For a subset S [n], p S = P x S px. For distributions p, q over [n], the total variation distance is d T V (p, q) = 1 x [n] |px qx| = max S [n] p S q S. We also recall the definition of mutual information. Let X, Y be random variables over domain X. The mutual information of X, Y is I(X : Y ) = P x,y X Pr((X, Y ) = (x, y)) log Pr((X,Y )=(x,y)) Pr(X=x) Pr(Y =y). We use a b (resp. a b) to denote that a is a large (resp. small) constant multiple of b. 3 Replicable Uniformity Testing Algorithm Algorithm Overview At a high level, our algorithm computes the TV-distance statistic and compares it with a random threshold. Correctness of the algorithm largely follows from the analysis of the test statistics from [16]. To show replicability of the algorithm, we need a better understanding of the concentration properties of the test statistic when the unknown distribution is neither uniform nor ε-far from being uniform. In particular, we show that when the variance of the test statistic is too large, even if the input distribution itself is not ε-far from uniform, the test statistic is with high probability larger than any random threshold that may be chosen, leading the algorithm to replicably reject in this case. On the other hand, when the variation is sufficiently small, we have sufficiently strong concentration in the test statistic, so that a randomly chosen threshold does not land between empirical test statistics computed from independent samples. Proof of Theorem 1.3. Our starting point is the expectation gap of the test statistic shown in [16]. Lemma 3.1 (Lemma 4 of [16]). Let p be a distribution on [n] such that ξ = d T V (p, Un). For any distribution p, let µ(p) denote the expectation of the test statistic S = 1 2 Pn i=1 Xi n given m samples drawn from p. For all m 6, n 2, there is a constant C such that µ(p) µ(Un) R := C n2 m n ξ2p m ξ2 ξ n ξ2 m . (2) We require the following structural lemmas, whose formal proofs can be found in Appendix A, on the test statistic Smedian. These lemmas show that the test statistic S (and therefore Smedian) concentrates 7The use of information theory in showing lower bounds for replicability has also appeared in the manuscript [29]. But our argument and construction are significantly more involved and exploit symmetries in the underlying algorithm. Algorithm 1 RUNIFORMITYTESTER(p, ε, ρ) Input :Sample access to distribution p on domain [n] Parameters :ε tolerance, ρ replicability Output :ACCEPT if p Un is uniform, REJECT if d T V (p, Un) ε ρ + 1 ρ2ε2 , m0 Θ (log 1/ρ). 2 for 1 j m0 do 3 Dj gets m samples from p 2 Pn i=1 Xi n where Xi is the occurrences of i [n] in Dj 5 Smedian median of {Sj} 6 µ(Un) is expectation of 1 2 Pn i=1 Xi n under uniform distribution. 7 Set threshold r µ(Un) + r0 R (R given by Lemma 3.1) where r0 Unif 1 8 return ACCEPT if Smedian < r. Reject otherwise. around its expectation µ(p) = Ep[S] in the sublinear (m < n) and superlinear (m n) cases respectively. For the superlinear case, we bound the sensitivity of S with respect to the input sample set T and apply Mc Diarmid s inequality to obtain the desired concentration result. Lemma 3.2 (Superlinear Concentration). Assume that we are in the superlinear regime (i.e., m n). Denote by µ(p) the expectation of the test statistic S under the distribution p. If n m n ε2 , then Pr |Smedian µ(p)| ρ C ε2 m, then Pr |Smedian µ(p)| ρ C For the sublinear case, we provide a proof sketch below, deferring the details to Appendix A. Lemma 3.3 (Sublinear Concentration). Suppose m n. If Var(S) (C/64)ρ2ε4m4n 4, then E [S] p Var(S) > µ(Un) + Cε2m2n 2 where µ(Un) is the expectation of S under the uniform distribution and C is given by Lemma 3.1. As a consequence, with probability at least 1 ρ/4, we have Smedian > µ(Un) + Cε2m2n 2. On the other hand, if Var(S) (C/64)ρ2ε4m4n 4, then it holds that Pr |Smedian µ(p)| (C/16)ρε2m2n 2 < ρ/4. Furthermore, for the uniform distribution Un, Var(S) (C/64)ρ2ε4m4n 4. Proof Sketch. In the proof sketch, for simplicity, we assume that ε is some small constant and ignore its dependency. When Var(S) is small, we simply apply Chebyshev s inequality to obtain the desired concentration of S. The concentration of Smedian then follows from the standard median trick. In the rest of the sketch we focus on the case Var(S) (C/64)ρ2ε4m4n 4. In this case, the key is to show that E [S] p Var(S) > µ(Un) + Cε2m2n 2 (3) as the claim Smedian > µ(Un) + Cε2m2n 2 with high probability follows almost immediately by an application of Chebyshev s inequality (and the standard analysis for the median trick). Towards Equation (3), define Xi as the number of occurrences of element i [n]. We begin with an observation from [16] stating that the test statistic S = Z/n where Z = |{i s.t. Xi = 0}| denotes the number of empty buckets. Furthermore, we can write Z = n m + Pm i=1 Yi where Yi indicates whether the i-th sample collides with a previous sample j < i. Then, Var(S) = Var(Z)/n2 = Var(P Yi)/n2. To bound the variance, we argue that the indicators Yi are almost negatively correlated using correlation inequalities (specifically Kleitman s Theorem Lemma A.8), so that (roughly) Var(P Yi) P i Var(Yi) P E [Yi] (see Lemma A.5 for the accurate statement). Observe that under Un, we have P E [Yi] m2n 1 so that E [S] µ(Un) E [P Yi] /n m2/n2. It follows that Var(S) µ(Un) E h X Yi i! /n (m2/n2). By our assumption, E [P Yi] n2Var(S) ρ2ε4m4n 2 1 so that it suffices to show E [P Yi] (C + 1)m2/n. Since Var(S) (C/64)ρ2ε4m4n 4 which is further bounded from below by m2/n3 whenever m nε 2ρ 1 we conclude E [P Yi] n2Var(S) m2/n. We are now ready to show Theorem 1.3 our main algorithmic result. We begin with correctness for the uniform distribution. Suppose m n. By Lemma 3.3, Smedian µ(Un) + R/16 µ(Un) + R/4 r with probability at least 1 ρ 4 so the algorithm outputs ACCEPT. Otherwise, if n m, we have by Lemma 3.2 that with probability at least 1 ρ 4, Smedian µ(Un) + R/16 µ(Un) + R/4 r so that the algorithm outputs ACCEPT. On the other hand, suppose ξ = d T V (p, Un) ε. We note that the algorithm of [16] computes the test statistic S using Θ ( p n log(1/ρ) + log(1/ρ))ε 2 samples and compares S with some fixed threshold R that is strictly larger than the random threshold r of our choice. By the correctness guarantee of their algorithm, it holds that Pr[S > R] 1 ρ/100, which further implies that Pr[Smed > R > r] 1 ρ/4. We now proceed to replicability. Consider two executions of the algorithm. If p is uniform or ξ = d T V (p, Un) ε, then following a union bound on the correctness condition, two executions of the algorithm output different values with probability at most ρ/2. Then, suppose m n and Var(S) (C/64)ρ2ε4m4n 4. By Lemma 3.3, both samples lead the algorithm to output REJECT with probability at least 1 ρ so that the algorithm is ρ-replicable. Otherwise, Lemma 3.2 and Lemma 3.3 guarantees strong concentration of the test statistic Smedian. In particular, with probability at least 1 ρ/2, we have |Smedian µ(p)| ρ 16R over both samples, where R is the expectation gap defined as in Equation (2). In particular, whenever the random threshold r does not fall in the interval (µ(p) ρR/16) both executions output the same result. Since r is chosen uniformly at random, this occurs with probability at most (ρR/8)/(R/2) = ρ/4. By a union bound, we observe that Algorithm 1 is ρ-replicable. Finally, the sample complexity is immediately obtained by our values of m m0. 4 Lower Bound for Replicable Uniformity Testing In this section, we outline the important lemmas used in showing the sample complexity lower bound for ρ-replicable symmetric uniformity testers, and give their proof sketches. The formal argument can be found in Appendix B. Note that the lower bound Ω ε 2ρ 2 holds even for testing whether the bias of a coin is 1/2 or 1/2 + ε (see [30]). We therefore focus on the more challenging bound of Ω nε 2ρ 1 . Consider the canonical hard instance for uniformity testing where half of the elements have probability mass (1 + ξ)/n and the other half have probability mass (1 ξ)/n: p(ξ)i = 1+ξ n if i mod 2 = 0 , 1 ξ n otherwise. (4) Our hard instance for replicable uniformity test is as follows: we choose ξ from the interval [0, ε] uniformly at random, and let the tester observe samples from p(ξ). Fix some uniformity tester Am that takes m samples. We will argue that if Am is ρ-replicable and correct with probability at least 0.99, then we must have m = Ω( nε 2ρ 1). At a high level, we follow the framework of [30]. First, we fix some good random string r such that the induced deterministic algorithm Am(; r) is replicable with probability at least 1 10ρ, and is correct on p(0) and p(ε) with probability at least 0.99. Then, consider the function Accm(ξ) that denotes the acceptance probability of Am(S; r) when the samples S are taken from p(ξ). Note that Accm(ξ) must be a continuous function. Moreover, by the correctness of Am(; r), it holds that Accm(0) 0.99 and Accm(ε) < 0.01. Hence, there must be some value ξ such that Accm(ξ ) = 1/2. To show the desired lower bound on m, it suffices to show that Accm(ξ + δ) must not be too far from Acc(ξ ) for any δ ερ if m = o( nε 2ρ 1). Proposition 4.1 (Lipschitz Continuity of Acceptance Probability). Assume that m = o( nε 2ρ 1). Let Am(; r) be a deterministic symmetric tester that takes m samples, and define the acceptance probability function Accm(ξ) = Pr S p(ξ) m[A(S; r) = 1] , where p(ξ) is defined as in Equation (4). Let ε0 < ε1 (0, ε) be such that ε1 ε0 < ερ. Then it holds that |Accm(ε0) Accm(ε1)| < 0.1. Given the above proposition, since we choose ξ from [0, ε] uniformly at random, it follows that the acceptance probability of the algorithm is around 1/2 with probability at least Ω(ρ) if m = o( nε 2ρ 1), implying that the algorithm is not O(ρ)-replicable. The proof of Proposition 4.1 is based on an information theoretic argument based on ideas developed in [29]. We defer the formal proof of Proposition 4.1 to Appendix B.3 and give its proof outline below. Let m, ε0, ε1 be defined as in Proposition 4.1. At a high level, we construct two families of probability distributions, which we denote by M0 and M1, that satisfy the following properties: (i) Mi contains distributions that are identical to p(εi) up to domain element relabeling. (ii) Any tester that uses at most m samples cannot effectively distinguish between a random probability distribution from M0 and a random one from M1. For the sake of contradiction, assume that there is a deterministic symmetric tester using m samples such that the acceptance probabilities on p(ξ) and p(ξ + δ) differ by at least 0.1. Since all distributions within M0 are identical to p(ξ) up to element relabeling, it follows that the acceptance probabilities of the symmetric tester on any of the distribution within M0 must be the same (and similarly for M1). This then further implies that the tester can successfully distinguish a random distribution from M0 versus one from M1, contradicting property (ii). Proposition 4.1 thereby follows. It then remains to construct the two families of distributions. Recall that Mi contains distributions that are identical to p(εi) up to element relabeling. We will consider all distributions that can be obtained by performing local swaps on p(εi). In particular, we first group the elements into n/2 many adjacent pairs, and then randomly exchange the labels within each pair. Definition 4.2 (Local Swap Family). Let n be an even number, and p be a probability distribution on [n]. We define the Local Swap Family of p as the set of all distributions p such that ( p(εi)j, p(εi)j+1) = (p(εi)j, p(εi)j+1) or ( p(εi)j, p(εi)j+1) = (p(εi)j+1, p(εi)j) for all odd numbers j [n]. Lemma 4.3 (Indistinguishable Distribution Families). Let m = o( nε 2ρ 1), and ε0 < ε1 (0, ε) be such that ε1 ε0 < ερ. Let p(ε0), p(ε1) be defined as in Equation (4), and M0, M1 be the Local Swap Families (see Definition 4.2) of p(ε0), p(ε1) respectively. Let S be m samples drawn from either a random distribution from M0 or a random one from M1. Given only S, no algorithm can successfully distinguish between the two cases with probability more than 0.6. The formal proof of Lemma 4.3 can be found in Appendix B.2. At a high level, we use an information theoretic argument. In particular, we consider a stochastic process where we have a random unbiased bit X that controls whether we sample from a random distribution from M0 or a random distribution from M1. Let T be the obtained sample set. We show that T and X has little mutual information. A simple application of the data processing inequality then allows us to conclude the proof. To simplify the computation involved in the argument, we will also apply the standard Poissonization trick. In particular, we assume that the algorithm draws Poi (m) many samples instead of exactly m samples. The advantage of doing so is that we can now assume that the random variables counting the number of occurrences of each element are mutually independent conditioned on the probability distribution from which they are sampled. Moreover, one can show that this is without loss of generality by a standard reduction-based argument using the fact that Poisson distributions are highly concentrated. The formal statement of the mutual information bound is provided below. Lemma 4.4. Let M0, M1 be defined as in Lemma 4.3. Let X be a random unbiased bit, p a random probability distribution from MX, and S be Poi (m) many samples from p. Moreover, let Mi be the occurrences of element i among S. Then it holds that I(X : M1, , Mn) = O ε4ρ2 m2 n log4(n) + o(1). To show Lemma 4.4, we first note that if we group the random variables into n/2 many adjacent pairs, the pairs (Mi, Mi+1) are conditionally independent and identical given X. Therefore, we can bound I(X : M1, , Mn) from above by n 2 I(X : M1, M2). To tackle I(X : M1, M2), we break into three regimes depending on the relative sizez of m, n, ε: the sub-linear regime (approximately m n), the super-linear regime (approximately n < m < n/ε2), and the super-learning regime (approximately m > n/ε2). The formal proofs involves writing the probability distributions of Mis as Taylor expansions in ε. The calculations are rather technical and therefore deferred to Appendix B.1. Acknowledgments and Disclosure of Funding The authors would like to thank Max Hopkins, Russell Impagliazzo, and Daniel Kane for many helpful discussions and suggestions. Sihan Liu is supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. Christopher Ye is supported by NSF Award AF: Medium 2212136, NSF grants 1652303, 1909046, 2112533, and HDR TRIPODS Phase II grant 2217058. [1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. Advances in Neural Information Processing Systems, 28, 2015. 1, 1.4, C.1 [2] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2016. A.8, A.1 [3] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016. 1 [4] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 442 451. IEEE, 2001. 1 [5] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In 41st Annual Symposium on Foundations of Computer Science, 2000. 1, 1.2, 1.5 [6] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 381 390, 2004. 1 [7] C Glenn Begley and Lee M Ellis. Raise standards for preclinical cancer research. Nature, 483(7391):531 533, 2012. 1 [8] Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC, pages 520 527. ACM, 2023. 1 [9] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1 100, 2020. 1 [10] Clément L Canonne, Ilias Diakonikolas, Daniel M Kane, and Sihan Liu. Near-optimal bounds for testing histogram distributions. ar Xiv preprint ar Xiv:2207.06596, 2022. 1 [11] Clément L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing bayesian networks. In Conference on Learning Theory, pages 370 448. PMLR, 2017. 1 [12] Clément L Canonne et al. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends in Communications and Information Theory, 19(6):1032 1198, 2022. 1 [13] Clément L Canonne, Ayush Jain, Gautam Kamath, and Jerry Li. The price of tolerance in distribution testing. In Conference on Learning Theory, pages 573 624. PMLR, 2022. 1, 1.1, 2 [14] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193 1203. SIAM, 2014. 1.4, C.1 [15] Zachary Chase, Bogdan Chornomaz, Shay Moran, and Amir Yehudayoff. Local borsuk-ulam, stability, and replicability. ar Xiv preprint ar Xiv:2311.01599, 2023. 1 [16] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In 45th International Colloquium on Automata, Languages, and Programming, ICALP, 2018. 1, 1.4, 1.4, 3, 8, 3.1, 8, A.1 [17] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. Chic. J. Theor. Comput. Sci, 25:1 21, 2019. 1.4 [18] Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 685 694. IEEE, 2016. 1.3, 1.4 [19] Ilias Diakonikolas, Daniel M Kane, and Sihan Liu. Testing closeness of multivariate distributions via ramsey theory. ar Xiv preprint ar Xiv:2311.13154, 2023. 1 [20] Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1841 1854. SIAM, 2014. 1, 1.4, C.1 [21] Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1183 1202. IEEE, 2015. 1 [22] Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Near-optimal closeness testing of discrete histogram distributions. ar Xiv preprint ar Xiv:1703.01913, 2017. 1 [23] Peter Dixon, A Pavan, Jason Vander Woude, and NV Vinodchandran. List and certificate complexities in replicable learning. Advances in Neural Information Processing Systems, 36, 2024. 1 [24] Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcement learning. ar Xiv preprint ar Xiv:2305.15284, 2023. 1 [25] Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. 1 [26] Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering. Co RR, abs/2302.10359, 2023. 1 [27] Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electronic Colloquium on Computational Complexity (ECCC), volume 23, page 1, 2016. 1.4, C.3, C.2 [28] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation: In Collaboration with Lidor Avigad, Mihir Bellare, Zvika Brakerski, Shafi Goldwasser, Shai Halevi, Tali Kaufman, Leonid Levin, Noam Nisan, Dana Ron, Madhu Sudan, Luca Trevisan, Salil Vadhan, Avi Wigderson, David Zuckerman, pages 68 75, 2011. 1, 1.2, 1.4 [29] Max Hopkins, Russell Impagliazzo, Daniel M Kane, Sihan Liu, and Christopher Ye. Replicability in high dimensional statistics. unpublished, N.D. 7, 4 [30] Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In 54th Annual ACM SIGACT Symposium on Theory of Computing STOC, 2022. 1, 1.1, 1, 1.2, 1.4, 4, 4, B.4 [31] John PA Ioannidis. Contradicted and initially stronger effects in highly cited clinical research. Jama, 294(2):218 228, 2005. 1 [32] John PA Ioannidis. Why most published research findings are false. PLo S medicine, 2(8):e124, 2005. 1 [33] Alkis Kalavasis, Amin Karbasi, Kasper Green Larsen, Grigoris Velegkas, and Felix Zhou. Replicable learning of large-margin halfspaces. ar Xiv preprint ar Xiv:2402.13857, 2024. 1 [34] Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In International Conference on Machine Learning, ICML, 2023. 1 [35] Junpei Komiyama, Shinji Ito, Yuichi Yoshida, and Souta Koshino. Improved algorithms for replicable bandits. 2023. 1 [36] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 1986. 1 [37] Robert K Merton. Priorities in scientific discovery: a chapter in the sociology of science. American sociological review, 22(6):635 659, 1957. 1 [38] Shay Moran, Hilla Schefler, and Jonathan Shafer. The bayesian stability zoo. Co RR, abs/2310.18428, 2023. 1 [39] Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933. 1 [40] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750 4755, 2008. 1, 1.2 [41] Michal Parnas, Dana Ron, and Ronitt Rubinfeld. Tolerant property testing and distance approximation. Journal of Computer and System Sciences, 72(6):1012 1042, 2006. 1.1 [42] Felipe Romero Toro. Novelty vs. replicability: Virtues and vices in the reward system of science. Philosophy of science: Official journal of the Philosophy of Science Association, 2017. 1 [43] Ronitt Rubinfeld. Taming big probability distributions. XRDS: Crossroads, The ACM Magazine for Students, 19(1):24 28, 2012. 1 [44] Stefan Schmidt. Shall we really do it again? the powerful concept of replication is neglected in the social sciences. Review of general psychology, 13(2):90 100, 2009. 1 [45] Gregory Valiant and Paul Valiant. A clt and tight lower bounds for estimating entropy. In Electronic Colloquium on Computational Complexity (ECCC), volume 17, page 9, 2010. 1.1 [46] Gregory Valiant and Paul Valiant. Estimating the unseen: A sublinear-sample canonical estimator of distributions. In Electronic Colloquium on Computational Complexity (ECCC), volume 17, page 9, 2010. 1.1 [47] Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429 455, 2017. 1, 1.2, 1.4, C.1 [48] Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 64(6):1 41, 2017. 1.1 A Omitted Proofs for Replicable Uniformity Testing Upper Bound A.1 Concentration of the Total Variation Distance Statistic We now prove the required lemmas for Theorem 1.3. This section is dedicated to proving the required lemmas regarding the concentration properties of Smedian. Superlinear Case: m n First, we consider the superlinear case, when m n. Lemma 3.2 (Superlinear Concentration). Assume that we are in the superlinear regime (i.e., m n). Denote by µ(p) the expectation of the test statistic S under the distribution p. If n m n ε2 , then Pr |Smedian µ(p)| ρ C ε2 m, then Pr |Smedian µ(p)| ρ C Proof. Recall that Algorithm 1 uses the median trick to boost the success probability. Here we focus on the test statistic S := Sj computed in a single iteration j [m0]. An essential tool in the analysis is Mc Diarmid s Inequality. Theorem A.1 (Mc Diarmid s Inequality). Let X1, . . . , Xm be independent random variables taking values in X. Let f : X m 7 R be a function such that for all pairs of tuples (x1, . . . , xm), (x 1, . . . x m) X m such that xi = x i for all but one i [m], |f(x1, . . . , xm) f(x 1, . . . , x m)| B, Pr (|f(X1, . . . , Xm) E [f(X1, . . . , Xm)]| t) < 2 exp 2t2 Observe that the samples Tj are independent and a change in Tj changes S by at most 1 m, since at most two values Xi change by at most 1 2m each. Then, applying Theorem A.1 to S = f(T1, . . . , Tm), we obtain Pr (|S µ(p)| t) < 2 exp 2m2t2 = 2 exp 2mt2 . (5) ε2 , we can conclude Pr |S µ(p)| ρ C < 2 exp 2C2ρ2ε4m2 The first inequality follows from Equation (5) and the second holds whenever m Θ n ρε2 for some sufficiently large constant factor. Finally, we show Smedian is concentrated with high probability, i.e. Pr |Smedian µ(p)| ρ C Note that the median fails to satisfy the concentration condition only if at least half of the intermediate statistics Sj do not satisfy the concentration condition. By a Chernoff bound, this occurs with probability at most ρ for m0 = Θ log 1 ρ a sufficiently large constant. Now consider the case m n Pr |S µ(p)| ρ C 16 ε < 2 exp 2C2ρ2ε2m whenever m Θ 1 ρ2ε2 . Concentration of Smedian then follows from a similar argument as above. Sublinear Case: m n We now consider the sublinear case. Again, we consider a single iteration j and examine the concentration of the test statistic S = Sj. Following an observation from [16], we rewrite the test statistic as n 1[Xi = 0] = 1 n |{i s.t. Xi = 0}| = Z where we define the random variable Z = |{i s.t. Xi = 0}|. Our goal now is to bound the variance of Z. To do so, we further define some useful random variables. Preliminaries Let T [n]m be the m samples drawn, i.e., Ti denotes the element corresponding to the i-th sample. We use Y1, . . . , Ym to denote whether the i-th sample collides with any samples j < i, i.e., Yi = 1 if and only if there exists some j < i such that Ti = Tj. We will consider decomposition of the domain into heavy and light elements. Definition A.2. Let p be a distribution on [n]. For each k [n], k is heavy if pk 3 ρ and light otherwise. Let n H, n L denote the number of heavy and light elements respectively. Let bk,p = 1[pk 3 ρ ] indicate whether k is a heavy element in p. When the distribution is clear, we omit p and write bk. We define Y1, . . . , Ym to indicate that the i-th sample comes from some light element and collides with some previous sample: Yi = Yi 1{b Ti = 0}. Let H denote the number of occurrences of heavy elements among the samples: H = |i s.t. b Ti = 1|. Finally, define ZH, ZL to be the contributions to Z from heavy and light elements respectively: ZH = |{i s.t. bk = 1 and Xi = 0}| ZL = |{i s.t. bk = 0 and Xi = 0}|. We then show the following. Lemma A.3. Var(Z) ρ3 500 + 32 log 10n ρ Pm i=1 E [Yi]. In particular, Var(S) = Var(Z) 500n2 + 32 n2 log 10n ρ Pm i=1 E [Yi]. Proof of Lemma A.3. Recall the random variables Y1, . . . , Ym to indicate whether the i-th sample collides with any samples j < i. Then, we obtain the following identities: i=1 Yi = ZH + ZL ZH = |{i s.t. Xi = 0 and bi = 1}| ZL = |{i s.t. Xi = 0 and bi = 0}| = n L (m H) + We will bound the variance of ZH, ZL separately. First, we note that ZH = 0 with high probability. In particular, the probability that an element with bk = 1 does not occur in m samples is at most (1 px)m < ρ 10n 3, so that applying the union bound the probability that some heavy element does not occur is at most ρ3 1000n2 , showing Pr(ZH > 0) < ρ3 1000n2 . In particular, ℓ=0 Pr(ZH = ℓ)ℓ2 In the remaining proof, we bound the variance of ZL. To bound the variance of ZL, we separately bound the variance of H and Pm i=1 Yi, since n L, m are fixed constants. We begin with the first term, beginning with an upper bound on the expectation of H. We show the expected number of heavy elements is at most a constant multiple of the expected number of collisions. i=1 E [Yi] m X bk=1 pk = E [H] Var(H). Proof of Lemma A.4. Note that H is a binomial random variable so that its expectation is an upper bound on its variance. Since the Yi are non-negative random variables and E [Yi] is increasing in i, it suffices to show the following lower bound: i=1 E [Yi] 4 i=m/2 E [Yi] 2m E Ym/2 m X or equivalently E Ym/2 = Pr(Ym/2 = 1) 1 Consider an element k such that bk = 1 or equivalently pk 3 ρ . Then, the probability that k does not occur in the first m/2 samples is at most (ρn 1/10)3/2 ρ 10n. Union bounding over at most n elements, each element with bk = 1 occurs in the first m 2 samples with probability at least 1 ρ 10 9 10. Conditioned on this, Pr(Ym/2 = 1) P bk=1 pk so that we conclude Pr(Ym/2 = 1) 9 Now, we move on to the term Pm i=1 Yi. 1 + 3 log 10n Proof. Using linearity of expectation, we have i=1 E h Y 2 i i + X i =j E h Yi Yj i . For the first term, E h Y 2 i i = E h Yi i since Yi is an indicator variable. Considering the second term, we introduce the variables Zi,j denoting whether the i-th and j-th samples collide, i.e. Zi,j = 1[Ti = Tj]. We can write E h Yi Yj i = Pr( Yi Yj = 1) = Pr( Yi Yj = 1, Zi,j = 1) + Pr( Yi Yj = 1, Zi,j = 0). We now bound the first term. i =j Pr( Yi Yj Zi,j = 1) 3 log 10n i=1 E h Yi i 3 log 10n Proof of Lemma A.6. The event Yi, Yj = 1, Zi,j = 1 happens when there is some element k [n] such that 2. both samples Ti = Tj = k 3. some samples from T µ(Un) + Cε2m2n 2 where µ(Un) is the expectation of S under the uniform distribution and C is given by Lemma 3.1. As a consequence, with probability at least 1 ρ/4, we have Smedian > µ(Un) + Cε2m2n 2. On the other hand, if Var(S) (C/64)ρ2ε4m4n 4, then it holds that Pr |Smedian µ(p)| (C/16)ρε2m2n 2 < ρ/4. Furthermore, for the uniform distribution Un, Var(S) (C/64)ρ2ε4m4n 4. Proof. Our proof considers two sub-cases. In particular, we argue that when the variance of the test statistic S is large, the algorithm replicably outputs REJECT. On the other hand, when variance is small, the test statistic S is tightly concentrated. High Variance Var(S) (C/64)ρ2ε4m4n 4 We argue that when the variance of the test statistic is high, the test statistic must also be large with high probability so that the algorithm replicably outputs REJECT. First, let us consider the expectation of the test statistic if the input distribution is uniform. Lemma A.10. Suppose m n. Let µ(Un) be the expectation of S given a sample from Un. Then n µ(Un) n m + m2 Proof of Lemma A.10. For each i, note that there are at most m distinct elements sampled before the i-th sample. In particular, E [Yi] = Pr(Yi = 1) m n for all i. Then, since in the sub-linear regime m n we have the identities We can write the expectation of S as µ(Un) = E [S] = E [Z] n = n m + Pm i=1 E [Yi] n n µ(Un) = n m + i=1 E [Yi] n m + m2 Now, we show that when the variance of S is large, with probability at least 9 10, S µ(Un)+Cε2 m2 n2 where C is the constant given by Lemma 3.1. Since S = Z n , this holds if and only if Z n µ(Un) + Cε2 m2 The expectation of Z is E [Z] = n m + i=1 E [Yi] . From Lemma A.3, we have Var(Z) = O ρ3 + log(n/ρ) Pm i=1 E [Yi] . By Chebyshev s inequality, we have that Pr Z < E [Z] p 10Var(Z) < 1 Therefore, it suffices to show E [Z] p 10Var(Z) n µ(Un) + Cε2 m2 n2 . If this holds, then S is beyond the random threshold and the algorithm outputs REJECT. We rearrange the desired inequality as follows: 10Var(Z) n µ(Un) + Cε2 m2 E [Z] n µ(Un) p 10Var(Z) Cε2 m2 Plugging in E [Z] n µ(Un) Pm i=1 E [Yi] m2 n , it suffices to show i=1 E [Yi] m2 10Var(Z) Cε2 m2 which is true as long as i=1 E [Yi] p 10Var(Z) 2C m2 Thus, applying Lemma A.3, it remains to show 50 + 320 log 10n i=1 E [Yi] 1 v u u t320 log 10n i=1 E [Yi] 2C m2 ρ3/50 1. Finally, we observe that by our choice of m, we have 2Cm2n 1 1 so that it suffices to show v u u t320 log 10n i=1 E [Yi] 3C m2 We simplify the left hand side with the following lemma. Lemma A.11. For any x 1280 log 10n ρ , we have 320 log 10n Proof. Suppose x 320 log 10n ρ . Then the inequality holds if and only if 320 log 10n 320 log 10n x 1280 log 10n We now lower bound Pm i=1 E [Yi]. First, from Lemma A.3 we have i=1 E [Yi] Var(Z) ρ3 500 32 log(10n/ρ) = Var(S)n2 ρ3 500 32 log(10n/ρ) . By assumption on Var(S) (C/64)ρ2ε4m4n 4, we obtain i=1 E [Yi] (C/64)ρ2ε4m4n 2 ρ3/500 32 log(10n/ρ) . Finally, since m nρ 1ε 2p log(n/ρ) for some sufficiently large constant, we have (C/64)ρ2ε4m4n 2 ρ3/500 log2(n/ρ) i=1 E [Yi] log2(n/ρ) ρ2ε4 log(n/ρ) 1280 log n Thus, we conclude by v u u t320 log 10n i=1 E [Yi] 1 where the final equality follows from i=1 E [Yi] (C/64)ρ2ε4m4n 2 ρ3/500 56 log(10n/ρ) log2(n/ρ) ρ2ε4 log(n/ρ) = log(n/ρ) n = O log(n/ρ) and if we choose an arbitrary large constant factor of m, left hand side P E [Yi] takes this constant to the 4th power while the right hand side takes this constant to the 2nd power. Thus, whenever the variance of S is large, the test statistic S lies above any random threshold with probability at least 9 10. Again applying a Chernoff bound, Smedian does not lie above any random threshold with probability at most ρ For the high variance case, it remains to show the inequality Var(S) µ(Un) + Cε2 m2 From the identities S = Z/n and Z = n m + P Yi, Lemma A.3 and Lemma A.10, we can rewrite the above equation as P E [Yi] m2/n X E [Yi] Cε2 m2 and we can lower bound the left hand side as P E [Yi] m2 where we obtain the first term by using b and ρ3/(500n2) n 2, the second term by grouping similar terms, and the final term by our assumption on Var(S) and therefore E [P Yi]. Therefore, it suffices to show X E [Yi] 1 + m2 or since m2n 1 1, X E [Yi] 4 max(C, 1)m2 n which follows from the lower bound on P E [Yi] given by Equation (8). Low Variance Var(S) (C/64)ρ2ε4m4n 4 We now show that whenever the number of expected collisions is low, the test statistic is well concentrated. By Chebyshev s inequality, we have for any iteration j that Pr |Sj µ(p)| > C = Pr |Sj µ(p)| > 4 C Using a Chernoff bound as previously, we obtain the same concentration result for Smedian with high probability. Finally, we show that for the uniform distribution Un, we have Var(S) (C/64)ρ2ε4m4n 4. We observe that for the uniform distribution, E [Yi] m n since the i-th sample can collide with at most i m elements before it and therefore Pm i=1 E [Yi] m2 n . Then, from Lemma A.3, Var(S) = Var(Z) n2 24 log(10n/ρ) i=1 E [Yi] 24m2 log(10n/ρ) We thus require 24m2 log(10n/ρ) n3 (C/64)ρ2ε4m4n 4 1536n log(10n/ρ) which is satisfied by our sample complexity m. B Omitted Proofs for Replicable Uniformity Testing Lower Bound B.1 Proof of Lemma 4.4 We begin by bounding the mutual information of the sub-linear regime. Lemma B.1. Let m n 1. Let δ = |ε0 ε1| and ε = max(ε0, ε1). Then, (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) O ε2δ2m2 Proof. First, let us expand the conditional probabilities for a fixed a, b. Expanding the definition of the random variables M1, M2, X and applying the probability mass function of Poisson distributions, we arrive at the expression Pr(M1 = a, M2 = b | X = 0) = 1 2a!b! exp 2m a+b (1 + ε0)a (1 ε0)b + (1 ε0)a (1 + ε0)b , Pr(M1 = a, M2 = b | X = 1) = 1 2a!b! exp 2m a+b (1 + ε1)a (1 ε1)b + (1 ε1)a (1 + ε1)b Define the function fa,b(x) := (1 + x)a(1 x)b + (1 x)a(1 + x)b. Then we have X (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) 1 a!b! exp 2m a+b (fa,b(ε1) fa,b(ε0))2 fa,b(ε1) + fa,b(ε0) 1 a!b! exp 2m a+b maxε0 x ε1 fa,b(ε0) + fa,b(ε1) (ε1 ε0)2 , (10) where in the last line we use the mean value theorem. The main technical step will be to bound from above the quantity maxε0 x ε1( x fa,b(x)) 2 fa,b(ε0)+fa,b(ε1) . Specifically, we show the following technical claim. Claim B.2. There exists an absolute constant C such that max ε0 ε 1/2. Proof. If a + b 1, fa,b(x) = 2. Thus, d dxfa,b(x) = 0, which gives the first case. We next analyze the case 2 a + b ε 1. Note that fa,b(x) is an even function with respect to x, i.e., fa,b(x) = fa,b( x). This allows us to write fa,b(x) = 1 2 (fa,b(x) + fa,b( x)). As a result, we can conclude that fa,b(x) does not contain any monomial of x with odd degree. Thus, we can write fa,b(x) = c(0) a,b + X d [a+b] is even c(d) a,bxd for some coefficients c(d) a,b that depend on a, b only. This implies that xfa,b(x) = X d [a+b] is even dc(d) a,bxd 1. When 2 a + b ε 1/2, the coefficients c(d) a,b is of order a+b d . Since x < ε, we must have that |d c(d) a,b xd 1| decreases exponentially fast in d. This shows that | xfa,b(x)| is dominated by the contribution from the monomial |c(2) a,bx|, which implies that max ε0 ε 1/2. In this case, for ε0 x ε1, we note that xfa,b(x) = (1 + x)a(1 x)b a 1 + x b 1 x + (1 x)a(1 + x)b b 1 + x a 1 x It then follows that xfa,b(x) O(a + b) (1 + x)a(1 x)b + (1 x)a(1 + x)b O(a + b) (1 + x)a + (1 + x)b , (13) where in the first inequality we use that |u v| u + v for any u, v > 0, and the second inequality simply follows from 1 x < 1. We can therefore conclude that max ε0 x ε1 fa,b(ε0) + fa,b(ε1) O(1)(a + b)2 (1 + ε1)a + (1 + ε1)b 2 (1 + ε1)a(1 ε1)b + (1 + ε1)b(1 ε1)b O(1)(a + b)2 (1 + ε1)a (1 ε1)b + (1 + ε1)b O(1)(a + b)2 1 + ε1 where in the first inequality we bound from above the numerator with the square of the right hand side of Equation (13) substituted with x = ε1, and bound from below the denominator by fa,b(ε1), in the second inequality we use the fact that (u + v)2 2u2 + 2v2, and in the last inequality we use that (1 + ε1)a, (1 ε1) a are increasing in a and (1 + ε1)b, (1 ε1) b are increasing in b. This concludes the proof of the third case and also Claim B.2. It immediately follows from the claim that (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) = 0. (14) Consider the terms with 2 a + b ε 1/2. From Equation (10) and Claim B.2, we have that a,b:2 a+b ε 1/2 (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) a,b:2 a+b ε 1/2 1 a!b! exp 2m a+b (a4 + b4)δ2ε2 b:2 a+b ε 1/2 a:2 a+b ε 1/2 Define A := P b:b:2 a+b ε 1/2 a4 a!b! m n a+b and B := P a:2 a+b ε 1/2 b4 a!b! m n a+b. One can see the two terms are similar to each other. So we focus on the term A. We have that b:b:2 a+b ε 1/2 2 + O(1) exp(m/n) X a 2 exp( m/n) a(a 1) + a(a 1)(a 2)(a 3) 2 + O(1) exp(m/n) Ey Poi(m/n) [y(y 1) + y(y 1)(y 2)(y 3)] , where in the first inequality we use the assumption m < n to bound from the above the summation over b by a geometric series with common ratio at most 1/2, in the second inequality we use the fact that a4 is at most 10 (a(a 1) + a(a 1)(a 2)(a 3)) for all a 2, in the third inequality we observe that the summation over a can be bounded from above by the moments of some Poisson random variable with mean m/n, and in the last inequality we use the fact that Ey Poi(λ)[y(y 1) + y(y 1)(y 2)(y 3)] = λ2 + λ4. One can show the same upper bound for B via an almost identical argument. Substituting the bounds for A, B into Equation (15) then gives X a,b:2 a+b ε 1/2 (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) O(δ2ε2) m It then remains to analyze the terms where a + b ε 1/2. Again from Equation (10) and Claim B.2, we have that X (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) a,b:a+b>ε 1/2 a+b (a + b)2 1 + ε n 1 + ε 1 ε n 1 + ε 1 ε n 1 + ε 1 ε s s=ε 1/2 O(δ2ε2(m/n)2) (17) where in the second inequality we use the observation that we must have either a s/2 or b s/2 if a + b = s, in the third inequality we note that the summation over s decreases exponentially and is therefore dominated by the term where s = ε 1/2, and in the last inequality we use that s3/ s/2 ! O(1), (m/n)ε 1/2 (m/n)2, (1+ε)/(2(1 ε)) < 0.9, and that 0.9ε 1/2 poly (ε) for sufficiently small ε. Combining Equations (14), (16) and (17) then concludes the proof of Lemma B.1. We now bound the mutual information for the super-linear regime. Lemma B.3. Let δ = |ε0 ε1| and ε = max(ε0, ε1). Let n < m < o n log2 n ε2 . Then, (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) O ε2δ2m2 log4 n Proof. We note that for a Poisson random variable y with mean λ 1, we have |y λ| > λ log n with probability at most o(1/n). We can focus on the terms where l [λ λ log n, λ + λ log n]. For reasons that will become clear soon, we will assume that a, b both lie in this range. As in the sub-linear case, we define the function fa,b(x) := (1 + x)a(1 x)b + (1 x)a(1 + x)b. Instead of computing the derivative of fa,b directly, we will first rewrite it slightly. fa,b(x) = exp (a log(1 + x) + b log(1 x)) + exp (a log(1 x) + b log(1 + x)) = 2 + a log(1 x2) + b log(1 x2) (a log(1 + x) + b log(1 x))i + (a log(1 x) + b log(1 + x))i where in the second equality we apply the Taylor approximation of the exponential function. We will analyze the expression terms by terms. In particular, define gy(x) = y log(1 x2) , ha,b(x) = (a log(1 + x) + b log(1 x))i Then we have fa,b(x) = 2 + ga(x) + gb(x) + ha,b(x) + ha,b( x). (18) We first analyze ga(x). When a [λ λ log n, λ + λ log n], we have that |ga(ε1) ga(ε0)| = a log 1 ε2 0 1 ε2 1 a 1 1 ε2 0 1 ε2 1 = a 1 ε2 1 (ε0 + ε1) |ε0 ε1| O m n ε δ log n , (19) where in the first inequality we use the fact |log(x)| x 1 for x 1, in the second inequality we use our assumption that max(ε0, ε1) ε and |ε0 ε1| δ, and our assumption that |a m/n| < log n p m/n. Using a similar argument, one can also derive that |gb(ε1) gb(ε0)| O m n ε δ log n , (20) We then turn to the term ha,b(x). The derivative with respect to x is given by xha,b(x) = a 1 + x b 1 x (a log(1 + x) + b log(1 x))i 1 = a 1 + x b 1 x (1 + x)a(1 x)b 1 where the second equality follows from the observation that the summation is exactly the Taylor approximation of exp (a log(1 + x) + b log(1 x)) without the constant term. We next proceed to bound from above xha,b(x) . First, when x is sufficiently small and a, b lie in the range m n log n, note that a 1 + x b 1 x = |a b O(x)(a + b)| |a b| + (a + b)O(x) n log n + 2m n log n , (21) where in the first line we approximate 1 1+x and 1 1 x with 1 O(x) and 1 + O(x) respectively for sufficiently small x, the second line follows from the triangle inequality, in the third line we use the assumption on the ranges of a, b, in the last line we note that p m n log n is the dominating term when x < ε and m < n/ε2. Next, under the same set of assumptions, we have that (1 + x)a(1 x)b 1 (1 x2)b(1 + x)|a b| 1 (1 O(bx2))(1 + O(|a b|)x) 1 n log n , (22) where in the second inequality we note that since bx2 = O m n ε2 1 and |a b|x = O p m n log(n)ε 1 we can approximate (1 x2)b and (1 + x)|a b| by (1 O(bx2)) and (1 + O(|a b|)x) respectively, and in the last inequality we use the assumption on the range of a, b and x < ε. Thus, combining Equations (21) and (22), we arrive at the upper bound max x [ε0,ε1] xha,b(x) O εm m/n log n a, b m m/n log n. Therefore, by the mean value theorem, we have that |ha,b(ε1) ha,b(ε0)| O εδ m n log2 n . (23) Following a similar argument, one can also derive the upper bound |ha,b( ε1) ha,b( ε0)| O εδ m n log2 n . (24) Combining Equations (18) to (20), (23) and (24) then gives that |fa,b(ε1) fa,b(ε0)| O εδ m n log2 n . (25) Recall that in the proof of Lemma B.1, we have that X (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) 1 a!b! exp 2m a+b (fa,b(ε1) fa,b(ε0))2 fa,b(ε1) + fa,b(ε0) . (26) Hence, we proceed to bound from below fa,b(ε0) + fa,b(ε1). Note that we have fa,b(ε0) + fa,b(ε1) (1 x2)min(a,b)(1 + x)|a b| 1 O(min(a, b)x2) Ω(1) , (27) where the first inequality follows from a case analysis on whether a > b, in the second inequality we approximate (1 x2)min(a,b) by 1 O(min(a, b)x2) since min(a, b)x2 = o(1) for a, b in the assumed range and x < ε. Combining Equations (25) to (27) then gives that X (Pr(M1 = a, M2 = b | X = 0) Pr(M1 = a, M2 = b | X = 1))2 Pr(M1 = a, M2 = b) n log n a, b m a!b! exp 2m + O(1) 1 Pr a,b Poi(m/n) m/n log n a, b m/n + p m/n log n i 2 log4 n + o(1/n). This concludes the mutual information bound for the super-linear case. Lastly, we present the mutual information bound for the super-learning parameter regime, i.e., m Ω n/ε2 . Lemma B.4. Let δ = |ε0 ε1| and ε = max(ε0, ε1). Let m > Ω n log2 n ε2 . Then it holds I(X : M1, M2) O ε2δ2m2 log2 n We are going to use the following facts regarding KL-divergence and mutual information. Claim B.5 (Convexity of KL-divergence). Let w (0, 1), and p1, p2, q1, q2 be probability distributions over the same domain. Then it holds KL(wp1 + (1 w)p2||wq1 + (1 w)q2) w KL(p1||q1) + (1 w)KL(p2||q2). Claim B.6 (Additivity of KL-divergence). Let w (0, 1), and p1, p2, q1, q2 be probability distributions such that p1, q1 share the same domain and p2, q2 share the same domain. Then it holds KL(p1 p2||q1 q2) = KL(p1||q1) + KL(p2||q2). Claim B.7 (Mutual Informtion Identity). Let X be an indicator variable, and H0, H1 be a pair of distributions. Let H be the random variable that follows the distribution of H0 if X = 0 and X1 if X = 1. Then it holds I(X : H) = 1 2 (H1 + H0) + 1 2 (H1 + H0) . Proof of Lemma B.4. When X = 0, the pair (M1, M2) is distributed as a even mixture of two distributions that are both product of Poisson distributions: (M1, M2) | (X = 0) H0 := 1 Similarly, we have (M1, M2) | (X = 1) H1 := 1 From Claim B.7 and Claim B.5, we know that it suffices to show that the KL-divergences between H0 and H1 are at most ε2δ2m2 log2 n n2 . We focus on the argument for KL(H0 : H1) as the one for KL(H1 : H0) is symmetric. By the definition of H0 and H1, we have that KL(H0 : H1) 1 where in the first line we use Claim B.5, and in the last line we use Claim B.6. Let λ1 > λ2 > 0. Note that we have the following closed form for KL divergence between a pair of Poisson distributions: KL(Poi (λ1) ||Poi (λ2)) = λ1 log(λ1/λ2) + λ2 λ1 λ1 λ1 λ2 λ2 + λ2 λ1 = (λ1 λ2)2 where in the first inequality we use the bound log(x) x 1 for all x > 0. Hence, it follows that n (ε1 ε0) 2 n (1 + ε1) O(1) m But notice that this is at most O ε2δ2m2 log2 n n2 as long as m Ω n/(2 log2(n)ε2) which is our assumed parameter regime in this lemma. By setting δ = |ε1 ε0| = O(ερ), it follows from Lemmas B.1, B.3 and B.4 that I(X : M1, M2) O ε2δ2m2 log4 n n2 +o(1/n). Since the pairs Mi, Mi+1 are conditional independent and have identical distributions given X, it follows that I(X; M1, , Mn) n 2 I(X; M1, M2) O ε2δ2m2 log4 n This concludes the proof of Lemma 4.4. B.2 Proof of Lemma 4.3 Proof. Let T be a set of samples. In case one, T is made up of m samples from a random distribution from M0. In case two, T is made up of m samples from a random distribution from M1. Assume that there exists a tester A that can distinguish between the two cases given S with probability at least 0.9. We will show that this implies that m = Ω nε 2ρ 1 . Recall that Lemma 4.4 has the following setup. Let X be an unbiased binary random variable. If X = 0, S will be Poi (2m) samples from a random distribution from M0. If X = 1, S will be Poi (2m) samples from a random distribution from M1. Then we can use A to predict X given S in the following way: if S contains more than m samples, we feed A the first m samples and take its output; otherwise, we declare failure. By the concentration property of Poisson distributions, it holds that S contains more than m samples with high constant probability. Hence, the above routine will be able to correctly predict X with probability at least 0.6. This implies that the mutual information between X and S is at least Ω(1). However, Lemma 4.4 says that the mutual information is at most n log4(n) + o(1). This therefore shows that n log4(n) = Ω(1) which further implies that m = Ω n log 2(n)ε 2ρ 1 . This concludes the proof of Lemma 4.3. B.3 Proof of Proposition 4.1 Proof. Let m = o nε 2ρ 1 . For the sake of contradiction, assume that |Accm(ε0) Accm(ε1)| > 0.1 for some deterministic symmetric tester A(; r). Let M0, M1 be the local swap family of p(ε0) and p(ε1) respectively. We show that this could be used to distinguish between a random distribution from M0 and a random distribution from M1. This would then contradict Lemma 4.3. In particular, since the output of A(; r) is invariant up to domain relabeling, the acceptance probability of A(; r) when it takes samples from a random distribution from M0 must be exactly equal to Accm(ε0) (and similarly for M1). This immediately implies a tester that takes m samples and could distinguish between a random distribution from M0 and a random distribution from M1 with probability at least 1/2 + c for some small constant c. This further implies that if we takes 100mc 1 many samples, we can boost the success probability to 0.99. Yet, since 100mc 1 = o nε 2ρ 1 , this clearly contradicts Lemma 4.3, and hence concludes the proof of Proposition 4.1. B.4 Proof of Theorem 1.6 Proof. The lower bound Ω(ε 2ρ 2) follows from the fact that testing whether an unknown coin has bias 1/2 or 1/2 + ε replicably requires Ω ε 2ρ 2 many samples as shown in [30]. In the rest of the proof, we therefore focus on establishing the lower bound Ω( nε 2ρ 1). Let p(ξ) be the distribution instance defined as in Equation (4). Let A be a ρ-replicable uniformity tester that takes m samples. We denote by r the random string representing the internal randomness of A. Following the framework of [30], we will fix some random string r such that (i) A(; r) accepts the uniform distribution with probability at least 1 O(ρ) (ii) A(; r) rejects the distribution p(ε) with probability at least 1 O(ρ) (iii) A(; r) is replicable with probability at least 1 O(ρ) against p(ξ) for ξ sampled uniformly from [0, ε]. By the correctness guarantees of A, a large constant fraction of random strings r must satisfy (i) and (ii). By the replicability requirement of A, a large constant fraction of random strings r must satisfy (iii). By the union bound, there must exist some random string r that satisfies (i), (ii) and (iii) at the same time. Define the acceptance probability function Accm(ξ) = Pr S p(ξ) m[A(S; r) = 1]. Note that Accm(ξ) is a continuous function in ξ since the acceptance probability can be expressed as a polynomial in ξ. Moreover, it holds that Accm(0) 1 O(ρ) and Accm(ε) O(ρ). Therefore, there must exist some ξ such that Accm(ξ ) = 1/2. Assume for the sake of contradiction that m = o( nε 2ρ 1). By Proposition 4.1, it follows that Accm(ξ) [Accm(ξ ) 0.1, Accm(ξ ) + 0.1] = [0.4, 0.6] for any ξ such that |ξ ξ | ρε. In other words, if we sample ξ randomly from [0, ε], whenever ξ falls into the interval [ξ ρε, ξ + ρε], the algorithm will fail to be replicable with constant probability. It is not hard to see that the interval has mass Ω(ρ) under the uniform distribution over [0, ε]. This would then imply that the tester would not be O(ρ)-replicable, contradicting property (iii). This shows that m = Ω( nε 2ρ 1), and hence concludes the proof of Theorem 1.6. C Further Discussion and Omitted Proofs In this section we give additional discussions and omitted proofs. C.1 Barriers for χ2-Statistics We show a similar barrier for using the χ2-test statistics. These statistics are used in the several uniformity testing algorithms, including [14, 1, 47, 20]. The χ2 test statistic (of [1] for example) computes X (Xi m/n)2 Xi where the algorithm takes Poi (m) samples and computes Xi to be the frequency of the i-th element among the samples. For a uniform distribution, the test statistic is expected to be at most mε2/500, while if the test statistic is far from uniform the test statistic is at least mε2/5, leading to an expectation gap with around mε2 (Lemma 2 of [1]). Suppose a tester compares the test statistic with a random threshold sampled from the interval [mε2/500, mε2/5]. For the tester to be O(ρ)-replicable, we require that the test statistics deviate by no more than O(ρmε2) with high constant probability, in other words the variance must be at most O(ρ2m2ε4). As before, we fix a constant ε > 0 so that we can ignore the dependency with sample complexity, and consider a hard instance of a distribution with a single heavy element with probability pi = n 1/2. In particular, we will have Xi Poi (mpi) = Poi (m/ n). Since m/ n 1, there exists some constant small constant c such that Pr[Xi > m/ n + c m/n1/4] > c , Pr[Xi < m/ n c m/n1/4] > c. However, in the two cases above, the contributions to the χ2-test statistic from Xi differ by m n + c m n1/4 m n 2 m n c m n1/4 m n 2 + 2c m n1/4 n3/4 + m n1/4 m/n = Ω mn1/4 . Following our previous discussion, for the tester to be O(ρ)-replicable, we therefore require that m1/2n1/4 ρm or m n1/2ρ 2. C.2 Discussion of Lower Bounds against Asymmetric Algorithms For non-replicable uniformity testing algorithms, we can typically assume that symmetry is without loss of generality when analyzing the sample complexity. A common reduction that turns an asymmetric algorithm into a symmetric algorithm is the following. The algorithm could apply all permutations to the samples observed, and output the majority answer. However, it is not clear that such a transformation preserves replicability. Below we discuss in more detail the barrier hit by our lower bound argument for asymmetric testers. Suppose we want to prove a lower bound against general algorithms. The known techniques for lower bounds against replicable algorithms first fix a random seed r such that A(; r) is both correct and replicable with high probability given a distribution drawn from the adversarial instance (in our lower bound the instance draws distributions p(ξ) uniformly from ξ [0, ε]).8 Given a fixed random seed, 8In our lower bound, we use the local swap family to restrict the adversary to a subset of permutations. In this discussion, we allow a more generic adversary that could permute the domain arbitrarily. we now have a deterministic algorithm A(; r) that accepts p(0) and rejects any permutation of p(ε). By continuity, for any fixed permutation π, there is some ξπ such that A(; r) accepts with probability 1/2 given samples drawn from p(ξπ) permuted according to π. It remains to argue that algorithms with low sample complexity m nρ 1ε 2 cannot successfully distinguish nearby distributions p(ξπ), p(ξ) (both permuted according to π) for any ξ (ξπ ρε) so that the acceptance probability of A(; r) must be close to 1/2 in this region and therefore A(; r) is not replicable with probability Ω(ρ). However, our arguments in Appendix B only show that the families p(ξπ), p(ξ) are indistinguishable when permuted by a random permutation π , not necessarily when permuted by the permutation π. In fact our proof only allows us to argue that for any ξ [0, ε], the families are indistinguishable when permuted by a constant fraction of permutations π . However, the permutations for which the families p(ξ ), p(ξ + ρε) are indistinguishable may not be the permutations for which ξπ, ξ are close. C.3 Replicable Identity Testing The formal definition of replicable identity testing is given below. Definition C.1 (Replicable Identity Testing). Let n Z+ , and ε, ρ (0, 1/2). A randomized algorithm A, given sample access to some distribution p on [n], is said to solve (n, ε, ρ)- replicable identity testing if A is ρ-replicable and it satisfies the following for every fixed distribution X: 1. If p is X, A should accept with probability at least 1 ρ. 9 2. If p is ε-far from X in TV distance, A should reject with probability at least 1 ρ. [27] gives a reduction following reduction between uniformity and identity testing. Theorem C.2 ([27], restated). Let q be a distribution over [n] with known explicit description. There is a (randomized) algorithm T such that given m independent samples to a distribution p on [n] and ε > 0, generates m independent samples from a distribution p on [6n] such that: 1. If p = q, p is uniform on [6n]. 2. If p is ε-far from q in total variation distance, then p is ε/3-far from uniform in total variation distance. Furthermore, this algorithm runs in O(m) time. If we want to design a replicable identity tester, we can simply transform the samples using T from Theorem C.2 (note that we don t even need to share the randomness of T across different runs) and then feed the transformed dataset to the replicable uniformity tester from Theorem 1.3. The correctness guarantees follow immediately. Let p be the output distribution of T . Since in both runs the replicable uniformity tester takes samples from p , replicability of the process follows from Theorem 1.3. Lastly, it is not hard to see that the number of samples consumed is asymptotically the same as the uniformity tester. 9As in the case of uniformity testing, we do not focus on the dependence of the sample complexity with the error parameter. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Yes, the claimed algorithms and lower bounds are proven in the paper. Furthermore, we qualify the limitation of our lower bound (only holds against a specific but broad class of algorithms) in the abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The primary limitation of our work is that the lower bound holds only for symmetric algorithms. While symmetric algorithms is a natural class and encompass all known algorithms in uniformity testing, it would be nice to have a lower bound that holds in general. We discuss this in our paper. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] . Justification: Yes, all claims are proven either in the main body or the appendix. For results not proven in the main body, we provide a proof sketch and leave details to the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include experiments requiring code. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Yes, the research conducted in the paper conforms, in every respect, with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: While the paper focuses on foundational research, replicable uniformity testing is aimed towards building an algorithmic framework for replicability in scientific analysis, helping build an efficient procedure to verify experimental procedures are followed correctly and building public trust in science. The authors are not aware of any potential negative impacts. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.