# testing_with_nonidentically_distributed_samples__49dda7f5.pdf

Published in Transactions on Machine Learning Research (11/2025)

Testing with Non-identically Distributed Samples

Shivam Garg shivam.garg55@gmail.com Microsoft Research

Chirag Pabbaraju cpabbara@stanford.edu Stanford University

Kirankumar Shiragur kshiragur@microsoft.com Microsoft Research

Gregory Valiant gvaliant@stanford.edu Stanford University

Reviewed on Open Review: https: // openreview. net/ forum? id= FUzvztz Bl W

We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size k, p1, p2, . . . , p T , and we obtain c independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, pavg. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with c = 1 samples from each distribution, Θ(k/ε2) samples are necessary and sufficient to learn pavg to within error ε in ℓ1 distance. To test uniformity or identity distinguishing the case that pavg is equal to some reference distribution, versus has ℓ1 distance at least ε from the reference distribution, we show that a linear number of samples in k is necessary given c = 1 samples from each distribution. In contrast, for c 2, we recover the usual sublinear sample testing guarantees of the i.i.d. setting: we show that O(

k/ε2 + 1/ε4) total samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where ε k 1/4. Additionally, we show that in the c = 2 case, there is a constant ρ > 0 such that even in the linear regime with ρk samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same pi) can perform uniformity testing. We further extend our techniques to the problem of testing closeness of two distributions: given c = 3 independent draws from each of p1, p2, . . . , p T and q1, q2, . . . , q T , one can distinguish the case that pavg = qavg versus having ℓ1 distance at least ε using O(k2/3/ε8/3) total samples, where k is an upper bound on the support size, matching the optimal sample complexity of the i.i.d. setting up to the ε-dependence.

1 Introduction

The problem of estimating statistics of an unknown distribution, or determining whether the distribution in question possesses a property of interest has been studied by the Statistics community for well over a century. Interest in these problems from the TCS community was sparked by the seminal paper of Goldreich (Goldreich & Ron, 2000), demonstrating the surprising result that such tasks might be accomplished given a sample size that scales sublinearly with the support of the underlying distribution. Over the subsequent 20+ years, the lay of the land of sublinear sample property testing and estimation has largely been worked out: for a number of natural properties of distributions (or pairs of distributions), testing or estimation can be

Published in Transactions on Machine Learning Research (11/2025)

accomplished with a sample size that is sublinear in the support size of the distributions in question, and in most settings, the optimal sample complexities are now known to constant or subconstant factors.

More recently, these questions have been revisited with an eye towards relaxing the assumption that samples are drawn i.i.d. from the distribution in question. The majority of such departures focused on relaxing the independence assumption; this includes work from the robust statistics perspective where some portion of samples may be adversarially corrupted, as well as other models of dependent sampling.

In this work, we consider the implications of relaxing the more innocuous-seeming assumption of identical samples. Given samples drawn independently from a collection of distributions, p1, . . . , p T , there are a number of different questions one can ask, including estimating properties of how the various distributions are related. Here, however, we focus on the practically motivated question of testing or estimating properties of the average distribution pavg = 1

T P pi. To what extent is sublinear sample testing or estimation still possible in this setting, and do existing estimators suffice? Below, we describe several concrete practical settings that motivate these questions.

Example 1. The Federated Setting: Consider the setting where there are a number of users/ individuals, each with a corresponding distribution (over online purchases, language, biometrics, etc.) and the goal is to test whether the average distribution is close to some reference. There might be significant heterogeneity between users, and the number of samples (i.e., amount of data) collected from each user might be very modest, due to privacy, bandwidth, or power considerations. Analogous settings arise when compiling training datasets for LLMs, as there is significant heterogeneity between documents, websites, or clusters of websites, and lightweight tools for testing properties of the average/aggregate distribution are useful.

Example 2. Chronological Heterogeneity: Consider a setting where samples are collected chronologically, and the underlying distribution varies over time. Consider a wearable technology that periodically samples biometric statistics, or language spoken. The goal may be to test whether the average distribution is close to some reference (with the goal of providing an early warning of illness in the case of biometrics, or for early detection of dementia in the case of sampling words spoken). Ideally, this test can be successfully accomplished with far less data than would be required to learn. There is significant heterogeneity across time but samples collected close together can be regarded as being sampled from the same distribution, pi.

Example 3. Spatially Segregated Sampling: Consider a setting where the goal is to sample species (of animals, bacteria, etc.) and test whether the overall distribution is close to a reference distribution. In many such settings, the distribution of species has significant spatial variation the distribution of soil bacteria in a grassy spot is quite different from the distribution in dense forest, similarly, the distribution of wildlife observed by trail-cameras in national parks varies according to the location of the cameras. Again, the goal is to accomplish this statistical test with less data than would be required to learn the underlying average distribution.

In some of these motivating settings, one may assume some structure in how the various distributions relate to each other for example in chronological settings, for most indices i, pi is similar to pi 1. Nevertheless, it is still worth understanding whether such assumptions must be leveraged. Indeed, for learning the average distribution pavg to within total variation distance ε, the sample complexity is the same as in the i.i.d. case, O(k/ε2) where k is the support size, without any assumptions on the pis. At the highest level, our results demonstrate that as soon as one draws at least c = 2 samples from each distribution, this non-identically distributed setting recovers much of the power of sublinear-sample property testing. The caveat is that one must design new algorithms for this setting that are cognizant of which distribution each sample was drawn from; as we show, testers that return a function of the multiset of examples do not suffice.

1.1 Summary of Results

First, we establish that in the case that we draw c = 1 samples from each distribution p1, . . . , p T , one can learn the average distribution with the same sample complexity as in the i.i.d. setting. Despite this, with c = 1 sample from each distribution, sublinear-sample uniformity or identity testing is impossible:

Claim 1.1 (Learning the distribution, proof in Appendix A). Given access to T distributions, p1, . . . , p T , each supported on a common domain of size k, for any ε > 0, given c = 1 samples drawn from each

Published in Transactions on Machine Learning Research (11/2025)

pi, one can output a distribution pavg such that with probability at least 2/3, d TV(pavg, pavg) ε, provided T = O(k/ε2). Furthermore, Ω(k/ε2) samples are necessary, even if p1 = . . . = p T = pavg for such a guarantee.

Claim 1.2 (Impossibility result with c = 1, proof in Appendix B). Given access to T k/2 distributions, p1, . . . , p T , and c = 1 sample drawn from each distribution, no tester can distinguish the case that the average distribution, pavg, is the uniform distribution of support k, versus has total variation distance at least 1/4 from Unif(k), with probability of success greater than 2/3.

The above impossibility of sublinear-sample testing in the case of c = 1 sample from each distribution follows from the following simple construction, which also provides intuition behind the positive results for c 2 : Consider the instance where T = k/2, and each pi is a point mass on a distinct element, versus the instance where each pi is a uniform distribution over two distinct domain elements that are disjoint from the supports of pj for j = i. In both instances, given c = 1 sample from each distribution, one will observe T = k/2 distinct elements each observed exactly once. Despite this, in the second instance, pavg is uniform over support k, whereas in the former the average distribution has total variation distance 1/4 from Unif(k).

Our first main result is that with just c = 2 samples from each distribution, we recover the full strength of sublinear sample uniformity (and identity) testing. The key intuition is that with two samples from each distribution, we can count both the collision statistics within distributions, as well as the collisions across distributions, and build an unbiased estimator for pavg 2 2. We then leverage techniques from Diakonikolas et al. (2016) and adapt them to this setting of non-identical samples to bound the variance of our estimator to get the following sublinear sample complexity for uniformity testing.

Theorem 1.3 (Uniformity testing, proof in Section 3). There is an absolute constant, α, such that given access to T distributions, p1, . . . , p T , each supported on a common domain of size k, for any ε > 0, provided T α(

k/ε2 + 1/ε4) and given c = 2 samples drawn from each pi, there exists a testing algorithm that succeeds with probability 2/3 and distinguishes whether,

pavg = Unif(k) versus d TV(pavg, Unif(k)) ε .

Note that when ε k 1/4, our sample complexity for uniformity testing simplifies to O(

k/ε2), which matches the optimal sample complexity for testing uniformity in the i.i.d. setting. Since the setting of nonidentical samples is a generalization of the i.i.d. setting, the lower bound of Ω(

k/ε2) for testing uniformity in the i.i.d. setting also holds in our setting; hence we get optimal sample complexity in the regime where ε k 1/4.

Our results also hold for identity testing, through standard reduction techniques. In particular, we can adapt the reduction from identity testing to uniformity testing of Goldreich (2016) to our setting with non-identical distributions, yielding the following result on the reduction from an instance of identity testing to an instance of uniformity testing.

Lemma 1.4 (Identity to uniformity testing, proof in Section 4). Given an arbitrary reference distribution q supported on k elements, there exists a pair (Φq, Ψq) that maps distributions and samples over [k] to distributions and samples over [4k], and satisfies the following,

For a sequence of distributions q1, . . . , q T such that avg(q1, . . . , q T ) = q, the map Φq satisfies,

avg(Φq(q1), . . . , Φq(q T )) = Unif(4k).

For two sequences of distributions p1 1, . . . , p1 T and p2 1, . . . , p2 T , the map Φq satisfies,

d TV(avg(Φq(p1 1), . . . , Φq(p1 T )), avg(Φq(p2 1), . . . , Φq(p2 T ))) 1

4 d TV(avg(p1, . . . , p1 T ), avg(p2, . . . , p2 T )) .

The above result combined with the result for uniformity testing, immediately yields the following result for identity testing. Our bounds for identity testing are optimal in the worst case for large values of ε k 1/4.

Published in Transactions on Machine Learning Research (11/2025)

Corollary 1.5 (Identity testing). There is an absolute constant, α, such that given a reference distribution q supported on k elements and c = 2 samples from each of T distributions, p1, . . . , p T , there exists a testing algorithm that succeeds with probability 2/3 and distinguishes whether,

pavg = q versus d TV(pavg, q) ε ,

provided T α(

k/ε2 + 1/ε4).

We note that our sublinear sample testing algorithms leverage the knowledge of which distribution each sample was drawn from. As the following theorem asserts, this is necessary there is a constant ρ > 0 such that even in the linear regime with ρk samples, there is no testing algorithm that simply returns a function of the multiset of samples, throwing away the information about which samples were drawn from the same distributions.

Theorem 1.6 (Lower bound for pooled estimators, proof in Section 5). There is an absolute constant, ρ > 0, such that given access to distributions p1, . . . , pρk supported on a domain of size k, given c = 2 samples from each of p1, . . . , pρk, no tester that returns a function of the multiset of samples (i.e., ignoring the information about which samples were drawn from the same distributions) can distinguish the case

pavg = Unif(k) versus d TV(pavg, Unif(k)) 0.01

with success probability at least 2/3.

Our positive results, and the intuition behind the uniformity testing algorithm also naturally extend to testing properties of pairs of distributions. In particular, we consider the closeness testing problem in this non-i.i.d. setting: given two sequences of distributions supported on a set of size k, p1, . . . , p T and q1, . . . , q T and c i.i.d. draws from each of the distributions, how many total samples are required to distinguish between pavg = qavg versus d TV(pavg, qavg) > ε? In this setting, we show that, given c = 3 samples from each distribution, we recover the optimal sample complexity of O(k2/3) from the i.i.d. setting, up to the ε dependence (Batu et al., 2013; Chan et al., 2014). While our proof leverages c = 3 samples to simplify some of the analysis, we strongly believe that an identical result holds for c = 2 examples from each distribution.

Theorem 1.7 (Closeness testing, proof in Appendix G). There is an absolute constant, α, such that for two sequences of distributions, supported on a common domain of size k, p1, . . . , p T , and q1, . . . , q T , given c = 3 samples from each of the distributions, there exists a testing algorithm that succeeds with probability 2/3 and distinguishes whether pavg = qavg versus d TV(pavg, qavg) ε,

provided T α(k2/3/ε8/3).

Finally, we note that the interesting phase shift between c = 1 and c > 1 holds only in the non-Poisson setting. Results in the Poisson setting remain the same as that of the i.i.d. setting. In particular, if instead of c samples, we are given Poi(c) samples from each distribution, then even in the case of c = 1, we recover all testing results known in the i.i.d. setting. This follows from the following basic fact: receiving Poi(c) samples from every distribution is equivalent in distribution to receiving Poi(c T) samples from the average distribution. A more formal version of this result is stated in Appendix C for identity testing, but such a result also holds for testing other properties. To see how this bypasses the obstruction discussed in Claim 1.2, note that with Poi(1) samples per distribution, in the point-mass case, each distribution yields repeated draws of its single element, whereas when the distributions have support on two distinct elements, within-group collisions start to occur.

Proving our upper bounds mostly involves adapting the techniques from the i.i.d. case to our setting. Establishing the lower bound for pooled samples is more complex, involving a new construction that might be of independent interest. Nonetheless, more than the techniques, our main contribution is the problem definition and conceptual takeaways. Commonly, when collecting samples, heterogeneity invariably arises, particularly when these samples are gathered over various time points, geographical locations, or from different users, as highlighted in the examples above. Thus, it is natural to ask whether sublinear sample testers are robust to such departures from the i.i.d. setting. We demonstrate that while existing testers are not

Published in Transactions on Machine Learning Research (11/2025)

directly applicable (as evidenced by our lower bound for pooled samples), they can be effectively adapted, provided one collects multiple samples per distribution and leverages the source information of samples. For example, for identity testing, a standard estimator involves counting collisions, treating all collisions similarly. In contrast, our modified estimator uses a weighted collision count, with weights varying based on whether collisions occur within samples from the same distribution or not.

1.2 Future Directions

There are many possible directions for future work. From a technical standpoint, a concrete goal is to match the i.i.d. ε-dependence for uniformity and closeness. For uniformity, the gap comes from variance-reducing terms in the i.i.d. analysis that are not available when distributions differ (see Remark 5). For closeness, the current ε 8/3 dependence comes from the classical collision analysis (Batu et al., 2013); matching the i.i.d. ε 4/3 will likely require a new analysis beyond collisions (e.g., adapting the counts-based statistic of Chan et al. (2014) to the non-i.i.d. setting). It is not clear to us whether that optimal ε-dependence is achievable here.

Another direction is improving the number of samples per distribution. For closeness, we currently use c = 3 to keep the collision-based ℓ2 estimate independent of the heavy/light estimation; with a more careful reuse of the same samples and explicit dependency tracking, c = 2 may suffice. For uniformity, we show that c = 1 is impossible in the worst case; it is natural to ask for structural conditions under which c = 1 becomes feasible. If all pi equal pavg, the i.i.d. guarantees apply; more generally, if each pi is sufficiently close to pavg, we expect i.i.d. rates to persist with a small additional failure probability. As a simple check, suppose each pi is within δ/T TV distance of pavg. Then the joint distribution of the T samples is within TV distance δ of T i.i.d. samples from pavg. Hence any optimal i.i.d. tester applies with an extra failure probability δ. It would be interesting to study whether larger deviations beyond this naive bound can be tolerated.

Finally, beyond the specific problems addressed in this work, there are many obvious directions, including mapping the sample complexities for other natural property testing and estimation problems within the framework of non-identically distributed samples. Studying properties of the average distribution is a natural starting point, though other summary statistics of a collection of distributions are also worth considering. For example, in chronological settings, one might be interested in how certain properties vary over time, or predicting properties of future distributions. More broadly, probing the robustness of sublinear sample testing from other perspectives may yield further surprises with practical implications.

2 Related Work

Interest from the theoretical computer science community on distribution testing, given i.i.d. sample access, was sparked by the problem of distinguishing whether an unknown probability distribution was the uniform distribution over support k, versus has total variation distance at least ε from this uniform distribution. The early work of Goldreich & Ron (2000) proposed a collision-based estimator, demonstrating that testing this property could be accomplished given significantly fewer samples than would be required to learn the distribution in question. Since then, a line of work eventually pinned down the exact sample complexity of Θ(

k/ε2) optimal to constant factors including the dependence on the probability of the tester failing (Paninski, 2008; Valiant & Valiant, 2017; Diakonikolas et al., 2014; 2016; 2017). For a comprehensive exposition of this line of work, we refer the reader to Chapters 2 and 3 in Canonne (2022).

A closely related problem to uniformity testing is testing identity, where the task is to determine if an unknown probability distribution, p, is equal to versus far from a fixed reference distribution, q, given i.i.d sample access (Batu et al., 2001). While instance-optimal estimators estimators that are optimal for every q are known (Valiant & Valiant, 2017), among reference distributions supported on k elements, the uniform distribution is the most difficult to test. Beyond this, building off Diakonikolas & Kane (2016), Goldreich (2016) gave a reduction from identity testing to uniformity testing, showing that any uniformity testing algorithm can be leveraged in a black-box fashion for identity testing. We leverage both the high-level design of the optimal uniformity testing algorithms, and this reduction.

Published in Transactions on Machine Learning Research (11/2025)

For the problem of closeness testing, the goal is still to distinguish p = q from the case that these two distributions have distance at least ε, though now q is not a fixed reference distribution, and instead we are given samples from both distributions. In the i.i.d. sample setting, for distributions of support size k, a sample complexity of O(k2/3 log k/ϵ8/3) was established in Batu et al. (2001; 2013), with the optimal sample complexity of Θ(max(k2/3/ϵ4/3, k1/2/ϵ2)) later pinned down in Chan et al. (2014).

Beyond uniformity, identity, and closeness testing, there is an enormous body of work on testing or estimating other distribution properties or properties of pairs of distributions. Broadly speaking, the optimal sample complexities for these tasks are now pinned down to constant factors, at least in the setting where one obtains i.i.d. samples

Beyond i.i.d. samples: The questions of learning, property testing and estimation have also been considered in various settings that deviate from the idealized setting of samples drawn i.i.d. from a fixed distribution. These include works from the perspective of robust statistics, where some portion of the samples may be adversarially corrupted (see e.g., Diakonikolas et al. (2019); Diakonikolas & Kane (2019); Charikar et al. (2017), as well as the broad area of time-series analysis where observations are drawn from time-varying distributions (e.g., Chernoff & Zacks (1964); Kolar et al. (2010); Lampert (2015)).

Most similar to the present paper are works that explore learning and testing questions in various settings where there are a number of data sources, each providing a batch of samples. In Qiao & Valiant (2018) and the followup papers Jain & Orlitsky (2020); Chen et al. (2020), it is assumed that an unknown 1 ε fraction of data sources correspond to identical distributions, each providing a batch of c i.i.d. samples, and no assumptions are made about the samples in the remaining batches. The main results here are that as c increases, the batches can be leveraged for learning: specifically, the distribution of the 1 ε fraction can be learned to accuracy O(ε/ c). The papers Tian et al. (2017); Vinayak et al. (2019) explore the setting where batches of size c are drawn i.i.d. from a collection of heterogeneous distributions, and the goal is to learn the multiset of distributions. The punchline here is that the multiset of distributions can be learned given asymptotically fewer draws from each distribution than would be necessary to learn them in isolation. While the focus in these works is on the setting where the distributions have support size 2 (i.e. the setting corresponds to flipping coins with heterogeneous probabilities), the results apply more broadly. The paper of Levi et al. (2013) considered the setting of a collection of T distributions over large support size, and explored the question of testing whether all distributions in the collection are the same, versus have significant variation. They considered two sampling models a query model where one can adaptively choose which distribution to draw a sample from, and a weaker model where each sample is drawn from a distribution selected uniformly at random from the T distributions. We note that in this latter model, testing properties of the average distribution trivially reduces to the i.i.d. setting. Later works by Diakonikolas & Kane (2016); Aliakbarpour et al. (2016) improve the results obtained by Levi et al. (2013). Finally, the paper by Aliakbarpour & Silwal (2019) considers the setting with a collection of distributions, and given one sample (in expectation) from each distribution, the task is to distinguish whether all the distributions are equal to a reference distribution, or all distributions are far from it. In contrast, our focus is on distinguishing whether the average distribution is equal to or far from a reference distribution. They also consider the case where we only have sample access to the reference distribution. General non-vacuous results are not possible in their setting given a single sample from each distribution. To bypass this, they impose a structural condition on the collection of distributions. We also face a similar issue and show that it can be bypassed in our setting by drawing multiple samples from each distribution.

3 Uniformity testing from non-identical samples

In this section, we prove our sample complexity results for uniformity testing in the setting of non-identical samples. Our tester for uniformity in the setting of non-identical samples is based on collisions and constructs an unbiased estimator for pavg 2 2. As pavg 2 2 equals 1

k when pavg is uniform and is larger than (1+ε2) 1

k when pavg is ε-far from uniform, we use this separation to solve the testing problem. The main part of the proof constitutes bounding the variance of the estimator that we construct in order to show that it concentrates around its mean. We adapt the proof from Diakonikolas et al. (2016) to the setting of non-identical samples to bound the variance of our estimator.

Published in Transactions on Machine Learning Research (11/2025)

To test uniformity, we build an unbiased estimator for pavg 2 2. In the following, we provide a description of this estimator. For each t [T], let (Xt(1), Xt(2)) denote the 2 i.i.d. samples drawn from the distribution pt. For each t [T], we define,

Yt = 1[Xt(1) = Xt(2)] and Yst = 1[Xt(1) = Xs(1)] .

Note that E[Yt] = pt 2 2 and E[Yst] = pt, ps for all s, t [T]. Using these random variables, we define an estimator Z as follows,

t [T ] Yt + 2 X

s<t Yst . (1)

Z is an unbiased estimator of pavg 2 2. In the following we formally state this result and bound the variance of the estimator.

Lemma 4. The estimator Z defined in Equation (1) satisfies the following:

E[Z] = pavg 2 2 and Var[Z] 4

Proof of Lemma 4. The expectation of the random variable Z is given by

t [T ] pt 2 2 + 2 X

s<t pt, ps = pavg 2 2 .

Therefore Z is an unbiased estimator for pavg 2 2. In the remainder of the proof, we bound the variance of this estimator Z.

Var[Z] = E h Z E[Z] 2i = 1

t [T ] (Yt E[Yt]) + 2 X

s<t (Yst E[Yst]) 2i .

ˆYt = Yt E[Yt] and ˆYst = Yst E[Yst],

and note that E[ ˆYt] = 0 and E[ ˆYst] = 0. Rewriting the variance of Z in terms of these new random variables we get the following,

t [T ] ˆYt + 2 X

s<t ˆYst 2i 2

t [T ] ˆYt 2 + 4 X

s<t ˆYst 2i

t [T ] ˆYt 2i + 4 E h X

s<t ˆYst 2i (2)

In the second inequality, we used (a + b)2 2(a2 + b2). In the final inequality, we used the linearity of expectation. To bound the variance of Z, we bound the terms in the final expression separately. We start with the first term.

t [T ] ˆYt 2i = E h X

t [T ] ˆY 2 t + 2 X

s<t ˆYt ˆYs i = X

t [T ] E[ ˆY 2 t ] + 2 X

s<t E[ ˆYt]E[ ˆYs] = X

t [T ] E[ ˆY 2 t ]

t [T ] (E[Y 2 t ] E[Yt]2) X

t [T ] E[Y 2 t ] = X

t [T ] E[Yt] = X

t [T ] pt 2 2. (3)

In the first equality, we expanded the expression. In the second equality, we used the fact that the random variables Ys, Yt are independent for all s, t [T] and s = t. In the third equality, we used E[ ˆYt] = 0 for all t [T]. The fourth and fifth inequalities are immediate. The sixth equality follows because Yt is an indicator

Published in Transactions on Machine Learning Research (11/2025)

random variable. In the seventh inequality, we substituted the value of E[Yt] = pt 2 2. Similar to the above, we now bound the second term of Equation (2).

s<t ˆYst 2i = E h X

s<t Yst i!2

s<t E Y 2 st

| {z } T 2 terms

s<t a<b |{s,t,a,b}|=3

6 T 3 terms

s<t a<b |{s,t,a,b}|=4

6 T 4 terms

s<t Yst i!2

| {z } T 2 2 terms

The first term above is equal to P

s<t pt, ps we can club this together with the summands in the summation in the last term that are of the form E[Yst]2. For the third term, observe that if the indices s, t, a, b are all distinct, the random variables Yst and Yab are independent, and hence E[Yst Yab] = E[Yst]E[Yab]. But all these summands are also included in the summation in the last term, and hence will get canceled out. Finally, we are left with the second term. Observe that for any 3 fixed distinct values, there are 6 possible orderings of the indices s < t and a < b such that the multiset {s, t, a, b} has exactly these distinct values. Further, for each of these orderings, the random variable Yst Yab is non-zero if and only if the first sample at the three distinct indices is the same. Thus, we have that,

s<t a<b |{s,t,a,b}|=3

E[Yst Yab] = 6 X

ℓ=1 pa(ℓ)pb(ℓ)pc(ℓ) := 6 X

a<b<c pa, pb, pc . (5)

Putting the above together with the remaining terms corresponding to |{s, t, a, b}| = 3 in the last summation, we get

s<t ˆYst 2i = X

s<t pt, ps (1 pt, ps ) + 6 X

a<b<c pa, pb, pc

a<b<c [ pa, pb pa, pc + pa, pb pb, pc + pa, pc pb, pc ] (6)

s<t pt, ps + 6 X

a<b<c pa, pb, pc (7)

s<t pt, ps + 6 X

t [T ] pt 3

Substituting the above inequality, along with Equation (3) into Equation (2), we get the following bound for the variance of Z,

t [T ] pt 2 2 + 4 X

s<t pt, ps + 24 X

t [T ] pt 3

t [T ] pt 2

t [T ] pt 3

We can now use the variance bound from the above lemma along with Chebyshev s inequality to prove Theorem 1.3,

Proof of Theorem 1.3. We divide the proof into two cases.

Published in Transactions on Machine Learning Research (11/2025)

Uniform case: In this case, we are promised that the pavg = uk. Therefore, it is immediate that E[Z] = pavg 2 2 = 1

k and pavg 3 3 = 1 k2 , which further gives us that,

Pr h Z 1 + ε2

i = Pr h Z 1 + ε2

E[Z] i 9 ε4E[Z]2 Var[Z] = 9k2

ε4 Var[Z] ,

ε4T 2 + 432

In the first inequality, we used that E[Z] = 1/k. In the second inequality, we used the Chebyshev s inequality. In the third and fourth inequality, we substituted the values of E[Z] and Var[Z] respectively; to upper bound the variance we used bounds from Equation (8). In the final inequality, we simplified the expression. Note that when T 2600

ε4 , the second term above is smaller than 1/6, and for T 216

k ε2 , the first term is smaller

than 1/6. Therefore, we have that, for T = max 2600

k ε2 = O max

Pr h Z 1 + ε2

Therefore, Z is smaller than (1 + ε2/3) 1

k with probability at least 2/3 in the uniform case, that is, when pavg = uk.

Far from uniform case: In this case, we are promised that, pavg uk 1 ε, which further implies

E[Z] = pavg 2 2 = 1+α2

k(1 + ε2), where α2 def = k pavg uk 2 2 ε2. In the following, we upper bound the probability that Z takes a value smaller than (1 + ε2/3) 1

Pr h Z 1 + ε2

i = Pr h Z 1 + ε2

i = Pr h Z 1 + ε2

= Pr h Z 1 α2 ε2

E[Z] i Pr h Z 1 α2 α2

= Pr h Z 1 2α2

E[Z] i 9(1 + α2)2

4α4E[Z]2 Var[Z]

= 9(1 + α2)2

4α4 pavg 4 2 Var[Z] 9(1 + α2)2

T 2α4 pavg 2 2 + 108(1 + α2)2

= 9k(1 + α2)

T 2α4 + 108k2

In the first equality, we multiplied and divided by the term (1 + α2). In the second equality, we used the

definition of α. In the third equality, we used 1+ ε2

3 1+α2 = 1 α2 ε2

3 1+α2 . In the fourth inequality, we used α2 ε2

and in the fifth equality, we simplified the expression. In the sixth inequality, we used Chebyshev s inequality. In the seventh and eighth inequality, we used E[Z] = pavg 2 2 and substituted the bound on the variance of Z from Equation (8). In the final expression, we used the definition of α. To upper bound the above probability term, we next bound each of the terms in the final expression separately. The first term above is bounded by,

T 2α4 9k(1 + ε2)

T 2ε4 . (10)

In the first inequality, we used the fact that for x > 0, the function 1+x

x2 is decreasing in x and also α2 ε2. The final inequality follows because ε 1. To upper bound the second term in Equation (9), we first note that,

pavg 3 3 = pavg uk + uk 3 3 =

i=1 ((pavg(i) uk(i)) + uk(i))3

Published in Transactions on Machine Learning Research (11/2025)

h (pavg(i) uk(i))3 + uk(i)3 + 3(pavg(i) uk(i))uk(i)pavg(i) i

= pavg uk 3 3 + 1

i=1 pavg(i)2 1

= pavg uk 3 3 + 1

k pavg uk 2 2

pavg uk 3 2 + 3

k pavg uk 2 2 + 1

The first five equalities follow from simple algebraic manipulation. The sixth inequality holds because 3 2. In the final equality, we used the definition of α. Thus, the second term is bounded as,

In the final inequality, we used α ε. Substituting Equation (10) and Equation (11) in Equation (9), we get the following bound on the probability,

Pr h Z 1 + ε2

T 2ε4 + 108

T 2ε4 + 432

k Tε2 + 108

Note that for T 2600

k ε2 , the sum of the first two terms is smaller than 1/6, and for T 648

ε4 , the last term

is smaller than 1/6. Thus, we have that for T = O max

Pr h Z 1 + ε2

Therefore, Z is larger than (1 + ε2/3) 1

k with probability at least 2/3 in the far from uniform case, that is

when pavg uk 1 ε. In conclusion, for T = O

ε4 , we can test uniformity, that is we can separate

pavg = uk from pavg uk 1 ε, for any ε > 0. We conclude the proof.

Remark 5 (Optimal sample complexity). We can compare the bound on the variance in Equation (8) to the similar bound in the tight analysis (e.g., Equation (2.9), Theorem 2.1 in Canonne (2022)) of the standard i.i.d setting. It would appear that we are missing a term of the form 1

T pavg 4 2 which helps reduce the variance. In fact, if we had a term of this form, we would be able to shed the O(1/ε4) factor to obtain the optimal sample complexity of O(

k/ε2) even in the non-i.i.d setting. The main hurdle in the analysis above for getting this crucial variance-reducing term can be seen in Equation (6). In the i.i.d setting with one distribution p, the terms of the form pa, pb pa, pc are all the same and equal p 4 2 (which is at least 1 k2 ). But in our setting, these terms can even be 0 (if pa, pb, pc have disjoint supports), and hence, we cannot conveniently lower bound the contribution of these terms in reducing the variance. That is why, for lack of a better bound, we had to drop all these 6 T 3 many! terms in going from Equation (6) to Equation (7), which ultimately results in the stated sample complexity.

4 Identity testing from non-identical samples

We now prove our result which reduces the problem of identity testing to uniformity testing. Such a reduction for the i.i.d. setting is already known (Goldreich, 2016) (see also (Canonne, 2022, Section 2.2.3)), and we adapt the same reduction for the setting of non-identical samples. In particular, we use the same reduction functions from that of the i.i.d case and show that they work even in the case of a sequence of changing distributions. A more formal statement of this reduction, along with its proof for a sequence of distributions is provided below.

Published in Transactions on Machine Learning Research (11/2025)

Lemma 1.4. Given an arbitrary reference distribution q supported on k elements, there exists a pair (Φq, Ψq) that maps distributions and samples over [k] to distributions and samples over [4k], and satisfies the following,

For a sequence of distributions q1, . . . , q T such that avg(q1, . . . , q T ) = q, the map Φq satisfies,

avg(Φq(q1), . . . , Φq(q T )) = Unif(4k).

For two sequences of distributions p1 1, . . . , p1 T and p2 1, . . . , p2 T , the map Φq satisfies,

d TV(avg(Φq(p1 1), . . . , Φq(p1 T )), avg(Φq(p2 1), . . . , Φq(p2 T ))) 1

4 d TV(avg(p1, . . . , p1 T ), avg(p2, . . . , p2 T )) .

Proof. We divide the definition of our maps (Φq, Ψp) into three parts:

Φq def = Φ1 q Φ2 q Φ3 q .

Ψq def = Ψ1 q Ψ2 q Ψ3 q .

We first define the map Φq, which we divide into three parts Φ3 q, Φ2 q and Φ1 q. The map Φ3 q : k k is defined as follows,

Φ3 q(p) = 1

Let k = 4k and r = Φ3 q(q) = 1

2Unif(k). The map Φ2 q : k k+1 is defined as follows,

Φ2 q(p)(i) =

k r(i) p(i) for all i [k]

j [k] k r(j)

k r(j) p(j) for i = k + 1 . (12)

Let s = Φ2 q(r) = Φ2 q(Φ3 q(q)) and define ki = k s(i) for i [k + 1]. Let S1, . . . , Sk+1 be any partition of [k ] such that |Si| = ki for all i [k + 1]. Then the map Φ1 q : k+1 4k is defined as follows,

Φ1 q(p)(i) = 1

s(j) 1(i Sj) .

Using these definitions, we now prove the two claims stated in the theorem.

Claim 1: For any t [T], let,

xt = Φ3 q(qt), yt = Φ2 q(xt) and zt = Φ1 q(yt) .

In the following, we show that,

avg(Φq(q1), . . . , Φq(q T ))(i) = avg(z1, . . . , zt) = u4k .

We prove the above statement entry-wise, and we consider ith entry of the average distribution. We divide the proof into two cases. Let i be such that i Sj for some j [k], then note that,

avg(Φq(q1), . . . , Φq(q T ))(i) = 1

t [T ] zt(i) = 1

s(j ) 1(i Sj ),

s(j) = 1 Tk X

k r(j) xt(j)

k r(j) xt(j)

Published in Transactions on Machine Learning Research (11/2025)

1 2qt(j) + 1

2Unif(k)(j) r(j)

1 2q(j) + 1

2Unif(k)(j) r(j)

In the second equality, we substituted the value zt = Φ1 q(yt). In the third equality, we used the assumption that i Sj. In the fourth and fifth inequality, we substituted the values of yt and s respectively. We simplified the expression in the sixth equality and substituted the value of xt in the seventh equality. In the eighth equality, we used the definition of q = avg(q1, . . . , q T ). In the final equality, we used the definition of r.

Let i Sk+1, then note that,

avg(Φq(q1), . . . , Φq(q T ))(i) = 1

t [T ] zt(i) = 1

s(j ) 1(i Sj ),

s(k + 1) = 1 Tk X

j [k] k r(j)

k r(j) xt(j)

= 1 Tk 1 s(k + 1)

k r(j) xt(j)

= 1 Tk 1 s(k + 1) T

k 1 s(k + 1)

2Unif(k)(j))

k 1 s(k + 1)

k 1 s(k + 1) s(k + 1)

In the second equality, we substituted the value zt = Φ1 q(yt). In the third equality, we used the assumption that i Sk+1. In the fourth equality, we substituted the values of yt and simplified the expression in the fifth and sixth equality. In the seventh equality, substituted the value of xt and used the definition of q = avg(q1, . . . , q T ). In the eighth and ninth equality, we used the definition of r and s respectively.

Combining the above two derivations, we have that, for all i [k + 1],

avg(Φq(q1), . . . , Φq(q T ))(i) = 1

Therefore, avg(Φq(q1), . . . , Φq(q T )) = u4k .

Claim 2: For any t [T], let,

xa t = Φ3 q(pa t ), ya t = Φ2 q(xa t ) and za t = Φ1 q(ya t ) for a {1, 2} .

In the remainder of the proof, we show that

d TV(avg(z1 1, . . . , z1 T ), avg(z2 1, . . . , z2 T ) = d TV(avg(Φ(p1 1), . . . , Φ(p1 T )), avg(Φ(p2 1), . . . , Φ(p1 T ))

Published in Transactions on Machine Learning Research (11/2025)

4d TV(avg(p1 1, . . . , p1 T ), avg(p2 1, . . . , p2 T ) .

It is immediate that,

d TV(avg(z1 1, . . . , z1 T ), avg(z2 1, . . . , z2 T ) = d TV(avg(y1 1, . . . , y1 T ), avg(y2 1, . . . , y2 T )) .

Furthermore, note that for i [k],

avg(ya 1, . . . , ya T )(i) = 1

k r(i) xa t (i) = k r(i)

2pa avg(i) + 1

2Unif(k)(i))

avg(ya 1, . . . , ya T )(k + 1) = 1

k r(j) xa t (j)

2pa avg(j) + 1

2Unif(k)(j)) .

In the following we bound d TV(avg(y1 1, . . . , y1 T ), avg(y2 1, . . . , y2 T )), which in turn bounds the quantity d TV(avg(z1 1, . . . , z1 T ), avg(z2 1, . . . , z2 T ).

d TV(avg(y1 1, . . . , y1 T ), avg(y2 1, . . . , y2 T ))

2p1 avg(i) + 1

2Unif(k)(i)) k r(i)

2p2 avg(i) + 1

2Unif(k)(i)) ,

2p1 avg(j) + 1

2Unif(k)(j))

2p2 avg(j) + 1

2Unif(k)(j))

p1 avg(i) p2 avg(i) + X

p1 avg(j) p2 avg(j) ,

p1 avg(i) p2 avg(i) 1

p1 avg(i) p2 avg(i) .

In the first equality, we substituted the values of avg(ya 1, . . . , ya T ) for a {1, 2} that were computed earlier. In the second and third equality, we simplified the expression. The last inequality follows because r(i) 1 2k, k r(i) 2 and k r(i) k r(i) 1. Putting it all together, we have that,

d TV(avg(Φ(p1 1), . . . , Φ(p1 T )), avg(Φ(p2 1), . . . , Φ(p1 T ))

4d TV(avg(p1 1, . . . , p1 T ), avg(p2 1, . . . , p2 T ) ,

and we conclude the proof for this case.

Sampling: In the remainder of the proof, we state the maps Ψ3, Ψ2, Ψ1 which help us generate the samples from the mapped distributions. The definitions of these sampling maps are pretty straightforward and we state them below to conclude the proof.

Ψ3 q : Given i [k], return i with probability 1/2 and a uniformly random element of [k] otherwise.

Ψ2 q : Given i [k], return i with probability k r(i)

k r(i) and k + 1 otherwise.

Ψ1 q : Given i [k + 1], return a uniformly random element from Si.

Published in Transactions on Machine Learning Research (11/2025)

5 Lower bound for pooling-based estimators

In this section, we prove the indistinguishability result for pooling-based estimators given in Theorem 1.6. First, we observe that uniformity of the average distribution is a symmetric property, and thus, labels of the samples do not matter for a testing algorithm. Thus, without loss of generality, we can assume that the testing algorithm takes as input the pooled multiset of samples and operates only on the fingerprint of the pooled multiset (refer to Appendix E for a formal discussion on this). Here, the fingerprint is the count of elements that occur exactly zero times, once, twice etc. In what follows, we will construct two sequences of distributions: one whose average is the uniform distribution and another whose average is far from uniform. However, we show that the total variation distance between the distributions of their corresponding fingerprints is small, implying impossibility of uniformity testing with pooled samples. Below, we describe the key proof steps:

Step 1: We construct two sequences of distributions a and b (Definition 8) such that avg(a) is uniform and avg(b) is far from uniform (Claim 5.1). Moreover, these sequences are chosen in such a way that when we draw 2 samples from each distribution in the sequence, no element can be observed more than 4 times in the pooled multiset. The building blocks needed for defining these sequences of distributions are introduced in Lemma 6.

Step 2: Next, we show that when we draw 2 samples from each distribution in the sequences, the fingerprints of the pooled multisets have the same mean (Claim 5.2), and similar covariances (Claim 5.3). A minor detail here is that we show this for the collision counts instead of fingerprints, but this is not an issue, as one can use an invertible linear transform to relate the two (Appendix D).

Step 3: We want to eventually show that the distributions of fingerprints are close, and hence cannot be distinguished. For this, we will first show that the distributions of these fingerprints are close, in total variation distance, to the multivariate Gaussian distributions Ga and Gb of corresponding mean and covariance, that have been discretized by rounding all probability mass to the nearest point in the integer lattice (Lemma 11). We show this via a CLT (Lemma 10). One slight complication in proving this CLT is that in our setting, the fingerprints no longer correspond to a generalized multinomial distribution, and hence we cannot directly leverage multivariate CLTs for such distributions (e.g., Valiant & Valiant (2011)). Instead, we will show that the distribution of fingerprints is represented as a sum of independent samples, supported on the all-zero vector, the basis vectors, and the vector [2, 0, 0, . . .]. Our proof essentially splits each distribution into the component supported on the zero vector and the basis vectors, and a component supported on the zero vector and the [2, 0, 0, . . .] vector. The portion supported on the basis vectors is a generalized multinomial, and hence is close to the corresponding discretized Gaussian by the CLT in Valiant & Valiant (2011), and the remaining portion is simply a binomial, scaled by a factor of 2. We then argue that the convolution of these distributions is close to a discretized Gaussian.

Step 4: Having shown that the distributions of fingerprints are close to corresponding discretized multivariate Gaussian distributions, we are left with the tractable task of showing that the TV distance between these two (discretized) Gaussians is small. This is shown in Lemma 12 and relies on certain technical results proved in Lemma 16 and Lemma 17. Here, we use the fact that the two collision-count distributions have the same mean and similar covariances as discussed in Step 2 above. Finally, a triangle inequality yields the desired result (Lemma 13).

We now proceed towards the proof details of the above steps. We start by defining some distributions which are the building blocks for our hard instance. We will tile up these distributions over disjoint supports to form the overall hard instance

Lemma 6 (Building block). There exist distributions p1, p2, q1, q2, r1, r2, s1, s2, each supported on m elements, satisfying the following:

(1) avg(p1, p2) = avg(r1, r2) = Unif(m).

Published in Transactions on Machine Learning Research (11/2025)

(2) d TV(avg(p1, p2), avg(q1, q2)) 1 16

2 and d TV(avg(r1, r2), avg(s1, s2)) 1 16

(3) Let FP(p) = [FP(p)2, FP(p)3, FP(p)4] be the random variable representing the tuple of counts of elements observed exactly 2 times, 3 times and 4 times respectively, when we draw c = 2 samples i.i.d each from p1 and p2 (i.e. we obtain a total 4 samples). Note that FP(p) is supported on [0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [2, 0, 0]. Similarly, define FP(q), FP(r), FP(s). Furthermore, let Cp = [cp 2, cp 3, cp 4] be the 3-dimensional vector of 2-way, 3-way and 4-way collisions observed when we draw c = 2 samples i.i.d each from p1 and p2

1. Similarly define Cq, Cr, Cs. There exists an α (0, 1) such that,

E[αCp + (1 α)Cr] = E[αCq + (1 α)Cs].

E[αFP(p) + (1 α)FP(r)]

= E[αFP(q) + (1 α)FP(s)] .

Proof. First, we describe the distributions p1, p2, r1, r2. Let p1 = p2 m be the uniform distribution on m elements. Let r1 m be such that it assigns mass 2 m each on the first m

2 elements of [m]. Let r2 m be such that it assigns mass 2 m each on the last m

2 elements of [m]. This is illustrated more clearly in Figure 1a. It is clear that avg(p1, p2) = avg(r1, r2) = Unif(m), and hence part (1) of the lemma holds.

Next, we describe the distributions q1, q2, s1, s2. Set ε = 1

2. Let q1 m be such that q1 has a mass 1+ε

m each on the first m

2 elements, and a mass 1 ε

m each on the remaining m

2 elements. Next, consider q2 m defined as follows: Of the first m

2 elements, q2 has mass 1 ε

m each on the first 7

8 th fraction, and a mass 1+ε

m each on the remaining 1

8 th fraction. Similarly, of the latter m

2 elements, q2 has a mass 1+ε

m each on the first

7 8 th fraction, and a mass 1 ε

m each on the remaining 1

8 th fraction. Further, we set s1 = q1 and s2 = q2. This is illustrated more clearly in Figure 1b.

We can readily see that avg(p1, p2) avg(q1, q2) 1 = avg(r1, r2) avg(s1, s2) 1 = ε

8. Recalling that ε = 1

2, part (2) of the lemma holds.

Now, we prove the last part. Fix β = 7

8. Let cq 2, cq 3, cq 4 denote the number of 2-way, 3-way and 4-way collisions respectively when we draw 2 samples from q1 and 2 samples from q2.

E[cq 2] = 2

+ 4 βm 1 + ε

E[cq 3] = 4

E[cq 4] = (1 β)m

Since q1 = s1 and q2 = s2, for any α (0, 1),

cq 2 cq 3 cq 4

cs 2 cs 3 cs 4

αcq 2 + (1 α)cs 2 αcq 3 + (1 α)cs 3 αcq 4 + (1 α)cs 4

1m-way collisions also formally defined in Appendix D.

Published in Transactions on Machine Learning Research (11/2025)

(a) Average is uniform

(b) Average far from uniform

Figure 1: Building block for the hard instance defined in Definition 8.

Now, set α = 3

4. Say we toss a coin with bias α. If we observe heads, we will draw two samples each from p1 and p2. If we observe tails, we will draw two samples each from r1 and r2. We can calculate the expected number of 2/3/4-way collisions in the 4 samples obtained via this process.

E[αcp 2 + (1 α)cr 2] = α 4 2

m + (1 α) 2

E[αcp 3 + (1 α)cr 3] = 4α

m2 = 3 m2 .

E[αcp 4 + (1 α)cr 4] = α

m3 = 3 4m3 .

These are precisely the expressions for the expectations in Equation (13). Finally, the collision counts are related linearly to the fingerprint (Lemma 14), and this completes the proof of the part (3).

Remark 7. Note that as we draw only 4 samples from the building block given above, the number of elements that occur exactly zero times and exactly once are a deterministic function of the number of elements that occur exactly 2,3 and 4 times. Furthermore, since we will tile up the above instance over disjoint supports of m elements, no element will ever occur more than 4 times. Thus, we restrict our attention only to the fingerprint of elements appearing 2,3 and 4 times.

Now, we propose two sequences of distributions a = (a1, , a T ) and b = (b1, , b T ) such that avg(a) is uniform and avg(b) is far from uniform.

Definition 8. Divide the support [k] into n disjoint blocks, each of size m, so that k = mn. Each block corresponds to the support of 2 distributions in a sequence, so that the total number of distributions T = 2n. Assume that k is large enough and m is a large constant independent of k. k being large enough and m being a constant ensures that T = Θ(k) is also large enough.

Published in Transactions on Machine Learning Research (11/2025)

First, we specify each block in the first αn many blocks (where α = 3

4 and n will be some multiple of 4) fix the ith such block of size m, where 1 i αn. Let pi 1, pi 2, qi 1, qi 2 correspond to distributions p1, p2, q1, q2 respectively (as defined in Lemma 6), supported on this block. For the first distribution supported on this block, that is for t = 2i 1, we will set a2i 1 = pi 1, b2i 1 = qi 1. For the second distribution supported on this block, that is for t = 2i, we will set a2i = pi 2, b2i = qi 2.

Now, we specify each block in the remaining (1 α)n many blocks fix the ith such block of size m, where αn+1 i n. Let ri 1, ri 2, si 1, si 2 correspond to distributions r1, r2, s1, s2 respectively (as defined in Lemma 6), supported on this block. For the first distribution supported on this block, that is for t = 2i 1, we will set a2i 1 = ri 1, b2i 1 = si 1. For the second distribution supported on this block, that is for t = 2i, we will set a2i = ri 2, b2i = si 2.

These sequences a and b are the entry-point to Step 1 in the proof steps described above.

Claim 5.1. Let a and b be the sequences constructed in Definition 8. avg(a) is uniform on [k] = [mn] and avg(b) is far from uniform with d TV(Unif(k), avg(b)) = 1 16

Proof. Observe that from Lemma 6, in every block 1 i n, avg(a2i 1, a2i) = Unif(m). Furthermore, since all blocks have disjoint supports of size m, we conclude that avg(a) = avg(a1, . . . , a2n) = umn = Unif(k).

Also, from Lemma 6, in every block 1 i n, avg(a2i 1, a2i) avg(b2i 1, b2i) 1 = 1 8

2. This immediately gives us avg(a1, a2n) avg(b1, b2n) 1 = 1 8

2. Thus, d TV(avg(a), avg(b)) = d TV(Unif(k), avg(b)) = 1 16

Next, we proceed to Step 2 in the outline above, which shows that the first two moments of the collision vectors are close.

Claim 5.2 (Means match). Let Ca and Cb be the 3-dimensional vectors of 2-way, 3-way and 4-way collisions observed for the sequences a and b respectively over all the n blocks. Then,

E [Ca] = E [Cb] .

Proof. Let Cai be the 3-dimensional vector of 2-way, 3-way and 4-way collisions observed for a in the ith

block (i.e. 2 samples each from a2i 1 and a2i). Similarly, define Cbi. Then, the total number of 2-way, 3-way and 4-way collision across all n blocks will be P i [n] Cai and P i [n] Cbi. Now, recall that exactly αn many blocks are of one kind, where each Cai = Cp (respectively, Cbi = Cq) independently, and the remaining (1 α)n many blocks are of the second kind, where each Cai = Cr (respectively, Cbi = Cs) independently. Thus we have that

= n(E[αCp + (1 α)Cr])

= n(E[αCq + (1 α)Cs]) (from Lemma 6)

Claim 5.3 (Covariances match approximately). Let Ca and Cb be the 3-dimensional vectors of 2-way, 3-way and 4-way collisions observed for the sequences a and b respectively over all the n blocks. Then the covariance matrices are as follows:

γ11/m γ12/m2 γ13/m3

γ12/m2 γ22/m2 γ23/m3

γ13/m3 γ23/m3 γ33/m3

α11/m2 α12/m3 α13/m4

α12/m3 α22/m3 α23/m4

α13/m4 α23/m4 α33/m4

Published in Transactions on Machine Learning Research (11/2025)

γ11/m γ12/m2 γ13/m3

γ12/m2 γ22/m2 γ23/m3

γ13/m3 γ23/m3 γ33/m3

α 11/m2 α 12/m3 α 13/m4

α 12/m3 α 22/m3 α 23/m4

α 13/m4 α 23/m4 α 33/m4

with |αij|, |α ij|, |γij| bounded by constants independent of m (for m large enough).

Proof. As in Claim 5.2, we have that Ca = P i [n] Cai and Cb = P i [n] Cbi, where Cai is either Cp or Cr, and Cbi is either Cq or Cs. Further, since the samples in each block are independent, the covariances add up. Thus, we only need to compute the covariance in one block, which we recall supports two distributions and contributes 4 samples. Suppressing subscripts for now, we have that

E[c2 2] E[c2]2 E[c2c3] E[c2]E[c3] E[c2c4] E[c2]E[c4] E[c2c3] E[c2]E[c3] E[c2 3] E[c3]2 E[c3c4] E[c3]E[c4] E[c2c4] E[c2]E[c4] E[c3c4] E[c3]E[c4] E[c2 4] E[c4]2

Let Yab be the indicator that the elements at indices a and b collide. Note that 1 a, b 4, since we have 4 samples in each block. Then,

E[c2 2] = E

a<b Y 2 ab + 6 X

a<b<c Yabc + 6 X

a<b<c<d Yab Ycd

+ 6 E[Y12] E[Y34]

= E[c2] + 6 E[c3] + 6 Pr[first two samples collide] Pr[second two samples collide].

In a similar manner, we can obtain the following expressions for the remaining terms:

E[c3 2] = E[c3] + 12 E[c4]

E[c2 4] = E[c4] E[c2c3] = 3 E[c3] + 12 E[c4]

E[c2c4] = 6 E[c4]

E[c3c4] = 4 E[c4].

Thus, we have expressed all the entries in the covariance matrices in terms of just the expected number of 2-way, 3-way and 4-way collisions. We have already computed explicit expressions for these for each of p, q, r and s in the proof of Lemma 6. Plugging these in, we obtain

6 m 12 m2 6 m3 12 m2 4 m2 4 m3 6 m3 4 m3 1 m3

m3 12 m3 16

4 m 0 0 0 0 0 0 0 0

8 m2 0 0 0 0 0 0 0 0

Σ[Cq] = Σ[Cs] =

11 2m 9 m2 9 2m3 9 m2 3 m2 3 m3 9 2m3 3 m3 3 4m3

2m3 9 m3 9 m4 9 4m5 33

8m4 9 4m5 9 16m6

Finally, recall that

= αnΣ[Cp] + (1 α)nΣ[Cr]

Published in Transactions on Machine Learning Research (11/2025)

= αnΣ[Cq] + (1 α)nΣ[Cs]

4, and hence, substituting the values above, we get

11 2m 9 m2 9 2m3 9 m2 3 m2 3 m3 9 2m3 3 m3 3 4m3

11 2m 9 m2 9 2m3 9 m2 3 m2 3 m3 9 2m3 3 m3 3 4m3

2m3 9 m3 9 m4 9 4m5 33

8m4 9 4m5 9 16m6

For the terms 9/m3 12/m4 and 9/m3 9/m4, we can write them as (9 12/m)

m3 and 9 9/m

m3 respectively. Thus for m large enough, we can write all the terms in the matrices in the form specified in the claim, completing the proof.

We now prove a CLT (Step 3) which helps us approximate the distributions of the fingerprints by multivariate Gaussians having the same mean and covariance.

Lemma 9. Consider a c-dimensional distribution (for c 10) supported on [0, 0, . . . , 0], the c-dimensional basis vectors, and [2, 0, . . . , 0], such that the masses on the non-origin support points are fixed positive constants, and the mass on origin [0, 0, . . . , 0] is at least 1 10 16. Then for ℓlarge enough, the distribution of the sum of ℓi.i.d samples from this distribution has total variation distance at most 0.002 from the Gaussian with corresponding mean and covariance, whose mass has been discretized to the nearest point in the integer lattice.

Proof. Let u0 = Pr([0, 0, . . . , 0]), and u1 = Pr([1, 0, . . . , 0]), u2 = Pr([0, 1, . . . , 0]), . . . , uc = Pr([0, 0, . . . , 1]) and v = Pr([2, 0, . . . , 0]). We draw a sample from this distribution as follows: first flip a fair coin, and if the coin lands heads, sample from the distribution over [0, 0, . . . , 0] and the basis vectors, given by respective probabilities 1 2 Pc i=1 ui, 2u1, 2u2, . . . , 2uc. If the coin lands tails, we sample from the distribution over [0, 0, . . . , 0], [2, 0, . . . , 0] with respective probability 1 2v, 2v. Note that this sampling procedure yields a sample from the original distribution. Here, we are using that v, Pc i=1 ui are both at most 10 16 for the probabilities to be well-defined.

Given ℓindependent samples sampled from the distribution in this way, let us consider the distribution of the sum of the samples, conditioned on exactly t of the ℓfair coins having landed heads. Note that for large ℓ, with probability at least 0.999, t [ℓ/2 4

Thus for all t in this range, if we can show that the sum of ℓi.i.d. samples has TV distance at most ϵ from the desired discretized Gaussian, this would imply a total variation distance of at most 0.001 + ϵ without conditioning on t. Thus we would focus on the distribution conditioned on t taking a value in [ℓ/2 4

Conditioned on a value of t, the distribution corresponds to the convolution of t independent draws from the multinomial distribution given by the distribution corresponding to the coin landing heads, and the distribution corresponding to the ℓ t samples from the distribution supported on [0, 0, . . . , 0] and [2, 0, . . . , 0], which is a binomial random variable in the first coordinate, scaled by a factor of 2.

Let us first reason about the t independent draws from the multinomial distribution. Leveraging a multivariate CLT, the distribution of the sum of the t draws from the multinomial will have total variation distance at most O (log t)2/3

σ1/3 from the multivariate Gaussian of corresponding mean and covariance, discretized to the

nearest point in the integer lattice ( (Valiant & Valiant, 2011, Theorem 4); see Section F.2 for a restatement). Here, σ2 denotes the smallest eigenvalue of the covariance of the sum vector of the t draws which will be Ω(t).

Published in Transactions on Machine Learning Research (11/2025)

To see this, observe that this sum vector has mean t [2u1, . . . , 2uc], and covariance

2u1(1 2u1) 4u1u2 4u1u3 . . . 4u1uc 4u1u2 2u2(1 2u2) 4u2u3 . . . 4u2uc ... . . . 4u1uc 4u2uc 4u3uc . . . 2uc(1 2uc)

The determinant of the matrix above, without the t scaling, is exactly

2u1 2u2 2uc

For constant c, and by the assumption that ui > 0 for all i and Pc i=1 ui 10 16, this determinant is a positive constant independent of ℓ. This implies that the smallest eigenvalue of this (positive semidefinite) matrix is at least (and at most) a positive constant. Together with the scaling of t, we get that σ2 = Ω(t), meaning that σ = Ω(

t). Thus, the total variation distance goes to 0 as t gets large. Consequently, we take ℓto be large enough so that for each t under consideration, this distance is at most 10 5.

Similarly instantiating the CLT on the sum vector of the ℓ t samples corresponding to the coin landing tails, which, recall is simply a scaled binomial random variable in the first coordinate (and zero elsewhere) yields that for large enough ℓ, this vector has TV distance at most 10 5 close to a Gaussian having the corresponding mean and covariance (discretized to the closest integer) in the first coordinate, and zero elsewhere. Also, note that the variance of first coordinate is O(ℓ) (for t in the assumed range) which is large for ℓlarge enough.

Thus, the sum of t vectors corresponding to coin landing heads and ℓ t vectors corresponding to coin landing tails has distance at most 2 10 5 from the convolution of the corresponding discretized Gaussians. Now we can apply Lemma 21 which gives us that the convolution of the discretized Gaussians here has distance at most 10 5 to the discretization of the convolution of two Gaussians (here we use the fact that the variance of the first coordinate of the vector corresponding to coin landing tails is large enough and all its other coordinates are zero). By a triangle inequality, we get that the sum of t vectors corresponding to coin landing heads and ℓ t vectors corresponding to coin landing tails has distance at most 3 10 5 from the discretized Gaussian of corresponding mean and covariance.

So far, we have shown that for any t [ℓ/2 4

ℓ], sum of t vectors corresponding to coin landing heads and ℓ t vectors corresponding to coin landing tails has distance at most 3 10 5 from the discretized Gaussian of corresponding mean and covariance. In Lemma 23, we show that all these Gaussians have TV distance at most 10 5 from the desired Gaussian (the Gaussian having mean and covariance same as mean and covariance of sum of ℓi.i.d. samples from our original distribution). Lemma 23 uses the fact that u0 is close to 1 and c is a small constant. This Lemma essentially involves showing that the difference between ith

coordinate of the means of the Gaussians is roughly ui

ℓwhereas the standard deviation is roughly ui

ℓ. Thus for u0 large enough, ui will be small, and the standard deviation will be much larger than the difference between the means.

Combining the above arguments, we get that sum of ℓi.i.d. samples from our distribution has TV distance at most 0.001 + 3 10 5 + 10 5 0.002 from the discretized Gaussian with corresponding mean and covariance.

Lemma 10. Consider two distributions supported on [0, 0, 0], the basis vectors, and [2, 0, 0] such that the proability of [0, 0, 0] is at least 1 10 16. Let for the first distribution, mass on all these vectors is guaranteed to be non-zero, and for the second distribution, mass on [0, 0, 0], [1, 0, 0] and [2, 0, 0] is guaranteed to be non-zero. Then for constant α (0, 1), and n large enough, the distribution of the sum of αn i.i.d samples from the first distribution and (1 α)n i.i.d samples from the second distribution has total variation distance at most 0.01 from a Gaussian with corresponding mean and covariance, where the mass has been discretized to the nearest point in the integer lattice.

Proof. Using Lemma 9, we know that the sum of αn i.i.d. samples from the first distribution and sum of (1 α)n i.i.d. samples from the second distribution has TV distance at most 0.002 from the corresponding

Published in Transactions on Machine Learning Research (11/2025)

discretized Gaussians. Here, for the first distribution we apply Lemma 9 with c = 3 and for second distribution, we use c 3 depending on whether it has non-zero mass on [0, 1, 0] and [0, 0, 1]. All that remains, is to prove that the convolution of two discretized Gaussians is close, in total variation distance, to the discretization of the convolution of the two Gaussians. This is true for any Gaussians in constant dimension such that the minimum eigenvalue of covariance of at least one of the Gaussians is super-constant (Lemma 18). In our case, the Gaussian corresponding to the sum of αn samples from the first distribution has minimum eigenvalue Θ(n). This is because the per-sample covariance for the first distribution Σ1 has det(Σ1) > 0 and Σ1 = O(1) (both independent of n), so with Σαn = αn Σ1 we get λmin(Σαn) αn det(Σ1)/ Σ1 2 = Θ(n) (see Lemma 22 for detailed proof). Taking n to be large enough, we get that convolution of the two discretized Gaussians has distance at most 10 5, to the discretization of the convolution of the two Gaussians. Applying a triangle inequality, we get that the sum of n i.i.d. samples produced as above has TV distance at most 0.004 + 10 5 0.01 from the corresponding discretized Gaussian.

Let FP(a) and FP(b) be 3 dimensional random variables containing the number of elements observed exactly 2, 3 and 4 times, for samples drawn from a and b respectively. Let Ga be a random variable corresponding to picking a sample from gaussian with same mean and covariance as FP(a) and rounding each coordinate to the nearest integer. Similarly, Gb corresponds to FP(b). Lemma 11 (Fingerprints close to discretized Gaussians). d TV(FP(a), Ga) 0.01 and d TV(FP(b), Gb) 0.01.

Proof. Recall that FP(a) and FP(b) are 3 dimensional random variables containing number of elements appearing exactly 2, 3 and 4 times in samples drawn according to a and b respectively. Due to the structure of a and b (see Definition 8), we can decompose FP(a) and FP(b) as

i=1 FP(pi) +

i=αn+1 FP(ri),

i=1 FP(qi) +

i=αn+1 FP(si).

Here, FP(pi) denotes the fingerprint vector when we obtain 2 samples from pi 1 and 2 samples from pi 2. Note that as we vary i, FP(pi) corresponds to independent samples from the same distribution. FP(qi), FP(ri) and FP(si) are defined analogously. Note that each of FP(pi), FP(qi), FP(ri) and FP(si) are supported on [0, 0, 0], the basis vectors and [2, 0, 0]. Here, FP(pi), FP(qi), FP(si) has non-zero mass on all of these vectors, and FP(ri) has non-zero mass on [0, 0, 0], [1, 0, 0], [2, 0, 0].

Also, note that FP(pi) = [0, 0, 0] corresponds to the event that when we draw 2 samples each from pi 1 and pi 2, the 4 samples so obtained are distinct. The probability of this event is equal to (1 1/m)(1 2/m)(1 3/m) which is at least 1 10 16 if we choose m to be large enough. Similarly, one can verify that for m large enough constant, for each of FP(qi), FP(ri) and FP(si), the probability of these random variables being [0, 0, 0] is at least 1 10 16.

By applying Lemma 10, we get d TV(FP(a), Ga) 0.01 and d TV(FP(b), Gb) 0.01, which completes the proof.

Finally, we conclude with the calculations required in Step 4 of the outline. Lemma 12 (Discretized Gaussians are close). d TV(Ga, Gb) 0.02.

Proof. Let G a be a random variable corresponding to picking a sample from gaussian with same mean and covariance as FP(a) (without any rounding). Similarly, G b corresponds to FP(b). By data processing inequality, rounding each coordinate can only decrease the total variation distance, d TV(Ga, Gb) d TV(G a, G b).

By Lemma 14, given 4 samples from a finite support, there exists an invertible linear map, say A, such that we can go from a vector of fingerprints to vector of collisions by applying A to the vector of fingerprints.

Published in Transactions on Machine Learning Research (11/2025)

Since we can decompose a and b into disjoint supports from each of which we are getting 4 samples, applying A also lets us go from vector of fingerprints to vector of collisions for samples obtained from a and b.

We apply A to random variables G a and G b. Since A is invertible, d TV(AG a, AG b) = d TV(G a, G b). Here AG a is a 3 dimensional Gaussian random variable with mean and covariance same as the mean and covariance of the vector containing number of 2-way, 3-way and 4-way collisions in the samples produced by a. AG b is related to b in the same fashion.

From Claim 5.2, we know that expectations of AG a and AG b match. Let us denote the expectation by µ. From Claim 5.3, we know that covariances of AG a and AG b are as follows:

Σ(AG a) = Σ1 + n

α11/m2 α12/m3 α13/m4

α12/m3 α22/m3 α23/m4

α13/m4 α23/m4 α33/m4

, Σ(AG b) = Σ1 + n

α 11/m2 α 12/m3 α 13/m4

α 12/m3 α 22/m3 α 23/m4

α 13/m4 α 23/m4 α 33/m4

γ11/m γ12/m2 γ13/m3

γ12/m2 γ22/m2 γ23/m3

γ13/m3 γ23/m3 γ33/m3

Let G1 be a random variable distributed as a Gaussian with mean µ and covariance Σ1. From Lemma 17, which is a technical lemma that bounds the TV distance between Gaussians having the same mean and similar covariances, and whose proof is given in Appendix F, we have that d TV(G1, AG a) 0.01 and d TV(G1, AG b) 0.01 (compared to the lemma statement there, covariances here have an additional factor of n but that does not change the TV distance). Using triangle inequality, we get d TV(AG a, AG b) 0.02. Since d TV(Ga, Gb) d TV(G a, G b) = d TV(AG a, AG b), this completes the proof.

Lemma 13 (Fingerprints are close). d TV(FP(a), FP(b)) 0.04.

Proof. d TV(FP(a), FP(b)) d TV(FP(a), Ga) + d TV(FP(b), Gb) + d TV(Ga, Gb) by triangle inequality which is at most 0.04 due to Lemma 11 and Lemma 12.

With this, we have shown that the TV distance between the distribution of fingerprints of the two sequences is small, even when the total variation distance between the corresponding average distributions is at least some constant. We can then invoke the Neyman-Pearson lemma (Neyman & Pearson, 1933), (Canonne, 2022, Lemma 1.4) to say that no testing algorithm can distinguish between these two cases with high probability. This concludes the proof of Theorem 1.6.

6 Conclusion

We show that sublinear sample complexity property testing extends to the non-i.i.d. setting where the samples are drawn from different distributions, and we are interested about the average distribution. In particular, natural collision-based testers with just a constant number of samples from each distribution suffice to solve the identity and closeness testing problems. While our analysis is optimal with respect to the support size, it is still sub-optimal with respect to the distance parameter ε resolving this is an interesting future direction. Another direction could be to chart the landscape of the problem when T (number of distributions) is fixed, and c (number of samples from each distribution) varies.

Acknowledgments

S.G. did part of this work at Stanford where he was supported by a Stanford Interdisciplinary Graduate Fellowship, and part of it at Harvard, where he was supported by a postdoctoral fellowship from Harvard s Digital Data Design Institute. C.P. was supported by Moses Charikar s and Gregory Valiant s Simons Investigator Awards. K.S. was supported, in part, by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. G.V. was supported, in part, by a Simons Investigator Award.

Published in Transactions on Machine Learning Research (11/2025)

Maryam Aliakbarpour and Sandeep Silwal. Testing properties of multiple distributions with few samples. ar Xiv preprint ar Xiv:1911.07324, 2019.

Maryam Aliakbarpour, Eric Blais, and Ronitt Rubinfeld. Learning and testing junta distributions. In Conference on Learning Theory, pp. 19 46. PMLR, 2016.

Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 442 451. IEEE, 2001.

Tuğkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM (JACM), 60(1):1 25, 2013.

Clément L Canonne. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020.

Clément L. Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends in Communications and Information Theory, 19(6):1032 1198, 2022. ISSN 1567-2190. doi: 10.1561/0100000114. URL http://dx.doi.org/10.1561/0100000114.

Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 1193 1203. SIAM, 2014.

Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 47 60, 2017.

Sitan Chen, Jerry Li, and Ankur Moitra. Learning structured distributions from untrusted batches: Faster and simpler. Advances in Neural Information Processing Systems, 33:4512 4523, 2020.

Herman Chernoff and Shelemyahu Zacks. Estimating the current mean of a normal distribution which is subjected to changes in time. The Annals of Mathematical Statistics, 35(3):999 1018, 1964.

Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 685 694. IEEE, 2016.

Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. ar Xiv preprint ar Xiv:1911.05911, 2019.

Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1841 1854. SIAM, 2014.

Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. ar Xiv preprint ar Xiv:1611.03579, 2016.

Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Optimal identity testing with high probability. ar Xiv preprint ar Xiv:1708.02728, 2017.

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48 (2):742 864, 2019.

Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electron. Colloquium Comput. Complex., volume 23, pp. 15, 2016.

Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. 2000.

Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

Published in Transactions on Machine Learning Research (11/2025)

Ayush Jain and Alon Orlitsky. Optimal robust learning of discrete distributions from batches. In International Conference on Machine Learning, pp. 4651 4660. PMLR, 2020.

Mladen Kolar, Le Song, Amr Ahmed, and Eric P Xing. Estimating time-varying networks. The Annals of Applied Statistics, pp. 94 123, 2010.

Christoph H Lampert. Predicting the future behavior of a time-varying probability distribution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 942 950, 2015.

Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(8):295 347, 2013. doi: 10.4086/toc.2013.v009a008. URL https://theoryofcomputing.org/ articles/v009a008.

Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933.

Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750 4755, 2008.

Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrusted batches. In Anna R. Karlin (ed.), 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, volume 94 of LIPIcs, pp. 47:1 47:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. doi: 10.4230/LIPIcs.ITCS.2018.47. URL https://doi.org/10.4230/LIPIcs.ITCS. 2018.47.

Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters. Advances in neural information processing systems, 30, 2017.

Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685 694, 2011.

Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429 455, 2017.

Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, and Sham Kakade. Maximum likelihood estimation for learning populations of parameters. In International Conference on Machine Learning, pp. 6448 6457. PMLR, 2019.

A Learning the average distribution

In this section, we show that the sample complexity of learning the average distribution in the setting of non-identical samples is asymptotically equal to that of the i.i.d. setting.

Claim 1.1. Given access to T distributions, p1, . . . , p T , each supported on a common domain of size k, for any ε > 0, given c = 1 samples drawn from each pi, one can output a distribution pavg such that with probability at least 2/3, d TV(pavg, pavg) ε, provided T = O(k/ε2). Furthermore, Ω(k/ε2) samples are necessary, even if p1 = . . . = p T = pavg for such a guarantee.

Proof. The lower bound for the setting of non-identical samples follows directly from the lower bound of the standard i.i.d setting, that is if we let pt = p, t [T]. For the upper bound, we follow the proof in (Canonne, 2020, Theorem 1). Consider the empirical estimator given by,

pavg(i) = 1

t [T ] 1[Xt = i].

Published in Transactions on Machine Learning Research (11/2025)

We have that

E pavg pavg 2 2 = 1

t [T ] pt(i)(1 pt(i))

t [T ] pt(i)

i pavg(i) = 1

Thus, by Markov s inequality, the squared ℓ2 distance of this estimator from pavg is smaller than 3 T with

probability at least 2

3. By Cauchy-Schwartz, the ℓ1 distance of this estimator from pavg is at most q

probability at least 2

3 and thus, this distance is smaller than ε if T = O(k/ε2).

B Impossibility Result with c = 1

Claim 1.2. Given access to T k/2 distributions, p1, . . . , p T , and c = 1 sample drawn from each distribution, no tester can distinguish the case that the average distribution, pavg, is the uniform distribution of support k, versus has total variation distance at least 1/4 from Unif(k), with probability of success greater than 2/3.

We prove the following equivalent claim.

Claim B.1. For all c1 > 0, there is a k c1 such that for any T k/2, given c = 1 sample drawn from each of p1, . . . , p T , there exists no testing algorithm that succeeds with probability greater than 1/2 and tests whether pavg = Unif(k) versus d TV(pavg, Unif(k)) 1/4 .

Proof. All the logarithms in the proof below are base 2. Let k be the smallest power of 2 greater than or equal to c1, k = 2 log c1 .

For T k/2, let r = k/T, so that r 2. We set r to be the smallest power of 2 greater than or equal to r,

r = 2 log r .

Let T = k/r . By construction T T k/2, T is integral, and T /T 1/2.

We now consider two distributions D1 and D2 over sequences of distributions (p1 . . . p T ) such that for all sequences (p1 . . . p T ) in the support of D1, avg(p1 . . . p T ) = Unif(k) and for all sequences in the support of D2, d TV(avg(p1 . . . p T ), Unif(k)) 1/4. Further, we will show that the samples drawn from distribution sequences drawn from D1 are indistinguishable from samples drawn from distribution sequences drawn from D2 (that is, corresponding distributions have total variation distance 0). This implies that there exists no testing algorithm that succeeds with probability greater than 1/2 and tests whether average distribution is the uniform distribution versus average distribution has total variation distance at least 1/4 from the uniform distribution.

Here is the procedure to draw a sequence of distribution (p1 . . . p T ) from D1:

1. Draw a permutation π : [k] [k] uniformly at random.

2. Use the permutation π to partition the domain into T sets of size r each where for i T , each pi corresponds to a uniform distribution over a distinct set. That is, for 1 i T , pi corresponds to a uniform distribution over {π ((i 1) r + 1) , π ((i 1) r + 2) , , π(i r )}.

3. For T + 1 i T, pi corresponds to a uniform distribution over the whole support [k], that is pi = Unif(k).

Published in Transactions on Machine Learning Research (11/2025)

Here is the procedure to draw a sequence of distributions (p1 . . . p T ) from D2:

1. Draw a permutation π : [k] [k] uniformly at random.

2. Use the permutation π to partition the domain into T sets of size r each where for i T , each pi corresponds to a point-mass distribution supported over an element of a distinct set: For 1 i T , pi corresponds to a point-mass distribution with probability mass 1 on element π ((i 1) r + 1).

3. For T + 1 i T, pi corresponds to a uniform distribution over the whole support [k], that is pi = Unif(k).

Note that all the distributions sequences in the support of D1 satisfy avg(p1 . . . p T ) = uk. Next, we show that all the distribution sequences in the support of D2 satisfy d TV(avg(p1 . . . p T ), Unif(k)) 1/4. For such distribution sequences, avg(p1 . . . p T ) has mass 1

T on T domain elements and mass 1

T k on k T domain elements. From this, we get

d TV(avg(p1 . . . p T ), Unif(k)) = (k T )T

where for the last inequality, we used T T k/2 and T /T 1/2.

Now, we will show that the samples drawn from distribution sequences drawn from D1 are indistinguishable from samples drawn from distribution sequences drawn from D2 which will complete the proof. Let DX 1 be the distribution corresponding to c = 1 sample drawn from each of p1, . . . , p T , where (p1 . . . p T ) is drawn from D1 (with c = 1). We define DX 2 analogously. We want to show that d TV(DX 1 , DX 2 ) = 0. Note that in both DX 1 and DX 2 , each of last (T T ) samples {Xt}t [T T ] are drawn from Unif(k) independent of the first T samples {Xt}t [T ]. And the first T samples {Xt}t [T ] correspond to choosing T distinct domain elements uniformly at random. Thus DX 1 and DX 2 are the same distributions.

C Poissonized setting

In this section, we state and prove some results for the non-i.i.d. setting, where we are also allowed to use Poissonization. We will use the following standard facts about Poisson distributions.

Let Poi(λ) denote a Poisson random variable with mean parameter λ 0.

Fact C.1. For any λ1, λ2 0, if X1 and X2 are independent random variables distributed as X1 Poi(λ1) and X2 Poi(λ2), then X1 + X2 Poi(λ1 + λ2) .

Fact C.2. Given Poi(c) samples from a distribution p k, the frequency of any element j [k] follows the distribution Poi(c p(j)).

We are now ready to prove the following theorem.

Claim C.3 (Poisson identity testing). For any ε > 0 and reference distribution q k, with Poi(c) samples from each of p1, . . . , p T and T = O(

k/ε2), there exists a testing algorithm that succeeds with probability at least 2/3 and tests whether, pavg = q versus d TV(pavg, q) ε ,

for any c 1.

Proof. Note that we are given Poi(c) samples from each distribution pt for all t [T]. For each t [T], let Nt denote the number of samples drawn from pt. Note that Nt Poi(c). For i [k], let Nt(i) be the random variable that denotes the frequency of element i in samples Xt, that is, Nt(i) = |{j [Nt] : Xt(j) = i}|. By Fact C.2, we know that Nt(i) Poi(c pt(i)) .

Published in Transactions on Machine Learning Research (11/2025)

Furthermore, if we let N(i) = P

t [T ] Nt(i), then by Fact C.1, we get that,

N(i) Poi c X

t [T ] pt(i) = Poi(c T pavg(i)) .

Therefore, N(i) has the same distribution as the frequency of element i when we are given Poi(c T) i.i.d samples from distribution pavg. As c T O(

k/ε2), and since we just argued that we are under the setting of having drawn Poi(c T) samples from pavg, we can invoke existing identity testing algorithms that use Poissonization for the i.i.d setting to solve the problem with success probability 2/3, even with c = 1.

D Collisions and Fingerprints

We relate the collision statistics and the fingerprint of a sample by an invertible linear map. To recall, the number of m-way collisions in a sample S = {X1, . . . , Xn} is equal to Pn i1< <im 1[Xi1 = = Xim]]. The fingerprint of the sample is a count array, where for each i, we list the count of the number of support elements that appeared exactly i times in S.

Lemma 14 (Collisions to fingerprint). Let x1, x2, x3, x4 [m]4 for m > 4. Let c2, c3, c4 denote the number of 2-way, 3-way and 4-way collisions amongst x1, x2, x3, x4. Further, let n2, n3, n4 denote the number of elements in [m] occurring exactly 2 times, 3 times and 4 times respectively in x1, x2, x3, x4. Then, the vectors

are related by an invertible linear map. In particular,

1 3 6 0 1 4 0 0 1

Proof. Observe that

n4 = c4, n3 = c3 4 3

n4 = c3 4c4

n2 = c2 3 2

n4 = c2 3(c3 4c4) 6c4 = c2 3c3 + 6c4.

The matrix associated with the linear map has determinant 1, therefore the map is invertible.

E Sufficiency of fingerprint for pooling-based testing algorithms

We first formally state the definition of a pooling-based testing algorithm.

Definition E.1 (Pooling-based testing algorithm). Fix c and let π Perm(c T) denote drawing a uniformly random permutation π over c T elements. A pooling-based testing algorithm A for property P k with sample complexity c T is a (possibly randomized) algorithm which, when given a randomly shuffled pooled multiset S of c samples drawn independently from each of p1, . . . , p T , i.e., S = ST t=1{X(t) 1 , . . . , X(t) c pt}, π Perm(c T), S π(S), produces output A(S) {0, 1} satisfying the following

If pavg P, then Pr S,π,A[A[S] = 1] 2/3.

If d TV(pavg, P) ε, then Pr S,π,A[A[S] = 0] 2/3.

The following lemma states that the fingerprint of the combined sample captures all the information required for a pooling-based testing algorithm to test a symmetric property of the average distribution.

Published in Transactions on Machine Learning Research (11/2025)

Lemma 15. Let A be a pooling-based testing algorithm having sample complexity c T that tests a symmetric property P k of the average distribution pavg = avg(p1, . . . , p T ). Then, there exists a pooling-based algorithm A for the same task having the same sample complexity which takes as input only the fingerprint of the combined sample.

Proof. Let the fingerprint of the combined sample S be denoted by f = [f0, f1, . . . , fc T ]. Here, fi is the number of elements in [k] that occur exactly i times in the combined sample. Given as input f, the algorithm A constructs a sequence S in the following way:

1. Initially, S = ϕ, e = 1.

2. For i = 0, . . . , c T :

While fi > 0 :

Append i copies of e to S.

e e + 1, fi fi 1.

3. Draw πk Perm(k).

4. For every s S:

Set s πk(s).

5. Draw πS Perm(c T).

6. Set S πS(S).

The algorithm A then feeds S constructed as above to A, and returns A s output.

Observe that the distribution of S constructed as above is identical to the distribution of S obtained in the following way:

1. Draw πk Perm(k).

2. Draw S = ST t=1{X(t) 1 , . . . , X(t) c πk(pt)}.

3. Draw πS Perm(c T).

4. Set S πS(S).

Here, π(p) denotes permuting the probability distribution p according to π i.e. π(p)i = pπ(i). Now, for a fixed permutation πk, observe that avg(πk(p1), . . . , πk(p T )) = πk(avg(p1, . . . , p T )) = πk(pavg). Further, recall that the property P we are thinking about is symmetric. Thus, if pavg P, then πk(pavg) P for any πk Perm(k). Similarly, if pavg / P, then πk(pavg) / P for any πk Perm(k). Thus, for any fixed πk, A correctly tests if avg(πk(p1), . . . , πk(p T )) has the property P ( avg(p1, . . . , p T ) has the property P). Consequently, A (and hence A ) correctly tests for a randomly chosen πk as well.

F Technical Results for Section 5

We prove a technical lemma that bounds the TV distance between Gaussians having the same mean and similar covariances.

Published in Transactions on Machine Learning Research (11/2025)

Lemma 16. Let A and D be defined as follows:

α11/m2 α12/m3 α13/m4

α12/m3 α22/m3 α23/m4

α13/m4 α23/m4 α33/m4

and matrix D =

1/m2 0 0 0 1/m3 0 0 0 1/m4

where αij are bounded by constants independent of m. For large enough m, there exists a positive constant α such that, αD A αD.

Proof. Let v1 = [1, 1, 0], v2 = [1, 0, 1] and v3 = [0, 1, 1]. Note that,

m3 v1v T 1 + α13

m4 v2v T 2 + α23

m4 v3v T 3 + D ,

where D is a diagonal matrix, whose entries are as follows:

m4 0 0 0 α22

m4 0 0 0 α33

Note that, it is immediate that, v1v T 1 2D1, v2v T 2 2D2 and v3v T 3 2D3, where matrices D1, D2 and D3 are defined as follows,

1 0 0 0 1 0 0 0 0

1 0 0 0 0 0 0 0 1

0 0 0 0 1 0 0 0 1

This is because for any x R3, x v1v 1 x = (x1 + x2)2 2(x2 1 + x2 2) = 2 x D1x, and similarly (x1 + x3)2 2(x2 1 + x2 3) and (x2 + x3)2 2(x2 2 + x2 3).

Furthermore, as all αijs are bounded independent of m, we have that the matrix D satisfies, D c D for some positive constant c.

Combining all the above inequalities we get that,

m3 v1v T 1 + α13

m4 v2v T 2 + α23

m4 v3v T 3 + D 2|α12|

m3 D1 + 2|α13|

m4 D2 + 2|α23|

m4 D3 + c D .

The matrix in the final expression simplifies to,

c m2 + 2|α12|

m3 + 2|α13|

m4 0 0 0 c m3 + 2|α12|

m3 + 2|α23|

m4 0 0 0 c m4 + 2|α13|

m4 + 2|α23|

As all αijs are bounded by constants, we have that the above matrix is upper bounded by α D for some large constant α > 0. Therefore A α D.

The proof for the other side, that is α D A follows exactly the same argument when applied to A and we conclude the proof.

Lemma 17. Let G1 be distributed as N(µ, Σ1) and G2 be distributed as N(µ, Σ2) where

γ11/m γ12/m2 γ13/m3

γ12/m2 γ22/m2 γ23/m3

γ13/m3 γ23/m3 γ33/m3

and Σ2 = Σ1 +

α11/m2 α12/m3 α13/m4

α12/m3 α22/m3 α23/m4

α13/m4 α23/m4 α33/m4

for γij and αij such that |γij| and |αij| are bounded by constants independent of m. Then d TV(G1, G2) 0.01, for m large enough.

Published in Transactions on Machine Learning Research (11/2025)

Proof. In our proof, we show that, for a sufficiently large m, the covariance matrices Σ1 and Σ2 satisfy: (1 α/m)Σ1 Σ2 (1 + α/m)Σ1 for some constant α > 0. Equivalently, all the eigenvalues of the matrix L 1Σ2(LT ) 1 lie in the range [1 α/m, 1 + α/m], where Σ1 = LLT is the Cholesky decomposition of matrix Σ1. Note that, by the TV distance bound in Fact F.3 this result immediately implies the following upper bound on d TV(G1, G2) between the two Gaussian distributions ,

d TV(G1, G2)

max(λi, λ 1 i ) 1

2πe O(1/m) .

Here, while applying Fact F.3, we used that G1 and G2 have the same mean. Now note that, for a sufficiently large m, we get the desired upper bound on the d TV(G1, G2).

Therefore to prove our lemma, all that remains is to show that, (1 α/m)Σ1 Σ2 (1 + α/m)Σ1 for some constant α > 0. In the remainder of the proof, we prove this claim. First we show invertibility of Σ1. Since Σ1 is a covariance matrix in our construction, the diagonal coefficients γ11, γ22, γ33 are fixed positive constants (independent of m). Thus the diagonal permutation contributes γ11γ22γ33/m6 to det(Σ1), while any other permutation includes at least one off-diagonal of order 1/m2 or 1/m3 and hence has magnitude O(1/m7). Therefore, for m large enough,

det(Σ1) γ11γ22γ33

so Σ1 is invertible.

α11/m2 α12/m3 α13/m4

α12/m3 α22/m3 α23/m4

α13/m4 α23/m4 α33/m4

To prove the claim, we in turn show that, α

By Lemma 16, we know that there exist constants α1 and α2 such that, α1D E α1D and 0 Σ1 α2m D, where

1/m2 0 0 0 1/m3 0 0 0 1/m4

Here the bound 0 Σ1 α2m D follows by applying Lemma 16 to m 1Σ1, which has the same (1/m2, 1/m3, 1/m4) structure with coefficients αij = γij.

In the following, we prove E α

mΣ1 for some constant α > 0. We first show that D γ

mΣ1 for some constant γ, as E α1D, we conclude that E α mΣ1. Towards this end, we now prove that D γ mΣ1 for some constant γ. Equivalently, we wish to show that the smallest eigenvalue of the matrix F = D 1/2 1

is lower bounded by some constant. Note that F is PSD and is equal to,

γ11 γ12/m1/2 γ13/m γ12/m1/2 γ22 γ23/m1/2

γ13/m γ23/m1/2 γ33

Its trace is Tr(F) = γ11 + γ22 + γ33 = Θ(1). Moreover, the determinant satisfies

det(F) γ11γ22γ33

2 =: c0 > 0

for all sufficiently large m (the diagonal term dominates, while all off-diagonal permutations are o(1)). By Gershgorin circle theorem (Fact F.1), the spectral norm is bounded:

λmax(F) max i

|γij| m1/2 C0,

Published in Transactions on Machine Learning Research (11/2025)

for some constant C0 and all large m. Therefore, with eigenvalues λ1 λ2 λ3 0,

λmin(F) = λ3 det(F)

λ1λ2 c0 C2 0 =: c1 > 0.

Equivalently, D 1 c1 1 m Σ1, i.e. D γ mΣ1 with γ = 1/c1. Hence E α1D α1γ

m Σ1, i.e. E α mΣ1 for α := α1γ.

The proof for the other side, that is α

m Σ1 E follows exactly the same argument when applied to E and we conclude the proof.

Lemma 18 (Discretization convolution almost commute for Gaussians). Let X N(µX, ΣX) and Y N(µY , ΣY ) be independent in Rd (with d constant), and assume at least one of ΣX, ΣY is positive definite.2

Let Disc( ) be coordinate-wise rounding to the nearest integer (ties are immaterial since Gaussians put zero mass on boundaries). Then

d TV Disc(X) + Disc(Y ), Disc(X + Y )

Tr(Σ 1 X ), q

Tr(Σ 1 Y ) o .

In particular, if λmin(ΣX) or λmin(ΣY ) , the left-hand side is O(1/ λmin) and tends to 0.

Proof. Without loss of generality assume ΣY 0; otherwise swap X and Y . Let f and g be the densities of X and Y . For m Zd, write Cm := m + [ 1

2)d. Define

P(m) := Pr(Disc(X) + Disc(Y ) = m) = X

Ca f(x) dx Z

Cm a g(y) dy ,

Q(m) := Pr(Disc(X + Y ) = m) = Z

Rd f(x) g(z x) dx dz.

For x Ca, set δ(x) := a x [ 1

2]d. Changing variables z = y + a in Q(m) yields

Cm a g y + δ(x) dy f(x) dx,

P(m) Q(m) = X

g(y) g y + δ(x) dy f(x) dx. (16)

m |P(m) Q(m)| by an expectation times a norm. Summing equation 16 in absolute value over m and using that the cubes {Cm a}m are disjoint and partition Rd,

m Zd |P(m) Q(m)| X

g(y) g(y + δ(x)) dy dx

Rd f(x) g( ) g( + δ(x)) L1 dx. (17)

Cm a| | = R

Rd| |, since {Cm a}m partition Rd.)

Apply Lemma 19 with h = g and u = δ(x) pointwise in x to get g( ) g( + δ(x)) L1 δ(x) 2 g L1. (18)

2If one covariance is singular, the bound below with that matrix should be read as + ; the inequality is then applied with the other (positive definite) covariance.

Published in Transactions on Machine Learning Research (11/2025)

Plugging equation 18 into equation 17 and factoring out the constant g L1 yields X

m Zd |P(m) Q(m)| E δ(X) 2 g L1. (19)

Since X CDisc(X) almost surely, δ(X) = Disc(X) X [ 1

2]d, and thus

E δ(X) 2 sup δ 1/2 δ 2 =

Computing g L1. For Y N(µY , ΣY ),

g(y) = Σ 1 Y (y µY ) g(y), g L1 = Z Σ 1 Y (y µY ) 2 g(y) dy = E Σ 1/2 Y Z 2 ,

where Z N(0, Id). By Cauchy Schwarz,

E Z Σ 1 Y Z = q

Tr(Σ 1 Y ). (21)

Combining equation 19, equation 20, and equation 21,

m Zd |P(m) Q(m)|

Tr(Σ 1 Y ),

d TV Disc(X) + Disc(Y ), Disc(X + Y ) = 1

m |P(m) Q(m)|

Tr(Σ 1 Y ).

By symmetry, the same bound holds with ΣX; taking the minimum proves the claim.

Lemma 19. Let h : Rd R+ be locally absolutely continuous with h L1(Rd). Then for every u Rd,

h( + u) h( ) L1 u 2 h L1.

Proof. Fix y Rd and consider the path t 7 y + t u. By the fundamental theorem of calculus,

h(y + u) h(y) = Z 1

0 h(y + t u), u dt,

so |h(y + u) h(y)| u 2 R 1 0 h(y + t u) 2 dt by Cauchy Schwarz. Integrating in y and using Fubini with the change of variables y 7 y + t u (unit Jacobian) gives Z

Rd|h(y + u) h(y)| dy u 2

Rd h(y + t u) 2 dy dt = u 2 h L1.

Lemma 20. Let X1 and X2 be independent with distributions N(µ1, σ2 1) and N(µ2, σ2 2), respectively. Let t := max(σ2 1, σ2 2). Then, for t large enough,

d TV Disc(X1) + Disc(X2), Disc(X1 + X2) 0.00001.

Proof. Apply Lemma 18 with d = 1:

d TV Disc(X1) + Disc(X2), Disc(X1 + X2) 1

4 min n 1 σ1 , 1

Thus the bound is 10 5 as soon as

t 25,000, i.e. t 6.25 108.

Published in Transactions on Machine Learning Research (11/2025)

Lemma 21. Let X and Y be n-dimensional independent random variables, where X is multivariate Gaussian and Y has first coordinate distributed as N(0, t) and all other coordinates identically 0. Then, for t large enough, d TV Disc(X) + Disc(Y ), Disc(X + Y ) 0.00001,

where Disc( ) denotes coordinate-wise discretization to the nearest integer.

Proof. The idea is that Y perturbs only the first coordinate. Let X 1 denote the last n 1 coordinates of X. After discretization, the last (n 1) coordinates of Disc(X) + Disc(Y ) equal Disc(X 1), and those of Disc(X +Y ) equal Disc(X 1) as well. Thus, conditioned on X 1, the event {Disc(X)+Disc(Y ) A} depends only on the first coordinate, and likewise for {Disc(X + Y ) A}. We therefore reduce to the one-dimensional case of Lemma 18 on the first coordinate and then average over X 1. We formalize this below.

Write X = (X1, X 1) and Y = (Y1, 0, . . . , 0) with Y1 N(0, t) independent of X. For any A Zn and u Rn 1, define Au := {m1 Z : (m1, Disc(u)) A}.

Condition on X 1 and apply Lemma 18 in dimension 1 (to the pair (X1, Y1)) to get

Pr Disc(X1) + Disc(Y1) AX 1 | X 1 Pr Disc(X1 + Y1) AX 1 | X 1 1 4

In the above, we conditioned on X 1 = u, which freezes the last n 1 coordinates; thus (X1, Y1) is one dimensional with Y1 N(0, t) independent of X1. We then applied Lemma 18 in d = 1, yielding a bound 1/(4

t) uniformly in u since the right-hand side depends only on t.

Averaging over X 1 and taking the supremum over A therefore gives

d TV Disc(X) + Disc(Y ), Disc(X + Y ) 1 4

t 10 5 for t 6.25 108.

Lemma 22. Let X be a sample from a distribution supported on {[0, 0, 0], e1, e2, e3, 2e1} with probabilities u0, u1, u2, u3, v such that u1, u2, u3, v > 0 and u0 1 10 16 (all parameters independent of n). Let Σ1 = Cov(X) and let Σαn be the covariance of the sum of αn i.i.d. copies of X. Then there exist constants c , C > 0 depending only on (u1, u2, u3, v) (hence independent of n) such that

c αn λmin(Σαn) λmax(Σαn) C αn.

In particular, λmin(Σαn) = Θ(n).

Proof. Write µ = E[X] = (u1 + 2v, u2, u3) and D = E[XX ] = diag(u1 + 4v, u2, u3), so Σ1 = D µµ . By the matrix determinant lemma,

det(Σ1) = det(D) 1 µ D 1µ = (u1+4v)u2u3

1 (u1 + 2v)2

u1 + 4v u2 u3

(u1+4v)u2u3 (u0 v) =: cdet > 0,

using (u1+2v)2

u1+4v u1 + 2v and u0 = 1 (u1 + u2 + u3 + v). Also Σ1 D, hence λmax(Σ1) λmax(D) = max{u1 + 4v, u2, u3} =: c D < . If the eigenvalues of Σ1 are λ1 λ2 λ3 > 0, then

λmin(Σ1) = λ3 det(Σ1)

c 2 D =: c0 > 0.

For the sum of αn i.i.d. draws, Σαn = αn Σ1, so λmin(Σαn) = αn λmin(Σ1) αc0n and λmax(Σαn) αn λmax(Σ1) αc Dn. Take c := c0 and C := c D to complete the proof.

Published in Transactions on Machine Learning Research (11/2025)

Lemma 23. Let X1 N(µ1, Σ1), X2 N(µ2, Σ2) and X3 N(µ3, Σ3) be c-dimensional Gaussians, for c 10, where

2u1 2u2 ... 2uc

2u1(1 2u1) 4u1u2 . . . 4u1uc 4u1u2 2u2(1 2u2) . . . 4u2uc ... . . . 4u1uc 4u2uc . . . 2uc(1 2uc)

8v(1 2v) 0 . . . 0 0 0 . . . 0 ... . . . 0 0 . . . 0

u1 + 2v u2 ... uc

u1 + 4v (u1 + 2v)2 (u1 + 2v)u2 . . . (u1 + 2v)uc (u1 + 2v)u2 u2(1 u2) . . . u2uc ... . . . (u1 + 2v)uc u2uc . . . uc(1 uc)

and furthermore, 1) v + P

i ui 10 16, and 2) ui, v are positive constants. Now, let Z1 be the sum of l independent draws of X3. Let Z2 be the sum of l

2 + n independent draws of X1 and l

2 n independent draws of X2. As l gets large, for any n [ 4

d TV(Z1, Z2) 10 4.

Proof. First, note that both Z1 and Z2 are also Gaussians, because a linear combination of independent Gaussians is a Gaussian. Let µ1, Σ1 and µ2, Σ2 respectively be the mean and covariance of Z1 and Z2. Then,

Σ1 = l Σ3 µ2 = (l/2 + n)µ1 + (l/2 n)µ2 Σ2 = (l/2 + n)Σ1 + (l/2 n)Σ2

Let Z1 be the Gaussian with mean µ1 and covariance diag(Σ1), where diag(Σ1) is simply the matrix Σ1 with all non-diagonal entries zeroed out. We will first show that d TV(Z1, Z1) 10 13.

Consider the matrix diag(Σ1) 1/2 Σ1 diag(Σ1) 1/2. Let us denote this matrix by M1 and it is equal to

1 (u1+2v)u2 v1v2 . . . (u1+2v)uc v1vc (u1+2v)u2 v1v2 1 . . . u2uc v2vc ... . . . (u1+2v)uc v1vc u2uc v2vc . . . 1

where v1 = u1 + 4v (u1 + 2v)2 and vi = ui(1 ui) for i 2. Note that because each ui 10 16 and v 10 16,

u1 + 2v v1 = u1 + 2v p

u1(1 u1) + 4v(1 v) 4u1v u1 p

u1(1 u1) + 2v p

4v(1 v) 2( u1 + v)

and also, for i = 2, . . . , c,

ui vi = ui p

ui(1 ui) 2 ui.

Published in Transactions on Machine Learning Research (11/2025)

Thus, the sum of absolute values of off-diagonal elements in any row i of M1 is at most

vuj 8 c 10 16 10 14.

Then, by the Gershgorin circle theorem (Fact F.1), all the eigenvalues of this matrix are contained in [1 10 14, 1 + 10 14]. Using corollary 24 that bounds the TV distance between multivariate Gaussian distributions, we get

d TV(Z1, Z1) 10

2πe max 1 1 10 14 , 1 + 10 14 1 10 13.

Now, let Z2 be the Gaussian with mean µ2 and covariance diag(Σ1). We will next show that d TV(Z2, Z2) 10 12.

Again, consider the matrix diag(Σ1) 1/2 Σ2 diag(Σ1) 1/2. Let us denote this matrix by M2 and this is equal to

l ) 2u1(1 2u1)+(1 2n

l ) 8v(1 2v) 2u1+8v 2(u1+2v)2 1 + 2n

l 2u1u2 v1v2 . . . 1 + 2n

l 2u1uc v1vc 1 + 2n

l 2u1u2 v1v2 1 + 2n

. . . 1 + 2n

l 2u2uc v2vc ... . . .

l 2u1uc v1vc 1 + 2n

l 2u2uc v2vc . . . 1 + 2n

where v1, . . . , vc are the same as above. Again, using a similar calculation as above, the sum of absolute values of off-diagonal entries in any row i is at most

10 14 10 13,

where the last inequality holds for large enough l.

Diagonals of M2: For i = 1, write

(M2)11 = A + 2n

ℓB D , A := 2u1(1 2u1)+8v(1 2v), B := 2u1(1 2u1) 8v(1 2v), D := 2u1+8v 2(u1+2v)2.

When n = 0, a direct expansion gives A = D 2(u1 2v)2, hence

(M2)11 1 = 2(u1 2v)2

D 2(u1 + 2v)2

2u1 + 8v 2(u1 + 2v)2 12 10 16 10 13,

using u1, v 10 16. For the n dependent term, 2n

ℓ 2u1 + 8v 2u1 + 8v 2(u1 + 2v)2 16

since |n| 4

ℓand the ratio is 2 for these parameters; thus (M2)11 [1 10 13, 1 + 10 13] for ℓlarge enough. For i 2,

(M2)ii = 1 + 2n

1 ui , so (M2)ii 1 8

ℓ + ui 1 ui 8

ℓ + 2 10 16.

Hence, for sufficiently large ℓ, all diagonal entries of M2 lie in [1 10 13, 1 + 10 13].

Thus by the Gershgorin circle theorem (Fact F.1), all the eigenvalues of this matrix M2 are contained in [1 2 10 13, 1 + 2 10 13]. Again using corollary 24 that bounds the TV distance between multivariate Gaussian distributions, we get

d TV(Z2, Z2) 10

2πe max 1 1 2 10 13 , 1 + 2 10 13 1 10 12.

Published in Transactions on Machine Learning Research (11/2025)

Finally, we bound d TV(Z1, Z2). Note that both Z1 and Z2 are Gaussians with the same diagonal covariance diag(Σ1), but their means are µ1 and µ2 respectively. Since the covariance is diagonal, we can upper-bound the total variation distance between the product distributions by the sum of the total variation distances between the coordinate-wise one-dimensional Gaussians. Namely, denoting by σ2 1, . . . , σ2 c the c diagonal entries in diag(Σ1),

d TV(Z1, Z2) d TV(N(µ11, σ2 1), N(µ21, σ2 1)) + + d TV(N(µ1c, σ2 c), N(µ2c, σ2 c))

σ1 , 1 , N µ21

σ1 , 1 + + d TV

σc , 1 , N µ2c

where in the last equality, we used the property that total variation distance is invariant to affine transformations. For each of these, we can use the formula for bounding the total variation distance between one-dimensional Gaussians given in Fact F.2 to get

d TV(Z1, Z2) 1

σ1 + + |µ1c µ2c|

u1(1 2u1) + 4v(1 v) 4u1v + + 2uc p

v + u1 + + uc (using ui and v at most 10 16)

v + u1 + + uc

16 π 11 10 8

Putting everything together, we get using the triangle inequality that

d TV(Z1, Z2) d TV(Z1, Z1) + d TV(Z2, Z2) + d TV(Z1, Z2)

10 13 + 10 12 + 10 5

F.1 Other useful facts

Fact F.1 (Gershgorin circle theorem). (Horn & Johnson, 2012, Theorem 6.1.1) Let A = (aij) Cd d and set Ri := P j =i |aij|. Then all eigenvalues of A lie in

n z C : |z aii| Ri o .

In particular, if A Rd d is symmetric, all its eigenvalues are real and lie in

i=1 [ aii Ri, aii + Ri ].

Fact F.2 ((Valiant & Valiant, 2011, Fact 29)). Letting N(µ, σ2) denote the univariate Gaussian distribution,

d TV N(µ, 1), N(µ + α, 1) |α|

Published in Transactions on Machine Learning Research (11/2025)

Fact F.3 ((Valiant & Valiant, 2011, Fact 31)). Let G1 = N(µ1, Σ1) and G2 = N(µ2, Σ2) in Rm, and let Σ1 = TT be the Cholesky decomposition of Σ1. Let λi denote the ith the eigenvalue of T 1Σ2T , then

d TV(G1, G2) 1

max{λi, λ 1 i } 1 +

We use the following corollary of this fact:

Corollary 24. Let G1 = N(µ, Σ1) and G2 = N(µ, Σ2) in Rm, and suppose Σ1 is diagonal with positive entries. Then, with T = Σ1/2 1 (so T 1Σ2T = Σ 1/2 1 Σ2Σ 1/2 1 ), we get

d TV(G1, G2) 1

max{λi, λ 1 i } 1 ,

where λ1, . . . , λm are the eigenvalues of Σ 1/2 1 Σ2Σ 1/2 1 .

F.2 Restating CLT from Valiant & Valiant (2011)

Definition 25 (Generalized Multinomial). Let n 1 and k 1. For each r {1, . . . , n} let ρr,1, . . . , ρr,k [0, 1] with Pk i=1 ρr,i 1. Let X(r) {0, e1, . . . , ek} Rk be independent with

Pr X(r) = ei = ρr,i, Pr X(r) = 0 = 1

The sum S := Pn r=1 X(r) Zk 0 is said to be distributed as a generalized multinomial.

Definition 26 (Discretized Gaussian). For µ Rk and positive semidefinite Σ Rk k, let N(µ, Σ) denote the k-dimensional Gaussian. Ndisc(µ, Σ) denotes the distribution corresponding to obtaining a draw from N(µ, Σ) and rounding each coordinate to the nearest integer.

Theorem 27 (Valiant & Valiant (2011, Thm. 4)). Let S be a k-dimensional Generalized Multinomial with parameters as above, mean µ, covariance Σ, and σ2 = λmin(Σ). Then,

d TV S, Ndisc(µ, Σ) k4/3

σ1/3 2.2 3.1 + 0.83 log n 2/3.

Corollary 28. For constant k,

d TV S, Ndisc(µ, Σ) = O (log n)2/3

G Closeness Testing

Our tester for testing closeness follows the proof of Batu et al. (2001). Specifically, we divide the analysis into two parts: heavy elements B and light elements S. We show that we can estimate the distance P

i B |pavg(i) qavg(i)| up to accuracy ϵ using the provided samples. On light elements, since the norms corresponding to these elements is bounded, we can apply the ℓ2 tester to get the desired closeness tester. We divide the analysis into four sections. In the first section, we provide the ℓ2 tester, whose sample complexity depends on the norm of the underlying distributions. Therefore invoking ℓ2 tester only on light elements reduces our sample complexity as their corresponding norm is low. Further, in the second and third section we show several properties of light and heavy elements. In particular, for the heavy elements we show that the we can estimate the ℓ1 distance for these elements quite well. Furthermore, in the same section, we also show that the norm of the light elements is low, setting ourselves to use ℓ2 tester on these elements. Finally in the last section, we provide the final theorem, which invokes the results from sections one and two to get the desired bounds on the closeness testing.

Published in Transactions on Machine Learning Research (11/2025)

G.1 ℓ2 tester

To test closeness in ℓ2, we build an unbiased estimator for the terms pavg 2 2, qavg 2 2 and pavg, qavg . In the following, we provide a description of these estimator. For each t [T], let (Xt(2), Xt(3)) and (X t(2), X t(3)) denote the 2nd and 3rd i.i.d samples drawn from the distribution pt and qt respectively. For each t [T], we define,

Yt = 1[Xt(2) = Xt(3)] and Yst = 1[Xt(2) = Xs(3)] .

Y t = 1[X t(2) = X t(3)] and Y st = 1[X t(2) = X s(3)] .

Note that E[Yt] = pt 2 2, E[Y t ] = qt 2 2 and E[Yst] = pt, ps , E[Y st] = qt, qs for all s, t [T]. Using these random variables, we define an estimator Z and Z as follows,

t [T ] Yt + 2 X

t [T ] Y t + 2 X

Lemma 29. The estimator Z and Z defined in Equation (1) satisfy the following,

E[Z] = pavg 2 2 and Var[Z] 4

E[Z ] = qavg 2 2 and Var[Z ] 4

The proof of the above lemma follows immediately from Lemma 4.

Next, define

Cst = 1[Xs(2) = X t(2)].

s,t [T ] Cs,t .

Lemma 30. The estimator Q defined above satisfies the following,

E[Q] = pavg, qavg .

T 2 pavg, qavg + 1

T ( pavg, qavg, qavg + pavg, pavg, qavg ).

Proof. The expectation of the estimator Q is as follows,

s,t [T ] ps, qt = 1

s [T ] ps, 1

s [T ] qt = pavg, qavg .

In the following we bound the variance, define ˆCst = Cst E[Cst].

s,t [T ] ˆCst 2i

Published in Transactions on Machine Learning Research (11/2025)

s,t Cst i!2

s,t E C2 st

| {z } T 2 terms

s,t a,b |{s,t,a,b}|=3

O( T 3 ) terms

s,t a,b |{s,t,a,b}|=4

O( T 4 ) terms

s,t Cst i!2

| {z } T 2 2 terms

E C2 st E [Cst]2 + X

s,t a,b |{s,t,a,b}|=3

(E[Cst Cab] E[Cst]E[Cab]) + X

s,t a,b |{s,t,a,b}|=4

(E[Cst Cab] E[Cst]E[Cab])

E C2 st + X

s,t a,b |{s,t,a,b}|=3

(E[Cst Cab] E[Cst]E[Cab]) + X

s,t a,b |{s,t,a,b}|=4

(E[Cst Cab] E[Cst]E[Cab])

E C2 st + X

s,c a,c s =a

(E[Csc Cac] E[Csc]E[Cac]) + X

a,t a,b t =b

(E[Cat Cab] E[Cat]E[Cab]).

In the last step, we used the fact that E[Cst Cab] E[Cst]E[Cab] = 0 when s = a and t = b. Further, since E[Cst] 0 for all s, t, we get

s,t [T ] ˆCst 2i X

E C2 st + X

s,c a,c s =a

(E[Csc Cac]) + X

a,t a,b t =b

(E[Cat Cab]) (22)

Now, we look at individual terms in the above expression. The last two terms can be bounded as follows:

s,c a,c |s =a

E[Csc Cac] = X

ℓ=1 ps(ℓ)pa(ℓ)qc(ℓ) := X

c ps, pa, qc , (23)

T 3. pavg, pavg, qavg (24)

a,t a,b |t =b

E[Cat Cab] = X

ℓ=1 pa(ℓ)qt(ℓ)qb(ℓ) := X

a pa, qt, qb , (25)

T 3. pavg, qavg, qavg (26)

The first term can be bounded as follows: X

E C2 st = T 2 pavg, qavg (27)

Substituting these in Equation 22, and using Var[Q] = 1 T 4 E h P s,t [T ] ˆCst 2i , we get the desired variance bound.

Lemma 31. Let F = Z + Z 2Q, then note that,

E[F] = pavg qavg 2 2 .

Var[F] O(1)

T 2 ( qavg 2 2 + pavg 2 2 + pavg, qavg ) + O(1)

T ( pavg, qavg, qavg + pavg, pavg, qavg ) .

Published in Transactions on Machine Learning Research (11/2025)

Proof. The expression for E[F] follows immediately by combining bias terms in Lemma 29 and Lemma 30.

The variance bound follows by using the fact that Var[F] 8(Var(Z) + Var(Z ) + Var(Q)) and further using variance bounds on Z, Z , Q from Lemma 29 and Lemma 30.

Lemma 32. We can test if pavg qavg 2 ϵ/3 vs pavg qavg 2 > ϵ using T =

qavg 2 2+pavg 2 2+ pavg,qavg ϵ2 ) + ( pavg,qavg,qavg + pavg,pavg,qavg )

ϵ4 samples.

Proof. It is sufficient to bound the probability of P(|F E[F]| > ϵ/3). By Chebyshev s inequality, we get that P(|F E[F]| > ϵ/3) 9 Var(F)/ϵ2. Substituting the upper bound on the variance of F, we get that,

P(|F E[F]| > ϵ2/3) O(1)

T 2ϵ4 ( qavg 2 2 + pavg 2 2 + pavg, qavg )

Tϵ4 ( pavg, qavg, qavg + pavg, pavg, qavg )

The above inequality is further upper bounded by a small constant for the value of T specified in the conditions of the lemma.

G.2 Heavy elements

Let Xt(1) and X t(1) be the 1st i.i.d. sample from distribution pt and qt respectively. Let bp(i) = 1 T P

t [T ] 1[Xt(1) = i] and bq(i) = 1 T P

t [T ] 1[X t(1) = i]. We further define big elements and small elements as follows, Bp = {i [k] | bp(i) b} and Sp = {i [k] | bp(i) < b} ,

Bq = {i [k] | bq(i) b} and Sq = {i [k] | bq(i) < b} ,

where b = (ϵ/k)2/3.

Lemma 33. Let Bp and Bq be random sets as defined above and let B = Bp Bq. If n = O( 1

bϵ2 ), then the following conditions holds, P( X

i B |bp(i) pavg(i)| > ϵ/12) 1/100.

i B |bq(i) qavg(i)| > ϵ/12) 1/100.

Proof. Note that B is a random set that includes at most 2/b elements. Given any fixed subset A [n], note that,

i A |bp(i) pavg(i)| > ϵ/12) P( A A such that bp(A ) pavg(A ) > ϵ/12)

2|A| exp( O(nϵ2)) .

Recall in the above expression bp(A ) = P i A bp(i) and pavg(A ) = P i A pavg(i).

In the last inequality, we used the fact that, P(bp(A ) pavg(A ) > ϵ/12) exp( nϵ2/144) and did a union bound over all subsets of A. Therefore for any fixed A such that, |A| 2/b, we have that,

P(|bp(A) pavg(A)| > ϵ/12) 1 100 .

P(|bp(B) pavg(B)| > ϵ/12) = X

A [n]||A| 2/b P(|bp(B) pavg(B)| > ϵ/12|B = A)P(B = A) (28)

A [n]||A| 2/b P(|bp(A) pavg(A)| > ϵ/12)P(B = A) (29)

Published in Transactions on Machine Learning Research (11/2025)

A [n]||A| 2/b

1 100P(B = A) 1/100. (30)

The analogous proof also holds for qavg and we conclude the proof.

Lemma 34. Let Bp and Bq be random sets as defined above and let B = Bp Bq. If n = O( 1

bϵ2 ), then the following conditions holds,

i B |ˆp(i) ˆq(i)| X

i B |pavg(i) qavg(i)|| ϵ/6) 1/50 .

Proof. Let E = P

i B |ˆp(i) ˆq(i)|, F = P

i B |pavg(i) qavg(i)|, G = P

i B |ˆp(i) pavg(i)| and H = P

i B |ˆq(i) qavg(i)|. Note that, E F + G + H and F E + G + H. Combining these two inequalities we get that,

|E F| G + H. (31)

Using the above inequality we get that,

P(|E F| ϵ/6) P(G + H ϵ/6) P(G ϵ/12) + P(H ϵ/12) 1/50. (32)

The last expression follows from Lemma 33.

G.3 Light elements

Lemma 35 (Light elements). Let S denote the complement of B, corresponding to the random set containing elements with empirical probability less than b in both ˆp and ˆq. The following norm bounds hold:

i S pavg(i)2 > 4b/δ) δ and P( X

i S pavg(i)3 > 8b2/δ) δ ,

i S qavg(i)2 > 4b/δ) δ and P( X

i S qavg(i)3 > 8b2/δ) δ ,

i S pavg(i)qavg(i) > 4b/δ) δ, P( X

i S p2 avg(i)qavg(i) > 8b2/δ) δ and P( X

i S pavg(i)q2 avg(i) > 8b2/δ) δ .

Proof. This argument is essentially identical to the proof of Theorem 20 in Chan et al. (2014).

Let H = {i [k]|pavg(i) > 2b} and recall that S = {i [k]|ˆp(i) < b and ˆq(i) < b}. Here we provide the proof for P(P

i S pavg(i)t > 2tb/δ) δ for t {2, 3} and the proof for all the other probability statements follow the same argument.

To prove P(P

i S pavg(i)t > 2tbt 1/δ) δ for t {2, 3} we first show that,

i S pavg(i)t] 2tbt 1,

and the lemma statements follow immediately by applying Markov Inequality. Therefore, in the remainder of the proof, we focus on proving the upper bound on the expectation.

Note that, E[P

i S pavg(i)t] = E[P

i S H pavg(i)t + P

i S Hc pavg(i)t] 2t 1bt 1 + P

i S H pavg(i)t. In the last inequality, we used pavg(i) < 2b for all i Hc. Therefore all that remains is to show that, E[P

i S H pavg(i)t] 2t 1bt 1.

For i S H, let pavg(i) = xib and note that xi 2. Then we have that,

i S H pavg(i)t] = X

i H pavg(i)t P(i S),

Published in Transactions on Machine Learning Research (11/2025)

i H pavg(i)bt 1xt 1 i P(pavg(i) = xib and ˆp(i) < b).

In the last inequality, we used pavg(i) = xib and P(i S) P(pavg(i) = xib and ˆp(i) < b).

Note that, by Chernoff bound it is immediate that, P(i S) exp( Cbn) exp( Cxi/ϵ2), for some constant C just dependent on t. Substituting this probability upper bound back in the expectation we get that, E[ X

i S H pavg(i)t] cbt 1,

for some constant c dependent on t and we conclude the proof.

G.4 ℓ1 tester

Here we finally present our ℓ1 tester.

Theorem 36. For any ϵ > k 1/3, there exists an algorithm ℓ1-Distance-Test that, for O( k2/3

ϵ8/3 ) samples from p1, . . . , p T and q1, . . . , q T has the following behavior: it rejects with probability 2/3 when pavg qavg 1 ϵ, and accepts with probability 2/3 when p = q.

Proof. Let b = (ϵ/k)2/3. Note that n = O(1/(bϵ2)). Our ℓ1 tester works as follows. 1) It uses the first set of samples from p1 . . . p T and q1 . . . q T to estimate the heavy elements, which are defined as follows.

Bp = {i [k] | ˆp(i) b} and Bq = {i [k] | ˆq(i) b} .

Define B = Bp Bq. By Lemma 34, we know that with a good probability, | P

i B |ˆp(i) ˆq(i)| estimates the quantity | P

i B |pavg(i) qavg(i)| upto ϵ/6 accuracy. Therefore, if pavg qavg 1 ϵ/3, then this implies that | P

i B |ˆp(i) ˆq(i)| > ϵ/6. In this particular case, we know that we are in the case of pavg qavg 1 ϵ and we reject the instance. Therefore in the remainder of the proof we focus on the case, where | P

i B |pavg(i) qavg(i)| ϵ/3.

Here we define new distributions p 1 . . . p T and q 1 . . . q T focused on light elements as follows: Sample an element from pt. If this sample is in S output it; otherwise, output uniformly random element from [k]. Define q t similarly. We generate two samples from p 1 . . . p T and q 1 . . . q T using this procedure. Let p avg and q avg be respective averages.

Note that, for i S, p avg(i) = pavg(i) + pavg(B)/k and q avg(i) = qavg(i) + qavg(B)/k. For i B, we have that, p avg(i) = qavg(B)/k and q avg(i) = qavg(B)/k.

Note that, when pavg = qavg, we have that, p avg = q avg. Furthermore in the other case, when pavg qavg 1 ϵ and P

i B |pavg(i) qavg(i)| ϵ/3, we get that, p avg q avg 1 ϵ/3.

Note that by Lemma 35, we get that, the 2nd and 3rd order norms for p avg and q avg are bounded by O(b) and O(b2) respectively with very tiny constant probability. Therefore we get that, p avg 2 2 P

i S pavg(i)2 + O(1/k) O(b + 1/k), q avg 2 2 P

i S qavg(i)2 + O(1/k) O(b + 1/k) and pavg, qavg O(b + 1/k). Similarly, the third order norms are also bounded, p avg 3 3 P

i S pavg(i)3+3/k P

i S pavg(i)2+O(1/k2) O(b2 + b/k + 1/k2) and q avg 3 3 P

i S qavg(i)3 + +3/k P

i S qavg(i)2 + O(1/k) O(b2 + b/k + 1/k2), pavg, qavg, qavg O(b2 + b/k + 1/k2) and pavg, pavg, qavg O(b2 + b/k + 1/k2). We then invoke the ℓ2 tester with ϵ = ϵ/10

k. Further using the sample complexity bounds from Lemma 32, we get the following upper bound, O( p

b + 1/k/ϵ 2 + (b2 + b/k + 1/k2)/ϵ 4) O(k2/3/ϵ8/3) ,

which is the required sample complexity.