# statisticalcomputational_tradeoffs_for_density_estimation__4637ca4c.pdf

Statistical-Computational Trade-offs for Density Estimation

Anders Aamand University of Copenhagen aamand@mit.edu

Alexandr Andoni Columbia University andoni@cs.columbia.edu

Justin Y. Chen MIT justc@mit.edu

Piotr Indyk MIT indyk@mit.edu

Shyam Narayanan Citadel Securities shyam.s.narayanan@gmail.com

Sandeep Silwal UW-Madison silwal@cs.wisc.edu

Haike Xu MIT haikexu@mit.edu

We study the density estimation problem defined as follows: given k distributions p1, . . . , pk over a discrete domain [n], as well as a collection of samples chosen from a query distribution q over [n], output pi that is close to q. Recently [1] gave the first and only known result that achieves sublinear bounds in both the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses O(n/ logc k) samples for some constant c > 0 and polynomial space, then the query time of the data structure must be at least k1 O(1)/ log log k, i.e., close to linear in the number of distributions k. This is a novel statistical-computational trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where q = pi for some i, and when the distributions are flat (specifically, all distributions are uniform over half of the domain [n]). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.

1 Introduction

The general density estimation problem is defined as follows: given k distributions p1, . . . , pk over a domain [n]2, build a data structure which when queried with a collection of samples chosen from a query distribution q over [n], outputs pi that is close to q. An ideal data structure reports the desired pi quickly given the samples (i.e., has fast query time), uses few samples from q (i.e., has low sampling complexity) and uses little space.

Work done as a student at MIT 2In this paper we focus on finite domains.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In the realizable case, we know that q is equal to one of the distributions pj, 1 j k, and the goal is to identify a (potentially different) pi such that q pi 1 ϵ for an error parameter ϵ > 0. In the more general agnostic case, q is arbitrary and the goal is to report pi such that

q pi 1 C min j q pj 1 + ϵ

for some constant C > 1 and error parameter ϵ > 0. The problem is essentially that of non-parametric learning of a distribution q F, where the family F = {p1, . . . pk} has no structure whatsoever. Its statistical complexity was understood already in [17], and the more modern focus has been on the structured case (when F has additional properties). Surprisingly, its computational complexity is still not fully understood.

Due to its generality, density estimation is a fundamental problem with myriad applications in statistics and learning distributions. For example, the framework provides essentially the best possible sampling bounds for mixtures of Gaussians [10, 18, 12]. The framework has also been studied in private [8, 7, 13, 15] and low-space [4] settings.

Samples Query time Space Comment Reference log k k2 log k kn [17, 11] log k k log k kn [2] log k k kn [1] log k log k n O(log k/ϵ2) precompute all possible samples folklore n nkρ kn + k1+ρ any constant ρ > 0 [16]+[14]

n log(k)1/4 n + k 1 1 log(k)1/4 k2n [1]

n/s k1 O(ρu)/ log s k1+ρu lower bound for any ρu > 0, sufficiently large s this paper n/s k1 Ω(ρu)/ log s kn + k1+ρu algorithm for half-uniform distributions this paper

Table 1: Prior work and our results. For simplicity the results stated only for the realizable case, constant ϵ > 0, and with O( ) factors suppressed. The bound of [1] (row 6 of the table) is stated as in Theorem 3.1 of that paper. However, by adjusting the free parameters, their algorithm can be easily generalized to use n/s samples for n/s > n/polylog(n), resulting in a query time of O(n + k1 Ω(ϵ2)/s). Note that the term 1 1/s in their bound results in a larger exponent than 1 1/ log s in our upper bound. Furthermore, our algorithm is correct as long as n/s log k/ε2 which is the information theoretic lower bound.

Table 1 summarizes known results as well as our work. As seen in the table, the data structures are subject to statistical-computational trade-offs. On one hand, if the query time is not a concern, logarithmic in k samples are sufficient [17, 11]. On the other hand, if the sampling complexity is not an issue, then one can use the algorithm of [16] to learn the distribution ˆq such that ˆq q 1 ϵ/2, and then deploy standard approximate nearest neighbor search algorithms with sublinear query time, e.g., from [14]. Unfortunately both of these extremes require either linear (in n) sample complexity, or linear (in k) query time time. Thus, achieving the best performance with respect to one metric resulted in the worst possible performance on the other metric.

The first and only known result that obtained non-trivial improvements to both the sampling complexity and the query time is due to a recent work of [1]. Their improvements, however, were quite subtle: the sample complexity was improved by a sub-logarithmic factor, while the query time was improved by a sub-polynomial factor of k1/(log k)1/4 = 2log(k)3/4.

This result raises the question of whether further improvements are possible. In particular, [1] asks: To what extent can our upper bounds of query and sample complexity be improved? What are the computational-statistical tradeoffs between the sample complexity and query time? These are the questions that we address in this paper.

Lower bound: We give the first limit to the best possible tradeoff between the query time and sampling complexity, demonstrating a novel statistical-computational tradeoff for a fundamental problem. To our knowledge, this is the first statistical-computational trade-off for a data structures problem if we allow for superpolynomial space, logarithmic sampling and query complexity

0.01 0.03 0.05 0.07 0.09 1/s (Samples as a fraction of n)

q (Query Exponent)

Statistical-Computational Trade-offs

UB for Half-Uniform Distributions UB for General Distributions Numerical Upper/Lower Bound for Half-Uniform Distributions LB for Half-Uniform Distributions

10 9 10 8 10 7 10 6 10 5 10 4 10 3 1/s (Samples as a fraction of n)

1.000 Log-Scale

Figure 1: Left: Trade-off between 1/s (samples as a fraction of n) and the query time exponent ρq for our algorithm for half-uniform distributions (solid green curve), the algorithm by [1] for general distributions (dashed green curve), our analytic lower bound (solid red curve), and a numerical evaluation of the bound from Theorem 3.2 (dashed black curve). We have fixed the space parameter ρu = 1/2. The plots illustrate the asymptotic behaviour proven in Theorem 3.1 and Theorem 4.2 that as s , ρq = 1 Θ(1/ log s) both in the lower bound and for our algorithm for half-uniform distributions. Right: The same plot zoomed in to the upper left corner with 1/s on log-scale.

is possible. As in [5, 6, 9, 3], we focus on data structures that operate in the so-called list-ofpoints model3, which captures all bounds depicted in Table 1. Suppose that the query time of the data structure is of the form kρq. We show that, if the data structure is allowed polynomial space kn+k1+ρu and uses n/s samples, then the query time exponent ρq must be at least 1 O(ρu)/ log s. Therefore, if we set s = logc k for some c > 0, as in [1], then the query time of the data structure must be at at least k1 O(1)/ log log k. That is, the query time can be improved over linear time in k by at most a factor of k O(1)/ log log k = 2O(log k/ log log k). This shows that it is indeed not possible to improve the linear query time by a polynomial factor while keeping the sampling complexity below n/ logc n, for any constant c > 0. Our lower bound instance falls within the realizable case and in the restricted setting where the data and query distributions are half-uniform , i.e. each distribution is uniform over half of the universe [n]. Note that showing a lower bound under these restrictions automatically extends to the agnostic case with general distributions, as the former is a special case of the latter. Our construction takes lower bounds by [3] for set similarity search as a starting point. We adapt their lower bound to our setting in which queries are formed via samples from a distribution. The resulting lower bound is expressed via a complicated optimization problem and does not yield a closed-form statistical-computational trade-off. One of our technical contributions is solve this optimization problem for a regime of interest to get our explicit lower bound. Upper bound: We complement the lower bound by demonstrating a data structure for our hard instances (in the realizable case with half-uniform distributions) achieving sampling and query time bounds asymptotically matching those of the lower bound. We note that the existence of such an algorithm essentially follows from [3]. However, the algorithms presented there are quite complex. In contrast, our algorithm can be viewed as a bare-bones version of their approach, and as a result it is simple and easy to implement. To demonstrate the last point, we present an empirical evaluation of the algorithm on the synthetic data set from [1], and compare it to the algorithm from that paper, as well as a baseline tailored to half-uniform distributions. Our faster algorithm achieves over 6 reduction in the number of operations needed to correctly answer 100 random queries. In Figure 1, we illustrate the trade-off between the number of samples and the query time exponent ρq in our upper and lower bounds. Open Questions: The direct question left open by our work is whether there exists a data structure whose upper bounds for general distributions match our lower bounds (note we give matching

3See Section 2 for the formal definition. Generally, proving data structure lower bounds requires restriction to a specific model of computation, and the list-of-points model is a standard choice for related approximate nearest neighbor problems.

upper bounds for half-uniform distributions). [1] give an algorithm for the general case, but with a worse trade-off than that described by our lower bound. More generally, are there other data structures problems for which one can show statistical-computational tradeoffs between the trifecta of samples, query time, and space?

2 Preliminaries and Roadmap for the Lower Bound

First we introduce helpful notation used throughout the paper.

Notation: We use Bern(p) to denote the Bernoulli(p) distribution and Poi(λ) to denote the Poisson(λ) distribution. For a discrete distribution f : X R, we use supp(f) = {x X : f(x) = 0} to denote f s support and |supp(f)| to denote the size of its support. We use f n to denote the tensor product of n identical distribution f. We call a distribution f half-uniform if it is a uniform distribution on its support T with |T| = n/2. For a binary distribution P supported on {0, 1} with a random

variable x P, we sometimes explicitly write P = P[x = 1] P[x = 0]

. Similarly, for a joint distribution

PQ over {0, 1}2 with (x, y) PQ, we write PQ = P[x = 1, y = 1] P[x = 1, y = 0] P[x = 0, y = 1] P[x = 0, y = 0]

For a vector x Rn, we use x[i] to denote its i-th coordinate. We use d(p||q) = p log p

q + (1 p) log 1 p

1 q and D(P||Q) = P

p P p log p

q to denote the standard KL-divergence over a binary distribution or a general discrete distribution. KL divergence D(P||Q) is only finite when supp(P) supp(Q), also denoted as P Q. All logarithms are natural.

We now introduce the main problem which we use to prove our statistical-computational lower bound. We state a version which generalizes half-uniform distributions. Definition 2.1 (Uniform random density estimation problem). For a universe U = {0, 1}n, we generate the following problem instance:

1. A dataset P is constructed by sampling k uniform distributions, where for each uniform distribution p P, every element i [n] is contained in p s support with probability wu.

2. Fix a distribution p P, take Poi |supp(p )|

samples from p and get a query set q.

3. The goal of the data structure is to preprocess P such that when given the query set q, it recovers the distribution p .

We denote this problem as URDE(wu, s). URDE abbreviates Uniform Random Density Estimation. The name comes from the fact that the data set distributions are uniform over randomly generated supports. In Section 3, we prove a lower bound for URDE by showing that a previously studied hard problem can be reduced to URDE. The previously studied hard problem is the Gap SS problem.

Definition 2.2 (Random Gap SS problem [3]). For a universe U = {0, 1}n and parameters 0 < wq <

wu < 1, let distribution PU = Bern(wu)n, PQ = Bern(wq)n, and PQU = h wq 0 wu wq 1 wu in . A

random Gap SS(wu, wq) problem is generated by the following steps:

1. A dataset P U is constructed by sampling k points where p PU for all p P.

2. A dataset point p P is fixed and a query point q is sampled such that (q, p ) PQU.

3. The goal of the data structure is to preprocess P such that it recovers p when given the query point q.

We denote this problem as random Gap SS(wu, wq). Gap SS abbreviates Gap Subset Search. To provide some intuition about how Gap SS relates to URDE, let us denote the data set P = {p1, . . . , pk}. Then the pi {0, 1}n can naturally be viewed as k independently generated random subsets of [n]. For each i, pi includes each element of [n] with probability wu. The query point q can similarly be viewed as a random subset of [n] including each element with probability wq, but it is correlated with some fixed p P. Namely, p and q are generated according to the join distribution PQU (with the

right marginal distributions PQ and PU) such q a subset of p . The goal in Gap SS is to identify p given q. This intuition is formalized in Section 3.

Our main goal is to study the asymptotic behavior of algorithms with sublinear samples, or specifically, the query time and memory trade-off when only sublinear samples are available, so all our theorems assume the setting that both the support size n and the number of samples k goes to infinity and n k poly(n). Sublinear samples mean that 1

s < o(1) as n goes to infinity.

Our lower bound extend and generalize lower bounds for Gap SS in the List-of-points model. Thus, the lower bound we provide for URDE is also in the List-of-points model defined below (slightly adjusted from the original definition in [5] to our setting). The model captures a large class of data structures for retrieval problems such as partial match and nearest neighbor search: where one preprocesses a dataset P to answer queries q that can match a point in the dataset.

Definition 2.3 (List-of-points model). Fix a universe Q of queries, a universe U of dataset points, as well as a partial predicate ϕ : Q S {0, 1, }. We first define the following ϕ-retrieval problem: preprocess a dataset P U so that given a query q Q such that there exist some p P with ϕ(q, p ) = 1 and ϕ(q, p) = 0 on all p P \ {p }, we must report p .

Then a list-of-points data structure solving the above problem is as follows:

1. We fix (possibly random) sets Ai U, for 1 i m; and with each possible query point q Q, we associate a (random) set of indices I(q) [m];

2. For the given dataset P U, we maintain m lists of points L1, L2, ..., Lm, where Li = P Ai.

3. On query q Q, we scan through lists Li where i I(q), and check whether there exists some p Li with ϕ(q, p) = 1. If it exists, return p.

The data structure succeeds, for a given q Q, p P with ϕ(q, p ) = 1, if there exists i I(q) such that p Li. The total space is defined by S = m + P

i [m] |Li| and the query time by T = |I(q)| + P

i I(q) |Li|.

To see how the lower bound model relates to URDE, in our setting, the ϕ-retrieval problem is the URDE problem: U is the set of random half-uniform distributions, Q is the family of query samples, and ϕ(q, p) is 1 if the samples q were drawn from the distribution p, and 0 otherwise. (The case corresponds to an approximate answer, considering by the earlier papers; but we define URDE problem directly to not have approximate solutions.)

We use the list-of-points model as it captures all known data-independent similarity search data structures, such as Locality-Sensitive Hashing [14]. In principle, a lower bound against this model does not rule out data-dependent hashing approaches. However, these have been useful only for datasets which are not chosen at random. In particular, [5] conjecture that data-dependency doesn t help on random instances, which is the setting of our theorems.

3 Lower bounds for random half-uniform density estimation problem

In this section, we formalize our lower bound. The main theorem of the section is the following.

Theorem 3.1 (Lower bound for URDE). If a list-of-points data structure solves the URDE 1

using time O(kρq) and space O(k1+ρu), and succeeds with probability at least 0.99, then for sufficiently large s, ρq 1 1 s1 log 2 o(1) ρu log s 1.

To prove Theorem 3.1, our starting point is the following result of [3] that provides a lower-bound for the random Gap SS problem.

Theorem 3.2 (Lower bound for random Gap SS, [3]). Consider any list-of-points data structure for solving the random Gap SS(wu, wq) problem on k points, which uses expected space O(k1+ρu), has expected query time O(kρq ok(1)) , and succeeds with probability at least 0.99. Then for every α [0, 1], we have that αρq + (1 α)ρu inf tq,tu [0,1] tu =wu

where F(tu, tq) = α D(T ||P ) d(tq||wq)

d(tu||wu) + (1 α) D(T ||P ) d(tu||wu)

d(tu||wu) , P = h wq 0 wu wq 1 wu i and

T = arg inf T P

Our proof strategy is to first give a reduction from the the Gap SS problem to the URDE problem. Note that the URDE problem involves a statistical step where we receive samples from an unknown distribution (our query). On the other hand, the query of Gap SS is a specified vector, rather than a distribution, with no ambiguity. Our reduction bridges this and shows that Gap SS is a strictly easier problem than URDE. Theorem 3.3 (Reduction from random Gap SS to URDE). If a list-of-points data structure solves the URDE(wu, s) problem of size k in Definition 2.1 using time O(kρq) and space O(k1+ρu), and succeeds with probability at least 0.99, then there exists a list-of-points data structure solving the Gap SS(wu, wq) problem for wq = wu 1 e

1 s wu using space O(k1+ρu) and time O(kρq +wq n), and succeeds with probability at least 0.99.

Proof. We provide a reduction from random Gap SS(wu, wq) to URDE(wu, s) with s = 1 wu log(1 wq wu ). Specifically, for each instance (P1, p 1, q1) generated from Gap SS(wu, wq) in Def-

inition 2.2, we will construct an instance (P2, p 2, q2) generated from URDE(wu, s) satisfying Definition 2.1 for some s.

For each point p1 P1, it is straightforward to construct a corresponding uniform distribution p2 supported on those coordinates where p1[i] = 1. Then let s construct q2 from q1. Recall that for each i U with p 1[i] = 1, we have q1[i] = 0 with probability 1 wq

wu , in which case we add no element i to

q2. If q1[i] = 1, we add Poi+ 1 s wu

copies of element i to q2 where P [Poi+(λ) = x] = P[Poi(λ)=x]

P[Poi(λ)>0]

for any x > 0. By setting s = 1 wu log(1 wq wu ), we have P h Poi 1 s wu

= 0 i = 1 wq

wu . Thus for

each element i in p 2, the number of its appearances in q2 exactly follows the distribution Poi( 1 s wu ).

According to the property of the Poisson distribution, uniformly sampling Poi |supp(p 2)| s wu

from a set of size |supp(p 2)| is equivalent to sampling each element Poi( 1 s wu ) times. Therefore, the constructed instance (P2, p 2, q2) is an instance of URDE(wu, s), as stated in Definition 2.1.

Equivalently, we have the relationship wq = wu 1 e

Hence we complete our reduction from Gap SS(Definition 2.2) to URDE (Definition 2.1).

To get the desired space-time trade-off in the sublinear sample regime, which means s (or equivalently wq 0), and to get an interpretable analytic bound, we need to develop upon the lower bound in Theorem 3.2. This requires explicitly solving the optimization problem in Theorem 3.2. Proving Theorem 3.4 (proof in Appendix 3) is the main technical contribution of the paper. Theorem 3.4 (Explicit lower bound for random Gap SS instance). Consider any list-of-points data structure for solving the random Gap SS 1

2, wq which has expected space O(k1+ρu), uses expected query time O kρq o(1) , and succeeds with probability at least 0.99. Then we have the following lower bound for sufficiently small wq: ρq 1 w1 log 2 o(1) q + ρu 1+log wq .

Applying our reduction to the random Gap SS lower bound above allows us to prove our main theorem.

Proof of Theorem 3.1. According to the reduction given in Theorem 3.3 from Gap SS(wu, wq) to

URDE(wu, s) where wq = wu 1 e

s. We can apply the lower bound in Theorem 3.4 and get the desired lower bound.

Remark 3.5. Note that in URDE 1

2, s , the distributions are uniform over random subsets of

expected size n/2 and the query set is generated by taking Poi 2|supp(p )|

s samples from one of them p . This is not quite saying that the query complexity is n/s. However, by standard

concentration bounds, from the Poisson sample, we can simulate sampling with a fixed number of samples n/s O( p

n/s) = n/s(1 o(1)) with high probability, and so, any algorithm using this fixed number of samples must have the same lower bound on ρq as in Theorem 3.1.

4 A simple algorithm for half-uniform density estimation problem

In this section, we present a simple algorithm for a special case of the density estimation problem when the input distributions are half-uniform. The algorithm also works for the related URDE( 1

2, s) problem of Theorem 3.1. A distribution p over [n] is half-uniform if there exists T [n] with |T| = n/2 such that p[i] = 2/n if i T and 0 otherwise. The problem we consider in this section is:

Definition 4.1 (Half-uniform density estimation problem; HUDE(s, ε)). For a domain [n], integer k, ε > 0, and s > 0, we consider the following data structure problem.

1. A dataset P of k distributions p1, . . . , pk over [n] which are half-uniform over subsets T1, . . . , Tk [n] each of size n/2 is given.

2. We receive a query set q consisting of n/s samples from an unknown distribution pi P satisfying that pi pj ε for j = i .

3. The goal of the data structure is to preprocess P such that when given the query set q, it recovers the distribution pi with probability at least 0.99.

This problem is related to the URDE(1/2, s) problem in Theorem 3.1. Indeed, with high probability, an instance of URDE(1/2, s) consists of almost half-uniform distributions with support size n/2 O( n log k). Moreover, two such distributions pi, pj have pi pj 1 = (1 O( p

(log k)/n)) with high probability. Thus, an instance of URDE(1/2, s) is essentially an instance of HUDE(s, 1).

To solve HUDE(s, ε), we can essentially apply the similarity search data structure of [3] querying it with the set Q consisting of all elements that were sampled at least once. This approach obtains the optimal trade-off between ρu and ρq (at least in the List-of-points model). The contribution of this section is to provide a very simple alternative algorithm with a slightly weaker trade-off between ρu and ρq. Section 5 evaluates the simplified algorithm experimentally. Our main theorem is:

Theorem 4.2. Suppose n and k are polynomially related, s 2, and that s is such that4 n

ε2 for a sufficiently large constant C. Let ε > 0 and ρu > 0 be given. There exists a data structure for the HUDE(s, ε) problem using space O(k1+ρu + nk) and with query time O k1 ερu 2 log(2s) + n/s .

Let us compare the upper bound of Theorem 4.2 to the lower bound in Theorem 3.1. While Theorem 4.2 is stated for half-uniform distributions, its proof is easily modified to work for the URDE(1/2, s) problem where the support size is random. Then ε = 1 O(1) and as s , 2 1 e 2/s = s(1+o(1)). Thus, the query time exponent in Theorem 4.2 is ρq = 1 (1+o(1))) log(2)ρu

log s . URDE(1/2, s) is exactly the hard instance in Theorem 3.1, and so we know that any algorithm must have ρq 1 (1 + o(1)) ρu

log s as s . Asymptotically, our algorithm therefore gets the right logarithmic dependence on s but with a leading constant of log(2) 0.693 instead of 1.

Next we define the algorithm. Let ℓand L be parameters which we will also specify shortly. During preprocessing, our algorithm samples L subsets S1, . . . , SL of [n] each of size ℓindependently and uniformly at random. For each i L, it stores a set Ai of all indices j [k] such that Si Tj, namely the indices of the distributions pj which contain Si in their support. See Algorithm 1.

During the query phase, we receive n/s samples from p = pi for some unknown i [k]. Our algorithm first forms the subset Q [n] consisting of all elements that were sampled at least once. Note that |Q| n/s as elements can be sampled several times. The algorithm proceeds in two steps. First, it goes through the L sets S1, . . . , SL until it finds an i such that Si Q. Second, it scans through the indices j Ai. For each such j it samples a set Uj one element at a time from Q. It stops this sampling at the first point of time where either Uj Tj or |Uj| = C log n

ε for a sufficiently

4The requirement n/s log k/ε2 is the information theoretic lower bound for the density estimation problem.

Algorithm 1 Density estimation for half-uniform distributions (preprocessing)

1: Input: Half uniform distributions {pi}k i=1 over [n] with support sets {Ti}k i=1. 2: Output: A data structure for the density estimation problem. 3: procedure PREPROCESSING({pi}k i=1) 4: for i = 1 to L do 5: Si sample of size ℓfrom [n] 6: Ai {j [k] | Si Tj}

7: Return (Si, Ai)i [L]

Algorithm 2 Query algorithm for half-uniform distributions

1: Input: Half uniform distributions {pi}k i=1 over [n] with support sets {Ti}k i=1, data structure (Si, Ai)i [L] Preprocessing({pi}k i=1), and n/s samples from query distribution p = pi . 2: Output: A distribution pj. 3: procedure DENSITY-ESTIMATION({pi}k i=1) 4: Q {i [n] | i appeared in the n/s samples} 5: for i = 1 to L do 6: if Si Q then 7: for j Ai do 8: Uj sample from Q of size C log n

ε for a large constant C. 9: if Uj Tj then 10: Return: pj

large constant C. In the first case, it concludes that pj is not the right distribution and proceeds to the next element of Aj and in the latter case, it returns the distribution pj as the answer to the density estimation problem. See Algorithm 2. We defer the formal proofs to Appendix B.

5 Experiments

We test our algorithm in Section 4 experimentally on datasets of half-uniform distributions as in [1] and corresponding to our study in Sections 3 and 4. Given parameters k and n, the input distributions are k distributions each uniform over a randomly chosen n/2 elements.

Algorithms We compare two main algorithms: an implementation of the algorithm presented in Section 4 which we refer to as the Subset algorithm and a baseline for half-uniform distributions which we refer to as the Elimination algorithm. The Subset algorithm utilizes the same techniques as that presented in Section 4 but with some slight changes geared towards practical usage. We do not compare to the Fast Tournament of [1] since it was computationally prohibitive; see Remark C.1.

The Subset algorithm picks L subsets of size ℓand preprocesses the data by constructing a dictionary mapping subsets to the distributions whose support contains that subset. When a query arrives, scan through the L subset until we find one that is contained in the support of the query. We then restrict ourselves to solving the problem over the set of distributions mapped to by that subset and run Elimination. The Elimination algorithm goes through the samples one at a time. It starts with a set of distributions which is the entire input in general or a subset of the input when called as a subroutine of the Subset algorithm. To process a sample, the Elimination algorithm removes from consideration all distributions which do not contain that element in its support. When a single distribution remains, the algorithm returns that distribution as the solution. As the input distributions are random half-uniform distributions, we expect to throw away half of the distributions at each step (other than the true distribution) and terminate in logarithmically in size of the initial set of distribution steps.

Experimental Setup Our experiments compare the Subset and Elimination algorithms while varying several problem parameters: the number of distributions k, the domain size n, the number of samples S (for simplicity, we use this notation rather than n/s samples as in the rest of the paper), and the size of subsets ℓused by the Subset algorithm. While we vary one parameter at a time, the others are set to a default of k = 50000, n = 500, S = 50, l = 3. Given these parameters, we

Figure 2: Comparison of efficiency of the Subset (Ours) and Elimination algorithms as (a): the number of distributions k varies. Other parameters are set to n = 500, S = 50, ℓ= 3. (b): the domain size n varies. Other parameters are set to k = 50000, S = 50, ℓ= 3. (c): the number of samples S varies. Other parameters are set to k = 50000, n = 500, ℓ= 3. (d): the subset size ℓ varies. Other parameters are set to k = 50000, n = 500, S = 50.

evaluate the Subset algorithm and the Elimination algorithm on 100 random queries where each query corresponds to picking one of the input distributions as the true distribution to draw samples from.

In all settings we test, the Elimination algorithm achieves 100% accuracy on these queries, which is to be expected as long as the number of samples is sufficently more than the log2 k. There is a remaining free parameter, which is the number of subsets L used in the Subset algorithm. We start with a modest value of L = 200 and increase L by a factor of 1.5 repeatedly until the Subset algorithm also achieves 100% accuracy on the queries (in reality, it s failure probability will likely still be greater than that of the Elimination algorithm). The results we report correspond to this smallest value of L for which the algorithm got all the queries correct.

For both algorithms, we track the average number of operations as well as the execution time of the algorithms (not counting preprocessing). A single operation corresponds to a membership query of checking whether a given distribution/sample contains a specified element in its support which is the main primitive used by both algorithms. We use code from [1] as a basis for our setup.

Results For all parameter settings we test, the number of operations per query by our Subset algorithm is significantly less than those required by Elimination algorithm, up to a factor of more than 6x. The average query time (measured in seconds) shows similar improvements for the Subset algorithm though for some parameter settings, it takes more time than the Elimination algorithm. While, in general, operations and time are highly correlated, these instances where they differ may depend on the specific Python data structures used to implement the algorithms, cache efficiency, or other computational factors.

As the number of distributions k increases, Figure 2a shows that both time and number of operations scale linearly. Across the board, our Subset algorithm outperforms the Elimination algorithm baseline and exhibits a greater improvement as k increases. On the other hand, as the domain size increases in Figure 2b, the efficiency of the Subset algorithm degrades while the Elimination algorithm maintains its performance. This is due to the fact that for larger domains, more subsets are needed in order to correctly answer all queries, leading to a greater runtime.

In Figure 2c, we see that across the board, as we vary the number of samples, the Subset algorithm has a significant advantage over the Elimination algorithm in query operations and time. Finally, Figure 2d shows that for subset size ℓ {2, 3, 4}, the Subset algorithm experiences a significant improvement over the Elimination algorithm. But for ℓ= 5, the improvement (at least in terms of time) suddenly disappears. For this setting, that subset size requires many subsets in order to get high accuracy, leading to longer running times.

Overall, on flat distributions for a variety of parameters, our algorithm has significant benefits even over a baseline tailored for this case. The good performance of the Subset algorithm corresponds with our theory and validates the contribution of providing a simple algorithm for density estimation in this hard setting.

Acknowledgments: This work was supported by the Jacobs Presidential Fellowship, the Mathworks Fellowship, the NSF TRIPODS program (award DMS-2022448) and Simons Investigator Award; also supported in part by NSF (CCF2008733) and ONR (N00014-22-1-2713). Justin Chen is supported by an NSF Graduate Research Fellowship under Grant No. 174530. Shyam Narayanan is supported by an NSF Graduate Fellowship and a Google Fellowship. Anders Aamand was supported by the DFF-International Postdoc Grant 0164-00022B and by the VILLUM Foundation grants 54451 and 16582.

[1] Anders Aamand, Alexandr Andoni, Justin Y Chen, Piotr Indyk, Shyam Narayanan, and Sandeep Silwal. Data structures for density estimation. In International Conference on Machine Learning, 2023.

[2] Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Maximum selection and sorting with adversarial comparators. The Journal of Machine Learning Research, 19(1):2427 2457, 2018.

[3] Thomas D Ahle and Jakob BT Knudsen. Subsets and supermajorities: Optimal hashing-based set similarity search. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 728 739. IEEE, 2020.

[4] Maryam Aliakbarpour, Mark Bun, and Adam Smith. Hypothesis selection with memory constraints. Advances in Neural Information Processing Systems, 36, 2024.

[5] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashingbased time-space trade-offs for approximate near neighbors. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, pages 47 66. SIAM, 2017.

[6] Alexandr Andoni, Huy L Nguyen, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. Approximate near neighbors for general symmetric norms. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 902 913, 2017.

[7] Mark Bun, Gautam Kamath, Thomas Steinke, and Steven Z Wu. Private hypothesis selection. Advances in Neural Information Processing Systems, 32, 2019.

[8] Clément L Canonne, Gautam Kamath, Audra Mc Millan, Adam Smith, and Jonathan Ullman. The structure of optimal private tests for simple hypotheses. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 310 321, 2019.

[9] Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 1094 1107, 2017.

[10] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183 1213. PMLR, 2014.

[11] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer series in statistics. Springer, 2001.

[12] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864, 2019.

[13] Sivakanth Gopi, Gautam Kamath, Janardhan Kulkarni, Aleksandar Nikolov, Zhiwei Steven Wu, and Huanyu Zhang. Locally private hypothesis selection. In Conference on Learning Theory, pages 1785 1816. PMLR, 2020.

[14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604 613, 1998.

[15] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pages 2204 2235. PMLR, 2020.

[16] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066 1100. PMLR, 2015.

[17] Henry Scheffe. A Useful Convergence Theorem for Probability Distributions. The Annals of Mathematical Statistics, 18(3):434 438, 1947.

[18] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimalsample estimators for spherical gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

A Appendix: Omitted Proofs of Section 3

Theorem 3.4 (Explicit lower bound for random Gap SS instance). Consider any list-of-points data structure for solving the random Gap SS 1

2, wq which has expected space O(k1+ρu), uses expected query time O kρq o(1) , and succeeds with probability at least 0.99. Then we have the following lower bound for sufficiently small wq: ρq 1 w1 log 2 o(1) q + ρu 1+log wq .

Proof. Our proof proceeds by explicitly calculating the lower bound given in Theorem 3.2 when wu = 1 2 and wq approaches 0. Recall that Theorem 3.2 states that if a list-of-points data structure solves Gap SS(wu, wq) for k points uses expected space k1+ρu, and has expected query time kρq ok(1), then for every α [0, 1], we have that

αρq + (1 α)ρu inf tq,tu [0,1] tu =wu

αD(T||P) d(tq||wq)

d (tu||wu) + (1 α)D(T||P) d(tu||wu)

where P = wq 0 wu wq 1 wu

and T = arg inf T P

E X T[X]= tq tu

We denote the fraction in the right hand side of Equation 1 as F(tq, tu). Our goal is to provide a lower bound in the case wu = 1/2.

First, notice that to satisfy T P (i.e. supp(T) supp(P)) and E X T[X] = tq tu

available choice is T = tq 0 tu tq 1 tu

. Plugging this in and expanding F(tu, tq), we get

F(tu, tq) = (tu tq) log tu tq wu wq + α d(tu||wu) tu log tu

wu α d(tq||wq) + tq log tq

wq d(tu||wu) . (2)

For wu = 1/2 fixed, and for fixed wq, α, tu = 1/2, we can consider F as a function of only tq. Because F is a continuously differentiable function in terms of tq, the infimum of F (for fixed wu, wq, α, tu) can only be achieved either when F/ tq = 0, or at the boundary points (tq = 0 or tq t u ).

We first consider the case when the partial derivative is 0 and handle the endpoint cases later. Calculating the partial derivative of F with respect to tq gives us F

tq = log tq

wq log tu tq wu wq

α log tq(1 wq)

wq(1 tq). When F

tq = 0, we must have

tu tq wu wq = tq

Plugging in the relation in (3) and wu = 1

F(tu, tq) = α + (tu tq) log tu tq wu wq tu log tu

wu α d(tq||wq) + tq log tq

wq d(tu||wu)

= α + tu log tu tq wu wq log tu

tq log tu tq wu wq α d(tq||wq) + tq log tq

wq d(tu||wu)

= α + tu log tu tq wu wq log tu

tq log tu tq wu wq + (1 α) tq log tq

wq α (1 tq) log 1 tq

1 wq d(tu||wu)

= α + tu log tu tq wu wq log tu

1 wq d(tu||wu) (Plugging in equation 3)

= α + tu log 1 tq

+ log 1 1 2wq

1 wq d(tu|| 1

2) . (Plugging in wu = 1

By Lemma A.1, if we set α = 1 + 1 log wq for wq is sufficiently small, we have F(tu, tq) α

w1 log 2 o(1) q , uniformly over tu, tq. Next, let s check the boundary cases. For tq = 0, Lemma A.8 proves that F(tu, 0) α w1 o(1) q α w1 log 2 o(1) q . For tq t u , because F

tq is continuous

for 0 tq < tu and lim tq t u

F tq = + , the infimum of F cannot be achieved when tq t u .

Thus, for wu = 1/2, any fixed wq sufficiently small, α = 1 + 1 log wq , and any fixed tu = 1/2 and the

infimum of F across 0 tq < tu is at least α w1 log 2 o(1) q , where the o(1) term goes to 0 as wq goes to 0, uniformly across tu, tq. So in fact, F(tu, tq) α w1 log 2 o(1) q uniformly across tu, tq. Applying this bound back to our original inequality in 1 gives us the desired bound.

Our goal is to now bound the fraction in the final step of the proof of Theorem 3.4. The following lemma bound this fraction. Lemma A.1. Fix any constant δ > 0. Suppose wq < 1 is smaller than a sufficiently small constant c = cδ that only depends on δ, wu = 1/2, and α = 1 + 1 log wq . Suppose these parameters satisfy the relation given in Equation 3. Then

inf tq,tu [0,1] tu =wu

tu log 1 tq

+ log 1 1 2wq

1 wq d(tu||wu) w1 log 2 δ q .

Equivalently,

inf tq,tu [0,1] tu =wu

tu log 1 tq

+ log 1 1 2wq

1 wq d(tu||wu) w1 log 2 o(1) q

for sufficiently small wq, where o(1) denotes a term that uniformly goes to 0 as wq 0 (regardless of tq, tu).

In the rest of this section, we use o(1) to denote any term that goes to 0 as wq 0, uniformly over tq, tu. We will prove some auxiliary lemmas before proving Lemma A.1. We first define the following function H(x). Definition A.2. For a value x, H(x) := x log(2x) + (1 x) log(2(1 x)).

It is clear that H(x) is only defined when 0 < x < 1. Moreover, we note the following basic property and provide its proof for completeness. Proposition A.3. For any x (0, 1), 2( 1

2 x)2 H(x) 16( 1

Proof. It is simple to check that H(1/2) = 0 and H (1/2) = 0. Moreover, the second derivative is H (x) = 1

x + 1 1 x = 1 x x2 . For 0 < x < 1, x x2 1/4, so H (x) 4 for all x. Thus, H (x) 2( 1

Next, we have that x x2 3 16 for x [1/4, 3/4], which means H (x) 16

3 for x [1/4, 3/4]. Thus, H(x) 8

2 x)2 for x [1/4, 3/4]. Since the first derivative H (x) = log(x) log(1 x) is negative for x < 1

2 and positive for x > 1

2, this means H(x) is maximized as x approaches either 0 or 1. But the limits equal log 2, so H(x) log 2 for all x. Since ( 1

2 x)2 1 16 for all 1 > x > 3

4 or 0 < x < 1

4, this means for any such x, H(x) log 2 16 log 2 ( 1

2 x)2 16 ( 1

2 x)2. So in either case, H(x) 16( 1

We are now ready to prove Lemma A.1.

Proof of Lemma A.1. For simplicity of notation, let x = tq. Recalling the value of α, wu, and Equation 3, we have

Now we denote the fraction in the lemma statement as b(x)/a(x) and we note

a = H(tu) = tu log (2tu) + (1 tu) log (2 (1 tu)) ,

b = α log 1 x

+ tu log 1 x

+ log 1 1 2wq

From Proposition A.4, it suffices to check the following cases:

1. 0 < x w1.01 q , or w0.99 q x < x ,

2. w1.01 q < x < (1 + 1 log wq )wq, or (1 1 log wq )wq < x < w0.99 q ,

3. and (1 + 1 log wq )wq x (1 1 log wq )wq.

In Case 1, x is such that x (0, x ) is the regime of x such that a and b are well defined. Proposition A.4 further states that x only depends on wq and x w1 log 2 o(1) q as wq 0.

The cases are handled in Lemmas A.5, A.6, and A.7, respectively. Combining them proves Lemma A.1.

Proposition A.4. Suppose that 0 < wq < 1

3 and 0 x 1. Then, the regime of x such that a, b are well defined is x (0, x ), where 0 < x < 1 only depends on wq and x w1 log 2 o(1) q as wq 0.

Proof. We must have that 0 < tu < 1, so that a = H(tu) is well-defined. If x = 0, tu = 0, and if

x = 1, then log 1 x 1 wq

isn t defined, so b isn t defined. If 0 < x < 1, tu is always positive, so a being well-defined is equivalent to 1 tu > 0, which is equivalent to

We can rearrange this as 1 x

1/(log wq) = 1 x

1 α > 1/2 wq w1 α q (1 wq)α = e(1/2 wq) (1 wq)1+1/(log wq) .

We can again rearrange this as

x > C(wq) := e(1/2 wq) (1 wq)1+1/(log wq)

where C(wq) is positive if wq < 1/3. This is equivalent to is equivalent to 0 < x < 1 C(wq)+1. So, we can set x = 1 C(wq)+1. Thus, a is well defined if and only if 0 < x < x , where x (0, 1) clearly holds. Moreover, if wq < 1/3, then log 1 x 1 wq and log 1 1 2wq are well-defined, and tu > x > 0 so

is also well-defined. In summary, a, b are well defined if and only if 0 < x < x =

Finally, we bound x for wq sufficiently small. Note that (1 wq)1+1/(log wq) = 1 + o(1) and 1/2 wq = 1/2 o(1), so C(wq) = e

2 (1 o(1)) log wq = (e/2) log wq (1 o(1)) = wlog 2 1 o(1) q . Thus, C(wq) wlog 2 1+o(1) q , which means that x w1 log 2 o(1) q .

Lemma A.5. Suppose that wq is sufficiently small, and 0 < x w1.01 q , or w0.99 q x < x . Then,

b a w1 log 2 o(1) q as wq 0.

Proof. Since x is strictly positive, we can write x = w1+γ q , for some real γ with |γ| 0.01. Then,

x wq = wγ q , so x wq

1 α = w γ/ log wq q = e γ. Finally, since x = o(1) and wq = o(1), this means

tu = x + e γ 1

2 o(1) = 0.5 e γ o(1) (e γ + 1). Since |γ| > 0.01, this implies that, for wq sufficiently small, |tu 0.5| Ω(1), which means a = H(tu) = Ω(1) by Proposition A.3.

We can also bound b as follows. Note that log 1 x 1 wq

= O(x + wq), so α log 1 x 1 wq

O(x + wq). Next, note that for wq sufficiently small (and thus x < x is also sufficiently small),

since 0 < α, 1 α < 1, 1 x 1 wq

α 0.5, and x wq

1 α x1 α x. Also, 1

6 x. Therefore, 0 x

7, which means log 1 x

O(x/tu). Thus, tu log 1 x

+ log 1 1 2wq

O(tu (x/tu +wq)) = O(x+wq). Thus, |b| O(x+wq).

Since a = Ω(1) is positive, and b O(x + wq), this means that b

a O(x + wq). Since we know that x < x w1 log 2 o(1) q , this implies that b

a w1 log 2 o(1) q .

Lemma A.6. Suppose that w1.01 q < x < (1 + 1 log wq )wq, or (1 1 log wq )wq < x < w0.99 q . Then,

b a w0.99 o(1) q .

Proof. We again write x = w1+γ q , where this time |γ| < 0.01. Since either x > (1 1 log wq )wq or

x < (1 + 1 log wq )wq, this means that |γ| Ω 1 (log wq)2 . Then, we can write

tu = O(w0.99 q ) + w γ/(log wq) q 1

2 O(w0.99 q ) = e γ

2 O(w0.99 q ) = 1

since 0.01 > |γ| Ω 1 log wq

so the w0.99 q term is negligible compared to γ. So, a = H(tu) =

Θ(γ2) Ω 1/(log wq)4 , by Proposition A.3.

Next, to bound b, note that log 1 x 1 wq

= O(x + wq) O(w0.99 q ). Also, since tu = e γ

O(w0.99 q ), this means that tu = Θ(1), so tu log 1 x

+ log 1 1 2wq

O(x + wq) =

O(w0.99 q ). Overall, |b| O(w0.99 q ), so b

a w0.99 o(1) q .

Lemma A.7. Suppose that (1 + 1 log wq )wq x (1 1 log wq )wq. Then, b

a w1 o(1) q .

Proof. Suppose that x = (1 + β)wq, where |β| 1 log wq . We will look at tu from a more fine grained perspective.

Note that x wq

1 α = (1 + β) 1/(log wq) = e (β O(β2))/(log wq) = 1 β log wq O β2

Moreover, 1 x 1 wq

α = 1 βwq 1 wq

α = 1 βwq 1 wq α O(β2w2 q). Thus,

α = 1 β log wq βwq 1 wq α O β2

Thus, we can write

tu = wq + βwq + 1 β log wq βwq 1 wq α O β2

2 + βwq β 1 log wq + wq 1 wq α 1

Note that tu = 1

2 Θ(β/ log wq), since the other terms in (5) are all negligible compared to β/ log wq, which means that a = Θ(β2/(log wq)2) by Proposition A.3.

To bound b, we will need the more fine-grained approximation of tu from (5). Indeed, note that log 1 x 1 wq

= log 1 βwq 1 wq

1 wq O(β2w2 q), and tu log 1 x

+ log 1 1 2wq

tu log tu x tu(1 2wq) = tu log 1 + 2wqtu x tu(1 2wq) . Note that 2wqtu = 2(1/2 Θ(β/ log wq))wq =

wq Θ(β wq/ log wq). Thus, 2wqtu x = Θ(β wq). Moreover, tu(1 2wq) = Θ(1). Therefore, we have that

tu log 1 + 2wqtu x

2wqtu x tu(1 2wq) O 2wqtu x

1 2wq O(β2w2 q).

b = α βwq 1 wq + 2wqtu x

1 2wq O(β2w2 q).

Using the more fine-grained approximation of tu, we have that

2wqtu x = wq + 2βw2 q wqβ 1 log wq + wq 1 wq α (1 2wq) (wq + βwq) O(β2wq)

= (2βw2 q βwq) wqβ 1 log wq + wq 1 wq α (1 2wq) O(β2wq)

= wqβ(1 2wq) wqβ 1 log wq + wq 1 wq α (1 2wq) O(β2wq)

= wqβ 1 1 log wq wq 1 wq α (1 2wq) O(β2wq)

= wqβ α 1 wq

(1 2wq) O(β2wq),

where the last line uses the fact that α = 1 + 1 log wq . So,

b = α βwq 1 wq wqβ α 1 wq O(β2wq) = O(β2wq).

Therefore, b

a O(wq (log wq)2) w1 o(1) q .

Finally we check the endpoint cases of Theorem 3.4.

Lemma A.8. For F(tu, tq) defined in Equation 2, and for α = 1 + 1 log wq , F(tu, 0) α w1 o(1) q where o(1) goes to 0 as wq goes to 0, uniformly over tu.

Proof. For tq = 0, we can calculate that

F = tu log tu 1/2 wq

tu log(2tu) αd(tq||wq)

d(tu||1/2) = α + α log(1 wq) tu log(1 2wq)

d(tu||1/2) .

Let us first consider the term α log(1 wq) tu log(1 2wq). If we were to set tu = 1/2, this equals 1 + 1 log wq

log(1 wq) 1

2 log(1 2wq) = log(1 wq) 1

2 log(1 2wq)+ 1 log wq log(1 wq) = wq log wq O(w2 q).

For sufficiently small wq, | log(1 2wq)| 3wq, which means that

α log(1 wq) tu log(1 2wq) = wq log wq O(w2 q + |tu 1/2| wq).

Note that wq log wq is positive, since log wq is negative. If wq is sufficiently small and |tu 1/2| c log wq for some sufficiently small constant c, then the term O(w2 q + |tu 1/2| wq) is smaller than wq log wq , which means α log(1 wq) tu log(1 2wq) 0. Because d(tu||1/2) is nonnegative,

this means F α α w1 o(1) q . Alternatively, if |tu 1/2| c log wq , then we still have α log(1 wq) tu log(1 2wq) O(w2 q + |tu 1/2| wq) O(wq), and by Proposition A.3,

d(tu||1/2) = H(tu) = Θ((tu 1/2)2) Ω 1 log wq

2 . So, F α O wq (log wq)2

α w1 o(1) q . So, in either case, we have that F(tu, 0) α w1 o(1) q , where the o(1) term does not depend on tu.

B Appendix: Omitted proofs of section 4

Lemma B.1. The expected space usage of Algorithm 1 is O(Lℓ+ Lk2 ℓ+ nk).

Proof. The algorithm has to store each of the L sets Si which requires space O(Lℓ). Furthermore, it needs to store the sets Ai which each have expected size O(2 ℓk). Indeed, the probability that a random subset of size ℓis contained in a given Tj is at most 2 ℓ. Finally, the algorithm needs to store the sets Tj which takes O(nk) space.

Lemma B.2. The expected query time of Algorithm 2 is O(Lℓ+ k

ε (1 ε/2)ℓ+ n

Proof. First, the algorithm forms the set Q which takes O(n/s) time. Then, the algorithm goes over the L sets Si until it finds an i such that Si Q. This takes time O(Lℓ). Next, the algorithm goes through the indices j Ai. For each such j, it samples the set Uj one element at a time checking if Uj Tj. Let us first bound the expected size of Ai. We clearly have that i Ai. Indeed, Si Q Ti and Ai lists all the j such that Si Tj. Now for a j [k] with j = i , the assumption that pj pi 1 > ε, gives that |Tj Ti | n( 1

4). As the sampling of S1, . . . , SL and Q are independent, we can view Si as a random size-ℓsubset of Ti . In particular, the probability that Si Tj can be upper bounded by |Tj Ti |

ℓ n (1/2 ε/4)

ℓ = (1 ε/2)ℓ

and so, the expected size of |Ai| is at most k(1 ε/2)ℓ. Finally, using the assumption that pi pj 1 > ε for j = i , and that n/s (log k)/ε2, by a standard concentration bound, it holds for any j = i that |Q Tj| |Q|(1 ε/4) with high probability in k. In particular, for j = i , we only expect to include O(1/ε) samples in Uj before observing that Uj Tj. In conclusion, the expected query time is O( n

ε (1 ε/2)ℓ), as desired.

Lemma B.3. Let L = C 2 1 e 2/s ℓ for a sufficiently large constant C. Assume that L = k O(1). Then Algorithm 2 returns i with probability at least 0.99.

Proof. It is readily checked that ℓ= O(log k) = O(log n), and further, for s > 10, it holds that ℓ= O( log k

log s ) = O( log n

log s ). We record this for later use.

Note that by standard concentration bounds, it holds with high probability in k that Q Tj only for j = i . In fact, as in the previous proof, for all j = i , |Q Tj| |Q|(1 ε/4) with high probability in k. In particular, this implies that the algorithm with high probability never returns a j = i . Indeed, by a union bound, the probability of this happening is at most

k Pr[Uj Tj] k(1 ε/4)|Uj| = k(1 ε/4)C log n/ε kn C/4 n 10,

where we used that k and n are polynomially related and C is sufficiently large. In order for the algorithm to succeed, it therefore suffices to show that there exists an i [L] such that Si Q. In that case, the algorithm will indeed return pi with high probability.

The expected size of Q is

and by a standard application of Azuma s inequality, it holds with high probability in n that

1 e 2/s O((n/s)0.51). (6)

Note that the probability that a single set Si is contained in a fixed set Q is

Using the bounds on ℓin the beginning of the proof, we find that ℓ O((n/s)0.51). In particular, conditioning on the high probability event of Equation (6), the probability in Equation (7) is at least, 1 e 2/s

2 O 1 n0.49s0.51

ℓ c 1 e 2/s

for some constant c > 0. The probability that no Si is contained in Q is at most

1 c 1 e 2/s

We thus choose L = 5

c 2 1 e 2/s ℓ to ensure that this probability is at most e 5 < 1/100 and the result follows.

We can now prove our main theorem of Section 4.

Theorem 4.2. Suppose n and k are polynomially related, s 2, and that s is such that5 n

ε2 for a sufficiently large constant C. Let ε > 0 and ρu > 0 be given. There exists a data structure for the HUDE(s, ε) problem using space O(k1+ρu + nk) and with query time O k1 ερu 2 log(2s) + n/s .

Proof of Theorem 4.2. Let us pick L = kρu. We further choose ℓsuch that L = C 2 1 e 2/s ℓ for

some sufficiently large constant C as in Lemma B.3. In particular, ℓ ρu lg(k), so we obtain that the space bound from Lemma B.1 is O(k1+ρu + nk). On the other hand, the bound on the query time in Lemma B.2 is

O kρu log k + k

ε (1 ε/2)ℓ+ n/s = O kρu log k + 1

εk 1+ρu log(1 ε

2 )/log 2 1 e 2/s + n/s ,

as desired. Finally, by the choice of L and ℓ, it follows by Lemma B.3 that the algorithm returns i with probability at least 0.99.

Note that while vastly simplified from the expression of Theorem 3.2 from [3], the expression for the query time in Theorem 4.2 is unwieldy. For a simplified version of the theorem, one can note that log(1 ε

2) ε/2 and for s 2, we have that 2 1 e 2/s 2s, resulting in the claimed bound.

Remark B.4. Theorem 4.2 is stated for the promise problem of Definition 4.1 where all distributions p P with p = pi have p pi 1 ε. However, even if this condition is not met, we can still guarantee to return a distribution pj such that pj pi ε with probability 0.99 and only a slight increase in the query time. Indeed, as in the proof of Lemma B.3 as long as there exists an Si Q, the list Ai will contain pi . Moreover, as in the proof of that lemma, the algorithm will never return a pj with pj pi > ε. Thus it will either return pi when it encounters it in the list Ai, or a distribution pj with pj pi ε. The only difference is that now the bound on the number of samples included in Uj in Lemma B.2 becomes O(log n/ε) instead of O(1/ε) with a corresponding blow up by a log n factor in the query time.

C Remark about experimental setting

Remark C.1. The prior work of [1] also tested their Fast Tournament algorithm on random halfuniform distributions. That algorithm works for the general problem and not just flat distributions, but its generality makes it is much less efficient for the special case we consider. From their experiments, in the setting where k = 8192, n = 500, and S = 50, the best setting of parameters achieves less than 98% accuracy using more than 400000 operations (their notion of operation corresponds to querying the probability mass at a specific index for two distributions while our notion of operation corresponds to querying whether a specific index is in the support of a single distribution/sample). For a comparable setting of k = 10000, n = 500, S = 50, our algorithm uses fewer then 20000

5The requirement n/s log k/ε2 is the information theoretic lower bound for the density estimation problem.

operations, more than a 20 improvement. In addition to having much better query time than the general algorithm, our algorithm has subquadratic preprocessing time of O(k Lℓ) while the tournament-based algorithm requires O(k2n) time which is prohibitive for the parameter settings we test. For these reasons, we restrict our comparisons to our Subset algorithm and the Elimination algorithm baseline, both tailored for the half-uniform setting.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All theorem statements have full proofs. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We lay out the assumptions underlying our results and discuss our bounds in comparison to prior work in the introduction. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theoretical claims have full proofs. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Full experimental details given in Section 5. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Full experimental details given in Section 5. Code is attached in the supplementary material. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Full experimental details given in Section 5. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We do not report error bars but run experiments over 100 randomly chosen queries to boost significance. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Full experimental details given in Section 5. The experiments are not computationally intensive for the parameter regime we study.

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The paper adheres to the code of ethics. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The work is theoretical in nature and we do not envision any societal impacts. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: No new data or models were generated as part of this project. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Yes, we use open-source code from [1]. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [No]

Justification: The experimental code is a lightweight modification of an existing code base as a proof-of-concept for our simple upper bound, so we do not add extensive documentation. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No human subjects.