# data_structures_for_density_estimation__02add728.pdf

Data Structures for Density Estimation

Anders Aamand * 1 Alexandr Andoni 2 Justin Y. Chen 1 Piotr Indyk 1 Shyam Narayanan 1 Sandeep Silwal 1

Abstract We study statistical/computational tradeoffs for the following density estimation problem: given k distributions v1, . . . , vk over a discrete domain of size n, and sampling access to a distribution p, identify vi that is close to p. Our main result is the first data structure that, given a sublinear (in n) number of samples from p, identifies vi in time sublinear in k. We also give an improved version of the algorithm of (Acharya et al., 2018) that reports vi in time linear in k. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work.

1. Introduction

Finding the hypothesis that best matches a given set of samples, i.e., the density estimation problem, is a fundamental task in statistics and machine learning. In the finite setting, this task can be formulated as follows: given k distributions v1, . . . , vk over some domain U and access to samples from a distribution p over U, output vi that is close to p. In the proper case, we assume that p = vj for some 1 j k, and the goal is to output vi such that p vi 1 ε for some error parameter ε > 0. In the more general improper case, p is arbitrary and the goal is to report vi such that

p vi 1 C min j p vj 1 + ε

for some constant C > 1 and error parameter ε > 0.

The density estimation problem has been studied extensively. The work of Scheffe (Scheffe, 1947), Yatracos (Yatracos,

Authors listed alphabetically. 1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 2Data Science Institute, Columbia University, New York, NY 10027, USA. Correspondence to: Anders Aamand <aamand@mit.edu>, Alexandr Andoni <andoni@mit.edu>, Justin Y. Chen <justc@mit.edu>, Piotr Indyk <indyk@mit.edu>, Shyam Narayanan <shyamsn@mit.edu>, Sandeep Silwal <silwal@mit.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1985), Devroye-Lugosi (Devroye & Lugosi, 2001) showed that O(log(k)/ε2) samples are sufficient to solve the improper version of the problem for some constant C > 1 with a constant probability in O(k2 log(k)/ε2) time. Recently Acharya et al (Acharya et al., 2018) improved the time to O(k log(k)/ε2) time.

Despite the generality of the formulation, this approach yields almost optimal sampling bounds for many natural classes of distributions such as mixtures of Gaussians (Daskalakis & Kamath, 2014; Suresh et al., 2014; Diakonikolas et al., 2019), see also the survey (Diakonikolas, 2016). Furthermore, the method has been extended to enable various forms of privacy (Canonne et al., 2019; Bun et al., 2019; Gopi et al., 2020; Kamath et al., 2020).

When the domain U is a discrete set of size n, the problem can be viewed as a variant of approximate nearest neighbor search over n-dimensional vectors under the L1 norm, a problem that has been studied extensively (Andoni et al., 2018). Specifically, given a set of k distributions v1, . . . , vk, the goal of C-approximate nearest neighbor search is to build a data structure that, given any query distribution p, returns vi such that p vi 1 C minj p vj 1.

However, in the density estimation problem, the query distribution p is not specified fully, but instead the algorithm is given only samples from p. This makes the problem substantially richer and more complex, as in addition to the usual computational resources (data structure space, query time), one also needs to consider the number of samples taken from p, a statistical resource. Thus, designing data structures for this problem involves making tradeoffs between the computational and statistical resources, with the known algorithms forming the endpoints of the tradeoff curve. In particular:

If the goal is to optimize the computational efficiency of the data structure, one can learn the query distribution p up to an additive error of ε using O(n/ε2) samples (Kamath et al., 2015), and use any c-approximate nearest neighbor algorithm for the L1 norm. In particular, the algorithm of (Andoni & Razenshteyn, 2015) yields (roughly) O(nk + k1+1/(2c 1)) space and O(nk1/(2c 1)) query time. This leads to an algorithm with polynomial space and sublinear (in k) query

Data Structures for Density Estimation 2

time, but at the price of using a linear (in n) number of samples.

If the goal is to optimize the statistical efficiency of the data structure, then the aforementioned result of (Acharya et al., 2018) yields a data structure that uses only a logarithmic number of samples, but at the price of near linear (in k) query time.

These two point-wise results raise the question of whether best of both worlds data structures exists, i.e., whether there are data structures which are efficient in both statistical and computational terms. This is the focus of the paper.

Our results In this paper we initiate the study of computational/statistical tradeoffs of the (discrete) density estimation problem. Our main theoretical contributions are as follows:

We present the first data structure that solves the proper version of the problem with polynomial space, sublinear query time and sublinear sampling complexity. This demonstrates that one can achieve non-trivial complexity bounds on all three parameters (space, query time, sampling complexity) simultaneously.

We also present an improved version of the data structure from (Acharya et al., 2018) for the improper case, reducing its query time from O(k log(k)/ε2) to O(k/ε2), i.e., linear in k, while retaining its optimal sampling complexity bound of O(log(k)/ε2).

On the empirical side, we experimentally evaluate the lineartime algorithm and compare its performance to the (Acharya et al., 2018) algorithm on synthetic and on real networking data. These experiments display the practical benefits of the faster algorithm. For example, for synthetic data, we demonstrate that our faster algorithm achieves over 2x reduction in the number of comparisons needed to achieve the same level of accuracy. Similarly, our experiments on network data show up to 5x reduction in the number of comparisons.

Open questions We view a major part of the contribution of this paper as initiating the study of statistical/computational tradeoffs in data structures for density estimation. We design a data structure with polynomial space and sublinear query and sampling complexity but only in the proper case and only achieving query and sampling bounds that are slightly sublinear in the input parameters. Extending these results (1) to the improper case, (2) with stronger sublinear bounds, or (3) showing lower bounds are all exciting open problems.

1.1. Preliminaries

We assume the discrete distributions v1, . . . , vk over the domain [n] are fully specified. We assume k, the number of

distributions, is much larger than n but still polynomially related to n. For instance, n1.01 k n C for C = 100 suffices. This mimics the classic nearest neighbor search (NNS) literature setting, where the aim is to get algorithms sublinear in the dataset size, which exactly corresponds to k in our setting.

Let p be the unknown distribution we sample from. We actually attain results stronger than the truly proper case: we relax the equality condition and assume that there exists v among the vi s such that p v 2 ε 2 n.

We use the standard Poissonization trick and take s = Pois(s) samples 1. This doesn t affect the sample complexity asymptotically as s = Θ(s) with high probability. This is a standard trick used in distribution testing (Canonne, 2020; Szpankowski, 2011; Valiant & Valiant, 2011) and ensures that for all i [n], the number of samples observed that are equal to i when sampling from p is distributed as Pois(s p(i)) and the counts are independent. We say b a for a, b > 0 if b C a for some absolute positive constant C which does not depend on a or b.

We recall the guarantees of ℓ and ℓ2 NNS data structures. Theorem 1.1. (Indyk, 2001) There exists some absolute constant C such that given a dataset X Rn with |X| = k, for every δ (0, 1), we can solve the C

δ log log n-approximate nearest neighbor problem on X in ℓ using O(nk1+δ) space and O(n) query time.

The following theorem is obtained by using a standard combination of randomized dimensionality reduction theorem of Johnson-Lindenstrauss and locality-sensitive hashing (Har Peled et al., 2012). Theorem 1.2. (Har-Peled et al., 2012) Given a dataset X Rn with |X| = k, for every c > 1, we can solve the c-approximate nearest neighbor problem on X in ℓ2 using (k1+ρ + kn) log O(1) k space and (kρ + n) log O(1) k query time where ρ 1/c.

Lastly, we define some convenient notation used shortly. Definition 1.3. For a vector x Rn and a set A [n], let x A Rm denote the vector which is equal to x in coordinates of set A and zero otherwise. x(i) refers to the ith coordinate of the vector x.

2. Motivating our algorithm

As we discuss in the next section, our algorithm works by considering the empirical sample distribution ˆp (where ˆpi = X(i)/s if i is sampled X(i) times) and delicately combines nearest neighbor search data structures both for ℓ2 and ℓ distance. Before describing our algorithm, we first

1Pois(s) denotes a draw from a Poisson distribution with parameter s.

Data Structures for Density Estimation 3

explain why simpler approaches do not work. First, since we are interested in finding a close distribution in ℓ1 distance, it is a natural idea to simply construct an approximate ℓ1 nearest neighbor data structure on the known distributions and directly querying it for ˆp. As the following lemma illustrates, this approach fails to return a close distribution even with access to an exact nearest neighbor data structure which returns a true nearest neighbour to the empirical distribution. Surprisingly, it can even be much worse than simply returning a uniformly random known distribution which would succeed with probability 1/k. All proofs in this section are deferred to Appendix A. Lemma 2.1. For any k n n/2 , there exist distinct distributions p, q1, . . . , qk over [n] such that p qi 1 = 1 for each i [k] and if ˆp is the empirical distribution obtained from sampling s = n/2 elements from p, then the probability that p ˆp 1 mini [k]( qi ˆp 1) is exp( Ω(k)). In fact, if k Cn log n for a sufficiently large constant C, the probability is 0.

Another idea is to instead use an ℓ2 nearest neighbor data structure on the empirical distribution. Perhaps less surprisingly, this approach fails too, but whereas for ℓ1 the issue was the light elements, for ℓ2 it is the heavy elements. For simplicity, we consider the case of two distributions. Lemma 2.2. There exists distributions p, q over [n] with p q 1 = 1 such that for any sample size s = o(n), with probability Ω(1), p ˆp 2 > q ˆp 2.

Finally, it should be clear that ℓ nearest neighbor search on the empirical distribution ˆp does not work with s = o(n) samples. Indeed, consider distributions where all probabilities are either 2

n or 0 and the most sampled coordinate i. Even with an exact nearest neighbor data structure, we could return an arbitrary distribution q such that q(i) = 2/n.

3. Main algorithm and proof intuition

As motivated by the previous section, our algorithm deals with heavy and light elements separately. The first challenge is to formally define the notions of heavy and light elements. The most natural choice is to declare a domain element heavy for a distribution v if the probability mass v places on the element is larger than some threshold γ. An issue arises when we want to employ this definition for p, the distribution we are sampling from. Since we only get sample access to p, we will have tiny, but non-negligible, error on estimates of the probabilities of p, which introduces some ambiguity on the exactly partition of the domain into heavy and light elements. Thus, we are motivated seek an unequivocal definition of heavy and light.

Towards this goal, we can define a heavy and light partitioning of [n] according to one of the fixed distributions vi, which we know fully, via Algorithm 1. It partitions the

domain [n] with respect to a fixed threshold γ based on the probability masses of vi. When we sample from p, we wish to use one of these partitions of the domains (obtained from inputting the vi s into Algorithm 1) as the definition of heavy and light elements for p. However, we still wish to ensure that any element that we define as heavy for p does not possess probability mass significantly greater than γ. Lemma B.3 proves that ˆp and p are O(1/ n) close in ℓ , where ˆp is the empirical distribution after sampling from p. Let v denote the closest distribution to ˆp among the {vi}k i=1 in ℓ distance. As there exists a choice of v

(namely v ) which is O(1/ n) close to p (and ˆp) in ℓ , v p O(1/ n). Hence, if we use the heavy/light partitioning of v , then every domain element classified as heavy for p has probability mass at least γ O(1/ n) and no light element has mass more than γ + O(1/ n).

The above discussion naturally leads us to our preprocessing algorithm, Algorithm 2. In Algorithm 2, we first group all the known distributions {vi}k i=1 into k groups such that the ith group consists of all distributions which are close to vi in ℓ distance. Let v denote the closest distribution to ˆp in ℓ and consider v s group, denoted as S . We essentially use the heavy / light partitioning of the domain induced by v as the definition of heavy or light for p. All distributions in v s group will also share the same heavy / light partitioning as v . By adjusting the log factors, we can ensure that v is also included in v s group. Hence, we only need to consider the distributions in S .

The key observation is that we can now completely disregard the heavy elements. This is because p will be sufficiently close to all distributions in S in ℓ1 distance, restricted to the heavy elements. This is formalized in Lemma 3.4.

Thus, it suffices to consider the distributions in S , restricted to the light domain elements. It turns out that ℓ2 is a suitable metric to use on the light elements. Therefore, our main algorithm simply performs an ℓ2 nearest neighbor using query ˆp and searches for the closest distribution in S to ˆp, restricted to the light domain elements. Since we know v is in S , we can guarantee that we will find a distribution which is close in ℓ1 distance on the light elements. We conclude by combining the fact that all distributions in S are close to p on the heavy elements already, and thus the total ℓ1 error must be small.

To summarize, the overall algorithm outline is the following. In the preprocessing step, we create an ℓ data structure on all the known distributions {vi}k i=1. Then we group all of the known distributions into k groups, one for each vi, with the property that all distributions in a fixed group are O(1/ n) close in ℓ distance. We additionally instantiate an ℓ2 nearest neighbor data structure within each group, but only on the light domain elements, which is defined consistently within each group (using vi for vi s group). See

Data Structures for Density Estimation 4

Algorithm 2 for details.

For the final algorithm, presented formally in Algorithm 3, we take s samples from the unknown distribution p and let ˆp denote the empirical distribution. We query the ℓ nearest neighbor search data structure on ˆp which helps us narrow down to a fixed group of distributions close on the heavy elements, as defined in the preprocessing stage. We then query the ℓ2 NNS data structure for this group on ˆp, but restricted to the light domain elements in the group2. The returned distribution is our output for the closest distribution to p. The main guarantee of our algorithms is given below.

Theorem 3.1. Set s = Θ n ε2(log k)1/4 = o(n) in Al-

gorithm 3 and γ = 1/n5/12 in Algorithm 2. Let v denote the output of Algorithm 3. Then the preprocessing algorithm, Algorithm 2, runs in time polynomial in k, requires O(nk2) space, and the query time of Algorithm 3 is O(n) + k1 1/(log k)1/4 = o(k). Furthermore, with probability 1 o(1), Algorithm 3 returns distribution v satisfying p v 1 ε.

In summary, we use polynomial preprocessing time, s = o(n) samples, and our query time is o(k), with the latter two quantities being sublinear in the domain size and the number of distributions, respectively. We present auxiliary lemmas in Sections 3.1 and 3.2 (proofs in Appendices B and C, respectively) and prove Theorem 3.1 in Appendix D.

Algorithm 1 Heavy light decomposition

1: Input: An ordered list of distributions q1, . . . , qt 2: Output: Vectors (qi)H, (qi)L for every i [t] (recall Definition 1.3) 3: procedure HEAVY-LIGHT({qi}t i=1) 4: H, L 5: for j = 1 to n do 6: if q1(j) γ then We use q1 to define the heavy and light elements for all q1, . . . , qt. 7: H H {j} 8: else 9: L L {j} 10: end if H L is a disjoint partition of [n]. 11: end for 12: Return: vectors {(qi)H, (qi)L}t i=1 We use the distribution q1 to set the heavy and light domain elements for all other distributions. 13: end procedure

3.1. Auxiliary lemma for heavy elements

Lemma 3.2 shows that any pair of distributions in a fixed group, as defined in the preprocessing step, are close in

2To ensure independence, we actually query using an empirical distribution using fresh samples.

Algorithm 2 Preprocessing

1: Input: Distributions {vi}k i=1 according to Section 1.1 2: Output: An ℓ data structure, a number of ℓ2 NNS data structures such that each vi is associated with a unique ℓ2 data structure 3: procedure PREPROCESSING({vi}k i=1) 4: D ℓ data structure on {vi}k i=1 5: for all i [k] do 6: Si {v {vi}k i=1 | vi v O((log n)2 (log log n)/ n)} 7: end for 8: for j = 1 to k do 9: Write Sj = {w1, . . . , wt} where w1 = vj 10: {(wi)H, (wi)L}t i=1 Heavy-Light(Sj) 11: Dj ℓ2 NNS data structure on {(wi)L}t i=1 Sj corresponds to a well defined partition of [n] into heavy and light elements based on vj via Algorithm 1 12: end for 13: Return: D , data structures {Dj}k j=1 14: end procedure

Algorithm 3 Sublinear Time Hypothesis Selection

1: Input: Distributions {vi}k i=1; preprocessed data structures D , {Dj} Preprocessing({vi}k i=1); Poi(s) samples from query distribution p 2: Output: A distribution vj 3: procedure SUBLINEAR-HYPOTHESIS4: SELECTION({vi}k i=1) 5: ˆp empirical distribution of the first half of the samples from p 6: v output of D on query ˆp with approximation c = O(log(n) log log n) 7: D ℓ2 NNS data structure corresponding to v

8: L the light domain elements for D

9: ˆp empirical distribution of the second half of the samples from p 10: v output of D on ˆp L with approximation c = 1 + sε2

32n 11: Return: v 12: end procedure

ℓ1 distance, when the ℓ1 distance is restricted to the heavy domain elements in the group.

Lemma 3.2. Suppose γ = 1/n C in Algorithm 1. Consider the sets Sj = {w1, . . . , wt} defined in line 9 of Algorithm 2. Consider the vectors (wi)H, which are the heavy subsets of the distributions in Sj, as defined in line 10 of Algorithm 2. Then for all j and for all w, w Sj, we have

w H w H 1 O (log n) log log n

Lemma 3.3 shows that v , the distribution close to p as

Data Structures for Density Estimation 5

defined in Section 1.1, must belong to the same group as v , the distribution returned after querying ˆp in the ℓ data structure in Algorithm 3.

Lemma 3.3. Suppose s Ω(n/(log n)1/2). Consider the output v on ˆp when inputted into the ℓ data structure D , as done in line 6 of Algorithm 3. Let S denote the group of v from Algorithm 2. Assuming the event in Lemma B.3 holds, it must be that v S .

The final 3.4 below argues that S , the group of v in the preprocessing algorithm, has the property that p must be close in ℓ distance to every member of S . Consequently, p must be close to every distribution in S in ℓ1 distance, restricted to the heavy elements of the group.

Lemma 3.4. Consider the same setting as in Lemma 3.3 and suppose γ = 1/n C. Let H denote the heavy domain elements of group S . Then for all w S , we have p w O (log n)2 log log n n . Consequently, we also

have p H w H 1 O (log n)2 log log n

Essentially, the lemmas above ensure that v will be in the group S which in turn has the property that for all the distributions are close in ℓ1 on the heavy elements. In order to actually find the closest distribution in ℓ1, we have to switch our attention to the light elements within the group and for this we need the lemmas of the next section.

3.2. Auxiliary lemmas for light elements

We now state auxiliary lemmas concerning light elements. Recall the definition of L from line 8 in Algorithm 3. We first show the expected value of s ˆp L s v L 2 captures the ℓ2 distance between p and v, restricted to the light elements. Note that we are using the empirical distribution ˆp which does not share any samples with ˆp, insuring the independence of L and ˆp . We also define T in the lemma below as T = P

Lemma 3.5. E[ s ˆp L s v L 2 2] = s T +s2 p L v L 2 2.

We now derive concentration of the estimator around its expected value. First we consider the case where v is sufficiently close to p.

Lemma 3.6. Suppose p satisfies p L v L 2 ε 2 n. Set

t = s2 p L v L 2 2 4 + s2ε2

4n and Z1 = s ˆp L s v L 2 2. If

s = Ω max n p L 2

ε4/3 then P[|Z1 E[Z1]| t] 0.01.

Next we consider the case where ε/ n p L v L 2. In this case, we need to obtain a stronger concentration result since later we union bound over possibly Ω(k) different distributions which are in group S .

Lemma 3.7. Suppose p satisfies p L v L 2 0.75 ε n.

Set t = s2 p L v L 2 2 4 + s2ε2

4n , suppose γ = 1/n5/12, and let Z1 = s ˆp L s v L 2 2 denote our estimator. If s =

Ω n2γ3 log(k)2

ε4 then P[|Z1 E[Z1]| t] 1/poly(k).

Putting it all together: Proof of Theorem 3.1 We give a high level outline of the proof by showing how to combine the prior results to prove the main theorem and defer the full proof to Appendix D. See Section 3 for additional inituition.

Lemma 3.4 states that we can essentially ignore the heavy elements in the group S and furthermore, Lemma 3.3 proves that v S . Thus we can restrict our attention to the light elements of the group belonging to v . Lemma 3.5 states that the expected value of the (scaled) ℓ2 distance squared from ˆp L, the empirical distribution, to v L is at most s + s2 p L v L 2 2 s + s2ε2/(4n) (assuming T is a constant for the sake of simplicity of the discussion). On the other hand, the expected value of the same statistic for any distribution v in S which is 0.99ε far from p in ℓ1 has to be at least s + s2 p L v L 2 2 s + (0.99)2s2ε2/n. This is bigger than the corresponding value for v by a factor of roughly 1 + O(ε2s2/n). These calculations are only true in expected value so we use Lemmas 3.6 and 3.7 to prove that the quantities are close to their expected value with high probability. Since the values of the ℓ2 statistic are sufficiently far apart in these two cases, the ℓ2 NNS data structure outputs a close distribution v satisfying the theorem guarantees.

4. Improving the runtime for classic hypothesis testing

We now switch our focus to algorithms for the classic hypothesis selection problem in the improper setting. Recall from Section 1.1 that we are given a set of k distributions V = {v1, . . . , vk} over [n] and a set of samples S from some distribution p also over [n]. Given the samples, our goal is to output ˆv V such that p ˆv 1 C minj p vj 1 + ε with probability at least 1 δ for some constant C and some small parameters ε and δ. In the case k = 2, this problem can be solved using the so called Scheffe test (Scheffe, 1947) which we will discuss shortly. This test uses O( log 1/δ

ε2 ) samples and returns a ˆv V satisfying the above bound with C = 3. In (Devroye & Lugosi, 2001) this was extended to the case k > 2. They showed that with s = O((log k + log 1

δ )/ε2) samples, running the Scheffe test for each pair of distributions and outputting the one with the most wins, yields an estimate that satisfies the bound above with C = 9. This is referred to as the Scheffe tournament. The running time for this tournament, clearly scales with O(k2). However, using a knockout tournament style algorithm, (Suresh et al., 2014) showed that the running time can be reduced to O k

ε2 (log k + log 1

δ ) for a fairly large constant C even when we are only given sample

Data Structures for Density Estimation 6

access to the distributions in V. In (Acharya et al., 2018) the authors showed that a simple algorithm Quick-Select obtains C = 9 with the same running time and sample size. Here we show how to reduce this runtime to O k

with a slight blow-up in the value of C. For instance, when δ = Ω(1) this shaves off a factor of log k of the running time. To obtain this bound, we will allow our algorithm a simple preprocessing step, namely for each pair of distributions (vi, vj), we compute and store the total variation distance between them. This can be done in total time O(k2n).

To achieve this improvement, we recall the classic Scheffe test (Scheffe, 1947). This test can be seen as an algorithm for the hypothesis selction problem for the case k = 2. The algorithm takes as input two distributions v1 and v2 over [n] and samples from an unknown distribution p over [n]. Let s define S = {i [n] : v1(i) > v2(i)}. Upon receiving the samples, the algorithm computes µS the frequency of samples in S. It then outputs v1 if |v1(S) µS| |v2(S) µS|. Otherwise, it outputs v2. Note that the values v1(S) and v2(S) can be computed directly from the total variation distance between v1 and v2. It was shown in (Devroye & Lugosi, 2001) that for k = 2 and with s samples, with probability 1 δ the Scheffe test outputs a distribution v satisfying that

p v 1 3 min j {1,2} p vj 1 +

Except for a simple modification, our algorithm is similar to the algorithm proposed in (Suresh et al., 2014). Assume for simplicity that k is a power of two this assumption can easily be dispensed with. Define δi = δ/4i and si = 10 log(1/δi)

ε2 . Finally, define V1 = V. Our algorithm first initializes a set C . Then for i = 1, . . . , lg k, it performs the following steps:

1. Randomly select a subset of min{k1/3, |Vi|} elements from |Vi| and move them to C.

2. Randomly form |Vi|/2 pairs of distributions in Vi and run the Scheffe test on each pair using the set Si consisting of the first si elements of the sample.

3. Define Vi+1 to be the set of |Vi|/2 winners.

Finally, our algorithm runs the Scheffe test on all pairs of

distributions in C using a single set of 10 log(( |C| 2 )/δ) ε2 new samples from p (independent of the samples used in step 2. above). It outputs the distribution ˆv in C with the most wins among these comparisons (breaking ties arbitrarily).

The difference between this algorithm and the algorithm in (Suresh et al., 2014) is that we do not use the entire sample in the lower levels of the tournament tree. Intuitively, for a subtree of the tournament tree containing 2i distributions,

we only need to consider a sample of size si in order to union bound over the bad events that the distribution in V closest to p loses to either of the far distributions in this subtree. As running a single Scheffe test with a sample of size s takes O(s) time, the running time used for the knockout

tournament is therefore O Plg k i=1 sik/2i = O k

δ . We defer the full proof to Appendix E.

Theorem 4.1. Assume that δ k 1/4. With probability 1 O(δ), the algorithm of Section 4 outputs a distribution ˆv with p ˆv 1 27 minj p vj 1 +O(ε). The algorithm uses s = O((log k + log 1

δ )/ε2) samples from p and has running time O k

5. Experiments

We experimentally evaluate the faster tournament algorithm given in Section 4 and compare its performance to the base knockout tournament (Suresh et al., 2014) on synthetic and on real networking data.3 These experiments display the practical benefits of the faster algorithm.

The algorithms have several key parameters which we vary throughout the experiments. n All Pairs is the number of distributions in each level of the tournament which are randomly sampled to compete in an all-pairs tournament at the end of the algorithm. This parameter is used in both the base tournament and our fast tournament. fast Const controls the number of samples used in each level of the fast tournament. In particular, at the ith level, the fast tournament uses fast Const i samples for each Scheffe test. While adjusting these parameters slightly does not change the constant factors in the analysis in Section 4, we find they make a large impact empirically and thus test the algorithms under various parameter settings.

We measure computation cost by the number of Scheffe operations performed by each tournament. One Scheffe operation is one comparison of a pair of distributions at a given sampled element (the basic computation performed during a Scheffe test). Below, we describe the experimental setup for the datasets.

5.1. Synthetic Experiments

Setup We compare two synthetic datasets corresponding to half-uniform and Zipfian distributions. The half-uniform dataset consists of k = 8192 distributions over a domain of size n = 500. Each distribution is uniform over a random n/2 sized subset of the domain. We consider a number of samples s {20, 30, 40, 50, 60}.

The Zipfian dataset consists of k = 4096 distributions over

3Code available at https://github.com/justc2/ datastructdensityest

Data Structures for Density Estimation 7

Figure 1. Grid search on half-uniform data with 20 samples.

Figure 2. Grid search on half-uniform data with up to 30 samples.

a domain of size n = 250. Each distribution is a random permutation of the standard Zipfian distribution where the ith element has probability proportional to 1/i. We consider a number of samples s {20, 30, 40}.

In both cases, queries are formed by taking samples from one of the distributions in the dataset and performance is measured by the accuracy of the algorithms (i.e., the fraction of queried points for which the algorithm returned the true distribution). As runtime can always be decreased by not using all of the sampled elements, for results using s samples, we also report all results using fewer than s samples. For these experiments, we grid search over fast Const {5, 10, 15, 20} and n All Pairs {0, 10, 20, 30}. The reported results are averaged over 5 random sets of 20 queries each.

Results We will focus on the results for the half-uniform dataset as the conclusions for the Zipfian dataset are similar (see specific comments at the end of this subsection). In Figures 1-5, we compare the accuracy and computational cost of the base tournament and our fast tournament under the various parameters setting described above. For both algorithms, the upper envelope of points (i.e., the discrete approximation of the Pareto curve for the accuracy/time tradeoff) are highlighted. As the number of samples increases, the accuracy both algorithms dramatically increases

Figure 3. Grid search on half-uniform data with up to 40 samples.

Figure 4. Grid search on half-uniform data with up to 50 samples.

from 13% at 20 samples up to 98% at 60 samples.

Across different levels of sampling, the fast tournament is able to attain similar accuracy to the base tournament while using significantly fewer samples. For example, at 60 samples, the fast tournament is able to achieve accuracy of 88% using fewer than 165,000 operations while the slow tournament requires more than 400,000 operations to achieve accuracy greater than 80% (an improvement of more than 2.4 ). In general, we find that the most important factor influencing accuracy is the number of total samples. On the other hand, using fewer samples in earlier rounds of the tournament via small fast Const has a moderate to negligible impact on accuracy. Finally, the points of the right side of the plots with many operations correspond to larger values of n All Pairs. Increasing n All Pairs (at least beyond a certain point) has small impact on accuracy while significantly increasing computation due to the quadratic dependence on the all-pairs comparison at the end of the tournament.

The overall results for the Zipfian datset in Figures 6-8 are very similar though the Zipfian distributions are qualitatively different from the half-uniform distributions as they contain several elements with quite large probabilities.

Data Structures for Density Estimation 8

Figure 5. Grid search on half-uniform data with up to 60 samples.

Figure 6. Grid search on Zipfian data with 20 samples.

5.2. Networking Experiments

Setup The underlying network data we use comes from the CAIDA Anonymized Internet Trace internet traffic dataset4, which is IP traffic data collected at a backbone link of a Tier1 ISP in a data center in NYC. Within each minute, there are approximately 3.5 107 packets recorded.

We split 7 minutes of the IP data into 2,148 chunks, each representing 170ms and approximately 105 packets. For each chunk, we construct a distribution corresponding to the empirical distribution over source IP addresses in that chunk. The support sizes of the chunks (and therefore of the distributions) are approximately 45,000. The goal of approximate nearest neighbor search within this context is to identify similar traffic patterns to the current chunk of data within past data. Further, algorithms for density estimation allow this computation to be done on subsampled traffic data, which is a common practice in networking to make data acquisition feasible (cis).

For a series of 100 query chunks, we use the prior 2,048 chunks as the set of distributions to search over. We test

4From CAIDA internet traces 2019, https: //www.caida.org/catalog/datasets/monitors/ passive-equinix-nyc/

Figure 7. Grid search on Zipfian data with up to 30 samples.

Figure 8. Grid search on Zipfian data with up to 40 samples.

the algorithms in two parameter regimes: 100 samples and fast Const = 10 as well as 250 samples and fast Const = 25. In both regimes, we test with n All Pairs = 0 and n All Pairs = 5. To measure performance, we report the total variation distance of the distribution returned by the tournament algorithms as well as the true nearest neighbor distance and average distance across all 2,048 distributions. Results are averaged over 10 trials for each of the 100 queries and one standard deviation is shaded.

Results In Figures 9 and 10, we plot the performance of the base and fast tournaments on the networking data across samples 100 and 250 and for n All Pairs set to 0 and 5, respectively. In all parameter settings, across the 100 queries, the fast tournament returns a distribution within roughly the same distance as the base tournament. The fast tournament uses 5 (when n All Pairs = 0) or 2.4 (when n All Pairs = 5) less number of operations. Interestingly, neither algorithm performs better with the inclusion of an all-pairs comparison at the end of the tournament. Even with the relatively small sampling rate compared to a domain of all possible IPs and support size of 45k, the tournament algorithms are able to recover distributions close to the nearest neighbor distance. Comparing the results with 100 and 250 samples, the distance to the distribution returned

Data Structures for Density Estimation 9

Figure 9. Networking experiments with n All Pairs = 0. The base tournament uses 5 more operations than the fast tournament.

by the algorithms moderately improves and the variance reduces considerably with more samples.

Experiment summary In both the synthetic and realworld experiments, the tournament algorithms perform well in recovering close distributions using few samples. The most important factor in the algorithms performance is the number of total samples. On the other hand, the fast tournament is able to use very limited sample sizes for earlier rounds of the tournament in order to save up to 5x on computational cost while retaining essentially the same performance to the base tournament which uses all samples at all steps. According to the theoretical results, this computational gap will only increase for larger k.

6. Conclusion

We introduce the question of sublinear time density estimation, a natural generalization of nearest neighbor search to discrete distributions. We obtain the first algorithm with sublinear sample complexity and query time and with polynomial preprocessing, in the proper case. In the improper case, we improve upon prior results of (Acharya et al., 2018) to obtain a linear time algorithm with optimal sample complexity. Our work raises a number of interesting follow up questions: To what extent can our upper bounds of query and sample complexity be improved? What are

Figure 10. Networking experiments with n All Pairs = 5. The base tournament uses 2.4 more operations than the fast tournament.

the computational-statistical tradeoffs between the sample complexity and query time? While we studied discrete distributions under total variation distance, it is also interesting to ask if we can obtain efficient retrieval algorithms for other distances for distributions, discrete or continuous.

Acknowledgements

Anders Aamand is supported by DFF-International Postdoc Grant 0164-00022B from the Independent Research Fund Denmark. Research supported in part by NSF (CCF2008733), and ONR (N00014-22-1-2713) grants. Justin Y. Chen is supported by Math Works Engineering Fellowship, GIST-MIT Research Collaboration grant, NSF award CCF-2006798, and Simons Investigator Award. Justin Y. Chen, Shyam Narayanan, and Sandeep Silwal are supported by an NSF Graduate Research Fellowship under Grant No. 1745302. Piotr Indyk was supported by the NSF TRIPODS program (award DMS-2022448), Simons Investigator Award and MIT-IBM Watson AI Lab. Shyam Narayanan is supported by a Google Ph D fellowship.

Using netflow sampling to select the network traffic to track. URL https:// www.cisco.com/c/en/us/td/docs/

Data Structures for Density Estimation 10

ios-xml/ios/netflow/configuration/ xe-16/test/nf-xe-16-book/ nflow-filt-samp-traff-xe.html.

Acharya, J., Falahatgar, M., Jafarpour, A., Orlitsky, A., and Suresh, A. T. Maximum selection and sorting with adversarial comparators. The Journal of Machine Learning Research, 19(1):2427 2457, 2018.

Andoni, A. and Razenshteyn, I. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 793 801, 2015.

Andoni, A., Indyk, P., and Razenshteyn, I. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pp. 3287 3318. World Scientific, 2018.

Bun, M., Kamath, G., Steinke, T., and Wu, S. Z. Private hypothesis selection. Advances in Neural Information Processing Systems, 32, 2019.

Canonne, C. L. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pp. 1 100, 2020.

Canonne, C. L., Kamath, G., Mc Millan, A., Smith, A., and Ullman, J. The structure of optimal private tests for simple hypotheses. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 310 321, 2019.

Daskalakis, C. and Kamath, G. Faster and sample nearoptimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pp. 1183 1213. PMLR, 2014.

Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer series in statistics. Springer, 2001. ISBN 978-0-387-95117-1.

Diakonikolas, I. Learning structured distributions. Handbook of Big Data, 267:10 1201, 2016.

Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., and Stewart, A. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864, 2019.

Gopi, S., Kamath, G., Kulkarni, J., Nikolov, A., Wu, Z. S., and Zhang, H. Locally private hypothesis selection. In Conference on Learning Theory, pp. 1785 1816. PMLR, 2020.

Har-Peled, S., Indyk, P., and Motwani, R. Approximate nearest neighbor: Towards removing the curse of dimensionality. 2012.

Indyk, P. On approximate nearest neighbors under ℓ norm. Journal of Computer and System Sciences, 63(4): 627 638, 2001.

Kamath, G., Singhal, V., and Ullman, J. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pp. 2204 2235. PMLR, 2020.

Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In Conference on Learning Theory, pp. 1066 1100. PMLR, 2015.

Scheffe, H. A Useful Convergence Theorem for Probability Distributions. The Annals of Mathematical Statistics, 18(3):434 438, 1947. doi: 10.1214/aoms/ 1177730390. URL https://doi.org/10.1214/ aoms/1177730390.

Suresh, A. T., Orlitsky, A., Acharya, J., and Jafarpour, A. Near-optimal-sample estimators for spherical gaussian mixtures. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

Szpankowski, W. Average case analysis of algorithms on sequences. John Wiley & Sons, 2011.

Valiant, G. and Valiant, P. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pp. 403 412. IEEE, 2011.

Yatracos, Y. G. Rates of convergence of minimum distance estimators and kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

Data Structures for Density Estimation 11

A. Omitted Proofs of Section 2

Lemma 2.1. For any k n n/2 , there exist distinct distributions p, q1, . . . , qk over [n] such that p qi 1 = 1 for each i [k] and if ˆp is the empirical distribution obtained from sampling s = n/2 elements from p, then the probability that p ˆp 1 mini [k]( qi ˆp 1) is exp( Ω(k)). In fact, if k Cn log n for a sufficiently large constant C, the probability is 0.

Proof. Let p = (p(1), . . . , p(n)) where each p(i) = 1/n. Let x(i) Bin(s, 1/n) denote the number of times item i is sampled, so that ˆp(i) = x(i)/s. Fix a subset A [n] with |A| n/2 and define q A = (q A(1), . . . , q A(n)) by q A(i) = 2/n if i A and q A(i) = 0 otherwise. Note that p q A 1 = 1. Let ℓ1 = |{i A | x(i) 1}| and ℓ2 = |{i [n] \ A | x(i) 1}|. If s n/2, then for all i such that x(i) 1 it holds that

|p(i) ˆp(i)| =

( x(i)/s p(i), x(i) 1 p(i), x(i) = 0,

and a similar equation holds for |q A(i) ˆp(i)|. Using this, simple calculations show that ˆp p 1 = 2 ℓ1+ℓ2

n and ˆp q A 1 = 2 4ℓ1

n . In particular, ˆp q A 1 ˆp p 1 = 2

n(ℓ1 ℓ2). By symmetry, if A is sampled at random (and in particular A and Ac is sampled with the same probability), the events ˆp q A 1 > ˆp p 1 and ˆp q A 1 < ˆp p 1 occur with exactly the same probability. Moreover, conditioning on the x(i) s, it is simple to check that regardless of their values, ℓ1 = ℓ2 with probability Ω(1). It follows that the probability that ˆp p 1 ˆp q A 1 is at most 1 c for some constant c > 0. Now pick the distributions q1, . . . , q2k independently and uniformly at random by picking random A1, . . . , A2k with |Ai| = n/2 and defining qi = q Ai (note that we may have repetitions). By independence, the probability that ˆp p 1 min ˆp qi 1 is exp( Ω(k)). If k = O(n log n), it moreover holds with the same high probability that {q1, . . . , q2k} contains at least 2k different distributions and so the result follows. If on the other hand k Cn log n for C sufficiently large, then we can pick k = Cn log n distributions q1, . . . , qk satisfying the statement of the theorem. The error probability is P0 = exp( Ω(k )) but there are at most nn/2 ways to do the sampling from p. Therefore, if for at least one way of sampling from p it was the case that ˆp p 1 mini [k ] ˆp qi 1, then this event would in fact occur with probability at least n n/2. This is a contradiction since P0 = exp( Ω(k )) < n n/2 when C is sufficiently large.

Lemma 2.2. There exists distributions p, q over [n] with p q 1 = 1 such that for any sample size s = o(n), with probability Ω(1), p ˆp 2 > q ˆp 2.

Proof (sketch). Suppose for simplicity that n = 2n0 + 1 is odd. Let p = (p(1), . . . , p(n)) be given by

1 2, i = 1, 1 2n0 , i = 2, . . . , n0 + 1, 0, i = n0 + 2, . . . , 2n0 + 1,

Simple concentration bounds show that with high probability, Pn i=2(X(i) sp(i))2 (1 + o(1))s/2. Define next q = (q(1), . . . , q(n)) by

1 2 + 1 s, i = 1,

0, i = 2, . . . , n0 + 1,

1 2 1 s n0 , i = n0 + 2, . . . , 2n0 + 1,

Again with the assumption that s = o(n), standard concentration bounds show that Pn i=2(X(i) sq(i))2 (1 + o(1))s/2 with high probability. It follows from these observations that if X(1) s/2 + s, then p ˆp 2 > q ˆp 2. By standard properties of the binomial distribution B(s, 1/2), this happens with probability Ω(1).

B. Omitted Proofs of Section 3.1

Remark B.1. We remark that in the proofs of this and all subsequent sections, the notation 1/poly(n) or 1/poly(k) refers to quantities of the form 1/n C or 1/k C where we can choose C to be an arbitrarily large constant.

Data Structures for Density Estimation 12

The following is a standard tail bound for Poisson distributions5.

Lemma B.2. If Y Poi(λ), then for any t > 0, P(|Y λ| t) 2 exp t2 2(λ+t) .

Lemma B.3 below bounds the ℓ distance between ˆp, the empirical distribution, and p, the unknown distribution. Lemma B.3. Let ˆp denote the empirical distribution of Algorithm 3. With probability at least 1 1/poly(n), we have that for all i [n],

|ˆp(i) p(i)| O

Proof. Note that s ˆp(i) is distributed as Poi(s p(i)). By setting t = C max p

s p(i) log n, log n in Lemma B.2 for a large enough constant C , we see that the probability s ˆp(i) deviates by t is at most 1/poly(n), where we can make the degree of the polynomial arbitrarily large by increasing C . The proof is completed by dividing by s.

Lemma 3.2. Suppose γ = 1/n C in Algorithm 1. Consider the sets Sj = {w1, . . . , wt} defined in line 9 of Algorithm 2. Consider the vectors (wi)H, which are the heavy subsets of the distributions in Sj, as defined in line 10 of Algorithm 2. Then for all j and for all w, w Sj, we have

w H w H 1 O (log n) log log n

Proof. There can be at most 1/γ = n C heavy elements since each heavy element has probability mass at least γ, by Algorithm 1. Let H be the set of heavy elements. We know that any two distributions w and w Sj are at most O((log n) (log log n))/ n apart in ℓ . Thus, the ℓ1 difference restricted to the heavy elements is at most n C

O (log n) (log log n) n = O (log n) log log n

Lemma 3.3. Suppose s Ω(n/(log n)1/2). Consider the output v on ˆp when inputted into the ℓ data structure D , as done in line 6 of Algorithm 3. Let S denote the group of v from Algorithm 2. Assuming the event in Lemma B.3 holds, it must be that v S .

Proof. We know p v p v 2 1 2 n by our assumption on v given in Section 1.1. Furthermore, Lemma

B.3 implies that p ˆp O (log n)3/4

n so by adjusting constants, it follows from the triangle inequality that

ˆp v O (log n)3/4

n . By definition of v , the distribution returned by the ℓ data structure in line 6 of Algorithm

3, we know that ˆp v c ˆp v , where c = O(log(n) log log n) set in line 6 of Algorithm 3. Hence,

v v ˆp v + ˆp v (c + 1) ˆp v

O (log n)1.75 (log log n) n

so v must be in S from line 6 of Algorithm 2.

Lemma 3.4. Consider the same setting as in Lemma 3.3 and suppose γ = 1/n C. Let H denote the heavy domain elements of group S . Then for all w S , we have p w O (log n)2 log log n n . Consequently, we also have

p H w H 1 O (log n)2 log log n

Proof. Take any w S . We know that

ˆp w ˆp v + v w

ˆp v + O (log n)2 (log log n) n

5see e.g. https://math.stackexchange.com/questions/2434883/chernoff-style-bounds-for-poisson-distribution 2434922

Data Structures for Density Estimation 13

where the last inequality holds from the construction of S in Algorithm 2. Since ˆp v c ˆp v , for c defined in line 6 of Algorithm 3, by definition,

p w ˆp w + p ˆp

c ˆp v + p ˆp + O (log n) (log log n) n

The proof of Lemma 3.3 implies that p ˆp , ˆp v O((log n)3/4/ n), so by adjusting constant factors, the first inequality in the lemma statement holds, as O (log n)2 log log n/ n is the dominant term.

The second inequality in the lemma statement holds by an identical reasoning as used in the proof of Lemma 3.2, as there are at most n C heavy coordinates.

C. Omitted Proofs of Section 3.2

Recall that T = P

Lemma 3.5. E[ s ˆp L s v L 2 2] = s T + s2 p L v L 2 2.

Proof. Note that the light elements refer to elements in L defined in Algorithm 3. Set X(i) := s ˆp L(i) and recall that X(i) Pois(s p L(i)). We have

E[ s ˆp L s v L 2 2]

i=1 E[X(i)2 + s2v L(i)2 2s X(i)v L(i)]

i=1 sp L(i) +

s2p L(i)2 2s2p L(i)v L(i) + s2v L(i)2

= s T + s2 p L v L 2 2.

We now bound the variance. The following lemma bounds the variance one term at a time.

Lemma C.1. Let X(i) = s ˆp L(i). We have Var[(X(i) sv L(i))2] 4sp L(i) (sp L(i) sv L(i))2 +6(sp L(i))2 +sp L(i).

Proof. Define the auxiliary variables y := sp L(i) and z := sv L(i). We have

Var[(X(i) sv L(i))2] = E[(X(i) sv L(i))4] E[(X(i) sv L(i))2]2

= E[(X(i) sv L(i))4] (s2p L(i)2 2s2p L(i)v L(i) + s2v L(i)2 + sp L(i))2

= E[(X(i) sv L(i))4] (y + y2 2yz + z2)2.

Since X(i) is a Poisson random variable with parameter y, we know that

E[X(i)3] = y + 3y2 + y3,

E[X(i)4] = y + 7y2 + 6y3 + y4.

By expanding, we have

E[(X(i) sv(i))4] = E[X(i)4] 4E[X(i)3]z + 6E[X(i)2]z2 4E[X(i)]z3 + z4

= (y + 7y2 + 6y3 + y4) 4(y + 3y2 + y3)z + 6(y + y2)z2 4yz3 + z4.

Data Structures for Density Estimation 14

Putting everything together gives us

Var[(X(i) sv L(i))2] = E[(X(i) sv L(i))4] (y + y2 2yz + z2)2

= (y + 7y2 + 6y3 + y4) 4(y + 3y2 + y3)z + 6(y + y2)z2 4yz3 + z4 (y + y2 2yz + z2)2

= 4y3 8y2z + 4yz2 + 6y2 + y 4yz

= 4y(y z)2 + 6y2 + y 4yz

4y(y z)2 + 6y2 + y

= 4sp L(i) (sp L(i) sv L(i))2 + 6(sp L(i))2 + sp L(i).

The following lemma bounds the total variance.

Lemma C.2. Var[ s ˆp L s v L 2 2] 4s3 p L 2 p L v L 2 2 + 6s2 p L 2 2 + s T.

Proof. Summing the result of Lemma C.1 for all coordinates i gives us

Var[ s ˆp L s v L 2 2]

i=1 4sp L(i) (sp L(i) sv L(i))2 + 6(sp L(i))2 + sp L(i)

i=1 4s3p L(i) (p L(i) v L(i))2 + 6s2 p L 2 2 + s T

4s3 p L 2 p L v L 2 2 + 6s2 p L 2 2 + s T.

Lemma 3.6. Suppose p satisfies p L v L 2 ε 2 n. Set t = s2 p L v L 2 2 4 + s2ε2

4n and Z1 = s ˆp L s v L 2 2. If

s = Ω max n p L 2

ε4/3 then P[|Z1 E[Z1]| t] 0.01.

Proof. By Chebyshev s inequality and Lemma C.2,

P[|Z1 E[Z1]| t] Var[Z1]

s3 p L 2 p L v L 2 2 s4ε4/n2 + s2 p L 2 2 s4ε4/n2 + s s4ε4/n2

sε2 + n p L 2

where the last inequality holds if s C max n p L 2

ε4/3 for a sufficiently large constant C.

We then consider the case where ε/ n p L v L 2. As stated in the main body, we need to obtain a stronger concentration result since later we union bound over possibly Ω(k) different distributions which are in group S .

To obtain a stronger concentration result, we need the coordinates of both ˆp L and v L to be bounded. For each vi, we know that (vi)L 2γ by construction if we set γ Ω(1/ n). Lemma B.3 readily implies a similar statement for ˆp L (with high probability). Remark C.3. Therefore, in the following concentration bound, we utilize the fact that maxi{p L(i), v L(i)} O(γ), which holds with high probability.

The main tool we use is Bernstein s concentration inequality on Z1 = s ˆp L s v L 2 2. Towards this, we first prove a bound on a variance like quantity.

Lemma C.4. With probability 1 1/poly(k) (over the randomness in ˆp ), we have

max i [n] |(s ˆp L(i) s v L(i))2 E[(s ˆp L(i) s v L(i))2]| O log k (sγ)1.5) .

Data Structures for Density Estimation 15

Proof. Define X(i) = s ˆp L(i). Recall each X(i) is distributed as Pois(s p L(i)). From Lemma B.2, we have that with

probability 1 1/poly(k), |X(i) sp L(i)| O max p

sp L(i) log k, log k for all i. Call this event E. Now

(X(i) s v L(i))2 E[(X(i) s v L(i))2]

= X(i)2 2sv L(i)X(i) + s2v L(i)2 (s2p L(i)2 2s2p L(i)v L(i) + s2v L(i)2 + sp L(i))

= X(i)2 s2p L(i)2 sp L(i) + 2sv L(i)(sp L(i) X(i)).

Thus conditioning on E and recalling Remark C.3, we have

|X(i)2 s2p L(i)2 sp L(i)| O((sp L(i))1.5 log k) = O((sγ)1.5 log k)

and |2sv L(i)(sp L(i) XL(i))| O(sγ ( sγ + 1) log k)

for all i. Altogether, we have that under E,

max i |(X(i) s v L(i))2 E[(X(i) s v L(i))2]|

O log k (s1.5γ1.5 ).

Lemma 3.7. Suppose p satisfies p L v L 2 0.75 ε n. Set t = s2 p L v L 2 2 4 + s2ε2

4n , suppose γ = 1/n5/12, and let

Z1 = s ˆp L s v L 2 2 denote our estimator. If s = Ω n2γ3 log(k)2

ε4 then P[|Z1 E[Z1]| t] 1/poly(k).

Proof. Define X(i) = s ˆp L(i). Recall Bernstein s inequality: for a random variable R = P

i Ri where Ri are independent, it states

P(|R E[R]| t) 2 exp t2/2 Var(R) + t M/3

where M is such that |Ri E[Ri]| M with probability 1 for every i. In our case, Z1 = P

i(X(i) s v L(i))2, so we must first bound the maximum deviation of

|(X(i) s v L(i))2 E[(X(i) s v L(i))2]|

for all i. To do so, we condition on the event E of Lemma C.4. This gives

P[|Z1 E[Z1]| t] = P[|Z1 E[Z1]| t | E] P(E) + P[|Z1 E[Z1]| t | Ec] P(Ec).

We have P[|Z1 E[Z]| t | Ec] P(Ec) P(Ec) 1 poly(k) so it suffices to bound P[|Z1 E[Z1]| t | E]. We drop the conditioning E for simplicity. Note that a similar analysis as above also shows that |E[Z1] E[Z1 | E]| 1 poly(k) which is a much smaller lower order term compared to t so we ignore it for simplicity. A similar statement holds for the variance of Z1.

Now Bernstein s inequality gives us

P(|Z1 E[Z1]| t) exp t2/2 4s3 p L 2 p L v L 2 2 + 6s2 p L 2 2 + s + t M/3

where we have used Lemma C.2 to substitute in the variance. We can furthermore set M = O log k (sγ)1.5 ) from Lemma C.4. If we show that

s3 p L 2 p L v L 2 2 , t2

s2 p L 2 2 , t2

where the constants in the Ω(log k) are sufficiently large, then it follows that

t2/2 4s3 p 2 p L v L 2 2 + 6s2 p L 2 2 + s + t M/3 Ω(log k)

Data Structures for Density Estimation 16

and we get the desired concentration bound. Thus, we proceed to analyze each of the four fractions individually. Using t = s2 p L v L 2 2 4 + s2ε2

4n , we see that

s3 p L 2 p L v L 2 2 s p L v L 2 2 p L 2 sε2

s2 p L 2 2 s2ε4

n2 p L 2 2 ,

Thus, s Ω(max(n2/3 log(k)2/3/ε2/3, n p L 2 log k/ε2)) is required. We now consider t/M Recalling their values, we observe that

t s1.5γ1.5 s0.5ε2

(Note that sγ 1 in our setting). If s Ω(n p 2 log k/ε2), then sε2/(nγ) = Ω(log k). To ensure s0.5ε2

nγ1.5 Ω(log k), we

need to satisfy s = Ω n2γ3 log(k)2

ε4 . Recalling the value of γ, we can easily check that s Ω n2γ3 log(k)2

ε4 implies all other lower bounds we have imposed on s so far, proving the lemma.

D. Proof of Theorem 3.1

Theorem 3.1. Set s = Θ n ε2(log k)1/4 = o(n) in Algorithm 3 and γ = 1/n5/12 in Algorithm 2. Let v denote the output of

Algorithm 3. Then the preprocessing algorithm, Algorithm 2, runs in time polynomial in k, requires O(nk2) space, and the query time of Algorithm 3 is O(n) + k1 1/(log k)1/4 = o(k). Furthermore, with probability 1 o(1), Algorithm 3 returns distribution v satisfying p v 1 ε.

Proof. Let v be the output of line 6 of Algorithm 3 and S be the group of v from Algorithm 2, the preprocessing step. Let H and L be the partition of [n] into heavy and light elements induced by v . From Lemma 3.4, we know that p is close to v on H as v S : p v 1 o(1). Thus it suffices to bound p L v L 1.

Lemma 3.3 states that v S with probability 1 1/poly(n). Since we assumed that p v 2 ε/(2 n), Lemma 3.6 implies that

s ˆp L s v L 2 2 s T + 5s2ε2

with probability 1 o(1) (note Lemma 3.6 is stated as holding with probability 99% but we are picking a much larger value of s than required by the lemma. Plugging into the Chebyshev inequality bound readily gives us that the failure probability is o(1)).

Now we claim that v cannot satisfy v L p L 2 0.99 ε/ n. Let us assume for the sake of contradiction that v L p L 2 0.99 ε/ n. Then from Lemma 3.5,

E[ s ˆp L s v L 2 2] = s T + s2 p L v L 2 2,

and Lemma 3.7 implies that with probability at least 1 1/poly(k), we have

s ˆp L s v L 2 2 s T + .75s2 p L v L 2 2 0.25 s2ε2

n s T + 0.48 s2ε2

where the last inequality follows from the assumption that p L v 2 2 (.99)2ε2/n. However, the ratio of the quantities on the right hand side of (4) and (3) is at least

s T + 0.48 s2ε2

n s T + 5s2ε2

16n 1 + z/2 T + z

Data Structures for Density Estimation 17

for z = 5sε2/(16n). Now if T z then z/2/(T + z) 1/4 so the ratio is at least 1.25. Otherwise, if T > z, we always know that T 1 so T + z 2 and hence the ratio is at least 1 + z

64n . Since (1 + x)1/2 1 + x/2.5 for any x [0, 1], we have that (in our regime of s) the square root of the ratio is strictly larger than 1 + sε2/(32n), by setting s appropriately. However, this contradicts the ℓ2 nearest neighbor search guarantees set in line 10 of Algorithm 3, since it means we return a v which has the property that ˆp L v L 2 is larger than ˆp L v L 2 by a factor larger than c set in line 10 (and v was also considered by the ℓ2 data structure).

Therefore, we must have v L p L 2 0.99 ε/ n which implies that v L p L 1 0.99 ε. Since we already know that v H p H 1 o(1), we have that v p 1 (1 + o(1))0.99 ε < ε. This proves the approximation.

For the sample complexity, note that if we set s = Θ(n/(log k)1/4) then s is larger than the values required in Lemmas 3.6 and 3.7. The prepossessing time follows from Theorem 1.1. Finally, the query time follows from 1.2 by plugging in the value of c from line 10 of Algorithm 3 (the ℓ2 NNS approximation parameter) into statement of Theorem 1.2 and noting that all log(k) factors can be absorbed into the exponent as kΘ(1/(log k)1/4) poly(log k). Note that we get an extra O(n) runtime from also querying an ℓ NNS datastructure as well in step 6 of Algorithm 3. Finally, we remark that an alternative expression for the approximation guarantee is p v 1 p v 1 + ε. This completes the proof.

E. Omitted Proofs of Section 4

Theorem 4.1. Assume that δ k 1/4. With probability 1 O(δ), the algorithm of Section 4 outputs a distribution ˆv with p ˆv 1 27 minj p vj 1 + O(ε). The algorithm uses s = O((log k + log 1

δ )/ε2) samples from p and has running time O k

Proof. The statement about the number of samples is clear. Indeed, for the iterative part of the algorithm, we use a total of slg k = 10 log(k2/δ)

ε2 samples and for the final part, we use a further 10 log(|C|2/δ)

ε2 10 log(k2/δ)

ε2 samples. Regarding the running time, we already argued above that the time used for the knockout tournament is O k

δ . Further, since |C| k1/3 log k, running the Scheffe test on all pairs of distributions in C takes time O sk2/3 log2 k = O k

For the bound on the quality of the estimate ˆv, we define v = arg minv V p v 1 and let W0 = {v V : p v 1 3 p v 1 + ε} and W1 = V \ W0. We first want to argue that with probability 1 O(δ), some element of W0 gets added to C. For i = 1, . . . , lg k, we let Ti be the set of 2i 1 distributions that v could possibly be paired with in step 2. of the algorithm during the first i rounds of the tournament (this is the set of 2i 1 distributions in the subtree of the tournament tree rooted i steps above v ). Let Ai denote the event that there exists a distribution v in Ti W1 such that v loses the Scheffe test to v using the sample Si. By the choice of si and the bound in (1), we have that

Pr[Ai] |Ti W1|δi δ

i Ai] δ. We now consider two cases: Suppose first that at some stage i, |Vi W0|

|Vi| log 1/δ

k1/3 . In this case, the

probability that no element of Vi is added to C during step 1. of iteration i of the algorithm is at most (1 log 1/δ

k1/3 )k1/3 δ.

If on the other hand, at each step i, |Vi W0|

|Vi| < log 1/δ

k1/3 , then the probability that v gets paired with an element of W0 in step

i (conditioned on it surviving the first i 1 steps) is at most log 1/δ

k1/3 . Note that if none of the events Ai occurs and v never gets paired with a distribution in W0, then v will win the tournament, and thus get added to C in some step. Thus, by a union bound, the probability that no element of W0 gets added to C is at most

+ δ + log k log 1/δ

k1/3 2δ + (log k)2

4k1/3 = O(δ),

where we used δ k 1/4 in the final step.

It was shown in (Devroye & Lugosi, 2001) that running the Scheffe test on all pairs of distributions in a set C using a sample

of size 10 log(( |C| 2 )/δ) ε2 and outputting the one ˆv with the most wins, we have with probability at least 1 δ that

p ˆv 1 9 min v C p v 1 + 4ε.

Data Structures for Density Estimation 18

Since C contains a distribution of W0 with probability 1 O(δ) we obtain that with probability 1 O(δ),

p ˆv 1 9 min v C p v 1 + 4ε

9 (3 p v 1 + ε) + 4ε = 27 p v 1 + 13ε,

as desired.

Remark E.1. We can also handle error probabilities δ < k 1/4. In this case, we set δ0 = k 1/4 and run the iterative part of the algorithm above ℓ= log 1/δ log 1/δ0 times to obtain a candidate set of distributions C V (ℓtimes larger than before) which contains an element of W0 with probability 1 O(δ). We then again run the complete tournament on C with a sample of

size 10 log(( |C| 2 )/δ) ε2 . The bound on the quality of the estimate, follows as above. Moreover, easy calculations show that the number of samples used is still O( log 1/δ

ε2 ) and that the total running time is

k + ℓ2k2/3 log2 k ,

which is O k

δ except for extremely small δ. Remark E.2. In a slightly different model, we are not given the distributions of V explicitly but can only access them through sampling. In this model, we can skip the preprocessing step and instead estimate the probabilities vi(S) and vj(S) through sampling whenever we run the Scheffe test for distributions vi and vj. This comes at the cost of a larger value of the constant C. When running the algorithm above in this model, we obtain the same reduction of the running time and with no preprocessing.