# estimating_entropy_of_distributions_in_constant_space__de313afd.pdf

Estimating Entropy of Distributions in Constant Space

Jayadev Acharya Cornell University acharya@cornell.edu

Sourbh Bhadane Cornell University snb62@cornell.edu

Piotr Indyk Massachusetts Institute of Technology

indyk@mit.edu

Ziteng Sun Cornell University zs335@cornell.edu

We consider the task of estimating the entropy of k-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires O

k log(1/")2

samples and a constant O(1) memory

words of space and outputs a " estimate of H(p). Without space limitations, the sample complexity has been established as S(k, ") =

k " log k + log2 k

, which is sub-linear in the domain size k, and the current algorithms that achieve optimal sample complexity also require nearly-linear space in k. Our algorithm partitions [0, 1] into intervals and estimates the entropy contribution of probability values in each interval. The intervals are designed to trade off the bias and variance of these estimates.

1 Introduction

Streaming Algorithms. Algorithms that require a limited memory/space/storage1 have garnered great interest over the last two decades, and are popularly known as streaming algorithms. Initially studied by [1, 2], this setting became mainstream with the seminal work of [3]. Streaming algorithms are particularly useful in handling massive datasets that are impossible to be stored in the memory of the system. It is also applicable in networks where data is naturally generated sequentially and the data rates are much higher than the capabilities of storing them, e.g., on a router.

The literature on streaming algorithms is large, and many problems have been studied in this model. With roots in computer science, a large fraction of this literature considers the worst case model, where it is assumed that the input Xn := X1, . . . , Xn is an arbitrary sequence over some domain of size k (e.g., [k] := {1, . . . , k}). The set-up is as follows:

Given a system with limited memory that can make a few (usually just one) passes over Xn, the objective is to estimate some f(Xn) of the underlying dataset. The primary objective is solving the task with as little memory as possible, which is called the space complexity.

Some of the research closest to our task is the estimation of frequency moments of the data stream [3, 4, 5], the Shannon and Rényi entropy of the empirical distribution of the data stream [6, 7, 8, 9, 10], the heavy hitters [11, 12, 13, 14], and distinct elements [15, 16]. There has also been work on random order streams, where one still considers a worst case data stream Xn, but feeds a random permutation Xσ(1), . . . , Xσ(n) of Xn as input to the algorithm [10, 17, 18].

1We use space, storage, and memory interchangeably.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Statistical Estimation. At the same time, there has been great progress in the classical ﬁelds of statistical learning and distribution property estimation. The typical set-up is as follows:

Given independent samples Xn from an unknown distribution p, the objective is to estimate a property f(p) using the fewest samples, called the sample complexity.

Distribution property estimation literature most related to our work include entropy estimation [19, 20, 21, 22, 23, 24, 25], support size estimation [21, 23, 26], Rényi entropy estimation [27, 28, 29], support coverage estimation [30, 31], and divergence estimation [32, 33]. In these tasks, the optimal sample complexity is sub-linear in k, the domain size of the distribution.

Streaming Algorithms for Statistical Estimation. While space complexity of streaming algorithms, and sample complexity of statistical estimation have both received great attention, the problem of statistical estimation under memory constraints has received relatively little attention. Interestingly, almost half a century ago, Cover and Hellman [34, 35] studied this setting for hypothesis testing with ﬁnite memory, and [36] had studied estimating the bias of a coin using a ﬁnite state machine. However, until recently, there are few works on learning with memory constraints. There has been a recent interest in space-sample trade-offs in statistical estimation [37, 38, 39, 40, 41, 42, 43, 44]. Within these, [40] is the closest to our paper. They consider estimating the integer moments of distributions, which is equivalent to estimating Rényi entropy of integer orders under memory constraints. They present natural algorithms for the problem, and perhaps more interestingly, prove non-trivial lower bounds on the space complexity of this task. More recently, [45] obtained memory sample trade-offs for testing discrete distributions.

We initiate the study of distribution entropy estimation with space limitations, with the goal of understanding the space-sample trade-offs.

1.1 Problem Formulation

Let k be the class of all k-ary discrete distributions over the set X = [k] := {0, 1, . . . , k 1}. The Shannon entropy of p 2 k is H (p) := P

x2[k] p (x) log (p (x)) . Entropy is a fundamental measure of randomness and a central quantity in information theory and communications. Entropy estimation is a key primitive in various machine learning applications for feature selection.

Given independent samples Xn := X1, . . . , Xn from an unknown p 2 k, an entropy estimator is a possibly randomized mapping b H : [k]n ! R. Given " > 0, δ > 0, b H is an (", δ) estimator if

8p 2 k, Pr Xn p n

| b H(Xn) H(p)| > "

where p n denotes the joint distribution of n independent samples from p.

Sample Complexity. The sample complexity S(H, k, ", δ) is the smallest n for which an estimator satisfying (1) exists. Throughout this paper, we assume a constant error probability, say δ = 1/3,2 and exclusively study entropy estimation. We therefore denote S(H, k, ", 1/3) by S(k, ").

Memory Model and Space Complexity. The basic unit of our storage model is a word, which consists of log k + log(1/") bits. This choice of storage model is motivated by the fact that at least log(1/") bits are needed for a precision of ", and log k bits are needed to store a symbol in [k]. The space complexity of an algorithm is the smallest space (in words) required for its implementation.

1.2 Prior Work

Distribution Entropy Estimation. Entropy estimation from samples has a long history [19, 46, 47]. The most popular method is empirical plug-in estimation that outputs the entropy of the empirical distribution of the samples. Its sample complexity [47, 20] is

k/" + (log2 k)/"2&

Paninski [48] showed that there exists an estimator with sub-linear sample complexity in k. A recent line of work [21, 23, 22] has characterized the optimal sample complexity as

k/(" log k) + log2 k/"2&

2For smaller δ s, we can apply median trick with an extra factor of log(1/δ) samples.

Note that the optimal sample complexity is sub-linear in k, and that of empirical estimator is linear.

Estimating Entropy of Streams. There is signiﬁcant work on estimating entropy of the stream with limited memory. Here, no distributional assumptions on the input stream Xn, and the goal is to estimate H(Xn), the entropy of the empirical distribution of Xn. [6, 49, 10, 9, 8] consider multiplicative entropy estimation. These algorithms can be modiﬁed to additive entropy estimation by noting that (1 "/ log n) multiplicative estimation is equivalent to a " additive estimation. With this, [8, 10] give an algorithm requiring O( log3 n

"2 ) words of space for " estimate of H(Xn). [9]

proposes an algorithm using O( log2 n log log n

"2 ) words of space. A space lower bound of (1/"2) was proved in [8] for the worst-case setting.

Another widely used notion of entropy is Rényi entropy [50]. The Rényi entropy of p of order > 0 is H (p) := log(P

x p(x) )/(1 ). [51, 52, 27] show that the sample complexity of estimating H (p) is (k1 1/ /"2) for 2 N. [40] studies the problem of estimating the collision probability, which can be seen as estimating H (p) for = 2, under memory constraints. They propose an algorithm with sample complexity n and the memory M satisfying n M (k), when n is at least O(k1 1/ ). They also provide some (non-tight) lower bounds on the memory requirements.

1.3 Our Results and Techniques

We consider the problem of estimating H(p) from samples Xn p, with as little space as possible. Our motivating question is: What is the space-sample trade-off of entropy estimation over k?

The optimal sample complexity is given in (3). However, straight-forward implementations of sampleoptimal schemes in [21, 23, 22] require nearly linear space complexity in S(k, "), which is nearly linear (in k) words space. At the same time, when the number of samples is more than Se(k, "), given in (2), the empirical entropy of Xn is within " of H(p). We can therefore use results from streaming literature to estimate the empirical entropy of a data-stream with n = Se(k, ") samples to within ", and in doing so, obtain a 2" estimate of H(p). In particular, the algorithm of [9] requires Se(k, ") samples, and with O( log2(Se(k,")) log log(Se(k,"))

"2 ) words of space, estimates H(p) to ". Note that Se(k, ") is linear with respect to k.

Our work requires constant words of space while maintaining linear sample complexity in k.

Theorem 1. There is an algorithm that requires O

k(log(1/"))2

samples and 20 words of space

and estimates H(p) to ".

The results and the state of the art are given in Table 1. A few remarks are in order.

Remark. (1). Our algorithm can bypass the lower bound of (1/"2) for entropy estimation of datastreams since Xn is generated by a distribution and not the worst case data stream. (2). Consider the case when " is a constant, say " = 1. Then, the optimal sample complexity is ( k log k) (from (3)) and all known implementations of these algorithms requires (k) space. Streaming literature provides an algorithm with O(k) samples and O((log k)2) memory words. We provide an algorithm with O(k) samples, and 20 memory words. Compared to the sample-optimal algorithms, we have a log k blow-up in the sample complexity, but an exponential reduction in space.

Table 1: Sample and space complexity for estimating H(p).

Algorithm Samples Space (in words)

Sample-Optimal [21],[23, 22]

k " log k + log2 k

k " log k + log2 k

Streaming [8, 9] O

k " + log2 k

" ) log log( k

Algorithm 6 O

k(log(1/"))2

We now describe the high level approach and techniques. We can write H(p) as

p(x) log p(x) = EX p [ log p(X)] . (4)

A Simple Method. We build layers of sophistication to a simple approach: In each iteration,

1. Obtain a sample X p.

2. Using constant memory, over the next N samples, estimate log(1/p(X)).

From (4), for large enough N, we can obtain a good estimate ˆp(X) of p(X), and log(1/ˆp(X)) will be an almost unbiased estimate of the entropy. We can then maintain a running average of log(1/ˆp(X)) over R iterations, where R is large enough for the empirical mean of log(1/ˆp(X)) to concentrate. The total sample requirement is NR. This approach is described in Algorithm 1 (in Section 2). Theorem 4 states that it requires O(1) memory words and the sample complexity is super-linear.

Intervals for Better Sample Complexity. To improve the sample complexity, we partition [0, 1] into T disjoint intervals (Algorithm 1 corresponds to T = 1). In Lemma 7 we express H(p) as a sum of entropy-like expressions deﬁned over probability values in these T intervals. We will then estimate each of the terms separately with the approach stated above. We will show that the sample complexity as a function of k drops down roughly as k(log(T ) k)2, where log(T ) is the Tth iterated logarithm, while the space complexity is still constant. In the simple algorithm described above, we need

1. N, the number of samples for each iteration, to be large enough for good estimates of p(X).

2. R, the number of iterations, to be large enough for concentration.

Note that when p(X) is large, fewer samples are needed to estimate p(X) (small N), and for small p(X) more samples are needed. However, if the intervals are chosen such that small probabilities are also contained in small intervals, the number of iterations R needed for these intervals can be made small (the range of random variables in Hoeffding s inequality is smaller). Succinctly,

Fewer samples are needed to estimate the large probabilities, and fewer iterations are needed for

convergence of estimates for small probabilities by choosing the intervals carefully.

Some Useful Tools. We now state two concentration inequalities that we use throughout this paper. Lemma 2. (Hoeffding s Inequality) [53] Let X1, . . . , Xm 2 [ai, bi] be independent random vari-

ables. Let X = (X1 + . . . + Xm)/m, then Pr (|X E [X]| t) 2 exp

In some algorithms we consider, m itself is a random variable. In those cases, we will use the following variant of Hoeffding s inequality, which is proved in Section B. Lemma 3. (Random Hoeffding s Inequality) Let M Bin (m, p). Let X1, . . . , Xm be independent random variables such that Xi 2 [a, b]. Let X = (PM

i=1 Xi)/M. Then, for any 0 < p 1

Pr (|X E [X]| t/p) 3 exp

mt2/(8p (b a)2)

Outline. In Section 2 we describe the simple approach and its performance in Theorem 4. In Section 3.1, Algorithm 5 we show how the sample complexity can be reduced from k log2 k in Theorem 4 to k(log log k)2 in Theorem 8 by choosing two intervals (T = 2). The intervals are chosen such that the number of iterations R for the small interval is poly(log log k) in Algorithm 5 compared to poly(log k) in Algorithm 1. The algorithm for general T is described in Section 3.2, and the performance of our main algorithm is given in Theorem 1.

2 A Building Block: Simple Algorithm with Constant Space

We propose a simple method (Algorithm 1) with the following guarantee.

Theorem 4. Let " > 0. Algorithm 1 takes O

k log2(k/")

samples from p 2 k, uses at most 20

words of memory, and outputs H, such that with probability at least 2/3,

Based on (4), each iteration of Algorithm 1 obtains a sample X from p and estimates log(1/p(X)). To avoid assigning zero probability value to p(X), we do add-1 smoothing to our empirical estimate of p(X). The bias in our estimator can be controlled by the choice of N.

Performance Guarantee. Algorithm 1 only maintains a running sum at the end of each iteration. We reserve two words for N, R and S. We reserve one word to store x and two words to keep track

Algorithm 1 Entropy estimation with constant space: Simple Algorithm Require: Accuracy parameter " > 0, a data stream X1, X2, . . . p

R 4 log2(1 + 2k/")/"2, N 2k/", S 0

2: for t = 1, . . . , R do 3: Let x the next element in the data stream 4: Nx # appearances of x in the next N symbols 5: ˆHt = log (N/(Nx + 1)) 6: S = S + ˆHt 7: H = S/R

of Nx in each iteration. We reserve three words for the counters. Thus the algorithm uses less than 20 words of space.

To bound the accuracy, note that H is the mean of R i.i.d. random variables ˆH1, . . . , ˆHR. We bound the bias and prove concentration of H using Lemma 2.

Bias Bound. Larger values of N provides a better estimate of p(X), and therefore a smaller bias in estimation. This is captured in the next lemma, which is proved in Section C.

Lemma 5. (Bias Bound)

Concentration. Since 8t, ˆHt 2 [log(N/(N + 1)), log N], we show in the next lemma that with large enough R, H concentrates. This is proved in Section C.

Lemma 6. (Concentration) For any µ > 0, Pr

The choice of N implies that

(( "/2, and by choosing µ = "/2, and R = 4 log2(1 + 2k/")/"2 implies that H is within H(p) " with probability at least 2/3. The total sample complexity of Algorithm 1 is (N + 1)R = O

k log2 (k/")/"3&

3 Interval-based Algorithms

In the previous section, the simple algorithm treats each symbol equally and uses the same N and R. To reduce the sample complexity, we express H(p) as an expectation of various conditional expectations depending on the symbol probability values. For larger probability values we use smaller N and for small probabilities we use smaller R. We then estimate the terms separately to obtain the ﬁnal estimate.

Entropy as a Weighted Sum of Conditional Expectations. Let T 2 N (decided later), and 0 = a0 < a1 < . . . < a T = 1. Let I := {I1, I2, ..., IT }, where Ij = [a T j, a T j+1) be a partition of [0, 1] into T intervals.

Consider a randomized algorithm A : [k] ! {I1, . . . , IT } that takes as input x 2 [k], and outputs an interval in I. Let p A (Ij|x) = Pr (A(x) = Ij). For a distribution p 2 k, let

p(x) p A (Ij|x) , p A (x|Ij) := p(x) p A (Ij|x)

p A (Ij) . (6)

Then p A(Ij) is the probability that A(X) = Ij, when X p. p A (x|Ij) is the conditional distribution over [k] given A(X) = Ij. Then we have the following lemma:

Lemma 7. Let Hj := EX p A(x|Ij) [ log p (X)] then, H (p) = PT

j=1 p A (Ij) Hj.

A log 1 p(x) =

p A (Ij) p A (x|Ij) log 1 p(x)

EX p A(x|Ij) [ log p (X)]

where (7) follows from (6).

We will choose the intervals and algorithm A appropriately. By estimating each term in the summation above, we will design an algorithm with T intervals that uses O

k(log(T ) k+log(1/"))2

samples and a

constant words of space, and estimates H(p) to ".

In Section 3.1, we provide the details with T = 2. This section will ﬂesh out the key arguments, and ﬁnally in Section 3.2, we extend this to T = log k where log k = mini{log(i) k 1} intervals to further reduce the sample complexity to O(k(log(1/"))2/"3).

3.1 Two Intervals Algorithm

We propose Algorithm 5 with T = 2 and the following guarantee.

Theorem 8. Algorithm 5 uses O(NR + N1R1 + N2R2) = O

k(log(log(k)/"))2

samples, 20 words

and outputs an " estimate of H(p) with probability at least 2/3.

3.1.1 Description of the Algorithm

Let T = 2, and β > 16 be a constant. Consider the following partition of [0, 1]:

I2 = [0, ) , I1 = [ , 1] where = (log k)β/k. (8)

We now specify the algorithm A : [k] ! {I1, I2} to be used in Lemma 7. A is denoted by ESTINT (Algorithm 2). For x 2 [k], it takes N samples from p, and outputs the interval where the empirical fraction of occurrences of x lies. ESTINT tries to predict the interval in which p(x) lies.

Algorithm 2 A : ESTINT (N, x)

1: Obtain N samples from p 2: if x appears N times, output I1 3: else output I2

Algorithm 3 ESTPROBINT (N, R)

1: ˆp A (I1) = 0 2: for t = 1 to R do 3: Sample x p. 4: if ESTINT (N, x) = I1 then 5: ˆp A (I1) = ˆp A (I1) + 1/R

By Lemma 7, H (p) = p A (I1) H1 + p A (I2) H2. We estimate the terms in this expression as follows.

Estimating p A(Ij) s. We run ESTINT multiple times on samples generated from p, and output the fraction of times the output is Ij as an estimate of p A(Ij). We only estimate p A(I1), since p A(I1) + p A(I2) = 1. The complete procedure is speciﬁed in Algorithm 3.

Estimating Hj s. Recall that Hj s are the expectations of log (p(x)) under different distributions given in (6). Since the expectations are with respect to the conditional distributions, we ﬁrst sample a symbol from p and then conditioned on the event that ESTINT outputs Ij, we use an algorithm similar to Algorithm 1 to estimate log(1/p(x)). The full algorithm is in Algorithm 4. Notice that when computing ˆH2 in Step 8, we clip the ˆH2 s to log 1

4 if Nx,2 > 4 N2 1. This is done to restrict each ˆH2 to be in the range of [log 1

4 , log N2], which helps when proving concentration.

3.1.2 Performance Guarantees

Memory Requirements. We reserve 5 words to store parameters R1, R2, N1, N2 and . ESTINT uses one word to keep track of the number of occurrences of x. For ESTPROBINT, we use one word to store x and one word to keep track of the ﬁnal sum ˆp A (I1). We execute CONDEXP for each interval separately and use one word each to store x and keep track of Si and ˆ Hi. We use two words to store the outputs H1 and H2 and store the ﬁnal output ˆHII in one of those. Hence, at most 20 words of memory are sufﬁcient.

Algorithm 4 Estimating H1 and H2 : CONDEXP (N1, N2, R1, R2)

1: for i = 1, 2, set ˆHi = 0, Si = 0, do 2: for t = 1 to Ri do 3: Sample x p 4: if ESTINT (N, x) = Ii then 5: Si = Si + 1 6: Let Nx,i # occurrences of x in the next Ni samples 7: ˆHi = ˆHi + log (Ni/(Nx,i + 1)) if i = 1 8: ˆHi = ˆHi + max {log (Ni/(Nx,i + 1)) , log (1/4 )} if i = 2

9: Hi = ˆHi/Si

Algorithm 5 Entropy Estimation with constant space: Two Intervals Algorithm Require: Accuracy parameter " > 0, γ = β/2, a data stream X1, X2, . . . p

N = N1 = C1k " (log k)γ , R = R1 = C2

"2 , N2 = C1 k

" , R2 = C2 (log((log k)/"))2

2: ˆp A (I1) = ESTPROBINT (N, R) 3: H1, H2 = CONDEXP (N1, N2, R1, R2) 4: ˆHII = ˆp A (I1) H1 + (1 ˆp A (I1)) H2

Sample Guarantees. Let ˆH

II be the unclipped version of the estimator where we don t use clipping in Step 8 in Algorithm 4 (all other steps remain the same). Then we can bound the estimation error by the following three terms and we will bound each of them separately,

(((H (p) ˆHII

i((( | {z } Unclipped Bias

i((( | {z } Clipping Error

i((( . | {z } Concentration

Clipping Error. By the design of CONDEXP, ˆH2 is clipped only when the event Ex = {ESTINT(N, x) = I2, Nx,2 > 4N2 1} occurs for some x 2 X . We bound the clipping error in the following lemma (proof in Section D.3) by showing that Pr (Ex) is small.

Lemma 9. (Clipping Error Bound) Let ˆHII be the entropy estimate of Algorithm 5 and let ˆH

the entropy estimate of the unclipped version of Algorithm 5. Then

Concentration Bound. To prove the concentration bound, we use Lemma 10 to decompose it into three terms each of which can be viewed as the difference between some empirical mean and its true expectation which can be bounded using concentration inequalities. (proof in Section D.4)

Lemma 10. (Concentration Bound) Let ˆHII be the entropy estimate of Algorithm 5 and let Hi be as deﬁned in Algorithm 5. Let p A (Ii) be the distribution deﬁned in (6) where A is ESTINT.

(( + |p A (I1) ˆp A (I1) || H1 H2| "/3.

We provide a brief outline of the proof below. By union bound, in order to show that with probability at least 2/3 the sum is less than "/3, it is sufﬁcient to show that with probability at most 1

9, each of the terms is greater than "/9.

To bound |p A (I1) ˆp A (I1) || H1 H2|, we ﬁrst bound the range of | H1 H2| and then use Hoeffding s inequality (Lemma 2) to obtain concentration of ˆp A (I1). To bound

((, note that we cannot obtain concentration using Hoeffding s inequality because Ri (the number of samples that we average over) is a random variable. Therefore we apply Random Hoeffding Inequality (Lemma 3) to Hi. Since Ri depends on the range of the random variables being averaged over, we obtain a reduction in the sample complexity for i = 2 because of clipping the estimate below

4 . Therefore the range for the second interval is log(N2) log 1

4 = O (log ((log k) /")) implying R2 = O

(log ((log k)/"))2/"2&

sufﬁces for the desired probability. For i = 1, since the range is the same as the one interval case, we use the same R1 as in the one interval case. Note R2 < R1.

Bias Bound. We bound the bias of the unclipped version, ˆH

II using the following lemma:

Lemma 11. (Unclipped Bias Bound) Let ˆH

II be the unclipped estimate of Algorithm 5 and let p A (Ii|x) be the conditional distribution deﬁned in (6) where A is ESTPROBINT. Then,

p A (Ii|x)/Ni

(Proof in Section D.2) Lemma 11 allows us to choose N1 and N2 separately to bound the bias. Interval I2 s contribution is at most k N2 . For interval I1, we improve upon k N1 by partitioning X into sets X1 = {x 2 X|p(x) < /2} and X2 = {x 2 X|p(x) /2}. For X1, p A (I1|x) is small by Chernoff bound. For X2, since p(x) /2, |X2| 2/ which is smaller than k. Hence we can choose N2 < N1.

In the sample complexity of the two interval algorithm, observe that the term N2R2 dominates. Reducing N2 is hard because it is independent of the interval length. Therefore we hope to reduce R2 by partitioning into intervals with smaller lengths. In the smallest interval, if we reduce the range of each estimate to be within a constant, then O( 1

"2 ) samples would sufﬁce for concentration. In the next section, we make this concrete by considering an algorithm that uses multiple intervals.

3.2 General Intervals Algorithm

The general algorithm follows the same principles as the previous section with a larger number of intervals, decreasing the sample requirements at each step, as discussed in Section 1.3. However, the proofs are much more involved, particularly in order to obtain an O(k) upper bound on the sample complexity. We will sketch some of the key points and move a bulk of the algorithm and details to the appendix due to lack of space.

Intervals. Let T = log k, where log k := mini{log(i) k 1}. Consider the following partition of [0, 1]: {Ii}T

i=1 where I1 = [l1, h1] and for i = 2, ..., T, Ii = [li, hi), hi = (log(i 1)(k))β

k (β > 16) and i 1 = hi. Deﬁne l T = 0 and h1 = 1, then we have for i = 2, ..., T 1 :

(log(1)(k))β

0, (log(T 1)(k))β

(log(i)(k))β

k , (log(i 1)(k))β

We divide the bottleneck of the two intervals algorithm I2, into further intervals until the width of the smallest interval is a constant over k (eβ/k) which implies concentration with lesser samples than before. Using Lemma 7, similar to the two intervals case, we will estimate each of the p A (Ii) and Hi s independently in Algorithm 8 (GENESTPROBINT) and Algorithm 9 (GENCONDEXP), presented in Appendix E.1. Complete algorithm for T = log k is presented in Algorithm 6.

Memory Requirements. The analysis of memory requirement is similar to that of the two interval case. To store parameters i, Ni, Ri s, we only store k, ", γ, CN and CR and compute the parameters on the ﬂy. Notice that for each interval, the execution of GENESTINT, GENESTPROBINT and GENCONDEXP require same memory as that of their two interval counterparts. The trick here is that we don t need to store ˆp A (Ii) s and Hi s since we can perform each of GENESTPROBINT and GENCONDEXP for one interval and maintain a running sum of ˆp A (Ii) Hi s. Therefore, Algorithm 6 uses at most 20 words of space.

Sample complexity. Algorithm 6 proves the main claim of our paper in Theorem 1. The key idea to remove the extra loglog factor in Theorem 8 is to progressively make the error requirements stricter for the larger probability intervals. We denote the ﬁnal estimate without the clipping step (Step 8) by

I (all other steps remaining the same). Then the error can be bounded by the following three terms:

|H (p) ˆHI| |H (p) E

| | {z } Unclipped Bias

| | {z } Clipping Error

| | {z } Concentration

Algorithm 6 Entropy Estimation with constant space: General Intervals Algorithm Require: Accuracy parameter " > 0, γ = β/2, a data stream X1, X2, . . . p.

"(log(i)(k))γ , Ri = CR (log(log(i 1)(k)/"))2

" , RT = CR (log(log(T 1)(k)/"))2

2: {ˆp A (Ii)}T 1

i=1 = GENESTPROBINT

i=1 , {Ri}T 1

i=1 = GENCONDEXP

i=1 , {Ri}T

4: ˆHI = PT 1

i=1 ˆp A (Ii) Hi + (1 PT 1

i=1 ˆp A (Ii)) HT

With the parameters deﬁned in Algorithm 6, we can bound the unclipped bias and clipping error in (10) by "

3 each and show that the concentration part is also bounded by "

3 with probability at least 2/3. The details are given in Lemma 13, 14, and 15 in Appendix E.

4 Open Problems

There are several interesting questions that arise from our work. While our algorithms require only a constant memory words of space, they require a log k multiplicative factor more samples (as a function of k) than the optimal sample complexity (in (3)).

Does there exist an algorithm for entropy estimation that has the optimal sample complexity

and space requirement that is at most poly(log k)?

We are unaware of any implementation that requires sub-linear space in k. Designing a strictly sublinear-space (space requirement k for some < 1) sample-optimal algorithm could be a ﬁrst step toward solving the question above. At the same time, there might not exist an algorithm with a small sample complexity. This leads to the following complementary question.

Prove a lower bound on the space requirement of a sample-optimal algorithm for entropy

estimation.

Beyond these, obtaining sample-space trade-offs for distribution testing, and property estimation tasks is an exciting future direction.

Acknowledgements. This work is supported by NSF-CCF-1657471. This research started with the support of MIT-Shell Energy Research Fellowship to JA and PI, while JA was at MIT.

[1] J Ian Munro and Mike S Paterson. Selection and sorting with limited storage. Theoretical

computer science, 12(3):315 323, 1980.

[2] Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applica-

tions. Journal of Computer and System Sciences, 31(2):182 209, 1985.

[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the

frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, 1996.

[4] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream

computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, page 189. IEEE, 2000.

[5] Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data

streams. In Proceedings of the 37th Annual ACM Symposium on the Theory of Computing, STOC 05, 2005.

[6] Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms

for estimating entropy of network trafﬁc. 34(1):145 156, June 2006.

[7] Amit Chakrabarti, Khanh Do Ba, and S Muthukrishnan. Estimating entropy and entropy norm

on data streams. Internet Mathematics, 3(1):63 78, 2006.

[8] Amit Chakrabarti, Graham Cormode, and Andrew Mcgregor. A near-optimal algorithm for

estimating the entropy of a stream. volume 6, pages 51:1 51:21, New York, NY, USA, July 2010. ACM.

[9] Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy

via approximation theory. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 06, pages 489 498, Philadephia, PA, USA, 2006. IEEE Computer Society.

[10] Sudipto Guha, Andrew Mc Gregor, and Suresh Venkatasubramanian. Sublinear estimation of

entropy and information distances. ACM Transactions on Algorithms, 5(4), 2009.

[11] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.

In International Colloquium on Automata, Languages, and Programming, pages 693 703. Springer, 2002.

[12] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min

sketch and its applications. Journal of Algorithms, 55(1):58 75, 2005.

[13] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efﬁcient computation of frequent

and top-k elements in data streams. In International Conference on Database Theory, pages 398 412. Springer, 2005.

[14] Arnab Bhattacharyya, Palash Dey, and David P Woodruff. An optimal algorithm for l1-heavy

hitters in insertion streams and related problems. In Proceedings of the 35th ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems, pages 385 400. ACM, 2016.

[15] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct

elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1 10. Springer, 2002.

[16] Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct

elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 41 52. ACM, 2010.

[17] Sudipto Guha and Andrew Mc Gregor. Stream order and order statistics: Quantile estimation in

random-order streams. SIAM Journal on Computing, 38(5):2044 2059, 2009.

[18] Amit Chakrabarti, TS Jayram, and Mihai Patra scu. Tight lower bounds for selection in randomly

ordered streams. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 720 729. Society for Industrial and Applied Mathematics, 2008.

[19] George A. Miller. Note on the bias of information estimates. Information Theory in Psychology:

Problems and Methods, 2:95 100, 1955.

[20] Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191

1253, 2003.

[21] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/ log n-sample estimator for

entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on the Theory of Computing, STOC 11, pages 685 694, New York, NY, USA,

[22] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of

functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, May 2015.

[23] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best

polynomial approximation. IEEE Trans. Information Theory, 62(6):3702 3720, 2016.

[24] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uniﬁed

maximum likelihood approach for estimating symmetric properties of discrete distributions. In Proceedings of the 34th International Conference on Machine Learning, ICML 17, pages 11 21. JMLR, Inc., 2017.

[25] Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data ampliﬁcation: A uniﬁed

and competitive approach to property estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8834 8843. Curran Associates, Inc., 2018.

[26] Yi Hao and Alon Orlitsky. The broad optimality of proﬁle maximum likelihood. ar Xiv preprint

ar Xiv:1906.03794, 2019.

[27] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating

Rényi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38 56, Jan 2017.

[28] Maciej Obremski and Maciej Skorski. Rényi entropy estimation revisited. In Proceedings of

the 20th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 17, pages 20:1 20:15, Dagstuhl, Germany, 2017. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

[29] Kazuto Fukuchi and Jun Sakuma. Minimax optimal estimators for additive scalar functionals of

discrete distributions. In Proceedings of the 2017 IEEE International Symposium on Information Theory, ISIT 17, pages 2103 2107, Washington, DC, USA, 2017. IEEE Computer Society.

[30] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words

did shakespeare know? Biometrika, 63(3):435 447, 1976.

[31] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number

of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016.

[32] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of kl

divergence between discrete distributions. In 2016 International Symposium on Information Theory and Its Applications, ISITA 16, pages 256 260. IEEE, 2016.

[33] Jayadev Acharya. Proﬁle maximum likelihood is optimal for estimating kl divergence. In 2018

IEEE International Symposium on Information Theory (ISIT), pages 1400 1404. IEEE, 2018.

[34] Martin E Hellman and Thomas M Cover. Learning with ﬁnite memory. The Annals of Mathematical Statistics, pages 765 782, 1970.

[35] Thomas M Cover. Hypothesis testing with ﬁnite statistics. The Annals of Mathematical Statistics,

pages 828 835, 1969.

[36] Thomas Leighton and Ronald Rivest. Estimating a probability using ﬁnite memory. IEEE

Transactions on Information Theory, 32(6):733 742, 1986.

[37] Sudipto Guha and Andrew Mc Gregor. Space-efﬁcient sampling. In Artiﬁcial Intelligence and

Statistics, pages 171 178, 2007.

[38] Steve Chien, Katrina Ligett, and Andrew Mc Gregor. Space-efﬁcient estimation of robust

statistics and distribution testing. Tsinghua University Press, January 2010.

[39] Yuval Dagan and Ohad Shamir. Detecting correlations with little memory and communication.

ar Xiv preprint ar Xiv:1803.01420, 2018.

[40] Michael Crouch, Andrew Mc Gregor, Gregory Valiant, and David P Woodruff. Stochastic

streams: Sample complexity vs. space complexity. In LIPIcs-Leibniz International Proceedings in Informatics, volume 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.

[41] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical

queries. In Proceedings of the 29th Annual Conference on Learning Theory, pages 1490 1516, 2016.

[42] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. In

Proceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science, 2016.

[43] Dana Moshkovitz and Michal Moshkovitz. Mixing implies lower bounds for space bounded

learning. In Proceedings of the 30th Annual Conference on Learning Theory, pages 1516 1566, 2017.

[44] Ayush Jain and Himanshu Tyagi. Effective memory shrinkage in estimation. In Proceedings of

the 2018 IEEE International Symposium on Information Theory, 2018.

[45] Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, and Sankeerth Rao. Communication

and memory efﬁcient testing of discrete distributions. In Proceedings of the 29th Annual Conference on Learning Theory, COLT 19, 2019.

[46] Georgij P Basharin. On a statistical estimate for the entropy of a sequence of independent

random variables. Theory of Probability and Its Applications, 4(3):333 336, 1959.

[47] András Antos and Ioannis Kontoyiannis. Convergence properties of functional estimates for

discrete distributions. Random Struct. Algorithms, 19(3-4):163 193, October 2001.

[48] Liam Paninski. Estimating entropy on m bins given fewer than m samples. IEEE Transactions

on Information Theory, 50(9):2200 2203, 2004.

[49] Lakshminath Bhuvanagiri and Sumit Ganguly. Estimating entropy over data streams. In

Proceedings of the 14th Annual European Symposium on Algorithms, volume 4168, page 148. Springer, 2006.

[50] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley

Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547 561, 1961.

[51] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Electronic

Colloquium on Computational Complexity (ECCC), 7(20), 2000.

[52] Ziv Bar-Yossef, Ravi Kumar, and D Sivakumar. Sampling algorithms: Lower bounds and

applications. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing, pages 266 275. ACM, 2001.

[53] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of

the American Statistical Association, 58(301):13 30, 1963.