# dimensionfree_empirical_entropy_estimation__3f170591.pdf Dimension-Free Empirical Entropy Estimation Doron Cohen Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel doronv@post.bgu.ac.il Aryeh Kontorovich Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel karyeh@cs.bgu.ac.il Aaron Koolyk Department of Computer Science Hebrew University Jerusalem, Israel aaron.koolyk@mail.huji.ac.il Geoffrey Wolfer JSPS International Research Fellow Department of Computer and Information Sciences Tokyo University of Agriculture and Technology Tokyo, Japan geo-wolfer@m2.tuat.ac.jp We seek an entropy estimator for discrete distributions with fully empirical accuracy bounds. As stated, this goal is infeasible without some prior assumptions on the distribution. We discover that a certain information moment assumption renders the problem feasible. We argue that the moment assumption is natural and, in some sense, minimalistic weaker than finite support or tail decay conditions. Under the moment assumption, we provide the first finite-sample entropy estimates for infinite alphabets, nearly recovering the known minimax rates. Moreover, we demonstrate that our empirical bounds are significantly sharper than the state-ofthe-art bounds, for various natural distributions and non-trivial sample regimes. Along the way, we give a dimension-free analogue of the Cover-Thomas result on entropy continuity (with respect to total variation distance) for finite alphabets, which may be of independent interest. 1 Introduction Estimating the entropy of a discrete distribution based on a finite iid sample is a classic problem with theoretical and practical ramifications. Considerable progress has been made in the case of a finite alphabet, and the countably infinite case has also attracted a fair amount of attention in recent years. A less-addressed issue is one of empirical accuracy estimates: data-dependent bounds adaptive to the particular distribution being sampled. Our point of departure is the simpler (to analyze) problem of estimating a discrete distribution µ in total variation norm TV = 1 2 1, where the most recent advance was made by Cohen et al. [2020]; see therein for a literature review. If µ is a distribution on N and ˆµn is its empirical realization based on a sample of size n, then Theorem 2.1 of Cohen et al. states that with probability at least 1 δ, µ ˆµn 1 2 n A major part of this research was conducted when the author was graduate student at Ben-Gurion University of The Negev, Israel. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). This bound has the advantage of being valid for all distributions on N, without any prior assumptions, and being fully empirical: it yields a risk estimate that is computable based on the observed sample, not depending on any unknown quantities. (Additionally, Cohen et al. argue that (1) is near-optimal in a well-defined sense.) The question we set out to explore in this paper is: What analogues of (1) are possible for discrete entropy estimation? When µ has support size d < , an answer to our question is readily provided by combining (1) with Cover and Thomas [2006, Theorem 17.3.3], which asserts that, for µ ν 1 1/2, we have |H(µ) H(ν)| µ ν 1 log d µ ν 1 , (2) where H( ) is the entropy functional defined in (3). Indeed, taking µ as in (1) and ν to be ˆµn yields a fully empirical estimate on |H(µ) H(ˆµn)|. For fixed d < , no technique relying on the plug-in estimator can yield minimax rates [Wu and Yang, 2016]. The plug-in is, however, minimax optimal for fixed d < [Paninski, 2003] as well as strongly universally consistent even for d = [Antos and Kontoyiannis, 2001a], and is among the few methods for which explicitly computable finite-sample risk bounds are known. The thrust of this paper is to replace the restrictive finite-support assumption with considerably more general moment conditions. It is well-known that when estimating the mean of some random variable X, the first-moment assumption E|X| M is not sufficient to yield any finite-sample information.2 Strengthening the assumption to E|X|α M, for any α > 1, immediately yields finite-sample empirical estimates on EX 1 n Pn i=1 Xi via the von Bahr and Esseen [1965] inequality.3 In this sense, a bound on the (1 + ε)th moment is a minimal requirement for empirical mean estimation. However, it is not immediately obvious how to apply this insight to the entropy estimation problem: the corresponding random variable is X = log µ(I), where I µ, but rather than being given iid samples of X, we are only given draws of I. Our contribution. In Theorem 1, we provide a dimension-free analogue of (2), which, combined with (1), allows for empirical accuracy bounds on the plug-in entropy estimator under a minimalistic moment assumption. Moreover, for this rich class of distributions, the plug-in estimator turns out to be asymptotically optimal, as we show in Theorem 4. Our moment assumption is natural and essentially the weakest one that makes any empirical bounds feasible, as we argue in Theorem 3. As we demonstrate in Section 6, the rates provided by our empirical bound compare favorably against the state of the art. 2 Definitions and notation Our logarithms will always be base e by default. For discrete distributions, there is no loss of generality in taking the domain to be the natural numbers N = {1, 2, 3, . . .}. For k N, we write [k] := {i N : i k}. The set of all probability distributions on N will be denoted by N. For d N, we write d N to denote those µ whose support is contained in [d]. We define the operator ( ) , which maps any µ N to its non-increasing rearrangement µ . The set of all non-increasing distributions will be denoted by N := µ : µ N . We write R+ := [0, ). For any ξ : N R+ and α 0, define H(α)(ξ) := X j N:ξ(j)>0 ξ(j) |log ξ(j)|α . (3) For ξ RN, denote by |ξ| RN + the elementwise application of | | to ξ. When ξ N and α = 1, (3) recovers the standard definition of entropy, which we denote by H(ξ) := H(1)(ξ). For general 2Even distinguishing, for X 0, between EX = 0 and EX = M based on a finite sample is impossible with any degree of confidence. Of course, 1 n Pn i=1 Xi EX almost surely, by the strong law of large numbers. 3Put Y = X EX; then E|Y | 2M. For 1 < α < 2, a sharper version of the Bahr-Esseen inequality [Pinelis, 2015] states that E Pn i=1 Yi α 2n(2M)α, which implies tail bounds via Markov s inequality. Better rates are available via the median-of-means estimator, see Lugosi and Mendelson [2019]. α > 0, this quantity may be referred to as the αth moment of information. For h 0, define (α) N [h] = n µ N : H(α)(µ) h o and also (α) N := S h 0 (α) N [h] and (α) N [h] := N (α) N [h]. For n N and µ N, we write X = (X1, . . . , Xn) µn to mean that the components of the vector X are drawn iid from µ. The empirical measure ˆµn N induced by the sample X is defined by ˆµn(j) = 1 i [n] 1[Xi = j]. For any ξ RN and 0 < p < , the ℓp (pseudo)norm is defined by ξ p p = P j N |ξ(j)|p and ξ = supj N |ξ(j)|. For α, h > 0, and n N, define the L1 minimax risk for the αth moment by R(α) n (h) := inf ˆ H sup µ (α) N [h] E| ˆH(X1, . . . , Xn) H(µ)|, (4) where the infimum is over all mappings ˆH : Nn R+. 3 Main results Our first result is a dimension-free analogue of (2): Theorem 1. For all α > 1, H : (α) N R+ is uniformly continuous under ℓ1. In particular, for all µ, ν (α) N satisfying µ ν < 1/2, we have |H(µ) H(ν)| µ ν 1 1/α 1 2αα + H(α)(µ) + H(α)(ν) 1/α µ ν 1 1/α 1 2α + H(α)(µ)1/α + H(α)(ν)1/α . The requirement in Theorem 1 that α > 1 cannot be dispensed with, as the function H : (α) N [h] R+ is not continuous under ℓ1 for α = 1 (see Remark following Lemma 5), and, a fortiori, is not uniformly continuous. Thus, there can be no function F : R2 + R+ satisfying |H(µ) H(ν)| F( µ ν 1 , h), h > 0, µ, ν (1) N [h] with the additional property that for any two sequences µn, νn N satisfying εn := µn νn 1 0, it holds that F(εn, h) 0. Perhaps surprisingly,4 it turns out that H : (α) N [h] R+ is uniformly continuous under ℓp for all α > 1, p [1, ]: Theorem 2. There is a function F : R4 + R+ such that |H(µ) H(ν)| F( µ ν p , h, α, p), h > 0, α > 1, p [1, ], µ, ν (α) N [h] with the additional property that whenever εn := µn νn p 0, we have F(εn, h, α, p) 0. Remark. Although Theorem 2 establishes uniform continuity, it gives no hint as to the functional dependence of the modulus of continuity F on α, p, h, and µ ν p. We leave this as a fascinating open problem even though the practical applications are likely to be limited: it follows from Wyner and Foster [2003] and Theorem 4 that for p = α = 2 and fixed h, F( µ ν 2 , h, 2, 2) cannot decay at a faster rate than 1/ log(1/ µ ν 2). Combining Theorem 1 with (1) yields an empirical (under moment assumptions) bound for the plug-in entropy estimator: 4Since ℓ1 dominates all of the ℓp norms, continuity of a function under ℓp trivially implies continuity under ℓ1, but the reverse implication is generally not true. Corollary 1. For all α > 1, h > 0, δ (0, 1), n 2 log 4 δ , and µ (α) N [h], we have that |H(µ) H(ˆµn)| 2αα + h + H(α)(ˆµn) 1/α 2 ˆµn 1/2 1/2 n + 6 holds with probability at least 1 δ. In Section 6, we compare the rates implied by Corollary 1 to the state of the art on various distributions. Next, we examine the optimality of the plug-in estimate by analyzing the minimax risk, defined in (4). It was known [Silva, 2018, Appendix A] that assuming H(µ) < does not suffice to yield a minimax rate for the L2 risk: inf ˆ H:Nn R+ sup µ (1) N E ˆH(X1, . . . , Xn) H(µ) 2 = . This technique yields an analogous result for the L1 risk as well. We strengthen these results in two ways: (i) by lower-bounding the L1 risk (rather than L2, which is never smaller), and (ii) by restricting µ to (1) N [h] and obtaining a finitary, quantitative lower bound: Theorem 3. For α = 1, there is a universal constant C > 0 such that for all h > 1 and n N, we have R(1) n (h) Ch. Remark. The above result complements but is not directly comparable to Antos and Kontoyiannis [2001a, Theorem 4]. Ours gives a quantitative dependence on h but constructs an adversarial distribution for each sample size n; theirs is asymptotic only but a single adversarial distribution suffices for all n. Remark. Our technique immediately yields a lower bound of Ch2 on the L2 minimax risk. In contradistinction to the α = 1 case, where no minimax rate exists, we show that the plug-in estimator is minimax for all α > 1: Theorem 4. The following bounds hold for the L1 minimax risk: (a) Upper bound: for all h > 0, α > 1, R(α) n (h) 1 + log n n + 2α 1h logα 1 n, n N; further, this bound is achieved by the plug-in estimate H(ˆµn). (b) Lower bound: for each α > 0, n N there is an h > 0 such that R(α) n (h) h 4 3α logα 1 n. Open problem. Close the gap in the dependence on α in the upper and lower bounds. 4 Related work Continuity, convergence, moments of information. Zhang [2007] gave a sharpened version of (2) and Ho and Yeung [2010] presented analogous bounds; Audenaert [2007] proved a non-commutative generalization. Sason [2013, Theorem 5] upper-bounds |H(µ) H(ν)| in terms of quantities related to µ ν 1, where (at most) one of them is allowed to have infinite support. Even though H( ) is not continuous on N, the plug-in estimate H(ˆµn) converges to H(µ) almost surely and in L2 [Antos and Kontoyiannis, 2001a]. Silva [2018] studied a variety of restrictions on distributions over infinite alphabets to derive strong consistency results and rates of convergence. Moments of information were apparently first defined in Golomb [1966]. Entropy estimation. Recent surveys of entropy estimation results may be found in Jiao et al. [2015], Verdú [2019]. The finite-alphabet case is particularly well-understood. For fixed alphabet size d < , the plug-in estimate is asymptotically minimax optimal [Paninski, 2003]. Paninski [2004] non-constructively established the existence of a sublinear (in d) entropy estimator. The optimal dependence on d (at fixed accuracy) was settled by Valiant and Valiant [2011a, 2017] as being Θ(d/ log d). The Θ(d/ log d) dependence on the alphabet size is also relevant in the so-called high dimensional asymptotic regime, where d grows with n. Here, the plug-in estimate is no longer optimal, and more sophisticated techniques are called for [Valiant and Valiant, 2011a,b, 2017]. The works of Wu and Yang [2016], Jiao et al. [2015], Han et al. [2015], Jiao et al. [2017] characterized the minimax rates for the high-dimensional regime: a small additive error of ε requires Θ(d/ε log d) samples. Building off of these polynomial-approximation based constructions, Acharya et al. [2017] design an additional optimal estimator, this one based on a profile maximum likelihood approach that can also estimate a variety of other important statistics. Fukuchi and Sakuma [2017, 2018] generalize the optimal estimators to estimate any additive functional, recovering in particular the optimal rates for entropy. Acharya et al. [2019] modify these optimal estimators with the added goal of low space complexity. Finally, there is the infinite-alphabet case. Although here the plug-in estimate is again universally strongly consistent, control of the convergence rate requires some assumption on the sampling distribution and Antos and Kontoyiannis [2001a] compellingly argue that moment assumptions are natural and minimalistic. Absent any prior assumptions, the L1 (and hence L2) convergence rate of any estimator can be made arbitrarily slow (Theorem 4 ibid.). The present paper proves a variant of this result (see Theorem 3 and the Remark following it). Antos and Kontoyiannis [2001a] further show that even under moment assumptions, there is no polynomial rate of convergence for the plug-in estimate: there is no β > 0 such that its risk decays as O(n β). Wyner and Foster [2003] showed that the plug-in estimate achieves a rate of O( 1 log n) for bounded second moment, and this is minimax optimal. Brautbar and Samorodnitsky [2007] exhibited a function of the higher moments that can be used in place of alphabet size to give a multiplicative approximation to the entropy. 5.1 Proof of Theorem 1 We begin with a subadditivity result for the αth moment of information (which we state for α > 0, even though only the range α > 1 will be needed). Lemma 1. For α > 0 and µ, ν (α) N , we have H(α)(|µ ν|) 2αα + H(α)(µ) + H(α)(ν). Proof. Define h(α) : [0, 1] R+ by z 7 z lnα(1/z), where h(α)(0) = 0. The function h(α) is increasing on [0, e α] and decreasing on [e α, 1]. The maximum is therefore achieved at z = e α, and max z [0,1] h(α)(z) = h(α)(e α) = e ααα. (5) Now decompose H(α): H(α)(|µ ν|) = X i:µ(i) ν(i)>e α h(α)(|µ(i) ν(i)|) + X i:µ(i) ν(i) e α h(α)(|µ(i) ν(i)|). For the first term, since µ N, it must be that |{i N: µ(i) > e α}| eα, and similarly for ν. Thus, X i:µ(i) ν(i)>e α h(α)(|µ(i) ν(i)|) i: µ(i) > e α + i: ν(i) > e α max z [0,1] h(α)(z) 2eαe ααα = 2αα. For the second term, notice that when µ(i) ν(i) e α, the monotonicity of h(α) implies h(α)(|µ(i) ν(i)|) h(α)(µ(i) ν(i)), and hence X i N:µ(i) ν(i) e α h(α)(|µ(i) ν(i)|) X i: µ(i) ν(i) e α h(α)(µ(i) ν(i)) i: µ(i) ν(i) e α h(α)(µ(i)) + h(α)(ν(i)) H(α)(µ) + H(α)(ν). Proof of Theorem 1. The concavity argument in the proof of Cover and Thomas [2006, Theorem 17.3.3], immediately implies |H(µ) H(ν)| H(|µ ν|). Then, via an application of Hölder s inequality, H(|µ ν|) = X i N |µ(i) ν(i)| log 1 |µ(i) ν(i)| i N |µ(i) ν(i)|1 1/α |µ(i) ν(i)|1/α log 1 |µ(i) ν(i)| |µ(i) ν(i)|1 1/α 1/(1 1/α) !1 1/α X |µ(i) ν(i)|1/α log 1 |µ(i) ν(i)| = µ ν 1 1/α 1 H(α)(|µ ν|)1/α. The claim follows by invoking Lemma 1 and the subadditivity of t 7 t1/α for t 0 and α > 1. 5.2 Proof of Corollary 1 For h > 1 and n N, put an = (1 1/(2n)) ln(1 1/(2n)) and define the support size S = S(h, n) by S = (1/2n) exp(2n(h + an)) . Consider the distributions µ0 = (1, 0, 0, . . . ) and µn defined by µn(1) = 1 1/(2n), and µn(i) = 1 2n S , 2 i 1 + S(h, n). We compute the Kullback-Leibler divergence and entropy: DKL (µ0||µn) = log 1 1 1/(2n) 1 1 1/(2n) 1 1 H(µ0) = 0 h. For x 2, always x x/2. Additionally, from 2nan 1, and 1 2n exp(2nh 1) > 2, we obtain that S > (1/4n) exp(2n(h + an)), hence we also have that h H(µn) > h 1 2n ln 2. Since 1 2x ln 2 1/2 on (0, ) and h > 1, it follows that H(µn) h 2 , whence |H(µ0) H(µn)| h/2. To bound the L1 minimax risk (defined in (4)), we invoke Markov s inequality: E| ˆH(X1, . . . , Xn) H(µ)| h 4 P | ˆH(X1, . . . , Xn) H(µ)| > h It follows via Le Cam s two point method [Tsybakov, 2008, Section 2.4.2] that R(1) n (h) h 4 e n DKL(µ0||µ) h where the second inequality stems from (6). 5.3 Proof of Theorem 4 We begin with an auxiliary lemma, of possible independent interest. Lemma 2. For all µ N and n N, we have H(µ) EH(ˆµn) H(µ) inf 0<ε<1 i N:µ(i)<ε µ(i) log 1 µ(i) + log 1 + 1 Proof. The first inequality follows from Jensen s, since H( ) is concave and Eˆµn = µ. To prove the second inequality, choose ε > 0, put J := {i N : µ(i) < ε}, and compute EH(ˆµn) = E i N\J ˆµn(i) log 1 ˆµn(i) + X i J ˆµn(i) log 1 ˆµn(i) i N\J ˆµn(i) log 1 ˆµn(i) + =: EH( µn), where µn is the collapsed version of ˆµn, where all of the masses in J have been replaced by a single mass equal to their sum, and the inequality holds because conditioning reduces entropy [Cover and Thomas, 2006, Eq.(2.157)]. We observe that µn has support size at most 1 + 1/ε and invoke Paninski [2003, Proposition 1]: EH( µn) H( µ) log 1 + 1 where µ is the collapsed version of µ. Now H( µ) = H(µ) + i J µ(i) log 1 µ(i) i J µ(i) log 1 µ(i), which concludes the proof. The first part of the theorem will follow from the following proposition. Proposition 1. For α 1, h > 0, n N and µ (α) N [h], we have E|H(µ) H(ˆµn)| log n n + inf 0<ε<1 1 α h + log 1 + 1 Proof. Since by Lemma 2, |H(µ) EH(ˆµn)| = H(µ) EH(ˆµn), it follows from the triangle and Jensen inequalities that E|H(µ) H(ˆµn)| E|H(ˆµn) EH(ˆµn)| + H(µ) EH(ˆµn) Var [H(ˆµn)] + H(µ) EH(ˆµn) log n n + H(µ) EH(ˆµn), (8) where the variance bound is from Antos and Kontoyiannis [2001b, Proposition 1(iv)]. For any ε > 0, Lemma 2 implies EH(ˆµn) H(µ) X i N:µ(i)<ε µ(i) log 1 µ(i) log 1 + 1 i N:µ(i)<ε µ(i) log 1 µ(i) α log 1 + 1 1 α H(α)(µ) log 1 + 1 where the second and third inequalities follow from the obvious relations X i:µ(i)<ε µ(i) log 1 µ(i) log 1 i:µ(i)<ε µ(i) log 1 µ(i) 1 α H(α)(µ). The claim follows by combining (8) with (9). Proof of Theorem 4(a). Use the fact that R(α) n (h) E|H(µ) H(ˆµn)|, invoke Proposition 1 with ε = 1 n and use log(1 + x) x. We now prove the second half of the theorem. Proof of Theorem 4(b). Let α > 0, n N and define two families of distributions: U1 := µ1 = Uniform([n3]) , U2 := µ2 = Uniform(A) : A [n3], |A| = n2 . Let h := 3α logα n and note that U1 U2 (α) N [h]. Let E be the event that X = (X1, . . . , Xn) has no repeating elements, i.e |{X1, X2, . . . , Xn}| = n. Let µ1 U1, µ2 U2 and consider the values PX µn 1 (E) and PX µn 2 (E). For m N, define X(m) to be the smallest k such that when uniformly throwing m balls into k buckets, the probability of collision is at least 1/2. Since X(m) is known5 to be at least m (and hence X(n2) > n) we have a lower bound of 1 2 on both PX µn 1 (E) and PX µn 2 (E). Define µn 1|E as the distribution on Nn induced by conditioning the product µn 1 on the event E, and define µn 2|E analogously. Our key observation is that conditional on E, (i) both are effectively distributions on ordered n-tuples from [n3], and (ii) µn 1 is uniform on ([n3])n whereas µn 2 = Uniform(A) is uniform on (A)n, where (J)k := (x1, . . . , xk) Jk : | {x1, . . . , xk} | = k , J N, k N. Then R(α) n (h) inf ˆ H sup µ U1 U2 E X µn h | ˆH(X) H(µ)| i (a) inf ˆ H sup µ U1 U2 E X µn|E h | ˆH(X) H(µ)| i P X µn (E) 1 2 sup µ U1 U2 E X µn|E h | ˆH(X) H(µ)| i (b) inf ˆ H E X µn 1 |E h | ˆH(X) H(µ1)| i + sup µ2 U2 E X µn 2 |E h | ˆH(X) H(µ2)| i! (c) inf ˆ H E X µn 1 |E h | ˆH(X) H(µ1)| i + E µ2 Uniform(U2) E X µn 2 |E h | ˆH(X) H(µ2)| i#! (d)= inf ˆ H E X µn 1 |E h | ˆH(X) H(µ1)| i + E X µn 1 |E h | ˆH(X) H(µ2)| i! E X µn 1 |E h | ˆH(X) H(µ1)| + | ˆH(X) H(µ2)| i! 4|H(µ1) H(µ2)| = 1 4 log n = 1 4 h 3α logα 1 n, 5Better bounds exist [Brink, 2012]. where (a) is from the law of total expectation (the complement of E is discarded), (b) and (c) are bounding a supremum by an average, (e) is from the triangle inequality, and (d) is by observing that, by symmetry, the operators Eµ2 Uniform(U2) EX µn 2 |E [ ] and EX µn 1 |E [ ] are equivalent. (There is a minor abuse of notation in transitions after (c), since we write µ2 without specifying a particular member of U2. However, µ2 only occurs therein as H(µ2), and this value is identical for all µ2 U2.) Our bounds have the crucial characteristic of being empirical (under moment assumptions). When we observe favorable distributions (even without a priori knowledge of the fact), we will benefit from tighter bounds. This entails some cost, and in the worst case our bounds will be sub-optimal. In this section, we illustrate these trade-offs for various natural classes of distributions. For the class of all finite alphabet distributions, our bound is sub-optimal. The MLE (plug-in estimator) is competitive with the optimal estimator up to logarithmic factors in d, but our bounds on the MLE are loose nearly quadratically in d/n, in the worst case. The convergence of the empirical distribution on a finite alphabet in ℓ1 occurs at rate Θ( p d/n), whereas the MLE entropy estimator converges at rate O q d n 2 + log2 d , as follows from Wu and Yang [2016, Proposition 1]. So any approach that upper bounds the entropy risk via ℓ1 (as our Theorem 1 or Section 4 of Ho and Yeung [2010]) will be worst-case suboptimal for this class of distributions. Nevertheless, for certain classes of distributions our bounds (Theorem 1 and Corollary 1) can significantly outperform the state of the art, for small and moderate-sized samples. To calculate the expected rate of our approach, we apply Hölder s inequality, as in the proof of Theorem 1: E|H(ˆµn) H(µ)| E h 2αα + H(α)(µ) + H(α)(ˆµn) i 1/α (E ˆµn µ 1)1 1/α . Now, as in the proof of Lemma 1 (recall that h(α)(z) := z lnα(1/z)) , EH(α)(ˆµn) = X i [d] Eh(α)(ˆµn(i)) eα 1 max z [0,e1 α] h(α)(z) + X i [d] ˆµn(i) 1; notice that it is insensitive to p. For a fair comparison to (10), our estimator s only a priori knowledge of µ is that its support is of size at most d + D. By Proposition 2, we have maxµ K H(α)(µ) max {α, log K}α + (α/e)α. This allows us to optimize over α for each n: OUR(d, D, p, n) := inf α>1 e + 2αα + 2 max {α, log(d + D)}α + 2(α/e)α 1/α Λn(µ)1 1/α. Since µ has finite support, the Cover-Thomas inequality (2) also applies to yield an adaptive estimate when combined with (1). As t log(1/t) is concave, the latter has the form E|H(ˆµn) H(µ)| E ˆµn µ 1 log d + D ˆµn µ 1 Λn(µ) log d + D Λn(µ) =: CT(d, D, p, n). The comparisons are plotted in Figure 1 (Left). Infinite support. In some cases our bound is nearly tight (at least for the plug-in estimate), such as for the family of zeta distributions µq(i) 1/iq with parameter q > 1. For this family, Antos and Kontoyiannis [2001a, Theorem 7] establish a lower bound of order n q on E H(ˆµn) H(µq) . It is straightforward to verify6 that µq (α) N for all q, α > 1. Thus, we can optimize our bound in (10) over all α > 1; the results are presented in Figure 1 (Right). Acknowledgments and Disclosure of Funding This research was partially supported by the Israel Science Foundation (grant No. 1602/19) and Amazon Research Award. We thank Ioannis Kontoyiannis, Igal Sason, and Sergio Verdú for enlightening conversations. J. Acharya, H. Das, A. Orlitsky, and A. T. Suresh. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 11 21. PMLR, 06 11 Aug 2017. URL http://proceedings.mlr.press/v70/acharya17a.html. J. Acharya, S. Bhadane, P. Indyk, and Z. Sun. Estimating entropy of distributions in constant space. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc. 6One can, for example, apply Cauchy s condensation test, followed up by the ratio test. A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(3-4):163 193, 2001a. A. Antos and I. Kontoyiannis. Estimating the entropy of discrete distributions. In IEEE International Symposium on Information Theory, pages 45 45, 2001b. K. M. R. Audenaert. A sharp continuity estimate for the von Neumann entropy. J. Phys. A, 40 (28):8127 8136, 2007. ISSN 1751-8113. doi: 10.1088/1751-8113/40/28/S18. URL https: //doi.org/10.1088/1751-8113/40/28/S18. D. Berend and A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013. ISSN 0167-7152. doi: https://doi.org/10.1016/j.spl.2013.01.023. URL https://www.sciencedirect.com/science/ article/pii/S0167715213000242. D. Berend, A. Kontorovich, and G. Zagdanski. The expected missing mass under an entropy constraint. Entropy, 19(7):315, 2017. doi: 10.3390/e19070315. URL https://doi.org/10. 3390/e19070315. M. Brautbar and A. Samorodnitsky. Approximating entropy from sublinear samples. In N. Bansal, K. Pruhs, and C. Stein, editors, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 366 375. SIAM, 2007. URL http://dl.acm.org/citation.cfm?id=1283383.1283422. D. Brink. A (probably) exact solution to the birthday problem. The Ramanujan Journal, 28, 06 2012. doi: 10.1007/s11139-011-9343-9. D. Cohen, A. Kontorovich, and G. Wolfer. Learning discrete distributions with infinite support. In Neural Information Processing Systems (NIPS), 2020. T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006. ISBN 978-0-471-24195-9; 0-471-24195-4. K. Fukuchi and J. Sakuma. Minimax optimal estimators for additive scalar functionals of discrete distributions. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2103 2107, 2017. doi: 10.1109/ISIT.2017.8006900. K. Fukuchi and J. Sakuma. Minimax optimal additive functional estimation with discrete distribution: Slow divergence speed case. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1041 1045, 2018. doi: 10.1109/ISIT.2018.8437725. S. Golomb. The information generating function of a probability distribution (corresp.). IEEE Transactions on Information Theory, 12(1):75 77, 1966. doi: 10.1109/TIT.1966.1053843. Y. Han, J. Jiao, and T. Weissman. Adaptive estimation of shannon entropy. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 1372 1376. IEEE, 2015. Y. Hao and A. Orlitsky. Data amplification: Instance-optimal property estimation. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4049 4059. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/v119/hao20a.html. Y. Hao, A. Orlitsky, A. T. Suresh, and Y. Wu. Data amplification: A unified and competitive approach to property estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ a753a43564c29148df3150afb4475440-Paper.pdf. S.-W. Ho and R. W. Yeung. The interplay between entropy and variational distance. IEEE Transactions on Information Theory, 56(12):5906 5929, 2010. J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2015. J. Jiao, K. Venkat, Y. Han, and T. Weissman. Maximum likelihood estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 63(10):6774 6798, 2017. doi: 10.1109/TIT.2017.2733537. H. Jürgensen and D. E. Matthews. Entropy and higher moments of information. Journal of Universal Computer Science, 16(5):749 794, 2010. N. Kusolitsch. Why the theorem of Scheffé should be rather called a theorem of Riesz. Period. Math. Hungar., 61(1-2):225 229, 2010. ISSN 0031-5303. doi: 10.1007/s10998-010-3225-6. URL https://doi.org/10.1007/s10998-010-3225-6. E. H. Lieb and M. Loss. Analysis, volume 14 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, second edition, 2001. ISBN 0-8218-2783-9. doi: 10.1090/ gsm/014. URL https://doi.org/10.1090/gsm/014. P.-S. Loh. Convexity. 2013. URL https://www.math.cmu.edu/~ploh/docs/math/mop2008/ convexity-soln.pdf. G. Lugosi and S. Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Found. Comput. Math., 19(5):1145 1190, 2019. doi: 10.1007/s10208-019-09427-x. URL https://doi.org/10.1007/s10208-019-09427-x. P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab., 18(3):1269 1283, 1990. ISSN 0091-1798. URL http://links.jstor.org/sici?sici= 0091-1798(199007)18:3<1269:TTCITD>2.0.CO;2-Q&origin=MSN. L. Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191 1253, 2003. doi: 10.1162/089976603321780272. L. Paninski. Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50(9):2200 2203, 2004. doi: 10.1109/TIT.2004.833360. I. Pinelis. Best possible bounds of the von Bahr Esseen type. Annals of Functional Analysis, 6(4):1 29, 2015. doi: 10.15352/afa/06-4-1. URL https://doi.org/10.15352/afa/06-4-1. I. Sason. Entropy bounds for discrete random variables via maximal coupling. IEEE Trans. Inf. Theory, 59(11):7118 7131, 2013. doi: 10.1109/TIT.2013.2274515. URL https://doi.org/10. 1109/TIT.2013.2274515. H. Scheffé. A useful convergence theorem for probability distributions. Ann. Math. Statistics, 18: 434 438, 1947. ISSN 0003-4851. doi: 10.1214/aoms/1177730390. URL https://doi.org/10. 1214/aoms/1177730390. J. F. Silva. Shannon entropy estimation in -alphabets from convergence results: studying plug-in estimators. Entropy, 20(6):397, 2018. A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. G. Valiant and P. Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC 11, page 685 694, New York, NY, USA, 2011a. Association for Computing Machinery. ISBN 9781450306911. doi: 10.1145/1993636.1993727. URL https://doi.org/10.1145/1993636.1993727. G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403 412, 2011b. doi: 10.1109/FOCS.2011.81. G. Valiant and P. Valiant. Estimating the unseen: Improved estimators for entropy and other properties. J. ACM, 64(6), Oct. 2017. ISSN 0004-5411. doi: 10.1145/3125643. URL https: //doi.org/10.1145/3125643. S. Verdú. Empirical estimation of information measures: A literature guide. Entropy, 21(8):720, 2019. B. von Bahr and C.-G. Esseen. Inequalities for the rth Absolute Moment of a Sum of Random Variables, 1 r 2. The Annals of Mathematical Statistics, 36(1):299 303, 1965. doi: 10.1214/aoms/1177700291. URL https://doi.org/10.1214/aoms/1177700291. Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016. A. J. Wyner and D. Foster. On the lower limits of entropy estimation. IEEE Transactions on Information Theory, submitted for publication, 2003. Z. Zhang. Estimating mutual information via kolmogorov distance. IEEE Transactions on Information Theory, 53(9):3280 3282, 2007. doi: 10.1109/TIT.2007.903122.