# data_amplification_instanceoptimal_property_estimation__3421434f.pdf Data Amplification: Instance-Optimal Property Estimation Yi Hao 1 Alon Orlitsky 1 Abstract The best-known and most commonly used technique for distribution-property estimation uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly amplify the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of Lipschitz properties including the L1 distance to a fixed distribution, the new estimators use n samples to achieve the accuracy attained by the empirical estimators with n log n samples. For support-size and coverage, the new estimators use n samples to achieve the performance of empirical frequency with sample size n times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-theart estimators designed for each specific property. 1. Introduction Recent years have seen significant interest in estimating properties of distributions over large domains (Valiant & Valiant, 2013; Jiao et al., 2015; 2018; Wu & Yang, 2016; Orlitsky et al., 2016; Acharya et al., 2017a; Hao et al., 2018; Wu & Yang, 2019; Hao & Orlitsky, 2019a;c; Charikar et al., 2019; Hao & Li, 2020). Chief among these properties are 1Department of Electrical and Computer Engineering, University of California, San Diego, USA. Correspondence to: Yi Hao , Alon Orlitsky . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). support size and coverage, Shannon entropy, and L1 distance to a known distribution. The main achievement of these papers is essentially estimating properties of distributions with alphabet size k using just k/ log k samples. In practice however, the underlying distributions are often simple, and their properties can be accurately estimated with significantly fewer than k/ log k samples. For example, if the distribution is concentrated on a small part of the domain, or is exponential, very few samples may suffice to estimate the property. To address this discrepancy, Hao et al. (2018) took the following competitive approach. The best-known distribution property estimator is the empirical estimator that replaces the unknown underlying distribution by the observed empirical distribution. For example, with n samples, it estimates entropy by the formula P i(Ni/n) log(Ni/n) where Ni is the number of times symbol i appeared. Besides its simple and intuitive form, the empirical estimator is also consistent, stable, and universal. It is therefore the most commonly used property estimator for data-science applications. The estimator in Hao et al. (2018) uses n samples and for any underlying distribution achieves the same performance that the empirical estimator would achieve with n log n samples. It therefore provides an effective way to amplify the amount of data available by a factor of log n, regardless of the domain or structure of the underlying distribution. In this paper we present novel estimators that increase the amplification factor for all sufficiently smooth properties including those mentioned above from log n to the information-theoretic bound of log n. Namely, for every distribution their expected estimation error with n samples is that of the empirical estimator with n log n samples and no further uniform amplification is possible. It can further be shown (Valiant & Valiant, 2013; Jiao et al., 2015; Acharya et al., 2017a; Wu & Yang, 2019) that the empirical estimator estimates all of the aforementioned four properties with linearly many samples, hence the sample size required by the new estimators is always at most the k/ log k guaranteed by the state-of-the-art estimators. The current formulation has several additional advantages over previous approaches, which we illustrate as follows. Instance-Optimal Property Estimation Fewer assumptions It eliminates the need for some commonly used assumptions. For example, support size cannot be estimated with any number of samples, as arbitrarilymany low-probabilities may be missed. Hence previous research (Acharya et al., 2017a; Wu & Yang, 2019) unrealistically assumed prior knowledge of the alphabet size k, and additionally that all positive probabilities exceed 1/k. By contrast, the current formulation does not need these assumptions. Intuitively, if a symbol s probability is so small that it won t be detected even with n log n samples, we do not need to worry about it. Refined bounds For some properties, our results are more refined than previously shown. For example, existing results estimate the support size to within ϵk, rendering the estimates rather inaccurate when the true support size S is much smaller than k. By contrast, the new estimation errors are bounded by ϵS, and are therefore accurate regardless of the support size. A similar improvement holds for the support coverage that we introduce below. Graceful degradation For the previous results to work, one needs at least k/ log k samples. With fewer samples, the estimators have no guarantees. By contrast, the guarantees of the new estimators work for any sample size n. If n < k/ log k, the performance may degrade, but will still track that of the empirical estimators with a factor log n more samples. See Theorem 1 for an example. Instance optimality With the recent exception of Hao et al. (2018), all modern property-estimation research took a min-max-related approach, evaluating the estimation improvement based on the worst possible distribution for the property. In reality, practical distributions are rarely the worst possible and often quite simple, rendering min-max approach overly pessimistic, and its estimators, typically suboptimal in practice. In fact, for this very reason, practical distribution estimators do not use min-max based approaches (Gale & Sampson, 1995). By contrast, our competitive, or instance-optimal, approach provably ensures amplification for every underlying distribution, regardless of its complexity or support size. In addition, the proposed estimators run in time near-linear in the sample size, and the constants involved are very small, attributes shared by some, though not all existing estimators. Below, we formalize the foregoing discussion in definitions. Let k denote the collection of discrete distributions over [k] := {1, . . . , k}. A distribution property is a mapping F : k R. It is additive if it can be written as i [k] fi(pi), where fi : [0, 1] R are real functions. Many important distribution properties are additive: Shannon entropy H(p) := P i [k] pi log pi, is the principal measure of information (Cover & Thomas, 2012), and arises in a variety of machine-learning (Chow & Liu, 1968; Quinn et al., 2013; Bresler, 2015), neuroscience (Mainen & Sejnowski, 1995; Steveninck et al., 1997; Gerstner & Kistler, 2002), and other applications. L1 distance Dq(p) := P i [k] |pi qi|, where q is a given distribution, is one of the most basic and well-studied properties in the field of distribution property testing (Batu et al., 2000; Ron, 2010; Valiant & Valiant, 2016; Canonne, 2017). Support size S(p) := P i [k] 1pi>0, is a fundamental quantity for discrete distributions, and plays an important role in vocabulary size (Mc Neil, 1973; Efron & Thisted, 1976; Thisted & Efron, 1987) and population estimation (Good, 1953; Mao & Lindsay, 2007). Support coverage C(p) := P i [k](1 (1 pi)m), for a given m, represents the number of distinct elements we would expect to see in m independent samples, arises in many ecological (Chao, 1984; Chao & Lee, 1992; Colwell et al., 2012; Chao & Chiu, 2014), biological (Chao, 1984; Kroes et al., 1999), genomic (Ionita-Laza et al., 2009) as well as database (Haas et al., 1995) studies. 2. Prior and New Results Given an additive property F and sample access to an unknown distribution p, we would like to estimate the value of F(p) as accurately as possible. Let [k]n denote the collection of all length-n sequences, an estimator is a function ˆF : [k]n R that maps a sample sequence Xn p to a property estimate ˆF(Xn). We evaluate the performance of ˆF in estimating F(p) via its mean absolute error (MAE) 1, L ˆ F (p, n) := E Xn p ˆF(Xn) F(p) . Since we do not know p, the common approach is to consider the worst-case MAE of ˆF over k, L ˆ F (n) := max p k L ˆ F (p, n). The best-known and most commonly-used property estimator is the empirical plug-in estimator. Upon observing Xn, let Ni denote the number of times symbol i [k] appears in Xn. The empirical estimator estimates F(p) by ˆF E(Xn) := X 1As we aim to estimate only a single property value, the estimators in this paper all have negligible variances, e.g., O(1/n0.9). Hence the MAE is the same as MSE for our purpose, and we choose the former because it induces cleaner expressions. Instance-Optimal Property Estimation Starting with Shannon entropy, it has been shown (Wu & Yang, 2016) that for n k, the worst-case (max) MAE of the empirical estimator ˆHE is L ˆ HE(n) = Θ k n + log k n On the other hand, Jiao et al. (2015); Wu & Yang (2016); Acharya et al. (2017a); Hao & Orlitsky (2019a;c) showed that for n k/log k, more sophisticated estimators achieve the best min-max performance of L(n) := min ˆ F L ˆ F (n) = Θ k n log n + log k n Hence up to constant factors, for the worst distributions, the MAE of these estimators with n samples equals that of the empirical estimator with n log n samples. A similar relation holds for the other three properties we consider. However, the min-max formulation is pessimistic as it evaluates the estimator s performance for the worst distributions. In many practical applications, the underlying distribution is fairly simple and does not attain this worst-case loss, rather, a much smaller MAE can be achieved. Several recent works have therefore gone beyond worst-case analysis and designed algorithms that perform well for all distributions, not just those with the worst performance (Orlitsky & Suresh, 2015; Valiant & Valiant, 2016; Hao & Orlitsky, 2019b). For property estimation, Hao et al. (2018) designed an estimator ˆF A that for any underlying distribution uses n samples to achieve the performance of the n log n-sample empirical estimator, hence effectively multiplying the data size by a log n amplification factor. Lemma 1 (Hao et al. (2018)). For every property F in a large class including the aforementioned four properties, there is an absolute constant c F such that for all distributions p, all ε 1, and all n 1, L ˆ F A(p, n) L ˆ F E p, εn p log n c F ε. In this work, we fully strengthen the above result and establish the limits of data amplification for all sufficiently smooth additive properties including four of the most important ones, and all that are appropriately Lipschitz. Using Shannon entropy as an example, we achieve a log n amplification factor. Equations (1) and (2) imply that the improvement over the empirical estimator cannot always exceed O(log n), hence up to an absolute constant, this amplification factor is information-theoretically optimal. Similar optimality arguments hold for our results on the other three properties (e.g., see Table 1 in Acharya et al. (2017a)). Specifically, we derive efficient estimators ˆH, ˆD, ˆS, ˆC, and ˆF for the Shannon entropy, L1 distance, support size, support coverage, and a broad class of additive properties which we refer to as Lipschitz properties. These estimators run in near-linear time, take a single parameter ε, and given samples Xn p, amplify the data as described below. For brevity, henceforth we shall write a b and a b instead of min{a, b} and a = O(b), respectively, and abbreviate support size S(p) by Sp and coverage C(p) by Cp. The following five theorems hold for all ε 1, all distributions p, and all n 1. Theorem 1 (Shannon entropy). L ˆ H(p, n) L ˆ HE(p, εn log n) ε Sp n + 1 n0.49 Note that the estimator requires no knowledge of Sp or k. When ε = 1, the estimator amplifies the data by a factor of log n. As ε decreases, the amplification factor decreases, and so does the extra additive inaccuracy. One can also set ε to be a vanishing function of n, e.g., ε = 1/ log log n. This result may be interpreted as follows. For distributions with large support sizes such that the min-max estimators provide no or only very weak guarantees, our estimator with n samples always tracks the performance of the n log nsample empirical estimator. On the other hand, for distributions with relatively small support sizes, our estimator achieves a near-optimal O(Sp/n)-error rate. Similarly, for L1 distance to a fixed distribution q, Theorem 2 (L1 distance). For any q, we can construct an estimator ˆDq for Dq such that L ˆ Dq(p, n) L ˆ DE q p, ε2n log n ε n + 1 n0.49 Besides having an interpretation similar to that of Theorem 1, the above result shows that for each q and each p, we can use just n samples to achieve the performance of the n log n-sample empirical estimator. More generally, for any additive property F(p) := P i [k] fi(pi) that satisfies the simple condition: fi is O(1)-Lipschitz, for all i, Theorem 3 (General additive properties). Given F, we can construct an estimator ˆF such that L ˆ F (p, n) L ˆ F E p, ε2n log n ε n + 1 n0.49 The results in Kamath et al. (2015) show that no plug-in estimators provide those theoretical guarantees presented in Theorem 2 and 3. Henceforth, we refer to the above collection of distribution properties as the class of Lipschitz properties. The L1 distance Dq, for any q, is in this class. Lipschitz properties are essentially bounded by absolute constants and Shannon entropy grows at most logarithmically in the support size, and we were able to approximate all with just an additive error. Support size and support coverage Instance-Optimal Property Estimation can grow linearly in k and m, and can be approximated only multiplicatively. We therefore evaluate the estimator s normalized performance, regarding the property value. Note that for both properties, the amplification factor is logarithmic in the property value, which can be arbitrarily larger than the sample size n. The following two theorems hold for ε e 2, Theorem 4 (Support size). L ˆS(p, n) L ˆSE p, n log Sp 1 | log ε| 1 To make the slack term vanish, one can simply set ε to be a vanishing function of n (or Sp), e.g., ε = 1/ log n. Note that in this case, the slack term modifies the multiplicative error in estimating Sp by only o(1), which is negligible in most applications. Similarly, for support coverage, Theorem 5 (Support coverage). L ˆ C(p, n) L ˆ CE p, n log Cp 1 | log ε| 1 The next section presents implications of these results. 3. Implications Data amplification Numerous modern scientific applications, such as those emerging in social networks and genomics, deal with properties of distributions whose support size Sp is equal to or even larger than the sample size n. In this data-sparse regime, the estimation error of the empirical estimator often decays at a slow rate, e.g., 1/ logc n for some c (0, 1), hence the proposed estimators yield a much more accurate estimate, paralleling that of the empirical with n log n samples. For applications where n 25, 000 and regardless of the distribution structure, our approach significantly amplifies the number of samples by at least a factor of 10, known by practitioners as an order of magnitude . As for the data-rich regime where n Sp, our method essentially recovers the the standard p Sp/n rate of maximum likelihood methods in general, without knowing Sp. Instance optimality With just n samples, our method emulates the performance of the n log n-sample empirical estimator for every distribution instance. The method hence possesses the vital ability of strengthening all MAE guarantees of the empirical estimator by a logarithmic factor, which is optimal in many settings. The significance of such instance optimality arises from 1) empirical estimators are often simple and easy to analyze; 2) there is a rich literature on their estimation attributes, e.g., Bustamante (2017) and the references therein; 3) empirical estimators are the best-known and most-used. Consequently, we can work on a simple problem, analyzing the performance of the empirical estimator, and immediately strengthen the result we obtain by a logarithmic-factor using the theorems in this paper. In many cases, the strengthened results are challenging to establish via other statistical methods. We present two examples below. Entropy Consider entropy estimation over k. As Equation 2 shows, the min-max MAE is known for n k/ log k, and essentially becomes a constant when n gets close to the k/ log k lower bound. Nevertheless, over an alphabet of size k, the value of entropy can go up to log k. Hence, it is still possible to get meaningful estimation results in the n = o(k/ log k) large-alphabet regime. We follow the above strategy to solidify the statement. First, for empirical estimator ˆHE, Paninski (2003) [see Proposition 1] provides a short argument showing that its worst-case MAE, for all n and k, satisfies L ˆ HE(n) log 1 + k 1 + log n n . Consolidating this inequality with Theorem 1 then implies Corollary 1. In the n = o(k/ log k) large-alphabet regime, the min-max MAE of estimating Shannon entropy satisfies L(n) (1 + o(1)) log 1 + k 1 Lipschitz Property The same type of arguments apply to any Lipschitz property F. Again, we begin with characterizing the performance of the empirical estimator ˆF E. By Lemma 3 and the Cauchy-Schwarz inequality, the bias of ˆF E is at most O( p k/n). By the Efron-Stein inequality, the standard deviation is no more than O(1/ n). It then follows by Theorem 3 that: ˆF estimates F over k to an MAE of ε with O(k/(ε3 log k)) samples. Note that 1) this yields the first estimator for Lipschitz properties with optimal sample dependence on k; 2) after a draft of this paper became available online, Hao & Orlitsky (2019c) improved the sample dependence on ε to the optimal ε2. 4. Estimator Construction and Analysis For clarity, we focus on the proof of Theorem 1 about entropy estimation, and explain only necessary modifications for similar arguments to go through for other properties. We begin by relating the empirical entropy estimator to the Bernstein polynomial of function x log x. Notation For a sampling parameter n and accuracy ε 1, define the amplification factor as a := ε log n. Without loss of generality, assume that ε 1/ log n and hence a 1. For simplicity, write h(x) := x log x, m := na, τn := Θ(log n/n) and dn := Θ(log n), where the asymptotic notations hide only properly chosen absolute constants. Instance-Optimal Property Estimation 4.1. Bernstein Polynomial Drawing i.i.d. samples Y m from any distribution p, the expected value of the empirical estimator for entropy is E[ ˆHE(Y m)] = X i [k] E Mi bin(m,pi) Note that for any function f, m N, and x [0, 1], the degree-m Bernstein polynomial of f is Bm(f, x) := xj(1 x)m j. Therefore, we can express the expectation of the empirical entropy estimator as E Y m p[ ˆHE(Y m)] = X i [k] Bm(h, pi). As modifying a sample changes the value of ˆHE(Y m) by at most 2 log m/m, the Efron-Stein inequality bounds its variance by 2 log2m/m, which is negligible in magnitude. Hence, for our purpose, we focus on finding a good approximation of each Bm(h, pi). 4.2. Estimator Construction and Computation In the subsequent sections, given i.i.d. samples Xn p, we construct our estimator as follows. Substitute n by 2n for simplicity. According to Section 4.4, we first split the samples into two halves, Xn 1 and X2n n+1, and respectively denote by Ni and N i the empirical counts of each symbol i [k] in them. Then, we follow Dobrushin (1958) to classify the symbols into two categories and decompose the sum E Y m p[ ˆHE(Y m)] = X i [k] Bm(h, pi) into two parts by thresholding the empirical counts N i at level 1/ε. The first part, VL := P i [k] Bm(h, pi)1N i>1/ε, corresponds to the contribution of symbols with potentially large probabilities. Illustrated in Section 4.3, this quantity is well approximated by the large-probability estimator to an MAE of 2(ε Sp/n). As for the small-probability part, i [k] Bm(h, pi) 1N i 1 we follow the arguments in Section 4.4 and 4.5 to learn each summand adaptively (to the magnitude of the probability) and compute the summation. Concretely, recall τn = Θ(log n/n) and dn = Θ(log n). For a given function and domain, the polynomial achieving the minimal maximum deviation from the function over the domain is the min-max polynomial. Then, denote by the degree-dn min-max polynomial of B m(h, pi) over interval In := [0, τn]. The small-probability estimator is 1Ni log n 1N i 1 where for each symbol i, the term in the parentheses is an unbiased estimator for Hm(pi) := R pi 0 hm(s)ds. Next, we illustrate the technique and intuition behind the construction. Differential smoothing The construction of ˆVS presents a generic method for designing a polynomial G that closely approximates a given differentiable function G with pointwise error bounds. More precisely, for a fixed interval I := [0, τ] and degree bound d N, we want to find a polynomial G of degree at most d, satisfying max x I | G(x) G(x)| c x, for a number c 0 that is as small as possible. We propose a novel method, differential smoothing, that addresses this approximation problem and operates as follows. 1. Compute G (x) and write g := G . 2. Approximate g by its min-max polynomial g over I. 3. Let c be the min-max approximation error in Step 2. 4. Compute G(x) := R x 0 g(t)dt. By the triangle inequality for integrals, the resulting c and G satisfy the desired inequality. Besides, Step 2 and 3 can be jointly performed using the well-known Remez algorithm (Pach on & Trefethen, 2009; Trefethen, 2013). Turning back to our estimator ˆVS, by the reasoning in Section 4.6 and 4.7, the min-max polynomial hm(x) approximates B m(h, x) to within O(ε) over In. Hence, applying the method of differential smoothing yields |Bm(h, x) Hm(x)| ε x. Further relating this inequality to the expectation of the empirical entropy estimator implies E Y m p[ ˆHE(Y m)] X i [k] Hm(pi) X i [k] ε pi = ε. Instance-Optimal Property Estimation In Section 6.1 of the supplementary, we prove that the absolute bias is also at most O(Sp/n), which requires some additional work. Finally, Section 7.1 bounds the mean absolute deviation of the estimator by O(1/n0.49). Consequently, we approximate H(p) by ˆH := ˆVL + ˆVS. Computational complexity The dominant computation step is finding the min-max polynomial of B m(h, x), for which we utilize the well-known Remez algorithm (Pach on & Trefethen, 2009; Trefethen, 2013). In Section 9 of the supplementary, we shall argue that the algorithm takes only O(n) time to well approximate B m(h, x). 4.3. Large-Probability Estimator Following the previous arguments, we say that i [k] is a large-probability symbol if N i > 1/ε. To the expectation of the m-sample empirical estimator, these symbols contribute i [k] Bm(h, pi) 1N i> 1 We estimate VL by respectively reweighing the empirical estimator associated with the first-half samples: To bound the estimation bias, we leverage the next lemma, stating that the Bernstein polynomial of h closely approximates the function over [0, 1]. Lemma 2. For any t Z+ and x [0, 1], t Bt(h, x) h(x) 0. The number of symbols satisfying N i > 1/ε is at most nε. Together with the lemma and triangle inequality, this yields |E[VL] E[ˆVL]| X (1 pi)E h 1N i> 1 Furthermore, the number of such symbols is also at most Sp, implying an alternative upper bound of 2Sp/n. For Shannon entropy, we note that adding 1/(2n) to the empirical estimate h(Ni/n) may reduce its bias. This particular method, known as the Miller-Mallow estimator , appears in Miller (1955) and eliminates the first-order term of Bn(h, x) h(x). Applying the method will introduce extra complications in the analysis, and hence for entropy and general non-differentiable properties, we employ the original empirical estimator. On the other hand, substituting the Miller-Mallow estimate into our algorithm in Theorem 1 retains its theoretical guarantee. For Lipschitz properties, the rich literature on Bernstein operators presents us with the following bound. Lemma 3 ((Bustamante, 2017) Proposition 4.9). For any t Z+, x [0, 1], and c-Lipschitz function f, |Bt(f, x) f(x)| c Combined with the Cauchy-Schwarz inequality, the lemma shows that the estimation bias of the respective ˆVL admits |E[VL] E[ˆVL]| 2 This completes the bias analysis of the large-probability estimator, while Section 6.2 in the supplementary provides additional technical details. For the variance analysis, see Section 7.2. The following three sections proceed to construct the small-probability estimator and introduce fundamental results from polynomial approximation theory. 4.4. Choice of Parameters and Sample Splitting Section 4.1 calls for estimating Bm(h, x). Applying the method of differential smoothing in Section 4.2, we first choose some domain I = [0, τ] and degree d, and estimate B m(h, x) by its min-max polynomial hm(x) = Pd t=0 btxt over I. Then, we approximate Bm(h, x) by Hm(x) = Z x 0 hm(t)dt = bt t + 1xt+1. To estimate Hm(x), note that given a binomial variable X bin(n, x), an unbiased estimator for xt is Xt/nt, where t N and AB denotes the B-th order falling factorial of A. Hence, we employ an unbiased estimator for Hm(x) that corresponds to the parenthetical component in estimator ˆVS s expression. Next, we illustrate the intuitions behind our choice of τ and d. For any X bin(n, x), the variance of ˆHm(X) generally gets larger as the degree d increases. On the other hand, a higher-degree polynomial is able to achieve a lower approximation error. To balance this bias-variance trade-off, we want to reduce both the interval length, τ, and the polynomial degree, d, while maintaining the approximation power. As in Section 4.2, we set parameter τ = τn = Θ(log n/n) since below this threshold, sample statistics are insufficient for inferring the relative magnitudes of the underlying probabilities with high confidence. Regarding the degree parameter τ = τn = Θ(log n), below the log n threshold, the approximation Hm loses the ε x guarantee; in contrast, above the threshold, the final estimator may no longer possess a vanishing variance. For more details, see derivations in Section 7.1 and Appendix A of the supplementary. Instance-Optimal Property Estimation One thing that follows the construction of Hm and ˆHm is how to apply these approximations to only probabilities of order τn. This issue arises from the fact that we observe symbol counts, not ranges of the actual probability values. It is straightforward to deal with such uncertainty by inferring the magnitudes of unknowns leveraging the counting statistics concentration. For concentration, binomial random variables are sums of independent indicator variables and possess Gaussian-type tail bounds. To avoid introducing additional statistical dependency, we 1) split the sample sequence into two halves of equal length; 2) denote respectively the empirical counts of each symbol i in the first and second halves by Ni and N i (where we slightly abused the notation); 3) classify each i [k] as a largeor smallprobability symbol by thresholding the count N i at 1/ε. The supplementary material presents relevant details in Section 5 and 6.2. In the literature, the above procedure is often referred to as sample splitting. This idea of classifying the symbols in the alphabet into two categories dates back to Dobrushin (1958), and has been applied to estimate a variety of specific distribution properties in the past decade (Acharya et al., 2014; Jiao et al., 2015; Wu & Yang, 2016; Hao et al., 2018). Recently, Hao & Orlitsky (2019c) generalize this idea to estimate general properties by partitioning the unit interval into Θ( n) pieces; Hao & Orlitsky (2019b) apply the method to derive state-of-the-art distribution estimators. Sample splitting and additiveness of the property enable us to estimate the contributions from the large and small probabilities separately. The rest sections assume this separation and address the small-probability approximation error. 4.5. Min-Max Polynomial Polynomials have extensive applications to statistical inference, ranging from approximating the norms of Gaussian parameters (Cai & Low, 2011) to learning structured distributions (Chan et al., 2014; Acharya et al., 2017b; Hao & Orlitsky, 2019b) to estimating properties of distributions (Jiao et al., 2015; Orlitsky et al., 2016; Wu & Yang, 2016; Hao et al., 2018; Hao & Orlitsky, 2019c). As illustrated in Section 4.2 and 4.4, we aim to find a polynomial hm(x) of degree dn = Θ(log n) that satisfies the pointwise bound |B m(h, x) hm(x)| ε over In = [0, τn]. The task naturally calls for a polynomial achieving the minimal maximum deviation from B m(h, x), commonly known as the respective min-max polynomial. Moreover, direct computation shows that B m(h, x) is the order-(m 1) Bernstein polynomial of another function: B m(h, x) = Bm 1(hm, x), where function hm is defined as hm(y):= logm 1 m +(m 1) h y+ 1 m 1 Hence, our objective reduces to bounding the error of minmax polynomial approximations of Bm 1(hm, x) over In. As one could expect, the analysis gets more involved since 1) Bm 1(hm, x) is a high-degree polynomial with transcendental coefficients; 2) in general, there are no closed-form formulas for the min-max polynomials of a real function. Though sophisticated in its form, function Bm 1(hm, x) is continuous and relatively smooth, as hinted by Lemma 2. This simple observation serves as the starting point for our subsequent analysis. In the next section, we dive into approximation theory and present fundamental connections between the smoothness of a function (characterized by specific quantities) and its min-max polynomial approximation error over a given interval. The desired result then follows by a sequence of inequalities and simplifications that enable us to gauge the smoothness of Bm 1(hm, x). For the proof of the derivative identity on hm and a more straightforward argument leading to a weaker result, see Section 4 and 5 of the supplementary. 4.6. Moduli of Smoothness In this section, we introduce some notable results in approximation theory (Ditzian & Totik, 2012) that are crucial for simplifying the problem. Denote ϕ(x) := p x(1 x). For any function f : [0, 1] R, the firstand secondorder Ditzian-Totik moduli of smoothness quantities of f are w1 ϕ(f, t) := sup {|f(u) f(v)| : 0 u, v 1, |u v| t ϕ u + v w2 ϕ(f, t) := sup f(u) + f(v) 2f u + v 0 u, v 1, |u v| 2t ϕ u + v respectively. Let Pd denote the collection of polynomials with real coefficients and degree at most d. For any d Z+, interval I R, and function f : I R, denote by Ed[f, I] := min f Pd max x I |f(x) f(x)| the best approximation error of the degree-d min-max polynomial of f over I. For a bounded domain I, we can always shift and rescale f to make it a real function over [0, 1]. Hence, without loss of generality, it suffices to consider and analyze Ed[f] := Ed[f, [0, 1]]. The connection between the best polynomial-approximation error Ed[f] of a continuous function f and the second-order Ditzian-Totik moduli of smoothness w2 ϕ(f, t) is established in the following lemma (Ditzian & Totik, 2012). Instance-Optimal Property Estimation Lemma 4. There are absolute constants C1 and C2 such that for any continuous function f over [0, 1] and d > 2, Ed[f] C1w2 ϕ(f, d 1), and 1 d2 t=0 (t + 1)Et[f] C2w2 ϕ(f, d 1). The above lemma shows that the second-order smoothness quantity w2 ϕ(f, ) essentially characterizes E [f], and thus transforms the problem of showing | hm(x) Bm 1(hm, x)| ε, x In, to that of establishing w2 ϕ(Bm 1(hm, τn x), d 1 n ) ε, where τn = Θ(log n/n) and dn = Θ(log n) by definition. 4.7. Simplification via Poissonization The last block in our analysis is Poissonization, which helps decompose and simplify the function to approximate. For any y [0, ], define two functions: f1(y):= E X Poi(y)[h(X)] = e y X j! (j log j) and f2(y):= E X Poi(y)[h(X + 1)]. Let z(x) := (m 1)x for simplicity. The following lemma, appearing in Appendix A.1 of the supplementary relates Bm 1(hm, x) to these functions and base function h(x). Lemma 5. For any m Z+ and x [0, log4m/m], hm(x) Bm 1(hm, x) = [h(z(x) + 1) f2(z(x))] [h(z(x)) f1(z(x))] + O 1 In particular, the above equation holds for any sufficiently large n and x In = [0, τn]. Since 1/m = 1/(na 1) min{1/ log n, Sp/n}, the last term on the right-hand side is negligible. These results, together with the function-wise triangle inequality on w2 ϕ, further reduce the last inequality in Section 4.6 to bounds in the form of w2 ϕ(g(x), d 1 n ) ε, for function g(x) being hm(τn x), h(z(x)), h(z(x) + 1), f1(z(x)), and f2(z(x)), respectively. We prove these bounds in Appendix A.2 and A.3 of the supplementary. In Appendix B, a similar yet more involved argument extends the result to all Lipschitz properties. One reason for the extra complication is the absence of concrete expression, as we impose only the Lipschitz condition. While these proofs are technical, a critical insight is that the optimization problems induced by computing w2 ϕ for the above choices of g are all convex. Consequently, it suffices to consider only the boundary cases of parameters. 5. Experiments We demonstrate the efficacy of the proposed estimators by comparing their performance to two state-of-the-art estimators (Wu & Yang, 2016; 2019), and empirical estimators with logarithmic larger sample sizes. Due to method similarity, we present only the results for entropy and support size. Additional estimators for both properties were compared in Orlitsky et al. (2016); Wu & Yang (2016; 2019); Hao et al. (2018); Hao & Orlitsky (2019a) and found to perform similarly to or worse than the estimators we tested, hence we exclude them here. For each property, we considered nine natural-synthetic distributions, shown in Figure 1 and 2. Settings We experimented with nine distributions having support size S = 10,000: uniform distribution; a two-steps distribution with probability values 0.5S 1 and 1.5S 1; Zipf distribution with power 1/2; Zipf distribution with power 1; binomial distribution with success probability 0.3; geometric distribution with success probability 0.9; Poisson distribution with mean 0.3S; a distribution drawn from Dirichlet prior with parameter 1; a distribution drawn from Dirichlet prior with parameter 1/2. The geometric, Poisson, and Zipf distributions were truncated at S and re-normalized. The horizontal axis shows the number of samples, n, ranging from S0.2 to S. Each experiment was repeated 100 times and the reported results, shown on the vertical axis, reflect their mean values and standard deviations. Specifically, the real property value is drawn as a dashed black line, and the other estimators are color/shape coded, with the solid line displaying their mean estimate, and the shaded area corresponding to one standard deviation. We compared the estimators performance with n samples to that of two other recent estimators as well as the empirical estimator with n, n log A, and n log A samples, where for Shannon entropy, A = n and for support size, A = Sp, the actual distribution support size (which is S). We chose the parameter ε = 1. The graphs denote our proposed estimator by Proposed, ˆF E with n samples by Empirical, ˆF E with n log A samples by Empirical+, ˆF E with n log A samples by Empirical++, the entropy and support-size estimators in Wu & Yang (2016) and Wu & Yang (2019) by WY. Results As Theorem 1 and 4 would imply and the experiments confirmed, for both properties, the proposed estimators with n samples achieved the accuracy as the empirical estimators with at least n log n samples for entropy and n log Sp samples for support size. In particular, for entropy, the proposed estimator with n samples performed significantly better than the n log n-sample empirical estimator, for all tested distributions and all values of sample size n. For both properties, the proposed estimators outperformed the state-of-the-art estimators in terms of accuracy and stability regarding distribution structures. Instance-Optimal Property Estimation 4 Binomial 0.3 Geometric 0.9 Poisson 0.3 101 102 103 104 n 101 102 103 104 n Dirichlet-1 prior 101 102 103 104 n Dirichlet-1/2 prior Truth Empirical Empirical+ Empirical++ WY Proposed Figure 1. Shannon entropy estimation. For clarity, the horizontal axis is in logarithmic scale. The WY curve is flipped vertically around Truth for all the curves to have similar trends. Besides the samples, the WY estimator takes as input an upper bound of the support size, which is set to be the actual support size in the experiments. The vertical axis shows only nonnegative values. Binomial 0.3 Geometric 0.9 600 Poisson 0.3 101 102 103 104 n 101 102 103 104 n Dirichlet-1 prior 101 102 103 104 n Dirichlet-1/2 prior Truth Empirical Empirical+ Empirical++ WY Proposed Figure 2. Support size estimation. For clarity, the horizontal axis is in logarithmic scale. Besides the samples, the WY estimator takes as input a lower bound of the smallest positive probability p+ min, which is set to be max{1/(10S), 4p+ min} in the experiments. Here, 1/(10S) is used to avoid division by zero in numerical computation, and factor 4 represents a reasonable uncertainty about p+ min. For several distributions, such as uniform and geometric, knowing p+ min yields the full knowledge of the entire probability multiset. Finally, while estimator WY s bias is slightly lower on a few distributions, the corresponding standard deviation is too high to be acceptable. Instance-Optimal Property Estimation Acknowledgements We thank the reviewers for helpful comments, and are grateful to the National Science Foundation (NSF) for supporting this work through grants CIF-1564355 and CIF-1619448. Acharya, J., Orlitsky, A., Suresh, A. T., and Tyagi, H. The complexity of estimating R enyi entropy. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1855 1869. SIAM, 2014. Acharya, J., Das, H., Orlitsky, A., and Suresh, A. T. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pp. 11 21, 2017a. Acharya, J., Diakonikolas, I., Li, J., and Schmidt, L. Sampleoptimal density estimation in nearly-linear time. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1278 1289. SIAM, 2017b. Batu, T., Fortnow, L., Rubinfeld, R., Smith, W. D., and White, P. Testing that distributions are close. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 259 269. IEEE, 2000. Bresler, G. Efficiently learning Ising models on arbitrary graphs. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, pp. 771 782, 2015. Bustamante, J. Bernstein operators and their properties. Springer, 2017. Cai, T. T. and Low, M. G. Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. The Annals of Statistics, 39(2): 1012 1041, 2011. Canonne, C. L. A survey on distribution testing. Your Data is Big. But is it Blue., 2017. Chan, S.-O., Diakonikolas, I., Servedio, R. A., and Sun, X. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pp. 604 613, 2014. Chao, A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, pp. 265 270, 1984. Chao, A. and Chiu, C.-H. Species richness: Estimation and comparison. Wiley Stats Ref: Statistics Reference Online, pp. 1 26, 2014. Chao, A. and Lee, S.-M. Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87(417):210 217, 1992. Charikar, M., Shiragur, K., and Sidford, A. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 780 791, 2019. Chow, C. and Liu, C. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968. Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L., and Longino, J. T. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of Plant Ecology, 5(1):3 21, 2012. Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012. Ditzian, Z. and Totik, V. Moduli of smoothness, volume 9. Springer Science & Business Media, 2012. Dobrushin, R. L. A simplified method of experimentally evaluating the entropy of a stationary sequence. Theory of Probability & Its Applications, 3(4):428 430, 1958. Efron, B. and Thisted, R. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3):435 447, 1976. Gale, W. A. and Sampson, G. Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3):217 237, 1995. Gerstner, W. and Kistler, W. M. Spiking neuron models: Single neurons, populations, plasticity. Cambridge University Press, 2002. Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4): 237 264, 1953. Haas, P. J., Naughton, J. F., Seshadri, S., and Stokes, L. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the 21st International Conference on Very Large Data Bases, pp. 311 322. Morgan Kaufmann Publishers Inc., 1995. Hao, Y. and Li, P. Bessel smoothing and multi-distribution property estimation. In Conference on Learning Theory, pp. 1817 1876, 2020. Hao, Y. and Orlitsky, A. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems, pp. 10991 11003, 2019a. Instance-Optimal Property Estimation Hao, Y. and Orlitsky, A. Doubly-competitive distribution estimation. In International Conference on Machine Learning, pp. 2614 2623, 2019b. Hao, Y. and Orlitsky, A. Unified sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems, pp. 11104 11114, 2019c. Hao, Y., Orlitsky, A., Suresh, A. T., and Wu, Y. Data amplification: A unified and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pp. 8834 8843, 2018. Ionita-Laza, I., Lange, C., and Laird, N. M. Estimating the number of unseen variants in the human genome. Proceedings of the National Academy of Sciences, 106 (13):5008 5013, 2009. Jiao, J., Venkat, K., Han, Y., and Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2015. Jiao, J., Han, Y., and Weissman, T. Minimax estimation of the L1 distance. IEEE Transactions on Information Theory, 64(10):6672 6706, 2018. Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In Conference on Learning Theory, pp. 1066 1100, 2015. Kroes, I., Lepp, P. W., and Relman, D. A. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547 14552, 1999. Mainen, Z. F. and Sejnowski, T. J. Reliability of spike timing in neocortical neurons. Science, 268(5216):1503 1506, 1995. Mao, C. X. and Lindsay, B. G. Estimating the number of classes. The Annals of Statistics, pp. 917 930, 2007. Mc Neil, D. R. Estimating an author s vocabulary. Journal of the American Statistical Association, 68(341):92 96, 1973. Miller, G. Note on the bias of information estimates. Information Theory in Psychology: Problems and Methods, 1955. Orlitsky, A. and Suresh, A. T. Competitive distribution estimation: Why is Good-Turing good. In Advances in Neural Information Processing Systems, pp. 2143 2151, 2015. Orlitsky, A., Suresh, A. T., and Wu, Y. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016. Pach on, R. and Trefethen, L. N. Barycentric-Remez algorithms for best polynomial approximation in the Chebfun system. BIT Numerical Mathematics, 49(4):721, 2009. Paninski, L. Estimation of entropy and mutual information. Neural Computation, 15(6):1191 1253, 2003. Quinn, C. J., Kiyavash, N., and Coleman, T. P. Efficient methods to compute optimal tree approximations of directed information graphs. IEEE Transactions on Signal Processing, 61(12):3173 3182, 2013. Ron, D. Algorithmic and analysis techniques in property testing. Foundations and Trends R in Theoretical Computer Science, 5(2):73 205, 2010. Steveninck, R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W. Reproducibility and variability in neural spike trains. Science, 275(5307):1805 1808, 1997. Thisted, R. and Efron, B. Did Shakespeare write a newlydiscovered poem? Biometrika, 74(3):445 455, 1987. Trefethen, L. N. Approximation theory and approximation practice, volume 128. SIAM, 2013. Valiant, G. and Valiant, P. Estimating the unseen: Improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pp. 2157 2165, 2013. Valiant, G. and Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing, pp. 142 155, 2016. Wu, Y. and Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016. Wu, Y. and Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 883, 2019.