# unified_sampleoptimal_property_estimation_in_nearlinear_time__90c59457.pdf

Uniﬁed Sample-Optimal Property Estimation in Near-Linear Time

Yi Hao Dept. of Electrical and Computer Engineering University of California, San Diego yih179@ucsd.edu

Alon Orlitsky Dept. of Electrical and Computer Engineering University of California, San Diego alon@ucsd.edu

We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the ﬁrst uniﬁed methodology for constructing sampleand time-efﬁcient estimators for all sufﬁciently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimators whose approximation values are asymptotically optimal and highly-concentrated, resulting in the ﬁrst: 1) estimators achieving the O(k/(ε2 log k)) min-max ε-error sample complexity for all k-symbol Lipschitz properties; 2) uniﬁed near-optimal differentially private estimators for a variety of properties; 3) uniﬁed estimator achieving optimal bias and near-optimal variance for ﬁve important properties; 4) near-optimal sample-complexity estimators for several important symmetric properties over both domain sizes and conﬁdence levels.

1 Introduction

Let k be the collection of distributions over the alphabet [k] := {1, 2, . . . , k}, and [k] be the set of ﬁnite sequences over [k]. In many learning applications, we are given i.i.d. samples Xn := X1, X2, . . . , Xn from an unknown distribution p := (p1, p2, . . . , pk) k, and using these samples we would like to infer certain distribution properties.

A distribution property is a mapping f : k R. Often, these properties are symmetric and additive, namely, f( p) = P

i [k] f(pi). For example, Shannon entropy, support size, and three more properties in Table 1. Many other important properties are additive but not necessarily symmetric, namely, f( p) = P i [k] fi(pi). For example, Kullback-Leibler (KL) divergence or ℓ1 distance to a ﬁxed distribution q, and distances weighted by the observations xi, e.g., P

i [k] xi |pi qi|. A property estimator is a mapping ˆf : [k] R, where ˆf(Xn) approximates f( p).

Property estimation has attracted signiﬁcant attention due to its many applications in various disciplines: Shannon entropy is the principal information measure in numerous machine-learning [6] and neurosicence [13] algorithms; support size is essential in population [4] and vocabulary size [29] estimation; support coverage arises in ecological [7], genomic [23], and database [14] studies; ℓ1 distance is useful in hypothesis testing [24] and classiﬁcation [10]; KL divergence reﬂects the performance of investment [8], compression [9], and on-line learning [20].

For data containing sensitive information, we may need to design special property estimators that preserve individual privacy. Perhaps the most notable notion of privacy is differential privacy (DP). In the context of property estimation [11], we say that an estimator ˆf is α-differentially private if for any Xn and Y n that differ by at most one symbol, Pr( ˆf(Xn) S)/ Pr( ˆf(Y n) S) eα for any measurable set S R. We consider designing property estimators that achieve small estimation error ε, with probability at least 2/3, while maintaining α-privacy.

Preprint. To appear at Neur IPS 2019.

The next section formalizes our discussion and presents some of the major results in the area.

1.1 Problem formulation and prior results

Property estimation Let f be a property over k. The (ε, δ)-sample complexity of an estimator ˆf for f is the smallest number of samples required to estimate f( p) with accuracy ε and conﬁdence 1 δ, for all distributions in k,

Cf( ˆf, ε, δ) := min{n : Pr Xn p(| ˆf(Xn) f( p)| > ε) δ, p k}.

The (ε, δ)-sample complexity of estimating f is the lowest (ε, δ)-sample complexity of any estimator,

Cf(ε, δ) := min ˆ f Cf( ˆf, ε, δ).

Ignoring constant factors and assuming k is large, Table 1 summarizes some of the previous results [2, 22, 26, 31, 33 35] for δ = 1/3. Following the formulation in [2, 31, 34], for support size, we normalize it by k and replace k by the collection of distributions p k satisfying pi 1

k, i [k]. For support coverage [2, 26], the expected number of distinct symbols in m samples, we normalize it by the given parameter m and assume that m is sufﬁciently large.

Table 1: Cf(ε, 1/3) for some properties

Property fi(pi) Cf(ε, 1/3)

Shannon entropy pi log 1

pi k log k 1 ε Power sum of order a pa i , a < 1 k1/a

ε1/a log k Distance to uniformity pi 1

k k log k 1 ε2 Normalized support size 1pi>0

k k log klog2 1

ε Normalized support coverage 1 (1 pi)m

m m log m log 1

Min-max MSE A closely related characterization of an estimator s performance is the min-max mean squared error (MSE). For any unknown distribution p k, the MSE of a property estimator ˆf in estimating f( p), using n samples from p, is

Rn( ˆf, f, p) := E Xn p( ˆf(Xn) f( p))2.

Since p is unknown, we consider the minimal possible worst-case MSE, or the min-max MSE, for any property estimator in estimating property f,

Rn(f, k) := min ˆ f max p k Rn( ˆf, f, p).

The property estimator ˆf m achieving the min-max MSE is the min-max estimator [21, 22, 34, 35].

Letting pmax := arg max p k Rn( ˆf m, f, p) be the worst-case distribution for ˆf m, we can express the min-max MSE as the sum of two quantities: the min-max squared bias,

Bias2 n( ˆf m) := (EXn pmax[ ˆf m(Xn)] f( pmax))2,

and the min-max variance,

Varn( ˆf m) := Var Xn pmax( ˆf m(Xn)).

Private property estimation Analogous to the non-private setting above, for an estimator ˆf of f, let its (ε, δ, α)-private sample complexity C( ˆf, ε, δ, α) be the smallest number of samples that ˆf requires to estimate f( p) with accuracy ε and conﬁdence 1 δ, while maintaining α-privacy for all distributions p k. The (ε, δ, α)-private sample complexity of estimating f is then

Cf(ε, δ, α) := min ˆ f Cf( ˆf, ε, δ, α).

For Shannon entropy, normalized support size, and normalized support coverage, the work of [3] derived tight lower and upper bounds on Cf(ε, 1/3, α).

1.2 Existing methods

There are mainly two types of methods introduced to estimate distribution properties: plug-in, and approximation-empirical, which we brieﬂy discuss below.

Plug-in Major existing plug-in estimators work for only symmetric properties, and in general do not achieve the min-max MSEs nor the optimal (ε, δ)-sample complexities. More speciﬁcally, the linear-programming based methods initiated by [12], and analyzed and extended in [31 33] achieve the optimal sample complexities only for distance to uniformity and entropy, for relatively large ε. The method basically learns the moments of the underlying distribution from its samples, and ﬁnds a distribution whose (low-order) moments are consistent with these estimates. A locally reﬁned version of the linear-programming estimator [15] achieves optimal sample complexities for entropy, power sum, and normalized support size, but requires polynomial time to be computed. This version yields a bias guarantee similar to ours over symmetric properties, yet its variance guarantee is often worse.

Recently, the work of [2] showed that the proﬁle maximum likelihood (PML) estimator [25], an estimator that ﬁnds a distribution maximizing the probability of observing the multiset of empirical frequencies, is sample-optimal for estimating entropy, distance to uniformity, and normalized support size and coverage. After the initial submission of the current work, paper [18] showed that the PML approach and its near-linear-time computable variant [5] are sample-optimal for any property that is symmetric, additive, and appropriately Lipschitz, including the four properties just mentioned. This establishes the PML estimator as the ﬁrst universally sample-optimal plug-in approach for estimating symmetric properties. In comparison, the current work provides a uniﬁed propertydependent approach that is sample-optimal for several symmetric and non-symmetric properties.

Approximation-empirical The approximation-empirical method [21, 22, 28, 34, 35] identiﬁes a non-smooth part of the underlying function fi, replaces it by a polynomial fi, and estimates the value of pi by its empirical frequency ˆpi. Depending on whether ˆpi belongs to the non-smooth part or not, the method estimates fi(pi) by either the unbiased estimator of fi(pi), or the empirical plug-in estimator fi(ˆpi). However, due to its strong dependency on both the function s structure and the empirical estimator s performance, the method requires signiﬁcantly different case-by-case modiﬁcation and analysis, and may not work optimally for general additive properties.

Speciﬁcally, 1) The efﬁcacy of this method relies on the accuracy of the empirical plug-in estimator over the smooth segments, which needs to be veriﬁed individually for each property; 2) Different functions often have non-smooth segments of different number, locations, and sizes; 3) Combining the non-smooth and smooth segments estimators requires additional care: sometimes needs the knowledge of k, sometimes even needs a third estimator to ensure smooth transitioning.

In addition, the method has also not been shown to achieve optimal results for general Lipschitz properties, or many of the other properties covered by the new method in this paper.

2 New methodology

The preceding discussion showed that no existing generic method efﬁciently estimates general additive properties. Motivated by recent advances in the ﬁeld [2, 15, 16, 19], we derive the ﬁrst generic method to construct sample-efﬁcient estimators for all sufﬁciently smooth additive properties.

We start by approximating functions of an unknown Bernoulli success probability from its i.i.d. samples. For a wide class of real functions, we propose a piecewise-polynomial approximation technique, and show that it yields small-bias estimators that are exponentially concentrated around their expected estimates. This provides a different view of property estimation that allows us to simplify the proofs and broaden the range of the results. For details please see Section 4.

High-level idea The idea behind this methodology is natural. By the Chernoff bound for binomial random variables, the empirical count of a symbol in a given sample sequence will not differ from its mean value by too much. Hence, based on the empirical frequency, we can roughly infer which tiny piece of [0, 1] the corresponding probability lies in. However, due to randomness, a symbol s empirical frequency may often differ from the true probability value by a small quantity, and plugging it into the function will cause certain amount of bias.

To correct this bias, we ﬁrst replace the function by its low-degree polynomial approximation over that tiny piece , and then compute an unbiased estimator of this polynomial. In other words, we use this polynomial as a proxy for the estimation task. We want the degree of the polynomial to be small since this will generally reduce the unbiased estimator s variance; we focus on approximating only a tiny piece of the function because this will reduce the polynomial s approximation error (bias). Given any additive property f( p) = P

i [k] fi(pi), we apply this technique to each real function fi and use the corresponding sum to estimate f( p). Henceforth we use ˆf to denote this explicit estimator.

3 Implications and new results

Because of its conceptual simplicity, the methodology described in the last section has strong implications for estimating all sufﬁciently smooth additive properties, which we present as theorems.

Theorem 5 in Section 5 is the root of all the following results, while it is more abstract and illustrating it requires much more effort. Hence for clarity, we begin by presenting several more concrete results.

Correct asymptotic For most of the properties considered in the paper, even the naive empiricalfrequency estimator is sample-optimal in the large-sample regime (termed "simple regime" in [34]) where the number of samples n far exceeds the alphabet size k. The interesting regime, addressed in numerous recent publications [15, 16, 19, 18, 21, 31, 33, 35], is where n and k are comparable, e.g., differing by at most a logarithmic factor. In this range, n is sufﬁciently small that sophisticated techniques can help, yet not too small that nothing can be estimated. Since n and k are given, one can decide whether the naive estimator sufﬁces, or sophisticated estimators are needed. For most of the results presented here, the technical signiﬁcance stems in their nontriviality in this large-alphabet regime. For this reason, we will also assume that log k log n throughout the paper.

Implication 1: Lipschitz property estimation

An additive property f( p) = P

i fi(pi) is L-Lipschitz if all functions fi have Lipschitz constants uniformly bounded by L. Many important properties are Lipschitz, but except for a few isolated examples, it was not known until very recently [16, 19] that general Lipschitz properties can be estimated with sub-linearly many samples. In particular, the result in [16] implies a sample-complexity upper bound of O(L3k/(ε3 log k)). We improve this bound to Cf(ε, 1/3) L2k/(ε2 log k): Theorem 1. If f is an L-Lipschitz property, then for any p k and Xn p, E h ˆf (Xn) i f( p) X

i [k] L r pi n log n L

Var( ˆf (Xn)) O L2

This theorem is optimal as even for relatively simple Lipschitz properties, e.g., distance to uniformity (see Table 1 and [22]), the bias bound is optimal up to constant factors, and the variance bound is near-optimal and can not be smaller than Ω(L2/n).

Implication 2: High-conﬁdence property estimation

Surprisingly, the (ε, δ)-sample complexity has not been fully characterized even for some important properties. A commonly-used approach to constructing an estimator with (ε, δ)-guarantee is to choose an (ε, 1/3)-estimator, and boost the learning conﬁdence by taking the median of its O(log 1

δ ) independent estimates. This well-known median trick yields the following upper bound

Cf(ε, δ) log 1

δ Cf(ε, 1/3).

For example, for Shannon entropy,

Cf(ε, δ) log 1

δ k ε log k + log 1

By contrast, we show that our estimator satisﬁes

Cf( ˆf , ε, δ) k ε log k + log 1

To see optimality, Theorem 2 below shows that this upper bound is nearly tight.

In the high-probability regime, namely when δ is small, the new upper bound obtained using our method could be signiﬁcantly smaller than the one obtained from the median trick. Theorem 2 shows that this phenomenon also holds for other properties like normalized support size and power sum. Theorem 2. Ignoring constant factors, Table 2 summarizes relatively tight lower and upper bounds on Cf(ε, δ, k) for different properties. In addition, all the upper bounds can be achieved by estimator ˆf .

Table 2: Bounds on Cf(ε, δ, k) for several properties

Property Lower bound Upper bound

Shannon entropy k ε log k + log 1

ε2 k ε log k + log 1

Power sum of order a k 1 a

ε 1 a log k + log 1

ε 1 a log k + h log 1

ε2 1 2a 1 i1+β

Normalized support size k log k log2 1

ε k log k log2 1

Remarks on Table 2: Parameter β can be any ﬁxed absolute constant in (0, 1). The lower and upper bounds for power sum hold for a (1/2, 1). For normalized support size, we require δ > exp( k1/3) and ε (k 0.33, k 0.01). Note that the restriction on ε for support-size estimation is imposed only to yield a simple sample-complexity expression. This is not required by our algorithm, which is also sample optimal for ε k 0.01. It is possible that other algorithms can achieve similar upper bounds, while our main point is to demonstrate that our single, uniﬁed method has many desired attributes.

Implication 3: Optimal bias and near-optimal variance

The min-max MSEs of several important properties have been determined up to constant factors, yet there is no explicit and executable scheme for designing estimators achieving them. We show that ˆf achieves optimal squared bias and near-optimal variance in estimating a variety of properties.

Theorem 3. Up to constant factors, the estimator ˆf achieves the optimal (min-max) squared bias and near-optimal variance for estimating Shannon entropy, normalized support size, distance to uniformity, and power sum, as well as ℓ1 distance to a ﬁxed distribution.

Remarks on Theorem 3: For power sum, we consider the case where the order is less than 1. For normalized support size, we again make the assumption that the minimum nonzero probability of the underlying distribution is at least 1/k. As noted previously, we consider the parameter regime where n and k are comparable and k is large. In particular, besides the general assumption log k log n, we assume that n k1/α/ log k for power sum; n k/ log k for entropy; and k log k n k/ log k for normalized support size. The proof of the theorem naturally follows from Theorem 5.

Implication 4: Private property estimation

Privacy is of increasing concern in modern data science. We show that our estimates are exponentially concentrated around the underlying value. Using this attribute we derive a near-optimal differentiallyprivate estimator ˆf DP for several important properties by adding Laplacian noises to ˆf.

As an example, for Shannon entropy and properly chosen algorithm hyper-parameters,

Cf( ˆf DP, ε, 1/3, α) k ε log k + 1 ε2.01 + 1

This essentially recovers the sample-complexity upper bound in [3], which is nearly tight [3] for all parameters. Hence for large domains, one can achieve strong differential privacy guarantees with only a marginal increase in the sample size compared to the k/(ε log k) required for non-private estimation. An analogous argument shows that ˆf DP is also near-optimal for the private estimation of support size and many others. Section 2.3 of the supplementary material provides more detail as well as uniﬁed bounds on the differentially-private sample complexities of general additive properties.

Outline The rest of the paper is organized as follows. In Section 4 we construct an estimator that approximates the function value of an unknown Bernoulli success probability, and characterize its performance by Theorem 4. In Section 5 we apply this function estimator to estimating properties of distributions and provide analogous guarantees. Section 6 concludes the paper and also presents possible future directions. We postpone all the proof details to the supplementary material.

4 Estimating functions of Bernoulli probabilities

4.1 Problem formulation

We begin with a simple problem that involves just a single unknown parameter.

Let g be a continuous real function over the unit interval whose absolute value is uniformly bounded by an absolute constant. Given i.i.d. samples Xn := X1, . . . , Xn from a Bernoulli distribution with unknown success probability p, we would like to estimate the function value g(p). A function estimator is a mapping ˆg : {0, 1} R. We characterize the performance of the estimator ˆg(Xn) in estimating g(p) by its absolute bias

Bias(ˆg) := |E[ˆg(Xn)] g(p)|,

and deviation probability

P(ε) := Pr(|ˆg(Xn) E[ˆg(Xn)]| > ε),

which implies the variance, and provides additional information useful for property estimation. Our objective is to ﬁnd an estimator that has near-optimal small bias and Gaussian-type deviation probability exp( nΘ(1)ε2) for all possible values of p [0, 1].

As could be expected, our results are closely related to the smoothness of the function g.

4.2 Smoothness of real functions

Effective derivative Given an interval I and step size h (0, |I|), where |I| denotes the interval s length. The effective derivative of g over I is the Lipschitz-type ratio

Lg(h, I) := sup x,y I,|y x| h

|g (y) g (x)|

This simple smoothness measure does not fully capture the smoothness of g. For example, g could be a zigzag function that has a high effective derivative locally, but over-all ﬂuctuates in only a very small range, and hence is close to a smooth function in maximum deviation. We therefore deﬁne a second smoothness measure as the maximum deviation between g and a ﬁxed-degree polynomial. Besides being smooth and having derivatives of all orders, by the Weierstrass approximation theorem, polynomials can also uniformly approximate any continuous g.

Min-max deviation Let Pd be the collection of polynomials of degree at most d. The min-max deviation in approximating g over an interval I by a polynomial in Pd is

Dg(d, I) := min q Pd max x I |g(x) q(x)|.

The minimizing polynomial is the degree-d min-max polynomial approximation of g over I.

For simplicity we abbreviate Lg(h) := Lg(h, [0, 1]) and Dg(d) := Dg(d, [0, 1]).

4.3 Estimator construction

For simplicity, assume that the sampling parameter is an even number 2n. Given i.i.d. samples X2n Bern(p), we let Ni denote the number of times symbol i {0, 1} appears in X2n.

We ﬁrst describe a simpliﬁed version of our estimator and provide a non-rigorous analysis relating its performance to the smoothness quantities just deﬁned. The actual more involved estimator and a rigorous performance analysis are presented in Section 1 of the supplementary material.

High-level description On a high level, the empirical estimator estimates g(p) by g(N1/(2n)), and often incurs a large bias. To address this, we ﬁrst partition the unit interval into roughly n sub-intervals. Then, we split X2n into two halves of equal length n and use the empirical probability of symbol 1 in the ﬁrst half to identify a sub-interval I and its two neighbors in the partition so that p is contained in one of them, with high conﬁdence. Finally, we replace g by a low-degree min-max polynomial g over I and its four neighbors and estimate g(p) from the second half of the sample sequence by applying a near-unbiased estimator of g(p).

Step 1: Partitioning the unit interval

Let α[a, b] denote the interval [αa, αb]. For an absolute positive constant c, deﬁne cn := c log n

n and a sequence of increasing-length intervals Ij := cn (j 1)2, j2 , j 1. Observe that the ﬁrst Mn := 1/ cn intervals partition the unit interval [0, 1]. For any x 0, we let jx denote the index j such that x Ij. This unit-interval partition is directly motivated by the Chernoff bounds. A very similar construction appears in [1], and the exact one appears in [15, 17].

Step 2: Splitting the sample sequence and locating the probability value

Split the sample sequence X2n into two equal halves, and let ˆp1 and ˆp2 denote the empirical probabilities of symbol 1 in the ﬁrst and second half, respectively. By the Chernoff bound of binomial random variables, for sufﬁciently large c, the intervals I1, . . . , IMn form essentially the ﬁnest partition of [0, 1] such that if we let I j := j+1 j =j 1 Ij and I j := j+2 j =j 2 Ij , then for all underlying p I j ,

Pr(ˆp1 Ij) n 3, and for all underlying p and all j, Pr(ˆp1 Ij and ˆp2 I j ) n 3. It follows that if ˆp1 Ij, then with high conﬁdence we can assume that p I j .

Step 3: Min-max polynomial estimation

Let λ be a universal constant in (0, 1/4) that balances the bias and variance of our estimator. Given the sampling parameter n, deﬁne dn := max d N : d 24.5d+2 nλ . For each j, let the min-max polynomial of g be the degree-dn polynomial gj minimizing the largest absolute deviation with g over I j .

For each interval Ij we create a piecewise polynomial g j that approximates g over the entire unit interval. It consists of gj over I j , and of gj over Ij for j [j 2, j + 2].

Finally, to estimate g(p), let j be the index such that ˆp1 Ij, and approximate g j (p) by plugging in unbiased estimators of pt constructed from ˆp2 for all powers t dn. Note that a standard unbiased estimator for pt is Qt 1 i=0[(ˆp2 i/n)/(1 i/n)], and the rest follows from the linearity of expectation.

Computational complexity A well-known approximation theory result states that the degree-d truncated Chebyshev series (or polynomial) of a function g, often closely approximate the degree-d min-max polynomial of g. The Remez algorithm [27, 30] is a popular method for ﬁnding such Chebyshev-type approximations of degree d, and is often very efﬁcient in practice. Under certain conditions on the function to approximate, running the algorithm for log t iterations will lead to an error of O(exp( Θ(t))). Indeed, many state-of-the-art property estimators, e.g., [16, 21, 22, 34, 35], use the Remez algorithm to approximate the min-max polynomials, and have implementations that are near-linear-time computable.

4.4 Final estimator and its characterization

The estimator Consolidating above results, we estimate g(p) by the estimator

ˆg(ˆp1, ˆp2) := X

j ˆgj(ˆp2) 1ˆp1 Ij.

The exact form and construction of this estimator appear in Section 1.2 of the supplementary material.

Characterization The theorem below characterizes the bias, variance, and mean-deviation probability of the estimator. We sketch its proof here and leave the details to the supplementary material.

According to the reasoning in the last section, for all p I j , the absolute bias of the resulting estimator ˆgj(ˆp2) in estimating g(p) is essentially upper bounded by Dg(dn, I j ). Normalizing it by the input s precision 1/n, we deﬁne the (normalized) local min-max deviation and the global min-max deviation over I j , respectively, as

D g(2n, x) := n max j [jx 1,jx+1] Dg(dn, I j ).

and D g(2n) := max x [0,1] D g(2n, x ).

Hence the bias of ˆg(ˆp1, ˆp2) in estimating g(p) is essentially upper bounded by D g(2n, p)/n D g(2n)/n. A similar argument yields the following variance bound on ˆg(ˆp1, ˆp2), where D g(2n, p) is replaced by the local effective derivative, L g(2n, p) := max j [jp 1,jp+1] Lg(1/n, I j ).

Analogously, deﬁne L g(2n) := maxx [0,1] L g(2n, x) as the global effective derivative. The meandeviation probability of this estimator is characterized by S g(2n) := L g(2n) + D g(2n).

Speciﬁcally, changing one sample in X2n changes the value of ˆg(ˆp1, ˆp2) by at most Θ(S g(n)nλ 1). Therefore, by Mc Diarmid s inequality, for any error parameter ε,

Pr(|ˆg(ˆp1, ˆp2) E[ˆg(ˆp1, ˆp2)]| > ε) exp Θ ε2n1 2λ

Theorem 4. For any bounded function g over [0, 1], Xn Bern(p), and error parameter ε > 0,

|E[ˆg(ˆp1, ˆp2)] g(p)| p

n3 + D g(n, p)

Var(ˆg(ˆp1, ˆp2)) p

L g(n, p) 2 p n1 4λ , and

Pr(|ˆg(ˆp1, ˆp2) E[ˆg(ˆp1, ˆp2)]| > ε) exp Θ ε2n1 2λ

Next we use this theorem to derive tight bounds for estimating general additive properties.

5 A uniﬁed piecewise-polynomial approach to property estimation

Let f be an arbitrary additive property over k such that |fi(x)| is uniformly bounded by some absolute constant for all i [k], and L ( ), D ( ), and S ( ) be the smoothness quantities deﬁned in Section 4.3 and 4.4. Let Xn be an i.i.d. sample sequence from an unknown distribution p k. Splitting Xn into two sub-sample sequences of equal length, we denote by ˆpi,1 and ˆpi,2 the empirical probability of symbol i [k] in the ﬁrst and second sub-sample sequences, respectively.

Applying the technique presented in Section 4, we can estimate the additive property f( p) = P

i [k] fi(pi) by the estimator ˆf (Xn) := P

i [k] ˆfi(ˆpi,1, ˆpi,2). Theorem 4 can then be used to show that ˆf performs well for all sufﬁciently-smooth additive properties: Theorem 5. For any ε > 0, p k, and Xn p, E h ˆf (Xn) i f( p) 1

i [k] D fi(n, pi),

Var( ˆf (Xn)) 1

n5 + 1 n1 4λ X

L fi(n, pi) 2 pi,

Pr ˆf (Xn) E h ˆf (Xn) i > ε exp

maxi [k](S fi(n))2

Discussions While the signiﬁcance of the theorem may not be immediately apparent, note that the three equations characterize the estimator s bias, variance, and higher-order moments in terms of the local min-max deviation D fi(n, pi), the local effective deviation L fi(n, pi), and the sum of the maximum possible values of the two, S fi(n), respectively. The smoother function fi is, the smaller D fi( ) and L fi( ) will be. In particular, for simple smooth functions, the values of D , L , and S

can be easily shown to be small, implying that the f is nearly optimal under all three criteria.

For example, considering Shannon entropy where fi(pi) = pi log pi for all i, we can show that D fi(n, pi) and L fi(n, pi) are at most O(1/ log n) and O(log n), respectively. Hence, the bias and variance bounds in Theorem 5 become k/(n log n) and (log n)/n1 4λ, and the tail bound simpliﬁes to exp( Θ(ε2n1 2λ/ log2 n)), where λ is an arbitrary absolute constant in (0, 1/4), e.g., λ = 0.01. According to Theorem 2 and results in [21, 35], all these bounds are optimal.

Computational complexity We brieﬂy illustrate how our estimator can be computed efﬁciently in near-linear time in the sample size n. As stated in Section 4.3, over each of the O( p

n/ log n)- intervals we constructed, we will ﬁnd the min-max polynomial of the underlying function over that particular interval, and for many properties, an approximation sufﬁces and the computation takes only poly(log n) time utilizing the Remez algorithm as previously noted.

Though our construction uses O( p

n/ log n) such polynomials, for each symbol i appearing in the sample sequence Xn, we need to compute just one such polynomial to estimate fi(pi). The number of symbols appearing in Xn is trivially at most n, hence the total time complexity is O(n poly(log n)), which is near-linear in n. In fact, the computation of all the O(k p

n/ log n) possible polynomials can be even performed off-line (without samples), and will not affect our estimator s time complexity.

6 Conclusion and future directions

We introduced a piecewise min-max polynomial methodology for approximating additive distribution properties. This method yields the ﬁrst generic approach to constructing sampleand time-efﬁcient estimators for all sufﬁciently smooth properties. This approach provides the ﬁrst: 1) sublinearsample estimators for general Lipschitz properties; 2) general near-optimal private estimators; 3) uniﬁed min-max-MSE-achieving estimators for six important properties; 4) near-optimal highconﬁdence estimators. Unlike previous works, our method covers both symmetric and non-symmetric, differentiable and non-differentiable, properties, under both private and non-private settings.

Two natural extensions are of interest: 1) generalizing the results to properties involving multiple unknown distributions such as distributional divergences; 2) extending the techniques to derive a similarly uniﬁed approach for the closely related ﬁeld of distribution property testing.

Besides the results we established for piecewise polynomial estimators under the min-max estimation framework, the works of [16, 19] recently proposed and studied a different formulation of competitive property estimation that aims to emulate the instance-by-instance performance of the widely used empirical plug-in estimator, using a logarithmic smaller sample size. It is also quite meaningful to investigate the performance of our technique through this new formulation.

Acknowledgments

We are grateful to the National Science Foundation (NSF) for supporting this work through grants CIF-1564355 and CIF-1619448.

[1] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh. Optimal probability estimation with applications to prediction and classiﬁcation. In Conference on Learning Theory, pages 764 796, 2013.

[2] J. Acharya, H. Das, A. Orlitsky, and A. T. Suresh. A uniﬁed maximum likelihood approach for estimating symmetric properties of discrete distributions. International Conference on Machine Learning, pages 11 21, 2017.

[3] J. Acharya, G. Kamath, Z. Sun, and H. Zhang. Inspectre: Privately estimating the unseen. ar Xiv preprint ar Xiv:1803.00008, 2018.

[4] A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics, pages 265 270, 1984.

[5] M. S. Charikar, K. Shiragur, and A. Sidford. Efﬁcient proﬁle maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 780 791. ACM, 2019.

[6] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968.

[7] R. K. Colwell, A. Chao, N. J. Gotelli, S. Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of Plant Ecology, 5(1):3 21, 2012.

[8] T. M. Cover. Universal portfolios. The Kelly Capital Growth Investment Criterion: Theory and Practice, pages 181 209, 2011.

[9] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.

[10] L. Devroye, L. Györﬁ, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

[11] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, Berlin, Heidelberg, 2006.

[12] B. Efron and R. Thisted. Estimating the number of unseen species: How many words did shakespeare know? Biometrika, 63.3:435 447, 1976.

[13] W. Gerstner and W. M. Kistler. Spiking neuron models: Single neurons, populations, plasticity. Cambridge university press, 2002.

[14] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. VLDB, 95:311 322, 1995.

[15] Y. Han, J. Jiao, and T. Weissman. Local moment matching: A uniﬁed methodology for symmetric functional estimation and distribution estimation under wasserstein distance. ar Xiv preprint ar Xiv:1802.08405, 2018.

[16] Y. Hao and A. Orlitsky. Data ampliﬁcation: Instance-optimal property estimation. In ar Xiv preprint ar Xiv:1903.01432., 2019.

[17] Y. Hao and A. Orlitsky. Doubly-competitive distribution estimation. In International Conference on Machine Learning, pages 2614 2623, 2019.

[18] Y. Hao and A. Orlitsky. The broad optimality of proﬁle maximum likelihood. In ar Xiv preprint ar Xiv:1906.03794., 2019.

[19] Y. Hao, A. Orlitsky, A. T. Suresh, and Y. Wu. Data ampliﬁcation: A uniﬁed and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pages 8848 8857, 2018.

[20] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 856 864, 2010.

[21] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61.5:2835 2885, 2015.

[22] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the ℓ1 distance. IEEE Transactions on Information Theory, 2018.

[23] I. Kroes, P. W. Lepp, and D. A. Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547 14552, 1999.

[24] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.

[25] A. Orlitsky, N.P. Santhanam, K. Viswanathan, and J. Zhang. On modeling proﬁles instead of values. In UAI 04, pages 426 435, 2004.

[26] A. Orlitsky, A. T. Suresh, and Y. Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113.47:13283 13288, 2016.

[27] R. Pachón and L. N. Trefethen. Barycentric-remez algorithms for best polynomial approximation in the chebfun system. BIT Numerical Mathematics, 49(4):721, 2009.

[28] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191 1254, 2003.

[29] R. Thisted and B. Efron. Did shakespeare write a newly-discovered poem? bio, 74:445 455, 1987.

[30] L. N. Trefethen. Approximation theory and approximation practice, volume 128. SIAM, 2013.

[31] G. Valiant and P. Valiant. Estimating the unseen: An n/ log(n)-sample estimator for entropy, support size, and other distribution properties, with a proof of optimality via two new central limit theorems. In STOC 11: Proceedings of the 42nd annual ACM symposium on Theory of computing, 2011.

[32] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), pages 403 412, 2011.

[33] G. Valiant and P. Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pages 2157 2165, 2013.

[34] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. ar Xiv preprint ar Xiv:1504.01227, 2015.

[35] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62.6:3702 3720, 2016.