# doublycompetitive_distribution_estimation__c2b73607.pdf

Doubly-Competitive Distribution Estimation

Yi Hao 1 Alon Orlitsky 1

Distribution estimation is a statistical-learning cornerstone. Its classical min-max formulation minimizes the estimation error for the worst distribution, hence under-performs for practical distributions that, like power-law, are often rather simple. Modern research has therefore focused on two frameworks: structural estimation that improves learning accuracy by assuming a simple structure of the underlying distribution; and competitive, or instance-optimal, estimation that achieves the performance of a genie-aided estimator up to a small excess error that vanishes as the sample size grows, regardless of the distribution. This paper combines and strengthens the two frameworks. It designs a single estimator whose excess error vanishes both at a universal rate as the sample size grows, as well as when the (unknown) distribution gets simpler. We show that the resulting algorithm signiﬁcantly improves the performance guarantees for numerous competitiveand structural-estimation results. The algorithm runs in near-linear time and is robust to model misspeciﬁcation and domain-symbol permutations.

1. Introduction

Estimating large-alphabet distributions from their samples is a fundamental statistical-learning staple. Over the past few decades, distribution estimation has found numerous applications, ranging from language modeling (Chen & Goodman, 1999) to biological studies (Armaanzas et al., 2008), and has been extensively studied. In the following subsections, we formalize the discussion and present major research frameworks used in the ﬁeld.

1Department of Electrical and Computer Engineering, University of California, San Diego, USA. Correspondence to: Yi Hao <yih179@eng.ucsd.edu>, Alon Orlitsky <alon@ucsd.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

1.1. Distribution Estimation

Let k denote the collection of distributions over the discrete alphabet [k] := {1, . . . , k}. Let [k] be the set of ﬁnite-length sequences over [k]. An estimator is a mapping ˆp : [k] k that associates with every sequence xn a distribution ˆp(xn) k. Let Xn := X1, . . . , Xn be an i.i.d. sample sequence from an unknown p. Our objective is to ﬁnd an estimator ˆp such that ˆp(Xn) approximates p well.

Speciﬁcally, for two distributions p, q k, let ℓ(p, q) be the loss when approximating distribution p by estimate q. The loss of estimating p by ˆp(Xn) is therefore ℓ(p, ˆp(Xn)). We also consider the expected loss, known as risk,

rℓ n(p, ˆp) := EXn p L(p, ˆp(Xn)).

The two most important losses for distribution estimation are the KL-divergence D(p q) := P

x [k] px log px

qx , and the ℓ1-distance |p q| := P

x [k] |px qx|. We study mainly the KL-loss, hence abbreviate r KL as simply r.

Next, we formalize the uncertainty about the distribution and the three common measures for the approximation quality: min-max, structural, and competitive estimation.

1.2. Previous Works

While the underlying distribution p is unknown, it often belongs to a known distribution collection P. The worstcase risk of an estimator ˆp over all distributions in P is

rℓ n(P, ˆp) := max p P rℓ n(p, ˆp),

and the minimal possible worst-case risk for P, incurred by any estimator, is the min-max risk,

rℓ n(P) := min ˆp rℓ n(P, ˆp) = min ˆp max p P rℓ n(p, ˆp).

The most classical and widely-studied class of distributions is simply the set k of all discrete distributions. The problem of determining rℓ n( k) up to the ﬁrst order was introduced by (Cover, 1972) and studied in a sequence of papers (Krichevsky & Troﬁmov, 1981; Braess et al., 2002; Paninski, 2005). Among the many results on

Doubly-Competitive Distribution Estimation

the topic, (Braess & Sauer, 2004) showed that for KLdivergence, as n/k , the min-max KL-risk satisﬁes rn( k) = (1 + o(1)) k 1

2n , achieved by a variant of the add-3/4 estimator. On the other hand, (Paninski, 2005) proved that as k/n , the optimal KL-risk becomes rn( k) = (1 + o(1)) log k

n, which is achieved by addconstant estimators. Similar results for other loss measures like ℓ1-distance can be found in (Kamath et al., 2015).

Beyond min-max The success of add-constant estimators in achieving the classical min-max risks does not extend to practical applications. One possible explanation is that practical distributions, like power-law, or Poisson, are often rather simple and can be estimated more efﬁciently and accurately than the worst distribution targeted by the min-max paradigm. The desire to construct estimators that perform better on practical distributions has led to the following two frameworks.

Instead of considering arbitrary underlying distributions, the structural approach focuses on learning distributions that posses a natural structure, such as monotonicity, logconcavity, and m-modality. In many cases, structural assumptions lead to more effective estimators that provably perform better on the corresponding distribution classes.

For example, (Kamath et al., 2015) showed that for ﬁxed k, as n increases, the empirical estimator achieves the min-max ℓ1-risk over k,

rℓ1 n ( k) = (1 + o(1))

In many practical applications, the alphabet k is often large, hence several papers considered structured distributions (Acharya et al., 2017; Diakonikolas et al., 2016; Kamath, 2014; Chan et al., 2013; Daskalakis et al., 2012; Jankowski & Wellner, 2009; Feldman et al., 2008). For example, for the collection Mt,m k of t-mixture m-modal distributions over [k], more sophisticated estimators, e.g., (Acharya et al., 2017) attain

rℓ1 n (Mt,m k ) = Θ tm log k

which for k/ log k n1/3(tm)2/3, is lower than rℓ1 n ( k).

Drawbacks The structural approach leverages the structure assumptions to design more efﬁcient estimators, thus has the drawback of relying on the hypothetical models.

For example, to learn t-mixture m-modal distributions efﬁciently as above, one needs to ensure the correctness of the structure assumption and know both t and m up to constant factors. While it may seem possible to use hypothesis

testing to ﬁnd the best parameters, existing work on distribution property testing shows that even testing whether a distribution is m-modal requires a non-trivial number of samples (Canonne et al., 2018). Hence, when t and m are relatively large, ﬁnding the best parameters may require many samples.

In addition, many structures possessed by real-world distributions, for example, mixtures of log-concave and logconvex, have not been addressed before.

COMPETITIVE

Instead of relying on often-uncertain structural assumptions, the competitive distribution estimation framework takes a different view and aims to design universally near-optimal estimators. Any reasonable estimator for i.i.d. distributions would assign the same probability to all symbols appearing the same number of times in the sample, and we let Qnat denote this collection of natural estimators.

Our objective is to design a distribution estimator ˆp that estimates every distribution nearly as well as the best estimator designed with prior knowledge of the true distribution p, but is restricted to be natural. Speciﬁcally, for any distribution p k, the lowest risk of a natural estimator knowing p is

rℓ n(p, Qnat) := min ˆp Qnat rℓ n(p, ˆp ),

and the excess risk of an arbitrary estimator ˆp is

rℓ n(p, ˆp) := rℓ n(p, ˆp) rℓ n(p, Qnat).

Therefore, the worst-case excess risk, or competitive risk, of the estimator ˆp over all distribution in k is

rℓ n(ˆp) := max p k rℓ n(p, ˆp).

This formulation was introduced in (Orlitsky & Suresh, 2015) who showed that a simple variant of the Good Turing estimator ˆp GT achieves a vanishing competitive KLrisk of rn(ˆp GT) (3 + o(1))/n1/3, regardless of the alphabet size, and a more involved estimator ˆp MI achieves eΘ(min{k/n, 1/ n}), For ℓ1-distance, (Valiant & Valiant, 2016) designed a linear-programming-based estimator ˆp LP and proved rℓ1 n (ˆp LP) = O(1/polylog(n)).

Drawbacks The upper bounds provided by the competitive approach apply to all distributions, and similar to the min-max approach, track the excess error of the worst distribution. As we now show, they are too lax for many practical distributions. Consider the following generalization of the ubiquitous power-law distributions. For c > 0, α > 1, and large alphabet-size k, deﬁne the enveloped distribution collection Pα,c k := {p k : px c x α}. It can be shown that for n [k0.1, k2] there is a constant Cα,c depending on

Doubly-Competitive Distribution Estimation

α and c, such that the min-max KL-risk of Pα,c k satisﬁes

rn(Pα,c k ) = Cα,c n α 1

By simple algebra, for α > 2 and large n, this term is smaller than Θ(min{k/n, 1/ n}), the lowest competitive risk of any estimator (Orlitsky & Suresh, 2015). Hence the guarantees the competitive framework provides do not sufﬁce to address relatively simple common distributions.

2. New Results

The foregoing section reviewed the merits and drawbacks of classical and modern approaches to distribution-estimation. It noted that the min-max approach is pessimistic and often performs sub-optimally in both theory and practice. Of the modern frameworks, the structural approach works well if the structural assumptions are both correct and accurate, but fails otherwise, hence this approach is local but not global . The competitive approach constructs universally near-optimal estimators, but provides the same guarantees regardless of the distribution s structure, potentially resulting in sub-optimal estimators for practical distributions, hence this approach is global but not local .

This raises the question of whether a single estimator can be both global and local . Namely, without any assumptions on the distribution, provide universal excess-loss guarantees for general distributions, and stronger excess-loss guarantees for simple distributions. For example, an estimator ˆp such that for any distribution p, rn(p, ˆp) n 1/2, and yet if the distribution p happens to be in the enveloped power-law class P3,c k , then rn(p, ˆp) n 3/4.

We answer this question in the afﬁrmative, and present the ﬁrst competitive and structural distribution estimator.

2.1. Deﬁnitions

Instant competitive loss For consistency, let us instantiate the loss ℓas the KL-divergence, i.e., for p, q k,

ℓ(p, q) := D(p q).

Let p k be an unknown discrete distribution, and let xn be a realization of Xn p. The best natural estimator, knowing both p and xn, incurs the minimal possible loss

ℓxn(p, Qnat) := min ˆp Qnat ℓ(p, ˆp (xn)),

and for this particular pair (p, xn), the excess loss of an arbitrary estimator ˆp is

ℓxn(p, ˆp) := ℓ(p, ˆp(xn)) ℓxn(p, Qnat).

Hence for sequence xn, the worst-case excess loss of ˆp over k, or simply the instance competitive loss of ˆp, is

ℓxn(ˆp) := max p k ℓxn(p, ˆp).

Permutation class For any distribution p k, we denote by p the collection of distributions in k that are equal to p up to some permutation over [k]. Knowing p is equivalent to knowing the multiset of p but not p itself.

General notation For Xn p k, the multiplicity of a symbol x [k] is Nx := Pn i=1 1Xi=x, the number of times x appears in Xn. The prevalence of an integer µ is Φµ := P x [k] 1Nx=µ, the number of symbols that appear µ times. Let D := P

µ>0 Φµ be the number of distinct symbols in Xn, and let DΦ := P

µ>0 1Φµ>0 be the number of distinct positive multiplicities. Clearly, D DΦ, and typically, D DΦ. For example, if all symbols in the sequence Xn are distinct, then D = n, while DΦ is just 1.

2.2. Main Results

We construct an explicit, near-linear-time computable distribution estimator ˆp such that

Theorem 1. For any distribution p, let Xn p, then with probability at least 1 n 8,

ℓXn(ˆp ) e O DΦ

Note that the right-hand side is determined by just Xn, its computation requires no additional information about p.

The exact form of ˆp can be found in Section 5, and the proof of Theorem 1 appears in the supplemental material.

Our main theorem implies the following new results and improvements on existing ones.

Global competitiveness In Section 3 we show that our estimator provides stronger estimation guarantees than many existing estimators: adaptive estimators (Corollary 3) such as the robust absolute discounting estimator (Ohannessian & Dahleh, 2012; Ben-Hamou et al., 2017); competitive estimators (Corollary 4) such as the modiﬁed Good-Turing estimator (Orlitsky & Suresh, 2015); and min-max estimators (Corollary 5).

Example: Section 3 shows that DΦ min{

2n, k}. Corollary 4 then concludes that the excess loss ℓXn(ˆp ) is always at most e O (min{ n, k}/n), providing a guarantee not only stronger than the n 1/3 rate of the modiﬁed Good-Turing estimator, but also as strong as the more involved estimator in (Acharya et al., 2013; Orlitsky & Suresh, 2015).

Local competitiveness In Section 4, we use the theorem to establish eight new results on learning important structured distributions. We show that our estimator has strong excess-loss bounds for three important structured distribution families: T-value (Corollary 7 and 8), log-concave

Doubly-Competitive Distribution Estimation

(Corollary 9 and 10), and log-convex (Corollary 11, 12, and 13). Many common distributions are covered by these three classes. In particular, our results for power-law distributions (Corollary 12) are uniformly stronger than those in (Falahatgar et al., 2017) for all parameter regimes.

Example: Corollary 8 shows that for all uniform distributions, E[DΦ] is bounded above by O(n1/3), hence the algorithm s excess risk is at most O(n 2/3).

Robustness to model misspeciﬁcation The structural approach often uses different estimators for different distribution classes. By contrast, our single estimator provides robust and adaptive guarantees for a variety of structural classes without any modiﬁcation.

Example: Over uniform distributions, ˆp achieves an excess risk of O(n 2/3) (Corollary 8), while for power law distributions with power parameter 1.5 , the same estimator achieves an excess risk of O(n 3/5) (Corollary 11).

Robustness to domain permutations The structural approach often assumes that we know how to order the symbols so that the underlying distribution would exhibit certain structure (such as power-law). As discussed in Section 4, this assumption may be impractical. By contrast, since the distribution of DΦ is the same for all p p , the excess loss/risk guarantees of our algorithm are invariant under any permutation of the domain symbols.

Example: If under some unknown ordering of the domain symbols, the underlying distribution is a power-law with power parameter 1.5 , then Corollary 11 implies that our estimator achieves an excess risk of O(n 3/5).

Outline Besides Section 3 and 4 mentioned above, we present the exact form of our estimator in Section 5.

3. Global Competitiveness

In this section, we present several implications of Theorem 1 for the universal estimation guarantees of ˆp . In particular, we show that ˆp is near-optimal under various classical and modern distribution learning frameworks, including minmax and competitive mentioned above.

Corollary 1. For any distribution p,

rn(p, ˆp ) e O E[DΦ]

As in the proof of Theorem 1, ℓXn(p, ˆp ) O(log n) always. The corollary then follows from Theorem 1 itself.

Analogous to the previous deﬁnition of competitive distribution estimation, we can consider competing with an estimator that knows the probability multi-set. Speciﬁcally,

for any distribution p k, the lowest worst-case risk of a natural estimator knowing the multi-set of p is

rℓ n( p ) := min ˆp max p p rℓ n(p , ˆp ),

and an arbitrary estimator ˆp has the multi-set excess risk of

rℓ n(p, ˆp) := rℓ n(p, ˆp) rℓ n( p ).

For KL-divergence, the following lemma relates rℓ n to rℓ n.

Lemma 1. (Orlitsky & Suresh, 2015) For any distribution p k and estimator ˆp,

max p p rn(p , ˆp) rn(p, ˆp).

Together with Corollary 2, the lemma yields,

Corollary 2. For any distribution p,

max p p rn(p, ˆp ) e O E[DΦ]

Adaptive optimality The min-max results (Krichevsky & Troﬁmov, 1981) imply that for any estimator, learning an arbitrary k-symbol distribution up to a certain KL-risk requires Ω(k) samples in the worst case. Since modern data science often considers applications over large alphabets, this is normally viewed as a negative result. However, as experience suggests, many practical distributions have small effective alphabet sizes . For example, if we draw 10 samples from a geometric distribution with success probability 0.9, although the support size is inﬁnite, with high probability, we shall observe at most 3 distinct symbols.

To formalize this intuition, for a given n, let the effective alphabet size of a distribution p be the expected number E[D] of distinct symbols that appear in Xn p. As in (Falahatgar et al., 2017), given n, k, and d, let Pd be the collection of distributions in k satisfying E[D] d. By Corollary 2, the performance of ˆp over Pd is adaptive to d:

Corollary 3. For all d 2 and every distribution p Pd,

rn(p, ˆp ) d

n log k + e O d

The following lemma shows the optimality of Corollary 3.

Lemma 2. (Falahatgar et al., 2017) Let α be any constant greater than 1. There exist constants c0 > 0 and n0 such that for d = n 1 α , any estimator ˆp, all n > n0, and all k > max{3n, 1.2 1 α 1 n 1 α },

max p Pd rn(p, ˆp) c0 d n log k e O d

Doubly-Competitive Distribution Estimation

Here we present two immediate implications. First, to learn a k-symbol distribution up to a certain KL-risk, the number of samples we need is at most O(E[D] log k), which is often much smaller than Ω(k). Second, in the extreme case when k/n , the upper bound on rn(p, ˆp ) is at most (1+o(1)) log k. Hence, our estimator achieves the min-max KL-risk over k to the right constant.

Competitive optimality Now we show that ˆp is nearoptimal under the competitive formulation described in Section 1.2. We begin by ﬁnding a simple upper bound for DΦ, the number of distinct positive multiplicities. Since different multiplicities correspond to distinct symbols, DΦ is at most the alphabet size k. On the other hand, since only distinct positive multiplicities count, PDΦ µ=1 µ n. Hence, DΦ min{k,

2n}, which together with Corollary 2 yields

Corollary 4. For any distribution p,

rn(p, ˆp ) e O min{k, n}

The following lemma shows the optimality of Corollary 4.

Lemma 3. (Orlitsky & Suresh, 2015) For any estimator ˆp,

max p k rn(p, ˆp) eΩ min{k, n}

Min-max optimality The previous results show that ˆp

often achieves the min-max KL-risk rn( k) to the right constant. Speciﬁcally,

Corollary 5. Let α0 be any constant greater than 1/2. For any α > α0 and k > nα,

rn( k, ˆp ) = (1 + on(1))rn( k).

4. Local Competitiveness

We use Corollary 2 and 3 to establish eight new results on learning important structured distributions. We show that our estimator has strong excess-loss bounds for three important structured distribution families: T-value (Corollary 7 and 8), log-concave (Corollary 9 and 10), and log-convex (Corollary 11, 12, and 13). Many common distributions are covered by these three classes.

4.1. A Simple Bound on E[DΦ]

By Corollary 2, the excess KL-risk rn(p, ˆp ) of ˆp in estimating p is upper bounded by e O (E[DΦ]/n). Perhaps the most natural question to ask is: given n and p, how large is E[DΦ]? To get a relatively simple closed-form expression for E[DΦ], we adopt the conventional Poisson Sampling

technique where the sample size is an independent Poisson variable with mean n. By doing so, the multiplicities Nx Poi(npx) independently of each other. Under Poisson sampling, the linearity of expectation implies

E[DΦ] = n X

1 e npx (npx)µ

Expanding the right-hand side would give us an expression consisting of n (2k 1) terms, which is hard to analyze. Hence, instead of evaluating E[DΦ] directly, we would like to work on its simple upper bounds. Given sampling parameter n, we partition the unit-length interval (0, 1] into a sequence of sub-intervals,

Ij := (j 1)2 log n

n , j2 log n

, 1 j r n log n.

For any distribution p, denote by p Ij the number of probabilities px in Ij. Then, Lemma 4. For any distribution p,

j 1 min p Ij, j ) log n.

In addition, since p is a distribution, for all j, p Ij j2 log n

n 1, which in turn implies

min p Ij, j min n j2 log n, j < n 1 3 .

More generally, let PIj denote the sum of probabilities px in Ij. Then,

min p Ij, j min n PIj

j2 log n, j < (n PIj) 1 3 .

Combined, Corollary 2 and Lemma 4 yield Corollary 6. For any distribution p,

rn(p, ˆp ) e O 1

j 1 min p Ij, j .

To illustrate the Corollary s signiﬁcance, we present its implications for various distribution-learning problems.

4.2. T-Value Distributions

A uniform distribution can be described as a distribution whose positive probabilities take only a single value. As a generalization of this formulation, we call a distribution p a T-value distribution if its positive probabilities px can take T different values. Note that T-value distributions over [k] can be viewed as mixtures of T uniform distributions over different subsets of [k], and that these distributions generalize T-piecewise histogram distributions. Intuitively, for smaller values of T, we would expect the task of learning an unknown T-value distribution to be easier. The following corollary conﬁrms this intuition.

Doubly-Competitive Distribution Estimation

Corollary 7. For any T-value distribution p and p p ,

rn(p , ˆp ) O

T 2 3 n 1 6

Note that p p . To prove the corollary, observe that by our previous result, for all j,

min p Ij, j < (n PIj) 1 3 .

Note that for a T-value distribution, p Ij = 0 for at most T different j values, say j1, . . . , j T . By the above inequality and Corollary 6,

rn(p, ˆp ) e O 1

i=1 (n PIji) 1 3 e O

T(n/T) 1 3 n

combined with Corollary 4, this completes the proof.

Uniform Distributions Now we consider the collection Uk of 1-value distributions, i.e., uniform distributions over non-empty subsets of [k]. Our objective is to derive a result stronger than Corollary 7. Let Sp denote the support size of a distribution p Uk. For all x [k], px is either 0 or S 1 p . Since {Ij, j 1} forms a partition of (0, 1], there exists a unique j such that S 1 p Ij , i.e.,

S 1 p Ij = log n

(j 1)2, j 2i ,

which further implies 1 + p

n/(Sp log n) j . Together with DΦ D Sp and Corollary 6, this shows

Corollary 8. Let p be an arbitrary distribution in Uk, then

rn(p, ˆp ) O

Note that the right-hand side is no more than O(n 2/3). Furthermore, in both the small alphabet regime where Sp = O(1) and the large alphabet regime where Sp = Ω(n), we have rn(p, ˆp ) O(n 1), which is fairly tight.

4.3. Log-Concave Distributions

The class of discrete log-concave distributions covers a variety of well-known distribution classes including binomial, Poisson, negative binomial, geometric, hypergeometric, hyper Poisson, Skellam, and P olya-Eggenberger (Qu et al., 1990). We say a discrete distribution p k is log-concave if for all x [k], p2 x px 1 px+1, and denote the collection of all such distributions by Lk. Further, for all σ > 0, let Ln,σ k denote the collection of p Lk whose standard deviation lies in (σ log 1 n, σ]. Intuitively, one would expect

the learning task over Ln,σ k to be easier for smaller values of σ. The following corollary demonstrates the correctness of this intuition and shows the competitive performance of our estimator. Due to space considerations, we postpone its proofs to the supplemental material.

Corollary 9. For any distribution p Ln,σ k and p p ,

rn(p , ˆp ) O (σn) 1

For any σ 1, the right-hand side is uniformly smaller than the bound O(min{k, n} n 1) in Corollary 4.

For mixtures of distributions in Ln,σ k , an analogous argument gives the following result.

Corollary 10. Let p be a t-mixture of distributions in Ln,σ k and p be any distribution in p ,

rn(p , ˆp ) O (σn) 1

4.4. Log-Convex Distributions

While the T-value and log-concave families cover many common distributions, there are certainly more distribution classes to be explored. For example, a truncated powerlaw distribution is always log-convex. In this section, we consider two generic classes of log-convex distributions: power-law and Hurwitz Lerch Zeta distribution families.

Enveloped power-law distributions Consider the collection Pα,c k := {p k : px c x α} of enveloped (truncated) power-law distributions. Note that this deﬁnition generalizes power-law families, and that distributions in Pα,c k are not necessarily log-convex. We have the following result, whose proof appears in the supplemental material.

Corollary 11. For any distribution p Pα,c k and p p ,

rn(p , ˆp ) Oc,α n max{ α α+1 , 1

The distribution collection Pα,c k has the interesting property that it is closed under mixtures. Hence, Corollary 11 also covers mixtures of enveloped power-law distributions.

Implications of Corollary 11 Let pα k be the truncated power-law distribution with power α that is truncated at k, i.e., pα x x α, x [k]. Clearly, we have pα Pα,c k for all c 1. The recent work of (Falahatgar et al., 2017) shows that for k > {n, n 1 α 1 } and any distribution p pα , the estimator ˆp proposed in (Ohannessian & Dahleh, 2012) satisﬁes

rn(p , ˆp ) Oc,α n 2α 1

A simple combination of Lemma 1 and Corollary 11 yields

Doubly-Competitive Distribution Estimation

Corollary 12. For any distribution p pα ,

rn(p , ˆp ) rn(p, ˆp ) Oc,α n max{ α α+1 , 1

Our approach has the following three advantages over the previous result in (Falahatgar et al., 2017). First, for all α > 0, we have α/(α + 1) < (2α 1)/(2α + 1), hence our guarantee is uniformly better than the previous one. Second, the previous result requires k > {n, n 1 α 1 } to hold, which can be non-realistic for α close to 1. In comparison, our result does not require such conditions at all. Third, for small α < 1/2, the previous result only implies a multi-set excess risk of O(nΘ(1)), while Corollary 12 always yields O n 1/2 regardless of α.

Enveloped Hurwitz Lerch Zeta distributions For any distribution p k, p is a (truncated) Hurwitz Lerch Zeta (HLZ) distribution (Gupta et al., 2008) if

px = 1 T(θ, s, a, k) θx

(a + x)s+1 ,

for some parameter s 0, a [0, 1] and θ (0, 1], where the normalization factor T(θ, s, a, k) := P

x [k] θx/(a +

x)s+1. Analogously, consider the collection Hθ,s,a,c k := {p k : px c θx/(a + x)s+1} of enveloped HLZ distributions. HLZ distributions include the well-known Riemann Zeta, Zipf-Mandelbrot, Lotka, Good, logarithmicseries, and Estoup distributions. These distributions have various applications in many ﬁelds. For example, the Good distribution (Zornig & Altmann, 1995) can be used to model species frequencies and to estimate population parameters.

Note that Hθ,s,a,c k Ps+1,c k for α 1. Hence, by Corollary 11, for any distribution p Hθ,s,a,c k and p p ,

rn(p , ˆp ) Oc,s n s+1

Let x1 be the threshold parameter such that c θx1 = n 1. Direct computation gives x1 = log(cn)/ log 1

θ. The symbols x [k] that are no larger than x1 contribute at most x1 to E[DΦ]. Furthermore, the proof of Lemma 4 essentially shows that symbols with probability no larger than n 1 contributes at most O(log n) to E[DΦ]. Therefore, we conclude that E[DΦ] O(log(cn)/ log 1

θ + log n).

Corollary 3 combines the above results and yields Corollary 13. For any p Hθ,s,a,c k and p p ,

rn(p , ˆp ) Oc,s

n 1 log 1 θ

Note that the right-hand side is the minimum of two quantities. For θ (1 n 1 s+2 , 1], we can reduce the upper bound to Oc,s(n s+1

s+2 ). On the other hand, for θ (0, 1 n 1 s+2 ], the upper bound becomes Oc,s((1 log 1 θ) n 1).

4.5. Robustness to Domain Permutations

Our results on learning structured distribution families differ signiﬁcantly from nearly all the existing ones. Prior work has mainly considered unknown distribution with a certain structure over a known and ordered domain. In our formulation, we assume that the underlying distribution has certain structure under some particular ordering of the domain elements, and this ordering is unknown to the estimator.

Below we illustrate this by a concrete example.

Let F be a ﬁnite discrete domain of size k. Consider learning an unknown log-concave distribution P F from its sample sequence Y n. Traditional formulations like (Chan et al., 2013) assume that we know an exact bijective mapping σ from F to [k], such that reordering the probabilities of P according to σ yields a log-concave distribution p k. Further applying σ to Y n and denoting the resulting sequence by Xn transforms the problem into learning p from a sample sequence Xn p. Here, the assumption that p is log-concave is equivalent to requiring p2 x px 1 px+1, for all x [k] \ {1, k}. We can see that such formulation may be non-practical. For example, in natural language processing, the observed samples are words and punctuation marks. Even we know these samples come from a log-concave distribution, we don t know how to order the alphabet, i.e., ﬁnd the right mapping σ, so that the corresponding distribution p k would be log-concave.

5. The Estimator

Let p be an arbitrary distribution in k, and let Xn be a length-n sample sequence from p. For simplicity, abbreviate 1µ x := 1Nx=µ. For any natural number µ, denote the total probability mass of the symbols that appear µ times by

x [k] px1µ x.

After observing Xn, an estimator ˆp approximates Mµ by

x:Nx=µ ˆpx(Xn).

Assume that ˆp is a natural estimator. By (Orlitsky & Suresh, 2015), the excess loss of ˆp over the best natural estimator that knows the underlying distribution p is

ℓXn(p, ˆp) = D(M ˆ M) := X

µ 0 Mµ log Mµ

The above characterization of ℓXn(p, ˆp) converts the problem of ﬁnding good natural estimators for the underlying distribution to that of ﬁnding good estimators for

M := (M0, . . . , Mn).

Doubly-Competitive Distribution Estimation

Intuition We ﬁrst motivate the estimator, whose form is similar to that in (Acharya et al., 2013), but with some modiﬁcations. Since the estimator is natural, it needs to approximate only M := (M0, . . . , Mn). The construction is guided by analyzing the estimator bias and concentration properties for various multiplicities µ. To estimate M0, we use the provably near-optimal (Rajaraman et al., 2017) Good-Turing estimator. For the remaining multiplicities, analysis shows that for moderate, yet frequent multiplicities, namely µ = O(log n) and Φµ = Ω(log2 n), the Good Turing estimator performs nearly optimally. For infrequent multiplicities, the empirical estimator performs better. For the remaining multiplicities, both estimates are sub-optimal. Applying polynomial approximation techniques, we construct a more involved estimator that approximates the behavior of a genie that knows that expected Mµ values. The estimator is slightly simpler than that in (Acharya et al., 2013), yet achieves better performance.

Details Since our estimator ˆp is natural, we simply specify ˆ M µ := P

x:Nx=µ ˆp x(Xn). To simplify the analysis, we adopt the standard Poisson sampling technique, and make the sample size a Poisson variable N with mean value n.

For N < n log n, let c1, c2, and c3 be properly chosen absolute constants. For any two natural numbers µ µ , denote aµ µ := µ !/µ! and Eµ x,µ := 1µ x aµ µ (Nx)µ µ , where AB is the falling factorial of A of order B. Let

µ/log n Eµ x,µ.

We can show that Eµ := P

x [k] Ex,µ is an unbiased estimator of E[Φµ]. Empirical-frequency estimates Mµ by

ˆφµ := Φµ µ n,

while Good-Turing estimates it by

ˆGµ := Φµ+1 µ + 1

To avoid zero probability estimates, slightly modify the Good-Turing estimator to ˆG µ := max{1/n, ˆGµ} and let

ˆOµ := Φµ µ + 1

and similarly set

ˆO µ := min{max{1/n, ˆOµ}, log2 n}.

For µ < n log n, our estimator is

ˆG µ if µ = 0, ˆφµ if µ 1 and Φµ c2(log2 n), ˆO µ if µ > c3 log n and Φµ > c2(log2 n), ˆG µ if c3 log n µ 1 and Φµ > c2(log2 n).

As Poisson variables are concentrated around their mean, for N n log n, which rarely happens, and µ [0, N], we simply set ˆ M µ = 1/(N + 1). If these probability estimates do not sum to 1, we normalize them by their sum.

Finally for each x [k], our distribution estimator is

ˆp x(Xn) = ˆ M Nx ΦNx .

6. Numerical Experiments

The estimator is easy to implement. In Section 1 of the supplemental material, we present experimental results on a variety of distributions, and show that the proposed estimator indeed outperforms the improved Good-Turing estimator in (Orlitsky & Suresh, 2015).

7. Future Directions

The results obtained in paper strengthen and extend the competitive approach to distribution estimation taken in (Orlitsky & Suresh, 2015). It would be of interest to obtain similar results for distribution estimation under ℓ1 distance. (Kamath et al., 2015) showed that the simple empirical estimator achieves the min-max ℓ1-risk rℓ1 n ( k) = (1 + o(1)) p

2(k 1)/(πn). Yet the excess risk of the estimator in the nice work of (Valiant & Valiant, 2016) is O(1/polylog(n)). Hence, for k O(n), this guarantee does not improve that of the empirical estimator, raising the possibility of strengthening the competitive results.

A similar approach can be applied to the related propertyestimation task. A property, e.g., Shannon entropy, is simply a mapping f : k R. Most existing property-estimation results are worst-case (min-max) in nature. Yet practical and natural distributions are rarely the worst possible, and often possess a simple structure. To address this discrepancy, recent works (Hao et al., 2018; Hao & Orlitsky, 2019) took a competitive approach, constructing estimators whose performance is adaptive to the simplicity of the underlying distribution. Speciﬁcally, the widely-used empirical estimator estimates property values by evaluating the property at the empirical distribution. For every property in a broad class and every distribution in k, the expected error of the estimator in (Hao & Orlitsky, 2019) with sample size n/ log n is at most that of the empirical estimator with sample size n, plus a distribution-free vanishing function of n.

These results cover several well-known properties such as entropy and support size, for which the log n factor is optimal up to constants, and also apply to any property in the form of P

x fx(px), such as the ℓ1 distance to a given distribution, where fx is 1-Lipschitz for all x [k]. It would be of interest to construct a doubly-competitive estimator for property estimation as well.

Doubly-Competitive Distribution Estimation

Acknowledgements

We thank Ananda Theertha Suresh for helpful comments and discussions, and are grateful to the National Science Foundation (NSF) for supporting this work through grants CIF-1564355 and CIF-1619448.

Acharya, J., Jafarpour, A., Orlitsky, A., and Suresh, A. T. Optimal probability estimation with applications to prediction and classiﬁcation. In Conference on Learning Theory, pp. 764 796, 2013.

Acharya, J., Diakonikolas, I., Li, J., and Schmidt, L. Sampleoptimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1278 1289. Society for Industrial and Applied Mathematics, 2017.

Armaanzas, R., Inza, I., Santana, R., Saeys, Y., Flores, J. L., Lozano, J. A., . . . , and Larraaga, P. A review of estimation of distribution algorithms in bioinformatics. Bio Data Mining, 1(6), 2008.

Ben-Hamou, A., Boucheron, S., and Ohannessian, M. I. Concentration inequalities in the inﬁnite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli, 23(1):249 287, 2017.

Braess, D. and Sauer, T. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187 206, 2004.

Braess, D., Forster, J., Sauer, T., and Simon, H. U. How to achieve minimax expected kullback-leibler distance from an unknown ﬁnite distribution. In International Conference on Algorithmic Learning Theory, pp. 380 394, Berlin, Heidelberg, 2002. Springer.

Canonne, C. L., Diakonikolas, I., Gouleakis, T., and Rubinfeld, R. Testing shape restrictions of discrete distributions. Theory of Computing Systems, 62(1):4 62, 2018.

Chan, S. O., Diakonikolas, I., Servedio, R. A., and Sun, X. Learning mixtures of structured distributions over discrete domains. In Proceedings of the Twenty-Fourth Annual ACM-SIAM symposium on Discrete algorithms, pp. 1380 1394. Society for Industrial and Applied Mathematics, 2013.

Chen, S. F. and Goodman, J. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359 394, 1999.

Cover, T. Admissibility properties or Gilbert s encoding for unknown source probabilities (corresp.). IEEE Transactions on Information Theory, 18(1):216 217, 1972.

Daskalakis, C., Diakonikolas, I., and Servedio, R. A. Learning k-modal distributions via testing. In Proceedings of the Twenty-Third Annual ACM-SIAM symposium on Discrete Algorithms, pp. 1371 1385. Society for Industrial and Applied Mathematics, 2012.

Diakonikolas, I., Kane, D. M., and Stewart, A. Optimal learning via the Fourier Transform for sums of independent integer random variables. In Conference on Learning Theory, pp. 831 849, 2016.

Falahatgar, M., Ohannessian, M. I., Orlitsky, A., and Pichapati, V. The power of absolute discounting: Alldimensional distribution estimation. In Advances in Neural Information Processing Systems, pp. 6660 6669, 2017.

Feldman, J., O Donnell, R., and Servedio, R. A. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536 1564, 2008.

Gupta, P. L., Gupta, R. C., Ong, S. H., and Srivastava, H. M. A class of Hurwitz-Lerch zeta distributions and their applications in reliability. Applied Mathematics and Computation, 196(2):521 531, 2008.

Hao, Y. and Orlitsky, A. Data ampliﬁcation: Instanceoptimal property estimation. In ar Xiv preprint ar Xiv:1903.01432., 2019.

Hao, Y., Orlitsky, A., Suresh, A. T., and Wu, Y. Data ampliﬁcation: A uniﬁed and competitive approach to property estimation. In Advances in Neural Information Processing Systems, pp. 8848 8857, 2018.

Jankowski, H. K. and Wellner, J. A. Estimation of a discrete monotone distribution. Electronic journal of statistics, 3: 1567, 2009.

Kamath, G. G. C. On learning and covering structured distributions. Doctoral dissertation, Massachusetts Institute of Technology, 2014.

Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In Conference on Learning Theory, pp. 1066 1100, July, 3-6, 2015.

Krichevsky, R. and Troﬁmov, V. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981.

Ohannessian, M. I. and Dahleh, M. A. Rare probability estimation under regularly varying heavy tails. In Conference on Learning Theory, pp. 21 1, 2012.

Orlitsky, A. and Suresh, A. T. Competitive distribution estimation: Why is Good-Turing good. Advances in Neural Information Processing Systems, pp. 2143 2151, 2015.

Doubly-Competitive Distribution Estimation

Paninski, L. Variational minimax estimation of discrete distributions under KL loss. In Advances in Neural Information Processing Systems, pp. 1033 1040, 2005.

Qu, Y., Beck, G. J., and Williams, G. W. Polya-Eggenberger distribution: Parameter estimation and hypothesis tests. Biometrical journal, 32(2):229 242, 1990.

Rajaraman, N., Thangaraj, A., and Suresh, A. T. Minimax risk for missing mass estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 3025 3029. IEEE, 2017.

Valiant, G. and Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth Annual ACM symposium on Theory of Computing, pp. 142 155. ACM, 2016.

Zornig, P. and Altmann, G. Uniﬁed representation of Zipf distributions. Computational Statistics & Data Analysis, 19(4):461 473, 1995.