# compressed_maximum_likelihood__7d61ecbd.pdf

Compressed Maximum Likelihood

Yi Hao 1 Alon Orlitsky 1

Maximum likelihood (ML) is one of the most fundamental and general statistical estimation techniques. Inspired by recent advances in estimating distribution functionals, we propose compressed maximum likelihood (CML) that applies ML to compressed samples. We show that CML is sample-efﬁcient for several fundamental learning tasks over both discrete and continuous domains, including learning structural densities, estimating probability multisets, and inferring symmetric distribution functionals.

1. Introduction

Maximum likelihood (ML) is a strikingly simple yet fundamental statistical inference paradigm for estimating parameters in probabilistic data generalization models. Over the past century, it has been applied to derive numerous important results in mathematics, statistics, and machine learning (Friedman et al., 2001).

At its core, the ML principle favors the model that maximizes the observed data s probability. Consider for example ﬂipping a coin with unknown heads probability p ten times and observing six heads and four tails. The observed sequence s probability is p6(1 p)4, and ML suggests that the best guess of p is ˆp = 0.6 that maximizes this probability.

ML is so natural and intuitive that it may be stumbled upon without realizing its depth and signiﬁcance. Yet, it can be shown to be near-optimal under relatively mild and general regularity conditions (Van der Vaart, 2000). In particular, ˆp tends to the actual value of p as the number of observations increases and enjoys both consistency and efﬁciency.

This paper concerns functional estimation, where P is a distribution collection over a domain space Z, and f : P Q is a functional, mapping distributions in P into a space Q equipped with a pseudo-metric d.

1Department of Electrical and Computer Engineering, University of California, San Diego, USA. Correspondence to: Yi Hao <yih179@eng.ucsd.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

An estimator for f is a function ˆf : Z Q such that when Z p, the loss d( ˆf(Z), f(p)) is small. The ML plug-in estimator of f ﬁrst ﬁnds the ML distribution estimate

pz := arg max p P p(z)

and then estimates the functional s value as f(p Z).

For example, if P is a parametric family indexed by Θ, and f maps each pθ P to its index θ Θ, the task becomes the classical parameter estimation problem, and ML maps sample Z to the parameter θ maximizing pθ(Z).

1.1. Statistical Guarantees of ML Methods

ML is particularly signiﬁcant in modern scientiﬁc applications that often involve complex structures in high dimensions. Deriving a bespoke estimator for each application may be an arduous task. Yet ML provides a simple universal methodology that in principle can be applied to any problem.

Unfortunately, it is well-known that for ﬁnite samples, ML estimators may be suboptimal, and other estimators often concentrate faster around the true parameters. Hence, establishing ﬁnite-sample guarantees for ML-based methods is of fundamental importance.

While the ML principle is quite natural, showing its ﬁnitesample efﬁciency is often not easy. Recent work Acharya et al. (2017a) provided a simple argument for bypassing the difﬁculty in analyzing ML methods. Their primary lemma, stated next, relies only on the fact that pz(z) p(z) for any z Z. For simplicity, we write df(p, q) for d(f(p), f(q)).

Lemma 1. For any Z, P, and accuracy ε > 0,

max p P Pr Z p(df(p, p Z)>2ε) |Z| max p P Pr Z p

d(f(p), ˆf(Z))>ε .

Namely, the ML plug-in estimator is competitive with all other estimators in that if any other estimator achieves loss at most ε with probability 1 δ then the ML plug-in estimator will achieve a 2ε loss with probability at least 1 |Z| δ.

Leveraging this lemma, the paper showed that the ML-based proﬁle maximum likelihood (PML) probability-multiset estimator in Orlitsky et al. (2004) is sample-optimal for several symmetric functionals, including entropy and support size.

Compressed Maximum Likelihood

The proﬁle of a sample represents the number of elements appearing any given number of times, and hence is a sufﬁcient statistic for symmetric functionals. Therefore instead of maximizing the sample s probability, Acharya et al. (2017a) maximized the probability of the sample s proﬁle.

1.2. Compressed Maximum Likelihood

Motivated by the above rationale, our paper develops a general framework for applying and analyzing ML for several statistical and machine-learning problems. It does so by adding a compressor between the sample and ML learner.

Speciﬁcally, given domain Z, a compressor is a function ϕ that maps each z Z to an element ϕ(z) in some codomain Φ. In general, we will allow ϕ to possess randomness independent of the samples.

Given compressor ϕ, we deﬁne the ϕ-compressed maximum likelihood (CML) estimator as

pϕ := arg max p P p(ϕ(Z)),

where for notational brevity, we suppress Z in pϕ(Z).

2. Main Results

2.1. Compression for Learning

Consider an arbitrary compressor ϕ, mapping elements in Z to a co-domain Φ. We evaluate the compressor s quality for our learning objective via the following two criteria.

Typicality A compressor is (m, γ)-typical for an integer m and probability threshold γ (0, 1), if for every p P, there is an m-element subset T Φ with probability

Intuitively, smaller m corresponds to a better compressor.

Learnability Given error parameters ε, δ, we say that the compressor enables (ε, δ)-learning under pseudo-metric d if there is an algorithm A : Φ Q satisfying

Pr Z p (d(f(p), A(ϕ(Z))) > ε) δ, p P.

Intuitively, the smaller the error parameters, the better the compressor is for our learning tasks.

The theorem below shows that for any good quality compressor, the respective CML plug-in estimate will also be accurate, with high probability.

Theorem 1. For any compressor ϕ that is (m, γ)-typical and enables (ε, δ)-learning, distribution p P, and Z p,

Pr df(p, pϕ(Z) > 2ε) γ + m δ.

Proof. We prove the theorem by classifying all possible patterns φ Φ into three categories.

Given any distribution p P, let T be the smallest set with p(T ) 1 γ. By deﬁnition, |T | m.

For any pattern φ T that satisﬁes p(φ) > δ, since the compressor enables (ε, δ)-learning with an algorithm A, we must have d(f(p), A(φ)) ε. By the deﬁnition of CML, pϕ(φ) p(φ) > δ, hence, d(f(pϕ), A(φ)) ε. The triangle inequality combines both and yields df(pϕ, p) 2ε.

Consider φ T satisfying p(φ) δ. By |T | m, the total probability of all such patterns is at most m δ. In addition, the total probability of patterns φ T is at most γ since p(T ) 1 γ. Therefore, the probability that df(pϕ, p) > 2ε is at most m δ + γ, which completes the proof.

Similar to Acharya et al. (2017a), it sufﬁces to obtain a CML approximation for the competitive guarantees to hold.

Deﬁnition 1 (Approximate CML (ACML)). For any β 1, z Z, and compressor ϕ, a distribution pϕ(z) P is a β-approximate CML if pϕ(z)(ϕ(z)) β pϕ(z)(ϕ(z)).

Corollary 1. For any (m, γ)-typical compressor ϕ that enables (ε, δ)-learning, distribution p P, Z p, and a β-approximate CML pϕ(Z),

Pr df(p, pϕ(Z) > 2ε) γ + m δ/β.

Building on this framework, we design and analyze CML estimators for various applications. For each, we add a compressor between the samples and the estimator. The main challenges are ﬁnding a good compressor ϕ that works well with ML and an effective algorithm A for the task.

2.2. CML Estimators

The previous sections presented the general CML framework, and Theorem 1 demonstrated its statistical competitiveness. The remaining sections describe concrete CML estimators for four statistical inference tasks, over both continuous and discrete domains, described below.

Continuous distributions Section 4 applies the CML method to learn structured continuous i.i.d. distributions.

We consider a broad distribution class P where the difference p q between any two distributions p, q P has essentially at most s sign changes. This class encompasses numerous essential distributions such as log-concave, piecewise polynomials, and Gaussian mixtures.

Theorem 2 shows that for any p P, with sample size Θ(s log(s/ε)/ε3), the CML estimator achieves

Pr( p pϕ 1 > ε) O(ε).

Compressed Maximum Likelihood

Note that Yatracos method, described in Section 3, ﬁnds the approximate minimizer to the empirical distribution in Akdistance, and achieves better sample efﬁciency. However, in general, no efﬁcient algorithm is known for ﬁnding such a minimizer, e.g., Theorem 6.4 in Devroye & Lugosi (2012).

By contrast, the CML approach transforms the learning task to a maximum likelihood problem. This establishes the efﬁciency of ML methods for learning structured distributions, and enables the use of numerous ML optimization algorithms. Some of these algorithms, like the EM algorithm (Bishop, 2006) are heuristic, while others are rigorous.

Learning distributions with bounded crossings was also considered in Acharya et al. (2017b). However, they assume access to an Ak-projection oracle that ﬁnds the approximate Ak-distance minimizers to the empirical distribution.

Discrete distributions Section 5 applies CML to learn discrete i.i.d. distributions where as above, every two distributions cross values at most s times.

We present two different CML formulations.

The ﬁrst formulation in Section 5.1 applies the compressor in Section 4 along with a random mapping that transforms the sample from a discrete distribution to a continuous analog while maintaining the structural properties.

The second formulation constructs CML directly from the discrete sample. The approach essentially performs maximum likelihood on the quantized empirical distribution, with a well-tuned quantization level. The resulting ML objective, presented in Section 5.2, resembles that of the PML estimator in Acharya et al. (2017a).

Theorem 3 shows that for any distribution p with support {1, . . . , N} satisfying N s/ε, given a sample from p of size n = Θ(s log(N)/ε3), with probability at least 1 e s,

p pϕ 1 O(ε).

Probability multisets Section 6 is geared towards the original PML method, a special CML whose compressor ϕ maps samples to proﬁles (Orlitsky et al., 2004; 2011; Das, 2012; Acharya et al., 2012; 2017a; Hao & Orlitsky, 2019a; 2020b; Charikar et al., 2019a;b; Han & Shiragur, 2021).

PML yields a natural estimate for the distribution probability multiset. Given support bound N and desired accuracy ε, it is necessary (in the worst case) to obtain Θ(N/(ε2 log N)) observations from the distribution to estimate its probability multiset under the sorted ℓ1-distance (Valiant & Valiant, 2011c; 2013; Han et al., 2018; Hao & Orlitsky, 2019a).

Over the years, a sequence of works established the optimality of PML for multiset learning in sorted ℓ1 distance with different accuracy levels (Das, 2012; Hao & Orlitsky,

2019a; Han & Shiragur, 2021). A different multiset learning guarantee stated in terms of the earth-mover s distance was considered in Valiant & Valiant (2016).

Theorem 4 shows that PML also enjoys this learning guarantee, hence is sample-optimal for both sorted ℓ1-distance and τ-truncated relative earth-mover distance. For any level τ [0, 1], the latter metric is

Rτ(p, q) := inf γ Γp,q E (X,Y ) γ

log max{p(X), τ}

max{q(Y ), τ}

where Γp,q represents all possible couplings of p and q. Section 6.2 shows that with probability at least 1 O(1/n), for any w [1, log n], the compressed estimator pϕ achieves

R w n log n (pϕ, p) = O 1 w

Distribution functionals Section 7 explores the application of CML to functional estimation. The compressor maps each sample sequence to the multiset of multiplicities that are small relative to n, leading to a uniﬁed algorithm that optimally learns several symmetric functionals.

3. Preliminaries

General notation The rest of the paper considers two types of univariate domains X, continuous, R, and discrete, [N] := {1, . . . , N}. We consider distributions p over X. Given a domain partition I, the quantized distribution p I assigns to each part I I probability p(I).

More generally, I can be any collection of disjoint measurable subsets of X = R, and f can be any real measurable function over R. Then, f I becomes the quantized function that assigns any I I a value of R

x I f(x) dx and the remaining set X \ I II a probability mass of 0. Slightly abusing notation, we use f I also to represent the ﬂattened distribution that assigns to every x in any I I of positive measure |I| > 0, the value

f I(x) := f I(I)

Finally, write a b for min{a, b}, and a b for max{a, b}.

Function norms We utilize a few norms of functions. The ℓ1-norm that evaluates f 1 := R |f(x)| dx. Another norm that will come in handy is the Ak-norm. Speciﬁcally, for any positive integer k, let Ik denote the collection of all unions of k disjoint intervals. Then, for any measurable function f : R R, the Ak-norm of f is

f Ak := sup I Ik f I 1.

Note that the above notations naturally extend to X = [N] if we replace the integrals with the respective ﬁnite sums.

Compressed Maximum Likelihood

VC inequality A well-known convergence bound associated with the Ak-norm is the VC inequality (Devroye & Lugosi, 2012). Often, the inequality appears as an upper bound on the expected Ak-norm of the difference between the empirical and actual distributions.

We leverage the following variant in Acharya et al. (2017b), making the error probability explicit. Lemma 2 (VC inequality). For any ε, δ > 0, and continuous distribution p, draw a sample Xn p and let ˆp denote the empirical distribution. For n = Θ((k + log(1/δ))/ε2), with probability at least 1 δ,

4. Structured Continuous Distributions

A distribution collection C is an ε-cover for P under ℓ1distance if every distribution in P is within ℓ1-distance ε from some distribution in C. The crossing number of two real functions f and g is the number of times they cross each other, or equivalently, the number of sign changes of f(x) g(x). The crossing number of a distribution collection is the largest crossing number of any pair of distributions in the collection.

Numerous important distribution families have ε-covers whose crossing number grows moderately as ε decreases. For example, t-piecewise, degree-d polynomials have 0covers with crossing number O(t(d + 1)), and t-mixtures of univariate Gaussians have ε-covers with crossing number O(t log(1/ε)) growing only logarithmically in ε.

4.1. Compressor

Let P have an ε-cover with crossing number s. Draw a sample Y n from an unknown p P and let ˆq be the empirical distribution. Partition the real line into t := s/ε intervals, I := (Ii)t i=1, such that ˆq I is uniform, where for simplicity, we assume that s/ε is an integer.

Next, draw an independent sample Xn p. Let ˆp denote the empirical distribution, and let the compressor ϕ be its ﬂattened version over I,

ϕ(Xn) := ˆp I.

Note that ϕ(Xn) depends implicitly on Y n, and its deﬁnition consists of both the Y n partition and the empirical probabilities of each part according to Xn.

4.2. CML Estimator

Given a sample Y n and the corresponding interval partition I, the respective CML estimator is

pϕ := arg max p P Pr Xn p(ϕ(Xn) | I).

Note that the probability term corresponds to a multinomial distribution. We can rewrite the CML estimator as

pϕ = arg min p P H(ˆp I, p I).

Hence, CML minimizes the cross entropy between the quantized empirical distribution and the actual one. Theorem 2. For any ε and p P, draw (Xn, Y n) p. If n = Θ(t log(t/ε)/ε2), with probability at least 1 2ε,

p pϕ 1 34ε.

Note that by traditional VC theory, the empirical Ak-norm minimizer based on Y n alone achieves sample complexity of Θ(t/ε). However, this method is usually not used for learning structured densities. The theorem on the other hand shows that the practically ubiquitous ML technique can also be applied to learn structured continuous distributions with a rigorous guarantee. It thereby demonstrates a different, continuous, application of Theorem 1 beyond the original application to discrete domains. The use of the samples Xn, while not strictly necessary, facilitates the application of the competitive argument.

The rest of the section is devoted to the theorem s proof.

4.3. Typical Events

First, we leverage standard concentration inequalities to establish several claims, each holds with high probability. The subsequent analysis will assume all claims hold.

By construction, each interval Ii I carries an empirical probability mass of ˆq(Ii) = ε/s. The respective probability mass under the actual density, p(Ii), turns out to follow a Beta(n/t, n(1 1/t)) distribution. The following inequality (Hao et al., 2020) shows the typical values of p(Ii). Lemma 3. For any α [0, 1] and 1 t n,

For any γ (0, 1) and n 12t log(2t/γ), choosing α = 1/(2

t) in the lemma yields

Pr i [t], p(Ii) 1

Henceforth, we assume that n 12t log(2t/γ) and for each i, the actual probability mass of Ii falls in (1/(2t), 3/(2t)).

Next, consider the second sample Xn and an arbitrary index i [t]. By independence, nˆp(Ii) bin(n, p(Ii)), where p(Ii) > 1/(2t) as above. From the binomial Chernoff and union bounds, for all i [k], with probability at least 1 γ,

|ˆp(Ii) p(Ii)|

Compressed Maximum Likelihood

Our lower bounds on n and p(Ii) also imply that for all i,

ˆp(Ii) p(Ii) +

2 p(Ii) < 3

Below we assume that ˆp satisﬁes these inequalities.

4.4. Guarantees of CML

Given Y n s empirical distribution ˆq, hence also I, there are ﬁnitely many ﬂattened Xn-distributions ˆp I over I. Given a probability threshold δ, we consider two disjoint cases according to the probability of a particular distribution ˆp I.

Likely ˆp I: The probability p(ˆp I) > δ. We show that for such ˆp I, with high probability, the CML distribution pϕ is close to p. The following argument assumes that the claims made in Section 4.3 hold.

First, since ˆp(Ii) = ˆp I(Ii) < 3/t for all i, then for any distribution p satisfying p ˆp I As > 7ε,

p ˆp As p ˆp I As ˆp I ˆp As > 7ε 2s 3

By the VC inequality, this event happens with probability at most δ/2 for a sample size of n = Θ((s + log(2/δ))/ε2), contradicting our assumption that p(ˆp I) > δ. Therefore,

p ˆp I As 7ε.

On the other hand, a fundamental attribute of the ML method is that pϕ(ˆp I) p(ˆp I) δ. Hence,

pϕ ˆp I As 7ε.

For any distribution q P, we denote by q the closest distribution in the ε-cover mentioned above under ℓ1 distance. From the above and the triangle inequality,

pϕ p 1 p ϕ p 1 + 2ε

2 ( p ˆp I As + pϕ ˆp I As) + 6ε

where the second inequality follows as

q1 q2 1 = 2 sup A R |q1(A) q2(A)|.

Unlikely ˆp I: The probability p(ˆp I) δ. Note that the number of possible ˆp I s is at most t + n 1 t

t et log(e(1+ n

Denote by Lt,n the exponent of the last term. Setting δ = e 2Lt,n and n = Θ(t log(t/ε)/ε2), the total probability of ˆp I s falling into this category is at most

e Lt,n e 2Lt,n = e Lt,n ε t,

where the last step follows by choosing a sufﬁciently large absolute constant in the above asymptotic expression for n.

Summary Choosing γ = ε/2 in Section 4.3, if n = Θ(t log(t/ε)/ε2), the union bound implies that with probability at least 1 2ε,

pϕ p 1 34ε.

5. Structured Discrete Distributions

Moving from R to the discrete domain [N], we apply CML to learn structured discrete distributions. Again, we assume that P has an ε-cover C with a crossing number at most s.

5.1. Reduction from Continuous CML

First, we present a concrete random mapping from discrete to continuous domains that enables the use of the continuous CML estimator.

For any distribution p over [k], we deﬁne by p its ﬂattened version, a continuous distribution that assigns

p(x) := p( x ), x (0, k],

and p(x) := 0 for x (0, k]. This notation yields a bijective mapping that maintains the number of sign changes of the difference between any two distributions.

Note that we only have sample access to the underlying distribution p. To apply the CML estimator in Section 4, we need to simulate a sample from p from Xn p. A random mapping S for this purpose counts the number of times each symbol i [N] observed in Xn, say ni, and respectively draws ni independent sample points from the uniform distribution over (i 1, i].

Now, it is straightforward to apply the continuous CML. Speciﬁcally, draw samples (Xn, Y n) p, deﬁne the compressor ϕ based on S(Y n), and let the CML estimate be

pϕ := arg max p P Pr Xn p(ϕ(S(Xn)) | I).

5.2. Discrete CML

Next, we present an alternative CML method that applies directly to discrete samples.

Compressor Draw a size-n sample Xn p, and denote by ˆp the empirical distribution. Then, sort elements in Xn

in a non-descending order, say, X(1), . . . , X(n), such that

X(i) X(j), i < j.

Without loss of generality, assume that t := s/ε is an integer and n is divisible by t. Sequentially partition the sequence into t sub-sequences, such that the i-th sub-sequence contains observations with indices from (i 1)n/t + 1 to in/t.

Compressed Maximum Likelihood

The compressor ϕ maps every sample Xn to the boundary symbols that of these sub-sequences. Speciﬁcally,

ϕ(Xn) := X(in/t) t i=1 ,

which is essentially ˆp I with I := (Ii)t i=1 and

Ii := [X((i 1)n/t) : X(in/t)], i [t].

CML estimator Consequently, the CML estimator for the above compressor takes the form

pϕ := arg max p P Pr Xn p(ϕ(Xn) | I).

Equivalently, the CML estimator can be written as

pϕ := arg max p P

yn:ϕ(yn)=ϕ(Xn)

The estimator often performs well, as described below.

Theorem 3. For p P and Xn p, if n = Θ((t N) log(t N)/ε2), with probability at least 1 e t N,

p pϕ 1 = O(ε).

5.3. Hypothesis Selection Algorithm

First, we utilize the assumptions and the standard VC inequality to show the existence of a hypothesis selection algorithm that, with high probability, ﬁnds an accurate estimate of the underlying distribution.

By the VC inequality, for any ε, δ > 0, and n = Θ((s + log(2/δ))/ε2), with probability at least 1 δ/2,

Denote by Is the set of unions of at most s disjoint intervals in I. By our construction, for any set I Is,

For any q Ps,ε, let q be a distribution in the ε-cover C with minimal ℓ1-distance to q. For any distribution pair q1, q2 Ps,ε, deﬁne the following variant of the Scheff e set (Devroye & Lugosi, 2012) as

S 12 := {x R : q 1(x) > q 2(x)},

and the original Scheff e set as S12.

Given the previous inequality and distribution ˆp I, we can approximate ˆp(q1 q2) := ˆp(S12) to a 3ε additive error by summing up ˆp I(Ii) over indices i satisfying Ii S 12. Note that we assumed that q 1 q 2 has at most s sign changes.

Next, we show that if the VC inequality holds, there exists a selection algorithm that uses ˆp I to ﬁnd a density p C satisfying p p O(ε). In addition, this algorithm is a variant of that in Theorem 6.4 of Devroye & Lugosi (2012).

Let S denote the collection of all the modiﬁed Scheff e sets induced by distributions in C. Let p be the distribution q in C minimizing sup S S |ˆp I(S) q(S)| up to an additive 4ε error, where we enlarged the error bound by ε to guarantee the existence of such a distribution. Then, the triangle inequality yields

p p 1 p p 1 + p p 1.

For any pair (q1, q2) of distributions, deﬁne

DS(q1, q2) := sup S S |q1(S) q2(S)|.

Consider the last term in the above inequality. By a recursive application of the triangle inequality and properties embedded in the prior constructions,

p p 1 = 2DS(p , p )

2DS(p , ˆp) + 2DS(ˆp, p )

2DS(p , ˆp I) + 2DS(ˆp I, p ) + O(ε)

4DS(p , ˆp I) + O(ε)

4DS(p , ˆp) + O(ε)

4DS(p , p) + 4DS(p, ˆp) + O(ε)

2 p p 1 + 4DS(p, ˆp) + O(ε).

Recall that C is an ε-cover of Ps,ε, we have p p 1 ε. In addition, the collection S of modiﬁed Scheff e sets has a VC dimension of at most 2s + 2. Therefore by the VC inequality, for a sample size of n = Θ((s + log(2/δ))/ε2), with probability at least 1 δ/2,

DS(p, ˆp) ε.

Consolidating these results shows that with probability at least 1 δ, the selected hypothesis p achieves

p p 1 = O(ε).

Note, however, that the above selection algorithm does not provide a method for ﬁnding p .

5.4. Guarantees of CML

There are only ﬁnitely many possible ˆp I, or equivalently, ϕ(Xn). Below, we consider two disjoint cases according to the probability of observing a particular pattern of ϕ(Xn) under the actual distribution.

Likely pattern: Pattern φ whose probability p(φ) := Pr Xn p(ϕ(Xn) = φ) > δ. By the argument in Section 5.3,

Compressed Maximum Likelihood

there exists a selection algorithm that maps this pattern to a distribution p whose error probability in achieving

p p 1 = O(ε)

is at most δ. Since p(φ) > δ by assumption, the selected hypothesis must satisfy the error bound above.

It is straightforward to combine these error bounds through Theorem 1 and the triangle inequality,

|p pϕ| |pϕ p | + |p p | = O(ε).

Unlikely pattern: Pattern φ whose probability p(φ) δ. Since φ can only be a sorted integer sequence with values in [N], the number of possible patterns equals the number of ways to pick t elements from [N] with repetition, which is t+N 1 t

exp (t N) log e 1 + N

Denote by Lt,n the exponent of the last term. Setting δ = e 2Lt,n and n = Θ(Lt,n/ε2), the total probability of patterns falling into this category is at most

e Lt,n e 2Lt,n = e Lt,n e t N,

where the last step follows by choosing a sufﬁciently large absolute constant in the expression for n.

6. Probability Multisets

In this section, we address probability multiset estimation under permutation invariance.

6.1. Compressor and PML

For any discrete distribution p and sample Xn p, let compressor ϕ map the sequence to its proﬁle ϕ(Xn), deﬁned as the multiplicity multiset of symbols appearing in Xn. The respective CML estimator, introduced in Orlitsky et al. (2004) as the PML estimator. Speciﬁcally, PML computes

pϕ := arg max p P

yn:ϕ(yn)=ϕ(Xn)

For any τ [0, 1], deﬁne the τ-truncated relative earthmover distance (Valiant & Valiant, 2015) between two discrete distributions p and q as

Rτ(p, q) := inf γ Γp,q E (X,Y ) γ

where Γp,q represents all the possible couplings of these two distributions. In the following, we show that the PML distribution estimator satisﬁes

Theorem 4. For any discrete distribution p, draw a sample Xn p. With probability at least 1 4 exp( Ω(n 1 3 )), for any w [1, log n],

R w n log n (pϕ, p) = O 1 w

For any τ [0, 1], deﬁne the τ-truncated sorted ℓ1-distance between two distributions p, q X as

ℓτ(p, q) := min p X :{p }={p}

x |p (x) τ q(x) τ| ,

where {p} denotes the probability multiset of p. By Fact 1 in Valiant & Valiant (2015), for any distributions p, q X , ℓτ(p, q) 2Rτ(p, q), implying

Corollary 2. Under the same conditions as Theorem 4,

ℓ w n log n (pϕ, p) = O 1 w

6.2. Proof of Theorem 4

This section provides a sketch for the proof of Theorem 4. We relegate most technical details to Appendix A. Part of the proof is adapted from Theorem 2 in Valiant & Valiant (2015). The original reasoning is not sufﬁcient for our purpose as the error probability derived is too large to invoke the competitiveness of PML. For this reason, we modify the linear program used in the paper, carefully separate the analysis of the estimators for large and small probabilities, and provide a reﬁned analysis with tighter probability bounds by reducing union-bound arguments.

First, we deﬁne histograms and the relative earth-moving cost and give operational meaning to Rτ. For a distribution p, the histogram of a multiset A {p} is a mapping, denoted by h A : (0, 1] Z 0, that maps each number y (0, 1] to the number of times it appears in A. Note that every y corresponds to a probability mass of y h A(y). More generally, we also allow generalized histograms h with non-integral values h(y) R 0.

For any y1, y2 (0, 1], generalized histogram h, and nonnegative m < y1 h(y1), we can move a probability mass from location y1 to y2 by reassigning h(y1) m/y1 to y1, and h(y2) + m/y2 to y2. Given τ [0, 1], we deﬁne the cost associated with this operation as

cτ,m(y1, y2) := m log y1 τ

and term it as τ-truncated earth-moving cost. The cost of multiple operations is additive. Note that Rτ(p, q) is the minimal total τ-truncated earth-moving cost associated with any operation schemes of moving h{p} to yield h{q}.

Compressed Maximum Likelihood

For simplicity, we suppress Xn in ϕi(Xn) and µs(Xn), representing respectively the number of symbols appearing i times and the number of times symbol s appears.

For any absolute constants B and C satisfying 0.1 > B > C > B

2 > 0, deﬁne xn := n B+n C

n and S := { 1

n2 , . . . , xn}. Consider the following linear program.

For each x S, deﬁne the associated variable vx

x S bin(n, x, i) vx

x S x vx = X

and x S, vx 0

Figure 1. Linear program (LP)

EXISTENCE OF A GOOD FEASIBLE POINT

Let p be the underlying distribution and h be its histogram. First, we show that with high probability, the linear program LP has a feasible point (vx) that is good in the following sense: 1) the corresponding objective value is relatively small; 2) for τ n 3/2, the generalized histogram h0 : x vx is close to hn : y h(y) 1y xn, admitting a low τ-truncated earth-mover cost.

In the appendix, we leverage the Chernoff bound and union bound to show that with probability at least 1 exp( n 1 3 +κ), the objective value of the feasible point (vx) is at most n B O(n 2 3 +κ + 1) = O(n B+ 2

For any τ n 3/2, the minimal τ-truncated earth-moving cost of moving the generalized histogram h0 corresponding to (vx), and the histogram hn : y h(y) 1y xn, so that they differ from each other only at x = xn, is at most

log n 3/2 + n 2

ALL SOLUTIONS ARE GOOD SOLUTIONS

Let (vx) be the solution described above. The appendix then proceeds to show that for any solution (v x) to LP whose objective value is O(n B+ 2

3 +κ), the generalized histogram h1 corresponding to (v x) is close to h0.

Speciﬁcally, the proof establishes a O(1/ log n) discrepancy bound whenever B = 1.5C = 10κ = 0.01.

Consolidate the previous results. For w [1, log n] and τ = w/(n log n), with probability at least 1 exp( n 1 3 +κ), the solution to LP will yield a generalized histogram h1, such that the minimal τ-truncated earth-moving cost of moving h1 and hn so that they differ only at xn, is O(1/ w).

COMPETITIVENESS OF PML

For the PML distribution associated with a sample satisfying our prior assumptions, denote by h PML n the histogram corresponding to its entries that are at most xn.

By a recent result in (Han & Shiragur, 2021), for any w [1, log n] and τ = w/(n log n), the minimal τ-truncated cost of moving h PML n and hn so that they differ only at xn, is O(1/ w), with probability at least 1 2 exp( Ω(n 1 3 )).

PROPERTIES OF THE EMPIRICAL HISTOGRAM

Denote by h EMP the empirical histogram. By the Chernoff bound, for any symbol s, the probability that

|n p(s) µs| µ3/4 s and µs > n B + 2n C

is at most 2np(s) exp( Ω(n2C B)), and similarly,

n p(s) n B + 4n C and µs n B + 2n C

will happen with probability at most 2 exp( Ω(n2C B)). Hence, we assume that |n p(s) µs| < µ3/4 s for all symbols s appearing more than n B +2n C times, and that any symbol s with probability p(s) (n B +4n C)/n appears more than n B + 2n C times. By the union bound, we will be correct with probability at least 1 4n exp( Ω(n2C B)).

Let yn := (n B + 4n C)/n for notational convenience. If for each symbol s satisfying µs n B + 2n C, we move a µs/n probability mass of h EMP from µs/n to p(s), then at all locations y yn, the total discrepancy between the resulting generalized histogram and the actual one is at most

j>n B+2n C ϕj j 3 4 n = 1

j>n B+2n C ϕ

1 4 j (ϕjj) 3 4 n B

where the last step follows by H older s inequality. Moreover, the associated total earth-moving cost is at most

j>n B+2n C ϕj j n log j

j>n B+2n C ϕj j 3 4 n n B

The yn-truncated earth-mover distance between h EMP and h is thus at most 2n B

4 log n = O(n B

5 ), which, together with the above error probability bound, upper bounds the expected value of Ryn(h EMP, h) by O(n B

5 ) + 4n exp( Ω(n2C B)) log n = O(n B

Moreover, changing any element in the sample sequence changes the value of Ryn(h EMP, h) by at most (log n)/n. Hence, by Mc Diarmid s inequality, with probability at least 1 2 exp( 2 n), the value of Ryn(h EMP, h) is less than O(n B

4 log n = O(n B

Compressed Maximum Likelihood

COMPETITIVENESS OF PML

Consider the PML histogram h PML and its yn-truncated earth-mover distance to h. Since there are at most exp(3 n) different proﬁles (Hardy & Ramanujan, 1918), with probability at least 1 2 exp( Ω( n)),

Ryn(h PML, h) O(n B

PERFORMANCE OF PML

We consolidate the previous results. For any w [1, log n] and τ = w/(n log n), we design an earth-moving scheme that moves h PML to h.

First, with probability at least 1 2 exp( Ω(n 1 3 )), we can move the probability mass of h PML n so that it differs from hn only at xn = (n B + n C)/n while incurring a τ-truncated earth-mover cost of at most O(1/ w).

Second, by the empirical-histogram argument and xn < yn, with probability at least 1 2 exp( Ω(n 1 3 )), we can further move the probability mass of h PML at locations y xn to coincide with h above yn while incurring a loss of O(n B

After the previous two steps, the modiﬁed PML histogram differs from the actual histogram h only at locations y In = [xn, yn]. Note that the cost of moving a unit mass within In is at most 3n C B, implying that with probability at least 1 2 exp( Ω(n 1 3 )),

Rτ(h PML, h) O 1 w + n B

5 + 3n C B =O 1 w

7. Symmetric Functionals

In this section, we apply the idea of CML to estimate multisets of low probabilities and leverage the respective plug-in estimator to approximate several distribution functionals.

The study of functional estimation dates back more than half a century (Carlton, 1969; Good, 1953; Good & Toulmin, 1956) and has steadily grown over the years. While the empirical distribution plug-in estimator performs well in the large-sample regime, modern data science applications often study high-dimensional data, for which more sophisticated methods lead to estimators that possess better guarantees (Jiao et al., 2015; Orlitsky et al., 2016; Valiant & Valiant, 2011a;b; Wu & Yang, 2016; Hao et al., 2018; Wu & Yang, 2019; Hao & Orlitsky, 2019b; 2020a;b).

Recently, a line of research works studied the use of PML in functional estimation (Orlitsky et al., 2004; 2011; Das, 2012; Acharya et al., 2012; 2017a; Hao & Orlitsky, 2019a; 2020b; Charikar et al., 2019a;b; Han & Shiragur, 2021).

For functionals addressed in this section, the plain PML is known to be sample-optimal only for additive errors

ε 1/n1/3. Concurrently with our result, Charikar et al. (2019c) proposed a different PML-type functional estimation method with optimal sample complexity down to ε 1/ n accuracy. We note that some specialized estimators also achieve better accuracy of order 1/ n.

Recently, an efﬁcient algorithm (Anari et al., 2020) was proposed to compute the PML-type estimators in Charikar et al. (2019c). The techniques also apply to the algorithms presented in this section.

7.1. Compressor

For any distribution p over X, sample Xn p, and multiplicity i, recall that ϕi(Xn) denotes the number of symbols appearing exactly i times.

Our objective is to extract information about low probabilities of p, regardless of symbol permutations. It is natural to consider, for an integer t n, the compressor

ϕ(Xn) := (ϕi(Xn))t i=1,

where t determines the inference horizon.

7.2. CML Estimator

Similar to previous sections, for ﬁxed t and sample Xn p, the CML estimator for the above compressor takes the form

pϕ := arg max p P Pr Xn p(ϕ(Xn)).

Equivalently, the CML estimator can be written as

pϕ := arg max p P

yn:ϕ(yn)=ϕ(Xn)

For space considerations, we relegate the rest of this section and the paper s appendix to the supplementary material.

The paper proposes a simple, novel, and uniﬁed compressed maximum likelihood (CML) approach for several fundamental tasks. The new technique bridges algorithms and results in several research directions over both discrete and continuous domains.

Acknowledgements

We thank the reviewers for their helpful comments and are grateful to the National Science Foundation for supporting this work through grants CIF-1564355 and CIF-1619448.

We also thank an anonymous reviewer for an excellent summary of our contribution, incorporated into Section 2.1, and another anonymous reviewer for suggesting the approximate CML estimator (Deﬁnition 1).

Compressed Maximum Likelihood

Acharya, J., Das, H., Jafarpour, A., Orlitsky, A., and Pan, S. Estimating multiple concurrent processes. In Proceedings 2012 IEEE International Symposium on Information Theory (ISIT), pp. 1628 1632. IEEE, 2012.

Acharya, J., Das, H., Orlitsky, A., and Suresh, A. T. A uniﬁed maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pp. 11 21, 2017a.

Acharya, J., Diakonikolas, I., Li, J., and Schmidt, L. Sampleoptimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1278 1289. SIAM, 2017b.

Anari, N., Charikar, M., Shiragur, K., and Sidford, A. Instance based approximations to proﬁle maximum likelihood. Advances in neural information processing systems, 2020.

Bishop, C. M. Pattern recognition and machine learning. springer, 2006.

Carlton, A. G. On the bias of information estimates. Psychological Bulletin, 71(2):108, 1969.

Charikar, M., Shiragur, K., and Sidford, A. The Bethe approximation for structured matrices: an improved approximation for the proﬁle maximum likelihood. In Neur IPS 2019 Workshop on Information Theory and Machine Learning, 2019a.

Charikar, M., Shiragur, K., and Sidford, A. Efﬁcient proﬁle maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 780 791, 2019b.

Charikar, M., Shiragur, K., and Sidford, A. A general framework for symmetric property estimation. Advances in neural information processing systems, 2019c.

Das, H. Competitive tests and estimators for properties of distributions. Ph D thesis, UC San Diego, 2012.

Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.

Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of statistical learning. Number 10 in 1. Springer series in statistics New York, 2001.

Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4): 237 264, 1953.

Good, I. J. and Toulmin, G. H. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1-2):45 63, 1956.

Han, Y. and Shiragur, K. On the competitive analysis and high accuracy optimality of proﬁle maximum likelihood. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1317 1336. SIAM, 2021.

Han, Y., Jiao, J., and Weissman, T. Local moment matching: A uniﬁed methodology for symmetric functional estimation and distribution estimation under wasserstein distance. In Conference On Learning Theory, pp. 3189 3221, 2018.

Hao, Y. and Orlitsky, A. The broad optimality of proﬁle maximum likelihood. In Advances in Neural Information Processing Systems (Neur IPS), pp. 10989 11001, 2019a.

Hao, Y. and Orlitsky, A. Uniﬁed sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems (Neur IPS), pp. 11106 11116, 2019b.

Hao, Y. and Orlitsky, A. Data ampliﬁcation: Instanceoptimal property estimation. In International Conference on Machine Learning (ICML), pp. 4049 4059. PMLR, 2020a.

Hao, Y. and Orlitsky, A. Proﬁle entropy: A fundamental measure for the learnability and compressibility of discrete distributions. In Advances in Neural Information Processing Systems (Neur IPS), 2020b.

Hao, Y., Orlitsky, A., Suresh, A. T., and Wu, Y. Data ampliﬁcation: A uniﬁed and competitive approach to property estimation. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8834 8843, 2018.

Hao, Y., Jain, A., Orlitsky, A., and Ravindrakumar, V. SURF: A simple, universal, robust, fast distribution learning algorithm. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Hardy, G. H. and Ramanujan, S. Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society, 2(1):75 115, 1918.

Jiao, J., Venkat, K., Han, Y., and Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2015.

Orlitsky, A., Santhanam, N. P., Viswanathan, K., and Zhang, J. On modeling proﬁles instead of values. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pp. 426 435. AUAI Press, 2004.

Compressed Maximum Likelihood

Orlitsky, A., Santhanam, N. P., Viswanathan, K., and Zhang, J. On estimating the probability multiset. Online Draft, 2011. URL http://alon.ucsd.edu/papers/ pml1.pdf.

Orlitsky, A., Suresh, A. T., and Wu, Y. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016.

Valiant, G. and Valiant, P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings 43rd Annual ACM Symposium on Theory of Computing (STOC), pp. 685 694. ACM, 2011a.

Valiant, G. and Valiant, P. The power of linear estimators. In Proceedings 52nd IEEE Symposium on Foundations of Computer Science (FOCS), pp. 403 412. IEEE, 2011b.

Valiant, G. and Valiant, P. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685 694, 2011c.

Valiant, G. and Valiant, P. Instance optimal learning. ar Xiv preprint, ar Xiv: 1504.05321, 2015.

Valiant, G. and Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 142 155, 2016.

Valiant, P. and Valiant, G. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pp. 2157 2165, 2013.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Wu, Y. and Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016.

Wu, Y. and Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 883, 2019.