# compressed_maximum_likelihood__7d61ecbd.pdf Compressed Maximum Likelihood Yi Hao 1 Alon Orlitsky 1 Maximum likelihood (ML) is one of the most fundamental and general statistical estimation techniques. Inspired by recent advances in estimating distribution functionals, we propose compressed maximum likelihood (CML) that applies ML to compressed samples. We show that CML is sample-efficient for several fundamental learning tasks over both discrete and continuous domains, including learning structural densities, estimating probability multisets, and inferring symmetric distribution functionals. 1. Introduction Maximum likelihood (ML) is a strikingly simple yet fundamental statistical inference paradigm for estimating parameters in probabilistic data generalization models. Over the past century, it has been applied to derive numerous important results in mathematics, statistics, and machine learning (Friedman et al., 2001). At its core, the ML principle favors the model that maximizes the observed data s probability. Consider for example flipping a coin with unknown heads probability p ten times and observing six heads and four tails. The observed sequence s probability is p6(1 p)4, and ML suggests that the best guess of p is ˆp = 0.6 that maximizes this probability. ML is so natural and intuitive that it may be stumbled upon without realizing its depth and significance. Yet, it can be shown to be near-optimal under relatively mild and general regularity conditions (Van der Vaart, 2000). In particular, ˆp tends to the actual value of p as the number of observations increases and enjoys both consistency and efficiency. This paper concerns functional estimation, where P is a distribution collection over a domain space Z, and f : P Q is a functional, mapping distributions in P into a space Q equipped with a pseudo-metric d. 1Department of Electrical and Computer Engineering, University of California, San Diego, USA. Correspondence to: Yi Hao . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). An estimator for f is a function ˆf : Z Q such that when Z p, the loss d( ˆf(Z), f(p)) is small. The ML plug-in estimator of f first finds the ML distribution estimate pz := arg max p P p(z) and then estimates the functional s value as f(p Z). For example, if P is a parametric family indexed by Θ, and f maps each pθ P to its index θ Θ, the task becomes the classical parameter estimation problem, and ML maps sample Z to the parameter θ maximizing pθ(Z). 1.1. Statistical Guarantees of ML Methods ML is particularly significant in modern scientific applications that often involve complex structures in high dimensions. Deriving a bespoke estimator for each application may be an arduous task. Yet ML provides a simple universal methodology that in principle can be applied to any problem. Unfortunately, it is well-known that for finite samples, ML estimators may be suboptimal, and other estimators often concentrate faster around the true parameters. Hence, establishing finite-sample guarantees for ML-based methods is of fundamental importance. While the ML principle is quite natural, showing its finitesample efficiency is often not easy. Recent work Acharya et al. (2017a) provided a simple argument for bypassing the difficulty in analyzing ML methods. Their primary lemma, stated next, relies only on the fact that pz(z) p(z) for any z Z. For simplicity, we write df(p, q) for d(f(p), f(q)). Lemma 1. For any Z, P, and accuracy ε > 0, max p P Pr Z p(df(p, p Z)>2ε) |Z| max p P Pr Z p d(f(p), ˆf(Z))>ε . Namely, the ML plug-in estimator is competitive with all other estimators in that if any other estimator achieves loss at most ε with probability 1 δ then the ML plug-in estimator will achieve a 2ε loss with probability at least 1 |Z| δ. Leveraging this lemma, the paper showed that the ML-based profile maximum likelihood (PML) probability-multiset estimator in Orlitsky et al. (2004) is sample-optimal for several symmetric functionals, including entropy and support size. Compressed Maximum Likelihood The profile of a sample represents the number of elements appearing any given number of times, and hence is a sufficient statistic for symmetric functionals. Therefore instead of maximizing the sample s probability, Acharya et al. (2017a) maximized the probability of the sample s profile. 1.2. Compressed Maximum Likelihood Motivated by the above rationale, our paper develops a general framework for applying and analyzing ML for several statistical and machine-learning problems. It does so by adding a compressor between the sample and ML learner. Specifically, given domain Z, a compressor is a function ϕ that maps each z Z to an element ϕ(z) in some codomain Φ. In general, we will allow ϕ to possess randomness independent of the samples. Given compressor ϕ, we define the ϕ-compressed maximum likelihood (CML) estimator as pϕ := arg max p P p(ϕ(Z)), where for notational brevity, we suppress Z in pϕ(Z). 2. Main Results 2.1. Compression for Learning Consider an arbitrary compressor ϕ, mapping elements in Z to a co-domain Φ. We evaluate the compressor s quality for our learning objective via the following two criteria. Typicality A compressor is (m, γ)-typical for an integer m and probability threshold γ (0, 1), if for every p P, there is an m-element subset T Φ with probability Intuitively, smaller m corresponds to a better compressor. Learnability Given error parameters ε, δ, we say that the compressor enables (ε, δ)-learning under pseudo-metric d if there is an algorithm A : Φ Q satisfying Pr Z p (d(f(p), A(ϕ(Z))) > ε) δ, p P. Intuitively, the smaller the error parameters, the better the compressor is for our learning tasks. The theorem below shows that for any good quality compressor, the respective CML plug-in estimate will also be accurate, with high probability. Theorem 1. For any compressor ϕ that is (m, γ)-typical and enables (ε, δ)-learning, distribution p P, and Z p, Pr df(p, pϕ(Z) > 2ε) γ + m δ. Proof. We prove the theorem by classifying all possible patterns φ Φ into three categories. Given any distribution p P, let T be the smallest set with p(T ) 1 γ. By definition, |T | m. For any pattern φ T that satisfies p(φ) > δ, since the compressor enables (ε, δ)-learning with an algorithm A, we must have d(f(p), A(φ)) ε. By the definition of CML, pϕ(φ) p(φ) > δ, hence, d(f(pϕ), A(φ)) ε. The triangle inequality combines both and yields df(pϕ, p) 2ε. Consider φ T satisfying p(φ) δ. By |T | m, the total probability of all such patterns is at most m δ. In addition, the total probability of patterns φ T is at most γ since p(T ) 1 γ. Therefore, the probability that df(pϕ, p) > 2ε is at most m δ + γ, which completes the proof. Similar to Acharya et al. (2017a), it suffices to obtain a CML approximation for the competitive guarantees to hold. Definition 1 (Approximate CML (ACML)). For any β 1, z Z, and compressor ϕ, a distribution pϕ(z) P is a β-approximate CML if pϕ(z)(ϕ(z)) β pϕ(z)(ϕ(z)). Corollary 1. For any (m, γ)-typical compressor ϕ that enables (ε, δ)-learning, distribution p P, Z p, and a β-approximate CML pϕ(Z), Pr df(p, pϕ(Z) > 2ε) γ + m δ/β. Building on this framework, we design and analyze CML estimators for various applications. For each, we add a compressor between the samples and the estimator. The main challenges are finding a good compressor ϕ that works well with ML and an effective algorithm A for the task. 2.2. CML Estimators The previous sections presented the general CML framework, and Theorem 1 demonstrated its statistical competitiveness. The remaining sections describe concrete CML estimators for four statistical inference tasks, over both continuous and discrete domains, described below. Continuous distributions Section 4 applies the CML method to learn structured continuous i.i.d. distributions. We consider a broad distribution class P where the difference p q between any two distributions p, q P has essentially at most s sign changes. This class encompasses numerous essential distributions such as log-concave, piecewise polynomials, and Gaussian mixtures. Theorem 2 shows that for any p P, with sample size Θ(s log(s/ε)/ε3), the CML estimator achieves Pr( p pϕ 1 > ε) O(ε). Compressed Maximum Likelihood Note that Yatracos method, described in Section 3, finds the approximate minimizer to the empirical distribution in Akdistance, and achieves better sample efficiency. However, in general, no efficient algorithm is known for finding such a minimizer, e.g., Theorem 6.4 in Devroye & Lugosi (2012). By contrast, the CML approach transforms the learning task to a maximum likelihood problem. This establishes the efficiency of ML methods for learning structured distributions, and enables the use of numerous ML optimization algorithms. Some of these algorithms, like the EM algorithm (Bishop, 2006) are heuristic, while others are rigorous. Learning distributions with bounded crossings was also considered in Acharya et al. (2017b). However, they assume access to an Ak-projection oracle that finds the approximate Ak-distance minimizers to the empirical distribution. Discrete distributions Section 5 applies CML to learn discrete i.i.d. distributions where as above, every two distributions cross values at most s times. We present two different CML formulations. The first formulation in Section 5.1 applies the compressor in Section 4 along with a random mapping that transforms the sample from a discrete distribution to a continuous analog while maintaining the structural properties. The second formulation constructs CML directly from the discrete sample. The approach essentially performs maximum likelihood on the quantized empirical distribution, with a well-tuned quantization level. The resulting ML objective, presented in Section 5.2, resembles that of the PML estimator in Acharya et al. (2017a). Theorem 3 shows that for any distribution p with support {1, . . . , N} satisfying N s/ε, given a sample from p of size n = Θ(s log(N)/ε3), with probability at least 1 e s, p pϕ 1 O(ε). Probability multisets Section 6 is geared towards the original PML method, a special CML whose compressor ϕ maps samples to profiles (Orlitsky et al., 2004; 2011; Das, 2012; Acharya et al., 2012; 2017a; Hao & Orlitsky, 2019a; 2020b; Charikar et al., 2019a;b; Han & Shiragur, 2021). PML yields a natural estimate for the distribution probability multiset. Given support bound N and desired accuracy ε, it is necessary (in the worst case) to obtain Θ(N/(ε2 log N)) observations from the distribution to estimate its probability multiset under the sorted ℓ1-distance (Valiant & Valiant, 2011c; 2013; Han et al., 2018; Hao & Orlitsky, 2019a). Over the years, a sequence of works established the optimality of PML for multiset learning in sorted ℓ1 distance with different accuracy levels (Das, 2012; Hao & Orlitsky, 2019a; Han & Shiragur, 2021). A different multiset learning guarantee stated in terms of the earth-mover s distance was considered in Valiant & Valiant (2016). Theorem 4 shows that PML also enjoys this learning guarantee, hence is sample-optimal for both sorted ℓ1-distance and τ-truncated relative earth-mover distance. For any level τ [0, 1], the latter metric is Rτ(p, q) := inf γ Γp,q E (X,Y ) γ log max{p(X), τ} max{q(Y ), τ} where Γp,q represents all possible couplings of p and q. Section 6.2 shows that with probability at least 1 O(1/n), for any w [1, log n], the compressed estimator pϕ achieves R w n log n (pϕ, p) = O 1 w Distribution functionals Section 7 explores the application of CML to functional estimation. The compressor maps each sample sequence to the multiset of multiplicities that are small relative to n, leading to a unified algorithm that optimally learns several symmetric functionals. 3. Preliminaries General notation The rest of the paper considers two types of univariate domains X, continuous, R, and discrete, [N] := {1, . . . , N}. We consider distributions p over X. Given a domain partition I, the quantized distribution p I assigns to each part I I probability p(I). More generally, I can be any collection of disjoint measurable subsets of X = R, and f can be any real measurable function over R. Then, f I becomes the quantized function that assigns any I I a value of R x I f(x) dx and the remaining set X \ I II a probability mass of 0. Slightly abusing notation, we use f I also to represent the flattened distribution that assigns to every x in any I I of positive measure |I| > 0, the value f I(x) := f I(I) Finally, write a b for min{a, b}, and a b for max{a, b}. Function norms We utilize a few norms of functions. The ℓ1-norm that evaluates f 1 := R |f(x)| dx. Another norm that will come in handy is the Ak-norm. Specifically, for any positive integer k, let Ik denote the collection of all unions of k disjoint intervals. Then, for any measurable function f : R R, the Ak-norm of f is f Ak := sup I Ik f I 1. Note that the above notations naturally extend to X = [N] if we replace the integrals with the respective finite sums. Compressed Maximum Likelihood VC inequality A well-known convergence bound associated with the Ak-norm is the VC inequality (Devroye & Lugosi, 2012). Often, the inequality appears as an upper bound on the expected Ak-norm of the difference between the empirical and actual distributions. We leverage the following variant in Acharya et al. (2017b), making the error probability explicit. Lemma 2 (VC inequality). For any ε, δ > 0, and continuous distribution p, draw a sample Xn p and let ˆp denote the empirical distribution. For n = Θ((k + log(1/δ))/ε2), with probability at least 1 δ, 4. Structured Continuous Distributions A distribution collection C is an ε-cover for P under ℓ1distance if every distribution in P is within ℓ1-distance ε from some distribution in C. The crossing number of two real functions f and g is the number of times they cross each other, or equivalently, the number of sign changes of f(x) g(x). The crossing number of a distribution collection is the largest crossing number of any pair of distributions in the collection. Numerous important distribution families have ε-covers whose crossing number grows moderately as ε decreases. For example, t-piecewise, degree-d polynomials have 0covers with crossing number O(t(d + 1)), and t-mixtures of univariate Gaussians have ε-covers with crossing number O(t log(1/ε)) growing only logarithmically in ε. 4.1. Compressor Let P have an ε-cover with crossing number s. Draw a sample Y n from an unknown p P and let ˆq be the empirical distribution. Partition the real line into t := s/ε intervals, I := (Ii)t i=1, such that ˆq I is uniform, where for simplicity, we assume that s/ε is an integer. Next, draw an independent sample Xn p. Let ˆp denote the empirical distribution, and let the compressor ϕ be its flattened version over I, ϕ(Xn) := ˆp I. Note that ϕ(Xn) depends implicitly on Y n, and its definition consists of both the Y n partition and the empirical probabilities of each part according to Xn. 4.2. CML Estimator Given a sample Y n and the corresponding interval partition I, the respective CML estimator is pϕ := arg max p P Pr Xn p(ϕ(Xn) | I). Note that the probability term corresponds to a multinomial distribution. We can rewrite the CML estimator as pϕ = arg min p P H(ˆp I, p I). Hence, CML minimizes the cross entropy between the quantized empirical distribution and the actual one. Theorem 2. For any ε and p P, draw (Xn, Y n) p. If n = Θ(t log(t/ε)/ε2), with probability at least 1 2ε, p pϕ 1 34ε. Note that by traditional VC theory, the empirical Ak-norm minimizer based on Y n alone achieves sample complexity of Θ(t/ε). However, this method is usually not used for learning structured densities. The theorem on the other hand shows that the practically ubiquitous ML technique can also be applied to learn structured continuous distributions with a rigorous guarantee. It thereby demonstrates a different, continuous, application of Theorem 1 beyond the original application to discrete domains. The use of the samples Xn, while not strictly necessary, facilitates the application of the competitive argument. The rest of the section is devoted to the theorem s proof. 4.3. Typical Events First, we leverage standard concentration inequalities to establish several claims, each holds with high probability. The subsequent analysis will assume all claims hold. By construction, each interval Ii I carries an empirical probability mass of ˆq(Ii) = ε/s. The respective probability mass under the actual density, p(Ii), turns out to follow a Beta(n/t, n(1 1/t)) distribution. The following inequality (Hao et al., 2020) shows the typical values of p(Ii). Lemma 3. For any α [0, 1] and 1 t n, For any γ (0, 1) and n 12t log(2t/γ), choosing α = 1/(2 t) in the lemma yields Pr i [t], p(Ii) 1 Henceforth, we assume that n 12t log(2t/γ) and for each i, the actual probability mass of Ii falls in (1/(2t), 3/(2t)). Next, consider the second sample Xn and an arbitrary index i [t]. By independence, nˆp(Ii) bin(n, p(Ii)), where p(Ii) > 1/(2t) as above. From the binomial Chernoff and union bounds, for all i [k], with probability at least 1 γ, |ˆp(Ii) p(Ii)| Compressed Maximum Likelihood Our lower bounds on n and p(Ii) also imply that for all i, ˆp(Ii) p(Ii) + 2 p(Ii) < 3 Below we assume that ˆp satisfies these inequalities. 4.4. Guarantees of CML Given Y n s empirical distribution ˆq, hence also I, there are finitely many flattened Xn-distributions ˆp I over I. Given a probability threshold δ, we consider two disjoint cases according to the probability of a particular distribution ˆp I. Likely ˆp I: The probability p(ˆp I) > δ. We show that for such ˆp I, with high probability, the CML distribution pϕ is close to p. The following argument assumes that the claims made in Section 4.3 hold. First, since ˆp(Ii) = ˆp I(Ii) < 3/t for all i, then for any distribution p satisfying p ˆp I As > 7ε, p ˆp As p ˆp I As ˆp I ˆp As > 7ε 2s 3 By the VC inequality, this event happens with probability at most δ/2 for a sample size of n = Θ((s + log(2/δ))/ε2), contradicting our assumption that p(ˆp I) > δ. Therefore, p ˆp I As 7ε. On the other hand, a fundamental attribute of the ML method is that pϕ(ˆp I) p(ˆp I) δ. Hence, pϕ ˆp I As 7ε. For any distribution q P, we denote by q the closest distribution in the ε-cover mentioned above under ℓ1 distance. From the above and the triangle inequality, pϕ p 1 p ϕ p 1 + 2ε 2 ( p ˆp I As + pϕ ˆp I As) + 6ε where the second inequality follows as q1 q2 1 = 2 sup A R |q1(A) q2(A)|. Unlikely ˆp I: The probability p(ˆp I) δ. Note that the number of possible ˆp I s is at most t + n 1 t t et log(e(1+ n Denote by Lt,n the exponent of the last term. Setting δ = e 2Lt,n and n = Θ(t log(t/ε)/ε2), the total probability of ˆp I s falling into this category is at most e Lt,n e 2Lt,n = e Lt,n ε t, where the last step follows by choosing a sufficiently large absolute constant in the above asymptotic expression for n. Summary Choosing γ = ε/2 in Section 4.3, if n = Θ(t log(t/ε)/ε2), the union bound implies that with probability at least 1 2ε, pϕ p 1 34ε. 5. Structured Discrete Distributions Moving from R to the discrete domain [N], we apply CML to learn structured discrete distributions. Again, we assume that P has an ε-cover C with a crossing number at most s. 5.1. Reduction from Continuous CML First, we present a concrete random mapping from discrete to continuous domains that enables the use of the continuous CML estimator. For any distribution p over [k], we define by p its flattened version, a continuous distribution that assigns p(x) := p( x ), x (0, k], and p(x) := 0 for x (0, k]. This notation yields a bijective mapping that maintains the number of sign changes of the difference between any two distributions. Note that we only have sample access to the underlying distribution p. To apply the CML estimator in Section 4, we need to simulate a sample from p from Xn p. A random mapping S for this purpose counts the number of times each symbol i [N] observed in Xn, say ni, and respectively draws ni independent sample points from the uniform distribution over (i 1, i]. Now, it is straightforward to apply the continuous CML. Specifically, draw samples (Xn, Y n) p, define the compressor ϕ based on S(Y n), and let the CML estimate be pϕ := arg max p P Pr Xn p(ϕ(S(Xn)) | I). 5.2. Discrete CML Next, we present an alternative CML method that applies directly to discrete samples. Compressor Draw a size-n sample Xn p, and denote by ˆp the empirical distribution. Then, sort elements in Xn in a non-descending order, say, X(1), . . . , X(n), such that X(i) X(j), i < j. Without loss of generality, assume that t := s/ε is an integer and n is divisible by t. Sequentially partition the sequence into t sub-sequences, such that the i-th sub-sequence contains observations with indices from (i 1)n/t + 1 to in/t. Compressed Maximum Likelihood The compressor ϕ maps every sample Xn to the boundary symbols that of these sub-sequences. Specifically, ϕ(Xn) := X(in/t) t i=1 , which is essentially ˆp I with I := (Ii)t i=1 and Ii := [X((i 1)n/t) : X(in/t)], i [t]. CML estimator Consequently, the CML estimator for the above compressor takes the form pϕ := arg max p P Pr Xn p(ϕ(Xn) | I). Equivalently, the CML estimator can be written as pϕ := arg max p P yn:ϕ(yn)=ϕ(Xn) The estimator often performs well, as described below. Theorem 3. For p P and Xn p, if n = Θ((t N) log(t N)/ε2), with probability at least 1 e t N, p pϕ 1 = O(ε). 5.3. Hypothesis Selection Algorithm First, we utilize the assumptions and the standard VC inequality to show the existence of a hypothesis selection algorithm that, with high probability, finds an accurate estimate of the underlying distribution. By the VC inequality, for any ε, δ > 0, and n = Θ((s + log(2/δ))/ε2), with probability at least 1 δ/2, Denote by Is the set of unions of at most s disjoint intervals in I. By our construction, for any set I Is, For any q Ps,ε, let q be a distribution in the ε-cover C with minimal ℓ1-distance to q. For any distribution pair q1, q2 Ps,ε, define the following variant of the Scheff e set (Devroye & Lugosi, 2012) as S 12 := {x R : q 1(x) > q 2(x)}, and the original Scheff e set as S12. Given the previous inequality and distribution ˆp I, we can approximate ˆp(q1 q2) := ˆp(S12) to a 3ε additive error by summing up ˆp I(Ii) over indices i satisfying Ii S 12. Note that we assumed that q 1 q 2 has at most s sign changes. Next, we show that if the VC inequality holds, there exists a selection algorithm that uses ˆp I to find a density p C satisfying p p O(ε). In addition, this algorithm is a variant of that in Theorem 6.4 of Devroye & Lugosi (2012). Let S denote the collection of all the modified Scheff e sets induced by distributions in C. Let p be the distribution q in C minimizing sup S S |ˆp I(S) q(S)| up to an additive 4ε error, where we enlarged the error bound by ε to guarantee the existence of such a distribution. Then, the triangle inequality yields p p 1 p p 1 + p p 1. For any pair (q1, q2) of distributions, define DS(q1, q2) := sup S S |q1(S) q2(S)|. Consider the last term in the above inequality. By a recursive application of the triangle inequality and properties embedded in the prior constructions, p p 1 = 2DS(p , p ) 2DS(p , ˆp) + 2DS(ˆp, p ) 2DS(p , ˆp I) + 2DS(ˆp I, p ) + O(ε) 4DS(p , ˆp I) + O(ε) 4DS(p , ˆp) + O(ε) 4DS(p , p) + 4DS(p, ˆp) + O(ε) 2 p p 1 + 4DS(p, ˆp) + O(ε). Recall that C is an ε-cover of Ps,ε, we have p p 1 ε. In addition, the collection S of modified Scheff e sets has a VC dimension of at most 2s + 2. Therefore by the VC inequality, for a sample size of n = Θ((s + log(2/δ))/ε2), with probability at least 1 δ/2, DS(p, ˆp) ε. Consolidating these results shows that with probability at least 1 δ, the selected hypothesis p achieves p p 1 = O(ε). Note, however, that the above selection algorithm does not provide a method for finding p . 5.4. Guarantees of CML There are only finitely many possible ˆp I, or equivalently, ϕ(Xn). Below, we consider two disjoint cases according to the probability of observing a particular pattern of ϕ(Xn) under the actual distribution. Likely pattern: Pattern φ whose probability p(φ) := Pr Xn p(ϕ(Xn) = φ) > δ. By the argument in Section 5.3, Compressed Maximum Likelihood there exists a selection algorithm that maps this pattern to a distribution p whose error probability in achieving p p 1 = O(ε) is at most δ. Since p(φ) > δ by assumption, the selected hypothesis must satisfy the error bound above. It is straightforward to combine these error bounds through Theorem 1 and the triangle inequality, |p pϕ| |pϕ p | + |p p | = O(ε). Unlikely pattern: Pattern φ whose probability p(φ) δ. Since φ can only be a sorted integer sequence with values in [N], the number of possible patterns equals the number of ways to pick t elements from [N] with repetition, which is t+N 1 t exp (t N) log e 1 + N Denote by Lt,n the exponent of the last term. Setting δ = e 2Lt,n and n = Θ(Lt,n/ε2), the total probability of patterns falling into this category is at most e Lt,n e 2Lt,n = e Lt,n e t N, where the last step follows by choosing a sufficiently large absolute constant in the expression for n. 6. Probability Multisets In this section, we address probability multiset estimation under permutation invariance. 6.1. Compressor and PML For any discrete distribution p and sample Xn p, let compressor ϕ map the sequence to its profile ϕ(Xn), defined as the multiplicity multiset of symbols appearing in Xn. The respective CML estimator, introduced in Orlitsky et al. (2004) as the PML estimator. Specifically, PML computes pϕ := arg max p P yn:ϕ(yn)=ϕ(Xn) For any τ [0, 1], define the τ-truncated relative earthmover distance (Valiant & Valiant, 2015) between two discrete distributions p and q as Rτ(p, q) := inf γ Γp,q E (X,Y ) γ where Γp,q represents all the possible couplings of these two distributions. In the following, we show that the PML distribution estimator satisfies Theorem 4. For any discrete distribution p, draw a sample Xn p. With probability at least 1 4 exp( Ω(n 1 3 )), for any w [1, log n], R w n log n (pϕ, p) = O 1 w For any τ [0, 1], define the τ-truncated sorted ℓ1-distance between two distributions p, q X as ℓτ(p, q) := min p X :{p }={p} x |p (x) τ q(x) τ| , where {p} denotes the probability multiset of p. By Fact 1 in Valiant & Valiant (2015), for any distributions p, q X , ℓτ(p, q) 2Rτ(p, q), implying Corollary 2. Under the same conditions as Theorem 4, ℓ w n log n (pϕ, p) = O 1 w 6.2. Proof of Theorem 4 This section provides a sketch for the proof of Theorem 4. We relegate most technical details to Appendix A. Part of the proof is adapted from Theorem 2 in Valiant & Valiant (2015). The original reasoning is not sufficient for our purpose as the error probability derived is too large to invoke the competitiveness of PML. For this reason, we modify the linear program used in the paper, carefully separate the analysis of the estimators for large and small probabilities, and provide a refined analysis with tighter probability bounds by reducing union-bound arguments. First, we define histograms and the relative earth-moving cost and give operational meaning to Rτ. For a distribution p, the histogram of a multiset A {p} is a mapping, denoted by h A : (0, 1] Z 0, that maps each number y (0, 1] to the number of times it appears in A. Note that every y corresponds to a probability mass of y h A(y). More generally, we also allow generalized histograms h with non-integral values h(y) R 0. For any y1, y2 (0, 1], generalized histogram h, and nonnegative m < y1 h(y1), we can move a probability mass from location y1 to y2 by reassigning h(y1) m/y1 to y1, and h(y2) + m/y2 to y2. Given τ [0, 1], we define the cost associated with this operation as cτ,m(y1, y2) := m log y1 τ and term it as τ-truncated earth-moving cost. The cost of multiple operations is additive. Note that Rτ(p, q) is the minimal total τ-truncated earth-moving cost associated with any operation schemes of moving h{p} to yield h{q}. Compressed Maximum Likelihood For simplicity, we suppress Xn in ϕi(Xn) and µs(Xn), representing respectively the number of symbols appearing i times and the number of times symbol s appears. For any absolute constants B and C satisfying 0.1 > B > C > B 2 > 0, define xn := n B+n C n and S := { 1 n2 , . . . , xn}. Consider the following linear program. For each x S, define the associated variable vx x S bin(n, x, i) vx x S x vx = X and x S, vx 0 Figure 1. Linear program (LP) EXISTENCE OF A GOOD FEASIBLE POINT Let p be the underlying distribution and h be its histogram. First, we show that with high probability, the linear program LP has a feasible point (vx) that is good in the following sense: 1) the corresponding objective value is relatively small; 2) for τ n 3/2, the generalized histogram h0 : x vx is close to hn : y h(y) 1y xn, admitting a low τ-truncated earth-mover cost. In the appendix, we leverage the Chernoff bound and union bound to show that with probability at least 1 exp( n 1 3 +κ), the objective value of the feasible point (vx) is at most n B O(n 2 3 +κ + 1) = O(n B+ 2 For any τ n 3/2, the minimal τ-truncated earth-moving cost of moving the generalized histogram h0 corresponding to (vx), and the histogram hn : y h(y) 1y xn, so that they differ from each other only at x = xn, is at most log n 3/2 + n 2 ALL SOLUTIONS ARE GOOD SOLUTIONS Let (vx) be the solution described above. The appendix then proceeds to show that for any solution (v x) to LP whose objective value is O(n B+ 2 3 +κ), the generalized histogram h1 corresponding to (v x) is close to h0. Specifically, the proof establishes a O(1/ log n) discrepancy bound whenever B = 1.5C = 10κ = 0.01. Consolidate the previous results. For w [1, log n] and τ = w/(n log n), with probability at least 1 exp( n 1 3 +κ), the solution to LP will yield a generalized histogram h1, such that the minimal τ-truncated earth-moving cost of moving h1 and hn so that they differ only at xn, is O(1/ w). COMPETITIVENESS OF PML For the PML distribution associated with a sample satisfying our prior assumptions, denote by h PML n the histogram corresponding to its entries that are at most xn. By a recent result in (Han & Shiragur, 2021), for any w [1, log n] and τ = w/(n log n), the minimal τ-truncated cost of moving h PML n and hn so that they differ only at xn, is O(1/ w), with probability at least 1 2 exp( Ω(n 1 3 )). PROPERTIES OF THE EMPIRICAL HISTOGRAM Denote by h EMP the empirical histogram. By the Chernoff bound, for any symbol s, the probability that |n p(s) µs| µ3/4 s and µs > n B + 2n C is at most 2np(s) exp( Ω(n2C B)), and similarly, n p(s) n B + 4n C and µs n B + 2n C will happen with probability at most 2 exp( Ω(n2C B)). Hence, we assume that |n p(s) µs| < µ3/4 s for all symbols s appearing more than n B +2n C times, and that any symbol s with probability p(s) (n B +4n C)/n appears more than n B + 2n C times. By the union bound, we will be correct with probability at least 1 4n exp( Ω(n2C B)). Let yn := (n B + 4n C)/n for notational convenience. If for each symbol s satisfying µs n B + 2n C, we move a µs/n probability mass of h EMP from µs/n to p(s), then at all locations y yn, the total discrepancy between the resulting generalized histogram and the actual one is at most j>n B+2n C ϕj j 3 4 n = 1 j>n B+2n C ϕ 1 4 j (ϕjj) 3 4 n B where the last step follows by H older s inequality. Moreover, the associated total earth-moving cost is at most j>n B+2n C ϕj j n log j j>n B+2n C ϕj j 3 4 n n B The yn-truncated earth-mover distance between h EMP and h is thus at most 2n B 4 log n = O(n B 5 ), which, together with the above error probability bound, upper bounds the expected value of Ryn(h EMP, h) by O(n B 5 ) + 4n exp( Ω(n2C B)) log n = O(n B Moreover, changing any element in the sample sequence changes the value of Ryn(h EMP, h) by at most (log n)/n. Hence, by Mc Diarmid s inequality, with probability at least 1 2 exp( 2 n), the value of Ryn(h EMP, h) is less than O(n B 4 log n = O(n B Compressed Maximum Likelihood COMPETITIVENESS OF PML Consider the PML histogram h PML and its yn-truncated earth-mover distance to h. Since there are at most exp(3 n) different profiles (Hardy & Ramanujan, 1918), with probability at least 1 2 exp( Ω( n)), Ryn(h PML, h) O(n B PERFORMANCE OF PML We consolidate the previous results. For any w [1, log n] and τ = w/(n log n), we design an earth-moving scheme that moves h PML to h. First, with probability at least 1 2 exp( Ω(n 1 3 )), we can move the probability mass of h PML n so that it differs from hn only at xn = (n B + n C)/n while incurring a τ-truncated earth-mover cost of at most O(1/ w). Second, by the empirical-histogram argument and xn < yn, with probability at least 1 2 exp( Ω(n 1 3 )), we can further move the probability mass of h PML at locations y xn to coincide with h above yn while incurring a loss of O(n B After the previous two steps, the modified PML histogram differs from the actual histogram h only at locations y In = [xn, yn]. Note that the cost of moving a unit mass within In is at most 3n C B, implying that with probability at least 1 2 exp( Ω(n 1 3 )), Rτ(h PML, h) O 1 w + n B 5 + 3n C B =O 1 w 7. Symmetric Functionals In this section, we apply the idea of CML to estimate multisets of low probabilities and leverage the respective plug-in estimator to approximate several distribution functionals. The study of functional estimation dates back more than half a century (Carlton, 1969; Good, 1953; Good & Toulmin, 1956) and has steadily grown over the years. While the empirical distribution plug-in estimator performs well in the large-sample regime, modern data science applications often study high-dimensional data, for which more sophisticated methods lead to estimators that possess better guarantees (Jiao et al., 2015; Orlitsky et al., 2016; Valiant & Valiant, 2011a;b; Wu & Yang, 2016; Hao et al., 2018; Wu & Yang, 2019; Hao & Orlitsky, 2019b; 2020a;b). Recently, a line of research works studied the use of PML in functional estimation (Orlitsky et al., 2004; 2011; Das, 2012; Acharya et al., 2012; 2017a; Hao & Orlitsky, 2019a; 2020b; Charikar et al., 2019a;b; Han & Shiragur, 2021). For functionals addressed in this section, the plain PML is known to be sample-optimal only for additive errors ε 1/n1/3. Concurrently with our result, Charikar et al. (2019c) proposed a different PML-type functional estimation method with optimal sample complexity down to ε 1/ n accuracy. We note that some specialized estimators also achieve better accuracy of order 1/ n. Recently, an efficient algorithm (Anari et al., 2020) was proposed to compute the PML-type estimators in Charikar et al. (2019c). The techniques also apply to the algorithms presented in this section. 7.1. Compressor For any distribution p over X, sample Xn p, and multiplicity i, recall that ϕi(Xn) denotes the number of symbols appearing exactly i times. Our objective is to extract information about low probabilities of p, regardless of symbol permutations. It is natural to consider, for an integer t n, the compressor ϕ(Xn) := (ϕi(Xn))t i=1, where t determines the inference horizon. 7.2. CML Estimator Similar to previous sections, for fixed t and sample Xn p, the CML estimator for the above compressor takes the form pϕ := arg max p P Pr Xn p(ϕ(Xn)). Equivalently, the CML estimator can be written as pϕ := arg max p P yn:ϕ(yn)=ϕ(Xn) For space considerations, we relegate the rest of this section and the paper s appendix to the supplementary material. The paper proposes a simple, novel, and unified compressed maximum likelihood (CML) approach for several fundamental tasks. The new technique bridges algorithms and results in several research directions over both discrete and continuous domains. Acknowledgements We thank the reviewers for their helpful comments and are grateful to the National Science Foundation for supporting this work through grants CIF-1564355 and CIF-1619448. We also thank an anonymous reviewer for an excellent summary of our contribution, incorporated into Section 2.1, and another anonymous reviewer for suggesting the approximate CML estimator (Definition 1). Compressed Maximum Likelihood Acharya, J., Das, H., Jafarpour, A., Orlitsky, A., and Pan, S. Estimating multiple concurrent processes. In Proceedings 2012 IEEE International Symposium on Information Theory (ISIT), pp. 1628 1632. IEEE, 2012. Acharya, J., Das, H., Orlitsky, A., and Suresh, A. T. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pp. 11 21, 2017a. Acharya, J., Diakonikolas, I., Li, J., and Schmidt, L. Sampleoptimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1278 1289. SIAM, 2017b. Anari, N., Charikar, M., Shiragur, K., and Sidford, A. Instance based approximations to profile maximum likelihood. Advances in neural information processing systems, 2020. Bishop, C. M. Pattern recognition and machine learning. springer, 2006. Carlton, A. G. On the bias of information estimates. Psychological Bulletin, 71(2):108, 1969. Charikar, M., Shiragur, K., and Sidford, A. The Bethe approximation for structured matrices: an improved approximation for the profile maximum likelihood. In Neur IPS 2019 Workshop on Information Theory and Machine Learning, 2019a. Charikar, M., Shiragur, K., and Sidford, A. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 780 791, 2019b. Charikar, M., Shiragur, K., and Sidford, A. A general framework for symmetric property estimation. Advances in neural information processing systems, 2019c. Das, H. Competitive tests and estimators for properties of distributions. Ph D thesis, UC San Diego, 2012. Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer Science & Business Media, 2012. Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of statistical learning. Number 10 in 1. Springer series in statistics New York, 2001. Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4): 237 264, 1953. Good, I. J. and Toulmin, G. H. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1-2):45 63, 1956. Han, Y. and Shiragur, K. On the competitive analysis and high accuracy optimality of profile maximum likelihood. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1317 1336. SIAM, 2021. Han, Y., Jiao, J., and Weissman, T. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. In Conference On Learning Theory, pp. 3189 3221, 2018. Hao, Y. and Orlitsky, A. The broad optimality of profile maximum likelihood. In Advances in Neural Information Processing Systems (Neur IPS), pp. 10989 11001, 2019a. Hao, Y. and Orlitsky, A. Unified sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems (Neur IPS), pp. 11106 11116, 2019b. Hao, Y. and Orlitsky, A. Data amplification: Instanceoptimal property estimation. In International Conference on Machine Learning (ICML), pp. 4049 4059. PMLR, 2020a. Hao, Y. and Orlitsky, A. Profile entropy: A fundamental measure for the learnability and compressibility of discrete distributions. In Advances in Neural Information Processing Systems (Neur IPS), 2020b. Hao, Y., Orlitsky, A., Suresh, A. T., and Wu, Y. Data amplification: A unified and competitive approach to property estimation. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8834 8843, 2018. Hao, Y., Jain, A., Orlitsky, A., and Ravindrakumar, V. SURF: A simple, universal, robust, fast distribution learning algorithm. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Hardy, G. H. and Ramanujan, S. Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society, 2(1):75 115, 1918. Jiao, J., Venkat, K., Han, Y., and Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, 2015. Orlitsky, A., Santhanam, N. P., Viswanathan, K., and Zhang, J. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 426 435. AUAI Press, 2004. Compressed Maximum Likelihood Orlitsky, A., Santhanam, N. P., Viswanathan, K., and Zhang, J. On estimating the probability multiset. Online Draft, 2011. URL http://alon.ucsd.edu/papers/ pml1.pdf. Orlitsky, A., Suresh, A. T., and Wu, Y. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016. Valiant, G. and Valiant, P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings 43rd Annual ACM Symposium on Theory of Computing (STOC), pp. 685 694. ACM, 2011a. Valiant, G. and Valiant, P. The power of linear estimators. In Proceedings 52nd IEEE Symposium on Foundations of Computer Science (FOCS), pp. 403 412. IEEE, 2011b. Valiant, G. and Valiant, P. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685 694, 2011c. Valiant, G. and Valiant, P. Instance optimal learning. ar Xiv preprint, ar Xiv: 1504.05321, 2015. Valiant, G. and Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 142 155, 2016. Valiant, P. and Valiant, G. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pp. 2157 2165, 2013. Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000. Wu, Y. and Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016. Wu, Y. and Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 883, 2019.