# instance_based_approximations_to_profile_maximum_likelihood__bfdcfab0.pdf

Instance Based Approximations to Proﬁle Maximum Likelihood

Nima Anari Stanford University anari@stanford.edu

Moses Charikar Stanford University moses@cs.stanford.edu

Kirankumar Shiragur Stanford University shiragur@stanford.edu

Aaron Sidford Stanford University sidford@stanford.edu

In this paper we provide a new efﬁcient algorithm for approximately computing the proﬁle maximum likelihood (PML) distribution, a prominent quantity in symmetric property estimation. We provide an algorithm which matches the previous best known efﬁcient algorithms for computing approximate PML distributions and improves when the number of distinct observed frequencies in the given instance is small. We achieve this result by exploiting new sparsity structure in approximate PML distributions and providing a new matrix rounding algorithm, of independent interest. Leveraging this result, we obtain the ﬁrst provable computationally efﬁcient implementation of Pseudo PML, a general framework for estimating a broad class of symmetric properties. Additionally, we obtain efﬁcient PML-based estimators for distributions with small proﬁle entropy, a natural instance-based complexity measure. Further, we provide a simpler and more practical Pseudo PML implementation that matches the best-known theoretical guarantees of such an estimator and evaluate this method empirically.

1 Introduction

In this paper we consider the fundamental problem of symmetric property estimation: given access to n i.i.d. samples from an unknown distribution, estimate the value of a given symmetric property (i.e. one invariant to label permutation). This is an incredibly well-studied problem with numerous applications [Cha84, BF93, CCG+12, TE87, Für05, KLR99, PBG+01, DS13, RCS+09, GTPB07, HHRB01] and proposed property-speciﬁc estimators, e.g. for support [VV11b, WY15], support coverage [ZVV+16, OSW16], entropy [VV11b, WY16a, JVHW15], and distance to uniformity [VV11a, JHW16].

However, in a striking recent line of work it was shown that there is a universal approach to achieving sample optimal1 estimators for a broad class of symmetric properties, including those above. [ADOS16] showed that the value of the property on a distribution that (approximately) maximizes the likelihood of the observed proﬁle (i.e. multiset of observed frequencies) is an optimal estimator up to accuracy2 ϵ n 1/4. Further, [ACSS20] which in turn built on [ADOS16, CSS19a], provided a polynomial time algorithm to compute an exp( O( n log n))-approximate proﬁle maximum likelihood distribution (PML). Together, these results yield efﬁcient sample optimal estimators for various symmetric properties up to accuracy ϵ n 1/4.

1Sample optimality is up to constant factors. See [ADOS16] for details. 2We use ϵ n c to denote ϵ > n c+α for any constant α > 0.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Despite this seemingly complete picture of the complexity of PML, recent work has shown that there is value in obtaining improved approximate PML distributions. In [CSS19b, HO19] it was shown that variants of PML called Pseudo PML and truncated PML respectively, which compute an approximate PML distribution on a subset of the coordinates, yield sample optimal estimators in broader error regime for a wide range of symmetric properties. Further, in [HO20] an instance dependent quantity known as proﬁle entropy was shown to govern the accuracy achievable by PML and their analysis holds for all symmetric properties with no additional assumption on the structure of the property. Additionally, in [HS20] it was shown that PML distributions yield a sample optimal universal estimator up to error ϵ n 1/3 for a broad class of symmetric properties. However, the inability to obtain approximate PML distributions of approximation error better than exp( O( n log n)) has limited the provably efﬁcient implementation of these methods.

In this paper we enable many of these applications by providing improved efﬁcient approximations to PML distributions. Our main theoretical contribution is a polynomial time algorithm that computes an exp( O(k log n))-approximate PML distribution where k is the number of distinct observed frequencies. As k is always upper bounded by n, our work generalizes the previous best known result from [ACSS20] that computed an exp( O( n log n))-approximate PML. Leveraging this result, our work provides the ﬁrst provably efﬁcient implementation of Pseudo PML. Further, our work also yields the ﬁrst provably efﬁcient estimator for proﬁle entropy and efﬁcient estimators with instance-based high-accuracy guarantees via proﬁle entropy. We obtain our approximate PML result by leveraging interesting sparsity structure in convex relaxations of PML [ACSS20, CSS19a] and additionally provide a novel matrix rounding algorithm that we believe is of independent interest.

Finally, beyond the above theoretical results we provide a simpliﬁed instantiation of these results that is sufﬁcient for implementing Pseudo PML. We believe this result is a key step towards practical Pseudo PML. We provide preliminary experiments in which we perform entropy estimation using the Pseudo PML approach implemented using our simpler rounding algorithm. Our results match other state-of-the-art estimators for entropy, some of which are property speciﬁc.

Notation and basic deﬁnitions: Throughout this paper we assume to receive a sequence of n independent samples from an underlying distribution p D, where D is a domain of elements and D is the set of all discrete distributions supported on this domain. We let [a, b] and [a, b]R denote the interval of integers and reals a and b respectively, so D def = {q [0, 1]D R | q 1 = 1}. Let Dn be the set of all length n sequences and yn Dn be one such sequence with yn i denoting its ith element. Let f(yn, x) def = |{i [n] | yn i = x}| and px be the frequency and probability of x D respectively. For a sequence yn Dn, let M = {f(yn, x)}x D\{0} be the set of all its non-zero distinct frequencies and m1, m2, . . . , m|M| be these distinct frequencies. The proﬁle of a sequence

yn denoted φ = Φ(yn) is a vector in Z|M| + , where φj def = |{x D | f(yn, x) = mj}| is the number of domain elements with frequency mj. We call n the length of proﬁle φ and let Φn denote the set of all proﬁles of length n. The probability of observing sequence yn and proﬁle φ with respect to a distribution p are as follows,

P(p, yn) = Y

x D pf(yn,x) x and P(p, φ) = X

{yn Dn | Φ(yn)=φ} P(p, yn) .

For a proﬁle φ Φn, pφ is a proﬁle maximum likelihood (PML) distribution if pφ arg maxp D P(p, φ). Further, a distribution pβ φ is a β-approximate PML distribution if P(pβ φ, φ) β P(pφ, φ).

For a distribution p and n, let X be a random variable that takes value φ Φn with probability Pr (p, φ). The distribution of X depends only on p and n and we call H(X) (entropy of X) the proﬁle entropy with respect to (p, n) and denote it by H(Φn, p).

We use e O( ), eΩ( ) notation to hide all polylogarithmic factors in n and N.

Paper organization: In Section 2 we formally state our results. In Section 3, we provide the convex relaxation [CSS19a, ACSS20] for the PML objective. Using this convex relaxation, in Section 4 we state our algorithm that computes an exp( O(k log n))-approximate PML and sketch its proof. Finally, in Section 5, we provide a simpler algorithm that provably implements the Pseudo PML approach; we implement this algorithm and provide experiments in the same section. Due to space constraints, we defer most of the proofs to appendix.

Here we provide the main results of our paper on computing approximations to PML where the approximation quality depends on the number of distinct frequencies, as well as efﬁciently implementing results on proﬁle entropy and Pseudo PML.

Distinct frequencies: Our main approximate PML result is the following.

Theorem 2.1 (Approximate PML). There is an algorithm that given a proﬁle φ Φn with k distinct frequencies, computes an exp ( O(k log n))-approximate PML distribution in time polynomial in n.

Our result generalizes [ACSS20] which computes an exp( O( n log n))-approximate PML. Through [ADOS16] our result also provides efﬁcient optimal estimators for class of symmetric properties when ϵ n 1/4. Further, for distributions that with high probability output a proﬁle with O(n1/3) distinct frequencies, through [HS20] our algorithm enables efﬁcient optimal estimators for the same class of properties when ϵ n 1/3. In Section 4 we provide a proof sketch for the above theorem and defer all the proof details to Appendix A.

Proﬁle entropy: One key application of our instance-based, i.e. distinct-frequency-based, approximation algorithm is the efﬁcient implementation of the following approximate PML version of the proﬁle entropy result from [HO20].3. See Section 1 for the deﬁnition of proﬁle entropy.

Lemma 2.2 (Theorem 3 in [HO20]). Let f be a symmetric property. For any p D and a proﬁle φ p of length n with k distinct frequencies, with probability at least 1 O(1/ n),

|f(p) f(pβ φ)| 2ϵf

eΩ(n) H(Φn, p)

where pβ φ is any β-approximate PML distribution for β > exp( O(k log n)) and ϵf(n) is the smallest error that can be achieved by any estimator with sample size n and success proability4 9/10

As the above result requires an exp( O(k log n))-approximate PML, our Theorem 2.1 immediately provides an efﬁcient implementation of it. Lemma 2.2 holds for any symmetric property with no additional assumptions on the structure. Further, it trivially implies a weaker result in [ADOS16] where H(Φn, p) is replaced by n. For further details and motivation, see [HO20].

Pseudo PML: Our approximate PML algorithm also enables the efﬁcient implementation of Pseudo PML [CSS19b, HO19]. Using Pseudo PML, the authors in [CSS19b, HO19] provide a general estimation framework that is sample optimal for many properties in wider parameter regimes than the previous universal approaches. At a high level, in this framework, the samples are split into two parts based on the element frequencies. The empirical estimate is used for the ﬁrst part and for the second part, they compute the estimate corresponding to approximate PML. To efﬁciently implement the approach of Pseudo PML required efﬁcient algorithms with either strong or instance dependent approximation guarantees and our result (Theorem 2.1) achieves the later. We ﬁrst state a lemma that relates the approximate PML computation to the Pseudo PML.

Lemma 2.3 (Pseudo PML). Let φ Φn be a proﬁle with k distinct frequencies and ℓ, u [0, 1]. If there exists an algorithm that runs in time T(n, k, u, ℓ) and returns a distribution p such that

P(p , φ) exp ( O((u ℓ)n log n + k log n)) max q D [ℓ,u] P(q, φ) , (1)

where D [ℓ,u] def = {p D px [ℓ, u] x D}. Then we can implement the Pseudo PML approach with the following guarantees,

For entropy, when error parameter ϵ > Ω log N N1 α for any constant α > 0, the estimator is sample

complexity optimal and runs in T(n, O(log n), O(log n/n), 1/poly(n)) time.

3Theorem 3 in [HO20] discuss instead exact PML and the authors discuss the approximate PML case in the comments; we conﬁrmed the sufﬁciency of approximate PML claimed in the theorem through private communication with the authors. 4Please refer [HO20] for general success probability 1 δ; our work also holds for the general case.

For distance to uniformity, when ϵ > Ω 1 N1 α for any constant α > 0, the estimator is sample complexity optimal and runs in T(n, e O(1/ϵ), O(1/N), Ω(1/N)) time.

The proof of the lemma is divided into two main steps. In the ﬁrst step, we relate (1) to conditions considered in Pseudo PML literature. In the second step, we leverage this relationship and the analysis in [CSS19b, HO19] to obtain the result. See Appendix B.3 for the proof of the lemma and other details. As discussed in [CSS19b, HO19], the above results are interesting because we have a general framework (Pseudo PML approach) that is sample optimal in a broad range of non-trivial estimation settings; for instance when ϵ < log N

N for entropy and ϵ < 1 NC for distance to uniformity where C > 0 is a constant, we know that the empirical estimate is optimal.

As our approximate PML algorithm (Theorem 2.1) runs in time polynomial in n (for all values of k) and returns a distribution that satisﬁes the condition of the above lemma; we immediately obtain an efﬁcient implementation of the results in Lemma 2.3. However for practical purposes, we present a simpler and faster algorithm that outputs a distribution which sufﬁces for the application of Pseudo PML. We summarize this result in the following theorem.

Theorem 2.4 (Efﬁcient Pseudo PML). There exists an algorithm that implements Lemma 2.3 in time T(n, k, u, ℓ) = e O(n kω 1 log u

ℓ), where ω is the matrix multiplication constant. Consequently, this provides estimators for entropy and distance to uniformity in time e O(n) and e O(n/ϵω 1) under their respective error parameter restrictions.

See Section 5 for a description of the algorithm and proof sketch. The running time in the above result involves: solving a convex program, n/k number of linear system solves of k k matrices and other low order terms for the remaining steps. In our implementation we use CVX[GB14] with package CVXQUAD[FSP17] to solve the convex program. We use couple of heuristics to make our algorithm more practical and we discuss them in Appendix B.4.

2.1 Related work

PML was introduced by [OSS+04]. Since then, many heuristic approaches [OSS+04, ADM+10, PJW17, Von12, Von14] have been proposed to compute an approximate PML distribution. Recent work of [CSS19a] gave the ﬁrst provably efﬁcient algorithm to compute a non-trivial exp( O(n2/3 log n))-approximate PML distribution. The proof of this result is broadly divided into three steps. In the ﬁrst step, the authors in [CSS19a] provide a convex program that approximates the probability of a proﬁle for a ﬁxed distribution. In the second step, they perform minor modiﬁcations to this convex program to reformulate it as instead maximizing over all distributions while maintaining the convexity of the optimization problem. The feasible solutions to the modiﬁed convex program represent fractional distributions and in the third step, a rounding algorithm is applied to obtain a valid distribution. The approximation quality of this approach is governed by the ﬁrst and last step and [CSS19a] showed a loss of exp( O(n2/3 log n)) for each and thereby obtained exp( O(n2/3 log n))-approximate PML distribution. In follow up work, [ACSS20] improved the analysis for the ﬁrst step and then provided a better rounding algorithm in the third step to output an exp( O( n log n))-approximate PML distribution. The authors in [ACSS20] showed that the convex program considered in the ﬁrst step by [CSS19a] approximates the probability of a proﬁle for a ﬁxed distribution up to accuracy exp( O(k log n)), where k is the number of distinct observed frequencies in the proﬁle. However they incurred a loss of exp( O( n log n)) in the rounding step; thus returning an exp( O( n log n)) PML distribution. To prove these results, [CSS19a] used a combinatorial view of the PML problem while [ACSS20] analyzed the Bethe/Sinkhorn approximation to the permanent [Von12, Von14].

Leveraging the connection between PML and symmetric property estimation, [CSS19a] and [ACSS20] gave efﬁcient optimal universal estimators for various symmetric properties when ϵ n 1/6 and ϵ n 1/4 respectively. The broad applicability of PML in property testing and to estimate other symmetric properties was later studied in [HO19]. [HS20] showed interesting continuity properties of PML distributions and proved their optimality for sorted ℓ1 distance and other symmetric properties when ϵ n 1/3; no efﬁcient version of this result is known yet.

There have been other approaches for designing universal estimators, e.g. [VV11b] based on [ET76], [HJW18] based on local moment matching, and variants of PML by [CSS19b, HO19] that weakly

depend on the property. Optimal sample complexities for estimating many symmetric properties were also obtained by constructing property speciﬁc estimators, e.g. sorted ℓ1 distance [VV11a, HJW18], Renyi entropy [AOST14, AOST17], KL divergence [BZLV16, HJW16] and others.

2.2 Overview of techniques

Here we provide a brief overview of the proof to compute an exp( O(k log n))-approximate PML distribution. As discussed in the related work, both [CSS19a, ACSS20] analyzed the same convex program; [ACSS20] showed that this convex program approximates the probability of a proﬁle for a ﬁxed distribution up to a multiplicative factor of exp( O(k log n)). However in the rounding step, their algorithms incurred a loss of exp( O(n2/3 log n)) and exp( O( n log n)) respectively. Computing an improved exp( O(k log n))-approximate PML distribution required a better rounding algorithm which in turn posed several challenges. We address these challenges by leveraging interesting sparsity structure in the convex relaxation of PML [ACSS20, CSS19a] (Lemma 4.3) and provide a novel matrix rounding algorithm (Theorem 4.4).

In our rounding algorithm, we ﬁrst leverage homogeneity in the convex relaxation of PML and properties of basic feasible solutions of a linear program to efﬁciently obtain a sparse approximate solution to the convex relaxation. This reduces the problem of computing the desired approximate PML distribution to a particular matrix rounding problem where we need to round down a matrix of non-negative reals to another one with integral row and column sums without changing the entries too much (O(k) overall) in ℓ1. Perhaps surprisingly, we show that this is always possible by reduction to a combinatorial problem which we solve by combining seemingly disparate theorems from combinatorics and graph theory. Further, we show that this rounding can be computed efﬁciently by employing algorithms for enumerating near-minimum-cuts of a graph [KS96].

3 Convex Relaxation to PML

Here we deﬁne the convex program that approximates the PML objective. This convex program was initially introduced in [CSS19a] and rigorously analyzed in [CSS19a, ACSS20]. We ﬁrst describe the notation and later state the theorem in [ACSS20] that captures the guarantees of the convex program.

Probability discretization: Let R def = {ri}i [1,ℓ] be a ﬁnite discretization of the probability space,

where ri = 1 2n2 (1 + α)i for all i [1, ℓ 1], rℓ= 1 and ℓ def = |R| be such that 1 2n2 (1 + α)ℓ> 1; therefore ℓ= O( log n

α ). Let r Zℓ + be a vector where the i th element is equal to ri. We call q [0, 1]D R a pseudo-distribution if q 1 1 and a discrete pseudo-distribution with respect to R if all its entries are in R as well. We use D pseudo and D R to denote the set of all pseudo-distributions and discrete pseudo-distributions with respect to R respectively. For all probability terms deﬁned involving distributions p, we extend those deﬁnitions to pseudo distributions q by replacing px with qx everywhere. The effect of discretization is captured by the following lemma. Lemma 3.1 (Lemma 4.4 in [CSS19a]). For any proﬁle φ Φn and distribution p D, there exists q D R that satisﬁes P(p, φ) P(q, φ) exp ( αn 6) P(p, φ) and therefore,

max p D P(p, φ) max q D R P(q, φ) exp ( αn 6) max p D P(p, φ) .

For any probability discretization set R, proﬁle φ and q D R , we deﬁne the following sets that help lower and upper bound the PML objective by a convex program.

Zφ R def = n S Rℓ [0,k] 0 S1 Zℓ +, [S 1]j = φj for all j [1, k] and r S1 1 o , (2)

Zφ,frac R def = n S Rℓ [0,k] 0 [S 1]j = φj for all j [1, k] and r S1 1 o , (3)

where in the above deﬁnitions the 0 th column corresponds to domain elements with frequency 0 (unseen) and we use m0 def = 0. We next deﬁne the objective of the convex program. Let Cij def = mj log ri and for any S Rℓ [0,k] 0 deﬁne,

g(S) def = exp X

i [1,ℓ],j [0,k] [Cij Xij Xij log Xij] + X

i [1,ℓ] [X1]i log[X1]i . (4)

The function g(S) approximates the P(q, φ) term and the following theorem summarizes this result. Theorem 3.2 (Theorem 6.7 and Lemma 6.9 in [ACSS20]). Let R be a probability discretization set. Given a proﬁle φ Φn with k distinct frequencies the following inequalities hold,

exp ( O(k log n)) Cφ max S Zφ R g(S) max q D R P(q, φ) exp (O (k log n)) Cφ max S Zφ R g(S) , (5)

max q D R P(q, φ) exp (O (k log n)) Cφ max S Zφ,frac R g(S) , (6)

where Cφ def = n! Q

j [1,k](mj!)φj is a term that only depends on the proﬁle.

See Appendix A.1 for citations related to convexity of the function g(S) and running time to solve the convex program. For any S Zφ R, deﬁne a pseudo-distributions associated with it as follows.

Deﬁnition 3.3. For any S Zφ R, the discrete pseudo-distribution q S associated with S and R is deﬁned as follows: For any arbitrary P

j [0,k] Si,j number of domain elements assign probability ri.

Further p S def = q S/ q S 1 is the distribution associated with S and R.

Note that q S is a valid pseudo-distribution because of the third condition in Equation (2) and these pseudo distributions p S and q S satisfy the following lemma. Lemma 3.4 (Theorem 6.7 in [ACSS20]). Let R and φ Φn be any probability discretization set and a proﬁle respectively. For any S Zφ R, the discrete pseudo distribution q S and distribution p S associated with S and R satisﬁes: exp ( O(k log n)) Cφ g(S) P(q, φ) P(p, φ) .

4 Algorithm and Proof Sketch of Theorem 2.1

Here we provide the algorithm to compute an exp ( O(k log n))-approximate PML distribution, where k is the number of distinct frequencies. We use the convex relaxation from Section 3; the maximizer of this convex program is a matrix S Zφ,frac R and its i th row sum denotes the number of domain elements with probability ri. As the row sums are not necessarily integral, we wish to round S to a new matrix S that has integral row sums and S Zφ R for some probability discretization set R . Our algorithm does this rounding and incurs only a loss of exp ( O(k log n)) in the objective; ﬁnally the distribution associated with S and R is the desired exp ( O(k log n))-approximate PML. We ﬁrst provide a general algorithm that holds for any probability discretization set R and the guarantees of this algorithm are stated below. Theorem 4.1. Given a proﬁle φ Φn with k distinct observed frequencies and R, there exists an algorithm that runs in polynomial of n and |R| time and returns a distribution p that satisﬁes,

P (p , φ) exp ( O(k log n)) max q D R P (q, φ) .

For an appropriately chosen R, the above theorem immediately proves Theorem 2.1 and we defer its proof to Appendix A.4. In the remainder of this section we focus our attention towards the proof of Theorem 4.1 and we next provide the algorithm that satisﬁes the guarantees of this theorem.

Algorithm 1 Approximate PML(φ, R)

1: Solve S = arg max S Zφ,frac R log g(S). Step 1

2: S = Sparse(S ). Step 2 3: (S , B ) = Matrix Round(S ). Step 3 4: (Sext, Rext) = Create New Probability Values(S , B , R). Step 4 5: Return distribution p with respect to Sext and Rext (See Deﬁnition 3.3). Step 5

We divide the analysis of the above algorithm into 5 main steps. See Lemma 3.4 for the guarantees of Step 5 and here we state results for the remaining steps; we later combine it all to prove Theorem 4.1.

Lemma 4.2 ([CSS19a, ACSS20]). Step 1 of the algorithm can be implemented in e O(|R| k2) time and the maximizer S satisﬁes: Cφ g(S ) exp (O ( k log n)) maxq D R P(q, φ).

The running time follows from Theorem 4.17 in [CSS19a] and the guarantee of the maximizer follows from Lemma 6.9 in [ACSS20]. The lemma statements for the remaining steps are written in a general setting; we later invoke each of these lemmas in the context of the algorithm to prove Theorem 4.1.

Lemma 4.3 (Sparse solution). For any A Zφ,frac R , the algorithm Sparse(A) runs in e O(|R| kω) time and returns a solution A Zφ,frac R such that g(A ) g(A) and {i [1, ℓ] | [A 1 ]i > 0} k + 1.

We defer description of the algorithm Sparse(X) and the proof to Appendix A.1. In the proof, we use homogeneity of the convex program to write an LP whose optimal basic feasible solution satisﬁes the lemma conditions. Theorem 4.4. For a matrix A Rs t 0 , the algorithm Matrix Round(A) runs in time polynomial in

s, t and returns a matrix B Rs t 0 such that Bij Aij i [s], j [t], B 1 Zs +, B 1 Zt + and P

i,j(Aij Bij) O(s + t ), where s , t denote the number of non-zeros rows and columns.

For continuity of reading, we defer the description of Matrix Round(A) and its proof to Section 4.1.

Lemma 4.5 (Lemma 6.13 in [ACSS20]). For any A Zφ,frac R Rℓ [0,k] 0 and B Rℓ [0,k] 0 such that

Bij Aij for all i [ℓ], j [0, k], B 1 Zℓ +, B 1 Z[0,k] + and P

i [ℓ],j [0,k](Aij Bij) t. The algorithm Create New Probability Values(A, B, R) runs in polynomial time and returns a solution A and a probability discretization set R such that A Zφ R and g(A ) exp ( O (t log n)) g(A) .

The algorithm Create New Probability Values is the same algorithm from [ACSS20] and the above lemma is a simpliﬁed version of Lemma 6.13 in [ACSS20]; see Appendix A.3 for its proof.

The proof of Theorem 4.1 follows by combining results for each step and we defer it to Appendix A.4.

4.1 Matrix rounding algorithm and proof sketch of Theorem 4.4

In this section we prove Theorem 4.4. Given a matrix A Rs t 0 , our goal is to produce a roundeddown matrix B with integer row and column sums, such that 0 B A (entry wise) and the total amount of rounding P

i,j(Aij Bij) is bounded by O(s +t ), where s , t are the number of nonzero rows and columns respectively. For simplicity we may assume s = s and t = t by simply dropping the zero rows and columns from A and re-appending them to the resulting B. As our ﬁrst step, we reduce the problem to a statement about graphs. Below we use deg F (v) to denote the number of edges adjacent to a vertex v within a set of edges F. Lemma 4.6. Suppose that G = (V, E) is a bipartite graph and k is a positive integer. There exists a polynomial time algorithm that outputs a subgraph F E, such that deg F (v) = 0 modulo k for every vertex v, and |E F| O(k|V |).

Proof of Lemma 4.6 = Theorem 4.4. Let k = min(s, t). Given A we produce a bipartite graph with s and t vertices on two sides; for every entry Aij we round down to the nearest integer multiple of 1/k, say cij/k, and introduce cij parallel edges between vertices i and j of the bipartite graph. Now Lemma 4.6 produces a subgraph F, and we let Bij be 1/k times the number of edges left in F between i, j. By Lemma 4.6, B will have integer row and column sums, and 0 B A. We next show that the total amount of rounding is bounded by O(s + t).

Notice that when rounding each entry of A down to cij/k, the total amount of change is at most st/k = O(s + t). By the guarantee that |E F| O(k|V |), the total amount of rounding in the second step is also bounded by O(k(s + t))/k = O(s + t).

So it remains to prove Lemma 4.6. As our main tool, we will use a result from [Tho14] which was obtained by reduction to an earlier result from [LTWZ13]. Roughly, this result says that as long as G is sufﬁciently connected, we can choose a subgraph whose degrees are arbitrary values modulo k. Lemma 4.7 ([Tho14, Theorem 1]). Suppose that G = (V, E) is a bipartite (3k 3)-edge-connected graph. Suppose that f : V {0, . . . , k 1} is an arbitrary function, with the restriction that the

sum of f on either side of the bipartite graph G yields the same result modulo k. Then, there is a subgraph F E, such that for each vertex v, deg F (v) = f(v) modulo k.

Note that (3k 3)-edge-connectivity means that for every cut, i.e., every partitioning of vertices into two nonempty sets S, Sc, the number of edges between S and Sc is 3k 3. We show that Lemma 4.7 can also be made constructive, giving the polynomial time guarantee for Lemma 4.6.

Lemma 4.8. There is a polynomial time algorithm that produces the subgraph of Lemma 4.7.

We defer the proof of Lemma 4.8 to Appendix A.2. At a high level, the proof of Lemma 4.7 works by formulating an assumption about the graph that is more general and more nuanced than edgeconnectivity; instead of a constant lower bound on every cut, this assumption puts a cut-speciﬁc lower bound on each cut, the details of which can be found in Appendix A.2. The rest of the argument follows a clever induction. To make this argument constructive, we show how to check the nuanced variant of edge-connectivity in polynomial time. We do this by proving that only cuts of size smaller than a constant multiple of the minimum cut have to checked, and these can be enumerated in polynomial time [KS96].

Note that Lemma 4.7 does not guarantee anything about |E F|, even when f is the zero function (the empty subgraph is actually a valid answer in that case). We will ﬁx this using a theorem of [NW61]. We will ﬁrst prove Lemma 4.6 with the extra assumption that G is 6k-edge-connected, and then prove the general case.

Proof of Lemma 4.6 when G is 6k-edge-connected. By a famous theorem due to [NW61], a 6k-edgeconnected graph contains 6k/2 = 3k edge-disjoint spanning trees. Moreover the union of these 3k edge-disjoint spanning trees can be found in polynomial time by matroid partitioning algorithms [GW92]. Let H be the subgraph formed by these 3k edge-disjoint spanning trees. We will ensure that all edges outside H are included in F; as a consequence, we will automatically get that |E F| is bounded by the number of edges in H, which is at most 3k(|V | 1) = O(k|V |).

Let Hc denote the complement of H in G. Deﬁne the function f : V {0, . . . , k 1} in the following way: let f(v) be deg Hc(v) modulo k. Note that f has the same sum on either side of the bipartite graph, modulo k. We will apply Lemmas 4.7 and 4.8 to the graph H (which is 3k (3k 3)-edge-connected) and the function f. Then we take the union of the subgraph returned by Lemma 4.8 and Hc and output the result as F. Then deg F (v) = deg Hc(v) + f(v) = 0 modulo k, for every vertex v. Note again that since we only deleted edges in H to get F, the total number of edges we have removed can be at most O(k|V |).

We have shown Lemma 4.6 for highly-connected graphs and the proof for the general case follows by partitioning the graph into union of vertex-disjoint highly-connected subgraphs while removing a small number of edges. We defer the proof for this general case to Appendix A.2.

5 Algorithm, Proof Sketch of Theorem 2.4 and Experiments

Here we present a simpler rounding algorithm that further provides a faster implementation of the pseudo PML approach with provable guarantees. Similar to Section 4, we ﬁrst provide an algorithm with respect to a probability discretization set R that proves Theorem 5.1; we later choose the discretization set carefully to prove Theorem 2.4. We perform experiments in Section 5.1 to analyze the performance of this rounding algorithm empirically. We defer all remaining details to Appendix B.

Theorem 5.1. Given a probability discretization set R (ℓ def = |R|) and a proﬁle φ Φn with k distinct frequencies, there is an algorithm that runs in time e O(ℓkω) and returns a distribution p such that,

P (p , φ) exp ( O((rmax rmin)n + k log(ℓn))) max q D R P (q, φ) .

For an appropriately chosen R, the above theorem immediately proves Theorem 2.4 and we defer both their proofs to Appendix B.1. We now present the algorithm that proves Theorem 5.1.

Algorithm 2 Approximate PML2(φ, R)

1: Solve X = arg max S Zφ,frac R log g(S) and let X = Sparse(X). Step 1

2: Let S be the sub matrix of X corresponding to its non-zero rows. Step 2

3: Let R denote the elements in R corresponding to non-zero rows of X . Let ℓ def = |R |. Step 3 4: for i = 1 . . . ℓ 1 do Step 4

5: Sext i,j = S i,j S i 1 S i 1 for all j [0, k]. Step 5

6: S i+1,j = S i+1,j + (S i,j Sext i,j ) for all j [0, k]. Step 6 7: end for Step 7

8: Sext ℓ ,j = S ℓ ,j S ℓ 1 S ℓ 1 for all j [0, k]. Step 8

9: Let c = P

i [1,ℓ ] r i Sext i 1, where r i are the elements of R . Step 9

10: Deﬁne Rext = {r i }i [1,ℓ ], where r i = r i c for all i [1, ℓ ]. Step 10 11: Return distribution p with respect to Sext and Rext (See Deﬁnition 3.3). Step 11

5.1 Experiments

Here we present experimental results for entropy estimation. We analyze the performance of the Pseudo PML approach implemented using our rounding algorithm with the other state-of-the-art estimators. Each plot depicts the performance of various algorithms for estimating entropy of different distributions with domain size N = 105. The x-axis corresponds to the sample size (in logarithmic scale) and the y-axis denotes the root mean square error (RMSE). Each data point represents 50 random trials. Mix 2 Uniforms is a mixture of two uniform distributions, with half the probability mass on the ﬁrst N/10 symbols and the remaining mass on the last 9N/10 symbols, and Zipf(α) 1/iα with i [N]. MLE is the naive approach of using empirical distribution with correction bias; all the remaining algorithms are denoted using bibliographic citations.

102 103 104 105 106 Sample Size

Entropy - Mix 2 Uniforms

Our Work JVHW15 VV11b PJW17 MLE

102 103 104 105 106 Sample Size

Entropy - Zipf(-1)

Our Work JVHW15 VV11b PJW17 MLE

102 103 104 105 106 Sample Size

Entropy - Zipf(-0.5)

Our Work JVHW15 VV11b PJW17 MLE

Figure 1: Experimental results for entropy estimation.

In the above experiment, note that the error achieved by our estimator is competitive with the other state-of-the-art estimators. As for the running times in practice, the other approaches tend to perform better than the current implementation of our algorithm. To further improve the running time of our approach or any other provable PML based approaches involves building an efﬁcient practical solver for the convex optimization problem [CSS19a, ACSS20] stated in the ﬁrst step5 of our Algorithm 1; we think building such an efﬁcient practical solver is an important research direction.

In Appendix B.4, we provide experiments for other distributions, compare the performance of the Pseudo PML approach implemented using our algorithm with a heuristic approximate PML algorithm [PJW17] and provide all the implementation details.

5In our current implementation, we use CVX[GB14] with package CVXQUAD[FSP17] to solve the convex program stated in the ﬁrst step of Algorithm 1.

Broader Impact

Symmetric property estimation has a broad range of applications, ranging from ecology [Cha84, CL92, BF93, CCG+12], to physics [VBB+12], to neuroscience [RWd Rv SB99], and beyond [HJWW17, HJM17, AOST14, RVZ17, ZVV+16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a]. By providing new, broadly applicable, computationally efﬁcient tools for obtaining higher accuracy estimates to symmetric properties this work could enable a variety of applications in machine learning and the sciences more broadly. Though the primary contributions of this work are theoretical, the preliminary experimental results show that this work could ultimately lead to obtaining higher quality answers to statistical questions at lower computational cost, with less manual tuning to the particular statistical question of interest. This could ultimately help save time, energy, and the many costs associated with data collection. There are always risks in widespread application of statistical tools, we are unaware of any particular biases or harm from the methods proposed. Further research may be required before the results of this paper can have a broad societal impact.

Acknowledgments

We thank Alon Orlitsky and Yi Hao for helpful clariﬁcations and discussions.

Sources of Funding

Researchers on this project were supported by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a Simons Investigator Award, a Google Faculty Research Award, an Amazon Research Award, a Pay Pal research gift, a Sloan Research Fellowship, a Stanford Data Science Scholarship and a Dantzig-Lieberman Operations Research Fellowship.

Competing Interests

The authors declare no competing interests.

[ACSS20] Nima Anari, Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The bethe and sinkhorn permanents of low rank matrices and implications for proﬁle maximum likelihood, 2020.

[ADM+10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory, pages 1498 1502, June 2010.

[ADOS16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uniﬁed maximum likelihood approach for optimal distribution property estimation. Co RR, abs/1611.02960, 2016.

[AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1855 1869, 2014.

[AOST17] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating renyi entropy of discrete distributions. IEEE Trans. Inf. Theor., 63(1):38 56, January 2017.

[AV20] Josh Alman and Virginia Vassilevska Williams. A Reﬁned Laser Method and Faster Matrix Multiplication. ar Xiv e-prints, page ar Xiv:2010.05846, October 2020.

[BF93] John Bunge and Michael Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364 373, 1993.

[BZLV16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1118 1122, July 2016.

[CCG+12] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang Xuan Mao, Robin L Chazdon, and John T Longino. Models and estimators linking individualbased and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of plant ecology, 5(1):3 21, 2012.

[Cha84] A Chao. Nonparametric estimation of the number of classes in a population. scandinavianjournal of statistics11, 265-270. Chao26511Scandinavian Journal of Statistics1984, 1984.

[CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 87(417):210 217, 1992.

[CSS19a] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efﬁcient proﬁle maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780 791, New York, NY, USA, 2019. ACM.

[CSS19b] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. A general framework for symmetric property estimation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 12447 12457. Curran Associates, Inc., 2019.

[DS13] Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4):325, 2013.

[ET76] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did shakespeare know? Biometrika, 63(3):435 447, 1976.

[FSP17] H. Fawzi, J. Saunderson, and P. A. Parrilo. Semideﬁnite approximations of the matrix logarithm. Ar Xiv e-prints, May 2017.

[Für05] Johannes Fürnkranz. Web mining. In Data mining and knowledge discovery handbook, pages 899 920. Springer, 2005.

[GB14] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.

[GTPB07] Zhan Gao, Chi-hong Tseng, Zhiheng Pei, and Martin J Blaser. Molecular analysis of human forearm superﬁcial skin bacterial biota. Proceedings of the National Academy of Sciences, 104(8):2927 2932, 2007.

[GW92] Harold N Gabow and Herbert H Westermann. Forests, frames, and games: algorithms for matroid sums and applications. Algorithmica, 7(1-6):465, 1992.

[HHRB01] Jennifer B Hughes, Jessica J Hellmann, Taylor H Ricketts, and Brendan JM Bohannan. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67(10):4399 4406, 2001.

[HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On Estimation of $L_{r}$-Norms in Gaussian White Noise Models. ar Xiv e-prints, page ar Xiv:1710.03863, Oct 2017.

[HJW16] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of KL divergence between discrete distributions. Co RR, abs/1605.09124, 2016.

[HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A uniﬁed methodology for symmetric functional estimation and distribution estimation under wasserstein distance. ar Xiv preprint ar Xiv:1802.08405, 2018.

[HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy estimation over Lipschitz balls. ar Xiv e-prints, page ar Xiv:1711.02141, Nov 2017.

[HO19] Yi Hao and Alon Orlitsky. The Broad Optimality of Proﬁle Maximum Likelihood. ar Xiv e-prints, page ar Xiv:1906.03794, Jun 2019.

[HO20] Yi Hao and Alon Orlitsky. Proﬁle entropy: A fundamental measure for the learnability and compressibility of discrete distributions, 2020.

[HS20] Yanjun Han and Kirankumar Shiragur. The optimality of proﬁle maximum likelihood in estimating sorted discrete distributions, 2020.

[JHW16] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the l1 distance. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 750 754, July 2016.

[JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835 2885, May 2015.

[Kar00] David R Karger. Minimum cuts in near-linear time. Journal of the ACM (JACM), 47(1):46 76, 2000.

[KLR99] Ian Kroes, Paul W Lepp, and David A Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547 14552, 1999.

[KS96] David R Karger and Clifford Stein. A new approach to the minimum cut problem. Journal of the ACM (JACM), 43(4):601 640, 1996.

[LG14] François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC 14, page 296 303, New York, NY, USA, 2014. Association for Computing Machinery.

[LTWZ13] László Miklós Lovász, Carsten Thomassen, Yezhou Wu, and Cun-Quan Zhang. Nowhere-zero 3-ﬂows and modulo k-orientations. Journal of Combinatorial Theory, Series B, 103(5):587 598, 2013.

[NW61] C St JA Nash-Williams. Edge-disjoint spanning trees of ﬁnite graphs. Journal of the London Mathematical Society, 1(1):445 450, 1961.

[OSS+04] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and Junan Zhang. Algorithms for modeling distributions over large alphabets. In International Symposium on Information Theory, 2004. ISIT 2004. Proceedings., pages 304 304, 2004.

[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016.

[PBG+01] Bruce J Paster, Susan K Boches, Jamie L Galvin, Rebecca E Ericson, Carol N Lau, Valerie A Levanos, Ashish Sahasrabudhe, and Floyd E Dewhirst. Bacterial diversity in human subgingival plaque. Journal of bacteriology, 183(12):3770 3783, 2001.

[PJW17] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate Proﬁle Maximum Likelihood. Ar Xiv e-prints, December 2017.

[RCS+09] Harlan S Robins, Paulo V Campregher, Santosh K Srivastava, Abigail Wacher, Cameron J Turtle, Orsalem Kahsai, Stanley R Riddell, Edus H Warren, and Christopher S Carlson. Comprehensive assessment of t-cell receptor β-chain diversity in αβ t cells. Blood, 114(19):4099 4107, 2009.

[RRSS07] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 559 569, Oct 2007.

[RVZ17] Aditi Raghunathan, Gregory Valiant, and James Zou. Estimating the unseen from multiple populations. Co RR, abs/1707.03854, 2017.

[RWd Rv SB99] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, USA, 1999.

[TE87] Ronald Thisted and Bradley Efron. Did shakespeare write a newly-discovered poem? Biometrika, 74(3):445 455, 1987.

[Tho14] Carsten Thomassen. Graph factors modulo k. Journal of Combinatorial Theory, Series B, 106:174 177, 2014.

[VBB+12] Martin Vinck, Francesco P. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck, and Cyriel M. A. Pennartz. Estimation of the entropy based on its polynomial representation. Phys. Rev. E, 85:051139, May 2012.

[Von12] Pascal O. Vontobel. The bethe approximation of the pattern maximum likelihood distribution. pages 2012 2016, 07 2012.

[Von14] P. O. Vontobel. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1 10, Feb 2014.

[VV11a] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403 412, Oct 2011.

[VV11b] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC 11, pages 685 694, New York, NY, USA, 2011. ACM.

[Wil12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmithwinograd. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC 12, page 887 898, New York, NY, USA, 2012. Association for Computing Machinery.

[WY15] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ar Xiv e-prints, April 2015.

[WY16a] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, June 2016.

[WY16b] Yihong Wu and Pengkun Yang. Sample complexity of the distinct elements problem. ar Xiv e-prints, page ar Xiv:1612.03375, Dec 2016.

[ZVV+16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. Mac Arthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7:13293 EP , Oct 2016.