# approximate_profile_maximum_likelihood__62a5ea04.pdf

Journal of Machine Learning Research 20 (2019) 1-55 Submitted 2/18; Revised 7/19; Published 7/19

Approximate Proﬁle Maximum Likelihood

Dmitri S. Pavlichin dmitrip@stanford.edu Department of Applied Physics Stanford University Stanford, CA 94305, USA

Jiantao Jiao jiantao@eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California Berkeley, CA 94720, USA

Tsachy Weissman tsachy@stanford.edu Department of Electrical Engineering Stanford University Stanford, CA 94305, USA

Editor: Sara van de Geer

We propose an eﬃcient algorithm for approximate computation of the proﬁle maximum likelihood (PML), a variant of maximum likelihood maximizing the probability of observing a suﬃcient statistic rather than the empirical sample. The PML has appealing theoretical properties, but is diﬃcult to compute exactly. Inspired by observations gleaned from exactly solvable cases, we look for an approximate PML solution, which, intuitively, clumps comparably frequent symbols into one symbol. This amounts to lower-bounding a certain matrix permanent by summing over a subgroup of the symmetric group rather than the whole group during the computation. We extensively experiment with the approximate solution, and the empirical performance of our approach is competitive and sometimes signiﬁcantly better than state-of-the-art performances for various estimation problems. Keywords: Proﬁle maximum likelihood, dynamic programming, suﬃcient statistic, partition of multi-partite numbers, integer partition

1. Introduction

The maximum likelihood principle, proposed by Ronald Fisher, has proved to be a powerful and versatile method used throughout nearly all scientiﬁc ﬁelds. However, it is still the source of considerable controversy in the statistics and machine learning community. Indeed, quoting Efron (Efron, 1982):

The controversy centers on the relationship between decision theory and maximum likelihood. Beginning with the Neyman-Pearson lemma, decision theory has reshaped the theory and practice of hypothesis testing...

Indeed, Wald s statistical decision theory (Wald, 1950) provided a comprehensive framework for evaluating and proposing statistical procedures. The maximum likelihood approach, which can be proved to be asymptotically eﬃcient under mild conditions in the

c 2019 Dmitri S. Pavlichin, Jiantao Jiao, and Tsachy Weissman.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-075.html.

Pavlichin, Jiao, and Weissman

H ajek-Le Cam theory (Van der Vaart, 2000, Chap. 9), lacks a ﬁnite-sample justiﬁcation. Indeed, as Le Cam (Le Cam, 1979) argued, even in the asymptotic regime the maximum likelihood approach may have weird behavior and is not always consistent. The recent work (Acharya et al., 2017) provided an elegant justiﬁcation of the maximum likelihood approach which is intimately connected to decision theory. It was shown in (Acharya et al., 2017) that plugging-in the proﬁle maximum likelihood (PML) distribution estimators into a variety of functionals achieves the optimal sample complexity in estimating those functionals, which the generic sequence maximum likelihood (SML) fails to achieve. We also refer the readers to (Anevski et al., 2017) for statistical properties of PML. To compute the proﬁle maximum likelihood estimator we must solve the following optimization problem: given n samples with empirical distribution ˆp = (ˆp1, ˆp2, . . .), maximize the probability to observe ˆp up to relabeling σˆp (ˆpσ(1), ˆpσ(2), . . .) of the empirical distribution, where σ is some permutation. This amounts to computing the PML distribution p : p = argmax p

σ Pp(σˆp) = argmax p

σ e n D(σˆp||p) (1)

where the max is over all discrete distributions p with a known support set size (we treat the unknown case later; it is similar), the sum is over all permutations σ of the support set of distribution p, and where D( || ) is the Kullback-Leibler divergence. The proﬁle maximum likelihood estimator is computationally challenging to ﬁnd as can shown to be equivalent to optimizing a certain matrix permanent and the best known algorithm has running time exponential in the support size of the unknown discrete distribution. Our main contributions are the following:

1. We present an eﬃciently computable approximation for the PML distribution. The approximation idea is motivated by observations for the exact PML for small alphabets, where the PML distribution tends to put symbols whose empirical counts are close (within O( n)) into the same level set1. The idea is to lower bound the sum in (1) by summing over only the permutations that contribute a lot namely the subgroup of permutations that only mix symbols within the level sets of p. This leads to the objective function in the argmax below (a lower bound to the objective function in (1)) to deﬁne the approximate PML (APML) distribution p :

p = argmax p

e n D(ˆp||p) Y

α A(p) |α|! (2)

where A(p) is the partition of the support set of p into the level sets α of p. The second factor of the objective function rewards clumping many symbols together into a few large level sets, while the ﬁrst factor encourages similarity to the empirical distribution and dominates as the sample size n .

We present a dynamic programming approach to compute the approximate PML distribution p . Compared with existing approximations to the PML (Orlitsky et al., 2004b; Vontobel, 2012), our algorithm has no tuning parameters, is deterministic, runs

1. A level set of distribution p is {x : px = u} for some value u.

Approximate Profile Maximum Likelihood

in at most linear time in the sample size (usually computing the empirical histogram is the slowest part), and achieves state-of-the-art empirical performance in various estimation tasks that we detail below.

For the case of unknown support set size, we state the appropriate generalization of the PML distribution and its approximation. We modify our dynamic programming algorithm slightly to handle this case, preserving a linear worst-case runtime. It may happen that the PML and our approximate PML distributions have a discrete part with ﬁnite support and a continuous part in the terminology of (Orlitsky et al., 2004b); we are able to detect this case and ﬁnd the discrete part of the approximate PML distribution and the probability mass of the continuous part.

2. Given the result in (Acharya et al., 2017) that plugging in the PML distribution into various functionals achieves the optimal sample complexity, we extensively experiment with the plug-in estimator using our APML distribution. We estimate entropy, R enyi entropy, support set size, L1 distance to uniformity, and the sorted probability vector of a discrete distribution and compare with the state-of-the-art approaches with available code for estimating those functionals. We ﬁnd that the performance of plugging in our APML distribution into those functionals is consistently competitive, and sometimes much better than state-of-the-art packages.

3. We extend our approximation scheme to the multi-dimensional PML problem, which is intimately connected to estimating divergence functions of discrete distributions. Utilizing results on partitions of bipartite numbers, we show that solving the twodimensional PML problem leads to estimators that achieve the optimal sample complexity for estimating a variety of functionals such as the KL divergence, the L1 distance, the squared Hellinger distance, and the χ2 divergence. The multi-dimensional APML distribution turns out to be harder to compute than the one-dimensional APML, so we settle for a greedy heuristic to approximate the multi-dimensional APML distribution.

4. We provide extensive experimental results on the plug-in approach for estimating divergence functions using our approximation to the two-dimensional PML. This achieves competitive and sometimes much better results than state-of-the-art approaches in KL divergence and L1 distance estimation.

5. We generalize the PML idea to general group actions, leading to the generalized PML approach for mutual information estimation. We analyze several candidates for the PML solution and show that one candidate makes the key arguments in (Acharya et al., 2017) fail, while the other succeeds.

We provide code at https://github.com/dmitrip/PML and (Pavlichin et al., 2017) for computing the approximate PML distributions. This work is organized as follows. Section 2 reviews some results on proﬁle maximum likelihood and functional estimation. Section 3 specializes the discussion of Section 2 to the setting of estimating functionals of discrete distributions, and deﬁnes the PML and multidimensional PML distributions. Section 4 presents our APML approach and observations about the exact PML that inspired it. Section 5 shows the results of numerical experiments

Pavlichin, Jiao, and Weissman

using the APML to estimate symmetric functionals of a distribution: the sorted probability vector, entropy, R enyi entropy, L1 distance to uniformity, and support set size. Section 6 shows the results of numerical experiments using the APML to estimate symmetric functionals of multiple distributions: the KL divergence and L1 distance. Appendices contain proofs, examples, most of the notation and algorithms related to the multi-dimensional APML (Appendix H), and application of the PML to estimation of mutual information (Appendix I).

2. The principle of proﬁle maximum likelihood and partial suﬃciency

We rephrase the key result in (Acharya et al., 2017) regarding the maximum likelihood principle in Theorem 3; similar ideas appeared in (Acharya et al., 2011). We deﬁne a statistical model

E = {X, Pθ, θ Θ}, (3)

where the observation X Pθ for some θ Θ, X X. Let F(θ) : Θ 7 Y be a measurable function. Let d(F, ˆF) : Y Y 7 R 0 be the loss function, which is also assumed to be a metric. Given observation x, upon asserting that a statistic T is partially suﬃcient (Basu, 2011) for estimating F(θ), the general proﬁle maximum likelihood approach aims at maximizing the probability that t = T(x) appears rather than the probability of x, which is the aim of traditional maximum likelihood. Then, we simply plug-in the PML estimator into the function F( ) to obtain an estimate of the functional F(θ). The proﬁle maximum likelihood estimator of θ is deﬁned as follows.

Deﬁnition 1 (Proﬁle maximum likelihood) The proﬁle maximum likelihood estimator of θ is deﬁned as

ˆθT(t) argmax θ Pθ(T(X) = t). (4)

Remark 2 The notion of proﬁle likelihood has appeared in the statistics literature before (Murphy and Van der Vaart, 2000), bearing a diﬀerent meaning from the one we adopt. The principle of proﬁle maximum likelihood we introduced in Deﬁnition 1 is known in the statistics community as restricted maximum likelihood, which was introduced in (Patterson and Thompson, 1971) and surveyed in (Harville, 1977).

The following theorem provides a general performance guarantee for the PML algorithm introduced in Deﬁnition 1 in terms of estimating any functional F(θ).

Theorem 3 (Acharya et al., 2017) We ﬁx a statistical model E = {X, Pθ : θ Θ}. Let T = T(X) be a statistic such that T : X 7 T and |T | < . Let F(θ) : Θ 7 Y be a measurable function. Suppose there exists an estimator ˆF : T 7 Y such that

sup θ Θ Pθ d(F(θ), ˆF(T)) > ϵ < δ. (5)

sup θ Θ Pθ d(F(θ), F(ˆθT)) > 2ϵ δ |T |. (6)

Approximate Profile Maximum Likelihood

The usefulness of Theorem 3 would rely on two factors: the existence of a good estimator ˆF(T) in terms of worst case risk, and the small cardinality of the set T that the statistic T lies in. As we discussed above, it would be sensible to employ the PML approach if we can assert that the statistic T is suﬃcient for estimating F(θ). However, what is the right deﬁnition of suﬃciency in this context? The deﬁnition of suﬃciency in this context is one of the most basic problems in decision theory, which turns out to be a highly non-trivial question. It is usually termed partial suﬃciency , and we refer the readers to (Basu, 2011) for a survey. To illustrate the subtleties of this deﬁnition, we review some historical developments below. This question was considered by Kolmogorov (Kolmogoroﬀ, 1942), who proposed the following deﬁnition:

Deﬁnition 4 (Kolmogorov s deﬁnition of partial suﬃciency) (Kolmogoroﬀ, 1942) A statistic T = T(X) is called partially suﬃcient for F(θ), if the posterior distribution of F(θ) given X = x, depends only on T = t and on the prior distribution of θ.

It was later shown by H ajek (H ajek, 1967) that the Kolmogorov deﬁnition is void in the sense that if F(θ) is not a constant, and T is partially suﬃcient for F(θ) in the sense of Kolmogorov, then T is suﬃcient for θ in the usual sense as well. In this context, H ajek proposed another deﬁnition of partial suﬃciency. 2

Deﬁnition 5 (H ajek, 1967, Def. 2.2) A statistic T = T(X) is called partially suﬃcient for F(θ) if there exists some other functional R(θ) such that F(θ) = F1(R(θ)), and the following is satisﬁed:

1. The distribution of T will depend on R(θ) only, that is

Pθ(d T) = PR(θ)(d T); (7)

2. There exists a distribution QR PR such that T is suﬃcient for the family {QR}, where PR is the convex hull of the distributions {Pθ : R(θ) = R}.

It was shown in (H ajek, 1967) that under the partial suﬃciency in Def. 5, an analogue of the Rao-Blackwell theorem in terms of the worst case risk can be proved. 3

It begs the question: in general how could one prove a certain statistic T is partially suﬃcient for F(θ)? The following equivalence relation induced by group transformations is common in practice. Let G = {g} be a group of one-to-one transformations of the X space on itself. We shall say that an event is G-invariant, if g A = A for all g G. The set of G-invariant events is a sub-σ-algebra B, and a measurable function f is B-measurable if and only if f(gx) = f(x) for all g G, x X.

Theorem 6 (H ajek, 1967) Suppose parameter θ can be represented as θ = (τ, g), where g G, and G is a group of one-to-one transformations of the X space on itself. Suppose there exists a family of probability distributions {Pτ} indexed by τ such that

Pθ(A) = Pτ(g X A), (8)

2. For other deﬁnitions of suﬃciency, see (Le, 1964; Birnbaum, 1962; Blackwell and Girshick, 1979). 3. Note that it is weaker than the usual Rao-Blackwell theorem, which holds pointwise for the risk function.

Pavlichin, Jiao, and Weissman

where X Pτ. Then, under mild conditions 4, the sub-σ-algebra of G-invariant events is partially suﬃcient for τ under statistical model {X, Pθ} in the sense of Def. 5.

Theorem 6 can be interpreted in the following way. We introduce the equivalence relation x1 x2 if and only if there exists some g G such that x1 = gx2. Then, Theorem 6 states that it is partially suﬃcient for τ to look at the equivalence classes [a] = {x X|x a}.

3. Discrete distributions up to relabeling

In this section, we specialize Theorem 6 to several diﬀerent settings, centered around the problem of estimating functionals of discrete distributions.

3.1. Permutation group and single sorted probability vector

The standard PML problem can be viewed as a special case of Theorem 6 with the group being the permutation group. Concretely, let G be the permutation group SX on X, let p be a distribution supported on set X that is, px > 0 x X and let τ be the sorted nonincreasing probability vector (p(1), p(2), . . . , p(|X|)) of p. It is clear that the label-invariant properties of a distribution, such as the entropy, R enyi entropy, and support set size, depend on p only through τ. Suppose we observe n i.i.d. samples xn 1 = (x1, . . . , xn) with distribution p. Denote by ˆp = ˆp(xn 1) the empirical distribution:

ˆp = (ˆpx)x X =

i=1 1(xi = x)

It is well known that the empirical distribution is the minimum complete suﬃcient statistic for p (Lehmann and Casella, 1998). The probability of observing a speciﬁc empirical distribution is given by

x X pnˆpx x (10)

where the prefactor in (10) is a multinomial coeﬃcient5. Applying Theorem 6 with θ = p = (τ, g), we obtain that the ﬁngerprint statistic (Valiant and Valiant, 2011b, 2013)6 is partially suﬃcient for τ. Concretely, the ﬁngerprint F = (Fi)i 0 is deﬁned so that Fi is the number of symbols observed exactly i times in xn 1:

F = F(ˆp(xn 1)) = (Fi)i 0 (|{x X : nˆpx = i}|)i 0 (11)

4. Precisely, Condition 3.1 in (H ajek, 1967).

x X 1 (nˆpx)!.

6. The ﬁngerprint is also called the proﬁle (Orlitsky et al., 2004b), histogram order statistics (Paninski, 2003), and histogram of histograms (Batu et al., 2000).

Approximate Profile Maximum Likelihood

Below, we compute the probability of a speciﬁc ﬁngerprint for the case that p is supported on ﬁnite alphabet X = Supp(p) with

and the empirical distribution ˆp is supported on empirical alphabet ˆ X Supp(ˆp) = {x X : ˆpx > 0} X with ˆK | ˆ X|. (13)

Then F0 = |X \ ˆ X| = K ˆK counts the number of unseen symbols, and is thus unknown if the support set size K is unknown. In Appendix A we derive the following expression for the probability of a speciﬁc ﬁngerprint for the case of ﬁnite X:

perm pnˆpx x

x,x X | {z } Q

where perm(A) denotes the matrix permanent of the K K matrix A:

perm(A) = X

x X Ax,σ(x) (15)

where the sum is over the symmetric group SX on X. To simplify notation, we let Q denote the K K matrix in (14). Note that to evaluate expression (14) we need to know the support set X, both to evaluate perm(Q) and F0 = |X \ ˆ X|. Appendix B shows examples of the computation of Pp(F) over small alphabets. For completeness, Appendix A also contains expression (87) for Pp(F) for the case of inﬁnite countable X, though this expression is not used elsewhere in this work for the reason stated in the last paragraph of Section 3.1. For a given collection of distributions P, the proﬁle maximum likelihood distribution is deﬁned as

p argmax p P Pp(F) = argmax p P

(K ˆK)! , (16)

where in the second equality we discarded all p-independent factors of Pp(F) (14). F0 = K ˆK = |X \ ˆ X| depends on p through its support set size. Note that Pp(F) is invariant under relabeling of the components of p, so we can choose p to be non-increasing in the same ordering as we choose for the support set X. Note that the set P is not necessarily the same as the set of all discrete distributions. If the collection of distributions P includes distributions with diﬀerent support set sizes (for example, all ﬁnite support set sizes at least as large as ˆK), then we can estimate the support set size by breaking up the optimization in (16) into two steps:

(K ˆK)! max p PK perm (Q)

Pavlichin, Jiao, and Weissman

whenever the max over K exists, where PK {p P : | Supp(p)| = K}. The maximizer K usually exists because increasing K makes the ﬁrst factor in (17) smaller, but makes the K K matrix Q bigger, boosting the number K! of permutations to sum over in computing the permanent. It may happen that K does not exist7, in which case we are still able to deﬁne a PML distribution in terms of a discrete part and continuous part in the terminology of (Orlitsky et al., 2004b). (Orlitsky et al., 2004b) showed that for any ﬁngerprint F, there exists a distribution p maximizing Pp(F), but this distribution may assign some symbols to the continuous part, and others to the discrete part with ﬁnite support. The intuition is that if there are suﬃciently many symbols that appear exactly once in the sample, then the PML distribution p assigns discrete probability 0 to inﬁnitely many symbols to maximize the probability that each of them is seen only once. Then we can deﬁne K d to be the optimal support set size of the discrete part of p . (Orlitsky et al., 2004b) further derived conditions lowerand upper-bounding K d in terms of maxx ˆpx and minx ˆpx and thus showing K d exists, and upper-bounding the size of the continuous part in terms of F1 (the numbers of symbols occurring once). (Acharya et al., 2009) computed p for all sample sizes up to n = 7. Note that either K exists or the PML distribution p has a continuous part, but it is never the case that p has inﬁnite countable support. For this reason we never use expression (87) for the probability of a ﬁngerprint when X is countably inﬁnite.

3.2. Permutation group and multiple probability vectors

Suppose we have two distributions p, q on the same alphabet X, such that px +qx > 0 x X. This condition ensures that there are no symbols x such that px = qx = 0, which simpliﬁes the expressions below. Draw n samples i.i.d. from p with empirical distribution ˆp and draw m samples i.i.d. from q with empirical distribution ˆq. It is clear that the label-invariant properties of two distributions, such as the Kullback-Leibler divergence, L1 distance, and the general family of divergences P x X f(px, qx), are invariant to the permutation group SX acting on pairs of distributions as σ(p, q) (σp, σq) for all σ SX . Denote the 2-D ﬁngerprint (Raghunathan et al., 2017) by F(ˆp, ˆq), analogous to (11):

F = F(ˆp, ˆq) = (Fi,j)i,j 0 (|{x X : nˆpx = i, mˆqx = j|)i,j 0 (18)

The probability to draw the 2-D ﬁngerprint under p, q is:

perm pnˆpx x qmˆqx x

Following the general proﬁle maximum likelihood methodology, for a collection P of pairs of distributions on alphabet X, the PML distributions p , q are:

(p , q ) argmax (p,q) P Pp,q(F) (20)

Note that ﬁnding the PML pair (p , q ) is not equivalent to solving two independent optimization problems since the matrix permanent in (19) does not factor into two terms. As

7. For example, if n 2 and each symbol occurs exactly once in the sample, so ˆpx = 1

n for all x X(ˆp).

Approximate Profile Maximum Likelihood

in the 1D PML case, if the support set size K = |X| is unknown, then we can attempt to estimate it analogously to (17). See Appendix H for a precise statement. Appendix H generalizes the argument above in a straightforward manner to D-dimensional ﬁngerprints (Raghunathan et al., 2017). This allows estimation of functionals of more than two distributions.

4. An approximation to the proﬁle maximum likelihood distribution

4.1. Our approximate PML algorithm: an overview

It seems computationally challenging to compute the proﬁle maximum likelihood distribution in all the cases discussed in Section 3, though the computational complexity of this problem is not known. Indeed, it involves computing the matrix permanent in (16), and computing permanents is #P-hard (Valiant, 1979). Since the computation of permanents is exponentially slow in the support set size |X| with the best-known algorithms, one may lower the target and hope to ﬁnd an approximate eﬃcient algorithm to all the proﬁle maximum likelihood problems in Section 3. Some algorithms on approximate PML have been developed: (Orlitsky et al., 2004a) proposes an expectation maximization algorithm and (Vontobel, 2012) proposes a Bethe approximation to the PML distribution. None of these works, including ours, provides guarantees on the nearness of the approximate and exact PML distributions. Here we develop a PML distribution approximation inspired by some empirical observations in the small examples where this distribution is computable exactly in reasonable time. We provide a fast algorithm for computing this approximation. The algorithm has no tunable parameters and runs in time at most linear in the sample size in the standard PML problem (Section 3.1), its run time usually dominated by the computation of the empirical distribution ˆp. It is also eﬃcient in solving divergence estimation problems as described in Section 3.2. Observations from solving the PML exactly in small cases (see Section 4.2) yield the intuition that the PML distribution p assigns equal mass to ( clumps together ) symbols whose empirical counts are close, diﬀering on the order of n, where n is the sample size. For such a clumped distribution p , we further guess that the value of the matrix permanent in Pp (F) perm(Q)/F0! (14) is dominated by a subset of terms in the summation over the symmetric group SX namely the subgroup of permutations that only mix symbols within the same clump a level set of p but not between diﬀerent clumps. Armed with this intuition, we replace maximization of the matrix permanent, which seems hard, with maximization of a lower bound involving a sum over only this subgroup of permutations, which is much easier. Section 4.3 gives a precise statement of these remarks and further motivates our approach. The remainder of this section is devoted to an overview of our approximation algorithm. Let A(p) = {α} be the partition of X into level sets of distribution p:

A(p) {{x X : px = u} : u {px : x X}} (21)

That is, x, x α A(p) if and only if px = px .

Pavlichin, Jiao, and Weissman

Then our lower bound V to the log permanent is deﬁned by summing over a subgroup of the symmetric group, and satisﬁes (see derivation in Section 4.3):

log(perm(Q)) V (p) = X

α A(p) log(|α|!) n(D(ˆp||p) + H(ˆp)) (22)

where D(ˆp||p) = P x X ˆpx log(ˆpx/px) is the Kullback-Leibler divergence and H(ˆp) = P x X px log(px) is the entropy of the empirical distribution. The approximate PML (APML) distribution is then the maximizer of the lower bound V :

p argmax p P V (p) (23)

The right hand side of (22) yields some intuition for the approximate PML distribution. Suppose P contains only distributions with support set size K = |X|, so F0 = K | ˆ X| is ﬁxed in (16), and the ﬁrst term on the right hand side of (22) encourages p to clump many symbols together into a few large clumps (since log(|α|!) |α| log(|α|) is superlinear in |α|), while the second term encourages p to be similar to the empirical distribution ˆp. As the sample size n , the second term dominates, and we have p ˆp, consistent with our intuition for the large sample size limit and consistent with the result of (Orlitsky et al., 2004b) (Theorem 16) for the exact PML distribution. It turns out that computation of the approximate PML distribution p is equivalent to optimization over all partitions of X. This is because, as we show, any distribution p

maximizing V satisﬁes an averaging property with respect to the empirical distribution ˆp for all x X: Let α(x, p ) = {x X : p x = p x} A( p ) be the level set of p containing symbol x, then

p x = 1 |α(x, p )|

x α(x, p ) ˆpx (24)

Therefore p is determined by its partition of the alphabet X into level sets, so we can replace maximization of V over p with maximization over all partitions A. Let A denote the optimal partition: A argmax A: partition of X V (A) (25)

where V (A) V (p A(ˆp)) and p A(ˆp) is the distribution obtained by averaging the empirical distribution over the partition elements that is, satisfying property (24) for all x X. The approximate PML distribution is then p = p A (ˆp). We derive constraints on the optimal partition A that enable an eﬃcient8 dynamic programming algorithm to compute A and the approximate PML distribution p . This algorithm is presented in Section 4.5. For the case of D 2 probability vectors and D-dimensional ﬁngerprints for example, when estimating the KL divergence, corresponding to D = 2 as in Section 3.2 we are unable to give an eﬃcient algorithm for maximizing our lower bound V suitably generalized to the D-dimensional case. The diﬃculty is that there is no natural ordering on ND for

8. Running in time O(| Supp(F)|2). Any empirical distribution ˆp satisﬁes | Supp(F(ˆp))| ( 8n + 1 + 1)/2 (see discussion at end of Section 4.5), where n is the sample size, so the running time is O(n). The run time of our algorithm is usually dominated by computation of the empirical distribution ˆp.

Approximate Profile Maximum Likelihood

D 2, so the dynamic programming approach for D = 1 does not work here. In this case we settle for another layer of approximation, using a greedy heuristic to iteratively merge clumps of symbols: we enlarge the ﬁrst term in (22) until the second term becomes too large. This algorithm is presented in Appendix H.

Figures 1 and 2 show the approximate PML distributions for the cases D = 1 and D = 2, respectively. For D = 1, the approximate PML distribution is as deﬁned in (23). For D 2, we approximately maximize our lower bound V to the log matrix permanent as described in Appendix H. Figure 14 in Appendix H shows the case D = 3.

10 20 30 40 0

p (underlying distribution)

ˆp (sorted empirical) p (approximate PML)

5 10 15 20 25 30 35 40

Figure 1: (left) An empirical histogram ˆp sorted in non-increasing order (gray squares) drawn from underlying distribution p (black area) and the approximate PML (APML) distribution p (23) (red line), scaled by the sample size n = 200. The alphabet is X = {1, . . . , 40}, assumed unknown in computing p . (right) Computation of the log permanent lower bound V ( p ) (31) involves summing over all permutations that mix symbols only within the level sets of p , corresponding to all permutation matrices with nonzero entries within the black blocks.

4.2. Intuitions from solving the PML exactly

For small alphabets, we can numerically and sometimes analytically optimize the permanent in (16) and (20). We observe that the PML distribution tends to assign equal mass to symbols whose empirical counts are close, diﬀering on the order of n, where n is the sample size. This observation motivates our approximate PML scheme in Section 4.3.

4.2.1. Exact solution for size 2 alphabet

Suppose the alphabet size is |X| = 2. Given sample size n and empirical distribution ˆp = (ˆp1, ˆp2) = (ˆp1, 1 ˆp1), without loss of generality one may assume ˆp1 1

n (otherwise the PML is given by Bern(1)) and ˆp1 1

2. The exact PML distribution ˆp was computed in (Orlitsky et al., 2004b):

Pavlichin, Jiao, and Weissman

10 20 30 40 50 60 70 80 90 100 0

p (underlying distribution)

ˆp (empirical) p (approximate PML)

10 20 30 40 50 60 70 80 90 100 0

q (underlying distribution)

ˆq (empirical) q (approximate PML)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2-D ﬁngerprint F(ˆp, ˆq)

Figure 2: (Left) empirical distributions ˆp and ˆq (gray squares) drawn from underlying distributions p and q (black area) with n = m = 400 samples. The approximate PML distributions ( p , q ) are computed by clumping entries of a 2-D ﬁngerprint according to the greedy heuristic described in Appendix H. The approximate PML distributions are ordered in the same way as the empirical distributions (with a distinct color for each level set corresponding to the subplot on the right). All distributions are plotted scaled by the sample size. (Right) the 2-D ﬁngerprint Fi,j at coordinates (i, j). Marker areas are roughly proportional to Fi,j. All points within the colored convex hulls correspond to a level set of the approximate PML distributions. We can see three large level sets corresponding to the three distinct values of (px, qx) for x X.

Theorem 7 ((Orlitsky et al., 2004b)) For all 1

2 : |ˆp1 ˆp2| 1 n 1 1+p, p 1+p : |ˆp1 ˆp2| > 1 n , (26)

where p is the unique root in (0, 1) of the polynomial

ˆp1pn(1 2ˆp1)+1 ˆp2pn(1 2ˆp1) + ˆp2p ˆp1. (27)

Theorem 7 states that |ˆp1 ˆp2| 1 n p is uniform, which conﬁrms our intuition that the relative ranking of two bins is resolvable if their empirical counts are more than about n apart, since the empirical counts are marginally binomially-distributed with standard deviation proportional to n. Figure 3 summarizes these observations for the size-2 alphabet.

4.2.2. Size 3 alphabet

Next let s consider the case of |X| = 3 bins. Given empirical distribution ˆp = (ˆp1, ˆp2, ˆp3), we use an expectation-maximization (EM) algorithm to numerically search for the PML distribution. The EM algorithm starts from an initial point p ˆp and converges to a local maximum of perm(Q) (14). We sort the components of this local maximum in non-

Approximate Profile Maximum Likelihood

ˆp1 (empirical)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3: Phase diagram for the PML distribution p = (p 1, 1 p 1) on binary alphabet X = {1, 2} and empirical distribution (ˆp1, 1 ˆp1) on n = 30 samples (plotting the ﬁrst component). Circles correspond to the possible histograms (ˆp1 {0/30, 1/30, . . . , 30/30}). Solid black lines correspond to the solution for arbitrary ˆp1 [0, 1], obtained numerically. The PML distribution p is uniform when |ˆp1 ˆp2| 1/ n. Outside this middle region there are two PML branches, since any permutation of the PML distribution is another PML distribution. Diagonal dashed lines correspond to the lines (ˆp1, ˆp1) and (ˆp1, 1 ˆp1).

increasing order and denote the result by p EM. Details of the EM algorithm can be found in Appendix D. We summarize our ﬁndings in Figure 4.

4.2.3. Exact solution for size 2 alphabet with D distributions

Suppose we have D distributions ((p(d) 1 , 1 p(d) 1 ))D d=1 on the same alphabet X = {1, 2} and draw (nd)D d=1 samples from each, obtaining empirical distributions ((ˆp(d) 1 , 1 ˆp(d) 1 ))D d=1. Then the D-dimensional PML distribution (p (d))D d=1 (deﬁned for D = 2 in Section 3.2 and for general D in Appendix H) is stated in Theorem 8.

Theorem 8 The D-tuple of PML distributions (p (d))D d=1 on a binary alphabet satisﬁes:

2 > 1 (p (d))D d=1 = 1

See Appendix E for a proof. We conjecture that the converse of statement (28) holds. Theorem 8 extends the results of (Orlitsky et al., 2004b) and (Fernandes and Kashyap, 2013) on the uniformity of the D-dimensional distribution beyond D = 1. The left side of (28) describes the interior of an ellipsoid in the space of distributions centered on (1

2)D d=1, shown in Figure 5 for the case D = 2. Note that (28) is not consistent with D copies of condition (26) for D 2; this reﬂects the fact that the probability of a D-dimensional ﬁngerprint does not decompose into a product of D terms (see (19) for the case D = 2 and Appendix H for the general case).

Pavlichin, Jiao, and Weissman

(30, 0, 0) (0, 30, 0)

p 1 = p 2 = p 3 p 1 > p 2 = p 3 p 1 = p 2 > p 3 p 1 > p 2 > p 3

Figure 4: Phase diagram for the PML distribution approximation p EM for alphabet size |X| = 3 bins and n = 30 samples (analogous to Figure 3), computed by an expectationmaximization (EM) algorithm as discussed in Section 4.2.2 and Appendix D. We plot a marker for each possible empirical distribution a type on the simplex. Each type has three coordinates, which we then project onto the 2-dimensional image shown (so, e.g., the uniform type nˆp = (10, 10, 10) is in the center). The shape of the marker at position nˆp corresponds to the level set decomposition of the distribution p EM, with components sorted in non-increasing order.

4.2.4. Empirical observations of the exact solutions

Figures 3, 4, and 5 suggest that the PML distribution tends to cluster together similar entries of the empirical histogram ˆp, rather than smooth them out. If n|ˆpx ˆpx | is smaller than approximately nˆpx, then the relative ranking of the true distribution s x and x

components is not well resolved statistically, and the permanent of Q (14) tends to be maximized with p x = p x . On the other hand if n|ˆpx ˆpx | is larger than approximately nˆpx, then it tends to happen that p x = p x . The similar phenomenon of a phase transition between a uniform and nonuniform PML distribution (for a diﬀerent approximation of the PML (Vontobel, 2012)) has been reported by (Fernandes and Kashyap, 2013; Chan et al., 2015).

These observations lead to the intuition that permutations σ SX that exchange suﬃciently diﬀerent components of ˆp do not contribute much to the permanent perm(Q). This intuition motivates our idea of simplifying the computation of the permanent by summing over only those permutations that contribute a lot to the sum. Section 4.3 develops this idea.

Approximate Profile Maximum Likelihood

ˆp1 (empirical)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ˆq1 (empirical)

p , q not both uniform

Figure 5: Phase diagram for the 2-D PML pair of distributions p , q (20) on binary alphabet X = {1, 2} and empirical distributions (ˆp1, 1 ˆp1), (ˆq1, 1 ˆq1) on n = 10, m = 20 samples. p , q are not both uniform whenever the empirical distribution components ˆp1, ˆq1 both lie outside the shaded ellipse 4n ˆp1 1

2 2 + 4m ˆq1 1

4.3. Approximate PML: the case of a single distribution

In the sequel, denote by A(p) = {α} the partition of X into the level sets of p:

A(p) {{x X : px = u} : u {px : x X}} (29)

In this Section we assume that the underlying distribution p is supported on ﬁnite set X of known cardinality. The case of unknown or inﬁnite support set size is treated in Section 4.6. If the support set size is known, then so is the number of unseen symbols F0, so from (16) we conclude that the PML distribution p is a maximizer of the function:

V (p) log(perm(Q)) (30)

We lower bound the log permanent V (p) by summing over a subset of the terms in the summation a subgroup of the symmetric group SX . Denote the lower bound by V (p):

x X p nˆpσ(x) x

x X p nˆpσ(x) x

= V (p) (31)

where SX,p is a subgroup of SX consisting of all permutations that exchange only those alphabet symbols x, x X that are within the same level set of p. Equivalently, SX,p is the stabilizer of distribution p under the action of relabeling its components:

SX,p {σ SX : σp = p} = α A(p) Sα (32)

where (σp)x pσ(x), = denotes group isomorphism, denotes the direct product of groups, and Sα is the symmetric group acting on level set α X. Then |SX,p| = Q α A(p) |α|!.

Pavlichin, Jiao, and Weissman

Our approximate proﬁle maximum likelihood distribution p maximizes the lower bound V (p). In other words, for the case of every distribution in collection P having the same support set size,

p argmax p P V (p) (33)

The case of unknown support set size is treated in Section 4.6. Some intuition for the computational diﬃculty of the proposed approximation: the product structure (32) of SX,p allows us to lower bound the log matrix permanent V (p) = log(perm(Q)) log(perm( Q)) = V (p), where Q is the block-diagonal matrix Qx,x = Qx,x 1(ˆpx = ˆpx ) whose permanent is the product of the permanents in each block. Since p is constant within each block, the permanent of each block is easy to evaluate, so evaluation of the lower bound V (p) is dramatically simpler than evaluation of V (p), taking time O(|X|). It moreover turns out to be possible to optimize V (p) computationally eﬃciently to ﬁnd the approximate PML distribution p ; this is done below. The restriction to summing over SX,p rather than SX seems large, but the hope is that p clusters together only comparably frequent symbols (that is, p x = p x implies n|ˆpx ˆpx | is less than about nˆpx), so we hope that V ( p ) V ( p ) and that p is close to p , the exact proﬁle maximum likelihood (16). Figure 1 shows that the approximate PML distribution p

clustering together similar enough bins of ˆp. We thus expect our approximate PML scheme to perform best when data is drawn from a distribution with a few well-separated level sets (like a uniform distribution or mixture of uniforms) and less well for more smoothly-varying distributions (like a Zipf distribution); this intuition is borne by the numerical experiments of Section 5. We can rewrite our log permanent lower bound V (p) (31) as (see proof in Appendix F):

V (p) = log

x X p nˆpσ(x) x

= n D(ˆp||p) + H(ˆp) + X

α A(p) log(|α|!) (35)

where A(p) (29) is the partition of X into level sets of p, D(ˆp||p) is the Kullback-Leibler divergence and H(ˆp) is the entropy of the empirical distribution. Expression (35) yields some intuition about the approximate PML distribution p a maximizer of V (p). A distribution p that clumps many symbols together (that is, has a few large level sets α) boosts the second term (the summation over α A(p)) in (35), but is very diﬀerent from ˆp, thus lowering the ﬁrst term. On the other hand, by setting p = ˆp, we maximize the ﬁrst term, but reduce the second term. As the sample size n , the contribution of the second term in (35) vanishes relative to the ﬁrst term, and we have p ˆp, the ML distribution, in this limit. Section 4.4 establishes some properties of the approximate PML solution p (33) that enable us to compute it eﬃciently via a dynamic programming algorithm in Section 4.5.

Approximate Profile Maximum Likelihood

4.4. Properties of the approximate PML distribution

Let α X and denote by ˆpα the average value of the empirical distribution ˆp over α:

x α ˆpx (36)

We can show that the PML distribution approximation p (33) satisﬁes for all x X

p x = ˆpα(x, p ) (37)

where α(x, p ) = {x : X : p x = p x} is the level set of p containing x. That is, p x (37) is equal to the average value of the empirical histogram ˆp over the symbols clumped with x into the same level set of p . This follows from the fact that D(ˆp p) is a Bregman divergence (Jiao et al., 2017, Lemma 4). The averaging property (37) determines our approximate PML distribution p in terms of its partition of X into level sets. Therefore, instead of maximizing V (p), we maximize V (A), deﬁned:

α A V (α) (38)

where for α X V (α) log (|α|!) + n|α|ˆpα log(ˆpα) (39)

We can check that V (A) = V (ˆp A), where ˆp A is the distribution obtained by averaging the empirical distribution ˆp within each partition element α A. Let A denote the optimal partition: A argmax A: partition of X V (A) (40)

Now optimizing the lower bound to the permanent V (p) (31) is equivalent to optimizing V (A) (38) over partitions of X, since V ( p ) = V ( A ) and p = ˆp A . We make two observations for the solution of the approximate PML problem (40) that allow us to restrict the set of partitions over which we optimize. Let A be a partition of X. We say A has the iso-clumping or convexity properties deﬁned below:

1. Iso-clumping property. Symbols with the same empirical probabilities are clumped together: For all x, x X:

ˆpx = ˆpx α(x) = α(x ) (41)

Where α(x) A is the partition element containing x.

2. Convexity property. For all x, x , x X: ˆpx < ˆpx < ˆpx and α(x) = α(x ) α(x) = α(x ) = α(x ) (42)

Theorem 9 Let A (40) be the partition of X into level sets of the approximate PML distribution p (33) and let ˆp be the empirical distribution. Then A has the iso-clumping and convexity properties (41) and (42).

See Appendix G for a proof.

Pavlichin, Jiao, and Weissman

4.5. A dynamic programming computation of the approximate PML distribution

Theorem 9 lets us eﬃciently maximize V (A) (38) by restricting to only those partitions of X that have the iso-clumping and convexity properties. Once we ﬁnd A (40), the averaging property (37) lets us compute the approximate PML distribution p . Let F+ (Fi)i 1. Let

0 = m0 < m1 < m2 < . . . < m F+ n (43)

where Supp(F+) = {m 1 : Fm > 0} = {mi}F+ i=1 (44)

Then any distribution whose level set partition satisﬁes the iso-clumping (41) and convexity (42) properties has all level sets of the form Xi:j with 0 i j F+:

Xi:j {x X : mi nˆpx mj}. (45)

Thus we optimize V (A) (38) over partitions of X into level sets of the form (45). Note that if F0 > 0, then the set X0:0 = {x X : ˆpx = 0} of unseen symbols can not appear in the optimal partition A because if it does, then its probability mass under p A is 0, violating our assumption that p has support X. If F0 = 0, then X0:0 = . This optimization can be done by a dynamic programming algorithm. For integers i j, let [i, j] {k N : i k j}. Let Vi denote for 0 i F+:

Vi max j [i 1,F+]( V (Xi:j) + Vj+1) (46)

(a) = max A: partition of Xi:F+ V (A) (47)

with boundary condition VF++1 0, where i j denotes the greater of i and j, and where V (Xi:j) is as in (39). (a) follows by induction on i downwards from F+. We let j [i 1, F+] in (46) (rather than j [i, F+]) in order to optimize over only those partitions of X that do not contain the set X0:0 of unseen symbols; this restriction forces the approximate PML distribution to have support set size |X|, as we assumed in the beginning of Section 4.3. Then the PML distribution approximation p = argmaxp V (p) satisﬁes (setting i = 0 and using X0:F+ = X in (47)):

V ( p ) = max A: partition of X V (A) = V0 (48)

We can compute the term V (Xi:j) in (46), corresponding to clustering all symbols of Xi:j into a level set of p, using (39):

V (Xi:j) = log (|Xi:j|!) + n|Xi:j|ˆp Xi:j log ˆp Xi:j (49)

Pj k=i mk Fmk n Pj k=i Fmk

where we used |Xi:j| = Pj k=i Fmk and ˆp Xi:j = 1 n|Xi:j| Pj k=i mk Fmk.

Approximate Profile Maximum Likelihood

Relations (46) and (50) allow us to compute the level set decomposition p by ﬁlling out a (F+ + 1) F+ array ( V (Xi:j))i [0,F+],j [1,F+] and keeping track of the maximizing index each time we compute Vi. Once we have the optimal level set decomposition A = {Xi:j} of p , we set for all x Xi:j, for all i, j

p x = ˆp Xi:j (51)

For example, in Figure 1, we have (mi)F+=12 i=0 = (0, 1, 2, 3, 4, 5, 7, 9, 12, 13, 15, 16, 20) and the approximate PML distribution p (red line) has level sets {X0:6, X7:12}. The running time of this dynamic programming algorithm is O(| Supp(F)|2), where | Supp(F)| F+ + 1. In terms of the sample size n, any empirical distribution ˆp satisﬁes | Supp(F(ˆp))| 1

2 1 + 8n + 1 , with equality achieved by the empirical distribution ˆpi = 2i |X|(|X|+1) for i [0, |X|] with |X| = | Supp(F)| and n = 1

2|X|(|X| + 1). Thus the worst-case

run time for the dynamic programming algorithm is O( n2) = O(n). We usually have | Supp(F)| much smaller than n, so this is a pessimistic estimate for a typical case. The run time of our approximate PML scheme is usually dominated by computation of ˆp. An implementation optimization is to pre-compute the sums Pj k=i Fmk and Pj k=i mk Fmk for all i j in computing V (Xi:j).

4.6. Unknown or inﬁnite support set size

If the support set size K = |X| is unknown, then we can attempt to infer it. Our estimator K for the support set size is the same as (17), but replaces the permanent with our lower bound (31), e V (p) perm(Q):

(K ˆK)! max p PK e V (p) !

whenever the max over K exists, where PK {p P : | Supp(p)| = K}, where ˆK = | ˆ X| is the support set size of the empirical distribution ˆp. Then the PML distribution is p

with support set K . The max over p is obtained via our approximate PML algorithm in Section 4.5. The maximizer K usually exists because a larger K boosts the value of the log permanent lower bound V (p) (since there are more permutations to sum over), but reduces the ﬁrst factor in (52). It may happen that K does not exist, in which case we say our approximate PML distribution has a continuous part in the terminology of (Orlitsky et al., 2004b). In general the function in the argmax in (52) is multimodal in K. It turns out we can eﬃciently compute K if it exists and compute the corresponding approximate PML distribution; if K does not exist, then we can eﬃciently detect this case and compute the corresponding approximate PML distribution, which in this case has a continuous part. We state and derive our observations in Appendix J. The idea is to optimize over all possible ways to clump the unseen symbols X0:0 with the other symbols. Since the optimal clumping satisﬁes the iso-clumping and convexity properties, we show that there are only | Supp(F+)|

2n + 1 possibilities to check and that the value of each possibility can be checked quickly, reusing the work done in the dynamic programming approach described in the previous Section, so this can be done eﬃciently.

Pavlichin, Jiao, and Weissman

4.7. Approximate PML: the case of multiple distributions (see Appendix H)

We generalize the PML distribution and approximate PML distribution to the case of (nd)D d=1 samples drawn from distributions (p(d))D d=1 on the same alphabet (see Section 3.2 for the case D = 2, used in estimating functionals of pairs of distributions). This requires some more notation that we delegate to Appendix H. For D 2, we are unable to give an eﬃcient algorithm to compute the D-dimensional approximate PML. The diﬃculty is due to the lack of a natural ordering on ND for D 2, so the dynamic programming approach we use for D = 1 does not work for D 2. We settle for approximately computing our the approximate PML distribution via a greedy heuristic: we iteratively enlarge level sets of a candidate solution by merging pairs of distinct level sets until the objective function V (suitably generalized to D distributions) stops increasing. See Appendix H for details.

4.8. Comparison to the (exact) PML distribution for size 2 alphabet

Recall the discussion of Section 4.2.1 on the PML distribution for alphabet size |X| = 2. A computation establishes the condition for uniformity of the APML distribution:

2 : H(ˆp) log(2) 1 1

(ˆp1, 1 ˆp1) : otherwise (53)

where H(ˆp) is the entropy of the empirical distribution and n is the sample size. Let s consider the large n limit. Let ˆpunif 1 be the critical value at which the APML distribution switches between uniform and nonuniform (the least value of ˆp1 satisfying the uniformity condition in (53)). Then a computation shows that

ˆpunif 1 = 1

Comparing this to condition (26) for the (exact) PML, we see that in the APML case, ˆpunif 1 carries an extra factor of p

log(4) 1.18 in the 1/ n term relative to the uniform/nonuniform critical value of ˆp1 for the PML. Figure 6 compares the APML and PML distributions for a size-2 alphabet. Both the APML and PML are uniform when |ˆp1 ˆp2| is small enough, though at diﬀerent points, as remarked above. Whenever the APML solution is not uniform, it exactly matches (up to relabeling of components) the empirical distribution, unlike the PML.

5. Estimating symmetric functionals of a single discrete distribution

In the standard PML setting discussed in Section 3.1, we argued the suﬃciency of the ﬁngerprint F (11) in estimating the sorted probability vector of a discrete distribution, which implies the suﬃciency of the ﬁngerprint for estimating any functional F of the sorted probability vector of the form

x X f(px) , (55)

Approximate Profile Maximum Likelihood

ˆp1 (empirical)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

ˆp1 (empirical)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

p 1 (PML) p 1 (APML)

ˆp1 (empirical)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

Figure 6: Phase diagram for the exact PML (hollow circles) and APML distributions (red circles) on binary alphabet with sample size n indicated above each subplot (compare to Figure 3). Solid black lines correspond to the solution for arbitrary ˆp1 [0, 1], obtained numerically. The PML distribution p is uniform when |ˆp1 ˆp2| 1/ n, while the APML distribution is uniform when |ˆp1 ˆp2| p

log(4)/ n + O(1/n). Outside this middle region there are two PML (APML) branches, since any permutation of the PML (APML) distribution is another PML (APML) distribution. The APML distribution exactly matches the empirical distribution outside the uniform region. Diagonal dashed lines correspond to the lines (ˆp1, ˆp1) and (ˆp1, 1 ˆp1).

where we constrain f(0) = 0 to accommodate the unknown alphabet setting. Note that the ﬁngerprint is also partially suﬃcient for functionals of type P x,y X f(px, py), etc. In order to apply Theorem 3, one also needs to show the cardinality of the ﬁngerprint F is small. The ﬁngerprint F = (Fi)i 0 satisﬁes X

i 0 i Fi = n, (56)

and each ﬁngerprint corresponds to an unordered integer partition of the integer n. We now recall the Hardy-Ramanujan result on integer partitions.

Theorem 10 (Hardy and Ramanujan, 1918)(B ar any and Vershik, 1992, Theorem 2) The cardinality of the set of ﬁngerprints (11) on n samples is given by

3 (1 o(1)) |{(Fi)i 1}| eπ q

The key observation is that the cardinality of the ﬁngerprint is sub-exponential. Then, combining with Theorem 3 and the fact that there exist estimators for various symmetric functionals with near exponential measure concentration, (Acharya et al., 2017) showed that plugging in the PML achieves the optimal sample complexity in estimating the Shannon entropy, the support size, the support coverage, and the L1 distance to uniformity. In this section, we extensively test the performance of symmetric functional estimation via plugging in our approximate PML (APML) distribution into the functionals. Recall

Pavlichin, Jiao, and Weissman

that the PML distribution p is deﬁned as in (16) for empirical distribution ˆp on n samples with ﬁngerprint F(ˆp):

p = argmax p P Pp (F) = argmax p P

perm pnˆpx x

(K ˆK)! . (58)

where the optimization is over distributions of possibly diﬀerent support set size, X denotes the support set of distribution p, K = |X|, and ˆK denotes the support set size of the empirical distribution ˆp. p is diﬃcult to compute exactly, so we compute the APML distribution p (33), (52):

p = argmax p P

(K ˆK)! (59)

which maximizes a lower bound to the probability Pp(F) since V (p) (31) lower bounds the log permanent in (58). Then our estimator for function F of the form (55) is the plugin F( p ). Our approximate PML estimator performs comparably well to the competition (Valiant and Valiant, 2011a) (Valiant and Valiant, 2013) (Jiao et al., 2015), (Wu and Yang, 2016), (Wu and Yang, 2015) across diﬀerent functions F and distributions, and performs signiﬁcantly better when the true distribution is uniform. This good performance for the uniform case makes some intuitive sense: the approximate PML distribution p maximizes a lower bound (22) to a matrix permanent obtained by summing over only those permutations that mix symbols within level sets of p . If there is only one level set that is, if p is uniform then we have exactly optimized the matrix permanent (or at least found a local maximum), rather than a lower bound to the permanent.

5.1. Sorted L1 loss

Given n samples from a distribution p, the usual L1 loss in measuring the accuracy of estimating p is given by X

x X |px qx|. (60)

The sorted L1 loss measures the shape diﬀerence between the reconstruction distribution q and the true distribution p. In other words, we ﬁrst sort the distributions p, q into nonincreasing probability vectors (p(1), p(2), . . . , p(K)), (q(1), q(2), . . . , q(K)), and then measure the loss by X

1 i K |p(i) q(i)|. (61)

The sorted L1 loss measures the error in reconstructing the sorted probability vector, for which the ﬁngerprints are partially suﬃcient as shown by Theorem 6. The sorted L1 loss can also be viewed as a Wasserstein distance (Vallender, 1974), i.e.,

1 i K |p(i) q(i)| = K sup f Lip1

Approximate Profile Maximum Likelihood

where the supremum is over all Lipschitz functions with Lipschitz constant one.

We estimate the sorted L1 loss using the plug-in q = p , the APML distribution (59), computed over collection of distributions P = , where

{p : | Supp(p)| < } (63)

denotes the set of all discrete distributions with ﬁnite support set size, and collection of distributions P = K, where

K {p : | Supp(p)| = K} (64)

denotes the set of all discrete distributions with support set size K.

Figure 7 shows the sorted probability vector inferred by our approximate PML distribution in (63) that is, the case of unknown support set size along with the sorted ML (empirical) distribution and the distribution of (Valiant and Valiant, 2013). We see that both the approximate PML distribution and the distribution of (Valiant and Valiant, 2013) perform much better than the sorted ML distribution, and that the PML distribution prefers to have fewer and larger level sets than the distribution of (Valiant and Valiant, 2013).

Figure 8 shows the performance of our approximate PML distribution in K (64) that is, the case of known support set size as a plugin estimator for the sorted L1 loss, along with the estimator of (Valiant and Valiant, 2011a) and the empirical distribution plugin estimator (MLE) (see caption for performance test parameters). We see that the approximate PML distribution performs well for the uniform and mixture-of-two-uniforms distributions, but more poorly for the Zipf distributions, consistent with the remarks in the previous Section. Good performance on near-uniform distributions is observed for other performance tests as well (see below).

Pavlichin, Jiao, and Weissman

200 400 600 800 1000

0.012 Uniform

underlying distribution ML (sorted empirical) VV approx. PML distribution

200 400 600 800 1000

index of symbol

200 400 600 800 1000

200 400 600 800 1000 0

0.012 Mix 2 Uniforms

200 400 600 800 1000

index of symbol

200 400 600 800 1000

10 0 10 1 10 2 10 3 10 -4

10 0 Zipf(-1)

10 0 10 1 10 2 10 3 10 -4

index of symbol

10 0 10 1 10 2 10 3 10 -4

Figure 7: Inferring the sorted probability vector given samples from an underlying distribution (plotted in gray). The distribution is held constant in each column and the sample size n is held constant in each row of the ﬁgure. In all cases K = |X| = 1000. Uniform is uniform on X, Mix 2 Uniforms is a mixture of two uniform distributions, with half the probability mass on the ﬁrst K/5 symbols, and the other half on the remaining symbols, and Zipf(α) 1/iα with i {1, . . . , K}. ML denotes the sorted empirical distribution. VV is (Valiant and Valiant, 2013).

Approximate Profile Maximum Likelihood

sample size

10 2 10 4 10 6 10 8 10 -6

10 2 Uniform

sample size

10 2 10 4 10 6 10 8 10 -4

10 1 Mix 2 Uniforms

sample size

10 2 10 4 10 6 10 8 10 -3

10 1 Zipf(-1)

sample size

10 2 10 4 10 6 10 8 10 -3

10 1 Zipf(-0.6)

MLE VV this work

Figure 8: Sorted L1 loss in estimating an unknown distribution p with known support set size K = |X|. In all cases K = 104. Uniform is uniform on X, Mix 2 Uniforms is a mixture of two uniform distributions, with half the probability mass on the ﬁrst K/5 symbols, and the other half on the remaining symbols, and Zipf(α) 1/iα with i {1, . . . , K}. MLE denotes the ML plugin ( naive ) approach of using the sorted empirical distribution in (61). VV is (Valiant and Valiant, 2013). Each data point represents 100 random trials, with 2 standard error bars smaller than the plot marker for most points.

5.2. Entropy and R enyi entropy estimation

For entropy H(p) = P x X px log(px) and R enyi entropy Hα(p) = 1 1 α log P x X pα x of distribution p, the corresponding approximate PML estimator is deﬁned as the plugin estimator H( p ), Hα( p ), where p is as in (59), optimized over the collection of distributions P = (63) for both the entropy and R enyi entropy. Additionally, to enable direct comparison with the entropy estimator of (Wu and Yang, 2016), which requires a support set size as input, we also ﬁnd the approximate PML distribution optimized over P = K (64).

Figure 9 shows the performance of our approximate PML scheme for estimating the entropy and R enyi entropy with α {2, 1.5, 0.8}. Overall, our approximation looks competitive with the other approaches, and is particularly strong for the uniform distribution. Adding in knowledge of the true support set size in estimating the entropy did not make much of a diﬀerence, except for improvement in the case of the uniform distribution.

Pavlichin, Jiao, and Weissman

10 2 10 4 10 6 10 8

MLE VV JVHW this work WY (known |X|) this work (known |X|)

10 2 10 4 10 6 10 8

Mix 2 Uniforms

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

sample size

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

R enyi entropy (α = 2)

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8 10 -3

sample size

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

R enyi entropy (α = 1.5)

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

sample size

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

R enyi entropy (α = 0.8)

10 2 10 4 10 6 10 8 10 -4

10 2 10 4 10 6 10 8

10 2 10 4 10 6 10 8

sample size

10 2 10 4 10 6 10 8

Figure 9: Root mean squared error of several estimators of entropy (ﬁrst column) and R enyi entropy (last three columns for α {2, 1.5, 0.8}). In all cases except Geometric we set the alphabet size |X| = 104. Uniform is uniform on X, Zipf(α) 1/iα with i {1, . . . , K}, Geometric is the geometric distribution with inﬁnite support and mean K. MLE denotes the ML plugin ( naive ) approach of computing H(ˆp). VV is (Valiant and Valiant, 2013). JVHW is (Jiao et al., 2015). WY is (Wu and Yang, 2016) this estimator requires the support set size |X| as an input. Our estimator optionally accepts the support set size |X| (shown in dark red). Each data point represents 100 random trials, with 2 standard error bars smaller than the plot marker for most points. Log base 2.

Approximate Profile Maximum Likelihood

5.3. L1 distance to uniformity

For estimating the L1 distance to uniformity, which is P x X |px 1

K |, the corresponding PML estimator is (Acharya et al., 2017) the plugin of p (58) optimized over the collection of distributions P = K (64) (the case of known support set size). The APML estimator is the plugin of p (59) optimized over P = K.

Figure 10 shows the performance of our approximate PML scheme for estimating the L1 distance to a uniform distribution with known support set size. Our approach looks competitive, and is by far the best if the true distribution is uniform.

sample size

10 2 10 4 10 6 10 8 10 -5

10 1 Uniform

sample size

10 2 10 4 10 6 10 8 10 -4

10 1 Mix 2 Uniforms

sample size

10 2 10 4 10 6 10 8 10 -4

10 0 Zipf(-1)

sample size

10 2 10 4 10 6 10 8 10 -4

10 1 Zipf(-0.6) MLE VV this work

Figure 10: Root mean squared error for estimation of L1 distance to uniformity with known support set size K = |X|. In all cases K = 104. Uniform is uniform on X, Mix 2 Uniforms is a mixture of two uniform distributions, with half the probability mass on the ﬁrst K/5 symbols, and the other half on the remaining symbols, and Zipf(α) 1/iα with i {1, . . . , K}. MLE denotes the ML plugin ( naive ) approach of computing P x X |ˆpx 1

K |. VV is (Valiant and Valiant, 2013). Each data point represents 100 random trials, with 2 standard error bars smaller than the plot marker for most points.

Pavlichin, Jiao, and Weissman

5.4. Support set size estimation

For support size estimation, the corresponding PML estimator is (Acharya et al., 2017) the plugin of p (58) optimized over the collection of distributions P = 1

K {p : px 1

K x X} (65)

denotes the set of all discrete distributions whose minimum nonzero probability for each symbol is at least 1 K . The approximate PML estimator is the plugin of p (59) optimized over P = 1

K . Figure 11 shows the performance of our approximate PML scheme for estimating the support set size. Here we plot the mean and standard error of the support set size inferred by the diﬀerent estimation schemes rather than a root mean squared error. Overall, our approach is comparable to the others, performing worst on the Zipf distributions, and best on the uniform distribution.

log(n)/ log(|X|)

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

log(n)/ log(|X|)

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

10 Zipf(-1)

true |X| MLE VV WY this work

log(n)/ log(|X|)

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

7 Zipf(-0.5)

log(n)/ log(|X|)

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Mix geom. and Zipf

Figure 11: Support set size estimation for the same cases as in (Wu and Yang, 2015), plotting the mean and 2 standard deviations of the distribution of inferred support set sizes in 100 trials vs. sample size n. In all cases the support set set size K = |X| is chosen such that mini {1,...,K} pi 10 6. Uniform is uniform on {1, . . . , K}, Zipf(α) 1/iα with i {1, . . . , K}, and Mix geom. and Zipf is a mixture of pi 1/i for i {1, . . . , K/2} and pi (1 2/K)i K/2 1 for i {K/2 + 1, . . . , K} normalized so that half the mass is on the ﬁrst K/2 symbols. MLE denotes the ML plugin ( naive ) approach of using the empirical support set size. VV is (Valiant and Valiant, 2013). WY is (Wu and Yang, 2015).

Approximate Profile Maximum Likelihood

6. Estimating the divergence function between two distributions

In the 2-D PML setting discussed in Section 3.2, we argued the suﬃciency of the ﬁngerprint (18) in estimating functions of two discrete distributions of the form

F(p, q) = X

x X f(px, qx). (66)

In order to apply Theorem 3, one needs to show the cardinality of the suﬃcient statistic is not too big. Below we present an argument that applies to the D-dimensional PML. Suppose we have D discrete distributions on the same alphabet X, denoted as p(1), p(2), . . . , p(D). Suppose we obtain nd samples from each distribution p(d) with empirical distribution ˆp(d). Let the joint ﬁngerprint be deﬁned for every (i1, i2, . . . , i D) 0,

Fi1,i2,...,i D = |{x X : (ndˆp(d) x )D d=1 = (id)D d=1}|. (67)

We now argue that the joint ﬁngerprint (Fi1,i2,...,i D)(i1,i2,...,i D) 0 corresponds to the partition of multipartite numbers. Indeed, the joint ﬁngerprint satisﬁes the following equation:

(i1,i2,...,i D) 0 Fi1,i2,...,i D

Theorem 11 quantiﬁes the size of the suﬃcient statistic.

Theorem 11 (Auluck, 1953; B ar any and Vershik, 1992; Acharya et al., 2011) Suppose D = 2. If n1 = n2 = n, then the cardinality of the set of 2-D ﬁngerprints on n samples is given by

|{(Fij)(i,j) =(0,0)}| = e3(ζ(3))1/3n2/3(1+o(1)). (69)

For general D > 2, if nd 2D+1, 1 d D, we have

|{(Fi1,i2,...,i D)(i1,i2,...,i D) =0}| exp

where 0 = (0, . . . , 0). Moreover, if nd = n for all d {1, . . . , D}, we have

|{(Fi1,i2,...,i D)(i1,i2,...,i D) =0}| = exp (D + 1)(ζ(D + 1))1/(D+1)n D D+1 (1 o(1)) . (71)

Here ζ(D + 1) = P k=1 k (D+1) is the Riemann zeta function.

Fortunately, the cardinality of the D-dimensional ﬁngerprint is still sub-exponential in the sample size for all D. Then, combining with Theorem 3 and the fact that there exist estimators for the aforementioned divergence functions with near exponential measure concentration (which can be shown in a straightforward manner as in (Acharya et al., 2017;

Pavlichin, Jiao, and Weissman

Han et al., 2016a,b; Bu et al., 2016; Valiant and Valiant, 2011a; Jiao et al., 2018)), we know that plugging in the 2-D PML achieves the optimal sample complexity in estimating the Kullback-Leibler divergence, L1 distance, the squared Hellinger distance, and the χ2

divergence. The proof of these results is to be reported elsewhere. In this section, we extensively test the performance of divergence estimation via plugging in our approximate 2-D PML. We mainly compete with the released code of KL divergence estimator in (Han et al., 2016a,b). The concrete divergence functional estimation algorithm is as follows. Suppose we observe n samples from distribution p with empirical distribution ˆp and m samples from distribution q with empirical distribution ˆq. Let the 2-D PML estimator be as (20):

(p , q ) = argmax (p,q) P Pp,q(F), (72)

where F is the 2-D ﬁngerprint (18) and P is a collection of pairs of distributions on the same alphabet X. The APML distributions maximize a lower bound to Pp,q(F):

( p , q ) = argmax (p,q) P

(K ˆK)! (73)

where V is deﬁned for the D-dimensional case in Appendix H, X = Supp(p) Supp(q), K = |X|, and ˆK = | Supp(ˆp) Supp(q)|. As discussed in Section 4.7, there is no natural ordering on the bins of the 2-D ﬁngerprint, so we do not give a dynamic programming algorithm like the one in section 4.5 to ﬁnd the APML distributions. Instead we use a greedy heuristic presented in Appendix H to approximately maximize V (p, q) and approximate the already-approximate PML distributions: ( p , q ) ( p , q ) (74)

Then our estimator for functional F of the form (66) is F( p , q ). Our approximate PML estimator performs overall best relative to the competition, which for the KL divergence consists of only (Han et al., 2016a,b) and the MLE plugin estimator, and for the L1 distance consists of only the MLE plugin estimator. (Valiant and Valiant, 2013), (Valiant, 2008) generalize their approach to the 2-D ﬁngerprint setting, but do not release code to repeat their experiments.

6.1. KL divergence estimation

For the KL divergence D(p||q) = P x X px log px qx

, the corresponding approximate PML estimator is the plugin D( p || q ), where p , q are as in (74), optimized over collection of distributions P = ρ, where

ρ = {(p, q) : sup x (px/qx) ρ} (75)

denotes the set of all pairs of discrete distributions with ﬁnite support and maximum ratio ρ. Figure 12 shows the performance of our approximate PML scheme for estimating the KL divergence. The approximate PML estimator looks mostly better than (Han et al., 2016a,b)

Approximate Profile Maximum Likelihood

and the MLE. All three estimators show non-monotonicity of their root mean squared performance in the sample size. Qualitatively, the approximate PML estimator s performance looks best-behaved among the three in terms of being roughly monotone decaying in the sample size.

6.2. L1 distance

For the L1 distance p q 1 = P x X |px qx|, the corresponding approximate PML estimator is the plugin p q 1, where p , q are as in (74), optimized over collection of distributions P = , where denotes the set of all pairs of discrete distributions with ﬁnite support set size (possibly diﬀerent support set sizes within the pair). Figure 13 shows the performance of our approximate PML scheme for estimating the L1 distance between two distributions. The approximate PML estimator looks overall stronger than the MLE plugin estimator for all distributions and sample sizes, and seems to perform best when the two unknown distributions p and q are the same (corresponding to the diagonal of the matrix of plots in the ﬁgure). Although (Valiant and Valiant, 2013) did not release their code for estimating the L1 distance, their Figure 2, leftmost pane corresponds exactly in its parameters to the top-leftmost pane of our Figure 13; from roughly checking a few sample sizes visually, the approximate PML seems to perform signiﬁcantly better.

Acknowledgments

We are grateful to Huacheng Yu for providing the proof of Theorem 12. We would like to thank Jayadev Acharya, Nick Loehr, Greta Panova, Richard Stanley, and Arthur Yang for helpful discussions on the size of the suﬃcient statistic.

Appendix A. Proof of expression (14) for the probability of a ﬁngerprint

Let SX denote the symmetric group on the discrete, possibly inﬁnite set X and let SX ˆp denote the orbit of empirical distribution ˆp under the action of SX by relabeling the components of ˆp. That is, if σ SX , then (σˆp)x ˆpσ(x), and SX ˆp = {σˆp}σ SX . Each distribution in the orbit SX ˆp has the same ﬁngerprint and distinct orbits have distinct ﬁngerprints, so the set of distinct ﬁngerprints F is in bijection with the set of distinct orbits under the action of SX , so the probability of a speciﬁc ﬁngerprint F is given by

Pp(F) = Pp(SX ˆp) = X

ˆp SX ˆp Pp(ˆp ) =

x X pnˆp x x (76)

where the last equality follows from (10). When the alphabet X is ﬁnite, then we can replace the summation over the orbit SX ˆp in (76) with a summation over the ﬁnite symmetric group SX , taking care to avoid doublecounting, since |SX ˆp| |SX | with strict inequality if there exist distinct x, x X such that ˆpx = ˆpx (see Appendix B for an example of this). The ﬁngerprint notation lets us

Pavlichin, Jiao, and Weissman

10 1 10 2 10 3 10 4 10 5 10 6 10 7

10 1 Uniform

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 Mix 2 Uniforms

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -2

10 1 Zipf(-1)

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -2

10 1 Zipf(-0.6)

HJW MLE this work

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Mix 2 Uniforms

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -5

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -2

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 10 2 10 3 10 4 10 5 10 6 10 7

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -2

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -4

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -4

Figure 12: Root mean squared error for estimation of KL divergence between distributions indicated in row and column names (that is, D(prow||pcolumn)). In all cases K = |X| = 104. Uniform is uniform on X, Mix 2 Uniforms is a mixture of two uniform distributions, with half the probability mass on the ﬁrst K/5 symbols, and the other half on the remaining symbols, and Zipf(α) 1/iα with i {1, . . . , K}. MLE denotes the ML plugin ( naive ) approach of computing D(ˆp||ˆq). HJW is (Han et al., 2016a,b). Each data point represents 100 random trials, with 2 standard error bars smaller than the plot marker for most points. Log base 2.

Approximate Profile Maximum Likelihood

10 1 10 2 10 3 10 4 10 5 10 6 10 7

10 1 Uniform

MLE this work

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 Mix 2 Uniforms

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 0 Zipf(-1)

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 Zipf(-0.6)

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Mix 2 Uniforms

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 10 2 10 3 10 4 10 5 10 6 10 7

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -2

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

sample size

10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 -3

Figure 13: Root mean squared error for estimation of L1 distance between distributions indicated in row and column names. Same parameters as in Figure 12. Since the L1 distance is symmetric in its arguments, the performance matrix is symmetric.

Pavlichin, Jiao, and Weissman

conveniently express |SX ˆp|:

1 Fi!. (77)

The probability of a speciﬁc ﬁngerprint F under distribution p with ﬁnite support X is given by

Pp(F) = Pp(SX ˆp) = X

ˆp SX ˆp Pp(ˆp ) (78)

(a) = |SX ˆp|

σ SX Pp(σˆp) (79)

(b) = 1 |X|!

σ SX Pp(σˆp) (80)

x X p nˆpσ(x) x (81)

perm pnˆpx x

x,x X | {z } Q

(d) = 1 |X \ Supp(ˆp)|!

perm(Q) (83)

where in (a) the prefactor corrects for multiple-counting points in the orbit SX ˆp when we sum over σ SX , (b) follows from (77), (c) follows from (10), and (d) follows since F0 = |X \ Supp(ˆp)|. When the alphabet X is countably inﬁnite, then we group terms in the sum (76) by the support set Supp(ˆp ). Let ˆK = | Supp(ˆp)| (13) denote the empirical support set size and let Π ˆ K(X) denote the set of all subsets of X of size ˆK. Then

ˆp SX ˆp Pp(ˆp ) (84)

ˆp SX ˆp: Supp(ˆp )=X Pp(ˆp ) (85)

σ SX Pp(σˆp) (86)

X Π ˆ K(X) perm pnˆpx x

where in (a) the prefactor corrects for multiple-counting points in the orbit SX ˆp when we sum over σ SX and (b) follows from (10) and the fact that ˆp is supported on X (so there is no F0 term, unlike the case when X is ﬁnite).

Approximate Profile Maximum Likelihood

Appendix B. Illustration of computing the probability of a ﬁngerprint

To illustrate the computation of the probability of the ﬁngerprint, suppose that we have an alphabet of 3 symbols X = {a, b, c} and consider the sequence x7 1 = (a,b,b,a,b,b,c). Then the empirical distribution is ˆp(x7 1) = (ˆpa, ˆpb, ˆpc) = 1

7 (2, 4, 1) and the ﬁngerprint F(ˆp) is Fi = 1 if i {1, 2, 3} and Fi = 0 otherwise. The probability under p = (pa, pb, pc) of F(ˆp) is indicated pictorially below:

P(F(ˆp)) = Pp

= 1 1!1!1! 7! 2! 4! 1! perm

p2 a p4 a p1 a p2 b p4 b p1 b p2 c p4 c p1 c

where each term on the right hand side of (88) is computed via (10) and in (89) the two prefactors are the multinomial coeﬃcients in (83). Suppose now that we happen to observe only the symbol b in x4 1 = (b,b,b,b), but know that X consists of 3 symbols as in the previous example. Then ˆp(x4 1) = 1

4 (0, 1, 0) and the ﬁngerprint is F0 = 2, F1 = 1, and Fi = 0 for i 2. Our pictorial computation of the probability under p = (pa, pb, pc) of F(ˆp) is shown in (90):

Pp(F(ˆp)) = Pp

= 1 2!1! 4! 0!4!0! perm

= p4 a + p4 b + p4 c (91)

Note that there are now fewer than |SX | = |X|! distinct permutations to sum over, reﬂected in the prefactor 1/F0! with F0 = 2 since there are two unseen symbols.

Appendix C. Proof of Theorem 12

Let Z be the number in Theorem 12. We ﬁrst prove the lower bound. Fix a (constant) parameter α (0, 1). Consider all 0-1 assignments to an αn αn array such that the total number of 1 s is exactly n. The number of such assignments is α2n2

On the other hand, each equivalent class generates at most

((αn)!)2 (αn)2αn

Pavlichin, Jiao, and Weissman

such assignments. Therefore, the number of equivalent classes Z is at least

Z α2n n /(αn)2αn = n(1 2α)n α2n(1 α).

The above inequality holds for any α > 0. In particular, for any constant α, we have Z n(1 2α o(1))n, which implies the claimed bound. Now we prove the upper bound. Consider the following procedure for generating an assignment to a n n array.

1. Choose an integer 1 m n arbitrarily;

2. Choose m entries from the n2 entries arbitrarily;

3. For each of the m chosen entries from last step, ﬁll in a positive integer such that the m entries sum up to n.

It is easy to verify that for every equivalent class, some assignment in the class can be generated by the above procedure. On the other hand, the number of diﬀerent assignments that can be generated is at most

2n n max 1 m n

2n n (e n)n = n(1+o(1))n.

This proves the upper bound.

Appendix D. The EM algorithm used to numerically solve the PML for small alphabets in Section 4

D.1. 1-D case

We treat p as the unknown parameter to be inferred and assume that p is sorted. Recall that for a ﬁxed alphabet size, the PML distribution is:

p = argmax p

x X p nˆpσ(x) x (92)

We treat permutation σ as the hidden data drawn uniformly on SX , and the observed data ˆp drawn from the permuted distribution σp. Then if p(t) is our estimate for p after t iterations of the EM algorithm, the next value is:

σ SX (σˆp) Y

x X p nˆpσ(x) (t),x (93)

where σˆp denotes the permuted empirical histogram and denotes equality up to normalization. For the starting point we choose p(0) = ˆp.

Approximate Profile Maximum Likelihood

D.2. 2-D case

Analogously to the 1D case above, let p(t), q(t) be our estimates for p, q after t iterations of the EM algorithm. Then the next values are:

σ SX (σˆp)wσ (94)

σ SX (σˆq)wσ (95)

x X p nˆpσ(x) (t),x q mˆqσ(x) (t),x (96)

and denotes equality up to normalization. For the starting point we choose p(0) = ˆp, q(0) = ˆq.

Appendix E. Proof of Theorem 8 (D-dimensional PML distribution on a binary alphabet)

The D-dimensional PML distributions p (p(d) )D d=1 are (see Appendix H for a derivation):

p argmax p P perm

d=1 (p(d) x )nd ˆp(d) x !

= argmax p P

d=1 (p(d) 1 )nd ˆp(d) 1 (1 p(d) 1 )nd(1 ˆp(d) 1 ) +

d=1 (p(d) 1 )nd(1 ˆp(d) 1 )(1 p(d) 1 )nd ˆp(d) 1

where P is a collection of D-tuples of distributions on the binary alphabet X = {1, 2}. To lighten notation, let

rd p(d) 1 (99)

Then p = ((r d, 1 r d))D d=1, where

(r d)D d=1 = argmax (rd)D d=1 [0,1]D

d=1 rnd/2+ d d (1 rd)nd/2 d +

d=1 rnd/2 d d (1 rd)nd/2+ d

| {z } U(r1,...,rd)

= argmin (rd)D d=1 [0,1]D U(r1, . . . , rd) (102)

Pavlichin, Jiao, and Weissman

If r d = 1/2 for all d, then U(r1, . . . , rd) must have positive deﬁnite Hessian HU at (rd)D d=1 = (1/2)D d=1. Let s compute the Hessian at rd = 1/2 for all d, letting [D] {1, . . . , D}:

H U HU|(rd)D d=1=(1/2)D d=1 = 2U rd rd

(rd)D d=1=(1/2)D d=1

= c ( d d )d,d [D] diag nd

where c = 25 PD d=1 nd and in the second term in (104) diag( ) denotes the diagonal matrix with entry nd in the d-th row and d-th column. The matrix H U is positive deﬁnite when d = 0 for all d with eigenvalues ( cnd/4)D d=1. Since the eigenvalues of H U vary continuously in ( d)D d=1, we conclude that H U is positive deﬁnite when ( d)D d=1 is in the neighborhood of (0)D d=1 determined by:

2 > 1 (106)

where c = c 22D QD d=1 nd. Thus H U is not positive deﬁnite at (rd)D d=1 = (1/2)D d=1 when the empirical distribution components (ˆp(d) 1 )D d=1 lie outside the ellipse determined by (106), so the D PML distributions (p(d) )D d=1 are not uniform in this case.

Appendix F. Proof of permanent lower bound identity (35)

Given distribution p supported on set X, denote by Xu(p) a level set of p:

Xu(p) {x X : px = u} (107)

Denote by U(p) the set of unique entries of p:

U(p) {px : x X} (108)

and denote by A(p) the partition of X into level sets of p (equivalent to the deﬁnition (29)):

A(p) {Xu(p) : u U(p)} (109)

It is convenient to express the group SX,p (32) as isomorphic to:

SX,p = α A(p) Sα = u U(p) SXu (110)

Approximate Profile Maximum Likelihood

V (p) = log(perm(Q)) V (p) (111)

x X p nˆpσ(x) x

x X pnˆpx σ(x)

σ u U(p) SXu

x X pnˆpx σ(x)

u U(p) |SXu| un P x Xu ˆpx

log(|SXu|) + n X

x Xu ˆpx log(u) (117)

u U(p) log(|SXu|!) + n X

x Xu ˆpx log (u) (118)

u U(p) log(|Xu|!) + n X

x X ˆpx log (px) (119)

u U(p) log(|Xu|!) + n X

+ log (ˆpx) (120)

α A(p) log(|α|!) n(D(ˆp||p) + H(ˆp)) (121)

where in (a) we changed order of summation over the group SX , in (b) we summed over the isomorphic group (110), in (c) we used the product structure (110) of SX,p to rewrite the sum as a product of sums and used the fact that px = u for all x Xu (107). In (d) we used the fact that the summands are independent of σ and moved the sum into the exponent. In (e) we used |Sα| = |α|! and used the fact that u = px for all x Xu. In (f) we used α A(p) u U(p) : Xu = α.

Appendix G. Proof of Theorem 9 (properties of the approximate PML distribution s level set partition A )

G.1. Proof of iso-clumping property (41)

Suppose A does not satisfy property (41). Then there exist x, y X such that ˆpx = ˆpy, but α(x) = α(y). We show that the partition obtained by either reassigning y to the

Pavlichin, Jiao, and Weissman

same level set as x or reassigning x to the same level set as y achieves a higher value of the log permanent lower bound V ( ) (38), contradicting the optimality of A (40). Let m nˆpx = nˆpy, let Nα(x) P x α(x) nˆpx. Let Vδ denote for δ [ 1, 1]:

Vδ log((|α(x)| + δ)!) + Nα(x) + mδ log Nα(x) + mδ

n(|α(x)| + δ)

+ log((|α(y)| δ)!) + Nα(y) mδ log Nα(y) mδ

n(|α(y)| δ)

Then V0 = V (α(x))+ V (α(y)) (39) corresponds to two terms in V ( A ) (38), and V1 (resp. V 1) corresponds to two terms in the value of the partition of X obtained by reassigning symbol y to the same level set as x (resp. reassigning symbol x to the same level set as y). We can check that Vδ is strictly convex in δ [ 1, 1]. A strictly convex function attains its maximum on the boundaries of a convex set, so either V 1 > V0 or V1 > V0, contradicting the optimality of partition A , thus establishing the iso-clumping property.

G.2. Proof of convexity property (42)

Suppose A does not satisfy property (42). Then there exist x1, x2, x3 X such that ˆpx1 < ˆpx2 < ˆpx3, but α(x1) = α(x3) = α(x2). For all x X, let Nα(x) P x α(x) nˆpx.

If Nα(x1) |α(x1)| Nα(x2) |α(x2)|, then let x = x2 and y = x1; otherwise let x = x3, y = x2. Let

n(ˆpx ˆpy). Then > 0 and

Nα(y) |α(y)| Nα(x)

|α(x)|. (123)

Let A denote the partition of X obtained by swapping the set memberships of x and y in A . We show that V (A ) > V ( A ), contradicting the optimality of A (A might not satisfy the iso-clumping property (41)). Since A is obtained from A by swapping the level set memberships of x and y, the elements of A and A have the same sizes, so P α A log(|α|!) = P α A log(|α|!), so using (38) and (39) we write:

V (A ) V ( A ) =(Nα(x) ) log Nα(x)

+ (Nα(y) + ) log Nα(y) +

Nα(x) log Nαx

Nα(y) log Nα(y)

= log Nα(y)

|α(y)| |α(x)|

+ f(Nα(x), ) + f(Nα(y), ) (125)

(a) >0 (126)

where f(N, ) +(N ) log N

N with f(N, N) lim N f(N, ) = N. (a) follows because we can show that f(N, ) 0 for N with strict inequality iﬀ = 0, so the last two terms in (125) are positive since > 0. The ﬁrst term in (125) is non-negative since > 0 and (123). Thus V (A ) > V ( A ), contradicting the optimality of partition A , thus establishing the convexity property.

Approximate Profile Maximum Likelihood

Appendix H. Our approximate PML algorithm: the case of multiple distributions

We generalize the discussion of Section 3.2 to D-dimensional ﬁngerprints. In this case we are not able to optimize a D-dimensional analogue of our objective function V (31) and will instead settle for an approximation of our already-approximate PML distribution. We use a greedy heuristic to approximate the optimal partition A (40) by iteratively, greedily merging partition elements (enlarging the second term in a D-dimensional analogue of (35) until the ﬁrst term becomes too large) (see Section H.4). Even this doubly-approximate algorithm turns out to be too slow to run on examples of practical interest, so we introduce a further heuristic to speed things up (only trying to merge partition elements that are close in Euclidean distance), which comes at a cost of adding some tunable knobs to the algorithm.

H.1. Some notation

Suppose we have D distributions (p(d))D d=1 on the same alphabet X and for each d draw nd samples i.i.d. from p(d) with empirical distribution ˆp(d). Let bold symbols denote Dcomponent collections:

p (p(d))D d=1 n (nd)D d=1 ˆp (ˆp(d))D d=1 nˆp (ndp(d))D d=1 (127)

We assume that the D distributions p satisfy:

d=1 p(d) x > 0 x X (128)

This condition ensures that there are no symbols x such that p(d) x = 0 for all d, which simpliﬁes the expressions below. Denote by ˆ X the empirical support set:

d=1 Supp(ˆp(d)) = {x X :

d=1 ˆp(d) x > 0} (129)

Finally let K |X| and ˆK | ˆ X|. The probability to draw empirical distributions ˆp under distributions p is (using the independence of the D empirical distributions and (10)):

d=1 (p(d) x )nd ˆp(d) x (130)

where the ﬁrst product is over multinomial coeﬃcients in the notation of (10). Hereon denote by F = F(ˆp(1), . . . , ˆp(D)) the D-dimensional ﬁngerprint (Raghunathan et al., 2017) indexed by vector i = (id)D d=1 ND (which we write as satisfying i 0 id 0 d, where 0 (0)D d=1):

F = F(ˆp) = (Fi)i 0 |{x X : (ndˆp(d) x )D d=1 = i}|

Pavlichin, Jiao, and Weissman

Note that F0 = |X \ ˆ X| = K ˆK counts the number of symbols unseen by any of the D samples, and is thus unknown if the support set size K is unknown. Then the probability to draw the D-dimensional ﬁngerprint F under p is (obtained analogously to the D = 1 case in Appendix A):

d=1 (p(d) x )nd ˆp(d) σ(x) (132)

d=1 (p(d) x )nd ˆp(d) x !

x,x X | {z } Q

where in (133) the ﬁrst product is over all D-dimensional indices i such that mind id 0 and maxd id > 0.

H.2. The PML distributions in D dimensions

For a given collection P of D-tuples of distributions, the D-dimensional PML distributions p are:

p (p(d) )D d=1 argmax p P Pp(F(ˆp)) (134)

= argmax p P

(K ˆK)! (135)

where in (135) we discarded all p-independent factors of Pp(F) (133). F0 = K ˆK = |X \ ˆ X| depends on p through its support set size. Note that Pp(F) is invariant under relabeling of the components of p, so we can choose p to be non-increasing in the same ordering as we choose for the support set X. Note that the set P is not necessarily the same as the set of all discrete distributions. Note that ﬁnding the D-dimensional PML distributions (135) is not equivalent to ﬁnding D 1-dimensional PML distributions since Pp(F) (133) does not factor into D pieces. If the collection of D-tuples of distributions P includes distributions with diﬀerent support set sizes, then we can estimate the support set size by breaking up the optimization in (135) into two steps:

(K ˆK)! max p PK perm (Q)

whenever the max over K exists, where PK {p P : | SD d=1 Supp(p(d))| = K}. The case D 2 is more complicated than the case D = 1 because the D support sets (Supp(p(d)))D d=1 can overlap partially. If X is the joint support set that is, X = SD d=1 Supp(p(d)) then we can associate with each symbol x X an indicator function φd(x) = 1(x Supp(p(d))), so there are 2D 1 possible values (φd(x))D d=1 for each x X 9.

9. 1 less than 2D since φd(x) = 0 d implies x / X.

Approximate Profile Maximum Likelihood

We could then be more ambitious and estimate not just the joint support set size K = |X|, but also the optimal value (φd(x))D d=1 for all symbols. Our proposed approximate PML approximation scheme for D 2 is less ambitious and only supports estimation of |X|, but not of the ﬁner structure of the joint support set X in terms of D partially overlapping support sets (Supp(p(d)))D d=1.

H.3. Approximate PML in D dimensions: the case of known support set size

In this Section we assume the support set size K = |X| is known. Analogously to the notation in the D = 1 case of Section 4.3 and the proof of the permanent lower bound for D = 1 in Appendix F, deﬁne the level sets Xu(p) of p indexed by D-dimensional real vector u = (ud)D d=1: Xu(p) {x X : (p(d) x )D d=1 = u} (137)

the unique D-tuple entries of p by U(p):

U(p) {(p(d) x )D d=1 : x X} (138)

and the partition A(p) of X into level sets of p:

A(p) {Xu(p) : u U(p)} (139)

We assume K = |X| is known, so p (135) is a maximizer of

V (p) log(perm(Q)) (140)

As for the D = 1 case, we do not know how to computationally eﬃciently maximize V (p) over p, so we settle for maximizing a lower bound V (p), analogous to (31) and motivated by our empirical observations for the case D = 2 in Section 4.2:

V (p) V (p) log

d=1 (p(d) x )nd ˆp(d) σ(x)

d=1 nd D(ˆp(d)||p(d)) + H(ˆp(d)) + X

α A(p) log(|α|!) (142)

where SX,p is analogous to (32) a subgroup of SX consisting of all permutations that exchange only those alphabet symbols that are in the same level set of p:

SX,p {σ SX : σp = (σp(d))D d=1 = p} (143)

(142) follows from (141) by an analogous argument to the one given in in Appendix F for the case D = 1. As in the D = 1 case, the intuition is that we approximate the permanent by summing over only those permutations that contribute a lot to the value, and then maximize this lower bound V (p) over p. Our approximate PML distributions are the D-tuple:

p = ( p(d) )D d=1 argmax p P V (p) (144)

Pavlichin, Jiao, and Weissman

As in the 1-D case, the ﬁrst term in (142) encourages p to clump many symbols together, while the second term encourages p to be similar to ˆp, dominating as n . Note that optimizing V (p) over p is not equivalent to solving D independent optimization problems because the D distributions interact through the second summation in expression (142) (the summation over A(p)). As in the 1-D case, we can show that for all d the approximate PML distribution p(d)

(144) satisﬁes the averaging property (37). Therefore as in the 1-D case, p is determined by its partition of X into level sets. Thus we seek to maximize the quantity

α A V (α) (145)

where for α X

V (α) log(|α|!) + |α|

d=1 ndˆp(d) α log(ˆp(d) α ) (146)

The optimal partition is A argmax A: partition of X V (A) (147)

Finally p is obtained by averaging the empirical distributions ˆp within each partition element α A . As in the 1-dimensional case, the approximate PML partition A has the iso-clumping property (generalizing (41) to D dimensions):

(ˆp(d) x )D d=1 = (ˆp(d) x )D d=1 α(x) = α(x ) (148)

for all x, x X, where α(x), α(x ) A are the partition elements containing x, x , respectively. This can be shown by adapting the proof for the 1-D case in Appendix G.1. We do not know if A satisﬁes a convexity property analogous to the 1-D case (42).

H.4. A greedy heuristic to approximately maximize the permanent for D 2 distributions

Unlike the D = 1 case, we do not solve the optimization problem (144) exactly for D 2. The diﬃculty is that there is no natural ordering on ND for D 2, so we can no longer deﬁne a structured collection of subsets of X from which to build the approximate PML partition A (147) X, so we do not oﬀer a dynamic programming algorithm like the one described in Section 4.3. Our search for an optimal partition of X is similar in ﬂavor to a clustering problem, so the added diﬃculty in D 2 dimensions makes some sense. For example, k-means clustering in 1 dimension is exactly solvable by a dynamic programming algorithm (Wang and Song, 2011) similar to the one we present, but no such algorithm is available for dimension greater than 1 (indeed k-means is NP-hard for dimension greater than 1). We settle for a greedy heuristic to approximately maximize V (A), and call the resulting approximate optimizer A : A argmax A: partition of X V (A) (149)

Approximate Profile Maximum Likelihood

the double bar signaling the two levels of approximation (an approximation to the approximate PML partition A (144)). The doubly-approximate PML distribution p is obtained by averaging the empirical distributions ˆp within each partition element α A . The greedy heuristic used to compute A is given in algorithm (1). In words, we start with initial partition A(ˆp) of X into the level sets of the empirical distributions ˆp (we clump equally empirically frequent symbols into the same level sets since we know that A satisﬁes the iso-clumping property; see Section H.3) and then iteratively merge partition elements until it is no longer proﬁtable to do so in terms of the objective function V increasing. At each step we merge two partition elements that most boost the value of the objective function.

Algorithm 1 Approximation to level set partition of the approximate D-dimensional PML distribution

1: function A (ˆp, n, |X|) Input: D empirical distributions, D sample sizes, and assumed support set size |X|

2: A A(ˆp) Initial partition into level sets of empirical distribution ˆp

3: while maxα1,α2 A:α1 =α2 V (α1 α2) ( V (α1) + V (α2)) > 0 do

4: (α 1, α 2) argmaxα1,α2 A:α1 =α2 V (α1 α2) ( V (α1) + V (α2))

5: A (A {α 1 α 2}) \ {α 1, α 2} Merge two level sets into one

6: end while

7: return A Output: A (ˆp)

8: end function

The inputs ˆp and n in algorithm (1) are used when evaluating V (α) (146) for α X. The input of the support set size |X| is used in line (2) to initially assign the |X| |X(ˆp)| unobserved symbols to a partition element. The loop in algorithm (1) always terminates since we only merge distinct partition elements of the ﬁnite support set X. We can implement algorithm (1) eﬃciently by observing that for α1, α2 X α1 = α2:

ˆpα1 α2 = |α1|ˆpα1 + |α2|ˆpα2

|α1| + |α2| (150)

Relation (150) lets us evaluate V (α1 α2) without summing over all points in α1 α, so we can reuse the work we did when we previously computed V (α1), V (α2). We obtain the doubly-approximate set of D distributions p by setting for all x X, d {1, . . . , D} p(d) x = ˆp(d) α(x) (151)

where α(x) A is the partition element containing x, and ˆp(d) α (36) is the average of ˆp(d)

over α X. We know that the partition A computed by algorithm (1) is suboptimal (that is, does not maximize V (147)) because it only tries to merge pairs of existing partition elements rather than triplets or larger collections; it is possible to construct an example where the optimal partition consists of a single set (all of X) but no pair of partition elements of A(ˆp) is worth merging in the sense of satisfying the loop condition in line (3).

Pavlichin, Jiao, and Weissman

The running time of the heuristic (1) is O(| Supp(F)|2) = O(n), where n = PD d=1 nd: In line (2), we have |A(ˆp)| = | Supp(F)| = O( n). To evaluate the max over α1, α2 A for the ﬁrst time in line (3) we consider O(|A(ˆp)|2) = O(n) pairs of partition elements. On every subsequent iteration of the loop, we add one new partition element (α 1 α 2) to A and remove two partition elements (α 1 and α 2) from A, so there are O(|A|) = O( n) new pairs of distinct partition elements to check in computing the max in line (3). The size of |A| decrements by one each time the loop is run, so the loop is run at most O(|A(ˆp)|) = O( n) times, so the overall complexity of algorithm 1 is O(n) + O( n)O( n) = O(n). If the support set size |X| is unknown, then we vary the assumed support set size |X| to optimize V ( A (ˆp, n, |X|)) using a bisection search on the range [|X(ˆp)|, maxd [1,D] n2 d], increasing the overall running time of the algorithm (optimizing over both the level set partition A and the support set size) to O(n log(n)). Numerical experiments show that V ( A (ˆp, n, |X|)) is not unimodal in |X|, so this heuristic might choose a suboptimal support set size. The O(n) worst-case running time of algorithm (1) turns out to be slow for examples of practical interest (in making the performance plots of Section 6), so we introduce another heuristic to speed things up at the cost of adding some tunable knobs to the algorithm. The heuristic is to avoid checking all pairs of partition elements α1, α2 A, instead checking only the promising pairs. A pair α1, α2 A is called promising if the point (ˆp(d) α1 )D d=1 RD

is one of the k nearest neighbors of the point (ˆp(d) α2 )D d=1 RD in Euclidean distance10. This heuristic is motivated by the observation that the level sets that get merged in the loop of algorithm (1) tend to correspond to nearby points in Euclidean distance. Then we modify algorithm (1) to ﬁrst compute the set of k nearest neighbors (doable in time O(| Supp(F)| log(| Supp(F)|)) = O( n log( n)) = O( n log(n))) and then to update the set of nearest neighbors after each loop execution. This updating can be done heuristically by taking the k nearest neighbors of ˆp(d) α1 α2 to be the k nearest points chosen from among only the union of the k nearest neighbors of (ˆp(d) α1 )D d=1 and the k nearest neighbors of (ˆp(d) α2 )D d=1 (rather than the k nearest points chosen from among (ˆp(d) α )D d=1 for all α A). The loop is executed O( n) times, so the overall running time of the algorithm is O( n log(n)+k n). The time to compute the empirical histograms ˆp is O(n), asymptotically in n dominating the runtime of our algorithm for large n, but tends to take less time in our use cases than running algorithm 1 modiﬁed with our k nearest neighbor heuristic. For the performance plots of Section 6 we used k = 5. Figures 2 and 14 show the doubly-approximate PML level set partition A (149) for D = 2 and D = 3, respectively.

Appendix I. Mutual and Lautum information

For the case of estimating the mutual information I(X; Y ), deﬁned as

I(X; Y ) = X

x X,y Y px,y log px,y

10. The nearest neighbor relation is not symmetric, so α1, α2 being promising does not imply that α2, α1 is a promising pair.

Approximate Profile Maximum Likelihood

10 5 0 20 15 n1

Figure 14: Computing the doubly-approximate PML distribution level set partition A

(149) by clumping entries of a 3-D ﬁngerprint Fi,j,k using algorithm 1 modiﬁed with the 5-nearest neighbors heuristic as described in Section H.4, shown in three projections. The underlying distributions are p(1) = p(2) uniform on {1, . . . , 100} and p(3) is a mixture of two uniforms, with half its mass uniform on {1, . . . , 20} and the other half uniform on {21, . . . , 100}. The sample size is 500 for all three empirical distributions (ˆp(d))3 d=1. Colored convex hulls correspond to symbols assigned to the same clump (partition element). The marker size of point (i, j, k) is roughly proportional to Fi,j,k. We can see two large clumps, corresponding to symbols {1, . . . , 20} and {21, . . . , 100}.

or lautum information (Palomar and Verd u, 2008), deﬁned as

L(X; Y ) = X

x X,y Y pxpy log pxpy

we have n samples from a joint distribution p on X Y with joint histogram ˆp. Here px = P y px,y, py = P x px,y are the marginal distributions of X and Y , respectively.

I.1. Product of permutation groups: one candidate for PML

It is clear that the mutual information and the lautum information are invariant to the product of two permutation groups SX SY, acting on p as:

((σX , σY)p)x,y pσX (x),σY(y) (154)

for all (σX , σY) SX SY. Let (SX SY)p denote the orbit of p under the group action. Following the general proﬁle maximum likelihood methodology for the product of these two permutation groups, the PML distribution p maximizes the following quantity over a set

Pavlichin, Jiao, and Weissman

of distribution P on X Y:

Pp((SX SY)ˆp) = X

ˆp (SX SY)ˆp Pp(ˆp ) (155)

(a) = |(SX SY)ˆp|

(σX ,σY) SX SY Pp((σX , σY)ˆp) (156)

(σX ,σY) SX SY

(x,y) X Y p nˆpσ(σX ,σY )X (x),σY (y) x,y (157)

perm p nˆpx ,y x,y

(x,y),(x ,y ) X Y

where in (a) the prefactor corrects for multiple-counting points of the orbit (SX SY)ˆp when we sum over (σX , σY) (SX SY), and where (Fi)i 0 = (|{(x, y) X Y : nˆpx,y = i}|)i 0 is the ﬁngerprint of the distribution when one treats X Y as a single alphabet. Note that in this setting (Fi)i 0 is not partially suﬃcient for estimating the mutual (lautum) information: the mutual (lautum) information cannot be written as a functional of the sorted probabilities of px,y. In this setting, it follows from Theorem 6 that the partially suﬃcient statistic could be deﬁned via the following equivalence class representation: for any empirical distribution ˆp(1) = (ˆp(1) x,y),ˆp(2) = (ˆp(2) x,y), we say ˆp(1) is equivalent to ˆp(2) if there exists an element g SX SY such that gˆp(1) = ˆp(2). The partially suﬃcient statistic is the corresponding equivalent classes. In this PML formulation, we analyze the cardinality of the corresponding partially suﬃcient statistic below. This number appeared in (OEIS sequence A007716), and its precise asymptotics is obtained in the following theorem.

Theorem 12 For any two empirical distributions ˆp(1) = (ˆp(1) x,y),ˆp(2) = (ˆp(2) x,y) with identical sample size n, we say ˆp(1) is equivalent to ˆp(2) if there exists an element (σX , σY) SX SY such that (σX , σY)ˆp(1) = ˆp(2), where SX , SY are the permutation groups on X, Y, respectively. Then, the number of equivalence classes is given by e(1+o(1))n log n.

Theorem 12 shows that the cardinality of the partially suﬃcient statistic is no longer sub-exponential. This negative result immediately renders Theorem 3 not applicable, which leaves us with no guarantee on the performance of this type of PML on estimating mutual information or lautum information.

I.2. Three permutation groups: the other candidate for the PML

The mutual information I(X; Y ) can be rewritten as

I(X; Y ) = H(PX) + H(PY ) H(PXY ) (159)

x X px log 1

y Y py log 1

x X,y Y px,y log 1

Approximate Profile Maximum Likelihood

where H(PX) = P x px log 1

px denotes the Shannon entropy of discrete distribution PX. Following the discussion in Section 3.1, we introduce

(FXY i )i 0 = (| {(x, y) X Y : nˆpx,y = i} |)i 0 (161)

(FX i )i 0 = (| {x X : nˆpx = i} |)i 0 (162)

(FY i )i 0 = (| {y Y : nˆpy = i} |)i 0 , (163)

and it is clear that FXY i

i 0 is partially suﬃcient for H(PXY ), FX i

i 0 is partially suﬃcient for H(PX), and FY i

i 0 is partially suﬃcient for H(PY ). Since I(X; Y ) = H(PX) + H(PY ) H(PXY ), it motivates the following PML deﬁnition:

p argmax p P Pp FXY , FX, FY . (164)

In other words, the PML distribution (p x,y) on X Y maximizes the probability of observing the three joint ﬁngerprints FXY , FX, FY . This PML formulation, which only applies to mutual information but not lautum information, does not suﬀer from the super-exponential cardinality problem. Applying Theorem 10, it follows that the cardinality of the statistic is at most eπ q

which is still sub-exponential. However, it seems to be a non-trivial task to even approximately solve the PML formulation (164). Moreover, this PML formulation does not apply to lautum information. Note that there may not exist a group G (a subgroup of the symmetric group SX Y) acting on the joint empirical distributions ˆp such that the set of distinct triplets (FXY , FX, FY ) is in bijection with the set of distinct orbits under the action of G. For example, let ˆp, ˆp , and ˆp be three empirical distributions on X = {x1, x2}, Y = {y1, y2, y3} with sample size n = 1110:

where rows (columns) correspond to elements of X (Y). ˆp, ˆp , and ˆp have the same Xand Y-marginal distributions ((105, 1005)/1110 and (3, 7, 1100)/1110, respectively) and the same set of entries, so their ﬁngerprint triplets (FXY , FX, FY ) are the same. Suppose there exists a group G SX Y such that the orbit Gˆp contains exactly those distributions with the same ﬁngerprint triplet (FXY , FX, FY ). Then ˆp , ˆp Gˆp. Let σ G, where ˆp = σˆp. Then σ switches x1 and x2 iﬀy = y3:

px2,y : x = x1, y = y3

px1,y : x = x2, y = y3

px,y : otherwise

Pavlichin, Jiao, and Weissman

On the other hand,

has X-marginal (1004, 106)/1110, and thus a diﬀerent FX, so the action of σ does not preserve the triplet (FXY , FX, FT ), so group G does not exist.

Appendix J. Estimating the support set size of the approximate PML distribution

We ﬁrst state our results in the notation introduced in Sections 4.5 and 4.6.

Theorem 13 Let p be the approximate PML distribution. Then X0:i A( p ) is a level set of p , where

i argmax i [1,F+]

log(F0!) + V (X0:i) + Vi+1

For each value of i [1, F+] the supremum over F0 in (169) is ﬁnite and we compute it below. Let Ki |X1:i| and Ni P x X1:i nˆpx. If i = 1 and F1 1, then N1 = K1 = F1 1 is the number of symbols seen exactly once in the sample, and the supremum in (169) is equal to N1 log N1

n . If F1 > 1, then the supremum is approached as F0 and is not achieved by any F0 N. If F1 = 1, then the supremum is achieved by all F0 N, so we arbitrarily choose F(1) 0 = 0 as an achiever. If i > 1 or F1 = 0, then Ni > Ki and the supremum in (169) is achieved by F(i) 0 N where

F(i) 0 argmax F0 N

log (Ki + F0)!

Ni log (Ki + F0)

= 0 : Ni Ki HKi

[0, K2 i Ni Ni Ki + 1] : Ki HKi > Ni > Ki (170) where Hk Pk j=1 1 j is the k-th harmonic number.

If the supremum in (169) is achieved by ﬁnite F0 = F(i ) 0 (this is the case unless i = 1 and F1 > 1) then the approximate PML distribution p has ﬁnite support set size K = F(i ) 0 + ˆK, where ˆK is the support set size of the empirical distribution ˆp. Then we compute p as in Section 4.5 with known support set size K . If the supremum is not achieved by ﬁnite F0 (this is the case if i = 1 and F1 > 1), then we say p has a continuous part of mass N1/n and discrete part p d of mass 1 N1/n supported on X2:F+ = X \ X0:1. Then we compute p d as in Section 4.5 with known support set size |X2:F+|.

Since the function in the argmax in (170) is unimodal in F0 and 0 F(i) 0 K2 i Ni Ni Ki +1 N2 i +2 (where the last inequality follows from Ni Ki +1), we can compute the optimizing integer F(i) 0 in O(log(Ni)) iterations of a bisection search. We must do this bisection search

Approximate Profile Maximum Likelihood

F+ = | Supp(F+)| times. As remarked earlier, since | Supp(F+)|

2n + 1 and Ni n for all i, then the total run time of our support set size optimization scheme is O( n log(n)). Thus to estimate the support set size of the approximate PML distribution we solve F+ optimization problems, corresponding to estimation of the support set size of a uniform distribution (since all symbols in X0:i are assigned to a single level set) given a sample of size Ni with Ki distinct symbols. Proof Optimizing over the support set size K, the approximate PML distribution p satisﬁes:

log(P p (F)) = sup K

log((K ˆK)!) + max p PK

V (p) (171)

log(F0!) + max p P ˆ K+F0

V (p) (172)

(a) = sup F0

log(F0!) + V0 (173)

(b) = sup F0

log(F0!) + max i [1,F+]

V (X0,i) + Vi+1 (174)

(c) = max i [1,F+]

Vi+1 + sup F0

V (X0,i) log(F0!) (175)

(d) = max i [1,F+]

Vi+1 + Ni log Ni

log (Ki + F0)!

Ni log (Ki + F0) | {z } f(F0,Ki,Ni)

where (a) follows from (48), (b) follows from (46), (c) follows since Vi is independent of F0 (and hence K) for i 1, and (d) from (39) and (50). We use sup F0 (rather than max F0) since a maximizing value of F0 might not exist. We later show that the supremum in (176) is ﬁnite, so we interchange the order of the max and supremum. The function f(F0, K, N) is deﬁned for non-negative arguments as in (176). Note that we do not allow i = 0 in the optimization (176) in order to exclude X0,0 (the set of unseen symbols) as a level set of p; otherwise, due to the averaging-over-clumps property (37), we would have p vanish on X0,0, violating the assumption that the support set size of p is K. It is possible to have K = ˆK = | Supp(ˆp)|, corresponding to the case F0 = 0, so that X0,i = X1,i for all i. Let s compute the supremization in (176). We can check that f(F0, N, N) < 0 for all N > 1 and F0 0, and

lim F0 f(F0, N, N) = 0 (177)

so if i = 1 and F1 > 1, then N1 = K1 = F1 > 1 and the supremum in (176) is equal to 0, is approached as F0 , and is not achieved by any F0 N. We can check that f(F0, 1, 1) = 0 for all F0, so if i = 1 and F1 = 1, then N1 = K1 = F1 = 1 and the supremum in (176) is equal to 0 and is achieved by all F0 N, so we arbitrarily choose F(1) 0 = 0 as an achiever.

Pavlichin, Jiao, and Weissman

If i > 1 or F1 = 0, then Ni > Ki. When N > K the function f(F0, K, N) is continuous in F0 11 and unimodal with a global maximum. Solving

F0 f(F0, K, N ) = 0 (178)

for N we ﬁnd N = KHK, where HK is the K-th harmonic number. Thus if Ni Ki HKi, then the supremum in (176) is achieved by F0 = 0. If Ni < Ki HKi, then we can upper bound the supremizing value F0 by ﬁnding the unique inﬂection point of f in F0 beyond the global maximum. Setting the discrete second derivative to 0 (we take the discrete derivative because the function f F0 is easier to work with for integer F0):

F0 (f(F0 + 1, K, N) f(F0, K, N)) F0=F 0 = 0 (179)

we ﬁnd F 0 = K2 N

N K . If K < N < KHK then K > 1 (since 1 = H1) and N < K2, so F 0 > 0.

Thus we can upper bound the inﬂection point of f by K2 N

N K + 1, so if Ki < Ni < Ki HKi,

then F(i) 0 [0, K2 i Ni Ni Ki + 1].

Jayadev Acharya, Alon Orlitsky, and Shengjun Pan. The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns. In Information Theory (ISIT), 2009 IEEE International Symposium on, pages 1135 1139. IEEE, 2009.

Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. In Proceedings of the 24th Annual Conference on Learning Theory, pages 47 68, 2011.

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uniﬁed maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pages 11 21, 2017.

Dragi Anevski, Richard D Gill, and Stefan Zohren. Estimating a probability mass function with unknown labels. The Annals of Statistics, 45(6):2708 2735, 2017.

FC Auluck. On partitions of bipartite numbers. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 49, pages 72 83. Cambridge Univ Press, 1953.

I B ar any and AM Vershik. On the number of convex lattice polytopes. Geometric and Functional Analysis, 2(4):381 393, 1992.

D Basu. On partial suﬃciency: A review. In Selected Works of Debabrata Basu, pages 291 303. Springer, 2011.

11. Replacing x! with Γ(x + 1), where Γ is the gamma function.

Approximate Profile Maximum Likelihood

Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259 269. IEEE, 2000.

Allan Birnbaum. On the foundations of statistical inference. Journal of the American Statistical Association, 57(298):269 306, 1962.

David A Blackwell and Meyer A Girshick. Theory of games and statistical decisions. Courier Corporation, 1979.

Yuheng Bu, Shaofeng Zou, Yingbin Liang, and Venugopal V Veeravalli. Estimation of kl divergence: Optimal minimax rate. ar Xiv preprint ar Xiv:1607.02653, 2016.

Chun Lam Chan, Winston Fernandes, Navin Kashyap, and Majunath Krishnapur. Phase transitions for the uniform distribution in the pattern maximum likelihood problem and its Bethe approximation. SIAM J. Discrete Math, 31:597 631, 2015.

Bradley Efron. Maximum likelihood and decision theory. The Annals of Statistics, pages 340 356, 1982.

Winston Fernandes and Navin Kashyap. A phase transition for the uniform distribution in the pattern maximum likelihood problem. In ITW, 2013.

Jaroslav H ajek. On basic concepts of statistics. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probabilities, volume 1, pages 139 162, 1967.

Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of divergences between discrete distributions. ar Xiv preprint ar Xiv:1605.09124, 2016a.

Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of KL divergence between discrete distributions. In Information Theory and Its Applications (ISITA), 2016 International Symposium on, pages 256 260. IEEE, 2016b.

Godfrey H Hardy and Srinivasa Ramanujan. Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society, 2(1):75 115, 1918.

David A Harville. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72(358):320 338, 1977.

Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. Information Theory, IEEE Transactions on, 61(5): 2835 2885, 2015.

Jiantao Jiao, Kartik Venkat, and Tsachy Weissman. Relations between information and estimation in discrete-time L evy channels. IEEE Transactions on Information Theory, 63(6):3579 3594, 2017.

Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the L1 distance. IEEE Transactions on Information Theory, 64(10):6672 6706, 2018.

Pavlichin, Jiao, and Weissman

A Kolmogoroﬀ. Sur lestimation statistique des parametres be la loi de gauss. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 6(1):3 32, 1942.

L Le. Suﬃciency and approximate suﬃciency. The Annals of Mathematical Statistics, pages 1419 1455, 1964.

Lucien Le Cam. Maximum likelihood: an introduction. Statistics Branch, Department of Mathematics, University of Maryland, 1979.

Erich Leo Lehmann and George Casella. Theory of point estimation, volume 31. Springer, 1998.

Susan A Murphy and Aad W Van der Vaart. On proﬁle likelihood. Journal of the American Statistical Association, 95(450):449 465, 2000.

OEIS Foundation Inc. The online encyclopedia of integer sequences (OEIS) A007716. https://oeis.org/A007716. Accessed: 2017-05-16.

Alon Orlitsky, Sajama, Narayana Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. Algorithms for modeling distributions over large alphabets. In Information Theory (ISIT), 2004 IEEE International Symposium on, page 304. IEEE, 2004a.

Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On modeling proﬁles instead of values. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 426 435. AUAI Press, 2004b.

Daniel P Palomar and Sergio Verd u. Lautum information. IEEE transactions on information theory, 54(3):964 975, 2008.

Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15 (6):1191 1253, 2003.

H Desmond Patterson and Robin Thompson. Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545 554, 1971.

Dmitri S. Pavlichin, Jiantao Jiao, and Tsachy Weissman. Approximate proﬁle maximum likelihood (software), 2017. URL https://doi.org/10.5281/zenodo.1043617. URL https://github.com/dmitrip/PML.

Aditi Raghunathan, Greg Valiant, and James Zou. Estimating the unseen from multiple populations. ar Xiv preprint ar Xiv:1707.03854, 2017.

Gregory Valiant and Paul Valiant. The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 403 412. IEEE, 2011a.

Gregory Valiant and Paul Valiant. Estimating the unseen: An n/ log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In the 43rd ACM Symposium on Theory of Computing (STOC), 2011b.

Approximate Profile Maximum Likelihood

Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pages 2157 2165, 2013.

L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8:189 201, 1979.

Paul Valiant. Testing Symmetric Properties of Distributions. Ph D thesis, Massachusetts Institute of Technology, 2008.

SS Vallender. Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784 786, 1974.

Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Pascal O. Vontobel. The Bethe approximation of the pattern maximum likelihood distribution. In ISIT, 2012.

Abraham Wald. Statistical decision functions. Wiley, 1950.

Haizhou Wang and Mingzhou Song. Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming. The R Journal, 3(2):29 33, 2011.

Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. ar Xiv preprint ar Xiv:1504.01227, 2015.

Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6): 3702 3720, 2016.