# distribution_estimation_under_the_infinity_norm__5cb3b9e2.pdf

Journal of Machine Learning Research 26 (2025) 1-30 Submitted 7/24; Revised 1/25; Published 7/25

Distribution Estimation under the Inﬁnity Norm

Aryeh Kontorovich karyeh@cs.bgu.ac.il Department of Computer Science Ben Gurion University of the Negev Beer Sheva, Israel

Amichai Painsky amichaip@tauex.tau.ac.il School of Industrial Engineering Tel Aviv University Tel Aviv, Israel

Editor: Mladen Kolar

We present novel bounds for estimating discrete probability distributions under the ℓ norm. These are nearly optimal in various precise senses, including a kind of instanceoptimality. Our data-dependent convergence guarantees for the maximum likelihood estimator signiﬁcantly improve upon the currently known results. A variety of techniques are utilized and innovated upon, including Chernoﬀ-type inequalities and empirical Bernstein bounds. We illustrate our results in synthetic and real-world experiments. Finally, we apply our proposed framework to a basic selective inference problem, where we estimate the most frequent probabilities in a sample. Keywords: Distribution Estimation, Multinomial Distribution, Count Data

1. Introduction

Consider a probability distribution p over N = {1, 2, . . .}. Let Xn be a sample of n independent observations from p. In this work we study the basic problem of estimating p from Xn. We focus our attention to the inﬁnity norm, which is formally deﬁned in (4). Also known as the uniform or supremum norm, this popular metric over distributions has a number of important applications, in addition to being a fundamental object of independent interest (Boucheron et al., 2003; Van Handel, 2014). Among the applications is a selective inference scheme for multinomial proportions, as discussed below. Our reference point is the following simple and classic bound, whose proof is an easy consequence of Mc Diarmid s inequality: for all δ (0, 1),

sup i N |pi ˆpi(Xn)|

holds with probability at least 1 δ, where ˆpi(Xn) is the maximum likelihood estimator (MLE) deﬁned below and hides small absolute constants. This rate is known to be tight in the worst case (Lemma 23), but can certainly be improved upon for benign distributions. For example, when p = (1 θ, θ) is the Bernoulli distribution, Bernstein s inequality (Boucheron

2025 Aryeh Kontorovich and Amichai Painsky.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1166.html.

Kontorovich and Painsky

et al., 2003, Corollary 2.11) yields

where ˆθ is the MLE. Furthermore, (2) has an empirical Bernstein version (Dasgupta and Hsu, 2008a, Lemma 5), in which the unknown quantity θ is replaced in the right-hand side by the empirically computable ˆθ. Drawing inspiration from (2) and its empirical version, we might expect something like

sup i N |pi ˆpi(Xn)| ?

where v = maxi N pi(1 pi), or, even more ambitiously, some version of (3) with v replaced by its empirical version ˆv . It will turn out that (3) is too optimistic. Absent an oracle that tells us the index of the largest mass, some additional cost must be incurred for estimating many symbol probabilities simultaneously.

Our contributions. Our main result amounts to nearly achieving the ultimate goal. First, we derive a Chernoﬀ-type upper bound in Theorem 1, which improves upon (1). Theorem 2 introduces its data-dependent counterpart, which demonstrates a signiﬁcant improvement in small sample regimes. Next, we establish in Theorem 4 a version of (3) where v is replaced by v log 1

v and provide even sharper bounds therein. These are matched by nearly optimal lower bounds, in distinct senses made precise below. Finally, we apply our results to the important problem of selective inference. Speciﬁcally, we study the basic problem of inferring the most frequent events in a sample and achieve a signiﬁcant improvement over currently known schemes.

2. Deﬁnitions and Problem Statement

Consider a probability distribution p over N, which induces the random variable X p. The support size of X, ||p||0, is also the alphabet size, and unless stated otherwise, our results hold even when these are inﬁnite. Let Xn = (X1, ..., Xn) be a sample consisting of n independent copies of X. Let ci(Xn) be the count (number of appearances) of the ith

symbol in the sample. Let ˆp(Xn) be the maximum likelihood estimator (MLE) of p; namely, ˆpi(Xn) = ci(Xn)/n for every i N. In this work, we study empirical distribution estimation of p under the inﬁnity norm. That is, given a prescribed δ > 0, we seek a random variable Tδ(Xn) such that

||p ˆp(Xn)|| sup i N |pi ˆpi(Xn)| Tδ(Xn) (4)

with probability of at least 1 δ. In light of (1), we also require that, for ﬁxed δ, Tδ(Xn) 0 as n , in some appropriate sense. Notice that (4) may also be viewed as a multinomial inference scheme. Speciﬁcally, (4) implies that pi [ˆpi T(Xn)] simultaneously for all i N. This conﬁdence region (CR) deﬁnes a hypercube around the MLE, which covers p with a conﬁdence level of 1 δ.

Distribution Estimation under the Infinity Norm

3. Related Work

Discrete probability estimation is a fundamental problem in many ﬁelds. It is extensively studied under a variety of merits such as total variation (Jiao et al., 2017; Cohen et al., 2020), KL divergence (Orlitsky and Suresh, 2015), Hellinger distance (Hellinger, 1909) Wasserstein metric (Kantorovich, 1960), Kolmogorov-Smirnov distance (Smirnov, 1948) and others. The interested reader is referred to (Rice, 2006; Painsky and Wornell, 2019; Painsky, 2023b) for a comprehensive discussion. In this work we focus on the inﬁnity norm. Here, the baseline is the bound implicit in (1), where, in the language of (4), Tδ(Xn) = p

log(1/δ)/2n. The inﬁnity norm is diﬃcult to analyze in the general case (4). In fact, it is later shown (Section 4) that (1) is only asymptotically tight and only for the worst-case distribution, and can be signiﬁcantly improved in a limited sample regime. On the other hand, the binomial case, ||p||0 = 2, is fairly exhaustively understood as far as minimax optimal and fully empirical bounds. In particular, if Y Bin(n, θ) is a Binomial random variable and ˆθ = Y/n its MLE, then Mc Allester and Schapire (2000); Bousquet et al. (2003) and later Dasgupta and Hsu (2008b) showed that

with probability of at least 1 δ. A closely related line of work appears in the statistics literature. Let F(y; n, θ) be the cumulative distribution function of Y . For a given y and n, let θl and θu be the solutions (with respect to θ) of F(y; n, θ) = δ/2 and F(y; n, θ) = 1 δ/2 respectively. Clopper and Pearson (1934) showed that

P (θ [θl, θu]) 1 δ (6)

for every θ [0, 1]. The interval [θl, θu] is widely known as the exact Clopper-Pearson (CP) conﬁdence interval (CI). The exact notion refers to the fact that (6) holds for every n, as opposed to alternative approximations. In fact, CP is also known to be shortest possible CI, for most setups of interest. Speciﬁcally, let T be a collection of intervals [tl, tu] that satisfy P (θ [tl, tu]) 1 δ, for every [tl, tu] T . The shortest CI for θ is deﬁned as the intersection of all intervals in T . This notion also implies minimal expected length and minimal false coverage probability, uniformly. Wang (2006) showed that for n log(δ/2)/ log 0.5, the CP CI is the shortest. Notice that for a nominal level of δ = 0.05, this condition corresponds to n 6. Hence, for practical setups of interest, the CP interval [θl, θu] is the shortest possible CI for θ. Unfortunately, CP does not hold a closed-form expression. Yet, Thulin (2014) showed that for every y {1, ..., n 1},

θl = ˆθ z δ

1 2ˆθ z2 δ/2 1 ˆθ

θu = ˆθ + z δ

1 2ˆθ z2 δ/2 + 2 ˆθ

Kontorovich and Painsky

up to additive terms of order n 3/2, where zδ/2 is the upper δ/2 quantile of the standard normal distribution. This result implies that for every y {1, ..., n 1}, the shortest possible CI length for θ is

θu θl = 2zδ/2

ˆθ(1 ˆθ)/n + 1/n + O(n 3/2). (7)

Moreover, we have

|θ ˆθ| max{|ˆθ θl|, |ˆθ θu|} (8)

with probability of at least 1 δ. Importantly, it can be shown that zδ/2 behaves asymptotically like p

2 log(2/δ). Comparing (1) to (8) (and (5)) , we observe that its sample complexity, 1/ n, is tight. However, there may still be room for improvement by utilizing a data-dependent scheme. The Clopper-Pearson interval (8) provides a tight solution for the binomial case ||p||0 = 2. Yet, the problem becomes more involved in the multinomial setting (4). Currently known methods focus on two basic regimes. The ﬁrst considers an asymptotic setup, where n is much greater than the alphabet size (Quesenberry and Hurst, 1964; Goodman et al., 1964; Sison and Glaz, 1995). The second addresses the case where both n and ||p||0 are small (Chafai and Concordet, 2009; Malloy et al., 2020). Notice that while some of these methods provide rectangular CR (Quesenberry and Hurst, 1964; Goodman et al., 1964; Painsky, 2023a), others focus on hyper-cubes (Sison and Glaz, 1995). Yet, all of these methods assume a ﬁnite alphabet where performance guarantees are limited to relatively small ||p||0. To the best of our knowledge, no method considers the case where ||p||0 may be inﬁnite.

4. Main Results

We begin our analysis by considering a data-indepedent bound under the inﬁnity norm. Our proposed bound generalizes (1) by utilizing a Chernoﬀ-like concentration bound.

Theorem 1 Let p = pi N be a distribution over N. Let Xn be a sample of n independent observations from p. Let ˆp(Xn) be the MLE of p. Then, with probability at least 1 δ,

p ˆp = sup i N |pi ˆpi(Xn)| (9)

k=1 km knk X

i pk i (1 pi)k

m/2 δ1/m exp 1

for every even m > 0.

Theorem 1 relies on Markov s inequality for higher-order moments of the inﬁnity norm. In addition, it applies higher-order properties of the MLE and the binomial distribution. The detailed proof appears Section 7.1. Next, similarly to Chernoﬀinequality, we minimize (9) with respect to m to obtain tighter convergence guarantees. Speciﬁcally, we minimize the leading term of (9) to obtain

m/2 δ1/m exp 1

Distribution Estimation under the Infinity Norm

for the choice of m = 2 log(1/δ) + 2. Hence the inﬁmum of the bound is given by r

1 + log (1/δ)

n 1 2 + 1 m

Unfortunately, this is not a great improvement over the benchmark (1). However, it is shown in Section 7.1 that as m increases, the second inequality in (9) becomes tight for a worst-case case distribution p = [1/2, 1/2, 0, . . . , 0]. This distribution is quite unlikely in a large alphabet regime. On the other hand, if we assume a more likely uniform distribution over a ﬁnite alphabet size A := ||p||0 < , we obtain

m/2 δ1/m A 1

for every even m > 0. Minimizing the leading term with respect to m yields

m/2 δ1/m A 1

2 log(A) , (12)

for a choice of m = 2 log(A/δ). Notice the above vanishes with A. This result motivates our quest for a data-dependent bound, which considers an empirical estimate of p and does not assume a worst-case distribution as in (1) and (11). Theorem 2 below improves upon (9) and introduces a data-dependent bound which further accounts for ˆp.

Theorem 2 Let δ1 > 0 and δ2 > 0. Let m be a positive even number. Then, with probability at least 1 δ1 δ2,

k=1 km k(n + 1)k(ˆpi(1 ˆpi))k + ϵn+1

for every even m, where

2n log(1/δ2)

k=1 km knk k n 4k 1 + k(k 1)

To prove Theorem 2 we utilize the ﬁrst inequality of (9) with δ = δ1. Then, we apply Mc Diarmid s inequality to obtain a concentration bound for P i N pk i (1 pi)k around its empirical counterpart, with probability 1 δ2. Finally, we apply the union bound to obtain (13). The detailed proof is provided in Section 7.2. To further clarify the proposed bound we introduce the following simpliﬁed corollary, whose proof is located in Section 7.3

Corollary 3 Let δ1 > 0 and δ2 > 0. Let m be a positive even number. Then, with probability at least 1 δ1 δ2,

i (nˆpi(1 ˆpi))k

Kontorovich and Painsky

for every even m, where a = p

exp(1/e). Furthermore,

δ1/m 1 = e log(1/δ1),

where the inﬁmum is obtained for a choice of m = log(1/δ1).

Let us compare Corollary 3 with the benchmark scheme (1) and our previous data independent bound (Theorem 1). First, we notice a similar sample complexity of order of p

1/n in all the three schemes. This is not entirely surprising, given (5). However, Corollary 3 demonstrates an improved dependency on the underlying distribution, which now depends on ˆp and does not assume a worst-case distribution. Unfortunately, the dependency in ˆp is somewhat involved, and does not hold the desired form of (3). Finally, we compare the dependency in the conﬁdence level δ. Here, the data independent bounds introduce a squared root logarithmic dependency in 1/δ. On the other hand, Corollary 3 only attains a logarithmic dependency in 1/δ1 (where δ1 < δ) for the right choice of m. This diﬀerence is typically negligible compared to the other terms, especially in a ﬁxed δ regime as later discussed. To conclude, Theorem 2 and Corollary 3 introduce data-dependent upper bounds to the inﬁnity norm. The proposed bounds generalize (1), as they both utilizes high-moments and, more importantly, do not consider a worst-case distribution. In that sense, the proposed results generalize and improve upon (1), which may be considered as their special case. Our experiments in Section 5 numerically demonstrate this improvement. Despite the above, we would still like to ﬁnd more intuitive upper bounds, which are less complex and also introduce a dependency in p that is closer to the desired form (3). Hence, we present our second main result. First, we introduce some additional notation.

Notation. For any distribution pi N, deﬁne v = v(p) by vi = pi(1 pi) and v = maxi N vi as above. Deﬁne the functional

V (p) = sup i N vi(p ) log(i + 1), (14)

where p is p sorted in non-increasing order. Deﬁne

ϕ(t) = t log 1

t , 0 t 1. (15)

Theorem 4 Let p = pi N be a distribution over N and put v = v (p), V = V (p). For n 81 and δ (0, 1), we have that

3n log 2(n + 1)

3n log 2(n + 1)

v log(n + 1)

3n log 2(n + 1)

holds with probability at least 1 δ 81/n.

Distribution Estimation under the Infinity Norm

Remark 5 It seems that the log(n)/n term can be improved to log(n)/(n log log log n); we shall explore this in a future work.

It is instructive to compare Theorem 4 with our ambitious dream (3). The loosest bound therein, (18), features the desired dependency on v , but at the cost of a log n factor. The sharper bound (17) replaces the log n with v log 1

v . Finally, (16) gives the optimal (at least for the MLE, cf. Proposition 9) quantity V . The proof of Theorem 4 is located in Section 7.4. It relies on techniques from empirical process theory and large deviations. Next, we provide the empirical counterpart of Theorem 4, which depends on ˆv = supi N ˆpi(1 ˆpi).

Theorem 6 Let p = pi N be a distribution over N. Let ˆp be the MLE of p. Deﬁne

3n log 2(n + 1)

Then, with probability at least 1 δ 81/n,

p ˆp 3a/2 + ab/2 + 5b2/4 + 7b

ˆv /4. (19)

Remark 7 Note that the estimate in (19) is of order q

n , matching, up to constants, the form of (18).

The proof of Theorem 6 follows the proof of Dasgupta and Hsu (2008a, Lemma 5) and is left for Section 7.5.

Open problem. The estimate in Theorem 6 gives an empirical analog of (18). We conjecture that some empirical analog of (16) should be possible as well: a bound of the general

Near-optimality. To argue the near-optimality1 of the above bounds we introduce our lower bounds on supi N |ˆpi pi| for some ﬁxed constant δ > 0; this is equivalent to lower bounding E supi N |ˆpi pi|. Understanding the correct dependence on δ is left for future work. For a ﬁxed δ, the upper bound in (16) consists of two terms: one of order p

V /n and another one of order log(n)/n. We shall argue below that the ﬁrst is tight and the second nearly so, albeit in diﬀerent senses. The near-optimality of the log(n)/n term is proved in the following result, whose proof is provided in Section 7.6. Note that the lower bound obtained for this term is of a minimax type, meaning that it holds for any estimator, not just the MLE.

Proposition 8 There is an absolute constant c > 0 such that the following holds for all suﬃciently large n. For any estimator p(Xn), there is a distribution pi N on N such that

E p p c log n n log log n (20)

for all suﬃciently large n.

1. We use the term instance-optimality in the spirit of Theorems 2.3 and 2.4 of Cohen et al. (2020): fully empirical data-dependent bounds that cannot be signiﬁcantly improved upon.

Kontorovich and Painsky

Our lower bound matching the p

V /n term will be limited to the MLE, but will have the advantage of holding pointwise for any given distribution, in constradistinction to the minimax bound in Proposition 8, which only holds for some adversarial distribution. The proof of Proposition 9 is provided in Section 7.7.

Proposition 9 For any distribution pi N and its corresponding MLE ˆp, we have

lim inf n n E p ˆp c p

where c > 0 is an absolute constant.

Remark 10 We show in Section 7.7 that the lower bound is necessarily only asymptotic (rather than ﬁnite-sample, in the sense of holding for all n), as a consequence of previous results (Berend and Kontorovich, 2013)

Finally, the following is a straightforward consequence of the Neyman-Pearson (Lemma 23):

Proposition 11 For any estimator p there exists a distribution pi N such that

for all suﬃciently large n, where c > 0 is an absolute constant.

The proof of Proposition 11 is provided in Section 7.8. It is instructive to compare Propositions 9 and 11. The former holds for any ﬁxed distribution p and the bound is stronger (since V v ), but only for the MLE estimate. The latter holds for all estimators, but the distribution can be aversarially chosen for each sample size n and the bound is weaker. We conjecture that the lower bound in Proposition 9 holds for all estimators and not just the MLE.

5. Experiments

Let us now demonstrate our proposed bounds. We focus on two benchmark distributions which represent two extreme cases. That is, we study the Zipf s law and the uniform distributions. The Zipf s law distribution is a typical benchmark in large alphabet probability estimation; it is a commonly used heavy-tailed distribution, mostly for modeling natural (real-world) quantities in physical and social sciences, linguistics, economics and others ﬁelds (Saichev et al., 2009). The Zipf s law distribution follows pi = i s/PA r=1 r s where A is the alphabet size and s is a skewness parameter. We set s = 1.1 throughout our experiments. In each experiment we draw n samples from a distribution over an alphabet size A to evaluate the proposed bounds for a given conﬁdence level 1 δ. We repeat this process 104 times and report the average bound and coverage rate (that is, the number of times that the inﬁnity norm is not greater than the bound). In the ﬁrst experiment we focus on A = 100 and δ = 0.05. We examine three bounds. First, we consider the bound from Theorem 2 with δ1 = 0.99δ, δ2 = 0.01δ. We set m to minimize (13) over the worst-case distribution (see (10)). This results in m = log(1/δ1) + 2 8. Further, we examine Theorem 6 and the benchmark bound (1). To further assess

Distribution Estimation under the Infinity Norm

the tightness of our results we introduce an Oracle lower bound (OLB). The OLB knows the true distribution and evaluates the 1 δ quantile of the desired inﬁnity norm. Figure 1 summarizes the results we achieve. First, we observe that Theorem 2 outperforms both Theorem 6 and the benchmark. It is also relatively close to the OLB, especially as n increases. We emphasize that although Theorem 6 demonstrates a steep descent, it does not outperform Theorem 2, even for a relatively large n = 105. The reason for this phenomenon is the ﬁxed δ regime, in which Theorem 2 is favorable. Importantly, all the examined bounds attain the prescribed coverage rate as desired.

1000 2000 3000 4000 5000 0

0.3 Uniform

Theorem 4 Benchmark Theorem 2 Oracle

1000 2000 3000 4000 5000 0

0.3 Zipf Law

Figure 1: The proposed bounds compared to the benchmark and to an Oracle, as n grows and δ = 0.05

Next, we examine the performance of our proposed schemes for a decaying conﬁdence level, δ = 1/n2. As above, we set A = 100 and focus on the two benchmark distributions. Figure 2 demonstrates the results we achieve. Here, we see the advantage of Theorem 6, as it outperforms the alternatives for relatively large n. Once again, all bounds attain the prescribed conﬁdence level. To conclude, both Theorem 2 and Theorem 6 improve upon the benchmark (1) in the studied regimes. While Theorem 2 shows favorable performance in the ﬁxed δregime, Theorem 6 demonstrates improved results in a decaying δ regime. It is also important to emphasize that while Theorem 6 is more intuitive and easier to apply, Theorem 2 is more complex and designed to attain the tightest results, even for smaller n. Notice that Theorems 1 and 4 are omitted from our experiments, as they require the unknown underlying distribution pp.

6. Application to Inference of Frequent Events

We now introduce an important application of our proposed scheme. Consider a survey asking individuals for their favorite food. We would like to report the k most popular foods along with their associated CIs. The common approach is to construct k marginal (binomial)

Kontorovich and Painsky

0.1 Uniform

Theorem 4 Benchmark Theorem 2 Oracle

0.1 Zipf Law

Figure 2: The proposed bounds compared to the benchmark and to an Oracle, as n grows and δ = 1/n2

intervals of conﬁdence level 1 δ each. This approach is genuinely wrong. For example, consider the case of k = 1, n = 100 and a uniform distribution over an alphabet size A = ||p||0. By deﬁnition, the most popular food in the sample would attain at least a single vote. Therefore, its exact lower bound CI (of level 1 δ = 0.95) is at least θl = 2.5 10 4. This means that for A > 4000, we attain zero coverage rate (!). This phenomenon is not all that surprising. Traditional (frequentist) inference assumes a ﬁxed and unknown parameter θ. Here, the inferred parameter is data-dependent, as it corresponds to the most frequent symbols in the sample. That is, we may obtain diﬀerent k most popular foods for diﬀerent samples. This type of inference problem is known as selective inference (Ben Hamou et al., 2017). Selective inference is a complicated task which is extensively studied in recent years (Tibshirani et al., 2016; Berk et al., 2013). One of the ﬁrst major contributions to the problem is due to Benjamini and Yekutieli (2005). In their work, they showed that conditional coverage, following any selection rule for any set of (unknown) values for the parameters, is impossible to achieve. This means we cannot simply infer on the chosen parameters, given that they were selected. Naturally, by controlling the inﬁnity norm we implicitly control the k most frequent events. That is, assume we found Tδ(Xn) that satisﬁes (4). Then, we have |pi ˆpi(Xn)| Tδ(Xn) for every i N with probability 1 δ, including the k most frequent events. However, it is of a natural concern that such an approach is not tight enough, as it is oblivious to k. In the following we study this claim and discuss the tightness of the inﬁnity norm with respect to the k most frequent events.

Theorem 12 Let p = pi N be a distribution over N. Let ˆp be the MLE of p. Let j = argmax ˆpi be the most frequent symbol in the sample so that ˆpj = maxi ˆpi. Assume there

Distribution Estimation under the Infinity Norm

exists Uδ(Xn) such that

P (|pj ˆpj| Uδ(Xn)) δ. (21)

E(Uδ(Xn)) zδ/2

p[1](1 p[1])

for suﬃciently large n, where p[1] = maxipi.

The proof of Theorem 12 is provided in Section 7.9. It utilizes the optimally of the CP CI and additional asymptotic properties. Let us now consider the k most frequent events. We would like to refrain from multiplicity corrections, so we seek an interval for supi Xk(Xn) |pi ˆpi(Xn)| where Xk(Xn) is the collection of the k most frequent events in the sample. This set naturally contains the single most frequent event, so a CI of average length (22) is inevitable. Let us compare Theorem 12 with our proposed bounds. For this purpose we turn to real-world data sets. Notice that in the real-world settings, the true underlying probability is unknown. Hence, we treat the empirical distribution of the full data-set as the underlying distribution and sample from it accordingly. We begin with a census data; we consider the 2000 United States Census (Bureau, 2014), which lists the frequency of the top 1000 most common last names in the United States. We randomly sample n names (with replacement) and examine the studied bounds for δ = 0.05. In addition, we present the Oracle CIs for the single most frequent symbol and the inﬁnity norm. The left chart of Figure 3 demonstrates the results we achieve. As we can see, the Oracle CIs are very close to each other and the diﬀerence between them and Theorem 12 is also negligible. This shows that the inﬁnity norm is a very good proxy to the k most frequent symbols in the alphabet. As we further examine our results, we see that for a typical experiment of n = 10000, the top k = 5 surnames are Smith, Johnson, Williams, Brown and Jones with ˆpi = [0.0213, 0.0165, 0.0137, 0.0134, 0.0130] respectively. Theorem 2 attains a bound of 0.0108 while the benchmark is about three times greater, 0.0345. Next, we consider a corpus linguistic experiment. The popular Broadway play Hamilton consists of 20,520 words, of which 3,578 are distinct. We randomly sample n words (with replacement), and evaluate the corresponding bounds. The right chart of Figure 3 demonstrates the results we achieve. Once again,it is quite evident that Theorem 2 outperforms its alternatives in this ﬁxed δ = 0.05 regime. Further, we observe that the the inﬁnity norm is a tight proxy to the k most frequent symbols in the alphabet.

7.1 Proof of Theorem 1

First, notice we have

E sup i |pi ˆpi(Xn)| m (i)=E sup i (pi ˆpi(Xn))m (ii) E

i (pi ˆpi(Xn))m !

i E(ni npi)m (iii) 1 nm X

k=1 km k(npi(1 pi))k

Kontorovich and Painsky

2000 4000 6000 8000 10000 0

0.3 Surnames

Theorem 4 Benchmark Theorem 2 Oracle Oracle - Most Frequent Symbol LB - Most Frequent Symbol

2000 4000 6000 8000 10000 0

0.16 Hamilton

Figure 3: The proposed bounds compared to the benchmark and to an Oracle, as n grows. The lower bound for the most frequent symbol corresponds to Theorem 12

where d = m/2 and

(i) follows from the monotonicity of the power function.

(ii) The supremum of non-negative elements is bounded from above by their sum (Maddox, 1988).

(iii) follows from Theorem 4 of (Skorski, 2020)

Applying Markov s inequality we obtain

P sup i |pi ˆpi(Xn)| a 1

am E sup i |pi ˆpi(Xn)| m (24)

1 am 1 nm X

k=1 km k(npi(1 pi))k.

Setting the right hand side to equal δ yields

k=1 km k(npi(1 pi))k !1/m

Therefore, with probability 1 δ, we have

sup i |pi ˆpi(Xn)| 1

k=1 km knk X

i pk i (1 pi)k !1/m

Distribution Estimation under the Infinity Norm

Let us further bound from above the right hand side of (25). We have,

i pk i (1 pi)k (i) max t [0,1] tk 1(1 t)k (ii) k 1

k (iii) (26)

2k 1 = 1 + k + 1

2k 1 (iv) exp( k + 1)

(i) follows from P i piψ(pi) maxt [0,1] ψ(t) for ψ(pi) = pk 1 i (1 pi)k. That is, the mean of a random variable not greater than its maximum.

(ii) simple derivation shows that the maximum of tk 1(1 t)k is attained for t = k 1

(iii) is due to k 1 2k 1 k 1 k 2k 1 k 1 .

(iv) follows from Bernoulli inequality.

Importantly, notice that for a choice of p = [1/2, 1/2, 0, . . . , 0] we have

i pk i (1 pi)k = 2 1

2k 1 , (27)

which approaches the term on the right hand side of inequality (iii), as k increases. Finally, plugging (26) to (25) we obtain

sup i |pi ˆpi(Xn)| 1

k=1 km knk exp( k + 1)

for every even m > 0.

7.2 Proof of Theorem 2

We begin with the following proposition.

Proposition 13 Let δ2 > 0. Then, with probability 1 δ2,

k=1 km k(n(pi(1 pi))k X

k=1 km k((n + 1)ˆpi(1 ˆpi))k + ϵn+1

for every even m, where

2n log(1/δ2)

k=1 km knk k n 4k 1 + k(k 1)

Kontorovich and Painsky

Proof Deﬁne ψ(n, d, ˆp) = P i Pd k=1 km k(nˆpi(1 ˆpi))k. Mc Diarmid s inequality yields

P (ψ(n, d, ˆp) E (ψ(n, d, ˆp)) ϵn) exp

2ϵ2 n Pn j=1 c2 j

where cj is deﬁned as

ψ(n, d, ˆp) ψ(n, d, ˆp ) cj. (31)

Here, ˆp is the MLE over the same sample xn, but with a diﬀerent jth observation, x j. First, let us ﬁnd cj. We have

ψ(n, d, ˆp) ψ(n, d, ˆp ) (32)

sup p [0,1 1/n] 2 d X

k=1 km k (np(1 p))k

k=1 km k (n(p + 1/n)(1 (p + 1/n)))k) =

sup p [0,1 1/n] 2 d X

k=1 km knk (p(1 p))k ((p + 1/n)(1 (p + 1/n)))k) ,

independently of j, where the inequality follows from the fact that changing a single observation eﬀects only two symbols (for example, ˆpl and ˆpt), where the change is 1/n. Now, we would like to bound from above (32). Denote fk(p) = pk(1 p)k. Applying Taylor series to fk(p + 1/n) around fk(p) yields

= fk(p) + 1

nf k(p) + r(p)

where r(p) = 1

2! 1 n2 f (c) is the residual and c [p, p + 1/n] (Stromberg, 2015). We have

f k(p) = k (p(1 p))k 1 (1 2p) k (p(1 p))k 1 (33)

f k (p) = k(k 1)(p(1 p))k 2(1 2p)2 2k(p(1 p))k 1 k(k 1)(p(1 p))k 2.

sup p [0,1 1/n]

(p(1 p))k ((p + 1/n)(1 (p + 1/n)))k) = (34)

sup p [0,1 1/n]

2! 1 n2 f (c) sup p [0,1 1/n]

2! 1 n2 f (c) (i)

sup p [0,1 1/n]

k n (p(1 p))k 1 + k(k 1)

2n2 (p(1 p))k 2 (ii) k n 4k 1 + k(k 1)

(i) follows from (33).

Distribution Estimation under the Infinity Norm

(ii) follows from the concavity of (p(1 p))k for k 1.

Therefore, we have

ψ(n, d, ˆp) ψ(n, d, ˆp ) 2

k=1 km knk k n 4k 1 + k(k 1)

E(ψ(n, d, ˆp)) X

k=1 km knk (E(ˆpi(1 ˆpi)))k =

k=1 km knk 1 1

pi(1 pi) k =

k=1 km k((n 1)pi(1 pi))k = ψ(n 1, d, p) (36)

where the ﬁrst inequality follows from Jensen Inequality and the equality that follows is due to E(ˆpi(1 ˆpi)) = E(ˆpi) Var(ˆpi) E2(ˆpi) = p(1 p)(1 1/n). Going back to Mc Diarmid s inequality, we have

P (Eψ(n, d, ˆp) ψ(n, d, ˆp) + ϵn) exp

2ϵ2 n nc2 j

In word, the probability that the random variable Z = ψ(n, d, ˆp) is smaller than a constant C = E(ψ(n, d, ˆp)) ϵn is not greater that ν = exp 2ϵ2/Pn j=1 c2 j . Therefore, it necessarily means that the probability that Z is smaller than a constant smaller than C, is also not greater than ν. Hence, plugging (36) we obtain

P (ψ(n 1, d, p) ψ(n, d, ˆp) + ϵn) exp

2ϵ2 n P j c2 j

Setting the right hand side to equal δ2 we get

2n log(1/δ2)

k=1 km knk k n 4k 1 + k(k 1)

and with probability 1 δ2,

k=1 km k((n 1)pi(1 pi))k

k=1 km k(nˆpi(1 ˆpi))k + ϵn

Plugging n + 1 to the above yields the desired result.

Finally, we apply the union bound to (25) with δ = δ1 and Proposition 13 to obtain the stated Theorem 1.

Kontorovich and Painsky

7.3 A Proof for Corollary 3

We prove the Corollary with two propositions.

Proposition 14 Let δ1 > 0. Then, with probability 1 δ1,

sup i |pi ˆpi(Xn)| m

k=1 (npi(1 pi))k

for every even m > 0.

Proof First, we have

E sup i |pi ˆpi(Xn)| m (i) 1

k=1 km k(npi(1 pi))k (ii) d

k=1 (npi(1 pi))k

where d = m/2 and

(i) follows from (23).

(ii) follows from km k dm for every k {1, ..., d}.

Applying Markov s inequality we obtain

P sup i |pi ˆpi(Xn)| a 1

am E sup i |pi ˆpi(Xn)| m (41)

k=1 (npi(1 pi))k.

Setting the right hand side to equal δ1 yields

k=1 (npi(1 pi))k !1/m

k=1 (npi(1 pi))k

Proposition 15 Let δ2 > 0. Then, with probability 1 δ2,

k=1 (npi(1 pi))k (42)

k=1 (nˆpi(1 ˆpi))k + d

1 2 log(1/δ2) 2nd 1/2 + 2nd 3/2 !

for every even m.

Distribution Estimation under the Infinity Norm

Proof Mc Diarmid s inequality suggests that

k=1 (nˆpi(1 ˆpi))k E

k=1 (nˆpi(1 ˆpi))k !

2ϵ2 n Pn j=1 c2 j

where cj follows

k=1 (nˆpi(1 ˆpi))k X

nˆp i(1 ˆp i) k) cj. (43)

First, let us ﬁnd cj. We have

k=1 (nˆpi(1 ˆpi))k X

nˆp i(1 ˆp i) k) (i) (44)

2 sup p [0,1 1/n]

k=1 (np(1 p))k

k=1 (n(p + 1/n)(1 (p + 1/n)))k) =

2 sup p [0,1 1/n]

k=1 nk (p(1 p))k ((p + 1/n)(1 (p + 1/n)))k)

k=1 nk sup p [0,1 1/n]

(p(1 p))k ((p + 1/n)(1 (p + 1/n)))k) (ii) =

k=1 nk k n 4k 1 + k(k 1)

4k 1 + k(k 1)

(iii) 2dnd 1 + 2dnd 2

(i) Changing a single observation eﬀects only two symbols (for example, ˆpl and ˆpt), where the change is 1/n.

(ii) Follows from (34).

(iii) Follows from Pd k=1 k 4k 1 = 4 Pd k=1 k 4k d and

4k 2 d max k [1,d] k2

4k 2 2d (45)

where the maximum is obtain for k = 2/ log(4).

Next, we have

k=1 (nˆpi(1 ˆpi))k X

k=1 (E(nˆpi(1 ˆpi)))k = (46)

pi(1 pi) k 1 1

k=1 ((n 1)pi(1 pi))k

Kontorovich and Painsky

Going back to Mc Diarmid s inequality, we have

k=1 (nˆpi(1 ˆpi))k !

k=1 (nˆpi(1 ˆpi))k + ϵn

2ϵ2 n Pn j=1 c2 j

Plugging (46) we obtain

k=1 (npi(1 pi))k X

k=1 (nˆpi(1 ˆpi))k + ϵn

2ϵ2 n P j c2 j

Setting the right hand side to equal δ2 we get

2 log(1/δ2) 2dnd 1 + 2dnd 2

and with probability 1 δ2,

k=1 (npi(1 pi))k (48)

k=1 (nˆpi(1 ˆpi))k + d

1 2 log(1/δ2) 2dnd 1/2 + 2dnd 3/2 !

Finally, we apply the union bound to Propositions 14 and 15 to obtain

sup i |pi ˆpi(Xn)|

k=1 (nˆpi(1 ˆpi))k + d

1 2 log(1/δ2) 2nd 1/2 + 2d 3/2 !!1/m

k=1 (nˆpi(1 ˆpi))k

1/2 (m/2)1/m 1

1/2m 2n 1 2 1

2m + 2n 1 2 3

with probability 1 δ1 δ2. Notice that p

2. Deﬁne g(m, δ1) = m/(

2δ1/m 1 ). Further, it is immediate to show that (m/2)1/m p

exp(1/ exp(1)) and (1/2)1/2m 1/

2. Hence, with probability 1 δ1 δ2,

sup i |pi ˆpi(Xn)| g(m, δ1)

k=1 (nˆpi(1 ˆpi))k

bg(m, δ1)(log(1/δ2))1/2m n 1

for every even m, where b = p

2 exp(1/ exp(1)). Finally, we would like to choose m which minimizes g(m, δ1). It is immediate to show that infm g(m, δ1) = (exp(1)/2) log(1/δ1)/

2, where and the inﬁmum is obtained for a choice of m = log(1/δ1).

Distribution Estimation under the Infinity Norm

7.4 A Proof of Theorem 4

Let us ﬁrst introduce some auxiliary results and background

7.4.1 Auxiliary Results

Lemma 16 (contained in the proof of Lemma 10, Cohen and Kontorovich (2023)) Let Yi I N be random variables such that, for each i I, there are vi > 0 and ai 0 satisfying

P (Yi ε) exp ε2

2(vi + aiε)

, ε 0. (49)

v := supi I vi, V := supi I vi log(i + 1), a := supi I ai, A := supi I ai log(i + 1). (50)

sup i I Yi 2

V + v log 1

δ + 4A + 4a log 1

Remark 17 When considering the random variable Z = supi N |ˆpi pi|, there is no loss of generality in assuming that pi 1/2, i N. Indeed, |Yi| = |ˆpi pi| is distributed as |n 1 Bin(n, pi) pi|, and the latter distribution is invariant under the transformation pi 7 1 pi. This is easily veriﬁed via the calculation P(Bin(n, p) = k) = P(Bin(n, 1 p) = n k).

Lemma 18 For any distribution pi N,

V (p) ϕ(v (p)).

Proof (This elegant proof idea is due to Václav Voráček.) There is no loss of generality in assuming p = p . The claim then amounts to

sup i N vi log(i + 1) v log 1

The monotonicity of the pi implies pi (p1 + . . . + pi)/i 1/i. Now x 1/i = x(1 x) 1/(i + 1) for i N, and hence vi 1/(i + 1). Thus, vi log(i + 1) vi log 1

vi . Finally, since x log(1/x) is increasing on [0, 1/4], which is the range of the vi, we have supi N vi log 1

Remark 19 There is no reverse inequality of the form ϕ(v (p)) F(V (p)), for any ﬁxed F : R+ R+. This can be seen by considering p supported on [k], with p1 = log(k)/k and the remaining masses uniform. Then V (p) log(k)/k while ϕ(v (p)) log(k) log(k/ log k)/k.

Proposition 20 Let n 10 and β = log(n). Then,

f(n) := β βn2 n β

Kontorovich and Painsky

Proof To prove the above, we show that f(n) is decreasing for n > 200. This means that the maximum of f(n) may be numerically evaluated in the range n {10, ..., 200}. Finally, we verify that the maximum of f(n) is attained for n = 33, and is bounded from above by 81/2 as desired. It remains to verify that f(n) is decreasing for n > 200. Since f(n) is non-negative, it is enough to show that g(n) = log f(n) is decreasing. Denote

g(n) = β log β + 2 log n + (n β) log(n β) + (n β) log n log(2β 2). (51)

Taking the derivative of g(n) we have,

g (n) = (52)

n(log β + 1) + 2

( log(n β) 1 + log n) + n β

(n 1) log n n β log β β + 2 2β log 2

n log n n β log β β + 2 log 2 1

n β log β β + 2 log 2 =

n β log β + 2 log 2 ,

where the ﬁrst inequality follows from log(n/(n β)) 1 and 2β/(2β 2) 1, while the second inequality is due to Bernoulli s inequality, (n/(n β))n exp(nβ/(n β)). Finally, it is easy to show that β2/(n β) is decreasing for n 10. This means that β2/(n β) (log 10)2/(10 log(10)) and g (n) < 0 for n > 200.

Lemma 21 (generalized Fano method (Yu, 1997), Lemma 3) For r 2, let Mr be a collection of r probability measures ν1, ν2, ..., νr with some parameter of interest θ(ν) taking values in pseudo-metric space (Θ, ρ) such that for all j = k, we have

ρ(θ(νj), θ(νk)) α

and D(νj νk) β.

inf ˆθ max j [d] EZ µjρ(ˆθ(Z), θ(νj)) α

1 β + log 2

where the inﬁmum is over all estimators ˆθ : Z 7 Θ.

Proposition 22 Let p and q be two distributions with support size n. Deﬁne p by

p1 = log n 2n log log n, pi = 1 p1

n 1 , i > 1,

and q by q2 = p1, and qi = p2 for i = 2. Then,

Distribution Estimation under the Infinity Norm

(i) p q c log n n log log n for some c > 0 and all n suﬃciently large.

(ii) limn n log n D(p||q) = 1

Proof For the ﬁrst part, it is enough to show that

|p1 p2| c log(n)/n log log n

for some c > 0 and suﬃciently large n. First, we show that p1 p2 for n (log n)2. That is,

n 1 = np1 1

n 1 > 0 (53)

for np1 > 1. Next, ﬁx 0 < c 1/2. We have,

|p1 p2| c log(n) n log log n =ap1 1

n 1 c log n n log log n = (54)

log n 2 log log n 1 n 1

n c log n log log n

1 (n 1)2 log log n

log n 1 n 1

n 2c 2 log log n > 0

where the last inequality holds for c(n 1)/n < 1/2 and suﬃciently large n, as desired. We now proceed to the second part of the proof.

n log n D(p||q) = n log n

q1 + p2 log p2

= n log n(p1 p2) log p1

First, we have

n log n(p1 p2) = n log n

n log n log n/2n log log n 1

n 1 = n n 1

1 2 log log n 1 log n

p2 = log(n 1) + log p1 1 p1 = log(n 1) + log log n 2n log log n log n = (57)

log(n 1) + log log n 2 log(2n log log n log n).

Kontorovich and Painsky

Putting it all together we obtain

n log n D(p||q) = (58)

1 2 log log n 1 log n

(log(n 1) + log log n 2 log(2n log log n log n)) =

2 log log n log(n 1)

2 log log n

log(2n log log n log n)

2 log log n + log(2n log log n log n)

2 + log(n 1) log(2n log log n log n)

2 log log n +

log(2n log log n log n) log(n 1)

log n log log n

It is straightforward to show that the last three terms in the parenthesis above converge to zero for suﬃciently large n, which leads to the stated result.

Lemma 23 (Peres (2017)) When estimating a single Bernoulli parameter in the range [0, p0] for p0 1 2, Θ(p0ε 2 log(1/δ)) draws are both necessary and suﬃcient to achieve additive accuracy ε with probability at least 1 δ.

7.4.2 Bernstein inequalities

Background: Let Y Bin(n, θ) be a Binomial random variable and let ˆθ = Y/n be the its MLE.

Classic Bernstein (Boucheron et al., 2003):

P ˆθ θ ε exp nε2

2θ(1 θ) + ε/3

with an analogous bound for the left tail. This implies:

Empirical Bernstein (Dasgupta and Hsu, 2008a, Lemma 5):

We are now ready to present the proof of Theorem 4. We assume without loss of generality that p is sorted in descending order: p1 p2 . . . and further, as per Remark 17, that p1 1/2. The estimate ˆpi is just the MLE based on n iid draws.

Distribution Estimation under the Infinity Norm

Our strategy for analyzing supi N |ˆpi pi| will be to break up p into the heavy masses, where we apply a maximal Bernstein-type inequality, and the light masses, where we apply a multiplicative Chernoﬀ-type bound. We deﬁne the heavy masses as those with pi 1/n. Denote by I N the set of corresponding indices and note that |I| n. For i I, put Yi = ˆpi pi. Then (59) implies that each Yi satisﬁes (49) with vi = pi(1 pi)/n and ai = 1/(3n); trivially, maxi I ai log(i + 1) = log(n + 1)/(3n). Invoking Lemma 16 twice (once for Yi and again for Yi) together with the union bound, we have, with probability 1 δ,

max i I |ˆpi pi| 2

δ + 4 log(n + 1)

Next, we analyze the light masses. Our ﬁrst segment consisted of the pi [n 1, 1]; these were the heavy masses. We take the next segment to consist of pi [(2n) 1, n 1], of which there are at most 2n atoms. The segment after that will be in the range [(4n) 1, (2n) 1], and, in general, the kth segment is in the range [(2kn) 1, (2k 1n) 1], and will contain at most 2kn atoms. To the kth segment, we apply the Chernoﬀbound P(ˆp p+ε) exp( n D(p+ε||p)), where p = (2kn) 1 and ε = εk = 2kpβ p, for some β > 1 to be speciﬁed below. [Note that D(αp||p) is monotonically increasing in p for ﬁxed α, so we are justiﬁed in taking the left endpoint.] For this choice, in the kth segment we have

D(p + ε||p) = D(2kpβ||p) = D β

= (n β) log 2k(n β)

+ β log 2kβ

(n β) log n β

n + β log 2kβ

since neglecting the 1/2k additive term in the denominator decreases the expression. Let E be the event that any of the pis in any of the segments k = 1, 2, . . . has a corresponding ˆpi that exceeds β/n. Then

k=1 2kn exp (n β) log n β

β log 2kβ = 2β βn n β

For the choice β = log n, we have

P(E) 2β βn n β

n , n 10, (63)

which is proved in Proposition 20. Now E is the event that supi:pi<1/n(ˆpi pi) log(n)/n. Since pi < 1/n, there is no need to consider the left-tail deviation at this scale, as all of the probabilities will be zero. Combining (62) with (63) yields (16). Since Lemma 18 implies that V ϕ(v ), (17) follows from (16). Finally, (18) follows from (16) via the obvious relation V log(n + 1)v .

Kontorovich and Painsky

7.5 Proof of Theorem 6

We begin with an elementary observation: for N N and a, b [0, 1]N, we have max i [N] ai(1 ai) max i [N] bi(1 bi) max i [N] |ai bi| ,

and this also carries over to a, b [0, 1]N. Let us denote v := supi N pi(1 pi) and ˆv := supi N ˆpi(1 ˆpi). Together with (18), this implies

|v ˆv | p ˆp a + b

a = 4 3n log 2(n + 1)

Following the proof of Lemma 5 of Dasgupta and Hsu (2008a),

|v ˆv | a + b

ˆv + |v ˆv |

where we used v ˆv + |v ˆv | and x + y x + y. Now we have an expression of the form

where A = |v ˆv |, B = b, C = a + b

ˆv , which implies A B2 + B

|v ˆv | b2 + a + b

Using x + y x + y and xy (x + y)/2,

|v ˆv | b2 + a + b

ˆv + b a + b q

ˆv + b a + b(b +

= a + 3b2/2 + b a + 3b

We still have

ˆv + b2/2 + |v ˆv |/2,

where xy (x + y)/2 was used. Hence, with probability 1 δ,

ˆv + b2/2 + |v ˆv |/2

ˆv + b2/2 + (a + 3b2/2 + b a + 3b

= 3a/2 + ab/2 + 5b2/4 + 7b

Distribution Estimation under the Infinity Norm

7.6 A Proof of Proposition 8

The necessity of an additive term of order log(n)/(n log log n) can be intuited via a ballsin-bins analysis: If n balls are uniformly thrown into n bins, we expect a maximal load of about log(n)/(n log log n) balls in one of the bins (Raab and Steger, 1998). We proceed to formalize this intuition. Let µ1, . . . , µn be probability measures on N with support contained in [n], deﬁned as follows. For a := log(n)/(2n log log n) and b := (1 a)/(n 1), we deﬁne, for i [n], µi(i) = a, for j = i, µi(j) = b. For each i [n], deﬁne the probability νi on Nn as the n-fold product measure µn i . It follows from Proposition 22 that

log n = n D(µj||µk)

log n n 1 2

and µj µk α := c log(n)/(n log log n) for j = k and n suﬃciently large. Invoking Lemma 21 with r = n, θ(νj) = µj, ρ = , and β = 1

2 log n + o(log n) completes the proof.

7.7 A Proof of Proposition 9

The analysis relies on a result of Cohen and Kontorovich (2023, Theorem 2). Let Xi Bin(n, pi), i N be a sequence of independent binomials with 1/2 p1 p2 . . . and deﬁne Yi := n 1Xi pi. Then Cohen and Kontorovich (2023) showed2 that

S lim inf n n E sup i N Yi, (64)

where S := supi N pi log(i + 1) and c > 0 is an absolute constant. Since t 2t(1 t) for t [0, 1/2] and p1 1/2 (as per Remark 17), we have that V S/2. However, the Cohen and Kontorovich (2023) lower bound is not immediately applicable to our case, because (64) requires the binomials to be independent. Fortunately, their dependence is of the negative association type (Dubhashi and Ranjan, 1998, Theorem 14), which futher implies negative right orthant dependence (Proposition 5, ibid.). Finally, (Kontorovich, 2023, Proposition 4) shows that

E sup i N Yi 1 2E sup i N Yi, (65)

where the Yi are mutually independent and each one is distributed identically to its corresponding Yi. This completes the proof.

Remark 24 The lower bound is only asymptotic (rather than ﬁnite-sample, in the sense of holding for all n) necessarily so. This is because even for a single binomial Y Bin(n, p), the behavior of E|Y np| is roughly np(1 p) for p / [1/n, 1 1/n] and p

np(1 p) elsewhere (Berend and Kontorovich, 2013, Theorem 1). This precludes any ﬁnite-sample lower bound of the form E p ˆp c p

2. The theorem therein claimed this for E supi N |Yi| but in fact the proof shows this for E supi N Yi.

Kontorovich and Painsky

7.8 A Proof of Proposition 11

It follows from Lemma 23 that E|p ˆp| c p

p0/n for some universal c > 0. Since E supi N |pi ˆpi| supi N E|pi ˆpi|, this completes the proof.

7.9 Proof Theorem 12

We begin with the following proposition.

Proposition 25 Let j = arg maxi ˆpi. Assume there exists Vδ(Xn) such that

P |pj ˆpj| Vδ(Xn)|pj = p[1] δ, (66)

where the conditioning pj = p[1] implies that the probability of the most frequent event in the sample is the maximal probability, p[1]. Then,

E(Vδ(Xn)) zδ/2

p[1](1 p[1])

Proof Let j = argmaxiˆpi. Assume there exists Vδ(Xn) that satisﬁes (66) and

E(Vδ(Xn)) < zδ/2

p[1](1 p[1])

From (66), we have that

P |pj ˆpj| Vδ(Xn)|pj = p[1] = P |p[1] ˆpj| Vδ(Xn)|pj = p[1] δ. (67)

Now, consider Y Bin(n, p[1]). Let Y n be a sample of n independent observations. Notice we can always extend the Binomial case to a multinomial setup with parameters p, over any alphabet size ||p||0. That is, given a sample Y n, we may replace every Y = 0 (or Y = 1) with a sample from a multinomial distribution over an alphabet size ||p||0 1. Further, we may focus on samples for which p[1] is the most likely event in the alphabet, and construct a CI for p[1] following (67). This means that we found a CI for p[1] with an expected length that is shorter than the CP CI, which contradicts its optimality.

Now, assume there exists Uδ(Xn) that satisﬁes

P (|pj ˆpj| Uδ(Xn)) δ. (68)

E(Uδ(Xn)) < zδ/2

p[1](1 p[1])

For simplicity of notation, denote v = arg maxi pi as the symbol with the greatest probability in the alphabet. That is, pv = p[1]. We implicitly assume that v is unique, although the

Distribution Estimation under the Infinity Norm

proof holds in case of several maxima as well. We have that

P (|pj ˆpj| Uδ(Xn)) = X

u P (|pj ˆpj| Uδ(Xn)|j = u) P(j = u) = (70)

P |p[1] ˆpj| Uδ(Xn)|j = v P(j = v)+ X

u =v P (|pj ˆpj| Uδ(Xn)|j = u) P(j = u).

Proposition 25 together with assumption (69) suggest that

P |p[1] ˆpj| Uδ(Xn)|j = v > δ.

On the other hand, it is well-known that ˆp[1] p[1] for suﬃciently large n (Gelfand et al., 1992; Shifeng and Guoying, 2005; Xiong and Li, 2009). This means that P(j = v) 1 and (70) is bounded from below by δ, for suﬃciently large n. This contradicts (67) as desired.

8. Discussion and Conclusions

In this work we study distribution estimation under the ℓ norm. We introduce two datadependent upper bounds for the MLE, which signiﬁcantly improve upon currently known results. Our ﬁrst bound (Theorem 2) demonstrates favorable performance in small sample size and ﬁxed δ regimes. However, its dependency in the data is somewhat involved, compared to our dream result (3). Our second bound (Theorem 6) improves the explicit dependency in the data, and demonstrates favorable performance in larger sample regimes, where δ decays with n. The above upper bounds are matched by nearly-optimal lower bounds, demonstrating the tightness of our analysis. Finally, we introduce an important application to our work in selective inference. We show that by utilizing ℓ results, we provide relatively tight conﬁdence interval for the most frequent events in the sample.

Acknowledgments

A.K. is supported in part by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award and the Ben-Gurion University Data Science Research Center. A.P. is supported in part by the Israel Science Foundation (grant No. 963/21)

Anna Ben-Hamou, Stéphane Boucheron, Mesrob I Ohannessian, et al. Concentration inequalities in the inﬁnite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli, 23(1):249 287, 2017.

Yoav Benjamini and Daniel Yekutieli. False discovery rate adjusted multiple conﬁdence intervals for selected parameters. Journal of the American Statistical Association, 100 (469):71 81, 2005.

Daniel Berend and Aryeh Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013.

Kontorovich and Painsky

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid postselection inference. The Annals of Statistics, pages 802 837, 2013.

Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer school on machine learning, pages 208 240. Springer, 2003.

Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. In Summer school on machine learning, pages 169 207. Springer, 2003.

US Census Bureau. Frequently occurring surnames from the census 2000. 2014.

Djalil Chafai and Didier Concordet. Conﬁdence regions for the multinomial parameter with small sample size. Journal of the American Statistical Association, 104(487):1071 1079, 2009.

Charles J Clopper and Egon S Pearson. The use of conﬁdence or ﬁducial limits illustrated in the case of the binomial. Biometrika, 26(4):404 413, 1934.

Doron Cohen and Aryeh Kontorovich. Local glivenko-cantelli. In The Thirty Sixth Annual Conference on Learning Theory, COLT, volume 195, page 715, 2023.

Doron Cohen, Aryeh Kontorovich, and Geoﬀrey Wolfer. Learning discrete distributions with inﬁnite support. In Advances in Neural Information Processing Systems 33, 6-12, 2020.

Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pages 208 215, 2008a.

Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208 215, 2008b.

Devdatt Dubhashi and Desh Ranjan. Balls and bins: a study in negative dependence. Random Struct. Algorithms, 13(2):99 124, September 1998. ISSN 1042-9832.

AE Gelfand, J Glaz, L Kuo, and T-M Lee. Inference for the maximum cell probability under multinomial sampling. Naval Research Logistics (NRL), 39(1):97 114, 1992.

Leo A Goodman et al. Simultaneous conﬁdence intervals for contrasts among multinomial populations. The Annals of Mathematical Statistics, 35(2):716 725, 1964.

Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik, 1909(136):210 271, 1909.

Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Maximum likelihood estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 63(10):6774 6798, 2017.

Leonid V Kantorovich. Mathematical methods of organizing and planning production. Management science, 6(4):366 422, 1960.

Distribution Estimation under the Infinity Norm

Aryeh Kontorovich. Decoupling maximal inequalities, 2023.

Ivor John Maddox. Elements of functional analysis. CUP Archive, 1988.

Matthew L Malloy, Ardhendu Tripathy, and Robert D Nowak. Optimal conﬁdence regions for the multinomial parameter. ar Xiv preprint ar Xiv:2002.01044, 2020.

David A Mc Allester and Robert E Schapire. On the convergence rate of Good-Turing estimators. In COLT, pages 1 6, 2000.

Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is Good-Turing good. In Advances in Neural Information Processing Systems, pages 2143 2151, 2015.

Amichai Painsky. Large alphabet inference. Information and Inference: A Journal of the IMA, 12(4), 2023a.

Amichai Painsky. Quality assessment and evaluation criteria in supervised learning. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook, pages 171 195, 2023b.

Amichai Painsky and Gregory W Wornell. Bregman divergence bounds and universality properties of the logarithmic loss. IEEE Transactions on Information Theory, 66(3): 1658 1673, 2019.

Yuval Peres. Learning a coin s bias (localized). Theoretical Computer Science Stack Exchange, 2017. https://cstheory.stackexchange.com/q/38931 (version: 2017-08-28).

Charles P Quesenberry and DC Hurst. Large sample simultaneous conﬁdence intervals for multinomial proportions. Technometrics, 6(2):191 195, 1964.

Martin Raab and Angelika Steger. Balls into Bins - A simple and tight analysis. In Michael Luby, José D. P. Rolim, and Maria J. Serna, editors, Randomization and Approximation Techniques in Computer Science, volume 1518, pages 159 170. Springer, 1998.

John A Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.

Alexander I Saichev, Yannick Malevergne, and Didier Sornette. Theory of Zipf s law and beyond, volume 632. Springer Science & Business Media, 2009.

Xiong Shifeng and Li Guoying. Testing for the maximum cell probabilities in multinomial distributions. Science in China Series A: Mathematics, 48:972 985, 2005.

Cristina P Sison and Joseph Glaz. Simultaneous conﬁdence intervals and sample size determination for multinomial proportions. Journal of the American Statistical Association, 90(429):366 369, 1995.

Maciej Skorski. Handy formulas for binomial moments. ar Xiv preprint ar Xiv:2012.06270, 2020.

Kontorovich and Painsky

Nickolay Smirnov. Table for estimating the goodness of ﬁt of empirical distributions. The annals of mathematical statistics, 19(2):279 281, 1948.

Karl R Stromberg. An introduction to classical real analysis, volume 376. American Mathematical Soc., 2015.

Måns Thulin. The cost of using exact conﬁdence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817 840, 2014.

Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact postselection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600 620, 2016.

Ramon Van Handel. Probability in high dimension. Lecture Notes (Princeton University), 2014.

Weizhen Wang. Smallest conﬁdence intervals for one binomial proportion. Journal of Statistical Planning and Inference, 136(12):4293 4306, 2006.

Shi Feng Xiong and Guo Ying Li. Inference for ordered parameters in multinomial distributions. Science in China Series A: Mathematics, 52(3):526 538, 2009.

Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: research papers in probability and statistics, pages 423 435, 1997.