# learning_discrete_distributions_with_infinite_support__674b0a57.pdf Learning discrete distributions with infinite support Doron Cohen Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel doronv@post.bgu.ac.il Aryeh Kontorovich Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel karyeh@cs.bgu.ac.il Geoffrey Wolfer Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel geoffrey@post.bgu.ac.il We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. In a departure from the established paradigm, we make no structural assumptions whatsoever on the sampling distribution. In such a setting, distribution-free risk bounds are impossible, and the best one could hope for is a fully empirical data-dependent bound. We derive precisely such bounds, and demonstrate that these are, in a well-defined sense, the best possible. Our main discovery is that the half-norm of the empirical distribution provides tight upper and lower estimates on the empirical risk. Furthermore, this quantity decays at a nearly optimal rate as a function of the true distribution. The optimality follows from a minimax result, of possible independent interest. Additional structural results are provided, including an exact Rademacher complexity calculation and apparently a first connection between the total variation risk and the missing mass. 1 Introduction Estimating a discrete distribution in the total variation (TV) metric is a central problem in computer science and statistics (see, e.g., Han et al. [2015], Kamath et al. [2015], Orlitsky and Suresh [2015] and the references therein). The TV metric, which we use throughout the paper, is a natural and abundantly motivated choice [Devroye and Lugosi, 2001]. For support size d, a sample of size O(d/ε2) suffices for the maximum-likelihood estimator (MLE) to be ε-close (with constant probability) to the unknown target distribution. A matching lower bound is known [Anthony and Bartlett, 1999], and has been computed down to the exact constants [Kamath et al., 2015]. Classic VC theory and, in particular, the aforementioned results imply that for infinite support, no distribution-free sample complexity bound is possible. If µ is the target distribution and bµm is its empirical (i.e., MLE) estimate based on m iid samples, then Berend and Kontorovich [2013] showed that 1 4Λm(µ) 1 4 m E [ µ bµm TV] Λm(µ), m 2, (1) j N:µ(j)<1/m µ(j) + 1 2 m j N:µ(j) 1/m 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. The quantity Λm(µ) has the advantage of always being finite and of decaying to 0 as m . The bound in (1) suggests that Λm(µ), or a closely related measure, controls the sample complexity for learning discrete distributions in TV. Further supporting the foregoing intuition is the observation that for finite support size d and m 1, we have Λm p d/m, recovering the known minimax rate. Additionally, a closely related measure turns out to control a minimax risk rate in a sense made precise in Theorem 2.5. One shortcoming of (1) is that the lower bound only holds for the MLE, leaving the possibility that a different estimator could achieve significantly improved bounds. Another shortcoming of (1) and related estimates is that they are not empirical, in that they depend on the unknown quantity we are trying to estimate. A fully empirical bound, on the other hand, would give a high-probability estimate on µ bµm TV solely in terms of observable quantities such as bµm. Of course, such a bound should also be non-trivial, in the sense of improving with growing sample size and approaching 0 as m . A further desideratum might be something akin to instance optimality: We would like the rate at which the empirical bound decays to be the best possible for the given µ, in an appropriate sense. Our analogue of instance optimality is inspired by, but distinct from, that of Valiant and Valiant [2016], as discussed in detail in Related work below. Our contributions. We address the shortcomings of existing estimators detailed above by providing a fully empirical bound on µ bµm TV. Our main discovery is that the quantity Φm(bµm) := 1 m P bµm(j) satisfies all of the desiderata posed above for an empirical bound. As we show in Theorems 2.1 and 2.2, Φm(bµm) provides tight, high-probability upper and lower bounds on µ bµm TV. Further, Theorem 2.3 shows that E [Φm(bµm)] behaves as Λm(µ) defined in (2). Finally, a result in the spirit of instance optimality, Theorem 2.4, shows that no other estimator-bound pair can improve upon (bµm, Φm), other than by small constants. The latter follows from a minimax bound of independent interest, Theorem 2.5. Additional structural results are provided, including an exact Rademacher complexity calculation and a connection (apparently the first) between the total variation risk and the missing mass. Definitions, notation and setting. As we are dealing with discrete distributions, there is no loss of generality in taking our sample space to be the natural numbers N = {1, 2, 3, . . .}. For k N, we write [k] := {i N : i k}. The set of all distributions on N will be denoted by N, which we enlarge to include the deficient distributions: N N := µ [0, 1]N : P i N µ(i) 1 . For d N, we write d N to denote those µ whose support is contained in [d]. For µ N and I N, we write µ(I) = P i I µ(i). We define the decreasing permutation of µ N, denoted by µ , to be the sequence (µ(i))i N sorted in non-increasing order, achieved by a1 permutation Π µ : N N; thus, µ (i) = µ(Π µ(i)). For 0 < η < 1, define Tµ(η) N as the least t for which P i>t µ (i) < η. This induces a truncation of µ, denoted by µ[η] N and defined by µ[η](i) = 1[Π µ(i) Tµ(η)]µ(i). For µ, ν N, we define the total variation distance in terms of the ℓ1 norm: µ ν TV := 1 2 µ ν 1 = 1 i N |µ(i) ν(i)| . (3) For µ N, we also define the half-norm2 as note that while µ 1/2 may be infinite, we have µ 1/2 µ 0, where the latter denotes the support size. For m N and µ N, we write X = (X1, . . . , Xm) µm to mean that the components of the vector X are drawn iid from from µ. We reserve bµm N for the empirical measure induced by the sample X, i.e. bµm(i) := 1 t [m] 1[Xt = i]; the term MLE will be used interchangeably. 1While µ is uniquely defined, Π µ is not. Uniqueness could be ensured by taking the lexicographically first permutation, but will not be needed for our results. 2The half-norm is not a proper vector-space norm, as it lacks sub-additivity. For the class of boolean functions over the integers {f : N {0, 1}}, which we denote by {0, 1}N, recall the definition of the empirical Rademacher complexity [Mohri et al., 2012, Definition 3.1] conditional on the sample X: ˆRm(X) := E σ sup f {0,1}N 1 m t=1 σtf(Xt) where σ = (σ1, . . . , σm) Uniform({ 1, 1}m). The expectation of the above random quantity is the Rademacher complexity [Mohri et al., 2012, Definition 3.2]: Rm := E X µm h ˆR(X) i . (6) Related work. Given the classical nature of the problem, a comprehensive literature survey is beyond our scope; the standard texts Devroye and Györfi[1985], Devroye and Lugosi [2001] provide much of the requisite background. Chapter 6.5 of the latter makes a compelling case for the TV metric used in this paper, but see Waggoner [2015] and the works cited therein for results on other ℓp norms. Though surveying all of the relevant literature is a formidable task, a relatively streamlined narrative may be distilled. Conceptually, the simplest case is that of µ 0 < (i.e., finite support). Since learning a distribution over [d] in TV is equivalent to agnostically learning the function class {0, 1}d, standard VC theory [Anthony and Bartlett, 1999, Kontorovich and Pinelis, 2019] entails that the MLE achieves the minimax risk rate of p d/m over all µ N with µ 0 d. An immediate consequence is that in order to obtain quantitative risk rates for the case of infinite support, one must assume some sort of structure [Diakonikolas, 2016]. One can, for example, obtain minimax rates for µ with bounded entropy [Han et al., 2015], or, say, bounded half-norm (as we do here). Alternatively, one can restrict one s attention to a finite class Q N; here too, optimal results are known [Bousquet et al., 2019]. Berend and Kontorovich [2013] was one of the few works that made no assumptions on µ N, but only gave non-empirical bounds. Our work departs from the paradigm of a-priori constraints on the unknown sampling distribution. Instead, our estimates hold for all µ N. Of course, this must come at a price: no a-priori sample complexity bounds are possible in this setting. Absent any prior knowledge regarding µ, one can only hope for sample-dependent empirical bounds, and we indeed obtain these. Further, our empirical bounds are essentially the best possible, as formalized in Theorem 2.4. The latter result may be thought of as a learning-theoretic analogue of being instance-optimal, as introduced by Valiant and Valiant [2017] in the testing framework. Instance optimality is a very natural notion in the context of testing whether an unknown sampling distribution µ is identical to or ε-far from a given reference one, µ0. For example, Valiant and Valiant discovered that a truncated 2/3-norm of µ0 i.e., a quantity closely related to µ0 2/3 controls the complexity of the testing problem in TV distance. Instance optimality is more difficult to formalize for distribution learning, since for any given µ N, there is a trivial learner with µ hard-coded inside. Valiant and Valiant [2016] defined this notion in terms of competing against an oracle who knows the distribution up to a permutation of the atoms, and did not provide empirical confidence intervals. We do derive fully empirical bounds, and further show that they are impossible to improve upon by any estimator other than by constants. Our results suggest that the half-norm µ 1/2 plays a role in learning analogous to that of µ 2/3 in testing. As an intriguing aside, we note that the half-norm corresponds to the Tsallis q-entropy with q = 1/2, which was shown to be an optimal regularizer in some stochastic and adversarial bandit settings [Zimmert and Seldin, 2019]. We leave the question of investigating a deeper connection between the two results for future work. 2 Main results In this section, we formally state our main results. Recall from the Definitions that the sample X = (X1, . . . , Xm) µm induces the empirical measure (MLE) bµm, and that a key quantity in our bounds is Φm(bµm) = 1 m bµm 1/2 1/2 = 1 m bµm(j). (7) Our first result is a fully empirical, high-probability upper bound on bµm µ TV in terms of Φm(bµm): Theorem 2.1. For all m N, δ (0, 1), and µ N, we have that bµm µ TV Φm(bµm) + 3 holds with probability at least 1 δ. We also have E [ bµm µ TV] E [Φm(bµm)] . Since bµm 1/2 bµm 0 µ 0, this recovers the minimax rate of p d/m for µ N with µ 0 d. We also provide a matching lower bound: Theorem 2.2. For all m N, δ (0, 1), and µ N, we have that bµm µ TV 1 4 holds with probability at least 1 δ. Our empirical measure Φm(bµm) is never much worse than the non-empirical Λm(µ), defined in (2): Theorem 2.3. For all m N and µ N we have E [Φm(bµm)] 2Λm(µ) and, with probability at least 1 δ, Φm(bµm) 2Λm(µ) + p log(1/δ)/m. Furthermore, no other estimator-bound pair ( µm, Ψm) can improve upon (bµm, Φm), other than by a constant. This is the instance optimality result alluded to above: Theorem 2.4. There exist universal constants a, b > 0 such that the following holds. For any estimator-bound pair ( µm, Ψm) and any continuous function θ: R+ R+ such that E [ µm µ TV] E [Ψm( µm)] θ E [Φm(bµm)] holds for all µ N, θ necessarily verifies inf 0 0 such that for all Λ 2 and 0 < ε, δ < 1, the MLE bµm verifies the following optimality property: For all µ N with µ[2εδ/9] 1/2 Λ, we have m Cε 2 max {Λ, log(1/δ)} = P ( bµm µ TV < ε) 1 δ. On the other hand, for any estimator µm : Nm N there is a µ N with max n µ[ε/18] 1/2 , µ[2εδ/9] 1/2 o Λ such that: m < Cε 2 min {Λ, log(1/δ)} = P ( µm µ TV ε) min {3/4, 1 δ} . The above is a simplified statement chosen for brevity; a considerably refined version is stated and proved in Theorem 3.1. 3.1 Proof of Theorem 2.1 The proof consists of two parts. The first is contained in Lemma 3.1, which provides a high-probability empirical upper bound, and an expectation bound, similar to Theorem 2.1, but in terms of ˆRm(X) instead of Φm(bµm). The second part, contained in Lemma 3.2, provides an estimate of ˆRm(X) in terms of Φm(bµm). Lemma 3.1. For all m N, δ (0, 1), and µ N, we have that bµm µ TV 2 ˆRm(X) + 3 δ 2m holds with probability at least 1 δ. We also have, E [ bµm µ TV] 2Rm. (8) Proof. The high-probability bound from the observation, bµm µ TV := sup A N (µ(A) bµm(A)) = sup f F E X µ [f(X)] 1 where F := {IA|A N} = {0, 1}N , combined with [Mohri et al., 2012, Theorem 3.3], which states: Let G be a family of functions from Z to [0, 1] and let ν be a distribution supported on a subset of Z. Then, for any δ > 0 , with probability at least 1 δ over Z = (Z1, . . . , Zm) νm, the following holds: E Z ν [g(Z)] 1 2 ˆRm(Z) + 3 Plugging in F for G and µ for ν in the above theorem completes the proof of the high-probability bound. The expectation bound (eq. (8)) follows from the observation at eq. (9) and a symmetrization argument [Mohri et al., 2012, eq. (3.8) to (3.13)]. In order to complete the proof, we apply Lemma 3.2 (Empirical Rademacher estimates). Let X = (X1, . . . , Xm) and let bµm be the empirical measure constructed from the sample X. Then, 1 2 2Φm(bµm) ˆRm(X) 1 Proof. The proof is based on an argument that was also developed in [Scott and Nowak, 2006, Section 7.1, Appendix E.] in the context of histograms and dyadic decision trees, and that was credited to Gilles Blanchard. Let ˆS = {Xi|i [m]} be the empirical support according to the sample X = (X1, X2, ..., Xm). Then, m ˆRm(X) = E σ sup f {0,1}N i=1 σif(Xi) i=1 σi IA(Xi) i:Xi=x σi IA(Xi) where the last equality follows from counting {i : Xi = x} and the symmetry of the random variable Pm i=1 σi for all n N. Now, by Khintchine s inequality, for 0 < p < and x1, x2, ..., xm C we have i=1 |xi|2 !1/2 i=1 |xi|2 !1/2 where Ap, Bp > 0 are constants depending on p. Sharp values for Ap, Bp were found by Haagerup [1981]. In particular, for p = 1 he found that A1 = 1 2 and B1 = 1. By using Khintchine s inequality for each Eσ h Pmbµm(x) i=1 σi i with these constants, we get mbµm(x) E σ and hence 1 2 mbµm(x) m ˆRm(X) 1 Dividing by m completes the proof. Remark: We also give an exact expression for ˆRm(X) in Lemma A.1, and show in Corollary A.1 with a more delicate analysis that bµm 1/2 1/2 1 2π 1 m3/2 1/2 ˆRm(X) bµm 1/2 1/2 1 2π 1 m3/2 3.2 Proof of Theorem 2.2 The proof follows from applying the lower bound of Lemma 3.2 to the following lemma: Lemma 3.3 (lower bound by empirical Rademacher). For all m N, δ (0, 1), and µ N, we have that δ m holds with probability at least 1 δ. Proof. The proof is closely based on [Wainwright, 2019, Proposition 4.12], which states: Let Y = (Y1, . . . , Ym) νm for some distribution ν on Z, let G [ b, b]Z be a function class, and let σ = (σ1, . . . , σm) Uniform({ 1, 1}m). Then E Y ν [g(Y )] 1 i=1 σig(Yi) supg G |EY ν [g(Y )]| holds with probability at least 1 e nδ2 2b2 . Plugging in X for Y , µ for ν, N for Z, 1 for b, and F := {IA|A N} = {0, 1}N for G in (10) together with observing that bµm µ TV := sup A N (µ(A) bµm(A)) = sup f F E X µ [f(X)] 1 i=1 σif(Xi) Rm, and sup f F E X µ [f(X)] = 1, followed by some algebraic manipulation we get with probability at least 1 δ/2. Applying Mc Diarmid s inequality to the 1/m-bounded-differences function ˆRm(X) (similar to [Mohri et al., 2012, Eq. (3.14)]) we get: with probability at least 1 δ/2. To conclude the proof, combine (11) and (12) with the union bound to get: 2 ˆRm(X) 1 2 m 1 with probability at least 1 δ, and use the fact 1 2 m 1 δ m for all m N, δ (0, 1) . Remark 3.1. We note that by using a more careful analysis, the constants of Theorem 2.2 can be improved to yield, under the same assumptions, bµm µ TV 1 2 ˆRm(X) 1 4 m 3 δ 2m with probability at least 1 δ. 3.3 Proof of Theorem 2.3 Invoking Fubini s theorem, we write 1 m E h bµm 1/2 1/2 i = 1 i=1 E X Bin(m,µ(i)) Since X {0, 1, 2, . . .}, we have X X and hence E h X i E [X]. On the other hand, Jensen s inequality implies E h E [X], whence 1 m E h bµm 1/2 1/2 i 1 m mµ(i), mµ(i)} (13) i: µ(i) 1/m µ(i) + 1 m i: µ(i)>1/m µ(i) 2Λm(µ). (14) The high-probability bound follows from applying Mc Diarmid s inequality to the 2/m-boundeddifferences function: for all δ (0, 1), we have Φm(bµm) E [Φm(bµm)] + p log(1/δ)/m. 3.4 Statement and proof of the refined version of Theorem 2.5 Theorem 3.1. There is a universal constant C > 0 such that for all Λ 2 and 0 < ε, δ < 1, the MLE verifies the following optimality property: For all µ N with µ[2εδ/9] 1/2 Λ, if (X1, . . . , Xm) µm and m C ε2 max Λ, ln δ 1 , then bµm µ TV < ε holds with probability at least 1 δ. On the other hand, for all Λ 2 and 0 < ε < 1/16, 0 < δ < 1, for any estimator µ: Nm N there is a µ N with µ[ε/18] 1/2 Λ such that µ must require at least m C ε2 Λ samples in order for µ µ TV < ε to hold with probability at least 3/4, and for any estimator ν : Nm N there is a ν N with ν[2εδ/9] 1/2 Λ, such that ν must require at least m C δ samples in order for ν ν TV < ε to hold with probability at least 1 δ. Minimax risk. For any Λ [2, ), 0 < ε, δ < 1, we define the minimax risk Rm(Λ, ε, δ) := inf µ sup µ: µ[2εδ/9] 1/2<Λ P X µm ( µ µ TV > ε) , where the infimum is taken over all functions µ : Nm N, and the supremum is taken over the subset of distributions such that µ[2εδ/9] 1/2 < Λ. Upper bound. Let Λ [2, ), 0 < ε, δ < 1, µ N, such that µ[2εδ/9] 1/2 Λ, m N, (X1, . . . , Xm) µ and let bµm be the MLE. For η > 0, consider the two truncated distributions µ[η] and bµ m, where we define the latter as bµ m(i) := bµm(i)1[µ[η](i) > 0], i N. By the triangle inequality, P ( bµm µ TV > ε) P (E1 + E2 + E3 > ε), where E1 := bµm bµ m TV , E2 := bµ m µ[2εδ/9] TV , E3 := µ[2εδ/9] µ TV . By Markov s inequality, ε E bµm bµ m TV = 3 bµm(i) bµ m(i) # i N: Πµ(i)>Tµ(η) t=1 1[Xt = i] 2ε P (Πµ(Xt) > Tµ(η)) δ Moreover, E3 = 1 2 P i>Tµ(η) µ (i) εδ 3. In order to apply the union bound, it remains to handle P (E2 > ε/3). This is achieved in two standard steps. The first follows an argument similar to that of [Berend and Kontorovich, 2013, Lemma 5], that bounds from above the quantity in expectation using Jensen s inequality, E [E2] µ[2εδ/9] 1/2 1/2 m q Λ m. An application of Mc Diarmid s inequality controls the fluctuations around the expectation [Berend and Kontorovich, 2013, (7.5)] and concludes the proof. Sample complexity lower bound m = Ω log δ 1 ε2 . See Lemma B.1. Sample complexity lower bound m = Ω Λ ε2 . Let ε (0, 1/16) and Λ > 2. First observe that Λ/2 2 Λ/2 Λ, and 2 Λ/2 2N. As a result, Rm(Λ, ε, δ) (i) inf µ sup µ: µ[2εδ/9] 1/2 2 Λ/2 P X µm ( µ µ TV > ε) (ii) inf µ sup µ 2 Λ/2 P X µm ( µ µ TV > ε) (iii) 1 where (i) and (ii) follow from taking the supremum over increasingly smaller sets, (iii) is Lemma B.2 invoked for 2 Λ/2 N, and C > 0 is a universal constant. To conclude, m Λ 4Cε2 = Rm(Λ, ε, δ) 1/4, which yields the second lower bound. Remark: The universal constant in the lower bound obtained by Tsybakov s method at Lemma B.2 is suboptimal, and we give a short proof in the appendix for completeness. We refer the reader to the more involved methods of Kamath et al. [2015] for obtaining tighter bounds. 3.5 Proof of Theorem 2.4 Let d 2N and m N, and restrict the problem to µ d. Let ε (0, 1/16). By Lemma B.2, Rm(d, ε) := inf µ supµ d P ( µ µ TV > ε) 1 2 1 Cmε2 d for some C > 0, whence Markov s inequality yields ε inf µ sup µ d E [ µ µ TV] . Restrict m d b2 , with b := p 3C/16 and set ε = q d 3Cm, so that inf µ sup µ d E [ µ µ TV] 1 d m, where a := Suppose that θ( p d m, then by hypothesis, inf µ sup µ d E [ µ µ TV] sup µ d E [Ψm( µm)] sup µ d θ d m. It follows that which contradicts (15). We have therefore established, for d/m: (m, d) N 2N, m d the lower bound θ(r) r/a. We extend the lower bound to the open interval (0, b), by observing that R is dense in (0, b) followed by a continuity argument. Broader Impact This work is of purely theoretical nature and does not present any foreseeable societal consequence. Acknowledgments and Disclosure of Funding We are thankful to Clayton Scott for the insightful conversations, and to the anonymous referees for their valuable comments. This research was partially supported by the Israel Science Foundation (grant No. 1602/19) and Google Research. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. ISBN 0-521-57353-X. doi: 10.1017/CBO9780511624216. URL http://dx.doi.org/10.1017/CBO9780511624216. D. Berend and A. Kontorovich. The missing mass problem. Statistics & Probability Letters, 82 (6):1102 1110, 2012. URL https://Econ Papers.repec.org/Re PEc:eee:stapro:v:82:y: 2012:i:6:p:1102-1110. D. Berend and A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013. O. Bousquet, D. Kane, and S. Moran. The optimal approximation factor in density estimation. In A. Beygelzimer and D. Hsu, editors, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 318 341. PMLR, 2019. URL http://proceedings.mlr.press/v99/bousquet19a.html. F. Deutsch and H. Hundal. Slow convergence of sequences of linear operators i: Almost arbitrarily slow convergence. Journal of Approximation Theory, 162(9):1701 1716, 2010a. F. Deutsch and H. Hundal. Slow convergence of sequences of linear operators ii: Arbitrarily slow convergence. Journal of Approximation Theory, 162(9):1717 1738, 2010b. L. Devroye and L. Györfi. Nonparametric density estimation: the L1 view. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985. ISBN 0-471-81646-9. L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York, 2001. ISBN 0-387-95117-2. doi: 10.1007/978-1-4613-0125-7. URL http://dx.doi.org/10.1007/978-1-4613-0125-7. I. Diakonikolas. Learning structured distributions. In P. Bühlmann, P. Drineas, M. J. Kane, and M. J. van der Laan, editors, Handbook of Big Data, pages 267 283. Chapman and Hall/CRC, 2016. URL http://www.crcnetbase.com/doi/abs/10.1201/b19567-21. S. R. Dunbar. Topics in probability theory and stochastic processes. Department of Mathematics. University of Nebraska, Lincoln/NE, 2009. U. Haagerup. The best constants in the khintchine inequality. Studia Mathematica, 70(3):231 283, 1981. URL http://eudml.org/doc/218383. Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions under ℓ1 loss. IEEE Transactions on Information Theory, 61(11):6343 6354, Nov 2015. ISSN 0018-9448. doi: 10.1109/TIT.2015.2478816. M. B. Handelsman. Solution to problem 436. Distributing heads minus tails . The College Mathematics Journal, 22(5):444 449, 1991. ISSN 07468342, 19311346. URL http://www. jstor.org/stable/2686609. S. Kamath, A. Orlitsky, D. Pichapati, and A. T. Suresh. On learning distributions from their samples. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 1066 1100, 2015. URL http://jmlr.org/proceedings/papers/v40/ Kamath15.html. A. Kontorovich and I. Pinelis. Exact lower bounds for the agnostic probably-approximately-correct (pac) machine learning model. Ann. Statist., 47(5):2822 2854, 2019. ISSN 0090-5364. doi: 10.1214/18-AOS1766. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X. A. Orlitsky and A. T. Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2143 2151, 2015. URL http://papers.nips.cc/paper/ 5762-competitive-distribution-estimation-why-is-good-turing-good. C. D. Scott and R. D. Nowak. Learning minimum volume sets. J. Mach. Learn. Res., 7:665 704, Dec. 2006. ISSN 1532-4435. A. B. Tsybakov. Introduction to nonparametric estimation, 2009. URL https://doi.org/10. 1007/b13794. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. G. Valiant and P. Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142 155, 2016. G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429 455, 2017. B. Waggoner. Lp testing and learning of discrete distributions. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015, pages 347 356, 2015. doi: 10.1145/2688073.2688095. URL http://doi.acm.org/10.1145/ 2688073.2688095. M. M. J. Wainwright. High-dimensional statistics : a non-asymptotic viewpoint. Cambridge series on statistical and probabilistic mathematics ; 48. Cambridge University Press, Cambridge, United Kingdom, 2019. ISBN 9781108498029. J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In K. Chaudhuri and M. Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 467 475. PMLR, 2019. URL http://proceedings.mlr. press/v89/zimmert19a.html.