# dataderived_weak_universal_consistency__32375d4f.pdf Journal of Machine Learning Research 23 (2022) 1-55 Submitted 6/20; Revised 12/21; Published 2/22 Data-Derived Weak Universal Consistency Narayana Santhanam nsanthan@hawaii.edu Department of Electrical Engineering, University of Hawaii, Manoa Honolulu, HI 96822, USA Venkat Anantharam ananth@eecs.berkeley.edu Department of Electrical Engineering and Computer Science, University of California, Berkeley Berkeley, CA 94720, USA Wojciech Szpankowski szpan@purdue.edu Department of Computer Science, Purdue University W. Lafayette, IN 47907, USA Editor: Nicholas Vayatis Many current applications in data science need rich model classes to adequately represent the statistics that may be driving the observations. Such rich model classes may be too complex to admit uniformly consistent estimators. In such cases, it is conventional to settle for estimators with guarantees on convergence rate where the performance can be bounded in a model-dependent way, i.e. pointwise consistent estimators. But this viewpoint has the practical drawback that estimator performance is a function of the unknown model within the model class that is being estimated. Even if an estimator is consistent, how well it is doing at any given time may not be clear, no matter what the sample size of the observations. In these cases, a line of analysis favors sample dependent guarantees. We explore this framework by studying rich model classes that may only admit pointwise consistency guarantees, yet enough information about the unknown model driving the observations needed to gauge estimator accuracy can be inferred from the sample at hand. In this paper we obtain a novel characterization of lossless compression problems over a countable alphabet in the data-derived framework in terms of what we term deceptive distributions. We also show that the ability to estimate the redundancy of compressing memoryless sources is equivalent to learning the underlying single-letter marginal in a data-derived fashion. We expect that the methodology underlying such characterizations in a dataderived estimation framework will be broadly applicable to a wide range of estimation problems, enabling a more systematic approach to data-derived guarantees. Keywords: compression, sample-derived bounds, learning marginals, data-derived framework 1. Introduction and Motivation Many of the most challenging problems in the data sciences stem from one or more of the following characteristics associated with data: high dimensionality; extreme scale (typically 2022 Narayana Santhanam, Venkatachalam Anantharam, and Wojciech Szpankowski. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-644.html. Santhanam, Anantharam and Szpankowski requiring that the data reside on multiple storage nodes); sparsity; patterns in the data that manifest at multiple scales; dynamic, temporal, and heterogeneous structure; complex dependencies between different parts of the data; and noise/ missing data. Tasks such as image recognition, classification, control, and many others, which are built on such data sources, depend on estimating the relevant underlying structure in the data. Rich model classes, i.e. rich collections of probabilistic models, such as the collection of all probability distributions over a large or countably infinite support, or the set of long memory, slowly mixing Markov processes are often required to adequately model the complex characteristics of these data sources. Indeed, in bringing rigorous theory to bear on data science, an important question we face is related to model selection. There is often a tension between the need for rich model classes to better represent data and our ability to handle these collections from a mathematical point of view. The richness of a model class is often quantified by metrics such as its VC-dimension (Bishop, 2006), Rademacher complexity (Koltchinskii, 2001; Bartlett et al., 2002; Bartlett and Mendelson, 2002), or what is most relevant in the context of universal compression its asymptotic per-symbol redundancy (Shtarkov, 1987; Fittingoff, 1972; Krichevsky and Trofimov, 1981; Rissanen, 1984; Barron et al., 1998; Drmota and Szpankowski, 2004; Szpankowski and Weinberger, 2012). The traditional uniform consistency paradigm would want an estimation algorithm with a model-agnostic guarantee on its performance, depending only on the sample size. Many applications, particularly in the big data regime, force us to consider model collections that are too complex to admit estimators with traditional model-agnostic uniformly consistent guarantees. When the model classes we are interested in are too complex to admit uniformly consistent estimators, the common belief is that the best we can do is to have estimators with convergence guarantees dependent on not just the sample size but also on the underlying model in the model class that governs the statistics of the observations. These are pointwise consistent estimators ((see Davisson, 1973) in the context of universal compression). This is often difficult (and as we will see, sometimes impossible) to use predictively as one cannot necessarily verify if the estimator has converged till the underlying model, the very quantity being estimated, becomes known. To tackle these rich classes, several approaches consider obtaining guarantees that hold samplewise, for example, bounds from the PAC-Bayes approaches (Mc Allester (1999); Catoni (2007)) for rich classification tasks, data-dependent structural risk minimization (e.g. Ben-David and Shalev-Schwartz (2012)) as well as its development via the luckiness framework (Grunwald (2007)), or as in Asadi et al. (2014) for slow mixing Markov setups. We adopt the same philosophy we express any estimator accuracy or confidence using empirically observed quantities. Our notion of data-derived consistency is also closely related to other formulations in compression, statistics and learning theory. In particular, we note hierarchical universal compression in Merhav and Feder (1998) and the more general framework of making finitely many errors along the lines of Cover (1973); Dembo and Peres (1994); Kulkarni and Tse (1994). We have approached this angle under the framework of regularization in Wu and Santhanam (2021b,a). To get a flavor of the results in this line of work, for example, Cover (1973) asks whether one can estimate the rationality of the mean of a Bernoulli process in finitely many samples, showing that the answer is affirmative if the mean comes from a Baire first category set with Lebesgue measure 1 and that also Data-Derived Consistency contains every rational number, see Koplowitz et al. (1995); Wu and Santhanam (2021a) for extensions. Fundamental to all these approaches is to balance the sample complexity of learning with the desire for richer model collections (or hypothesis collections as the case may be). This paper builds a natural information theoretic framework in the ambit of this philosophy: however we choose to obtain the data-derived bounds, when can they be made strong enough to answer convergence questions with arbitrary pre-specified confidence? Or equivalently, when is the data a sufficient statistic for the convergence rate of the estimator (or a non-trivial bound on it)? To retain focus in understanding this data-derived consistency, in this paper we concentrate on universal compression to bring out the salient features of this framework. We also make connections to a related prediction problem that was analyzed by us earlier in Santhanam and Anantharam (2015), and is now seen to fit into this broader framework. We note also that universal compression of i.i.d. data is equivalent to finding the marginal from samples with data-derived guarantees. We illustrate the salient aspect of the data-derived setup we consider with a simple example below. Example 1. (Hiding entropy) For ϵ > 0 and M N, where N is the set of natural numbers, and let pϵ,M be the probability distribution that assigns probability 1 ϵ to the natural number 1 and assigns probability ϵ/M to the natural numbers 2 through M + 1. Denote the probability distribution that assigns probability 1 to the natural number 1 by p0. Let W be the set comprised of the probability distributions pϵ,M for ϵ > 0 and M N, as well as p0. Our task is to estimate the Shannon entropy of a probability distribution in W using i.i.d. samples from it. However, we do not know which probability distribution in W is governing the law of the observed samples. The natural plug-in estimator assigns to a sample X1, . . . ,Xn the entropy of its empirical distribution. Since every probability distribution in W has finite support, the plug-in estimate is consistent almost surely, no matter which underlying distribution from W is generating the observations. But at what point do we know that the plug-in estimate is close to the correct answer? Indeed, can we, at any point, get an upper bound for the true entropy using the plug-in estimate with, say, a confidence probability 3/4, regardless of what the true probability distribution in W is? It turns out that it is impossible to provide such guarantees for W. To see why, suppose we have a sequence of n successive 1s. This could have come from p0, or, with high probability, from any probability distribution pϵ,M with 0 < ϵ 1 n. What is worse, for any upper bound ˆh we may provide, however large, even if 0 < ϵ 1 n, the entropy of pϵ,M where M 2ˆh/ϵ is h(ϵ) + ϵ log M ˆh. Every such pϵ,M gives the sample of n successive 1s a probability of at least > 3/4 if ϵ is sufficiently small, so our upper bound fails. This argument applies whether we obtained ˆh from the plug-in estimator or any other estimator of the entropy. No upper bound that we propose on the entropy based on any finite sequence of 1s can hold with confidence probability 3/4 under all probability distributions in W. To make matters worse, the sequence of all 1s occurs with probability 1 when the underlying model in force is p0. Therefore, even when we could estimate the entropy consistently, we could never obtain even a trivial upper bound on the entropy with a confidence probability 3/4. 2 Santhanam, Anantharam and Szpankowski Universal compression posits that we have a model class of source probability measures, while we are required to come up with a universal probability measure that attempts to compress any source in the model class as well as possible without prior knowledge of the source. Since the universal probability measure is not exactly matched to any single source probability measure in the model class it incurs a redundancy, measured using the Kullback Leibler (KL) divergence, against any source in the model class when compressing a sequence of observed samples whose statistics are governed by this source. The uniform consistency setup in this case corresponds to what is commonly known as the strong compression formulation, where we find universal probability measures whose per-symbol redundancy incurred against any source in the model class can be uniformly bounded over the entire model class and, in addition, diminishes to 0 as the sample size grows to infinity. The pointwise consistency setup in this case corresponds to what is commonly known as the weak compression formulation and is one where the universal probability measure incurs asymptotically zero per-symbol redundancy against each source in the model class, but the convergence to zero is not necessarily uniform over the entire model class. The data-derived weak compression formulation (d.w.c.) identifies when, in the weak compression setup, we can also estimate from the sample the redundancy of the universal probability measure relative to the underlying source model generating the data. Broadly speaking, we aim to find a universal estimator/encoding with a given accuracy as well as a corresponding stopping rule that allows us to find out at what point the KL divergence from the true source becomes (and remains) small from that point on. We also prove that this characterization is completely equivalent to that of estimating, in a data-derived fashion, a distribution q over naturals that is within a specified accuracy from the underlying marginal.1 To characterize the classes of probability distributions on N that are data-derived weakly compressible, we shall introduce the notion of what it means for a probability distribution in the class to be deceptive relative to the class. At a high level, a source probability distribution, viewed as a member of a collection of probability distributions, is deceptive if the asymptotic per-symbol redundancy of neighborhoods of the source within the model class is bounded away from 0, in the limit as the neighborhood shrinks to 0. Then, in our main finding, Theorem 20, we show that a collection of probability measures is data-derived weakly compressible iffno source in the model class is deceptive. As we delve deeper into this formulation, we will see that data-derived consistency changes how we think of model classes. It shifts the focus away from the global complexity of the model class to some form of local complexity of each model within the model class, viewed as a member of the model class. The paper is organized as follows. In the next section we develop our data-derived approach. Section 3 recalls some of the central prior results on universal compression that we build on in our work. Section 4 discusses our main result (Theorem 20), which completely characterizes d.w.c. model classes of i.i.d. probability distributions on a countable set. Theorem 9 and Appendix C contain an equivalent formulation, that of estimating in a data-derived fashion a distribution q over naturals that is within a specified KL divergence from the underlying marginal. We then illustrate several nuances in our formulation and 1. We thank the anonymous reviewer for suggesting this comparison. Data-Derived Consistency results using several examples in Section 5. Sections 6 and 7 are devoted to proving the main result. The main thread of the discussion is supported by several appendices. Appendix A reconciles the traditional definitions of strong and weak compressibility with those we work with in this paper. Appendix B gathers several basic results on entropy and redundancy that we draw upon throughout the paper. Appendix C, as mentioned, proves an operational equivalence between our notion of data-derived compressibility and a natural definition of learnability of a class of probability distributions (Definition 8)2. Appendix D contains the details of the proof for the claims made regarding one of the examples in Section 5. Appendix E proves a lemma needed for the proof the sufficiency part of the main theorem. The last bit of the proof of the necessity part of the main theorem is in Appendix F and that of the sufficiency part in Appendix G. Finally, Appendix H corrects an erroneous claim made in passing in the concluding remarks in Santhanam and Anantharam (2015) (which does not in any way affect the rest of that paper), and in addition illustrates why, in general, finite unions of d.w.c. classes, while weakly universal, need not be d.w.c.. 2. Formulation of the Problem We consider here the lossless compression problem for collections of large alphabet i.i.d. sources. The main contribution of this work is to characterize when data-derived guarantees for estimation problems can be made sufficiently strong. The large alphabet i.i.d. compression problem is the vehicle we have used to do this, but this framework leads to interesting developments in other problems as well. We compare with Example 10, the problem of estimation of percentiles of the probability distribution defining the source this has been studied in depth in Santhanam and Anantharam (2015), and here we show that this estimation task also lies in the data-derived framework proposed in this document. Another example is that of entropy estimation, see Example 11, and which we have studied in Wu and Santhanam (2020) from a related, almost-sure hypothesis testing framework. 2.1 Notations Before embarking on the discussion, we introduce notational conventions adopted in the paper. The symbol :=, and occasionally =:, is used to denote equality by definition. We write log for logarithms to base 2 and ln for logarithms to the natural base. Strings, sets and types: The set of natural numbers, denoted N, is the set {1, 2, . . .}, thought of as endowed with its usual σ-algebra comprised of all subsets of N. For n 1, we use Nn to denote the set of strings of length n of natural numbers, with the product σ-algebra. The set of infinite sequences of natural numbers is denoted N , and is thought of as endowed with the corresponding product σ-algebra. We will adopt the convention of thinking of a probability measure on N as defined by a distribution, which assigns a probability to each natural number. A string of integers (x1, . . . , xn) Nn will be denoted by x, or by xn when it seems important to emphasize the specific length of the string. The type of a string of integers x := (x1, . . . , xn) Nn will refer to the pair (n, t), where n is the sequence length and t its empirical distribution. 2. We are grateful to the anonymous reviewer for observing and suggesting one direction of this useful connection. Santhanam, Anantharam and Szpankowski N denotes the set of strings of naturals of finite length, including the empty string. For the purposes of this paper it suffices to think of N as a set with no additional structure. Similarly {0, 1} denotes the set of binary strings of finite length. The notation {0, 1} \ is used for the set of binary strings of finite length, excluding the empty string. For b {0, 1} \ , the length of b is denoted by l(b). For 1 m n and strings y Nm and x Nn, we write y x to denote that y is a prefix of x. We can also use this notation when y Nm and x N . The length of a finite string x Nn is denoted by |x|. Probability measures and distributions: Let P be a collection of probability distributions over N. Given P, we let P denote the collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. We will use the term source to denote either p P or p P as appropriate. For notational simplicity and following the convention in literature, we will also often drop the superscript in p and use p both for the probability distribution on N and the corresponding i.i.d. probability measure induced on N . Further, for n 1 and a string of natural numbers x := (x1, . . . , xn) =: xn Nn, we will write p(x) or p(xn) for Qn i=1 p(xi). Here p can be thought of as a simplified notation for the product probability measure pn on Nn corresponding to the probability distribution p on N. For a probability measure q on N , given n 1 and a string x Nn, we write q(x) for the probability under q of the set of strings in N whose prefix of length n is x. In effect, we are treating x as also denoting an event in N . Note that, for p P, n 1, and x Nn, this notational convention is consistent with the earlier conventions of writing p for both p P and for the product probability measure on Nn corresponding to p. It is a standard fact that a probability measure q on N is completely specified by q(x) for all x Nn for all n 1, subject to the consistency conditions q(x) = P y Nm : x y q(y) for all 1 n m and x Nn. We write 1(A) to denote the indicator of an event A. It is convenient to state some of the supporting results in this document at a level of generality where the underlying set is a countable set, in which case we denote such a set by X. Also, we will state some results that apply to arbitrary collections of probability measures on N , i.e. not necessarily of the form P for some collection of probability distributions P on N. In such cases, we denote such a collection of probability measures on N by Λ. If q and r are arbitrary probability measures on N , then Dn(q||r) := Eq log q(Xn) denotes the KL divergence over length n strings of q with respect to r. If p and p are probability distributions on N, then D(p|| p) denotes the KL divergence of p with respect to p, which is Ep log p(X) p(X). Note that, with our conventions, the expression Dn(p|| p) is also well-defined, and can be viewed as a shorthand notation for Dn(p || p ). We thus have Dn(p|| p) = n D(p|| p) for all n N, since p and p are i.i.d. probability measures on N . KL divergence is also called relative entropy. For probability distributions p and p on N, their ℓ1 distance is ||p p||1 := X i N |p(i) p(i)|. Data-Derived Consistency 2.2 Strong and Weak Compressibility In the lossless data compression problem for the collection of probability measures P on N corresponding to a collection of probability distributions P on N, our estimator is a probability measure q on N . 3 The problem formulation can be understood by thinking of the loss L(p, q, x) incurred by the estimator q against a source p, given the length n observation x Nn, as being the excess codelength, L(p, q, x) := log p(x) The terminology is justified by thinking of log 1 p(x) as an indication of the length of the binary string one would want to use to represent x in an ideal prefix-free scheme for compressing strings of length n from the source p if one knew what p was, and thinking of log 1 q(x) as the length of the binary string one would be led to use for representing x in the prefix-free compression scheme suggested by the estimator q. For more on this, see the discussion in Appendix A on how strong and weak compressibility is typically defined in the literature. With this loss function in mind, we now make the following definitions. Definition 2. Let P be a collection of probability distributions on N, and P the corresponding collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. Then P , or equivalently P, is called strongly compressible if there is a probability measure q on N satisfying lim sup n sup p P 1 n Ep log p(Xn) q(Xn) = 0. (1) The preceding definition may seem unusual relative to the definition of strong compressibility that is traditionally encountered in the literature on data compression (see Fittingoff, 1972; Davisson, 1973). In Appendix A we establish that it is identical to the traditional definition. Discussions of data compression in the literature are often framed in the language of redundancy. We formalize this notion in the following definition. Definition 3. Let Λ be any collection of probability measures on N . The length-n redundancy of Λ is defined to be Rn(Λ) := inf q sup r Λ Er log r(Xn) where the outer infimum is taken over all probability measures on N , or equivalently over all probability measures on Nn. The redundancy in the special case n = 1 is called the single letter redundancy of Λ, and Rn(Λ)/n is called the per-symbol length-n redundancy of Λ. The asymptotic per-symbol redundancy of Λ is lim supn Rn(Λ)/n. 3. It is not required that the probability measure q be a product measure. Santhanam, Anantharam and Szpankowski More generally, given a probability measure ˆqn on Nn one can define the length-n redundancy of Λ with respect to ˆqn to be supr Λ Er log r(Xn) ˆqn(Xn) and similarly for the persymbol length-n redundancy of Λ with respect to ˆqn. Given a probability measure q on N , one can define the asymptotic-per-symbol redundancy of Λ with respect to q to be lim supn 1 n supr Λ Er log r(Xn) q(Xn). Even more generally, given a probability measure ˆqn on Nn one can define the lengthn redundancy of r Λ with respect to ˆqn to be Er log r(Xn) ˆqn(Xn) and define the per-symbol length-n redundancy of r Λ with respect to ˆqn similarly. Given a probability measure q on N , one can define the asymptotic-per-symbol redundancy of r Λ with respect to q to be lim supn 1 n Er log r(Xn) q(Xn). When P is a collection of probability distributions on N, and P the corresponding collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P, we will talk about each of the redundancy quantities as properties of P when in fact they are defined for P . Similarly, given a probability measure ˆqn on Nn or a probability measure q on N we will talk about each of the redundancy quantities for a given p P with respect to ˆqn or q (as appropriate) when we mean the corresponding quantities for the p P corresponding to p. 2 It is worth noting that a collection of probability distributions on N is strongly compressible iffits asymptotic per-symbol redundancy is zero. For completeness, we give a proof of this claim in Lemma 34 in Appendix A. We also observe that the asymptotic per-symbol redundancy of a collection of probability measures Λ on N can also be written as lim sup n Rn(Λ)/n = lim sup n 1 n inf q sup r Λ Er log r(Xn) q(Xn) = inf q lim sup n 1 n sup r Λ Er log r(Xn) where the infimum on both sides of the equality is over probability measures q on N . Namely, the lim supn can be interchanged with the infq. A proof of this is given in Lemma 40 in Appendix B. We can allow for much richer collections of probability distributions if we work with a weaker notion of compressibility. Definition 4. Let P be a collection of probability distributions on N, and P the collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. Then P , or equivalently P, is called weakly compressible if there exists a probability measure q over N such that, for all p P with finite entropy rate, we have lim sup n 1 n Ep log p(Xn) q(Xn) = 0. (3) One artifact of the above definition is that any collection of probability distributions on N where every source has infinite entropy is vacuously weakly compressible. In Appendix A we establish that this definition of weak compressibility is identical to the definition of weak compressibility commonly encountered in the literature on data compression, see for example, Kieffer (1978). Also, in Lemma 35 of Appendix A we formally establish the essentially tautological fact that a collection of probability distributions P on N is weakly compressible iffthere exists a probability measure q on N such that every p P with finite entropy has vanishing asymptotic per-symbol redundancy with respect to q. Data-Derived Consistency 2.3 Compression in the Data-Derived Sense Working with collections of probability distributions on N that are weakly compressible gives us a richer class of models than working with those that are strongly compressible. Weak compressibility of a collection P of probability distributions on N ensures that there is a probability measure q on N such that q is essentially as good an encoder as the underlying p for long enough strings of natural numbers drawn i.i.d. from p, where goodness is measured in terms of the number of bits used per symbol encoded. This is what it means to say that the asymptotic per-symbol redundancy of every p P with respect to q is 0, But observe that what one means by long enough depends on the unknown p, since convergence to the limit in (3) need not be uniform over p P. The main contribution of our work is to come to grips with this issue without having to back offall the way to being able to deal only with strongly compressible collections of probability distributions. 2.3.1 Stopping Rule Our ideas are built around the notion of a universal stopping rule, which we introduce next. Recall that a stopping rule is a function of observed strings where the decision to stop or not at any given time is based only on what has been observed thus far. We formalize a stopping rule by a function τ from N , the set of all finite strings of naturals, to the set {0, 1}, τ : N {0, 1}. When τ assigns value 0 on a finite string xn, possibly the empty string, it indicates that the stopping rule is still waiting after having observed xn. A string xn, possibly the empty string, is assigned 1 if the stopping rule has stopped on any prefix of xn. From a notational point of view, since τ quantifies a stopping rule, we will have for all strings xn with prefix xm that τ(xn) τ(xm). To align with the common definition of stopping time T defined on the standard filtration on {Nn}n 1, τ is a binary (0-1) process that assigns to Xm a value 1 if Xm {T m}, and 0 else. The stopping rule τ is required to be universal for P. In other words, the stopping rule cannot change depending on the unknown probabilistic model p P that is generating the observations. In the formulation that we will develop in this paper, given a threshold δ > 0, a stopping rule (call it τ for now) will be based on some fixed probability measure q on N , and will signify when the sequence length is long enough that the normalized KL divergence between the underlying source distribution and the probability measure q has fallen below δ and will remain below δ henceforth. We will insist that τ stops at a finite time for all p P, i.e., p( lim n τ(Xn) = 1) = 1, for all p P. (4) We will include the condition in (4) in the concept of what we mean by a universal stopping rule. To understand this requirement better, fix a probability measure q on N , and for p P let Np,δ;q := {n : 1 n Ep log p(Xn) q(Xn) > δ}. Santhanam, Anantharam and Szpankowski Thus Np,δ;q is the set of all lengths n 1 such that the length-n KL divergence of the i.i.d. probability measure p corresponding to p with respect to the probability measure q is worse than the accuracy required. Now consider the set Nδ;q := p PNp,δ;q. In the trivial case where Nδ;q is a finite set, let N denote the largest element in Nδ;q. Then, for all n N, we have 1 n Ep log p(Xn) Clearly we can choose the stopping rule to be 0 for all sequences with length n N and 1 for all sequences with length > N, and this is universal. 2.3.2 δ Premature Rules It is more interesting when Nδ;q defined above is not a finite set. Even in this case, the stopping rule τ has to stop at a finite time almost surely no matter which source is governing the observations. Naturally, no matter when τ stops waiting, the sequence length may not be long enough for some sources in P, so τ fails on such sequences. More formally, for δ > 0, τ fails with respect to q or is δ-premature with respect to q for a source p P and at time i if there is some string xi such that τ(xi 1) = 1 and 1 i Ep log p(Xi) q(Xi) > δ. (5) For p P, consider the subset of N defined as x 1 N : i such that τ(xi 1) = 1 and 1 yi Ni p(yi) log p(yi) For p P, the above set is the set of strings on which τ is δ premature with respect to q. While this set depends on which p P is driving the observations, this set is an event in the product σ-algebra on N whatever the underlying p P. To see this, note that it is a countable union of sets of the form x N : τ(xi) = 1 , i 1 (which of the components sets lie in the union is determined, for the fixed probability measure q on N , by the underlying source probability distribution p). While the set in (6) may not be an empty set, we can at least try to ensure that its probability under p is small. This thought process leads to what we mean by a collection of probability distributions on N being weakly compressible in the data-derived sense, formalized below. This is the central concept investigated in this paper. Definition 5. Let P be a collection of probability distributions on N and P the associated collection of probability measures on N got by i.i.d. assignments from the individual distributions in P. We say that P , or equivalently P, is weakly compressible in the dataderived sense or data-derived weakly compressible ( d.w.c.) if there is a probability measure q on N such that, for any accuracy δ > 0 and confidence probability 0 < 1 η < 1, there is Data-Derived Consistency a universal stopping rule τδ,η with the property that, no matter what p P is in force, we have p(τδ,η is δ premature with respect to q for p) (7) := p( i such that τδ,η(Xi) = 1 and 1 yi Ni p(yi) log p(yi) q(yi) > δ) < η, where in the second statement the random variables Xi are generated i.i.d. p. 2 While the above definition recalls the compression/information-theoretic angle of our problem, we also note that characterizing d.w.c. classes will be equivalent to characterizing when we can learn the underlying marginals of the generating distribution, with a certificate that assures us that the estimate is accurate. We state this formally in Section 2.4, Definitions 8 and Theorem 9 as the operational interpretation of the above definition of d.w.c. classes. Claim 6. (Strongly compressible implies d.w.c.) Suppose P is a collection of probability distributions on N that is strongly compressible, namely there exists a probability measure q on N that satisfies (1). It follows then that, for all δ > 0, the sets Nδ;q := {n : sup p P 1 n Ep log p(Xn) are finite. For any η > 0, suppose we set τδ,η(xi) = 1 if i > max Nδ;q and 0 else, we obtain for all p P that p(τδ,η is δ premature with respect to q) = 0. Thus every strongly compressible collection of probability distributions on N is d.w.c.. 2 Claim 7. (d.w.c. implies weakly compressible) Suppose P is a collection of probability distributions on N that is d.w.c., as in Definition 5. Let q be a probability measure on N such that, for every accuracy δ > 0 and confidence probability 0 < 1 η < 1 there is a universal stopping rule τδ,η satisfying (7) for every p P. Fix p P. From (7) we conclude that, for all i 1, we have p(τδ,η(Xi) = 1)1 yi Ni p(yi) log p(yi) However, since the stopping rule τδ,η is universal, it must satisfy (4), i.e. it stops eventually. Hence we have lim i p(τδ,η(Xi) = 1) = 1. From this, it follows that yi Ni p(yi) log p(yi) (in fact, for this to hold, it suffices to have the condition in (7) hold for some 0 < 1 η < 1 and not necessarily for all η > 0, for the given δ > 0). Letting δ 0, we see that the condition in (3) holds, for the given probability measure q on N , for all p P. This means, by definition, that P is weakly compressible. 2 Santhanam, Anantharam and Szpankowski Claims 6 and 7 imply that Strongly compressible d.w.c. weakly compressible. In Section 5.1 we will see examples of model classes demonstrating that each of these inclusions is strict. Note also that the distinctions between the definitions hold when the underlying alphabet is infinite for finite alphabets, all versions are equivalent. As can be seen from the preceding discussion, our formulation of d.w.c. model classes is aimed at addressing the most interesting case from a statistical modeling viewpoint, which is the case where P is weakly compressible, but not strongly compressible. Typically, we need global constraints on the collection of sources that comprise a model class to render the model class strongly compressible for example, that the square root of the Fisher information be integrable over the model class for a class to be strongly compressible (Rissanen (1984)). By contrast, as we will see, data-derived weak compressibility does not depend on controlling the entire class P , but requires only that local neighborhoods of each p P, viewed as a member of P, be simple. Indeed, one of the main contributions of this paper is to obtain a condition that is both necessary and sufficient for an i.i.d. collection P to be d.w.c.. 2.4 Operational Characterization of Data-Derived Compressibility We provide the following operational perspective for d.w.c. from a learning theoretic perspective. Let P(N) be the set of all probability distributions on N and, as before, let P P(N) be a collection of probability distributions on N. Let X1, X2, . . . be i.i.d. samples generated by an unknown p P and let ˆq X1,...,Xn be a distribution on N that is considered to be an estimate of the underlying distribution p obtained using samples X1, . . . ,Xn. Abbreviating ˆq Xn by ˆq, the loss incurred by the estimate ˆq is the single-letter divergence D1(p||ˆq). Definition 8. P is learnable if for all η > 0 and δ > 0 there is an estimator ˆq : N P(N) and a universal stopping rule τδ,η for P such that for all p P, p( i s.t. τδ,η(Xi) = 1 and D1(p||ˆq Xi) > δ ) < η, where X1, X2, . . . above are generated i.i.d. p. (Here the left hand side of the preceding equation will be abbreviated as p(τδ,η = 1 and D1(p||ˆq) > δ).) 2 Theorem 9. P is learnable iffP is d.w.c., Proof Please see Appendix C. 2 2.5 Other Examples of Data-Derived Problem Formulations To clarify that the ideas in our framework have the potential to apply much more broadly to estimation problems other than the lossless compression problem that we have focused on in this document, we highlight in this section data-derived formulations for two other estimation problems. The first is a prediction task from Santhanam and Anantharam (2015), which we call the insurance problem, while the second is an entropy estimation task. In later sections, we will also make some comparisons between the insurance problem and the universal lossless compression problem studied here. Data-Derived Consistency Example 10. (Insurability) Suppose we have a collection P of i.i.d. measures over N . Given a finite sample (X1, . . . , Xn) with i.i.d. marginals from an unknown p P we want to estimate a finite upper bound on the next symbol Xn+1 in a data-derived sense. If there are p P with unbounded support then for any finite upper bound we propose there is a probability under such p that it may not be valid. In our data-derived formulation, we therefore want to provide an estimated upper bound Φ(Xn 1 ), and a universal stopping rule τ that tells us from what point we should believe that our estimates Φ(Xn 1 ) are at least as big as Xn+1, while allowing for some probability of being wrong. Formally, given a confidence probability 0 < 1 η < 1, we seek to come up with a mapping Φ : N R and a stopping rule τ such that, for all p P, we have p i N such that Φ(Xi) < Xi+1 and τ(Xi) = 1 < η. If this is possible, we say that the model class P is insurable. In prior work, in Santhanam and Anantharam (2015), the collections P that are insurable were completely characterized. See Corollary 22 and Corollary 23 for more details and connections with the results developed in this document. 2 Example 11. (Entropy estimation) Let P be a collection of probability distributions on N. Given a finite sample (X1, . . . , Xn) sampled i.i.d. from an unknown p P, we want to provide a data-derived finite upper bound ˆH on the entropy of p. Formally, given a confidence probability 0 < 1 η < 1, we would like to come up with a mapping ˆH : N R and a universal stopping rule τ such that, for all p P, we have p i N such that ˆH < H(p) and τ(Xi) = 1 < η. 2 While this remains open, we have worked on a related formulation in Wu and Santhanam (2020) on entropy property testing namely, given a set A R, to determine whether H(p) A or not. We show there that, under mild conditions on the underlying distribution, we can resolve the property testing problem in finitely many samples iffA and Ac are Fσ separable, adding to a related line of work developed in Cover (1973); Dembo and Peres (1994); Kulkarni and Tse (1994) 3. Background This section highlights some interesting prior results on universal compression that will be used in this paper. Readers can skip the proofs in this section if they are willing to take the results here at face value when they are referred to. We have collected in this section the more interesting prior results we use. Other, more basic, prior results that we also use are collected in Appendix B. 3.1 Weak Compression Let P be a collection of probability distribution on N and P the collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. In Appendix A we have demonstrated that the notion of weak compressibility of P Santhanam, Anantharam and Szpankowski in the sense of Kieffer (Kieffer (1978)) is identical to the definition of weak compressibility of P that we have made in Definition 4. The following lemma gives a useful characterization of weak compressibility. Lemma 12. Let P be a collection of probability distributions on N and P the associated set of i.i.d. probability measures on N . Then P is weakly compressible iffthere exists a distribution q on N such that for all p P with finite entropy we have X x N p(x) log 1 q(x) < . (8) Proof From (Kieffer, 1978, Theorem 1)) we know that P is weakly compressible iff there is a countable set Q := {q1, q2, . . .} of probability distributions on N such that for all p P with finite entropy there is some qi Q satisfying X x N p(x) log 1 qi(x) < . Therefore, if there is a probability distribution q on N satisfying (8) for all p P, we can immediately conclude that P is weakly compressible. It remains to show the converse. To do this, suppose that P is weakly compressible and let Q be a choice of the countable set of probability distributions on N guaranteed by (Kieffer, 1978, Theorem 1)). Fix some enumeration of Q as Q = {q1, q2, . . .}. Consider the probability distribution q on N given by P|Q| i=1 qi(n) i(i+1) P|Q| j=1 1 j(j+1) , n N, where the upper limit of the summation is understood to be if Q is countably infinite. Observe that, for all i and for all n, we have q(n) qi(n) i(i + 1). Therefore, for all p P with finite entropy and all qi Q, we have X x N p(x) log 1 q(x) X x N p(x) log i(i + 1) Since the right hand side of the preceding equation is finite for at least one qi Q, this completes the proof. 2 3.2 Tightness, Percentiles and Relevance to Redundancy Let us recall the definition of tightness of a collection of probability distributions on N. Definition 13. A collection P of probability distributions on N is said to be tight if for every γ > 0 there is a natural number Mγ such that sup p P p(X > Mγ) < γ. Data-Derived Consistency Informally, tightness can be expressed as saying that all percentiles of distributions in P can be uniformly bounded over P. Formally, to define percentiles, we will use the linearly interpolated cumulative distribution function of a probability distribution on N, defined as follows. Definition 14. For a probability distribution q on N, the linearly interpolated cumulative distribution Fq(n) for n N {0} follows the standard definition of the cumulative distribution function, i.e. Fq(n) := Fq(n) = P(X n) (9) where X is a random variable distributed according to q. For n N {0} and a real number n x n + 1, however, we define Fq(x) := (n + 1 x) Fq(n) + (x n) Fq(n + 1). Note that Fq is a nondecreasing function with domain the nonnegative real numbers and range either [0, 1] or [0, 1). For t [0, 1), we define F 1 q (t) to be the right continuous inverse of Fq, i.e. F 1 q (t) := sup{x 0 : Fq(x) t}. 2 Proposition 15. For all distributions p over N, if X1, X2, . . . are generated i.i.d. p, and tn be the empirical distribution of X1, , Xn, then for all 0 < γ 1, F 1 tn (1 γ) F 1 p (1 γ) a.s. Proof For any probability distribution q on N, any 0 < γ 1, and any positive real number x > 0 that is not an integer (so x x = 1), we have F 1 q (1 γ) < x Fq(x) > 1 γ (x x )Fq( x ) + ( x x)Fq( x ) > 1 γ. Also, for M N, we have F 1 q (1 γ) < M + 1 Fq(M + 1) > 1 γ. Hence the claim is a consequence of fact that for all integers M we have Ftn(M) Fp(M) a.s., which in turn follows from the strong law of large numbers. 2 We now show that tightness of a collection of probability distributions on N is implied by finiteness of the single letter redundancy of the collection. The result we present is a well-known folk theorem, see for example (Haussler, 1997, Lemma 4). Here we give an elementary proof of this result. Lemma 16. Let P be a collection of probability distributions on N. If the single letter redundancy of P is finite, then P is tight. Proof We prove the contrapositive here. If P is not tight then, for some ϵ > 0, there is a sequence of probability distributions pn in P, such that pn(X > n) ϵ. Santhanam, Anantharam and Szpankowski For any probability distribution q over N and any positive real number R, there is some natural number M such that q(X > M) < ϵ/2R. Thus we have D(p M||q) p M(X M) log p M(X M) q(X M) + p M(X > M) log p M(X > M) Noting that R can be made arbitrarily large, we conclude that the redundancy of P is infinite. 2 3.3 Bounds on Redundancy The following technical lemma is used in Example 26 and in Example 30. Its roots go back to Merhav and Feder (1998). Lemma 17. Let X be a countable set, and P be a collection of probability distributions on X. For i ranging over the finite set of indices {1, . . . , M} or over all indices i 1, let Si X be a subset of X, and assume that these sets are pairwise disjoint. Suppose that for each i there exists pi P such that Then, for all probability distributions q on X, we have sup p P D(p||q) δ log(M) 1, if the number of subsets in the collection is finite, equal to M, and sup p P D(p||q) = , if the number of subsets in the collection is infinite. Proof This is a simplified formulation of the distinguishability concept in Merhav and Feder (1998). To prove the claim, note that for any m at most equal to the number of subsets in the collection, we must have q(Si) 1/m for some i. For such a choice of i we can write D(pi||q) = X x Si pi(x) log pi(x) x Sc i pi(x) log pi(x) (a) pi(Si) log pi(Si) q(Si) + pi(Sc i ) log pi(Sc i ) q(Sc i ) pi(Si) log 1 q(Si) + pi(Sc i ) log 1 q(Sc i ) 1 where step (a) is from the log sum inequality. This completes the proof. 2 Data-Derived Consistency 4. Characterization of d.w.c. Model Classes In this section we state our primary result, which is a necessary and sufficient condition for a model class comprised of a collection of probability distributions P on N to be data-derived weak compressible. We will see that what decides whether a model class P is d.w.c. or not is a local property of the probability distributions in P, viewed as members of P. Namely, the characterization of data-derived weak compressibility is based on considering a property of local neighborhoods, as defined in Section 4.1, of the individual probability distributions in the model class. Distributions having bad local neighborhoods are what we call deceptive distributions, defined and studied in detail in Section 4.2. The notion of deceptive distributions lies at the heart of our characterization, in Theorem 20, of which model classes are d.w.c.. 4.1 Local Neighborhoods We will see in this section that what makes the local neighborhoods of a probability distribution p P bad and kills d.w.c. is that when a stopping rule is forced by p P into certifying the accuracy of the estimate at some time (which will have to be the case, since the stopping rule has to stop with probability 1 under p), it will nevertheless be the case that there are other probability distributions in P, potentially arbitrarily close to p, which induce inadequate performance on the estimator. We now proceed to make this vague description of the underlying ideas precise. Definition 18. An ϵ neighborhood of p P is the set B(p, ϵ; P) of all p P such that we have ||p p ||1 < ϵ, where || ||1 denotes the ℓ1 distance. 2 4.2 Deceptive Distributions Data-derived compressibility of a collection P is captured by how neighborhoods of measure in P can be compressed. To formalize this, we define the notion of deceptive measures that have very complex neighborhoods. Definition 19. p P is said to be deceptive if the asymptotic per-symbol redundancy of neighborhoods of p is bounded away from 0 in the limit as the neighborhood shrinks to 0. More precisely, we define p P , or equivalently p P, to be deceptive if lim ϵ 0 inf q lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q) > 0. (10) In the above, the infimum is over all q that are probability measures on N (not necessarily obtained by i.i.d. assignments). The verbal description of this condition in terms of the asymptotic per-symbol redundancy of the neighborhoods of p is justified by Lemma 40, which is proved in Appendix B. 2 Our main result is the following Theorem 20. The necessity part of this theorem is proved in Section 6 and the sufficiency part in Section 7. Theorem 20. Let P be a collection of probability distributions on N and P the associated collection of probability measures on N got by i.i.d. assignments. Then P is d.w.c. iffno p P is deceptive. 2 Santhanam, Anantharam and Szpankowski In the rest of this section we explore the concept of deceptive distributions to flesh out a few properties of such distributions and their neighborhoods. This will help to better understand Definition (10) and will set the stage for understanding the proof of Theorem 20. 4.2.1 A Simpler Characterization of Deceptive Distributions In determining whether a source p P is deceptive, (10) allows us to choose q depending on ϵ. We now show that this degree of freedom is unnecessary. Lemma 21. If p P is not deceptive, then there is a single probability measure q on N such that lim ϵ 0 lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q ) = 0. On the other hand, we have that p is deceptive iff inf q lim ϵ 0 lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q) > 0. where the inf is over all probability measures q on N . Proof Because p is not deceptive, there exists a sequence (δm > 0, m 1), with limm δm 0, and a sequence of probability measures (qm, m 1) on N such that, for all sufficiently large m 1, we have lim sup n sup p B(p,1/m;P) 1 n Dn(p ||qm) δm. Define the probability measure q on N that, for each n 1 and x Nn, assigns to the string x the probability qm(x) m(m + 1). For all m 1, n 1 and p B(p, 1/m; P), we have 1 n Dn(p ||q ) 1 n Dn(p ||qm) + log(m(m + 1) This implies that lim sup n sup p B(p,1/m;P) 1 n Dn(p ||q ) δm + lim n log (m(m + 1)) lim ϵ 0 lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q ) = lim m lim sup n sup p B(p,1/m;P) 1 n Dn(p ||q ) lim m δm = 0. On the other hand, if p is deceptive, then inf q lim ϵ 0 lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q) lim ϵ 0 inf q lim sup n sup p B(p,ϵ;P) 1 n Dn(p ||q) > 0. The converse follows from the first part of the Lemma. 2 Data-Derived Consistency 4.2.2 Neighborhoods of Non-Deceptive Distributions are Tight Recall the definition of tightness of a collection of probability distributions on N from Definition 13. The following corollary is immediate. Corollary 22. If p P is not deceptive, then some neighborhood of p is tight. Proof If p P is not deceptive then, for some ϵ > 0, there exists n 1 and a probability measure q on N such that sup p B(p,ϵ) Dn(p ||q) < . From Proposition 37 in Appendix B, it follows that the single letter redundancy of the neighborhood B(p, ϵ) is finite, which implies that B(p, ϵ) is tight, from Lemma 16. 2 The above corollary helps to make a connection between two data-derived formulations d.w.c., which is considered in this document, and insurability, from Example 10. We showed in Santhanam and Anantharam (2015) that a collection of i.i.d. probability measures P on N is insurable iffsome neighborhood, exactly as defined here, of every p P is tight. We therefore obtain Corollary 23. Let P be a collection of probability distributions on N and let P denote the associated collection of i.i.d. probability measures on N . If P is d.w.c., then P is insurable. 2 In both cases, note that the condition relies on some neighborhood within the model class of every model being simple. We expect this kind of locality to appear as a feature of the characterization of which model classes admit data-derived estimators in most data-derived formulations. 5. Examples We now discuss a series of examples that highlight various aspects of our formulation. These examples also help flesh out the notion of what it means for a probability distribution to be deceptive. 5.1 Strongly Compressible d.w.c. Weakly Compressible We first give examples showing that weakly compressible collections of probability distribution on N are a strictly richer class of models than d.w.c. collections. We also show that there are collections of probability distributions on N that are d.w.c. but are not strongly compressible. 5.1.1 Weakly Compressible but Not d.w.c. We consider two examples in this category. A monotone probability distribution p on N is one that satisfies p(y) p(y + 1) for all y N. Let M denote the collection of all monotone probability distributions on N and M be the corresponding collection of i.i.d. probability measures on N . Santhanam, Anantharam and Szpankowski Weakly Compressible Figure 1: Summary of examples: M h is strongly compressible (hence d.w.c., insurable and weakly compressible), U and F h are d.w.c. (hence insurable and weakly compressible), B is weakly compressible and insurable but not d.w.c., N and M are weakly compressible, but not insurable nor d.w.c., while I is insurable but not weakly compressible. Note that Corollary 23 shows that all d.w.c. collections are insurable, while Claim 6 and Claim 7 show that strong compressibility implies d.w.c. and that d.w.c. implies weak compressibility respectively. Example 24. (M is weakly compressible but not d.w.c..) To see that M is weakly compressible (Elias (1975)) note that, for all p M and all n N, we have It follows that every p M with finite entropy must satisfy n 1 p(n) log n X n 1 p(n) log 1 p(n) < . (11) Now consider the probability distribution q on N assigning probability q(n) = 6 π2n2 to n N. From (11) we see that, for all p M with finite entropy, we have n 1 p(n) log 1 q(n) < . From Lemma 12 we conclude that M is weakly compressible. It turns out that all the probability distributions p M are deceptive. To conclude this, we show that no neighborhood around any p M is tight and then appeal to Corollary 22. Data-Derived Consistency This would then imply, by Theorem 20, that M is not d.w.c.. In fact, it would have been enough to show that there exists some p M such that that no neighborhood of p is tight. Let U denote the collection of all uniform distributions over finite supports of form {m, m + 1, . . . ,M} where m and M are positive integers with m M. For p M and ϵ > 0, consider the collection M(p, ϵ) := {p : p = (1 α)p + αq for q U M and 0 α < ϵ}. (12) In (12) q can be any monotone uniform distribution, namely a uniform distribution with support {1, . . . ,M} for some M > 0. Clearly M(p, ϵ) M. Note also that M(p, ϵ) is a subset of an ℓ1 neighborhood of p corresponding to ℓ1 distance 2ϵ. We will show that M(p, ϵ) is not tight for all p and all ϵ > 0. By the definition of neighborhoods in Definition 18, it follows that no neighborhood of any p M is tight. For 0 < α < ϵ, let 0 < δ < α and n 1. Observe that if the support {1, . . . , M} of a uniform distribution q U M satisfies M n 1 δ α , then we have q {j : j > n} = 1 n Thus, given any p M, we have a distribution p = (1 α)p + αq M(p, ϵ) that satisfies p {j : j > n} δ. Therefore, M(p, ϵ) is not tight. This completes the argument. 2 For our second example, we consider the set N 1 of all i.i.d. probability measures on N corresponding to the set of all probability distributions p on N such that Ep X < , denoted N1. Example 25. (N 1 is weakly compressible but not d.w.c..) Note that every p N1 has finite entropy. Also, by definition, all p N1 satisfy P i 1 ipi < . Therefore the simplified version of Kieffer s condition for weak compressibility, as stated in Lemma 12, is satisfied by the distribution q(i) := 1/2i (i 1). Thus we conclude that N1 is weakly compressible. We can show that every p N1 is deceptive by showing that no neighborhood of any p N1 is tight. The approach is similar to that in Example 24. Given ϵ > 0, consider distributions of the form p = (1 α)p + αq, where q U is a uniform distribution over a support of the form {m, m + 1, . . . , M}, and 0 < α < ϵ. Since q has finite support, we have p N1. As in Example 24 we observe that (i) the ℓ1 distance between p and q is strictly less than 2ϵ; (ii) for all 0 < δ < α and n 1, we can pick q U, with U defined as in Example 24, whose support satisfies M n 1 δ α , which then implies that the (1 δ) percentile of p := (1 α)p + αq can be made to lie above n. Since the above construction works for arbitrary n 1 and in view of the way in which neighborhoods are defined in Definition 18, no neighborhood of any p N1 is tight, which shows that every p N1 is deceptive and hence, by Theorem 20, that N1 cannot be d.w.c.. As in Example 24, to apply Theorem 20 it would have been enough to show that there is at least one p N1 which is deceptive. 2 Santhanam, Anantharam and Szpankowski 5.1.2 d.w.c. but Not Strongly Compressible The example we consider in this category is U, which is defined in Example 24. Let U denote the collection of all i.i.d. probability measures on N corresponding to U. Example 26. (U is not strongly compressible but is d.w.c..) We first show that U has infinite single letter redundancy. To see this, we partition N into disjoint subsets (Ti, i 0), where Ti := {2i, . . . ,2i+1 1}. For each Ti there is an associated distribution pi U such that pi(Ti) = 1. Since the number of these disjoint sets Ti is infinite, we conclude from Lemma 17 that the single redundancy of U is . From the second part of Proposition 37 we can now conclude that the length-n redundancy of U is for all n 1, so its asymptotic per-symbol redundancy is also , which means, by Lemma 34, that U is not strongly compressible. To see that U is d.w.c., note that around each probability distribution p U there is an ℓ1-neighborhood that contains no other probability distribution in U. Such a neighborhood has length-n redundancy equal to 0 for all n because the only possible distribution in the neighborhood is p. Hence the asymptotic per-symbol redundancy of all sufficient small neighborhoods of each p U is zero, which means, by definition, that each p U is not deceptive, see Definition 19. 2 5.1.3 Strongly Compressible and d.w.c. For completeness we next give an example of a collection of probability distributions on N which is strongly compressible, hence automatically d.w.c.. For h > 0, we consider the set Mh M of all monotone probability distributions on N where the second moment of the self information satisfies the bound Let M h denote the set of all i.i.d. probability measures on N corresponding to Mh. Example 27. (M h is strongly compressible, hence d.w.c..) Note that for any monotone probability distribution p on N and all i 1 we have p(i) 1/i. Therefore for any p Mh, if X is a random variable taking values in N with the probability distribution p, we have Ep log2(X) Ep log2 1 p(X) h. Therefore, for all p Mh, we have by the Cauchy-Schwartz inequality that Ep log X h. Now, for the probability distribution q on N given by q(i) = 1 i(i+1), i 1, we have sup p Mh Ep log 1 q(X) 2 sup p Mh Ep log(X2 + X) + 1 2 sup p Mh Ep(2 log X + 2)2 4( where the last inequality follows because, for all p Mh, we have Ep(2 log(X) + 2)2 = 4Ep log2(X) + 2 log X + 1 4(h + 2 h + 1) = 4( Data-Derived Consistency Therefore (see Appendix D for a proof), we can construct a probability measure q on N 1 n Dn(p||q ) 2h 1 4 ( 2 3n log e. From this it follows that the collection M h is strongly compressible, and therefore d.w.c. trivially from Claim 6. 2 Comparing Examples 24 and 27, we observe, that countable unions of d.w.c. model classes need not be d.w.c.. In fact, as we will see in Example 30, even finite unions of d.w.c. model classes need not be d.w.c.. 5.2 d.w.c. Collections Thus far, we have seen two d.w.c. classes U and M h . But neither is completely satisfying. In the collection U above, there was a neighborhood around each probability measure p U with no other element of U. Thus U trivially satisfied the local condition characterizing d.w.c. in Theorem 20. The Mh case falls into another extreme the entire model collection Mh is strongly compressible, and therefore the condition characterizing d.w.c. in Theorem 20 was again satisfied in a trivial way. We now therefore construct two additional examples of d.w.c. model classes that are much more interesting. Our first example is of d.w.c. model classes Fh, where neither of the two extreme situations mentioned above holds. Our second example is of a d.w.c. model class H with a source none of whose neighborhoods are strongly compressible, but where the asymptotic per-symbol redundancy diminishes to 0 as the neighborhood shrinks to the defining probability distribution. 5.2.1 More Interesting d.w.c. Model Classes For a probability distribution p on N and a number M > 0, define the probability measure ( p(n M) n M + 1 0 else. Namely, p(M) shifts p to the right by M. Furthermore, let the span of any probability distribution p on N having finite support be defined to be the largest natural number which has non-zero probability under p. For h > 0, we consider the model classes Fh := n (1 ϵ)p1 + ϵp(span(p1)+1) 2 : p1 U, p2 Mh and 0 < ϵ < 1 o . As usual, let F h denote the set of i.i.d. probability measures on N associated to Fh. Note that the initial uniform component of any p Fh is uniquely determined. Example 28. F h is d.w.c.. Proof Let the base of any probability distribution over the naturals be the smallest natural number which has non-zero probability. Consider any probability distribution p = (1 ϵ)p1+ ϵp(span(p1)+1) 2 Fh with p1 U, p2 Mh, and 0 < ϵ < 1. Let m denote base(p) (which Santhanam, Anantharam and Szpankowski clearly equals base(p1)), and let m + M 1 denote the span(p1), where M 1. Thus |support(p1)| = M. Consider any probability distribution u Fh, written as u = (1 ϵ )u1 + ϵ u(span(q1)+1) 2 , where u1 U, u2 Mh, and 0 < ϵ < 1. Suppose that u is within ℓ1 distance (1 ϵ)2 M(M+1) from p. We show that |span(u1)| m + M To see this, suppose to the contrary that we have |span(u1)| m + M If base(u1) m, all elements in the support of p1 are assigned probability 1 M 1 ϵ +1 from u. If base(u1)> m, then u(base(p1))=0. Thus, in either case, we have u(base(p1)) 1 M 1 ϵ +1. We can now lower bound the ℓ1 distance between p and u by M 1 M 1 ϵ + 1 = (1 ϵ)2 M(M + 1 ϵ) > (1 ϵ)2 This contradiction proves the claim. Now, for fixed numbers m and M , consider the collection Pm ,M Fh of all probability distributions with base m , and whose support of the initial uniform component is M . Recall that Mh was shown to be strongly compressible in Example 27. Observe that the redundancy of Pm ,M will be at most the redundancy of Mh plus 1. Therefore we must also have that Pm ,M is strongly compressible. The set of all probability distributions in the ℓ1 neighborhood of p Fh with radius (1 ϵ)2 M(M+1) can be decomposed into the finite union Each component of the finite union is strongly compressible. Therefore it follows that this neighborhood of p Fh is strongly compressible. Thus no p Fh is deceptive and the collection is d.w.c.. 2 We construct a d.w.c. collection H where one of the probability distributions in H has no non-zero neighborhood that is also strongly compressible. We again partition N into (Ti, i 0) as before, where Ti = {2i, . . . ,2i+1 1} for i 0. Let H contain the probability distribution p0 that assigns probability 1 (i+1)(i+2) to 2i for all i 0. We will construct H in such a way that while p0 is not going to be deceptive in H, no neighborhood of p0 in H will be strongly compressible. We construct H in several steps. We first fix a sequence (ϵm, m 2) such that 0 < ϵm < 1 2 and lim m ϵm = 0. Data-Derived Consistency Next, for m 2, k m, and j 2k + 1, , . . . ,2k + 2 kϵm , we define the probability distribution pm,k,j(r) := p0(r), if 1 r 2m 1 1, 1 m 1 k+1, if r = 2m 1 + 1, 1 k+1, if r = j, 0, else. Now, for m 2 and k m, let Hm,k := n pm,k,j : 2k + 1 j 2k + 2 kϵm o , let Hm := k m Hm,k, and, finally, let H := {p0} ( m 2Hm) . A few observations about our construction. For all m 2, all the probability distributions in Hm assign probabilities exactly as p0 does to every element in m 2 i=0 Ti, and the rest of their support is disjoint from that of p0. It follows that, for all m 2. for all p Hm, we have ||p p0||1 = 2 Hence, for all m 2, the set of probability distributions in H within ℓ1 distance 2 m from p0 is precisely {p0} ( r m Hr). Around any probability distribution in H other than p0, there is a non-zero neighborhood containing no other probability distribution that belongs to H. Therefore, none of the probability distributions in H other than p0 can possibly be deceptive. Hence, to show that H is d.w.c., we have to prove that p0 is not deceptive. Example 29. None of the neighborhoods of p0 H is strongly compressible. We show that for all m 2 the collection of probability distributions Hm is not strongly compressible, i.e., its asymptotic per-symbol redundancy is bounded away from zero. To see this, for 2k + 1 j 2k + 2 kϵm , let Sj Nk+1 be the set of all length-(k + 1) sequences all of whose symbols but one are from m 1 i=0 Ti, and there is exactly one occurrence of the number j in the sequence. Clearly, for distinct j, Sj are disjoint. Observe that pm,k,j(Sj) = 1 1 k + 1 Therefore, from Lemma 17, we have that the length-(k + 1) redundancy of Hm,k, which we denote by Rk+1(Hm,k), satisfies k + 1 1 k + 1 e 1 = 1 k + 1 Since for all k m 2 we have Hm,k Hm, it follows that for m 2 the length-n redundancy of Hm, for n m + 1, which we denote by Rn(Hm), satisfies n Rn(Hm,n 1) Santhanam, Anantharam and Szpankowski Hence, the asymptotic per-symbol redundancy of Hm satisfies lim sup n Rn(Hm) Thus Hm is not strongly compressible and, in particular, neither is any ℓ1 neighborhood of p0. Nevertheless, we can show that p0 is not deceptive. We will verify that, as m , the asymptotic per-symbol redundancy of an ℓ1 neighborhood of radius 2(m+1) m2 around p0 goes to 0. 4 To do so, observe from Proposition 38 that the asymptotic per-symbol redundancy of any collection of probability distributions on N is upper bounded by the single-letter redundancy of the collection. Recall that for m 2 the ℓ1 neighborhood of radius 2(m+1) m2 around p0 is the collection {p0} ( l m Hl). We will verify that the single-letter redundancy of {p0} ( l m Hl) diminishes to 0 as m , which will then imply that p0 is not deceptive, using Proposition 38. For m 2, let qm be the probability distribution on N defined by p0(r), if 1 r 2m 1 1, 1 m 1 m+1, if r = 2m 1 + 1, 1 (k+1)(k+2) 1 2 kϵm , if r 2k + 1, , . . . ,2 kϵm , k m, Let l m 2. Then, for every k l and j 2k + 1, , . . . ,2k + 2 kϵl , note that pl,k,j Hl,k and ql assign the same probabilities as those assigned by p0 to every number 2l 1 1. It follows that D(pl,k,j||ql) = pl,k,j(2l 1 + 1) log pl,k,j(2l 1 + 1) ql(2l 1 + 1) + pl,k,j(j) log pl,k,j(j) l log(l + 1) + 1 k + 1 log(k + 2) + 1 k + 1 log 2 kϵl l log(l + 1) + 1 l + 1. (14) Now, for m 2, consider the mixture probability distribution qm on N given by m l(l + 1)ql(r). Fix m 2. We have seen that any probability distribution in H in the ℓ1 neighborhood of radius 2(m+1) m2 around p0 must belong to {p0} ( l m Hl). For every k l m, and j 2k + 1, , . . . ,2k + 2 kϵl , we observe that pl,k,j Hl,k and qm assign the same probabilities as those assigned by p0 to every number 2m 1 1. Also, p0 and qm assign the same probabilities as those assigned by p0 to every number 2m 1 1. We will now use this observation to find upper bounds for D(pm,k,j|| qm) for k m and j 2k + 1, , . . . ,2k + 2 kϵm , 4. The choice of radius 2(m+1) m2 is made since it satisfies 2 m < 2(m+1) m2 < 2 m 1 for m 2, and we defined ℓ1 neighborhoods to be open sets. Data-Derived Consistency then for D(pl,k,j|| qm) for k l m + 1 and j 2k + 1, , . . . ,2k + 2 kϵl , and finally for D(p0|| qm). For k m and j 2k + 1, , . . . ,2k + 2 kϵm , we write D(pm,k,j|| qm) = pm,k,j(2m 1 + 1) log pm,k,j(2m 1 + 1) qm(2m 1 + 1) + pm,k,j(j) log pm,k,j(j) pm,k,j(2m 1 + 1) log (m + 1)pm,k,j(2m 1 + 1) qm(2m 1 + 1) + pm,k,j(j) log (m + 1)pm,k,j(j) m log(m + 1) + 1 m + 1, (15) where the last step uses (14) for the choice l = m. For k l m + 1 and j 2k + 1, , . . . ,2k + 2 kϵl , we write D(pl,k,j|| qm) = n=m 1 pl,k,j(2n) log pl,k,j(2n) r=2l 1 pl,k,j(r) log pl,k,j(r) n=m 1 pl,k,j(2n) log pl,k,j(2n) mqn+2(2n) (n+2)(n+3) + r=2l 1 pl,k,j(r) log pl,k,j(n)l(l + 1) n=m 1 pl,k,j(2n) log pl,k,j(2n) qn+2(2n) (n+2)(n+3) + r=2l 1 pl,k,j(r) log pl,k,j(r) ql(r) + log(l(l+1) log ((n + 2)(n + 3)) (n + 1)(n + 2) + r=2l 1 pl,k,j(r) log pl,k,j(r) ql(r) + log(l(l+1) log ((n + 2)(n + 3)) (n + 1)(n + 2) + ϵl + 4log(l + 1) l + 1 l + 1 log ((n + 2)(n + 3)) (n + 1)(n + 2) + ϵm + 4 log(m + 1) m + 1 m + 1, (16) where (a) uses the bound log(l(l + 1)/m) 2 log(l + 1), observes that qn+2(2n) = p0(2n) = pl,k,j(2n), and uses (14). To bound D(p0|| qm) from above, note that qm(2n) = m n+2p0(2n) for n m 1. Therefore we have D(p0|| qm) = n=m 1 p0(2n) log p0(2n) log(n + 1) (n + 1)(n + 2). (17) From (15), (16), and (17), the single letter redundancy of all sources around p0 within ℓ1 distance 2(m+1) m2 of p0 satisfies the upper bound sup p {p0} ( l m Hl) D(p|| qm) log ((n + 2)(n + 3)) (n + 1)(n + 2) + ϵm + 4 log(m + 2) m + 1 + 1 m + 1. (18) Santhanam, Anantharam and Szpankowski Note that X log ((n + 2)(n + 3)) (n + 1)(n + 2) < . Hence, as m , each of the terms on the right side of (18) converges to 0. Since the single letter redundancy of {p0} ( l m Hl) diminishes to 0 as m , from Proposition 38, the asymptotic per-symbol redundancy of {p0} ( l m Hl) also diminishes to zero as m . Therefore p0 is not deceptive. In conclusion, none of the neighborhoods of p0 is strongly compressible, from (13), since the asymptotic per-symbol redundancy of a 2(m+1) m2 size ℓ1 neighborhood of p0 is lower bounded by ϵm/e > 0. Yet, as we showed above, p0 is not deceptive. As noted above, no other probability distribution in H can possibly be deceptive since it has a neighborhood of nonzero radius around it containing no other probability distribution from H. Therefore, H is d.w.c.. 2 5.3 Non-d.w.c. Collections We now construct two examples of non-d.w.c. model classes to illustrate some additional points. In Example 30 we define a model class B where exactly one source in the model class is deceptive. This would mean that B is not d.w.c.. However, even though B is not d.w.c., removing the single deceptive source renders the rest of the model class d.w.c.. Put another way, adding a single source to a d.w.c. model class may make the resulting bigger model class not d.w.c.. Since a model class with one source is trivially d.w.c., it follows that even finite unions of d.w.c. classes may not be d.w.c.. The second example we give here is of an insurable model class I that is not d.w.c.. See Example 10 for the definition of insurability of a model class. Partition N into (Ti, i 0), where Ti := {2i, . . . ,2i+1 1}, i 0. For 0 < ϵ < 1, let nϵ = 1 ϵ . Note that ϵ lies in the range [ 1 nϵ , 1 nϵ 1). For 1 j 2nϵ, let pϵ,j be the probability distribution on N that assigns probability 1 ϵ to the natural number 1 (or equivalently, to the set T0), and ϵ to the natural number 2nϵ +j 1. Finally, let p0 be a singleton probability distribution assigning probability 1 to the natural number 1. Now, let B (mnemonic for binary, since every probability distribution in B has support of cardinality at most 2) be the collection of probability distributions on N defined by B := {pϵ,j : 0 < ϵ < 1, 1 j 2nϵ} {po}. As usual, B denotes the set of i.i.d. probability measures on N corresponding to B. Example 30. (p0 is the unique probability distribution in B that is deceptive.) An ℓ1 neighborhood of radius δ around p0 is comprised of p0 and the pϵ,j for all 0 < ϵ < δ/2, and all 1 j 2nϵ. For all n 1 and j Tn, let Sn,j denote the set of all length n strings of natural numbers with exactly one appearance of j and the remaining n 1 elements of the string being 1. Then, we have n,j(Sn,j) = 1 1 Data-Derived Consistency For each n 1, the sets Sn,j are disjoint as j ranges over Tn. Further, they are subsets of Nn. Therefore, Lemma 17 implies that the length-n redundancy of the collection {p 1 n,j : j Tn} is lower bounded by n Therefore, for all n > 2 δ, the length-n redundancy of the ℓ1 neighborhood of radius δ is bounded below by n e 1. This implies that the asymptotic per-symbol redundancy of the ℓ1 neighborhood of size δ is bounded below by 1 e. From the second part of Lemma 21, we conclude that p0 is deceptive. On the other hand, for 0 < ϵ < 1, around every other probability distribution pϵ,j B, there is an ℓ1-neighborhood of radius 1 nϵ that contains only probability distributions in B that have support equal to {1, 2nϵ + j 1}. For n 1, let ˆrn denote the probability measure on Nn giving probability 1 (n+1)(n k) to each of the strings in Nn comprised of k occurrences of 2nϵ + j 1 and n k occurrences of 1, 0 k n. Let rn be the probability measure corresponding to ˆrn, as in Lemma 32. Then, for all p B in this ℓ1-neighborhood of pϵ,j B, we have for all n Dn(p||rn) log(n + 1). Noting that the measure r on N that assigns probability rm(x) m(m + 1) satisfies lim sup n sup p:|p pϵ,j|< 1 1 n Dn(p||q) lim n log n we conclude that for every pϵ,j B there is an ℓ1-neighborhood of pϵ,j that has zero asymptotic per-symbol redundancy. Hence, there is a neighborhood of pϵ,j that has zero asymptotic per-symbol redundancy. We conclude that, while p0 is deceptive, no other probability distribution in B is deceptive. Indeed, this is quite intuitive when we think about what is involved operationally in compressing strings of integers whose statistics are i.i.d. and governed by a probability distribution in B. If at any point we see two distinct symbols in such a string, there is no ambiguity about what the underlying distribution is from that point on, and very little ambiguity in the probabilities of the two distinct symbols seen, of which one must be the symbol 1. But if we see a string of all 1s we can never be sure (no matter what the length of the string) what the underlying source is. One possibility is that the source is p0. But having seen a string of 1s of length m, there is also a reasonable chance that the underlying source could be pϵ,j for some ϵ 1 m and any j Tnϵ. There are 2nϵ such possible values j can take in Tnϵ, so any description of j requires an additional nϵ bits or m bits. However, if we remove p0 from the collection, we have no such trouble. We have no obligation to stop on any finite length string of all 1s, no matter how long it is, since the sequence of all 1s has probability 0 under every source in B other than p0. 2 The last example is a collection I of probability measures over N that is insurable but not d.w.c.. In fact I is not even weakly compressible. Santhanam, Anantharam and Szpankowski Partition N into the sets (Ti, i 0) as before, where Ti := {2i, . . . ,2i+1 1}. For each i 1, pick exactly one element of Ti and assign it probability 1/(i(i + 1)). We define I to be the collection of all probability distributions on N that can be formed in this way. I denotes the set of i.i.d. probability measures on N corresponding to I. Example 31. (I is insurable but not weakly compressible, hence not d.w.c.) For all p I and all k 1, we have n 2k p(n) = 1 This means that the entire set I is tight. By (Santhanam and Anantharam, 2015, Theorem 1)), we can therefore conclude that I is insurable. On the other hand, for every probability distribution q on N, for all i 1 there is xi Ti such that q(xi) 1 By the definition of I, there is a probability distribution p I that has support {xi : i 1}. Note that D(p||q) = . Since every probability distribution in I has finite entropy (in fact they all have the same entropy), from Lemma 12 we conclude that I is not weakly compressible. In particular, I is not d.w.c.. 2 6. Necessity Part of Theorem 20 In this section we prove the necessity part of Theorem 20. Namely, we prove that the existence of deceptive distributions kills d.w.c.. More precisely, we prove that if P is a collection of probability distributions on N and P the associated collection of i.i.d. probability measures on N , then P is d.w.c. only if no p P is deceptive. To prove this, suppose p P is deceptive. Then, by the second part of Lemma 21, for every probability measure q on N we can find δ > 0 such that lim ϵ 0 lim sup n sup p B(p,ϵ ;P) 1 n Dn(p ||q) > δ. Pick any 0 < η < 1, and let τ be a stopping rule. We will demonstrate that there is some p P such that p(τ is δ premature with respect to q for p) > η, where we refer to the discussion around (5) to recall what it means for a stopping rule to be δ premature for the probability distribution p P, with respect to the probability measure q on N . In order to do this, for all n 1 let An := {xn Nn : τ(xn) = 1} denote the set of sequences of length n on which τ has entered. Note that p(An) is increasing with n and limn p(An) = 1. We can therefore pick n 4/(1 η) large enough such that p(An) (1 + η)/2. Data-Derived Consistency Let 5 ϵ := 1 2n4 . Applying Lemma 42 in Appendix B to i.i.d. probability measures over length-n strings, we see that for all p P such that ||p p||1 ϵ, we have p(An) > (1 + η)/2 2 and for all m n, since Am is an increasing sequence of events with m, p(Am) p(An). Since lim supm supp B(p,ϵ ;P) 1 m Dm(p ||q) is nondecreasing in ϵ as ϵ increases, we can choose p B(p, ϵ; P) such that for some m n we have p(Am) > η and 1 m Dm( p||q) > δ. This in turn means, for the choice of η and δ above, that p(τ is δ premature with respect to q for p) > η. This completes the proof of the necessity part of Theorem 20. As a caveat regarding the structure of this proof, we remark that the presence of a deceptive distribution p P does not automatically imply that any other probability distribution in any neighborhood of the deceptive distribution p is also deceptive. For example, the class B in Example 30 has only p0 deceptive, while no other distribution in its neighborhood is. 7. Sufficiency Part of Theorem 20 In this section we prove the sufficiency part of Theorem 20. Namely, we prove that if a collection P of probability distributions on N does not contain any deceptive distributions, then P is d.w.c.. We do this by explicitly constructing a probability measure q on N such that, given any desired confidence probability 0 < 1 η < 1 and accuracy δ > 0, there is a stopping rule τ such that, for every p P, under p, τ is δ premature with respect to q for p, as defined in (5), with probability at most η. Note that it suffices to prove this for all δ of the form 1 m for m 1. So will restrict attention to this case, set δ = 1 m for the rest of the proof, and denote the corresponding stopping rule we construct by τη,m. We proceed in three steps. Using the fact that P does not have deceptive distributions, in Section 7.1 we cover P by countably many ℓ1 neighborhoods, each of which has asymptotic per-symbol redundancy < 1 m. In the second step, in Section 7.2, we construct a universal measure q of the kind desired by taking advantage of the countable covering. In the third step, in Section 7.3, we use the type of the sequence generated to estimate which of the neighborhoods from the first step the underlying source may be in. If we get the neighborhood right, note that in that neighborhood the asymptotic per symbol redundancy is bounded by 1 m uniformly over all sources in the neighborhood. This allows us to get the 5. Please note that in the interest of simplicity, we have not attempted to provide the best scaling for ϵ or the tightest possible bounds. Santhanam, Anantharam and Szpankowski compression rate down to the desired accuracy by pretending that the marginal distribution in force is the one determining the neighborhood, i.e. its centroid. Ideally, we would like to be able identify one of the neighborhoods from the first step that cover the underlying source (a good neighborhood). This requires some care since different neighborhoods may have different sizes, and the rate of convergence of the empirical statistics to the source statistics is usually pointwise and not uniform. But when there are no deceptive distributions, given any confidence, a stopping rule can be constructed that can certify against prematurely deciding a bad neighborhood to the required confidence. 7.1 Covering P by Countably Many Neighborhoods Using the fact that P does not have deceptive distributions, we cover P by countably many neighborhoods, each of which has asymptotic per-symbol redundancy < 1 m. Suppose p P is not deceptive. From Lemma 21, there is a probability measure qp on N such that for all m 1 we can pick ϵp,m > 0 satisfying lim sup n sup p B(p,ϵp,m;P) 1 n Dn(p ||qp) < 1 We fix such an ϵp,m > 0 for each p P and m 1. Reach of p P For δ 1, let m = 1 and for 0 < δ < 1 let m = 1/δ . Therefore m is the natural number such that 1 m δ < 1 m 1. For any δ > 0, we call ϵp, 1/δ the δ reach of p. In particular, ϵp,m > 0 is the 1 m reach of p. We do not require any regularity of ϵp,m over p P, in particular infp P ϵp,m can be 0. Zone of p P Given m 1, the zone Qp,m of a probability distribution p P is defined to be the set of probability distributions u on N given by Qp,m def = n u : ||p u||1 < ϵp,m where, ϵp,m is the 1 m-reach of p. Note that the probability distributions in Qp,m are not necessarily in P. Countable cover of P The zone Qp,m satisfies Qp,m P B(p, ϵp,m; P). Trivially p Qp,m P. Therefore we have we have P = p P(Qp,m P). Further, since Qp,m is open in the ℓ1 topology, each of the intersections Qp,m P is relatively open in the ℓ1 topology on P. Since P under the ℓ1 topology is second countable, it is also Lindel of (see (Santhanam and Anantharam, 2015, Sec. 6.1) for a proof), i.e., there is a countable set Pm P, such that P is covered by the collection of relatively open sets (Q p,m P, p Pm), i.e. we have P = p Pm(Q p,m P). (21) For any fixed m 1, we will make a choice of such a Pm and refer to it as the quantization of P and to elements of Pm as the centroids of the quantization, borrowing from commonly used literature. We index the countable set of centroids, Pm by ιm : Pm N. Data-Derived Consistency 7.2 Construction of the Probability Measure q on N We now construct a probability measure q on N and, for each 0 < η < 1 and m 1, a stopping rule τη,m, such that the pair q and τη,m will together satisfy the required guarantee that for every p P, the probability that the stopping rule τη,m is 1 m premature with respect to q for p is at most η. This section details the construction of q , while Section 7.3 details the construction of τη,m. Note that while the stopping rule τη,m depends on the confidence η and accuracy threshold 1 m, the measure q is universal over all choices of the confidence and accuracy. First, fix η and m. We construct the universal q using the partition in (21), which holds regardless of what the confidence η is. Therefore, our construction of the measure is not dependent on the confidence η, and is universal over the choice of η automatically. For each p Pm there is a probability measure q p on N satisfying (19) for p, with ϵ p,m denoting the 1 m reach of p. Let Qm := {q p : p Pm} denote the collection of these probability measures as p ranges over Pm. Note that Qm is countable and is a collection of not necessarily i.i.d. probability measures on N . For q Qm, set the index ιm( q) to be equal the index assigned to the corresponding centroid p in the enumeration of Pm. Then define a probability measure qm on N by extending the following assignment for each n 1 and each x Nn, q(x) ιm( q)(ιm( q) + 1). Similarly, to remove dependence on m, let q be the probability measure on N extending the following assignment for each n 1 and each x Nn, qm(x) m(m + 1). Now, for all p Pm, we have lim sup n sup p B(p,ϵp,m;P) 1 n Dn(p ||q ) = lim sup n sup p B(p,ϵp,m;P) 1 n Dn(p ||qm) = lim sup n sup p B(p,ϵp,m;P) 1 n Dn(p ||q p) 7.3 Description of the Stopping Rule τη,m We turn next to construct a stopping rule τη,m having the property that, for all p P, we have p τη,m is 1 m premature with respect to q for p < η. The essential task in this section is to use the type of the empirically observed sequence, say (n, t), to identify one of the centroids in the covering (21) that contains the underlying source, say p, within its 1 m reach. Santhanam, Anantharam and Szpankowski 7.3.1 Summary This is a variant of a hypothesis testing problem over a countable number of hypotheses where we either choose one of the hypothesis (guaranteeing that at the point of choice our estimate will be accurate with the required confidence) or defer the decision to a point where we have more data. To summarize the theoretical approach, initialize τη,m to be 0 on the empty string. Given a string XT 1 , if for any i < T we have τη,m(Xi 1) = 1 then set τη,m(XT 1 ) = 1. (Here X0 1 denotes the empty string.) Else, we consider the following tests, one for each centroid p Pm: test if the empirical distribution t Q p,m, and, if so, additionally test for Equations (24) and (25) below. If any centroid passes all the tests, we choose the first centroid (according to the enumeration of Pm chosen in Section 7.1) among them, and we also determine τη,m(Xn 1 ) for all n 1, as explained in the detailed description of the scheme below. If none do, we defer the decision to a future point. Testing whether a centroid contains the observed type in its zone is clearly a natural thing to do. Since the empirical distribution converges to the underlying probability distribution at a rate that is only pointwise and cannot in general be uniformly bounded over P, it could happen on any finite sequence (say, with type (n, t)) that certain centroids close to the empirical distribution t may not contain the generating distribution p within their 1 m reach. Resolving which centroids are misleading and which are not cannot always be done to arbitrary confidence using finite sequences. However if P has no deceptive distributions, imposing the additional tests in (24) and (25) enables us to attest that the probability the type generated by a source p can be captured by any centroid in Pm which does not have p in its reach is < η. At this point, we prove that with the desired confidence, we have identified a centroid ˆp that contains the generating source p within its 1 m reach. Therefore we can now identify, based on the uniform convergence of per symbol redundancy within the 1 m reach of ˆp, when the per-symbol redundancy drops 1 m and stays below the threshold. 7.3.2 Detailed Construction Fix 0 < η < 1 and m 1. Let p P be the probability distribution in force, which is unknown. Consider a length-n sequence xn on which we have not yet decided that τη,m(xr) = 1 for any 1 r < n. Let xn have type (n, t) where t is the empirical distribution. The set of centroids in Pm that can potentially capture the type is defined to be Pm,t := { p Pm : t Q p,m}. Not every centroid in Pm,t is necessarily benign. Some of the centroids in Pm,t may not have the generating probability measure p within their 1 m-reach. Therefore, when Pm,t = , we refine Pm,t further to a set of safe centroids ˆPm,t Pm,t in a way that will allow us to use Lemma 43 to bound the probability of wrong capture. To counter the possibility that the convergence of empirical distribution is not necessarily uniform over P, we use a modified convergence result in Lemma 43. This is a distribution free bound that bounds the probability that the empirical distribution is far from the underlying p, but only for empirical distributions that are top heavy (namely, those with Data-Derived Consistency at least a certain probability mass within the first k symbols). To do so, for every p Pm, with 1 m reach ϵ p,m, let γ p,m := ϵ p,m The quantity above plays the role of γ when using Lemma 43. Note that ϵp,m 2 also played a role in defining the zone Qp,m (for given m 1) of an arbitrary probability distribution (not just a centroid) p P. (20). To understand the core of our sufficiency proof, consider what happens when the underlying p happens to be outside the 1 m reach of some p Pm,t. Since p is far from p (out of its 1 m reach), but p is close to the empirical distribution, t, of the observed sequence, the triangle inequality will lower bound the distance of t from the underlying p by γ p,m. The centroids in Pm,t that get placed into the safe set ˆPm,t are those that satisfy (24) and (25) in addition. In what follows, the quantity log C(p , m) of a centroid p Pm,t plays the role of the effective size of the support size of p , corresponding to the number k of Lemma 43. Given p Pm, we define C( p, m) via C( p, m) := 23 supr B( p,ϵ p,m;P) F 1 r (1 γ p,m/6) , (23) and we note that C( p, m) is finite from the tightness result in Lemma 16. This is because we have lim sup n sup r B( p,ϵ p,m;P) 1 n Dn(r||q ) < 1 from (22), which implies that for sufficiently large n the single letter redundancy of the family of n-fold product measures on Nn corresponding to the probability distributions in B( p, ϵ p,m; P) is finite, which, by Lemma 16, implies that this family of n-fold product measures on Nn is tight, which implies that the family of probability distributions B( p, ϵ p,m; P) is tight. With C(p , m) for p Pm,t defined as in (23), the conditions we require on p Pm,t in order to place it in ˆPm,t are p ,m/18 η 2C(p , m)ι(p )2n(n + 1), (24) and 2 F 1 t (1 γp ,m/6) log C(p , m). (25) These criteria will be then translated into a bound on the probability of wrong capture. It is also worth remarking that the proof of sufficiency of the necessary and sufficient condition for the insurability of a model class in (Santhanam and Anantharam, 2015, Thm. 1)) also uses a similar criterion to bound the probability of wrong capture. We are now in a position to specify the stopping rule τη,m. Consider a sequence of natural numbers, xn, having type (n, t). Assume that we have not yet specified τη,m for any prefix xl of the sequence xn for 1 l n. If ˆPm,t = , we move on to all the possible single letter extensions of the sequence xn. If ˆPm,t = , let ˆp denote the probability distribution in ˆPm,t with the smallest index. All suffixes of xn are then said to be trapped by ˆp, which means that they are assigned to Santhanam, Anantharam and Szpankowski ˆp ˆPm,t. From (22), we have lim sup n sup r B(ˆp,ϵˆp,m;P) 1 n Dn(r||q ) < 1 This means that the set Nˆp := {n : sup r B(ˆp,ϵˆp,m;P) 1 n Dn(r||q ) 1 is finite. For any suffix x N of xn, when N > max Nˆp, we set τη,m(x N) = 1, 0 else. Finally for each finite string xn for which the value of τη,m(xn) has not yet been decided, we set this value to be 0. It can be checked that τη,m so defined is a stopping rule. This is because if τη,m(xn) = 0 for any sequence xn Nn, then we also have τη,m(xm) = 0 for 1 m n, i.e. for all its prefixes. 7.3.3 τη,m Enters With Probability 1 This is proved in Appendix F, using an argument similar to that used in the sufficiency proof in Santhanam and Anantharam (2015). 7.3.4 τη,m is 1 m Premature is < η Consider any p P. Among sequences of natural numbers on which τη,m has entered, we will distinguish between those that are in good traps and those in bad traps. If a sequence xn is trapped by ˆp Pm such that p B(ˆp, ϵˆp,m; P), we call ˆp is a good trap for that sequence. Conversely, if p / B(ˆp, ϵˆp,m; P), ˆp is called a bad trap for that sequence. (Good traps) Suppose a length-n sequence xn is in a good trap. Namely, it is trapped by a probability distribution ˆp Pm such that p B(ˆp, ϵˆp,m; P). Then, if τη,m(xn) = 1 it must be the case that 1 n D(p||q ) < 1 m. Thus such sequences cannot contribute to the probability under p of τη,m being 1 m premature with respect to q for p. (Bad traps) We can show that the probability with which sequences generated by p fall into bad traps is strictly less than η using an argument, which is essentially identical to the one used in Santhanam and Anantharam (2015). This argument is reproduced in Appendix G for the sake of completeness. Pessimistically, we assume that τη,m is 1 m premature with respect to q for p on every sequence that falls into a bad trap. This completes the proof of the sufficiency part of Theorem 20. Acknowledgments We thank the anonymous reviewer for several suggestions that helped streamline the proofs and improve the readability of the document. We also thank the reviewer for suggesting one direction of the connection to learnability (as noted in two footnotes and in Appendix C). This work was in part supported by the NSF Science & Technology Center for Science of Information Grant number CCF-0939370. In addition, Santhanam was also supported by NSF Grants CCF-1065632 and CCF-1619452; Anantharam was also supported by the ARO Data-Derived Consistency MURI grant W911NF08-1-0233, Tools for the Analysis and Design of Complex Multi Scale Networks , Marvell Semiconductor Inc., the U.C. Discovery program, the William and Flora Hewlett Foundation supported Center for Long Term Cybersecurity at Berkeley, and the NSF grants CNS-0910702, ECCS-1343998, CNS 1527846, CCF 1618145 and CCF1901004; Szpankowski was also supported by NSF Grants CCF-2006440, and CCF-2007238. Appendix A. Alternate Definitions of Strong and Weak Compressibility We first establish the following elementary result. Lemma 32. For n 1, let ˆqn be a probability measure on Nn. Then there is a probability measure qn on N such that, for all x Nn, we have qn(x) = ˆqn(x). Proof We define qn by specifying qn(y) for all y Nm for all m 1. If 1 m n and y Nm, let qn(y) := X x Nn:y x ˆqn(x ). For m n and y Nm, if y is x followed by a string of 1s, for some x Nn, let qn(y) := ˆqn(x), else let qn(y) := 0. It can be checked that qn, defined in this way, satisfies the consistency conditions qn(z) = P y Nm : z y qn(y) for all 1 l m and z Nl. Hence qn defines a probability measure on N . It can also be checked that qn satisfies the requirement in the statement of the lemma. 2 Using Lemma 32, we now get the following result, which will help establish the equivalence of our definitions of strong and weak compressibility with those common in literature. Lemma 33. Let Λ be any collection of probability measures on N (not necessarily i.i.d.). Suppose there exists a sequence of probability measures ˆqn on Nn such that lim sup n sup r Λ 1 n Er log r(Xn) ˆqn(Xn) = 0. Then there is a probability measure q on N such that lim sup n sup r Λ 1 n Er log r(Xn) Proof For each n 1, let the probability measure qn on N be constructed to match the probability measure ˆqn on Nn, as in Lemma 32. Define the probability measure q on N that, for each n 1 and x Nn, assigns to x the probability qi(x) i(i + 1). For all n 1 we therefore have 1 n Er log r(Xn) q(Xn) sup r Λ 1 n Er log r(Xn) qn(Xn) + log(n(n + 1)) 1 n Er log r(Xn) ˆqn(Xn) + log(n(n + 1)) Santhanam, Anantharam and Szpankowski lim sup n sup r Λ 1 n Er log r(Xn) Let P be a collection of probability distributions on N and P the collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. In most prior work (Fittingoff(1972); Davisson (1973); Kieffer (1978)) the collection P is called strongly compressible if there is a sequence of probability measures ˆqn on Nn such that lim sup n sup p P 1 n Ep log p(Xn) ˆqn(Xn) = 0. Lemma 33 immediately establishes that this definition is equivalent to the definition of strong compressibility that we have made in Definition 2. The most commonly used definition of weak compressibility in prior work is due to Kieffer (Kieffer (1978)), and is framed in the language of length functions of compression schemes. Let Λ be any collection of stationary ergodic probability measures on N (not necessarily i.i.d.). A compression scheme is a sequence of mappings φn : Nn {0, 1} \ whose image satisfies the prefix condition, i.e. for any two distinct elements in the domain the image of the first is not a prefix of the image of the second. The collection Λ is called weakly compressible if there is a compression scheme (φn, n 1) such that, for all r Λ, we have lim n 1 n Erl(φn(Xn)) = H(r), where H(r) denotes the entropy rate of r. Let P be a collection of probability distributions on N and P the corresponding collection of i.i.d. probability measures on N . Note that P is a collection of stationary ergodic probability measures. We now show that the definition of weak compressibility of P in the sense of Kieffer (Kieffer (1978)) is identical to the definition of weak compressibility of P that we have made in Definition 4. Suppose first that P is weakly compressible in the sense of Definition 4. If every probability distribution in P has infinite entropy, consider an arbitrary compression scheme (φn, n 1), for instance by defining φn(xn) by concatenating symbol by symbol the representation of i N by a bit string of length log 1 (i+1)(i+2) coming from a prefix code for N corresponding to the probability distribution assigning probability 1 (i+1)(i+2) to i N. Then we have 1 n Epl(φn(Xn)) (a) 1 n Ep log 1 p(Xn) = , (27) and so lim n 1 n Epl(φn(Xn)) = H(p), for all p P. Here (a) in (27) can be seen by picking a probability measure qn on Nn that satisfies l(φn(Xn)) log 1 qn(xn)) and observing that Ep log p(Xn) qn(Xn) 0. If there are probability distributions in P with finite entropy, let q be a probability measure on N verifying the requirements in Definition 4. For n 1, let ˆqn denote the probability measure Data-Derived Consistency on Nn resulting from restricting q to Nn. We can then define a compression scheme (φn, n 1) such that l(φn(x)) = log 1 ˆqn(x) for all x Nn for all n 1. Hence, for every p P, we have 1 n Epl(φn(Xn)) = 1 n Ep log 1 ˆqn(Xn) = 1 n Ep log 1 q(Xn) . Suppose H(p) = . By the same argument as that used in (27) we conclude that 1 n Epl(φn(Xn)) = for all n 1 and so, for all such p, we have lim n 1 n Epl(φn(Xn)) = H(p). On the other hand, if H(p) < we have 1 n Epl(φn(Xn)) 1 n Ep log 1 q(Xn) + 1 = 1 n Ep log p(Xn) q(Xn) + H(p) + 1 and so, letting n , we see that lim n 1 n Epl(φn(Xn)) = H(p) also holds for such p. We have established that P is also weakly compressible in the sense of Kieffer (Kieffer (1978)), irrespective of whether P is comprised entirely of probability distributions with infinite entropy or also contains probability distributions with finite entropy. For the converse, suppose that P is weakly compressible in the sense of Kieffer (Kieffer (1978)). For each n 1 we can find a probability measure ˆqn on Nn such that ˆqn(x) 2 l(φn(x)) for all x Nn, where (φn, n 1) is a compression scheme verifying the weak compressibility of P in the sense of Kieffer (Kieffer (1978)). For each n 1 we define the probability measure qn on N in terms of ˆqn as in Lemma 32, and we define the probability measure q on N which, for each n 1 and x Nn, assigns to x the probability qi(x) i(i + 1). For each p P with finite entropy, we have 1 n Ep log p(Xn) q(Xn) 1 n Ep log p(Xn) qn(Xn) + log n(n + 1) = 1 n Ep log p(Xn) ˆqn(Xn) + log n(n + 1) n Epl(φn(Xn)) + log n(n + 1) and so, from limn 1 n Epl(φn(Xn)) = H(p), we conclude that lim supn 1 n Ep log p(Xn) q(Xn) = 0. This proves that P is weakly compressible in the sense of Definition 4. To close this section, we give proofs of two statements that allow us to think about strong compressibility and weak compressibility respectively in terms of vanishing asymptotic persymbol redundancy. Santhanam, Anantharam and Szpankowski Lemma 34. Let P be a collection of probability distribution on N and P the collection of probability measures on N induced by i.i.d. assignments from the individual probability distributions in P. Then P is strongly compressible iffit has zero asymptotic per-symbol redundancy. Proof If P is strongly compressible, then taking the probability measure q on N which verifies the strong compressibility condition in (1) from Definition 2 as the q in (2) from Definition 3 for each n 1 immediately implies that P has zero asymptotic per-symbol redundancy. Conversely, suppose P has zero asymptotic per-symbol redundancy. Given ϵ > 0, for each n 1 let qn be a probability measure on N for which supp P Ep log p(Xn) qn(Xn) Rn+ϵ, and define the probability measure q on N by qi(x) i(i + 1). Then we have 1 n sup p P Ep log p(Xn) n sup p P Ep log r(Xn) qn(Xn) + log(n(n + 1)) lim sup n 1 n sup p P Ep log p(Xn) Letting ϵ 0 shows that P is strongly compressible. 2 Lemma 35. Let P be a collection of probability distribution on N and P the collection of probability measures on N induced by i.i.d. sampling from the individual probability distributions in P. Then P is weakly compressible iffthere is a probability measure q on N such that for every p P with finite entropy the corresponding p P has zero asymptotic per-symbol redundancy with respect to q. Proof The claim is vacuously true if all the probability distributions in P have infinite entropy. If there are distributions in P with finite entropy and P is weakly compressible, then consider the probability measure q on N which verifies the weak compressibility condition in (3) from Definition 4. By definition, with respect to this q, every p P with finite entropy is such that the corresponding p P has zero asymptotic per-symbol redundancy with respect to q. Conversely, if there are distributions in P with finite entropy and there is a probability measure q on N such that for every p P the corresponding p P has zero asymptotic per-symbol redundancy with respect to q then, by definition, this q satisfies the condition in (3) from Definition 4 for all p P with finite entropy. This establishes that P is weakly compressible. 2 Appendix B. Basic Properties of Relative Entropy and Redundancy In this appendix we gather some basic results on the KL divergence and redundacy, which are used at various points in the document. Data-Derived Consistency Proposition 36. Let p and q be two probability distributions on a countable set X. Then X x X p(x) log p(x) D(p||q) + 2log e Proof Let S X be the set of all elements x X such that p(x) q(x). Note that q(S) > 0. We have x X p(x) log p(x) x S p(x) log p(x) (a) 2p(S) log p(S) q(S) 2p(S) log p(S) where step (a) is from the log sum inequality. The proposition follows. 2 Proposition 37. For all probability measures r and q on N and all 1 m n, we have Dm(r||q) Dn(r||q). In particular, for any collection of probability distributions P on N, if P denotes the associated collection of i.i.d. probability measures on N , we will have Rm(P) := inf q sup p P Ep log p(Xm) q(Xm) inf q sup p P Ep log p(Xn) q(Xn) = Rn(P), where the outer infimum on both sides is taken over all probability measures q on N and so Rm(P) and Rn(P) are the length-m redundancy and the length-n redundancy of P, respectively. Proof The first part of the claim follows from convexity, because, for all ym Nm, we have r(ym) = X xn : ym xn r(xn) and q(ym) = X xn : ym xn q(xn). For the second part of the claim, for any ϵ > 0 pick a probability measure q on N such that sup p P Ep log p(Xn) q (Xn) < Rn(P) + ϵ. It then follows from the first part of the claim that Rm(P) sup p P Ep log p(Xm) q (Xm) < Rn(P) + ϵ. We let ϵ 0 to complete the proof. 2 Santhanam, Anantharam and Szpankowski Proposition 38. Let P be a collection of probability distributions on N and P the corresponding collection of probability measures on N got by i.i.d. sampling from the individual probability distributions in P. For n 1, let Rn denote the length-n redundancy of P , as defined in (2). Then, for all n 1, the per-symbol length-n redundancy of P satisfies Rn/n R1. Proof Let ϵ > 0. Let p be a probability distribution on N such that the single letter redundancy of P with respect to p is strictly less than R1 + ϵ. With the usual abuse of notation, let p also denote the i.i.d. probability measure on N corresponding to p. Then, for all p P, we have 1 n Ep log p(Xn) p(Xn) = Ep log p(X) p(X) < (R1 + ϵ). By letting ϵ 0, the proposition follows. 2 Corollary 39. Let P be any collection of distributions over N and let P the set of probability measures obtained by i.i.d. sampling from distributions in P. Then lim supn 1 n Rn(P) < iffR1 < . Proof Immediate from the Propositions 37 and 38 since for all n, 1 n R1(P) 1 n Rn(P) R1(P). 2 Lemma 40. Let Λ be a collection of probability measures on N . Then we have lim sup n 1 n inf q sup r Λ Er log r(Xn) q(Xn) = inf q lim sup n 1 n sup r Λ Er log r(Xn) q(Xn), (28) where the infimum is taken over all probability measures q on N . Namely, the lim supn can be interchanged with the infq in the definition of the asymptotic per-symbol redundancy of Λ. Proof Fix ϵ > 0. For n 1, let qn be a probability measure on N such that 1 n sup r Λ Er log r(Xn) Define the probability measure q on N that, for each n 1 and x Nn, assigns to x the probability qi(x) i(i + 1), where, as usual, qi(x) is the probability under qi of the event in N comprised of the sequences having the prefix x. We then have 1 n sup r Λ Er log r(Xn) n sup r Λ Er log r(Xn) qn(Xn) + log(n(n + 1) n Rn + ϵ + log(n(n + 1)) inf q lim sup n 1 n sup r Λ Er log r(Xn) q(Xn) lim sup n 1 n sup r Λ Er log r(Xn) q(Xn) lim sup n 1 n Rn + ϵ. Data-Derived Consistency Letting ϵ 0, we see that the term on the right hand side of (28) is no bigger than the term on its left hand side. Showing the inequality in the other direction is straightforward, since 1 n inf q sup r Λ Er log r(Xn) n sup r Λ Er log r(Xn) for each probability measure q on N . This completes the proof. 2 The following lemma will be needed in Appendix C. Lemma 41. Let P1, . . . , PL be classes of probability distributions on N. Let P := L l=1Pl denote their union. Then, for each n 1 we have max l Rn(Pl) Rn(P) log L. Proof For any ϵ > 0, for each 1 l L, let ql be a probability measure on N such that sup p Pl Ep log p(Xn) ql(Xn) Rn(Pl) + ϵ. L PL l=1 ql. Then we have Rn(P) = inf q sup p P Ep log p(Xn) sup p P Ep log p(Xn) = max l sup p Pl Ep log p(Xn) max l sup p P Ep log p(Xn) q1(Xn) + . . . + q L(Xn) max l sup p P Ep log p(Xn) max l Rn(Pl) + ϵ + log L, where the infimum in the first line is over probability measures q on N . Letting ϵ 0 completes the proof. 2 The following variation of the result from Santhanam and Anantharam (2015) will be needed to prove the necessity part of Theorem 20. Lemma 42. Fix ϵ > 0. Let p and q be probability distributions on N with ||p q||1 ϵ. Fix n N with 2n2ϵ 1. Consider the probability measures on Nn obtained by i.i.d. sampling from p and q respectively, which we continue to denote by p and q respectively, following our convention. Suppose An Nn is subset for which p(An) 1 α, for some α > 0. Then we have q(An) > 1 α 2n3ϵ 1 Santhanam, Anantharam and Szpankowski B1 := i N : q(i) p(i) 1 1 , and B2 := i N : p(i) q(i) 1 1 We are given ||p q||1 ϵ, and in addition, we have x B1 (p(x) q(x)) p(B1) and similarly x B2 (q(x) p(x)) q(B2) From the preceding inequalities, it follows that p(B1 B2) 2n2ϵ and q(B1 B2) 2n2ϵ. (29) Let S := N (B1 B2). For all x S we have q(x) p(x) 1 1 In addition, from (29) we have p(S) 1 2n2ϵ. Let Sn Nn denote the set of all length-n strings of symbols from S. Clearly since 2n2ϵ < 1 p(Sn) (1 2n2ϵ)n > 1 2n3ϵ. Thus we have p(An Sn) > 1 2n3ϵ α. From (30), for all xn Sn, we have q(xn) p(xn) 1 1 n > p(xn) 1 1 q(An) q(An Sn) > (1 2n3ϵ α) 1 1 > 1 α 2n3ϵ 1 Appendix C. Operational Formulation of the Problem Recall that P(N) denotes the set of probability distributions on N and P P(N) a collection of probability distributions on N. We prove Theorem 9 in this section, i.e. that P is learnable (see Definition 8) iffit is d.w.c.. Data-Derived Consistency C.1 Learnable d.w.c. To prove that if P is learnable then it is d.w.c., we use the equivalence of d.w.c. and the existence of deceptive distributions which was proved in Theorem 20. Specifically, we show that if P is learnable, then there cannot be any deceptive distributions in P. Suppose, to the contrary, that P is learnable but that p P is deceptive. Then, by the definition of what it means to be deceptive, see Definition 19, we can find δ > 0 such that inf q lim sup n sup p B(p,ϵ ;P) 1 n Dn(p ||q) > δ, (31) for all ϵ > 0 and hence, by Lemma 40 in Appendix B, we have lim sup n inf q sup p B(p,ϵ ;P) 1 n Dn(p ||q) > δ, (32) for all ϵ > 0. In both (31) and (32) the infimum is over all probability measures q on N . Since P is assumed to be learnable, from Definition 8 there must certainly be some η > 0, a stopping rule τ, and ˆq : N P(N) such that for all p P we have p(τ = 1 and D1( p||ˆq) > δ) < η. For all n 1 let An := {xn Nn : τ(xn) = 1} denote the set of sequences of length n on which τ has entered. Note that p(An) is increasing with n and limn p(An) = 1. We can therefore pick n 4/(1 η) large enough such that and a finite set Sn An such that p(Sn) (1 + η)/2. Let ϵ := 1 2n4 . Applying Lemma 42 in Appendix B to i.i.d. probability distributions over length-n strings, we see that for all p P such that ||p p||1 ϵ, we have p(Sn) > (1 + η)/2 2 From (31) and (32) respectively it then follows that for all 0 < ϵ < ϵ we have inf q lim sup m sup p B(p,ϵ;P) 1 m Dm(p ||q) > δ, (33) and lim sup m inf q sup p B(p,ϵ;P) 1 m Dm(p ||q) > δ, (34) and of course, p (Sn) η for all p B(p, ϵ; P). Fix some 0 < ϵ < ϵ. Since Sn is finite by choice, for each p B(p, ϵ; P) we can choose y(p ) Sn such that y(p ) = arg min y Sn D(p ||qy), where qy = ˆq( |y). Let By = {p B(p, ϵ; P) : y(p ) = y}. Santhanam, Anantharam and Szpankowski Therefore, B(p, ϵ; P) = y Sn By (35) where the union above is finite. From (34) we have that the asymptotic per symbol redundancy of B(p, ϵ; P) is strictly bigger than δ. Since the union in (35) is finite, from Lemma 41 in Appendix B it follows that there is some y Sn such that the asymptotic per symbol redundancy of By is strictly bigger than δ. Hence, from Proposition 38, we have that the single letter redundancy of By is strictly bigger than δ, which in turn implies that supp By D(p ||qy ) > δ. Thus we conclude that there is some p By such that D(p ||qy ) > δ. Therefore, if p were in force then with probability η, we would have D1(p||ˆq) D1(p||qy ) > δ, which violates the assumption that P is learnable. This completes the proof of the necessity part of Theorem 9. C.2 d.w.c. Learnable We thank the anonymous reviewer for observing this direction of the connection. Suppose for all δ > 0, η > 0, we have a stopping rule τδ ,η and a universal measure q , such that τδ ,η certifies with confidence 1 η when the per-symbol redundancy of q falls (and remains) below δ . Then for any given δ > 0 and η > 0 we construct a new stopping rule σδ,η and an estimator ˆq : N P(N) that satisfies for all p P, p(σδ,η = 1 and D(p||ˆq) > δ) < η. (36) According to Definition 8, this will establish that P is learnable. To see this, let δ = δη/2 and η = η/2. Let T := min{t 1 : τδ ,η (Xt 1) = 1}, and note that XT+1, , X2T 1 are the T 1 subsequent samples. Set σδ,η(X2T 1) = 1, (regardless of what X2T 1 T+1 are) and output the estimate ˆq P(N), where for all x N i=0 q (x|XT+i T+1). In the above, XT T+1 is understood to be the empty string. Note that ˆq does not use the observations X1, . . . ,XT and, given T, ˆq is conditionally independent of XT . Rather, ˆq applies the marginal distributions of q over Ni, i T, to the observations XT+1, . . . ,X2T 1. To complete the definition of ˆq as a function from N to P(N), we define it arbitrarily for finite sequences of naturals on which σδ,η equals 0 and on those for which σδ,η equals 1 we Data-Derived Consistency define it to be ˆq . We claim that the stopping time σδ,η and estimator ˆq : N P(N) as defined above satisfy (36). To prove the claim, fix p P. Note that D1(p||ˆq ) = D1 i=0 q ( |XT+i T+1) i=0 D1 p||q ( |XT+i T+1) . Further, we have for any XT that i=0 D1 p||q ( |XT+i T+1) | XT # T D1 p||q ( |XT+i T+1) | T T DT (p||q ), (37) where the first equality holds because (i) p is i.i.d., and (ii) the single letter distributions within any of the KL divergences only depend on the length T, and not on the values of X1, . . . ,XT . In the last expression, 1 T DT (p||q ) denotes 1 m Dm(p||q) evaluated at T. Observe that since P is d.w.c., and q is a weak universal measure, there exists Np such that 1 m Dm(p||q ) δ for all m Np. The conditional expectation in (37) is a random variable that only depends on T, and whenever T Np, we have i=0 D1 p||q ( |XT+i T+1) | XT # T DT (p||q ) δ . When T Np therefore, Markov s inequality implies that with probability 1 δ /δ, conditioned on XT , we have D1(p||ˆq ) 1 i=0 D p||q ( |XT+i T+1) δ. Since P is d.w.c., if we take Np to be the smallest such integer we know p(T > Np) 1 η . Hence, with probability under p (1 η )(1 δ δ = 1 η, we have D1(p||ˆq ) δ. Appendix D. Length-n Per-Symbol Redundancy of Mh We construct a probability measure q on N such that for Mh we have 1 n Dn(p||q) 2h 1 4 ( 2 3n log e. This implies that the per-symbol length-n redundancy of Mh diminishes to 0 as n . Hence Mh is strongly compressible. Santhanam, Anantharam and Szpankowski Consider the probability distribution q on N defined by q(i) = 1/i(i + 1), i 1. As observed in Example 27, we have sup p Mh Ep log 1 q(X) 2 < 4( h + 1)2. (38) We consider a scheme that encodes patterns (Orlitsky et al. (2004b)) of symbols (i.e. natural numbers in our case) first, followed by an encoding using log 1 q(x) bits to describe every symbol x that appeared in the string, in the order in which they arrived. To clarify, recall that the pattern of a sequence of symbols from N replaces each symbol by k N if the symbol was the k-th new symbol to appear in the sequence. For example, the pattern of the sequence of natural numbers (2, 3, 17, 4, 3, 3, 1, 2, 4) is (1, 2, 3, 4, 2, 2, 5, 1, 4). If in addition to the pattern of a finite sequence of natural numbers, in which there are l distinct symbols, one knows which symbol was the k-th symbol to appear for each 1 k l, one learns the sequence of symbols. The expected (not normalized by n) additional number of bits to encode the pattern of a sequence of symbols of length n from any p Mh is at most π q 2 3n log e, using the results in Orlitsky et al. (2004b), while the expected number of bits to describe the symbols of length-n strings using a prefix code based on the probability distribution q on N is at most X i N (1 (1 p(i))n) log 1 q(i) . Note that the distinct symbols appearing the the string will need to be specified in the order in which they arrived. Let Mn denote the number of distinct symbols that appear in a sequence of length n. Then the expected number of extra bits the scheme uses for length-n strings is (without normalizing by n) at most π q 2 3n log e plus at most i N (1 (1 p(i))n) log 1 q(i) i N (1 (1 pi)n) X j N (1 (1 pj)n) log 1 q(j) 2 i N (1 (1 pi)n) X j N (npj) log 1 q(j) 2 (c) 2nh1/4( Here (a) follows from the Cauchy-Schwartz inequality, while (b) follows from (38) and the definition of Mn. As for (c), a result similar to (c) can be found in Orlitsky et al. (2004a), Data-Derived Consistency but we justify (c) below for completeness. We observe that for all i N we have 1 (1 pi)n = pi j=0 (1 pi)j j=0 (1 pi)j Pn k=1 1 k ln n Combining the above with the fact that the entropy of any p Mh is at most h, which was shown in Example 27, proves (c) in the previous set of equations. In the above set of equations, inequality (a) follows from Minkowski s inequality which says that if xi and yi (0 i n 1) are both decreasing positive sequences, then n P xiyi P xj P yk. Minkowski s inequality is easily proved by noting P xj P yk = P m P xiy(i+m) mod n and that P xiyi P xiy(i+m) mod n for all 0 m n 1. The claim about the per-symbol length-n redundancy of Mh follows after normalization by n. Appendix E. Typicality of Top Heavy Empirical Distributions In this section we prove a useful result quantifying how close the empirical distribution of a sample drawn i.i.d. from a probability distribution p on N is to p, when the alphabet of symbols showing up in the sample is not too spread out. There is a lemma that looks somewhat similar in Ho and Yeung (2010). The difference of the result in Lemma 43 from that in Ho and Yeung (2010) is that the right side of the inequality in (39) does not depend on p. The result of Lemma 43 will be used in the sufficiency proof in Appendix G and this property is crucial for its use. Lemma 43. Let p be any probability distribution on N. Let γ > 0 and let k 2 be an integer. Let Xn 1 be a sequence generated i.i.d. with marginals p and let t(Xn) be the empirical distribution of Xn 1 . Then p |t(Xn) p|1 > γ and 2 F 1 t (1 γ/6) k (2k 2) exp nγ2 Proof For any probability distribution p on N with finite support of size L we have the following well-known result (e.g. , (Weissman et al., 2005, Proposition 1)) p (|t Xn p |1 α) (2L 2) exp nα2 Santhanam, Anantharam and Szpankowski where t Xn is the empirical distribution of Xn generated i.i.d. with marginal distribution p . The above is easily seen by recalling |t Xn p |1 = 2 sup E [L] |t(E) p (E)| = 2 sup E [L] |E| L/2 |t(E) p (E)|, and that for any E [L], from Hoeffding s inequality, p |t Xn(E) p (E)| α A union bound over all non-empty subsets of size L/2 yields (40). Consider the probability distributions p and t on A obtained from p and t respectively via the mapping from N to A := {1, . . . ,k 1} { 1} which maps i to i for 0 i k 1 and maps all the other natural numbers to 1. Thus, we have ( p(i), if 1 i k 1, P j=k p(j), if i = 1. Further, sequences of natural numbers generated i.i.d. with marginal distribution p and with empirical distribution t are mapped to sequences from A that are i.i.d. with probability distribution p and have empirical distribution t . Applying (40) to p , we have p (|p t |1 > γ/3) (2k 2) exp nγ2 We first argue that all sequences generated by p with empirical distributions t satisfying |p t|1 > γ and 2 F 1 t (1 γ/6) k are mapped into sequences generated by p with empirical t satisfying |p t |1 > γ/3 and t ( 1) γ/3. This follows from writing i=1 |p(i) t(i)| j=k (p(j) t(j)) + 2 |p ( 1) t ( 1)| + γ/3, where the last inequality above follows from the fact that 2 F 1 t (1 γ/6) k implies Ft(k 1) 1 γ/6, i.e. P j=k t(j) γ/6. Hence we have i=1 |p(i) t(i)| + |p ( 1) t ( 1)| |p t|1 γ/3 > γ/3, Data-Derived Consistency because |p t|1 > γ. Thus, from (41), we will have p(|t(Xn) p|1 > γ and 2 F 1 t (1 γ/6) k) p (|t p |1 > γ/3 and t ( 1) γ/3) (2k 2) exp nγ2 This completes the proof of the lemma. 2 Appendix F. τ Enters With Probability 1 We reproduce the argument from Santhanam and Anantharam (2015) here for completeness. Every probability distribution p P is contained in at least one of the elements of the cover (Q p,m P, p Pm), where Q p,m denotes the zone of p Pm. Recall the enumeration of Pm. Let p be be centroid with the smallest index among all centroids in Pm whose zones contain p. With probability 1, sequences generated by p will eventually have their empirical distribution within Qp ,m. (see Chung (1961) for a proof). Next note that for all n sufficiently large the analog of (24), (which makes sense for all p Pm) will hold. This follows since the right hand side of (24) diminishes to zero polynomially with n while the left hand side diminishes to zero exponentially fast in n. Next, the analog of (25) will also hold eventually with probability 1, since, if t denotes the empirical distribution of a sequence of length n generated by p, then from Proposition 15 F 1 t (1 γp ,m/6) F 1 p (1 γp ,m/6) (42) with probability 1 as n , where we note that the quantity on the left hand side of (42) is actually a random variable and t determines n. Furthermore, we will also have after finitely many samples that 2 F 1 t (1 γp ,m/6) < 3 F 1 p (1 γp ,m/6) sup r B(p ,ϵp ,m;P) F 1 r (1 γp ,m/6) = log C(p , m), where the second inequality follows since p is in the 1 m reach of p . Note that the 3 in the inequalities above can be replaced by any number strictly > 2 or by an additive constant. Therefore, both (24) and (25) will eventually hold with probability 1. Furthermore, long enough sequences generated by p fall into the zone of p with probability 1. This implies in turn that τη,m enters with probability 1. Note that it is entirely possible that some other centroid traps strings before they can be trapped by p , but that does not take away from the fact that τη,m will enter with probability 1. Santhanam, Anantharam and Szpankowski Appendix G. Probability of Falling Into Bad Traps Let t be any length-n empirical distribution trapped by ˆp, which we recall has 1 m-reach ϵˆp,m, such that p / B(ˆp, ϵˆp,m; P). Then we have ||ˆp p||1 ϵˆp,m, because p / B(ˆp, ϵˆp,m; P), and we have ||ˆp t||1 < ϵˆp,m because t has to be in the zone Qˆp,m in order to be captured by ˆp. Using the triangle inequality for ℓ1 norms, we get ||p t|| ϵˆp,m This means that for every p P, the probability that length-n sequences with empirical distribution t are trapped by a bad ˆp can be bounded from above as p |t p|1 γˆp,m and 2 F 1 t (1 γˆp,m 6 ) log C(ˆp, m) (a) (C(ˆp, m) 2) exp nγ2 ˆp,m 18 (b) η(C(ˆp, m) 2) 2C(ˆp, m)ι(ˆp)2n(n + 1) η 2ι(ˆp)2n(n + 1), where the inequality (a) follows from Lemma 43 and (b) from (24). Therefore, the probability of sequences falling into bad traps is bounded above by η 2ι( p)2n(n + 1) π2 since P p P 1 ι(ˆp)2 = π2 6 and P n 1 1 n(n+1) = 1. Appendix H. A Fake Proof In this section we give a fake proof of the following mistaken claim: if P1 and P2 are d.w.c., then P1 P2 is also d.w.c.. We then explain why it is wrong. In the concluding remarks in Santhanam and Anantharam (2015) it was stated, in passing, that if P1 and P2 are insurable then P1 P2 is also insurable. This statement if false, for the reasons explained in this section. This does not affect any of the results in Santhanam and Anantharam (2015). The argument proceeds as follows. Since Pi is d.w.c. for each i = 1, 2, there is a probability measure qi on N for each i = 1, 2 such that for every m 1, 0 < 1 η < 1 and i = 1, 2 there is a universal stopping rule τ (i) η,m such that, for all p Pi, we have p n such that 1 n Dn(p||qi) > 1 m and τ (i) η,m(Xn) = 1 < η. Data-Derived Consistency Let q := (q1 + q2)/2 and, for accuracy 1 m > 0 and confidence 0 < 1 η < 1, define τη,m(x) := 1(τ (1) η,2m(x) = 1)1(τ (2) η,2m(x) = 1)1(|x| > 2m). (43) Now, suppose p P1 P2. Without loss of generality, assume that p P1. Now, if n > 2m and we have 1 n Dn then we have 1 n Dn(p||q1) > 1 Further, from (43), if τη,m(x) = 1, then we have τ (1) η,2m(x) = 1 as well. Therefore p n such that 1 m and τη,m(Xn) = 1 p n such that n > 2m, 1 n Dn(p||q1) > 1 2m and τ (1) η,2m(Xn 1 ) = 1 < η, where we have used (43) to see that the event whose probability is being evaluated on the left hand side of the preceding equation cannot occur unless n > 2m. Since the above holds for all p P1 and we can use a similar argument for all p P2, we are done . The flaw in the above proof is that τη,m, as defined in (43), does not necessarily eventually equal 1 almost surely for all sources in P1 P2, which would mean that it is not a universal stopping rule for the model class P1 P2. To see why this issue might arise, note that τ (i) η,2m is known to eventually equal 1 almost surely only for sources in Pi. Thus, if it happens to be the case that there is some event A N and p1 P1 with p1(A) > 0 for which we have p2(A) = 0 for every source p2 P2, then τ (2) η,2m might never stop waiting on the sequences in A. This doesn t stop P2 from being d.w.c.. But when we introduce sources from P1, in particular p1, we find that τη,m, as defined in (43), will never stop waiting under p1. The stopping rule τη,m would then not be a universal stopping rule for the model class P1 P2. M. Asadi, R. Paravi, and N. Santhanam. Stationary and transition probabilities in slow mixing, long memory Markov processes. IEEE Transactions on Information Theory, 60 (9), September 2014. A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743 2760, October 1998. P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85 113, 2002. Santhanam, Anantharam and Szpankowski S. Ben-David and S. Shalev-Schwartz. Understanding Machine Learning. Cambridge University Press, 2012. C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Olivier Catoni. Pac-bayesian supervised classification: The thermodynamics of statistical learning. Full version from https://arxiv.org/abs/0712.0248, 2007. K.L. Chung. A note on the ergodic theorem of information theory. Annals of Mathematical Statistics, 32:612 614, 1961. Thomas Cover. On determining the irrationality of the mean of a random variable. The Annals of Statistics, 1(5):862 871, 1973. L.D. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19 (6):783 795, November 1973. Amir Dembo and Yuval Peres. A topological criterion for hypothesis testing. The Annals of Statistics, pages 106 117, 1994. M. Drmota and W. Szpankowski. Precise minimax redundancy and regrets. IEEE Trans. Information Theory, 50:2686 2707, 2004. P. Elias. Universal codeword sets and representations of integers. IEEE Transactions on Information Theory, 21(2):194 203, March 1975. B. Fittingoff. Universal methods of coding for the case of unknown statistics. In Proceedings of the 5th Symposium on Information Theory, pages 129 135. Moscow-Gorky, 1972. P. Grunwald. The Minimum Description Length Principle. MIT Press, 2007. David Haussler. A general minimax result for relative entropy. IEEE Transactions on Information Theory, 43(4):1276 1280, 1997. S. Ho and R. Yeung. On information divergence measures and joint typicality. IEEE Transactions on Information Theory, 56(12):5893 5905, 2010. J.C. Kieffer. A unified approach to weak universal source coding. IEEE Transactions on Information Theory, 24(6):674 682, November 1978. V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47:1902 1914, 2001. Jack Koplowitz, Jeffrey E Steif, and Olle Nerman. On cover s consistent estimator. Scandinavian Journal of Statistics, pages 395 397, 1995. R.E. Krichevsky and V.K. Trofimov. The performance of universal coding. IEEE Transactions on Information Theory, 27(2):199 207, March 1981. Sanjeev R Kulkarni and David N. C. Tse. A paradigm for class identification problems. IEEE Transactions on Information Theory, 40(3):696 705, 1994. Data-Derived Consistency D. Mc Allester. Pac bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, page 164 170, July 1999. N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124 2147, October 1998. A. Orlitsky, N.P. Santhanam, K. Viswanathan, and J.Zhang. On modeling profiles instead of values. In Uncertainty in Artificial Intelligence, 2004a. A. Orlitsky, N.P. Santhanam, and J. Zhang. Universal compression of memoryless sources over unknown alphabets. IEEE Transactions on Information Theory, 50(7):1469 1481, July 2004b. J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629 636, July 1984. N. Santhanam and V. Anantharam. Agnostic insurance of model classes. Journal of Machine Learning Research, pages 2329 2355, 2015. Full version available from ar Xiv doc id: 1212:3866. Y.M. Shtarkov. Universal sequential coding of single messages. Problems of Information Transmission, 23(3):3 17, 1987. W. Szpankowski and M. Weinberger. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE Transactions on Information Theory, 58:4094 4104, 2012. T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. Weinberger. Universal discrete denoising: known channel. IEEE Transactions on Information Theory, 51(1):5 28, 2005. See also HP Labs Tech Report HPL-2003-29, Feb 2003. C. Wu and N. Santhanam. Entropy property testing with finitely many errors . In Proceedings of IEEE Symposium on Information Theory, Virtual conference due to covid-19, 2020. C. Wu and N. Santhanam. Prediction with finitely many errors almost surely. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021a. C. Wu and N. Santhanam. Non-uniform consistency of online learning with random sampling. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021b.