# agnostic_insurability_of_model_classes__0f25bbd3.pdf Journal of Machine Learning Research 16 (2015) 2329-2355 Submitted 4/13, Revised 2/14; Published 8/15 Agnostic Insurability of Model Classes Narayana Santhanam nsanthan@hawaii.edu Dept of Electrical Engineering University of Hawaii at Manoa Honolulu, HI 96822 Venkat Anantharam ananth@eecs.berkeley.edu EECS Dept University of California, Berkeley Berkeley, CA 94720 Editor: Gabor Lugosi Motivated by problems in insurance, our task is to predict finite upper bounds on a future draw from an unknown distribution p over natural numbers. We can only use past observations generated independently and identically distributed according to p. While p is unknown, it is known to belong to a given collection P of probability distributions on the natural numbers. The support of the distributions p P may be unbounded, and the prediction game goes on for infinitely many draws. We are allowed to make observations without predicting upper bounds for some time. But we must, with probability 1, start and then continue to predict upper bounds after a finite time irrespective of which p P governs the data. If it is possible, without knowledge of p and for any prescribed confidence however close to 1, to come up with a sequence of upper bounds that is never violated over an infinite time window with confidence at least as big as prescribed, we say the model class P is insurable. We completely characterize the insurability of any class P of distributions over natural numbers by means of a condition on how the neighborhoods of distributions in P should be, one that is both necessary and sufficient. Keywords: insurance, ℓ1 topology of probability distributions over countable sets, non-parametric approaches, prediction of quantiles of distributions, universal compression 1. Introduction Insurance is a means of managing risk by transferring a potential sequence of losses to an insurer for a price paid on a regular basis, the premium. The insurer attempts to break even by balancing the possible loss that may be suffered by a few with the guaranteed premiums of many. We aim to study the fundamentals of this problem when the losses can be unbounded and a precise model for the probability distribution of the aggregate loss in each period either does not exist or is infeasible to get. A systematic, theoretical, as opposed to empirical, study of insurance goes back to 1903 when Filip Lundberg defined a natural probabilistic setting as part of his thesis see, e.g., the chapter on Lundberg in Englund and Martin-L of (2001). In particular, Lundberg formulated a collective risk problem pooling together the risk of all the insured parties into a single entity, which we call the insured. A substantial part of statistical work on insurance depend on working with specific models for the loss distribution, e.g. compound Poisson models, after which questions of interest in practice, such as the relation between the size of the premiums charged and the probability of the insurer going bankrupt, can be analyzed. c 2015 Narayana Santhanam and Venkat Anantharam. Santhanam and Anantharam A rather comprehensive theory of insurance along these lines has evolved, see Cramer (1969) and more recently in Asmussen and Albrecher (2010). This theory is able to incorporate several model classes for the distribution of the losses over time other than compound Poisson processes, including some heavy tailed distribution classes. We will outline our approach in the next subsection. In addition, from a learning theory perspective, our formulation is positioned in between the notions of uniform and pointwise consistency as we outline in subsection 1.2. Finally, in subsection 1.3 we compare our approach to the closely connected notions of universal compression and regret. 1.1 Approach Our approach departs from the existing literature on insurance in two important respects. No upper bound on loss. The first departure relates to the practice among insurers to limit payments to a predetermined ceiling, even if the loss suffered by the insured exceeds this ceiling. In both the insurance industry and the legal regulatory framework surrounding it, this is assumed to be common sense. But is it always necessary to impose such ceilings? Moreover, in scenarios such as reinsurance, a ceiling on compensation is not only undesirable, but may also limit the very utility of the business. As we will see, we may be able to handle scenarios where the loss can be unbounded. Universal approach. The second aspect of our approach arises from our motivation to deal with several new settings for which some sort of insurance is desirable, but where insurers are hesitant to enter the market due to lack of sufficient data. Examples of such settings include insuring against network outages or attacks against future smart grids, where the cascade effect of outages or attacks could be catastrophic. In these settings, it is not clear today what should constitute a reasonable risk model because of the absence of usable information about what might cause the outages or motivate the attacks. We address the second issue by working with a class of models, i.e., a set of probability laws over loss sequences that adheres to any assumptions the insurer may want to make or any information it may already have. In this paper we will only consider loss models that are independent and identically distributed (i.i.d.) from period to period, so we can equivalently think of a model class as defined in terms of its one dimensional marginals. As an example, we may want to consider the set of all finite moment probability distributions over the nonnegative integers as our class of possible models for the loss distribution in each period. Now, we ask the question: what classes of models are the ones on which the insurer can learn from observations and set premiums so as to remain solvent? In this paper, we completely answer this question by giving a necessary and sufficient condition that characterizes what classes of models lend themselves to this insurance task. This setup for insurability is very reminiscent of the universal compression/estimation/prediction approaches as seen in Shtarkov (1987); Fittingoff(1972); Rissanen (1984) and Ryabko (2008). There is also extensive work regarding learning from experts that has a related flavor, see Cesa-Bianchi and Lugosi (2006) for a survey. We will discuss the compression angle in more depth shortly as well. Formulation. Formally, we adopt the collective risk approach, namely, we abstract the problem to include just two agents, the insurer and the insured. Losses incurred by the insured are considered to Insurability form a discrete time sequence of random variables, with the sequence of losses denoted by {Xi, i 1}, and we assume that Xi N for all i 1, where N denotes the set of natural numbers, {0, 1, 2, . . .}. A model class P is a collection of measures on infinite length loss sequences, and is to be thought of as the set of all potential probability laws governing the loss sequence. Each element of P is a model for the sequence of losses. Any prior knowledge on the structure of the problem is accounted for in the definition of P . We focus on measures corresponding to i.i.d. samples, i.e. each member of P induces marginals that are product distributions. We denote by P the set of distributions on N obtained as one dimensional marginals of P . Since there is no risk of confusion, we will also refer to the distributions in P as models and to P as the model class. The actual model in P governing the law of the loss in each period remains unknown to the insurer. We assume no ceiling on the loss, and require the insurer to compensate the insured in full for the loss in each period at the end of that period. The insurer is assumed to start with some initial capital Π0 R+, a nonnegative real number. The insurer then sets a sequence of premiums based on the past losses at time i, the insurer collects a premium Π(Xi 1 1 ) at the beginning of the period, and pays out full compensation for loss Xi at the end of the period. If the built up capital till step i (including Π(Xi 1 1 ), and after having paid out all past losses) is less than Xi, the insurer is said to be bankrupted. Given a class P of loss models, we ask if, for every prescribed upper bound η > 0 on the probability of bankruptcy, the insurer can set (finite) premiums at every time step based only on the loss sequence observed thus far and with no further knowledge of which law p P governs the loss sequence, while simultaneously ensuring that the insurer remains solvent with probability bigger than 1 η under p irrespective of which p P is in effect. If the probability of the insurer ever going bankrupt over an infinite time window can be made arbitrarily small in this sense, the class of i.i.d. loss measures P is said to be insurable. A couple of clarifications are in order here. First, to make the problem non-trivial, we allow the insurer to observe the loss sequence for some arbitrary finite length of time without having to provide compensations. We require that the insurer has to eventually provide insurance with probability 1 no matter which p P is in effect. The insurer cannot quit providing insurance once it has entered into the insurance contract with the insured. Premiums set before the entry time can be thought of as being 0 and the question of bankruptcy only arises after the insurer has entered into the contract. Secondly, at this point of research, we do not concern ourselves with incentive compatibility issues on the part of the insured and assume that the insured will accept the contract once the insurer has entered, agreeing to pay the premiums as set by the insurer. It turns out that the fact that the capital available to the insurer at any time is built up from past premiums does not play any role in whether a model class is insurable or not. In fact, the problem is basically one of finding a sequence of finite upper bounds Φ(Xi 1 1 ) on the loss Xi for all i 1. We refer to the sequence {Φ(Xi 1 1 ), i 1} as the loss dominating sequence and call Φ(Xi 1 1 ) the loss dominant at step i. The notion of insurability of a model class P comes down to whether for each η > 0 there is a way of choosing the loss dominants such that the probability of the loss Xi ever exceeding the loss dominant Φ(Xi 1 1 ) is smaller than η irrespective of which model p in the model class P is in effect. Here again we allow some initial finite number of periods for which the loss dominant can be set to , but it must become finite with probability 1 under each p P and stay finite from that point onwards. Results. For a model class to be insurable, roughly speaking, close distributions must have comparable percentiles. Distributions in the model class that, in every neighborhood, have some other distribution with arbitrarily different percentiles are said to be deceptive. In Section 3, we define what Santhanam and Anantharam it means for distributions to be close, and what it means for distributions to have comparable percentiles. In Section 4, we provide several examples of insurable and non-insurable model classes. Our main result is Theorem 1 of Section 3, which states that P is insurable iffit has no deceptive distributions. We prove this theorem in Sections 5 and 6. In the rest of this section, we discuss the problem of insurability in the broader contexts of uniform and pointwise convergence of estimators and universal compression. 1.2 Pointwise vs. Uniform Convergence Theoretically, the flexibility we have permitted regarding when to start proposing finite loss dominants allows us to categorize the insurance problem formulated above as one that admits what we call dataderived pointwise convergent estimators. When dealing with large alphabets or high dimensions, it may be too restrictive to only deal with model classes or problem formulations that admit uniformly convergent estimators. We are particularly interested in richer cases where uniform learnability may not be possible. In such cases, often guarantees on estimators are provided that hold pointwise for all models. Typically, the results proven in such cases are of consistency, or bounds on rate of convergence of estimators that depend on the parameters of the model in force of course, the model by itself is usually not known a-priori. Therefore the practical issue with such pointwise guarantees is that they may not say much about what is happening with the specific sample at hand. Namely, if we know that an estimator for a problem is consistent and we have model dependent convergence bounds, for a given sample (from an unknown model) there may be no way of telling how good or bad the estimate currently is. It can be shown that the insurance problem outlined here is equivalent to learning an upper bound on every percentile of the unknown distribution from P, using only samples of i.i.d. draws from the distribution. However, we allow for model classes that are rich enough that there may be no bound on a given percentile that holds uniformly over the entire class P. Yet, we can still salvage the situation if for any given finite sample, we had some way to tell from the sample if the estimate was doing well or not relative to the true unknown model. In other words, we ask for model dependent convergence bound that depends on the model only through the sample that we observe. Data-driven pointwise convergence is at the heart of insurability as well, and it shows up specifically because we provide a finite (but not bounded uniformly over all models and observations) observation window before the insurer enters the game. While we consider complex P where there may be no possible way to bound any given percentile over the entire class, we can tell when the bounds obtained from a sample is good against the model that generated the sample. The point at which we decide our bounds are good against the model underlying the sample observed is related to the entry point defined in the formulation above. In section 4 we provide several examples of classes that are insurable and those that are not. This principle of using the sample to gauge the performance of a pointwise consistent estimate has been incorporated into several other setups as well. A few examples that stand out are the notions of luckiness NML in compression proposed in Grunwald (2007), in the PAC-Bayesian bounds for classification in Mc Allester (2013) and in the estimation of slow mixing Markov processes in Asadi et al. (2014). For the luckiness NML formulation, an appropriate slack function can be interpreted as a bound on how far we are from the code length of the underlying source. The PAC-Bayesian bounds of Mc Allester (2013) can also be used to provide data-dependent confidence bounds for classifiers in scenarios where classes may be very rich. In Asadi et al. (2014) again, data-dependent confidence bounds are provided Insurability for the estimates of transition and stationary probabilities of an underlying slow mixing, long memory Markov sources. To illustrate this concept, we provide a simple running example below that we will also use to demonstrate connections with compression in subsection 1.3. Example 1 [Birthday Problem] Consider the problem of estimating the size of a discrete, finite, set S N (specifically, there is no upper bound on the size of S) if we can draw as many random samples from S as needed. If we have independent, uniform draws from S, one simple way to estimate the size of S is keep sampling till some element from S is drawn twice. This is the first repeat. A simple back of the envelope calculation analogous to the Birthday Problem shows that if N1 is the sample size when the first repeat occurred, then a good estimate of the set size is N2 1 /2. One can then provide PAC-learning kind of bounds that, with some confidence, the above estimate based on the first repeat is accurate to a certain level. This is an example where there can be no uniformly convergent estimator of the set size. Given a fixed sample of size n, if the size of S is Ω(n2) the sample consists of n distinct symbols with probability close to 1. If we can assume no further structure on S, there is no way to distinguish between samples obtained from any two sets with size Ω(n2) thus no estimator can distinguish between these large sets. It is therefore futile, with a finite sample size, to expect an estimator that can estimate the set size to any non-trivial accuracy no matter what the set is. Equivalently, there can be no uniformly convergent estimator of the set size. The simple Birthday Problem estimator above only converges pointwise. It may take an arbitrarily larger sample for some models to give an answer. That is the nature of the problem. But the estimator is imbued with a very useful property with a guaranteed confidence, the estimator does not make a mistake even though it may not always have an answer. If the sample has no repeats yet, the estimator does not overreach and volunteer a wrong answer. Hence, we can tell from the sample if we can do well or not. 2 1.3 Universal Compression The approach we take is not unconnected with the universal compression literature, as well as certain learning formulations involving regret with log-loss. There are many variations in how compression problems are formulated, and the following example illustrates the main variants studied. It also serves to provide an example of the data-derived pointwise estimation introduced in the last section, and hence places insurability in context of the universal compression literature. Example 2 We will study the so-called Birthday problem from the previous example in a little more depth. Let B denote the collection of all distributions p M (M 1), where p M is a uniform distribution over {0, . . . ,M}. We use this example to distinguish between uniformly good compressors (strongly universal) and compressors that are only good pointwise (weakly universal). Suppose we consider one draw from an unknown distribution p B. The worst case regret or worstcase redundancy quantifies the minimum possible excess code length of a universal distribution q over N over the (unknown) distribution p inf q sup p B sup x N log p(x) Note that the regret of any class of distributions is always 0. We could, of course, consider a sequence of n independent draws from an unknown p B, and ask now for a measure q over infinite sequences Santhanam and Anantharam of numbers that is universal for all the i.i.d. measures corresponding to distributions in B. We then concentrate on the redundancy Rn(B) def = inf q sup p B sup xn Nn 1 n log p(xn) where we abuse notation and write for all p B to be the probability assigned to xn by independent draws from p. Of course, we could similarly define redundancy for length n sequences for any collection of measures over infinite sequences from a countable alphabet, not necessarily i.i.d.. Strongly compressible classes are those sets P of measures over infinite sequences satisfying lim n Rn(P ) 0. For the single-letter formulation in (1), clearly the optimal universal distribution gives any number x a probability proportional to the highest probability that number gets from any model in B, followed by a normalization. But the highest probability a model in B gives any x N is 1/(x + 1), which is not summable over x. Thus the redundancy is infinite here or equivalently, no matter what universal distribution q we choose and no matter how large a number M we pick, there is a p B and a number x N such that In this case, we will therefore not have redundancy bounds holding uniformly for the model class. We say B is not strongly compressible. With a very similar argument, it is easy to see that Rn(B ) = for all n, where we use B to denote the set of i.i.d. measures over infinite sequences constructed as above from marginals in B. But we can say something more. Consider again compressing sequences of numbers drawn i.i.d. from an unknown distribution in B. Noting that B is countable, we focus on a measure q over infinite sequences that gives a sequence xn the probability 1 i(i + 1)pi(xn). It is easy to verify that qw above satisfies sup p B lim n sup xn Nn 1 n log p(xn) qw(xn) = 0, (2) or that qw matches every p pointwise over the model class B. Such classes of sources are weakly universal. The code length of the universal measure qw matches that of p for every p B, but at arbitrarily slower rates for some sources (since the class cannot be strongly compressed). A couple of points. Note that admittedly it has been easy to define qw here since B was countable to begin with. If not, a condition reminiscent of countability above is necessary and sufficient for a class to be weakly universal as shown in (Kieffer, 1978). To emphasize, qw is guaranteed to satisfy (2), Insurability implying that it does not underestimate relative to any p for sufficiently long sequences. But how long is sufficiently long depends on p, and for a given length-n sequence without knowing p, it may not be possible to say if qw is doing well or not. This second aspect of knowing when an estimator is good is crucial to insurance formulations. One could also replace the sup over xn with expectation, and get average case versions for both strong and weak universality. 2 Strong compression is well known and is the more studied version of universal compression and regret formulations involving log loss. As one might expect, we show in (Santhanam and Anantharam, 2012) that strong compression implies insurability but not vice-versa. However, the insurance formulation has more in common with weak compression and pointwise convergence, rather than strong compression and uniform convergence. The connection between insurability and weak compression turns out to be rather interesting. In (Santhanam and Anantharam, 2012), we show classes of models that can be weakly compressed but are not insurable. At the same time, we also construct classes of models in (Santhanam and Anantharam, 2012) that are insurable, but cannot be weakly compressed. To summarize, our formulation is interesting precisely in cases where the strong notions of (worstcase or average-case) redundancy fail. Namely, classes of distributions whose redundancy is not finite. The universal compression formulation closer to our notion of insurability here is in the idea of weak universal compression in Kieffer (1978). However, weak compression formulations thus far have never included the aspect of determining from the data at hand when a compressor is doing well a crucial part of our problem. There may be one insight that we conjecture can be generalized beyond insurability to all problems with the flavor of data-derived pointwise convergence of estimates. What matters seems to be the local complexity as opposed to global complexity of model classes. Insurability of model classes does not depend on global complexity measures of model classes as with the (strong) redundancy of model classes (which is determined by the integral of the square root of absolute Fisher information over the entire model class) or the Rademacher complexity. Instead, insurability is related to how local neighborhoods look; in particular it depends on local tightness as we will see in Section 3. 2. Precise Formulation of Insurability We model the loss at each time by a random variable taking values in N = {0, 1, . . .}. Denote the sequence of losses by X1, X2 . . . where Xi N. Let N be the set of all finite length sequences from N, including the empty sequence. We will write xn for the sequence x1, . . . ,xn. Where it appears, x0 denotes the empty sequence. A loss distribution is a probability distribution on N. Let P be a set of loss distributions. P is the collection of i.i.d. measures over infinite sequences of symbols from N such that the set of one dimensional marginals over N they induce is P. We write R+ for the set of nonnegative real numbers and use := for equality by definition. Consider an insurer with an initial capital Π0 R+. An insurance scheme for P is comprised of a pair (τ, Π). Here τ : N 7 {0, 1} satisfies τ(x1, . . . , xn) = 1 = τ(x1, . . . , xn+1) = 1 for all xn and also p(supn τ(Xn) = 1) = 1 for all p P . τ should be thought of as defining an entry time for the insurer with the property that once the insurer has entered it stays entered and that the insurer enters with probability 1 irrespective of which p P is in effect. Here we say the insurer enters after seeing the sequence xn N (possibly the empty sequence) if τ(xn) = 1. The other ingredient of an insurance Santhanam and Anantharam scheme is the premium setting scheme Π : N R+, satisfying Π(xn) = 0 if τ(xn) = 0, with Π(xn) being interpreted as the premium demanded by the insurer from the insured after the loss sequence xn N is observed. Let 1( ) denote the indicator function of its argument. The event that the insurer goes bankrupt is the event that i=1 (Π(Xi 1) Xi)1(τ(Xi 1) = 1) < 0 for some n 1 . In words, this is the event that in some period n 1 after the insurer has entered, the loss Xn incurred by the insured exceeds the built up capital of the insurer, namely the sum of its initial capital and all the premiums it has collected after it has entered (including the currently charged premium Π(Xn 1)) less all the losses paid out so far. Definition 1 A class P of laws on loss sequences is called insurable by an insurer with initial capital Π0 R+ if η > 0, there exists an insurance scheme (τ, Π) such that p P , p((τ, Π) goes bankrupt ) < η . We should remark that despite the apparent role of the initial capital of the insurer in this definition, it plays no role from a mathematical point of view. To see this note first that if a model class P is insurable by an insurer with capital Π0 it is clearly insurable by all insurers with initial capital at least Π0, since such an insurer can use the same entry time and premium setting scheme as the insurer with initial capital Π0. On the other hand, an insurer with initial capital less than Π0 can use the same entry time as an insurer with initial capital Π0 and simply charge an additional premium at the time of entry which in effect builds up its initial capital to Π0, and then proceed with the same premium setting scheme as that used by the insurer with initial capital Π0. This feature is an artifact of the complete flexibility we give the insurer in setting premiums; for more on this see the concluding remarks in Section 7. As indicated in the introductory Section 1, we will first show that whether a model class of loss distributions is insurable is equivalent to whether we can find suitable loss-domination sequences for the sequence of losses. We next make this connection and the associated terminology precise. Definition 2 A loss-domination scheme for P is a mapping Φ : N 7 R+ { }, where for xn N , we interpret Φ(xn) as an estimated upper bound on xn+1. We call {Φ(Xi 1), i 1} the loss-domination sequence and Φ(Xi 1) the loss dominant at step i. We require for all xn N that Φ(x1, . . . , xn) < = Φ(x1, . . . , xn+1) < and also that for all p P , p(inf n 1 Φ(Xn) < ) = 1. 2 We think of Φ(xn) = as saying that the scheme has not yet committed to proposing finite loss dominants after having seen the sequence xn, while if Φ(xn) < it has. Once the scheme commits to proposing finite loss dominants it has to continue to propose finite loss dominants from that point onwards. Further, with probability 1 under every p P , the scheme has to eventually start proposing finite loss dominants. Insurability Definition 3 Given our motivation from the insurance problem, we will say the loss-domination scheme Φ goes bankrupt if Φ(Xn 1) < Xn for some n 1. 2 The connection between the insurance problem and the problem of selecting loss dominants can now be made precise as follows. Observation 1 Let P be a model class and η > 0. Let Π0 R+. An insurer with initial capital Π0 can find an insurance scheme (τ, Π) such that the probability of remaining solvent is bigger than 1 η irrespective of which p P is in effect if and only if there is a loss-domination scheme Φ such that the probability of it going bankrupt is less than η irrespective of which p P is in effect. Proof Given an insurance scheme (τ, Π) consider the loss-domination scheme Φ that has Φ(xn) := iffτ(xn) = 0 and Φ(Xn 1) := Π0 + i=1 (Π(Xi 1) Xi)1(τ(Xi 1) = 1) + Π(Xn 1) , if τ(Xn) = 1. Since τ enters (become equal to 1) with probability 1 under each p P and stays equal to 1 once it has become 1, Φ becomes finite with probability 1 under each p P and stays finite once it has become finite. Thus Φ is indeed a loss-domination scheme. It is straightforward to check that if the insurance scheme (τ, Π) stays solvent with probability bigger than 1 η irrespective of which p P is in effect then the loss-domination scheme Φ becomes bankrupt with probability less than η irrespective of which p P is in effect. Conversely, given a loss-domination scheme Φ define the insurance scheme (τ, Π) by setting τ(xn) := 0 iffΦ(Xn) = (and τ(xn) := 1 iffΦ(xn) < ) and defining Π(xn) := 0 if Φ(xn) = and Π(xn) := Φ(xn) if Φ(xn) < . One sees that τ as defined becomes 1 with probability 1 under each p P and stays equal to 1 once it becomes 1. Further, the premiums set at each time are finite and equal to 0 till the entry time. Thus (τ, Π) as defined is indeed an insurance scheme. It is straightforward to check if Φ becomes bankrupt with probability less than η irrespective of which p P is in effect, then (τ, Π) stays solvent with probability bigger than 1 η irrespective of which p P is in effect. Hence the above observation. 2 We may therefore conclude that a model class P is insurable ifffor all η > 0 there is a lossdomination scheme Φ such that the probability of going bankrupt under Φ is less than η irrespective of which p P is in effect. In the rest of the paper we will therefore focus mainly on whether the model class P is such that for every η > 0 a loss-domination sequence Φ exists with its probability of bankruptcy being less than η irrespective of which model in the model class governs the sequence of losses. In Theorem 1, we provide a condition on P that is both necessary and sufficient for insurability. 3. Statement of Main Result We go through a few technical points before spelling out the results in detail in 3.3. 3.1 Close Distributions Insurability of P depends on the neighborhoods of the probability distributions among its one dimensional marginals P. The relevant measure of closeness between distributions in P that decides the Santhanam and Anantharam neighborhoods is J (p, q) := D p||p + q + D q||p + q Note that the above is the Jensen-Shannon divergence (with an additional factor of 2) and is not a true distance. Here D(p||q) denotes the relative entropy of p with respect to q, where p and q are probability distributions on N, defined by D(p||q) := X y N p(y) log p(y) The logarithm is assumed to be taken to base 2 (we use ln for the logarithm to the natural base). The reason for choosing the Jensen-Shannon (JS) divergence is that it has two convenient properties (i) for the necessary part, it becomes easy to quantify how close distributions yield very similar measures on sequences, (ii) for the sufficient part, we bound the JS divergence with the ℓ1 norm in Lemma 4 which in turn lets us work with the ℓ1 topology induced on the class P of distributions. Specifically, we show that if p and q are probability distributions on N, then 1 4 ln 2|p q|2 1 J (p, q) 1 ln 2|p q|1 . 3.2 Cumulative Distribution Function Since we would like to discuss percentiles, it is convenient to use a non-standard definition for the cumulative distribution function of a probability distribution on N. For our purposes, the cumulative distribution function of any probability distribution p on N is a function Fp : R+ { } [0, 1] defined in an unconventional way. We obtain Fp by first defining Fp on points in the support of p in the way cumulative distribution functions are normally defined. We define Fp for all other nonnegative real numbers by linearly interpolating between the values in the support of p. Finally, Fp( ) := 1. Let F 1 p : [0, 1] 7 R+ { } denote the inverse function of Fp. Then F 1 p (x) = 0 for all 0 x < Fp(0). If p has infinite support then F 1 p (1) = , else F 1 p (1) is the smallest natural number y such that Fp(y) = 1. Two simple and useful observations can now be made. Consider a probability distribution p with support A N. For δ > 0, let (T for tail) Tp,δ := {y A : y F 1 p (1 δ)}, and let (H for head) Hp,δ := {y A : y 2F 1 p (1 δ/2)}. It is easy to see that p(Tp,δ) > δ (3) and that p(Hp,δ) > 1 δ. (4) Suppose that for some δ > 0 we have F 1 p (1 δ) > 0 and the loss dominant at the beginning of period i 1 happens to be set to F 1 p (1 δ), then the probability under p of the loss in period i exceeding the loss dominant is bigger than δ. If the loss dominant at the beginning of period i happens to be set to 2F 1 p (1 δ/2), then the probability that the loss in period i exceeds the loss dominant is less than δ. We will use these observations in the proofs to follow. Insurability 3.3 Condition that is Necessary and Sufficient for Insurability Existence of close distributions with very different quantiles is what kills insurability. A loss-domination scheme could be deceived by some process p P into setting low loss dominants, while a close enough distribution hits the scheme with too high a loss. The conditions for insurability of P are phrased in terms of the set of its one dimensional marginals, P. Formally, a distribution p in P is not deceptive if some neighborhood around p is tight. For a more complete development of the concept of tightness, see e.g., Billingsley (1995). Specifically, ϵp > 0, such that δ > 0, f(δ) R, such that all distributions q P with J (p, q) < ϵp satisfy F 1 q (1 δ) f(δ). Equivalently, a probability distribution p in P is deceptive if no neighbourhood of P around p is tight. Specifically, ϵ > 0, δ > 0 such that that no matter what f(δ) R+ is chosen, a (bad) distribution q P such that J (p, q) < ϵ and F 1 q (1 δ) > f(δ). In the above definition, f(δ) is simply an arbitrary nonnegative real number. However, it is useful to think of this number as the evaluation of a function f : (0, 1) R at δ. Our main theorem is the following, which we prove in Sections 5 and 6. Theorem 1 P is insurable, iffno p P is deceptive. 2 4. Examples Consider U, the collection of all uniform distributions over finite supports of form {m, m + 1, . . . ,M} for all positive integers m and M with m M. Let the sequence of losses be i.i.d. samples from distributions in U call the resulting model class over infinite loss sequences U . Note that no distribution in U is deceptive. Around each distribution in U is a neighborhood that that contains no other distribution of U. Example 3 U is insurable. Proof If the threshold probability of ruin is η, choose the loss-domination scheme Φ as follows. For all sequences xn with n log 1 η + 1 set Φ(xn) = . For all sequences xn with n > log 1 η + 1, the loss dominant Φ(xn) is set to be twice the largest loss observed thus far. It is easy to see that this scheme is bankrupted with probability less than η irrespective of which p U is in effect. 2 Consider the set N 1 of all i.i.d. processes such that the one dimensional marginals have finite first moment. Namely, p N 1 , Ep X < where X N is distributed according to the single letter marginal of p. If N1 is the collection of single letter marginals from N 1 , it is easy to verify as below that every distribution in N1 is deceptive. Santhanam and Anantharam Example 4 N 1 is not insurable. Proof Note that the loss process that puts probability 1 on the all zero sequence exists in N 1 , since it corresponds to the one dimensional marginal loss distribution that produces loss 0 in each period. Since every loss-domination scheme enters with probability 1 no matter which p N 1 is in force, every loss-domination scheme must enter after seeing some finite number of zeros. Fix any loss-domination scheme Φ. Suppose the scheme starts to set finite loss dominants after seeing N losses of size 0. To show that N 1 is not insurable, we show that η > 0 and p N 1 such that p( Φ goes bankrupt ) η. Fix δ = 1 η. Let ϵ be small enough that (1 ϵ)N > 1 δ/2, and let M be a number large enough that (1 ϵ)M < δ/2. Note that since 1 δ/2 δ/2, we have N < M. Let L be greater than any of the loss dominants set by Φ for the sequences 0N, 0N+1, . . . 0M. Let p N 1 satisfy, for all i, ( 1 ϵ if Xi = 0 ϵ if Xi = L. For the i.i.d. loss process having the law p, the insurer is bankrupted on all sequences that contain loss L in between the N-th and M-th steps. These sequences, 0NL, 0N+1L, . . . ,0M 1L, have respective probabilities (under p) (1 ϵ)Nϵ, (1 ϵ)N+1ϵ, . . . , (1 ϵ)M 1, and they also form a prefix free set. Therefore, summing up the geometric series and using the assumptions on ϵ above, p( Φ is bankrupted ) (1 ϵ)N (1 ϵ)M 1 δ/2 δ/2 = η. 2 A reading of the proof above shows that we can say something much stronger. The distributions that break insurability have all their moments finite. Suppose N is the collection of measures each of whose single letter marginal has all moments finite. Namely for all p N and all finite r 0, Ep Xr 1 < . It follows that Example 5 N is not insurable. 2 Consider the collection of all i.i.d. loss distributions with monotone one dimensional marginals. A monotone probability distribution p on N is one that satisfies p(y + 1) p(y) for all y N. Let M be the set of all i.i.d. loss processes, with one dimensional marginal distribution from M, the collection of all monotone probability distributions over N. Again, it is easily shown that every distribution in M is deceptive. It follows from Theorem 1 that Example 6 M is not insurable. Proof To see that any distribution p M is deceptive, consider distributions of form p = (1 ϵ)p+ϵq, where q U is a monotone uniform distribution and ϵ > 0. Clearly, the ℓ1 distance between p and q is 2ϵ (and therefore so is the JS divergence, up to a constant factor). But for all M > 0 and δ < ϵ, we can pick q U over a sufficiently large support that the 1 δ percentile of p can be made M. Therefore, no neighborhood around p is tight. 2 Insurability Note also that if p has finite entropy, so does every p obtained by the above construction. Let M M be the collection of finite entropy monotone distributions. The above example also implies that Example 7 M is not insurable. 2 Now for h > 0, we consider the set Mh M of all monotone distributions over N whose entropy is bounded above by h. Let M h be the set of all i.i.d. loss processes with one dimensional marginals from Mh. Then Example 8 M h is insurable. Proof From Markov inequality, if p Mh and X p, p(X > M) = p(log(X + 1) > log(M + 1)) < Ep log(X + 1) log(M + 1) Ep log 1 p(X) log(M + 1) h log(M + 1). To see the second inequality above, note that p is monotone therefore for i N, p(i) 1 i+1. Therefore, for all p Mh, F 1 p (1 δ) 2 h δ . Thus no p Mh is deceptive, and M h is insurable. 2 In the class U above, there was a neighborhood around each distribution p U with no other model from U, Hence U trivially satisfied the local tightness condition that we will prove is necessary and sufficient for insurability. The above case is another extreme the entire model class Mh is tight. The following example illustrates a insurable class of models where neither extreme holds. For a distribution q over N, let q(R)(i + R) = q(i) for all i N. Furthermore let the span of any finite support probability distribution over naturals be the largest natural number which has non-zero probability. Then, let Fh = n (1 ϵ)p1 + ϵp(span(p1)+1) 2 : p1 U, p2 Mh and 1 > ϵ > 0 o . As always F h is the set of measures on infinite sequences obtained by i.i.d. sampling from distributions in Fh. Example 9 F h is insurable. Proof Let the base of any probability distribution over the naturals be the smallest natural number which has non-zero probability. Consider any distribution p = (1 ϵ)p1 + ϵp(span(p1)+1) 2 Fh with p1 U, p2 Mh, and 1 > ϵ > 0. Let m denote base(p), and m + M 1 denote span(p1). Thus |support(p1)| = M, and we have M 1. Of course, we also have base(p1) = base(p). Consider any other distribution q = (1 ϵ )q1 + ϵ q(span(q1)+1) 2 Fh, where q1 U, q2 Mh, and 1 > ϵ > 0. Let m denote base(q) (which equals base(q1)) and let m + M 1 denote span(q1), so |support(q1)| = M , and we have M 1. It suffices to show that there is an ℓ1 ball around p of sufficiently small radius, such that for all δ > 0 we can find a uniform bound on the (1 δ)-th percentile of all q in this ball. If m > m, then the ℓ1 distance between p and q is at least 1 ϵ M . Hence, whenever the ℓ1 distance between p and q is strictly less than 1 ϵ M we must have m m. Thus we may assume that m m. Santhanam and Anantharam Suppose m + M 1 m + 2M 1 ϵ. Then M 2 , from which, because support(p1) support(q1), we can conclude that the ℓ1 distance between q and p is at least 1 ϵ 2 . Thus we may assume that m + M 1 < m + 2M 1 ϵ. Now, for any i 0, we have q(m + M + i) = ϵ q2(i) ϵ 1 i + 1 1 i + 1 . Thus for any K 0 we have q(X > m + 2M 1 ϵ + K) q(X > m + M + K) h log(K + 1) by an argument similar to that in the preceding example, which gives the desired conclusion that no p F h is deceptive, and hence that F h is insurable. 2 5. Necessary Condition for Insurability Theorem 2 shows that lack of deceptive distributions in P is necessary for insurability of P . Theorem 2 If P is insurable, then no p P is deceptive. Proof To keep notation simple, we will denote by p (or q) both a measure in P as well as the corresponding one dimensional marginal distribution, which is a member of P. The context will clarify which of the two is meant. We prove the contrapositive of the theorem: if some p P is deceptive, then P is not insurable. Pick α > 0. Suppose p P is deceptive. Let Φ be any loss-domination scheme. Recall that Φ enters on p with probability 1, in the sense that the loss dominants set by Φ will eventually become finite with probability 1 under p. For all n 1, let Rn := {xn : Φ(xn) < } be the set of sequences of length n on which Φ has entered and let N 1 be a number such that p(RN) > 1 α/2. (5) Fix 0 < η < 1 N )(1 1/e2). We prove that P is not insurable by finding, for each lossdomination scheme Φ, a probability distribution q P such that q( Φ goes bankrupt ) η. The basic idea is that because Φ has to enter with probability 1 under p, it would have been forced to set premiums that are too low for q. For any sequence xn, let A(xn) be the set of symbols that appear in it. Recall that the head of the distribution p, Hp,γ, was defined in Section 3.2 to be the set {y Ap : y 2F 1 p (1 γ/2)}, where Ap is the support of p. Further, define for all γ > 0 Rp,γ,n := {xn Rn : A(xn) Hp,γ)}. Given δ > 0, pick γp(δ) so small that (1 γp(δ))N+1/δ 1 α/2. (6) Insurability A word about this parameter γp(δ), since it may not be immediately apparent why this should be defined. The advantage of doing so is technical we will be able to handle p (and q which will be chosen later) as though they were distributions with finite span. Set1 ϵ = 1 16(ln 2)N8 . Applying Lemma 5 to distributions over length-N sequences induced by the distributions p, q P such that J (p, q) ϵ, we have for each δ > 0 that q(XN Rp,γp(δ),N) 1 α 2 This implies that Φ has entered with probability (under q) at least 1 α 2 N for length N sequences. We will find a suitable δ > 0 and a suitable q such that J (p, q) ϵ, and such that the scheme Φ is bankrupted with probability > η. Since p is deceptive, there exists δ > 0 such that sup q P: J (p,q)<ϵ F 1 q (1 δ ) = . Define p,ϵ = {δ : sup q:J (p,q)<ϵ F 1 q (1 δ ) = }. In connection with this definition, note that if the δ tails of distributions in the ϵ neighborhood of p are not bounded, neither are the δ tails for all δ < δ . Furthermore if sup p,ϵ 1/2, we are done since for some δ 1/2, there is a q satisfying J (p, q) ϵ and F 1 q (1 δ) max x N Rp,γp,N Φ(x N). Therefore, conditioned on the event {XN Rp,γp(δ),N}, this q will be bankrupted with probability 1/2. From (7) above we thus have q( Φ goes bankrupt ) 1 α 2 If not, we have p,ϵ < 1/2. Pick δ and r = 2δ such that δ p,ϵ, but r / p,ϵ. For convenience, let M = 1 δ . We consider now a set S of strings of lengths ranging from N to N + M defined by the following properties: every string in S has its prefix of length N belonging to Rp,γp(δ),N; every string of length k in S, N + 1 k N + M, has all its symbols at times N + 1 through to k belonging to Hq ,2r for some q P such that J (p, q ) < ϵ. To clarify this definition, we recall once again that Hq ,2r is the {y Aq : y 2F 1 q (1 r)}, where Aq denotes the support of q . Note that since r / p,ϵ, S is finite. Again, we pick q satisfying J (p, q) ϵ and F 1 q (1 δ) max xk S N k N+M 1 Φ(xk). 1. Please note that in the interest of simplicity, we have not attempted to provide the best scaling for ϵ or the tightest possible bounds in arguments below Santhanam and Anantharam Therefore if a symbol from Tq,δ follows any string in S, the scheme goes bankrupt under q. There may be symbols in the complement of Tq,δ that also bankrupt the scheme if they follow a string in S. Taking a different perspective, it could happen that no string in S contains symbols in Tq,δ, or that strings S contain symbols from Tq,δ. We consider both these variations below. Let q1 be the probability (under q) of all sequences in Rp,γp,N under which the scheme Φ has not yet been bankrupted, and let q2 be the probability (under q) of all sequences in Rp,γp,N where Φ has already been bankrupted. Therefore q1 + q2 = q(Rp,γp,N). Define a sequence in S to be a survivor if the loss-domination scheme Φ has not yet been bankrupted on this sequence under q. Thus, for instance, q2 is the probability, under q, of survivor sequences of length N. To continue, we need to consider two cases: Case (i). Strings in S do not contain symbols of Tq,δ (when supq :J (p,q )<ϵ 2F 1 q (1 r) F 1 q (1 δ)). In this case, Hq,2r is contained in the complement of Tq,δ and we show the probability under q with which Φ is bankrupted is bounded below by q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ . Given that the sequence seen so far is a survivor sequence, one can classify the next symbol in one of four ways: (a) it is in Tq,δ (which automatically implies bankruptcy); (b) it is in the complement of Tq,δ Hq,2r which we ignore; (c) it is in Hq,2r and results in bankruptcy; (d) it is in Hq,2r and does not result in bankruptcy. We ignore case (b) since we are only interested in a lower bound. In case (c) the contribution to the conditional probability of bankruptcy given a survivor sequence is 1, but we lower bound it for survivor sequences of length N + l (0 l M 1) by the running sum δ + (1 r)δ + . . . + (1 r)M l 1δ . This has the effect that the lower bound looks as though symbols in Hq,2r never contribute to bankruptcy (and hence always lead to survivor sequences). Case (ii). Strings in S could contain symbols of Tq,δ (when supq :J (p,q )<ϵ 2F 1 q (1 r) > F 1 q (1 δ).) We show here that the probability under q with which Φ is bankrupted is bounded below by q2 + q1 δ + (1 δ)δ + . . . + (1 δ)Mδ . This can be seen, as in the preceding case, by classifying the next symbol following a survivor sequence into three types: (a) it is in Tq,δ (which implies bankruptcy); (b) it is the complement of Tq,δ and results in bankruptcy; (c) it is the complement of Tq,δ and does not result in bankruptcy. In case (b), the conditional probability of bankruptcy given a survivor sequence of length N +l (0 l M 1) is 1, but, as before we lower bound the contribution to by the running sum δ + (1 δ)δ + . . . + (1 δ)M l 1δ . Similar to the prior case, this has the effect that the lower bound looks as though symbols in the complement of Tq,δ never contribute to bankruptcy (and hence always lead to survivor sequences). However, we have 1 δ 1 r, so once again in case (ii) the probability under q with which Φ is bankrupted is bounded below by q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ . Insurability Thus we see that irrespective of which case is in force, under q the loss-domination scheme Φ is bankrupted with probability at least q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ 1 (1 2δ) 1/δ 2q(Rp,γp,N) 1 (1 2δ) 1/δ The theorem follows. 2 6. Sufficient Condition for Insurability When no p P is deceptive, for any η > 0 Theorem 3 constructs a loss-domination scheme that goes bankrupt with probability η. If no p P is deceptive, there is for each p P a number ϵp > 0 such that, for every percentile δ > 0, there is a uniform bound on the δ-percentile over the set of probability distributions in the neighborhood {p P : J (p , p) < ϵp}, . We pick such an ϵp for each p P and call it the reach of p. For p P, the set Bp = {p P : J (p, p ) < ϵp}, where ϵp is the reach of p, will play the role of the set of probability distributions in P for which it will be okay to eventually set loss dominants assuming p is in force. To prove that P is insurable if no distribution among its one dimensional marginals P is deceptive, we will need to find a way to cover P with countably many sets of the form Bp above. Unfortunately, J (p, q) is not a metric, so it is not immediately clear how to go about doing this. On the other hand note that J (p , p) |p p |1/ ln 2, where |p p |1 denotes the ℓ1 distance between p and p (see Lemma 4 in the Appendix). Therefore, we can instead bootstrap offan understanding of the topology induced on P by the ℓ1 metric. 6.1 Topology of P with ℓ1 Metric The topology induced on P by the ℓ1 metric is Lindel of, i.e. any covering of P with open sets in the ℓ1 topology has a countable subcover. See e.g., (Dugundji, 1970, Defn. 6.4) for definitions and properties of Lindel of topological spaces. We can show that P with the ℓ1 topology is Lindel of by appealing to the fact that the set of all probability distributions on N with the ℓ1 topology, is second countable, i.e. that it has a countable basis. The set of all distributions on N along with ℓ1 topology has a countable basis because it has a countable norm-dense set (consider the set of all probability distributions on N with finite support and with all probabilities being rational). Now, P, as a topological subspace of a second countable topological space is also second countable (Dugundji, 1970, Theorem 6.2(2)). Finally, every second countable topological space is Lindel of (Dugundji, 1970, Thm. 6.3), hence P is Lindel of. Santhanam and Anantharam 6.2 Sufficient Condition We now have the machinery required to prove that if no p P is deceptive, then P is insurable, which is the other direction of Theorem 1, as stated next. Theorem 3 If no p P is deceptive, then P is insurable. Proof The proof is constructive. For any 0 < η < 1, we obtain a loss-domination scheme Φ such that for all p P , p(Φ goes bankrupt ) < η. For p P, let Qp = q : |p q|1 < ϵp2(ln 2)2 where ϵp is the reach of p. We will call Qp as the zone of p. The set Qp is non-empty when ϵp > 0. For large enough n, the set of loss sequences of length n with empirical distribution in Qp will ensure that the loss-domination scheme Φ to be proposed enters with probability 1 when p is in force. Note that if ϵp > 0 is small enough then Qp P Bp we will assume wolog that ϵp > 0 is always taken so that Qp P Bp. Since no p P is deceptive, none of the zones Qp are empty and the space P of distributions can be covered by the sets Qp P, namely P = p P(Qp P). From Section 6.1, we know that P is Lindel of under the ℓ1 topology. Thus, there is a countable set P P, such that P is covered by the collection of relatively open sets {Q p P : p P}. We let the above collection be denoted by Q P. We will refer to P as the quantization of P and to elements of P as centroids of the quantization, borrowing from commonly used literature in classification. We index the countable set of centroids, P (and reuse the index for the corresponding elements of Q P) by ι : P N. We now describe the loss-domination scheme Φ having the property that for all p P , p(Φ goes bankrupt ) < η. Preliminaries. Consider a length-n sequence xn on which Φ has not entered thus far. Let the empirical distribution of the sequence be q, and let P q := {p P : q Qp } be the set of centroids in the quantization of P (elements of P) which can potentially capture q. Note that q in general need not belong to P or P. If P q = , we will further refine the set of distributions that could capture q further to Pq P q as described below. Refining P q to Pq ensures that models in P q do not prematurely capture loss sequences. Let p be the model in force, which remains unknown. The idea is that we want sequences generated by (unknown) p to be captured by those centroids of the quantization P that have p in their reach. We will require (8) below to ensure that the probability (under the unknown p) of all sequences that may Insurability get captured by centroids p Pq not having p in its reach remains small. In addition, we impose (9) as well to resolve a technical issue since q need not, in general, belong to P. For p P q, let the reach of p be ϵp , and define Dp := ϵp 4(ln 2)4 In case the underlying distribution p happens to be out of the reach of p (wrong capture), the quantity Dp will later lower bound the distance of the empirical q in question from the underlying p. Specifically, we place p in Pq if n satisfies exp n Dp /18 η 2C(p )ι(p )2n(n + 1), (8) and 2F 1 q (1 q Dp /6) log C(p ), (9) where C(p ) is C(p ) := 2 2 supr Bp F 1 r (1 q Note that C(p ) is finite since p is not deceptive. Comparison with Lemma 7 will give a hint as to why the equations above look the way they do. Description of Φ. For the sequence xn with type q, if Pq = , the scheme does not enter yet. If Pq = , let pq denote the distribution in Pq with the smallest index. All sequences with prefix xn (namely sequences obtained by concatenating xn with by any other sequence of symbols) are then said to be trapped by pq namely, loss dominants will be based on pq. The loss dominant assigned for a length-m sequence trapped by pq is η 4n(n + 1) := 2 sup r Bpq F 1 r 1 η 4n(n + 1) Φ enters with probability 1. First, we verify that the scheme enters with probability 1, no matter what distribution p P is in force. Every distribution p P is contained in at least one of the elements of the cover Q P. Recall the enumeration of P. Let p be centroid with the smallest index among all centroids in P whose zones contain p. Let Q be the zone of p . There is thus some γ > 0 such that the neighborhood around p given by I(p, γ) := {q : |p q|1 < γ} satisfies I(p, γ) Q. Note in particular that p is in the reach of p . With probability 1, sequences generated by p will have their empirical distribution within I(p, γ). For the proof of this well known result in the countably infinite alphabet case, see Chung (1961) or for an alternate approach, see Lemma 7. Next (8) will hold for all sequences whose empirical distributions that fall in I(p, γ) whose length n is large enough since C(p ) and ι(p ) do not change with n, the right hand side diminishes to zero polynomially with n while the left hand side diminishes exponentially to zero. Thus we conclude (8) will be satisfied with probability 1. Santhanam and Anantharam Next, (9) will also hold almost surely, for if q is the empirical probability of sequences generated by p, then (with a little abuse of notation) Dp /6) F 1 p (1 q with probability 1. Note that the quantity on the left is actually a random variable that is sequence dependent (since q is the empirical distribution of the sequence). Furthermore, we also have 2F 1 p (1 q sup r Bp F 1 r (1 q = log C(p ), where the first inequality follows since p is in the reach of p . Thus the scheme enters with probability 1 no matter which p P is in force. Probability of bankruptcy η. We now analyze the scheme. Consider any p P. Among sequences on which Φ has entered, we will distinguish between those that are in good traps and those in bad traps. If a sequence xn is trapped by p such that p Bp , p is a good trap. Conversely, if p / Bp , p is a bad trap. (Good traps) Suppose a length-n sequence xn is in a good trap, namely, it is trapped by a distribution p such that p Bp . Recall that the loss dominant assigned is 2gp η 4n(n + 1) 1 η 4n(n + 1) where the inequality follows because p is not deceptive, and p is within the reach of p . Therefore from (4), given any sequence in a good trap the scheme is bankrupted with conditional probability at most δ = η/2n(n + 1) in the next step. Therefore, summing over all n, sequences in good traps contribute at most η/2 to the probability of bankruptcy. (Bad traps) We will show that the probability with which sequences generated by p fall into bad traps η/2. Pessimistically, the conditional probability of bankruptcy in the very next step given a sequence falls into a bad trap is going to be bounded above by 1. Thus the contribution to bankruptcy by sequences in bad traps is at most η/2. Let q be any length-n empirical distribution trapped by p with reach ϵ such that p / B p. If p is far from p (because p is not in p s reach), namely J ( p, p) ϵ, but q is close to p (because q has to be in p s zone to be captured by it), namely | p q|1 < ϵ2(ln 2)2 then we would like q to be far from p. That is exactly what we obtain from the triangle-inequality like Lemma 6, namely that J (p, q) ϵ2 ln 2 Insurability and hence, for all q trapped by p that |p q|2 1 J 2(p, q)(ln 2)2 ϵ4(ln 2)4 256 = D2 p. We need not be concerned that the right side above depends on p, and there may be actually no way to lower bound the rhs as a function of just p. Rather, we take care of this issue by setting the entry point appropriately via (8). Thus, for p P , the probability length-n sequences with empirical distribution q is trapped by a bad p is, using (8) and (9) p |q p|2 D p and 2F 1 q (1 6 ) log C( p) (a) (C( p) 2) exp n D p (b) η(C( p) 2) 2C( p)ι( p)2n(n + 1) η 2ι( p)2n(n + 1), where the inequality (a) follows from Lemma 7 and (b) from (8). Therefore, the probability of sequences falling into bad traps η 2ι( p)2n(n + 1) η/2 since P p P 1 ι( p)2 P n 1 1 n(n+1) = 1. The theorem follows. 2 7. Concluding Remarks 7.1 Observations We make a few observations about insurability that, while evident from the proofs and approaches we have taken, are interesting in themselves and worth highlighting. Finite unions of insurable classes. The first is relative simple finite unions of insurable model classes are insurable in themselves. If P1, . . . ,Pm are m insurable model collections, then m i=1Pi is insurable. Countable unions of insurable classes. The union of countably infinitely many classes of models each of which is insurable need not be insurable. As we have seen, the collection of monotone distributions with entropy h, Mh, is insurable for all h N. However, the collection M = h NMh is not insurable. Countable unions of tight sets. Note that from our arguments while proving the sufficiency criterion, it follows that every insurable model class is a countable union of tight sets. The converse is not however true. Note that Mh is a tight set for any h > 0, yet M = h NMh is not insurable. Santhanam and Anantharam 7.2 General Remarks The loss-domination problem formulated and solved in this paper appears to be of natural interest. However, there are several features of the insurance problem formulated here that might appear troubling even to the casual reader. In practice an insured party entering into an insurance contract would expect some stability in the premiums that are expected to be paid. A natural direction for further research is therefore to study how the notion of insurability of a model class changes when one imposes restrictions on how much the premium set by the insurer can vary from period to period. Another obvious shortcoming of the formulation of the insurance problem studied here is the assumption that the insured will accept any contract issued by the insurer. Since the insured in our model represents an aggregate of individual insured parties, a natural direction to make the framework more realistic would be to think of the insured parties as being of different types. This would in effect make the total realized premium from the insured (the aggregate of the insured parties) and the distribution of the realized loss in each period a function of the size of the premium per insured party set by the insurer in that period. Characterizing which model classes are insurable when the realized premium and the realized loss are functions of a set premium per insured party would be of considerable interest. Both for the loss-domination problem and for the insurance problem, working with model classes for the loss sequence that allow for dependencies in the loss from period to period, for instance Markovian dependencies, would be another interesting direction for further research. Considering models with multiple, possibly competing insurers, as well as considering an insurer operating in multiple markets, where losses in one market can be offset by gains in another, also seem to be useful directions to investigate. Acknowledgments We are very grateful to anonymous reviewers whose insightful comments has made this paper much better. We thank C. Nair (Chinese Univ of Hong Kong) and K. Viswanathan (HP Labs) for helpful discussions. N. Santhanam was supported by NSF Grants CCF-1065632, CCF-1018984 and EECS1029081. V. Anantharam was supported by the ARO MURI grant W911NF08-1-0233, Tools for the Analysis and Design of Complex Multi-Scale Networks, the NSF grant CNS-0910702 and EECS1343398, the NSF Science & Technology Center grant CCF-0939370, Science of Information, Marvell Semiconductor Inc., and the U.C. Discovery program. Lemma 4 Let p and q be probability distributions on N. Then 1 4 ln 2|p q|2 1 J (p, q) 1 ln 2|p q|1 . If, in addition, r is a probability distribution on N, then J (p, q) + J (q, r) J 2(p, r)ln 2 Proof The lower bound in the first statement follows from Pinsker s inequality on KL divergences, 1 2 ln 2 1 4|p q|2 1 Insurability and similarly for D q||p+q 2 . See e.g., Cover and Thomas (1991) for more details about Pinsker s inequality. Since ln(1 + z) z for all z 0, the upper bound in the first statement follows as below: J (p, q) ln 2 X x:p(x) q(x) p(x) p(x) q(x) p(x) + q(x) x :q(x ) p(x ) q(x ) q(x ) p(x ) p(x ) + q(x ) To prove the triangle-like inequality, note that J (p, q) + J (q, r) 1 4 ln 2 |p q|2 1 + |q r|2 1 1 8 ln 2(|p q|1 + |q r|1)2 1 8 ln 2(|p r|1)2 8 J (p, r)2, where the last inequality follows from the upper bound on J (p, r) already proved. 2 Lemma 5 Let p and q be probability distributions on a countable set A with J (p, q) ϵ. Let p N and q N be distributions over AN obtained by i.i.d. sampling from p and q respectively (the distribution induced by the product measure). For any RN AN and α > 0, if p N(RN) 1 α, then q N(RN) 1 α 2N3 B1 = i A : q(i) p(i) 1 1 B2 = i A : p(i) q(i) 1 1 If J (p, q) ϵ, then we have ϵ p J (p, q) |p q|1 It can then be easily seen that p(B1 B2) 2N2 4ϵ ln 2 and q(B1 B2) 2N2 4ϵ ln 2 (10) x B1 (p(x) q(x)) p(B1) and similarly N2|p q|1 q(B2) p(B2). Santhanam and Anantharam Let S = A B1 B2. We have for all x S, q(x) p(x) 1 1 and from (10) we have p(S) 1 2N2 4ϵ ln 2. Now, we focus on the set SN AN containing all length-N strings of symbols from S. Clearly p(SN) 1 2N3 Thus we have p(RN SN) 1 2N3 From (11), for all x N SN, q(x N) p(x N) 1 1 N p(x N) 1 1 q(RN) q(RN SN) (1 2N3 4ϵ ln 2 α) 1 1 Lemma 6 Let ϵ0 > 0. If |p0 q|1 ϵ2 0(ln 2)2 then for all p P with J (p, p0) ϵ0, we have J (p, q) ϵ2 0 ln 2 Proof Since |p0 q|1 ϵ2 0(ln 2)2 Lemma 4 implies that J (p0, q) ϵ2 0 ln 2 Further, Lemma 4 then implies that J (p, q) + ϵ2 0 ln 2 16 J (p, q) + J (p0, q) J 2(p, p0) ln 2 8 ϵ2 0 ln 2 where the last inequality follows since J (p, p0) ϵ0. 2 Lemma 7 Let p be any probability distribution on N. Let δ > 0 and let k 2 be an integer. Let Xn 1 be a sequence generated i.i.d. with marginals p and let q(Xn) be the empirical distribution of Xn 1 . Then p |q(Xn) p| > δ and 2F 1 q (1 δ/6) k (2k 2) exp nδ2 Insurability Remark There is a lemma that looks somewhat similar in Ho and Yeung (2010). The difference from Ho and Yeung (2010) is that the right side of the inequality above does not depend on p, and this property is crucial for its use here. 2 Proof The starting point is the following result. Suppose p is a probability distribution on N with finite support of size L. Then from Weissman et al. (2005), if we consider length n sequences, p (|q(Xn) p |1 t) 1 (2L 2) exp nt2 Since k 2, consider the distributions p and q with support A = {1, . . . ,k 1} { 1}, obtained as ( p(i) 1 i < k P j=k p(j) i = 1, and similarly for q . From (12), p (|p q |1 > δ/3) (2k 2) exp nδ2 We will see that all sequences generated by p with empirical distributions q satisfying |p q|1 > δ and 2F 1 q (1 δ/6) k are now mapped into sequences generated by p with empirical q satisfying |p q |1 > δ/3 and q ( 1) δ/3. (13) Thus, we will have p(|q(Xn) p|1 > δ and 2F 1 q (1 δ/6) k) p (|p q |1 > δ/3 and q ( 1) δ/3) (2k 2) exp nδ2 Finally we observe (13) as in Ho and Yeung (2010) l=1 |p(l) q(l)| j=k (p(j) q(j)) + 2 |p ( 1) q ( 1)| + 2δ/3, where the last inequality above follows from (4). Since p(l) = p (l) and q(l) = q (l) for all l = 1, . . . ,k 1, we have |p q |1 |p q|1 2δ/3. If |p q|1 δ in addition, |p q |1 δ/3. 2 Santhanam and Anantharam M. Asadi, R. Paravi, and N. Santhanam. Stationary and transition probabilities in slow mixing, long memory markov processes. IEEE Transactions on Information Theory, September 2014. S. Asmussen and H. Albrecher. Ruin Probabilities. World Scientific Publishing Company, 2nd edition, 2010. P. Billingsley. Probability and Measure. Wiley Interscience, 1995. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006. K.L. Chung. A note on the ergodic theorem of information theory. Annals of Mathematical Statistics, 32:612 614, 1961. T. Cover and J. Thomas. Elements of Information Theory. John Wiley and sons., 1991. H. Cramer. Historical Review of Filip Lundberg s Work on Risk Theory. Skandinavisk Aktuarietidskrift (Suppl.), 52:6 12, 1969. Reprinted in The Collected Works of Harald Cram er edited by Anders Martin-L of, 2 volumes Springer 1994. J. Dugundji. Topology. Allyn and Bacon Inc., Boston, 1970. K. Englund and A. Martin-L of. Statisticians of the Centuries, chapter Ernst Filip Oskar Lundberg, pages 308 311. New York: Springer, 2001. B. Fittingoff. Universal methods of coding for the case of unknown statistics. In Proceedings of the 5th Symposium on Information Theory, pages 129 135. Moscow-Gorky, 1972. P. Grunwald. The Minimum Description Length Principle. MIT Press, 2007. S. Ho and R. Yeung. On information divergence measures and joint typicality. IEEE Transactions on Information Theory, 56(12):58935905, 2010. J.C. Kieffer. A unified approach to weak universal source coding. IEEE Transactions on Information Theory, 24(6):674 682, November 1978. D. Mc Allester. A pac-bayesian tutorial with a dropout bound. Available from arxiv doc:id 1307.2118, 2013. J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629 636, July 1984. B. Ryabko. Compression based methods for non-parametric online prediction, regression, classification and density estimation. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday, pages 271 288, 2008. N. Santhanam and V. Anantharam. Agnostic insurance tasks and their relation to compression. In International Conference on Signal Processing and Communications (SPCOM), 2012. Y.M. Shtarkov. Universal sequential coding of single messages. Problems of Information Transmission, 23(3):3 17, 1987. Insurability T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. Weinberger. Universal discrete denoising: known channel. IEEE Transactions on Information Theory, 51(1):5 28, 2005. See also HP Labs Tech Report HPL-2003-29, Feb 2003.