# agnostic_insurability_of_model_classes__0f25bbd3.pdf

Journal of Machine Learning Research 16 (2015) 2329-2355 Submitted 4/13, Revised 2/14; Published 8/15

Agnostic Insurability of Model Classes

Narayana Santhanam nsanthan@hawaii.edu Dept of Electrical Engineering University of Hawaii at Manoa Honolulu, HI 96822

Venkat Anantharam ananth@eecs.berkeley.edu EECS Dept University of California, Berkeley Berkeley, CA 94720

Editor: Gabor Lugosi

Motivated by problems in insurance, our task is to predict ﬁnite upper bounds on a future draw from an unknown distribution p over natural numbers. We can only use past observations generated independently and identically distributed according to p. While p is unknown, it is known to belong to a given collection P of probability distributions on the natural numbers. The support of the distributions p P may be unbounded, and the prediction game goes on for inﬁnitely many draws. We are allowed to make observations without predicting upper bounds for some time. But we must, with probability 1, start and then continue to predict upper bounds after a ﬁnite time irrespective of which p P governs the data. If it is possible, without knowledge of p and for any prescribed conﬁdence however close to 1, to come up with a sequence of upper bounds that is never violated over an inﬁnite time window with conﬁdence at least as big as prescribed, we say the model class P is insurable. We completely characterize the insurability of any class P of distributions over natural numbers by means of a condition on how the neighborhoods of distributions in P should be, one that is both necessary and suﬃcient. Keywords: insurance, ℓ1 topology of probability distributions over countable sets, non-parametric approaches, prediction of quantiles of distributions, universal compression

1. Introduction

Insurance is a means of managing risk by transferring a potential sequence of losses to an insurer for a price paid on a regular basis, the premium. The insurer attempts to break even by balancing the possible loss that may be suﬀered by a few with the guaranteed premiums of many. We aim to study the fundamentals of this problem when the losses can be unbounded and a precise model for the probability distribution of the aggregate loss in each period either does not exist or is infeasible to get. A systematic, theoretical, as opposed to empirical, study of insurance goes back to 1903 when Filip Lundberg deﬁned a natural probabilistic setting as part of his thesis see, e.g., the chapter on Lundberg in Englund and Martin-L of (2001). In particular, Lundberg formulated a collective risk problem pooling together the risk of all the insured parties into a single entity, which we call the insured. A substantial part of statistical work on insurance depend on working with speciﬁc models for the loss distribution, e.g. compound Poisson models, after which questions of interest in practice, such as the relation between the size of the premiums charged and the probability of the insurer going bankrupt, can be analyzed.

c 2015 Narayana Santhanam and Venkat Anantharam.

Santhanam and Anantharam

A rather comprehensive theory of insurance along these lines has evolved, see Cramer (1969) and more recently in Asmussen and Albrecher (2010). This theory is able to incorporate several model classes for the distribution of the losses over time other than compound Poisson processes, including some heavy tailed distribution classes. We will outline our approach in the next subsection. In addition, from a learning theory perspective, our formulation is positioned in between the notions of uniform and pointwise consistency as we outline in subsection 1.2. Finally, in subsection 1.3 we compare our approach to the closely connected notions of universal compression and regret.

1.1 Approach

Our approach departs from the existing literature on insurance in two important respects.

No upper bound on loss. The ﬁrst departure relates to the practice among insurers to limit payments to a predetermined ceiling, even if the loss suﬀered by the insured exceeds this ceiling. In both the insurance industry and the legal regulatory framework surrounding it, this is assumed to be common sense. But is it always necessary to impose such ceilings? Moreover, in scenarios such as reinsurance, a ceiling on compensation is not only undesirable, but may also limit the very utility of the business. As we will see, we may be able to handle scenarios where the loss can be unbounded.

Universal approach. The second aspect of our approach arises from our motivation to deal with several new settings for which some sort of insurance is desirable, but where insurers are hesitant to enter the market due to lack of suﬃcient data. Examples of such settings include insuring against network outages or attacks against future smart grids, where the cascade eﬀect of outages or attacks could be catastrophic. In these settings, it is not clear today what should constitute a reasonable risk model because of the absence of usable information about what might cause the outages or motivate the attacks.

We address the second issue by working with a class of models, i.e., a set of probability laws over loss sequences that adheres to any assumptions the insurer may want to make or any information it may already have. In this paper we will only consider loss models that are independent and identically distributed (i.i.d.) from period to period, so we can equivalently think of a model class as deﬁned in terms of its one dimensional marginals. As an example, we may want to consider the set of all ﬁnite moment probability distributions over the nonnegative integers as our class of possible models for the loss distribution in each period. Now, we ask the question: what classes of models are the ones on which the insurer can learn from observations and set premiums so as to remain solvent? In this paper, we completely answer this question by giving a necessary and suﬃcient condition that characterizes what classes of models lend themselves to this insurance task. This setup for insurability is very reminiscent of the universal compression/estimation/prediction approaches as seen in Shtarkov (1987); Fittingoﬀ(1972); Rissanen (1984) and Ryabko (2008). There is also extensive work regarding learning from experts that has a related ﬂavor, see Cesa-Bianchi and Lugosi (2006) for a survey. We will discuss the compression angle in more depth shortly as well.

Formulation. Formally, we adopt the collective risk approach, namely, we abstract the problem to include just two agents, the insurer and the insured. Losses incurred by the insured are considered to

Insurability

form a discrete time sequence of random variables, with the sequence of losses denoted by {Xi, i 1}, and we assume that Xi N for all i 1, where N denotes the set of natural numbers, {0, 1, 2, . . .}. A model class P is a collection of measures on inﬁnite length loss sequences, and is to be thought of as the set of all potential probability laws governing the loss sequence. Each element of P is a model for the sequence of losses. Any prior knowledge on the structure of the problem is accounted for in the deﬁnition of P . We focus on measures corresponding to i.i.d. samples, i.e. each member of P induces marginals that are product distributions. We denote by P the set of distributions on N obtained as one dimensional marginals of P . Since there is no risk of confusion, we will also refer to the distributions in P as models and to P as the model class. The actual model in P governing the law of the loss in each period remains unknown to the insurer. We assume no ceiling on the loss, and require the insurer to compensate the insured in full for the loss in each period at the end of that period. The insurer is assumed to start with some initial capital Π0 R+, a nonnegative real number. The insurer then sets a sequence of premiums based on the past losses at time i, the insurer collects a premium Π(Xi 1 1 ) at the beginning of the period, and pays out full compensation for loss Xi at the end of the period. If the built up capital till step i (including Π(Xi 1 1 ), and after having paid out all past losses) is less than Xi, the insurer is said to be bankrupted. Given a class P of loss models, we ask if, for every prescribed upper bound η > 0 on the probability of bankruptcy, the insurer can set (ﬁnite) premiums at every time step based only on the loss sequence observed thus far and with no further knowledge of which law p P governs the loss sequence, while simultaneously ensuring that the insurer remains solvent with probability bigger than 1 η under p irrespective of which p P is in eﬀect. If the probability of the insurer ever going bankrupt over an inﬁnite time window can be made arbitrarily small in this sense, the class of i.i.d. loss measures P is said to be insurable. A couple of clariﬁcations are in order here. First, to make the problem non-trivial, we allow the insurer to observe the loss sequence for some arbitrary ﬁnite length of time without having to provide compensations. We require that the insurer has to eventually provide insurance with probability 1 no matter which p P is in eﬀect. The insurer cannot quit providing insurance once it has entered into the insurance contract with the insured. Premiums set before the entry time can be thought of as being 0 and the question of bankruptcy only arises after the insurer has entered into the contract. Secondly, at this point of research, we do not concern ourselves with incentive compatibility issues on the part of the insured and assume that the insured will accept the contract once the insurer has entered, agreeing to pay the premiums as set by the insurer. It turns out that the fact that the capital available to the insurer at any time is built up from past premiums does not play any role in whether a model class is insurable or not. In fact, the problem is basically one of ﬁnding a sequence of ﬁnite upper bounds Φ(Xi 1 1 ) on the loss Xi for all i 1. We refer to the sequence {Φ(Xi 1 1 ), i 1} as the loss dominating sequence and call Φ(Xi 1 1 ) the loss dominant at step i. The notion of insurability of a model class P comes down to whether for each η > 0 there is a way of choosing the loss dominants such that the probability of the loss Xi ever exceeding the loss dominant Φ(Xi 1 1 ) is smaller than η irrespective of which model p in the model class P is in eﬀect. Here again we allow some initial ﬁnite number of periods for which the loss dominant can be set to , but it must become ﬁnite with probability 1 under each p P and stay ﬁnite from that point onwards.

Results. For a model class to be insurable, roughly speaking, close distributions must have comparable percentiles. Distributions in the model class that, in every neighborhood, have some other distribution with arbitrarily diﬀerent percentiles are said to be deceptive. In Section 3, we deﬁne what

Santhanam and Anantharam

it means for distributions to be close, and what it means for distributions to have comparable percentiles. In Section 4, we provide several examples of insurable and non-insurable model classes. Our main result is Theorem 1 of Section 3, which states that P is insurable iﬀit has no deceptive distributions. We prove this theorem in Sections 5 and 6. In the rest of this section, we discuss the problem of insurability in the broader contexts of uniform and pointwise convergence of estimators and universal compression.

1.2 Pointwise vs. Uniform Convergence

Theoretically, the ﬂexibility we have permitted regarding when to start proposing ﬁnite loss dominants allows us to categorize the insurance problem formulated above as one that admits what we call dataderived pointwise convergent estimators. When dealing with large alphabets or high dimensions, it may be too restrictive to only deal with model classes or problem formulations that admit uniformly convergent estimators. We are particularly interested in richer cases where uniform learnability may not be possible. In such cases, often guarantees on estimators are provided that hold pointwise for all models. Typically, the results proven in such cases are of consistency, or bounds on rate of convergence of estimators that depend on the parameters of the model in force of course, the model by itself is usually not known a-priori. Therefore the practical issue with such pointwise guarantees is that they may not say much about what is happening with the speciﬁc sample at hand. Namely, if we know that an estimator for a problem is consistent and we have model dependent convergence bounds, for a given sample (from an unknown model) there may be no way of telling how good or bad the estimate currently is. It can be shown that the insurance problem outlined here is equivalent to learning an upper bound on every percentile of the unknown distribution from P, using only samples of i.i.d. draws from the distribution. However, we allow for model classes that are rich enough that there may be no bound on a given percentile that holds uniformly over the entire class P. Yet, we can still salvage the situation if for any given ﬁnite sample, we had some way to tell from the sample if the estimate was doing well or not relative to the true unknown model. In other words, we ask for model dependent convergence bound that depends on the model only through the sample that we observe. Data-driven pointwise convergence is at the heart of insurability as well, and it shows up speciﬁcally because we provide a ﬁnite (but not bounded uniformly over all models and observations) observation window before the insurer enters the game. While we consider complex P where there may be no possible way to bound any given percentile over the entire class, we can tell when the bounds obtained from a sample is good against the model that generated the sample. The point at which we decide our bounds are good against the model underlying the sample observed is related to the entry point deﬁned in the formulation above. In section 4 we provide several examples of classes that are insurable and those that are not. This principle of using the sample to gauge the performance of a pointwise consistent estimate has been incorporated into several other setups as well. A few examples that stand out are the notions of luckiness NML in compression proposed in Grunwald (2007), in the PAC-Bayesian bounds for classiﬁcation in Mc Allester (2013) and in the estimation of slow mixing Markov processes in Asadi et al. (2014). For the luckiness NML formulation, an appropriate slack function can be interpreted as a bound on how far we are from the code length of the underlying source. The PAC-Bayesian bounds of Mc Allester (2013) can also be used to provide data-dependent conﬁdence bounds for classiﬁers in scenarios where classes may be very rich. In Asadi et al. (2014) again, data-dependent conﬁdence bounds are provided

Insurability

for the estimates of transition and stationary probabilities of an underlying slow mixing, long memory Markov sources. To illustrate this concept, we provide a simple running example below that we will also use to demonstrate connections with compression in subsection 1.3.

Example 1 [Birthday Problem] Consider the problem of estimating the size of a discrete, ﬁnite, set S N (speciﬁcally, there is no upper bound on the size of S) if we can draw as many random samples from S as needed. If we have independent, uniform draws from S, one simple way to estimate the size of S is keep sampling till some element from S is drawn twice. This is the ﬁrst repeat. A simple back of the envelope calculation analogous to the Birthday Problem shows that if N1 is the sample size when the ﬁrst repeat occurred, then a good estimate of the set size is N2 1 /2. One can then provide PAC-learning kind of bounds that, with some conﬁdence, the above estimate based on the ﬁrst repeat is accurate to a certain level. This is an example where there can be no uniformly convergent estimator of the set size. Given a ﬁxed sample of size n, if the size of S is Ω(n2) the sample consists of n distinct symbols with probability close to 1. If we can assume no further structure on S, there is no way to distinguish between samples obtained from any two sets with size Ω(n2) thus no estimator can distinguish between these large sets. It is therefore futile, with a ﬁnite sample size, to expect an estimator that can estimate the set size to any non-trivial accuracy no matter what the set is. Equivalently, there can be no uniformly convergent estimator of the set size. The simple Birthday Problem estimator above only converges pointwise. It may take an arbitrarily larger sample for some models to give an answer. That is the nature of the problem. But the estimator is imbued with a very useful property with a guaranteed conﬁdence, the estimator does not make a mistake even though it may not always have an answer. If the sample has no repeats yet, the estimator does not overreach and volunteer a wrong answer. Hence, we can tell from the sample if we can do well or not. 2

1.3 Universal Compression

The approach we take is not unconnected with the universal compression literature, as well as certain learning formulations involving regret with log-loss. There are many variations in how compression problems are formulated, and the following example illustrates the main variants studied. It also serves to provide an example of the data-derived pointwise estimation introduced in the last section, and hence places insurability in context of the universal compression literature.

Example 2 We will study the so-called Birthday problem from the previous example in a little more depth. Let B denote the collection of all distributions p M (M 1), where p M is a uniform distribution over {0, . . . ,M}. We use this example to distinguish between uniformly good compressors (strongly universal) and compressors that are only good pointwise (weakly universal). Suppose we consider one draw from an unknown distribution p B. The worst case regret or worstcase redundancy quantiﬁes the minimum possible excess code length of a universal distribution q over N over the (unknown) distribution p

inf q sup p B sup x N log p(x)

Note that the regret of any class of distributions is always 0. We could, of course, consider a sequence of n independent draws from an unknown p B, and ask now for a measure q over inﬁnite sequences

Santhanam and Anantharam

of numbers that is universal for all the i.i.d. measures corresponding to distributions in B. We then concentrate on the redundancy

Rn(B) def = inf q sup p B sup xn Nn 1 n log p(xn)

where we abuse notation and write for all p B

to be the probability assigned to xn by independent draws from p. Of course, we could similarly deﬁne redundancy for length n sequences for any collection of measures over inﬁnite sequences from a countable alphabet, not necessarily i.i.d.. Strongly compressible classes are those sets P of measures over inﬁnite sequences satisfying lim n Rn(P ) 0.

For the single-letter formulation in (1), clearly the optimal universal distribution gives any number x a probability proportional to the highest probability that number gets from any model in B, followed by a normalization. But the highest probability a model in B gives any x N is 1/(x + 1), which is not summable over x. Thus the redundancy is inﬁnite here or equivalently, no matter what universal distribution q we choose and no matter how large a number M we pick, there is a p B and a number x N such that

In this case, we will therefore not have redundancy bounds holding uniformly for the model class. We say B is not strongly compressible. With a very similar argument, it is easy to see that Rn(B ) = for all n, where we use B to denote the set of i.i.d. measures over inﬁnite sequences constructed as above from marginals in B. But we can say something more. Consider again compressing sequences of numbers drawn i.i.d. from an unknown distribution in B. Noting that B is countable, we focus on a measure q over inﬁnite sequences that gives a sequence xn the probability

1 i(i + 1)pi(xn).

It is easy to verify that qw above satisﬁes

sup p B lim n sup xn Nn 1 n log p(xn)

qw(xn) = 0, (2)

or that qw matches every p pointwise over the model class B. Such classes of sources are weakly universal. The code length of the universal measure qw matches that of p for every p B, but at arbitrarily slower rates for some sources (since the class cannot be strongly compressed). A couple of points. Note that admittedly it has been easy to deﬁne qw here since B was countable to begin with. If not, a condition reminiscent of countability above is necessary and suﬃcient for a class to be weakly universal as shown in (Kieﬀer, 1978). To emphasize, qw is guaranteed to satisfy (2),

Insurability

implying that it does not underestimate relative to any p for suﬃciently long sequences. But how long is suﬃciently long depends on p, and for a given length-n sequence without knowing p, it may not be possible to say if qw is doing well or not. This second aspect of knowing when an estimator is good is crucial to insurance formulations. One could also replace the sup over xn with expectation, and get average case versions for both strong and weak universality. 2

Strong compression is well known and is the more studied version of universal compression and regret formulations involving log loss. As one might expect, we show in (Santhanam and Anantharam, 2012) that strong compression implies insurability but not vice-versa. However, the insurance formulation has more in common with weak compression and pointwise convergence, rather than strong compression and uniform convergence. The connection between insurability and weak compression turns out to be rather interesting. In (Santhanam and Anantharam, 2012), we show classes of models that can be weakly compressed but are not insurable. At the same time, we also construct classes of models in (Santhanam and Anantharam, 2012) that are insurable, but cannot be weakly compressed. To summarize, our formulation is interesting precisely in cases where the strong notions of (worstcase or average-case) redundancy fail. Namely, classes of distributions whose redundancy is not ﬁnite. The universal compression formulation closer to our notion of insurability here is in the idea of weak universal compression in Kieﬀer (1978). However, weak compression formulations thus far have never included the aspect of determining from the data at hand when a compressor is doing well a crucial part of our problem. There may be one insight that we conjecture can be generalized beyond insurability to all problems with the ﬂavor of data-derived pointwise convergence of estimates. What matters seems to be the local complexity as opposed to global complexity of model classes. Insurability of model classes does not depend on global complexity measures of model classes as with the (strong) redundancy of model classes (which is determined by the integral of the square root of absolute Fisher information over the entire model class) or the Rademacher complexity. Instead, insurability is related to how local neighborhoods look; in particular it depends on local tightness as we will see in Section 3.

2. Precise Formulation of Insurability

We model the loss at each time by a random variable taking values in N = {0, 1, . . .}. Denote the sequence of losses by X1, X2 . . . where Xi N. Let N be the set of all ﬁnite length sequences from N, including the empty sequence. We will write xn for the sequence x1, . . . ,xn. Where it appears, x0

denotes the empty sequence. A loss distribution is a probability distribution on N. Let P be a set of loss distributions. P is the collection of i.i.d. measures over inﬁnite sequences of symbols from N such that the set of one dimensional marginals over N they induce is P. We write R+ for the set of nonnegative real numbers and use := for equality by deﬁnition. Consider an insurer with an initial capital Π0 R+. An insurance scheme for P is comprised of a pair (τ, Π). Here τ : N 7 {0, 1} satisﬁes τ(x1, . . . , xn) = 1 = τ(x1, . . . , xn+1) = 1 for all xn and also p(supn τ(Xn) = 1) = 1 for all p P . τ should be thought of as deﬁning an entry time for the insurer with the property that once the insurer has entered it stays entered and that the insurer enters with probability 1 irrespective of which p P is in eﬀect. Here we say the insurer enters after seeing the sequence xn N (possibly the empty sequence) if τ(xn) = 1. The other ingredient of an insurance

Santhanam and Anantharam

scheme is the premium setting scheme Π : N R+, satisfying Π(xn) = 0 if τ(xn) = 0, with Π(xn) being interpreted as the premium demanded by the insurer from the insured after the loss sequence xn N is observed. Let 1( ) denote the indicator function of its argument. The event that the insurer goes bankrupt is the event that

i=1 (Π(Xi 1) Xi)1(τ(Xi 1) = 1) < 0 for some n 1 .

In words, this is the event that in some period n 1 after the insurer has entered, the loss Xn incurred by the insured exceeds the built up capital of the insurer, namely the sum of its initial capital and all the premiums it has collected after it has entered (including the currently charged premium Π(Xn 1)) less all the losses paid out so far.

Deﬁnition 1 A class P of laws on loss sequences is called insurable by an insurer with initial capital Π0 R+ if η > 0, there exists an insurance scheme (τ, Π) such that p P ,

p((τ, Π) goes bankrupt ) < η .

We should remark that despite the apparent role of the initial capital of the insurer in this deﬁnition, it plays no role from a mathematical point of view. To see this note ﬁrst that if a model class P is insurable by an insurer with capital Π0 it is clearly insurable by all insurers with initial capital at least Π0, since such an insurer can use the same entry time and premium setting scheme as the insurer with initial capital Π0. On the other hand, an insurer with initial capital less than Π0 can use the same entry time as an insurer with initial capital Π0 and simply charge an additional premium at the time of entry which in eﬀect builds up its initial capital to Π0, and then proceed with the same premium setting scheme as that used by the insurer with initial capital Π0. This feature is an artifact of the complete ﬂexibility we give the insurer in setting premiums; for more on this see the concluding remarks in Section 7. As indicated in the introductory Section 1, we will ﬁrst show that whether a model class of loss distributions is insurable is equivalent to whether we can ﬁnd suitable loss-domination sequences for the sequence of losses. We next make this connection and the associated terminology precise.

Deﬁnition 2 A loss-domination scheme for P is a mapping Φ : N 7 R+ { }, where for xn N , we interpret Φ(xn) as an estimated upper bound on xn+1. We call {Φ(Xi 1), i 1} the loss-domination sequence and Φ(Xi 1) the loss dominant at step i. We require for all xn N that

Φ(x1, . . . , xn) < = Φ(x1, . . . , xn+1) <

and also that for all p P , p(inf n 1 Φ(Xn) < ) = 1. 2

We think of Φ(xn) = as saying that the scheme has not yet committed to proposing ﬁnite loss dominants after having seen the sequence xn, while if Φ(xn) < it has. Once the scheme commits to proposing ﬁnite loss dominants it has to continue to propose ﬁnite loss dominants from that point onwards. Further, with probability 1 under every p P , the scheme has to eventually start proposing ﬁnite loss dominants.

Insurability

Deﬁnition 3 Given our motivation from the insurance problem, we will say the loss-domination scheme Φ goes bankrupt if Φ(Xn 1) < Xn for some n 1. 2

The connection between the insurance problem and the problem of selecting loss dominants can now be made precise as follows.

Observation 1 Let P be a model class and η > 0. Let Π0 R+. An insurer with initial capital Π0 can ﬁnd an insurance scheme (τ, Π) such that the probability of remaining solvent is bigger than 1 η irrespective of which p P is in eﬀect if and only if there is a loss-domination scheme Φ such that the probability of it going bankrupt is less than η irrespective of which p P is in eﬀect. Proof Given an insurance scheme (τ, Π) consider the loss-domination scheme Φ that has Φ(xn) := iﬀτ(xn) = 0 and

Φ(Xn 1) := Π0 +

i=1 (Π(Xi 1) Xi)1(τ(Xi 1) = 1) + Π(Xn 1) ,

if τ(Xn) = 1. Since τ enters (become equal to 1) with probability 1 under each p P and stays equal to 1 once it has become 1, Φ becomes ﬁnite with probability 1 under each p P and stays ﬁnite once it has become ﬁnite. Thus Φ is indeed a loss-domination scheme. It is straightforward to check that if the insurance scheme (τ, Π) stays solvent with probability bigger than 1 η irrespective of which p P is in eﬀect then the loss-domination scheme Φ becomes bankrupt with probability less than η irrespective of which p P is in eﬀect. Conversely, given a loss-domination scheme Φ deﬁne the insurance scheme (τ, Π) by setting τ(xn) := 0 iﬀΦ(Xn) = (and τ(xn) := 1 iﬀΦ(xn) < ) and deﬁning Π(xn) := 0 if Φ(xn) = and Π(xn) := Φ(xn) if Φ(xn) < . One sees that τ as deﬁned becomes 1 with probability 1 under each p P and stays equal to 1 once it becomes 1. Further, the premiums set at each time are ﬁnite and equal to 0 till the entry time. Thus (τ, Π) as deﬁned is indeed an insurance scheme. It is straightforward to check if Φ becomes bankrupt with probability less than η irrespective of which p P is in eﬀect, then (τ, Π) stays solvent with probability bigger than 1 η irrespective of which p P is in eﬀect. Hence the above observation. 2

We may therefore conclude that a model class P is insurable iﬀfor all η > 0 there is a lossdomination scheme Φ such that the probability of going bankrupt under Φ is less than η irrespective of which p P is in eﬀect. In the rest of the paper we will therefore focus mainly on whether the model class P is such that for every η > 0 a loss-domination sequence Φ exists with its probability of bankruptcy being less than η irrespective of which model in the model class governs the sequence of losses. In Theorem 1, we provide a condition on P that is both necessary and suﬃcient for insurability.

3. Statement of Main Result

We go through a few technical points before spelling out the results in detail in 3.3.

3.1 Close Distributions

Insurability of P depends on the neighborhoods of the probability distributions among its one dimensional marginals P. The relevant measure of closeness between distributions in P that decides the

Santhanam and Anantharam

neighborhoods is

J (p, q) := D p||p + q

+ D q||p + q

Note that the above is the Jensen-Shannon divergence (with an additional factor of 2) and is not a true distance. Here D(p||q) denotes the relative entropy of p with respect to q, where p and q are probability distributions on N, deﬁned by

D(p||q) := X

y N p(y) log p(y)

The logarithm is assumed to be taken to base 2 (we use ln for the logarithm to the natural base). The reason for choosing the Jensen-Shannon (JS) divergence is that it has two convenient properties (i) for the necessary part, it becomes easy to quantify how close distributions yield very similar measures on sequences, (ii) for the suﬃcient part, we bound the JS divergence with the ℓ1 norm in Lemma 4 which in turn lets us work with the ℓ1 topology induced on the class P of distributions. Speciﬁcally, we show that if p and q are probability distributions on N, then

1 4 ln 2|p q|2 1 J (p, q) 1 ln 2|p q|1 .

3.2 Cumulative Distribution Function

Since we would like to discuss percentiles, it is convenient to use a non-standard deﬁnition for the cumulative distribution function of a probability distribution on N. For our purposes, the cumulative distribution function of any probability distribution p on N is a function Fp : R+ { } [0, 1] deﬁned in an unconventional way. We obtain Fp by ﬁrst deﬁning Fp on points in the support of p in the way cumulative distribution functions are normally deﬁned. We deﬁne Fp for all other nonnegative real numbers by linearly interpolating between the values in the support of p. Finally, Fp( ) := 1. Let F 1 p : [0, 1] 7 R+ { } denote the inverse function of Fp. Then F 1 p (x) = 0 for all 0 x < Fp(0). If p has inﬁnite support then F 1 p (1) = , else F 1 p (1) is the smallest natural number y such that Fp(y) = 1. Two simple and useful observations can now be made. Consider a probability distribution p with support A N. For δ > 0, let (T for tail)

Tp,δ := {y A : y F 1 p (1 δ)},

and let (H for head) Hp,δ := {y A : y 2F 1 p (1 δ/2)}.

It is easy to see that p(Tp,δ) > δ (3)

and that p(Hp,δ) > 1 δ. (4)

Suppose that for some δ > 0 we have F 1 p (1 δ) > 0 and the loss dominant at the beginning of period i 1 happens to be set to F 1 p (1 δ), then the probability under p of the loss in period i exceeding the loss dominant is bigger than δ. If the loss dominant at the beginning of period i happens to be set to 2F 1 p (1 δ/2), then the probability that the loss in period i exceeds the loss dominant is less than δ. We will use these observations in the proofs to follow.

Insurability

3.3 Condition that is Necessary and Suﬃcient for Insurability

Existence of close distributions with very diﬀerent quantiles is what kills insurability. A loss-domination scheme could be deceived by some process p P into setting low loss dominants, while a close enough distribution hits the scheme with too high a loss. The conditions for insurability of P are phrased in terms of the set of its one dimensional marginals, P. Formally, a distribution p in P is not deceptive if some neighborhood around p is tight. For a more complete development of the concept of tightness, see e.g., Billingsley (1995). Speciﬁcally, ϵp > 0, such that δ > 0, f(δ) R, such that all distributions q P with

J (p, q) < ϵp

satisfy F 1 q (1 δ) f(δ).

Equivalently, a probability distribution p in P is deceptive if no neighbourhood of P around p is tight. Speciﬁcally, ϵ > 0, δ > 0 such that that no matter what f(δ) R+ is chosen, a (bad) distribution q P such that J (p, q) < ϵ

and F 1 q (1 δ) > f(δ).

In the above deﬁnition, f(δ) is simply an arbitrary nonnegative real number. However, it is useful to think of this number as the evaluation of a function f : (0, 1) R at δ. Our main theorem is the following, which we prove in Sections 5 and 6.

Theorem 1 P is insurable, iﬀno p P is deceptive. 2

4. Examples

Consider U, the collection of all uniform distributions over ﬁnite supports of form {m, m + 1, . . . ,M} for all positive integers m and M with m M. Let the sequence of losses be i.i.d. samples from distributions in U call the resulting model class over inﬁnite loss sequences U . Note that no distribution in U is deceptive. Around each distribution in U is a neighborhood that that contains no other distribution of U.

Example 3 U is insurable. Proof If the threshold probability of ruin is η, choose the loss-domination scheme Φ as follows. For all sequences xn with n log 1

η + 1 set Φ(xn) = . For all sequences xn with n > log 1

η + 1, the loss dominant Φ(xn) is set to be twice the largest loss observed thus far. It is easy to see that this scheme is bankrupted with probability less than η irrespective of which p U is in eﬀect. 2

Consider the set N 1 of all i.i.d. processes such that the one dimensional marginals have ﬁnite ﬁrst moment. Namely, p N 1 , Ep X < where X N is distributed according to the single letter marginal of p. If N1 is the collection of single letter marginals from N 1 , it is easy to verify as below that every distribution in N1 is deceptive.

Santhanam and Anantharam

Example 4 N 1 is not insurable. Proof Note that the loss process that puts probability 1 on the all zero sequence exists in N 1 , since it corresponds to the one dimensional marginal loss distribution that produces loss 0 in each period. Since every loss-domination scheme enters with probability 1 no matter which p N 1 is in force, every loss-domination scheme must enter after seeing some ﬁnite number of zeros. Fix any loss-domination scheme Φ. Suppose the scheme starts to set ﬁnite loss dominants after seeing N losses of size 0. To show that N 1 is not insurable, we show that η > 0 and p N 1 such that

p( Φ goes bankrupt ) η.

Fix δ = 1 η. Let ϵ be small enough that

(1 ϵ)N > 1 δ/2,

and let M be a number large enough that

(1 ϵ)M < δ/2.

Note that since 1 δ/2 δ/2, we have N < M. Let L be greater than any of the loss dominants set by Φ for the sequences 0N, 0N+1, . . . 0M. Let p N 1 satisfy, for all i,

( 1 ϵ if Xi = 0 ϵ if Xi = L.

For the i.i.d. loss process having the law p, the insurer is bankrupted on all sequences that contain loss L in between the N-th and M-th steps. These sequences, 0NL, 0N+1L, . . . ,0M 1L, have respective probabilities (under p) (1 ϵ)Nϵ, (1 ϵ)N+1ϵ, . . . , (1 ϵ)M 1,

and they also form a preﬁx free set. Therefore, summing up the geometric series and using the assumptions on ϵ above,

p( Φ is bankrupted ) (1 ϵ)N (1 ϵ)M 1 δ/2 δ/2 = η. 2

A reading of the proof above shows that we can say something much stronger. The distributions that break insurability have all their moments ﬁnite. Suppose N is the collection of measures each of whose single letter marginal has all moments ﬁnite. Namely for all p N and all ﬁnite r 0, Ep Xr 1 < . It follows that

Example 5 N is not insurable. 2

Consider the collection of all i.i.d. loss distributions with monotone one dimensional marginals. A monotone probability distribution p on N is one that satisﬁes p(y + 1) p(y) for all y N. Let M be the set of all i.i.d. loss processes, with one dimensional marginal distribution from M, the collection of all monotone probability distributions over N. Again, it is easily shown that every distribution in M is deceptive. It follows from Theorem 1 that

Example 6 M is not insurable. Proof To see that any distribution p M is deceptive, consider distributions of form p = (1 ϵ)p+ϵq, where q U is a monotone uniform distribution and ϵ > 0. Clearly, the ℓ1 distance between p and q is 2ϵ (and therefore so is the JS divergence, up to a constant factor). But for all M > 0 and δ < ϵ, we can pick q U over a suﬃciently large support that the 1 δ percentile of p can be made M. Therefore, no neighborhood around p is tight. 2

Insurability

Note also that if p has ﬁnite entropy, so does every p obtained by the above construction. Let M M be the collection of ﬁnite entropy monotone distributions. The above example also implies that

Example 7 M is not insurable. 2

Now for h > 0, we consider the set Mh M of all monotone distributions over N whose entropy is bounded above by h. Let M h be the set of all i.i.d. loss processes with one dimensional marginals from Mh. Then

Example 8 M h is insurable. Proof From Markov inequality, if p Mh and X p,

p(X > M) = p(log(X + 1) > log(M + 1)) < Ep log(X + 1)

log(M + 1) Ep log 1 p(X) log(M + 1) h log(M + 1).

To see the second inequality above, note that p is monotone therefore for i N, p(i) 1 i+1. Therefore, for all p Mh, F 1 p (1 δ) 2 h δ .

Thus no p Mh is deceptive, and M h is insurable. 2

In the class U above, there was a neighborhood around each distribution p U with no other model from U, Hence U trivially satisﬁed the local tightness condition that we will prove is necessary and suﬃcient for insurability. The above case is another extreme the entire model class Mh is tight. The following example illustrates a insurable class of models where neither extreme holds. For a distribution q over N, let q(R)(i + R) = q(i) for all i N. Furthermore let the span of any ﬁnite support probability distribution over naturals be the largest natural number which has non-zero probability. Then, let

Fh = n (1 ϵ)p1 + ϵp(span(p1)+1) 2 : p1 U, p2 Mh and 1 > ϵ > 0 o .

As always F h is the set of measures on inﬁnite sequences obtained by i.i.d. sampling from distributions in Fh.

Example 9 F h is insurable. Proof Let the base of any probability distribution over the naturals be the smallest natural number which has non-zero probability. Consider any distribution p = (1 ϵ)p1 + ϵp(span(p1)+1) 2 Fh with p1 U, p2 Mh, and 1 > ϵ > 0. Let m denote base(p), and m + M 1 denote span(p1). Thus |support(p1)| = M, and we have M 1. Of course, we also have base(p1) = base(p). Consider any other distribution q = (1 ϵ )q1 + ϵ q(span(q1)+1) 2 Fh, where q1 U, q2 Mh, and 1 > ϵ > 0. Let m denote base(q) (which equals base(q1)) and let m + M 1 denote span(q1), so |support(q1)| = M , and we have M 1. It suﬃces to show that there is an ℓ1 ball around p of suﬃciently small radius, such that for all δ > 0 we can ﬁnd a uniform bound on the (1 δ)-th percentile of all q in this ball. If m > m, then the ℓ1 distance between p and q is at least 1 ϵ

M . Hence, whenever the ℓ1 distance between p and q is strictly less than 1 ϵ

M we must have m m. Thus we may assume that m m.

Santhanam and Anantharam

Suppose m + M 1 m + 2M

1 ϵ. Then M

2 , from which, because support(p1) support(q1), we can conclude that the ℓ1 distance between q and p is at least 1 ϵ

2 . Thus we may assume that m + M 1 < m + 2M

1 ϵ. Now, for any i 0, we have

q(m + M + i) = ϵ q2(i) ϵ 1 i + 1 1 i + 1 .

Thus for any K 0 we have

q(X > m + 2M

1 ϵ + K) q(X > m + M + K) h log(K + 1)

by an argument similar to that in the preceding example, which gives the desired conclusion that no p F h is deceptive, and hence that F h is insurable. 2

5. Necessary Condition for Insurability

Theorem 2 shows that lack of deceptive distributions in P is necessary for insurability of P .

Theorem 2 If P is insurable, then no p P is deceptive. Proof To keep notation simple, we will denote by p (or q) both a measure in P as well as the corresponding one dimensional marginal distribution, which is a member of P. The context will clarify which of the two is meant. We prove the contrapositive of the theorem: if some p P is deceptive, then P is not insurable. Pick α > 0. Suppose p P is deceptive. Let Φ be any loss-domination scheme. Recall that Φ enters on p with probability 1, in the sense that the loss dominants set by Φ will eventually become ﬁnite with probability 1 under p. For all n 1, let

Rn := {xn : Φ(xn) < }

be the set of sequences of length n on which Φ has entered and let N 1 be a number such that

p(RN) > 1 α/2. (5)

Fix 0 < η < 1

N )(1 1/e2). We prove that P is not insurable by ﬁnding, for each lossdomination scheme Φ, a probability distribution q P such that

q( Φ goes bankrupt ) η.

The basic idea is that because Φ has to enter with probability 1 under p, it would have been forced to set premiums that are too low for q. For any sequence xn, let A(xn) be the set of symbols that appear in it. Recall that the head of the distribution p, Hp,γ, was deﬁned in Section 3.2 to be the set {y Ap : y 2F 1 p (1 γ/2)}, where Ap is the support of p. Further, deﬁne for all γ > 0

Rp,γ,n := {xn Rn : A(xn) Hp,γ)}.

Given δ > 0, pick γp(δ) so small that

(1 γp(δ))N+1/δ 1 α/2. (6)

Insurability

A word about this parameter γp(δ), since it may not be immediately apparent why this should be deﬁned. The advantage of doing so is technical we will be able to handle p (and q which will be chosen later) as though they were distributions with ﬁnite span. Set1 ϵ = 1 16(ln 2)N8 . Applying Lemma 5 to distributions over length-N sequences induced by the distributions p, q P such that J (p, q) ϵ, we have for each δ > 0 that

q(XN Rp,γp(δ),N) 1 α 2

This implies that Φ has entered with probability (under q) at least 1 α 2

N for length N sequences. We will ﬁnd a suitable δ > 0 and a suitable q such that J (p, q) ϵ, and such that the scheme Φ is bankrupted with probability > η. Since p is deceptive, there exists δ > 0 such that

sup q P: J (p,q)<ϵ

F 1 q (1 δ ) = .

Deﬁne p,ϵ = {δ : sup q:J (p,q)<ϵ F 1 q (1 δ ) = }.

In connection with this deﬁnition, note that if the δ tails of distributions in the ϵ neighborhood of p are not bounded, neither are the δ tails for all δ < δ . Furthermore if sup p,ϵ 1/2, we are done since for some δ 1/2, there is a q satisfying J (p, q) ϵ and

F 1 q (1 δ) max x N Rp,γp,N Φ(x N).

Therefore, conditioned on the event {XN Rp,γp(δ),N}, this q will be bankrupted with probability 1/2. From (7) above we thus have

q( Φ goes bankrupt ) 1 α 2

If not, we have p,ϵ < 1/2. Pick δ and r = 2δ such that δ p,ϵ, but r / p,ϵ. For convenience, let M = 1

δ . We consider now a set S of strings of lengths ranging from N to N + M deﬁned by the following properties:

every string in S has its preﬁx of length N belonging to Rp,γp(δ),N;

every string of length k in S, N + 1 k N + M, has all its symbols at times N + 1 through to k belonging to Hq ,2r for some q P such that J (p, q ) < ϵ.

To clarify this deﬁnition, we recall once again that Hq ,2r is the {y Aq : y 2F 1 q (1 r)}, where Aq denotes the support of q . Note that since r / p,ϵ, S is ﬁnite. Again, we pick q satisfying J (p, q) ϵ and F 1 q (1 δ) max xk S N k N+M 1 Φ(xk).

1. Please note that in the interest of simplicity, we have not attempted to provide the best scaling for ϵ or the tightest possible bounds in arguments below

Santhanam and Anantharam

Therefore if a symbol from Tq,δ follows any string in S, the scheme goes bankrupt under q. There may be symbols in the complement of Tq,δ that also bankrupt the scheme if they follow a string in S. Taking a diﬀerent perspective, it could happen that no string in S contains symbols in Tq,δ, or that strings S contain symbols from Tq,δ. We consider both these variations below.

Let q1 be the probability (under q) of all sequences in Rp,γp,N under which the scheme Φ has not yet been bankrupted, and let q2 be the probability (under q) of all sequences in Rp,γp,N where Φ has already been bankrupted. Therefore q1 + q2 = q(Rp,γp,N).

Deﬁne a sequence in S to be a survivor if the loss-domination scheme Φ has not yet been bankrupted on this sequence under q. Thus, for instance, q2 is the probability, under q, of survivor sequences of length N. To continue, we need to consider two cases:

Case (i). Strings in S do not contain symbols of Tq,δ (when supq :J (p,q )<ϵ 2F 1 q (1 r) F 1 q (1 δ)).

In this case, Hq,2r is contained in the complement of Tq,δ and we show the probability under q with which Φ is bankrupted is bounded below by

q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ .

Given that the sequence seen so far is a survivor sequence, one can classify the next symbol in one of four ways: (a) it is in Tq,δ (which automatically implies bankruptcy); (b) it is in the complement of Tq,δ Hq,2r which we ignore; (c) it is in Hq,2r and results in bankruptcy; (d) it is in Hq,2r and does not result in bankruptcy. We ignore case (b) since we are only interested in a lower bound. In case (c) the contribution to the conditional probability of bankruptcy given a survivor sequence is 1, but we lower bound it for survivor sequences of length N + l (0 l M 1) by the running sum δ + (1 r)δ + . . . + (1 r)M l 1δ . This has the eﬀect that the lower bound looks as though symbols in Hq,2r never contribute to bankruptcy (and hence always lead to survivor sequences).

Case (ii). Strings in S could contain symbols of Tq,δ (when supq :J (p,q )<ϵ 2F 1 q (1 r) > F 1 q (1 δ).)

We show here that the probability under q with which Φ is bankrupted is bounded below by

q2 + q1 δ + (1 δ)δ + . . . + (1 δ)Mδ .

This can be seen, as in the preceding case, by classifying the next symbol following a survivor sequence into three types: (a) it is in Tq,δ (which implies bankruptcy); (b) it is the complement of Tq,δ and results in bankruptcy; (c) it is the complement of Tq,δ and does not result in bankruptcy. In case (b), the conditional probability of bankruptcy given a survivor sequence of length N +l (0 l M 1) is 1, but, as before we lower bound the contribution to by the running sum δ + (1 δ)δ + . . . + (1 δ)M l 1δ . Similar to the prior case, this has the eﬀect that the lower bound looks as though symbols in the complement of Tq,δ never contribute to bankruptcy (and hence always lead to survivor sequences).

However, we have 1 δ 1 r, so once again in case (ii) the probability under q with which Φ is bankrupted is bounded below by

q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ .

Insurability

Thus we see that irrespective of which case is in force, under q the loss-domination scheme Φ is bankrupted with probability at least

q2 + q1 δ + (1 r)δ + . . . + (1 r)Mδ

1 (1 2δ) 1/δ

2q(Rp,γp,N) 1 (1 2δ) 1/δ

The theorem follows. 2

6. Suﬃcient Condition for Insurability

When no p P is deceptive, for any η > 0 Theorem 3 constructs a loss-domination scheme that goes bankrupt with probability η. If no p P is deceptive, there is for each p P a number ϵp > 0 such that, for every percentile δ > 0, there is a uniform bound on the δ-percentile over the set of probability distributions in the neighborhood {p P : J (p , p) < ϵp}, .

We pick such an ϵp for each p P and call it the reach of p. For p P, the set

Bp = {p P : J (p, p ) < ϵp},

where ϵp is the reach of p, will play the role of the set of probability distributions in P for which it will be okay to eventually set loss dominants assuming p is in force. To prove that P is insurable if no distribution among its one dimensional marginals P is deceptive, we will need to ﬁnd a way to cover P with countably many sets of the form Bp above. Unfortunately, J (p, q) is not a metric, so it is not immediately clear how to go about doing this. On the other hand note that J (p , p) |p p |1/ ln 2, where |p p |1 denotes the ℓ1 distance between p and p (see Lemma 4 in the Appendix). Therefore, we can instead bootstrap oﬀan understanding of the topology induced on P by the ℓ1 metric.

6.1 Topology of P with ℓ1 Metric

The topology induced on P by the ℓ1 metric is Lindel of, i.e. any covering of P with open sets in the ℓ1 topology has a countable subcover. See e.g., (Dugundji, 1970, Defn. 6.4) for deﬁnitions and properties of Lindel of topological spaces. We can show that P with the ℓ1 topology is Lindel of by appealing to the fact that the set of all probability distributions on N with the ℓ1 topology, is second countable, i.e. that it has a countable basis. The set of all distributions on N along with ℓ1 topology has a countable basis because it has a countable norm-dense set (consider the set of all probability distributions on N with ﬁnite support and with all probabilities being rational). Now, P, as a topological subspace of a second countable topological space is also second countable (Dugundji, 1970, Theorem 6.2(2)). Finally, every second countable topological space is Lindel of (Dugundji, 1970, Thm. 6.3), hence P is Lindel of.

Santhanam and Anantharam

6.2 Suﬃcient Condition

We now have the machinery required to prove that if no p P is deceptive, then P is insurable, which is the other direction of Theorem 1, as stated next.

Theorem 3 If no p P is deceptive, then P is insurable. Proof The proof is constructive. For any 0 < η < 1, we obtain a loss-domination scheme Φ such that for all p P , p(Φ goes bankrupt ) < η. For p P, let

Qp = q : |p q|1 < ϵp2(ln 2)2

where ϵp is the reach of p. We will call Qp as the zone of p. The set Qp is non-empty when ϵp > 0. For large enough n, the set of loss sequences of length n with empirical distribution in Qp will ensure that the loss-domination scheme Φ to be proposed enters with probability 1 when p is in force. Note that if ϵp > 0 is small enough then Qp P Bp we will assume wolog that ϵp > 0 is always taken so that Qp P Bp. Since no p P is deceptive, none of the zones Qp are empty and the space P of distributions can be covered by the sets Qp P, namely

P = p P(Qp P).

From Section 6.1, we know that P is Lindel of under the ℓ1 topology. Thus, there is a countable set P P, such that P is covered by the collection of relatively open sets

{Q p P : p P}.

We let the above collection be denoted by Q P. We will refer to P as the quantization of P and to elements of P as centroids of the quantization, borrowing from commonly used literature in classiﬁcation. We index the countable set of centroids, P (and reuse the index for the corresponding elements of Q P) by ι : P N. We now describe the loss-domination scheme Φ having the property that for all p P ,

p(Φ goes bankrupt ) < η.

Preliminaries. Consider a length-n sequence xn on which Φ has not entered thus far. Let the empirical distribution of the sequence be q, and let

P q := {p P : q Qp }

be the set of centroids in the quantization of P (elements of P) which can potentially capture q. Note that q in general need not belong to P or P. If P q = , we will further reﬁne the set of distributions that could capture q further to Pq P q as described below. Reﬁning P q to Pq ensures that models in P q do not prematurely capture loss sequences. Let p be the model in force, which remains unknown. The idea is that we want sequences generated by (unknown) p to be captured by those centroids of the quantization P that have p in their reach. We will require (8) below to ensure that the probability (under the unknown p) of all sequences that may

Insurability

get captured by centroids p Pq not having p in its reach remains small. In addition, we impose (9) as well to resolve a technical issue since q need not, in general, belong to P. For p P q, let the reach of p be ϵp , and deﬁne

Dp := ϵp 4(ln 2)4

In case the underlying distribution p happens to be out of the reach of p (wrong capture), the quantity Dp will later lower bound the distance of the empirical q in question from the underlying p. Speciﬁcally, we place p in Pq if n satisﬁes

exp n Dp /18 η 2C(p )ι(p )2n(n + 1), (8)

and 2F 1 q (1 q

Dp /6) log C(p ), (9)

where C(p ) is

C(p ) := 2 2 supr Bp F 1 r (1 q

Note that C(p ) is ﬁnite since p is not deceptive. Comparison with Lemma 7 will give a hint as to why the equations above look the way they do.

Description of Φ. For the sequence xn with type q, if Pq = , the scheme does not enter yet. If Pq = , let pq denote the distribution in Pq with the smallest index. All sequences with preﬁx xn (namely sequences obtained by concatenating xn with by any other sequence of symbols) are then said to be trapped by pq namely, loss dominants will be based on pq. The loss dominant assigned for a length-m sequence trapped by pq is

η 4n(n + 1)

:= 2 sup r Bpq F 1 r

1 η 4n(n + 1)

Φ enters with probability 1. First, we verify that the scheme enters with probability 1, no matter what distribution p P is in force. Every distribution p P is contained in at least one of the elements of the cover Q P. Recall the enumeration of P. Let p be centroid with the smallest index among all centroids in P whose zones contain p. Let Q be the zone of p . There is thus some γ > 0 such that the neighborhood around p given by I(p, γ) := {q : |p q|1 < γ}

satisﬁes I(p, γ) Q. Note in particular that p is in the reach of p . With probability 1, sequences generated by p will have their empirical distribution within I(p, γ). For the proof of this well known result in the countably inﬁnite alphabet case, see Chung (1961) or for an alternate approach, see Lemma 7. Next (8) will hold for all sequences whose empirical distributions that fall in I(p, γ) whose length n is large enough since C(p ) and ι(p ) do not change with n, the right hand side diminishes to zero polynomially with n while the left hand side diminishes exponentially to zero. Thus we conclude (8) will be satisﬁed with probability 1.

Santhanam and Anantharam

Next, (9) will also hold almost surely, for if q is the empirical probability of sequences generated by p, then (with a little abuse of notation)

Dp /6) F 1 p (1 q

with probability 1. Note that the quantity on the left is actually a random variable that is sequence dependent (since q is the empirical distribution of the sequence). Furthermore, we also have

2F 1 p (1 q

sup r Bp F 1 r (1 q

= log C(p ),

where the ﬁrst inequality follows since p is in the reach of p . Thus the scheme enters with probability 1 no matter which p P is in force.

Probability of bankruptcy η. We now analyze the scheme. Consider any p P. Among sequences on which Φ has entered, we will distinguish between those that are in good traps and those in bad traps. If a sequence xn is trapped by p such that p Bp , p is a good trap. Conversely, if p / Bp , p is a bad trap. (Good traps) Suppose a length-n sequence xn is in a good trap, namely, it is trapped by a distribution p such that p Bp . Recall that the loss dominant assigned is

2gp η 4n(n + 1)

1 η 4n(n + 1)

where the inequality follows because p is not deceptive, and p is within the reach of p . Therefore from (4), given any sequence in a good trap the scheme is bankrupted with conditional probability at most δ = η/2n(n + 1) in the next step. Therefore, summing over all n, sequences in good traps contribute at most η/2 to the probability of bankruptcy. (Bad traps) We will show that the probability with which sequences generated by p fall into bad traps η/2. Pessimistically, the conditional probability of bankruptcy in the very next step given a sequence falls into a bad trap is going to be bounded above by 1. Thus the contribution to bankruptcy by sequences in bad traps is at most η/2. Let q be any length-n empirical distribution trapped by p with reach ϵ such that p / B p. If p is far from p (because p is not in p s reach), namely

J ( p, p) ϵ,

but q is close to p (because q has to be in p s zone to be captured by it), namely

| p q|1 < ϵ2(ln 2)2

then we would like q to be far from p. That is exactly what we obtain from the triangle-inequality like Lemma 6, namely that

J (p, q) ϵ2 ln 2

Insurability

and hence, for all q trapped by p that

|p q|2 1 J 2(p, q)(ln 2)2 ϵ4(ln 2)4

256 = D2 p.

We need not be concerned that the right side above depends on p, and there may be actually no way to lower bound the rhs as a function of just p. Rather, we take care of this issue by setting the entry point appropriately via (8). Thus, for p P , the probability length-n sequences with empirical distribution q is trapped by a bad p is, using (8) and (9)

p |q p|2 D p and 2F 1 q (1

6 ) log C( p)

(a) (C( p) 2) exp n D p

(b) η(C( p) 2) 2C( p)ι( p)2n(n + 1)

η 2ι( p)2n(n + 1),

where the inequality (a) follows from Lemma 7 and (b) from (8). Therefore, the probability of sequences falling into bad traps

η 2ι( p)2n(n + 1) η/2

since P p P 1 ι( p)2 P n 1 1 n(n+1) = 1. The theorem follows. 2

7. Concluding Remarks

7.1 Observations

We make a few observations about insurability that, while evident from the proofs and approaches we have taken, are interesting in themselves and worth highlighting.

Finite unions of insurable classes. The ﬁrst is relative simple ﬁnite unions of insurable model classes are insurable in themselves. If P1, . . . ,Pm are m insurable model collections, then m i=1Pi is insurable.

Countable unions of insurable classes. The union of countably inﬁnitely many classes of models each of which is insurable need not be insurable. As we have seen, the collection of monotone distributions with entropy h, Mh, is insurable for all h N. However, the collection M = h NMh is not insurable.

Countable unions of tight sets. Note that from our arguments while proving the suﬃciency criterion, it follows that every insurable model class is a countable union of tight sets. The converse is not however true. Note that Mh is a tight set for any h > 0, yet M = h NMh is not insurable.

Santhanam and Anantharam

7.2 General Remarks

The loss-domination problem formulated and solved in this paper appears to be of natural interest. However, there are several features of the insurance problem formulated here that might appear troubling even to the casual reader. In practice an insured party entering into an insurance contract would expect some stability in the premiums that are expected to be paid. A natural direction for further research is therefore to study how the notion of insurability of a model class changes when one imposes restrictions on how much the premium set by the insurer can vary from period to period. Another obvious shortcoming of the formulation of the insurance problem studied here is the assumption that the insured will accept any contract issued by the insurer. Since the insured in our model represents an aggregate of individual insured parties, a natural direction to make the framework more realistic would be to think of the insured parties as being of diﬀerent types. This would in eﬀect make the total realized premium from the insured (the aggregate of the insured parties) and the distribution of the realized loss in each period a function of the size of the premium per insured party set by the insurer in that period. Characterizing which model classes are insurable when the realized premium and the realized loss are functions of a set premium per insured party would be of considerable interest. Both for the loss-domination problem and for the insurance problem, working with model classes for the loss sequence that allow for dependencies in the loss from period to period, for instance Markovian dependencies, would be another interesting direction for further research. Considering models with multiple, possibly competing insurers, as well as considering an insurer operating in multiple markets, where losses in one market can be oﬀset by gains in another, also seem to be useful directions to investigate.

Acknowledgments

We are very grateful to anonymous reviewers whose insightful comments has made this paper much better. We thank C. Nair (Chinese Univ of Hong Kong) and K. Viswanathan (HP Labs) for helpful discussions. N. Santhanam was supported by NSF Grants CCF-1065632, CCF-1018984 and EECS1029081. V. Anantharam was supported by the ARO MURI grant W911NF08-1-0233, Tools for the Analysis and Design of Complex Multi-Scale Networks, the NSF grant CNS-0910702 and EECS1343398, the NSF Science & Technology Center grant CCF-0939370, Science of Information, Marvell Semiconductor Inc., and the U.C. Discovery program.

Lemma 4 Let p and q be probability distributions on N. Then

1 4 ln 2|p q|2 1 J (p, q) 1 ln 2|p q|1 .

If, in addition, r is a probability distribution on N, then

J (p, q) + J (q, r) J 2(p, r)ln 2

Proof The lower bound in the ﬁrst statement follows from Pinsker s inequality on KL divergences,

1 2 ln 2 1 4|p q|2 1

Insurability

and similarly for D q||p+q

2 . See e.g., Cover and Thomas (1991) for more details about Pinsker s inequality. Since ln(1 + z) z for all z 0, the upper bound in the ﬁrst statement follows as below:

J (p, q) ln 2 X

x:p(x) q(x) p(x) p(x) q(x)

p(x) + q(x)

x :q(x ) p(x ) q(x ) q(x ) p(x )

p(x ) + q(x )

To prove the triangle-like inequality, note that

J (p, q) + J (q, r) 1 4 ln 2 |p q|2 1 + |q r|2 1

1 8 ln 2(|p q|1 + |q r|1)2

1 8 ln 2(|p r|1)2

8 J (p, r)2,

where the last inequality follows from the upper bound on J (p, r) already proved. 2

Lemma 5 Let p and q be probability distributions on a countable set A with J (p, q) ϵ. Let p N

and q N be distributions over AN obtained by i.i.d. sampling from p and q respectively (the distribution induced by the product measure). For any RN AN and α > 0, if p N(RN) 1 α, then

q N(RN) 1 α 2N3

B1 = i A : q(i) p(i) 1 1

B2 = i A : p(i) q(i) 1 1

If J (p, q) ϵ, then we have ϵ p

J (p, q) |p q|1

It can then be easily seen that

p(B1 B2) 2N2

4ϵ ln 2 and q(B1 B2) 2N2

4ϵ ln 2 (10)

x B1 (p(x) q(x)) p(B1)

and similarly N2|p q|1 q(B2) p(B2).

Santhanam and Anantharam

Let S = A B1 B2. We have for all x S,

q(x) p(x) 1 1

and from (10) we have p(S) 1 2N2

4ϵ ln 2. Now, we focus on the set SN AN containing all length-N strings of symbols from S. Clearly

p(SN) 1 2N3

Thus we have p(RN SN) 1 2N3

From (11), for all x N SN,

q(x N) p(x N) 1 1

N p(x N) 1 1

q(RN) q(RN SN) (1 2N3

4ϵ ln 2 α) 1 1

Lemma 6 Let ϵ0 > 0. If

|p0 q|1 ϵ2 0(ln 2)2

then for all p P with J (p, p0) ϵ0, we have

J (p, q) ϵ2 0 ln 2

Proof Since

|p0 q|1 ϵ2 0(ln 2)2

Lemma 4 implies that

J (p0, q) ϵ2 0 ln 2

Further, Lemma 4 then implies that

J (p, q) + ϵ2 0 ln 2

16 J (p, q) + J (p0, q) J 2(p, p0) ln 2

8 ϵ2 0 ln 2

where the last inequality follows since J (p, p0) ϵ0. 2

Lemma 7 Let p be any probability distribution on N. Let δ > 0 and let k 2 be an integer. Let Xn 1 be a sequence generated i.i.d. with marginals p and let q(Xn) be the empirical distribution of Xn 1 . Then

p |q(Xn) p| > δ and 2F 1 q (1 δ/6) k (2k 2) exp nδ2

Insurability

Remark There is a lemma that looks somewhat similar in Ho and Yeung (2010). The diﬀerence from Ho and Yeung (2010) is that the right side of the inequality above does not depend on p, and this property is crucial for its use here. 2 Proof The starting point is the following result. Suppose p is a probability distribution on N with ﬁnite support of size L. Then from Weissman et al. (2005), if we consider length n sequences,

p (|q(Xn) p |1 t) 1 (2L 2) exp nt2

Since k 2, consider the distributions p and q with support A = {1, . . . ,k 1} { 1}, obtained as

( p(i) 1 i < k P j=k p(j) i = 1,

and similarly for q . From (12),

p (|p q |1 > δ/3) (2k 2) exp nδ2

We will see that all sequences generated by p with empirical distributions q satisfying

|p q|1 > δ and 2F 1 q (1 δ/6) k

are now mapped into sequences generated by p with empirical q satisfying

|p q |1 > δ/3 and q ( 1) δ/3. (13)

Thus, we will have

p(|q(Xn) p|1 > δ and 2F 1 q (1 δ/6) k)

p (|p q |1 > δ/3 and q ( 1) δ/3)

(2k 2) exp nδ2

Finally we observe (13) as in Ho and Yeung (2010)

l=1 |p(l) q(l)|

j=k (p(j) q(j)) + 2

|p ( 1) q ( 1)| + 2δ/3,

where the last inequality above follows from (4). Since p(l) = p (l) and q(l) = q (l) for all l = 1, . . . ,k 1, we have |p q |1 |p q|1 2δ/3.

If |p q|1 δ in addition, |p q |1 δ/3. 2

Santhanam and Anantharam

M. Asadi, R. Paravi, and N. Santhanam. Stationary and transition probabilities in slow mixing, long memory markov processes. IEEE Transactions on Information Theory, September 2014.

S. Asmussen and H. Albrecher. Ruin Probabilities. World Scientiﬁc Publishing Company, 2nd edition, 2010.

P. Billingsley. Probability and Measure. Wiley Interscience, 1995.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.

K.L. Chung. A note on the ergodic theorem of information theory. Annals of Mathematical Statistics, 32:612 614, 1961.

T. Cover and J. Thomas. Elements of Information Theory. John Wiley and sons., 1991.

H. Cramer. Historical Review of Filip Lundberg s Work on Risk Theory. Skandinavisk Aktuarietidskrift (Suppl.), 52:6 12, 1969. Reprinted in The Collected Works of Harald Cram er edited by Anders Martin-L of, 2 volumes Springer 1994.

J. Dugundji. Topology. Allyn and Bacon Inc., Boston, 1970.

K. Englund and A. Martin-L of. Statisticians of the Centuries, chapter Ernst Filip Oskar Lundberg, pages 308 311. New York: Springer, 2001.

B. Fittingoﬀ. Universal methods of coding for the case of unknown statistics. In Proceedings of the 5th Symposium on Information Theory, pages 129 135. Moscow-Gorky, 1972.

P. Grunwald. The Minimum Description Length Principle. MIT Press, 2007.

S. Ho and R. Yeung. On information divergence measures and joint typicality. IEEE Transactions on Information Theory, 56(12):58935905, 2010.

J.C. Kieﬀer. A uniﬁed approach to weak universal source coding. IEEE Transactions on Information Theory, 24(6):674 682, November 1978.

D. Mc Allester. A pac-bayesian tutorial with a dropout bound. Available from arxiv doc:id 1307.2118, 2013.

J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629 636, July 1984.

B. Ryabko. Compression based methods for non-parametric online prediction, regression, classiﬁcation and density estimation. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday, pages 271 288, 2008.

N. Santhanam and V. Anantharam. Agnostic insurance tasks and their relation to compression. In International Conference on Signal Processing and Communications (SPCOM), 2012.

Y.M. Shtarkov. Universal sequential coding of single messages. Problems of Information Transmission, 23(3):3 17, 1987.

Insurability

T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. Weinberger. Universal discrete denoising: known channel. IEEE Transactions on Information Theory, 51(1):5 28, 2005. See also HP Labs Tech Report HPL-2003-29, Feb 2003.