# compression_generalization_and_learning__472d7f1f.pdf

Journal of Machine Learning Research 24 (2023) 1-74 Submitted 5/22; Revised 9/23; Published 10/23

Compression, Generalization and Learning

Marco C. Campi marco.campi@unibs.it Department of Information Engineering University of Brescia via Branze 38, 25123 Brescia, Italy

Simone Garatti simone.garatti@polimi.it Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano piazza L. da Vinci 32, 20133 Milano, Italy

Editor: Manfred Warmuth

A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassiﬁcation, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (which maps into the statistical risk in learning applications). Under suitable conditions, the cardinality of the compressed set is shown to be a consistent estimator of the probability of change of compression (without any upper limit on the size of the compressed set); moreover, unprecedentedly tight ﬁnite-sample bounds to evaluate the probability of change of compression are obtained under a generally applicable condition of preference. All results are usable in a fully agnostic setup, i.e., without requiring any a priori knowledge on the probability distribution of the observations. Not only these results oﬀer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.

Keywords: Compression Schemes, Statistical Risk, Statistical Learning Theory, Scenario Approach

1. Introduction

Compression is an established topic in theoretical learning, and various generalization bounds have been proven for compression schemes.

According to a deﬁnition introduced in Littlestone and Warmuth (1986), a compression scheme consists of i. a compression function c, which maps any list of observed examples S = ((x1, y1), . . . , (x N, y N)) (xi is called an instance and yi a label ) into a sub-list c(S), and ii. a reconstruction function ρ, which maps any list of examples S into a classiﬁer ρ(S). An important feature of a classiﬁer is its risk and, in the context of compression schemes, one is interested in the risk associated to the classiﬁer ρ(c(S)). The concept of risk ﬁnds a natural deﬁnition in statistical learning where one assumes that examples (x, y) are gener-

c 2023 Marco C. Campi and Simone Garatti.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0605.html.

M.C. Campi and S. Garatti

ated according to a random mechanism: the risk of a generic classiﬁer f is then deﬁned as R(f) = P{f(x) = y} (throughout, boldface indicates random quantities). In loose terms, most results in compression schemes establish that low cardinality of the compression implies low risk for the ensuing classiﬁer. A bit more precisely, let S = ((x1, y1), . . . , (x N, y N)) be a list of independent random examples all sharing the same distribution (which coincides with the distribution of (x, y) in the deﬁnition of risk). In the example-consistent framework (i.e., the bound on the risk is only given for lists of examples S for which ρ(c(S)) returns the corresponding label yi for any xi in S in this case ρ(c(S)) is said to be consistent with S) and under the assumption that the maximum cardinality of c(S) is bounded by an integer d (d is called the size of the compression scheme), Littlestone and Warmuth (1986) and Floyd and Warmuth (1995) establish results of the type: with high probability 1 δ with respect to the generation of the list of examples S, if ρ(c(S)) is consistent with S, then the risk of c(S) is below a known bound that depends on N, d and δ only (in particular, the bound does not depend on the distribution by which examples are generated). Results in this vein have been subsequently extended to the non-consistent framework and to compression schemes with unbounded size, in which context the best known results are given in Graepel et al. (2005).

In a series of recent papers, compression schemes have been studied under a stability condition, a notion that is natural in many contexts and that has its roots in Vapnik and Chervonenkis (1974). For the example-consistent framework and compression schemes with bounded size, Bousquet et al. (2020) succeeded in removing a log(N) term in the expression of the bound for the risk as compared to the formulation given in the above referenced papers; when applied to Support Vector Machines, this result resolves a long-standing issue that was posed in and remained open since Vapnik and Chervonenkis (1974). Later, the scope of Bousquet et al. (2020) has been signiﬁcantly broadened by Hanneke and Kontorovich (2021), where the non-consistent framework with no upper bounds on the size of the compression scheme has been considered. Since the stable case is most relevant to the present paper, we shall come back to these latter contributions with a more detailed discussion and comparison at the end of Section 3.

In the present paper, we make a paradigm shift: since we are interested in compression as a general tool applicable across various domains, we are well-advised to adopt a purist approach in which compression functions are studied in isolation (without a reconstruction function). Our essential goal is to study how the probability of change of compression relates to the size of the compressed set. The corresponding results can be applied to supervised learning, unsupervised learning and, in addition, to any other contexts where compression functions are in use. Our ﬁndings are summarized at the end of this section, we start with introducing the formal elements of the problem.

1.1 Mathematical setup and notation

Examples z are elements from a set Z (for instance, in supervised learning z are pairs (x, y)). The compression functions we study are permutation invariant. Correspond-

Compression, Generalization and Learning

ingly, given any n = 0, 1, 2, . . . and a list of examples (z1, . . . , zn),1 we introduce the associated multiset written as ms(z1, . . . , zn), where the operator ms removes the ordering in the list, while maintaining repetitions. The set operations , , \ (union, intersection, and set diﬀerence) are easily extended to multisets using the notion of multiplicity function µU for a multiset U, which counts how many times each element of Z occurs in U. Then, µU U (z) = µU(z) + µU (z), µU U (z) = min µU(z), µU (z) , and µU\U (z) = max 0, µU(z) µU (z) . Moreover, U U means that µU(z) µU (z) for all z, and |U| stands for the cardinality of a multiset where each example is counted as many times as is its multiplicity. Throughout this paper, multisets have always ﬁnite cardinality and, any time a multiset is introduced, it is tacitly assumed that it has ﬁnitely many elements. A compression function c is a map from any multiset of examples U to a submultiset: c(U) U. We write c(z1, . . . , zn) as a shortcut for c(ms(z1, . . . , zn)). Also, given a multiset U and one more example z, c(U, z) stands for c(U ms(z)). Similar notations apply to other maps having a multiset as argument. Throughout, an example is modeled as a realization of a random element deﬁned over a probability space (Ω, F, P); moreover, a list of n examples is the realization of the ﬁrst n elements of an independent and identically distributed (i.i.d.) sequence z1, z2, . . .. A training set is a multiset generated from a list of observed examples. When dealing with problems in machine learning, a learning algorithm A is a map from training sets to a hypothesis h in a set H (in supervised binary classiﬁcation, h is a concept; in supervised learning with continuous label, h is a predictor; in unsupervised learning, h can, e.g., be a collection of clusters; etc.). According to the above notation, we write A(z1, . . . , z N) to denote the hypothesis generated by A when the input is the training set ms(z1, . . . , z N). We use a {0, 1}-valued function ℓ(h, z) to indicate whether or not a hypothesis h is appropriate for an example z: ℓ(h, z) = 0 signiﬁes that h is appropriate for z, while ℓ(h, z) = 1 corresponds to inappropriateness (for instance, in supervised classiﬁcation, ℓ(h, z) = 1h(x) =y, where 1 is the indicator function). The statistical risk of h is R(h) = P{ℓ(h, z) = 1}, where z is a random element distributed as each zi.

1.2 Main contributions

The contributions of this paper are summarized in the following three points.

(i) Under the property of preference,2 Theorem 4 establishes a new bound to the probability of change of compression as a function of the cardinality of the compressed multiset. For a ﬁnite size of the training set, the bound is informative and useful in applications (see Figure 2). When the size of the training set N tends to inﬁnity, the bound tends to the ratio k/N (where k is the cardinality of the compressed multiset), uniformly in k {0, 1, . . . , N} (the fact that the range for k arrives at N means that the compressed multiset has no upper limits other than the size of the training set itself), see Proposition 8. No lower bounds to the probability of change of compression are possible under the sole preference property. Under an additional non-associativity property and a condition of non-concentrated prob-

1. Note that here symbol n indicates the size of a generic list, while N is in use throughout to indicate the actual number of observed examples. The distinction between the two is necessary to accommodate various needs in theoretical developments. 2. While stated diﬀerently, the preference property is equivalent to the property of stability, see Section 2 for an explanation of our terminology.

M.C. Campi and S. Garatti

abilistic mass, Theorem 7 establishes a lower bound (see Figure 3). This lower bound also converges to k/N uniformly in k {0, 1, . . . , N}. Hence, under the assumptions of Theorem 7, the probability of change of compression is in sandwich between two bounds that merge one on top of the other as N (see Proposition 8). This entails that the cardinality of the compressed multiset is a highly informative statistics to evaluate the probability of change of compression.

(ii) In Section 3, the results in (i) are put at work to study classical compression schemes in the presence of a reconstruction function. It is shown that a preferent compression scheme augmented with the examples that are misclassiﬁed preserves the preference property. From this, one ﬁnds that the risk can be evaluated without resorting to an incremental approach (as it was customary in previous contributions) in which the empirical risk is incremented with an estimate of the mismatch between empirical and actual risk. The resultant evaluations of the risk, established in Theorem 17, are unprecedentedly sharp. Under the additional conditions of non-associativity and of non-concentrated mass, if certain coherence properties hold, then one obtains bounds on the risk that are valid both from above and from below, which provides a statistically consistent evaluation of the risk. Empirical demonstrations complement the theoretical study and show that the bounds cover tightly the actual stochastic dispersion of the risk.

(iii) As examples of application, the achievements in point (ii) are applied in Section 4 to various support vector methods, including the Support Vector Machine (SVM) and the Support Vector Regression (SVR), and to the Guaranteed Error Machine (GEM). This study shows that, in various learning contexts, one can identify statistics of the data from which consistent estimates of the risk can be obtained without resorting to validation or testing.

1.3 Relation with previous contributions and a more general perspective on this work

The scientiﬁc background in which this work has matured lies in some ﬁfteen years of work by its authors in the ﬁeld of data-driven optimization. In a group of papers, whose forefathers are Calaﬁore and Campi (2005, 2006); Campi and Garatti (2008) and that include Campi and Garatti (2011); Campi and Car e (2013); Car e et al. (2015); Campi et al. (2018); Garatti et al. (2019, 2023); Garatti and Campi (2023), they laid with co-authors the foundations of the so-called scenario approach , a vast body of methods and algorithms to obtain data-driven, theoretically-certiﬁed, solutions to uncertain optimization problems. The scenario approach has spurred quite a bit of work also done by others, as witnessed by a large number of theoretical contributions, of which we here only mention the most signiﬁcant ones: Welsh and Rojas (2009); Welsh and Kong (2011); Pagnoncelli et al. (2012); Schildbach et al. (2013, 2014); Margellos et al. (2014, 2015); Zhang et al. (2015); Esfahani et al. (2015); Crespo et al. (2015); Grammatico et al. (2016); Crespo et al. (2016); Lacerda and Crespo (2017); Margellos et al. (2018); Crespo et al. (2019); Falsone et al. (2019). Recently, the studies on scenario optimization have culminated in the works Campi and Garatti (2018); Garatti and Campi (2022), which are conceptually linked to the present contribution by

Compression, Generalization and Learning

the fact that the generalization properties of the solution are evaluated from an observable called complexity (complexity parallels the size of the compressed multiset of this paper). As compared with all this previous literature, the present contribution introduces two major elements of novelty:

(a) compression takes center stage, beyond any contextualization. By this purist approach, we aim to lay the groundwork for a new theory of wide applicability, to machine learning in primis, but also across the other multiple data science ﬁelds in which compression ﬁnds application;

(b) by a novel, powerful, theoretical apparatus, this paper establishes bounds on the risk that fare beyond the domain of previous contributions; in particular, they allow one to drop any condition of non-degeneracy, which was a standing and limiting assumption in previous works, e.g., Campi and Garatti (2018); Garatti and Campi (2022).

We hope that the ﬁndings presented in this paper will open a new era of exploration and discovery in an important subarea of data-driven methods that is centered around the notion of compression. As previously mentioned, we here already consider support vector methods and improve the results in Campi and Garatti (2021) by eliminating all assumptions on the distribution of the examples for the problem of obtaining upper bounds on the risk. We also study a generalized version of the so-called Guaranteed Error Machine, which was introduced in Campi (2010) under a limiting condition on the complexity of the classiﬁer. Beyond the applications discussed in this paper, we expect that our results will prove useful in various ﬁelds where the scenario approach is applicable (including robust optimization, with its multiform applications to diverse contexts). We feel like to also mention that the authors of this paper are at present actively exploring a wide range of example-driven computer science algorithms in which the application of the compression theory of this paper is made possible through an importance procedure for example selection, even in cases where the original algorithm lacks any compression (see Paccagnan et al. (2023) for a study in the context of machine learning). For a broader discussion on the increasing importance of establishing well-founded risk theories for data-driven decision processes, particularly in today s time in which the use of data is becoming pervasive, the reader is referred to the recent position paper Campi et al. (2021).

1.4 Structure of the paper

The main results on compression schemes are presented in a uniﬁed treatment in the next Section 2, which also includes a discussion on the asymptotic behavior of the bounds. Section 3 presents a rapprochement with classical compression schemes in statistical learning that incorporate a reconstruction function, along with some other more general results useful for machine learning problems. Speciﬁc machine learning schemes are considered in Section 4. The proofs of the main results are deferred till Section 5.

M.C. Campi and S. Garatti

2. New generalization results for compression schemes

Our interest lies in quantifying the probability with which a change of compression occurs. As it was mentioned in the introduction, and it will be further explored in Section 4, this probability has important implications in relation to learning schemes. We start with a formal deﬁnition of probability of change of compression.

Deﬁnition 1 (probability of change of compression) The probability of change of compression is deﬁned as

φ(z1, . . . , zn) = P{c(c(z1, . . . , zn), zn+1) = c(z1, . . . , zn)|z1, . . . , zn}.3

On the right-hand side a new element zn+1 is added to the compression of ms(z1, . . . , zn) 4

and it is tested whether this makes the compression change. This gives an event, and our interest lies in the probability of this event. However, in view of its use in applications, what matters is not the probability tout court, rather, we take a more ﬁne-grained standpoint by conditioning on z1, . . . , zn, so as to capture the variability of the probability of change of compression as determined by the examples. This makes φ(z1, . . . , zn) into a random variable. In what follows, we shall often use the symbol φn as a shorthand for the random variable φ(z1, . . . , zn).

Before delving into the mathematical developments, we are well-advised to digress a moment to discuss the nature of the results we mean to reach. Let N be the size of the multiset at hand, the one of which we want to study the probability of change of compression. φN has a probability distribution of its own. Arguably, this distribution may vary signiﬁcantly with the distribution by which the zi s are generated. This fact has an important implication: any result that describes the distribution of φN without referring to some prior knowledge on the distribution of the zi s is bound to stay on the conservative side and is therefore poorly informative. While this may seem to set fundamental limitations to obtaining distribution-free results on φN (i.e., results valid without any a priori knowledge on the distribution of the zi s), nevertheless it turns out that this conclusion is hasty and incorrect: indeed, one can instead move along a diﬀerent path and take a bi-variate standpoint, as next explained. Let |c(z1, . . . , z N)| be the cardinality of the compressed multiset c(z1, . . . , z N). We consider the pair (|c(z1, . . . , z N)|, φN) and identify conditions of general interest under which its bi-variate distribution concentrates in a slender, lenticular-shaped, region (see Figure 3). The implications are quite notable: within the lenticular-shaped region, the distribution of (|c(z1, . . . , z N)|, φN) does exhibit a strong variability depending on the problem (which also translates into the variability of the marginal distribution of φN). However, given the realization of z1, . . . , z N at hand, one can compute the value of |c(z1, . . . , z N)| and intersect the vertical line corresponding to this value with the lenticularshaped region to obtain an interval for the probability of the change of compression. The so-formulated evaluation is tight and informative even for small values of N and oﬀers an useful assessment tool for applications. Importantly, the corresponding theory retains the

3. This means that φ(z1, . . . , zn) is any version of the conditional probability on the right-hand side. 4. It is not unimportant that zn+1 is added to the compression, not to the initial multiset.

Compression, Generalization and Learning

characteristic of being distribution-free. This ﬁnding is stated below as Theorem 7 and it holds under two properties called preference (Property 2) and non-associativity (Property 5), besides a condition that rules out concentrated masses (Property 6). Interestingly, under the sole preference property, only the lower bound of the lenticular-shaped region is lost while the upper bound maintains its validity (Theorem 4), which provides a result broadly applicable to evaluate an upper limit on the probability of change of compression as a function of the cardinality of the compressed multiset.

Moving towards the mathematical results, we ﬁrst formalize the concept of preference.

Property 2 (preference) For any multisets U and V such that V U, if V = c(U), then V = c(U, z) for all z Z.

Hence, if a sub-multiset is not chosen as the compressed multiset, then it cannot become the compressed multiset at a later stage after augmenting the multiset with a new example.5

The following lemma provides a useful reformulation of the preference property.

Lemma 3 A compression function c satisﬁes the preference property if and only if c(V ) = c(U) for all multisets U, V such that c(U) V U.

Proof Assume c satisﬁes the preference property and let z1, . . . , zn be the elements in U \V where U and V are multisets such that c(U) V U. Let S0 = V and Si = Si 1 ms(zi) for i = 1, . . . , n so that Sn = U. Now suppose that c(V ) = c(U). Since c(S0) = c(V ) and c(Sn) = c(U), then it must be that c(Si 1) = c(U) and c(Si 1, zi) = c(U) for some i {1, . . . , n}. However, since c(U) Si 1, this contradicts the assumption that c satisﬁes the preference property. For the other direction, assume that the preference property does not hold. Then, we can ﬁnd U, V, z such that V U and c(U) = V = c(U, z). This implies c(U, z) U U ms(z) and c(U) = c(U, z), contradicting the statement that c(V ) = c(U ) for all multisets U , V

such that c(U ) V U .

An immediate consequence of Lemma 3 is that c c(U) = c(U) whenever c satisﬁes the preference property.

The statement of our ﬁrst theorem is better enunciated by introducing the following functions Ψk,δ : (0, 1) R, which are indexed by k = 0, 1, . . . , N 1 and by the conﬁdence parameter δ (0, 1):

Ψk,δ(α) = δ

N k (1 α) (N m) .

5. This property is not new and is called stability in the literature, see, e.g., Bousquet et al. (2020) where the formulation is slightly diﬀerent but, provably, equivalent. Our introducing a change of terminology is in the interest of clarity as we believe that stability may convey the erroneous idea of absence of change or, what is germane to the ﬁeld of systems theory, the idea that a small input variation can only cause a small output variation. We chose preference because we feel that this term rightly conveys the idea that a multiset cannot be selected and hence preferred at a later stage if it had not been preferred earlier when it was already available.

M.C. Campi and S. Garatti

Figure 1: (a) Function Ψk,δ(α): it starts below δ when α 0 and tends to + when α 1; (b) Function Ψk,δ(α): it tends to + as α 1 or α and takes a value below 1 in a point in ( , 1).

For any k and any δ, the equation Ψk,δ(α) = 1 admits one and only one solution in (0, 1). Indeed, Ψk,δ(α) is strictly increasing, continuous, and Ψk,δ(α) δ < 1 when α 0, while it grows to + when α 1 (see Figure 1(a)). Deﬁne6

( solution to Ψk,δ(α) = 1, k = 0, 1, . . . , N 1; 1, k = N. (1)

Theorem 4 Assume the preference Property 2. For any δ (0, 1), it holds that

P{φN > εk} δ, (2)

where εk is the random variable obtained by the composition of k := |c(z1, . . . , z N)| (the cardinality of the multiset c(z1, . . . , z N)) with the function εk given in (1) (in other words, it is εk evaluated at the random value k).

Proof The proof of Theorem 4 is given in Section 5.2.

In the theorem, parameter δ is called the conﬁdence parameter and it is normally selected to a very small value, say 10 5 or 10 6. The theorem claims that φN, the probability of change of compression, is upper-bounded, with high conﬁdence 1 δ, by εk, which is a known, deterministic, function εk evaluated in correspondence of the cardinality k of the compressed multiset. Figure 2 visualizes function εk for N = 2000 and various values of δ.

In machine learning applications, the interest of Theorem 4 stems from the fact that a change of compression occurs when an example is misclassiﬁed, or mispredicted. Hence,

6. εk can be computed via a bisection algorithm. An eﬃcient and ready-to-use MATLAB code is provided in Appendix B.1.

Compression, Generalization and Learning

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

Figure 2: Curve εk against the value of k for N = 2000 and various values of δ (10 3, 10 6, and 10 9). As established in Theorem 4, for preferent compression functions this curve sets an upper bound (valid with conﬁdence 1 δ) on the probability of change of compression as a function of the cardinality k of the compressed multiset.

the theorem allows one to upper-bound the probability of misclassiﬁcation, or misprediction, by using an observable, the cardinality of the compressed multiset. Importantly, the evaluation holds independently of the distribution of the zi s, hence the user can apply the result without positing any, possibly hazardous, conjecture on how data are generated. An ample space to the use of Theorem 4 in machine learning problems is given in Sections 3 and 4.

Lower and upper bounds on φN are established under additional conditions, the nonassociativity Property 5 and the Property 6 of non-concentrated mass, as described in the following.

Property 5 (non-associativity) For any n 0 and p 1,

P E1 \ E2 = 0,

where E1 = {c(z1, . . . , zn, zn+i) = c(z1, . . . , zn), i = 1, . . . , p},

E2 = {c(z1, . . . , zn, zn+1, . . . , zn+p) = c(z1, . . . , zn)}.

In words, the non-associativity property can be phrased as follows: if the compression does not change adding elements one at a time, then it does non change when they are added altogether, with the possible exception of an event whose probability is zero.7 The reader

7. Non-associativity is naturally satisﬁed in many contexts, including all cases in which the compression singles out the relevant observations in a robust optimization process (this is because adding multiple

M.C. Campi and S. Garatti

may have noticed that Property 5 is given in probability unlike the preference Property 2, which was required to hold for any choice of the examples. The reason is that requiring the validity of the non-associativity property for all examples can be restrictive in some applications.

Property 6 (non-concentrated mass)

P{zi = z} = 0, z Z.

The property of non-concentrated mass simply requires that any z can only be drawn with probability zero and, hence, it excludes with probability 1 that the same z occurs twice or more times in a training set.

Theorem 7 is stated by means of the following functions Ψk,δ : ( , 1) R indexed by k = 0, 1, . . . , N and by the conﬁdence parameter δ (0, 1):

for k = 0, . . . , N 1, let

Ψk,δ(α) = δ 2N

N k (1 α) (N m) + δ

N k (1 α)m N, (3)

while, for k = N, let

ΨN,δ(α) = δ 6N

In Appendix A it is shown that, for k = 0, 1, . . . , N 1, equation Ψk,δ(α) = 1 admits two and only two solutions in ( , 1), say αk and αk, with αk < k

N < αk (see Figure 1(b) for a graphical visualization of Ψk,δ(α), k < N). Instead, equation ΨN,δ(α) = 1 admits only one solution in ( , 1), which is denoted by αN (this is easy to verify because ΨN,δ(α) is strictly decreasing and it tends to 0 as α 1 while it grows to + as α ). Deﬁne8

εk = max{0, αk}, k = 0, 1, . . . , N, (4)

αk, k = 0, 1, . . . , N 1; 1, k = N. (5)

Theorem 7 Assume the preference Property 2, the non-associativity Property 5 and the non-concentrated mass Property 6. For any δ (0, 1), it holds that

P{εk φN εk} 1 δ, (6)

constraints that do not change the solution when considered in isolation viz., the current solution is feasible for the new constraints does not change the solution when all the constraints are introduced simultaneously). See Section 4 for examples in the machine learning context. 8. See Appendix B.2 for a MATLAB code that eﬃciently computes εk and εk.

Compression, Generalization and Learning

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

Figure 3: Region delimited by εk and εk for N = 2000 and various values of δ (10 3, 10 6, and 10 9). Under the assumptions of Theorem 7, this region contains with conﬁdence 1 δ the probability of change of compression as a function of the cardinality of the compressed multiset.

where εk is the random variable obtained by the composition of k := |c(z1, . . . , z N)| with the function εk given in (4) and εk is the random variable obtained by the composition of k with the function εk given in (5).

Proof The proof of Theorem 7 is given in Section 5.3.

With the additional properties of non-associativity and non-concentrated mass, Theorem 7 assigns upper and lower bounds for the change of compression, as visualized in Figure 3. Strictly speaking, these additional requirements do depend on the underlying probability by which examples are generated and, hence, they cannot be labeled as being distribution-free . Nonetheless, in various situations, Theorem 7 becomes applicable under a very limited knowledge on the distribution of the zi s, and we shall see examples of this in Section 4. We also note that dropping the assumption of non-concentrated mass makes Theorem 7 false, as shown in the following counterexample.9 Suppose that one single element z has probability 1 to be selected (so that the non-concentrated mass property is violated), ﬁx any integer M (for example M = 100) and consider the compression function that returns the initial multiset any time this multiset includes the element z less than M times, while it trims the number of elements z to M when z appears more than M times in the initial multiset. It is easily seen that the preference and non-associativity properties hold. On the other hand, for N M the probability of change of compression is zero with probability 1, so that no meaningful lower bounds can be assigned in this example.

9. While the non-concentrated mass property is easy to state, which led us to prefer this formulation, Theorem 7 can still be proven under a slightly weaker condition, as brieﬂy discussed in Remark 30 after the proof of the theorem.

M.C. Campi and S. Garatti

0 500 1000 1500 2000 0

0 1000 2000 3000 4000 0

0 2000 4000 6000 8000 0

Figure 4: Graph of εk, and εk as functions of k for δ = 10 6 and N = 2000, 4000, and 8000.

2.1 Asymptotic behavior of εk and εk, εk

The purpose of this section is to establish explicit lower and upper bounds on εk and εk, εk able to reveal the dependencies of these quantities on k, N, and δ, and also to pinpoint convergence properties as N tends to inﬁnity. The main result is in Proposition 8, followed by some comments. We advise the reader that the explicit bounds in Proposition 8 are in use to clarify various dependencies, but they are not meant for practical computation since they lead to conservative results if used in place of the numerical procedures given in Appendix B.

Proposition 8

ln(k + 1) + 4 + 2

ln(k + 1) + 2 3

Moreover, it holds that

Proof The proof of Proposition 8 is given in Section 5.4.

In both (7) and (8), the dependence on δ is inversely logarithmic, which shows that conﬁdence is cheap : very small values of δ can be enforced without signiﬁcantly aﬀecting the results and, thereby, the width of the interval [εk, εk] (see again Figure 3). For any ﬁxed k, we see that εk and εk, εk tend to k/N as O(1/N), while for k that grows at the same rate as N (say k/N = constant) εk and εk, εk converge towards k/N as O( p

N). This is just marginally slower than the convergence rate for the law of large numbers, as given by the central limit theorem. The evolution of εk, εk as N grows can be seen in Figure 4.

Reading the results of this section in the light of Theorem 4, one concludes that, for all compression functions satisfying the preference Property 2, the bi-variate distribution of k = |c(z1, . . . , z N)| and φN all lies below the line k/N plus an oﬀset whose size goes to zero as O( p

N) with the exception of a slim tail whose probabilistic mass is no more than δ. If, additionally, Properties 5 and 6 hold, Theorem 7 shows that the bi-variate

Compression, Generalization and Learning

distribution of k = |c(z1, . . . , z N)| and φN all lies in a strip around k/N whose size goes to zero as O( p

N) but a slim tail. In this latter case, Theorem 7 also carries the very important implication that the ratio k/N is a strongly consistent estimator of φN irrespective of the problem at hand (it can be proven that Theorem 7 and Proposition 8 together imply that |k/N φN| converges to zero both in the mean square sense and almost surely).

3. Compression schemes for machine learning

The aim of this section is to connect the theory of Section 2 to that of statistical risk in learning algorithms. Our ﬁndings will be compared with existing results at the end of this section. Refer to Section 1.1 for the mathematical setup and notation.

Given a learning algorithm A, suppose that there exists a compression function c that ties in with the loss function ℓaccording to the following property.

Property 9 (coherence part I) For any n 0 and any choice of z1, . . . , zn, zn+1 Z, if ℓ A(z1, . . . , zn), zn+1 = 1, then c c(z1, . . . , zn), zn+1 = c(z1, . . . , zn).

Under Property 9, R(A(z1, . . . , z N)) φN holds with probability 1,10 and Theorem 4 can be used to bound the risk of the hypothesis returned by the learning algorithm, as speciﬁed in the following theorem.

Theorem 10 Given a learning algorithm A, suppose that there exists a compression function c that satisﬁes the coherence part I Property 9. Assume the preference Property 2. For any δ (0, 1), it holds that

P n R A(z1, . . . , z N) > εk o δ,

where k = |c(z1, . . . , z N)| and εk is given in (1).

We next provide a suﬃcient condition for Property 9 to hold for the case in which the learning algorithm can be reconstructed from the compressed multiset.

Deﬁnition 11 (reconstruction function) Given a learning algorithm A and a compression function c, a reconstruction function ρ is a map from multisets to hypotheses such that ρ c(z1, . . . , zn) = A(z1, . . . , zn) for any multiset ms(z1, . . . , zn).

We also need the following property, which requires that the examples in the training set for which the hypothesis chosen by the learning algorithm A is inappropriate are included in the compressed multiset.

Property 12 (inclusion) For any multiset ms(z1, . . . , zn) and for any i = 1, 2, . . . , n, if ℓ A(z1, . . . , zn), zi = 1, then zi appears in c(z1, . . . , zn) the same number of times as it appears in ms(z1, . . . , zn).

10. The reason why the inequality holds with probability 1 and not always is that φN is just a version of the conditional probability in Deﬁnition 1 and various versions can diﬀer over events having probability zero.

M.C. Campi and S. Garatti

The following lemma shows that inclusion implies coherence part I whenever A admits a reconstruction function for the given c.

Lemma 13 Given a learning algorithm A and a compression function c satisfying the inclusion Property 12, if there exists a reconstruction function for (A, c), then the coherence part I Property 9 holds.

Proof We prove that, under inclusion and existence of a reconstruction function, absence of change of compression is necessarily associated to appropriateness (contrapositive of coherence part I Property 9). Let U := ms(z1, . . . , zn) and suppose that for a new zn+1 the compression does not change, i.e., c c(U), zn+1 = c(U). (9)

Applying ρ to both sides of (9) and using the deﬁnition of reconstruction function gives

A c(U), zn+1 = A(U). (10)

On the other hand, (9) implies that zn+1 appears in c c(U), zn+1 as many times as it does in c(U). Thus, zn+1 appears in ms c(U), zn+1 one more time than it does in c c(U), zn+1 . Since A and c satisfy the inclusion Property 12, it follows that

ℓ A c(U), zn+1 , zn+1 = 0, (11)

and, substituting (10) in (11) gives

ℓ A(U), zn+1 = 0.

Hence, it remains proven that

c c(U), zn+1 = c(U) = ℓ A(U), zn+1 = 0,

which is the contrapositive of Property 9.

Remark 14 If, for any multiset, an algorithm generates a hypothesis that is appropriate for all the examples in the multiset (i.e., the hypothesis is consistent with the multiset), then the inclusion Property 12 is automatically satisﬁed and, hence, the existence of a reconstruction function implies the coherence part I Property 9. The inclusion Property 12 provides a condition for the coherence part I property to hold when the algorithm is allowed to generate hypotheses without appropriateness requirements on the training set. Notice also that the inclusion property alone (without a reconstruction function) does not imply the coherence part I property. For example, consider points zi R and let11

A(z1, . . . , zn) = ( , second largest zi] and c(z1, . . . , zn) = max{z1, . . . , zn},12 and say that

11. If two or more zi attain the maximum, then the second largest equals the largest. 12. For a training set that has only one element or it is empty, let, e.g., the algorithm return the whole real line and the compression coincide with the training set.

Compression, Generalization and Learning

A(z1, . . . , zn) is appropriate for z if z A(z1, . . . , zn). Here, one can verify that the inclusion property holds, while no reconstruction function exists. If a new point zn+1 falls in between the second largest zi and max{z1, . . . , zn}, then A(z1, . . . , zn) is not appropriate for this zn+1, but the compression does not change, that is, the coherence part I property does not hold.

The statement of Lemma 13 does not admit a converse: under the existence of a reconstruction function, coherence part I does not imply inclusion. To see this, let examples zi be points of R and consider A(z1, . . . , zn) = [second largest zi, + ), while c(z1, . . . , zn) = second largest zi.13 Then, A(z1, . . . , zn) can be reconstructed from c(z1, . . . , zn) and, when a point for which A(z1, . . . , zn) is inappropriate (i.e., the point does not belong to A(z1, . . . , zn)) is added to c(z1, . . . , zn), the compression becomes the newly added point (which is now the second largest14) so that the coherence part I property holds. On the other hand, if there are among z1, . . . , zn some examples strictly smaller than the second largest zi, then A(z1, . . . , zn) is inappropriate for all of these examples while these examples are not in the compression (and, hence, the inclusion property does not hold).

Interestingly, the properties of inclusion and coherence part I become equivalent under preference, a fact that is stated in the next lemma.

Lemma 15 Consider a learning algorithm A and a compression function c that satisﬁes the preference Property 2. Assume that there exists a reconstruction function ρ for (A, c). Then, the inclusion Property 12 holds iﬀthe coherence part I Property 9 holds.

Proof In view of Lemma 13, we only need to show the implication coherence part I inclusion. Let U := ms(z1, . . . , zn). Assume the coherence part I property and, by contradiction, that the inclusion property fails so that there is a zi U such that ℓ A(U), zi = 1 and zi does not appear in c(U) as many times as it does in U. Now, A(c(U)) = ρ(c(c(U))) = ρ(c(U)) = A(U) (where the second last equality is true because c(c(U)) = c(U) under preference see the comment immediately after Lemma 3); hence, ℓ A(c(U)), zi = ℓ A(U), zi = 1. By the coherence part I property, we then have: c(c(U), zi) = c(U), which contradicts Lemma 3 by the choice V = ms(c(U), zi) for which c(U) V U.

Under preference and the existence of a reconstruction function, the previous lemma shows that inclusion is strictly necessary to have the coherence part I property. The next lemma shows a way to secure inclusion (and thereby coherence part I ) by augmenting the compression function so as to include examples for which the hypothesis is inappropriate.

Lemma 16 Consider a learning algorithm A and a compression function c that satisﬁes the preference Property 2 ((A, c) are not required to satisfy the inclusion Property 12). Assume that there exists a reconstruction function ρ for (A, c). Deﬁne a new couple ( c, ρ) as follows:

13. When the training set has only one element z1, let the algorithm return [z1, + ) and the compression be ms(z1) while, with an empty training set, the algorithm returns the empty subset of R and the compression is obviously empty. 14. If the compression is empty, then the newly added point is alone and it becomes the compression as well.

M.C. Campi and S. Garatti

for any multiset U, let c(U) = c(U) ms(zi U : ℓ(A(U)), zi) = 1) \ c(U) ; (i.e., c(U) is c(U) augmented with the examples that are not already in c(U) for which A(U) is inappropriate);

for any multiset U, let ρ(U) = A(U).

(i) c satisﬁes the preference Property 2;

(ii) ρ is a reconstruction function for (A, c);

(iii) (A, c) satisﬁes the inclusion Property 12 (and, thereby, the coherence part I Property 9).

(i) Consider any two multisets U and V such that c(U) V U. We want to show that c(V ) = c(U), which, by Lemma 3, implies that c satisﬁes the preference Property 2. By deﬁnition of c, it holds that c(U) c(U), yielding c(U) V U. Since c satisﬁes the preference Property 2, Lemma 3 gives c(V ) = c(U), which, together with the fact that ρ is a reconstruction function for (A, c), also implies that A(V ) = ρ(c(V )) = ρ(c(U)) = A(U). Using c(V ) = c(U) and A(V ) = A(U) in the deﬁnition of c(V ) gives

c(V ) = c(V ) ms(zi V : ℓ(A(V ), zi) = 1) \ c(V )

= c(U) ms(zi V : ℓ(A(U), zi) = 1) \ c(U)

= c(U) ms(zi U : ℓ(A(U), zi) = 1) \ c(U) , (12)

where the last equality follows by the observation that c(U) V and that c(U) contains by deﬁnition all the zi U such that ℓ(A(U), zi) = 1. The thesis follows by observing that the right-hand side of (12) is c(U).

(ii) For any multiset U it holds that c(U) c(U) U and, since c satisﬁes the preference Property 2, Lemma 3 gives that c( c(U)) = c(U). Recalling now the deﬁnition of ρ and the fact ρ is a reconstruction function for (A, c), we have that

ρ( c(U)) = A( c(U)) = ρ(c( c(U))) = ρ(c(U)) = A(U).

(iii) This is obvious in view of the deﬁnition of c.

Using Lemma 16, the following theorem follows immediately from Theorem 10.

Theorem 17 Consider a learning algorithm A and a compression function c that satisﬁes the preference Property 2. Assume that there exists a reconstruction function ρ for (A, c). Deﬁne c as in Lemma 16. For any δ (0, 1), it holds that

P n R A(z1, . . . , z N) > εk o δ,

where k = | c(z1, . . . , z N)| and εk is given in (1).

Compression, Generalization and Learning

Upper and lower bounds for R(A(z1, . . . , z N)) can be established under additional conditions.

Property 18 (coherence part II) For any n 0 and p 1,

P E1 \ E2 = 0,

where E1 = {c c(z1, . . . , zn), zn+1 = c(z1, . . . , zn)},

E2 = {ℓ A(z1, . . . , zn), zn+1 = 1}.

The coherence part II property requires that E2 covers E1 up to an event of probability zero. The reason for not requiring that E1\E2 = is that non-pathological examples can be exhibited where this latter condition fails (while the one in probability does hold), showing that this requirement would be unduly restrictive.

We now have the following theorem, which can be proven from Theorem 7 in the light of the two coherence properties.

Theorem 19 Given a learning algorithm A, suppose that there exists a compression function c that satisﬁes the coherence part I Property 9 and the coherence part II Property 18. Assume the preference Property 2, the non-associativity Property 5 and the nonconcentrated mass Property 6. For any δ (0, 1), it holds that

P n εk R A(z1, . . . , z N) εk o 1 δ,

where k = |c(z1, . . . , z N)| and εk, εk are given respectively in (4), (5).

Proof Consider events E1 and E2 as in the statement of the coherence - part II Property 18 and notice that φN = P{E1|z1, . . . , z N} while R A(z1, . . . , z N) = P{E2|z1, . . . , z N}. Properties 9 and 18 then imply that φN = R A(z1, . . . , z N) over a zero probability set only, and the conclusion of Theorem 19 follows in view of (6).

We close this section with some comparison of the results given here with previous results established under the preference property (or the stability property, as it is phrased in some contributions) for compression schemes that consist of a compression function c and a reconstruction function ρ. In an example-consistent framework (i.e., the bound on the risk is only given for multisets S for which ρ(c(S)) is appropriate for all zi S), the best available result for the case when a threshold on the maximum cardinality of c(S) is known is given by Theorem 15 in Bousquet et al. (2020). When N , the bound on the risk in Bousquet et al. (2020) exhibits a O(1/N) convergence rate to zero, in line with the asymptotic results of this paper given in Section 2.1. The ﬁndings of Bousquet et al. (2020) have been extended to compression schemes that have no upper limit for the maximum cardinality of c(S) in Hanneke and Kontorovich (2021), Theorem 10. As compared to Hanneke and Kontorovich (2021), our upper bound shows a uniform

M.C. Campi and S. Garatti

Figure 5: Convex hull of points in R3.

(in |c(S)| {0, 1, . . . , N}) convergence towards |c(S)|/N, which is unattainable within the approach of Hanneke and Kontorovich (2021) (where |c(S)|/N is multiplied by a non-unitary constant). Moreover, our bound is unprecedentedly sharp for ﬁnite values of N and gets rapidly close to |c(S)|/N as N grows, see Figure 4. Moving to the non-consistent framework, the available literature aims at bounding the gap between the empirical probability of inappropriateness (ratio between the number of examples in the training set for which the selected hypothesis is inappropriate divided by the size of the training set) and the actual probability of inappropriateness (i.e., the actual risk). Our Theorem 17 departs from this approach by allowing for an evaluation of the risk that uses directly the size of an augmented compression that automatically incorporates the empirical cases of inappropriateness. This allows us to use the same bound for the risk, without distinguishing between the consistent and non-consistent frameworks. The ensuing theory reveals all its sharpness when change of compression and inappropriateness are equivalent as speciﬁed in Theorem 19 (this is, e.g., the case for Support Vector Regression in Section 4.1.2 or the Guaranteed Error Machine in Section 4.2 under mild conditions), in which case one can establish lower and upper bounds that converge one to the other for increasing N as shown in Section 2.1 (which is unprecedented in statistical learning).15 See also the next Example 20 for a numerical simulation that shows that our bounds for ﬁnite N well capture the intrinsic stochastic variability of the risk.

Example 20 We consider a sample of 1000 points drawn in an independent fashion in R3 and an algorithm A that constructs the corresponding convex hull (see Figure 5). The compression function c returns the vertexes of the convex hull (in case of multiple points corresponding to the same vertex, only one point is put in the compression) and a new point is appropriate if it belongs to the convex hull. It is easy to check that c satisﬁes the preference Property 2 and the non-associativity Property 5 and that coherence part I Property 9 and coherence part II Property 18 also hold. Hence, if the probability by which the points are drawn has no concentrated mass (for instance, if it admits density), then the non-concentrated mass Property 6 is also veriﬁed and Theorem 19 can be used to assess the risk. Panels (a) and (b) in Figure 6 proﬁle the region delimited by εk and εk for N = 1000 and δ = 10 3. The green dots have coordinates equal to the cardinality of the

15. This also shows the sharpness of the upper bound in Theorem 17 because εk εk (see Proposition 8) and the cases dealt with in Theorem 19 that admit lower bound εk and upper bound εk have to be accommodated in Theorem 17 as well.

Compression, Generalization and Learning

0 20 40 60 80 100 0

0 20 40 60 80 100 0

Figure 6: Region delimited by εk and εk for N = 1000 and δ = 10 3. The green dots are generated by a Monte-Carlo testing with (a) a Gaussian distribution and (b) a uniform distribution.

compressed multiset (x axis) and the risk (y axis) in a Monte-Carlo testing in which points in R3 have a Gaussian distribution panel (a) and a uniform distribution in a hyper-cube panel (b). One sees that the two clouds of green dots in (a) and in (b) are quite diﬀerent, while, in both cases, they belong to the region (compare with the discussion in Section 2). Moreover, the stochastic ﬂuctuation of the two clouds well covers the gap between the lower and the upper bound, signifying that the bounds are tight.

Remark 21 (On the role of observations) We feel advisable to just touch upon here an aspect, the full study of which goes beyond the intended goal of this paper. In data-driven applications, it is common practice that observations are split in two sets, used respectively for training and testing. This paper shows that, in a compression setup, data can well stand a double role, in which they are all used for training, while preserving their usability in the process of assessing the risk. Indeed, under the assumptions of Theorem 19 one can show that the quality of risk assessment by means of εk and εk (which is based on the sample of data points that has been used for training) only marginally degrades as compared to testing the solution with a new, untouched, sample of data points of equal cardinality. Hence, in this context, saving data for testing seems inappropriate, particularly when data are a scarce or costly resource.

For the non-consistent framework, it is interesting to further contrast asymptotic bounds that stem from the theory of this paper with previous results. To facilitate the comparison, we ﬁrst reformulate our upper bound in terms of the empirical probability of inappropriateness ˆR A(S) (empirical risk). Start by noticing that | c(S)| in Theorem 17 can be bounded as follows: | c(S)| ˆR A(S) N + |c(S)| (strict inequality occurs when one or more examples zi are simultaneously inappropriate and also contained in c(S) and, therefore, counted twice in the right-hand side). Then, Theorem 17 in conjunction with Proposition 8 give,

M.C. Campi and S. Garatti

with probability 1 δ with respect to the draw of the training set, that

ˆR A(S) + |c(S)|

ˆR A(S) N + |c(S)| + 1

ln ˆR A(S) N + |c(S)| + 1 + 4

ˆR A(S) N + |c(S)| + 1 q

= ˆR A(S) + |c(S)|

ln ˆR A(S) N + |c(S)| + 1 + 1

(since our use of the notation O( ) is not standard, we clarify that here and in (14) g(N, c(S), ˆR(A(S))) = O(f(N, c(S), ˆR(A(S)))) means that there exist a constant C and an N such that g(N, c(S), ˆR(A(S))) Cf(N, c(S), ˆR(A(S))) for all c(S) {0, . . . , N} and ˆR(A(S)) [0, 1] when N N). This last expression can be compared with the best available result in the literature given by Theorem 17 in Hanneke and Kontorovich (2021), which yields

R A(S) = ˆR A(S) + O

It stands out that in (13) term q

ˆR A(S) does not multiply p

|c(S)| + 1, as it instead

does in (14); moreover, p

|c(S)| + 1 is divided by N instead of

N. As a consequence, it is easy to show that, if, e.g., ˆR A(S) is replaced by a constant (so as to accommodate typical non-consistent frameworks), then the rate provided by (13) outdoes that of (14) whenever |c(S)| grows sub-linearly and faster than ln(N) (when |c(S)| is slower than ln(N), the dominant term in (13) is of the type O( p

N), whereas in (14) it is of the type O( p

N)). For instance, replacing |c(S)| with

N gives the rate p

N in (13) and the rate 1/N 1 4 in (14).16

To close, we ﬁnally mention an interesting implication of our result that has been kindly suggested to us by an anonymous reviewer. Suppose that in a binary classiﬁcation problem a hypothesis is selected from a class H via a compression scheme that minimizes the empirical risk (such compression scheme is named agnostic sample compression scheme for H in David et al., 2016b, Section 2; see also David et al., 2016a), and that this compression

16. Interestingly enough, both these rates violate the lower bound established in Hanneke and Kontorovich (2019) for generic non-preferent compression schemes, which shows that the property of preference is strictly necessary to establish accelerated convergence rates for the excess risk as discussed here.

Compression, Generalization and Learning

scheme is also preferent. Then, (13) and Lemma 3.2 in David et al. (2016b) give (with high probability with respect to the draw of the training set) that

R A(S) = inf h H R h + |c(S)|

On the other hand, Theorem 5.2 in Anthony and Bartlett (1999) establishes a bound to the rate at which any hypotheses class H of Vapnik-Chernovenkis dimension d can be learned:

R A(S) inf h H R h

d 320 N . (16)

Considering classes H whose Vapnik-Chernovenkis dimension increases with N more than ln(N), results (15) and (16) imply that these classes cannot admit preferent agnostic sample compression schemes for H of a size that increases at a rate less than

N. This result is in contrast with the case of non-preferent agnostic sample compression scheme for H , in which context David et al. (2016b) shows that schemes of smaller cardinality can be found.

4. Application to known learning schemes

The theory developed in Section 3 is here applied to well-known learning techniques: some algorithms within the family of support vector methods and then a more recent classiﬁcation technique called Guaranteed Error Machine (GEM). In all these cases, Theorem 10 applies without restrictions, while the use of Theorem 19 requires some conditions on the probability distribution of the examples. The content of this section is also meant to illustrate the ﬂexibility and usefulness of the theory of this paper and, in this light, additional schemes that are amenable to be analyzed within the framework of Section 3 are hinted upon at the end of the section.

4.1 Support Vector methods

4.1.1 Support Vector Machines

In supervised binary classiﬁcation, an example z is a pair z := (x, y), where x X (think of X as a generic space without any speciﬁc structure) is an instance and y { 1, 1} is a label . A hypothesis, called a binary classiﬁer , is a map h : X { 1, 1}. We let ℓ(h, (x, y)) = 1y =h(x), so that the loss function equals 0 when y = h(x) (correct classiﬁcation) and 1 when y = h(x) (misclassiﬁcation).

To add ﬂexibility, support vector methods are often got to operate in a feature space. Let ϕ : X H be a feature map from X into a Hilbert space H equipped with an inner product , . Support Vector Machine (SVM, see Cortes and Vapnik, 1995; Sch olkopf and Smola, 1998) is a learning algorithm ASVM that maps a training set S = ms((x1, y1), . . . , (xn, yn)) into a binary classiﬁer according to the formula

ASVM(S)(x) =

( 1, if w (S), ϕ(x) + b (S) 0 1, if w (S), ϕ(x) + b (S) < 0,

M.C. Campi and S. Garatti

where x is a generic instance and (w (S), b (S)) (along with the auxiliary variables ξ i (S) that are used to relax the constraint of exact classiﬁcation of all the examples in the training set) is the solution to the optimization program

PSVM(S) : min w H,b R ξi 0,i=1,...,n w 2 + ρ

i=1 ξi (17)

subject to: 1 yi( w, ϕ(xi) + b) ξi, i = 1, . . . , n.

As shown in Theorem 2 in Burges and Crisp (1999), (17) always admits a minimizer and, moreover, w (S) is unique; however, b (S) (and ξ i (S)) need not be unique. When b (S) is not unique, we assume that the tie is broken by selecting the value of b that minimizes |b|.17 Note that once w (S) and b (S) are uniquely determined, then also the optimal values ξ i (S) remain univocally identiﬁed.

As is well known, see e.g. Sch olkopf and Smola (1998), the feature map ϕ( ) and the inner product , need not be explicitly assigned to solve (17). The reason is that the determination of the solution to (17) (as well as the evaluation of ASVM(S)(x)) only involves the computation of inner products of the type ϕ(xi), ϕ(xj) and ϕ(xi), ϕ(x) . These inner products can indeed be directly obtained from a kernel k( , ) (k( , ) is a function X X R that satisﬁes suitable conditions of positive deﬁniteness, see Sch olkopf and Smola, 1998) and the theory of Reproducing Kernel Hilbert Spaces ensures that any choice of k( , ) always corresponds to allocate a suitable pair ϕ( ), , so that k( , ) = ϕ( ), ϕ( ) (this is the so-called kernel trick ). We also note that the solution to (17) is typically obtained by solving a program that is the dual of (17). However, although important for the practice of SVM, all these remarks are immaterial for the discussion that follows.

To cast SVM within the framework of the present paper, introduce the following compression function c SVM. First, endow X with an arbitrary total ordering (to be used below in point (ii)). For any multiset S, deﬁne c SVM(S) = S+ S0 where:

(i) S+ is the sub-multiset of all the examples (repeated as many times as they appear in S) for which 1 yi( w (S), ϕ(xi) + b (S)) > 0 (note that this is equivalent to the condition ξ i (S) > 0);

(ii) consider the sub-multisets S0 of smallest cardinality for which:

(a) 1 yi( w (S), ϕ(xi) + b (S)) = 0; and, (b) the couple (w (S+ S0), b (S+ S0)) given by (17) with S+ S0 in place of the original S is the same as (w (S), b (S)).18

17. This certainly breaks the tie because, if the smallest absolute value were achieved by two values for b , say b = b, corresponding to the solutions (w , b, ξ i,1) and (w , b, ξ i,2) (recall that w must be the same at optimum), then the optimality of these two solutions would imply that Pn i=1 ξ i,1 = Pn i=1 ξ i,2 and therefore the solution half way between (w , b, ξ i,1) and (w , b, ξ i,2), i.e., (w , 0, 0.5 ξ i,1 +0.5 ξ i,2), would be feasible thanks to convexity, it would achieve the same cost as the other two solutions, but it would be preferred because it carries a smaller value of |b|. 18. The sub-multiset of S formed by all examples that satisfy relation 1 yi( w (S), ϕ(xi) + b (S)) 0 certainly gives the same couple (w (S), b (S)) obtained when (17) is applied to the whole multiset S, hence a multiset of smallest cardinality satisfying (a) and (b) certainly exists.

Compression, Generalization and Learning

Among the multisets S0, single out S0 by using the ordering on X: for each candidate multiset S0, identify the element with smallest instance and pick the multiset that exhibits the smallest among all; if a tie remains, move on to compare the second smallest and so on until S0 is uniquely determined.

We want to apply Theorem 10 to SVM. To this purpose, we start by verifying that c SVM satisﬁes the preference Property 2.

Preference. We apply Lemma 3, which requires to show that, for every multisets S and S such that c SVM(S) S S, it holds that c SVM(S ) = c SVM(S). First, we show that (w (S ), b (S )) = (w (S), b (S)). (18)

For the sake of contradiction, suppose that (18) does not hold:

(w (S ), b (S )) = (w (S), b (S)). (19)

Note that the value achieved by (w (S ), b (S )) for the problem that only contains the constraints corresponding to the examples in c SVM(S)19 cannot be worse than the value achieved by (w (S ), b (S )) for the problem containing the constraints associated to S (because the latter is more constrained than the former); moreover, the value achieved by (w (S), b (S)) for the problem containing only the constraints associated to c SVM(S) is equal to the value achieved by (w (S), b (S)) for the problem containing the constraints associated to S (because the examples in S that are not in c SVM(S) corresponds to ξ i (S) whose value is zero). From (19), then, one concludes that (w (S ), b (S )) must be preferred to (w (S), b (S)) for the problem containing only the constraints associated to c SVM(S). This, however, leads to a contradiction because, by construction (w (c SVM(S)), b (c SVM(S))) = (w (S), b (S)). Hence, (18) remains proven. Consider now c SVM(S ) = S + S 0, where S + and S 0 are obtained from (i) and (ii) applied to S . Since S c SVM(S), S contains all the examples in S for which 1 yi( w (S), ϕ(xi) + b (S)) > 0 and, in view of (18), these examples are also all those in S for which 1 yi( w (S ), ϕ(xi) + b (S )) > 0. Thus, S + = S+. The fact that S 0 = S0 follows instead from observing that S0 is the preferred selection according to the ordering procedure in (ii) to recover (w (S), b (S)) and, since (w (S ), b (S )) = (w (S), b (S)) and S0 is available as a sub-multiset of S , this same S0 is selected when (ii) is applied to S , leading to S 0 = S0. This establishes the validity of Lemma 3 and closes the argument.

Next, we verify the coherence part I Property 9.

Coherence part I. Notice ﬁrst that the very deﬁnition of c SVM implies that ASVM(c SVM(S)) = ASVM(S), i.e., ASVM itself acts as a reconstruction function. Moreover, if we have that ℓ(ASVM(S), (xi, yi)) = 1 for an example (xi, yi) in S, then it must be that yi( w (S), ϕ(xi) + b (S)) 0 (because yi has sign opposite to that of w (S), ϕ(xi) +b (S)). This implies that 1 yi( w (S), ϕ(xi) + b (S)) > 0, from which the inclusion Property 12 follows because

19. This amounts to substitute (w (S ), b (S )) in the problem of the type (17) where only the constraints corresponding to the examples in c SVM(S) are enforced, and then optimize with respect to the variables ξi.

M.C. Campi and S. Garatti

c SVM(S) includes all examples for which this latter inequality holds true. Based on these results, the coherence part I Property 9 follows from Lemma 13.

Having established the preference and the coherence - part I properties, the following theorem follows as a corollary of Theorem 10.

Theorem 22 (Risk of SVM) For any δ (0, 1), it holds that

P n R ASVM(S) > εk o δ,

where k = |c SVM(S)| and εk is given in (1).

We are instead not in a position to establish lower bounds for R ASVM(S) . The reason is that adding a new example (xn+1, yn+1) for which 1 yn+1( w (S), ϕ(xn+1) + b (S)) > 0 changes the compression (refer to (i) in the deﬁnition of c SVM), but this does not exclude that yn+1( w (S), ϕ(xn+1) +b (S)) > 0, in which case the new example is not misclassiﬁed. This fact prevents the coherence part II Property 18 from being satisﬁed.

Remark 23 (A computational aspect) To evaluate an upper bound to the risk according to Theorem 22, one needs to compute |c SVM(S)|. Computing the cardinality of S+ is easy; determining the cardinality of S0, however, is more computationally demanding. On the other hand, εk is increasing with k so that a valid result can be easily found by overestimating |c SVM(S)| with the cardinality of the set of examples for which 1 yi( w (S), ϕ(xi) +b (S)) 0. Heuristically, this evaluation often turns out to be sharp.

4.1.2 Support Vector Regression

In regression problems, an example z is a pair z := (x, y) with x X, a generic space, and y R. A hypothesis h is called a predictor and it is a map from X to R. As loss function, we take

ℓ(h, (x, y)) =

( 1, if |y h(x)| > t 0, if |y h(x)| t,

where t is the so-called prediction tolerance . Let ϕ : X H be a feature map from X to a Hilbert space H endowed with inner product , . Support Vector Regression (SVR, see Smola and Sch olkopf, 2004) is a learning algorithm ASVR(S) that maps any training set S = ms((x1, y1), . . . , (xn, yn)) into the predictor

ASVR(S)(x) = w (S), ϕ(x) + b (S)

where w (S) and b (S) (along with ξ i (S)) are the solution to the program

PSVR(S) : min w H,b R ξi 0,i=1,...,n w 2 + ρ

subject to: |yi w, ϕ(xi) b| t + ξi, i = 1, . . . , n.

Similarly to SVM, a minimizer certainly exists and w (S) is also unique, but b (S) may not be, see Burges and Crisp (1999). In the latter case, the tie is broken by choosing

Compression, Generalization and Learning

the minimizer with the smallest value of |b|. After that w (S) and b (S) have been made unique, also the values of ξ i (S) remain univocally determined. The kernel trick described for SVM applies here as well, so that ϕ( ) and , need not be speciﬁed explicitly and can be assigned via a kernel function.

Our deﬁnition of a compression function c SVR closely resembles that for SVM. For any multiset of examples S, deﬁne c SVR(S) = S+ S0 where:

(i) S+ is the multiset of all the examples (repeated as many times as they appear in S) for which |yi w (S), ϕ(xi) b (S)| > t (note that this is equivalent to the condition ξ i (S) > 0);

(ii) S0 is the smallest sub-multiset only containing examples that satisfy condition |yi w (S), ϕ(xi) b (S)| = t for which w (S+ S0) = w (S) and b (S+ S0) = b (S). In complete analogy with SVM, S0 as deﬁned before may not be unique, in which case an S0 is singled out by a total ordering on X.

The proof that c SVR satisﬁes the preference Property 2 and that (ASVR, c SVR) satisﬁes the coherence part I Property 9 follows the same path, mutatis mutandis, as for SVM and is therefore omitted. This gives the following theorem, obtained as a direct consequence of Theorem 10.

Theorem 24 (Risk of SVR) For any δ (0, 1), it holds that

P n R ASVR(S) > εk o δ,

where k = |c SVR(S)| and εk is given in (1).

Unlike SVM, for SVR lower and upper bounds for the risk are established under an additional, mild, distributional assumption.

Assumption 25 The regular conditional distribution of y given x has no concentrated mass almost surely.

To establish the lower and upper bounds, we resort to Theorem 19. Preliminarily, we show the validity of the assumptions of this theorem.

Non-associativity. Consider any training set S = ms((x1, y1), . . . , (xn, yn)) and an additional multiset of examples S = ms((xn+1, yn+1), . . . , (xn+p, yn+p)) such that c(S, (xn+i, yn+i)) = c(S) for all i {1, . . . , p}. Further, assume that |yn+i ASVR(S)(xn+i)| = t for all i {1, . . . , p}. Then, it must be that |yn+i ASVR(S)(xn+i)| < t for all i {1, . . . , p}. Indeed, suppose by contradiction the opposite: |yn+i ASVR(S)(xn+i)| > t for some i. Then, if ASVR(S, (xn+i, yn+i)) = ASVR(S), then |yn+i ASVR(S, (xn+i, yn+i))(xn+i)| > t and (xn+i, yn+i) is counted in c SVR(S, (xn+i, yn+i)) leading to c SVR(S, (xn+i, yn+i)) = c SVR(S); if instead ASVR(S, (xn+i, yn+i)) = ASVR(S), then ASVR(c SVR(S, (xn+i, yn+i))) = ASVR(c SVR(S)), which means that c SVR(S, (xn+i, yn+i)) cannot be the same as c SVR(S). Thus, it remains proven that |yn+i ASVR(S)(xn+i)| < t for all i {1, . . . , p} and this yields immediately that ASVR(S S ) = ASVR(S) and that c SVR(S S ) = c SVR(S). This, along with the fact

M.C. Campi and S. Garatti

that |yn+i ASVR(S)(xn+i)| = t for all i {1, . . . , p} holds with probability 1 in view of Assumption 25, gives the non-associativity property.

Non-concentrated mass. This is obvious in view of Assumption 25.

Coherence part II. The proof is by contrapositive: letting S = ms((x1, y1), . . . , (xn, yn)) we show that

P {ℓ(ASVR(S), (xn+1, yn+1)) = 0} \ {c SVR(c SVR(S), (xn+1, yn+1)) = c SVR(S)} = 0. (20)

Assumption 25 implies that the case |yn+1 ASVR(S)(xn+1)| = t can be disregarded because it correpsonds to an event that has probability zero. Moreover, when |yn+1 ASVR(S)(xn+1)| > t, we have that ℓ(ASVR(S), (xn+1, yn+1)) = 1, so that this case can be disregarded too. When instead |yn+1 ASVR(S)(xn+1)| < t, it holds that ℓ ASVR(S), (xn+1, yn+1) = 0. In this case,

|yn+1 ASVR(c SVR(S))(xn+1)| = |yn+1 ASVR(S)(xn+1)| < t

yields ASVR(c SVR(S), (xn+1, yn+1)) = ASVR(S), from which

|yn+1 ASVR(c SVR(S), (xn+1, yn+1))(xn+1)| < t.

Hence, (xn+1, yn+1) is not in c SVR(c SVR(S), (xn+1, yn+1)), so that, owing to the preference Property 2, it must be that c SVR(c SVR(S), (xn+1, yn+1)) = c SVR(S). This shows the validity of (20).

Using Theorem 19, we now have the following result.

Theorem 26 (Risk of SVR - bounds from below and from above) Under Assumption 25, for any δ (0, 1), it holds that

P n εk R ASVR(S) εk o 1 δ,

where k = |c SVR(S)| and εk, εk are given respectively in (4), (5).

4.1.3 Other Support Vector methods

The applicability of Theorems 10 and 19 can be carried over to other Support Vector methods. SVR with Adjustable Size, Sch olkopf et al. (1998), requires minor modiﬁcations. Suitable but conceptually straightforward modiﬁcations of the arguments used in this section can also be applied to one-class SVM, Sch olkopf et al. (1999), and Support Vector Data Description (SVDD), Tax and Duin (2004) and Wang et al. (2011). In these latter two cases, the setup changes slightly since examples are unlabeled and hypotheses are regions in the set hosting the examples. To apply Theorem 19, one needs here to modify Assumption 25 so as to enforce speciﬁc non-accumulation conditions for the method at hand.

Compression, Generalization and Learning

4.2 Guaranteed Error Machine

The Guaranteed Error Machine (GEM) is a learning algorithm for classiﬁcation that was ﬁrst introduced in Campi (2010) and then further developed in Car e et al. (2018).20 GEM returns a ternary-valued classiﬁer, which is also allowed to abstain from classifying in case of doubt. To be speciﬁc, letting z = (x, y) with x X, a generic set, and y { 1, 1} (this is the same setup as in SVM), a hypothesis h is here a map from X to { 1, 1, 0}, where the value 0 is interpreted as admission of being unable to classify. Issuing an incorrect label ( 1 in place of 1 or vice versa) leads to a mistake, and the theory aims at bounding the probability for this to happen. Correspondingly, the loss function is deﬁned as follows:

ℓ(h, (x, y)) =

( 1, if |y h(x)| = 2 0, if |y h(x)| = 0 or 1.

To describe the operation of GEM, start by introducing a feature map ϕ : X H, where H is a Hilbert space with inner product , (as for support vector methods, ϕ, H, and , need not be explicitly given and can be implicitly deﬁned by means of a kernel) and also assume the existence of an ordering on X (used later to introduce a tie-break rule). GEM requires that the user chooses an integer d 1, which speciﬁes the maximal cardinality for the compression.21 In loose terms, GEM operates as follows. It is assumed that one has an additional observation ( x, y) (besides the training set S = ms((x1, y1), . . . , (xn, yn))) that acts as initial center . GEM constructs the hyper-sphere in H around ϕ( x) which is the largest possible under the condition that the hyper sphere does not include any ϕ(xi) with label yi diﬀerent from y. All points inside this hyper-sphere are classiﬁed as the label y, and all examples (xi, yi) for which ϕ(xi) is inside the hyper-sphere are removed from the training set. The example that lies on the boundary of the hyper-sphere (and that has therefore prevented the hyper sphere from further enlarging) is then appointed as the new center (in case of ties, the tie is broken by using the ordering on X) and the procedure is repeated by constructing another hyper-sphere around the new center. This time, only the region given by the diﬀerence between the newly constructed hyper-sphere and the ﬁrst hyper sphere (which has been already classiﬁed) is classiﬁed as the label of the second center. This procedure continues the same way and comes to a stop when either the whole space has been classiﬁed or the total number of centers is equal to d, in which case the portion of X that has not been covered is classiﬁed as 0. This leads to the algorithm formally described below.

STEP 0. SET q := 0, P := S ms(( x, y)), C = and x C = x, y C = y;

STEP 1. SET q := q + 1 and SOLVE problem

subject to: ϕ(xi) ϕ(x C) r, for all (xi, yi) P such that yi = y C.

20. The algorithm described here is a variant of those proposed in the referenced papers. 21. Selecting a large value for d reduces the chance of abstention from classifying. When d is larger than the cardinality of the training set, the set of abstention becomes empty. In other cases, the user tries to achieve a good compromise between the probability of abstention and the probability of making an error. This is not speciﬁc to GEM and applies to any technique for the construction of ternary-valued classiﬁers. See Campi (2010) for more discussion on this point.

M.C. Campi and S. Garatti

Let r be the optimal solution (note that r can possibly be + );

STEP 2. FORM the region Rq := {x X : ϕ(x) ϕ(x C) < r } and LET ℓq := y C; UPDATE P as follows: if r > 0, then remove from P all the examples with xi Rq; if instead r = 0,22 then remove from P the example (x C, y C);

STEP 3. IF r < + , THEN

3.a SET C := C ms((xi , yi )), where (xi , yi ) is an example in P such that: a. ϕ(xi ) ϕ(x C) = r ; b. yi = y C; c. xi is smallest in the ordering of X among all the examples satisfying a. and b.;

3.b SET (x C, y C) := (xi , yi );

STEP 4. IF either |C| = d or P = THEN STOP and RETURN ℓj, Rj, j = 1, . . . , q and C; ELSE, GO TO 1.

The GEM predictor is deﬁned as

AGEM(S)(x) =

( 0, if x / Rj j = 1, . . . , q; ℓj otherwise, with j = min j {1, . . . , q} : x Rj .

The compression function for GEM is c GEM(S) = C.

We next establish the preference and coherence part I properties, required to apply Theorem 10.

Preference. Given any multisets S and S such that c GEM(S) S S, it is easy to verify that running STEPS 0-4 with S as input returns the same output as when these steps are run with input S. In particular, c GEM(S ) = C = c GEM(S) and, therefore, the preference property follows by an application of Lemma 3.

Coherence part I. Since applying STEPS 0-4 to c GEM(S) returns the same output as when they are applied to S, AGEM itself acts as a reconstruction function. Also, the inclusion Property 12 is immediately veriﬁed (just pay a bit of care to the case in which more examples with diﬀerent labels corresponds to the same x C). The coherence part I Property 9 then follows from Lemma 13.

Applying Theorem 10 we now have the following result.

Theorem 27 (Risk of GEM) For any δ (0, 1), it holds that

P n R AGEM(S) > εk o δ,

where k = |c GEM(S)| and εk is given in (1).

22. r = 0 only happens if there are examples with diﬀerent labels whose instance is x C.

Compression, Generalization and Learning

Notice also that |c GEM(S)| d holds by construction, which implies that the bound R AGEM(S) εd is always correct with high conﬁdence 1 δ.23

We now turn to lower bounds to the risk, which are established by an application of Theorem 19. We start by showing the validity of the non-associativity property.

Non-associativity. Consider any training set S = ms((x1, y1), . . . , (xn, yn)) and an additional multiset of examples S = ms((xn+1, yn+1), . . . , (xn+p, yn+p)). Suppose that c GEM(S S ) = c GEM(S). For this to be, it is required that at least one of these conditions applies: (i) ℓ(AGEM(S), (xn+i, yn+i)) = 1 for some i {1, . . . , p}; or, (ii) one of the (xn+i, yn+i), i {1, . . . , p}, for which ℓ(AGEM(S), (xn+i, yn+i)) = 0 lies on the boundary of a Rj and is lower in order than the example that is chosen as center by the algorithm applied to S. However, take in isolation an example (xn+i, yn+i) that satisﬁes either (i) or (ii); then, that example alone makes the compression change. This proves the non-associativity property.

To move on and prove the non-concentrated mass and coherence part II properties, we need a mild assumption on the distribution of examples.

Assumption 28 For any c H and γ R, it holds that

P{ ϕ(x) c 2 = γ} = 0.

Non-concentrated mass. This immediately follows from Assumption 28: if P{z = z} = 0 for some z = ( x, y), then Assumption 28 is violated by the choices c = ϕ( x) and γ = 0.

Coherence part II. In view of Assumption 28, xn+1 lies on the boundary of a region Rj with probability zero. On the other hand, when xn+1 is not on the boundary, a change of compression only occurs if (xn+1, yn+1) is misclassiﬁed, that is, ℓ(AGEM(S), (xn+1, yn+1)) = 1. This proves the coherence part II property.

The following theorem now follows from Theorem 19.

Theorem 29 (Risk of GEM - bounds from below and from above) Under Assumption 28, for any δ (0, 1), it holds that

P n εk R AGEM(S) εk o 1 δ,

where k = |c GEM(S)| and εk, εk are given respectively in (4), (5).

4.3 Other learning schemes with a preferent compression

Besides Support Vector methods and GEM, other learning schemes can be studied within the framework of the present paper. We mention here just two additional examples, without

23. A similar result would not be possible without resorting to ternary classiﬁers that allow for abstention from classifying.

M.C. Campi and S. Garatti

working out all the details as it was done in Sections 4.1 and 4.2.

One ﬁrst example is the class of methods for classiﬁcation based on the nearest-neighbor (NN) algorithm. Consider for simplicity 1-NN in a ﬁnite Euclidean space X and with labels generated by a target concept, see Shalev-Shwartz and Ben-David (2014). For every training set S = ms((x1, y1), . . . , (xn, yn)) of instance/label pairs, 1-NN relies on the Voronoi partition of the instance domain X induced by S, where cells are Ci = {x X : x xi x xj j = 1, . . . , n}, i = 1, . . . , n. If one eliminates from S all the examples whose associated cell is in the interior of the region labeled as the cell, then the remaining examples form a compressed multiset for which 1-NN itself acts as a reconstruction function. It is then a simple enough task to show that such a compression function satisﬁes the preference Property 2; moreover, since the 1-NN classiﬁer is always consistent with S, Lemma 13 allows us to conclude that the coherence part I Property 9 holds. Thereby, Theorem 10 can be applied to evaluate the probability of misclassiﬁcation of the 1-NN classiﬁer. Similar arguments are expected to be applicable to more general k-NN schemes, even in generic (inﬁnite dimensional) metric spaces. Further studies can possibly cover more general NN-based methods along the lines of Kontorovich et al. (2017, 2018); Hanneke et al. (2021).

As a second example, again in the context of classiﬁcation, we would like to mention the Total Recall algorithm of Helmbold et al. (1990), which is in use to learn nested diﬀerences of concepts from intersection-closed classes. In the setup of Helmbold et al. (1990), no matter whether the depth24 of the hypothesis returned by the algorithm is arbitrary or a-priori ﬁxed, it is fairly easy to prove that the union of the spanning sets25 with minimal cardinality for the multisets of examples used in the various calls to the closure learner by the Total Recall algorithm26 deﬁnes a compression function that satisﬁes the preference Property 2. By the very deﬁnition of spanning sets, we also have that the Total Recall algorithm applied to the union of the minimal spanning sets (i.e., applied to the compressed multiset) returns the same hypothesis as when the algorithm is run on the whole training set. This means that the Total Recall algorithm itself acts as a reconstruction function and Theorem 17 can be used to obtain evaluations of the probability of misclassiﬁcation. More research can extend this analysis to alternative algorithms, e.g., along the lines discussed in Section 5 in Helmbold et al. (1990).

5.1 A brief overview of the proofs

To help readability, we ﬁrst trace a roadmap of the fundamental steps in which the rather long proofs of Theorems 4 and 7 are articulated. To prove Theorem 4, we ﬁrst establish some properties that have necessarily to be satisﬁed by any compression scheme that is preferent. These are (i) and (ii) in the second page of the proof. Next, the probability P{φN > εk} that appears in the left-hand side of (2) in the statement of Theorem 4 is re-

24. See Helmbold et al. (1990), page 166. 25. See Helmbold et al. (1990), page 170. 26. If multiple spanning sets with minimal cardinality exist, then a choice is singled-out by means of any total ordering of the ﬁnite subests of X.

Compression, Generalization and Learning

written in integral form with respect to suitable measures that are introduced in (23), and the resulting expression is minimized under conditions (i) and (ii). The ensuing variational problem (33) returns an upper bound to P{φN > εk}. The next step consists in the evaluation of the optimal value of problem (33). This step is accomplished by dualization, leading to the reformulation (42). Interestingly, dualization does not introduce any conservatism since strong duality holds, as stated in equation (36). To close the proof, we show that the value δ that appears in the statement of Theorem 4 is achieved by a feasible solution of the dual problem and, thereby, it upper bounds the optimal value of (42) and, by this, that of P{φN > εk}. This derivation is covered in the last part of the proof that starts after equation (46).

The proof of Theorem 7 follows the same path as that of Theorem 4 with the non-trivial diﬀerence that the property (ii) holds with equality in this case (it has an inequality in the proof of Theorem 4). This results in primal and dual problems that have substantial diﬀerences from those in the proof of Theorem 4, while the conceptual structure of the proof remains the same.

5.2 Proof of Theorem 4

Result (2) is ﬁrst proven under the following additional assumption of no concentrated mass

P n zi = z o = 0, z Z; (21)

the extension to the general case is dealt with at the end of this proof.

The quantity of interest P{φN > εk} can be expressed as follows

P n φN > εk o = P n φN > ε|c(z1,...,z N)| o

k=0 P n |c(z1, . . . , z N)| = k and φN > εk o

{i1,...,ik} {1,...,N}

n c(z1, . . . , z N) = ms(zi1, . . . , zik) and φN > εk o

{i1,...,ik} {1,...,N}

P n c(z1, . . . , z N) = ms(zi1, . . . , zik) and φN > εk o ,

where the last equality holds because: due to (21), z1 = = z N occurs with probability 1, and so the multisets ms(zi1, . . . , zik) are all diﬀerent from each other with probability 1; whence, c(z1, . . . , z N) = ms(zi1, . . . , zik) holds for one and only one choice of the indexes with probability 1, implying that the events under the sign of union are disjoint up to overlaps of probability zero.

M.C. Campi and S. Garatti

Now, for any ﬁxed k, all the probabilities in the inner summation are equal because the zi s are i.i.d. and so we can write

{i1,...,ik} {1,...,N}

P n c(z1, . . . , z N) = ms(zi1, . . . , zik) and φN > εk o

P n c(z1, . . . , z N) = ms(z1, . . . , zk) and φN > εk o

(εk,1] dm+ k,N, (22)

where m+ k,N is a (positive) measure on [0, 1] deﬁned as follows (for future use we introduce a deﬁnition that holds for a generic integer m, and not just for m = N): for all m = 0, 1, . . . and k = 0, . . . , m, let

m+ k,m(B) = P n c(z1, . . . , zm) = ms(z1, . . . , zk) and φm B o , (23)

with B any Borel set in [0, 1].

Next we derive two relations (i) and (ii) that are satisﬁed by measures m+ k,m for all compression schemes that satisfy the preference property; relations (i) and (ii) will be in use when evaluating P{φN > εk}.

(i) For m = 0, 1, . . ., it holds that (we use α as variable of integration)

[0,1] dm+ k,m(α) = 1; (24)

(ii) For m = 0, 1, . . . and k = 0, . . . , m, it holds that Z

B dm+ k,m+1(α) Z

B (1 α) dm+ k,m(α) 0, (25)

for any Borel set B [0, 1].

For any given B, the left-hand side of (25) returns a numerical value and, when B ranges over the Borel sets in [0, 1], the left-hand side of (25) deﬁnes a signed measure. Condition (25) means that this measure is in fact negative. In the following, this measure will be denoted as m+ k,m+1 (1 α)m+ k,m,27 and condition (ii) can also be written as

m+ k,m+1 (1 α)m+ k,m M ,

where M is the cone of negative ﬁnite measures on [0, 1].

27. Note that (1 α)m+ k,m cannot be interpreted as a product since (1 α) is not a number because it depends on α; hence, m+ k,m+1 (1 α)m+ k,m has to be interpreted just as a symbol that indicates the measure deﬁned via the left-hand side of equation (25).

Compression, Generalization and Learning

Proof of (i): Along the same lines as the proof of (22), we obtain

k=0 P n |c(z1, . . . , zm)| = k o

{i1,...,ik} {1,...,m}

n c(z1, . . . , zm) = ms(zi1, . . . , zik) o

{i1,...,ik} {1,...,m}

P n c(z1, . . . , zm) = ms(zi1, . . . , zik) o

P n c(z1, . . . , zm) = ms(z1, . . . , zk) o

[0,1] dm+ k,m.

Proof of (ii): For any given Borel set B in [0, 1], we have that Z

B dm+ k,m+1 = P n c(z1, . . . , zm+1) = ms(z1, . . . , zk) and φm+1 B o . (26)

By Lemma 3, relation c(z1, . . . , zm+1) = ms(z1, . . . , zk) implies the following two facts:

(a) c(z1, . . . , zm) = ms(z1, . . . , zk);

(b) c(c(z1, . . . , zm), zm+1) = c(z1, . . . , zm).

Equation (a) is an immediate consequence of Lemma 3, while (b) is proven by the following chain of equalities: c(c(z1, . . . , zm), zm+1) = [use (a)] = c(z1, . . . , zk, zm+1) = [use Lemma 3] = c(z1, . . . , zm, zm+1) = ms(z1, . . . , zk) = c(z1, . . . , zm).

Over the set where c(z1, . . . , zm+1) = ms(z1, . . . , zk), it therefore holds that

φm+1 = P n c(c(z1, . . . , zm+1), zm+2) = c(z1, . . . , zm+1)|z1, . . . , zm+1 o

= P n c(c(z1, . . . , zm), zm+2) = c(z1, . . . , zm)|z1, . . . , zm+1 o

where we have used (a), which gives c(z1, . . . , zm+1) = c(z1, . . . , zm)

= P n c(c(z1, . . . , zm), zm+2) = c(z1, . . . , zm)|z1, . . . , zm o

= φm P-almost surely, (27)

so that the right-had side of (26) can be re-written as

P n c(z1, . . . , zm+1) = ms(z1, . . . , zk) and φm+1 B o

= P n c(z1, . . . , zm+1) = ms(z1, . . . , zk) and φm B o . (28)

M.C. Campi and S. Garatti

On the other hand, an application of (a) and (b) also gives28

P n c(z1, . . . , zm+1) = ms(z1, . . . , zk) and φm B o

P n c(c(z1, . . . , zm), zm+1) = c(z1, . . . , zm) and

c(z1, . . . , zm) = ms(z1, . . . , zk) and φm B o . (29)

Using (28) and (29) in (26), we obtain Z

B dm+ k,m+1 P n c(c(z1, . . . , zm), zm+1) = c(z1, . . . , zm) and

c(z1, . . . , zm) = ms(z1, . . . , zk) and φm B o . (30)

The proof of (ii) is now established by noticing that the right-hand side of (30) can be re-written as follows

E h E 1{c(c(z1,...,zm),zm+1)=c(z1,...,zm)} 1{c(z1,...,zm)=ms(z1,...,zk) and φm B}|z1, . . . , zm i

= E h E 1{c(c(z1,...,zm),zm+1)=c(z1,...,zm)}|z1, . . . , zm

1{c(z1,...,zm)=ms(z1,...,zk) and φm B} i

= E h (1 φm) 1{c(z1,...,zm)=ms(z1,...,zk) and φm B} i

B (1 α) dm+ k,m.

We are now ready to upper-bound P {φN > εk} by taking the sup of the right-hand side of (22) under conditions (i) and (ii) (in addition to the fact that measures m+ k,m belong to the cone M+ of positive ﬁnite measures on [0, 1]). This gives

P {φN > εk} γ, (31)

28. Importantly, inequality in (29) may be strict. For example, suppose that z is uniformly distributed over a circle with unitary circumference and that c(z1, . . . , zn) selects the two points whose gap is smallest (i.e. c(z1, . . . , zn) = ms(zi1, zi2) such that no other pair of points is closer if a tie occurs use an arbitrary tiebreak rule). Take m = 3 and B = [0, 1]. The left-hand side of (29) equals 1/6, as is obvious by observing that any choice of two points has the same probability of being selected. Instead, the right-hand side is the probability that c(z1, z2, z4) = ms(z1, z2) and c(z1, z2, z3) = ms(z1, z2). Naming x the length of the arc connecting z1 and z2, we have: i. x has uniform density equal to 2 over [0, 1/2]; ii. if x > 1/3, then adding one more point certainly changes the compression; and iii. if x 1/3, then the probability that one more point changes the compression is 3x. Hence, the right-hand side of (29) has value

P n c(z1, z2, z4) = ms(z1, z2) and c(z1, z2, z3) = ms(z1, z2) o = Z 1 3

0 (1 3x)2 2dx = 2

which is strictly larger than 1/6. We shall see in Theorem 7 that, by strengthening the assumptions of the theorem with the introduction of the non-associativity Property 5, inequality in (29) turns into an equality and this provides lower bounds on the probability of change of compression in addition to the upper bound of the present theorem.

Compression, Generalization and Learning

where γ is deﬁned as the value of the optimization problem

γ = sup m+ k,m M+

m=0,1,..., k=0,...,m

(εk,1] dm+ k,N (32)

subject to:

[0,1] dm+ k,m = 1, m = 0, 1, . . .

m+ k,m+1 (1 α)m+ k,m M , m = 0, 1, . . . ; k = 0, . . . , m.

To evaluate γ, we consider a truncated version of problem (32) that only includes the measures m+ k,m for m M (we take M N). We then dualize the truncated problem and let M increase.

The truncated problem is

γM = sup m+ k,m M+

m=0,...,M, k=0,...,m

(εk,1] dm+ k,N (33a)

subject to:

[0,1] dm+ k,m = 1, m = 0, . . . , M (33b)

m+ k,m+1 (1 α)m+ k,m M ,

m = 0, . . . , M 1; k = 0, . . . , m. (33c)

As M increases, one adds new constraints (which also contain new variables), while the cost function and previous constraints remain unchanged. Hence, γM does not increase with M and γ γM, (34)

for all M. To dualize (33), consider the Lagrangian:

(εk,1] dm+ k,N

[0,1] dm+ k,m 1

[0,1] µ+ k,m(α) d[m+ k,m+1 (1 α)m+ k,m], (35)

which is a function of

m+ k,m M+, m = 0, . . . , M, k = 0, . . . , m,

and the Lagrange multipliers

λm R, m = 0, . . . , M,

M.C. Campi and S. Garatti

µ+ k,m C0 +[0, 1], m = 0, . . . , M 1, k = 0, . . . , m,

where C0 +[0, 1] is the set of positive and continuous functions over [0, 1].

We show below that29

γM (A) = sup {m+ k,m} inf {λm} {µ+ k,m}

L (B) = inf {λm} {µ+ k,m}

sup {m+ k,m} L (C) = γ M, (36)

where γ M is the value of the dual of problem (33) (1 denotes the indicator function):

γ M = inf λm, m=0,...,M µ+ k,m C0 +[0,1], m=0,...,M 1, k=0,...,m

m=0 λm (37a)

subject to: m k

1α (εk,1]1m=N + (1 α)µ+ k,m(α)1m =M

+ µ+ k,m 1(α)1m =k, α [0, 1],

k = 0, . . . , M; m = k, . . . , M (37b)

(note that in (37b) the indexes run over a range such that there appear functions, for instance µ+ 0, 1, that are not listed as optimization variables; however, these functions are all multiplied by an indicator function that is zero and they therefore disappear; we have used this way of writing the constraints because it simpliﬁes the notation).

Proof of (A) in (36): If measures m+ k,m do not satisfy the constraints in (33b) and (33c), then inf{λm},{µ+ k,m} L is equal to . This is true for (33b) because, if for some m the term m X

[0,1] dm+ k,m 1

in the right-hand side of (35) is not null, then λm can be taken any large with sign equal to that of that term, bringing L down to arbitrary large negative values. Likewise, if (33c) is not satisﬁed for a given pair (k, m), then the last term in the right-hand side of (35) can be made any large negative by selecting a suitable positive large continuous function µ+ k,m.30 Hence, the sup{m+ k,m} of inf{λm},{µ+ k,m} L is attained at measures m+ k,m

29. In various parts of this paper from here onward, the set of measures m+ k,m, m = 0, . . . , M, k = 0, . . . , m, is indicated by the notation {m+ k,m}, where the range of variability for m and k is suppressed for brevity. Similar notations apply to λm and µ+ k,m and other collections alike. 30. Intuitively, this is achieved by a function µ+ k,m that is concentrated over the domain where m+ k,m+1 (1 α)m+ k,m is positive. In this footnote, we provide the interested reader with a formal construction of such a function µ+ k,m. Note that, if condition (33c) is not satisﬁed for a given pair (k, m), then there is a Borel set B in [0, 1] such that Z

B dm+ k,m+1 Z

B (1 α) dm+ k,m > 0. (38)

Compression, Generalization and Learning

satisfying (33b) and (33c) and, once (33b) and (33c) hold, inf{λm},{µ+ k,m} L is achieved

by setting the second and third terms in the right-hand side of (35) to zero (choose λm to be any value and µ+ k,m, e.g., equal to zero for all m and k). This leads to the conclusion that sup{m+ k,m} inf{λm},{µ+ k,m} L equals γM of problem (33).

Proof of (B) in (36): This long and technical proof is provided in Appendix C.

Proof of (C) in (36): First note that the Lagrangian can be rewritten as follows (in the second last term we have used the change of running index j = m+1)

1α (εk,1]1m=N dm+ k,m

[0,1] µ+ k,j 1(α)1j =k dm+ k,j

[0,1] µ+ k,m(α) (1 α)1m =M dm+ k,m.

Letting m be the measure m+ k,m+1 + m+ k,m, B can be sandwiched between a closed set C and an open set O (C B O, note that O R, but it may not be restricted to [0, 1]) such that m(O) m(C) < ε for any arbitrarily small ε (Theorem 12.3 in Billingsley, 1995). Let now

g(α) = dist(α, Oc) dist(α, Oc) + dist(α, C), α R,

where dist(α, X) = inf{|α x| : x X} and Oc is the complement of O. g is a continuous function with codomain [0,1] and g(α) = 0 on Oc while g(α) = 1 on C. We have that Z

[0,1] g(α) d[m+ k,m+1 (1 α)m+ k,m]

[0,1] g(α) dm+ k,m+1 Z

[0,1] g(α) (1 α) dm+ k,m

O g(α) dm+ k,m+1 Z

O g(α) (1 α) dm+ k,m

(since O can expand beyond [0, 1] but m+ k,m+1 and m+ k,m are supported in [0, 1])

C g(α) dm+ k,m+1 Z

C g(α) (1 α) dm+ k,m ε

(since 0 g(α) (1 α) 1 for α [0, 1] and Z

O\C dm+ k,m < ε)

C dm+ k,m+1 Z

C (1 α) dm+ k,m ε

B dm+ k,m+1 ε Z

B (1 α) dm+ k,m ε

B\C dm+ k,m+1 < ε)

> 0, for ε small enough and using (38).

Taking µ+ k,m to be an arbitrarily rescaled version of the restriction of g to [0, 1], one obtains that the last term in the right-hand side of (35) can be made any large negative.

M.C. Campi and S. Garatti

By renaming j as m in the second last term and re-arranging the summations PM m=0 Pm k=0 as PM k=0 PM m=k, we then obtain:

1α (εk,1]1m=N + (1 α)µ+ k,m(α)1m =M

µ+ k,m 1(α)1m =k

dm+ k,m. (39)

Now, if for some pair (k, m) the constraint in (37b) is not satisﬁed for a given α = α, then sup{m+ k,m} L can be sent to + by choosing m+ k,m that has an arbitrarily large mass concentrated in α. Hence, the inf{λm},{µ+ k,m} of sup{m+ k,m} L is attained at λm s

and µ+ k,m s satisfying (37b) and, once (37b) holds, sup{m+ k,m} L is achieved by setting

the second term in the right-hand side of (39) to zero (choose, e.g., m+ k,m = 0 for all k and m). This leads to the conclusion that inf{λm},{µ+ k,m} sup{m+ k,m} L equals γ M of

problem (37).

Next we want to evaluate γ M of problem (37).

For a better visualization of the constraints in (37b), we write them more explicitly in groups indexed by k as follows:

k = 0, . . . , N 1

(1 α)µ+ k,k(α) λk k k m=k (1 α)µ+ k,k+1(α) λk+1 k+1 k + µ+ k,k(α) m=k+1 ... (1 α)µ+ k,N 1(α) λN 1 N 1 k + µ+ k,N 2(α) m=N 1 N k 1α (εk,1] + (1 α)µ+ k,N(α) λN N k + µ+ k,N 1(α) m=N (1 α)µ+ k,N+1(α) λN+1 N+1 k + µ+ k,N(α) m=N+1 ... (1 α)µ+ k,M 1(α) λM 1 M 1 k + µ+ k,M 2(α) m=M 1 0 λM M k + µ+ k,M 1(α) m=M k = N N N 1α (εN,1] + (1 α)µ+ N,N(α) λN N N m=N (1 α)µ+ N,N+1(α) λN+1 N+1 N + µ+ N,N(α) m=N+1 ... (1 α)µ+ N,M 1(α) λM 1 M 1 N + µ+ N,M 2(α) m=M 1 0 λM M N + µ+ N,M 1(α) m=M

Compression, Generalization and Learning

k = N +1, . . . , M 1

(1 α)µ+ k,k(α) λk k k m=k (1 α)µ+ k,k+1(α) λk+1 k+1 k + µ+ k,k(α) m=k+1 ... (1 α)µ+ k,M 1(α) λM 1 M 1 k + µ+ k,M 2(α) m=M 1 0 λM M k + µ+ k,M 1(α) m=M k = M

0 λM M M m=M

For any given k {0, . . . , M}, consider the corresponding set of inequalities and multiply both sides of the ﬁrst inequality by (1 α)0, both sides of the second inequality by (1 α)1, and so on till the last inequality, which is multiplied by (1 α)M k. Then, summing sideby-side the so-obtained inequalities, and noting that all functions µ+ k,m(α) cancel out, one obtains that the constraints in (40) imply the following inequalities:

k = 0, . . . , N

(1 α)N k1α (εk,1]

k = N +1, . . . , M 0

(1 α)m k. (41)

We next show that the optimal value of problem (37) equals the optimal value of an optimization problem with the same cost function as in problem (37) and the constraints (41) complemented with the condition λm = 0 for m = N + 1, . . . , M, viz.

γ M = inf λm, m=0,...,M

m=0 λm (42a)

subject to: N k

(1 α)N k1α (εk,1]

α [0, 1], k = 0, . . . , N (42b)

(1 α)m k, α [0, 1],

k = N + 1, . . . , M (42c)

λm = 0 for m = N + 1, . . . , M (42d)

(clearly, (42c) is automatically satisﬁed in view of (42d)).

To show that the values of γ M given by (37) and (42) are actually the same, start by noting that adding the condition λm = 0 for m = N + 1, . . . , M to problem (37) does not change its optimal value. This requires a short proof:

The constraints in (40) imply that λm 0 for m = 0, . . . , M as it can be seen from the ﬁrst inequality (m = k) of each group (k = 0, . . . , M) evaluated at α = 1. Now,

M.C. Campi and S. Garatti

given a feasible point of (40) that does not have λm = 0 for m = N + 1, . . . , M, consider a modiﬁed point by setting λm = 0 for m = N + 1, . . . , M and µ+ k,m = 0 for k = 0, . . . , M 1 and m = max{k, N}, . . . , M 1, while maintaining the original choices for all other λm and µ+ k,m. This point is still feasible for (40) because, for all k, all the inequalities for m N + 1 become 0 0, the inequality for m = N is a-fortiori satisﬁed (recall that function µ+ k,N in the left-hand side of this inequality is 0 so that setting it to 0 relaxes the constraint) and all other inequalities are not aﬀected. On the other hand, the value of problem (37) corresponding to the modiﬁed point outdoes the value at the original point since all λm in the original feasible point were nonnegative and some of them have been set to zero in the modiﬁed point.

Since the condition λm = 0 for m = N + 1, . . . , M in (42d) can be added to (37) without aﬀecting its optimal value, and considering that the other constraints in (42) for k = 0, . . . , M are implied by those already present in (37) (as shown before equation (41)), the optimal value of (42) is not bigger than the optimal value of (37). The reverse inequality that the optimal value of (37) is not bigger than the optimal value of (42) is proven by showing that for any feasible point of (42) one can ﬁnd a feasible point of (37) that attains the same value. This is shown in the following.

Consider a feasible point of (42). Evaluating all constraints (42b) for k = 0, . . . , N, at α = 1, one sees that λm 0 for m = 0, . . . , N. Moreover, it holds that λm = 0 for m = N + 1, . . . , M. To ﬁnd the sought feasible point of (37), consider the same λm as those for the feasible point of (42) and complement them with the following functions µ+ k,m. For k = 0, . . . , M 1, m = max{k, N}, . . . , M 1, take µ+ k,m = 0. With this choice, all the inequalities in (40) for k = 0, . . . , M, m = max{k, N + 1}, . . . , M become 0 0 and are therefore satisﬁed. The expressions of µ+ k,m for the remaining indexes are ﬁrst deﬁned over [0, 1) and then extended to the closed interval [0, 1]. Over [0, 1), consider the inequalities in (40) for k = 0, . . . , N 1, m = k, . . . , N 1 and take µ+ k,m(α) such that these inequalities are satisﬁed with equality, starting from top and then proceeding downwards. This gives

µ+ k,k(α) = λk k k

µ+ k,k+1(α) = λk+1 k+1 k

1 α + λk k k

µ+ k,N 1(α) =

Since λm 0, the obtained µ+ k,m(α) s are all positive and, moreover, are continuous over [0, 1). We next show that choice (43) satisﬁes over [0, 1) the remaining inequalities (those in (40) for k = 0, . . . , N and m = N). For k = 0, . . . , N 1 and m = N,

Compression, Generalization and Learning

substituting µ+ k,N 1(α) = PN 1 j=k λj(j k) (1 α)N j and µ+ k,N(α) = 0 gives

1 (1 α)N j , (44)

while for k = N and m = N, substituting µ+ N,N(α) = 0 we have N N

1α (εN,1] λN

Equations (44) and (45) are satisﬁed because they coincide with (42b) (recall that λm = 0 for m = N + 1, . . . , M see (42d)). As for α = 1, note that functions µ+ k,m deﬁned in (43) tend to inﬁnity when α 1. This poses a problem of existence for α = 1, which, however, can be easily circumvented by truncating the functions µ+ k,m in the interval α [1 ρ, 1] at the value µ+ k,m(1 ρ) to obtain

µ+,ρ k,m(α) =

( µ+ k,m(α) α < 1 ρ

µ+ k,m(1 ρ) α 1 ρ,

and noting that all the inequalities are satisﬁed over [0, 1] if ρ is chosen small enough.

Summarizing the results so far, we have

P {φN > εk} (31) γ (34) γM (36) = γ M, (46)

where γ M is given by (42). Notice now that increasing M beyond N in (42) does not change the problem because λm = 0 for m N + 1, so that γ M = γ N for all M N. The proof of the theorem (under condition (21)) is concluded by showing that γ N δ, which is what we do next.

For M = N, problem (42) becomes

γ N = inf λm, m=0,...,N

m=0 λm (47)

subject to: N k

(1 α)N k1α (εk,1]

α [0, 1], k = 0, . . . , N.

Take λm = δ

N for m = 0, . . . , N 1 and λN = 0, so that PN m=0 λm = δ. We show that these λm s are feasible for (47) so that γ N PN m=0 λm = δ. The inequality for k = N is satisﬁed because the left-hand side is 0 (recall that εN = 1 so that the indicator function is 1 over an empty set). For k = 0, . . . , N 1, the inequalities for α = 1 become 0 λk, which is true, while for α [0, 1) the inequalities can be rewritten as

1α (εk,1] δ

N k (1 α) (N m), k = 0, . . . , N 1,

M.C. Campi and S. Garatti

and are satisﬁed in view of the deﬁnition of εk, see (1). This concludes the proof under condition (21).

Next we remove condition (21).

Let us augment each random element zi with a random variable θi uniformly distributed over [0, 1] and independent of zi so as to form an i.i.d. sequence z 1 = (z1, θ1), z 2 = (z2, θ2), . . ..31 Clearly, condition (21) applies (mutatis mutandis) to z i, viz.

P n z i = z o = 0, z Z [0, 1]. (48)

Given a multiset of augmented examples ms(z 1, . . . , z n), let

proj[ms(z 1, . . . , z n)] = ms(z1, . . . , zn),

i.e., proj is the extractor of the zi components. We deﬁne a compression c to be applied to multisets of augmented examples as the compression that satisﬁes the following rule: proj[c (z 1, . . . , z n)] = c(z1, . . . , zn) and, among sub-multisets of ms(z 1, . . . , z n) whose projections is c(z1, . . . , zn), c favors augmented examples with lower second components θi. We next show that c inherits from c the preference property. Suppose that c (z 1, . . . , z n, z ) ms(z 1, . . . , z n). This implies that c(z1, . . . , zn, z) ms(z1, . . . , zn). Then,

proj[c (z 1, . . . , z n, z )] = c(z1, . . . , zn, z)

= c(z1, . . . , zn) (because of the preference property of c)

= proj[c (z 1, . . . , z n)]

and, hence, the z components of c (z 1, . . . , z n, z ) and those of c (z 1, . . . , z n) coincide. Moreover, also the θ components coincide by the rule that favors lower second components. This establishes the preference property of c .

In view of (48) and the fact that c has the preference property, we are in the position to apply to c the proof that has been developed before under the assumption of no concentrated mass. Deﬁning φ N = P{c (c (z 1, . . . , z N), z N+1) = c (z 1, . . . , z N)|z 1, . . . , z N} and k = |c (z 1, . . . , z N)|, we have P{φ N > εk } δ.

On the other hand, c(c(z1, . . . , z N), z N+1) = c(z1, . . . , z N)

implies that c (c (z 1, . . . , z N), z N+1) = c (z 1, . . . , z N)

(while the vice-versa does not hold), which gives φN φ N P-almost surely. Moreover, k = k . Hence, P{φN > εk} P{φ N > εk} = P{φ N > εk } δ.

This concludes the proof.

31. Note that the augmented random elements z i are mere mathematical tools used to draw conclusions on zi, which remain the measured and relevant variables.

Compression, Generalization and Learning

5.3 Proof of Theorem 7

We prove the equivalent statement that

P n φN < εk or φN > εk o δ.

The proof parallels that of Theorem 4 and we highlight here the diﬀerences.

Result (22) holds unaltered in the present context, with the only notational diﬀerence that εk is now εk. By proving a similar equation for P{φN < εk}, we come to the result

P n φN < εk or φN > εk o =

[0,εk) (εk,1] dm+ k,N. (49)

One main diﬀerence arises next in connection with (i) and (ii): while (i) holds as before, (ii) holds in this context with equality, which we write in the following way:

(ii) For m = 0, 1, . . . and k = 0, . . . , m, it holds that

m+ k,m+1 (1 α)m+ k,m = 0.

Proof of (ii) : Follow the derivation of (ii) till equation (29). Next, we prove that, in the present context of Theorem 7, equation (29) also holds with reversed inequality, so proving that the two sides of (29) are in fact equal. To see this, notice that in the event under the sign of probability in the right-hand side of (29) it holds that c(z1, . . . , zm) = ms(z1, . . . , zk), which, owing to the preference Property 2, implies

c(z1, . . . , zk, zi) = ms(z1, . . . , zk), i = k + 1, . . . , m,

and c(z1, . . . , zk) = ms(z1, . . . , zk).

In addition, in the same event it also holds that (simply substitute c(z1, . . . , zm) with ms(z1, . . . , zk) in the ﬁrst condition that deﬁnes the event)

c(z1, . . . , zk, zm+1) = ms(z1, . . . , zk).

P n c(c(z1, . . . , zm), zm+1) = c(z1, . . . , zm)

and c(z1, . . . , zm) = ms(z1, . . . , zk) and φm B o

P n c(z1, . . . , zk, zi) = c(z1, . . . , zk), i = k + 1, . . . m + 1

and c(z1, . . . , zk) = ms(z1, . . . , zk) and φm B o

P n c(z1, . . . , zm+1) = ms(z1, . . . , zk) and φm B o ,

where the last inequality follows from the non-associativity Property 5. This establishes the reversed inequality of (29) and, therefore, that (29) and (30) hold with equality. The ﬁnal part of the proof of (ii) consists in re-writing the right-hand side of (30) as is done in the proof of (ii).

M.C. Campi and S. Garatti

We are now ready to upper-bound P n φN < εk or φN > εk o by taking the sup of (49)

under conditions (i) and (ii) (in addition to the fact that measures m+ k,m belong to the cone M+ of positive ﬁnite measures on [0, 1]). This gives

P n φN < εk or φN > εk o γ, (50)

where γ is deﬁned as the value of the optimization problem

γ = sup m+ k,m M+

m=0,1,..., k=0,...,m

[0,εk) (εk,1] dm+ k,N

subject to:

[0,1] dm+ k,m = 1, m = 0, 1, . . .

m+ k,m+1 (1 α)m+ k,m = 0, m = 0, 1, . . . ; k = 0, . . . , m.

To evaluate γ, we consider as before a truncated version of the problem

γM = sup m+ k,m M+

m=0,...,M, k=0,...,m

[0,εk) (εk,1] dm+ k,N (51a)

subject to:

[0,1] dm+ k,m = 1, m = 0, . . . , M (51b)

m+ k,m+1 (1 α)m+ k,m = 0,

m = 0, . . . , M 1; k = 0, . . . , m. (51c)

and observe that γ γM, (52)

for all M. In evaluating γM by dualization one important diﬀerence with the proof of Theorem 4 occurs in the Lagrangian

[0,εk) (εk,1] dm+ k,N

[0,1] dm+ k,m 1

[0,1] µk,m(α) d[m+ k,m+1 (1 α)m+ k,m], (53)

because functions µk,m C0[0, 1] are now required only to be continuous, while their sign is arbitrary (this diﬀerence stems from the equality condition on measures in (ii) as opposed to the inequality condition in (ii)).

Tantamount to (36), we here have

γM (A) = sup {m+ k,m} inf {λm} {µk,m}

L (B) = inf {λm} {µk,m}

sup {m+ k,m} L (C) = γ M, (54)

Compression, Generalization and Learning

where γ M is the value of the dual of problem (51):

γ M = inf λm, m=0,...,M µk,m C0[0,1], m=0,...,M 1, k=0,...,m

m=0 λm (55a)

subject to: m k

1α [0,εk) (εk,1]1m=N + (1 α)µk,m(α)1m =M

+ µk,m 1(α)1m =k, α [0, 1],

k = 0, . . . , M, m = k, . . . , M. (55b)

Proof of (A) in (54): If measures m+ k,m do not satisfy the constraints in (51b) and (51c), then inf{λm},{µk,m} L is equal to . The reason why this is true for (51b) is the same as the reason why this is true for (33b) in the proof of (A) in Theorem 4. As for (51c), note that in the present context functions µk,m have more ﬂexibility than µ+ k,m in Theorem 4 because they need not be positive. By concentrating on µk,m s that are indeed positive, we have as before that m+ k,m+1 (1 α)m+ k,m M ; similarly, with negative µk,m s one concludes that m+ k,m+1 (1 α)m+ k,m M+, and these two facts together imply (51c). To close the proof of (A), we simply notice that the Lagrangian (53) with (51b) and (51c) in place reduces to (51a).

Proof of (B) in (54): As in the proof of Theorem 4, matters of convenience suggest to introduce a modiﬁed Lagrangian Lτ that corresponds to a continuous cost function. To this aim, for k = 0, 1, . . . , N, the integral R

[0,εk) (εk,1] dm+ k,N in the ﬁrst term

of the Lagrangian is rewritten as R

[0,1] 1α [0,εk) (εk,1]dm+ k,N and the indicator function 1α [0,εk) (εk,1] is replaced with a continuous function ϕk,τ(α) that perturbs the Lagrangian in a vanishing way as τ . Precisely, to obtain a continuous transition, we tilt the edges of the indicator function by a small enough quantity τ (which creates linear slopes over the intervals (εk, εk + τ] and [εk τ, εk) while leaving the indicator function unaltered for other values of α) with the only advice that: if εk = 0 (so that the left edge does not exist), then ϕk,τ(α) continues at the value 0 till α = 0; likewise, ϕk,τ(α) continues at value 0 till α = 1 if εk = 1. The modiﬁed Lagrangian is

[0,1] ϕk,τ(α) dm+ k,N

[0,1] dm+ k,m 1

[0,1] µk,m(α) d h m+ k,m+1 (1 α)m+ k,m i .

In full analogy with (83) in Theorem 4, the proof consists in showing the validity of the following relations:

sup{m+ k,m} inf {λm} {µk,m} Lτ = inf {λm} {µk,m} sup{m+ k,m} Lτ

sup{m+ k,m} inf {λm} {µk,m} L inf {λm} {µk,m} sup{m+ k,m} L, (56)

M.C. Campi and S. Garatti

where the only diﬀerence with (83) is that the positive functions µ+ k,m are now the functions µk,m that are undeﬁned in sign. Here, as in Theorem 4, we need only to show the validity of the = at top and the convergence τ 0 on the left.

To show the validity of the top equality

sup {m+ k,m} inf {λm} {µk,m}

Lτ = inf {λm} {µk,m}

sup {m+ k,m} Lτ, (57)

one follows the same argument as in Theorem 4 after noting that there is no need here to introduce the positive measures p+ k,m (in Theorem 4, the p+ k,m s served the purpose of making null the measures qk,m = m+ k,m+1 (1 α) m+ k,m +p+ k,m to evaluate the value V and the supervalue V ; this is not needed here because m+ k,m+1 (1 α) m+ k,m is downright zero and not just negative). Hence, for precise reference, we make explicit that set H in the present context becomes

H := n (v, {rm}, qk,m ) R RM+1 M (M+1)M

[0,1] ϕk,τ(α) dm+ k,N,

[0,1] dm+ k,m 1

qk,m = m+ k,m+1 (1 α) m+ k,m ,

where, for all m and k, m+ k,m M+o . (58)

All derivations from here till the equivalent of equation (89) are identical to those developed in Theorem 4 with the only notice that functions µε k,m must not be nonnegative in the present context (hence, delete from the derivations in Theorem 4 the sentence Moreover, noting . . . in place of µε k,m. ). This way one arrives in the present context to the following equivalent of (89), which only diﬀers from (89) because functions µk,m, which have undeﬁned sign, take the place of µ+ k,m:

inf {λm} {µk,m}

sup (v,{rm},{qk,m}) H

[0,1] µk,m(α) dqk,m

By recalling the expression of v, rm, qk,m in the deﬁnition of H given in (58) and noticing that the curly bracket in the left-hand side is nothing but Lτ (in Theorem 4 we had to digress to take care of measures p+ k,m), (59) immediately gives the counterpart of (91): inf {λm} {µk,m}

sup {m+ k,m} Lτ V .

The last portion after (91) holds unaltered in the present context to conclude that

inf {λm} {µk,m}

sup {m+ k,m} Lτ = V .

Compression, Generalization and Learning

The next step that V = V becomes simpler in the present context. Indeed, one has just to suppress p+,i k,m wherever encountered to show the validity of equations (96) and (97), while, by a derivation similar to that used to obtain (96) and (97), one obtains from (94) (without p+ k,m) relation Z

[0,1] gj(α) d[ m+ k,m+1 (1 α) m+ k,m] = 0,

gj(α), j = 1, 2, . . . , m = 0, 1, . . . , M 1, k = 0, . . . , m. (60)

The last part now becomes: Taking now any function f in C0[0, 1] and noting that f can be arbitrarily approximated in the sup norm by a function gj, (60) yields Z

[0,1] f(α) d[ m+ k,m+1 (1 α) m+ k,m] = 0,

from which m+ k,m+1 (1 α) m+ k,m = 0 (61)

(recall Footnote 30). In the light of (96), (97) and (61) one sees that { m+ k,m} maps into the point ( V , {rm = 0}, qk,m = 0 ), which proves that this point is in H. Hence, it holds that V = V and equation (57) remains proven.

Turning now to equation

lim τ 0 sup {m+ k,m} inf {λm} {µk,m}

Lτ = sup {m+ k,m} inf {λm} {µk,m}

(which is the only remaining relation to prove in (56)), we notice that this is a step that contains some major diﬀerences from Theorem 4, for which reason we prefer to repeat the whole derivation even at the price of duplicating some parts already contained in the proof of Theorem 4.

Notice that, in both sides of (62), the inf operator sends the value to whenever the constraints in (51b) or (51c) are not satisﬁed by {m+ k,m}: hence, (51b) and (51c) must be satisﬁed and are always assumed from now on. Under (51b) and (51c), (62) is rewritten as

lim τ 0 sup {m+ k,m}

[0,1] ϕk,τ(α) dm+ k,N

= sup {m+ k,m}

[0,1] 1α [0,εk) (εk,1] dm+ k,N. (63)

To show the validity of (63), we discretize τ into τi, i = 1, 2, . . ., τi 0, and consider a sequence { m+,i k,m}, i = 1, 2, . . . (where measures m+,i k,m satisfy (51b) and (51c) for any i), such that

[0,1] ϕk,τi(α) d m+,i k,N

M.C. Campi and S. Garatti

equals the left-hand side of (63) (for this to hold, m+,i k,m must achieve a progressively closer and closer approximation of sup{m+ k,m} in the left-hand side of (63) as i in-

creases); then, we construct from { m+,i k,m} a new sequence { m+,i k,m}, i = 1, 2, . . . (still satisfying (51b) and (51c)), such that

[0,1] ϕk,τi(α) d m+,i k,N lim i

[0,1] 1α [0,εk) (εk,1] d m+,i k,N, (64)

which shows that the left-hand side of (63) is upper-bounded by a value that is no bigger than the right-hand side of (63). Since, on the other hand, the left-hand side of (63) cannot be smaller than the right-hand side of (63) because ϕk,τ(α) 1α [0,εk) (εk,1], (63) remains proven.

The construction of { m+,i k,m} is in three steps:

Step 1. [construction of {ˇm+,i k,m}] For all k N for which εk = 1 and for all m, move the

probabilistic mass of m+,i k,m contained in the interval (εk τi, εk] into a concentrated mass in point εk + τi and, for all k N for which εk = 0 and for all m, move the probabilistic mass of m+,i k,m contained in the interval [εk, εk+τi) into a concentrated

mass in point εk τi; let ˇm+,i k,m be the corresponding measures.

Step 2. [construction of {ˆm+,i k,m}] The mass shift in Step 1 can lead to measures ˇm+,i k,m that

violate condition (51c) in εk + τi and/or εk τi; the new measures ˆm+,i k,m restore the validity of condition (51c). The construction is in two steps that focus on εk + τi and εk τi, respectively. For all k > N and all k N for which εk = 1, let ˆm+,i(1) k,m = ˇm+,i k,m, for all m = k, . . . , M. For all other k s, let ˆm+,i(1) k,k = ˇm+,i k,k; then, verify sequentially for m = k, . . . , M 1 whether the condition

ˇm+,i k,m+1({εk + τi}) (1 (εk + τi)) ˆm+,i(1) k,m ({εk + τi}) = 0 (65)

is satisﬁed; if yes, let ˆm+,i(1) k,m+1 = ˇm+,i k,m+1, otherwise trim ˇm+,i k,m+1({εk + τi}) to the

value (1 (εk+τi)) ˆm+,i(1) k,m ({εk+τi}) and deﬁne ˆm+,i(1) k,m+1 as the trimmed version of ˇm+,i k,m+1.32 Likewise, we need to restore validity of condition (51c) in εk τi. Since we again want to trim i.e., reducing rather than raising measures (for reasons that will become clear in Step 3) and in this case the mass shift is leftward, we are well-advised to scan the values of m from bottom to top. For all k > N and all k N for which εk = 0, let ˆm+,i k,m = ˆm+,i(1) k,m , for all m = k, . . . , M. For all

other k s, let ˆm+,i k,M = ˆm+,i(1) k,M ; then, verify sequentially for m = M 1, M 2 . . . , k whether the condition

ˆm+,i k,m+1({εk τi}) (1 (εk τi)) ˆm+,i(1) k,m ({εk τi}) = 0

32. Note that, if (65) is violated, its left-hand side is necessarily greater than zero (so that the trimming operation achieves its intended goal). Reason is that the initial masses in (εk τi, εk] are balanced (i.e., m+,i k,m+1(εk τi, εk] R

(εk τi,εk](1 α)d m+,i k,m = 0) and the shift to εk + τi reduces the coeﬃcient (1 α) in the second term, which is the negative one.

Compression, Generalization and Learning

is satisﬁed; if yes, let ˆm+,i k,m = ˆm+,i(1) k,m , otherwise trim ˆm+,i(1) k,m ({εk τi}) to the

value ˆm+,i k,m+1({εk τi})/(1 (εk τi)) and deﬁne ˆm+,i k,m as the trimmed version

of ˆm+,i(1) k,m .

Step 3. [construction of { m+,i k,m}] The trimming operation in Step 2 may have unbalanced some equalities in (51b), i.e., it may be that

[0,1] dˆm+,i k,m < 1

for some m. If so, re-gain balance by adding to measure ˆm+,i m,m a suitable probabilistic mass concentrated in α = 1, while leaving other measures ˆm+,i k,m, k = m,

unaltered. The so-obtained measures are m+,i k,m. Note that this operation pre-

serves the validity of condition m+,i m,m+1 (1 α) m+,i m,m = 0 (since we have altered measures ˆm+,i m,m only in α = 1 where coeﬃcient (1 α) is null), so that { m+,i k,m} satisﬁes (51c) besides (51b).

Since the mass shift in Step 1 has only moved masses into points where ϕk,τi(α) is bigger, this mass shift can only increase PN k=0 N k R

[0,1] ϕk,τi(α) d m+,i k,N; moreover, any trimming and re-balancing in Steps 2 and 3 involve vanishing masses as τi 0. Therefore,

[0,1] ϕk,τi(α) d m+,i k,N lim i

[0,1] ϕk,τi(α) d m+,i k,N. (66)

On the other hand, by construction, ϕk,τi(α) = 1α [0,εk) (εk,1] if εk = 0 and εk = 1, while, for εk = 0 and/or εk = 1, ϕk,τi(α) = 1α [0,εk) (εk,1] only occurs where m+,i k,N is null. Hence,

[0,1] ϕk,τi(α) d m+,i k,N =

[0,1] 1α [0,εk) (εk,1] d m+,i k,N,

which, substituted in (66), gives (64). This concludes the proof of (B).

Proof of (C) in (54): The proof of point (C) follows, mutatis mutandis, that of Theorem 4. Here, the Lagrangian can be re-written as

1α [0,εk) (εk,1]1m=N + (1 α)µk,m(α)1m =M

µk,m 1(α)1m =k

and the argument is closed as in Theorem 4, in which derivation one has only to substitute µ+ k,m with µk,m.

M.C. Campi and S. Garatti

Next we want to evaluate γ M for problem (55).

In the present context, the constraints can be rewritten more explicitly as in (40) with the only change that all functions µ+ k,m, for any index m, k, need not be positive, so that the superscript + must be dropped everywhere and, moreover, the indicator function 1α (εk,1] now becomes 1α [0,εk) (εk,1]. The discussion that follows (40) remains the same with only a major diﬀerence: the discussion leading up to the conclusion that one can add the constraints λm = 0 for m = N + 1, . . . , M ceases to be valid here because it was grounded on the fact that functions µ+ k,m were positive, while here functions µk,m do not undergo this requirement. Hence, one comes to the following problem:

inf λm, m=0,...,M

m=0 λm (67a)

subject to: N k

(1 α)N k1α [0,εk) (εk,1]

α [0, 1], k = 0, . . . , N (67b)

(1 α)m k, α [0, 1],

k = N + 1, . . . , M (67c)

and the claim is that it returns the same optimal value γ M as (55). The fact that the optimal value of (67) is not bigger than the optimal value of (55) is obvious. The converse result that the optimal value of (55) is not bigger than the optimal value of (67) is proven by showing that for any feasible point of (67) one can ﬁnd a feasible point of (55) that attains the same value, which is shown in the following.

Consider a feasible point of (67). To ﬁnd the sought feasible point of (55), consider the same λm as those for the feasible point of (67) and augment them with the functions µk,m deﬁned as follows. Consider the inequalities in (40) where the µ+ k,m s are replaced by the µk,m s as it must be in the present context. The inequality for k = M and m = M is satisﬁed in view of (67c) for k = M. Next, satisfy the inequalities corresponding to k = 0, . . . , M 1, m = max{k, N} + 1, . . . , M with equality starting from bottom and then proceeding upward. This gives:

µk,M 1(α) = λM

µk,M 2(α) = λM 1

µk,max{k,N}(α) =

j=max{k,N}+1 λj

(1 α)j max{k,N} 1.

Note that for k = N + 1, . . . , M 1, this choice also satisﬁes the inequalities for m = k in view of (67c). The expression of µk,m over [0, 1) for the remaining indexes

Compression, Generalization and Learning

are instead deﬁned as in Theorem 4 by equations (43). We show that choices (68) and (43) satisfy over [0, 1) the remaining inequalities (those for k = 0, . . . , N and m =

N). Substituting µk,N 1(α) = PN 1 j=k λj(j k) (1 α)N j and µk,N(α) = PM j=N+1 λj j k (1

α)j N 1 in these inequalities gives

1α [0,εk) (εk,1]

1 (1 α)N j +

(1 α)j N. (69)

Equation (69) is satisﬁed because it coincides with (67b). As for α = 1, note that functions µk,m deﬁned in (43) tend to inﬁnity when α 1. This poses a problem of existence for α = 1, which, however, can be easily circumvented as in Theorem 4 by truncating the functions µk,m in the interval α [1 ρ, 1] at the value µk,m(1 ρ) to obtain

µρ k,m(α) =

( µk,m(α) α < 1 ρ µk,m(1 ρ) α 1 ρ,

and noting that all the inequalities are satisﬁed over [0, 1] if ρ is chosen small enough.

Summarizing the results so far, we have

P n φN < εk or φN > εk o (50) γ (52) γM (54) = γ M,

where γ M is given by (67). Choose now M = 4N. The proof of the theorem is concluded by showing that γ 4N δ, which is what we do next.

Take λm = δ 2N for m = 0, . . . , N 1, λN = 0 and λm = δ 6N for m = N + 1, . . . , 4N, so that P4N m=0 λm = δ. Inequalities (67c) are clearly satisﬁed because all λm are nonnegative. The inequalities in (67b) for α = 1 are satisﬁed because they become 0 λk (for k < N, the term (1 α)N k in the left-hand side of (67b) annihilates, while for k = N it is the indicator function that annihilates because εN = 1). For α [0, 1), (67b) can be rewritten as

1α [0,εk) (εk,1] δ 2N

N k (1 α) (N m) + δ

N k (1 α)m N,

k = 0, . . . , N 1,

1α [0,εN) δ 6N

(1 α)m N, k = N,

and are satisﬁed in view of the deﬁnition of εk and εk, see (4) and (5). This concludes the proof.

Remark 30 Theorem 7 preserves its validity under the following slightly weaker assumption than the non-concentrated mass Property 6 (while maintaining the preference and nonassociativity Properties 2 and 5): with probability 1, if for some n an example z appears in c(z1, . . . , zn), then it appears in c(z1, . . . , zn) as many times as it does in ms(z1, . . . , zn).

M.C. Campi and S. Garatti

Clearly, the non-concentrated mass property implies the assumption stated here because the non-concentrated mass property gives that any multiset does not have repetitions with probability 1. To instead prove that Theorem 7 preserves its validity under this extension, one proceeds similarly to the last part of the proof of Theorem 4 (where the temporary condition of non-concentrated mass was removed). Precisely, deﬁne z i, c , φ N, and k

as it was done there. Then, the preference property of c holds as shown in the proof of Theorem 4, while the non-associativity property of c follows from the non-associativity of c along a similar argument. Moreover, z i satisﬁes the non-concentrated mass property by construction. Using Theorem 7 for z i, c , φ N, and k now gives

P{εk φ N εk } 1 δ.

Then, to prove Theorem 7 for the initial setting, one has to show that k = k and φN = φ N, with probability 1. k = k is obviously true. As for φN = φ N, in the proof of Theorem 4, it was already noted that c(c(z1, . . . , z N), z N+1) = c(z1, . . . , z N) implies that c (c (z 1, . . . , z N), z N+1) = c (z 1, . . . , z N). On the other hand, the assumption introduced in the present remark straightforwardly gives that c (c (z 1, . . . , z N), z N+1) = c (z 1, . . . , z N) implies that c(c(z1, . . . , z N), z N+1) = c(z1, . . . , z N), except for at most a probability zero event. This gives φN = φ N, and we therefore have

P{εk φN εk} = P{εk φ N εk } 1 δ.

5.4 Proof of Proposition 8

We ﬁrst establish that k N εk εk. The result is obvious for k = N since εk = 1 = εk. For k < N, recall that εk is the unique solution to Ψk,δ(α) = 1, while εk is the solution to Ψk,δ(α) = 1 bigger than k N . Notice that Ψk,δ(α) = 1

2Ψk,δ(α) + νk,δ(α), where νk,δ(α) =

δ 6N P4N m=N+1 (m k) (N k)(1 α)m N. Using the same argument given in Appendix A to show that

N ) < 1, it is easy to prove that νk,δ( k

2, which, along with the fact that νk,δ(α) is strictly decreasing over the interval of interest (0, 1), yields νk,δ(α) < 1

2 for α [ k

N , 1). Since εk > k N (remember that Ψk,δ(α) is monotonically increasing and the argument in Appendix A can be used again to show that Ψk,δ( k

N ) δ < 1), we have that

Ψk,δ(εk) = 1 2Ψk,δ(εk) + νk,δ(εk) = 1 2 + νk,δ(εk) (since Ψk,δ(εk) = 1) < 1. (since νk,δ(εk) < 1

This, along with the fact that Ψk,δ(α) is ﬁrst decreasing and then increasing, gives εk εk.

We next prove equations (7) and (8).

For k = 0, 1, . . . , N 1, consider the equation (note that the left-hand side is not exactly equal to function Ψk,δ in (3) because a 2 in the ﬁrst term has been substituted with

Compression, Generalization and Learning

N k (1 α)m N + δ

N k (1 α)m N = 1. (70)

For every α in ( , 1), the left-hand side is no bigger than Ψk,δ(α) and, as a function of α, it has a behavior similar to Ψk,δ (i.e., it is ﬁrst decreasing and then increasing, diverging to + for both α and α 1). Thus, (70) admits exactly two solutions in ( , 1), one smaller than k/N and one bigger than k/N, and these solutions give a lower bound and an upper bound to εk and εk, respectively. Since (1 α)N k = 0 over ( , 1), equation (70) is equivalent to

(1 α)m k = 1 + δ

(1 α)N k, (71)

which is obtained by multiplying both sides of (70) by N k (1 α)N k and then adding on the left and on the right the term δ 6N N k (1 α)N k. Notice that (71) is meaningful for k = N too, and is indeed equivalent to the equation deﬁning εN, i.e., ΨN,δ(α) = 1 (for k = N, (71) admits only one solution, which coincides with εN). Thus, summarizing, the solution to (71) smaller than k/N provides a lower bound to εk for all k = 0, 1, . . . , N, while the solution greater than k/N, which exists for k = 0, 1, . . . , N 1, provides an upper bound to εk.

We now rewrite (71) in a form that is better suited to obtain an explicit evaluation of its solutions.

For k = 0, the summation in the left-hand side of (71) is

m=0 (1 α)m = 1 (1 α)4N+1

and, noticing that for a generic k it holds that

(1 α)m k = ( 1)k

m=0 (1 α)m #

a cumbersome, but straightforward, computation gives the following general formula for the left-hand side of (71)

(1 α)m k = δ 6N 1 Pk i=0 4N+1 i αi(1 α)4N+1 i

αk+1 . (72)

Substituting in (71) and multiplying on the left and on the right by Nαk+1, we then obtain

αi(1 α)4N+1 i !

αk+1(1 α)N k, (73)

M.C. Campi and S. Garatti

which, in view of the multiplication by Nαk+1, is equivalent to (71) for α = 0. In what follows, the usage of equation (73) to investigate the solutions to (71) will be limited to the interval (0, 1) (and, hence, the condition α = 0 becomes irrelevant). As is clear, this is always the case for the solution greater than k/N, which upper bounds εk, while for the solution lower than k/N, which lower bounds εk, a suﬃcient condition for this solution to be in (0, 1) is that k > ln(3/δ), (74)

as shown in the following derivation.

Start by noticing that

k > ln(3/δ) δ

6ek+1 > 1 δ

where the latter implication follows from 4x/x > ex for x 0. Since 4N+1

N i 4 for i = 0, 1, . . . , k 1, the previous inequality implies that

N + 1 k 1 k + 1 > 1,

which, re-arranging the terms and dividing the left-hand and right-hand sides by k!, yields δ 6N (4N + 1) (4N + 1 k)

(k + 1) k! > 1 + δ

N (N + 1 k)

or, in a more compact form, δ 6N 4N+1 k+1 > 1 + δ 6N N k . Using in this last expression the hockey-stick identity (which asserts that 4N+1 k+1 = P4N m=k m k ), one concludes that

This means that the left-hand side of (71) evaluated at α = 0 is larger than the righthand side evaluated at α = 0, and this implies that the two solutions of (71) (and, thereby, those of (73)) are both within the interval (0, 1).

Apart from the factor δ

6, the left-hand side of (73) is a so-called Beta(k+1,4N +1 k) cumulative distribution function and, as such, it takes value 0 for α = 0 and is strictly increasing in (0, 1) converging to the value 1. The right-hand side, instead, for k = 0, . . . , N 1 takes value 0 for both α = 0 and α = 1 and is ﬁrst increasing and then decreasing, while for k = N is 0 for α = 0 only and it is increasing. A graphical illustration of the typical trend for the left-hand and right-hand sides of (73) for ln(3/δ) < k N 1 is given in Figure 7 (solid blue lines). Given this state of things, it is clear that if the left-hand side of (73), call it L(α), is replaced by a function that lies below L(α), while the right-hand side, call it R(α), is replaced by a monotonically decreasing function that stays above R(α), the so-obtained equation has the property that any solution to it upper bounds the solution to (73) bigger than k/N, which in turn upper bounds εk. Similarly, keeping the same replacement for the left-hand side of (73), but this time substituting the right-hand side R(α)

Compression, Generalization and Learning

Figure 7: Graph of the left-hand and right-hand sides of (73) (solid blue lines) and of the curves that are used to obtain suitable lower and upper bounds to the solutions to (73) (dashed red lines).

with a monotonically increasing function that stays above R(α), an equation is obtained whose solutions provide lower-bounds to εk. In the following, the sought upper bound to εk and lower bound to εk will be obtained by replacing L(α) with a suitable constant function and the right-hand side R(α) with a decreasing exponential and an increasing exponential function, respectively. See again Figure 7 for a graphical illustration (dashed red lines).

Upper bound to εk

Consider (73) for k = 0, 1, . . . , N 1. Start with the left-hand side and notice that

αi(1 α)4N+1 i !

for α k+1 4N+2. As a matter of fact, k+1 4N+2 is the mean of the Beta distribution having cumulative distribution function 1 Pk i=0 4N+1 k αi(1 α)4N+1 i and the mean of this Beta distribution is greater than its median, see Payton et al. (1989).

M.C. Campi and S. Garatti

As for the right-hand side of (73), pick any number c > 1 and notice that33

αk+1(1 α)N k

(k + 1) N + 1 k + 1

αk+1(1 α)N+1 (k+1)

αi(1 α)N+1 i

7 6(k + 1)ck+1 k+1 X

i (1 α)N+1 i

7 6(k + 1)ck+1 N+1 X

i (1 α)N+1 i

= 7 6(k + 1)(1 (1 c))k+1 1 αc 1

7 6(k + 1)e (1 c)(k+1)e α c 1

7 6(k + 1)e (1 c)(k+1)e α c 1

where the second-last inequality derives from relation 1 x e x. Taking now

ln(k + 1) + ln 14

one obtains 1 + δ

αk+1(1 α)N k

7 6(k + 1)e

ln(k+1)+ln 14

ln(k+1)+ln 14

ln(k+1)+ln 14

where the right-hand side is a monotonically decreasing function of α. Using (75) and (76) together, in view of the argument given after (73) we have that the solution to

ln(k+1)+ln 14

ln(k+1)+ln 14

ln(k+1)+ln 14

upper bounds εk as long as the solution turns out to be no smaller than k+1 4N+2 (which is the condition to ensure that (75) is indeed true). Solving (77) for α gives

ln(k + 1) + ln 14

δ N + ln(k + 1) + ln 14

33. This computation is insipred by Alamo et al. (2015).

Compression, Generalization and Learning

which is always greater than k+1 4N+2. Thus, it follows that, for k = 0, 1, . . . , N 1,

ln(k + 1) + ln 14

δ N + ln(k + 1) + ln 14

Moreover, the bound turns out to be valid for k = N too, since εN = 1, while the righthand side of the previous inequality is greater than 1 for k = N. Using the fact that q

ln(k + 1) + ln 14

ln(k + 1) + p

δ in the previous expression, and rearranging the terms, one obtains:

ln(k + 1) + p

ln(14) + ln(k + 1) + ln(14) + 1

It is easy to verify that ln(k + 1) + ln(14) + 1 < 4

k + 1; using this fact in the numerator of the second term together with p

ln(14) + 2 < 4 yields

ln(k + 1) + 4 + 2

which is the bound in (7).

Lower bound to εk

Consider (73) again, this time for k > ln(3

δ) as given in (74) (which, as we have seen, ensures that the solution lower than k/N takes value in the interval (0, 1)). The right-hand side of (73) is upper bounded as follows (c is any number bigger than 1): 1 + δ

αk+1(1 α)N k

(k + 1) N + 1 k + 1

αk+1(1 α)N+1 (k+1)

αi(1 α)N+1 i

7 6(k + 1) 1 ck+1

(αc)i(1 α)N+1 i

7 6(k + 1) 1 ck+1

(αc)i(1 α)N+1 i

= 7 6(k + 1)

1 + α(c 1) N+1 1 + (c 1) k+1

7 6(k + 1)eα(c 1)(N+1)e c 1 (c 1)2

M.C. Campi and S. Garatti

where the last inequality derives from ex x2

2 1 + x ex for x 0. With the choice

ln(k + 1) + ln 14

we now obtain 1 + δ

αk+1(1 α)N k

7 6(k + 1)eα

ln(k+1)+ln 14

δ (N+1) k+1 e q

ln(k+1)+ln 14

2(ln(k+1)+ln 14

where the right-hand side is an increasing function of α. Using (75) and (78) together, in view of the argument given after (73) we have that the solution to

ln(k+1)+ln 14

δ (N+1) k+1 e q

ln(k+1)+ln 14

2(ln(k+1)+ln 14

lower bounds εk as long as the solution turns out to be no smaller than k+1 4N+2 (which is the condition to ensure that (75) is indeed true). Solving (79) for α gives

k + 1 N + 1

ln(k + 1) + ln 14

Hence, for those k for which α k+1 4N+2, we have that εk α, which also clearly gives the looser inequality

k + 1 N + 1

ln(k + 1) + ln 14

On the other hand, if k is such that α < k+1 4N+2, then, substituting the expression for α given in (80) in relation α < k+1 4N+2 yields

k + 1 N + 1

ln(k + 1) + ln 14

N + 1 k + 1

from which k + 1 N + 1 3

k + 1 N + 1

ln(k + 1) + ln 14

and, since εk 0 by deﬁnition, we conclude that (81) remains valid in this case as well.

Go now back to re-consider condition k > ln(3

δ) introduced at the beginning of this part about lower bounding εk, and consider the opposite case that k ln(3

δ). If so, it also holds

that k + 1 3

ln(k + 1) + ln 14

δ , which in turn implies that the right-hand side of (81) is no bigger than 0. This, in view of relation εk 0, proves that (81) is valid in this case as well. Hence, we conclude that (81) is valid for all k = 0, 1, . . . , N. Since k+1 N+1 k

Compression, Generalization and Learning

1 N+1 < 1 N , and q

ln(k + 1) + ln 14

ln(k + 1) + p

δ, from (81) we also obtain

ln(k + 1) + p

The bound in (8) is ﬁnally obtained by noticing that p

ln(14) < 2.

This concludes the proof.

Acknowledgments

The authors would like to thank professor Nicol o Cesa-Bianchi for insightful and inspiring discussions on various parts of this manuscript. They also express their gratitude to the editor and to anonimous reviewers for generously providing comments that helped improve the paper.

The research presented in this article has been partly supported by MUR under the PRIN 2022 project The Scenario Approach for Control and Non-Convex Design (project number D53D23001440006) and by FAIR (Future Artiﬁcial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artiﬁcial Intelligence).

M.C. Campi and S. Garatti

Appendix A. A study of the graph of Ψk,δ

For k = 0, 1, . . . , N 1, the derivative of Ψk,δ, i.e.,

dα (α) = δ 2N

N k (N m)(1 α) (N m+1) δ

N k (m N)(1 α)m N 1,

is a strictly increasing function with limit for α and limit + for α 1. Hence, over the interval ( , 1), Ψk,δ is ﬁrst decreasing and then increasing, with a unique minimum. Since Ψk,δ + for both α and α 1, to prove that the equation Ψk,δ(α) = 1 has indeed two solutions, it is enough to exhibit an α ( , 1) for which Ψk,δ( α) < 1. To this aim, consider α = k

N and notice that

Ψk,δ(k/N) = δ

(N k)N m + 1

N k (N k)m N

For m = k, . . . , N 1, we have

where the inequality is satisﬁed since each term N k i

N i N N k in the product is no bigger than 1. Similarly, for m = N + 1, . . . , 4N we have

N k (N k)m N

N + i N k + i N k

(again it is straightforward to verify that each term N+i N k+i N k

N is no bigger than 1). We therefore see that both sums in the square brackets in the right-hand side of (82) are arithmetic means of quantities no bigger than 1, and therefore they are no bigger than 1 as well. This gives Ψk,δ(k/N) δ < 1, and it also shows that αk < k

Appendix B. MATLAB code

In this Appendix, we provide eﬃcient bisection algorithms in the MATLAB computing environment for the evaluation of εk, εk and εk. The algorithms take advantage of some reformulations of the equations Ψk,δ(α) = 1 and Ψk,δ(α) = 1 as discussed in the beginning of the proof of Proposition 8 in Section 5.4. A summary of these reformulations are provided before the MATLAB codes.

B.1 Bisection algorithm for the computation of εk

Equation Ψk,δ(α) = 1 for k = 0, . . . , N 1 can be rewritten as

(1 α)m k = N k

Compression, Generalization and Learning

and, using a formula analogous to (72) in Section 5.4, but with N 1 in place of 4N, this equation is shown to be equivalent over the interval (0, 1) to

αi(1 α)N i !

αk(1 α)N k.

The only solution to this equation in (0, 1) is εk and it can be computed by bisection starting from 0 and 1 as initial extremes. Apart from the coeﬃcient δ, the left-hand side is an incomplete Beta function with parameters k + 1 and N k, which can be eﬃciently and accurately evaluated with the betainc function of MATLAB. Similarly, apart from the term αN, the right-hand side can be computed as the diﬀerence of two incomplete beta functions. The following code provides a ready-to-use implementation of the bisection algorithm in the MATLAB environment.

function eps = find_eps(k,N,delta)

if k==N eps = 1; else t1 = 0; t2 = 1; while t2-t1 > 1e-10 t = (t1+t2)/2; left = delta*betainc(t,k+1,N-k); right = t*N*(betainc(t,k,N-k+1)-betainc(t,k+1,N-k)); if left > right t2=t; else t1=t; end end eps = t2; end

B.2 Bisection algorithm for the computation of εk and εk

Following the same argument given in the proof of Proposition 8 to derive (71) (see Section 5.4), it is easy to see that equation Ψk,δ(α) = 1 can be reformulated as

(1 α)m k + δ

(1 α)m k = 1 + δ

Using again formula (72) for the second term in the left-hand side and its variant with N 1 in place of 4N for the ﬁrst term, one obtains that Ψk,δ(α) = 1 is equivalent over the interval

M.C. Campi and S. Garatti

(0, 1) to the equation

αi(1 α)N i !

αi(1 α)4N+1 i !

αk(1 α)N k,

where, again, the leftand the right-hand sides can be conveniently computed via the incomplete Beta function. A bisection algorithm with k

N and 1 as extremes can be used to compute αk = εk for k = 0, . . . , N 1; instead, for k = 0, . . . , N, using 0 and k

N as extremes, the bisection algorithm converges to αk = εk when αk > 0 and to 0 = εk when αk 0. The following code provides an implementation in MATLAB.

function [eps L, eps U] = find_eps LU(k,N,delta)

t1 = 0; t2 = k/N; while t2-t1 > 1e-10 t = (t1+t2)/2; left = delta/3*betainc(t,k+1,N-k)+delta/6*betainc(t,k+1,4*N+1-k); right = (1+delta/6/N)*t*N*(betainc(t,k,N-k+1)-betainc(t,k+1,N-k)); if left > right t1=t; else t2=t; end end eps L = t1;

if k==N eps U = 1; else t1 = k/N; t2 = 1; while t2-t1 > 1e-10 t = (t1+t2)/2; left = (delta/2-delta/6)*betainc(t,k+1,N-k)+delta/6*betainc(t,k+1,4*N+1-k); right = (1+delta/6/N)*t*N*(betainc(t,k,N-k+1)-betainc(t,k+1,N-k)); if left > right t2=t; else t1=t; end end eps U = t2;

Compression, Generalization and Learning

Appendix C. Proof of (B) in (36)

Let τ > 0 be a number smaller than 1 εk for all k s for which εk = 1. Matters of convenience (as shown later) suggest to introduce a modiﬁed Lagrangian that corresponds to a continuous cost function as follows

[0,1] ϕk,τ(α) dm+ k,N

[0,1] dm+ k,m 1

[0,1] µ+ k,m(α) d h m+ k,m+1 (1 α)m+ k,m i ,

where: for all k for which εk = 1, ϕk,τ(α) is a continuous function equal to 0 for α [0, εk τ], equal to 1 for α [εk, 1], and with a linear slope connecting 0 to 1 in between; while ϕk,τ(α) is identically zero when εk = 1. We show below the validity of the following relations:

sup {m+ k,m} inf {λm} {µ+ k,m}

Lτ = inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ

sup {m+ k,m} inf {λm} {µ+ k,m}

L inf {λm} {µ+ k,m}

sup {m+ k,m} L. (83)

Notice that the above relations imply the sought result that

sup {m+ k,m} inf {λm} {µ+ k,m}

L = inf {λm} {µ+ k,m}

sup {m+ k,m} L

because inf {λm} {µ+ k,m}

sup {m+ k,m} L

is in sandwich between sup {m+ k,m} inf {λm} {µ+ k,m}

and inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ,

two quantities that converge one onto the other as τ 0.

The two inequalities in (83) are justiﬁed as follows: the at the bottom of (83) is valid

M.C. Campi and S. Garatti

because relation sup inf inf sup is always true, while the

on the right follows from the fact that ϕk,τ(α) in Lτ is greater than or equal to 1α (εk,1] in L.

What remains to show is thus the = at the top of (83) and the convergence τ 0 on the left.

We ﬁrst show that

sup {m+ k,m} inf {λm} {µ+ k,m}

Lτ = inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ, (84)

for which purpose we need to introduce a proper topological vector space, Rudin (1991), as speciﬁed in the following.

Consider the vector space M of ﬁnite signed measures m with support on [0, 1]. Moreover, let LF be the vector space of linear functionals on M of the form R

[0,1] µ(α) dm,

where µ is a continuous function (µ C0[0, 1]). In M, introduce the weak topology induced by LF, see Section 3.8 in Rudin (1991). This weak topology makes M into a locally convex topological vector space whose dual space coincides with LF, see Theorem 3.10 in Rudin (1991).34 By also considering the standard topology of R generated by open intervals, the ambient space in which we are going to work is the topological vector space given by R RM+1 M (M+1)M

2 =: S equipped with the product topology.

The interpretation of S is that it is the codomain of an operator that maps m+ k,m, m = 0, 1, . . . , M, k =, 0, . . . , m into an element of S according to the rule:

m+ k,m m=0,1,...,M k=0,...,m

PN k=0 N k R

[0,1] ϕk,τ(α) dm+ k,N ( R) n Pm k=0 m k R

[0,1] dm+ k,m 1 o

m=0,1,...,M ( RM+1) m+ k,m+1 (1 α) m+ k,m m=0,1,...,M 1 k=0,...,m ( M (M+1)M

(note that this operator returns various terms that are found in Lτ). We next consider the image of this operator, that is, the range of points in S that are reached as {m+ k,m}

varies in its domain (M+)

2 . To this image, we further add an arbitrary positive measure p+ k,m to each term m+ k,m+1 (1 α) m+ k,m (the reason for this will become clear shortly). The ﬁnal set that is obtained as {m+ k,m} and {p+ k,m} vary over

34. For the applicability of Theorem 3.10, one needs that LF separates M, a fact that follows from Footnote 30.

Compression, Generalization and Learning

their domains is denoted by H:

H := n (v, {rm}, qk,m ) R RM+1 M (M+1)M

[0,1] ϕk,τ(α) dm+ k,N,

[0,1] dm+ k,m 1

qk,m = m+ k,m+1 (1 α) m+ k,m + p+ k,m ,

where, for all m and k, m+ k,m M+, p+ k,m M+o . (85)

The closure of H in the topology of S is denoted by H.35 The following deﬁnitions refer to the restrictions of H and H to the line where all rm and qk,m are set to 0 (i.e., the zero element in R and M, respectively): quantities

V := sup n v : (v, {rm = 0}, qk,m = 0 ) H o

V := sup n v : (v, {rm = 0}, qk,m = 0 ) H o

are called value and supervalue, respectively.36 With this notation, we have

sup {m+ k,m} inf {λm} {µ+ k,m}

(this fact easily follows from an argument similar to the proof of equality (A) in (36) after noting that V in the present context plays the same role as γM in left-hand side of(36)). On the other hand, we also have

inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ = V , (86)

which requires the proof given below, based on Hahn-Banach theorem. After showing this, the proof of (84) is concluded by proving that V = V .

To prove (86), note that H is convex and closed and, for any ε > 0, point sε := ( V + ε, {rm = 0}, {qk,m = 0}) / H. By an application of Hahn-Banach theorem

35. The closure H is formed by all contact points of H, where a point is of contact if any neighborhood of the point contains at least one point in H; clearly, any point h H also belongs to H. 36. Note that, in the deﬁnition of V , sup is taken over a nonempty set. As a matter of fact, owing to (24) and (25), any compression scheme satisfying the preference property gives rise to measures m+ k,m such that rm = 0 for all m and qk,m = 0 for all m and k by choosing p+ k,m = (m+ k,m+1 (1 α) m+ k,m). It is also worth noticing that V (and hence V too) is ﬁnite and no bigger than 1. In fact, by the deﬁnition of H, every point in H satisﬁes v r N + 1. On the other hand, if it were that V > 1, then there would exist a contact point of H such that v > 1 and r N = 0, which is in contradiction with the fact that v r N + 1 for all points in H.

M.C. Campi and S. Garatti

(see Theorem 3.4 in Rudin, 1991), one can therefore ﬁnd a linear continuous functional deﬁned over S that separates H from sε in such a way that the functional computed at any point of H is strictly smaller than the functional computed at sε. A generic linear continuous functional deﬁned over S is written as

[0,1] µk,m(α) dqk,m, (87)

where a, λm R and µk,m C0[0, 1], and hence the separation condition yields

aε v PM m=0 λε mrm PM 1 m=0 Pm k=0 R

[0,1] µε k,m(α) dqk,m < aε ( V + ε), (v, {rm}, {qk,m}) H, (88)

where aε, λε m, µε k,m(α) are speciﬁc choices of a, λm, µk,m(α) in (87). Specializing (88) to a point in H with {rm = 0} and {qk,m = 0} yields aε v < aε ( V +ε), which implies aε > 0. Moreover, noting that qk,m contains p+ k,m, which is positive and arbitrarily large, one concludes that µε k,m must be non-negative for the inequality to hold over the whole H. To take notice of this fact, in subsequent derivations we write µε,+ k,m in place of µε k,m. Dividing by aε, inequality (88) now gives

µε,+ k,m(α)

aε dqk,m < V + ε, (v, {rm}, {qk,m}) H.

Given the arbitrariness of ε and restricting attention to H H, one concludes that

inf {λm} {µ+ k,m}

sup (v,{rm},{qk,m}) H

[0,1] µ+ k,m(α) dqk,m

(89) On the other hand, recalling the expression of v, rm, qk,m in the deﬁnition of H given in (85), the left-hand side of (89) can be rewritten as

inf {λm} {µ+ k,m}

sup {m+ k,m},{p+ k,m}

[0,1] µ+ k,m(α) dp+ k,m

which further becomes

inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ + sup {p+ k,m}

[0,1] µ+ k,m(α) dp+ k,m

= inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ,

Compression, Generalization and Learning

where in the last equality the second term has been suppressed because sup{p+ k,m} is taken over non-positive quantities and p+ k,m = 0 is admissible. Altogether, (89) and (90) give the relation

inf {λm} {µ+ k,m}

sup {m+ k,m} Lτ V . (91)

To prove (86), we show that strict inequality in (91) cannot hold. Indeed, in the opposite, there would exist a linear continuous functional of the form (87) that separates H from P := ( V , {rm = 0}, {qk,m = 0}). If we now consider the open set A obtained as counter-image of the reals greater than the value taken by this functional at P minus a small enough margin, then A contains P, while A leaves out all H, contradicting the fact that P is a contact point of H.37

We now show that V = V , so closing the proof of (84). We start by constructing a sequence of neighborhoods of P = ( V , {rm = 0}, {qk,m = 0}) that exhibit asymptotic properties of interest. Consider a countable set of continuous functions g1, g2, . . . dense in C0[0, 1] with respect to the sup norm (e.g., polynomials with rational coeﬃcients, see Theorem 7.26 in Rudin, 1976). For i = 1, 2, . . ., the neighborhoods of P are deﬁned as follows:

Oi := n (v, {rm}, {qk,m}) with |v V | < 1/i; |rm| < 1/i, m = 0, 1, . . . , M; and

max j=1,...,i

[0,1] gj(α) dqk,m

< 1/i, m = 0, 1, . . . , M 1 and k = 0, . . . , m o .

Further, for any m = 0, 1, . . . , M and k = 0, 1, . . . , m consider sequences m+,i k,m and

p+,i k,m indexed in i such that, for each i = 1, 2, . . ., the pair ({m+,i k,m}, {p+,i k,m}) maps into a point of H that is also in Oi (such sequences certainly exist since P is a contact point of H, see Footnote 37). For these sequences we have

[0,1] ϕk,τ(α) dm+,i k,N = V ; (92)

[0,1] dm+,i k,m 1

= 0, m = 0, 1, . . . , M; (93)

[0,1] gj(α) d[m+,i k,m+1 (1 α) m+,i k,m + p+,i k,m] = 0,

gj(α), j = 1, 2, . . . , m = 0, 1, . . . , M 1, k = 0, . . . , m. (94)

In view of (93), for a given m and k, measures m+,i k,m are uniformly bounded in i (i.e.,

m+,i k,m([0, 1]) C, i, for some positive constant C < + ). Since measures m+,i k,m are

37. P is a contact point of H because V is deﬁned via a sup operation over contact points and, therefore, any neighborhood of P is also a neighborhood of a contact point (v, {rm = 0}, qk,m = 0 ) with v close enough to V , so that the neighborhood must contain a point of H.

M.C. Campi and S. Garatti

supported in [0, 1], by Prokhorov s theorem (see Shiryaev, 1996, Theorem 1, Section 2, Chapter III),38 we then conclude that there exists a sub-sequence of indexes ih such that m+,ih k,m has weak limit m+ k,m M+. By repeating the same reasoning in a nested manner, we can further ﬁnd a sub-sequence of the indexes ih such that weak convergence holds for a new choice of m and k. Proceeding the same way for all choices of m and k, we conclude that there exists a sub-sequence of indexes (which, with a little abuse of notation, we still indicate as ih) for which Z

[0,1] f(α) d m+ k,m = lim ih

[0,1] f(α) dm+,ih k,m , m, k, (95)

holds for any continuous function f C0[0, 1].

Since ϕk,τ, as well as the constant function equal to 1, are continuous, (95) together with (92) and (93) yield

[0,1] ϕk,τ(α) d m+ k,N = V (96)

[0,1] d m+ k,m 1 = 0, m = 0, 1, . . . , M. (97)

Turn now to consider (94), from which we have

[0,1] gj(α) d[m+,ih k,m+1 (1 α) m+,ih k,m ] = lim ih

[0,1] gj(α) dp+,ih k,m , (98)

where the limit in (94) restricted to the sub-sequence ih can be broken up in the two limits in (98) because the left-hand side of (98) exists due to the weak convergence of measures m+,ih k,m (note that gj(α) and gj(α)(1 α) are continuous functions). For the functions gj that are non-negative (i.e., gj(α) 0, α), which we henceforth write as g+ j to help interpretation, (98) gives

[0,1] g+ j (α) d[m+,ih k,m+1 (1 α) m+,ih k,m ] 0. (99)

Taking now any non-negative function f+ in C0[0, 1] and noting that f+ can be arbitrarily approximated in the sup norm by a function g+ j ,39 weak convergence of

m+,ih k,m to m+ k,m used in (99) yields Z

[0,1] f+(α) d[ m+ k,m+1 (1 α) m+ k,m] 0,

38. In fact, a straightforward extended version of Prokhorov s theorem for positive and uniformly bounded measures. 39. Note that function f +(α) can be zero for some α, so that an approximant, however close, might as well take negative values, against the requirement that the approximant is a non-negative g+ j . Nonetheless, any ε-close approximant of f +(α) + ε is non-negative and it is also a 2ε-close approximant of f +(α).

Compression, Generalization and Learning

from which m+ k,m+1 (1 α) m+ k,m is a negative measure (recall Footnote 30).

If we now choose p+ k,m = [ m+ k,m+1 (1 α) m+ k,m] (which is in M+), then in the light of (96) and (97) one sees that ({ m+ k,m}, { p+ k,m}) maps into the point ( V , {rm = 0}, qk,m = 0 ), which proves that this point is in H. Hence, it holds that V = V and equation (84) remains proven.

We next show that lim τ 0 sup {m+ k,m} inf {λm} {µ+ k,m}

Lτ = sup {m+ k,m} inf {λm} {µ+ k,m}

which is the only relation in (83) that is still unproven, so concluding the proof.

Notice that, in both sides of (100), the inf operator sends the value to whenever the constraints in (33b) or (33c) are not satisﬁed by {m+ k,m}: hence, (33b) and (33c) must be satisﬁed and are always assumed from now on. Under (33b) and (33c), inf is attained for λm = 0 and µ+ k,m = 0 for all m and k, and (100) is therefore rewritten as

lim τ 0 sup {m+ k,m}

[0,1] ϕk,τ(α) dm+ k,N = sup {m+ k,m}

[0,1] 1α (εk,1] dm+ k,N.

(101) To show the validity of (101), we discretize τ into τi, i = 1, 2, . . ., τi 0, and consider a sequence { m+,i k,m}, i = 1, 2, . . . (where measures m+,i k,m satisfy (33b) and (33c) for any i), such that

[0,1] ϕk,τi(α) d m+,i k,N

equals the left-hand side of (101) (for this to hold, m+,i k,m must achieve a progressively closer and closer approximation of sup{m+ k,m} in the left-hand side of (101) as i in-

creases); then, we construct from { m+,i k,m} a new sequence { m+,i k,m}, i = 1, 2, . . . (still satisfying (33b) and (33c)), such that

[0,1] ϕk,τi(α) d m+,i k,N lim i

[0,1] 1α (εk,1] d m+,i k,N, (102)

which shows that the left-hand side of (101) is upper-bounded by a value that is no bigger than the right-hand side of (101). Since, on the other hand, the left-hand side of (101) cannot be smaller than the right-hand side of (101) because ϕk,τ(α) 1α (εk,1], (101) remains proven.

The construction of { m+,i k,m} is in three steps:

Step 1. [construction of {ˇm+,i k,m}] For all k N for which εk = 1 and for all m, move the

probabilistic mass of m+,i k,m contained in the interval (εk τi, εk] into a concentrated

mass in point εk + τi; let ˇm+,i k,m be the corresponding measures.

M.C. Campi and S. Garatti

Step 2. [construction of {ˆm+,i k,m}] The mass shift in Step 1 can lead to measures ˇm+,i k,m that

violate condition (33c) in εk + τi; the new measures ˆm+,i k,m restore the validity of this condition. For all k > N and all k N for which εk = 1 (so that no mass shift has been performed in Step 1), let ˆm+,i k,m = ˇm+,i k,m, for all m = k, . . . , M. For

all other k s, let ˆm+,i k,k = ˇm+,i k,k; then, verify sequentially for m = k, . . . , M 1 whether the condition

ˇm+,i k,m+1({εk + τi}) (1 (εk + τi)) ˆm+,i k,m({εk + τi}) 0

is satisﬁed; if yes, let ˆm+,i k,m+1 = ˇm+,i k,m+1, otherwise trim ˇm+,i k,m+1({εk + τi}) to the

value (1 (εk + τi)) ˆm+,i k,m({εk + τi}) and deﬁne ˆm+,i k,m+1 as the trimmed version

of ˇm+,i k,m+1.

Step 3. [construction of { m+,i k,m}] The trimming operation in Step 2 may have unbalanced some equalities in (33b), i.e., it may be that

[0,1] dˆm+,i k,m < 1

for some m. If so, re-gain balance by adding to measure ˆm+,i m,m a suitable probabilistic mass (e.g., concentrated in α = 1), while leaving other measures ˆm+,i k,m,

k = m, unaltered. The so-obtained measures are m+,i k,m. Note that this operation

preserves the validity of condition m+,i m,m+1 (1 α) m+,i m,m M , so that { m+,i k,m} satisﬁes (33c) besides (33b).

Since ϕk,τi(α) is non-decreasing in α, the mass shift in Step 1 can only increase PN k=0 N k R

[0,1] ϕk,τi(α) d m+,i k,N; moreover, any trimming and re-balancing in Steps 2 and 3 involve vanishing masses as τi 0. Therefore,

[0,1] ϕk,τi(α) d m+,i k,N lim i

[0,1] ϕk,τi(α) d m+,i k,N. (103)

On the other hand, by construction, ϕk,τi(α) = 1α (εk,1] if εk = 1, while, for εk = 1, ϕk,τi(α) = 1α (εk,1] only occurs on the interval (εk τi, εk] where m+,i k,N is null. Hence,

[0,1] ϕk,τi(α) d m+,i k,N =

[0,1] 1α (εk,1] d m+,i k,N,

which, substituted in (103), gives (102).

This concludes the proof.

T. Alamo, R. Tempo, A. Luque, and D. R. Ramirez. Randomized methods for design of uncertain systems: sample complexity and sequential algorithms. Automatica, 51: 160 172, 2015.

Compression, Generalization and Learning

M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, UK, 1999.

P. Billingsley. Probability and Measure. Wiley, New York, NY, 1995.

O. Bousquet, S. Hanneke, S. Moran, and N. Zhivotovskiy. Proper learning, Helly number, and an optimal SVM bound. In Proceedings of 33rd Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 582 609, 2020.

C.J.C. Burges and D.J. Crisp. Uniqueness of the SVM solution. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pages 223 229, Denver, CO, 1999.

G.C. Calaﬁore and M.C. Campi. Uncertain convex programs: randomized solutions and conﬁdence levels. Mathematical Programming, 102(1):25 46, 2005.

G.C. Calaﬁore and M.C. Campi. The scenario approach to robust control design. IEEE Transactions on Automatic Control, 51(5):742 753, 2006.

M.C. Campi. Classiﬁcation with guaranteed probability of error. Machine Learning, 80: 63 84, 2010.

M.C. Campi and A. Car e. Random convex programs with L1-regularization: sparsity and generalization. SIAM Journal on Control and Optimization, 51(5):3532 3557, 2013.

M.C. Campi and S. Garatti. The exact feasibility of randomized solutions of uncertain convex programs. SIAM Journal on Optimization, 19(3):1211 1230, 2008.

M.C. Campi and S. Garatti. A sampling-and-discarding approach to chance-constrained optimization: feasibility and optimality. Journal of Optimization Theory and Applications, 148(2):257 280, 2011.

M.C. Campi and S. Garatti. Wait-and-judge scenario optimization. Mathematical Programming, 167(1):155 189, 2018.

M.C. Campi and S. Garatti. A theory of the risk for optimization with relaxation and its application to support vector machines. Journal of Machine Learning Research, 22(288): 1 38, 2021.

M.C. Campi, S. Garatti, and F.A. Ramponi. A general scenario theory for nonconvex optimization and decision making. IEEE Transactions on Automatic Control, 63:4067 4078, 2018.

M.C. Campi, A. Car e, and S. Garatti. The scenario approach: a tool at the service of data-driven decision making. Annual Reviews in Control, 52:1 17, 2021.

A. Car e, S. Garatti, and M.C. Campi. Scenario min-max optimization and the risk of empirical costs. SIAM Journal on Optimization, 25(4):2061 2080, 2015.

A. Car e, F.A. Ramponi, and M.C. Campi. A new classiﬁcation algorithm with guaranteed sensitivity and speciﬁcity for medical applications. IEEE Control Systems Letters, 2: 393 398, 2018.

M.C. Campi and S. Garatti

C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 297, 1995.

L.G. Crespo, S.P. Kenny, and D.P. Giesy. Random predictor models for rigorous uncertainty quantiﬁcation. International Journal for Uncertainty Quantiﬁcation, 5(5):469 489, 2015.

L.G. Crespo, S.P. Kenny, and D.P. Giesy. Interval predictor models with a linear parameter dependency. Journal of Veriﬁcation, Validation and Uncertainty Quantiﬁcation, 1(2): 1 10, 2016.

L.G. Crespo, B.K. Colbert, S.P. Kenny, and D.P. Giesy. On the quantiﬁcation of aleatory and epistemic uncertainty using sliced-normal distributions. Systems and Control Letters, 134:104560, 2019.

O. David, S. Moran, and A. Yehudayoﬀ. Supervised learning through the lens of compression. In Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 2016a.

O. David, S. Moran, and A. Yehudayoﬀ. On statistical learning via the lens of compression. ar Xiv, 2016b. Paper id. ar Xiv:1610.03592.

P.M. Esfahani, T. Sutter, and J. Lygeros. Performance bounds for the scenario approach and an extension to a class of non-convex programs. IEEE Transactions on Automatic Control, 60(1):46 58, 2015.

A. Falsone, L. Deori, D. Ioli, S. Garatti, and M. Prandini. Optimal disturbance compensation for constrained linear systems operating in stationary conditions: A scenario-based approach. Automatica, 110:108537, 2019.

S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21:269 304, 1995.

S. Garatti and M.C. Campi. Risk and complexity in scenario optimization. Mathematical Programming, 191(1):243 279, 2022.

S. Garatti and M.C. Campi. On conditional risk assessments in scenario optimization. SIAM Journal on Optimization, 33(2):455 480, 2023.

S. Garatti, M.C. Campi, and A. Car e. On a class of interval predictor models with universal reliability. Automatica, 110:108542, 2019.

S. Garatti, M.C. Campi, and A. Car e. Complexity is an eﬀective observable to tune early stopping in scenario optimization. IEEE Transactions on Automatic Control, 68(2):928 942, 2023.

T. Graepel, R. Herbrich, and J.Shawe-Taylor. PAC-Bayesian compression bounds on the prediction error of learning algorithms for classiﬁcation. Machine Learning, 59:55 76, 2005.

S. Grammatico, X. Zhang, K. Margellos, P.J. Goulart, and J. Lygeros. A scenario approach for non-convex control design. IEEE Transactions on Automatic Control, 61(2):334 345, 2016.

Compression, Generalization and Learning

S. Hanneke and A. Kontorovich. A sharp lower bound for agnostic learning with sample compression schemes. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 489 505, 2019.

S. Hanneke and A. Kontorovich. Stable sample compression schemes: new applications and an optimal SVM margin bound. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 1 25, 2021.

S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss. Universal Bayes consistency in metric spaces. The Annals of Statistics, 49(4):2129 2150, 2021.

D. Helmbold, R. Sloan, and M.K. Warmuth. Learning nested diﬀerences of intersectionclosed concept classes. Machine Learning, 5:165 196, 1990.

A. Kontorovich, S. Sabato, and R. Weiss. Nearest-neighbor sample compression: Eﬃciency, consistency, inﬁnite dimensions. In Advances in Neural Information Processing Systems 30 (Neur IPS 2017), Long Beach, CA, 2017.

A. Kontorovich, S. Sabato, and R. Urner. Active nearest-neighbor learning in metric spaces. Journal of Machine Learning Research, 18(195):1 38, 2018.

M.J. Lacerda and L. G. Crespo. Interval predictor models for data with measurement uncertainty. In Proceedings of the 2017 American Control Conference (ACC), pages 1487 1492, Seattle, WA, 2017.

N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.

K. Margellos, P.J. Goulart, and J. Lygeros. On the road between robust optimization and the scenario approach for chance constrained optimization problems. IEEE Transactions on Automatic Control, 59(8):2258 2263, 2014.

K. Margellos, M. Prandini, and J. Lygeros. On the connection between compression learning and scenario based single-stage and cascading optimization problems. IEEE Transactions on Automatic Control, 60(10):2716 2721, 2015.

K. Margellos, A. Falsone, S. Garatti, and M. Prandini. Distributed constrained optimization and consensus in uncertain networks via proximal minimization. IEEE Transactions on Automatic Control, 63(5):1372 1387, 2018.

D. Paccagnan, M.C. Campi, and S. Garatti. The Pick-to-Learn algorithm: Empowering compression for tight generalization bounds and improved post-training performance. In Advances in Neural Information Processing Systems 36 (Neur IPS 2023), New Orleans, LA, 2023.

B.K. Pagnoncelli, D. Reich, and M.C. Campi. Risk-return trade-oﬀwith the scenario approach in practice: A case study in portfolio selection. Journal of Optimization Theory and Applications, 155(2):707 722, 2012.

M.C. Campi and S. Garatti

M.E. Payton, L.J. Young, and J.H. Young. Bounds for the diﬀerence between median and mean of beta and negative binomial distributions. Metrika, 36:347 354, 1989.

W. Rudin. Principle of Mathematical Analysis. Mc Graw-Hill, 1976.

W. Rudin. Functional Analysis. Mc Graw-Hill, 1991.

G. Schildbach, L. Fagiano, and M. Morari. Randomized solutions to convex programs with multiple chance constraints. SIAM Journal on Optimization, 23(4):2479 2501, 2013.

G. Schildbach, L. Fagiano, C. Frei, and M. Morari. The scenario approach for stochastic model predictive control with bounds on closed-loop constraint violations. Automatica, 50(12):3009 3018, 2014.

B. Sch olkopf and A.J. Smola. Learning with kernels. MIT press, 1998.

B. Sch olkopf, P. Bartlett, A. Smola, and R. Williamson. Shrinking the tube: A new support vector regression algorithm. In Advances in Neural Information Processing Systems 11 (NIPS 1998), pages 330 336, Denver, CO, 1998.

B. Sch olkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Plattt. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pages 582 588, Denver, CO, 1999.

S. Shalev-Shwartz and S. Ben-David. Underwstanding Machine Learning (from theory to algorithms). Cambridge University Press, New York, NY, 2014.

A.N. Shiryaev. Probability. Springer, New York, 1996.

A.J. Smola and B. Sch olkopf. A tutorial on support vector regression. Statistics and Computing, 14:199 224, 2004.

D.M.J. Tax and R.P.W. Duin. Support vector data description. Machine Learning, 54: 45 66, 2004.

V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974.

X. Wang, F. Chung, and S. Wang. Theoretical analysis for solution of support vector data description. Neural Networks, 24:360 369, 2011.

J.S. Welsh and H. Kong. Robust experiment design through randomisation with chance constraints. In Proceedings of the 18th IFAC World Congress, Milan, Italy, 2011.

J.S. Welsh and C.R. Rojas. A scenario based approach to robust experiment design. In Proceedings of the 15th IFAC Symposium on System Identiﬁcation, Saint-Malo, France, 2009.

X. Zhang, S. Grammatico, G. Schildbach, P.J. Goulart, and J. Lygeros. On the sample size of random convex programs with structured dependence on the uncertainty. Automatica, 60:182 188, 2015.