# pacbayes_unexpected_bernstein_inequality__449f2c95.pdf PAC-Bayes Un-Expected Bernstein Inequality Zakaria Mhammedi The Australian National University and Data61 zak.mhammedi@anu.edu.au Peter D. Grünwald CWI and Leiden University pdg@cwi.nl Benjamin Guedj Inria and University College London benjamin.guedj@inria.fr We present a new PAC-Bayesian generalization bound. Standard bounds contain a Ln KL/n complexity term which dominates unless Ln, the empirical error of the learning algorithm s randomized predictions, vanishes. We manage to replace Ln by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough n). Theoretically, unlike existing bounds, our new bound can be expected to converge to 0 faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and excess risk bounds for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein s but with X2 taken outside its expectation. 1 Introduction PAC-Bayesian generalization bounds [1, 8, 9, 17, 18, 20, 28, 29, 30] have recently obtained renewed interest within the context of deep neural networks [14, 34, 42]. In particular, Zhou et al. [42] and Dziugaite and Roy [14] showed that, by extending an idea due to Langford and Caruana [23], one can obtain nontrivial (but still not very strong) generalization bounds on real-world datasets such as MNIST and Image Net. Since using alternative methods, nontrivial generalization bounds are even harder to get, there remains a strong interest in improved PAC-Bayesian bounds. In this paper, we provide a considerably improved bound whenever the employed learning algorithm is sufficiently stable on the given data. Most standard bounds have an order Ln COMPn/n term on the right, where COMPn represents model complexity in the form of a Kullback-Leibler divergence between a prior and a posterior, and Ln is the posterior expected loss on the training sample. The latter only vanishes if there is a sufficiently large neighborhood around the center of the posterior at which the training error is 0. In the two papers [14, 42] mentioned above, this is not the case. For example, the various deep net experiments reported by Dziugaite et al. [14, Table 1] with n = 150000 all have Ln around 0.03, so that COMPn/n is multiplied by a non-negligible 0.03 0.17. Furthermore, they have COMPn increasing substantially with n, making Ln COMPn/n converge to 0 at rate slower than 1/ n. In this paper, we provide a bound (Theorem 3) with Ln replaced by a second-order term Vn a term which will go to 0 in many cases in which Ln does not. This can be viewed as an extension of an earlier second-order approach by Tolstikhin and Seldin [39] (TS from now on); they also replace Ln, but by a term that, while usually smaller than Ln, will tend to be larger than our Vn. Specifically, as 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. they write, in classification settings (our primary interest), their replacement is not much smaller than Ln itself. Instead our Vn can be very close to 0 in classification even when Ln is large. While the TS bound is based on an empirical Bernstein inequality due to [27]1, our bound is based on a different modification of Bernstein s moment inequality in which the occurrence of X2 is taken outside of its expectation (see Lemma 13). We note that an empirical Bernstein inequality was introduced in [4, Theorem 1], and the name Empirical Bernstein was coined in [32]. The term Vn in our bound goes to 0 and our bound improves on existing bounds whenever the employed learning algorithm is relatively stable on the given data; for example, if the predictor learned on an initial segment (say, 50%) of the dataset performs similarly (i.e. assigns similar losses to the same samples) to the predictor based on the full data. This improvement is reflected in our experiments where, except for very small sample sizes, we consistently outperform existing bounds both on a toy classification problem with label noise and on standard UCI datasets [13]. Of course, the importance of stability for generalization has been recognized before in landmark papers such as [7, 33, 38], and recently also in the context of PAC-Bayes bounds [35]. However, the data-dependent stability notion Vn occurring in our bound seems very different from any of the notions discussed in those papers. Theoretically, a further contribution is that we connect our PAC-Bayesian generalization bound to excess risk bounds; we show that (Theorem 7) our generalization bound can be of comparable size to excess risk bounds up to an irreducible complexity-free term that is independent of model complexity. The excess risk bound that can be attained for any given problem depends both on the complexity of the set of predictors H and on the inherent easiness of the problem. The latter is often measured in terms of the exponent β [0,1] of the Bernstein condition that holds for the given problem [6, 15, 19], which generalizes the exponent in the celebrated Tsybakov margin condition [5, 40]. The larger β, the faster the excess risk converges. In Section 5, we essentially show that the rate at which the Vn COMPn/n term goes to 0 can also be bounded by a quantity that gets smaller as β gets larger. In contrast, previous PAC-Bayesian bounds do not have such a property. Contents. In Section 2, we introduce the problem setting and provide a first, simplified version of our main theorem. Section 3 gives our main bound. Experiments are presented in Section 4, followed by theoretical motivation in Section 5. The proof of our main bound is provided in Section 6, where we first present the convenient ESI language for expressing stochastic inequalities, and (our main tool) the unexpected Bernstein lemma (Lemma 13). The paper ends with an outlook for future work. 2 Problem Setting, Background, and Simplified Version of Our Bound Setting and Notation. Let Z1,...,Zn be i.i.d. random variables in some set Z, with Z1 D. Let H be a hypothesis set and ℓ H Z [0,b], b > 0, be a bounded loss function such that ℓh(Z) = ℓ(h,Z) denotes the loss that hypothesis h makes on Z. We call any such tuple (D,ℓ,H) a learning problem. For a given hypothesis h H, we denote its risk (expected loss on a test sample of size 1) by L(h) = EZ D [ℓh(Z)] and its empirical error by Ln(h) = 1 n n i=1 ℓh(Zi). For any distribution P on H, we write L(P) = Eh P [L(h)] and Ln(P) = Eh P [Ln(h)]. For any m [n] and any variables Z1,...,Zn in Z, we denote Z m = (Z1,...,Zm) and Zm = Z m+1, with the convention that Z n+1 = . As is customary in PAC-Bayesian works, a learning algorithm is a (computable) function P n i=1 Zi P(H) that, upon observing input Z n Zn, outputs a posterior distribution P(Z n)( ) on H. The posterior could be a Gibbs or a generalized Bayesian posterior but also other algorithms. When no confusion can arise, we will abbreviate P(Z n) to Pn, and denote P0 any prior distribution, i.e. a distribution on H which has to be specified in advance, before seeing the data; we will use the convention P( ) = P0. Finally, we denote the Kullback-Leibler divergence between Pn and P0 by KL(Pn P0). Comparing Bounds. Both existing state-of-the-art PAC-Bayes bounds and ours essentially take the following form; there exists constants P,A,C 0, and a function εδ,n, logarithmic in 1/δ and n, such 1An alternative form of empirical Bernstein inequality appears in [41], based on an inequality due to [11]. that for all δ ]0,1[, with probability at least 1 δ over the sample Z1,...,Zn, it holds that, L(Pn) Ln(Pn) P Rn (COMPn + εδ,n) n + A COMPn + εδ,n where Rn,R n 0 are sample-dependent quantities which may differ from one bound to another. Existing classical bounds that after slight relaxations take on this form are due to Langford and Seeger [24, 37], Catoni [10], Maurer [26], and Tolstikhin and Seldin (TS) [39] (see the latter for a nice overview). In all these cases, COMPn = KL(Pn P0), R n = 0, and except for the TS bound Rn = Ln(Pn). For the TS bound, Rn is equal to the empirical loss variance. Our bound in Theorem 3 also fits (1) (after a relaxation), but with considerably different choices for COMPn, R n, and Rn. Of special relevance in our experiments is the bound due to Maurer [26], which as noted by TS [39] tightens the PAC-Bayes-kl inequality due to Seeger [36], and is one of the tightest known generalization bounds in the literature. It can be stated as follows: for δ ]0,1[, n 8, and any learning algorithm P, with probability at least 1 δ, kl(L(Pn),Ln(Pn)) KL(Pn P0) + ln 2 n where kl is the binary Kullback-Leibler divergence. Applying the inequality p q + 2q kl(p q) + 2kl(p q) to (2) yields a bound of the form (1) (see [39] for more details). Note also that using Pinsker s inequality together with (2) implies Mc Allester s classical PAC-Bayesian bound [28]. We now present a simplified version of our bound in Theorem 3 below as a corollary. Corollary 1. For any 1 m < n and any deterministic estimator ˆh n i=1 Zi H (such as ERM), there exists P,A,C > 0, such that (1) holds with probability at least 1 δ, with COMPn = KL(Pn P(Z m)) + KL(Pn P(Z>m)), R n = V n = 1 m i=1 ℓˆh(Z>m)(Zi)2 + 1 n j=m+1 ℓˆh(Z m)(Zj)2, (3) Rn = Vn = 1 m i=1 (ℓh(Zi) ℓˆh(Z>m)(Zi)) 2 + n j=m+1 (ℓh(Zj) ℓˆh(Z m)(Zj)) 2 . Like in TS s and Catoni s bound, but unlike Mc Allester s and Maurer s, our εδ,n grows as (lnlnn)/δ. Another difference is that our complexity term is a sum of two KL divergences, in which the prior (in this case P(Z m) or P(Z>m)) is informed when m = n/2, it is really the posterior based on half the sample. Our experiments confirm that this tends to be much smaller than KL(Pn P0). While the idea to use part of the sample to create an informed prior is due to [2], we are the first to combine all parts (halves) into a single bound, which requires a novel technique. This technique can be applied to other existing bounds as well (see Section 3). A larger difference between our bound and others is in the fact that we have Rn = Vn instead of the typical empirical error Rn = Ln(Pn). Only TS [39] have a Rn that is somewhat reminiscent of ours; in their case Rn = Eh Pn[ n i=1 (ℓh(Zi) Ln(h))2]/(n 1) is the empirical loss variance. The crucial difference to our Vn is that the empirical loss variance cannot be close to 0 unless a sizeable Pn-posterior region of h has empirical error almost constant on most data instances. For classification with 0-1 loss, this is a strong condition since the empirical loss variance is equal to n Ln(Pn)(1 Ln(Pn))/(n 1), which is only close to 0 if Ln(Pn) is itself close to 0 or 1. In contrast, our Vn can go to zero 0 even if the empirical error and variance do not, as long as the learning algorithm is sufficiently stable. This can be witnessed in our experiments in Section 4. In Section 5, we argue more formally that under a Bernstein condition, the Vn COMPn/n term in our bound can be much smaller than COMPn/n. Note, finally, that the term Vn has a two-fold cross-validation flavor, but in contrast to a cross-validation error, for Vn to be small, it is sufficient that the losses are similar, not that they are small. The price we pay for having Rn = Vn in our bound is the right-most, irreducible remainder term in (1) of order at most b/ n. Note, however, that this term is decoupled from the complexity COMPn, and thus it is not affected by COMPn growing with the size of H. The following lemma gives a tighter bound (tighter than the b/ n just mentioned) on the irreducible term: Lemma 2. Suppose that the loss is bounded by 1 (i.e. b = 1) and that n is even, and let m = n/2. For δ ]0,1[, R n as in (3), and any estimator ˆh n i=1 Zi H, we have, with probability at least 1 δ, 2(L(ˆh(Z>m)) + L(ˆh(Z m))) Behind the proof of the lemma is an application of Hoeffding s and the empirical Bernstein inequality [27] (see Section C). Note that in the realizable setting, the first term on the RHS of (4) can be of order O(1/n) with the right choice of estimator ˆh (e.g. ERM). In this case (still in the realizable setting), our irreducible term would go to zero at the same rate as other bounds which have Rn = Ln(Pn). 3 Main Bound We now present our main result in its most general form. Let ϑ(η) = ( ln(1 η) η)/η2 and cη = η ϑ(ηb), for η ]0,1/b[, where b > 0 is an upper-bound on the loss ℓ. Theorem 3. [Main Theorem] Let Z1,...,Zn be i.i.d. with Z1 D. Let m [0..n] and π be any distribution with support on a finite or countable grid G ]0,1/b[. For any δ ]0,1[, and any learning algorithms P,Q n i=1 Zi P(H), we have, L(Pn) Ln(Pn) + inf η G cη Vn + COMPn + 2ln 1 δ π(η) η n cν V n + ln 1 δ π(ν) ν n with probability at least 1 δ, where COMPn, V n, and Vn are the random variables defined by: COMPn = KL(Pn P(Z m)) + KL(Pn P(Z>m)), (6) m i=1 Eh Q(Z>i) [ℓh(Zi)2] + 1 n j=m+1 Eh Q(Zi) [ℓh (Zi)]) 2 + n j=m+1 (ℓh(Zj) Eh Q(Zi) δ(ˆh(Z>m)) and Q(Z m, in the RHS sum of Vn, we could use a posterior Q(Zi)) which converge to the final ˆh(Z n) based on the full data. Doing this would likely improve our bounds, but we did not try it in our experiments since it is computationally demanding. Informed Priors. Other bounds can also be modified to make use of informed priors from each half of the data; in this case, the KL(Pn P0) term in these bounds can be replaced by COMPn defined in (6). As revealed by additional experiments in the Appendix H, doing this substantially improves the corresponding bounds when the learning algorithm is sufficiently stable. Here we show how this can be done for Maurer s bound in (2) (the details for other bounds are postponed to Appendix A). Lemma 4. Let δ ]0,1[ and m [0..n]. In the setting of Theorem 3, we have, with probability at least 1 δ, kl(L(Pn),Ln(Pn)) KL(Pn P(Z m)) + KL(Pn P(Z>m)) + ln 4 Remark 5. (Useful for Section 5 below) Though this may deteriorate the bound in practice, Theorem 3 allows choosing a learning algorithm P such that for 1 m < n, P(Z m) P(Z>m) P0 (i.e. no informed priors); this results in COMPn = 2KL(Pn P0) the bound is otherwise unchanged. Biasing. The term Vn in our bound can be seen as the result of biasing the loss when evaluating the generalization error on each half of the sample. The TS bound, having a second order variance term, can be used in a way as to arrive at a bound like ours with the same Vn as in Corollary 1. The idea here is to apply the TS bound twice (once on each half of the sample) to the biased losses ℓ(h, ) ℓ(ˆh(Z m), ) and ℓ(h, ) ℓ(ˆh(Z>m), ), then combine the results with a union bound. The details of this are postponed to Appendix B. Note however, that this trick will not lead to a bound with a Vn term as in Theorem 3, i.e. with the online posteriors (Q(Z>i)) and (Q(Z 1/2} , where φ(w) = 1/(1 + e w),w R. We learn our hypotheses using regularized logistic regression; given a sample S = (Zp,...,Zq), with (p,q) {(1,m),(m + 1,n),(1,n)} and m = n/2, we compute ˆh(S) = arg min h H λ h 2 2 + 1 q p + 1 q i=p Yi lnφ(h Xi) + (1 Yi) ln(1 φ(h Xi)). (8) For Z n Zn, and 1 i m < j n, we choose algorithm Q in Theorem 3 such that Q(Z>i) δ (ˆh(Z>m)) and Q(Z 0; that is, P(S) N(ˆh(S),σ2Id). The prior distribution is set to P0 N(0,σ2 0Id), for σ0 > 0. Parameters. We set δ = 0.05. For all datasets, we use λ = 0.01, and (approximately) solve (8) using the BFGS algorithm. For each bound, we pick the σ2 {1/2,...,1/2J J = log2 n } which minimizes it on the given data (with n instances). In order for the bounds to still hold with probability at least 1 δ, we replace δ on the RHS of each bound by δ/ log2 n (this follows from the application of a union bound). We choose the prior variance such that σ2 0 = 1/2 (this was the best value on average for the bounds we compare against). We choose the grid G in Theorem 3 as in (7). Finally, we approximate Gaussian expectations using Monte Carlo sampling. Synthetic data. We generate synthetic data for d = {10,50} and sample sizes between 800 and 8000. For a given sample size n, we 1) draw X1,...,Xn [resp. ϵ1,...,ϵn] identically and independently from the multivariate-Gaussian distribution N(0,Id) [resp. the Bernoulli distribution B(0.9)]; and 2) we set Yi = 1{φ(h Xi) > 1/2} ϵi, for i [n], where h Rd is the vector constructed from the first d digits of π. For example, if d = 10, then h = (3,1,4,1,5,9,2,6,5,3) . Figure 1 shows the results averaged over 10 independent runs for each sample size. UCI datasets. For the second experiment, we use several UCI datasets. These are listed in Table 1 (where Breast-C. stands for Breast Cancer). We encode categorical variables in appropriate 0-1 vectors. This effectively increases the dimension of the input space (this is reported as d in Table 1). After removing any rows (i.e. instances) containing missing features and performing the encoding, the input data is scaled such that every column has values between -1 and 1. We used a 5-fold train-test split (n in Table 1 is the training set size), and the results in Table 1 are averages over 5 runs. We only compare with Maurer s bound since other bounds were worse than Maurer s and ours on all datasets. Discussion. As the dimension d of the input space increases, the complexity KL(Pn P0) and thus, all the PAC-Bayes bounds discussed in this paper get larger. Our bound suffers less from this increase in d, since for a large enough sample size n, the term Vn is small enough (see Figure 1) to absorb any increase in the complexity. In fact, for large enough n, the irreducible (complexity-free) term involving V n in our bound becomes the dominant one. This, combined with the fact that for the 0-1 loss, V n Ln(Pn) for large enough n (see Figure 1), makes our bound tighter than others. Adding a regularization term in the objective (8) is important as it stabilizes ˆh(Z 0, if for all h H, EZ D [(ℓh(Z) ℓh (Z))2] B EZ D [ℓh(Z) ℓh (Z)]β , where h arg infh H EZ D [ℓh(Z)] is a risk minimizer within the closer of H. The Bernstein condition [3, 5, 6, 15, 22] essentially characterizes the easiness of the learning problem; it implies that the variance in the excess loss random variable ℓh(Z) ℓh (Z) gets smaller the closer the risk of hypothesis h H gets to that of the risk minimizer h . For bounded loss functions, the BC with β = 0 always holds. The BC with β = 1 (the easiest learning setting) is also known as the Massart noise condition [25]; it holds in our experiment with synthetic data in Section 4, and also, e.g., whenever H is convex and h ℓh(z) is exp-concave, for all z Z [15, 31]. For more examples of learning settings where a BC holds see [22, Section 3]. Our aim in this section is to give an upper-bound on the infimum term involving Vn in (5), under a BC, in terms of the complexity COMPn and the excess risks L(Pn), L(Q(Z>m)), and L(Q(Z m)), where for a distribution P P(H), the excess risk is defined by L(P) = Eh P [EZ D [ℓh(Z)]] EZ D [ℓh (Z)]. In the next theorem, we denote Q m = Q(Z m) and Q>m = Q(Z>m), for m [n]. To simplify the presentation further (and for consistency with Section 4), we assume that Q is chosen such that Q(Z>i) = Q>m, for 1 i m, and Q(Z 0, then for any learning algorithms P and Q (with Q satisfying (9)), there exists a C > 0, such that n 1 and m = n/2, with probability at least 1 δ, 1 C inf η G {cη Vn + COMPn + εδ,n η n } L(Pn) + L(Q m) + L(Q>m) + ( COMPn + εδ,n 1 2 β + COMPn + εδ,n In addition to the ESI tools provided in Section 6 and Lemma 13, the proof of Theorem 7, presented in Appendix E, also uses an ESI version of the Bernstein condition due to [22]. First note that the only terms in our main bound (5), other than the infimum on the LHS of (10), are the empirical error Ln(Pn) and a O(1/ n)-complexity-free term which is typically smaller than KL(Pn P0)/n (e.g. when the dimension of H is large enough). The term KL(Pn P0)/n is often the dominating one in other PAC-Bayesian bounds when liminfn Ln(Pn) > 0. Now consider the remaining term in our main bound, which matches the infimum term on the LHS of (10), and let us choose algorithm P as per Remark 5, so that COMPn = 2KL(Pn P0). Suppose that, with high probability (w.h.p.), KL(Pn P0)/n converges to 0 for n (otherwise no PAC-Bayesian bound would converge to 0), then (COMPn/n)1/(2 β) + COMPn/n essentially the sum of the last two terms on the RHS of (10) converges to 0 at a faster rate than KL(Pn P0)/n w.h.p. for β > 0, and at equal rate for β = 0. Thus, in light of Theorem 7, to argue that our bound can be better than others (still when liminfn Ln(Pn) > 0), it remains to show that there exist algorithms P and Q for which the sum of the excess risks on the RHS of (10) is smaller than KL(Pn P0)/n. One choice of estimator with small excess risk is the Empirical Risk Minimizer (ERM). When m = n/2, if one chooses Q such that it outputs a Dirac around the ERM on a given sample, then under a BC with exponent β and for parametric H (such as the d-dimensional linear classifiers in Sec. 4), L(Q m) and L(Q>m) are of order O (n 1/(2 β)) w.h.p. [3, 19]. However, setting Pn δ(ERM(Z n)) is not allowed, since otherwise KL(Pn P0) = . Instead one can choose Pn to be the generalized-Bayes/Gibbs posterior. In this case too, under a BC with exponent β and for parametric H, the excess risk is of order O (n 1/(2 β)) w.h.p. for clever choices of prior P0 [3, 19]. 6 Detailed Analysis We start this section by presenting the convenient ESI notation and use it to present our main technical Lemma 13 (proofs of the ESI results are in Appendix D). We then continue with a proof of Theorem 3. Definition 8. [ESI (Exponential Stochastic Inequality, pronounce as:easy) 19, 22] Let η > 0, and X, Y be any two random variables with joint distribution D. We define X D η Y X Y D η 0 E(X,Y ) D [eη(X Y )] 1. (11) Definition 8 can be extended to the case where η = ˆη is also a random variable, in which case the expectation in (11) needs to be replaced by the expectation over the joint distribution of (X, Y , ˆη). When no ambiguity can arise, we omit D from the ESI notation. Besides simplifying notation, ESIs are useful in that they simultaneously capture with high probability and in expectation results: Proposition 9. [ESI Implications] For fixed η > 0, if X η Y then E[X] E[Y ]. For both fixed and random ˆη, if X ˆη Y , then δ ]0,1[, X Y + ln 1 δ ˆη , with probability at least 1 δ. In the next proposition, we present two results concerning transitivity and additive properties of ESI: Proposition 10. [ESI Transitivity and Chain Rule] (a) Let Z1,...,Zn be any random variables on Z (not necessarily independent). If for some (γi)i [n] ]0,+ [n, Zi γi 0, for all i [n], then n i=1 Zi νn 0, where νn = ( n i=1 1 (so if i [n],γi = γ > 0 then νn = γ/n). (b) Suppose now that Z1,...,Zn are i.i.d. and let X Z n i=1 Zi R be any real-valued function. If for some η > 0, X(Zi;z 0 and let {Yh h H} be any family of random variables such that for all h H, Yh η 0. Let P0 be any distribution on H and let P n i=1 Zi P(H) be a learning algorithm. We have: Eh Pn[Yh] η KL(Pn P0) η , where Pn = P(Z n). In many applications (especially for our main result) it is desirable to work with a random (i.e. data-dependent) η in the ESI inequalities; one can tune η after seeing the data. Proposition 12. [ESI from fixed to random η] Let G be a countable subset of ]0,+ [ and let π be a prior distribution over G. Given a countable collection {Yη η G} of random variables satisfying Yη η 0, for all fixed η G, we have, for arbitrary estimator ˆη with support on G, Yˆη ˆη lnπ(ˆη) The following key lemma, which is of independent interest, is central to our main result. Lemma 13. [Key result: un-expected Bernstein] Let X D be a random variable bounded from above by b > 0 almost surely, and let ϑ(u) = ( ln(1 u) u)/u2. For all 0 < η < 1/b, we have (a): E[X] X D η c X2, for all c η ϑ(ηb). (12) (b): The result is tight; for every c < η ϑ(ηb), there exists a distribution D so that (12) does not hold. Lemma 13 is reminiscent of the following slight variation of Bernstein s inequality [12]; let X be any random variable bounded from below by b, and let κ(x) = (ex x 1)/x2. For all η > 0, we have E[X] X η s E[X2], for all s η κ(ηb). (13) Note that the un-expected Bernstein Lemma 13 has the X2 lifted out of the expectation. In Appendix G, we prove (13) and compare it to standard versions of Bernstein. We also compare (12) to the related but distinct empirical Bernstein inequality due to [27, Theorem 4]. We now prove part (a) of Lemma 13, which follows easily from the proof of an existing result [16, 21]. Part (b) is novel; its proof is postponed to Appendix F. Proof of Lemma 13-Part (a). [16] (see also [21]) showed in the proof of their lemma 4.1 that exp(λξ λ2ϑ(λ)ξ2) 1 + λξ, for all λ [0,1[ and ξ 1. (14) Letting η = λ/b and ξ = X/b, (14) becomes, exp( ηX η2ϑ(ηb)X2) 1 ηX, for all η ]0,1/b[. (15) Taking expectation on both sides of (15) and using the fact that 1 ηE[X] exp( ηE[X]) on the RHS of the resulting inequality, leads to (12). Proof of Theorem 3. Let η ]0,1/b[ and cη = η ϑ(ηb). For 1 i m < j n, define Xh(Zi;z>i) = ℓh(Zi) Eh Q(z>i) [ℓh (Zi)], for z>i Zn i, Xh(Zj;zi Zn i, Y η h (Zi;z>i) = EZ i D [Xh(Z i;z>i)] Xh(Zi;z>i) cη Xh(Zi;z>i)2 η 0, zi) η 0, S = n j=m+1 Y η h (Zj;Zm) and P(Z m), respectively, and common posterior Pn = P(Z n) on H, we get, with KL>m = KL(Pn P(Z>m)) and KL m = KL(Pn P(Z m)): Eh Pn [ m i=1 Y η h (Zi;Z>i)] KL>m η η 0, Eh Pn n j=m+1 Y η h (Zj;Zi) + n j=m+1 Y η h (Zj;Zm)) + KL(Pn P(Z m)) With the prior π on G, we have for any ˆη = ˆη(Z n) G [1/ nb2,1/b[ (see Proposition 12), m i=1 Y ˆη h (Zi;Z>i) + n j=m+1 Y ˆη h (Zj;Zi(Z i)] ℓQ>i(Zi)) + n j=m+1 (EZ j D [ ℓQi(Zi) = Eh Q(Z>i) [ℓh(Zi)] and ℓQi) [ℓh (Zi)2] + n j=m+1 Eh Q(Z