# pacbayes_unexpected_bernstein_inequality__449f2c95.pdf

PAC-Bayes Un-Expected Bernstein Inequality

Zakaria Mhammedi The Australian National University and Data61 zak.mhammedi@anu.edu.au

Peter D. Grünwald CWI and Leiden University pdg@cwi.nl

Benjamin Guedj Inria and University College London benjamin.guedj@inria.fr

We present a new PAC-Bayesian generalization bound. Standard bounds contain a

Ln KL/n complexity term which dominates unless Ln, the empirical error of the learning algorithm s randomized predictions, vanishes. We manage to replace Ln by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufﬁciently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough n). Theoretically, unlike existing bounds, our new bound can be expected to converge to 0 faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and excess risk bounds for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein s but with X2 taken outside its expectation.

1 Introduction

PAC-Bayesian generalization bounds [1, 8, 9, 17, 18, 20, 28, 29, 30] have recently obtained renewed interest within the context of deep neural networks [14, 34, 42]. In particular, Zhou et al. [42] and Dziugaite and Roy [14] showed that, by extending an idea due to Langford and Caruana [23], one can obtain nontrivial (but still not very strong) generalization bounds on real-world datasets such as MNIST and Image Net. Since using alternative methods, nontrivial generalization bounds are even harder to get, there remains a strong interest in improved PAC-Bayesian bounds. In this paper, we provide a considerably improved bound whenever the employed learning algorithm is sufﬁciently stable on the given data.

Most standard bounds have an order

Ln COMPn/n term on the right, where COMPn represents model complexity in the form of a Kullback-Leibler divergence between a prior and a posterior, and Ln is the posterior expected loss on the training sample. The latter only vanishes if there is a sufﬁciently large neighborhood around the center of the posterior at which the training error is 0. In the two papers [14, 42] mentioned above, this is not the case. For example, the various deep net experiments reported by Dziugaite et al. [14, Table 1] with n = 150000 all have Ln around 0.03, so that

COMPn/n is multiplied by a non-negligible

0.03 0.17. Furthermore, they have COMPn increasing substantially with n, making

Ln COMPn/n converge to 0 at rate slower than 1/ n.

In this paper, we provide a bound (Theorem 3) with Ln replaced by a second-order term Vn a term which will go to 0 in many cases in which Ln does not. This can be viewed as an extension of an earlier second-order approach by Tolstikhin and Seldin [39] (TS from now on); they also replace Ln, but by a term that, while usually smaller than Ln, will tend to be larger than our Vn. Speciﬁcally, as

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

they write, in classiﬁcation settings (our primary interest), their replacement is not much smaller than Ln itself. Instead our Vn can be very close to 0 in classiﬁcation even when Ln is large. While the TS bound is based on an empirical Bernstein inequality due to [27]1, our bound is based on a different modiﬁcation of Bernstein s moment inequality in which the occurrence of X2 is taken outside of its expectation (see Lemma 13). We note that an empirical Bernstein inequality was introduced in [4, Theorem 1], and the name Empirical Bernstein was coined in [32].

The term Vn in our bound goes to 0 and our bound improves on existing bounds whenever the employed learning algorithm is relatively stable on the given data; for example, if the predictor learned on an initial segment (say, 50%) of the dataset performs similarly (i.e. assigns similar losses to the same samples) to the predictor based on the full data. This improvement is reﬂected in our experiments where, except for very small sample sizes, we consistently outperform existing bounds both on a toy classiﬁcation problem with label noise and on standard UCI datasets [13]. Of course, the importance of stability for generalization has been recognized before in landmark papers such as [7, 33, 38], and recently also in the context of PAC-Bayes bounds [35]. However, the data-dependent stability notion Vn occurring in our bound seems very different from any of the notions discussed in those papers.

Theoretically, a further contribution is that we connect our PAC-Bayesian generalization bound to excess risk bounds; we show that (Theorem 7) our generalization bound can be of comparable size to excess risk bounds up to an irreducible complexity-free term that is independent of model complexity. The excess risk bound that can be attained for any given problem depends both on the complexity of the set of predictors H and on the inherent easiness of the problem. The latter is often measured in terms of the exponent β [0,1] of the Bernstein condition that holds for the given problem [6, 15, 19], which generalizes the exponent in the celebrated Tsybakov margin condition [5, 40]. The larger β, the faster the excess risk converges. In Section 5, we essentially show that the rate at which the

Vn COMPn/n term goes to 0 can also be bounded by a quantity that gets smaller as β gets larger. In contrast, previous PAC-Bayesian bounds do not have such a property.

Contents. In Section 2, we introduce the problem setting and provide a ﬁrst, simpliﬁed version of our main theorem. Section 3 gives our main bound. Experiments are presented in Section 4, followed by theoretical motivation in Section 5. The proof of our main bound is provided in Section 6, where we ﬁrst present the convenient ESI language for expressing stochastic inequalities, and (our main tool) the unexpected Bernstein lemma (Lemma 13). The paper ends with an outlook for future work.

2 Problem Setting, Background, and Simpliﬁed Version of Our Bound

Setting and Notation. Let Z1,...,Zn be i.i.d. random variables in some set Z, with Z1 D. Let H be a hypothesis set and ℓ H Z [0,b], b > 0, be a bounded loss function such that ℓh(Z) = ℓ(h,Z) denotes the loss that hypothesis h makes on Z. We call any such tuple (D,ℓ,H) a learning problem. For a given hypothesis h H, we denote its risk (expected loss on a test sample of size 1) by L(h) = EZ D [ℓh(Z)] and its empirical error by Ln(h) = 1

n n i=1 ℓh(Zi). For any distribution P on H, we write L(P) = Eh P [L(h)] and Ln(P) = Eh P [Ln(h)].

For any m [n] and any variables Z1,...,Zn in Z, we denote Z m = (Z1,...,Zm) and Z<m = Z m 1, with the convention that Z 0 = . Similarly, we denote Z m = (Zm,...,Zn) and Z>m = Z m+1, with the convention that Z n+1 = . As is customary in PAC-Bayesian works, a learning algorithm is a (computable) function P n i=1 Zi P(H) that, upon observing input Z n Zn, outputs a posterior distribution P(Z n)( ) on H. The posterior could be a Gibbs or a generalized Bayesian posterior but also other algorithms. When no confusion can arise, we will abbreviate P(Z n) to Pn, and denote P0 any prior distribution, i.e. a distribution on H which has to be speciﬁed in advance, before seeing the data; we will use the convention P( ) = P0. Finally, we denote the Kullback-Leibler divergence between Pn and P0 by KL(Pn P0).

Comparing Bounds. Both existing state-of-the-art PAC-Bayes bounds and ours essentially take the following form; there exists constants P,A,C 0, and a function εδ,n, logarithmic in 1/δ and n, such

1An alternative form of empirical Bernstein inequality appears in [41], based on an inequality due to [11].

that for all δ ]0,1[, with probability at least 1 δ over the sample Z1,...,Zn, it holds that,

L(Pn) Ln(Pn) P

Rn (COMPn + εδ,n)

n + A COMPn + εδ,n

where Rn,R n 0 are sample-dependent quantities which may differ from one bound to another. Existing classical bounds that after slight relaxations take on this form are due to Langford and Seeger [24, 37], Catoni [10], Maurer [26], and Tolstikhin and Seldin (TS) [39] (see the latter for a nice overview). In all these cases, COMPn = KL(Pn P0), R n = 0, and except for the TS bound Rn = Ln(Pn). For the TS bound, Rn is equal to the empirical loss variance. Our bound in Theorem 3 also ﬁts (1) (after a relaxation), but with considerably different choices for COMPn, R n, and Rn.

Of special relevance in our experiments is the bound due to Maurer [26], which as noted by TS [39] tightens the PAC-Bayes-kl inequality due to Seeger [36], and is one of the tightest known generalization bounds in the literature. It can be stated as follows: for δ ]0,1[, n 8, and any learning algorithm P, with probability at least 1 δ,

kl(L(Pn),Ln(Pn)) KL(Pn P0) + ln 2 n

where kl is the binary Kullback-Leibler divergence. Applying the inequality p q +

2q kl(p q) + 2kl(p q) to (2) yields a bound of the form (1) (see [39] for more details). Note also that using Pinsker s inequality together with (2) implies Mc Allester s classical PAC-Bayesian bound [28].

We now present a simpliﬁed version of our bound in Theorem 3 below as a corollary.

Corollary 1. For any 1 m < n and any deterministic estimator ˆh n i=1 Zi H (such as ERM), there exists P,A,C > 0, such that (1) holds with probability at least 1 δ, with

COMPn = KL(Pn P(Z m)) + KL(Pn P(Z>m)),

R n = V n = 1

m i=1 ℓˆh(Z>m)(Zi)2 + 1

n j=m+1 ℓˆh(Z m)(Zj)2, (3)

Rn = Vn = 1

m i=1 (ℓh(Zi) ℓˆh(Z>m)(Zi)) 2 + n j=m+1 (ℓh(Zj) ℓˆh(Z m)(Zj)) 2 .

Like in TS s and Catoni s bound, but unlike Mc Allester s and Maurer s, our εδ,n grows as (lnlnn)/δ. Another difference is that our complexity term is a sum of two KL divergences, in which the prior (in this case P(Z m) or P(Z>m)) is informed when m = n/2, it is really the posterior based on half the sample. Our experiments conﬁrm that this tends to be much smaller than KL(Pn P0). While the idea to use part of the sample to create an informed prior is due to [2], we are the ﬁrst to combine all parts (halves) into a single bound, which requires a novel technique. This technique can be applied to other existing bounds as well (see Section 3).

A larger difference between our bound and others is in the fact that we have Rn = Vn instead of the typical empirical error Rn = Ln(Pn). Only TS [39] have a Rn that is somewhat reminiscent of ours; in their case Rn = Eh Pn[ n i=1 (ℓh(Zi) Ln(h))2]/(n 1) is the empirical loss variance. The crucial difference to our Vn is that the empirical loss variance cannot be close to 0 unless a sizeable Pn-posterior region of h has empirical error almost constant on most data instances. For classiﬁcation with 0-1 loss, this is a strong condition since the empirical loss variance is equal to n Ln(Pn)(1 Ln(Pn))/(n 1), which is only close to 0 if Ln(Pn) is itself close to 0 or 1. In contrast, our Vn can go to zero 0 even if the empirical error and variance do not, as long as the learning algorithm is sufﬁciently stable. This can be witnessed in our experiments in Section 4. In Section 5, we argue more formally that under a Bernstein condition, the

Vn COMPn/n term in our bound can be much smaller than

COMPn/n. Note, ﬁnally, that the term Vn has a two-fold cross-validation ﬂavor, but in contrast to a cross-validation error, for Vn to be small, it is sufﬁcient that the losses are similar, not that they are small.

The price we pay for having Rn = Vn in our bound is the right-most, irreducible remainder term in (1) of order at most b/ n. Note, however, that this term is decoupled from the complexity COMPn, and thus it is not affected by COMPn growing with the size of H. The following lemma gives a tighter bound (tighter than the b/ n just mentioned) on the irreducible term:

Lemma 2. Suppose that the loss is bounded by 1 (i.e. b = 1) and that n is even, and let m = n/2. For δ ]0,1[, R n as in (3), and any estimator ˆh n i=1 Zi H, we have, with probability at least 1 δ,

2(L(ˆh(Z>m)) + L(ˆh(Z m)))

Behind the proof of the lemma is an application of Hoeffding s and the empirical Bernstein inequality [27] (see Section C). Note that in the realizable setting, the ﬁrst term on the RHS of (4) can be of order O(1/n) with the right choice of estimator ˆh (e.g. ERM). In this case (still in the realizable setting), our irreducible term would go to zero at the same rate as other bounds which have Rn = Ln(Pn).

3 Main Bound

We now present our main result in its most general form. Let ϑ(η) = ( ln(1 η) η)/η2 and cη = η ϑ(ηb), for η ]0,1/b[, where b > 0 is an upper-bound on the loss ℓ. Theorem 3. [Main Theorem] Let Z1,...,Zn be i.i.d. with Z1 D. Let m [0..n] and π be any distribution with support on a ﬁnite or countable grid G ]0,1/b[. For any δ ]0,1[, and any learning algorithms P,Q n i=1 Zi P(H), we have,

L(Pn) Ln(Pn) + inf η G

cη Vn + COMPn + 2ln 1 δ π(η) η n

cν V n + ln 1 δ π(ν) ν n

with probability at least 1 δ, where COMPn, V n, and Vn are the random variables deﬁned by:

COMPn = KL(Pn P(Z m)) + KL(Pn P(Z>m)), (6)

m i=1 Eh Q(Z>i) [ℓh(Zi)2] + 1

n j=m+1 Eh Q(Z<j) [ℓh(Zj)2],

m i=1 (ℓh(Zi) Eh Q(Z>i) [ℓh (Zi)]) 2 + n j=m+1 (ℓh(Zj) Eh Q(Z<j) [ℓh (Zj)]) 2 .

While the result holds for all 0 m n, in the remainder of this paper, we assume for simplicity that n is even and that m = n/2. We will also be using the grid G and distribution π deﬁned by

2b,..., 1 2Kb K = log2 ( 1

δ ) }, and π uniform distribution over G. (7)

Roughly speaking, this choice of G ensures that the inﬁma in η and ν in (5) are attained within [min G,max G]. Using the relaxation cη η/2 + η211b/20, for η 1/(2b), in (5) and tuning η and ν within the grid G deﬁned in (7) leads to a bound of the form (1). Furthermore, we see that the expression of Vn in Corollary 1 now follows when Q is chosen such that, for 1 i m < j n, Q(Z>i) δ(ˆh(Z>m)) and Q(Z<j) δ(ˆh(Z m)), for some deterministic estimator ˆh, where δ(h)( ) denotes the Dirac distribution at h H.

Online Estimators. It is clear that Theorem 3 is considerably more general than its Corollary 1; when predicting the j-th point Zj, j > m, in the RHS sum of Vn, we could use a posterior Q(Z<j) δ(ˆh(Z<j)) which does not only depend on Z1,...,Zm, but also on part of the second sample, namely Zm+1,...,Zj 1, and analogously when predicting Zi, i m, in the LHS sum of Vn. We can thus base our bound on a sum of errors achieved by online estimators (ˆh(Z<j)) and (ˆh(Z>i)) which converge to the ﬁnal ˆh(Z n) based on the full data. Doing this would likely improve our bounds, but we did not try it in our experiments since it is computationally demanding.

Informed Priors. Other bounds can also be modiﬁed to make use of informed priors from each half of the data; in this case, the KL(Pn P0) term in these bounds can be replaced by COMPn deﬁned in (6). As revealed by additional experiments in the Appendix H, doing this substantially improves the corresponding bounds when the learning algorithm is sufﬁciently stable. Here we show how this can be done for Maurer s bound in (2) (the details for other bounds are postponed to Appendix A).

Lemma 4. Let δ ]0,1[ and m [0..n]. In the setting of Theorem 3, we have, with probability at least 1 δ,

kl(L(Pn),Ln(Pn)) KL(Pn P(Z m)) + KL(Pn P(Z>m)) + ln 4

Remark 5. (Useful for Section 5 below) Though this may deteriorate the bound in practice, Theorem 3 allows choosing a learning algorithm P such that for 1 m < n, P(Z m) P(Z>m) P0 (i.e. no informed priors); this results in COMPn = 2KL(Pn P0) the bound is otherwise unchanged.

Biasing. The term Vn in our bound can be seen as the result of biasing the loss when evaluating the generalization error on each half of the sample. The TS bound, having a second order variance term, can be used in a way as to arrive at a bound like ours with the same Vn as in Corollary 1. The idea here is to apply the TS bound twice (once on each half of the sample) to the biased losses ℓ(h, ) ℓ(ˆh(Z m), ) and ℓ(h, ) ℓ(ˆh(Z>m), ), then combine the results with a union bound. The details of this are postponed to Appendix B. Note however, that this trick will not lead to a bound with a Vn term as in Theorem 3, i.e. with the online posteriors (Q(Z>i)) and (Q(Z<j)) which get closer and closer to the ﬁnal Q(Z m) based on the full sample.

4 Experiments

In this section, we experimentally compare our bound in Theorem 3 to that of TS [39], Catoni [9, Theorem 1.2.8] (with α = 2), and Maurer in (2). For the latter, given Ln(Pn) [0,1[ and the RHS of (2), we solve for an upper bound of L(Pn) by inverting the kl. We note that TS [39] do not claim that their bound is better than Maurer s in classiﬁcation (in fact, they do better in other settings).

1000 3000 5000 7000 SAMPLE SIZE

DIMENSION = 10

1000 3000 5000 7000 SAMPLE SIZE

DIMENSION = 50

Figure 1: Results for the synthetic data.

UCI d Test err. Our Maurer Dataset of ˆh bound bound Haberman 3 0.272 0.521 0.411 (n=244) Breast-C. 9 0.068 0.185 0.159 (n=560) Tic Tac Toe 27 0.046 0.191 0.216 (n=766) Banknote 4 0.058 0.125 0.136 (n=1098) kr-vs-kp 73 0.044 0.108 0.165 (n=2556) Spambase 57 0.173 0.293 0.312 (n=3680) Mushroom 116 0.002 0.018 0.055 (n=6500)

Adult 108 0.168 0.195 0.234 (n=24130)

Table 1: Results for the UCI datasets.

Setting. We consider both synthetic and real-world datasets for binary classiﬁcation, and we evaluate bounds using the 0-1 loss. In particular, the data space Z is X Y = Rd {0,1}, where d N is the dimension of the feature space. In this case, the hypothesis set H is also Rd, and the error associated with h H on a sample Z = (X,Y ) X Y is given by ℓh(Z) = Y 1{φ(h X) > 1/2} , where φ(w) = 1/(1 + e w),w R. We learn our hypotheses using regularized logistic regression; given a sample S = (Zp,...,Zq), with (p,q) {(1,m),(m + 1,n),(1,n)} and m = n/2, we compute

ˆh(S) = arg min h H λ h 2

2 + 1 q p + 1

q i=p Yi lnφ(h Xi) + (1 Yi) ln(1 φ(h Xi)). (8)

For Z n Zn, and 1 i m < j n, we choose algorithm Q in Theorem 3 such that

Q(Z>i) δ (ˆh(Z>m)) and Q(Z<j) δ(ˆh(Z m)).

Given a sample S , we set the posterior P(S) to be a Gaussian centered at ˆh(S) with variance σ2 > 0; that is, P(S) N(ˆh(S),σ2Id). The prior distribution is set to P0 N(0,σ2 0Id), for σ0 > 0.

Parameters. We set δ = 0.05. For all datasets, we use λ = 0.01, and (approximately) solve (8) using the BFGS algorithm. For each bound, we pick the σ2 {1/2,...,1/2J J = log2 n } which minimizes it on the given data (with n instances). In order for the bounds to still hold with probability at least 1 δ, we replace δ on the RHS of each bound by δ/ log2 n (this follows from the application of a union bound). We choose the prior variance such that σ2 0 = 1/2 (this was the best value on average for the bounds we compare against). We choose the grid G in Theorem 3 as in (7). Finally, we approximate Gaussian expectations using Monte Carlo sampling.

Synthetic data. We generate synthetic data for d = {10,50} and sample sizes between 800 and 8000. For a given sample size n, we 1) draw X1,...,Xn [resp. ϵ1,...,ϵn] identically and independently from the multivariate-Gaussian distribution N(0,Id) [resp. the Bernoulli distribution B(0.9)]; and 2) we set Yi = 1{φ(h Xi) > 1/2} ϵi, for i [n], where h Rd is the vector constructed from the ﬁrst d digits of π. For example, if d = 10, then h = (3,1,4,1,5,9,2,6,5,3) . Figure 1 shows the results averaged over 10 independent runs for each sample size.

UCI datasets. For the second experiment, we use several UCI datasets. These are listed in Table 1 (where Breast-C. stands for Breast Cancer). We encode categorical variables in appropriate 0-1 vectors. This effectively increases the dimension of the input space (this is reported as d in Table 1). After removing any rows (i.e. instances) containing missing features and performing the encoding, the input data is scaled such that every column has values between -1 and 1. We used a 5-fold train-test split (n in Table 1 is the training set size), and the results in Table 1 are averages over 5 runs. We only compare with Maurer s bound since other bounds were worse than Maurer s and ours on all datasets.

Discussion. As the dimension d of the input space increases, the complexity KL(Pn P0) and thus, all the PAC-Bayes bounds discussed in this paper get larger. Our bound suffers less from this increase in d, since for a large enough sample size n, the term Vn is small enough (see Figure 1) to absorb any increase in the complexity. In fact, for large enough n, the irreducible (complexity-free) term involving V n in our bound becomes the dominant one. This, combined with the fact that for the 0-1 loss, V n Ln(Pn) for large enough n (see Figure 1), makes our bound tighter than others.

Adding a regularization term in the objective (8) is important as it stabilizes ˆh(Z<m) and ˆh(Z m); a similar effect is achieved with methods like gradient descent as they essentially have a built-in regularization. For very small sample sizes, the regularization in (8) may not be enough to ensure that ˆh(Z<m) and ˆh(Z m) are close to ˆh(Z n), in which case Vn need not be necessarily small. In particular, this is the case for the Haberman and the breast cancer datasets where the advantage of our bound is not fully leveraged, and Maurer s bound is smaller.

5 Theoretical Motivation of the Bound

In this section, we study the behavior of our bound (5) under a Bernstein condition:

Deﬁnition 6. [Bernstein Condition (BC)] The learning problem (D,ℓ,H) satisﬁes the (β,B)- Bernstein condition, for β [0,1] and B > 0, if for all h H,

EZ D [(ℓh(Z) ℓh (Z))2] B EZ D [ℓh(Z) ℓh (Z)]β ,

where h arg infh H EZ D [ℓh(Z)] is a risk minimizer within the closer of H.

The Bernstein condition [3, 5, 6, 15, 22] essentially characterizes the easiness of the learning problem; it implies that the variance in the excess loss random variable ℓh(Z) ℓh (Z) gets smaller the closer the risk of hypothesis h H gets to that of the risk minimizer h . For bounded loss functions, the BC with β = 0 always holds. The BC with β = 1 (the easiest learning setting) is also known as the Massart noise condition [25]; it holds in our experiment with synthetic data in Section 4, and also, e.g., whenever H is convex and h ℓh(z) is exp-concave, for all z Z [15, 31]. For more examples of learning settings where a BC holds see [22, Section 3].

Our aim in this section is to give an upper-bound on the inﬁmum term involving Vn in (5), under a BC, in terms of the complexity COMPn and the excess risks L(Pn), L(Q(Z>m)), and L(Q(Z m)),

where for a distribution P P(H), the excess risk is deﬁned by L(P) = Eh P [EZ D [ℓh(Z)]] EZ D [ℓh (Z)].

In the next theorem, we denote Q m = Q(Z m) and Q>m = Q(Z>m), for m [n]. To simplify the presentation further (and for consistency with Section 4), we assume that Q is chosen such that

Q(Z>i) = Q>m, for 1 i m, and Q(Z<j) = Q m, for m < j n. (9)

Theorem 7. Let G and π be as in (7), δ ]0,1[, and εδ,n = 2ln 1 δ π(η) = 2ln G

δ , η G. If the (β,B)-Bernstein condition holds with β [0,1] and B > 0, then for any learning algorithms P and Q (with Q satisfying (9)), there exists a C > 0, such that n 1 and m = n/2, with probability at least 1 δ, 1 C inf η G {cη Vn + COMPn + εδ,n

η n } L(Pn) + L(Q m) + L(Q>m)

+ ( COMPn + εδ,n

1 2 β + COMPn + εδ,n

In addition to the ESI tools provided in Section 6 and Lemma 13, the proof of Theorem 7, presented in Appendix E, also uses an ESI version of the Bernstein condition due to [22].

First note that the only terms in our main bound (5), other than the inﬁmum on the LHS of (10), are the empirical error Ln(Pn) and a O(1/ n)-complexity-free term which is typically smaller than

KL(Pn P0)/n (e.g. when the dimension of H is large enough). The term

KL(Pn P0)/n is often the dominating one in other PAC-Bayesian bounds when liminfn Ln(Pn) > 0.

Now consider the remaining term in our main bound, which matches the inﬁmum term on the LHS of (10), and let us choose algorithm P as per Remark 5, so that COMPn = 2KL(Pn P0). Suppose that, with high probability (w.h.p.), KL(Pn P0)/n converges to 0 for n (otherwise no PAC-Bayesian bound would converge to 0), then (COMPn/n)1/(2 β) + COMPn/n essentially the sum of the last two terms on the RHS of (10) converges to 0 at a faster rate than

KL(Pn P0)/n w.h.p. for β > 0, and at equal rate for β = 0. Thus, in light of Theorem 7, to argue that our bound can be better than others (still when liminfn Ln(Pn) > 0), it remains to show that there exist algorithms P and Q for which the sum of the excess risks on the RHS of (10) is smaller than

KL(Pn P0)/n.

One choice of estimator with small excess risk is the Empirical Risk Minimizer (ERM). When m = n/2, if one chooses Q such that it outputs a Dirac around the ERM on a given sample, then under a BC with exponent β and for parametric H (such as the d-dimensional linear classiﬁers in Sec. 4), L(Q m) and L(Q>m) are of order O (n 1/(2 β)) w.h.p. [3, 19]. However, setting Pn δ(ERM(Z n)) is not allowed, since otherwise KL(Pn P0) = . Instead one can choose Pn to be the generalized-Bayes/Gibbs posterior. In this case too, under a BC with exponent β and for parametric H, the excess risk is of order O (n 1/(2 β)) w.h.p. for clever choices of prior P0 [3, 19].

6 Detailed Analysis

We start this section by presenting the convenient ESI notation and use it to present our main technical Lemma 13 (proofs of the ESI results are in Appendix D). We then continue with a proof of Theorem 3.

Deﬁnition 8. [ESI (Exponential Stochastic Inequality, pronounce as:easy) 19, 22] Let η > 0, and X, Y be any two random variables with joint distribution D. We deﬁne

X D η Y X Y D η 0 E(X,Y ) D [eη(X Y )] 1. (11)

Deﬁnition 8 can be extended to the case where η = ˆη is also a random variable, in which case the expectation in (11) needs to be replaced by the expectation over the joint distribution of (X, Y , ˆη). When no ambiguity can arise, we omit D from the ESI notation. Besides simplifying notation, ESIs are useful in that they simultaneously capture with high probability and in expectation results: Proposition 9. [ESI Implications] For ﬁxed η > 0, if X η Y then E[X] E[Y ]. For both ﬁxed

and random ˆη, if X ˆη Y , then δ ]0,1[, X Y + ln 1

δ ˆη , with probability at least 1 δ.

In the next proposition, we present two results concerning transitivity and additive properties of ESI: Proposition 10. [ESI Transitivity and Chain Rule] (a) Let Z1,...,Zn be any random variables on Z (not necessarily independent). If for some (γi)i [n] ]0,+ [n, Zi γi 0, for all i [n], then

n i=1 Zi νn 0, where νn = ( n i=1

1 (so if i [n],γi = γ > 0 then νn = γ/n).

(b) Suppose now that Z1,...,Zn are i.i.d. and let X Z n i=1 Zi R be any real-valued function. If for some η > 0, X(Zi;z<i) η 0, for all i [n] and all z<i Zi 1, then n i=1 X(Zi;Z<i) η 0.

We now give a basic PAC-Bayesian result for the ESI context: Proposition 11. [ESI PAC-Bayes] Fix η > 0 and let {Yh h H} be any family of random variables such that for all h H, Yh η 0. Let P0 be any distribution on H and let P n i=1 Zi P(H) be a learning algorithm. We have:

Eh Pn[Yh] η KL(Pn P0)

η , where Pn = P(Z n).

In many applications (especially for our main result) it is desirable to work with a random (i.e. data-dependent) η in the ESI inequalities; one can tune η after seeing the data. Proposition 12. [ESI from ﬁxed to random η] Let G be a countable subset of ]0,+ [ and let π be a prior distribution over G. Given a countable collection {Yη η G} of random variables satisfying Yη η 0, for all ﬁxed η G, we have, for arbitrary estimator ˆη with support on G,

Yˆη ˆη lnπ(ˆη)

The following key lemma, which is of independent interest, is central to our main result. Lemma 13. [Key result: un-expected Bernstein] Let X D be a random variable bounded from above by b > 0 almost surely, and let ϑ(u) = ( ln(1 u) u)/u2. For all 0 < η < 1/b, we have (a):

E[X] X D η c X2, for all c η ϑ(ηb). (12)

(b): The result is tight; for every c < η ϑ(ηb), there exists a distribution D so that (12) does not hold.

Lemma 13 is reminiscent of the following slight variation of Bernstein s inequality [12]; let X be any random variable bounded from below by b, and let κ(x) = (ex x 1)/x2. For all η > 0, we have

E[X] X η s E[X2], for all s η κ(ηb). (13)

Note that the un-expected Bernstein Lemma 13 has the X2 lifted out of the expectation. In Appendix G, we prove (13) and compare it to standard versions of Bernstein. We also compare (12) to the related but distinct empirical Bernstein inequality due to [27, Theorem 4]. We now prove part (a) of Lemma 13, which follows easily from the proof of an existing result [16, 21]. Part (b) is novel; its proof is postponed to Appendix F.

Proof of Lemma 13-Part (a). [16] (see also [21]) showed in the proof of their lemma 4.1 that

exp(λξ λ2ϑ(λ)ξ2) 1 + λξ, for all λ [0,1[ and ξ 1. (14)

Letting η = λ/b and ξ = X/b, (14) becomes,

exp( ηX η2ϑ(ηb)X2) 1 ηX, for all η ]0,1/b[. (15)

Taking expectation on both sides of (15) and using the fact that 1 ηE[X] exp( ηE[X]) on the RHS of the resulting inequality, leads to (12).

Proof of Theorem 3. Let η ]0,1/b[ and cη = η ϑ(ηb). For 1 i m < j n, deﬁne

Xh(Zi;z>i) = ℓh(Zi) Eh Q(z>i) [ℓh (Zi)], for z>i Zn i, Xh(Zj;z<j) = ℓh(Zj) Eh Q(z<j) [ℓh (Zj)], for z<j Zj 1.

Since ℓis bounded from above by b, Lemma 13 implies that for all h H and 1 i m < j n,

z>i Zn i, Y η h (Zi;z>i) = EZ i D [Xh(Z i;z>i)] Xh(Zi;z>i) cη Xh(Zi;z>i)2 η 0,

z<j Zj 1, Y η h (Zj;z<j) = FZ j D[ Xh(Z j;z<j)] Xh(Zj;z<j) cη Xh(Zj;z<j)2 η 0,

Since Z1,...,Zn are i.i.d. we can chain the ESIs above using Proposition 10-(b) to get:

S = m i=1 Y η h (Zi;Z>i) η 0, S = n j=m+1 Y η h (Zj;Z<j) η 0. (16)

Applying PAC-Bayes (Proposition 11) to S and S in (16) with priors P(Z>m) and P(Z m), respectively, and common posterior Pn = P(Z n) on H, we get, with KL>m = KL(Pn P(Z>m)) and KL m = KL(Pn P(Z m)):

Eh Pn [ m i=1 Y η h (Zi;Z>i)] KL>m

η η 0, Eh Pn

n j=m+1 Y η h (Zj;Z<j) KL m

We now apply Proposition 10-(a) to chain these two ESIs, which yields

m i=1 Y η h (Zi;Z>i) + n j=m+1 Y η h (Zj;Z<j) η

2 KL(Pn P(Z>m)) + KL(Pn P(Z m))

With the prior π on G, we have for any ˆη = ˆη(Z n) G [1/

nb2,1/b[ (see Proposition 12),

m i=1 Y ˆη h (Zi;Z>i) + n j=m+1 Y ˆη h (Zj;Z<j) ˆ η 2

ˆη 2lnπ(ˆη)

n (L(Pn) Ln(Pn)) ˆ η 2 n cˆη Vn + COMPn + 2ln 1 π(ˆη) ˆη +

m i=1 (EZ i D [ ℓQ>i(Z i)] ℓQ>i(Zi)) + n j=m+1 (EZ j D [ ℓQ<j(Z j)] ℓQ<j(Zj)) , (17)

where ℓQ>i(Zi) = Eh Q(Z>i) [ℓh(Zi)] and ℓQ<j(Zj) = Eh Q(Z<j) [ℓh(Zj)]. Let Un denote the quantity between the square brackets in (17). Using the un-expected Bernstein Lemma 13, together with Proposition 12, we get for any estimator ˆν on G:

m i=1 Eh Q(Z>i) [ℓh (Zi)2] + n j=m+1 Eh Q(Z<j) [ℓh (Zj)2] + ln 1 π(ˆν)

By chaining (18) and (17) using Proposition 10-(a) and dividing by n, we get:

L(Pn) nˆ ηˆν ˆ η+2ˆν Ln(Pn) + cˆη Vn + COMPn + 2ln 1 π(ˆη) ˆη n + cˆν V n + ln 1 π(ˆν) ˆν n . (19)

We now apply Proposition 9 to (19) to obtain the following inequality with probability at least 1 δ:

L(Pn) Ln(Pn) + cˆη Vn + COMPn + 2ln 1 π(ˆη) δ ˆη n

+ cˆν V n + ln 1 π(ˆν) δ ˆν n

Inequality (5) follows after picking ˆν and ˆη to be, respectively, estimators which achieve the inﬁmum over the closer of G of the quantities between braces and square brackets in (20).

7 Conclusion and Future Work

The main goal of this paper was to introduce a new PAC-Bayesian bound based on a new proof technique; we also theoretically motivated the bound in terms of a Bernstein condition. The simple experiments we provided are to be considered as a basic sanity check in future work, we plan to put the bound to real practical use by applying it to deep nets in the style of, e.g., [42].

Acknowledgments

An anonymous referee made some highly informed remarks on our paper, which led us to substantially rewrite the paper and made us understand our own work much better. Part of this work was performed while Zakaria Mhammedi was interning at the Centrum Wiskunde & Informatica (CWI). This work was also supported by the Australian Research Council and Data61.

[1] Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887 902, 2018.

[2] Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In Advances in neural information processing systems, pages 9 16, 2007.

[3] Jean-Yves Audibert. PAC-Bayesian statistical learning theory. These de doctorat de l Université Paris, 6:29, 2004.

[4] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150 165. Springer, 2007.

[5] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[6] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311 334, 2006.

[7] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002.

[8] Olivier Catoni. A PAC-Bayesian approach to adaptive classiﬁcation. preprint, 2003.

[9] Olivier Catoni. PAC-Bayesian Supervised Classiﬁcation. Lecture Notes-Monograph Series. IMS, 2007.

[10] Olivier Catoni. PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learning. Lecture Notes-Monograph Series. IMS, 2007.

[11] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321 352, 2007.

[12] Nicolò Cesa-Bianchi and Gàbor Lugosi. Prediction, Learning and Games. Cambridge University Press, Cambridge, UK, 2006.

[13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[14] Gintare K. Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.

[15] Tim Van Erven, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015.

[16] Xiequan Fan, Ion Grama, Quansheng Liu, et al. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20, 2015.

[17] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning of linear classiﬁers. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353 360. ACM, 2009.

[18] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk bounds for the majority vote: From a pac-bayesian analysis to a learning algorithm. The Journal of Machine Learning Research, 16(1):787 860, 2015.

[19] Peter D. Grünwald and Nishant A. Mehta. Fast rates for general unbounded loss functions: from ERM to generalized Bayes. Journal of Machine Learning Research, 2019. accepted pending minor revision. Available as ar Xiv preprint ar Xiv:1605.00252.

[20] Benjamin Guedj. A primer on PAC-Bayesian learning. ar Xiv preprint ar Xiv:1901.05353, 2019.

[21] Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Uniform, nonparametric, non-asymptotic conﬁdence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018.

[22] Wouter M. Koolen, Peter D. Grünwald, and Tim van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. In Advances in Neural Information Processing Systems, pages 4457 4465, 2016.

[23] John Langford and Rich Caruana. (Not) bounding the true error. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 809 816. MIT Press, 2002.

[24] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems, pages 439 446, 2003.

[25] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006.

[26] Andreas Maurer. A note on the PAC-Bayesian theorem. ar Xiv preprint cs/0411099, 2004.

[27] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings COLT 2009, 2009.

[28] David A. Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh ACM Conference on Computational Learning Theory (COLT 98), pages 230 234. ACM Press, 1998.

[29] David A. Mc Allester. PAC-Bayesian model averaging. In Proceedings of the Twelfth ACM Conference on Computational Learning Theory (COLT 99), pages 164 171. ACM Press, 1999.

[30] David A. Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003.

[31] Nishant A. Mehta. Fast rates with high probability in exp-concave statistical learning. In Artiﬁcial Intelligence and Statistics, pages 1085 1093, 2017.

[32] Volodymyr Mnih, Csaba Szepesvári, and Jean-Yves Audibert. Empirical bernstein stopping. In Proceedings of the 25th international conference on Machine learning, pages 672 679. ACM, 2008.

[33] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufﬁcient for generalization and necessary and sufﬁcient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161 193, 2006.

[34] Behnam Neyshabur, Srinadh Bhojanapalli, David A. Mc Allester, and Nathan Srebro. A PACBayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR, 2018.

[35] Omar Rivasplata, Csaba Szepesvári, John S Shawe-Taylor, Emilio Parrado-Hernandez, and Shiliang Sun. Pac-bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems, pages 9214 9224, 2018.

[36] Matthias Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of machine learning research, 3(Oct):233 269, 2002.

[37] Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research, 3:233 269, 2002.

[38] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635 2670, 2010.

[39] Ilya O. Tolstikhin and Yevgeny Seldin. PAC-Bayes-empirical-Bernstein inequality. In Advances in Neural Information Processing Systems, pages 109 117, 2013.

[40] Alexandre B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004.

[41] Olivier Wintenberger. Optimal learning with bernstein online aggregation. Machine Learning, 106(1):119 141, 2017.

[42] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the Image Net scale: a PAC-Bayesian compression approach. In ICLR, 2019.