# pacbayesian_theory_meets_bayesian_inference__0a6a3a41.pdf

PAC-Bayesian Theory Meets Bayesian Inference

Pascal Germain Francis Bach Alexandre Lacoste Simon Lacoste-Julien

INRIA Paris - École Normale Supérieure, firstname.lastname@inria.fr

Google, allac@google.com

We exhibit a strong link between frequentist PAC-Bayesian risk bounds and the Bayesian marginal likelihood. That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization risk bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam s razor criteria, under the assumption that the data is generated by an i.i.d. distribution. Moreover, as the negative log-likelihood is an unbounded loss function, we motivate and propose a PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that our approach is sound on classical Bayesian linear regression tasks.

1 Introduction

Since its early beginning [24, 34], the PAC-Bayesian theory claims to provide PAC guarantees to Bayesian algorithms (Mc Allester [24]). However, despite the amount of work dedicated to this statistical learning theory many authors improved the initial results [8, 21, 25, 30, 35] and/or generalized them for various machine learning setups [4, 12, 15, 20, 28, 31, 32, 33] it is mostly used as a frequentist method. That is, under the assumptions that the learning samples are i.i.d.-generated by a data-distribution, this theory expresses probably approximately correct (PAC) bounds on the generalization risk. In other words, with probability 1 δ, the generalization risk is at most " away from the training risk. The Bayesian side of PAC-Bayes comes mostly from the fact that these bounds are expressed on the averaging/aggregation/ensemble of multiple predictors (weighted by a posterior distribution) and incorporate prior knowledge. Although it is still sometimes referred as a theory that bridges the Bayesian and frequentist approach [e.g., 16], it has been merely used to justify Bayesian methods until now.1

In this work, we provide a direct connection between Bayesian inference techniques [summarized by 5, 13] and PAC-Bayesian risk bounds in a general setup. Our study is based on a simple but insightful connection between the Bayesian marginal likelihood and PAC-Bayesian bounds (previously mentioned by Grünwald [14]) obtained by considering the negative log-likelihood loss function (Section 3). By doing so, we provide an alternative explanation for the Bayesian Occam s razor criteria [18, 22] in the context of model selection, expressed as the complexity-accuracy trade-off appearing in most PAC-Bayesian results. In Section 4, we extend PAC-Bayes theorems to regression problems with unbounded loss, adapted to the negative log-likelihood loss function. Finally, we study the Bayesian model selection from a PAC-Bayesian perspective (Section 5), and illustrate our ﬁnding on classical Bayesian regression tasks (Section 6).

2 PAC-Bayesian Theory

We denote the learning sample (X, Y )={(xi, yi)}n

i=12(X Y)n, that contains n input-output pairs. The main assumption of frequentist learning theories including PAC-Bayes is that (X, Y ) is

1Some existing connections [3, 6, 14, 19, 29, 30, 36] are discussed in Appendix A.1.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

randomly sampled from a data generating distribution that we denote D. Thus, we denote (X, Y ) Dn the i.i.d. observation of n elements. From a frequentist perspective, we consider in this work loss functions : F X Y ! R, where F is a (discrete or continuous) set of predictors f : X ! Y, and we write the empirical risk on the sample (X, Y ) and the generalization error on distribution D as

X,Y (f) = 1

(f, xi, yi) ; L

D(f) = E (x,y) D (f, x, y) .

The PAC-Bayesian theory [24, 25] studies an averaging of the above losses according to a posterior distribution ˆ over F. That is, it provides probably approximately correct generalization bounds on the (unknown) quantity Ef ˆ L

D(f) = Ef ˆ E(x,y) D (f, x, y) , given the empirical estimate Ef ˆ b L

X,Y (f) and some other parameters. Among these, most PAC-Bayesian theorems rely on the Kullback-Leibler divergence KL(ˆ k ) = Ef ˆ ln[ˆ (f)/ (f)] between a prior distribution over F speciﬁed before seeing the learning sample X, Y and the posterior ˆ typically obtained by feeding a learning process with (X, Y ).

Two appealing aspects of PAC-Bayesian theorems are that they provide data-driven generalization bounds that are computed on the training sample (i.e., they do not rely on a testing sample), and that they are uniformly valid for all ˆ over F. This explains why many works study them as model selection criteria or as an inspiration for learning algorithm conception. Theorem 1, due to Catoni [8], has been used to derive or study learning algorithms [10, 17, 26, 27]. Theorem 1 (Catoni [8]). Given a distribution D over X Y, a hypothesis set F, a loss function 0 : F X Y ! [0, 1], a prior distribution over F, a real number δ 2 (0, 1], and a real number β > 0, with probability at least 1 δ over the choice of (X, Y ) Dn, we have

8ˆ on F : E f ˆ L 0

D (f) 1 1 e β

1 e β Ef ˆ b L 0

KL(ˆ k )+ ln 1

Theorem 1 is limited to loss functions mapping to the range [0, 1]. Through a straightforward rescaling we can extend it to any bounded loss, i.e., : F X Y ! [a, b], where [a, b] R. This is done by using β := b a and with the rescaled loss function 0(f, x, y) := ( (f, x, y) a)/(b a) 2 [0, 1] . After few arithmetic manipulations, we can rewrite Equation (1) as

8ˆ on F : E f ˆ L

D(f) a + b a 1 ea b

X,Y (f)+a 1

KL(ˆ k )+ ln 1

From an algorithm design perspective, Equation (2) suggests optimizing a trade-off between the empirical expected loss and the Kullback-Leibler divergence. Indeed, for ﬁxed , X, Y , n, and δ, minimizing Equation (2) is equivalent to ﬁnd the distribution ˆ that minimizes

X,Y (f) + KL(ˆ k ) . (3)

It is well known [1, 8, 10, 21] that the optimal Gibbs posterior ˆ is given by

ˆ (f) = 1 ZX,Y (f) e n b L

X,Y (f) , (4)

where ZX,Y is a normalization term. Notice that the constant β of Equation (1) is now absorbed in the loss function as the rescaling factor setting the trade-off between the expected empirical loss and KL(ˆ k ).

3 Bridging Bayes and PAC-Bayes

In this section, we show that by choosing the negative log-likelihood loss function, minimizing the PAC-Bayes bound is equivalent to maximizing the Bayesian marginal likelihood. To obtain this result, we ﬁrst consider the Bayesian approach that starts by deﬁning a prior p( ) over the set of possible model parameters . This induces a set of probabilistic estimators f 2 F, mapping x to a probability distribution over Y. Then, we can estimate the likelihood of observing y given x and , i.e., p(y|x, ) f (y|x).2 Using Bayes rule, we obtain the posterior p( |X, Y ):

p( |X, Y ) = p( ) p(Y |X, )

p(Y |X) / p( ) p(Y |X, ) , (5)

where p(Y |X, ) = Qn

i=1 p(yi|xi, ) and p(Y |X) =

p( ) p(Y |X, ) d .

2To stay aligned with the PAC-Bayesian setup, we only consider the discriminative case in this paper. One can extend to the generative setup by considering the likelihood of the form p(y, x| ) instead.

To bridge the Bayesian approach with the PAC-Bayesian framework, we consider the negative log-likelihood loss function [3], denoted nll and deﬁned by

nll(f , x, y) ln p(y|x, ) . (6)

Then, we can relate the empirical loss b L

X,Y of a predictor to its likelihood:

X,Y ( ) = 1

nll( , xi, yi) = 1

ln p(yi|xi, ) = 1

n ln p(Y |X, ) ,

or, the other way around,

p(Y |X, ) = e n b L

nll X,Y ( ) . (7)

Unfortunately, existing PAC-Bayesian theorems work with bounded loss functions or in very speciﬁc contexts [e.g., 9, 36], and nll spans the whole real axis in its general form. In Section 4, we explore PAC-Bayes bounds for unbounded losses. Meanwhile, we consider priors with bounded likelihood. This can be done by assigning a prior of zero to any yielding ln 1 p(y|x, ) /2 [a, b].

Now, using Equation (7) in the optimal posterior (Equation 4) simpliﬁes to

ˆ ( ) = ( ) e n b L

nll X,Y ( )

= p( ) p(Y |X, )

p(Y |X) = p( |X, Y ) , (8)

where the normalization constant ZX,Y corresponds to the Bayesian marginal likelihood:

ZX,Y p(Y |X) =

( ) e n b L

nll X,Y ( )d . (9)

This shows that the optimal PAC-Bayes posterior given by the generalization bound of Theorem 1 coincides with the Bayesian posterior, when one chooses nll as loss function and β := b a (as in Equation 2). Moreover, using the posterior of Equation (8) inside Equation (3), we obtain

X,Y ( ) + KL(ˆ k ) (10)

ZX,Y b L nll

X,Y ( ) d +

d = ZX,Y ZX,Y ln 1 ZX,Y = ln ZX,Y .

In other words, minimizing the PAC-Bayes bound is equivalent to maximizing the marginal likelihood. Thus, from the PAC-Bayesian standpoint, the latter encodes a trade-off between the averaged negative log-likelihood loss function and the prior-posterior Kullback-Leibler divergence. Note that Equation (10) has been mentioned by Grünwald [14], based on an earlier observation of Zhang [36]. However, the PAC-Bayesian theorems proposed by the latter do not bound the generalization loss directly, as the classical PAC-Bayesian results [8, 24, 29] that we extend to regression in forthcoming Section 4 (see the corresponding remarks in Appendix A.1).

We conclude this section by proposing a compact form of Theorem 1 by expressing it in terms of the marginal likelihood, as a direct consequence of Equation (10).

Corollary 2. Given a data distribution D, a parameter set , a prior distribution over , a δ 2 (0, 1], if nll lies in [a, b], we have, with probability at least 1 δ over the choice of (X, Y ) Dn,

D ( ) a + b a 1 ea b

where ˆ is the Gibbs optimal posterior (Eq. 8) and ZX,Y is the marginal likelihood (Eq. 9).

In Section 5, we exploit the link between PAC-Bayesian bounds and Bayesian marginal likelihood to expose similarities between both frameworks in the context of model selection. Beforehand, next Section 4 extends the PAC-Bayesian generalization guarantees to unbounded loss functions. This is mandatory to make our study fully valid, as the negative log-likelihood loss function is in general unbounded (as well as other common regression losses).

4 PAC-Bayesian Bounds for Regression

This section aims to extend the PAC-Bayesian results of Section 3 to real valued unbounded loss. These results are used in forthcoming sections to study nll, but they are valid for broader classes of loss functions. Importantly, our new results are focused on regression problems, as opposed to the usual PAC-Bayesian classiﬁcation framework.

The new bounds are obtained through a recent theorem of Alquier et al. [1], stated below (we provide a proof in Appendix A.2 for completeness). Theorem 3 (Alquier et al. [1]). Given a distribution D over X Y, a hypothesis set F, a loss function : F X Y ! R, a prior distribution over F, a δ 2 (0, 1], and a real number λ > 0, with probability at least 1 δ over the choice of (X, Y ) Dn, we have

8ˆ on F : E f ˆ L

X,Y (f) + 1

KL(ˆ k ) + ln 1

δ + , ,D(λ, n)

where , ,D(λ, n) = ln E

f E X0,Y 0 Dn exp

Alquier et al. used Theorem 3 to design a learning algorithm for {0, 1}-valued classiﬁcation losses. Indeed, a bounded loss function : F X Y ! [a, b] can be used along with Theorem 3 by applying the Hoeffding s lemma to Equation (12), that gives , ,D(λ, n) λ2(b a)2/(2n). More speciﬁcally, with λ := n, we obtain the following bound

8ˆ on F : E f ˆ L

X,Y (f) + 1

KL(ˆ k ) + ln 1

2(b a)2. (13)

Note that the latter bound leads to the same trade-off as Theorem 1 (expressed by Equation 3). However, the choice λ := n has the inconvenience that the bound value is at least 1

2(b a)2, even at the limit n ! 1. With λ := pn the bound converges (a result similar to Equation (14) is also formulated by Pentina and Lampert [28]):

8ˆ on F : E f ˆ L

X,Y (f) + 1 pn

KL(ˆ k ) + ln 1

Sub-Gaussian losses. In a regression context, it may be restrictive to consider strictly bounded loss functions. Therefore, we extend Theorem 3 to sub-Gaussian losses. We say that a loss function is sub-Gaussian with variance factor s2 under a prior and a data-distribution D if it can be described by a sub-Gaussian random variable V =L

D(f) (f, x, y), i.e., its moment generating function is upper bounded by the one of a normal distribution of variance s2 (see Boucheron et al. [7, Section 2.3]):

V (λ) = ln E eλV = ln E

f E (x,y) D exp

D(f) (f, x, y)

2 , 8λ 2 R . (15)

The above sub-Gaussian assumption corresponds to the Hoeffding assumption of Alquier et al. [1], and allows to obtain the following result. Corollary 4. Given D, F, , and δ deﬁned in the statement of Theorem 3, if the loss is sub-Gaussian with variance factor s2, we have, with probability at least 1 δ over the choice of (X, Y ) Dn,

8ˆ on F : E f ˆ L

X,Y (f) + 1

KL(ˆ k ) + ln 1

Proof. For i = 1 . . . n, we denote i a i.i.d. realization of the random variable L

D(f) (f, x, y).

, ,D(λ, n) = ln E exp

2n , where the inequality comes from the sub-Gaussian loss assumption (Equation 15). The result is then obtained from Theorem 3, with λ := n.

Sub-gamma losses. We say that an unbounded loss function is sub-gamma with a variance factor s2 and scale parameter c, under a prior and a data-distribution D, if it can be described by a sub-gamma random variable V (see Boucheron et al. [7, Section 2.4]), that is

V (λ) s2 c2 ( ln(1 λc) λc) λ2s2 2(1 cλ) , 8λ 2 (0, 1

Under this sub-gamma assumption, we obtain the following new result, which is necessary to study linear regression in the next sections.

Corollary 5. Given D, F, , and δ deﬁned in the statement of Theorem 3, if the loss is sub-gamma with variance factor s2 and scale c < 1, we have, with probability at least 1 δ over (X, Y ) Dn,

8ˆ on F : E f ˆ L

X,Y (f) + 1

KL(ˆ k ) + ln 1

+ 1 2(1 c) s2 . (17)

As a special case, with := nll and ˆ := ˆ (Equation 8), we have

D ( ) s2 2(1 c) 1

n ln (ZX,Y δ) . (18)

Proof. Following the same path as in the proof of Corollary 4 (with λ := n), we have

, ,D(n, n) = ln E exp [Pn

i=1 i] = ln Qn

i=1 E exp [ i] = Pn

i=1 i(1) n s2 2(1 c) = n s2 2(1 c) ,

where the inequality comes from the sub-gamma loss assumption, with 1 2 (0, 1

Squared loss. The parameters s and c of Corollary 5 rely on the chosen loss function and prior, and the assumptions concerning the data distribution. As an example, consider a regression problem where X Y Rd R, a family of linear predictors fw(x) = w x, with w 2 Rd, and a Gaussian prior N(0, σ2

I). Let us assume that the input examples are generated by x N(0, σ2

x I) with label y = w x+ , where w 2Rd and N(0, σ2

) is a Gaussian noise. Under the squared loss function

sqr(w, x, y) = (w x y)2 , (19)

we show in Appendix A.4 that Corollary 5 is valid with s2 2

d + kw k2) + σ2

. As expected, the bound degrades when the noise increases

Regression versus classiﬁcation. The classical PAC-Bayesian theorems are stated in a classiﬁcation context and bound the generalization error/loss of the stochastic Gibbs predictor Gˆ . In order to predict the label of an example x 2 X, the Gibbs predictor ﬁrst draws a hypothesis h 2 F according to ˆ , and then returns h(x). Maurer [23] shows that we can generalize PAC-Bayesian bounds on the generalization risk of the Gibbs classiﬁer to any loss function with output between zero and one. Provided that y 2 { 1, 1} and h(x) 2 [ 1, 1], a common choice is to use the linear loss function 0

01(h, x, y) = 1

2y h(x). The Gibbs generalization loss is then given by RD(Gˆ ) = E(x,y) D Eh ˆ 0

01(h, x, y) . Many PAC-Bayesian works use RD(Gˆ ) as a surrogate loss to study the zero-one classiﬁcation loss of the majority vote classiﬁer RD(Bˆ ):

RD(Bˆ ) = Pr (x,y) D

h ˆ h(x) < 0

= E (x,y) D I

h ˆ h(x) < 0

where I[ ] being the indicator function. Given a distribution ˆ , an upper bound on the Gibbs risk is converted to an upper bound on the majority vote risk by RD(Bˆ ) 2RD(Gˆ ) [20]. In some situations, this factor of two may be reached, i.e., RD(Bˆ ) ' 2RD(Gˆ ). In other situations, we may have RD(Bˆ ) = 0 even if RD(Gˆ ) = 1

2 (see Germain et al. [11] for an extensive study). Indeed, these bounds obtained via the Gibbs risk are exposed to be loose and/or unrepresentative of the majority vote generalization error.3

In the current work, we study regression losses instead of classiﬁcation ones. That is, the provided results express upper bounds on Ef ˆ L

D(f) for any (bounded, sub-Gaussian, or sub-gamma) losses. Of course, one may want to bound the regression loss of the averaged regressor Fˆ (x) = Ef ˆ f(x). In this case, if the loss function is convex (as the squared loss), Jensen s inequality gives L

D(Fˆ ) Ef ˆ L

D(f) . Note that a strict inequality replaces the factor two mentioned above for the classiﬁcation case, due to the non-convex indicator function of Equation (20).

Now that we have generalization bounds for real-valued loss functions, we can continue our study linking PAC-Bayesian results to Bayesian inference. In the next section, we focus on model selection.

3It is noteworthy that the best PAC-Bayesian empirical bound values are so far obtained by considering a majority vote of linear classiﬁers, where the prior and posterior are Gaussian [2, 10, 20], similarly to the Bayesian linear regression analyzed in Section 6.

5 Analysis of Model Selection

We consider L distinct models {Mi}L

i=1, each one deﬁned by a set of parameters i. The PACBayesian theorems naturally suggest selecting the model that is best adapted for the given task by evaluating the bound for each model {Mi}L

i=1 and selecting the one with the lowest bound [2, 25, 36]. This is closely linked with the Bayesian model selection procedure, as we showed in Section 3 that minimizing the PAC-Bayes bound amounts to maximizing the marginal likelihood. Indeed, given a collection of L optimal Gibbs posteriors one for each model given by Equation (8),

p( |X, Y, Mi) ˆ

i ( ) = 1 ZX,Y,i i( ) e n b L

nll X,Y ( ), for 2 i , (21)

the Bayesian Occam s razor criteria [18, 22] chooses the one with the higher model evidence

p(Y |X, Mi) ZX,Y,i =

i( ) e n b L

X,Y ( ) d . (22)

Corollary 6 below formally links the PAC-Bayesian and the Bayesian model selection. To obtain this result, we simply use the bound of Corollary 5 L times, together with nll and Equation (10). From the union bound (a.k.a. Bonferroni inequality), it is mandatory to compute each bound with a conﬁdence parameter of δ/L, to ensure that the ﬁnal conclusion is valid with probability at least 1 δ. Corollary 6. Given a data distribution D, a family of model parameters { i}L

i=1 and associated priors { i}L

i=1 where i is deﬁned over i , a δ 2 (0, 1], if the loss is sub-gamma with parameters s2 and c < 1, then, with probability at least 1 δ over (X, Y ) Dn,

8i 2 {1, . . . , L} : E ˆ

D ( ) 1 2(1 c) s2 1

i is the Gibbs optimal posterior (Eq. 21) and ZX,Y,i is the marginal likelihood (Eq. 22).

Hence, under the uniform prior over the L models, choosing the one with the best model evidence is equivalent to choosing the one with the lowest PAC-Bayesian bound.

Hierarchical Bayes. To perform proper inference on hyperparameters, we have to rely on the Hierarchical Bayes approach. This is done by considering an hyperprior p( ) over the set of hyperparameters H. Then, the prior p( | ) can be conditioned on a choice of hyperparameter . The Bayes rule of Equation (5) becomes p( , |X, Y ) = p( ) p( | ) p(Y |X, )

Under the negative log-likelihood loss function, we can rewrite the results of Corollary 5 as a generalization bound on E ˆ 0 E ˆ L nll

D ( ), where ˆ 0( ) / 0( ) ZX,Y, is the hyperposterior on H and 0 the hyperprior. Indeed, Equation (18) becomes

D ( ) = E ˆ

D ( ) 1 2(1 c) s2 1

E 0 ZX,Y, δ

To relate to the bound obtained in Corollary 6, we consider the case of a discrete hyperparameter set H = { i}L

i=1, with a uniform prior 0( i) = 1

L (from now on, we regard each hyperparameter i as the speciﬁcation of a model i). Then, Equation (23) becomes

D ( ) = E ˆ

D ( ) 1 2(1 c) s2 1

i=1 ZX,Y, i

This bound is now a function of PL

i=1 ZX,Y, i instead of maxi ZX,Y, i as in the bound given by the best model in Corollary 6. This yields a tighter bound, corroborating the Bayesian wisdom that model averaging performs best. Conversely, when selecting a single hyperparameter 2 H, the hierarchical representation is equivalent to choosing a deterministic hyperposterior, satisfying

ˆ 0( ) = 1 and 0 for every other values. We then have

KL(ˆ || ) = KL(ˆ 0|| 0) + E

ˆ 0 KL(ˆ || ) = ln(L) + KL(ˆ || ) .

With the optimal posterior for the selected , we have

X,Y ( ) + KL(ˆ || ) = n E

X,Y ( ) + KL(ˆ

|| ) + ln(L)

= ln(ZX,Y, ) + ln(L) = ln

Inserting this result into Equation (17), we fall back on the bound obtained in Corollary 6. Hence, by comparing the values of the bounds, one can get an estimate on the consequence of performing model selection instead of model averaging.

6 Linear Regression

In this section, we perform Bayesian linear regression using the parameterization of Bishop [5]. The output space is Y := R and, for an arbitrary input space X, we use a mapping function φ :X!Rd.

The model. Given (x, y) 2 X Y and model parameters := hw, σi 2 Rd R+, we consider the likelihood p(y|x, hw, σi) = N(y|w φ(x), σ2). Thus, the negative log-likelihood loss is

nll(h w, σ i, x, y) = ln p(y|x, h w, σ i) = 1

2 ln(2 σ2) + 1 2σ2 (y w φ(x))2 . (24)

For a ﬁxed σ2, minimizing Equation (24) is equivalent to minimizing the squared loss function of Equation (19). We also consider an isotropic Gaussian prior of mean 0 and variance σ2

: p(w|σ ) = N(w|0, σ2

I). For the sake of simplicity, we consider ﬁxed parameters σ2 and σ2

. The Gibbs optimal posterior (see Equation 8) is then given by

ˆ (w) p(w|X, Y, σ, σ ) = p(w|σ ) p(Y |X,w,σ)

p(Y |X,σ,σ ) = N(w | bw, A 1) , (25)

where A := 1 σ2 ΦT Φ + 1 σ2 I ; bw := 1 σ2 A 1ΦT y ; Φ is a n d matrix such that the ith line is φ(xi) ; y := [y1, . . . yn] is the labels-vector ; and the negative log marginal likelihood is

ln p(Y |X, σ, σ ) = 1 2σ2 ky Φbwk2 + n

2 ln(2 σ2) + 1 2σ2 kbwk2 + 1

2 log |A| + d ln σ

= n b L nll

X,Y (bw) + 1 2σ2 tr(ΦT ΦA 1) | {z }

nll X,Y (w)

+ 1 2σ2 tr(A 1) d

2 + 1 2σ2 kbwk2 + 1

2 log |A| + d ln σ | {z }

N ( bw,A 1) k N (0,σ2 I)

To obtain the second equality, we substitute 1 2σ2 ky Φbwk2+ n

2 ln(2 σ2) = n b L nll

X,Y (bw) and insert 1 2σ2 tr(ΦT ΦA 1) + 1 2σ2

tr(A 1) = 1

σ2 ΦT ΦA 1 + 1 σ2

2 tr(A 1A) = d

2 . This exhibits how the Bayesian regression optimization problem is related to the minimization of a PAC-Bayesian bound, expressed by a trade-off between Ew ˆ b L nll

X,Y (w) and KL

N(bw, A 1) k N(0, σ2

. See Appendix A.5 for detailed calculations.

Model selection experiment. To produce Figures 1a and 1b, we reimplemented the toy experiment of Bishop [5, Section 3.5.1]. That is, we generated a learning sample of 15 data points according to y = sin(x) + , where x is uniformly sampled in the interval [0, 2 ] and N(0, 1

4) is a Gaussian noise. We then learn seven different polynomial models applying Equation (25). More precisely, for a polynomial model of degree d, we map input x 2 R to a vector φ(x) = [1, x1, x2, . . . , xd] 2 Rd+1, and we ﬁx parameters σ2

= 1 0.005 and σ2 = 1

2. Figure 1a illustrates the seven learned models. Figure 1b shows the negative log marginal likelihood computed for each polynomial model, and is designed to reproduce Bishop [5, Figure 3.14], where it is explained that the marginal likelihood correctly indicates that the polynomial model of degree d = 3 is the simplest model which gives a good explanation for the observed data . We show that this claim is well quantiﬁed by the trade-off intrinsic to our PAC-Bayesian approach: the complexity KL term keeps increasing with the parameter d 2 {1, 2, . . . , 7}, while the empirical risk drastically decreases from d = 2 to d = 3, and only slightly afterward. Moreover, we show that the generalization risk (computed on a test sample of size 1000) tends to increase with complex models (for d 4).

Empirical comparison of bound values. Figure 1c compares the values of the PAC-Bayesian bounds presented in this paper on a synthetic dataset, where each input x2R20 is generated by a Gaussian x N(0, I). The associated output y2R is given by y=w x + , with kw k= 1

9. We perform Bayesian linear regression in the input space, i.e., φ(x)=x, ﬁxing σ2

100 and σ2=2. That is, we compute the posterior of Equation (25) for training samples of sizes from 10 to 106. For each learned model, we compute the empirical negative log-likelihood loss of Equation (24), and the three PAC-Bayes bounds, with conﬁdence parameter of δ= 1

20. Note that this loss function is an afﬁne transformation of the squared loss studied in Section 4 (Equation 19), i.e., nll(hw, σi, x, y)= 1

2 ln(2 σ2)+ 1 2σ2 sqr(w, x, y). It turns out that nll is sub-gamma with parameters s2 1 σ2

d+kw k2)+σ2

and c 1 σ2 (σ2

), as shown in Appendix A.6. The bounds of Corollary 5 are computed using the above mentioned values of kw k, d, σ, σx, σ , σ , leading

0 1 2 3 2 2 x

model d=1 model d=2 model d=3 model d=4

model d=5 model d=6 model d=7 sin(x)

(a) Predicted models. Black dots are the 15 training samples.

1 2 3 4 5 6 7 model degree d

ln ZX,Y KL(ˆ k )

n E ˆ b L nll

n E ˆ L nll

(b) Decomposition of the marginal likelihood into the empirical loss and KL-divergence.

101 102 103 104 105 n

Alquier et al s [a, b] bound (Theorem 3 + Eq 14)

Catoni s [a, b] bound (Corollary 2)

sub-gamma bound (Corollary 5) E ˆ L nll

D ( ) (test loss)

E ˆ b L nll

X,Y ( ) (train loss)

(c) Bound values on a synthetic dataset according to the number of training samples.

Figure 1: Model selection experiment (a-b); and comparison of bounds values (c).

to s2 ' 0.280 and c ' 0.005. As the two other bounds of Figure 1c are not suited for unbounded loss, we compute their value using a cropped loss [a, b] = [1, 4]. Different parameter values could have been chosen, sometimes leading to another picture: a large value of s degrades our sub-gamma bound, as a larger [a, b] interval does for the other bounds. In the studied setting, the bound of Corollary 5 that we have developed for (unbounded) subgamma losses gives tighter guarantees than the two results for [a, b]-bounded losses (up to n=106). However, our new bound always maintains a gap of 1 2(1 c)s2 between its value and the generalization loss. The result of Corollary 2 (adapted from Catoni [8]) for bounded losses suffers from a similar gap, while having higher values than our sub-gamma result. Finally, the result of Theorem 3 (Alquier et al. [1]), combined with λ = 1/pn (Eq. 14), converges to the expected loss, but it provides good guarantees only for large training sample (n & 105). Note that the latter bound is not directly minimized by our optimal posterior , as opposed to the one with λ = 1/n (Eq. 13), for which we observe values between 5.8 (for n=106) and 6.4 (for n=10) not displayed on Figure 1c.

7 Conclusion

The ﬁrst contribution of this paper is to bridge the concepts underlying the Bayesian and the PACBayesian approaches; under proper parameterization, the minimization of the PAC-Bayesian bound maximizes the marginal likelihood. This study motivates the second contribution of this paper, which is to prove PAC-Bayesian generalization bounds for regression with unbounded sub-gamma loss functions, including the squared loss used in regression tasks. In this work, we studied model selection techniques. On a broader perspective, we would like to suggest that both Bayesian and PAC-Bayesian frameworks may have more to learn from each other than what has been done lately (even if other works paved the way [e.g., 6, 14, 30]). Predictors learned from the Bayes rule can beneﬁt from strong PAC-Bayesian frequentist guarantees (under the i.i.d. assumption). Also, the rich Bayesian toolbox may be incorporated in PAC-Bayesian driven algorithms and risk bounding techniques.

Acknowledgments

We thank Gabriel Dubé and Maxime Tremblay for having proofread the paper and supplemental.

[1] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of

Gibbs posteriors. JMLR, 17(239):1 41, 2016. [2] Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In

NIPS, 2006. [3] Arindam Banerjee. On Bayesian bounds. In ICML, pages 81 88, 2006. [4] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian theory for transduc-

tive learning. In AISTATS, pages 105 113, 2014. [5] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).

Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [6] P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016. [7] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities : a nonasymptotic

theory of independence. Oxford university press, 2013. ISBN 978-0-19-953525-5. [8] Olivier Catoni. PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learning,

volume 56. Inst. of Mathematical Statistic, 2007. [9] Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian

bounds and sparsity. Machine Learning, 72(1-2):39 61, 2008. [10] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning of

linear classiﬁers. In ICML, pages 353 360, 2009. [11] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk

bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. JMLR, 16, 2015. [12] Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new PAC-Bayesian

perspective on domain adaptation. In ICML, pages 859 868, 2016. [13] Zoubin Ghahramani. Probabilistic machine learning and artiﬁcial intelligence. Nature, 521:452 459, 2015. [14] Peter Grünwald. The safe Bayesian - learning the learning rate via the mixability gap. In ALT, 2012. [15] Peter D. Grünwald and Nishant A. Mehta. Fast rates with unbounded losses. Co RR, abs/1605.00252, 2016. [16] Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin C. Cawley. Model selection: Beyond the Bayesian/frequentist divide. JMLR, 11:61 87, 2010. [17] Tamir Hazan, Subhransu Maji, Joseph Keshet, and Tommi S. Jaakkola. Learning efﬁcient random maximum

a-posteriori predictors with non-decomposable loss functions. In NIPS, pages 1887 1895, 2013. [18] William H. Jeffreys and James O. Berger. Ockham s razor and Bayesian analysis. American Scientist,

1992. [19] Alexandre Lacoste. Agnostic Bayes. Ph D thesis, Université Laval, 2015. [20] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In NIPS, pages 423 430, 2002. [21] Guy Lever, François Laviolette, and John Shawe-Taylor. Tighter PAC-Bayes bounds through distribution-

dependent priors. Theor. Comput. Sci., 473:4 28, 2013. [22] David J. C. Mac Kay. Bayesian interpolation. Neural Computation, 4(3):415 447, 1992. [23] Andreas Maurer. A note on the PAC-Bayesian theorem. Co RR, cs.LG/0411099, 2004. [24] David Mc Allester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363, 1999. [25] David Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003. [26] David Mc Allester and Joseph Keshet. Generalization bounds and consistency for latent structural probit

and ramp loss. In NIPS, pages 2205 2212, 2011. [27] Asf Noy and Koby Crammer. Robust forward algorithms via PAC-Bayes and Laplace distributions. In

AISTATS, 2014. [28] Anastasia Pentina and Christoph H. Lampert. A PAC-Bayesian bound for lifelong learning. In ICML,

2014. [29] Matthias Seeger. PAC-Bayesian generalization bounds for Gaussian processes. JMLR, 3:233 269, 2002. [30] Matthias Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and

Sparse Approximations. Ph D thesis, University of Edinburgh, 2003. [31] Yevgeny Seldin and Naftali Tishby. PAC-Bayesian analysis of co-clustering and beyond. JMLR, 11, 2010. [32] Yevgeny Seldin, Peter Auer, François Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian

analysis of contextual bandits. In NIPS, pages 1683 1691, 2011. [33] Yevgeny Seldin, François Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-

Bayesian inequalities for martingales. In UAI, 2012. [34] John Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In COLT, 1997. [35] Ilya O. Tolstikhin and Yevgeny Seldin. PAC-Bayes-empirical-Bernstein inequality. In NIPS, 2013. [36] Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans.

Information Theory, 52(4):1307 1321, 2006.