# improved_pacbayesian_bounds_for_linear_regression__a8069eb1.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Improved PAC-Bayesian Bounds for Linear Regression

Vera Shalaeva,1 Alireza Fakhrizadeh Esfahani,2 Pascal Germain,1 Mihaly Petreczky 2

1The MODAL project-team, INRIA Lille Nord-Europe, Villeneuve d Ascq, France 2Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL - Centre de Recherche en Informatique Signal et Automatique de Lille, France {vera.shalaeva, pascal.germain}@inria.fr, alireza.fakhrizadeh@univ-lille.fr, mihaly.petreczky@centralelille.fr

In this paper, we improve the PAC-Bayesian error bound for linear regression derived in Germain et al. (2016). The improvements are two-fold. First, the proposed error bound is tighter, and converges to the generalization loss with a wellchosen temperature parameter. Second, the error bound also holds for training data that are not independently sampled. In particular, the error bound applies to certain time series generated by well-known classes of dynamical models, such as ARX models.

1 Introduction When facing a machine learning problem, one must be careful to avoid overﬁtting the training dataset. Indeed, it is well known that minimizing the empirical prediction error is not sufﬁcient to generalize to future observations. This is especially important for sensitive AI applications that are nowadays tackled by many industries (self-driving vehicles, health diagnosis, personality proﬁling, to name a few). Statistical learning theories study the generalization properties of learning algorithms. For the prediction problems, they provide guarantees on the true error of machine learning predictors (i.e., the probability of erroneously predicting the labels of not seen yet samples). The PAC-Bayesian learning theory, initiated by David Mc Allester (1999; 2003) see Guedj (2019) for a recent survey , has the particularity of providing computable non-vacuous generalization bounds on popular machine learning algorithms, such as neural networks (Dziugaite and Roy 2017) and SVMs (Ambroladze, Parrado-Hern andez, and Shawe-Taylor 2006). Moreover, as its name suggests, PAC-Bayesian framework bridges the frequentist Probably Approximately Correct theory and the Bayesian inference. This topic is namely discussed in Zhang (2006), Gr unwald (2012), Alquier, Ridgway, and Chopin (2016), Germain et al. (2016), Sheth and Khardon (2017). In this paper, we build on a result of Germain et al. (2016), which analyses the Bayesian linear regression from a PAC-Bayesian perspective, leading to generalization bounds for the squared loss. We improve the preceding results in two directions. First, our new generalization bound is

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

tighter than the one of Germain et al. (2016), and converges to the generalization loss for proper parameters (see Section 3). Second, our result holds for training data that are not independently sampled (see Section 4). The latter result is directly applicable to the problem of learning dynamical systems from time series data, in particular, to learning ARX models. ARX models are a popular class of dynamical systems with a rich literature (Ljung 1999; Hannan and Deistler 1988) due to their relative simplicity and modelling power. Note that ARX models can be viewed as a simple yet non-trivial subclass of recurrent neural network regressions. For example, just like general recurrent neural networks, ARX models have a memory, i.e., they are able to remember past input data. Noteworthy, Alquier and Wintenberger (2012) proposed PAC-Bayesian oracle inequalities to perform model selection on different time series (weakly dependent processes and causal Bernoulli shifts). Thus, their work is complementary to ours, as it relies on different assumptions and focuses on other types of error bounds.

2 PAC-Bayesian Learning Let us consider a supervised learning setting, where a learning algorithm is given a training set S = {(xi, yi)}n i=1 of size n. Each pair (xi, yi) links a description xi X to a label yi Y. Typically, the description is encoded by a realvalued vector (X Rd), and the label is a scalar (Y N for classiﬁcation problems, or Y R for regression ones). Given S, the learning algorithm returns a prediction function f : X Y, also referred to as a hypothesis. We restrict attention to prediction functions/hypotheses that are measurable. The quality of the predictor f is usually assessed through a measurable loss function ℓ: Y Y R such as the zero-one loss ℓ(y, y ) = 1y =y in classiﬁcation context, or the squared loss ℓ(y, y ) = (y y )2 in regression context, by evaluating the empirical loss

L ℓ(f)(S) = 1

i=1 ℓ(f(xi), yi) , for any S.

PAC Learning. When facing a machine learning problem, one wants to use f to predict the label y Y from a description x X that does not belong to the training set S. A

good predictor generalize to unseen data . This is the object of study of the Probably Approximately Correct (PAC) approach (Valiant 1984). In order to study the statistical behavior of the average loss, we introduce the following statistical framework. We ﬁx a probability space (Ω, P, F), where F is a σ-algebra over Ω and P is a probability measure on F, see for example Bilingsley (1986) for the terminology. We assume that there exist random variables Xi : Ω X, Yi : Ω Y, i = 1, 2, . . . ,, such that the description-label pairs {(xi, yi)}n i=1 are samples from the ﬁrst n variables {(Xi, Yi)}n i=1 and there exists ω Ω such that xi = Xi(ω), yi = Yi(ω). Moreover, we assume that Xi, Yi are identically distributed, i.e. E g(Xi, Yi) does not depend on i for any measurable function g.

Notation 1 (E). We will use E to denote expected value with respect to the measure P.

That is, in the sequel, boldface symbols P and E will denote the probability and the corresponding mathematical expectation for the data generating distribution, and we will use boldface to denote random variables on the probability space (Ω, P, F) and simple font for their samples. As we will see later on, we will also use a probability measure and the corresponding mathematical expectation deﬁned on the space of predictors, which will be denoted differently. The generalization loss of a predictor f is then deﬁned as

L ℓ(f) = E ℓ(f(Xi), Yi) ,

and it expresses the average error for unseen data . It is then of interest to compare this error with the average empirical error, where the average is taken over all possible samples. To this end, we deﬁne the random variable

i=1 ℓ(f(Xi), Yi) ,

i.e., for any sample S = {(xi, yi) = (Xi(ω), Yi(ω)}n i=1 of {(Xi, Yi)}n i=1, L ℓ(f)(S) = L ℓ(f)(ω) is a sample of the random variable L ℓ(f). By slight abuse of terminology, we will refer to L ℓ(f) as the empirical loss too. PAC theories provide upper bounds of the form

P L ℓ(f) L ℓ(f) + ε 1 δ ,

where δ (0, 1] acts as a conﬁdence parameter; the whole challenge of the PAC theories is to derive the mathematical expression of ε. Among the various approaches proposed to achieve this goal (reviewed in Shalev-Shwartz and Ben-David (2014)), we can mention VC-dimension, sample compression, Rademacher s complexity, algorithmic stability, and the PAC-Bayesian theory. In the current work, we stand in the PAC-Bayesian learning framework.

PAC-Bayes. The PAC-Bayesian learning framework (Mc Allester 1999; 2003) has the particularity of reconciling the PAC learning standpoint with the Bayesian paradigm.

To be more precise, let us deﬁne a σ-algebra F on the set of predictors F.1

Notation 2 (Ef ρ). If ρ is a probability distribution function on F, in the sequel we denote by Ef ρ the mathematical expectation with respect to the probability measure which corresponds to ρ. In the PAC-Bayesian paradigm, we consider a prior probability distribution π and a posterior probability distribution ˆρ over this σ-algebra. The prior must be chosen independently of the training set S, and the learning algorithm role is to output the posterior distribution, instead of a single predictor. The PAC-Bayesian bounds take the form2

P E f ˆρ L ℓ(f) E f ˆρ L ℓ(f) + ε 1 δ .

That is, in the PAC-Bayesian setting, the study focuses on the ˆρ-averaged loss.3 Typically, the term ε takes into account the prior via the Kullback-Leibler divergence:

KL(ˆρ π) = E f ˆρ ˆρ(f) π(f) .

Note that KL-diverence is deﬁned only if ˆρ is absolutely continuous with respect to π. In this paper, we build on the PAC-Bayesian theorem of Alquier, Ridgway, and Chopin (2016), which is also the starting point of Germain et al. (2016) result improved in upcoming sections. Theorem 3 (Alquier, Ridgway, and Chopin (2016)). Given a set F of measurable hypotheses X Y, a measurable loss function ℓ: Y Y R, a prior distribution π over F, a δ (0, 1], and a real number λ > 0, ˆρ over F :

E f ˆρ L ℓ(f) E f ˆρ L ℓ(f) (1)

KL(ˆρ π) + ln 1

δ + Ψℓ,π(λ, n)

where Ψℓ,π(λ, n) = ln E f π E e λ L ℓ(f) L ℓ(f) . (2)

For completeness, we provide a proof of Theorem 3 in Appendix A. This proof highlights that the result is obtained without assuming that the random variables Xi, Yi are mutually independent, unlike many classical PAC-Bayesian

1Note that F is completely different from the σ-algebra F of the probability space for which the data generating random variables Xi, Yi are deﬁned. This is not surprising, as the randomness of the data represents an assumption on the nature of the process which generates the data, while F will be used to deﬁne probability distributions, which express our subjective preferences for certain predictors, and which will be adjusted based on the observed data. 2Contrary to the example we give here, the relation between the expected empirical loss and the term ε might be non-linear. This is the case of the famous PAC-Bayes theorem of Seeger (2002). 3The PAC-Bayesian literature also studies the stochastic Gibbs predictor, that perform each prediction on x X by drawing f according to ˆρ and outputting f(x) (e.g., Germain et al. (2015)).

theorems. However, the i.i.d. assumption might be necessary to obtain a computable expression from Theorem 3, because it requires bounding the term Ψℓ,π(λ, n) of Eq. (2). Indeed, since Ψℓ,π(λ, n) relies on the unknown joint distribution of Xi, Yi for i = 1, 2, . . . its approximation needs assumption on the data. Interestingly, given a training set S, obtaining the optimal posterior ˆρ minimizing the bound of Theorem 3, does not require evaluating Ψℓ,π(λ, n), as this latter term is independent of both S and ˆρ. Indeed, for ﬁxed S, π and λ, minimizing the right-hand side of Eq. (1) amounts to solve ˆρ = argminˆρ λ Ef ˆρ L ℓ(f)(S) + KL(ˆρ π) , which is given by the Gibbs posterior (Catoni 2007; Alquier, Ridgway, and Chopin 2016; Guedj 2019); for all f F,

Z π(f) exp λ L ℓ(f)(S) , (3)

where Z is a normalization constant. We refer to λ as a temperature parameter, as it controls the emphasis on the empirical loss minimization. The value of λ also directly impacts the value of the generalization bound, and the convergence properties of Ψℓ,π(λ, n). In particular, if a non-negative loss is upper bounded by a value L (i.e., ℓ(y, y ) [0, L] for all y, y Y), and Xi, Yi are i.i.d., we have, for any f F (we provide the mathematical details in Appendix A):

E exp λ L ℓ(f) L ℓ(f) exp λ2L2

Hence, we have Ψℓ,π(λ, n) ln Ef π e λ2L2

8n , from which the following result is obtained.

Corollary 4. Given F, π, a measurable and bounded loss function ℓ: Y Y [0, L], under i.i.d. observations, for δ (0, 1] and λ > 0, for any ˆρ over F :

E f ˆρ L ℓ(f) E f ˆρ L ℓ(f)

KL(ˆρ π) + ln 1

Therefore, from Corollary 4, we obtain with λ = n ,

E f ˆρL ℓ(f) E f ˆρ L ℓ(f) + 1 n

KL(ˆρ π)+ ln 1

In turn, with λ = n ,

E f ˆρL ℓ(f) E f ˆρ L ℓ(f) + 1

KL(ˆρ π)+ ln 1

with probability at least 1-δ. The generalization bound given by Eq. (5) has the nice property that its value converges to the generalization loss (i.e., the 1 n term tends to 0 as n grows to inﬁnity). However, the result of Eq. (6) does not converge: the bound suffers from an additive term L2/8 even with large n.

Relation with Bayesian inference. Despite its lack of convergence, PAC-Bayesian theorem result of Eq. (6) is interesting for being closely linked to Bayesian inference. As discussed in Germain et al. (2016) (based on earlier results of Zhang (2006) and Gr unwald (2012)), maximizing the Bayesian maximum likelihood amounts to minimize the PAC-Bayes bound of Theorem 3 with λ = n, provided the Bayesian model parameters (typically denoted θ in the literature) are carefully reinterpreted as predictors (each θ is mapped to a regressor fθ), and the considered loss function ℓis the negative log likelihood (roughly4, ℓnll y, fθ(x) = ln p(y|x, θ), where p(y|x, θ) is a Bayesian likelihood). That is, in these particular conditions, the posterior promoted by the celebrated Bayesian rule (i.e., p(θ|X, Y ) = p(θ)p(Y |X,θ)

p(Y |X) , where p(θ) is the prior) aligns with the Gibbs posterior of Eq. (3). Based on this observation, Germain et al. (2016) extends Theorem 3 to Bayesian linear regression for which the loss is unbounded , as discussed in the next section.

3 Bounds for Bayesian Linear Regression

In the Bayesian literature (Bishop 2006; Murphy 2012, ...), it is common to model a linear regression problem by assuming that X = Rd, Y = R. The input-output pairs Xi, Yi satisfy the following assumptions.

Assumption 5.

(a) the inputs Xi are such that Xi N(0, σ2 x I), and Xi, Xj are independent for i = j. (b) the labels are given by Yi = w Xi + ei, where ei N(0, σ2 e) and ei, ej are independent for i = j.

Here, we consider that σe > 0 is ﬁxed, and we want to estimate the weight vector parameters w Rd. Thus, the likelihood function of Yi given Xi, w Rd is given by

p(Yi|Xi, w)=N(Yi|w Xi, σ2 e)

=(2πσ2 e) 1

2 e 1 2σ2e (Yi w Xi)2 .

Therefore, the corresponding negative log-likelihood loss function is proportional to the squared loss of a linear regressor fw(x) = w x :

ℓsqr(fw(Xi), Yi) = (Yi w Xi)2 . (7)

Previous theorem

Considering a family of linear predictors, Fd={fw|w Rd}, Germain et al. (2016) proposed a generalization bound for Bayesian linear regression under the following assumptions. To get a generalization bound for a squared loss in form of Eq. (1), one needs to compute the term Ψℓsqr,π(λ, n) or upper bound it. The following is the initial PAC-Bayesian bound for unbounded squared loss proposed by Germain et al. (2016).

4We omit several details here to concentrate on the general idea. We refer the reader to Germain et al. (2016) for the whole picture.

Theorem 6 (Germain et al. (2016)). Given Fd, ℓsqr, and δ deﬁned above, given a prior distribution π over Fd which is a zero mean Gaussian with covariance σ2 πI, i.e., π(fw) = N(w|0, σ2 π I), under Assumption 5, for constants c 2σ2 xσ2 π, and λ (0, 1

c), for any posterior distribution ˆρ over Fd:

P E fw ˆρL ℓ(fw) E fw ˆρ L ℓ(fw) + 1

KL(ˆρ π) + ln 1

1 2(d + w 2)c + (1 λc)σ2 e 1 λc

Theorem 6 expresses the result with λ stated explicitly, while Germain et al. (2016) see Appendix A.4 therein were focusing on the case λ = n. Here, we observe that the bound does not converge; regardless the choice of λ, the last term of Eq. (8) is not negligible. Note that PAC-Bayesian guarantees for similar Bayesian models has also been proposed by other authors, under different set of assumptions, either bounded loss (Sheth and Khardon 2017) or non-random inputs (Dalalyan and Tsybakov 2008).

Improved theorem The ﬁrst contribution of this paper is an improvement of Theorem 6.

Theorem 7. Given Fd, ℓsqr deﬁned above, under Assumption 5, for any δ (0, 1], λ > 0, for any prior distribution π over Fd, and for any posterior distribution ˆρ over Fd, the following holds:

P E fw ˆρL ℓ(fw) E fw ˆρ L ℓ(fw)

KL(ˆρ π) + ln 1

δ + Ψℓsqr,π(λ, n) 1 δ , (9)

where Ψℓsqr,π(λ, n) = ln E fw π exp (λvw) 1 + λvw n 2

ln E fw π exp λ2v2 w n 2

and vw = σ2 x w w 2 2 +σ2 e .

Proof. We get the complexity term in form of Eq. (10) by simplifying the general form given in Eq. (2), and using assumptions on inputs and a prior distribution.

Ψℓsqr,π(λ, n)

= ln E fw π E exp λ L ℓ(fw) L ℓ(fw)

= ln E fw π exp λL ℓ(fw) E exp λ L ℓ(fw)

= ln E fw π exp λL ℓ(fw) E exp λ

i=1 (Yi w Xi)2

Note that random variable Yi w Xi = (w w)Xi +ei has zero expectation

E(Yi w Xi) = (w w) E Xi + E ei = 0 ,

and its second moment, denoted vw, which by deﬁnition equals L ℓ(fw), is

L ℓ(fw) = E(Yi w Xi)2

= E (w w)XT i Xi(w w)

+ E 2(w w)Xiei + e2 i

= σ2 x w w 2 2 +σ2 e .

Hence, Yi w Xi vw N(0, 1) is a normalized random variable, and its squared sum follows Chi-squared distribution law. Note that the term ( ) of the function Ψℓsqr,π(λ, n) in the form

n n i=1 Yi w Xi vw

corresponds to the moment generating function (MGF) of a Chi-squared distribution, i.e. (1 2t) n 2 with t = λvw

n . By replacing the term ( ) by Chi-Squared MGF and L ℓ(fw) by vw, we get the complexity term in form of Eq. (10). Eq. (11) is obtained by lower bounding the denominator of Eq. (10) by using the inequality (1 + a

b )b > exp( ab

a+b), for a, b > 0 :

Ψℓ,π(λ, n) = ln E fw π exp (λvw)

= ln E fw πexp λ2v2 w λvw+ n

ln E fw πexp λ2v2 w n 2

We are interested in the convergence properties of the right side of Eq. (9). This will highly depend on the choice of λ.

If λ is ﬁxed and does not depend on n, and the latter approaches to , we get

E fw ˆρ L ℓ(fw) E fw ˆρ L ℓ(fw) + 1

KL(ˆρ π) + ln 1

The term Ψℓ,π(λ, n) amounts to 0, since the expression under the expectation of Eq. (10) will converge to 1 due to the fact that

exp (λvw) = lim n

1 + λvw n 2

Hence, an empirical error converges to the generalization error with sufﬁciently large value of the parameter λ, and small divergence between prior and posterior distributions.

If λ is considered as a function of n, then we can obtain convergence of the right side of the Eq. (9) to the left side with a well-chosen temperature parameter. Let λ be

n 1 d ln( 1

δ ), then from Eq. (9) and (11), we have

E fw ˆρ L ℓ(fw) E fw ˆρ L ℓ(fw) + KL(ˆρ π)

n 1 d ln( 1

n 1 d ln E fw π exp

2n 2 d ln( 1

If the amount of training examples n , then the bound converges to generalization loss.

Theorems comparison The new bound given by Theorem 7 is always tighter than the previous one of Theorem 6. Indeed, the fraction of Eq. (10) is upper bounded by its numerator exp (λvw). The latter is the exact same expression as in the derivation of Germain et al. (2016) (Supp. Material A4, p.11, line 4), which lead us to the prior bound shown in Eq. (8). Moreover, the new bound converges to zero for well-chosen temperature parameter λ as the number of training observations goes to inﬁnity. For these reasons, the result of Theorem 7 is strictly stronger than those of Theorem 6.

4 Extension to the non i.i.d. case In this section we will study the case when the observed data are no longer sampled independently from the underlying distribution.

The learning problem and its relationship with time series We consider the same learning problem as in Section 3, but we modify Assumption 5 by no longer assuming that Xi are i.i.d. random variables, more precisely, we assume the following: Assumption 8. We assume Part (b) of Assumption 5 and we assume that Xi N(0, Qx) for some positive deﬁnite matrix Qx > 0. It then follows that Yi are also identically distributed, Yi N(0, σ2 y), where

σ2 y = w T Qxw + σ2 e I .

Note that from the assumption that Xi are identically distributed it follows that L ℓ(fw) does not depend on i and

L ℓ(fw) = (w w)T Qx(w w) + σ2 e .

A particular instance of the learning problem above is the problem of learning ARX models, which is a well-studied problem in control theory and econometrics (Ljung 1999; Hannan and Deistler 1988). For the sake of simplicity, we will deal only with the scalar input, scalar output case. Consider stationary zero mean discrete-time stochastic processes yt, ut, t Z, t > 0. Assume that there exist real numbers {ai, bi}k i=1 and a stochastic process et such that

i=1 aiyt i +

i=1 biut i + et , (12)

where et is assumed to be an i.i.d. sequence of random variables such that et N(0, σ2) and et is uncorrelated with ys, us for s < t. Consider the polynomial a(z) = zk k i=1 aizk i 1. If a(z) has all its complex roots inside the unit disc, and ut is a stationary, then it is well known (Hannan and Deistler 1988) that there yt is the unique stationary process which satisﬁes Eq. (12). Moreover, if ut is a jointly Gaussian process, then the yt and the parameters ({ai, bi}k i=1, σ2) together with the joint distribution of ut determine the distribution of yt uniquely (Hannan and Deistler 1988). Intuitively, the learning problem is to try to compute a prediction ˆyt of yt based on past values {yt l, ut l} l=1 of the input and output processes. In the literature (Hannan and Deistler 1988; Ljung 1999) one typically would like to minimize the prediction error E[(yt ˆyt)2] In principle, this generalization error may depend on t. However, if we assume that the predictor f uses only the last L observations and it is of the form ˆyt = L i=1 ˆaiyt i + L i=1 ˆbiut i, then by stationarity of yt, ut, t Z, the predictor will not depend on t. Furthermore, if yt, ut come from an ARX model Eq. (12) and they are Gaussian, then it can be shown (Hannan and Deistler 1988) under some mild assumptions that the best possible predictor is necessarily of the above form with L = k, and in fact, we should take ˆai = ai, ˆbi = bi, i = 1, . . . , k, and in this case the generalization error E[(yt ˆyt)2] = σ2. For this reason, in the literature (Ljung 1999; Hannan and Deistler 1988) the learning problem is often formulated as the problem of estimating the parameters of the true model (Eq. (12)). It is well known that for ARX models, the latter point of view is essentially equivalent to ﬁnding the predictor for which the generalization error E[(yt ˆyt)2] is the smallest. This allows us to recast the learning problem into our framework for linear regression as follows. For every i = 1, 2, . . ., deﬁne Yi = yi+k ,

Xi = [yi+k 1 . . . yi 1 ui+k 1 . . . ui 1]T ,

w = [a1 . . . ak b1 . . . bk] , et = ei+k .

It then follows that Xi, Yi, ei satisfy Assumption 8.

PAC-Bayesian approach for linear regression with possibly dependent observations In this section we discuss the extension of Theorem 7 to the case when the observations are not independently sampled. Although Theorem 3 holds even when (Xi, Yi) are not i.i.d., the proof of Theorem 7 relies heavily on the independence of Xi, i = 1, . . . , n. More precisely, let us recall from the proof of Theorem 7 the empirical prediction error variables

Zw,i = Yi w Xi = (w w) Xi + ei . (13)

The proof of Theorem 7 relied on Zw,i, i = 1, . . . , n being independent and identically distributed zero mean Gaussian random variables. In our case, the variables Zw,i are still zero mean Gaussian variables which are identically distributed, but they no longer independent. Hence, we have to

take into account the joint distribution of {Zw,i}n i=1, which in turn depends on the joint distribution of {Xi}n i=1. In order to deal with this phenomenon, we will deﬁne the joint covariance matrix QX,n of the random variable X1:n = XT 1 , . . . , XT n as follows:

QX,n = E[X1:n XT 1:n] ,

i.e., the (i, j)th d d block matrix element of QX,n is E[Xi XT j ]. We can then formulate the following bound.

Theorem 9. Let ρn be the minimal eigenvalue of QX,n and assume that ρn>0. Under Assumption 8, for any prior distribution π over Fd, any δ (0, 1], any real number λ > 0, and for any posterior distribution ˆρ over Fd, we have

P E fw ˆρ L ℓ(fw) E fw ˆρ L ℓ(fw)

KL(ˆρ π) + ln 1

δ + ˆΨℓ,π(λ, n) 1 δ , (14)

ˆΨℓ,π(λ, n) = ln E fw π exp (λvw) 1 + λρn,w

ln E fw π exp λ2vwρn,w

n 2 + λ(vw ρn,w) , (16)

with vw = (w w)T Qx(w w) + σ2 e , and ρn,w = ρn(w w)T (w w) + σ2 e .

Remark 10 (Comparison with the i.i.d. case). If Xi, i = 1, 2, . . . , are independent and QX = σ2 x Id, then QX,n is diagonal, with the diagonal elements being σ2 x. In this case, ρn = σ2 x and ρn,w = vwb and hence the statement of Theorem 9 boils down to that of Theorem 7.

Before presenting the proof of Theorem 9 some discussion is in order. Recall that one of the advantages of the error bound of Theorem 7 was that it converged to zero as n . The question arises if this is the case for the error bound of Theorem 9. In order to answer this question we need to investigate the dependence on n of the smallest eigenvalue ρn of the covariance matrix QX,n, since ρn is used in the error bound of Theorem 9. To this end, note that QX,n is a positive semi-deﬁnite matrix, and hence by the properties of minimal eigenvalues of positive semi-deﬁnite matrices (Golub and Van Loan 2013) ρnr T r r T QX,nr. From S odestr om and Stoica (1989)(Chapter 5, page 135) it follows that ρn ρn 1, i.e., ρn is a monotonically increasing sequence. In particular, as ρn ρ1 and QX,1 = QX, ρ1 w w 2 2 (w w )T Qx(w w ) and hence ρn,w vw. This means that the right-hand side of Eq. (15) is not smaller than the right-hand side of Eq. (10), and Eq. (16) is not smaller than Eq. (11). That is, the error bounds of Theorem 9 are not smaller than those of Theorem 7. Moreover, ρn 0 since it is an eigenvalue of the positive deﬁnite matrix QX,n. In particular, ρ = limn ρn = infn ρn exists.

Then we get the following corollary of Theorem 9, by noticing that since ρn ρ , exp (λvw) 1+ λρn,w

2 exp (λvw) 1+ λρ ,w

Corollary 11. Assume ρ > 0. For any prior π over Fd, any δ (0, 1], and any λ > 0, and any ˆρ over Fd, Eq. (14) remains true if we replace ˆΨℓ,π by Ψℓ,π, where

ˆΨℓ,π(λ, n) Ψℓ,π(λ, n) = ln E fw π exp (λvw) 1 + λρ ,w

with vw = (w w)T Qx(w w) + σ2 e and ρ ,w = ρ (w w)T (w w) + σ2 e.

Corollary 11 gives a PAC-Bayesian bound, asymptotic behavior of which is easy to study. Indeed, since 1+ λρ ,w

n/2 increases with n and it converges to exp(λρ ,w) as n , the error bound Ψℓ,π(λ, n) will decrease with n and

lim n Ψℓ,π,(λ, n) = ln E fw π exp (λ(vw ρ ,w)) . (17)

That is, contrary to the i.i.d. case in Theorem 9, PACBayesian error bound of Corollary 11 decreases with n, but it will not converge to 0, rather, it will be bounded from above by the right-hand side of Eq. (17). Note that vw ρ ,w = (w w )T (Qx ρ Id)(w w ). The latter is a monotonically increasing function of Qx ρ Id: the smaller this difference is, the close the right-hand side of Eq. (17) to zero. The difference Qx ρ Id is zero in the i.i.d. case, and can be seen as a kind of measure of the degree of dependence of Xi, i = 1, 2, . . . ,. Note that Theorem 9 and Corollary 11 are meaningful only for ρn > 0 and ρ > 0. For time series assumption that ρ > 0 is equivalent to Qx,n > m Ind for all n for some m. This property is mild modiﬁcation of the well-known property of informativity of the data set {yt, ut} t=1 (Ljung 1999). This can be seen by an easy modiﬁcation of the argument of S odestr om and Stoica (1989)(Chapter 5, page 122, proof of Property 1). In turn, informativity of the data set is a standard assumption made in the literature (Ljung 1999), and it is required for learning ARX models. Note that under mild assumptions on ut, from Ljung (1999)[Theorem 2.3] it then follows that the L ℓ(fw) L ℓ(fw) as n with probability one. That is, even though the law of large numbers does not apply in this case, we still know that the empirical loss converges to the generalization error as n .

Proof of Theorem 9. The proof follows the same lines as that of Theorem 9. From Theorem 3 it follows that

P E fw ˆρ L ℓ(fw) E fw ˆρ L ℓ(fw)

KL(ˆρ π) + ln 1

δ + Ψℓ,π(λ, n) 1 δ . (18)

Consider the random variable Zw,i deﬁned in Eq. (13). Just

like in the proof of Theorem 7,

Ψℓ,π(λ, n) =ln E fw πE exp λ L ℓ(fw) L ℓ(fw) (19)

= ln E fw π

exp λL ℓ(fw) E exp

And, it can be shown that Zw,i is zero mean Gaussian with variance E[Z2 w,i] = vw. In the proof of Theorem 7 we used the fact that under its assumptions {Zw,i}n i=1 were mutually independent and identically distributed and hence λvw

n n i=1 Zw,i

v2 w had χ2 distribution. In our case, Zw,i are not independent. In order to get around this issue, we deﬁne the random variable Zw,1:n and its covariance matrix Qw,n :

Zw,1:n = [Zw,i, . . . , Zw,n]T ,

Qw,n = E[Zw,1:n ZT w,1:n] .

It is easy to see that Qw,n = DT w QX,n Dw + σ2 e In , where

Dw = diag((w w )Id, . . . , (w w )Id n times

Notice that r T QX,nr ρnr T r for all r Rd by Golub and Van Loan (2013). Then, for any z Rn, by taking r = Dwz, it follows that

z T Qw,nz = (Dwz)T QX,n(Dwz) + σ2 ez T z

ρn(Dwz)T (Dwz) + σ2 ez T z = ρn,w . (20)

where we used that Dwz 2 2 = w w 2 2 z 2 2. Deﬁne

S = Q 1/2 w,n Zw,1:n .

and let Si be the ith entry of S, i.e., S = [S1 . . . Sn]T . Then from Eq. (20) it follows that

i=1 Z2 w,i = ZT w,1:n Q 1/2 w,n Qw,n Q 1/2 w,n Zw,1:n

= ST Qw,n S ST Sρn,w = n

i=1 S2 i ρn,w .

It then follows that

Notice now that S is Gaussian and zero mean, with covariance E[SST ] = Q 1/2 w,n E[Zw,1:n ZT w,1:n]Q 1/2 w,n = In. That is, the random variables Si are normally distributed and Si, Sj are independent, and therefore n i=1 S2 i has χ2 distribution. Hence,

n 2 ) n 2 .

Combining this with Eq. (21) and (19), Eq. (18) implies Eq. (15). By using the inequality 1+ a

b b>e ab a+b for a, b>0 with a=λρn,w and b= n

2 , Eq. (16) follows from Eq. (15).

Related works

Note that PAC bounds for learning time series has been explored in the literature by Kuznetsov and Mohri (2017; 2018). Their approach is based on covering numbers and Rademacher complexity instead of PAC-Bayes analysis, but in contrast to the current paper, Kuznetsov and Mohri s work allows for non-stationary time series. Alquier and Wintenberger (2012) includes a PACBayesian analysis in their model selection procedure for time series. Among other differences, they provide oracle inequalities type of bounds, whereas our analysis provides generalization bounds relying on the empirical loss.

5 Conclusion

We have presented an improved PAC-Bayesian error bound for linear regression and extended this error bound to the case of non i.i.d. observations. Thus, the obtained bound applies to the learning problem of time series using ARX models, which can be viewed as a simple yet non-trivial subclass of recurrent neural network regressions. For this reason, we are hopeful that the results of Section 4 could potentially lead to PAC-Bayesian bounds for recurrent neural networks.

6 Acknowledgement

This work is funded in part by CNRS project PEPS Blanc INS2I 2019 Bayes Rea For RNN, in part by CPER Data project, co-ﬁnanced by European Union, European Regional Development Fund (ERDF), French State and the French Region of Hauts-de-France, and in part by the French project APRIORI ANR-18-CE23-0015.

A Mathematical details Proof of Theorem 3

Proof. The PAC-Bayesian theorem is based on the following Donsker-Varadhan s change of measure. For any measurable function φ : F R, we have Ef ˆρ φ(f) KL(ˆρ π) + ln Ef π eφ(f) . Thus, with

φ(f)=λ L ℓ(f) L ℓ(f) , we obtain ˆρ on F :

E f ˆρλ L ℓ(f) L ℓ(f)

KL(ˆρ π) + ln E f π eλ L ℓ(f) L ℓ(f) . (22)

Let s consider the random variable ξ = E f πeλ L ℓ(f) L ℓ(f) .

By the Markov inequality, we have

δ E ξ 1 δ ,

which, combined with Eq. (22), gives

P E f ˆρ λ L ℓ(f) L ℓ(f) KL(ˆρ π) + ln 1

By rearranging the terms of above equation, we obtain the following equivalent form of the statement of the theorem:

E f ˆρ L ℓ(f) E f ˆρ L ℓ(f)

KL(ˆρ π) + ln 1

To see that the inequality above is equivalent to the statement of the theorem, note that by Fubini s theorem,

E ξ = E E f ˆρ eλ L ℓ(f) L ℓ(f) = E f ˆρ E eλ L ℓ(f) L ℓ(f) ,

and hence ln E ξ = Ψℓ,π(λ, n). Moreover, ln( 1

δ E ξ) = ln 1

δ + ln E ξ.

Details leading to Eq. (4)

For any f F:

E exp λ L ℓ(f) L ℓ(f)

= E e λ n n i=1(E ℓ(f(Xk),Yk) ℓ(f(Xi),Yi))

i=1 e λ n (E ℓ(f(Xk),Yk) ℓ(f(Xi),Yi))

(Xi, Yi i.i.d.) =

i=1 E e λ n (E ℓ(f(Xk),Yk) ℓ(f(Xi),Yi))

i=1 exp λ2L2

where the line (Hoeff.) is obtained from Hoeffding s lemma on the random variable L ℓ(f) ℓ(f(Xi), Yi) [ L ℓ(f), L L ℓ(f)], which has an expected value of zero.

Alquier, P., and Wintenberger, O. 2012. Model selection for weakly dependent time series forecasting. Bernoulli 18(3):883 913. Alquier, P.; Ridgway, J.; and Chopin, N. 2016. On the properties of variational approximations of Gibbs posteriors. JMLR 17(239):1 41. Ambroladze, A.; Parrado-Hern andez, E.; and Shawe-Taylor, J. 2006. Tighter PAC-Bayes bounds. In NIPS. Bilingsley, P. 1986. Probability and measure. Wiley. Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Catoni, O. 2007. PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learning, volume 56. Inst. of Mathematical Statistic.

Dalalyan, A. S., and Tsybakov, A. B. 2008. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72(1-2):39 61. Dziugaite, G. K., and Roy, D. M. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI. AUAI Press. Germain, P.; Lacasse, A.; Laviolette, F.; Marchand, M.; and Roy, J.-F. 2015. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. JMLR 16. Germain, P.; Bach, F. R.; Lacoste, A.; and Lacoste-Julien, S. 2016. PAC-Bayesian theory meets bayesian inference. In NIPS, 1876 1884. Golub, H. G., and Van Loan, Charles, F. 2013. Matrix computations. Johns Hopkins University Press. Gr unwald, P. 2012. The safe Bayesian - learning the learning rate via the mixability gap. In ALT. Guedj, B. 2019. A Primer on PAC-Bayesian Learning. ar Xiv preprint ar Xiv:1901.05353. Hannan, E., and Deistler, M. 1988. The Statistical Theory of Linear Systems. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. Kuznetsov, V., and Mohri, M. 2017. Generalization bounds for non-stationary mixing processes. Machine Learning 106(1):93 117. Kuznetsov, V., and Mohri, M. 2018. Theory and algorithms for forecasting time series. ar Xiv preprint ar Xiv:1803.05814. Ljung, L. 1999. System Identiﬁcation: Theory for the user (2nd Ed.). PTR Prentice Hall., Upper Saddle River, USA. Mc Allester, D. 1999. Some PAC-Bayesian theorems. Machine Learning 37(3):355 363. Mc Allester, D. 2003. Simpliﬁed PAC-Bayesian margin bounds. In COLT, 203 215. Murphy, K. P. 2012. Machine learning: a probabilistic perspective. The MIT Press. Seeger, M. 2002. PAC-Bayesian generalization bounds for Gaussian processes. JMLR 3:233 269. Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding Machine Learning: From Theory to Algorithms. New York, NY, USA: Cambridge University Press. Sheth, R., and Khardon, R. 2017. Excess risk bounds for the bayes risk using variational inference in latent gaussian models. In NIPS, 5151 5161. S odestr om, T., and Stoica, P. 1989. System Identiﬁcation. Prentice Hall. Valiant, L. G. 1984. A theory of the learnable. Commununications of the ACM 27(11):1134 1142. Zhang, T. 2006. Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans. Information Theory 52(4):1307 1321.