# pacbayes_analysis_beyond_the_usual_bounds__84cda2e2.pdf PAC-Bayes Analysis Beyond the Usual Bounds Omar Rivasplata University College London & Deep Mind o.rivasplata@cs.ucl.ac.uk Ilja Kuzborskij Deep Mind iljak@google.com Csaba Szepesv ari Deep Mind szepi@google.com John Shawe-Taylor University College London jst@cs.ucl.ac.uk We focus on a stochastic learning model where the learner observes a finite set of training examples and the output of the learning process is a data-dependent distribution over a space of hypotheses. The learned data-dependent distribution is then used to make randomized predictions, and the high-level theme addressed here is guaranteeing the quality of predictions on examples that were not seen during training, i.e. generalization. In this setting the unknown quantity of interest is the expected risk of the data-dependent randomized predictor, for which upper bounds can be derived via a PAC-Bayes analysis, leading to PAC-Bayes bounds. Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds. We clarify the role of the requirements of fixed data-free priors, bounded losses, and i.i.d. data. We highlight that those requirements were used to upper-bound an exponential moment term, while the basic PAC-Bayes theorem remains valid without those restrictions. We present three bounds that illustrate the use of data-dependent priors, including one for the unbounded square loss. 1 Introduction The context of this paper is the statistical learning model where the learner observes training data S = (Z1, Z2, . . . , Zn) randomly drawn from a space of size-n samples S = Zn (e.g. Z = Rd Y for a supervised learning problem where the input space is Rd and the label set is Y) according to some unknown probability distribution1 Pn 2 M1(S). Typically Z1, . . . , Zn are independent and share a common distribution P1 2 M1(Z). Upon observing the training data S, the learner outputs a data-dependent probability distribution QS over a hypothesis space H. Notice that this learning scenario involves randomness in the data and the hypothesis. In this stochastic learning model, the randomized predictions are carried out by randomly drawing a fresh hypothesis for each prediction. Therefore, we consider the performance of a probability distribution Q over the hypothesis space: the expected empirical loss is Q[ˆLs] = H ˆLs(h)Q(dh), i.e. the Q-average of the standard empirical loss ˆLs(h) = ˆL(h, s) defined as ˆL(h, s) = 1 i=1 (h, zi) for a fixed h 2 H and s = (z1, . . . , zn), where : H Z ! [0, 1) is a given loss function. Similarly, the expected population loss is Q[L] = H L(h)Q(dh), i.e. the Q-average of the standard population loss L(h) = Z (h, z)P1(dz) for a fixed h 2 H, where P1 2 M1(Z) is the distribution that generates one random example. An important component of our development is formalizing data-dependent distributions over H in a way that makes explicit their difference to fixed data-free distributions over H. 1We write M1(X) to denote the family of probability measures over a set X, see Appendix A. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Data-dependent distributions as stochastic kernels. A data-dependent distribution over the space H is formalized as a stochastic kernel2 from S to H, which is defined as a mapping3 Q : S H ! [0, 1] such that (i) for each B 2 H the function s 7! Q(s, B) is measurable; and (ii) for each s 2 S the function Qs : B 7! Q(s, B) is a probability measure over H. We write K(S, H) to denote the set of all stochastic kernels from S to distributions over H. We reserve the notation M1(H) for the set of data-free distributions over H. Notice that M1(H) K(S, H), since every data-free distribution can be regarded as a constant kernel. With the notation just introduced, QS stands for the distribution over H corresponding to a randomly drawn data set S. The stochastic kernel Q can be thought of as describing a randomizing learner. One well-known example is the Gibbs learner, where QS is of the form QS(dh) / e γ ˆL(h,S)µ(dh) for some γ > 0, with µ a base measure over H. Note that, besides randomized predictors, other prediction schemes may be devised from a learned distribution over hypotheses, as for instance ensemble predictors and majority vote predictors (see the related literature in Section 4 below). A common question arising in learning theory aims to explain the generalization ability of a learner: how can a learner ensure a well-behaved population loss? One way to answer this question is via upper bounds on the population loss, also called generalization bounds. Often the focus is on the generalization gap, which is the difference between the population loss and the empirical loss, and giving upper bounds on the gap. There are several types of generalization bounds we care about in learning theory, with variations in the way they depend on the training data S and the data-generating distribution Pn. The classical bounds (such as VC-bounds) depend on neither. Distribution-dependent bounds are expressed in terms of quantities related to the data-generating distribution (e.g. population mean or variance) and possibly constants, but not the data in any way. These bounds can be helpful to study the behaviour of a learning method on different distributions for example, some data-generating distributions might give faster convergence rates than others. Finally, data-dependent bounds are expressed in terms of empirical quantities that can be computed directly from data. These are useful for building and comparing predictors [Catoni, 2007], and also for self-bounding [Freund, 1998] or self-certified [P erez-Ortiz et al., 2020] learning algorithms, which are learning algorithms that use all the available data to simultaneously provide a predictor and a risk certificate that is valid on unseen examples. PAC-Bayesian inequalities allow to derive distributionor data-dependent generalization bounds in the context of the stochastic prediction model discussed above. The usual PAC-Bayes analysis introduces a reference data-free probability measure Q0 2 M1(H) on the hypothesis space H. The learned data-dependent distribution QS is commonly called a posterior, while Q0 is called a prior. However, in contrast to Bayesian learning, the PAC-Bayes prior Q0 acts as an analytical device and may or may not be used by the learning algorithm, and the PAC-Bayes posterior QS is unrestricted and so it may be different from the posterior that would be obtained from Q0 through Bayesian inference. In this sense, the PAC-Bayes approach affords an extra level of flexibility in the choice of distributions, even compared to generalized Bayesian approaches [Bissiri et al., 2016]. In the following, for any given Q 2 K(S, H) and s 2 S, we write Qs[ˆLs] = R ˆLs(h)Qs(dh) and Qs[L] = L(h)Qs(dh) for the expected empirical loss and the expected population loss, respectively. The focus of PAC-Bayes analysis is deriving bounds on the gap between QS[L] and QS[ˆLS]. For instance, the classical result of Mc Allester [1999] says the following: For a fixed data-free distribution Q0 2 M1(H), bounded loss function with range [0, 1], stochastic kernel Q 2 K(S, H) and for any δ 2 (0, 1), with probability at least 1 δ over size-n random samples S: QS[L] QS[ˆLS] KL(QSk Q0) + log KL( k ) stands for the Kullback-Leibler divergence4 which is defined for two given probability distributions Q, Q0 over H as follows: KL(Qk Q0) = H log (d Q/d Q0) d Q, where d Q/d Q0 denotes the Radon-Nikodym derivative. Note that PAC-Bayes bounds (e.g. Mc Allester s bound described above) are usually presented under a statement that says that with probability at least 1 δ, the 2This is also called a transition kernel or probability kernel, a well-known concept in the literature on stochastic processes, see e.g. Kallenberg [2017], Meyn and Tweedie [2009], Ethier and Kurtz [1986]. 3The space of size-n samples S is equipped with a sigma algebra that we denote S, and the hypothesis space H is equipped with a sigma algebra that we denote H. For precise definitions see Appendix A. 4Also known as relative entropy, see e.g. Cover and Thomas [2006]. displayed inequality holds simultaneously for all probability distributions Q over H, i.e. with an arbitrary Q replacing QS. Such commonly used formulation has the apparent advantage of being valid uniformly for all distributions over H, while our formulation is valid for a fixed kernel. At the same time, the commonly used formulation has the disadvantage of hiding the data-dependence of the posterior distributions used in practice, while our formulation in terms of a stochastic kernel shows explicitly the data-dependence: given the data S, the corresponding distribution over H is QS. Notice that one fixed stochastic kernel suffices in order to describe a whole parametric family of distributions (such as Gaussian or Laplace distributions, among others) with parameter values learned from data. Since our main interest is in results for data-dependent distributions (contrasted to results for fixed data-free distributions), we argue in favour of the formulation based on stochastic kernels. These have appeared in the learning theory literature under the names of Markov kernels [Xu and Raginsky, 2017] or regular conditional probabilities [Catoni, 2004, 2007, Alquier, 2008]. A large body of subsequent work focused on refining the PAC-Bayes analysis by means of alternative proof techniques and different ways to measure the gap between QS[L] and QS[ˆLS]. For instance Langford and Seeger [2001] and Seeger [2002] gave an upper bound on the relative entropy of QS[ˆLS] and QS[L], commonly called the PAC-Bayes-kl bound [Seldin et al., 2012], which holds with high probability over randomly drawn size-n samples S: kl(QS[ˆLS] k QS[L]) 1 KL(QSk Q0) + log kl( k ), appearing on the left-hand side of this inequality, denotes the binary KL divergence, which is by definition the KL divergence between the Bernoulli distributions with the given parameters: kl(qkq0) = q log( q q0 ) + (1 q) log( 1 q 1 q0 ) for q, q0 2 [0, 1]. Inequality (2) is tighter than (1) due to Pinsker s inequality 2(p q)2 kl(pkq). In fact, by a refined form of Pinsker s inequality, namely (p q)2/(2q) kl(pkq) which is valid for p < q (and tighter than the former when q < 0.25), from Eq. (2) one obtains a localised inequality5 (see Eq. (6) of Mc Allester [2003]), which holds with high probability6 over randomly drawn size-n samples S: QS[L] QS[ˆLS] . n KL(QSk Q0) + 1 n KL(QSk Q0) . (3) PAC-Bayes bounds like Eq. (1) and Eq. (3) tell us that the population loss is controlled by a tradeoff between the empirical loss and the deviation of the posterior from the prior as captured by the KL divergence. Note that inequality (3) is tighter than (1) when QS[ˆLS] < QS[L] < 0.25. Obviously, the upper bound in Eq. (3) is dominated by the lower-order (second) term whenever the empirical loss QS[ˆLS] is small enough, which makes this inequality very appealing for learning problems based on empirical risk minimization, where the empirical loss is driven to zero. At a high level, such kinds of data-dependent upper bounds on the generalization gap are much desirable, as their empirical terms are closely linked to and hopefully capture more properties of the data. In this direction, valuable contributions were made by Tolstikhin and Seldin [2013] who obtained an empirical PAC-Bayes bound similar in spirit to Eq. (3), but controlled by the sample variance of the loss. An alternative direction to get sharper empirical bounds was explored through tunable bounds [Catoni, 2007, van Erven, 2014, Thiemann et al., 2017], which involve a free parameter that offers a trade-off between the empirical error term and the KL(Posteriork Prior) term. Despite their variety and attractive properties, the results discussed above (and the vast majority of the literature) share two crucial limitations: the prior Q0 cannot depend on the training data S and the loss function has to be bounded. It is conceivable that in many realistic situations the population loss is effectively controlled by the KL complexity term indeed, in most modern learning scenarios (e.g. training deep neural networks) the empirical loss is driven to zero. At the same time, the choice of a fixed data-free prior essentially becomes a wild guess on how the posterior will look like. Therefore, allowing prior distributions to be data-dependent introduces much needed flexibility, since this opens up the possibility to minimize upper bounds in both the posterior and the prior, which should lead to tighter empirical bounds on QS[L] and tighter risk certificates. 5For x, b, c nonnegative, x c + bpx implies x c + bpc + b2. 6The notation . hides universal constants and logarithmic factors. These limitations have been removed in the PAC-Bayesian literature in special cases. For instance, Ambroladze et al. [2007] and Parrado-Hern andez et al. [2012] used priors that were trained on a held-out portion of the available data, thus enabling empirical bounds with PAC-Bayes priors that are data-dependent, but independent from the training set. Priors that depend on the full training set have also been studied recently. Thiemann et al. [2017] proposed to construct a prior as a mixture of point masses at a finite number of data-dependent hypotheses trained on a k-fold split of the training set, effectively a data-dependent prior. Another approach was proposed by Dziugaite and Roy [2018b]: rather than splitting the training data, they require the data-dependent prior Q0 s (where Q0 2 K(S, H)) to be stable with respect to small changes in the composition of the n-tuple s. As we will see shortly, there is benefit in relaxing the restrictions of the usual PAC-Bayes literature. 2 Our Contributions In this paper we discuss a basic PAC-Bayes inequality (Theorem 1 below) and a general template for PAC-Bayesian bounds (Theorem 2 below). The formulation of both these results is based on representing data-dependent distributions as stochastic kernels. To make a case for the usefulness of this approach, we show that our Theorem 2 encompasses many usual bounds which appear in the literature [Mc Allester, 1998, 1999, Seeger, 2002, Catoni, 2007, Thiemann et al., 2017], while at the same time it enables new PAC-Bayes inequalities. Importantly, our study takes a critical stand on the usual assumptions on which PAC-Bayes inequalities are based, namely, (a) data-free prior, (b) bounded loss, and (c) i.i.d. data observations. We aim to clarify the role of these assumptions and to illustrate how to obtain PAC-Bayes inequalities in cases where these assumptions are removed. As we will soon see, the analysis leading to our Theorem 2 shows that the PAC-Bayes priors can be data-dependent by default, and also that the underlying loss function can be unbounded by default. Furthermore, the proof of our Theorem 2 does not rely on the assumption of i.i.d. data observations, which may enable new results for statistically dependent data in future research. For illustration, our general PAC-Bayes theorem7 for stochastic kernels (Theorem 2 in Section 3), in specialized form, implies that for any convex function F : R2 ! R, for any stochastic kernels Q, Q0 2 K(S, H) and δ 2 (0, 1), with probability at least 1 δ over randomly drawn S one has F(QS[L], QS[ˆLS]) KL(QSk Q0 S) + log( (Q0)/δ) , (4) where (Q0) is the exponential moment of F(L(h), ˆLs(h)), which is defined as follows: e F (L(h),ˆLs(h))Q0 s(dh)Pn(ds) . Observe that Eq. (4) is defined for an arbitrary convex function F. This way the usual bounds are encompassed: F(x, y) = 2n(x y)2 yields a Mc Allester [1999]-type bound, F(x, y) = n kl(ykx) gives the bound of Seeger [2002], and F(x, y) = n log 1 1 x(1 e λ) λny gives the bound of Catoni [2007]. Furthermore, F(x, y) = n(x y)2/(2x) leads to the so-called PAC-Bayes-λ bound of Thiemann et al. [2017], or to the bound of Rivasplata et al. [2019] which holds under the usual requirements of fixed data-free prior Q0, losses within the [0, 1] range, and i.i.d. data: QS[ˆLS] + KL(QSk Q0) + log( 2pn KL(QSk Q0) + log( 2pn As consequence of the universality of Eq. (4), besides the usual bounds we may derive novel bounds, e.g. with data-dependent priors Q0 S. Conceptually, our approach splits the usual PAC-Bayesian analysis into two components: (i) choose F to use in Eq. (4), and (ii) obtain an upper bound on the exponential moment (Q0). The cost of generality is that for each specific choice of the bound (technically, a choice of a function F and Q0) we need to study the exponential moment (Q0) and, in particular, provide a reasonable, possibly data-dependent upper bound on it. We stress that the only technical step necessary for the introduction of a data-dependent prior is a bound on (Q0), the rest is taken care of by Eq. (4). While previous works8 analysed separately the exponential moment, 7Generic PAC-Bayes theorems, similar in spirit to ours, have been presented before, e.g. by Audibert [2004], Germain et al. [2009], B egin et al. [2014, 2016], but only with fixed data-free priors. 8Audibert and Bousquet [2007], Alquier et al. [2016], among others, for the case of fixed data-free priors. as we do here, to the best of our knowledge they considered data-free priors only. We think our work is the first to point out techniques to upper bound (Q0) when Q0 is a stochastic kernel, and to present PAC-Bayesian inequalities where the prior is data-dependent by default. Our work also clarifies where / how the data-free nature of the priors was used in previous works. We emphasize that in this paper the main focus is on using data-dependent priors in the PAC-Bayes analysis. Again, we point out that the proof of the basic PAC-Bayes inequality (Theorem 1 below) does not require fixed data-free priors, nor bounded loss functions nor i.i.d. data observations. The same can be said of Theorem 2, a consequence of Theorem 1(ii), which gives a general template for deriving PAC-Bayes bounds. Below we discuss three generalization bounds with data-dependent priors, two of which are for bounded losses, while the third is for the unbounded square loss. 2.1 A PAC-Bayes bound with a data-dependent Gibbs prior Choosing as prior an empirical Gibbs distribution Q0 s(dh) / e γ ˆL(h,s)µ(dh) for some fixed γ > 0 and base measure µ over H, we derive a novel PAC-Bayes bound. Recall that s is the size-n sample. We use F(x, y) = pn(x y), and we prove that in this case the exponential moment (Q0) satisfies log( (Q0)) 2 The proof (Appendix B) is based on the algorithmic stability argument for Gibbs densities, inspired by the proof of Kuzborskij et al. [2019, Theorem 1]. Combining this with Eq. (4), for any kernel Q 2 K(S, H) and δ 2 (0, 1), with probability at least 1 δ over size-n i.i.d. samples S we have QS[L] QS[ˆLS] 1 pn Notice that this prior allowed to remove log(n) from the usual PAC-Bayes bounds (see our Eq. (1) and Eq. (2) above). This was one of the important contributions of Catoni [2007], who also used a data-dependent Gibbs distribution, see Catoni [2007, Theorem 1.2.4, Theorem 1.3.1, & corollaries]. Interestingly, the choice Q = Q0 gives the smallest right-hand side in Eq. (6) (however, it does not necessarily minimize the bound on QS[L]) which leads to the following for the Gibbs learner: QS[L] QS[ˆLS] . 1/pn + γ/n . Notice that this latter bound has an additive 1/pn compared to the bound in expectation of Raginsky et al. [2017]. 2.2 PAC-Bayes bounds with d-stable data-dependent priors Next we discuss an approach to convert any PAC-Bayes bound with a usual data-free prior into a bound with a stable data-dependent prior, which is accomplished by generalizing a technique from Dziugaite and Roy [2018b]. Essentially, they show (see Appendix C) that for any fixed data-free distribution Q 2 M1(H) and stochastic kernel Q0 2 K(S, H) satisfying the DP( ) property9, one can turn the inequality F(QS[L], QS[ˆLS]) KL(QSk Q ) + log( (Q )/δ) into F(QS[L], QS[ˆLS]) KL(QSk Q0 S) + log(2 (Q )/δ) + n 2 In other words, if Eq. (4) holds with a data-free prior Q , then Eq. (7) holds with a data-dependent prior that is distributionally stable (i.e. satisfies DP( )). Note that different choices of F would lead to different bounds on (Q ) essentially, upper bounds on the exponential moment typically considered in the PAC-Bayesian literature. For example, taking F(x, y) = n kl(ykx) one can show that (Q ) 2pn [Maurer, 2004], and this leads to Theorem 4.2 of Dziugaite and Roy [2018b]: if Q0 2 K(S, H) satisfies the DP( ) property, then for any kernel Q 2 K(S, H) and δ 2 (0, 1), with probability at least 1 δ over size-n i.i.d. samples S we have kl(QS[ˆLS]k QS[L]) 1 S) + log(4pn Eq. (7) is a general version of this result, whose derivation is based on the notion of max-information [Dwork et al., 2015a]. The details of the general conversion recipe are given in Appendix C. 9DP( ) stands for differential privacy with . See Appendix C for details on this property. 2.3 A generalization bound for the square loss with a data-dependent prior Our third and last contribution is a novel bound for the setting of learning linear predictors with the square loss. This will demonstrate the full power of our take on the PAC-Bayes analysis, as we will consider a regression problem with the unbounded squared loss and a data-dependent prior. In fact, our framework of data-dependent priors makes it possible to obtain the problem-dependent bound in Eq. (8) for square loss regression. We are not aware of an equivalent previous result. In this setting, the input space is X = Rd and the label space Y = R. A linear predictor is of the form hw : Rd ! R with hw(x) = w>x for x 2 Rd, where of course w 2 Rd. Hence hw may be identified with the weight vector w and correspondingly the hypothesis space H may be identified with the weight space W = Rd. The size-n random sample is S = ((X1, Y1), . . . , (Xn, Yn)) 2 (Rd R)n. The population and empirical losses are defined with respect to the square loss function: 2 E[(w>X1 Y1)2] and ˆLS(w) = 1 (w>Xi Yi)2 . The population covariance matrix is = E[X1X> 1 ] 2 Rd d and its eigenvalues are λ1 λd. The (regularized) sample covariance matrix is ˆ λ = (X1X> 1 + + Xn X> n )/n + λI for λ > 0, with eigenvalues ˆλ1 ˆλd. Note that ˆλi are data-dependent. Consider the prior Q0 γ,λ with density q0 γ,λ(w) / e γλ 2 kwk2 for some γ, λ > 0, that possibly depend on the data. In this setting, we prove (Appendix D) that for any posterior Q 2 K(S, W), for any γ > 0, and any λ > maxi{λi ˆλi}, with probability one over size-n random samples S we have QS[L] QS[ˆLS] min w2Rd L(w) + 1 γ KL(QS || Q0 A straightforward observation is that this generalization bound holds with probability one over the distribution of size-n random samples. This is a stronger result than usual high-probability bounds. Of course one may derive a high-probability bound from Eq. (8) by an application of Markov s inequality, but that would make the result weaker. The stronger result with probability one, for instance, allows to select the best out a countable collection of λ values at no extra cost, while the high-probability bound would need to pay a union bound price for such selection. Notice that we are not necessarily assuming bounded inputs or labels. Our bound depends on the data-generating distribution (possibly of unbounded support) via the spectra of the covariance matrices. While this is apparent by looking at the last term in Eq. (8), in fact the KL(Posteriork Prior) term also depends on the covariances (see Proposition 12 in Appendix D). In particular, if the data inputs are independent sub-gaussian random vectors, then with high probability |ˆλi λi| . d/n and the last term in Eq. (8) then behaves as d log λ/(λ+ ˆλi λi) . d/pn 1. This of course can be extended to heavy-tailed distributions or, in general, to any input distributions such that spectrum of the covariance matrix concentrates well [Vershynin, 2011]. The explicit dependence on the spectrum of the sample covariance matrix opens interesting venues for distribution-dependent analysis. The above argument can be extended to heavy-tailed data distributions, where in some cases we can have concentration of the smallest eigenvalue of a sample covariance matrix even for unbounded instances, see Vershynin [2011, Section 5.4.2]. Moreover, our technique allows to combine PAC-Bayes analysis with specific applications by considering various data distributions. For instance, we can obtain bounds for structured data by analyzing eigenvalues of the corresponding (sparse or blocked) covariance matrices [Wainwright, 2019], thus revealing fined-grained dependence on the distribution compared to the usual PAC-Bayes bounds. Similarly, one can obtain generalization bounds for statistically dependent data by looking at the concentration of the covariance with dependent observations [de la Pe na and Gin e, 2012]. An important component of the proof of Eq. (8) is the following identity for the exponential moment of f = γ(L(w) ˆLS(w)) under the prior distribution: for λ > maxi{λi ˆλi}, with probability one over random samples S, γ,λ[ef] = γ min L(w) (ˆLS(w) + λ This identity computes explicitly the exponential moment of f under the prior distribution. Also this explains why the upper bound in Eq. (8) contains the term minw2Rd L(w). The latter should be understood as the label noise. This term will disappear in a noise-free problem, while given a distribution-dependent boundedness of the loss function, the term will concentrate well around zero (see Proposition 11 in Appendix D). We comment on the free parameter γ in Appendix D. Finally, note that Eq. (9) elucidates an equivalence between the concentration of eigenvalues of the sample covariance matrix and concentration of the empirical loss. Indeed, for simplicity assuming a noise-free setting (that is minw2Rd L(w) = 0), we observe that whenever (ˆλi λi) ! 0 as n ! 1 for i.i.d. instances, we have ˆLS(w) ! L(w). This provides an alternative way to control the concentration, compared to works based on restrictions on the loss as e.g. by Germain et al. [2016], Holland [2019]. We discuss another PAC-Bayes bound for unbounded losses in Appendix E. 3 Our PAC-Bayes theorem for stochastic kernels The following results involve dataand hypothesis-dependent functions f : S H ! R. Notice that the order S H is immaterial functions H S ! R are treated the same way. It will be convenient to define fs(h) = f(s, h). If 2 M1(H) is a data-free distribution, we will write [fs] to denote the -average of fs( ) for fixed s, that is, [fs] = H fs(h) (dh). When is data-dependent, that is, 2 K(S, H) is a stochastic kernel, we will write s for the distribution over H corresponding to a fixed s, so s(B) = (s, B) for B 2 H, and s[fs] = H fs(h) s(dh). The joint distribution over S H defined by P 2 M1(S) and Q 2 K(S, H) is the measure denoted10 by P Q that acts on functions φ : S H ! R as follows: Q(s, dh)[φ(s, h)] = φ(s, h)Qs(dh)P(ds) . Drawing a random pair (S, H) P Q is equivalent to drawing S P and drawing H QS. In this case, with E denoting the expectation under the joint distribution P Q, the previous display takes the form E[φ(S, H)] = E[E[φ(S, H)|S]]. Our basic result is the following theorem. Theorem 1 (Basic PAC-Bayes inequality) Fix a probability measure P 2 M1(S), a stochastic kernel Q0 2 K(S, H), and a measurable function f : S H ! R, and let s(dh)P(ds) . (i) For any Q 2 K(S, H), for any δ 2 (0, 1), with probability at least 1 δ over the random draw of a pair (S, H) P Q we have f(S, H) log + log( /δ) . (ii) For any Q 2 K(S, H), for any δ 2 (0, 1), with probability at least 1 δ over the random draw of S P we have QS[f S] KL(QSk Q0 S) + log( /δ) . To the best of our knowledge, this theorem is new. Notice that Q0 is by default a stochastic kernel from S to H. Hence, given data S, the prior Q0 S is a data-dependent distribution over hypotheses. By contrast, the usual PAC-Bayes approaches assume that Q0 is a data-free distribution. Also note that the function f is unrestricted, and the distribution P 2 M1(S) is unrestricted, except for integrability conditions to ensure that is finite. A key step of the proof involves a well-known change of measure that can be traced back to Csisz ar [1975] and Donsker and Varadhan [1975]. Proof Recall that when Y is a positive random variable, by Markov inequality, for any δ 2 (0, 1), with probability at least 1 δ we have: log Y log E[Y ] + log(1/δ) . (?) 10The notation P Q (see e.g. Kallenberg [2017]), used here for the joint distribution over S H defined by P 2 M1(S) and Q 2 K(S, H), corresponds to what in Bayesian learning is commonly written QH|SPS. Let Q0 2 K(S, H), and let E0 denote expectation under the joint distribution P Q0. Thus if S P and H Q0 S we then have = E0[E0[ef(S,H)|S]]. Let Q 2 K(S, H) and denote by E the expectation under the joint distribution P Q. Then by a change of measure we may re-write = E0[ef(S,H)] as = E[e f(S,H)] = E[e D] with D = f(S, H) = f(S, H) log (i) Applying inequality (?) to Y = e D, with probability at least 1 δ over the random draw of the pair (S, H) P Q we get D log E[e D] + log(1/δ). (ii) Recall f S(H) = f(S, H). Notice that E[D|S] = QS[f S] KL(QSk Q0 S) . By Jensen inequality, E[D|S] log E[e D|S]. While from (?) applied to Y = E[e D|S], with probability at least 1 δ over the random draw of S P we have log E[e D|S] log E[e D] + log(1/δ). Suppose the function f is of the form f = F A with A : S H ! Rk and F : Rk ! R convex. In this case, by Jensen inequality we have F(Qs[As]) Qs[F(As)] and Theorem 1(ii) gives: Theorem 2 (PAC-Bayes for stochastic kernels) For any P 2 M1(S), for any Q0 2 K(S, H), for any positive integer k, for any measurable function A : S H ! Rk and convex function F : Rk ! R, let f = F A and let = (P Q0)[ef] as in Theorem 1. Then for any Q 2 K(S, H) and any δ 2 (0, 1), with probability at least 1 δ over the random draw of S P we have F(QS[AS]) KL(QSk Q0 S) + log( /δ) . (10) This theorem is a general template for deriving PAC-Bayes bounds, not just with data-free priors, but also more generally with data-dependent priors. Previous works (see Section 4 below) that presented similar generic templates for deriving PAC-Bayes bounds only considered data-free priors. We emphasize that a data-free distribution is equivalent to a constant stochastic kernel: Q0 s0 for all s, s0 2 S. Hence M1(H) K(S, H), which implies that our Theorem 2 encompasses the usual PAC-Bayes inequalities with data-free priors in the literature. Interestingly, our Theorem 2 is valid with any normed space instead of Rk. This theorem extends the typically used case where k = 2 and A = (L(h), ˆL(h, s)), in which case the function of interest is f(s, h) = F(L(h), ˆL(h, s)), where F : R2 ! R is a convex function, but there are no restrictions on the loss function that is used in defining L(h) and ˆL(h, s). Hence Theorem 2 is valid for any loss function: convex or non-convex, bounded or unbounded. Notice also that our Theorem 2 holds for any P 2 M1(S), i.e. without restrictions on the data-generating process. In particular, our Theorem 2 holds without the i.i.d. data assumption, hence this theorem could potentially enable new generalization bounds for statistically dependent data. In Section 4 below we comment on some literature related to unbounded losses and non-i.i.d. data. An important role is played by , the exponential moment (moment generating function at 1) of the function f under the joint distribution P Q0. As discussed above in Section 2, there are essentially two main steps involved in obtaining a PAC-Bayesian inequality: (i) choose F to use in Theorem 2, and (ii) upper-bound the exponential moment . We emphasize that the usual assumptions on which PAC-Bayes bounds are based, namely, (a) data-free prior, (b) bounded loss, and (c) i.i.d. data, played a role only in the technique used for controlling . This is because with a data-free Q0 we may swap the order of integration: ef(s,h)Q0(dh)P(ds) = ef(s,h)P(ds)Q0(dh) =: swap . Then bounding proceeds by calculating or bounding swap for which there are readily available techniques for bounded loss functions and i.i.d. data (see e.g. Maurer [2004], Germain et al. [2009], van Erven [2014]). The bounds with data-dependent priors that we presented in Section 2 required different kinds of techniques to control the exponential moment, the details are in the appendices. To the best of our knowledge, ours is the first work to extend the PAC-Bayes analysis to stochastic kernels. This framework appears to be a promising theoretical tool to obtain new results. The three types of data-dependant priors discussed in Section 2 show the versatility of the approach. Deriving more cases of PAC-Bayes inequalities without the usual assumptions is left for future research. 4 Additional discussion and related literature The literature on the PAC-Bayes learning approach is vast. We briefly mention the usual references Mc Allester [1999], Langford and Seeger [2001], Seeger [2002], and Catoni [2007]; but see also Maurer [2004], and Keshet et al. [2011]. Note that Mc Allester [1999] continued Mc Allester [1998] whose work was inspired by Shawe-Taylor and Williamson [1997] s work on a PAC analysis of a Bayesian-style estimator. We acknowledge the tutorials of Langford [2005] and Mc Allester [2013], the mini-tutorial of van Erven [2014], and the primer of Guedj [2019]. Our Theorem 2 is akin to general forms of the PAC-Bayes theorem given before by Audibert [2004], Germain et al. [2009], and B egin et al. [2014, 2016]. Our Theorem 1(i) is akin to the pointwise bound of Blanchard and Fleuret [2007], in that the bound holds over the random draw of data and hypothesis pairs. There are many application areas that have used the PAC-Bayes approach, but there are essentially two ways that a PAC-Bayes bound is typically applied: either use the bound to give a risk certificate for a randomized predictor learned by some method, or turn the bound itself into a learning method by searching a randomized predictor that minimizes the bound. The latter is mentioned already by Mc Allester [1999], credit for this approach in various contexts is due also to Germain et al. [2009], Seldin and Tishby [2010], Keshet et al. [2011], Noy and Crammer [2014], Keshet et al. [2017], possibility among others. Recently, the use of the latter approach has also found success in training neural networks, see Dziugaite and Roy [2017, 2018b]. In fact, the recent resurgence of interest in the PAC-Bayes approach has been to a large extent motivated by the interest in generalization guarantees for neural networks. Langford and Caruana [2001] used Mc Allester [1999] s classical PAC-Bayesian bound to evaluate the error of a (stochastic) neural network classifier. Dziugaite and Roy [2017] obtained numerically non-vacuous generalization bounds by optimizing the same bound. Subsequent studies (e.g. Rivasplata et al. [2019], P erez-Ortiz et al. [2020]) continued this approach, sometimes with links to the generalization of stochastic optimization methods (e.g. London [2017], Neyshabur et al. [2018], Dziugaite and Roy [2018a]) or algorithmic stability. A line of work related to connecting PAC-Bayes priors to data was explored by Lever et al. [2013], Pentina and Lampert [2014] and more recently by Rivasplata et al. [2018], who assumed that priors are distribution-dependent. In that setting the priors are still data-free but in a less agnostic fashion (compared to an arbitrary fixed prior), which allows to demonstrate improvements for nice datagenerating distributions. Data-dependent priors were investigated recently by Awasthi et al. [2020], who relied on tools from the empirical process theory and controlled the capacity of a data-dependent hypothesis class (see also Foster et al. [2019]). The PAC-Bayes literature does contain a line of work that investigates relaxing the restriction of bounded loss functions. A straightforward way to extend PAC-Bayes inequalities to unbounded loss functions is to make assumptions on the tail behaviour of the loss [Alquier et al., 2016, Germain et al., 2016] or its moments [Alquier and Guedj, 2018, Holland, 2019], leading to interesting bounds in special cases. Recent work has also looked into the analysis for heavy-tailed losses. For example, Alquier and Guedj [2018] proposed a polynomial moment-dependent bound with f-divergence replacing the KL divergence, while Holland [2019] devised an exponential bound assuming that the second moment of the loss is bounded uniformly across hypotheses. An alternative approach was explored by Kuzborskij and Szepesv ari [2019], who proposed a stability-based approach by controlling the Efron-Stein variance proxy of the loss. Squared loss regression was studied by Shalaeva et al. [2020] who improved results of Germain et al. [2016] and also relaxed the data-generation assumption to non-iid data. It is worth mentioning the important work related to extending the PAC-Bayes framework to statistically dependent data, see e.g. Alquier and Wintenberger [2012] who applied Rio [2000] s version of Hoeffding s inequality, derived PAC-Bayes bounds for non-i.i.d. data, and used them in model selection for time series. As we mentioned in the introduction, besides randomized predictions, other prediction schemes may be derived from a learned distribution over hypotheses. Aggregation by exponential weighting was considered by Dalalyan and Tsybakov [2007, 2008], ensembles of decision trees were considered by Lorenzen et al. [2019], weighted majority vote by Masegosa et al. [2020], Germain et al. [2015]. This list is far from being complete. Finally, it is worth mentioning that the PAC-Bayesian analysis extends beyond bounds on the gap between population and empirical losses: A large body of literature has also looked into upper and lower bounds on the excess risk, namely, QS[L] infh2H L(h), we refer e.g. to Catoni [2007], Alquier et al. [2016], Gr unwald and Mehta [2019], Kuzborskij et al. [2019], Mhammedi et al. [2019]. The approach of analyzing the gap (for randomized predictors), which we follow in this paper, is generally complementary to such excess risk analyses. Broader Impact We think this work will have a positive impact on the theoretical machine learning community. However, since this work presents a high-level theoretical framework, its direct impact on society will be linked to the particular user-specific applications where this framework may be instantiated. Acknowledgments and Disclosure of Funding We warmly thank the anonymous reviewers for their valuable feedback, which helped us to improve the paper greatly. For comments on various early parts of this work we warmly thank Tor Lattimore, Yevgeny Seldin, Tim van Erven, Benjamin Guedj, and Pascal Germain. We warmly acknowledge the Foundations team at Deepmind, and the AI Centre at University College London, for providing friendly and stimulating work environments. Omar Rivasplata and Ilja Kuzborskij warmly thank Vitaly Feldman for interesting discussions and a fun table tennis game while visiting Deep Mind. Omar Rivasplata gratefully acknowledges Deep Mind sponsorship for carrying out research studies on the theoretical foundations of machine learning and AI at University College London. This work was done while Omar was a research scientist intern at Deep Mind. Csaba Szepesv ari gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, the Alberta Machine Intelligence Institute (Amii), and the Natural Sciences and Engineering Research Council (NSERC) of Canada. John Shawe-Taylor gratefully acknowledges support and funding from the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. P Alquier. PAC-Bayesian bounds for randomized empirical risk minimizers. Mathematical Methods of Statistics, 17(4):279 304, 2008. P. Alquier and B. Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107 (5):887 902, May 2018. P. Alquier and O. Wintenberger. Model selection for weakly dependent time series forecasting. Bernoulli, 18(3):883 913, 2012. P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(1):8374 8414, 2016. A. Ambroladze, E. Parrado-Hern andez, and J. Shawe-taylor. Tighter PAC-Bayes bounds. In Ad- vances in Neural Information Processing Systems (NIPS), pages 9 16, 2007. J.-Y. Audibert. A Better Variance Control For PAC-Bayesian Classification. Preprint, 2004. J.-Y. Audibert and O. Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863 889, 2007. P. Awasthi, S. Kale, S. Karp, and M. Mohri. PAC-Bayes Learning Bounds for Sample-Dependent Priors. In Advances in Neural Information Processing Systems (Neur IPS), 2020. L. B egin, P. Germain, F. Laviolette, and J.-F. Roy. PAC-Bayesian theory for transductive learning. In Artificial Intelligence and Statistics (AISTATS), pages 105 113, 2014. L. B egin, P. Germain, F. Laviolette, and J.-F. Roy. PAC-Bayesian bounds based on the R enyi diver- gence. In Artificial Intelligence and Statistics (AISTATS), pages 435 444, 2016. P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103 1130, 2016. G. Blanchard and F. Fleuret. Occam s hammer. In Conference on Learning Theory (COLT), pages 112 126. Springer, 2007. S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013. O. Catoni. Statistical learning theory and stochastic optimization: Ecole d Et e de Probabilit es de Saint-Flour XXXI-2001. Springer, 2004. O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lecture Notes-Monograph Series, 56, 2007. URL www.jstor.org/stable/20461499. T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, 2nd. edition, 2006. I. Csisz ar. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, pages 146 158, 1975. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting and sharp oracle inequali- ties. In Conference on Learning Theory (COLT), pages 97 111. Springer, 2007. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1-2):39 61, 2008. V. de la Pe na and E. Gin e. Decoupling: from dependence to independence. Springer, 2012. M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time. Communications on Pure and Applied Mathematics, 28, 1975. C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems (NIPS), pages 2350 2358, 2015a. Our citations refer to the full version ar Xiv:1506.02629. C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 117 126. ACM, 2015b. G. K. Dziugaite and D. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: General- ization properties of Entropy-SGD and data-dependent priors. In International Conference on Machine Learning (ICML), pages 1376 1385, 2018a. G. K. Dziugaite and D. M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochas- tic) Neural Networks with Many More Parameters than Training Data. In Uncertainty in Artificial Intelligence (UAI), 2017. G. K. Dziugaite and D. M. Roy. Data-dependent PAC-Bayes priors via differential privacy. In Advances in Neural Information Processing Systems (Neur IPS), pages 8430 8441, 2018b. S. N. Ethier and T. G. Kurtz. Markov processes: characterization and convergence. Wiley, 1986. D. J. Foster, S. Greenberg, S. Kale, H. Luo, M. Mohri, and K. Sridharan. Hypothesis Set Stability and Generalization. In Advances in Neural Information Processing Systems (Neur IPS), pages 6729 6739, 2019. Y. Freund. Self bounding learning algorithms. In Conference on Learning Theory (COLT), pages 247 258. ACM, 1998. P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learning (ICML), pages 353 360. ACM, 2009. P. Germain, A. Lacasse, F. Laviolette, M. Marchand, and J.-F. Roy. Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm. Journal of Machine Learning Research, 16:787 860, 2015. P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien. PAC-Bayesian theory meets Bayesian inference. In Advances in Neural Information Processing Systems (NIPS), pages 1884 1892, 2016. P. D. Gr unwald and N. A. Mehta. A tight excess risk bound via a unified PAC-Bayesian Rademacher-Shtarkov-MDL complexity. In Algorithmic Learning Theory (ALT), volume 98, pages 433 465. PMLR, 2019. B. Guedj. A Primer on PAC-Bayesian Learning. ar Xiv:1901.05353, 2019. M. Holland. PAC-Bayes under potentially heavy tails. In Advances in Neural Information Process- ing Systems (Neur IPS), pages 2715 2724, 2019. O. Kallenberg. Random Measures, Theory and Applications. Springer, 2017. J. Keshet, D. Mc Allester, and T. Hazan. PAC-Bayesian approach for minimization of phoneme error rate. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2224 2227. IEEE, 2011. J. Keshet, S. Maji, T. Hazan, and T. Jaakkola. Perturbation Models and PAC-Bayesian Generaliza- tion Bounds. In Perturbations, Optimization, and Statistics, pages 289 309. MIT Press, 2017. I. Kuzborskij and C. Szepesv ari. Efron-Stein PAC-Bayesian Inequalities. ar Xiv:1909.01931, 2019. URL https://arxiv.org/abs/1909.01931. I. Kuzborskij, N. Cesa-Bianchi, and C. Szepesv ari. Distribution-Dependent Analysis of Gibbs-ERM Principle. In Conference on Learning Theory (COLT), volume 99, pages 2028 2054. PMLR, 2019. J. Langford. Tutorial on Practical Prediction Theory for Classification. Journal of Machine Learning Research, 6(Mar):273 306, 2005. J. Langford and R. Caruana. (Not) bounding the true error. In Advances in Neural Information Processing Systems (NIPS), pages 809 816, 2001. J. Langford and M. Seeger. Bounds for averaging classifiers. Technical Report CMU-CS-01-102, Carnegie Mellon University, 2001. G. Lever, F. Laviolette, and J. Shawe-Taylor. Tighter PAC-Bayes bounds through distributiondependent priors. Theoretical Computer Science, 473:4 28, 2013. B. London. A PAC-Bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems (NIPS), pages 2931 2940, 2017. S. S. Lorenzen, C. Igel, and Y. Seldin. On PAC-Bayesian bounds for random forests. Machine Learning, 108(8-9):1503 1522, 2019. A. R. Masegosa, S. S. Lorenzen, C. Igel, and Y. Seldin. Second Order PAC-Bayesian Bounds for the Weighted Majority Vote. In Advances in Neural Information Processing Systems (Neur IPS), 2020. ar Xiv:2007.13532. A. Maurer. A note on the PAC Bayesian theorem. ar Xiv:cs/0411099, 2004. D. A. Mc Allester. Some PAC-Bayesian theorems. In Conference on Learning Theory (COLT), pages 230 234. ACM, 1998. Also one year later in Machine Learning 37(3), pages 355 363, 1999. D. A. Mc Allester. PAC-Bayesian model averaging. In Conference on Learning Theory (COLT), pages 164 170. ACM, 1999. D. A. Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003. D. A. Mc Allester. A PAC-Bayesian tutorial with a dropout bound. ar Xiv:1307.2118, 2013. F. Mc Sherry and K. Talwar. Mechanism Design via Differential Privacy. In IEEE Symposium on Foundations of Computer Science (FOCS), volume 7, pages 94 103. IEEE, 2007. S. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, 2nd. edition, 2009. Z. Mhammedi, P. Gr unwald, and B. Guedj. PAC-Bayes Un-Expected Bernstein Inequality. In Advances in Neural Information Processing Systems (Neur IPS), pages 12202 12213, 2019. B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR), 2018. A. Noy and K. Crammer. Robust forward algorithms via PAC-Bayes and Laplace distributions. In Artificial Intelligence and Statistics (AISTATS), pages 678 686, 2014. E. Parrado-Hern andez, A. Ambroladze, J. Shawe-Taylor, and S. Sun. PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research, 13(Dec):3507 3531, 2012. A. Pentina and C. H. Lampert. A PAC-Bayesian Bound for Lifelong Learning. In International Conference on Machine Learning (ICML), pages 991 999, 2014. M. P erez-Ortiz, O. Rivasplata, J. Shawe-Taylor, and C. Szepesv ari. Tighter risk certificates for neural networks. ar Xiv:2007.12911, 2020. M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Conference on Learning Theory (COLT), 2017. E. Rio. In egalit es de Hoeffding pour les fonctions lipschitziennes de suites d ependantes. Comptes Rendus de l Acad emie des Sciences-Series I-Mathematics, 330(10):905 908, 2000. O. Rivasplata, E. Parrado-Hern andez, J. Shawe-Taylor, S. Sun, and C. Szepesv ari. PAC-Bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems (Neur IPS), pages 9214 9224, 2018. O. Rivasplata, V. M. Tankasali, and C. Szepesv ari. PAC-Bayes with Backprop. ar Xiv:1908.07380, M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3(Oct):233 269, 2002. Y. Seldin and N. Tishby. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11(Dec):3595 3646, 2010. Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086 7093, 2012. V. Shalaeva, A. F. Esfahani, P. Germain, and M. Petreczky. Improved PAC-Bayesian Bounds for Linear Regression. In Conference on Artificial Intelligence (AAAI), 2020. J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayesian estimator. In Conference on Learning Theory (COLT), pages 2 9. ACM, 1997. N. Thiemann. PAC-Bayesian ensemble learning. Master s thesis, University of Copenhagen, 2016. N. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin. A strongly quasiconvex PAC-Bayesian bound. In Algorithmic Learning Theory (ALT), pages 466 492, 2017. I. O. Tolstikhin and Y. Seldin. PAC-Bayes-empirical-Bernstein inequality. In Advances in Neural Information Processing Systems (NIPS), pages 109 117, 2013. T. van Erven. PAC-Bayes Mini-tutorial: A Continuous Union Bound. ar Xiv:1405.1580, 2014. R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027, 2011. Chapter 5 of: Compressed Sensing, Theory and Applications. Edited by Y. Eldar and G. Kutyniok. Cambridge University Press, 2012. pp. 210 268. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019. A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524 2533, 2017.