# pacbayes_under_potentially_heavy_tails__b7f3e984.pdf PAC-Bayes under potentially heavy tails Matthew J. Holland Institute of Scientific and Industrial Research Osaka University matthew-h@ar.sanken.osaka-u.ac.jp We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss distribution are bounded, the learning algorithm derived from this estimator achieves nearly sub-Gaussian statistical error, up to the quality of the prior. 1 Introduction More than two decades ago, the origins of PAC-Bayesian learning theory were developed with the goal of strengthening traditional PAC learning guarantees1 by explicitly accounting for prior knowledge [17, 12, 5]. Subsequent work developed finite-sample risk bounds for Bayesian learning algorithms which specify a distribution over the model [13]. These bounds are controlled using the empirical risk and the relative entropy between prior and posterior distributions, and hold uniformly over the choice of the latter, meaning that the guarantees hold for data-dependent posteriors, hence the naming. Furthermore, choosing the posterior to minimize PAC-Bayesian risk bounds leads to practical learning algorithms which have seen numerous successful applications [2]. Following this framework, a tremendous amount of work has been done to refine, extend, and apply the PAC-Bayesian framework to new learning problems. Tight risk bounds for bounded losses are due to Seeger [15] and Maurer [11], with the former work applying them to Gaussian processes. Bounds constructed using the loss variance in a Bernstein-type inequality were given by Seldin et al. [16], with a data-dependent extension derived by Tolstikhin and Seldin [18]. As stated by Mc Allester [14], virtually all the bounds derived in the original PAC-Bayesian theory only apply to bounded loss functions. This technical barrier was solved by Alquier et al. [2], who introduce an additional error term depending on the concentration of the empirical risk about the true risk. This technique was subsequently applied to the log-likelihood loss in the context of Bayesian linear regression by Germain et al. [9], and further systematized by Bégin et al. [3]. While this approach lets us deal with unbounded losses, naturally the statistical error guarantees are only as good as the confidence intervals available for the empirical mean deviations. In particular, strong assumptions on all of the moments of the loss are essentially unavoidable using the traditional tools espoused by Bégin et al. [3], which means the heavy-tailed regime cannot be handled, where all we assume is that a few higher-order moments are finite (say finite variance and/or finite kurtosis). A new technique for deriving PAC-Bayesian bounds even under heavy-tailed losses is introduced by Alquier and Guedj [1]; their lucid procedure provides error rates even under heavy tails, but as the authors recognize, the guarantees are sub-optimal at high confidence levels due to direct dependence on the empirical risk, leading in turn to sub-optimal algorithms derived from these bounds.2 1PAC: Probably approximately correct [19]. 2See work by Catoni [6], Devroye et al. [8] and the references within for background on the fundamental limitations of the empirical mean for real-valued random variables. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. In this work, while keeping many core ideas of Bégin et al. [3] intact, using a novel approach we obtain exponential tail bounds on the excess risk using PAC-Bayesian bounds that hold even under heavy-tailed losses. Our key technique is to replace the empirical risk with a new mean estimator inspired by the dimension-free estimators of Catoni and Giulini [7], designed to be computationally convenient. We review some key theory in section 2 before introducing the new estimator in section 3. In section 4 we apply this estimator to the PAC-Bayes setting, deriving a new robust optimal Gibbs posterior. Empirical inquiries into the properties of the new mean estimator are given in section 5. All proofs are relegated to supplementary materials. 2 PAC-Bayesian theory based on the empirical mean Let us begin by briefly reviewing the best available PAC-Bayesian learning guarantees under general losses. Denote by z1, . . . , zn Z a sequence of independent observations distributed according to common distribution µ. Denote by H a model/hypothesis class, from which the learner selects a candidate based on the n-sized sample. The quality of this choice can be measured in a pointwise fashion using a loss function l : H Z R, assumed to be l 0. The learning task is to achieve a small risk, defined by R(h) ..= Eµ l(h; z). Since the underlying distribution is inherently unknown, the canonical proxy is b R(h) ..= 1 i=1 l(h; zi), h H. Let ν and ρ respectively denote prior and posterior distributions on the model H. The so-called Gibbs risk induced by ρ, as well as its empirical counterpart are given by Gρ ..= Eρ R = Z H R(h) dρ(h), b Gρ ..= Eρ b R = 1 H l(h; zi) dρ(h). When our losses are almost surely bounded, lucid guarantees are available. Theorem 1 (PAC-Bayes under bounded losses [13, 3]). Assume 0 l 1, and fix any arbitrary prior ν on H. For any confidence level δ (0, 1), we have with probability no less than 1 δ over the draw of the sample that K(ρ; ν) + log(2 nδ 1) 2n uniformly in the choice of ρ. Since the good event where the inequality in Theorem 1 holds is valid for any choice of ρ, the result holds even when ρ depends on the sample, which justifies calling it a posterior distribution. Optimizing this upper bound with respect to ρ leads to the so-called optimal Gibbs posterior, which takes a form which is readily characterized (cf. Remark 13). The above results fall apart when the loss is unbounded, and meaningful extensions become challenging when exponential moment bounds are not available. As highlighted in section 1 above, over the years, the analytical machinery has evolved to provide general-purpose PAC-Bayesian bounds even under heavy-tailed data. The following theorem of Alquier and Guedj [1] extends the strategy of Bégin et al. [3] to obtain bounds under the weakest conditions we know of. Theorem 2 (PAC-Bayes under heavy-tailed losses [1]). Take any p > 1 and set q = p/(p 1). For any confidence level δ (0, 1), we have with probability no less than 1 δ over the draw of the sample that Eν | b R R|q uniformly in the choice of ρ. For concreteness, consider the case of p = 2, where q = 2/(2 1) = 2, and assume that the variance of the loss is varµ l(h; z) is ν-finite, namely that H varµ l(h; z) dν(h) < . From Proposition 4 of Alquier and Guedj [1], we have Eν | b R R|2 Vν/n. It follows that on the high-probability event, we have While the n rate and dependence on a divergence between ν and ρ are similar, note that the dependence on the confidence level δ (0, 1) is polynomial; compare this with the logarithmic dependence available in Theorem 1 above when the losses were bounded. For comparison, our main result of section 4 is a uniform bound on the Gibbs risk: with probability no less than 1 δ, we have Gρ b Gρ,ψ + 1 n K(ρ; ν) + log(8πM2δ 2) 2 + M2 + ν n(H) 1 + O 1 where b Gρ,ψ is an estimator of Gρ defined in section 3, ν n(H) is a term depending on the quality of prior ν, and the key constants are bounds such that for all h H we have M2 Eµ l(h; z)2. As long as the first three moments are finite, this guarantee holds, and thus both sub-Gaussian and heavy-tailed losses (e.g., with infinite higher-order moments) are permitted. Given any valid M2, the PAC-Bayesian upper bound above can be minimized in ρ based on the data, and thus an optimal Gibbs posterior can also be computed in practice. In section 4, we characterize this robust posterior. 3 A new estimator using smoothed Bernoulli noise Notation In this section, we are dealing with the specific problem of robust mean estimation, thus we specialize our notation here slightly. Data observations will be x1, . . . , xn R, assumed to be independent copies of x µ. Denote the index set [k] ..= {1, 2, . . . , k}. Write M1 +(Ω, A) for the set of all probability measures defined on the measurable space (Ω, A). Write K(P, Q) for the relative entropy between measures P and Q (also known as the KL divergence; definition in appendix). We shall typically suppress A and even Ωin the notation when it is clear from the context. Let ψ be a bounded, non-decreasing function such that for some b > 0 and all u R, log 1 u + u2/b ψ(u) log 1 + u + u2/b . (1) As a concrete and analytically useful example, we shall use the piecewise polynomial function of Catoni and Giulini [7], defined by which for b = 2 satisfies (1). Slightly looser bounds hold with b = 1 for an analogous procedure using a Huber-type influence function. Estimator definition We consider a straightforward procedure, in which the data are subject to a soft truncation after re-scaling, defined by where s > 0 is a re-scaling parameter. Depending on the setting of s, this function can very closely approximate the sample mean, and indeed modifying this scaling parameter controls the bias of this estimator in a direct way, which can be quantified as follows. As the scale grows, note that 6s2 x, as s which implies that taking expectation with respect to the sample and s , in the limit this estimator is unbiased, with = Eµ x Eµ x3 4 u psi(u) upper bound lower bound Figure 1: Graph of the Catoni function ψ(u) over On the other hand, taking s closer to zero implies that more observations will be truncated. Taking s small enough,3 we have 2s 3n (|I+| |I |) , which converges to zero as s 0. Here the positive/negative indices are I+ ..= {i [n] : xi > 0} and I ..= {i [n] : xi < 0}. Thus taking s too small means that only the signs of the observations matter, and the absolute value of the estimator tends to become too small. High-probability deviation bounds for bx We are interested in high-probability bounds on the deviations |bx Eµ x| under the weakest possible assumptions on the underlying data distribution. To obtain such guarantees in a straightforward manner, we make the simple observation that the estimator bx defined in (3) can be related to an estimator with smoothed noise as follows. Let ϵ1, . . . , ϵn be an iid sample of noise ϵ {0, 1} with distribution Bernoulli(θ) for some 0 < θ < 1. Then, taking expectation with respect to the noise sample, one has that i=1 ψ xi ϵi This simple observation becomes useful to us in the context of the following technical fact. Lemma 3. Assume we are given some independent data x1, . . . , xn, assumed to be copies of the random variable x µ. In addition, let ϵ1, . . . , ϵn similarly be independent observations of strategic noise, with distribution ϵ ρ that we can design. Fix an arbitrary prior distribution ν, and consider f : R2 R, assumed to be bounded and measurable. Write K(ρ; ν) for the Kullback-Leibler divergence between distributions ρ and ν. It follows that with probability no less than 1 δ over the random draw of the sample, we have i=1 f(xi, ϵi) Z log Eµ exp(f(x, ϵ)) dρ(ϵ) + K(ρ; ν) + log(δ 1) uniform in the choice of ρ, where expectation on the left-hand side is over the noise sample. The special case of interest here is f(x, ϵ) = ψ(xϵ/s). Using (1) and Lemma 3, with prior ν = Bernoulli(1/2) and posterior ρ = Bernoulli(θ), it follows that on the 1 δ high-probability event, 3More precisely, taking s min{|xi| : i [n]}/ uniform in the choice of 0 < θ < 1, we have θ bx Z ϵ Eµ x s + ϵ2 Eµ x2 dρ(ϵ) + K(ρ; ν) + log(δ 1) s + θ Eµ x2 n θ log(2θ) + (1 θ) log(2(1 θ)) + log(δ 1) where we have used the fact that E ϵ2 = E ϵ = θ in the Bernoulli case. Dividing both sides by (θ/s) and optimizing this as a function of s > 0 yields a closed-form expression for s depending on the second moment, the confidence δ, and θ. Analogous arguments yield lower bounds on the same quantity. Taking these facts together, we have the following proposition, which says that assuming only finite second moments Eµ x2 < , the proposed estimator achieves exponential tail bounds scaling with the second non-central moment. Proposition 4 (Concentration of deviations). Scaling with s2 = n Eµ x2/2 log(δ 1), the estimator defined in (3) satisfies 2 Eµ x2 log(δ 1) with probability at least 1 2δ. Remark 5. While the above bound (6) depends on the true second moment, the result is easily extended to hold for any valid upper bound on the moment, which is what will inevitably have to be used in practice. Centered estimates Note that the bound (6) depends on the second moment of the underlying data; this is in contrast to M-estimators which due to a natural centering of the data typically have tail bounds depending on the variance [6]. This results in a sensitivity to the absolute value of the location of the distribution, e.g., on a distribution with unit variance and Eµ x = 0 will tend to be much better than a distribution with Eµ x = 104. Fortunately, a simple centering strategy works well to alleviate this sensitivity, as follows. Without loss of generality, assume that the first 0 < k < n estimates are used for constructing a shifting device, with the remaining n k > 0 points left for running the usual routine on shifted data. More concretely, define , where s2 = k Eµ x2 2 log(δ 1). (7) From (6) in Proposition 4, we have |xψ Eµ x| εk ..= 2 Eµ x2 log(δ 1) k on an event with probability no less than 1 2δ, over the draw of the k-sized sub-sample. Using this, we shift the remaining data points as x i ..= xi xψ. Note that the second moment of this data is bounded as Eµ(x )2 varµ x + ε2 k. Passing these shifted points through (3) with analogous second moment bounds used for scaling, we have bx = s (n k) i=k+1 ψ x i s , where s2 = (n k)(varµ x + ε2 k) 2 log(δ 1) . (8) Shifting the resulting output back to the original location by adding and shifting bx back to the original location by adding xψ, conditioned on xψ, we have by (6) again that |(bx + xψ) Eµ x| = |bx Eµ(x xψ)| 2(varµ x + ε2 k) log(δ 1) n k with probability no less than 1 2δ over the draw of the remaining n k points. Defining the centered estimator as bx = bx + xψ, and taking a union bound over the two good events on the independent sample subsets, we may thus conclude that P {|bx Eµ x| > ε} 4 exp (n k)ε2 2(varµ x + ε2 k) where probability is over the draw of the full n-sized sample. While one takes a hit in terms of the sample size, the variance works to combat sensitivity to the distribution location (see section 5 for empirical tests). 4 PAC-Bayesian bounds for heavy-tailed data An import and influential paper due to D. Mc Allester gave the following theorem as a motivating result. To get started, we give a slightly modified version of his result. Theorem 6 (Mc Allester [12], Preliminary Theorem 2). Let ν be a prior probability distribution over H, assumed countable, and to be such that ν(h) > 0 for all h H. Consider the pattern recognition task with z = (x, y) X { 1, 1}, and the classification error l(h; z) = I{h(x) = y}. Then with probability no less than 1 δ, for any choice of h H, we have i=1 l(h; zi) + log (1/ν(h)) + log (1/δ) One quick glance at the proof of this theorem shows that the bounded nature of the observations plays a crucial role in deriving excess risk bounds of the above form, as it is used to obtain concentration inequalities for the empirical risk about the true risk. While analogous concentration inequalities hold under slightly weaker assumptions, when considering the potentially heavy-tailed setting, one simply cannot guarantee that empirical risk is tightly concentrated about the true risk, which prevents direct extensions of such theorems. With this in mind, we take a different approach, that does not require the empirical mean to be well-concentrated. Our motivating pre-theorem The basic idea of our approach is very simple: instead of using the sample mean, bound the off-sample risk using a more robust estimator which is easy to compute directly, and which allows risk bounds even under unbounded, potentially heavy-tailed losses. Define a new approximation of the risk by b Rψ(h) ..= s i=1 ψ l(h; zi) for s > 0. Note that this is just a direct application of the robust estimator defined in (3) to the case of a loss which depends on the choice of candidate h H. As a motivating result, we basically re-prove Mc Allester s result (Theorem 6) under much weaker assumptions on the loss, using the statistical properties of the new risk estimator (10), rather than relying on classical Chernoff inequalities. Theorem 7 (Pre-theorem). Let ν be a prior probability distribution over H, assumed countable. Assume that ν(h) > 0 for all h H, and that m2(h) ..= E l(h; z)2 < for all h H. Setting the scale in (10) to s2 h = n m2(h)/2 log(δ 1), then with probability no less than 1 2δ, for any choice of h H, we have R(h) b Rψ(h) + 2m2(h) (log(1/ν(h)) + log(1/δ)) Remark 8. We note that all quantities on the right-hand side of Theorem 7 are easily computed based on the sample, except for the second moment m2, which in practice must be replaced with an empirical estimate. With an empirical estimate of m2 in place, the upper bound can easily be used to derive a learning algorithm. Uncountable model case Next we extend the previous motivating theorem to a more general result on a potentially uncountable H, using stochastic learning algorithms, as has become standard in the PAC-Bayes literature. We need a few technical conditions, listed below: 1. Bounds on lower-order moments. For all h H, we require Eµ l(h; z)2 M2 < , Eµ l(h; z)3 M3 < . 2. Bounds on the risk. For all h H, we require R(h) p n M2/(4 log(δ 1)). 3. Large enough confidence. We require δ exp( 1/9) 0.89. These conditions are quite reasonable, and easily realized under heavy-tailed data, with just lowerorder moment assumptions on µ and say a compact class H. The new terms that appear in our bounds that do no appear in previous works are b Gρ,ψ ..= Eρ b Rψ and ν n(H) = Eν exp( n(R b Rψ))/ Eν exp(R b Rψ). The former is the expectation of the proposed robust estimator with respect to posterior ρ, and the latter is a term that depends directly on the quality of the prior ν. Theorem 9. Let ν be a prior distribution on model H. Let the three assumptions listed above hold. Setting the scale in (10) to s2 = n M2/2 log(δ 1), then with probability no greater than 1 δ over the random draw of the sample, it holds that Gρ b Gρ,ψ + 1 n K(ρ; ν) + log(8πM2δ 2) 2 + M2 + ν n(H) 1 + O 1 for any choice of probability distribution ρ on H, since Gρ < by assumption. Remark 10. As is evident from the statement of Theorem 9, the convergence rate is clear for all terms but ν n(H)/ n. In our proof, we use a modified version of the elegant and now-standard strategy formulated by Bégin et al. [3]. A glance at the proof shows that under this strategy, there is essentially no way to avoid dependence on ν n(H). Since the random variable R b Rψ is bounded over the random draw of the sample and h ν, the bounds still hold and are non-trivial. That said, ν n(H) may indeed increase as n , potentially spoiling the n rate, and even consistency in the worst case. Clearly ν n(H) presents no troubles if R b Rψ 0 on a high-probability event, but note that this essentially amounts to asking for a prior that on average realizes bounds that are better than we can guarantee for any posterior though the above analysis. Such a prior may indeed exist, but if it were known, then that would eliminate the need for doing any learning at all. If the deviations R b Rψ are truly sub-Gaussian [4], then the n rate can be easily obtained. However, impossibility results from Devroye et al. [8] suggest that under just a few finite moment assumptions, such an estimator cannot be constructed. As such, here we see a clear limitation of the established PAC-Bayes analytical framework under potentially heavy-tailed data. Since the change of measures step in the proof is fundamental to the basic argument, it appears that concessions will have to be made, either in the form of slower rates, deviations larger than the relative entropy, or weaker dependence on 1/δ. Remark 11. Note that while in its tightest form, the above bound requires knowledge of Eµ l(h; z)2, we may set s > 0 used to define b Rψ using any valid upper bound M2, under which the above bound still holds as-is, using known quantities. Furthermore, for reference the content of the O(1/n) term in the above bound takes the form 1 n V log(δ 1) + M3 log(δ 1) where V is an upper bound on the variance varµ l(h; z) V < over h H. As a principled approach to deriving stochastic learning algorithms, one naturally considers the choice of posterior ρ in Theorem 9 that minimizes the upper bound. This is typically referred to as the optimal Gibbs posterior [9], and takes a form which is easily characterized, as we prove in the following proposition. Proposition 12 (Robust optimal Gibbs posterior). The upper bound of Theorem 9 is optimized by a data-dependent posterior distribution bρ, defined in terms of its density function with respect to the prior ν as (h) = exp n b Rψ(h) Eν exp n b Rψ . Furthermore, the risk bound under the optimal Gibbs posterior takes the form log Eν exp n b Rψ + log(8πM2δ 1) 2 + M2 + ν n(H) 1 + O 1 with probability no less than 1 δ over the draw of the sample. Remark 13 (Comparison with traditional Gibbs posterior). In traditional PAC-Bayes analysis [9, Equation 8], the optimal Gibbs posterior, let us write bρemp, is defined by (h) = exp n b R(h) Eν exp n b R where b R(h) = n 1 Pn i=1 l(h; zi) is the empirical risk. We have n b R and n b Rψ, but since scaling in the latter case should be done with s n, so in both cases the 1/n factor cancels out. In the special case of the negative log-likelihood loss, Germain et al. [9] demonstrate that the optimal Gibbs posterior coincides with the classical Bayesian posterior. As noted by Alquier et al. [2], the optimal Gibbs posterior has shown strong empirical performance in practice, and variational approaches have been proposed as efficient alternatives to more traditional MCMC-based implementations. Comparison of both the computational and learning efficiency of our proposed robust Gibbs posterior with the traditional Gibbs posterior is a point of significant interest moving forward. 5 Empirical analysis In this section, we use tightly controlled simulations to investigate how the performance of bx (cf. (3) and Proposition 4) compares with the sample mean and other robust estimators. We pay particular attention to how performance depends on the underlying distribution family, the value of second moments, and the sample size. Experimental setup For each experimental setting and each independent trial, we generate a sample x1, . . . , xn of size n, compute some estimator {xi}n i=1 7 bx, and record the deviation |bx Eµ |. The sample sizes range over n {10, 20, 30, . . . , 100}, and the number of trials is 104. We draw data from two distribution families, the Normal family with mean a and variance b2, and the log-Normal family, with log-mean alog and log-variance b2 log, under multiple parameter settings. In particular, we consider the impact of shifting the distribution location over [ 40.0, 40.0], with small and large variance settings. Regarding the variance, we have low, mid, and high settings, which correspond to b = 0.5, 5.0, 50.0 in the Normal case, and blog = 1.1, 1.35, 1.75 in the log-Normal case. Over all settings, the log-location parameter of the log-Normal data is fixed at alog = 0. Shifting the Normal data is trivially accompished by taking the desired a [ 40.0, 40.0]. Shifting the log-Normal data is accomplished by subtracting the true mean (pre-shift) equal to exp(alog + b2 log/2) to center the data, and subsequently adding the desired location. The methods being compared are as follows: mean denotes the empirical mean, med the empirical median,4 mult_g is the estimator of Holland [10] using smoothed Gaussian noise, mult_b the proposed estimator bx defined in (3) using smoothed Bernoulli noise, and finally mult_bc the centered version of bx, see the discussion culminating in (9). The latter methods are given access to the true variance or second moment as needed for scaling purposes, and all algorithms are run with confidence parameter δ = 0.01. Impact of distribution family In Figure 2, we give histograms of the deviations for each method of interest under high variance settings. Colored vertical rules correspond to the error bounds for bx under Gaussian noise and Bernoulli noise (bound via Proposition 4), with probability δ. When the standard deviation is not much larger than the mean, we can see substantial improvement over traditional estimators. The bias introduced by the different bx choices is clearly far smaller on average than the median, with substantially improved sensitivity to outliers when compared with the mean. The centered version of bx has a deviation distribution somewhere between that of the empirical mean and that of the other bx choices. Impact of distribution location In Figure 3 (a), we plot the graph of average/median deviations over trials, taken as a function of the true location Eµ x. From these results, two clear observations can be made. First, note that the performance of the Gaussian-type (mult_g) and Bernoulli-type (mult_b) estimators methods tend to differ greatly as a function of the true mean; in particular, we see that the bias of the Gaussian case is far more sensitive to the true location, providing strong evidence for use of our proposed Bernoulli version, which is less expensive, essentially uniformly better than the Gaussian version (as we would expect from the tighter bounds), with error growing slower as a function of the true mean value. Second, the fact that the centering procedure works very well to mitigate the effect of the second moment value is lucid, also a price is paid in overall accuracy due to the naive sample-splitting technique discussed used. Impact of sample size In Figure 3 (b), we show the graph of average/median deviations taken over all trials, viewed as a function of the sample size n. The most distinct observation that can be made 4After sorting, this is computed as the middle point when n is odd, or the average of the two middle points when n is even. 0 50 100 Deviation 0 50 100 Deviation 0 50 100 Deviation 0 50 100 Deviation 0 50 100 Deviation 0 100 200 Deviation 0 20 40 Deviation 0 20 40 Deviation 0 20 40 Deviation 0 20 40 Deviation Figure 2: Histograms of deviations |bx Eµ x| for different distributions and estimators, with accompanying error bounds. Sample size is n = 10. Distributions centered such that mean is equal to low level standard deviation. Top: Normal data. Bottom: log-Normal data. 25 0 25 True mean Averages (var = high) 25 0 25 True mean Averages (var = high) 25 50 75 100 Sample size Averages (var = high) 25 50 75 100 Sample size Averages (var = high) Figure 3: (a) Deviations |bx Eµ x| as a function of the true mean Eµ x. (b) Deviations |bx Eµ x| as a function of the sample size n. In both sub-figures, left is Normal data, right is log-Normal data. here is that the estimator bx (3) considered here has learning efficiency which is far superior to the empirical mean and median, though as expected the centered version of bx has poorer efficiency, a direct result of the sample-splitting scheme used in its definition. As discussed before, this comes with the caveat that the mean cannot be too much larger than the standard deviation; when the second moment is exceedingly large, this leads to a rather large bias as seen in Figure 3 (a) previously. 6 Conclusions The main contribution of this paper was to develop a novel approach to obtaining PAC-Bayesian learning guarantees, which admits deviations with exponential tails under weak moment assumptions on the underlying loss distribution, while still being computationally amenable. In this work, our chief interest was the fundamental problem of obtaining strong guarantees for stochastic learning algorithms which can reflect prior knowledge about the data-generating process, from which we derived a new robust Gibbs posterior. Moving forward, a deeper study of the statistical nature of this new stochastic learning algorithm, as well as computational considerations to be made in practice are of significant interest. Acknowledgments This work was partially supported by the JSPS KAKENHI Grant Number 18H06477. [1] Alquier, P. and Guedj, B. (2018). Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887 902. [2] Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(1):8374 8414. [3] Bégin, L., Germain, P., Laviolette, F., and Roy, J.-F. (2016). PAC-Bayesian bounds based on the Rényi divergence. In Proceedings of Machine Learning Research, volume 51, pages 435 444. [4] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press. [5] Catoni, O. (2004). Statistical learning theory and stochastic optimization: Ecole d Eté de Probabilités de Saint-Flour XXXI-2001, volume 1851 of Lecture Notes in Mathematics. Springer. [6] Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148 1185. [7] Catoni, O. and Giulini, I. (2017). Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. ar Xiv preprint ar Xiv:1712.02747. [8] Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-gaussian mean estimators. Annals of Statistics, 44(6):2695 2725. [9] Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). PAC-Bayesian theory meets Bayesian Inference. In Advances in Neural Information Processing Systems 29, pages 1884 1892. [10] Holland, M. J. (2019). Robust descent using smoothed multiplicative noise. In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pages 703 711. [11] Maurer, A. (2004). A note on the PAC Bayesian theorem. ar Xiv preprint ar Xiv:cs/0411099. [12] Mc Allester, D. A. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363. [13] Mc Allester, D. A. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21. [14] Mc Allester, D. A. (2013). A PAC-Bayesian tutorial with a dropout bound. ar Xiv preprint ar Xiv:1307.2118. [15] Seeger, M. (2002). PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3(Oct):233 269. [16] Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., and Auer, P. (2012). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086 7093. [17] Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1996). A framework for structural risk minimisation. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages 68 76. ACM. [18] Tolstikhin, I. and Seldin, Y. (2013). PAC-Bayes-Empirical-Bernstein inequality. In Advances in Neural Information Processing Systems 26, pages 109 117. [19] Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134 1142.