# datadependent_pacbayes_priors_via_differential_privacy__50f2afa4.pdf Data-dependent PAC-Bayes priors via differential privacy Gintare Karolina Dziugaite University of Cambridge; Element AI Daniel M. Roy University of Toronto; Vector Institute The Probably Approximately Correct (PAC) Bayes framework (Mc Allester, 1999) can incorporate knowledge about the learning algorithm and (data) distribution through the use of distribution-dependent priors, yielding tighter generalization bounds on data-dependent posteriors. Using this flexibility, however, is difficult, especially when the data distribution is presumed to be unknown. We show how an e-differentially private data-dependent prior yields a valid PAC-Bayes bound, and then show how non-private mechanisms for choosing priors can also yield generalization bounds. As an application of this result, we show that a Gaussian prior mean chosen via stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) leads to a valid PAC-Bayes bound given control of the 2-Wasserstein distance to an e-differentially private stationary distribution. We study our datadependent bounds empirically, and show that they can be nonvacuous even when other distribution-dependent bounds are vacuous. 1 Introduction There has been a resurgence of interest in PAC-Bayes bounds, especially towards explaining generalization in large-scale neural networks trained by stochastic gradient descent (Dziugaite and Roy, 2017; Neyshabur et al., 2017b; Neyshabur et al., 2017a; London, 2017). See also (Bégin et al., 2016; Germain et al., 2016; Thiemann et al., 2017; Bartlett, Foster, and Telgarsky, 2017; Raginsky, Rakhlin, and Telgarsky, 2017; Grünwald and Mehta, 2017; Smith and Le, 2018). PAC-Bayes bounds control the generalization error of Gibbs classifiers (aka PAC-Bayes posteriors ) in terms of the Kullback Leibler (KL) divergence to a fixed probability measure (aka PACBayes prior ) on the space of classifiers. PAC-Bayes bounds depend on a tradeoff between the empirical risk of the posterior Q and a penalty 1 m KL(Q||P), where P is the prior, fixed independently of the sample S 2 Zm from some space Z of labelled examples. The KL penalty is typically the largest contribution to the bound and so finding the tightest possible bound generally depends on minimizing the KL term. The KL penalty vanishes for Q = P, but typically P, viewed as a randomized (Gibbs) classifier, has poor performance since it has been chosen independently of the data. On the other hand, since P is chosen independently of the data, posteriors Q tuned to the data to achieve minimal empirical risk often bear little resemblance to the data-independent prior P, causing KL(Q||P) to be large. As a result, PAC-Bayes bounds can be loose or even vacuous. The problem of excessive KL penalties is not inherent to the PAC-Bayes framework. Indeed, the PAC-Bayes theorem permits one to choose the prior P based on the distribution D of the data. However, since D is considered unknown, and our only insight as to D is through the sample S, this flexibility would seem to be useless, as P must be chosen independently of S in existing bounds. Nevertheless, it is possible to make progress in this direction, and it is likely the best way towards tighter bounds and deeper understanding. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. There is a growing body of work in the PAC-Bayes literature on data-distribution-dependent priors (Catoni, 2007; Parrado-Hernández et al., 2012; Lever, Laviolette, and Shawe-Taylor, 2013). Our focus is on generalization bounds that use all the data, and in this setting, Lever, Laviolette, and Shawe-Taylor (2013) prove a remarkable result for Gibbs posteriors. Writing ˆLS for the empirical risk function with respect to the sample S, Lever, Laviolette, and Shawe-Taylor study the randomized classifier Q with density (relative to some base measure) proportional to exp( t ˆLS). For large values of t, Q concentrates increasingly around the empirical risk minimizers. As a prior, the authors consider the probability distribution with density proportional to exp( t LD), where the empirical risk has been replaced by its expectation, LD, the (true) risk. Remarkably, Lever, Laviolette, and Shawe-Taylor are able to upper bound the KL divergence between the two distributions by a term independent of D, yielding the following result: Here kl(q||p) is the KL divergence between Bernoulli measures with mean q and p. See Section 4 for more details. Theorem 1.1 (Lever, Laviolette, and Shawe-Taylor 2013). Fix t > 0. For S 2 Zm, let Q(S) = Pexp( t ˆLS) be a Gibbs posterior with respect to some base measure P on Rp, where the empirical risk ˆLS is bounded in [0,1]. For every d > 0, m 2 N, distribution D on Z, kl(ˆLS(Q(S))||LD(Q(S))) 1 The dependence on the data distribution is captured through t, which is ideally chosen as small as possible, subject to Q(S) yielding small empirical risk. (One can use a union bound to tune t based on S.) The fact that the KL bound does not depend on D, other than through t, implies that the bound must be loose for all t such that there exists a distribution D that causes Q to overfit with high probability on size m datasets S Dm. In other words, for fixed t, the bound is no longer distribution dependent. This would not be important if not for the following empirical finding: weights sampled according to high values of t do not overfit on real data, but they do on data whose labels have been randomized. Thus these bounds are vacuous in practice when the generalization error is, in fact, small. Evidently, the KL bound gives up too much. Our work launches a different attack on the problem of using distribution-dependent priors. Loosely speaking, if a prior is chosen on the basis of the data, but in a way that is very stable to perturbations of the data set, then the resulting data-dependent prior should reflect the underlying data distribution, rather than the data, resulting in a bound that should still hold, perhaps with smaller probability. We formalize this intuition using differential privacy (Dwork, 2006; Dwork et al., 2015b). We show that an e-differentially private prior mean yields a valid, though necessarily looser, PACBayes generalization bound. (See Theorem 4.2.) The result is a straightforward application of results connecting privacy and adaptive data analysis (Dwork et al., 2015b; Dwork et al., 2015a). The real challenge is using such a result: In practice, e-differentially private mechanisms can be expensive to compute. In the context of generalization bounds for neural networks, we consider the possibility of using stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) to choose a data-dependent prior by way of stochastic optimization/sampling. By various results, SGLD is known to produce an (e,d)-differentially private release (Mir, 2013; Bassily, Smith, and Thakurta, 2014; Dimitrakakis et al., 2014; Wang, Fienberg, and Smola, 2015; Minami et al., 2016) A gap remains between pure and approximate differential privacy. Even if this gap were to be closed, the privacy/accuracy tradeoff of these analyses is too poor because they do not take advantage of the fact that, under some technical conditions, the distribution of SGLD s output converges weakly towards a stationary distribution (Teh, Thiery, and Vollmer, 2016, Thm. 7), which is e-differentially private. One can also bound the KL divergence (and then 2-Wasserstein distance) of SGLD to stationarity within a constant given an appropriate fixed step size (Raginsky, Rakhlin, and Telgarsky, 2017). Neither result implies that SGLD achieves pure e-differential privacy. We show that we can circumvent this barrier in our PAC-Bayes setting. We give a general PACBayes bound for non-private data-dependent priors and then an application to multivariate Gaussian priors with non-private data-dependent means, with explicit bounds for the case of Gibbs posteriors. In particular, conditional on a data set, if a data-dependent mean vector w is close in 2-Wasserstein distance to an e-differentially private mean vector, then the generalization errors is close to that of the e-differentially private mean. The data-dependent mean w is not necessarily differentially private, even approximately. As a consequence, under suitable assumptions, SGLD can be used to optimize a data-dependent mean and still yield a valid PAC-Bayes bound. 2 Other Related Work Our analysis relies on the stability of a data-dependent prior. Stability has long been understood to relate to generalization (Bousquet and Elisseeff, 2002). Our result relies on the connection between generalization and differential privacy (Dwork et al., 2015b; Dwork et al., 2015a; Bassily et al., 2016; Oneto, Ridella, and Anguita, 2017), which can be viewed as a particularly stringent notion of stability. See (Dwork, 2008) for a survey of differential privacy. Kifer, Smith, and Thakurta (2012, Thm. 1) also establish a limit theorem for differential privacy, showing that the almost sure convergence of mechanisms of the same privacy level admits a private mechanism of the same privacy level. Our result can be viewed as a significant weakening of the hypothesis to require only that the weak limit be private: no element on the sequence need be private. The bounds we establish hold for bounded loss functions and i.i.d. data. Under additional assumptions, one can obtain PAC-Bayes generalization and excess risk bounds for unbounded loss with heavy tails (Catoni, 2007; Germain et al., 2016; Grünwald and Mehta, 2016; Alquier and Guedj, 2018). Alquier and Guedj (2018) also consider non-i.i.d. training data. Our approach to differentially private data-dependent priors can be readily extended to these settings. 3 Preliminaries Let Z be a measurable space, let M1(Z) denote the space of probability measures on Z, and let D 2 M1(Z) be unknown. We consider the batch supervised learning setting under a loss function bounded below: having observed S Dm, i.e., m independent and identically distributed samples from D, we aim to choose a predictor, parameterized by weight vector w 2 Rp, with minimal risk z D{ (w,z)}, where : Rp Z ! R is measurable and bounded below. (We ignore the possibility of constraints on the weight vector for simplicity.) We also consider randomized predictors, represented by probability measures Q 2 M1(Rp), whose risks are defined via averaging, w Q{LD(w)} = E E w Q{ (w,z)} where the second equality follows from Tonelli s theorem and the fact that is bounded below. Let S = (z1,...,zm) and let ˆD i=1 dzi be the empirical distribution. Given a randomized predictor Q, such as that chosen by a learning algorithm on the basis of data S, its empirical risk def= L ˆD(Q) = 1 w Q{ (w,zi)}, is studied as a stand-in for its risk, which we cannot compute. While ˆLS(Q) is easily seen to be an unbiased estimate of LD(Q) when Q is independent of S, our goal is to characterize the (one-sided) generalization error LD(Q) ˆLS(Q) when Q is random and dependent on S. Finally, when optimizing the weight vector or defining tractable distributions on Rp, we use a (differentiable) surrogate risk LS, which is the empirical average of a bounded surrogate loss, taking values in an interval of length D. 3.1 Differential privacy For readers not familiar with differential privacy, Appendix A provides the basic definitions and results. We use the notation A : R T to denote a randomized algorithm A that takes an input in a measurable space R and produces a random output in the measurable space T. Definition 3.1. A randomized algorithm A : Zm T is (e,d)-differentially private if, for all pairs S,S0 2 Zm that differ at only one coordinate, and all measurable subsets B T, we have P{A (S) 2 B} ee P{A (S0) 2 B}+d. Further, e-differentially private means (e,0)-differentially private. For our purposes, max-information is the key quantity controlled by differential privacy. Definition 3.2 (Dwork et al. 2015a, 3). Let b 0, let X and Y be random variables in arbitrary measurable spaces, and let X0 be independent of Y and equal in distribution to X. The b-approximate max-information between X and Y, denoted Ib (X;Y), is the least value k such that, for all productmeasurable events E, P{(X,Y) 2 E} ek P{(X0,Y) 2 E}+b. (3.1) The max-information I (X;Y) is defined to be Ib (X;Y) for b = 0. For m 2 N and A : Zm T, the b-approx. max-information of A , denoted Ib (A ,m), is the least value k such that, for all D 2 M1(Z), Ib (S;A (S)) k when S Dm. The max-information of A is defined similarly.1 In Section 4.1, we consider the case where the dataset S and a data-dependent prior P(S) have small approximate max-information. The above definition tells us that we can almost treat the datadependent prior as if it was chosen independently from S. The following is the key result connecting pure differential privacy and max-information: Theorem 3.3 (Dwork et al. 2015a, Thms. 19 20). Fix m 2 N. Let A : Zm T be e-differentially private. Then I (A ,m) em and, for all b > 0, Ib (A ,m) e2m/2+e mln(2/b)/2. 4 PAC-Bayes bounds Let Q,P 2 M1(Rp). When Q is absolutely continuous with respect to P, written Q P, we write d Q d P : Rp ! R+ [ { } for some Radon Nikodym derivative of Q with respect to P. The Kullback Liebler (KL) divergence from Q to P is KL(Q||P) = d P d Q if Q P and otherwise. Let Bp denote the Bernoulli distribution on {0,1} with mean p. For p,q 2 [0,1], we abuse notation and define def= KL(Bq||Bp) = qln q p +(1 q)ln 1 q The following PAC-Bayes bound for bounded loss is due to Maurer (2004), and extends the 0 1 bound established by Langford and Seeger (2001), building off the seminal work of Mc Allester (1999) and Shawe-Taylor and Williamson (1997). See also (Langford, 2002) and (Catoni, 2007). Theorem 4.1 (PAC-Bayes; Maurer 2004, Thm. 5). Under bounded loss 2 [0,1], for every d > 0, m 2 N, distribution D on Z, and distribution P on Rp, (8Q) kl(ˆLS(Q)||LD(Q)) KL(Q||P)+ln 2pm One can use Pinsker s inequality to obtain a bound on the generalization error |ˆLS(Q) LD(Q)|, however this significantly loosens the bound, especially when ˆLS(Q) is close to zero. We refer to the quantity kl(ˆLS(Q)||LD(Q)) as the KL-generalization error. From a bound on this quantity, we can bound the risk as follows: given empirical risk q and a bound on the KL-generalization error c, the risk is bounded by the largest value p 2 [0,1] such kl(q||p) c. See (Dziugaite and Roy, 2017) for a discussion of this computation. When the empirical risk is near zero, the KL-generalization error is essentially the generalization error. As empirical risk increases, the bound loosens and the square root of the KL-generalization error bounds the generalization error. 4.1 Data-dependent priors The prior P that appears in the PAC-Bayes generalization bound must be chosen independently of the data S Dm, but can depend on the data distribution D itself. If a data-dependent prior P(S) does not depend too much on any individual data point, it should behave as if it depends only on the distribution. Theorem 3.3 allows us to formalize this intuition: we can obtain new PAC-Bayes bounds that use data-dependent priors, provided they are e-differentially private: We provide an example using the bound of Maurer (Theorem 4.1). Theorem 4.2 (PAC-Bayes with private data-dependent priors). Fix a bounded loss 2 [0,1]. Let m 2 N, let P : Zm M1(Rp) be an e-differentially private mechanism for choosing a data-dependent prior, let D 2 M1(Z), and let S Dm. Then, with probability at least 1 d, 8Q 2 M1(Rp), kl(ˆLS(Q)||LD(Q)) KL(Q||P(S))+ln 4pm d m +e2/2+e 1Note that in much of the literature it is standard to express the max-information in bits, i.e., the factor ek above is replaced by 2k0 with k0 = klog2 e. See Appendix B for a proof of a more general statement of the theorem and further discussion. The main innovation here is recognizing the potential to choose data-dependent priors using private mechanisms. The hard work is done by Theorem 3.3: obtaining differentially private versions of other PAC-Bayes bounds is straightforward. When one is choosing the privacy parameter, e, there is a balance between minimizing the direct contributions of e to the bound (forcing e smaller) and minimizing the indirect contribution of e through the KL term for posteriors Q that have low empirical risk (forcing e larger). The optimal value for e is often much less than one, which can be challenging to obtain. We discuss strategies for achieving the required privacy in later sections. 5 Weak approximations to e-differentially private priors Theorem 4.2 permits data-dependent priors that are chosen by e-differentially private mechanisms. In this section, we discuss concrete families of priors and mechanisms for choosing among them in data-dependent ways. We also relax Theorem 4.2 to allow non-private priors. We apply our main result to non-private data-dependent Gaussian priors with a fixed covariance matrix. Thus, we choose only the mean w0 2 Rp privately in a data-dependent way. We show that it suffices for the data-dependent mean to be merely close in 2-Wasserstein distance to a private mean to yield a generalization bound. (It is natural to consider also choosing a data-dependent covariance, but as it is argued below, the privacy budget we have in applications to generalization is very small.) Ideally, we would choose a mean vector w0 that leads to a tight bound. A reasonable approach is to choose w0 by approximately minimizing the empirical risk ˆLS or surrogate risk LS, subject to privacy constraints. A natural way to do this is via an exponential mechanism. We pause to introduce some notation for Gibbs distributions: For a measure P on Rp and measurable function g : Rp ! R, let P[g] denote the expectation R g(h)P(dh) and, provided P[g] < , let Pg denote the probability measure on Rp, absolutely continuous with respect to P, with Radon Nikodym derivative d Pg d P (h) = g(h) P[g] . A distribution of the form Pexp( tg) is generally referred to as a Gibbs distribution with energy function g and inverse temperature t. In the special case where P is a probability measure, we call Pexp( t LS) a Gibbs posterior . Lemma 5.1 (Mc Sherry and Talwar 2007, Thm. 6). Let q : Zm Rp ! R be measurable, let P be a s-finite measure on Rp, let b > 0, and assume P[exp( b q(S, ))] < for all S 2 Zm. Let Dq def= sup S,S0 supw2Rp |q(S,w) q(S0,w)|, where the first supremum ranges over pairs S,S0 2 Zm that disagree on no more than one coordinate. Let A : Zm Rp, on input S 2 Zm, output a sample from the Gibbs distribution Pexp( bq(S, )). Then A is 2b Dq-differentially private. The following result is a straightforward application of Lemma 5.1 and essentially equivalent results have appeared in numerous studies of the differential privacy of Bayesian and Gibbs posteriors (Mir, 2013; Bassily, Smith, and Thakurta, 2014; Dimitrakakis et al., 2014; Wang, Fienberg, and Smola, 2015; Minami et al., 2016). Corollary 5.2. Let t > 0 and let LS denote the surrogate risk, taking values in an interval of length D. One sample from the Gibbs posterior Pexp( t LS) is 2t D m -differentially private. 5.1 Weak convergence yields valid PAC-Bayes bounds Even for small values of the inverse temperature, it is difficult to implement the exponential mechanism because sampling from Gibbs posteriors exactly is intractable. On the other hand, a number of algorithms exist for generating approximate samples from Gibbs posteriors. If one can control the total-variation distance, one can obtain a bound like Theorem 4.2 by applying approximate maxinformation bounds for (e,d)-differential private mechanisms. However, many algorithms do not control the total-variation distance to stationarity, or do so poorly. The generalization properties of randomized classifiers are generally insensitive to small variations of the parameters, however, and so it stands to reason that our data-dependent prior need only be itself close to an e-differentially private prior. We formalize this intuition here by deriving bounds on KL(Q||P(S)) in terms of a non-private data-dependent prior PS. We start with an identity: Lemma 5.3. If P0 P then KL(Q||P) = KL(Q||P0)+Q[ln d P0 The proof is straightforward. The lemma highlights the role of Q in judging the difference between P0 and P, and leads immediately to the following corollary (see Appendix C). Lemma 5.4 (Non-private priors). Let m 2 N, let P : Zm M1(Rp) be e-differentially private, let D 2 M1(Z), let S Dm, and let PS be a data-dependent prior such that, for some P (S) satisfying P[P (S)|S] = P[P(S)|S], we have PS P (S) with probability at least 1 d 0. Then, with probability at least 1 d d 0, Eq. (4.2) holds with KL(Q||P(S)) replaced by KL(Q||PS)+Q[ln d PS d P (S)]. The conditions that PS P (S) and Q[ln d PS d P (S)] for some Q do not constrain PS to be differen- tially private. In fact, S could be PS-measurable! Lemma 5.4 is not immediately applicable because P (S) is intractable to generate. The following application considers multivariate Gaussian priors, N(w), indexed by their mean vectors w 2 Rp, with a fixed positive definite covariance matrix S 2 Rp p. We require two technical results: The first implies that we can bound Q[ln d PS d P (S)] if Q concentrates near the non-private mean; The second characterizes this concentration for Gibbs posteriors built from bounded surrogate risks. Lemma 5.5. Q[ln d N(w0) S 1 +kw w0k S 1 E v Qkv w0k S 1. Lemma 5.6. Let P = N(w) and Q = Pexp(h) for h 0. Then E v Qkv wk S 1 2khk L (P) + Corollary 5.7 (Gaussian means close to private means). Let m 2 N, let D 2 M1(Z), let S Dm, let A : Zm Rp be e-differentially private, and let w(S) denote a data-dependent mean vector such that, for some w (S) satisfying P[w (S)|S] = P[A (S)|S], we have kw(S) w (S)k2 with probability at least 1 d 0. Let smin be the minimum eigenvalue of S. Then, with probability at least 1 d d 0, Eq. (4.2) holds with KL(Q||P(S)) replaced by KL(Q||N(w(S))) + 1 2 C/smin + C/smin Ev Qkv w(S)k S 1. In particular, for a Gibbs posterior Q = PS exp( t LS), we have Ev Qkv w(S)k S 1 See Appendix C for details and further discussion. One way to achieve Eq. (5.1) is to construct w1(S),w2(S),... so that P[wn(S)|S] ! P[A (S)|S] weakly with high probability. Skorohod s representation theorem then implies the existence of w (S). One of the standard algorithms used to generate such sequences for high-dimensional Gibbs distributions is stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011). In order to get nonasymptotic results, it suffices to bound the 2-Wasserstein distance of the SGLD Markov chain to stationarity. Recall that the p-Wasserstein distance between µ and n is given by (Wp(µ,n))p = infg 2 dg(v,w) where the infimum runs over couplings of µ and n, i.e., distributions g 2 M1(Rp Rp) with marginals µ and n, respectively. Let A (S) return a sample from the Gibbs posterior Pexp( t LS) with Gaussian P and surrogate risk LS constructed from smooth loss functions taking values in a length-D interval. By Corollary 5.2, A is 2t D m -differentially private. Consider running SGLD to target P[A (S)|S] = Pexp( t LS). Assume2 that, for every c > 0, there is a step size h > 0 and number of SGLD iterations n 2 O( 1 cq ), such that the n-th iterate w(S) 2 Rp produced by SGLD satisfies W2 P[w(S)|S],P[A (S)|S] c. Markov s inequality and the definition of W2 immediately implies the following. Corollary 5.8 (Prior via SGLD). Under the above assumption, for some step size h > 0, the n-th iterate w(S) of SGLD targeting Pexp( t LS) satisfies Corollary 5.7 with e = 2t D m and C 2 O( 1 d 0n1/q ). The dependence on d 0 is poor. However, one can construct Markov chain algorithms that are geometrically ergodic, in which case n 1/q is replaced by a term 2 W(n), allowing one to spend computation to control the 1/d 0 term. 2 The status of this assumption relates to results by Mattingly, Stuart, and Higham (2002, Thm. 7.3) and Raginsky, Rakhlin, and Telgarsky (2017, Prop. 3.3), under so-called dissipativity conditions on the regularized loss. We have hidden potentially exponential dependence on t, which is problem dependent. See also (Erdogdu, Mackey, and Shamir, 2018). 6 Empirical studies We have presented data-dependent bounds and so it is necessary to study them empirically to evaluate their usefulness. The goal of this section is to make a number of arguments. First, it is an empirical question as to what value of the inverse temperature t is sufficient to yield small empirical risk from a Gibbs posterior. Indeed, we compare to the bound of Lever, Laviolette, and Shawe Taylor (2013), presented above as Theorem 1.1. As Lever, Laviolette, and Shawe-Taylor point out, this bound depends explicitly on t, where it plays an obvious role as a measure of complexity. Second, despite how tight this bound is for small values of t, the bound must become vacuous before the Gibbs posterior would have started to overfit on random labels because this bound holds for all data distributions. We demonstrate that this phase transition happens well before the Gibbs posterior achieves its minimum risk on true labels. Third, because our bound retains the KL term, we can potentially identify easy data. Indeed, our risk bound decreases beyond the point where the same classifier begins to overfit to random labels. Finally, our results suggest that we can use the property that SGLD converges weakly to investigate the generalization properties of Gibbs classifiers. More work is clearly needed to scale our study to full fledged deep learning benchmarks. (See Appendix E for details on how we compute the KL term and challenges there due to Gibbs posteriors Q not be exactly samplable.) More concretely, we perform an empirical evaluation using SGLD to approximate simulating from Gibbs distributions. Corollary 5.7 provides some justification by showing that a slight weakening of the PAC-Bayes generalization bound is valid provided that SGLD eventually controls the 2-Wasserstein distance to stationarity. However, because we cannot measure our convergence in practice, it is an empirical question as to whether our samples are accurate enough. Violated bounds would be an obvious sign of trouble. We expect the bound on the classification error not to go below the true error as estimated on the heldout test set (with high probability). We perform an experiment on a MNIST (and CIFAR10, with the same conclusion so we have not included it) using true and random labels and find that no bounds are violated. The results suggest that it may be useful to empirical study bounds for Gibbs classifiers using SGLD. Our main focus is a sythentic experiment comparing the bounds of Lever, Laviolette, and Shawe Taylor (2013) to our new bounds based on privacy. The main finding here is that, as expected, the bounds by Lever, Laviolette, and Shawe-Taylor must explode when the Gibbs classifier begins to overfit random labels, whereas our bounds, on true labels, continue to track the training error and bound the test error. Our focus is on classification by neural networks into K classes. Thus Z = X [K], and we use neural networks that output probability vectors over these K classes. Given weights w 2 Rp and input x 2 X, the probability vector output by the network is p(w,x) 2 [0,1]K. Networks are trained by minimizing cross entropy loss: (w,(x,y)) = g(p(w,x),y), where g((p1,..., p K),y) = ln py. Note that cross entropy loss is merely bounded below. We report results in terms of {0,1}-valued classification error: (w,(x,y)) = 0 if and only if y is the largest coordinate of p(w,x). We refer to elements of Rp and M1(Rp) as classifiers and randomized classifiers, respectively, and to the (empirical) 0 1 risk as the (empirical) error. We train two different architectures using SGLD on MNIST and a synthetic dataset, SYNTH. The experimental setup is explained in Appendix F. One-stage training procedure We run SGLD for T training epochs with a fixed value of the parameter t. We observe that convergence appears to occur within 10 epochs, but use a much larger number of training epochs to potentially expose nonconvergence behavior. The value of the inverse temperature t is fixed during the whole training procedure. Two-stage training procedure In order to evaluate our private PAC-Bayes bound (Theorem 4.2), we perform a two-stage training procedure: Stage One. We run SGLD for T1 epochs with inverse temperature t1, minimizing the standard cross entropy objective. Let w0 denote the neural network weights after stage one. Figure 1: Results for a fully connected neural network trained on MNIST dataset with SGLD and a fixed value of t. We vary t on the x-axis. The y-axis shows the average 0 1 loss. We plot the estimated generalization gap, which is the difference between the training and test errors. The left plot shows the results for the true label dataset. We observe, that the training error converges to zero as t increases. Further, while the generalization error increases for intermediate values of t (104 to 106), it starts dropping again as t increases. We see that the Lever bound fails to capture this behaviour due to the monotonic increase with t. The right hand side plot shows the results for a classifier trained on random labelling of MNIST images. The true error is around 0.9. For small values of t (under 103) the network fails to learn and the training error stays at around 0.9. When t exceeds the number of training points, the network starts to overfit heavily. The sharp increase of the generalization gap is predicted by the Lever bound. Transition. We restart the learning rate schedule and continue SGLD for T1 epochs with linearly annealing the temperature between t1 and t2, i.e., inverse temperature tt = ((t T1)t2 + (2T1 t) t1)/T1, where t is the current epoch number. The objective at w is the cross entropy loss for w plus a weight-decay term g 2. Stage Two. We continue SGLD for T2 T1 epochs with inverse temperature t2. The objective is the same as in the transition stage. During the first stage, the k-step transitions of SGLD converge weakly towards a Gibbs distribution with a uniform base measure, producing a random vector w0 2 Rp. The private data-dependent prior Pw0 is the Gaussian distribution centred at w0 with diagonal covariance 1 g Ip. During the second stage, SGLD converges to the Gibbs posterior with a Gaussian base measure Pw0, i.e., Qt2 = Pexp( t2 ˆLS). Bound calculation Our experiments evaluate different values of the inverse temperature t. We evaluate Lever bounds for the randomized classifier Qt obtained by the one-stage training procedure, with T = 1000. We do so on both the MNIST and SYNTH datasets. We also evaluate our private PAC-Bayes bound (Theorem 4.2) for the randomized classifier Qt2 and the private data-dependent prior Pw0, where the privacy parameter depends on t1. The bound depends on the value of the KL(Qt1||Pw0). The challenges of estimating this term are described in Appendix E. We only evaluate the differentially private PAC-Bayes bounds on the small neural network and SYNTH dataset, The parameter settings for SYNTH experiments are: T1 = 100, T2 = 1000, g = 2; for MNIST: T1 = 500, T2 = 1000, g = 5. When evaluating Lever bounds with a one-stage learning procedure for either datasets, T = 1000. 6.2 Results Results are presented in Figs. 1 and 2. We never observe a violation of the PAC-Bayes bounds for Gibbs distributions. This suggests that our assumption that SGLD has nearly converged is accurate enough or the bounds are sufficiently loose that any effect from nonconvergence was masked. Our MNIST experiments highlight that the Lever bounds upper bound the risk for every possible data distribution, including the random label distribution. In the random label experiment (Fig. 1, right plot), when t gets close to the number of training samples, the generalization error starts increasing steeply. This phase transition is captured by the Lever bound. In the true label experiment (right Figure 2: Results for a small fully connected neural network trained on a synthetically generated dataset SYNTH, consisting of 50 training examples. The x-axis shows the t value, and the y-axis the average 0 1 loss. To generate the top plots, we train the network with a one-stage SGLD. The top-left plot corresponds to the true label dataset, and the top-right to the random label dataset. Similarly as in MNIST experiments, we do not witness any violation of the Lever bounds. Once again, we notice that Lever bound gets very loose for larger values of t in the true label case. The bottom plot demonstrates the results for the two-stage SGLD. In this case the x-axis plots the t value used in the second-stage optimization. The first stage used t1 = 1. The network is trained on true labels. We see that the differentially private PAC-Bayes bound yields a much tighter estimate of the generalization gap for larger values of t than the Lever bound (top left). When t becomes very large relative to the amount of training data, it becomes more difficult to sample from the Gibbs posterior. This results in a looser upper bound on the KL divergence between the prior and posterior. plot), the generalization error does not rise with t. Indeed, it continues to decrease, and so the Lever bound quickly becomes vacuous as we increase t. The Lever bound cannot capture this behavior because it must simultaneously bound the generalization error under random labels. On the SYNTH dataset, we see the same phase transition under random labels and so Lever bounds remain vacuous after this point. In contrast, we see that our private PAC-Bayes bounds can track the error beyond the phase transition that occurs under random labels. (See Fig. 2.) At high values of t, our KL upper bound becomes very loose. Private versus Lever PAC-Bayes bound While Lever PAC-Bayes bound fails to explain generalization for high t values, our private PAC-Bayes bound may remain nonvacuous. This is due to the fact that it retains the KL term, which is sensitive to the data distribution via Q, and thus it can be much lower than the upper bound on the KL in Lever, Laviolette, and Shawe-Taylor (2013) for datasets with small true Bayes risk. Two stage optimization, inspired by the DP PAC-Bayes bound, allows us to obtain more accurate classifiers by setting a higher inverse temperature parameter at the second stage, t2. We do not plot DP PAC-Bayes bound for MNIST experiments due to the computational challenges approximating the KL term for a high-dimensional parameter space, as discussed in Appendix E. We evaluate our private PAC-Bayes bound on MNIST dataset only for a combination of t1 = 103 and t2 2 [3 103,3 104,105,3 105]. The values are chosen such that t1 gives a small penalty for using the data to learn the prior and t2 is chosen such that at t = t2 Lever s bound returns a vacuous bound (as seen in Fig. 1). We use 105 number of samples from the DP Gaussian prior learnt in stage one to approximate ln Z term that appears in the KL, as defined in Eq. (E.5). The results are presented in the table below. While DP PAC-Bayes bound is very loose, it is still smaller than Lever s bound for high values of inverse temperature. Note, that for smaller values t2, we can use Lever s upper bound on the KL term instead of performing a Monte Carlo approximation. Since t1 is small and adds only a small penalty ( 1%), the DP PAC-Bayes bound is equal to Lever s bound plus a differential privacy penalty ( 1%). t2 3 103 3 104 105 3 105 Test 0.12 0.07 0.06 4 DP PAC-Bayes bound on test 0.21 0.35 0.65 1 Lever PAC-Bayes with t = t2 on test 0.26 1 1 1 Acknowledgments The authors would like to thank Olivier Catoni, Pascal Germain, Mufan Li, David Mc Allester, and Alexander Rakhlin, John Shawe-Taylor, for helpful discussions. This research was carried out in part while the authors were visiting the Simons Institute for the Theory of Computing at UC Berkeley. GKD was additionally supported by an EPSRC studentship. DMR was additionally supported by an NSERC Discovery Grant and Ontario Early Researcher Award. P. Alquier and B. Guedj (May 2018). Simpler PAC-Bayesian Bounds for Hostile Data . Mach. Learn. 107.5, pp. 887 902. DOI: 10.1007/s10994-017-5690-0. P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017). Spectrally-normalized margin bounds for neural networks . Advances in Neural Info. Proc. Syst. (NIPS), pp. 6241 6250. R. Bassily, A. Smith, and A. Thakurta (2014). Differentially private empirical risk minimization: Efficient algorithms and tight error bounds. ar Xiv: 1405.7085v2 [cs.LG]. R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman (2016). Algorithmic stabil- ity for adaptive data analysis . Proc. Symp. Theory of Comput. (STOC), pp. 1046 1059. L. Bégin, P. Germain, F. Laviolette, and J.-F. Roy (2016). PAC-Bayesian bounds based on the Rényi divergence . Proc. Artificial Intelligence and Statistics (AISTATS), pp. 435 444. O. Bousquet and A. Elisseeff (2002). Stability and generalization . Journal of Machine Learning Research 2.Mar, pp. 499 526. O. Catoni (2007). PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Lecture Notes-Monograph Series. Institute of Mathematical Statistics. DOI: 10.1214/ 074921707000000391. ar Xiv: 0712.0248 [stat.ML]. C. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. Rubinstein (2014). Robust and private Bayesian inference . Int. Conf. Algorithmic Learning Theory (ALT), pp. 291 305. C. Dwork (2006). Differential Privacy . Int. Colloq. Automata, Languages and Programming (ICALP), pp. 1 12. DOI: 10.1007/11787006_1. (2008). Differential privacy: A survey of results . International Conference on Theory and Ap- plications of Models of Computation. Springer, pp. 1 19. C. Dwork, A. Roth, et al. (2014). The algorithmic foundations of differential privacy . Foundations and Trends in Theoretical Computer Science 9.3 4, pp. 211 407. C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth (2015a). Generalization in adaptive data analysis and holdout reuse . Advances in Neural Info. Proc. Syst. Pp. 2350 2358. C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth (2015b). Preserving statistical validity in adaptive data analysis . Proc. Symp. Theory of Comput. (STOC), pp. 117 126. G. K. Dziugaite and D. M. Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data . Proc. 33rd Conf. Uncertainty in Artificial Intelligence (UAI). ar Xiv: 1703.11008. M. A. Erdogdu, L. Mackey, and O. Shamir (2018). Global Non-convex Optimization with Dis- cretized Diffusions. ar Xiv: 1810.12361. P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien (2016). PAC-Bayesian Theory Meets Bayesian Inference . Advances in Neural Info. Proc. Syst. Pp. 1884 1892. P. D. Grünwald and N. A. Mehta (2016). Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes . ar Xiv: 1605.00252. (2017). A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity. ar Xiv: 1710.07732. D. Kifer, A. Smith, and A. Thakurta (2012). Private convex empirical risk minimization and high- dimensional regression . Journal of Machine Learning Research 1.41, pp. 1 40. J. Langford (2002). Quantitatively tight sample complexity bounds . Ph D thesis. Carnegie Mellon University. J. Langford and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS-01-102. Carnegie Mellon University. G. Lever, F. Laviolette, and J. Shawe-Taylor (2013). Tighter PAC-Bayes bounds through distribution-dependent priors . Theoretical Computer Science 473, pp. 4 28. DOI: 10.1016/ j.tcs.2012.10.013. B. London (2017). A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent . Advances in Neural Info. Proc. Syst. (NIPS), pp. 2931 2940. J. Mattingly, A. Stuart, and D. Higham (2002). Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise . Stochastic Processes and their Applications 101.2, pp. 185 232. DOI: https://doi.org/10.1016/S0304-4149(02)00150-3. A. Maurer (2004). A note on the PAC-Bayesian theorem. ar Xiv: cs/0411099 [cs.LG]. D. A. Mc Allester (1999). PAC-Bayesian Model Averaging . Proc. Conf. Learning Theory (COLT), pp. 164 170. F. Mc Sherry and K. Talwar (2007). Mechanism Design via Differential Privacy . Proc. Symp. Found. Comp. Sci. (FOCS), pp. 94 103. K. Minami, H. Arai, I. Sato, and H. Nakagawa (2016). Differential Privacy without Sensitivity . Advances in Neural Info. Proc. Syst. (NIPS), pp. 956 964. D. J. Mir (2013). Differential privacy: an exploration of the privacy-utility landscape . Ph D thesis. Rutgers University. B. Neyshabur, S. Bhojanapalli, D. Mc Allester, and N. Srebro (2017a). A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks . Proc. Int. Conf. on Learning Representation (ICLR). ar Xiv: 1707.09564. B. Neyshabur, S. Bhojanapalli, D. Mc Allester, and N. Srebro (2017b). Exploring generalization in deep learning . Advances in Neural Info. Proc. Syst. (NIPS), pp. 5949 5958. L. Oneto, S. Ridella, and D. Anguita (2017). Differential privacy and generalization: Sharper bounds with applications . Pattern Recognition Letters 89, pp. 31 38. DOI: 10.1016/j.patrec. 2017.02.006. E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun (2012). PAC-Bayes bounds with data dependent priors . Journal of Machine Learning Research 13.Dec, pp. 3507 3531. M. Raginsky, A. Rakhlin, and M. Telgarsky (2017). Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis . Proc. Conf. on Learning Theory (COLT). ar Xiv: 1702.03849. O. Rivasplata, C. Szepesvari, J. S. Shawe-Taylor, E. Parrado-Hernandez, and S. Sun (2018). PAC- Bayes bounds for stable algorithms with instance-dependent priors . Advances in Neural Info. Proc. Syst. 31, pp. 9234 9244. J. Shawe-Taylor and R. C. Williamson (1997). A PAC analysis of a Bayesian estimator . Proc. Conference on Learning Theory (COLT). ACM, pp. 2 9. S. L. Smith and Q. V. Le (2018). A Bayesian perspective on generalization and stochastic gradient descent . Proc. Int. Conf. on Learning Representations (ICLR). Y. W. Teh, A. H. Thiery, and S. J. Vollmer (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics . Journal of Machine Learning Research 17, pp. 1 33. N. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin (2017). A Strongly Quasiconvex PAC- Bayesian Bound . Int. Conf. on Algorithmic Learning Theory (ALT), pp. 466 492. ar Xiv: 1608. 05610. Y.-X. Wang, S. E. Fienberg, and A. J. Smola (2015). Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo . Proc. Int. Conf. Machine Learning (ICML), pp. 2493 2502. M. Welling and Y. W. Teh (2011). Bayesian learning via stochastic gradient Langevin dynamics . Proc. of the 28th Int. Conf. on Machine Learning (ICML), pp. 681 688.