# hypothesis_set_stability_and_generalization__d4b18dcd.pdf

Hypothesis Set Stability and Generalization

Dylan J. Foster Massachusetts Institute of Technology dylanf@mit.edu

Spencer Greenberg Spark Wave admin@sparkwave.tech

Satyen Kale Google Research satyen@satyenkale.com

Haipeng Luo University of Southern California haipengl@usc.edu

Mehryar Mohri Google Research and Courant Institute mohri@google.com

Karthik Sridharan Cornell University sridharan@cs.cornell.edu

We present a study of generalization for data-dependent hypothesis sets. We give a general learning guarantee for data-dependent hypothesis sets based on a notion of transductive Rademacher complexity. Our main result is a generalization bound for data-dependent hypothesis sets expressed in terms of a notion of hypothesis set stability and a notion of Rademacher complexity for data-dependent hypothesis sets that we introduce. This bound admits as special cases both standard Rademacher complexity bounds and algorithm-dependent uniform stability bounds. We also illustrate the use of these learning bounds in the analysis of several scenarios.

1 Introduction

Most generalization bounds in learning theory hold for a ﬁxed hypothesis set, selected before receiving a sample. This includes learning bounds based on covering numbers, VC-dimension, pseudo-dimension, Rademacher complexity, local Rademacher complexity, and other complexity measures [Pollard, 1984, Zhang, 2002, Vapnik, 1998, Koltchinskii and Panchenko, 2002, Bartlett et al., 2002]. Some alternative guarantees have also been derived for speciﬁc algorithms. Among them, the most general family is that of uniform stability bounds given by Bousquet and Elisseeff [2002]. These bounds were recently signiﬁcantly improved by Feldman and Vondrak [2019], who proved guarantees that are informative, even when the stability parameter β is only in o(1), as opposed to o(1/ m). The log2 m factor in these bounds was later reduced to log m by Bousquet et al. [2019]. Bounds for a restricted class of algorithms were also recently presented by Maurer [2017], under a number of assumptions on the smoothness of the loss function. Appendix A gives more background on stability.

In practice, machine learning engineers commonly resort to hypothesis sets depending on the same sample as the one used for training. This includes instances where a regularization, a feature transformation, or a data normalization is selected using the training sample, or other instances where the family of predictors is restricted to a smaller class based on the sample received. In other instances, as is common in deep learning, the data representation and the predictor are learned using the same sample. In ensemble learning, the sample used to train models sometimes coincides with the one used to determine their aggregation weights. However, standard generalization bounds are not directly applicable for these scenarios since they assume a ﬁxed hypothesis set.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

HS h S 2 HS

Figure 1: Decomposition of the learning algorithm s hypothesis selection into two stages. In the ﬁrst stage, the algorithm determines a hypothesis HS associated to the training sample S which may be a small subset of the set of all hypotheses that could be considered, say H = S Zm HS. The second stage then consists of selecting a hypothesis h S out of HS.

1.1 Contributions of this paper.

1. Foundational deﬁnitions of data-dependent hypothesis sets. We present foundational deﬁnitions of learning algorithms that rely on data-dependent hypothesis sets. Here, the algorithm decomposes into two stages: a ﬁrst stage where the algorithm, on receiving the sample S, chooses a hypothesis set HS dependent on S, and a second stage where a hypothesis h S is selected from HS. Standard generalization bounds correspond to the case where HS is equal to some ﬁxed H independent of S. Algorithm-dependent analyses, such as uniform stability bounds, coincide with the case where HS is chosen to be a singleton HS = {h S}. Thus, the scenario we study covers both existing settings and other intermediate scenarios. Figure 1 illustrates our general scenario.

2. Learning bounds via transductive Rademacher complexity. We present general learning bounds for data-dependent hypothesis sets using a notion of transductive Rademacher complexity (Section 3). These bounds hold for arbitrary bounded losses and improve upon previous guarantees given by Gat [2001] and Cannon et al. [2002] for the binary loss, which were expressed in terms of a notion of shattering coefﬁcient adapted to the data-dependent case, and are more explicit than the guarantees presented by Philips [2005][corollary 4.6 or theorem 4.7]. Nevertheless, such bounds may often not be sufﬁciently informative, since they ignore the relationship between hypothesis sets based on similar samples.

3. Learning bounds via hypothesis set stability. We provide ﬁner generalization bounds based on the key notion of hypothesis set stability that we introduce in this paper. This notion admits algorithmic stability as a special case, when the hypotheses sets are reduced to singletons. We also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. Our main results are generalization bounds (Section 4) for stable data-dependent hypothesis sets expressed in terms of the hypothesis set stability parameter, our notion of Rademacher complexity, and a notion of cross-validation stability that, in turn, can be upper-bounded by the diameter of the family of hypothesis sets. Our learning bounds admit as special cases both standard Rademacher complexity bounds and algorithm-dependent uniform stability bounds.

4. New generalization bounds for speciﬁc learning applications. In section 5 (see also Appendix G), we illustrate the generality and the beneﬁts of our hypothesis set stability learning bounds by using them to derive new generalization bounds for several learning applications. To our knowledge, there is no straightforward analysis based on previously existing tools that yield these generalization bounds. These applications include: (a) bagging algorithms that may employ nonuniform, data-dependent, averaging of the base predictors, (b) stochastic strongly-convex optimization algorithms based on an average of other stochastic optimization algorithms, (c) stable representation learning algorithms, which ﬁrst learn a data representation using the sample and then learn a predictor on top of the learned representation, and (d) distillation algorithms, which ﬁrst compute a complex predictor using the sample and then use it to learn a simpler predictor that is close to it.

1.2 Related work on data-dependent hypothesis sets.

Shawe-Taylor et al. [1998] presented an analysis of structural risk minimization over data-dependent hierarchies based on a concept of luckiness, which generalizes the notion of margin of linear classiﬁers. Their analysis can be viewed as an alternative and general study of data-dependent hypothesis sets,

using luckiness functions and ω-smallness (or ω-smoothness) conditions. A luckiness function helps decompose a hypothesis set into lucky sets, that is sets of functions luckier than a given function. The luckiness framework is attractive and the notion of luckiness, for example margin, can in fact be combined with our results. However, ﬁnding pairs of truly data-dependent luckiness and ω-smallness functions, other than those based on the margin and the empirical VC-dimension, is quite difﬁcult, in particular because of the very technical ω-smallness condition [see Philips, 2005, p. 70]. In contrast, hypothesis set stability is simpler and often easier to bound. The notions of luckiness and ω-smallness have also been used by Herbrich and Williamson [2002] to derive algorithm-speciﬁc guarantees. In fact, the authors show a connection with algorithmic stability (not hypothesis set stability), at the price of a guarantee requiring the strong condition that the stability parameter be in o(1/m), where m is the sample size [see Herbrich and Williamson, 2002, pp. 189-190].

Data-dependent hypothesis classes are conceptually related to the notion of data-dependent priors in PAC-Bayesian generalization bounds. Catoni [2007] developed localized PAC-Bayes analysis by using a prior deﬁned in terms of the data generating distribution. This work was extended by Lever et al. [2013] who proved sharp risk bounds for stochastic exponential weights algorithms. Parrado-Hern andez et al. [2012] investigated the possibility of learning the prior from a separate data set, as well as priors obtained via computing a data-dependent bound on the KL term. More closely related to this paper is the work of Dziugaite and Roy [2018a,b], who develop PAC-Bayes bounds by choosing the prior via a data-dependent differentially private mechanism, and also showed that weaker notions than differential privacy also sufﬁce to yield valid bounds. In Appendix H, we give a more detailed discussion of PAC-Bayes bounds, in particular to show how ﬁner PAC-Bayes bounds than standard ones can be derived from Rademacher complexity bounds, here with an alternative analysis and constants than [Kakade et al., 2008] and how data-dependent PAC-Bayes bounds can be derived from our data-dependent Rademacher complexity bounds. More discussion on data-dependent priors can be found in Appendix F.3.

2 Deﬁnitions and Properties

Let X be the input space and Y the output space, and deﬁne Z = X Y We denote by D the unknown distribution over X Y according to which samples are drawn.

The hypotheses h we consider map X to a set Y sometimes different from Y. For example, in binary classiﬁcation, we may have Y = { 1,+1} and Y = R. Thus, we denote by ℓ Y Y [0,1] a loss function deﬁned on Y Y and taking non-negative real values bounded by one. We denote the loss of a hypothesis h X Y at point z = (x,y) X Y by L(h,z) = ℓ(h(x),y). We denote by R(h) the generalization error or expected loss of a hypothesis h H and by RS(h) its empirical loss over a sample S = (z1,...,zm):

R(h) = E z D[L(h,z)] RS(h) = E z S[L(h,z)] = 1

m i=1 L(h,zi).

In the general framework we consider, a hypothesis set depends on the sample received. We will denote by HS the hypothesis set depending on the labeled sample S Zm of size m 1. We assume that HS is invariant to the ordering of the points in S.

Deﬁnition 1 (Hypothesis set uniform stability). Fix m 1. We will say that a family of datadependent hypothesis sets H = (HS)S Zm is β-uniformly stable (or simply β-stable) for some β 0, if for any two samples S and S of size m differing only by one point, the following holds:

h HS, h HS z Z, L(h,z) L(h ,z) β. (1)

Thus, two hypothesis sets derived from samples differing by one element are close in the sense that any hypothesis in one admits a counterpart in the other set with β-similar losses. A closely related notion is the sensitivity of a function f Zm R. Such a function f is called β-sensitive if for any two samples S and S of size m differing only by one point, we have f(S) f(S ) β.

We also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. To introduce its deﬁnition, for any two samples S,T Zm and a vector of Rademacher variables σ, denote by ST,σ the sample derived from S by replacing its ith element with the ith element of T, for all i [m] = {1,2,...,m} with σi = 1. We will use Hσ S,T to denote the hypothesis set HST,σ.

Deﬁnition 2 (Rademacher complexity of data-dependent hypothesis sets). Fix m 1. The empirical Rademacher complexity R S,T (H) and the Rademacher complexity R m(H) of a family of datadependent hypothesis sets H = (HS)S Zm for two samples S = (z S 1 ,...,z S m) and T = (z T 1 ,...,z T m) in Zm are deﬁned by

R S,T (H) = 1

sup h Hσ S,T

m i=1 σih(z T i ) R m(H) = 1

m E S,T Dm σ

sup h Hσ S,T

m i=1 σih(z T i ) . (2)

When the family of data-dependent hypothesis sets H is β-stable with β = O(1/m), the empirical Rademacher complexity R S,T (G) is sharply concentrated around its expectation R m(G), as with the standard empirical Rademacher complexity (see Lemma 4).

Let HS,T denote the union of all hypothesis sets based on U = {U U (S T),U Zm}, the subsamples of S T of size m: HS,T = U U HU. Since for any σ, we have Hσ S,T HS,T , the following simpler upper bound in terms of the standard empirical Rademacher complexity of HS,T can be used for our notion of empirical Rademacher complexity:

m E S,T Dm σ [ sup h HS,T

m i=1 σih(z T i )] = E S,T Dm [ RT (HS,T )],

where RT (HS,T ) is the standard empirical Rademacher1 complexity of HS,T for the sample T. Some properties of our notion of Rademacher complexity are given in Appendix B.

Let GS denote the family of loss functions associated to HS:

GS = {z L(h,z) h HS}, (3)

and let G = (GS)S Zm denote the family of hypothesis sets GS. Our main results will be expressed in terms of R m(G). When the loss function ℓis µ-Lipschitz, by Talagrand s contraction lemma [Ledoux and Talagrand, 1991], in all our results, R m(G) can be replaced by µES,T Dm[ RT (HS,T )].

Rademacher complexity is one way to measure the capacity of the family of data-dependent hypothesis sets. We also derive learning bounds in situations where a notion of diameter of the hypothesis sets is small. We now deﬁne a notion of cross-validation stability and diameter for data-dependent hypothesis sets. In the following, for a sample S, Sz z denotes the sample obtained from S by replacing z S by z Z. Deﬁnition 3 (Hypothesis set Cross-Validation (CV) stability, diameter). Fix m 1. For some χ,χ, , , max 0, we say that a family of data-dependent hypothesis sets H = (HS)S Zm has average CV-stability χ, CV-stability χ, average diameter , diameter and max-diameter max if the following hold:

E S Dm E z D,z S [ sup h HS,h HSz z L(h ,z) L(h,z)] χ (4)

sup S Zm E z D,z S [ sup h HS,h HSz z L(h ,z) L(h,z)] χ (5)

E S Dm E z S [ sup h,h HS L(h ,z) L(h,z)] (6)

sup S Zm E z S [ sup h,h HS L(h ,z) L(h,z)] (7)

sup S Zm max z S [ sup h,h HS L(h ,z) L(h,z)] max. (8)

CV-stability of hypothesis sets can be bounded in terms of their stability and diameter (see straightforward proof in Appendix C).

1Note that the standard deﬁnition of Rademacher complexity assumes that hypothesis sets are not datadependent, however the deﬁnition remains valid for data-dependent hypothesis sets.

Lemma 1. Suppose a family of data-dependent hypothesis sets H is β-uniformly stable. Then if it has diameter , then it is ( +β)-CV-stable, and if it has average diameter then it is ( +β)-average CV-stable.

3 General learning bound for data-dependent hypothesis sets

In this section, we present general learning bounds for data-dependent hypothesis sets that do not make use of the notion of hypothesis set stability. One straightforward idea to derive such guarantees for data-dependent hypothesis sets is to replace the hypothesis set HS depending on the observed sample S by the union of all such hypothesis sets over all samples of size m, Hm = S Zm HS. However, in general, Hm can be very rich, which can lead to uninformative learning bounds. A somewhat better alternative consists of considering the union of all such hypothesis sets for samples of size m included in some supersample U of size m + n, with n 1, HU,m = S Zm S U HS. We will

derive learning guarantees based on the maximum transductive Rademacher complexity of HU,m. There is a trade-off in the choice of n: smaller values lead to less complex sets HU,m, but they also lead to weaker dependencies on sample sizes. Our bounds are more reﬁned guarantees than the shattering-coefﬁcient bounds originally given for this problem by Gat [2001] in the case n = m, and later by Cannon et al. [2002] for any n 1. They also apply to arbitrary bounded loss functions and not just the binary loss. They are expressed in terms of the following notion of transductive Rademacher complexity for data-dependent hypothesis sets:

R U,m(G) = E σ

m+n i=1 σi L(h,z U i ) ,

where U = (z U 1 ,...,z U m+n) Zm+n and where σ is a vector of (m + n) independent random variables taking value m+n

n with probability n m+n, and m+n

m with probability m m+n. Our notion of transductive Rademacher complexity is simpler than that of El-Yaniv and Pechyony [2007] (in the data-independent case) and leads to simpler proofs and guarantees. A by-product of our analysis is learning guarantees for standard transductive learning in terms of this notion of transductive Rademacher complexity, which can be of independent interest.

Theorem 1. Let H = (HS)S Zm be a family of data-dependent hypothesis sets. Let G be deﬁned as in (3). Then, for any δ > 0, with probability at least 1 δ over the choice of the draw of the sample S Zm, the following inequality holds for all h HS:

R(h) RS(h) + max U Zm+n 2 R U,m(G) + 3

Proof. (Sketch; full proof in Appendix D.) We use the following symmetrization result, which holds for any ϵ > 0 with mϵ2 2 for data-dependent hypothesis sets (Lemma 5, Appendix D):

P S Dm [ sup h HS R(h) RS(h) > ϵ] 2 P S Dm T Dn [ sup h HS RT (h) RS(h) > ϵ

To bound the right-hand side, we use an extension of Mc Diarmid s inequality to sampling without replacement [Cortes et al., 2008] applied to Φ(S) = suph HU,m RT (h) RS(h). Lemma 6 (Appendix D) is then used to bound E[Φ(S)] in terms of our notion of transductive Rademacher complexity.

4 Learning bound for stable data-dependent hypothesis sets

In this section, we present our main generalization bounds for data-dependent hypothesis sets.

Theorem 2. Let H = (HS)S Zm be a β-stable family of data-dependent hypothesis sets, with χ average CV-stability, χ CV-stability and max max-diameter. Let G be deﬁned as in (3). Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S Zm, the following inequality

holds for all h HS:

h HS,R(h) RS(h) + min min{2R m(G), χ} + (1 + 2βm)

1 2m log( 1

m + 2β)log( 6

48(3β + max)log(m)log( 5m3

The proof of the theorem is given in Appendix E. The main idea is to control the sensitivity of the function Ψ(S,S ) deﬁned for any two samples S,S , as follows:

Ψ(S,S ) = sup h HS R(h) RS (h).

To prove bound (9), we apply Mc Diarmid s inequality to Ψ(S,S), using the ( 1

m + 2β)-sensitivity of Ψ(S,S), and then upper bound the expectation ES Dm[Ψ(S,S)] in terms of our notion of Rademacher complexity. The bound (10) is obtained via a differential-privacy-based technique, as in Feldman and Vondrak [2018], and (11) is a direct consequence of the bound of Feldman and Vondrak [2019] using the observation that an algorithm that chooses an arbitrary h HS is O(β + max)-uniformly stable in the classical [Bousquet and Elisseeff, 2002] sense.

Bound (9) admits as a special case the standard Rademacher complexity bound for ﬁxed hypothesis sets [Koltchinskii and Panchenko, 2002, Bartlett and Mendelson, 2002]: in that case, we have HS = H for some H, thus R m(G) coincides with the standard Rademacher complexity Rm(G); furthermore, the family of hypothesis sets is 0-stable, thus the bound holds with β = 0.

Bounds (10) and (11) specialize to the bounds of Feldman and Vondrak [2018] and Feldman and Vondrak [2019] respectively for the special case of standard uniform stability of algorithms, since in that case, HS is reduced to a singleton, HS = {h S}, and so = 0, which implies that χ + β = β.

The Rademacher complexity-based bound (9) typically gives the tightest control on generalization error compared to the bounds (10) and (11), which rely on the cruder diameter notion. However the diameter may be easier to bound for some applications than the Rademacher complexity. To compare the diameter-based bounds, in applications where max = O( ), bound (11) may be tighter than (10). But, in several applications, we have β = O( 1

m), and then bound (10) is tighter.

5 Applications

We now discuss several applications of our learning guarantees, with some others in Appendix G.

5.1 Bagging

Bagging [Breiman, 1996] is a prominent ensemble method used to improve the stability of learning algorithms. It consists of generating k new samples B1,B2,...,Bk, each of size p, by sampling uniformly with replacement from the original sample S of size m. An algorithm A is then trained on each of these samples to generate k predictors A(Bi), i [k]. In regression, the predictors are combined by taking a convex combination k i=1 wi A(Bi). Here, we analyze a common instance of bagging to illustrate the application of our learning guarantees: we will assume a regression setting and a uniform sampling from S without replacement.2 We will also assume that the loss function is µ-Lipschitz in its ﬁrst argument, that the predictions are in the range [0,1], and that all the mixing weights wi are bounded by C

k for some constant C 1, in order to ensure that no subsample Bi is overly inﬂuential in the ﬁnal regressor (in practice, a uniform mixture is typically used in bagging).

To analyze bagging, we cast it in our framework. First, to deal with the randomness in choosing the subsamples, we can equivalently imagine the process as choosing indices in [m] to form the subsamples rather than samples in S, and then once S is drawn, the subsamples are generated by

2Sampling without replacement is only adopted to make the analysis more concise; its extension to sampling with replacement is straightforward.

ﬁlling in the samples at the corresponding indexes. For any index i [m], the chance that it is picked in any subsample is p

m. Thus, by Chernoff s bound, with probability at least 1 δ, no index in

[m] appears in more than t = kp

δ ) m subsamples. In the following, we condition on the random seed of the bagging algorithm so that this is indeed the case, and later use a union bound to control the chance that the chosen random seed does not satisfy this property, as elucidated in section F.2.

Deﬁne the data-dependent family of hypothesis sets H as HS = { k i=1 wi A(Bi) w C/k k }, where

C/k k denotes the simplex of distributions over k items with all weights wi C

k . Next, we give upper bounds on the hypothesis set stability and the Rademacher complexity of H. Assume that algorithm A admits uniform stability βA [Bousquet and Elisseeff, 2002], i.e. for any two samples B and B of size p that differ in exactly one data point and for all x X, we have A(B)(x) A(B )(x) βA. Now, let S and S be two samples of size m differing by one point at the same index, z S and z S . Then, consider the subsets B i of S which are obtained from the Bis by copying over all the elements except z, and replacing all instances of z by z . For any Bi, if z Bi, then A(Bi) = A(B i) and, if z Bi, then A(Bi)(x) A(B i)(x) βA for any x X. We can now bound the hypothesis set uniform stability as follows: since L is µ-Lipschitz in the prediction, for any z Z, and any w C/k k , we have

L( k i=1 wi A(Bi),z ) L( k i=1 wi A(B i),z ) [ p

δ ) km ] CµβA.

It is easy to check the CV-stability and diameter of the hypothesis sets is Ω(1) in the worst case. Thus, the CV-stability-based bound (10) and standard uniform-stability bound (11) are not informative here, and we use the Rademacher complexity based bound (9) instead. Bounding the Rademacher complexity RS(HS,T ) for S,T Zm is non-trivial. Instead, we can derive a reasonable upper bound by analyzing the Rademacher complexity of a larger function class. Speciﬁcally, for any z Z, deﬁne the d = (2m p )-dimensional vector uz = A(B)(z) B S T, B =p. Then, the class of functions is FS,T = {z w uz w Rd, w 1 = 1}. Clearly HS,T FS,T . Since uz 1, a standard Rademacher complexity bound (see Theorem 11.15 in [Mohri et al., 2018]) implies

2 log(2(2m p ))

m . Thus, by Talagrand s inequality, we conclude that

RS(GS,T ) µ

m . In view of that, by Theorem 2, for any δ > 0, with probability at least 1 2δ over the draws of a sample S Dm and the randomness in the bagging algorithm, the following inequality holds for any h HS:

R(h) RS(h) + 2µ

m + 1 + 2 p +

For p = o( m) and k = ω(p), the generalization gap goes to 0 as m , regardless of the stability of A. This gives a new generalization guarantee for bagging, similar (but incomparable) to the one derived by Elisseeff et al. [2005]. One major point of difference is that unlike their bound, our bound allows for non-uniform averaging schemes.

5.2 Stochastic strongly-convex optimization

Here, we consider data-dependent hypothesis sets based on stochastic strongly-convex optimization algorithms. As shown by Shalev-Shwartz et al. [2010], uniform convergence bounds do not hold for the stochastic convex optimization problem in general.

Consider K stochastic strongly-convex optimization algorithms Aj, each returning vector w S j , after receiving sample S Zm, j [K]. As shown by Shalev-Shwartz et al. [2010], such algorithms are β = O( 1

m) sensitive in their output vector, i.e. for all j [K], we have w S j w S j β if S and S differ by one point.

Assume that the loss L(w,z) is µ-Lipschitz with respect to its ﬁrst argument w. Let the datadependent hypothesis set be deﬁned as follows: HS = { K j=1 αj w S j α K B1(α0,r)}, where

α0 is in the simplex of distributions K and B1(α0,r) is the L1 ball of radius r > 0 around α0. We choose r = 1 2µD m. A natural choice for α0 would be the uniform mixture.

Since the loss function is µ-Lipschitz, the family of hypotheses HS is µβ-stable. In this setting, bounding the Rademacher complexity is difﬁcult, so we resort to the diameter based bound (10) instead. Note that for any α,α K B1(α0,r) and any z Z, we have

L( K j=1 αj w S j ,z) L( K j=1 α j w S j ,z) µ K j=1 (αi α j) w S j

2 µ [w S 1 w S K] 1,2 α α 1 2µr D,

where [w S 1 w S K] 1,2 = maxx 0 k j=1 xjw S j 2 x 1 = maxi [K] w S i 2 D. Thus, the average diameter

admits the following upper bound: 2µr D = 1 m. In view of that, by Theorem 2, for any δ > 0, with probability at least 1 δ, the following holds for all α K B1(α0,r):

E z D[L( K j=1 αj w S j ,z)] 1

m i=1 L( K j=1 αi w S j ,z S i ) + e

m + eµβ + 4

m + 2µβ)log (6

The second stage of an algorithm in this context consists of choosing α, potentially using a non-stable algorithm. This application both illustrates the use of our learning bounds using the diameter and its application even in the absence of uniform convergence bounds.

As an aside, we note that the analysis of section 5.1 can be carried over to this setting, by setting A to be a stochastic strongly-convex optimization algorithm which outputs a weight vector ˆw. This yields generalization bounds for aggregating over a larger set of mixing weights, albeit with the restriction that each algorithm uses only a small part of S.

5.3 -sensitive feature mappings

Consider the scenario where the training sample S Zm is used to learn a non-linear feature mapping ΦS X RN that is -sensitive for some = O( 1

m). ΦS may be the feature mapping corresponding to some positive deﬁnite symmetric kernel or a mapping deﬁned by the top layer of an artiﬁcial neural network trained on S, with a stability property.

To deﬁne the second state, let L be a set of γ-Lipschitz functions f RN R. Then we deﬁne HS = {x f(ΦS(x)) f L}. Assume that the loss function ℓis µ-Lipschitz with respect to its ﬁrst argument. Then, for any hypothesis h x f(ΦS(x)) HS and any sample S differing from S by one element, the hypothesis h x f(ΦS (x)) HS admits losses that are β-close to those of h, with β = µγ , since, for all (x,y) X Y, by the Cauchy-Schwarz inequality, the following inequality holds:

ℓ(f(ΦS(x)),y) ℓ(f(ΦS (x)),y) µ f((ΦS(x)) f(ΦS (x)) µγ ΦS(x) ΦS (x) µγ .

Thus, the family of hypothesis set H = (HS)S Zm is uniformly β-stable with β = µγ = O( 1

m). In view of that, by Theorem 2, for any δ > 0, with probability at least 1 δ over the draw of a sample S Dm, the following inequality holds for any h HS:

R(h) RS(h) + 2R m(G) + (1 + 2µγ m)

1 2m log( 1

Notice that this bound applies even when the second stage of an algorithm, which consists of selecting a hypothesis h S in HS, is not stable. A standard uniform stability guarantee cannot be used in that case. The setting described here can be straightforwardly extended to the case of other norms for the deﬁnition of sensitivity and that of the norm used in the deﬁnition of HS.

5.4 Distillation

Here, we consider distillation algorithms which, in the ﬁrst stage, train a very complex model on the labeled sample. Let f S X R denote the resulting predictor for a training sample S of size m. We will assume that the training algorithm is β-sensitive, that is f S f S β = O( 1

m) for S and S differing by one point.

Figure 2: Illustration of the distillation hypothesis sets. Notice that the diameter of a hypothesis set HS may be large here.

In the second stage, the algorithm selects a hypothesis that is γ-close to f S from a less complex family of predictors H. This deﬁnes the following sample-dependent hypothesis set: HS = {h H (h f S) γ}.

Assume that the loss ℓis µ-Lipschitz with respect to its ﬁrst argument and that H is a subset of a vector space. Let S and S be two samples differing by one point. Note, f S may not be in H, but we will assume that f S f S is in H. Let h be in HS, then the hypothesis h = h + f S f S is in HS since h f S = h f S γ. Figure 2 illustrates the hypothesis sets. By the µ-Lipschitzness of the loss, for any z = (x,y) Z, ℓ(h (x),y) ℓ(h(x),y) µ h (x) h(x) = µ f S f S µβ. Thus, the family of hypothesis sets HS is µβ-stable.

In view of that, by Theorem 2, for any δ > 0, with probability at least 1 δ over the draw of a sample S Dm, the following inequality holds for any h HS:

R(h) RS(h) + 2R m(G) + (1 + 2µβm)

1 2m log( 1

Notice that a standard uniform-stability argument would not necessarily apply here since HS could be relatively complex and the second stage not necessarily stable.

6 Conclusion

We presented a broad study of generalization with data-dependent hypothesis sets, including general learning bounds using a notion of transductive Rademacher complexity and, more importantly, learning bounds for stable data-dependent hypothesis sets. We illustrated the applications of these guarantees to the analysis of several problems. Our framework is general and covers learning scenarios commonly arising in applications for which standard generalization bounds are not applicable. Our results can be further augmented and reﬁned to include model selection bounds and local Rademacher complexity bounds for stable data-dependent hypothesis sets (to be presented in a more extended version of this manuscript), and further extensions described in Appendix F. Our analysis can also be extended to the non-i.i.d. setting and other learning scenarios such as that of transduction. Several by-products of our analysis, including our proof techniques, new guarantees for transductive learning, and our PAC-Bayesian bounds for randomized algorithms, both in the sample-independent and sample-dependent cases, can be of independent interest. While we highlighted several applications of our learning bounds, a tighter analysis might be needed to derive guarantees for a wider range of data-dependent hypothesis classes or scenarios.

Acknowledgements. HL is supported by NSF IIS-1755781. The work of SG and MM was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award. KS would like to acknowledge NSF CAREER Award 1750575 and Sloan Research Fellowship.

Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning, 3, 2002.

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized Rademacher complexities. In Proceedings of COLT, volume 2375, pages 79 97. Springer-Verlag, 2002.

Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of STOC, pages 1046 1059, 2016.

Olivier Bousquet and Andr e Elisseeff. Stability and generalization. Journal of Machine Learning, 2: 499 526, 2002.

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. Co RR, abs/1910.07833, 2019. URL http://arxiv.org/abs/1910.07833.

Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996.

Adam Cannon, J. Mark Ettinger, Don R. Hush, and Clint Scovel. Machine learning with data dependent hypothesis classes. Journal of Machine Learning Research, 2:335 358, 2002.

Olivier Catoni. PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learning. Institute of Mathematical Statistics, 2007.

Nicol o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. In Proceedings of NIPS, pages 359 366, 2001.

Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Stability of transductive regression algorithms. In Proceedings of ICML, pages 176 183, 2008.

Luc Devroye and T. J. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979.

Gintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via differential privacy. In Proceedings of Neur IPS, pages 8440 8450, 2018a.

Gintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In Proceedings of ICML, pages 1376 1385, 2018b.

Ran El-Yaniv and Dmitry Pechyony. Transductive Rademacher complexity and its applications. In Proceedings of COLT, pages 157 171, 2007.

Andr e Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6:55 79, 2005.

Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms. In Proceedings of Neur IPS, pages 9770 9780, 2018.

Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Proceedings of COLT, 2019.

Yoram Gat. A learning generalization bound with an application to sparse-representation classiﬁers. Machine Learning, 42(3):233 239, 2001.

Ralf Herbrich and Robert C. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3:175 212, 2002.

Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of NIPS, pages 793 800, 2008.

Satyen Kale, Ravi Kumar, and Sergei Vassilvitskii. Cross-validation and mean-square stability. In Proceedings of Innovations in Computer Science, pages 487 495, 2011.

Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In Proceedings of COLT, pages 152 162. ACM, 1997.

Vladmir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30, 2002.

Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In Proceedings of UAI, pages 275 282, 2002.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1):93 117, 2017.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York, 1991.

Guy Lever, Franc ois Laviolette, and John Shawe-Taylor. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science, 473:4 28, 2013.

Andreas Maurer. A second-order look at stability and generalization. In Proceedings of COLT, pages 1461 1475, 2017.

David A. Mc Allester. Simpliﬁed PAC-Bayesian margin bounds. In Proceedings of COLT, pages 203 215, 2003.

Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of FOCS, pages 94 103, 2007.

Kohei Miyaguchi. PAC-Bayesian transportation bound. Co RR, abs/1905.13435, 2019.

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary phi-mixing and beta-mixing processes. Journal of Machine Learning Research, 11:789 814, 2010.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, second edition, 2018.

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In Proceedings of ICLR, 2018.

Emilio Parrado-Hern andez, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research, 13(Dec):3507 3531, 2012.

Petra Philips. Data-Dependent Analysis of Learning Algorithms. Ph D thesis, Australian National University, 2005.

David Pollard. Convergence of Stochastic Processess. Springer, 1984.

W. H. Rogers and T. J. Wagner. A ﬁnite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, 6(3):506 514, 05 1978.

Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 54(4):21, 2007.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635 2670, 2010.

John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Information Theory, 44(5):1926 1940, 1998.

Nati Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. NIPS, 2010.

Thomas Steinke and Jonathan Ullman. Subgaussian tail bounds via stability arguments. Co RR, abs/1701.03493, 2017. URL http://arxiv.org/abs/1701.03493.

G. W. Stewart. Matrix Algorithms: Volume 1, Basic Decompositions. Society for Industrial and Applied Mathematics, 1998.

Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.