# relative_deviation_margin_bounds__c4d5d76a.pdf Relative Deviation Margin Bounds Corinna Cortes 1 Mehryar Mohri 1 2 Ananda Theertha Suresh 1 We present a series of new and more favorable margin-based learning guarantees that depend on the empirical margin loss of a predictor. We give two types of learning bounds, in terms of either the Rademacher complexity or the empirical ℓ - covering number of the hypothesis set used, both distribution-dependent and valid for general families. Furthermore, using our relative deviation margin bounds, we derive distribution-dependent generalization bounds for unbounded loss functions under the assumption of a finite moment. We also briefly highlight several applications of these bounds and discuss their connection with existing results. 1. Introduction Margin-based learning bounds provide a fundamental tool for the analysis of generalization in classification (Vapnik, 1998; 2006; Schapire et al., 1997; Koltchinskii and Panchenko, 2002; Taskar et al., 2003; Bartlett and Shawe Taylor, 1998; Cortes et al., 2014; Kuznetsov et al., 2014; Cortes et al., 2017). These are guarantees that hold for realvalued functions based on the notion of confidence margin. Unlike worst-case bounds based on standard complexity measures such as the VC-dimension, margin bounds provide optimistic guarantees: a strong guarantee holds for predictors that achieve a relatively small empirical margin loss, for a relatively large value of the confidence margin. More generally, guarantees similar to margin bounds can be derived based on notion of a luckiness (Shawe-Taylor et al., 1998; Koltchinskii and Panchenko, 2002). Notably, margin bounds do not have an explicit dependency on the dimension of the feature space for linear or kernelbased hypotheses. They provide strong guarantees for largemargin maximization algorithms such as Support Vector Machines (SVM) (Cortes and Vapnik, 1995), including when 1Google Research, New York, NY; 2Courant Institute of Mathematical Sciences, New York, NY;. Correspondence to: Ananda Theertha Suresh . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). they are used with positive definite kernels such as Gaussian kernels, for which the dimension of the feature space is infinite. Similarly, margin-based learning bounds have helped derive significant guarantees for Ada Boost (Freund and Schapire, 1997; Schapire et al., 1997). More recently, margin-based learning bounds have been derived for feedforward artificial neural networks (NNs) (Neyshabur et al., 2015; Bartlett et al., 2017) and convolutional neural networks (CNNs) (Long and Sedghi, 2020). An alternative family of tighter learning guarantees is that of relative deviation bounds (Vapnik, 1998; 2006; Anthony and Shawe-Taylor, 1993; Cortes et al., 2019). These are bounds on the difference of the generalization and the empirical error scaled by the square-root of the generalization error or empirical error, or some other power of the error. The scaling is similar to dividing by the standard deviation since, for smaller values of the error, the variance of the error of a predictor roughly coincides with its error. These guarantees translate into very useful bounds on the difference of the generalization error and empirical error whose complexity terms admit the empirical error as a factor. This paper presents relative deviation margin bounds. These are new learning bounds that combine the benefit of standard margin bounds and that of standard relative deviation bounds, thereby resulting in tighter margin-based guarantees (Section 5.2). Our bounds are distribution-dependent and valid for general hypothesis sets. They can be viewed as second-order margin-based guarantees. For a sample size m, they are based on an interpolation between a 1 m-term that includes the square-root of the empirical margin loss as a factor and another term in 1 m. In particular, when the empirical margin loss is zero, the bound only admits the 1 m fast rate term. As an example, our learning bounds provide tighter guarantees for margin-based algorithms such as SVM and boosting than existing ones. We give two new families of relative deviation bounds, both distribution-dependent and valid for general hypothesis sets. Additionally, both families of guarantees hold for an arbitrary α-moment, with α (1,2]. The guarantees for general α-moments admit interesting applications in other areas. We describe one such application to deriving generalization guarantees for unbounded loss functions in Section 5.1. Relative Deviation Margin Bounds Our first family of margin bounds are expressed in terms of the empirical ℓ -covering number of the hypothesis set (Section 3). We show how these empirical covering numbers can be upper-bounded to derive empirical fat-shattering guarantees. One benefit of these resulting guarantees is that there are known upper bounds on the covering numbers for various standard hypothesis sets, which can be leveraged to derive explicit bounds. Our second family of margin bounds are expressed in terms of the Rademacher complexity of the hypothesis set used (Section 4). Here, our learning bounds are first expressed in terms of a peeling-based Rademacher complexity term we introduce. Next, we give a series of upper bounds on this complexity measure, first simpler ones in terms of Rademacher complexity, next in terms of empirical ℓ2-covering numbers, and finally in terms of the so-called maximum Rademacher complexity. In particular, we show that a simplified version of our bounds yields a guarantee similar to the maximum Rademacher margin bound of Srebro et al. (2010), but for a general α-moment. We then use our families of margin bounds for α-moments to provide generalization guarantees for unbounded loss functions (Section 5.1). We also illustrate these results by deriving explicit bounds for various standard hypothesis sets in Section 5.2. 1.1. Contributions and Previous Work We now further highlight our contributions and compare them to related previous work. ℓ -covering based bounds: A version of our main result for empirical ℓ -covering number bounds in the special case α=2 was postulated by Bartlett (1998) without a proof. The author suggested that the proof could be given by combining various techniques with the results of Anthony and Shawe-Taylor (1993) and Vapnik (1998; 2006). However, as pointed out by Cortes et al. (2019), the proofs given by Anthony and Shawe-Taylor (1993) and Vapnik (1998; 2006) are incomplete and rely on a key lemma that is not proven by these authors. In a distinct line of research, Zhang (2002) presented finer covering number-based bounds for linear classifiers. These are not relative deviation bounds but the author postulated that his techniques could be modified, using Bernstein-type concentration bounds, to obtain relative deviation ℓ -covering number bounds for linear classifiers. However, a careful inspection suggests that this is not a straightforward exercise and obtaining such bounds in fact requires techniques such as those we develop in this paper, or, perhaps, somewhat similar ones. Our contribution: We provide a self-contained proof based on a margin-based symmetrization argument. Our proof technique uses a new symmetrization argument that is different from those of Bartlett (1998) and Zhang (2002). Rademacher complexity bounds: Using ideas from local Rademacher complexity (Bartlett et al., 2005), Rademacher complexity bounds were given by Srebro et al. (2010). However their bounds are based on the so-called maximum Rademacher complexity, which depends on the worst possible sample and is therefore independent of the underlying distribution. Our contribution: We provide the first distribution-dependent relative deviation margin bounds, in terms of a peeling-based Rademacher complexity. The proof is based on several new ingredients, including a new symmetrization result, an upper bound in terms of a normalized Rademacher process, and a peeling-based argument. We also show that, as a by-product of our guarantees, the distribution-independent bounds of Srebro et al. (2010) can be recovered, albeit with a more general α (1,2]. Generalization bounds for unbounded loss functions: Standard relative deviation bounds do not hold for commonly used loss functions that are unbounded, such as cross-entropy. Cortes et al. (2019) provided zero-one relative deviation bounds which they used to derive guarantees for unbounded losses, in terms of the discrete dichotomies generated by the hypothesis class, under the assumption of a finite moment of the loss. Our contribution: We present the first generalization bounds for unbounded loss functions in terms of covering numbers and Rademacher complexity, which are optimistic bounds that, in general, are more favorable than the previous known bounds of Cortes et al. (2019), under the same finite moment assumption. Doing so further required us to derive relative deviation margin bounds for a general α-moment (α (1,2]), in contrast with previous work, which only focused on the special case α = 2. The need for guarantees for unbounded loss functions with bounded α-moments with α < 2 comes up in several scenarios, for example in the context of importance-weighting (Cortes, Mansour, and Mohri, 2010). Recently, relative deviation margin bounds for the special case of linear classifiers were studied by Grønlund et al. (2020). Both the results and the proof techniques in that work are specific to the case of linear hypotheses. In contrast, our bounds hold for any general hypothesis set and recover the bounds of Grønlund et al. (2020) for the special case of linear classifiers, up to logarithmic factors. Furthermore, our proofs, while more general, are also simpler. Moreover, in contrast with these bounds, our guarantees are expressed in terms of Rademacher complexity and are therefore distribution-dependent. Relative deviation PAC-Bayesian bounds were also derived by Mc Allester (2003) for linear hypothesis sets. It is known, however, that Rademacher complexity learning bounds are finer guarantees: as shown by Kakade et al. (2008) and Foster et al. (2019)[Appendix H], they can be used to derive more favorable PAC-Bayesian guarantees than previously known ones (Mc Allester, 2003). Relative Deviation Margin Bounds 2. Preliminaries In this section, we introduce the main definitions and notation used in our analysis and prove two symmetrization-type lemmas for a relative deviation between the expected binary loss and empirical margin loss. We consider an input space X and a binary output space Y = { 1,+1} and a hypothesis set H of functions mapping from X to R. We denote by D a distribution over Z = X Y and denote by R(h) the generalization error and by RS(h) the empirical error of a hypothesis h H: R(h) = E z=(x,y) D[1yh(x) 0], RS(h) = E z=(x,y) S[1yh(x) 0], where we write z S to indicate that z is randomly drawn from the empirical distribution defined by S. Given ρ 0, we similarly defined the ρ-margin loss and empirical ρmargin loss of h H: Rρ(h) = E z=(x,y) D[1yh(x)<ρ], Rρ S(h) = E z=(x,y) S[1yh(x)<ρ]. We will sometimes use the shorthand xm 1 to denote a sample of m points (x1,...,xm) Xm. The relative margin deviation for a hypothesis h H is the ratio of the difference between the generalization error of h and its empirical margin loss, and the α-moment of the generalization error, 1 < α 2: R(h) Rρ S(h) modulo a constant term τ > 0 used to guarantee the positivity of denominator, which can be chosen to be arbitrarily small. For R(h) small, the variance R(h)(1 R(h)) is close to R(h). Thus, for α = 2, the ratio can be viewed as a normalization of the difference between the generalization error of h and its empirical margin loss obtained by dividing (approximately) by the standard deviation. The problem we consider is to derive high-probability upper bounds for the supremum over h H of the relative margin deviation of h. This will result in our relative deviation margin bounds. We will be mainly interested in the case α = 2. But, as we shall see in Section 5.1, the case α (1,2) is crucial since it allows us to derive new covering numberbased learning guarantees for unbounded loss functions when the α-moment of the loss is bounded only for some value α (1,2). The following is our first symmetrization lemma in terms of empirical margin losses. As already mentioned, the parameter τ > 0 is used to ensure a positive denominator so that the relative deviations are mathematically well defined. Figure 1. Illustration of different choices of function φ for ρ = 0.25. Lemma 1. Fix ρ 0 and 1 < α 2 and assume that mϵ α α 1 > 1. Then, for any ϵ,τ > 0, the following inequality holds: R(h) Rρ S(h) R(h) + τ > ϵ RS (h) Rρ S(h) 1 2[ RS (h) + Rρ S(h) + 1 The proof is presented in Appendix A. It consists of extending the proof technique of Cortes et al. (2019) for standard empirical error to the empirical margin case and of using the binomial inequality (Greenberg and Mohri, 2013, Lemma 9). The lemma helps us bound the relative deviation in terms of the empirical margin loss on a sample S and the empirical error on an independent sample S , both of size m. We now introduce some notation needed for the presentation and discussion of our relative deviation margin bound. Let φ R R+ be a function such that the following inequality holds for all x R: 1x<0 φ(x) 1x<ρ. As an example, we can choose φ(x) = 1x<ρ/2 as in the previous sections. For a sample z = (x,y), let g(z) = φ(yh(x)). Then, 1yh(x)<0 g(z) 1yh(x)<ρ. (1) Let the family G be defined as follows: G = {z = (x,y) φ(yh(x)) h H} and let R(g) = Ez D[g(z)] denote the expectation of g and RS(g) = Ez S[g(z)] its empirical expectation for a sample S. There are several choices for function φ, as illustrated by Figure 1. For example, φ(x) can be chosen to be 1x<ρ or 1x<ρ/2 (Bartlett, 1998). φ can also be chosen to be the so-called ramp loss: 1 if x < 0 1 x ρ if x [0,ρ] 0 if x > ρ, Relative Deviation Margin Bounds or the smoothed margin loss chosen by (Srebro et al., 2010): 1 if x < 0 1+cos(πx/ρ) 2 if x [0,ρ] 0 if x > ρ. Fix ρ > 0. Define the ρ-truncation function βρ R [ ρ,+ρ] by βρ(u) = max{u, ρ}1u 0 + min{u,+ρ}1u 0, for all u R. For any h H, we denote by hρ the ρtruncation of h, hρ = βρ(h), and define Hρ = {hρ h H}. For any family of functions F, we also denote by N (F,ϵ,xm 1 ) the empirical covering number of F over the sample (x1,...,xm) and by C(F,ϵ,xm 1 ) a minimum empirical cover. Then, the following symmetrization lemma holds. Lemma 2. Fix ρ 0 and 1 < α 2. Then, the following inequality holds: RS (h) Rρ S(h) 1 2[ RS (h) + Rρ S(h) + 1 RS (g) RS(g) 1 2[ RS (g) + RS(g) + 1 Further for g(z) = 1yh(x)<ρ/2, using the shorthand K = C(Hρ, ρ 2,S S ), the following holds: RS (h) Rρ S(h) 1 2[ RS (h) + Rρ S(h) + 1 ρ 2 S (h) R ρ 2 S (h) + R ρ 2 S (h) + 1 The proof consists of using Inequality 1, it is given in Appendix A. The first result of the lemma gives an upper bound for a general choice of functions g, that is for an arbitrary choices of the Φ loss function. This inequality will be used in Section 4 to derive our Rademacher complexity bounds. The second inequality is for the specific choice of Φ that corresponds to ρ/2-step function. We will use this inequality in the next section to derive ℓ -covering number bounds. 3. Relative Deviation Margin Bounds Covering Numbers In this section, we present a general relative deviation margin-based learning bound, expressed in terms of the expected empirical covering number of Hρ. The learning guarantee is thus distribution-dependent. It is also very general since it is given for any 1 < α 2 and an arbitrary hypothesis set. Theorem 1 (General relative deviation margin bound). Fix ρ 0 and 1 < α 2. Then, for any hypothesis set H of functions mapping from X to R and any τ > 0, the following inequality holds: R(h) Rρ S(h) R(h) + τ > ϵ 4 E x2m 1 D2m[N (Hρ, ρ 2,x2m 1 )] exp The proof is given in Appendix B. As mentioned earlier, a version of this result for α = 2 was postulated by Bartlett (1998). The result can be alternatively expressed as follows, taking the limit τ 0. Corollary 1. Fix ρ 0 and 1 < α 2. Then, for any hypothesis set H of functions mapping from X to R, with probability at least 1 δ, the following inequality holds for all h H: R(h) Rρ S(h) Á Á Àlog E[N (Hρ, ρ 2,x2m 1 )] + log 1 Note that a smaller value of α (α closer to 1) might be advantageous for some values of R(h), at the price of a worse complexity in terms of the sample size. For α = 2, the result can be rewritten as follows. In the following, we use N as a shorthand for E[N (Hρ, ρ 4,x2m 1 )]. Corollary 2. Fix ρ 0. Then, for any hypothesis set H of functions mapping from X to R, with probability at least 1 δ, the following inequality holds for all h H: R(h) Rρ S(h) 2 Á Á À Rρ S(h)log N Proof. Let a, b, and c be defined as follows: a = R(h), b = Rρ S(h), and c = log E[N (Hρ, ρ 2 ,x2m 1 )), ρ δ m . Then, for α = 2, the inequality of Corollary 1 can be rewritten as a b + 2 ca. This implies that ( a c)2 b + c and hence a b + c + c. Therefore, a b + 2c + 2 (b + c)c b + 4c + 2 cb. Substituting the values of a,b, and c yields the bound. The guarantee just presented provides a tighter margin-based learning bound than standard margin bounds since the dominating term admits the empirical margin loss as a factor. Standard margin bounds are subject to a trade-off: a large Relative Deviation Margin Bounds value of ρ reduces the complexity term while leading to a larger empirical margin loss term. Here, the presence of the empirical loss factor favors this trade-off by allowing a smaller choice of ρ. The bound is distribution-dependent since it is expressed in terms of the expected covering number and it holds for an arbitrary hypothesis set H. The learning bounds just presented hold for a fixed value of ρ. They can be extended to hold uniformly for all values of ρ [0,1], at the price of an additional log log-term. We illustrate that extension for Corollary 1. Corollary 3. Fix 1 < α 2. Then, for any hypothesis set H of functions mapping from X to R and any ρ (0,r], with probability 1 δ, the following inequality holds for all h H: R(h) Rρ S(h)+2 α+2 Á Á Á Àlog N + log ( log2(2r/ρ) Proof. For k 1, let ρk = r/2k and δk = δ/k2. For all such ρk, by Corollary 1 and the union bound, R(h) Rρk S (h)+2 α+2 Á Á Àlog N + log 1 By the union bound, the error probability is most k δk = δ k(1/k2) δ. For any ρ (0,r], there exists a k such that ρ (ρk,ρk 1]. For this k, ρ ρk 1 = r/2k 1. Hence, k log2(2r/ρ). By the definition of margin, for all h H, Rρk S (h) Rρ S(h). Furthermore, as ρk = ρk 1/2 ρ/2, N (Hρ, ρk 2 ,x2m 1 ) N (Hρ, ρ 4,x2m 1 ). Hence, for all ρ (0,r], R(h) Rρ S(h)+2 α+2 Á Á Á Àlog N + log ( log2(2r/ρ) This concludes the proof. Our previous bounds can be expressed in terms of the fatshattering dimension, as illustrated below. Recall that, given γ > 0, a set of points U = {u1,...,um} is said to be γshattered by a family of real-valued functions H if there exist real numbers (r1,...,rm) (witnesses) such that for all binary vectors (b1,...,bm) {0,1}m, there exists h H such that: ri + γ if bi = 1; ri γ otherwise. The fat-shattering dimension fatγ(H) of the family H is the cardinality of the largest set γ-shattered set by H (Anthony and Bartlett, 1999). Corollary 4. Fix ρ 0. Then, for any hypothesis set H of functions mapping from X to R with d = fat ρ 16 (H), with probability at least 1 δ, the following holds for all h H: R(h) Rρ S(h) + 2 where m = 1+dlog2(2c2m)log2 2cem δ and c = 17. Proof. By (Bartlett, 1998, Proof of theorem 2), we have log max x2m 1 [N (Hρ, ρ 2,x2m 1 )] 1+d log2(2c2m)log2 2cem where d = fat ρ 16 (Hρ) fat ρ 16 (H) = d. Upper bounding the expectation by the maximum completes the proof. We will use this bound in Section 5.2 to derive explicit guarantees for several standard hypothesis sets. 4. Relative Deviation Margin Bounds Rademacher Complexity In this section, we present relative deviation margin bounds expressed in terms of the Rademacher complexity of the hypothesis sets. As with the previous section, these bounds are general: they hold for any 1 < α 2 and arbitrary hypothesis sets. As in the previous section, we will define the family G by G = {φ(yh(x)) h H}, where φ is a function such that 1x<0 φ(x) 1x<ρ. For a set G and a set of samples zm 1 , the empirical Rademacher complexity is defined as Rm(G) = E σ [sup g G 1 m i σig(zi)]. We further allow G to be dependent on the samples. The proof of our main result in this section admits the following three main ingredients: (1) a symmetrization lemma to relate the relative margin deviation term to a symmetrized quantity with empirical terms only (Lemmas 1 and 2); (2) relating the problem of bounding that symmetrized quantity to that of bounding a normalized Rademacher process (Lemma 3); (3) bounding that normalized Rademacher process in terms of Rademacher complexity using an adapted peeling technique. 4.1. Rademacher Complexity-Based Margin Bounds We first relate bounding the symmetrized relative deviations to bounding the normalized Rademacher process supg G 1 m m i=1 σig(zi) 1 m [ m i=1 g(zi)+1]. Relative Deviation Margin Bounds Lemma 3. Fix 1 < α 2. Then, the following inequality holds: RS (g) RS(g) 1 2[ RS (g) + RS(g) + 1 2 P zm 1 Dm,σ 1 m m i=1 σig(zi) 1 m[ m i=1 g(zi) + 1] > ϵ 2 The proof is given in Appendix C. It consists of introducing Rademacher variables and deriving an upper bound in terms of the first m points only. Now, to bound the normalized Rademacher process term, the technique adopted in previous work has consisted of fixing zm 1 and applying Hoeffding s bound to the ratio 1 m m i=1 σig(zi) 1 m [ m i=1(g(zi))+1] for a fixed g G (Anthony and Shawe- Taylor, 1993; Cortes et al., 2019). This is then followed by a union bound, which results in shattering coefficients or covering numbers, and an expectation over zm 1 . Instead, for a fixed zm 1 , we will seek to directly bound the normalized Rademacher process term via a uniform convergence bound. Doing so is not straightforward due to the complex denominator. Thus, we first peel G according to the values of the main term in the denominator 1 m m i=1 g(zi) + 1 m: we partition G into sets Gk(zm 1 ) for which this average value is in [ 2k m ]. This reduces bounding the normalized Rademacher process term to that of bounding the Rademacher process terms supg Gk(zm 1 ) 1 m m i=1 σig(zi). Now, to bound these terms, using Mc Diarmid s inequality would result in too loose terms. This is essentially because the proxy term for the variance in Mc Diarmid s inequality is a quantity of the form m i=1 ig 2 . We use an alternative bounded difference inequality (van Handel, 2016, Theorem 3.18) with a proxy term of the form m i=1 ig 2 instead, which helps us leverage the property of Gk(zm 1 ) and also provide a finer one-sided inequality. This results, for each Rademacher process term supg Gk(zm 1 ) 1 m m i=1 σig(zi), in a bound expressed in terms of the Rademacher complexity of Gk(zm 1 ). A union bound over the sets Gk(zm 1 ) and an expectation over zm 1 conclude the proof. With this background, we now detail the peeling argument, that is we partition G into subsets Gk, give a learning bound for each Gk, and then take a weighted union bound. For any non-negative integer k with 0 k log2 m, let Gk(zm 1 ) denote the family of hypotheses defined by Gk(zm 1 ) = {g G 2k ( m i=1 g(zi)) + 1 < 2k+1}. Using the above inequality and a peeling argument, we show the following upper bound expressed in terms of Rademacher complexities. Lemma 4. Fix 1 < α 2 and zm 1 Zm. Then, the following inequality holds: 1 m m i=1 σig(zi) 1 m[ m i=1(g(zi)) + 1] > ϵ RRRRRRRRRRR zm log2 m k=0 exp m2 R2 m(Gk(zm 1 )) 2k+5 ϵ2 64 2k(1 2/α) The proof is given in Appendix C. Instead of applying Hoeffding s bound to each term of the left-hand side for a fixed g and then using covering and the union bound to bound the supremum, here, we seek to bound the supremum over G directly. To do so, we use a bounded difference inequality that leads to a finer result than Mc Diarmid s inequality. Let rm(G) be defined as the following peeling-based Rademacher complexity of G: sup 0 k log2(m) log [ E zm 1 Dm [exp(m2 R2 m(Gk(zm 1 )) 2k+5 )]]. Then, the following is a margin-based relative deviation bound expressed in terms of rm(G), that is in terms of Rademacher complexities. Theorem 2. Fix 1 < α 2. Then, with probability at least 1 δ, for all hypothesis h H, the following inequality holds: R(h) Rρ S(h) R(h)[rm(G) + log log m + log 16 Combining the above lemma with Theorem 2 yields the following. Corollary 5. Fix 1 < α 2 and let G be defined as above. Then, with probability at least 1 δ, for all hypothesis h H, R(h) Rρ S(h) 32 α α + 2(32) α α 1 ( m where m = rm(G) + log log m + log 16 The above result can be extended to hold for all α simultaneously. Corollary 6. Let G be defined as above. Then, with probability at least 1 δ, for all hypothesis h H and α (1,2], R(h) Rρ S(h) 32 rm(G) + log 16 log m Relative Deviation Margin Bounds 4.2. Upper Bounds on Peeling-Based Rademacher Complexity We now present several upper bounds on rm(G) and show how this can help recover previously known quantities. We provide proofs for all the results in Appendix D. For any hypothesis set G, we denote by SG(xm 1 ) the number of distinct dichotomies generated by G over that sample: SG(zm 1 ) = Card({(g(z1),...,g(zm)) g G}). We note that we do not make any assumptions over the range of G. Lemma 5. If the functions in G take values in {0,1}, then the following upper bounds hold for the peeling-based Rademacher complexity of G: 8 log E zm 1 [SG(zm 1 )]. Combining the above result with Corollary 5, improves the relative deviation bounds of (Cortes et al., 2019, Corollary 2) for α < 2. In particular, we improve the Ezm 1 [SG(zm 1 )] term in their bounds to (Ezm 1 [SG(zm 1 )]) 1 1/α, which is significant for α < 2. We next upper bound the peeling-based Rademacher complexity in terms of covering numbers. Lemma 6. For a set of hypotheses G, rm(G) sup 0 k log2(m) log[ E zm 1 Dm[exp{fk(zm 1 ,G)}]]. fk(zm 1 ,G)= 1 1 m log N2(Gk(zm 1 ), m ϵ,zm 1 )dϵ]. One can further simplify the above bound using the smoothed margin loss from (Srebro et al., 2010). Let the worst case Rademacher complexity be defined as follows. Rmax m (H) = sup zm 1 Rm(H). Lemma 7. Let g be the smoothed margin loss from (Srebro et al., 2010, Section 5.1), with its second moment bounded by (π2/4ρ2). Then, rm(G) is upper bounded by [4π Rmax m (H)] 2 (ρ2/m) [2log 3 2 [ m Rmax m (H) ] log 3 2 [ 2πm ρ Rmax m (H) ]] Proof. Recall that the smoothed margin loss of Srebro et al. (2010) is given by 1 if yh(x) < 0 1+cos(πyh(x)/ρ) 2 if yh(x) [0,ρ] 0 if yh(x) > ρ. Upper bounding the expectation by the maximum gives: rm(G) sup k log sup zm 1 [exp(m2 R2 m(Gk(zm 1 )) 2k+5 )] sup k sup zm 1 m2 R2 m(Gk(zm 1 )) 2k+5 . Let G k(zm 1 ) = {g G m i=1 g(zi) + 1 2k+1}. Since Gk(zm 1 ) G k(zm), rm(G) sup k sup zm 1 m2 R2 m(G k(zm)) 2k+5 . Now, Rm(G k(zm)) coincides with the local Rademacher complexity term defined in (Srebro et al., 2010, Section 2). Thus, by (Srebro et al., 2010, Lemma 2.2), Rm(G k(zm)) Rmax m (H) is upper bounded by m [2log 3 2 [ m Rmax m (H) ] log 3 2 [ 2πm ρ Rmax m (H) ]], which concludes the proof. Combining Lemma 7 with Corollary 5 yields the following bound, which is a generalization of (Srebro et al., 2010, Theorem 5) holding for all α (1,2]. Corollary 7. For any δ > 0, with probability at least 1 δ, the following inequality holds for all α (0,1] and all h H: R(h) Rρ S(h) 32 Rρ S(h)β 1 1 α m + 2(32) α α 1 βm, where βm is the upper bound on rm(G) in Lemma 7. 5. Applications In this section, we discuss two applications of our relative deviation margin bounds. We first show how they can be used to obtain generalization guarantees for unbounded loss functions. Next, we describe the application of our bounds to several specific hypothesis sets and show they can recover some recent results. In Appendix F, we further discuss other potential applications of our learning guarantees. 5.1. Generalization Bounds for Unbounded Loss Functions Standard generalization bounds hold for bounded loss functions. Many loss functions frequently used in applications, such as the cross-entropy loss, are unbounded, when used with standard hypothesis sets. For the more general and more realistic case of unbounded loss functions, a number of different results have been presented in the past, under different assumption on the family of functions. This includes Relative Deviation Margin Bounds learning bounds assuming the existence of an envelope, that is a single non-negative function with a finite expectation lying above the absolute value of the loss of every function in the hypothesis set (Dudley, 1984; Pollard, 1984; Dudley, 1987; Pollard, 1989; Haussler, 1992), or an assumption similar to Hoeffding s inequality based on the expectation of a hyperbolic function, a quantity similar to the momentgenerating function (Meir and Zhang, 2003), or the weaker assumption that the αth-moment of the loss is bounded for some value of α > 1 (Vapnik, 1998; 2006; Cortes et al., 2019). The need for guarantees for unbounded loss functions with bounded alpha-moments with α < 2 come up in several scenarios, for example in the context of importanceweighting (Cortes, Mansour, and Mohri, 2010). Here, we will also adopt this assumption and present distributiondependent learning bounds for unbounded losses that improve upon the previous bounds of Cortes et al. (2019). To do so, we will leverage the relative deviation margin bounds given in the previous sections, which hold for any α 2. Let L be an unbounded loss function and L(h,z) denote the loss of hypothesis h for sample z. Let Lα(h) = Ez D[L(h,z)α] be the αth-moment of the loss function L, which is assumed finite for all h H. In what follows, we will use the shorthand P[L(h,z) > t] instead of Pz D[L(h,z) > t], and similarly P[L(h,z) > t] instead of Pz D[L(h,z) > t]. Theorem 3. Fix ρ 0. Let 1 < α 2, 0 < ϵ 1, and 0 < τ α 1 α < ϵ α α 1 . For any loss function L (not necessarily bounded) and hypothesis set H such that Lα(h) < + for all h H, P[sup h H L(h) LS(h) > Γτ(α,ϵ)ϵ α Lα(h) + τ + ρ] P sup h H,t R P[L(h,z) > t] P[L(h,z) > t ρ] P[L(h,z) > t] + τ > ϵ , where Γτ(α,ϵ) = α 1 α (1 + τ) 1 α + 1 α ( α α 1) α 1 (1 + α ) α τ 1 α ) 1 α [1 + log(1/ϵ) ( α α 1 )α 1 ] The proof is provided in Appendix E. The above theorem can be used in conjunction with our relative deviation margin bounds to obtain strong guarantees for unbounded loss functions and we illustrate it with our ℓ -based bounds. Similar techniques can be used to obtain peeling-based Rademacher complexity bounds. Combining Theorems 3 and (1) yields the following corollary. Corollary 8. Fix ρ 0. Let ϵ < 1, 1 < α 2. and hypothesis set H such that Lα(h) < + for all h H, L(h) LS(h) γ α where m = log E[N (L(H), ρ 2,x2m 1 )] + log 1 α ) = O(log m). The upper bound in the above corollary has two terms. The first term is based on the covering number and decreases with ρ while the second term increases with ρ. A natural choice for ρ is 1/ m, however one can choose a suitable value of ρ that minimizes the sum to obtain favorable bounds.1 Furthermore, the above bound depends on the covering number as opposed to the result of Cortes et al. (2019), which depends on the number of dichotomies generated by the hypothesis set. Hence, the above bound is optimistic and in general is more favorable than the previous known bounds of Cortes et al. (2019). We note that instead of using our ℓ -based bounds, one can use our Rademacher complexity bounds to derive finer results. 5.2. Relative Margin Bounds for Common Hypothesis Sets In this section, we briefly highlight some applications of our learning bounds: both our covering number and Rademacher complexity margin bounds can be used to derive finer margin-based guarantees for several commonly used hypothesis sets. Below we briefly illustrate these applications. Linear hypothesis sets: let H be the family of liner hypotheses defined by H = {x w x w 2 1,x Rn, x 2 R}. The margin bound for SVM by Bartlett and Shawe-Taylor (1998, Theorem 1.7) is R(h) Rρ S(h) + c where c is some universal constant and where β m = O ( (R/ρ)2 m ). Recently, Grønlund et al. (2020) derived the following more favorable relative deviation margin bounds for linear hypothesis sets: R(h) Rρ S(h) + 2 Rρ S(h)β m + β m, (3) where β m = O ( (R/ρ)2 m ). We can directly apply our relative deviation margin bounds to recover this result up to logarithmic factors. However, our guarantees have the additional benefit of being expressed in terms of Rademacher complexity and thus of being distribution-dependent, unlike the bound of Grønlund et al. (2020). Furthermore, while their proof technique crucially depends on the fact that the underlying hypothesis set is linear, ours is comparatively 1This requires that the bound holds uniformly for all ρ, which can be shown with an additional log log 1 ρ term (See Corollary 9). Relative Deviation Margin Bounds simpler and very general, it applies to arbitrary hypothesis sets. Feed-forward neural networks of depth d: For a matrix W, let W p,q denote the matrix p,q norm and W 2 denote the spectral norm. Let H0 = {x x 2 1,x Rn} and Hi = {σ(W h) h Hi 1, W 2 B, WT 2,1 B2,1 W 2)}. Let σ be L-Lipschitz. The Rademacher complexity bounds of Corollary 7 can be used to provide generalization bounds for neural networks. By Bartlett et al. (2017), the following upper bound holds for Hd: Rmax m (H) = O (d3/2BB2,1 ρ m (BL)d). Plugging in this upper bound in the bound of Corollary 7 leads to the following: R(h) Rρ S(h) + 2 Rρ S(h)βm + βm, (4) where βm = O ( d3B2B2 2,1 ρ2m (BL)2d). In comparison, the best existing neural network bounds by Bartlett et al. (2017, Theorem 1.1) is R(h) Rρ S(h) + c where c is a universal constant and β m is the empirical Rademacher complexity. The margin bound (4) has the benefit of a more favorable dependency on the empirical margin loss than (5), which can be significant when that empirical term is small. On other hand, the empirical Rademacher complexity of (5) is more favorable than its counterpart in (4). A similar analysis can be used to derive relative margin bounds for ensembles of predictors or neural networks families (see Appendix F.2) as well as many other function classes. 6. Conclusion Margin bounds are the most appropriate tools for the analysis of generalization in classification problems since they are more optimistic and typically not dimension-dependent. They have been used successfully to analyze the generalization properties of linear classifiers with Gaussian kernels, that of Ada Boost, and more recently that of neural networks. The finer margin guarantees we presented provide a more powerful tool for such analyses. Our relative margin bounds can further be used to derive guarantees for a variety of hypothesis sets and in a variety of applications. In particular, as illustrated in Appendix F.2, these bounds can help derive more favorable margin-based learning bounds for different families of neural networks, which has been the topic of several recent research publications. They may also serve as a useful tool in the analysis of scenarios such as active learning and the design of new algorithms. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied Mathematics, 47:207 217, 1993. H. Bao, C. Scott, and M. Sugiyama. Calibrated surrogate losses for adversarially robust classification. In Conference on Learning Theory, pages 408 451. PMLR, 2020. P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525 536, 1998. P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in Kernel Methods: Support Vector Learning. MIT Press, 1998. P. L. Bartlett, O. Bousquet, S. Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 1537, 2005. P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrallynormalized margin bounds for neural networks. In Proceedings of NIPS, pages 6240 6249, 2017. C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20(3), 1995. C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442 450, 2010. C. Cortes, M. Mohri, and U. Syed. Deep boosting. In Proceedings of ICML, pages 1179 1187, 2014. C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. Adanet: Adaptive structural learning of artificial neural networks. In Proceedings of ICML, pages 874 883, 2017. C. Cortes, S. Greenberg, and M. Mohri. Relative deviation learning bounds and generalization with unbounded loss functions. Ann. Math. Artif. Intell., 85(1):45 70, 2019. R. M. Dudley. A course on empirical processes. Lecture Notes in Mathematics, 1097:2 142, 1984. R. M. Dudley. Universal Donsker classes and metric entropy. Annals of Probability, 14(4):1306 1326, 1987. D. J. Foster, S. Greenberg, S. Kale, H. Luo, M. Mohri, and K. Sridharan. Hypothesis set stability and generalization. In Proceedings of Neur IPS, pages 6729 6739, 2019. Relative Deviation Margin Bounds Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1):119 139, 1997. W. Gao and Z.-H. Zhou. On the doubt about margin explanation of boosting. Artificial Intelligence, 203:1 18, 2013. S. Greenberg and M. Mohri. Tight lower bound on the probability of a binomial exceeding its expectation. Statistics and Probability Letters, 86:91 98, 2013. A. Grønlund, L. Kamma, and K. G. Larsen. Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning, pages 3779 3788. PMLR, 2020. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput., 100(1):78 150, 1992. S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of NIPS, pages 793 800, 2008. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Proceedings of NIPS, pages 2501 2509, 2014. P. M. Long and H. Sedghi. Generalization bounds for deep convolutional neural networks. In Proceedings of ICLR, 2020. D. Mc Allester. Simplified PAC-bayesian margin bounds. In Learning theory and Kernel machines, pages 203 215. Springer, 2003. R. Meir and T. Zhang. Generalization Error Bounds for Bayesian Mixture Algorithms. Journal of Machine Learning Research, 4:839 860, 2003. B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In Proceedings of COLT, pages 1376 1401, 2015. D. Pollard. Convergence of Stochastic Processess. Springer, New York, 1984. D. Pollard. Asymptotics via empirical processes. Statistical Science, 4(4):341 366, 1989. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proceedings of ICML, pages 322 330, 1997. J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Information Theory, 44(5):1926 1940, 1998. N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Proceedings of NIPS, pages 2199 2207, 2010. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Proceedings of NIPS, 2003. R. van Handel. Probability in High Dimension, APC 550 Lecture Notes. Princeton University, 2016. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. V. N. Vapnik. Estimation of Dependences Based on Empirical Data, second edition. Springer, Berlin, 2006. T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002.