# hconsistency_bounds_for_surrogate_loss_minimizers__783f1901.pdf H-Consistency Bounds for Surrogate Loss Minimizers Pranjal Awasthi 1 Anqi Mao 2 Mehryar Mohri 1 2 Yutao Zhong 2 We present a detailed study of estimation errors in terms of surrogate loss estimation errors. We refer to such guarantees as H-consistency bounds, since they account for the hypothesis set H adopted. These guarantees are significantly stronger than H-calibration or H-consistency. They are also more informative than similar excess error bounds derived in the literature, when H is the family of all measurable functions. We prove general theorems providing such guarantees, for both the distribution-dependent and distribution-independent settings. We show that our bounds are tight, modulo a convexity assumption. We also show that previous excess error bounds can be recovered as special cases of our general results. We then present a series of explicit bounds in the case of the zero-one loss, with multiple choices of the surrogate loss and for both the family of linear functions and neural networks with one hidden-layer. We further prove more favorable distribution-dependent guarantees in that case. We also present a series of explicit bounds in the case of the adversarial loss, with surrogate losses based on the supremum of the ρ-margin, hinge or sigmoid loss and for the same two general hypothesis sets. Here too, we prove several enhancements of these guarantees under natural distributional assumptions. Finally, we report the results of simulations illustrating our bounds and their tightness. 1. Introduction Most learning algorithms rely on optimizing a surrogate loss function distinct from the target loss function tailored to the task considered. This is typically because the target loss 1Google Research, New York, NY; 2Courant Institute of Mathematical Sciences, New York, NY. Correspondence to: Anqi Mao , Yutao Zhong . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). function is computationally hard to optimize or because it does not admit favorable properties, such as differentiability or smoothness, crucial to the convergence of optimization algorithms. But, what guarantees can we count on for the target loss estimation error, when minimizing a surrogate loss estimation error? A desirable property of a surrogate loss function, often referred to in that context is Bayes-consistency. It requires that asymptotically, nearly optimal minimizers of the surrogate excess error also nearly optimally minimize the target excess error (Steinwart, 2007). This property holds for a broad family of convex surrogate losses of the standard binary and multi-class classification losses (Zhang, 2004a; Bartlett et al., 2006; Tewari & Bartlett, 2007; Steinwart, 2007). But, Bayes-consistency is not relevant when learning with a hypothesis set H distinct from the family of all measurable functions. Instead, the hypothesis set-dependent notion of H-consistency should be adopted, as argued by Long & Servedio (2013) (see also (Kuznetsov et al., 2014) and (Zhang & Agarwal, 2020)). More recently, Awasthi et al. (2021a) further studied H-consistency guarantees for the adversarial loss (Goodfellow et al., 2014; Madry et al., 2017; Tsipras et al., 2018; Carlini & Wagner, 2017). Nevertheless, consistency and H-consistency are both asymptotic properties and thus do not provide any guarantee for approximate minimizers learned from finite samples. Instead, we will consider upper bounds on the target estimation error expressed in terms of the surrogate estimation error, which we refer to as H-consistency bounds, since they account for the hypothesis set H adopted. These guarantees are significantly stronger than H-calibration or Hconsistency (Section 6) or some margin-based properties of convex surrogate losses for linear predictors studied by Ben-David et al. (2012) and Long & Servedio (2011). They are also more informative than similar excess error bounds derived in the literature, which correspond to the special case where H is the family of all measurable functions (Zhang, 2004a; Bartlett et al., 2006) (see also (Mohri et al., 2018)[section 4.7]). We prove general theorems providing such guarantees, which could be used in both distributiondependent and distribution-independent settings (Section 4). We show that our bounds are tight, modulo a convexity assumption (Section 5.2 and 6.1). We also show that previous excess error bounds can be recovered as special cases of our H-Consistency Bounds for Surrogate Loss Minimizers general results (Section 5.1). We then present a series of explicit bounds in the case of the 0/1 loss (Section 5), with multiple choices of the surrogate loss and for both the family of linear functions (Section 5.3) and that of neural networks with one hidden-layer (Section 5.4). We further prove more favorable distributiondependent guarantees in that case (Section 5.5). We also present a detailed analysis of the adversarial loss (Section 6). We show that there can be no non-trivial adversarial H-consistency bound for supremum-based convex loss functions and supremum-based sigmoid loss function, under mild assumptions that hold for most hypothesis sets used in practice (Section 6.2). These results imply that the loss functions commonly used in practice for optimizing the adversarial loss cannot benefit from any useful Hconsistency bound guarantee! These are novel results that go beyond the negative ones given for convex surrogates by Awasthi et al. (2021a). We present new H-consistency bounds for the adversarial loss with surrogate losses based on the supremum of the ρ-margin loss, for linear hypothesis sets (Section 6.3) and the family of neural networks with one hidden-layer (Section 6.4). Here too, we prove several enhancements of these guarantees under some natural distributional assumptions (Section 6.5). Our results help compare different surrogate loss functions of the zero-one loss or adversarial loss, given the specific hypothesis set used, based on the functional form of their H-consistency bounds. These results, combined with approximation error properties of surrogate losses, can help select the most suitable surrogate loss in practice. In addition to several general theorems, our study required a careful inspection of the properties of various surrogate loss functions and hypothesis sets. Our proofs and techniques could be adopted for the analysis of many other surrogate loss functions and hypothesis sets. In Section 7, we report the results of simulations illustrating our bounds and their tightness. We give a detailed discussion of related work in Appendix A. We start with some preliminary definitions and notation. 2. Preliminaries Let X denote the input space and Y = { 1,+1} the binary label space. We will denote by D a distribution over X Y, by P a set of such distributions and by H a hypothesis set of functions mapping from X to R. The generalization error and minimal generalization error for a loss function ℓ(h,x,y) are defined as Rℓ(h) = E(x,y) D[ℓ(h,x,y)] and R ℓ,H = infh H Rℓ(h). Let Hall denote the hypothesis set of all measurable functions. The excess error of a hypothesis h is defined as the difference Rℓ(h) R ℓ,Hall, which can be decomposed into the sum of two terms, the estimation error and approximation error: Rℓ(h) R ℓ,Hall = (Rℓ(h) R ℓ,H)+(R ℓ,H R ℓ,Hall). (1) Given two loss functions ℓ1 and ℓ2, a fundamental question is whether ℓ1 is consistent with respect to ℓ2 for a hypothesis set H and a set of distributions P (Bartlett et al., 2006; Steinwart, 2007; Long & Servedio, 2013; Bao et al., 2021; Awasthi et al., 2021a). Definition 1 ((P,H)-consistency). We say that ℓ1 is (P,H)-consistent with respect to ℓ2, if, for all distributions D P and sequences {hn}n N H, we have lim n + Rℓ1(hn) R ℓ1,H = 0 lim n + Rℓ2(hn) R ℓ2,H = 0. (2) We will denote by Φ a margin-based loss if a loss function ℓcan be represented as ℓ(h,x,y) = Φ(yh(x)) and by Φ = supx x x p γ Φ(yh(x )), p [1,+ ], the supremumbased counterpart. In the standard binary classification, ℓ2 is the 0/1 loss ℓ0 1 = 1sign(h(x)) y, where sign(α) = 1α 0 1α<0 and ℓ1 is the margin-based loss for some function Φ R R+, typically convex. In the adversarial binary classification, ℓ2 is the adversarial 0/1 loss ℓγ = supx x x p γ 1yh(x ) 0, for some γ (0,1) and ℓ1 is the supremum-based margin loss Φ. Let Bd p(r) denote the d-dimensional ℓp-ball with radius r: Bd p(r) = {z Rd z p r}. Without loss of generality, we consider X = Bd p(1). Let p,q [1,+ ] be conjugate numbers, that is 1 q = 1. We will specifically study the family of linear hypotheses Hlin = {x w x + b w q W, b B} and one-hidden-layer Re LU networks HNN = {x n j=1 uj(wj x+b)+ u 1 Λ, wj q W, b B}, where ( )+ = max( ,0). Finally, for any ϵ > 0, we will denote by t ϵ the ϵ-truncation of t R defined by t1t>ϵ. 3. H-consistency bound definitions (P,H)-Consistency is an asymptotic relation between two loss functions. However, we are interested in a more quantitative relation in many applications. This motivates the study of H-consistency bound. Definition 2 (H-consistency bound). If for some nondecreasing function f R+ R+, a bound of the following form holds for all h H and D P: Rℓ2(h) R ℓ2,H f(Rℓ1(h) R ℓ1,H), (3) then, we call it an H-consistency bound. Furthermore, if P consists of all distributions over X Y, we say that the bound is distribution-independent. H-Consistency Bounds for Surrogate Loss Minimizers When H = Hall and P is the set of all distributions, a bound of the form (3) is also called a consistency excess error bound. Note when f(0) = 0 and f is continuous at 0, the H-consistency bound (3) implies H-consistency (2). Thus, H-consistency bounds provide stronger quantitative results than consistency and calibration. Furthermore, there is a fundamental reason to study such bounds from the statistical learning point of view: they can be turned into more favorable generalization bounds for the target loss ℓ2 than the excess error bound. For example, when P is the set of all distributions, by (1), relation (3) implies that, for all h H, the following inequality holds: Rℓ2(h) R ℓ2,Hall f(Rℓ1(h) R ℓ1,H)+R ℓ2,H R ℓ2,Hall.(4) Similarly, the excess error bound can be written as follows: Rℓ2(h) R ℓ2,Hall f(Rℓ1(h) R ℓ1,H +R ℓl,H R ℓl,Hall).(5) If we further bound the estimation error [Rℓ1(h) R ℓ1,H] by the empirical error plus a complexity term, (4) and (5) both turn into generalization bounds. However, the generalization bound obtained by (4) is linearly dependent on the approximation error of target loss ℓ2, while the one obtained by (5) depends on the approximation error of the surrogate loss ℓ1 and can potentially be worse than linear dependence. Moreover, (4) can be easily used to compare different surrogates by directly comparing the corresponding mapping f. However, only comparing the mapping f for different surrogates in (5) is not sufficient since the approximation errors of surrogates may differ as well. Minimizability gap. We will adopt the standard notation for the conditional distribution of Y given X = x: η(x) = D(Y = 1 X = x) and will also use the shorthand η(x) = η(x) 1 2. It is useful to write the generalization error as Rℓ(h) = EX[Cℓ(h,x)], where Cℓ(h,x) is the conditional ℓ-risk defined by Cℓ(h,x) = η(x)ℓ(h,x,+1) + (1 η(x))ℓ(h,x, 1). The minimal conditional ℓ-risk is denoted by C ℓ,H(x) = infh H Cℓ(h,x). We also use the following shorthand for the gap Cℓ,H(h,x) = Cℓ(h,x) C ℓ,H(x). We call Cℓ,H(h,x) ϵ = Cℓ,H(h,x)1 Cℓ,H(h,x)>ϵ the conditional ϵ-regret for ℓ. To simplify the notation, we also define for any t [0,1], Cℓ(h,x,t) = tℓ(h,x,+1) + (1 t)ℓ(h,x, 1) and Cℓ,H(h,x,t) = Cℓ(h,x,t) infh H Cℓ(h,x,t). Thus, Cℓ,H(h,x,η(x)) = Cℓ,H(h,x). A key quantity that appears in our bounds is the (ℓ,H)- minimizability gap Mℓ,H, which is the difference of the bestin class error and the expectation of the minimal conditional ℓ-risk: Mℓ,H = R ℓ,H EX[C ℓ,H(x)]. This is an inherent property of the hypothesis set H and distribution D that we cannot hope to estimate or minimize. As an example, the minimizability gap for the 0/1 loss and adversarial 0/1 loss with Hall can be expressed as follows: Mℓ0 1,Hall = R ℓ0 1,Hall EX[min{η(x),1 η(x)}] = 0, Mℓγ,Hall = R ℓγ,Hall EX[min{η(x),1 η(x)}]. Steinwart (2007, Lemma 2.5) shows that the minimizability gap vanishes when the loss ℓis minimizable. Awasthi et al. (2021a) point out that the minimizability condition does not hold for adversarial loss functions, and therefore that, in general, Mℓγ,Hall is strictly positive, thereby presenting additional challenges for adversarial robust classification. Thus, the minimizability gap is critical in the study of adversarial surrogate loss functions. The minimizability gaps for some common loss functions and hypothesis sets are given in Table 1 in Section 5.2 for completeness. 4. General theorems We first introduce two main theorems that provide a general H-consistency bound between any target loss and surrogate loss. These bounds are H-dependent, taking into consideration the specific hypothesis set used by a learning algorithm. To the best of our knowledge, no such guarantee has appeared in the past. For both theoretical and practical computational reasons, learning algorithms typically seek a good hypothesis within a restricted subset of Hall. Thus, in general, H-dependent bounds can provide more relevant guarantees than excess error bounds. Our proposed bounds are also more general in the sense that Hall can be used as a special case. Theorems 1 and 2 are counterparts of each other, while the latter may provide a more explicit form of bounds as in (3). Theorem 1 (Distribution-dependent Ψ-bound). Assume that there exists a convex function Ψ R+ R with Ψ(0) 0 and ϵ 0 such that the following holds for all h H and x X: Ψ( Cℓ2,H(h,x) ϵ) Cℓ1,H(h,x). (6) Then, the following inequality holds for any h H: Ψ(Rℓ2(h) R ℓ2,H + Mℓ2,H) Rℓ1(h) R ℓ1,H + Mℓ1,H + max{Ψ(0),Ψ(ϵ)}. (7) Theorem 2 (Distribution-dependent Γ-bound). Assume that there exists a concave function Γ R+ R and ϵ 0 such that the following holds for all h H and x X: Cℓ2,H(h,x) ϵ Γ( Cℓ1,H(h,x)). (8) Then, the following inequality holds for any h H: Rℓ2(h) R ℓ2,H Γ(Rℓ1(h) R ℓ1,H+Mℓ1,H) Mℓ2,H+ϵ.(9) H-Consistency Bounds for Surrogate Loss Minimizers The proofs of Theorems 1 and 2 are included in Appendix D, where we make use of the convexity of Ψ and concavity of Γ. Below, we will mainly focus on the case where Ψ(0) = 0 and ϵ = 0. Note that if ℓ2 is upper bounded by ℓ1 and R ℓ1,H Mℓ1,H = R ℓ2,H Mℓ2,H, then, the following inequality automatically holds for any h H: Rℓ2(h) R ℓ2,H + Mℓ2,H Rℓ1(h) R ℓ1,H + Mℓ1,H. This is a special case of Theorems 1 and 2. Indeed, since R ℓ1,H Mℓ1,H = R ℓ2,H Mℓ2,H, we have C ℓ2,H(x) C ℓ1,H(x) and thus Cℓ2,H(h,x) Cℓ1,H(h,x). Therefore, Φ and Γ can be the identity function. We refer to such cases as trivial cases . They occur when Mℓ1,H and Mℓ2,H respectively coincide with the corresponding approximation errors and R ℓ1,Hall = R ℓ2,Hall. We will later see such cases for specific loss functions and hypothesis sets (See (38) in Appendix K.1.6 and (56) in Appendix L.1.1). Let us point out, however, that the corresponding H-consistency bounds are still valid and worth studying since they can be shown to be the tightest (Theorems 4 and 6). Theorem 1 is distribution-dependent, in the sense that, for a fixed distribution, if we find a Ψ that satisfies condition (6), then the bound (7) only gives guarantee for that same distribution. Since the distribution D of interest is typically unknown, to obtain guarantees for D, if the only information given is that D belongs to a set of distributions P, we need to find a Ψ that satisfies condition (6) for all the distributions in P. The choice of Ψ is critical, since it determines the form of the bound obtained. We say that Ψ is optimal if any function that makes the bound (7) hold for all distributions in P is everywhere no larger than Ψ. The optimal Ψ leads to the tightest Hconsistency bound (7) uniform over P. Specifically, when P consists of all distributions, we say that the bound is distribution-independent. The above also applies to Theorem 2, except that Γ is optimal if any function that makes the bound (9) hold for all distributions in P is everywhere no less than Γ. When ℓ2 is the 0/1 loss or the adversarial 0/1 loss, the conditional ϵ-regret that appears in condition (6) has explicit forms for common hypothesis sets as characterized later in Lemma 1 and 2, establishing the basis for introducing non-adversarial and adversarial H-estimation error transformation in Section 5.2 and 6.1. We will see later in these sections that the transformations introduced are often the optimal Ψ we are seeking for, which respectively leads to tight non-adversarial and adversarial distributionindependent guarantees. In Section 5 and 6, we also apply our general theorems and tools to loss functions and hypothesis sets widely used in practice. Each case requires a careful analysis that we present in detail. 5. Guarantees for the zero-one loss ℓ2 = ℓ0 1 In this section, we discuss guarantees in the non-adversarial scenario where ℓ2 is the zero-one loss, ℓ0 1. The lemma stated next characterizes the minimal conditional ℓ0 1-risk and the conditional ϵ-regret, which will be helpful for introducing the general tools in Section 5.2. The proof is given in Appendix E. For convenience, we will adopt the following notation: H(x) = {h H sign(h(x)) η(x) 0}. Lemma 1. Assume that H satisfies the following condition for any x X: {sign(h(x)) h H} = { 1,+1}. Then, the minimal conditional ℓ0 1-risk is C ℓ0 1,H(x) = C ℓ0 1,Hall(x) = min{η(x),1 η(x)}. The conditional ϵ-regret for ℓ0 1 can be characterized as Cℓ0 1,H(h,x) ϵ = 2 η(x) ϵ1h H(x) . 5.1. Hypothesis set of all measurable functions Before introducing our general tools, we will consider the case where H = Hall and will show that previous excess error bounds can be recovered as special cases of our results. As shown in (Steinwart, 2007), both Mℓ0 1,Hall and MΦ,Hall vanish. Thus by Lemma 1, we obtain the following corollary of Theorem 1 by taking ϵ = 0. Corollary 1. Assume that there exists a convex function Ψ R+ R with Ψ(0) = 0 such that for any x X, Ψ(2 η(x) ) infh Hall(x) CΦ,Hall(h,x). Then, for any hypothesis h Hall, the following inequality holds: Ψ(Rℓ0 1(h) R ℓ0 1,Hall) RΦ(h) R Φ,Hall. Furthermore, Corollary 2 follows from Corollary 1 by taking the convex function Ψ(t) = (t/(2c))s. Corollary 2. Assume there exist s 1 and c > 0 such that for any x X, η(x) c infh Hall(x)( CΦ,Hall(h,x)) 1 s . Then, for any hypothesis h Hall, Rℓ0 1(h) R ℓ0 1,Hall 2c (RΦ(h) R Φ,Hall) The excess error bound results in the literature are all covered by the above corollaries. As shown in Appendix F, Theorem 4.7 in (Mohri et al., 2018) is a special case of Corollary 2 and Theorem 1.1 in (Bartlett et al., 2006) is a special case of Corollary 1. 5.2. General hypothesis sets H In this section, we provide general tools to study Hconsistency bounds when the target loss is the 0/1 loss. We will then apply them to study specific hypothesis sets and surrogates in Section 5.3 and 5.4. Lemma 1 characterizes H-Consistency Bounds for Surrogate Loss Minimizers 2 1 0 1 2 t Hinge Logistic Exponential Quadratic Sigmoid 0.0 0.2 0.4 0.6 0.8 1.0 t Hinge Logistic Exponential Quadratic Sigmoid Figure 1: Left: surrogates. Right: Hlin-est. error trans. inv. the conditional ϵ-regret for ℓ0 1 with common hypothesis sets. Thus, Theorems 1 and 2 can be instantiated as Theorems 8 and 9 in these cases (see Appendix C). They are powerful distribution-dependent bounds and, as discussed in Section 4, the bounds become distribution-independent if the corresponding conditions can be verified for all the distributions with some Ψ, which is equivalent to verifying the condition in the following theorem. Theorem 3 (Distribution-independent Ψ-bound). Assume that H satisfies the condition of Lemma 1. Assume that there exists a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that for any t [1/2,1], Ψ( 2t 1 ϵ) inf x X,h H h(x)<0 CΦ,H(h,x,t). Then, for any hypothesis h H and any distribution, Ψ(Rℓ0 1(h) R ℓ0 1,H + Mℓ0 1,H) RΦ(h) R Φ,H + MΦ,H + max{0,Ψ(ϵ)}. (10) The counterpart of Theorem 3 is Theorem 12 (distributionindependent Γ-bound), deferred to Appendix C due to space limitations. The proofs for both theorems are included in Appendix G. Theorem 3 provides the general tool to derive distribution-independent H-consistency bounds. They are in fact tight if we choose Ψ to be the H-estimation error transformation defined as follows. Definition 3 (H-estimation error transformation). The H-estimation error transformation of Φ is defined on t [0,1] by TΦ(t) = T(t)1t [ϵ,1] + (T(ϵ)/ϵ)t1t [0,ϵ), where T(t) = infx X,h H h(x)<0 CΦ,H(h,x, t+1 When ϵ = 0, TΦ(t) coincides with T(t). Observe that for any t [(1 + ϵ)/2,1], the following equality holds: TΦ(2t 1) = inf x X,h H h(x)<0 CΦ,H(h,x,t). Taking Ψ = TΦ satisfies the condition in Theorem 3 if TΦ is convex with TΦ(0) = 0. Moreover, as mentioned earlier, it actually leads to the tightest H-consistency bound (10) when ϵ = 0. Theorem 4 (Tightness). Suppose that H satisfies the condition of Lemma 1 and that ϵ = 0. If TΦ is convex with TΦ(0) = 0, then, for any t [0,1] and δ > 0, there exist a distribution D and a hypothesis h H such that Table 1: Loss functions and their minimizability gaps. Loss Functions Definitions Mℓ,Hlin Mℓ,HNN Hinge Φhinge(t) = max{0, 1 t} (25) (40) Logistic Φlog(t) = log2(1 + e t) (27) (42) Exponential Φexp(t) = e t (29) (44) Quadratic Φquad(t) = (1 t)21t 1 (31) (31) Sigmoid Φsig(t) = 1 tanh(kt), k > 0 (33) (48) ρ-Margin Φρ(t) = min{1, max{0, 1 t ρ}}, ρ > 0 (36) (40) Sup-ρ-Margin Φρ = supx x x p γ Φρ(yh(x )) (53) (58) Zero-One ℓ0 1 = 1sign(h(x)) y (24) (39) Adversarial Zero-One ℓγ = supx x x p γ 1yh(x ) 0 (52) (57) Table 2: Hlin-estimation error transformation and Hlinconsistency bounds with ϵ = 0. Surrogates TΦ(t), t [0, 1] Bound Hinge min{B, 1} t (26) 2 log2(t + 1) + 1 t 2 log2(1 t), t e B 1 e B+1, 1 t+1 2 log2(1 + e B) 1 t 2 log2(1 + e B), t > e B 1 e B+1. (28) Exponential 1 t2, t e2B 1 e2B+1, 1 t+1 2 e B, t > e2B 1 e2B+1. (30) Quadratic {t2, t B, 2B t B2, t > B. (32) Sigmoid tanh(k B) t (34) ρ-Margin min{B,ρ} Rℓ0 1(h) R ℓ0 1,H + Mℓ0 1,H = t and TΦ(t) RΦ(h) R Φ,H + MΦ,H TΦ(t) + δ. The proof is included in Appendix I. In other words, when ϵ = 0, if TΦ is convex with TΦ(0) = 0, it is optimal for the distribution-independent bound (10). Moreover, if TΦ is additionally invertible and non-increasing, T 1 Φ is the optimal function for the distribution-independent bound in Theorem 12 (Appendix C) and the two bounds are equivalent. In the following sections, we will see that all these assumptions hold for common loss functions with linear and neural network hypothesis sets. Next, we will apply Theorems 3 and 4 to the linear models (Section 5.3) and neural networks (Section 5.4). Each case requires a detailed analysis (See Appendix K.1 and K.2). The loss functions considered below and their minimizability gaps are defined in Table 1. In some cases, the minimizability gap coincides with the approximation error. For example, MΦsig,Hlin = R Φsig,Hlin EX[1 1 2η(x) tanh(k(W x p + B))] coincides with the (Φsig,Hlin)-approximation error R Φsig,Hlin EX[1 1 2η(x) ] for B = + ; MΦhinge,HNN = R Φhinge,HNN EX[1 2η(x) 1 min{ΛW x p + ΛB,1}] coincides with the (Φhinge,HNN)-approximation error R Φhinge,HNN EX[1 1 2η(x) ] for ΛB 1. The detailed derivation is included in Appendix K, L. 5.3. Linear hypotheses By applying Theorems 3 and 4, we can derive Hlinconsistency bounds for common loss functions defined in Table 1. Table 2 supplies the Hlin-estimation error transformation TΦ and the corresponding bounds for those H-Consistency Bounds for Surrogate Loss Minimizers Table 3: HNN-estimation error transformation and HNNconsistency bounds with ϵ = 0. Surrogates TΦ(t), t [0, 1] Bound Hinge min{ΛB, 1} t (41) 2 log2(t + 1) + 1 t 2 log2(1 t), t eΛB 1 eΛB+1, 1 t+1 2 log2(1 + e ΛB) 1 t 2 log2(1 + eΛB), t > eΛB 1 eΛB+1. (43) Exponential 1 t2, t e2ΛB 1 e2ΛB+1, 1 t+1 2 eΛB, t > e2ΛB 1 e2ΛB+1. (45) Quadratic {t2, t ΛB, 2ΛBt (ΛB)2, t > ΛB. (47) Sigmoid tanh(kΛB) t (49) ρ-Margin min{ΛB,ρ} loss functions. The inverse T 1 Φ is given in Table 5 of Appendix B. Surrogates Φ and their corresponding T 1 Φ (B = 0.8) are visualized in Figure 1. Theorems 3 and 4 apply to all these cases since TΦ is convex, increasing, invertible and satisfies that TΦ(0) = 0. More precisely, taking Ψ = TΦ and ϵ = 0 in (10) and using the inverse function T 1 Φ directly give the tightest bound. As an example, for the sigmoid loss, T 1 Φsig(t) = t tanh(k B). Then the bound (10) becomes Rℓ0 1(h) R ℓ0 1,Hlin (RΦsig(h) R Φsig,Hlin + MΦsig,Hlin)/tanh(k B) Mℓ0 1,Hlin, which is (34) in Table 2. Furthermore, after plugging in the minimizability gaps concluded in Table 1, we will obtain the novel bound Rℓ0 1(h) R ℓ0 1,Hall (RΦsig(h) EX[1 1 2η(x) tanh(k(W x p + B))])/tanh(k B) ((35) in Appendix K.1.5). The bounds for other surrogates are similarly derived in Appendix K.1. For the logistic loss and exponential loss, to simplify the expression, the bounds are obtained by plugging in an upper bound of T 1 Φ . Let us emphasize that these H-consistency bounds are novel in the sense that they are all hypothesis set-dependent and, to our knowledge, no such guarantee has been presented before. More precisely, the bounds of Table 2 depend directly on the parameter B in the linear models and parameters of the loss function (e.g., k in sigmoid loss). Thus, for a fixed hypothesis h Hlin, we may give the tightest bound by choosing the best parameter B. As an example, Appendix K.1.5 shows that the bound (35) with B = + coincides with the excess error bound known for the sigmoid loss (Bartlett et al., 2006). However, for a fixed hypothesis h, by varying B (hypothesis set) and k (loss function), we may obtain a finer bound! Thus studying hypothesis set-dependent bounds can guide us to select the most suitable hypothesis set and loss function. Moreover, as shown by Theorem 4, all the bounds obtained by directly using T 1 Φ are tight and cannot be further improved. 5.4. One-hidden-layer Re LU neural networks In this section, we give H-consistency bounds for onehidden-layer Re LU neural networks HNN. Table 3 is the counterpart of Table 2 for HNN. Different from the bounds in the linear case, all the bounds in Table 3 not only depend on B, but also depend on Λ, which is a new parameter in HNN. This further illustrates that our bounds are hypothesis set-dependent and that, as with the linear case, adequately choosing the parameters Λ and B in HNN would give us better hypothesis set-dependent guarantees than standard excess error bounds. The inverse T 1 Φ is given in Table 6 of Appendix B. Our proofs and techniques could also be adopted for the analysis of multi-layer neural networks. 5.5. Guarantees under Massart s noise condition The distribution-independent H-consistency bound (10) cannot be improved, since they are tight as shown in Theorem 4. However, the bounds can be further improved in the distribution-dependent setting. Indeed, we will study how H-consistency bounds can be improved under low noise conditions, which impose the restrictions on the conditional distribution η(x). We consider Massart s noise condition (Massart & Nédélec, 2006) which is defined as follows. Definition 4 (Massart s noise). The distribution D over X Y satisfies Massart s noise condition if η(x) β for almost all x X, for some constant β (0,1/2]. When it is known that the distribution D satisfies Massart s noise condition with β, in contrast with the distributionindependent bounds, we can require the bounds (7) and (9) to hold uniformly only for such distributions. With Massart s noise condition, we introduce a modified Hestimation error transformation in Proposition 1 (Appendix M), which verifies condition (13) of Theorem 8 (the finer distribution dependent guarantee mentioned before, deferred to Appendix C) for all distributions under the noise condition. Then, using this transformation, we can obtain more favorable distribution-dependent bounds. As an example, we consider the quadratic loss Φquad, the logistic loss Φlog and the exponential loss Φexp with Hall. For all distributions and h Hall, as shown in (Zhang, 2004a; Bartlett et al., 2006; Mohri et al., 2018), the following holds: Rℓ0 1(h) R ℓ0 1,Hall 2(RΦ(h) R Φ,Hall) 1/2, when the surrogate loss Φ is Φlog or Φexp. If Φ = Φquad, then the constant multiplier 2 can be removed. For distributions that satisfy Massart s noise condition with β, as proven in Appendix M, for any h Hall such that RΦ(h) R Φ,Hall + T(2β), the consistency excess error bound is improved from the square-root dependency to a linear dependency: Rℓ0 1(h) R ℓ0 1,Hall 2β(RΦ(h) R Φ,Hall)/T(2β), (11) where T(t) equals to t2, t+1 2 log2(t+1)+ 1 t 2 log2(1 t) and 1 1 t2 for Φquad, Φlog and Φexp respectively. These linear dependent bounds are tight, as illustrated in Section 7. H-Consistency Bounds for Surrogate Loss Minimizers 6. Guarantees for the adversarial loss ℓ2 = ℓγ In this section, we discuss the adversarial scenario where ℓ2 is the adversarial 0/1 loss ℓγ. We consider symmetric hypothesis sets, which satisfy: h H if and only if h H. For convenience, we will adopt the following definitions: hγ(x) = inf x x x p γ h(x ) hγ(x) = sup x x x p γ h(x ). We also define Hγ(x) = {h H hγ(x) 0 hγ(x)}. The following characterization of the minimal conditional ℓγ-risk and conditional ϵ-regret is based on (Awasthi et al., 2021a, Lemma 27) and will be helpful in introducing the general tools in Section 6.1. The proof is similar and is included in Appendix E for completeness. Lemma 2. Assume that H is symmetric. Then, the minimal conditional ℓγ-risk is C ℓγ,H(x) = min{η(x),1 η(x)}1Hγ(x) H + 1Hγ(x)=H . The conditional ϵ-regret for ℓγ can be characterized as Cℓγ,H(h,x) ϵ = 2 ϵ h Hγ(x) H 2 η(x) ϵ hγ(x) < 0 2 η(x) ϵ hγ(x) > 0 0 otherwise 6.1. General hypothesis sets H As with the non-adversarial case, we begin by providing general theoretical tools to study H-consistency bounds when the target loss is the adversarial 0/1 loss. Lemma 2 characterizes the conditional ϵ-regret for ℓγ with symmetric hypothesis sets. Thus, Theorems 1 and 2 can be instantiated as Theorems 10 and 11 (See Appendix C) in these cases. These results are distribution-dependent and can serve as general tools. For example, we can use these tools to derive more favorable guarantees under noise conditions (Section 6.5). As in the previous section, we present their distribution-independent version in the following theorem. Theorem 5 (Adversarial distribution-independent Ψ-bound). Suppose that H is symmetric. Assume there exist a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that the following holds for any t [1/2,1] Ψ( t ϵ) inf x X,h Hγ(x) H C Φ,H(h,x,t), Ψ( 2t 1 ϵ) inf x X,h H hγ(x)<0 C Φ,H(h,x,t). Then, for any hypothesis h H and any distribution, Ψ(Rℓγ(h) R ℓγ,H + Mℓγ,H) R Φ(h) R Φ,H + M Φ,H + max{0,Ψ(ϵ)}. (12) The counterpart of Theorem 5 is Theorem 13 (adversarial distribution-independent Γ-bound), deferred to Appendix C due to space limitations. The proofs for both theorems are included in Appendix H. As with the non-adversarial scenario, the tightest distribution-independent H-consistency bounds obtained by Theorem 5 can be achieved by the optimal Ψ, which is the adversarial H-estimation error transformation defined as follows. Definition 5 (Adversarial H-estimation error transformation). The adversarial H-estimation error transformation of Φ is defined on t [0,1] by T Φ(t) = min{T1(t),T2(t)}, where T1(t) = T1(t)1t [1/2,1] + 2 T1(1/2)t1t [0,1/2), T2(t) = T2(t)1t [ϵ,1] + ( T2(ϵ)/ϵ)t1t [0,ϵ), with T1(t) = inf x X,h Hγ(x) H C Φ,H(h,x,t), T2(t) = inf x X,h H hγ(x)<0 C Φ,H(h,x, t + 1 It is clear that T Φ satisfies assumptions in Theorem 5. The next theorem shows that it gives the tightest H-consistency bound (12) under certain conditions. Theorem 6 (Adversarial tightness). Suppose that H is symmetric and that ϵ = 0. If T Φ = min{T1,T2} is convex with T Φ(0) = 0 and T2 T1, then, for any t [0,1] and δ > 0, there exist a distribution D and a hypothesis h H such that Rℓγ(h) R ℓγ,H+Mℓγ,H = t and T Φ(t) R Φ(h) R Φ,H + M Φ,H T Φ(t) + δ. The proof is included in Appendix I. In other words, when ϵ = 0, if T2 T1 and T Φ is convex with T Φ(0) = 0, T Φ is the optimal function for the distribution-independent bound (12). Moreover, if T Φ is additionally invertible and non-increasing, T 1 Φ is the optimal function for the distribution-independent bound in Theorem 13 (Appendix C) and the two bounds will be equivalent. We will see that all these assumptions hold for cases considered in Section 6.3 and 6.4. Next, we will apply Theorem 5 along with the tightness guarantee Theorem 6 to study specific hypothesis sets and adversarial surrogate loss functions in Section 6.2 for negative results and Section 6.3 and 6.4 for positive results. A careful analysis is presented in each case (See Appendix L). 6.2. Negative results for adversarial robustness Awasthi et al. (2021a) show that supremum-based convex loss functions of the type Φ = supx x x p γ Φ(yh(x )), where Φ is convex and non-increasing, are not H-calibrated with respect to ℓγ for H containing 0, that is regular for adversarial calibration, e.g., Hlin and HNN. Definition 6 (Regularity for adversarial calibration). [Definition 5 in (Awasthi et al., 2021a)] We say that a hypothesis set H is regular for adversarial calibration if there exists H-Consistency Bounds for Surrogate Loss Minimizers Table 4: Adversarial H-consistency bounds. They are completely new consistency bounds in the adversarial setting and can turn into more significant ϵ-consistency results. The minimizability gaps appearing in the bounds for the surrogates are concluded in Table 1. The detailed derivation is included in Appendix L, N. Surrogates Bound (Hlin) Bound (HNN) Distribution set Φρ (54) (59) All distributions Φhinge (67) (71) Massart s noise Φsig (69) (73) Massart s noise a distinguishing x in X, that is if there exist f,g H such that inf x x p γ f(x ) > 0 and sup x x p γ g(x ) < 0. Similarly, we show that there are no non-trivial adversarial H-consistency bounds with respect to ℓγ for supremumbased convex loss functions and supremum-based symmetric loss functions (see Definition 7 below) including sigmoid loss with such hypothesis sets. Definition 7 (Symmetric loss). We say that a margin-based loss Φ is symmetric if there exists a constant C 0 such that Φ(t) + Φ( t) = C for any t R, and denote it by Φsym. We also define its supremum-based counterpart as Φsym = supx x x p γ Φsym(yh(x )) and call Φsym the supremum-based symmetric loss. For the sigmoid loss Φsig(t) = 1 tanh(kt), k > 0, we have Φsig(t) + Φsig( t) = 2, which implies that Φsig is symmetric. Note that Awasthi et al. (2021a) do not study the sigmoid loss, which is non-convex. Thus, our results below go beyond their results for convex adversarial surrogates. Theorem 7 (Negative results for robustness). Suppose that H contains 0 and is regular for adversarial calibration. Let ℓ1 be supremum-based convex loss or supremum-based symmetric loss and ℓ2 = ℓγ. Then, f(t) 1/2 for any t 0 are the only non-decreasing functions f such that (3) holds. The proof is given in Appendix J. In other words, the function f in bound (3) must be lower bounded by 1/2 for such adversarial surrogates. Theorem 7 implies that the loss functions commonly used in practice for optimizing the adversarial loss cannot benefit from any useful H-consistency bound guarantee. Instead, we show in Section 6.3 and 6.4 that the supremum-based ρ-margin loss Φρ = supx x x p γ Φρ(yh(x )) proposed by Awasthi et al. (2021a) admits favorable adversarial H-consistency bounds. These bounds would also imply significantly stronger results than the asymptotic H-consistency guarantee in (Awasthi et al., 2021a). 6.3. Linear hypotheses In this section, by applying Theorems 10 and 11, we derive the adversarial Hlin-consistency bound (54) in Table 4 for supremum-based ρ-margin loss. This is a completely new consistency bound in the adversarial setting. As with the non-adversarial case, the bound is dependent on the parameter B in linear hypothesis set and ρ in the loss function. This helps guide the choice of loss functions once the hypothesis set is fixed. More precisely, if B > 0 is known, we can always choose ρ < B such that the bound is the tightest. Moreover, the bound can turn into more significant ϵ-consistency results in adversarial setting than the H-consistency result in (Awasthi et al., 2021a). Corollary 3. Let D be a distribution over X Y such that M Φρ,Hlin ϵ for some ϵ 0. Then, the following holds: Rℓγ(h) R ℓγ,Hlin ρ(R Φρ(h) R Φρ,Hlin +ϵ)/min{B,ρ}. Awasthi et al. (2021a) show that Φρ is Hlin-consistent with respect to ℓγ when M Φρ,Hlin = 0. This result can be immediately implied by Corollary 3. Moreover, Corollary 3 provides guarantees for more general cases where M Φρ,Hlin can be nonzero. 6.4. One-hidden-layer Re LU neural networks For the one-hidden-layer Re LU neural networks HNN and Φρ, we have the HNN-consistency bound (59) in Table 4. Note infx X suph HNN hγ(x) does not have an explicit expression. However, (59) can be further relaxed to be (60) in Appendix L.2, which is identical to the bound in the linear case modulo the replacement of B by ΛB. As in the linear case, the bound is new and also implies stronger ϵ-consistency results as follows: Corollary 4. Let D be a distribution over X Y such that M Φρ,HNN ϵ for some ϵ 0. Then, Rℓγ(h) R ℓγ,HNN ρ(R Φρ(h) R Φρ,HNN+ϵ)/min{ΛB,ρ}. Besides the bounds for Φρ, Table 4 gives a series of results that are all new in the adversarial setting. Like the bounds in Table 2 and 3, they are all hypothesis set dependent and very useful. For example, the improved bounds for Φhinge and Φsig under noise conditions in the table can also turn into meaningful consistency results under Massart s noise condition, as shown in Section 6.5. 6.5. Guarantees under Massart s noise condition Section 6.2 shows that non-trivial distribution-independent bounds for supremum-based hinge loss and supremumbased sigmoid loss do not exist. However, under Massart s noise condition (Definition 4), we will show that there exist non-trivial adversarial H-consistency bounds for the two loss functions. Furthermore, we will see that the bounds are linear dependent as those in Section 5.5. H-Consistency Bounds for Surrogate Loss Minimizers 10 5 10 4 10 3 excess error 10 5 10 4 10 3 generalization error Figure 2: Left: tightness of bound (11) in Section 5.5. Right: tightness of bounds (54), (67) and (69) in Section 6.3 and 6.5. As with the non-adversarial scenario, we introduce a modified adversarial H-estimation error transformation in Proposition 2 (Appendix N). Using this tool, we derive adversarial H-consistency bounds for Φhinge and Φsig under Massart s noise condition in Table 4. From the bounds (67), (69), (71), and (73), we can also obtain novel ϵ-consistency results for Φhinge and Φsig with linear models and neural networks under Massart s noise condition. Corollary 5. Let H be Hlin or HNN. Let D be a distribution over X Y which satisfies Massart s noise condition with β such that M Φ,H ϵ for some ϵ 0. Then, Rℓγ(h) R ℓγ,H 1 + 2β 4β (R Φ(h) R Φ,H + ϵ)/T(B), where T(t) equals to min{t,1} and tanh(kt) for Φhinge and Φsig respectively, B is replaced by ΛB for H = HNN. In Section 7, we will further show that these linear dependency bounds in adversarial setting are tight, along with the non-adversarial bounds we discussed earlier in Section 5.5. 7. Simulations Here, we present experiments on simulated data to illustrate our bounds and their tightness. We generate data points x R on [ 1,+1]. All risks are approximated by their empirical counterparts computed over 107 i.i.d. samples. Non-adversarial. To demonstrate the tightness of our non-adversarial bounds, we consider a scenario where the marginal distribution is symmetric about x = 0 with labels flipped. With probability 1 16, (x,y) = (1, 1); with probability 7 16, the label is +1 and the data follows the truncated normal distribution on [σ,1] with both mean and standard deviation σ. We consider Φquad, Φlog and Φexp defined in Table 1. The distribution considered satisfies Massart s noise condition with β = 1 2. Thus, our bound (11) in Section 5.5 becomes Rℓ0 1(h) R ℓ0 1,Hall RΦ(h) R Φ,Hall, for any h Hall such that RΦ(h) R Φ,Hall + 1. All the minimal generalization errors vanish in this case. As shown in Figure 2, for h(x) = 5x, the bounds corresponding to Φquad, Φlog and Φexp are all tight as σ 0. Adversarial. To demonstrate the tightness of our adversarial bounds, the distribution is modified as follows: with probability 1 16, (x,y) = (1, 1); with probability 1 16, (x,y) = ( 1,+1); with probability 7 8, the label is 1 and the data follows the truncated normal distribution on [ 1,γ σ] with mean γ σ and standard deviation σ. We set γ = 0.1 and consider Φρ with ρ = 1, Φhinge and Φsig with k = 1. The distribution considered satisfies Massart s noise condition with β = 1 2. Thus, our bounds (54), (67) and (69) in Table 4 become Rℓγ(h) R Φ(h), for any h Hlin. As shown in Figure 2, for h(x) = 5x, the bounds corresponding to Φρ, Φhinge and Φsig are all tight as σ 0. 8. Conclusion We presented an exhaustive study of H-consistency bounds, including a series of new guarantees for both the nonadversarial zero-one loss function and the adversarial zeroone loss function. Our hypothesis-dependent guarantees are significantly stronger than the consistency or calibration ones. Our results include a series of theoretical and conceptual tools helpful for the analysis of other loss functions and other hypothesis sets, including multi-class classification or ranking losses. They can be further extended to the analysis of non-i.i.d. settings such as that of drifting distributions (Helmbold & Long, 1994; Long, 1999; Barve & Long, 1997; Bartlett et al., 2000; Mohri & Medina, 2012; Gama et al., 2014) or, more generally, time series prediction (Engle, 1982; Bollerslev, 1986; Brockwell & Davis, 1986; Box & Jenkins, 1990; Hamilton, 1994; Meir, 2000; Kuznetsov & Mohri, 2015; 2017; 2020). Our results can also be extended to many other loss functions, using our general proof techniques or a similar analysis. Acknowledgements Part of the work of A. Mao and Y. Zhong was done during their internship at Google Research. Attias, I., Kontorovich, A., and Mansour, Y. Improved generalization bounds for robust learning. ar Xiv preprint ar Xiv:1810.02180, 2018. Awasthi, P., Dutta, A., and Vijayaraghavan, A. On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, pp. 13737 13747, 2019. Awasthi, P., Frank, N., and Mohri, M. Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431 441, 2020. Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate H-Consistency Bounds for Surrogate Loss Minimizers losses. In Advances in Neural Information Processing Systems, pp. 9804 9815, 2021a. Awasthi, P., Frank, N., and Mohri, M. On the existence of the adversarial bayes classifier. In Advances in Neural Information Processing Systems, pp. 2978 2990, 2021b. Bao, H., Scott, C., and Sugiyama, M. Corrigendum to: Calibrated surrogate losses for adversarially robust classification. ar Xiv preprint ar Xiv:2005.13748, 2021. Bartlett, P. L., Ben-David, S., and Kulkarni, S. Learning changing concepts by exploiting the structure of change. Machine Learning, 41:153 174, 2000. Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Bartlett, P. L., Bubeck, S., and Cherapanamjeri, Y. Adversarial examples in multi-layer random relu networks. ar Xiv preprint ar Xiv:2106.12611, 2021. Barve, R. D. and Long, P. M. On the complexity of learning from drifting distributions. Information and Computation, 138(2):101 123, 1997. Ben-David, S., Loker, D., Srebro, N., and Sridharan, K. Minimizing the misclassification error rate using a surrogate convex loss. ar Xiv preprint ar Xiv:1206.6442, 2012. Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986. Box, G. E. P. and Jenkins, G. Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated, 1990. Brockwell, P. J. and Davis, R. A. Time Series: Theory and Methods. Springer-Verlag, New York, 1986. Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. ar Xiv preprint ar Xiv:2105.12806, 2021. Bubeck, S., Lee, Y. T., Price, E., and Razenshteyn, I. Adversarial examples from cryptographic pseudo-random generators. ar Xiv preprint ar Xiv:1811.06418, 2018a. Bubeck, S., Price, E., and Razenshteyn, I. Adversarial examples from computational constraints. ar Xiv preprint ar Xiv:1805.10204, 2018b. Bubeck, S., Cherapanamjeri, Y., Gidel, G., and Combes, R. T. d. A single gradient step finds adversarial examples on random two-layers neural networks. ar Xiv preprint ar Xiv:2104.03863, 2021. Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39 57, 2017. Cullina, D., Bhagoji, A. N., and Mittal, P. PAC-learning in the presence of evasion adversaries. ar Xiv preprint ar Xiv:1806.01471, 2018. Diakonikolas, I., Kane, D. M., and Manurangsi, P. The complexity of adversarially robust proper learning of halfspaces with agnostic noise. ar Xiv preprint ar Xiv:2007.15220, 2020. Engle, R. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4):987 1007, 1982. Feige, U., Mansour, Y., and Schapire, R. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pp. 637 657, 2015. Feige, U., Mansour, Y., and Schapire, R. E. Robust inference for multiclass classification. In Algorithmic Learning Theory, pp. 368 386, 2018. Gama, J., Žliobait e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, H. A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46, 04 2014. Gao, W. and Zhou, Z.-H. On the consistency of auc pairwise optimization. In International Joint Conference on Artificial Intelligence, pp. 939 945, 2015. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Hamilton, J. D. Time series analysis. Princeton, 1994. Helmbold, D. P. and Long, P. M. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27 46, 1994. Khim, J. and Loh, P.-L. Adversarial risk bounds for binary classification via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018. Kuznetsov, V. and Mohri, M. Learning theory and algorithms for forecasting non-stationary time series. In Proceedings of NIPS, pp. 541 549, 2015. Kuznetsov, V. and Mohri, M. Generalization bounds for non-stationary mixing processes. Mach. Learn., 106(1): 93 117, 2017. Kuznetsov, V. and Mohri, M. Discrepancy-based theory and algorithms for forecasting non-stationary time series. Ann. Math. Artif. Intell., 88(4):367 399, 2020. Kuznetsov, V., Mohri, M., and Syed, U. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pp. 2501 2509, 2014. H-Consistency Bounds for Surrogate Loss Minimizers Long, P. and Servedio, R. Learning large-margin halfspaces with more malicious noise. In Advances in Neural Information Processing Systems, pp. 91 99, 2011. Long, P. and Servedio, R. Consistency versus realizable Hconsistency for multiclass classification. In International Conference on Machine Learning, pp. 801 809, 2013. Long, P. M. The complexity of learning according to two models of a drifting environment. Machine Learning, 37: 337 354, 1999. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Massart, P. and Nédélec, É. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006. Meir, R. Nonparametric time series prediction through adaptive model selection. Machine Learning, pp. 5 34, 2000. Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings, volume 7568 of Lecture Notes in Computer Science, pp. 124 138. Springer, 2012. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, second edition, 2018. Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. ar Xiv preprint ar Xiv:1902.04217, 2019. Montasser, O., Hanneke, S., and Srebro, N. Reducing adversarially robust learning to non-robust pac learning. ar Xiv preprint ar Xiv:2010.12039, 2020. Robey, A., Chamon, L., Pappas, G., Hassani, H., and Ribeiro, A. Adversarial robustness with semi-infinite constrained learning. In Advances in Neural Information Processing Systems, pp. 6198 6215, 2021. Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! In Advances in Neural Information Processing Systems, pp. 3353 3364, 2019. Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007. Tewari, A. and Bartlett, P. L. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018. Uematsu, K. and Lee, Y. On theoretically optimal ranking functions in bipartite ranking. Department of Statistics, The Ohio State University, Tech. Rep, 863, 2011. Viallard, P., VIDOT, E. G., Habrard, A., and Morvant, E. A pac-bayes analysis of adversarial robustness. In Advances in Neural Information Processing Systems, pp. 14421 14433, 2021. Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. ar Xiv preprint ar Xiv:2001.03994, 2020. Yin, D., Ramchandran, K., and Bartlett, P. L. Rademacher complexity for adversarially robust generalization. In International Conference of Machine Learning, pp. 7085 7094, 2019. Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019. Zhang, M. and Agarwal, S. Bayes consistency vs. Hconsistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, pp. 16927 16936, 2020. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a. Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b. H-Consistency Bounds for Surrogate Loss Minimizers Contents of Appendix A Related Work 14 B Deferred Tables 15 C Deferred Theorems 16 D Proof of Theorem 1 and Theorem 2 17 E Proof of Lemma 1 and Lemma 2 18 F Comparison with Previous Results when H = Hall 19 F.1 Comparison with (Mohri et al., 2018, Theorem 4.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 F.2 Comparison with (Bartlett et al., 2006, Theorem 1.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 G Proof of Theorem 3 and Theorem 12 20 H Proof of Theorem 5 and Theorem 13 21 I Proof of Theorem 4 and Theorem 6 21 J Proof of Theorem 7 23 K Derivation of Non-Adversarial H-Consistency Bounds 23 K.1 Linear Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 K.1.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 K.1.2 Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 K.1.3 Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 K.1.4 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 K.1.5 Sigmoid Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 K.1.6 ρ-Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 K.2 One-Hidden-Layer Re LU Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 K.2.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 K.2.2 Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 K.2.3 Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 K.2.4 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 K.2.5 Sigmoid Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 K.2.6 ρ-Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 L Derivation of Adversarial H-Consistency Bounds 42 H-Consistency Bounds for Surrogate Loss Minimizers L.1 Linear Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 L.1.1 Supremum-Based ρ-Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 L.2 One-Hidden-Layer Re LU Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 L.2.1 Supremum-Based ρ-Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 M Derivation of Non-Adversarial Hall-Consistency Bounds under Massart s Noise Condition 47 M.1 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 M.2 Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 M.3 Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 N Derivation of Adversarial H-Consistency Bounds under Massart s Noise Condition 49 N.1 Linear Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 N.1.1 Supremum-Based Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 N.1.2 Supremum-Based Sigmoid Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 N.2 One-Hidden-Layer Re LU Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 N.2.1 Supremum-Based Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 N.2.2 Supremum-Based Sigmoid Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 H-Consistency Bounds for Surrogate Loss Minimizers A. Related Work Bayes-consistency (also known as consistency) and excess error bounds between margin-based loss functions and the zero-one loss have been widely studied in the literature (Zhang, 2004a; Bartlett et al., 2006; Steinwart, 2007; Mohri et al., 2018). Consistency studies the asymptotic relation between the surrogate excess error and the target excess error while excess error bounds study the quantitative relation between them and thus is stronger. They both consider the hypothesis set of all measurable functions. Zhang (2004a), Bartlett et al. (2006), and Steinwart (2007) studied consistency via the lens of calibration and showed that calibration and consistency are equivalent in the standard binary classification when considering the hypothesis set of all measurable functions. Zhang (2004a) studied the closesenee to the optimal excess error of the zero-one loss minimizers of convex surrogates. Bartlett et al. (2006) extended the results of Zhang (2004a) and developed a general methodology for coming up with quantitative bounds between the excess error corresponding to the zero-one loss and that of margin-based surrogate loss functions for all distributions. In a more recent work, Mohri et al. (2018) simplified these results and provided different proofs for the excess error bounds of various loss functions widely used in practice. Calibration and consistency analysis have also been extended to the multi-class settings (Zhang, 2004b; Tewari & Bartlett, 2007) and to ranking problems (Uematsu & Lee, 2011; Gao & Zhou, 2015). Bayes-consistency is not an appropriate notion when studying learning with a hypothesis set H that is distinct from the family of all measurable functions. Therefore, a new hypothesis set-dependent notion namely, H-consistency, has been proposed and explored in the more recent literature (Long & Servedio, 2013; Kuznetsov et al., 2014; Zhang & Agarwal, 2020). In particular, Long & Servedio (2013) argued that H-consistency is a more useful notion than consistency by empirically showing that certain loss functions that are H-consistent but not Bayes consistent can perform significantly better than a loss function known to be Bayes consistent. The work of Kuznetsov et al. (2014) extended the H-consistency results in (Long & Servedio, 2013) to the case of structured prediction and provided positive results for H-consistency of several multi-class ensemble algorithms. In a recent work Zhang & Agarwal (2020) investigated the empirical phenomenon in (Long & Servedio, 2013) and designed a class of piecewise linear scoring functions such that minimizing a surrogate that is not H-consistent over this larger class yields H-consistency of linear models. For linear predictors, more general margin-based properties of convex surrogate losses are also studied in (Long & Servedio, 2011; Ben-David et al., 2012). Aiming for such margin-based error guarantees, Ben-David et al. (2012) argued that the hinge loss is optimal among convex losses. Most recently, the notion of H-consistency along with H-calibration have also been studied in the context of adversarially robust classification (Bao et al., 2021; Awasthi et al., 2021a). In the adversarial scenario, in contrast to standard classification, the target loss is the adversarial zero-one loss (Goodfellow et al., 2014; Madry et al., 2017; Carlini & Wagner, 2017; Tsipras et al., 2018; Shafahi et al., 2019; Wong et al., 2020). This corresponds to the worst zero-one loss incurred over an adversarial perturbation of x within a γ-ball as measured in a norm, typically ℓp for p [1,+ ]. The adversarial loss presents new challenges and makes the consistency analysis significantly more complex. The work of Bao et al. (2021) initiated the study of H-calibration with respect to the adversarial zero-one loss for the linear models. They showed that convex surrogates are not calibrated and introduced a class of nonconvex margin-based surrogate losses. They then provided sufficient conditions for such nonconvex losses to be calibrated in the linear case. The work of Awasthi et al. (2021a) extended the results in (Bao et al., 2021) to the general nonlinear hypothesis sets and pointed out that although H-calibration is a necessary condition of H-consistency, it is not sufficient in the adversarial scenario. They then proposed sufficient conditions which guarantee calibrated losses to be consistent in the setting of adversarially robust classification. All the above mentioned publications either studied asymptotic properties (Bayes-consistency or H-consistency) or studied quantitative relations when H is the family of all measurable functions (excess error bounds). Instead, our work considers a hypothesis set-dependent quantitative relation between the surrogate estimation error and the target estimation error. This is significantly stronger than H-calibration or H-consistency and is also more informative than excess error bounds which correspond to a special case of our results with H = Hall. As a by-product, our theory contributes more significant consistency results for the poorly understood setting of adversarial robustness. There have also been recent works on different theoretical aspects of adversarial robustness such as tension between the zero-one loss and the adversarial zero-one loss (Tsipras et al., 2018; Zhang et al., 2019), computational bottlenecks for adversarial loss (Bubeck et al., 2018a;b; Awasthi et al., 2019), adversarial examples (Bartlett et al., 2021; Bubeck et al., 2021), sample complexity of adversarial surrogate H-Consistency Bounds for Surrogate Loss Minimizers losses (Khim & Loh, 2018; Cullina et al., 2018; Yin et al., 2019; Montasser et al., 2019; Awasthi et al., 2020), computational complexity of adversarially robust linear classifiers (Diakonikolas et al., 2020), connections with PAC learning (Montasser et al., 2020; Viallard et al., 2021), perturbations beyond ℓp norm(Feige et al., 2015; 2018; Attias et al., 2018), adversarial robustness optimization (Robey et al., 2021), overparametrization (Bubeck & Sellke, 2021) and Bayes optimality (Awasthi et al., 2021b). B. Deferred Tables Table 5: Non-adversarial Hlin-estimation error transformation (ϵ = 0) and Hlin-consistency bounds. All the bounds are hypothesis set-dependent (parameter B in Hlin) and provide novel guarantees as discussed in Section 5.3. The minimizability gaps appearing in the bounds for the surrogates are concluded in Table 1. The detailed derivation is included in Appendix K.1. Surrogates TΦ(t), t [0, 1] T 1 Φ (t), t R+ Bound Hinge min{B, 1} t t min{B,1} (26) 2 log2(t + 1) + 1 t 2 log2(1 t), t e B 1 e B+1, 1 t+1 2 log2(1 + e B) 1 t 2 log2(1 + e B), t > e B 1 e B+1. upper bounded by e B 1) t, t > 1 e B+1) 2 . (28) Exponential 1 t2, t e2B 1 e2B+1, 1 t+1 2 e B, t > e2B 1 e2B+1. upper bounded by e2B 1) t, t > 1 e2B+1) 2 . (30) Quadratic {t2, t B, 2B t B2, t > B. { t, t B2, t 2B + B 2 , t > B2. (32) Sigmoid tanh(k B) t t tanh(k B) (34) ρ-Margin min{B,ρ} ρ t ρ min{B,ρ} t (37) Table 6: Non-adversarial HNN-estimation error transformation (ϵ = 0) and HNN-consistency bounds. All the bounds are hypothesis set-dependent (parameter Λ and B in HNN) and provide novel guarantees as discussed in Section 5.4. The minimizability gaps appearing in the bounds for the surrogates are concluded in Table 1. The detailed derivation is included in Appendix K.2. Surrogates TΦ(t), t [0, 1] T 1 Φ (t), t R+ Bound Hinge min{ΛB, 1} t t min{ΛB,1} (41) 2 log2(t + 1) + 1 t 2 log2(1 t), t eΛB 1 eΛB+1, 1 t+1 2 log2(1 + e ΛB) 1 t 2 log2(1 + eΛB), t > eΛB 1 eΛB+1. upper bounded by eΛB 1) t, t > 1 eΛB+1) 2 . (43) Exponential 1 t2, t e2ΛB 1 e2ΛB+1, 1 t+1 2 eΛB, t > e2ΛB 1 e2ΛB+1. upper bounded by e2B+1 ) 2 , e2ΛB 1) t, t > 1 e2ΛB+1) 2 . (45) Quadratic {t2, t ΛB, 2ΛBt (ΛB)2, t > ΛB. { 2 , t > (ΛB)2 (47) Sigmoid tanh(kΛB) t t tanh(kΛB) (49) ρ-Margin min{ΛB,ρ} ρ t ρ min{ΛB,ρ} t (51) H-Consistency Bounds for Surrogate Loss Minimizers C. Deferred Theorems Theorem 8 (Non-adversarial distribution-dependent Ψ-bound). Suppose that H satisfies the condition of Lemma 1 and that Φ is a margin-based loss function. Assume there exist a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that the following holds for any x X: Ψ( 2 η(x) ϵ) inf h H(x) CΦ,H(h,x). (13) Then, for any hypothesis h H, Ψ(Rℓ0 1(h) R ℓ0 1,H + Mℓ0 1,H) RΦ(h) R Φ,H + MΦ,H + max{0,Ψ(ϵ)}. (14) Theorem 9 (Non-adversarial distribution-dependent Γ-bound). Suppose that H satisfies the condition of Lemma 1 and that Φ is a margin-based loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any x X: 2 η(x) ϵ Γ( inf h H(x) CΦ,H(h,x)). (15) Then, for any hypothesis h H, Rℓ0 1(h) R ℓ0 1,H Γ(RΦ(h) R Φ,H + MΦ,H) Mℓ0 1,H + ϵ. (16) Theorem 10 (Adversarial distribution-dependent Ψ-bound). Suppose that H is symmetric and that Φ is a supremumbased margin loss function. Assume there exist a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that the following holds for any x X: Ψ( η(x) + 1/2 ϵ) inf h Hγ(x) CΦ,H(h,x), Ψ( 2 η(x) ϵ) inf h H hγ(x)<0 CΦ,H(h,x), Ψ( 2 η(x) ϵ) inf h H hγ(x)>0 CΦ,H(h,x). Then, for any hypothesis h H, Ψ(Rℓγ(h) R ℓγ,H + Mℓγ,H) R Φ(h) R Φ,H + M Φ,H + max{0,Ψ(ϵ)}. (18) Theorem 11 (Adversarial distribution-dependent Γ-bound). Suppose that H is symmetric and that Φ is a supremumbased margin loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any x X: η(x) + 1/2 ϵ Γ( inf h Hγ(x) CΦ,H(h,x)), 2 η(x) ϵ Γ( inf h H hγ(x)<0 CΦ,H(h,x)), 2 η(x) ϵ Γ( inf h H hγ(x)>0 CΦ,H(h,x)). Then, for any hypothesis h H, Rℓγ(h) R ℓγ,H Γ(R Φ(h) R Φ,H + M Φ,H) Mℓγ,H + ϵ. (20) Theorem 12 (Distribution-independent Γ-bound). Suppose that H satisfies the condition of Lemma 1 and that Φ is a margin-based loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any for any t [1/2,1] 2t 1 ϵ Γ( inf x X,h H h(x)<0 CΦ,H(h,x,t)). H-Consistency Bounds for Surrogate Loss Minimizers Then, for any hypothesis h H and any distribution, Rℓ0 1(h) R ℓ0 1,H Γ(RΦ(h) R Φ,H + MΦ,H) Mℓ0 1,H + ϵ. (21) Theorem 13 (Adversarial distribution-independent Γ-bound). Suppose that H is symmetric and that Φ is a supremumbased margin loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any for any t [1/2,1] t ϵ Γ inf x X,h Hγ(x) H C Φ,H(h,x,t) , 2t 1 ϵ Γ( inf x X,h H hγ(x)<0 C Φ,H(h,x,t)). Then, for any hypothesis h H and any distribution, Rℓγ(h) R ℓγ,H Γ(R Φ(h) R Φ,H + M Φ,H) Mℓγ,H + ϵ. (22) D. Proof of Theorem 1 and Theorem 2 Theorem 1 (Distribution-dependent Ψ-bound). Assume that there exists a convex function Ψ R+ R with Ψ(0) 0 and ϵ 0 such that the following holds for all h H and x X: Ψ( Cℓ2,H(h,x) ϵ) Cℓ1,H(h,x). (6) Then, the following inequality holds for any h H: Ψ(Rℓ2(h) R ℓ2,H + Mℓ2,H) Rℓ1(h) R ℓ1,H + Mℓ1,H + max{Ψ(0),Ψ(ϵ)}. (7) Proof. For any h H, since Ψ( Cℓ2,H(h,x)1 Cℓ2,H(h,x)>ϵ) Cℓ1,H(h,x) for all x X, we have Ψ(Rℓ2(h) R ℓ2,H + Mℓ2,H) = Ψ(EX[Cℓ2(h,x) C ℓ2,H(x)]) = Ψ(EX[ Cℓ2,H(h,x)]) EX[Ψ( Cℓ2,H(h,x))] (Jensen s ineq.) = EX[Ψ( Cℓ2,H(h,x)1 Cℓ2,H(h,x)>ϵ + Cℓ2,H(h,x)1 Cℓ2,H(h,x) ϵ)] EX[Ψ( Cℓ2,H(h,x)1 Cℓ2,H(h,x)>ϵ) + Ψ( Cℓ2,H(h,x)1 Cℓ2,H(h,x) ϵ)] (Ψ(0) 0) EX[ Cℓ1,H(h,x)] + sup t [0,ϵ] Ψ(t) (assumption) = Rℓ1(h) R ℓ1,H + Mℓ1,H + max{Ψ(0),Ψ(ϵ)}, (convexity of Ψ) which proves the theorem. Theorem 2 (Distribution-dependent Γ-bound). Assume that there exists a concave function Γ R+ R and ϵ 0 such that the following holds for all h H and x X: Cℓ2,H(h,x) ϵ Γ( Cℓ1,H(h,x)). (8) Then, the following inequality holds for any h H: Rℓ2(h) R ℓ2,H Γ(Rℓ1(h) R ℓ1,H +Mℓ1,H) Mℓ2,H +ϵ. (9) H-Consistency Bounds for Surrogate Loss Minimizers Proof. For any h H, since Cℓ2,H(h,x)1 Cℓ2,H(h,x)>ϵ Γ( Cℓ1,H(h,x)) for all x X, we have Rℓ2(h) R ℓ2,H + Mℓ2,H = EX[Cℓ2(h,x) C ℓ2,H(x)] = EX[ Cℓ2,H(h,x)] = EX[ Cℓ2,H(h,x)1 Cℓ2,H(h,x)>ϵ + Cℓ2,H(h,x)1 Cℓ2,H(h,x) ϵ] EX[Γ( Cℓ1,H(h,x))] + ϵ (assumption) Γ(EX[ Cℓ1,H(h,x)]) + ϵ (concavity of Γ) = Γ(Rℓ1(h) R ℓ1,H + Mℓ1,H) + ϵ, which proves the theorem. E. Proof of Lemma 1 and Lemma 2 Lemma 1. Assume that H satisfies the following condition for any x X: {sign(h(x)) h H} = { 1,+1}. Then, the minimal conditional ℓ0 1-risk is C ℓ0 1,H(x) = C ℓ0 1,Hall(x) = min{η(x),1 η(x)}. The conditional ϵ-regret for ℓ0 1 can be characterized as Cℓ0 1,H(h,x) ϵ = 2 η(x) ϵ1h H(x) . Proof. By the definition, the conditional ℓ0 1-risk is Cℓ0 1(h,x) = η(x)1h(x)<0 + (1 η(x))1h(x) 0 η(x) if h(x) < 0, 1 η(x) if h(x) 0. By the assumption, for any x X, there exists h H such that sign(h (x)) = sign( η(x)), where η(x) is the Bayes classifier such that Cℓ0 1( η(x),x) = C ℓ0 1,Hall(x) = min{η(x),1 η(x)}. Therefore, the optimal conditional ℓ0 1-risk is C ℓ0 1,H(x) = Cℓ0 1(h ,x) = Cℓ0 1( η(x),x) = min{η(x),1 η(x)} which proves the first part of lemma. By the definition, Cℓ0 1,H(h,x) = Cℓ0 1(h,x) C ℓ0 1,H(x) = η(x)1h(x)<0 + (1 η(x))1h(x) 0 min{η(x),1 η(x)} 2 η(x) , h H(x), 0, otherwise. This leads to Cℓ0 1,H(h,x) ϵ = 2 η(x) ϵ1h H(x) . Lemma 2. Assume that H is symmetric. Then, the minimal conditional ℓγ-risk is C ℓγ,H(x) = min{η(x),1 η(x)}1Hγ(x) H + 1Hγ(x)=H . H-Consistency Bounds for Surrogate Loss Minimizers The conditional ϵ-regret for ℓγ can be characterized as Cℓγ,H(h,x) ϵ = 2 ϵ h Hγ(x) H 2 η(x) ϵ hγ(x) < 0 2 η(x) ϵ hγ(x) > 0 0 otherwise Proof. By the definition, the conditional ℓγ-risk is Cℓγ(h,x) = η(x)1{hγ(x) 0} + (1 η(x))1{hγ(x) 0} 1 if h Hγ(x), η(x) if hγ(x) < 0, 1 η(x) if hγ(x) > 0. Since H is symmetric, for any x X, either there exists h H such that hγ(x) > 0, or Hγ(x) = H. When Hγ(x) = H, {h H hγ(x) < 0} and {h H hγ(x) > 0} are both empty sets. Thus C ℓγ,H(x) = 1. When Hγ(x) H, there exists h H such that Cℓγ(h,x) = min{η(x),1 η(x)} = C ℓγ,H(x). Therefore, the minimal conditional ℓγ-risk is C ℓγ,H(x) = 1, Hγ(x) = H , min{η(x),1 η(x)}, Hγ(x) H . When Hγ(x) = H, Cℓγ(h,x) 1, which implies that Cℓγ,H(h,x) 0. For h Hγ(x) H, Cℓγ,H(h,x) = 1 min{η(x),1 η(x)} = η(x) + 1/2; for h H such that hγ(x) < 0, we have Cℓγ,H(h,x) = η(x) min{η(x),1 η(x)} = max{0,2 η(x)}; for h H such that hγ(x) > 0, Cℓγ,H(h,x) = 1 η(x) min{η(x),1 η(x)} = max{0, 2 η(x)}. Therefore, Cℓγ,H(h,x) = η(x) + 1/2 h Hγ(x) H, max{0,2 η(x)} hγ(x) < 0, max{0, 2 η(x)} hγ(x) > 0, 0 otherwise. This leads to Cℓγ,H(h,x) ϵ = 2 ϵ h Hγ(x) H 2 η(x) ϵ hγ(x) < 0 2 η(x) ϵ hγ(x) > 0 0 otherwise F. Comparison with Previous Results when H = Hall F.1. Comparison with (Mohri et al., 2018, Theorem 4.7) Assume Φ is convex and non-increasing. For any x X, by the convexity, we have CΦ(h,x) = η(x)Φ(h(x)) + (1 η(x))Φ( h(x)) Φ(2 η(x)h(x)). (23) inf h Hall(x) CΦ,Hall(h,x) inf h Hall 2 η(x)h(x) 0 CΦ,Hall(h,x) (h Hall(x) Ô h(x) η(x) 0) inf h Hall 2 η(x)h(x) 0Φ(2 η(x)h(x)) C Φ,Hall(x) (eq. (23)) = CΦ(0,x) C Φ,Hall(x) (Φ is non-increasing). H-Consistency Bounds for Surrogate Loss Minimizers Thus the condition of Theorem 4.7 in (Mohri et al., 2018) implies the condition in Corollary 2: η(x) c [CΦ(0,x) C Φ,Hall(x)] 1 s , x X Ô η(x) c inf h Hall(x) [ CΦ,Hall(h,x)] Therefore, Theorem 4.7 in (Mohri et al., 2018) is a special case of Corollary 2. F.2. Comparison with (Bartlett et al., 2006, Theorem 1.1) We show that the ψ-transform in (Bartlett et al., 2006) verifies the condition in Corollary 1 for all distributions. First, by Definition 2 in (Bartlett et al., 2006), we know that ψ is convex, ψ(0) = 0 and ψ ψ. Then, ψ(2 η(x) ) ψ(2 η(x) ) (ψ ψ) = inf α 0(max{η(x),1 η(x)}Φ(α) + min{η(x),1 η(x)}Φ( α)) inf α R(max{η(x),1 η(x)}Φ(α) + min{η(x),1 η(x)}Φ( α)) (def. of ψ) = inf α η(x) 0(η(x)Φ(α) + (1 η(x))Φ( α)) inf α R(η(x)Φ(α) + (1 η(x))Φ( α)) (symmetry) = inf h Hall h(x) η(x) 0 CΦ,Hall(h,x) inf h Hall(x) CΦ,Hall(h,x) (h Hall(x) Ô h(x) η(x) 0) Therefore, Theorem 1.1 in (Bartlett et al., 2006) is a special case of Corollary 1. G. Proof of Theorem 3 and Theorem 12 Theorem 3 (Distribution-independent Ψ-bound). Assume that H satisfies the condition of Lemma 1. Assume that there exists a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that for any t [1/2,1], Ψ( 2t 1 ϵ) inf x X,h H h(x)<0 CΦ,H(h,x,t). Then, for any hypothesis h H and any distribution, Ψ(Rℓ0 1(h) R ℓ0 1,H + Mℓ0 1,H) RΦ(h) R Φ,H + MΦ,H + max{0,Ψ(ϵ)}. (10) Proof. Note the condition (13) in Theorem 8 is symmetric about η(x) = 0. Thus, condition (13) uniformly holds for all distributions is equivalent to the following holds for any t [1/2,1] Ψ( 2t 1 ϵ) inf x X,h H h(x)<0 CΦ,H(h,x,t), which proves the theorem. Theorem 12 (Distribution-independent Γ-bound). Suppose that H satisfies the condition of Lemma 1 and that Φ is a margin-based loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any for any t [1/2,1] 2t 1 ϵ Γ( inf x X,h H h(x)<0 CΦ,H(h,x,t)). Then, for any hypothesis h H and any distribution, Rℓ0 1(h) R ℓ0 1,H Γ(RΦ(h) R Φ,H + MΦ,H) Mℓ0 1,H + ϵ. (21) Proof. Note the condition (15) in Theorem 9 is symmetric about η(x) = 0. Thus, condition (15) uniformly holds for all distributions is equivalent to the following holds for any t [1/2,1] Ψ( 2t 1 ϵ) inf x X,h H h(x)<0 CΦ,H(h,x,t), which proves the theorem. H-Consistency Bounds for Surrogate Loss Minimizers H. Proof of Theorem 5 and Theorem 13 Theorem 5 (Adversarial distribution-independent Ψ-bound). Suppose that H is symmetric. Assume there exist a convex function Ψ R+ R with Ψ(0) = 0 and ϵ 0 such that the following holds for any t [1/2,1] Ψ( t ϵ) inf x X,h Hγ(x) H C Φ,H(h,x,t), Ψ( 2t 1 ϵ) inf x X,h H hγ(x)<0 C Φ,H(h,x,t). Then, for any hypothesis h H and any distribution, Ψ(Rℓγ(h) R ℓγ,H + Mℓγ,H) R Φ(h) R Φ,H + M Φ,H + max{0,Ψ(ϵ)}. (12) Proof. Note the condition (17) in Theorem 10 is symmetric about η(x) = 0. Thus, condition (17) uniformly holds for all distributions is equivalent to the following holds for any t [1/2,1] Ψ( t ϵ) inf x X,h Hγ(x) H C Φ,H(h,x,t), Ψ( 2t 1 ϵ) inf x X,h H hγ(x)<0 C Φ,H(h,x,t), which proves the theorem. Theorem 13 (Adversarial distribution-independent Γ-bound). Suppose that H is symmetric and that Φ is a supremumbased margin loss function. Assume there exist a non-negative and non-decreasing concave function Γ R+ R and ϵ 0 such that the following holds for any for any t [1/2,1] t ϵ Γ inf x X,h Hγ(x) H C Φ,H(h,x,t) , 2t 1 ϵ Γ( inf x X,h H hγ(x)<0 C Φ,H(h,x,t)). Then, for any hypothesis h H and any distribution, Rℓγ(h) R ℓγ,H Γ(R Φ(h) R Φ,H + M Φ,H) Mℓγ,H + ϵ. (22) Proof. Note the condition (19) in Theorem 11 is symmetric about η(x) = 0. Thus, condition (19) uniformly holds for all distributions is equivalent to the following holds for any t [1/2,1] t ϵ Γ inf x X,h Hγ(x) H C Φ,H(h,x,t) , 2t 1 ϵ Γ( inf x X,h H hγ(x)<0 C Φ,H(h,x,t)), which proves the theorem. I. Proof of Theorem 4 and Theorem 6 Theorem 4 (Tightness). Suppose that H satisfies the condition of Lemma 1 and that ϵ = 0. If TΦ is convex with TΦ(0) = 0, then, for any t [0,1] and δ > 0, there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1,H+Mℓ0 1,H = t and TΦ(t) RΦ(h) R Φ,H + MΦ,H TΦ(t) + δ. H-Consistency Bounds for Surrogate Loss Minimizers Proof. By Theorem 3, if TΦ is convex with TΦ(0) = 0, the first inequality holds. For any t [0,1], consider the distribution that supports on a singleton {x0} and satisfies that η(x0) = 1 inf x X,h H h(x)<0 CΦ,H(h,x,η(x0)) = inf h H h(x0)<0 CΦ,H(h,x0,η(x0)) = inf h H h(x0)<0 CΦ,H(h,x0). For any δ > 0, take h0 H such that h0(x0) < 0 and CΦ,H(h0,x0) inf h H h(x0)<0 CΦ,H(h,x0) + δ = inf x X,h H h(x)<0 CΦ,H(h,x,η(x0)) + δ. Then, we have Rℓ0 1(h0) R ℓ0 1,H + Mℓ0 1,H = Rℓ0 1(h0) EX[C ℓ0 1,H(x)] = Cℓ0 1,H(h0,x0) RΦ(h0) R Φ,H + MΦ,H = RΦ(h0) EX[C Φ,H(x)] = CΦ,H(h0,x0) inf x X,h H h(x)<0 CΦ,H(h,x,η(x0)) + δ = TΦ(2η(x0) 1) + δ = TΦ(t) + δ, which completes the proof. Theorem 6 (Adversarial tightness). Suppose that H is symmetric and that ϵ = 0. If T Φ = min{T1,T2} is convex with T Φ(0) = 0 and T2 T1, then, for any t [0,1] and δ > 0, there exist a distribution D and a hypothesis h H such that Rℓγ(h) R ℓγ,H + Mℓγ,H = t and T Φ(t) R Φ(h) R Φ,H + M Φ,H T Φ(t) + δ. Proof. By Theorem 5, if T Φ is convex with T Φ(0) = 0, the first inequality holds. For any t [0,1], consider the distribution that supports on a singleton {x0}, which satisfies that η(x0) = 1 2 and Hγ(x0) H. Thus inf x X,h H hγ(x)<0 C Φ,H(h,x,η(x0)) = inf h H hγ(x0)<0 C Φ,H(h,x0,η(x0)) = inf h H hγ(x0)<0 C Φ,H(h,x0). For any δ > 0, take h H such that hγ(x0) < 0 and C Φ,H(h,x0) inf h H hγ(x0)<0 C Φ,H(h,x0) + δ = inf x X,h H hγ(x)<0 C Φ,H(h,x,η(x0)) + δ. Then, we have Rℓγ(h) R ℓγ,H + Mℓγ,H = Rℓγ(h) EX[C ℓγ,H(x)] = Cℓγ,H(h,x0) R Φ(h) R Φ,H + M Φ,H = R Φ(h) EX[C Φ,H(x)] = C Φ,H(h,x0) inf x X,h H hγ(x)<0 C Φ,H(h,x,η(x0)) + δ = T2(2η(x0) 1) + δ = T Φ(2η(x0) 1) + δ (T2 T1) = T Φ(t) + δ which completes the proof. H-Consistency Bounds for Surrogate Loss Minimizers J. Proof of Theorem 7 Theorem 7 (Negative results for robustness). Suppose that H contains 0 and is regular for adversarial calibration. Let ℓ1 be supremum-based convex loss or supremum-based symmetric loss and ℓ2 = ℓγ. Then, f(t) 1/2 for any t 0 are the only non-decreasing functions f such that (3) holds. Proof. Assume x0 X is distinguishing. Consider the distribution that supports on {x0}. Let η(x0) = 1/2 and h0 = 0 H. Then, for any h H, Rℓγ(h) = Cℓγ(h,x0) = 1/21hγ(x0) 0 + 1/21hγ(x0) 0 1/2, where the equality can be achieved for some h H since x0 is distinguishing. Therefore, R ℓγ,H = C ℓγ,H(x0) = inf h H Cℓγ(h,x0) = 1/2. Note Rℓγ(h0) = 1/2 + 1/2 = 1. For the supremum-based convex loss Φ, for any h H, R Φ(h) = C Φ(h,x0) = 1/2Φ(hγ(x0)) + 1/2Φ( hγ(x0)) Φ(1/2hγ(x0) 1/2hγ(x0)) (convexity of Φ) Φ(0), (Φ is non-increasing) where both equality can be achieved by h0 = 0. Therefore, R Φ,H = C Φ,H(x0) = R Φ(h0) = Φ(0). If (3) holds for some non-decreasing function f, then, we obtain for any h H, Rℓγ(h) 1/2 f(R Φ(h) Φ(0)). Let h = h0, then f(0) 1/2. Since f is non-decreasing, for any t [0,1], f(t) 1/2. For the supremum-based symmetric loss Φsym, there exists a constant C 0 such that, for any h H, R Φsym(h) = C Φsym(h,x0) = 1/2Φsym(hγ(x0)) + 1/2Φsym( hγ(x0)) 1/2Φsym(hγ(x0)) + 1/2Φsym( hγ(x0)) 2 where the equality can be achieved by h0 = 0. Therefore, R Φsym,H = C Φsym,H(x0) = R Φsym(h0) = C If (3) holds for some non-decreasing function f, then, we obtain for any h H, Rℓγ(h) 1/2 f(R Φsym(h) C Let h = h0, then f(0) 1/2. Since f is non-decreasing, for any t [0,1], f(t) 1/2. K. Derivation of Non-Adversarial H-Consistency Bounds K.1. Linear Hypotheses Since Hlin satisfies the condition of Lemma 1, by Lemma 1 the (ℓ0 1,Hlin)-minimizability gap can be expressed as follows: Mℓ0 1,Hlin = R ℓ0 1,Hlin EX[min{η(x),1 η(x)}] = R ℓ0 1,Hlin R ℓ0 1,Hall. (24) Therefore, the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error. By the definition of Hlin, for any x X, {h(x) h Hlin} = [ W x p B,W x p + B]. H-Consistency Bounds for Surrogate Loss Minimizers K.1.1. HINGE LOSS For the hinge loss Φhinge(α) = max{0,1 α}, for all h Hlin and x X: CΦhinge(h,x,t) = tΦhinge(h(x)) + (1 t)Φhinge( h(x)) = tmax{0,1 h(x)} + (1 t)max{0,1 + h(x)}. inf h Hlin CΦhinge(h,x,t) = 1 2t 1 min{W x p + B,1}. Therefore, the (Φhinge,Hlin)-minimizability gap can be expressed as follows: MΦhinge,Hlin = R Φhinge,Hlin EX[1 inf h Hlin CΦhinge(h,x,η(x))]. = R Φhinge,Hlin EX[1 2η(x) 1 min{W x p + B,1}]. (25) Note the (Φhinge,Hlin)-minimizability gap coincides with the (Φhinge,Hlin)-approximation error R Φhinge,Hlin EX[1 2η(x) 1 ] for B 1. 2 < t 1, we have inf h Hlin h(x)<0CΦhinge(h,x,t) = tmax{0,1 0} + (1 t)max{0,1 + 0} inf x X inf h Hlin h(x)<0 CΦhinge,Hlin(h,x,t) = inf x X{ inf h Hlin h(x)<0CΦhinge(h,x,t) inf h Hlin CΦhinge(h,x,t)} = inf x X(2t 1)min{W x p + B,1} = (2t 1)min{B,1} where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = min{B,1}t. By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the hinge loss is as follows: TΦhinge = min{B,1}t, t [0,1], Therefore, TΦhinge is convex, non-decreasing, invertible and satisfies that TΦhinge(0) = 0. By Theorem 4, we can choose Ψ(t) = min{B,1}t in Theorem 3, or, equivalently, Γ(t) = t min{B,1} in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the hinge loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin RΦhinge(h) R Φhinge,Hlin + MΦhinge,Hlin min{B,1} Mℓ0 1,Hlin. (26) Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φhinge,Hlin)- minimizability gap coincides with the (Φhinge,Hlin)-approximation error for B 1, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦhinge(h) R Φhinge,Hall if B 1 1 B [RΦhinge(h) EX[1 2η(x) 1 min{W x p + B,1}]] otherwise. The inequality for B 1 coincides with the consistency excess error bound known for the hinge loss (Zhang, 2004a; Bartlett et al., 2006; Mohri et al., 2018) but the one for B < 1 is distinct and novel. For B < 1, we have EX[1 2η(x) 1 min{W x p + B,1}] > EX[1 2η(x) 1 ] = 2EX[min{η(x),1 η(x)}] = R Φhinge,Hall. H-Consistency Bounds for Surrogate Loss Minimizers Therefore for B < 1, RΦhinge(h) EX[1 2η(x) 1 min{W x p + B,1}] < RΦhinge(h) R Φhinge,Hall. Note that: R Φhinge,Hall = 2R ℓ0 1,Hall = 2EX[min{η(x),1 η(x)}]. Thus, the first inequality (case B 1) can be equivalently written as follows: h Hlin, Rℓ0 1(h) RΦhinge(h) EX[min{η(x),1 η(x)}], which is a more informative upper bound than the standard inequality Rℓ0 1(h) RΦhinge(h). K.1.2. LOGISTIC LOSS For the logistic loss Φlog(α) = log2(1 + e α), for all h Hlin and x X: CΦlog(h,x,t) = tΦlog(h(x)) + (1 t)Φlog( h(x)) = tlog2(1 + e h(x)) + (1 t)log2(1 + eh(x)). inf h Hlin CΦlog(h,x,t) tlog2(t) (1 t)log2(1 t) if log t 1 t W x p + B, max{t,1 t}log2(1 + e (W x p+B)) + min{t,1 t}log2(1 + e W x p+B) if log t 1 t > W x p + B. Therefore, the (Φlog,Hlin)-minimizability gap can be expressed as follows: MΦlog,Hlin = R Φlog,Hlin EX[ inf h Hlin CΦlog(h,x,η(x))] = R Φlog,Hlin EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))1log η(x) 1 η(x) W x p+B] EX[max{η(x),1 η(x)}log2(1 + e (W x p+B))1log η(x) 1 η(x) >W x p+B] EX[min{η(x),1 η(x)}log2(1 + e W x p+B)1log η(x) 1 η(x) >W x p+B] Note (Φlog,Hlin)-minimizability gap coincides with the (Φlog,Hlin)-approximation error R Φlog,Hlin EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))] for B = + . 2 < t 1, we have inf h Hlin h(x)<0CΦlog(h,x,t) = tlog2(1 + e 0) + (1 t)log2(1 + e0) inf x X inf h Hlin h(x)<0 CΦlog,Hlin(h,x,t) = inf x X( inf h Hlin h(x)<0CΦlog(h,x,t) inf h Hlin CΦlog(h,x,t)) 1 + tlog2(t) + (1 t)log2(1 t) if log t 1 t W x p + B, 1 tlog2(1 + e (W x p+B)) (1 t)log2(1 + e W x p+B) if log t 1 t > W x p + B. 1 + tlog2(t) + (1 t)log2(1 t) if log t 1 t B, 1 tlog2(1 + e B) (1 t)log2(1 + e B) if log t 1 t > B. where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = 2 log2(t + 1) + 1 t 2 log2(1 t), t e B 1 e B+1, 1 t+1 2 log2(1 + e B) 1 t 2 log2(1 + e B), t > e B 1 H-Consistency Bounds for Surrogate Loss Minimizers By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the logistic loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦlog is convex, non-decreasing, invertible and satisfies that TΦlog(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦlog(t) in Theorem 3, or equivalently Γ(t) = T 1 Φlog(t) in Theorem 12, which are optimal. To simplify the expression, using the fact that 2 log2(t + 1) + 1 t 2 log2(1 t) = 1 ( t + 1 2 log2(t + 1 2 log2(1 + e B) 1 t 2 log2(1 + e B) = 1 2 log2( 4 2 + e B + e B ) + 1/2log2( 1 + e B 1 + e B )t, TΦlog can be lower bounded by 2 , t e B 1 e B+1)t, t > e B 1 Thus, we adopt an upper bound of T 1 Φlog as follows: T 1 Φlog(t) = e B 1)t, t > 1 Therefore, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the logistic loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin + Mℓ0 1,Hlin 2(RΦlog(h) R Φlog,Hlin + MΦlog,Hlin) 1 2 , if RΦlog(h) R Φlog,Hlin 1 e B+1) 2 MΦlog,Hlin 2( e B+1 e B 1)(RΦlog(h) R Φlog,Hlin + MΦlog,Hlin), otherwise (28) Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φlog,Hlin)- minimizability gap coincides with the (Φlog,Hlin)-approximation error for B = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall 2[RΦlog(h) R Φlog,Hall] 1 2 if B = + 2[RΦlog(h) R Φlog,Hlin + MΦlog,Hlin] 1 2 if RΦlog(h) R Φlog,Hlin 1 e B+1) 2 MΦlog,Hlin 2( e B+1 e B 1)(RΦlog(h) R Φlog,Hlin + MΦlog,Hlin) otherwise otherwise where the (Φlog,Hlin)-minimizability gap MΦlog,Hlin is characterized as below, which is less than the (Φlog,Hlin)- H-Consistency Bounds for Surrogate Loss Minimizers approximation error when B < + : MΦlog,Hlin = R Φlog,Hlin EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))1log η(x) 1 η(x) W x p+B] EX[max{η(x),1 η(x)}log2(1 + e (W x p+B))1log η(x) 1 η(x) >W x p+B] EX[min{η(x),1 η(x)}log2(1 + e W x p+B)1log η(x) 1 η(x) >W x p+B] < R Φlog,Hlin EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))] = R Φlog,Hlin R Φlog,Hall. Therefore, the inequality for B = + coincides with the consistency excess error bound known for the logistic loss (Zhang, 2004a; Mohri et al., 2018) but the one for B < + is distinct and novel. K.1.3. EXPONENTIAL LOSS For the exponential loss Φexp(α) = e α, for all h Hlin and x X: CΦexp(h,x,t) = tΦexp(h(x)) + (1 t)Φexp( h(x)) = te h(x) + (1 t)eh(x). inf h Hlin CΦexp(h,x,t) = t(1 t) if 1/2log t 1 t W x p + B max{t,1 t}e (W x p+B) + min{t,1 t}e W x p+B if 1/2log t 1 t > W x p + B. Therefore, the (Φexp,Hlin)-minimizability gap can be expressed as follows: MΦexp,Hlin = R Φexp,Hlin EX[ inf h Hlin CΦexp(h,x,η(x))] = R Φexp,Hlin EX[2 η(x)(1 η(x))11/2 log η(x) 1 η(x) W x p+B] EX[max{η(x),1 η(x)}e (W x p+B)11/2 log η(x) 1 η(x) >W x p+B] EX[min{η(x),1 η(x)}e W x p+B11/2 log η(x) 1 η(x) >W x p+B]. Note (Φexp,Hlin)-minimizability gap coincides with the (Φexp,Hlin)-approximation error R Φexp,Hlin η(x)(1 η(x))] for B = + . 2 < t 1, we have inf h Hlin h(x)<0CΦexp(h,x,t) = te 0 + (1 t)e0 inf x X inf h Hlin h(x)<0 CΦexp,Hlin(h,x,t) = inf x X( inf h Hlin h(x)<0CΦexp(h,x,t) inf h Hlin CΦexp(h,x,t)) t(1 t) if 1/2log t 1 t W x p + B, 1 te (W x p+B) (1 t)e W x p+B if 1/2log t 1 t > W x p + B. t(1 t), 1/2log t 1 t B 1 te B (1 t)e B, 1/2log t 1 t > B where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = 1 t2, t e2B 1 e2B+1, 1 t+1 2 e B, t > e2B 1 H-Consistency Bounds for Surrogate Loss Minimizers By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the exponential loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦexp is convex, non-decreasing, invertible and satisfies that TΦexp(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦexp(t) in Theorem 3, or equivalently Γ(t) = T 1 Φexp(t) in Theorem 12, which are optimal. To simplify the expression, using the fact that 2 e B = 1 1/2e B 1/2e B + e B e B TΦexp can be lower bounded by 2 , t e2B 1 e2B+1)t, t > e2B 1 Thus, we adopt an upper bound of T 1 Φexp as follows: T 1 Φexp(t) = e2B 1)t, t > 1 Therefore, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the exponential loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin + Mℓ0 1,Hlin 2(RΦexp(h) R Φexp,Hlin + MΦexp,Hlin) 1 2 , if RΦexp(h) R Φexp,Hlin 1 e2B+1) 2 MΦexp,Hlin, e2B 1)(RΦexp(h) R Φexp,Hlin + MΦexp,Hlin), otherwise. (30) Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φexp,Hlin)- minimizability gap coincides with the (Φexp,Hlin)-approximation error for B = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall 2[RΦexp(h) R Φexp,Hall] 1 2 if B = + , 2[RΦexp(h) R Φexp,Hlin + MΦexp,Hlin] 1 2 if RΦexp(h) R Φexp,Hlin 1 e2B+1) 2 MΦexp,Hlin, e2B 1)(RΦexp(h) R Φexp,Hlin + MΦexp,Hlin) otherwise. otherwise. where the (Φexp,Hlin)-minimizability gap MΦexp,Hlin is characterized as below, which is less than the (Φexp,Hlin)- approximation error when B < + : MΦexp,Hlin = R Φexp,Hlin EX[2 η(x)(1 η(x))11/2 log η(x) 1 η(x) W x p+B] EX[max{η(x),1 η(x)}e (W x p+B)11/2 log η(x) 1 η(x) >W x p+B] EX[min{η(x),1 η(x)}e W x p+B11/2 log η(x) 1 η(x) >W x p+B] < R Φexp,Hlin EX[2 η(x)(1 η(x))] = R Φexp,Hlin R Φexp,Hall. Therefore, the inequality for B = + coincides with the consistency excess error bound known for the exponential loss (Zhang, 2004a; Mohri et al., 2018) but the one for B < + is distinct and novel. H-Consistency Bounds for Surrogate Loss Minimizers K.1.4. QUADRATIC LOSS For the quadratic loss Φquad(α) = (1 α)21α 1, for all h Hlin and x X: CΦquad(h,x,t) = tΦquad(h(x)) + (1 t)Φquad( h(x)) = t(1 h(x))21h(x) 1 + (1 t)(1 + h(x))21h(x) 1. inf h Hlin CΦquad(h,x,t) = 4t(1 t), 2t 1 W x p + B, max{t,1 t}(1 (W x p + B)) 2 + min{t,1 t}(1 + W x p + B) 2, 2t 1 > W x p + B. Therefore, the (Φquad,Hlin)-minimizability gap can be expressed as follows: MΦquad,Hlin = R Φquad,Hlin EX[4η(x)(1 η(x))1 2η(x) 1 W x p+B] EX[max{η(x),1 η(x)}(1 (W x p + B)) 21 2η(x) 1 >W x p+B] EX[min{η(x),1 η(x)}(1 + (W x p + B)) 21 2η(x) 1 >W x p+B] Note (Φquad,Hlin)-minimizability gap coincides with the (Φquad,Hlin)-approximation error R Φquad,Hlin EX[4η(x)(1 η(x))] for B 1. 2 < t 1, we have inf h Hlin h(x)<0CΦquad(h,x,t) = t + (1 t) inf x X inf h Hlin h(x)<0 CΦquad,Hlin(h,x,t) = inf x X( inf h Hlin h(x)<0CΦquad(h,x,t) inf h Hlin CΦquad(h,x,t)) 1 4t(1 t), 2t 1 W x p + B, 1 t(1 (W x p + B)) 2 (1 t)(1 + W x p + B) 2, 2t 1 > W x p + B. 1 4t(1 t), 2t 1 B, 1 t(1 B)2 (1 t)(1 + B)2, 2t 1 > B. where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = t2, t B, 2B t B2, t > B. By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the quadratic loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦquad is convex, non-decreasing, invertible and satisfies that TΦquad(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦquad(t) in Theorem 3, or equivalently Γ(t) = T 1 Φquad(t) = 2 , t > B2 , in Theorem 12, which are optimal. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the quadratic loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin + Mℓ0 1,Hlin [RΦquad(h) R Φquad,Hlin + MΦquad,Hlin] 1 2 if RΦquad(h) R Φquad,Hlin B2 MΦquad,Hlin RΦquad(h) R Φquad,Hlin+MΦquad,Hlin 2 otherwise (32) H-Consistency Bounds for Surrogate Loss Minimizers Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φquad,Hlin)- minimizability gap coincides with the (Φquad,Hlin)-approximation error for B 1, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall [RΦquad(h) R Φquad,Hall] [RΦquad(h) R Φquad,Hlin + MΦquad,Hlin] 1 2 if RΦquad(h) R Φquad,Hlin B2 MΦquad,Hlin RΦquad(h) R Φquad,Hlin+MΦquad,Hlin 2 otherwise otherwise where the (Φquad,Hlin)-minimizability gap MΦquad,Hlin is characterized as below, which is less than the (Φquad,Hlin)- approximation error when B < 1: MΦquad,Hlin = R Φquad,Hlin EX[4η(x)(1 η(x))1 2η(x) 1 W x p+B] EX[max{η(x),1 η(x)}(1 (W x p + B)) 21 2η(x) 1 >W x p+B] EX[min{η(x),1 η(x)}(1 + (W x p + B)) 21 2η(x) 1 >W x p+B] R Φquad,Hlin EX[4η(x)(1 η(x))] = R Φquad,Hlin R Φquad,Hall. Therefore, the inequality for B 1 coincides with the consistency excess error bound known for the quadratic loss (Zhang, 2004a; Bartlett et al., 2006) but the one for B < 1 is distinct and novel. K.1.5. SIGMOID LOSS For the sigmoid loss Φsig(α) = 1 tanh(kα), k > 0, for all h Hlin and x X: CΦsig(h,x,t) = tΦsig(h(x)) + (1 t)Φsig( h(x)), = t(1 tanh(kh(x))) + (1 t)(1 + tanh(kh(x))). inf h Hlin CΦsig(h,x,t) = 1 1 2t tanh(k(W x p + B)) Therefore, the (Φsig,Hlin)-minimizability gap can be expressed as follows: MΦsig,Hlin = R Φsig,Hlin EX[ inf h Hlin CΦsig(h,x,η(x))] = R Φsig,Hlin EX[1 1 2η(x) tanh(k(W x p + B))]. (33) Note (Φsig,Hlin)-minimizability gap coincides with the (Φsig,Hlin)-approximation error R Φsig,Hlin EX[1 1 2η(x) ] for B = + . 2 < t 1, we have inf h Hlin h(x)<0CΦsig(h,x,t) = 1 1 2t tanh(0) inf x X inf h Hlin h(x)<0 CΦsig,Hlin(h,x,t) = inf x X( inf h Hlin h(x)<0CΦsig(h,x,t) inf h Hlin CΦsig(h,x,t)) = inf x X(2t 1)tanh(k(W x p + B)) = (2t 1)tanh(k B) H-Consistency Bounds for Surrogate Loss Minimizers where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = tanh(k B)t. By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the sigmoid loss is as follows: TΦsig = tanh(k B)t, t [0,1], Therefore, TΦsig is convex, non-decreasing, invertible and satisfies that TΦsig(0) = 0. By Theorem 4, we can choose Ψ(t) = tanh(k B)t in Theorem 3, or equivalently Γ(t) = t tanh(k B) in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the sigmoid loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin RΦsig(h) R Φsig,Hlin + MΦsig,Hlin tanh(k B) Mℓ0 1,Hlin. (34) Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φsig,Hlin)- minimizability gap coincides with the (Φsig,Hlin)-approximation error for B = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦsig(h) R Φsig,Hall if B = + 1 tanh(k B)[RΦsig(h) EX[1 1 2η(x) tanh(k(W x p + B))]] otherwise. (35) The inequality for B = + coincides with the consistency excess error bound known for the sigmoid loss (Zhang, 2004a; Bartlett et al., 2006; Mohri et al., 2018) but the one for B < + is distinct and novel. For B < + , we have EX[1 1 2η(x) tanh(k(W x p + B))] > EX[1 2η(x) 1 ] = 2EX[min{η(x),1 η(x)}] = R Φhinge,Hall. Therefore for B < + , RΦsig(h) EX[1 1 2η(x) tanh(k(W x p + B))] < RΦsig(h) R Φsig,Hall. Note that: R Φsig,Hall = 2R ℓ0 1,Hall = 2EX[min{η(x),1 η(x)}]. Thus, the first inequality (case B = + ) can be equivalently written as follows: h Hlin, Rℓ0 1(h) RΦsig(h) EX[min{η(x),1 η(x)}], which is a more informative upper bound than the standard inequality Rℓ0 1(h) RΦsig(h). K.1.6. ρ-MARGIN LOSS For the ρ-margin loss Φρ(α) = min{1,max{0,1 α ρ }}, ρ > 0, for all h Hlin and x X: CΦρ(h,x,t) = tΦρ(h(x)) + (1 t)Φρ( h(x)), = tmin{1,max{0,1 h(x) ρ }} + (1 t)min{1,max{0,1 + h(x) inf h Hlin CΦρ(h,x,t) = min{t,1 t} + max{t,1 t} 1 min{W x p + B,ρ} Therefore, the (Φρ,Hlin)-minimizability gap can be expressed as follows: MΦρ,Hlin = R Φρ,Hlin EX[ inf h Hlin CΦρ(h,x,η(x))] = R Φρ,Hlin EX min{η(x),1 η(x)} + max{η(x),1 η(x)} 1 min{W x p + B,ρ} H-Consistency Bounds for Surrogate Loss Minimizers Note the (Φρ,Hlin)-minimizability gap coincides with the (Φρ,Hlin)-approximation error R Φρ,Hlin EX[min{η(x),1 η(x)}] for B ρ. 2 < t 1, we have inf h Hlin h(x)<0CΦρ(h,x,t) = t + (1 t) 1 min{W x p + B,ρ} inf x X inf h Hlin h(x)<0 CΦρ,Hlin(h,x) = inf x X( inf h Hlin h(x)<0CΦρ(h,x,t) inf h Hlin CΦρ(h,x,t)) = inf x X(2t 1) min{W x p + B,ρ} = (2t 1)min{B,ρ} ρ = T(2t 1) where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = min{B,ρ} By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the ρ-margin loss is as follows: TΦρ = min{B,ρ} ρ t, t [0,1], Therefore, TΦρ is convex, non-decreasing, invertible and satisfies that TΦρ(0) = 0. By Theorem 4, we can choose Ψ(t) = min{B,ρ} ρ t in Theorem 3, or equivalently Γ(t) = ρ min{B,ρ} t in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the Hlin-consistency bound for the ρ-margin loss, valid for all h Hlin: Rℓ0 1(h) R ℓ0 1,Hlin ρ(RΦρ(h) R Φρ,Hlin + MΦρ,Hlin) min{B,ρ} Mℓ0 1,Hlin. (37) Since the (ℓ0 1,Hlin)-minimizability gap coincides with the (ℓ0 1,Hlin)-approximation error and (Φρ,Hlin)- minimizability gap coincides with the (Φρ,Hlin)-approximation error for B ρ, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦρ(h) R Φρ,Hall if B ρ ρ(RΦρ(h) EX[min{η(x),1 η(x)}+max{η(x),1 η(x)}(1 min{W x p+B,ρ} B otherwise. Note that: R Φρ,Hall = R ℓ0 1,Hall = EX[min{η(x),1 η(x)}]. Thus, the first inequality (case B ρ) can be equivalently written as follows: h Hlin, Rℓ0 1(h) RΦρ(h). (38) The case B ρ is one of the trivial cases mentioned in Section 4, where the trivial inequality Rℓ0 1(h) RΦρ(h) can be obtained directly using the fact that ℓ0 1 is upper bounded by Φρ. This, however, does not imply that non-adversarial Hlin-consistency bound for the ρ-margin loss is trivial when B > ρ since it is optimal. K.2. One-Hidden-Layer Re LU Neural Network As with the linear case, HNN also satisfies the condition of Lemma 1 and thus the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error: Mℓ0 1,HNN = R ℓ0 1,HNN EX[min{η(x),1 η(x)}] = R ℓ0 1,HNN R ℓ0 1,Hall. (39) By the definition of HNN, for any x X, {h(x) h HNN} = [ Λ(W x p + B),Λ(W x p + B)]. H-Consistency Bounds for Surrogate Loss Minimizers K.2.1. HINGE LOSS For the hinge loss Φhinge(α) = max{0,1 α}, for all h HNN and x X: CΦhinge(h,x,t) = tΦhinge(h(x)) + (1 t)Φhinge( h(x)) = tmax{0,1 h(x)} + (1 t)max{0,1 + h(x)}. inf h HNN CΦhinge(h,x,t) = 1 2t 1 min{ΛW x p + ΛB,1}. Therefore, the (Φhinge,HNN)-minimizability gap can be expressed as follows: MΦhinge,HNN = R Φhinge,HNN EX[1 inf h HNN CΦhinge(h,x,η(x))]. = R Φhinge,HNN EX[1 2η(x) 1 min{ΛW x p + ΛB,1}]. (40) Note the (Φhinge,HNN)-minimizability gap coincides with the (Φhinge,HNN)-approximation error R Φhinge,HNN EX[1 2η(x) 1 ] for ΛB 1. 2 < t 1, we have inf h HNN h(x)<0CΦhinge(h,x,t) = tmax{0,1 0} + (1 t)max{0,1 + 0} inf x X inf h HNN h(x)<0 CΦhinge,HNN(h,x,t) = inf x X{ inf h HNN h(x)<0CΦhinge(h,x,t) inf h HNN CΦhinge(h,x,t)} = inf x X(2t 1)min{ΛW x p + ΛB,1} = (2t 1)min{ΛB,1} where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = min{ΛB,1}t. By Definition 3, for any ϵ 0, the HNN-estimation error transformation of the hinge loss is as follows: TΦhinge = min{ΛB,1}t, t [0,1], Therefore, TΦhinge is convex, non-decreasing, invertible and satisfies that TΦhinge(0) = 0. By Theorem 4, we can choose Ψ(t) = min{ΛB,1}t in Theorem 3, or, equivalently, Γ(t) = t min{ΛB,1} in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the hinge loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN RΦhinge(h) R Φhinge,HNN + MΦhinge,HNN min{ΛB,1} Mℓ0 1,HNN. (41) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error and (Φhinge,HNN)- minimizability gap coincides with the (Φhinge,HNN)-approximation error for ΛB 1, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦhinge(h) R Φhinge,Hall if ΛB 1 1 ΛB [RΦhinge(h) EX[1 2η(x) 1 min{ΛW x p + ΛB,1} ]] otherwise. The inequality for ΛB 1 coincides with the consistency excess error bound known for the hinge loss (Zhang, 2004a; Bartlett et al., 2006; Mohri et al., 2018) but the one for ΛB < 1 is distinct and novel. For ΛB < 1, we have EX[1 2η(x) 1 min{ΛW x p + ΛB,1}] > EX[1 2η(x) 1 ] = 2EX[min{η(x),1 η(x)}] = R Φhinge,Hall. H-Consistency Bounds for Surrogate Loss Minimizers Therefore for ΛB < 1, RΦhinge(h) EX[1 2η(x) 1 min{ΛW x p + ΛB,1}] < RΦhinge(h) R Φhinge,Hall. Note that: R Φhinge,Hall = 2R ℓ0 1,Hall = 2EX[min{η(x),1 η(x)}]. Thus, the first inequality (case ΛB 1) can be equivalently written as follows: h HNN, Rℓ0 1(h) RΦhinge(h) EX[min{η(x),1 η(x)}], which is a more informative upper bound than the standard inequality Rℓ0 1(h) RΦhinge(h). K.2.2. LOGISTIC LOSS For the logistic loss Φlog(α) = log2(1 + e α), for all h HNN and x X: CΦlog(h, x, t) = tΦlog(h(x)) + (1 t)Φlog( h(x)), = t log2(1 + e h(x)) + (1 t) log2(1 + eh(x)). inf h HNN CΦlog(h, x, t) = { t log2(t) (1 t) log2(1 t) if log t 1 t ΛW x p + ΛB, max{t, 1 t} log2(1 + e (ΛW x p+ΛB)) + min{t, 1 t} log2(1 + eΛW x p+ΛB) if log t 1 t > ΛW x p + ΛB. Therefore, the (Φlog,HNN)-minimizability gap can be expressed as follows: MΦlog,HNN = R Φlog,HNN EX[ inf h HNN CΦlog(h,x,η(x))] = R Φlog,HNN EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))1log η(x) 1 η(x) ΛW x p+ΛB] EX[max{η(x),1 η(x)}log2(1 + e (ΛW x p+ΛB))1log η(x) 1 η(x) >ΛW x p+ΛB] EX[min{η(x),1 η(x)}log2(1 + eΛW x p+ΛB)1log η(x) 1 η(x) >ΛW x p+ΛB] Note (Φlog,HNN)-minimizability gap coincides with the (Φlog,HNN)-approximation error R Φlog,HNN EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))] for ΛB = + . 2 < t 1, we have inf h HNN h(x)<0CΦlog(h,x,t) = tlog2(1 + e 0) + (1 t)log2(1 + e0) inf x X inf h HNN h(x)<0 CΦlog,HNN(h,x,t) = inf x X( inf h HNN h(x)<0CΦlog(h,x,t) inf h HNN CΦlog(h,x,t)) 1 + tlog2(t) + (1 t)log2(1 t) if log t 1 t ΛW x p + ΛB, 1 tlog2(1 + e (ΛW x p+ΛB)) (1 t)log2(1 + eΛW x p+ΛB) if log t 1 t > ΛW x p + ΛB. 1 + tlog2(t) + (1 t)log2(1 t) if log t 1 t ΛB, 1 tlog2(1 + e ΛB) (1 t)log2(1 + eΛB) if log t 1 t > ΛB. where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = 2 log2(t + 1) + 1 t 2 log2(1 t), t eΛB 1 eΛB+1, 1 t+1 2 log2(1 + e ΛB) 1 t 2 log2(1 + eΛB), t > eΛB 1 H-Consistency Bounds for Surrogate Loss Minimizers By Definition 3, for any ϵ 0, the HNN-estimation error transformation of the logistic loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦlog is convex, non-decreasing, invertible and satisfies that TΦlog(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦlog(t) in Theorem 3, or equivalently Γ(t) = T 1 Φlog(t) in Theorem 12, which are optimal. To simplify the expression, using the fact that 2 log2(t + 1) + 1 t 2 log2(1 t) = 1 ( t + 1 2 log2(t + 1 2 log2(1 + e ΛB) 1 t 2 log2(1 + eΛB) = 1 2 log2( 4 2 + e ΛB + eΛB ) + 1/2log2( 1 + eΛB 1 + e ΛB )t, TΦlog can be lower bounded by 2 , t eΛB 1 eΛB+1)t, t > eΛB 1 Thus, we adopt an upper bound of T 1 Φlog as follows: T 1 Φlog(t) = eΛB 1)t, t > 1 Therefore, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the logistic loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN + Mℓ0 1,HNN 2(RΦlog(h) R Φlog,HNN + MΦlog,HNN) 1 2 , if RΦlog(h) R Φlog,HNN 1 eΛB+1) 2 MΦlog,HNN 2( eΛB+1 eΛB 1)(RΦlog(h) R Φlog,HNN + MΦlog,HNN), otherwise (43) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error and (Φlog,HNN)- minimizability gap coincides with the (Φlog,HNN)-approximation error for ΛB = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall 2[RΦlog(h) R Φlog,Hall] 1 2 if ΛB = + 2[RΦlog(h) R Φlog,HNN + MΦlog,HNN] 1 2 if RΦlog(h) R Φlog,HNN 1 eΛB+1) 2 MΦlog,HNN 2( eΛB+1 eΛB 1)(RΦlog(h) R Φlog,HNN + MΦlog,HNN) otherwise otherwise where the (Φlog,HNN)-minimizability gap MΦlog,HNN is characterized as below,which is less than the (Φlog,HNN)- H-Consistency Bounds for Surrogate Loss Minimizers approximation error when ΛB < + : MΦlog,HNN = R Φlog,HNN EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))1log2 η(x) 1 η(x) ΛW x p+ΛB] EX[max{η(x),1 η(x)}log2(1 + e (ΛW x p+ΛB))1log2 η(x) 1 η(x) >ΛW x p+ΛB] [min{η(x),1 η(x)}log2(1 + eΛW x p+ΛB)1log2 η(x) 1 η(x) >ΛW x p+ΛB] < R Φlog,HNN EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))] = R Φlog,HNN R Φlog,Hall. Therefore, the inequality for ΛB = + coincides with the consistency excess error bound known for the logistic loss (Zhang, 2004a; Mohri et al., 2018) but the one for ΛB < + is distinct and novel. K.2.3. EXPONENTIAL LOSS For the exponential loss Φexp(α) = e α, for all h HNN and x X: CΦexp(h,x,t) = tΦexp(h(x)) + (1 t)Φexp( h(x)) = te h(x) + (1 t)eh(x). inf h HNN CΦexp(h,x,t) = t(1 t) if 1/2log t 1 t ΛW x p + ΛB max{t,1 t}e (ΛW x p+ΛB) + min{t,1 t}eΛW x p+ΛB if 1/2log t 1 t > ΛW x p + ΛB. Therefore, the (Φexp,HNN)-minimizability gap can be expressed as follows: MΦexp,HNN = R Φexp,HNN EX[ inf h HNN CΦexp(h,x,η(x))] = R Φexp,HNN EX[2 η(x)(1 η(x))11/2 log η(x) 1 η(x) ΛW x p+ΛB] EX[max{η(x),1 η(x)}e (ΛW x p+ΛB)11/2 log η(x) 1 η(x) >ΛW x p+ΛB] EX[min{η(x),1 η(x)}eΛW x p+ΛB11/2 log η(x) 1 η(x) >ΛW x p+ΛB]. Note (Φexp,HNN)-minimizability gap coincides with the (Φexp,HNN)-approximation error R Φexp,HNN η(x)(1 η(x))] for ΛB = + . 2 < t 1, we have inf h HNN h(x)<0CΦexp(h,x,t) = te 0 + (1 t)e0 inf x X inf h HNN h(x)<0 CΦexp,HNN(h,x,t) = inf x X( inf h HNN h(x)<0CΦexp(h,x,t) inf h HNN CΦexp(h,x,t)) t(1 t) if 1/2log t 1 t ΛW x p + ΛB, 1 te (ΛW x p+ΛB) (1 t)eΛW x p+ΛB if 1/2log t 1 t > ΛW x p + ΛB. t(1 t), 1/2log t 1 t ΛB 1 te ΛB (1 t)eΛB, 1/2log t 1 t > ΛB where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = 1 t2, t e2ΛB 1 e2ΛB+1, 1 t+1 2 eΛB, t > e2ΛB 1 H-Consistency Bounds for Surrogate Loss Minimizers By Definition 3, for any ϵ 0, the Hlin-estimation error transformation of the exponential loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦexp is convex, non-decreasing, invertible and satisfies that TΦexp(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦexp(t) in Theorem 3, or equivalently Γ(t) = T 1 Φexp(t) in Theorem 12, which are optimal. To simplify the expression, using the fact that 2 eΛB = 1 1/2eΛB 1/2e ΛB + eΛB e ΛB TΦexp can be lower bounded by 2 , t e2ΛB 1 1 2( e2ΛB 1 e2ΛB+1)t, t > e2ΛB 1 Thus, we adopt an upper bound of T 1 Φexp as follows: T 1 Φexp(t) = e2B+1 ) 2 , e2ΛB 1)t, t > 1 e2ΛB+1) 2 . Therefore, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the exponential loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN + Mℓ0 1,HNN 2(RΦexp(h) R Φexp,HNN + MΦexp,HNN) 1 2 , if RΦexp(h) R Φexp,HNN 1 e2ΛB+1) 2 MΦexp,HNN, e2ΛB 1)(RΦexp(h) R Φexp,HNN + MΦexp,HNN), otherwise. (45) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error and (Φlog,HNN)- minimizability gap coincides with the (Φlog,HNN)-approximation error for ΛB = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall 2[RΦexp(h) R Φexp,Hall] 2[RΦexp(h) R Φexp,HNN + MΦexp,HNN] 1 2 if RΦexp(h) R Φexp,HNN 1 e2ΛB+1) 2 MΦexp,HNN 2( e2ΛB+1 e2ΛB 1)(RΦexp(h) R Φexp,HNN + MΦexp,HNN) o/w o/w where the (Φexp,HNN)-minimizability gap MΦexp,HNN is characterized as below, which is less than the (Φexp,HNN)- approximation error when ΛB < + : MΦexp,HNN = R Φexp,HNN EX[2 η(x)(1 η(x))11/2 log2 η(x) 1 η(x) ΛW x p+ΛB] EX[max{η(x),1 η(x)}e (ΛW x p+ΛB)11/2 log2 η(x) 1 η(x) >ΛW x p+ΛB] [min{η(x),1 η(x)}eΛW x p+ΛB11/2 log2 η(x) 1 η(x) >ΛW x p+ΛB] < R Φexp,HNN EX[2 η(x)(1 η(x))] = R Φexp,HNN R Φexp,Hall. Therefore, the inequality for ΛB = + coincides with the consistency excess error bound known for the exponential loss (Zhang, 2004a; Mohri et al., 2018) but the one for ΛB < + is distinct and novel. H-Consistency Bounds for Surrogate Loss Minimizers K.2.4. QUADRATIC LOSS For the quadratic loss Φquad(α) = (1 α)21α 1, for all h HNN and x X: CΦquad(h,x,t) = tΦquad(h(x)) + (1 t)Φquad( h(x)) = t(1 h(x))21h(x) 1 + (1 t)(1 + h(x))21h(x) 1. inf h HNN CΦquad(h,x,t) 4t(1 t), 2t 1 ΛW x p + ΛB, max{t,1 t}(1 (ΛW x p + ΛB)) 2 + min{t,1 t}(1 + ΛW x p + ΛB) 2, 2t 1 > ΛW x p + ΛB. Therefore, the (Φquad,HNN)-minimizability gap can be expressed as follows: MΦquad,HNN = R Φquad,HNN EX[4η(x)(1 η(x))1 2η(x) 1 ΛW x p+ΛB] EX[max{η(x),1 η(x)}(1 (ΛW x p + ΛB)) 21 2η(x) 1 >ΛW x p+ΛB] EX[min{η(x),1 η(x)}(1 + (ΛW x p + ΛB)) 21 2η(x) 1 >ΛW x p+ΛB] Note (Φquad,HNN)-minimizability gap coincides with the (Φquad,HNN)-approximation error R Φquad,HNN EX[4η(x)(1 η(x))] for ΛB 1. 2 < t 1, we have inf h HNN h(x)<0 CΦquad(h, x, t) = t + (1 t) = 1 inf x X inf h HNN h(x)<0 CΦquad,HNN(h, x, t) = inf x X( inf h HNN h(x)<0 CΦquad(h, x, t) inf h HNN CΦquad(h, x, t)) 1 4t(1 t), 2t 1 ΛW x p + ΛB, 1 t(1 (ΛW x p + ΛB)) 2 (1 t)(1 + ΛW x p + ΛB) 2, 2t 1 > ΛW x p + ΛB. = {1 4t(1 t), 2t 1 ΛB, 1 t(1 ΛB)2 (1 t)(1 + ΛB)2, 2t 1 > ΛB. where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = t2, t ΛB, 2ΛB t (ΛB)2, t > ΛB. By Definition 3, for any ϵ 0, the HNN-estimation error transformation of the quadratic loss is as follows: T(t), t [ϵ,1], T(ϵ) ϵ t, t [0,ϵ). Therefore, when ϵ = 0, TΦquad is convex, non-decreasing, invertible and satisfies that TΦquad(0) = 0. By Theorem 4, we can choose Ψ(t) = TΦquad(t) in Theorem 3, or equivalently Γ(t) = T 1 Φquad(t) = 2 , t > (ΛB)2 , in Theorem 12, which are optimal. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the quadratic H-Consistency Bounds for Surrogate Loss Minimizers loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN + Mℓ0 1,HNN [RΦquad(h) R Φquad,HNN + MΦquad,HNN] 1 2 if RΦquad(h) R Φquad,HNN (ΛB)2 MΦquad,HNN RΦquad(h) R Φquad,HNN+MΦquad,HNN 2 otherwise (47) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error and (Φquad,HNN)- minimizability gap coincides with the (Φquad,HNN)-approximation error for ΛB 1, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall [RΦquad(h) R Φquad,Hall] 1 2 if ΛB 1 [RΦquad(h) R Φquad,HNN + MΦquad,HNN] 1 2 if RΦquad(h) R Φquad,HNN (ΛB)2 MΦquad,HNN RΦquad(h) R Φquad,HNN+MΦquad,HNN 2 otherwise otherwise where the (Φquad,HNN)-minimizability gap MΦquad,HNN is characterized as below, which is less than the (Φquad,HNN)- approximation error when ΛB < 1: MΦquad,HNN = R Φquad,HNN EX[4η(x)(1 η(x))1 2η(x) 1 ΛW x p+ΛB] EX[max{η(x),1 η(x)}(1 (ΛW x p + ΛB)) 21 2η(x) 1 >ΛW x p+ΛB] EX[min{η(x),1 η(x)}(1 + (ΛW x p + ΛB)) 21 2η(x) 1 >ΛW x p+ΛB] < R Φquad,HNN EX[4η(x)(1 η(x))] = R Φquad,HNN R Φquad,Hall. Therefore, the inequality for ΛB 1 coincides with the consistency excess error bound known for the quadratic loss (Zhang, 2004a; Bartlett et al., 2006) but the one for ΛB < 1 is distinct and novel. K.2.5. SIGMOID LOSS For the sigmoid loss Φsig(α) = 1 tanh(kα), k > 0, for all h HNN and x X: CΦsig(h,x,t) = tΦsig(h(x)) + (1 t)Φsig( h(x)), = t(1 tanh(kh(x))) + (1 t)(1 + tanh(kh(x))). inf h HNN CΦsig(h,x,t) = 1 1 2t tanh(k(ΛW x p + ΛB)) Therefore, the (Φsig,HNN)-minimizability gap can be expressed as follows: MΦsig,HNN = R Φsig,HNN EX[ inf h HNN CΦsig(h,x,η(x))] = R Φsig,HNN EX[1 1 2η(x) tanh(k(ΛW x p + ΛB))]. (48) Note (Φsig,HNN)-minimizability gap coincides with the (Φsig,HNN)-approximation error R Φsig,HNN EX[1 1 2η(x) ] for ΛB = + . H-Consistency Bounds for Surrogate Loss Minimizers 2 < t 1, we have inf h HNN h(x)<0CΦsig(h,x,t) = 1 1 2t tanh(0) inf x X inf h HNN h(x)<0 CΦsig,HNN(h,x,t) = inf x X( inf h HNN h(x)<0CΦsig(h,x,t) inf h HNN CΦsig(h,x,t)) = inf x X(2t 1)tanh(k(ΛW x p + ΛB)) = (2t 1)tanh(kΛB) where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = tanh(kΛB)t. By Definition 3, for any ϵ 0, the HNN-estimation error transformation of the sigmoid loss is as follows: TΦsig = tanh(kΛB)t, t [0,1], Therefore, TΦsig is convex, non-decreasing, invertible and satisfies that TΦsig(0) = 0. By Theorem 4, we can choose Ψ(t) = tanh(kΛB)t in Theorem 3, or equivalently Γ(t) = t tanh(kΛB) in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the sigmoid loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN RΦsig(h) R Φsig,HNN + MΦsig,HNN tanh(kΛB) Mℓ0 1,HNN. (49) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error, and since (Φsig,HNN)- minimizability gap coincides with the (Φsig,HNN)-approximation error for ΛB = + , the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦsig(h) R Φsig,Hall if ΛB = + 1 tanh(kΛB)[RΦsig(h) EX[1 1 2η(x) tanh(k(ΛW x p + ΛB)) ]] otherwise. The inequality for ΛB = + coincides with the consistency excess error bound known for the sigmoid loss (Zhang, 2004a; Bartlett et al., 2006; Mohri et al., 2018) but the one for ΛB < + is distinct and novel. For ΛB < + , we have EX[1 1 2η(x) tanh(k(ΛW x p + ΛB))] > EX[1 2η(x) 1 ] = 2EX[min{η(x),1 η(x)}] = R Φhinge,Hall. Therefore for ΛB < + , RΦsig(h) EX[1 1 2η(x) tanh(k(ΛW x p + ΛB))] < RΦsig(h) R Φsig,Hall. Note that: R Φsig,Hall = 2R ℓ0 1,Hall = 2EX[min{η(x),1 η(x)}]. Thus, the first inequality (case ΛB = + ) can be equivalently written as follows: h HNN, Rℓ0 1(h) RΦsig(h) EX[min{η(x),1 η(x)}], which is a more informative upper bound than the standard inequality Rℓ0 1(h) RΦsig(h). K.2.6. ρ-MARGIN LOSS For the ρ-margin loss Φρ(α) = min{1,max{0,1 α ρ }}, ρ > 0, for all h HNN and x X: CΦρ(h,x,t) = tΦρ(h(x)) + (1 t)Φρ( h(x)), = tmin{1,max{0,1 h(x) ρ }} + (1 t)min{1,max{0,1 + h(x) inf h HNN CΦρ(h,x,t) = min{t,1 t} + max{t,1 t} 1 min{ΛW x p + ΛB,ρ} H-Consistency Bounds for Surrogate Loss Minimizers Therefore, the (Φρ,HNN)-minimizability gap can be expressed as follows: MΦρ,HNN = R Φρ,HNN EX[ inf h HNN CΦρ(h,x,η(x))] = R Φρ,HNN EX min{η(x),1 η(x)} + max{η(x),1 η(x)} 1 min{ΛW x p + ΛB,ρ} Note the (Φρ,HNN)-minimizability gap coincides with the (Φρ,HNN)-approximation error R Φρ,HNN EX[min{η(x),1 η(x)}] for ΛB ρ. 2 < t 1, we have inf h HNN h(x)<0CΦρ(h,x,t) = t + (1 t) 1 min{ΛW x p + ΛB,ρ} inf x X inf h HNN h(x)<0 CΦρ,HNN(h,x) = inf x X( inf h HNN h(x)<0CΦρ(h,x,t) inf h HNN CΦρ(h,x,t)) = inf x X(2t 1) min{ΛW x p + ΛB,ρ} = (2t 1)min{ΛB,ρ} ρ = T(2t 1) where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = min{ΛB,ρ} By Definition 3, for any ϵ 0, the HNN-estimation error transformation of the ρ-margin loss is as follows: TΦρ = min{ΛB,ρ} ρ t, t [0,1], Therefore, TΦρ is convex, non-decreasing, invertible and satisfies that TΦρ(0) = 0. By Theorem 4, we can choose Ψ(t) = min{ΛB,ρ} ρ t in Theorem 3, or equivalently Γ(t) = ρ min{ΛB,ρ} t in Theorem 12, which are optimal when ϵ = 0. Thus, by Theorem 3 or Theorem 12, setting ϵ = 0 yields the HNN-consistency bound for the ρ-margin loss, valid for all h HNN: Rℓ0 1(h) R ℓ0 1,HNN ρ(RΦρ(h) R Φρ,HNN + MΦρ,HNN) min{ΛB,ρ} Mℓ0 1,HNN. (51) Since the (ℓ0 1,HNN)-minimizability gap coincides with the (ℓ0 1,HNN)-approximation error and (Φρ,HNN)- minimizability gap coincides with the (Φρ,HNN)-approximation error for ΛB ρ, the inequality can be rewritten as follows: Rℓ0 1(h) R ℓ0 1,Hall RΦρ(h) R Φρ,Hall if ΛB ρ ρ(RΦρ(h) EX[min{η(x),1 η(x)}+max{η(x),1 η(x)}(1 min{ΛW x p+ΛB,ρ} ΛB otherwise. Note that: R Φρ,Hall = R ℓ0 1,Hall = EX[min{η(x),1 η(x)}]. Thus, the first inequality (case ΛB ρ) can be equivalently written as follows: h HNN, Rℓ0 1(h) RΦρ(h). The case ΛB ρ is one of the trivial cases mentioned in Section 4, where the trivial inequality Rℓ0 1(h) RΦρ(h) can be obtained directly using the fact that ℓ0 1 is upper bounded by Φρ. This, however, does not imply that non-adversarial HNN-consistency bound for the ρ-margin loss is trivial when ΛB > ρ since it is optimal. H-Consistency Bounds for Surrogate Loss Minimizers L. Derivation of Adversarial H-Consistency Bounds L.1. Linear Hypotheses By the definition of Hlin, for any x X, hγ(x) = w x γ w q + b [ W x p γW B,W x p γW + B] x p γ [ W x p γW B,B] x p < γ , hγ(x) = w x + γ w q + b [ W x p + γW B,W x p + γW + B] x p γ [ B,W x p + γW + B] x p < γ . Note Hlin is symmetric. For any x X, there exist w = 0 and any 0 < b B such that w x γ w q + b > 0. Thus by Lemma 2, for any x X, C ℓγ,Hlin(x) = min{η(x),1 η(x)}. The (ℓγ,Hlin)-minimizability gap can be expressed as follows: Mℓγ,Hlin = R ℓγ,Hlin EX[min{η(x),1 η(x)}]. (52) L.1.1. SUPREMUM-BASED ρ-MARGIN LOSS For the supremum-based ρ-margin loss Φρ = sup x x x p γ Φρ(yh(x )), where Φρ(α) = min{1,max{0,1 α ρ }}, ρ > 0, for all h Hlin and x X: C Φρ(h,x,t) = t Φρ(h(x)) + (1 t) Φρ( h(x)) = tΦρ(hγ(x)) + (1 t)Φρ( hγ(x)) = tmin{1,max{0,1 hγ(x) ρ }} + (1 t)min{1,max{0,1 + hγ(x) inf h Hlin C Φρ(h,x,t) = max{t,1 t} 1 min{W max{ x p,γ} γW + B,ρ} ρ + min{t,1 t}. Therefore, the ( Φρ,Hlin)-minimizability gap can be expressed as follows: M Φρ,Hlin = R Φρ,Hlin EX[ inf h Hlin C Φρ(h,x,η(x))] = R Φρ,Hlin EX max{η(x),1 η(x)} 1 min{W max{ x p,γ} γW + B,ρ} EX[min{η(x),1 η(x)}]. H-Consistency Bounds for Surrogate Loss Minimizers 2 < t 1, we have inf h Hlin hγ(x) 0 hγ(x) C Φρ(h,x,t) = t + (1 t) inf x X inf h Hlin hγ(x) 0 hγ(x) C Φρ,Hlin(h,x,t) = inf x X inf h Hlin hγ(x) 0 hγ(x) C Φρ(h,x,t) inf h Hlin C Φρ(h,x,t) min{W max{ x p,γ} γW + B,ρ} where T1 is the increasing and convex function on [0,1] defined by t [0,1], T1(t) = min{B,ρ} inf h Hlin hγ(x)<0 C Φρ(h,x,t) = t + (1 t) 1 min{W max{ x p,γ} γW + B,ρ} inf x X inf h Hlin hγ(x)<0 C Φρ,Hlin(h,x,t) = inf x X{ inf h Hlin hγ(x)<0 C Φρ(h,x,t) inf h Hlin C Φρ(h,x,t)} = inf x X(2t 1) min{W max{ x p,γ} γW + B,ρ} = (2t 1)min{B,ρ} ρ = T2(2t 1), where T2 is the increasing and convex function on [0,1] defined by t [0,1], T2(t) = min{B,ρ} By Definition 5, for any ϵ 0, the adversarial Hlin-estimation error transformation of the supremum-based ρ-margin loss is as follows: T Φρ = min{B,ρ} ρ t, t [0,1], Therefore, T1 = T2 and T Φρ is convex, non-decreasing, invertible and satisfies that T Φρ(0) = 0. By Theorem 6, we can choose Ψ(t) = min{B,ρ} ρ t in Theorem 5, or equivalently Γ(t) = ρ min{B,ρ} t in Theorem 13, which are optimal when ϵ = 0. Thus, by Theorem 5 or Theorem 13, setting ϵ = 0 yields the adversarial Hlin-consistency bound for the supremum-based ρ-margin loss, valid for all h Hlin: Rℓγ(h) R ℓγ,Hlin ρ(R Φρ(h) R Φρ,Hlin + M Φρ,Hlin) min{B,ρ} Mℓγ,Hlin. (54) Mℓγ,Hlin = R ℓγ,Hlin EX[min{η(x),1 η(x)}], M Φρ,Hlin = R Φρ,Hlin EX max{η(x),1 η(x)} 1 min{W max{ x p,γ} γW + B,ρ} EX[min{η(x),1 η(x)}], H-Consistency Bounds for Surrogate Loss Minimizers inequality (54) can be rewritten as follows: R Φρ(h) EX[max{η(x),1 η(x)}(1 min{W max{ x p,γ} γW +B,ρ} ρ )] if B ρ ρ(R Φρ(h) EX[max{η(x),1 η(x)}(1 min{W max{ x p,γ} γW +B,ρ} min{B,ρ} +(1 ρ min{B,ρ})EX[min{η(x),1 η(x)}] otherwise. Note that: min{W max{ x p,γ} γW + B,ρ} = ρ if B ρ. Thus, the first inequality (case B ρ) can be equivalently written as follows: h Hlin, Rℓγ(h) R Φρ(h). (56) The case B ρ is one of the trivial cases mentioned in Section 4, where the trivial inequality Rℓγ(h) R Φρ(h) can be obtained directly using the fact that ℓγ is upper bounded by Φρ. This, however, does not imply that adversarial Hlin-consistency bound for the supremum-based ρ-margin loss is trivial when B > ρ since it is optimal. L.2. One-Hidden-Layer Re LU Neural Networks By the definition of HNN, for any x X, hγ(x) = inf x x x p γ n j=1 uj(wj x + b)+ hγ(x) = sup x x x p γ n j=1 uj(wj x + b)+ Note HNN is symmetric. For any x X, there exist u = ( 1 Λ), w = 0 and any 0 < b B satisfy that hγ(x) > 0. Thus by Lemma 2, for any x X, C ℓγ,HNN(x) = min{η(x),1 η(x)}. The (ℓγ,HNN)-minimizability gap can be expressed as follows: Mℓγ,HNN = R ℓγ,HNN EX[min{η(x),1 η(x)}]. (57) L.2.1. SUPREMUM-BASED ρ-MARGIN LOSS For the supremum-based ρ-margin loss Φρ = sup x x x p γ Φρ(yh(x )), where Φρ(α) = min{1,max{0,1 α ρ }}, ρ > 0, for all h HNN and x X: C Φρ(h,x,t) = t Φρ(h(x)) + (1 t) Φρ( h(x)) = tΦρ(hγ(x)) + (1 t)Φρ( hγ(x)) = tmin{1,max{0,1 hγ(x) ρ }} + (1 t)min{1,max{0,1 + hγ(x) inf h HNN C Φρ(h,x,t) = max{t,1 t} 1 min{suph HNN hγ(x),ρ} ρ + min{t,1 t}. Therefore, the ( Φρ,HNN)-minimizability gap can be expressed as follows: M Φρ,HNN = R Φρ,HNN EX[ inf h HNN C Φρ(h,x,η(x))] = R Φρ,HNN EX max{η(x),1 η(x)} 1 min{suph HNN hγ(x),ρ} EX[min{η(x),1 η(x)}]. H-Consistency Bounds for Surrogate Loss Minimizers 2 < t 1, we have inf h HNN hγ(x) 0 hγ(x) C Φρ(h,x,t) = t + (1 t) inf x X inf h HNN hγ(x) 0 hγ(x) C Φρ,HNN(h,x,t) = inf x X inf h HNN hγ(x) 0 hγ(x) C Φρ(h,x,t) inf h HNN C Φρ(h,x,t) = inf x X min{suph HNN hγ(x),ρ} = min{infx X suph HNN hγ(x),ρ} = T1(η(x)), where T1 is the increasing and convex function on [0,1] defined by t [0,1], T1(t) = min{infx X suph HNN hγ(x),ρ} inf h HNN hγ(x)<0 C Φρ(h,x,t) = t + (1 t) 1 min{suph HNN hγ(x),ρ} inf x X inf h HNN hγ(x)<0 C Φρ,HNN(h,x,t) = inf x X{ inf h HNN hγ(x)<0 C Φρ(h,x,t) inf h HNN C Φρ(h,x,t)} = inf x X(2t 1) min{suph HNN hγ(x),ρ} = (2t 1) min{infx X suph HNN hγ(x),ρ} ρ = T2(2t 1), where T2 is the increasing and convex function on [0,1] defined by t [0,1], T2(t) = min{infx X suph HNN hγ(x),ρ} By Definition 5, for any ϵ 0, the adversarial HNN-estimation error transformation of the supremum-based ρ-margin loss is as follows: T Φρ = min{infx X suph HNN hγ(x),ρ} ρ t, t [0,1], Therefore, T1 = T2 and T Φρ is convex, non-decreasing, invertible and satisfies that T Φρ(0) = 0. By Theorem 6, we can choose Ψ(t) = min{infx X suph HNN hγ(x),ρ} ρ t in Theorem 5, or equivalently Γ(t) = ρ min{infx X suph HNN hγ(x),ρ} t in Theorem 13, which are optimal when ϵ = 0. Thus, by Theorem 5 or Theorem 13, setting ϵ = 0 yields the adversarial HNN-consistency bound for the supremum-based ρ-margin loss, valid for all h HNN: Rℓγ(h) R ℓγ,HNN ρ(R Φρ(h) R Φρ,HNN + M Φρ,HNN) min{infx X suph HNN hγ(x),ρ} Mℓγ,HNN. (59) H-Consistency Bounds for Surrogate Loss Minimizers Observe that inf x X sup h HNN hγ(x) sup h HNN inf x Xhγ(x) = sup u 1 Λ, wj q W, b B inf x X inf s p γ n j=1 uj(wj x + wj s + b)+ sup u 1 Λ, b B inf x X inf s p γ n j=1 uj(0 x + 0 s + b)+ = sup u 1 Λ, b B n j=1 uj(b)+ Thus, the inequality can be relaxed as follows: Rℓγ(h) R ℓγ,HNN ρ(R Φρ(h) R Φρ,HNN + M Φρ,HNN) min{ΛB,ρ} Mℓγ,HNN. (60) Mℓγ,HNN = R ℓγ,HNN EX[min{η(x),1 η(x)}], M Φρ,HNN = R Φρ,HNN EX max{η(x),1 η(x)} 1 min{suph HNN hγ(x),ρ} EX[min{η(x),1 η(x)}], inequality (59) can be rewritten as follows: R Φρ(h) EX[max{η(x),1 η(x)}(1 min{suph HNN hγ(x),ρ} ρ )] if ΛB ρ ρ R Φρ(h) EX max{η(x),1 η(x)} 1 min{suph HNN hγ (x),ρ} min{ΛB,ρ} +(1 ρ min{ΛB,ρ})EX[min{η(x),1 η(x)}] otherwise. Observe that sup h HNN hγ(x) = sup u 1 Λ, wj q W, b B inf x x x p γ n j=1 uj(wj x + b)+ inf x x x p γ sup u 1 Λ, wj q W, b B n j=1 uj(wj x + b)+ = inf x x x p γ Λ(W x p + B) Λ(W x p γW + B) if x p γ ΛB if x p < γ = Λ(W max{ x p,γ} γW + B). Thus, the inequality can be further relaxed as follows: R Φρ(h) EX[max{η(x),1 η(x)}(1 min{Λ(W max{ x p,γ} γW +B),ρ} ρ )] if ΛB ρ ρ(R Φρ(h) EX[max{η(x),1 η(x)}(1 min{Λ(W max{ x p,γ} γW +B),ρ} min{ΛB,ρ} +(1 ρ min{ΛB,ρ})EX[min{η(x),1 η(x)}] otherwise. Note the relaxed adversarial HNN-consistency bounds (59) and (61) for the supremum-based ρ-margin loss are identical to the bounds (54) and (55) in the linear case respectively modulo the replacement of B by ΛB. H-Consistency Bounds for Surrogate Loss Minimizers M. Derivation of Non-Adversarial Hall-Consistency Bounds under Massart s Noise Condition With Massart s noise condition, we introduce a modified H-estimation error transformation. We assume that ϵ = 0 throughout this section. Proposition 1. Under Massart s noise condition with β, the modified H-estimation error transformation of Φ for ϵ = 0 is defined on t [0,1] by, TM Φ (t) = T(t)1t [2β,1] + (T(2β)/2β)t1t [0,2β), with T(t) defined in Definition 3. Suppose that H satisfies the condition of Lemma 1 and TM Φ is any lower bound of TM Φ such that TM Φ TM Φ . If TM Φ is convex with TM Φ (0) = 0, then, for any hypothesis h H and any distribution under Massart s noise condition with β, TM Φ (Rℓ0 1(h) R ℓ0 1,H + Mℓ0 1,H) RΦ(h) R Φ,H + MΦ,H. Proof. Note the condition (13) in Theorem 8 is symmetric about η(x) = 0. Thus, condition (13) uniformly holds for all distributions is equivalent to the following holds for any t [1/2 + β,1] Ψ( 2t 1 ϵ) inf x X,h H h(x)<0 CΦ,H(h,x,t), (62) It is clear that any lower bound TM Φ of the modified H-estimation error transformation verified condition (62). Then by Theorem 8, the proof is completed. M.1. Quadratic Loss For the quadratic loss Φquad(α) = (1 α)21α 1, for all h Hall and x X: CΦquad(h,x,t) = tΦquad(h(x)) + (1 t)Φquad( h(x)) = t(1 h(x))21h(x) 1 + (1 t)(1 + h(x))21h(x) 1. inf h Hall CΦquad(h,x,t) = 4t(1 t) MΦquad,Hall = R Φquad,Hall EX[ inf h Hall CΦquad(h,x,η(x))] = R Φquad,Hall EX[4η(x)(1 η(x))] Thus, for 1 2 < t 1, we have inf h Hall h(x)<0CΦquad(h,x,t) = t + (1 t) inf x X inf h Hall h(x)<0 CΦquad,Hall(h,x,t) = inf x X( inf h Hall h(x)<0CΦquad(h,x,t) inf h Hall CΦquad(h,x,t)) = inf x X(1 4t(1 t)) = 1 4t(1 t) where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = t2. By Proposition 1, for ϵ = 0, the modified Hall-estimation error transformation of the quadratic loss under Massart s noise condition with β is as follows: TM Φquad(t) = 2β t, t [0,2β], t2, t [2β,1]. H-Consistency Bounds for Surrogate Loss Minimizers Therefore, TM Φquad is convex, non-decreasing, invertible and satisfies that TM Φquad(0) = 0. By Proposition 1, we obtain the Hall-consistency bound for the quadratic loss, valid for all h Hall such that RΦquad(h) R Φquad,Hall T(2β) = 4β2 and distributions D satisfies Massart s noise condition with β: Rℓ0 1(h) R ℓ0 1,Hall RΦquad(h) R Φquad,Hall 2β (63) M.2. Logistic Loss For the logistic loss Φlog(α) = log2(1 + e α), for all h Hall and x X: CΦlog(h,x,t) = tΦlog(h(x)) + (1 t)Φlog( h(x)), = tlog2(1 + e h(x)) + (1 t)log2(1 + eh(x)). inf h Hall CΦlog(h,x,t) = tlog2(t) (1 t)log2(1 t) MΦlog,Hall = R Φlog,Hall EX[ inf h Hall CΦlog(h,x,η(x))] = R Φlog,Hall EX[ η(x)log2(η(x)) (1 η(x))log2(1 η(x))] Thus, for 1 2 < t 1, we have inf h Hall h(x)<0CΦlog(h,x,t) = tlog2(1 + e 0) + (1 t)log2(1 + e0) inf x X inf h Hall h(x)<0 CΦlog,Hall(h,x,t) = inf x X( inf h Hall h(x)<0CΦlog(h,x,t) inf h Hall CΦlog(h,x,t)) = inf x X(1 + tlog2(t) + (1 t)log2(1 t) = 1 + tlog2(t) + (1 t)log2(1 t) where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = t + 1 2 log2(t + 1) + 1 t 2 log2(1 t) By Proposition 1, for ϵ = 0, the modified Hall-estimation error transformation of the logistic loss under Massart s noise condition with β is as follows: T(t), t [2β,1], T(2β) 2β t, t [0,2β). Therefore, TM Φlog is convex, non-decreasing, invertible and satisfies that TM Φlog(0) = 0. By Proposition 1, we obtain the Hall-consistency bound for the logistic loss, valid for all h Hall such that RΦlog(h) R Φlog,Hall T(2β) = 2 log2(2β + 1) + 1 2β 2 log2(1 2β) and distributions D satisfies Massart s noise condition with β: Rℓ0 1(h) R ℓ0 1,Hall 2β(RΦlog(h) R Φlog,Hall) 2 log2(2β + 1) + 1 2β 2 log2(1 2β) (64) H-Consistency Bounds for Surrogate Loss Minimizers M.3. Exponential Loss For the exponential loss Φexp(α) = e α, for all h Hall and x X: CΦexp(h,x,t) = tΦexp(h(x)) + (1 t)Φexp( h(x)) = te h(x) + (1 t)eh(x). inf h Hall CΦexp(h,x,t) = 2 MΦexp,Hall = R Φexp,Hall EX[ inf h Hall CΦexp(h,x,η(x))] = R Φexp,Hall EX[2 η(x)(1 η(x))] Thus, for 1 2 < t 1, we have inf h Hall h(x)<0CΦexp(h,x,t) = te 0 + (1 t)e0 inf x X inf h Hall h(x)<0 CΦexp,Hall(h,x) = inf x X( inf h Hall h(x)<0CΦexp(h,x) inf h Hall CΦexp(h,x)) = inf x X(1 2 where T is the increasing and convex function on [0,1] defined by t [0,1], T(t) = 1 By Proposition 1, for ϵ = 0, the modified Hall-estimation error transformation of the exponential loss under Massart s noise condition with β is as follows: T(t), t [2β,1], T(2β) 2β t, t [0,2β). Therefore, TM Φexp is convex, non-decreasing, invertible and satisfies that TM Φexp(0) = 0. By Proposition 1, we obtain the Hall- consistency bound for the exponential loss, valid for all h Hall such that RΦexp(h) R Φexp,Hall T(2β) = 1 1 4β2 and distributions D satisfies Massart s noise condition with β: Rℓ0 1(h) R ℓ0 1,Hall 2β(RΦexp(h) R Φexp,Hall) N. Derivation of Adversarial H-Consistency Bounds under Massart s Noise Condition As with the non-adversarial scenario in Section 5.5, we introduce a modified adversarial H-estimation error transformation. We assume that ϵ = 0 throughout this section. Proposition 2. Under Massart s noise condition with β, the modified adversarial H-estimation error transformation of Φ for ϵ = 0 is defined on t [0,1] by TM Φ (t) = min{TM 1 (t),TM 2 (t)}, H-Consistency Bounds for Surrogate Loss Minimizers TM 1 (t) = T1(t)1t [ 1 2 +β,1] + 2/(1 + 2β) T1(1 2 + β)t1t [0, 1 TM 2 (t) = T2(t)1t [2β,1] + T2(2β) 2β t1t [0,2β), with T1(t) and T2(t) defined in Definition 5. Suppose that H is symmetric and TM Φ is any lower bound of TM Φ such that TM Φ TM Φ . If TM Φ is convex with TM Φ (0) = 0, then, for any hypothesis h H and any distribution under Massart s noise condition with β, TM Φ (Rℓγ(h) R ℓγ,H + Mℓγ,H) R Φ(h) R Φ,H + M Φ,H. Proof. Note the condition (17) in Theorem 10 is symmetric about η(x) = 0. Thus, condition (17) uniformly holds for all distributions under Massart s noise condition with β is equivalent to the following holds for any t [1/2 + β,1] Ψ( t ϵ) inf x X,h Hγ(x) H C Φ,H(h,x,t), Ψ( 2t 1 ϵ) inf x X,h H hγ(x)<0 C Φ,H(h,x,t), (66) It is clear that any lower bound TM Φ of the modified adversarial H-estimation error transformation verified condition (66). Then by Theorem 10, the proof is completed. N.1. Linear Hypotheses By the definition of Hlin, for any x X, hγ(x) = w x γ w q + b [ W x p γW B,W x p γW + B] x p γ [ W x p γW B,B] x p < γ , hγ(x) = w x + γ w q + b [ W x p + γW B,W x p + γW + B] x p γ [ B,W x p + γW + B] x p < γ . Note Hlin is symmetric. For any x X, there exist w = 0 and any 0 < b B such that w x γ w q + b > 0. Thus by Lemma 2, for any x X, C ℓγ,Hlin(x) = min{η(x),1 η(x)}. The (ℓγ,Hlin)-minimizability gap can be expressed as follows: Mℓγ,Hlin = R ℓγ,Hlin EX[min{η(x),1 η(x)}]. N.1.1. SUPREMUM-BASED HINGE LOSS For the supremum-based hinge loss Φhinge = sup x x x p γ Φhinge(yh(x )), where Φhinge(α) = max{0,1 α}, H-Consistency Bounds for Surrogate Loss Minimizers for all h Hlin and x X: C Φhinge(h, x, t) = t Φhinge(h(x)) + (1 t) Φhinge( h(x)) = tΦhinge(hγ(x)) + (1 t)Φhinge( hγ(x)) = t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)} [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] inf h Hlin C Φhinge(h, x, t) inf h Hlin [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] inf h Hlin [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] = 1 2t 1 min{W max{ x p, γ} γW + B, 1} inf h Hlin C Φhinge(h, x, t) = inf h Hlin [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] = inf h Hlin[t max{0, 1 w x + γ w q b} + (1 t) max{0, 1 + w x + γ w q + b}] inf b [ B,B][t max{0, 1 b} + (1 t) max{0, 1 + b}] = 1 2t 1 min{B, 1} M Φhinge,Hlin = R Φhinge,Hlin E[ inf h Hlin C Φhinge(h, x, η(x))] R Φhinge,Hlin E[1 2η(x) 1 min{W max{ x p, γ} γW + B, 1}] Thus, for 1 2 < t 1, we have inf h Hlin hγ(x) 0 hγ(x) C Φhinge(h,x,t) = t + (1 t) inf x X inf h Hlin hγ(x) 0 hγ(x) C Φhinge,Hlin(h,x,t) = inf x X{1 inf h Hlin C Φhinge(h,x,t)} inf x X(2t 1)min{B,1} = (2t 1)min{B,1} where T1 is the increasing and convex function on [0,1] defined by min{B,1}(2t 1), t [1/2 + β,1], min{B,1} 4β 1+2β t, t [0,1/2 + β). inf h Hlin hγ(x)<0 C Φhinge(h,x,t) inf h Hlin hγ(x)<0 [tmax{0,1 hγ(x)} + (1 t)max{0,1 + hγ(x)}] = tmax{0,1 0} + (1 t)max{0,1 + 0} inf x X inf h Hlin hγ(x)<0 C Φhinge,Hlin(h,x,t) = inf x X{ inf h Hlin hγ(x)<0 C Φhinge(h,x,t) inf h Hlin C Φhinge(h,x,t)} inf x X(2t 1)min{B,1} = (2t 1)min{B,1} = T2(2t 1), H-Consistency Bounds for Surrogate Loss Minimizers where T2 is the increasing and convex function on [0,1] defined by t [0,1], T2(t) = min{B,1}t. By Proposition 2, for ϵ = 0, the modified adversarial Hlin-estimation error transformation of the supremum-based hinge loss under Massart s noise condition with β is lower bounded as follows: TM Φhinge TM Φhinge = min{T1,T2} = min{B,1}(2t 1), t [1/2 + β,1], min{B,1} 4β 1+2β t, t [0,1/2 + β). Note TM Φhinge is convex, non-decreasing, invertible and satisfies that TM Φhinge(0) = 0. By Proposition 2, using the fact that TM Φhinge min{B,1} 4β 1+2β t yields the adversarial Hlin-consistency bound for the supremum-based hinge loss, valid for all h Hlin and distributions D satisfies Massart s noise condition with β: Rℓγ(h) R ℓγ,Hlin 1 + 2β R Φhinge(h) R Φhinge,Hlin + M Φhinge,Hlin min{B,1} Mℓγ,Hlin (67) Mℓγ,Hlin = R ℓγ,Hlin EX[min{η(x),1 η(x)}], M Φhinge,Hlin R Φhinge,Hlin E[1 2η(x) 1 min{W max{ x p,γ} γW + B,1}], the inequality can be relaxed as follows: Rℓγ(h) 1 + 2β R Φhinge(h) min{B,1} + EX[min{η(x),1 η(x)}] 1 + 2β E[1 2η(x) 1 min{W max{ x p,γ} γW + B,1}] Note that: min{W max{ x p,γ} γW + B,1} 1 and 1 1 2η(x) = 2min{η(x),1 η(x)}. Thus the inequality can be further relaxed as follows: Rℓγ(h) 1 + 2β R Φhinge(h) min{B,1} ( 1 + 2β 2β min{B,1} 1)EX[min{η(x),1 η(x)}]. When B 1, it can be equivalently written as follows: Rℓγ(h) 1 + 2β 4β R Φhinge(h) 1 2β EX[min{η(x),1 η(x)}]. (68) N.1.2. SUPREMUM-BASED SIGMOID LOSS For the supremum-based sigmoid loss Φsig = sup x x x p γ Φsig(yh(x )), where Φsig(α) = 1 tanh(kα), k > 0, H-Consistency Bounds for Surrogate Loss Minimizers for all h Hlin and x X: C Φsig(h,x,t) = t Φsig(h(x)) + (1 t) Φsig( h(x)) = tΦsig(hγ(x)) + (1 t)Φsig( hγ(x)) = t(1 tanh(khγ(x))) + (1 t)(1 + tanh(khγ(x))) max{1 + (1 2t)tanh(khγ(x)),1 + (1 2t)tanh(khγ(x))} inf h Hlin C Φsig(h,x,t) max{ inf h Hlin[1 + (1 2t)tanh(khγ(x))], inf h Hlin[1 + (1 2t)tanh(khγ(x))]} = 1 1 2t tanh(k(W max{ x p,γ} γW + B)) inf h Hlin C Φsig(h,x,t) = inf h Hlin[t(1 tanh(k(w x γ w q + b))) + (1 t)(1 + tanh(k(w x + γ w q + b)))] inf b [ B,B][t(1 tanh(kb)) + (1 t)(1 + tanh(kb))] = max{t,1 t}(1 tanh(k B)) + min{t,1 t}(1 + tanh(k B)) = 1 1 2t tanh(k B) M Φsig,Hlin = R Φsig,Hlin E[ inf h Hlin C Φsig(h,x,η(x))] R Φsig,Hlin E[1 1 2η(x) tanh(k(W max{ x p,γ} γW + B))] Thus, for 1 2 < t 1, we have inf h Hlin hγ(x) 0 hγ(x) C Φsig(h,x,t) = t + (1 t) inf x X inf h Hlin hγ(x) 0 hγ(x) C Φsig,Hlin(h,x,t) = inf x X{1 inf h Hlin C Φsig(h,x,t)} inf x X(2t 1)tanh(k B) = (2t 1)tanh(k B) where T1 is the increasing and convex function on [0,1] defined by tanh(k B) 4β 1+2β t, t [0,1/2 + β], tanh(k B)(2t 1), t [1/2 + β,1]. inf h Hlin hγ(x)<0 C Φsig(h,x,t) inf h Hlin hγ(x)<0 [1 + (1 2t)tanh(khγ(x))] inf x X inf h Hlin hγ(x)<0 C Φsig,Hlin(h,x,t) = inf x X{ inf h Hlin hγ(x)<0 C Φsig(h,x,t) inf h Hlin C Φsig(h,x,t)} inf x X(2t 1)tanh(k B) = (2t 1)tanh(k B) = T2(2t 1), where T2 is the increasing and convex function on [2β,1] defined by t [0,1], T2(t) = tanh(k B)t; H-Consistency Bounds for Surrogate Loss Minimizers By Proposition 2, for ϵ = 0, the modified adversarial Hlin-estimation error transformation of the supremum-based sigmoid loss under Massart s noise condition with β is lower bounded as follows: TM Φsig TM Φsig = min{T1,T2} = tanh(k B) 4β 1+2β t, t [0,1/2 + β], tanh(k B)(2t 1), t [1/2 + β,1]. Note TM Φsig is convex, non-decreasing, invertible and satisfies that TM Φsig(0) = 0. By Proposition 2, using the fact that TM Φsig tanh(k B) 4β 1+2β t yields the adversarial Hlin-consistency bound for the supremum-based sigmoid loss, valid for all h Hlin and distributions D satisfies Massart s noise condition with β: Rℓγ(h) R ℓγ,Hlin 1 + 2β R Φsig(h) R Φsig,Hlin + M Φsig,Hlin tanh(k B) Mℓγ,Hlin. (69) Mℓγ,Hlin = R ℓγ,Hlin EX[min{η(x),1 η(x)}], M Φsig,Hlin R Φsig,Hlin E[1 1 2η(x) tanh(k(W max{ x p,γ} γW + B))], the inequality can be relaxed as follows: tanh(k B) + EX[min{η(x),1 η(x)}] 1 + 2β E[1 1 2η(x) tanh(k(W max{ x p,γ} γW + B))] Note that: tanh(k(W max{ x p,γ} γW + B)) 1 and 1 1 2η(x) = 2min{η(x),1 η(x)}. Thus the inequality can be further relaxed as follows: Rℓγ(h) 1 + 2β tanh(k B) ( 1 + 2β 2β tanh(k B) 1)EX[min{η(x),1 η(x)}]. When B = + , it can be equivalently written as follows: Rℓγ(h) 1 + 2β 4β R Φsig(h) 1 2β EX[min{η(x),1 η(x)}]. (70) N.2. One-Hidden-Layer Re LU Neural Networks By the definition of HNN, for any x X, hγ(x) = inf x x x p γ n j=1 uj(wj x + b)+ hγ(x) = sup x x x p γ n j=1 uj(wj x + b)+ Note HNN is symmetric. For any x X, there exist u = ( 1 Λ), w = 0 and any 0 < b B satisfy that hγ(x) > 0. Thus by Lemma 2, for any x X, C ℓγ,HNN(x) = min{η(x),1 η(x)}. The (ℓγ,HNN)-minimizability gap can be expressed as follows: Mℓγ,HNN = R ℓγ,HNN EX[min{η(x),1 η(x)}]. N.2.1. SUPREMUM-BASED HINGE LOSS For the supremum-based hinge loss Φhinge = sup x x x p γ Φhinge(yh(x )), where Φhinge(α) = max{0,1 α}, H-Consistency Bounds for Surrogate Loss Minimizers for all h HNN and x X: C Φhinge(h, x, t) = t Φhinge(h(x)) + (1 t) Φhinge( h(x)) = tΦhinge(hγ(x)) + (1 t)Φhinge( hγ(x)) = t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)} [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] [t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] inf h HNN C Φhinge(h, x, t) inf h HNN[t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] inf h HNN[t max{0, 1 hγ(x)} + (1 t) max{0, 1 + hγ(x)}] = 1 2t 1 min{ sup h HNN hγ(x), 1} inf h HNN C Φhinge(h, x, t) inf h HNN w=0 C Φhinge(h, x, t) = 1 2t 1 min{ΛB, 1} M Φhinge,HNN = R Φhinge,HNN E[ inf h HNN C Φhinge(h, x, η(x))] R Φhinge,HNN E[1 2η(x) 1 min{ sup h HNN hγ(x), 1}] Thus, for 1 2 < t 1, we have inf h HNN hγ(x) 0 hγ(x) C Φhinge(h,x,t) = t + (1 t) inf x X inf h HNN hγ(x) 0 hγ(x) C Φhinge,HNN(h,x,t) = inf x X{1 inf h HNN C Φhinge(h,x,t)} inf x X(2t 1)min{ΛB,1} = (2t 1)min{ΛB,1} where T1 is the increasing and convex function on [0,1] defined by min{ΛB,1} 4β 1+2β t, t [0,1/2 + β], min{ΛB,1}(2t 1), t [1/2 + β,1]. inf h HNN hγ(x)<0 C Φhinge(h,x,t) inf h HNN hγ(x)<0 [tmax{0,1 hγ(x)}t + (1 t)max{0,1 + hγ(x)}] = tmax{0,1 0} + (1 t)max{0,1 + 0} inf x X inf h HNN hγ(x)<0 C Φhinge,HNN(h,x,t) = inf x X{ inf h HNN hγ(x)<0 C Φhinge(h,x,t) inf h HNN C Φhinge(h,x,t)} inf x X(2t 1)min{ΛB,1} = (2t 1)min{ΛB,1} = T2(2t 1), H-Consistency Bounds for Surrogate Loss Minimizers where T2 is the increasing and convex function on [0,1] defined by t [0,1], T2(t) = min{ΛB,1}t; By Proposition 2, for ϵ = 0, the modified adversarial HNN-estimation error transformation of the supremum-based hinge loss under Massart s noise condition with β is lower bounded as follows: TM Φhinge TM Φhinge = min{T1,T2} = min{ΛB,1}(2t 1), t [1/2 + β,1], min{ΛB,1} 4β 1+2β t, t [0,1/2 + β). Note TM Φhinge is convex, non-decreasing, invertible and satisfies that TM Φhinge(0) = 0. By Proposition 2, using the fact that TM Φhinge min{ΛB,1} 4β 1+2β t yields the adversarial HNN-consistency bound for the supremum-based hinge loss, valid for all h HNN and distributions D satisfies Massart s noise condition with β Rℓγ(h) R ℓγ,HNN 1 + 2β R Φhinge(h) R Φhinge,HNN + M Φhinge,HNN min{ΛB,1} Mℓγ,HNN (71) Mℓγ,HNN = R ℓγ,HNN EX[min{η(x),1 η(x)}], M Φhinge,HNN R Φhinge,HNN E[1 2η(x) 1 min{ sup h HNN hγ(x),1}], the inequality can be relaxed as follows: Rℓγ(h) 1 + 2β R Φhinge(h) min{ΛB,1} + EX[min{η(x),1 η(x)}] 1 + 2β E[1 2η(x) 1 min{suph HNN hγ(x),1}] Observe that sup h HNN hγ(x) = sup u 1 Λ, wj q W, b B inf x x x p γ n j=1 uj(wj x + b)+ inf x x x p γ sup u 1 Λ, wj q W, b B n j=1 uj(wj x + b)+ = inf x x x p γ Λ(W x p + B) Λ(W x p γW + B) if x p γ ΛB if x p < γ = Λ(W max{ x p,γ} γW + B). Thus, the inequality can be further relaxed as follows: R Φhinge(h) min{ΛB,1} + EX[min{η(x),1 η(x)}] 1 + 2β E[1 2η(x) 1 min{Λ(W max{ x p,γ} γW + B),1}] Note that: min{Λ(W max{ x p,γ} γW + B),1} 1 and 1 1 2η(x) = 2min{η(x),1 η(x)}. Thus the inequality can be further relaxed as follows: Rℓγ(h) 1 + 2β R Φhinge(h) min{ΛB,1} ( 1 + 2β 2β min{ΛB,1} 1)EX[min{η(x),1 η(x)}]. (72) When ΛB 1, it can be equivalently written as follows: Rℓγ(h) 1 + 2β 4β R Φhinge(h) 1 2β EX[min{η(x),1 η(x)}]. H-Consistency Bounds for Surrogate Loss Minimizers N.2.2. SUPREMUM-BASED SIGMOID LOSS For the supremum-based sigmoid loss Φsig = sup x x x p γ Φsig(yh(x )), where Φsig(α) = 1 tanh(kα), k > 0, for all h HNN and x X: C Φsig(h,x,t) = t Φsig(h(x)) + (1 t) Φsig( h(x)) = tΦsig(hγ(x)) + (1 t)Φsig( hγ(x)) = t(1 tanh(khγ(x))) + (1 t)(1 + tanh(khγ(x))) max{1 + (1 2t)tanh(khγ(x)),1 + (1 2t)tanh(khγ(x))} inf h HNN C Φsig(h,x,t) max{ inf h HNN[1 + (1 2t)tanh(khγ(x))], inf h HNN[1 + (1 2t)tanh(khγ(x))]} = 1 1 2t tanh(k sup h HNN hγ(x)) inf h HNN C Φsig(h,x,t) max{t,1 t}(1 tanh(kΛB)) + min{t,1 t}(1 + tanh(kΛB)) = 1 1 2t tanh(kΛB) M Φsig,HNN = R Φsig,HNN E[ inf h HNN C Φsig(h,x,η(x))] R Φsig,HNN E[1 1 2η(x) tanh(k sup h HNN hγ(x))] 2 < t 1, we have inf h HNN hγ(x) 0 hγ(x) C Φsig(h,x,t) = t + (1 t) inf x X inf h HNN hγ(x) 0 hγ(x) C Φsig,HNN(h,x,t) = inf x X{1 inf h HNN C Φsig(h,x,t)} inf x X(2t 1)tanh(kΛB) = (2t 1)tanh(kΛB) where T1 is the increasing and convex function on [0,1] defined by tanh(kΛB) 4β 1+2β t, t [0,1/2 + β], tanh(kΛB)(2t 1), t [1/2 + β,1]. inf h HNN hγ(x)<0 C Φsig(h,x,t) inf h HNN hγ(x)<0 1 + (1 2t)tanh(khγ(x)) inf x X inf h HNN hγ(x)<0 C Φsig,HNN(h,x) = inf x X{ inf h HNN hγ(x)<0 C Φsig(h,x,t) inf h HNN C Φsig(h,x,t)} inf x X(2t 1)tanh(kΛB) = (2t 1)tanh(kΛB) = T2(2t 1), H-Consistency Bounds for Surrogate Loss Minimizers where T2 is the increasing and convex function on [0,1] defined by t [0,1], T2(t) = tanh(kΛB)t; By Proposition 2, for ϵ = 0, the modified adversarial HNN-estimation error transformation of the supremum-based sigmoid loss under Massart s noise condition with β is lower bounded as follows: TM Φsig TM Φsig = min{T1,T2} = tanh(kΛB) 4β 1+2β t, t [0,1/2 + β], tanh(kΛB)(2t 1), t [1/2 + β,1]. Note TM Φsig is convex, non-decreasing, invertible and satisfies that TM Φsig(0) = 0. By Proposition 2, using the fact that TM Φsig tanh(kΛB) 4β 1+2β t yields the adversarial HNN-consistency bound for the supremum-based sigmoid loss, valid for all h HNN and distributions D satisfies Massart s noise condition with β: Rℓγ(h) R ℓγ,HNN 1 + 2β R Φsig(h) R Φsig,HNN + M Φsig,HNN tanh(kΛB) Mℓγ,HNN (73) Mℓγ,HNN = R ℓγ,HNN EX[min{η(x),1 η(x)}], M Φsig,HNN R Φsig,HNN E[1 1 2η(x) tanh(k sup h HNN hγ(x))], the inequality can be relaxed as follows: Rℓγ(h) 1 + 2β tanh(kΛB) + EX[min{η(x),1 η(x)}] 1 + 2β E[1 1 2η(x) tanh(k suph HNN hγ(x))] Observe that sup h HNN hγ(x) = sup u 1 Λ, wj q W, b B inf x x x p γ n j=1 uj(wj x + b)+ inf x x x p γ sup u 1 Λ, wj q W, b B n j=1 uj(wj x + b)+ = inf x x x p γ Λ(W x p + B) Λ(W x p γW + B) if x p γ ΛB if x p < γ = Λ(W max{ x p,γ} γW + B). Thus, the inequality can be further relaxed as follows: tanh(kΛB) + EX[min{η(x),1 η(x)}] 1 + 2β E[1 1 2η(x) tanh(kΛ(W max{ x p,γ} γW + B))] Note that: tanh(kΛ(W max{ x p,γ} γW + B)) 1 and 1 1 2η(x) = 2min{η(x),1 η(x)}. Thus the inequality can be further relaxed as follows: Rℓγ(h) 1 + 2β tanh(kΛB) ( 1 + 2β 2β tanh(kΛB) 1)EX[min{η(x),1 η(x)}]. (74) When ΛB = + , it can be equivalently written as follows: Rℓγ(h) 1 + 2β 4β R Φsig(h) 1 2β EX[min{η(x),1 η(x)}].