# hconsistency_bounds_characterization_and_extensions__1eb67f89.pdf H-Consistency Bounds: Characterization and Extensions Anqi Mao Courant Institute New York, NY 10012 aqmao@cims.nyu.edu Mehryar Mohri Google Research & CIMS New York, NY 10011 mohri@google.com Yutao Zhong Courant Institute New York, NY 10012 yutao@cims.nyu.edu A series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] have introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-specific and thus stronger and more informative than Bayes-consistency. However, determining if they hold and deriving these bounds have required a specific proof and analysis for each surrogate loss. Can we derive more general tools and characterizations? This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classification. We present new and tight H-consistency bounds for both the family of constrained losses and that of comp-sum losses, which covers the familiar crossentropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets. Our characterizations are based on error transformations, which are explicitly defined for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class H-consistency bound for cross-entropy reduces to an excess bound and is not significant. Instead, we prove a much stronger and more significant guarantee. 1 Introduction Bayes-consistency is an important property of surrogate loss functions. It requires that minimizing the surrogate excess error over the family of all measurable functions leads to the minimization of the target error loss in the limit [Steinwart, 2007]. This property applies to a broad family of convex margin-based losses in binary classification [Zhang, 2004a, Bartlett et al., 2006], as well as some extensions in multi-class classification [Tewari and Bartlett, 2007]. However, Bayes-consistency does not apply to the hypothesis sets commonly used for learning, such as the family of linear models or that of neural networks, which of course do not include all measurable functions. Furthermore, it is also only an asymptotic property and does not supply any convergence guarantee. To address these limitations, a series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-specific and thus stronger and more informative than Bayes-consistency. However, determining the validity of these bounds and deriving them have required a specific proof and analysis for each surrogate loss. Can we derive more general tools and characterizations for H-consistency bounds? 37th Conference on Neural Information Processing Systems (Neur IPS 2023). This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classification. Previous approaches to deriving these bounds required the development of new proofs for each specific case. In contrast, we introduce the general concept of an error transformation function that serves as a very general tool for deriving such guarantees with tightness guarantees. We show that deriving an H-consistency bound for comp-sum losses and constrained losses for both complete and bounded hypothesis sets can be reduced to the calculation of their corresponding error transformation function. Our general tools and tight bounds show several remarkable advantages: first, they improve existing bounds for complete hypothesis sets previously proven in [Awasthi et al., 2022b]; second, they encompass all previously comp-sum and constrained losses studied thus far as well as many new ones [Awasthi et al., 2022a, Mao et al., 2023h]; third, they extend beyond the completeness assumption adopted in previous work; fourth, they provide novel guarantees for bounded hypothesis sets; and, finally, they help prove a much stronger and more significant guarantee for logistic loss with linear hypothesis set than [Zheng et al., 2023]. Previous work. Here, we briefly discuss recent studies of H-consistency bounds by Awasthi et al. [2022a,b], Mao et al. [2023h] and Zheng et al. [2023]. Awasthi et al. [2022a] introduced and studied H-consistency bounds in binary classification. They provided a series of tight H-consistency bounds for bounded hypothesis set of linear models and one-hidden-layer neural networks. The subsequent study [Awasthi et al., 2022b] further generalized the framework to multi-class classification and presented an extensive study of H-consistency bounds for diverse multi-class surrogate losses, including negative results for max losses [Crammer and Singer, 2001] and positive results for sum losses [Weston and Watkins, 1998], and constrained losses [Lee et al., 2004]. However, the hypothesis sets examined in their analysis were assumed to be complete, which rules out the bounded hypothesis sets typically used in practice. Moreover, the final bounds derived from [Awasthi et al., 2022b] are based on ad hoc methods and may not be tight. [Mao et al., 2023h] complemented this previous work by studying a wide family of comp-sum losses in the multi-class classification, which generalizes the sum-losses and includes as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized cross-entropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption on the hypothesis sets was adopted and their H-consistency bounds do not apply to common bounded hypothesis sets in practice. Recently, Zheng et al. [2023] proved H-consistency bounds for multi-class logistic loss with bounded linear hypothesis sets. However, their bounds require a crucial distributional assumption, under which the minimizability gaps coincide with the approximation errors. Thus, their bounds can be recovered as excess error bounds, which are less significant. Other related work on H-consistency bounds includes H-consistency bounds for pairwise ranking [Mao, Mohri, and Zhong, 2023d,e]; theoretically grounded surrogate losses and algorithms for learning with abstention supported by H-consistency bounds, including the study of score-based abstention [Mao, Mohri, and Zhong, 2023f], predictor-rejector abstention [Mao, Mohri, and Zhong, 2023c] and learning to abstain with a fixed predictor with application in decontextualization [Mohri, Andor, Choi, Collins, Mao, and Zhong, 2023]; principled approaches for learning to defer with multiple experts that benefit from strong H-consistency bounds, including the single-stage scenario [Mao, Mohri, and Zhong, 2023b] and a two-stage scenario [Mao, Mohri, Mohri, and Zhong, 2023a]; H-consistency theory and algorithms for adversarial robustness [Awasthi et al., 2021a,b, 2023a, Mao et al., 2023h, Awasthi et al., 2023b]; and efficient algorithms and loss functions for structured prediction with stronger H-consistency guarantees [Mao et al., 2023g]. Structure of this paper. We present new and tight H-consistency bounds for both the family of comp-sum losses (Section 4.1) and that of constrained losses (Section 5.1), which cover the familiar cross-entropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets (Section 4.2 and 5.2). Our characterizations are based on error transformations, which are explicitly defined for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class H-consistency bound for cross-entropy reduces to an excess bound independent of the hypothesis set. Instead, we prove a much stronger and more significant guarantee (Section 4.2). We give a comprehensive discussion of related work in Appendix A. We start with some basic definitions and notation in Section 2. 2 Preliminaries We denote by X the input space, by Y the output space, and by D a distribution over X Y. We consider the standard scenario of multi-class classification, where Y = {1,...,n}. Given a hypothesis set H of functions mapping X Y to R, the multi-class classification problem consists of finding a hypothesis h H with small generalization error Rℓ0 1(h), defined by Rℓ0 1(h) = E(x,y) D[ℓ0 1(h,x,y)], where ℓ0 1(h,x,y) = 1h(x) y is the multi-class zero-one loss with h(x) = argmaxy Y h(x,y) the prediction of h for the input point x. We also denote by H(x) the set of all predictions associated to input x generated by functions in H, that is, H(x) = {h(x) h H}. We will analyze the guarantees of surrogate multi-class losses in terms of the zero-one loss. We denote by ℓa surrogate loss and by Rℓ(h) its generalization error, Rℓ(h) = E(x,y) D[ℓ(h,x,y)]. For a loss function ℓ, we define the best-in-class generalization error within a hypothesis set H as R ℓ(H) = infh H Rℓ(h), and refer to Rℓ(h) R ℓ(H) as the estimation error. We will study the key notion of H-consistency bounds [Awasthi et al., 2022a,b], which are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error, for some real-valued function f that is non-decreasing: h H, Rℓ0 1(h) R ℓ0 1(H) f(Rℓ(h) R ℓ(H)). These bounds imply that the zero-one estimation error is at most f(ϵ) whenever the surrogate loss estimation error is bounded by ϵ. Thus, the learning guarantees provided by H-consistency bounds are both non-asymptotic and hypothesis set-specific. The function f appearing in these bounds is expressed in terms of a minimizability gap, which is a quantity measuring the difference of bestin-class error R ℓ(H) and the expected best-in-class conditional error Ex[C ℓ(H,x)]: Mℓ(H) = R ℓ(H) EX[C ℓ(H,x)], where Cℓ(h,x) = Ey x[ℓ(h,x,y)] and C ℓ(H,x) = infh H Cℓ(h,x) are the conditional error and best-in-class conditional error respectively. We further write Cℓ,H = Cℓ(h,x) C ℓ(H,x) to denote the conditional regret. Note that that the minimizability gap is an inherent quantity depending on a hypothesis set H and the loss function ℓ. By Lemma 1, the minimizability gap for the zero-one loss, Mℓ0 1(H), coincides with its approximation error Aℓ0 1(H) = R ℓ0 1(H) R ℓ0 1(Hall) when the set of all possible predictions generated by H covers the label space Y. This holds for typical hypothesis sets used in practice. However, for a surrogate loss ℓ, the minimizability gap Mℓ(H) is always upper bounded by and in general finer than its approximation error Aℓ(H) = R ℓ(H) R ℓ(Hall) since Mℓ(H) = Aℓ(H) Iℓ(H), where Hall is the family of all measurable functions and Iℓ(H) = Ex [C ℓ(H,x) C ℓ(Hall,x)] (see Appendix B for a more detailed discussion). Thus, an H-consistency bound, expressed as follows for some increasing function Γ: Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Mℓ(H)), (1) is more favorable than an excess error bound expressed in terms of approximation errors Rℓ0 1(h) R ℓ0 1(H)+Aℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Aℓ(H)). Here, Γ is typically linear or the square-root function modulo constants. When H = Hall, the family of all measurable functions, an H-consistency bound coincides with the excess error bound and implies Bayes-consistency by taking the limit. It is therefore a stronger guarantee than an excess error bound and Bayes-consistency. The minimizability gap is always non-negative, since the infimum of the expectation is greater than or equal to the expectation of infimum. Furthermore, as shown in Appendix B, when H is the family of all measurable functions or when the Bayes-error coincides with the best-in-class error, that is, R ℓ(H) = R ℓ(Hall), the minimizability gap vanishes. In such cases, (1) implies the H-consistency of a surrogate loss ℓwith respect to the zero-one loss ℓ0 1: Rℓ(hn) R ℓ(H) n + ÐÐÐ 0 Ô Rℓ0 1(hn) R ℓ0 1(H) n + ÐÐÐ 0. In the next sections, we will provide both a general characterization and an extension of H-consistency bounds for multi-class classification. Before proceeding, we first introduce a useful lemma from [Awasthi et al., 2022b] which characterizes the conditional regret of zero-one loss explicitly. We denote by p(x) = (p(x,1),...,p(x,n)) as the conditional distribution of y given x. Lemma 1. For zero-one loss ℓ0 1, the best-in-class conditional error and the conditional regret for ℓ0 1 can be expressed as follows: for any x X, we have C ℓ0 1(H,x) = 1 max y H(x)p(x,y) and Cℓ0 1,H(h,x) = max y H(x)p(x,y) p(x,h(x)). 3 Comparison with previous work Here, we briefly discuss previous studies of H-consistency bounds [Awasthi et al., 2022a,b, Zheng et al., 2023, Mao et al., 2023h] in standard binary or multi-class classification and compare their results with those we present. Awasthi et al. [2022a] studied H-consistency bounds in binary classification. They provided a series of tight H-consistency bounds for the bounded hypothesis set of linear models Hbi lin and one-hidden-layer neural networks Hbi NN, defined as follows: Hbi lin = {x w x + b w W, b B} Hbi NN = {x n j=1 uj(wj x + b)+ u 1 Λ, wj W, b B}, where B, W, and Λ are positive constants and where ( )+ = max( ,0). We will show that our bounds recover these binary classification H-consistency bounds. The scenario of multi-class classification is more challenging and more crucial in applications. Recent work by Awasthi et al. [2022b] showed that max losses [Crammer and Singer, 2001], defined as ℓmax(h,x,y) = maxy y Φ(h(x,y) h(x,y )) for some convex and non-increasing function Φ, cannot admit meaningful H-consistency bounds, unless the distribution is deterministic. They also presented a series of H-consistency bounds for sum losses [Weston and Watkins, 1998] and constrained losses [Lee et al., 2004] for symmetric and complete hypothesis sets, that is such that: H = {h X Y R h( ,y) F, y Y} (symmetry) x X,{f(x) f F} = R, (completeness) for some family F of functions mapping from X to R. The completeness assumption rules out the bounded hypothesis sets typically used in practice such as Hlin. Moreover, the final bounds derived from [Awasthi et al., 2022b] are based on ad hoc proofs and may not be tight. In contrast, we will study both the complete and bounded hypothesis sets, and provide a very general tool to derive H-consistency bounds. Our bounds are tighter than those of Awasthi et al. [2022b] given for complete hypothesis sets and extend beyond the completeness assumption. [Mao et al., 2023h] complemented the work of [Awasthi et al., 2022b] by studying a wide family of comp-sum losses in multi-class classification, which generalized the sum-losses and included as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized crossentropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption was adopted, thus their H-consistency bounds do not apply to common bounded hypothesis sets used in practice. We illustrate the application of our general results through a broader set of surrogate losses than [Mao et al., 2023h] and significantly generalize the bounds of [Mao et al., 2023h] to bounded hypothesis sets. Recently, Zheng et al. [2023] proved H-consistency bounds for logistic loss with linear hypothesis sets in the multi-class classification: Hlin = {x wy x + by wy W, by B,y Y}. However, their bounds require a crucial distributional assumption under which, the minimizability gaps Mℓ0 1(Hlin) and Mℓlog(Hlin) coincide with the approximation errors Rℓ0 1(Hlin) R ℓ0 1(Hall) and Rℓlog(Hlin) R ℓlog(Hall) respectively (see the note before [Zheng et al., 2023, Appendix F]). Thus, their bounds can be recovered as excess error bounds Rℓ0 1(h) R ℓ0 1(Hall) 2(Rℓlog(h) R ℓlog(Hall)) 1 2 , which are less significant. In contrast, our Hlin-consistency bound are much finer and take into account the role of the parameter B and that of the number of labels n. Thus, we provide stronger and more significant guarantees for logistic loss with linear hypothesis set than [Zheng et al., 2023]. In summary, our general tools offer the remarkable advantages of deriving tight bounds, which improve upon the existing bounds of Awasthi et al. [2022b] given for complete hypothesis sets, cover the comp-sum and constrained losses considered in [Awasthi et al., 2022a, Mao et al., 2023h] as well as new ones, extend beyond the completeness assumption with novel guarantees valid for bounded hypothesis sets, and are much stronger and more significant guarantees for logistic loss with linear hypothesis sets than those of Zheng et al. [2023]. 4 Comp-sum losses In this section, we present a general characterization of H-consistency bounds for comp-sum losses, a family of loss functions including the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the sum exponential loss [Weston and Watkins, 1998, Awasthi et al., 2022b], the generalized cross entropy loss [Zhang and Sabuncu, 2018], the mean absolute error loss [Ghosh et al., 2017], and many other loss functions used in applications. This is a family of loss functions defined via the composition of a non-negative and non-decreasing function Ψ with the sum exponential losses (see [Mao et al., 2023h]): h H, (x,y) X Y, ℓcomp(h,x,y) = Ψ y Y eh(x,y ) h(x,y) . (2) This expression can be equivalently written as ℓcomp(h,x,y) = Φ( eh(x,y) y Y eh(x,y ) ), where Φ u u ) is a non-increasing auxiliary function from [0,1] to R+ {+ }. As an example, the logistic loss corresponds to the choice Φ u log(u) and the sum exponential loss to Φ u 1 u 4.1 H-consistency bounds In previous work, deriving H-consistency bounds has required giving new proofs for each instance. The following result provides a very general tool for deriving such bounds with tightness guarantees. We introduce an error transformation function and show that deriving an H-consistency bound for comp-sum losses can be reduced to the calculation of this function. Theorem 2 (H-consistency bound for comp-sum losses). Assume that H is symmetric and complete and that Tcomp is convex. Then, the following inequality holds for any hypothesis h H and any distribution Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H), (3) with Tcomp an H-estimation error transformation for comp-sum losses defined for all t [0,1] by inf τ [0, 1 2 ] sup µ [ τ,1 τ] { 1+t 2 [Φ(τ) Φ(1 τ µ)] + 1 t 2 [Φ(1 τ) Φ(τ + µ)]} n = 2 inf P [ 1 n 1 t,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] { P +t 2 [Φ(τ2) Φ(τ1 µ)] + P t 2 [Φ(τ1) Φ(τ2 + µ)]} n > 2. Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t). Thus, Theorem 2 shows that, when Tcomp is convex, to make these guarantees explicit, all that is needed is to calculate Tcomp. Moreover, the last statement shows the tightness of the guarantees derived using this function. The constraints in Tcomp are due to the forms that the conditional probability vector and scoring functions take. These forms become more flexible for n > 2, leading to intricate constraints. Note that our H-consistency bounds are distribution-independent and we cannot claim tightness across all distributions. The general expression of Tcomp in Theorem 2 is complex, but it can be considerably simplified under some broad assumptions, as shown by the following result. Theorem 3 (characterization of Tcomp). Assume that Φ is convex, differentiable at 1 2 and Φ ( 1 2) < 0. Then, Tcomp can be expressed as follows: 2) infµ [ 1 2 + µ) + 1+t 2 µ)} n = 2 infτ [ 1 2 ]{Φ(τ) infµ [ τ,τ]{ 1+t 2 Φ(τ µ) + 1 t 2 Φ(τ + µ)}} n > 2. The proof of this result as well as that of other theorems in this section are given in Appendix C. Examples. We now illustrate the application of our theory through several examples. To do so, we compute the H-estimation error transformation Tcomp for comp-sum losses and present the results in Table 1: H-estimation error transformation for common comp-sum losses. Auxiliary function Φ log(t) 1 t 1 1 q(1 tq),q (0,1) 1 t (1 t)2 Transformation Tcomp 1+t 2 log(1 + t) + 1 t 2 log(1 t) 1 1 t2 1 qnq ( (1+t) 1 1 q +(1 t) 1 1 q 2 ) Table 1. Remarkably, by applying Theorem 2, we are able to obtain the same H-consistency bounds for comp-sum losses with Φ(t) = log(t), 1 q(1 tq),q (0,1) and 1 t as those derived using ad hoc methods in [Mao et al., 2023h], and a novel tight H-consistency bound for the new comp-sum loss ℓsq = [1 eh(x,y) y Y eh(x,y ) ] 2 with Φ(t) = (1 t)2 in Theorem 4. The calculation of Tcomp for all entries of Table 1 is detailed in Appendix C.3. To illustrate the effectiveness of our general tools, here, we show how the error transformation function can be straightforwardly calculated in the case of the new surrogate loss ℓsq. Theorem 4 (H-consistency bound for a new comp-sum loss). Assume that H is symmetric and complete. Then, for all h H and any distribution, the following tight bound holds. Rℓ0 1(h) R ℓ0 1(H) 2(Rℓsq(h) R ℓsq(H) + Mℓsq(H)) 1 2 Mℓ0 1(H). Proof. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 3, gives 4 inf µ [ 1 Similarly, for n > 2, plugging in Φ(t) = (1 t)2 in Theorem 3 yields Tcomp = inf τ [ 1 2 ]{(1 τ)2 inf µ [ τ,τ]{1 + t 2 (1 τ + µ)2 + 1 t 2 (1 τ µ)2}} = inf τ [ 1 2 ]{(1 τ)2 (1 τ)2(1 t2)} (minimum achieved at µ = t(τ 1)) 4 . (minimum achieved at τ = 1 By Theorem 2, the bound obtained is tight, which completes the proof. 4.2 Extension to non-complete/bounded hypothesis sets: comp-sum losses As pointed out earlier, the hypothesis sets typically used in practice are bounded. Let F be a family of real-valued functions f with f(x) Λ(x) for all x X and such that all values in [ Λ(x),+Λ(x)] can be reached, where Λ(x) > 0 is a fixed function on X. We will study hypothesis sets H in which each scoring function is bounded: H = {h X Y R h( ,y) F, y Y}. (4) This holds for most hypothesis sets used in practice. The symmetric and complete hypothesis sets studied in previous work correspond to the special case of H where Λ(x) = + for all x X. The hypothesis set of linear models Hlin, defined by Hlin = {(x,y) wy x + by wy W, by B,y Y}, is also a special instance of H where Λ(x) = W x + B. Let us emphasize that previous studies did not establish any H-consistency bound for these general hypothesis sets, H. Theorem 5 (H-consistency bound for comp-sum losses). Assume that T comp is convex. Then, the following inequality holds for any hypothesis h H and any distribution: T comp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) with T comp the H-estimation error transformation for comp-sum losses defined for all t [0,1] by T comp(t) = inf τ [0, 1 2 ] sup µ [smin τ,1 τ smin] { 1+t 2 [Φ(τ) Φ(1 τ µ)] + 1 t 2 [Φ(1 τ) Φ(τ + µ)]} n = 2 inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C { P +t 2 [Φ(τ2) Φ(τ1 µ)] + P t 2 [Φ(τ1) Φ(τ2 + µ)]} n > 2, where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}], smax = 1 1+(n 1)e 2 infx Λ(x) and smin = 1 1+(n 1)e2 infx Λ(x) . Furthermore, for any t [0,1], there exist a distribution D and h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t). This theorem significantly broadens the applicability of our framework as it encompasses bounded hypothesis sets. The last statement of the theorem further shows the tightness of the H-consistency bounds derived using this error transformation function. We now illustrate the application of our theory through several examples. A. Example: logistic loss. We first consider the multinomial logistic loss, that is ℓcomp with Φ(u) = log(u), for which we give the following guarantee. Theorem 6 (H-consistency bounds for logistic loss). For any h H and any distribution, we have Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)), where ℓlog = log( eh(x,y) y Y eh(x,y ) ) and Ψ(t) = 2 log(1 + t) + 1 t 2 log(1 t) t smax smin smin+smax t 2 log( smax smin ) + log( 2 smaxsmin smax+smin ) otherwise. The proof of Theorem 6 is given in Appendix E.2. With the help of some simple calculations, we can derive a simpler upper bound: Ψ 1(t) Γ(t) = 2t t (smax smin)2 2(smin+smax)2 2(smin+smax) smax smin t otherwise. When the relative difference between smin and smax is small, the coefficient of the linear term in Γ explodes. On the other hand, making that difference large essentially turns Γ into a square-root function for all values. In general, Λ is not infinite since a regularization is used, which controls both the complexity of the hypothesis set and the magnitude of the scores. Comparison with [Mao et al., 2023h]. For the symmetric and complete hypothesis sets H considered in [Mao et al., 2023h], Λ(x) = + , smax = 1, smin = 0, Ψ(t) = 1+t 2 log(1 + t) + 1 t 2 log(1 t) and Γ(t) = 2t. By Theorem 6, this yields an H-consistency bound for the logistic loss. Corollary 7 (H-consistency bounds for logistic loss). Assume that H is symmetric and complete. Then, for any h H and any distribution, we have Rℓ0 1(h) R ℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)) Mℓ0 1(H) where Ψ(t) = 1+t 2 log(1 + t) + 1 t 2 log(1 t) and Ψ 1(t) Corollary 7 recovers the H-consistency bounds of Mao et al. [2023h]. Comparison with [Awasthi et al., 2022a] and [Zheng et al., 2023]. For the linear models Hlin = {(x,y) wy x + by wy W, by B}, we have Λ(x) = W x + B. By Theorem 6, we obtain Hlin-consistency bounds for logistic loss. Corollary 8 (Hlin-consistency bounds for logistic loss). For any h Hlin and any distribution, Rℓ0 1(h) R ℓ0 1(Hlin) Ψ 1(Rℓlog(h) R ℓlog(Hlin) + Mℓlog(Hlin)) Mℓ0 1(Hlin) where Ψ(t) = 2 log(1 + t) + 1 t 2 log(1 t) t (n 1)(e2B e 2B) 2+(n 1)(e2B+e 2B) t 2 log( 1+(n 1)e2B 1+(n 1)e 2B ) + log( 2 (1+(n 1)e2B)(1+(n 1)e 2B) 2+(n 1)(e2B+e 2B) ) otherwise. For n = 2, we have Ψ(t) = 2 log(t + 1) + 1 t 2 log(1 t) t e2B 1 e2B+1 t 2 log( 1+e2B 1+e 2B ) + log( 2 (1+e2B)(1+e 2B) 2+e2B+e 2B ) otherwise, which coincides with the Hlin-estimation error transformation in [Awasthi et al., 2022a]. Thus, Corollary 8 includes as a special case the Hlin-consistency bounds given by Awasthi et al. [2022a] for binary classification. Our bounds of Corollary 8 improves upon the multi-class Hlin-consistency bounds of recent work [Zheng et al., 2023, Theorem 3.3] in the following ways: i) their bound holds only for restricted distributions while our bound holds for any distribution; ii) their bound holds only for restricted values of the estimation error Rℓlog(h) R ℓlog(Hlin) while ours holds for any value in R and more precisely admits a piecewise functional form; iii) under their distributional assumption, the minimizability gaps Mℓ0 1(Hlin) and Mℓlog(Hlin) coincide with the approximation errors Rℓ0 1(Hlin) R ℓ0 1(Hall) and Rℓlog(Hlin) R ℓlog(Hall) respectively (see the note before [Zheng et al., 2023, Appendix F]). Thus, their bounds can be recovered as an excess error bound Rℓ0 1(h) R ℓ0 1(Hall) 2[Rℓlog(h) R ℓlog(Hall)] 1 2 , which is not specific to the hypothesis set H and thus not as significant. In contrast, our Hlin-consistency bound is finer and takes into account the role of the parameter B as well as the number of labels n; iv) [Zheng et al., 2023, Theorem 3.3] only offers approximate bounds that are not tight; in contrast, by Theorem 5, our bound is tight. Note that our H-consistency bound in Theorem 6 are not limited to specific hypothesis set forms. They are directly applicable to various types of hypothesis sets including neural networks. For example, the same derivation can be extended to one-hidden-layer neural networks studied in [Awasthi et al., 2022a] and their multi-class generalization by calculating and substituting the corresponding Λ(x). As a result, we can obtain novel and tight H-consistency bounds for bounded neural network hypothesis sets in multi-class classification, which highlights the versatility of our general tools. B. Example: sum exponential loss. We then consider the sum exponential loss, that is ℓcomp with Φ(u) = 1 u u . By computing the error transformation in Theorem 5, we obtain the following result. Theorem 9 (H-consistency bounds for sum exponential loss). For any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H)) where ℓexp = y y eh(x,y ) h(x,y) and Ψ(t) = 1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2 2smaxsmin(smax+smin) otherwise. . The proof of Theorem 9 is given in Appendix E.3. Observe that 1 1 t2 t2/2. By Theorem 9, making smin close to zero, that is making Λ close to infinite for any x X, essentially turns Ψ into a square function for all values. In general, Λ is not infinite since a regularization is used in practice, which controls both the complexity of the hypothesis set and the magnitude of the scores. C. Example: generalized cross-entropy loss and mean absolute error loss. Due to space limitations, we present the results for these loss functions in Appendix E. 5 Constrained losses In this section, we present a general characterization of H-consistency bounds for constrained loss, that is loss functions defined via a constraint, as in [Lee et al., 2004]: ℓcstnd(h,x,y) = y y Φ( h(x,y )) (5) with the constraint that y Y h(x,y) = 0 for a non-negative and non-increasing auxiliary function Φ. 5.1 H-consistency bounds As in the previous section, we prove a result that supplies a very general tool, an error transformation function for deriving H-consistency bounds for constrained losses. When Tcstnd is convex, to make these guarantees explicit, we only need to calculate Tcstnd. Table 2: H-estimation error transformation for common constrained losses. Auxiliary function Φ Φexp(t) = e t Φhinge(t) = max{0,1 t} Φsq hinge(t) = (1 t)21t 1 Φsq = (1 t)2 Transformation Tcstnd Tcstnd(t) = 2 4 t2 Tcstnd(t) = t Tcstnd(t) = t2 2 Tcstnd(t) = t2 Theorem 10 (H-consistency bound for constrained losses). Assume that H is symmetric and complete. Assume that Tcstnd is convex. Then, for any hypothesis h H and any distribution, Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) with H-estimation error transformation for constrained losses defined on t [0,1] by Tcstnd(t) = inf τ 0sup µ R { 1 t 2 [Φ(τ) Φ( τ + µ)] + 1+t 2 [Φ( τ) Φ(τ µ)]} n = 2 inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R { 2 P t 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t 2 [Φ( τ1) Φ( τ2 µ)]} n > 2. Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t). Here too, the theorem guarantees the tightness of the bound. This general expression of Tcstnd can be considerably simplified under some broad assumptions, as shown by the following result. Theorem 11 (characterization of Tcstnd). Assume that Φ is convex, differentiable at zero and Φ (0) < 0. Then, Tcstnd can be expressed as follows: Tcstnd(t) = Φ(0) infµ R{ 1 t 2 Φ(µ) + 1+t 2 Φ( µ)} n = 2 infτ 0{(2 1 n 1)Φ( τ) infµ R{ 2 t 1 n 1 2 Φ( τ + µ) + 2+t 1 n 1 2 Φ( τ µ)}} n > 2 {Φ(0) infµ R{ 1 t 2 Φ(µ) + 1+t 2 Φ( µ)} n = 2 infτ 0{2Φ( τ) infµ R{ 2 t 2 Φ( τ + µ) + 2+t 2 Φ( τ µ)}} n > 2. The proof of all the results in this section are given in Appendix D. Examples. We now compute the H-estimation error transformation for constrained losses and present the results in Table 2. Here, we present the simplified Tcstnd by using the lower bound in Theorem 11. Remarkably, by applying Theorem 10, we are able to obtain tighter H-consistency bounds for constrained losses with Φ = Φhinge,Φsq hinge,Φexp than those derived using ad hoc methods in [Awasthi et al., 2022b], and a novel H-consistency bound for the new constrained loss ℓcstnd(h,x,y) = y y(1 + h(x,y ))2 with Φ(t) = (1 t)2. 5.2 Extension to non-complete or bounded hypothesis sets As in the case of comp-sum losses, we extend our results beyond the completeness assumption adopted in previous work and establish H-consistency bounds for bounded hypothesis sets. This significantly broadens the applicability of our framework. Theorem 12 (H-consistency bound for constrained losses). Assume that T cstnd is convex. Then, the following inequality holds for any hypothesis h H and any distribution: T cstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). (6) with T cstnd the H-estimation error transformation for constrained losses defined for all t [0,1] by T cstnd(t) = inf τ 0 sup µ [τ Λmin,τ+Λmin] { 1 t 2 [Φ(τ) Φ( τ + µ)] + 1+t 2 [Φ( τ) Φ(τ µ)]} n = 2 inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C { 2 P t 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t 2 [Φ( τ1) Φ( τ2 µ)]} n > 2, where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] and Λmin = infx X Λ(x). Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H)+ Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t). The proof is presented in Appendix F.1. Next, we illustrate the application of our theory through an example of constrained exponential losses, that is ℓcstnd with Φ(t) = e t. By using the error transformation in Theorem 12, we obtain new H-consistency bounds in Theorem 13 (see Appendix F.2 for the proof) for bounded hypothesis sets H. Theorem 13 (H-consistency bounds for constrained exponential loss). Let Φ(t) = e t. For any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)) where Ψ(t) = 1 t2 t e2Λmin 1 e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin 2 otherwise. . Awasthi et al. [2022b] proves H-consistency bounds for constrained exponential losses when H is symmetric and complete. Theorem 13 significantly generalizes those results to the non-complete hypothesis sets. Different from the complete case, the functional form of our new bounds has two pieces which corresponds to the linear and the square root convergence respectively, modulo the constants. Furthermore, the coefficient of the linear piece depends on the the magnitude of Λmin. When Λmin is small, the coefficient of the linear term in Ψ 1 explodes. On the other hand, making Λmin large essentially turns Ψ 1 into a square-root function. 6 Discussion Here, we further elaborate on the practical value of our tools and H-consistency bounds. Our contributions include a more general and convenient mathematical tool for proving H-consistency bounds, along with tighter bounds that enable a better comparison of surrogate loss functions and extensions beyond previous completeness assumptions. As mentioned by [Awasthi et al., 2022b], given a hypothesis set H, H-consistency bounds can be used to compare different surrogate loss functions and select the most favorable one, which depends on the functional form of the H-consistency bound; the smoothness of the surrogate loss and its optimization properties; approximation properties of the surrogate loss function controlled by minimizability gaps; and the dependency on the number of classes in the multiplicative constant. Consequently, a tighter H-consistency bound provides a more accurate comparison, as a loose bound might not adequately capture the full advantage of using one surrogate loss. In contrast, Bayes-consistency does not take into account the hypothesis set and is an asymptotic property, thereby failing to guide the comparison of different surrogate losses. Another application of our H-consistency bounds involves deriving generalization bounds for surrogate loss minimizers [Mao et al., 2023h], expressed in terms of the same quantities previously discussed. Therefore, when dealing with finite samples, a tighter H-consistency bound could also result in a corresponding tighter generalization bound. Moreover, our novel results extend beyond previous completeness assumptions, offering guarantees applicable to bounded hypothesis sets commonly used with regularization. This enhancement provides meaningful learning guarantees. Technically, our error transformation function serves as a very general tool for deriving H-consistency bounds with tightness guarantees. These functions are defined within each class of loss functions including comp-sum losses and constrained losses, and their formulation depends on the structure of the individual loss function class, the range of the hypothesis set and the number of classes. To derive explicit bounds, all that is needed is to calculate these error transformation functions. Under some broad assumptions on the auxiliary function within a loss function, these error transformation functions can be further distilled into more simplified forms, making them straightforward to compute. 7 Conclusion We presented a general characterization and extension of H-consistency bounds for multi-class classification. We introduced new tools for deriving such bounds with tightness guarantees and illustrated their benefits through several applications and examples. Our proposed method is a significant advance in the theory of H-consistency bounds for multi-class classification. It can provide a general and powerful tool for deriving tight bounds for a wide variety of other loss functions and hypothesis sets. We believe that our work will open up new avenues of research in the field of multi-class classification consistency. A. Agarwal and S. Agarwal. On consistent surrogate risk minimization and property elicitation. In Conference on Learning Theory, pages 4 22, 2015. P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pages 9804 9815, 2021a. P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. A finer calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b. P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pages 1117 1174, 2022a. P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Multi-class H-consistency bounds. In Advances in neural information processing systems, 2022b. P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pages 10077 10094, 2023a. P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. DC-programming for neural network optimizations. Journal of Global Optimization, 2023b. P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944. J. Berkson. Why I prefer logits to probits. Biometrics, 7(4):327 -339, 1951. M. Blondel. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019. D.-R. Chen and T. Sun. Consistency of multiclass empirical risk minimization methods based on convex loss. Journal of Machine Learning Research, 7:2435 2447, 2006. D.-R. Chen and D.-H. Xiang. The consistency of multicategory support vector machines. Advances in Computational Mathematics, 24(1):155 169, 2006. C. Ciliberto, L. Rosasco, and A. Rudi. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016. C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In Algorithmic Learning Theory, pages 67 82, 2016a. C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Advances in Neural Information Processing Systems, pages 1660 1668, 2016b. C. Cortes, G. De Salvo, and M. Mohri. Theory and algorithms for learning with rejection in binary classification. Annals of Mathematics and Artificial Intelligence, to appear, 2023. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001. K. Dembczynski, W. Kotlowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. ar Xiv preprint ar Xiv:1206.6401, 2012. U. Dogan, T. Glasmachers, and C. Igel. A unified view on multi-class support vector classification. Journal of Machine Learning Research, 17:1 32, 2016. J. Finocchiaro, R. M. Frongillo, and B. Waggoner. An embedding framework for the design and analysis of consistent polyhedral surrogates. ar Xiv preprint ar Xiv:2206.14707, 2022. R. Frongillo and B. Waggoner. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, pages 21569 21580, 2021. W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. In Conference on learning theory, pages 341 358, 2011. W. Gao and Z.-H. Zhou. On the consistency of AUC pairwise optimization. In International Joint Conference on Artificial Intelligence, 2015. A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017. V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pages 2501 2509, 2014. Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004. Y. Liu. Fisher consistency of multicategory support vector machines. In Artificial intelligence and statistics, pages 291 298, 2007. P. Long and R. Servedio. Consistency versus realizable H-consistency for multiclass classification. In International Conference on Machine Learning, pages 801 809, 2013. A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a. A. Mao, M. Mohri, and Y. Zhong. Principled approaches for learning to defer with multiple experts. ar Xiv preprint ar Xiv:2310.14774, 2023b. A. Mao, M. Mohri, and Y. Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. ar Xiv preprint ar Xiv:2310.14772, 2023c. A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023d. A. Mao, M. Mohri, and Y. Zhong. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023e. A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for scorebased multi-class abstention. ar Xiv preprint ar Xiv:2310.14770, 2023f. A. Mao, M. Mohri, and Y. Zhong. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023g. A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023h. C. Mohri, D. Andor, E. Choi, M. Collins, A. Mao, and Y. Zhong. Learning to reject with a fixed predictor: Application to decontextualization. ar Xiv preprint ar Xiv:2301.09044, 2023. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, second edition, 2018. H. Narasimhan, H. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pages 2398 2407, 2015. A. Osokin, F. Bach, and S. Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017. F. Pedregosa, F. Bach, and A. Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1 35, 2017. B. Á. Pires and C. Szepesvári. Multiclass classification calibration functions. ar Xiv preprint ar Xiv:1609.06385, 2016. B. A. Pires, C. Szepesvari, and M. Ghavamzadeh. Cost-sensitive multiclass classification risk bounds. In International Conference on Machine Learning, pages 1391 1399, 2013. H. G. Ramaswamy and S. Agarwal. Classification calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems, 2012. H. G. Ramaswamy and S. Agarwal. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research, 17(1):397 441, 2016. H. G. Ramaswamy, S. Agarwal, and A. Tewari. Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In Advances in Neural Information Processing Systems, 2013. H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classification with a reject option. ar Xiv preprint ar Xiv:1505.04137, 2015. P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In International Conference on Artificial Intelligence and Statistics, pages 618 626, 2011. I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007. A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007. A. Thilagar, R. Frongillo, J. J. Finocchiaro, and E. Goodwill. Consistent polyhedral surrogates for top-k classification and variants. In International Conference on Machine Learning, pages 21329 21359, 2022. K. Uematsu and Y. Lee. On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112(519):1311 1322, 2017. P. F. Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113 -121, 1838. P. F. Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845. Y. Wang and C. Scott. Weston-Watkins hinge loss and ordered partitions. In Advances in neural information processing systems, pages 19873 19883, 2020. Y. Wang and C. D. Scott. On classification-calibration of gamma-phi losses. ar Xiv preprint ar Xiv:2302.07321, 2023. J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Citeseer, 1998. R. C. Williamson, E. Vernet, and M. D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17:1 52, 2016. M. Zhang and S. Agarwal. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, pages 16927 16936, 2020. M. Zhang, H. G. Ramaswamy, and S. Agarwal. Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pages 11246 11255, 2020. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a. T. Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b. Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018. C. Zheng, G. Wu, F. Bao, Y. Cao, C. Li, and J. Zhu. Revisiting discriminative vs. generative classifiers: Theory and implications. In International Conference on Machine Learning, 2023. Contents of Appendix A Related work 16 B Minimizability gap 17 B.1 Zero minimizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Relationship with approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.3 Significance of H-consistency bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C Proofs for comp-sum losses 18 C.1 Proof of H-consistency bounds with Tcomp (Theorem 2) . . . . . . . . . . . . . . . . 19 C.2 Characterization of Tcomp (Theorem 3) . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Computation of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D Proofs for constrained losses 23 D.1 Proof of H-consistency bounds with Tcstnd (Theorem 10) . . . . . . . . . . . . . . . . 24 D.2 Characterization of Tcstnd (Theorem 11) . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.3 Computation of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 E Extensions of comp-sum losses 28 E.1 Proof of H-consistency bounds with T comp (Theorem 5) . . . . . . . . . . . . . . . . 28 E.2 Logistic loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 E.3 Sum exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.4 Generalized cross-entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.5 Mean absolute error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 F Extensions of constrained losses 35 F.1 Proof of H-consistency bound with T cstnd (Theorem 12) . . . . . . . . . . . . . . . . 35 F.2 Constrained exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A Related work The notions of Bayes-consistency (also known as consistency) and calibration have been well studied not only with respect to the binary zero-one loss [Zhang, 2004a, Bartlett et al., 2006, Steinwart, 2007, Mohri et al., 2018], but also with respect to the multi-class zero-one loss [Zhang, 2004b, Tewari and Bartlett, 2007], the general multi-class losses [Ramaswamy and Agarwal, 2012, Narasimhan et al., 2015, Ramaswamy and Agarwal, 2016], the multi-class SVMs [Chen and Sun, 2006, Chen and Xiang, 2006, Liu, 2007, Dogan et al., 2016, Wang and Scott, 2020], the gamma-phi losses [Wang and Scott, 2023], the multi-label losses [Gao and Zhou, 2011, Dembczynski et al., 2012, Zhang et al., 2020], the losses with a reject option [Ramaswamy et al., 2015, Cortes et al., 2016a,b, 2023], the ranking losses [Ravikumar et al., 2011, Ramaswamy et al., 2013, Gao and Zhou, 2015, Uematsu and Lee, 2017], the cost sensitive losses [Pires et al., 2013, Pires and Szepesvári, 2016], the structured losses [Ciliberto et al., 2016, Osokin et al., 2017, Blondel, 2019], the polyhedral losses [Frongillo and Waggoner, 2021, Finocchiaro et al., 2022], the Top-k classification losses [Thilagar et al., 2022], the proper losses [Agarwal and Agarwal, 2015, Williamson et al., 2016] and the losses of ordinal regression [Pedregosa et al., 2017]. Bayes-consistency only holds for the full family of measurable functions, which of course is distinct from the more restricted hypothesis set used by a learning algorithm. Therefore, a hypothesis setdependent notion of H-consistency has been proposed by Long and Servedio [2013] in the realizable setting, used by Zhang and Agarwal [2020] for linear models, and generalized by Kuznetsov et al. [2014] to the structured prediction case. Long and Servedio [2013] showed that there exists a case where a Bayes-consistent loss is not H-consistent while inconsistent losses can be H-consistent. Zhang and Agarwal [2020] further investigated the phenomenon in [Long and Servedio, 2013] and showed that the situation of losses that are not H-consistent with linear models can be remedied by carefully choosing a larger piecewise linear hypothesis set. Kuznetsov et al. [2014] proved positive results for the H-consistency of several multi-class ensemble algorithms, as an extension of H-consistency results in [Long and Servedio, 2013]. Recently, Awasthi et al. [2022a,b], Mao et al. [2023h], Zheng et al. [2023] presented a series of results providing H-consistency bounds. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are more informative guarantees than similar excess error bounds derived in the literature, which correspond to the special case where H is the family of all measurable functions [Zhang, 2004a, Bartlett et al., 2006, Mohri et al., 2018]. Awasthi et al. [2022a] studied H-consistency bounds in binary classification. They provided a series of tight H-consistency bounds for bounded hypothesis set of linear models and one-hidden-layer neural networks. The subsequent study [Awasthi et al., 2022b] further generalized the framework to multi-class classification, where they presented a extensive study of H-consistency bounds for diverse multi-class surrogate losses including negative results for max losses [Crammer and Singer, 2001] and positive results for sum losses [Weston and Watkins, 1998], and constrained losses [Lee et al., 2004]. However, the hypothesis sets adopted there were assumed to be complete, which rules out the bounded hypothesis sets typically used in practice. Moreover, the final bounds derived from [Awasthi et al., 2022b] are based on ad hoc methods and may not be tight. [Mao et al., 2023h] complemented the previous work by studying a wide family of comp-sum losses in the multi-class classification, which generalized the sum-losses and included as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized cross-entropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption on the hypothesis sets was adopted and their H-consistency bounds do not apply to common bounded hypothesis sets in practice. Zheng et al. [2023] proved H-consistency bounds for multi-class logistic loss with bounded linear hypothesis sets. However, their bounds require a crucial distributional assumption under which, the minimizability gaps coincide with the approximation errors. Thus, their bounds can be recovered as excess error bounds, which are less significant. This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classification. Our general tools and tight bounds show several remarkable advantages: first, they improve existing bounds for complete hypothesis sets previously proven in [Awasthi et al., 2022b], second, they encompass all previously comp-sum and constrained losses studied thus far as well as many new ones [Awasthi et al., 2022a, Mao et al., 2023h], third, they extend beyond the completeness assumption adopted in previous work, fourth, they give novel guarantees for bounded hypothesis sets, and finally they help prove a much stronger and more significant guarantee for logistic loss with linear hypothesis set than [Zheng et al., 2023]. Other related work on H-consistency bounds includes: H-consistency bounds for pairwise ranking [Mao et al., 2023d,e]; theoretically grounded surrogate losses and algorithms for learning with abstention supported by H-consistency bounds, including the study of score-based abstention [Mao et al., 2023f], predictor-rejector abstention [Mao et al., 2023c] and learning to abstain with a fixed predictor with application in decontextualization [Mohri et al., 2023]; principled approaches for learning to defer with multiple experts that benefit from strong H-consistency bounds, including the single-stage scenario [Mao et al., 2023b] and a two-stage scenario [Mao et al., 2023a]; H-consistency theory and algorithms for adversarial robustness [Awasthi et al., 2021a,b, 2023a, Mao et al., 2023h, Awasthi et al., 2023b]; and efficient algorithms and loss functions for structured prediction with stronger H-consistency guarantees [Mao et al., 2023g]. B Minimizability gap This is a brief discussion of minimizability gaps and their properties. By definition, for any loss function ℓ, the minimizability gap is defined by Mℓ(H) = inf h H{ E (x,y) D[ℓ(h,x,y)]} E x[ inf h H E y x[ℓ(h,x,y)]] = R ℓ(H) E x[C ℓ(H,x)]. B.1 Zero minimizability Lemma 14. Let ℓbe a surrogate loss such that for (x,y) X Y and any measurable function h Hall, the loss ℓ(h,x,y) only depends on h(x) and y (thus we can write ℓ(h,x,y) = ℓ(h(x),y) for some function ℓ). Then, the minimizabiliy gap vanishes: Mℓ(Hall) = 0. Proof. Fix ϵ > 0. Then, by definition of the infimum, for any x X, there exists hx Hall such that E y x[ℓ(hx,x,y)] inf h Hall E y x[ℓ(h,x,y)] + ϵ. Now, define the function h by h(x) = hx(x), for all x X. h can be shown to be measurable, for example, when X admits a countable dense subset. Then, E (x,y) D[ℓ(h,x,y)] = E (x,y) D[ℓ(h(x),y)] = E (x,y) D[ℓ(hx(x),y)] = E (x,y) D[ℓ(hx,x,y)] E x[ inf h Hall E y x[ℓ(h,x,y)] + ϵ] = E x[C ℓ(Hall,x)] + ϵ. Thus, we have inf h Hall E (x,y) D[ℓ(h,x,y)] E x[C ℓ(Hall,x)] + ϵ. Since the inequality holds for any ϵ > 0, we have R ℓ(Hall) = infh Hall E(x,y) D[ℓ(h,x,y)] Ex[C ℓ(Hall,x)]. This implies equality since the inequality R ℓ(H) Ex[C ℓ(H,x)] holds for any H. B.2 Relationship with approximation error Let Aℓdenote the approximation error of a loss function ℓand a hypothesis set H: Aℓ(H) = R ℓ(H) R ℓ(Hall). We will denote by Iℓ(H) the difference of pointwise infima Iℓ(H) = Ex [C ℓ(H,x) C ℓ(Hall,x)], which is non-negative. The minimizability gap can be decomposed as follows in terms of the approximation error and the difference of pointwise infima: Mℓ(H) = R ℓ(H) Ex [C ℓ(H,x)] = R ℓ(H) R ℓ(Hall) + R ℓ(Hall) Ex [C ℓ(H,x)] = Aℓ(H) + R ℓ(Hall) Ex [C ℓ(H,x)] = Aℓ(H) + Ex [C ℓ(Hall,x) C ℓ(H,x)] (By Lemma 14) = Aℓ(H) Iℓ(H). The decomposition immediately implies the following result. Lemma 15. Let ℓbe a surrogate loss such that for (x,y) X Y and any measurable function h Hall, the loss ℓ(h,x,y) only depends on h(x) and y (thus we can write ℓ(h,x,y) = ℓ(h(x),y) for some function ℓ). Then, for any loss function ℓand hypothesis set H, we have: Mℓ(H) Aℓ(H). By Lemma 1, when ℓis the zero-one loss, Iℓ(H) = 0 when the hypothesis set generates labels that cover all possible outcomes for each input. However, for a surrogate loss function, Iℓ(H) is non-negative, and is generally non-zero. Take the example of binary classification and denote the conditional distribution as η(x) = D(Y = 1 X = x). Let H be a family of functions h with h(x) Λ for all x X and such that all values in [ Λ,+Λ] can be reached. Consider for example the exponential-based margin loss: ℓ(h,x,y) = e yh(x). Then, Cℓ(h,x) = η(x)e h(x) + (1 η(x))eh(x). Upon observing this, it becomes apparent that the infimum over all measurable functions can be expressed in the following way, for all x: C ℓ(Hall,x) = 2 η(x)(1 η(x)), while the infimum over H, C ℓ(H,x), depends on Λ and can be expressed as max{η(x),1 η(x)}e Λ + min{η(x),1 η(x)}eΛ Λ < 1 2 log η(x) 1 η(x) η(x)(1 η(x)) otherwise. Thus, in the deterministic scenario, Iℓ(H) = Ex[C ℓ(H,x) C ℓ(Hall,x)] = e Λ. B.3 Significance of H-consistency bounds As shown in the previous section, for target loss ℓ0 1, the minimizability gap coincides with the approximation error Mℓ0 1(H) = Aℓ0 1(H) when the hypothesis set generates labels that cover all possible outcomes for each input. However, for a surrogate loss ℓ, the minimizability gap is generally strictly less than the approximation error Mℓ(H) < Aℓ(H). Thus, an H-consistency bound, expressed as follows for some increasing function Γ: Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Mℓ(H)). is more favorable than an excess error bound expressed in terms of approximation errors: Rℓ0 1(h) R ℓ0 1(H) + Aℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Aℓ(H)). Here, Γ is typically linear or the square-root function modulo constants. When H = Hall, the family of all measurable functions, by Lemma 14, the H-consistency bound coincides with the excess error bound and implies Bayes-consistency by taking the limit. It is therefore a stronger guarantee than an excess error bound and Bayes-consistency. C Proofs for comp-sum losses Let ymax = argmaxy Y p(x,y) and h(x) = argmaxy Y h(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. C.1 Proof of H-consistency bounds with Tcomp (Theorem 2) Theorem 2 (H-consistency bound for comp-sum losses). Assume that H is symmetric and complete and that Tcomp is convex. Then, the following inequality holds for any hypothesis h H and any distribution Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H), (3) with Tcomp an H-estimation error transformation for comp-sum losses defined for all t [0,1] by inf τ [0, 1 2 ] sup µ [ τ,1 τ] { 1+t 2 [Φ(τ) Φ(1 τ µ)] + 1 t 2 [Φ(1 τ) Φ(τ + µ)]} n = 2 inf P [ 1 n 1 t,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] { P +t 2 [Φ(τ2) Φ(τ1 µ)] + P t 2 [Φ(τ1) Φ(τ2 + µ)]} n > 2. Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t). Proof. For the comp-sum loss ℓcomp, the conditional ℓcomp-risk can be expressed as follows: Cℓcomp(h,x) = y Y p(x,y)ℓcomp(h,x,y) = y Y p(x,y)Φ( eh(x,y) y Y eh(x,y ) ) = y Y p(x,y)Φ(Sh(x,y)) = p(x,ymax)Φ(Sh(x,ymax)) + p(x,h(x))Φ(Sh(x,h(x))) + y {ymax,h(x)} p(x,y)Φ(Sh(x,y)). where we let Sh(x,y) = eh(x,y) y Y eh(x,y ) for any y Y with the constraint that y Y Sh(x,y) = 1. For any h H such that h(x) ymax and x X, we can always find a family of hypotheses {hµ} H such that Sh,µ(x, ) = ehµ(x, ) y Y ehµ(x,y ) take the following values: Sh,µ(x,y) = Sh(x,y) if y / {ymax,h(x)} Sh(x,ymax) + µ if y = h(x) Sh(x,h(x)) µ if y = ymax. Note that Sh,µ satisfies the constraint: y Y Sh,µ(x,y) = y Y Sh(x,y) = 1. Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = Sh(x,h(x)) and τ2 = Sh(x,ymax) to simplify the notation. Then, by the definition of Sh,µ, we have for any h H and x X, Cℓcomp(h,x) inf µ [ τ2,τ1]Cℓcomp(hµ,x) = sup µ [ τ2,τ1] {p1[Φ(τ2) Φ(τ1 µ)] + p2[Φ(τ1) Φ(τ2 + µ)]} = sup µ [ τ2,τ1] {P + p1 p2 2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2 2 [Φ(τ1) Φ(τ2 + µ)]} (P = p1 + p2 [ 1 n 1 p1 p2,1]) inf P [ 1 n 1 p1 p2,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] {P + p1 p2 2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2 2 [Φ(τ1) Φ(τ2 + µ)]} (τ1 max(τ2,1/n),τ1 + τ2 1,τ2 0) = Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1) where for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression of Tcomp is simplified. Since Tcomp is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution, Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) = Tcomp(E X[ Cℓ0 1,H(h,x)]) E X[Tcomp( Cℓ0 1,H(h,x))] E X[ Cℓcomp,H(h,x)] = Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H). For the second part, we first consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 . For any ϵ > 0, by the definition of infimum, we can take h H such that Sh(x,1) = τϵ [0, 1 2] and satisfies sup µ [ τϵ,1 τϵ] {1 + t 2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t 2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ. Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)] = Cℓ0 1(h,x) C ℓ0 1(H,x) Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Rℓcomp(h) EX[C ℓcomp(H,x)] = Cℓcomp(h,x) C ℓcomp(H,x) = sup µ [ τϵ,1 τϵ] {1 + t 2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t 2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ. By letting ϵ 0, we prove the tightness for n = 2. The proof for n > 2 directly extends from the case when n = 2. Indeed, for any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the definition of infimum, we can take h H such that Sh(x,1) = τ1,ϵ, Sh(x,2) = τ2,ϵ, Sh(x,y) = 0, 3 y n and satisfies τ1,ϵ + τ2,ϵ = 1, and inf P [ 1 n 1 t,1] sup µ [ τ2,ϵ,τ1,ϵ] {P + t 2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + P t 2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]} = sup µ [ τ2,ϵ,τ1,ϵ] {1 + t 2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + 1 t 2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]} < Tcomp(t) + ϵ. Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t Tcomp(t) < Tcomp(t) + ϵ. By letting ϵ 0, we prove the tightness for n > 2. C.2 Characterization of Tcomp (Theorem 3) Theorem 3 (characterization of Tcomp). Assume that Φ is convex, differentiable at 1 2 and Φ ( 1 2) < 0. Then, Tcomp can be expressed as follows: 2) infµ [ 1 2 + µ) + 1+t 2 µ)} n = 2 infτ [ 1 2 ]{Φ(τ) infµ [ τ,τ]{ 1+t 2 Φ(τ µ) + 1 t 2 Φ(τ + µ)}} n > 2. Proof. For n = 2, we have Tcomp(t) = inf τ [0, 1 2 ] sup µ [ τ,1 τ] {1 + t 2 [Φ(τ) Φ(1 τ µ)] + 1 t 2 [Φ(1 τ) Φ(τ + µ)]} = inf τ [0, 1 2 Φ(τ) + 1 t 2 [Φ(1 τ)] inf µ [ τ,1 τ]{1 + t 2 Φ(1 τ µ) + 1 t 2 Φ(τ + µ)}) = inf τ [0, 1 2 Φ(τ) + 1 t 2 [Φ(1 τ)]) inf µ [ 1 2 + µ) + 1 + t inf τ [0, 1 2)) inf µ [ 1 2 + µ) + 1 + t (Φ is convex) 2) inf µ [ 1 2 + µ) + 1 + t 2 µ)} (Φ ( 1 2) < 0, t(τ 1 where the equality can be achieved by τ = 1 For n > 2, we have Tcomp(t) = inf P [ 1 n 1 ,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] F(P,τ1,τ2,µ) where we let F(P,τ1,τ2,µ) = P +t 2 [Φ(τ2) Φ(τ1 µ)] + P t 2 [Φ(τ1) Φ(τ2 + µ)]. For simplicity, we assume that Φ is differentiable. For general convex Φ, we can proceed by using left and right derivatives, which are non-decreasing. Differentiate F with respect to µ, we have 2 Φ (τ1 µ) + t P 2 Φ (τ2 + µ). Using the fact that P [ 1 n 1 t,1],t [0,1] and Φ is non-decreasing, we obtain that F µ is nonincreasing. Furthermore, Φ is non-decreasing and non-positive, Φ is non-negative, we obtain that Φ (+ ) = 0. This implies that F µ (+ ) 0 and F µ ( ) 0. Therefore, there exists µ0 R such that F µ (µ0) = P + t 2 Φ (τ1 µ0) + t P 2 Φ (τ2 + µ0) = 0 By taking µ = τ1 τ2 and using the fact that τ2 1 2) < 0, we have F µ (τ1 τ2) = P + t 2 Φ (τ2) + t P 2 Φ (τ1) < 0. Thus, since F µ is non-increasing, we obtain µ0 < τ1 τ2. Differentiate F with respect to τ2 at µ0, we have F τ2 = P + t 2 Φ (τ2) + t P 2 Φ (τ2 + µ0). Since Φ is non-decreasing, we obtain F τ2 P + t 2 Φ (τ2) + t P 2 Φ (τ2 + τ1 τ2) = F µ (τ1 τ2) < 0, which implies that the infimum infτ1 max{τ2, 1 n } is achieved when τ2 = τ1. Differentiate F with respect to P at µ0 and τ1 = τ2, by the convexity of Φ, we obtain F P = Φ(τ1) Φ(τ1 µ0) + Φ(τ1) Φ(τ1 + µ0) 0, which implies that the infimum inf P [ 1 n 1 ,1] is achieved when P = 1. Above all, we obtain Tcomp(t) = inf τ [ 1 2 ] sup µ [ τ,τ] F(1,τ,τ,µ) = inf τ [ 1 2 ]{Φ(τ) inf µ [ τ,τ]{1 + t 2 Φ(τ µ) + 1 t 2 Φ(τ + µ)}}. C.3 Computation of examples Example: Φ(t) = log(t). For n = 2, plugging in Φ(t) = log(t) in Theorem 3, gives Tcomp = log 2 inf µ [ 1 2 + µ) 1 + t 2 log(1 + t) + 1 t 2 log(1 t). (minimum achieved at µ = t Similarly, for n > 2, plugging in Φ(t) = log(t) in Theorem 3 yields Tcomp = inf τ [ 1 2 ]{ log τ inf µ [ τ,τ]{ 1 t 2 log(τ + µ) 1 + t 2 log(τ µ)}} 2 log(1 + t) + 1 t 2 log(1 t). (minimum achieved at µ = τt) Example: Φ(t) = 1 t 1. For n = 2, plugging in Φ(t) = 1 t 1 in Theorem 3, gives Tcomp = 2 inf µ [ 1 2 1 1 2 + µ + 1 + t 1 t2. (minimum achieved at µ = (1 t) 1 2 (1+t) 1 2 2((1+t) 1 2 +(1 t) 1 2 )) Similarly, for n > 2, plugging in Φ(t) = 1 t 1 in Theorem 3 yields Tcomp = inf τ [ 1 τ inf µ [ τ,τ]{1 + t 2 1 τ µ + 1 + t 2 1 τ + µ}} = inf τ [ 1 2 ] 1 2τ (1 1 t2) (minimum achieved at µ = (1 t) 1 2 (1+t) 1 2 (1+t) 1 2 +(1 t) 1 2 τ) 1 t2. (minimum achieved at τ = 1 Example: Φ(t) = 1 q(1 tq),q (0,1). For n = 2, plugging in Φ(t) = 1 q(1 tq) in Theorem 3, gives q2q inf µ [ 1 = 1 q2q (1 + t) 1 1 q + (1 t) (minimum achieved at µ = (1 t) 1 1 q (1+t) 1 1 q 2((1+t) 1 1 q +(1 t) 1 1 q )) Similarly, for n > 2, plugging in Φ(t) = 1 q(1 tq) in Theorem 3 yields Tcomp = inf τ [ 1 q inf µ [ τ,τ]{ 1 + t 2q (τ µ)q 1 t 2q (τ + µ)q}} = inf τ [ 1 1 1 q + (1 t) (minimum achieved at µ = (1 t) 1 1 q (1+t) 1 1 q (1+t) 1 1 q +(1 t) 1 1 q τ) = 1 qnq (1 + t) 1 1 q + (1 t) 1 qnq . (minimum achieved at τ = 1 Example: Φ(t) = 1 t. For n = 2, plugging in Φ(t) = 1 t in Theorem 3, gives 2 inf µ [ 1 2 µ) + 1 + t 2 + µ)} = 1 Similarly, for n > 2, plugging in Φ(t) = 1 t in Theorem 3 yields Tcomp = inf τ [ 1 2 ]{(1 τ) inf µ [ τ,τ]{1 + t 2 (1 τ + µ) + 1 t 2 (1 τ µ)}} = inf τ [ 1 2 ]τ t (minimum achieved at µ = τ) n. (minimum achieved at τ = 1 Example: Φ(t) = (1 t)2. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 3, gives 4 inf µ [ 1 Similarly, for n > 2, plugging in Φ(t) = (1 t)2 in Theorem 3 yields Tcomp = inf τ [ 1 2 ]{(1 τ)2 inf µ [ τ,τ]{1 + t 2 (1 τ + µ)2 + 1 t 2 (1 τ µ)2}} = inf τ [ 1 2 ]{(1 τ)2t2} (minimum achieved at µ = t(τ 1)) 4 . (minimum achieved at τ = 1 D Proofs for constrained losses Let ymax = argmaxy Y p(x,y) and h(x) = argmaxy Y h(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. D.1 Proof of H-consistency bounds with Tcstnd (Theorem 10) Theorem 10 (H-consistency bound for constrained losses). Assume that H is symmetric and complete. Assume that Tcstnd is convex. Then, for any hypothesis h H and any distribution, Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) with H-estimation error transformation for constrained losses defined on t [0,1] by Tcstnd(t) = inf τ 0sup µ R { 1 t 2 [Φ(τ) Φ( τ + µ)] + 1+t 2 [Φ( τ) Φ(τ µ)]} n = 2 inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R { 2 P t 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t 2 [Φ( τ1) Φ( τ2 µ)]} n > 2. Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t). Proof. For the constrained loss ℓcstnd, the conditional ℓcstnd-risk can be expressed as follows: Cℓcstnd(h,x) = y Y p(x,y)ℓcstnd(h,x,y) = y Y p(x,y) y y Φ( h(x,y )) = y Y Φ( h(x,y)) y y p(x,y ) = y Y Φ( h(x,y))(1 p(x,y)) = Φ( h(x,ymax))(1 p(x,ymax)) + Φ( h(x,h(x)))(1 p(x,h(x))) + y {ymax,h(x)} Φ( h(x,y))(1 p(x,y)). For any h H and x X, by the symmetry and completeness of H, we can always find a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values: h(x,y) if y / {ymax,h(x)} h(x,ymax) + µ if y = h(x) h(x,h(x)) µ if y = ymax. Note that the hypotheses hµ satisfies the constraint: y Y hµ(x,y) = y Y h(x,y) = 0, µ R. Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = h(x,h(x)) and τ2 = h(x,ymax) to simplify the notation. Then, by the definition of hµ, we have for any h H and x X, Cℓcstnd(h,x) inf µ RCℓcstnd(hµ,x) = sup µ R {(1 p1)[Φ( τ2) Φ( τ1 + µ)] + (1 p2)[Φ( τ1) Φ( τ2 µ)]} = sup µ R {2 P p1 + p2 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2 2 [Φ( τ1) Φ( τ2 µ)]} (P = p1 + p2 [ 1 n 1,1]) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R {2 P p1 + p2 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2 2 [Φ( τ1) Φ( τ2 µ)]} (τ1 0, τ2 τ1) = Tcstnd(p1 p2) = Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1) where for n = 2, an additional constraint τ1 + τ2 = 0 is imposed and the expression of Tcomp is simplified. Since Tcstnd is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution, Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) = Tcstnd(E X[ Cℓ0 1,H(h,x)]) E X[Tcstnd( Cℓ0 1,H(h,x))] E X[ Cℓcstnd,H(h,x)] = Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). For the second part, we first consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 . For any ϵ > 0, by the definition of infimum, we can take h H such that h(x,2) = τϵ 0 and satisfies sup µ R {1 t 2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t 2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ. Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)] = Cℓ0 1(h,x) C ℓ0 1(H,x) Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Rℓcstnd(h) EX[C ℓcstnd(H,x)] = Cℓcstnd(h,x) C ℓcstnd(H,x) = sup µ R {1 t 2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t 2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ. By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the definition of infimum, we can take h H such that h(x,1) = τ1,ϵ, h(x,2) = τ2,ϵ, h(x,y) = 0, 3 y n and satisfies τ1,ϵ + τ2,ϵ = 0, and inf P [ 1 n 1 ,1]sup µ R {2 P t 2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 2 P + t 2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]} = sup µ R {1 t 2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 1 + t 2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]} < Tcstnd(t) + ϵ. Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) < Tcstnd(t) + ϵ. D.2 Characterization of Tcstnd (Theorem 11) Theorem 11 (characterization of Tcstnd). Assume that Φ is convex, differentiable at zero and Φ (0) < 0. Then, Tcstnd can be expressed as follows: Tcstnd(t) = Φ(0) infµ R{ 1 t 2 Φ(µ) + 1+t 2 Φ( µ)} n = 2 infτ 0{(2 1 n 1)Φ( τ) infµ R{ 2 t 1 n 1 2 Φ( τ + µ) + 2+t 1 n 1 2 Φ( τ µ)}} n > 2 {Φ(0) infµ R{ 1 t 2 Φ(µ) + 1+t 2 Φ( µ)} n = 2 infτ 0{2Φ( τ) infµ R{ 2 t 2 Φ( τ + µ) + 2+t 2 Φ( τ µ)}} n > 2. Proof. For n = 2, we have Tcstnd(t) = inf τ 0sup µ R {1 t 2 [Φ(τ) Φ( τ + µ)] + 1 + t 2 [Φ( τ) Φ(τ µ)]} = inf τ 0(1 t 2 Φ(τ) + 1 + t 2 [Φ( τ)]) inf µ R{1 t 2 Φ( τ + µ) + 1 + t = inf τ 0(1 t 2 Φ(τ) + 1 + t 2 [Φ( τ)]) inf µ R{1 t 2 Φ(µ) + 1 + t inf τ 0(Φ(0) Φ (0)tτ) inf µ R{1 t 2 Φ(µ) + 1 + t 2 Φ( µ)} (Φ is convex) = Φ(0) inf µ R{1 t 2 Φ(µ) + 1 + t 2 Φ( µ)} (Φ (0) < 0, tτ 0) where the equality can be achieved by τ = 0. For n > 2, we have Tcstnd(t) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R F(P,τ1,τ2,µ) where we let F(P,τ1,τ2,µ) = 2 P t 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t 2 [Φ( τ1) Φ( τ2 µ)]. For simplicity, we assume that Φ is differentiable. For general convex Φ, we can proceed by using left and right derivatives, which are non-decreasing. Differentiate F with respect to µ, we have F µ = P + t 2 2 Φ ( τ1 + µ) + 2 P + t 2 Φ ( τ2 µ). Using the fact that P [ 1 n 1,1],t [0,1] and Φ is non-decreasing, we obtain that F µ is nonincreasing. Furthermore, Φ is non-decreasing and non-positive, Φ is non-negative, we obtain that Φ (+ ) = 0. This implies that F µ (+ ) 0 and F µ ( ) 0. Therefore, there exists µ0 R such that F µ (µ0) = P + t 2 2 Φ ( τ1 + µ0) + 2 P + t 2 Φ ( τ2 µ0) = 0 By taking µ = τ1 τ2 and using the fact that Φ (0) < 0, we have F µ (τ1 τ2) = P + t 2 2 Φ ( τ2) + 2 P + t 2 Φ ( τ1) < 0. Thus, since F µ is non-increasing, we obtain µ0 < τ1 τ2. Differentiate F with respect to τ2 at µ0, we have F τ2 = P + t 2 2 Φ ( τ2) + 2 P + t 2 Φ ( τ2 µ0). Since Φ is non-decreasing, we obtain F τ2 P + t 2 2 Φ ( τ2) + 2 P + t 2 Φ ( τ2 τ1 + τ2) = F µ (τ1 τ2) < 0, which implies that the infimum infτ1 max{τ2,0} is achieved when τ2 = τ1. Differentiate F with respect to P at µ0 and τ1 = τ2, by the convexity of Φ, we obtain F P = Φ( τ1 + µ0) Φ( τ1) Φ( τ1) + Φ( τ1 µ0) 0, which implies that the infimum inf P [ 1 n 1 ,1] is achieved when P = 1 n 1. Above all, we obtain Tcstnd(t) = inf τ 0sup µ R F( 1 n 1,τ,τ,µ) = inf τ 0{(2 1 n 1)Φ( τ) inf µ R{2 t 1 n 1 2 Φ( τ + µ) + 2 + t 1 n 1 2 Φ( τ µ)}} inf τ 0sup µ R F(0,τ,τ,µ) = inf τ 0{2Φ( τ) inf µ R{2 t 2 Φ( τ + µ) + 2 + t 2 Φ( τ µ)}}. D.3 Computation of examples Example: Φ(t) = Φexp(t) = e t. For n = 2, plugging in Φ(t) = e t in Theorem 11, gives Tcomp = 1 inf µ R{1 t 2 e µ + 1 + t 1 t2. (minimum achieved at µ = 1 For n > 2, plugging in Φ(t) = e t in Theorem 11 yields Tcomp inf τ 0{2eτ inf µ R{2 t 2 eτ µ + 2 + t 2 inf µ R{2 t 2 e µ + 2 + t 2 eµ} (minimum achieved at τ = 0) 4 t2. (minimum achieved at µ = 1 Example: Φ(t) = Φhinge(t) = max{0,1 t}. For n = 2, plugging in Φ(t) = max{0,1 t} in Theorem 11, gives Tcomp = 1 inf µ R{1 t 2 max{0,1 µ} + 1 + t 2 max{0,1 + µ}} = t. (minimum achieved at µ = 1) For n > 2, plugging in Φ(t) = max{0,1 t} in Theorem 11 yields Tcomp inf τ 0{2max{0,1 + τ} inf µ R{2 t 2 max{0,1 + τ µ} + 2 + t 2 max{0,1 + τ + µ}}} = 2 inf µ R{2 t 2 max{0,1 µ} + 2 + t 2 max{0,1 + µ}} (minimum achieved at τ = 0) = t. (minimum achieved at µ = 1) Example: Φ(t) = Φsq hinge(t) = (1 t)21t 1. For n = 2, plugging in Φ(t) = (1 t)21t 1 in Theorem 11, gives Tcomp = 1 inf µ R{1 t 2 (1 µ)21µ 1 + 1 + t 2 (1 + µ)21µ 1} = t2. (minimum achieved at µ = t) For n > 2, plugging in Φ(t) = (1 t)21t 1 in Theorem 11 yields Tcomp inf τ 0{2(1 + τ)21τ 1 inf µ R{2 t 2 (1 + τ µ)21 τ+µ 1 + 2 + t 2 (1 + τ + µ)21τ+µ 1}} 2 inf µ R{2 t 2 (1 µ)21µ 1 + 2 + t 2 (1 + µ)21µ 1} (minimum achieved at τ = 0) 2 . (minimum achieved at µ = t Example: Φ(t) = Φsq(t) = (1 t)2. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 11, gives Tcomp = 1 inf µ R{1 t 2 (1 µ)2 + 1 + t 2 (1 + µ)2} = t2. (minimum achieved at µ = t) For n > 2, plugging in Φ(t) = (1 t)2 in Theorem 11 yields Tcomp inf τ 0{2(1 + τ)2 inf µ R{2 t 2 (1 + τ µ)2 + 2 + t 2 (1 + τ + µ)2}} 2 inf µ R{2 t 2 (1 µ)2 + 2 + t 2 (1 + µ)2} (minimum achieved at τ = 0) 2 . (minimum achieved at µ = t E Extensions of comp-sum losses E.1 Proof of H-consistency bounds with T comp (Theorem 5) Theorem 5 (H-consistency bound for comp-sum losses). Assume that T comp is convex. Then, the following inequality holds for any hypothesis h H and any distribution: T comp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) with T comp the H-estimation error transformation for comp-sum losses defined for all t [0,1] by T comp(t) = inf τ [0, 1 2 ] sup µ [smin τ,1 τ smin] { 1+t 2 [Φ(τ) Φ(1 τ µ)] + 1 t 2 [Φ(1 τ) Φ(τ + µ)]} n = 2 inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C { P +t 2 [Φ(τ2) Φ(τ1 µ)] + P t 2 [Φ(τ1) Φ(τ2 + µ)]} n > 2, where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}], smax = 1 1+(n 1)e 2 infx Λ(x) and smin = 1 1+(n 1)e2 infx Λ(x) . Furthermore, for any t [0,1], there exist a distribution D and h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t). Proof. For the comp-sum loss ℓcomp, the conditional ℓcomp-risk can be expressed as follows: Cℓcomp(h,x) = y Y p(x,y)ℓcomp(h,x,y) = y Y p(x,y)Φ( eh(x,y) y Y eh(x,y ) ) = y Y p(x,y)Φ(Sh(x,y)) = p(x,ymax)Φ(Sh(x,ymax)) + p(x,h(x))Φ(Sh(x,h(x))) + y {ymax,h(x)} p(x,y)Φ(Sh(x,y)) where we let Sh(x,y) = eh(x,y) y Y eh(x,y ) for any y Y with the constraint that y Y Sh(x,y) = 1. Note that for any h H, 1 1 + (n 1)e2Λ(x) = e Λ(x) e Λ(x) + (n 1)eΛ(x) Sh(x,y) eΛ(x) eΛ(x) + (n 1)e Λ(x) = 1 1 + (n 1)e 2Λ(x) Therefore for any (x,y) X Y, Sh(x,y) [Smin,Smax], where we let Smax = 1 1+(n 1)e 2Λ(x) and Smin = 1 1+(n 1)e2Λ(x) . Furthermore, all values in [Smin,Smax] of Sh can be reached for some h H. Observe that 0 Smax + Smin 1. Let ymax = argmaxy Y p(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymax and x X, we can always find a family of hypotheses {hµ} H such that Sh,µ(x, ) = ehµ(x, ) y Y ehµ(x,y ) take the following values: Sh,µ(x,y) = Sh(x,y) if y / {ymax,h(x)} Sh(x,ymax) + µ if y = h(x) Sh(x,h(x)) µ if y = ymax. Note that Sh,µ satisfies the constraint: y Y Sh,µ(x,y) = y Y Sh(x,y) = 1. Since Sh,µ(x,y) [Smin,Smax], we have the following constraints on µ: Smin Sh(x,ymax) µ Smax Sh(x,ymax) Sh(x,h(x)) Smax µ Sh(x,h(x)) Smin. (7) Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = Sh(x,h(x)) and τ2 = Sh(x,ymax) to simplify the notation. Let C = {µ R µ verify constraint (7)}. Since Sh(x,h(x)) Smax Smax Sh(x,ymax) and Smin Sh(x,ymax) Sh(x,h(x)) Smin, C is not an empty set and can be expressed as C = [max{Smin τ2,τ1 Smax},min{Smax τ2,τ1 Smin}]. Then, by the definition of Sh,µ, we have for any h H and x X, Cℓcomp(h,x) inf µ C Cℓcomp(hµ,x) = sup µ C {p1[Φ(τ2) Φ(τ1 µ)] + p2[Φ(τ1) Φ(τ2 + µ)]} = sup µ C {P + p1 p2 2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2 2 [Φ(τ1) Φ(τ2 + µ)]} (P = p1 + p2 [ 1 n 1 p1 p2,1]) inf P [ 1 n 1 p1 p2,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + p1 p2 2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2 2 [Φ(τ1) Φ(τ2 + µ)]} (Smin τ2 τ1 Smax, τ1 + τ2 1) inf P [ 1 n 1 p1 p2,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + p1 p2 2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2 2 [Φ(τ1) Φ(τ2 + µ)]} (Smin smin smax Smax) = Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1) where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}] C, smax = 1 1+(n 1)e 2 infx Λ(x) and smin = 1 1+(n 1)e2 infx Λ(x) . Note that for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression can be simplified as Cℓcomp(h,x) inf µ C Cℓcomp(hµ,x) inf τ [0, 1 2 ] sup µ [smin τ,1 τ smin] {1 + p1 p2 2 [Φ(τ) Φ(1 τ µ)] + 1 p1 + p2 2 [Φ(1 τ) Φ(τ + µ)]} = Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1) where we use the fact that smax +smin = 1 and P = 1 when n = 2. Since Tcomp is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution, Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) = Tcomp(E X[ Cℓ0 1,H(h,x)]) E X[Tcomp( Cℓ0 1,H(h,x))] E X[ Cℓcomp,H(h,x)] = Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H). For the second part, we first consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 . For any ϵ > 0, by the definition of infimum, we can take h H such that Sh(x,1) = τϵ [0, 1 2] and satisfies sup µ [smin τϵ,1 τϵ smin] {1 + t 2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t 2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)] = Cℓ0 1(h,x) C ℓ0 1(H,x) = t and Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Rℓcomp(h) EX[C ℓcomp(H,x)] = Cℓcomp(h,x) C ℓcomp(H,x) = sup µ [smin τϵ,1 τϵ smin] {1 + t 2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t 2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ. By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the definition of infimum, we can take h H such that Sh(x,1) = τ1,ϵ, Sh(x,2) = τ2,ϵ and Sh(x,y) = 0, 3 y n and satisfies τ1,ϵ + τ2,ϵ = 1, and inf P [ 1 n 1 t,1]sup µ C {P + t 2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + P t 2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]} = sup µ C {1 + t 2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + 1 t 2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]} < Tcomp(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) < Tcomp(t) + ϵ. By letting ϵ 0, we conclude the proof. E.2 Logistic loss Theorem 6 (H-consistency bounds for logistic loss). For any h H and any distribution, we have Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)), where ℓlog = log( eh(x,y) y Y eh(x,y ) ) and Ψ(t) = 2 log(1 + t) + 1 t 2 log(1 t) t smax smin smin+smax t 2 log( smax smin ) + log( 2 smaxsmin smax+smin ) otherwise. Proof. For the multinomial logistic loss ℓlog, plugging in Φ(t) = log(t) in Theorem 5, gives T comp inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t 2 [ log(τ2) + log(τ1 µ)] + P t 2 [ log(τ1) + log(τ2 + µ)]} where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the infimum is achieved when τ1 = τ2 = smin+smax 2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as T comp = sup µ C {1 + t 2 [ log(smin + smax 2 ) + log(smin + smax 2 [ log(smin + smax 2 ) + log(smin + smax = log(smin + smax 2 ) + sup µ C g(µ) where C = [ smin smax 2 , smax smin 2 ] and g(µ) = 1+t 2 log( smin+smax 2 log( smin+smax 2 + µ). Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain g(µ ) = 0, µ = t(smin + smax) Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have µ 0 smax smin 2 In view of the constraint C, if µ smin smax 2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax 2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = smin smax 2 . Since µ smin smax 2 is equivalent to t smax smin smin+smax , the maximum can be expressed as max µ C g(µ) = {g(µ ) t smax smin smin+smax g( smin smax 2 ) otherwise Computing the value of g at these points yields: g(µ ) = 1 + t 2 log (1 + t)(smin + smax) 2 log (1 t)(smin + smax) g(smin smax 2 ) = 1 + t 2 log(smax) + 1 t 2 log(smin) Then, if t smax smin smin+smax , we obtain T comp = log(smin + smax 2 ) + 1 + t 2 log (1 + t)(smin + smax) 2 log (1 t)(smin + smax) 2 log(1 + t) + 1 t 2 log(1 t). Otherwise, we obtain T comp = log(smin + smax 2 ) + 1 + t 2 log(smax) + 1 t 2 log(smin) smin ) + log(2 smaxsmin smax + smin ). Since T comp is convex, by Theorem 5, for any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)), 2 log(1 + t) + 1 t 2 log(1 t) t smax smin smin+smax t 2 log( smax smin ) + log( 2 smaxsmin smax+smin ) otherwise. E.3 Sum exponential loss Theorem 9 (H-consistency bounds for sum exponential loss). For any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H)) where ℓexp = y y eh(x,y ) h(x,y) and Ψ(t) = 1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2 2smaxsmin(smax+smin) otherwise. . Proof. For the sum exponential loss ℓexp, plugging in Φ(t) = 1 t 1 in Theorem 5, gives T comp inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t τ2 1 τ1 µ] + P t τ1 1 τ2 + µ]} where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the infimum is achieved when τ1 = τ2 = smin+smax 2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as T comp = sup µ C {1 + t 2 [ 2 smin + smax 2 smin + smax 2µ] 2 [ 2 smin + smax 2 smin + smax + 2µ]} = 2 smin + smax + sup µ C g(µ) where C = [ smin smax 2 , smax smin 2 ] and g(µ) = 1+t smin+smax 2µ 1 t smin+smax+2µ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain g(µ ) = 0, µ = smin + smax 1 t Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have µ 0 smax smin 2 In view of the constraint C, if µ smin smax 2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax 2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = 2 . Since µ smin smax 2 is equivalent to t s2 max s2 min s2 min+s2max , the maximum can be expressed as max µ C g(µ) = g(µ ) t s2 max s2 min s2 min+s2max g( smin smax 2 ) otherwise Computing the value of g at these points yields: 1 t2 2 smin + smax g(smin smax 2 ) = 1 + t Then, if t s2 max s2 min s2 min+s2max , we obtain T comp = 2 smin + smax + 1 1 t2 2 smin + smax = 1 Otherwise, we obtain T comp = 2 smin + smax 1 + t = smax smin 2smaxsmin t (smax smin)2 2smaxsmin(smax + smin). Since T comp is convex, by Theorem 5, for any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H)), 1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2 2smaxsmin(smax+smin) otherwise. E.4 Generalized cross-entropy loss Theorem 16 (H-consistency bounds for generalized cross-entropy loss). For any h H and any distribution, we have Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓgce(h) R ℓgce(H) + Mℓgce(H)), where Ψ(t) = 1 q( smin+smax 2 ) q ( (1+t) 1 1 q +(1 t) 1 1 q 2 ) 1 q 1 t s1 q max s1 q min s1 q min+s1 q max t 2q(sq max sq min) + 1 q( sq min+sq max 2 ( smin+smax 2 ) q) otherwise. 1 q[1 ( eh(x,y) y Y eh(x,y ) ) Proof. For generalized cross-entropy loss ℓgce, plugging Φ(t) = 1 q(1 tq) in Theorem 5, gives T comp inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t q (τ2)q + 1 q (τ1 µ)q] + P t q (τ1)q + 1 q (τ2 + µ)q]} where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the infimum is achieved when τ1 = τ2 = smin+smax 2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as T comp = sup µ C {1 + t 2q [ (smin + smax q + (smin + smax 2q [ (smin + smax q + (smin + smax q (smin + smax q + sup µ C g(µ) where C = [ smin smax 2 , smax smin 2 ] and g(µ) = 1+t 2q ( smin+smax 2 µ) q + 1 t 2q ( smin+smax 2 + µ) q. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain g(µ ) = 0, µ = (1 t) 1 1 q (1 + t) 1 1 q (1 + t) 1 1 q + (1 t) 1 1 q smin + smax Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have µ 0 smax smin 2 In view of the constraint C, if µ smin smax 2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax 2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = 2 . Since µ smin smax 2 is equivalent to t s1 q max s1 q min s1 q min+s1 q max , the maximum can be expressed as max µ C g(µ) = g(µ ) t s1 q max s1 q min s1 q min+s1 q max g( smin smax 2 ) otherwise Computing the value of g at these points yields: q (smin + smax 1 1 q + (1 t) g(smin smax 2 ) = 1 + t 2q (smax)q + 1 t Then, if t s1 q max s1 q min s1 q min+s1 q max , we obtain q (smin + smax 1 1 q + (1 t) q (smin + smax q (smin + smax 1 1 q + (1 t) 1 Otherwise, we obtain q (smin + smax 2q (smax)q + 1 t 2q (sq max sq min) + 1 q (sq min + sq max 2 (smin + smax Since T comp is convex, by Theorem 5, for any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓgce(h) R ℓgce(H) + Mℓgce(H)), 1 q( smin+smax 2 ) q ( (1+t) 1 1 q +(1 t) 1 1 q 2 ) 1 q 1 t s1 q max s1 q min s1 q min+s1 q max t 2q(sq max sq min) + 1 q( sq min+sq max 2 ( smin+smax 2 ) q) otherwise. E.5 Mean absolute error loss Theorem 17 (H-consistency bounds for mean absolute error loss). For any h H and any distribution, we have Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) 2(Rℓmae(h) R ℓmae(H) + Mℓmae(H)) smax smin . Proof. For mean absolute error loss ℓmae, plugging Φ(t) = 1 t in Theorem 5, gives T comp inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t 2 [ (τ2) + (τ1 µ)] + P t 2 [ (τ1) + (τ2 + µ)]} where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the infimum is achieved when τ1 = τ2 = smin+smax 2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as T comp = sup µ C {1 + t 2 [ (smin + smax 2 ) + (smin + smax 2 [ (smin + smax 2 ) + (smin + smax = sup µ C tµ where C = [ smin smax 2 , smax smin 2 ]. Since tµ is monotonically non-increasing, the maximum over C can be achieved by µ = smin smax 2 , T comp = smax smin Since T comp is convex, by Theorem 5, for any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) 2(Rℓmae(h) R ℓmae(H) + Mℓmae(H)) smax smin . F Extensions of constrained losses F.1 Proof of H-consistency bound with T cstnd (Theorem 12) Theorem 12 (H-consistency bound for constrained losses). Assume that T cstnd is convex. Then, the following inequality holds for any hypothesis h H and any distribution: T cstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). (6) with T cstnd the H-estimation error transformation for constrained losses defined for all t [0,1] by T cstnd(t) = inf τ 0 sup µ [τ Λmin,τ+Λmin] { 1 t 2 [Φ(τ) Φ( τ + µ)] + 1+t 2 [Φ( τ) Φ(τ µ)]} n = 2 inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C { 2 P t 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t 2 [Φ( τ1) Φ( τ2 µ)]} n > 2, where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] and Λmin = infx X Λ(x). Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H)+ Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t). Proof. For the constrained loss ℓcstnd, the conditional ℓcstnd-risk can be expressed as follows: Cℓcstnd(h,x) = y Y p(x,y)ℓcstnd(h,x,y) = y Y p(x,y) y y Φ( h(x,y )) = y Y Φ( h(x,y)) y y p(x,y ) = y Y Φ( h(x,y))(1 p(x,y)) = Φ( h(x,ymax))(1 p(x,ymax)) + Φ( h(x,h(x)))(1 p(x,h(x))) + y {ymax,h(x)} Φ( h(x,y))(1 p(x,y)). For any h H and x X, by the definition of H, we can always find a family of hypotheses {hµ} H such that hµ(x, ) take the following values: h(x,y) if y / {ymax,h(x)} h(x,ymax) + µ if y = h(x) h(x,h(x)) µ if y = ymax. Note that the hypotheses hµ satisfies the constraint: y Y hµ(x,y) = y Y h(x,y) = 0, µ R. Since hµ(x,y) [ Λ(x),Λ(x)], we have the following constraints on µ: Λ(x) h(x,ymax) µ Λ(x) h(x,ymax) Λ(x) + h(x,h(x)) µ Λ(x) + h(x,h(x). Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = h(x,h(x)) and τ2 = h(x,ymax) to simplify the notation. Then, the constraint on µ can be expressed as µ C, C = [max{τ1, τ2} Λ(x),min{τ1, τ2} + Λ(x)] Since max{τ1, τ2} min{τ1, τ2} = τ1 + τ2 τ1 + τ2 2Λ(x), C is not an empty set. By the definition of hµ, we have for any h H and x X, Cℓcstnd(h,x) inf µ C Cℓcstnd(hµ,x) = sup µ C {(1 p1)[Φ( τ2) Φ( τ1 + µ)] + (1 p2)[Φ( τ1) Φ( τ2 µ)]} = sup µ C {2 P p1 + p2 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2 2 [Φ( τ1) Φ( τ2 µ)]} (P = p1 + p2 [ 1 n 1,1]) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P p1 + p2 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2 2 [Φ( τ1) Φ( τ2 µ)]} (τ1 0, τ2 τ1) inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P p1 + p2 2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2 2 [Φ( τ1) Φ( τ2 µ)]} (C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] C since Λmin Λ(x)) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}{2 P p1 + p2 2 Φ( τ2) + 2 P + p1 p2 inf µ C{2 P p1 + p2 2 Φ( τ1 + µ) + 2 P + p1 p2 2 Φ( τ2 µ)}} = Tcstnd(p1 p2) = Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1) Note that for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression can be simplified as Cℓcstnd(h,x) inf µ C Cℓcstnd(hµ,x) inf τ 0 sup µ [τ Λmin,τ+Λmin] {1 p1 + p2 2 [Φ(τ) Φ( τ + µ)] + 1 + p1 p2 2 [Φ( τ) Φ(τ µ)]} = Tcstnd(p1 p2) = Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1) Since Tcstnd is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution, Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) = Tcstnd(E X[ Cℓ0 1,H(h,x)]) E X[Tcstnd( Cℓ0 1,H(h,x))] E X[ Cℓcstnd,H(h,x)] = Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). Let n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 . For any ϵ > 0, by the definition of infimum, we can take h H such that h(x,2) = τϵ 0 and satisfies sup µ [τϵ Λmin,τϵ+Λmin] {1 t 2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t 2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)] = Cℓ0 1(h,x) C ℓ0 1(H,x) = t and Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Rℓcstnd(h) EX[C ℓcstnd(H,x)] = Cℓcstnd(h,x) C ℓcstnd(H,x) = sup µ [τϵ Λmin,τϵ+Λmin] {1 t 2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t 2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ. By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, for any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisfies p(x,1) = 1+t 2 , p(x,2) = 1 t 2 , p(x,y) = 0,3 y n. For any ϵ > 0, by the definition of infimum, we can take h H such that h(x,1) = τ1,ϵ, h(x,2) = τ2,ϵ, h(x,3) = 0, 3 y n and satisfies τ1,ϵ + τ2,ϵ = 0, and inf P [ 1 n 1 ,1]sup µ C {2 P t 2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 2 P + t 2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]} = sup µ C {1 t 2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 1 + t 2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]} < Tcstnd(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) < Tcstnd(t) + ϵ. By letting ϵ 0, we conclude the proof. F.2 Constrained exponential loss Theorem 13 (H-consistency bounds for constrained exponential loss). Let Φ(t) = e t. For any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)) where Ψ(t) = 1 t2 t e2Λmin 1 e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin 2 otherwise. . Proof. For n = 2, plugging in Φ(t) = e t in Theorem 12, gives T cstnd(t) = inf τ 0 sup µ [τ Λmin,τ+Λmin] {1 t 2 [e τ eτ µ] + 1 + t 2 [eτ e τ+µ]}. By differentiating with respect to τ, we can see that the infimum is achieved when τ = 0 modulo some elementary analysis. Thus, T cstnd can be reformulated as T cstnd = sup µ [ Λmin,Λmin] {1 t 2 [1 e µ] + 1 + t = 1 + sup µ [ Λmin,Λmin] g(µ). where g(µ) = 1 t 2 eµ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain g(µ ) = 0, µ = 1 1 + t Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since µ 0 and Λmin 0, we have In view of the constraint, if µ Λmin, the maximum is achieved by µ = µ . Otherwise, if µ < Λmin, since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = Λmin. Since µ Λmin is equivalent to t e2Λmin 1 e2Λmin+1, the maximum can be expressed as max µ [ Λmin,Λmin]g(µ) = {g(µ ) t e2Λmin 1 e2Λmin+1 g( Λmin) otherwise Computing the value of g at these points yields: g( Λmin) = 1 t 2 eΛmin 1 + t Then, if t e2Λmin 1 e2Λmin+1, we obtain T cstnd = 1 Otherwise, we obtain T cstnd = 1 1 t 2 eΛmin 1 + t 2(eΛmin e Λmin) + 2 eΛmin e Λmin For n > 2, plugging in Φ(t) = e t in Theorem 12, gives T cstnd(t) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P t 2 [eτ2 eτ1 µ] + 2 P + t 2 [eτ1 eτ2+µ]}. where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin]. By differentiating with respect to τ2 and P, we can see that the infimum is achieved when τ2 = τ1 = 0 and P = 1 modulo some elementary analysis. Thus, T cstnd can be reformulated as T cstnd = sup µ C {1 t 2 [1 e µ] + 1 + t = 1 + sup µ C g(µ). where C = [ Λmin,Λmin] and g(µ) = 1 t 2 eµ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain g(µ ) = 0, µ = 1 Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since µ 0 and Λmin 0, we have In view of the constraint, if µ Λmin, the maximum is achieved by µ = µ . Otherwise, if µ < Λmin, since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = Λmin. Since µ Λmin is equivalent to t e2Λmin 1 e2Λmin+1, the maximum can be expressed as max µ [ Λmin,Λmin]g(µ) = {g(µ ) t e2Λmin 1 e2Λmin+1 g( Λmin) otherwise Computing the value of g at these points yields: g( Λmin) = 1 t 2 eΛmin 1 + t Then, if t e2Λmin 1 e2Λmin+1, we obtain T cstnd = 1 Otherwise, we obtain T cstnd = 1 1 t 2 eΛmin 1 + t 2(eΛmin e Λmin) + 2 eΛmin e Λmin Since T cstnd is convex, by Theorem 12, for any h H and any distribution, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)) 1 t2 t e2Λmin 1 e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin 2 otherwise.