# hconsistency_bounds_characterization_and_extensions__1eb67f89.pdf

H-Consistency Bounds: Characterization and Extensions

Anqi Mao Courant Institute New York, NY 10012 aqmao@cims.nyu.edu

Mehryar Mohri Google Research & CIMS New York, NY 10011 mohri@google.com

Yutao Zhong Courant Institute New York, NY 10012 yutao@cims.nyu.edu

A series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] have introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-speciﬁc and thus stronger and more informative than Bayes-consistency. However, determining if they hold and deriving these bounds have required a speciﬁc proof and analysis for each surrogate loss. Can we derive more general tools and characterizations? This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classiﬁcation. We present new and tight H-consistency bounds for both the family of constrained losses and that of comp-sum losses, which covers the familiar crossentropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets. Our characterizations are based on error transformations, which are explicitly deﬁned for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class H-consistency bound for cross-entropy reduces to an excess bound and is not signiﬁcant. Instead, we prove a much stronger and more signiﬁcant guarantee.

1 Introduction

Bayes-consistency is an important property of surrogate loss functions. It requires that minimizing the surrogate excess error over the family of all measurable functions leads to the minimization of the target error loss in the limit [Steinwart, 2007]. This property applies to a broad family of convex margin-based losses in binary classiﬁcation [Zhang, 2004a, Bartlett et al., 2006], as well as some extensions in multi-class classiﬁcation [Tewari and Bartlett, 2007]. However, Bayes-consistency does not apply to the hypothesis sets commonly used for learning, such as the family of linear models or that of neural networks, which of course do not include all measurable functions. Furthermore, it is also only an asymptotic property and does not supply any convergence guarantee.

To address these limitations, a series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-speciﬁc and thus stronger and more informative than Bayes-consistency. However, determining the validity of these bounds and deriving them have required a speciﬁc proof and analysis for each surrogate loss. Can we derive more general tools and characterizations for H-consistency bounds?

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classiﬁcation. Previous approaches to deriving these bounds required the development of new proofs for each speciﬁc case. In contrast, we introduce the general concept of an error transformation function that serves as a very general tool for deriving such guarantees with tightness guarantees. We show that deriving an H-consistency bound for comp-sum losses and constrained losses for both complete and bounded hypothesis sets can be reduced to the calculation of their corresponding error transformation function. Our general tools and tight bounds show several remarkable advantages: ﬁrst, they improve existing bounds for complete hypothesis sets previously proven in [Awasthi et al., 2022b]; second, they encompass all previously comp-sum and constrained losses studied thus far as well as many new ones [Awasthi et al., 2022a, Mao et al., 2023h]; third, they extend beyond the completeness assumption adopted in previous work; fourth, they provide novel guarantees for bounded hypothesis sets; and, ﬁnally, they help prove a much stronger and more signiﬁcant guarantee for logistic loss with linear hypothesis set than [Zheng et al., 2023].

Previous work. Here, we brieﬂy discuss recent studies of H-consistency bounds by Awasthi et al. [2022a,b], Mao et al. [2023h] and Zheng et al. [2023]. Awasthi et al. [2022a] introduced and studied H-consistency bounds in binary classiﬁcation. They provided a series of tight H-consistency bounds for bounded hypothesis set of linear models and one-hidden-layer neural networks. The subsequent study [Awasthi et al., 2022b] further generalized the framework to multi-class classiﬁcation and presented an extensive study of H-consistency bounds for diverse multi-class surrogate losses, including negative results for max losses [Crammer and Singer, 2001] and positive results for sum losses [Weston and Watkins, 1998], and constrained losses [Lee et al., 2004]. However, the hypothesis sets examined in their analysis were assumed to be complete, which rules out the bounded hypothesis sets typically used in practice. Moreover, the ﬁnal bounds derived from [Awasthi et al., 2022b] are based on ad hoc methods and may not be tight. [Mao et al., 2023h] complemented this previous work by studying a wide family of comp-sum losses in the multi-class classiﬁcation, which generalizes the sum-losses and includes as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized cross-entropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption on the hypothesis sets was adopted and their H-consistency bounds do not apply to common bounded hypothesis sets in practice. Recently, Zheng et al. [2023] proved H-consistency bounds for multi-class logistic loss with bounded linear hypothesis sets. However, their bounds require a crucial distributional assumption, under which the minimizability gaps coincide with the approximation errors. Thus, their bounds can be recovered as excess error bounds, which are less signiﬁcant.

Other related work on H-consistency bounds includes H-consistency bounds for pairwise ranking [Mao, Mohri, and Zhong, 2023d,e]; theoretically grounded surrogate losses and algorithms for learning with abstention supported by H-consistency bounds, including the study of score-based abstention [Mao, Mohri, and Zhong, 2023f], predictor-rejector abstention [Mao, Mohri, and Zhong, 2023c] and learning to abstain with a ﬁxed predictor with application in decontextualization [Mohri, Andor, Choi, Collins, Mao, and Zhong, 2023]; principled approaches for learning to defer with multiple experts that beneﬁt from strong H-consistency bounds, including the single-stage scenario [Mao, Mohri, and Zhong, 2023b] and a two-stage scenario [Mao, Mohri, Mohri, and Zhong, 2023a]; H-consistency theory and algorithms for adversarial robustness [Awasthi et al., 2021a,b, 2023a, Mao et al., 2023h, Awasthi et al., 2023b]; and efﬁcient algorithms and loss functions for structured prediction with stronger H-consistency guarantees [Mao et al., 2023g].

Structure of this paper. We present new and tight H-consistency bounds for both the family of comp-sum losses (Section 4.1) and that of constrained losses (Section 5.1), which cover the familiar cross-entropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets (Section 4.2 and 5.2). Our characterizations are based on error transformations, which are explicitly deﬁned for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class H-consistency bound for cross-entropy reduces to an excess bound independent of the hypothesis set. Instead, we prove a much stronger and more signiﬁcant guarantee (Section 4.2).

We give a comprehensive discussion of related work in Appendix A. We start with some basic deﬁnitions and notation in Section 2.

2 Preliminaries

We denote by X the input space, by Y the output space, and by D a distribution over X Y. We consider the standard scenario of multi-class classiﬁcation, where Y = {1,...,n}. Given a hypothesis set H of functions mapping X Y to R, the multi-class classiﬁcation problem consists of ﬁnding a hypothesis h H with small generalization error Rℓ0 1(h), deﬁned by Rℓ0 1(h) = E(x,y) D[ℓ0 1(h,x,y)], where ℓ0 1(h,x,y) = 1h(x) y is the multi-class zero-one loss with h(x) = argmaxy Y h(x,y) the prediction of h for the input point x. We also denote by H(x) the set of all predictions associated to input x generated by functions in H, that is, H(x) = {h(x) h H}.

We will analyze the guarantees of surrogate multi-class losses in terms of the zero-one loss. We denote by ℓa surrogate loss and by Rℓ(h) its generalization error, Rℓ(h) = E(x,y) D[ℓ(h,x,y)]. For a loss function ℓ, we deﬁne the best-in-class generalization error within a hypothesis set H as R ℓ(H) = infh H Rℓ(h), and refer to Rℓ(h) R ℓ(H) as the estimation error. We will study the key notion of H-consistency bounds [Awasthi et al., 2022a,b], which are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error, for some real-valued function f that is non-decreasing:

h H, Rℓ0 1(h) R ℓ0 1(H) f(Rℓ(h) R ℓ(H)).

These bounds imply that the zero-one estimation error is at most f(ϵ) whenever the surrogate loss estimation error is bounded by ϵ. Thus, the learning guarantees provided by H-consistency bounds are both non-asymptotic and hypothesis set-speciﬁc. The function f appearing in these bounds is expressed in terms of a minimizability gap, which is a quantity measuring the difference of bestin-class error R ℓ(H) and the expected best-in-class conditional error Ex[C ℓ(H,x)]: Mℓ(H) = R ℓ(H) EX[C ℓ(H,x)], where Cℓ(h,x) = Ey x[ℓ(h,x,y)] and C ℓ(H,x) = infh H Cℓ(h,x) are the conditional error and best-in-class conditional error respectively. We further write Cℓ,H = Cℓ(h,x) C ℓ(H,x) to denote the conditional regret. Note that that the minimizability gap is an inherent quantity depending on a hypothesis set H and the loss function ℓ.

By Lemma 1, the minimizability gap for the zero-one loss, Mℓ0 1(H), coincides with its approximation error Aℓ0 1(H) = R ℓ0 1(H) R ℓ0 1(Hall) when the set of all possible predictions generated by H covers the label space Y. This holds for typical hypothesis sets used in practice. However, for a surrogate loss ℓ, the minimizability gap Mℓ(H) is always upper bounded by and in general ﬁner than its approximation error Aℓ(H) = R ℓ(H) R ℓ(Hall) since Mℓ(H) = Aℓ(H) Iℓ(H), where Hall is the family of all measurable functions and Iℓ(H) = Ex [C ℓ(H,x) C ℓ(Hall,x)] (see Appendix B for a more detailed discussion). Thus, an H-consistency bound, expressed as follows for some increasing function Γ:

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Mℓ(H)), (1)

is more favorable than an excess error bound expressed in terms of approximation errors Rℓ0 1(h) R ℓ0 1(H)+Aℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Aℓ(H)). Here, Γ is typically linear or the square-root function modulo constants. When H = Hall, the family of all measurable functions, an H-consistency bound coincides with the excess error bound and implies Bayes-consistency by taking the limit. It is therefore a stronger guarantee than an excess error bound and Bayes-consistency.

The minimizability gap is always non-negative, since the inﬁmum of the expectation is greater than or equal to the expectation of inﬁmum. Furthermore, as shown in Appendix B, when H is the family of all measurable functions or when the Bayes-error coincides with the best-in-class error, that is, R ℓ(H) = R ℓ(Hall), the minimizability gap vanishes. In such cases, (1) implies the H-consistency of a surrogate loss ℓwith respect to the zero-one loss ℓ0 1:

Rℓ(hn) R ℓ(H) n + ÐÐÐ 0 Ô Rℓ0 1(hn) R ℓ0 1(H) n + ÐÐÐ 0.

In the next sections, we will provide both a general characterization and an extension of H-consistency bounds for multi-class classiﬁcation. Before proceeding, we ﬁrst introduce a useful lemma from [Awasthi et al., 2022b] which characterizes the conditional regret of zero-one loss explicitly. We denote by p(x) = (p(x,1),...,p(x,n)) as the conditional distribution of y given x. Lemma 1. For zero-one loss ℓ0 1, the best-in-class conditional error and the conditional regret for ℓ0 1 can be expressed as follows: for any x X, we have

C ℓ0 1(H,x) = 1 max y H(x)p(x,y) and Cℓ0 1,H(h,x) = max y H(x)p(x,y) p(x,h(x)).

3 Comparison with previous work

Here, we brieﬂy discuss previous studies of H-consistency bounds [Awasthi et al., 2022a,b, Zheng et al., 2023, Mao et al., 2023h] in standard binary or multi-class classiﬁcation and compare their results with those we present.

Awasthi et al. [2022a] studied H-consistency bounds in binary classiﬁcation. They provided a series of tight H-consistency bounds for the bounded hypothesis set of linear models Hbi lin and one-hidden-layer neural networks Hbi NN, deﬁned as follows:

Hbi lin = {x w x + b w W, b B}

Hbi NN = {x n j=1 uj(wj x + b)+ u 1 Λ, wj W, b B},

where B, W, and Λ are positive constants and where ( )+ = max( ,0). We will show that our bounds recover these binary classiﬁcation H-consistency bounds.

The scenario of multi-class classiﬁcation is more challenging and more crucial in applications. Recent work by Awasthi et al. [2022b] showed that max losses [Crammer and Singer, 2001], deﬁned as ℓmax(h,x,y) = maxy y Φ(h(x,y) h(x,y )) for some convex and non-increasing function Φ, cannot admit meaningful H-consistency bounds, unless the distribution is deterministic. They also presented a series of H-consistency bounds for sum losses [Weston and Watkins, 1998] and constrained losses [Lee et al., 2004] for symmetric and complete hypothesis sets, that is such that:

H = {h X Y R h( ,y) F, y Y} (symmetry) x X,{f(x) f F} = R, (completeness)

for some family F of functions mapping from X to R. The completeness assumption rules out the bounded hypothesis sets typically used in practice such as Hlin. Moreover, the ﬁnal bounds derived from [Awasthi et al., 2022b] are based on ad hoc proofs and may not be tight. In contrast, we will study both the complete and bounded hypothesis sets, and provide a very general tool to derive H-consistency bounds. Our bounds are tighter than those of Awasthi et al. [2022b] given for complete hypothesis sets and extend beyond the completeness assumption.

[Mao et al., 2023h] complemented the work of [Awasthi et al., 2022b] by studying a wide family of comp-sum losses in multi-class classiﬁcation, which generalized the sum-losses and included as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized crossentropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption was adopted, thus their H-consistency bounds do not apply to common bounded hypothesis sets used in practice. We illustrate the application of our general results through a broader set of surrogate losses than [Mao et al., 2023h] and signiﬁcantly generalize the bounds of [Mao et al., 2023h] to bounded hypothesis sets.

Recently, Zheng et al. [2023] proved H-consistency bounds for logistic loss with linear hypothesis sets in the multi-class classiﬁcation: Hlin = {x wy x + by wy W, by B,y Y}. However, their bounds require a crucial distributional assumption under which, the minimizability gaps Mℓ0 1(Hlin) and Mℓlog(Hlin) coincide with the approximation errors Rℓ0 1(Hlin) R ℓ0 1(Hall) and Rℓlog(Hlin) R ℓlog(Hall) respectively (see the note before [Zheng et al., 2023, Appendix F]). Thus, their bounds

can be recovered as excess error bounds Rℓ0 1(h) R ℓ0 1(Hall)

2(Rℓlog(h) R ℓlog(Hall))

1 2 , which are less signiﬁcant. In contrast, our Hlin-consistency bound are much ﬁner and take into account the role of the parameter B and that of the number of labels n. Thus, we provide stronger and more signiﬁcant guarantees for logistic loss with linear hypothesis set than [Zheng et al., 2023].

In summary, our general tools offer the remarkable advantages of deriving tight bounds, which improve upon the existing bounds of Awasthi et al. [2022b] given for complete hypothesis sets, cover the comp-sum and constrained losses considered in [Awasthi et al., 2022a, Mao et al., 2023h] as well as new ones, extend beyond the completeness assumption with novel guarantees valid for bounded hypothesis sets, and are much stronger and more signiﬁcant guarantees for logistic loss with linear hypothesis sets than those of Zheng et al. [2023].

4 Comp-sum losses

In this section, we present a general characterization of H-consistency bounds for comp-sum losses, a family of loss functions including the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the sum exponential loss [Weston and Watkins, 1998, Awasthi et al., 2022b], the generalized cross entropy loss [Zhang and Sabuncu, 2018], the mean absolute error loss [Ghosh et al., 2017], and many other loss functions used in applications.

This is a family of loss functions deﬁned via the composition of a non-negative and non-decreasing function Ψ with the sum exponential losses (see [Mao et al., 2023h]):

h H, (x,y) X Y, ℓcomp(h,x,y) = Ψ y Y eh(x,y ) h(x,y) . (2)

This expression can be equivalently written as ℓcomp(h,x,y) = Φ( eh(x,y)

y Y eh(x,y ) ), where Φ u

u ) is a non-increasing auxiliary function from [0,1] to R+ {+ }. As an example, the logistic loss corresponds to the choice Φ u log(u) and the sum exponential loss to Φ u 1 u

4.1 H-consistency bounds

In previous work, deriving H-consistency bounds has required giving new proofs for each instance. The following result provides a very general tool for deriving such bounds with tightness guarantees. We introduce an error transformation function and show that deriving an H-consistency bound for comp-sum losses can be reduced to the calculation of this function. Theorem 2 (H-consistency bound for comp-sum losses). Assume that H is symmetric and complete and that Tcomp is convex. Then, the following inequality holds for any hypothesis h H and any distribution

Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H), (3)

with Tcomp an H-estimation error transformation for comp-sum losses deﬁned for all t [0,1] by

inf τ [0, 1

2 ] sup µ [ τ,1 τ] { 1+t

2 [Φ(τ) Φ(1 τ µ)] + 1 t

2 [Φ(1 τ) Φ(τ + µ)]} n = 2

inf P [ 1 n 1 t,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] { P +t

2 [Φ(τ2) Φ(τ1 µ)] + P t

2 [Φ(τ1) Φ(τ2 + µ)]} n > 2.

Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t).

Thus, Theorem 2 shows that, when Tcomp is convex, to make these guarantees explicit, all that is needed is to calculate Tcomp. Moreover, the last statement shows the tightness of the guarantees derived using this function. The constraints in Tcomp are due to the forms that the conditional probability vector and scoring functions take. These forms become more ﬂexible for n > 2, leading to intricate constraints. Note that our H-consistency bounds are distribution-independent and we cannot claim tightness across all distributions.

The general expression of Tcomp in Theorem 2 is complex, but it can be considerably simpliﬁed under some broad assumptions, as shown by the following result.

Theorem 3 (characterization of Tcomp). Assume that Φ is convex, differentiable at 1

2 and Φ ( 1

2) < 0. Then, Tcomp can be expressed as follows:

2) infµ [ 1

2 + µ) + 1+t

2 µ)} n = 2 infτ [ 1

2 ]{Φ(τ) infµ [ τ,τ]{ 1+t

2 Φ(τ µ) + 1 t

2 Φ(τ + µ)}} n > 2.

The proof of this result as well as that of other theorems in this section are given in Appendix C.

Examples. We now illustrate the application of our theory through several examples. To do so, we compute the H-estimation error transformation Tcomp for comp-sum losses and present the results in

Table 1: H-estimation error transformation for common comp-sum losses.

Auxiliary function Φ log(t) 1 t 1 1 q(1 tq),q (0,1) 1 t (1 t)2

Transformation Tcomp 1+t

2 log(1 + t) + 1 t

2 log(1 t) 1

1 t2 1 qnq ( (1+t) 1 1 q +(1 t) 1 1 q 2 )

Table 1. Remarkably, by applying Theorem 2, we are able to obtain the same H-consistency bounds for comp-sum losses with Φ(t) = log(t), 1

q(1 tq),q (0,1) and 1 t as those derived using ad hoc methods in [Mao et al., 2023h], and a novel tight H-consistency bound for the new

comp-sum loss ℓsq = [1 eh(x,y)

y Y eh(x,y ) ]

2 with Φ(t) = (1 t)2 in Theorem 4.

The calculation of Tcomp for all entries of Table 1 is detailed in Appendix C.3. To illustrate the effectiveness of our general tools, here, we show how the error transformation function can be straightforwardly calculated in the case of the new surrogate loss ℓsq. Theorem 4 (H-consistency bound for a new comp-sum loss). Assume that H is symmetric and complete. Then, for all h H and any distribution, the following tight bound holds.

Rℓ0 1(h) R ℓ0 1(H) 2(Rℓsq(h) R ℓsq(H) + Mℓsq(H))

1 2 Mℓ0 1(H).

Proof. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 3, gives

4 inf µ [ 1

Similarly, for n > 2, plugging in Φ(t) = (1 t)2 in Theorem 3 yields

Tcomp = inf τ [ 1

2 ]{(1 τ)2 inf µ [ τ,τ]{1 + t

2 (1 τ + µ)2 + 1 t

2 (1 τ µ)2}}

= inf τ [ 1

2 ]{(1 τ)2 (1 τ)2(1 t2)} (minimum achieved at µ = t(τ 1))

4 . (minimum achieved at τ = 1

By Theorem 2, the bound obtained is tight, which completes the proof.

4.2 Extension to non-complete/bounded hypothesis sets: comp-sum losses

As pointed out earlier, the hypothesis sets typically used in practice are bounded. Let F be a family of real-valued functions f with f(x) Λ(x) for all x X and such that all values in [ Λ(x),+Λ(x)] can be reached, where Λ(x) > 0 is a ﬁxed function on X. We will study hypothesis sets H in which each scoring function is bounded:

H = {h X Y R h( ,y) F, y Y}. (4)

This holds for most hypothesis sets used in practice. The symmetric and complete hypothesis sets studied in previous work correspond to the special case of H where Λ(x) = + for all x X. The hypothesis set of linear models Hlin, deﬁned by

Hlin = {(x,y) wy x + by wy W, by B,y Y},

is also a special instance of H where Λ(x) = W x + B. Let us emphasize that previous studies did not establish any H-consistency bound for these general hypothesis sets, H.

Theorem 5 (H-consistency bound for comp-sum losses). Assume that T comp is convex. Then, the following inequality holds for any hypothesis h H and any distribution:

T comp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H)

with T comp the H-estimation error transformation for comp-sum losses deﬁned for all t [0,1] by T comp(t) =

inf τ [0, 1

2 ] sup µ [smin τ,1 τ smin] { 1+t

2 [Φ(τ) Φ(1 τ µ)] + 1 t

2 [Φ(1 τ) Φ(τ + µ)]} n = 2

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C { P +t

2 [Φ(τ2) Φ(τ1 µ)] + P t

2 [Φ(τ1) Φ(τ2 + µ)]} n > 2,

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}], smax = 1 1+(n 1)e 2 infx Λ(x) and

smin = 1 1+(n 1)e2 infx Λ(x) . Furthermore, for any t [0,1], there exist a distribution D and h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t).

This theorem signiﬁcantly broadens the applicability of our framework as it encompasses bounded hypothesis sets. The last statement of the theorem further shows the tightness of the H-consistency bounds derived using this error transformation function. We now illustrate the application of our theory through several examples.

A. Example: logistic loss. We ﬁrst consider the multinomial logistic loss, that is ℓcomp with Φ(u) = log(u), for which we give the following guarantee.

Theorem 6 (H-consistency bounds for logistic loss). For any h H and any distribution, we have

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)),

where ℓlog = log( eh(x,y)

y Y eh(x,y ) ) and Ψ(t) =

2 log(1 + t) + 1 t

2 log(1 t) t smax smin

smin+smax t 2 log( smax

smin ) + log( 2 smaxsmin

smax+smin ) otherwise.

The proof of Theorem 6 is given in Appendix E.2. With the help of some simple calculations, we can derive a simpler upper bound:

Ψ 1(t) Γ(t) =

2t t (smax smin)2

2(smin+smax)2 2(smin+smax)

smax smin t otherwise.

When the relative difference between smin and smax is small, the coefﬁcient of the linear term in Γ explodes. On the other hand, making that difference large essentially turns Γ into a square-root function for all values. In general, Λ is not inﬁnite since a regularization is used, which controls both the complexity of the hypothesis set and the magnitude of the scores.

Comparison with [Mao et al., 2023h]. For the symmetric and complete hypothesis sets H considered in [Mao et al., 2023h], Λ(x) = + , smax = 1, smin = 0, Ψ(t) = 1+t

2 log(1 + t) + 1 t

2 log(1 t) and Γ(t) =

2t. By Theorem 6, this yields an H-consistency bound for the logistic loss. Corollary 7 (H-consistency bounds for logistic loss). Assume that H is symmetric and complete. Then, for any h H and any distribution, we have

Rℓ0 1(h) R ℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)) Mℓ0 1(H)

where Ψ(t) = 1+t

2 log(1 + t) + 1 t

2 log(1 t) and Ψ 1(t)

Corollary 7 recovers the H-consistency bounds of Mao et al. [2023h].

Comparison with [Awasthi et al., 2022a] and [Zheng et al., 2023]. For the linear models Hlin = {(x,y) wy x + by wy W, by B}, we have Λ(x) = W x + B. By Theorem 6, we obtain Hlin-consistency bounds for logistic loss. Corollary 8 (Hlin-consistency bounds for logistic loss). For any h Hlin and any distribution,

Rℓ0 1(h) R ℓ0 1(Hlin) Ψ 1(Rℓlog(h) R ℓlog(Hlin) + Mℓlog(Hlin)) Mℓ0 1(Hlin)

where Ψ(t) =

2 log(1 + t) + 1 t

2 log(1 t) t (n 1)(e2B e 2B) 2+(n 1)(e2B+e 2B)

t 2 log( 1+(n 1)e2B

1+(n 1)e 2B ) + log( 2

(1+(n 1)e2B)(1+(n 1)e 2B)

2+(n 1)(e2B+e 2B) ) otherwise.

For n = 2, we have Ψ(t) =

2 log(t + 1) + 1 t

2 log(1 t) t e2B 1

e2B+1 t 2 log( 1+e2B

1+e 2B ) + log( 2

(1+e2B)(1+e 2B)

2+e2B+e 2B ) otherwise, which coincides

with the Hlin-estimation error transformation in [Awasthi et al., 2022a]. Thus, Corollary 8 includes as a special case the Hlin-consistency bounds given by Awasthi et al. [2022a] for binary classiﬁcation.

Our bounds of Corollary 8 improves upon the multi-class Hlin-consistency bounds of recent work [Zheng et al., 2023, Theorem 3.3] in the following ways: i) their bound holds only for restricted distributions while our bound holds for any distribution; ii) their bound holds only for restricted values of the estimation error Rℓlog(h) R ℓlog(Hlin) while ours holds for any value in R and more precisely admits a piecewise functional form; iii) under their distributional assumption, the minimizability gaps Mℓ0 1(Hlin) and Mℓlog(Hlin) coincide with the approximation errors Rℓ0 1(Hlin) R ℓ0 1(Hall) and Rℓlog(Hlin) R ℓlog(Hall) respectively (see the note before [Zheng et al., 2023, Appendix F]). Thus, their bounds can be recovered as an excess error bound

Rℓ0 1(h) R ℓ0 1(Hall)

2[Rℓlog(h) R ℓlog(Hall)]

1 2 , which is not speciﬁc to the hypothesis set H and thus not as signiﬁcant. In contrast, our Hlin-consistency bound is ﬁner and takes into account the role of the parameter B as well as the number of labels n; iv) [Zheng et al., 2023, Theorem 3.3] only offers approximate bounds that are not tight; in contrast, by Theorem 5, our bound is tight.

Note that our H-consistency bound in Theorem 6 are not limited to speciﬁc hypothesis set forms. They are directly applicable to various types of hypothesis sets including neural networks. For example, the same derivation can be extended to one-hidden-layer neural networks studied in [Awasthi et al., 2022a] and their multi-class generalization by calculating and substituting the corresponding Λ(x). As a result, we can obtain novel and tight H-consistency bounds for bounded neural network hypothesis sets in multi-class classiﬁcation, which highlights the versatility of our general tools.

B. Example: sum exponential loss. We then consider the sum exponential loss, that is ℓcomp with Φ(u) = 1 u

u . By computing the error transformation in Theorem 5, we obtain the following result.

Theorem 9 (H-consistency bounds for sum exponential loss). For any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H))

where ℓexp = y y eh(x,y ) h(x,y) and Ψ(t) =

1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2

2smaxsmin(smax+smin) otherwise. .

The proof of Theorem 9 is given in Appendix E.3. Observe that 1

1 t2 t2/2. By Theorem 9, making smin close to zero, that is making Λ close to inﬁnite for any x X, essentially turns Ψ into a square function for all values. In general, Λ is not inﬁnite since a regularization is used in practice, which controls both the complexity of the hypothesis set and the magnitude of the scores.

C. Example: generalized cross-entropy loss and mean absolute error loss. Due to space limitations, we present the results for these loss functions in Appendix E.

5 Constrained losses

In this section, we present a general characterization of H-consistency bounds for constrained loss, that is loss functions deﬁned via a constraint, as in [Lee et al., 2004]:

ℓcstnd(h,x,y) = y y Φ( h(x,y )) (5)

with the constraint that y Y h(x,y) = 0 for a non-negative and non-increasing auxiliary function Φ.

5.1 H-consistency bounds

As in the previous section, we prove a result that supplies a very general tool, an error transformation function for deriving H-consistency bounds for constrained losses. When Tcstnd is convex, to make these guarantees explicit, we only need to calculate Tcstnd.

Table 2: H-estimation error transformation for common constrained losses.

Auxiliary function Φ Φexp(t) = e t Φhinge(t) = max{0,1 t} Φsq hinge(t) = (1 t)21t 1 Φsq = (1 t)2

Transformation Tcstnd Tcstnd(t) = 2

4 t2 Tcstnd(t) = t Tcstnd(t) = t2

2 Tcstnd(t) = t2

Theorem 10 (H-consistency bound for constrained losses). Assume that H is symmetric and complete. Assume that Tcstnd is convex. Then, for any hypothesis h H and any distribution,

Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)

with H-estimation error transformation for constrained losses deﬁned on t [0,1] by Tcstnd(t) =

inf τ 0sup µ R { 1 t

2 [Φ(τ) Φ( τ + µ)] + 1+t

2 [Φ( τ) Φ(τ µ)]} n = 2

inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R { 2 P t

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t

2 [Φ( τ1) Φ( τ2 µ)]} n > 2.

Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t).

Here too, the theorem guarantees the tightness of the bound. This general expression of Tcstnd can be considerably simpliﬁed under some broad assumptions, as shown by the following result. Theorem 11 (characterization of Tcstnd). Assume that Φ is convex, differentiable at zero and Φ (0) < 0. Then, Tcstnd can be expressed as follows:

Tcstnd(t) =

Φ(0) infµ R{ 1 t

2 Φ(µ) + 1+t

2 Φ( µ)} n = 2

infτ 0{(2 1 n 1)Φ( τ) infµ R{ 2 t 1 n 1 2 Φ( τ + µ) + 2+t 1 n 1 2 Φ( τ µ)}} n > 2

{Φ(0) infµ R{ 1 t

2 Φ(µ) + 1+t

2 Φ( µ)} n = 2 infτ 0{2Φ( τ) infµ R{ 2 t

2 Φ( τ + µ) + 2+t

2 Φ( τ µ)}} n > 2.

The proof of all the results in this section are given in Appendix D.

Examples. We now compute the H-estimation error transformation for constrained losses and present the results in Table 2. Here, we present the simpliﬁed Tcstnd by using the lower bound in Theorem 11. Remarkably, by applying Theorem 10, we are able to obtain tighter H-consistency bounds for constrained losses with Φ = Φhinge,Φsq hinge,Φexp than those derived using ad hoc methods in [Awasthi et al., 2022b], and a novel H-consistency bound for the new constrained loss ℓcstnd(h,x,y) = y y(1 + h(x,y ))2 with Φ(t) = (1 t)2.

5.2 Extension to non-complete or bounded hypothesis sets

As in the case of comp-sum losses, we extend our results beyond the completeness assumption adopted in previous work and establish H-consistency bounds for bounded hypothesis sets. This signiﬁcantly broadens the applicability of our framework.

Theorem 12 (H-consistency bound for constrained losses). Assume that T cstnd is convex. Then, the following inequality holds for any hypothesis h H and any distribution:

T cstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). (6)

with T cstnd the H-estimation error transformation for constrained losses deﬁned for all t [0,1] by

T cstnd(t) =

inf τ 0 sup µ [τ Λmin,τ+Λmin] { 1 t

2 [Φ(τ) Φ( τ + µ)] + 1+t

2 [Φ( τ) Φ(τ µ)]} n = 2

inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C { 2 P t

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t

2 [Φ( τ1) Φ( τ2 µ)]} n > 2,

where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] and Λmin = infx X Λ(x). Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H)+ Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t).

The proof is presented in Appendix F.1. Next, we illustrate the application of our theory through an example of constrained exponential losses, that is ℓcstnd with Φ(t) = e t. By using the error transformation in Theorem 12, we obtain new H-consistency bounds in Theorem 13 (see Appendix F.2 for the proof) for bounded hypothesis sets H.

Theorem 13 (H-consistency bounds for constrained exponential loss). Let Φ(t) = e t. For any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H))

where Ψ(t) =

1 t2 t e2Λmin 1

e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin

2 otherwise. .

Awasthi et al. [2022b] proves H-consistency bounds for constrained exponential losses when H is symmetric and complete. Theorem 13 signiﬁcantly generalizes those results to the non-complete hypothesis sets. Different from the complete case, the functional form of our new bounds has two pieces which corresponds to the linear and the square root convergence respectively, modulo the constants. Furthermore, the coefﬁcient of the linear piece depends on the the magnitude of Λmin. When Λmin is small, the coefﬁcient of the linear term in Ψ 1 explodes. On the other hand, making Λmin large essentially turns Ψ 1 into a square-root function.

6 Discussion

Here, we further elaborate on the practical value of our tools and H-consistency bounds. Our contributions include a more general and convenient mathematical tool for proving H-consistency bounds, along with tighter bounds that enable a better comparison of surrogate loss functions and extensions beyond previous completeness assumptions. As mentioned by [Awasthi et al., 2022b], given a hypothesis set H, H-consistency bounds can be used to compare different surrogate loss functions and select the most favorable one, which depends on the functional form of the H-consistency bound; the smoothness of the surrogate loss and its optimization properties; approximation properties of the surrogate loss function controlled by minimizability gaps; and the dependency on the number of classes in the multiplicative constant. Consequently, a tighter H-consistency bound provides a more accurate comparison, as a loose bound might not adequately capture the full advantage of using one surrogate loss. In contrast, Bayes-consistency does not take into account the hypothesis set and is an asymptotic property, thereby failing to guide the comparison of different surrogate losses.

Another application of our H-consistency bounds involves deriving generalization bounds for surrogate loss minimizers [Mao et al., 2023h], expressed in terms of the same quantities previously discussed. Therefore, when dealing with ﬁnite samples, a tighter H-consistency bound could also result in a corresponding tighter generalization bound. Moreover, our novel results extend beyond previous completeness assumptions, offering guarantees applicable to bounded hypothesis sets commonly used with regularization. This enhancement provides meaningful learning guarantees. Technically, our error transformation function serves as a very general tool for deriving H-consistency bounds with tightness guarantees. These functions are deﬁned within each class of loss functions including comp-sum losses and constrained losses, and their formulation depends on the structure of the individual loss function class, the range of the hypothesis set and the number of classes. To derive explicit bounds, all that is needed is to calculate these error transformation functions. Under some broad assumptions on the auxiliary function within a loss function, these error transformation functions can be further distilled into more simpliﬁed forms, making them straightforward to compute.

7 Conclusion

We presented a general characterization and extension of H-consistency bounds for multi-class classiﬁcation. We introduced new tools for deriving such bounds with tightness guarantees and illustrated their beneﬁts through several applications and examples. Our proposed method is a signiﬁcant advance in the theory of H-consistency bounds for multi-class classiﬁcation. It can provide a general and powerful tool for deriving tight bounds for a wide variety of other loss functions and hypothesis sets. We believe that our work will open up new avenues of research in the ﬁeld of multi-class classiﬁcation consistency.

A. Agarwal and S. Agarwal. On consistent surrogate risk minimization and property elicitation. In Conference on Learning Theory, pages 4 22, 2015.

P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pages 9804 9815, 2021a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pages 1117 1174, 2022a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Multi-class H-consistency bounds. In Advances in neural information processing systems, 2022b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pages 10077 10094, 2023a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. DC-programming for neural network optimizations. Journal of Global Optimization, 2023b.

P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944.

J. Berkson. Why I prefer logits to probits. Biometrics, 7(4):327 -339, 1951.

M. Blondel. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019.

D.-R. Chen and T. Sun. Consistency of multiclass empirical risk minimization methods based on convex loss. Journal of Machine Learning Research, 7:2435 2447, 2006.

D.-R. Chen and D.-H. Xiang. The consistency of multicategory support vector machines. Advances in Computational Mathematics, 24(1):155 169, 2006.

C. Ciliberto, L. Rosasco, and A. Rudi. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016.

C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In Algorithmic Learning Theory, pages 67 82, 2016a.

C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Advances in Neural Information Processing Systems, pages 1660 1668, 2016b.

C. Cortes, G. De Salvo, and M. Mohri. Theory and algorithms for learning with rejection in binary classiﬁcation. Annals of Mathematics and Artiﬁcial Intelligence, to appear, 2023.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

K. Dembczynski, W. Kotlowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. ar Xiv preprint ar Xiv:1206.6401, 2012.

U. Dogan, T. Glasmachers, and C. Igel. A uniﬁed view on multi-class support vector classiﬁcation. Journal of Machine Learning Research, 17:1 32, 2016.

J. Finocchiaro, R. M. Frongillo, and B. Waggoner. An embedding framework for the design and analysis of consistent polyhedral surrogates. ar Xiv preprint ar Xiv:2206.14707, 2022.

R. Frongillo and B. Waggoner. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, pages 21569 21580, 2021.

W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. In Conference on learning theory, pages 341 358, 2011.

W. Gao and Z.-H. Zhou. On the consistency of AUC pairwise optimization. In International Joint Conference on Artiﬁcial Intelligence, 2015.

A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artiﬁcial intelligence, 2017.

V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pages 2501 2509, 2014.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

Y. Liu. Fisher consistency of multicategory support vector machines. In Artiﬁcial intelligence and statistics, pages 291 298, 2007.

P. Long and R. Servedio. Consistency versus realizable H-consistency for multiclass classiﬁcation. In International Conference on Machine Learning, pages 801 809, 2013.

A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.

A. Mao, M. Mohri, and Y. Zhong. Principled approaches for learning to defer with multiple experts. ar Xiv preprint ar Xiv:2310.14774, 2023b.

A. Mao, M. Mohri, and Y. Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. ar Xiv preprint ar Xiv:2310.14772, 2023c.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023d.

A. Mao, M. Mohri, and Y. Zhong. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023e.

A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for scorebased multi-class abstention. ar Xiv preprint ar Xiv:2310.14770, 2023f.

A. Mao, M. Mohri, and Y. Zhong. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023g.

A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023h.

C. Mohri, D. Andor, E. Choi, M. Collins, A. Mao, and Y. Zhong. Learning to reject with a ﬁxed predictor: Application to decontextualization. ar Xiv preprint ar Xiv:2301.09044, 2023.

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, second edition, 2018.

H. Narasimhan, H. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pages 2398 2407, 2015.

A. Osokin, F. Bach, and S. Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017.

F. Pedregosa, F. Bach, and A. Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1 35, 2017.

B. Á. Pires and C. Szepesvári. Multiclass classiﬁcation calibration functions. ar Xiv preprint ar Xiv:1609.06385, 2016.

B. A. Pires, C. Szepesvari, and M. Ghavamzadeh. Cost-sensitive multiclass classiﬁcation risk bounds. In International Conference on Machine Learning, pages 1391 1399, 2013.

H. G. Ramaswamy and S. Agarwal. Classiﬁcation calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems, 2012.

H. G. Ramaswamy and S. Agarwal. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research, 17(1):397 441, 2016.

H. G. Ramaswamy, S. Agarwal, and A. Tewari. Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In Advances in Neural Information Processing Systems, 2013.

H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classiﬁcation with a reject option. ar Xiv preprint ar Xiv:1505.04137, 2015.

P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In International Conference on Artiﬁcial Intelligence and Statistics, pages 618 626, 2011.

I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

A. Tewari and P. L. Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007.

A. Thilagar, R. Frongillo, J. J. Finocchiaro, and E. Goodwill. Consistent polyhedral surrogates for top-k classiﬁcation and variants. In International Conference on Machine Learning, pages 21329 21359, 2022.

K. Uematsu and Y. Lee. On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112(519):1311 1322, 2017.

P. F. Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113 -121, 1838.

P. F. Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845.

Y. Wang and C. Scott. Weston-Watkins hinge loss and ordered partitions. In Advances in neural information processing systems, pages 19873 19883, 2020.

Y. Wang and C. D. Scott. On classiﬁcation-calibration of gamma-phi losses. ar Xiv preprint ar Xiv:2302.07321, 2023.

J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Citeseer, 1998.

R. C. Williamson, E. Vernet, and M. D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17:1 52, 2016.

M. Zhang and S. Agarwal. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, pages 16927 16936, 2020.

M. Zhang, H. G. Ramaswamy, and S. Agarwal. Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pages 11246 11255, 2020.

T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a.

T. Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b.

Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.

C. Zheng, G. Wu, F. Bao, Y. Cao, C. Li, and J. Zhu. Revisiting discriminative vs. generative classiﬁers: Theory and implications. In International Conference on Machine Learning, 2023.

Contents of Appendix

A Related work 16

B Minimizability gap 17

B.1 Zero minimizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B.2 Relationship with approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B.3 Signiﬁcance of H-consistency bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

C Proofs for comp-sum losses 18

C.1 Proof of H-consistency bounds with Tcomp (Theorem 2) . . . . . . . . . . . . . . . . 19

C.2 Characterization of Tcomp (Theorem 3) . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.3 Computation of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

D Proofs for constrained losses 23

D.1 Proof of H-consistency bounds with Tcstnd (Theorem 10) . . . . . . . . . . . . . . . . 24

D.2 Characterization of Tcstnd (Theorem 11) . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.3 Computation of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

E Extensions of comp-sum losses 28

E.1 Proof of H-consistency bounds with T comp (Theorem 5) . . . . . . . . . . . . . . . . 28

E.2 Logistic loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

E.3 Sum exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

E.4 Generalized cross-entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

E.5 Mean absolute error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

F Extensions of constrained losses 35

F.1 Proof of H-consistency bound with T cstnd (Theorem 12) . . . . . . . . . . . . . . . . 35

F.2 Constrained exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A Related work

The notions of Bayes-consistency (also known as consistency) and calibration have been well studied not only with respect to the binary zero-one loss [Zhang, 2004a, Bartlett et al., 2006, Steinwart, 2007, Mohri et al., 2018], but also with respect to the multi-class zero-one loss [Zhang, 2004b, Tewari and Bartlett, 2007], the general multi-class losses [Ramaswamy and Agarwal, 2012, Narasimhan et al., 2015, Ramaswamy and Agarwal, 2016], the multi-class SVMs [Chen and Sun, 2006, Chen and Xiang, 2006, Liu, 2007, Dogan et al., 2016, Wang and Scott, 2020], the gamma-phi losses [Wang and Scott, 2023], the multi-label losses [Gao and Zhou, 2011, Dembczynski et al., 2012, Zhang et al., 2020], the losses with a reject option [Ramaswamy et al., 2015, Cortes et al., 2016a,b, 2023], the ranking losses [Ravikumar et al., 2011, Ramaswamy et al., 2013, Gao and Zhou, 2015, Uematsu and Lee, 2017], the cost sensitive losses [Pires et al., 2013, Pires and Szepesvári, 2016], the structured losses [Ciliberto et al., 2016, Osokin et al., 2017, Blondel, 2019], the polyhedral losses [Frongillo and Waggoner, 2021, Finocchiaro et al., 2022], the Top-k classiﬁcation losses [Thilagar et al., 2022], the proper losses [Agarwal and Agarwal, 2015, Williamson et al., 2016] and the losses of ordinal regression [Pedregosa et al., 2017].

Bayes-consistency only holds for the full family of measurable functions, which of course is distinct from the more restricted hypothesis set used by a learning algorithm. Therefore, a hypothesis setdependent notion of H-consistency has been proposed by Long and Servedio [2013] in the realizable setting, used by Zhang and Agarwal [2020] for linear models, and generalized by Kuznetsov et al. [2014] to the structured prediction case. Long and Servedio [2013] showed that there exists a case where a Bayes-consistent loss is not H-consistent while inconsistent losses can be H-consistent. Zhang and Agarwal [2020] further investigated the phenomenon in [Long and Servedio, 2013] and showed that the situation of losses that are not H-consistent with linear models can be remedied by carefully choosing a larger piecewise linear hypothesis set. Kuznetsov et al. [2014] proved positive results for the H-consistency of several multi-class ensemble algorithms, as an extension of H-consistency results in [Long and Servedio, 2013].

Recently, Awasthi et al. [2022a,b], Mao et al. [2023h], Zheng et al. [2023] presented a series of results providing H-consistency bounds. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are more informative guarantees than similar excess error bounds derived in the literature, which correspond to the special case where H is the family of all measurable functions [Zhang, 2004a, Bartlett et al., 2006, Mohri et al., 2018]. Awasthi et al. [2022a] studied H-consistency bounds in binary classiﬁcation. They provided a series of tight H-consistency bounds for bounded hypothesis set of linear models and one-hidden-layer neural networks. The subsequent study [Awasthi et al., 2022b] further generalized the framework to multi-class classiﬁcation, where they presented a extensive study of H-consistency bounds for diverse multi-class surrogate losses including negative results for max losses [Crammer and Singer, 2001] and positive results for sum losses [Weston and Watkins, 1998], and constrained losses [Lee et al., 2004]. However, the hypothesis sets adopted there were assumed to be complete, which rules out the bounded hypothesis sets typically used in practice. Moreover, the ﬁnal bounds derived from [Awasthi et al., 2022b] are based on ad hoc methods and may not be tight. [Mao et al., 2023h] complemented the previous work by studying a wide family of comp-sum losses in the multi-class classiﬁcation, which generalized the sum-losses and included as special cases the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], the generalized cross-entropy loss [Zhang and Sabuncu, 2018], and the mean absolute error loss [Ghosh et al., 2017]. Here too, the completeness assumption on the hypothesis sets was adopted and their H-consistency bounds do not apply to common bounded hypothesis sets in practice. Zheng et al. [2023] proved H-consistency bounds for multi-class logistic loss with bounded linear hypothesis sets. However, their bounds require a crucial distributional assumption under which, the minimizability gaps coincide with the approximation errors. Thus, their bounds can be recovered as excess error bounds, which are less signiﬁcant.

This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classiﬁcation. Our general tools and tight bounds show several remarkable advantages: ﬁrst, they improve existing bounds for complete hypothesis sets previously proven in [Awasthi et al., 2022b], second, they encompass all previously comp-sum and constrained losses studied thus far as well as many new ones [Awasthi et al., 2022a, Mao et al., 2023h], third, they extend beyond the completeness assumption adopted in previous work, fourth, they give novel guarantees for bounded

hypothesis sets, and ﬁnally they help prove a much stronger and more signiﬁcant guarantee for logistic loss with linear hypothesis set than [Zheng et al., 2023].

Other related work on H-consistency bounds includes: H-consistency bounds for pairwise ranking [Mao et al., 2023d,e]; theoretically grounded surrogate losses and algorithms for learning with abstention supported by H-consistency bounds, including the study of score-based abstention [Mao et al., 2023f], predictor-rejector abstention [Mao et al., 2023c] and learning to abstain with a ﬁxed predictor with application in decontextualization [Mohri et al., 2023]; principled approaches for learning to defer with multiple experts that beneﬁt from strong H-consistency bounds, including the single-stage scenario [Mao et al., 2023b] and a two-stage scenario [Mao et al., 2023a]; H-consistency theory and algorithms for adversarial robustness [Awasthi et al., 2021a,b, 2023a, Mao et al., 2023h, Awasthi et al., 2023b]; and efﬁcient algorithms and loss functions for structured prediction with stronger H-consistency guarantees [Mao et al., 2023g].

B Minimizability gap

This is a brief discussion of minimizability gaps and their properties. By deﬁnition, for any loss function ℓ, the minimizability gap is deﬁned by

Mℓ(H) = inf h H{ E (x,y) D[ℓ(h,x,y)]} E x[ inf h H E y x[ℓ(h,x,y)]] = R ℓ(H) E x[C ℓ(H,x)].

B.1 Zero minimizability

Lemma 14. Let ℓbe a surrogate loss such that for (x,y) X Y and any measurable function h Hall, the loss ℓ(h,x,y) only depends on h(x) and y (thus we can write ℓ(h,x,y) = ℓ(h(x),y) for some function ℓ). Then, the minimizabiliy gap vanishes: Mℓ(Hall) = 0.

Proof. Fix ϵ > 0. Then, by deﬁnition of the inﬁmum, for any x X, there exists hx Hall such that

E y x[ℓ(hx,x,y)] inf h Hall E y x[ℓ(h,x,y)] + ϵ.

Now, deﬁne the function h by h(x) = hx(x), for all x X. h can be shown to be measurable, for example, when X admits a countable dense subset. Then,

E (x,y) D[ℓ(h,x,y)] = E (x,y) D[ℓ(h(x),y)] = E (x,y) D[ℓ(hx(x),y)]

= E (x,y) D[ℓ(hx,x,y)]

E x[ inf h Hall E y x[ℓ(h,x,y)] + ϵ]

= E x[C ℓ(Hall,x)] + ϵ.

Thus, we have inf h Hall E (x,y) D[ℓ(h,x,y)] E x[C ℓ(Hall,x)] + ϵ.

Since the inequality holds for any ϵ > 0, we have R ℓ(Hall) = infh Hall E(x,y) D[ℓ(h,x,y)] Ex[C ℓ(Hall,x)]. This implies equality since the inequality R ℓ(H) Ex[C ℓ(H,x)] holds for any H.

B.2 Relationship with approximation error

Let Aℓdenote the approximation error of a loss function ℓand a hypothesis set H: Aℓ(H) = R ℓ(H) R ℓ(Hall). We will denote by Iℓ(H) the difference of pointwise inﬁma Iℓ(H) = Ex [C ℓ(H,x) C ℓ(Hall,x)], which is non-negative. The minimizability gap can be decomposed as

follows in terms of the approximation error and the difference of pointwise inﬁma:

Mℓ(H) = R ℓ(H) Ex [C ℓ(H,x)]

= R ℓ(H) R ℓ(Hall) + R ℓ(Hall) Ex [C ℓ(H,x)]

= Aℓ(H) + R ℓ(Hall) Ex [C ℓ(H,x)]

= Aℓ(H) + Ex [C ℓ(Hall,x) C ℓ(H,x)] (By Lemma 14) = Aℓ(H) Iℓ(H).

The decomposition immediately implies the following result.

Lemma 15. Let ℓbe a surrogate loss such that for (x,y) X Y and any measurable function h Hall, the loss ℓ(h,x,y) only depends on h(x) and y (thus we can write ℓ(h,x,y) = ℓ(h(x),y) for some function ℓ). Then, for any loss function ℓand hypothesis set H, we have: Mℓ(H) Aℓ(H).

By Lemma 1, when ℓis the zero-one loss, Iℓ(H) = 0 when the hypothesis set generates labels that cover all possible outcomes for each input. However, for a surrogate loss function, Iℓ(H) is non-negative, and is generally non-zero.

Take the example of binary classiﬁcation and denote the conditional distribution as η(x) = D(Y = 1 X = x). Let H be a family of functions h with h(x) Λ for all x X and such that all values in [ Λ,+Λ] can be reached. Consider for example the exponential-based margin loss: ℓ(h,x,y) = e yh(x). Then, Cℓ(h,x) = η(x)e h(x) + (1 η(x))eh(x). Upon observing this, it becomes apparent that the inﬁmum over all measurable functions can be expressed in the following way, for all x:

C ℓ(Hall,x) = 2

η(x)(1 η(x)),

while the inﬁmum over H, C ℓ(H,x), depends on Λ and can be expressed as

max{η(x),1 η(x)}e Λ + min{η(x),1 η(x)}eΛ Λ < 1

2 log η(x) 1 η(x)

η(x)(1 η(x)) otherwise.

Thus, in the deterministic scenario,

Iℓ(H) = Ex[C ℓ(H,x) C ℓ(Hall,x)] = e Λ.

B.3 Signiﬁcance of H-consistency bounds

As shown in the previous section, for target loss ℓ0 1, the minimizability gap coincides with the approximation error Mℓ0 1(H) = Aℓ0 1(H) when the hypothesis set generates labels that cover all possible outcomes for each input. However, for a surrogate loss ℓ, the minimizability gap is generally strictly less than the approximation error Mℓ(H) < Aℓ(H). Thus, an H-consistency bound, expressed as follows for some increasing function Γ:

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Mℓ(H)).

is more favorable than an excess error bound expressed in terms of approximation errors:

Rℓ0 1(h) R ℓ0 1(H) + Aℓ0 1(H) Γ(Rℓ(h) R ℓ(H) + Aℓ(H)).

Here, Γ is typically linear or the square-root function modulo constants. When H = Hall, the family of all measurable functions, by Lemma 14, the H-consistency bound coincides with the excess error bound and implies Bayes-consistency by taking the limit. It is therefore a stronger guarantee than an excess error bound and Bayes-consistency.

C Proofs for comp-sum losses

Let ymax = argmaxy Y p(x,y) and h(x) = argmaxy Y h(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy.

C.1 Proof of H-consistency bounds with Tcomp (Theorem 2)

Theorem 2 (H-consistency bound for comp-sum losses). Assume that H is symmetric and complete and that Tcomp is convex. Then, the following inequality holds for any hypothesis h H and any distribution

Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H), (3)

with Tcomp an H-estimation error transformation for comp-sum losses deﬁned for all t [0,1] by

inf τ [0, 1

2 ] sup µ [ τ,1 τ] { 1+t

2 [Φ(τ) Φ(1 τ µ)] + 1 t

2 [Φ(1 τ) Φ(τ + µ)]} n = 2

inf P [ 1 n 1 t,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] { P +t

2 [Φ(τ2) Φ(τ1 µ)] + P t

2 [Φ(τ1) Φ(τ2 + µ)]} n > 2.

Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t).

Proof. For the comp-sum loss ℓcomp, the conditional ℓcomp-risk can be expressed as follows:

Cℓcomp(h,x) = y Y p(x,y)ℓcomp(h,x,y)

= y Y p(x,y)Φ( eh(x,y)

y Y eh(x,y ) )

= y Y p(x,y)Φ(Sh(x,y))

= p(x,ymax)Φ(Sh(x,ymax)) + p(x,h(x))Φ(Sh(x,h(x)))

+ y {ymax,h(x)} p(x,y)Φ(Sh(x,y)).

where we let Sh(x,y) = eh(x,y)

y Y eh(x,y ) for any y Y with the constraint that y Y Sh(x,y) = 1. For

any h H such that h(x) ymax and x X, we can always ﬁnd a family of hypotheses {hµ} H such that Sh,µ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

Sh,µ(x,y) =

Sh(x,y) if y / {ymax,h(x)} Sh(x,ymax) + µ if y = h(x) Sh(x,h(x)) µ if y = ymax.

Note that Sh,µ satisﬁes the constraint:

y Y Sh,µ(x,y) = y Y Sh(x,y) = 1.

Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = Sh(x,h(x)) and τ2 = Sh(x,ymax) to simplify the notation. Then, by the deﬁnition of Sh,µ, we have for any h H and x X,

Cℓcomp(h,x) inf µ [ τ2,τ1]Cℓcomp(hµ,x)

= sup µ [ τ2,τ1] {p1[Φ(τ2) Φ(τ1 µ)] + p2[Φ(τ1) Φ(τ2 + µ)]}

= sup µ [ τ2,τ1] {P + p1 p2

2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2

2 [Φ(τ1) Φ(τ2 + µ)]}

(P = p1 + p2 [ 1 n 1 p1 p2,1])

inf P [ 1 n 1 p1 p2,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] {P + p1 p2

2 [Φ(τ2) Φ(τ1 µ)]

+ P p1 + p2

2 [Φ(τ1) Φ(τ2 + µ)]} (τ1 max(τ2,1/n),τ1 + τ2 1,τ2 0)

= Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1)

where for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression of Tcomp is simpliﬁed. Since Tcomp is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H))

= Tcomp(E X[ Cℓ0 1,H(h,x)])

E X[Tcomp( Cℓ0 1,H(h,x))]

E X[ Cℓcomp,H(h,x)]

= Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H).

For the second part, we ﬁrst consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 . For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that Sh(x,1) = τϵ [0, 1

2] and satisﬁes

sup µ [ τϵ,1 τϵ] {1 + t

2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t

2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ.

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)]

= Cℓ0 1(h,x) C ℓ0 1(H,x)

Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H)

= Rℓcomp(h) EX[C ℓcomp(H,x)]

= Cℓcomp(h,x) C ℓcomp(H,x)

= sup µ [ τϵ,1 τϵ] {1 + t

2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t

2 [Φ(1 τϵ) Φ(τϵ + µ)]}

< Tcomp(t) + ϵ.

By letting ϵ 0, we prove the tightness for n = 2. The proof for n > 2 directly extends from the case when n = 2. Indeed, for any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the deﬁnition

of inﬁmum, we can take h H such that Sh(x,1) = τ1,ϵ, Sh(x,2) = τ2,ϵ, Sh(x,y) = 0, 3 y n and satisﬁes τ1,ϵ + τ2,ϵ = 1, and

inf P [ 1 n 1 t,1] sup µ [ τ2,ϵ,τ1,ϵ] {P + t

2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + P t

2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]}

= sup µ [ τ2,ϵ,τ1,ϵ] {1 + t

2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + 1 t

2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]}

< Tcomp(t) + ϵ.

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t

Tcomp(t) < Tcomp(t) + ϵ.

By letting ϵ 0, we prove the tightness for n > 2.

C.2 Characterization of Tcomp (Theorem 3)

Theorem 3 (characterization of Tcomp). Assume that Φ is convex, differentiable at 1

2 and Φ ( 1

2) < 0. Then, Tcomp can be expressed as follows:

2) infµ [ 1

2 + µ) + 1+t

2 µ)} n = 2 infτ [ 1

2 ]{Φ(τ) infµ [ τ,τ]{ 1+t

2 Φ(τ µ) + 1 t

2 Φ(τ + µ)}} n > 2.

Proof. For n = 2, we have

Tcomp(t) = inf τ [0, 1

2 ] sup µ [ τ,1 τ] {1 + t

2 [Φ(τ) Φ(1 τ µ)] + 1 t

2 [Φ(1 τ) Φ(τ + µ)]}

= inf τ [0, 1

2 Φ(τ) + 1 t

2 [Φ(1 τ)] inf µ [ τ,1 τ]{1 + t

2 Φ(1 τ µ) + 1 t

2 Φ(τ + µ)})

= inf τ [0, 1

2 Φ(τ) + 1 t

2 [Φ(1 τ)]) inf µ [ 1

2 + µ) + 1 + t

inf τ [0, 1

2)) inf µ [ 1

2 + µ) + 1 + t

(Φ is convex)

2) inf µ [ 1

2 + µ) + 1 + t

2 µ)} (Φ ( 1

2) < 0, t(τ 1

where the equality can be achieved by τ = 1

For n > 2, we have

Tcomp(t) = inf P [ 1 n 1 ,1] inf τ1 max(τ2,1/n) τ1+τ2 1,τ2 0 sup µ [ τ2,τ1] F(P,τ1,τ2,µ)

where we let F(P,τ1,τ2,µ) = P +t

2 [Φ(τ2) Φ(τ1 µ)] + P t

2 [Φ(τ1) Φ(τ2 + µ)]. For simplicity, we assume that Φ is differentiable. For general convex Φ, we can proceed by using left and right derivatives, which are non-decreasing. Differentiate F with respect to µ, we have

2 Φ (τ1 µ) + t P

2 Φ (τ2 + µ).

Using the fact that P [ 1 n 1 t,1],t [0,1] and Φ is non-decreasing, we obtain that F

µ is nonincreasing. Furthermore, Φ is non-decreasing and non-positive, Φ is non-negative, we obtain that

Φ (+ ) = 0. This implies that F

µ (+ ) 0 and F

µ ( ) 0. Therefore, there exists µ0 R such that F

µ (µ0) = P + t

2 Φ (τ1 µ0) + t P

2 Φ (τ2 + µ0) = 0

By taking µ = τ1 τ2 and using the fact that τ2 1

2) < 0, we have F

µ (τ1 τ2) = P + t

2 Φ (τ2) + t P

2 Φ (τ1) < 0.

Thus, since F

µ is non-increasing, we obtain µ0 < τ1 τ2. Differentiate F with respect to τ2 at µ0, we have F τ2 = P + t

2 Φ (τ2) + t P

2 Φ (τ2 + µ0).

Since Φ is non-decreasing, we obtain F τ2 P + t

2 Φ (τ2) + t P

2 Φ (τ2 + τ1 τ2) = F

µ (τ1 τ2) < 0,

which implies that the inﬁmum infτ1 max{τ2, 1

n } is achieved when τ2 = τ1. Differentiate F with respect to P at µ0 and τ1 = τ2, by the convexity of Φ, we obtain F P = Φ(τ1) Φ(τ1 µ0) + Φ(τ1) Φ(τ1 + µ0) 0,

which implies that the inﬁmum inf P [ 1 n 1 ,1] is achieved when P = 1. Above all, we obtain

Tcomp(t) = inf τ [ 1

2 ] sup µ [ τ,τ] F(1,τ,τ,µ)

= inf τ [ 1

2 ]{Φ(τ) inf µ [ τ,τ]{1 + t

2 Φ(τ µ) + 1 t

2 Φ(τ + µ)}}.

C.3 Computation of examples

Example: Φ(t) = log(t). For n = 2, plugging in Φ(t) = log(t) in Theorem 3, gives

Tcomp = log 2 inf µ [ 1

2 + µ) 1 + t

2 log(1 + t) + 1 t

2 log(1 t). (minimum achieved at µ = t

Similarly, for n > 2, plugging in Φ(t) = log(t) in Theorem 3 yields

Tcomp = inf τ [ 1

2 ]{ log τ inf µ [ τ,τ]{ 1 t

2 log(τ + µ) 1 + t

2 log(τ µ)}}

2 log(1 + t) + 1 t

2 log(1 t). (minimum achieved at µ = τt)

Example: Φ(t) = 1

t 1. For n = 2, plugging in Φ(t) = 1

t 1 in Theorem 3, gives

Tcomp = 2 inf µ [ 1

2 1 1 2 + µ + 1 + t

1 t2. (minimum achieved at µ = (1 t) 1 2 (1+t) 1 2

2((1+t) 1 2 +(1 t) 1 2 ))

Similarly, for n > 2, plugging in Φ(t) = 1

t 1 in Theorem 3 yields

Tcomp = inf τ [ 1

τ inf µ [ τ,τ]{1 + t

2 1 τ µ + 1 + t

2 1 τ + µ}}

= inf τ [ 1

2 ] 1 2τ (1

1 t2) (minimum achieved at µ = (1 t) 1 2 (1+t) 1 2

(1+t) 1 2 +(1 t) 1 2 τ)

1 t2. (minimum achieved at τ = 1

Example: Φ(t) = 1

q(1 tq),q (0,1). For n = 2, plugging in Φ(t) = 1

q(1 tq) in Theorem 3, gives

q2q inf µ [ 1

= 1 q2q (1 + t)

1 1 q + (1 t)

(minimum achieved at µ = (1 t) 1 1 q (1+t) 1 1 q

2((1+t) 1 1 q +(1 t) 1 1 q ))

Similarly, for n > 2, plugging in Φ(t) = 1

q(1 tq) in Theorem 3 yields

Tcomp = inf τ [ 1

q inf µ [ τ,τ]{ 1 + t

2q (τ µ)q 1 t

2q (τ + µ)q}}

= inf τ [ 1

1 1 q + (1 t)

(minimum achieved at µ = (1 t) 1 1 q (1+t) 1 1 q

(1+t) 1 1 q +(1 t) 1 1 q τ)

= 1 qnq (1 + t)

1 1 q + (1 t)

1 qnq . (minimum achieved at τ = 1

Example: Φ(t) = 1 t. For n = 2, plugging in Φ(t) = 1 t in Theorem 3, gives

2 inf µ [ 1

2 µ) + 1 + t

2 + µ)} = 1

Similarly, for n > 2, plugging in Φ(t) = 1 t in Theorem 3 yields

Tcomp = inf τ [ 1

2 ]{(1 τ) inf µ [ τ,τ]{1 + t

2 (1 τ + µ) + 1 t

2 (1 τ µ)}}

= inf τ [ 1

2 ]τ t (minimum achieved at µ = τ)

n. (minimum achieved at τ = 1

Example: Φ(t) = (1 t)2. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 3, gives

4 inf µ [ 1

Similarly, for n > 2, plugging in Φ(t) = (1 t)2 in Theorem 3 yields

Tcomp = inf τ [ 1

2 ]{(1 τ)2 inf µ [ τ,τ]{1 + t

2 (1 τ + µ)2 + 1 t

2 (1 τ µ)2}}

= inf τ [ 1

2 ]{(1 τ)2t2} (minimum achieved at µ = t(τ 1))

4 . (minimum achieved at τ = 1

D Proofs for constrained losses

Let ymax = argmaxy Y p(x,y) and h(x) = argmaxy Y h(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy.

D.1 Proof of H-consistency bounds with Tcstnd (Theorem 10)

Theorem 10 (H-consistency bound for constrained losses). Assume that H is symmetric and complete. Assume that Tcstnd is convex. Then, for any hypothesis h H and any distribution,

Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)

with H-estimation error transformation for constrained losses deﬁned on t [0,1] by Tcstnd(t) =

inf τ 0sup µ R { 1 t

2 [Φ(τ) Φ( τ + µ)] + 1+t

2 [Φ( τ) Φ(τ µ)]} n = 2

inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R { 2 P t

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t

2 [Φ( τ1) Φ( τ2 µ)]} n > 2.

Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t).

Proof. For the constrained loss ℓcstnd, the conditional ℓcstnd-risk can be expressed as follows:

Cℓcstnd(h,x) = y Y p(x,y)ℓcstnd(h,x,y)

= y Y p(x,y) y y Φ( h(x,y ))

= y Y Φ( h(x,y)) y y p(x,y )

= y Y Φ( h(x,y))(1 p(x,y))

= Φ( h(x,ymax))(1 p(x,ymax)) + Φ( h(x,h(x)))(1 p(x,h(x)))

+ y {ymax,h(x)} Φ( h(x,y))(1 p(x,y)).

For any h H and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymax,h(x)} h(x,ymax) + µ if y = h(x) h(x,h(x)) µ if y = ymax.

Note that the hypotheses hµ satisﬁes the constraint:

y Y hµ(x,y) = y Y h(x,y) = 0, µ R.

Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = h(x,h(x)) and τ2 = h(x,ymax) to simplify the notation. Then, by the deﬁnition of hµ, we have for any h H and x X,

Cℓcstnd(h,x) inf µ RCℓcstnd(hµ,x)

= sup µ R {(1 p1)[Φ( τ2) Φ( τ1 + µ)] + (1 p2)[Φ( τ1) Φ( τ2 µ)]}

= sup µ R {2 P p1 + p2

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2

2 [Φ( τ1) Φ( τ2 µ)]}

(P = p1 + p2 [ 1 n 1,1])

= inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R {2 P p1 + p2

2 [Φ( τ2) Φ( τ1 + µ)]

+ 2 P + p1 p2

2 [Φ( τ1) Φ( τ2 µ)]} (τ1 0, τ2 τ1)

= Tcstnd(p1 p2)

= Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1)

where for n = 2, an additional constraint τ1 + τ2 = 0 is imposed and the expression of Tcomp is simpliﬁed. Since Tcstnd is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H))

= Tcstnd(E X[ Cℓ0 1,H(h,x)])

E X[Tcstnd( Cℓ0 1,H(h,x))]

E X[ Cℓcstnd,H(h,x)]

= Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H).

For the second part, we ﬁrst consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 . For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that h(x,2) = τϵ 0 and satisﬁes

sup µ R {1 t

2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t

2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ.

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)]

= Cℓ0 1(h,x) C ℓ0 1(H,x)

Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)

= Rℓcstnd(h) EX[C ℓcstnd(H,x)]

= Cℓcstnd(h,x) C ℓcstnd(H,x)

= sup µ R {1 t

2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t

2 [Φ( τϵ) Φ(τϵ µ)]}

< Tcstnd(t) + ϵ.

By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that h(x,1) = τ1,ϵ, h(x,2) = τ2,ϵ, h(x,y) = 0, 3 y n and satisﬁes τ1,ϵ + τ2,ϵ = 0, and

inf P [ 1 n 1 ,1]sup µ R {2 P t

2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 2 P + t

2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]}

= sup µ R {1 t

2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 1 + t

2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]}

< Tcstnd(t) + ϵ.

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t

Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) < Tcstnd(t) + ϵ.

D.2 Characterization of Tcstnd (Theorem 11)

Theorem 11 (characterization of Tcstnd). Assume that Φ is convex, differentiable at zero and Φ (0) < 0. Then, Tcstnd can be expressed as follows:

Tcstnd(t) =

Φ(0) infµ R{ 1 t

2 Φ(µ) + 1+t

2 Φ( µ)} n = 2

infτ 0{(2 1 n 1)Φ( τ) infµ R{ 2 t 1 n 1 2 Φ( τ + µ) + 2+t 1 n 1 2 Φ( τ µ)}} n > 2

{Φ(0) infµ R{ 1 t

2 Φ(µ) + 1+t

2 Φ( µ)} n = 2 infτ 0{2Φ( τ) infµ R{ 2 t

2 Φ( τ + µ) + 2+t

2 Φ( τ µ)}} n > 2.

Proof. For n = 2, we have

Tcstnd(t) = inf τ 0sup µ R {1 t

2 [Φ(τ) Φ( τ + µ)] + 1 + t

2 [Φ( τ) Φ(τ µ)]}

= inf τ 0(1 t

2 Φ(τ) + 1 + t

2 [Φ( τ)]) inf µ R{1 t

2 Φ( τ + µ) + 1 + t

= inf τ 0(1 t

2 Φ(τ) + 1 + t

2 [Φ( τ)]) inf µ R{1 t

2 Φ(µ) + 1 + t

inf τ 0(Φ(0) Φ (0)tτ) inf µ R{1 t

2 Φ(µ) + 1 + t

2 Φ( µ)} (Φ is convex)

= Φ(0) inf µ R{1 t

2 Φ(µ) + 1 + t

2 Φ( µ)} (Φ (0) < 0, tτ 0)

where the equality can be achieved by τ = 0.

For n > 2, we have

Tcstnd(t) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ R F(P,τ1,τ2,µ)

where we let F(P,τ1,τ2,µ) = 2 P t

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t

2 [Φ( τ1) Φ( τ2 µ)]. For simplicity, we assume that Φ is differentiable. For general convex Φ, we can proceed by using left and right derivatives, which are non-decreasing. Differentiate F with respect to µ, we have F

µ = P + t 2

2 Φ ( τ1 + µ) + 2 P + t

2 Φ ( τ2 µ).

Using the fact that P [ 1 n 1,1],t [0,1] and Φ is non-decreasing, we obtain that F

µ is nonincreasing. Furthermore, Φ is non-decreasing and non-positive, Φ is non-negative, we obtain that Φ (+ ) = 0. This implies that F

µ (+ ) 0 and F

µ ( ) 0. Therefore, there exists µ0 R such that F

µ (µ0) = P + t 2

2 Φ ( τ1 + µ0) + 2 P + t

2 Φ ( τ2 µ0) = 0

By taking µ = τ1 τ2 and using the fact that Φ (0) < 0, we have F

µ (τ1 τ2) = P + t 2

2 Φ ( τ2) + 2 P + t

2 Φ ( τ1) < 0.

Thus, since F

µ is non-increasing, we obtain µ0 < τ1 τ2. Differentiate F with respect to τ2 at µ0, we have F τ2 = P + t 2

2 Φ ( τ2) + 2 P + t

2 Φ ( τ2 µ0).

Since Φ is non-decreasing, we obtain F τ2 P + t 2

2 Φ ( τ2) + 2 P + t

2 Φ ( τ2 τ1 + τ2) = F

µ (τ1 τ2) < 0,

which implies that the inﬁmum infτ1 max{τ2,0} is achieved when τ2 = τ1. Differentiate F with respect to P at µ0 and τ1 = τ2, by the convexity of Φ, we obtain F P = Φ( τ1 + µ0) Φ( τ1) Φ( τ1) + Φ( τ1 µ0) 0,

which implies that the inﬁmum inf P [ 1 n 1 ,1] is achieved when P = 1 n 1. Above all, we obtain

Tcstnd(t) = inf τ 0sup µ R F( 1 n 1,τ,τ,µ)

= inf τ 0{(2 1 n 1)Φ( τ) inf µ R{2 t 1 n 1 2 Φ( τ + µ) + 2 + t 1 n 1 2 Φ( τ µ)}}

inf τ 0sup µ R F(0,τ,τ,µ)

= inf τ 0{2Φ( τ) inf µ R{2 t

2 Φ( τ + µ) + 2 + t

2 Φ( τ µ)}}.

D.3 Computation of examples

Example: Φ(t) = Φexp(t) = e t. For n = 2, plugging in Φ(t) = e t in Theorem 11, gives

Tcomp = 1 inf µ R{1 t

2 e µ + 1 + t

1 t2. (minimum achieved at µ = 1

For n > 2, plugging in Φ(t) = e t in Theorem 11 yields

Tcomp inf τ 0{2eτ inf µ R{2 t

2 eτ µ + 2 + t

2 inf µ R{2 t

2 e µ + 2 + t

2 eµ} (minimum achieved at τ = 0)

4 t2. (minimum achieved at µ = 1

Example: Φ(t) = Φhinge(t) = max{0,1 t}. For n = 2, plugging in Φ(t) = max{0,1 t} in Theorem 11, gives

Tcomp = 1 inf µ R{1 t

2 max{0,1 µ} + 1 + t

2 max{0,1 + µ}}

= t. (minimum achieved at µ = 1)

For n > 2, plugging in Φ(t) = max{0,1 t} in Theorem 11 yields

Tcomp inf τ 0{2max{0,1 + τ} inf µ R{2 t

2 max{0,1 + τ µ} + 2 + t

2 max{0,1 + τ + µ}}}

= 2 inf µ R{2 t

2 max{0,1 µ} + 2 + t

2 max{0,1 + µ}} (minimum achieved at τ = 0)

= t. (minimum achieved at µ = 1)

Example: Φ(t) = Φsq hinge(t) = (1 t)21t 1. For n = 2, plugging in Φ(t) = (1 t)21t 1 in Theorem 11, gives

Tcomp = 1 inf µ R{1 t

2 (1 µ)21µ 1 + 1 + t

2 (1 + µ)21µ 1}

= t2. (minimum achieved at µ = t)

For n > 2, plugging in Φ(t) = (1 t)21t 1 in Theorem 11 yields

Tcomp inf τ 0{2(1 + τ)21τ 1 inf µ R{2 t

2 (1 + τ µ)21 τ+µ 1 + 2 + t

2 (1 + τ + µ)21τ+µ 1}}

2 inf µ R{2 t

2 (1 µ)21µ 1 + 2 + t

2 (1 + µ)21µ 1} (minimum achieved at τ = 0)

2 . (minimum achieved at µ = t

Example: Φ(t) = Φsq(t) = (1 t)2. For n = 2, plugging in Φ(t) = (1 t)2 in Theorem 11, gives

Tcomp = 1 inf µ R{1 t

2 (1 µ)2 + 1 + t

2 (1 + µ)2}

= t2. (minimum achieved at µ = t)

For n > 2, plugging in Φ(t) = (1 t)2 in Theorem 11 yields

Tcomp inf τ 0{2(1 + τ)2 inf µ R{2 t

2 (1 + τ µ)2 + 2 + t

2 (1 + τ + µ)2}}

2 inf µ R{2 t

2 (1 µ)2 + 2 + t

2 (1 + µ)2} (minimum achieved at τ = 0)

2 . (minimum achieved at µ = t

E Extensions of comp-sum losses

E.1 Proof of H-consistency bounds with T comp (Theorem 5)

Theorem 5 (H-consistency bound for comp-sum losses). Assume that T comp is convex. Then, the following inequality holds for any hypothesis h H and any distribution:

T comp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H)

with T comp the H-estimation error transformation for comp-sum losses deﬁned for all t [0,1] by T comp(t) =

inf τ [0, 1

2 ] sup µ [smin τ,1 τ smin] { 1+t

2 [Φ(τ) Φ(1 τ µ)] + 1 t

2 [Φ(1 τ) Φ(τ + µ)]} n = 2

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C { P +t

2 [Φ(τ2) Φ(τ1 µ)] + P t

2 [Φ(τ1) Φ(τ2 + µ)]} n > 2,

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}], smax = 1 1+(n 1)e 2 infx Λ(x) and

smin = 1 1+(n 1)e2 infx Λ(x) . Furthermore, for any t [0,1], there exist a distribution D and h H such that Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) = Tcomp(t).

Proof. For the comp-sum loss ℓcomp, the conditional ℓcomp-risk can be expressed as follows:

Cℓcomp(h,x)

= y Y p(x,y)ℓcomp(h,x,y)

= y Y p(x,y)Φ( eh(x,y)

y Y eh(x,y ) )

= y Y p(x,y)Φ(Sh(x,y))

= p(x,ymax)Φ(Sh(x,ymax)) + p(x,h(x))Φ(Sh(x,h(x))) + y {ymax,h(x)} p(x,y)Φ(Sh(x,y))

where we let Sh(x,y) = eh(x,y)

y Y eh(x,y ) for any y Y with the constraint that y Y Sh(x,y) = 1. Note

that for any h H,

1 1 + (n 1)e2Λ(x) = e Λ(x)

e Λ(x) + (n 1)eΛ(x) Sh(x,y) eΛ(x)

eΛ(x) + (n 1)e Λ(x) = 1 1 + (n 1)e 2Λ(x)

Therefore for any (x,y) X Y, Sh(x,y) [Smin,Smax], where we let Smax = 1 1+(n 1)e 2Λ(x) and

Smin = 1 1+(n 1)e2Λ(x) . Furthermore, all values in [Smin,Smax] of Sh can be reached for some h H.

Observe that 0 Smax + Smin 1. Let ymax = argmaxy Y p(x,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymax and x X, we can always ﬁnd a family of hypotheses {hµ} H such that Sh,µ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

Sh,µ(x,y) =

Sh(x,y) if y / {ymax,h(x)} Sh(x,ymax) + µ if y = h(x) Sh(x,h(x)) µ if y = ymax.

Note that Sh,µ satisﬁes the constraint:

y Y Sh,µ(x,y) = y Y Sh(x,y) = 1.

Since Sh,µ(x,y) [Smin,Smax], we have the following constraints on µ:

Smin Sh(x,ymax) µ Smax Sh(x,ymax) Sh(x,h(x)) Smax µ Sh(x,h(x)) Smin. (7)

Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = Sh(x,h(x)) and τ2 = Sh(x,ymax) to simplify the notation. Let C = {µ R µ verify constraint (7)}. Since Sh(x,h(x)) Smax Smax Sh(x,ymax) and Smin Sh(x,ymax) Sh(x,h(x)) Smin, C is not an empty set and can be expressed as C = [max{Smin τ2,τ1 Smax},min{Smax τ2,τ1 Smin}].

Then, by the deﬁnition of Sh,µ, we have for any h H and x X,

Cℓcomp(h,x) inf µ C Cℓcomp(hµ,x)

= sup µ C {p1[Φ(τ2) Φ(τ1 µ)] + p2[Φ(τ1) Φ(τ2 + µ)]}

= sup µ C {P + p1 p2

2 [Φ(τ2) Φ(τ1 µ)] + P p1 + p2

2 [Φ(τ1) Φ(τ2 + µ)]}

(P = p1 + p2 [ 1 n 1 p1 p2,1])

inf P [ 1 n 1 p1 p2,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + p1 p2

2 [Φ(τ2) Φ(τ1 µ)]

+ P p1 + p2

2 [Φ(τ1) Φ(τ2 + µ)]} (Smin τ2 τ1 Smax, τ1 + τ2 1)

inf P [ 1 n 1 p1 p2,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + p1 p2

2 [Φ(τ2) Φ(τ1 µ)]

+ P p1 + p2

2 [Φ(τ1) Φ(τ2 + µ)]} (Smin smin smax Smax)

= Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1)

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}] C, smax = 1 1+(n 1)e 2 infx Λ(x) and smin = 1 1+(n 1)e2 infx Λ(x) . Note that for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression can be simpliﬁed as

Cℓcomp(h,x) inf µ C Cℓcomp(hµ,x)

inf τ [0, 1

2 ] sup µ [smin τ,1 τ smin] {1 + p1 p2

2 [Φ(τ) Φ(1 τ µ)] + 1 p1 + p2

2 [Φ(1 τ) Φ(τ + µ)]}

= Tcomp(p1 p2) = Tcomp( Cℓ0 1,H(h,x)), (by Lemma 1)

where we use the fact that smax +smin = 1 and P = 1 when n = 2. Since Tcomp is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

Tcomp(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H))

= Tcomp(E X[ Cℓ0 1,H(h,x)])

E X[Tcomp( Cℓ0 1,H(h,x))]

E X[ Cℓcomp,H(h,x)]

= Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H). For the second part, we ﬁrst consider n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 . For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that Sh(x,1) = τϵ [0, 1

2] and satisﬁes

sup µ [smin τϵ,1 τϵ smin] {1 + t

2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t

2 [Φ(1 τϵ) Φ(τϵ + µ)]} < Tcomp(t) + ϵ.

Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)]

= Cℓ0 1(h,x) C ℓ0 1(H,x)

= t and Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H)

= Rℓcomp(h) EX[C ℓcomp(H,x)]

= Cℓcomp(h,x) C ℓcomp(H,x)

= sup µ [smin τϵ,1 τϵ smin] {1 + t

2 [Φ(τϵ) Φ(1 τϵ µ)] + 1 t

2 [Φ(1 τϵ) Φ(τϵ + µ)]}

< Tcomp(t) + ϵ. By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 , p(x,y) = 0, 3 y n. For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that Sh(x,1) = τ1,ϵ, Sh(x,2) = τ2,ϵ and Sh(x,y) = 0, 3 y n and satisﬁes τ1,ϵ + τ2,ϵ = 1, and

inf P [ 1 n 1 t,1]sup µ C {P + t

2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + P t

2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]}

= sup µ C {1 + t

2 [Φ(τ2,ϵ) Φ(τ1,ϵ µ)] + 1 t

2 [Φ(τ1,ϵ) Φ(τ2,ϵ + µ)]}

< Tcomp(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Tcomp(t) Rℓcomp(h) R ℓcomp(H) + Mℓcomp(H) < Tcomp(t) + ϵ. By letting ϵ 0, we conclude the proof.

E.2 Logistic loss

Theorem 6 (H-consistency bounds for logistic loss). For any h H and any distribution, we have

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)),

where ℓlog = log( eh(x,y)

y Y eh(x,y ) ) and Ψ(t) =

2 log(1 + t) + 1 t

2 log(1 t) t smax smin

smin+smax t 2 log( smax

smin ) + log( 2 smaxsmin

smax+smin ) otherwise.

Proof. For the multinomial logistic loss ℓlog, plugging in Φ(t) = log(t) in Theorem 5, gives T comp

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t

2 [ log(τ2) + log(τ1 µ)] + P t

2 [ log(τ1) + log(τ2 + µ)]}

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the inﬁmum is achieved when τ1 = τ2 = smin+smax

2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as

T comp = sup µ C {1 + t

2 [ log(smin + smax

2 ) + log(smin + smax

2 [ log(smin + smax

2 ) + log(smin + smax

= log(smin + smax

2 ) + sup µ C g(µ)

where C = [ smin smax

2 , smax smin

2 ] and g(µ) = 1+t

2 log( smin+smax

2 log( smin+smax

2 + µ). Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain

g(µ ) = 0, µ = t(smin + smax)

Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have

µ 0 smax smin

2 In view of the constraint C, if µ smin smax

2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax

2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = smin smax

2 . Since µ smin smax

2 is equivalent to t smax smin

smin+smax , the maximum can be expressed as

max µ C g(µ) = {g(µ ) t smax smin

smin+smax g( smin smax

2 ) otherwise Computing the value of g at these points yields:

g(µ ) = 1 + t

2 log (1 + t)(smin + smax)

2 log (1 t)(smin + smax)

g(smin smax

2 ) = 1 + t

2 log(smax) + 1 t

2 log(smin)

Then, if t smax smin

smin+smax , we obtain

T comp = log(smin + smax

2 ) + 1 + t

2 log (1 + t)(smin + smax)

2 log (1 t)(smin + smax)

2 log(1 + t) + 1 t

2 log(1 t).

Otherwise, we obtain

T comp = log(smin + smax

2 ) + 1 + t

2 log(smax) + 1 t

2 log(smin)

smin ) + log(2 smaxsmin

smax + smin ).

Since T comp is convex, by Theorem 5, for any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓlog(h) R ℓlog(H) + Mℓlog(H)),

2 log(1 + t) + 1 t

2 log(1 t) t smax smin

smin+smax t 2 log( smax

smin ) + log( 2 smaxsmin

smax+smin ) otherwise.

E.3 Sum exponential loss

Theorem 9 (H-consistency bounds for sum exponential loss). For any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H))

where ℓexp = y y eh(x,y ) h(x,y) and Ψ(t) =

1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2

2smaxsmin(smax+smin) otherwise. .

Proof. For the sum exponential loss ℓexp, plugging in Φ(t) = 1

t 1 in Theorem 5, gives T comp

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t

τ2 1 τ1 µ] + P t

τ1 1 τ2 + µ]}

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the inﬁmum is achieved when τ1 = τ2 = smin+smax

2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as

T comp = sup µ C {1 + t

2 [ 2 smin + smax 2 smin + smax 2µ]

2 [ 2 smin + smax 2 smin + smax + 2µ]}

= 2 smin + smax + sup µ C g(µ)

where C = [ smin smax

2 , smax smin

2 ] and g(µ) = 1+t smin+smax 2µ 1 t smin+smax+2µ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain

g(µ ) = 0, µ = smin + smax

1 t Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have

µ 0 smax smin

2 In view of the constraint C, if µ smin smax

2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax

2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ =

2 . Since µ smin smax

2 is equivalent to t s2 max s2 min s2 min+s2max , the maximum can be expressed as

max µ C g(µ) =

g(µ ) t s2 max s2 min s2 min+s2max g( smin smax

2 ) otherwise

Computing the value of g at these points yields:

1 t2 2 smin + smax

g(smin smax

2 ) = 1 + t

Then, if t s2 max s2 min s2 min+s2max , we obtain

T comp = 2 smin + smax + 1

1 t2 2 smin + smax = 1

Otherwise, we obtain

T comp = 2 smin + smax 1 + t

= smax smin

2smaxsmin t (smax smin)2

2smaxsmin(smax + smin).

Since T comp is convex, by Theorem 5, for any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓexp(h) R ℓexp(H) + Mℓexp(H)),

1 t2 t s2 max s2 min s2 min+s2max smax smin 2smaxsmin t (smax smin)2

2smaxsmin(smax+smin) otherwise.

E.4 Generalized cross-entropy loss

Theorem 16 (H-consistency bounds for generalized cross-entropy loss). For any h H and any distribution, we have

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓgce(h) R ℓgce(H) + Mℓgce(H)),

where Ψ(t) =

1 q( smin+smax

2 ) q ( (1+t) 1 1 q +(1 t) 1 1 q 2 )

1 q 1 t s1 q max s1 q min s1 q min+s1 q max

t 2q(sq max sq min) + 1

q( sq min+sq max 2 ( smin+smax

2 ) q) otherwise.

1 q[1 ( eh(x,y)

y Y eh(x,y ) )

Proof. For generalized cross-entropy loss ℓgce, plugging Φ(t) = 1

q(1 tq) in Theorem 5, gives T comp

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t

q (τ2)q + 1

q (τ1 µ)q] + P t

q (τ1)q + 1

q (τ2 + µ)q]}

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the inﬁmum is achieved when τ1 = τ2 = smin+smax

2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as

T comp = sup µ C {1 + t

2q [ (smin + smax

q + (smin + smax

2q [ (smin + smax

q + (smin + smax

q (smin + smax

q + sup µ C g(µ)

where C = [ smin smax

2 , smax smin

2 ] and g(µ) = 1+t

2q ( smin+smax

2 µ) q + 1 t

2q ( smin+smax

2 + µ) q. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain

g(µ ) = 0, µ = (1 t) 1 1 q (1 + t) 1 1 q

(1 + t) 1 1 q + (1 t) 1 1 q smin + smax

Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since smax smin 0, we have

µ 0 smax smin

2 In view of the constraint C, if µ smin smax

2 , the maximum is achieved by µ = µ . Otherwise, if µ < smin smax

2 , since g(µ) is non-increasing when µ µ , the maximum is achieved by µ =

2 . Since µ smin smax

2 is equivalent to t s1 q max s1 q min s1 q min+s1 q max , the maximum can be expressed as

max µ C g(µ) =

g(µ ) t s1 q max s1 q min s1 q min+s1 q max g( smin smax

2 ) otherwise

Computing the value of g at these points yields:

q (smin + smax

1 1 q + (1 t)

g(smin smax

2 ) = 1 + t

2q (smax)q + 1 t

Then, if t s1 q max s1 q min s1 q min+s1 q max , we obtain

q (smin + smax

1 1 q + (1 t)

q (smin + smax

q (smin + smax

1 1 q + (1 t)

1 Otherwise, we obtain

q (smin + smax

2q (smax)q + 1 t

2q (sq max sq min) + 1

q (sq min + sq max 2 (smin + smax

Since T comp is convex, by Theorem 5, for any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓgce(h) R ℓgce(H) + Mℓgce(H)),

1 q( smin+smax

2 ) q ( (1+t) 1 1 q +(1 t) 1 1 q 2 )

1 q 1 t s1 q max s1 q min s1 q min+s1 q max

t 2q(sq max sq min) + 1

q( sq min+sq max 2 ( smin+smax

2 ) q) otherwise.

E.5 Mean absolute error loss

Theorem 17 (H-consistency bounds for mean absolute error loss). For any h H and any distribution, we have

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) 2(Rℓmae(h) R ℓmae(H) + Mℓmae(H))

smax smin .

Proof. For mean absolute error loss ℓmae, plugging Φ(t) = 1 t in Theorem 5, gives T comp

inf P [ 1 n 1 t,1] inf Smin τ2 τ1 Smax τ1+τ2 1 sup µ C {P + t

2 [ (τ2) + (τ1 µ)] + P t

2 [ (τ1) + (τ2 + µ)]}

where C = [max{smin τ2,τ1 smax},min{smax τ2,τ1 smin}]. Here, we only compute the expression for n > 2. The expression for n = 2 will lead to the same result since it can be viewed as a special case of the expression for n > 2. By differentiating with respect to τ2 and P, we can see that the inﬁmum is achieved when τ1 = τ2 = smin+smax

2 and P = 1 modulo some elementary analysis. Thus, T comp can be reformulated as

T comp = sup µ C {1 + t

2 [ (smin + smax

2 ) + (smin + smax

2 [ (smin + smax

2 ) + (smin + smax

= sup µ C tµ

where C = [ smin smax

2 , smax smin

2 ]. Since tµ is monotonically non-increasing, the maximum over C can be achieved by

µ = smin smax

2 , T comp = smax smin

Since T comp is convex, by Theorem 5, for any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) 2(Rℓmae(h) R ℓmae(H) + Mℓmae(H))

smax smin .

F Extensions of constrained losses

F.1 Proof of H-consistency bound with T cstnd (Theorem 12)

Theorem 12 (H-consistency bound for constrained losses). Assume that T cstnd is convex. Then, the following inequality holds for any hypothesis h H and any distribution:

T cstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H)) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). (6)

with T cstnd the H-estimation error transformation for constrained losses deﬁned for all t [0,1] by

T cstnd(t) =

inf τ 0 sup µ [τ Λmin,τ+Λmin] { 1 t

2 [Φ(τ) Φ( τ + µ)] + 1+t

2 [Φ( τ) Φ(τ µ)]} n = 2

inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C { 2 P t

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P +t

2 [Φ( τ1) Φ( τ2 µ)]} n > 2,

where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] and Λmin = infx X Λ(x). Furthermore, for any t [0,1], there exist a distribution D and a hypothesis h H such that Rℓ0 1(h) R ℓ0 1(H)+ Mℓ0 1(H) = t and Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) = Tcstnd(t).

Proof. For the constrained loss ℓcstnd, the conditional ℓcstnd-risk can be expressed as follows:

Cℓcstnd(h,x) = y Y p(x,y)ℓcstnd(h,x,y)

= y Y p(x,y) y y Φ( h(x,y ))

= y Y Φ( h(x,y)) y y p(x,y )

= y Y Φ( h(x,y))(1 p(x,y))

= Φ( h(x,ymax))(1 p(x,ymax)) + Φ( h(x,h(x)))(1 p(x,h(x)))

+ y {ymax,h(x)} Φ( h(x,y))(1 p(x,y)).

For any h H and x X, by the deﬁnition of H, we can always ﬁnd a family of hypotheses {hµ} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymax,h(x)} h(x,ymax) + µ if y = h(x) h(x,h(x)) µ if y = ymax.

Note that the hypotheses hµ satisﬁes the constraint:

y Y hµ(x,y) = y Y h(x,y) = 0, µ R.

Since hµ(x,y) [ Λ(x),Λ(x)], we have the following constraints on µ: Λ(x) h(x,ymax) µ Λ(x) h(x,ymax) Λ(x) + h(x,h(x)) µ Λ(x) + h(x,h(x). Let p1 = p(x,ymax), p2 = p(x,h(x)), τ1 = h(x,h(x)) and τ2 = h(x,ymax) to simplify the notation. Then, the constraint on µ can be expressed as

µ C, C = [max{τ1, τ2} Λ(x),min{τ1, τ2} + Λ(x)] Since max{τ1, τ2} min{τ1, τ2} = τ1 + τ2 τ1 + τ2 2Λ(x), C is not an empty set. By the deﬁnition of hµ, we have for any h H and x X, Cℓcstnd(h,x) inf µ C Cℓcstnd(hµ,x)

= sup µ C {(1 p1)[Φ( τ2) Φ( τ1 + µ)] + (1 p2)[Φ( τ1) Φ( τ2 µ)]}

= sup µ C {2 P p1 + p2

2 [Φ( τ2) Φ( τ1 + µ)] + 2 P + p1 p2

2 [Φ( τ1) Φ( τ2 µ)]}

(P = p1 + p2 [ 1 n 1,1])

= inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P p1 + p2

2 [Φ( τ2) Φ( τ1 + µ)]

+ 2 P + p1 p2

2 [Φ( τ1) Φ( τ2 µ)]} (τ1 0, τ2 τ1)

inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P p1 + p2

2 [Φ( τ2) Φ( τ1 + µ)]

+ 2 P + p1 p2

2 [Φ( τ1) Φ( τ2 µ)]}

(C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin] C since Λmin Λ(x))

= inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}{2 P p1 + p2

2 Φ( τ2) + 2 P + p1 p2

inf µ C{2 P p1 + p2

2 Φ( τ1 + µ) + 2 P + p1 p2

2 Φ( τ2 µ)}}

= Tcstnd(p1 p2)

= Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1) Note that for n = 2, an additional constraint τ1 + τ2 = 1 is imposed and the expression can be simpliﬁed as Cℓcstnd(h,x) inf µ C Cℓcstnd(hµ,x)

inf τ 0 sup µ [τ Λmin,τ+Λmin] {1 p1 + p2

2 [Φ(τ) Φ( τ + µ)] + 1 + p1 p2

2 [Φ( τ) Φ(τ µ)]}

= Tcstnd(p1 p2)

= Tcstnd( Cℓ0 1,H(h,x)). (by Lemma 1)

Since Tcstnd is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution, Tcstnd(Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H))

= Tcstnd(E X[ Cℓ0 1,H(h,x)])

E X[Tcstnd( Cℓ0 1,H(h,x))]

E X[ Cℓcstnd,H(h,x)]

= Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H). Let n = 2. For any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 . For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that h(x,2) = τϵ 0 and satisﬁes

sup µ [τϵ Λmin,τϵ+Λmin] {1 t

2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t

2 [Φ( τϵ) Φ(τϵ µ)]} < Tcstnd(t) + ϵ.

Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = Rℓ0 1(h) EX[C ℓ0 1(H,x)]

= Cℓ0 1(h,x) C ℓ0 1(H,x)

= t and Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H)

= Rℓcstnd(h) EX[C ℓcstnd(H,x)]

= Cℓcstnd(h,x) C ℓcstnd(H,x)

= sup µ [τϵ Λmin,τϵ+Λmin] {1 t

2 [Φ(τϵ) Φ( τϵ + µ)] + 1 + t

2 [Φ( τϵ) Φ(τϵ µ)]}

< Tcstnd(t) + ϵ. By letting ϵ 0, we conclude the proof. The proof for n > 2 directly extends from the case when n = 2. Indeed, for any t [0,1], we consider the distribution that concentrates on a singleton {x} and satisﬁes p(x,1) = 1+t

2 , p(x,2) = 1 t

2 , p(x,y) = 0,3 y n. For any ϵ > 0, by the deﬁnition of inﬁmum, we can take h H such that h(x,1) = τ1,ϵ, h(x,2) = τ2,ϵ, h(x,3) = 0, 3 y n and satisﬁes τ1,ϵ + τ2,ϵ = 0, and

inf P [ 1 n 1 ,1]sup µ C {2 P t

2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 2 P + t

2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]}

= sup µ C {1 t

2 [Φ( τ2,ϵ) Φ( τ1,ϵ + µ)] + 1 + t

2 [Φ( τ1,ϵ) Φ( τ2,ϵ µ)]}

< Tcstnd(t) + ϵ. Then, Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) = t and Tcstnd(t) Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H) < Tcstnd(t) + ϵ. By letting ϵ 0, we conclude the proof.

F.2 Constrained exponential loss

Theorem 13 (H-consistency bounds for constrained exponential loss). Let Φ(t) = e t. For any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H))

where Ψ(t) =

1 t2 t e2Λmin 1

e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin

2 otherwise. .

Proof. For n = 2, plugging in Φ(t) = e t in Theorem 12, gives

T cstnd(t) = inf τ 0 sup µ [τ Λmin,τ+Λmin] {1 t

2 [e τ eτ µ] + 1 + t

2 [eτ e τ+µ]}.

By differentiating with respect to τ, we can see that the inﬁmum is achieved when τ = 0 modulo some elementary analysis. Thus, T cstnd can be reformulated as

T cstnd = sup µ [ Λmin,Λmin] {1 t

2 [1 e µ] + 1 + t

= 1 + sup µ [ Λmin,Λmin] g(µ).

where g(µ) = 1 t

2 eµ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain

g(µ ) = 0, µ = 1

1 + t Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since µ 0 and Λmin 0, we have

In view of the constraint, if µ Λmin, the maximum is achieved by µ = µ . Otherwise, if µ < Λmin, since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = Λmin. Since µ Λmin is equivalent to t e2Λmin 1

e2Λmin+1, the maximum can be expressed as

max µ [ Λmin,Λmin]g(µ) = {g(µ ) t e2Λmin 1

e2Λmin+1 g( Λmin) otherwise

Computing the value of g at these points yields:

g( Λmin) = 1 t

2 eΛmin 1 + t

Then, if t e2Λmin 1

e2Λmin+1, we obtain

T cstnd = 1

Otherwise, we obtain

T cstnd = 1 1 t

2 eΛmin 1 + t

2(eΛmin e Λmin) + 2 eΛmin e Λmin

For n > 2, plugging in Φ(t) = e t in Theorem 12, gives

T cstnd(t) = inf P [ 1 n 1 ,1] inf τ1 max{τ2,0}sup µ C {2 P t

2 [eτ2 eτ1 µ] + 2 P + t

2 [eτ1 eτ2+µ]}.

where C = [max{τ1, τ2} Λmin,min{τ1, τ2} + Λmin]. By differentiating with respect to τ2 and P, we can see that the inﬁmum is achieved when τ2 = τ1 = 0 and P = 1 modulo some elementary analysis. Thus, T cstnd can be reformulated as

T cstnd = sup µ C {1 t

2 [1 e µ] + 1 + t

= 1 + sup µ C g(µ).

where C = [ Λmin,Λmin] and g(µ) = 1 t

2 eµ. Since g is continuous, it attains its supremum over a compact set. Note that g is concave and differentiable. In view of that, the maximum over the open set ( ,+ ) can be obtained by setting its gradient to zero. Differentiate g(µ) to optimize, we obtain

g(µ ) = 0, µ = 1

Moreover, by the concavity, g(µ) is non-increasing when µ µ . Since µ 0 and Λmin 0, we have

In view of the constraint, if µ Λmin, the maximum is achieved by µ = µ . Otherwise, if µ < Λmin, since g(µ) is non-increasing when µ µ , the maximum is achieved by µ = Λmin. Since µ Λmin is equivalent to t e2Λmin 1

e2Λmin+1, the maximum can be expressed as

max µ [ Λmin,Λmin]g(µ) = {g(µ ) t e2Λmin 1

e2Λmin+1 g( Λmin) otherwise

Computing the value of g at these points yields:

g( Λmin) = 1 t

2 eΛmin 1 + t

Then, if t e2Λmin 1

e2Λmin+1, we obtain

T cstnd = 1

Otherwise, we obtain

T cstnd = 1 1 t

2 eΛmin 1 + t

2(eΛmin e Λmin) + 2 eΛmin e Λmin

Since T cstnd is convex, by Theorem 12, for any h H and any distribution,

Rℓ0 1(h) R ℓ0 1(H) + Mℓ0 1(H) Ψ 1(Rℓcstnd(h) R ℓcstnd(H) + Mℓcstnd(H))

1 t2 t e2Λmin 1

e2Λmin+1 t 2(eΛmin e Λmin) + 2 eΛmin e Λmin

2 otherwise.