# surrogate_regret_bounds_for_polyhedral_losses__7578d410.pdf

Surrogate Regret Bounds for Polyhedral Losses

Rafael Frongillo University of Colorado Boulder raf@colorado.edu

Bo Waggoner University of Colorado Boulder bwag@colorado.edu

Surrogate risk minimization is an ubiquitous paradigm in supervised machine learning, wherein a target problem is solved by minimizing a surrogate loss on a dataset. Surrogate regret bounds, also called excess risk bounds, are a common tool to prove generalization rates for surrogate risk minimization. While surrogate regret bounds have been developed for certain classes of loss functions, such as proper losses, general results are relatively sparse. We provide two general results. The ﬁrst gives a linear surrogate regret bound for any polyhedral (piecewise-linear and convex) surrogate, meaning that surrogate generalization rates translate directly to target rates. The second shows that for sufﬁciently non-polyhedral surrogates, the regret bound is a square root, meaning fast surrogate generalization rates translate to slow rates for the target. Together, these results suggest polyhedral surrogates are optimal in many cases.

1 Introduction

In supervised learning, our goal is to learn a hypothesis that solves some target problem given labeled data. The problem is often speciﬁed by a target loss ℓthat is discrete in the sense that the prediction lies in a ﬁnite set, as with classiﬁcation, ranking, and structured prediction. As minimizing the target ℓover a data set is typically NP-hard, we instead turn to the general paradigm of surrogate risk minimization. Here a surrogate loss function L is chosen and minimized instead of the target ℓ. A link function ψ then maps the surrogate predictions to target predictions.

Naturally, a central question in surrogate risk minimization is how to choose the surrogate L. A minimal requirement is statistical consistency, which roughly says that approaching the best possible surrogate loss will eventually lead to the best possible target loss. A more precise goal is to understand the rate at which this target minimization occurs, called the generalization rate: how quickly the target loss approaches the best possible, in terms of the number of data points. In other words, if we deﬁne the regret of a hypothesis to be its performance relative to the Bayes optimal hypothesis, we wish to know how quickly the ℓ-regret approaches zero.

Surrogate regret bounds are a common tool to prove generalization rates for surrogate risk minimization, by bounding the target regret as a function of the surrogate regret. This bound allows generalization rates for surrogate losses of which there are many, often leveraging properties of L like strong convexity and smoothness to imply rates for the target problem. Roughly speaking, a surrogate regret bound takes the form

surrogate hypotheses h, Regretℓ(ψ h) ζ( Regret L(h) ) , (1)

where the regret transfer function ζ : R+ R+ controls how surrogate regret translates to target regret. For example, if we have a sequence of surrogate hypotheses hn achieving a fast rate Regret L(hn) = O(1/n), and we have a regret transfer of ζ(ϵ) = ϵ, then we can only guarantee a slower target rate of Regretℓ(ψ hn) = O(1/ n). The ideal regret transfer is linear, i.e. ζ(ϵ) = C ϵ for some C > 0, which would have directly translated the fast-rate guarantee to the target.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Much is known about surrogate regret bounds for particular problems, especially for binary classiﬁcation and related problems like bipartite ranking; see 1.1 for a discussion of the literature. General results which span multiple target problems, however, are sparse. Furthermore, we still lack guidance as to which surrogate will lead to the best overall target generalization rate, as this rate depends on both the surrogate generalization rate and the regret transfer function. For example, consider the choice between a polyhedral (piecewise-linear and convex) surrogate, like hinge loss, or a smooth surrogate like exponential or squared loss. The surrogate rate for polyhedral losses will likely be a slower O(1/ n), whereas a faster O(1/n) rate is typical for smooth surrogates but what is the tradeoff when it comes to the regret transfer function ζ, which ultimately decides the target generalization rate? To answer these types of questions, we need more general results.

Our ﬁrst result is a general surrogate regret bound for polyhedral surrogates. Perhaps surprisingly, we show that every consistent surrogate in this class achieves a linear regret transfer, meaning their surrogate generalization rates translate directly to target rates, up to a constant factor.

Theorem 1 (Linear bound for polyhedral). Let L be any polyhedral surrogate and ψ any link function such that (L, ψ) is consistent for the target loss ℓ. Then L satisﬁes Equation 1 with a linear ζ.

The recent embedding framework of Finocchiaro et al. [10] shows how to construct consistent polyhedral surrogates for any discrete target loss; Theorem 1 additionally endows these surrogates with linear regret bounds. The same framework shows that every polyhedral loss is a consistent surrogate for some discrete target. Theorem 1 therefore immediately applies to the Weston Watkins surrogate [31], Lovász hinge [10, 33], and other polyhedral surrogates in the literature currently lacking regret bounds.1

We also give a partial converse: If the surrogate is sufﬁciently non-polyhedral , i.e. smooth and locally strongly convex, then ζ cannot shrink faster than a square root. For these losses, therefore, even fast surrogate generalization rates would likely translate to slow rates for the target.

Theorem 2 (Lower bound for non-polyhedral ). Let L be a locally strongly convex surrogate with locally Lipschitz gradient. If L satisﬁes Equation 1 for a target loss ℓ, then there exists c > 0 such that, for all small enough ϵ 0, we have ζ(ϵ) c ϵ.

While it may seem intuitive that ζ = for strongly convex losses, it is well-known that losses like exponential loss for binary classiﬁcation also have ζ = , even though they are neither strongly convex nor strongly smooth [4]. Theorem 2 offers an explanation for this phenomenon, and together with Theorem 1 clariﬁes an intriguing trend in the literature: ζ is typically either linear or square-root.

Taken together, our results shed light on the dichotomy expressed above: slower polyhedral surrogate rates of O(1/ n) will translate to the target unchanged, while most smooth surrogates, even if achieving a fast surrogate generalization rate of O(1/n), will likely yield the same target rate of O(1/ n). Our results therefore suggest an inherent tradeoff fast surrogate rates and optimal regret transfer, a phenomenon also observed by Mahdavi et al. [17] for binary classiﬁcation. In particular, our results suggest that polyhedral losses may achieve an overall generalization rate to the target problem with the optimal dependence on the number of data points.

The proof of our upper bound, Theorem 1, is nonconstructive. After proving the main results (Sections 3 and 4), we make it constructive in Section 5. Speciﬁcally, the regret transfer function for a polyhedral loss can be bounded by ζ(t) α t where the constant is of the form α = αℓ αL αψ. Each component of α depends only on its respective object ℓ, L, and ψ. A corollary of the derivation is that a polyhedral loss L and link ψ are consistent for ℓif and only if ψ is ϵ-separated, a condition introduced in Finocchiaro et al. [10].

1.1 Literature on surrogate regret bounds

Surrogate regret bounds were perhaps developed ﬁrst by Zhang [37], as a method to prove statistical consistency for various surrogate losses for binary classiﬁcation. These results were then generalized by Bartlett et al. [4] for margin losses, and Reid and Williamson [26] for composite proper losses.

1Many of these polyhedral surrogates mentioned are actually known to be inconsistent for their intended target problems. Nonetheless, the embedding framework of Finocchiaro et al. [10] shows that any polyhedral surrogate is consistent for some target problem, namely the one that it embeds. It is to this embedded target loss function, rather than the intended target, that our linear regret bounds apply.

These latter works and others give a fairly complete picture for binary classiﬁcation: a characterization of calibrated surrogates, and precise closed forms for the optimal ζ for each. By extension, close variations on binary classiﬁcation are also well-understood, such as cost-sensitive binary classiﬁcation [28], binary classiﬁcation with an abstain/reject option [3, 8, 34], bipartite ranking [1, 13, 18], and ordinal regression via reduction to binary classiﬁcation [22]. For multiclass classiﬁcation, we have nice consistency characterizations with respect to 0-1 loss [30], with some surrogate regret bounds [9, 23].

Beyond binary classiﬁcation and 0-1 loss, our knowledge drops off precipitously, with few general results. There are some works which provide partial understand for speciﬁc target problems. For example, in the problem of multiclass classiﬁcation with an abstain option, Ramaswamy et al. [25] give three consistent polyhedral surrogates (two of which are commonly used as inconsistent surrogates for 0-1 loss), along with linear surrogate regret bounds. (See also Ni et al. [19, Theorems 7 & 8] and Charoenphakdee et al. [6, Theorem 8] for other surrogate regret bounds for the same problem.) Ramaswamy et al. [24] show how to extend these surrogates to nested, hierarchical versions, again providing linear regret bounds. Motivated by these polyhedral surrogates, together with examples arising in top-k prediction [16, 32] and other structured prediction problems [33], Finocchiaro et al. [10] introduce an embedding framework for polyhedral surrogates. This framework shows that all polyhedral surrogates are consistent for some discrete target problem, and vice versa, but do not provide surrogate regret bounds. Our results ﬁll this gap, giving linear regret bounds for all polyhedral losses, and thus all discrete target problems.

Perhaps closest in spirit to our work are several recent works on general structured prediction problems. Ciliberto et al. [7, Theorem 2] and Osokin et al. [21, Theorem 7] give surrogate regret bounds for a quadratic surrogate, for a general class of structed prediction problems. Blondel [5, Proposition 5] gives surrogate regret bounds for a tighter class of quadratic surrogates based on projection oracles. Nowak-Vila et al. [20, Theorem 4.4] give even more general bounds, using a broader class of surrogates based on estimating a vector-valued expectation. All of these works apply to a broad range of discrete target problems. The main difference to our work is that their upper bounds are all for strongly convex surrogates,2 whereas ours are for polyhedral surrogates. Osokin et al. [21, Theorem 8] and Nowak-Vila et al. [20, Theorem 4.5] also provide lower bounds, in the same spirit as our Theorem 2 but only for the classes of surrogates they study. In particular, our results recover their bounds up to a constant factor.

Finally, in the binary classiﬁcation setting, some works have used surrogate regret bounds to study the relationship between generalization rates for the surrogate and target, as discussed above. Zhang et al. [36] give detailed results about how surrogate regret bounds actually convert the surrogate rate to the target rate, by carefully analyzing the behavior of the regret tranfer function near 0. They argue that hinge loss achieves strictly better rates than smooth surrogates like logistic and exponential loss; notably, though, this claim implicitly assumes that generalization rate is the same for all surrogates under consideration. By contrast, Mahdavi et al. [17], as we do, consider that the surrogate generalization rate may depend on the choice of surrogate. They study a particular family of smoothed hinge losses, and show a direct tradeoff: smoother losses yield faster surrogate generalization rates, but slower regret transfers. Without placing further assumptions on the data, therefore, Mahdavi et al. [17] observe the same general phenomenon as we do: while smoothness may help surrogate generalization rates, it can hurt the regret transfer just as much. In particular, without further assumptions, polyhedral surrogates appear to achieve optimal overall generalization rates.

2 Preliminaries

Target problems. We consider supervised machine learning problems where data points have features in X and labels in Y, a ﬁnite set. The target problem is to learn a hypothesis mapping X to a report (or prediction) space R, possibly different from Y. For example, binary classiﬁcation has Y = R = { 1, +1}, whereas for ranking problems R may be the set of permutations of Y. In this paper, we assume R is a ﬁnite set. The target problem is speciﬁed by the target loss ℓ: R RY +, which maps a report (prediction) r R to the vector ℓ(r) of nonnegative loss values for each label,

2These works express surrogate regret bounds with ζ 1 appearing on the left-hand side of eq. (1), and thus the terms upper bound and lower bound are reversed from their terminology.

i.e. ℓ(r) = (ℓ(r)y)y Y. Given a hypothesis g : X R and a data point (x, y) X Y, the loss is ℓ(g(x))y. We make a minimal non-redundancy assumption on ℓ, formalized in Section 2.1. We may refer to ℓas a discrete target loss to emphasize our assumption that R and Y are ﬁnite.

Surrogate problems. We recall the surrogate risk minimization framework. First, one constructs a surrogate loss function L : Rd RY +. Then, one minimizes it to learn a hypothesis of the form h : X Rd. Finally, one applies a link function ψ : Rd R, which maps h to the implied target hypothesis ψ h.3 In other words, if a surrogate prediction h(x) Rd implies a prediction ψ(h(x)) R for the target problem. For example, in binary classiﬁcation, the target problem is given by 0-1 loss, ℓzo(r)y = 1{r = y}, with surrogate losses typically deﬁned over R. Common surrogate losses for 0-1 loss include hinge loss Lhinge(u)y = max(1 uy, 0), logistic loss Llog(u)y = ln (1 + exp( yu)), and exponential loss Lexp(u)y = exp( yu). These surrogates are typically associated with the link ψ : u 7 sign(u), which maps negative predictions to 1 and nonnegative predictions to +1, breaking ties arbitrarily at 0.

Surrogate regret bounds. The suitability and efﬁciency of a chosen surrogate is often quantiﬁed with a surrogate regret bound. Formally, the surrogate regret of h with respect to data distribution D is given by RL(h; D) = E(X,Y ) D L(h(X))Y infh :X Rd E(X,Y ) D L(h (X))Y , where the minimum is taken over all measurable functions. (One may equivalently assume the Bayes optimal hypothesis is in the function class.) Given h, the target regret of the implied hypothesis ψ h is Rℓ(ψ h; D) = E(X,Y ) D ℓ(ψ(h(X)))Y infh :X R E(X,Y ) D ℓ(h (X))Y , where we recall that ψ is the link function.

A surrogate regret bound answers the following question: when we minimize the surrogate regret at a rate f(n), given n data points, how fast does the target regret diminish? Formally, we say that ζ : R+ R+ is a regret transfer function if ζ is continuous at 0 and satisﬁes ζ(0) = 0. Given a rerget transfer function ζ, we say that (L, ψ) guarantees a regret transfer of ζ for ℓif

D, h : X Rd, Rℓ(ψ h; D) ζ( RL(h; D) ) . (2) We note that a classic minimal requirement for (L, ψ) is that they be consistent for ℓ, which is the case if there exists any regret transfer ζ guaranteed by (L, ψ) for ℓ[29]. In other words, consistency implies that RL(h; D) 0 = Rℓ(ψ h; D) 0, but says nothing about the relative rates.

Polyhedral surrogates. A function f : Rd R is called polyhedral if it can be written as a pointwise maximum of a ﬁnite set of afﬁne functions [27, 19]. The epigraph of a polyhedral function is a polyhedral set, i.e. the intersection of a ﬁnite set of closed halfspaces; thus polyhedral functions are always convex. We say a surrogate loss L : Rd RY + is polyhedral if, for each ﬁxed y Y, the loss as a function of prediction, i.e. r 7 L(r)y, is polyhedral.

2.1 Property elicitation and the conditional approach

Our technical approach is based on property elicitation, which studies the structure of ℓand L as they relate to Y. This involves a conditional perspective. Given a data distribution D on X Y, for (X, Y ) D, we consider the format of possible target predictions g(x) = r R; surrogate predictions h(x) = u Rd; and conditional distributions Pr[ | X] = p Y. By dropping the role of the hypotheses and the feature space X, we can focus on the relationships of the functions ℓ(r)y and L(u)y and the conditional distribution p.

In particular, in our notation the expected L-loss of prediction u on distribution p is simply the inner product p, L(u) . The Bayes risk of L is L(p) = infu Rd p, L(u) , and similarly the Bayes risk of ℓis ℓ(p) = infr R p, ℓ(r) . We abuse notation to deﬁne the conditional surrogate regret RL(u, p) = p, L(u) L(p), and similarly the conditional target regret Rℓ(r, p) = p, ℓ(r) ℓ(p). Given a regret transfer function ζ, we say (L, ψ) guarantees a conditional regret transfer of ζ for ℓif, for all p Y and all u Rd, Rℓ(ψ(u), p) ζ(RL(u, p)).

The following is a direct result of Jensen s inequality, using that RL(h; D) = EX RL(h(X), p X) where p X is the conditional distribution on Y given X.

3The term link function can carry a more speciﬁc meaning in the statistics literature, for example when discussing generalized linear models. Our usage here is more general: any map from surrogate reports to target reports is a valid link function, though perhaps not a calibrated one.

Observation 1. If (L, ψ) guarantee a conditional regret transfer of ζ for ℓ, and ζ is concave, then (L, ψ) guarantee a regret transfer of ζ for ℓ.

Property elicitation. To begin, we deﬁne a property, which is essentially a statistic or summary of a distribution, such as the mode. We then deﬁne what it means for a loss function to elicit a property. We write Γ : Y R as shorthand for Γ : Y 2R \ { }. Deﬁnition 1 (Property, level set). A property is a function Γ : Y R, for some set R. The level set of Γ for report r is the set Γr = {p Y : r Γ(p)}.

The following deﬁnition applies both to target losses and surrogate losses.

Deﬁnition 2 (Elicits, report space). A loss function f : R RY +, where the domain R is called the report space, elicits a property Γ : Y R if

p Y, Γ(p) = arg min r R p, f(r) =: prop[f](p) . (3)

As each loss f elicits a unique property, we refer to it as prop[f]. In particular, we will consistently use the notation Γ = prop[L] where L is a surrogate loss with report space Rd, and we will use γ = prop[ℓ] where ℓis a target loss with ﬁnite report space R. For example, prop[ℓzo] is simply the mode of p, given by mode(p) = { 1, 1} if p = 1/2, and mode(p) = {sign(2p 1)} otherwise.

We will assume throughout that the target loss ℓ: R RY + is non-redundant, meaning that all target reports r R are possibly uniquely optimal. Formally, letting γ = prop[ℓ], we suppose that for all r R, there exists p Y with γ(p) = {r}.4

Calibration and consistency. It is known that consistency implies the weaker condition of calibration, which implies the still-weaker, but very useful, condition of indirect elicitation. To deﬁne these conditions, consider L, ψ, ℓand γ = prop[ℓ]. Recall that γ(p) is the set of target reports that are optimal for p. The typical deﬁnition of indirect elicitation, which we reformulate below, is as follows: a surrogate-link pair (L, ψ) indirectly elicits γ if u prop[L](p) = ψ(u) γ(p) for all p Y. Note that if ψ(u) γ(p), then u is a bad (suboptimal) surrogate prediction for p. Let BL,ψ,ℓ(p) = {RL(u, p) : ψ(u) γ(p)}, the set of expected surrogate losses resulting from bad predictions. Thus, we may equivalently state indirect elicitation as the requirement that suboptimal surrogate reports be bad, which clariﬁes the relationship to calibration: calibration further requires a nonzero gap between the expected loss of a good report and that of any bad one. Deﬁnition 3 (Indirectly Elicits, Calibrated). Let ℓbe a target loss and γ = prop[ℓ]. A surrogate-link pair (L, ψ) indirectly elicits γ if 0 BL,ψ,ℓ(p) for all p Y. The pair (L, ψ) is calibrated for ℓif 0 < inf BL,ψ,ℓ(p) for all p Y.

Fact 1 ([4, 29]). Let ℓ: R RY + be a target loss with ﬁnite R and Y. If a surrogate and link (L, ψ) are consistent for ℓ, then (L, ψ) are calibrated for L. If (L, ψ) are calibrated for ℓ, then they indirectly elicit ℓ.

For example, it is well-known that hinge loss Lhinge is calibrated and consistent for 0-1 loss ℓzo using the link ψ : u 7 sign(u). Recall that ℓzo elicits the mode. One can verify that indeed (Lhinge, ψ) indirectly elicit the mode; for example, BLhinge,ψ,ℓzo(0) = BLhinge,ψ,ℓzo(1) = [1, ) and BLhinge,ψ,ℓzo(1/2) = , neither of which contain 0.

3 Upper Bound: Polyhedral Surrogates

In this section, we describe the proof of Theorem 1, that polyhedral surrogates guarantee linear regret transfers. At a high level, the proof (whose details appear in Appendix A) has two parts:

1. Fix p Y. Prove a regret bound for this ﬁxed p, namely a constant αp such that Rℓ(ψ(u), p) αp RL(u, p) for all u Rd. 2. Apply this regret bound to each p in a carefully chosen ﬁnite subset of Y. Argue that the maximum αp from this ﬁnite set sufﬁces for the overal regret bound.

4Technically, this deﬁnition differs from the one given in literature [10, 11], but the two are equivalent for ﬁnite elicitable properties.

The ﬁrst part will actually hold for any calibrated surrogate. The second part relies on a crucial fact about polyhedral losses L: their properties Γ = prop[L] cover the whole simplex with only a ﬁnite number of polyhedral level sets [10]. Thus, although the prediction space Rd is inﬁnite, one can restrict to just a ﬁnite set of prediction points and always have a global minimizer of expected surrogate loss.

3.1 Linear transfer for ﬁxed p

First, we establish a linear transfer function for any ﬁxed conditional distribution p. This statement holds for any calibrated surrogate, not necessarily polyhedral.

Lemma 1. Let ℓ: R RY + be a discrete target loss and suppose the surrogate L : Rd RY + and link ψ : Rd R are calibrated for ℓ. Then for any p Y, there exists αp 0 such that, for all u Rd, Rℓ(ψ(u), p) αp RL(u, p).

While useful, this result is somewhat nonconstructive and potentially loose. In Section 5, we unpack the constant αp for polyhedral losses in particular.

3.2 Linear overall transfer function

Given Lemma 1, we might hope to obtain that, for any calibrated surrogate, there exists α = supp αp such that Rℓ(ψ(u), p ) αRL(u, p ) for all p . However, we know this is false in general: not all surrogates yield linear regret transfers! Indeed, for many surrogates, the supremum will be + .

However, polyhedral surrogates have a special structure that allows us to show α is ﬁnite. We ﬁrst present some tools that hold for general surrogates, then the key implication from L being polyhedral. Lemma 2 ([12]). If (L, ψ) indirectly elicits γ, then Γ = prop[L] reﬁnes γ in the sense that, for all u Rd, there exists r R such that Γu γr.

The next lemma states that surrogate regret is linear in p on any ﬁxed level set Γu of Γ = prop[L]. By combining this fact with Lemma 2, we obtain that target regret is linear on these level sets as well. Lemma 3. Suppose (L, ψ) indirectly elicits γ = prop[ℓ] and let Γ = prop[L]. Then for any ﬁxed u, u Rd and r R, the functions RL(u, ) and Rℓ(r, ) are linear on Γu in their second arguments.

A key fact we use about polyhedral losses is that they have a ﬁnite set of optimal sets.

Lemma 4. If L : Rd RY + is polyhedral, then Γ = prop[L] has a ﬁnite set of level sets that union to Y. Moreover, these level sets are polytopes.

We are now ready to restate the main upper bound and sketch a proof.

Theorem 3 (Theorem 1 restated). Suppose the surrogate loss L : Rd RY + and link ψ : Rd R are consistent for the target loss ℓ: R RY +. If L is polyhedral, then (L, ψ) guarantee a linear regret transfer for ℓ, i.e. there exists α 0 such that, for all D and all measurable h : X Rd,

Rℓ(ψ h; D) αRL(h; D).

The key point of the proof is that, by Lemma 4, the polyhedral loss L has a ﬁnite set U Rd of predictions such that (a) for each u U, the level set Γu is a polytope, and (b) u UΓu = Y. Let Q be the ﬁnite set of all vertices of all these level sets. For any vertex q Q, we have a linear regret transfer for this ﬁxed q with constant αq from Lemma 1. Lemma 3 allows us to write the regret of a general p as a convex combination of the regrets of its containing level set s vertices. So we obtain a bound of α = maxq Q αq.

4 Lower bound

In this section, we show that if a surrogate loss is sufﬁciently non-polyhedral , then the best regret transfer it can achieve is ζ(ϵ) on the order of ϵ. (Proofs are deferred to Appendix B.) Speciﬁcally, we consider a relaxation of L being both strongly smooth and strongly convex, as we now formalize.

Let Γ = prop[L] and γ = prop[ℓ]. We will assume L is strongly smooth everywhere and strongly convex around some boundary report for L, ℓ, which we deﬁne to be a u0 Rd such that, for some r, r R, there exists a distribution p0 such that p0 Γu0 γr γr . For the conditional distribution p0, the surrogate report u0 minimizes expected L-loss, and it may link to either r or r , as either of these minimize expected ℓ-loss. Recall that an open neighborhood of u0 is an open set in Rd containing u0.

Assumption 1. L : Rd RY + is a surrogate and ψ : Rd R a link, consistent with the discrete target ℓ: R RY +, satisfying the following: (i) For all y Y, the function L( )y is differentiable with a locally Lipschitz gradient.5 (ii) For some α > 0 and some boundary report u0 Rd for L, ℓ, the expected loss function u 7 p, L(u) is α-strongly convex on an open neighborhood of u0.

This assumption is quite weak. In particular, because ℓhas a ﬁnite report set R, boundary reports essentially always exist whenever (L, ψ) indirectly elicits ℓ. In particular, as the level sets of γ are closed, convex, and union to the simplex [10, 15], there must be some pair of reports r, r R with γr γr = . We can then take p0 γr γr , and any u0 Γ(p0). One caveat is that we may have Γ(p0) = . While some surrogates may fail to have a minimizer for some distributions p, this typically happens when p is on the boundary of the simplex, and not generally on the boundary between level sets. For example, consider exponential loss Lexp(u)y = exp( uy), where prop[Lexp](0) = prop[Lexp](1) = , but we still have prop[Lexp](1/2) = {0}.

The following result is our main lower bound. Theorem 4 (Stronger version of Theorem 2.). Suppose the surrogate loss L and link ψ satisfy a regret transfer of ζ for a target loss ℓ. If L, ψ, and ℓsatisfy Assumption 1, then there exists c > 0 such that, for some ϵ > 0, for all 0 ϵ < ϵ , ζ(ϵ) c ϵ.

The key idea of the proof is to ﬁx a boundary report u0 and consider a sequence of distributions pλ p0 as λ 0. Target regret Rℓ(ψ(u0), pλ) shrinks linearly in λ, but surrogate regret RL(u0, pλ) shrinks quadratically. The latter holds roughly because expected surrogate loss is strongly convex in some neighborhood of u0 and is strongly smooth elsewhere.6

Theorem 4 captures a wide range of surrogate losses beyond those that are strongly convex and strongly smooth. For example, exponential loss Lexp(u)y = exp( uy) is not strongly convex, but does satisfy Assumption 1; for part (ii), the boundary report is u0 = 0 as we saw above, and Lexp is 1/e-strongly convex on ( 1, 1) u0.

As another example, consider Huber loss for binary classiﬁcation. Here Y = { 1} and the prediction space is R, and Huber loss can be deﬁned by letting the hinge loss be t = max{0, 1 ry} and setting L(r)y = t2 if t 2, else 4(t 1). Theorem 4 does capture Huber, but needs the strength of Assumption 1(ii) as Huber is strongly convex on [ 1, 1] but linear outside that interval.

5 Unpacking the constant for polyhedral losses

Our main upper bound for polyhedral losses, Theorem 1, is mostly nonconstructive: it says polyhedral surrogates have some α such that Rℓ(ψ h; D) α RL(h; D). In this section, we make the result constructive by unpacking the implications of consistency for polyhedral surrogates. Proofs appear in Appendix C.

5.1 Contribution from the target, surrogate, and link

We will give a bound on the constant of the form α αℓ αL αψ, i.e. one component each relating to the structure of the target loss, surrogate loss, and link. First, we will derive αL using the theory of Hoffman constants from linear optimization [14]. Then, we will show that any link for a consistent polyhedral loss must satisfy a condition called ϵ-separation, introduced in [10]. We will set αψ = 1

ϵ . With αℓsimply an upper bound on the target loss, we will ﬁnally prove α αℓ αL αψ.

5A map g : A B is locally Lipschitz if for every a A there is an open neighborhood U of a such that g|A is Lipschitz continuous. 6A tempting alternative is to ﬁx a distribution p0 and consider a sequence of predictions. We know from Lemma 1 that this approach will fail, however: for any consistent surrogate and any ﬁxed p0, there is a linear regret transfer if we restrict to p0.

5.1.1 Hoffman constants

The intuition we explore here is the linear growth rate of a given polyhedral function as we move away from its optimal set. In particular, consider the expected loss u 7 p, L(u) . Expected loss should grow linearly with distance from the optimal set, which happens to be Γ(p). In particular, the worst growth rate is governed roughly by the smallest slope of the function in any direction. This is captured by the Hoffman constant of an associated system of linear inequalities. In Appendix C, we use Hoffman constants to obtain the following result, where we deﬁne HL,p.

Lemma 5 ([14]). Let L : Rd RY + be a polyhedral loss with Γ = prop[L]. Then for any ﬁxed p, there exists some smallest constant HL,p 0 such that d (u, Γ(p)) HL,p RL(u, p) for all u Rd.

To foreshadow the usefulness of HL,p, recall the proof strategy of the upper bound: use Lemma 1 to obtain a constant αp for each p; then carefully choose a ﬁnite set of p s and take the maximum constant. In the same way, we will be able to set the loss constant αL = maxp HL,p. First, however, we will need to investigate separated links.

5.1.2 Separated link functions

A link is ϵ-separated if, for all pairs of surrogate reports u, u such that u is optimal for the surrogate and u links to a suboptimal target report, we have u u > ϵ. The deﬁnition was introduced by Finocchiaro et al. [10]. Deﬁnition 4 (Separated Link). Let properties Γ : Y Rd and γ : Y R be given. We say a link ψ : Rd R is ϵ-separated with respect to Γ and γ if for all u Rd with ψ(u) / γ(p), we have d (u, Γ(p)) > ϵ, where d (u, A) .= infa A u a . Similarly, we say ψ is ϵ-separated with respect to L and ℓif it is ϵ-separated with respect to prop[L] and prop[ℓ].

Recall from the previous subsection that Hoffman constants allowed us to show that polyhedral expected losses grow linearly as we move away from the optimal set. Now, ϵ-separated link functions allow us to guarantee that we move at least a constant distance from the optimal set before we start linking to an incorrect report r R. But when does a polyhedral loss have an associated ϵ-separated link?

Finocchiaro et al. [10] showed that ϵ-separated links for polyhedral surrogates are calibrated. We show that the converse is in fact true: every calibrated link is ϵ-separated for some ϵ > 0. The proof follows a similar argument to that of Tewari and Bartlett [30, Lemma 6].

Lemma 6. Let polyhedral surrogate L : Rd RY +, discrete loss ℓ: R RY +, and link ψ : Rd R be given such that (L, ψ) is calibrated with respect to ℓ. Then there exists ϵ > 0 such that ψ is ϵ-separated with respect to Γ .= prop[L] and γ .= prop[ℓ].

5.1.3 Combining the loss and link

We can now follow the general proof strategy of the main upper bound, but constructively. Given Lemma 6, we can make the following deﬁnition. Deﬁnition 5 (ϵψ). If (L, ψ) are calibrated for ℓ, then let ϵψ .= sup{ϵ : ψ is ϵ-separated}.

We will also need an upper bound Cℓ= maxr,p Rℓ(r, p) on the regret of ℓ. Since Rℓis convex in p, in particular this maximum is achieved at a vertex of the probability simplex.

Deﬁnition 6 (Cℓ). Given discrete loss ℓ: R RY +, deﬁne Cℓ= maxr,r R,y Y ℓ(r)y ℓ(r )y.

Recall that Lemma 4 gave that, if L is polyhedral, then Γ = prop[L] has a ﬁnite set of full-dimensional level sets, each a polytope, that union to the simplex.

Deﬁnition 7 (HL). Given a polyhedral surrogate loss L : Rd RY +, let Q be the set of all vertices of the full-dimensional level sets of Γ = prop[L], and deﬁne HL .= maxp Q HL,p.

Theorem 5 (Constructive linear transfer). Let ℓ: R RY + be a discrete target loss, L : Rd RY + be a polyhedral surrogate loss, and ψ : Rd R a link function. If (L, ψ) are consistent for ℓ, then

( h, D) Rℓ(ψ h; D) CℓHL

ϵψ RL(h; D) .

The proof closely mirrors the proof of the nonconstructive upper bound, Theorem 1, but using the constructive constants Cℓ, HL, ϵψ derived above. To summarize, we have shown the following.

Theorem 6. Let ℓ: R RY + be a discrete target loss, L : Rd RY + be a polyhedral surrogate loss, and ψ : Rd R a link function. The following are equivalent:

1. (L, ψ) is consistent for ℓ.

2. ψ is ϵ-separated with respect to L and ℓfor some ϵ > 0.

3. (L, ψ) guarantees a linear regret transfer for ℓ.

Furthermore, if any of the above hold, then in particular (L, ψ) guarantees the regret transfer

ζ(t) = CℓHL

5.2 Further tightening

While our goal is not to focus on exact constants for speciﬁc problems, we remark on a few ways Theorem 5 may be loose and how one could compute the tightest possible constant α for which Rℓ(ψ h; D) α RL(h; D) for all h and D. In general, for a ﬁxed p, there is some smallest α p such that Rℓ(ψ(u), p) α p RL(u, p) for all u. Then, it follows from our results that α = maxp Q α p for the ﬁnite set Q used in the proof, i.e. the vertices of the full-dimensional level sets of Γ = prop[L].

Above, we bounded α p CℓHL,p

ϵψ . The intuition is that some u at distance ϵψ from Γ(p), the optimal set, may link to a bad report r = ψ(u) γ(p). The rate at which L grows is at least HL,p, so the surrogate loss at u may be as small as ϵψ HL,p , while the target regret may be as high as

Cℓ= maxr ,p Rℓ(r , p ). The ratio of regrets is therefore bounded by HL,p Cℓ

The tightest possible bound, on the other hand, is α = supu:ψ(u) γ(p) Rℓ(ψ(u),p)

RL(u,p) . This bound can be smaller if the values of numerator and denominator are correlated across u. For example, u may only be ϵψ-close to the optimal set when it links to reports ψ(u) with lower target regret; or L may have a smaller slope in the direction where the link s separation is larger than ϵ.

To illustrate with a concrete example, consider the binary encoded predictions (BEP) surrogate of [25] for the abstain target loss, ℓ(r, y) = 1

2 if r = , otherwise ℓ(r, y) = 1[r = y]. The surrogate involves an injective map B : Y { 1, 1}d for d = log2 |Y| . It is L(u)y = maxj=1...d(1 ud B(y)d)+, where ( )+ indicates taking the maximum of the argument and zero. The associated link is ψ(u) = if minj=1...d |uj| 1

2, otherwise ψ(u) = arg miny Y B(y) u .

One can show that for p = δy, i.e. the distribution with full support on some y Y, L(u)y = d (u, Γ(p)) exactly, giving HL,p = 1. It is almost immediate that ϵψ = 1

2. Meanwhile, Rℓ(r, p) 1, giving us an upper bound α p (1)(1)

1/2 = 2. In fact, this is slightly loose; the exact constant, given in [25] is 1. The looseness stems from the fact that for p = δy, the closest reports u to the optimal set, i.e. at distance only ϵψ = 1

2 away, do not link to reports maximizing target regret; they link to the abstain report , which has regret only 1

2. With this correction, and an observation that all u linking to reports y = y are at distance at least 3

2 from Γ(p), we restore the tight bound α p 1. A similar but slightly more involved calculation can be carried out for the other vertices p Q, which turn out to be all vertices of the form 1

Finally, while we use to deﬁne the minimum-slope HL and the separation ϵψ, in principle one could use another norm. One reason for restricting to is that it is more compatible with Hoffman constants. However, all deﬁnitions hold for other norms and so does the main upper bound, as existence of an HL and ϵψ in imply existence of constants for other norms. These constants may change for different norms, and in particular, the optimal overall constant may arise from a norm other than .

6 Discussion

We have shown two broad results about regret tranfer functions for surrogate losses. In particular, polyhedral surrogates always achieve a linear transfer, whereas non-polyhedral surrogates are generally square root. Section 5 outlines several directions to further reﬁne the bound for polyhedral surrogates. Beyond these directions, an interesting question is to what extent our results hold when the label set or report set are inﬁnite. For example, when the labels are the real line, pinball loss is polyhedral yet elicits a quantile, a very different case than the one we study. In particular, Lemma 4 will not give a ﬁnite set of optimal sets in this case. Nonetheless, we suspect similar results could go through under some regularity assumptions. Finally, in line with our results, we would like to better understand the tradeoff between smoothness of the surrogate and the overall generalization rate. One important direction along these lines is to extend the analysis of Mahdavi et al. [17] beyond binary classiﬁcation.

Broader Impacts

As a theoretical work, the likely impacts of this paper take the form of downstream research and applications. Instead of direct applications, we anticipate this work leading to more investigation of surrogate losses to improve discrete prediction tasks. It may inform practitioners choices of which surrogate losses they use for supervised learning tasks. Of course, such machine learning tasks can be solved for ethical or unethical purposes. We do not know of particular risk of negative impacts of this work beyond risks of supervised machine learning in general.

Acknowledgments and Disclosure of Funding

We thank Stephen Becker, Jessie Finocchiaro, and Nishant Mehta for insights, discussions, and references. This material is based upon work supported by the National Science Foundation under Grant No. IIS-2045347.

[1] Shivani Agarwal. Surrogate regret bounds for bipartite ranking via strongly proper losses. The Journal of Machine Learning Research, 15(1):1653 1674, 2014.

[2] Franz Aurenhammer. A criterion for the afﬁne equivalence of cell complexes in R d and convex polyhedra in R d+1. Discrete & Computational Geometry, 2(1):49 64, December 1987. ISSN 0179-5376, 1432-0444. doi: 10.1007/BF02187870. URL http://link.springer.com/ article/10.1007/BF02187870.

[3] Peter L Bartlett and Marten H Wegkamp. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9(Aug):1823 1840, 2008.

[4] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. URL http://amstat.tandfonline.com/doi/abs/10.1198/016214505000000907.

[5] Mathieu Blondel. Structured prediction with projection oracles. Advances in neural information processing systems, 32:12145 12156, 2019.

[6] Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classiﬁcation with rejection based on cost-sensitive classiﬁcation. In International Conference on Machine Learning, pages 1507 1517. PMLR, 2021.

[7] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction. Advances in neural information processing systems, 29:4412 4420, 2016.

[8] Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In International Conference on Algorithmic Learning Theory, pages 67 82. Springer, 2016.

[9] John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classiﬁcation, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246 3275, 2018.

[10] Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, 2019.

[11] Rafael Frongillo and Ian Kash. General truthfulness characterizations via convex analysis. In Web and Internet Economics, pages 354 370. Springer, 2014.

[12] Rafael M Frongillo and Ian A Kash. Elicitation complexity of statistical properties. Biometrika, 2020.

[13] Wei Gao and Zhi-Hua Zhou. On the consistency of auc pairwise optimization. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

[14] Alan J Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the National Bureau of Standards, 49(4), 1952.

[15] Nicolas S. Lambert and Yoav Shoham. Eliciting truthful answers to multiple-choice questions. In Proceedings of the 10th ACM conference on Electronic commerce, pages 109 118, 2009.

[16] Maksim Lapin, Matthias Hein, and Bernt Schiele. Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1468 1477, 2016.

[17] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Binary excess risk for smooth convex surrogates. ar Xiv preprint ar Xiv:1402.1792, 2014.

[18] Aditya Krishna Menon and Robert C Williamson. Bipartite ranking: a risk-theoretic perspective. The Journal of Machine Learning Research, 17(1):6766 6867, 2016.

[19] Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classiﬁcation with rejection. ar Xiv preprint ar Xiv:1901.10655, 2019.

[20] Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A general theory for structured prediction with smooth convex surrogates. ar Xiv preprint ar Xiv:1902.01958, 2019.

[21] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, pages 302 313, 2017.

[22] Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1 35, 2017.

[23] Bernardo Avila Pires, Csaba Szepesvari, and Mohammad Ghavamzadeh. Cost-sensitive multiclass classiﬁcation risk bounds. In International Conference on Machine Learning, pages 1391 1399, 2013.

[24] Harish Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Convex calibrated surrogates for hierarchical classiﬁcation. In International Conference on Machine Learning, pages 1852 1860, 2015.

[25] Harish G Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic Journal of Statistics, 12(1):530 554, 2018.

[26] Mark D Reid and Robert C Williamson. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 897 904, 2009.

[27] R.T. Rockafellar. Convex analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1997.

[28] Clayton Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6: 958 992, 2012.

[29] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, September 2008. ISBN 978-0-387-77242-4. Google-Books-ID: HUnqnrp Yt4IC.

[30] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classiﬁcation methods. The Journal of Machine Learning Research, 8:1007 1025, 2007. URL http://dl.acm.org/ citation.cfm?id=1390325.

[31] Yutong Wang and Clayton Scott. Weston-watkins hinge loss and ordered partitions. Advances in neural information processing systems, 2020.

[32] Forest Yang and Sanmi Koyejo. On the consistency of top-k surrogate losses. Co RR, abs/1901.11141, 2019. URL http://arxiv.org/abs/1901.11141. [33] Jiaqian Yu and Matthew B Blaschko. The lovász hinge: A novel convex surrogate for submodular losses. IEEE transactions on pattern analysis and machine intelligence, 2018. [34] Ming Yuan and Marten Wegkamp. Classiﬁcation methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(Jan):111 130, 2010. [35] Constantin Zalinescu. Sharp estimates for hoffman s constant for systems of linear inequalities and equalities. SIAM Journal on Optimization, 14(2):517 533, 2003. [36] Jingwei Zhang, Tongliang Liu, and Dacheng Tao. On the rates of convergence from surrogate risk minimizers to the bayes optimal classiﬁer. IEEE Transactions on Neural Networks and Learning Systems, 2021. [37] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004.