# unified_binary_and_multiclass_marginbased_classification__3cfc7e39.pdf

Journal of Machine Learning Research 25 (2024) 1-51 Submitted 12/23; Revised 5/24; Published 5/24

Uniﬁed Binary and Multiclass Margin-Based Classiﬁcation

Yutong Wang1,2 yutongw@umich.edu Clayton Scott1,3 clayscot@umich.edu 1Department of Electrical Engineering and Computer Science 2Michigan Institute of Data Science 3Department of Statistics University of Michigan Ann Arbor, MI 48109, USA

Editor: Zhihua Zhang

The notion of margin loss has been central to the development and analysis of algorithms for binary classiﬁcation. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classiﬁcation. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classiﬁcation-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classiﬁcation-calibrated. Keywords: Classiﬁcation, loss functions, consistency, margins, label encodings

1. Introduction

Classiﬁcation into k 2 categories is the learning task of selecting a classiﬁer, i.e., a function from the feature space X to the set of labels [k] := {1, . . . , k}, given training data. Many of the most popular and successful classiﬁcation methods aim to ﬁnd a classiﬁer with minimum risk, where the risk of a classiﬁer is the expected value of a loss function that measures the quality of predictions. Indeed, logistic regression, support vector machines, boosting, and neural network methods can all be viewed as algorithms to minimize the risk associated with a certain loss. In practice, loss-based approaches to binary (k = 2) and multiclass (k 3) cases are formulated diﬀerently. Because of this discrepancy, the theory and practice of multiclass methods often lag behind their binary counterparts. The purpose of this work is to bridge this gap by introducing a framework that uniﬁes binary and multiclass loss-based classiﬁcation. The standard approach to binary classiﬁcation is to learn a discriminant function g : X R that maps an instance x to a discriminant g(x), e.g., g(x) = w T x + b for linear classiﬁcation. Identifying {1, 2} with { 1, +1}, a label y is predicted by sign(g(x)). The loss ascribed to a discriminant g and a pair (x, y) is deﬁned in terms of a function ψ : R R, with the loss being equal to ψ(( 1)yg(x)). Loss functions of this form are referred to as

c 2024 Yutong Wang and Clayton Scott.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1599.html.

Wang and Scott

Figure 1: Relative margin form. Panel (a): Multiclass losses L satisfying the permutation equivariant and relative margin-based conditions (i.e., PERM losses, Deﬁnition 2.2) can be expressed in the relative margin form as in Panel (b). See Theorem 2.5. The relative margin form employs three components: Panels (c). the matrix label code {Υy}y [k], (d). the discriminant function g, and (e). the template , a symmetric function ψ : Rk 1 R.

binary margin loss functions. Examples include the logistic loss ψ(t) = log(1+e t) (logistic regression), the hinge loss ψ(t) = max(0, 1 t) (support vector machines), the exponential loss ψ(t) = e t (Ada Boost), and the sigmoid loss ψ(t) = 1/(1 + et) (some neural networks). The conventional approach to multiclass classiﬁcation seeks to learn a class-score function f = (f1, . . . , fk) : X Rk, e.g., a feed-forward neural network. The label for an input x is predicted by taking the argmax of the class-score vectors v := f(x). Multiclass loss functions, e.g., cross entropy, may be viewed as functions L : {1, . . . , k} Rk R. The loss incurred by a class-score function f on a pair (x, y) X [k] is then L(y, f(x)). When a multiclass loss function is expressed in terms of a class-score function output, we say that it is in class-score form. We show that a large family of multiclass loss functions in class-score form, including cross-entropy, multiclass exponential loss (Mukherjee and Schapire, 2013), multiclass hinge losses (Crammer and Singer, 2001; Weston and Watkins, 1998), Gamma-Phi losses (Beijbom et al., 2014), and Fenchel-Young losses (Blondel et al., 2020), can be expressed in what we call the relative-margin form. Instead of being deﬁned in terms of a class-score function output f(x), in the relative margin form a loss is expressed ψ(Υyg(x)), which involves the following elements: 1. a symmetric1 function ψ : Rk 1 R, 2. a set of (k 1) (k 1) matrices {Υy}y [k] that encode the label, and

3. a discriminant function of the form g : X Rk 1. When k = 2, we have Υy = ( 1)y, and the relative margin form coincides with the standard binary margin form. See Figure 1 for an illustration of the framework. Thus, the relative margin form generalizes the notion of margin loss from binary classiﬁcation.

1. Recall that a function is symmetric if its value does not change when its input is permuted.

Unified Margin-Based Classification

1.1 Our contributions

In this work, we develop the relative margin from as described above, and illustrate the utility of this framework by establishing novel results on classiﬁcation-calibration2. More speciﬁcally, our contributions are: Theorem 2.5: characterization of losses expressible in the relative margin form. We show that a multiclass loss function can be expressed in the relative margin form if and only if both of the following conditions are met: 1. permutation equivariance: L treats all classes equally, and 2. relative margin-based: L(y, f(x)) depends only on the diﬀerences between the fj(x) s rather than the values of the fj(x) s themselves. Below, such losses are referred to as PERM losses. Our characterization implies that many popular losses such as the cross entropy, multiclass exponential, and Fenchel-Young losses (Blondel, 2019) are PERM losses. PERM losses are characterized by their template, a symmetric function ψ : Rk 1 R. Thus, our result implies that for the analysis of existing and design of new PERM losses, it suﬃces to focus on the template ψ which can be simpler than the original L. Theorem 4.7: Extending a fundamental result on classiﬁcation-calibration (CC) to the multiclass case. A seminal result of Bartlett et al. (2006) in the binary case shows that a convex margin loss is classiﬁcation-calibrated if and only if ψ is diﬀerentiable at 0 and has negative derivative there. We use the PERM loss framework to prove a multiclass extension of this result for L that are totally regular (Deﬁnition 4.2). We use Theorem 4.7 to prove a novel suﬃcient condition for CC of sums of losses (Proposition 4.8), and, as a corollary, establish CC for sums of Gamma-Phi losses (Example 5). A key component of the totally regular property is that the gradient of the template ψ must be entry-wise negative everywhere3. Theorem 5.10: Expanding previous suﬃcient conditions of CC for Fenchel Young losses. Fenchel-Young losses, deﬁned via taking the convex conjugate of a certain negentropy function, have recently been proposed for structured prediction, including multiclass classiﬁcation. However, an open question is whether Fenchel-Young losses are classiﬁcation-calibrated under the assumption that the negentropy is strictly convex. Our Theorem 5.10 answers this in the aﬃrmative. The matrix label code. The key component in our analysis is the matrix label code (Deﬁnition 2.4) which extends to the multiclass case the universally adopted { 1} label code for binary classiﬁcation. To obtain our main results, we developed a suite of technical lemmas for working with the matrix label code in Appendix B. We expect the matrix label code to be useful for multiclass classiﬁcation research beyond the scope here.

1.2 Background

In this section, we deﬁne the probabilistic setting assumed throughout this work. Moreover, we ﬁrst review the background on classiﬁcation-calibration.

2. For readers unfamiliar with this theory, we provide a brief review below in the Background on Classiﬁcation-Calibration section (Section 1.2). 3. Thus, in the binary case, our result is weaker than that of Bartlett et al. (2006), which only requires negativity at 0 R. We leave as an open question whether this gap can be closed.

Wang and Scott

Let {(xi, yi)}n i=1 be drawn from a joint distribution P over X [k]. Let I be the indicator function and I{y = j} be the 01-loss for y, j [k]. The goal of classiﬁcation is to select a classiﬁer h : X [k] minimizing the 01-risk R01,P (h) := E(X,Y ) P [I{Y = h(X)}] objective. A fundamental approach for learning a classiﬁer is empirical risk minimization (ERM), which selects an h minimizing the empirical 01-risk ˆRn 01(h) := 1

n Pn i=1 I{yi = h(xi)} over a class of functions H. Directly minimizing the empirical 01-risk objective is often intractable4 due to the discreteness of the objective. The surrogate-based approach addresses this issue as follows. Deﬁne the L-risk for a ﬁxed continuous loss function L by RL(f) := E(X,Y ) P [L(Y, f(X))] and the empirical L-risk by ˆRn L(f) := 1

n Pn i=1 L(yi, f(xi)). Minimizing ˆRn L(f) over a family of f s, e.g., neural networks, is employed as a tractable surrogate objective. A class-score function f induces a discrete-valued classiﬁer arg max f(x) := arg maxj=1,...,k fj(x) with ties broken arbitrarily. An important question is whether good performance with respect to the surrogate L-risk transfers back to the 01-risk, the original objective.

Deﬁnition 1.1 A surrogate loss L has the consistency transfer property5 if for any distribution P over X [k] and any sequence of class-score functions { ˆf(n)}n, e.g., ˆf(n) is an empirical L-risk minimizer over a family of functions depending on n, the following is satisﬁed: limn RL( ˆf(n)) = inff RL(f) implies that limn R01(arg max ˆf(n)) = infh R01(h). The inﬁmums are, respectively, over all measurable functions f : X Rk and h : X [k].

Thus, the consistency transfer property (CTP) provides justiﬁcation for L-risk minimization when minimizing the 01-risk is the original objective of interest. For binary margin-based losses ψ, the seminal result of Bartlett et al. (2006, Theorem 1.3) shows that CTP is equivalent to a functional property of ψ known as classiﬁcation-calibration (CC). For the multiclass case, Tewari and Bartlett (2007) deﬁne the CC property (reviewed in detail in Section 3 below) and show its equivalence to the CTP as well.

1.3 Related work

Suﬃcient conditions for CC of multiclass convex margin-based losses. For binary convex loss in margin form, the suﬃcient conditions for CC are easy to verify and apply to common losses of interest. Indeed, Bartlett et al. (2006, Theorem 2.1) asserts that a nonnegative binary convex loss in margin form ψ : R R 0 is CC if and only if ψ is diﬀerentiable at 0 and ψ (0) < 0. There are several diﬃculties for extending the above characterization to the multiclass case. The result in the binary case relies on the expression of a binary margin-based loss as ψ(( 1)yz). While proposed multiclass extensions of the expression exist (to be reviewed below), existing work on suﬃcient conditions for CC of multiclass losses focus on losses in the class-score form L : [k] Rk R. Most previous works studied CC of losses of a speciﬁc form such as the Gamma-Phi losses (Zhang, 2004; Beijbom et al., 2014; Wang and Scott, 2023a), Fenchel-Young losses (Duchi et al., 2018; Blondel et al., 2020), and hinge-like losses (Tan and Zhang, 2022).

4. See Bhattacharyya et al. (2018) and the references therein 5. This property appears in many works, e.g., Steinwart (2007); Bartlett et al. (2006); Tewari and Bartlett (2007); Zhang (2004). The name consistency transfer property was used by Wang and Scott (2023a).

Unified Margin-Based Classification

Notably, Tewari and Bartlett (2007) derive a characterization of CC for multiclass losses analogous to that of Bartlett et al. (2006, Theorem 2.1). However, their suﬃcient condition (Tewari and Bartlett, 2007, Theorem 7), while geometrically elegant, is hard to verify for common losses such as the multiclass exponential loss6. Multiclass margins & margin-based loss. A line of work has considered the problem of extending the binary margin ( 1)yz and binary margin form ψ(( 1)yz) to the multiclass setting. For the margin, Crammer and Singer (2001) and Mohri et al. (2018, 9.2) deﬁned a notion of scalar-valued multiclass margin as minj [k]:j =y vy vj where y is the ground truth label. In other words, this notion of multiclass margin is the minimum diﬀerence of score between the ground truth label and the rest of the labels. While intuitive, according to Tewari and Bartlett (2007, 5.1), convex losses based on this notion of margin are never classiﬁcation-calibrated. Lee et al. (2004); Zou et al. (2008) proposes deﬁnitions of vector-valued multiclass margin vectors and associated margin losses for multiclass SVMs and boosting, respectively. These losses are known to be classiﬁcation-calibrated if and only if sum-to-zero constraints are enforced on the input margin vectors (Dogan et al., 2016, Theorem 6). Lee et al. (2004) develops a multiclass SVM based on a hinge loss that leverages this notion of margin. However, in practice, enforcing these sum-to-zero constraints leads to signiﬁcantly slower computations (Fu et al., 2022). Dogan et al. (2016) developed a framework using relative margins to unify the analysis of several variants of multiclass support vector machines. This notion of relative margin7

is ﬁrst introduced by Rosset et al. (2003, 4). Both of these prior works only established classiﬁcation-calibration of speciﬁc losses for speciﬁc algorithms. By contrast, our work develops a framework that characterizes when multiclass losses can be expressed via relative margins and proves general classiﬁcation-calibration results for a large family of losses. Label encodings for multiclass classiﬁcation. As alluded to earlier, the multiclass SVM introduced by Lee et al. (2004) is classiﬁcation-calibrated provided that certain sumto-zero constraints are enforced. Moreover, enforcing these constraints is computationally prohibitive. The simplex code (Hill and Doucet, 2007) is a multiclass label code designed to address this computational issue by using a constraint-free reparametrization of the multiclass SVM dual formulation (Wu and Lange, 2010; Saberian and Vasconcelos, 2011; Mroueh et al., 2012; van den Burg and Groenen, 2016; Pouliot, 2018). Another approach to label encoding is the multivector construction8 (Shalev-Shwartz and Ben-David, 2014, 17.7) which has been used to study the multiclass perceptron, logistic regression and SVMs (Duda et al., 2006, 5.12). Moreover, the multivector construction has been used to study the PAC learning sample complexity of multiclass linear classiﬁers (Daniely and Shalev-Shwartz, 2014). While the simplex code and the multivector constructions are elegant techniques, the scope of these work are on speciﬁc algorithms and classiﬁers. In particular, a framework for

6. See Footnote 7 in Tewari and Bartlett (2007, Table 1). 7. Rosset et al. (2003) also uses the name margin vector in conﬂict with the later work by Zou et al. (2008) which deﬁnes a diﬀerent notion of margin. Due to this confusion, we will refer to the earlier notion of Rosset et al. (2003) by relative margins . 8. Also known as Kesler s construction which, according to Crammer and Singer (2003), is credited to (Kesler, 1961). A description of the construction can be found in Duda et al. (2006, 5.12) and Shalev Shwartz and Ben-David (2014, 17.7).

Wang and Scott

general multiclass margin-based losses and analysis of their properties, e.g., classiﬁcationcalibration, is lacking. Our work ﬁlls this gap by using the framework of the relative-margin form and matrix label code to 1. characterize the set of multiclass losses expressible in this form (Theorem 2.5), 2. prove suﬃcient conditions for classiﬁcation-calibration of a large family of multiclass losses (Theorems 4.7) and 3. apply our results to expand the previously known family of classiﬁcation-calibrated Fenchel-Young losses (Theorem 5.10).

1.4 Notations

In this subsection, we brieﬂy discuss some of the key notations. For the reader s convenience, we include a more comprehensive reference for the notations in Table 1 in Section A of the appendix. Moreover, we tabulate the mathematical objects deﬁned in Table 2. Throughout this work, let k 2 denote the number of classes. Denote the k-probability simplex by k = {p Rk 0 : Pk j=1 pj = 1}. Vectors/matrices. Let the square bracket with subscript [ ]j be the projection of a vector onto its j-th component, i.e., [v]j := vj where v = (v1, . . . , vk) Rk. Given two vectors w, v Rk, we write w v (resp. w v) if wj vj (resp. wj > vj) for all j [k]. Allzeros/all-ones/i-th elementary basis vector in Rn are denoted 0(n)/1(n)/e(n) i , respectively. When the ambient dimension is clear, we drop the superscript (n). The n n identity matrix is denoted In. Permutations and permutation matrices. A bijection from [k] to itself is called a permutation (on [k]). Denote by Sym(k) the set of all permutations on [k]. For each σ Sym(k), let Sσ denote the permutation matrix corresponding to σ. In other words, if v Rk

is a vector, then [Sσv]j = [v]σ(j) = vσ(j).

2. Permutation equivariant relative margin (PERM) losses

In this section, we deﬁne PERM losses and prove fundamental properties used throughout the rest of the work. We begin with a discussion of the two deﬁning properties: permutation equivariance and relative-margins. Recall that in the introduction, we denoted loss functions as L : [k] Rk R. For a class-score function f : X Rk, the loss is evaluated as L(y, f(x)) on an instance (x, y). Throughout the rest of this work, for mathematical convenience we will use the equivalent notation convention Ly(f(x)) where y appears in the subscript. In this convention, a multiclass loss function is a vector-valued function L = (L1, . . . , Lk) : Rk Rk. With this convention, the property that the loss treats each of the k classes equally is captured by permutation equivariance, a notion similar to but distinct from symmetry9. Denote by v the class-score vector f(x) on some generic instance (x, y). The permutation equivariance property states that if the class-score vector v is relabeled by some permutation σ Sym(k), then the vector of losses L(v) should be relabeled in the same way, i.e., L(Sσv) = SσL(v).

9. Recall that a function is symmetric if its value does not change when its input is permuted. Sometimes, this is also referred to as permutation invariance (Bronstein et al., 2021, 3.1). For our purposes, equivariance is the correct descriptor.

Unified Margin-Based Classification

Relative margins10 have been recently used by Dogan et al. (2016) and by Fathony et al. (2016) in the context of multiclass SVMs that only utilize the set of diﬀerences vy vj over all y, j [k] such that y = j, rather than the non-relative or absolute class-scores (v1, . . . , vk) themselves. Below, it will be notationally convenient to use a matrix that converts the vector of class-score v into relative margins:

Deﬁnition 2.1 Let D := Ik 1 1(k 1) R(k 1) k. Observe that [Dv]y = vk vy for y [k 1] and v Rk. Equivalently, Dv = (vk v1, vk v2, . . . , vk vk 1) .

We are now ready to deﬁne

Deﬁnition 2.2 (PERM losses) Let k 2 be an integer. A k-ary multiclass loss function is a vector-valued function L = (L1, . . . , Lk) : Rk Rk. We say that L is

1. permutation equivariant if L(Sσv) = SσL(v) for all v Rk and σ Sym(k), 2. relative margin-based if for each y [k] there exists a function ℓy : Rk 1 R so that

Ly(v) = ℓy(Dv) = ℓy(vk v1, vk v2, . . . , vk vk 1), for all v Rk. (1)

We refer to the vector-valued function ℓ:= (ℓ1, . . . , ℓk) as the reduced form of L. 3. PERM if L is both permutation equivariant and relative margin-based. In this case, the function ψ := ℓk is referred to as the template of L.

While the concepts of permutation equivariance and relative margin have been studied largely in isolation in previous works, our work is the ﬁrst to systematically study the properties of losses having both properties. Below in Theorem 2.5, we show that a PERM loss is completely determined by its template via what we call the relative-margin form.

Remark 2.3 (On the name template ) The symbol of the template ψ is chosen intentionally to match that of Bartlett et al. (2006), where a (binary) margin loss is expressed as ψ(( 1)yg(x)). Treating g(x) = z as an arbitrary input to ψ, the rationale behind the name is that the positive branch ψ(( 1)2z), of the margin loss serves as a template for the negative branch ψ(( 1)1z).

Before proceeding, let us examine some notable PERM losses:

Example 1 The cross entropy (also multinomial logistic) loss is given by

LCE y (v) = log 1 + P j [k]:j =y exp( (vy vj)) , for all v Rk.

It is easy to see that its template is the function ψCE(z) = log 1 + Pk 1 j=1 exp( zj) . When

k = 2, we have ψCE(z) = log(1 + exp( z)) which is the binary logistic loss (also known as the binary cross entropy). When k = 3, we have ψCE(z) = log(1 + exp( z1) + exp( z2)) is a function deﬁned over the 2D plane R2, plotted in Figure 1.

10. The idea of relative margins goes back to Rosset et al. (2003). We note that Jebara and Shivaswamy (2008) also use this term in a unrelated context in binary SVMs.

Wang and Scott

Example 2 The Gamma-Phi loss (Beijbom et al., 2014), denoted Lγ,φ y (v), is a generalization of the cross entropy loss deﬁned as follows: Let γ : R R and φ : R R 0 be functions. Deﬁne

Lγ,φ y (v) := γ P j [k]:j =y φ(vy vj) , for all v Rk.

It is easy to see that its template is the function ψγ,φ(z) = γ Pk 1 j=1 φ(zj) . The cross

entropy LCE is the Gamma-Phi loss where φ(t) = exp( t) and γ(t) = log(1 + t). The multiclass exponential (Mukherjee and Schapire, 2013) loss LExp is the Gamma-Phi loss where φ(t) = exp( t) and γ is the identity.

Example 3 A notable case of the Gamma-Phi loss is when φ(t) = max{0, 1 t} is the hinge loss and γ is the identity. The resulting loss is known as the Weston-Watkins hinge loss (Weston and Watkins, 1998; Bredensteiner and Bennett, 1999; Vapnik, 1998), which is well-known to be not classiﬁcation-calibrated (Liu, 2007; Tewari and Bartlett, 2007). However, the Weston-Watkins hinge loss is calibrated with respect to a discrete loss called the ordered partition loss related to ranking with ties (Wang and Scott, 2020).

Example 4 The Crammer-Singer hinge loss (Crammer and Singer, 2001) is a well-known loss that is a PERM loss but not a Gamma-Phi loss. It is deﬁned as

LCS y (v) := maxj [k]:j =y {max{0, 1 (vy vj)}} , for all v Rk.

Section 5 features another example of a family of PERM losses, namely, the Fenchel Young losses (Blondel et al., 2020).

2.1 Matrix label code and the relative margin form

This section introduces the multiclass generalization of { 1}: the matrix label code {Υy}k y=1.

Deﬁnition 2.4 (Matrix label code) For k 2 and y [k], deﬁne the (k 1) (k 1) matrix Υy as follows: For y = k, Υk := Ik 1. For y [k 1], deﬁne Υy column-wise by

( e(k 1) j : j = y 1(k 1) : j = y, for each j [k 1].

Equivalently, to construct Υy for each y [k 1], ﬁrst take the identity matrix Ik 1, then replace the y-th column by all 1 s. Note that Υ2 y = Ik 1 for all y [k 1], i.e., Υy is an involution. Moreover, in the binary case where k = 2, we have Υ1 = 1 and Υ2 = 1, i.e., the matrix label code reduces to the label encoding { 1}. Next, recall that a function f : Rn R is symmetric if f(Sσ( )) = f( ) for all σ Sym(n). We now state the main property of the PERM loss and the matrix label code:

Theorem 2.5 (Relative-margin form) Let L : Rk Rk be a PERM loss with template ψ, and let v Rk and y [k] be arbitrary. Then ψ is a symmetric function. Moreover,

Ly(v) = ψ(Υy Dv). (2)

Conversely, let ψ : Rk 1 R be a symmetric function. Deﬁne a multiclass loss function L = (L1, . . . , Lk) : Rk Rk according to eq. (2). Then L is a PERM loss with template ψ.

Unified Margin-Based Classification

Equation (2) is referred to as the relative margin form of L. Theorem 2.5 and other results in this section are proved in Section C in the appendix.

Remark 2.6 The relative margin form can be interpreted as a one-to-one correspondence between PERM losses L : Rk Rk and symmetric functions ψ : Rk 1 R. Thus, we can refer to a PERM loss by either L or ψ without ambiguity.

Remark 2.7 (Uniqueness of the matrix label code) In Section J of the appendix, we prove11 a unique-ness result: any set of matrices {Υ y}k y=1 satisfying the conclusion of Theorem 2.5 is equal to the matrix label code (Deﬁnition 2.4) up to row permutations of the Υy s.

Remark 2.8 Observe that Υy Dv = (vy v1, vy v2, . . . , vy vk) Rk 1 where the vy vy entry is omitted. This is Lemma B.2 in the appendix. Now, note that the right hand side is exactly the relative margin as deﬁned in Rosset et al. (2003, Eqn. (8)). The advantage of the left hand expression Υy Dv is that the label y acts via its label encoding Υy by left matrix multplication. This disentanglement of the label encoding and the loss can facilitate calculations and will be crucial in our key Lemma G.6.

Remark 2.9 (Prior work) The matrices in the deﬁnition of the matrix label code have been used by Wang and Scott (2020, 2021) for analyzing the Weston-Watkins (WW) SVMs (Weston and Watkins, 1998). Wang and Scott (2020) show that the hinge loss from the WW-SVM is calibrated with respect to the ordered partition loss, and uses this theory to explain the empirical observation made by Dogan et al. (2016) that WW-SVM performs well even under signiﬁcant label noise. Wang and Scott (2021) derive a reparametrization of the Weston-Watkins dual problem that decomposes into subproblems that can be solved exactly in O(k log(k)) time, leading to faster performance for the linear WW-SVMs with many classes. The disentanglement discussed in the previous remark is the key ingredient for deriving this reparametrization. This work extends Wang and Scott (2020, 2021) beyond the multiclass SVM setting, and provides a framework for using these matrices for margin-based losses.

2.2 Relationship between class-score functions and discriminant functions

As mentioned in the introduction, the sign function converts a real-valued discriminant function g : X R to a discrete-valued classiﬁer h : X {1, 2} by choosing h(x) to be y {1, 2} such that ( 1)yg(x) 0 with ties broken arbitrarily. To generalize this to the multiclass case, ﬁrst recall that when k = 2, the matrix label code is equal to Υy = ( 1)y. Thus, ( 1)yg(x) 0 iﬀΥyg(x) 0. More simply put, y is the predicted class if and only if Υy is the sign of g(x). Next, we deﬁne the multiclass sign function analogously. Given a vector-valued discriminant function g : X Rk 1 we deﬁne a discrete-valued classiﬁer h : X [k] by taking y [k] such that Υyg(x) 0. Similar to the binary case, ties can be broken arbitrarily. Thus, Υy can be viewed as the sign of g(x). To relate this multiclass sign function method for producing a classiﬁer to the conventional arg max method, we have

11. We omit the proofs of some of the technical intermediate lemmas. For the omitted proofs, see the ar Xiv version of our manuscript (Wang and Scott, 2023b).

Wang and Scott

Proposition 2.10 Let v Rk, z := Dv, and y [k]. Then y arg maxj vj iﬀΥyz 0.

Furthermore, let v := 0 z . Then y arg maxj vj iﬀy arg maxj v j.

Proof For the ﬁrst part, recall from Remark 2.8 that Υy Dv = (vy v1, vy v2, . . . , vy vk)

where the vy vy entry is omitted. Therefore, the assertion y arg maxj vj iﬀΥyz 0 follows immediately from the deﬁnition of and z. For the Furthermore part, note that by construction we have v = (0, v2 v1, v3 v1, . . . , vk v1) . By adding the constant v1 to all entries of v , we recover v. Thus, the indices maximizing v are the same as those of v = (v1, v2, . . . , vk) .

The ﬁrst part of Proposition 2.10 gives an intuitive explanation for our earlier deﬁnition of a multiclass sign . The second part of Proposition 2.10 gives a simple formula for computing the arg max from the discriminant function.

Remark 2.11 Proposition 2.10 can be seen as an equivalence between class-score functions and discriminant functions. Given a class-score function f : X Rk, we can derive a discriminant function g : X Rk 1 by deﬁning g(x) := Df(x) for all x X. Conversely, we can view the discriminant function g as the given , and derive a class-score function f by deﬁning f(x) := 0 g(x) for all x X. In both cases, arg max f(x) =

arg max 0 g(x) .

3. Classiﬁcation-calibration and Consistency

This section is a brief review of the core deﬁnitions and theorem from Tewari and Bartlett (2007) in preparation for our results on suﬃcient conditions for classiﬁcation-calibration. Below, we assume the probability setting introduced earlier in Section 1.2.

Deﬁnition 3.1 (Range and its convex hull) Let f : Rm Rn be a function. Denote by ran(f) := {f(x) : x Rm} the range of f, and ranc(f) := conv(ran(f)) the convex hull of the range of f.

Deﬁnition 3.2 (Tewari and Bartlett (2007)) A set S Rk + is classiﬁcation-calibrated if there exists a function12 θ : Rk [k] such that

inf{ p, ζ : ζ S : pθ(ζ) < max p} > infζ S p, ζ (3)

for all p k. A multiclass loss function L is classiﬁcation-calibrated if ranc(L) is classiﬁcation-calibrated.

Intuitively, Deﬁnition 3.2 says that the lowest achievable conditional risk when predicting the wrong label (Eqn. (3) LHS) is still strictly larger than the conditional Bayes risk (Eqn. (3) RHS). Deﬁne arg max : Rk [k] by arg max(v) = min{i [k] : vi = maxj [k] vj}. When L is permutation equivariant and classiﬁcation-calibrated, we can take θ in Deﬁnition 3.2 to be simply arg max. See Tewari and Bartlett (2007, Lemma 4). As alluded to earlier, the signiﬁcance of Deﬁnition 3.2 is demonstrated by the following theorem:

12. The function θ is called a calibrated link for S.

Unified Margin-Based Classification

Theorem 3.3 (Tewari and Bartlett 2007) Let L be a PERM loss. Let F be the set of all Borel functions X Rk. If L is classiﬁcation-calibrated, then L has the consistency transfer property: For all sequence of function classes {Fn}n such that Fn F, S n Fn = F, all ˆfn Fn and all probability distributions P on X [k]

RL,P ( ˆfn) P inff RL,P (f) implies R01,P (arg max ˆfn) P infh R01,P (h)

where the inﬁmums are taken over all Borel functions f : X Rk and h : X [k], respectively.

Remark 3.4 In applications, ˆfn is often taken to be an L-risk empirical minimizer over a training dataset of cardinality n. However, the above property holds for any sequence of functions ˆfn Fn.

Checking that a multiclass loss L is classiﬁcation-calibration is a non-trivial task, requiring delicate analysis of the loss function (Tewari and Bartlett, 2007). Recently, Wang and Scott (2023a) established a suﬃcient condition of classiﬁcation-calibration for Gamma-Phi losses (Example 2). Below, we derive a suﬃcient conditions for general PERM losses.

4. Regular PERM losses

To demonstrate the utility of the relative margin form and Theorem 2.5, we prove in this section Theorem 4.7, a suﬃcient condition for classiﬁcation-calibration (CC) of multiclass loss functions. To this end, we begin with the necessary deﬁnitions.

Deﬁnition 4.1 A function f : Rn R is

1. coercive if for all c R, the set {v Rn : f(v) c} (i.e., the c-sublevel set) is bounded, 2. semi-coercive if for all c R there exists b R such that

{v Rn : f(v) c} {v Rn : b minj [n] vj}.

The deﬁnition of a coercive function is well-studied (Boyd and Vandenberghe, 2004). However, semi-coercivity appears to be new. Intuitively, a function is semi-coercive if, for all c R, its c-sublevel set is contained in a translation of the positive orthant.

Deﬁnition 4.2 (Regular PERM loss) Let L be a PERM loss with template ψ. We say that L is regular if ψ is nonnegative, twice diﬀerentiable, strictly convex, semi-coercive, and the gradient ψ(z) 0 is entrywise negative for all z Rk 1.

Below, we give a suﬃcient condition for a Gamma-Phi loss (Example 2) to be a regular PERM loss. But ﬁrst, we discuss the intuition behind the condition ψ(z) 0. The condition ψ(z) 0 is reminiscent of a condition in Bartlett et al. (2006, Theorem 6), which shows that in the binary case a convex margin loss ψ is classiﬁcation-calibrated if and only if ψ is diﬀerentiable at 0 and ψ(0) < 0 where the overdot denotes taking derivative of a univariate function. If we view ψ( ) as a vector ﬁeld on R, then this condition on the derivative can be stated as the vector ﬁeld ψ( ) near 0 points toward the positive half of the real line .

Wang and Scott

positive orthant

Figure 2: Vector ﬁeld ψ(z) where ψ is the template of the cross entropy (Example 1). The contour lines are shaded according to the value of the function ψ (darker is larger). See Figure 1.

The condition ψ(z) 0 can be thought of in a similar fashion. See Figure 2 for a plot of the vector ﬁeld ψ(z) where ψ ψCE is the template of the cross entropy (Example 1), and where k = 3. Note that the vectors point toward the shaded region whose interior is the positive orthant in R2. However, Deﬁnition 4.2 requires the strict negativity of the gradient for all of Rk 1, whereas the derivative of ψ is only required to be negative at 0 in the binary case in Bartlett et al. (2006, Theorem 6). Next, we deﬁne the truncation of a PERM loss L. Intuitively, if k 3 is the number of classes and L is k-ary, then the truncation of L results in a (k 1)-ary loss.

Proposition 4.3 (Truncation) Assume k 3. Let L : Rk Rk 0 be a PERM loss with template ψ : Rk 1 R. Deﬁne trunc[ψ](w) := limλ ψ(λ, w) for all w Rk 2. If L is regular (Deﬁnition 4.2), then trunc[ψ] is a well-deﬁned symmetric function Rk 2 R (i.e., all limits exist in R), referred to as the truncation of ψ.

Proof Let h(λ) := ψ(λ, w). The condition that ψ( ) 0 implies that h is decreasing as a function of λ. Moreover, h is nonnegative since ψ is nonnegative. Thus, limλ + h(λ) exists. The symmetry of trunc[ψ] follows immediately from the symmetry of ψ.

By Theorem 2.5, the symmetric function trunc[ψ] induces a unique loss function which we call the truncation of L. Note that if ψCE(z) = log(1+exp( z1)+exp( z2)) is the template of the cross entropy with k = 3 (Example 1), then trunc[ψCE](w) = limλ ψCE(λ, w) = log(1 + exp( w)), which is just the binary cross entropy/logistic loss. Next, we deﬁne the m-fold iterated truncation of the template of a PERM loss.

Corollary 4.4 For each m {0, 1, . . . , k 2}, deﬁne trunc m[ψ] to be m-fold repeated applications of trunc to ψ, i.e., trunc m[ψ] := trunc[ trunc[trunc[ψ]] ] where trunc appears m-times. By convention, let trunc 0[ψ] = ψ. Moreover, for each n {2, . . . , k}, deﬁne the notational shorthand ψ(n) := trunc (k n)[ψ]. If, for n {2, . . . , k 1}, ψ(n+1) : Rn R is a symmetric function such that the associated PERM loss, denoted L(n+1), is a regular PERM loss, then ψ(n) : Rn 1 R is a symmetric function.

The n-ary truncated loss captures the behavior of ψ when the ﬁrst k n inputs to the template ψ approach + . Note that if ψCE is the template of the k-ary cross entropy, then trunc (k n)[ψCE] is the template of the n-ary cross entropy. This follows from a calculation similar to the one right before Corollary 4.4. Next, we deﬁne totally regular PERM loss which generalizes the above observation:

Deﬁnition 4.5 (Totally regular PERM loss) Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. For each n {2, . . . , k}, let ψ(n) : Rn 1 R be as in

Unified Margin-Based Classification

Corollary 4.4 and let L(n) be the unique13 PERM loss associated to ψ(n). We say that L is a totally regular PERM loss if L(n) is regular for each n {2, . . . , k}.

Before proving our main result, we note that Gamma Phi losses form a large family of totally regular PERM losses:

Proposition 4.6 Let γ, φ : R R be functions and Lγ,φ be the associated Gamma-Phi loss as deﬁned in Example 2. Suppose that all of the following holds

1. γ is a twice diﬀerentiable function such that γ 0, dγ

dt > 0 and d2γ

dt2 0 on all of R,

2. φ is a twice diﬀerentiable function such that dφ

dt < 0, d2φ

dt2 > 0 on all of R. Moreover, φ(t) 0 as t + . Then Lγ,φ is totally regular.

We now state our main result:

Theorem 4.7 If L is totally regular, then L is classiﬁcation-calibrated.

In view of Theorem 4.7, the assumptions of Proposition 4.6 turns out to be a sufﬁcient condition for a Gamma-Phi loss to be classiﬁcation-calibrated. Theorem 4.7 and Proposition 4.6 together recover the result that the coherence loss (Zhang et al., 2009) is classiﬁcation-calibrated. However, this suﬃcient condition is subsumed by a previous result in Wang and Scott (2023a, Theorem 3.3). Nevertheless, that Lγ,φ is totally regular has nontrivial consequences as we will see next. First, new totally regular PERM losses can be constructed from existing ones:

Proposition 4.8 Let L and L : Rk Rk be totally regular PERM losses and λ > 0 be a number. Then λL and L + L are also totally regular.

Thus, λL and L+L are both classiﬁcation-calibrated by Theorem 4.7. Next, we apply this result to sums of Gamma-Phi losses:

Example 5 (Sum of Gamma-Phi losses) Consider two Gamma-Phi losses each satisfying the assumptions of Proposition 4.6. For concreteness, take the cross entropy LCE and the multiclass exponential loss LExp (See Example 2). Then by Proposition 4.8, the loss 1 2(LCE +LExp) is totally regular and thus classiﬁcation-calibrated by Theorem 4.7. Note that the sum of two Gamma-Phi losses need not be another Gamma-Phi losses. Thus, Theorem 4.7 combined with Proposition 4.8 gives the ﬁrst suﬃcient condition for when the sum of classiﬁcation-calibrated losses are again classiﬁcation-calibrated, in the multiclass case14.

In the next section, we will apply Theorem 4.7 to obtain a suﬃcient condition for classiﬁcation-calibration of Fenchel-Young losses (Theorem 5.10).

13. The existence and uniqueness is guaranteed by Theorem 2.5. 14. Note that the situation is much simpler in the binary case, due to the result of Bartlett et al. (2006, Theorem 6) that convex margin loss ψ is classiﬁcation-calibrated if and only if ψ is diﬀerentiable at 0 and ψ(0) < 0 where the overdot denotes taking derivative of a univariate function.

Wang and Scott

5. Fenchel-Young losses: suﬃcient conditions for classiﬁcation-calibration

In this section, we consider a subset of PERM losses called the Fenchel-Young losses:

Deﬁnition 5.1 (Blondel et al. 2020) Let Ω: k R be a continuous function and µ R 0. Deﬁne cy := µ(1(k) e(k) y ). The Fenchel-Young loss associated to Ωand µ is the loss function LΩ,µ : Rk Rk 0 whose y-th component is given by

LΩ,µ y (v) := maxp k Ω(p) + Ω(e(k) y ) + v + cy, p e(k) y . (4)

The maximization in Eqn. 4 is essentially the Fenchel conjugate15 of Ω. See Deﬁnition 5.4.

It is not obvious that Fenchel-Young losses are indeed PERM losses. We prove this fact in Proposition I.1 in the appendix.

Deﬁnition 5.2 (Negentropy) A function Ω: k R is a negentropy if :

1. Ωis closed (maps closed sets to closed sets) and convex, 2. Ωis symmetric, i.e., Ω(Sσp) = Ω(p) for all p k and σ Sym(k), 3. Ω(p) 0 for all p k and Ω(e(k) i ) = 0 for all i [k].

The negative entropy is more convenient to work with due to its convexity. We use the term negentropy in the interest of brevity. This term has been previously used in statistics (Hyv arinen and Oja, 2000) and in machine learning (Mensch et al., 2019).

Example 6 When Ω(p) := Pk y=1 py log(py) is the negative Shannon entropy and µ = 0, we have that LΩ,0 is the cross entropy (Example 1).

Below, we use the notation that p = [p1, . . . , pk 1] Rk 1 is the sub-vector of p with the last element dropped. Later in Proposition I.1, we show that the Fenchel-Young loss Eqn. (4) is a PERM loss with template

ψΩ,µ(z) := maxp k Ω(p) + µ1 p p, z . (5)

Remark 5.3 Blondel et al. (2020) allow the vector cy Rk to be arbitrary, in which case the resulting loss is known as cost-sensitive Fenchel-Young loss. However, known calibration results (Blondel, 2019; Nowak-Vila et al., 2019) are only for cy as in Deﬁnition 5.1.

In order to state our suﬃcient condition for classiﬁcation-calibration (Theorem 5.10), we brieﬂy review two key notions from the theory of convex analysis and Legendre transformation following Rockafellar (1970, Section 26):

Deﬁnition 5.4 (Convex conjugates) Let D Rn be a closed convex set. Let f : D R be a function. Deﬁne D := {y Rn : supx D y, x f(x) < }. The convex conjugate of a function f : D R is the function f : D R given by f (y) = supx D y, x f(x).

Deﬁnition 5.5 (Convex functions of Legendre type) Let D Rn be a closed convex set. A convex function f : D R is said to be of Legendre type if

15. Also known as the convex conjugate.

Unified Margin-Based Classification

1. C := int(D) is an open convex subset of Rn, 2. f is strictly convex and diﬀerentiable on C, 3. for all {xi} i=1 C with limi xi bdry(D), we have limi f(xi) = + .

For example, when D = k and f = H is the negative Shannon entropy, then f : D R 0 is of Legendre type. See the paragraph immediately following Blondel et al. (2020, Deﬁnition 3).

Deﬁnition 5.6 (Regular negentropies) A negentropy Ω: k R is a regular negentropy if Ωis of Lengedre type and twice diﬀerentiable.

The term negentropy was previously used by Mensch et al. (2019). To the best of our knowledge, the term regular negentropy is new.

Proposition 5.7 (Truncation of a negentropy) Assume k 3. Let pad : k 1 k be the zero-padding operator, i.e., pad(q) := [0, q1, . . . , qk 1] k for all q = [q1, . . . , qk 1] k 1. Let Ω: k R be a negenetropy. Deﬁne the truncation of Ω, denoted trunc[Ω] : k 1 R, by trunc[Ω]( ) := Ω(pad( )). Then trunc[Ω] is a negentropy.

The following is analogous to the earlier iterated truncation construction in Corollary 4.4:

Corollary 5.8 For each m {0, 1, . . . , k 2}, deﬁne trunc m[Ω] to be m-fold repeated applications of trunc to Ω, i.e., trunc m[Ω] := trunc[ trunc[trunc[Ω]] ] where trunc appears m-times. By convention, let trunc 0[Ω] = Ω. Moreover, for each n {2, . . . , k}, deﬁne the notational shorthand Ω(n) := trunc (k n)[Ω]. Then Ω(n) : n R is a negentropy.

Deﬁnition 5.9 (Totally regular negentropy) Let Ω: k R be a negentropy. We say that Ωis a totally regular negentropy if Ω(n) is a regular negentropy for each n {2, . . . , k}.

To the best of our knowledge, the deﬁnition of a totally regular negentropy is new. As suggested by the name, totally regular negentropies induce Fenchel-Young losses that are totally regular. This is a critical step16 towards proving the main result of this section below. The result leverages Theorem 4.7 to establish a suﬃcient condition for the classiﬁcationcalibration of Fenchel-Young losses:

Theorem 5.10 Let Ωbe a totally regular negentropy, µ R 0 be ﬁxed, and LΩ,µ be the Fenchel-Young loss associated to Ωand µ. Then LΩ,µ is classiﬁcation-calibrated.

In light of Theorem 3.3, if Ωsatisﬁes the conditions of Theorem 5.10, then LΩ,µ satisﬁes the consistency transfer property (Deﬁnition 1.1). Theorem 5.10 recovers all known classiﬁcation-calibration suﬃcient conditions (Nowak-Vila et al., 2019; Blondel, 2019) which require that Ωbe strongly-convex whereas our result only requires strict convexity. The following proposition and example show that there is always a nonempty set of losses to which the result applies but the prior results don t.

16. See the proof of Theorem I.5 in the appendix.

Wang and Scott

Proposition 5.11 Let Ω: k R be a totally regular negentropy and g : R 0 R 0 be convex, twice diﬀerentiable and strictly increasing. Let u := (1/k)1(k) k be the uniform probability vector. Deﬁne Θ : k R by Θ(p) := g(Ω(p) Ω(u)) g( Ω(u)) for p k. Then Θ is a totally regular negentropy. Furthermore, if g (0) = 0, then Θ is not strongly-convex.

Example 7 For a concrete application of Proposition 5.11, take Ωto be the negative Shannon entropy (Example 6) and g(x) = x2 the square function. Then Θ as deﬁned in Proposition 5.11 is a totally regular negenetropy that is not strongly-convex. Moreover, LΘ,µ is classiﬁcation-calibration for any µ R 0 by Theorem 5.10.

Note that previous suﬃcient conditions (Nowak-Vila et al., 2019; Blondel, 2019) do not cover our example above.

6. Discussion and Future Directions

We proposed the relative margin form as a multiclass extension to the popular (binary) margin loss framework. We characterized the set of losses that can be expressed in the relative margin form the PERM losses. Our Theorem 4.7 generalizes to the multiclass case the seminal result of (Bartlett et al., 2006, Theorem 6). Central to our analysis is the matrix label code (Deﬁnition 2.4) which we expect to be useful beyond the scope here. Utilizing our framework, we extended the existing set of known suﬃcient conditions for the classiﬁcation-calibration of Fenchel-Young losses. We expect the framework to be useful for research on multiclass classiﬁcation, e.g., for H-consistency (Long and Servedio, 2013; Zhang and Agarwal, 2020; Awasthi et al., 2022). Characterization of classiﬁcation-calibration. One weakness of our work is that, compared to the characterization of CC in the binary case (Bartlett et al., 2006, Theorem 6), we are only able to prove an analogous result (Theorem 4.7) in the multiclass case under additional assumptions. Future work will investigate whether this can be improved. Cost-weighted classiﬁcation. Our results are limited to the traditional notion of classiﬁcation-calibration under the assumption that the cost of misclassifying all classes are the same. The more general notion of c-classiﬁcation-calibration described in Williamson et al. (2016) allows the cost of misclassiﬁcation to be class-dependent according to some class-speciﬁc weight vector c. An interesting future direction is to consider a setting where a general c-classiﬁcation-calibration suﬃcient condition similar to our Theorem 4.7 can be derived for non-symmetric losses. More speciﬁcally, given a classiﬁcation-calibrated PERM loss, is there a principled way to modify the loss to make it c-classiﬁcation calibrated? Combining losses. Hui et al. (2023) demonstrate the eﬀectiveness of summing existing losses (the cross entropy and the square loss) to design new losses for classiﬁcation. Unfortunately, our result in Proposition 4.8 does not apply to their new loss since the square loss is not totally regular. One approach may be to connect our technique with that of Williamson and Cranko (2023, 6) for studying the geometric properties of sums of losses. Beyond classiﬁcation. While this work is concerned with classiﬁcation-calibration, there are many works on calibration for more general discrete supervised learning tasks. Ramaswamy and Agarwal (2016) developed theory for multiclass classiﬁcation with abstain option and, more generally, losses deﬁned over ﬁnite sets i.e., discrete losses. Finocchiaro

Unified Margin-Based Classification

et al. (2019) showed that there exists polyhedral losses that are calibrated with respect to arbitrary discrete losses. An interesting question is if the PERM loss framework can useful for analyzing such surrogate losses. Beyond classiﬁcation-calibration, a potentially fruit direction is to employ the PERM loss framework to study optimization and learning theory.

Acknowledgments

The authors were supported in part by the National Science Foundation under awards 1838179 and 2008074, and by the Department of Defense, Defense Threat Reduction Agency under award HDTRA1-20-2-0002. YW is supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Futures program.

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-Class H-Consistency Bounds. In Advances in Neural Information Processing Systems, 2022.

Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliﬀe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Oscar Beijbom, Mohammad Saberian, David Kriegman, and Nuno Vasconcelos. Guessaverse loss functions for cost-sensitive multiclass boosting. In International Conference on Machine Learning, pages 586 594. PMLR, 2014.

M Beltagy and S Shenawy. On the boundary of closed convex sets in En. ar Xiv preprint ar Xiv:1301.0688, 2013.

Arnab Bhattacharyya, Suprovat Ghoshal, and Rishi Saket. Hardness of learning noisy halfspaces using polynomial thresholds. In Conference On Learning Theory, pages 876 917. PMLR, 2018.

David A Blackwell and Meyer A Girshick. Theory of games and statistical decisions. Courier Corporation, 1979.

Mathieu Blondel. Structured prediction with projection oracles. Advances in neural information processing systems, 32, 2019.

Mathieu Blondel, Andr e FT Martins, and Vlad Niculae. Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(35):1 69, 2020.

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.

Erin J Bredensteiner and Kristin P Bennett. Multicategory classiﬁcation by support vector machines. Computational Optimization: A Tribute to Olvi Mangasarian Volume I, pages 53 79, 1999.

Wang and Scott

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021.

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3(Jan):951 991, 2003.

Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287 316. PMLR, 2014.

Ur un Dogan, Tobias Glasmachers, and Christian Igel. A uniﬁed view on multi-class support vector classiﬁcation. Journal of Machine Learning Research, 17(45):1 32, 2016.

John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classiﬁcation, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246 3275, 2018.

Richard O Duda, Peter E Hart, and David G Stork. Pattern classiﬁcation. John Wiley & Sons, 2006.

Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classiﬁcation: A risk minimization perspective. Advances in Neural Information Processing Systems, 29, 2016.

Jessica Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework for consistent polyhedral surrogates. Advances in Neural Information Processing Systems, 32, 2019.

Sheng Fu, Piao Chen, and Zhisheng Ye. Simplex-based proximal multicategory support vector machine. IEEE Transactions on Information Theory, 2022.

Simon I Hill and Arnaud Doucet. A framework for kernel-based multi-category classiﬁcation. Journal of Artiﬁcial Intelligence Research, 30:525 564, 2007.

Like Hui, Mikhail Belkin, and Stephen Wright. Cut your losses with squentropy. In Proceedings of the 40th International Conference on Machine Learning, pages 14114 14131. PMLR, 2023.

Aapo Hyv arinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4-5):411 430, 2000.

Tony Jebara and Pannagadatta Shivaswamy. Relative margin machines. Advances in Neural Information Processing Systems, 21, 2008.

Yuri Kalnishkan and Michael V Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8):1228 1244, 2008.

Carl Kesler. Preliminary Experiments on Perceptron: Application to Bubble Chamber Event Recognition. Cornell University, 1961.

Unified Margin-Based Classification

John M Lee. Smooth manifolds. In Introduction to smooth manifolds, pages 1 31. Springer, 2013.

Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

Yufeng Liu. Fisher consistency of multicategory support vector machines. In Artiﬁcial intelligence and statistics, pages 291 298. PMLR, 2007.

Phil Long and Rocco Servedio. Consistency versus realizable h-consistency for multiclass classiﬁcation. In International Conference on Machine Learning, pages 801 809. PMLR, 2013.

Arthur Mensch, Mathieu Blondel, and Gabriel Peyr e. Geometric losses for distributional learning. In International Conference on Machine Learning, pages 4516 4525. PMLR, 2019.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2018.

Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, and Jean-Jeacques Slotine. Multiclass learning with simplex coding. Advances in Neural Information Processing Systems, 25, 2012.

Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 2013.

Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A general theory for structured prediction with smooth convex surrogates. ar Xiv preprint ar Xiv:1902.01958, 2019.

Robert J Plemmons. M-matrix Characterizations. i. Nonsingular M-matrices. Linear Algebra and its applications, 18(2):175 188, 1977.

Guillaume Pouliot. Equivalence of multicategory SVM and simplex cone SVM: Fast computations and statistical theory. In International conference on machine learning, pages 4133 4140. PMLR, 2018.

Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices. The Journal of Machine Learning Research, 17(1):397 441, 2016.

R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton University Press, 1970.

Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximizing loss functions. Advances in Neural Information Processing Systems, 16, 2003.

Mohammad Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. Advances in Neural Information Processing Systems, 24, 2011.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Wang and Scott

Ingo Steinwart. How to compare diﬀerent loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

Zhiqiang Tan and Xinwei Zhang. On loss functions and regret bounds for multi-category classiﬁcation. IEEE Transactions on Information Theory, 2022.

Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(5), 2007.

Gertjan van den Burg and Patrick Groenen. Gen SVM: A generalized multiclass support vector machine. Journal of Machine Learning Research, 17:1 42, 2016.

Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

Yutong Wang and Clayton Scott. Weston-Watkins hinge loss and ordered partitions. Advances in Neural Information Processing Systems, 33:19873 19883, 2020.

Yutong Wang and Clayton Scott. An exact solver for the Weston-Watkins SVM subproblem. In International Conference on Machine Learning, pages 10894 10904. PMLR, 2021.

Yutong Wang and Clayton Scott. On classiﬁcation-calibration of gamma-phi losses. In The Thirty Sixth Annual Conference on Learning Theory, pages 4929 4951. PMLR, 2023a.

Yutong Wang and Clayton Scott. Uniﬁed binary and multiclass margin-based classiﬁcation. ar Xiv preprint ar Xiv:2311.17778, 2023b.

Jason Weston and Chris Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1998.

Robert C Williamson and Zac Cranko. The geometry and calculus of losses. Journal of Machine Learning Research, 24(342):1 72, 2023.

Robert C Williamson, Elodie Vernet, and Mark D Reid. Composite multiclass losses. Journal of Machine Learning Research, 17:1 52, 2016.

Tong Tong Wu and Kenneth Lange. Multicategory vertex discriminant analysis for highdimensional data. The Annals of Applied Statistics, 4(4):1698 1721, 2010.

Mingyuan Zhang and Shivani Agarwal. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. Advances in Neural Information Processing Systems, 33:16927 16936, 2020.

Tong Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004.

Zhihua Zhang, Michael Jordan, Wu-Jun Li, and Dit-Yan Yeung. Coherence functions for multicategory margin-based classiﬁcation methods. In Artiﬁcial Intelligence and Statistics, pages 647 654. PMLR, 2009.

Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on multicategory Fisher-consistent losses. The Annals of Applied Statistics, 2(4):1290, 2008.

Unified Margin-Based Classification

Appendix A. Notational conventions and deﬁnitions

For the reader s convenience, we tabulate the notational convention used in this work in Table 1. Moreover, we tabulate the mathematical objects deﬁned in Table 2. Finally, we require the following additional notations: Permutations. A bijection from [k] to itself is called a permutation (on [k]). Recall we denote by Sym(k) the set of all permutations on [k]. Two permutations σ, σ Sym(k) can be composed resulting in another permtuation σσ Sym(k) deﬁned by function composition: for y [k], we have σσ (y) := σ(σ (y)). For i, j [k], let τ(i,j) Sym(k) denote the transposition which swaps i and j, leaving all other elements unchanged. More precisely, τ(i,j)(i) = j, τ(i,j)(j) = i and τ(i,j)(y) = y for y [k]\{i, j}. Deﬁne the notational shorthand τi := τ(k,i), the transposition that swaps k and i. Permutation matrices. Recall that for each σ Sym(k), Sσ Rk k denotes the permutation matrix corresponding to σ. In other words, if v Rk is a vector, then [Sσv]j = [v]σ(j) = vσ(j). Note that if σ, σ Sym(k), then Sσσ = Sσ Sσ where the order of compositions are reversed (Lemma B.6). Deﬁne the notational shorthand T(i,j) := Sτ(i,j) the matrix corresponding to the transposition of i and j. Likewise, deﬁne Ti := T(k,i).

When k is ambiguous, we disambiguate it with the superscript notation and write S(k) (i,j) and

T(k) (i,j). Topology. Let S be a subset of a topological space. Let int(S) and bdry(S) denote the interior and the boundary of the set S, respectively. See Table 1 for the full list of symbols.

Mathematical object Description of notation Example of notation

Set of all permutations Sym(k) Permutations Lower case sigma or tau σ, τ Transpositions tau with subscripts τ(i,j)

Vector Bold lower case v, w Entries of vector Normal font lower case v1, v2, . . .

Special vector Blackboard font All zeros/ones vector in Rn 0(n), 1(n)

i-th elem. basis vector in Rn e(n) i Matrix Bold upper case A j-th Column [A]:j n n Identity In

Table 1: Notation convention used throughout this work.

Appendix B. Properties of the matrix label code

As mentioned in the main text, the deﬁnition of matrix label code has already been introduced in Wang and Scott (2020, Deﬁnition S3.14). However, that work, the matrix label code is used to analyze a particular optimization problem resulting from the Weston-

Wang and Scott

Symbol Description Deﬁned in...

L Multiclass loss function Deﬁnition 2.2 ℓ Reduced form of L Deﬁnition 2.2 item 2 ψ Template of L Deﬁnition 2.2 item 3 Ω Negentropy Deﬁnition 5.2 LΩ,µ Fenchel-Young loss assoc. to Ω, µ Deﬁnition 5.1 trunc[ψ] truncation of ψ Proposition 4.3 ψ(n) trunc applied k n times to ψ Corollary 4.4 L(n) PERM loss corresponding to ψ(n) Deﬁnition 4.5 trunc[ℓ] truncation of ℓ Deﬁnition H.3 pad zero-padding operator Proposition 5.7 trunc[Ω] truncation of Ω Proposition 5.7 Ω(n) trunc applied k n times to Ω Corollary 5.8 D relative margins conversion matrix Deﬁnition 2.1 Mσ DSσD Deﬁnition B.4 Υ matrix label code Deﬁnition 2.4 T Transposition matrix Section A τ Transposition permutation Section A A(z) gradient of ℓ(z) Lemma G.6 P (resp. Q) drop last (resp. ﬁrst) coordinate Lemma G.25 (resp. G.26) ran( ) and ranc( ) range of convex hull of range Deﬁnition 3.1 ran c(f) and ran c(f) closure and interior of ranc(f) Deﬁnition G.27 N(ζ; S) Positive normals Deﬁnition G.35

Table 2: Mathematical objects and where they are deﬁned.

Watkins SVM (Weston and Watkins, 1998). Here, we develop a comprehensive theory for using matrix label code with PERM loss. With the exception of Lemma B.15, all results in this section have already appeared in Wang and Scott (2020, 2021) using slightly diﬀerent notation. For the reader s convenience, the proofs are omitted here but are all included in the ar Xiv version of this manuscript (Wang and Scott, 2023b).

First, we note that for y {1, . . . , k 1}, the matrix Υ(k) y acts on a vector z = (z1, . . . , zk 1) Rk 1 by

j [k 1], [Υ(k) y z]j :=

( zj zy : j = y zy : j = y. (6)

Remark B.1 Throughout this section, we consider k ﬁxed and write Υ1, , Υk instead of Υ(k) 1 , , Υ(k) k . Moreover, whenever Ty does not have a superscript, we implicitly assume that Ty = T(k) y . In some results, we will work with (k 1) (k 1) permutation matrices in which case we will write, for example, T(k 1) (y,y ) where y, y [k 1].

Unified Margin-Based Classification

Lemma B.2 For all y [k], we have DTy = Υy D. In particular, for all i > 1 and j [k 1], we have

( vy vj : y = j vy vk : y = j.

Remark B.3 Let D denote the Moore-Penrose inverse of D. Since D contains a copy of the identity matrix, D has full column rank. Hence, DD is the identity.

Deﬁnition B.4 For each σ Sym(k), deﬁne the (k 1) (k 1) matrix Mσ := DSσD .

Lemma B.5 For all y [k], we have Mτy = Υy.

Lemma B.6 If σ, σ Sym(k), then Sσσ = Sσ Sσ.

Lemma B.7 For all σ, σ Sym(k), we have Mσσ = Mσ Mσ.

Lemma B.8 For all y [k], Υ2 y is the identity.

Lemma B.9 Let y1, y2 [k 1] be distinct, i.e., y1 = y2. Then τy1τy2τy1 = τ(y1,y2), as elements of Sym(k). Moreover, Ty1Ty2Ty1 = T(y1,y2), as elements of Rk k.

Corollary B.10 Every σ Sym(k) can be written as a product σ = τy1τy2 τyn for some integer n 0 and yi [k 1] for each i [n].

Lemma B.11 For all y1, y2 [k 1], we have T(k 1) (y1,y2)D = DT(k) (y1,y2).

Lemma B.12 Let y1, y2 [k 1] be distinct. Then T(k 1) (y1,y2) = Υy1Υy2Υy1.

Lemma B.13 Let y, j [k]. Then

( T(k 1) (j,y) ΥjΥy : both y and j [k 1]

ΥjΥy : otherwise. (7)

Proposition B.14 For an arbitrary σ Sym(k 1), let σ Sym(k) denote the permutation that extends σ to [k] , i.e., σ (k) := k and σ (y) := σ(y) for y [k 1]. Then we have SσD = DSσ .

Lemma B.15 Let 1 := 1(k 1) denote17 the (k 1)-dimensional (column) vector of all ones. Let ey := e(k 1) y denote the (k 1)-dimensional y-th elementary basis (column) vector. Let y, j [k] be such that y = j. Then we have the following identities

Υy T(y,j)Υj = (1 + ej)(ej ey) if y, j [k 1], (8)

Υy Υj = (1 + ej)(ej) if y = k, (9)

Υy Υj = (1 + ey)(ey) if j = k. (10)

17. We drop the superscript (k 1) for convenience.

Wang and Scott

Appendix C. Proof of Theorem 2.5

We ﬁrst recall the deﬁnition of a template from Theorem 2.5. For the reader s convenience, we restate it as a deﬁnition below:

Lemma C.1 Let L be a PERM loss with template ψ. Let D be as in Deﬁnition 2.1. Then for all v Rk we have Lk(v) := [L(v)]k = ψ(Dv).

Proof Recall from Deﬁnition 2.1 that Dv = (vk v1, vk v2, . . . , vk vk 1) . Thus, the result now follows immediately from Deﬁnition 2.2.

For the reader s convenience, the following is a restatement of Theorem 2.5.

Theorem C.2 Let L : Rk Rk be a PERM loss with template ψ, and let v Rk and y [k] be arbitrary. Then ψ is a symmetric function and L can be expressed as

Ly(v) = ψ(Υy Dv). (11)

Conversely, given a symmetric function ψ : Rk 1 R, the function L = (L1, . . . , Lk) : Rk Rk deﬁned componentwise via eq. (11) is a PERM loss with template ψ.

Proof [Proof of Theorem C.2] We begin with the = direction of the proof. First, we check that ψ : Rk 1 R is symmetric, i.e., ψ(Sσz) = ψ(z) for all z Rk 1 and all σ Sym(k 1). Below, ﬁx such z and σ. Pick v Rk such that Dv = z. For instance, let v := z 0 . Deﬁne σ Sym(k) to be the permutation that extends σ to [k] as in Proposition B.14. Then

ψ(Sσz) = ψ(SσDv) = ψ(DSσ v) Proposition B.14

= [L(Sσ v)]k Deﬁnition 2.2 of the template ψ of a PERM loss

= [Sσ L(v)]k L is permutation equivariant (Deﬁnition 2.2)

= [L(v)]σ (k) Sσ is the permutation matrix of σ

= [L(v)]k Deﬁnition of σ in Proposition B.14

= ψ(Dv). Lemma C.1

Since by deﬁnition ψ(Dv) = ψ(z), we have by the above that ψ(Sσz) = ψ(z) for all z. This proves that ψ is symmetric. Next, we prove eq. (11), i.e., [L(v)]y = ψ(Υy Dv) for all y [k]. Again, this is a straight forward computation:

[L(v)]y = [L(v)]τy(k) = [Ty L(v)]k deﬁnitions of τy and Ty = [L(Tyv)]k L is permutation equivariant

= ψ(DTyv) Lemma C.1

= ψ(Υy Dv) Lemma B.2.

This proves eq. (11) and thus the = direction follows. The proof of the = direction, i.e., the Conversely part of the theorem, proceeds similarly. For the reader s convenience, it is omitted here but is included in the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Unified Margin-Based Classification

Lemma C.3 Let L : Rk Rk be a function such that Ty L(v) = L(Tyv) for all y [k] and v Rk. Then L is permutation-equivariant, i.e., L(Sσv) = SσL(v) for all y [k] and v Rk.

Proof For an arbitrary σ Sym(k), write σ = τy1τy2 τyn as a product of transposition where n 0 is an integer and yi [k 1] for each i [n]. This is possible because of Corollary B.10. By Lemma B.6, we have Sσ = Tyn Ty2Ty1. Thus

L(Sσv) = L(Tyn Ty1v) = Tyn L(Tyn 1 Ty1v) = = Tyn Ty1L(v) = SσL(v)

where = = denotes repeated application of L(Tyv) = Ty L(v) for all y [k].

Remark C.4 For convenience, we now summarize the relationship between L, ℓand ψ from Deﬁnition 2.2 in terms of the input to ψ, typically denoted below by z Rk 1. Let y [k] and z Rk 1 be arbitrary. Deﬁne v Rk be such that Dv = z, where D is as deﬁned in Deﬁnition 2.1. Since DD is the identity (Remark B.3), we can for instance let v = D z. Then we have ℓy(z) = [L(v)]y = ψ(Υyz). (12)

The ﬁrst equality is eq. (1) (rewritten using Dv = z) and the second equality is eq. (11). A useful corollary of the above identity is the following:

Sσℓ(z) = ℓ(Mσz), for all σ Sym(k). (13)

To prove eq. (13), we let v := D z as discussed earlier. Then

Sσℓ(z) 1= SσL(v) 2= L(Sσv) 3= ℓ(DSσv) 4= ℓ(DSσD z) 5= ℓ(Mσz).

For 1 and 3, we used eq. (11). For 2, we used permutation equivariance of L. For 4, we used the deﬁnition of v. For 5, we used Deﬁnition B.4 of Mσ.

Appendix D. Well-incentivized losses

This section discusses the well-incentivized property of a loss that should be satisﬁed for a multiclass classiﬁcation loss to be useful. This will be made clear after Remark D.2.

Deﬁnition D.1 (Well-incentivized losses) Let L be a multiclass loss function (Deﬁnition 2.2). We say that L is well-incentivized (resp. strictly well-incentivized) if for all v Rk and distinct y, j [k], vj vy (resp. vj < vy) implies Lj(v) Ly(v) (resp. Lj(v) > Ly(v)).

Remark D.2 Suppose that L is well-incentivized. Let v Rk and y arg maxj [k] vj. Then Ly(v) = minj [k] Lj(v).

Remark D.2 implies that for well-incentivized losses, the arg max predictor is correct, i.e., Ly(v) is minimized when the class score [v]y is the highest. Next, we deﬁne a condition on the template ψ such that the corresponding PERM loss is well-incentivized:

Wang and Scott

Deﬁnition D.3 A function f : Rn R is non-increasing (resp. decreasing) if for all v, w Rn such that v w (resp. v w) we have f(v) f(w) (resp. f(v) < f(w)).

Proposition D.4 Let f : Rn R be a continuously diﬀerentiable function. If f( ) 0 (resp. f( ) 0) then f is non-increasing (resp. decreasing).

Proposition D.5 Let L be a PERM loss whose template ψ is non-increasing (resp. decreasing). Then L is (resp. strictly) well-incentivized.

The proofs of Propositions D.4 and D.5 are straightforward calculations involving the Fundamental Theorem of Calculus. For the reader s convenience, they are omitted here but are included in the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Appendix E. Proof of Proposition 4.6

The proof is a straightforward calculation involving the chain rule and properties of the functions φ and γ. For the reader s convenience, it is omitted here but is included in the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Appendix F. Proof of Proposition 4.8

We ﬁrst prove that L+L is regular. Let ψ and ψ be the templates of L and L respectively. Then L + L has template ψ + ψ . All conditions in Deﬁnition 4.2 clearly holds, except for the semi-coercive property which we now check. Let c, b R be arbitrary and satisfy

{z Rn : ψ(z) c} {z Rn : b min j [n] zj}.

Since ψ and ψ are both non-negative, we have ψ(z) + ψ (z) c implies ψ(z) c. Thus

{z Rn : ψ(z) + ψ (z) c} {z Rn : ψ(z) c} {z Rn : b minj [n] zj},

which shows that ψ + ψ is semi-coercive, as desired. The fact that L + L is totally regular follows immediately from the deﬁnition of the truncation in Proposition 4.3. That λL is totally regular is completely straightforward and thus we omit its proof.

Appendix G. Properties of Regular PERM losses

In this section, we will prove several key properties of regular PERM losses which were introduced in Deﬁnition 4.2. Recall that a regular PERM loss has a template ψ such that ψ is nonnegative, twice diﬀerentiable, strictly convex, semi-coercive, and the gradient ψ(z) 0 is entrywise negative for all z Rk 1. Section G.1 focuses on consequences of the semi-coercivity condition. Sections G.2 and G.3 focus on consequences of the other aforementioned conditions. Finally, Section H presents the proof of our main result Theorem 4.7.

Unified Margin-Based Classification

G.1 Semi-coercive functions

In this section, we study the properties of a PERM loss L whose template ψ is semi-coercive (Deﬁnition 4.1).

Lemma G.1 Let L : Rk Rk + be a PERM loss whose template ψ is semi-coercive. Let ℓ be the reduced form of L. Then, for all ζ Rk, the set {z Rk 1 : ℓ(z) ζ} is bounded.

Proof Below, let z denote an arbitrary element of Rk 1. In set-builder notations, we write {z : condition} to mean {z Rk 1 : condition}. Now, by eq. (12), we have that ℓ(z) ζ if and only if ψ(Υyz) ζy for all y [k]. Thus, we get

{z : ℓ(z) ζ} = T y [k]{z : ψ(Υyz) ζy} = T y [k] Υy ({z : ψ(z) ζy}) (14)

For the last equality above, we used the fact that Υy = Υ 1 y (Lemma B.8) and that

{z : ψ(Υyz) ζy} = Υ 1 y ({z : ψ(z) ζy})

for all y [k]. Moreover, by the semi-coercivity assumption on ψ we have that for all y [k] there exists by R such that {z : ψ(z) ζy} {z : min z by}. Putting it all together, we have {z : ℓ(z) ζ} T y [k] Υy({z : min z by}) =: B (15)

Thus, it suﬃces to show that B is bounded. Below, we prove this. Since the empty set is bounded, we assume below that B is nonempty. Let z B Rk 1 be a ﬁxed arbitrary point. First, recall the inﬁnity-norm: z := max{|z i| : i [k 1]}. To prove that B is bounded, it suﬃces to show that there exists some number M such that z M. Note that we can express z alternatively as

z = max {| max z |, | min z |} , (16)

where max z := maxj [k 1] z j and min z := minj [k 1] z j. Deﬁne M1 = max{|by| : y [k]} and M2 = max{|by + bj| : y [k], j [k]}. Finally, deﬁne M = max{M1, M2}. We claim that z M. First, we note that min z bk. To see this, ﬁrst recall that Υk is the identity (Deﬁnition 2.4). From eq. (15) we have B Υy({z : min z by}) for each y [k]. Now, recall from Deﬁnition 2.4 that Υk is the identity matrix. Thus in particular B {z : min z bk} and so min z bk holds. Next, let y arg minj [k 1] z j and thus z y = min z . From eq. (15), we have z Υy({z : min z by}). Thus, Υyz {z : min z by} and in particular, [Υyz ]y by. Moreover, by eq. (6), we have [Υyz ]y = z y = min z , and thus min z by. Combining with the result from the previous paragraph, we now have that min z [bk, by] and, in particular, | min z | M1. Note that here, we used the fact that if α, β, γ R are such that γ [α, β], then |γ| max{|α|, |β|}. Next, let l arg maxj [k 1] z j (and y be still the same as before). First consider the case when l = y. Then z is a constant vector and z = | min z | in which case z M1 M holds.

Wang and Scott

Next, consider the case when l = y. Then we have [Υlz ]y = z l z y = (min z ) (max z ) by eq. (6). Similar as in the previous case, z B Υl({z : min z bl}) which implies that [Υlz ]y bl. Thus, max z min z bl (by + bl). Furthermore, max z min z bk. Thus, we ve shown that max z [bk, (by + bl)]. This implies that | max z | max{M1, M2} = M, where again we used the fact that γ [α, β] implies |γ| max{|α|, |β|}, for arbitrary α, β, γ R. To conclude, since | min z | M, by Equation (16), we have z M.

Proposition G.2 Let L be a nonnegative PERM loss with template ψ. If ψ is semi-coercive (Deﬁnition 4.1), then CL p is coercive for all p int( k).

Proof Let c R be arbitrary and S := {z Rk 1 : CL p (z) = p, ℓ(z) c} be the c-sublevel set. To show that CL p is coercive, we show that S Rk 1 is a bounded set. Observe that for all z S we have c p, ℓ(z) = P y [k] pyψ(Υyz) pyψ(Υyz) for all y [k]. We remark that the preceding inequality crucially uses ψ( ) 0. Thus, S T y [k]{z Rk 1 : ψ(Υyz) c/py}, where the right hand side is a bounded set (Lemma G.1). Hence, S is also bounded.

G.2 The link function

In this section, we study the set of minimizers of the conditional risk of a PERM loss L, i.e., the set arg minz Rk 1 CL p (z). When L is the multinomial cross entropy (Example 1), this argmin is a singleton set for all p int( k) and the mapping from int( k) p to this unique minimizer recovers the logit function. For a general loss L, this mapping is sometimes referred to as the link function (Nowak Vila et al., 2019; Williamson et al., 2016). See Deﬁnition G.5 below. This section will study the properties of the link function, culminating in a suﬃcient condition for when the link function is a bijection (Proposition G.9).

Proposition G.3 Let L be a nonnegative PERM loss with template ψ. If ψ is convex, then CL p is convex for all p k. Furthermore, if ψ is strictly convex, then CL p is strictly convex for all p int( k).

Proof Recall that CL p (z) = P y [k] pyψ(Υyz) where Υy is an invertible matrix by Lemma B.8. Thus, if ψ is (strictly) convex, then z 7 ψ(Υyz) is (strictly) convex for each y [k]. For each p k, CL p is a convex combination of convex function and is thus convex. Furthermore, if p int( k), then CL p is a convex combination of strictly convex function and is thus strictly convex. See Boyd and Vandenberghe (2004, Section 3.2.1) for instance.

An easy consequence of the above result is the following:

Corollary G.4 Let p int( k) be arbitrary and L be a nonnegative PERM loss whose ψ is semi-coercive. If ψ is convex, then the inﬁmum infz Rk 1 CL p (z) is attained. Furthermore, if ψ is strictly convex, then the inﬁmum is attained by a unique minimizer, i.e., arg minz Rk 1 CL p (z) is a singleton set.

Unified Margin-Based Classification

Proof By Proposition G.2, CL p is coercive. By Proposition G.3, CL p is strictly convex. By the Extreme Value Theorem, a continuous and coercive function has at least one global minimum. Furthermore, a strictly convex functions have at most one global minimum. For a reference of these result standards, see Boyd and Vandenberghe (2004, Section 4.2).

In view of Corollary G.4, we deﬁne:

Deﬁnition G.5 Let L be a PERM loss whose template ψ is nonnegative, strictly convex and semi-coercive. Deﬁne the link function link L : int( k) Rk 1 by letting link L(p) be the unique element of arg minz Rk 1 CL p (z).

In this section, we give a suﬃcient condition on L for link L of Deﬁnition G.5 to be a bijection. We will need the concept of an M-matrix, which is reviewed in Section K.1.

Lemma G.6 Let L : Rk Rk + be a regular PERM loss (Deﬁnition 4.2) with reduced form ℓand template ψ. For all z Rk 1, the (k 1) (k 1) matrix

A(z) := ℓ1(z) ℓk 1(z)

is a non-singular M-matrix with strictly positive diagonal elements. Furthermore, both A(z) and A(z) are strictly monotone.

While our usage of the operator is standard, for completeness we refer the reader to the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b) for our notations on the derivative operator .

The deﬁnition of a (strictly) monotone matrix is discussed in Section K.1. Proof First, we compute the gradient of ℓ= (ℓ1, . . . , ℓk) : Rk 1 Rk. For each y [k], we have by deﬁnition that ℓy(z) = ψ(Υyz) (eq. (12)). Thus, by the chain rule, we have

ℓy(z) = Υ y ψ(Υyz). (17)

Next, ﬁx y [k 1] and z Rk 1. Let w := ψ(Υyz). Then by the assumption that ψ( ) 0, we have w is a entrywise negative column vector, i.e., w 0. Moreover, by eq. (17) we have ℓy(z) = Υ y w. Thus, for each j [k 1], we have [Υ y w]j = ([Υy]:j) w. Recall from Deﬁnition 2.4 that

( e(k 1) j : j = y 1(k 1) : j = y , which implies that [Υ y w]j =

( wj : j = y P l [k 1] wl : j = y.

In particular, [Υ y w]j 0 for all j = y which proves that A(z) is a Z-matrix. Furthermore, note that the fact w < 0 and [Υ y w]y = P l [k 1] wl implies that the diagonals of A(z) are positive. Observe that Υ y w has the property that

|[Υ y w]y| = P l [k 1] wl > P l [k 1]:l =y wl = P l [k 1]:l =y |[Υ y w]l|.

Note that the strict inequality above follows from the fact that w 0 and so in particular wy > 0. This proves that A(z) is strictly diagonally dominant. By Corollary K.3, we

Wang and Scott

have that A(z) is a non-singular M-matrix. For the Furthermore part, we can apply Lemma K.5 since the diagonal elements of A(z) are positive. By Corollary K.6, A(z) is also strictly monotone.

Lemma G.7 Let L : Rk Rk + be a regular PERM loss. Let z Rk 1 and p k be arbitrary. Then z minimizes CL p if and only if

pk ψ(z) = A(z) p1 pk 1 (18)

where A(z) is as in Lemma G.6. Furthermore, CL p has a minimizer, then p int( k).

Proof Proposition G.3 asserts that CL p is convex. For a diﬀerentiable convex function, recall that the gradient-vanishing condition is necessary and suﬃcient for optimality of unconstrained optimization (Boyd and Vandenberghe, 2004). Thus, z minimizes CL p iﬀ

0 = CL p (z) = P j [k] pj ℓj(z) = pk ψ(z) + A(z) p1 pk 1 . (19)

which is equivalent to eq. (18). Next, for the Furthermore part, ﬁrst note that Lemma G.6 says A(z) is a non-singular M-matrix. Now, we ﬁrst show that pk = 0. If pk = 0, then Equation (18) reduces to 0 = A(z) p1 pk 1 . Since A(z) is non-singular, we must have p1 = = pk 1 = 0 as well which implies that p = 0. But this contradicts that p k. Thus, pk > 0 and so pk ψ(z) 0. From Lemma G.6, we have that A(z) is strictly monotone. Combined with eq. (18), we can conclude that py > 0 for each y [k 1] as well.

The Furthermore part of Lemma G.7 immediately implies the following.

Corollary G.8 If p k \ int( k), then arg minz Rk 1 CL p (z) = .

Proposition G.9 Let L be a regular PERM loss. Then link L (Deﬁnition G.5) is a bijection.

Proof First, we prove that link is injective. Suppose that p, q int( k) are such that link L(p) = link L(q) =: z. Then by eq. (18), we have that

ψ(z)A(z) 1 = p 1 k p1 pk 1 = q 1 k q1 qk 1 . (20)

Thus, (1 pk)/pk = (p1 + + pk 1)/pk = (q1 + + qk 1)/qk = (1 qk)/qk implies that pk = qk. Therefore, eq. (20) implies that py = qy for each y [k 1] as well. Thus, p = q which proves that link is injective. Next, we prove that link is surjective. Pick z Rk 1. From Lemma G.6, we have that A(z) is non-singular and strictly monotone. By non-singular-ness, there exists v Rk 1 such that ψ(z) = A(z) v. Furthermore, since ψ(z) 0 and A(z) is strictly monotone, we have v 0. Deﬁne p1, . . . , pk by pk := (v1 + + vk 1 + 1) 1 and py := vypk for each y [k 1]. Clearly, we have p 0. Furthermore,

p1 + p2 + + pk = pk(v1 + + vk 1 + 1) = 1.

Unified Margin-Based Classification

Thus, we have p int( k). By construction, z and p satisfy eq. (18). Thus link L(p) = z follows from the deﬁnition of link L.

Remark G.10 Before proceeding, we remark that Proposition G.9 gives theoretical support to the conjectural observation in Nowak-Vila et al. (2019, Remark 3.1) regarding the injectivity of the link function.

G.3 Geometry of the loss surface

Recall from Deﬁnition 3.1 and Theorem 3.3 that the classiﬁcation-calibration of the set ranc(L) implies the classiﬁcation-calibration of the loss L. In general, the set ranc(L) may be diﬃcult to compute. In this section, we study the geometry of the set ran(L) when L is a regular PERM loss which enables us to compute the convex hull ranc(L) of ran(L). One of the main tools is the mapping deﬁned below:

Corollary G.11 Let L be a regular PERM loss with reduced form ℓ. Then ℓ: Rk 1 Rk

is injective.

Proof Suppose that z, w Rk 1 are such that ℓ(z) = ℓ(w). By Proposition G.9, there exists p int( k) such that link L(p) = z. Now, p, ℓ(z) = p, ℓ(w) implies that both z, w minimize CL p . By Corollary G.4, we have z = w and so ℓis injective.

Deﬁnition G.12 Given a PERM loss L with reduced form ℓ, we deﬁne two functions F and G mapping from Rk 1 R to Rk by F(z, λ) = ℓ(z) + λ1 and G(z, t) = ℓ(z) + te(k) k , where z Rk 1 and λ R. For computing F , we view the tuple z, λ as a (column) vector z λ . Hence, F (z, λ) = ℓ(z) 1 . Likewise for the tuple z, t and G.

Remark G.13 In the context of Deﬁnition G.12, it is perhaps more precise to write F( z λ ). However, this notation is cumbersome. Below, we will always use the tuple notation. The reason we discuss the vector notation is so that F and G can be more simply treated as gradients of vector input-valued functions (rather than matrix input-valued).

Below, we will study the properties of the two functions from Deﬁnition G.12.

G.3.1 Properties of the F function

The main result of this section is to prove that F is a homeomorphism from Rk 1 R to Rk (Corollary G.19).

Lemma G.14 Let L be a regular PERM loss with reduced form ℓand F be as in Definition G.12. Then F is injective. In other words, if ℓ(z) + λ1 = ℓ(w) + µ1 for some z, w Rk 1 and λ, µ R, then both z = w and λ = µ.

Wang and Scott

Proof Let z, w Rk 1 and λ, µ R be as in the statement of the lemma. Our goal is to show that both z = w and λ = µ. First, consider the case that z = w. Then ℓ(z) = ℓ(w) and so λ = µ. Thus, z = w implies λ = µ. Next, consider the case that λ = µ. Then we have ℓ(z) = ℓ(w). By Corollary G.11, we have z = w. Therefore, λ = µ implies z = w. Thus, it only remains to show that if both z = w and λ = µ, then we have a contradiction. Without loss of generality, suppose that λ > µ. Then we have ℓ(z) + (λ µ)1 = ℓ(w). Thus, for all p k, we have p, ℓ(w) = p, ℓ(z) + (λ µ)1 > p, ℓ(z) . Thus, there does not exist p k such that w is the minimizer of CL p . But this contradicts since Proposition G.9 implies that link is surjective.

Lemma G.15 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Deﬁnition G.12. Then for all (z, λ) Rk 1 R, F (z, λ) is non-singular.

Proof Recall that ℓ(z) R(k 1) k and 1(k) Rk. Thus, F (z, λ) = ℓ(z) (1(k))

is a k k square matrix. To show that it is non-singular, ﬁrst pick v Rk arbitrary. It suﬃces to check that if F (z, λ)v = 0 then v = 0. Towards this, ﬁrst note that F (z, λ)v = 0 can be equivalently stated as both ℓ(z)v = 0 and (1(k)) v = v1 + + vk = 0. Replacing v by v if necessary, we assume that vk 0. Recall A(z) from Lemma G.6. Then the identity ℓ(z)v = 0 can be rewritten as vk ψ(z) = A(z) v1 vk 1 . Since A(z) is monotone and vk ψ(z) 0, we get that vy 0 for each y [k 1]. Combined with the fact that v 1 = v1 + + vk = 0, we have that v = 0, as desired.

Now, by applying the inverse function theorem18, we immediately have the following.

Corollary G.16 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Deﬁnition G.12. For all (z, λ) Rk 1 R, there exist open neighborhoods U (z, λ) and V F(z, λ) such that F|U : U V is a diﬀeomorphism.

Proposition G.17 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Deﬁnition G.12. The map F is a bijection.

Proof Lemma G.14 shows that F is injective. To show that F is surjective, we prove that ran(F) is both open and closed as a subset of Rk. This would imply that ran(F) = Rk

since the only subets of Rk that are both open and closed are and Rk. Now, Lemma G.16 shows that ran(F) is an open subset of Rk. It remains to prove that ran(F) is closed. To this end, consider a sequence {(z(i), λ(i))} i=1 such that F(z(i), λ(i)) = ℓ(z(i)) + λ(i)1 converges to ζ Rk. Our goal is to show that ζ ran(F). We begin by ﬁrst picking ϵ > 0. Let 1 := 1(k) (without the superscript (k)) denotes the k-dimensional vector of all ones. Since the sequence converges, there exists M such that

ζ ϵ1 ℓ(z(i)) + λ(i)1 ζ + ϵ1

18. The inverse function theorem is a standard result in multivariate calculus. See the ar Xiv version of this manuscript (Wang and Scott, 2023b) for the result s statement and a textbook reference.

Unified Margin-Based Classification

for all i M. Before proceeding, we prove a helper lemma.

Lemma G.18 (Helper lemma) Let L be a PERM loss with reduced form ℓand template ψ such that ψ( ) 0. Then for all z Rk 1, we have that miny [k 1] ℓy(z) ψ(0(k 1)) =: C, where 0(k 1) is the (k 1)-dimensional all-zeros vector.

Proof [Proof of helper lemma] Let v Rk be such that Dv = z, where we recall from Deﬁnition 2.1 that D = Ik 1 1(k 1) . As in Remark C.4, we can take v = z 0 . Next, let y arg maxj [k 1] vj. Recall from Section A that min( ) over vector-valued inputs denotes entrywise minimum. Then by Remark D.2 and Proposition D.5, we have that min(L(z)) = Ly(v). Next by eq. (12) from Remark C.4, we have [L(v)]y = ℓy(z). Let w := σy(v). (Recall from Section A that σy Sym(k) is the transposition that swaps k and y.) Note that by construction we have k arg max w. By permutation-equivariance, we have [L(v)]y = [L(v)]σy(k) = [L(σy(v))]k = [L(w)]k. Again by eq. (12) from Remark C.4, we have [L(w)]k = ψ(Dw). Since k arg max w, we have that Dw Rk 1 0 belongs to the non-negative orthtant. In other words, Dw 0(k 1). Finally, since ψ( ) 0, we have that ψ(Dw) ψ(0(k 1)).

We now return to the proof of the proposition. Let C be as in the helper lemma. Then

min(ℓ(z(i)) + λ(i)1) = min(ℓ(z(i))) + λ(i) C + λ(i).

Thus, we have min(ζ) ϵ C + λ(i) and which implies that λ(i) C + ϵ min(ζ) =: D. From this, we get that

ℓ(z(i)) ζ + ϵ1 λ(i)1 ζ + (ϵ + D)1

Thus, for all i M we have z(i) {z Rk 1 : ℓ(z) ζ + (ϵ + D)1} which is a bounded set by Lemma G.1. By passing to a subsequence, we may assume that z(i) converges to some z Rk 1. Thus, we have λ(i)1 converges to ℓ(z ) + ζ, which implies in particular that λ(i)

converges to some λ . Putting it all together, we have shown that F(z(i), λ(i)) converges to ζ = F(z , λ ) and so ran(F) is closed.

Corollary G.19 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Deﬁnition G.12. The map F is a diﬀeomorphism, i.e., F is a diﬀerentiable bijection with a diﬀerentiable inverse. In particular, F is a homeomorphism.

Proof Lee (2013, Proposition 4.6 (f)) states that every bijective local diﬀeomorphism is a (global) diﬀeomorphism. Thus, the result follows in view of the facts that F is a bijection (Proposition G.17) and that F is a local diﬀeomorphism (Corollary G.16).

Proposition G.20 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Deﬁnition G.12. Consider arbitrary v, x Rk and t R. Deﬁne19 α(t) Rk 1 and

19. Note that by Corollary G.19, such α(t) and β(t) exist and are unique with respect to this property.

Wang and Scott

β(t) R to be such that tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1(k). Then for all t R, we have

1. α : R Rk 1 and β : R R are diﬀerentiable,

2. If v 0, then β (t) = dβ

dt (t) > 0,

3. β is concave, i.e., β (t) = d2β

Remark G.21 We note that α and β in Proposition G.20 implicitly depend on x and v.

Proof Below, let 1 := 1(k) be the k-dimensional vector of all ones. To prove the ﬁrst part, ﬁrst note that (α(t), β(t)) = F 1(tv + x). Hence, α and β are diﬀerentiable. Next, we prove the second part. Let y [k] be arbitrary, to be speciﬁed later. Now, the y-th coordinate of tv + x = ℓ(α(t)) + β(t)1 is

vyt + xy = ℓy(α(t)) + β(t). (21)

Diﬀerentiating (21) on both sides with respect to t and applying the chain rule20, we get

vy = α (t) ℓy(α(t)) + β (t). (22)

For convenience, we adopt the notation f ( ) := ( f( )) for functions with a scalar input. Now, we claim that α (t) ℓy(α(t)) 0 for some y [k]. In fact, we prove this claim by proving a slightly more general statement that will be used again later.

Lemma G.22 Let L be a regular PERM loss with reduced form ℓ. Let z, w Rk 1 be arbitrary. Then there exists y [k] such that w ℓy(z) 0.

Proof Suppose for the sake of contradiction that w ℓy(z) > 0 for all y [k]. Then

0 ℓ1(z) ℓk 1(z) ℓk(z) w = A(z) ψ(z) w.

In other words, we have A(z) w 0 and w ψ(z) > 0. By Lemma G.6, A(z) is strictly monotone . Thus, A(z) w 0 implies21 w 0. But ψ(z) 0 by assumption that L is a regular PERM loss. Hence, w ψ(z) < 0, a contradiction.

Applying Lemma G.22 with w = α (t), we get the desired claim. Now, pick y [k] such that α (t) ℓy(α(t)) 0. Thus, from Eqn. (22) we have β (t) = vy α (t) ℓy(α(t)) > 0. This proves the second part of Proposition G.20. Finally, we prove the last part of Proposition G.20. Pick y [k] (unrelated to the earlier choice) such that α (t) ℓy(α(t)) 0. Such a y [k] exists by Lemma G.22 by setting w = α (t). Diﬀerentiating eq. (22) with respect to t, we get by the product rule for curves22 that 0 = α (t) ℓy(α(t)) + α (t) d

dt( ℓy(α(t))) + β (t)

20. The chain rule is a standard result in multivariate calculus. See the ar Xiv version of this manuscript (Wang and Scott, 2023b) for the result s statement and a textbook reference. 21. This is the deﬁnition of strictly monotone matrices . See Section K.1 22. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Unified Margin-Based Classification

By the chain rule for curves23, we have d dt( ℓy(α(t))) = 2 ℓy(α(t))α (t) which combined with the above implies

β (t) = α (t) ℓy(α(t)) + α (t) 2 ℓy(α(t))α (t).

Since ℓy is convex, 2 ℓy(α(t)) is positive semideﬁnite. Therefore, α (t) 2 ℓy(α(t))α (t) 0. Putting it all together, β (t) 0, i.e., β is concave.

G.3.2 Properties of the G function

Lemma G.23 Let L be a regular PERM loss with reduced form ℓ. Let G be as in Deﬁnition G.12. Then for all (z, t) Rk 1 R, the gradient (matrix) G(z, t) is non-singular.

Proof The proof proceeds similarly as in Lemma G.15. Recall that ℓ(z) R(k 1) k

and e(k) k Rk. Below, we suppress this superscript and simply write ek := e(k) k Thus,

G(z, λ) = ℓ(z) e k is a k k square matrix. To show that it is non-singular, ﬁrst pick v Rk arbitrary. It suﬃces to check that if G(z, λ)v = 0 then v = 0. Towards this, ﬁrst note that G(z, λ)v = 0 can be equivalently stated as both ℓ(z)v = 0 and e k v = vk = 0. Recall A(z) from Lemma G.6. Then the identity ℓ(z)v = 0 can be rewritten as 0 = vk ψ(z) = A(z) v1 vk 1 . Since A(z) is non-singular, we get that vy = 0 for each y [k 1]. Thus, we have that v = 0, as desired.

Lemma G.24 Suppose that A Rn n and v Rn are such that 1) A is strictly monotone (Deﬁnition K.1), 2) v 0 has strictly negative entries, 3) the matrix A v 1 1 is invertible,

and 4) M := A v 0 1 A v 1 1 1 R(n+1) (n+1) is also invertible. Then Mn+1,n+1 > 0, i.e., the bottom right entry of M is positive.

Proof Deﬁne B Rn n, w, u Rn and c R such that A v 1 1 1 = B w u c . Then by deﬁnition of the matrix inverse, we have In 0 0 1

= AB + vu Aw + cv 1 B + u 1 w + c

Now, we observe that

M = A v 0 1

= AB + vu Aw + cv u c

Since M is invertible by assumption, we have that c = 0. To ﬁnish the proof, it suﬃces to show that c > 0. Below, we assume that c < 0 and derive a contradiction. The top right block of eq. (23) implies that Aw + cv = 0, i.e., A( w) = cv. Since c < 0 and v 0, we have A( w) 0. By strict monotonicity of A, we have that

23. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Wang and Scott

w 0, or equivalently, w 0. Finally, the bottom right entry of eq. (23) implies that 1 = 1 w + c < 0, which is a contradiction.

Lemma G.25 Let L be a regular PERM loss with reduced form ℓand z Rk 1 be arbitrary. Deﬁne P to be the projection Rk Rk 1 that drops the last coordinate, i.e., P([v1, . . . , vk] ) = [v1, . . . , vk 1] . Then there exists z Rk 1 and t R>0 such that Pℓ(z ) + t 1 = P(ℓ(z)).

Proof Deﬁne functions ζ : Rk 1 R Rk 1 and τ : Rk 1 R R by ζ(z, t) τ(z, t) := F 1 G(z, t).

We ﬁrst show that τ(z, 0) = 0. Observe that

F 1(G(z, 0)) = F 1(ℓ(z) + 0 ek) = F 1(ℓ(z) + 0 1) = z 0 . (24)

Therefore, ζ(z, t) = z and τ(z, 0) = 0. Next, we claim that τ

t (z, 0) = 0. By deﬁnition,

F 1 G(z, t) = ζ

z(z, t) τ z(z, t) ζ

Let be a notational shorthand for the input (z, t) to F and G. The gradient of F 1 G

F 1 G( ) = G( ) F 1(G( )) = G( )( F (F 1 G( ))) 1

Below, let ek := e(k) k , i.e., we drop the superscript. Now, from the proof of Lemma G.23, we

have that G(z, t) = ℓ(z) ek and from the proof of Lemma G.15 that F (z, t) = ℓ(z) 1(k) . By eq. (24), we have

G(z, 0)( F (F 1 G(z, 0))) 1 = G(z, 0)( F (z, 0)) 1

The above is invertible because G(z, 0) is invertible by Lemma G.23. Moreover,

G(z, 0)( F (z, 0)) 1 = ℓ(z) e k

ℓ(z) (1(k))

= A(z) ψ(z) (0(k 1)) 1

A(z) ψ(z) (1(k 1)) 1

By eq. (25), τ

t (z, 0) is the bottom right element of G(z, 0)( F (z, 0)) 1. By applying Lemma G.24 to the RHS of eq. (27) with A = A(z) and v = ψ(z), we see that the bottom right entry of G(z, 0)( F (z, 0)) 1 is nonzero. This proves that τ

t (z, 0) = 0. Note that Lemmas G.6 and G.23 together guarantee that the requirements of Lemma G.24 are all met. Next, we claim that there exists some t R such that τ(z, t ) > 0. To see this, assume the contrary. Then τ(z, t) 0 for all t R. In particular, t = 0 is a (global) maximizer of t 7 τ(z, t) which implies that τ

t (z, 0) = 0, a contradiction. Thus, the claim follows.

Unified Margin-Based Classification

Now, ﬁx t R such that τ(z, t ) > 0. Deﬁne z := ζ(z, t ) and t = τ(z, t ). Then by the deﬁnition of ζ and τ, we have F(z , t ) = G(z, t ) by applying F to both side of

(z ) t = ζ(z, t ) τ(z, t ) = F 1(G(z, t )).

Unwinding the deﬁnition of F and G (Deﬁnition G.12), the identity F(z , t ) = G(z, t ) implies ℓ(z ) + t 1 = F(z , t ) = G(z, t ) = ℓ(z) + t ek.

Now, applying P( ) to both side, we have P(ℓ(z ) + t 1) = P(ℓ(z) + t ek) = P(ℓ(z)) where we used the fact that P( ) is linear and that P(ek) = 0.

Corollary G.26 Let L be a regular PERM loss with reduced form ℓand z Rk 1 be arbitrary. Deﬁne Q to be the projection Rk Rk 1 that projs the ﬁrst coordinate, i.e., Q([v1, . . . , vk] ) = [v2, . . . , vk] . Then there exists z Rk 1 and t R>0 such that Q(ℓ(z ) + t 1) = Qℓ(z).

Proof Recall P from Lemma G.25. Let σ Sym(k) be such that Q = PSσ. Recall Mσ from Deﬁnition B.4. Then Sσℓ(z) = ℓ(Mσz) by eq. (13). Applying Lemma G.25 to Mσz, there exists z Rk 1 and t R>0 such that Pℓ(Mσz) = P(ℓ( z ) + t 1). Putting it all together, we have

Qℓ(z) = PSσℓ(z) = Pℓ(Mσz) = P(ℓ( z ) + t 1).

Next, Q = PSσ implies QSσ 1 = P. Hence, again applying eq. (13), we have

P(ℓ( z ) + t 1) = QSσ 1(ℓ( z ) + t 1) = Q(ℓ(Mσ 1 z ) + t 1).

Letting z := Mσ 1 z , we have Qℓ(z) = Q(ℓ(z ) + t 1), as desired.

Deﬁnition G.27 Let f : Rm Rn be a function. Deﬁne the following sets:

1. ran c(f) := {ζ + λ1 : ζ ran(f), λ [0, )}

2. ran c(f) := {ζ + λ1 : ζ ran(f), λ (0, )}

When f = ℓis the reduced form of a PERM loss, the above two sets are closely related to ranc(ℓ) (Deﬁnition 3.1), as the following lemma and Proposition G.31 show. These sets are convenient alternative characterizations.

Lemma G.28 Let L be a regular PERM loss with reduced form ℓ. Then we have the following:

1. ran(ℓ) is closed. 2. ran c(ℓ) is closed and bdry(ran c(ℓ)) = ran(ℓ). 3. ran c(ℓ) = int(ran c(ℓ)) and bdry(ran c(ℓ)) = ran(ℓ).

Wang and Scott

Proof For item 1, deﬁne C := Rk 1 {0} which is a closed subset of Rk 1 R. Now, note that ran(ℓ) = F(C) where F is as in Deﬁnition G.12. Since F is a homeomorphism (Corollary G.19), F(C) is closed as well. For item 2, deﬁne D := Rk 1 [0, ) which is a closed subset of Rk 1 R. Then ran c(ℓ) = F(D). Thus, as in the previous case ran c(ℓ) is closed. Next, we have

bdry(ran c(ℓ)) = bdry(F(D))=F(bdry(D)) = F(C) = ran(ℓ),

where the second equality from the left follows from F being a homeomorphism. For item 3, let E = Rk 1 (0, ). Then similar to the above, we have

int(ran c(ℓ)) = int(F(D)) = F(int(D)) = F(E) = ran c(ℓ).

To conclude, note that bdry(ran c(ℓ)) = bdry(F(E)) = F(bdry(E)) = F(C) = ran(ℓ).

Proposition G.29 Let L be a regular PERM loss. Then ran c(ℓ) is convex.

Proof Let ζ, ξ ran c(f). Write ζ = ℓ(z) + λ1 and ξ = ℓ(w) + µ1, where z, w Rk 1 and λ, µ [0, ). Let v = ξ ζ Rk and x = ζ. Take α and β as deﬁned in Proposition G.20, i.e., we have for all t R that

tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1. (28)

Plugging in t = 0 into (28), we get ζ = ℓ(α(0)) + β(0). Thus, α(0) = z and β(0) = λ. Likewise, plugging in t = 1, we get α(1) = w and β(1) = µ. In particular, we have β(0) 0 and β(1) 0. By Proposition G.20, β is concave. Thus, β(t) 0 for all t [0, 1], i.e.,

tv + w = tξ + (1 t)ζ = ℓ(α(t)) + β(t)1 ran c(ℓ)

for all t [0, 1]. This proves that ran c(ℓ) is convex.

The following result is a restatement of Beltagy and Shenawy (2013, Theorem 9):

Theorem G.30 (Beltagy and Shenawy (2013)) Let C be a nonempty closed convex subset of Rn. If C contains no hyperplane, then C = conv(bdry(C)).

Proposition G.31 Let L be a regular PERM loss with reduced form ℓ. Then ranc(ℓ) = ran c(ℓ) and bdry(ranc(ℓ)) = ran(ℓ).

Proof Clearly, ran c(ℓ) is nonempty. Furthermore, by Lemma G.28 and Lemma G.29, ran c(ℓ) is closed and convex. Next, note that ran c(ℓ) lies in the nonnegative quadrant [0, )k. Since no hyperplane lies entirely inside the nonnegative quadrant, ran c(ℓ) cannot contain any hyperplane. Hence, we have veriﬁed that ran c(ℓ) satisﬁes the condition of Theorem G.30. To ﬁnish the proof, we have

ranc(ℓ) = conv(ran(ℓ)) Deﬁnition of ranc(ℓ) (29)

= conv(bdry(ran c(ℓ))) Lemma G.29 (30)

= ran c(ℓ) Theorem G.30 (31)

Unified Margin-Based Classification

This proves the ﬁrst part. For the second part, note that bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ) by Lemma G.29.

Before moving on, we summarize the results on ranc(ℓ) we have thus obtained below:

Corollary G.32 Let L be a regular PERM loss with reduced form ℓ. Recall from Deﬁnition G.27 ran c(ℓ) := {ζ + λ1 : ζ ran(ℓ), λ (0, )}.

Then ranc(ℓ) is a closed and convex set with the following properties:

1. ranc(ℓ) = {ζ + λ1 : ζ ran(ℓ), λ [0, )}

2. int(ranc(ℓ)) = ran c(ℓ) (see Deﬁnitions 3.1 and G.27)

3. bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ).

We state one more result about the set ran c(ℓ) which will be useful later.

Lemma G.33 Let L be a regular PERM loss with reduced form ℓ. Then ran c(ℓ) = {ζ Rk : z Rk 1 such that ζ ℓ(z)}.

Proof Recall that by deﬁnition we have ran c(ℓ) = {ℓ(z) + λ1 : z Rk 1, λ (0, )}. Thus, the direction is immediate. For the other inclusion, take ζ = ℓ(z) + v where z Rk 1 and v 0. Let x = ℓ(z). Take α and β as deﬁned in Proposition G.20, i.e., we have for all t R that

tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1. (32)

Plugging in t = 0 into (32), we get x = ℓ(z) = ℓ(α(0))+β(0). Thus, α(0) = z and β(0) = 0. Recall that v = ζ ℓ(z) 0 by assumption. Hence, by Proposition G.20, β is strictly increasing. In particular, β(1) > β(0) = 0. Now, plugging in t = 1 into (32), we get v + x = v + ℓ(z) = ζ = ℓ(α(1)) + β(1)1. This shows that ζ ran c(ℓ), as desired.

Remark G.34 From basic point-set topology24, we know that a closed set is the union of its interior and its boundary. Thus, ranc(ℓ) = int(ranc(ℓ)) bdry(ranc(ℓ)). Hence, a consequence of Lemma G.33 and Lemma G.28 is that ranc(ℓ) is precisely the superprediction set of ℓ(see Williamson et al. (2016, Deﬁnition 15) and Kalnishkan and Vyugin (2008)):

ranc(ℓ) = {ζ Rk : z Rk 1 such that ζ ℓ(z)}.

Recall (Tewari and Bartlett, 2007, Deﬁnition 5):

Deﬁnition G.35 (Admissible sets; Tewari and Bartlett (2007)) Let S Rk + be a set and ζ Rk +. Deﬁne the set N(ζ; S) := {p k : ξ ζ, p 0, ξ S}. We say that S is admissible if for all ζ bdry(S) and p N(ζ; S) we have arg min(ζ) arg max(p).

24. The boundary of a set A is deﬁned as the closure of A minus the interior of A.

Wang and Scott

Proposition G.36 (Tewari and Bartlett (2007)) Let S Rk + be a symmetric set. If |N(ζ; S)| = 1 for all ζ bdry(S), then S is admissible.

Lemma G.37 Let L be a regular PERM loss with reduced form ℓ. Let ζ ran(ℓ). Then we have N(ζ; ran(ℓ)) = N(ζ; ranc(ℓ)) = N(ζ; ran c(ℓ)).

Proof We ﬁrst prove that N(ζ; ran(ℓ)) = N(ζ; ranc(ℓ)). Since ranc(ℓ) ran(ℓ), we immediately have N(ζ; ran(ℓ)) N(ζ; ranc(ℓ)). For the other inclusion, we ﬁrst note that ranc(ℓ) = ran c(ℓ) by Proposition G.31. Thus, every ξ ranc(ℓ) can be written as ξ = α+β1 for some α ran(ℓ) and β 0. Now, let p N(ζ; ran(ℓ)) and let ξ ranc(ℓ) be decomposed as in the preceding sentence. Then

ξ ζ, p = α + β1 ζ, p = α ζ, p + β 1, p 0

where the last inequality holds since (1) α ζ, p 0 because p N(ζ; ran(ℓ)), and (2) β 0. Hence, such a p satisﬁes ξ ζ, p 0, ξ ranc(ℓ) as well which implies that p N(ζ; ranc(ℓ)), as desired. Next, we prove N(ζ; ranc(ℓ)) = N(ζ; ran c(ℓ)). Again, since ranc(ℓ) ran c(ℓ), we immediately have N(ζ; ran c(ℓ)) N(ζ; ranc(ℓ)). For the other inclusion, we ﬁrst note that clos(ran c(ℓ)) = ranc(ℓ). Suppose p k is such that ξ ζ, p 0 for al ξ ran c(ℓ). Then by continuity, we must have that ξ ζ, p 0 for all ξ clos(ran c(ℓ)) = ranc(ℓ).

Proposition G.38 Let L be a regular PERM loss with reduced form ℓ. Then ranc(ℓ) and ran c(ℓ) are both admissible.

Proof By Proposition G.36, it suﬃces to check the following two claims hold:

1. for all ζ bdry(ranc(ℓ)) we have |N(ζ; ranc(ℓ))| = 1, and 2. for all ζ bdry(ran c(ℓ)) we have |N(ζ; ran c(ℓ))| = 1. By Corollary G.32, we have bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ). Hence, by Lemma G.37, to show both above claims it suﬃces to show that |N(ζ; ran(ℓ))| = 1 for all ζ ran(ℓ). Note that here we can replace N(ζ; ranc(ℓ)) and N(ζ; ran c(ℓ)) by N(ζ; ran(ℓ)) because of Lemma G.37. Below, ﬁx ζ = ℓ(z) ran(ℓ) where z Rk 1 is arbitrary. Then

N(ζ; ran(ℓ)) = {p k : ξ ζ, p 0, ξ ran(ℓ)} Deﬁnition G.35

= {p k : ℓ(w) ℓ(z), p 0, w Rk 1} Corollary G.11

= n p k : z arg minw Rk 1 CL p (w) o deﬁnition of arg min

= n p int( k) : z arg minw Rk 1 CL p (w) o Corollary G.8

= {p int( k) : z = link L(p)} Deﬁnition G.5 and Corollary G.4

By Proposition G.9, link L is an injection. Thus, | p int( k) : z = link L(p) | = 1.

Unified Margin-Based Classification

Appendix H. Proof of Theorem 4.7

Lemma H.1 Suppose that k 3 and y {2, . . . , k}. Let Q be as in Corollary G.26. Then QΥ(k) y = Υ(k 1) y 1 Q.

The proof of Lemma H.1 is similar to that of Lemma B.2 and is thus omitted here. For the proof, see the ar Xiv version of this work (Wang and Scott, 2023b).

Lemma H.2 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let z Rk 1 and w Rk 2 be arbitrary and such that Qz = w. Then limλ ψ(z + λe(k 1) 1 ) = trunc[ψ](w).

Proof Write z = [z1, . . . , zk 1] and w = [w1, . . . , wk 2]. Since Qz = w, we have z2 = w1, z3 = w2 and so on. Now, deﬁne g : R R by g(λ) := ψ(z + λe(k 1) 1 ) = ψ(z1 + λ, z2, . . . , zk 1) and h : R R by

h(λ) := ψ(λ, w) = ψ(λ, w1, w2, . . . , wk 2) = ψ(λ, z2, z3, . . . , zk 1) = g(λ z1).

As argued in the proof of Proposition 4.3, h is decreasing and nonnegative. Thus, g is also decreasing and nonnegative. Moreover, limλ g(λ) = limλ g(λ z1) = limλ h(λ). The right hand side is equal to trunc[ψ](w), as in the proof of Proposition 4.3

Earlier in Proposition 4.3 and Corollary 4.4, we deﬁned trunc[ψ], ψ(n) and L(n). We now deﬁne the analogous notation for the reduced form ℓ:

Deﬁnition H.3 (Truncation of ℓ) Assume k 3. Let L : Rk Rk 0 be a regular PERM loss with template ψ : Rk 1 R. As in Proposition 4.3, deﬁne trunc[ℓ] to be the reduced form of trunc[L] (whose template is trunc[ψ]).

Lemma H.4 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Deﬁnition H.3. Let Q be as in Corollary G.26. Let z Rk 1 and x Rk 0 be arbitrary. For brevity, let e1 := e(k 1) 1 . Then we have lim λ + Q (ℓ(z + λe1) + x) = trunc[ℓ](Qz) + Qx (33)

and Q (ℓ(z) + x) trunc[ℓ](Qz). (34)

Proof Throughout this proof, let y {2, . . . , k} be arbitrary. First, we claim that Υ(k) y e1 = e1 for each y {2, . . . , k}. If y = k, then this is clearly true since Υ(k) k is, by deﬁnition, the identity matrix. Now, for y {2, . . . , k 1}, recall that Υ(k) y is deﬁned as the matrix obtained by replacing the y-th column of the identity matrix with the all 1 s vector. Thus,

Wang and Scott

since y > 1, the ﬁrst column of Υ(k) y is equal to that of the identity matrix. Therefore, Υ(k) y e1 = e1 as well. Next, still assuming y {2, . . . , k 1}, we have

ℓy(z + λe1) = ψ(Υ(k) y (z + λe1)) = ψ(Υ(k) y z + λe1).

Hence, we have

lim λ + ℓy(z + λe1) = lim λ + ψ(Υ(k) y z + λe1) Theorem C.2.

= trunc[ψ](QΥ(k) y z) Proposition 4.3 and Lemma H.2

= trunc[ψ] Υ(k 1) y 1 Qz Lemma H.1

= [trunc[ℓ]]y 1(Qz) Theorem C.2 and explanation below.

Application of Theorem C.2 to the last equality requires a bit more explanation. For said equality, we used the fact that trunc[ℓ] is the reduced form of L(k 1) (Deﬁnition H.3), whose template is trunc[ψ]. Thus, applying Equation (12) (which is a collary of Equation (11) from Theorem C.2) to trunc[L], we have [trunc[ℓ]]y 1(z) = trunc[ψ](Υ(k 1) y 1 z). Thus,

lim λ + Q (ℓ(z + λe1) + x) = trunc[ℓ](Qz) + Qx.

Next, for every y {2, . . . , k}, we note that the function

gy(λ) := ℓy(z + λe1) = ψ(Υyz + λe1)

is strictly decreasing. To see this, by the chain rule for curves25, we have

g y(λ) = ψ(Υyz + λe1) e1 < 0.

Thus, ℓy(z) = gy(0) > limλ + gy(λ) = trunc[ℓ]y 1(Qz), which proves that

Q (ℓ(z) + x) trunc[ℓ](Qz) + Qx trunc[ℓ](Qz)

as desired.

Lemma H.5 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Deﬁnition H.3. Let Q be as in Corollary G.26. Then Q(ranc(ℓ)) ran c(trunc[ℓ]) and clos[Q(ranc(ℓ))] = ranc(trunc[ℓ]).

Proof Let C := Q(ranc(ℓ)) and take ζ C. We ﬁrst prove C ran c(trunc[ℓ]). By the characterization of ranc(ℓ) from Corollary G.32 item 1, there exists z Rk 1 and x [0, )k

such that ζ = Q(ℓ(z) + x). Now let z := Q(z). Applying Eqn. (34) from Lemma H.4, we get ζ = Q (ℓ(z) + x) trunc[ℓ]( z).

25. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b).

Unified Margin-Based Classification

In particular, by the characterization of ran c(trunc[ℓ]) from Lemma G.33, we have that ζ ran c(trunc[ℓ]). This proves that C ran c(trunc[ℓ]). Next, we prove clos[C] = ranc(trunc[ℓ]). We ﬁrst show that clos[C] ranc(trunc[ℓ]) by proving that every point ranc(trunc[ℓ]) is a limit point of C. Let ζ ranc(trunc[ℓ]). By the characterization of ranc(trunc[ℓ]) as in Corollary G.32, there exists z Rk 2 and x [0, )k 1 such that ζ = trunc[ℓ]( z) + x. Now, pick z Rk 1

and x [0, )k such that z = Qz and x = Qx. Applying Lemma H.4 Eqn. (33), we get that ζ is a limit point of S, which proves the desired claim. This proves that clos(C) ranc(trunc[ℓ]). By the ﬁrst part, we know that C ran c(trunc[ℓ]). By Corollary G.32, ran c(trunc[ℓ]) = int(ranc(trunc[ℓ])) ranc(trunc[ℓ]). Putting it all together, we have

C ran c(trunc[ℓ]) ranc(trunc[ℓ]) clos(C).

From Corollary G.32, we have that ranc(trunc[ℓ]) is closed. Since by deﬁnition clos(C) is the smallest closed set containing C, we get that ranc(trunc[ℓ]) = clos(C), as desired.

Theorem H.6 (Blackwell and Girshick (1979)) Let C Rn be a convex set. Then int(C) = int(clos(C)).

Proposition H.7 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Deﬁnition H.3. Let Q be as in Corollary G.26. Then we have Q(ranc(ℓ)) = ran c(trunc[ℓ])

Proof For brevity, let C := Q(ranc(ℓ)). By Corollary G.32, ranc(trunc[ℓ]) is convex. Since convexity is preserved under projection, we have that C is convex as well. Now,

int(C) = int(clos(C)) Theorem H.6 (35)

= int(ranc(trunc[ℓ])) Lemma H.5 (36)

= ran c(trunc[ℓ]) Lemma G.28 (37)

C Lemma H.5 (38)

Since C int(C) by deﬁnition, we conclude that C = ran c(trunc[ℓ]).

Proposition H.8 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Deﬁnition H.3. Let Q be as in Corollary G.26. Then we have Q(ran c(ℓ)) = ran c(trunc[ℓ]).

Proof By the preceding Proposition H.7, we have Q(ranc(ℓ)) = ran c(trunc[ℓ]). Since ran c(ℓ) ranc(ℓ) we have Q(ran c(ℓ)) Q(ranc(ℓ)). Thus, to prove the result we only have to show Q(ran c(ℓ)) ran c(trunc[ℓ]). To this end, let trunc[ℓ](w) ran c(trunc[ℓ]) and z ranc(ℓ) be such that Qℓ(z) = trunc[ℓ](w). By Corollary G.26, there exist z Rk 1 and t R such that t > 0

Wang and Scott

and Q(ℓ(z ) + t 1) = Qℓ(z) = trunc[ℓ](w). Since ℓ(z ) + t 1 ran c(ℓ), we get that trunc[ℓ](w) Q(ran c(ℓ)) as desired.

Deﬁnition H.9 For each m {0, 1, . . . , k 2}, deﬁne trunc m[ℓ] to be m-fold repeated applications of trunc to ℓ, i.e., trunc m[ℓ] := trunc[ trunc[trunc[ℓ]] ] where trunc appears m-times. By convention, let trunc 0[ℓ] = ℓ. Moreover, for each n {2, . . . , k}, deﬁne the notational shorthand ℓ(n) := trunc (k n)[ℓ].

Remark H.10 It follows tautologically from the deﬁnition of ψ(n) in Corollary 4.4, that deﬁne ℓ(n) is the reduced form of L(n) (whose template is ψ(n)).

Next, let m 1 be an integer. Below, let Q m denote the m-fold iterated composition of Q. In other words, Q m := Q Q repeated m times.

Proposition H.11 Suppose that L is totally regular. Then Let m {1, . . . , k 2}. Then Q m(ranc(ℓ)) = ran c(trunc m[ℓ]).

Proof We prove by induction. The case when m = 1 is simply Proposition H.8. Now, suppose that the result holds for m where 1 < m < k 2. Then

Q m+1(ranc(ℓ)) = Q(Q m(ranc(ℓ)))

= Q(ran c(trunc m[ℓ])) Induction hypothesis

= ran c(trunc[trunc m[ℓ]]) Proposition H.8

= ran c(trunc m+1[ℓ]).

This completes the induction step and the desired result follows.

Before presenting the proof of Theorem 4.7, we recall (Tewari and Bartlett, 2007, Theorem 7):

Theorem H.12 (Tewari and Bartlett (2007)) Let S Rk + be a symmetric convex set. Then S is classiﬁcation calibrated if and only if 1. S is admissible and 2. Q m(S) is admissible for all m {1, . . . , k 2}.

Finally, we conclude with the Proof of Theorem 4.7 Assume that we are in the situation stated at the beginning of Section H. Let S = ranc(ℓ(k)). By Theorem H.12, it suﬃces to prove that 1. S is admissible and 2. Q m(S) is admissible for all m {1, . . . , k 2}. First, from Proposition G.38, S = ranc(ℓ(k)) is admissible. Next, for each m {1, . . . , k 2}, we have by Proposition H.11 that Q m(ranc(ℓ(k))) = ran c(ℓ(k m)). Since L(k m) is a regular PERM loss with reduced form ℓ(k m), we have again by Proposition G.38 that ran c(ℓ(k m)) is admissible.

Unified Margin-Based Classification

Appendix I. Classiﬁcation-Calibration of Fenchel-Young losses

The goal of this section is two fold. The ﬁrst subsection (Section I.1) presents the proof of Theorem 5.10 on a suﬃcient condition for the classiﬁcation-calibration of Fenchel-Young losses. The second subsection (Section I.2) shows the existence of a totally regular negentropy that is strictly convex, but not strongly convex. To prepare for the later proofs, we introduce a more convenient parametrization of the the negenetropy and its Fenchel-Young loss. Deﬁne the reduced k-probability simplex as

k := { p := (p1, . . . , pk 1) [0, 1]k 1 : Pk 1 i=1 pi 1}. (39)

In other words, k is simply k without the ﬁrst coordinate. To every function Ω: k R with domain on the k-simplex, we deﬁne a corresponding function Ω: k R called the reduced form of Ω, deﬁned by

Ω( p) := Ω p1, . . . , pk 1, 1 Pk 1 i=1 pi , for all p = (p1, . . . , pk 1) k. (40)

Equation (40) induces a one-to-one correspondence between functions Ω: k R on the simplex k and functions Ω: k R on the reduced simplex k.

I.1 Proof of Theorem 5.10

Before proceeding with the proof, we establish two key results. Recall that k is deﬁned in Eqn. (39) and Ωin Eqn. 40.

Proposition I.1 Let Ωbe a negentropy (Deﬁnition 5.2) and µ R. Then the Fenchel Young loss LΩ,µ associated to Ωand µ (Deﬁnition 5.1) is a PERM loss that is closed, convex, and non-negative. The template ψΩ,µ of LΩ,µ is semi-coercive and is given by

ψΩ,µ(z) = max p=[p1,...,pk 1] k Ω( p) + µ1 p p, z . (41)

Furthermore, if Ωis a regular negentropy, then L is a regular PERM loss.

Remark I.2 We note that results from Blondel et al. (2020) may be used to prove portions of Proposition I.1. Combining their Proposition 1Order preservation part with the expression immediately following their Deﬁnition 2 can be used to prove that LΩ,µ is permutation equivariant. Moreover, the expression in our Eqn. (41) is related to their Eqn. (15). However, there are key technical diﬀerences since we work with the (k 1-dimensional) relative margin, while Blondel et al. (2020) use the class-score formulation. For this reason and for the reader s convenience, we prove Proposition I.1 without leveraging their results.

Proof In this proof, all elementary basis vectors are implicitly assumed to be k-dimensional, i.e., we write ey instead of e(k) y . First, recall that the Fenchel conjugate of a closed convex function is again closed convex (Rockafellar, 1970). Next, we show that L is permutation equivariant. First, we show that L is symmetric. By Lemma C.3, it suﬃces to prove the claim that L(Tjv) = Tj L(v) for all j [k] and v Rk. To this end, let y [k] be arbitrary. Recall that the Fenchel-Young loss is deﬁned as

[L(v)]y = maxp k Ω(p) + cy, p ey + v, p ey (42)

Wang and Scott

[Tj L(v)]y = [L(v)]τj(y) = max p k Ω(p) + v + cτj(y), p eτj(y) deﬁnition of L

= max p k Ω(p) + Tj(v + cτj(y)), Tj(p eτj(y)) Tj is an isometry

= max p k Ω(p) + Tjv + cτj(τj(y)), Tjp eτj(τj(y))

= max p k Ω(p) + Tjv + cy, Tjp ey τj τj is the identity

= max p k Ω(Tjp) + Tjv + cy, Tjp ey Ωis symmetric

= max p k Ω(p) + Tjv + cy, p ey) Tj| k is a bijection

= [L(Tjv)]y deﬁnition of L.

Since y is arbitrary, we have proven the claim. Next, we show that L is relative margin-based, i.e., eq. (42) depends on v only through Dv. Recall that Dv = vk v1 vk v2 vk vk 1 . See Deﬁnition 2.1. Now, only the v, p ey term of eq. (42) depends on v. Thus, it suﬃces to prove that this term only depends on Dv. To this end, observe that

v, p = p1v1 + + pkvk = p1v1 + + pk 1vk 1 + (1 (p1 + + pk 1))vk = vk (p1(vk v1) + + pk 1(vk vk 1)) = vk p, Dv

where we write p to denote the vector [p1, . . . , pk 1] . Thus, for each y [k 1], we have

v, p ey = v, p vy = vk vy p, Dv = [Dv]y p, Dv

This shows that L is relative margin-based. Now the template of L is

ψ(z) = ℓk(z) = maxp k Ω(p) + ck, p ek p, z .

Since ek k and [ek]y = 0 for y [k 1], we have by construction that

ψ(z) Ω(e1) + c1, e1 e1 0, z = 0.

When ck = µ(1 ek), we have

ψ(z) = max p k Ω( p) + µ1 p p, z , for all z Rk 1. (43)

Finally, we prove that ψ is semi-coercive (Deﬁnition 4.1), i.e., for all c R there exists b R such that {z Rk 1 : ψ(z) c} {z Rk 1 : b min z}. To this end, let c R and z Rk 1 be such that c ψ(z). Let j arg min z. Then since ej = ek 1 j k, we have

c ψ(z) = supp k Ω(p) p, z Ω(ej) + µ ej, z zj = min z.

Thus, we have shown that {z Rk 1 : ψ(z) c} {z Rk 1 : min z c}.

Unified Margin-Based Classification

Next, we prove the Furthermore part. By the ﬁrst part, it remains to show that ψ is strictly convex, twice diﬀerentiable and ψ(z) 0 for all z Rk 1. Deﬁne ϕ( p) := Ω( p) µ1 p. Then ϕ : k R is also of Legendre type. Moreover, note that

ψ(z) = max p k p, z Ω( p) + µ1 p Eqn. (43)

= max p k p, z ϕ( p) = ϕ ( z) deﬁnition of Fenchel conjugate

Recall the following fundamental theorem regarding convex conjugates (Rockafellar, 1970).

Theorem I.3 (Rockafellar (1970)) If (C, f) is a convex function of Legendre type, then (C , f ) is a convex function of Legendre type. The map f : C C is a homeomorphism and f = ( f) 1.

By Theorem I.3, we have

1. The function ϕ , and hence ψ, is of Legendre type. In particular, ψ is strictly convex.

2. The derivative ϕ : int( k) Rk 1 is a bijection and the derivative of ϕ satisﬁes ϕ = ( ϕ) 1 : Rk 1 int( k).

It follows that if ϕ is twice diﬀerentiable, then so is ϕ . Finally, by the chain rule, we have ψ(z) = ϕ (z). Since ϕ (z) int( k) for all z, we have in particular that ϕ (z) 0. Thus, ψ(z) 0 for all z.

Proposition I.4 Let k 3 be an integer, Ω: k R a regular negentropy and µ 0. Let LΩ,µ be the Fenchel-Young loss corresponding to Ωand µ. Then trunc LΩ,µ = Ltrunc[Ω],µ.

Proof Throughout this proof, let Θ := trunc[Ω]. Let Ωand Θ be the reduced versions (see eq. (40)) of Ωand Θ, respectively. We remark that the notation used in this proof slightly departs from that of eq. (40) in the following way: Here, elements of k will be denoted as p instead of p. Likewise, elements of k 1 will be denoted as q rather than q. Moreover, throughout this proof, we abuse notation and denote by 1 either the (k 1) or the (k 2)-dimensional vector. The dimension of 1 will be clear from the context. Now, from eq. (41) in Proposition I.1, recall that the template of LΩ,µ is

ψΩ,µ : Rk 1 R, where ψΩ,µ(z) = max p=[p1,...,pk 1] k Ω(p) + µ1 p z, p . (44)

Similarly, the template of LΘ,µ is

ψΘ,µ : Rk 2 R, where ψΘ,µ(w) = max q=[q1,...,qk 2] k 1 Θ(q) + µ1 q w, q . (45)

By the deﬁnition of the truncation of a PERM loss (given in Proposition 4.3), the template of trunc[LΩ,µ] is trunc[ψΩ,µ]. Since PERM losses are uniquely deﬁned by their templates (Theorem 2.5), the result of Proposition I.4 can be equivalently stated as trunc[ψΩ,µ] = ψΘ,µ. Below, our proof will focus on proving this identity.

Wang and Scott

Both trunc[ψΩ,µ] and ψΘ,µ have Rk 2 as domain. Fix w Rk 2 arbitrarily. Our goal is to show that trunc[ψΩ,µ](w) = ψΘ,µ(w). Pick z Rk 1 such that Qz = w. For instance, we can pad w with a zero, i.e., take z := [0, w] = pad(w). By Lemma H.2, we have

trunc[ψΩ,µ](w) = lim λ + ψΩ,µ(z + λe(k 1) 1 )

To simplify notations, let e1 := e(k 1) 1 . In view of eq. (44) and eq. (45), establishing our goal, i.e., the identity trunc[ψΩ,µ](w) = ψΘ,µ(w), is equivalent to proving

lim λ + max p k Ω(p) + µ1 p z + λe1, p = max q k 1 Θ(q) + µ1 q Qz, q . (46)

For brevity, we deﬁne

g(λ) := max p k Ω(p) + µ1 p z + λe1, p . (47)

For each λ R, ﬁx arbitrarily an element pλ arg maxp k Ω(p) + µ1 p z + λe1, p . We note that the maximization is over a compact domain with a continuous objective. We will prove the result via a series of claims. Claim 1: g : R R+, deﬁned at eq. (47), is monotone non-increasing. Claim 2: For all λ R, g(λ) Θ(bq) + µ1 bq w, bq . Claim 3: p1 = 0. Claim 4: limt g(λt) = Ω(bq) + µ1 bq w, bq

The proofs of these claims proceed via straightforward but tedious computations and applications of inequalities. They are omitted here but appear in the ar Xiv version of this work (Wang and Scott, 2023b). Now we ﬁnish the proof of eq. (46), which we argued earlier suﬃces to establish Proposition I.4. By Claims 1 and 4, we have that limλ + g(λ) = Θ(bq) + µ1 bq w, bq . But limλ + g(λ) is exactly the left hand side of eq. (46), by the construction of g in eq. (47). By the deﬁnition of bq, moreover we have that Θ(bq) + µ1 bq w, bq is precisely the right hand side of eq. (46).

Theorem I.5 Let Ω: k R be a negentropy (Deﬁnition 5.2). For each n {2, . . . , k}, let Ω(n) be the n-ary truncated negentropy of Ω(Deﬁnition 5.9). Let LΩ,µ be the Fenchel Young loss corresponding to Ωand µ (Deﬁnition 5.1). Likewise, for each n {2, . . . , k}, let L(n) be the n-ary truncation of LΩ,µ (Deﬁnition 4.5). Then L(n) = LΩ(n),µ. In other words, the n-ary truncated loss L(n) of LΩ,µ is equal to the Fenchel-Young loss of the n-ary truncated negentropy Ω(n) and µ.

The proof of Theorem I.5 is omitted here since it largely mirrors that of Proposition H.11. See the ar Xiv version of manuscript (Wang and Scott, 2023b) for the proof.

Proof [Proof of Theorem 5.10] By Theorem 4.7, it suﬃces to show that LΩ,µ is totally regular (Deﬁnition 4.5). Unwinding the deﬁnition, our goal is to show for each n {2, . . . , k} that the n-ary truncated loss of LΩ,µ, denoted L(n), is regular.

Unified Margin-Based Classification

Next, let Ω(n) be the n-ary truncated negentropy of Ω(Deﬁnition 5.9) and let LΩ(n),µ

be the Fenchel-Young loss corresponding to Ω(n) and µ. By Theorem I.5, we have L(n) = LΩ(n),µ. By the assumption of Theorem 5.10, Ω(n) is a regular negentropy. Thus, by Proposition I.1, LΩ(n),µ is a regular PERM loss. Since L(n) = LΩ(n),µ, we are done.

I.2 Totally regular negentropy that is not strongly convex

In this section, we prove Proposition 5.11 which is used to show that there exists totally regular entropies that are not strongly convex. See Example 7. Thus, the associated Fenchel Young loss is calibrated by Theorem 5.10. Moreover, this calibration result is outside of the purview of previously established results (Blondel, 2019; Nowak-Vila et al., 2019) which requires strong convexity

The next four results serve as the technical tools for proving Proposition 5.11. Their proofs are omitted here and are in the ar Xiv version of this work (Wang and Scott, 2023b).

Proposition I.6 Let f : D R 0 be of Legendre type and g : R 0 R be convex, diﬀerentiable and strictly increasing. Let C = int(D). Suppose that D is compact and there exists x C such that infx D f(x) = f(x ). Then g f : D R is of Legendre type.

Lemma I.7 Let f and g be as in Proposition I.6. If g (0) = 0 and x int(D) is such that infx D f(x) = f(x ) = 0, then the Hessian of g f vanishes at x , i.e., 2 g f(x ) = 0.

Corollary I.8 In the situation of Lemma I.7, g f is not α-strongly convex for any α > 0.

Proposition I.9 Let Ω: k R be a regular negentropy and let g : R 0 R 0 be as in Proposition I.6. Furthermore, suppose that g is twice diﬀerentiable. Let a R be a negative number such that a Ω(p) for all p k. Deﬁne Θ : k R by

Θ(p) := g(Ω(p) a) g( a), p k.

Then Ωis a regular negentropy.

Proof [Proof of Proposition 5.11] First we note that the assumptions on g is in the setting of Proposition I.6. Next, by Deﬁnition 5.9, we must show that Θ(n) is a regular negentropy for each n {2, . . . , k}. Let a = Ω(u). Note that since Ωis symmetric and convex, we must have that a Ω(p) for all p k. Furthermore, it is easy to see that Θ(n)( q) := g( Ω(n)( q) a) g( a) for all q n. Now, apply Proposition I.9 to Θ(n) and a = Ω(u), we get the desired result. The part of the proposition assuming g (0) = 0 follows immediately from Corollary I.8.

Appendix J. Uniqueness of the matrix label code

In this section, we prove Theorem J.1. The proof appears at the end of this section.

Wang and Scott

Theorem J.1 Let L be the cross entropy loss and ψ be its template as in Example 1. Suppose that {Ay}k y=1 is a set of (k 1) (k 1) matrices satisfying Ly(v) = ψ(Ay Dv) for all y [k] and all v Rk. Then for each y [k] there exists a permutation σy Sym(k) so that Ay = SσyΥy.

Lemma J.2 Let ψ be the template of the multinomial logistic loss (Example 1). Let a Rk 1 be a vector. Suppose that ψ(ta) = ψ(te1) for all t R. Then a = ei for some i [k 1].

Lemma J.2 serves as the main tool for Lemma J.3 below. The proofs of both Lemmas J.2 and J.3 appear in the ar Xiv version of our manuscript (Wang and Scott, 2023b).

Lemma J.3 Let ψ be the template of the multinomial logistic loss (Example 1). Suppose that A is such that ψ(z) = ψ(Az) for all z Rk 1, then A is the identity matrix up to row permutation. In other words, there exists σ Sym(k) such that A = Sσ.

Lemma J.4 Let ψ be the template of the multinomial logistic loss (Example 1). Let j [k] and A R(k 1) (k 1) be arbitrary. Suppose that ψ(Υjz) = ψ(Az) for all z Rk 1. Then A is equal to the Υj up to row permutation. In other words, there exists σ Sym(k) such that A = SσΥj.

Proof [Proof of Lemma J.4] We ﬁrst prove that the assumption of Lemma J.4 implies that of the previous Lemma J.3: ψ(z) = ψ(Az) for all z Rk 1. Let u Rk 1 be arbitrary and let z := Υju. Since Υ2 j = I, we have ψ(u) = ψ(ΥjΥju) = ψ(Υjz) = ψ(Az) = ψ(AΥju) where for the third equality from the left we used the assumption of Lemma J.4. Now, by the previous Lemma J.3, we have that AΥj = Sσ for some σ Sym(k). Using Υ2 j = I, we get SσΥj = AΥjΥj = A. That is to say A is equal to Υj up to permutations of the rows.

Proof [Proof of Theorem J.1] Let y [k] and v Rk be ﬁxed. Let z := Dv. By Theorem 2.5, we have ψ(Υyz) = ψ(Ayz). Since D is surjective, the preceding identity holds for all z Rk 1. Thus by Lemma J.4, Ay = SσΥy for some σ Sym(k).

Appendix K. Mathematical Background

We review mathematical background on fundamental topics that are used throughout this work.

K.1 Non-singular M-matrix

We recall some deﬁnitions from linear algebra.

Deﬁnition K.1 Let A = (aij) Rn n be a matrix. We say that A is a

1. Z-matrix if aij 0 whenever i = j.

Unified Margin-Based Classification

2. M-matrix if A is a Z-matrix and all eigenvalues of A have nonnegative real parts.

3. strictly diagonally dominant matrix if |aii| > P j [n]:j =i |aij| for all i [n].

4. monotone matrix if for all x Rn, Ax 0 implies x 0. If, in addition, Ax 0 implies x 0, then A is said to be strictly monotone.

Theorem K.2 (Levy-Desplanques/Gershgorin circle) Let A be a strictly diagonally dominant matrix. Then A is non-singular whose eigenvalues all have nonnegative real parts.

The above result immediately implies the following:

Corollary K.3 A strictly diagonally dominant Z-matrix is a non-singular M-matrix.

Non-singular M-matrices have many equivalent characterizations. The one relevant to us is the following:

Theorem K.4 (Plemmons (1977)) Let A be a Z-matrix. Then A is a non-singular M-matrix if and only A is a monotone matrix.

Lemma K.5 Let A = (aij) Rn n be a non-singular M-matrix. If the diagonals of A are positive, then A is strictly monotone.

Proof Let x Rn be arbitrary such that Ax 0. Our goal is to show that x 0. First, from Theorem K.4, we have that A is monotone. Thus, Ax 0 implies x 0. We only have to check additionally that x 0. Since A is a Z-matrix, the oﬀ-diagonals are non-positive, i.e., aij 0 for all i, j [n] such that i = j. Now, let i [n]. We need to check that xi > 0. To this end, note that 0 < [Ax]i = Pn j=1 aijxj = aiixi + P j =i aijxj aiixi. Note that P j =i aijxj 0 because x 0 and aij 0. Finally, xi > 0 since aii > 0.

Corollary K.6 Let A = (aij) Rn n be a non-singular M-matrix. Then A is also a non-singular M-matrix. Moreover, if the diagonals of A are positive, then A is strictly monotone.

Proof Note that A and A are both non-singular and have the same set of eigenvalues. Moreover, A being a Z-matrix implies that A is a Z-matrix. Thus, A is a non-singular M-matrix. Finally, Lemma K.5 implies that A is a strictly monotone matrix.