# unified_binary_and_multiclass_marginbased_classification__3cfc7e39.pdf Journal of Machine Learning Research 25 (2024) 1-51 Submitted 12/23; Revised 5/24; Published 5/24 Unified Binary and Multiclass Margin-Based Classification Yutong Wang1,2 yutongw@umich.edu Clayton Scott1,3 clayscot@umich.edu 1Department of Electrical Engineering and Computer Science 2Michigan Institute of Data Science 3Department of Statistics University of Michigan Ann Arbor, MI 48109, USA Editor: Zhihua Zhang The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated. Keywords: Classification, loss functions, consistency, margins, label encodings 1. Introduction Classification into k 2 categories is the learning task of selecting a classifier, i.e., a function from the feature space X to the set of labels [k] := {1, . . . , k}, given training data. Many of the most popular and successful classification methods aim to find a classifier with minimum risk, where the risk of a classifier is the expected value of a loss function that measures the quality of predictions. Indeed, logistic regression, support vector machines, boosting, and neural network methods can all be viewed as algorithms to minimize the risk associated with a certain loss. In practice, loss-based approaches to binary (k = 2) and multiclass (k 3) cases are formulated differently. Because of this discrepancy, the theory and practice of multiclass methods often lag behind their binary counterparts. The purpose of this work is to bridge this gap by introducing a framework that unifies binary and multiclass loss-based classification. The standard approach to binary classification is to learn a discriminant function g : X R that maps an instance x to a discriminant g(x), e.g., g(x) = w T x + b for linear classification. Identifying {1, 2} with { 1, +1}, a label y is predicted by sign(g(x)). The loss ascribed to a discriminant g and a pair (x, y) is defined in terms of a function ψ : R R, with the loss being equal to ψ(( 1)yg(x)). Loss functions of this form are referred to as c 2024 Yutong Wang and Clayton Scott. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1599.html. Wang and Scott Figure 1: Relative margin form. Panel (a): Multiclass losses L satisfying the permutation equivariant and relative margin-based conditions (i.e., PERM losses, Definition 2.2) can be expressed in the relative margin form as in Panel (b). See Theorem 2.5. The relative margin form employs three components: Panels (c). the matrix label code {Υy}y [k], (d). the discriminant function g, and (e). the template , a symmetric function ψ : Rk 1 R. binary margin loss functions. Examples include the logistic loss ψ(t) = log(1+e t) (logistic regression), the hinge loss ψ(t) = max(0, 1 t) (support vector machines), the exponential loss ψ(t) = e t (Ada Boost), and the sigmoid loss ψ(t) = 1/(1 + et) (some neural networks). The conventional approach to multiclass classification seeks to learn a class-score function f = (f1, . . . , fk) : X Rk, e.g., a feed-forward neural network. The label for an input x is predicted by taking the argmax of the class-score vectors v := f(x). Multiclass loss functions, e.g., cross entropy, may be viewed as functions L : {1, . . . , k} Rk R. The loss incurred by a class-score function f on a pair (x, y) X [k] is then L(y, f(x)). When a multiclass loss function is expressed in terms of a class-score function output, we say that it is in class-score form. We show that a large family of multiclass loss functions in class-score form, including cross-entropy, multiclass exponential loss (Mukherjee and Schapire, 2013), multiclass hinge losses (Crammer and Singer, 2001; Weston and Watkins, 1998), Gamma-Phi losses (Beijbom et al., 2014), and Fenchel-Young losses (Blondel et al., 2020), can be expressed in what we call the relative-margin form. Instead of being defined in terms of a class-score function output f(x), in the relative margin form a loss is expressed ψ(Υyg(x)), which involves the following elements: 1. a symmetric1 function ψ : Rk 1 R, 2. a set of (k 1) (k 1) matrices {Υy}y [k] that encode the label, and 3. a discriminant function of the form g : X Rk 1. When k = 2, we have Υy = ( 1)y, and the relative margin form coincides with the standard binary margin form. See Figure 1 for an illustration of the framework. Thus, the relative margin form generalizes the notion of margin loss from binary classification. 1. Recall that a function is symmetric if its value does not change when its input is permuted. Unified Margin-Based Classification 1.1 Our contributions In this work, we develop the relative margin from as described above, and illustrate the utility of this framework by establishing novel results on classification-calibration2. More specifically, our contributions are: Theorem 2.5: characterization of losses expressible in the relative margin form. We show that a multiclass loss function can be expressed in the relative margin form if and only if both of the following conditions are met: 1. permutation equivariance: L treats all classes equally, and 2. relative margin-based: L(y, f(x)) depends only on the differences between the fj(x) s rather than the values of the fj(x) s themselves. Below, such losses are referred to as PERM losses. Our characterization implies that many popular losses such as the cross entropy, multiclass exponential, and Fenchel-Young losses (Blondel, 2019) are PERM losses. PERM losses are characterized by their template, a symmetric function ψ : Rk 1 R. Thus, our result implies that for the analysis of existing and design of new PERM losses, it suffices to focus on the template ψ which can be simpler than the original L. Theorem 4.7: Extending a fundamental result on classification-calibration (CC) to the multiclass case. A seminal result of Bartlett et al. (2006) in the binary case shows that a convex margin loss is classification-calibrated if and only if ψ is differentiable at 0 and has negative derivative there. We use the PERM loss framework to prove a multiclass extension of this result for L that are totally regular (Definition 4.2). We use Theorem 4.7 to prove a novel sufficient condition for CC of sums of losses (Proposition 4.8), and, as a corollary, establish CC for sums of Gamma-Phi losses (Example 5). A key component of the totally regular property is that the gradient of the template ψ must be entry-wise negative everywhere3. Theorem 5.10: Expanding previous sufficient conditions of CC for Fenchel Young losses. Fenchel-Young losses, defined via taking the convex conjugate of a certain negentropy function, have recently been proposed for structured prediction, including multiclass classification. However, an open question is whether Fenchel-Young losses are classification-calibrated under the assumption that the negentropy is strictly convex. Our Theorem 5.10 answers this in the affirmative. The matrix label code. The key component in our analysis is the matrix label code (Definition 2.4) which extends to the multiclass case the universally adopted { 1} label code for binary classification. To obtain our main results, we developed a suite of technical lemmas for working with the matrix label code in Appendix B. We expect the matrix label code to be useful for multiclass classification research beyond the scope here. 1.2 Background In this section, we define the probabilistic setting assumed throughout this work. Moreover, we first review the background on classification-calibration. 2. For readers unfamiliar with this theory, we provide a brief review below in the Background on Classification-Calibration section (Section 1.2). 3. Thus, in the binary case, our result is weaker than that of Bartlett et al. (2006), which only requires negativity at 0 R. We leave as an open question whether this gap can be closed. Wang and Scott Let {(xi, yi)}n i=1 be drawn from a joint distribution P over X [k]. Let I be the indicator function and I{y = j} be the 01-loss for y, j [k]. The goal of classification is to select a classifier h : X [k] minimizing the 01-risk R01,P (h) := E(X,Y ) P [I{Y = h(X)}] objective. A fundamental approach for learning a classifier is empirical risk minimization (ERM), which selects an h minimizing the empirical 01-risk ˆRn 01(h) := 1 n Pn i=1 I{yi = h(xi)} over a class of functions H. Directly minimizing the empirical 01-risk objective is often intractable4 due to the discreteness of the objective. The surrogate-based approach addresses this issue as follows. Define the L-risk for a fixed continuous loss function L by RL(f) := E(X,Y ) P [L(Y, f(X))] and the empirical L-risk by ˆRn L(f) := 1 n Pn i=1 L(yi, f(xi)). Minimizing ˆRn L(f) over a family of f s, e.g., neural networks, is employed as a tractable surrogate objective. A class-score function f induces a discrete-valued classifier arg max f(x) := arg maxj=1,...,k fj(x) with ties broken arbitrarily. An important question is whether good performance with respect to the surrogate L-risk transfers back to the 01-risk, the original objective. Definition 1.1 A surrogate loss L has the consistency transfer property5 if for any distribution P over X [k] and any sequence of class-score functions { ˆf(n)}n, e.g., ˆf(n) is an empirical L-risk minimizer over a family of functions depending on n, the following is satisfied: limn RL( ˆf(n)) = inff RL(f) implies that limn R01(arg max ˆf(n)) = infh R01(h). The infimums are, respectively, over all measurable functions f : X Rk and h : X [k]. Thus, the consistency transfer property (CTP) provides justification for L-risk minimization when minimizing the 01-risk is the original objective of interest. For binary margin-based losses ψ, the seminal result of Bartlett et al. (2006, Theorem 1.3) shows that CTP is equivalent to a functional property of ψ known as classification-calibration (CC). For the multiclass case, Tewari and Bartlett (2007) define the CC property (reviewed in detail in Section 3 below) and show its equivalence to the CTP as well. 1.3 Related work Sufficient conditions for CC of multiclass convex margin-based losses. For binary convex loss in margin form, the sufficient conditions for CC are easy to verify and apply to common losses of interest. Indeed, Bartlett et al. (2006, Theorem 2.1) asserts that a nonnegative binary convex loss in margin form ψ : R R 0 is CC if and only if ψ is differentiable at 0 and ψ (0) < 0. There are several difficulties for extending the above characterization to the multiclass case. The result in the binary case relies on the expression of a binary margin-based loss as ψ(( 1)yz). While proposed multiclass extensions of the expression exist (to be reviewed below), existing work on sufficient conditions for CC of multiclass losses focus on losses in the class-score form L : [k] Rk R. Most previous works studied CC of losses of a specific form such as the Gamma-Phi losses (Zhang, 2004; Beijbom et al., 2014; Wang and Scott, 2023a), Fenchel-Young losses (Duchi et al., 2018; Blondel et al., 2020), and hinge-like losses (Tan and Zhang, 2022). 4. See Bhattacharyya et al. (2018) and the references therein 5. This property appears in many works, e.g., Steinwart (2007); Bartlett et al. (2006); Tewari and Bartlett (2007); Zhang (2004). The name consistency transfer property was used by Wang and Scott (2023a). Unified Margin-Based Classification Notably, Tewari and Bartlett (2007) derive a characterization of CC for multiclass losses analogous to that of Bartlett et al. (2006, Theorem 2.1). However, their sufficient condition (Tewari and Bartlett, 2007, Theorem 7), while geometrically elegant, is hard to verify for common losses such as the multiclass exponential loss6. Multiclass margins & margin-based loss. A line of work has considered the problem of extending the binary margin ( 1)yz and binary margin form ψ(( 1)yz) to the multiclass setting. For the margin, Crammer and Singer (2001) and Mohri et al. (2018, 9.2) defined a notion of scalar-valued multiclass margin as minj [k]:j =y vy vj where y is the ground truth label. In other words, this notion of multiclass margin is the minimum difference of score between the ground truth label and the rest of the labels. While intuitive, according to Tewari and Bartlett (2007, 5.1), convex losses based on this notion of margin are never classification-calibrated. Lee et al. (2004); Zou et al. (2008) proposes definitions of vector-valued multiclass margin vectors and associated margin losses for multiclass SVMs and boosting, respectively. These losses are known to be classification-calibrated if and only if sum-to-zero constraints are enforced on the input margin vectors (Dogan et al., 2016, Theorem 6). Lee et al. (2004) develops a multiclass SVM based on a hinge loss that leverages this notion of margin. However, in practice, enforcing these sum-to-zero constraints leads to significantly slower computations (Fu et al., 2022). Dogan et al. (2016) developed a framework using relative margins to unify the analysis of several variants of multiclass support vector machines. This notion of relative margin7 is first introduced by Rosset et al. (2003, 4). Both of these prior works only established classification-calibration of specific losses for specific algorithms. By contrast, our work develops a framework that characterizes when multiclass losses can be expressed via relative margins and proves general classification-calibration results for a large family of losses. Label encodings for multiclass classification. As alluded to earlier, the multiclass SVM introduced by Lee et al. (2004) is classification-calibrated provided that certain sumto-zero constraints are enforced. Moreover, enforcing these constraints is computationally prohibitive. The simplex code (Hill and Doucet, 2007) is a multiclass label code designed to address this computational issue by using a constraint-free reparametrization of the multiclass SVM dual formulation (Wu and Lange, 2010; Saberian and Vasconcelos, 2011; Mroueh et al., 2012; van den Burg and Groenen, 2016; Pouliot, 2018). Another approach to label encoding is the multivector construction8 (Shalev-Shwartz and Ben-David, 2014, 17.7) which has been used to study the multiclass perceptron, logistic regression and SVMs (Duda et al., 2006, 5.12). Moreover, the multivector construction has been used to study the PAC learning sample complexity of multiclass linear classifiers (Daniely and Shalev-Shwartz, 2014). While the simplex code and the multivector constructions are elegant techniques, the scope of these work are on specific algorithms and classifiers. In particular, a framework for 6. See Footnote 7 in Tewari and Bartlett (2007, Table 1). 7. Rosset et al. (2003) also uses the name margin vector in conflict with the later work by Zou et al. (2008) which defines a different notion of margin. Due to this confusion, we will refer to the earlier notion of Rosset et al. (2003) by relative margins . 8. Also known as Kesler s construction which, according to Crammer and Singer (2003), is credited to (Kesler, 1961). A description of the construction can be found in Duda et al. (2006, 5.12) and Shalev Shwartz and Ben-David (2014, 17.7). Wang and Scott general multiclass margin-based losses and analysis of their properties, e.g., classificationcalibration, is lacking. Our work fills this gap by using the framework of the relative-margin form and matrix label code to 1. characterize the set of multiclass losses expressible in this form (Theorem 2.5), 2. prove sufficient conditions for classification-calibration of a large family of multiclass losses (Theorems 4.7) and 3. apply our results to expand the previously known family of classification-calibrated Fenchel-Young losses (Theorem 5.10). 1.4 Notations In this subsection, we briefly discuss some of the key notations. For the reader s convenience, we include a more comprehensive reference for the notations in Table 1 in Section A of the appendix. Moreover, we tabulate the mathematical objects defined in Table 2. Throughout this work, let k 2 denote the number of classes. Denote the k-probability simplex by k = {p Rk 0 : Pk j=1 pj = 1}. Vectors/matrices. Let the square bracket with subscript [ ]j be the projection of a vector onto its j-th component, i.e., [v]j := vj where v = (v1, . . . , vk) Rk. Given two vectors w, v Rk, we write w v (resp. w v) if wj vj (resp. wj > vj) for all j [k]. Allzeros/all-ones/i-th elementary basis vector in Rn are denoted 0(n)/1(n)/e(n) i , respectively. When the ambient dimension is clear, we drop the superscript (n). The n n identity matrix is denoted In. Permutations and permutation matrices. A bijection from [k] to itself is called a permutation (on [k]). Denote by Sym(k) the set of all permutations on [k]. For each σ Sym(k), let Sσ denote the permutation matrix corresponding to σ. In other words, if v Rk is a vector, then [Sσv]j = [v]σ(j) = vσ(j). 2. Permutation equivariant relative margin (PERM) losses In this section, we define PERM losses and prove fundamental properties used throughout the rest of the work. We begin with a discussion of the two defining properties: permutation equivariance and relative-margins. Recall that in the introduction, we denoted loss functions as L : [k] Rk R. For a class-score function f : X Rk, the loss is evaluated as L(y, f(x)) on an instance (x, y). Throughout the rest of this work, for mathematical convenience we will use the equivalent notation convention Ly(f(x)) where y appears in the subscript. In this convention, a multiclass loss function is a vector-valued function L = (L1, . . . , Lk) : Rk Rk. With this convention, the property that the loss treats each of the k classes equally is captured by permutation equivariance, a notion similar to but distinct from symmetry9. Denote by v the class-score vector f(x) on some generic instance (x, y). The permutation equivariance property states that if the class-score vector v is relabeled by some permutation σ Sym(k), then the vector of losses L(v) should be relabeled in the same way, i.e., L(Sσv) = SσL(v). 9. Recall that a function is symmetric if its value does not change when its input is permuted. Sometimes, this is also referred to as permutation invariance (Bronstein et al., 2021, 3.1). For our purposes, equivariance is the correct descriptor. Unified Margin-Based Classification Relative margins10 have been recently used by Dogan et al. (2016) and by Fathony et al. (2016) in the context of multiclass SVMs that only utilize the set of differences vy vj over all y, j [k] such that y = j, rather than the non-relative or absolute class-scores (v1, . . . , vk) themselves. Below, it will be notationally convenient to use a matrix that converts the vector of class-score v into relative margins: Definition 2.1 Let D := Ik 1 1(k 1) R(k 1) k. Observe that [Dv]y = vk vy for y [k 1] and v Rk. Equivalently, Dv = (vk v1, vk v2, . . . , vk vk 1) . We are now ready to define Definition 2.2 (PERM losses) Let k 2 be an integer. A k-ary multiclass loss function is a vector-valued function L = (L1, . . . , Lk) : Rk Rk. We say that L is 1. permutation equivariant if L(Sσv) = SσL(v) for all v Rk and σ Sym(k), 2. relative margin-based if for each y [k] there exists a function ℓy : Rk 1 R so that Ly(v) = ℓy(Dv) = ℓy(vk v1, vk v2, . . . , vk vk 1), for all v Rk. (1) We refer to the vector-valued function ℓ:= (ℓ1, . . . , ℓk) as the reduced form of L. 3. PERM if L is both permutation equivariant and relative margin-based. In this case, the function ψ := ℓk is referred to as the template of L. While the concepts of permutation equivariance and relative margin have been studied largely in isolation in previous works, our work is the first to systematically study the properties of losses having both properties. Below in Theorem 2.5, we show that a PERM loss is completely determined by its template via what we call the relative-margin form. Remark 2.3 (On the name template ) The symbol of the template ψ is chosen intentionally to match that of Bartlett et al. (2006), where a (binary) margin loss is expressed as ψ(( 1)yg(x)). Treating g(x) = z as an arbitrary input to ψ, the rationale behind the name is that the positive branch ψ(( 1)2z), of the margin loss serves as a template for the negative branch ψ(( 1)1z). Before proceeding, let us examine some notable PERM losses: Example 1 The cross entropy (also multinomial logistic) loss is given by LCE y (v) = log 1 + P j [k]:j =y exp( (vy vj)) , for all v Rk. It is easy to see that its template is the function ψCE(z) = log 1 + Pk 1 j=1 exp( zj) . When k = 2, we have ψCE(z) = log(1 + exp( z)) which is the binary logistic loss (also known as the binary cross entropy). When k = 3, we have ψCE(z) = log(1 + exp( z1) + exp( z2)) is a function defined over the 2D plane R2, plotted in Figure 1. 10. The idea of relative margins goes back to Rosset et al. (2003). We note that Jebara and Shivaswamy (2008) also use this term in a unrelated context in binary SVMs. Wang and Scott Example 2 The Gamma-Phi loss (Beijbom et al., 2014), denoted Lγ,φ y (v), is a generalization of the cross entropy loss defined as follows: Let γ : R R and φ : R R 0 be functions. Define Lγ,φ y (v) := γ P j [k]:j =y φ(vy vj) , for all v Rk. It is easy to see that its template is the function ψγ,φ(z) = γ Pk 1 j=1 φ(zj) . The cross entropy LCE is the Gamma-Phi loss where φ(t) = exp( t) and γ(t) = log(1 + t). The multiclass exponential (Mukherjee and Schapire, 2013) loss LExp is the Gamma-Phi loss where φ(t) = exp( t) and γ is the identity. Example 3 A notable case of the Gamma-Phi loss is when φ(t) = max{0, 1 t} is the hinge loss and γ is the identity. The resulting loss is known as the Weston-Watkins hinge loss (Weston and Watkins, 1998; Bredensteiner and Bennett, 1999; Vapnik, 1998), which is well-known to be not classification-calibrated (Liu, 2007; Tewari and Bartlett, 2007). However, the Weston-Watkins hinge loss is calibrated with respect to a discrete loss called the ordered partition loss related to ranking with ties (Wang and Scott, 2020). Example 4 The Crammer-Singer hinge loss (Crammer and Singer, 2001) is a well-known loss that is a PERM loss but not a Gamma-Phi loss. It is defined as LCS y (v) := maxj [k]:j =y {max{0, 1 (vy vj)}} , for all v Rk. Section 5 features another example of a family of PERM losses, namely, the Fenchel Young losses (Blondel et al., 2020). 2.1 Matrix label code and the relative margin form This section introduces the multiclass generalization of { 1}: the matrix label code {Υy}k y=1. Definition 2.4 (Matrix label code) For k 2 and y [k], define the (k 1) (k 1) matrix Υy as follows: For y = k, Υk := Ik 1. For y [k 1], define Υy column-wise by ( e(k 1) j : j = y 1(k 1) : j = y, for each j [k 1]. Equivalently, to construct Υy for each y [k 1], first take the identity matrix Ik 1, then replace the y-th column by all 1 s. Note that Υ2 y = Ik 1 for all y [k 1], i.e., Υy is an involution. Moreover, in the binary case where k = 2, we have Υ1 = 1 and Υ2 = 1, i.e., the matrix label code reduces to the label encoding { 1}. Next, recall that a function f : Rn R is symmetric if f(Sσ( )) = f( ) for all σ Sym(n). We now state the main property of the PERM loss and the matrix label code: Theorem 2.5 (Relative-margin form) Let L : Rk Rk be a PERM loss with template ψ, and let v Rk and y [k] be arbitrary. Then ψ is a symmetric function. Moreover, Ly(v) = ψ(Υy Dv). (2) Conversely, let ψ : Rk 1 R be a symmetric function. Define a multiclass loss function L = (L1, . . . , Lk) : Rk Rk according to eq. (2). Then L is a PERM loss with template ψ. Unified Margin-Based Classification Equation (2) is referred to as the relative margin form of L. Theorem 2.5 and other results in this section are proved in Section C in the appendix. Remark 2.6 The relative margin form can be interpreted as a one-to-one correspondence between PERM losses L : Rk Rk and symmetric functions ψ : Rk 1 R. Thus, we can refer to a PERM loss by either L or ψ without ambiguity. Remark 2.7 (Uniqueness of the matrix label code) In Section J of the appendix, we prove11 a unique-ness result: any set of matrices {Υ y}k y=1 satisfying the conclusion of Theorem 2.5 is equal to the matrix label code (Definition 2.4) up to row permutations of the Υy s. Remark 2.8 Observe that Υy Dv = (vy v1, vy v2, . . . , vy vk) Rk 1 where the vy vy entry is omitted. This is Lemma B.2 in the appendix. Now, note that the right hand side is exactly the relative margin as defined in Rosset et al. (2003, Eqn. (8)). The advantage of the left hand expression Υy Dv is that the label y acts via its label encoding Υy by left matrix multplication. This disentanglement of the label encoding and the loss can facilitate calculations and will be crucial in our key Lemma G.6. Remark 2.9 (Prior work) The matrices in the definition of the matrix label code have been used by Wang and Scott (2020, 2021) for analyzing the Weston-Watkins (WW) SVMs (Weston and Watkins, 1998). Wang and Scott (2020) show that the hinge loss from the WW-SVM is calibrated with respect to the ordered partition loss, and uses this theory to explain the empirical observation made by Dogan et al. (2016) that WW-SVM performs well even under significant label noise. Wang and Scott (2021) derive a reparametrization of the Weston-Watkins dual problem that decomposes into subproblems that can be solved exactly in O(k log(k)) time, leading to faster performance for the linear WW-SVMs with many classes. The disentanglement discussed in the previous remark is the key ingredient for deriving this reparametrization. This work extends Wang and Scott (2020, 2021) beyond the multiclass SVM setting, and provides a framework for using these matrices for margin-based losses. 2.2 Relationship between class-score functions and discriminant functions As mentioned in the introduction, the sign function converts a real-valued discriminant function g : X R to a discrete-valued classifier h : X {1, 2} by choosing h(x) to be y {1, 2} such that ( 1)yg(x) 0 with ties broken arbitrarily. To generalize this to the multiclass case, first recall that when k = 2, the matrix label code is equal to Υy = ( 1)y. Thus, ( 1)yg(x) 0 iffΥyg(x) 0. More simply put, y is the predicted class if and only if Υy is the sign of g(x). Next, we define the multiclass sign function analogously. Given a vector-valued discriminant function g : X Rk 1 we define a discrete-valued classifier h : X [k] by taking y [k] such that Υyg(x) 0. Similar to the binary case, ties can be broken arbitrarily. Thus, Υy can be viewed as the sign of g(x). To relate this multiclass sign function method for producing a classifier to the conventional arg max method, we have 11. We omit the proofs of some of the technical intermediate lemmas. For the omitted proofs, see the ar Xiv version of our manuscript (Wang and Scott, 2023b). Wang and Scott Proposition 2.10 Let v Rk, z := Dv, and y [k]. Then y arg maxj vj iffΥyz 0. Furthermore, let v := 0 z . Then y arg maxj vj iffy arg maxj v j. Proof For the first part, recall from Remark 2.8 that Υy Dv = (vy v1, vy v2, . . . , vy vk) where the vy vy entry is omitted. Therefore, the assertion y arg maxj vj iffΥyz 0 follows immediately from the definition of and z. For the Furthermore part, note that by construction we have v = (0, v2 v1, v3 v1, . . . , vk v1) . By adding the constant v1 to all entries of v , we recover v. Thus, the indices maximizing v are the same as those of v = (v1, v2, . . . , vk) . The first part of Proposition 2.10 gives an intuitive explanation for our earlier definition of a multiclass sign . The second part of Proposition 2.10 gives a simple formula for computing the arg max from the discriminant function. Remark 2.11 Proposition 2.10 can be seen as an equivalence between class-score functions and discriminant functions. Given a class-score function f : X Rk, we can derive a discriminant function g : X Rk 1 by defining g(x) := Df(x) for all x X. Conversely, we can view the discriminant function g as the given , and derive a class-score function f by defining f(x) := 0 g(x) for all x X. In both cases, arg max f(x) = arg max 0 g(x) . 3. Classification-calibration and Consistency This section is a brief review of the core definitions and theorem from Tewari and Bartlett (2007) in preparation for our results on sufficient conditions for classification-calibration. Below, we assume the probability setting introduced earlier in Section 1.2. Definition 3.1 (Range and its convex hull) Let f : Rm Rn be a function. Denote by ran(f) := {f(x) : x Rm} the range of f, and ranc(f) := conv(ran(f)) the convex hull of the range of f. Definition 3.2 (Tewari and Bartlett (2007)) A set S Rk + is classification-calibrated if there exists a function12 θ : Rk [k] such that inf{ p, ζ : ζ S : pθ(ζ) < max p} > infζ S p, ζ (3) for all p k. A multiclass loss function L is classification-calibrated if ranc(L) is classification-calibrated. Intuitively, Definition 3.2 says that the lowest achievable conditional risk when predicting the wrong label (Eqn. (3) LHS) is still strictly larger than the conditional Bayes risk (Eqn. (3) RHS). Define arg max : Rk [k] by arg max(v) = min{i [k] : vi = maxj [k] vj}. When L is permutation equivariant and classification-calibrated, we can take θ in Definition 3.2 to be simply arg max. See Tewari and Bartlett (2007, Lemma 4). As alluded to earlier, the significance of Definition 3.2 is demonstrated by the following theorem: 12. The function θ is called a calibrated link for S. Unified Margin-Based Classification Theorem 3.3 (Tewari and Bartlett 2007) Let L be a PERM loss. Let F be the set of all Borel functions X Rk. If L is classification-calibrated, then L has the consistency transfer property: For all sequence of function classes {Fn}n such that Fn F, S n Fn = F, all ˆfn Fn and all probability distributions P on X [k] RL,P ( ˆfn) P inff RL,P (f) implies R01,P (arg max ˆfn) P infh R01,P (h) where the infimums are taken over all Borel functions f : X Rk and h : X [k], respectively. Remark 3.4 In applications, ˆfn is often taken to be an L-risk empirical minimizer over a training dataset of cardinality n. However, the above property holds for any sequence of functions ˆfn Fn. Checking that a multiclass loss L is classification-calibration is a non-trivial task, requiring delicate analysis of the loss function (Tewari and Bartlett, 2007). Recently, Wang and Scott (2023a) established a sufficient condition of classification-calibration for Gamma-Phi losses (Example 2). Below, we derive a sufficient conditions for general PERM losses. 4. Regular PERM losses To demonstrate the utility of the relative margin form and Theorem 2.5, we prove in this section Theorem 4.7, a sufficient condition for classification-calibration (CC) of multiclass loss functions. To this end, we begin with the necessary definitions. Definition 4.1 A function f : Rn R is 1. coercive if for all c R, the set {v Rn : f(v) c} (i.e., the c-sublevel set) is bounded, 2. semi-coercive if for all c R there exists b R such that {v Rn : f(v) c} {v Rn : b minj [n] vj}. The definition of a coercive function is well-studied (Boyd and Vandenberghe, 2004). However, semi-coercivity appears to be new. Intuitively, a function is semi-coercive if, for all c R, its c-sublevel set is contained in a translation of the positive orthant. Definition 4.2 (Regular PERM loss) Let L be a PERM loss with template ψ. We say that L is regular if ψ is nonnegative, twice differentiable, strictly convex, semi-coercive, and the gradient ψ(z) 0 is entrywise negative for all z Rk 1. Below, we give a sufficient condition for a Gamma-Phi loss (Example 2) to be a regular PERM loss. But first, we discuss the intuition behind the condition ψ(z) 0. The condition ψ(z) 0 is reminiscent of a condition in Bartlett et al. (2006, Theorem 6), which shows that in the binary case a convex margin loss ψ is classification-calibrated if and only if ψ is differentiable at 0 and ψ(0) < 0 where the overdot denotes taking derivative of a univariate function. If we view ψ( ) as a vector field on R, then this condition on the derivative can be stated as the vector field ψ( ) near 0 points toward the positive half of the real line . Wang and Scott positive orthant Figure 2: Vector field ψ(z) where ψ is the template of the cross entropy (Example 1). The contour lines are shaded according to the value of the function ψ (darker is larger). See Figure 1. The condition ψ(z) 0 can be thought of in a similar fashion. See Figure 2 for a plot of the vector field ψ(z) where ψ ψCE is the template of the cross entropy (Example 1), and where k = 3. Note that the vectors point toward the shaded region whose interior is the positive orthant in R2. However, Definition 4.2 requires the strict negativity of the gradient for all of Rk 1, whereas the derivative of ψ is only required to be negative at 0 in the binary case in Bartlett et al. (2006, Theorem 6). Next, we define the truncation of a PERM loss L. Intuitively, if k 3 is the number of classes and L is k-ary, then the truncation of L results in a (k 1)-ary loss. Proposition 4.3 (Truncation) Assume k 3. Let L : Rk Rk 0 be a PERM loss with template ψ : Rk 1 R. Define trunc[ψ](w) := limλ ψ(λ, w) for all w Rk 2. If L is regular (Definition 4.2), then trunc[ψ] is a well-defined symmetric function Rk 2 R (i.e., all limits exist in R), referred to as the truncation of ψ. Proof Let h(λ) := ψ(λ, w). The condition that ψ( ) 0 implies that h is decreasing as a function of λ. Moreover, h is nonnegative since ψ is nonnegative. Thus, limλ + h(λ) exists. The symmetry of trunc[ψ] follows immediately from the symmetry of ψ. By Theorem 2.5, the symmetric function trunc[ψ] induces a unique loss function which we call the truncation of L. Note that if ψCE(z) = log(1+exp( z1)+exp( z2)) is the template of the cross entropy with k = 3 (Example 1), then trunc[ψCE](w) = limλ ψCE(λ, w) = log(1 + exp( w)), which is just the binary cross entropy/logistic loss. Next, we define the m-fold iterated truncation of the template of a PERM loss. Corollary 4.4 For each m {0, 1, . . . , k 2}, define trunc m[ψ] to be m-fold repeated applications of trunc to ψ, i.e., trunc m[ψ] := trunc[ trunc[trunc[ψ]] ] where trunc appears m-times. By convention, let trunc 0[ψ] = ψ. Moreover, for each n {2, . . . , k}, define the notational shorthand ψ(n) := trunc (k n)[ψ]. If, for n {2, . . . , k 1}, ψ(n+1) : Rn R is a symmetric function such that the associated PERM loss, denoted L(n+1), is a regular PERM loss, then ψ(n) : Rn 1 R is a symmetric function. The n-ary truncated loss captures the behavior of ψ when the first k n inputs to the template ψ approach + . Note that if ψCE is the template of the k-ary cross entropy, then trunc (k n)[ψCE] is the template of the n-ary cross entropy. This follows from a calculation similar to the one right before Corollary 4.4. Next, we define totally regular PERM loss which generalizes the above observation: Definition 4.5 (Totally regular PERM loss) Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. For each n {2, . . . , k}, let ψ(n) : Rn 1 R be as in Unified Margin-Based Classification Corollary 4.4 and let L(n) be the unique13 PERM loss associated to ψ(n). We say that L is a totally regular PERM loss if L(n) is regular for each n {2, . . . , k}. Before proving our main result, we note that Gamma Phi losses form a large family of totally regular PERM losses: Proposition 4.6 Let γ, φ : R R be functions and Lγ,φ be the associated Gamma-Phi loss as defined in Example 2. Suppose that all of the following holds 1. γ is a twice differentiable function such that γ 0, dγ dt > 0 and d2γ dt2 0 on all of R, 2. φ is a twice differentiable function such that dφ dt < 0, d2φ dt2 > 0 on all of R. Moreover, φ(t) 0 as t + . Then Lγ,φ is totally regular. We now state our main result: Theorem 4.7 If L is totally regular, then L is classification-calibrated. In view of Theorem 4.7, the assumptions of Proposition 4.6 turns out to be a sufficient condition for a Gamma-Phi loss to be classification-calibrated. Theorem 4.7 and Proposition 4.6 together recover the result that the coherence loss (Zhang et al., 2009) is classification-calibrated. However, this sufficient condition is subsumed by a previous result in Wang and Scott (2023a, Theorem 3.3). Nevertheless, that Lγ,φ is totally regular has nontrivial consequences as we will see next. First, new totally regular PERM losses can be constructed from existing ones: Proposition 4.8 Let L and L : Rk Rk be totally regular PERM losses and λ > 0 be a number. Then λL and L + L are also totally regular. Thus, λL and L+L are both classification-calibrated by Theorem 4.7. Next, we apply this result to sums of Gamma-Phi losses: Example 5 (Sum of Gamma-Phi losses) Consider two Gamma-Phi losses each satisfying the assumptions of Proposition 4.6. For concreteness, take the cross entropy LCE and the multiclass exponential loss LExp (See Example 2). Then by Proposition 4.8, the loss 1 2(LCE +LExp) is totally regular and thus classification-calibrated by Theorem 4.7. Note that the sum of two Gamma-Phi losses need not be another Gamma-Phi losses. Thus, Theorem 4.7 combined with Proposition 4.8 gives the first sufficient condition for when the sum of classification-calibrated losses are again classification-calibrated, in the multiclass case14. In the next section, we will apply Theorem 4.7 to obtain a sufficient condition for classification-calibration of Fenchel-Young losses (Theorem 5.10). 13. The existence and uniqueness is guaranteed by Theorem 2.5. 14. Note that the situation is much simpler in the binary case, due to the result of Bartlett et al. (2006, Theorem 6) that convex margin loss ψ is classification-calibrated if and only if ψ is differentiable at 0 and ψ(0) < 0 where the overdot denotes taking derivative of a univariate function. Wang and Scott 5. Fenchel-Young losses: sufficient conditions for classification-calibration In this section, we consider a subset of PERM losses called the Fenchel-Young losses: Definition 5.1 (Blondel et al. 2020) Let Ω: k R be a continuous function and µ R 0. Define cy := µ(1(k) e(k) y ). The Fenchel-Young loss associated to Ωand µ is the loss function LΩ,µ : Rk Rk 0 whose y-th component is given by LΩ,µ y (v) := maxp k Ω(p) + Ω(e(k) y ) + v + cy, p e(k) y . (4) The maximization in Eqn. 4 is essentially the Fenchel conjugate15 of Ω. See Definition 5.4. It is not obvious that Fenchel-Young losses are indeed PERM losses. We prove this fact in Proposition I.1 in the appendix. Definition 5.2 (Negentropy) A function Ω: k R is a negentropy if : 1. Ωis closed (maps closed sets to closed sets) and convex, 2. Ωis symmetric, i.e., Ω(Sσp) = Ω(p) for all p k and σ Sym(k), 3. Ω(p) 0 for all p k and Ω(e(k) i ) = 0 for all i [k]. The negative entropy is more convenient to work with due to its convexity. We use the term negentropy in the interest of brevity. This term has been previously used in statistics (Hyv arinen and Oja, 2000) and in machine learning (Mensch et al., 2019). Example 6 When Ω(p) := Pk y=1 py log(py) is the negative Shannon entropy and µ = 0, we have that LΩ,0 is the cross entropy (Example 1). Below, we use the notation that p = [p1, . . . , pk 1] Rk 1 is the sub-vector of p with the last element dropped. Later in Proposition I.1, we show that the Fenchel-Young loss Eqn. (4) is a PERM loss with template ψΩ,µ(z) := maxp k Ω(p) + µ1 p p, z . (5) Remark 5.3 Blondel et al. (2020) allow the vector cy Rk to be arbitrary, in which case the resulting loss is known as cost-sensitive Fenchel-Young loss. However, known calibration results (Blondel, 2019; Nowak-Vila et al., 2019) are only for cy as in Definition 5.1. In order to state our sufficient condition for classification-calibration (Theorem 5.10), we briefly review two key notions from the theory of convex analysis and Legendre transformation following Rockafellar (1970, Section 26): Definition 5.4 (Convex conjugates) Let D Rn be a closed convex set. Let f : D R be a function. Define D := {y Rn : supx D y, x f(x) < }. The convex conjugate of a function f : D R is the function f : D R given by f (y) = supx D y, x f(x). Definition 5.5 (Convex functions of Legendre type) Let D Rn be a closed convex set. A convex function f : D R is said to be of Legendre type if 15. Also known as the convex conjugate. Unified Margin-Based Classification 1. C := int(D) is an open convex subset of Rn, 2. f is strictly convex and differentiable on C, 3. for all {xi} i=1 C with limi xi bdry(D), we have limi f(xi) = + . For example, when D = k and f = H is the negative Shannon entropy, then f : D R 0 is of Legendre type. See the paragraph immediately following Blondel et al. (2020, Definition 3). Definition 5.6 (Regular negentropies) A negentropy Ω: k R is a regular negentropy if Ωis of Lengedre type and twice differentiable. The term negentropy was previously used by Mensch et al. (2019). To the best of our knowledge, the term regular negentropy is new. Proposition 5.7 (Truncation of a negentropy) Assume k 3. Let pad : k 1 k be the zero-padding operator, i.e., pad(q) := [0, q1, . . . , qk 1] k for all q = [q1, . . . , qk 1] k 1. Let Ω: k R be a negenetropy. Define the truncation of Ω, denoted trunc[Ω] : k 1 R, by trunc[Ω]( ) := Ω(pad( )). Then trunc[Ω] is a negentropy. The following is analogous to the earlier iterated truncation construction in Corollary 4.4: Corollary 5.8 For each m {0, 1, . . . , k 2}, define trunc m[Ω] to be m-fold repeated applications of trunc to Ω, i.e., trunc m[Ω] := trunc[ trunc[trunc[Ω]] ] where trunc appears m-times. By convention, let trunc 0[Ω] = Ω. Moreover, for each n {2, . . . , k}, define the notational shorthand Ω(n) := trunc (k n)[Ω]. Then Ω(n) : n R is a negentropy. Definition 5.9 (Totally regular negentropy) Let Ω: k R be a negentropy. We say that Ωis a totally regular negentropy if Ω(n) is a regular negentropy for each n {2, . . . , k}. To the best of our knowledge, the definition of a totally regular negentropy is new. As suggested by the name, totally regular negentropies induce Fenchel-Young losses that are totally regular. This is a critical step16 towards proving the main result of this section below. The result leverages Theorem 4.7 to establish a sufficient condition for the classificationcalibration of Fenchel-Young losses: Theorem 5.10 Let Ωbe a totally regular negentropy, µ R 0 be fixed, and LΩ,µ be the Fenchel-Young loss associated to Ωand µ. Then LΩ,µ is classification-calibrated. In light of Theorem 3.3, if Ωsatisfies the conditions of Theorem 5.10, then LΩ,µ satisfies the consistency transfer property (Definition 1.1). Theorem 5.10 recovers all known classification-calibration sufficient conditions (Nowak-Vila et al., 2019; Blondel, 2019) which require that Ωbe strongly-convex whereas our result only requires strict convexity. The following proposition and example show that there is always a nonempty set of losses to which the result applies but the prior results don t. 16. See the proof of Theorem I.5 in the appendix. Wang and Scott Proposition 5.11 Let Ω: k R be a totally regular negentropy and g : R 0 R 0 be convex, twice differentiable and strictly increasing. Let u := (1/k)1(k) k be the uniform probability vector. Define Θ : k R by Θ(p) := g(Ω(p) Ω(u)) g( Ω(u)) for p k. Then Θ is a totally regular negentropy. Furthermore, if g (0) = 0, then Θ is not strongly-convex. Example 7 For a concrete application of Proposition 5.11, take Ωto be the negative Shannon entropy (Example 6) and g(x) = x2 the square function. Then Θ as defined in Proposition 5.11 is a totally regular negenetropy that is not strongly-convex. Moreover, LΘ,µ is classification-calibration for any µ R 0 by Theorem 5.10. Note that previous sufficient conditions (Nowak-Vila et al., 2019; Blondel, 2019) do not cover our example above. 6. Discussion and Future Directions We proposed the relative margin form as a multiclass extension to the popular (binary) margin loss framework. We characterized the set of losses that can be expressed in the relative margin form the PERM losses. Our Theorem 4.7 generalizes to the multiclass case the seminal result of (Bartlett et al., 2006, Theorem 6). Central to our analysis is the matrix label code (Definition 2.4) which we expect to be useful beyond the scope here. Utilizing our framework, we extended the existing set of known sufficient conditions for the classification-calibration of Fenchel-Young losses. We expect the framework to be useful for research on multiclass classification, e.g., for H-consistency (Long and Servedio, 2013; Zhang and Agarwal, 2020; Awasthi et al., 2022). Characterization of classification-calibration. One weakness of our work is that, compared to the characterization of CC in the binary case (Bartlett et al., 2006, Theorem 6), we are only able to prove an analogous result (Theorem 4.7) in the multiclass case under additional assumptions. Future work will investigate whether this can be improved. Cost-weighted classification. Our results are limited to the traditional notion of classification-calibration under the assumption that the cost of misclassifying all classes are the same. The more general notion of c-classification-calibration described in Williamson et al. (2016) allows the cost of misclassification to be class-dependent according to some class-specific weight vector c. An interesting future direction is to consider a setting where a general c-classification-calibration sufficient condition similar to our Theorem 4.7 can be derived for non-symmetric losses. More specifically, given a classification-calibrated PERM loss, is there a principled way to modify the loss to make it c-classification calibrated? Combining losses. Hui et al. (2023) demonstrate the effectiveness of summing existing losses (the cross entropy and the square loss) to design new losses for classification. Unfortunately, our result in Proposition 4.8 does not apply to their new loss since the square loss is not totally regular. One approach may be to connect our technique with that of Williamson and Cranko (2023, 6) for studying the geometric properties of sums of losses. Beyond classification. While this work is concerned with classification-calibration, there are many works on calibration for more general discrete supervised learning tasks. Ramaswamy and Agarwal (2016) developed theory for multiclass classification with abstain option and, more generally, losses defined over finite sets i.e., discrete losses. Finocchiaro Unified Margin-Based Classification et al. (2019) showed that there exists polyhedral losses that are calibrated with respect to arbitrary discrete losses. An interesting question is if the PERM loss framework can useful for analyzing such surrogate losses. Beyond classification-calibration, a potentially fruit direction is to employ the PERM loss framework to study optimization and learning theory. Acknowledgments The authors were supported in part by the National Science Foundation under awards 1838179 and 2008074, and by the Department of Defense, Defense Threat Reduction Agency under award HDTRA1-20-2-0002. YW is supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Futures program. Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-Class H-Consistency Bounds. In Advances in Neural Information Processing Systems, 2022. Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Oscar Beijbom, Mohammad Saberian, David Kriegman, and Nuno Vasconcelos. Guessaverse loss functions for cost-sensitive multiclass boosting. In International Conference on Machine Learning, pages 586 594. PMLR, 2014. M Beltagy and S Shenawy. On the boundary of closed convex sets in En. ar Xiv preprint ar Xiv:1301.0688, 2013. Arnab Bhattacharyya, Suprovat Ghoshal, and Rishi Saket. Hardness of learning noisy halfspaces using polynomial thresholds. In Conference On Learning Theory, pages 876 917. PMLR, 2018. David A Blackwell and Meyer A Girshick. Theory of games and statistical decisions. Courier Corporation, 1979. Mathieu Blondel. Structured prediction with projection oracles. Advances in neural information processing systems, 32, 2019. Mathieu Blondel, Andr e FT Martins, and Vlad Niculae. Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(35):1 69, 2020. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004. Erin J Bredensteiner and Kristin P Bennett. Multicategory classification by support vector machines. Computational Optimization: A Tribute to Olvi Mangasarian Volume I, pages 53 79, 1999. Wang and Scott Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of machine learning research, 2(Dec):265 292, 2001. Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3(Jan):951 991, 2003. Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287 316. PMLR, 2014. Ur un Dogan, Tobias Glasmachers, and Christian Igel. A unified view on multi-class support vector classification. Journal of Machine Learning Research, 17(45):1 32, 2016. John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classification, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246 3275, 2018. Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2006. Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classification: A risk minimization perspective. Advances in Neural Information Processing Systems, 29, 2016. Jessica Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework for consistent polyhedral surrogates. Advances in Neural Information Processing Systems, 32, 2019. Sheng Fu, Piao Chen, and Zhisheng Ye. Simplex-based proximal multicategory support vector machine. IEEE Transactions on Information Theory, 2022. Simon I Hill and Arnaud Doucet. A framework for kernel-based multi-category classification. Journal of Artificial Intelligence Research, 30:525 564, 2007. Like Hui, Mikhail Belkin, and Stephen Wright. Cut your losses with squentropy. In Proceedings of the 40th International Conference on Machine Learning, pages 14114 14131. PMLR, 2023. Aapo Hyv arinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4-5):411 430, 2000. Tony Jebara and Pannagadatta Shivaswamy. Relative margin machines. Advances in Neural Information Processing Systems, 21, 2008. Yuri Kalnishkan and Michael V Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8):1228 1244, 2008. Carl Kesler. Preliminary Experiments on Perceptron: Application to Bubble Chamber Event Recognition. Cornell University, 1961. Unified Margin-Based Classification John M Lee. Smooth manifolds. In Introduction to smooth manifolds, pages 1 31. Springer, 2013. Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004. Yufeng Liu. Fisher consistency of multicategory support vector machines. In Artificial intelligence and statistics, pages 291 298. PMLR, 2007. Phil Long and Rocco Servedio. Consistency versus realizable h-consistency for multiclass classification. In International Conference on Machine Learning, pages 801 809. PMLR, 2013. Arthur Mensch, Mathieu Blondel, and Gabriel Peyr e. Geometric losses for distributional learning. In International Conference on Machine Learning, pages 4516 4525. PMLR, 2019. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2018. Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, and Jean-Jeacques Slotine. Multiclass learning with simplex coding. Advances in Neural Information Processing Systems, 25, 2012. Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 2013. Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. A general theory for structured prediction with smooth convex surrogates. ar Xiv preprint ar Xiv:1902.01958, 2019. Robert J Plemmons. M-matrix Characterizations. i. Nonsingular M-matrices. Linear Algebra and its applications, 18(2):175 188, 1977. Guillaume Pouliot. Equivalence of multicategory SVM and simplex cone SVM: Fast computations and statistical theory. In International conference on machine learning, pages 4133 4140. PMLR, 2018. Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices. The Journal of Machine Learning Research, 17(1):397 441, 2016. R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton University Press, 1970. Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximizing loss functions. Advances in Neural Information Processing Systems, 16, 2003. Mohammad Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. Advances in Neural Information Processing Systems, 24, 2011. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Wang and Scott Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007. Zhiqiang Tan and Xinwei Zhang. On loss functions and regret bounds for multi-category classification. IEEE Transactions on Information Theory, 2022. Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007. Gertjan van den Burg and Patrick Groenen. Gen SVM: A generalized multiclass support vector machine. Journal of Machine Learning Research, 17:1 42, 2016. Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. Yutong Wang and Clayton Scott. Weston-Watkins hinge loss and ordered partitions. Advances in Neural Information Processing Systems, 33:19873 19883, 2020. Yutong Wang and Clayton Scott. An exact solver for the Weston-Watkins SVM subproblem. In International Conference on Machine Learning, pages 10894 10904. PMLR, 2021. Yutong Wang and Clayton Scott. On classification-calibration of gamma-phi losses. In The Thirty Sixth Annual Conference on Learning Theory, pages 4929 4951. PMLR, 2023a. Yutong Wang and Clayton Scott. Unified binary and multiclass margin-based classification. ar Xiv preprint ar Xiv:2311.17778, 2023b. Jason Weston and Chris Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1998. Robert C Williamson and Zac Cranko. The geometry and calculus of losses. Journal of Machine Learning Research, 24(342):1 72, 2023. Robert C Williamson, Elodie Vernet, and Mark D Reid. Composite multiclass losses. Journal of Machine Learning Research, 17:1 52, 2016. Tong Tong Wu and Kenneth Lange. Multicategory vertex discriminant analysis for highdimensional data. The Annals of Applied Statistics, 4(4):1698 1721, 2010. Mingyuan Zhang and Shivani Agarwal. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. Advances in Neural Information Processing Systems, 33:16927 16936, 2020. Tong Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004. Zhihua Zhang, Michael Jordan, Wu-Jun Li, and Dit-Yan Yeung. Coherence functions for multicategory margin-based classification methods. In Artificial Intelligence and Statistics, pages 647 654. PMLR, 2009. Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on multicategory Fisher-consistent losses. The Annals of Applied Statistics, 2(4):1290, 2008. Unified Margin-Based Classification Appendix A. Notational conventions and definitions For the reader s convenience, we tabulate the notational convention used in this work in Table 1. Moreover, we tabulate the mathematical objects defined in Table 2. Finally, we require the following additional notations: Permutations. A bijection from [k] to itself is called a permutation (on [k]). Recall we denote by Sym(k) the set of all permutations on [k]. Two permutations σ, σ Sym(k) can be composed resulting in another permtuation σσ Sym(k) defined by function composition: for y [k], we have σσ (y) := σ(σ (y)). For i, j [k], let τ(i,j) Sym(k) denote the transposition which swaps i and j, leaving all other elements unchanged. More precisely, τ(i,j)(i) = j, τ(i,j)(j) = i and τ(i,j)(y) = y for y [k]\{i, j}. Define the notational shorthand τi := τ(k,i), the transposition that swaps k and i. Permutation matrices. Recall that for each σ Sym(k), Sσ Rk k denotes the permutation matrix corresponding to σ. In other words, if v Rk is a vector, then [Sσv]j = [v]σ(j) = vσ(j). Note that if σ, σ Sym(k), then Sσσ = Sσ Sσ where the order of compositions are reversed (Lemma B.6). Define the notational shorthand T(i,j) := Sτ(i,j) the matrix corresponding to the transposition of i and j. Likewise, define Ti := T(k,i). When k is ambiguous, we disambiguate it with the superscript notation and write S(k) (i,j) and T(k) (i,j). Topology. Let S be a subset of a topological space. Let int(S) and bdry(S) denote the interior and the boundary of the set S, respectively. See Table 1 for the full list of symbols. Mathematical object Description of notation Example of notation Set of all permutations Sym(k) Permutations Lower case sigma or tau σ, τ Transpositions tau with subscripts τ(i,j) Vector Bold lower case v, w Entries of vector Normal font lower case v1, v2, . . . Special vector Blackboard font All zeros/ones vector in Rn 0(n), 1(n) i-th elem. basis vector in Rn e(n) i Matrix Bold upper case A j-th Column [A]:j n n Identity In Table 1: Notation convention used throughout this work. Appendix B. Properties of the matrix label code As mentioned in the main text, the definition of matrix label code has already been introduced in Wang and Scott (2020, Definition S3.14). However, that work, the matrix label code is used to analyze a particular optimization problem resulting from the Weston- Wang and Scott Symbol Description Defined in... L Multiclass loss function Definition 2.2 ℓ Reduced form of L Definition 2.2 item 2 ψ Template of L Definition 2.2 item 3 Ω Negentropy Definition 5.2 LΩ,µ Fenchel-Young loss assoc. to Ω, µ Definition 5.1 trunc[ψ] truncation of ψ Proposition 4.3 ψ(n) trunc applied k n times to ψ Corollary 4.4 L(n) PERM loss corresponding to ψ(n) Definition 4.5 trunc[ℓ] truncation of ℓ Definition H.3 pad zero-padding operator Proposition 5.7 trunc[Ω] truncation of Ω Proposition 5.7 Ω(n) trunc applied k n times to Ω Corollary 5.8 D relative margins conversion matrix Definition 2.1 Mσ DSσD Definition B.4 Υ matrix label code Definition 2.4 T Transposition matrix Section A τ Transposition permutation Section A A(z) gradient of ℓ(z) Lemma G.6 P (resp. Q) drop last (resp. first) coordinate Lemma G.25 (resp. G.26) ran( ) and ranc( ) range of convex hull of range Definition 3.1 ran c(f) and ran c(f) closure and interior of ranc(f) Definition G.27 N(ζ; S) Positive normals Definition G.35 Table 2: Mathematical objects and where they are defined. Watkins SVM (Weston and Watkins, 1998). Here, we develop a comprehensive theory for using matrix label code with PERM loss. With the exception of Lemma B.15, all results in this section have already appeared in Wang and Scott (2020, 2021) using slightly different notation. For the reader s convenience, the proofs are omitted here but are all included in the ar Xiv version of this manuscript (Wang and Scott, 2023b). First, we note that for y {1, . . . , k 1}, the matrix Υ(k) y acts on a vector z = (z1, . . . , zk 1) Rk 1 by j [k 1], [Υ(k) y z]j := ( zj zy : j = y zy : j = y. (6) Remark B.1 Throughout this section, we consider k fixed and write Υ1, , Υk instead of Υ(k) 1 , , Υ(k) k . Moreover, whenever Ty does not have a superscript, we implicitly assume that Ty = T(k) y . In some results, we will work with (k 1) (k 1) permutation matrices in which case we will write, for example, T(k 1) (y,y ) where y, y [k 1]. Unified Margin-Based Classification Lemma B.2 For all y [k], we have DTy = Υy D. In particular, for all i > 1 and j [k 1], we have ( vy vj : y = j vy vk : y = j. Remark B.3 Let D denote the Moore-Penrose inverse of D. Since D contains a copy of the identity matrix, D has full column rank. Hence, DD is the identity. Definition B.4 For each σ Sym(k), define the (k 1) (k 1) matrix Mσ := DSσD . Lemma B.5 For all y [k], we have Mτy = Υy. Lemma B.6 If σ, σ Sym(k), then Sσσ = Sσ Sσ. Lemma B.7 For all σ, σ Sym(k), we have Mσσ = Mσ Mσ. Lemma B.8 For all y [k], Υ2 y is the identity. Lemma B.9 Let y1, y2 [k 1] be distinct, i.e., y1 = y2. Then τy1τy2τy1 = τ(y1,y2), as elements of Sym(k). Moreover, Ty1Ty2Ty1 = T(y1,y2), as elements of Rk k. Corollary B.10 Every σ Sym(k) can be written as a product σ = τy1τy2 τyn for some integer n 0 and yi [k 1] for each i [n]. Lemma B.11 For all y1, y2 [k 1], we have T(k 1) (y1,y2)D = DT(k) (y1,y2). Lemma B.12 Let y1, y2 [k 1] be distinct. Then T(k 1) (y1,y2) = Υy1Υy2Υy1. Lemma B.13 Let y, j [k]. Then ( T(k 1) (j,y) ΥjΥy : both y and j [k 1] ΥjΥy : otherwise. (7) Proposition B.14 For an arbitrary σ Sym(k 1), let σ Sym(k) denote the permutation that extends σ to [k] , i.e., σ (k) := k and σ (y) := σ(y) for y [k 1]. Then we have SσD = DSσ . Lemma B.15 Let 1 := 1(k 1) denote17 the (k 1)-dimensional (column) vector of all ones. Let ey := e(k 1) y denote the (k 1)-dimensional y-th elementary basis (column) vector. Let y, j [k] be such that y = j. Then we have the following identities Υy T(y,j)Υj = (1 + ej)(ej ey) if y, j [k 1], (8) Υy Υj = (1 + ej)(ej) if y = k, (9) Υy Υj = (1 + ey)(ey) if j = k. (10) 17. We drop the superscript (k 1) for convenience. Wang and Scott Appendix C. Proof of Theorem 2.5 We first recall the definition of a template from Theorem 2.5. For the reader s convenience, we restate it as a definition below: Lemma C.1 Let L be a PERM loss with template ψ. Let D be as in Definition 2.1. Then for all v Rk we have Lk(v) := [L(v)]k = ψ(Dv). Proof Recall from Definition 2.1 that Dv = (vk v1, vk v2, . . . , vk vk 1) . Thus, the result now follows immediately from Definition 2.2. For the reader s convenience, the following is a restatement of Theorem 2.5. Theorem C.2 Let L : Rk Rk be a PERM loss with template ψ, and let v Rk and y [k] be arbitrary. Then ψ is a symmetric function and L can be expressed as Ly(v) = ψ(Υy Dv). (11) Conversely, given a symmetric function ψ : Rk 1 R, the function L = (L1, . . . , Lk) : Rk Rk defined componentwise via eq. (11) is a PERM loss with template ψ. Proof [Proof of Theorem C.2] We begin with the = direction of the proof. First, we check that ψ : Rk 1 R is symmetric, i.e., ψ(Sσz) = ψ(z) for all z Rk 1 and all σ Sym(k 1). Below, fix such z and σ. Pick v Rk such that Dv = z. For instance, let v := z 0 . Define σ Sym(k) to be the permutation that extends σ to [k] as in Proposition B.14. Then ψ(Sσz) = ψ(SσDv) = ψ(DSσ v) Proposition B.14 = [L(Sσ v)]k Definition 2.2 of the template ψ of a PERM loss = [Sσ L(v)]k L is permutation equivariant (Definition 2.2) = [L(v)]σ (k) Sσ is the permutation matrix of σ = [L(v)]k Definition of σ in Proposition B.14 = ψ(Dv). Lemma C.1 Since by definition ψ(Dv) = ψ(z), we have by the above that ψ(Sσz) = ψ(z) for all z. This proves that ψ is symmetric. Next, we prove eq. (11), i.e., [L(v)]y = ψ(Υy Dv) for all y [k]. Again, this is a straight forward computation: [L(v)]y = [L(v)]τy(k) = [Ty L(v)]k definitions of τy and Ty = [L(Tyv)]k L is permutation equivariant = ψ(DTyv) Lemma C.1 = ψ(Υy Dv) Lemma B.2. This proves eq. (11) and thus the = direction follows. The proof of the = direction, i.e., the Conversely part of the theorem, proceeds similarly. For the reader s convenience, it is omitted here but is included in the ar Xiv version of this manuscript (Wang and Scott, 2023b). Unified Margin-Based Classification Lemma C.3 Let L : Rk Rk be a function such that Ty L(v) = L(Tyv) for all y [k] and v Rk. Then L is permutation-equivariant, i.e., L(Sσv) = SσL(v) for all y [k] and v Rk. Proof For an arbitrary σ Sym(k), write σ = τy1τy2 τyn as a product of transposition where n 0 is an integer and yi [k 1] for each i [n]. This is possible because of Corollary B.10. By Lemma B.6, we have Sσ = Tyn Ty2Ty1. Thus L(Sσv) = L(Tyn Ty1v) = Tyn L(Tyn 1 Ty1v) = = Tyn Ty1L(v) = SσL(v) where = = denotes repeated application of L(Tyv) = Ty L(v) for all y [k]. Remark C.4 For convenience, we now summarize the relationship between L, ℓand ψ from Definition 2.2 in terms of the input to ψ, typically denoted below by z Rk 1. Let y [k] and z Rk 1 be arbitrary. Define v Rk be such that Dv = z, where D is as defined in Definition 2.1. Since DD is the identity (Remark B.3), we can for instance let v = D z. Then we have ℓy(z) = [L(v)]y = ψ(Υyz). (12) The first equality is eq. (1) (rewritten using Dv = z) and the second equality is eq. (11). A useful corollary of the above identity is the following: Sσℓ(z) = ℓ(Mσz), for all σ Sym(k). (13) To prove eq. (13), we let v := D z as discussed earlier. Then Sσℓ(z) 1= SσL(v) 2= L(Sσv) 3= ℓ(DSσv) 4= ℓ(DSσD z) 5= ℓ(Mσz). For 1 and 3, we used eq. (11). For 2, we used permutation equivariance of L. For 4, we used the definition of v. For 5, we used Definition B.4 of Mσ. Appendix D. Well-incentivized losses This section discusses the well-incentivized property of a loss that should be satisfied for a multiclass classification loss to be useful. This will be made clear after Remark D.2. Definition D.1 (Well-incentivized losses) Let L be a multiclass loss function (Definition 2.2). We say that L is well-incentivized (resp. strictly well-incentivized) if for all v Rk and distinct y, j [k], vj vy (resp. vj < vy) implies Lj(v) Ly(v) (resp. Lj(v) > Ly(v)). Remark D.2 Suppose that L is well-incentivized. Let v Rk and y arg maxj [k] vj. Then Ly(v) = minj [k] Lj(v). Remark D.2 implies that for well-incentivized losses, the arg max predictor is correct, i.e., Ly(v) is minimized when the class score [v]y is the highest. Next, we define a condition on the template ψ such that the corresponding PERM loss is well-incentivized: Wang and Scott Definition D.3 A function f : Rn R is non-increasing (resp. decreasing) if for all v, w Rn such that v w (resp. v w) we have f(v) f(w) (resp. f(v) < f(w)). Proposition D.4 Let f : Rn R be a continuously differentiable function. If f( ) 0 (resp. f( ) 0) then f is non-increasing (resp. decreasing). Proposition D.5 Let L be a PERM loss whose template ψ is non-increasing (resp. decreasing). Then L is (resp. strictly) well-incentivized. The proofs of Propositions D.4 and D.5 are straightforward calculations involving the Fundamental Theorem of Calculus. For the reader s convenience, they are omitted here but are included in the ar Xiv version of this manuscript (Wang and Scott, 2023b). Appendix E. Proof of Proposition 4.6 The proof is a straightforward calculation involving the chain rule and properties of the functions φ and γ. For the reader s convenience, it is omitted here but is included in the ar Xiv version of this manuscript (Wang and Scott, 2023b). Appendix F. Proof of Proposition 4.8 We first prove that L+L is regular. Let ψ and ψ be the templates of L and L respectively. Then L + L has template ψ + ψ . All conditions in Definition 4.2 clearly holds, except for the semi-coercive property which we now check. Let c, b R be arbitrary and satisfy {z Rn : ψ(z) c} {z Rn : b min j [n] zj}. Since ψ and ψ are both non-negative, we have ψ(z) + ψ (z) c implies ψ(z) c. Thus {z Rn : ψ(z) + ψ (z) c} {z Rn : ψ(z) c} {z Rn : b minj [n] zj}, which shows that ψ + ψ is semi-coercive, as desired. The fact that L + L is totally regular follows immediately from the definition of the truncation in Proposition 4.3. That λL is totally regular is completely straightforward and thus we omit its proof. Appendix G. Properties of Regular PERM losses In this section, we will prove several key properties of regular PERM losses which were introduced in Definition 4.2. Recall that a regular PERM loss has a template ψ such that ψ is nonnegative, twice differentiable, strictly convex, semi-coercive, and the gradient ψ(z) 0 is entrywise negative for all z Rk 1. Section G.1 focuses on consequences of the semi-coercivity condition. Sections G.2 and G.3 focus on consequences of the other aforementioned conditions. Finally, Section H presents the proof of our main result Theorem 4.7. Unified Margin-Based Classification G.1 Semi-coercive functions In this section, we study the properties of a PERM loss L whose template ψ is semi-coercive (Definition 4.1). Lemma G.1 Let L : Rk Rk + be a PERM loss whose template ψ is semi-coercive. Let ℓ be the reduced form of L. Then, for all ζ Rk, the set {z Rk 1 : ℓ(z) ζ} is bounded. Proof Below, let z denote an arbitrary element of Rk 1. In set-builder notations, we write {z : condition} to mean {z Rk 1 : condition}. Now, by eq. (12), we have that ℓ(z) ζ if and only if ψ(Υyz) ζy for all y [k]. Thus, we get {z : ℓ(z) ζ} = T y [k]{z : ψ(Υyz) ζy} = T y [k] Υy ({z : ψ(z) ζy}) (14) For the last equality above, we used the fact that Υy = Υ 1 y (Lemma B.8) and that {z : ψ(Υyz) ζy} = Υ 1 y ({z : ψ(z) ζy}) for all y [k]. Moreover, by the semi-coercivity assumption on ψ we have that for all y [k] there exists by R such that {z : ψ(z) ζy} {z : min z by}. Putting it all together, we have {z : ℓ(z) ζ} T y [k] Υy({z : min z by}) =: B (15) Thus, it suffices to show that B is bounded. Below, we prove this. Since the empty set is bounded, we assume below that B is nonempty. Let z B Rk 1 be a fixed arbitrary point. First, recall the infinity-norm: z := max{|z i| : i [k 1]}. To prove that B is bounded, it suffices to show that there exists some number M such that z M. Note that we can express z alternatively as z = max {| max z |, | min z |} , (16) where max z := maxj [k 1] z j and min z := minj [k 1] z j. Define M1 = max{|by| : y [k]} and M2 = max{|by + bj| : y [k], j [k]}. Finally, define M = max{M1, M2}. We claim that z M. First, we note that min z bk. To see this, first recall that Υk is the identity (Definition 2.4). From eq. (15) we have B Υy({z : min z by}) for each y [k]. Now, recall from Definition 2.4 that Υk is the identity matrix. Thus in particular B {z : min z bk} and so min z bk holds. Next, let y arg minj [k 1] z j and thus z y = min z . From eq. (15), we have z Υy({z : min z by}). Thus, Υyz {z : min z by} and in particular, [Υyz ]y by. Moreover, by eq. (6), we have [Υyz ]y = z y = min z , and thus min z by. Combining with the result from the previous paragraph, we now have that min z [bk, by] and, in particular, | min z | M1. Note that here, we used the fact that if α, β, γ R are such that γ [α, β], then |γ| max{|α|, |β|}. Next, let l arg maxj [k 1] z j (and y be still the same as before). First consider the case when l = y. Then z is a constant vector and z = | min z | in which case z M1 M holds. Wang and Scott Next, consider the case when l = y. Then we have [Υlz ]y = z l z y = (min z ) (max z ) by eq. (6). Similar as in the previous case, z B Υl({z : min z bl}) which implies that [Υlz ]y bl. Thus, max z min z bl (by + bl). Furthermore, max z min z bk. Thus, we ve shown that max z [bk, (by + bl)]. This implies that | max z | max{M1, M2} = M, where again we used the fact that γ [α, β] implies |γ| max{|α|, |β|}, for arbitrary α, β, γ R. To conclude, since | min z | M, by Equation (16), we have z M. Proposition G.2 Let L be a nonnegative PERM loss with template ψ. If ψ is semi-coercive (Definition 4.1), then CL p is coercive for all p int( k). Proof Let c R be arbitrary and S := {z Rk 1 : CL p (z) = p, ℓ(z) c} be the c-sublevel set. To show that CL p is coercive, we show that S Rk 1 is a bounded set. Observe that for all z S we have c p, ℓ(z) = P y [k] pyψ(Υyz) pyψ(Υyz) for all y [k]. We remark that the preceding inequality crucially uses ψ( ) 0. Thus, S T y [k]{z Rk 1 : ψ(Υyz) c/py}, where the right hand side is a bounded set (Lemma G.1). Hence, S is also bounded. G.2 The link function In this section, we study the set of minimizers of the conditional risk of a PERM loss L, i.e., the set arg minz Rk 1 CL p (z). When L is the multinomial cross entropy (Example 1), this argmin is a singleton set for all p int( k) and the mapping from int( k) p to this unique minimizer recovers the logit function. For a general loss L, this mapping is sometimes referred to as the link function (Nowak Vila et al., 2019; Williamson et al., 2016). See Definition G.5 below. This section will study the properties of the link function, culminating in a sufficient condition for when the link function is a bijection (Proposition G.9). Proposition G.3 Let L be a nonnegative PERM loss with template ψ. If ψ is convex, then CL p is convex for all p k. Furthermore, if ψ is strictly convex, then CL p is strictly convex for all p int( k). Proof Recall that CL p (z) = P y [k] pyψ(Υyz) where Υy is an invertible matrix by Lemma B.8. Thus, if ψ is (strictly) convex, then z 7 ψ(Υyz) is (strictly) convex for each y [k]. For each p k, CL p is a convex combination of convex function and is thus convex. Furthermore, if p int( k), then CL p is a convex combination of strictly convex function and is thus strictly convex. See Boyd and Vandenberghe (2004, Section 3.2.1) for instance. An easy consequence of the above result is the following: Corollary G.4 Let p int( k) be arbitrary and L be a nonnegative PERM loss whose ψ is semi-coercive. If ψ is convex, then the infimum infz Rk 1 CL p (z) is attained. Furthermore, if ψ is strictly convex, then the infimum is attained by a unique minimizer, i.e., arg minz Rk 1 CL p (z) is a singleton set. Unified Margin-Based Classification Proof By Proposition G.2, CL p is coercive. By Proposition G.3, CL p is strictly convex. By the Extreme Value Theorem, a continuous and coercive function has at least one global minimum. Furthermore, a strictly convex functions have at most one global minimum. For a reference of these result standards, see Boyd and Vandenberghe (2004, Section 4.2). In view of Corollary G.4, we define: Definition G.5 Let L be a PERM loss whose template ψ is nonnegative, strictly convex and semi-coercive. Define the link function link L : int( k) Rk 1 by letting link L(p) be the unique element of arg minz Rk 1 CL p (z). In this section, we give a sufficient condition on L for link L of Definition G.5 to be a bijection. We will need the concept of an M-matrix, which is reviewed in Section K.1. Lemma G.6 Let L : Rk Rk + be a regular PERM loss (Definition 4.2) with reduced form ℓand template ψ. For all z Rk 1, the (k 1) (k 1) matrix A(z) := ℓ1(z) ℓk 1(z) is a non-singular M-matrix with strictly positive diagonal elements. Furthermore, both A(z) and A(z) are strictly monotone. While our usage of the operator is standard, for completeness we refer the reader to the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b) for our notations on the derivative operator . The definition of a (strictly) monotone matrix is discussed in Section K.1. Proof First, we compute the gradient of ℓ= (ℓ1, . . . , ℓk) : Rk 1 Rk. For each y [k], we have by definition that ℓy(z) = ψ(Υyz) (eq. (12)). Thus, by the chain rule, we have ℓy(z) = Υ y ψ(Υyz). (17) Next, fix y [k 1] and z Rk 1. Let w := ψ(Υyz). Then by the assumption that ψ( ) 0, we have w is a entrywise negative column vector, i.e., w 0. Moreover, by eq. (17) we have ℓy(z) = Υ y w. Thus, for each j [k 1], we have [Υ y w]j = ([Υy]:j) w. Recall from Definition 2.4 that ( e(k 1) j : j = y 1(k 1) : j = y , which implies that [Υ y w]j = ( wj : j = y P l [k 1] wl : j = y. In particular, [Υ y w]j 0 for all j = y which proves that A(z) is a Z-matrix. Furthermore, note that the fact w < 0 and [Υ y w]y = P l [k 1] wl implies that the diagonals of A(z) are positive. Observe that Υ y w has the property that |[Υ y w]y| = P l [k 1] wl > P l [k 1]:l =y wl = P l [k 1]:l =y |[Υ y w]l|. Note that the strict inequality above follows from the fact that w 0 and so in particular wy > 0. This proves that A(z) is strictly diagonally dominant. By Corollary K.3, we Wang and Scott have that A(z) is a non-singular M-matrix. For the Furthermore part, we can apply Lemma K.5 since the diagonal elements of A(z) are positive. By Corollary K.6, A(z) is also strictly monotone. Lemma G.7 Let L : Rk Rk + be a regular PERM loss. Let z Rk 1 and p k be arbitrary. Then z minimizes CL p if and only if pk ψ(z) = A(z) p1 pk 1 (18) where A(z) is as in Lemma G.6. Furthermore, CL p has a minimizer, then p int( k). Proof Proposition G.3 asserts that CL p is convex. For a differentiable convex function, recall that the gradient-vanishing condition is necessary and sufficient for optimality of unconstrained optimization (Boyd and Vandenberghe, 2004). Thus, z minimizes CL p iff 0 = CL p (z) = P j [k] pj ℓj(z) = pk ψ(z) + A(z) p1 pk 1 . (19) which is equivalent to eq. (18). Next, for the Furthermore part, first note that Lemma G.6 says A(z) is a non-singular M-matrix. Now, we first show that pk = 0. If pk = 0, then Equation (18) reduces to 0 = A(z) p1 pk 1 . Since A(z) is non-singular, we must have p1 = = pk 1 = 0 as well which implies that p = 0. But this contradicts that p k. Thus, pk > 0 and so pk ψ(z) 0. From Lemma G.6, we have that A(z) is strictly monotone. Combined with eq. (18), we can conclude that py > 0 for each y [k 1] as well. The Furthermore part of Lemma G.7 immediately implies the following. Corollary G.8 If p k \ int( k), then arg minz Rk 1 CL p (z) = . Proposition G.9 Let L be a regular PERM loss. Then link L (Definition G.5) is a bijection. Proof First, we prove that link is injective. Suppose that p, q int( k) are such that link L(p) = link L(q) =: z. Then by eq. (18), we have that ψ(z)A(z) 1 = p 1 k p1 pk 1 = q 1 k q1 qk 1 . (20) Thus, (1 pk)/pk = (p1 + + pk 1)/pk = (q1 + + qk 1)/qk = (1 qk)/qk implies that pk = qk. Therefore, eq. (20) implies that py = qy for each y [k 1] as well. Thus, p = q which proves that link is injective. Next, we prove that link is surjective. Pick z Rk 1. From Lemma G.6, we have that A(z) is non-singular and strictly monotone. By non-singular-ness, there exists v Rk 1 such that ψ(z) = A(z) v. Furthermore, since ψ(z) 0 and A(z) is strictly monotone, we have v 0. Define p1, . . . , pk by pk := (v1 + + vk 1 + 1) 1 and py := vypk for each y [k 1]. Clearly, we have p 0. Furthermore, p1 + p2 + + pk = pk(v1 + + vk 1 + 1) = 1. Unified Margin-Based Classification Thus, we have p int( k). By construction, z and p satisfy eq. (18). Thus link L(p) = z follows from the definition of link L. Remark G.10 Before proceeding, we remark that Proposition G.9 gives theoretical support to the conjectural observation in Nowak-Vila et al. (2019, Remark 3.1) regarding the injectivity of the link function. G.3 Geometry of the loss surface Recall from Definition 3.1 and Theorem 3.3 that the classification-calibration of the set ranc(L) implies the classification-calibration of the loss L. In general, the set ranc(L) may be difficult to compute. In this section, we study the geometry of the set ran(L) when L is a regular PERM loss which enables us to compute the convex hull ranc(L) of ran(L). One of the main tools is the mapping defined below: Corollary G.11 Let L be a regular PERM loss with reduced form ℓ. Then ℓ: Rk 1 Rk is injective. Proof Suppose that z, w Rk 1 are such that ℓ(z) = ℓ(w). By Proposition G.9, there exists p int( k) such that link L(p) = z. Now, p, ℓ(z) = p, ℓ(w) implies that both z, w minimize CL p . By Corollary G.4, we have z = w and so ℓis injective. Definition G.12 Given a PERM loss L with reduced form ℓ, we define two functions F and G mapping from Rk 1 R to Rk by F(z, λ) = ℓ(z) + λ1 and G(z, t) = ℓ(z) + te(k) k , where z Rk 1 and λ R. For computing F , we view the tuple z, λ as a (column) vector z λ . Hence, F (z, λ) = ℓ(z) 1 . Likewise for the tuple z, t and G. Remark G.13 In the context of Definition G.12, it is perhaps more precise to write F( z λ ). However, this notation is cumbersome. Below, we will always use the tuple notation. The reason we discuss the vector notation is so that F and G can be more simply treated as gradients of vector input-valued functions (rather than matrix input-valued). Below, we will study the properties of the two functions from Definition G.12. G.3.1 Properties of the F function The main result of this section is to prove that F is a homeomorphism from Rk 1 R to Rk (Corollary G.19). Lemma G.14 Let L be a regular PERM loss with reduced form ℓand F be as in Definition G.12. Then F is injective. In other words, if ℓ(z) + λ1 = ℓ(w) + µ1 for some z, w Rk 1 and λ, µ R, then both z = w and λ = µ. Wang and Scott Proof Let z, w Rk 1 and λ, µ R be as in the statement of the lemma. Our goal is to show that both z = w and λ = µ. First, consider the case that z = w. Then ℓ(z) = ℓ(w) and so λ = µ. Thus, z = w implies λ = µ. Next, consider the case that λ = µ. Then we have ℓ(z) = ℓ(w). By Corollary G.11, we have z = w. Therefore, λ = µ implies z = w. Thus, it only remains to show that if both z = w and λ = µ, then we have a contradiction. Without loss of generality, suppose that λ > µ. Then we have ℓ(z) + (λ µ)1 = ℓ(w). Thus, for all p k, we have p, ℓ(w) = p, ℓ(z) + (λ µ)1 > p, ℓ(z) . Thus, there does not exist p k such that w is the minimizer of CL p . But this contradicts since Proposition G.9 implies that link is surjective. Lemma G.15 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Definition G.12. Then for all (z, λ) Rk 1 R, F (z, λ) is non-singular. Proof Recall that ℓ(z) R(k 1) k and 1(k) Rk. Thus, F (z, λ) = ℓ(z) (1(k)) is a k k square matrix. To show that it is non-singular, first pick v Rk arbitrary. It suffices to check that if F (z, λ)v = 0 then v = 0. Towards this, first note that F (z, λ)v = 0 can be equivalently stated as both ℓ(z)v = 0 and (1(k)) v = v1 + + vk = 0. Replacing v by v if necessary, we assume that vk 0. Recall A(z) from Lemma G.6. Then the identity ℓ(z)v = 0 can be rewritten as vk ψ(z) = A(z) v1 vk 1 . Since A(z) is monotone and vk ψ(z) 0, we get that vy 0 for each y [k 1]. Combined with the fact that v 1 = v1 + + vk = 0, we have that v = 0, as desired. Now, by applying the inverse function theorem18, we immediately have the following. Corollary G.16 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Definition G.12. For all (z, λ) Rk 1 R, there exist open neighborhoods U (z, λ) and V F(z, λ) such that F|U : U V is a diffeomorphism. Proposition G.17 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Definition G.12. The map F is a bijection. Proof Lemma G.14 shows that F is injective. To show that F is surjective, we prove that ran(F) is both open and closed as a subset of Rk. This would imply that ran(F) = Rk since the only subets of Rk that are both open and closed are and Rk. Now, Lemma G.16 shows that ran(F) is an open subset of Rk. It remains to prove that ran(F) is closed. To this end, consider a sequence {(z(i), λ(i))} i=1 such that F(z(i), λ(i)) = ℓ(z(i)) + λ(i)1 converges to ζ Rk. Our goal is to show that ζ ran(F). We begin by first picking ϵ > 0. Let 1 := 1(k) (without the superscript (k)) denotes the k-dimensional vector of all ones. Since the sequence converges, there exists M such that ζ ϵ1 ℓ(z(i)) + λ(i)1 ζ + ϵ1 18. The inverse function theorem is a standard result in multivariate calculus. See the ar Xiv version of this manuscript (Wang and Scott, 2023b) for the result s statement and a textbook reference. Unified Margin-Based Classification for all i M. Before proceeding, we prove a helper lemma. Lemma G.18 (Helper lemma) Let L be a PERM loss with reduced form ℓand template ψ such that ψ( ) 0. Then for all z Rk 1, we have that miny [k 1] ℓy(z) ψ(0(k 1)) =: C, where 0(k 1) is the (k 1)-dimensional all-zeros vector. Proof [Proof of helper lemma] Let v Rk be such that Dv = z, where we recall from Definition 2.1 that D = Ik 1 1(k 1) . As in Remark C.4, we can take v = z 0 . Next, let y arg maxj [k 1] vj. Recall from Section A that min( ) over vector-valued inputs denotes entrywise minimum. Then by Remark D.2 and Proposition D.5, we have that min(L(z)) = Ly(v). Next by eq. (12) from Remark C.4, we have [L(v)]y = ℓy(z). Let w := σy(v). (Recall from Section A that σy Sym(k) is the transposition that swaps k and y.) Note that by construction we have k arg max w. By permutation-equivariance, we have [L(v)]y = [L(v)]σy(k) = [L(σy(v))]k = [L(w)]k. Again by eq. (12) from Remark C.4, we have [L(w)]k = ψ(Dw). Since k arg max w, we have that Dw Rk 1 0 belongs to the non-negative orthtant. In other words, Dw 0(k 1). Finally, since ψ( ) 0, we have that ψ(Dw) ψ(0(k 1)). We now return to the proof of the proposition. Let C be as in the helper lemma. Then min(ℓ(z(i)) + λ(i)1) = min(ℓ(z(i))) + λ(i) C + λ(i). Thus, we have min(ζ) ϵ C + λ(i) and which implies that λ(i) C + ϵ min(ζ) =: D. From this, we get that ℓ(z(i)) ζ + ϵ1 λ(i)1 ζ + (ϵ + D)1 Thus, for all i M we have z(i) {z Rk 1 : ℓ(z) ζ + (ϵ + D)1} which is a bounded set by Lemma G.1. By passing to a subsequence, we may assume that z(i) converges to some z Rk 1. Thus, we have λ(i)1 converges to ℓ(z ) + ζ, which implies in particular that λ(i) converges to some λ . Putting it all together, we have shown that F(z(i), λ(i)) converges to ζ = F(z , λ ) and so ran(F) is closed. Corollary G.19 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Definition G.12. The map F is a diffeomorphism, i.e., F is a differentiable bijection with a differentiable inverse. In particular, F is a homeomorphism. Proof Lee (2013, Proposition 4.6 (f)) states that every bijective local diffeomorphism is a (global) diffeomorphism. Thus, the result follows in view of the facts that F is a bijection (Proposition G.17) and that F is a local diffeomorphism (Corollary G.16). Proposition G.20 Let L be a regular PERM loss with reduced form ℓ. Let F be as in Definition G.12. Consider arbitrary v, x Rk and t R. Define19 α(t) Rk 1 and 19. Note that by Corollary G.19, such α(t) and β(t) exist and are unique with respect to this property. Wang and Scott β(t) R to be such that tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1(k). Then for all t R, we have 1. α : R Rk 1 and β : R R are differentiable, 2. If v 0, then β (t) = dβ dt (t) > 0, 3. β is concave, i.e., β (t) = d2β Remark G.21 We note that α and β in Proposition G.20 implicitly depend on x and v. Proof Below, let 1 := 1(k) be the k-dimensional vector of all ones. To prove the first part, first note that (α(t), β(t)) = F 1(tv + x). Hence, α and β are differentiable. Next, we prove the second part. Let y [k] be arbitrary, to be specified later. Now, the y-th coordinate of tv + x = ℓ(α(t)) + β(t)1 is vyt + xy = ℓy(α(t)) + β(t). (21) Differentiating (21) on both sides with respect to t and applying the chain rule20, we get vy = α (t) ℓy(α(t)) + β (t). (22) For convenience, we adopt the notation f ( ) := ( f( )) for functions with a scalar input. Now, we claim that α (t) ℓy(α(t)) 0 for some y [k]. In fact, we prove this claim by proving a slightly more general statement that will be used again later. Lemma G.22 Let L be a regular PERM loss with reduced form ℓ. Let z, w Rk 1 be arbitrary. Then there exists y [k] such that w ℓy(z) 0. Proof Suppose for the sake of contradiction that w ℓy(z) > 0 for all y [k]. Then 0 ℓ1(z) ℓk 1(z) ℓk(z) w = A(z) ψ(z) w. In other words, we have A(z) w 0 and w ψ(z) > 0. By Lemma G.6, A(z) is strictly monotone . Thus, A(z) w 0 implies21 w 0. But ψ(z) 0 by assumption that L is a regular PERM loss. Hence, w ψ(z) < 0, a contradiction. Applying Lemma G.22 with w = α (t), we get the desired claim. Now, pick y [k] such that α (t) ℓy(α(t)) 0. Thus, from Eqn. (22) we have β (t) = vy α (t) ℓy(α(t)) > 0. This proves the second part of Proposition G.20. Finally, we prove the last part of Proposition G.20. Pick y [k] (unrelated to the earlier choice) such that α (t) ℓy(α(t)) 0. Such a y [k] exists by Lemma G.22 by setting w = α (t). Differentiating eq. (22) with respect to t, we get by the product rule for curves22 that 0 = α (t) ℓy(α(t)) + α (t) d dt( ℓy(α(t))) + β (t) 20. The chain rule is a standard result in multivariate calculus. See the ar Xiv version of this manuscript (Wang and Scott, 2023b) for the result s statement and a textbook reference. 21. This is the definition of strictly monotone matrices . See Section K.1 22. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b). Unified Margin-Based Classification By the chain rule for curves23, we have d dt( ℓy(α(t))) = 2 ℓy(α(t))α (t) which combined with the above implies β (t) = α (t) ℓy(α(t)) + α (t) 2 ℓy(α(t))α (t). Since ℓy is convex, 2 ℓy(α(t)) is positive semidefinite. Therefore, α (t) 2 ℓy(α(t))α (t) 0. Putting it all together, β (t) 0, i.e., β is concave. G.3.2 Properties of the G function Lemma G.23 Let L be a regular PERM loss with reduced form ℓ. Let G be as in Definition G.12. Then for all (z, t) Rk 1 R, the gradient (matrix) G(z, t) is non-singular. Proof The proof proceeds similarly as in Lemma G.15. Recall that ℓ(z) R(k 1) k and e(k) k Rk. Below, we suppress this superscript and simply write ek := e(k) k Thus, G(z, λ) = ℓ(z) e k is a k k square matrix. To show that it is non-singular, first pick v Rk arbitrary. It suffices to check that if G(z, λ)v = 0 then v = 0. Towards this, first note that G(z, λ)v = 0 can be equivalently stated as both ℓ(z)v = 0 and e k v = vk = 0. Recall A(z) from Lemma G.6. Then the identity ℓ(z)v = 0 can be rewritten as 0 = vk ψ(z) = A(z) v1 vk 1 . Since A(z) is non-singular, we get that vy = 0 for each y [k 1]. Thus, we have that v = 0, as desired. Lemma G.24 Suppose that A Rn n and v Rn are such that 1) A is strictly monotone (Definition K.1), 2) v 0 has strictly negative entries, 3) the matrix A v 1 1 is invertible, and 4) M := A v 0 1 A v 1 1 1 R(n+1) (n+1) is also invertible. Then Mn+1,n+1 > 0, i.e., the bottom right entry of M is positive. Proof Define B Rn n, w, u Rn and c R such that A v 1 1 1 = B w u c . Then by definition of the matrix inverse, we have In 0 0 1 = AB + vu Aw + cv 1 B + u 1 w + c Now, we observe that M = A v 0 1 = AB + vu Aw + cv u c Since M is invertible by assumption, we have that c = 0. To finish the proof, it suffices to show that c > 0. Below, we assume that c < 0 and derive a contradiction. The top right block of eq. (23) implies that Aw + cv = 0, i.e., A( w) = cv. Since c < 0 and v 0, we have A( w) 0. By strict monotonicity of A, we have that 23. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b). Wang and Scott w 0, or equivalently, w 0. Finally, the bottom right entry of eq. (23) implies that 1 = 1 w + c < 0, which is a contradiction. Lemma G.25 Let L be a regular PERM loss with reduced form ℓand z Rk 1 be arbitrary. Define P to be the projection Rk Rk 1 that drops the last coordinate, i.e., P([v1, . . . , vk] ) = [v1, . . . , vk 1] . Then there exists z Rk 1 and t R>0 such that Pℓ(z ) + t 1 = P(ℓ(z)). Proof Define functions ζ : Rk 1 R Rk 1 and τ : Rk 1 R R by ζ(z, t) τ(z, t) := F 1 G(z, t). We first show that τ(z, 0) = 0. Observe that F 1(G(z, 0)) = F 1(ℓ(z) + 0 ek) = F 1(ℓ(z) + 0 1) = z 0 . (24) Therefore, ζ(z, t) = z and τ(z, 0) = 0. Next, we claim that τ t (z, 0) = 0. By definition, F 1 G(z, t) = ζ z(z, t) τ z(z, t) ζ Let be a notational shorthand for the input (z, t) to F and G. The gradient of F 1 G F 1 G( ) = G( ) F 1(G( )) = G( )( F (F 1 G( ))) 1 Below, let ek := e(k) k , i.e., we drop the superscript. Now, from the proof of Lemma G.23, we have that G(z, t) = ℓ(z) ek and from the proof of Lemma G.15 that F (z, t) = ℓ(z) 1(k) . By eq. (24), we have G(z, 0)( F (F 1 G(z, 0))) 1 = G(z, 0)( F (z, 0)) 1 The above is invertible because G(z, 0) is invertible by Lemma G.23. Moreover, G(z, 0)( F (z, 0)) 1 = ℓ(z) e k ℓ(z) (1(k)) = A(z) ψ(z) (0(k 1)) 1 A(z) ψ(z) (1(k 1)) 1 By eq. (25), τ t (z, 0) is the bottom right element of G(z, 0)( F (z, 0)) 1. By applying Lemma G.24 to the RHS of eq. (27) with A = A(z) and v = ψ(z), we see that the bottom right entry of G(z, 0)( F (z, 0)) 1 is nonzero. This proves that τ t (z, 0) = 0. Note that Lemmas G.6 and G.23 together guarantee that the requirements of Lemma G.24 are all met. Next, we claim that there exists some t R such that τ(z, t ) > 0. To see this, assume the contrary. Then τ(z, t) 0 for all t R. In particular, t = 0 is a (global) maximizer of t 7 τ(z, t) which implies that τ t (z, 0) = 0, a contradiction. Thus, the claim follows. Unified Margin-Based Classification Now, fix t R such that τ(z, t ) > 0. Define z := ζ(z, t ) and t = τ(z, t ). Then by the definition of ζ and τ, we have F(z , t ) = G(z, t ) by applying F to both side of (z ) t = ζ(z, t ) τ(z, t ) = F 1(G(z, t )). Unwinding the definition of F and G (Definition G.12), the identity F(z , t ) = G(z, t ) implies ℓ(z ) + t 1 = F(z , t ) = G(z, t ) = ℓ(z) + t ek. Now, applying P( ) to both side, we have P(ℓ(z ) + t 1) = P(ℓ(z) + t ek) = P(ℓ(z)) where we used the fact that P( ) is linear and that P(ek) = 0. Corollary G.26 Let L be a regular PERM loss with reduced form ℓand z Rk 1 be arbitrary. Define Q to be the projection Rk Rk 1 that projs the first coordinate, i.e., Q([v1, . . . , vk] ) = [v2, . . . , vk] . Then there exists z Rk 1 and t R>0 such that Q(ℓ(z ) + t 1) = Qℓ(z). Proof Recall P from Lemma G.25. Let σ Sym(k) be such that Q = PSσ. Recall Mσ from Definition B.4. Then Sσℓ(z) = ℓ(Mσz) by eq. (13). Applying Lemma G.25 to Mσz, there exists z Rk 1 and t R>0 such that Pℓ(Mσz) = P(ℓ( z ) + t 1). Putting it all together, we have Qℓ(z) = PSσℓ(z) = Pℓ(Mσz) = P(ℓ( z ) + t 1). Next, Q = PSσ implies QSσ 1 = P. Hence, again applying eq. (13), we have P(ℓ( z ) + t 1) = QSσ 1(ℓ( z ) + t 1) = Q(ℓ(Mσ 1 z ) + t 1). Letting z := Mσ 1 z , we have Qℓ(z) = Q(ℓ(z ) + t 1), as desired. Definition G.27 Let f : Rm Rn be a function. Define the following sets: 1. ran c(f) := {ζ + λ1 : ζ ran(f), λ [0, )} 2. ran c(f) := {ζ + λ1 : ζ ran(f), λ (0, )} When f = ℓis the reduced form of a PERM loss, the above two sets are closely related to ranc(ℓ) (Definition 3.1), as the following lemma and Proposition G.31 show. These sets are convenient alternative characterizations. Lemma G.28 Let L be a regular PERM loss with reduced form ℓ. Then we have the following: 1. ran(ℓ) is closed. 2. ran c(ℓ) is closed and bdry(ran c(ℓ)) = ran(ℓ). 3. ran c(ℓ) = int(ran c(ℓ)) and bdry(ran c(ℓ)) = ran(ℓ). Wang and Scott Proof For item 1, define C := Rk 1 {0} which is a closed subset of Rk 1 R. Now, note that ran(ℓ) = F(C) where F is as in Definition G.12. Since F is a homeomorphism (Corollary G.19), F(C) is closed as well. For item 2, define D := Rk 1 [0, ) which is a closed subset of Rk 1 R. Then ran c(ℓ) = F(D). Thus, as in the previous case ran c(ℓ) is closed. Next, we have bdry(ran c(ℓ)) = bdry(F(D))=F(bdry(D)) = F(C) = ran(ℓ), where the second equality from the left follows from F being a homeomorphism. For item 3, let E = Rk 1 (0, ). Then similar to the above, we have int(ran c(ℓ)) = int(F(D)) = F(int(D)) = F(E) = ran c(ℓ). To conclude, note that bdry(ran c(ℓ)) = bdry(F(E)) = F(bdry(E)) = F(C) = ran(ℓ). Proposition G.29 Let L be a regular PERM loss. Then ran c(ℓ) is convex. Proof Let ζ, ξ ran c(f). Write ζ = ℓ(z) + λ1 and ξ = ℓ(w) + µ1, where z, w Rk 1 and λ, µ [0, ). Let v = ξ ζ Rk and x = ζ. Take α and β as defined in Proposition G.20, i.e., we have for all t R that tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1. (28) Plugging in t = 0 into (28), we get ζ = ℓ(α(0)) + β(0). Thus, α(0) = z and β(0) = λ. Likewise, plugging in t = 1, we get α(1) = w and β(1) = µ. In particular, we have β(0) 0 and β(1) 0. By Proposition G.20, β is concave. Thus, β(t) 0 for all t [0, 1], i.e., tv + w = tξ + (1 t)ζ = ℓ(α(t)) + β(t)1 ran c(ℓ) for all t [0, 1]. This proves that ran c(ℓ) is convex. The following result is a restatement of Beltagy and Shenawy (2013, Theorem 9): Theorem G.30 (Beltagy and Shenawy (2013)) Let C be a nonempty closed convex subset of Rn. If C contains no hyperplane, then C = conv(bdry(C)). Proposition G.31 Let L be a regular PERM loss with reduced form ℓ. Then ranc(ℓ) = ran c(ℓ) and bdry(ranc(ℓ)) = ran(ℓ). Proof Clearly, ran c(ℓ) is nonempty. Furthermore, by Lemma G.28 and Lemma G.29, ran c(ℓ) is closed and convex. Next, note that ran c(ℓ) lies in the nonnegative quadrant [0, )k. Since no hyperplane lies entirely inside the nonnegative quadrant, ran c(ℓ) cannot contain any hyperplane. Hence, we have verified that ran c(ℓ) satisfies the condition of Theorem G.30. To finish the proof, we have ranc(ℓ) = conv(ran(ℓ)) Definition of ranc(ℓ) (29) = conv(bdry(ran c(ℓ))) Lemma G.29 (30) = ran c(ℓ) Theorem G.30 (31) Unified Margin-Based Classification This proves the first part. For the second part, note that bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ) by Lemma G.29. Before moving on, we summarize the results on ranc(ℓ) we have thus obtained below: Corollary G.32 Let L be a regular PERM loss with reduced form ℓ. Recall from Definition G.27 ran c(ℓ) := {ζ + λ1 : ζ ran(ℓ), λ (0, )}. Then ranc(ℓ) is a closed and convex set with the following properties: 1. ranc(ℓ) = {ζ + λ1 : ζ ran(ℓ), λ [0, )} 2. int(ranc(ℓ)) = ran c(ℓ) (see Definitions 3.1 and G.27) 3. bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ). We state one more result about the set ran c(ℓ) which will be useful later. Lemma G.33 Let L be a regular PERM loss with reduced form ℓ. Then ran c(ℓ) = {ζ Rk : z Rk 1 such that ζ ℓ(z)}. Proof Recall that by definition we have ran c(ℓ) = {ℓ(z) + λ1 : z Rk 1, λ (0, )}. Thus, the direction is immediate. For the other inclusion, take ζ = ℓ(z) + v where z Rk 1 and v 0. Let x = ℓ(z). Take α and β as defined in Proposition G.20, i.e., we have for all t R that tv + x = F(α(t), β(t)) = ℓ(α(t)) + β(t)1. (32) Plugging in t = 0 into (32), we get x = ℓ(z) = ℓ(α(0))+β(0). Thus, α(0) = z and β(0) = 0. Recall that v = ζ ℓ(z) 0 by assumption. Hence, by Proposition G.20, β is strictly increasing. In particular, β(1) > β(0) = 0. Now, plugging in t = 1 into (32), we get v + x = v + ℓ(z) = ζ = ℓ(α(1)) + β(1)1. This shows that ζ ran c(ℓ), as desired. Remark G.34 From basic point-set topology24, we know that a closed set is the union of its interior and its boundary. Thus, ranc(ℓ) = int(ranc(ℓ)) bdry(ranc(ℓ)). Hence, a consequence of Lemma G.33 and Lemma G.28 is that ranc(ℓ) is precisely the superprediction set of ℓ(see Williamson et al. (2016, Definition 15) and Kalnishkan and Vyugin (2008)): ranc(ℓ) = {ζ Rk : z Rk 1 such that ζ ℓ(z)}. Recall (Tewari and Bartlett, 2007, Definition 5): Definition G.35 (Admissible sets; Tewari and Bartlett (2007)) Let S Rk + be a set and ζ Rk +. Define the set N(ζ; S) := {p k : ξ ζ, p 0, ξ S}. We say that S is admissible if for all ζ bdry(S) and p N(ζ; S) we have arg min(ζ) arg max(p). 24. The boundary of a set A is defined as the closure of A minus the interior of A. Wang and Scott Proposition G.36 (Tewari and Bartlett (2007)) Let S Rk + be a symmetric set. If |N(ζ; S)| = 1 for all ζ bdry(S), then S is admissible. Lemma G.37 Let L be a regular PERM loss with reduced form ℓ. Let ζ ran(ℓ). Then we have N(ζ; ran(ℓ)) = N(ζ; ranc(ℓ)) = N(ζ; ran c(ℓ)). Proof We first prove that N(ζ; ran(ℓ)) = N(ζ; ranc(ℓ)). Since ranc(ℓ) ran(ℓ), we immediately have N(ζ; ran(ℓ)) N(ζ; ranc(ℓ)). For the other inclusion, we first note that ranc(ℓ) = ran c(ℓ) by Proposition G.31. Thus, every ξ ranc(ℓ) can be written as ξ = α+β1 for some α ran(ℓ) and β 0. Now, let p N(ζ; ran(ℓ)) and let ξ ranc(ℓ) be decomposed as in the preceding sentence. Then ξ ζ, p = α + β1 ζ, p = α ζ, p + β 1, p 0 where the last inequality holds since (1) α ζ, p 0 because p N(ζ; ran(ℓ)), and (2) β 0. Hence, such a p satisfies ξ ζ, p 0, ξ ranc(ℓ) as well which implies that p N(ζ; ranc(ℓ)), as desired. Next, we prove N(ζ; ranc(ℓ)) = N(ζ; ran c(ℓ)). Again, since ranc(ℓ) ran c(ℓ), we immediately have N(ζ; ran c(ℓ)) N(ζ; ranc(ℓ)). For the other inclusion, we first note that clos(ran c(ℓ)) = ranc(ℓ). Suppose p k is such that ξ ζ, p 0 for al ξ ran c(ℓ). Then by continuity, we must have that ξ ζ, p 0 for all ξ clos(ran c(ℓ)) = ranc(ℓ). Proposition G.38 Let L be a regular PERM loss with reduced form ℓ. Then ranc(ℓ) and ran c(ℓ) are both admissible. Proof By Proposition G.36, it suffices to check the following two claims hold: 1. for all ζ bdry(ranc(ℓ)) we have |N(ζ; ranc(ℓ))| = 1, and 2. for all ζ bdry(ran c(ℓ)) we have |N(ζ; ran c(ℓ))| = 1. By Corollary G.32, we have bdry(ranc(ℓ)) = bdry(ran c(ℓ)) = ran(ℓ). Hence, by Lemma G.37, to show both above claims it suffices to show that |N(ζ; ran(ℓ))| = 1 for all ζ ran(ℓ). Note that here we can replace N(ζ; ranc(ℓ)) and N(ζ; ran c(ℓ)) by N(ζ; ran(ℓ)) because of Lemma G.37. Below, fix ζ = ℓ(z) ran(ℓ) where z Rk 1 is arbitrary. Then N(ζ; ran(ℓ)) = {p k : ξ ζ, p 0, ξ ran(ℓ)} Definition G.35 = {p k : ℓ(w) ℓ(z), p 0, w Rk 1} Corollary G.11 = n p k : z arg minw Rk 1 CL p (w) o definition of arg min = n p int( k) : z arg minw Rk 1 CL p (w) o Corollary G.8 = {p int( k) : z = link L(p)} Definition G.5 and Corollary G.4 By Proposition G.9, link L is an injection. Thus, | p int( k) : z = link L(p) | = 1. Unified Margin-Based Classification Appendix H. Proof of Theorem 4.7 Lemma H.1 Suppose that k 3 and y {2, . . . , k}. Let Q be as in Corollary G.26. Then QΥ(k) y = Υ(k 1) y 1 Q. The proof of Lemma H.1 is similar to that of Lemma B.2 and is thus omitted here. For the proof, see the ar Xiv version of this work (Wang and Scott, 2023b). Lemma H.2 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let z Rk 1 and w Rk 2 be arbitrary and such that Qz = w. Then limλ ψ(z + λe(k 1) 1 ) = trunc[ψ](w). Proof Write z = [z1, . . . , zk 1] and w = [w1, . . . , wk 2]. Since Qz = w, we have z2 = w1, z3 = w2 and so on. Now, define g : R R by g(λ) := ψ(z + λe(k 1) 1 ) = ψ(z1 + λ, z2, . . . , zk 1) and h : R R by h(λ) := ψ(λ, w) = ψ(λ, w1, w2, . . . , wk 2) = ψ(λ, z2, z3, . . . , zk 1) = g(λ z1). As argued in the proof of Proposition 4.3, h is decreasing and nonnegative. Thus, g is also decreasing and nonnegative. Moreover, limλ g(λ) = limλ g(λ z1) = limλ h(λ). The right hand side is equal to trunc[ψ](w), as in the proof of Proposition 4.3 Earlier in Proposition 4.3 and Corollary 4.4, we defined trunc[ψ], ψ(n) and L(n). We now define the analogous notation for the reduced form ℓ: Definition H.3 (Truncation of ℓ) Assume k 3. Let L : Rk Rk 0 be a regular PERM loss with template ψ : Rk 1 R. As in Proposition 4.3, define trunc[ℓ] to be the reduced form of trunc[L] (whose template is trunc[ψ]). Lemma H.4 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Definition H.3. Let Q be as in Corollary G.26. Let z Rk 1 and x Rk 0 be arbitrary. For brevity, let e1 := e(k 1) 1 . Then we have lim λ + Q (ℓ(z + λe1) + x) = trunc[ℓ](Qz) + Qx (33) and Q (ℓ(z) + x) trunc[ℓ](Qz). (34) Proof Throughout this proof, let y {2, . . . , k} be arbitrary. First, we claim that Υ(k) y e1 = e1 for each y {2, . . . , k}. If y = k, then this is clearly true since Υ(k) k is, by definition, the identity matrix. Now, for y {2, . . . , k 1}, recall that Υ(k) y is defined as the matrix obtained by replacing the y-th column of the identity matrix with the all 1 s vector. Thus, Wang and Scott since y > 1, the first column of Υ(k) y is equal to that of the identity matrix. Therefore, Υ(k) y e1 = e1 as well. Next, still assuming y {2, . . . , k 1}, we have ℓy(z + λe1) = ψ(Υ(k) y (z + λe1)) = ψ(Υ(k) y z + λe1). Hence, we have lim λ + ℓy(z + λe1) = lim λ + ψ(Υ(k) y z + λe1) Theorem C.2. = trunc[ψ](QΥ(k) y z) Proposition 4.3 and Lemma H.2 = trunc[ψ] Υ(k 1) y 1 Qz Lemma H.1 = [trunc[ℓ]]y 1(Qz) Theorem C.2 and explanation below. Application of Theorem C.2 to the last equality requires a bit more explanation. For said equality, we used the fact that trunc[ℓ] is the reduced form of L(k 1) (Definition H.3), whose template is trunc[ψ]. Thus, applying Equation (12) (which is a collary of Equation (11) from Theorem C.2) to trunc[L], we have [trunc[ℓ]]y 1(z) = trunc[ψ](Υ(k 1) y 1 z). Thus, lim λ + Q (ℓ(z + λe1) + x) = trunc[ℓ](Qz) + Qx. Next, for every y {2, . . . , k}, we note that the function gy(λ) := ℓy(z + λe1) = ψ(Υyz + λe1) is strictly decreasing. To see this, by the chain rule for curves25, we have g y(λ) = ψ(Υyz + λe1) e1 < 0. Thus, ℓy(z) = gy(0) > limλ + gy(λ) = trunc[ℓ]y 1(Qz), which proves that Q (ℓ(z) + x) trunc[ℓ](Qz) + Qx trunc[ℓ](Qz) as desired. Lemma H.5 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Definition H.3. Let Q be as in Corollary G.26. Then Q(ranc(ℓ)) ran c(trunc[ℓ]) and clos[Q(ranc(ℓ))] = ranc(trunc[ℓ]). Proof Let C := Q(ranc(ℓ)) and take ζ C. We first prove C ran c(trunc[ℓ]). By the characterization of ranc(ℓ) from Corollary G.32 item 1, there exists z Rk 1 and x [0, )k such that ζ = Q(ℓ(z) + x). Now let z := Q(z). Applying Eqn. (34) from Lemma H.4, we get ζ = Q (ℓ(z) + x) trunc[ℓ]( z). 25. See the Vector Calculus section in the appendix of the ar Xiv version of this manuscript (Wang and Scott, 2023b). Unified Margin-Based Classification In particular, by the characterization of ran c(trunc[ℓ]) from Lemma G.33, we have that ζ ran c(trunc[ℓ]). This proves that C ran c(trunc[ℓ]). Next, we prove clos[C] = ranc(trunc[ℓ]). We first show that clos[C] ranc(trunc[ℓ]) by proving that every point ranc(trunc[ℓ]) is a limit point of C. Let ζ ranc(trunc[ℓ]). By the characterization of ranc(trunc[ℓ]) as in Corollary G.32, there exists z Rk 2 and x [0, )k 1 such that ζ = trunc[ℓ]( z) + x. Now, pick z Rk 1 and x [0, )k such that z = Qz and x = Qx. Applying Lemma H.4 Eqn. (33), we get that ζ is a limit point of S, which proves the desired claim. This proves that clos(C) ranc(trunc[ℓ]). By the first part, we know that C ran c(trunc[ℓ]). By Corollary G.32, ran c(trunc[ℓ]) = int(ranc(trunc[ℓ])) ranc(trunc[ℓ]). Putting it all together, we have C ran c(trunc[ℓ]) ranc(trunc[ℓ]) clos(C). From Corollary G.32, we have that ranc(trunc[ℓ]) is closed. Since by definition clos(C) is the smallest closed set containing C, we get that ranc(trunc[ℓ]) = clos(C), as desired. Theorem H.6 (Blackwell and Girshick (1979)) Let C Rn be a convex set. Then int(C) = int(clos(C)). Proposition H.7 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Definition H.3. Let Q be as in Corollary G.26. Then we have Q(ranc(ℓ)) = ran c(trunc[ℓ]) Proof For brevity, let C := Q(ranc(ℓ)). By Corollary G.32, ranc(trunc[ℓ]) is convex. Since convexity is preserved under projection, we have that C is convex as well. Now, int(C) = int(clos(C)) Theorem H.6 (35) = int(ranc(trunc[ℓ])) Lemma H.5 (36) = ran c(trunc[ℓ]) Lemma G.28 (37) C Lemma H.5 (38) Since C int(C) by definition, we conclude that C = ran c(trunc[ℓ]). Proposition H.8 Assume k 3. Let L : Rk Rk be a regular PERM loss with template ψ : Rk 1 R. Let Q be as in Corollary G.26. Recall from Proposition 4.3 the truncation of ψ denoted by trunc[ψ] : Rk 2 R. Let trunc[ℓ] be as in Definition H.3. Let Q be as in Corollary G.26. Then we have Q(ran c(ℓ)) = ran c(trunc[ℓ]). Proof By the preceding Proposition H.7, we have Q(ranc(ℓ)) = ran c(trunc[ℓ]). Since ran c(ℓ) ranc(ℓ) we have Q(ran c(ℓ)) Q(ranc(ℓ)). Thus, to prove the result we only have to show Q(ran c(ℓ)) ran c(trunc[ℓ]). To this end, let trunc[ℓ](w) ran c(trunc[ℓ]) and z ranc(ℓ) be such that Qℓ(z) = trunc[ℓ](w). By Corollary G.26, there exist z Rk 1 and t R such that t > 0 Wang and Scott and Q(ℓ(z ) + t 1) = Qℓ(z) = trunc[ℓ](w). Since ℓ(z ) + t 1 ran c(ℓ), we get that trunc[ℓ](w) Q(ran c(ℓ)) as desired. Definition H.9 For each m {0, 1, . . . , k 2}, define trunc m[ℓ] to be m-fold repeated applications of trunc to ℓ, i.e., trunc m[ℓ] := trunc[ trunc[trunc[ℓ]] ] where trunc appears m-times. By convention, let trunc 0[ℓ] = ℓ. Moreover, for each n {2, . . . , k}, define the notational shorthand ℓ(n) := trunc (k n)[ℓ]. Remark H.10 It follows tautologically from the definition of ψ(n) in Corollary 4.4, that define ℓ(n) is the reduced form of L(n) (whose template is ψ(n)). Next, let m 1 be an integer. Below, let Q m denote the m-fold iterated composition of Q. In other words, Q m := Q Q repeated m times. Proposition H.11 Suppose that L is totally regular. Then Let m {1, . . . , k 2}. Then Q m(ranc(ℓ)) = ran c(trunc m[ℓ]). Proof We prove by induction. The case when m = 1 is simply Proposition H.8. Now, suppose that the result holds for m where 1 < m < k 2. Then Q m+1(ranc(ℓ)) = Q(Q m(ranc(ℓ))) = Q(ran c(trunc m[ℓ])) Induction hypothesis = ran c(trunc[trunc m[ℓ]]) Proposition H.8 = ran c(trunc m+1[ℓ]). This completes the induction step and the desired result follows. Before presenting the proof of Theorem 4.7, we recall (Tewari and Bartlett, 2007, Theorem 7): Theorem H.12 (Tewari and Bartlett (2007)) Let S Rk + be a symmetric convex set. Then S is classification calibrated if and only if 1. S is admissible and 2. Q m(S) is admissible for all m {1, . . . , k 2}. Finally, we conclude with the Proof of Theorem 4.7 Assume that we are in the situation stated at the beginning of Section H. Let S = ranc(ℓ(k)). By Theorem H.12, it suffices to prove that 1. S is admissible and 2. Q m(S) is admissible for all m {1, . . . , k 2}. First, from Proposition G.38, S = ranc(ℓ(k)) is admissible. Next, for each m {1, . . . , k 2}, we have by Proposition H.11 that Q m(ranc(ℓ(k))) = ran c(ℓ(k m)). Since L(k m) is a regular PERM loss with reduced form ℓ(k m), we have again by Proposition G.38 that ran c(ℓ(k m)) is admissible. Unified Margin-Based Classification Appendix I. Classification-Calibration of Fenchel-Young losses The goal of this section is two fold. The first subsection (Section I.1) presents the proof of Theorem 5.10 on a sufficient condition for the classification-calibration of Fenchel-Young losses. The second subsection (Section I.2) shows the existence of a totally regular negentropy that is strictly convex, but not strongly convex. To prepare for the later proofs, we introduce a more convenient parametrization of the the negenetropy and its Fenchel-Young loss. Define the reduced k-probability simplex as k := { p := (p1, . . . , pk 1) [0, 1]k 1 : Pk 1 i=1 pi 1}. (39) In other words, k is simply k without the first coordinate. To every function Ω: k R with domain on the k-simplex, we define a corresponding function Ω: k R called the reduced form of Ω, defined by Ω( p) := Ω p1, . . . , pk 1, 1 Pk 1 i=1 pi , for all p = (p1, . . . , pk 1) k. (40) Equation (40) induces a one-to-one correspondence between functions Ω: k R on the simplex k and functions Ω: k R on the reduced simplex k. I.1 Proof of Theorem 5.10 Before proceeding with the proof, we establish two key results. Recall that k is defined in Eqn. (39) and Ωin Eqn. 40. Proposition I.1 Let Ωbe a negentropy (Definition 5.2) and µ R. Then the Fenchel Young loss LΩ,µ associated to Ωand µ (Definition 5.1) is a PERM loss that is closed, convex, and non-negative. The template ψΩ,µ of LΩ,µ is semi-coercive and is given by ψΩ,µ(z) = max p=[p1,...,pk 1] k Ω( p) + µ1 p p, z . (41) Furthermore, if Ωis a regular negentropy, then L is a regular PERM loss. Remark I.2 We note that results from Blondel et al. (2020) may be used to prove portions of Proposition I.1. Combining their Proposition 1Order preservation part with the expression immediately following their Definition 2 can be used to prove that LΩ,µ is permutation equivariant. Moreover, the expression in our Eqn. (41) is related to their Eqn. (15). However, there are key technical differences since we work with the (k 1-dimensional) relative margin, while Blondel et al. (2020) use the class-score formulation. For this reason and for the reader s convenience, we prove Proposition I.1 without leveraging their results. Proof In this proof, all elementary basis vectors are implicitly assumed to be k-dimensional, i.e., we write ey instead of e(k) y . First, recall that the Fenchel conjugate of a closed convex function is again closed convex (Rockafellar, 1970). Next, we show that L is permutation equivariant. First, we show that L is symmetric. By Lemma C.3, it suffices to prove the claim that L(Tjv) = Tj L(v) for all j [k] and v Rk. To this end, let y [k] be arbitrary. Recall that the Fenchel-Young loss is defined as [L(v)]y = maxp k Ω(p) + cy, p ey + v, p ey (42) Wang and Scott [Tj L(v)]y = [L(v)]τj(y) = max p k Ω(p) + v + cτj(y), p eτj(y) definition of L = max p k Ω(p) + Tj(v + cτj(y)), Tj(p eτj(y)) Tj is an isometry = max p k Ω(p) + Tjv + cτj(τj(y)), Tjp eτj(τj(y)) = max p k Ω(p) + Tjv + cy, Tjp ey τj τj is the identity = max p k Ω(Tjp) + Tjv + cy, Tjp ey Ωis symmetric = max p k Ω(p) + Tjv + cy, p ey) Tj| k is a bijection = [L(Tjv)]y definition of L. Since y is arbitrary, we have proven the claim. Next, we show that L is relative margin-based, i.e., eq. (42) depends on v only through Dv. Recall that Dv = vk v1 vk v2 vk vk 1 . See Definition 2.1. Now, only the v, p ey term of eq. (42) depends on v. Thus, it suffices to prove that this term only depends on Dv. To this end, observe that v, p = p1v1 + + pkvk = p1v1 + + pk 1vk 1 + (1 (p1 + + pk 1))vk = vk (p1(vk v1) + + pk 1(vk vk 1)) = vk p, Dv where we write p to denote the vector [p1, . . . , pk 1] . Thus, for each y [k 1], we have v, p ey = v, p vy = vk vy p, Dv = [Dv]y p, Dv This shows that L is relative margin-based. Now the template of L is ψ(z) = ℓk(z) = maxp k Ω(p) + ck, p ek p, z . Since ek k and [ek]y = 0 for y [k 1], we have by construction that ψ(z) Ω(e1) + c1, e1 e1 0, z = 0. When ck = µ(1 ek), we have ψ(z) = max p k Ω( p) + µ1 p p, z , for all z Rk 1. (43) Finally, we prove that ψ is semi-coercive (Definition 4.1), i.e., for all c R there exists b R such that {z Rk 1 : ψ(z) c} {z Rk 1 : b min z}. To this end, let c R and z Rk 1 be such that c ψ(z). Let j arg min z. Then since ej = ek 1 j k, we have c ψ(z) = supp k Ω(p) p, z Ω(ej) + µ ej, z zj = min z. Thus, we have shown that {z Rk 1 : ψ(z) c} {z Rk 1 : min z c}. Unified Margin-Based Classification Next, we prove the Furthermore part. By the first part, it remains to show that ψ is strictly convex, twice differentiable and ψ(z) 0 for all z Rk 1. Define ϕ( p) := Ω( p) µ1 p. Then ϕ : k R is also of Legendre type. Moreover, note that ψ(z) = max p k p, z Ω( p) + µ1 p Eqn. (43) = max p k p, z ϕ( p) = ϕ ( z) definition of Fenchel conjugate Recall the following fundamental theorem regarding convex conjugates (Rockafellar, 1970). Theorem I.3 (Rockafellar (1970)) If (C, f) is a convex function of Legendre type, then (C , f ) is a convex function of Legendre type. The map f : C C is a homeomorphism and f = ( f) 1. By Theorem I.3, we have 1. The function ϕ , and hence ψ, is of Legendre type. In particular, ψ is strictly convex. 2. The derivative ϕ : int( k) Rk 1 is a bijection and the derivative of ϕ satisfies ϕ = ( ϕ) 1 : Rk 1 int( k). It follows that if ϕ is twice differentiable, then so is ϕ . Finally, by the chain rule, we have ψ(z) = ϕ (z). Since ϕ (z) int( k) for all z, we have in particular that ϕ (z) 0. Thus, ψ(z) 0 for all z. Proposition I.4 Let k 3 be an integer, Ω: k R a regular negentropy and µ 0. Let LΩ,µ be the Fenchel-Young loss corresponding to Ωand µ. Then trunc LΩ,µ = Ltrunc[Ω],µ. Proof Throughout this proof, let Θ := trunc[Ω]. Let Ωand Θ be the reduced versions (see eq. (40)) of Ωand Θ, respectively. We remark that the notation used in this proof slightly departs from that of eq. (40) in the following way: Here, elements of k will be denoted as p instead of p. Likewise, elements of k 1 will be denoted as q rather than q. Moreover, throughout this proof, we abuse notation and denote by 1 either the (k 1) or the (k 2)-dimensional vector. The dimension of 1 will be clear from the context. Now, from eq. (41) in Proposition I.1, recall that the template of LΩ,µ is ψΩ,µ : Rk 1 R, where ψΩ,µ(z) = max p=[p1,...,pk 1] k Ω(p) + µ1 p z, p . (44) Similarly, the template of LΘ,µ is ψΘ,µ : Rk 2 R, where ψΘ,µ(w) = max q=[q1,...,qk 2] k 1 Θ(q) + µ1 q w, q . (45) By the definition of the truncation of a PERM loss (given in Proposition 4.3), the template of trunc[LΩ,µ] is trunc[ψΩ,µ]. Since PERM losses are uniquely defined by their templates (Theorem 2.5), the result of Proposition I.4 can be equivalently stated as trunc[ψΩ,µ] = ψΘ,µ. Below, our proof will focus on proving this identity. Wang and Scott Both trunc[ψΩ,µ] and ψΘ,µ have Rk 2 as domain. Fix w Rk 2 arbitrarily. Our goal is to show that trunc[ψΩ,µ](w) = ψΘ,µ(w). Pick z Rk 1 such that Qz = w. For instance, we can pad w with a zero, i.e., take z := [0, w] = pad(w). By Lemma H.2, we have trunc[ψΩ,µ](w) = lim λ + ψΩ,µ(z + λe(k 1) 1 ) To simplify notations, let e1 := e(k 1) 1 . In view of eq. (44) and eq. (45), establishing our goal, i.e., the identity trunc[ψΩ,µ](w) = ψΘ,µ(w), is equivalent to proving lim λ + max p k Ω(p) + µ1 p z + λe1, p = max q k 1 Θ(q) + µ1 q Qz, q . (46) For brevity, we define g(λ) := max p k Ω(p) + µ1 p z + λe1, p . (47) For each λ R, fix arbitrarily an element pλ arg maxp k Ω(p) + µ1 p z + λe1, p . We note that the maximization is over a compact domain with a continuous objective. We will prove the result via a series of claims. Claim 1: g : R R+, defined at eq. (47), is monotone non-increasing. Claim 2: For all λ R, g(λ) Θ(bq) + µ1 bq w, bq . Claim 3: p1 = 0. Claim 4: limt g(λt) = Ω(bq) + µ1 bq w, bq The proofs of these claims proceed via straightforward but tedious computations and applications of inequalities. They are omitted here but appear in the ar Xiv version of this work (Wang and Scott, 2023b). Now we finish the proof of eq. (46), which we argued earlier suffices to establish Proposition I.4. By Claims 1 and 4, we have that limλ + g(λ) = Θ(bq) + µ1 bq w, bq . But limλ + g(λ) is exactly the left hand side of eq. (46), by the construction of g in eq. (47). By the definition of bq, moreover we have that Θ(bq) + µ1 bq w, bq is precisely the right hand side of eq. (46). Theorem I.5 Let Ω: k R be a negentropy (Definition 5.2). For each n {2, . . . , k}, let Ω(n) be the n-ary truncated negentropy of Ω(Definition 5.9). Let LΩ,µ be the Fenchel Young loss corresponding to Ωand µ (Definition 5.1). Likewise, for each n {2, . . . , k}, let L(n) be the n-ary truncation of LΩ,µ (Definition 4.5). Then L(n) = LΩ(n),µ. In other words, the n-ary truncated loss L(n) of LΩ,µ is equal to the Fenchel-Young loss of the n-ary truncated negentropy Ω(n) and µ. The proof of Theorem I.5 is omitted here since it largely mirrors that of Proposition H.11. See the ar Xiv version of manuscript (Wang and Scott, 2023b) for the proof. Proof [Proof of Theorem 5.10] By Theorem 4.7, it suffices to show that LΩ,µ is totally regular (Definition 4.5). Unwinding the definition, our goal is to show for each n {2, . . . , k} that the n-ary truncated loss of LΩ,µ, denoted L(n), is regular. Unified Margin-Based Classification Next, let Ω(n) be the n-ary truncated negentropy of Ω(Definition 5.9) and let LΩ(n),µ be the Fenchel-Young loss corresponding to Ω(n) and µ. By Theorem I.5, we have L(n) = LΩ(n),µ. By the assumption of Theorem 5.10, Ω(n) is a regular negentropy. Thus, by Proposition I.1, LΩ(n),µ is a regular PERM loss. Since L(n) = LΩ(n),µ, we are done. I.2 Totally regular negentropy that is not strongly convex In this section, we prove Proposition 5.11 which is used to show that there exists totally regular entropies that are not strongly convex. See Example 7. Thus, the associated Fenchel Young loss is calibrated by Theorem 5.10. Moreover, this calibration result is outside of the purview of previously established results (Blondel, 2019; Nowak-Vila et al., 2019) which requires strong convexity The next four results serve as the technical tools for proving Proposition 5.11. Their proofs are omitted here and are in the ar Xiv version of this work (Wang and Scott, 2023b). Proposition I.6 Let f : D R 0 be of Legendre type and g : R 0 R be convex, differentiable and strictly increasing. Let C = int(D). Suppose that D is compact and there exists x C such that infx D f(x) = f(x ). Then g f : D R is of Legendre type. Lemma I.7 Let f and g be as in Proposition I.6. If g (0) = 0 and x int(D) is such that infx D f(x) = f(x ) = 0, then the Hessian of g f vanishes at x , i.e., 2 g f(x ) = 0. Corollary I.8 In the situation of Lemma I.7, g f is not α-strongly convex for any α > 0. Proposition I.9 Let Ω: k R be a regular negentropy and let g : R 0 R 0 be as in Proposition I.6. Furthermore, suppose that g is twice differentiable. Let a R be a negative number such that a Ω(p) for all p k. Define Θ : k R by Θ(p) := g(Ω(p) a) g( a), p k. Then Ωis a regular negentropy. Proof [Proof of Proposition 5.11] First we note that the assumptions on g is in the setting of Proposition I.6. Next, by Definition 5.9, we must show that Θ(n) is a regular negentropy for each n {2, . . . , k}. Let a = Ω(u). Note that since Ωis symmetric and convex, we must have that a Ω(p) for all p k. Furthermore, it is easy to see that Θ(n)( q) := g( Ω(n)( q) a) g( a) for all q n. Now, apply Proposition I.9 to Θ(n) and a = Ω(u), we get the desired result. The part of the proposition assuming g (0) = 0 follows immediately from Corollary I.8. Appendix J. Uniqueness of the matrix label code In this section, we prove Theorem J.1. The proof appears at the end of this section. Wang and Scott Theorem J.1 Let L be the cross entropy loss and ψ be its template as in Example 1. Suppose that {Ay}k y=1 is a set of (k 1) (k 1) matrices satisfying Ly(v) = ψ(Ay Dv) for all y [k] and all v Rk. Then for each y [k] there exists a permutation σy Sym(k) so that Ay = SσyΥy. Lemma J.2 Let ψ be the template of the multinomial logistic loss (Example 1). Let a Rk 1 be a vector. Suppose that ψ(ta) = ψ(te1) for all t R. Then a = ei for some i [k 1]. Lemma J.2 serves as the main tool for Lemma J.3 below. The proofs of both Lemmas J.2 and J.3 appear in the ar Xiv version of our manuscript (Wang and Scott, 2023b). Lemma J.3 Let ψ be the template of the multinomial logistic loss (Example 1). Suppose that A is such that ψ(z) = ψ(Az) for all z Rk 1, then A is the identity matrix up to row permutation. In other words, there exists σ Sym(k) such that A = Sσ. Lemma J.4 Let ψ be the template of the multinomial logistic loss (Example 1). Let j [k] and A R(k 1) (k 1) be arbitrary. Suppose that ψ(Υjz) = ψ(Az) for all z Rk 1. Then A is equal to the Υj up to row permutation. In other words, there exists σ Sym(k) such that A = SσΥj. Proof [Proof of Lemma J.4] We first prove that the assumption of Lemma J.4 implies that of the previous Lemma J.3: ψ(z) = ψ(Az) for all z Rk 1. Let u Rk 1 be arbitrary and let z := Υju. Since Υ2 j = I, we have ψ(u) = ψ(ΥjΥju) = ψ(Υjz) = ψ(Az) = ψ(AΥju) where for the third equality from the left we used the assumption of Lemma J.4. Now, by the previous Lemma J.3, we have that AΥj = Sσ for some σ Sym(k). Using Υ2 j = I, we get SσΥj = AΥjΥj = A. That is to say A is equal to Υj up to permutations of the rows. Proof [Proof of Theorem J.1] Let y [k] and v Rk be fixed. Let z := Dv. By Theorem 2.5, we have ψ(Υyz) = ψ(Ayz). Since D is surjective, the preceding identity holds for all z Rk 1. Thus by Lemma J.4, Ay = SσΥy for some σ Sym(k). Appendix K. Mathematical Background We review mathematical background on fundamental topics that are used throughout this work. K.1 Non-singular M-matrix We recall some definitions from linear algebra. Definition K.1 Let A = (aij) Rn n be a matrix. We say that A is a 1. Z-matrix if aij 0 whenever i = j. Unified Margin-Based Classification 2. M-matrix if A is a Z-matrix and all eigenvalues of A have nonnegative real parts. 3. strictly diagonally dominant matrix if |aii| > P j [n]:j =i |aij| for all i [n]. 4. monotone matrix if for all x Rn, Ax 0 implies x 0. If, in addition, Ax 0 implies x 0, then A is said to be strictly monotone. Theorem K.2 (Levy-Desplanques/Gershgorin circle) Let A be a strictly diagonally dominant matrix. Then A is non-singular whose eigenvalues all have nonnegative real parts. The above result immediately implies the following: Corollary K.3 A strictly diagonally dominant Z-matrix is a non-singular M-matrix. Non-singular M-matrices have many equivalent characterizations. The one relevant to us is the following: Theorem K.4 (Plemmons (1977)) Let A be a Z-matrix. Then A is a non-singular M-matrix if and only A is a monotone matrix. Lemma K.5 Let A = (aij) Rn n be a non-singular M-matrix. If the diagonals of A are positive, then A is strictly monotone. Proof Let x Rn be arbitrary such that Ax 0. Our goal is to show that x 0. First, from Theorem K.4, we have that A is monotone. Thus, Ax 0 implies x 0. We only have to check additionally that x 0. Since A is a Z-matrix, the off-diagonals are non-positive, i.e., aij 0 for all i, j [n] such that i = j. Now, let i [n]. We need to check that xi > 0. To this end, note that 0 < [Ax]i = Pn j=1 aijxj = aiixi + P j =i aijxj aiixi. Note that P j =i aijxj 0 because x 0 and aij 0. Finally, xi > 0 since aii > 0. Corollary K.6 Let A = (aij) Rn n be a non-singular M-matrix. Then A is also a non-singular M-matrix. Moreover, if the diagonals of A are positive, then A is strictly monotone. Proof Note that A and A are both non-singular and have the same set of eigenvalues. Moreover, A being a Z-matrix implies that A is a Z-matrix. Thus, A is a non-singular M-matrix. Finally, Lemma K.5 implies that A is a strictly monotone matrix.