# westonwatkins_hinge_loss_and_ordered_partitions__0a039a81.pdf

Weston-Watkins Hinge Loss and Ordered Partitions

Yutong Wang University of Michigan yutongw@umich.edu

Clayton D. Scott University of Michigan clayscot@umich.edu

Multiclass extensions of the support vector machine (SVM) have been formulated in a variety of ways. A recent empirical comparison of nine such formulations [1] recommends the variant proposed by Weston and Watkins (WW), despite the fact that the WW-hinge loss is not calibrated with respect to the 0-1 loss. In this work we introduce a novel discrete loss function for multiclass classiﬁcation, the ordered partition loss, and prove that the WW-hinge loss is calibrated with respect to this loss. We also argue that the ordered partition loss is minimally emblematic among discrete losses satisfying this property. Finally, we apply our theory to justify the empirical observation made by Doˇgan et al. [1] that the WW-SVM can work well even under massive label noise, a challenging setting for multiclass SVMs.

1 Introduction

Classiﬁcation is the task of assigning labels to instances, and a common approach is to minimize misclassiﬁcation error corresponding to the 0-1 loss. However, the 0-1 loss is discrete and typically cannot be optimized efﬁciently. To address this, the 0-1 loss is often replaced by a surrogate loss during training. If the surrogate is calibrated with respect to the 0-1 loss, then a classiﬁer minimizing the expected surrogate loss will also minimize the expected 0-1 loss in the inﬁnite sample limit.

For multiclass classiﬁcation, several different multiclass extensions of the support vector machine (SVM) have been proposed, including the Weston-Watkins (WW) [2], Crammer-Singer (CS) [3], and Lee-Lin-Wahba (LLW) [4] SVMs. The pertinent difference between these multiclass SVMs is the multiclass generalization of the hinge loss. Below, we refer to the hinge loss from WW-SVM as the WW hinge loss and so on. It is well-known that the LLW-hinge is calibrated with respect to the 0-1 loss, while the WWand CS-hinge losses are not [5, 6].

Despite this result, the LLW-SVM is not more widely accepted than the WW-, CS-, and other SVMs. The ﬁrst reason for this is that while the LLW-SVM is calibrated with respect to the 0-1 loss, this did not lead to superior performance empirically. In particular, Doˇgan et al. [1] found that the LLW-SVM fails in low dimensional feature space even under the noiseless setting. On the other hand, Doˇgan et al. [1] observed that the WW-SVM is the only multiclass SVM that succeeded in both the noiseless and noisy setting in their simulations. Indeed, Doˇgan et al. [1] concluded that, among 9 different competing multiclass SVMs, the WW-SVM offers the best overall performance when considering accuracy and computation. The second reason is that the calibration framework is not limited to the 0-1 loss. There could be other discrete losses with respect to which a surrogate is calibrated, and which help to explain its performance. Indeed, Ramaswamy et al. [7] recently showed that the CS-hinge loss is calibrated with respect to a discrete loss for classiﬁcation with abstention.

In a vein similar to [7], we show that the WW-hinge loss is calibrated with respect to a novel discrete loss that we call the ordered partition loss. Our results leverage the embedding framework for analyzing discrete losses and convex piecewise linear surrogates, introduced recently by Finocchiaro et al. [8]. We also give theoretical justiﬁcation for the empirical performance of the WW-SVM observed by Doˇgan et al. [1].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

1.1 Related work

Cortes and Vapnik [9] introduced the support vector machine for learning a binary classiﬁer, using the hinge loss as a surrogate for the 0-1 loss. Steinwart [10] showed that the binary SVM is universally consistent, a desirable property of a classiﬁcation algorithm that ensures its convergence to the Bayes optimal classiﬁer in the large sample limit. Steinwart [11] later used calibration to give a more general proof of SVM consistency with respect to the 0-1 loss. Around that time, more general theories of when a loss is calibrated with respect to 0-1 loss, or classiﬁcation calibrated," began to emerge [12, 13, 14], and since then a proliferation of papers have extended these ideas to a variety of learning settings (see Bao et al. [15] for a recent review).

Several natural extensions of the binary SVM exist, including the Weston-Watkins (WW) [2], Crammer-Singer (CS) [3], and Lee-Lin-Wahba (LLW) [4] SVMs. Tewari and Bartlett [6] extended the deﬁnition of calibration with respect to the 0-1 loss to the multiclass setting. Liu [5] and Tewari and Bartlett [6] analyzed these hinge losses and showed that WW and CS hinge losses are not calibrated with respect to the 0-1 loss while the LLW hinge loss is. Doˇgan et al. [1] introduced a framework that uniﬁed existing multiclass SVMs, proved the 0-1 loss consistency of several multiclass SVMs when the kernel is allowed to change, and also conducted extensive experiments. Despite not being calibrated with respect to the 0-1 loss, Zhang [12] showed that the Crammer-Singer SVM is consistent given the majority assumption , i.e., the most probable class has greater than 1/2 probability. When the majority assumption is violated, experiments conducted by Doˇgan et al. [1] suggested that the CS-SVM fails, while the WW-SVM continues to perform well.

The LLW-hinge loss is calibrated with respect to the 0-1 loss while the WW-hinge loss is not [5]. Nevertheless, the WW-SVM often outperforms the LLW-SVM in experiments [1] which ostensibly undermines using calibration be as a justiﬁcation for performance. To reconcile this, we refer the reader to the discussion in Doˇgan et al. [1, Section 3.3] on relative and absolute margin losses. Doˇgan et al. [1] argued that the poorer performance of losses based on absolute margin, including the LLW-hinge, is due to the issue of the absolute margin being incompatible the decision function. On the other hand, the CS and WW-hinge losses are relative margin based and do not suffer the same issue. We remark that Fathony et al. [16] proposed a relative margin hinge loss which is calibrated with respect to the 0-1 loss that outperforms the WW-hinge loss at the expense of greater computational complexity.

Ramaswamy and Agarwal [17] extended the notion of calibration to an arbitrary discrete loss used in general multiclass learning. The general multiclass learning framework uniﬁes several learning problems, including cost-sensitive classiﬁcation [18], classiﬁcation with abstain option [7], ranking [19], and partial label learning [20]. Furthermore, Ramaswamy and Agarwal [17] introduced the concept of convex calibration dimension which is deﬁned for a discrete loss to be the minimum dimension required for the domain of a convex surrogate loss to be calibrated with respect to the given discrete loss. Ramaswamy et al. [7] proved the consistency of CS-SVM with respect to the abstention loss where the cost of abstaining is 1/2 by showing that the CS hinge is calibrated with respect to this abstention loss. They also proposed a new calibrated convex surrogate loss in dimension log2 k for the abstention loss, implying that the CS hinge is suboptimal from the CC-dimension perspective.

Recently, several new multiclass hinge-like losses have been proposed, as well as frameworks for constructing convex losses. Doˇgan et al. [1] used their framework to devise two new multiclass hinge losses, and using ideas from adversarial multiclass classiﬁcation, Fathony et al. [16] proposed a new multiclass hinge-like loss; all three are calibrated with respect to the 0-1 loss. Blondel et al. [21] introduced a class of losses known as Fenchel-Young losses which contains non-smooth losses such as the CS hinge loss as well as smooth losses such as the logistic loss. Tan and Zhang [22] proposed an approach for constructing hinge-like losses using generalized entropies. Finocchiaro et al. [8] studied the calibration properties of polyhedral losses using the embedding framework that they developed. They analyzed several polyhedral losses in the literature including the CS hinge, the Lovász hinge [23], and the top-n loss [24].

1.2 Our contributions

We introduce a novel discrete loss ℓ, the ordered partition loss. We show in Theorem 3.1 that the Weston-Watkins hinge loss L embeds the ordered partition loss ℓ. Our embedding result together with results of [8] imply that L is calibrated with respect to ℓ(Corollary 3.2). To the best of our

knowledge, this is the ﬁrst calibration-theoretic result for the WW-hinge loss. We also introduce the notion of the minimally emblematic discrete loss that a polyhedral loss can embed and argue that the ordered partition loss is minimally emblematic for the WW-hinge loss. In Section 5, we use properties of the ordered partition loss to give theoretical support for the empirical observations made by Doˇgan et al. [1] on the success of WW-SVM in the massive label noise setting.

1.3 Notations

Let k 3 be an integer which denotes the number of classes. For a positive integer n, we let [n] = {1, . . . , n}. If v = (v1, . . . , vk) Rk and i [k] is an index, then let [v]i := vi. Deﬁne max v = maxi [k] vi and arg max v = {i [k] : vi = max v}.

Let Sk denote the set of permutations on [k], i.e., elements of Sk are bijections σ : [k] [k]. Given σ Sk and v Rk, the vector σv Rk is deﬁned entrywise where the i-th entry is [σv]i = vσ(i). Equivalently, we view Sk as the set of permutation matrices in Rk k.

Let R+ denote the set of nonnegative reals. Denote k = {(p1, . . . , pk) Rk + : p1 + + pk = 1} the probability simplex. For p k, we write Y p to denote a discrete random variable Y [k] whose probability mass function is p. Let , be the usual dot-product between vectors. Denote by I{input} the indicator function which returns 1 if input is true and 0 otherwise.

1.4 Background

Recall the general multiclass learning framework as described in [17]: X is a sample space and P is a joint distribution over X [k]. A multiclass classiﬁcation loss is a function ℓ: R Rk + where R is called the prediction space and [ℓ(r)]y R+ is the penalty incurred for predicting r R when the label is y [k]. If R is ﬁnite, we refer to ℓas a discrete loss. For example, a common setting for classiﬁcation is R = [k] and ℓis the 0-1 loss. The ℓ-risk of a hypothesis function f : X R is

erℓ P (f) := EX,Y P {[ℓ(f(X))]Y } . (1)

The goal is to design ℓ-consistent algorithms, i.e., procedures that output a hypothesis fn based on an input of n training samples sampled i.i.d from P such that erℓ P (fn) erℓ, P = inff:X R erℓ P (f) as n . Since ℓis discrete, eq. (1) is difﬁcult to directly minimize. To circumvent this difﬁculty, we consider a convex surrogate loss L : Rd Rk for some positive integer d. The following property relates the surrogate loss L and the discrete loss ℓ. Deﬁnition 1.1 (Calibration). For each p k, deﬁne γℓ(p) := arg minr R p, ℓ(r) . We say that L is calibrated with respect to ℓif there exists a function ψ : Rd R such that for all p k

inf u Rd:ψ(u) γℓ(p) p, L(u) > inf v Rd p, L(v) .

By Ramaswamy and Agarwal [17, Theorem 3], L being calibrated with respect to ℓis equivalent to the following: there exists ψ : Rd R such that for all joint distributions P on X [k] and all sequences of functions gn : X Rd, we have

er L P (gn) er L, P implies erℓ P (ψ gn) erℓ, P

where er L, P = infg:X Rd er L P (g). Thus, the calibration property allows us to focus on ﬁnding L-consistent algorithms. In general it can be difﬁcult to check that a given L is calibrated with respect to ℓ. Finocchiaro et al. [8] introduced the following deﬁnition: Deﬁnition 1.2 (Finocchiaro et al. [8]). The loss L : Rd Rk embeds ℓ: R Rk if there exists an injection ϕ : R Rd called an embedding such that

1. L(ϕ(r)) = ℓ(r) for all r R

2. r arg minr R p, ℓ(r) if and only if ϕ(r) arg minv Rd p, L(v) .

The notion of embedding is important due to the following result from [8, Theorem 3]: Theorem 1.3 (Finocchiaro et al. [8]). Let L be convex piecewise-linear and ℓbe discrete. If L embeds ℓ, then L is calibrated with respect to ℓ.

(0, 1, 0) 1 6

2|3|1 3|2|1

3|1|2 1|3|2

Figure 1: The gray triangle represents the probability simplex 3, where (p1, p2, p3) 3 is plotted as (p2, p3) in the plane. The interior of each polygonal region contains p 3 such that min S OPk p, ℓ(S) has a unique minimizer. For the derivations, see supplementary material. Ordered partitions are represented as follows: ({1}, {2, 3}) 7 1|23, ({1}, {2}, {3}) 7 1|2|3, ... ({3}, {2}, {1}) 7 3|2|1.

Given L, ℓand ϕ, Finocchiaro et al. [8, Deﬁnition 6] provided an explicit construction for ψ with excess risk bound proved in [8, Theorem 6].

In this work, we are interested in the case when L is the WW-hinge loss:

Deﬁnition 1.4. For v Rk, deﬁne the Weston-Watkins hinge loss [2] L(v) Rk + entrywise by

[L(v)]y = X

i [k] : i =y h(vy vi), y [k]

where h : R R+ is the hinge function deﬁned by h(x) = max{0, 1 x}.

By Theorem 1.3, to prove that L is calibrated with respect to ℓ, it sufﬁces to show that L embeds ℓ. Going forward, L will refer to the WW-hinge loss. We now work toward showing that L embeds the ordered partition loss ℓ, which we introduce next.

2 The ordered partition loss

The prediction space R that we use is the set of ordered partitions, which we now deﬁne:

Deﬁnition 2.1. An ordered partition on [k] of length l is an ordered list S = (S1, . . . , Sl) of nonempty, pairwise disjoint subsets of [k] such that S1 Sl = [k]. Denote by OPk the set of all ordered partitions on [k] with length 2. We write the length of S as l S to be precise when working with multiple ordered partitions.

Ordered partitions can be thought of as a complete ranking of k items where ties are allowed. They are widely studied in combinatorics [25, 26, 27]. In the ranking literature, ordered partitions are called bucket orders [28] and the Sis are called the buckets. The ﬁrst bucket S1 contains the highest ranked items, and so on. There is only one ordered partition with l S = 1, namely the trivial partition S = ([k]). Thus, OPk is the set of nontrivial ordered partitions.

We now deﬁne the following discrete loss over the ordered partitions:

Deﬁnition 2.2. The ordered partition loss ℓ: OPk Rk + is deﬁned, for i [k] and S = (S1, . . . , Sl) OPk, as [ℓ(S)]i = |S1| 1 + Pl S 1 j=1 |S1 Sj+1| I{i S1 Sj}.

The intuition behind the ordered partition loss is that we want to rank the labels, where ties are allowed and each Si is a set of labels that are tied. We want the correct label to be as high up the ranking as possible. The lower the true class is ranked, the larger the loss.

To build intuitions about ℓ, let Y p and consider the random variable [ℓ(S)]Y whose expectation is

EY p {[ℓ(S)]Y } = |S1| 1 +

j=1 |S1 Sj+1| Pr Y p {Y S1 Sj} . (2)

Note that EY p {[ℓ(S)]Y } = p, ℓ(S) . In Figure 1, we visualize the decision rule for the Bayes optimal classiﬁer in the k = 3 case by plotting the function p 7 arg min S OPk p, ℓ(S) . When l S = 2, we have EY p {[ℓ(S)]Y } = |S1| 1 + k Pr Y p {Y S1} . (3)

Thus, we have a trade-off where adding elements to S1 increases the |S1| 1 term but decreases the k Pr Y p {Y S1} term. More generally, when l S 2, the ordered partition loss requires the predictor to associate each test instance x with a nested sequence of sets S1, S1 S2, where these sets are designed to balance the probability of containing x s label with the size of the set. In the learning with partial labels settings [29, 20], for each training instance the learner observes a set of labels, one of which is the true label. The sets S1, S1 S2, . . . might be called progressive partial labels in the spirit of partial label learning [29, 20].

Next, we deﬁne the embedding that satisﬁes Deﬁnition 1.2 when L is the WW-hinge loss and ℓis the ordered partition loss:

Deﬁnition 2.3. The embedding ϕ : OPk Rk is deﬁned as follows: Let S = (S1, . . . , Sl) OPk. Deﬁne ϕ(S) Rk entrywise so that for all i [l S] and all j Si, we have [ϕ(S)]j = (i 1).

With the discrete loss ℓand the embedding map ϕ deﬁned, we now proceed to the main results.

3 Main results

In this work, we establish that the WW-hinge loss embeds the ordered partition loss:

Theorem 3.1. The Weston-Watkins hinge loss L : Rk Rk embeds the ordered partition loss ℓ: OPk Rk with embedding ϕ as in Deﬁnition 2.3.

In light of Theorem 1.3, Theorem 3.1 implies

Corollary 3.2. L is calibrated with respect to ℓ.

In the remainder of this section, we develop the tools necessary to prove Theorem 3.1.

3.1 Vectorial representation of ordered partitions

First, we deﬁne the set Sk CZ whose elements serve as realizations of ordered partitions inside Rk.

Deﬁnition 3.3. Deﬁne the following sets:

C := {v Rk : v1 = 0, vk 1, vi vi+1 [0, 1], i [k 1]}, CZ := C Zk (4)

and ﬁnally Sk CZ := S

σ Sk σCZ where σCZ = {σv : v CZ}.

A vector v Rk is monotonic non-increasing if v1 v2 vk. Note that vectors in CZ are nonconstant, integer-valued monotonic non-increasing such that consecutive entries decrease at most by 1. Furthermore, by construction, Sk CZ consists of all possible permutations of elements in CZ. Therefore, the entries of an element v Sk CZ take on every value in 0, 1, . . . , (l 1) for some integer l {2, . . . , k}. Thus, v Sk CZ can be thought of as vectorial representation of the ordered partition S = (S1, . . . , Sl) where Si = {j : vj = (i 1)} for each i [l]. In Proposition 3.6 below, we make this notion precise.

Lemma 3.4. The image of ϕ is contained in Sk CZ.

Proof. Let S OPk. It sufﬁces to prove that there exists some σ Sk such that σϕ(S) CZ. Note that by deﬁnition, we have the set of unique values of ϕ(S) is

{[ϕ(S)]j : j [k]} = {0, 1, 2, . . . , (l S 1)}.

Thus, let σ Sk be such that σϕ(S) is monotonic non-increasing. Then σϕ(S) CZ.

Next, we deﬁne the inverse of ϕ.

Deﬁnition 3.5. The quasi-link map ψ : Sk CZ OPk is deﬁned as follows: Given v Sk CZ, let l = 1 minj [k] vj. Deﬁne Si = {j [k] : vj = (i 1)} for each i [l]. Finally, deﬁne ψ(v) = (S1, . . . , Sl).

The tilde in ψ is to differentiate the quasi-link from ψ in Deﬁnition 1.1. Proposition 3.6. The embedding map ϕ : OPk Sk CZ given in Deﬁnition 2.3 is a bijection with inverse given by the quasi-link map ψ from Deﬁnition 3.5.

Proof. We ﬁrst show that for all ψ(ϕ(S)) = S for all S = (S1, . . . , Sl) OPk. Observe that Si = {j [k] : [ϕ(S)]j = (i 1)} for all i = 1, 2, . . . , l. This implies that ψ(ϕ(S)) = S.

Next, we show that ϕ( ψ(v)) = v for all v Sk CZ. Let S = (S1, . . . , Sl) = ψ(v). Then [ϕ(S)]j = (i 1) if and only if j Si. By deﬁnition Si = {j [k] : vj = (i 1)}. Hence, [ϕ(S)]j = (i 1) if and only if vj = (i 1) which implies that ϕ(S) = v, as desired.

In the next section, using ϕ, we prove a relationship between the inner risk functions of L and ℓ.

3.2 Inner risk functions

Deﬁne the inner risk functions L : k R+ and ℓ: k R+ as follows:

L(p) = inf v Rk p, L(v) , and ℓ(p) = inf S OPk p, ℓ(S) . (5)

Note that these functions appear in the second part of Deﬁnition 1.2, although here we have inf instead of min. Since OPk is ﬁnite, the inﬁmum in the deﬁnition of ℓis attained. Later, we will argue that the inﬁmum in the deﬁnition of L is also attained.

We now state the main structural result regarding L: Theorem 3.7. For all p k, we have

L(p) = min v Sk CZ p, L(v) .

Sketch of proof. Note that L is invariant under translation by any scalar multiple of the all ones vector. Thus, L has an extra degree of freedom. We introduce a loss function Ł : Rk 1 Rk called the reduced WW-hinge loss, which removes this extra degree freedom. Furthermore, there exists a mapping π : Rk Rk 1 such that p, L(v) = p, Ł(π(v)) for all p k and v Rk. Letting z = π(v) Rk 1, we show that for a ﬁxed p, the function Fp(z) := p, Ł(z) is convex and piecewise-linear and the minimization of which can be formulated as a linear program [30]. Furthermore, since Fp is nonnegative, the inﬁmum infz Rk 1 Fp(z) is attained [30, Corollary 3.2], which implies that the inﬁmum in the deﬁnition of L in eq. (5) is attained as well. The linear program is shown to be totally unimodular, which implies that an integral solution exists [31], i.e., minz Rk 1 Fp(z) = Fp(z ) for some z Zk 1. From z , we obtain an integral v Zk such that L(p) = p, L(v ) . Finally, we construct an element v Sk CZ from v in such a way that the objective does not increase, i.e., p, L(v ) p, L(v ) , which implies that L(p) = p, L(v ) by the optimality of v .

The ordered partition loss ℓand the WW-hinge loss L are related by the following: Theorem 3.8. For all p k and all S OPk, we have

p, ℓ(S) = p, L(ϕ(S)) ,

where ϕ is the embedding map as in Deﬁnition 2.3.

Sketch of proof. Let S = (S1, . . . , Sl) OPk and p k. Let T Rk k consist of ones on and below the main diagonal and zero everywhere else. Letting D = T 1, we have

p, L(ϕ(S)) = p, TDL(ϕ(S)) = T p, DL(ϕ(S)) .

Next, we observe that [T p]i = pi + + pk for each i [k]. We then show through a lengthy calculation that for each i [k]

1. If i = 1, then [T p]1 = 1 and [DL(ϕ(S))]1 = |S1| 1.

2. If i > 1 and i = |S1 Sj| + 1 for some j [l], then [T p]i = Pr Y p {Y S1 Sj} and [DL(ϕ(S))]i = |S1 Sj+1|.

3. For all other i, [DL(ϕ(S))]i = 0 (in which case the value of [T p]i is irrelevant).

From this, we deduce that T p, DL(ϕ(S)) is equal to eq. (2).

Next, we show that the inner risks of L and ℓfrom eq. (5) are in fact identical: Corollary 3.9. For all p k, we have L(p) = ℓ(p).

Proof. Observe that

ℓ(p) (a) = min S OPk p, ℓ(S) (b) = min S OPk p, L(ϕ(S)) (c) = min v Sk CZ p, L(v) (d) = L(p)

where (a) follows from deﬁnition of ℓ, (b) from Theorem 3.8, (c) from the fact that ϕ : OPk Sk CZ is a bijection (Proposition 3.6), and (d) from Theorem 3.7.

Having developed all the tools necessary, we turn toward the proof of our main result Theorem 3.1.

3.3 Proof of Theorem 3.1

We check that the two conditions in Deﬁnition 1.2 holds. The ﬁrst condition is that L(ϕ(S)) = ℓ(S) for all S OPk, which follows from Theorem 3.8. To see this, note that for all i [k] the i-th elementary basis vector ei k. Thus, we have

[L(ϕ(S))]i = ei, L(ϕ(S)) = ei, ℓ(S) = [ℓ(S)]i

for all i [k]. This implies that L(ϕ(S)) = ℓ(S), which is the ﬁrst condition of Deﬁnition 1.2.

Next, we check the second condition. Let p k. Deﬁne γ(p) := arg min S OPk p, ℓ(S) , and Γ(p) := arg minv Rk p, L(v) . Furthermore, by the deﬁnition of γ, S γ(p) if and only if p, ℓ(S) = ℓ(p). Likewise, ϕ(S) Γ(p) if and only if p, L(ϕ(S)) = L(p). By Corollary 3.9 and Theorem 3.8, we have p, ℓ(S) = ℓ(p) if and only if p, L(ϕ(S)) = L(p). Putting it all together, we get S γ(p) if and only if ϕ(S) Γ(p), which is the second condition of Deﬁnition 1.2.

4 Minimially emblematic losses

Going forward, let L : Rd Rk + be a generic surrogate loss. The WW-hinge loss is denoted by LW W and the CS-hinge loss by LCS. Likewise, let ℓ: R Rk + be a generic discrete loss. The ordered partition loss is denoted by ℓOP and the 0-1 loss by ℓzo.

We deﬁne a dual notion to the embedding dimension Finocchiaro et al. [32, Deﬁnition 6]: Deﬁnition 4.1. Let L : Rd Rk + be a loss. Deﬁne the embedding cardinality of L as

emb.card(L) := min n n {2, 3, . . . } | there exists a discrete loss ℓ:[n] Rk such that L embeds ℓ

A discrete loss ℓ: R Rk is said to be minimally emblematic for L if |R| = emb.card(L) and L embeds ℓ. Remark 4.2. Intuitively, ℓis minimally emblematic for L with embedding ϕ if ϕ(R) captures all the essential information contained in the surrogate L in the most compact way. Let us say that a set of vectors E Rk is an emblem of L if for all p k, the set E argminv p, L(v) is nonempty. Then we can equivalently deﬁne ℓwith ϕ to be minimally emblematic for L if ϕ(R) is an emblem of L of minimal cardinality. In other words, ϕ(R) is a minimal set of minimizers of all possible L-inner risks.

For each k {3, . . . , 15}, we showed by a computer search that for all S OPk, there exists p k such that S is the unique minimizer of min T OPk p, ℓ(T) . A consequence of this is that

(0, 1, 0) 1 6

(0, 1, 0) 1 6

Figure 2: The gray triangle represents the probability simplex 3, where (p1, p2, p3) 3 is plotted as (p2, p3) in the plane. The light gray regions are ΩLW W (left) and ΩLCS (right). For the derivation, see Supplementary Material.

Proposition 4.3. For k {3, . . . , 15}, emb.card(LW W ) = |OPk|. In other words, the ordered partition loss is minimally emblematic for the WW-hinge loss.

We conjecture this result holds for all k 3.

5 The argmax link

Deﬁne γℓ(p) := arg minr R p, ℓ(r) and ΓL(p) := arg minv Rd p, L(v) . For multiclass classiﬁcation into k classes, most multiclass SVMs typically output a vector of scores v Rk which is converted to a class label by taking arg max v. In this section, we analyze the arg max as a link function. Recall from Section 1.3, arg max is a set-valued function. Deﬁne

ΩL := {p k : | arg max p| = 1, arg max v = arg max p, v ΓL(p)}.

When L is calibrated with respect to ℓzo, we have that ΩL = {p k : | arg max p| = 1}. Hence, k \ ΩL has measure zero. For other L not necessarily calibrated with respect to ℓzo, it is desirable that ΩL be as large as possible. Below, we will prove that ΩLCS is a proper subset of ΩLW W .

Recall that X is a sample space and P is a distribution on X [k]. For each x X, deﬁne the class conditional distribution ηP (x) k by [ηP (x)]y = Pr X,Y P (Y = y|X = x).

Proposition 5.1. Let P be a joint distribution on X [k] such that ηP (x) ΩL for all x and L : Rd Rk + be a loss. Let g : X Rk be such that g (x) ΓL(ηP (x)) for all x X. Then arg max g is Bayes optimal with respect to the 0-1 loss.

Proof. By deﬁnition of ΩL, we have arg max g (x) = arg max ηP (x) for all x X.

The following theorem asserts that for any v ΓLW W (p), the arg max v is contained in the top bucket S1 for some S γℓOP(p).

Theorem 5.2. Let p k be such that max p > 1

k and v ΓLW W (p). Then there exists S = (S1, . . . , Sl) γℓOP(p) such that arg max v S1.

Below, we consider two conditions on p k such that for all S γℓOP(p), the top bucket S1 = arg max p. By Theorem 5.2, for such p k, we can recover arg max p from any v ΓLW W (p). The ﬁrst condition covers p k such that the top class has a majority:

Proposition 5.3. Let p k satisfy the majority condition : max p > 1/2. Then for all S = (S1, . . . , Sl) γℓOP(p), we have |S1| = 1 and S1 = arg max p.

While Proposition 5.3 does not guarantee that γℓOP(p) is a singleton, all S γℓOP(p) have the same top bucket. The second condition covers p k whose top class may not have a majority, yet arg max p can still be recovered from any v ΓLW W (v) by taking arg max v:

Proposition 5.4. Fix a number α such that 1 > α > 1

k. Let p k satisfy the symmetric label noise (SLN) condition : there exists j [k] so that pj = α and pj = 1 α

k 1 for all j = j . Then ({j }, [k] \ {j }) is the unique element of γℓOP(p).

In particular, when α < 1/2, p violates the majority condition. Under SLN, we have arg max p = {j } since α 1 α

k 1 = (k 1)α 1+α

k 1 = 0. In light of Theorem 5.2, we have

Corollary 5.5. If p k satisﬁes the majority or the SLN condition, then p ΩLW W .

Thus, in two common regimes where for all x X the class conditional ηP (x) satisﬁes the SLN or the majority condition, the Bayes optimal ordered partition has a top bucket consisting of a single element. When this occurs, the argmax link recovers the most probable class, i.e., the unique element from the top bucket. This supports the observation by Doˇgan et al. [1] that the WW-SVM performs well under the SLN condition, even with signiﬁcant label noise. For the CS-hinge loss, it is known that ΩLCS = {p k : p satisﬁes the majority condition} [5, Lemma 4]. In particular, ΩLCS is a proper subset of ΩLW W . For k = 3, we show in Figure 2 the regions ΩLW W and ΩLCS. Our ﬁnding provides theoretical support for the ﬁnding of [1] that WW outperform CS.

6 Conclusion and future work

We proved that the Weston-Watkins hinge loss is calibrated with respect to the ordered partition loss, which we argue is minimally emblematic for the WW-hinge loss. Furthermore, we showed the advantage of WW-hinge loss over the Crammer-Singer hinge loss when the popular argmax link is used. An interesting direction is to apply the ordered partition loss to other multiclass learning problems such as partial label and multilabel learning.

Broader Impact

This work does not present any foreseeable societal consequence.

Acknowledgements

The authors were supported in part by the National Science Foundation under awards 1838179 and 2008074, by the Department of Defense, Defense Threat Reduction Agency under award HDTRA120-2-0002, and by the Michigan Institute for Data Science.

[1] Ürün Doˇgan, Tobias Glasmachers, and Christian Igel. A uniﬁed view on multi-class support vector classiﬁcation. The Journal of Machine Learning Research, 17(1):1550 1831, 2016.

[2] J Weston and C Watkins. Support vector machines for multi-class pattern recognition. In Proc. 7th European Symposium on Artiﬁcial Neural Networks, 1999, 1999.

[3] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

[4] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

[5] Yufeng Liu. Fisher consistency of multicategory support vector machines. In Artiﬁcial intelligence and statistics, pages 291 298, 2007.

[6] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(May):1007 1025, 2007.

[7] Harish G Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic Journal of Statistics, 12(1):530 554, 2018.

[8] Jessica Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, pages 10780 10790, 2019.

[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273 297, 1995.

[10] Ingo Steinwart. Support vector machines are universally consistent. Journal of Complexity, 18 (3):768 791, 2002.

[11] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classiﬁers. IEEE Transactions on Information Theory, 51(1):128 142, 2005.

[12] Tong Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004.

[13] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[14] Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

[15] Han Bao, Clayton Scott, and Masashi Sugiyama. Calibrated surrogate losses for adversarially robust classiﬁcation. In Conference on Learning Theory, 2020.

[16] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classiﬁcation: A risk minimization perspective. In Advances in Neural Information Processing Systems, pages 559 567, 2016.

[17] Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices. The Journal of Machine Learning Research, 17(1):397 441, 2016.

[18] Clayton Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6: 958 992, 2012.

[19] John C Duchi, Lester Mackey, and Michael I Jordan. The asymptotics of ranking algorithms. The Annals of Statistics, 41(5):2292 2323, 2013.

[20] Jesús Cid-Sueiro. Proper losses for learning from partial labels. In Advances in neural information processing systems, pages 1565 1573, 2012.

[21] Mathieu Blondel, Andre Martins, and Vlad Niculae. Learning classiﬁers with Fenchel-Young losses: Generalized entropies, margins, and algorithms. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 606 615, 2019.

[22] Zhiqiang Tan and Xinwei Zhang. On loss functions and regret bounds for multi-category classiﬁcation. ar Xiv preprint ar Xiv:2005.08155, 2020.

[23] Jiaqian Yu and Matthew B Blaschko. The Lovász hinge: A novel convex surrogate for submodular losses. IEEE transactions on pattern analysis and machine intelligence, 2018.

[24] Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation. IEEE transactions on pattern analysis and machine intelligence, 40(7):1533 1554, 2017.

[25] Touﬁk Mansour. Combinatorics of set partitions. CRC Press, 2012.

[26] Oliver A Gross. Preferential arrangements. The American Mathematical Monthly, 69(1):4 8, 1962.

[27] Masao Ishikawa, Anisse Kasraoui, and Jiang Zeng. Euler Mahonian statistics on ordered set partitions. SIAM Journal on Discrete Mathematics, 22(3):1105 1137, 2008.

[28] Ronald Fagin, Ravi Kumar, Mohammad Mahdian, D Sivakumar, and Erik Vee. Comparing and aggregating rankings with ties. In Proceedings of the twenty-third ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pages 47 58, 2004.

[29] Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12(May):1501 1536, 2011.

[30] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientiﬁc Belmont, MA, 1997.

[31] Eugene L Lawler. Combinatorial optimization: networks and matroids. Courier Corporation, 2001. [32] Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. Embedding dimension of polyhedral losses. In Conference on Learning Theory, pages 1558 1585, 2020.