# composite_multiclass_losses__03e0d98e.pdf

Journal of Machine Learning Research 17 (2016) 1-52 Submitted 7/14; Revised 5/16; Published 12/16

Composite Multiclass Losses

Robert C. Williamson Australian National University and Data61 BOB.WILLIAMSON@ANU.EDU.AU

Elodie Vernet Centre for Mathematical Sciences University of Cambridge EV315@CAM.AC.UK

Mark D. Reid Australian National University and Data61 MARK.REID@ANU.EDU.AU

Editor: Nicolas Vayatis

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a proper composite loss , which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on classiﬁcation calibration by relating it to properness. We determine the stationarity condition, Bregman representation, order-sensitivity, and quasi-convexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation for multiclass losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk. We show that the simple integral representation for binary proper losses can not be extended to multiclass losses but offer concrete guidance regarding how to design different loss functions. The conclusion drawn from these results is that the proper composite representation is a natural and convenient tool for the design of multiclass loss functions. Keywords: Proper losses, Multiclass losses, Link Functions, Convexity and quasi-convexity of losses, Margin losses, Classiﬁcation calibration, Parametrisations and representations of loss functions, Admissibility, Mixability, Minimaxity, Superprediction set

1. Introduction

Machine learning is done for a purpose. The performance of a machine learning solution is judged by means of a loss function. Different choices of loss function will lead to different solutions. The theory of binary losses (i.e. losses suitable for binary prediction problems) is well understood. This paper extends that understanding to multiclass losses and aids the choice of a suitable loss function by exploring the parametrisations available and the implications of different choices. It does so by systematically exploring a decomposition of a multiclass loss into two components, one which affects the statistical performance, and one which affects the computational optimisation of models. The problem setting is where one is given a bag *(xi,yi)+i of pairs of points xi and their accompanying labels yi [n] := {1,...,n}, drawn from a ﬁnite set of size n. The task can

c 2016 Robert C. Williamson, Elodie Vernet, and Mark D. Reid.

WILLIAMSON, VERNET AND REID

be either predict a label for an unseen instance, or predict the probability that a label takes on a particular value. These two problems are called multiclass classiﬁcation and probability estimation respectively. Proper composite losses are the composition of a proper loss and and invertible link (both deﬁned formally below). This representation makes the understanding of multiclass losses easier because, crucially, it seperates two distinct concerns: the statistical and the computational. The statistical properties are controlled by the proper loss, while the link function is essentially just a parametrisation. Choice of a suitable link can help for example, a nonconvex proper loss can be made convex (and thus more amenable to numerical optimisation) by choice of the appropriate link. For prediction purposes it is desirable to use an admissible loss (one where every possible prediction is uniquely optimal for some underlying distribution). It turns out that every proper composite loss is admissible; in fact proper composite losses satisfy a stronger adequacy property than admissibility. We characterise when a multiclass loss has a proper composite representation and when such representations are unique. We consider integral representations (whereby the proper component can be expressed as a weighted combination of elementary proper losses). We show the suprising result that there is a fundamental difference between n = 2 and n > 2 in terms of the simplicitly of the parametrisation of the class of elementary proper losses. It has been known for some time that proper losses are characterised by their conditional Bayes risks (or entropy functions). It has already been shown how important properties of a loss that control the performance of certain learning tasks can be expressed directly in terms of the Bayes risk. In this paper we extend results due to Reid and Williamson (2010) (for n = 2) to general n and characterise the convexity of a proper loss in terms of the associated Bayes risk. We also illuminate the connection between classiﬁcation and probability estimation by characterising the relationship between the cruicial property that a loss should have for each of these: classiﬁcation calibrated (which we ﬁrst generalise to make sense in the more general setting we consider) and properness. We explain the relationship between these two concepts, which captures the idea behind the probing reduction from classiﬁcation to class probability estimation. We also show how the results of the paper can provide tools to help with the design of multiclass losses, putting this on ﬁrmer ground than in the past.

1.1 Previous Work

With some exceptions, existing work on multiclass loss functions attempts to work directly with ℓ: V Rn +. As we shall show this conﬂates two seperate concerns the design of the statistical properties of the loss, those that affect statistical performance, with the aspects that affect the computational properties that control the ease with which empirical averages of the loss are minimized. The proper composite representation is not new in hindsight the observation of Gr unwald and Dawid (2004) that every loss induces a proper scoring rule is tantamount to the proper composite representation. Furthermore, its components (link functions and proper losses) have a long history. The novelty of the present work is to systematically use these two components as a canonical parametrisation of loss functions. Key differences between the present paper and previous work are tabulated in Table 1.

COMPOSITE MULTICLASS LOSSES

Attribute Previous Work Present Paper Ref.

Structure and Semantics None just a function; possibly convex in parameters

Clear seperation of concerns and meaning for λ and ψ. Gives meaning to predictions v as transformed probabilities.

Classiﬁcation versus probability estimation

Little insight in the multiclass case; confer recent works such as (Reid and Williamson, 2010, 2011; Narasimhan and Agarwal, 2013; Menon and Williamson, 2014) for the binary case

Clear connection via a characterisation relating classiﬁcation calibrated, prediction calibrated and proper losses

Effect of choice of loss function on performance

Margin based. Only a sufﬁcient condition and only for statistical batch setting. Mixes up statistical fundamentals (L) with parametrization (ψ). Strong convexity for speed of convergence in online setting; cf. (Abernethy et al., 2009).

Mixability and Stochastic Mixability. Characterisation in online setting. Both online worstcase and statistical batch settings. Parametrisation ψ automatically ignored.

Admissibility Not considered explicitly. Ensured however by assuming ℓis convex.

All proper composite losses admissible. All continuous Bayes losses have a proper composite representation.

Quasi-convexity and Minimaxity Guaranteed by assumimg ℓis convex.

Quasi-convexity guaranteed for all continuous proper losses; minimaxity for all continuous proper composite losses.

Convexiﬁability No principled way to convexify a loss; can make convex surrogate approximations.

All continuous proper losses convexiﬁable (using the canonical link). 6.4

Design principles and parametrisation

No guidance; choose ℓor margin function φ, in which case symmetry imposed.

Principled; general asymmetric losses possible; parametrise via (Λ,Ψ); separation of concerns. 8.3

Connections to divergences

Many to one for margin losses in binary case. (Nguyen et al., 2009)

Explicit 1:1 correspondence for binary and multiclass case (Reid and Williamson, 2011; Garc ıa Garc ıa and Williamson, 2012).

Table 1: Comparison of present paper to previous works on loss functions.

Proper losses are the natural losses to use for probability estimation. They have been studied in detail when n = 2 (the binary case ) where there is a nice integral representation (Buja et al., 2005; Gneiting and Raftery, 2007; Reid and Williamson, 2011), and characterization (Reid and Williamson, 2010) when differentiable. The proper composite representation for binary losses has proved very illuminating in the study of bipartite ranking problems (Menon and Williamson,

WILLIAMSON, VERNET AND REID

2014). Classiﬁcation calibrated losses are an analog of proper losses for the problem of classiﬁcation (Bartlett et al., 2006). The relationship between classiﬁcation calibration and properness was determined by Reid and Williamson (2010) for n = 2. Most of these results have had no multiclass analogue until now. Whilst there is much work on classiﬁcation problems, it is now widely understood that there are often advantages in being able to predict probabilities, rather than just labels (Bennett, 2003; Cohen and Goldszmidt, 2004). The theory of loss functions makes it clear how one ideally chooses a loss one takes account of one s utility concerning various incorrect predictions (Kiefer, 1987), (Berger, 1985, Section 2.4). The practice rarely involves such a step, primarily, we conjecture, because there is no adequate understanding of the way one can parametrise losses effectively, especially in the multiclass case. There is little guidance in the literature concerning how to choose a loss function; typically heuristic arguments are used for the choice confer e.g. (Ighodaro et al., 1982; Nayak and Naik, 1989). An early approach to multiclass losses is simply reduction to binary (Allwein et al., 2001; Beygelzimer et al., 2007; Crammer and Singer, 2001; Dietterich and Bakiri, 1995; Zadrozny and Elkan, 2002). Related approaches are pairwise coupling or Bradley Terry models (Hastie and Tibshirani, 1998; Wu and Weng, 2004; Huang et al., 2006) where certain relationships are assumed to hold between the pairwise probabilities and the multivariate probability of interest. The design of losses for multiclass prediction has received recent attention (Zhang, 2004; Hill and Doucet, 2007; Tewari and Bartlett, 2007; Liu, 2007; Santos-Rodr ıguez et al., 2009; Zou et al., 2008; Zhang et al., 2009) although none of these papers developed the connection to proper losses, and most restrict consideration to margin losses (which imply certain symmetry conditions). Zou et al. (2005) proposed a multiclass generalisation of admissible losses (their name for classiﬁcation calibration) for multiclass margin classiﬁcation. Liu (2007) considered several multiclass generalisations of hinge loss (suitable for multiclass SVMs) and showed some of them were and others were not Fisher consistent, and when they were not it was shown how the training algorithm could be modiﬁed to make the losses behave consistently. Shi et al. (2010) have investigated the relationship between classiﬁcation calibration of multiclass losses and losses for structure prediction, and have proposed an extension of classiﬁcation calibration which they call parametric consistency, which attempts to take account of the function class used (classiﬁcation calibration is, like all the results in this paper, concerned with behaviour per point; in practice one typically optimises over restricted classes of functions). Multiclass losses have also been considered in the development of multiclass boosting (e.g. Zhu et al., 2009; Mukherjee and Schapire, 2013; Wu and Lange, 2010).

1.2 Outline

The rest of the paper is organised as follows. In 2: we set up the problem formally and state some purely mathematical results we will need; 3: we relate properness, classiﬁcation calibration, and the notion used by Tewari and Bartlett (2007) which we rename prediction calibrated ; 4: we provide a novel characterization of multiclass properness; 5: we study composite proper losses (the composition of a proper loss with an invertible link) and characterise when a given loss has such a representation and when the representation is unique; 6: we develop a number of interesting implications of the representation and the characterisation results in terms of

COMPOSITE MULTICLASS LOSSES

mixability ( 6.1), admissibility ( 6.2) and convexity ( 6.4), where we give a complete characterisation of the (strong) convexity of composite multiclass proper losses in terms of the Bayes risk; 7: we present a (somewhat surprising) negative result concerning the integral representation of proper multiclass losses; 8: we outline how the above results can aid in the design of proper losses, especially by use of a (new) multiclass extension of the canonical link ; ﬁnally, 9 summarises the key contributions and outlines some future directions.

2. Formal Setup

Suppose X is some set and Y = [n] = {1,...,n} is a set of labels. (Throughout the paper n is an integer greater than or equal to 2.) We suppose we are given data S = *(xi,yi)+i [m] such that yi Y is the label corresponding to xi X . These data follow a joint distribution PX ,Y on X [n]. We denote by EX ,Y and EY |X respectively, the expectation and the conditional expectation with respect to PX ,Y . Given a new observation x we want to predict the probability

pi := P(Y = i|X = x) of x belonging to class i, for i [n]. Multiclass classiﬁcation requires the learner to predict the most likely class of x; that is to ﬁnd ˆy argmaxi [n] pi. A loss measures the quality of prediction. Let n := {(p1,..., pn): i [n] pi = 1,and 0 pi 1, i [n]} denote the n-simplex. For multiclass probability estimation, ℓ: n Rn +. The partial losses ℓi are the components of ℓ(q) = (ℓ1(q),...,ℓn(q)) and ℓi(q) is the loss incurred by predicting q n when y = i. A commonly used loss for probability estimation is the log loss ℓlog deﬁned by ℓlog i (q) := logqi for i [n]. Other examples of multiclass losses we will refer to in this paper include the square loss ℓsq i (q) := j [n](Ji = j K qj)2, the absolute loss ℓabs i (q) := j [n] |Ji = j K qj| and the 0-1 loss ℓ01 i (q) := Ji argmax j [n] qj K. Here, JPK denotes the function that is 1 when P is true and 0 otherwise. Throughout the paper, A denotes transpose of the matrix or vector A, except when applied to a real-valued function where it denotes derivative. We denote matrix multiplication of compatible matrices A and B by A B, so the inner product of two vectors x,y Rn is x y. The conditional risk L associated with a loss ℓis the function

L: n n (p,q) 7 L(p,q) = EY pℓY(q) = p ℓ(q) = i [n] piℓi(q) R+,

where Y p means Y is drawn according to a multinomial distribution with parameter p n. In a typical learning problem one will construct an estimate q: X n. The full risk is L(q) = EX EY |X ℓY(q(X)). Minimizing L(q) over q: X n is equivalent to minimizing L(p(x),q(x)) over q(x) n for all x X where p(x) = (p1(x),..., pn(x)) , and pi(x) = P(Y = i|X = x). Thus it sufﬁces to only consider the conditional risk; confer (Reid and Williamson, 2011). If one is interested in estimating probabilities (ℓ: n Rn +) it is natural to require the associated conditional risk is minimized when estimating the true underlying probability. Such a loss is called proper (formally: if L(p, p) L(p,q), p,q n). It is strictly proper if the inequality is strict when p = q (so it is uniquely minimised by predicting the correct probability). The conditional Bayes risk is deﬁned by

L: n p 7 inf q n L(p,q).

WILLIAMSON, VERNET AND REID

Controls convexity Controls statistical properties

link proper loss

proper composite loss predictions probabilities loss values

ψ is a monotone function n V

λ is parametrised by a concave Bayes risk Λ: n R

Figure 1: The idea of a proper composite loss.

This function is always concave (Gneiting and Raftery, 2007). If ℓis proper, then L(p) = L(p, p) = p ℓ(p). Strictly proper losses induce Fisher consistent estimators of probabilities: if ℓis strictly proper, p = argminq L(p,q). By considering when the derivatives qi L(p,q) are zero it is straight-forward to show that, of the example losses introduced above, the log loss, square loss, and 0-1 loss are proper, while absolute loss is not. Furthermore, both log loss and square loss are strictly proper while 0-1 loss is proper but not strictly proper. Using the fact that, for proper losses, the Bayes risk L(p) = L(p, p) we see that Llog(p) = i [n] pi log pi (i.e., Shannon entropy); Lsq(p) = 1 i [n] p2 i ; and L01(p) = mini{1 pi}. The losses above are deﬁned on the simplex n since the argument (a predictor) represents a probability vector. However it is sometimes desirable to use another set V of predictions. For example if one wishes to use linear predictors, their natural range is Rn. One can consider losses ℓ: V Rn +. Suppose there exists an invertible function ψ : n V . Then ℓcan be written as a composition of a loss λ deﬁned on the simplex with ψ 1. That is, ℓ(v) = λ ψ(v) := λ(ψ 1(v)). Such a function λ ψ is a composite loss. If λ is proper, we say ℓis a proper composite loss, with associated proper loss λ and link ψ; see Figure 1. Many commonly used multiclass losses are composite losses, even though they are not often expressed as such; see the example in 8.4. Throughout the paper, ℓis a general loss deﬁned on V , where V may equal n, and λ is always a loss deﬁned on n, which may be proper. For such a loss λ : n Rn +, its corresponding conditional risk is denoted Λ(p,q) and its conditional Bayes risk is Λ(p). In order to differentiate the losses we project the n-simplex into a subset of Rn 1. Let

(p1,..., pn 1) : pi 0, i [n], n 1 i=1 pi 1

denote the bottom of the n-simplex. We denote by

Π : n p = (p1,..., pn) 7 p = (p1,..., pn 1) n,

the projection of the n, and

Π 1 : n p = ( p1,..., pn 1) 7 p = ( p1,..., pn 1,1 n 1 i=1 pi) n

its inverse. For convenience, we will often use n := n 1 to denote the dimension of the set n.

COMPOSITE MULTICLASS LOSSES

T1(c) T2(c)

Figure 2: A partitioning of the 3-simplex by regions Ti(c), i = 1,2,3, where c = (.35,.2,.45) as viewed from the direction (1,1,1).

We use the following notation. The kth unit vector ek is the n vector with all components zero except the kth which is 1. The n-vector 1n := (1,...,1) . The (relative) interior of the simplex is

n := {(p1,..., pn): i [n] pi = 1,and 0 < pi < 1, i [n]} and the boundary is n := n \ n. We also adopt notation from Magnus and Neudecker (1999). For the reader s convenience we list the essential notations and conventions in Appendix A.

3. Relating Properness to Classiﬁcation Calibration

Properness is an attractive property of a loss for the task of class probability estimation. However if one is merely interested in classifying (predicting ˆy [n] given x X ) then it is stronger than one needs. In this section we relate classiﬁcation calibration (the analog of properness for classiﬁcation problems) to properness. Suppose c n. We cover n with n subsets each representing one class:

Ti(c) := {p n : j = i picj pjci}, i [n].

Observe that for i = j, the sets Rij(c) := {p n : picj = pjci} are subsets of dimension n 2 through c and all ek such that k = i and k = j. These subsets partition Rn into two parts. The set Ri j(c) is the intersection of n and the subspaces delimited by the precedent (n 2)-subspace and in the same side as ei. An example of this partition is shown graphically in Figure 2. We will make use of the following properties of Ti(c).

Lemma 1 Suppose c n, i [n]. Then the following hold:

1. For all p n, there exists i such that p Ti(c).

2. Suppose p n. Ti(c) Tj(c) {p n : picj = pjci}, a subset of a subspace of dimension n 2.

WILLIAMSON, VERNET AND REID

3. Suppose p n. If p Tn i=1 Ti(c) then p = c.

4. For all p,q n, p = q, there exists c n, and i [n] such that p Ti(c) and q / Ti(c).

The proof is deferred to Appendix B.1. Classiﬁcation calibrated losses have been developed and studied under some different definitions and names (Zhang, 2004; Bartlett et al., 2006). Below we generalise the notion of c-calibration which was proposed for n = 2 by Reid and Williamson (2010) and developed by Scott (2011, 2012) as a generalisation of the notion of classiﬁcation calibration of Bartlett et al. (2006); confer also Steinwart (2007).

Deﬁnition 2 Suppose ℓ: n Rn + is a loss and c n. We say ℓis c-calibrated at p n if for all i [n] such that p / Ti(c) then q Ti(c), L(p) < L(p,q). We say that ℓis c-calibrated if p n, ℓis c-calibrated at p.

Deﬁnition 2 means that if the probability vector q one predicts doesn t belong to the same subset (i.e. doesn t predict the same class) as the real probability vector p, then the loss might be larger than L(p). Classiﬁcation calibration in the sense used by Bartlett et al. (2006) corresponds to 1

2-calibrated losses when n = 2. If cmid := (1

n) , cmid-calibration induces Fisher-consistent estimates in the case of classiﬁcation. Furthermore ℓis cmid-calibrated and for all i [n], and ℓi is continuous and bounded below is equivalent to ℓis inﬁnite sample consistent as deﬁned by Zhang (2004). This is because if ℓis continuous and Ti(c) is closed, then q Ti(c), L(p) < L(p,q) if and only if L(p) < infq Ti(c) L(p,q). The following result generalises the correspondence between binary classiﬁcation calibration and properness (Reid and Williamson, 2010, Theorem 16) to multiclass losses (n > 2).

Proposition 3 A continuous loss ℓ: n Rn + is strictly proper if and only if it is c-calibrated for all c n.

Proof ( ) Suppose that ℓis strictly proper. Then for all c n, for all i [n] such that p / Ti(c) and for all q Ti(c) then p = q and thus L(p) < L(p,q) since ℓis strictly proper. ( ) Suppose that ℓis c-calibrated for all c n. Suppose p,q n and p = q. By Lemma 1 (part 4) one can partition p and q into two different classes: there exists c n and i [n] such that q Ti(c) and p / Ti(c). Hence L(p) < L(p,q) since ℓis c-calibrated. Since ℓis continuous and n is closed, the inﬁmum in the deﬁnition of L(p) is attained. Since L(p) < L(p,q) for all q = p, we conclude L(p) = L(p, p). Thus ℓis strictly proper.

In particular, a continuous strictly proper loss is cmid-calibrated. Thus for any estimator ˆqn of the conditional probability vector one constructs by minimizing the empirical average of a continuous strictly proper loss, one can build an estimator of the label (corresponding to the largest probability of ˆqn) which is Fisher consistent for the problem of classiﬁcation. In the binary case, ℓis classiﬁcation calibrated if and only if the following implication holds (Bartlett et al., 2006): L(fn) min g L(g) PX ,Y (Y = fn(X)) min g PX ,Y (Y = g(X)) . (1)

COMPOSITE MULTICLASS LOSSES

Tewari and Bartlett (2007) have characterised when (1) holds in the multiclass case. Since there is no reason to assume the equivalence between classiﬁcation calibration and (1) still holds for n > 2, we give different names for these two notions. We use classiﬁcation calibration for the notion (Deﬁnition 2) linked to Fisher consistency and use prediction calibrated (deﬁned below) for the notion of Tewari and Bartlett (equivalent to (1)).

Deﬁnition 4 Suppose ℓ: V Rn + is a loss. Let Cℓ:= co({ℓ(v): v V }), the convex hull of the image of V . ℓis said to be prediction calibrated if there exists a prediction function pred: Rn [n] such that

p n : inf z Cℓ: ppred(z)<maxi [n]pi p z > inf z Cℓ p z = L(p).

Suppose that ℓ: n Rn + is such that ℓis prediction calibrated and pred(ℓ(p)) argmaxi pi. Then ℓis cmid-calibrated almost everywhere. By introducing a reference link ψ (which corresponds to the actual link ψ if ℓis a proper composite loss ℓ= λ ψ 1) we now show how the pred function can be canonically expressed in terms of argmaxi pi.

Proposition 5 Suppose ℓ: V Rn + is a loss. Let ψ : n V satisfy ψ(p) argminv V L(p,v) and λ = ℓ ψ. Then λ is proper. If ℓis prediction calibrated then pred(λ(p)) argmaxi [n] pi.

Proof We show ﬁrst that λ is proper. Let p n. Then

Λ(p, p) = L(p, ψ(p)) = L(p,argmin v L(p,v)) = min v L(p,v) min q n Λ(p,q).

Thus λ is proper and L(p) = Λ(p). We now assume that ℓis prediction calibrated. Suppose that pred(z = λ(p)) / argmaxi pi. Then ppred(λ(p)) < maxi pi , thus p z = Λ(p, p) > L(p) = Λ(p) which contradicts the properness of λ.

4. Characterizing Properness

We now present some simple (but new) consequences of properness in the multiclass case (Proposition 6). We also build some connections between the properness of multiclass losses and the properness of binary losses that can be derived from them via a restriction of the multiclass loss to a line connecting two points in the n-simplex (Proposition 7). Finally, we show that multiclass proper losses are effectively characterised by their Bayes risks (Proposition 8) and the continuity of losses is intimately tied to the differentiability of their Bayes risks (Proposition 9). An important implication of these last results is that we are able to study the class of multiclass proper losses by focusing our attention on concave functions deﬁned over probabilities. To state our propositions we need to introduce monotone functions, directional derivatives, and superdifferentials (cf. (Hiriart-Urruty and Lemar echal, 2001)). We say f : C Rn Rn is monotone (resp. strictly monotone) on C when for all x and y in C,

(f(x) f(y)) (x y) 0 resp. (f(x) f(y)) (x y) > 0; (2)

WILLIAMSON, VERNET AND REID

confer (Hiriart-Urruty and Lemar echal, 2001; Rockafellar and Wets, 2004). If a function f : Rn R is concave then limt 0 f(x+td) f(x)

t exists, and is called the directional derivative of f at x in the direction d and is denoted D f(x,d). By analogy with the usual deﬁnition of subdifferential for convex functions, we introduce the superdifferential f(x) for concave f at x is

f(x) := s Rn : s y D f(x,y), y Rn

= s Rn : f(y) f(x)+s (y x), y Rn .

Similarly, a vector s f(x) is called a supergradient of f at x.

Proposition 6 Suppose ℓ: n Rn + is a loss. If ℓis proper, then ℓis monotone on n. Furthermore, if ℓis strictly proper then it is also invertible.

Proof For all p,q n, (ℓ(p) ℓ(q)) (p q) = p ℓ(p) q ℓ(p) + q ℓ(q) p ℓ(q) 0 since p ℓ(p) p ℓ(q). For the strictly proper case, we just have to check that ℓis injective. By way of contradiction assume ℓis not invertible. Then there exists p = q such that ℓ(p) = ℓ(q). which means L(p, p) = L(p,q), contradicting the supposed strict properness of ℓ.

The following proposition presents several characterisations of multiclass properness. It shows how the characterisation of properness in the general (not necessarily differentiable) multiclass case can be reduced to the binary case. We also show this is equivalent to testing the properness condition for the loss on all possible line segments joining two distributions within the simplex. This latter characterisation can be viewed as a statement connecting order sensitivity and properness: the true class probability minimizes the risk and if the prediction moves away from the true class probability in a line then the risk increases. This property appears convenient for optimisation purposes: if one reaches a local minimum in the second argument of the risk and the loss is strictly proper then it is a global minimum. If the loss is proper, such a local minimum is a global minimum or a constant in an open set. But observe that typically one is minimising the full risk L(q( )) over functions q: X n. We note that order sensitivity of ℓdoes not imply this optimisation problem is well behaved; one needs convexity of q 7 L(p,q) for all p n to ensure convexity of the functional optimisation problem; we characterise when that holds in section 6.4.

Proposition 7 Suppose ℓ: n Rn + is a loss. We deﬁne the binary loss

ℓp,q : [0,1] η 7 ℓp,q 1 (η) ℓp,q 1(η)

= q ℓ p+η(q p)

p ℓ p+η(q p) .

The following statements are equivalent:

1. ℓis proper;

2. ℓp,q is proper for all p,q n;

3. p,q n, 0 h1 h2, L(p, p+h1(q p)) L(p, p+h2(q p)); and

COMPOSITE MULTICLASS LOSSES

4. there exists a concave function f : n R and q n, there exists a supergradient A(q) f(q) such that p,q n, p ℓ(q) = L(p,q) = f(q)+(p q) A(q).

The proof is deferred to Appendix B.3. Characterisation (2) shows that in order to check if a loss is proper one need only check the properness in each line. One could use the easy characterization of properness for differentiable binary losses (ℓ: [0,1] R2 + is proper if and only if η [0,1], ℓ 1(η) 1 η = ℓ 1(η)

η 0, (Reid and Williamson, 2010)). However this needs to be checked for all lines deﬁned by p,q n. The above result can also been seen as a generalisation of a result by Lambert (2010) who proved that properness is equivalent to the fact that the further your prediction is from reality, the larger the loss (hence the name order sensitivity ); also confer the results on monotonicity due to Nau (1985). His result relied upon on the total order of R. In the multiclass case, there does not exist such a total order. Yet, as the above result shows, one can compare two predictions if they are in the same line as the true real class probability. Characterisation (4) is a restatement of the well known Bregman representation of proper losses; Cid-Sueiro and Figueiras-Vidal (2001) presented the differentiable case, and Gneiting and Raftery (2007, Theorem 3.2) the general case. This last property gives us the form of the proper losses associated with a given Bayes risk. Suppose L: n R+ is concave. The proper losses whose Bayes risk is equal to L are

ℓ: n q 7 L(q)+(ei q) A(q) n

i=1 Rn +, A(q) L(q). (3)

This result suggests that some information is lost by representing a proper loss via its Bayes risk (when the last is not differentiable). The next proposition elucidates this by showing that proper losses which have the same Bayes risk are equal almost everywhere.

Proposition 8 Two proper losses ℓ1,ℓ2 : n Rn + have the same conditional Bayes risk function L if and only if ℓ1 = ℓ2 almost everywhere. If L is differentiable, ℓ1 = ℓ2 everywhere.

Proof A concave function is differentiable almost everywhere (Hiriart-Urruty and Lemar echal, 2001, theorem 4.2.3). Thus (3) proves that two proper losses ℓ1 and ℓ2 which have the same Bayes risk are equal almost everywhere. Suppose now that two proper losses are equal almost everywhere. Then their associated Bayes risks L1 and L2 are equal almost everywhere and continuous (since they are concave). If there exists p such that L1(p) = L2(p), then since L1 and L2 are continuous, there exists ε > 0 such that q B(p,ε) n, L1(q) = L2(q), where B(p,ε) is a ball of radius ε centred at p. Yet this contradicts the fact that L1 and L2 are equal almost everywhere. Hence the Bayes risks are equal everywhere.

While the previous proposition shows that losses are closely related to their Bayes risks the next proposition also shows how the continuity of a loss is related to the differentiability of its Bayes risk.

Proposition 9 Suppose ℓ: n Rn + is a proper loss. Then ℓis continuous in n if and only if L is differentiable on n; ℓis continuous at p n if and only if L is differentiable at p n.

WILLIAMSON, VERNET AND REID

The proof of this result can be found in Appendix B.2. This type of relationship is further explored in Section 6.4 where the convexity of a composite loss is related to properties of its Bayes risk.

5. The Proper Composite Representation: Uniqueness and Existence

Many natural predictors have a range other than the simplex (for example those induced by linear functions). It is thus sometimes convenient to deﬁne a loss on some set V rather than n; confer (Reid and Williamson, 2010). The link function explicates the result of Gr unwald and Dawid (2004) that every decision problem induces a decision problem expressed in terms of proper losses; (see van Erven et al., 2011, section 6, for further explanation). Traditionally (Mc Cullagh and Nelder, 1989) links are deﬁned only for binary problems (where one is using univariate probabilities). However there is scattered (but seemingly unsystematic) work on multivariate links (Glonek and Mc Cullagh, 1995; Glonek, 1996), primarily from the perspective of probabilistic modelling (as opposed to the design of loss functions). Sometimes multivariate links are constructed from univariate links (Molenberghs and Lesaffre, 1999). Composite losses (see the deﬁnition in 2) are a way of constructing losses on sets other than n: given a proper loss λ : n Rn + and an invertible link ψ : n V , one deﬁnes λ ψ : V Rn + as λ ψ := λ ψ 1. We now consider the question: given a loss ℓ: V Rn +, when does ℓ have a proper composite representation (whereby ℓcan be written as ℓ= λ ψ 1), and is this representation unique? We ﬁrst consider the binary case. Here the prediction space V R is assumed to be either an interval or the entire real line.

5.1 The Binary Case

Our ﬁrst result shows that if you can write a binary loss as a proper composite loss, the proper loss deﬁned on the simplex is unique. Furthermore, as soon as the loss is not constant the link function is also unique. If the loss is constant on an interval, then you can choose any value of the link function on this interval which keeps the link function continuous and invertible and still obtain a composite proper loss. The proof can be found in Appendix B.4. As is common in the literature, we write the binary labels as { 1,+1} and so the partial losses are ℓ 1 and ℓ+1.

Proposition 10 Suppose ℓ= λ ψ 1 : V R2 + is a proper composite loss and that the proper loss λ is differentiable and the link function ψ is differentiable and invertible. Then the proper loss λ is unique. Furthermore ψ is unique if v1,v2 V , v [v1,v2], ℓ 1(v) = 0 or ℓ 1(v) = 0. If there exists v1, v2 V such that ℓ 1(v) = ℓ 1(v) = 0 v [ v1, v2], one can choose any ψ|[ v1, v2] such that ψ is differentiable, invertible and continuous in [ v1, v2] and obtain ℓ= λ ψ 1, and ψ is uniquely deﬁned where ℓis invertible.

We now determine necessary and sufﬁcient conditions for a binary loss to be expressed as a proper composite loss. Once again, the proof is deferred to Section B.5.

Proposition 11 Suppose ℓ: V R2 + is a differentiable binary loss such that v V , ℓ 1(v) = 0 or ℓ 1(v) = 0. Then ℓcan be expressed as a proper composite loss if and only if the following three conditions hold:

COMPOSITE MULTICLASS LOSSES

1. ℓ1 is decreasing (increasing);

2. ℓ 1 is increasing (decreasing); and

3. f : V v 7 ℓ 1(v) ℓ 1(v) is strictly increasing (decreasing) and continuous.

Observe that the last condition is alway satisﬁed if both ℓ1 and ℓ 1 are convex.

5.2 Binary Margin Losses

Suppose ϕ : R R+ is a function. The loss ℓϕ : V v 7 (ℓ 1(v),ℓ1(v)) = (ϕ( v),ϕ(v)) R2 + is called a binary margin loss. Binary margin losses are often used for classiﬁcation problems. We will now show how the previous proposition applies to them.

Corollary 12 Suppose ϕ : R R+ is differentiable and v R, ϕ (v) = 0 or ϕ ( v) = 0. Then ℓϕ can be expressed as a proper composite loss if and only if f : R v 7 ϕ (v)

ϕ ( v) is strictly monotonic continuous and ϕ is monotonic.

If ϕ is convex or concave then f deﬁned above is monotonic. However not all binary margin losses are composite proper losses. One can even build a smooth margin loss which cannot be expressed as a proper composite loss. Consider ϕ(x) = 1 arctan(x 1)

π . Then f(v) = ϕ (v) ϕ ( v) =

x2+2x+2 x2 2x+2 which is not invertible. This loss is illustrated in Figure 5, after some additional concepts are introduced.

5.3 The Multiclass Case

Uniqueness of the composite representation remains straightfoward in the multiclass case.

Proposition 13 Suppose a loss ℓ: V Rn + has two proper composite representations ℓ= λ ψ 1 = µ φ 1 where λ and µ are proper losses with corresponding Bayes risks Λ and M respectively, and ψ and φ are continuous invertible link functions. Then λ = µ almost everywhere. If ℓis continuous and has a composite representation, then the proper loss (in the decomposition) is unique (λ = µ everywhere). If ℓis invertible and has a composite representation, then the representation is unique.

Proof Λ(p) = infq p λ(q) = infq p ℓ(ψ(q)) = infv L(p,v) (since ψ is invertible) = infv L(p,v) = infv L(p,φ(q)) = M(p). Then λ and µ are two proper losses which have the same Bayes risk, so these two losses are equal almost everywhere. If moreover ℓis continuous, λ = ℓ ψ and µ = ℓ φ are continuous. So λ = µ everywhere. If moreover ℓis invertible, ψ = λ ℓ 1 and φ = µ ℓ 1. So ψ and φ are also equal almost everywhere and as they are continuous, they are equal everywhere. So λ = ℓ ψ = ℓ φ = µ.

WILLIAMSON, VERNET AND REID

n-smooth x ℓ(V ) !p n

n-strictly convex p n !x ℓ(V )

Figure 3: Illustration of n-smoothness and n-strict convexity. The hyperplanes witness the possession or non-possession of the respective properrties.

Characterising the existence of a composite representation is more complex in the multiclass case. We need to introduce some deﬁnitions: We make use of a set of hyperplanes for p n

hβ p := {x Rn : x p = β}.

A hyperplane hβ p supports a set A at x A when x hβ p and for all a A, a p β or for all a A, a p β. Given a loss ℓ: V Rn +, the loss image ℓ(V ) := {ℓ(v): v V }.

Deﬁnition 14 Let S(p,x) := ℓ(V ) is supported by hβ p at x for some β R. .

1. A loss image ℓ(V ) is n-strictly convex if for all p n there exists a unique x ℓ(V ) such that S(p,x).

2. A loss image ℓ(V ) is n-smooth if for all x ℓ(V ) there exists a unique p n such that S(p,x).

This deﬁnition is illustrated in Figure 3. Dropping the uniqueness requirement in these deﬁnitions would drastically change things: since we will require ℓis continuous, ℓ(V ) is always closed. Since by assumption ℓ(V ) [0, )n every such loss satisﬁes the weakened version of n-strict convexity: for all p n there exists x ℓ(V ) such that S(p,x). The weakened version of n-smoothness requires that for all x ℓ(V ) there exists p n such that S(p,x) is

COMPOSITE MULTICLASS LOSSES

h L(q) q = {x: x q = L(q)}

Figure 4: Illustration of geometry of loss functions. The locus of the vector valued loss ℓis plotted as v varies over V . The superprediction set Sℓis the region to the northeast of the loss image ℓ(V ). The hyperplane h L(q) q has normal vector q and offset L(q). It supports Sℓat the point x = ℓ(v) indicating the Bayes risk is achieved at v for the true probability q.

a convexity-like requirement. (Confer the following result (Schneider, 1993, Theorem 1.3.3): Suppose A is closed set such that A = and through each boundary point of A there is a support plane to A; then A is convex.) The name n-strictly convex is justiﬁed by the observation that replacing n by Bln 1 (the ln 1 unit ball) gives a natural deﬁnition of strict convexity of a general set in Rn. We also observe that both n-strict convexity and n-smoothness are closely related to the curvature of the Bayes risk L by way of the fact that the support function of the set ℓ(V ) (restricted to n) is the Bayes risk; confer (Williamson, 2014). Speciﬁcally, n-strict convexity is equivalent to the Hessian HL(p) being non-singular for all p n while n-smoothness is implied whenever L(p) is continuously differentiable. Suppose A,B Rn. Then the Minkowski sum A+B := {a+b: a A,b B}.

Deﬁnition 15 Given a loss ℓ: V Rn +, we denote by

Sℓ:= ℓ(V )+[0, )n = {x Rn + : v V , i [n], xi ℓi(v)}

the superprediction set of ℓ(Kalnishkan and Vyugin, 2008).

One can characterise the existence of proper composite representations in terms of properties the superprediction set. We start with an old result; confer (Dawid, 2007).

Proposition 16 Every continuous proper loss has a convex superprediction set.

WILLIAMSON, VERNET AND REID

Figure 5: Superprediction set of a binary margin loss which is a not a composite proper loss; See text following Corollary 12.

Proof Suppose ℓis proper but Sℓis not convex. Then there exists x0 ℓ( n) such that ℓ( n) is not supported at x0 by any hyperplane hp with normal vector p n. Let q0 n be such that ℓ(q0) = x0. Then there is a hyperplane hq0 (with normal q0) that supports ℓ( n) at some x1 = x0. Thus q 0ℓ(q) is minimised at q1 and not minimised at q0 and thus ℓis can not be proper a contradiction.

The geometry of continuous proper losses is illustrated (for n = 2) in Figure 4. The superprediction set of the margin loss discussed following Corollary 12 is not convex as can be seen in Figure 5. Continuous proper losses are quasiconvex, canonically so, as the following result shows.

Proposition 17 Suppose ℓ: n Rn + is a continuous proper loss. Then its superprediction set Sℓis convex and, for all p n, the function fp(q) := L(p,q) = p ℓ(q) is quasi-convex. Conversely, suppose fp(q) := p ℓ(q) is quasi-convex in q for all p n. Then there is a unique convex set S such that Tℓ= S and ℓis necessarily proper.

The proof is in Appendix B.6. Some (but not all) proper losses are in addition convex; this is studied in more detail in Section 6.4 below. Working with Sℓis problematic for characterising the existence of strictly proper composite representations (essentially because while for a strictly proper loss ℓ, ℓ( n) is n-strictly convex, Sℓis not strictly convex (because of the ﬂat spots at the extremes bounded losses have superprediction sets with ﬂats parallel to the axes by construction)1. We will thus characterise proper and strictly proper composite representations in terms of properties of ℓ(V ) rather than Sℓ.

1. It turns out that by starting with the superprediction set, and deﬁning the loss in terms of the (super-) gradient of the (concave) support function of the superprediction set, these difﬁculties can be avoided (Williamson, 2014).

COMPOSITE MULTICLASS LOSSES

Proposition 18 Suppose ℓ: V Rn + is a continuous loss. ℓhas a proper composite representation if and only if ℓ(V ) is n-smooth. Additionally, ℓis strictly proper composite if and only if ℓ(V ) is also n-strictly convex.

The proof is in Section B.7.

6. Implications: Mixability, Admissibility, Minimaxity and Convexity

We now consider some of the implications that the proper composite representation has for several previously studied properties of loss functions.

6.1 Mixability

Mixability is a fundamental property of a loss function in the study of prediction with expert advice. In this setting learning takes place in ﬁxed number of sequential rounds. Each round a learner is presented with predictions some ﬁnite number of experts. The learner then makes a prediction and the outcome for that round is revealed. The learner s and experts predictions are assessed using some predeﬁned loss function and the aim of the learner is to incur a total loss not much worse than the best expert i.e., the one with the smallest total loss. The difference between the learner s total loss and that of the best expert is known as the regret. In his seminal work, Vovk (1995) showed that no matter how the experts behave, there exists a strategy for the learner (called the aggregating algorithm ) that guarantees a regret bounded by ln K

η where K is the number of experts and η is a positive number called the mixability constant (deﬁned below) that only depends on the loss. Losses for which this constant is deﬁned are called mixable. Furthermore, this constant characterises when such a constant regret bound is possible. That is, if a loss is not mixable then there is no strategy the learner can use to guarantee a constant regret bound. Formally, mixability of a loss ℓis deﬁned in terms of the convexity of a transformation of the loss s superprediction set Sℓ(see Deﬁnition 15). We say that for η > 0 the η-exponentiated superprediction set is the image of Sℓ Rn under the mapping Eη : Rn Rn + deﬁned by Eη(x) := (e ηxi)n i=1. A loss ℓis said to be η-mixable if its η-exponentiated superprediction set is convex. The mixability of ℓis the smallest value of η for which ℓis η-mixable. For further details, the reader is referred to papers by Vovk (1995); Kalnishkan and Vyugin (2008); Vovk and Zhdanov (2009). Recently, van Erven et al. (2012b)2 have shown that the mixability of a loss is related to the curvature of the loss s Bayes risk relative to the curvature of the Bayes risk for log loss. The main result here builds on some of the insights from that work and shows that mixable losses (under mild conditions) always have proper composite representations. For α (0,1) we write α := 1 α. For x,y Rn, x y (xi yi, i [n]). We now give a necessary condition for mixability.

2. An extension of this notion of mixability can be related to a natural convex duality (Reid et al., 2015).

WILLIAMSON, VERNET AND REID

Lemma 19 Suppose ℓ: V Rn +, x0 = ℓ(v0), x1 = ℓ(v1) with x0 = x1. For α (0,1), deﬁne xα := αx0 +αx1 and vα = αv0 +αv1. If for some α

xα ℓ(vα) (4)

then ℓis not mixable.

Proof Pick some η > 0. Let fη(a) = e ηa for a R so that for x Rn we have Eη(x) = (fη(xi))n i=1. Observe that the function fη is strictly monotone decreasing (a < b fη(a) > fη(b)) and strictly convex ( α fη(a) + α fη(b) > fη( αa + αb)). For i [n] set x0,i = ℓi(v0) and x1,i = ℓi(v1). By assumption, we have

αx0,i +αx1,i ℓi( αv0 +αv1), i [n],

which by strict monotonicity

fη( αx0,i +αx1,i) fη(ℓi( αv0 +αv1)), i [n],

and hence by strict convexity

α fη(x0,i)+α fη(x1,i) > fη(ℓi( αv0 +αv1)), i [n]

αEη(ℓ(v0))+αEη(ℓ(v1)) > Eη(ℓ( αv0 +αv1))

and thus ℓis not mixable since we have witnessed the non-convexity of the η-exponentiated superprediction set for ℓ.

van Erven et al. (2012b) showed that (under some mild conditions) a proper loss λ and the composite loss λ ψ obtained via the reference link ψ (see Proposition 5) share the same mixability constant. We now show that mixable losses always have strictly proper composite representations.

Proposition 20 Suppose ℓ: V Rn + is a n-smooth continuous loss. If ℓis mixable then ℓhas a strictly proper composite representation.

Proof We prove the contrapositive. Lack of a strictly proper composite representation is equivalent then to ℓ(V ) being not n-strictly convex. Suppose then that ℓ(V ) is indeed not n-strictly convex. There are two possibilities to consider:

1. There exists p n such that there is no x ℓ(V ) such that ℓ(V ) is supported by hβ p at x for some β R; or

2. There exists p n such that there exists v0,v1 V , v0 = v1, x0 = ℓ(v0), x1 = ℓ(v1), β R, hβ p supports ℓ(V ) at x1 and x2.

Since ℓ(V ) [0, )n and ℓis continuous (and hence ℓ(V ) is closed), for all p n there always exists x ℓ(V ) such that hβ p supports ℓ(V ) at x. Thus under the hypothesis, case 2 must always hold. Then by continuity of ℓand the deﬁnition of a supporting hyperplane, there exists α (0,1) such that (4) holds and so ℓis not mixable.

COMPOSITE MULTICLASS LOSSES

6.2 Admissibility

The above results are strongly related to the classical notion of admissibility (Ferguson, 1967; Chernoff and Moses, 1986; Kiefer, 1987), which is particularly simple in our situation. We adapt the terminology of Ferguson (1967) to be consistent with elsewhere in the present paper.

Deﬁnition 21 Suppose ℓ: V Rn + is a loss. A prediction v1 V is better than v2 V if ℓ(v1) ℓ(v2) and for some i [n], ℓi(v1) < ℓi(v2). A prediction v1 is equivalent to v2 if ℓ(v1) = ℓ(v2). A prediction v V is admissible if there is no prediction better than v. If a prediction v V is the Bayes-optimal for some distribution p, that is for all v V there exists p n such that v argmin v V p ℓ( v), then we say v is strongly admissible.

Ferguson (1967, Theorem 1, page 60) states the following (which we present for invertible losses, so that ℓ(v1) = ℓ(v2) v1 = v2).

Proposition 22 Suppose ℓ: V Rn + is invertible and p n. If v V is the unique prediction such that L(p,v) = L(p), then v is admissible.

Proposition 18 then implies the following.

Corollary 23 Suppose ℓ: V Rn + is continuous and invertible. If ℓhas a strictly proper composite representation then all v V are admissible and strongly admissible.

Proof If ℓhas a strictly proper composite representation, then ℓ(V ) is n-strictly convex and thus for all p n there exists a unique x ℓ(V ) such that h L(p) p supports ℓ(V ) at x. Thus by Proposition 22, v such that ℓ(v) = x is an admissible prediction. Furthermore, since ℓ(V ) is n-smooth, this previous argument actually holds for all v V and thus ℓis admissible. Furthermore, it follows directly from the deﬁnition of n-smoothness that all v are strongly admissible.

Proposition 24 If ℓ: V Rn + is continuous and has a proper composite representation then every prediction is admissible.

Proof We will prove the contrapositive: Suppose a continuous loss ℓ: V Rn + is such that there exist x0,x1 ℓ(V ) with x1 better than x0. Then ℓcan not have a proper composite representation. Observe that x1 is better than x0 is equivalent to

i [n], e i (x0 x1) 0

i [n], e i (x0 x1) > 0.

Consider two mutually exclusive and exhaustive cases:

1. e i (x0 x1) > 0, i [n]. Then for all p n, p (x0 x1) > 0 p x0 > p x1 and thus ℓ(V ) can not be supported at x0 by hβ p for any p n and thus ℓ(V ) is not n-smooth.

2. Alternatively suppose

e i (x0 x1) = 0, i I [n] > 0, i [n]\I

with 1 |I| < n. Consider the two mutually exclusive subcases over p n:

WILLIAMSON, VERNET AND REID

(a) pi > 0 for some i [n]\I. Then p (x0 x1) > 0 and ℓ(V ) can not be supported at x0 by hβ p for any β R.

(b) pi 0 for all i [n] \ I. Then p (x0 x1) = 0 in which case ℓ(V ) is supported by hβ p for some β at both x0 and x1.

In either of these subcases, the n-smoothness condition is violated.

Thus in both cases we have shown ℓ(V ) can not be n-smooth and by Proposition 18 can not have a proper composite representation.

As can be seen in Figure 6, there can be no hope of a converse: mere admissibility of every prediction x ℓ(V ) can not imply that ℓhas a proper composite representation. However strong admissibility of every prediction implies ℓ(V ) is n-smooth and so if ℓis continuous, strong admissibility of every prediciton implies (via Proposition 18) that ℓhas a proper composite representation. The relationship between strict convexity of Sℓand admissibility is not new (Brown, 1981); but the connection with our characterisation of composite proper losses is new. We conclude that if ℓis continuous and invertible and we desire that all predictions are admissible, then it sufﬁces to only consider losses with a proper composite representation. Continuous invertible losses that do not have a proper composite representation are redundant in the sense that there are guaranteed to exist predictions that are not Bayes optimal for any true distribution.

6.3 Minimaxity

We say a loss ℓ: V Rn + is minimax if its conditional risk L(p,v) = p ℓ(v) satisﬁes

max p n min v V L(p,v) = min v V max p n L(p,v). (5)

Minimaxity of proper losses has been studied in a very general setting by Gr unwald and Dawid (2004) who showed the connection between robust Bayes procedures and maximum entropy; confer classical results presented, for example, by Ferguson (1967). In this brief subsection we point out some simple implications of our earlier results. Setting V = n, oberve that for all proper losses λ : n Rn +, p 7 Λ(p,q) = p λ(q) is linear for all q n, and if λ is also continuous, by Proposition 17 q 7 Λ(p,q) is quasi-convex for all p n. It thus follows from the minimax theorem of Sion (1958) that all continuous proper losses satisfy

max p n min q n Λ(p,q) = min q n max p n Λ(p,q) (6)

and are thus minimax. Suppose ℓ= λ ψ = λ ψ 1 : V Rn + is a proper composite loss, with conditional risk L(p,v) = Λ(p,ψ 1(v)). Since ψ 1 is invertible,

max p n min v V L(p,v) = max p n min q n Λ(p,q), (7)

COMPOSITE MULTICLASS LOSSES

h L(v) q = {x: x q = L(v)}

Sℓ x1 = ℓ(v1)

h L(v) q = {x: x q = L(v)}

Figure 6: Left: Illustration of a continuous loss ℓ(which can be presumed invertible) with a nonconvex superprediction set. For true probability q, v1 and v2 both are Bayes optimal since q ℓ(v1) = q ℓ(v2) = L(v); thus h L(v1) q = h L(v3) q supports ℓ(V ) at both x1 and x3. The point x2 is never a member of a supporting hyperplane of ℓ(V ) and is thus never the Bayes optimal prediction for any q and so not strongly admissible. The green line indicates the set of predictions that are not strongly admissible they will never be Bayes optimal for any q n. Such predictions are, however, admissible, as can be seen by the grey translated negative orthants centred at x1, x2 and x3 (each orthant does not contain any other predictions better than them). All the other predictions whose image lies in the black line are both admissible and strongly admissible. The loss image ℓ(V ) is not n-smooth because there exist no p n that supports ℓ(V ) at x2. Hence by Proposition 18, ℓcan not have a proper composite representation. Right: Similar to the ﬁgure on the left, except there are now some predictions, such as x2, which are not admissible: x4 is better than x2 as can be seen since x4 is contained in the interior of the shifted negative orthant centred at x2. Note in this case the boundary of the super-prediction set Sℓdoes not equal ℓ(V ) (see the part of Sℓcross-hatched in red). This loss can not have a proper composite representation by Proposition 24.

where by the relationship between q and v, argmin v V L(p,v) = ψ

argmin q n Λ(p,q)

. Similarly,

min v V max p n L(p,v) = min q n max p n Λ(p,q). (8)

Since λ is proper, Λ satisﬁes (6) which combined with (7) and (8) proves the following.

Proposition 25 Every continuous proper composite loss is minimax.

Note that this alone does not imply that all continuous proper composite losses are quasi-convex, which would follow if ψ mapped convex sets to convex sets; however this can not be true in

WILLIAMSON, VERNET AND REID

general in Rn because convexity preserving mappings must be afﬁne (Webster, 1994, Theorem 7.3.7); confer (Meyer and Kay, 1973). Recall Proposition 17 showed the quasi-convexity of all proper losses. Proposition 25 means that the use of the classical minimax theorem by Abernethy et al. (2009) in order to prove their main result for convex losses can be foregone; their result also holds for arbitrary continuous proper composite losses.

6.4 Convexity

In order to computationally optimise models with respect to a loss function it is convenient if the loss is convex. In this subsection we develop conditions for the convexity of multiclass composite proper losses. We assume throughout this section that the loss and link are twice differentiable. We start by proving some identities for their ﬁrst and second derivatives.

6.4.1 TECHNICAL PRELIMINARIES

Suppose ℓ= λ ψ 1 is composed of the proper loss λ : n Rn + and the inverse of the link ψ : n V . In order to simplify the calculation of derivatives for the function ℓ: V Rn + we will assume the set V is a ﬂat, (n 1)-dimensional, convex subset of Rn +. We do so since if V were some arbitrary manifold the extra deﬁnitions required to make sense of convexity (e.g., in terms of geodesics) and derivatives on manifolds would obscure the gist of the results below. Furthermore, little is lost either practically or theoretically by assuming a simple V . In practice, predictions are usually vectors in Rn +, and in theory one could always choose a parametrisation of V in terms of some simpler space U and redeﬁne the link via composition with that parametrisation. Alternatively, since links must be invertible, a composite loss could be deﬁned by a choice of loss and choice of inverse link ψ 1 : V n for a V assumed to be ﬂat, etc. Recalling the convention that n := n 1, let v V ﬁxed but arbitrary with corresponding p = ψ 1(v) where ψ( p) := ψ(( p1,..., p n, pn) ) with pn := n i=1 pi is the induced function from n to V . By the chain rule and the inverse function theorem, the derivatives for each of the partial losses ℓi satisfy

Dℓi(v) = D λi( ψ 1(v)) = Dλi( p) [D ψ( p)] 1 . (9)

We use en i to denote the ith n-dimensional unit vector, en i = (0,...,0,1,0,...,0) when i [n], and deﬁne en i = 0n when i > n. We can now write Dλi( p) in terms of the n n matrix Dλ( p) using Dλi( p) = (en i ) Dλ( p). Now Dλ( p) = (D λ( p) ,Dλn( p) ) , where λ( p) = (λ1( p),...,λ n( p)) , and so

Dλi( p) = (en i ) Dλ( p) = (en i ) D λ( p) Dλn( p)

Furthermore, since λ is proper, Lemma 6 of (van Erven et al., 2012b) means we can use the relationship between a proper loss and its projected Bayes risk L := L Π 1 to write

D λ( p) = W( p) H L( p) (11)

Dλn( p) = y( p) D λ( p) (12)

COMPOSITE MULTICLASS LOSSES

where W( p) := I n 1 n p and where y( p) := p/pn( p) and pn( p) := 1 i [ n] pi. Thus, combining (10 12) we have for all i [ n]

Dλi( p) = (e n i ) W( p) H L( p)

= ((e n i ) (e n i ) 1 n p ) H L( p)

= (e n i p) H L( p) (13)

Dλn( p) = y( p) W( p) H L( p)

= 1 pn( p) p (I n 1 n p ) H L( p)

= 1 pn( p)( p (1 pn( p)) p ) H L( p)

= p H L( p). (14)

Finally, noting that by deﬁnition e n n = 0, (14) and (13) can be merged and combined with (9) to obtain the following proposition.

Proposition 26 For all i [n], p n (the relative interior of n), and v = ψ( p),

Dℓi(v) = e n i p κ( p) (15)

κ( p) := H L( p) [D ψ( p)] 1 . (16)

Using the deﬁnition of the Hessian Hℓi = D[(Dℓi) ] and the product rule (31) gives

D (Dℓi(v)) =Dv[

f( p) z }| { D ψ( p) 1 H L( p)

g( p) z }| { e n i p ]

= e n i p I n Dv[f( p) +(I1 f( p)) D e n i ψ 1(v)

= e n i p I n Dv h H L( p) [D ψ( p)] 1i D ψ( p) 1 H L( p) [D ψ( p)] 1 ,

where Dv is used to indicate that the derivative is with respect to v even when the terms inside the derivative are expressed using p. We have now established the following proposition.

Proposition 27 For all i [n], p n, and v = ψ( p),

Hℓi(v) = e n i p I n D κ ψ 1(v) + κ( p) [D ψ( p)] 1 ,

where κ( p) is deﬁned in (16).

WILLIAMSON, VERNET AND REID

The product κ( p) := H L( p)[D ψ( p)] 1 that appears in both propositions above can be interpreted as the curvature of the Bayes risk function L relative to the rate of change of the link function ψ. When the link function is the identity ψ( p) = p (i.e. when we have a proper loss directly) the expressions for the derivative and Hessian of each ℓi simplify to

Dℓi( p) = (e n i p) H L( p) (17)

Hℓi( p) = e n i p I n D H L( p) H L( p) . (18)

The form of κ as the product of H L and D ψ suggests another simpliﬁcation.

Deﬁnition 28 The canonical link function for a loss λ with Bayes risk L is deﬁned via

ψλ( p) := D L( p) . (19)

We will show in section 8.1 that (19) is indeed guaranteed to be a legitimate link. The term κ simpliﬁes to κ( p) = I n since D ψ( p) = D(D L( p) ) = H L( p). For this choice of link function, the ﬁrst and second derivatives become considerably simpler.

Proposition 29 If λ : n Rn + is a proper loss and ψλ is its associated canonical link then, for

all i [n], p n, and v = ψλ( p), the composite loss ℓ= λ ψ satisﬁes

Dℓi(v) = (e n i p) (20)

Hℓi(v) = H L( p) 1 . (21)

The simpliﬁed form of the Hessian above is established by noting that since κ( p) = I n we have D[κ( ψ 1(v))] = 0 for all v V in Proposition 27. The above propositions hold for any number of classes n. It is instructive (both here and later in the paper) to examine the binary case where n = 2. In this case, Proposition 26 and Proposition 27 reduce to

ℓ 1(v) = (1 p)κ( p) ; ℓ 2(v) = pκ( p) (22)

ℓ 1(v) = (1 p)κ ( p)+κ( p)

ψ ( p) (23)

ℓ 2(v) = pκ ( p)+κ( p)

ψ ( p) (24)

where κ( p) = L ( p)

ψ ( p) 0 and so d

dvκ( ψ 1(v)) = κ ( p)

6.4.2 CONDITIONS FOR CONVEXITY OF MULTICLASS COMPOSITE PROPER LOSSES

We will now consider when multiclass proper losses are convex, and give a characterisation in terms of the corresponding Bayes risk which as we have seen is the natural way to parametrise a loss. The results below are the multiclass generalisation of the characterisation of convexity of binary composite losses (Reid and Williamson, 2010). In fact we obtain more general results

COMPOSITE MULTICLASS LOSSES

even in the binary case because here we consider strongly convex losses. We will also show how any non-convex proper loss can be made convex by suitable choice of a link function (the canonical link)3. For a convex set C Rn, a loss ℓ: C Rn + is said to be convex if for all p n, the map C v 7 L(p,v) = p ℓ(v) is convex. That is, a loss is convex if, under any distribution p over outcomes i [n], the expected loss Ei p[ℓi(v)] is convex in v. It is easy to see that ℓis convex if and only if ℓi : C R+ is convex for all i [n]. (The if part follows since a sum of convex functions is convex; the only if follows by considering p = ei, for i [n].)

Deﬁnition 30 Suppose C Rn is convex. A function f : C R is strongly convex on C with modulus c 0 if for all x,x0 C, α (0,1),

f(αx+(1 α)x0) α f(x)+(1 α)f(x0) 1

2cα(1 α) x x0 2.

When c = 0 in the above deﬁnition, f is convex. The function f is strongly convex on C with modulus c if and only if x 7 f(x) c

2 x 2 is convex on C (Hiriart-Urruty and Lemar echal, 2001, page 73). Therefore, the maps v 7 ℓi(v) are c-strongly convex if and only if Hℓi(v) c I n. By applying Proposition 27 we obtain the following characterisation of the c-strong convexity of the loss ℓ.

Proposition 31 A proper composite loss ℓ= λ ψ 1 is strongly convex with modulus c 0 if and only if for all p n and for all i [n] e n i p I n D κ ψ 1(v) κ( p) [D ψ( p)] 1 c I n. (25)

We now consider the implications of Proposition 31 in two special cases: in the multiclass case with canonical link, and in the binary case with the identity link. Recall that the canonical link ψℓis chosen so that ψ( p) = D L( p) . This simpliﬁes κ( p) to the identity matrix I n so Dκ( p) = 0. In this case the above proposition reduces to the following corollary.

Corollary 32 If ℓ= λ ψ 1 is deﬁned so that ψ = D L then each map v 7 ℓi(v) is c-strongly convex if and only if H L( p) 1 c I n, or equivalently H L( p) 1

An immediate consequence of this result is obtained by observing that the deﬁniteness constraint is always met when c = 0 since L is always a concave function. Thus, using a canonical link guarantees a proper composite loss is convex. There is an upper deﬁniteness condition analogous to that for strong convexity that has implications for rates of convergence in numerical optimisation. Boyd and Vandenberghe (2004, 9.1.2) show that if a twice differentiable function f : X R satisﬁes

MI H f(x) m I

3. There are problems associated with the domain of deﬁnition of such link functions than need to be dealt with (Kamalaruban et al., 2015).

WILLIAMSON, VERNET AND REID

for all x X Rn then the value M

m is an upper bound on the condition number of H f, that is, the ratio of maximum to minimum eigenvalue of H f. This value measures the eccentricity of the sublevel sets of f and controls the rate at which optima of f are approached. Applying this result to the Hessian of a composite loss ℓwith a canonical link shows that the condition number bound is controlled by the Hessian of the Bayes risk of ℓ. Speciﬁcally, if the condition number is to be no more than M/m then 1

M H L( p) 1

m for all p. In the case that M = m and the condition number is 1, the only Hessian that satisﬁes these conditions is H L( p) = I n which is easily shown to be the Bayes risk for square loss. Thus, square loss is the only canonical composite loss for which a condition number of 1 is possible. In the binary case, when n = 2, (23) and (24) and the positivity of ψ simplify (25) to the two conditions:

(1 p)κ ( p) κ( p) c ψ ( p) pκ ( p) κ( p) c ψ ( p)

Further assuming that ψ is the identity link ( ψ(v) = v) and letting w( p) := L ( p) gives

w ( p) 1 1 p(w( p) c)) w ( p) 1

p (w( p) c)

p w ( p) w( p) c 1 1 p, p (0,1). (26)

The last equivalence is achieved by dividing through by w( p) c which must necessarily be positive since if it were not the ﬁnal pair of inequalities would imply 1

p 1 1 p, a contradiction given that p [0,1]. Note that (26) reduces to (Reid and Williamson, 2010, Corollary 26) for c = 0. Observe that if g( p) := log(w( p) c) then g ( p) = w ( p) w( p) c is the middle term in (26). This

allows a simpliﬁcation of the inequality. Speciﬁcally, if we assume w(1

2) = 1 then

p g ( p) 1 1 p, p (0,1)

1 2 g ( p)d p Z q

1 1 pd p, q (0,1)

log(q) log(2) g(q) log(1 c) (27)

log(2) log(1 q), q (0,1)

2q eg(q) log(1 c) 1 2(1 q), q (0,1)

which gives the following proposition purely in terms of w( p), rather than w( p) and its derivative.

COMPOSITE MULTICLASS LOSSES

0 0.25 0.5 0.75 1

Figure 7: Graph of w( p) = L ( p) as a function of p necessary for a suitably normalised binary proper loss to be strongly convex with modulus c {0, 1

5,1}. The regions Rc are nested by subsethood so that R0 R1/5 R2/5 R3/5 R4/5 R1, where R1 is simply the dotted line (containing only the function w(c) = 1, c [0,1], which is the weight function corresponding to squared loss). The palest shaded region corresponds to R0, the allowable range of w(c) necessary for the corresponding proper loss to be convex, and the darkest corresponds to R4/5.

Proposition 33 Let w( p) = H L( p) = L ( p) and assume w(1/2) = 1. A proper binary loss ℓ: 2 R2 + is strongly convex with modulus c [0,1] only if

1 2 p w( p) c

1 c 1 2(1 p), p (0,1), (28)

where denotes for p 1

2 and denotes for p 1

When c = 0 (corresponding to ℓbeing convex) this is equivalent to an expression by Reid and Williamson (2010, Equation 31), where it was incorrectly claimed this condition was also sufﬁcient. Inequation 28 is illustrated in Figure 7.

WILLIAMSON, VERNET AND REID

The above proposition only gives a necessary condition for strong convexity. (In addition to w belonging to the speciﬁed region, w ( p) also needs to be suitably controlled). A sufﬁcient condition is useful for designing strongly convex proper losses. Observe that if

w( p) = exp Z p

1/2 u(t)dt +K +c

where u: [0,1] R and K,c R, then

p log(w( p) c) = u( p). We require w(1/2) = 1 and so

exp R 1/2 1/2 u(t)dt +K +c = 1 and so e K = 1 c and

w( p) = (1 c)exp Z p

1/2 u(t)dt +c (29)

satisﬁes (26) if

p u( p) 1 1 p, p (0,1), (30)

and hence the loss with weight function w is strongly convex with modulus c. Thus by choosing u to satisfy (30) and constructing w via (29) one can design strongly convex proper binary losses. One can ask whether equation (25) can be simpliﬁed in the n > 2 case by using a matrix version of the logarithmic derivative trick in a manner similar to that used above when n = 2. Such a result does exist (Horn and Johnson, 1991, Section 6.6.19) but it requires that (H L( p)) 1

and D(H L( p)) commute for all p n, which is not generally the case.

7. Integral Representations of Proper Losses

Binary proper losses have an attractive integral representation that provides substantial insight and is a useful tool for both designing losses and understanding the implications of different choices of loss. Speciﬁcally, there exists a family of extremal loss functions (cost-weighted generalisations of the 0-1 loss) parametrised by c [0,1] and deﬁned for all η [0,1] by ℓc 1(η) := c Jη c K and ℓc 1 := (1 c)Jη < c K. As shown by Buja et al. (2005) and Reid and Williamson (2011), given these extremal functions, any proper binary loss ℓcan be expressed as the weighted integral

0 ℓc w(c)dc+constant

with weight function w(c) = L (c). This representation is a special case of a representation from Choquet theory (Phelps, 2001; Simon, 2011) which characterises when every point in some set can be expressed as a weighted combination of the extremal points of the set. Although there is such a representation when n > 2, the difﬁculty is that the set of extremal points is much larger and this rules out the existence of a nice small set of primitive proper losses when n > 2, and consequently rules out an easy-to-work-with weight function parameterizing all possible multiclass lossses in a manner analogous to the binary case. The rest of this section makes this statement precise. A convex cone K is a set of points closed under positive linear combinations. That is, K = αK + βK for any α,β 0. A point f K is extremal if f = 1

2(g + h) for g,h K

COMPOSITE MULTICLASS LOSSES

implies α R+ such that g = α f. That is, f cannot be represented as a non-trivial combination of other points in K . The set of extremal points for K will be denoted ex K . Suppose U is a bounded closed convex set in Rd, and Kb(U) is the set of convex functions on U bounded by 1, then Kb(U) is compact with respect to the topology of uniform convergence. Bronshtein (1978, Theorem 2.2) showed that the extremal points of the convex cone K (U) = {α f + βg : f,g Kb(U),α,β 0} are dense (w.r.t. the topology of uniform convergence) in K (U) when d > 1. This means for any function f K (U) there is a sequence of functions (gi)i such that for all i gi ex K (U) and limi f gi = 0, where f := supu U |f(u)|. We use this result to show that the set of extremal Bayes risks is dense in the set of Bayes risks when n > 2. In order to simplify our analysis, we restrict attention to fair proper losses. A loss is fair if each partial loss is zero on its corresponding vertex of the simplex (ℓi(ei) = 0, i [n]). A proper loss is fair if and only if its Bayes risk is zero at each vertex of the simplex (in this case the Bayes risk is also called fair). One does not lose generality by studying fair proper losses since any proper loss is a sum of a fair proper loss and a constant vector. The set of fair proper losses deﬁned on n form a closed convex cone, denoted Ln. The set of concave functions which are zero on all the vertices of the simplex n is denoted Fn and is also a closed convex cone.

Proposition 34 Suppose n > 2. Then for any fair proper loss ℓ Ln there exists a sequence (ℓi)i of extremal fair proper losses (ℓi ex Ln) which converges almost everywhere to ℓ.

The implication of this proposition is that the set of extremal multiclas proper losses is very large. Some intuition can be gleaned from Figure 8 from which it is apparent that there is a qualitative difference between the complexity of the set of extremal concave functions in one dimension (corresponding to n = 2) and higher dimensions (n > 2). The proof of Proposition 34 requires the following lemma which relies upon the correspondence between a proper loss and its Bayes risk (Proposition 8) and the fact that two continuous functions equal almost everywhere are equal everywhere.

Lemma 35 If ℓ ex Ln then its corresponding Bayes risk L is extremal in Fn. Conversely, if L ex Fn then all the proper losses ℓwith Bayes risk equal to L are extremal in Ln.

Proof We suppose that ℓ ex Ln and denote its Bayes risk by L(p) = p ℓ(p). Let F,G Fn so that L = 1

2(F + G). Suppose f and g are proper losses whose Bayes risks are respectively equal to F and G, then p n and almost everywhere in q (more precisely where L, F and G are differentiable), L(p,q) = 1

2(G(p,q) + F(p,q)). Then ℓ= 1

2(g + f) almost everywhere, so there exists α such as g = αℓalmost everywhere, hence G = αL almost everywhere and then everywhere by continuity. So L is extremal in Fn. Now suppose that the concave function L is extremal and let ℓbe a proper loss whose Bayes risk is L. Then L(p,q) = p ℓ(q) = L(q) + (p q) A(q) where A(q) L(q). Suppose that there exist f,g Ln so that ℓ= 1

2(f +g) almost everywhere, and have associated Bayes risks F and G, respectively. Then L(p) = p ℓ(p) = p 1

2(f(p)+g(p)) = 1

2(F +G) almost everywhere so L = 1

2(F + G) everywhere by continuity. Since L is extremal we must have F = αL and so f = αℓwhere L is differentiable (and so almost everywhere). Thus ℓis extremal in Ln.

WILLIAMSON, VERNET AND REID

Figure 8: Complexity of extremal concave functions in two dimensions (corresponds to n = 3). The ﬁgure shows the graph of an extremal concave function in two dimensions. The lines below indicate where the slope changes. The pattern of these lines can be made arbitrarily complex. This illustrates the fact (Proposition 34) that the set of extremal concave functions is very large.

We also need a correspondence between the uniform convergence of a sequence of Bayes risk functions and the convergence of their associated proper losses.

Lemma 36 Suppose L,Li Fn for i N and suppose ℓand ℓi, i N are associated proper losses. Then (Li)i converges uniformly to L if and only if (ℓi)i converges almost everywhere to ℓ.

Proof We require two facts from convex analysis; confer (Hiriart-Urruty and Lemar echal, 2001, Theorems B.3.1.4 and D.6.2.7). If a sequence (f i)i of convex functions f i converges pointwise to f then: 1) the sequence converges uniformly on any compact domain; and 2) ε > 0, f i(x) f(x) + B(0,ε) for i large enough. Then the reverse implication of the lemma is a direct consequence of the ﬁrst result and the forward implication is obtained by considering the set {x : n, Li and L are differentiable at x} which is of measure 1.

Proof (Proposition 34) When n > 2 the simplex n is isomorphic to a subset of Rd for d > 1. Since Fn is a convex cone associated with the set of bounded concave functions (i.e., the fair Bayes risks), (Bronshtein, 1978, Theorem 2.2) guarantees (with an appropriate change from concavity to convexity) that ex Fn is dense in Fn w.r.t. the topology of uniform convergence. Therefore, if ℓ Ln there exists a sequence (f i)i with f i ex Fn which converges uniformly to the Bayes risk L of ℓand so by Lemma 36 there is a corresponding sequence (ℓi)i of fair proper

COMPOSITE MULTICLASS LOSSES

losses that converges almost everywhere to ℓ. Lemma 35 guarantees that each ℓi is extremal in Ln since each f i ex Fn and so we have shown there exists a sequence (ℓi)i with ℓi ex Ln which converges to an ℓwhich was arbitrary.

8. Tools for Designing Losses

In this section we show how the results developed above could be used to design losses for particular purposes. There is no question about how this should be done in principle (Berger, 1985, Section 2.4). And even when not made explicit at the outset, all inference ultimately has an implicit loss function that captures what matters to the end user, even if the original purpose was to merely gather information , simply because in the end the information is acted upon (De Groot, 1962). Until now, the lack of convenient and canonical parametrisations of multiclass loss functions has made the comparison of different loss functions, and their tuning for speciﬁc applications, difﬁcult. In subsection 8.1, we show how to construct a parametric family of valid link functions from a ﬁnite number of base links by effectively taking their convex combination. By composing each link with a ﬁxed proper loss, this immediately allows for the speciﬁcation of a family of losses with a ﬁxed Bayes risk. This construction enables the creation of losses with a range of optimisation characteristics (e.g., convexity, robustness) but a common statistical basis (i.e., the same Bayes risk). In subsection 8.2 we show how it is possible to build losses by building them up from constraints on their Bayes risk curves on the edges of the simplex. This allows a loss to be constructed by effectively specifying its behaviour on pairs of outcomes. We show how this observation can be used to create piecewise linear, proper losses for cost-sensitive misclassiﬁcation. Finally (subsection 8.3) we observe how link functions are in fact themselves very similar to loss functions, and (subsection 8.4) we present some examples of proper composite losses from the literature (where they were not expressed in the proper composite parametrisation)

8.1 Families of Losses with Fixed Bayes Risk

The theory developed above suggests that each choice of proper loss λ and link function ψ results in an overall loss function with properties (e.g., convexity) that depend entirely on their relationship to each other. Given these two knobs for parameterising a loss function, we can begin to ask what kind of practical trade-offs are involved when selecting a composite loss as a surrogate loss for a particular problem. We now propose a simple scheme for constructing families of losses with the same Bayes risk. This is achieved by ﬁxing a choice of proper loss λ and creating a parameterised family (described below) of link functions ψα for parameters α A. Since the Bayes risk is entirely determined by λ any composite loss λ ψ 1 α for α A will have Bayes risk L(p) = p λ(p). Thus, we are able to examine the effect different choices of composite loss can have on a problem without changing the essential underlying problem.4

4. Of course, this argument only holds in a point-wise analysis. That is, where choices for estimates p(x) can be made independently. Once a restricted hypothesis class for the functions p is introduced the choice of link can

WILLIAMSON, VERNET AND REID

In order to construct a parametric family of links we ﬁrst choose some set of inverse link functions I = {ψ 1 1 ,...,ψ 1 B } with a common domain, that is, ψ 1 b : V n for a common n and V . This collection will be called the basis set of link functions. We then take the convex hull of I to form a set of inverse link functions Ψ 1 = co(I ). Each ψ 1 Ψ 1 is then identiﬁed with the unique α A = B such that B b=1 αbψ 1 b = ψ 1. For this construction to be valid, it it necessary to show that every such ψ 1 Ψ 1 is indeed an inverse link function, that is, it is invertible. The following proposition shows that it sufﬁces to assume that all of the basis functions are strictly monotone (see Equation 2).

Proposition 37 Every function ψ 1 in the set Ψ 1 = co(I ) is invertible whenever each basis function in I is strictly monotone.

This result is a consequence of: 1) strict monotonicity being preserved under convex combination; and 2) strict monotonicity implies invertibility. The ﬁrst claim is established by considering strictly monotone f and g and some α [0,1] and noting that if h = α f + (1 α)g then (h(u) h(v)) (u v) = α(f(u) f(v)) (u v) + (1 α)(g(u) g(v)) (u v) > 0. A strictly monotone function f that is not invertible is impossible since if we have (f(u) f(v)) (u v) > 0 for all u,v then a u = v such that f(u) = f(v) would lead to a contradiction. Strictly monotone basis functions are easily obtained via canonical links for strictly proper losses. By deﬁnition, a canonical link satisﬁes ψ = D L for some Bayes risk function. Strict properness guarantees L is strictly concave (van Erven et al., 2012b, Lemma 1). Kachurovskii s theorem (Hiriart-Urruty and Lemar echal, 2001, Theorem 4.1.4) states that the derivative of a function is (strictly) monotone if and only if the function is (strictly) convex. Since (f(f 1(u)) f(f 1(v))) (f 1(u) f 1(v)) = (u v) (f 1(u) f 1(v)) we see that strictly monotone functions have strictly monotone inverses and we have established the following proposition.

Proposition 38 If λ is a strictly proper loss then its canonical link ψλ = D L has a strictly monotone inverse.

This result means that a set of basis links can be deﬁned via a choice of strictly concave Bayes risk functions. As an example, the class of Fisher-consistent margin losses proposed by Zou et al. (2008) provides a ﬂexible starting point for designing sets of link functions as described above. They give explicit formulae for the inverse link for a composite loss deﬁned by a choice of convex function φ : R R. Speciﬁcally, if the loss for predicting v V = {v Rn : i vi = 0} is given by ℓ(v) = φ(vj) then its inverse link is ψ 1 φ (v) = 1 Zφ(v) [φ (vi)] 1 n i=1 where Zφ(v) normalises the vector to lie in n. Each choice of strictly convex φ gives a valid inverse link which can be used as a basis function.

8.2 Piecewise Linear Multiclass Losses

We now build a family of conditional Bayes risks. Suppose we are given n(n 1)

2 concave functions {Li1,i2 : 2 R}1 i1<i2 n on 2, and we want to build a concave function L on n

affect the minimal achievable risk. The interaction between the hypothesis class and the loss function is complex (van Erven et al., 2015).

COMPOSITE MULTICLASS LOSSES

which is equal to one of the given functions on each edge of the simplex ( 1 i1 < i2 n, L(0,...,0, pi1,0,...,0, pi2,0,...,0) = Li1,i2(pi1, pi2)). This is equivalent to choosing a binary loss function, knowing that the observation is in the class i1 or i2. The result below gives one possible construction. (There exists an inﬁnite number of solutions one can simply add any concave function equal to zero in each edge).

Lemma 39 Suppose we have a family of concave functions {Li1,i2 : 2 R}1 i1<i2 n, then

L: n p 7 L((p1,..., pn) ) = 1 i1<i2 n (pi1 + pi2)Li1,i2 pi1 pi1 + pi2 , pi2 pi1 + pi2

is concave and 1 i1 < i2 n, L((0,...,0, pi1,0,...,0, pi2,0,...,0) ) = Li1,i2((pi1, pi2) ).

Proof In order to show that L is concave it sufﬁces to show that for g : 2 R concave, f : p n f(p) = (p1+ p2)g p1 p1+p2 , p2 p1+p2

is concave, since a sum of concave functions is concave.

Let γ [0,1], p,q n. Since g is concave, α [0,1], p,q 2, g α p1+p2 p + 1 α

αg p p1+p2 +(1 α)g q q1+q2 . Then with α = γ(p1+p2) γ(p1+p2)+(1 γ)(q1+q2), we get f(γ p+(1 γ)q) γ f(p)+(1 γ)f(q). Moreover, L((0,...,0, pi1,0,...,0, pi2,0,...,0) ) = i/ {i1,i2} (pi1 0+ pi2 0)

+(pi1 + pi2)Li1,i2 pi1 pi1+pi2 , pi2 pi1+pi2

= Li1,i2((pi1, pi2) ), (p n, so pi1 + pi2 = 1).

Using this family of Bayes risks, one can build a family of proper losses.

Lemma 40 Suppose we have a family of binary proper losses ℓi1,i2 : 2 R2. Then

ℓ: n p 7 ℓ(p) =

j 1 i=1 ℓi,j 1

+ n i=j+1 ℓi,j 1

is a proper n-class loss such that

ℓi((0,...,0, pi1,0,...,0, pi2,0,...,0) ) =

ℓi1,i2 1 (pi1) i = i1 ℓi1,i2 1 (pi1) i = i2 0 otherwise .

Proof Use the correspondence between Bayes risk and proper losses and Lemma 39.

Observe that it is much easier to work at ﬁrst with the Bayes risk and then using the correspondence between Bayes risks and proper losses. We have already seen (Section 7) that it is not possible to parametrise all extremal concave functions in a tractable manner. However, for the sake of offering a range of knobs to the designer to design losses, it could often sufﬁce to use a subset of extremal losses. These will all have polyhedral forms. A convex polytope is a compact convex intersection of a ﬁnite set of half-spaces and is therefore the convex hull of its vertices. Let {ai}i be a ﬁnite family of afﬁne functions deﬁned on n. Now deﬁne the convex polyhedral function f by f(x) := maxi ai(x). The set K := {Pi = {x n : f(x) = ai(x)}} is a covering of n by polytopes. Bronshtein (1978, Theorem 2.1) shows that for f, Pi and K so deﬁned, f is extremal if the following two conditions

WILLIAMSON, VERNET AND REID

are satisﬁed: 1) for all polytopes Pi in K and for every face F of Pi, F n = implies F has a vertex in n; 2) every vertex of Pi in n belongs to n distinct polytopes of K. The set of all such f is dense in K (U). Using this result it is straightforward to exhibit some sets of extremal fair Bayes risks {Lc(p): c n}. Two examples are when

Lc(p) = n i=1

pi ci j =i J pi

Any convex combination of either of these families will be the Bayes risk of a proper fair multiclass loss. Thus the convex combination of the elementary losses induced by such Lc(p) will also be proper fair multiclass losses.

8.3 Parametrisation of Composite Losses

A composite loss ℓ= λ ψ 1 : V Rn + is directly parametrised by the proper loss λ : n Rn + and the invertible link ψ : n Rn. However we have seen (Section 4) that proper losses λ are more nicely parametrised by their concave conditional Bayes risk Λ: n R, which being scalar valued, are simpler objects to work with than λ. Although not every invertible function ψ can be written as the gradient of an analogous convex function Ψ: n R, by Kachurovskii s theorem (see Section 8.1) if for some Ψ: n R, ψ = DΨ, then ψ is monotone (resp. strictly monotone) if and only if Ψ is convex (resp. strictly convex). A link ψ is a gradient if and only if Dψ is symmetric (so that HΨ is symmetric) as a Hessian needs to be. Thus if one were willing to restrict oneself to links such that Dψ is symmetric, then a composite loss ℓcan be parametrised by (Λ,Ψ), which are concave (resp. strictly convex) functions from n to R. The parametrisation of ψ via Ψ allows the speciﬁcation of the canonical link as that satisfying Ψ = Λ.

8.4 Examples from Related Work

In this subsection we look at some existing candidate multiclass losses from the perspective of proper composite representations. Not all such losses as the generalisation of hinge loss are so representable, a prominent example being those introduced by Crammer and Singer (2001).

Zou et al. (2008) presented multi-category losses of the form ℓy(f) = φ(fy) for f such that y fy = 0 and φ (0) < 0 and φ (t) 0 so that we have Fisher consistency and the inverse link

is 1/φ (f j) i 1/φ (fi). As their examples of this class show, it is not always possible to write the link in closed form, even if the inverse link can be (e.g., logit loss φ(t) = log(1+et)). The coherence functions of Zhang et al. (2009) are a separate class of surrogate functions that emphasise the margin of a prediction:

ℓy(v) = T log

1+ i =y exp T 1(1+vi vy #

COMPOSITE MULTICLASS LOSSES

Figure 9: Illustration of the link function in the proper composite representation of the binary coherence loss for T [0.1,4]. Blue corresponds to T = 0.1 and red to T = 4, with there being 80 equal increments of T plotted.

We will illustrate some aspects of the present paper with reference to this parametrised family of losses. For ease of calculation, we consider only n = 2 below, but the conclusions we draw below hold for n > 2 also. In the binary case, this corresponds to a parametric family of margin losses with margin function

φT(z) := T log 1+exp 1 z

and thus φ T(z) = e(1 z)/T 1+e(1 z)/T and one can check that g T(v) := φ T (v) φ T ( v) is strictly monotone continuous and φT is monotone for all T > 0 and thus by Corollary 12 there is a strictly proper composite representation. Identifying ℓ 1(v) with φ( v) and ℓ1(v) with φ(v) we can ﬁnd for a

WILLIAMSON, VERNET AND REID

Figure 10: Illustration of the conditional Bayes risk corresponding to the the binary coherence loss for T [0.02,4] with blue corresponding to T = 0.02 and red to T = 4.

given p the v that minimises L(p,v) by solving

v L(p,v) = (1 p)φ T( v)+ pφ T(v) = 0

and obtain 1 p

p = g T(v). Solving this for v (using Maple) we ﬁnd the link function component of the proper composite representation:

ψT(p) = T ln 1 2(1 p)

2 pe T 1 e T 1 + q

4e2T 1 p2 4 pe2T 1 4 p2 +e2T 1 +4 p .

This is illustrated in Figure 9. Zou et al. (2008, Theorem 1) effectively compute ψ 1 T for general n. One can determine the proper component as

LT(p) = pφT (ψT(p))+(1 p)φT ( ψT(p))

which is plotted in Figure 10. One can glean further insight by considering the corresponding weight function w T(p) := L T(p), which is plotted in Figure 11.

COMPOSITE MULTICLASS LOSSES

Figure 11: Weight function of the proper loss component of the proper composite representation of the binary coherence loss for T [0.02,4], where blue corresponds to T = 0.02 and red to T = 4.

The weight function view makes it clear how the proper component of the loss approaches 0-1 loss as T 0. (The weight function for 0/1 loss is w(c) = δ(c 1

2); note the convergence in Figure 11 is not uniform, given the behaviour at 0 and 1.) Observe too that as well as the proper loss varying with T, the associated link also varies (in a complex way see Figure 9). Thus not only is one varying the statistical properties of the loss (in a substantial way when T is small, the weight is centered near p = 1

2, whereas for large T, a series expansion of w T(p) shows that

w T(p) T 1 p + 1 1 p +4. An alternative to this class of losses, would be to ﬁx a link, and then

WILLIAMSON, VERNET AND REID

vary the proper component. For a given model or hypothesis class F this has the advantage that the effective hypothesis class {ψ 1 f : f F} remains ﬁxed, and only the (proper) loss varies. The (Λ,Ψ) parametrisation of section 8.3 offers a convenient way to do this.

9. Conclusions

We have systematically studied multiclass composite losses. The results of the paper (summarised below) show that this is an attractive parametrisation of multiclass losses. If one desires all predictions be strongly admissible, then there is nothing lost in using the proper composite representation. Since the link is only a reparametrisation, this means one still has the relationship between losses and divergences as described by Garc ıa-Garc ıa and Williamson (2012). The proper composite representation leads to a desirable separation of concerns, where the inferential properties of the loss (such as its mixability) are governed by the proper loss, and the convexity (necessary for numerical optimisation) is controlled by the link function. It thus seems to be the best way to parametrise loss functions. The key technical contributions of the paper are as follows.

Relationship between prediction calibration and classiﬁcation calibration, showing that the latter can be seen as a pointwise version of the former (Section 3);

Characterisation of multiclass proper losses in terms of their binary restrictions (Proposition 7);

Every (multiclass) proper loss is quasi-convex (Proposition 17);

Characterisation of which binary margin losses have a proper composite representation (Corollary 12);

Characterisation of when a multiclass loss has a proper composite representation and when the representation is unique (Section 5.3);

Relationship between the proper composite representation, mixability and admissibility (Sections 6.1 and 6.2);

Necessary conditions for strong convexity of multiclass proper losses in terms of their corresponding Bayes risks (Proposition 31);

Canonical links always convexify proper losses, and outline how this can help in the design of losses (Proposition 32);

The attractive (simply parametrised) integral representation for binary proper losses can not be extended to the multiclass case (Section 7) ;

These results suggest that in order to design losses for multiclass prediction problems it is helpful to use the composite representation, and design the proper part via the Bayes risk as suggested for the binary case by Buja et al. (2005). The link function can be tuned to control the optimisation properties of the loss. Merely requiring the loss to be convex confounds two seperate aspects of

COMPOSITE MULTICLASS LOSSES

a loss: the shape of ℓ(V ) which controls the predictive performance, and the parametrization of ℓ(V ) which affects the numerical optimisation of a loss. There remain open questions. Perhaps the most practically important is the interaction between the loss and restricted hypothesis classes: typically one does not optimise conditionally, one optimises the full expected risk with respect to a restricted function class F Y X . The question of how knowledge of F should inﬂuence the design of a loss remains open; some initial work along these lines is the notion of stochastic mixability (van Erven et al., 2012a, 2015).

Acknowledgments

The research reported here was performed whilst Elodie Vernet was a student at ENS Cachan and visiting ANU and NICTA, and was supported by the Australian Research Council and NICTA, which was funded by the Australian government through the ICT centre of excellence program. An earlier version of some of these results appeared in NIPS2011 and ICML2012. The work beneﬁted from discussions with Jake Abernethy, Tim van Erven, Rafael Frongillo, and Dario Garc ıa-Garc ıa, and comments from the referees who we thank. Thanks also to Harish Guruprasad for identifying a ﬂaw in an earlier proof of quasi-convexity of proper losses.

Appendix A. Matrix Calculus

If A = [ai j] is an n m matrix, vec A is the vector of columns of A stacked on top of each other. The Kronecker product of an m n matrix A with a p q matrix B is the mp nq matrix

A1,1B A1,n B ... ... ... Am,1B Am,n B

We use the following properties of Kronecker products (Magnus and Neudecker, 1999, Chapter 2): (A B)(C D) = (AC BD) for all appropriately sized A,B,C,D, and I1 A = A. If f : Rn Rm is differentiable at c then the partial derivative of fi with respect to the jth coordinate at c is denoted D j fi(c). The m n matrix of partial derivatives of f is the Jacobian of f and denoted

(D f(c))i,j := D j fi(c) for i [m], j [n].

If F is a matrix valued function DF(X) := D f(vec X) where f(X) = vec F(X). We will require the product rule for matrix valued functions (Vetter, 1970; Fackler, 2005): Suppose f : Rn Rm p, g: Rn Rp q so that (f g): Rn Rm q. Then

D(f g)(x) = (g(x) Im) D f(x)+(Iq f(x)) Dg(x). (31)

The Hessian at x X Rn of a real-valued function f : X R is the n n real, symmetric matrix of second derivatives at x

(H f(x))j,k := Dk,j f(x) = 2 f xk xj .

WILLIAMSON, VERNET AND REID

Note that the derivative Dk,j is in row j, column k. It is easy to establish that the Jacobian of the transpose of the Jacobian of f is the Hessian of f. That is,

H f(x) = D (D f(x)) (32)

(Magnus and Neudecker, 1999, Chapter 10). If X Rn and f : X Rm is a vector-valued function, then the Hessian of f at x X is the mn n matrix that consists of the Hessians of the functions fi stacked vertically:

H f1(x) ... H fm(x)

Appendix B. Deferred Proofs

This appendix contains proofs of results in the main text that, due to their length or technicality, are better presented outside the ﬂow of the main text.

B.1 Proof of Lemma 1

1. We prove this by contradiction. Suppose p n such that for all i [n], p / Ti(c). Then

p / Tj1(c) j2 = j1 such that pj1

p / Tj2(c) j3 = j2 such that pj2

and hence by repeating this argument

p / Tjn(c) jn+1 = jn such that pjn

cjn < pjn+1

Thus we have n+1 indices j1,..., jn+1 belonging to [n] and therefore one is repeated (jk) and pjk cjk < p jk cjk which is a contradiction.

2. Obvious.

3. If p Tn i=1 Ti(c), then for all j [n], cj = i picj = i pjci = pj. Thus p = c.

4. We prove this by contradiction. Suppose p = q such that for all c if p Ti(c) then q Ti(c). Observe that j [n], p Tj(p), and so q Tn j=1 Tj(q), and hence q = p, a contradiction.

B.2 Proof of Proposition 9

Observe that L(p) = {(s ,0) +α1, s L( p), α R}. (33)

COMPOSITE MULTICLASS LOSSES

Indeed ( q p) s = (q p) ((s ,0) +α1). ( ) We ﬁrst assume that L is differentiable at p. We use the following result (Hiriart-Urruty and Lemar echal, 2001, page 203): If f is a convex function, then ε > 0, δ > 0, y B(x,δ) f(y) f(x)+B(0,ε). Assume ε > 0, then since L is differentiable at p, δ > 0, such that

q B( p, δ), A( q) L( q), ||A( q) D L( p)|| ε. (34)

Then there exists δ such that q B(p,δ) implies p B( p, δ). Thus using (3) and (34), i [n], q B(p,δ), for α1,α2 R,

ℓi(q) ℓi(p) = L(q)+(ei q) ((A( q) ,0) +α11)

(L(p)+(ei p) ((DL( p) ,0) +α21)),

= L(q) L(p)+( ei q) A( q) ( ei p) DL( p)+γ, A( q) L( q),

where A( q) L( q), and

γ = (ei q) α11+(ei p) α21 = α1 +α1q 1+α2 α2p 1

= α1 +α1 +α2 α2 = 0

ℓi(q) ℓi(p) = L(q) L(p)+( ei q) (A( q) DL( p))+( p q) DL( p).

By continuity of L, ||L(q) L(p)|| < ε for small enough δ. Furthermore by (34), ||A( q) D L( p)|| 0 and || p q|| ε. Hence ||ℓi(q) ℓi(p)|| ε +ε +δ which can be made arbitrarily small by suitable choice of ε. Thus ℓi is continuous for all i [n] and so ℓis continuous. ( ) Assume that L is not differentiable at p n. Thus there exists two different supergradients at p: A( p) and B( p). Assume that one of these supergradients, A( p), is the one associated to the loss ℓin the sense that for all i [n] ℓi(p) = L(p)+( ei p) A( p). Suppose that i [n],

(ei p) ((A( p) ,0) +α11) (ei p) ((B( p) ,0) +α21), α1,α2 R. (35)

i [n] qi(ei p) ((A( p) ,0) +α11) i [n] qi(ei p) ((B( p) ,0) +α21), q n, α1,α2 R

(q p) ((A( p) ,0) +α11) (q p) ((B( p) ,0) +α21), q n,α1,α2 R

( q p) A( p) ( q p) B( p), q n. (36)

Since p n we can choose q1 and q2 n such that q1 p = p q2 and so the only way (36) can hold is if ( q p) A( p) = ( q p) B( p).

WILLIAMSON, VERNET AND REID

Since p n is arbitrary, we obtain that A( p) = B( p), a contradiction and so (35) must be false. Thus there exists i [n] such that

(ei p) ((A( p) ,0) +α11) > (ei p) ((B( p) ,0) +α21), α1,α2 R.

Thus i [n], ( ei p) A( p) > ( ei p) B( p). (37)

Let pη := p+η(ei p) and denote by C( pη) the supergradient associated with ℓat pη (that is, ℓi(pη) = L(pη)+( ei pη) C) pη)). By deﬁnition of the supergradient,

L(pη) L(p)+( pη p) B( p) and L(p) L(pη)+( p pη) C( pη).

L(pη) L(pη)+C( pη) ( p pη)+B( p) ( pη p)

C(pη) ( pη p) B( p) ( pη p) .

But by deﬁnition of pη, pη p = p+η( ei p) p = η( ei p). Thus for η > 0,

C( pη) ( ei p) B( p) ( p ei). (38)

Now ℓi(pη) = L(pη)+( ei pη) C( pη). Hence (38) implies

ℓi(pη) L(pη)+( ei p) B( p).

However limη 0 pη = p and by continuity of L,

lim η 0L(pη)+( ei p) B( p) = L(p)+( ei p) B( p)

< L(p)+( ei p) A( p)

= ℓi(p) by (37).

Thus limη 0 ℓi(pη) < ℓi(p) and so ℓi is not continuous at p and thus ℓis not continuous at p.

B.3 Proof of Proposition 7

The proof shows the equivalence of statements 1 and 2 and, separately the equivalence of 1 and 3 and 1 and 4. 1 2: Suppose that ℓis proper and p,q n. Let Lp,q denote the conditional risk associated with ℓp,q. Then

Lp,q(η, ˆη) = ηq+(1 η)p ℓ p+ ˆη(q p) = L p+η(q p), p+ ˆη(q p)

L p+η(q p), p+η(q p) = Lp,q(η,η).

Hence ℓp,q is proper.

COMPOSITE MULTICLASS LOSSES

Figure 12: Illustration of proof of Proposition 7.

1 2: Suppose that ℓp,q is proper p,q n. Suppose p,q n. Then there exists p and q n such that p = p + η( q p) and q = p + ˆη( q p), where η, ˆη [0,1] (the line passing through p and q cuts n at p and q; see Figure 12). Then

L(p,q) = L p, q(η, ˆη) L p, q(η,η) = L(p, p).

Hence ℓis proper. One can easily prove that 3 1 by taking h1 = 0. For 3 1 we use a result of Lambert (2010, Proposition 1), which tells us a binary probability estimation loss ℓb is proper if and only if η η1 η2 or η η1 η2, Lb(η,η1) Lb(η,η2) (the assumptions on the statistic are checked in the binary case with the statistic function Γ: 2 p 7 E(p) [0,1]). We also know that if ℓis proper then p,q n, ℓp,q (introduced in Proposition 7) is proper. We assume that ℓis proper, p,q n, 0 h1 h2, we introduce the projections p, q n

of p and q, then there exists η and µ such that p = p + η( q p) and q = p + µ( q p). We denote η1 = η +h1(µ η) and η2 = η +h2(µ η). Then the result of Lambert applied to ℓp,q

gives us L(p, p+h1(q p)) L(p, p+h2(q p)). One can adapt the proof in the case of strict properness. 1 4: If ℓis proper, p ℓ(q) = q ℓ(q) + (p q) ℓ(q) = L(q) + (p q) ℓ(q). Thus q n there exists A(q) such as L(p,q) = L(q) + (p q) A(q). Since ℓis proper, p n, 0 L(p, p) L(p,q) = L(q) L(p) + (p q) A(q). Then A(q) is a supergradient of L = f (which is concave) at q, and p ℓ(q) = f(q)+(p q) A(q). 4 1: If there exists a function f concave and q n, there exists a supergradient A(q) f(q) such that p,q n, p ℓ(q) = f(q)+(p q) A(q). Then, L(p, p) L(p,q) = f(p) f(q)+(p q) A(q) 0. Hence ℓis proper.

B.4 Proof of Proposition 10

The proposition is a direct consequence of the characterization of differentiable binary proper losses (Reid and Williamson, 2010). A differentiable binary loss λ is proper if and only if λ 1(η) 1 η = λ 1(η)

η 0, η (0,1).

Suppose the loss ℓcan be expressed as a proper composite loss: ℓ= λ ψ = λ ψ 1 and so λ = ℓ ψ. Therefore for y { 1,1}, λ y(η) = ψ (η)ℓ y(ψ(η)). Then λ is proper and thus

λ 1(η) 1 η = λ 1(η)

η , η (0,1) (39)

1 ψ 1(v) ℓ 1(v) = ψ (ψ 1(v))

ψ 1(v) ℓ 1(v), v V

ψ (ψ 1(v)) = 0 or ℓ 1(v) = ℓ 1(v) = 0 or ψ 1(v) = ℓ 1(v) ℓ 1(v) ℓ 1(v), v V . (40)

WILLIAMSON, VERNET AND REID

Since ψ is differentiable and invertible, ψ cannot equal zero on an interval. By continuity, ψ 1

is uniquely deﬁned on an interval I when v1,v2 I, v [v1,v2], ℓ 1(v) = 0 or ℓ 1(v) = 0. If I = V then ψ is unique and thus λ = ℓ ψ is unique. If ℓ 1(v) = ℓ 1(v) = 0, v [v1,v2] then one can choose any ψ|[v1,v2] which is differentiable, invertible and such that ψ is continuous in v1 and v2 and as ℓ1 and ℓ 1 are constant on [v1,v2], λ(η) = ℓ(ψ(η)) does not depend on ψ and so in any case λ is unique.

B.5 Proof of Proposition 11

The loss λ is proper if and only if (39) and λ 1(η) 0 and λ 1(η) 0. This is equivalent to there exists an invertible ψ such that (40) holds and

ψ (ψ 1(v))ℓ 1(v) 0 and ψ (ψ 1(v))ℓ 1(v) 0, v V . (41)

( ) Suppose ℓhas a composite representation with ψ strictly increasing and thus ψ (v) > 0 for all v V and thus ℓ 1(v) 0 and ℓ 1(v) 0. Hence ℓ1 is decreasing and ℓ 1 is increasing. By hypothesis, ℓ 1(v) = 0 or ℓ 1(v) = 0. Furthermore ψ (v) can not equal zero except at isolated

points. Thus (40) implies ψ 1(v) = ℓ 1(v) ℓ 1(v) ℓ 1(v) = 1 1 f(v) and thus f is strictly increasing. (If instead ψ was strictly decreasing, we can run the same argument to conclude ℓ1 is increasing, ℓ 1 is decreasing and f is strictly decreasing.) ( ) Suppose ℓ1 is decreasing, ℓ 1 is increasing and f is strictly increasing. By setting ψ 1(v) = 1 1 f(v), ψ 1 is invertible and (41) holds. The other case is analogous.

B.6 Proof of Proposition 17

Fix an arbitrary p n. The function fp is quasi-convex if its α sublevel sets

Fα p := {q n : p ℓ(q) α}

are convex for all α R (Greenberg and Pierskalla, 1971). Fix an arbitrary α > L(p)i, and thus Fα p = /0. Let Qα p := {x Rn : p x α}

so Fα p = {q n : ℓ(q) Qα p}. Denote by

hβ q := {x: x q = β}

the hyperplane in direction q n with offset β R and by

Hβ q := {x: x q β}

the corresponding half-space. Since ℓis proper, Sℓis supported at x = ℓ(q) by the hyperplane h L(q) q and furthermore since Sℓis convex, Sℓ= T q n HL(q) q . Let V α p := \

x ℓ( n) Qαp HL(ℓ 1(x)) ℓ 1(x) = \

q Fα p HL(q) q

COMPOSITE MULTICLASS LOSSES

h L(q) q = {x: x q = L(q)}

Figure 13: Illustration of proof of quasi-convexity of continuous proper losses (see text).

(see ﬁgure 13). Since V α p is the intersection of halfspaces it is convex. Note that a given half-

space HL(q) q is supported by exactly one hyperplane, namely h L(q) q . Thus the set of hyperplanes that support V α p is {h L(q) q : q Fα p } If u Fα p then there is a hyperplane in direction u that supports V α p and its offset is given by

σV αp (u) := inf x V αp u x = L(p) >

whereas if u Fα p then for all β R, hβ u does not support V α p and hence σV αp (u) = . Thus we have shown u W α p σV αp (u) = .

Observe that σV αp (u) = s V αp ( u) where s C(u) = supx C u x is the support function of a set C. It is known (Valentine, 1964, Theorem 5.1) that the domain of deﬁnition of a support function {u Rn : s C(u) < + } for a convex set C is always convex. Thus Gα p := {u n : σV αp (u) > } = {u Rn : σV αp (u) > } n is always convex because it is the intersection of convex sets. Finally by observing that

Gα p = {p n : ℓ(p) ℓ( n) Qα p} = Fα p

we have shown that Fα p is convex. Since p n and α R were arbitrary we have thus shown that fp is quasi-convex for all p n. For the converse, the convexity of S comes from the contruction in the ﬁrst part of this proof. All that needs to be shown is uniqueness. Suppose then, for the sake of a contradiction, that

WILLIAMSON, VERNET AND REID

a given quasi-convex fp corresponds to two distinct S1,S2. Then the corresponding support functions s S1 = s S2 and thus there exists u such that s S1(u) = s S2(u). Let xi be the point of support of Si by a hyperplane with normal u. By assumption x1 = x2. Suppose x1 and x2 do not lie on a single hyperplane with normal p. Then there exists α such that hα p seperates x1 and x2. Thus u doms S1 Qαp but u doms S2 Qαp (or vice versa). Since Fα p = doms S1 Qαp and Fα p = doms S2 Qαp we have a contradiction. In case x1,x2 do lie on a single hyperplane hα p , one can argue (by the continuity of s S that there must exist a u , a convex combination of u and p which induces support points that can be seperated as above.

B.7 Proof of Proposition 18

n-smooth proper composite ℓ: Suppose ℓ(V ) is n-smooth. Pick some x ℓ(V ) with x = ℓ(v) for some v V (not necessarily unique because we have not assumed ℓis invertible). By assumption, there exists a unique p n such that hβ p supports ℓ(V ) at x for some β R. Deﬁne ψ to be the function such that ψ(p) = v. Since the corresponding p is unique, ψ is invertible. Now

β = inf{b R : hb p ℓ(V ) = /0} = p ℓ(v) = p ℓ(ψ(p)) = p λ(p).

Let λ := ℓ ψ. We will show λ is proper. Observe that for each p n

inf{b: hb p ℓ(V ) = /0} = inf{p ℓ(v): hp ℓ(v) p ℓ(V ) = /0, v V }.

Thus p λ(p) = p (ℓ ψ)(p) = infv V p ℓ(v). Since ψ : n V is invertible,

inf v V p ℓ(v) = inf v V p (λ ψ 1)(v) = inf q n p λ(q)

which we have shown above equals p λ(p). Thus λ is proper and ℓhas a proper composite representation. Proper composite ℓ n-smooth: Suppose ℓ= λ ψ 1, where λ is proper. We need to show that for all x ℓ(V ) there is a unique p n such that ℓ(V ) is supported by hβ p at x for some β R. Now pick an arbitrary v V which induces an arbitrary x = ℓ(v) ℓ(V ). Let p = ψ 1(v). Then h L(p) p supports ℓ(V ) at x since λ is proper. Suppose there was another q = p, q n such that h L(q) q supports ℓ(V ) at x. But that would require that v = ψ(q) which is impossible since v = ψ(p) and ψ is invertible. Thus p is unique and hence ℓ(V ) is n-smooth. n-strict convexity strict proper composite ℓ: Let p n. By invertibility of ℓand n-strict convexity of ℓ(V ) there exists a unique v V such that there exists a hyperplane hβ p supporting ℓ(V ) at ℓ(v). Deﬁne ψ such that for all p n, ψ(p) is this unique v. Since hβ p supports ℓ(V ), β = inf{b: hb p ℓ(V ) = /0} = p ℓ(v) = p ℓ(ψ(p)).

By n-smoothness of ℓ(V ) for all v V there is a unique p n such that hβ p supports ℓ(V ) at ℓ(v). By continuity of ℓand n-strict convexity of ℓ(V ), ψ is continuous. Let λ := ℓ ψ.

COMPOSITE MULTICLASS LOSSES

Observe that

inf{β : hβ p ℓ(V ) = /0} = inf{p ℓ(v): hp ℓ(v) p ℓ(V ) = /0, v V }.

Thus p λ(p) = p (ℓ ψ)(p) = infv V p ℓ(v). By n-smoothness of ℓ(V ), for all v V there exists a unique p n such that ψ(p) = v and thus ψ is invertible. Hence p λ(p) = infq n p λ(q) and thus λ is proper. Since ℓ(V ) is n-strictly convex there exists a unique point where hΛ(p) p supports ℓ(V ). Hence λ is strictly proper and we have shown that ℓhas a strictly proper composite representation. Strictly proper composite ℓ n-strictly convex: Suppose ℓhas a strictly proper composite representation ℓ(v) = λ(ψ 1(v)). Pick p n. By assumption, there exists v V such that ψ 1(v) = p. Since λ is strictly proper, there is a unique q n which minimises q 7 p λ(q). By invertibility of ψ, there thus exists a unique v V that minimises v 7 p ℓ(v) and so there is a unique x at which hβ p supports ℓ(V ) for some β R. Thus ℓ(V ) is n-strictly convex. Now pick an arbitrary v V which induces an arbitrary x = ℓ(v) ℓ(V ). Let p = ψ 1(v). Then h L(p) p supports ℓ(V ) at x since λ is proper. Suppose there was another q = p, q n such that h L(q) q supports ℓ(V ) at x. But that would require that v = ψ(q) which is impossible since v = ψ(p) and ψ is invertible. Thus p is unique and ℓ(V ) is n-smooth.

Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classiﬁers. The Journal of Machine Learning Research, 1:113 141, 2001.

Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, March 2006.

Paul N. Bennett. Using asymmetric distributions to improve text classiﬁer probability estimates. In Proceedings of SIGIR 03, pages 111 118, 2003.

James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 1985.

Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Multiclass classiﬁcation with ﬁlter trees. Preprint, June 2007. URL http://hunch.net/~jl/projects/reductions/mc_ to_b/inverted Tree.pdf.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Eﬁm Mikhailovich Bronshtein. Extremal convex functions. Siberian Mathematical Journal, 19: 6 12, 1978.

WILLIAMSON, VERNET AND REID

Lawrence D. Brown. A complete class theorem for statistical problems with ﬁnite sample spaces. The Annals of Statistics, 9(6):1289 1300, 1981.

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classiﬁcation: Structure and applications. Technical report, University of Pennsylvania, November 2005. URL http://www-stat.wharton.upenn.edu/~buja/PAPERS/ paper-proper-scoring.pdf.

Hermann Chernoff and Lincoln E. Moses. Elementary Decision Theory. Dover, 1986.

Jes us Cid-Sueiro and An ıbal R. Figueiras-Vidal. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3):445 455, May 2001.

Ira Cohen and Moises Goldszmidt. Properties and beneﬁts of calibrated classiﬁers. Technical Report HPL-2004-22(R.1), HP Laboratories, Palo Alto, July 2004. URL http://www.hpl. hp.com/techreports/2004/HPL-2004-22R1.pdf.

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265 292, 2001.

A. Philip Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1):77 93, March 2007.

Morris H. De Groot. Uncertainty, Information, and Sequential Experiments. The Annals of Mathematical Statistics, 33(2):404 419, 1962.

Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Artiﬁcial Intelligence Research, 2:263 286, 1995.

Paul K. Fackler. Notes on matrix calculus. North Carolina State University, 2005. URL http: //www4.ncsu.edu/~pfackler/Mat Calc.pdf.

Thomas S. Ferguson. Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York, 1967.

Dario Garc ıa-Garc ıa and Robert C. Williamson. Divergences and risks for multiclass experiments. In Conference on Learning Theory (JMLR: W&CP), volume 23, pages 28.1 28.20, 2012.

Gary F.V. Glonek. A class of regression models for multivariate categorical responses. Biometrika, 83(1):15 28, 1996.

Gary F.V. Glonek and Peter Mc Cullagh. Multivariate logistic models. Journal of the Royal Statistical Society, Series B, 57(3):533 546, 1995.

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, March 2007.

COMPOSITE MULTICLASS LOSSES

Harvey J. Greenberg and William P. Pierskalla. A review of quasi-convex functions. Operations Research, 19(7):1553 1570, November 1971.

Peter D. Gr unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004.

Trevor Hastie and Robert Tibshirani. Classiﬁcation by pairwise coupling. The Annals of Mathematical Statistics, 26(2):451 471, 1998.

Simon I. Hill and Arnaud Doucet. A framework for kernel-based multi-category classiﬁcation. Journal of Artiﬁcial Intelligence Research, 30:525 564, 2007.

Jean-Baptiste Hiriart-Urruty and Claude Lemar echal. Fundamentals of Convex Analysis. Springer, Berlin, 2001.

Roger A. Horn and Charles A. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.

Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized Bradley-Terry models and multi-class probability estimates. Journal of Machine Learning Research, 7:85 115, 2006.

Ayodele Ighodaro, Thomas Santner, and Lawrence Brown. Admissibility and complete class results for the multinomial estimation problem with entropy and squared error loss. Journal of Multivariate Analysis, 12:469 479, 1982.

Yuri Kalnishkan and Michael V. Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74:1228 1244, 2008.

Parameswaran Kamalaruban, Robert C. Williamson, and Xinhua Zhang. Exp-concavity of proper composite losses. In JMLR Workshop and Conference Proceedings (Proceedings COLT 2015), volume 40, 2015.

Jack Carl Kiefer. Introduction to Statistical Inference. Springer-Verlag, New York, 1987.

Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. Technical report, Stanford University, March 2010. URL http://www.stanford.edu/~nlambert/lambert_ elicitation.pdf.

Yufeng Liu. Fisher consistency of multicategory support vector machines. In Proceedings of the Eleventh International Conference on Artiﬁcial Intelligence and Statistics, pages 289 296, 2007.

Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics (revised edition). John Wiley & Sons, 1999.

Peter Mc Cullagh and John A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 1989.

Aditya K. Menon and Robert C. Williamson. Bipartite ranking: risk, optimality, and equivalences. Journal of Machine Learning Research, July 2014. URL http://users.cecs. anu.edu.au/~williams/papers/P194.pdf. Submitted.

WILLIAMSON, VERNET AND REID

Walter Meyer and David C. Kay. A convexity structure admits but one real linearization of dimension greater than one. Journal of the London Mathematical Society (2), 7:124 130, 1973.

Geert Molenberghs and Emmanuel Lesaffre. Marginal modelling of multivariate categorical data. Statistics in Medicine, 18:2237 2255, 1999.

Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. The Journal of Machine Learning Research, 14(1):437 497, 2013.

Harikrishna Narasimhan and Shivani Agarwal. On the relationship between binary classiﬁcation, bipartite ranking, and binary class probability estimation. In Advances in Neural Information Processing Systems, pages 2913 2921, 2013.

Robert F. Nau. Should scoring rules be effective ? Management Science, 31(5):527 535, May 1985.

Tapan K. Nayak and Dayanand N. Naik. Estimating multinomial cell probabilities under quadratic loss. Journal of the Royal Statistical Society. Series D (The Statistician), 38(1): 3 10, 1989.

Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. On surrogate loss functions and f-divergences. Annals of Statistics, 37:876 904, 2009.

Robert R. Phelps. Lectures on Choquet s Theorem, volume 1757 of Lecture Notes in Mathematics. Springer, 2nd edition, 2001.

Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010.

Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731 817, March 2011.

Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson, and Nishant A. Mehta. Generalized mixability via entropic duality. JMLR: Workshop and Conference Proceedings (COLT2015), 40:1 22, 2015.

R. Tyrrell Rockafellar and Roger J-B. Wets. Variational Analysis. Springer-Verlag, Berlin, 2004.

Ra ul Santos-Rodr ıguez, Alicia Guerrero-Curieses, Roc ıo Alaiz-Rodriguez, and Jes us Cid Sueiro. Cost-sensitive learning based on Bregman divergences. Machine Learning, 76:271 285, 2009.

Rolf Schneider. Convex Bodies: The Brunn-Minkowski Theory. Cambridge University Press, 1993.

Clayton Scott. Surrogate losses and regret bounds for cost-sensitive classiﬁcationwith exampledependent costs. In Proc. of the 28th International Conference on Machine Learning (ICML), 2011.

COMPOSITE MULTICLASS LOSSES

Clayton Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958 992, 2012.

Qinfeng Shi, Mark Reid, and Tiberio Caetano. Conditional random ﬁelds and support vector machines: A hybrid approach. ar Xiv:1009:3346v1, September 2010. URL http://arxiv. org/PS_cache/arxiv/pdf/1009/1009.3346v1.pdf.

Barry Simon. Convexity: An Analytic Viewpoint. Cambridge University Press, 2011.

Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8(1):171 176, 1958.

Ingo Steinwart. How to compare different loss functions. Constructive Approximation, 26: 225 287, 2007.

Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8:1007 1025, 2007.

Frederick A. Valentine. Convex Sets. Mc Graw-Hill, New York, 1964.

Tim van Erven, Mark D. Reid, and Robert C. Williamson. Mixability is Bayes risk curvature relative to log loss. In Proceedings of the 24th Annual Conference on Learning Theory, 2011.

Tim van Erven, Peter Gr unwald, Mark D Reid, and Robert C Williamson. Mixability in statistical learning. In Advances in Neural Information Processing Systems, pages 1691 1699, 2012a.

Tim van Erven, Mark D. Reid, and Robert C. Williamson. Mixability is Bayes risk curvature relative to log loss. Journal of Machine Learning Research, 13:1639 1663, May 2012b.

Tim van Erven, Peter D. Gr unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015.

William J. Vetter. Derivative operations on matrices. IEEE Transactions on Automatic Control, 15(2):241 244, April 1970.

Volodya Vovk. A game of prediction with expert advice. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 51 60. ACM, 1995.

Volodya Vovk and Fedor Zhdanov. Prediction with expert advice for the Brier game. Journal of Machine Learning Research, 10:2445 2471, 2009.

Roger Webster. Convexity. Oxford University Press, 1994.

Robert C. Williamson. The geometry of losses. In Conference on Learning Theory (JMLR: W&CP), volume 35, pages 1078 1108, 2014.

Ting-Fan Wu and Ruby C. Weng. Probability estimates for multi-class classiﬁcation by pairwise coupling. Journal of Machine Learning Research, 5:975 1005, 2004.

WILLIAMSON, VERNET AND REID

Tong Tong Wu and Kenneth Lange. Multicategory vertex discriminant analysis for highdimensional data. The Annals of Applied Statistics, 4:1698 1721, 2010.

Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In Proceedings of SIGKDD, 2002.

Tong Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5:1225 1251, 2004.

Zhihua Zhang, Michael I. Jordan, Wu-Jun Li, and Dit-Yan Yeung. Coherence functions for multicategory margin-based classiﬁcation methods. In Proceedings of the Twelfth Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2009.

Ji Zhu, Hui Zou, Saharaon Rosset, and Trevor Hastie. Multi-class Ada Boost. Statistics and its Interface, 2:349 360, 2009.

Hui Zou, Ji Zhu, and Trevor Hastie. The margin vector, admissible loss and multi-class marginbased classiﬁers. Preprint, 2005. URL http://www-stat.stanford.edu/~hastie/ Papers/margin.pdf.

Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on multicategory Fisher-consistent losses. The Annals of Applied Statistics, 2(4):1290 1306, 2008.