# legendretron_uprising_proper_multiclass_loss_learning__c3370f0c.pdf

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Kevin H. Lam 1 Christian Walder 2 3 Spiridon Penev 1 4 Richard Nock 2 3

Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as properness, which asserts that Bayes rule is optimal. Recent works have sought to learn losses and models jointly. Existing methods do this by ﬁtting an inverse canonical link function which monotonically maps R to [0, 1] to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between RC 1 and the projected probability simplex C 1 by using monotonicity of gradients of convex functions. We present LEGENDRETRON as a novel and practical method that jointly learns proper canonical losses and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a t-test at 99% signiﬁcance on all datasets with greater than 10 classes.

1. Introduction

Loss functions are a pillar of machine learning (ML). In supervised learning, a loss provides a measure of discrepancy between the underlying ground truth and a model s predictions. A learning algorithm attempts to minimise this discrepancy by adjusting the model. In other words, the loss governs how a model learns. The consequence of the bad choice of a loss is oblivious to the qualities of the learning pipeline: it means a poor model in the end. This brings forth the question: which loss is best for the problem at hand?

1School of Mathematics & Statistics, UNSW Sydney, Australia 2Google Research 3ANU College of Engineering, Computing and Cybernetics, The Australian National University, Australia 4UNSW Data Science Hub (u DASH), UNSW Sydney, Australia. Correspondence to: Kevin Lam <khﬂam@gmail.com>.

Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Statistical decision theory answers this by turning to admissible losses (Savage, 1971); also referred to as proper losses or proper scoring rules (Gneiting & Raftery, 2007). Proper losses are those for which the posterior expected loss value is minimised when probability predictions coincide with the true underlying probabilities. That is, a proper loss is one that can induce probability estimates that are admissible or optimal. Proper losses have been extensively studied in Shuford et al. (1966); Gr unwald & Dawid (2004); Reid & Williamson (2010); Williamson et al. (2016), with the latter two works extending losses to proper composite forms in binary and multiclass settings. Only a handful of proper losses, such as the square and log losses, are commonly used in ML. This is not surprising: properness is an intensional property and does not provide any candidate function. While eliciting some members is possible, extending further requires tuning or adapting the loss as part of the ML task.

There has been a recent surge of interest in doing so for supervised learning, including Mei & Moura (2018); Grabocka et al. (2019); Streeter (2019); Liu et al. (2020); Siahkamari et al. (2020); Sypherd et al. (2022a). However, no connections are made to properness to formulate the losses in these works. On the other hand, several recent works have used properness to formulate losses including Nock & Nielsen (2008); Nock & Menon (2020); Walder & Nock (2020); Sypherd et al. (2022b). Notably, the works of Nock & Menon (2020); Walder & Nock (2020) have proposed algorithms to learn both the link function and linear predictor of logistic regression models by considering both functions to be unknown but learnable; thereby extending Single Index Models (Hardle et al., 1993; Mei & Moura, 2018) and algorithms to learn them (Kakade et al., 2011). Despite the impressive progress in these works, no references have been made to proper losses for multiclass problems.

Background To approach multiclass problems in a principled manner, we generalise logistic regression as follows. For a given invertible and monotonic (see Deﬁnition A.1) link function ψ that maps [0, 1] to R and an input-label pair (x, y) with x Rp and y { 1, 1}, logistic regression learns a model of the form Pr(Y = 1|x) = ψ 1(w x + b) by ﬁtting a coefﬁcient vector w Rp and an intercept b R. A class prediction is then formed as ˆy = arg maxy { 1,1} Pr(Y = y|x). The crucial element

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

of logistic regression lies in the invertible and monotonic link function that connects probabilities to predictors. Invertibility of the link allows one to identify a unique probability estimate to associate with the predictor. Monotonicity of the link enforces an order to class predictions as elements of x either increase or decrease monotonically, so that the decision boundary between classes is unique. Loosely speaking, the generalisation of these ideas to multiclass problems with C 2 classes is to form probability estimates by using a monotonic link function ψ such that ψ 1(x) = (p1, p2, . . . , p C 1) with PC 1 k=1 pk 1.

Motivation In this work, our interest lies in learning proper losses for multiclass problems. Two observations highlight why this is beneﬁcial: properness directly enforces the same ranking of classes as probabilities without building multiple models, and learned losses can provide better models for related domains. We ﬁrst note that to approach a multiclass problem with C > 2 classes, one would typically pose the problem as multiple 1-vs-rest or 1-vs-1 component problems. Each component problem consists of positive and negative labels where the former refers to a class of interest, and the latter refers to all other classes in 1-vs-rest or to a single other class of interest in 1-vs-1. An unfortunate consequence in the design of these reductions to binary problems is that they do not include the admissibility constraint that probability estimates should rank classes in the same way that true probabilities do. Without loss of generality to the 1-vs-1 approach, we observe this in the following theorem.

Theorem 1.1. Suppose we use the 1-vs-rest approach to estimate probabilities for a multiclass problem with C > 2 classes. Then we learn models of the form

Pr( Y = c|x) = ψ 1 k (w k x + bk)

( +1 when y = k 1 otherwise for k = 1, . . . , C. Proba-

bility estimates for any class k is admissible if and only if ψ 1 k (w k x + bk) > ψ 1 i (w i x + bi) for all i = k.

To avoid solving C 1-vs-rest problems through constrained optimisation, we desire an approach that allows us to model multiclass probabilities simultaneously, while learning proper multiclass losses which can induce admissible probability estimates for all C classes directly. It has also been shown in the work of Nock & Menon (2020) that loss learning can provide better models for problems in domains related to the original problem where the loss was learned; compared to using uninformed losses such as cross-entropy or log loss. In general, linear models are known to be sensitive to training noise; with the notable result of Long & Servedio (2008) that such noise is sufﬁcient to deteriorate any linear binary classiﬁcation model to the point that it performs no better than an unbiased coin ﬂip on the original

noise-free domain. The presence of label noise in a dataset can be interpreted as a domain that relates to an original noise-free domain, up to some classiﬁcation noise process. The ideas of loss learning and loss transfer from Nock & Menon (2020) can then be seen as a mechanism that allows us to both overcome training noise and learn accurate models. We illustrate that learning proper multiclass losses can be done by modelling the canonical link function which connects probability estimates with a proper loss (see Deﬁnition 4.1 and remarks therein). In order to model canonical links ﬂexibly, we form them as composite functions with a ﬁxed component and a learnable component.

Contributions Our main contributions are as follows:

We derive necessary and sufﬁcient conditions for a composite function in RC 1 to be monotonic and the gradient of a twice-differentiable convex function;

We derive sufﬁcient conditions for a composite function in RC 1 to be monotonic and the gradient of a twice-differentiable strictly convex function;

We present LEGENDRETRON as a novel and practical way of learning proper canonical losses and probabilities concurrently in the multiclass problem setting.

Organisation In Section 2, we review existing works which similarly aim to learn losses and models concurrently. In Section 3, we ﬁrst describe properness and proper canonical losses. In Section 4, we design multiclass canonical link functions through Legendre functions and the (u, v)- geometric structure, and provide conditions for composite functions to be monotonic and gradients of convex functions. We then describe our method, LEGENDRETRON, in detail within Section 5. Lastly, numerical comparisons are provided in Section 6 before concluding in Section 7.

2. Related Work

TRON family of link-learning algorithms The notion of searching for proper losses was ﬁrst established within Nock & Nielsen (2008). The SLISOTRON algorithm was later presented in Kakade et al. (2011), as the ﬁrst algorithm designed to learn a model of the form Pr(Y = 1|x) = u(w x) for binary problems, which involves learning the unknown link function u : R [0, 1] assumed to be 1Lipschitz and non-decreasing, and the vector w Rp used to form the linear predictor w x. The algorithm iterates between Lipschitz isotonic regression to estimate u and gradient updates to estimate w. A notable and practical shortcoming of SLISOTRON is that the isotonic regression steps to update u do not guarantee u to map to [0, 1]. The BREG-

MANTRON algorithm was later proposed in Nock & Menon (2020), to reﬁne the SLISOTRON algorithm by addressing

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

this and providing convergence guarantees. By utilising the connection between proper losses and their canonical link functions outlined in Section 4, the BREGMANTRON replaced the link function u with the inverse canonical link ψ 1 which guaranteed probability estimates to lie in [0, 1].

ISGP-LINKGISTIC algorithm The idea of using the (u, v)-geometric structure in combination with Legendre functions to learn canonical link functions has recently been explored in the work of Walder & Nock (2020) to propose the ISGP-LINKGISTIC algorithm to learn a model of the form Pr(Y = 1|x) = (u v 1)(w x). By the squaring and integration of a Gaussian Process (GP) to yield the Integrated Squared Gaussian Process (ISGP), monotonicity and invertibility of v 1 : R R is guaranteed. The ISGP-LINKGISTIC algorithm exploits this property by choosing a ﬁxed squashing function u separate from the a priori ISGP distributed v 1. Inference is performed with a stochastic EM algorithm where the E-step ﬁxes the linear predictor w x and applies a Laplace approximation to the latent GP to compute Eq(v 1|w)[log p(y|x, v 1)], and the M-step maximises this expectation with respect to w. The ISGP-LINKGISTIC algorithm takes a Bayesian approach to learning proper canonical losses jointly with a probability estimator by posterior sampling of inverse canonical links.

3. Deﬁnitions and Properties of Losses

In this section, we revisit the notions of proper losses to formulate proper canonical losses in the multiclass setting. We follow the deﬁnitions and notations of Williamson et al. (2016) and describe key properties therein, for our discussion of composite multiclass losses.

Let C 2 be the total number of classes. Our setting is multiclass probability estimation. Denote the (C 1)- dimensional probability simplex as

and its relative interior as

i=1 pi = 1, pi (0, 1), i

Suppose we have a dataset D of N pairs {(xn, yn)}N n=1 where each xn X = Rp and yn Y = {1, . . . , C} denotes an input and a single label respectively. We aim to learn a function h : X C 1 such that ˆyn arg maxc {1,...,C} P(yn = c|xn) closely matches yn.

Consider the label as a random variable Y Categorical(p) with prior class probabilities p C 1. We denote q C 1 as the estimated probabilities in the following

deﬁnitions. To assess the quality of probability estimates, a loss function can be deﬁned generally as

ℓ: C 1 RC +, ℓ(q) = (ℓ1(q), . . . , ℓC(q))

where ℓi is the partial loss for predicting q when y = i. For a given label y, we can return to scalar-valued losses by referring to the y-th partial loss ℓy. Deﬁnition 3.1 (conditional Bayes Risk). The conditional risk associated with ℓis deﬁned as L(p, q) = EY Categorical(p)[ℓY (q)] for all p, q C 1. The best achievable conditional risk associated with a loss is termed the conditional Bayes risk and is deﬁned as

L : C 1 R+,

L(p) = inf q C 1 L(p, q) = inf q C 1 EY Categorical(p)[ℓY (q)].

It is well known that L is concave. Deﬁnition 3.2 (Proper Losses). A loss ℓis proper if and only if L is minimized when q = p. In other words, L(p) = L(p, p) L(p, q) for all p, q C 1. Losses where the inequality is strict when p = q, are termed strictly proper.

Remark Properness is an essential property of losses, as optimising a model with respect to a proper loss guides the model s probability estimates towards true posterior class probabilities. Examples of proper losses include the 0 1, square, log, and Matsushita losses (Matusita, 1956).

To draw the connection between a proper loss and its conditional Bayes risk, we require deﬁnitions of subgradients and Bregman divergences. Subgradients are a generalisation of gradients and are particularly useful when analysing convex functions that may not be differentiable.

Subgradients For a convex set S Rn, the subdifferential of a convex function f : S ( , + ] at x S is deﬁned as

f(x) = {φ Rn : φ, y x f(y) f(x), y Rn}

where an element φ f(x) is called a subgradient of f at x. By convention, we deﬁne f(x) = for all x / S. Moreover, f is strictly convex if and only if f(x) = {φ Rn : φ, y x < f(y) f(x), y Rn}.

Bregman divergence For a convex set S Rn, and a continuously-differentiable and strictly convex function f : S ( , + ], the Bregman divergence with generator f is deﬁned for all x, y S as

Df(x, y) = f(x) f(y) f(y), x y .

The following result is a rewritten characterisation of proper losses through their Bregman representation , and explicates the connection between a proper loss and its conditional Bayes risk.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Proposition 3.3 ((Williamson et al., 2016, Proposition 7)). Let ℓ: C 1 RC + be a loss. ℓis a (strictly) proper loss if and only if there exists a (strictly) convex function f : C 1 R such that for all q C 1, there exists a subgradient φ f(q) such that

L(p, q) = (p q) φ f(q) for all p C 1.

Moreover, if L is differentiable on ri( C 1) then

L(p, q) = (p q) ℓ(q) + L(q)

where ℓis the unique proper loss associated with L with the property L(p) = ℓ(p), p ri( C 1).

Remark 3.4. We note that L(p, q) is a Bregman divergence if and only if ℓis strictly proper due to the requirement of strict convexity of f.

In this work, we seek to learn strictly proper losses ℓby exploiting the connection L(p) = ℓ(p) described in Proposition 3.3. In Section 4, we extend this connection between probabilities and predictors in RC 1 through canonical link functions, and describe in detail how strictly proper losses can be learned through this extended connection.

4. Designing Multiclass Canonical Links

In this section, we provide deﬁnitions of canonical link functions, Legendre functions and the (u, v)-geometric structure. The latter two structures are essential for the design and learning of canonical link functions. We show that designing a canonical link amounts to designing a composite function that is the gradient of a twice-differentiable and convex function. To this end, we present our key theoretical contributions: conditions for composite functions to be gradients of convex functions.

Composite Form It is often desirable to link predictors with their probability estimates through an invertible link function ψ : C 1 RC. This allows one to uniquely identify probabilities while working with general predictors. It also allows one to deﬁne loss functions more generally as ℓψ = ℓ ψ 1 which are referred to as proper composite losses when ℓis proper. Williamson et al. (2016, Proposition 13) shows that a proper composite loss ℓψ is uniquely represented by ℓand ψ when ℓψ is continuous and invertible.

Proper Canonical Form As elements of C 1 are uniquely determined by the ﬁrst C 1 components, the above properties can be more naturally described by the projected probability simplex:

Deﬁne the projection map

Π : C 1 C 1,

Π(p) = (p1, . . . , p C 1) for all p C 1,

and its inverse

Π 1 : C 1 C 1,

p1, . . . , p C 1, 1

for all p C 1.

Deﬁnition 4.1. The projected conditional Bayes risk is deﬁned as L = L Π 1. Suppose L is differentiable. Then the canonical link function is deﬁned as

ψ : C 1 RC 1, ψ( p) = L( p).

Williamson et al. (2016, Corollary 32) shows that given a proper loss ℓ, the function ℓ Π 1 ψ 1 has components which are convex with respect to the input domain. We refer to such losses as proper canonical losses to distinguish them from proper composite losses. The connection between a differentiable conditional Bayes risk, a proper loss, and a canonical link, shown by Proposition 3.3 and Deﬁnition 4.1, is given by ℓ= L = (( L Π) JΠ) = (( ψ Π) JΠ) where JΠ is the Jacobian of Π. This illustrates that one can learn proper canonical losses by modelling either the conditional Bayes risk or its associated canonical link.

Properties of Legendre functions Let f : RC 1 R be continuously differentiable and strictly convex. We refer to f as a Legendre function. The Legendre-Fenchel conjugate of f, denoted by f , is deﬁned as

f (x ) = ( f) 1(x ), x f ( f) 1(x ) .

where S = { f(x) : x RC 1}, and f is Legendre if and only if f is Legendre. Rockafellar (1970, Theorem 26.5) shows that when the latter holds, (f ) = f, and f is continuous and invertible with f = ( f) 1. Moreover, if f is twice-differentiable with positive deﬁnite Hessian everywhere, then the inverse function theorem yields that f is twice-differentiable since 2f ( f(x)) = ( 2f(x)) 1.

(u, v)-geometric structure Amari (2016); Nock et al. (2016); Walder & Nock (2020) state that a general dually ﬂat structure on RC 1 can be deﬁned in terms of an arbitrary strictly convex function ξ. Let u and v be differentiable invertible functions. The pair (u, v) give a dually ﬂat structure on RC 1 if and only if ξ = u v 1. We consider the (u, v)-geometric structure of the Bregman divergence D( L) which gives ψ 1 = u v 1.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

probability estimates C 1

predictors RC 1

transformed predictors RC 1

Figure 1. Relationship between predictors and probability estimates through the inverse of the canonical link function under the (u, v)-geometric structure.

Designing links In this work, we focus on the case when L is twice-differentiable. Note that L is convex since Π 1 is afﬁne and L is convex. Properties of Legendre functions allow us to move from L to its Legendre-Fenchel conjugate ( L) , and similarly allow us to move from the canonical link ψ to its inverse ψ 1. The (u, v)-geometric structure then allows us to ﬂexibly learn ψ 1 by splitting it into a learnable component v 1 and a ﬁxed component u. Fixing u to be a suitable squashing function ensures that ψ 1 maps to C 1; thereby allowing us to uniquely identify multiclass probabilities associated with predictors from RC 1. On the other hand, v 1 can be parameterised by an invertible neural network which allows ψ 1 to adapt to the multiclass problem at hand. Legendre functions and the (u, v)-geometric structure together yield a more natural and practical design of the canonical link through its inverse since it is often much easier to map inputs from an unbounded space such as RC 1, to a bounded space such as C 1. Figure 1 illustrates how the inverse of the canonical link is modelled using the (u, v)-geometric structure. Loosely speaking, v 1 allows one to ﬁnd better logit representations before they are squashed to probabilities.

Under the (u, v)-geometric structure, if one can prove that u v 1 maps to C 1 and is the gradient of a Legendre function f, then one can set ( L) = f and ( L) = u v 1 as its corresponding inverse canonical link function by using properties of Legendre functions. This requires showing u v 1 is the gradient of a twice-differentiable and strictly convex function. In the following two theorems, we provide conditions where this assertion holds for general composite functions. We defer the background, supporting theorems and proofs of the following results to Sections A, E and F within the Appendices. Theorem 4.2. Let f : RC 1 RC 1 and g : RC 1 RC 1 be differentiable. Then the following conditions are equivalent:

1. f g = F where F is a twice-differentiable convex function.

2. The Jacobian Jf g(x) is symmetric for all x RC 1.

3. Jf g(x) is positive semi-deﬁnite for all x RC 1.

4. f g is monotone.

Proof sketch of Theorem 4.2 To claim that a function f : RC 1 RC 1 is the gradient of a convex function g : RC 1 R, requires f to satisfy maximal cyclical monotonicity. This is a more abstract notion of monotonicity within domains in higher dimensions, and encompasses two notions of monotonicity, namely maximal monotonicity and cyclical monotonicity. It turns out that it is sufﬁcient to consider monotonicity as maximal monotonicity is automatically guaranteed as our domain is RC 1, and f g is differentiable and therefore continuous.

Theorem 4.2 characterises when a composite function is the gradient of a convex function. It also serves as a convenient and practical criteria to aid model design through a check of positive semi-deﬁniteness for the Jacobian Jf g. The implications of Theorem 4.2 are profound as it allows us to derive the following sufﬁcient conditions under which the composition of gradients of convex functions is the gradient of a Legendre function.

Theorem 4.3. Let f : RC 1 S and g : RC 1 RC 1

be differentiable where S RC 1, and Jf(x) and Jg(x) are symmetric and positive deﬁnite for all x RC 1. Then f g is the gradient of a twice-differentiable Legendre function.

Proof sketch of Theorem 4.3 Theorem 4.2 tells us it is sufﬁcient to check for positive semi-deﬁniteness of a composite function s Jacobian. Our proof involves a check that all eigenvalues of the Jacobian are positive. This asserts that the composite function is the gradient of a twicedifferentiable and strictly convex function.

To use the (u, v)-geometric structure from Section 3 with Theorem 4.3, we can set f = u and g = v 1 within Theorem 4.3. This presents an additional requirement that the functions f and g are also invertible. In Section 5, we show how these requirements can be met with our proposed algorithm, LEGENDRETRON.

5. Learning Proper Canonical Multiclass Losses: LEGENDRETRON

In this section, we present LEGENDRETRON, our main algorithmic contribution for learning proper canonical losses for multiclass probability estimation. With the theory of Legendre functions, (u, v)-geometric structure and Theorem 4.3 in hand to support our approach, we now present LEGENDRETRON in detail, as an extension of generalised linear models and Single Index Models for multinomial logistic regression.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Model Given a dataset D = {(xn, yn)}N n=1, we have the classiﬁcation model

yn|xn Categorical ˆp(zn) where zn = Wxn + b

where W R(C 1) p, b RC 1 and ˆp(zn) = (u v 1)(zn) with u chosen as a squashing function that maps to C 1 and v 1 = g for a twice-differentiable and strictly convex function g. We leave the speciﬁcation of a suitable squashing function u as a modelling choice and provide a natural choice at the end of this section.

For any B Z+, let g1, g2, . . . g B be fully input convex neural networks (FICNN) investigated in Amos et al. (2017). We set v 1 = g = ( g1) ( g2) ( g B). For each gi, we use the same architecture as Huang et al. (2021) which is described as

zi,1 = l+ i,1(x)

zi,k = li,k(x) + l+ i,k(s(zi,k 1)) for k = 2, . . . , M + 1,

hi(x) = s(zi,M+1),

gi(x) = s(wi,0)hi(x) + s(wi,1) x 2

where we denote l+ i,k as a linear layer with positive weights, li,k as a linear layer with unconstrained weights, wi,0, wi,1 R are unconstrained parameters and s(x) = log(1 + ex) is the softplus function with s(x) denoting the softplus function applied elementwise on x. In particular, li,M+1 and l+ i,M+1 are linear layers that map to R while for each k = 1, . . . M, li,k and l+ i,k are hidden layers that map to RH for a chosen dimension size H Z+. With this setup, each gi is strongly convex (and therefore strictly convex) with an invertible gradient and positive deﬁnite Hessian for all x RC 1 due to the quadratic term within each gi.

We now show that, when equipped with a suitable squashing function u, any function learned by LEGENDRETRON is a valid inverse canonical link function. We turn to a modiﬁed version of the Log Sum Exp function previously studied in Nielsen & Hadjeres (2018) and describe its main properties within the following theorem.

Theorem 5.1. Let f(x) = log 1 + PC 1 k=1 exp(xk) . The key properties of f are:

f is strictly convex with invertible gradient

u : RC 1 C 1,

1 + PC 1 k=1 exp(xk)

the Hessian of f, given by Ju(x), is positive deﬁnite for all x RC 1.

We refer to f and u as Log Sum Exp+ and softmax+ respectively. Let v 1 : RC 1 RC 1 be deﬁned as

v 1 = ( g1) ( g2) ( g B)

where g1, g2, . . . g B are FICNNs. Then any function u v 1

learned by LEGENDRETRON is the gradient of a twicedifferentiable Legendre function and is therefore, the inverse of a canonical link function.

With this speciﬁcation, any function u v 1 learned via LEGENDRETRON is the gradient of a twice-differentiable Legendre function which can serve as an inverse canonical link function. Moreover, we can deduce that any inverse canonical link or gradient of a twice-differentiable Legendre function, can be approximated by the architecture of u v 1

deﬁned in Theorem 5.1.

Corollary 5.2. Let u : RC 1 C 1 and v 1 : RC 1 RC 1 be deﬁned as in Theorem 5.1 with v 1

parameterised by θ. Deﬁne C(Ω) as the set of twicedifferentiable convex functions with positive deﬁnite Hessian everywhere for a compact set Ω RC 1, and F = {f : f : RC 1 C 1, θ such that u v 1 = f}. Then F is dense in C(Ω).

Algorithm 1 describes LEGENDRETRON in detail. We conclude this section with remarks on our model design, and connections between LEGENDRETRON, multinomial logistic regression and the log loss. Remark 5.3. The basis of our design comes from requiring the determinant of Ju v 1(x) to satisfy |Ju v 1(x)| = |Ju(v 1(x))||Jv 1(x)| > 0 for all x RC 1. A direct way to guarantee this is by setting u and v 1 to be functions with known positive deﬁnite Hessians. To the best of our knowledge, only the CP-Flow architecture (Huang et al., 2021) satisﬁes this property among invertible networks in normalising ﬂows literature; making it our choice for v 1. While it is possible to set other functions as u, it is generally difﬁcult to elicit invertible functions that squash inputs to C 1 aside from variants of softmax. We have set u = softmax+ as it simpliﬁes to the well-known sigmoid when C = 2, which was used as the analogous squashing function in ISGP (Walder & Nock, 2020). Remark 5.4. As Log Sum Exp+ is twice-differentiable and Legendre, its gradient softmax+ is a valid inverse canonical link function since it maps to C 1. However, we note that setting ψ 1 = softmax+ results in learning only the parameters W and b which is equivalent to formulating multinomial logistic regression as a generalised linear model with link ψ( p) = log pi 1 PC 1 k=1 pk

1 i C 1 with corre-

sponding proper loss ℓ= (( ψ Π) JΠ) which can be shown to be the log loss (or cross-entropy). That is, LEGENDRETRON with v 1 as the identity map is equivalent to multinomial logistic regression or logistic regression when

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Algorithm 1 LEGENDRETRON

Input: sample S D, number of iterations T, number of FICNNs B, hidden layer dimension size H, number of layers M, squashing function u. Initialise W and b. Initialise g1, g2, . . . g B each with M layers of dimension size H, and denote their joint set of parameters θ. for i = 1 to T do

Set v 1 = ( g1) ( g2) ( g B). for each (xn, yn) S do

Compute zn = Wxn + b. Compute ˆp(zn) = (u v 1)(zn). end for Compute ES[L(ˆp(z), y)] by Monte Carlo where L is the log-likelihood of the Categorical distribution. Update W, b and θ by backpropagation. end for Output: W, b and g1, g2, . . . g B.

C = 2. In this case, Algorithm 1 would only optimise parameters of a linear model without loss learning; by using the log loss. Comparisons between LEGENDRETRON against multinomial logistic regression or logistic regression illustrate performance differences between learning a proper loss for the dataset and optimising with respect to log loss (cross-entropy). These results are provided in Tables 1 and 3, and Figure 2.

6. Experiments

In this section, we provide numerical comparisons between LEGENDRETRON, multinomial logistic regression and other existing methods that also aim to jointly learn models and proper canonical losses. For our experiments, we set softmax+ as the squashing function u for both LEGENDRETRON and multinomial logistic regression. For a practical and numerically stable implementation, we also map probability estimates to the log scale by deriving an alternate Log-Sum-Exp trick for softmax+. We defer the full experimental details to Appendix I.

All experiments were performed using Py Torch (Paszke et al., 2019) and took roughly one CPU month to complete1. CPU run times for the aloi dataset, which had the largest number of classes (1, 000), were respectively 4 hours and 0.75 hours for LEGENDRETRON and multinomial logistic regression. We note that the difference in run times for this experiment are in part due to the larger number of epochs (360), larger number of blocks B, autograd and backpropagation operations to update v 1 for a much larger

1The total run time for our experiments is favourable relative to the reported two CPU months for the ISGP-LINKGISTIC algorithm from Walder & Nock (2020).

Table 1. Test AUC for generalised linear models with various link methods (ordering in decreasing average). See text for details.

MNIST FMNIST

LEGENDRETRON 99.9% 99.2% ISGP-Linkgistic 99.9% 99.2% GP-Linkgistic 99.9% 99.1% Logistic regression 99.9% 98.5% GLMTron 99.6% 98.1% BREGMANTRON 99.7% 97.9% BREGMANTRONlabel 99.6% 97.7% BREGMANTRONapprox 99.3% 94.6% SLISOTRON 94.6% 90.7%

number of classes. Average GPU run times on a P100 for MNIST experiments in Table 2, were 2.32 and 2.12 hours for VGGTRON and VGG respectively. These run times demonstrate the relative efﬁciency and applicability of loss learning for most datasets.

MNIST Binary Problems Binary problems are a special case of our setting where C = 2, so LEGENDRETRON is readily applicable. In Table 1, we compared LEGENDRETRON against ISGP-LINKGISTIC (Walder & Nock, 2020) and BREGMANTRON (Nock & Menon, 2020), as both algorithms also aim to learn proper canonical losses for binary problems. We also compared with other baselines in these two works including the SLISOTRON algorithm from Kakade et al. (2011). Experiment details can be found in Section 6 of Nock & Menon (2020). Our model successfully matches the (binary speciﬁc) ISGP-LINKGISTIC baseline, which was the strongest algorithm in test AUC performance from the experiments of Walder & Nock (2020).

MNIST Multiclass Problems using Linear Models For the three MNIST-like datasets (Le Cun et al., 2010; Xiao et al., 2017; Clanuwat et al., 2018), we compared LEGENDRETRON against multinomial logistic regression and ISGP-LINKGISTIC, since the latter is the strongest algorithm in ten-class classiﬁcation test accuracy performance based on the experiments within Walder & Nock (2020). ISGP-LINKGISTIC approaches the multiclass problem by learning proper canonical losses for the 10 component 1-vsrest problems. Our experimental results in Figure 2 show that LEGENDRETRON and multinomial logistic regression outperform the ISGP-LINKGISTIC baseline on all three datasets. These results illustrate our conjecture that properness with respect to losses and models in component problems in a multiclass setting, does not imply optimality of class predictions or probability estimates. By respecting the true problem structure, proper multiclass losses allow the model to learn probability estimates that are able to better distinguish between all the classes at hand. Our results also

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Table 2. Test classiﬁcation accuracies of VGGTron and VGG for the MNIST, Kuzushiji-MNIST and Fashion-MNIST datasets. See text for details.

MNIST FMNIST KMNIST

VGGTRON 99.59% 92.88% 98.26% VGG 99.40% 92.80% 98.12%

show that LEGENDRETRON either matches or outperforms multinomial logistic regression on all three datasets. This is most notable on the Kuzushiji-MNIST dataset where LEGENDRETRON outperforms multinomial logistic regression by a reasonable margin.

MNIST Multiclass Problems using Nonlinear Models We note that the architecture of u v 1 does not restrict us to linear models. In Table 2, we provided experimental results of how loss learning can improve non-linear models. Speciﬁcally, we replace the linear model components of LEGENDRETRON and multinomial logistic regression with a VGG-5 architecture; which we refer to as VGGTRON and VGG respectively in Table 2. Our results show that learning proper losses can improve the performance of non-linear models. A more comprehensive survey of how loss learning can improve model performance for classiﬁcation tasks is an avenue for future work, due to the variety of architectures and datasets.

Other Multiclass Problems and Label Noise We also compared LEGENDRETRON against multinomial logistic regression on 15 datasets that are publicly available from the LIBSVM library (Chang & Lin, 2011), the UCI machine learning repository (Asuncion & Newman, 2007; Dua & Graff, 2017), and the Statlog project (King et al., 1995). We note that we did not compare our proposed method with other multiclass classiﬁcation methods such as kernel methods explored in Zien & Ong (2007) and Li et al. (2018), as these methods are centred on the task of classiﬁcation, whereas our focus is on jointly learning multiclass probabilities and proper canonical losses through the canonical link function. To assess the robustness against label noise, we also compare the classiﬁcation accuracy of LEGENDRETRON and multinomial logistic regression where labels in the training set are corrupted with probability η. That is, for any true label yn, we instead train our models on the potentially corrupted label given by

( yn with probability 1 η, c with probability η where c Y \ {yn} .

We applied symmetric label noise in our experiments which is the case where the probability of yn = c for each c Y \ {yn} is η C 1. We run both LEGENDRETRON and

multinomial logistic regression for each dataset 20 times, where each run randomly splits the dataset into 80% training and 20% testing sets. Our results in Table 3 show that LEGENDRETRON outperforms multinomial logistic regression under a t-test at 99% signiﬁcance for most datasets and label noise settings. The performance of LEGENDRETRON is on par with multinomial logistic regression on the svmguide2, wine and iris datasets. Multinomial logistic regression only statistically outperforms LEGENDRETRON on the dna dataset. LEGENDRETRON consistently outperforms multinomial logistic regression especially strongly on problems where the number of classes is greater than 10. The better performance of LEGENDRETRON can partially be attributed to greater model capacity afforded by v 1 which allows logit estimates to adapt to problems with more classes by adding more nonlinearities. We note that the Lipschitz and strongly monotone properties of g1, g2, . . . , g B are dependent only on inputs which remain uncorrupted so probability estimates would respect the true rankings of classes by design. We conjecture that these properties allow for more adaptive shrinking or expanding of logit variations depending on the level of label noise present; offering a form of tolerance to label noise.

7. Conclusion and Broader Impact

In this work, we proposed a general approach which jointly learns proper canonical losses and multiclass probabilities. Our contributions advance the recent work on learning losses with probabilities based on the seminal work within Kakade et al. (2011); Nock & Menon (2020); Walder & Nock (2020) by providing a natural extension to the multiclass setting. The practical nature and generality of our model is owed to the general parameterisation of Fully Input Convex Neural Networks, with theoretical support from Legendre functions, structures from information geometry and hallmark results from convex analysis.

By grounding losses in properness for the multiclass setting, we have demonstrated that our model improves upon existing methods that aim to solve multiclass problems through binary reductions, and also outperforms the natural baseline of multinomial logistic regression. Separately, we have also provided conditions under which a composition of gradients of differentiable convex functions is the gradient of another differentiable convex function.

While it is possible for advances in machine learning to bring positive and negative societal impacts, the present work remains general and not speciﬁc to any application so it is unlikely to bring about any immediate negative societal impact. We anticipate that our results will ﬁnd applications in multiclass classiﬁcation and probability estimation, as well as variational inference.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

10000 20000 30000 40000 50000 60000 60

Training Set Size

Classiﬁcation Accuracy (%)

FMNIST / MLR KMNIST / MLR

MNIST / MLR FMNIST / ISGP KMNIST / ISGP

MNIST / ISGP

FMNIST / LT KMNIST / LT

Figure 2. Test performance v.s. training set size for the MNIST, Kuzushiji-MNIST and Fashion-MNIST datasets. We compare the ten-class classiﬁcation accuracy of LEGENDRETRON (LT), multinomial logistic regression (MLR) and ISGP-LINKGISTIC (ISGP) where the ISGP combines 10 one-vs-rest binary models while the former two algorithms model the probabilities of all 10 classes jointly.

Table 3. Average test classiﬁcation accuracies (%) for LEGENDRETRON (LT) and multinomial logistic regression (MLR) on LIBSVM, UCI and Statlog datasets; at varying levels of label noise (η). Numbers of the method are bolded when it performs statistically better at a signiﬁcance level of 99% under a t-test. Absence of bolding indicates both methods have statistically similar performance.

Dataset # Features # Classes η = 0% η = 20% η = 50% LT MLR LT MLR LT MLR

aloi 128 1,000 88.11 0.03 10.34 0.42 83.03 0.06 7.07 0.45 75.23 0.07 3.53 0.29 sector 55,197 105 89.71 0.18 8.77 0.73 81.00 0.28 4.12 0.44 57.38 0.31 3.17 0.47 letter 16 26 79.82 0.30 53.37 0.25 74.17 0.21 51.24 0.28 64.28 0.26 46.78 0.41 news20 62,061 20 75.65 0.72 63.09 0.58 73.48 0.20 50.49 1.16 51.72 0.16 31.54 1.83 Sensorless 48 11 88.31 0.19 34.42 0.46 82.63 0.99 32.70 0.50 52.02 0.78 29.35 0.84 vowel 10 11 79.72 1.03 44.58 1.08 63.77 1.36 43.44 1.17 40.94 1.61 35.42 1.45

usps 256 10 95.23 0.16 93.79 0.17 92.88 0.15 92.95 0.19 90.23 0.26 90.48 0.27 segment 19 7 95.95 0.24 87.86 0.40 92.21 0.40 87.28 0.40 86.56 0.47 82.75 0.46 satimage 36 6 86.97 0.19 83.93 0.28 84.93 0.25 81.16 0.28 77.44 0.29 77.39 0.29 glass 36 6 58.72 1.94 52.09 1.88 53.72 1.98 50.47 2.11 42.56 1.92 45.47 1.67 vehicle 18 4 76.91 0.65 64.94 0.43 73.59 0.79 63.06 0.53 60.94 1.25 55.18 1.20 dna 180 3 92.79 0.30 94.43 0.19 82.61 0.51 89.55 0.31 58.23 1.05 64.18 0.81 svmguide2 20 3 56.01 1.40 56.01 1.40 56.01 1.40 56.01 1.40 51.65 2.81 52.41 3.04 wine 13 3 96.94 1.14 97.78 0.59 90.97 1.92 96.25 0.99 69.44 2.89 77.36 2.46 iris 4 3 86.67 3.89 83.00 2.08 80.00 3.71 81.50 2.27 63.50 5.13 70.67 3.83

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Amari, S. Information geometry and its applications, volume 194. Springer, 2016.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 146 155. PMLR, 06 11 Aug 2017.

Asplund, E. A monotone convergence theorem for sequences of nonlinear mappings. In Browder, F. E. (ed.), Proceedings of Symposia in Pure Mathematics, volume 18, pp. 1 9, Chicago, IL, USA, 1968. American Mathematical Society.

Asuncion, A. and Newman, D. UCI repository of machine learning databases, 2007.

Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, 2011. ISBN 9781441994660.

Bhatia, R. Matrix Analysis, volume 169 of Graduate Texts in Mathematics. Springer New York, 2013. ISBN 9781461206538.

Borwein, J. and Wiersma, H. Asplund decomposition of monotone operators. SIAM Journal on Optimization, 18 (3):946 960, 2007.

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. Co RR, abs/1812.01718, 2018.

Dua, D. and Graff, C. UCI machine learning repository, 2017.

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

Grabocka, J., Scholz, R., and Schmidt-Thieme, L. Learning surrogate losses. ar Xiv preprint ar Xiv:1905.10108, 2019.

Gr unwald, P. D. and Dawid, A. P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004.

Hardle, W., Hall, P., and Ichimura, H. Optimal Smoothing in Single-Index Models. The Annals of Statistics, 21(1): 157 178, 1993.

Huang, C.-W., Chen, R. T. Q., Tsirigotis, C., and Courville, A. Convex potential ﬂows: Universal probability distributions with optimal transport and convex optimization. In International Conference on Learning Representations, 2021.

Kakade, S. M., Kanade, V., Shamir, O., and Kalai, A. Efﬁcient learning of generalized linear and single index models with isotonic regression. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

King, R. D., Feng, C., and Sutherland, A. Statlog: Comparison of classiﬁcation algorithms on large real-world problems. Applied Artiﬁcial Intelligence, 9(3):289 333, 1995.

Le Cun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Li, J., Liu, Y., Yin, R., Zhang, H., Ding, L., and Wang, W. Multi-class learning: From theory to algorithm. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Liu, L., Wang, M., and Deng, J. A uniﬁed framework of surrogate loss by refactoring and interpolation. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III, volume 12348 of Lecture Notes in Computer Science, pp. 278 293. Springer, 2020.

Long, P. M. and Servedio, R. A. Random classiﬁcation noise defeats all convex potential boosters. In Cohen, W. W., Mc Callum, A., and Roweis, S. T. (eds.), Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pp. 608 615. ACM, 2008.

Matusita, K. Decision rule, based on the distance, for the classiﬁcation problem. Annals of the institute of statistical mathematics, 8:67 77, 1956.

Meenakshi, A. and Rajian, C. On a product of positive semideﬁnite matrices. Linear algebra and its applications, 295(1):3 6, 1999. ISSN 00243795.

Mei, J. and Moura, J. M. F. SILVar: Single index latent variable models. IEEE Transactions on Signal Processing, 66(11):2790 2803, 2018.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Nielsen, F. and Hadjeres, G. Monte carlo information geometry: The dually ﬂat case. Co RR, abs/1803.07225, 2018.

Nock, R. and Menon, A. Supervised learning: no loss no cry. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7370 7380. PMLR, 13 18 Jul 2020.

Nock, R. and Nielsen, F. On the efﬁcient minimization of classiﬁcation calibrated surrogates. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.

Nock, R., Nielsen, F., and Amari, S. On conformal divergences and their population minimizers. IEEE Transactions on Information Theory, 62(1):527 538, 2016.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Reid, M. D. and Williamson, R. C. Composite binary losses. Journal of Machine Learning Research, 11(83): 2387 2422, 2010.

Rockafellar, R. Convex Analysis. Princeton Mathematical Series. Princeton University Press, 1970. ISBN 0691080690.

Rockafellar, R., Wets, M., and Wets, R. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2009. ISBN 9783540627722.

Savage, L. J. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971. ISSN 01621459.

Shuford, E. H., Albert, A., and Edward Massengill, H. Admissible probability measurement procedures. Psychometrika, 31(2):125 145, 1966.

Siahkamari, A., Xia, X., Saligrama, V., Casta n on, D., and Kulis, B. Learning to approximate a bregman divergence. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3603 3612. Curran Associates, Inc., 2020.

Streeter, M. Learning effective loss functions efﬁciently. Co RR, abs/1907.00103, 2019.

Sypherd, T., Diaz, M., Cava, J. K., Dasarathy, G., Kairouz, P., and Sankar, L. A tunable loss function for robust classiﬁcation: Calibration, landscape, and generalization. IEEE Transactions on Information Theory, 68(9):6021 6051, 2022a.

Sypherd, T., Nock, R., and Sankar, L. Being properly improper. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 20891 20932. PMLR, 17 23 Jul 2022b.

Walder, C. and Nock, R. All your loss are belong to bayes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 18505 18517. Curran Associates, Inc., 2020.

Williamson, R. C., Vernet, E., and Reid, M. D. Composite multiclass losses. Journal of Machine Learning Research, 17(222):1 52, 2016.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, 2017.

Zien, A. and Ong, C. S. Multiclass multiple kernel learning. In Proceedings of the 24th International Conference on Machine Learning, pp. 1191 1198. Association for Computing Machinery, 2007. ISBN 9781595937933.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

A. Convex Analysis: Relevant Background and List of Theorems

A.1. Background

To motivate the results studied in this section, we ﬁrst note that in general, the composition of two monotone functions in RC 1 is not necessarily another monotone function in RC 1. This means that methods to design monotonic functions in R cannot be applied to functions deﬁned on RC 1, leaving the methods discussed in Section 2 unsuitable for the general multiclass setting. Separately, we note that the composition of two gradients of differentiable convex functions is not necessarily the gradient of another convex function. In general, to claim that a function f : RC 1 RC 1 is the gradient of a convex function g : RC 1 R, requires f to satisfy a notion of monotonicity generalised to higher dimensions. The connection between convex functions and their gradients is well known in convex analysis via the notion of maximal cyclically monotone functions. This is a combination of two notions of monotonicity: maximal monotonicity and cyclical monotonicity. These are deﬁned within the following list of deﬁnitions and theorems.

A.2. List of Theorems

Deﬁnition A.1 ((Rockafellar et al., 2009, Deﬁnition 12.1)). A function f : RC 1 RC 1 is monotone if f(x) f(z), x z 0 for all x, z RC 1. Moreover, it is strictly monotone when the inequality is strict whenever x = z.

The following two deﬁnitions require the notion of the graph of a function f : RC 1 RC 1 which is deﬁned as gph(f) = {(x, y) : x RC 1, y f(x)}.

Deﬁnition A.2 ((Bauschke & Combettes, 2011, Deﬁnition 20.20)). Let f : RC 1 RC 1 be a monotone function. Then f is maximally monotone if there exists no monotone function g : RC 1 RC 1 such that gph(f) gph(g).

Deﬁnition A.3 ((Bauschke & Combettes, 2011, Deﬁnition 22.10)). Let f : RC 1 RC 1. For an arbitrary integer n 2, f is n-cyclically monotone if for any {(xi, yi)}i=1,...,n gph(f) it follows that

i=1 yi, xi+1 xi 0 where xn+1 = x1.

f is cyclically monotone if it is n-cyclically monotone for any integer n 2. In addition, if gph(f) gph(g) for any cyclically monotone function g = f then f is maximal cyclically monotone.

Theorem A.4 ((Rockafellar et al., 2009, Theorems 12.17 & 12.25)). Let f : RC 1 RC 1. Then f = h for a differentiable convex function h : RC 1 R if and only if f is maximal cyclically monotone. That is, f is maximally monotone and cyclically monotone.

Theorem A.5 ((Rockafellar et al., 2009, Proposition 12.3)). Let f : RC 1 RC 1 be a differentiable function. Then f is monotone if and only if f(x) is positive semi-deﬁnite for all x RC 1. Moreover, if f(x) is positive deﬁnite for all x RC 1 \ {0} then f is strictly monotone.

Theorem A.6 ((Bauschke & Combettes, 2011, Corollary 20.25)). Let f : RC 1 RC 1 be a monotone and continuous function. Then f is maximally monotone.

Theorem A.7 ((Borwein & Wiersma, 2007, Theorem 3)). Let f : RC 1 RC 1 be maximally monotone and continuously differentiable. Then f(x) = F(x) + Lx where F is a differentiable convex function, and L is a skew symmetric matrix.

Theorem A.8 ((Meenakshi & Rajian, 1999, Theorem 3)). Let A, B R(C 1) (C 1) be symmetric and positive semideﬁnite matrices. Then AB is positive semi-deﬁnite if and only if it is symmetric.

Theorem A.9 ((Bhatia, 2013, Theorem VIII.4.6)). Let V R(C 1) (C 1) be a real vector space whose elements are matrices with real eigenvalues. Denote λi(M) as the i-th smallest eigenvalue for any matrix M V . Let A, B V then

λi(A) + λ1(B) λi(A + B) λi(A) + λC 1(B).

A.3. Remarks

Theorem A.4 serves as a criterion and characterisation of differentiable convex functions through their gradients. Theorems A.6 to A.8 are hallmark results from the rich literature of convex analysis and monotone operators that tie together conditions under which a differentiable composite function is the gradient of a convex function. Notably, Theorem A.7 is a rewritten

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability

0-1 log square Matsushita

Figure 3. Plots of partial losses corresponding various proper losses when true probability is 1.

version of the Asplund decomposition of maximal monotone operators (Asplund, 1968) which tells us it sufﬁces to focus on maximal monotonicity. We refer the reader to Appendix E for the usage of Theorems A.5 to A.8 in the proof of Theorem 4.2.

Theorem A.9 allows us to obtain a lower bound on the smallest eigenvalue of the sum of two real-valued matrices with real eigenvalues. This is particularly useful to prove positive deﬁniteness in Theorem 4.3. We refer the reader to Appendix F for its usage in the proof of Theorem 4.3.

B. Examples of Binary Proper Losses

Denote y { 1, 1} a label and p = Pr(Y = 1|x) be the true probability that Y = 1|x. Let ˆy and ˆp be the predicted class and probability estimate of Y = 1 given input x. Below are the analytical formulas of partial losses for various examples of binary proper losses where it is assumed that p = 1. Figure 3 shows the plots of the below partial losses.

0 1 [[ˆy(ˆp) = 1]] square (1 ˆp)2

log log(ˆp)

Matsushita 1 2 q

C. Proof of equivalent conditions on subdifferentials for strictly convex functions

( ) Suppose f is strictly convex and assume for a proof by contradiction that there exists some x, y domf such that x = y with f(x) + φ, y x f(y) for some φ f(x).

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Fix λ (0, 1). Then we have

f(x) + φ, (λx + (1 λ)y) x = f(x) + (1 λ) φ, y x

f(λx + (1 λ)y) by deﬁnition of a subgradient

< λf(x) + (1 λ)f(y) by strict convexity of f

f(x) + (1 λ) φ, y x by the above assumption.

Thus, we have a contradiction so we must have the subdifferential of f for all x domf is given by

f(x) = {φ Rn : φ, y x < f(y) f(x), y Rn} .

( ) Suppose the subdifferential of f for any x domf is given by

f(x) = {φ Rn : φ, y x < f(y) f(x), y Rn} .

Fix x, y domf and λ (0, 1). Consider φ f(λx + (1 λ)y). Then we have

f(λx + (1 λ)y) + (1 λ) φ, x y < f(x),

f(λx + (1 λ)y) + λ φ, y x < f(y).

Multiplying the ﬁrst inequality by λ and the second by (1 λ), summing them gives us f(λx + (1 λ)y) < λf(x) + (1 λ)f(y). This holds for arbitrary x, y domf and λ (0, 1) so it follows that f is strictly convex.

D. Proof of Proposition 3.3

( ) Fix q C 1. Suppose ℓis proper. Then we have

L(p, q) = p ℓ(q) = q ℓ(q) + (p q) ℓ(q) = L(q) + (p q) ℓ(q)

0 L(p, q) L(p, p) = L(q) + (p q) ℓ(q) L(p)

= (p q) ℓ(q) L(p) ( L(q)).

Recall that L is concave so it follows that L is convex. Hence, ℓ(q) ( L)(q) which means ℓ(q) is a subgradient of L at q and L(p, q) = ( L(q)) (p q) ( ℓ(q)).

( ) Suppose there exists a convex function f : C 1 R such that for all q C 1, there exists a subgradient φ f(q) and L(p, q) = f(q) (p q) φ.

For all p C 1, we have

L(p, q) L(p, p) = f(p) f(q) (p q) φ

0 since φ is a subgradient of f at q

= L(p, p) L(p, q).

Hence, ℓis a proper loss.

To prove that ℓis strictly proper if and only if there exists a strictly convex function f : C 1 R such that for all q C 1, there exists a subgradient φ f(q) such that L(p, q) = (p q) φ f(q) for all p C 1. This follows immediately by deﬁnitions of strictly proper losses and subdifferentials from which the above inequalities become strict.

We are left to prove that L(p, q) = (p q) ℓ(q) + L(q) when L is differentiable. We ﬁrst note that L is concave so L is convex. Recall that L(p, q) = p ℓ(q) = L(q) + (p q) ℓ(q) from the workings within Appendix D. Setting f = L for Proposition 3.3, we can deduce ℓ(q) = L(q) q ri( C 1) for a proper loss ℓwhich follows by the uniqueness of subgradients for differentiable functions. That is, L(q) = ℓ(q), q ri( C 1).

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

E. Proof of Theorem 4.2

(1) = (2). This follows from Schwarz s theorem on the equality of mixed partial derivatives. (2) = (3). This follows from Theorem A.8 since the Jacobian of a composite function is a product of matrices. (3) = (4). This follows from Theorem A.5. We are left to prove (4) = (1).

(4) = (1). Since f g is monotone and continuous, it follows from Theorem A.6 that f g is maximally monotone. From Theorem A.7, (f g)(x) = F(x) + Lx for a differentiable convex function F and a skew-symmetric matrix L. Since f g is differentiable then it follows that F is twice-differentiable. This gives us Jf g = 2F + L where 2F is symmetric by Schwarz s theorem on the equality of mixed partial derivatives. Theorems A.5 and A.8 tell us that Jf g is also symmetric. As Jf g and 2F are both symmetric, then we must have L = 0 = L. That is, f g = F where F is twice-differentiable and convex.

F. Proof of Theorem 4.3

Fix x RC 1. The Jacobian of f g is given by

Jf g(x) = Jf(g(x))Jg(x).

Here we aim to prove that Jf g(x) is positive deﬁnite. We ﬁrst note that Jg(x) is invertible since |Jg(x)| > 0. Now, note that Jf g(x) is similar to the matrix

(Jg(x)) 1 2 Jf(g(x))Jg(x)(Jg(x)) 1

2 = (Jg(x)) 1 2 Jf(g(x))(Jg(x)) 1 2

where the square root of the matrix Jg(x) is given by (Jg(x)) 1 2 which is known to be symmetric since Jg(x) is symmetric and positive deﬁnite. Since Jf(g(x)) is also symmetric, it follows that (Jg(x)) 1 2 Jf(g(x))(Jg(x)) 1 2 is symmetric. As Jg(x) and Jf(g(x)) are positive deﬁnite, we have (Jg(x)) 1 2 Jf(g(x))(Jg(x)) 1 2 = |Jf(g(x))| |(Jg(x))| > 0.

It follows that (Jg(x)) 1 2 Jf(g(x))(Jg(x)) 1 2 is positive deﬁnite, meaning it has positive eigenvalues λ1, . . . , λC 1 R that can be denoted such that λ1 λ2 λC 1. Since similar matrices have the same eigenvalues, it follows that λ1, . . . , λC 1 are also the eigenvalues of Jf g(x).

Denote S = 1

2(Jf g(x) + (Jf g(x)) ) and A = 1

2(Jf g(x) (Jf g(x)) ) as the symmetric and skew-symmetric parts of Jf g(x) respectively. It is well known that for any skew-symmetric matrix A and any z RC 1, we have z Az = 0. To prove that Jf g(x) is positive deﬁnite, it sufﬁces to prove that z Jf g(x)z = z Sz > 0 for any z RC 1 \ {0}.

Firstly, recall that all eigenvalues of Jf g(x) are real and positive, and the fact that the transpose of Jf g(x), (Jf g(x)) , has the same eigenvalues as Jf g(x). That is, all eigenvalues of (Jf g(x)) are real and positive. Secondly, S = 1 2(Jf g(x) + (Jf g(x)) ) is symmetric so all of its eigenvalues must be real. Hence, Theorem A.9 gives us the following bound for the smallest eigenvalue λ1(S)

2Jf g(x) + λ1

λ1(Jf g(x)) + λ1((Jf g(x)) )

The Rayleigh quotient for S and any z RC 1 \ {0}, is given by z Sz

z 2 , and satisﬁes the inequality

z 2 λC 1(S).

Hence, we have

z 2 λ1(S) > 0 for all z RC 1 \ {0} .

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Thus, z Jf g(x)z = z Sz > 0, z RC 1 \ {0} and so, Jf g(x) is positive deﬁnite. This holds for arbitrary x RC 1

so it follows that f g is the gradient of a twice-differentiable convex function F by Theorem 4.2 with F being strictly convex since Jf g(x) is positive deﬁnite. In other words, f g is the gradient of a twice-differentiable Legendre function.

G. Proof of Theorem 5.1

Proof of Properties of Log Sum Exp+ and softmax+ Since positive deﬁniteness of Ju(x) for all x RC 1 implies strict convexity of f and strict convexity of f implies invertibility of u, it sufﬁces to prove that Ju(x) is positive deﬁnite for all x RC 1.

Fix x RC 1. For ease of notation, we denote M as Ju(x) where Mij refers to the entry within the i-th row and j-th column of Ju(x). Consider any row i {1, . . . , C 1}. We have

Mii = exp(xi)

1 + PC 1 k=1 exp(xk)

1 + PC 1 k=1 exp(xk)

Mij = exp(xi)

1 + PC 1 k=1 exp(xk)

1 + PC 1 k=1 exp(xk) .

Observe that Mii P

j =i |Mij| = exp(xi) 1+PC 1 k=1 exp(xk)

PC 1 k=1 exp(xk) 1+PC 1 k=1 exp(xk)

> 0. This holds for arbitrary i {1, . . . , C 1}

so it follows that Ju(x) is strictly diagonally dominant. This implies that Ju(x) is positive deﬁnite so it follows that f is strictly convex. This completes the proof of the key properties of the Log Sum Exp+ function and its gradient softmax+.

Proof of functions learned by LEGENDRETRON are inverse canonical links We ﬁrst note that v 1 = ( g1) ( g2) ( g B) is indeed invertible since the RHS is invertible by the strong convexity of g1, g2, . . . , g B. Since each gi is symmetric and positive deﬁnite, it follows that v 1 is the gradient of a twice-differentiable Legendre function by applying Theorem 4.3 recursively. It follows from Theorem 4.2 that Jv 1(x) is symmetric. We also have that |Jv 1(x)| > 0 so Jv 1(x) is positive deﬁnite for all x RC 1.

Recall that Log Sum Exp+ is twice-differentiable with gradient u = softmax+ and Hessian Ju(x) being strictly diagonally dominant. That is, Ju(x) is symmetric and positive deﬁnite. Applying Theorem 4.3 on u v 1 allows us to deduce that u v 1 is the gradient of a twice-differentiable Legendre function that maps to C 1 so u v 1 can be set as the inverse of an implicit canonical link function.

H. Proof of Corollary 5.2

Let u = softmax+ and ﬁx g to be the gradient of a twice-differentiable Legendre function with positive Hessian everywhere. Note that Log Sum Exp+ is twice-differentiable and Legendre so u 1 and g satisfy the sufﬁcient conditions of Theorem 4.3. It follows that u 1 g is the gradient of a twice-differentiable Legendre function deﬁned on a compact set Ω. The result then follows from using Proposition 3 of Huang et al. (2021).

I. Experimental Details

I.1. Network Architecture and Optimisation Details

Experiment details on architecture and optimisation parameters for LEGENDRETRON (LT) and multinomial logistic regression (MLR). Here we denote α as the learning rate, λ as weight decay, γ as the multiplicative rate of decay applied to α every S epochs through a step-wise learning rate scheduler. We used the Adam optimiser for all experiments.

LEGENDRETRON: Uprising Proper Multiclass Loss Learning

Dataset(s) Model B H M α γ S Epochs Batch Size

MNIST/FMNIST/KMNIST LT 1 4 4 0.001 0.7 4 200 128 MNIST/FMNIST/KMNIST MLR \ \ \ 0.001 0.7 4 200 128 aloi LT 2 2 4 0.01 0.95 4 360 64 aloi MLR \ \ \ 0.01 0.95 4 360 64 LIBSVM/UCI/Statlog (other datasets) LT 2 2 4 0.01 0.95 4 240 64 LIBSVM/UCI/Statlog (other datasets) MLR \ \ \ 0.01 0.95 4 240 64

I.2. Log Sum Exp trick for softmax+

Let u = softmax+ and consider x RC 1. We have

log(Π 1(u(x))) =

1 + PC 1 k=1 exp(xk)

, . . . , log

1 + PC 1 k=1 exp(xk)

1 + PC 1 k=1 exp(xk)

where log on the LHS is applied elementwise. We seek an alternate expression for log(Π 1(u(x))) that is numerically stable.

Let x = max(x1, . . . , x C 1) and S = exp( x ) + PC 1 k=1 exp(xk x ). We can write

log(Π 1(u(x))) = log exp(x1 x )

, . . . , log exp(x C 1 x )

, log exp( x )

= (x1 x log(S), . . . , x C 1 x log(S), x log(S)) .

It can be observed that this expression is numerically stable for all large values of x .