# supervised_learning_no_loss_no_cry__b38b5dab.pdf

Supervised Learning: No Loss No Cry

Richard Nock 1 2 Aditya Krishna Menon 3

Supervised learning requires the speciﬁcation of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an ad hoc manner. In hopes of making this procedure more principled, the problem of learning the loss function for a downstream task (e.g., classiﬁcation) has garnered recent interest. However, works in this area have been generally empirical in nature.

In this paper, we revisit the SLISOTRON algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast SLISOTRON as learning a loss from a family of composite square losses. By interpreting this through the lens of proper losses, we derive a generalisation of SLISOTRON based on Bregman divergences. The resulting BREGMANTRON algorithm jointly learns the loss along with the classiﬁer. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the BREGMANTRON outperforms the SLISOTRON, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of loss transfer between domains.

1. Introduction

Computationally efﬁcient supervised learning essentially started with the PAC framework of Valiant (1984), in which

1Data61 (Australia) 2The Australian National University (Australia) 3Google Research (USA). Correspondence to: <richard.nock@data61.csiro.au,adityakmenon@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

the goal was to learn in polynomial time a function being able to predict a label (or class, among two possible) for i.i.d. inputs. The initial loss, whose minimization enforces the accurate prediction of labels, was the binary zero-one loss which returns 1 iff a mistake is made.

The zero-one loss was later progressively replaced in learning algorithms for tractability reasons, including its nondifferentiability and the structural complexity of its minimization (Kearns & Vazirani, 1994; Auer et al., 1995). From the late nineties, a zoo of losses started to be used for tractable machine learning (ML), the most popular ones built from the square loss and the logistic loss. Recently, there has been a signiﬁcant push to widen even more the choice of loss; to pick a few, see Grabocka et al. (2019); Kakade et al. (2011); Liu et al. (2019); Mei & Moura (2018); Nock & Nielsen (2008; 2009); Reid & Williamson (2010); Siahkamari et al. (2019); Streeter (2019); Sypherd et al. (2019).

With a few exceptions, seldom do such works ground reasons for change of the loss outside of tractability at large, be it algorithmic or statistical with for example the introduction of classiﬁcation calibration, which studies conditions on losses complying with the accurate prediction of labels (Bartlett et al., 2006). It turns out that statistics and Bayes decision theory give a precise reason, one which has long been the object of philosophical and formal debates (de Finetti, 1949). It starts from a simple principle:

Bayes rule is optimal for the loss at hand,

a property known as properness (Savage, 1971). Then comes a less known subtlety: a proper loss as commonly used for real-valued prediction, such as the square and logistic loss, involves an implicit canonical link (Reid & Williamson, 2010) function that maps class probabilities (such as the output of Bayes rule) to real values. This is exempliﬁed by the sigmoid (inverse) link in deep learning.

Supervised learning in a Bayesian framework can thus be more broadly addressed by learning a classiﬁer and a link for the domain at hand, which implies learning a proper canonical loss with the classiﬁer. This loss, if suitably expressed, can be used for training. This kills two birds in one shot: we get access not just to real valued predictions, but also a way to embed them into class probability estimates

Supervised Learning: No Loss No Cry

via the inverse link: we directly learn to estimate Bayes rule.

A large number of papers, especially recently, have tried to push forward the problem of learning the loss, including e.g. (Grabocka et al., 2019; Liu et al., 2019; Mei & Moura, 2018; Siahkamari et al., 2019; Streeter, 2019; Sypherd et al., 2019), but none of those alludes to properness to ground the choice of the loss, therefore taking the risk of ﬁtting a loss whose (unknown) optima may fail to contain Bayes rule. To the best of our knowledge, Nock & Nielsen (2008) is the ﬁrst paper grounding the search of the loss within properness and Kakade et al. (2011) brings the ﬁrst algorithm (SLISOTRON) and associated theoretical results for ﬁtting the link though subject to restrictive assumptions on Bayes rule and on the target distribution, the risk to ﬁt probabilities outside [0, 1], and ﬁnally falling short of showing convergence that would comply with the classical picture of ML, either for training or generalization.

Our major contribution is a new algorithm, the BREGMANTRON (Algorithm 2), a generalisation of the SLISOTRON (Kakade et al., 2011) to learn proper canonical losses. BREGMANTRON exploits two dual views of proper losses, guarantees class probability estimates in [0, 1], and uses Lipschitz constraints that can be tuned at runtime.

Our formal contribution includes a simple convergence guarantee for this algorithm which alleviates all assumptions on the domain and Bayes rule in (Kakade et al., 2011). Our result shows that convergence happens as a function of the discrepancy between our estimate and the true value of of the mean operator a sufﬁcient statistic for the class (Patrini et al., 2014). As the discrepancy converges to zero, the estimated (link, classiﬁer) by the BREGMANTRON converges to a stable output. To come to this result, we pass through an intermediate step in which we show a particular explicit form for any differentiable proper composite loss, of a Bregman divergence (Bregman, 1967), which are canonical distortion measures. To save space, all proofs are given in a supplementary material, denoted SI.

2. Deﬁnitions and notations

The following shorthands are used: [n] .= {1, 2, ..., n} for n N , for z 0, a b R, denote z [a, b] .= [za, zb] and z + [a, b] .= [z + a, z + b]. We also let R .= [ , ]. In (batch) supervised learning, one is given a training set of m examples S .= {(xi, y i ), i [m]}, where xi X is an observation (X is called the domain: often, X Rd) and y i Y .= { 1, 1} is a label, or class. The objective is to learn a classiﬁer h : X R which belongs to a given set H. The goodness of ﬁt of some h on S is evaluated by a loss.

Losses: A loss for binary class probability estimation (Buja et al., 2005) is some ℓ: Y [0, 1] R whose

expression can be split according to partial losses ℓ1, ℓ 1,

ℓ(y , u) .= Jy = 1K ℓ1(u) + Jy = 1K ℓ 1(u), (1)

Its conditional Bayes risk function is the best achievable loss when labels are drawn with a particular positive base-rate,

L(π) .= inf u EY πℓ(Y, u), (2)

where Pr[Y = 1] = π. A loss for class probability estimation ℓis proper iff Bayes prediction locally achieves the minimum everywhere: L(π) = EYℓ(Y, π), π [0, 1], and strictly proper if Bayes is the unique minimum. Fitting a prediction h(x) R into some u [0, 1] as required in (1) is done via a link function.

Links, composite and canonical proper losses. A link ψ : [0, 1] R allows to connect real valued prediction and class probability estimation. A loss can be augmented with a link to account for real valued prediction, ℓψ(y , z) .= ℓ(y , ψ 1(z)) with z R (Reid & Williamson, 2010). There exists a particular link uniquely deﬁned1

for any proper differentiable loss, the canonical link, as: ψ .= L (Reid & Williamson, 2010, Section 6.1). We note that the differentiability condition can be removed (Reid & Williamson, 2010, Footnote 6). As an example, for logloss we ﬁnd the link ψ(u) = log u 1 u, with inverse the well-known sigmoid ψ 1(z) = (1 + e z) 1. A canonical proper loss is a proper loss using the canonical link.

Convex surrogates. When the loss is proper canonical and symmetric (ℓ1(u) = ℓ 1(1 u), u (0, 1)), it was shown in Nock & Nielsen (2008; 2009) that there exists a convenient dual formulation amenable to direct minimization with real valued classiﬁers: a convex surrogate loss

Fℓ(z) .= ( L) ( z), (3)

where denotes the Legendre conjugate of F, F (z) .= supz dom(F ){zz F(z )} (Boyd & Vandenberghe, 2004). For simplicity, we just call Fℓthe convex surrogate of ℓ. The logistic, square and Matsushita losses are all surrogates of proper canonical and symmetric losses. Such functions are called surrogates since they all deﬁne convenient upperbounds of the 0/1 loss. Any proper canonical and symmetric loss has ℓ(y , z) Fℓ(y z) so both dual forms are equivalent in terms of minimization (Nock & Nielsen, 2008; 2009).

Learning. Given a sample S, we learn h by the empirical minimization of a proper loss on S that we denote ℓψ(S, h) .= Ei[ℓ(y i , ψ 1(h(xi)))]. We insist on the fact that minimizing any such loss does not just give access to a real valued predictor h: it also gives access to a class probability estimator given the loss (Nock & Nielsen, 2008, Section 5), (Nock & Williamson, 2019),

Pr[Y = 1|x; h, ψ] .= ψ 1(h(x)), (4)

1Up to multiplication or addition by a scalar (Buja et al., 2005).

Supervised Learning: No Loss No Cry

so in the Bayesian framework, supervised learning can also encompass learning the link ψ of the loss as well. If the loss is proper canonical, learning the link implies learning the loss. As usual, we assume S sampled i.i.d. according to an unknown but ﬁxed D and let ℓ.(D, h) .= ES D[ℓ.(S, h)].

3. Related work

Our problem of interest is learning not only a classiﬁer, but also a loss function itself. A minimal requirement for the loss to be useful is that it is proper, i.e., it preserves the Bayes classiﬁcation rule. Constraining our loss to this set ensures standard guarantees on the classiﬁcation performance using this loss, e.g., using surrogate regret bounds.

Evidently, when choosing amongst losses, we must have a well-deﬁned objective. We now reinterpret an algorithm of Kakade et al. (2011) as providing such an objective.

The SLISOTRON algorithm. Kakade et al. (2011) considered the problem of learning a class-probability model of the form Pr(Y = 1 | x) = u(w x) where u( ) is a 1-Lipschitz, non-decreasing function, and w Rd is a ﬁxed vector. They proposed SLISOTRON, an iterative algorithm that alternates between gradient steps to estimate w , and nonparametric isotonic regression steps to estimate u. SLISOTRON provably bounds the expected square loss, i.e.,

ℓ SQ ψ (S, h) = Ex S Ey S[(y ψ 1(h(x)))2|x] ,(5)

where h(x) = w x is a linear scorer and y .= (y + 1)/2. The square loss has 2 ℓ SQ 1 (u) .= (1 u)2, 2 ℓ SQ 1(u) .= u2 , and conditional Bayes risk 2 L

SQ(u) .= u(1 u).

Observe now that the SLISOTRON algorithm can be interpreted as follows: we jointly learn a classiﬁer h H and composite link ψ for the square loss ℓ L, as

L .= {(y, z) 7 (y ψ 1(z))2 : ψ is 1-Lipschitz, invertible}.

That is, SLISOTRON can be interpreted as ﬁnding a classiﬁer and a link via all compositions of the square loss with a 1Lipschitz, invertible function. Kakade et al. (2011) in fact do not directly analyze (5) but a lowerbound that directly follows from Jensen s inequality:

ℓ SQ ψ (S, h) = Ex S Ey S[(y ψ 1(h(x)))2|x] ,

Ex S (Ey S[y|x] ψ 1 h(x))2 .(6)

This does not change the problem as the slack is the expected (per observation) variance of labels in the sample, a constant given S. We shall return to this point in the sequel.

Kakade et al. (2011) make an assumption about Bayes rule, Ey D[y|x] = ψ 1 OPT(w OPTx) with ψ 1 OPT Lipschitz and w OPT R. Under such an assumption, it is shown that there exists an iteration t = O((Rm/d)1/3)

of the SLISOTRON with max{ ℓSQψ(S, h), ℓSQψ(D, h)} O((d R2/m)1/3) with high probability ( ℓSQ is the lowerbound of the loss in (6)). Nothing is guaranteed outside this unknown "hitting" point, which we partially attribute to the lack of convergence results on training. Another potential downside from the Bayesian standpoint is that the estimates learned are not guaranteed to be in [0, 1] by the isotonic regression as modeled.

Learning the loss. Over the last decade, the problem of learning the loss has seen a considerable push for a variety of reasons: Sypherd et al. (2019) introduced a family of tunable classiﬁcation calibrated losses, aimed at increasing robustness in classiﬁcation. Mei & Moura (2018) formulated the generalized linear model using Bregman divergences, though no relationship with proper losses is made and the loss function used integrates several regularizers breaking properness; the formal results rely on several quite restrictive assumptions and the guarantees are loosened if the true composite link comes from a loss that is not strongly convex "enough". In Streeter (2019), the problem studied is in fact learning the regularized part of the logistic loss, with no approximation guarantee. In Grabocka et al. (2019), the goal is to learn a loss deﬁned by a neural network, without reference to proper losses and no approximation guarantee. Such a line of work also appears in a slightly different form in Liu et al. (2019). In Siahkamari et al. (2019), the loss considered is mainly used for metric learning, but integrates Bregman divergences. No mention of properness is made. Perhaps the most restrictive part of the approach is that it ﬁts piecewise linear divergences, which are therefore not differentiable nor strictly convex.

Interestingly, none of these recent references alludes to properness to constrain the choice of the loss. Only the modelling of Mei & Moura (2018) can be related to properness via Theorem 1 proven below. The problem of learning the loss was introduced as loss tuning in Nock & Nielsen (2008) (see also Reid & Williamson (2010)). Though a general boosting result was shown for any tuned loss following a particular construction on its Bayes risk, it was restricted to losses deﬁned from a convex combination of a basis set and no insight on improved convergence rates was given.

4. Learning proper canonical losses

We now present BREGMANTRON, our algorithm to learn proper canonical losses by learning a link function. We proceed in two steps. We ﬁrst show an explicit form to proper differentiable composite losses and then provide our approach, the BREGMANTRON.

Every proper differentiable composite loss is Bregman Let F : R R be convex differentiable. The Bregman

Supervised Learning: No Loss No Cry

Algorithm 0 SLISOTRON

Input: sample S = {(xi, yi), i = 1, 2, ..., m}, iterations T N . For t = 0, 1, ..., T 1 [Step 1] If t = 0 Then wt+1 = w1 = 0 Else ﬁt wt+1 using

wt+1 = wt 1

i=1 (ut(xi) yi) xi (7)

[Step 2] order indexes in S so that w t+1xi+1 w t+1xi, i [m 1]; [Step 3] let zt+1 .= w t+1xi+1 [Step 4] ﬁt next (inverse) link

ut+1 Isotonic Reg(ˆzt+1, S); //ﬁtting of ut+1 given zt+1 (8)

Output: u T , w T .

Figure 1: The SLISOTRON algorithm of Kakade et al. (2011).

divergence DF with generator F is:

DF (z z ) .= F(z) F(z ) (z z )F (z ).(12)

Bregman divergences satisfy a number of convenient properties, many of which are going to be used in our results. In order not to laden the paper s body, we have summarized in SI (Section II) a factsheet of all the results we use.

Our ﬁrst result gives a way to move between proper composite losses and Bregman divergences.

Theorem 1 Let ℓ: Y [0, 1] R be differentiable and ψ : [0, 1] R invertible. Then ℓis a proper composite loss with link ψ iff it is a Bregman divergence with:

ℓ(y , h(x)) = D L(y ψ 1 h(x)), (13)

= D( L) ( L ψ 1 h(x) L (y)),

where L is the conditional Bayes risk deﬁned in (2), and we remind the correspondence y = (y + 1)/2.

Though similar forms of this Theorem have been proven in the past in Cranko et al. (2019, Theorem 2), Nock & Nielsen (2008, Lemma 3), Reid & Williamson (2010, Corollary 13), Zhang (2004, Theorem 2.1), Savage (1971, Section 4), none ﬁt exactly to the setting of Theorem 1, which is therefore of independent interest and proven in SI, Section III. We now remark that the approach of Kakade et al. (2011) in (6) in fact cannot be replicated for proper canonical losses in general: because any Bregman divergence is convex in its left parameter, we still have as in (6)

ℓψ(S, h) Ex S[D L(Ey S[y|x] ψ 1 h(x))],

but the slack in the generalized case can easily be found to be the expected Bregman information of the class (Banerjee et al., 2004), Ex S[I L(Y|x)], with I L(Y|x) = Ey S[ L(y)|x]+L(Ey S[y|x]), which therefore depends on the loss at hand (which in our case is learned as well).

Learning proper canonical losses We now focus on learning class probabilities unrestricted to all losses having the expression in (13), but with the requirement that we use the canonical link: ψ .= L , thereby imposing that we learn the loss as well via its link. Being the central piece of our algorithm, we formally deﬁne this function alongside some key parameters that will be learned. Notably, we in fact learn for algorithmic convenience the inverse canonical link of a loss but in order not to laden the paper, we shall also refer to this function as a link for short. Its domain or image makes the distinction clear from context.

Deﬁnition 2 A link u is a strictly increasing function with Imu = [0, 1], for which there exists z MIN, z MAX and 0 < n N such that (i) u(z MIN) = 0, u(z MAX) = 1 and (ii) z z , n(z z) u(z ) u(z) N(z z).

Notice that relaxing z MIN, z MAX R (the closure of R) and n, N R+ would allow to encompass all invertible links, so deﬁnition 2 is not restrictive but rather focuses on simply computable links. Given the canonical link u, we let

U(z) .= Z z

z MIN u(t)dt, (14)

from which we obtain the convex surrogate Fu(z) = U( z) and conditional Bayes risk Lu(v) = U (v) for the proper loss ℓu y (c) .= D Lu(y c), y {0, 1}.

Supervised Learning: No Loss No Cry

Algorithm 1 BREGMANTRON

Input: sample S = {(xi, yi), i = 1, 2, ..., m}, iterations T N . Initialize u0(z) .= 0 (1 (az + b)); For t = 0, 1, ..., T 1 [Step 1] If t = 0 Then wt+1 = w1 = 0 Else ﬁt wt+1 using a gradient step towards:

w .= arg min w ES[DUt(w x u 1 t (y))]. //proper canonical ﬁtting of wt+1 given ˆyt, ut (9)

[Step 2] order indexes in S so that w t+1xi+1 w t+1xi, i [m 1]; [Step 3] ﬁt ˆyt+1 by solving for global optimum (nt, Nt chosen so that 0 < nt Nt):

ˆyt+1 .= arg min ˆy ES[DU t (ˆy ˆyt)] //proper composite ﬁtting of ˆyt+1 given wt+1, ut

s.t. ˆyi+1 ˆyi [nt (w t+1(xi+1 xi)), Nt (w t+1(xi+1 xi))] , i [m 1] ˆy1 0, ˆym 1 . (10)

[Step 4] ﬁt next (inverse) link

ut+1 FIT(ˆyt+1, wt+1, S); //ﬁtting of ut+1 given wt+1, ˆyt+1 (11)

Output: u T , w T .

Figure 2: The BREGMANTRON algorithm.

Inline with Theorem 1, the loss we seek to minimize is

ℓ(y , h(x)) = DU (y u h(x)) = DU(h(x) u 1(y)), (15)

with h(x) = w x a linear classiﬁer.

We present BREGMANTRON, our algorithm for ﬁtting such losses in Figure 2. BREGMANTRON iteratively ﬁts a sequence u0, u1, ... of links and losses and w0, w1, ... of classiﬁers. Here, , are shorthands for min, max respectively and in Step 3, we have dropped the iteration in the optimization problem (ˆyi denotes ˆyti). Notice that Steps 1 and 3 exploit the two dual views of proper losses presented in 2.

Before analyzing BREGMANTRON, we make a few comments. First, in the initialization, we pick z0MAX z0MIN .= > 0 and let a .= 1/ , b .= z0MIN/ .

Second, the choice of nt, Nt is made iteration-dependent for ﬂexibility reasons, since in particular the gradient step does not put an explicit limit on wt. However, there is an implicit constraint which is to ensure nt 1/(w t+1(xm x1)) Nt to get a non-empty feasible set for ut+1.

Third, BREGMANTRON bears close similarity to SLISOTRON, but with two key points to note. In Step 1, we perform a gradient step to minimise a divergence between the current predictions and the labels; ﬁxing a learning rate of η = 1, in fact reduces to the SLISOTRON update. Further, in Step 3, we perform ﬁtting of ˆyt+1 based

on the previous estimates ˆyt, rather than the observed labels themselves as per SLISOTRON. Our Step 3 can thus be seen as Bregman regularisation step, which ensures the predictions (and thus the link function) do not vary too much across iterates. Such stability ensures asymptotic convergence, but does mean that the initial choice of link can inﬂuence the rate of this convergence.

Finally, with z(t+1)i .= w t+1xi, FIT can be summarized as:

[1] linearly interpolate between (z(t+1)i, ˆy(t+1)i) and (z(t+1)(i+1), ˆy(t+1)(i+1)), i {2, 3, m 2},

[2] pick z(t+1)MIN w t+1x1, z(t+1)MAX w t+1xm with:

ˆyj [lj, rj], j {1, m}, (16)

and linearly interpolate between (z(t+1)MIN, 0) and (z(t+1)1, ˆy(t+1)1), and (z(t+1)m, ˆy(t+1)m) and (z(t+1)MAX, 1).

Here, r1 .= nt (w t+1x1 z(t+1)MIN), l1 .= Nt (w t+1x1 z(t+1)MIN), rm .= nt (z(t+1)MAX w t+1xm), lm .= Nt (z(t+1)MAX w t+1xm). Figure 3 presents a simple example of FIT in the BREGMANTRON. Our theoretical results depend on the sample-wise variation conditions following from (10). As such, the piecewise afﬁne interpolation in FIT, chosen for its simplicity, can be replaced by other interpolation

Supervised Learning: No Loss No Cry

procedures tailored to other constraints, such as secondorder differentiability of the surrogate.

Analysis of BREGMANTRON We are now ready to analyze the BREGMANTRON. Our main result shows that provided the link does not change too much between iterations, we are guaranteed to decrease the following loss:

ℓr t(S, wt ) .= ES[DU r (y ut(w t x))] , (17)

for r, t, t = 1, 2, ..., which gives (13) for L

.= U r , ut .= ψ 1 and h(x) .= w t x. We do not impose r = t, as our algorithm incorporates a step of proper composite ﬁtting of the next link given the current loss.

We formalise the stability of the link below.

Deﬁnition 3 Let αt, βt 0. BREGMANTRON is (αt, βt)- stable at iteration t iff the solution ˆyt+1 in Step 3 satisﬁes ˆy1 ut(w t+1x1) [1 βt, 1 + αt].

Since ˆy1 .= ut+1(w t+1x1) from FIT, it comes that stability requires a bounded local change in inverse link for a single example (but implies a bounded change for all via the Lipschitz constraints in Step 3; this is explained in the proof of the main Theorem of this Section). We deﬁne the following mean operators (Patrini et al., 2014): ˆµy .= ES[y x] (sample), ˆµt .= ES[ˆyt x], t 1 (estimated), where ˆyt is deﬁned in the BREGMANTRON. We also let p t .= max{ES[y], ES[ut(w t+1x)]} [0, 1] denote the max estimated ˆPr(Y = 1) using both our model and the sample S. Assume p t > 0 as otherwise the problem is trivial. Finally, X .= maxi xi 2 (we consider the L2 norm for simplicity; our result holds for any norm on X).

Deﬁnition 4 BREGMANTRON is said to be in the δt-regime at iteration t, for some δt > 0 iff:

ˆµy ˆµt 2 2 p

p t δt X, t. (18)

To simplify the statement of our Theorem, we let f(z) .= z/(1 + z), which satisﬁes f(R+) = [0, 1].

Theorem 5 Suppose that BREGMANTRON is in the δtregime at iteration t, and the following holds:

in Step 1, the learning rate

ηt = 1 γt 2Nt X2 1 f(δt)(1 + f(δt))p t X ˆµy ˆµt 2

for some user-ﬁxed γt h 0, p

f f(δt)/2 i ;

in Step 3, Nt, nt satisfy Nt/nt, Nt 1/nt 1 + f(δt).

Then if the BREGMANTRON is (f(δt), f(δt))-stable, then:

ℓt+1 t+1(S, wt+1) ℓt t(S, wt) p t f(δt)

(proof in SI, Section IV) Explicitly, it can be shown that the learning rate at iteration t lies in the following interval:

η 1 γt 2Nt X2

δtp t (2 + δt) 2(1 + δt)2 hp

δtp t , 1 i!

Theorem 5 essentially says that as long as ˆµy = ˆµt, we can hope to get better results. This is no surprise: the gradient step in Step 1 of BREGMANTRON is proportional to ˆµy ˆµt.

The conditions on Steps 1 and 3 are easily enforceable at any step of the algorithm, so the Theorem essentially says that whenever the link does not change too much between iterations, we are guaranteed a decrease in the loss and therefore a better ﬁt of the class probabilities. Stability is the only assumption made: unlike Kakade et al. (2011), no assumptions are made about Bayes rule or the distribution D, and no constraints are put on the classiﬁer w.

We can also choose to enforce stability in the update of u in Step 3. Interestingly, while this restricts the choice of links (at least when ˆµy ˆµt 2 is small), this guarantees the bound in (19) at no additional cost or assumption.

Corollary 6 Suppose BREGMANTRON is run with so that in Step 3, constraint ˆy1 0 is replaced by

ˆy1 ut(w t+1x1) [1 βt, 1 + αt], (20)

and all other constraints are kept the same. Suppose BREGMANTRON is in the δt-regime at iteration t, parameters ηt, Nt, nt are ﬁxed as in Theorem 5 and furthermore αt, βt [0, f(δt)]. Then (19) holds.

(proof in SI, Section V) There exists an alternative reading of Corollary 6: there exists a way to ﬁx the key parameters (ηt, nt, Nt, αt, βt) at each iteration such that a decrease of the loss is guaranteed if our current estimate of the sample mean operator, ˆµt, is not good enough.

5. Discussion

Bregman divergences have had a rich history outside of convex optimisation, where they were introduced (Bregman, 1967). They are the canonical distortions on the manifold of parameters of exponential families in information geometry (Amari & Nagaoka, 2000), they have been introduced in normative economics in several contexts (Magdalou & Nock, 2011; Shorrocks, 1980). In machine learning, their rediscovery was grounded in their representation and algorithmic properties, starting with the work of M. Warmuth and

Supervised Learning: No Loss No Cry

ztmax ztmin

Figure 3: (Inverse) link u of a proper canonical loss learned (red), as computable in FIT (zti .= w t xi). In blue, we have depicted the stability constraint in the update of the link (Deﬁnition 3). Stability imposes just a constraint in the change of (inverse) link for a single example. It does not impose any constraint on the classiﬁer update (which, in this example, is signiﬁcant for example (x1, y1), see text).

collaborators (Helmbold et al., 1995; Herbster & Warmuth, 1998), later linked back to exponential families (Azoury & Warmuth, 2001), and then axiomatized in unsupervised learning (Banerjee et al., 2004; 2005), and then in supervised learning (See Section 4).

Classical uniform convergence bounds apply to the convex surrogate of any proper canonical loss (1-Lipschitz), so we refer to Bartlett & Mendelson (2002, Section 2) for available tools. The setting of the BREGMANTRON raises two questions, the ﬁrst of which is crucial for the algorithm. We make no assumption about the optimal link, which resorts to a powerful agnostic view of machine learning chased in a number of work (Bousquet et al., 2019), but it makes much more sense if we can prove that the link ﬁt by FIT belongs to a set with reasonable approximations of the target. This set contains piecewise afﬁne links, which is a bit more general than Deﬁnition 2 but matches the links learned by the BREG-

MANTRON. We remove the index notation T in u T and z T., and consider the following ℓ1 restricted discrepancy,

E(u, ℓ) .= Z zm

z1 |( L ) 1(z) u(z)|dz, (21)

where ℓis proper canonical with invertible canonical link. It is restricted because we do not consider set ( , z1) (zm, + ), whose ﬁtting in fact does not depend on data (see Figure 3). Denote Un,N(w, S) the set of piecewise afﬁne links with non-{0, 1} breakout points on abscissae z1, z2, ..., zm (w xi .= zi < zi+1 .= w xi+1, i {0, 1, ..., m 1}, wlog), satisfying Deﬁnition 2. Let . be any norm on X and . its dual. For any ε > 0, Gε(S) is the graph whose vertices are the observations in S and an edge links x, x S iff x x ε. G is said 2-connected iff it is connected when any single vertex is removed.

Lemma 7 For any S of size m, any ε > 0 such that Gε(S) is 2-connected and any proper canonical loss ℓ, n, N such that infu Un,N(w,S) E(u, ℓ) 2Nmε2 w 2 .

2-connectivity is a lightweight connectivity condition that essentially prevents the graph from being constituted of two almost separate subgraphs. The proof of the Lemma relies on an old result from graph theory connecting 2-connectivity and Hamiltonicity, known as Fleischner s Theorem (Fleischner, 1974), Gross & Yellen (2004, p. 265, F17). Crucially, N can be much smaller than the Lipschitz constant of ( L ) 1. Lemma 7 does guarantee that the set of links in which the BREGMANTRON ﬁnds u T is powerful enough to approximate a link provided we sample enough examples to drag ε small enough while guaranteeing Gε(S) 2-connected. This does not require i.i.d. sampling but would require additional assumptions about X to be tractable (such as boundedness), or the possibility of active learning in X. This also does not guarantee that FIT ﬁnds a link with small E(., ℓ), and this brings us to our second question: is it possible that (near-)optimal solutions contain very "different" couples (u, w), for which useful notion(s) of "different" ? This, we believe, has ties with the transferability of the loss.

Last, deep learning has achieved tremendous success on how one can learn a mapping ϕ from a general space X to Rd, where X possesses only weak mathematically amenable properties. Our work can be directly branched after (or during) training ϕ to get a complete proper learning pipeline for Pr(Y = 1 | x) .= u(w ϕ(x)). To fold BREGMANT RON within the training of ϕ shall however likely require signiﬁcant algorithmic improvements of Lipschitz isotonic regression for an efﬁcient training pipeline.

Supervised Learning: No Loss No Cry

6. Experimental results

We present experiments illustrating:

(a) the viability of the BREGMANTRON as an alternative to classic GLM or SLISOTRON learning.

(b) the nature of the loss functions learned by the BREG-

MANTRON, which are potentially asymmetric.

(c) the potential of using the loss function learned by the BREGMANTRON as input to some downstream learner.

Predictive performance of BREGMANTRON We compare BREGMANTRON as a generic binary classiﬁcation method against the following baselines: logistic regression, GLMTron Kakade et al. (2011) with u( ) the sigmoid, and SLISOTRON. We also consider two variants of BREGMANT

RON: one where in Step 4 we do not ﬁnd the global optimum (BREGMANTRONapprox), but rather a feasible solution with minimal ˆym; and another where in Step 4 we ﬁt against the labels, rather than ˆyt (BREGMANTRONlabel).

In all experiments, we ﬁx the following parameters for BREGMANTRON: we use a constant learning rate of η = 1 to perform the gradient update in Step 1, For Step 3, we ﬁx nt = 10 2 and Nt = 1 for all iterations.

We compare performance on two standard benchmark datasets, the MNIST digits (mnist) and the fashion MNIST (fmnist) on this latter domain, we sub-sample the data to 1000 points for computational efﬁciency. We converted the former to a binary classiﬁcation problem of the digits 0 versus 8, and the latter of the odd versus even classes. We also consider a synthetic dataset (synth), comprising 2D Gaussian class-conditionals with means (1, 1) and identity covariance matrix. The Bayes-optimal solution for Pr(Y = 1 | X) can be derived in this case: it takes the form of a sigmoid, as assumed by logistic regression, composed with a linear model proportional to the expectation. In this case therefore, logistic regression works on a search space much smaller than BREGMANTRON and guaranteed to contain the optimum.

On a given dataset, we measure the predictive performance for each method via the area under the ROC curve. This assesses the ranking quality of predictions, which provides a commensurate means of comparison; in particular the BREGMANTRON optimises for a bespoke loss function that can be vastly different from the square-loss.

Table 1 summarises the results. We make three observations. First, BREGMANTRON is consistently competitive with the mature baseline of logistic regression. Interestingly, this is even so on the synth problem, wherein logistic regression is correctly speciﬁed. Although the difference in performance here is minor, it does illustrate that BREGMANTRON can infer a meaningful pair of (u, w).

synth mnist fmnist

Logistic regression 92.2% 99.9% 98.5% GLMTron 92.2% 99.6% 98.1% SLISOTRON 91.6% 94.6% 90.7% BREGMANTRONapprox 92.2% 99.3% 94.6% BREGMANTRONlabel 90.1% 99.6% 97.7% BREGMANTRON 92.3% 99.7% 97.9%

Table 1: Test set AUC of various methods on binary classiﬁcation datasets. See text for details.

Second, BREGMANTRON and BREGMANTRONlabel are generally superior to the performance of the SLISOTRON. We attribute this to the latter s reliance on an isotonic regression step to ﬁt the links, as opposed to a Bregman regularisation.

Third, while BREGMANTRONapprox also performs reasonably, it is typically worse than the full BREGMANTRON. This illustrates the value of (at least approximately) solving Step 4 in the BREGMANTRON. Further, while BREGMAN TRONlabel generally performs slightly worse than standard BREGMANTRON, it remains competitive. A formal analysis of this method would be of interest in future work.

Illustration of learned losses As with the SLISOTRON, a salient feature of BREGMANTRON is the ability to automatically learn a link function. Unlike the SLISOTRON, however, the link in the BREGMANTRON has an interpretation of corresponding to a canonical loss function.

Figure 4 illustrates the link functions learned by BREGMAN TRON on each dataset. We see that these links are generally asymmetric about 1

2. This is in contrast to standard link functions such as the sigmoid. Recall that each link corresponds to an underlying canonical loss, given by ℓ (y, v) = u(v) y. Asymmetry of u( ) thus manifests in ℓ(+1, v) = ℓ( 1, v). We illustrate these implicit canonical losses in Figure 5. As a consequence of the links not being symmetric around 1

2, the losses on the positive and negative classes are not symmetric for the synth dataset. This is unlike the theoretical link, but the theoretical link may not be optimal at all on sampled data. This, we believe, also illustrates the intriguing potential of the BREGMANTRON to detect and exploit hidden asymmetries in the underlying data distribution.

Transferability of the loss between domains Finally, we illustrate the potential of recycling the loss function implicitly learned by BREGMANTRON for some other task. We take the fmnist dataset, and ﬁrst train BREGMAN TRON to classify the classes 0 versus 6 ( T-shirt versus Shirt ). This classiﬁer achieves an AUC of 0.85, which is competitive with the logistic regression AUC of 0.86.

Recall that BREGMANTRON gives us a learned link u, which per the above discussion also deﬁnes an implicit canonical

Supervised Learning: No Loss No Cry

synth mnist fmnist

Figure 4: (Inverse) link functions estimated by BREGMANTRON on each dataset. On all datasets, the losses are seen to be (slightly) asymmetric around 1

2, i.e., u(v) = 1 u( v). In particular, u(0) = 1

synth mnist fmnist

Figure 5: Loss functions estimated by BREGMANTRON on each dataset. The losses are (slightly) asymmetric, and have the ﬂavour of the square-hinge loss; this is a consequence of the linear interpolation used when ﬁtting.

loss. By training a classiﬁer to distinguish classes 2 versus 4 ( Pullover versus Coat ) using this loss function, we achieve an AUC of 0.879. This slightly outperforms the 0.877 AUC of logistic regression, and is also competitive with the 0.879 AUC attained by training BREGMANTRON directly on classes 2 versus 4. This indicates that the loss learned by BREGMANTRON on one domain could be useful in related domains to another classiﬁcation algorithms just training a classiﬁer. To properly develop this possibility is out of the scope of this paper, and as far as we know such a perspective is new in machine learning.

7. Conclusion

Fitting a loss that complies with Bayes decision theory implies not just to be able to learn a classiﬁer, but also a canonical link of a proper loss, and therefore a proper canonical loss. In a 2011 seminal work, Kakade et al. made with the SLISOTRON algorithm the ﬁrst attempt at solving this bigger picture of supervised learning. We propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses. From a formal standpoint, an interesting avenue for future work

is the inclusion of a regulariser in the loss: we conjecture that the choice of a gradient step in Step 1 of BREGMANT RON makes it convenient to devise and analyse such extensions.

Experiments tend to conﬁrm the ability of our approach, the BREGMANTRON, to signiﬁcantly beat the SLISOTRON, and compete with classical supervised approaches even when they are informed with the optimal choice of link. Interestingly, they seem to illustrate the importance of a stability requirement made by our theory. More interesting is perhaps the observation that the loss learned by the BREGMAN TRON on one domain can be useful to other learning algorithms to ﬁt classiﬁers on related domains, a transferability property of the loss learned that deserves further thought.

Acknowledgments

The authors warmly thank Manfred Warmuth for discussions around the introduction of Bregman divergences in machine learning. Thanks are also due to the reviewers and Lalitha Sankar for numerous remarks and discussions.

Supervised Learning: No Loss No Cry

Amari, S.-I. and Nagaoka, H. Methods of Information Geometry. Oxford University Press, 2000.

Auer, P., Herbster, M., and Warmuth, M. Exponentially many local minima for single neurons. In NIPS*8, pp. 316 322, 1995.

Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line density estimation with the exponential family of distributions. MLJ, 43(3):211 246, 2001.

Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clustering with bregman divergences. In Proc. of the 4th SIAM International Conference on Data Mining, pp. 234 245, 2004.

Banerjee, A., Guo, X., and Wang, H. On the optimality of conditional expectation as a bregman predictor. IEEE Trans. IT, 51:2664 2669, 2005.

Bartlett, P., Jordan, M., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. J. of the Am. Stat. Assoc., 101:138 156, 2006.

Bartlett, P.-L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2002.

Bousquet, O., Kane, D., and Moran, S. The optimal approximation factor in density estimation. In COLT 19, pp. 318 341, 2019.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Bregman, L. M. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200 217, 1967.

Buja, A., Stuetzle, W., and Shen, Y. Loss functions for binary class probability estimation ans classiﬁcation: structure and applications, 2005. Technical Report, University of Pennsylvania.

Cranko, Z., Menon, A.-K., Nock, R., Ong, C. S., Shi, Z., and Walder, C.-J. Monge blunts Bayes: Hardness results for adversarial training. In 36th ICML, pp. 1406 1415, 2019.

de Finetti, B. Rôle et domaine d application du théorème de Bayes selon les différents points de vue sur les probabilités (in French). In 18th International Congress on the Philosophy of Sciences, pp. 67 82, 1949.

Fleischner, H. The square of every two-connected graph is Hamiltonian. Journal of Combinatorial Theory, Series B, 16:29 34, 1974.

Grabocka, J., Scholz, R., and Schmidt-Thieme, L. Learning surrogate losses. Co RR, abs/1905.10108, 2019.

Gross, J.-L. and Yellen, J. Handbook of graph theory. CRC press, 2004. ISBN 1-58488-090-2.

Helmbold, D.-P., Kivinen, J., and Warmuth, M.-K. Worstcase loss bounds for single neurons. In NIPS*8, pp. 309 315, 1995.

Herbster, M. and Warmuth, M. Tracking the best regressor. In 9 th COLT, pp. 24 31, 1998.

Kakade, S., Kalai, A.-T., Kanade, V., and Shamir, O. Efﬁcient learning of generalized linear and single index models with isotonic regression. In NIPS*24, pp. 927 935, 2011.

Kearns, M. J. and Vazirani, U. V. An Introduction to Computational Learning Theory. M.I.T. Press, 1994.

Liu, L., Wang, M., and Deng, J. Uni Loss: Uniﬁed surrogate loss by adaptive interpolation. https://openreview.net/forum?id=ryeg XAVKDB, 2019.

Magdalou, B. and Nock, R. Income distributions and decomposable divergence measures. Journal of Economic Theory, 146(6):2440 2454, 2011.

Mei, J. and Moura, J.-M.-F. SILVar: Single index latent variable models. IEEE Trans. Signal Processing, 66(11): 2790 2803, 2018.

Nock, R. and Nielsen, F. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pp. 1201 1208, 2008.

Nock, R. and Nielsen, F. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048 2059, 2009.

Nock, R. and Williamson, R.-C. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pp. 4829 4838, 2019.

Patrini, G., Nock, R., Rivera, P., and Caetano, T. (Almost) no label no cry. In NIPS*27, 2014.

Reid, M.-D. and Williamson, R.-C. Composite binary losses. JMLR, 11:2387 2422, 2010.

Savage, L.-J. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783 801, 1971.

Shorrocks, A.-F. The class of additively decomposable inequality measures. Econometrica, 48:613 625, 1980.

Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning Bregman divergences. Co RR, abs/1905.11545, 2019.

Supervised Learning: No Loss No Cry

Streeter, M. Learning effective loss functions efﬁciently. Co RR, abs/1907.00103, 2019.

Sypherd, T., Diaz, M., Laddha, H., Sankar, L., Kairouz, P., and Dasarathy, G. A tunable loss function for classiﬁcation. Co RR, abs/1906.02314, 2019.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984.

Zhang, T. Statistical behaviour and consistency of classiﬁcation methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56 -134, 2004.