# learning_with_fitzpatrick_losses__5c236096.pdf Learning with Fitzpatrick Losses Seta Rakotomandimby Ecole des Ponts seta.rakotomandimby@enpc.fr Jean-Philippe Chancelier Ecole des Ponts jean-philippe.chancelier@enpc.fr Michel De Lara Ecole des Ponts michel.delara@enpc.fr Mathieu Blondel Google Deep Mind mblondel@google.com Fenchel-Young losses are a family of loss functions, encompassing the squared, logistic and sparsemax losses, among others. They are convex w.r.t. the model output and the target, separately. Each Fenchel-Young loss is implicitly associated with a link function, that maps model outputs to predictions. For instance, the logistic loss is associated with the soft argmax link function. Can we build new loss functions associated with the same link function as Fenchel-Young losses? In this paper, we introduce Fitzpatrick losses, a new family of separately convex loss functions based on the Fitzpatrick function. A well-known theoretical tool in maximal monotone operator theory, the Fitzpatrick function naturally leads to a refined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel Young losses, while maintaining the same link function for prediction. As an example, we introduce the Fitzpatrick logistic loss and the Fitzpatrick sparsemax loss, counterparts of the logistic and the sparsemax losses. This yields two new tighter losses associated with the soft argmax and the sparse argmax, two of the most ubiquitous output layers used in machine learning. We study in details the properties of Fitzpatrick losses and, in particular, we show that they can be seen as Fenchel-Young losses using a modified, target-dependent generating function. We demonstrate the effectiveness of Fitzpatrick losses for label proportion estimation. 1 Introduction Loss functions are a cornerstone of statistics and machine learning: they measure the difference, or loss, between a ground-truth target and a model prediction. As such, they have attracted a wealth of research. Proper losses (a.k.a. proper scoring rules) [17, 16] measure the discrepancy between a target distribution and a probability forecast. They are essentially primal-primal Bregman divergences, with both the target and the prediction belonging to the same primal space. They are typically explicitly composed with a link function [26, 30], in order to map the model output to a prediction. A disadvantage of this explicit composition is that it often makes the resulting composite loss function non-convex. A related family of loss functions are Fenchel-Young losses [7, 8], which encompass many commonly-used loss functions in machine learning including the squared, logistic, sparsemax and perceptron losses. Fenchel-Young losses can be seen as primal-dual Bregman divergences [1], with the target belonging to the primal space and the model output belonging to the dual space. In contrast to proper losses, each Fenchel-Young loss is implicitly associated with a given link function, mapping the dual-space model output to a primal-space prediction (for instance, the soft argmax is the link function associated with the logistic loss). This crucial difference makes Fenchel-Young losses always convex w.r.t. the model output and w.r.t. the target, separately. Can we build new (separately) convex losses associated with the same link function as Fenchel-Young losses? 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 3 2 1 0 1 2 3 s 3.0 Logistic loss Fitzpatrick logistic loss 3 2 1 0 1 2 3 s 3.0 Sparsemax loss Fitzpatrick sparsemax loss Figure 1: We introduce Fitzpatrick losses, a new family of loss functions L(y, θ) generated by a convex regularization function Ω, that are lower-bounds of the Fenchel-Young losses generated by the same function Ω, while maintaining the same link function byΩ= Ω . In particular, we use our framework to instantiate the counterparts of the logistic and sparsemax losses, two instances of Fenchel-Young losses, associated with the soft argmax and the sparse argmax. In the figures above, we plot L(y, θ), where y = e1, θ = (s, 0) and L {LF [ Ω], LΩ Ω }, confirming the lower-bound property. In this paper, we introduce Fitzpatrick losses, a new family of primal-dual (separately) convex loss functions. Our proposal builds upon the Fitzpatrick function, a well-known theoretical object in maximal monotone operator theory [15, 11, 2]. So far, the Fitzpatrick function had been used as a theoretical tool to represent maximal monotone operators [28] and to construct Bregman-like primal-primal divergences [10], but it had not been used to construct primal-dual loss functions for machine learning, as we do. Crucially, the Fitzpatrick function naturally leads to a refined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel-Young losses. Yet, their predictions are produced using the same link function, suggesting that we can use Fitzpatrick losses as a tighter replacement for the corresponding Fenchel-Young losses (Figure 1). We make the following contributions. After reviewing some background, we introduce Fitzpatrick losses. They can be thought as a tighter version of Fenchel-Young losses, that use the same link function. We instantiate two new loss functions in this family: the Fitzpatrick logistic loss and the Fitzpatrick sparsemax loss. They are the counterparts of the logistic and sparsemax losses, two instances of Fenchel-Young losses. We therefore obtain two new tighter losses for the soft argmax and the sparse argmax, two of the most popular output layers in machine learning. We study in detail the properties of Fitzpatrick losses. We show that Fitzpatrick losses are equivalent to Fenchel-Young losses with a modified, target-dependent generating function. We demonstrate the effectiveness of Fitzpatrick losses for probabilistic classification on 11 datasets. 2 Background 2.1 Convex analysis We denote the non-negative real numbers by R+ := [0, + ), the positive real numbers by R++ := (0, + ), and the extended real numbers by R := R { , + }. We suppose given a nonzero natural number k. We denote the probability simplex by k := {p Rk + Pk i=1 pi = 1}. We denote the indicator function of a set C Rk by ιC(y) = 0 if y C, + otherwise. We denote the effective domain of a function Ω: Rk R by dom Ω:= {y Rk Ω(y) < + }. A function Ω: Rk R is said to be proper if it never takes the value and if dom Ω = . We denote the Euclidean projection onto a nonempty closed convex set C Rk by PC(θ), the unique solution y C of the minimization problem miny C y θ 2 2. For a function Ω: Rk R, its subdifferential Ω Rk Rk is defined by (y , θ ) Ω θ Ω(y ) Ω(y) Ω(y ) + y y , θ , y Rk. When Ωis convex and differentiable, the subdifferential is a singleton and we have Ω(y ) = { Ω(y )}. The normal cone to a nonempty closed convex set C Rk at y is defined by θ NC(y ) y y , θ 0, y C if y C and NC(y ) = if y C. The Fenchel conjugate Ω : Rk R of a function Ω: Rk R is defined by Ω (θ) := sup y Rk y , θ Ω(y ). From standard results in convex analysis [27, Proposition 11.3], when Ω: Rk R is a proper convex l.s.c. (lower semicontinuous) function, Ω (θ) = argmax y Rk y , θ Ω(y ). When the argmax is unique, it is equal to Ω (θ). We define the generalized Bregman divergence [18] DΩ: Rk Rk R+ generated by a proper convex l.s.c. function Ω: Rk R, by DΩ(y, y ) := Ω(y) Ω(y ) sup θ Ω(y ) y y , θ , (1) with the convention + + ( ) = + (we comment on this convention in Appendix C). When Ω is differentiable, Equation (1) gives the classical Bregman divergence DΩ(y, y ) := Ω(y) Ω(y ) y y , Ω(y ) . Both y and y belong to the primal space. 2.2 Fenchel-Young losses Definition and properties The Fenchel-Young loss [8] LΩ Ω : Rk Rk R generated by a proper convex l.s.c. function Ω is LΩ Ω (y, θ) := Ω Ω (y, θ) y, θ := Ω(y) + Ω (θ) y, θ . As its name indicates, it is grounded in the Fenchel-Young inequality y, θ Ω(y) + Ω (θ) y, θ Rk. The Fenchel-Young loss enjoys many desirable properties, notably it is non-negative and it is separately convex in y and θ. The Fenchel-Young loss can be seen as a primal-dual Bregman divergence [1, 8], where y belongs to the primal space and θ belongs to the dual space. Link functions To map an element θ of a dual space to an element y of a primal space, we define the link function (potentially set-valued) associated with the loss L by θ 7 {y Rk L(y, θ) = 0}. Given a proper convex function Ω, the associated Fenchel-Young loss LΩ Ω produces the canonical link function Ω , since LΩ Ω (y, θ) = 0 y Ω (θ). In particular when Ωis strictly convex, and thus Ω is differentiable according to [27, Theorem 11.13], the Fenchel-Young loss satisfies the identity of indiscernibles LΩ Ω (y, θ) = 0 y = Ω (θ). In the remainder of this paper, we will use the notation byΩ(θ) for the canonical link function Ω (θ). When the function Ω is differentiable, byΩ(θ) will denote Ω (θ). Since Ω is convex, byΩis monotone (see 2.3). As shown in [8], the monotonicity implies that θ and byΩ(θ) are sorted the same way, i.e., θi > θj = byΩ(θ)i byΩ(θ)j. Link functions also play an important role in the loss subgradients (and in the loss gradient when it is differentiable), as we have θLΩ Ω (y, θ) = byΩ(θ) y. (2) Examples of Fenchel-Young loss instances and their associated link function We give a few examples of Fenchel-Young losses. With the squared 2-norm, Ω(y ) = 1 2 y 2 2, we obtain the squared loss LΩ Ω (y, θ) = Lsquared(y, θ) := 1 and the identity link byΩ(θ) = θ. With the indicator of a nonempty closed convex set C, Ω(y ) = ιC(y ), we obtain the perceptron loss LΩ Ω (y, θ) = Lperceptron(y, θ) := max y C y , θ y, θ and the argmax link byΩ(θ) = argmax y C y, θ . With the squared 2-norm restricted to some nonempty closed convex set C, Ω(y ) = 1 2 y 2 2 + ιC(y ), we obtain the sparse MAP loss [24] LΩ Ω (y, θ) = Lsparse MAP(y, θ) := 1 2 y θ 2 2 1 2 PC(θ) θ 2 2, and the link is the Euclidean projection onto C, byΩ(θ) = PC(θ). When the set is C = k, we obtain the sparsemax loss [21] and the sparse argmax link byΩ(θ) = P k(θ) (also known as sparsemax), which is known to produce sparse probability distributions. With the Shannon negentropy restricted to the probability simplex, Ω(y ) := y , log y + ι k(y ), we obtain the logistic loss LΩ Ω (y, θ) = Llogistic(y, θ) := log i=1 exp(θi) + y, log y y, θ , and the soft argmax link (also known as softmax) byΩ(θ) = softargmax(θ) := exp(θ)/ i=1 exp(θi). 2.3 Maximal monotone operators and the Fitzpatrick function An operator A, that is, a subset A Rk Rk, is called monotone if for all (y, θ) A and all (y , θ ) A, we have y y, θ θ 0. We overload the notation to denote A(y) := {θ Rk (y, θ) A}. A monotone operator A is said to be maximal if there does not exist (y, θ) A such that A {(y, θ)} is still monotone. It is well-known that the subdifferential Ωof a proper convex l.s.c. function Ωis maximal monotone. For more details on monotone operators, see [3, 28]. A well-known object in monotone operator theory, the Fitzpatrick function associated with a maximal monotone operator A [15, 11, 2], denoted F[A] : Rk Rk R, is defined by F[A](y, θ) := sup (y ,θ ) A y y , θ + y , θ . In particular, with A = Ω, we have F[ Ω](y, θ) = sup (y ,θ ) Ω y y , θ + y , θ = sup y dom Ω y , θ + sup θ Ω(y ) y y , θ The Fitzpatrick function was studied in depth in [2]. In particular, it is (jointly) convex and satisfies y, θ F[ Ω](y, θ) Ω Ω (y, θ) = Ω(y) + Ω (θ) y, θ Rk. (3) We introduce the operator y F [ Ω] (Rk Rk) Rk, associated to the Fitzpatrick function F[ Ω]: y F [ Ω](y, θ) := argmax y dom Ω y , θ + sup θ Ω(y ) y y , θ As proven for Item 4 in Proposition 1, we have y F [ Ω](y, θ) θF[ Ω](y, θ). For the rest of the paper, in case that the argmax in (4) is a singleton {y }, we will write y F [ Ω](y, θ) := y . The Fitzpatrick function F[ Ω](y, θ) and Ω Ω (y, θ) = Ω(y)+Ω (θ) play a similar role but the latter function is separable in y and θ, while the former is not. In particular this makes the subdifferential θF[ Ω](y, θ) depend on both y and θ, while θ(Ω Ω )(y, θ) = Ω (θ) depends only on θ. The Fitzpatrick function was used in [10] to theoretically study primal-primal Bregman-like divergences. As discussed in more detail in Section 3.4, using these divergences for machine learning would require us to compose them with an explicit link function, which would typically break convexity. In the next section, we introduce new primal-dual losses based on the Fitzpatrick function. 3 Fitzpatrick losses 3.1 Definition and properties Inspired by the inequality in (3), which we can view as a refined Fenchel-Young inequality, we introduce Fitzpatrick losses, a new family of loss functions generated by a convex l.s.c. function Ω. Definition 1 Fitzpatrick loss Let Ω: Rk R be a proper convex l.s.c. function. When y dom Ωand θ Rk, we define the Fitzpatrick loss LF [ Ω] : Rk Rk R generated by Ωas LF [ Ω](y, θ) := F[ Ω](y, θ) y, θ = sup (y ,θ ) Ω y y , θ + y , θ y, θ = sup (y ,θ ) Ω y y, θ θ . When y dom Ω, LF [ Ω](y, θ) := + . Fitzpatrick losses enjoy similar properties as Fenchel-Young losses, while being tighter than Fenchel Young losses. Proposition 1 Properties of Fitzpatrick losses 1. Non-negativity: for all (y, θ) Rk Rk, LF [ Ω](y, θ) 0. 2. Same link function: LΩ Ω (y, θ) = LF [ Ω](y, θ) = 0 y byΩ(θ). 3. Separable convexity: LF [ Ω](y, θ) is separately convex. 4. (Sub-)Gradient: θLF [ Ω](y, θ) y F [ Ω](y, θ) y where y F [ Ω](y, θ) is given by (4). 5. Tighter inequality: for all (y, θ) Rk, 0 LF [ Ω](y, θ) LΩ Ω (y, θ). A proof is given in Appendix B.2. Because the Fitzpatrick loss and the Fenchel-Young loss generated by the same Ωhave the same link function byΩ, they share the same minimizers w.r.t. θ for y fixed. However, the Fitzpatrick loss is always a lower bound of the corresponding Fenchel-Young loss. Moreover, they have different gradients w.r.t. θ: θLΩ Ω (y, θ) = byΩ(θ) y vs. θLF [ Ω](y, θ) y F [ Ω](y, θ) y. It is worth noticing that y F [ Ω](y, θ) depends on both y and θ, contrary to byΩ(θ). When Ωis a twice differentiable function on its domain (which is for instance the case of the squared 2-norm or the negentropy), we next show that Fitzpatrick losses enjoy a particularly simple expression and become a squared Mahalanobis-like distance. Proposition 2 Expressions of F[ Ω](y, θ) and LF [ Ω](y, θ) when Ωis twice differentiable Let Ω: Rk R be a convex function such that dom Ωis an open set. Let us assume that Ωis twice differentiable. Then, for any y dom Ωand for any y y F [ Ω](y, θ), as defined in (4), we have that F[ Ω](y, θ) = y, Ω(y ) + y , θ y , Ω(y ) LF [ Ω](y, θ) = y y, θ Ω(y ) = y y, 2Ω(y )(y y) and 2Ω(y )(y y) = θ Ω(y ). A proof is given in B.3. When Ωis constrained (i.e., when it contains an indicator function), we show in 3.5 that the above expression becomes a lower bound. 3.2 Examples We now present the Fitzpatrick loss counterparts of various Fenchel-Young losses. Squared loss. Proposition 3 Squared loss as a Fitzpatrick loss When Ω(y ) = 1 2 y 2 2, we have for all y Rk and θ Rk LF [ Ω](y, θ) = 1 4 y θ 2 2 = 1 2Lsquared(y, θ). A proof is given in Appendix B.4. Therefore, the Fenchel-Young and Fitzpatrick losses generated by Ωcoincide, but up to a factor 1 Perceptron loss. Proposition 4 Perceptron loss as a Fitzpatrick loss When Ω(y ) = ιC(y ), where C is a nonempty closed convex set, we have for all y C and θ Rk LF [ Ω](y, θ) = Lperceptron(y, θ) = max y C y , θ y, θ . A proof is given in Appendix B.5. Therefore, the Fenchel-Young and Fitzpatrick losses generated by Ωexactly coincide in this case. Fitzpatrick sparse MAP and Fitzpatrick sparsemax losses. As our first example where Fenchel Young and Fitzpatrick losses substantially differ, we introduce the Fitzpatrick sparse MAP loss, which is the Fitzpatrick counterpart of the sparse MAP loss [24]. Proposition 5 Fitzpatrick sparse MAP loss When Ω(y ) = 1 2 y 2 2 + ιC(y ), where C is a nonempty closed convex set, we have for all y C and θ Rk LF [ Ω](y, θ) = 2Ω ((y + θ)/2) y, θ = y y, θ y where we used y as a shorthand for y F [ Ω](y, θ) = Ω ((y + θ)/2) = PC((y + θ)/2). A proof is given in Appendix B.6. As a special case, when C = k, we call the obtained loss the Fitzpatrick sparsemax loss, as it is the counterpart of the sparsemax loss [21]. Like the sparse MAP and sparsemax losses, these new losses rely on the Euclidean projection as a core building block. The Euclidean projection onto the probability simplex k can be computed exactly in O(k) expected time and O(k log k) worst-case time [9, 22, 14, 12]. Fitzpatrick logistic loss. We now derive the Fitzpatrick counterpart of the logistic loss. Before stating the next proposition, we recall the definition of the Lambert function [13] W : R+ R+. The function W is the inverse of the function f : R+ R+ where f(w) = w exp(w) for all w R+. Proposition 6 Fitzpatrick logistic loss When Ω(y ) = y , log y + ι k(y ), we have for all y k and θ Rk LF [ Ω](y, θ) = y y, θ log y 1 where we used y as a shorthand for y F [ Ω](y, θ) defined by y F [ Ω](y, θ)i = e λ eθi, if yi = 0, yi W (yieλ θi), if yi > 0. A proof and the value of λ = λ (y, θ) R are given in Appendix B.7. To obtain λ (y, θ), we need to solve a one-dimensional root equation, which can be done using, for instance, a bisection. 3.3 Relation with Fenchel-Young losses On first sight, Fitzpatrick losses and Fenchel-Young losses appear quite different. In the next proposition, we show that the Fitzpatrick loss generated by Ωis in fact equal to the Fenchel-Young loss generated by the modified, target-dependent function Ωy(y ) := Ω(y ) + DΩ(y, y ), where DΩis the generalized Bregman divergence defined in (1). In particular, Lemma 1 in the appendix shows that if Ω= Ψ + ιC, where C is a nonempty closed convex set, then Ωy(y ) = Ψy(y ) + ιC(y ), where Ψy(y ) := Ψ(y ) + DΨ(y, y ). Proposition 7 Characterization of F[ Ω], LF [ Ω] and y F [ Ω] using Ωy Let Ω: Rk R be a proper convex l.s.c. function. Then, for all y dom Ωand all θ Rk, F[ Ω](y, θ) = Ωy(y) + Ω y(θ) LF [ Ω](y, θ) = LΩy Ω y(y, θ) y F [ Ω](y, θ) = byΩy(θ). This characterization of the Fitzpatrick function F[ Ω] is also new to our knowledge. A proof is given in Appendix B.8. Proposition 7 is very useful, as it means that Fitzpatrick losses inherit from all the known properties of Fenchel-Young losses, analyzed in prior works [8, 6]. In particular, Fenchel Young losses are smooth (i.e., with Lipschitz gradients) when Ωis strongly convex. We therefore immediately obtain that Fitzpatrick losses are smooth in their second argument θ if Ωis strongly convex and DΩis convex in its second argument, which is the case when Ω(y ) = 1 2 y 2 2 and Ω(y ) = y , log y . Therefore, the Fitzpatrick sparsemax and logistic losses are smooth. However in the general case, this does not hold, as DΩis usually not convex in its second argument. Proposition 7 also provides a mean to compute Fitzpatrick losses and their gradient. Finally, it suggests a very natural geometric interpretation of Fitzpatrick losses, as presented in Figure 2. 3.4 Relation with generalized Bregman divergences As we stated before, the generalized Bregman divergence DΩ(y, y ) in (1) is a primal-primal divergence, as both y and y belong to the same primal space. In contrast, Fenchel-Young losses Ω(y) = Ωy(y) y, θ Ω (θ) y, θ Ω y(θ) LF [ Ω](y, θ) = LΩy Ω y(y, θ) LΩ Ω (y, θ) Ω (θ) Ω y(θ) Figure 2: Geometric interpretation, with Ω(y ) = 1 2 y 2 2. The Fenchel-Young loss LΩ Ω (y, θ) is the gap (depicted with a double-headed arrow) between Ω(y) and y, θ Ω (θ), the value at y of the tangent with slope θ and intercept Ω (θ). As per Proposition 7, the Fitzpatrick loss LF [ Ω](y, θ) is equal to LΩy Ω y(y, θ) and is therefore equal to the gap between Ωy(y) = Ω(y) and y, θ Ω y(θ), the value at y of the tangent with slope θ and intercept Ω y(θ). Since Ωy(y ) = Ω(y ) + DΩ(y, y ), we have that Ωy(y ) Ω(y ), with equality when y = y . We therefore have Ω y(θ) Ω (θ), implying that the Fitzpatrick loss is a lower bound of the Fenchel-Young loss. LΩ Ω (y, θ) are primal-dual, since y belongs to the primal space and θ belongs to the dual space. Both can however be related for any proper l.s.c. convex function Ω, since, for any y such that Ω(y ) = , we have inf θ Ω(y ) LΩ Ω (y, θ ) = inf θ Ω(y ) Ω(y) + Ω (θ ) y, θ = Ω(y) + inf θ Ω(y ) Ω (θ ) y, θ = Ω(y) sup θ Ω(y ) Ω (θ ) + y, θ = Ω(y) Ω(y ) sup θ Ω(y ) y y , θ = DΩ(y, y ) where, in the penultimate line, we have used that Ω (θ ) = y , θ Ω(y ), as θ Ω(y ). This equality remains true when Ω(y ) = , by convention inf = + and by definition of DΩ(y, y ) in (1). This identity suggests that we can create Bregman-like primal-primal divergences by replacing Ω Ω with F[ Ω], DF [ Ω](y, y ) := inf θ Ω(y ) LF [ Ω](y, θ ) = inf θ Ω(y ) F[ Ω](y, θ ) y, θ . This recovers one of the two Bregman-like divergences proposed in [10], the other one replacing the inf above by a sup. As stated in [10], F[ Ω] and Ω Ω are representations of Ω. In order to use a primal-primal divergence as a loss, we need to explicitly compose it with a link function, such as byΩ(θ) = Ω (θ). Unfortunately, DΩ(y, byΩ(θ)) or DF [ Ω](y, byΩ(θ)) are typically non convex functions of θ, while Fenchel-Young and Fitzpatrick losses are always convex in θ. In addition, differentiating through byΩ(θ) typically requires implicit differentiation [19, 5], while Fenchel-Young and Fitzpatrick losses enjoy easy-to-compute gradients, thanks to Danskin s theorem. 3.5 Lower bound on Fitzpatrick losses If Ω= Ψ + ιC, where Ψ : Rk R is a proper convex Legendre-type function and C dom Ψ is a nonempty closed convex set, then it was shown in [8, Proposition 3] that Fenchel-Young losses satisfy the lower bound DΨ y, byΩ(θ) LΩ Ω (y, θ), with equality if C = dom Ψ. We now show that a similar result holds for Fitzpatrick losses. Similarly as before, we define Ψy(y ) := Ψ(y ) + DΨ(y, y ). Proposition 8 Lower bound on Fitzpatrick losses Let Ω= Ψ + ιC, where Ψ : Rk R is a proper convex Legendre-type function, as defined in [8, Definition 3], and C dom Ψ is a nonempty closed convex set. We remind that Ωy(y ) := Ω(y ) + DΩ(y, y ). Let us assume that Ω y is differentiable. Then, DΨy(y, y ) = y y , 2Ψ(y )(y y ) LF [ Ω](y, θ), with equality if dom Ψ = C, where we used y as a shorthand for Ω y(θ). A proof is given in Appendix B.9. If Ψy is µ-strongly convex, we obtain µ 2 y y 2 2 DΨy(y, y ). 4 Experiments Experimental setup. We follow exactly the same experimental setup as in [7, 8]. We consider a dataset of n pairs (xi, yi) of feature vectors xi Rd and label proportions yi k, where d is the number of features and k is the number of classes. At inference time, given an unknown input vector x Rd, our goal is to estimate a vector of label proportions by k. A model is specified by a matrix W Rk d and a convex l.s.c. function Ω: Rk R. Predictions are then produced by the generalized linear model x 7 byΩ(Wx). At training time, we estimate the matrix W Rk d by minimizing the convex objective i=1 L(yi, Wxi) + λ 2 W 2 2 , (5) where L LΩ Ω , LF [ Ω] . We focus on the (Fitzpatrick) sparsemax and the (Fitzpatrick) logistic losses. We optimize (5) using the L-BFGS algorithm [20]. The gradient of the Fenchel-Young loss is given in (2), while the gradient of the Fitzpatrick loss is given in Proposition 1, Item 4. Experiments were conducted on a Intel Xeon E5-2667 clocked at 3.30GHz with 192 GB of RAM running on Linux. Our implementation relies on the Sci Py [29] and scikit-learn [25] libraries. We ran experiments on 11 standard multi-label benchmark datasets1 (see Table 2 in Appendix A for statistics on the datasets). For all datasets, we removed samples with no label, normalized samples to have zero mean unit variance, and normalized labels to lie in the probability simplex. In (5), we chose the hyperparameter λ {10 4, 10 3, . . . , 104} against the validation set. We report test set mean squared error (also known as Brier score in a probabilistic forecasting context) in Table 1. Results. We found that the logistic loss and the Fitzpatrick logistic loss are comparable on most datasets, with the logistic loss significantly winning on 2 datasets and the Fitzpatrick logistic loss significantly winning on 2 datasets, out of 11. Since the Fitzpatrick logistic loss is slightly more computationally demanding, requiring to solve a root equation while the logistic loss does not, we believe that the logistic loss remains the best choice when we wish to use the softargmax as link function byΩ. Similarly, we found that the sparsemax loss and the Fitzpatrick sparsemax loss are comparable on most datasets, with the sparsemax loss significantly winning on only 1 dataset out of 11 and the Fitzpatrick loss significantly winning on 2 datasets out of 11. Since the two losses both use the Euclidean projection onto the simplex P k as their link function byΩ, we conclude that the Fitzpatrick sparsemax loss is a serious contender to the sparsemax loss, especially when predicting sparse label proportions is important. 1The datasets can be downloaded from http://mulan.sourceforge.net/datasets-mlc.html and https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Dataset Sparsemax Fitzpatrick-sparsemax Logistic Fitzpatrick-logistic Birds 0.531 0.513 0.519 0.522 Cal500 0.035 0.035 0.034 0.034 Delicious 0.051 0.052 0.056 0.055 Ecthr A 0.514 0.514 0.431 0.423 Emotions 0.317 0.318 0.327 0.320 Flags 0.186 0.188 0.184 0.187 Mediamill 0.191 0.203 0.207 0.220 Scene 0.363 0.355 0.344 0.368 Tmc 0.151 0.152 0.161 0.160 Unfair 0.149 0.148 0.157 0.158 Yeast 0.186 0.187 0.183 0.185 Table 1: Test performance comparison between the sparsemax loss, the logistic loss and their Fitzpatrick counterparts on the task of label proportion estimation, with regularization parameter λ in (5) tuned against the validation set. For each dataset, label proportion errors are measured using the mean squared error (MSE). We use bold if the error is at least 0.005 lower than its counterpart. 5 Conclusion We proposed to leverage the Fitzpatrick function, a theoretical tool from monotone operator theory, to build a new family of primal-dual separately convex loss functions for machine learning. We reinterpreted Fitzpatrick losses as lower bounds of Fenchel-Young losses that maintains the same link function. Our paper therefore challenges the idea that there can only be one loss function, convex in each argument, associated with a certain link function. For instance, we created the Fitzpatrick logistic and sparsemax losses, that are associated with the soft argmax and sparse argmax links, traditionally associated with the logistic and sparsemax losses, respectively. We believe that even more loss functions with the same link can be created, which calls for a systematic study of their properties and respective benefits. In future work, we intend to study calibration guarantees for Fitzpatrick losses, to test new link functions from maximal monotone operator theory and to implement more efficient training for the Fitzpatrick logistic loss. [1] S.-i. Amari. Information geometry and its applications, volume 194. Springer, 2016. [2] H. Bauschke, D. Mc Laren, and H. Sendov. Fitzpatrick functions: Inequalities, examples, and remarks on a problem by S. Fitzpatrick. Journal of Convex Analysis, 13, 07 2005. [3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer-Verlag, New York, second edition, 2017. [4] D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334 334, 1997. [5] M. Blondel, Q. Berthet, M. Cuturi, R. Frostig, S. Hoyer, F. Llinares-Lopez, F. Pedregosa, and J.-P. Vert. Efficient and modular implicit differentiation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5230 5242. Curran Associates, Inc., 2022. [6] M. Blondel, F. Llinares-López, R. Dadashi, L. Hussenot, and M. Geist. Learning energy networks with generalized Fenchel-Young losses. Advances in Neural Information Processing Systems, 35:12516 12528, 2022. [7] M. Blondel, A. Martins, and V. Niculae. Learning classifiers with Fenchel-Young losses: Generalized entropies, margins, and algorithms. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 606 615. PMLR, 2019. [8] M. Blondel, A. F. Martins, and V. Niculae. Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(35):1 69, 2020. [9] P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3(3):163 166, 1984. [10] R. S. Burachik and J. E. Martínez-Legaz. On Bregman-type distances for convex functions and maximally monotone operators. Set-Valued and Variational Analysis, 26:369 384, 2018. [11] R. S. Burachik and B. F. Svaiter. Maximal monotone operators, convex functions and a special family of enlargements. Set-Valued Analysis, 10:297 316, 2002. [12] L. Condat. Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158(1-2):575 585, 2016. [13] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth. On the Lambert W function. Advances in Computational Mathematics, 5(1):329 359, Dec 1996. [14] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1-ball for learning in high dimensions. In Proc. of ICML, 2008. [15] S. Fitzpatrick. Representing monotone operators by convex functions. In Workshop/Miniconference on Functional Analysis and Optimization, volume 20, pages 59 66. Australian National University, Mathematical Sciences Institute, 1988. [16] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007. [17] P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004. [18] K. C. Kiwiel. Proximal minimization methods with generalized Bregman functions. SIAM journal on control and optimization, 35(4):1142 1168, 1997. [19] S. G. Krantz and H. R. Parks. The implicit function theorem: history, theory, and applications. Springer Science & Business Media, 2002. [20] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1):503 528, 1989. [21] A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614 1623. PMLR, 2016. [22] C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 50(1):195 200, 1986. [23] J. J. Moreau. Inf-convolution, sous-additivité, convexité des fonctions numériques. J. Math. Pures Appl. (9), 49:109 154, 1970. [24] V. Niculae, A. Martins, M. Blondel, and C. Cardie. Sparsemap: Differentiable sparse structured inference. In International Conference on Machine Learning, pages 3799 3808. PMLR, 2018. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12:2825 2830, 2011. [26] M. D. Reid and R. C. Williamson. Composite binary losses. The Journal of Machine Learning Research, 11:2387 2422, 2010. [27] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. sv, Berlin, 1998. [28] E. K. Ryu and W. Yin. Large-scale convex optimization: algorithms & analyses via monotone operators. Cambridge University Press, 2022. [29] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in Python. Nature methods, 17(3):261 272, 2020. [30] R. C. Williamson, E. Vernet, and M. D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17(222):1 52, 2016. A Datasets statistics Dataset Type Train Dev Test Features Classes Avg.labels Birds Audio 134 45 172 260 19 2 Cal500 Music 376 126 101 68 174 26 Delicious Text 9682 3228 3181 500 983 19 Ecthr A Text 6683 228 847 92401 10 1 Emotions Music 293 98 202 72 6 2 Flags Images 96 33 65 19 7 3 Mediamill Video 22353 7451 12373 120 101 5 Scene Images 908 303 1196 294 6 1 Tmc Text 16139 5380 7077 48099 896 6 Unfair Text 645 215 172 6290 8 1 Yeast Micro-array 1125 375 917 103 14 4 Table 2: Datasets statistics Lemma 1 Generalized Bregman divergence for constrained Ω Let Ω= Ψ + ιC, where Ψ : Rk R is proper convex l.s.c. and C dom Ψ is a nonempty closed convex set such that ri C ri dom Ψ = , where ri C is the relative interior of C. Then, for all y, y Rk DΩ(y, y ) = DΨ(y, y ) + DιC(y, y ). As C, dom Ψ Rk and ri C ri dom Ψ = , we can apply [3, Proposition 6.19] and [3, Theorem 16.47] to write Ω(y ) = Ψ(y ) + ιC(y ). Thus, we have DΩ(y, y ) := Ω(y) Ω(y ) sup θ Ω(y ) y y , θ = Ψ(y) + ιC(y) Ψ(y ) ιC(y ) sup θ Ψ(y )+ ιC(y ) y y , θ = Ψ(y) Ψ(y ) sup θ Ψ(y ) y y , θ + ιC(y) ιC(y ) sup θ ιC(y ) y y , θ = DΨ(y, y ) + DιC(y, y ). Lemma 2 Generalized Bregman divergence of indicator function of some nonempty closed convex set set C Rk DιC(y, y ) = ιC(y) if y C if y C = ιC(y) + ιC(y ). Proof. Using (1) and the fact that ιC = NC [3, Example 16.13], we obtain that DιC(y, y ) := ιC(y) ιC(y ) sup θ NC(y ) y y , θ . When y C and y C, sup θ NC(y ) y y , θ = sup θ Rk z y ,θ 0 z C y y , θ = 0. When y C and y / C, DιC(y, y ) = + , as + +( ) = + in the definition of the Bregman divergence. Therefore, when y C, we get that DιC(y, y ) = ιC(y). When y C, NC(y ) = . Again, in the definition of the Bregman divergence, + + ( ) = + and we use the convention sup = . Lemma 3 Bregman divergence of Ψy Let Ψ : Rk R be proper, convex and twice differentiable on the interior of its domain. For all y Rk, let Ψy := Ψ + DΨ(y, ). Then, for all y, y , y int dom Ψ, DΨy(y , y ) = DΨ(y, y ) DΨ(y, y ) + DΨ(y , y ) + y y , 2Ψ(y )(y y ) = y y, Ψ(y ) y y, Ψ(y ) + y y , 2Ψ(y )(y y ) and, in particular, for all y, y int dom Ψ DΨy(y, y ) = y y , 2Ψ(y )(y y ) . Proof. For all y, y int dom Ψ, as Ψ is convex and differentiable on int dom Ψ, we have that Ψy(y ) = Ψ(y ) + DΨ(y, y ) = Ψ(y ) + Ψ(y) Ψ(y ) y y , Ψ(y ) = Ψ(y) y y , Ψ(y ) , and therefore, as Ψ is twice differentiable on int dom Ψ, we get that Ψy(y ) = 2Ψ(y )(y y) + Ψ(y ). Therefore, for all y, y , y dom Ψ, DΨy(y , y ) = Ψy(y ) Ψy(y ) y y , Ψy(y ) = Ψ(y ) + DΨ(y, y ) Ψ(y ) DΨ(y, y ) y y , 2Ψ(y )(y y) y y , Ψ(y ) = DΨ(y, y ) DΨ(y, y ) + DΨ(y , y ) + y y , 2Ψ(y )(y y ) = y y, Ψ(y ) y y, Ψ(y ) + y y , 2Ψ(y )(y y ) and in particular using the just last established equality with the triplet (y, y, y ), we have for all y, y dom Ψ, DΨy(y, y ) = y y, Ψ(y) y y, Ψ(y ) + y y , 2Ψ(y )(y y ) = y y , 2Ψ(y )(y y ) . Lemma 4 Generalized Bregman divergence of negentropy Let α R. Let Ψ(y ) := Pk i=1 y i log y i α Pk i=1 y i be defined for y Rk +. Then, for y, y Rk +, DΨ(y, y ) = i=1 yi log yi i=1 (yi y i) + ιRk ++(y ), which does not depend on α. Proof. First, if y Rk ++, Ψ is differentiable at y and Ψ(y )i = log yi + 1 α. Thus, Ψ(y ) = { Ψ(y )} and DΨ(y, y ) = Ψ(y) Ψ(y ) sup θ Ψ(y ) y y , θ , = Ψ(y) Ψ(y ) y y , Ψ(y ) , i=1 yi log yi i=1 (yi y i). Second, if we prove that Ψ(y ) = when there exists a component i such that y i = 0, we can conclude the proof, as sup = by convention. Indeed, Let us assume that y i = 0. Suppose that θ Ψ(y ). Then, by definition of subgradients, y y , θ + Ψ(y ) Ψ(y ), y Rk ++. We consider ε > 0 and choose y = y + εei, where ei is the i-th canonical base vector. Thus, we obtain εθi Ψ(y + εei) Ψ(y ), j=1 y j log y j + ε log ε α j=1 y j αε k X j=1 y j log y j α = ε log ε αε, as yi = 0 and 0 log 0 = 0 by convention. By noticing that limε 0+ ε log ε αε /ε = , we get a contradiction, which concludes the proof. Lemma 5 Subdifferential inclusion for sup function Let Y Rk be some set. Let {fy}y Y be a family of finite functions fy : Rk R. We define the function g : Rk R by g(θ) = supy Y fy(θ), for any θ Rk. Then, we have f y(θ) g(θ), for all θ Rk and for all y argmaxy Y fy(θ). Proof. Left as an exercise. Lemma 6 Value and gradient of Ψ y Let Ψ : Rk R be a proper strictly convex function. Let us assume that DΨ(y, y ) is convex w.r.t. y , for all y dom Ψ. We remind that Ψy(y ) = Ψ(y ) + DΨ(y, y ). Then, for all θ Rk and for all y argmaxy dom Ψ y , θ + y y , Ψ(y ) , Ψ y(θ) = y, θ Ψ(y) + y y, Ψ( y) Ψ y(θ) = y. Furthermore, if the function Ψ is twice differentiable, we have argmax y dom Ψ y , θ + y y , Ψ(y ) y dom Ψ 2Ψ(y )(y y) = θ Ψ(y ) . Proof. As Ψ Ψy, we have dom Ψy dom Ψ. Thus, we get Ψ y(θ) = sup y dom Ψ y , θ Ψy(y ) = sup y dom Ψ y , θ (Ψ(y ) + Ψ(y) Ψ(y ) y y , Ψ(y ) ) = sup y dom Ψ y , θ Ψ(y) + y y , Ψ(y ) . As Ψ is strictly convex and DΨ(y, y ) is convex w.r.t. y , we have that Ψy is strictly convex. Thus, according to [27, Theorem 11.13], Ψ y is differentiable. Thus, { Ψ y(θ)} = Ψ y(θ) argmax y dom Ψ y , θ Ψ(y) + y y , Ψ(y ) = argmax y dom Ψ y , θ + y y , Ψ(y ) , using Lemma 5 for the inclusion. In the case that Ψ is twice differentiable, setting the gradient of the inner function to zero concludes the proof. Lemma 7 Gradient of Ψ y, squared norm case Let Ψ(y ) := 1 2 y 2 2 and Ψy defined as in Lemma 6. Then, the gradient of Ψ y is given by Ψ y(θ) = y + θ Proof. Let us notice that sup y dom Ψ y , θ + y y , Ψ(y ) = sup y dom Ψ y , θ + y y 2 , and that Ψ is strictly convex function. By strong convexity of y 7 y 2, we deduce that argmaxy dom Ψ y , θ + y y , Ψ(y ) = . Furthermore, as Ψ is twice differentiable and 2Ψ(y ) = I and dom Ψ = Rk, we have that y dom Ψ 2Ψ(y )(y y) = θ Ψ(y ) is a singleton. Thus, we use Lemma 6 with Ψ(y ) = y and 2Ψ(y ) = I, we obtain that Ψ y(θ) is the solution w.r.t. y of the equation y y = θ y . Rearranging the terms concludes the proof. Before stating the next lemma, we recall the definition of the Lambert function [13] W : R+ R+. The function W is the inverse of the function f : R+ R+ where f(w) = w exp(w) for all w R+. Lemma 8 Gradient of Ψ y, negentropy case Let Ψ(y ) := Pk i=1 y i log y i α Pk i=1 y i be defined for y Rk +. Then, Ψ y(θ)i = eθi 2+α, if yi = 0 yi W (yie (θi 2+α)), if yi > 0. Proof. Following the same arguments as in the proof of Lemma 7, we use Lemma 6. We know that y is the solution of 2Ψ( y)( y y) = θ Ψ( y). Using Ψ( y) = log y + 1 α and 2Ψ( y) = 1/ y (where logarithm and division are performed element-wise), we obtain for all i {1, . . . , k} ( yi yi)/ yi = θi log yi 1 + α 1 yi/ yi = θi log yi 1 + α. When yi = 0, we immediately have yi = exp(θi 2 + α). When yi > 0, after rearranging, we obtain yi = yi exp( (θi 2 + α)) yi yi = W(yi exp( (θi 2 + α))), hence the result. Lemma 9 Gradient of Ω y Let Ψ : Rk R be a proper strictly convex l.s.c. function and assume that, for all y Rk, the function DΨ(y, ) is convex. Let Ωy = Ψy + ιC with Ψy defined as in Lemma 6 and assume that C is a nonempty compact convex set included in dom Ψ. Then Ω y(θ) is the unique element of the set argmax y C y , θ + y y , Ψ(y ) . Proof. The result follows from Danskin s theorem [4, Proposition B.25]. Lemma 10 Dual of simplex-constrained conjugate Let Ψ : Rk R be a proper strictly convex function. Then, (Ψ + ι k) (θ) = min τ R τ + (Ψ + ιRk +) (θ τ1). and for any τ argminτ R τ + (Ψ + ιRk +) (θ τ1), we get (Ψ + ι k) (θ) = (Ψ + ιRk +) (θ τ 1). Proof. We have by the definition of the Fenchel conjugate that (Ψ+ι k) (θ) = maxy k y , θ Ψ(y ) . As the unit simplex k is a compact set and as (Ψ+ι k) is differentiable (by [27, Theorem 11.13] given that Ψ is strictly convex and k is convex ), we apply Danskin s theorem [4, Proposition B.25] and get { (Ψ + ι k) (θ)} = argmaxy k y , θ Ψ(y ) . Furthermore, we rewrite (Ψ + ι k) in the following dual minimization problem (Ψ + ι k) (θ) = max y k y , θ Ψ(y ) = max y Rk + min τ R y , θ Ψ(y ) τ( y , 1 1) = min τ R τ + max y Rk + y , θ τ1 Ψ(y ) = min τ R τ + (Ψ + ιRk +) (θ τ1). Again, as Ψ is strictly convex, (Ψ + ιRk +) is differentiable [27, Theorem 11.13]. After some elementary calculus on subdifferentials, we conclude by applying primal and dual forms of Fermat s rule [27, Theorem 11.39, Item (d)]: for any τ argminτ R τ + (Ψ + ιRk +) (θ τ1), we have (Ψ + ιRk +) (θ τ 1) argmaxy k y , θ Ψ(y ) . Lemma 11 Gradient of Ω y, negentropy, constrained to the simplex Let Ψ(y ) = y , log y , if y Rk +, + otherwise. Then, for all y Rk +, the gradient of Ωy = Ψy + ι k (as defined in Lemma 9) is given by ( e λ eθi if yi = 0, yi W (yieλ θi) if yi > 0. where λ is the unique solution of the equation i:yi=0 eθi + X yi W(yie (θi λ )) = 1. Proof. From Lemma 9 and Lemma 10, since dom Ψy = Rk +, we have y = Ω y(θ) = Ψ y(θ τ 1) for any solution τ of min τ R τ + Ψ y(θ τ1). As τ 7 τ + Ψ y(θ τ1) is a convex function, Setting the gradient of the inner function to zero, we get τ argmin τ R τ + Ψ y(θ τ1) Ψ y(θ τ 1), 1 = 1. Using Lemma 8, we obtain that τ satisfies i:yi=0 eθi + X yi W(yie (θi τ 2)) = 1. By monotonicity in τ , we conclude that τ exists and is unique. Using the change of variable τ = λ + 2 concludes the proof. B.2 Proof of Proposition 1 (Properties of Fitzpatrick losses) Apart from differentiability, the proofs follow from the study of Fitzpatrick functions found in [15, 2, 28]. We include the proofs for completeness. Link function and non-negativity. We recall that LF [ Ω](y, θ) = sup (y ,θ ) Ω y y, θ θ = inf (y ,θ ) Ω y y, θ θ . From the monotonicity of Ω, we have that if (y, θ) Ωand (y , θ ) Ω, then y y, θ θ 0. Therefore, for all (y, θ) Ω, inf (y ,θ ) Ω y y, θ θ = 0, with the infimum being attained at (y , θ ) = (y, θ). This proves the link function. From the maximality of Ω, if (y, θ) Ω, there exists (y , θ ) Ωsuch that y y, θ θ < 0. Therefore, for all (y, θ) Ω, inf (y ,θ ) Ω y y, θ θ < 0. This proves the non-negativity. Separable convexity. We recall that LF [ Ω](y, θ) = F[ Ω](y, θ) y, θ where F[ Ω](y, θ) = sup (y ,θ ) Ω y y , θ + y , θ = sup (y ,θ ) Ω y , θ + y, θ y , θ . The function (y, θ) 7 y , θ + y, θ y , θ is (jointly) convex in (y, θ) for all (y , θ ). Since the supremum preserves convexity, F[ Ω](y, θ) is (jointly) convex in (y, θ). The function y, θ is convex in y and θ but not (jointly) convex in (y, θ). Therefore, LF [ Ω](y, θ) is separately convex in y and θ. Subgradient We are going to prove that, for any y, θ Rk and for any y y F [ Ω](y, θ) as defined in (4), we have y y θLF [ Ω](y, θ). Indeed, by Definition 1, we have that LF [ Ω](y, θ) = sup y dom Ω y y, θ + sup θ Ω(y ) y y , θ Furthermore, fy (θ) = {y y}, where fy (θ) = y , θ + supθ Ω(y ) y y , θ . Thus, by applying Lemma 5, we get y y θLF [ Ω](y, θ) for any y argmaxy dom Ωfy (θ) = argmaxy dom Ω h y y, θ + supθ Ω(y ) y y , θ i , which yields the result by definition of y F [ Ω](y, θ) in (4). Differentiability. Since the function Ωis strictly convex and y 7 DΩ(y, y ) is convex, then Ωy(y ) = Ω(y ) + DΩ(y, y ) is strictly convex in y . From the duality between strict convexity and differentiability, Ω y(θ) is differentiable in θ. Tighter inequality. Using Ω= {(y , θ ) Rk Rk Ω(y) Ω(y ) + y y , θ y} and Ω (θ) = sup y Rk y , θ Ω(y ), we get for any (y , θ ) Ω, y y , θ + y , θ Ω(y) Ω(y ) + y , θ Ω(y) + Ω (θ). Therefore F[ Ω](y, θ) = sup (y ,θ ) Ω y y , θ + y , θ Ω(y) + Ω (θ). B.3 Proof of Proposition 2 (Expression of Fitzpatrick loss when Ωis twice differentiable) We recall that F[ Ω](y, θ) = sup (y ,θ ) Ω y, θ + y , θ y , θ = sup y dom Ω y , θ + sup θ Ω(y ) y, θ y , θ . Since Ωis differentiable, we have Ω(y ) = { Ω(y )} and therefore θ = Ω(y ), which gives F[ Ω](y, θ) = sup y dom Ω y, Ω(y ) + y , θ y , Ω(y ) . Setting the gradient of the inner function w.r.t. y to zero, we get, for any y y F [ Ω](y, θ) as defined in (4), 2Ω(y )y + θ Ω(y ) 2Ω(y )y = 0. Using the y = y and θ = Ω(y ) in Definition 1, we then obtain LF [ Ω](y, θ) = y y, θ θ = y y, θ Ω(y ) = y y, 2Ω(y )(y y) . B.4 Proof of Proposition 3 (squared loss) Using Proposition 2 with Ω(y ) = y and 2Ω(y ) = I, we obtain y + θ 2y = 0 y = y + θ We therefore obtain LF [ Ω](y, θ) = y + θ 2 y, θ y + θ B.5 Proof of Proposition 4 (perceptron loss) A proof of the Fitzpatrick function for this case was given in [2, Example 3.1]. We include a proof for completeness. Since Ω= ιC, we have dom Ω= C. Therefore, for all y C and θ Rk, F[ Ω](y, θ) = sup y dom Ω y , θ + sup θ Ω(y ) y y , θ = sup y C y , θ ιC(y) ιC(y ) sup θ ιC(y ) y y , θ = sup y C y , θ DιC(y, y ) = sup y C y , θ , where in the third line we used that ιC(y) = ιC(y ) = 0 and where in the last line we used Lemma 2. Therefore, for all y Rk and θ Rk, F[ Ω](y, θ) = sup y C y , θ + ιC(y) = ιC(y) + ι C(θ). B.6 Proof of Proposition 5 (Fitzpatrick sparse MAP loss) A proof of the Fitzpatrick function for this case was given in [2, Example 3.13]. We provide an alternative proof. From Proposition 7, we know that for any y dom Ω(= C here) F[ Ω](y, θ) = Ωy(y) + Ω y(θ) = Ω(y) + Ω y(θ), 2 y 2 2 + 1 2 y y 2 2 + ιC(y ) = y 2 2 + 1 2 y 2 2 y, y + ιC(y ) = 2Ω(y ) + Ω(y) y, y . Using conjugate calculus, we obtain Ω y(θ) = 2Ω y + θ F[ Ω](y, θ) = 2Ω y + θ From Proposition 7, the supremum w.r.t. y is achieved at y = Ω ((y + θ)/2) = PC((y + θ)/2). We therefore obtain LF [ Ω](y, θ) = y y, θ y . B.7 Proof of Proposition 6 (Fitzpatrick logistic loss) Differentiability w.r.t. θ and formula of gradient. According to Proposition 7, we have LF [ Ω](y, θ) = Ωy(y) + Ω y(θ) y, θ . Thus the differentiability w.r.t. θ of LF [ Ω](y, θ) follows from the differentiability of θ 7 Ω y(θ). Lemma 11 yields the differentiability of Ω y(θ) and a formula for its gradient y F [ Ω](y, θ) := Ω y(θ) y F [ Ω](y, θ)i = e λ eθi, if yi = 0, yi W (yieλ θi), if yi > 0. It follows that θLF [ Ω](y, θ) = y F [ Ω](y, θ) y. Formula of the Fitzpatrick logistic loss. We use y as a shorthand for y F [ Ω](y, θ). As we know that Ω y(θ) = y , θ Ωy(y ), we use again Proposition 7 to get LF [ Ω](y, θ) = Ωy(y) + y , θ Ωy(y ) y, θ = Ω(y) (Ω(y ) + DΩ(y, y )) + y y, θ as Ωy(y ) = Ω(y)+DΩ(y, y ) and in particular Ωy(y) = Ω(y). Furthermore, as y k Rk ++, Ωis differentiable at y and DΩ(y, y ) = Ω(y) Ω(y ) y y , Ω(y ) , where Ω(y ) = log y +1. Thus LF [ Ω](y, θ) = y y , Ω(y ) + y y, θ = y y, θ log y 1 . Bisection formula for λ and bounds. We also get from Lemma 11 a bisection formula for λ , which is a shorthand for λ F [ Ω](y, θ). i:yi=0 eθi + X yi W(yie (θi λ )) = 1. We focus here on a lower bound and an upper bound for λ R. Let us prove that i=1 eθi λ log 2 + max n log i:yi=0 eθi, log ℓ0(y) + max i:yi>0 θi + 2ℓ0(y)yi o , where ℓ0(y) = Card(j : yj = 0). For the lower bound, we use the concavity [13] of the Lambert function W, which implies 1 W (yieλ θi) 1 yieλ θi . Thus, i:yi=0 eθi + X yi yieλ θi , which in turn implies eλ X i:yi=0 eθi + X and yields the lower bound. For the upper bound, the function g(λ) = e λ P i:yi=1 eθi + P i:yi>0 yi W (yieλ θi) is continuous and decreasing (as it is a positive combination of decreasing functions) and g( ) = + . Thus if we find a λ such that g(λ) < 1, we know that λ λ. We deal with each term of g(λ) separately. If λ R satisfies i:yi=0 eθi 1 max i:yi>0 yi W(yieλ θi) 1 2ℓ0(y), then g(λ) = e λ X yi W(yieλ θi) | {z } Thus, all λ satisfying the following inequalities are upper bounds of λ i:yi=0 eθi eλ 2ℓ0(y)yi W(yieλ θi), i : yi > 0. As W is monotone and W 1(t) = tet, we get log 2 + log X i:yi=0 eθi λ 2ℓ0(y)e2ℓ0(y)yi eλ θi, i : yi > 0. Thus taking λ = max n log 2+log Pk i:yi=0 eθi, maxi:yi>0 log 2+log ℓ0(y)+θi +2ℓ0(y)yi o yields an upper bound of λ . B.8 Proof of Proposition 7 (characterization of F[ Ω] using DΩ) Let (y, θ) dom Ω Rk. We have F[ Ω](y, θ) = sup (y ,θ ) Ω y y , θ + y , θ = sup y dom Ω y , θ + sup θ Ω(y ) y y , θ = sup y dom Ω y , θ Ω(y ) + Ω(y ) + sup θ Ω(y ) y y , θ = Ω(y) + sup y dom Ω y , θ Ω(y ) + Ω(y) Ω(y ) sup θ Ω(y ) y y , θ = Ω(y) + sup y dom Ω y , θ (Ω(y ) + DΩ(y, y )) = Ω(y) + (Ω+ DΩ(y, )) (θ) = Ωy(y) + Ω y(θ). The supremum above is achieved at some y Ω y(θ). When Ω= Ψ + ιC, where C dom Ψ is a nonempty closed convex set, using Lemmas 1 and 2, we have for all y C Ωy(y ) = Ψ(y ) + DΨ(y, y ) + ιC(y ). B.9 Proof of Proposition 8 (lower bound) It was shown in [8, Proposition 3] that if f = g + ιC, where the function g is Legendre-type with C dom Ψ a nonempty closed convex set, then for all y C and θ Rk, 0 Dg(y, f (θ)) Lf f (y, θ), with equality if C = dom g. Using g = Ψy, f = Ωy = Ψy + ιC, y = Ω y(θ) = y F [ Ω](y, θ), and Lemma 3, we therefore obtain DΨy(y, y ) = y y , 2Ψ(y )(y y ) LΩy Ω y(y, θ) = LF [ Ω](y, θ). If Ψy is µ-strongly convex and DΨ is convex in its second argument, then Ψy is µ-strongly convex as well. Therefore, we also have µ 2 y y 2 2 DΨy(y, y ). C Comment on the inf-addition convention + + ( ) = + The question of adding conflicting infinities has been thoroughly studied by Moreau in [23]. In this paper, Moreau introduced two additions for R { , + }. One of Moreau s two conventions for summing infinities can be found under the name of inf-addition in [27, Chapter 1]. The other convention is named sup-addition. Using the inf-addition on R { , + } allows, for instance, the equivalence of minx C f(x) and minx Rn f(x) + ιC(x). The same goes for the sup-addition but for maximization problems. We used the inf-addition convention for the definition of the generalized Bregman divergence in (1). As the generalized Bregman divergence is to be thought of as a generalization of a distance, it is generally a quantity that will be minimized. Therefore, the inf-addition is a good fit for the definition of the generalized Bregman divergence. In the proof of Proposition 7 in B.8, the choice of inf-addition is corroborated by the fact that we want to calculate, for some fixed y Rk, Ω( ) + DΩ(y, ) (θ) = supy y , θ Ω(y ) + DΩ(y, y ) , y Rk. The minus sign transforms the inf-addition into the sup-addition when distributed. If we had chosen the sup-addition in the definition of the Bregman divergence DΩ, we would here calculate a supremum of an inf-addition, thus entering unknown territory. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: As stated in the abstract and in the introduction, we have defined Fitzpatrick losses in Definition 1 and studied their properties in Proposition 1. We instantiated the Fitzpatrick sparse MAP and the Fitzpatrick logistic loss in Proposition 5 and Proposition 6; we have studied the properties of Fitzpatrick losses in Proposition 1 and Proposition 7; we have gathered the results of the label proportion estimation in Table 1. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the computational limitations of the Fitzpatrick logistic loss. As stated in Proposition 6 and in Section 4, the computation of the loss value and gradient involves solving a root equation. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All results in the paper are proved in the appendix. We strived to make the proofs self-contained. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In Section 4, we describe the training setup, the linear model we use for predictions and give a link to access the datasets we use. We also indicate that we use the L-BFGS algorithm for training. Furthermore, Proposition 5 and Proposition 6 yield formulae for the gradients of the new losses we introduce. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provided a link for the used datasets in the experiments. The code will be opened upon release. For now, we provide instructions to reproduce the main experimental results. These instructions involved basic supervised learning tools such as L-BFGS algorithm, linear prediction model and standard cross-validation for the hyperparameter. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The selection of hyperparameter are discussed in section 4. We use predetermined train-test splits that come with the datasets. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The split between training sets and test sets are fixed in advance so there is no variability when conducting the label proportion estimation tests. Furthermore, as we minimize convex losses, the convergence of the minimization algorithm is independent of the starting point. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The characteristics of the computer we used to conduct the tests. On the largest dataset, our experiment only takes a couple of hours to run. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted does not involve human subjects. We did not create new datasets and used standard open access datasets for multiclassification. After reviewing the societal impact and potential harmful consequences from the Code of Ethics, we conclude that the research conduct in the paper does not pose a risk of harmful consequences. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The research conducted in this paper is theoretical and has no societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The datasets used in the tests are publicly available. No new prediction model has been released in this paper. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: In Section 4, the libraries used for the implementation of the numerical experiments are credited and cited. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.