# the_geometry_of_mixability__0da28267.pdf

The Geometry of Mixability

Armando J. Cabrera Pacheco a.cabrera@uni-tuebingen.de University of Tübingen and Tübingen AI Center, Germany

Robert C. Williamson bob.williamson@uni-tuebingen.de University of Tübingen and Tübingen AI Center, Germany

Reviewed on Open Review: https: // openreview. net/ forum? id= Vrv GHDSz Z7

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function ℓis η-mixable if and only if the superprediction set spr(ηℓ) of the scaled loss function ηℓ slides freely inside the superprediction set spr(ℓlog) of the log loss ℓlog, under fairly general assumptions on the differentiability of ℓ. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a coordinate-free manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

1 Introduction

In the context of prediction with expert advice as described by Vovk (1998; 2001), an information game is considered between three players: the learner, N N experts and nature1. At each step t Z+,

each expert makes a prediction γt i Γ, i = 1, ..., N, (here Γ is the prediction space) which the learner is allowed to see,

the learner makes a prediction γt Γ,

nature chooses an outcome ωt Ω(here Ωis the set of outcomes),

for a fixed loss function ℓ: Ω Γ [0, ], the cumulative loss is calculated for the learner and each of the experts:

Lt(learner) :=

i=1 ℓ(ωi, γi),

Lt(expertk) :=

i=1 ℓ(ωi, γi k),

where expertk denotes the k-th expert (k = 1, ..., N).

The goal is to bound the difference between the learner s loss and the experts loss, which is often called the regret. That is, we want to bound the quantity

Rt k := Lt(learner) Lt(expertk)

1We refer the reader to (Vovk, 1998; 2001) for discussions about the desired (or required) assumptions on Γ, Ωand ℓ.

Published in Transactions on Machine Learning Research (09/2023)

1.1 Mixable games and characterizations of mixable and fundamental loss functions

For a wide class of games, called η-mixable games for η > 0, the Aggregating algorithm (see for example (Vovk, 2001)) ensures an optimal bound for the regret (Rt k η 1 ln N, for all k = 1, ..., N) independent of the trial t. Since the mixability of a game depends on the loss function ℓ, a loss function ℓis η-mixable if the corresponding game is mixable. Since arguably the aggregating algorithm is one of the most well founded and studied prediction algorithms, there is a natural interest in understanding properties and characterizations of mixable loss functions.

Examples of mixable loss functions include the log loss, relative entropy for binary outcomes (Haussler et al., 1998) and the Brier score (Vovk & Zhdanov, 2009; van Erven et al., 2012). Mixability of a loss function ℓis characterized by a stronger convexity of the superprediction set of ℓ, which can be described as the convexity of the superprediction set of ℓafter an exponential projection (see (1.3) below and Vovk (2015) and van Erven et al. (2012)). Unfortunately, this characterization of mixability lacks a transparent geometric interpretation.

The main goal of this work is to provide such geometric interpretation. The motivation stems from an observation made by Vovk (2015): a η-mixable loss can be characterized as the positiveness of the infimum of the quotient of the curvatures of the a strictly proper loss function ℓand the log loss ℓlog for binary outcomes. Here as usual, loss functions are defined on the 2-simplex 2 (see (1.1)). Moreover, he then proves that fundamentality (see Vovk (Vovk, 2015)) of a loss can be characterized as the finiteness of the supremum of the same quotient of curvatures. These two results suggest that these properties are geometric, meaning that they can be studied using differential geometry tools, and in this regard, mixability and fundamentality should not depend on the coordinates chosen to express them.

Loosely speaking, in convex geometry a convex set L is said to slide freely inside a convex set K, if for any point x in the boundary of K, there is a translation vector y such that the translation of L by y (i.e., the Minkowski sum L + y, see (4.2)), intersects K at x, and L + y K. We provide the following geometric characterization of mixability and fundamentality, as a geometric comparison to the log loss (see Figure 1). Let spr(ℓ) denote the superprediction set of a loss function ℓ(see (1.4)).

Theorem 1.1 (Informal statement). A continuously twice differentiable proper loss function is η-mixable if and only there is η > 0 such that spr(ηℓ) slides freely inside spr(ℓlog). In addition, the same ℓis fundamental if and only if there exists γ > 0 such that spr(ℓlog) slides freely inside spr(γℓ).

This new characterization of mixability appeals because it is stated directly in terms of the superprediction sets of the loss functions, rather than the exponentiated super-prediction set. In order to obtain this theorem it is necessary to re-interpret properness from a differential geometry point of view, which constitutes a big part of this work. However, this technical effort pays off. van Erven et al. (2012) characterized η-mixable (differentiable) loss functions for multi-class loss functions and moreover, related η to the Hessian of the Bayes risk of ℓand the log loss (see Definition 1.3), which is interpreted as its curvature. By generalizing the tools developed here for the binary case, we were able to obtain a multi-class analog result to Theorem 1.1 and to build a bridge to the results in (van Erven et al., 2012)2.

Finally we note that the large field of information geometry (Amari, 2016) strives to understand the basic problems of statistical inference from a geometric perspective, with a strong focus on the geometry of statistical manifolds. Likewise, there are strongly geometric theories which describe the effect of hypothesis classes in machine learning (Mendelson & Williamson, 2002; Van Erven et al., 2015). Our hope is that by geometrizing the theory of loss functions we may be able to bridge these diverse perspectives.

1.2 Description of results and structure of the article

Using the same setting as (van Erven et al., 2012), we obtain a geometric characterization of η-mixable loss functions in the sense of differential geometry. Loss functions are considered to be maps ℓ: n Rn 0, which under the conditions assumed in this work, give rise to submanifolds ℓ(relint( n)) of Rn whose geometric

2In fact, the present paper can be seen as the successful realization of the attempt of (van Erven et al., 2012), where we felt that the whole argument should be doable in a coordinate-free manner, but did not then have the skills to achieve that.

Published in Transactions on Machine Learning Research (09/2023)

Figure 1: We abuse notation and denote the image of a loss function ℓby simply ℓ. The figure shows how the superprediction set of a translation of the scaling of ℓslides freely inside the spr(ℓlog). The bullet points are located at the image of p 2.

properties are determined by ℓ(see the relevant precise definitions below). We first discuss the case n = 2 (binary classification loss functions) since it is more intuitively understandable, and then the case n 2. We summarize the main results as follows.

1. We recast the notion of a (strictly) proper loss as a geometric property of the loss itself rather than its superprediction set. That is, properness is no longer considered a parametrization dependent property, it is a statement about the geometric properties of the loss surface ℓ(relint( n)) (the boundary of the superprediction set). See lemmas 2.7 and 3.2.

2. A geometric comparison is performed. For n = 2 in terms of the curvature of the loss curves (see Section 1.5 below), and for n 2 in terms of the scalar second fundamental form of the loss surfaces (see Section 3 and Appendix A), which measure how they curve inside Rn. The precise statements are given in Lemma 2.13 and Lemma 3.6. Intuitively, these results tell us how the superprediction set of ℓsits inside the superprediction set of the log loss. We remark a fundamental difference between the cases n = 2 and n 2: for n = 2, as Vovk (2015) observes, mixability can be related to the quotient of the curvature of ℓand the curvature of ℓlog. In this case the image of ℓis a curve in R2

whose curvature can be easily computed. In higher dimensions the notion of curvature is far more complicated and different notions of curvature exist (see for example (Lee, 2018)). For n 2, our geometric comparison relies on a comparison of the second fundamental forms of the loss surfaces defined by ℓand ℓlog the second fundamental form measures how a surfaces is curved inside an ambient space, which is Rn in our case (see Appendix A). In this regard, part of our work can be interpreted as a coordinate independent version of the main computations carried out in (van Erven et al., 2012).

3. Finally, we interpret our result from the point of view of convex analysis to give a new characterization of mixability. More precisely, we show that a (strictly) proper loss function ℓis η-mixable if and only if the superprediction set of ℓslides freely (see Definition 4.11) inside the superprediction set of the log loss.

Published in Transactions on Machine Learning Research (09/2023)

As byproducts, we obtain a general way to define mixability with respect to a fixed (strictly) proper loss function, further properties and consequences for binary classification loss functions, particularly for composite losses and canonical links, and a bridge to the results obtained in (van Erven et al., 2012).

Since we treat loss functions from the point of view of differential geometry and convex geometry, a considerable background in these topics is needed. We present this work as self-contained as possible and spend some time providing the intuition and motivation for the results (and sometimes the background) which naturally results in a longer exposition. In Section 2 we treat the binary case, in Section 3 the multi-class case to obtain the geometric interpretation of properness and mixability and perform the geometric comparison (in terms of curvature). In Section 4 we make the connections to convex geometry and obtain the geometric characterization of mixability in terms of the sliding freely conditions of superprediction sets. In Section 5, we provide some additional comments and remarks, and provide a bridge between our results and those in van Erven et al. (2012).

Here we summarize our setup, for more details see van Erven et al. (2012). Denote by [n] the set of natural numbers {1, ..., n}. The set of probability distributions on a finite set Y with |Y| = n N is given by

(p1, ..., pn) Rn

i=1 pi = 1, pi 0, for i = 1, ..., n

We note that n is a manifold with (non-smooth) boundary of dimension n 1. Moreover, n is a hypersurface in Rn; we denote the interior (as a manifold) of n as int( n) which is the same set as the relative interior relint( n) of n. We define the standard parametrization of n as the map Φstd : Sn 1 Rn 1 n

Φstd(t1, ..., tn 1) =

t1, ..., tn 1, 1

where Sn 1 = n (t1, ..., tn 1) Rn 1 | Pn 1 i=1 ti 1, ti 0, for i = 1, ..., n 1 o .

In particular, when n = 2 the standard parametrization of 2 is the map Φstd : [0, 1] 2 given by Φstd(t) = (t, 1 t).

Definition 1.2. A loss function is a map ℓ: n Y R 0 such that for each k Y, the map ℓ( , k): n R is continuous.

Given a loss function ℓ, p n and k Y, the value ℓ(p, k) represents the penalty of predicting p upon observing k. We define the partial losses of a loss function ℓas the maps ℓi : n R 0 given by ℓi(p) = ℓ(p, i). A loss function can be described in terms of its partial losses as

i=1 [[k = i]]ℓi(p).

Thus, we can identify a loss function ℓwith the map ℓ: n Rn 0 determined by its partial losses

ℓ(p) = (ℓ1(p), ..., ℓn(p)) .

In this work we follow this convention unless stated otherwise. Note that this way we can see a loss function ℓas an embedding of int( n) into Rn 0 (assuming enough properties on ℓ). We will see later that properness ensures the image of this embedding to be a nice hypersurface of Rn with appealing geometric properties. Under the assumption that the outcomes are distributed with probability p n, we make the below definitions following (van Erven et al., 2012; Reid & Williamson, 2010).

Published in Transactions on Machine Learning Research (09/2023)

Definition 1.3. Given a loss function ℓ, we define the conditional risk as the map L : n n R as

L(p, q) := ℓ(q), p ,

and the associated conditional Bayes risk as the map L : n R given by

L(p) := inf q n L(p, q) = inf q n ℓ(q), p .

Definition 1.4. A loss function ℓ: n Rn 0 is said to be proper if for any p n

p = inf q n L(p, q)

for all q n. In other words, L(p, ) has a minimum at p. When p is the only minimum of L(p, ) we say that ℓis strictly proper.

Proper losses are the natural losses to use when predicting class probabilities they are such that perfect prediction is not penalized; for further discussion of why they are important, and additional references to the literature, see (Buja et al., 2005; Williamson & Cranko, 2022; Williamson et al., 2016).

For our geometric considerations it will be useful to denote the image of n under ℓby Mℓ, and impose enough differentiability conditions on ℓso that Mℓis (at least) a C2-manifold. See Definitions 2.1 and 3.1 below.

We now recall the definition of mixability (see for example, (Vovk, 2015; van Erven et al., 2012)). For η > 0, let Eη : Rn Rn be the η-exponential projection defined as

Eη(y) := (e ηy1, ..., e ηyn). (1.3)

A loss function ℓis called η-mixable if the image of its superprediction set, spr(ℓ), given by

spr(ℓ) := {λ [0, )n | there is q n such that ℓi(q) λi for i [n]}, (1.4)

is convex under the η-exponential projection, that is Eη(spr(ℓ)) [0, 1]n is convex. We say that ℓis mixable if ℓis η-mixable for some η > 0.

Definition 1.5. Let ℓbe a mixable loss function. The mixability constant of ℓ, η ℓ, is defined as

η ℓ:= sup η>0 {η > 0 | ℓis η-mixable} .

1.4 Motivation

In this part we mainly discuss the case n = 2 since it is more intuitively accessible. It has been made evident that there is a strong relation between properness and mixability. Here we make this relation more explicit and transparent from a geometric point of view. The basic motivation is as follows. It is commonly understood that properness is a property that depends on the parametrization of the boundary of the superprediction set of ℓ(Vovk, 2015). It has been also shown that it is related to the curvature of the Bayes risk, since it requires that the superprediction set remains convex under the η-exponential projection given by (1.3) (with the standard parametrization of the simplex 2) (Buja et al., 2005; Reid & Williamson, 2010; van Erven et al., 2012). Mixability is considered to be a stronger notion of convexity (Vovk, 2015), for some η > 0. The basic observation in this work is that it is possible recast properness from a geometric point of view, i.e., independent of the parametrization of n. More precisely, we define properness in terms of the loss function viewed as a map ℓ: n Rn 0 rather than in terms of the superprediction set spr(ℓ) (as it is usually defined). More precisely, to determine whether a given ℓis proper or not, it is not enough to look at image ℓ( n) (as the boundary of spr(ℓ)) but rather how n is mapped into Rn 0 by ℓ since we will be using tools of differential geometry, we will assume C2 differentiability (see Section 2). More precisely, restricting first to n = 2 (see Lemma 2.7 below), a given loss function ℓ: 2 R2 0 will be (strictly) proper if and only if

Published in Transactions on Machine Learning Research (09/2023)

Figure 2: Consider the two loss functions given by ℓ1(p1, p2) = ( log(p1), log(p2)) and ℓ2(p1, p2) = ( log(p2), log(p1)), for p = (p1, p2) 2. Although spr(ℓ1) = spr(ℓ2), ℓ2 is not proper since the normal vector at ℓ2(p) is not p/|p| for any p 2.

1. the normal vector to ℓ( n) at ℓ(p) is equal to p/|p| for all p int( 2), and

2. the curvature (see Section 1.5 below) at any point ℓ(p) with respect to the unit normal vector n = p/|p| is strictly positive for all p int( 2).

As observed in Figure 2, spr(ℓ1) = spr(ℓ2), which implies that their boundaries coincide (as a set). In particular, this implies that it is possible to parametrize the boundary of ℓ2( 2), (ℓ2( 2)), in the same way as (ℓ1( 2)) in order to have a proper loss. However, note that this changes the map ℓ2 and hence from the point of view of this work, this is a different loss function. In practice, one is given a loss function ℓrather than a superprediction set spr(ℓ), therefore we look at losses as individual maps from 2 to R2 0 instead of looking at their superpredictions sets and obtaining a proper loss by choosing a convenient parametrization of (spr(ℓ)).

Remark 1.6. Our strength by characterizing proper loss functions in this way is that we will be able to apply techniques from differential geometry, however, these considerations only work for loss functions which are sufficiently differentiable. For a general set up, it is possible to characterize properness of a loss function in a fairly simple way via the convexity of its superprediction set. More precisely, the loss surface is the subgradient of the support function of the superprediction set. This was thoroughly studied by Williamson & Cranko (2022). We briefly explore some connections to our work in Section 4. Alternative approaches to extending and better understanding mixability include (Reid et al., 2015) and (Mhammedi & Williamson, 2018).

1.5 Comments about the curvature of planar curves

The second condition for ℓto be proper mentioned above involves a condition on the curvature of ℓ(int( 2)). We now make this notion precise. Recall that if α(t) = (x1(t), x2(t)) is a C2 curve with α (t) = (x 1(t), x 2(t)) = (0, 0) for all t in its domain, then its curvature can be seen a measurement of the variation of its unit normal

Published in Transactions on Machine Learning Research (09/2023)

Figure 3: For a regular curve α, at each point we have two normal unit vectors.

vector at each point. We define the canonical normal vector at α(t), nc(t), as the unit normal vector in the direction obtained by rotating α (t) 90 counterclockwise. Then, the signed curvature of α at t is defined as

κα(t) := x 1(t)x 2(t) x 1(t)x 2(t)

(x 1(t)2 + x 2(t)2)3/2 . (1.5)

The interpretation of this number is as follows: κα(t) is positive if α curves in the direction of nc(t). However, note that at each point we have two normal vectors: nc(t). Thus, nc(t) and κα depend on the direction of α (i.e., α ), and their values differ by a negative sign. Thus, we can talk about the curvature of α with respect to a chosen unit vector n (either choosing nc or nc for all points, assuming this is possible, which is the case for the curves we will consider here, see Figure 3) and denote it by κ+ α. In the case when n = nc, then κ+ α = κα, and when n = nc, then κ+ α = κα. Since κα is invariant under reparametrizations (up to a sign), we can simply talk at the curvature of α at a given point p in the image of α. In Section 2 we make precise our choice in (2) above. For a summary of geometry of curves see Appendix A.

Going back to loss functions, suppose ℓ: 2 R2 0 is a loss function. Since 2 is a 1-manifold, any parametrization around a point (of its interior) can be assumed to be of the form Φ: (a, b) R 2 for some a < b. Thus, the local expression of ℓunder this parametrization eℓ= ℓ Φ is a curve in R2. By changing Φ around the same point, we are reparametrizing eℓ. Since curvature is independent of coordinates (i.e., of the Φ used) up to a sign, we can define the curvature of the loss curve ℓ(int( 2)) with respect to a chosen unit normal vector (which will depend only on ℓ). To compute it from its definition in (1.5), we need to choose a parametrization Φ, and as we will see, many times it is convenient to take Φ = Φstd.

Remark 1.7. One could avoid part of the technical complications above by choosing beforehand Φ = Φstd, as it is usually implicitly done, and then requiring ℓ1 and ℓ2 to be monotone (cf. Buja et al. (2005); Reid & Williamson (2010); Shuford et al. (1966); Vovk (2015)) essentially, this amounts to choosing direction for the admissible loss curves. Although this approach is appealing since the curve parameter (t in our case) can be directly interpreted as a probability, and moreover it simplifies calculations since in this case the convention can be chosen so that the signed curvature coincide with κ+ (see for example Vovk (2015)), when considering the multi-class case, the notion of direction breaks down and it is not clear which properties of Mℓ= ℓ( n) one should consider. The approach we consider here gives a concrete logical path to a generalization to the multi-class case (see Section 3).

1.6 Reconciling this point of view with previous works

In this part we explain how to translate the results we obtain here to previous results regarding proper losses and mixability. We do this in particular with (Reid & Williamson, 2010) and (Vovk, 2015).

Published in Transactions on Machine Learning Research (09/2023)

(Reid & Williamson, 2010) Let Φ = Φstd. The parameter bη in (Reid & Williamson, 2010) corresponds to the parameter t here, ℓ1(bη) and ℓ 1(bη) correspond to eℓ1(t) and eℓ2(t), respectively. Although the regularity assumption in (Reid & Williamson, 2010) is initially only differentiability of the partial losses, when discussing the weight of a loss function they impose C2 regularity. From (Reid & Williamson, 2010, Theorem 1), we see that a loss ℓis proper if (in particular) ℓ 1 > 0 and ℓ 0 < 0. We can heuristically say that ℓgoes from right to left . This means that in this case, κ+ ℓ(bη) = κℓ(bη). The log loss in this case is ℓlog(bη) = ( ln(bη), ln(1 bη)).

(Vovk, 2015) In (Vovk, 2015) the loss functions are defined as maps (λ0(p), λ1(p)), with λ0 increasing and λ1 decreasing (infinite differentiable). In this case, heuristically, losses go from left to right so that κ+ λ (p) = κλ(p). To relate this convention to ours, we set Φ(t) = (1 t, t). Then the parameter p in (Vovk, 2015) corresponds to t and λ0 and λ1 correspond to eℓ1 and eℓ2. The log loss is then given by λ(p) = ( ln(1 p), ln(p)).

Therefore, from our point of view, in previous works there is an implicit choice of a parametrization of 2, particularly motivated to interpret the parameter as a probability. However, sometimes this might not be the case and a link function is needed (Reid & Williamson, 2010) this fits well with our approach as a link function for us is a different choice of parametrization; this will carefully explained in Section 2.7. In favor of the study of loss functions using tools from differential geometry we are then motivated to eliminate this choice of parametrization and consider ℓas a map between manifolds (namely, int( 2) and ℓ(int( 2)) as a submanifold of R2). Although picking a general parametrization of 2 complicates the interpretation of the parameter, it makes other properties of loss functions transparent. This approach has, to the knowledge of the authors, never been explored. We remark that, however, one can always set Φ = Φstd and reinterpret the results of this work as the parameter being a probability. With this geometric characterization of loss functions and properness at hand we continue to study mixability.

2 Properness and Mixability for Binary Classification

We first restrict our discussion to binary classification, i.e., setting n = 2. Thus, we consider maps ℓ: 2 R2 0, where 2 = {(p1, p2) R2 | p1 + p2 = 1}, with partial losses ℓ1(p) and ℓ2(p). In this case the standard parametrization of 2 is given by Φstd(t) = (t, 1 t) for t [0, 1]. When a parametrization of 2, say Φ, is chosen, then the local expression of ℓwith respect to Φ (eℓ= ℓ Φ) is a map from some interval I R to R2, that is, a curve in the plane R2.

Dating back to (Haussler et al., 1995; Vovk, 1998) it has been established that properness of a loss function imposes strong conditions on the first and second derivatives of their partial losses. In (Vovk, 2015) these relations were expressed by means of the curvature of the loss curve. Moreover, in (Buja et al., 2005; Reid & Williamson, 2010) properness is related to the second derivative of its Bayes risk, which in a way can be interpreted as its curvature. However, in these works there is always an implicit choice of parametrization of 2, which in turn imposes certain restrictions on the admissible loss functions, particularly making the results parametrization dependent. In this section, we first recast properness as a geometric property which allows us to obtain results in a parametrization (or coordinate) independent way.

Definition 2.1. An admissible loss function is a map ℓ: 2 R2 0 such that

(i) ℓ(int( 2)) R2 0 is a 1-manifold of class C2,

(ii) there exists a C1 map n: ℓ(int( 2)) Nℓ(int( 2)), n(ℓ(p)) =: nℓ(p), where Nℓ(int 2) is the normal space of ℓ(int 2), and

(iii) n(ℓ(p)) or n(ℓ(p)) belongs to R2 >0 for all p int( 2).

We denote the set of admissible loss functions as L.

Remark 2.2. We give the following interpretation of the previous definition. (i) simply says that the loss curve (once parametrized) is twice differentiable with continuous second partial derivatives. (ii) prevents some

Published in Transactions on Machine Learning Research (09/2023)

anomalies on ℓ, for example, ℓcan not be constant on a neighborhood of a point. (iii) defines a subfamily of loss curves which are not allowed to vary too much . This definition should be compared to the definition of loss functions in (Vovk, 2015, Section 2).

Definition 2.3. Let ℓ L. Let n: ℓ(int( 2)) Nℓ(int( 2)) be the map that assigns to each ℓ(p) the normal vector to Mℓat ℓ(p) that lies in R2 0. We denote by κ+ α( ) the signed curvature of α with respect to the unit normal belonging to R2 0. We refer to κ+ α( ) as the curvature with respect to the unit normal vector pointing towards R2 0.

2.1 Proper losses

Lemma 2.4. Suppose that ℓin L is strictly proper, then the signed curvature of the loss curve ℓ( 2) has a sign. Moreover, its curvature, κℓ, is positive with respect to unit normal vector (field) pointing towards R2 0.

Proof. Let p0 int( 2) and let Φ: I R 2 be a parametrization of 2 around p0 = Φ(t0), for some t0 I, which we use to obtain a parametrization of 2 2 around (p0, p0)3. We consider the local expression of L given by e L(t, s) = ℓ(Φ(s)), Φ(t) .

Using strict properness we know that fixing t, the function e L(t, ) achieves a minimum at s = t (and it is the only one), that is

0 = se L(t, s)|s=t = eℓ (t), Φ(t) = eℓ 1(t)Φ1(t) + eℓ 2(t)Φ2(t), (2.1)

0 < sse L(t, s)|s=t = eℓ (t), Φ(t) = eℓ 1(t)Φ1(t) + eℓ 2(t)Φ2(t). (2.2)

To compute the sign of the signed curvature of ℓ( 2) it is enough to determine the sign of eℓ 1(t)eℓ 2(t) eℓ 1(t)eℓ 2(t). Without loss of generality, assuming Φ2 = 0 on this coordinate neighborhood we can write

eℓ 1(t)eℓ 2(t) eℓ 1(t)eℓ 2(t) = eℓ 1(t)eℓ 2(t) eℓ 1(t)

eℓ 1(t)Φ1(t)

= eℓ 1(t) Φ2(t)

h eℓ 2(t)Φ2(t) + eℓ 1(t)Φ1(t) i

= eℓ 1(t) Φ2(t)

h eℓ (t), Φ(t) i > 0,

where we have used (2.1) and (2.2). Notice that if eℓ 1(t) = 0 for some t then necessarily eℓ 2(t) = 0 by (2.1), which is impossible in L. Therefore eℓ 1 has a sign and this sign determines the sign of the signed curvature of ℓ( 2).

For the second statement, notice that again using (2.1) we know that eℓ 1 and eℓ 2 have different signs (and they do not change). If eℓ 1 > 0, then that means that the first coordinate increases and the second decreases, hence n(t) points towards R2 0 and κeℓ> 0. If eℓ 1 < 0, then we are in the opposite case and in this case n(t) points to R2 0 and κeℓ< 0, thus the signed curvature with respect to n(t) (the unit normal pointing towards R2 0) is positive.

From the proof of the previous theorem we obtain the following corollary.

Corollary 2.5. Let ℓ L. If ℓis proper, then p int( 2) is normal to the loss curve ℓ( 2) at ℓ(p).

Proof. It follows directly from (2.1), since for fixed p int( 2), ℓ(q), p attains a minimum at p.

3Notice that this particular choice of coordinates around (p0, p0) suffices since we want to conclude something about the curvature of the curve loss ℓ.

Published in Transactions on Machine Learning Research (09/2023)

Lemma 2.6. In L, proper implies strictly proper.

Proof. Let p int( 2), and suppose that there is p = p int( 2), such that

ℓ(p ), p = inf q 2 ℓ(q), p .

Using (2.1), we see that p is normal to ℓat ℓ(p), and hence p and p are parallel. Since both belong to 2, it follows that p = p, which is a contradiction.

Therefore, in what follows (as long as we stay within L) we will use proper and strictly proper interchangeably.

Note that the converse of Lemma 2.4 does not hold. That is, there are ℓ L which have positive signed curvature (with respect to the unit normal pointing towards R2 0), but are not proper. Indeed let ℓbe defined as ℓ(p) = ( ln(p2), ln(p1)).

Taking the (standard) parametrization Φstd(t) = (t, 1 t) we see that the loss curve eℓgoes from left to right so neℓ(t) points towards R2 0. Moreover, we can readily see that the (signed) curvature κeℓis positive. However,

Φstd(t) is not normal to eℓat eℓ(t), thus by Corollary 2.5, ℓcan not be proper.

Therefore, we obtain the following characterization of proper losses in L.

Lemma 2.7. Let ℓ L. Then ℓis strictly proper if and only if p is normal to the loss curve ℓ( 2) at ℓ(p) for all p int( 2) and the signed curvature of ℓ( 2) with respect to the normal vector pointing towards R2 0 is positive at all points ℓ(p) for p int( 2).

Proof. The if part is Lemma 2.4. For the only if part, let ℓ L be such that

κ+ ℓ> 0, (2.4)

where κ+ ℓis the signed curvature of ℓwith respect to the unit normal pointing towards R2 0. Let p int( 2) and let Φ be a parametrization around p. We readily see that (2.3) implies that

se L(t, s)|s=t = 0,

while (2.4) implies sse L(t, s)|s=t > 0 by the proof of Lemma 2.4. This implies that fixing t, e L achieves its minimum at s = t. Then ℓis proper and by Lemma 2.6, we conclude it is strictly proper.

Remark 2.8. Notice that to check whether a given loss function ℓ L is proper or not, it suffices to do it in any coordinate system. That is, given Φ, we check conditions (2.3) and (2.4) for eℓ= ℓ Φ.

2.2 Mixable loss functions

We say that a loss function ℓis fair if ℓ1(p) 0 as p (0, 1) and ℓ2(p) 0 as p (1, 0) (this is motivated by the interpretation when using the standard parametrization, see Reid & Williamson (2010)). In addition, recall that a loss function ℓ L is proper if and only if

(i) nℓ(p) = p |p| can be chosen, and

(ii) κ+ ℓ(p) > 0

for all p int( 2).

Thus, a prototype of a fair proper loss function is shown in Figure 4.

Published in Transactions on Machine Learning Research (09/2023)

Figure 4: Prototype of a mixable fair proper loss function.

Recall from Section 1 that mixability is defined in terms of the superprediction set spr(ℓ) of ℓ. More precisely, for η > 0, consider the map Eη(y1, y2) = e ηy1, e ηy2 ,

where Eη : R2 0 [0, 1]2 is the exponential projection (1.3). Then, ℓis η-mixable if and only if Eη(spr(ℓ)) is convex.

Remark 2.9. We stress the fact that this definition depends on the superprediction set of ℓrather than on ℓitself two different loss functions with the same superprediction set will be equally mixable. From our perspective, when talking about mixability of the map ℓ(i.e., without making reference to the superprediction set), we see that we can define it as follows. A loss ℓis mixable if the 1-dimensional manifold Eη ℓ(int( 2)) has signed curvature κ+ Eη ℓ 0. We will adopt the latter version here. Although clearly these definitions are equivalent, it is useful to have this at hand to relate mixability with properness. For now on, when we say ℓis mixable we mean in the latter way. See Figure 5.

We close this part by describing the log loss, which will play an important role. Let ℓlog : 2 R2, given by

ℓlog(p) = ( ln(p1), ln(p2)) . (2.5)

Let Φ = Φstd. Then eℓlog(t) = ( ln(t), ln(1 t)) .

Since eℓ log(t) = t 1, (1 t) 1 , its canonical normal vector is

neℓlog(t) = 1 p

t2 + (1 t)2 (1 t) 1, t 1 .

The curvature with respect to neℓlog(t), the normal vector pointing towards R2 0, is then given by

κ+ eℓlog = κeℓlog = t(1 t)

(t2 + (1 t)2)3/2 > 0. (2.6)

When there is no risk of confusion we denote κ+ eℓlog simply as κ+ log.

Published in Transactions on Machine Learning Research (09/2023)

(0, 0) (1, 0)

Figure 5: Diagram depicting how convexity of Eη(spr(ℓ)) is characterized by the principal curvatures of Eη ℓ(int( 2)).

2.3 Mixability and curvature

Haussler et al. (1995) gave a characterization of the mixability constant of a mixable proper binary loss function ℓin terms of the first and second derivatives of its partial losses. We reprove this characterization from a geometric point of view, that is, independent of the parametrization chosen for n.

Let ℓ L be proper and Φ a 1-chart parametrization4 of 2, then Eη(spr(ℓ)) will be convex if and only if the curve γ(t) = E(ℓ(Φ(t))) has negative curvature with respect to the unit normal pointing towards R2 0. Since ℓis proper we can assume without loss of generality that κℓ(p) = κ+ ℓ(p) > 0. We are then interested in computing the signed curvature of

g(t) = (g1(t), g2(t)) = E(eℓ1(t)), E(eℓ2(t)) = e ηeℓ1(t), e ηeℓ2(t) ,

and showing that κg 0. We have

g 1(t) = ηeℓ 1(t)e ηeℓ1(t)

g 1(t) = ηeℓ 1(t)e ηeℓ1(t) + η2eℓ 1(t)2e ηeℓ1(t)

= ηe ηeℓ1(t) h ηeℓ 1(t)2 eℓ 1(t) i

g 2(t) = ηeℓ 2(t)e ηeℓ2(t)

4This means that the map Φ: D 2 is such that Φ(D) = 2.

Published in Transactions on Machine Learning Research (09/2023)

g 2(t) = ηeℓ 2(t)e ηeℓ2(t) + η2eℓ 2(t)2e ηeℓ2(t)

= ηe ηeℓ2(t) h ηeℓ 2(t)2 eℓ 2(t) i ,

and thus we have g 1(t)2 + g 2(t)2 3/2 κg(t) = ηeℓ 1(t)e ηeℓ1(t)ηe ηeℓ2(t) h ηeℓ 2(t)2 eℓ 2(t) i

ηe ηeℓ1(t) h ηeℓ 1(t)2 eℓ 1(t) i ηeℓ 2(t)e ηeℓ0(t)

=η2e ηeℓ1(t)e ηeℓ2(t) h eℓ 1(t)eℓ 2(t) eℓ 1(t)ηeℓ 2(t)2 + eℓ 2(t)ηeℓ 1(t)2 eℓ 2(t)eℓ 1(t) i

=η2e ηeℓ1(t)e ηeℓ2(t) h ηeℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t)) + h eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) ii .

Note that the sign of eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) is the sign of κeℓ. If κeℓis positive, then one can check that eℓ 1(t) > 0

and eℓ 2(t) < 0, thus the first term in brackets is necessarily negative. Thus by making η large κg(t) will become negative. Then we want

ηeℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t)) + h eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) i 0,

η eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t)

( eℓ 2(t)eℓ 1(t))(eℓ 1(t) eℓ 2(t)) .

When considering the case when the signed curvature is negative, we have:

Lemma 2.10. Suppose that ℓ L is a proper loss function. Then, if ℓis mixable, for any 1-chart parametrization Φ of 2, the mixability constant is given by

η ℓ= inf t Φ 1(int( 2))

eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) eℓ 1(t)eℓ 2(t)(eℓ 1(t) eℓ 2(t))

Conversely, if (2.7) holds, then ℓis mixable with mixability constant η ℓ.

By the local nature of curvature, it would be possible to consider a local version of Lemma 2.10, which would characterize a local notion of mixability. This alternative will not be pursued here.

Vovk (2015) observed that mixability for proper losses is equivalent to a quotient of curvatures being bounded away from zero. For the reader s convenience we prove this statement. To recover Vovk s statement observe that the properties he imposes on the loss functions imply that κ+ is the signed curvature (see Section 1.6).

Lemma 2.11. A proper loss function ℓ L is mixable if and only if

inf p κ+ ℓ(p)

κ+ log(p) > 0,

where κ+ log denotes the curvature of ℓlog. Moreover, when this holds,

η ℓ= inf p κ+ ℓ(p)

κ+ log(p) > 0,

Proof. By Lemma 2.10, ℓis proper with mixability constant η ℓ> 0 if and only if

η ℓ= inf t Φ 1(int( 2))

eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) eℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t))

Published in Transactions on Machine Learning Research (09/2023)

for any given 1-chart parametrization Φ. Setting Φ = Φstd and using (2.6), we have the following. For any t Φ 1(int( 2)),

eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) eℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t))

eℓ 1(t)eℓ 2(t) eℓ 2(t)eℓ 1(t) eℓ 1(t)2 + eℓ 2(t)2 3/2

eℓ 1(t)2 + eℓ 2(t)2 3/2

1 (1 t)2 + 1

1 (1 t)2 + 1

|eℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t))|

κ+ eℓlog(t)

eℓ 1(t)2 + eℓ 2(t)2

1 (1 t)2 + 1

!3/2 1 (1 t)2 1

|eℓ 2(t)eℓ 1(t)(eℓ 1(t) eℓ 2(t))|

κ+ eℓlog(t)

eℓ 1(t)2 1 + t2 (1 t)2

3/2 1 (1 t)2 1

| t 1 t eℓ 1(t)3(1 + t 1 t)|

κ+ eℓlog(t),

where we used that by properness eℓ 2(t) = t 1 t eℓ 1(t) (see (2.1)).

Since κ+ is independent of the parametrization, we obtain the result.

Remark 2.12. Lemma 2.11 exemplifies the usefulness of Φstd. The curvature of ℓlog is easily computed with respect to the standard parametrization, by fixing Φ = Φstd we can easily recognize when the curvature of ℓlog appears in our computation. However, since curvature is a geometric quantity we know this relation between curvatures will hold for any other parametrization as well.

Using this point of view, the following observations enlighten why the weight function in (Buja et al., 2005) and in (Reid & Williamson, 2010) basically encodes all the relevant information in the binary case. Recall that given a proper loss function ℓ, the weight of ℓ(with respect to a local parametrization Φ of 2) is defined as

eℓ 1(t) Φ2(t)

eℓ 2(t) Φ1(t)

We stress that the weight depends on the coordinates Φ of that we use, and hence we use the notation ℓΦ. As observed in Remark 2.12, we sometimes set Φ = Φstd (as it is done in (Buja et al., 2005; Reid & Williamson, 2010)) to be able to recognize some terms.

Lemma 2.13. Let ℓ L be a proper loss and Φ a local parametrization of 2, denote by eℓΦ its local expression and by wℓΦ be its weight. Then we have for any t Φ 1(int( 2)),

k+ eℓΦ(t) = 1 wℓΦ(t) |Φ 1(t)| 1 Φ1(t)2 + Φ2(t)2

and moreover, if λ is another proper loss,

κ+ eλΦ(t) = wλΦ(t)

weℓΦ(t) . (2.9)

In particular, when Φ = Φstd,

k+ eℓstd(t) = 1 wℓstd(t)

1 t2 + (1 t)2

Published in Transactions on Machine Learning Research (09/2023)

and if in addition, λ = ℓlog (with Φ = Φstd),

k+ eℓstd(t)

κ+ eℓlog(t) = 1 wℓstd(t) 1 t(1 t) = wℓlog(t)

wℓstd(t). (2.10)

Proof. Let ℓ: 2 R2 2 be a proper loss and let Φ be any parametrization of 2 around p. Let us compute

κ+ eℓΦ (assuming w.l.o.g. that κ+ eℓΦ = κeℓΦ, which means eℓ 1 > 0 and Φ 1 < 0).

κ+ eℓΦ(t) = eℓ 1(t)eℓ 2(t) eℓ 1(t)eℓ 2(t) eℓ 2(t)2 + eℓ 1(t)2 3/2

= eℓ 1(t)eℓ 2(t) + eℓ 1(t)eℓ 1(t)Φ1(t)

Φ2(t)2 eℓ 1(t)2 + eℓ 1(t)2 3/2

= eℓ 1(t) Φ2(t) 1 eℓ 1(t)3

Φ1(t)eℓ 1(t) + Φ2(t)eℓ 2(t) Φ1(t)2

Φ2(t)2 + 1 3/2

= eℓ 1(t) Φ2(t) 1 eℓ 1(t)3

Φ 1(t)eℓ 1(t) + Φ 2(t)eℓ 2(t) Φ1(t)2

Φ2(t)2 + 1 3/2

= eℓ 1(t) Φ2(t) 1 eℓ 1(t)3

Φ 1(t)eℓ 1(t) Φ 2(t)Φ1(t)

Φ2(t) eℓ 1(t) Φ1(t)2

Φ2(t)2 + 1 3/2

= 1 Φ2(t) 1 eℓ 1(t)

Φ 1(t) Φ 2(t)Φ1(t)

Φ1(t)2 + Φ2(t)2

= 1 Φ2(t) 1 eℓ 1(t)

Φ 1(t)Φ2(t) Φ 2(t)Φ1(t) Φ2(t)

Φ1(t)2 + Φ2(t)2

eℓ 1(t) (Φ 1(t)Φ2(t) Φ 2(t)Φ1(t)) 1 Φ1(t)2 + Φ2(t)2

= 1 wℓΦ(t) (Φ 1(t)Φ2(t) + Φ 1(t)Φ1(t)) 1 Φ1(t)2 + Φ2(t)2

= 1 wℓΦ(t) ( Φ 1(t)) (Φ2(t) + Φ1(t)) 1 Φ1(t)2 + Φ2(t)2

where we have used that by properness we know that eℓ (t), Φ(t) = 0 ((2.1)), which implies eℓ (t), Φ(t) = eℓ (t), Φ (t) by differentiating with respect to t from the third to the fourth equality, and that Φ 1(t)+Φ 2(t) = 0 since Φ(t) 2 from the third to last to the second to last equality.

Notice that in the last equation of the previous string of equalities, the only term involving ℓis eℓ 1 (or more precisely wℓΦ(t)) and the remaining terms depend only on the parametrization Φ. Then we obtain

κ+ eλΦ(t) = wλΦ(t)

The remaining statements follow from setting Φ = Φstd and (2.6).

Remark 2.14. Combining Lemma 2.11 and (2.10), we recover the characterization of the mixability constant in terms of the quotient of weights obtained by van Erven et al. (2012, Section 4.1). However for the corresponding statement involving the quotient of second derivatives of the Bayes risks, the fact that 2 has

Published in Transactions on Machine Learning Research (09/2023)

an affine parametrization is important. Indeed, this relies on Corollary 3 in (Reid & Williamson, 2010) that states that w(t) = e L

(t). In general, it can be checked that

(t) Φ 1(t) Φ1(t)

(t) = Φ2(t)eℓ 1(t)2 1 + Φ2(t)2

3/2 κeℓ(t),

which reduces to w(t) = e L

(t) when Φ = Φstd. From the point of view of the present work, L (or a quotient of them) is not a good quantity to consider since it strongly depends on coordinates. However, notice that if one restricts to affine parametrizations of 2 then e L

(t) depends on eℓ 1(t)2 and κeℓ(t) and hence in view of Lemma 2.13 restricting to a fixed affine parametrization of 2 will make quotients of the second derivative of the Bayes risk well behaved.

Let us remark some points about Lemma 2.13.

Let ℓ: 2 R be a given strictly proper, fair, loss function. Given a parametrization, we obtain a weight wℓΦ given by (2.8), that is, the weight depends on the parametrization.

The curvature of ℓis independent of Φ up to a sign. However, when defining κ+ ℓwe made the choice of the sign in a uniform way, thus the curvature is independent of the parametrization for the family of losses considered here. Then it follows that the quotient of curvatures is independent of the parametrization and by (2.9), it also follows that the quotient of weights is also independent of the coordinates (despite the weights being coordinate dependent themselves).

A corresponding notion of weight in higher dimensions (for the multi-class case) is way more complicated and it is unclear whether using them would lead to successful results. One higher dimensional analog of curvatures is readily seen to be the so called principal curvatures of a hypersurface in Euclidean space (see Appendix A). This will be the main motivation when dealing with the multi-class case (Section 3) Alternative ways to characterize proper higher dimensional loss functions have been studied in (Williamson et al., 2016).

2.4 Geometric comparison of loss functions

Fix a proper, fair loss function λ: 2 R2. Given another proper, fair loss function ℓ = λ, how might we compare them? From the point of view of differential geometry, since given p the normal vectors at λ(p) and ℓ(p) coincide, it is natural to look at their curvatures. Motivated by Lemma 2.11, we impose (for the moment) the condition

inf p 2 κ+ ℓ(p)

κ+ λ (p) = 1.

Note that this implies that κ+ ℓ(p) κ+ λ (p) for all p 2. We divide the comparison in steps for clarity.

(1) Expressing λ( 2) as a function. Note that since λ is proper and fair, the normal vector to a point λ(p) can only be (1, 0) when p = (1, 0) (i.e., when evaluating λ at the boundary of 2). Thus, the set λ(int( 2)) can be expressed as a graph over the x-axis. To obtain an explicit expression let Φ = Φstd. We use the fact that eλ1 : (0, 1) (0, l1) (where l1 could be infinity) is invertible. Then, we have that

λ(int( 2)) = {(x, f(x)) | x (0, l1)}

where f(x) = λ2(eλ 1 1 (x), 1 eλ 1 1 (x)).

(2) Translating and parametrizing ℓ( 2). Let p0 int( 2) with κ+ ℓ(p0) > κ+ λ (p0), if such p0 does not exist then ℓ= λ. We define ℓ0 : 2 R2 by ℓ0(p) = ℓ(p) + [λ(p0) ℓ(p0)], i.e., we translate ℓso that it coincides with λ at λ(p0). (ℓ0 is not fair anymore, however, the curvature is invariant under translations.)

Published in Transactions on Machine Learning Research (09/2023)

We now parametrize ℓ( 2) as the graph of a function g defined on an interval I0 around x0 (the x-coordinate of λ(p0)), aligning it with λ (we can assume this interval to be maximal). We let g(x) = ℓ0 2((eℓ0 1) 1(x), 1 (eℓ0 1) 1(x)). Since κ+ ℓ(p0) > κ+ λ (p0), we know that around x0 the graph of g is to the northeast of f.

(3) Comparison. If the graph of g is to the northeast of f on the whole I0, then we see that the superprediction set of ℓ0 is contained in that of λ. If this does not hold, it means that there is x1 I0 such that f(x1) = g(x1), and w.l.o.g. we can assume x1 > x0. Thus we know that g(x) f(x) 0 on [x0, x1] and g(x) f(x) = 0 on {x0, x1}, i.e., the boundary of [x0, x1]. Define the second order operator which computes the curvature of the graph (x, h(x)) (see (A.1)):

L(h)(x) = κ+ h (x) = 1

(1 + h (x)2)3/2 h (x).

Since κ+ ℓ(p0) > κ+ λ (p0), we see that L(g f) 0 on [x0, x1]. The maximum principle now implies that the supremum of g f is attained at the boundary on [x0, x1], and hence we know that f(x) = g(x) on [x0, x1], which is a contradiction. Thus the superprediction set ℓis contained in the superprediction set of λ (see Section 4).

More generally, if we assume instead that

inf p 2 κ+ ℓ(p)

κ+ λ (p) = η,

for some η > 0, we see that (see Appendix A) that ℓη(p) = ηℓ(p) satisfies

inf p 2 κ+ ℓη(p)

κ+ λ (p) = 1.

That is, we can reproduce the previous analysis with ℓη instead of ℓ.

The previous discussion motivates right away a comparison between proper, fair loss functions.

Definition 2.15. Let λ: 2 R2 0 be a proper, fair loss in L, which we call a base loss. We say that a proper, fair loss ℓ: 2 R2 is mixable with respect to λ if

inf p 2 κ+ ℓ(p)

κ+ λ (p) > 0.

2.5 Mixability and fundamentality as comparison to the log loss

Now, suppose ℓ L is proper and fair. Thus, in particular κ+ ℓ(p) > 0 for all p int( 2). We want to think of mixability as a geometric comparison to the log loss as suggested by Vovk (2015) and give a detailed interpretation of this comparison. We fix the standard parametrization of 2, Φ = Φstd : [0, 1] 2, given by Φ(t) = (t, 1 t).

The log loss in these coordinates is thus given by

eℓlog(t) = ( ln(t), ln(1 t)),

and by (2.6), its curvature with respect to the unit normal pointing towards R2 0 is given by

κ+ eℓlog(t) = t(1 t)

(t2 + (1 t)2)3/2 .

Published in Transactions on Machine Learning Research (09/2023)

Notice that κ+ eℓlog(t) > 0 for all t (0, 1) and κ+ eℓlog(t) 0 as t 0 or t 1. Thus, clearly by Lemma 2.7, for

any proper subinterval C of [0, 1] (cf. (Vovk, 2015, Corollary 2)), we have

inf t C κ+ ℓ(t)

κ+ eℓlog(t) > 0.

Thus, whether a proper, fair loss function ℓis mixable or not will depend of the behavior of the quotient κ+ ℓ(p)/κ+ log(p) as p approaches (0, 1) and (1, 0). More precisely, we have obtained the following.

Lemma 2.16. Let ℓ L be a proper loss. Then ℓis mixable if and only if

lim p (0,1) κ+ ℓ(p)

κ+ log(p) > 0, and

lim p (1,0) κ+ ℓ(p)

κ+ log(p) > 0.

Motivated by this we make the following definition.

Definition 2.17. Let ℓbe a proper, fair loss function in L, and Φ = Φstd be the standard parametrization of 2. We say that is ℓ(B1, B2)-logarithmic at the boundary if

κ+ eℓlog(t) = B 1 1 > 0, and

κ+ eℓlog(t) = B 1 2 > 0.

Let us analyze what this means. Suppose that ℓis proper and (B1, B2)-logarithmic. Then for any t (0, 1), using (2.10) in Lemma 2.13 and (2.8), we have

κ+ eℓlog(t) = 1 weℓstd(t) 1 t(1 t) =

Notice that as t 0+,

B 1 1 = lim t 0+ 1

|eℓ 1(t)| = lim t 0+

(ℓlog) 1(t) eℓ 1(t)

and similarly,

B 1 2 = lim t 1 1 1 t 1

|eℓ 2(t)| = lim t 1

(ℓlog) 2(t) eℓ 2(t)

that is, we are only comparing the rate at which ℓi, i = 1, 2, go to 0 (since they do by fairness) with the rate at which the log loss does.

In (Vovk, 2015), Vovk defines a loss function λ to be fundamental if given a (computable, proper, mixable) loss function λ and a data sequence in ζ Z that is random under λ with respect to a prediction algorithm F, then it is random under λ with respect to F. He shows that a fair, mixable ℓ L is fundamental if and only if (using the notation in (Vovk, 2015))

sup p [0,1]

κℓ(p) κlog(p) < .

Published in Transactions on Machine Learning Research (09/2023)

Since we have seen that mixability can be regarded as a comparison of curvatures of the loss curve of ℓand that of ℓlog and we have reinterpreted fundamentabiliy as a comparison of ℓand ℓlog near the boundary building on Definition 2.15, we can easily come up with a notion of λ-fundamentality.

Definition 2.18. Let λ be a proper, fair loss function in L. We say that a proper, fair loss function ℓ L is λ-fundamental if

ℓis mixable with respect to λ, and

when Φ = Φstd, we have

κ+ eλ (t) <

κ+ eλ (t) < .

Suppose now that a mixable loss function ℓ L is fundamental. Then there exist η, γ > 0 such that

κ+ log(p) γ,

for all p int( 2). This implies that

η 1κ+ ℓ(p) κ+ log(p) and κ+ log(p) γ 1κ+ ℓ(p),

for all p int( 2), which readily implies (Appendix A) that

κ+ ηℓ(p) κ+ log(p) and κ+ log(p) κ+ γℓ(p),

for all p int( 2).

Rephrasing the previous discussion we have obtained the following characterization of fundamentality.

Theorem 2.19. A loss function ℓ L is fundamental if and only if there exist numbers η, γ > 0, such that for any p int( 2), there are translation vectors xp and yp in R2 0 such that

spr(ηℓ+ xp) spr(ℓlog) spr(γℓ+ yp).

2.6 Constructing new mixable losses from previous

We now observe how mixability helps us to construct new proper, fair and mixable functions from previous proper, fair and mixable losses. We first define a family of losses that will serve to illustrate the idea. We set Φ = Φstd and λ = ℓlog. Let a > 0 and define the loss function λa : 2 R2 0

λa(p) = aλ(p).

It can be readily checked that κ e λa(t) = a 1κ+ λ (t), thus since

κ+ λ (t) = 1

it follows that λa is 1-mixable for a 1 and it is not if a > 1. Note that λa is still proper and fair. Take then a < 1, we can readily see that there exists a proper, fair an mixable loss function λ such that

λ = λa + λ .

Published in Transactions on Machine Learning Research (09/2023)

Indeed, λ = λ λa = λ1 a, which is fair, proper and 1-mixable.

This process works in a more general setting than scalings of λ. Consider for example the spherical loss σ defined in coordinates by

eσ(t) = 1 t

t2+(1 t)2 , 1 + t

It can be easily checked that this is bounded, proper and fair and that κeσ(t) = 1. Thus

κ+ λ (t) = (t2 + (1 t)2)3/2

t(1 t) > 1,

thus σ is 1-mixable. Thus, as before, there is a loss function ℓ such that λ = σ + ℓ . Moreover, the loss function given (in coordinates) by

ℓ (t) = λ(t) σ(t) = ln(t) 1 + t

t2+(1 t)2 , ln(1 t) 1 t

which can be seen to be unbounded, proper, fair and mixable.

We close this subsection with the following observation. Suppose that ℓis a proper, fair, mixable loss function with mixability constant η > 0. Then the loss function ℓη = ηℓis 1-mixable. Thus, there exists a proper, fair, mixable loss ℓ such that

ℓlog = ℓη + ℓ .

As we will see in Section 4, the previous observation can be interpreted from the point of view of the superprediction sets of the involved loss functions and convex geometry: spr(ηℓ) slides freely inside spr(λ) (see Theorem 4.23).

2.7 Composite losses and the canonical link

In this part we discuss composite losses following (Reid & Williamson, 2010). Let us recall their setting. Let V R be a set of prediction values. A link function is a continuous map ψ: [0, 1] V. Given a loss function eϱ: {0, 1} [0, 1] R and assuming V = R, if ψ is invertible, we define the composite loss ϱψ as

eϱψ(y, v) = eϱ(y, ψ 1(v)).

Definition 2.20. A composite loss eϱψ is a proper composite loss if eϱ is a proper loss in the sense of Reid & Williamson (2010).

Recall that in (Reid & Williamson, 2010), Φ = Φstd is implicitly assumed. Then, given a loss function eϱ (in the (Reid & Williamson, 2010) sense), we can construct a loss function ϱ: 2 R2 0, by ϱ = eϱ Φ 1 std. Then, the composite loss eϱψ can be expressed as

eϱψ(v) = (eϱ ψ 1)(v)

= (ϱ Φstd ψ 1)(p)

= ϱ (Φstd ψ 1) (p)

In other words, the composite loss eϱψ is the local expression of ϱ with respect to the parametrization Φ = Φstd ψ 1 of 2. We denote the local expression of ϱ with respect to Φ by bϱ, that is bϱ := ϱ Φstd Ψ 1 = eϱ Ψ 1

To show how this reconciliation of terms work, we obtain a result similar to Corollary 12 in (Reid & Williamson, 2010). Suppose that a composite loss eϱψ is given and it has differentiable partial losses (i.e., the corresponding

Published in Transactions on Machine Learning Research (09/2023)

loss ϱ is in L), furthermore, we assume that ψ is a diffeomorphism which in one dimension means it is strictly monotonic. Then we know that eϱψ is strictly proper if and only if ϱ is strictly proper (by definition). This implies that p is normal to ϱ( 2) at ϱ(p) for all p int( 2) and its curvature is positive (with respect to the unit normal pointing towards R2 0). This means for all v V,

0 = bϱ (v), Φ(v)

= bϱ 1(v)Φ1(v) + bϱ 2(v)Φ2(v)

= eϱ 1(ψ 1(v))(ψ 1) (v)Φ1(v) + eϱ 2(ψ 1(v))(ψ 1) (v)Φ2(v)

= eϱ 1(ψ 1(v))Φ1(v) + eϱ 2(ψ 1(v))(1 Φ1(v)),

where we have used that ψ is a diffeomorphism and that Φ1 + Φ2 = 1 for all parametrizations Φ of 2. Therefore, we have

Φ1(v) eϱ 1(ψ 1(v)) eϱ 2(ψ 1(v)) = eϱ 2(ψ 1(v)),

ψ 1(v) = eϱ 2(ψ 1(v)) (eϱ 2(ψ 1(v)) eϱ 1(ψ 1(v)))

for all v V.

Since we are working with valid reparametrizations the choice of Ψ will not affect the curvature of ϱ. Hence we obtain

Corollary 2.21. A composite loss eϱψ is strictly proper if and only if ϱ L is strictly proper and ψ satisfies

ψ 1(v) = eϱ 2(ψ 1(v)) (eϱ 2(ψ 1(v)) eϱ 1(ψ 1(v)))

for all v R.

Remark 2.22. We have seen that whether a loss function ℓ L is strictly proper or not, depends on whether conditions (2.3) and (2.4) hold or not. Notice that under a (admissible) change of coordinates, for example given by a link ψ, (2.4) will not be modified. However, (2.3) might change (since in a way, we are changing the velocity at which we move on ℓ( 2)). Hence, Corollary 2.21 is giving us a way to define the set of admissible links (or reparametrizations of 2) given a loss function ℓand the standard parametrization of 2. In this case, the new parametrization is given by Φ = Φstd ψ 1.

For applications, it is desired to be able to work with a given composite loss eϱψ, and moreover, to have convexity of the partial losses eϱψ 1 and eϱψ 2 . From our point of view, we see eϱψ as the local expression of some ϱ: 2 R, so that bϱ := ϱ Φ = ϱ Φstd ψ 1 = (ϱ Φstd) ψ 1 = eϱ ψ 1.

Let us work with the partial losses separately:

d dv bϱ1(v) = eϱ 1(ψ 1(v))(ψ 1) (v)

d dv bϱ2(v) = eϱ 2(ψ 1(v))(ψ 1) (v)

Proceeding as in the proof of Lemma 2.4, properness implies

0 = v e L(u, v)|v=u = eϱ 1(ψ 1(v))(ψ 1) (v)Φ1(u) + eϱ 2(ψ 1(v))(ψ 1) (v)Φ2(u)|s=u

Published in Transactions on Machine Learning Research (09/2023)

or, equivalently,

0 = eϱ 1(ψ 1(v))Φ1(v) + eϱ 2(ψ 1(v))Φ2(v).

Therefore, we can define w as

w(v) := weϱ(Ψ 1(v)) = eϱ 2(ψ 1(v))

Φ1(v) = eϱ 1(ψ 1(v))

Φ2(v) , (2.11)

where weϱ is the weight of eϱ, we can rewrite the derivatives of the partial losses of bϱ as

dv (v) = w(v)Φ2(v)(ψ 1) (v),

dv (v) = w(s)Φ1(v)(ψ 1) (v).

Taking second derivatives we have

dv2 (v) = w(v)(ψ 1) (v) Φ2(v) w(v)(ψ 1) (v) Φ 2(v),

dv2 (v) = w(v)(ψ 1) (v) Φ1(v) + w(v)(ψ 1) (v) Φ 1(v).

A way to guarantee both expressions are positive is as follows. Assume w.l.o.g. that (ψ 1) > 0. Since we are assuming w > 0, bϱ2 is increasing and bϱ1 is decreasing (also we have Φ1 is increasing and Φ2 is decreasing). We readily see that imposing w(v)(ψ 1) (v) = 1

for all v R, is enough to guarantee both second derivatives to be strictly positive.

Definition 2.23. Given ϱ L strictly proper, we define the canonical link ψ as the link defined by

(ψ 1) (v) = ψ 1(v) eϱ 2(ψ 1(v)) = 1 w(v), (2.12)

for v V, where w is defined in (2.11).

The differential equation (2.12) can be seen as separable ordinary differential equation, which is solvable for loss functions in L.

To give a geometric meaning, we look at the norm of the velocity of the loss curve α(v) = bϱ(v).

|α (v)|2 = w(v)2(ψ 1) (v)2 Φ1(v)2 + Φ2(v)2

By assuming w(s)(ψ 1) (s) = 1 and Φ = Ψ, we have

|α (s)|2 = Φ0(ψ 1(s))2 + Φ1(ψ 1(s))2 .

Thus the canonical link gives a parametrization of 2 such that bϱ is a curve such that its velocity vector at v coincides with the length of the vector Φ(ψ 1(v)). In other words, it is a parametrization of the loss curve ϱ(int( 2)) such that for ϱ(p) = bϱ(v) ℓ(int( 2)), the tangent vector at the point has length |p|. We close this discussion with a charcterization of the canonical link.

Theorem 2.24. Let ϱ L be a strictly proper loss function and ψ its canonical link. The reparametrization of ϱ determined by its canonical link is a parametrization of ϱ(int( 2)) with weight equal to 1.

Published in Transactions on Machine Learning Research (09/2023)

Proof. Let bϱ = ϱ (Φstd ψ 1) = eϱ ψ 1 be the reparametrization of ϱ(int( 2)) determined by the canonical link. Since

bϱ 2(v) = eϱ 2(ψ 1(v))(ψ 1) (v),

for all v V, and from Definition 2.23

eϱ 2(ψ 1(v))) = ψ 1(v) (ψ 1) (v), (2.13)

for all v V, we have

bϱ 2(v) = ψ 1(v).

Thus wbϱ(v) = bϱ 2(v) Φ1(v)

3 Mixability for Multi-Class Classification

Now we focus our attention on multi-class classification loss functions, that is, maps ℓ: n Rn 0 given by the partial losses ℓ(p) = (ℓ1(p), ..., ℓn(p)).

Our main goal is to interpret mixability as a geometric comparison of a given loss function ℓto the log loss, as we did for the binary case. As suggested by the comments after Remark 2.11, the extra work of characterizing properness and mixability in a geometric way (coordinate independent) will pay off since to carry out the comparison we will look at the scalar second fundamental forms of ℓ(int( n)) and ℓlog(int( n)). The scalar second fundamental form measures how a Riemannian manifold curves inside an ambient space , in this case how ℓ(int( n)) curves inside Rn (see Appendix A for details).

The definition of L (Definition 2.1) can be extended to higher dimensions.

Definition 3.1. An admissible loss function is a map ℓ: n Rn 0 such that

(i) ℓ(int( n)) Rn 0 is a (n 1)-manifold of class C2,

(ii) there exists a C1 map n : ℓ(int( n)) Nℓ(int( n)), n(ℓ(p)) =: nℓ(p), where Nℓ(int(( n)) is the normal space of ℓ(int(( n)), and

(iii) n(ℓ(p)) or n(ℓ(p)) belongs to Rn >0 for all p int( n).

We denote the set of admissible loss functions as Ln, or simply L when the dimension is clear from context.

We fix the log loss and denote it for convenience by λ := ℓlog : n Rn 0, as the map

λ(p) = ( ln(p1), ..., ln(pn)),

for p = (p1, ..., pn) n.

Let ℓ Ln and consider a parametrization Φ: D Rn 1 n of n around p int( n). The local expression of the conditional risk (using the parametrization Φ Φ of n n around (p, p)) is given by

e L(t, s) = eℓ(s), Φ(t) =

k=1 eℓk(s)Φk(t),

where t = (t1, ...tn 1), s = (s1, ..., sn 1) D and eℓ= ℓ Φ.

Published in Transactions on Machine Learning Research (09/2023)

Imposing ℓto be proper implies that when fixing t, s = t is a critical point of e L(t, ), that is,

0 = si e L(t, )|s=t = sieℓ(t), Φ(t)

for all i {1, ..., n 1}. Note that since the tangent space of Mℓat eℓ(t), Teℓ(t)eℓ(U), is generated by

{ s1eℓ(t), ..., sn 1eℓ(t)}, we conclude that Φ(t) is a normal vector. In other words, as before, we have

n(ℓ(p)) = p

for all p int( n).

The fact that e L(t, ) achieves a minimum at s = t (at interior points) is equivalent to requiring that the Hessian, D2e L, is positive definite at s = t. The Hessian of e L(t, ) at s = t is given by

[D2e L]ij(t) = sjsi e L(t, )|s=t = 2 sjsieℓ(t), Φ(t) .

The next step is to relate [D2e L]ij(t) to the scalar second fundamental form h of Mℓ= ℓ( n) (see Appendix A for its definition). More precisely, we compute the h with respect to a local parametrization Φ of n, i.e., we obtain the matrix [hij] representing h. To do this we need to compute the second derivatives of its parametrization eℓ= ℓ Φ (Appendix A). Since,

sieℓ(s) = sieℓ1(s), ..., sieℓn 1(s)

2 sjsieℓ(s) = 2 sjsieℓ1(s), ..., 2 sjsieℓn 1(s) .

The scalar second fundamental form (with respect to the normal vector pointing towards Rn 0) is then given by

hij(s) = h( sieℓ(s), sj eℓ(s)) = 2 sjsieℓ(s), n(eℓ(s))

= 2 sjsieℓ(s), Φ(s)

= 1 |Φ(s)| 2 sjsieℓ(s), Φ(s)

= 1 |Φ(s)|[D2e L]ij(s), (3.1)

for i, j = 1, ..., n 1, thus if [D2e L]ij(s) is positive definite, then the matrix [hij](s) is positive definite. In this case its eigenvalues are strictly positive and hence, the principal curvatures of Mℓat ℓ(s) (see Appendix A), κ+ i (s) (with respect to the unit normal pointing towards Rn 0) are all positive. Therefore, using a similar reasoning as we did in the case n = 2, we have obtained the following geometric characterization of properness (by following the same arguments as in Section 2).

Lemma 3.2. Let ℓ Ln. ℓis strictly proper if and only if nℓ(p) = p/|p| and the principal curvatures of Mℓat ℓ(p), κ+ i (p) (i = 1, .., n 1), are strictly positive for all p int( n).

We briefly explain how the comparison of scalar second fundamental forms will be performed. We follow a similar procedure as the one described in Section 2.4 for the case n = 2.

1. We establish that given a proper loss function ℓ Ln, around every p int( n), ℓ(int( n)) can be parametrized as a graph of a function f defined on a neighborhood around some x Rn such that (x , f(x )) = ℓ(p ). We do this explicitly for the log loss λ.

Published in Transactions on Machine Learning Research (09/2023)

ηℓ(int( n))

Figure 6: Geometric interpretation of η-mixability.

2. Since λ and ℓare proper, the normal vector to λ(int( n)) and ℓ(int( n)) at λ(p ) and ℓ(p ), respectively, is p /|p |. Hence we can identify their tangent spaces at these points. We do so and fix the parametrizations given in step (1).

3. By assuming η-mixability of ℓ, we look at the principal curvatures of Eη(ℓ(int( n)) and prove an equivalent condition for them to be non-negative with respect to normal vector field pointing towards Eη(spr(ℓ)) (i.e., convexity). The condition to be satisfied is seen to be comparison of the scalar second fundamental forms of λ and ℓthat we can recognize by step (1).

4. We interpret this comparison as follows. Since the tangent spaces to ℓ(p ) (and ηℓ(p )) and λ(p ) coincide for the chosen point p , if we translate ℓto coincide to λ at p , call this tangent space H (and note it can be indetified with the supporting plane of the loss functions at the given point). Then if we express (locally) ηℓ(int( n)) and λ(int( n)) over H, the graph of ηℓ(int( n)) lies above the graph of λ(int( n)). See Figure 6.

3.1 Representing proper loss functions as graphs over Euclidean spaces

When restricting to the set of admissible loss functions Ln (n 2), we can represent losses as functions over Rn 1 (a similar approach was taken in (van Erven et al., 2012); the difference relies on the fact that here we are after the comparison of second fundamental forms), which allows us to represent geometric quantities in a simple way. This will be useful to recognize these quantities when comparing a proper loss function ℓto the log loss λ, as we did for the binary case in Section 2. Let ℓ: n Rn 0 be a proper loss in Ln given by

ℓ(p) = (ℓ1(p), ..., ℓn(p)) .

Let Φ: n 1 Rn 1 n be the standard parametrization of n given by

Φ(s) = Φstd(s) =

s1, ..., sn 1, 1

where s = (s1, ..., sn) n 1. The local expression of eℓin these coordinates is then given by eℓ(s) = (ℓ Φ)(s), so that eℓi(s) = (ℓi Φ)(s). Also, we define the projection Π: Rn 0 Rn 1 0 as Π(y1, ..., yn) = (y1, ..., yn 1).

Recall that properness implies that the normal vector of Mℓ= ℓ( n) at ℓ(p) can be chosen to be |p| 1p, for p int( n). As a consequence, the normal vector is never parallel to the hyperplane {(x1, .., xn)

Published in Transactions on Machine Learning Research (09/2023)

Rn | xn = 0}, so that around any point ℓ(p) with p int( n), Mℓcan be written as a graph over Rn 0 {0} (as regular as Mℓis). In general, the existence of this function is guaranteed by the implicit function theorem, however, in our case we can give an explicit description of it as follows. The function Π|Mℓis a map with injective derivative, say around ℓ(q) for a fixed q int( n), therefore, the inverse function theorem ensures the existence (and differentiability) of a local inverse, which we can denote by Π| 1 Mℓ. This inverse map can be seen as a local parametrization of Mℓ. Thus, the local expression of ℓ(viewed as a map from n to Mℓ), ℓ: Dq Rn 1 Uℓ(q) Rn 1 (where the latter are small neighborhoods around Φ 1(q) and Π(ℓ(q)) respectively) is given by

ℓ(s) = (Π ℓ Φ)(s) = (Π eℓ)(s) = eℓ1(s), ..., eℓn 1(s) .

This map is a diffeomorphism and its inverse ℓ 1 : Uℓ(q) Dq, will be denoted by

ℓ 1(x) = eℓ 1 1 (x), ..., eℓ 1 n 1(s) .

We warn the reader about this abuse of notation, eℓ 1 i (x) is not the inverse of eℓi(s), it is a map satisfying

xi = eℓi(s),

si = eℓ 1 i (x),

(ℓ ℓ 1)(x) = x,

(ℓ 1 ℓ)(s) = s.

We want to define f : Uℓq R such that graph(f) Mℓ. We see that setting Uℓ(q) Π(Mℓ), so that it contains Π(ℓ(q)), we arrive to f(x) = (ℓn Φ ℓ 1)(x) = eℓn(ℓ 1(x)).

We have obtained the following result.

Lemma 3.3. Let ℓ Ln be a strictly proper loss. Let q int( n). Then there exists an open set U Rn 1 0 {0} and a function f : U R 0 such that Mℓadmits the parametrization

Φf(x) = (x, f(x)),

around ℓ(q).

Let ℓand f be as in Lemma 3.3. The unit normal vector field (pointing towards Rn 0) is then given by

nℓ(x) = 1 p

|Df(x)|2 + 1 ( Df(x), 1) . (3.2)

We proceed to calculate the scalar second fundamental form. The first and second derivatives of Φf are given by

kΦf(x) = (ek, kf(x)),

kmΦf(x) = (0, kmf(x)),

for k, m = 1, ..., n 1, where ek denotes the canonical basis of Rn 1 and 0 is the 0 vector of Rn 1. Denote by hℓthe scalar second fundamental form of Mℓ. Thus with respect to this coordinates we have

hℓ km(x) = kmΦf(x), nℓ(x) = 1 p

|Df(x)|2 + 1 kmf(x), (3.3)

for k, m = 1, ..., n 1.

Published in Transactions on Machine Learning Research (09/2023)

3.1.1 Mλ as a graph

Fix an arbitrary point q int( n). The local expression of λ (with respect to the standard parametrization Φ = Φstd around q and Π around ℓ(q )) is given by

λ(s) = ( ln(s1), ..., ln(sn 1)) ,

thus, we have

λ 1(x) = e x1, ..., e xn 1 .

Fix s = Φ 1(q ). Thus, around x = Π(λ(q )), using Lemma 3.3, Mλ around ℓ(q) can be described as

Φg(x) = (x, g(x)).

Moreover, in this case we have the explicit expression g(x) = ln(1 Pn 1 i=1 e xi). Notice that λ 1(x ) = s . We now compute the scalar second fundamental form hλ of λ at x .

1 Pn 1 i=1 e xi

1 Pn 1 i=1 e xi + e xke xm

(1 Pn i=1 e xi)2

for k, m = 1, ..., n 1 (here δkm denotes the Kronecker delta). In particular,

ek, s k 1 Pn 1 i=1 s i

0, δkms k 1 Pn 1 i=1 s i + s ks m 1 Pn 1 i=1 s i 2

for k, m = 1, ..., n 1, and since n((x , g(x )) = 1 q Pn 1

i=1 (s i )2+(1 Pn 1

i=1 s i )2 (s , 1 Pn 1 i=1 s i ) we have

hλ km(x ) = kmΦg(x ), n((x , g(x )) (3.4)

= 1 q Pn i=1(s i )2 + (1 Pn i=1 s i )2

δkms k + s ks m 1 Pn i=1 s i

for k, m = 1, ..., n 1. Remark 3.4. Note that if instead of λ we would have used a translation of it, that is, for c Rn, define a loss function ϕ: n Rn 0 by

ϕ(p) = λ(p) + c,

we can repeat the previous computation. The only difference is that we would have a different point xc instead of x .

3.2 Geometric interpretation of mixability

Mixability is defined as a property of the superprediction set of a proper loss ℓ Ln. More precisely, ℓis mixable if and only if Eη(spr(ℓ)) is convex for some η > 0. As before, we can determine whether Eη(spr(ℓ)) is convex by looking at its boundary Eη(spr(ℓ)) = Eη(ℓ( n)). Eη(spr(ℓ)) is convex if the principal curvatures of Eη(ℓ( n) are non-negative (when defined with respect to the inner pointing normal vector) at all points. Since convexity is a global property that can be tested locally everywhere , it makes sense to make the following definition.

Published in Transactions on Machine Learning Research (09/2023)

Definition 3.5 (η-Mixability at p n). We say that ℓ Ln is η-mixable at p int( n) if Eη(Mℓ) has non-negative principal curvatures with respect to the unit normal vector pointing towards Eη(spr(ℓ)) at Eη(ℓ(p)).

Clearly, ℓ Ln is η-mixable at all p int( n) if and only if it is η-mixable.

Let ℓ, ϱ Ln be strictly proper. First, we note that properness implies that the second fundamental forms of ℓand ϱ can be compared in the following sense. Given q n, note that the normal vector to Mℓand Mϱ can be chosen to be q /|q |. A translation does not affect the geometric properties of Mϱ (since it is an isometry of Rn), thus we consider the translated loss ϱℓ(q ) : n Rn, given by

ϱℓ(q )(p) = ϱ(p) + [ℓ(q ) ϱ(q )] ,

i.e., we translate ϱ by the vector cq = λ(q ) ℓ(q ) so that both ϱq and ℓcoincide when evaluated at q . Doing so allows us to identify the tangent spaces to Mϱℓ(q ) and Mℓat ϱℓ(q )(q ) = ℓ(q ). We will call ϱℓ(q )

the translation of ϱ to ℓ(q ).

Lemma 3.6. Let ℓ Ln be strictly proper. Let hℓand hλ denote the scalar second fundamental form of Mℓ and Mλ (the log loss), respectively. Then, ℓis η-mixable at p int( n) if and only if

hℓ(ℓ(p)) ηhλ(λ(p)) (3.6)

is positive semi-definite, where hℓand hλ denote the second fundamental forms of ℓand λ in the graphical coordintes described in Lemma 3.3. And therefore, ℓis η-mixable if and only if (3.6) holds for all p int( n).

Proof. Let ℓ: n Rn 0 be an admissible proper loss

ℓ(p) = (ℓ1(p), ..., ℓn(p)) .

The η-exponential projection map Eη : Rn Rn is given by

Eη(y) = (e ηy1, ..., e ηyn).

Let q int( n) and write Mℓaround ℓ(q ) as the graph of a function f over Rn 1, defined on an open set U f x containing x , such that f(x ) = ℓ(q ). We can directly give a parametrization of Eη(Mℓ) around Eη(ℓ(q )) = Eη((x , f(x )) by

Ψ(x) = e ηx1, ..., e ηxn 1, e ηf(x) .

We proceed to compute the second fundamental form of Eη(Mℓ) around Eη(ℓ(q )) (with respect to the inward pointing unit normal vector). The first and second derivatives of Ψ are given by

kΨ(x) = ηe ηxkek, η kf(x)e ηf(x)

kmΨ(x) = η2δkme ηxkek, η kmf(x)e ηf(x) + η2 kf(x) mf(x)e ηf(x)

and noting that the (inward pointing) unit vector field is given by

n(Eη(ℓ(x)) = 1 Pn 1 i=1 if(x)2e2ηxi + e2ηf(x) 1/2

1f(x)eηx1, ..., nf(x)eηxn, eηf(x)

Therefore, letting Eη := Eη(Uf) = Eη(f(Uf)), the second fundamental form of Eη at Eη((x , f(x ))) is given by

h Eη km(x )

= kmΨ(x ), n(ℓη(x ))

Published in Transactions on Machine Learning Research (09/2023)

= 1 Pn 1 i=1 if(x )2e2ηx i + e2ηf(x ) 1/2

η2δkme ηx kek,

i=1 if(x )eηx i ei Rn 1

+η kmf(x ) η2 kf(x ) mf(x )

= η Pn 1 i=1 if(x )2e2ηx i + e2ηf(x ) 1/2 [ηδkm kf(x ) + kmf(x ) η kf(x ) mf(x )] .

Thus, since the convexity of Eη(spr(ℓ)) is equivalent to the principal curvatures of Eη(Mℓ) being non-negative at q for all q int( n) (with respect to the inner pointing normal vector), we see this will be the case if and only if the matrix

Akm = kmf(x ) η [ δkm kf(x ) + kf(x ) mf(x )]

is positive semi-definite for all x corresponding to q int( n).

Note that since we have a graphical parametrization Φf of Mℓaround x U, we have

kΦf(x ) = (ek, kf(x ))

and by (3.2),

n(x , f(x )) = 1 p

|Df(x )|2 + 1 ( Df(x ), 1) .

On the other hand, since the normal vector to Φf(U) at (x , f(x )) is q

|q |, we have

n((x , f(x ))) = 1 q Pn i=1(s i )2 + (1 Pn i=1 s i )2

s 1, ..., s n, 1

for s Rn 1 such that Φ(s ) = q .

By properness we know that

0 = kΦf(x ), n((x , f(x )))

= 1 q Pn i=1(s i )2 + (1 Pn i=1 s i )2

s k + kf(x )

kf(x ) = s k 1 Pn 1 i=1 s i ,

1 + |Df(x )|2 = 1 +

Pn 1 j=1 (s j)2 1 Pn 1 i=1 s i 2 =

Pn 1 j=1 (s j)2 + 1 Pn 1 i=1 s i 2

1 Pn 1 i=1 s i 2 . (3.7)

Using (3.3) and the previous observations, we can rewrite the terms of Akm as

|Df(x )|2 + 1 p

|Df(x )|2 + 1 kmf(x )

Published in Transactions on Machine Learning Research (09/2023)

= 1 1 Pn 1 i=1 s i

Pn 1 j=1 (s j)2 + 1 Pn 1 i=1 s i 2 hℓ km(x )

[ δkm kf(x ) + kf(x ) mf(x )] = δkms k 1 Pn 1 i=1 s i + s ks m 1 Pn 1 i=1 s i 2 .

Now, consider the log loss λ and its translation to ℓ(q ) which we denote by λ to simplify the notation. That is, we have

λ = λ(p) + [ℓ(q ) λ(q )] .

As discussed in Remark 3.4, we can write Mλ as a graph around x (since λ (q ) = ℓ(q )). The scalar second fundamental form of Mλ at λ (q ) is then given by

hλ ij (x ) = 1 q Pn i=1(s i )2 + (1 Pn i=1 s i )2

δkms k + s ks m 1 Pn i=1 s i

This readily implies that

[ δkm kf(x ) + kf(x ) mf(x )] = 1 1 Pn 1 i=1 s i

Pn 1 j=1 (s j)2 + 1 Pn 1 i=1 s i 2 hλ ij (x ).

Therefore, ℓis η-mixable at q if and only if we have that

[hℓ ij](x ) η[hλ ij ](x )

is semi-positive definite. Since q was arbitrary the result follows.

Remark 3.7. The previous comparison of second fundamental forms is possible because properness forces the induced metrics by ℓand λ to coincide at ℓ(q ) = λ (q ), that is, [gℓ ij](x ) = [gλ ij ](x ) (see Appendix A and Remark A.3). The conclusion of Theorem 3.6 does not necessarily hold if one takes a different coordinate system.

In order to get a geometric interpretation (i.e., independent of coordinates) we note the following:

0 [hℓ ij](x ) η[hλ ij ](x )

= [hℓ ij](x )[gℓ ij] 1(x )[gℓ ij](x ) η[hλ ij ](x )[gλ ij ] 1(x )[gλ ij ](x )

= [hℓ ij](x )[gℓ ij] 1(x ) η[hλ ij ](x )[gλ ij ] 1(x ) [gℓ ij](x ).

The matrices [hℓ ij](x )[gℓ] 1(x ) and [hλ ij ](x )[gλ ] 1(x ) are the local expression of the Weingarten map (see (Lee, 2018) for its definition and properties) of ℓand λ respectively. The eigenvalues of these matrices are the principal curvatures of Mℓand Mλ (and they are independent of coordinates), and the determinants are their Gaussian curvatures. From here it also follows that

η [hℓ ij](x )[gℓ ij] 1(x ) [hλ ij ](x )[gλ ij ] 1(x ) [gℓ ij](x )

=η h [hηℓ ij ](x )[gηℓ ij ] 1(x ) [hλ ij ](x )[gλ ij ] 1(x ) i [gℓ ij](x ),

Published in Transactions on Machine Learning Research (09/2023)

[W ηℓ ij ] [W λ ij] 0, (3.9)

where W ℓdenotes the Weingarten map of the loss function ℓ. Then once a system of coordinates around p n is chosen the relation (3.9) holds. A priori, the relation obtained between the Weingarten maps of ℓ and λ does not provide much information, but it does points to look at the loss function ηℓ. With this in mind Lemma 3.6 does give a direct geometric interpretation as follows. Let ℓ: n Rn 0 in L be a proper loss. Given a point q n we know that around ℓ(q), Mℓcan be parametrized with Φf(x) = (x, f(x)) for some function f around the point Π(ℓ(q)). Let x = Φ(ℓ(q)). Consider now the proper loss ϱ = ηℓ, for some η > 0. We readily see that ϱ can be parametrized as Φg(y) = (y, g(y)) with

g(y) = ηf(η 1y),

with g defined around yq = ηx . Now we compute the second fundamental form of ϱ at yq. Notice that

ig(y)|y=ηx = η if(η 1x)|y=ηx η 1 = if(x ),

ijg(y)|y=ηx = ijf(η 1x)|y=ηx η 1 = η 1 ijf(x ),

hϱ ij(ηx ) = hηℓ ij (ηx ) = η 1hℓ(x ).

Then assuming the hypothesis of Lemma 3.6, we obtain

hηℓ ij (ηx ) hλ ij(x ) = η 1hℓ ij(x ) hλ ij(x ) = η 1 hℓ(x ) ηhλ ij(x ) 0. (3.10)

The supporting planes at ηℓ(p) and λ(p) of Mηℓ(or more precisely, of its translation to λ(p)) and Mλ, respectively, coincide (since the normal vectors are the same), we denote it by Hp. By looking at Mηℓand Mλ locally as graphs over Hp, Lemma 3.6 gives the following comparison of graphs, which in turn can be regarded as local embeddability in the sense of convex geometry (see Definition 4.16 below). Theorem 3.8. ℓ Ln proper is η-mixable if and only if for all p int( n) the local graph of the translation of ηℓto λ(p) over the supporting plane to both Mℓp and Mλ at λ(p), Hp, lies above the graph of λ over Hp.

Remark 3.9. Observe the resemblance of Lemma 3.6 to Theorem 10 in (van Erven et al., 2012). To recover the latter from our point of view we will first reinterpret Lemma 3.6 and Theorem 3.8 from a convex geometry point of view which will lead to a nice bridge between Lemma 3.6 and (van Erven et al., 2012, Theorem 10).

4 Connections to convex geometry

In this section we reinterpret our results from the point of view of convex geometry. With this interpretation we can relate Theorem 3.8 to results in (van Erven et al., 2012) and (Williamson & Cranko, 2022). We first provide some background and state relevant results from convex geometry which are well-known and can be found in (Schneider, 2014) and can be adapted to our setting.

Let K Rn be a convex set, that is λx + (1 λ)y K

for all x, y K and λ [0, 1].

We define the recession cone of K as the set

rec(K) = {x Rn : K + x K}.

The boundary of K is denoted by K, as since we will assume that K is a differentiable manifold we denote the interior (as a manifold) of K by int( K). As usual the scaling of K by η > 0 and the Minkowski sum of K and L are defined as

ηK = {ηk Rn : k K}, (4.1)

K + L = {k + l Rn : k K, l L}. (4.2)

Published in Transactions on Machine Learning Research (09/2023)

Definition 4.1. Let K be a closed convex set in Rn. The support function of K, σ(K, u): Rn R, is defined as σ(K, u) = sup x K x, u .

We sometimes denote it as σK(u) := σ(K, u).

From the definition we know that

y K y, u σK(u) for all u Rn.

From (Schneider, 2014, Section 1.7) we have the following.

Lemma 4.2 (Properties of σ). Let L, K Rn be closed convex sets.

1. σL σK if and only if L K.

2. σ(K + t, u) = σ(K, u) + t, u for all t Rn.

3. σ(K + L, u) = σ(K, u) + σ(L, u).

Definition 4.3. A function f : D Rn R is convex if its extension to Rn given by

( f(x), if x D , if x / D

The following lemma is a well-known result (see (Schneider, 2014, Theorem 1.7.1) for example).

Lemma 4.4. Let f : Rn R convex, closed and positively homogeneous, then f is the support function of the convex, closed set

Kf = {x Rn | x, u f(u) for all u Rn}.

Definition 4.5. Let L, K Rn and closed and convex. We say that L is a summand of K if there exists a convex, closed set M Rn such that K = M + L.

We will be mainly interested in sets K whose recession cone is Rn 0, hence we denote by Kn the set of closed, convex sets whose recession cone is Rn 0. In the following we extend some common results in convex geometry which are usually stated for closed, compact convex sets in Rn (Schneider, 2014), however, some of them are easily extended to Kn (Shveidel, 2001).

Lemma 4.6 (Basic properties of sets in Kn ). Let K, L Kn and η > 0. Then, the following holds:

(1) ηK Kn ,

(2) rec(K + L) = Rn 0,

(3) K + L is closed, and

(4) K + L Kn .

Proof. In order to show (1), we need to show that ηK is closed, convex and rec(ηK) = Rn 0. Let x, y ηK and λ [0, 1], then we have

λx + (1 λ)y = η(λkx + (1 λ)ky)

where x = ηkx and y = ηky for some kx, ky K. Since K is convex, then λkx + (1 λ)ky K and hence ηK is convex. Let xn K be a convergent sequence that converges to x. Then, there exists kxn K such

Published in Transactions on Machine Learning Research (09/2023)

that xn = ηkxn. Since η is a constant, {kxn} converges to kx K (since K is closed). By the uniqueness of the limit, x = ηx ηK. Now, let x Rn 0, we want to show that ηK + x ηK. Take any k K,

ηk + x = η k + 1

ηx Rn 0. Conversely, if x rec(ηK), then for any k1 K, we have

then there exists k2 K, such that ηk1 + x = ηk2. Hence

ηx rec(K) = Rn 0, thus x Rn 0.

To show (2), let x Rn 0. We want to show that K + L + x K + L. Let k K and l L, then

k + l + x K + L,

since l + x L. Thus Rn 0 rec(K + L). Now, suppose that there is x rec(K + L) such that x / Rn 0. Since rec(K + L) is a cone, for all λ > 0, we have λx rec(K + L). Let k K and l L. Then

k + l + λx K + L K.

Thus ℓ+λx rec(K) = Rn 0 for all λ > 0, but notice that this is a contradiction since by picking λ sufficiently large, l + λx / Rn 0. Thus rec(K + L) = Rn 0.

For (3), see (Rockafellar, 1970, Theorem 8.2) and (Shveidel, 2001, Theorem 3.1). Finally, (4) is simply the combination of (2) and (3) (and the fact that K + L is convex).

We now specialize the discussion to a particular type of sets K Kn . First, suppose that the boundary K is of class C2, then at each point x int( K) there is an outward pointing normal vector u K(x). Thus, clearly, we can define a map u K : int( K) Sn 1 assigning u K(x) to x int( K). We define

Rn 0 = {x Rn : x = (x1, ..., xn), with xi 0 for i = 1, ..., n},

int(Rn 0) = {x Rn : x = (x1, ..., xn), with xi < 0 for i = 1, ..., n} = Rn <0.

Definition 4.7. Define C2 +(Kn ) as the collection of sets K Kn with boundary K of class C2, and such that the map u K is a C1-diffeomorphism from int( K) to Sn 1 := Sn 1 Rn <0.

We now specialize some properties of the support function to C2 +(Kn ).

Lemma 4.8. If K C2 +(Kn ), then dom(σK) = int(Rn 0) {0}.

Proof. Take u = 0 in dom(σK), then it must be an outward normal vector to int( K), hence it is in Sn 1 . Then dom(σK) int(Rn 0) {0}. Now, for u Rn <0, normalize it to make it unitary by letting v = u/|u|, then v Sn 1 and thus it must be a normal vector form some x int( K), hence the support function evaluated at v is finite, and in consequence σK(u) is finite too.

Remark 4.9. Following Schneider (2014, Section 2.5) the condition K C2 +(Kn ) is equivalent to assuming the principal curvatures of K to be non-zero. It also follows that

σK(u)|Sn 1 = u 1 K (u), u ,

and moreover, σK is of class C2.

Published in Transactions on Machine Learning Research (09/2023)

Remark 4.10. Let ℓ Ln be a proper loss function. By definition we see that Remark 4.9 implies spr(ℓ) C2 +(Kn ) (since Mℓ= (spr(ℓ))). Definition 4.11. Let K, L C2 +(Kn ). We say that L slides freely inside K if to each boundary point x of K, there exists a translation vector t Rn, such that x L + t K. Theorem 4.12. Let K, L C2 +(Kn ). L is a summand of K, then L slides freely inside K.

Proof. Suppose that there exists M C2 +(Kn ) such that K = L + M. Let x K. Then there are l L and m M such that

Thus, x L + m L + M = K.

Remark 4.13. For a general convex set L, if L is a summand of K C2 +(Kn ) we see that the previous proof holds an we conclude that L slides freely inside K; note however that this imposes restrictions on possible sets L. One of this consequences is that the principal curvatures of L must be positive as can be seen from a second fundamental form comparison and Theorem 3.8.

Lemma 4.14. Let K, L C2 +(Kn ) and suppose that f( ) = σK( ) σL( ) is convex. Then the set

M = {x Rn | x, u f(u) for all u Rn},

is in C2 +(Kn ), and it is such that K = M + L, that is, L and M are summands of K.

Proof. From Lemma 4.8, the domain of f is Rn <0 {0}, i.e., f : Rn <0 {0} R is convex. Thus it is the support function of M (by Lemma 4.4). That is, f( ) = σM( ).

Therefore we have σM = σK σL, and hence K = M + L. Note that M is a summand of K, then using Theorem 4.12 we know that M slides freely inside K, and since K has positive principal curvatures then M does too (Remark 4.13). Since σM is of class C2, then M has to be in C2 +(Kn ).

Theorem 4.15. [(Schneider, 2014, Theorem 1.5.2)] Let D Rn convex and let f : D R be a continuous function. Suppose that for each point x0 D there are an affine function g on Rn and a neighborhood U of x0 such that f(x0) = g(x0) and f g in U D. Then f is convex. Definition 4.16. We say that L is locally embeddable in K if for all x K, there is a y L and a neighborhood U of y, such that (L U) + x y K. Theorem 4.17. Let K, L C2 +(Kn ) and L strictly convex. If L is locally embeddable in K, then L is a summand of K.

Proof. Let u0 Sn 1 and x0 K be a point such that u(x0) = u0. Since L is locally embeddable in K there are y0 L and a neighborhood U0 of y0 such that (L U0)+x0 y0 K. Since u 1 L : Sn 1 L is continuous, there exists a neighborhood V0 of u0 such that u 1 K (V0) U0. Then it follows that σ(L+x0 y0, u0) = σ(K, u0) and σ(L + x0 y0, u) σ(K, u) for all u V0 by Lemma 4.2.

Let f( ) = σ(K, ) σ(L, ) (this is defined on Rn 0 {0} and is positively homogeneous), and g( ) = x y, . Then, clearly, we have

(i) f(u0) = g(u0), since

f(u0) = σ(K, u0) σ(L, u0)

= σ(K, u0) σ(L + x0 y0, u0) + x0 y0, u0

(ii) f g on V0,

f(u) = σ(K, u) σ(L, u)

= σ(K, u) σ(L + x0 y0, u) + x0 y0, u

Published in Transactions on Machine Learning Research (09/2023)

It follows by Theorem 4.15 that f is convex, and by Lemma 4.14 we conclude that L is a summand of K.

The following lemma is a direct consequence of the characterization of mixability in Theorem 3.8 and Definition 4.16. Lemma 4.18. Let ℓ Ln be a proper loss. For η > 0, if ℓis η-mixable then spr(ηℓ) is locally embeddable in spr(λ). Lemma 4.19. If ℓis η-mixable then spr(ηℓ) slides freely inside spr(λ).

Proof. Let ℓbe η-mixable, then Lemma 4.18 implies spr(ηℓ) it is locally embeddable in spr(λ). Then Theorem 4.17 implies it is a summand and Theorem 4.12 implies it slides freely inside spr(λ).

Corollary 4.20. Let ℓbe a η-mixable proper loss. Then spr(ηℓ) C2 +(Kn ) and it slides freely inside spr(λ) (λ is the log loss). Additionally, there exists M C2 +(Kn ) such that

spr(λ) = spr(ηℓ) + M.

Moreover, M can be regarded as ϱ( n) for a 1-mixable proper loss ϱ.

Proof. Since ℓis an η-mixable proper loss function, ηℓis also a proper loss function and hence spr(ηℓ) C2 +(Kn ) (Remark 4.10). Theorem 3.8 implies that spr(ηℓ) is locally embeddable in spr(λ). From Theorem 4.17 we know that spr(ηℓ) is a summand of spr(λ), which proves the existence of M. As a consequence, M is a convex set with recession cone Rn 0 (Lemma 4.14). By applying (Williamson & Cranko, 2022, Proposition 21) we can regard M as the image of a proper loss function ϱ, which since spr(ϱ) is a summand of spr(λ) it is 1-mixable (Lemma 4.14).

We now state (Schneider, 2014, Theorem 2.5.4) adapted to our setting which will be helpful to relate our work to (van Erven et al., 2012). Theorem 4.21. Let K, L C2 +(Kn ). Let h M(x) denote the second fundamental form of M at x with respect to u (see (A.2)). The following are equivalent:

(i) h L(x) h K(y) for all pairs of points x and y at which u(x) = u(y).

(ii) σK σL is a support function.

Since n is an affine manifold, the geodesics in n are simply straight lines. This allows to define convexity of functions defined on n in the usual way we do for functions on Rn. The following theorem connects and reconciles our results to those in (van Erven et al., 2012). More precisely, we create a bridge between our results and (van Erven et al., 2012, Theorem 10). Theorem 4.22. Let ℓ Ln be proper loss. Let η > 0, then ℓis η-mixable if and only if ηLℓ( ) Lλ( ) is convex on int( n), where Lϱ( ) denotes the Bayes risk of the loss function ϱ (Definition 1.3) and λ denotes the log loss.

Proof. Suppose that ℓis a proper loss in Ln which is η-mixable. By Lemma 4.19 spr(ηℓ) slides freely inside spr(λ) and in particular hηℓ(ℓ(p)) hλ(λ(p)). By Theorem 4.21 it follows that σspr(λ) σspr(ηℓ) is a support function with domain Rn <0 {0}, in particular it is convex on its interior. Let u Rn <0, such that the outward normal vector of ℓ( n) and λ( n) at ℓ(p) and λ(p), respectively, is u. Then we have for x = p n,

σspr(λ)(x) σspr(ηℓ)(x) = |x|(σspr(λ)(x/|x|) σspr(ηℓ)(x/|x|))

= |x|( λ(p), x/|x| ηℓ(p), x/|x| )

= |p|( λ(p), p/|p| ηℓ(p), p/|p| )

= λ(p), p ηℓ(p), p

= λ(p), p + ηℓ(p), p

= Lλ(p) + ηLℓ(p),

which proves the claim.

Published in Transactions on Machine Learning Research (09/2023)

Suppose now that for given ℓ Ln proper, there exists a η > 0 such that spr(ηℓ) slides freely inside spr(λ). Note that in particular this implies that spr(ηℓ) is locally embeddable in spr(λ), and hence for each p int( n) we have

hηℓ(ηℓ(p)) hλ(λ(p)) 0,

which by (3.10) and Lemma 3.6 implies that ℓis η-mixable. Thus combining this with Lemma 4.19 we obtain the following characterization of mixability of proper (sufficiently differentiable) loss functions.

Theorem 4.23. Let ℓ Ln be proper. ℓis η-mixable if and only if spr(ηℓ) slides freely inside spr(λ), where λ denotes the log loss.

In general, the set L provides a family of loss functions with appealing properties. Arguably, one of the most important properties is that given ℓ L, if we assume that ℓis proper then we know its principal curvatures are strictly positive. This is a strong and useful geometric feature. For example, in (Williamson & Cranko, 2022) the notion of a inverse loss called the anti-polar loss was investigated. Given ℓa proper loss (in the sense of (Williamson & Cranko, 2022), which are not necessarily smooth), they consider the 0-homogeneous extension of ℓ(see Remark 26 in (Williamson & Cranko, 2022)), defined on Rn >0 and given by

ℓext(p) := ℓ p p 1

where p 1 = p1 + ... + pn. For the following we simply denote ℓext by ℓ. In (Williamson & Cranko, 2022, Proposition 29) it is shown that there exists a map ℓ : R>0 Rn 0 such that

ℓ(p) = (ℓ ℓ ℓ)(p)

ℓ (x) = (ℓ ℓ ℓ )(x),

for all x, p Rn >0. The map ℓ is called the anti-polar loss of ℓ. For the family of admissible loss function L considered in this work, we exploit the differentiability conditions to obtain in a straightforward way an inverse loss defined on ℓ(int( n)). To see this, suppose that ℓ L is proper. Since this is equivalent to saying that spr(ℓ) is in C2 +(Kn ), meaning that the map uspr(ℓ) is C1 diffeomorphism. Then we can define the map ℓ 1 : ℓ(int( n)) int( n) by

ℓ 1(x) := u spr(ℓ)(x) uspr(ℓ)(x) 1 ,

which is the inverse of the map ℓ: int( n) ℓ(int( n)). Recall that u spr(ℓ)(x) is nothing else than the unit normal vector (pointing towards Rn 0) at x ℓ(int( n)).

It is of interest of finding parametrizations (or links) that simplify the expression of a given proper loss ℓ. At a theoretical level there are potentially many ways to to this. Notably we have at hand the notion of canonical link in (Williamson et al., 2016) (or see Section 2.7 above for n = 2). As an example of other ways to obtain nice links we have Lemma 3.3 above, which gives a nice expression in coordinates (as the form of a graph) of ℓ. Unfortunately, to obtain that results one makes uses of the inverse function theorem which does not provide an explicit inverse but rather its existence.

5 Conclusions

We summarize the main messages of this work.

Since mixable loss functions are of great importance in prediction games, it is desirable to understand them from different perspectives. Inspired by the work of Vovk (2015), in Section 2 we studied binary loss functions from the point of view of differential geometry, hence restricting to loss functions in L (Definition 2.1). To do this, we re-interpret properness as a geometric property, namely, a loss function ℓ L is proper if and only if

the normal vector (belonging to R2 0) to Mℓ= ℓ(int( 2)) at ℓ(p) is p |p|, for any p int( 2), and

Published in Transactions on Machine Learning Research (09/2023)

the loss curve ℓ(int( 2)) has positive curvature (with respect to p |p|).

Having this framework at hand, we characterized mixability and fundamentality of a proper loss ℓ L, as a curvature comparison to the log loss ℓlog (cf. (Vovk, 2015)).

In Section 3, we extended the geometric characterization of proper loss functions to higher dimensions, and obtained the corresponding interpretation of mixability as a geometric comparison (now in terms of the principal curvatures of the loss surface ). This comparison is done by using the second fundamental forms of the loss surfaces .

The main goal of Section 4 is to re-interpret the geometric results in Section 3 from the point of view of convex geometry. The main result in this part is a new characterization of η-mixability of a proper loss function ℓ L, as spr(ηℓ) sliding freely inside spr(ℓlog) (in general dimension). This provides an intuitive and geometric way to interpret mixability.

Since the results obtained in this work are in terms of curvature, it was necessary to re-interpret well known properties of loss functions in the language of differential geometry. Although slightly tedious, this allowed the reconciliation of the results obtained by Vovk (2015) for n = 2 and by van Erven et al. (2012) for n 2.

Theorem 4.22 connects our results to (van Erven et al., 2012, Theorem 10) as follows. In (van Erven et al., 2012, Theorem 10) the following statements are proven to be equivalent:

(i) a proper loss ℓ L is η-mixable,

(ii) ηHe L(t) He Llog(t) is positive semi-definite for all t Φ 1 std(int( n)), where HF(t) denotes the Hessian of F at t, (iii) ηL(p) Llog(p) is convex on int( n), and

(iv) ηe L(p) e Llog(p) is convex on Φ 1 std(int( n)).

There, they first proved the equivalence of (i) and (ii), which is the result of a long direct computation done very carefully. The equivalence between (iii) and (iv) is straightforward. To connect these two sets of equivalences, standard convex geometry is used to prove the equivalence of (ii) and (iii). Note that the statements (ii) and (iv) make reference to a precise choice of parametrization of n (i.e., the standard parametrization Φstd), therefore, the work presented here is naturally not related to these statements but rather to (i) and (iii), whose equivalence can be considered to be the content of Sections 3 and 4. Determining whether this new approach provides a simplification of the computations in (van Erven et al., 2012) or not, strongly depends on the differential geometry and convex geometry background of the reader. This work should be considered as complementing the understanding of mixable loss functions and providing a new geometric insight into them.

Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft under Germany s Excellence Strategy EXC number 2064/1 Project number 390727645.

Shun-ichi Amari. Information geometry and its applications. Springer, 2016.

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania, 2005.

Manfredo P. do Carmo. Differential geometry of curves & surfaces. Dover Publications, Inc., Mineola, NY, 2016. ISBN 978-0-486-80699-0; 0-486-80699-5. Revised & updated second edition of [ MR0394451].

David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Tight worst-case loss bounds for predicting with expert advice. In Computational learning theory (Barcelona, 1995), volume 904 of Lecture Notes in Comput. Sci., pp. 69 83. Springer, Berlin, 1995. doi: 10.1007/3-540-59119-2_169. URL https: //doi.org/10.1007/3-540-59119-2_169.

Published in Transactions on Machine Learning Research (09/2023)

David Haussler, Jykri Kivinen, and Manfred K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906 1925, 1998. doi: 10.1109/18.705569.

John M. Lee. Introduction to Riemannian manifolds, volume 176 of Graduate Texts in Mathematics. Springer, Cham, 2018. ISBN 978-3-319-91754-2; 978-3-319-91755-9. Second edition of [ MR1468735].

Shahar Mendelson and Robert C. Williamson. Agnostic learning nonconvex function classes. In International Conference on Computational Learning Theory, pp. 1 13. Springer, 2002.

Zakaria Mhammedi and Robert C Williamson. Constant regret, generalized mixability, and mirror descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https: //proceedings.neurips.cc/paper/2018/file/af1b5754061ebbd4412adfb34c8d3534-Paper.pdf.

Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010. ISSN 1532-4435.

Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson, and Nishant Mehta. Generalized mixability via entropic duality. In Peter Grünwald, Elad Hazan, and Satyen Kale (eds.), Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 1501 1522, Paris, France, 03 06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Reid15.html.

R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.

Rolf Schneider. Convex bodies: the Brunn-Minkowski theory, volume 151 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, expanded edition, 2014. ISBN 978-1-107-60101-7.

Emir H. Shuford, Arthur Albert, and H. Edward Massengill. Admissible probability measurement procedures. Psychometrika, 31(2):125 145, 1966. doi: 10.1007/BF02289503. URL https://doi.org/10.1007/ BF02289503.

Anna P. Shveidel. Recession cones of star-shaped and co-star-shaped sets. In Optimization and related topics (Ballarat/Melbourne, 1999), volume 47 of Appl. Optim., pp. 403 414. Kluwer Acad. Publ., Dordrecht, 2001. doi: 10.1007/978-1-4757-6099-6_19. URL https://doi.org/10.1007/978-1-4757-6099-6_19.

Tim van Erven, Mark D. Reid, and Robert C. Williamson. Mixability is Bayes risk curvature relative to log loss. J. Mach. Learn. Res., 13:1639 1663, 2012. ISSN 1532-4435.

Tim Van Erven, Peter Grünwald, Nishant A. Mehta, Mark Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015.

Vladimir Vovk. The fundamental nature of the log loss function. In Lev Beklemishev, Andreas Blass, Nachum Dershowitz, Berndt Finkbeiner, and Wolfram Schulte (eds.), Lecture Notes in Computer Science, volume 9300 of Lecture Notes in Computer Science, pp. 307 318. Springer, 2015. ISBN 978-3-319-23533-2. doi: 10.1007/978-3-319-23534-9_20.

Vladimir Vovk and Fedor Zhdanov. Prediction with expert advice for the Brier game. J. Mach. Learn. Res., 10:2445 2471, 2009. ISSN 1532-4435. doi: 10.1145/1390156.1390295. URL https://doi.org/10.1145/ 1390156.1390295.

Volodya Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56 (2):153 173, 1998. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.1997.1556. URL https://www. sciencedirect.com/science/article/pii/S0022000097915567.

Volodya Vovk. Competitive on-line statistics. International Statistical Review / Revue Internationale de Statistique, 69(2):213 248, 2001. ISSN 03067734, 17515823.

Published in Transactions on Machine Learning Research (09/2023)

Robert C. Williamson and Zac Cranko. The geometry and calculus of losses, 2022. URL https://arxiv. org/abs/2209.00238. ar Xiv:2209.00238.

Robert C. Williamson, Elodie Vernet, and Mark D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17:Paper No. 223, 52, 2016. ISSN 1532-4435.

A Differential Geometry

In this part we provide a brief summary of the concepts of differential geometry that are used in this work (we assume the reader has some familiarity with the topic although we try to put emphasis on the intuition). We do not intend to give a comprehensive introduction to the topic. Most of the material can be found in almost any differential geometry book, however, we recommend (and when possible use the notation of) (do Carmo, 2016) and (Lee, 2018).

A.1 Curvature of Curves

A parametrized curve is a differentiable map α: (a, b) Rn, (a < b). We are interested in studying the geometry of parametrized curves. For this it would be useful to restrict our discussions to curves with a well defined tangent line at every point α(t) for t (a, b) (i.e., with non-vanishing α (t)). These curves are called regular. Let ϕ: (a, b) (c, d) be a diffeomorphism, the curve β = α(ϕ(s)) is a reparametrization of α. Note that in this case α((a, b)) = β((c, d)). The image M = α((a, b)) is a 1-dimensional differentiable manifold in Rn (for this it is essential to restrict to regular curves). The study of curves is of particular importance since some aspects are carried to the study of the geometry of general hypersurfaces in Rn.

Typically, curvature is defined for curves parametrized by arc-length meaning that |β (s)| = 1 for all s (c, d) (and a regular curve can always be parametrized this way). For these types of curves, the curvature of β at β(s) is defined as the length of β (s), which measures how much a curve curves . However, this notion does not give information about the direction on which a curve is curving . We start looking at the case n = 2. We define the signed curvature of a general curve α(t) = (x1(t), x2(t)) by (cf. (1.5))

κα(t) := x 1(t)x 2(t) x 1(t)x2(t)

(x 1(t)2 + x 2(t)2)3/2 .

It can be checked that |κ(t)| coincides with the curvature of α when parametrized by arc-length (at the corresponding point), the signed curvature is well defined up to a sign (the sign will change if we consider a reparametrization that reverses the order of (a, b), for example a curve defined on ( b, a) given by β(s) = α( s)), which motivates the discussion in Section 1.5.

For example, suppose that a planar curve is defined by a function f : (a, b) R is the following way:

α(t) = (t, f(t)),

for t (a, b). A quick computation gives

κα(t) = f (t)

(1 + f (t)2)3/2 . (A.1)

Given a regular curve α: (a, b) R3 as above and a real number η = 0, it is straightforward to see that the curve β(t) = ηα(t) is also a regular curve and its signed curvature is given by

κβ(t) = η2x 1(t)x 2(t) η2x 1(t)x2(t)

(η2x 1(t)2 + η2x 2(t)2)3/2 = 1

The notion of signed curvature can be extended to curves in manifolds sitting inside Rn (see for example (Lee, 2018, Chapter 8)). For α( ε, ε) Rn parametrized by arc-length, the signed curvature (with respect to n) κ+ α of α at p = α(0) is given by κ+ α(0) = n, α (0) . It can be shown that this definition agrees with the one we gave for n = 2.

Published in Transactions on Machine Learning Research (09/2023)

A.2 Geometry of hypersurfaces in Rn

Let M be a differentiable hypersurface inside Rn of class Ck (i.e., a n 1-dimensional Ck manifold). By this we mean that for each p M there is an open set U Rn 1 and a Ck injective map Φ: U M (called a parametrization of M around p). For each x U, { 1Φ(x), ..., n 1Φ(x)} forms a basis for the tangent space Tq M (q = Φ(x)) to M at q. Since Φ(U) Rn we can consider the induced metric on M by the Euclidean metric in Rn (denoted by , ). This is a Riemannian metric on M given on the coordinates given by Φ by the matrix

gij(x) = iΦ(x), jΦ(x) ,

for x U. The metric g allows us to define the length of a curves in M.

In general, if a manifold M of dimension n 1 is sitting inside an n-dimensional Riemannian manifold M (and M is endowed with the induced metric from M) the second fundamental form carries the information on how M is curved inside M. Let g be the metric on M and g the induced metric on M by g. Let denote the Levi Civita connection of g. Let n be a smooth unit normal vector field to M (that is n(p) is perpendicular to Tp M for each p M). The scalar second fundamental form of M with respect to n is the covariant 2-tensor h on M defined as

h(X, Y ) = n, XY = Xn, Y . (A.2)

for X, Y tangent vectors to M. Note that for a hypersurface, at each point we have exactly to unit normal vectors to M at p, thus the scalar second fundamental form is well-defined up to a sign. Fixing a point p M and an orthonormal basis {E1, ..., En 1} for the tangent space at p Tp M, the eigenvalues of the matrix given by hij = h(Ei, Ej) for i, j = 1, ..., n 1 are called the principal curvatures of M at p and the corresponding eigenspaces are called the principal directions. For details of the above see Chapter 8 in (Lee, 2018).

When M = Rn and M is parametrized by Φ: U Rn 1 M Rn, with respect to the local frame { 1Φ, ..., n 1Φ} of Φ(U), the scalar second fundamental form with respect to a normal unit vector field n is given by ((Lee, 2018, Proposition 8.23))

hij = h( iΦ, jΦ) = ijΦ, n , (A.3)

for i, j = 1, ..., n 1.

Given any p M and v Tp M, there a geodesic γV : (a, b) M of M passing through p with velocity v at p. Let M1 and M2 be two hypersurfaces in Rn+1 tangent at a point p M1 M2. Choose a normal vector n and suppose that M1 lies above M2 (with respect to n). We have the following lemma from (Lee, 2018).

With the previous lemma we can obtain a comparison result for manifolds with positive principal curvatures.

Lemma A.1. Suppose that M1 and M2 are tangent at p M1 M2 and fix a normal vector n at p. Suppose that M1 and M2 have positive principal curvatures at p. Then h1(v, v) h2(v, v) for all v Tp M if and only if M1 lies above M2 (with respect to n) locally around p.

Proof. First we make the following observation. Suppose that M is a smooth hypersurface in Rn and we have a regular curve α: ( ε, ε) M such that α(0) = p and α (0) = v for some p M and v Tp M. Then, letting h denote the second fundamental form of M from (A.2) we have

h(v, v) = vn, v

dt (t), α (t) t=0

= (n α)(t), α (t) t=0 = n, α (0) .

Thus, if α is parametrized by arc-length, h(v, v) = n, α (0) = κ+ α(0).

Published in Transactions on Machine Learning Research (09/2023)

Suppose M1 lies above M2 are tangent at p and let v Tp M1 = Tp M2 with |v| = 1. Then we can intersect M1 and M2 with the plane generated by v and n. Then we obtain two curves α1 and α2 on M1 and M2, respectively, such that αi(0) = p and α (0) = v for i = 1, 2. Moreover, we can assume that these curves are parametrized by arc-length so its Euclidean curvature is given by α i (0), n . Since we can regard these curves as planar curves, there are functions f1 and f2 such that the curves α1 and α2 are represented in the plane v, n by the curves

γ1(x) = (x, f1(x))

γ2(x) = (x, f2(x)),

with fi = (0), f i(0) = v, f i (0) > 0 (since M1 and M2 have positive principal curvatures at p) for i = 1, 2. By construction κ+ γ1(0) = f i (0) and by definition κ+ γi(0) = α i (0), n , for i = 1, 2.

If M1 lies above M2 at p, then f 1 (0) > f 2 (0) and hence κ+ γ1(0) κ+ γ2(0), which is equivalent to h1(v, v) h2(v, v) for any v Tp M with |v| = 1. Let w = 0 Tp M be arbitrary, then

h1(w, w) = |w|2h1

= h2(w, w), (A.4)

as claimed.

Conversely if (A.4) holds, then we see that in particular holds for unitary v, which ultimately means that f 1 (0) f 2 (0) for all unitary v Tp M. This implies that M1 lies above M2.

We present the following instructive example.

Example A.2. Consider the differentiable function fκ(x, y) = κ(x2 + y2) with κ > 0, and let Mκ = {(x, y, fκ(x, y)) | (x, y) B1(0)}. We choose the parametrization Φκ(x) = (x, fκ(x)) of Mκ and compute the scalar second fundamental form of Mκ at p = (0, 0, 0) in these coordinates. We have

xΦ(x, y) = (1, 0, 2κx),

yΦ(x, y) = (0, 1, 2κy),

xxΦ(x, y) = (0, 0, 2κ),

xyΦ(x, y) = (0, 0, 0),

yyΦ(x, y) = (0, 0, 2),

thus from (A.3) at the point Φκ(0, 0) = (0, 0, 0), the scalar second fundamental form of Mκ with respect to n = (0, 0, 1) is given by

[hfκ](p) = 2κ 0 0 2κ

and in particular for κ = 1 we have

[hf1](p) = 2 0 0 2

Thus, clearly we have

[hfκ](0) [hf1](0) = 2κ 2 0 0 2κ 2

which is positive definite if and only if κ > 1 (when Mκ lies inside M1 and are tangent at p).

Remark A.3. We stress a technical observation. The comparison (A.5) in Example A.2 is valid since regardless of the value of κ, xΦκ(0, 0) and yΦκ(0, 0) are the same, meaning that we can identify the tangent spaces to Mκ and M1 at p for all κ, and the basis for them is given by { xΦ1(0, 0), yΦκ(0, 0)}. In general this is not necessarily the case so one should perform a change of basis before comparing the second fundamental forms.