# the_geometry_and_calculus_of_losses__3aacaeb3.pdf

Journal of Machine Learning Research 24 (2023) 1-72 Submitted 9/22; Revised 8/23; Published 11/23

The Geometry and Calculus of Losses

Robert C. Williamson BOB.WILLIAMSON@UNI-TUEBINGEN.DE University of T ubingen and T ubingen AI Center, Germany

Zac Cranko ZAC.CRANKO@GMAIL.COM Sydney, Australia

Editor: Aryeh Kontorovich

Statistical decision problems lie at the heart of statistical machine learning. The simplest problems are multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of polar loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk s aggregating algorithm. Keywords: convex sets, support functions, gauges, polars, concave duality, proper loss functions, M-sums, distorted probabilities, polar losses, Shephard duality, anti-norms, Bregman divergences, semi inner products, Finsler geometry, aggregating algorithm, substitution functions, direct and inverse addition.

1. Introduction

Most machine learning research focusses on methods (algorithms). But these methods are designed to solve particular problems. Platt (1962) argued for the greater importance of problem-oriented research. Our premise is that we need to better understand the elements of machine learning problems, and their permissible transformations. We focus on some of the simplest possible machine learning problems, namely multiclass classification and probability estimation. Stateless machine learning problems have three key ingredients:

1. the loss function l: specifies how predictive performance is evaluated;

2. the data generating process: in the statistical setting this corresponds to an underlying probability distribution P from which samples are drawn;

2023 Robert C. Williamson and Zac Cranko.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0987.html.

WILLIAMSON AND CRANKO

3. the model class F: the analyst s choice, informed by their prior knowledge1.

Implicit in this is the protocol by which the learner or analyst interacts with the data; we presume the usual statistical batch setting for now, but most of the technical results of the paper are not so restricted. Thus an (idealised) machine learning problem can be parametrised by the triple (l,F,P). Much research in machine learning focusses upon the classes of models F and methods for searching for the best element within F for data generated by P, on theoretical results concerning the complexity of F and its effect on convergence of empirical estimates (Vapnik, 1998), or the intrinsic geometry induced by F (Amari, 2016). Little attention has been paid to research on the loss function l, and its interaction with the other ingredients F and P. A recent exception is (van Erven et al., 2015) which showed how the joint interaction of l, F and P control the speed of convergence of learning algorithms. The lack of attention is surprising because the choice of loss function matters, especially when (as is typical) one has limited data, and so one cares about the speed of convergence of empirical estimates, and the best model in the class F has non-zero expected loss (again typically the case). Understanding the implications and options for the choice of loss function also matters when one considers the integration of machine learning technologies into larger socio-technical systems, since the loss function serves as an abstraction of what matters at the larger system level, and can be used, for example, to abstract a range of notions of fairness in ML problems (Menon and Williamson, 2018a). Loss functions are central to statistical decision problems, and have a long history (Wald, 1950); see (Vernet et al., 2016; Williamson, 2013) for some pointers to the literature. The present paper focusses upon understanding at a deeper level the loss functions for multiclass probability estimation, and their transformations. Our results are, to use the apposite term of Rota (1997), cryptomorphic an isomorphism that was previously hidden from view, which once decoded is illuminating. Our approach is parametrisation independent in the sense of the distinction made in (van Erven et al., 2015; Vernet et al., 2016) (in essence, what matters is the geometry of the set induced by the loss function which does not change under reparametrization). Losses in machine learning play a role analogous to metrics in other applied problems. Menger (1928) introduced distance geometry (in order to view the world in terms of distances) and there is an incredible variety of distances to choose from (Deza and Deza, 2009). But as we shall see, it is the simpler notion of norms, and normed spaces, that are the most relevant in the study of losses. The development of functional analysis critically depended upon the development of finite dimensional normed spaces by Minkowski (1896). In his Geometrie der Zahlen, he developed the notion of a symmetric convex body and its equivalence to a norm ball {x | p(x) 1}, as well as introducing the notion of a supporting hyperplane and the corresponding support function. He showed2 it was the dual to the norm p(x). We shall see that these concepts that were central to the development of

1. The claim that all the analyst brings is the model class F is a simplification that captures much of ML; in general the analyst provides a learning algorithm (a function) A: S 7 f F which given a sample S produces an f (Herbrich and Williamson, 2002). For the purposes of the present paper the simplification stated in the main text suffices. 2. See (Martini et al., 2001, Section 2) and (Thompson, 1996, Section 1.5) for a more detailed history. The extension of these concepts to infinite dimensional spaces underpinned the development of functional analysis; Dieudonn e (1981, page 130) credits Helly (1921) with the idea of abstracting away from particular spaces such as ℓp, Lp or C([a,b]) to the notion of general normed sequence spaces by methods which do not depend upon special features of the space. While apparently rather elementary, these finite dimensional normed spaces ( Minkowski Spaces (Thompson, 1996)) underpin the general theory of Banach spaces. Pietsch (2007, Page xxii) quotes Dvoretzky (1960) inspired by Grothendieck: many problems in the theory of Banach spaces may be reduced to the finite-dimensional case, i.e. to problems concerning Minkowski spaces.

GEOMETRY AND CALCULUS OF LOSSES

normed spaces are, with minor modification, the foundation for an understanding of loss functions. Recapitulating history, we concentrate in this paper on the finite dimensional case, corresponding to multiclass classification and conditional probability estimation problems. The rest of the paper is organised as follows. Section 2 introduces the mathematical machinery we utilise. Section 3 introduces (proper) loss functions, including the antipolar loss which is a natural inverse of a loss function. Section 4 presents examples, illustrating the new perspective, and the antipolar loss in particular. Section 5 presents some design strategies for loss functions in terms of their superprediction sets. Section 6 shows how to construct new proper losses from old ones by suitable combination of their superprediction sets. These results are based on the new results in Section 6.8 which substantially extend the theory of M-sums of convex sets, including a general duality result for M-sums of norms and anti-norms. Section 7 concludes.

1.1 Motivation, Expectations, Context and Significance

The goals and results of this paper are different in nature to those of the majority of papers in machine learning3. To that end, we give some context and set expectations. The paper contains no new algorithms and no experimental results. What it does contain is a new way of looking at loss functions which 1) illustrates the close connection between losses and norms and anti-norms; 2) presents the new idea of an antipolar loss; 3) develops a calculus for loss functions that allows multiple proper loss functions to be combined in a manner that the resulting loss is guaranteed proper; and 4) shows how the geometrical perspective can be used to design loss functions. Why embark on this complex endeavour? Currently loss functions are widely used, but there is little insight to be had regarding the consequences of particular choices. This is especially true when these functions are identified with their algebraic formulas. There were insights derived by Reid and Williamson (2011) for the design of loss functions (following Hand and Vinciotti (2003)), and in (Menon and Williamson, 2018b, Appendix B), but these approaches, whilst tractable enough in the binary case, become intractable for the multiclass situation. As we will show, there is an intrinsic geometry to loss functions which controls the nature of the learning problem at hand. Evidence for this was already given by van Erven et al. (2012); Pacheco and Williamson (2023) who showed how the mixability constant of a loss (which appears in bounds for the regret in online learning) is directly controlled by the intrinsic geometry of the loss function. Some of the value of the viewpoint developed in the paper is only realised in the companion paper (Williamson and Cranko, 2022) which uses the geometric approach developed here to derive, in a much simpler manner, the bridge between loss functions and measures of information that was previously presented by Reid and Williamson (2011) (binary case) and Garcia-Garcia and Williamson (2012) (multiclass case). In (Williamson and Cranko, 2022) we show that the geometric way of viewing information measures allows one to derive results seemingly unobtainable by others means. In particular, we derive a general data processing equality from which one can derive the classical strong data processing inequality. It turns out that the geometric viewpoint is central to these novel results. The new perspective has been used by Kamalaruban et al. (2015) to show the connection between exp-concavity and mixability, which is relevant to online learning algorithms, as well as to the

3. But not different in nature to many papers in economics. Indeed, economists have, over a long period, conducted investigations on the foundations of their discipline (utilities). As we shall see below in footnote 12, the similarity turns out not to be just in style, but there is a remarkable parallel in content as well.

WILLIAMSON AND CRANKO

understanding of fast rates in statistical learning (van Erven et al., 2015). It was used by Mhammedi and Williamson (2018) to solve an open question regarding generalised mixability as well as to draw a connection to mirror descent. It also underpins the results of Cranko (2021), which develops the theory of proper composite losses in an infinite dimensional setting. The present paper is a substantially extended and improved version of (Williamson, 2014). Beyond the correction of some errors, the present paper fully develops the theory of M-sums of superprediction sets in a general and rigorous manner.

2. Preliminaries

We introduce some standard machinery from the theory of convex sets and functions (see Hiriart Urruty and Lemar echal, 2001; Rockafellar, 1970; Rockafellar and Wets, 2004; Schneider, 2014; Penot, 1997)4. The concave cases of some of these results can be found in the works of Pukelsheim (1983) and Barbara and Crouzeix (1994). In choosing our notational conventions, we have adopted notation more common in the mathematical literature, even though some of this may be unfamiliar to a machine learning audience (since we refer to the mathematical literature for a number of the results upon which we build).

2.1 Basic Notation

Let X be a finite dimensional Euclidean space over the reals. The space of linear functionals on X is X , and the natural coupling is , : X X R; the usual inner product. Define the special sets R>0 := (0, ); R 0 := [0, ); R<0 := ( ,0); R 0 := ( ,0]; R := [ ,+ ]; R 0 := [0,+ ]. Denote the cardinality of a set S X by |S|. If S = {Tα |α A} is a set of sets, then TS := T α A Tα and SS := S α A Tα. We refer to the components of x X by xi and x = (x1,...,xn). Let (S) denote the set of probability measures on a set S, and [n] := {1,2,...,n}; then ([n]) x Rn 0 n i=1 xi = 1 . Let (ei)i [n] be the canonical basis vectors in Rn. The family

of p-norms (with p [1, ]) on the space X are defined by x p := i [n]|xi|p 1/p for finite p, and x := maxi [n]|xi| =: W i [n] xi. The p-unit ball is Bp := {x X | x p 1}, and if there is no subscript we take B := B2. The Iverson bracket J K takes a proposition and returns 1 if it is true, and 0 otherwise. We use the common conventions inf( ) := + , sup( ) := , 0 := 0 and 1/0 := . If v Rn, then v denotes its transpose. The all ones vector is defined as 1n := (1,...,1) Rn.

2.2 Convex Sets

Let S,T X, x X, α R and U R. Let αS := {αs | s S}, U S := {αs | α U, s S}, S+x := {s+x | s S}. The Minkowski sum is S+T := {s+t | s S and t T}. For S X, cl(S) and S both denote its closure. The collection of closed, nonempty, convex subsets of S is K(S). The interior and boundary of S are

int(S) := {x S| ε > 0, (εB+x) S} and bd(S) := S\int(S).

4. We recognise that there is a significant quantity of background material needed before we get to the machine learning problem and the results about loss functions. But this really illustrates the point of the paper: the deeper structure of loss functions arises from more fundamental geometrical concepts. And while some of the material in this section is widely known, the results for concave gauges and their polar duals, which are central to the analysis of loss functions, are both less well known and not a trivial variation of the convex case.

GEOMETRY AND CALCULUS OF LOSSES

radiant shady

star-shaped co-star-shaped

1-homogeneous

subadditive

1-homogeneous

superadditive

FIGURE 1: The relationship between various classes of sets S X defined in 2.3, and the corresponding properties of the gauge γS and antigauge βS defined in 2.7.

If S is convex its relative interior and relative boundary are

ri(S) := {x S| y S, λ > 1, λx+(1 λ)y S} and rbd(S) := S\ri(S). (1)

Its convex hull is co(S) := T{T X|S T and T convex}; its conic hull is cone(S) := S t>0t S. For the closure of these operations we sometimes write clco(S) := cl(co S), and clcone(S) := cl(cone S).

2.3 Starry, Radiant and Shady Sets

A nonempty, proper subset S X is:

star-shaped if (0,1] S S and 0 int S;

co-star-shaped if [1, ) S S and 0 / S;

radiant if it is star-shaped and convex;

shady if it is co-star-shaped and convex.

By convention the empty set is star-shaped (and radiant), and the entire space is co-star-shaped. Thus the star-shaped sets are the complements of the co-star-shaped sets and vice versa. If S = S we say S is symmetric; if S is symmetric and radiant we say it is a norm ball. Let R(X) and S(X) denote, respectively, the collections of closed radiant and closed shady subsets of X X. These definitions are illustrated in Figure 1.

2.4 Cones and Recession Cones

A set C X is said to be a cone if R>0 C C. A cone C is pointed if 0 C; salient if x, x C implies x = 0; and blunt if 0 / C. Every closed cone is pointed. Every blunt, convex cone is salient, but this is not the case for pointed convex cones. If a convex cone C is salient, then C \{0} is also a convex cone. For a cone C X there is a natural counterpart C X called the dual cone, where

C := {x X | x C, x ,x 0}. (2)

A pointed convex cone C X induces a partial ordering on X which we denote C; for x,y X we say x C y if x y C. For a set S X we say d X is a recession direction of S if S+d S. The

WILLIAMSON AND CRANKO

FIGURE 2: An illustration of the set S X (which extends infinitely far to the north-east , and thus only a finite portion is illustrated, a convention which we adopt throughout the paper) and its recession cone rec(S) = R2 0. For all x S and all d rec(S) the point x+d is contained within S.

collection of recession directions of S is called its recession cone which we denote by

rec(S) := {d X | S+d S}. (3)

The recession cone is illustrated in Figure 2. If S K(X) then it is easily verified that rec(S) is indeed a closed convex cone. We say a set S X is C-oriented if rec(S) = C. Unless otherwise stated we use X+ to denote a salient, closed, convex cone that is a proper subset of X: X+ X.

Proposition 1 (Calculus of Recession Cones) Let S K(X). The following hold:

1. rec(S) = {0} if and only if S is bounded;

2. rec(S) is a closed convex cone, thus S+rec(S) = S;

3. d rec(S) if and only if there exist sequences (xn)n N, xn S, and (tn)n N 0, tn R 0, such that xntn d;

4. if C X is a convex cone then rec(C) = cl(C);

5. if A B X then rec(A) rec(B).

If (Si)i I with Si K(X) is a family with an arbitrary index set I with T i I Si = , then

6. rec(T i I Si) = T i I rec Si;

7. rec(S i I Si) S i I rec Si.

If [m] I is a finite subcollection, then

8. rec i [m] Si) i [m] rec(Si).

Proof These are all well-known and can be found in a variety of common references (Auslender and Teboulle, 2003; Rockafellar, 1970; Rockafellar and Wets, 2004).

GEOMETRY AND CALCULUS OF LOSSES

2.5 Convex, Concave and Homogeneous Functions

For the remainder of this section let f : X R; we define its domain, epigraph and hypograph respectively as

dom(f) := {x X | f(x) R},

epi(f) := {(x,t) dom f R | f(x) t},

hyp(f) := {(x,t) dom f R | f(x) t}.

Let α R. The below, level and above sets are, respectively,

lev α(f) := {x dom(f) | f(x) α},

lev=α(f) := {x dom(f) | f(x) = α},

lev α(f) := {x dom(f) | f(x) α}.

We say f is convex if it satisfies

x,y dom f t (0,1), f(tx+(1 t)y) t f(x)+(1 t)f(y).

If f is convex, then f is concave. Equivalently f is convex (resp. concave) if epi(f) is convex (resp. hyp(f) is convex). We say f is quasi-convex (resp. quasi-concave) if lev α f (resp. lev α f) is convex for all α R. We say a convex (resp. concave) function f is closed if epi(f) (resp. hyp(f)) is closed. Thus if f is closed and convex, f is closed and concave. The closure of a convex (resp. concave) function f is the convex (resp. concave) function g such that epi(g) = cl(epi f) (resp. hyp(g) = cl(hyp f)). The function g is denoted by cl(f). Finally for a set S X, define argsupx S f(x) := {x X| f(x) = supx S f(x)} and arginfx S f(x) := {x X| f(x) = infx S f(x)}; either of these sets can be empty. If for some k R, f satisfies f(αx) = αk f(x) for α > 0 and for all x, we say f is homogeneous of degree k or k-homogeneous (there is an obvious extension to set-valued functions). If f is 1homogeneous, we also say f is positively homogeneous. If for all x,y dom(f) we have f(x+y) f(x)+ f(y) then f is called subadditive. If f is subadditive then f is called superadditive. If f is positively homogeneous and subadditive (resp. superadditive) then we say f is sublinear (resp. superlinear). All sublinear functions are convex and all superlinear functions are concave. Suppose f1,..., fm : X R. Their infimal convolution is the function X R defined by

(f1 fm)(x) := inf{ f1(x1)+ + fm(xm)|x1 + +xm = x}. (4)

2.6 Support Functions, Subdifferentials and Superdifferentials

For a set S X we define its convex support function

X x 7 σS(x ) := sup{ x ,x | x S} R. (5)

However, in our setting, it will often be more natural to consider the concave support function

X x 7 ρS(x ) := inf{ x ,x | x S} R. (6)

WILLIAMSON AND CRANKO

The convex and concave support functions are related as follows:

x X , σS(x ) = sup x S x ,x = sup x S x , x = inf x S x ,x = ρ S(x ). (7)

It is easy to see that σ and ρ are both 1-homogeneous, σ is subadditive and thus sublinear; and ρ is superadditive and thus superlinear. We introduce the mappings f, f : X 2X with

X x 7 f(x) := {x X | y X, f(y) f(x) x ,y x },

X x 7 f(x) := {x X | y X, f(y) f(x) x ,y x }. (8)

The first mapping is the classical subdifferential, and the second is the less-common concave subdifferential, or superdifferential. These sets are related by (f) = ( f). The mapping

X x 7 f(x) := f(x) f(x)

is known as the symmetric subdifferential (Mordukhovich and Shao, 1995). When f is convex f = f, and when f is concave f = (f). In the event that f is both convex and concave f = f = f. More importantly, the symmetric subdifferential satisfies ( f) = f, which makes it a convenient choice for us since we deal with (sub/super)-differentials of both convex and concave functions. We refer to elements of f as subgradients (recognising the slight terminological abuse in the choice of name) and write dom( f) := {x X | f = }. If there is a function g: X X that is always a subgradient of f in the sense that for all x dom( f) we have g(x) f(x), then g is called a selection of f and we write g f. The following proposition is a standard result (see Penot, 2012; Bauschke and Combettes, 2011):

Lemma 2 Let f : X R be convex with nonempty domain. Then ri(dom f) dom( f).

It is easy to show that for some convex functions the inclusion in Lemma 2 is not strict; For example take ,s , then dom( ,s ) = dom( ,s ).

Proposition 3 Suppose f : X R is 1-homogeneous. Then f is 0-homogeneous.

Proof From the definition of the subdifferential, for all x X and all α > 0,

f(αx) = {x X | y X, f(y) f(αx) x ,y αx }

= {x X | αy X, f(αy) f(αx) x ,αy αx }

= {x X | y X, α f(y) α f(x) α x ,y x }

and f is thus 0-homogenous. The proof for the superdifferential is similar.

Lemma 4 (Z alinescu, 2013, Corollary 3) Assume C K(X). Then σC is differentiable on dom( σC)\ {0} if and only if either C is a singleton or int(C) = and C is strictly convex.

Lemma 4 together with (7) gives us the corollary for the concave case:

GEOMETRY AND CALCULUS OF LOSSES

Corollary 5 Assume C K(X). Then ρC is differentiable on dom( ρC)\{0} if and only if either C is a singleton or int(C) = and C is strictly convex.

The gradient operator on X is := ( x1,..., xn). The following lemma has an obvious extension for the concave case:

Lemma 6 (Rockafellar, 1970, Theorem 25.1) Let f : X R be convex. If f is differentiable at x dom(f), then f(x) = { f(x)}. Conversely, if f(x) is a singleton at x dom(f) then f is differentiable at x.

Support functions are oblivious to closures and convex hulls:

Lemma 7 (Hiriart-Urruty and Lemar echal, 2001, Proposition C.2.2.1) Suppose S X is nonempty. Then σS = σcl(S) = σco(S); whence σS = σclco(S).

Using (7) we find:

Corollary 8 Suppose S X is nonempty. Then ρS = ρcl(S) = ρco(S); whence ρS = ρclco(S).

Lemma 9 Let S X, S = . Then dom(σS) = (rec(clco S)) and dom(ρS) = rec(clco S) .

Proof Firstly σS = σclco S (Lemma 7). Then from (Auslender and Teboulle, 2003, Theorem 2.2.1, p. 32) (domσS) = rec(clco S). Thus since dom(σS) is convex, dom(σS) = ( rec(clco S)) = (rec(clco S)) . For the concave case using (7) we have dom( ρclco S) = dom(σ clco S) = ( rec( clco S)) = rec(clco S) .

2.7 Gauge Functions

The theory of gauges (Minkowski functionals) and polars has been traditionally developed for radiant sets (see Hiriart-Urruty and Lemar echal, 2001; Rockafellar, 1970; Schneider, 2014; Thompson, 1996), whereas the theory of gauges for shady sets is less well known (Rockafellar, 1967; Pukelsheim, 1983; Barbara and Crouzeix, 1994; Penot and Z alinescu, 2000). However gauge functions for shady sets have been used in statistics in a manner similar to that which we will use them (Pukelsheim, 1983) and in economics; see (e.g. Hasenkamp and Schrader, 1978) and footnote 12 below. For convex sets the support function is a natural object to consider. Likewise, when working with star-shaped and co-star-shaped sets the gauge and anti-gauge are a natural parallel. For a set S X we define its gauge and anti-gauge:

X x 7 γS(x) := inf{λ > 0 | x λS} R, (9)

X x 7 βS(x) := sup{λ > 0 | x λS} R. (10)

If S is closed and radiant, then γS is closed and sublinear (Penot and Z alinescu, 2000, Proposition 2.3). Alternately, if S is closed and shady, then βS is closed and superlinear (Penot and Z alinescu, 2000, Proposition 2.4). Thus γS is a convex gauge and βS a concave gauge, as they are sometimes described in the literature. We list some properties of gauge functions and their associated sets in Table 1 and graphically in Figure 1. For closed S, the base star-shaped and co-star-shaped sets can be recovered

WILLIAMSON AND CRANKO

Gauge function γS

S Non-negative 1-Homogeneous Subadditive Norm

star-shaped radiant norm ball

Anti-gauge function βS

S Non-negative 1-Homogeneous Superadditive

co-star-shaped shady

TABLE 1: Properties of gauge and anti-gauge functions when restricted to their domains, as determined by their defining sets. See also Figure 1.

with the inverse mappings γS 7 lev 1(γS) = S and βS 7 lev 1(βS) = S respectively. Mirroring Lemma 9, Penot and Z alinescu (2000) observed that when S is closed

dom(γS) = cone(S) {0} rec(S) and dom(βS) = cone(S) {0} rec(S).

As one might expect

dom(γS) = clcone(S) and dom(βS) = clcone(S).

If 0 int S, then dom(γS) =X for all star-shaped sets S. Conversely, for any co-star-shaped set S X we have cone(S) X. In general this is the key difference between gauge and anti-gauge functions. That is, while gauge functions can be finite on the whole space, anti-gauges are usually finite only on a conic subset of X. This is significant with regard to the equivalence of gauges or norms. Recall two gauge functions γT1,γT2 : X R are equivalent if there exists constants c,C R>0 such that for all x X, cγT2(x) γT1(x) CγT2(x). Whilst all convex gauges and norms in finite dimensional space are equivalent, that is not true in general for concave gauges or anti-norms βS1,βS2 : X R if at least one of the concave gauges can take on the value + for some x X. Unbounded concave gauges correspond to unbounded loss functions; a point which will be elaborated below. The attentive reader will notice the similarity between the properties of the gauge of a norm ball and the properties of a norm on X. Indeed every norm on X, , can be written as a gauge of the norm ball lev 1 , and conversely the gauge of every norm ball, as defined in 2, is a norm. If one restricts analysis to the set cone(S) for a shady set S X in effect ruling out multiplication by negative scalars the function βS : cone(S) R is a natural counterpart to a norm on this space, which we call an anti-norm. As we shall see in 3, the conditional Bayes risks associated with proper losses are in fact anti-norms5.

5. Anti-gauges are sometimes called anti-norms (Berestovsk ı and Gichev, 2004; Moszy nska and Richter, 2012;

Merikoski, 1991), although confusingly this name is sometimes used to refer to the dual (polar) of a traditional norm (Horv ath et al., 2017; Martini and Swanepoel, 2006). We will thus stick with the terminology anti-gauge .

GEOMETRY AND CALCULUS OF LOSSES

2.8 Legendre-Fenchel Conjugates

The Legendre Fenchel conjugate or convex conjugate of f is the function

X x 7 f (x ) := sup x X ( x ,x f(x)) R. (11)

In addition to the more common convex Legendre Fenchel conjugate, we will make use of the concave conjugate, which, like the case of the concave support function, will be more appropriate for our purposes. The concave conjugate of f is

X x 7 f (x ) := inf x X( x ,x f(x)) R, (12)

and is related to the convex conjugate as follows:

f (x ) = sup x X ( x ,x f(x)) = inf x X( x ,x ( f)(x)) = ( f) ( x ).

The concave conjugate therefore satisfies a reverse Fenchel Young inequality:

x X, x X , f(x)+ f (x ) x ,x .

A function f is lower semi-continuous if for all x X and for all sequences (xn)n N with xn X and xn x we have f(x) liminfn f(xn). If f is lower semi-continuous then f is said to be upper semi-continuous.

2.9 Polar Duality

While we do make some use of the above Legendre-Fenchel duality, more important for our purposes is the polar duality of convex sets (Moszy nska, 2006, Chapter 13). It arises from the classical polarity between points and lines relative to the unit circle in inversive geometry (Askwith, 1917). Following Berger (2010, VII.5.C), define a bijection between the set of all points other than the origin onto the set of hyperplanes not containing the origin via x 7 hx := {y Rn | x,y = 1}, which is known as the polar hyperplane of the point x. Associated with hx is the halfspace Hx := {y Rn | x,y 1}. Given a set C Rn containing the origin, the polar of C is simply T x C Hx. More generally, there is a bijection between the sublinear functions σA and closed convex sets A; and another bijection between the superlinear functions ρB and closed convex sets B Hiriart-Urruty and Lemar echal (2001). Noting Table 1 we see that the gauge and anti-gauge functions also satisfy the same criteria as the support functions when restricted to (respectively) the radiant and shady subsets. A natural question to ask then is: can we find sets A and B such that σA = γA and ρB = βB? The answer in the convex case with respect to the radiant sets is well known, but there is an equally rich, parallel structure in the concave case with respect to the shady sets. Penot (2012) and Z alinescu (2002) show the following results for the convex case, and Penot and Z alinescu (2000) prove the concave case. For a set S X its polar and antipolar are the sets

S := {x X | x S, x ,x 1} and S := {x X | x S, x ,x 1}. (13)

Equivalently S = lev 1 ρS and S = lev 1 ρS. It is easy to show that the polarity operation S 7 S

takes closed radiant sets to closed radiant sets, and the antipolarity operation S 7 S takes closed

WILLIAMSON AND CRANKO

FIGURE 3: The polar duality of support functions and gauge functions for radiant (R) and shady (S) sets.

shady sets to closed shady sets such that if S S(X+), then S S(X + ). The polars and bipolars satisfy (Penot and Z alinescu, 2000, Lemma 4.2):

S = (clco((0,1] S))

S = clco((0,1] S) and S = (clco([1, ) S))

S = clco([1, ) S), (14)

where the operations S 7 clco((0,1] S) and S 7 clco([1, ) S) are known as the radiant hull and shady hull respectively. The polar and antipolar operations also induce a natural duality relationship between the gauge and support functions:

R R(X), σR = γR , σR = γR; and S S(X), ρS = βS , ρS = βS, (15)

such that we may define the function polar and antipolar:

σ R := σR , γ R := γR ; and ρ S := ρS , β S := βS . (16)

The above relationships are presented diagrammatically in Figure 3. The polar (antipolar) relationship between the convex (concave) support functions and gauge functions motivates a definition of the polar (antipolar) for sub/super-linear non-negative functions that is independent of its definition as a support function or gauge of a set. Let f,g: X R with f sublinear, and g superlinear. Convex and concave polar duality correspondences for f and g are given by

X x 7 f (x ) := sup x =0

f(x) R and X x 7 g (x ) := inf x =0 x ,x

Thus f and g satisfy a generalised H older and reverse H older inequality respectively:

x X,x X , x ,x f (x )f(x) and x ,x g (x )g(x). (17)

The case of H older conjugate norms p and q with 1/p+ 1/q = 1 can easily be derived as a special case of the polar duality relationships above, with Bq = B p and Bp = B q. We will make use of the following result of Barbara and Crouzeix (1994) which can be seen to be analogous to the classical result (Hiriart-Urruty and Lemar echal, 2001, Proposition E.1.4.3) regarding subdifferentials of Legendre Fenchel conjugates. We express the result for the concave case because that is what we need for losses. An analogous result holds for convex gauges.

GEOMETRY AND CALCULUS OF LOSSES

Lemma 10 (Barbara and Crouzeix 1994, Theorem 3.1) Let S S(X). Then for x domβS and x domβS

x βS(x) βS (x ) x

βS (x ) βS(x) βS (x )βS(x) = x ,x .

Barbara and Crouzeix (1994) provided a sketch of a proof. However as it is central to what follows we present a complete proof below. Proof We first prove

βS (x ) βS(s) | {z } (A)

βS (x ) S and βS(s) = x /βS (x ),s

For the sufficient condition suppose s S and x /βS (x ) βS(s). Then from (8) we have

y X, βS(y) βS(s)+ x /βS (x ),y s . (19)

Since by assumption S is shady, s = 0. Setting y = 0 and then y = 2s, and exploiting the 1-homogeneity of βS and (19)

βS(0) = 0 βS(s) x /βS (x ),s = βS(s) x /βS (x ),s , (20)

βS(2s) = 2βS(s) βS(s)+ x /βS (x ),s = βS(s) x /βS (x ),s . (21)

Together, (20) and (21) give

βS(s) = x /βS (x ),s . (22)

Combining (19) with (22) gives

(A) = y X, βS(y) x /βS (x ),s + x /βS (x ),y s = x /βS (x ),y ,

(A) = y X, βS(y) x /βS (x ),y . (23)

Since S is closed and shady, lev 1 βS = S and

y X, 1 βS(s) (22) = x /βS (x ),s (23) x /βS (x ),y = s S, 1 x /βS (x ),s .

Thus x /βS (x ) S by (13). For the necessary condition suppose now that x /βS (x ) S and let s S be such that βS(s) = x /βS (x ),s . Then

0 = βS(s)+ x /βS (x ), s y X, x /βS (x ),y = βS(s)+ x /βS (x ),y s . (24)

The reverse H older inequality (17) gives

y X, βS (x )βS(y) x ,y βS(y) x /βS (x ),y . (25)

WILLIAMSON AND CRANKO

infx S x ,x

sup{λ > 0 | x λS}

FIGURE 4: Illustration of concave support function ρS and concave gauge function βS. In (a) the set S is supported by the hyperplane normal to x at the point x with offset equal to ρS(x). In (b), picking a point x, we consider scaled versions λS of S such that x ends up on the boundary of λS, thus giving the value of βS(x). The ray passing through x intersects the boundary of S at x/βS(x).

Combining (24) with (25) we have

(B) = y X, βS(y) βS(s)+ x /βS (x ),y s .

Thus x /βS (x ) βS(s), and (18) is proved. Let x dom(βS). Then βS x/βS(x) = 1. This follows since βS is 1-homogenous. Since S = lev 1 βS, we have x/βS(x) S. Substituting s = x/βS(x) in (18) implies that for all x dom(βS )

βS (x ) βS(x) x

βS (x ) S and 1 βS(x) βS(x) = x /βS (x ), x/βS(x) ,

where we used the 0-homogeneity of βS (Proposition 3) to obtain βS(x/βS(x)) = βS(x). Since S is shady we can apply the bipolar theorem (14) to obtain the equivalent condition for all x dom(βS ):

x βS(x) β S(x ) x βS(x) S and 1 βS (x ) βS (x ) = x /βS (x ), x/βS(x) .

Finally observe

1 = x /βS (x ), x/βS(x) βS (x )βS(x) = x ,x ,

which concludes the proof.

Lemma 11 Suppose S S(X). Then βS is strictly concave if and only if S is strictly convex.

GEOMETRY AND CALCULUS OF LOSSES

Proof We show the sufficient condition using a proof by contradiction. Assume (for the contradiction) βS is strictly convex. If S is not strictly convex there exists x,y S and t (0,1) such that tx + (1 t)y bd(S). Since S is convex it follows that ri(S) is convex. And so it must be the case that x,y bd(S). Since S is shady, and by hypothesis βS is strictly convex, it follows that

βS(tx+(1 t)y) | {z } 1

< t βS(x) | {z } 1

+(1 t)βS(y) | {z } 1

which is absurd. Thus S is strictly convex. For necessity, choose arbitrary x,y S with x = y. Then x/βS(x), y/βS(y) bd(S) and βS(x/βS(x)) = βS(y/βS(y)) = 1. Since S is strictly convex t x/βS(x)+(1 t) y/βS(y) ri(S) for all t (0,1). Thus

t βS(x) x+ (1 t)

βS(y) y > 1 = tβS

t βS(x) x+ (1 t)

βS(y) y > t βS(x) βS(x)+ (1 t)

βS(y) βS(y),

where in the second line we exploited the 1-homogeneity of βS. Multiplying both sides by 1 t/βS(x)+(1 t)/βS(y) and exploiting 1-homogeneity again gives t (0,1)

t/βS(x)+ (1 t)/βS(y) βS

t βS(x) x+ (1 t)

t/βS(x) t/βS(x)+ (1 t)/βS(y) βS(x)+ (1 t)/βS(y) t/βS(x)+ (1 t)/βS(y) βS(y)

t/βS(x) t/βS(x)+ (1 t)/βS(y) x+ (1 t)/βS(y) t/βS(x)+ (1 t)/βS(y) y >

t/βS(x) t/βS(x)+ (1 t)/βS(y) βS(x)+ (1 t)/βS(y) t/βS(x)+ (1 t)/βS(y) βS(y),

and thus βS is strictly concave.

Corollary 12 Let S S(X). Then ρS strictly concave if and only if S is strictly convex.

Proof Apply Lemma 11 to ρS = βS .

Lemma 13 Let (sn i )n N with sn i X+ \ {0} for i [m]. Assume i [m] sn i

n N is a convergent sequence. Then each sequence (sn i )n N for i [m] has a convergent subsequence.

Proof If each of the m sequences (sn i )n N is bounded the proof is trivial. Assume that for 1 i l the sequences (sn i )n N are unbounded. We now show there exists a linear functional z such that with xn := m i=1 sn i ,

z ,xn | {z } bounded

= z ,sn 1 + + z ,sn l

| {z } unbounded

+ z ,sn l+1 + + z ,sn m | {z } bounded

WILLIAMSON AND CRANKO

Convex case R R(X) f convex

Concave case S S(X) f concave Definition

Support function σR ρR (5), (6) Gauge function γR βS (9), (10) Conjugate function f f (11), (12) Polar set R S (13) Polar function f f (16)

TABLE 2: Summary of the convex case and the (less known) concave case of support functions, gauge functions, polar sets, polar functions and Legendre-Fenchel conjugates.

producing a contradiction. To show the existence of z define the set

Observe that since X+ is salient, closed and convex, we have U {0} = . Furthermore, we trivially have that {0} is compact and U is closed. Then by the Hahn Banach separation theorem (Penot, 2012, Theorem 1.79, p. 55) there exists z X such that z ,u δ > 0 for all u U. To see why the first l terms must be unbounded with this choice of z note that we can write

i [l], z ,sn i = D z , sn i sn i/ sn i E = sn i D z , sn i/ sn i E ,

where sn i and D z , sn i/ sn i E δ for every n N since sn i/ sn i U. This shows (26), which is absurd.

2.10 Comparing the Convex and Concave Versions

Table 2 tabulates the convex and concave versions of the key mathematical objects we make use of in this paper. As can be seen, for every standard convex version, there is a corresponding concave version.

3. Loss Functions

In this section we will introduce proper losses; first in the traditional way, and then in terms of the superprediction set. We will then show some of the implications of the latter approach. A loss function is an outcome contingent disutility : that is, for a given outcome y, it provides a measure of (dis)utility of a prediction as a function u( ,y) (Berger, 1985). We introduce loss functions more formally by first introducing some concepts from statistical decision theory, to which we apply some of the geometric concepts introduced in 2. Let Z, Y be random variables taking values in the spaces Z and Y . We assume Y is finite with n := |Y |, and therefore distributions that assign probability to every state of Y are isomorphic to probability vectors from := ri( p Rn 0 n i=1 pi = 1 ),

GEOMETRY AND CALCULUS OF LOSSES

the relative interior of the n-simplex, that is cl( ) (Y ). Its dual cone, Rn 0, is the associated collection of loss vectors. We use X+ Rn and X + Rn to denote an arbitrary pair of salient, closed, convex cones, dual to one another. The reader might find it helpful to identify X+ with clcone( ), and X + with its dual, however most of our theorems merely depend on a dual pair of closed convex cones. One can understand the effect of choice of loss in terms of the conditional perspective which allows one to ignore the distribution of Z, which is typically unknown.6 We call mappings l: Y R 0 loss functions, and l(p;y) is the penalty from predicting p upon observing y Y . (It will sometimes be convenient to consider the extension of ldefined as l: cl( ) Y R 0, which now needs to map to R 0 to allow for infinite values; see Remark 22). It will be convenient to stack loss functions into a function over the second argument:

p , l(p) := y 7 l(p;y) RY 0,

or equivalently we have a vector

p , l(p) (l(p;y1),..., l(p;yn)) Rn 0.

For each fixed y Y , the functions l( ;y) are called partial losses. If for some norm (the choice does not matter) l(p) < for all p we say the loss is bounded. The conditional risk associated with lis defined via

L: (q, p) 7 L(q, p) := Eq[l(p)] = l(p),q R 0.

The conditional Bayes risk L := q 7 infp L(q, p) is always concave. For the next two definitions let f : Z , and g(z) := Pr(Y |Z = z). That is, for z Z , f(z) is a distribution over Y conditioned on Z = z and g(Z) is the true conditional distribution. The full risk is

EZEY |Z[l f(Z)] = EZ l f(Z),g(Z) . (27)

The most general framing of a supervised machine learning problem is to minimise (27) by choosing an appropriate function f. If we fix land g, the minimal value of the full risk (27) is bounded below by the Bayes risk7:

inf f : X EZ l f(Z),g(Z) .

The superprediction set (Kalnishkan et al., 2004; Kalnishkan and Vyugin, 2002; Dawid, 2007) of a loss function l: Rn 0 is

spr(l) := [

n x Rn x Rn 0 l o Rn 0. (28)

The set spr(l) (we write spr lwhen there is no ambiguity) consists of all the points x that incur no less loss than some point l l( ). In the parlance of game theory, this is the union of the points

6. See (Steinwart and Christmann, 2008; Reid and Williamson, 2011) for a discussion of this conditional perspective. 7. See (Williamson and Cranko, 2022) for a further discussion of the Bayes risk, its generalisation to restricted model classes F, and the relationship to measures of information such as f-divergences and integral probability metrics.

WILLIAMSON AND CRANKO

p that are weakly dominated by some other point, together with the dominating points. Equivalently (28) can be written

l l( ) (l +Rn 0) = l( )+Rn 0. (29)

In the next section we will be interested in the closed convex hull of the superprediction set mapping, for which it is useful to note

clco(spr l) (29) = clco(l( )+Rn 0) = clco(l( ))+Rn 0 = [

l clco(l( )) (l +Rn 0). (30)

Lemma 14 is a special case of a result due to Choquet (1962), but as the original is in French we include a proof for our setting below.

Lemma 14 Let S1,...,Sm X+ \{0} each be closed. Then the set S1 + +Sm is closed.

Proof Let S := S1 + +Sm, and take a convergent sequence (xn)n N x with xn S. Then for each xn there exists (sn i )i [m] with sn i Si for i [m] and n N. By Lemma 13 each of the sequences (sn i )n N has a convergent subsequence. And as Si is closed limn sn i S for each i [m] and x = i [m] limn sn i S (taking subsequences if need be).

3.1 Proper Losses

A natural requirement to impose upon lis that it is proper8 (Hendrickson and Buehler, 1971), which means that

[ p,q , l(q),q l(p),q ]

q , q arginf p l(p),q

That is, predicting the true probability minimises the expected loss. We say l is strictly proper when the above inequality is strict for p = q, that is, {q} = arginfp l(p),q for all q . If l is proper, L(p) = L(p, p) = l(p), p . The superprediction set spr(l) of a proper loss l has some useful properties: Theorem 15 makes explicit the link between the superprediction set and the convex geometry in 2. Proposition 21 below justifies that from a superprediction set we can construct a loss function. This motivates a shift in focus of analysis from loss functions lto families of convex superprediction sets of proper losses.

Theorem 15 (Representation) Let l: Rn be a loss function with the associated conditional Bayes risk L. Then

1. L = ρclco(spr l)| (the restriction of ρclco(spr l) to );

8. See (Gneiting and Raftery, 2007; Reid and Williamson, 2011) for an elaboration of the notion of properness, which dates back at least to work of Shuford Jr. et al. (1966) and von Holstein (1970); early particular examples are due to Brier (1950) and Good (1952). Note that what we call a proper loss is often called a proper scoring rule; the case we consider corresponds to having an action space being a set of finite dimensional distributions; see for example (Gr unwald and Dawid, 2004).

GEOMETRY AND CALCULUS OF LOSSES

2. [ p , l(p) ρclco(spr l)(p)] lis proper.

Proof The claim that L = ρclco(spr l) on for all loss functions lis straightforward:

p , L(p) = inf q l(q), p = inf l l( ) l, p = inf l l( )+Rn 0 l, p (30) = ρclco(spr l)(p). (32)

We now prove the second claim. Assume lis proper. Then for p we have ρspr(l)(p) = L(p, p) = l(p), p and p , q Rn,

l(p),q l(q),q l(p),q l(p), p l(q),q l(p), p

(32) l(p),q p ρclco(spr l)(q) ρclco(spr l)(p).

Thus l(p) ρclco(spr l)(p) for p . We use a proof by contradiction to show the reverse implication. Assume lenjoys the subgradient representation l(p) ρclco(spr l)(p) for p , but is not proper. Since lis not proper, by (31) there exists p,q with

l(p),q < l(q),q l(p),q l(p), p < l(q),q l(p), p

l(p),q p < l(q),q l(p), p . (33)

Since l(p ) ρclco(spr l)(p ) for p , from (8) we have p ,q Rn 0,

l(p ),q p ρclco(spr l)(q ) ρclco(spr l)(p )

= l(p ), p ρclco(spr l)(p )

l(p ), p ρclco(spr l)(p ), (34)

where in the implication we take q = 0. This gives us, for our choice of p,q,

l(p),q p (33) < l(q),q l(p), p

(34) ρclco(spr l)(q) l(p), p

ρclco(spr l)(q) inf r l(r), p

(32) = ρclco(spr l)(q) ρclco(spr l)(p),

which contradicts our assumption that l(p) ρclco(spr l)(p) for all p .

Thus in order to build a geometry of loss functions in terms of convex sets, with Theorem 15 we see that the propriety condition of the losses cannot be discarded; see also the discussion in section 3.6 below. The definition of as the relative interior of the probability simplex guarantees (via Corollary 17(2) below) that the subdifferential ρclco(spr l)(p) is nonempty for all p . This is analogous to the differentiable case, where if one wishes to compute gradients of a differentiable function, the natural area of analysis is the interior of its domain of definition.

WILLIAMSON AND CRANKO

Proposition 16 Suppose l: Rn 0 is a loss function. Then clco(spr l) is Rn 0-oriented.

Proof By hypothesis l: Rn and so from the definition of the superprediction set spr(l) Rn 0 we have

rec(Rn 0) P1(5) rec(clco(spr l))

l clco(l( )) (l +Rn 0)

l clco(l( )) rec(l +Rn 0)

= rec(Rn 0)

P1(4) = Rn 0

as desired.

Corollary 17 Suppose l: Rn 0 is a loss function. Then

1. dom(ρclco(spr l)) Rn 0, and

2. dom( ρclco(spr l)) Rn >0.

Proof Take Proposition 16 and apply Lemma 9, this shows claim 1. Apply Lemma 2 to (1) to show claim 2.

Theorem 15 and Proposition 16 motivate the introduction of the following family of convex sets. (Recall that X = Rn for some n and X+ X denotes a salient closed convex cone.) Take C X, let P(C) be the collection of C-oriented, closed, nonempty convex subsets of C \{0}:

P(C) := {S K(C \{0}) | rec(S) = C}. (35)

The construction (35) admits a lot of structure. In particular, Lemma 18 justifies our interest in the shady sets introduced in 2.7. Recall that S(X) denotes the collection of closed shady subsets of X.

Lemma 18 P(X+) S(X).

Proof Take S P(X+). Since S rec(S) = X+, we have

( d X+, α > 0, S+αd S) = ( d S, α > 0, S+αd S)

= ( α 1, αS S),

and S is co-star-shaped. From (35), S is also convex, thus S is shady.

Lemma 19 Let S P(X+). Then ri(S) = .

GEOMETRY AND CALCULUS OF LOSSES

Proof The result follows from a proof by contradiction. Suppose ri(S) = . Then from (1) this means

x S, λ > 1, y S, [λx+(1 λ)y / S λx / S+(λ 1)y]. (36)

From the conditions, λ 1 > 0 and y S rec(S). Thus (λ 1)y rec(S). It follows that S+(λ 1)y = S because for any x rec S, S+x = S (Proposition 1 (2)). Hence (36) reduces to

x S, λ > 1, λx / S,

which contradicts the fact that S is co-star-shaped (Lemma 18).

Lemma 20 Let Ai P(X+) for i [m]. Then T i [m] Ai = .

Proof Take an arbitrary sequence (ai)i [m] with ai Ai. Let Ji := [m] \ {i}, for i [m]. Then j Ji aj + ai = k [m] ak for each i [m]. Since X+ is a cone j Ji aj X+ for each i [m]. By assumption Ai rec(Ai) = X+, and from the definition of the recession cone (3) we always have k [m] ak = ai + j Ji aj ai +X+ Ai for all i [m], and thus T i [m] Ai = .

Proposition 21 completes the connection between superprediction sets and loss functions.

Proposition 21 Take S P(Rn 0). There exists a 0-homogeneous selection l: Rn >0 Rn 0 of ρS in the sense that

p Rn >0, l(p) ρS(p), (37)

and lrestricted to is a proper loss.

Proof From Lemma 2, dom( ρS) ri(Rn 0) = Rn >0, and so there exists l(p) arginf x S x, p for

p Rn >0. Thus

p,q Rn >0, l(q),q l(p),q ,

and lis proper. Since ρS is 1-homogeneous, lis 0-homogeneous (Proposition 3).

Remark 22 (From bounded to unbounded) In Theorem 15 and Proposition 21 we needed to be careful when talking about the domain of definition of a loss function l. This was to ensure that ρspr(l) is nonempty in order to have the inclusion l(p) ρspr(l)(p). Recall our definition of as the relative interior of the standard simplex. In practice since if there exists q cl( ) with ρspr(l)(q) = , we can define l(q;y) := lim(pn) q l(pn;y), where the sequence (pn) is chosen with pn and by Lemma 2 we know we have ρspr(l)(pn) = (allowing us to take l(pn) ρspr(l)(pn)). This extends our loss function to a mapping l: cl( ) R n 0. Many of the results on bounded loss functions can be similarly extended to unbounded loss functions, by restricting analysis to ; for example the range of log loss is R n 0 when defined on cl( ) but Rn 0 when restricted to .

WILLIAMSON AND CRANKO

Name l(p;y) L(p) Propriety Bounded Ref.

0/1 l0/1 co{ej : j argmax y Y py} min y Y py proper 3.1

Logarithmic llog log(py) y Y py log py strict 3.4

Concave Norm

a [ ,1]\{0} la

! 1 a 1 β a

a 1 (p) strict 4.1

Brier l Br (1+ p 2 2) 1n 2py 1 p 2 2 strict 4.2

Cobb Douglas a Rn 0 l CDa ψa(p/a) ay/py ψa(p) strict 4.3

TABLE 3: Some common loss functions, their conditional Bayes risks, and their propriety and boundedness.

In light of Remark 22, the distinction between bounded and unbounded losses becomes less important. Some commonly used loss functions are listed in Table 3 along with their boundedness. Corollary 23 below follows from the proof of Proposition 21 together with Corollary 5.

Corollary 23 A loss function lis strictly proper if and only if clco(spr l) is strictly convex.

Corollary 24 There is a bijection between the equivalence class of loss functions l: Rn 0 which agree almost everywhere and the family of convex sets P(Rn 0).

Proof There is a bijection between superlinear functions ρS and closed convex sets S (Hiriart-Urruty and Lemar echal, 2001, Theorem C.2.2.2), and the mapping ρS is a singleton almost everywhere (Hiriart-Urruty and Lemar echal, 2001, Theorem B.4.2.3). The connection to loss functions (Theorem 15 and Proposition 21) completes the proof.

3.2 Starting with Sets

The above development motivates the key viewpoint of the present paper: start with a set S P(Rn 0) and derive the loss (and other quantities) from it. We will thus sometimes explicitly parametrise the loss theoretic functions as l S, LS and LS. One immediate consequence of using a set spr(l) P(Rn 0) to define a proper loss lis that it may be the case that for two different loss functions l = m we have clco(spr l) = clco(spr m). This is the case whenever the conditional Bayes risk functions for land m coincide. However in such cases land m differ only on a set of measure zero (cf. Vernet et al., 2016, Proposition 8). That is, for some S := spr(l) = spr(m), both land m satisfy (37). However when l is strictly proper, spr(l) is strictly convex and so for strictly convex spr(l) = spr(m) we always have l= m.

Remark 25 (Misclassification loss) Misclassification loss l0/1 (also called 0/1 loss) (Buja et al., 2005; Gneiting and Raftery, 2007) assigns zero loss when predicting correctly and a loss of 1 when predicting incorrectly. This can be extended to when one predicts with a distribution p , with l0/1(p) = ej where j = argmaxi [n] pi and ej is the jth canonical basis vector in Rn, under the

GEOMETRY AND CALCULUS OF LOSSES

assumption that {pi : i [n]} has a unique maximum. We can extend l0/1 to all of in a manner that is consistent with Theorem 15 as follows: Define

L0/1 : p 7 min i [n] pi,

and let l0/1(p) := L0/1(p) = ( L0/1(p)) = ( mini pi) = (maxi( pi)). Using (Hiriart Urruty and Lemar echal, 2001, Example 3.4, page 182) we have

l0/1(p) = ( L(p)) = co{ ej : j argmax i [n] pi} = co{ej : j argmax i [n] pi}. (38)

When the argmax is unique, this reduces to just ej as per the usual definition.

Remark 26 (Naturally 0-Homogeneous) In Theorem 15 we needed to make restrictions on the domain of the concave support function and subdifferential in order for the geometric functions from convex analysis to agree with the classical concept of a loss function. However when loss functions are viewed instead as the subgradient of a concave support function these restrictions are indeed arbitrary. The subgradient representation of a proper loss function (37) suggests via the 0-homogeneity result of Proposition 21 that it is more natural to take a proper loss function ldefined in the conventional way on and instead consider its 0-homogeneous extension: Rn >0 p 7 l (p) := l p 1/ p 1 Rn 0. Defined like this, l satisfies (37) and agrees with lon . Without loss of generality, for the remainder of our analysis we will refer to a loss function as such a 0-homogeneous mapping Rn >0 Rn 0.9

3.3 Bregman Divergences, Semi Inner Products and Finslerian Geometry

The relationship between l and the conditional Bayes risk L (which we know is equal to ρspr(l)) is usually credited to Savage (1971) and is intimately related to Bregman divergences10. Given a convex function φ, the Bregman divergence between x,y dom(φ) is defined to be

Bφ(x,y) := φ(x) φ(y) g(y),x y ,

where g φ is a selection of the subgradient of φ. It is known that the regret L(p,q) L(p) is a Bregman divergence with φ = L, which is not only convex but is also 1-homogeneous. The additional structure of 1-homogeneity offers a nice simplification.

Proposition 27 Give a Bayes risk L, p,q Rn >0,

B L(p,q) = L(p,q) L(p) = l(q) l(p), p . (39)

Proof We have p,q Rn >0, L(p,q) = l(q), p , and thus if lis proper by Theorem 15, the general form of the Bregman divergence simplifies:

B L(p,q) = L(p)+L(q)+ l(q), p q

9. A formal alternative would be to work with the horizon hzn Rn of directions dir: Rn hzn Rn which can be considered pure (magnitudeless) direction vectors, so that for α > 0, dir(αx) = dir(x) (Rockafellar and Wets, 2004, Chapter 3). It amounts to the same thing since 0-homogeneity of lmeans that the magnitude of any vector x Rn does not affect the value of l(x); all that matters is the direction of x. Thus one could in fact define l: hzn Rn 0 Rn 0 via l( x) = l(x), where x is the unique point of intersection of the infinite magnitude direction vector x with the unit sphere. 10. See (Reid and Williamson, 2011) for more context and background on Bregman divergences.

WILLIAMSON AND CRANKO

B L(p,q) 1/ p 2

p1, l(p;y1)

p2, l(p;y2)

FIGURE 5: Geometrical interpretation of regret as the Bregman divergence B L(p,q) = l(q) l(p), p . As q p, so l(q) l(p) and the vectors l(q) l(p) and p become orthogonal and B L(p,q) 0.

= l(p), p + l(q),q + l(q), p l(q),q

= l(q) l(p), p

= L(p,q) L(p),

where we have used the fact that since lis proper L(p) = l(p), p .

The simpler form (39) provides for a geometrical interpretation of the Bregman divergence as the inner product of the vectors (l(q) l(p)) and p, which we illustrate in Figure 5. Considering loss functions as subgradients of concave support functions provides an intriguing geometrical perspective which we now sketch. It is based upon existing work on norm derivatives (Alsina et al., 2010) which we first summarise. Given a norm on some vector space V, define the normalised norm derivative for x,y V via

τ (x,y) := lim λ 0 x+λy 2 x 2

2λ = x lim λ 0 x+λy x

The function τ is of interest because if the norm is derived from an inner product via = , 1 2 , then τ (x,y) = x,y . For norms that are not derived from an inner product, the normalised norm derivative can be claimed to be like an inner product. This claim can be formalised as follows. First we slightly generalise τ by writing it in terms of a gauge function γ with associated unit ball S

in V which we henceforth take to be Rn (recall every norm is a gauge function, but a gauge function is only a norm if its unit ball is centrally symmetric with respect to the origin). Suppose S is smooth and strictly convex and thus S is too, and hence σS (and σS ) is differentible everywhere since there is only one support point for a given hyperplane. We can generalise the definition of τ as

τ S(x,y) := γS (x) lim λ 0 γS (x+λy) γS (x)

= σS(x) lim λ 0 σS(x+λy) σS(x)

GEOMETRY AND CALCULUS OF LOSSES

= σS(x)DσS(x) y,

where D f(x) = ( f(x)/ x1,..., f(x)/ xn), the Jacobian of f, is a row vector (Magnus and Neudecker, 1999, page 99).

Lumer (1961) introduced a semi inner product on a real vector space V as a real-valued two-place function [ , ] satisfying the following axioms11.

SIP1 [x+y,z] = [x,z]+[y,z] x,y,z V.

SIP2 [λx,y] = λ[x,y] λ R, x,y V.

SIP3 [x,x] 0 x V, x = 0.

SIP4 |[x,y]|2 [x,x][y,y] x,y V.

SIP5 [x,λy] = λ[x,y] λ R, x,y V.

SIP6 [y,x+εy] [y,x] for all real ε 0, x,y V.

Unlike the standard inner product, [ , ] is not symmetric: in general [x,y] = [y,x]. Suppose 0 S then 0 S . Suppose further that the principal curvatures of S and S are all non-zero (l S and l S being strictly proper guarantee that). Then σS and σS are in C2 (Schneider, 2014, page 115). Let GS := 1

2Hγ2 S , where H denotes the hessian: for f : Rn R, H f(x) := D((D(x)) ) (Magnus and Neudecker, 1999). Following Giles (1967) (who assumed γ was additionally a norm, and thus symmetric, an assumption which we drop) for x,y Rn we define

[y,x]S := y GS(x) x.

Proposition 28 Suppose S satisfies the assumptions above. Then [ , ]S is a semi inner product.

Proof Since σS is convex, we have HσS is positive semidefinite (in fact it has only one zero eigenvalue under the assumptions above (Schneider, 2014, page 118)). By properties of support functions we have that for all x

DγS (x) x = DσS(x) x = σS(x) = γS (x).

Furthermore, for all x HγS (x) x = HσS(x) x = 0n.

By the product rule, we thus have for all x

1 2Hγ2 S (x) = 1

2D((Dγ2 S (x)) ) = D(DγS (x) γS (x)) = DγS (x) DγS (x)+γS (x)HγS (x). (41)

Hence for all x

1 2Hγ2 S (x) x = (DγS (x) DγS (x)+γS (x)HγS (x)) x (42)

= DγS (x) DγS (x) x

= DγS (x) γS (x),

11. Semi-inner-products have been used previously in machine learning; see e.g. (Zhang et al., 2009; Der and Lee, 2007).

WILLIAMSON AND CRANKO

and consequently for all x

2Hγ2 S (x) x = x DγS (x) γS (x) = γS (x) γS (x) = γ2 S (x).

Since 0 S , γS (x) 0 for all x. Thus from (41) we see that for all x, GS(x) is the sum of two positive semidefinite matrices γS (x)HγS (x) and a DγS (x) DγS (x). The positive definiteness of the rank one second term follows from the fact its only non-zero eigenvalue is DγS (x) DγS (x) and since γS (0) = σS(0) = 0, by positive homogeneity, and by convexity, for all x, DγS (x) 0 (elementwise); consequently DγS (x) DγS (x) 0. Since λmin(A+B) λmin(A) for positive semidefinite A and B (L utkepohl, 1996, 9.12.2(7)), we conclude that GS(x) = 1

2Hγ2 S (x) is positive semidefinite for all x. For arbitrary x,y,z Rn, λ R and ε 0, we have

[x+y,z]S = (x+y) GS(z) z = x GS(z) z+y GS(z) z = [x,z]S +[y,z]S [λx,y]S = (λx) GS(y) y = λ(x GS(y) y) = λ[x,y]S [x,x]S = x GS(x) x 0 since GS(x) is positive semidefinite for all x

[y,x]S = y GS(x) x = y DγS (x) γS (x) γS (y)γS (x) = [x,x]

1 2 S [y,y]

1 2 S [x,λy]S = x GS(λy) λy = x GS(y) λy = λ[x,y]S [y,x+εy]S = x GS(x+εy) (x+εy) y GS(x) x since γS C2,

demonstrating that [ , ]S satisfies axioms SIP1 SIP6, where the antepenultimate line used the fact that y DγS (x) γ(y), and the penultimate line follows from γS being 1-homogeneous, implying γ2 S is 2-homogeneous and so by Euler s theorem, Hγ2 S is 0-homogeneous.

Observe that by (42), we have

[y,x]S = y DγS (x) γS (x) = γS (x)DγS (x) y = σS(x)DσS(x) y = τ S(x,y), (43)

by (40). One can reparametrise [ , ]S as follows. By (Schneider, 2014, Page 55), we have 1

where ( ) is the Legendre-Fenchel conjugate (11). Combining this with the result from (Seeger, 1992) that (H f(x)) 1 = H f (y), where y = (D f)(x) and x = (D f) 1(y), we can write

1 2Hγ2 S (u) = 1

2(Hγ2 S(x)) 1,

where u = (1

2Dγ2 S(x)) and x = (1

2Dγ2 S) 1(u). Observe that if A = Hγ2 S is 0-homogeneous, then it is easy to see that A 1 is too. Combining (43) and (42) we have

[y,x]S = y 1

2Hγ2 S(x) 1 x.

The semi-inner product [x,y]S induces a Finslerian geometry, a generalisation of Riemannanian geometry, where the norm varies throughout the space (as in the Riemannian case) but the unit balls

GEOMETRY AND CALCULUS OF LOSSES

are not necessarily ellipsoids (as in the Riemannian case). Rund (1959, page 16) notes that the metric of a Finsler space may be regarded as being locally Minkowskian just like Riemannian geometry without the quadratic restriction (Chern, 1996). For the situation of interest in the present paper (where we work with concave, rather than convex, gauge and support functions), we can mimic the above development to consider anti semi inner products | , | . We merely need replace the convex gauge γ by the concave gauge β, and the convex support function σ by the concave support function ρ and to reverse the inequality in the fourth axiom to read

SIPb4 | |x,y| |2 |x,x| |y,y| , x,y V .

We keep the other axioms the same. Define

|y,x| S := y b GS(x) x,

where b GS := 1

2Hβ 2 S . Then by a similar argument to the above, we have that |y,x| S is indeed an anti semi inner product. The only change is working with negative semidefiniteness instead of positive, and the use of the reverse inequality DρS y ρS (x), which is a rephrasing of the result with loss functions that L(x,y) L(x). Following the argument in (Lumer, 1961), but reversing the inequalities, we have that |x,x| 1 2 is a concave gauge function. Translating to our loss notation we can write

|y,x| S = LS(x)LS(y,x), (44)

which can be seen to be a weighted conditional risk. Utilising the formula for Bregman divergences (39) and by analogy with (43), we have (writing BS := BLS) that |y,x| S = ρS(x)Dρ(x) y = LS(x) l S(x),y = LS(x)LS(y,x), and thus

L(x)BS(y,x) = LS(x)[LS(y,x) LS(y)]

= LS(x)LS(y,x) LS(x)LS(y)

= |y,x| S p

|x,x| S |y,y| S.

Expressing loss-theoretic quantities in terms of | , | S and hence b GS may also be conceptually valuable, but we defer further investigation of this. The normalisation that naturally arises in (44) has not, to our knowledge, arisen previously. This does suggest that Riemannian geometry, the traditional foundation of information geometry (Amari, 2016), is not quite the right fit for the geometry of losses, and we instead need the richer, locally Minkowskian (Chern, 1996), notion of Finslerian geometry (Rund, 1959).

3.4 The Antipolar Loss

A loss function lmaps a distribution p Rn >0 to a loss vector l(p) Rn 0. Given l(p), one might ask if we can recover p? This problem arises naturally in a variety of settings (Vovk, 2001; Gneiting and Katzfuss, 2014). In practice it might be difficult to find or even show the existence of an inverse loss, but in light of Remark 26 we can show the existence of, and suggest several ways to calculate, a pseudoinverse, l . For reasons that will become clear, we call the function l the antipolar loss. The antipolar loss provides a universal substitution function (Kamalaruban et al., 2015) for the Aggregating Algorithm (Vovk, 2001, 1995, 1990). The substitution function needs to map an

WILLIAMSON AND CRANKO

arbitrary superprediction x spr(l) to a prediction p such that l := l(p) dominates x in the sense that l Rn 0 x (that is, pointwise inequality not incurring more loss under each y Y ). Explicitly stating a substitution function is the primary difference between the (unrealisable) aggregating pseudo-algorithm and the aggregating algorithm (Vovk, 2001). Determining the substitution function even for simple cases can be difficult (Zhdanov, 2011). We show below that by making use of antipolars, we can determine such substitution functions; see (Friedlander et al., 2014; Aravkin et al., 2018) for further uses of antipolarity.

Proposition 29 Let l: Rn >0 Rn 0 be a proper loss. There exists l : Rn >0 Rn 0 with l ρ(spr l) that satisfies

p Rn >0, l(p) = (l l l)(p) and x Rn >0, l (x) = (l l l )(x). (45)

Furthermore l is a proper loss.

Proof Let S := spr(l). Since the subdifferential of a support function is 0-homogenous (Proposition 3) applying Lemma 10 we have p Rn >0,

l(p) ρS(p) l(p) ρS (l(p))

ρS (l(p)) ρS(p)

L10 p ρS(p) ρS (l(p) ρS (l(p))) = ρS (l(p)).

Let us choose l to satisfy l (l(p)) = p 1/ρS(p). This defines l : Rn >0 Rn 0. We can then extend l to Rn 0 Rn 0 using the argument of Remark 22. Note that l(l (l(p))) = l p 1/ρS(p) = l(p). For the second claim regarding l , exchange the roles of land l in the above argument and apply Proposition 3. We now argue that l is proper. By construction l (l(p)) = p/ρS(p) ρS (l(p)). Let q = l(p). Then l (q) ρS (q). Proposition 21 then implies that l | is proper as long as S P(Rn 0) which we will now show. We have

P(Rn 0) = {S K(Rn 0 \{0})| rec(S) = Rn 0}.

By the observation following (13), the map S 7 S takes closed shady sets to closed shady sets and S S(X+) S S(X +). Furthermore S is convex. (Suppose x 0,x 1 S and for some λ (0,1), let x λ := λx 1 +(1 λ)x 0, then it is straightforward to check that x λ S from the definition of S .) Finally we have that

rec(S ) = {d Rn | x S, x ,x 1 x +d,x 1}

= {d Rn | x S, x ,x 1 x ,x + d,x 1}

= {d Rn | x S, d,x 1}

= {d Rn | x S, d,x 0}

x S {d Rn | d,x 0}

where the last line follows from the fact that rec(S) = Rn 0.

GEOMETRY AND CALCULUS OF LOSSES

(l log llog)(p)

p1, l(p;y1)

p2, l(p;y2)

FIGURE 6: Illustration of Proposition 29 using llog (Table 3) with Y = [2]. Upon predicting p , we receive the loss vector llog(p). Evaluating the antipolar loss l log at llog(p) we have the point (l log llog)(p) = α p for some α > 0. However, llog is 0-homogeneous (Proposition 21) and so (llog l log llog)(p) = llog(p) for all p R2 >0.

Proposition 29 is illustrated with llog in Figure 6, which should be compared with the work of Shephard (1953, pg. 23) which was the inspiration for the argument regarding antipolar losses in the present paper12. More generally, it follows from Lemma 10 that if lis a mapping ri(X+) X + with spr(l) P(X+) then l is a mapping ri(X + ) X+ with spr(l ) P(X + ). The pseudoinverse property of (45) can be expressed using the notion of the direction of a vector as in footnote 9, allowing us to write for all p Rn >0, dir(l l)(p) = dir p.

12. This has become known as Shephard s duality theorem in the economics literature (Shephard, 1970; Jacobsen,

1972; Mc Fadden, 1978; Hanoch, 1978; Cornes, 1992; F are and Primont, 1994, 1995; Penot, 2005; Z alinescu, 2016) and appears in standard microeconomics texts (Varian, 1978). Shephard s development of dual theory in economics in his 1953 book (Shephard, 1953) was described as one of the most original contributions to economic theory of all time (Jorgenson, 1981) due to three key ideas:

1. The duality between cost and production functions (essentially polar duality of concave gauge functions);

2. Shephard s lemma (Varian, 1978, page 74), (Mas-Collel et al., 1995, Page 141): essentially the result (Schneider,

2014) that the subgradient of a support function (in economics terminology, cost function evaluated at some fixed price of input vectors) σ(x) is the support set ( conditional factor demand correspondence evaluated at the same price vector), and furthermore if σ is differentiable at x, the support set is a singleton;

3. Homotheticity: essentially that the key functions of the theory are a composition of a positive monotone increasing scalar function and a positively homogeneous function of several variables.

The economic theory tends to obscure the simplicity of concave gauge duality because of the need to parametrise families of sets (either by the vector of inputs available to a firm or the vector of outputs) and the adoption of convoluted terminology ( conditional factor demand correspondence instead of support set ). The geometry in all cases is simply that of concave gauge duality, a point explicitly recognised by Hasenkamp and Schrader (1978) in the context of aggregation problems arising in production economics.

WILLIAMSON AND CRANKO

Proposition 30 Let l: Rn >0 Rn 0 be a proper loss. Then ρspr(l) is strictly concave if and only if l is strictly proper.

Proof This follows immediately from Corollaries 12 and 23.

Corollary 31 If ρspr(l) is strictly concave then for any function m that satisfies m ρ(spr l) we have m = l .

Proof It follows that spr(l) is strictly convex (Corollary 12), thus ρspr(l) is always a singleton (Corollary 5). Thus there is only one possible selection m with m ρ(spr l) .

Remark 32 If l is the antipolar of the loss lthen clearly so is the family (αl )α>0 since as lis 0-homogeneous l (αl ) = l l .

There are three ways the antipolar loss l can be computed given a proper loss l(cf. 2.7): we can

1. take the superdifferential of the concave support function of the antipolar superprediction set l7 ρ(spr l) l ;

2. compute the antipolar of the associated conditional Bayes risk function and superdifferentiate L 7 L l ; or

3. solve the optimisation problem in

l (p) arginf x Rn 0 l(p),x 1 .

The complete set of antipolar loss function relationships is presented in Figure 7. The notion of the antipolar loss and its relationship to the concave polar of the superprediction set provides conceptual insight. Furthermore, at least in some cases one can determine the inverse in closed form (see equation 46 in 4.1, as well as the other examples in 4.2 and 4.3).

3.5 Convexifying Proper Losses: The Canonical Link

All proper losses lhave convex superprediction sets, but that does not imply that the partial functions li = l( ;i) are convex for all i [m] (Reid and Williamson, 2010; Vernet et al., 2016). However, such proper losses with non-convex partial losses can be made convex by reparametrisation. A composite proper loss l ψ 1 is the composition of a proper loss l and an (inverse) link function ψ 1 that reparametrizes the loss (Reid and Williamson, 2010; Vernet et al., 2016). The aforementioned papers studied such links using the tools of differential calculus. We will now show that the geometric perspective of the present paper, along with the properties of antipolar losses, allows a simpler proof of the fact that there is always a special link function which ensures the resulting composite loss is a convex function. This link function is called the canonical link in (Reid and Williamson, 2010) (binary case) and (Vernet et al., 2016) (general multiclass case); as we

GEOMETRY AND CALCULUS OF LOSSES

lev 1 ρS lev 1 ρS lev 1 ρS

x 7 infq =0 x,q

L(q) p 7 infx =0 x,p L (x) p 7 infx =0 x,p L (x)

spr(l ) l L

l ρS x 7 x, l (x)

l ρS p 7 l(p), p

spr(l ) l L

l ρS x 7 x, l (x)

p 7 l(p), p

FIGURE 7: Illustration of the relationships superprediction sets S, associated loss functions l, and conditional Bayes risks L, along with their antipolar counterparts. (The diagram, which is a finite graph, has been unrolled to make reading easier.)

shall see below, the canonical loss is indeed the composition of the loss with its associated antipolar loss. We first need some additional notions (Pennanen, 1999; Gissler and Hoheisel, 2022). Suppose X and Y are sets, and K Y is a convex cone, and f : X Y with dom f convex. Recalling from 2.4 the ordering K, we say f is K-convex if for all x0,x1 X, and all α (0,1),

f(αx1 +(1 α)x0) K α f(x1)+(1 α)f(x0).

That is, f(αx1+(1 α)x0) α f(x1)+(1 α)f(x0) K. The K-epigraph of a function f : X Y is

epi K f := {(x,y) X Y | f(x) K y}.

For functions f : X R, the traditional epigraph epi f corresponds to epi K f with K = R 0. Analogous to the result for the traditional epigraph that f is convex iff epi f is, we have (Jahn, 2011, Lemma 14.8):

Lemma 33 Suppose f : X Y and dom f is convex, and K Y is a cone. Then f is K-convex if and only if epi K f is a convex subset of X Y.

Suppose f : Rn Rn, and write f(x) = (f1(x),..., fn(x)). Let K = Rn 0. Then f is K-convex if

f(αx1 +(1 α)x0) α f(x1) (1 α)f(x0) Rn 0 i [n], fi(αx1 +(1 α)x0) α fi(x1) (1 α)fi(x0) ( ,0]

WILLIAMSON AND CRANKO

i [n], fi(αx1 +(1 α)x0) α fi(x1)+(1 α)f(x0)

i [n], fi is convex.

Hence (confer (Gissler and Hoheisel, 2022, Section 4.4.4)) f is component-wise convex:

Lemma 34 f : Rn Rn is Rn 0-convex iff fi : Rn R is convex for all i [n].

Let l:= l l denote the canonical loss induced by composing an arbitrary proper loss lwith its associated antipolar loss l . Proposition 29 implies there exists a function γ l: Rn 0 R>0 such that for all x Rn 0, l(x) = γ l(x)x. We can now prove the following result directly, without the need for differential calculus as used in (Vernet et al., 2016, Corollary 32 et seq.).

Theorem 35 Suppose lis a proper loss. Then the canonical loss lis component-wise convex.

Proof By Lemmas 33 and 34, lis component-wise convex if epi K lis convex, with K = Rn 0, which we now proceed to show. Write γ = γ l. Then

epi K l= {(x,y) Rn 0 Rn 0 | l(x) K y}

x Rn 0 {(x,y)|y Rn 0, l(x) K y}

x Rn 0 {(x,y)|y Rn 0, γ(x)x K y}

x Rn 0 {(x,y)|y Rn 0, x K y/γ(x)}

x Rn 0 {(x,γ(x)y )|y Rn 0, x K y }

x Rn 0 {x} γ(x)({x}+Rn 0)

x Rn 0 γ(x)({x}+Rn 0)

x Rn 0 γ(x){y Rn 0 |x K y}

x Rn 0 {y Rn 0 |x K y/γ(x)}

x Rn 0 {y Rn 0 |γ(x)x K y}

x Rn 0 {y Rn 0 | l(x) K y}

l l( ) {y Rn 0 |l K y}

= Rn 0 spr l,

GEOMETRY AND CALCULUS OF LOSSES

by (29). By (Hiriart-Urruty and Lemar echal, 2001, Proposition A.1.2.3), the Cartesian product of convex sets is convex. Since Rn 0 and spr lare both convex, we have thus shown that epi K lis also convex, concluding the proof.

3.6 The Naturalness of our Setup, and its Advantages

The above development presumes a particular form for S, namely that it is in the class P(Rn 0). This assumption is necessary for our proofs, but is it reasonable? Furthermore, the definition of a loss function as the subgradient of ρS is rather unusual. What advantage does it have? Finally, the introduction of the link function seems to just complicate matters even further. What additionality does it bring? In this brief subsection we provide succinct answers to these natural questions. The choice of P(Rn 0) is indeed simply justified by the results obtainable: if S is not convex, then ρS = ρco S so there is nothing lost in assuming convexity. Since we only use the loss via its average (the risk) we are thus only interested in supporting hyperplanes with normals in the positive orthant; thus the assumption on the recession cone simply allows unbounded losses to exist, and ensures they are bounded on the relative interior of the simplex.13 Some advantages of defining the loss as ρS will be elaborated in the remainder of the paper (in terms of designing losses), but it also allows a direct connection to the question of mixability of a loss, which is defined in terms of the geometry of S (van Erven et al., 2012), as well as to the dual theory of production economics (as elaborated in footnote 12). Without the inherent geometrical structure, it seems unlikely one would have stumbled across the notion of an antipolar loss. Finally, the representation of a general (not necessarily proper) loss l, such that spr l P(Rn 0), as l= λ ψ 1 (where λ is proper, and ψ 1 is an inverse link function) allows a very clean separation of concerns: the statistical properties of the loss are controlled by λ, and the convexity of the partial losses li, i [n] is controlled by ψ; confer (Vernet et al., 2016).

4. Examples of the Antipolar Loss

We now present some examples of the antipolar loss for some well-known standard loss functions.

4.1 Concave Norm Losses

In general the calculation of antipolars of superprediction sets, or equivalently the antipolar loss may be difficult to achieve in closed form. However, analogous to the case of classical lp gauges (norms) with the duality property that B p = Bq with 1

q = 1, there is a parametric family of antigauges which has an attractive self-closure property with respect to taking antipolars. Following Barbara and Crouzeix (1994), we define βa : Rn 0 R for a [ ,1]\{0} as follows

β (p) := min y Y py

( y Y pa y 1/a p Rn >0 0 p bd Rn 0, a ( ,0)

13. For a much more careful treatment of unbounded losses, see (Waggoner, 2021).

WILLIAMSON AND CRANKO

!1/a a (0,1].

Barbara and Crouzeix (1994) showed for all a [ ,1] \ {0} that βa is indeed an antigauge and furthermore

b = 1 β a = βb. (46)

Note that if a (0,1] then b [ ,0), and if a [ ,0) then b (0,1]. Thus no antigauge in the family {βa |a [ ,1]\{0}} can be its own antipolar (unlike the classical result that B 2 = B2). The family of antigauges {βa |a [ ,1] \ {0}} can be used to define a family of proper losses on n outcomes. Since βa is an antigauge, it can be written either as the antigauge of the set lev 1(βa), or using polar duality as the concave support function of the set lev 1(β a ), which is the convention we follow below. In order to find the associated loss function we need la such that spr(la) = lev 1(β a ). The self antipolar property (46) makes this easy. Since β a

a 1 is differentiable on the interior of its domain we have

a ( ,1)\{0}, p Rn >0, la(p) = β a

a 1 (p) = exp 1 β a

a 1 (p) p; 1 a 1

where the exponentiation of vectors is defined component wise, exp(p; a) := (pa 1,..., pa n). Applying the same procedure to the antipolar loss we have

a ( ,1)\{0}, p Rn >0, l a (p) = βa(p) = exp 1 βa(p) p; a 1 .

There are two special values of a [ ,1]\{0} worth mentioning: When a = 1, b = and βb is no longer differentiable, but it is superdifferentiable, with

p Rn >0, l0/1(p) β (p).

When a = , b = 1 and β1(p) = 1n. Thus

p Rn >0, l (p) := 1n β1(p),

the constant loss. Note that l = l 0/1. The closure of the family (la)a [ ,1]\{0} under the antipolar operation is illustrated in Figure 8.

Example 1 Misclassification loss is not strictly proper and so ρspr(l0/1)(p) will not be a singleton for all p . This poses a problem for calculating l 0/1, since the antipolar superprediction set subdifferential ρspr(l 0/1)(p) will not be a singleton. However, we can use the family (la)a [ ,1]\{0} to approximate l0/1, and therefore approximate the antipolar. That is we can come arbitrarily close to obtaining the pair (l0/1(p), l 0/1(p)) with the sequence (la(p), l a

a 1 (p))a<0 for p Rn >0. The pointwise limit is illustrated explicitly in Figure 9.

GEOMETRY AND CALCULUS OF LOSSES

1/4 1/2 3/4 1

βα(p1,1 p1)

(a) The functions (βa)a [ ,0) are coloured from blue to white, and their antipolar counterparts, (βa)a (0,1] (β a )a [ ,0), are coloured from white to green. As a 1, la l0/1, and as a , lα 12, the constant loss.

p1, l(p;y1)

p2, l(p;y2)

l 3 l3/4(p)

(b) The loss l3/4 and its antipolar l 3 acting on the vector p := (1/3,2/3).

FIGURE 8: The concave norm conditional Bayes risk functions βa and losses la, and an illustration of self-polarity of the family {la}a [ ,1]\{0}.

l 0/1 l0/1(p)

p1, l(p;y1)

p2, l(p;y2)

FIGURE 9: Illustration of Example 1 with p := (1/3,2/3). We can simultaneously approximate misclassification loss, l0/1, and its antipolar, l 0/1, over using the concave norm losses.

WILLIAMSON AND CRANKO

4.2 Brier Loss

The Brier score (Brier, 1950) is usually defined for p in terms of its conditional Bayes risk (van Erven et al., 2012, Section 5), but for our purposes we need to work with the 1-homogeneous extension to Rn 0:

p Rn 0, ρspr(l Br)(p) =

( p 1 p 2 2 p 1 p Rn 0 \{0}

0 p = 0. (47)

Indeed we have l Br ρspr(l Br):

p Rn >0, ρspr(l Br)(p) = ( 1)(p) 1

p 1 ( 2 2)(p)+ ( p 2 2 1)(p)

= ( 1)(p) 1

p 1 2p p 2 2 ( 1)(p)

p 1 + p 2 2 p 1

1+ p 2 2 p 2 1

1n 2 p p 1 ,

where to compute the subdifferential of (47) we used the concave subdifferential quotient rule (Mordukhovich and Shao, 1995, Theorem 5.2). Thus

p Rn >0, l Br(p) =

1+ p 2 2 p 2 1

1n 2 p p 1 .

Example 2 When n = 2 we can determine the Brier loss antipolar explicitly. Since we know ρ spr(l Br) is 1-homogeneous, it suffices to evaluate it on the 2-simplex and then 1-homogeneously extend it. Parametrising an element p as p = (p1,1 p1) it follows that

ρspr(l Br)(p) = inf q =0 p,q ρspr(l Br)(q) (48)

= inf q cl q1p1 +q2(1 p1)

1 (q2 1 +q2 2)

= inf 0 q1 1 q1p1 +(1 q1)(1 p1)

1 q2 1 (1 q1)2 .

This can be computed directly, resulting in

p , ρspr(l Br)(p) = f(p1) := (2p1 1)2p

p1(1 p1) 4p1 ,

and thus for p = α(p1,1 p1) R2 0, ρspr(l Br)(p) = α f(p1)14. The Brier loss and its polar are illustrated in Figure 10.

14. The explicit form of the loss itself can be obtained by differentiation of ρspr(l Br)(p) and restriction to the simplex. Although little insight seems gleanable from the formula, we present it for completeness. We have for all q [0,1],

l Br,1(q) = (2q 1) 2q2

q(1 q) 2q2(1 q)+3q(1 q) (q+1)

q(1 q) 2q 2

GEOMETRY AND CALCULUS OF LOSSES

(l Br l Br)(p)

p1, l(p;y1)

p2, l(p;y2)

FIGURE 10: Brier loss and its antipolar acting on the vector p := (2/5,3/5).

It does not seem possible to find a closed form for l Br when n > 2. However the objective function in (48) can be seen to be quasi-convex in p (since ρspr(l Br) is concave and positive and thus 1/ρspr(l Br) is quasi-convex) and therefore is amenable to numerical solution.

4.3 Cobb-Douglas Loss

As a final example, let a Rn 0 and consider the parametrised superlinear function

Rn p 7 ψa(p) :=

( i [n] p ai/ a 1 i p Rn 0 otherwise.

Barbara and Crouzeix (1994) show that ψa is self-polar (cf. Remark 32) in the sense that

a Rn 0, p Rn >0, ψ a(p) = a 1

ψa(a)ψa(p). (49)

The function ψa can be seen to be the form of the Cobb-Douglas production function (Cobb and Douglas, 1928), the self-duality of which has been an object of considerable interest in microeconomics (Houthhakker, 1965; Samuelson, 1965; Sato, 1976)15. In order to find the associated loss function we need l CDa such that spr(l CDa) = lev 1(ψ a). The self polar property (49) makes this easy since

l CDa ψ a l CDa a 1

l Br,2(q) = (2q 1) 2q2

q(1 q)+2q2(1 q) 3q

q(1 q)+q(1 q)

q(1 q) 2q 2 .

15. It would be of interest to determine other self-dual losses using the results of (Houthhakker, 1965; Samuelson, 1965;

Sato, 1976) and to ascertain the significance (if any) of the self-dual nature of the boosting loss (example 3). Observe that for all a Rn 0 and all p Rn >0, (l CDa l CDa l CDa)(p) = l CDa(p), a fact one can verify directly by using (49).

WILLIAMSON AND CRANKO

Writing the quotient of vectors componentwise, a

p := (a1/p1 ,...,an/pn), since ψa is differentiable on its domain we have

p Rn 0, l CDa(p) = a 1

ψa(a)( ψa)(p) = a 1

a 1 = ψa(p)

ψa(a) a p. (50)

Applying the same procedure to the antipolar loss we have

p Rn 0, l CDa(p) = ( ψa)(p) = ψa(p)

Example 3 We illustrate the self-duality of l CDa with a simple example. Set n = 2 and a1 = a2 = 1 and thus ψa(x) = x1x2, ψa(a) = 1, and a 1 = 2. Restricting to ([2]), and writing p = (p1, p2) ([2]), from (50) we have

p ([2]), l CDa(p) =

which can be recognised as the boosting loss (Buja et al., 2005). This loss has as its weight function (Reid and Williamson, 2011) w(p1) := 2

p2 1 ρ(p1,1 p1), where ρ = ψa is the concave support function of spr l CDa. We have

w(p1) = 1 (p1(1 p1))3/2 .

Using (51) to calculate the antipolar loss l CDa we have

l CDa(p) = p1p2

2 l CDa(p).

The superprediction sets associated with the loss l CDa and its antipolar are illustrated in Figure 11.

5. Designing Losses via their Superprediction Sets

Loss functions are clearly essential for machine learning, but they are often taken for granted, their choice being primarily a consequence of convenience or familiarity. We posit that this is due in part to a lack of a tools for designing and tuning them. In this section we offer some starting points for such a tuning exercise. The conventional approach to working with loss functions is to focus on the analytic form of the mappings (p,y) 7 l(p,y). In this section we show some examples of the power of instead working with the family P(Rn 0) and deriving the associated loss functions via the subdifferential of the concave support functions.

5.1 Canonical Normalisation

One problem that presents itself when working with the family P(Rn 0) is that of normalisation. In 4 the lack of consistency of normalisation between (la), l Br and l CDa made it difficult to compare these losses side by side. Our proposed normalisation for a loss l is to pick pl Rn >0 and α > 0

GEOMETRY AND CALCULUS OF LOSSES

(l CDa l CDa)(p)

p1, l(p;y1)

p2, l(p;y2)

FIGURE 11: Illustration of the self-dual nature of the Cobb-Douglas loss l CDa with a := (1/2,1/2) and p := (7/10,3/10).

such that ρspr(αl)(pl) = 1, a task that the superprediction set machinery makes very simple. There are a couple of ways one might choose pl, the simplest is pl := 1n for all loss functions l. However, a more robust choice is pl argmaxp ρspr(l)(p). We say a loss function l is normalised if its associated conditional Bayes risk function attains a maximum value of 1. For several of our results we need a non-compact version of the Sion (1958) minimax theorem. The following result attributed to Ha (1981) is originally shown in a much more general setting and so we state it for the space X below.

Lemma 36 (Ha 1981, Theorem 2) Let X and Y each be nonempty convex subsets of X. Let f : X Y R be such that

1. For each x X, y 7 f(x,y) is lower semi-continuous and quasi-convex;

2. For each y Y, x 7 f(x,y) is upper semi-continuous and quasi-concave.

If there exists a nonempty convex set X X and a compact set Y Y such that

inf y Y sup x X f(x,y) inf y Y max x X f(x,y), (52)

inf y Y sup x X f(x,y) = sup x X inf y Y f(x,y).

Theorem 37 Let lbe a proper loss and let p argmaxp ρspr(l)(p). Then

ρspr(l)(p ) 1n βspr(l)(1n) = 1n ρspr(l )(1n).

Proof First we apply Lemma 36 to establish

max p inf z spr(l) z, p = inf z spr(l)max p z, p . (53)

WILLIAMSON AND CRANKO

For α 1/βspr(l)(1n) we have α1n spr(l), and

β 1, inf z spr(l)max p z, p inf z {βα1n}max p z, p ,

since the left hand side is finite. Thus we have demonstrated the sufficient condition for (52). Since spr(l) and are convex, we have satisfied the conditions for Lemma 36 and shown (53). Therefore

max p ρspr(l)(p) = max p inf z spr(l) z, p (53) = inf z spr(l)max p z, p .

Observe maxp z, p = maxy Y zy. A proof by contradiction easily confirms that for z ρspr(l)(p ) we have zy = zz for all y,z Y . Thus the minimising z ρspr(l)(p ) is a multiple of 1n, more precisely

z = inf{λ > 0 | λ1n spr(l)} 1n = 1n 1 βspr(l)(1n) ,

which completes the proof.

By Theorem 37 we see that the maximum of ρspr(l) occurs for p such that l(p ) = α1n where α := 1/βspr(l)(1n). If we normalise a proper loss lwith the coefficient c := βspr(l)(1n), then evaluating the conditional Bayes risk at p we have

ρspr(cl)(p ) = cρspr(l)(p ) T37 = c 1n/βspr(l)(1n), p = c βspr(l)(1n) 1n, p = 1.

The following corollary demonstrates another application of the polar loss, that is the uniform loss vector 1n is achieved by the prediction that maximises the conditional Bayes risk.

Corollary 38 Let lbe a proper loss and p := l (1n)/ l (1n) 1. Then p argmax p ρspr(l)(p).

Corollary 39 A proper loss function lis normalised if and only if 1n bd(spr l).

We give the normalisation coefficients for the common losses from Table 3 in Table 4, and plot their conditional Bayes risk functions and superprediction sets in Figure 12 (the overbar denotes this normalisation). With the normalised versions of these loss functions we can now see that l CD1n is attained as the limit lima 0 la.

5.2 Shifting the Maximum

In 5.1 we saw that the conditional Bayes risk of a proper loss lis maximised over the probability simplex at p := l (1n)/ l (1n) 1 (Corollary 38). The question naturally arises then of how one might one modify l7 ˇlto reposition the maximum to an arbitrary p0 .16 That is, p 7 ˇl (1n) = p0.

16. The motivation for doing so arises from considering the cost-sensitive missclassification losses lc, c (0,1) (Reid and Williamson, 2011, Section 5.2), whose conditional Bayes risks are Lc(p) = (1 p)c (1 c)p. The maximum of Lc(p) over p occurs at c (although the maximum value does not remain constant as c varies). The corresponding losses lc allow one to impose a different cost for false positives and false negatives. Thus shifting the maximum of a general loss allows one to reweight the costs for the different types of prediction error one might make.

GEOMETRY AND CALCULUS OF LOSSES

1/4 1/2 3/4 1

ρspr( la)(p1,1 p1)

(a) Conditional Bayes risk for l CDa: graph of ρspr( la) for a = (α,α), with α > 0 and

α I := {0.9525i |i [40]} (blue); and α < 0 with α = β/(β 1), β I (green).

p1, l(p;y1)

p2, l(p;y2)

(b) Superprediction sets for l CDa for same parameter range as in (a).

1/4 1/2 3/4 1

ρspr( l)(p1,1 p1)

(c) Comparison of conditional Bayes risks for l CDa with a = (1/2,1/2) (green), l Br (red), and llog (blue).

p1, l(p;y1)

p2, l(p;y2)

(d) Comparison of superprediction sets for l CDa with a = (1/2,1/2) (green), l Br (red), and llog (blue).

FIGURE 12: Normalised loss functions.

WILLIAMSON AND CRANKO

Name Symbol Maximiser Coefficient Normalised Loss

Misclassification Error (0/1) l0/1 (1/n,...,1/n) n nl0/1(p,y)

Logarithmic llog (1/n,...,1/n) 1/log(n) 1 log(n) log(py)

Concave Norm a [ ,1]\{0} la (1/n,...,1/n) a n a n py β a

Brier l Br (1/n,...,1/n) n n 1 n n 1 (1+ p 2 2) 1n 2py

Cobb Douglas a Rn 0 l CDa (a1,...,an) a 1 ψa(a) a 1ψa(p/a2) 1

TABLE 4: Canonical normalisations of the loss functions from Table 3.

If one is given only lor L this is not obvious. However, the answer is simple in terms of S := spr(l). Before proceeding we note that this problem is not well posed since we have not defined what exactly we are hoping to retain of the original function l. That said, our construction entails more or less the minimum required perturbation of S in order to endow ˇlwith the necessary desiderata. To solve this question we construct a new super prediction set ˇS from S and define ˇl ρ ˇS. The family P(Rn 0) is a cone since it is closed under positive scalar multiplication and addition of sets from the family. Thus if we construct a mapping S 7 ˇS := αS+x , where α > 0 and x Rn 0 we can easily ensure ˇS P(Rn 0). In order to move the maximiser from p to p0 it suffices to translate the set S by l(p ) l(p0). However l(p0) / Rn 0, and so we push the vector l(p0) into Rn 0 by adding just enough of the constant loss: α 1n, where α := maxy Y l(p0,y). This has a neutral effect on the argmax. We now have

ˇS := 1 β(l(p0)) S+ l(p ) l(p0)+α 1n , (54)

where the term 1/βS(l(p0)) normalises ˇS so that maxp βS(p) = maxq β ˇS(q). The normalisation can easily be calculated using the identity l(p0) = βS(l(p0)) l(p ). Using the calculus of support functions,

ρ ˇS = 1 βS(l(p0)) ρS + l(p ) l(p0), +α 1 ,

and ˇl(q) = 1 βS(l(p0)) l(q)+ l(p ) l(p0)+α 1n . (55)

Example 4 We will now apply the operations (54) and (55) to llog with Y := [2] and p0 := (1/4, 3/4). We know ρspr(llog) achieves its maximum at the uniform prediction: l (1n) = (1/2, 1/2). To shift the maximum to p0 we define the new loss p ,

ˇllog(p) = 1 βspr(llog)(p0)

llog(p)+ llog(p ) llog(p0) (max y Y llog(p0;y )) 1n

GEOMETRY AND CALCULUS OF LOSSES

0 1/4 1/2 1

ρS(p1,1 p1), ρ ˇS(p1,1 p1)

(a) The effect of llog 7 ˇllog on the conditional Bayes risk function. The original conditional Bayes risk is dashed.

spr( ˇllog)

bd(spr llog)

l(p ) = l(p0) p0

p1, l(p;y1)

p2, l(p;y2)

(b) The corresponding superprediction set operation.

FIGURE 13: Illustration of Example 4, shifting the maximum of the conditional Bayes risk function associated with llog. The modified conditional Bayes risk ρspr( ˇllog) attains its maximum at (1/4,3/4).

= 1.7783 llog(p)+ llog(1n/2) llog(p0)+log(4) 1n

The effect of ρspr(llog) 7 ρspr( ˇllog) is illustrated in Figure 13a. The corresponding superprediction set operation is illustrated in Figure 13b.

5.3 Building Losses From Norms

In 2.7 we looked at norms as gauge functions, and saw that the antigauge functions naturally give rise to the notion of an antinorm. In 3 we saw that these antinorms are precisely the conditional Bayes risk functions. In this section we will see that there is a natural injection or family thereof between the symmetric radiant sets and the shady sets. In doing so we define a new family of bounded, proper loss functions: the norm losses. Recall ( 2.4) X+ X denotes a salient, closed, convex cone, and X + denotes its dual cone. The following result provides a means to take a symmetric radiant set to generate a superprediction set for a proper loss.

Theorem 40 Let R R(X) be symmetric. Choose x X+ with x T r R X+(X++r) and x / R. Then

1. R+x+X+ X+, and

2. R+x+X+ P(X+).

Corollary 41 Let x T r R(X+ +r) then R+αx+X+ P(X+), where α > 1.

WILLIAMSON AND CRANKO

B2 +(1+2 1/2)12

p1, l(p;y1)

p2, l(p;y2)

FIGURE 14: Illustration of Theorem 40 using B2. We translate B2 northeast using the vector x := (1+2 1/2) 12. This ensures S2 := B2 +(1+2 1/2) 12 +R2 0 R2 0.

Proof From the definition of the dual cone, this means

R+x X+ x X + , r R, r +x,x 0. (56)

Minimising the inner product in (56) we have

x X+, r R, r +x,x inf z X +\{0}min r R r +x,z

= inf z X +\{0}

z/γR(z),z + x,z

= inf z X +\{0}

x z/γR(z),z . (57)

We exclude 0 from X + since z = 0 satisfies (56) trivially. And minr R r,z = z/γR(z),z follows because since R is symmetric and convex, the maximum occurs at z/γR(z),z . Let us now choose x X+ as described in the theorem. Then

r R X + (X+ +r) r R X + , x r X+

= r bd(R) X + , x r X+

z X + \{0}, x z γR(z) X+

which gives

inf z X +\{0}

x z/γR(z),z 0. (58)

GEOMETRY AND CALCULUS OF LOSSES

x X + , r R, r +x,x (57) inf z X +\{0}

x z/γR(z),z (58) 0.

Thus by (56), R+x X+. A closed convex cone is its own recession cone (Proposition 1(4)), which proves claim 1. The only condition for R+x+X+ P(X+) that is non-trivial to show is R+x+X+ 0. As before assume x X+ is chosen according to the conditions of the theorem. Then R x. Since x X+, let x0 := x 1/γR(x) and x1 := x (1 1/γR(x)) and x0,x1 X+. Then x = x0 +x1 and x0 bd(R). Since R is radiant x0 = 0.

x0 bd(R) 0 bd(R) x0.

By the symmetry of R, this is eqivalent to

0 bd(R)+x0 x1 bd(R)+x0 +x1 0 / bd(R)+x,

which implies R+x+X+ 0, and claim 2 is proved.

In light of Theorem 40 it might be surprising to note that there is no obvious operation S(Rn) R(Rn). The long flat portions of the set S2 S(R2) in Figure 14 make it easy to see how we might reconstruct B2 by translating S2 and forcing the resulting set to be symmetric. But there is no such simple answer for the superprediction sets of unbounded losses. See, for example, spr(llog) (Figure 7) and spr(l CDa) (Figure 11).

5.4 The Norm Losses

Let (Bα)α [1, ] be the family of closed unit α-norm balls in Rn with Bα := {x Rn | x α 1}. The family (Bα) is increasing in the sense that

The point 1n satisfies 1n T r Bα(X + + r) for all α [1, ]. For each Bα take the point 1 +

1/γBα (1n) 1n = (1+n 1/α)1n and build the set Sα := Bα +(1+n 1/α)1n +Rn 0. By Corollary 41 we have Sα P(Rn 0), guaranteeing properness of the associated loss functions: l α ρSα. The set B2 along with S2 is shown in Figure 14. Our choice of construction of Sα has another convenient property:

α [1, ], ρSα(l α(1n)) = 1. (59)

That is, the family (l α)α [1, ] has the normalisation about the conditional Bayes risk from 5.1. We can derive the closed form expression for the whole family on Rn 0 as follows:

ρSα = ρBα +(1+n 1/α) 1n, +ρRn 0 = ρBα +(1+n 1/α) 1 +ρRn 0.

WILLIAMSON AND CRANKO

1/4 1/2 3/4 1

ρSα (p1,1 p1)

(a) The family (ρSα )α [1, ], coloured from blue to red.

p1, l(p,y1)

p2, l(p;y2)

(b) The loss functions l 1, l 2 and l acting on the vector p = (7/10,3/10) .

FIGURE 15: The family l α

α [1, ] smoothly varies between l0/1 + 1/2 1n and the constant loss function.

Using (7) and the fact that Bα is symmetric we have

ρBα = σ Bα = σBα = βBγ = γ,

where γ is the H older conjugate of α, that is 1/α + 1/γ = 1. Therefore

ρSα = (1+n 1/α) 1 α α 1 +ρRn 0.

The family (ρSα)α [1, ] is plotted in Figure 15a. We can compute the subdifferential of ρSα directly:

p X+ \{0}, ρSα(p) = (1+n 1/α) 1(p) p p α α 1 ,

giving us a closed form expression for l α:

p X+ \{0}, y Y , l α(p,y) := 1+n 1/α py p α α 1 .

Some special values of α include:

p X+ \{0}, y Y , l 1(p,y) = l0/1(p,y)+ 1

2 and l (p,y) = l (p;y) = 1,

where l0/1 is misclassification loss (38) and l is the constant loss (which we derived in a completely different way in 4.1). The various intermediaries like l 2 smoothly interpolate between these two extremes as illustrated in Figure 15b, where it is clear that the condition (59) is equivalent to the simpler geometric property 1n T α [1, ] bd(Sα). Finally we note the family l α

α [1, ] is clearly bounded.

GEOMETRY AND CALCULUS OF LOSSES

p1, l(p;y1)

p2, l(p;y2)

FIGURE 16: Illustration of cospr(l0/1) = spr(l ) and that l 1(p), p = miny Y py.

Example 5 We can now derive the antipolar result l 0/1 = l = l from 4.1 using a simpler geometrical argument: Consider the set spr(l ) which has the property that ρspr(l ) = 1 over Rn 0. And so we have

spr(l ) = lev 1 ρspr(l ) = lev 1( 1 +ρRn 0) = (lev 1 1) Rn 0.

These sets are illustrated in Figure 16.

6. Combining Given Proper Losses to Form New Ones

Much machine learning practice works with a small family of loss functions for the pragmatic reason that they are familiar, available, and have explicit formulas. The above development shows there is an enormous range of possible proper losses one could use, but offers no concrete way of constructing them (with explicit formulas); if one had an analytic description of a desired superprediction set, then it is clear how to cosntruct the loss function. But such a premise seems implausible. In this section, we develop a straightforward way of constructing a larger usable set of loss functions by finding ways of combining existing proper losses in a manner that guarantees the result is also a proper loss, and which provides explicit formulas for the resulting loss function and associated conditional Bayes risk (concave support function). In 5 we observed the power of defining loss functions lby directly building their superprediction sets spr(l). We also saw that the family P(X+) is a cone, and is therefore closed under the family of operations

T X+, α > 0, P(X+) S 7 αS+T P(X+). (60)

It s natural then to consider what other operations have a closure property analogous to (60) for the family P(X+). With the relationship between proper losses and S P(X+), this amounts to asking whether one can combine multiple proper losses non-additively and still be guaranteed that the result

WILLIAMSON AND CRANKO

is a proper loss; when working directly with l, it is not obvious how to ensure the resulting loss is proper17. The convex analysis literature has largely studied the closely related family R(X) (Seeger, 1990; Seeger and Volle, 1995; Gardner et al., 2013; Gardner and Kiderlen, 2018), with some results for the family S(X) (Barbara and Crouzeix, 1994; Penot, 1997; Penot and Z alinescu, 2000). We will present a general family of operations, called M-sums and dual M-sums, (Gardner et al., 2013; Mesikepp, 2016) which provide a general means by which to create new proper losses from two or more given proper losses. These M-sums provide the opportunity to smoothly interpolate between several proper losses in a variety of ways (beyond merely taking the sum)18. In 6.2 we introduce our approach, and then successively demonstrate the preservation of the convexity of the superprediction sets ( 6.3), their closure ( 6.4), and orientation ( 6.5). We then introduce the functional analog of our combination rules (which serve to combine conditional Bayes risks) ( 6.6), examine their properties in terms of support functions ( 6.7), and the effect of polar operations ( 6.8). The argument is summarised and tied together in 6.9. The epimultiplication operation is

R 2X (α,S) 7 α S :=

( αS α = 0, rec(S) α = 0.

Fix M Rm. Then the M-sum and dual M-sum operations are defined as 2X m (A1,...,Am) 7 M(A1,...,Am) := [

µ M i [m] µi Ai,

and 2X m (A1,...,Am) 7 M(A1,...,Am) := [

i [m] µi Ai.

6.1 M-Composition of Losses

Throughout this section let l1,..., lm be a sequence of proper loss functions, paired with their conditional Bayes risk functions L1,...,Lm : X+ R. Let m: Rm >0 Rm 0 be a proper loss function with the associated conditional Bayes risk M : Rm 0 R. We introduce the two functions

M(L1,...,Lm) := p 7 M(L1(p),...,Lm(p))

M(L1,...,Lm) := p 7 sup{M(L1(a1),...,Lm(am)) | a1 + +am = p},

which we call the functional M-sum and dual functional M-sum respectively.

17. This question is obviously analogous to the question of aggregation in economics; see for example (Shephard, 1970, Chapter 6). In our case, restricting consideration to proper losses makes the problem situation simpler, and a rather more comprehensive answer can be given. 18. In applying the existing theory of M-sums we have needed to extend it in two ways: we have developed the concave version (which combines shady sets rather than radiant ones), and we have developed a comprehensive duality theory. These results may be of interest in their own right. They extend and generalise a range of results in the literature on the combination of convex bodies, including (Artstein-Avidan and Rubinstein, 2017; Penot and Z alinescu, 2000; Seeger and Volle, 1995; Volle, 1998; Luc and Volle, 1997; Penot and Z alinescu, 2001; Barbara and Crouzeix, 1994; Pallaschke and Urbanski, 2013; Milman and Rotem, 2017a; Slomka, 2011; Milman and Rotem, 2017b).

GEOMETRY AND CALCULUS OF LOSSES

The functional M-sum encompasses a wide range of operations on losses. For example using Corollary 56 we can see the sum of losses jand k can be written as an M-sum using the constant loss l with N := ρspr(l ):

N(ρspr(j),ρspr(k)) j+ k.

Note that j defined in this way is not guaranteed to retain the pseudo-inverse property of the antipolar l in the sense of Proposition 29.

6.2 Constructing Superprediction Sets with M-Sums

Our approach here is organised as follows: First we show some general sufficient conditions for the operations M and M (introduced in 6) to map P(X+) to P(X+). Since we are ultimately interested in the support functions of these sets, in 6.6 we compute the (convex and concave) support functions of sets in the images of M and M. The choices of name and notation for the dual M-sum are not accidental, in 6.8 we characterise the duality relationship of M and M in terms of the polar and antipolar and present closure results for the families R(X) and S(X+) (Theorem 61 and Theorem 63), which allows us to compute the gauge and antigauge functions of sets in the images of M and M (Corollary 65). We first seek to establish closure (in the algebraic sense) of the family P(X+) with the operations M and M. In order to show this requires a number of theorems, which culminate in the d enouement Corollary 51. For ease of exposition these results are summarised in Table 5. It is necessary to introduce the Panlev e Kuratowski notion of convergence for sequences of sets (Rockafellar and Wets, 2004). Define the following classes of subsets:

N := {N N | N\N is finite} and N # := 2N.

Let (Sn)n N with Sn X be a sequence of sets. Then the inner and outer limit are

liminf n Sn := {x X | N N , n N, xn Sn, xn x}

limsup n Sn := x X N N #, n N, xn Sn, xn x .

If liminfk Ck = limsupk Ck then we say (Ck)k N converges with limit limk Ck. As one might hope, since N N # it follows that liminfn Sn limsupn Sn.

Proposition 42 Let S K(X), (µk)k N µ with µk R>0 for all k N. Then µ S = limk µk S.

Proof The only interesting case is when µ = 0, which is immediate from Lemma 13.

6.3 Convexity

The convexity of superprediction sets plays an essential role in our theory, and thus if we wish to combine multiple proper losses by combining their superprediction sets, we need to ensure the resulting set is guaranteed convex. We first need some auxilliary lemmas.

WILLIAMSON AND CRANKO

Lemma 43 Let S K(X), and α,β [0, ). Then

1. rec(α S) = rec(S);

2. α S+β S = (α +β) S.

Proof 1. Let α = 0. Then α S = rec(S). Since rec(S) is a closed cone, it is easily verified (Proposition 14) that rec(rec(S)) = rec(S). For α > 0 we have rec(α S) = rec(S). This is an immediate consequence of Proposition 1(3). Turning now to claim 2 there are three cases: neither α nor β is zero, only one of α or β is zero, or both α and β are zero. The only interesting case is the second. Let α = 0,β = 0 and we have

α S+β S = α S+rec(S) = α S = (α +β) S.

The second equality follows from Proposition 1(2) since rec(S) = rec(α S).

Lemma 44 Let I be an arbitrary index set. For families of subsets of X, (Si)i I and (Tj)j I we have T i I Si + T j I Tj T i I(Si +Ti).

Proof Let x T i I Si +T j I Tj. Then x = s+r for some points s,r where s is in every Si, and r is in every Tj. Thus x Si +Tj for all i, j I, including the pairs (Si,Tj) with j = i. Consequently x is in the intersection T i I(Si +Ti).

The main result of this subsection is the following.

Theorem 45 Let M K(Rm), and Ai K(X) for i [m]. Then the sets M(A1,...,Am) and M(A1,...,Am) are convex.

Proof Fix arbitrary x,y M(A1,...,Am). Then there are µ,ν M, such that

x i [m] µi Ai and y m j=1 νj Aj. (61)

To show M(A1,...,Am) is a convex set, we need to show tx + (1 t)y M(A1,...,Am) for all t (0,1). By virtue of (61), t (0,1),

tx+(1 t)y t i [m] µi Ai +(1 t) m j=1 νj A j

= i [m] (tµi Ai +(1 t)νi Ai). (62)

Applying Lemma 43 with S = Ai, α = tµi and β = (1 t)νi implies

i [m], tµi Ai +(1 t)νi Ai = (tµi +(1 t)νi) Ai, (63)

and thus i [m] (tµi Ai +(1 t)νi Ai) = i [m] (tµi +(1 t)νi) Ai. (64)

GEOMETRY AND CALCULUS OF LOSSES

Finally, convexity of M guarantees tµ +(1 t)ν M, and therefore t (0,1),

tx+(1 t)y (62) i [m] (tµi Ai +(1 t)νi Ai)

(64) = i [m] (tµi +(1 t)νi) Ai

µ M i [m] µi Ai, (65)

which concludes the proof that M(A1,...,Am) is convex. The proof that M(A1,...,Am) is convex is similar. Let x,y M(A1,...,Am). Then there exists µ,ν M such that x T i [m] µi Ai and y Tm j=1 νj A j. Therefore t (0,1),

tx+(1 t)y t

i [m] µi Ai

j [m] νj A j

i [m] tµi Ai

j [m] (1 t)νj A j

i [m] (tµi Ai +(1 t)νi Ai). (66)

From (63), t (0,1), \

i [m] (tµi Ai +(1 t)νi Ai) = \

i [m] (tµi +(1 t)νi) Ai. (67)

Again the convexity of M guarantees that tµ +(1 t)ν M, and mirroring (65), t (0,1),

tx+(1 t)y (66) \

i [m] (tµi Ai +(1 t)νi Ai)

i [m] (tµi +(1 t)νi) Ai

i [m] µi Ai,

which concludes the proof that M(A1,...,Am) is convex.

Proposition 46 Let M K(Rm 0 \{0}), and Ai K(X+ \{0}) for i [m]. Then

M(A1,...,Am) X+ \{0} and M(A1,...,Am) X+ \{0}.

Proof Since X+ \{0} is a cone, it is closed under addition and positive multiplication. The set M does not contain 0 Rm and since the Ai are all subsets of X+ \ {0}, for all µ M the following inclusions are immediate:

µ1 A1 + + µm Am X+ \{0} and µ1 A1 µm Am X+ \{0}.

WILLIAMSON AND CRANKO

6.4 Closure

Superprediction sets are closed by construction, so we also need to ensure that our combination rules preserve closure. First we need the following lemma.

Lemma 47 Let Ai K(X+) for i [m]. Let (µk)k N µ with µk Rm 0. Then

1. limk (µk 1A1 + + µk m Am) = µ1 A1 + + µm Am,

2. limk (µk 1A1 µk m Am) = µ1 A1 µm Am.

Proof Define the set C := µ1A1 + + µm Am and the sequence of sets Ck := µk 1A1 + + µk m Am. Likewise the set D := µ1A1 µm Am and the sequence of sets Dk := µk 1A1 µk m Am. Take an arbitrary convergent sequence (xk) x such that xk Ck. Then xk = i [m] µk i ak i with ak i Ai for i [m] and each k N. All the sequences µk i ak i have convergent subsequences (Lemma 13) so it is without loss of generality to assume that they are convergent (by passing to subsequence if necessary), and it follows that limki µki i aki i exists for each i [m] and

x = lim k1 µk1 1 ak1 1 + + lim km µkm m akm m P42 = x µ1 A1 + + µm Am,

since µk Rn 0 for all k N, and we have proven (1). Again take an arbitrary sequence (xk) x such that xk Dk. Then xk = µk i ak i for some sequences µk i ak i µk j A j for all i, j [m] and all n N. Applying Proposition 42 completes the proof of claim 2.

Our main result for this subsection is:

Theorem 48 Let M K(Rm 0 \ {0}), Ai K(X+ \ {0}) for i [m]. Then M(A1,...,Am) and M(A1,...,Am) are both closed.

Proof Take an arbitrary sequence (xk) x such that xk M(A1,...,Am). Then there exists a sequence (µk)k N with µk M so that xk i [m] µk i Ai for all k N. Assume the sequence (µk)k N is bounded. Then without loss of generality we may assume it is convergent (by passing to a subsequence if necessary) with limit µ. Since M is closed, µ M. It follows that

x = lim k xk lim k (µk 1 A1 + + µk m Am)

L47(1) = µ1 A1 + + µm Am M(A1,...,Am).

A proof by contradiction shows that the sequence (µk)k N is bounded. Assume (µk)k N is unbounded. Then we can write µk = νk µk where νk = µk/ µk for each k N. Thus (νk)k N is bounded and we may assume it is convergent with limit ν. Therefore

xk µk 1 A1 + + µk m Am xk µk νk 1 A1 + +νk m Am

= lim k xk µk lim k νk 1 A1 + +νk m Am L47(1) 0 ν1 A1 + +νm Am,

which contradicts Proposition 46 (taking M = {ν}). Thus M(A1,...,Am) is closed.

GEOMETRY AND CALCULUS OF LOSSES

Again take an arbitrary sequence (xk) x such that xk M(A1,...,Am). Then xk = µk i ak i for some sequences µk i ak i µk j A j for all i, j [m] and all n N. Clearly the sequences (µk i )k N must be bounded for all i [m]. And so, as before, without loss of generality we assume it has limit µ M. Applying Proposition 42 completes the proof that M(A1,...,Am) is closed.

Corollary 49 Let M K(Rm), Ai K(Rm) for i [m]. Assume M and each Ai for i [m] are compact. Then both M(A1,...,Am) and M(A1,...,Am) are closed.

Proof The above result follows by an almost identical proof to Theorem 48, however since M and each Ai for i [m] are closed and bounded, this rules out the above pathologies when it comes to building bounded sequences (ak i )k N with ak i Ai for all i [m].

6.5 Orientation

Superprediction sets recess to the positive orthant. Thus we also need to ensure this property is preserved under our combination rules. This is captured by the main result of this subsection:

Theorem 50 Let M Rm 0, Ai P(X+) for i [m]. Then

1. rec( M(A1,...,Am)) = X+, and

2. rec( M(A1,...,Am)) = X+.

Proof Let µ M, Bµ := i [m] µi Ai, Cµ := T i [m] µi Ai. Then

µ M, rec(Bµ) P1(8) i [m] rec(µi Ai) L43(1) = i [m] rec(Ai) P1(2) = X+. (68)

For the equivalent result for Cµ by Lemma 20, T i [m] Ai = , Proposition 1(6) implies

µ M, rec(Cµ) P1(6) = \

i [m] rec(µi Ai) L43(1) = \

i [m] rec(Ai) = X+. (69)

It follows that

rec( M(A1,...,Am)) = rec( [

µ M Bµ) P1(7) [

µ M rec(Bµ) (68) X+, (70)

rec( M(A1,...,Am)) = rec( [

µ M Cµ) P1(7) [

µ M rec(Cµ) (69) = X+. (71)

The reverse inclusion is shown by contradiction. Suppose the inclusions in (70) and (71) are all strict. Then there exists

d rec( M(A1,...,Am)) where d / X+. (72)

WILLIAMSON AND CRANKO

(Ai)i [m] M M(A1,...,Am) M(A1,...,Am)

Theorem 45 Ai K(X) M K(Rm) convex convex Theorem 48 Ai K(X+\{0}) M K(Rm 0\{0}) closed closed Theorem 50 Ai P(X+) M Rm 0 X+-oriented X+-oriented Proposition 46 Ai P(X+) M Rm 0 \{0} subset of X+ \{0} subset of X+ \{0} Corollary 51 Ai P(X+) M P(Rm 0) in P(X+) in P(X+)

TABLE 5: Summary of M-sum structure results.

From Proposition 1(3) this means there are sequences (tk)k N 0, tk (0,1], and (xk)k N, xk rec( M(A1,...,Am)) such that limk tkxk = d. From the definition of M(A1,...,Am) there must be a sequence (µk)k N M such that

xk µk 1 A1 + + µk m Am.

By assumption we have Ai rec(Ai) = X+ for i [m], which implies µk i Ai X+ for i [m]. Since X+ is a cone we have

tkxk i [m] tkµk i Ai X+.

By hypothesis X+ is closed, and therefore contains limk tkxk, giving us a contradiction in (72). Thus equality holds throughout in (70). By an identical argument, mutatis mutandis, applied to M we have equality throughout in (71), proving claim 2.

The collection of results amassed in this section is summarised in Table 5, collectively they imply:

Corollary 51 Let M P(Rm 0). Then M maps from P(X+)m to P(X+), and M maps from P(X+)m

With Corollary 51 we see that the family of superprediction sets of proper losses P(Rn 0) is closed under the the M and M operations. Corollary 51 is illustrated in Figure 17.

6.6 The Functional M-Sum

While the superprediction sets are our starting point, to be able to derive proper loss functions we work with the support function of these sets. The combination rules for the sets have an analog for their corresponding support functions. We first introduce the functional M-sum and in the following subsection justify the overloading of the naming and notation. Let f1,..., fm : X R be convex functions. For convex g: Rm R define the convex functional M-sum

X x 7 g(f1,..., fm)(x) := g(f1(x),..., fm(x))

and the dual convex functional M-sum

X x 7 g(f1,..., fm)(x) := inf{g(f1(a1),..., fm(am)) | a1 + +am = x}.

GEOMETRY AND CALCULUS OF LOSSES

spr( l1/2) spr l log),spr l1/2))

bd(spr l1/2)

bd spr l log)

p1, l(p;y1)

p2, l(p;y2)

FIGURE 17: Illustration of Corollary 51. We take the normalised log loss llog ( 5.1) and shift the maximum of its conditional Bayes risk (see 5.2) to (3/4,1/4) to generate the loss function l log. We then show the resulting renormalised M-sum of this loss with the normalised concave norm loss l1/2, and M := spr( l1/2).

If f1,..., fm and g are concave functions the above two notations are overloaded with the concave functional M-sum and dual concave functional M-sum:

X x 7 g(f1,..., fm)(x) := g(f1(x),..., fm(x))

X x 7 g(f1,..., fm)(x) := sup{g(f1(a1),..., fm(am)) | a1 + +am = x}.

The overload of notation can be defended since we will be either dealing with convex or concave functions but not combinations of the two.

6.7 Support Functions

We now justify the overload of the name M-sum for both the set operation and the functional operation, via the following theorem.

Theorem 52 Let Ai X for i [m] and either

WILLIAMSON AND CRANKO

1. M Rm 0 with ri(M) = , or

2. M Rm, and Ai for i [m] are each bounded.

Then σM(σA1,...,σAm) = σ M(A1,...,Am).

Proof From the definition of the convex support function:

x X , σ M(A1,...,Am)(x ) = sup s M(A1,...,Am) x ,s = sup µ M sup s i [m] µi Ai x ,s . (73)

To simplify analysis it is useful to be able to replace the epimultiplication in (73) with ordinary scalar multiplication. Assume claim 1. Proposition 42 shows epimultiplication is continuous in the Painleve Kuratowski sense with respect to sequences of positive scalars, this means that (73) implies the existence of a sequence (µn)n N with µn ri(M) such that

lim n sup s i [m] µn i Ai x ,s = lim n sup s i [m] µn i Ai x ,s = sup µ M sup s i [m] µi Ai x ,s ,

where the first equality follows since ri(M) ri(Rm 0) = Rm >0. Therefore replacing M by ri(M) in the supremum has no effect. Thus the expression in (73) simplifies, giving x X ,

sup µ M sup s i [m] µi Ai x ,s

= sup µ ri(M) sup s i [m] µi Ai x ,s

= sup µ ri(M) sup a1 A1

sup am Am x ,µ1a1 + + µmam

= sup µ ri(M) sup a1 A1 sup am Am

x ,µ1a1 + + x ,µmam

= sup µ ri(M)

sup a1 A1 µ1 x ,a1 + + sup am Am µm x ,am

= sup µ ri(M) (µ1σA1(x )+ + µmσAm(x ))

= sup µ ri(M) (σA1(x ),...,σAm(x )),µ

= σri(M)(σA1(x ),...,σAm(x ))

L7= σM(σA1(x ),...,σAm(x )), (74)

as desired. Now assume claim 2. Since each of the sets Ai are bounded, Proposition 1(1) implies that epimultiplication reduces to scalar multiplication. By a similar argument to (74) mutatis mutandis (we no longer need to replace M by ri(M)) we are able to derive the same identity.

The relationship between convex and concave support functions (7) yields the following corollary:

GEOMETRY AND CALCULUS OF LOSSES

Corollary 53 Let Ai X for i [m] and either

1. M Rm 0 with ri(M) = , or

2. M Rm, and Ai for i [m] are each bounded.

Then ρM(ρA1,...,ρAm) = ρ M(A1,...,Am).

Corollary 54 Let l1,..., lm be a sequence of proper loss functions with conditional Bayes risks L1,...,Lm. Let M : Rm 0 R be a conditional Bayes risk function. Let

j M(L1,...,Lm) and k M(L1,...,Lm).

Then jand k are proper loss functions.

This corollary shows how we can create new proper losses from old via the M-sums and dual M-sums. The following is a restatement in the terminology of M-sums of an old result which we require subsequently.

Proposition 55 (Hiriart-Urruty and Lemar echal 2001, Theorem D.4.3.1) Let f1,..., fm : Rn R be convex. Let g: Rm R be convex and increasing component-wise in the sense that x Rm 0 y implies g(x) g(y). Define F := x 7 (f1(x),..., fm(x)). Then

x Rn, (g F)(x) = g(F(x))( f1(x),..., fm(x)).

Corollary 56 Let (Li)i [m], and M be as defined above, and let li = Li for i [m] and m = M. Then

l1(p,y1) ... lm(p,y1) ... ... ... l1(p,yn) ... lm(p,yn)

m(L1(p),...,Lm(p))

M(L1,...,Lm)(p) Rn.

Lemma 57 Let f1,..., fm : X R be convex (resp. concave) functions then σM(f1,..., fm) is convex (resp. ρM(f1,..., fm) is concave).

Proof Let f1,..., fm : X R be convex. Fix arbitrary (ai)i [m] and (bi)i [m] with ai,bi dom(fi) for i [m], and pick an arbitrary t (0,1). Then

i [m], fi(tai +(1 t)bi) t fi(ai)+(1 t)fi(bi)

= sup µ M i [m] µi fi(tai +(1 t)bi) t sup µ M i [m] µi fi(ai)+(1 t) sup ν M

m j=1 νj f j(bj).

Since (ai)i [m] and (bi)i [m] are arbitrary, taking the infimum over both sides yields

inf a1+ +am=x b1+ +bm=y

sup µ M i [m] µi fi(tai +(1 t)bi)

WILLIAMSON AND CRANKO

inf a1+ +am=x b1+ +bm=y

t sup µ M i [m] µi fi(ai)+(1 t) sup ν M

m j=1 νj f j(bj)

inf a1+ +am=x sup µ M i [m] µi fi(ai)

inf b1+ +bm=y sup ν M

m j=1 νj f j(bj)

= t σM(f1,..., fm)(x)+(1 t) σM(f1,..., fm)(y). (75)

To complete the proof

inf a1+ +am=x b1+ +bm=y

sup µ M i [m] µi fi(tai +(1 t)bi)

sup µ M i [m] µi fi(tai +(1 t)bi)

i [m], ai,bi dom(fi) and i [m] ai = x and i [m] bi = y

sup µ M i [m] µi fi(ci)

i [m], ci dom(fi) and i [m] ci = tx+(1 t)y

= σM(f1,..., fm)(tx+(1 t)y), (76)

where the second equality follows since dom(fi) is convex for each i [m]. Since (76) (75), we have that σM(f1,..., fm) is convex. By an identical argument, mutatis mutandis, the concave result for the concave functional M-sum follows.

Lemma 58 Let S K(X), and µ R 0. Then x 7 supx dom(σS) x ,x µσS(x )) = (σµ A) .

Proof Let µ > 0. Then µσS = σµA = σµ A (Hiriart-Urruty and Lemar echal, 2001, Theorem C.3.3.2) and the result follows by conjugation. Now assume µ = 0. Then from Lemma 9, dom(σS) = (rec S) and x X,

sup x dom(σS) ( x ,x µσS(x )) = sup x (rec S) x ,x

= sup x (rec S) x ,x

( 0 x rec(S) otherwise, (77)

where the final equality follows from the definition of the dual cone (2). Noticing (77) is the indicator of the set rec(S), it therefore is also the convex conjugate of the support function σS (Rockafellar, 1970, Theorem 13.2, p. 114).

Theorem 59 Let M R(Rm) be compact, N S(Rm), and Ai K(X) for i [m] with i [m]Ai = . Then

1. σ M(A1,...,Am) = σM(σA1,...,σAm), and

2. ρ N(A1,...,Am) = ρN(ρA1,...,ρAm).

GEOMETRY AND CALCULUS OF LOSSES

Proof 1. Let D := dom(σA1) dom(σAm) (X )m and a := (a1,...,am) D. The Legendre Fenchel conjugate (11) of the right-hand side is x X, σM(σA1,...,σAm) (x)

x ,x inf a1+ +am=x σM((σA1(a1),...,σAm(am)))

sup a1+ +am=x x ,x σM((σA1(a1),...,σAm(am)))

= sup a1,...,am X ( a1,x + + am,x σM((σA1(a1),...,σAm(am))))

= sup a D ( a1,x + + am,x σM((σA1(a1),...,σAm(am))))

a1,x + + am,x sup µ M (σA1(a1),...,σAm(am)),µ

= sup a D inf µ M( a1,x + + am,x (σA1(a1),...,σAm(am)),µ )

= sup a D inf µ MLx(a,µ), (78)

where for x X we have Lx : D M R with

a D, µ M, Lx(a,µ) := a1,x + + am,x (σA1(a1),...,σAm(am)),µ .

Immediately Lx(a,µ) is concave and upper semi-continuous in a, and convex and lower semicontinuous in µ. Both D and M are convex and M is compact, and so we can apply Sion s minimax theorem (1958) and write

sup a D inf µ MLx(a,µ) = inf µ Msup a D Lx(a,µ)

= inf µ M i [m] sup x dom(σAi) ( x ,x µiσAi(x ))

= inf µ M i [m] r Ai,µi(x), (79)

where r Ai,µi(x) := supx domσAi( x ,x µiσAi(x )). Examining the functions r Ai,µi with Lemma 58, we see that r Ai,µi = σ µi Ai and in summary we have x X,

σM(σA1,...,σAm) (x) (78) = sup a D inf µ MLx(a,µ)

(79) = inf µ M i [m] r Ai,µi(x)

L58 = inf µ M(σ µ1 A1(x)+ +σ µm Am(x))

= inf µ M(σµ1 A1 σµm Am) (x),

WILLIAMSON AND CRANKO

where denotes infimal convolution (4). We now apply the biconjugate theorem to both sides to obtain x X ,

σM(σA1,...,σAm) (x ) = sup x X

x ,x inf µ M(σµ1 A1 σµm Am) (x)

= sup µ M sup x X

x ,x (σµ1 A1 σµm Am) (x)

(σµ1 A1 σµm Am)(x )

= sup µ M σµ1 A1 µm Am(x )

µ M µ1 A1 µm Am(x )

= σ M(A1,...,Am)(x ),

where the final two equalities are due to Hiriart-Urruty and Lemar echal (2001, Theorem C.3.3.2). From Lemma 57 we have σM(σA1,...,σAm) = σM(σA1,...,σAm), which completes the proof of claim 1. We now turn our attention to claim 2. Let E := dom(ρA1) dom(ρAm) (X )m and a := (a1,...,am) E. The concave conjugate (12) of the right-hand side is x X, ρN(ρA1,...,ρAm)

x ,x sup a1+ +am=x ρN((ρA1(a1),...,ρAm(am)))

a1,x + + am,x inf µ N (ρA1(a1),...,ρAm(am)),µ

= inf a E sup µ M ( a1,x + + am,x (ρA1(a1),...,ρAm(am)),µ )

= inf a E sup µ N Ex(a,µ),

where for x X we have Ex : E N R with

a E, µ M, Ex(a,µ) := a1,x + + am,x (ρA1(a1),...,ρAm(am)),µ .

We immediately have that Ex(a,µ) is convex and lower semi-continuous in a, and concave and upper semi-continuous in µ. In order to apply Lemma 36 we need to find certain sets E E and N N such that we satisfy (52). From the definition of E we have 0 E. From the 1-homogeneity of the functions ρAm we know Ex(0,µ) = 0 for all µ N. Therefore µ N,

inf a E Ex(a,µ) Ex(0,µ) inf a E Ex(a,µ) inf b {0}Ex(b,µ) = 0. (80)

Let N := N BβN(0)+1 N, where BβN(0)+1 is the norm ball of radius βN(0)+1. Then N N is compact, and E := {0} E is convex and nonempty. From (80) we have

inf a E sup µ N Ex(a,µ) 0 = inf a E sup µ N Ex(a,µ),

GEOMETRY AND CALCULUS OF LOSSES

and therefore we satisfy (52) and we can apply Lemma 36. From here we proceed with an argument that parallels the proof for claim 1, mutatis mutandis (infima and suprema exchanged, convex conjugates replaced with concave conjugates, inf convolution replaced with sup convolution) and we find ρ N(A1,...,Am) = ρN(ρA1,...,ρAm), completing the proof of claim 2.

6.8 The M-Sum Polar

The polar (antipolar) operation plays an important role in our theory, and thus it is natural to ask how polarity interacts with the M-sum operations. We first need the following proposition.

Proposition 60 Let Ai X for i I be an arbitrary collection of sets with index set I, let R R(X) and S S(X). Then

1. (S i I Ai) = T i I(A i ), and (S i I Ai) = T i I(A i );

2. R = lev 1(γR), and S = lev 1(βS).

Proof The polar identity in claim 1 is a standard result in functional analysis (Aliprantis and Border, 2006, Corollary 5.104, p. 218). The proof for the antipolar is identical modulo the reversal of some inequalities. Claim 2 is immediate from the definition of star-shaped and co-star-shaped sets in 2.7.

We now show that the polarity operation preserves the radiant and shady nature of the sets.

Theorem 61 Let Ai R(X) and Bi S(X+) for i [m], M R(Rm) compact, N S(Rm 0). Then

1. M(A1,...,Am) R(X), and M(A1,...,Am) R(X);

2. N(B1,...,Bm) S(X), and N(B1,...,Bm) S(X).

Proof We take the first case:

(0,1] M(A1,...,Am) = (0,1] [

µ M i [m] µi Ai

µ M i [m] (0,1] µi Ai

µ M i [m] µi (0,1] Ai

µ M i [m] µi Ai,

and so it is clear that M(A1,...,Am) is star-shaped. By the same argument, mutatis mutandis M(A1,...,Am) is star-shaped, and N(B1,...,Bm) and N(B1,...,Bm) are co-star-shaped. Theorem 45 guarantees convexity, and Theorem 48 guarantees closure. Proposition 46 guarantees the exclusion of the origin in the shady case, and so all that is left is to show 0 int( M(A1,...,Am)), and 0 int( M(A1,...,Am)).

WILLIAMSON AND CRANKO

By hypothesis Ai R(X) for each i [m] and so 0 int(Ai). By definition there exists (ri)i [m+1] with ri R>0 such that

i [m], Bri Ai X and Brm+1 M Rm.

Let r := mini [m] ri, µ Rm >0 Brm+1 M, and µ min := mini [m] µ i > 0. Then

M(A1,...,Am) = [

µ M i [m] µi Ai [

µ M i [m] µi Br i [m] µ i Br Br µ min,

M(A1,...,Am) = [

i [m] µi Ai [

i [m] µi Br \

i [m] µ i Br = Br µ min,

completing the proof.

Lemma 62 Let R R(X), S S(X) with S rec(S), α 0. Then

1. lev α γR = α R, and

2. lev α βS = α S.

Proof Suppose α > 0 and let x lev α(γR), then exploiting the 1-homogeneity of γR:

α γR(x ) 1 γαR(x ) 1.

Thus applying Proposition 60(2) we find lev α γR = lev 1(γαR) = αR. Penot and Z alinescu (2000, Proposition 2.3) prove lev=0(γR) = rec(R). Since gauge functions are nonnegative (9), this shows claim 1. Suppose α > 0 and let x lev α(βS), then appealing to the 1-homogeneity of βS:

α βS(x ) 1 βαS(x ) 1.

Thus applying Proposition 60(2) we find lev α(βS) = lev 1(βαS) = αS. Now suppose α = 0. Penot and Z alinescu (2000, Proposition 2.4) prove lev=0(βS) = rec(S)\cone(S), and lev>0(βS) = cone(S), giving lev 0(βS) = rec(S). This shows claim 2 and completes the proof.

The following general duality result extends a range of classical results, as well as results in (Seeger, 1990); it demonstrates the appealling fact that the polar of an M-sum of (Ai)i is the dual (polar-M)-sum of the polars of (Ai)i; and similarly for antipolars.

Theorem 63 Let Ai R(X) and Bi S(X) with Bi rec(Bi) for i [m], M R(Rm), and N S(Rm 0). Assume M and Ai for i [m] are compact. Then

1. M(A1,...,Am) = M (A 1,...,A m), and

2. N(B1,...,Bm) = N (B 1,...,B m).

GEOMETRY AND CALCULUS OF LOSSES

Proof Calculating the polar of the left hand side of claim 1 we have

( M(A1,...,Am)) (13) = lev 1 σ M(A1,...,Am)

T52 = lev 1 ( M(σA1,...,σAm))

= lev 1 (x 7 σM((σA1(x ),...,σAm(x ))))

(15) = lev 1 x 7 γM ((γA 1(x ),...,γA m(x )))

P60(2) = x X |(γA 1(x ),...,γA m(x )) M

x X (γA 1(x ),...,γA m(x )) = µ

x X i [m], γA i (x ) = µ i . (81)

Observe that M is star-shaped and closed, thus 0 M and [0,1] M = M . Hence

M(A1,...,Am) (81) = [

x X i [m], γA i (x ) = µ i

x X i [m], γA i (x ) = µ i

x X i [m], γA i (x ) = λµ i

x X i [m], γA i (x ) µ i

i [m] lev µ i (γA i )

i [m] µ i A i

= M (A 1,...,A m), (82)

which completes the proof of claim 1. The proof of claim 2 proceeds much like the proof of claim 1. Calculating the left hand side of claim 2, by a similar argument to (82), mutatis mutandis

( N(B1,...,Bm)) (13) = lev 1 ρ N(B1,...,Bm)

C53 = lev 1 ( N(σB1,...,σBm))

= lev 1 (x 7 ρN((ρB1(x ),...,ρBm(x ))))

(15) = lev 1 x 7 βN ((βB 1(x ),...,βB m(x )))

P60(2) = x X (βB 1(x ),...,βB m(x )) N

x X (βB 1(x ),...,βB m(x )) = µ

x X i [m], βB i (x ) = µ i . (83)

WILLIAMSON AND CRANKO

Observe that N is co-star-shaped, thus [1, ) N = N . Hence

N(B1,...,Bm) (83) = [

x X i [m], βB i (x ) = µ i

x X i [m], βB i (x ) = µ i

x X i [m], βB i (x ) = λµ i

x X i [m], βB i (x ) µ i

i [m] lev µ i (βB i )

i [m] µ i B i

= N (B 1,...,B m),

which completes the proof of claim 2.

We can take the polars of both sides of the above theorem to obtain the result that the polar of the dual polar M-sum of the polars of (Ai)i is the M-sum of (Ai)i:

Corollary 64 Let Ai R(X) and Bi S(X) with Bi rec(Bi) for i [m], M R(Rm), and N S(Rm 0). Assume M and Ai for i [m] are compact. Then

1. M(A1,...,Am) = ( M (A 1,...,A m)) , and

2. N(B1,...,Bm) = ( N (B 1,...,B m)) .

Proof From Theorem 61 we know M(A1,...,Am) is radiant and N(B1,...,Bm) is shady. The bipolar theorem (14) applied to Theorem 63 gives

M(A1,...,Am) = ( M (A 1,...,A m)) and N(B1,...,Bm) = ( N (B 1,...,B m)) .

Theorem 48 and Corollary 49 makes the explicit closure operations redundant.

The above duality result can also be expressed in terms of gauges:

Corollary 65 Let Ai R(X) and Bi S(X) with Bi rec(Bi) for i [m], M R(Rm), and N S(Rm 0). Assume M and Ai for i [m] are compact. Then

1. γ M(A1,...,Am) = γM(γA1,...,γAm), and

2. β N(B1,...,Bm) = βN(βB1,...,βBm).

GEOMETRY AND CALCULUS OF LOSSES

M M(A1,...,Am) Operation σM(σA1,...,σAm)

{1m} A1 + +Am Minkowski sum m i=1 σAi

([m]) co(A1 Am) Convex union

M M(A1,...,Am) σM(σA1,...,σAm)

{1m} A1 Am Intersection σA1 σAm

([m]) A1 Am Inverse sum inf a1+ am=x

TABLE 6: Examples of convex M-sums and dual M-sums. The inverse sum is discussed in (Rockafellar, 1970, page 21).

Proof The two identities are derived as follows:

γ M(A1,...,Am) (15) = σ( M(A1,...,Am))

T63(1) = σ M(A 1,...,A m)

T59 = ρM (σA 1,...,σA m)

(15) = γM(γA1,...,γAm),

β M(B1,...,Bm) (15) = ρ( N(B1,...,Bm))

T63(2) = ρ N (B 1,...,B m)

T59 = ρN (ρB 1,...,ρB m)

(15) = βN(βB1,...,βBm).

Finally, the duality result implies the following result expressed in terms of proper loss functions.

Corollary 66 Let l1,..., lm be a sequence of proper loss functions with conditional Bayes risks L1,...,Lm. Let M : Rm 0 R be another conditional Bayes risk function. Then

l M(L1,...,Lm) l M (L 1,...,L m).

6.9 Conclusion on M-sums and New Losses from Old

The above development shows a general and powerful way of combining existing proper losses into new ones. An intriguing and satisfying feature of the results is that, in essence, the way you combine several proper losses into a new one, is to combine them using yet another proper loss! The M-sum operations are thus a very natural means to combine multiple existing proper loss functions to create a new proper loss function. Observe that by suitable scaling the component proper losses, one can smoothly interpolate between them in a variety of ways (depending upon the choice of M). To provide some intuition, several classical examples of convex M-sums are presented in Table 6. Confer (Seeger, 1990) who uses different notation. See also (Mesikepp, 2016) and (Gardner et al., 2013) and (Kusraev and Kutateladze, 1995, Chapter 1).

7. Conclusion

We have presented a geometric theory of proper losses, whereby we take an unbounded convex set with particular properties as the starting point, and derive the proper loss as the (sub)-gradient of

WILLIAMSON AND CRANKO

the support function of the set. The new perspective shows the natural duality between a loss and the antipolar loss newly introduced here, which is of value in understanding Vovk s aggregating algorithm (Kamalaruban et al., 2015). It also shows how one can generally combine multiple proper losses to create new proper losses. In developing that combinatorial theory, we extended a number of results on M-sums, and in particular proved an elegant and general duality theorem. The theory shows a deep but simple relationship between loss functions and concave gauges, the concave analog of a norm. Such concave gauges arise naturally in economics, but until now they have not been explicitly utilised within machine learning. We have presented the theory for finite outcomes (n-class probability estimation). Many of the results extend to a more general setting (Cranko, 2021). The geometry of losses complements existing geometric approaches to machine learning, which have focussed on the geometry of the data distribution (through its likelihood function) (Amari, 2016), the geometry of the prior (Mahony and Williamson, 2001), and the geometry of the model class (Lee et al., 1998; Mendelson and Williamson, 2002). The geometric theory developed here enables a general and insightful perspective relating Bayes risks and measures of information extending the results in (Reid and Williamson, 2011) and (Garcia Garcia and Williamson, 2012). By developing measures of information in terms of convex sets (directly related to the superprediction sets used in the present paper) one can extend and refine the famous data processing theorem of information theory (Williamson and Cranko, 2022). Given the foundational role loss functions play in a wide range of machine learning problems, it seems reasonable to suppose that the theory presented here will lead to further insights. One concrete direction for future work is to relate the geometry of the loss function developed here to the geometry of hypotheses classes and thus illuminate the interaction between losses and hypothesis classes that controls the speed of convergence in learning problems (van Erven et al., 2015). Ideally one would have a theory that simultaneously incorporated the loss function l, the hypothesis class F, the distribution of the data P, and one s prior knowledge into some overall geometric structure.

8. Acknowledgements

This work was inspired by conversations between RW and Mark Reid during 2009 2012. An early version of some of the results appeared in COLT2014. RW s contribution was funded in part by NICTA and the Deutsche Forschungsgemeinschaft under Germany s Excellence Strategy EXC number 2064/1 Project number 390727645. Thanks to extraordinarily assiduous referees for catching many typos and glitches in an earlier version.

Charalambos D. Aliprantis and Kim Border. Infinite dimensional analysis: a hitchhiker s guide. Springer Science & Business Media, 2006. Claudi Alsina, Justyna Sikorska, and M. Santos Tom as. Norm Derivatives and Characterizations of Inner Product Spaces. World Scientific, 2010. Shun-ichi Amari. Information Geometry and its Applications. Springer, 2016. Alexandre Y. Aravkin, James V. Burke, Dmitriy Drusvyatskiy, Michael P. Friedlander, and Kellie J. Mac Phee. Foundations of gauge and perspective duality. SIAM Journal on Optimization, 28(3): 2406 2434, 2018.

GEOMETRY AND CALCULUS OF LOSSES

Shiri Artstein-Avidan and Yanir A. Rubinstein. Differential analysis of polarity: Polar Hamilton Jacobi, conservation laws, and Monge Amp ere equations. Journal d Analyse Math ematique, 132 (1):133 156, 2017. Edward H. Askwith. A Course of Pure Geometry. Cambridge University Press, 1917. Alfred Auslender and Marc Teboulle. Asymptotic cones and functions in optimization and variational inequalities. Springer, 2003. Abdessamad Barbara and Jean-Pierre Crouzeix. Concave gauge functions and applications. ZOR Mathematical Methods of Operations Research, 40:43 74, 1994. Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer Science & Business Media, 2011. Valeri ı N. Berestovsk ı and Victor M. Gichev. Metrized semigroups. Journal of Mathematical Sciences, 119(1):10 29, 2004. James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 1985. Marcel Berger. Geometry Revealed: A Jacob s Ladder to Modern Higher Geometry. Springer, 2010. Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 3, January 1950. Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania, November 2005. Shiing-Shen Chern. Finsler geometry is just Riemannian geometry without the quadratic restriction. Notices of the American Mathematical Society, 43(9):959 963, 1996. Gustave Choquet. Les cˆones convexes faiblement complets dans l analyse. Proceedings of the International Congress of Mathematicians, Stockholm, pages 317 330, 1962. Charles W. Cobb and Paul H. Douglas. A theory of production. The American Economic Review, 18 (1):139 165, March 1928. Richard Cornes. Duality and Modern Economics. Cambridge University Press, 1992. Zac Cranko. An analytic approach to the structure and composition of General Learning Problems. Ph D thesis, The Australian National University, 2021. URL https://openresearch-repository.anu.edu.au/bitstream/1885/219338/1/ Cranko%20Thesis%202021.pdf. A. Philip Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1):77 93, March 2007. Ricky Der and Daniel Lee. Large-margin classification in Banach spaces. In Artificial Intelligence and Statistics, pages 91 98. PMLR, 2007. Michel Marie Deza and Elena Deza. Encyclopedia of Distances. Springer, 2009. Jean Dieudonn e. History of Functional Analysis. North-Holland Publishing Company, 1981. Rolf F are and Daniel Primont. The unification of Ronald W. Shephard s duality theory. Journal of Economics (Zeitschrift f ur National okonomie), 60(2):199 207, 1994. Rolf F are and Daniel Primont. Multi-output Production and Duality: Theory and Applications. Kluwer Academic Publishers, 1995. Michael P. Friedlander, Ives Macedo, and Ting Kei Pong. Gauge optimization, duality, and applications. SIAM Journal on Optimization, 24(4):1999 2022, 2014. Dario Garcia-Garcia and Robert C. Williamson. Divergences and Risks for Multiclass Experiments. In Conference on Learning Theory (JMLR: W&CP), volume 23, pages 28.1 28.20, 2012.

WILLIAMSON AND CRANKO

Richard J. Gardner and Markus Kiderlen. Operations between functions. Communications in Analysis and Geometry, 26(4):787 855, 2018. Richard J. Gardner, Daniel Hug, and Wolfgang Weil. Operations between sets in geometry. Journal of the European Mathematical Society, 15:2297 2352, 2013. John Robilliard Giles. Classes of semi-inner-product spaces. Transactions of the American Mathematical Society, 129(3):436 446, 1967. Armand Gissler and Tim Hoheisel. A note on the K-epigraph. Optimization, (online preview):1 35, 2022. Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annual Review of Statistics and its Applications, 1:125 151, 2014. Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, March 2007. Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107 114, 1952. Peter D. Gr unwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004. Chung Ha. A noncompact minimax theorem. Pacific Journal of Mathematics, 97(1):115 117, 1981. David J. Hand and Veronica Vinciotti. Local Versus Global Models for Classification Problems: Fitting Models Where it Matters. The American Statistician, 57(2):124 131, 2003. Giora Hanoch. Symmetric duality and polar production functions. In Melvyn Fuss and Daniel Mc Fadden, editors, Production Economics: A Dual Approach to Theory and Applications, volume 1, pages 111 132. North-Holland, 1978. Georg Hasenkamp and J urgen Schrader. Dual polar price and quantity aggregation. Zeitschrift f ur National okonomie, 38(3-4):305 322, 1978. Eduard Helly. Uber Systeme linear Gleichungen mit unendlich vielen Unbekannten. Monatshefte f ur Mathematik und Physik, 31(1):60 91, 1921. Arlo D. Hendrickson and Robert J. Buehler. Proper scores for probability forecasters. The Annals of Mathematical Statistics, 42(6):1916 1921, 1971. Ralf Herbrich and Robert C. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3(Sep):175 212, 2002. Jean-Baptiste Hiriart-Urruty and Claude Lemar echal. Fundamentals of Convex Analysis. Springer, Berlin, 2001. Akos G. Horv ath, Zsolt L angi, and Margarita Spirova. Semi-inner products and the concept of semi-polarity. Results in Mathematics, 71(1-2):127 144, 2017. Hendrik S. Houthhakker. A note on self-dual preferences. Econometrica, 33(4):797 801, October 1965. Stephen E. Jacobsen. On Shephard s duality theorem. Journal of Economic Theory, 4:458 464, 1972. Johannes Jahn. Vector optimization: Theory, Applications, and Extensions. Springer, 2nd edition, 2011. Dale W. Jorgenson. Foreword. In Ronald W. Shephard: Cost and Production Functions (Reprint of the First Edition), pages iii iv. Spinger-Verlag, 1981. Yuri Kalnishkan and Michael V. Vyugin. Mixability and the existence of weak complexities. In The 15th Annual Conference on Computational Learning Theory (COLT 2002), volume 2375 of Lecture Notes in Artificial Intelligence, pages 105 120. Springer-Verlag, 2002.

GEOMETRY AND CALCULUS OF LOSSES

Yuri Kalnishkan, Volodya Vovk, and Michael V. Vyugin. Loss functions, complexities, and the Legendre transformation. Theoretical Computer Science, 313:195 207, 2004. Parameswaran Kamalaruban, Robert C. Williamson, and Xinhua Zhang. Exp-concavity of proper composite losses. In Conference on Learning Theory (JMLR: W&CP), volume 40, pages 1 31, 2015. Anatoly G. Kusraev and Semen Samsonovich Kutateladze. Subdifferentials: theory and applications. Kluwer, 1995. Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974 1980, 1998. Dinh The Luc and Michel Volle. Levels sets infimal convolution and level addition. Journal of optimization theory and applications, 94:695 714, 1997. G unter Lumer. Semi-inner-product spaces. Transactions of the American Mathematical Society, 100 (1):29 43, 1961. Helmut L utkepohl. Handbook of Matrices. John Wiley and Sons, 1996. Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics (revised edition). John Wiley & Sons, 1999. Robert E. Mahony and Robert C. Williamson. Prior knowledge and preferential structures in gradient descent learning algorithms. Journal of Machine Learning Research, 1:311 355, 2001. Horst Martini and Konrad J. Swanepoel. Antinorms and Radon curves. Aequationes Mathematicae, 72(1-2):110 138, 2006. Horst Martini, Konrad J. Swanepoel, and Gunter Weiß. The geometry of Minkowski spaces a survey. Part I. Expositiones Mathematicae, 19:97 142, 2001. Andreu Mas-Collel, Michael D. Whinston, and Jerry R. Green. Microeconomic Theory. Oxford University Press, 1995. Daniel Mc Fadden. Cost, revenue, and profit functions. In Melvyn Fuss and Daniel Mc Fadden, editors, Production Economics: A Dual Approach to Theory and Applications, volume 1, pages 1 110. North-Holland, 1978. Shahar Mendelson and Robert C. Williamson. Agnostic learning nonconvex function classes. In International Conference on Computational Learning Theory, pages 1 13. Springer, 2002. Karl Menger. Untersuchungen uber allgemeine Metrik. Mathematische Annalen, 100:75 163, 1928. Aditya Krishna Menon and Robert C. Williamson. The cost of fairness in binary classification. FAT*18, Proceedings of Machine Learning Research, 81:1 43, 2018a. Aditya Krishna Menon and Robert C. Williamson. A loss framework for calibrated anomaly detection. In Proceedings of the 32nd international conference on neural information processing systems, pages 1494 1504, 2018b. Jorma Kaarlo Merikoski. On c-norms and c-antinorms on cones. Linear Algebra and its Applications, 150:315 329, 1991. Tim Mesikepp. M-addition. Journal of Mathematical Analysis and Applications, 443:146 177, 2016. Zakaria Mhammedi and Robert C. Williamson. Constant regret, generalized mixability, and mirror descent. In Neural Information Processing Systems, 2018. Vitali Milman and Liran Rotem. Non-standard constructions in convex geometry: geometric means of convex bodies. In Convexity and concentration, pages 361 390. Springer, 2017a. Vitali Milman and Liran Rotem. Irrational constructions in convex geometry. St. Petersburg Mathematics Journal, 29(1):165 175, 2017b. Hermann Minkowski. Geometrie der Zahlen. B.G. Teubner, 1896.

WILLIAMSON AND CRANKO

Boris S. Mordukhovich and Yongheng Shao. On nonconvex subdifferential calculus in Banach spaces. Journal of Convex Analysis, 2(1/2):211 227, 1995. Maria Moszy nska. Selected Topics in Convex Geometry. Birkh auser, 2006. Maria Moszy nska and Wolf-Dieter Richter. Reverse triangle inequality: Antinorms and semiantinorms. Studia Scientiarum Mathematicarum Hungarica, 49(1):120 138, 2012. Armando Cabrera Pacheco and Robert C. Williamson. The geometry of mixability. Transactions on Machine Learning Research, 2023. Diethard Ernst Pallaschke and Ryszard Urbanski. Pairs of compact convex sets: fractional arithmetic with convex sets. Springer Science & Business Media, 2013. Teemu Pennanen. Graph-convex mappings and K-convex functions. Journal of Convex Analysis, 6 (2):235 266, 1999. Jean-Paul Penot. Duality for radiant and shady programs. Acta Mathematica Vietnamica, 22(2): 541 566, 1997. Jean-Paul Penot. The bearing of duality on microeconomics. In Advances in Mathematical Economics, pages 113 139. Springer, 2005. Jean-Paul Penot. Calculus Without Derivatives. Springer, 2012. Jean-Paul Penot and Constantin Z alinescu. Approximation of functions and sets. In Approximation, optimization and mathematical economics, pages 255 274. Springer, 2001. Jean-Paul Penot and Constantin Z alinescu. Harmonic sum and duality. Journal of Convex Analysis, 7(1):95 113, 2000. Albrecht Pietsch. History of Banach Spaces and Linear Operators. Birkh auser, 2007. John R. Platt. Strong inference. Science, 146(3642):347 353, October 1962. Friedrich Pukelsheim. On information functions and their polars. Journal of Optimization Theory and Applications, 41(4):533 546, 1983. Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010. Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731 817, 2011. R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, 1970. R. Tyrrell Rockafellar. Monotone Processes of Convex and Concave Type, volume 77 of Memoirs of the American Mathematical Society. American Mathematical Society, 1967. R. Tyrrell Rockafellar and Roger J-B. Wets. Variational Analysis. Springer-Verlag, 2004. Gian-Carlo Rota. Indiscrete Thoughts. Birkh auser, 1997. Hanno Rund. The differential geometry of Finsler spaces. Springer-Verlag, 1959. Paul A. Samuelson. Using full duality to show that simultaneously additive direct and indirect utilities implies unitary price elasticity of demand. Econometrica, 33(4):781 796, October 1965. Ryuzo Sato. Self-dual preferences. Econometrica, 44(5):1017 1032, September 1976. Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971. Rolf Schneider. Convex Bodies: The Brunn-Minkowski Theory. Cambridge University Press, 2014. Second expanded edition. Alberto Seeger. Direct and inverse addition in convex analysis and applications. Journal of Mathematical Analysis and Applications, 148:317 349, 1990. Alberto Seeger. Second derivatives of a convex function and of its Legendre-Fenchel transformate. SIAM Journal on Optimization, 2(3):405 424, 1992.

GEOMETRY AND CALCULUS OF LOSSES

Alberto Seeger and Michel Volle. On a convolution operation obtained by adding level sets: classical and new results. Recherche op erationnelle/Operations Research, 29(2):131 154, 1995. Ronald W. Shephard. Cost and Production Functions. Princeton University Press, 1953. Ronald W. Shephard. Theory of Cost and Production Functions. Princeton University Press, 1970. Emir H. Shuford Jr., Arthur Albert, and H. Edward Massengill. Admissible probability measurement procedures. Psychometrika, 31(2):125 145, 1966. Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171 176, 1958. Boaz A. Slomka. On duality and endomorphisms of lattices of closed convex sets. Advances in Geometry, 11:225 239, 2011. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. Anthony C. Thompson. Minkowski Geometry. Cambridge University Press, 1996. Tim van Erven, Mark D. Reid, and Robert C. Williamson. Mixability is Bayes risk curvature relative to log loss. Journal of Machine Learning Research, 13:1639 1663, 2012. Tim van Erven, Peter D. Gr unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015. Vladimir N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998. Hal R. Varian. Microeconomic Analysis. W.W. Norton and Company, 1978. Elodie Vernet, Robert C. Williamson, and Mark D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17(223):1 52, 2016. Michel Volle. Duality for the level sum of quasiconvex functions and applications. ESAIM: Control, Optimisation and calculus of variations, 3:329 343, 1998. Carl-Axel S. Sta el von Holstein. A family of strictly proper scoring rules which are sensitive to distance. Journal of Applied Meteorology, 9:360 364, 1970. Volodya Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory (COLT), pages 371 383, 1990. Volodya Vovk. A game of prediction with expert advice. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 51 60. ACM, 1995. Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213 248, 2001. Bo Waggoner. Linear functions to the extended reals. ar Xiv:2102.09552, 2021. Abraham Wald. Statistical Decision Functions. John Wiley & Sons, 1950. Robert C. Williamson. Loss functions. In Bernhard Sch olkopf, Zhiyuan Luo, and Vladimir Vovk, editors, Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik, pages 71 80. Springer, 2013. Robert C. Williamson. The geometry of losses. In Proceedings of The 27th Conference on Learning Theory, pages 1078 1108, 2014. Robert C. Williamson and Zac Cranko. Information processing equalities and the information risk bridge. ar Xiv:2207.11987, 2022. Haizhang Zhang, Yuesheng Xu, and Jun Zhang. Reproducing kernel Banach spaces for machine learning. Journal of Machine Learning Research, 10(12), 2009. Fedor Zhdanov. Theory and Applications of Competitive Prediction. Ph D thesis, Department of Computer Science, Royal Holloway, University of London, 2011. Constantin Z alinescu. Convex analysis in general vector spaces. World Scientific, 2002. Constantin Z alinescu. On the differentiability of the support function. Journal of Global Optimization, 57(3):719 731, 2013.

WILLIAMSON AND CRANKO

Constantin Z alinescu. Relations between the convexity of a set and the differentiability of its support function. Optimization, 65(3):651 670, 2016.