# convex_elicitation_of_continuous_properties__8be640e9.pdf

Convex Elicitation of Continuous Real Properties

Jessica Finocchiaro Department of Computer Science University of Colorado, Boulder jessica.finocchiaro@colorado.edu

Rafael Frongillo Department of Computer Science University of Colorado, Boulder raf@colorado.edu

A property or statistic of a distribution is said to be elicitable if it can be expressed as the minimizer of some loss function in expectation. Recent work shows that continuous real-valued properties are elicitable if and only if they are identiﬁable, meaning the set of distributions with the same property value can be described by linear constraints. From a practical standpoint, one may ask for which such properties do there exist convex loss functions. In this paper, in a ﬁnite-outcome setting, we show that in fact essentially every elicitable real-valued property can be elicited by a convex loss function. Our proof is constructive, and leads to convex loss functions for new properties.

1 Introduction

Property elicitation is the study of statistics, or properties, of probability distributions which one can incentivize an expected-utility-maximizing agent to reveal. In a machine learning context, this agent is an algorithm following the principle of empirical risk minimization (ERM), wherein a hypothesis is ﬁt to the data by minimizing its error on a training data set, as judged by some loss function. The interest in property elicitation across the economics, statistics, and machine learning communities is reﬂected in the literature, with important results appearing in all three.

A central thread of this literature, weaving between all three communities, asks which continuous real-valued properties are elicitable, and which loss functions elicit them. Building on earlier work of Osband [16] and Lambert [12], Steinwart et al. [22] show that a property is elicitable if and only if it is identiﬁable, a concept introduced by Osband which says that the set of distributions sharing the same property value can be described by a set of linear constraints. Moreover, these papers give characterizations of the loss functions eliciting these identiﬁable properties, showing that every loss can be written as the integral of a positive-weighted identiﬁcation function.

A question of practical interest remains, however: for which properties do there exist convex loss functions eliciting them? Convex losses give concrete algorithms to efﬁciently solve ERM problems, and are also useful more broadly in statistical and economic settings (see 6). At ﬁrst glance, the answer to this question might appear to follow immediately from the comprehensive loss function characterizations of Lambert [12] and Steinwart et al. [22]. Unfortunately, it is far from clear in these characterizations whether there exists a weight function rendering their construction convex.

In this paper, we address this question of convex elicitability in the ﬁnite-outcome setting. Surprisingly, we ﬁnd that, under somewhat mild smoothness assumptions, every identiﬁable real-valued property is convex elicitable. Our proof proceeds by pinpointing a few key attributes of identiﬁcation functions, and then solving the following abstract problem: given a set of functions F from R to R, when does there exist a weight function λ : R R>0 making λf increasing over the report space R for all f F? We give a constructive solution to this problem under certain conditions, and show that identiﬁcation functions happen to satisfy these conditions.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

After reviewing the relevant prior work in more detail ( 2), we give our main result ( 3). We then give intuition for our key technical proposition, the solution to the abstract problem mentioned above ( 4), followed by examples illustrating the constructive nature of our approach ( 5). We conclude with a discussion of applications to information elicitation, and future work ( 6). See the Appendix for all omitted proofs.

Notation. We will use the following notation throughout the paper. Let R>0 := {r : r R, r > 0} denote the positive reals. For an interval I, let I denote the interior of I. We let Y denote the outcomes space, here taken to be ﬁnite, and (Y) denote the set of probability distributions over Y.

2 Setting and Background

In property elicitation, we aim to learn some distributional property by minimizing a loss function. A property is simply a function Γ : P R, which assigns a desired report R Rk to each probability distribution in a convex set P (Y). Without loss of generality, we often restrict R = Γ(P), but this does not affect the result. Common properties include moments, quantiles, and expectiles. Throughout the paper we will assume that Γ is a continuous real-valued function, implying R R is an interval. We also restrict to the ﬁnite outcome setting, where |Y| < , and consider P R|Y|, meaning we identify each distribution with the corresponding vector of probabilities.

We are interested in when properties are elicitable, meaning they can be expressed as the minimizer of expected loss for some loss function. In the present paper, we will additionally ask when the loss function can be convex. Deﬁnition 1. A loss function L : R Y R { } elicits a property Γ if for all p P,

{Γ(p)} = arg min r EY p L(r, Y ) . (1)

In this case, we say Γ is elicitable. If L( , y) is convex for every y Y, we say Γ is convex elicitable.

A central notion in property elicitation is that of identiﬁability, where the level sets {p : Γ(p) = r} can be expressed as a linear constraint. Deﬁnition 2. Let property Γ : P R be given, where R = Γ(P). A function V : R Y R identiﬁes Γ if EY p [V (r, Y )] = 0 r = Γ(p) (2)

for all r R and p P. In this case we say Γ is identiﬁable. We say V is oriented if we additionally have EY p [V (r, Y )] > 0 r > Γ(p), for all r R and p P.

Note that by the terminology of Steinwart et al. [22], an identiﬁcation satisfying eq. (2) on all of R is called strong, as otherwise it must hold almost everywhere. We can loosely think of an identiﬁcation function as a derivative of a loss; if L is differentiable and elicits Γ, then roughly speaking, we expect d dr EY p L(r, Y ) = 0 Γ(p) = r.

Finally, we will often assume our properties to possess two important qualities: continuity, and being nowhere-locally-constant. Deﬁnition 3 (Lambert [12]). A continuous property Γ : P R is nowhere-locally-constant if there does not exist any open neighborbood U of P such that Γ(p) = r for all p U.

Intuitively, restricting to nowhere-locally-constant properties is merely to ease bookkeeping, as one could always collapse different report values together afterwards.

It is known that for continuous, nowhere-locally-constant, real-valued properties, identiﬁability implies elicitability. In this paper, we show that under slightly stronger assumptions, identiﬁability implies convex elicitability. To place this result in the proper context, we now brieﬂy tour the history of property elicitation.

2.1 Relevant prior work

While Savage [20] studied the elicitation of expected values, the literature on the elicitation of general properties began with Osband [16], who gave several important results. One of Osband s observations is that the level sets {p : Γ(p) = r} of an elicitable property Γ must be convex [16,

Proposition 2.5]. He also introduced the notion of an identiﬁcation function, and the so-called Osband s principle, which states that (under a mild regularity assumption) every loss function eliciting a given property can be written as the integral of a weighted identiﬁcation function [16, Theorem 2.1]. He also gave several other results, such as the separability of loss functions jointly eliciting quantiles.

Independent of Osband, Lambert et al. [14, 15, 12] provide a geometric approach to both continuous and ﬁnite properties (that is, properties taking values in a ﬁnite set R) when the set of outcomes Y is ﬁnite. They represent the identiﬁcation function as a vector, and relating ﬁnite-valued properties to power diagrams in computational geometry. They rediscover several results of Osband for the realvalued case, such as convexity of level sets and a one-dimensional version of Osband s principle. Moreover, the proof of [12, Theorem 5] shows the following. (Steinwart, et al. [22] extend this result to the case of inﬁnite Y.)

Theorem 1 (Lambert [12]). Let Γ : P R be a continuous, nowhere-locally-constant property. If the level sets {p P : Γ(p) = r} are convex, then Γ is elicitable, and has a continuous, bounded, and oriented identiﬁcation function.

None of the above-mentioned papers address the question of when the loss eliciting the property in question is convex. This question has arisen in the context of surrogate risk minimization, where unlike our setting, one seeks to indirectly elicit a given ﬁnite-valued property (such as the mode, or most likely label) by ﬁrst eliciting a real-valued property, and then applying a link function [5, 23]. For example, support vector machines (SVMs) commonly use hinge loss and then apply the sign function, a combination which indirectly elicits the mode [21]. This literature is related to elicitation complexity, where one asks how many dimensions of the surrogate property are needed so that the desired property can be computed via the link function [7, 9, 14]. This relationship was perhaps ﬁrst identiﬁed by Agarwal and Agarwal [3], who restate prior work in terms of property elicitation, and speciﬁcally focus on convex losses. Finally, Reid, Vernet, and Williamson [18, 24] consider losses which indirectly elicit the full distribution, and consider convexity of the composite loss.

In contrast to this line of work, we seek the direct elicitation of continuous properties. While convex losses are well-known for several continuous properties of interest, including the mean and other expected values (squared loss), ratios of expectations (weighted squared loss), and the median and other quantiles (pinball loss), to our knowledge, to date there have been no results on the direct convex elicitation of general continuous properties.

3 Main Result

We will show that, under mild conditions, every elicitable real-valued property is also convex elicitable. Let us ﬁrst give some intuition why one might suspect this statement to be true. From a geometric perspective, the level sets {p : Γ(p) = r} of continuous elicitable properties are hyperplanes intersected with P. As one changes r, the level sets may be locally parallel, in which case the property is locally a link of a linear property (expected value), or the level sets may not be parallel, in which case the property locally resembles a link of a ratio of expectations. In fact, the second case also covers the ﬁrst, so we can say that, roughly speaking, every continuous property looks locally like a ratio of expectations. The following proposition states that if the property can actually be written as a ﬁnite piecewise ratio of expectations, it is convex elicitable. Hence, taking the limit as one approximates a given property better and better by ratios of expectations, one may suspect that indeed every continuous property is convex elicitable.

Proposition 1. Continuous piecewise ratio-of-expectation properties are convex elicitable.

Proof. First, we formalize the statement. Recall that Y is a ﬁnite set. Let φi : Y R and ψi : Y R>0 be arbitrary for i = 1, . . . , k, and let γi(p) = EY pφi(Y )/EY pψi(Y ). Assume that we have a0 < < ak such that for all p P, there is a unique i {1, . . . , k} such that either γi 1(p) (ai 1, ai) or γi 1(p) = γi(p) = ai 1. Call this i(p), and by extension i(r) where r = γi(p) for this i. We will show that Γ(p) := γi(p)(p) is convex elicitable with respect to the full probability simplex P = (Y).

Observe that by construction, for each i {1, . . . , k 1} the level sets for ai coincide: Si = {p : Γ(p) = ai} = {p : γi(p) = ai} = {p : γi 1(p) = ai}. Moreover, for all such i, these level sets

are full-dimensional in P, i.e., there are (n 2)-dimensional afﬁne sets which are the intersection of a hyperplane and P. Now let Vi(r, y) = ψi(y)r φi(y), which identiﬁes γi, and is strictly increasing in r as ψi(y) > 0 for all y. We now see that the hyperplane which is the span of Si in Rn is orthogonal to the vectors Vi 1(ai, ) Rn and Vi(ai, ) Rn, by the deﬁnition of identiﬁability. We conclude that there is some coefﬁcient αi 1 such that Vi 1(ai, y) = αi 1Vi(ai, y) for all y Y. (In fact, αi 1 > 0, as the coefﬁcient of r must be positive.) We then construct βi(r) = Qi(r) j=0 αj and write the identiﬁcation as V (r, y) = βi(r)Vi(r)(r, y).

Moving now to the formal result, let I R be an interval. Our main technical ingredient shows, given a collection F of functions f : I R satisfying certain conditions, how to construct a multiplier λ : I R>0 making λf strictly increasing on I for all f F. In our proof, the family F will be the set of identiﬁcation functions {V ( , y)}y Y, and λ will play the role of the weight function in previous work ([16, Theorem 2.1], [11, Theorem 2.7], [14, Theorem 3]) showing that L(r, y) = R r r0 λ(x)V (x, y)dx elicits Γ. As λf is increasing, L will additionally be convex.

Therefore, the conditions below are only mildly stronger than what Lambert shows to be true of the desired properties. We begin with our three conditions; the ﬁrst we will assume, and the second and third we will prove hold for any oriented identiﬁcation function.

Condition 1. Every f F is continuous on I, and continuously differentiable on I except on a ﬁnite set Sf I. When f is differentiable, d dxf(x) is ﬁnite. Additionally, if x I and f(x) = 0, then for all z in some open neighborhood U of x, d dzf(z) 0 whenever f is differentiable.

Condition 2. Every f F is bounded and has at most one zero xf I so that if xf exists, f(x) < 0 for x < xf and f(x) > 0 for x > xf. If f does not have a zero on I, then either f(x) < 0 or f(x) > 0 for all x I. For all x I, at least one function f F is nonzero at x.

Condition 3. For all f, g F and all open subintervals I I such that f > 0 > g on I , the function g

f is strictly increasing on I .

Our main technical tool follows; we sketch the proof in 4 and defer the full proof to the Appendix.

Proposition 2. If F satisﬁes Condition 1, 2, and 3, then there exists a function λ : I R>0 so that λf is increasing over I for every f F.

With this tool in hand, we are ready to prove our main result.

Theorem 2. For P = (Y), let Γ : P R be a continuous, nowhere-locally-constant property which is identiﬁed by a bounded and oriented V : R Y R. If F = {V ( , y)}y Y satisﬁes Condition 1, then Γ is convex elicitable.

Proof. We have assumed all f F = {V ( , y)}y Y are bounded, oriented, and satisfy Condition 1, and thus to apply Proposition 2, we need only establish Conditions 2 and 3. A fact we use throughout is that V (r, y) = EY δy V (r, Y ), where δy is the point distribution on y Y.

To establish Condition 2, we procede in order. First, boundedness of each f F follows by assumption. Second, we show that each f has at most one zero on R. As V identiﬁes Γ, note that V (r, y) = 0 Γ(δy) = r when r R. As Γ is single-valued, there can be at most one such r R. Third, we must show that if f has a zero on R, it changes sign from negative to positive at that zero, and if not, f never changes sign on R. The ﬁrst case follows from the fact that Γ(δy) = r and that V is oriented. For the second case, V ( , y) has no zero on R, and thus by continuity of V , cannot change sign on R. Fourth, to see that F has at least one nonzero function for all r R, note that if V (r, y) = 0 for all y Y, then EY p V (r, Y ) = 0 for all p P. Thus, as V identiﬁes Γ and r R, we would have Γ(p) = r for all p, contradicting nowhere-locally-constancy.

For Condition 3, consider V ( , y0), V ( , y1) F and open interval I = (a, b) such that V (r, y0) > 0 > V (r, y1) for all r I . We deﬁne pα = (1 α)δy0 + αδy1 and γ(α) = Γ(pα) for α [0, 1]. Since Γ is continuous and nowhere-locally-constant, [22, Cor. 9] implies that Γ is quasi-monotone, which in turn implies that γ is nondecreasing on [0, 1].

We ﬁrst show I γ([0, 1]) = [γ(0), γ(1)]. By deﬁnition of I , we know r I = V (r, y1) < 0 < V (r, y0) and the orientation of V then implies Γ(δy1) > r > Γ(δy0). Thus, Γ(δy1) = γ(1) b > a Γ(δy0) = γ(0), with the strict inequality since I is nonempty. We then see that r (a, b) = r [Γ(δy0), Γ(δy1)] = γ([0, 1]), and therefore I γ([0, 1])

Next, we show that γ is not only nondecreasing but strictly increasing on A = γ 1(I ). Note that A is itself an open interval as γ is continuous. Let α, α A, and suppose for a contradiction that γ(α) = γ(α ) = r I R. Then Γ(pα) = Γ(pα ) = r, and as V identiﬁes Γ, we have EY pαV (r, Y ) = EY pα V (r, Y ) = 0. Thus, EY p0V (r, Y ) = (α EY pαV (r, Y ) αEY pα V (r, Y ))/(α α) = 0, and similarly for p1. By identiﬁability again, we must now have Γ(p0) = Γ(p1) = r, contradicting Γ(p0) < Γ(p1) as observed above.

Since V identiﬁes Γ, we have for α A,

0 = EY pαV (γ(α), Y ) = (1 α)EY δy0 [V (γ(α), Y )] + αEY δy1 [V (γ(α), Y )]

= (1 α)V (γ(α), y0) + αV (γ(α), y1) ,

from which we conclude the function F(α) = V (γ(α), y1)/V (γ(α), y0) = (α 1)/α = 1 1/α, which is strictly increasing in α. Observe that as γ is strictly increasing on A, its inverse is strictly increasing on I . Thus V (r, p1)/V (r, p0) = F(γ 1(r)) = 1 1/γ 1(r) is strictly increasing on I , as desired.

As we have now established that F satisﬁes Conditions 1-3, Proposition 2 gives us a weight function λ : R R>0 such that for all y Y, the map r 7 λ(r)V (r, y) is strictly increasing on R. Thus, ﬁxing r0 R, the loss L(r, y) = R r r0 λ(r )V (r , y)dr is convex in r for each y Y, as noted by Rockafellar [19, Theorem 24.2]. Moreover, as λ > 0, L elicits Γ by Lambert [12, Theorem 6].

While we defer discussion of future work to 6, it is worth noting here that the argument establishing Condition 3 immediately extends to inﬁnite outcome spaces. Beginning with p0, p1 being arbitrary distributions, Γ(p0) = Γ(p1), one simply observes that V (γ(α), p0)/V (γ(α), p1) = 1 1/α by the same logic. The central challenge to such an extension therefore lies in the proof of Proposition 2.

Loosely speaking, when combining Theorem 2 with the existing literature, we conclude that every nice elicitable property is additionally convex elicitable. We formalize this in two corollaries, one stated as an implication, and the other given in the style of Steinwart et al. [22, Cor. 9]. Corollary 1. Let Γ : P R be continuous, nowhere-locally-constant, and elicited by a loss L with bounded and continuous ﬁrst and second derivatives. Suppose also, for all p P, that d dr EY p L(r, Y ) = 0 for at most one r R. Then Γ is convex elicitable.

Proof. As L(r, y) elicits Γ and is differentiable, for all r R, and all p with Γ(p) = r, we must have d dr EY p L(r, Y ) = 0. By our assumption on the critical points of L, we see that V (r, δy) = d dr L(r, δy) is a bounded and oriented identiﬁcation function for Γ, and is continuously differentiable with bounded derivative. Thus, F = {V (r, δy)}y Y satisﬁes Condition 1, and the result follows from Theorem 2.

Corollary 2. Let P = (Y) be the probability simplex over n outcomes, and let Γ : P R be a nowhere-locally-constant property with a bounded and nowhere vanishing ﬁrst derivative, a bounded second derivative, and a differentiable right inverse.1 Then the following are equivalent:

1. For all r R, the level set {p : Γ(p) = r} is convex. 2. Γ is quasi-monotonic. 3. Γ is identiﬁable and has a bounded and oriented identiﬁcation function. 4. Γ is elicitable. 5. There exists a non-negative, measurable, locally Lipschitz continuous loss function eliciting Γ. 6. Γ is convex elicitable.

Proof. We essentially reduce to a similar result of Steinwart et al. [22, Corollary 9]. First, note that the deﬁnition of nowhere-locally-constant from Lambert et al. [14] coincides with the deﬁnition

1We may identify P with {v R|Y| 1 + : P|Y| 1 i=1 vi 1} so that the derivatives are well deﬁned. In the proof, for ease of notation, we will still write dot products in R|Y|.

of Steinwart et al. [22, Deﬁnition 4] in ﬁnite dimensions. Second, as our assumptions are stronger than theirs, the equivalence of the ﬁrst ﬁve conditions follows. As convex elicitability implies convex level sets (by the same argument of Lambert [12, Theorem 5], which follows even if L can be inﬁnite on the boundary of R), it then sufﬁces to show that identiﬁability implies convex elicitability.

By standard arguments, the convexity of the level sets {p : Γ(p) = r} for r R imply that each level set must be a hyperplane intersected with P. (See e.g. Theorem 1 of [14].) Letting ˆp be the right inverse of Γ, so that Γ(ˆp(r)) = r for all r R, we may deﬁne

V (r, y) = ˆp(r)Γ (δy ˆp(r)) , (3)

a form taken from Frongillo and Kash [8, Proposition 18].

Now for any p with Γ(p) = r, as the level set is a hyperplane intersected with P, we must have Γ(αp + (1 α)ˆp(r)) = r, and we conclude ˆp(r)Γ (p ˆp(r)) = 0. (Simply take the derivative with respect to α.) Thus, as Γ = 0, the vector ˆp(r)Γ ˆp(r)Γ ˆp(r)1 deﬁnes the same hyperplane as {p : Γ(p) = r}, and thus V identiﬁes Γ. (Here 1 R|Y| denotes the all-ones vector.) That V is also bounded and oriented follows easily from our assumptions. As V has a bounded derivative everywhere by assumption, it satisﬁes Condition 1, and convex elicitability then follows from Theorem 2.

4 Proof Sketch and Intuition

We now give a sketch of the construction of the weight function λ in Proposition 2. See the Appendix for the full proof. For the purposes of this section, let us simplify our three conditions as follows:

Condition 1 . Every f F is continuously differentiable. Condition 2 . Each f F has a single zero, and moves from negative to positive. Condition 3 . When f > 0 > g, the ratio g/f is increasing.

Two function case. To begin, let us consider two functions satisfying Conditions 1 , 2 , and 3 , such that f > 0 > g on the interval I. We wish to ﬁnd some λ : I R>0 making both λf and λg strictly increasing. By Condition 3 , we know g/f is increasing on I. Let us choose λ as follows,

λ(r) := ( f(r)g(r)) 1/2 . (4)

As (fg)(r) > 0 for all r I, we have λ(r) > 0 as well. Moreover, one easily checks that λf = p

f/g and λg = p

g/f, which are both increasing as monotonic transformations of g/f.

General case. More generally, we wish to ﬁnd a λ such that for all x R, d dx(λf)(x) > 0. When f > 0, this constraint is equivalent to d dx log(λf)(x) > 0, which is in turn equivalent to d

dx log λ(x) < d dx log f(x). Similarly, if f(x) < 0, then we need d

dx log λ(x) > d dx log( f(x)). Finally, the case f(x) = 0 follows easily from Condition 2 , as d dxf(x) > 0 and λ > 0. Combining these constraints, we see that for all f > 0 and all g < 0, we must have

d dx log( g(x)) < d

dx log λ(x) < d dx log f(x) . (5)

In order for these constraints to be feasible, we must have d dx log( g(x)) < d dx log f(x) for all f < 0 < g, which is seen to be equivalent to Condition 3 after some manipulation.

Perhaps the most natural way to satisfy constraint (5) is to simply take the midpoint between the maximum lower bound m : R R and minimum upper bound m : R R deﬁned as follows:

m(x) := sup g F:g(x)<0

d dx log( g(x)) m(x) := inf f F:f(x)>0

d dx log(f(x))

This yields the following construction (where r0 R is arbitrary),

2 (m(x) + m(x)) , λ(x) = exp Z x

r0 h(z)dz , (6)

where one notes h(x) = d dx log λ(x). Provided our three conditions hold, we now have a positive weight function λ satisfying the constraint (5), and we conclude that λf is increasing for all f F.

Let us observe that our general construction in eq. (6) really is a generalization of the two-function case in eq. (4). That is, we are primarily concerned with the most decreasing and least increasing functions, which allows us to focus on two functions instead of the entire set F. When we only have two functions f > 0 > g, eq. (6) reduces to h(x) = 1

dx log( g(x)) + d dx log f(x) , whence λ(x) = exp 1

2 log( g(x)f(x)) = 1/ p

Hurdles and technicalities. As stated, the above construction has two issues, which we now brieﬂy identify and describe how our proof circumvents. First, in general our functions f will pass through 0, possibly making h and therefore λ unbounded. Recall that we only needed to satisfy eq. (5), and thus rather than taking the midpoint of the lower and upper bounds as in eq. (6), which will diverge whenever one of the bounds diverges, we can always choose h in a slightly more clever manner to be closer to the smallest magnitude bound. See the Appendix for one such construction.

The second problem is that our actual Condition 1 allows for nondifferentiability, which arises in settings of particular interest, like Proposition 1. Fortunately, in the ﬁnite-outcome setting, it is essentially without loss of generality to consider continuous f F (see Theorem 1). We can therefore address the ﬁnite nondifferentiabilities using continuity arguments, allowing us to focus on the set Ic I where every f F is continuously differentiable.

To illustrate the constructive nature of Theorem 2, we now give two examples. The ﬁrst is the Beta family scoring rule found in Buja et al. [6, 11] and Gneiting and Raftery [11, 3], which we use to illustrate the construction itself. The second is a simple elicitable property for which the obvious identiﬁcation function does not give a convex loss; we show how to convexify it.

1. Beta families. Consider the Beta family of loss functions discussed in Buja et al. [6], which elicit the mean over outcomes Y = {0, 1}, with R = [0, 1]. After some manipulation, one can write the loss and identiﬁcation function as follows, for any α, β > 1,

L(r, y) = Z r

0 zα 1(1 z)β 1(z y)dz V (r, y) = rα 1(1 r)β 1(r y) .

While some choices of the parameters yield convex losses, such as α = β = 0 (log loss) and α = β = 1 (squared loss), not all do, e.g. α = 1/5, β = 1/2.

Applying the two-function construction from Section 4, we choose λ(r) = r1/2 α(1 r)1/2 β, giving the identiﬁcation function V (r, y) = r1/2(1 r)1/2(r y), which is itself in the Beta family with α = β = 1/2. Intergrating V yields the following convex loss,

L (r, y) = Z r

0 z1/2(1 z)1/2(z y)dz = arcsin( p

r(1 r) , (7)

also discovered by Buja et al., which serves as a intermediary between log and squared loss.

2. A quadratic property. Let Y = {1, 2, 3}, and Γ(p) = 1

1 4p1p2+2p2 2 2p2 , where Γ(p) = p1 when p2 = 0 for continuity (from L Hˆopital s rule). Here, py denotes the probability outcome y is observed. Some of the level sets of Γ can be seen in Figure 2. A very natural choice of identiﬁcation function for Γ is V (r, 1) = r 1, V (r, 2) = 1

2 + r r2, V (r, 3) = r, as one readily veriﬁes. Yet we see in Figure 1(b) that V ( , 2) is not strictly increasing, so the loss given by integrating V will not be convex.

The set F = {V ( , y)}y Y satisﬁes Conditions 1 3, however, and thus we may use our construction to obtain a positive function λ for which L(r, y) = R r r0 λ(x)V (x, y)dx elicits Γ and is convex in r. Unfortunately, for this particular example, the construction given in the proof of Proposition 2 produces a somewhat unwieldy function λ. Fortunately, while our constructed λ is guaranteed to make λf monotone for every function f in F, it is generally not unique, and in many cases a simpler choice of λ can be found. In particular, our proof shows that any function h satisfying the criteria laid out in Claim 1 of the Appendix will lead to suitable choice of λ; among these criteria are that h(r) = d

dr log λ(r) must lie between m(r) and m(r) for all r. For practical purposes,

0.2 0.4 0.6 0.8 1.0

(a) V ( , 1)

0.2 0.4 0.6 0.8 1.0

(b) V ( , 2)

0.2 0.4 0.6 0.8 1.0

(c) V ( , 3)

0.2 0.4 0.6 0.8 1.0

(d) λ( )V ( , 1)

0.2 0.4 0.6 0.8 1.0

(e) λ( )V ( , 2)

0.2 0.4 0.6 0.8 1.0

(f) λ( )V ( , 3)

Figure 1: The functions V ( , y) are not always increasing for all y Y, but our function λ monotonizes them, as shown in (d) (f).

0.0 0.2 0.4 0.6 0.8 1.0

Pr(Outcome 1)

Pr(Outcome 2)

Figure 2: Level sets of Γ

0.2 0.4 0.6 0.8 1.0

Figure 3: m( ) in solid blue, m( ) in orange, valid h( ) in green and dashed blue.

therefore, we may use the following general technique in lieu of the construction given in the proof of Proposition 2:

1. Compute the bounds m(r) and m(r).

2. Search over some class of practical (e.g. linear) functions for an h which satisﬁes the criteria of Claim 1.

We illustrate this more practical construction in Figure 3; for the case of our quadratic property, the choice h(x) = 4x 1 (shown as dashed blue) sufﬁces, giving us the simpler λ(r) = exp(2x2 x). This choice of λ gives

λ(r)V (r, 1) = exp(2r2 r)(r 1)

λ(r)V (r, 2) = exp(2r2 r)((1/2) + r r2)

λ(r)V (r, 3) = r exp(2r2 r) ,

which we can integrate to obtain a convex loss.

6 Conclusion and future work

We have shown that all real-valued properties over ﬁnite outcomes, which are identiﬁed by a mostlysmooth continuous identiﬁcation function, are convex elicitable. Beyond natural relevance to machine learning, and statistical estimation more broadly, these results bring insights into the area of information elicitation. For example, a generalization of a common prediction market framework, the Scoring Rule Market, is well-deﬁned for any loss function [10, 14]. Yet it is not clear whether practical markets exist for any elicitable property. Among the practical considerations are axioms such as Tractable Trading (TT), which states that participants can compute their optimal trade/action under a budget [2], and Bounded Trader Budget (BTB), which states that traders with arbitrarily small budgets can still fruitfully participate in the market [10]. Our results imply that essentially every continuous real-valued elicitable property over ﬁnite outcomes has a market mechanism which satisﬁes these axioms. There are likely also implications for wagering mechanisms [13] and forecasting competitions [25], among other settings in information elicitation.

There are several avenues for future work, which we outline below.

Relaxing our conditions. We believe one could allow V to be smooth almost everywhere. One may still be able to use the fact that g/f is strictly increasing to have an almost-everywhere deﬁned derivative, but again, there are several challenges to this approach.

Inﬁnite outcomes. A challenging but important extension would be to allow inﬁnite Y, for example, Y = [0, 1] R. As discussed following Theorem 2, many pieces of our argument extend immediately, such as the argument establishing Condition 3. We believe the key hurdle to such an extension will be in Proposition 2, as several quantities become harder to control. As one example, the deﬁnition of h in eq. (6) involves a maximum and minimum which may not be obtained. Extending to inﬁnite outcomes requires the relaxation of our continuity assumption, as many properties of interest have discontinuous identiﬁcation functions in the inﬁnite-outcome space, like the median.

Strongly convex losses. Just as convex loss functions are useful so that empirical risk minimization is a tractable problem, strongly convex losses are even more tractable. Roughly speaking (ignoring the log transformation), if the gap in eq. (5) is bounded away from zero, λf will be increasing at least as fast as some linear function for all f, meaning its integral will be strongly convex. It is not clear what meaningful conditions on Γ sufﬁce for this to hold, however, and a full characterization is far from clear. Similarly, characterizations for exp-concave losses would also be interesting.

Vector-valued properties. Finally, we would like to extend our construction to vector-valued properties Γ : P Rk. In light of our results, this question is only interesting for properties which are not vectors of elicitable properties: if the k components of Γ are themselves elicitable, we may construct a convex loss for each, and the sum will be a convex loss eliciting Γ. Unfortunately, we lack a characterization of elicitable vector-valued properties, so the question of whether all elicitable vector-valued properties are convex elicitable seems even further from reach.

Acknowledgements. We would like to thank Bo Waggoner and Arpit Agarwal for their insights and the discussion which led to this project, and we thank Krisztina Dearborn for consultation on analysis results. Additionally, we would like to thank our reviewers for their feedback and suggestions. This project was funded by National Science Foundation Grant CCF-1657598.

[1] S. Abbott. Understanding analysis. Springer, 2001.

[2] J. D. Abernethy and R. M. Frongillo. A collaborative mechanism for crowdsourcing prediction problems. In Advances in Neural Information Processing Systems 24, pages 2600 2608, 2011.

[3] A. Agarwal and S. Agarwal. On consistent surrogate risk minimization and property elicitation. In JMLR Workshop and Conference Proceedings, volume 40, pages 1 19, 2015.

[4] T. M. Apostol. Mathematical analysis. 1974.

[5] P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[6] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classiﬁcation: Structure and applications. 2005.

[7] T. Fissler, J. F. Ziegel, and others. Higher order elicitability and Osband s principle. The Annals of Statistics, 44(4):1680 1707, 2016.

[8] R. Frongillo and I. Kash. Vector-Valued Property Elicitation. In Proceedings of the 28th Conference on Learning Theory, pages 1 18, 2015.

[9] R. Frongillo and I. A. Kash. On Elicitation Complexity. In Advances in Neural Information Processing Systems 29, 2015.

[10] R. Frongillo and B. Waggoner. An Axiomatic Study of Scoring Rule Markets. Preprint, 2017.

[11] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

[12] N. S. Lambert. Elicitation and evaluation of statistical forecasts. 2018.

[13] N. S. Lambert, J. Langford, J. W. Vaughan, Y. Chen, D. M. Reeves, Y. Shoham, and D. M. Pennock. An axiomatic characterization of wagering mechanisms. Journal of Economic Theory, 156:389 416, 2015.

[14] N. S. Lambert, D. M. Pennock, and Y. Shoham. Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 129 138, 2008.

[15] N. S. Lambert and Y. Shoham. Eliciting truthful answers to multiple-choice questions. In Proceedings of the 10th ACM conference on Electronic commerce, pages 109 118, 2009.

[16] K. H. Osband. Providing Incentives for Better Cost Forecasting. University of California, Berkeley, 1985.

[17] R. L. Pouso. A simple proof of the fundamental theorem of calculus for the lebesgue integral. ar Xiv preprint ar Xiv:1203.1462, 2012.

[18] M. Reid and R. Williamson. Composite binary losses. The Journal of Machine Learning Research, 9999:2387 2422, 2010.

[19] R. Rockafellar. Convex analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1997.

[20] L. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, pages 783 801, 1971.

[21] I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business Media, Sept. 2008. Google-Books-ID: HUnqnrp Yt4IC.

[22] I. Steinwart, C. Pasin, R. Williamson, and S. Zhang. Elicitation and Identiﬁcation of Properties. In Proceedings of The 27th Conference on Learning Theory, pages 482 526, 2014.

[23] A. Tewari and P. L. Bartlett. On the consistency of multiclass classiﬁcation methods. The Journal of Machine Learning Research, 8:1007 1025, 2007.

[24] R. C. Williamson, E. Vernet, and M. D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17(223):1 52, 2016.

[25] J. Witkowski, R. Freeman, J. W. Vaughan, D. M. Pennock, and A. Krause. Incentive-compatible forecasting competitions. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence (AAAI-18), 2018.