# all_your_loss_are_belong_to_bayes__ad6fe9bf.pdf

All your loss are belong to Bayes

Christian Walder 1 2 Richard Nock 1 2

1 CSIRO Data61, Australia. 2 The Australian National University.

{first.last}@data61.csiro.au

Loss functions are a cornerstone of machine learning and the starting point of most algorithms. Statistics and Bayesian decision theory have contributed, via properness, to elicit over the past decades a wide set of admissible losses in supervised learning, to which most popular choices belong (logistic, square, Matsushita, etc.). Rather than making a potentially biased ad hoc choice of the loss, there has recently been a boost in efforts to ﬁt the loss to the domain at hand while training the model itself. The key approaches ﬁt a canonical link, a function which monotonically relates the closed unit interval to R and can provide a proper loss via integration. In this paper, we rely on a broader view of proper composite losses and a recent construct from information geometry, source functions, whose ﬁtting alleviates constraints faced by canonical links. We introduce a trick on squared Gaussian Processes to obtain a random process whose paths are compliant source functions with many desirable properties in the context of link estimation. Experimental results demonstrate substantial improvements over the state of the art.

1 Introduction

Supervised learning has been an inﬂuential framework of machine learning, swayed decades ago by pioneers who managed to put into equations its nontrivial uncertainties in sampling and model selection [Val84, Vap98]. One lighthearted but famous videogame-born summary1 comes with an unexpected technical easter egg: uncertainty grasps the crux of the framework, the loss. All these works have indeed been predated from the early seventies by a rich literature on admissible losses for supervised learning, in statistical decision theory [Sav71], and even earlier in foundational philosophical work [d F49]. The essential criterion for admissibility is properness, the fact that Bayes rule is optimal for the loss at hand. Properness is intensional: it does not provide a particular set of functions to choose a loss from. Over the past decades, a signiﬁcant body of work has focused on eliciting the set of admissible losses (see [NM20] and references therein) ; yet in comparison with the vivid breakthroughs on models that has ﬂourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life more often than not, it is ﬁxed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02]. More recent work has aimed to provide machine learning with greater ﬂexibility in the loss [HT92, KS09, KKSK11, NM20] yet these works face signiﬁcant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the speciﬁc part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10].

1See https://tinyurl.com/y5jnurau or search for all your Bayes are belong to us, Vapnik quote .

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In a nutshell, the present work departs from these chieﬂy frequentist approaches and introduces a new and arguably Bayesian standpoint to the problem (see [GRR] for a related discussion on the nature of such a standpoint). We exploit a ﬁner characterisation of loss functions and a trick on the Gaussian Process (GP) for efﬁcient inference while retaining guarantees on the losses. Experiments show that our approach tends to signiﬁcantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with speciﬁc losses or links, which to our knowledge is a ﬁrst among approaches learning a loss or link.

More speciﬁcally, we ﬁrst sidestep the impediments and constraints of ﬁtting a link by learning a source function, which need only be monotonic, and which allows to reconstruct a loss via a given link yielding a proper composite loss [RW10]. The entire construct exploits a ﬁne-grained information-geometric characterisation of Bregman divergences pioneered by Zhang and Amari [Ama12, Ama13, Zha04]. This is our ﬁrst contribution. Our second contribution exploits a Bayesian standpoint on the problem itself: since properness does not specify a loss to minimise, we do not estimate nor model a loss based on data instead we perform Bayesian inference on the losses themselves, and more speciﬁcally on the source function. From the losses standpoint, our technique brings the formal advantage of a ﬁne-tuned control of the key parameters of a loss which, in addition to properness, control consistency, robustness, calibration and rates.

We perform Bayesian inference within this class of losses by a new and simple trick addressing the requirement of priors over functions which are simultaneously non-parametric, differentiable and guaranteed monotonic. We introduce for that purpose the Integrated Squared Gaussian Process (ISGP), which is itself of independent interest as a convenient model for monotonic functions. The ISGP extends by integration (which yields monotonicity) the squared Gaussian Process which has itself seen a recent burst of attention as a model for merely non-negative functions [ST03, MM06, LGOR15, WB17, FTS17, LH19]. The ISGP contrasts with previous approaches to learning monotonic functions which are either non-probabilistic [ABE+55, YW09, APS10, LR14, Bac18, Lim18], resort to a discretisation [HHLK19, KEC19, UKEC19], or are only approximately monotonic due to the use of only a ﬁnite set of monotonicity promoting constraints [RV10, GBCC15, SPV16, LRG+17, YLR+18, ANARL19].

Organisation. In Section 2 we introduce properness and source functions. We deﬁne the ISGP model in Section 3, and provide relevant Bayesian inference procedures in Section 4. Numerical experiments are provided in Section 5 before concluding with Section 6. Technical details and an extensive suite of illustrations are provided in the supplementary appendices.

2 Deﬁnitions and key properties for losses

prediction class probability estimation

canonical link

composite link

Figure 1: Relationship between source, canonical link and composite link deﬁned by (3). Domains and names are given in the context of supervised learning.

Our notations follow [NM20, RW10]. In the context of statistical decision theory as discussed more than seventy years ago [d F49] and later theorised [Sav71], the key properties of a supervised loss function start with ensuring that Bayes rule is optimal for the loss, a property known as properness. Proper (composite) losses: let Y .= { 1, 1} be a set of labels. A loss for binary class probability estimation [BSS05] is a function ℓ: Y [0, 1] R (closure of R). Its (conditional) Bayes risk function is the best achievable loss when labels are drawn with a particular positive base-rate,

L(π) .= inf u EY Bernoulli(π)ℓ(Y, u),

where Pr[Y = 1] = π. A loss for class probability estimation ℓis proper iff Bayes prediction locally achieves the minimum everywhere: L(π) = EYℓ(Y, π), π [0, 1], and strictly proper if Bayes is the unique minimum. Fitting a classiﬁer s prediction h(z) R (z being an observation) into some u [0, 1] is done via an invertible link function χ : [0, 1] R connecting real valued prediction and class probability estimation. A proper loss augmented with a link, ℓ(y, χ 1(x)) with x R [RW10] is called proper composite. There exists a particular link uniquely deﬁned (up to multiplication or addition by a scalar [BSS05]) for any proper loss, the canonical link, as: χ .= L [RW10, Section 6.1]: for example, the logistic loss

yields the canonical link χ(u) = ( L )(u) = log(u/(1 u)), with inverse the well-known sigmoid χ 1(x) = ( L ) 1(x) = (1+exp( x)) 1. A proper loss augmented with its canonical link instead of any composite link is called proper canonical. We remind that the Bayes risk of a proper loss is concave [RW10]. There exists a particularly useful characterisation of proper composite losses [NM20, Theorem 1]: a loss is proper composite iff

ℓ(y, χ 1(x)) = D L

y χ 1(x) = D( L) L χ 1(x) L (y ) , (1)

where y .= (y + 1)/2 {0, 1}, D L is the Bregman divergence with generator L and ( L) is the Legendre conjugate of L. Let G be convex differentiable. Then we have DG(x x ) .= G(x) G(x ) (x x )G (x ) [Bre67] and G (x) .= supx dom(G){xx G(x )} [BV04]. We assume in this paper that losses are twice differentiable for readability (differentiability assumption(s) can be alleviated [RW10, Footnote 6]) and strictly proper for the canonical link to be invertible, both properties being satisﬁed by popular choices (log, square, Matsushita losses, etc.). Margin losses: an important class of losses for real-valued prediction are functions of the kind G(yh) where G : R R, y { 1, 1} and yh is called a margin in [RW10]. There exists a rich theory on the connections between margin losses and proper losses: If G(x) .= ( L) ( x) for a proper loss ℓwhose Bayes risk is symmetric around 1/2 (equivalent to having no class-conditional misclassiﬁcation costs), then minimising the margin loss is equivalent to minimising the proper canonical loss [NN08, Lemma 1]. In the more general case, given a couple (G, χ), there exists a proper loss ℓsuch that G(yχ(u)) = ℓ(y, u) ( y, u) iff [RW10, Corollary 14]:

χ 1(x) = G ( x) G ( x) + G (x), x. (2)

When this holds, we say that the couple (G, χ) is representable as a proper loss. When properness is too tight a constraint, one can rely on classiﬁcation calibration [BJM06], which relaxes the optimality condition on just predicting labels. Such a notion is particularly interesting for margin losses. We assume in this paper that margin losses are twice differentiable, a property satisﬁed by most popular choices (logistic, square, Matsushita, etc.), but notably not satisﬁed by the hinge loss. Key quantities for a good loss: there are two crucial quantities that indicate the capability of a function to be a good choice for a loss, its Lipschitz constant and weight function. For any function G : R R, the Lipschitz constant of G is lip G .= supx,x |G(x) G(x )|/|x x | and weight function w G(x) .= G (x). Controlling the Lipschitz constant of a loss is important for robustness [CBG+17] and mandatory for consistency [Tel13]. The weight function, on the other hand, deﬁnes properness [Sch89] but also controls optimisation rates [KM99, NN08] and deﬁnes the geometry of the loss [AN00, Section 3]. More than the desirable properness, in ﬁtting or tuning a loss, one ideally needs to make sure that levers are easily accessible to control these two quantities. (u, v)-geometric structure: a proper loss deﬁnes a canonical link. A proper composite loss therefore makes use of two link functions. Apart from ﬂexibility, there is an information geometric justiﬁcation to such a more general formulation, linked to a ﬁner characterisation of the information geometry of Bregman divergences introduced in [Ama12, Ama13, Zha04] and more recently analysed, the (u, v)-geometric structure [NNA16]. u, v are two differentiable invertible functions and the gradient of the generator of the divergence is given by u v 1. This way, the two dual forms of a Bregman divergence (in (1)) can be represented using the same coordinate system known as the source, which speciﬁes the dual coordinate system and Riemannian metric induced by the divergence. We consider the (ν, χ)-geometric structure of the divergence D L, which therefore gives for ν:

ν = L χ 1, (3)

and yields a local composite estimate of Bayes rule for a corresponding x R as in [NM20, Eq. 4]:

ˆp(y = 1|x; ν, L) .= χ 1(x) = ( L ) 1 ν(x). (4)

Figure 1 depicts ν, which we call source as well. The properties of ν in (3) are summarised below. Deﬁnition 1. A source (function) ν is a differentiable, strictly increasing function.

Any loss completed with a source gives a proper composite loss. There are two expressions of interest for a loss involving source ν, which follow from (1) and margin loss G(.), with y {0, 1} and x R, u [0, 1] (with the correspondence x .= χ(u) and y .= (y + 1)/2):

ℓν(y, u) .= D( L) ν(x) L (y ) (5)

Gν(yχ(u)) .= G(y (ν 1 ( L ))(u)). (6)

When it is clear from the context, we shall remove the index ν for readability. Instead of learning the canonical link of a loss, which poses signiﬁcant challenges due to its closed domain [KKSK11, NM20], we choose to infer the source given a loss. As we now see, this can be done efﬁciently using Gaussian Process inference and with strong guarantees on the losses involved.

3 The Integrated Squared Gaussian Process

Model Deﬁnition. An Integrated Squared Gaussian Processes (ISGP) is a non-parametric random process whose sample paths turn out to be source functions.

Deﬁnition 2. An ISGP is deﬁned by the following model:

ν0 N(µ, γ 1); f GP(k( , )) f( ) = w ϕ( ) w N(0, Λ) and ν(x) .= ν0 + R x 0 f 2(z) dz.

Here, ϕ(x) = (ϕ1(x), ϕ2(x), . . . , ϕM(x)) represents M basis functions, Λ RM M is a diagonal matrix with elements λm > 0, and GP(k( , )) is the zero-mean Gaussian process with kernel k : R R R. Hence the above equivalence (denoted ) implies the Mercer expansion

k(x, z) = ϕ(x) Λϕ(z). (7)

Note that while for ﬁnite M this construction implies a parametric model, parameteric approximations are common in GP literature [RW05]. The ISGP is Gaussian in the parameters w and ν0, and admits a simple form for ν:

ν(x) = ν0 + Z x

0 f 2(z) dz = ν0 + Z x

w ϕ(z) 2 dz = ν0 + w ψ(x)w, (8)

with the positive semi-deﬁnite matrix

ψ(x) .= Z x

0 ϕ(z)ϕ(z) dz RM M. (9)

It is our choice of squaring of f previously exploited to model non-negativity [ST03, MM06, LGOR15, WB17, FTS17, LH19] which leads to this simple form for the monotonic ν. Sanity checks of the ISGP prior for loss inference. We now formally analyse the goodness of ﬁt of an ISGP in the context of loss inference. The following Lemma is immediate and links the ISGP to inference on functions expressed as Bregman divergences, and therefore to proper losses (Section 2).

Lemma 1. ν(x) as in Deﬁnition 2 is a source function with probability one.

Remark 3.1. Bregman divergences are also the analytical form of losses in other areas of machine learning (e.g. unsupervised learning [BGW05]), so ISGPs may via Lemma 1 be of broader interest.

The following Theorem safeguards inference on loss functions with an ISGP.

Theorem 2. For any convex G and any ISGP, the following holds with probability one: (i) if G is decreasing, there exists a canonical link L such that the couple (G, χ) where χ is obtained from (3) is representable as a proper loss; (ii) if G is classiﬁcation calibrated, the margin loss Gν(x) .= G ν 1(x) is classiﬁcation calibrated.

(Proof in Appendix C.) To summarise, the whole support of an ISGP is fully representable as proper losses, but the proof of point (i) is not necessarily constructive as it involves a technical invertibility condition. On the other hand, part (ii) relaxes properness for classiﬁcation calibration but the invertibility condition is weakened to that of ν for the margin loss involved.

Theorem 2 (and the title of the present work) begs for a converse on the condition(s) on the ISGP to ensure that any proper composite loss can be represented in its support. Fortunately, a sufﬁcient condition is already known [MXZ06].

Theorem 3. Suppose kernel k in Deﬁnition 2 is universal. Then for any proper loss ℓand any composite link χ, there exists ν as in Deﬁnition 2 such that the proper composite loss ℓ(y, χ 1(x)) can be represented as in (5): ℓ(y, χ 1(x)) = ℓν(y, u), u [0, 1], y { 1, 1} and x .= χ(u).

To avoid a paper laden paper with notation, we refer to [MXZ06] for the exact deﬁnition of universality which, informally, postulates that the span of the kernel can approximate any function to arbitrary precision. Interestingly, a substantial number of kernels are universal, including some of particular interest in our setting (see below). We now investigate a desirable property of the ﬁrst-order moment of a loss computed from an ISGP. We say that the ISGP is unbiased iff Eν [ν(x)] = x, x. Theorem 4. Letting {(x, y)} R Y be a sample of labelled training values. For any unbiased ISGP and any proper canonical loss ℓ, the proper composite loss ℓformed by ℓand composite link χ .= ν 1 L (3) satisﬁes: E(x,y) ℓ(y, ( L ) 1(x)) E(x,y),ν ℓν(y, χ 1(x)) .

(Proof in Appendix C.) In the context of Bayesian inference this means that the prior uncertainty in ν induces a prior on losses whose expected loss upper bounds that of the canonical link. We exploit this property to initialise our Bayesian inference scheme with ﬁxed canonical link (see Section 4.2). Of course, this property follows from Theorem 4 if the ISGP is unbiased, which we now show is trivial to guarantee. We say that a kernel k is translation invariant iff k(x, x ) = k(0, x x ), x, x . Theorem 5. For any translation invariant k, the ISGP ν satisﬁes Eν [ν(x)] = µ + k(0, 0) x.

Proof. By linearity of expectation this mean function equals E[ν0] = µ plus the simple term

Ef GP(k( , ))

0 f 2(z) dz = Z x

0 E f 2(z) dz = Z x

0 k(z, z) dz = x k(0, 0), (10)

yielding the statement of the Theorem.

Hence, if µ = 0 and k(0, 0) = 1 in Deﬁnition 2, the ISGP is unbiased. Interestingly, some translation invariant kernels are universal [MXZ06, Section 4]. We now dig into how an ISGP allows to control the properties of supervised losses that govern robustness and statistical consistency, among others [CBG+17, Tel13]. Denote ϕ 2 max = supx ϕ(x) 2 2, which is simple to upper-bound for the ϕ we will adopt later (see (14)). Lemma 6. For ISGP ν the Lipschitz constant of ℓν in (5) satisﬁes lipℓν ϕ 2 max w 2 2 lipℓ.

Proof. By Cauchy-Schwartz ν (x) = f 2(x) = (w ϕ(x))2 supx(w ϕ(x))2 supx ϕ(x) 2 2 w 2 2 ϕ 2 max w 2 2. The Lemma can then follows from the chain rule on ℓν in (5).

Similarly, we have χ = ν 1 ( L ) from (3) and so χ (u) = wℓ(u) (w (ϕϕ )(χ(u))w) 1, yielding this time a Lipschitz constant for Gν of (6) which is proportional to the weight of ℓand inversely proportional to f 2. This shows that the prior s uncertainty still allows for a probabilistic handle on the Lipschitz constant, Λ the eigenspectrum in the Mercer expansion.

4 Inference with Integrated Squared Gaussian Processes

In Section 4.1 we give an approximate inference method for the ISGP with simple i.i.d. likelihood functions this is analogous to the basic GP regression and classiﬁcation in e.g. [RW05]. The subsequent Section 4.2 builds on this to provide an inference method for proper losses for the generalised linear model this is a Bayesian analog of the loss ﬁtting methods in [KKSK11, NM20].

4.1 Basic Laplace-approximate ISGP inference for a ﬁxed likelihood function

Efﬁcient variational inference for the ISGP is an open problem due to the intractable integrals this entails. It is straight-forward however to construct a Laplace approximation to the posterior in the parameters Γ = {w, ν0}, provided that e.g. the likelihood function for the data D = {xn, yn}N n=1 factorises in the usual way as log p(D|ν, Θ) = PN n=1 log p(yn|ν(xn), Θ). Here Θ represents the hyper-parameters which are γ, µ, any parameters of the kernel k( , ), and any parameters of the likelihood function. We use automatic differentiation and numerical optimisation to obtain the posterior mode bΓ = argmaxΓ log p(D, Γ). We then use automatic differentiation once more to compute the Hessian

Γ=bΓ log p(D, Γ). (11)

Letting bΣΓ = H 1, the approximate posterior is then p(Γ|D) N(Γ|bΓ, bΣΓ) .= q(Γ|D). Predictive distribution: posterior samples of the monotonic function ν are obtained from samples Γ(s) = {w(s), ν(s) 0 } from q via the deterministic relation (8). For the predictive mean function, we exploit the property that the predictive distribution is the sum of the Gaussian ν0 plus the well studied quadratic function of a multivariate normal. If we let bν0 be the element of bΓ corresponding to ν0, and let bΣw be the sub-matrix of bΣΓ corresponding to w, etc., then the following holds. Lemma 7. Using notations just deﬁned, the mean function may be written as

Eq(Γ|D) [ν(x)] = bν0 + tr(ψ(x)bΣw) + b w bΣw b w.

Proof. We have Eq(Γ|D) [ν(x)] = bν0 + EN(w| b w,bΣw) w ψ(x)w and, for w N(w| b w, bΣw),

E w ψ(x)w = E tr ψ(x)ww = tr ψ(x)E ww = tr ψ(x)(bΣw + b w b w ) ,

which leads to the required result.

This mean function differs from that of a simple GP, which is linear w.r.t. b w. Closed form expressions for the higher moments of w ψ(x)w are provided in [MP92, 3.2b: Moments of a Quadratic Form].

We can easily replicate the use of Jensen s inequality in the proof of Theorem 4 to obtain the following guarantee for the posterior mean function. We recall that L is the Bayes risk of a proper loss ℓand χ 1 .= ( L ) 1 ν is the construct of the (inverse) composite link using source ν (of (3)). Theorem 8. For any sample of labelled data points {(x, y)} R Y, any proper loss ℓand source ν sampled from approximate posterior q(Γ|D), we have:

E(x,y),ν ℓν(y, χ 1(x)) E(x,y) h D( L) bν0 + tr(ψ(x)bΣw) + b w bΣw b w L (y ) i .

Note that the l.h.s. above takes an expectation with respect to the approximate posterior of ν while the r.h.s. involves the parameters of the approximate posterior mean function bν. This clariﬁes the importance of Bayesian modelling of ν, as the resulting expected loss is merely upper bounded by that for the ﬁxed mean function. Alternatively the result suggests a cheap approximation to marginalising (4) w.r.t. ν for prediction, namely putting the mean function for ν into that equation. Marginal likelihood approximation and optimisation: we follow [SF06] and use automatic differentiation to achieve the same result as while avoiding the manual gradient calculations of the usual approach in the GP literature [RW05, 3.4]. See Section D for further details on the marginal likelihood, computational complexity, and likelihood functions for the univariate case.

4.2 Using the basic ISGP inference to ﬁt Bayes estimates via Expectation Maximisation

At this stage, inferring the loss in a machine learning model boils down to simply placing an ISGP prior on the source function and doing Bayesian inference. In accordance with part i) of Theorem 2, this induces a posterior over sources which in turn implies a distribution over proper composite losses. We now present an effective inference procedure for the generalised linear model, which has been the standard model for closely related isotonic regression approaches to the problem [KS09, KKSK11, NM20]. We ﬁrst choose a proper loss link L to get the proper composite estimate (4). Here we consider the log loss and therefore the inverse sigmoid for L , although we emphasise that our method works with any proper loss in which the source is embedded. Model: to summarise, given a dataset D = {zn, yn}n=1,...,N RD Y, we model ˆp(x) .= ( L ) 1 ν(x) from (4) with ( L ) 1(x) = (1 + exp( x)) 1, to obtain the classiﬁcation model

yn|zn Bernoulli(ˆp(xn)) ; xn = β zn, (12)

where ν F and F is a prior over functions such as the GP or ISGP. We leave the prior on the linear model weight vector β RD unspeciﬁed as we perform maximum (marginal) likelihood on that parameter. This model generalises logistic regression, which we recover for ν(x) = x. Inference: we use Expectation Maximisation (EM) to perform maximum likelihood in β, marginal of ν. In accordance with Theorem 4 we initialise β(old) using traditional ﬁxed link ( L ) 1. In the Estep, xn = β(old) zn are ﬁxed and we use the Laplace approximate inference scheme of Section 4.1

to compute the posterior in ν (or equivalently Γ = {w, ν}), taking the likelihood function to be that of yn|xn from (12), above. Let that approximate posterior previously denoted by q(Γ|D) be denoted here by q(ν|β(old)). The M-step then updates β by (letting y = {y1, y2, . . . , y N} and x = {x1, x2, . . . , x N}),

β(new) = argmax β Q(β|β(old)) ; Q(β|β(old)) = Eq(ν|β(old)) [log p(y|x, ν)] .

Although the expectation for Q is analytically intractable, the univariate setting suggests that Monte Carlo will be effective. We draw S samples ν(s), s = 1, . . . , S from q(ν|β(old)), and approximate

Eq(ν|β(old)) [log p(y|x, ν)] 1

n=1 log Bernoulli(yn|( L ) 1 ν(s)(β zn)),

where ν(s) is given by (8). The above expression may be automatically differentiated and optimised using standard numerical tools. To achieve this efﬁciently under the ISGP prior for F, we implement a custom derivative function based on the relation

ν (x) = f 2(x) = w ϕ(x) 2, (13)

rather than requiring the automatic differentiation library to differentiate through ψ in (8).

Given our approximate posterior, probabilistic predictions are obtained by marginalising out ν in a manner analogous to a GP classiﬁer [RW05]. Computational Complexity Our O(M 2N) per EM iteration algorithm admits control of M, a meaningful knob to control complexity. This compares favourably with the state of the art [NM20, 5], the fastest option of which costs O(N log N) per iteration at the expense of a rather involved data structure, without which their procedure costs O(N 2).

4.3 The trigonometric kernel

We introduce a novel kernel designed to conveniently lend itself to the ISGP. See Appendix F for the requirements this entails, and a discussion of a tempting yet less convenient alternative (the standard Nyström approximation [RW05]). Our main insight is that we need not ﬁx the kernel and derive the corresponding Mercer expansion even if such an expansion is available, the integrals for ψ(x) are generally intractable. Instead we recall that k(x, z) = ϕ(x) Λϕ(z) = PM m=1 λm ϕm(x)ϕm(z), and we let the (λm, ϕm(x)) pairs (of which there are an even number, M) be given by the following union of two sets of M/2 pairs, n b/am, cos (πmcx) o M/2

[ n b/am, sin (πmcx) o M/2

on the domain x [ 1/c, +1/c], where a > 1 and b > 0 are input and output length-scale hyper-parameters and M 100 is chosen large enough to provide sufﬁcient ﬂexibility. This is related to i) the construction in [WB17] (but has the advantage of admitting closed form expressions for k), and ii) the random features approaches of e.g. [RR08, LGQCRFV10, CBMF17] (which differ by approximating a ﬁxed covariance with a trigonometric basis). It is easy to show that the kernel is translation invariant and that the prior variance which is especially relevant here due to Theorem 5 is given by

k(x, x) = k(0, 0) = b (1 a M/2)/(a 1). (15)

We give expressions for k(x, z) and ψ(x) in Appendix E, and an illustration in Figure 4.

5 Experiments

We provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classiﬁcation, and inferring proper losses. The code for the real world problems of learning the loss is available online.2 We ﬁxed M = 64 in (14). Classiﬁcation problems used ( L ) 1 = σ

2https://gitlab.com/cwalder/linkgistic

Table 1: Mean test set negative log likelihoods for various isotonic regression methods. See text for details.

Train Input GP ISGP PAVA Linear

Large accel. 2080.9 2083.4 2055.0 2105.9 displ. 788.1 780.6 760.7 907.2 power 763.4 763.9 755.5 1002.2 weight 735.5 750.7 769.9 788.4 Small accel. 2173.3 2216.8 2212.9 2199.6 displ. 871.0 849.4 871.4 950.9 power 827.3 811.8 826.4 1033.5 weight 777.2 770.7 835.8 815.8

Table 2: Test AUC for generalised linear models with various link methods (ordering in decreasing average). See text for details.

mnist fmnist

ISGP-Linkgistic 99.9 % 99.2 % GP-Linkgistic 99.9 % 99.1 % Logistic regression 99.9 % 98.5% GLMTron 99.6% 98.1% BREGMANTRON 99.7% 97.9% BREGMANTRONlabel 99.6% 97.7% BREGMANTRONapprox 99.3% 94.6% SLISOTRON 94.6% 90.7%

10000 20000 30000 40000 50000 60000 training set size

binary correct (%)

(a) binary classiﬁcation accuracy

10000 20000 30000 40000 50000 60000 training set size

ten class correct (%)

(b) multi-class classiﬁcation accuracy

fmnist / GP

kmnist / GP

fmnist / ISGP

kmnist / ISGP

mnist / ISGP

fmnist / logistic

kmnist / logistic

mnist / logistic

Figure 2: Test performance v.s. training set size for the MNIST, Kuzushiji-MNIST and Fashion MNIST datasets. We compare logistic regression (logistic), GP-LINKGISTIC (GP) and ISGPLINKGISTIC (ISGP). On the left we show the mean one vs the rest classiﬁcation accuracy, and on the right we show the ten class classiﬁcation accuracy obtained by combining the one vs the rest models.

with σ(x) = 1/(1 + exp( x)). Optimisation was performed with L-BFGS [BLNZ95]. Gradients were computed with automatic differentiation (with the notable exception of (13)) using Py Torch [PGM+19]. The experiments took roughly two CPU months. Univariate Regression Toy. As depicted in Figure 3, we combined the Gaussian likelihood function (17) with both the GP and ISGP function priors for ν. We investigate this setup further in the supplementary Figure 5 and Figure 6, which also depict the latent f and f 2, which are related by ν = f 2. Finally, Figure 7 visualises the marginal likelihood surface near the optimal point found by the method of Section D. See the captions for details. Univariate Classiﬁcation Toy. We illustrate the mechanics of the classiﬁcation model with ISGP prior and the sigmoid-Bernoulli likelihood function of (19). We constructed a univariate binary classiﬁcation problem by composing the sigmoid with a piece-wise linear function, which allows us to compare both the inferred ν and the inferred (inverse) canonical link χ 1 = ( L ) 1 ν to their respective ground truths see Figure 8 and the caption therein for more details. Real-world Univariate Regression. We compared our ISGP based regression model to a) GP regression with the same kernel of Section 4.3, b) the widely used pool adjacent violators algorithm (PAVA) [Kru64] for isotonic regression, and c) linear least squares. For the ISGP and GP models we used maximum marginal likelihood hyper parameters. The task was to regress each of the four real-valued features of the auto-mpg dataset [DG17] onto the target (car efﬁciency in miles per gallon). We partitioned the dataset into ﬁve splits. For the Small problems, we trained on one split and tested on the remaining four ; for the Large problems we did the converse. We repeated this all C5 1 = C5 4 = 5 ways and averaged the results to obtain Table 1. As expected the Bayesian GP and ISGP models which have the advantage of making stronger regularity assumptions perform relatively better with less data (the Small case). Our ISGP is in turn superior to the GP in that case, in line with the appropriateness of the monotonicity assumption which it captures. Real-world Problems of Learning the Loss. We bench-marked our loss inference algorithm of Section 4.2 on binary classiﬁcation problems. Note that multi-class problems may be handled via e.g. a 1-vs-all or 1-vs-1 multiple coding while more direct handling may be a topic of future work. We used both the GP and ISGP function priors, and we refer to the resulting algorithms as GPLINKGISTIC and ISGP-LINKGISTIC, respectively only the latter of which gives rise to guaranteed

proper losses. To ease the computational burden, we ﬁxed the hyper-parameters throughout. We chose set µ = 0 and γ = 0.01. We set the length-scale parameter a = 1.2 and chose b such that k(0, 0) = 1 (using (15)) by (10) this makes ν(x) = x the most likely source a priori, to roughly bias towards traditional logistic regression. For the GP we tuned b for good performance on the test set (for the full 60K training examples), to give the GP an unfair advantage over the ISGP, although this advantage is small as GP-LINKGISTIC is rather insensitive to the choice of b.

Table 2 compares the SLISOTRON and BREGMANTRON algorithms from [KKSK11] and [NM20], respectively, along with the other baselines from the latter work. Further details on the experimental setup are provided in [NM20]. In contrast with the SLISOTRON and BREGMANTRON algorithms, our models successfully match or outperform the logistic regression baseline. Moreover, the monotonic ISGP-LINKGISTIC slightly outperforms GP-LINKGISTIC, and as far as we know records the ﬁrst result beating logistic regression on this problem, by a reasonable margin on fmnist [NM20].

We further bench-marked ISGP-LINKGISTIC against GP-LINKGISTIC and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGPLINKGISTIC dominates on all three datasets, as the training set size increases see Figure 2 and the caption therein for more details. Figure 9 depicts an example of the learned (inverse) link functions.

6 Conclusion

We have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness. Our contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling ﬂexibility (which we inherit from the Bayesian approach) and as a direct consequence practical effectiveness as evidenced by our state of the art performance. Our model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classiﬁcation. As such, this represents an interesting step toward more ﬂexible modelling in a wider machine learning context, which typically works with a loss function which is prescribed a priori. Since the tricks we use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are also a principled distortion measure for unsupervised learning such as in the context of the popular k-means and EM algorithms an interesting avenue for future work is to investigate the potential of our approach for unsupervised learning.

Broader Impact

From an ethical standpoint, it is likely that any advancements in fundamental machine learning methodologies will eventually give rise to both positive and negative outcomes. The present work is sufﬁciently general and application independent, however, as to pose relatively little cause for immediate concern.

Nonetheless, we note one example of an optimistic take on the potential impact of the present work. The example is from the ﬁeld of quantitative criminology, wherein it is advocated that further study of asymmetric loss functions may provide lawmakers with procedures far more sensitive to the real consequences of forecasting errors in the context of law enforcement [Ber11]. Such a study implies that symmetric links like the ones derived from most common losses: logistic, square, Matsushita, etc. are not a good ﬁt for sensitive ﬁelds. Since inference on proper losses can produce non-symmetric links, the present contribution may be useful for such application ﬁelds.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their insightful feedback, and we thank The Boeing Company for partly funding this work.

[ABE+55] Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical

Statistics, (4), 1955.

[Ama12] S.-I. Amari. New developments of information geometry (17): Tsallis q-entropy, escort geometry, conformal geometry. In Mathematical Sciences (suurikagaku), number 592, pages 73 82. Science Company, October 2012. In Japanese.

[Ama13] S.-I. Amari. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku), number 605, pages 65 74. Science Company, November 2013. in japanese.

[AN00] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.

[ANARL19] Clement Abi Nader, Nicholas Ayache, Philippe Robert, and Marco Lorenzi. Monotonic Gaussian Process for Spatio-Temporal Disease Progression Modeling in Brain Imaging Data. Neuro Image, 2019.

[APS10] Pankaj K. Agarwal, Jeff M. Phillips, and Bardia Sadri. Lipschitz unimodal and isotonic regression on paths and trees. In Proc. of the 9th Latin American Symposium on Theoretical Informatics, pages 384 396, 2010.

[Bac18] Francis Bach. Efﬁcient algorithms for non-convex isotonic regression through submodular optimization. In Advances in Neural Information Processing Systems 31. 2018.

[Ber11] Richard Berk. Asymmetric loss functions for forecasting in criminal justice settings. Journal of Quantitative Criminology, 2011.

[BGW05] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. IT, 51:2664 2669, 2005.

[BJM06] P. Bartlett, M. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. J. of the Am. Stat. Assoc., 101:138 156, 2006.

[BLNZ95] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal of Scientiﬁc Computation, 1995.

[Bre67] L. M. Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200 217, 1967.

[BSS05] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation ans classiﬁcation: structure and applications, 2005. Technical Report, University of Pennsylvania.

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.

[CBG+17] M. Cissé, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: improving robustness to adversarial examples. In 34th ICML, 2017.

[CBIK+18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018.

[CBMF17] Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. volume 70 of Proceedings of Machine Learning Research, pages 884 893, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR.

[Cha10] Kian Ming Adam Chai. Multi-task learning with Gaussian processes. Ph D thesis, University of Edinburgh, UK, 2010.

[CM00] Claudia Czado and Axel Munk. Noncanonical links in generalized linear models - when is the effort justiﬁed? Journal of Statistical Planning and Inference, 87, 2000.

[CR02] Claudia Czado and Adrian Raftery. Choosing the link function and accounting for link uncertainty in generalized linear models using bayes factors. Statistical Papers, 47, 2002.

[Cza97] Claudia Czado. On selecting parametric link transformation families in generalized linear models. Journal of Statistical Planning and Inference, 61:125 139, 05 1997.

[d F49] B. de Finetti. Rôle et domaine d application du théorème de Bayes selon les différents points de vue sur les probabilités (in French). In 18th International Congress on the Philosophy of Sciences, pages 67 82, 1949.

[DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[FTS17] S. Flaxman, Y.W. Teh, and D. Sejdinovic. Poisson Intensity Estimation with Reproducing Kernels. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

[GBCC15] Shirin Golchi, D. Bingham, H. Chipman, and David Campbell. Monotone emulation of computer experiments. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 3:370 392, 01 2015.

[GRR] A Gelman, CP Robert, and J Rousseau. Inherent difﬁculties of non-Bayesian likelihood-based inference, as revealed by an examination of a recent book by Aitkin. Statistics & Risk Modeling, 30.

[HHLK19] Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential gaussian process ﬂows. In Proceedings of Machine Learning Research, volume 89, pages 1812 1821, 16 18 Apr 2019.

[HT92] W.K. Hardle and Berwin Turlach. Nonparametric approaches to generalized linear models. 02 1992.

[KEC19] Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Gaussian process latent variable alignment learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, 2019.

[KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efﬁcient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems 24. 2011.

[KM99] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999.

[Kru64] J.B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika, 1964.

[KS09] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.

[LC10] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010.

[LGOR15] Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for gaussian process modulated poisson processes. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1814 1822, Lille, France, 2015. PMLR.

[LGQCRFV10] Miguel Lázaro-Gredilla, Joaquin Quiñnero-Candela, Carl Edward Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11(63):1865 1881, 2010.

[LH19] Siqi Liu and Milos Hauskrecht. Nonparametric regressive point processes based on conditional gaussian processes. In Advances in Neural Information Processing Systems 32. 2019.

[Lim18] Cong Han Lim. An efﬁcient pruning algorithm for robust isotonic regression. In Advances in Neural Information Processing Systems 31. 2018.

[LR14] Ronny Luss and Saharon Rosset. Generalized isotonic regression. Journal of Computational and Graphical Statistics, 23, 2014.

[LRG+17] Cheng Li, Santu Rana, Satyandra K. Gupta, Vu Nguyen, and Svetha Venkatesh. Bayesian optimization with monotonicity information. In NIPS Workshop on Bayesian Optimization, 2017.

[MM06] Peter Mc Cullagh and Jesper Møller. The permanental process. Advances in Applied Probability, 38(4), 2006.

[MP92] Arak Mathai and Serge Provost. Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker, Inc., 1992.

[MXZ06] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651 2667, 2006.

[NDF00] Ioannis Ntzoufras, Petros Dellaportas, and Jonathan Forster. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference, 111, 03 2000.

[NM20] Richard Nock and Aditya Krishna Menon. Supervised learning: No loss no cry. In ICML 20, 2020.

[NN08] Richard Nock and Frank Nielsen. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008.

[NNA16] Richard Nock, Frank Nielsen, and Shun-ichi Amari. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:1 12, 2016.

[Nys28] Evert Johannes Nyström. Über die praktische auﬂösung von linearen integralgleichungen mit anwendungen auf randwertaufgaben der potentialtheorie. Commentationes physico-mathematicae, 4:1 52, 1928.

[PGM+19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019.

[RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, pages 1177 1184. 2008.

[RV10] Jaakko Riihimäki and Aki Vehtari. Gaussian processes with monotonicity information. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, 2010.

[RW05] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.

[RW10] Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010.

[Sav71] L.-J. Savage. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783 801, 1971.

[Sch89] M.-J. Schervish. A general method for comparing probability assessors. Ann. of Stat., 17(4):1856 1879, 1989.

[SF06] Hans J. Skaug and David A. Fournier. Automatic approximation of the marginal likelihood in non-gaussian hierarchical models. Computational Statistical Data Analysis, 51, 2006.

[SPV16] Eero Siivola, Juho Piironen, and Aki Vehtari. Automatic monotonicity detection for gaussian processes. In ar Xiv 1610.05440, 2016.

[ST03] Tomoyuki Shirai and Yoichiro Takahashi. Random point ﬁelds associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, (3), 07 2003.

[Tel13] M. Telgarsky. Boosting with the logistic loss is consistent. In 26 th COLT, pages 911 965, 2013.

[UKEC19] Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Monotonic gaussian process ﬂow. In ar Xiv 1905.12930, 2019.

[Val84] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984.

[Vap98] V. Vapnik. Statistical Learning Theory. John Wiley, 1998.

[WB17] Christian J. Walder and Adrian N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.

[XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

[YLR+18] Ang Yang, Cheng Li, Santu Rana, Sunil Gupta, and Svetha Venkatesh. Sparse approximation for gaussian process with derivative observations. In AI 2018: Advances in Artiﬁcial Intelligence, 2018.

[YW09] L Yeganova and W Wilbur. Isotonic regression under lipschitz constraint. Journal of Optimization Theory and Applications, 2009.

[ZE02] Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In Knowledge Discovery and Data Mining, 2002.

[Zha04] J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16:159 195, 2004.