# all_your_loss_are_belong_to_bayes__ad6fe9bf.pdf All your loss are belong to Bayes Christian Walder 1 2 Richard Nock 1 2 1 CSIRO Data61, Australia. 2 The Australian National University. {first.last}@data61.csiro.au Loss functions are a cornerstone of machine learning and the starting point of most algorithms. Statistics and Bayesian decision theory have contributed, via properness, to elicit over the past decades a wide set of admissible losses in supervised learning, to which most popular choices belong (logistic, square, Matsushita, etc.). Rather than making a potentially biased ad hoc choice of the loss, there has recently been a boost in efforts to fit the loss to the domain at hand while training the model itself. The key approaches fit a canonical link, a function which monotonically relates the closed unit interval to R and can provide a proper loss via integration. In this paper, we rely on a broader view of proper composite losses and a recent construct from information geometry, source functions, whose fitting alleviates constraints faced by canonical links. We introduce a trick on squared Gaussian Processes to obtain a random process whose paths are compliant source functions with many desirable properties in the context of link estimation. Experimental results demonstrate substantial improvements over the state of the art. 1 Introduction Supervised learning has been an influential framework of machine learning, swayed decades ago by pioneers who managed to put into equations its nontrivial uncertainties in sampling and model selection [Val84, Vap98]. One lighthearted but famous videogame-born summary1 comes with an unexpected technical easter egg: uncertainty grasps the crux of the framework, the loss. All these works have indeed been predated from the early seventies by a rich literature on admissible losses for supervised learning, in statistical decision theory [Sav71], and even earlier in foundational philosophical work [d F49]. The essential criterion for admissibility is properness, the fact that Bayes rule is optimal for the loss at hand. Properness is intensional: it does not provide a particular set of functions to choose a loss from. Over the past decades, a significant body of work has focused on eliciting the set of admissible losses (see [NM20] and references therein) ; yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02]. More recent work has aimed to provide machine learning with greater flexibility in the loss [HT92, KS09, KKSK11, NM20] yet these works face significant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the specific part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10]. 1See https://tinyurl.com/y5jnurau or search for all your Bayes are belong to us, Vapnik quote . 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. In a nutshell, the present work departs from these chiefly frequentist approaches and introduces a new and arguably Bayesian standpoint to the problem (see [GRR] for a related discussion on the nature of such a standpoint). We exploit a finer characterisation of loss functions and a trick on the Gaussian Process (GP) for efficient inference while retaining guarantees on the losses. Experiments show that our approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to our knowledge is a first among approaches learning a loss or link. More specifically, we first sidestep the impediments and constraints of fitting a link by learning a source function, which need only be monotonic, and which allows to reconstruct a loss via a given link yielding a proper composite loss [RW10]. The entire construct exploits a fine-grained information-geometric characterisation of Bregman divergences pioneered by Zhang and Amari [Ama12, Ama13, Zha04]. This is our first contribution. Our second contribution exploits a Bayesian standpoint on the problem itself: since properness does not specify a loss to minimise, we do not estimate nor model a loss based on data instead we perform Bayesian inference on the losses themselves, and more specifically on the source function. From the losses standpoint, our technique brings the formal advantage of a fine-tuned control of the key parameters of a loss which, in addition to properness, control consistency, robustness, calibration and rates. We perform Bayesian inference within this class of losses by a new and simple trick addressing the requirement of priors over functions which are simultaneously non-parametric, differentiable and guaranteed monotonic. We introduce for that purpose the Integrated Squared Gaussian Process (ISGP), which is itself of independent interest as a convenient model for monotonic functions. The ISGP extends by integration (which yields monotonicity) the squared Gaussian Process which has itself seen a recent burst of attention as a model for merely non-negative functions [ST03, MM06, LGOR15, WB17, FTS17, LH19]. The ISGP contrasts with previous approaches to learning monotonic functions which are either non-probabilistic [ABE+55, YW09, APS10, LR14, Bac18, Lim18], resort to a discretisation [HHLK19, KEC19, UKEC19], or are only approximately monotonic due to the use of only a finite set of monotonicity promoting constraints [RV10, GBCC15, SPV16, LRG+17, YLR+18, ANARL19]. Organisation. In Section 2 we introduce properness and source functions. We define the ISGP model in Section 3, and provide relevant Bayesian inference procedures in Section 4. Numerical experiments are provided in Section 5 before concluding with Section 6. Technical details and an extensive suite of illustrations are provided in the supplementary appendices. 2 Definitions and key properties for losses prediction class probability estimation canonical link composite link Figure 1: Relationship between source, canonical link and composite link defined by (3). Domains and names are given in the context of supervised learning. Our notations follow [NM20, RW10]. In the context of statistical decision theory as discussed more than seventy years ago [d F49] and later theorised [Sav71], the key properties of a supervised loss function start with ensuring that Bayes rule is optimal for the loss, a property known as properness. Proper (composite) losses: let Y .= { 1, 1} be a set of labels. A loss for binary class probability estimation [BSS05] is a function ℓ: Y [0, 1] R (closure of R). Its (conditional) Bayes risk function is the best achievable loss when labels are drawn with a particular positive base-rate, L(π) .= inf u EY Bernoulli(π)ℓ(Y, u), where Pr[Y = 1] = π. A loss for class probability estimation ℓis proper iff Bayes prediction locally achieves the minimum everywhere: L(π) = EYℓ(Y, π), π [0, 1], and strictly proper if Bayes is the unique minimum. Fitting a classifier s prediction h(z) R (z being an observation) into some u [0, 1] is done via an invertible link function χ : [0, 1] R connecting real valued prediction and class probability estimation. A proper loss augmented with a link, ℓ(y, χ 1(x)) with x R [RW10] is called proper composite. There exists a particular link uniquely defined (up to multiplication or addition by a scalar [BSS05]) for any proper loss, the canonical link, as: χ .= L [RW10, Section 6.1]: for example, the logistic loss yields the canonical link χ(u) = ( L )(u) = log(u/(1 u)), with inverse the well-known sigmoid χ 1(x) = ( L ) 1(x) = (1+exp( x)) 1. A proper loss augmented with its canonical link instead of any composite link is called proper canonical. We remind that the Bayes risk of a proper loss is concave [RW10]. There exists a particularly useful characterisation of proper composite losses [NM20, Theorem 1]: a loss is proper composite iff ℓ(y, χ 1(x)) = D L y χ 1(x) = D( L) L χ 1(x) L (y ) , (1) where y .= (y + 1)/2 {0, 1}, D L is the Bregman divergence with generator L and ( L) is the Legendre conjugate of L. Let G be convex differentiable. Then we have DG(x x ) .= G(x) G(x ) (x x )G (x ) [Bre67] and G (x) .= supx dom(G){xx G(x )} [BV04]. We assume in this paper that losses are twice differentiable for readability (differentiability assumption(s) can be alleviated [RW10, Footnote 6]) and strictly proper for the canonical link to be invertible, both properties being satisfied by popular choices (log, square, Matsushita losses, etc.). Margin losses: an important class of losses for real-valued prediction are functions of the kind G(yh) where G : R R, y { 1, 1} and yh is called a margin in [RW10]. There exists a rich theory on the connections between margin losses and proper losses: If G(x) .= ( L) ( x) for a proper loss ℓwhose Bayes risk is symmetric around 1/2 (equivalent to having no class-conditional misclassification costs), then minimising the margin loss is equivalent to minimising the proper canonical loss [NN08, Lemma 1]. In the more general case, given a couple (G, χ), there exists a proper loss ℓsuch that G(yχ(u)) = ℓ(y, u) ( y, u) iff [RW10, Corollary 14]: χ 1(x) = G ( x) G ( x) + G (x), x. (2) When this holds, we say that the couple (G, χ) is representable as a proper loss. When properness is too tight a constraint, one can rely on classification calibration [BJM06], which relaxes the optimality condition on just predicting labels. Such a notion is particularly interesting for margin losses. We assume in this paper that margin losses are twice differentiable, a property satisfied by most popular choices (logistic, square, Matsushita, etc.), but notably not satisfied by the hinge loss. Key quantities for a good loss: there are two crucial quantities that indicate the capability of a function to be a good choice for a loss, its Lipschitz constant and weight function. For any function G : R R, the Lipschitz constant of G is lip G .= supx,x |G(x) G(x )|/|x x | and weight function w G(x) .= G (x). Controlling the Lipschitz constant of a loss is important for robustness [CBG+17] and mandatory for consistency [Tel13]. The weight function, on the other hand, defines properness [Sch89] but also controls optimisation rates [KM99, NN08] and defines the geometry of the loss [AN00, Section 3]. More than the desirable properness, in fitting or tuning a loss, one ideally needs to make sure that levers are easily accessible to control these two quantities. (u, v)-geometric structure: a proper loss defines a canonical link. A proper composite loss therefore makes use of two link functions. Apart from flexibility, there is an information geometric justification to such a more general formulation, linked to a finer characterisation of the information geometry of Bregman divergences introduced in [Ama12, Ama13, Zha04] and more recently analysed, the (u, v)-geometric structure [NNA16]. u, v are two differentiable invertible functions and the gradient of the generator of the divergence is given by u v 1. This way, the two dual forms of a Bregman divergence (in (1)) can be represented using the same coordinate system known as the source, which specifies the dual coordinate system and Riemannian metric induced by the divergence. We consider the (ν, χ)-geometric structure of the divergence D L, which therefore gives for ν: ν = L χ 1, (3) and yields a local composite estimate of Bayes rule for a corresponding x R as in [NM20, Eq. 4]: ˆp(y = 1|x; ν, L) .= χ 1(x) = ( L ) 1 ν(x). (4) Figure 1 depicts ν, which we call source as well. The properties of ν in (3) are summarised below. Definition 1. A source (function) ν is a differentiable, strictly increasing function. Any loss completed with a source gives a proper composite loss. There are two expressions of interest for a loss involving source ν, which follow from (1) and margin loss G(.), with y {0, 1} and x R, u [0, 1] (with the correspondence x .= χ(u) and y .= (y + 1)/2): ℓν(y, u) .= D( L) ν(x) L (y ) (5) Gν(yχ(u)) .= G(y (ν 1 ( L ))(u)). (6) When it is clear from the context, we shall remove the index ν for readability. Instead of learning the canonical link of a loss, which poses significant challenges due to its closed domain [KKSK11, NM20], we choose to infer the source given a loss. As we now see, this can be done efficiently using Gaussian Process inference and with strong guarantees on the losses involved. 3 The Integrated Squared Gaussian Process Model Definition. An Integrated Squared Gaussian Processes (ISGP) is a non-parametric random process whose sample paths turn out to be source functions. Definition 2. An ISGP is defined by the following model: ν0 N(µ, γ 1); f GP(k( , )) f( ) = w ϕ( ) w N(0, Λ) and ν(x) .= ν0 + R x 0 f 2(z) dz. Here, ϕ(x) = (ϕ1(x), ϕ2(x), . . . , ϕM(x)) represents M basis functions, Λ RM M is a diagonal matrix with elements λm > 0, and GP(k( , )) is the zero-mean Gaussian process with kernel k : R R R. Hence the above equivalence (denoted ) implies the Mercer expansion k(x, z) = ϕ(x) Λϕ(z). (7) Note that while for finite M this construction implies a parametric model, parameteric approximations are common in GP literature [RW05]. The ISGP is Gaussian in the parameters w and ν0, and admits a simple form for ν: ν(x) = ν0 + Z x 0 f 2(z) dz = ν0 + Z x w ϕ(z) 2 dz = ν0 + w ψ(x)w, (8) with the positive semi-definite matrix ψ(x) .= Z x 0 ϕ(z)ϕ(z) dz RM M. (9) It is our choice of squaring of f previously exploited to model non-negativity [ST03, MM06, LGOR15, WB17, FTS17, LH19] which leads to this simple form for the monotonic ν. Sanity checks of the ISGP prior for loss inference. We now formally analyse the goodness of fit of an ISGP in the context of loss inference. The following Lemma is immediate and links the ISGP to inference on functions expressed as Bregman divergences, and therefore to proper losses (Section 2). Lemma 1. ν(x) as in Definition 2 is a source function with probability one. Remark 3.1. Bregman divergences are also the analytical form of losses in other areas of machine learning (e.g. unsupervised learning [BGW05]), so ISGPs may via Lemma 1 be of broader interest. The following Theorem safeguards inference on loss functions with an ISGP. Theorem 2. For any convex G and any ISGP, the following holds with probability one: (i) if G is decreasing, there exists a canonical link L such that the couple (G, χ) where χ is obtained from (3) is representable as a proper loss; (ii) if G is classification calibrated, the margin loss Gν(x) .= G ν 1(x) is classification calibrated. (Proof in Appendix C.) To summarise, the whole support of an ISGP is fully representable as proper losses, but the proof of point (i) is not necessarily constructive as it involves a technical invertibility condition. On the other hand, part (ii) relaxes properness for classification calibration but the invertibility condition is weakened to that of ν for the margin loss involved. Theorem 2 (and the title of the present work) begs for a converse on the condition(s) on the ISGP to ensure that any proper composite loss can be represented in its support. Fortunately, a sufficient condition is already known [MXZ06]. Theorem 3. Suppose kernel k in Definition 2 is universal. Then for any proper loss ℓand any composite link χ, there exists ν as in Definition 2 such that the proper composite loss ℓ(y, χ 1(x)) can be represented as in (5): ℓ(y, χ 1(x)) = ℓν(y, u), u [0, 1], y { 1, 1} and x .= χ(u). To avoid a paper laden paper with notation, we refer to [MXZ06] for the exact definition of universality which, informally, postulates that the span of the kernel can approximate any function to arbitrary precision. Interestingly, a substantial number of kernels are universal, including some of particular interest in our setting (see below). We now investigate a desirable property of the first-order moment of a loss computed from an ISGP. We say that the ISGP is unbiased iff Eν [ν(x)] = x, x. Theorem 4. Letting {(x, y)} R Y be a sample of labelled training values. For any unbiased ISGP and any proper canonical loss ℓ, the proper composite loss ℓformed by ℓand composite link χ .= ν 1 L (3) satisfies: E(x,y) ℓ(y, ( L ) 1(x)) E(x,y),ν ℓν(y, χ 1(x)) . (Proof in Appendix C.) In the context of Bayesian inference this means that the prior uncertainty in ν induces a prior on losses whose expected loss upper bounds that of the canonical link. We exploit this property to initialise our Bayesian inference scheme with fixed canonical link (see Section 4.2). Of course, this property follows from Theorem 4 if the ISGP is unbiased, which we now show is trivial to guarantee. We say that a kernel k is translation invariant iff k(x, x ) = k(0, x x ), x, x . Theorem 5. For any translation invariant k, the ISGP ν satisfies Eν [ν(x)] = µ + k(0, 0) x. Proof. By linearity of expectation this mean function equals E[ν0] = µ plus the simple term Ef GP(k( , )) 0 f 2(z) dz = Z x 0 E f 2(z) dz = Z x 0 k(z, z) dz = x k(0, 0), (10) yielding the statement of the Theorem. Hence, if µ = 0 and k(0, 0) = 1 in Definition 2, the ISGP is unbiased. Interestingly, some translation invariant kernels are universal [MXZ06, Section 4]. We now dig into how an ISGP allows to control the properties of supervised losses that govern robustness and statistical consistency, among others [CBG+17, Tel13]. Denote ϕ 2 max = supx ϕ(x) 2 2, which is simple to upper-bound for the ϕ we will adopt later (see (14)). Lemma 6. For ISGP ν the Lipschitz constant of ℓν in (5) satisfies lipℓν ϕ 2 max w 2 2 lipℓ. Proof. By Cauchy-Schwartz ν (x) = f 2(x) = (w ϕ(x))2 supx(w ϕ(x))2 supx ϕ(x) 2 2 w 2 2 ϕ 2 max w 2 2. The Lemma can then follows from the chain rule on ℓν in (5). Similarly, we have χ = ν 1 ( L ) from (3) and so χ (u) = wℓ(u) (w (ϕϕ )(χ(u))w) 1, yielding this time a Lipschitz constant for Gν of (6) which is proportional to the weight of ℓand inversely proportional to f 2. This shows that the prior s uncertainty still allows for a probabilistic handle on the Lipschitz constant, Λ the eigenspectrum in the Mercer expansion. 4 Inference with Integrated Squared Gaussian Processes In Section 4.1 we give an approximate inference method for the ISGP with simple i.i.d. likelihood functions this is analogous to the basic GP regression and classification in e.g. [RW05]. The subsequent Section 4.2 builds on this to provide an inference method for proper losses for the generalised linear model this is a Bayesian analog of the loss fitting methods in [KKSK11, NM20]. 4.1 Basic Laplace-approximate ISGP inference for a fixed likelihood function Efficient variational inference for the ISGP is an open problem due to the intractable integrals this entails. It is straight-forward however to construct a Laplace approximation to the posterior in the parameters Γ = {w, ν0}, provided that e.g. the likelihood function for the data D = {xn, yn}N n=1 factorises in the usual way as log p(D|ν, Θ) = PN n=1 log p(yn|ν(xn), Θ). Here Θ represents the hyper-parameters which are γ, µ, any parameters of the kernel k( , ), and any parameters of the likelihood function. We use automatic differentiation and numerical optimisation to obtain the posterior mode bΓ = argmaxΓ log p(D, Γ). We then use automatic differentiation once more to compute the Hessian Γ=bΓ log p(D, Γ). (11) Letting bΣΓ = H 1, the approximate posterior is then p(Γ|D) N(Γ|bΓ, bΣΓ) .= q(Γ|D). Predictive distribution: posterior samples of the monotonic function ν are obtained from samples Γ(s) = {w(s), ν(s) 0 } from q via the deterministic relation (8). For the predictive mean function, we exploit the property that the predictive distribution is the sum of the Gaussian ν0 plus the well studied quadratic function of a multivariate normal. If we let bν0 be the element of bΓ corresponding to ν0, and let bΣw be the sub-matrix of bΣΓ corresponding to w, etc., then the following holds. Lemma 7. Using notations just defined, the mean function may be written as Eq(Γ|D) [ν(x)] = bν0 + tr(ψ(x)bΣw) + b w bΣw b w. Proof. We have Eq(Γ|D) [ν(x)] = bν0 + EN(w| b w,bΣw) w ψ(x)w and, for w N(w| b w, bΣw), E w ψ(x)w = E tr ψ(x)ww = tr ψ(x)E ww = tr ψ(x)(bΣw + b w b w ) , which leads to the required result. This mean function differs from that of a simple GP, which is linear w.r.t. b w. Closed form expressions for the higher moments of w ψ(x)w are provided in [MP92, 3.2b: Moments of a Quadratic Form]. We can easily replicate the use of Jensen s inequality in the proof of Theorem 4 to obtain the following guarantee for the posterior mean function. We recall that L is the Bayes risk of a proper loss ℓand χ 1 .= ( L ) 1 ν is the construct of the (inverse) composite link using source ν (of (3)). Theorem 8. For any sample of labelled data points {(x, y)} R Y, any proper loss ℓand source ν sampled from approximate posterior q(Γ|D), we have: E(x,y),ν ℓν(y, χ 1(x)) E(x,y) h D( L) bν0 + tr(ψ(x)bΣw) + b w bΣw b w L (y ) i . Note that the l.h.s. above takes an expectation with respect to the approximate posterior of ν while the r.h.s. involves the parameters of the approximate posterior mean function bν. This clarifies the importance of Bayesian modelling of ν, as the resulting expected loss is merely upper bounded by that for the fixed mean function. Alternatively the result suggests a cheap approximation to marginalising (4) w.r.t. ν for prediction, namely putting the mean function for ν into that equation. Marginal likelihood approximation and optimisation: we follow [SF06] and use automatic differentiation to achieve the same result as while avoiding the manual gradient calculations of the usual approach in the GP literature [RW05, 3.4]. See Section D for further details on the marginal likelihood, computational complexity, and likelihood functions for the univariate case. 4.2 Using the basic ISGP inference to fit Bayes estimates via Expectation Maximisation At this stage, inferring the loss in a machine learning model boils down to simply placing an ISGP prior on the source function and doing Bayesian inference. In accordance with part i) of Theorem 2, this induces a posterior over sources which in turn implies a distribution over proper composite losses. We now present an effective inference procedure for the generalised linear model, which has been the standard model for closely related isotonic regression approaches to the problem [KS09, KKSK11, NM20]. We first choose a proper loss link L to get the proper composite estimate (4). Here we consider the log loss and therefore the inverse sigmoid for L , although we emphasise that our method works with any proper loss in which the source is embedded. Model: to summarise, given a dataset D = {zn, yn}n=1,...,N RD Y, we model ˆp(x) .= ( L ) 1 ν(x) from (4) with ( L ) 1(x) = (1 + exp( x)) 1, to obtain the classification model yn|zn Bernoulli(ˆp(xn)) ; xn = β zn, (12) where ν F and F is a prior over functions such as the GP or ISGP. We leave the prior on the linear model weight vector β RD unspecified as we perform maximum (marginal) likelihood on that parameter. This model generalises logistic regression, which we recover for ν(x) = x. Inference: we use Expectation Maximisation (EM) to perform maximum likelihood in β, marginal of ν. In accordance with Theorem 4 we initialise β(old) using traditional fixed link ( L ) 1. In the Estep, xn = β(old) zn are fixed and we use the Laplace approximate inference scheme of Section 4.1 to compute the posterior in ν (or equivalently Γ = {w, ν}), taking the likelihood function to be that of yn|xn from (12), above. Let that approximate posterior previously denoted by q(Γ|D) be denoted here by q(ν|β(old)). The M-step then updates β by (letting y = {y1, y2, . . . , y N} and x = {x1, x2, . . . , x N}), β(new) = argmax β Q(β|β(old)) ; Q(β|β(old)) = Eq(ν|β(old)) [log p(y|x, ν)] . Although the expectation for Q is analytically intractable, the univariate setting suggests that Monte Carlo will be effective. We draw S samples ν(s), s = 1, . . . , S from q(ν|β(old)), and approximate Eq(ν|β(old)) [log p(y|x, ν)] 1 n=1 log Bernoulli(yn|( L ) 1 ν(s)(β zn)), where ν(s) is given by (8). The above expression may be automatically differentiated and optimised using standard numerical tools. To achieve this efficiently under the ISGP prior for F, we implement a custom derivative function based on the relation ν (x) = f 2(x) = w ϕ(x) 2, (13) rather than requiring the automatic differentiation library to differentiate through ψ in (8). Given our approximate posterior, probabilistic predictions are obtained by marginalising out ν in a manner analogous to a GP classifier [RW05]. Computational Complexity Our O(M 2N) per EM iteration algorithm admits control of M, a meaningful knob to control complexity. This compares favourably with the state of the art [NM20, 5], the fastest option of which costs O(N log N) per iteration at the expense of a rather involved data structure, without which their procedure costs O(N 2). 4.3 The trigonometric kernel We introduce a novel kernel designed to conveniently lend itself to the ISGP. See Appendix F for the requirements this entails, and a discussion of a tempting yet less convenient alternative (the standard Nyström approximation [RW05]). Our main insight is that we need not fix the kernel and derive the corresponding Mercer expansion even if such an expansion is available, the integrals for ψ(x) are generally intractable. Instead we recall that k(x, z) = ϕ(x) Λϕ(z) = PM m=1 λm ϕm(x)ϕm(z), and we let the (λm, ϕm(x)) pairs (of which there are an even number, M) be given by the following union of two sets of M/2 pairs, n b/am, cos (πmcx) o M/2 [ n b/am, sin (πmcx) o M/2 on the domain x [ 1/c, +1/c], where a > 1 and b > 0 are input and output length-scale hyper-parameters and M 100 is chosen large enough to provide sufficient flexibility. This is related to i) the construction in [WB17] (but has the advantage of admitting closed form expressions for k), and ii) the random features approaches of e.g. [RR08, LGQCRFV10, CBMF17] (which differ by approximating a fixed covariance with a trigonometric basis). It is easy to show that the kernel is translation invariant and that the prior variance which is especially relevant here due to Theorem 5 is given by k(x, x) = k(0, 0) = b (1 a M/2)/(a 1). (15) We give expressions for k(x, z) and ψ(x) in Appendix E, and an illustration in Figure 4. 5 Experiments We provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classification, and inferring proper losses. The code for the real world problems of learning the loss is available online.2 We fixed M = 64 in (14). Classification problems used ( L ) 1 = σ 2https://gitlab.com/cwalder/linkgistic Table 1: Mean test set negative log likelihoods for various isotonic regression methods. See text for details. Train Input GP ISGP PAVA Linear Large accel. 2080.9 2083.4 2055.0 2105.9 displ. 788.1 780.6 760.7 907.2 power 763.4 763.9 755.5 1002.2 weight 735.5 750.7 769.9 788.4 Small accel. 2173.3 2216.8 2212.9 2199.6 displ. 871.0 849.4 871.4 950.9 power 827.3 811.8 826.4 1033.5 weight 777.2 770.7 835.8 815.8 Table 2: Test AUC for generalised linear models with various link methods (ordering in decreasing average). See text for details. mnist fmnist ISGP-Linkgistic 99.9 % 99.2 % GP-Linkgistic 99.9 % 99.1 % Logistic regression 99.9 % 98.5% GLMTron 99.6% 98.1% BREGMANTRON 99.7% 97.9% BREGMANTRONlabel 99.6% 97.7% BREGMANTRONapprox 99.3% 94.6% SLISOTRON 94.6% 90.7% 10000 20000 30000 40000 50000 60000 training set size binary correct (%) (a) binary classification accuracy 10000 20000 30000 40000 50000 60000 training set size ten class correct (%) (b) multi-class classification accuracy fmnist / GP kmnist / GP fmnist / ISGP kmnist / ISGP mnist / ISGP fmnist / logistic kmnist / logistic mnist / logistic Figure 2: Test performance v.s. training set size for the MNIST, Kuzushiji-MNIST and Fashion MNIST datasets. We compare logistic regression (logistic), GP-LINKGISTIC (GP) and ISGPLINKGISTIC (ISGP). On the left we show the mean one vs the rest classification accuracy, and on the right we show the ten class classification accuracy obtained by combining the one vs the rest models. with σ(x) = 1/(1 + exp( x)). Optimisation was performed with L-BFGS [BLNZ95]. Gradients were computed with automatic differentiation (with the notable exception of (13)) using Py Torch [PGM+19]. The experiments took roughly two CPU months. Univariate Regression Toy. As depicted in Figure 3, we combined the Gaussian likelihood function (17) with both the GP and ISGP function priors for ν. We investigate this setup further in the supplementary Figure 5 and Figure 6, which also depict the latent f and f 2, which are related by ν = f 2. Finally, Figure 7 visualises the marginal likelihood surface near the optimal point found by the method of Section D. See the captions for details. Univariate Classification Toy. We illustrate the mechanics of the classification model with ISGP prior and the sigmoid-Bernoulli likelihood function of (19). We constructed a univariate binary classification problem by composing the sigmoid with a piece-wise linear function, which allows us to compare both the inferred ν and the inferred (inverse) canonical link χ 1 = ( L ) 1 ν to their respective ground truths see Figure 8 and the caption therein for more details. Real-world Univariate Regression. We compared our ISGP based regression model to a) GP regression with the same kernel of Section 4.3, b) the widely used pool adjacent violators algorithm (PAVA) [Kru64] for isotonic regression, and c) linear least squares. For the ISGP and GP models we used maximum marginal likelihood hyper parameters. The task was to regress each of the four real-valued features of the auto-mpg dataset [DG17] onto the target (car efficiency in miles per gallon). We partitioned the dataset into five splits. For the Small problems, we trained on one split and tested on the remaining four ; for the Large problems we did the converse. We repeated this all C5 1 = C5 4 = 5 ways and averaged the results to obtain Table 1. As expected the Bayesian GP and ISGP models which have the advantage of making stronger regularity assumptions perform relatively better with less data (the Small case). Our ISGP is in turn superior to the GP in that case, in line with the appropriateness of the monotonicity assumption which it captures. Real-world Problems of Learning the Loss. We bench-marked our loss inference algorithm of Section 4.2 on binary classification problems. Note that multi-class problems may be handled via e.g. a 1-vs-all or 1-vs-1 multiple coding while more direct handling may be a topic of future work. We used both the GP and ISGP function priors, and we refer to the resulting algorithms as GPLINKGISTIC and ISGP-LINKGISTIC, respectively only the latter of which gives rise to guaranteed proper losses. To ease the computational burden, we fixed the hyper-parameters throughout. We chose set µ = 0 and γ = 0.01. We set the length-scale parameter a = 1.2 and chose b such that k(0, 0) = 1 (using (15)) by (10) this makes ν(x) = x the most likely source a priori, to roughly bias towards traditional logistic regression. For the GP we tuned b for good performance on the test set (for the full 60K training examples), to give the GP an unfair advantage over the ISGP, although this advantage is small as GP-LINKGISTIC is rather insensitive to the choice of b. Table 2 compares the SLISOTRON and BREGMANTRON algorithms from [KKSK11] and [NM20], respectively, along with the other baselines from the latter work. Further details on the experimental setup are provided in [NM20]. In contrast with the SLISOTRON and BREGMANTRON algorithms, our models successfully match or outperform the logistic regression baseline. Moreover, the monotonic ISGP-LINKGISTIC slightly outperforms GP-LINKGISTIC, and as far as we know records the first result beating logistic regression on this problem, by a reasonable margin on fmnist [NM20]. We further bench-marked ISGP-LINKGISTIC against GP-LINKGISTIC and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGPLINKGISTIC dominates on all three datasets, as the training set size increases see Figure 2 and the caption therein for more details. Figure 9 depicts an example of the learned (inverse) link functions. 6 Conclusion We have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness. Our contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility (which we inherit from the Bayesian approach) and as a direct consequence practical effectiveness as evidenced by our state of the art performance. Our model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification. As such, this represents an interesting step toward more flexible modelling in a wider machine learning context, which typically works with a loss function which is prescribed a priori. Since the tricks we use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are also a principled distortion measure for unsupervised learning such as in the context of the popular k-means and EM algorithms an interesting avenue for future work is to investigate the potential of our approach for unsupervised learning. Broader Impact From an ethical standpoint, it is likely that any advancements in fundamental machine learning methodologies will eventually give rise to both positive and negative outcomes. The present work is sufficiently general and application independent, however, as to pose relatively little cause for immediate concern. Nonetheless, we note one example of an optimistic take on the potential impact of the present work. The example is from the field of quantitative criminology, wherein it is advocated that further study of asymmetric loss functions may provide lawmakers with procedures far more sensitive to the real consequences of forecasting errors in the context of law enforcement [Ber11]. Such a study implies that symmetric links like the ones derived from most common losses: logistic, square, Matsushita, etc. are not a good fit for sensitive fields. Since inference on proper losses can produce non-symmetric links, the present contribution may be useful for such application fields. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their insightful feedback, and we thank The Boeing Company for partly funding this work. [ABE+55] Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, (4), 1955. [Ama12] S.-I. Amari. New developments of information geometry (17): Tsallis q-entropy, escort geometry, conformal geometry. In Mathematical Sciences (suurikagaku), number 592, pages 73 82. Science Company, October 2012. In Japanese. [Ama13] S.-I. Amari. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku), number 605, pages 65 74. Science Company, November 2013. in japanese. [AN00] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. [ANARL19] Clement Abi Nader, Nicholas Ayache, Philippe Robert, and Marco Lorenzi. Monotonic Gaussian Process for Spatio-Temporal Disease Progression Modeling in Brain Imaging Data. Neuro Image, 2019. [APS10] Pankaj K. Agarwal, Jeff M. Phillips, and Bardia Sadri. Lipschitz unimodal and isotonic regression on paths and trees. In Proc. of the 9th Latin American Symposium on Theoretical Informatics, pages 384 396, 2010. [Bac18] Francis Bach. Efficient algorithms for non-convex isotonic regression through submodular optimization. In Advances in Neural Information Processing Systems 31. 2018. [Ber11] Richard Berk. Asymmetric loss functions for forecasting in criminal justice settings. Journal of Quantitative Criminology, 2011. [BGW05] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. IT, 51:2664 2669, 2005. [BJM06] P. Bartlett, M. Jordan, and J. D. Mc Auliffe. Convexity, classification, and risk bounds. J. of the Am. Stat. Assoc., 101:138 156, 2006. [BLNZ95] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal of Scientific Computation, 1995. [Bre67] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200 217, 1967. [BSS05] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation ans classification: structure and applications, 2005. Technical Report, University of Pennsylvania. [BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004. [CBG+17] M. Cissé, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: improving robustness to adversarial examples. In 34th ICML, 2017. [CBIK+18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018. [CBMF17] Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. volume 70 of Proceedings of Machine Learning Research, pages 884 893, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. [Cha10] Kian Ming Adam Chai. Multi-task learning with Gaussian processes. Ph D thesis, University of Edinburgh, UK, 2010. [CM00] Claudia Czado and Axel Munk. Noncanonical links in generalized linear models - when is the effort justified? Journal of Statistical Planning and Inference, 87, 2000. [CR02] Claudia Czado and Adrian Raftery. Choosing the link function and accounting for link uncertainty in generalized linear models using bayes factors. Statistical Papers, 47, 2002. [Cza97] Claudia Czado. On selecting parametric link transformation families in generalized linear models. Journal of Statistical Planning and Inference, 61:125 139, 05 1997. [d F49] B. de Finetti. Rôle et domaine d application du théorème de Bayes selon les différents points de vue sur les probabilités (in French). In 18th International Congress on the Philosophy of Sciences, pages 67 82, 1949. [DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [FTS17] S. Flaxman, Y.W. Teh, and D. Sejdinovic. Poisson Intensity Estimation with Reproducing Kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. [GBCC15] Shirin Golchi, D. Bingham, H. Chipman, and David Campbell. Monotone emulation of computer experiments. SIAM/ASA Journal on Uncertainty Quantification, 3:370 392, 01 2015. [GRR] A Gelman, CP Robert, and J Rousseau. Inherent difficulties of non-Bayesian likelihood-based inference, as revealed by an examination of a recent book by Aitkin. Statistics & Risk Modeling, 30. [HHLK19] Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential gaussian process flows. In Proceedings of Machine Learning Research, volume 89, pages 1812 1821, 16 18 Apr 2019. [HT92] W.K. Hardle and Berwin Turlach. Nonparametric approaches to generalized linear models. 02 1992. [KEC19] Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Gaussian process latent variable alignment learning. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, 2019. [KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems 24. 2011. [KM99] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999. [Kru64] J.B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika, 1964. [KS09] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009. [LC10] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. [LGOR15] Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for gaussian process modulated poisson processes. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1814 1822, Lille, France, 2015. PMLR. [LGQCRFV10] Miguel Lázaro-Gredilla, Joaquin Quiñnero-Candela, Carl Edward Rasmussen, and Aníbal R. Figueiras-Vidal. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11(63):1865 1881, 2010. [LH19] Siqi Liu and Milos Hauskrecht. Nonparametric regressive point processes based on conditional gaussian processes. In Advances in Neural Information Processing Systems 32. 2019. [Lim18] Cong Han Lim. An efficient pruning algorithm for robust isotonic regression. In Advances in Neural Information Processing Systems 31. 2018. [LR14] Ronny Luss and Saharon Rosset. Generalized isotonic regression. Journal of Computational and Graphical Statistics, 23, 2014. [LRG+17] Cheng Li, Santu Rana, Satyandra K. Gupta, Vu Nguyen, and Svetha Venkatesh. Bayesian optimization with monotonicity information. In NIPS Workshop on Bayesian Optimization, 2017. [MM06] Peter Mc Cullagh and Jesper Møller. The permanental process. Advances in Applied Probability, 38(4), 2006. [MP92] Arak Mathai and Serge Provost. Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker, Inc., 1992. [MXZ06] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651 2667, 2006. [NDF00] Ioannis Ntzoufras, Petros Dellaportas, and Jonathan Forster. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference, 111, 03 2000. [NM20] Richard Nock and Aditya Krishna Menon. Supervised learning: No loss no cry. In ICML 20, 2020. [NN08] Richard Nock and Frank Nielsen. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008. [NNA16] Richard Nock, Frank Nielsen, and Shun-ichi Amari. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:1 12, 2016. [Nys28] Evert Johannes Nyström. Über die praktische auflösung von linearen integralgleichungen mit anwendungen auf randwertaufgaben der potentialtheorie. Commentationes physico-mathematicae, 4:1 52, 1928. [PGM+19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. [RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, pages 1177 1184. 2008. [RV10] Jaakko Riihimäki and Aki Vehtari. Gaussian processes with monotonicity information. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010. [RW05] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005. [RW10] Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387 2422, 2010. [Sav71] L.-J. Savage. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783 801, 1971. [Sch89] M.-J. Schervish. A general method for comparing probability assessors. Ann. of Stat., 17(4):1856 1879, 1989. [SF06] Hans J. Skaug and David A. Fournier. Automatic approximation of the marginal likelihood in non-gaussian hierarchical models. Computational Statistical Data Analysis, 51, 2006. [SPV16] Eero Siivola, Juho Piironen, and Aki Vehtari. Automatic monotonicity detection for gaussian processes. In ar Xiv 1610.05440, 2016. [ST03] Tomoyuki Shirai and Yoichiro Takahashi. Random point fields associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, (3), 07 2003. [Tel13] M. Telgarsky. Boosting with the logistic loss is consistent. In 26 th COLT, pages 911 965, 2013. [UKEC19] Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Monotonic gaussian process flow. In ar Xiv 1905.12930, 2019. [Val84] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984. [Vap98] V. Vapnik. Statistical Learning Theory. John Wiley, 1998. [WB17] Christian J. Walder and Adrian N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017. [XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. [YLR+18] Ang Yang, Cheng Li, Santu Rana, Sunil Gupta, and Svetha Venkatesh. Sparse approximation for gaussian process with derivative observations. In AI 2018: Advances in Artificial Intelligence, 2018. [YW09] L Yeganova and W Wilbur. Isotonic regression under lipschitz constraint. Journal of Optimization Theory and Applications, 2009. [ZE02] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Knowledge Discovery and Data Mining, 2002. [Zha04] J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16:159 195, 2004.