# multivariate_probability_calibration_with_isotonic_bernstein_polynomials__d1e64f33.pdf

Multivariate Probability Calibration with Isotonic Bernstein Polynomials

Yongqiao Wang1 and Xudong Liu2

1College of Finance, Zhejiang Gongshang University, China 2School of Computing, University of North Florida, USA {wangyq@zjsu.edu.cn,xudong.liu@unf.edu}

Multivariate probability calibration is the problem of predicting class membership probabilities from classiﬁcation scores of multiple classiﬁers. To achieve better performance, the calibrating function is often required to be coordinate-wise nondecreasing; that is, for every classiﬁer, the higher the score, the higher the probability of the class labeling being positive. To this end, we propose a multivariate regression method based on shaperestricted Bernstein polynomials. This method is universally ﬂexible: it can approximate any continuous calibrating function with any speciﬁed error, as the polynomial degree increases to inﬁnite. Moreover, it is universally consistent: the estimated calibrating function converges to any continuous calibrating function, as the training size increases to inﬁnity. Our empirical study shows that the proposed method achieves better calibrating performance than benchmark methods.

1 Introduction

To predict class membership probability for a given sample is a crucial component in various machine learning-based decision making systems [He et al., 2015; Ustun and Rudin, 2019; Schwarz and Heider, 2019]. One approach to obtain such predictions is to acquire a probability function by estimating the joint distribution function of the class label and the sample vector. However, since the sample vector usually includes both categorical and continuous random variables, estimating this joint distribution function is a challenging task. An alternative is probability calibration that consists of two steps as follows. First, it independently trains M classiﬁers, whose underlying scoring functions map a feature vector to a dimension-M score vector. Second, it trains a probability calibrating function that maps the dimension-M score vector to a membership probability. We see two advantages of this alternative over direct estimation. One, it incorporates well-established classiﬁcation algorithms of choice to come up with sample scores. The other, it decreases the dimension

Corresponding Author

of the joint distribution to M, assuming M is much smaller than the dimension of the feature vector. When M = 1, i.e., there is only one classiﬁer, many parametric and non-parametric methods have been proposed for probability calibration. The related work section of Wang et al. [2019] provides a survey of such methods for binary classiﬁcation. In general, parametric methods suffer from the lack of ﬂexibility and the vulnerability to mis-speciﬁcation, while non-parametric calibration methods, including histogram binning and isotonic (or monotonic) regression, usually require a large amount of data to perform reasonably well. Recently, researchers have shown that imposing isotonicity and smoothness on the calibrating function can greatly improve out-of-sample prediction performance, when the training size is small [Zadrozny and Elkan, 2002; Jiang et al., 2011; Naeini and Cooper, 2016]. However, to require monotonicity is a rather difﬁcult task, for it is imposed on every point in the domain along every dimension, which contributes inﬁnitely many inequality constraints. To this end, we propose to solve the multivariate probability calibration problem for binary classiﬁed samples, by shape-restricted regression with multivariate Bernstein polynomials (Bern Poly Fusion, for short). In statistics, univariate Bernstein polynomials have found many successful applications in nonlinear regression under shape-restrictions, such as monotonicity and convexity [Wang and Ghosh, 2012]. The key contributions of this work are as follows. Firstly, this non-parametric method has asymptotic universal ﬂexibility; that is, as the polynomial degree increases, the proposed function family can approximate any continuous multivariate calibrating function with any speciﬁed error. This is a great advantage over Ozdemir et al. [2017], in which the speciﬁed parametric copula family limits ﬂexibility. Secondly, the ﬁtting calibrating function estimated by this method is coordinate-wise isotonic over the entire domain. This is an advantage over Zhong and Kwok [2013]. In this paper, all deterministic variables are in lower-case letters, while random variables are written in capital letters. Scalars are written in normal letters, and vectors are written in boldfaced letters. For example, the i-th element of constant vector s is si, and the i-th element of random vector X is Xi. We write P[A] the probability of event A, and E[ξ] the mathematical expectation of variable ξ. We denote by I{ } the indicator function, and reads equal to by deﬁnition .

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

2 Problem Formulation In a binary classiﬁcation problem, one considers a random vector (X, Y ), where the feature vector X X Rd and the class label Y {0, 1}. The classiﬁcation problem aims to predict the class label Y of each x X . A typical classiﬁcation model ﬁrst estimates a scoring function s : X R, then predicts the class label of x according to whether its score s(x) is above or below a threshold. A good scoring function is expected to have good ranking power: for any x1, x2 X , if s(x1) > s(x2), P[Y = 1|X = x1] P[Y = 1|X = x2]. Without loss of generality, this paper assumes s(X ) [0, 1]. Or else one can achieve this by any increasing transformation. Instead of class labels, this paper studies how to predict the conditional probability function g : X [0, 1], g(x) P[Y = 1|X = x]. (1) Then P[Y = 0|X = x] = 1 g(x). One can obtain g( ) by estimating the joint distribution of (X, Y ), i.e. two conditional distributions of X|Y = 1 and X|Y = 0, from a training data. However, this task is tough in many applications. An alternative is post-processing that relies on classiﬁcation models. Assume that M classiﬁcation models have been independently trained, s(x) (s1(x), , s M(x)) . If the calibrating function is f : [0, 1]M [0, 1], f(s) P[Y = 1|S = s], (2) the membership probability for x is P[Y = 1|X = x] = f(s(x)). (3) f( ) should be estimated by calibration models from the training data {(s(Xi), Yi)}n i=1. Please note that this calibration is necessary for probability prediction, since many classiﬁcation methods produce scoring functions that have no direct relationship with membership probability [Guo et al., 2017; Ott et al., 2018; Kuleshov et al., 2018; Kuleshov and Ermon, 2017]. For simplifying notation, we let s s(x), S s(X), Si s(Xi), Dn {(Si, Yi)}n i=1.

3 Related Work 3.1 Copula Fusion Copula Fusion, proposed by Ozdemir et al. [2017], estimates f( ) by the Bayesian theorem

f(s) = P[S = s|Y = 1] P[Y = 1]

ℓ={0,1} P[S = s|Y = ℓ] P[Y = ℓ]. (4)

Each S|Y = ℓhas the following copula representation P[S1 s1, , SM s M|Y = ℓ] =Cℓ(P[S1 s1|Y = ℓ], , P[SM s M|Y = ℓ]) (5)

where Cℓ: [0, 1]M [0, 1] is the copula function. In this representation, each multivariate distribution consists of M marginal distribution P[Sm sm|Y = ℓ] and one copula function Cℓ( ). To estimate the calibrating function f( ) from the training data Dn, one should estimate two priors P[Y = ℓ], 2M marginal conditional distributions P[Sm sm|Y = ℓ], and two copulas Cℓ. P[Y = ℓ] can be replaced with its plug-in estimator n i=1 I{Yi=ℓ}/n. Each P[Sm sm|Y = ℓ] can be obtained independently with kernel density estimation. Each Cℓcan be obtained by maximum likelihood estimation.

3.2 MR-MIC MR-MIC proposed by Zhong and Kwok [2013] estimates the membership probabilities {pi}n i=1 for {Xi}n i=1 with the following optimization

i=1 (Yi pi)2 + λ

2 p Ωp (6a)

s.t. pi pj, if (i, j) E (6b)

where n i=1[Yi pi]2 is the empirical error that is a plugin estimator of the L2 risk E[f(S) Y ]2, and p Ωp is the manifold regularization for smoothness, and λ > 0 is a hyperparameter for the trade-off between the empirical error and the regularization. Eq. (6b) is the isotonic constraint. E is the transitive reduction of the relation {(i, j)|Si > Sj}. However, the requirement of monotonicity only on ordered pairs of sample points cannot guarantee the monotonicity over the entire domain [0, 1]M. It has the possibility of failing to keep monotone, especially when the training size is small.

4 Methodology To keep the presentation simple, we limit our discussion to the case M = 2. The extension to M 3 is straightforward. This paper estimates the calibrating function f( ) with a shape-restricted non-parametric regression

i=1 [f(Si) Yi]2 (7)

where F is the set

F { f C[0,1]2 f(0, 0) 0, f(1, 1) 1 f is coordinate-wise non-decreasing

where f C[0,1]2 means that f is continuous over [0, 1]2. The requirement of coordinate-wise non-decrease arises from the ranking power of each scoring function. The requirement of coordinate-wise non-decrease is very strong. Since it is imposed on every point of [0, 1]2, the calibrating function is continuously constrained and involves an uncountable number of inequality constraints. Generally, this kind of optimization is a semi-inﬁnite program [Reemtsen and R uckmann, 1998], because it involves a ﬁnite number of decision variables and an inﬁnite number of inequality constraints. Over-restricted models specify simple functional families that are easy to impose shape restrictions, but these models achieve computational convenience at the expense of ﬂexibility. Under-restricted models simplify these shape restrictions with ﬁnitely many inequality constraints on all samples or ﬁnite grid points, thus fail to guarantee full adherence. This paper solves the probability calibration problem with shape-restricted multivariate Bernstein polynomials. Univariate Bernstein polynomials are widely regarded as the best shape-preserving functions and have been successfully applied to univariate shape-restricted non-parametric regression [Wang and Ghosh, 2012]. Let f( ) be approximated with a degree-(K1, K2) Bernstein polynomial

k2=0 βk1k2bk1,K1(s1)bk2,K2(s2) (8)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

where bkm,Km( ) is a univariate degree-Km basis Bernstein polynomial

bkm,Km(sm) ( Km km

) skm m (1 sm)Km km. (9)

BK1K2 is a linear combination of (K1 +1)(K2 +1) bivariate basis Bernstein polynomials. Theorem 1. A sufﬁcient condition for BK1K2 F is β00 0, βK1K2 1 (10) β0k2 β1k2 βK1k2, k2 {0, , K2} (11) βk10 βk11 βk1K2, k1 {0, , K1}. (12) Proof. This proof is based on three properties of univariate basis Bernstein polynomials

bk,K(0) = {1 k = 0 0 k = 1, , K

bk,K(1) = {0 k = 0, , K 1 1 k = K

Kb0,K 1(s) if k = 0 Kb K 1,K 1(s) if k = K K[bk 1,K 1(s) bk,K 1(s)] otherwise.

Since BK1K2(0, 0) = β00 and BK1K2(1, 1) = βK1K2, we have BK1K2(0, 0) 0 β00 0, and BK1K2(1, 1) 1 βK1K2 1. For all s [0, 1]2, we have

[ β(k1+1)k2 βk1k2 ] bk2,K2(s2).

Because s [0, 1], bk,K(s) 0, Eq. (11) implies that BK1K2(s1, s2) is increasing with respect to s1. The sufﬁciency of Eq. (12) for s2 is similar.

The probability calibration problem with two fore-going classiﬁcation models can be solved by ﬁtting a bivariate degree-(K1, K2) Bernstein polynomial with training samples {(Si1, Si2, Yi)}n i=1. The corresponding quadratic program is

k2=0 βk1k2bk1,K1(Si1)bk2,K2(Si2) Yi

s.t. Eq.(10) (12). (13) This quadratic program is convex and tractable, which can be solved efﬁciently by many off-the-shelf optimization software. If the optimal solution for optimization (13) is β , the estimated calibrating function is

k2=0 β k1k2bk1,K1(s1)bk2,K2(s2). (14)

let FK be the family of shape-restricted degree-K bivariate Bernstein polynomials

FK { BK(s) =

k2=0 βk1k2bk2,K(s2)bk2,K(s2)

: β satisﬁes Eq. (10)-(12) } . (15)

4.1 Universal Flexibility This subsection proves that K=1FK is dense in F with respect to sup-norm, which means that shape-restricted Bernstein polynomials can be used to approximate any continuous coordinate-wise non-decreasing calibrating function with any arbitrary accuracy as the degree K increases to inﬁnite. Theorem 2. The sequence of function families FK is nested in F, i.e. F1 F2 FK F L2[0, 1]2, where f L2[0, 1]2 means

[0,1]2 |f(s)|2ds < + .

Proof. (I) FK FK+1, K. For any BK FK, BK(s) = K k1=0 K k2=0 βk1k2bk1,K(s1)bk2,K(s2), we can rewritten it as a degree-(K + 1) Bernstein polynomial

k2=0 βk1k2bk2,K+1(s2)bk2,K+1(s2) (16)

βk1k2 =(K + 1 k1)(K + 1 k2)

(K + 1)2 βk1k2

+ (K + 1 k1)k2

(K + 1)2 βk1(k2 1)

+ k1(K + 1 k2)

(K + 1)2 β(k1 1)k2

+ k1k2 (K + 1)2 β(k1 1)(k2 1). (17)

The above statement is obtained by the iterative property of univariate Bernstein polynomials

bk,K(s) = K + 1 k

K + 1 bk,K+1(s) + k + 1

K + 1bk+1,K+1(s).

To verify BK FK+1, we should further prove that β R(K+2) (K+2) satisﬁes β00 0, β(K+1)(K+1) 1 β0k2 β1k2 β(K+1)k2, k2 {0, , K + 1} βk10 βk11 βk1(K+1), k1 {0, , K + 1}.

The ﬁrst requirement can be veriﬁed by β00 = β00, β(K+1)(K+1) = βKK and Eq. (10). The second requirement can be obtained by Eq. (11) and β(k1+1)k2 βk1k2

=(K k1)(K + 1 k2)

(K + 1)2 [ β(k1+1)k2 βk1k2 ]

+ k1(K + 1 k2)

(K + 1)2 [ βk1k2 β(k1 1)k2 ]

(K + 1)2 [ β(k1+1)(k2 1) βk1(k2 1) ]

+ k1k2 (K + 1)2 [ βk1(k2 1) β(k1 1)(k2 1) ] . (18)

The proof for the third requirement is similar. (II) Since all basis Bernstein polynomials and their linear combinations belong to L2[0, 1]2, FK L2[0, 1]2, K.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Theorem 3. K=1FK is dense in F with respect to supnorm, i.e. for every f F, we have

lim K min BK FK sup s [0,1]2 |f(s) BK(s)| = 0. (19)

Proof. According to De Vore and Lorentz [1993, p.10], for any multivariate continuous function f, there are multivariate Bernstein polynomials that converge uniformly to f as the degrees increase to inﬁnite. So that the set of unconstrained bivariate Bernstein polynomials is dense in C[0,1]2 with respect to super-norm. F C[0,1]2 implies that, for each f F, for any approximation error ϵ > 0, there exists a K N such that the following function

k2=0 f (k1/K, k2/K) bk1,K(s1)bk2,K(s2)

satisﬁes sups [0,1]2 |BK(s) f(s)| < ϵ. Therefore, if we let

β R(K+1)(K+1), βk1k2 = f (k1/K, k2/K) , (20)

β satisﬁes Eq. (10) - (12) by the properties of f. Thus, BK( ) FK. Combining this with Theorem 2, we obtain Eq. (19).

4.2 Consistency The problem studied in this paper is called random-design regression. Each (Si, Yi) is random, and (S, Y ), (S1, Y1), , (Sn, Yn) are independent and identically distributed. To measure the error of a regression estimate, we use the L2 error:

s [0,1]2 | ˆfn(s) f(s)|2d(s).

Since the estimate ˆfn depends on the data Dn, this L2 error is a random variable. The following theorem on the consistency of the estimator relies on Lemma 1 [Gy orﬁet al., 2006, Lemma 10.1]. Lemma 1. Let Fn = Fn(Dn) be a class of functions f : Rd R depending on the data {(Si, Yi)}n i=1. If ˆfn is obtained by

ˆfn = arg min f Fn

f(Si) Yi 2 (21)

then ˆfn(s) f(s) 2 µ(ds)

[ f(Si) Yi ]2 E [ f(S) Y ]2

f(s) f(s) 2 µ(ds) (22)

where µ denotes the distribution of S.

Theorem 4. Provided that ˆfn is obtained by Eq. (13) - (14), if the degree K satisﬁes

K , K2/n 0 (n ), (23)

(I) ˆfn is strongly universally consistent, i.e. for any f F, as n

ˆfn(s) f(s) 2 µ(ds) 0, a.s. (24)

(II) ˆfn is weakly universally consistent, i.e. for any f F, as n ,

ˆfn(s) f(s) 2 µ(ds) 0. (25)

Proof. (I) STRONG CONSISTENCY. Due to the Lemma 1, to prove the strong universal consistency of ˆfn, it sufﬁces to verify that, as n ,

[ f(Si) Yi ]2 E [ f(S) Y ]2 0, a.s.

f(s) f(s) 2 µ(ds) 0, a.s. (27)

Let us ﬁrst prove Eq. (27). According to Theorem 3, K=1FK is dense in F with respect to sup norm, which follows that K=1FK is dense in F with respect to L2(µ) for any distribution µ. It implies that, for every f F, for any arbitrarily small error ϵ > 0, there exists a BK0(ϵ) K=1FK that satisﬁes

[0,1]2 |BK0(ϵ)(s) f(s)|2µ(ds) < ϵ. Because K as n , there is a n0(ϵ) N such that K > K0(ϵ) for all n n0(ϵ). Combining this with the nested property of Fns, one concludes that

f(s) f(s) 2 µ(ds) < ϵ, n n0(ϵ). (28)

Since ϵ > 0 is arbitrary, this implies Eq. (27). To verify Eq. (26), let F+ n be the class of all subgraphs of functions of Fn

F+ n {{ (s, t) [0, 1]M+1 : t f(s) } : f Fn } , (29)

and VF+ n be the Vapnik-Chervonenkis (VC) dimension of F+ n . Since each f Fn is a linear combination of (K + 1)2 basis Bernstein polynomials, and F+ n is a subset of {{ (s, t) [0, 1]M+1 : f(s) + α t 0 } : f Fn, α R }

which is a vector space with dimension (K + 1)2 + 1 of real functions on R3, by Theorem 9.5 of [Gy orﬁet al., 2006], we have VF+ n (K + 1)2 + 1. (30) Then we have

[ f(Si) Yi ]2 E [ f(S) Y ]2 > ϵ

)(K+1)2+1 exp { nϵ2

)2(K+1)2+2 exp { nϵ2

=24 exp {( 2(K + 1)2 + 2 ) log 48e

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

The proof for the ﬁrst inequality is same as [Wang et al., 2019, Theorem 4]. The second inequality follows from log(x) x 1 x for x > 0. Therefore,

[ f(Si) Yi ]2 E [ f(S) Y ]2 > ϵ

n=1 24 exp {( 2(K + 1)2 + 2 ) log 48e

n=1 24 exp { n ( ϵ2

128 2(K + 1)2 + 1

Hence, provided that limn K2/n 0, we have

[ f(Si) Yi ]2 E [ f(S) Y ]2 > ϵ

By the Borel-Cantelli lemma, we obtain Eq. (26). Therefore, Eq. (23) is a sufﬁcient condition for the strong universal convergence of ˆfn( ). (II) WEAK CONSISTENCY. To prove the weak universal consistency of ˆfn( ), it sufﬁces to verify, as n ,

[ f(Si) Pi ]2 E [ f(S) Y ]2

[ f(s) f(s) ]2 µ(ds)

Because Eq. (34) directly follows from Eq. (28), it is sufﬁcient to verify Eq. (33). If ξ is a nonnegative random variable, for an arbitrary ϵ > 0, we have

0 P[ξ > t]dt ϵ +

ϵ P[ξ > t]dt. (35)

Following this,

[ f(Si) Yi ]2 E [ f(S) Y ]2

)2(K+1)2+2 exp { nt2

ϵ + 24 (48e

ϵ exp { nϵt

=ϵ + 24 (48e

)2(K+1)2+2 128

nϵ exp { nϵ2

nϵ exp { n ( ϵ2

128 2(K + 1)2 + 1

The ﬁrst inequality is obtained by Eq. (35) and (31). Eq. (33) holds if K2/n 0 as n . Hence, the condition (23) is sufﬁcient for the weak universal consistency of ˆfn( ).

5 Experiments

In this experiment, to make each sm(X ) [0, 1], every score sm(Xi) is transformed by the empirical cumulative distribution function using {sm(Xi)}n i=1. In Copula Fusion, the copula function for each class is chosen according to the criterion of maximum likelihood. Candidate copula families are Gaussian, Clayton, rotated Clayton, Plackett, Frank, Gumbel, rotated Gumbel and Student s t. In MR-MIC, Ωis the similarity matrix with ωij = 1/ Si Sj . Because model (Eq.6) only obtains calibrated probabilities {ˆpi}n i=1 for training samples {Si}n i=1, we estimate the calibrating function by linear interpolation with the scattered data {(Si, ˆpi)}n i=1. Quadratic optimizations for MR-MIC and Bern Poly Fusion are solved with CVX [Grant and Boyd, 2014].

5.1 Characteristics of Calibrating Functions

The data is SUSY from UCI Machine Learning Repository [Dua and Graff, 2017]. Two fore-going classiﬁers are feedforward networks with hidden layer sizes 10 and 20. We randomly draw 200 samples for training two classiﬁers and 400 samples for training the calibrating model. The hyperparameters are arbitrarily chosen: λ = 10 5 in MR-MIC, and K1 = K2 = 5 in the proposed method. Figure 1(a) shows that both classiﬁers have good ranking power. Figure 1(b)-1(d) shows the main characteristics of three estimated calibrating functions with contours. In all ﬁgures, by and large, the estimated calibrating probability increases as S moves from southwest to northeast. Among the three methods, only Bern Poly Fusion can adhere to the requirement of coordinate-wise non-decrease. MR-MIC imposes this requirement only on training samples {Si}n i=1, while Copula Fusion does not explicitly impose this requirement. The calibrating function from Bern Poly Fusion is more smooth than MR-MIC and Copula Fusion.

5.2 Model Comparison

Ideal performance measures for probability calibration should be based on the difference between the true f and the estimated ˆf, e.g. sups [0,1] |f(s) ˆf(s)| and

s [0,1] |f(s) ˆf(s)|ds. However, in practice the true f is unavailable. When the size of the test data is very large, f can be approximated in the following way: the domain [0, 1]2 is discretized with a uniformly-spaced grid (10 10) and the true probability at each grid cell is approximated with the percent of class-1 among samples that falls in this grid cell. Because the majority of scores scatter around the antidiagonal, there are few test samples scatter in grid cells around northwest and southeast corners. Thus, approximated probabilities at these grid cells are subject to large deviations. Therefore, each grid cell with the number of test samples smaller than 1000 is discarded from performance measurement. The following two measures are used: MCE = maxs G |f(s) ˆf(s)| and ECE = s G |f(s) ˆf(s)|/|G|, where G is the set of centers of grid cells with the test number no less than 1000. The third performance measure is the Brier score, which is the mean squared difference between predicted probabilities and class labels.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Class 0 Class 1

(c) Copula Fusion

(d) Bern Poly Fusion

Figure 1: Training data and contours of three calibrating functions

Method Adult Census Covertype Dota2 SUSY Mean

MCE MR-MIC 6.142 0.843 7.116 0.825 5.785 0.611 5.353 0.576 6.185 0.479 6.116 0.667 Copula Fusion 5.309 0.723 6.817 0.611 5.882 0.626 6.290 0.508 6.592 0.469 6.178 0.587 Bern Poly Fusion 5.613 0.650 6.241 0.565 5.367 0.434 5.591 0.528 5.054 0.394 5.573 0.514

ECE MR-MIC 2.459 0.441 3.787 0.468 3.444 0.359 3.124 0.393 3.182 0.258 3.199 0.384 Copula Fusion 2.548 0.296 3.890 0.444 3.369 0.326 3.319 0.343 2.963 0.240 3.218 0.330 Bern Poly Fusion 2.226 0.309 3.587 0.370 2.847 0.231 2.799 0.268 2.432 0.241 2.727 0.284

Brier MR-MIC 15.485 0.915 19.719 1.338 19.776 0.861 18.740 0.758 19.015 0.646 18.547 0.904 Copula Fusion 16.441 0.703 19.714 1.095 18.067 0.824 19.264 0.711 17.771 0.506 18.252 0.768 Bern Poly Fusion 14.375 0.854 18.966 0.724 16.095 0.577 16.811 0.434 15.539 0.473 16.357 0.612

Table 1: Model Comparison (%. The best performance is highlighted.)

Experiments are performed on ﬁve large data from [Dua and Graff, 2017] with size larger than 40,000: Adult, Census, Covertype, Dota2 and SUSY. In Covertype, the largest class (Lodgepole Pine) is the positive class and others are the negative class. For each data, three size-500 datasets are randomly drawn. The ﬁrst dataset is used for training classiﬁer 1, and the second dataset is used for training classiﬁer 2, and the third dataset is used for training calibration models. All other samples are used for testing calibration models. In each training step, we use 5-fold cross-validation to determine hyperparameters. Two fore-going classiﬁers are three-layer feedforward networks. The above procedure is repeated 10 times, and model comparison is based on the average and standard deviation of ten rounds. Table 1 demonstrates that the proposed method outperforms benchmark models in all but two instances, where Copula Fusion bests on Adult and MR-MIC on Dota2, both with respect to MCE measure. Considering the results averaged over all ﬁve datasets, we see our Bern Poly Fusion unanimously performs the best.

5.3 Running Time The complexity of Copula Fusion depends on the choice of the copula family. For some copula families, the optimization for the maximum likelihood estimation is non-convex. The training of MR-MIC involves a quadratic program with n decision variables. An ADMM-based algorithm for this optimization has a complexity O(n2). The training of Bern Poly Fusion in-

volves a quadratic program with (K + 1)2 decision variables and 2K(K + 1) + 2 constraints. Both numbers are independent of the training size n. According to Section 4.6.2 of [Ben-Tal, 2019], Data(P) and Size(p) are linearly increasing with n, thus the Newton complexity of ϵ-solution is O(n). For the aforementioned experiments in Subsection 5.2 where n = 500, the CPU time in seconds for training MRMIC, Copula Fusion, and Bern Poly Fusion are 1.335 0.128, 1.486 0.172, and 0.501 0.044, respectively. If n is 1,000, the CPU time in seconds for three methods are 7.964 0.807, 1.756 0.246 and 0.872 0.090, respectively. Therefore, compared to MR-MIC and Copula Fusion, we see that Bern Poly Fusion attains less running time and better scalability.

6 Conclusions A novel method based on shape-restricted Bernstein polynomial regression is proposed for probability calibration. This method has universal ﬁtting ﬂexibility and is both strongly and weakly universally consistent. Experimental results show that this method has a great advantage over benchmark methods. Future work includes the extension from binary-class to multi-class or structured classiﬁcation problems [Kuleshov and Liang, 2015; Leathart et al., 2019].

Acknowledgments The work of the ﬁrst author was supported by NSFC (71571163) and Zhejiang NSF (LY19G010001).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Ben-Tal, 2019] Aharon Ben-Tal. Lectures on modern convex optimization. Technical report, School of Industrial & Systems Engineering, Georgia Institute of Technology, 2019. [De Vore and Lorentz, 1993] Ronald A. De Vore and George G. Lorentz. Constructive Approximation. Springer-Verlag, 1993. [Dua and Graff, 2017] Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2017. [Grant and Boyd, 2014] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014. [Guo et al., 2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proc. of ICML 17, pages 1321 1330, 2017. [Gy orﬁet al., 2006] L aszl o Gy orﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-free Theory of Nonparametric Regression. Springer Science & Business Media, 2006. [He et al., 2015] Jiazhen He, James Bailey, Benjamin I. P. Rubinstein, and Rui Zhang. Identifying at-risk students in massive open online courses. In Proc. of AAAI 15, pages 1749 1755, 2015. [Jiang et al., 2011] Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Smooth isotonic regression: A new method to calibrate predictive models. AMIA Summits on Translational Science Proceedings, 2011:16, 2011. [Kuleshov and Ermon, 2017] Volodymyr Kuleshov and Stefano Ermon. Estimating uncertainty online against an adversary. In Proc. of AAAI 17, pages 2110 2116, 2017. [Kuleshov and Liang, 2015] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems 28, pages 3474 3482. MIT Press, 2015. [Kuleshov et al., 2018] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In Proc. of ICML 18, pages 2801 2809, 2018. [Leathart et al., 2019] Tim Leathart, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. On calibration of nested dichotomies. In Proc. of PAKDD 19, pages 69 80, 2019. [Naeini and Cooper, 2016] Mahdi Pakdaman Naeini and Gregory F Cooper. Binary classiﬁer calibration using an ensemble of near isotonic regression models. In Proc. of ICDM 16, pages 360 369, 2016. [Ott et al., 2018] Myle Ott, Michael Auli, David Grangier, and Marc Aurelio Ranzato. Analyzing uncertainty in neural machine translation. In Proc. of ICML 18, pages 3956 3965, 2018. [Ozdemir et al., 2017] Onur Ozdemir, Thomas G Allen, Sora Choi, Thakshila Wimalajeewa, and Pramod K Varshney. Copula based classiﬁer fusion under statistical depen-

dence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2740 2748, 2017. [Reemtsen and R uckmann, 1998] Rembert Reemtsen and Jan-J R uckmann. Semi-inﬁnite Programming, volume 25. Springer Science & Business Media, 1998. [Schwarz and Heider, 2019] Johanna Schwarz and Dominik Heider. GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decisionmaking. Bioinformatics, 35(14):2458 2465, 2019. [Ustun and Rudin, 2019] Berk Ustun and Cynthia Rudin. Learning optimized risk scores. Journal of Machine Learning Research, 20:1 75, 2019. [Vellanki et al., 2019] Pratibha Vellanki, Santu Rana, Sunil Gupta, David Rubin de Celis Leal, Alessandra Sutti, Murray Height, and Svetha Venkatesh. Bayesian functional optimisation with shape prior. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 1617 1624, 2019. [Wang and Ghosh, 2012] J. Wang and S.K. Ghosh. Shape restricted nonparametric regression with bernstein polynomials. Computational Statistics & Data Analysis, 56(9):2729 2741, 2012. [Wang et al., 2019] Y. Wang, L. Li, and C. Dang. Calibrating classiﬁcation probabilities with shape-restricted polynomial regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1813 1827, 2019. [Zadrozny and Elkan, 2002] Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In Proc. of KDD 02, pages 694 699, 2002. [Zhong and Kwok, 2013] Leon Wenliang Zhong and James T Kwok. Accurate probability calibration for multiple classiﬁers. In Proc. of IJCAI 13, pages 1939 1945, 2013.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)