# metacal_wellcontrolled_posthoc_calibration_by_ranking__10e84d7b.pdf

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Xingchen Ma 1 Matthew B. Blaschko 1

Abstract To achieve this objective, one can design models with in-

In many applications, it is desirable that a classi ﬁer not only makes accurate predictions, but also outputs calibrated posterior probabilities. How ever, many existing classiﬁers, especially deep neural network classiﬁers, tend to be uncalibrated. Post-hoc calibration is a technique to recalibrate a model by learning a calibration map. Existing ap proaches mostly focus on constructing calibration maps with low calibration errors, however, this quality is inadequate for a calibrator being useful. In this paper, we introduce two constraints that are worth consideration in designing a calibration map for post-hoc calibration. Then we present Meta-Cal, which is built from a base calibrator and a ranking model. Under some mild assump tions, two high-probability bounds are given with respect to these constraints. Empirical results on CIFAR-10, CIFAR-100 and Image Net and a range of popular network architectures show our proposed method signiﬁcantly outperforms the current state of the art for post-hoc multi-class classiﬁcation calibration.

1. Introduction

Recent advances in machine learning have resulted in very high accuracy in many classiﬁcation tasks. In the context of computer vision, there has been a steady increase in top-1 accuracy on Image Net (Russakovsky et al., 2015) since Alex Net (Krizhevsky et al., 2012). While highly accurate models are desirable in general, in some applications, we want that the output of a classiﬁer is a calibrated posterior probability. This is especially important in cost-sensitive classiﬁcation tasks, such as medical diagnosis (Ma et al., 2017) and autonomous driving (Chen et al., 2015).

The goal of classiﬁcation calibration is to obtain a prob abilistic model that has low calibration error. A formal deﬁnition of calibration error will be given in Section 3.

1ESAT-PSI, KU Leuven, Belgium. Correspondence to: Xingchen Ma <xingchen.ma@esat.kuleuven.be>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

trinsically low calibration errors. These models (Wilson et al., 2016; Pereyra et al., 2017; Lakshminarayanan et al., 2016; Malinin & Gales, 2018; Milios et al., 2018; Mad dox et al., 2019) are usually developed with a Bayesian nature and tend to be expensive to train and make infer ences. On the other end of the spectrum, a post-hoc calibra tion method transforms outputs of an existing classiﬁcation model into well-calibrated predictions by learning a train able post-processing step. A lot of effort (Zhang et al., 2020; Guo et al., 2017; Ding et al., 2020; Patel et al., 2020; Wenger et al., 2020; Kull et al., 2019) has been done along this line given the design difﬁculties and computational burdens of the former approach. In this work, we focus on post-hoc multi-class calibration.

Among the existing literature on post-hoc calibration, some methods (Zhang et al., 2020; Guo et al., 2017; Ding et al., 2020; Patel et al., 2020) seek calibration maps that preserve classiﬁcation accuracy, while some other works (Wenger et al., 2020; Kull et al., 2019) focus more on calibration error without explicitly enforcing accuracy-preservation. Compared with the latter, there is no potential accuracy drop in the former, but its family of calibration maps is less ﬂexible. We will formalize the limitation of such a family by giving a lower bound of its calibration error in Proposition 2. In this work, we try to get the best of both worlds by combining an existing calibration model with a bipartite ranking model, and we call our proposed post-hoc calibration method Meta-Cal.

Similar to previous works (Wenger et al., 2020; Kull et al., 2019), Meta-Cal does not enforce the overall accuracy being kept. As we will show later in Section 4, such a calibrator can be of little importance even if its calibration error is low. Additional constraints should be taken into consideration. In this work, we introduce two constraints: (i) miscover age rate, (ii) coverage accuracy. In Section 4.1 and Sec tion 4.2, we show Meta-Cal has full control over these two constraints. An intuitive interpretation of the above two con straints is by considering a quality assurance (QA) system. In such a system, two desired properties are: (i) for products accepted by the QA system, their quality requirements are satisﬁed with high conﬁdence, (ii) the QA system does not perform unnecessary rejection. The ﬁrst point corresponds to the coverage accuracy of a calibration map and the sec

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

ond one corresponds to the miscoverage rate. Their formal deﬁnitions are given in Section 4. Our proposed Meta-Cal is constructed from a base calibrator and a ranking model. If this base calibration model is accuracy-preserving, it is guaranteed that Meta-Cal has an improved calibration error bound over it.

The main contributions of this paper are:

A novel post-hoc calibration approach (Meta-Cal) for multi-class classiﬁcation is proposed. Meta-Cal aug ments a base calibration model and obtains better cali bration performance.

Two practical constraints are investigated alongside Meta-Cal. We show how these constraints are incor porated in constructing our proposed calibrator. Theo retical results on high-probability bounds w.r.t. these constraints are presented.

We show the effectiveness of our proposed approach and validate our theoretical claims through a series of empirical experiments.

In the next section, we brieﬂy review post-hoc calibra tion methods for classiﬁcation and bipartite ranking mod els. Necessary background, notation and assumptions that will be used throughout this paper are given in Section 3. Section 4 describes our proposed approach. Two practi cal constraints of Meta-Cal are discussed in Section 4.1 and Section 4.2 respectively. Empirical results are presented in Section 5. Finally, we conclude in Section 6.

2. Related Work

Post-hoc Calibration of Classiﬁers: Platt scaling (Platt, 1999; Lin et al., 2007) is proposed to learn a calibration trans

formation to map the outputs of a binary SVM into posterior probabilities. Extensions of Platt scaling for multi-class calibration include temperature scaling (Guo et al., 2017), ensemble temperature scaling (Zhang et al., 2020) and local temperature scaling (Ding et al., 2020). Different from the parametric approach adopted by Platt scaling, histogram binning (Zadrozny & Elkan, 2001) is a non-parametric posthoc calibration method in binary settings. Two popular reﬁnements of histogram binning are isotonic regression (Zadrozny & Elkan, 2002) and Bayesian binning into quan tiles (Naeini et al., 2015). Platt scaling assumes scores of each class are Gaussian distributed, however, this assump tion is hardly satisﬁed for many probabilistic classiﬁers. Mo tivated by this assumption violation, Beta calibration (Kull et al., 2017) is proposed and it assumes Beta distribution of scores within each class. For multi-class classiﬁcation, Dirichlet calibration (Kull et al., 2019) is proposed as an extension of Beta calibration. More recently, a Gaussian

Process based calibration method is presented in Wenger et al. (2020).

Bipartite Ranking: The problem of bipartite ranking has been studied for a long time. In bipartite ranking, one wants to compare or rank two different objects and decide which one is better. Burges et al. (2005) uses a neural network to model a ranking function and trains this network by gra dient decent methods. Cl on et al. (2008) deﬁnes a emenc statistical framework for such ranking problems and pro vides several consistency results. Narasimhan & Agarwal (2013) investigates the relationship between binary classi ﬁcation and bipartite ranking. Our work is closely related to Narasimhan & Agarwal (2013) and we rely on the the oretical results presented there to construct a binary class probability estimation (CPE) model from a ranking model.

Selective Classiﬁcation: Selective classiﬁcation is a tech nique that augments a classiﬁer with a reject option. The essence of selective classiﬁcation is the trade-off between coverage and accuracy. El-Yaniv & Wiener (2010) lays the foundation for selective classiﬁcation by characterizing the theoretical and practical boundaries of risk-coverage trade offs. Geifman & El-Yaniv (2017) applies selective classi ﬁcation to deep neural networks. While the starting point of our work is to make improvements in post-hoc classiﬁca tion calibration under several practical constraints, it turns out that techniques present in this work can also be used in selective classiﬁcation. Geifman & El-Yaniv (2017) gives a one-sided and tight high-probability risk bound, while we present two high probability bounds for miscoverage rate and coverage accuracy. It should be noted that deﬁnitions of risk and coverage in our setting are different from these two metrics deﬁned in the context of selective classiﬁcation (El Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017), thus

these bounds are not directly comparable.

3. Problem Formulation

In this section, we summarize some necessary background, notation and assumptions that will be used throughout this paper. Let X be the input space and Y = [k] := {1, , k} be the label space in multi-class classiﬁcation. We de note the (k 1)-simplex by Δk := {(p1, , pk) |

i [k] pi = 1, pi [0, 1]}. Suppose a probabilistic clas siﬁer f : X Δk is given and this classiﬁer is trained on an i.i.d. data set following a joint distribution P(X, Y ) on X Y. Let the random variable X take values in in put space X , Z = (Z1, , Zk) be the output of f ap plied to X, Zˆ = maxi [k] Zi be the conﬁdence score and Yˆ = arg maxi [k] Zi be the prediction of X. Whenever there is a tie, it is broken uniformly at random. To measure the level of calibration, following Naeini et al. (2015); Ku mar et al. (2019); Wenger et al. (2020), the calibration error is deﬁned in Deﬁnition 1.

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Deﬁnition 1 (Calibration error). The Lp calibration error Assumption 1. Denote the marginal distribution of X tak of f with p 1 is: ing values in X by P(X). The induced distribution of h(X) is continuous, where h is a ranking model. pi1/p E = 1(Y ˆY ) | Zˆ Zˆ Assumption 2. Let h : X [a, b] (where a, b R, a < b) CEp(f) = E (1) ,

be any bounded-range ranking model. ηh(s) = P( = 1 | h(x) = s) is square-integrable w.r.t. the induced density of h(X). where the expectation is taken with respect to P(X, Y ).

If the calibration error of a model f is zero, we say f is per fectly calibrated. In practice, CEp( ) is unobservable since it depends on the unknown joint distribution P(X, Y ). To empirically estimate the calibration error based on a ﬁnite data set, prior works apply a ﬁxed binning scheme B B, where B is the family of binning schemes, on values of Zˆ and calculate the difference between the accuracy and the average of conﬁdence scores in every bin. An example fam ily for a binning scheme is a partition of the interval [0, 1]. When p = 1, the calibration error is also known as expected calibration error (ECE) (Guo et al., 2017) and an empirical estimator based on B is denoted by [ Although it ECEB . is known that [ ECEB is a biased estimation (Kumar et al., 2019; Zhang et al., 2020) of ECE and its magnitude can be hard to interpret, in this work, we use this binned estimator to measure the level of calibration due to its simplicity and popularity.

In the post-hoc calibration problem, one wants to ﬁnd a function that transforms the outputs of f to make the ﬁnal model better calibrated. Formally, a calibration map is a function g : Δk Δk . The goal is to learn a mapping g G on a ﬁnite calibration data set {(xi, yi)}n that is i=1 drawn i.i.d. from the joint distribution P(X, Y ) so that the composition g f has a small calibration error, where G is the calibration family.

In the problem of bipartite ranking, we want to ﬁnd a rank ing model h : X R that ranks positive examples and negative ones so that the positive examples have higher scores with a high probability. Formally, one wants to mini mize the following ranking risk (Narasimhan & Agarwal, 2013; Cl emenc on et al., 2008):

L(f) = P(( 0) (h(X) h(X0))),

where (X, ), (X0, 0) are i.i.d. pairs taking values in X { 1, +1}. The ranking risk is the probability that h ranks two randomly drawn pairs incorrectly (Cl on et al., emenc 2008). In this paper, if not stated otherwise, the random variable is a sign variable used to indicate whether X is correctly classiﬁed or not by f, that is, = 2 1(Y = Yˆ ) 1. In general, a bipartite ranking model is learnt using a consistent pairwise surrogate loss, such as exponential loss or logistic loss (Gao & Zhou, 2014).

In the following, we list some assumptions that will be used in Section 4.

The ﬁrst assumption simply means for two independent samples from P(X), the probability of their ranking scores being equal vanishes. The second assumption is required by Narasimhan & Agarwal (2013) as we rely on their results.

4. Meta-Cal

In this section, we start by showing that the family of accuracy-preserving calibration maps has an inherent lim itation (Proposition 2) and this motivates us to combine a bipartite ranking model with an existing calibration model. Then we demonstrate that models with low calibration errors alone do not necessarily indicate they are practical. Posthoc calibration should be considered together with other factors. In this work, we investigate two useful constraints in Section 4.1 and Section 4.2, respectively: Miscoverage rate control and coverage accuracy control.

Proposition 2 (Lower bound of ECE). Deﬁne Ga as the family of accuracy-preserving calibration maps, that is,

Ga = {g G : arg max f(X)i = arg max g(f(X))i}.

i [k] i [k]

Then for all g Ga

1 πˆ0 sup ECEB (g f) > (2) B B k

where [ π0 is the ECEB is estimated on a ﬁnite data set and ˆ empirical accuracy of f on the same data set. Further we can show B B, g G,

ECE(g f) [ (3) ECEB (g f).

All proofs are provided in Supplement A and we only de scribe a sketch of the proof here. The ﬁrst step is to ﬁnd the supremum of the binned estimator [ ECEB (g f) across all calibration maps g Ga and all binning schemes B B. This is achieved using Minkowski s inequality. The sec ond step is showing this supremum is lower bounded by (1 πˆ0)/k. Unless f is perfectly accurate, (1 πˆ0)/k will be strictly positive. The last step is already given in Kumar et al. (2019) by applying Jensen s inequality. We note bounding [ ECEB (g f) instead of its supremum makes little sense here since it is easy to construct a trivial g Ga and a binning scheme B such that [ ECEB (g f) is exactly

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

zero. In Supplement B, we give such a trivial construction. Proposition 2 makes it clear that an accuracy-preserving calibration has an inherent limitation. Such a calibration map cannot perfectly calibrate a classiﬁer even if it is given the label information. To avoid potential confusion, we em phasize here the empirical estimator of ECE measures the top-label calibration error instead of the calibration error w.r.t. a ﬁxed class.

In the following proposition, we show if a post-hoc calibra tion map is allowed to change the accuracy of the original classiﬁer, it can achieve perfect calibration with the aid of a binary classiﬁer. Proposition 3 (Optimal calibration map). Suppose realiza

ˆ tions of X | Y Y with positive labels and realizations = ˆ of X | Y = Y with negative labels are separable and we are given a perfect binary classiﬁer φ : X { 1, +1} that is able to classify these two sets. For a new observation

x X, an optimal calibration map g G that minimizes sup B B,g G ECEB (g f) can be constructed using the following rules:

if φ (x) = 1, we let maxi [k] g (x)i = 1

1 if φ (x) = +1, we let maxi [k] g (x)i = k

Proof. The optimality of g follows directly from the proof steps of Proposition 2.

We note g 6 Ga, since the used tie-breaking strategy is

random at uniform, thus g does not preserve the accuracy after the above calibration. It is straightforward to check the

constructed g in Proposition 3 has zero calibration error, that is ECE(g f) = sup B B ECEB (g f) = 0.

However, it is unlikely that a perfect binary classiﬁer can be obtained in practice and classiﬁcation errors will occur when φ(X) 1(Y Yˆ ) 1. We denote the population = 2 = Type I error (false positive error) and Type II error (false negative error) of φ by R0(φ) and R1(φ) respectively. In the following, we deﬁne the miscoverage rate and coverage accuracy of a calibration map g constructed using the rules in Proposition 3. These two constraints will be used to construct the ﬁnal calibration map. Deﬁnition 4 (Miscoverage rate). The miscoverage rate F0(g) of g constructed using the rules in Proposition 3 is deﬁned to be the Type I error of the binary classiﬁer φ associated with g. Deﬁnition 5 (Coverage accuracy). The coverage accuracy F1(g) of g constructed using the rules in Proposition 3 is de ﬁned to be the precision of the binary classiﬁer φ associated with g.

The rational behind the above deﬁnitions will be illustrated by two examples later in this section. To be brief, these two

constraints are necessary for a calibrator to be useful, if this calibrator cannot preserve the accuracy. The miscoverage rate of g is the proportion of instances whose predictions are no longer correct after the calibration to instances whose predictions are originally correct before the calibration. The coverage accuracy of g is the accuracy among covered in stances (i.e. those instances classiﬁed as negative by the binary classiﬁer) after the calibration. The correspondence between the Type I error (precision) of a binary classiﬁer and the miscoverage rate (coverage accuracy) of a calibra tion map is due to the construction rules in Proposition 3.

Proposition 6. Using the construction rules in Proposi tion 3, we can estimate the expected calibration error of g on a ﬁnite data set:

sup ECEB (g f) = Rˆ1(φ) (1 πˆ0), B B

where Rˆ1(φ) is the empirical Type II error of φ which is trained following Proposition 3, g is constructed using the rules in Proposition 3 and πˆ0 is the empirical accuracy of f on this ﬁnite data set.

The following two trivial calibration maps are used to illus trate the necessity of the above two constraints. These two calibrators have low calibration errors, but they are hardly useful.

High miscoverage rate: Based on the results in Proposi tion 6, to obtain a low calibration error, we could build a binary classiﬁer with a low Type II error. This classiﬁer has a potentially high Type I error, thus the miscoverage rate is potentially high as well. For example, we can select a small portion of negative instances and relabel the remaining in stances as positive. Then a 1-nearest neighbor classiﬁer is such a classiﬁer.

Low coverage accuracy: Regardless of the input, we let the output of a calibration map to be the marginal class distribution. The calibration error of this calibration map is equal to 0, but its coverage accuracy is low and equal to the proportion of the largest class.

Obviously, the above two constructions are of little conse quence in practice. Another motivating example demon strating the usefulness these two constraints is an automated decision model with human-in-the-loop (HIL). In such a model, on one hand, it is desirable that human interaction is as minimal as possible (low miscoverage rate). On the other hand, we hope this model can make accurate and reliable decisions (high coverage accuracy). We now describe how to construct a calibration map with a controlled miscoverage rate or a controlled coverage accuracy.1

1Code is available at https://github.com/maxc01/ metacal

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

4.1. Miscoverage Rate Control

To control the miscoverage rate of our proposed calibra tion method, we need to construct a suitable binary clas siﬁer with a controlled Type I error. Such a classiﬁer can be built using the following steps. Given a ﬁnite calibra tion data set {(xi, yi)}n that is drawn i.i.d. from the joint i=1 distribution P(X, Y ) on X Y, we ﬁrst create a binary classiﬁcation data set {(xi, 2 1(yi ˆ , where = yi) 1)}n

i=1 yˆi = arg maxi [k] f(xi) and f is a multi-class classiﬁer to be calibrated. Without loss of generality, we suppose the ﬁrst n1 inputs have negative labels and the last n2 = n n1 inputs have positive labels. Then we compute ranking scores on ﬁrst n1 inputs using a ranking model h and denote these ranking scores as {ri}n1 , where ri = h(xi). Given a mis i=1 coverage rate tolerance α, the desired binary classiﬁer is constructed:

φˆ(x) = 2 1(h(x) > r(v)) 1,

where v = d(n1 + 1)(1 α)e, r(v) is the v-th order statistic of {ri}n1 , that is, r(1) r(n1). We note under i=1 Assumption 1, the ranking model h is continuous, thus P(ri = rj ) = 0, i, j [n1] and the strict inequalities hold r(1) < < r(n1). In the following Proposition 7, we show the miscoverage rate of our calibrator is well-controlled by proving that the Type I error of this constructed binary classiﬁer is well-controlled.

Proposition 7 (Miscoverage rate control). Under Assump tion 1, given a ﬁnite test data set of size m that is i.i.d. drawn from P(X, Y ), the empirical miscoverage rate Fˆ0(g) of g on D is well-controlled:

where F0(g) is the population miscoverage rate of g, the ran dom variable R0 has a Normal distribution N (F0(g), σ2), σ2 = F0(g)(1 F0(g))/m1, m1 m is the number of samples whose predictions are correct.

Further, we have (Tong et al., 2018):

where v = d(n1 + 1)(1 α)e.

Equation (5) is given in Tong et al. (2018) and we adapt it here for our purpose. Essentially it means the population miscoverage rate is smaller than the predeﬁned miscoverage rate tolerance with a high probability. A rough estimation of the miscoverage rate of g is b(1 + n)αc/(1 + n1) α.

Although the miscoverage rate of our constructed cali bration map can be well-controlled using the above con structed binary classiﬁer, there is no such a guarantee that the Type II error of φˆ is also well-controlled. Combining the results shown in Proposition 6, it can be seen that the empirical ECE depends solely on Rˆ1(φ), since πˆ0 is a con stant given a ﬁnite data set. Thus we cannot make sure the empirical calibration error is small if we keep using the construction rules shown in Proposition 3.

A simple reﬁnement to the above issue is to utilize a separate calibration model gm G and the updated construction rules for a calibration map are listed in the following:

if φˆ(x) = 1, we let g(x) = gm(x),

1 if φˆ(x) = +1, we let maxi [k] g(x)i = . k

Now it is clear why we call our post-hoc calibration ap proach Meta-Cal, since our calibrator is built from a base calibration map. The rationality of the above updated rules is shown in Proposition 8.

Proposition 8 (Lower bound of Meta-Cal). Suppose the base calibration map gm Ga and g is constructed using the above updated rules, then:

1 πˆ0 sup ECEB (g f) > w , (6) k B B

where w = (1 Rˆ0(φˆ))ˆπ0 + Rˆ1(φˆ)(1 πˆ0) < 1, Rˆ0(φˆ) and Rˆ1(φˆ) are empirical Type I error and Type II error of φˆ respectively, πˆ0 is the empirical accuracy of f on a ﬁnite ˆF0(g) F0(g) P P (|R0 F0(g)| δ)

data set. 2 exp , 2σ2

The result in Proposition 8 should be compared to the lower bound shown in Proposition 2. It can be seen that the cal ibration map constructed by Meta-Cal has an improved calibration error lower bound, compared with its accuracypreserving base calibrator. Proposition 8 should also be compared with Proposition 6. From Equation (6), using

w (1 Rˆ0(φˆ))ˆπ0 Rˆ1(φˆ)(1 πˆ0) = the decomposition + , we k k k see the inﬂuence of the Type II error has been effectively n1 (1 α)j αn1 j P(F0(g) > α) = , (5) diluted. This is desirable since we only have control over j the Type II error of φˆ. j=v

At the same time, the miscoverage rate of our constructed calibrator is well-controlled. In Section 5, we empirically show that Meta-Cal outperforms other competing methods in terms of the empirical estimation of ECE. Algorithm 1 sketches the construction of Meta-Cal under a miscoverage rate constraint. For details of this construction, please see the ﬁrst paragraph of Section 4.1.

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Algorithm 1 Meta-Cal (miscoverage control)

1: Input: Training data set {(xi, yi)}n , miscoverage i=1 rate tolerance α, base calibration model gm, ranking model h. 2: Output: Binary classiﬁer φˆ, Meta-Cal calibration model g. 3: Partition the training data set randomly into two parts. The ﬁrst part has only negative ( Yˆ = Y ) samples. The second part contains both negative and positive samples (Yˆ = Y ). 4: Compute ranking scores on the ﬁrst part using the rank

ing model h. Compute threshold r based on α.

5: Construct a binary classiﬁer φˆ based on r . 6: Train a base calibration model gm using samples whose

scores are smaller than r among the second part.

In some applications, we are more interested in controlling the coverage accuracy of a calibration map instead of its miscoverage rate. To construct a calibration map with a controlled coverage accuracy, similar to Section 4.1, we need to build a suitable binary classiﬁer. It will be conve nient to introduce the concept of a calibrated binary CPE model (Narasimhan & Agarwal, 2013).

Deﬁnition 9 (Calibrated binary CPE model (Narasimhan & Agarwal, 2013)). A binary class probability estimation (CPE) model ηˆ : X [0, 1] is said to be calibrated w.r.t. a probability P(X, ) on X { 1, +1} if:

P( = 1 | ηˆ(x) = u) = u, u range(η)

where range(ˆη) denotes the range of ηˆ.

We note a calibrated binary CPE model is different from a perfectly calibrated model (Deﬁnition 1) in the binary setting. Before we state our bound for the coverage accuracy, we present two existence propositions.

Proof. The existence of such a monotonically increasing CPE transformation directly follows by combining the re sults of Narasimhan & Agarwal (2013, Lemma 13) and Narasimhan & Agarwal (2013, Deﬁnition 12).

Proposition 11 (Existence of a monotonically decreasing coverage accuracy transformation). Under Assumptions 1

and 2, let h : X [a, b] (where a, b R, a < b) be any bounded-range ranking model. There exists a monotonically decreasing function l : R [0, 1] such that l(s) is the cov erage accuracy of a calibration map g which is constructed using the binary classiﬁer φˆ(x) = 2 1(h(x) > s) 1.

Proof. From Proposition 10, there exists a monotonically increasing t such that t h is a calibrated binary CPE model. Let Eh(s) = P( = 1 | h(x) < s), since t is monotonically increasing, we have:

Eh(s) = P( = 1 | t(h(x)) < t(s)).

Because t h is a calibrated binary CPE model, using Deﬁ nition 9, we further have: 7: Construct the ﬁnal calibration map g using updated

t(s) 1 rules.

4.2. Coverage Accuracy Control

t(s)2 constant Eh(s) = udu = 2 c

where c is a constant.

From Deﬁnition 5, the desired coverage accuracy transfor mation is:

l(s) = 1 Eh(s) = constant 1 t(s)2 . 2

Since t(s) 0, s R and t( ) is a monotonically increasing function, it is obvious that l is a monotoni cally decreasing function. From the deﬁnition of Eh(s), the binary classiﬁer that is used to construct g is exactly φˆ(x) = 2 1(h(x) > s) 1.

Based on the above two existence propositions, we describe a bound for the empirical coverage accuracy in Proposi tion 12.

Proposition 12 (Coverage accuracy control). Under As sumptions 1 and 2, the empirical coverage accuracy Fˆ1(g) of g on a ﬁnite test data set of size m that is drawn i.i.d. from P(X, Y ) is well-controlled:

Proposition 10 (Existence of a monotonically increasing

ˆF1(g) β δ CPE transformation). Under Assumptions 1 and 2, let P P (|R1 β| δ)

m1δ2 h : X [a, b] (where a, b R, a < b) be any boundedrange ranking model. There exists a monotonically increas ing function t : R [0, 1] such that t h resulting from , (7) 2 exp 2β(1 β) composing t and h is a calibrated binary CPE model.

where β is the desired coverage accuracy, the random vari

β(1 β) able R1 has a Normal distributon N β, , and m1 m1 m is the number of samples whose ranking scores are smaller than l 1(β) in the test data set.

Further, the desired binary classiﬁer is:

φˆ(x) = 2 1(h(x) > l 1(β)) 1. (8)

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Table 1. ECE comparison. Uncal, TS, ETS, GPC, Meta Mis, Meta Acc denote no-calibration, temperature scaling, ensemble temperature scaling, Gaussian Process calibration, Meta-Cal under miscoverage rate constraint and Meta-Cal under coverage accuracy constraint respectively. Reported values are the average of 40 independent runs. All standard errors are less than 5e 4.

Dataset Network Acc Uncal TS ETS GPC Meta Mis Meta Acc

Dense Net40 0.9242 0.05105 0.00510 0.00567 0.00634 0.00434 0.00355

Res Net110 0.9356 0.04475 0.00781 0.00809 0.00684 0.00391 0.00441

CIFAR10 Res Net110SD 0.9404 0.04022 0.00439 0.00509 0.00364 0.00350 0.00315

Wide Res Net32 0.9393 0.04396 0.00706 0.00712 0.00684 0.00485 0.00532

Dense Net40 0.7000 0.21107 0.01067 0.01104 0.01298 0.01093 0.00793

Res Net110 0.7148 0.18182 0.02037 0.02130 0.01348 0.01815 0.01441

CIFAR100 Res Net110SD 0.7283 0.15496 0.01043 0.01057 0.01265 0.01109 0.00733

Wide Res Net32 0.7382 0.18425 0.01332 0.01351 0.00993 0.01332 0.01189

Dense Net161 0.7705 0.05531 0.02053 0.02064 NA 0.01388 0.01248 Image Net Res Net152 0.7620 0.06290 0.02023 0.02004 NA 0.01360 0.01138

Algorithm 2 Meta-Cal (coverage accuracy control)

1: Input: Training data set {(xi, yi)}n , desired cover i=1 age accuracy β, base calibration model gm, ranking model h. 2: Output: Binary classiﬁer φˆ, Meta-Cal calibration model g. 3: Randomly split the training data set into two parts. 4: Estimate the coverage accuracy transformation ˆl on the ﬁrst part. 5: Compute a threshold r = lˆ 1(β) based on the esti mated ˆl and β.

6: Construct a binary classiﬁer φˆ based on r . 7: Train a base calibration model gm using samples among

the second part whose scores are smaller than r . 8: Construct the ﬁnal calibration map g using updated rules.

It should be noted that among all calibration maps whose coverage accuracy is higher than β, the calibration map con structed using the binary classiﬁer in Equation (8) has the smallest miscoverage rate. In the following, we describe how to estimate the monotonically decreasing function l in practice. Given a ﬁnite calibration data set {(xi, yi)}n that i=1 is drawn i.i.d. from the joint distribution P(X, Y ) on X Y: Firstly the ranking scores {ri}n , where ri = h(xi), are i=1 computed using our ranking model h. Then a set of bins {I1, , Ib} to partition {ri}n is constructed through uni i=1 form mass binning (Zadrozny & Elkan, 2001). For a given

(s) (a) j {1, , b}, let l and l denote the average of rank j j ing scores on Ij and f s accuracy on {I1, , Ij } respec tively. Finally a decreasing isotonic regression model is

(s) (a) ﬁtted on {(l , l )}b and the ﬁtted model is an estima j j j=1 tion of l. Algorithm 2 sketches the construction of Meta-Cal under the coverage accuracy constraint.

Several remarks are made to conclude this section. (i) The miscoverage rate bound holds for all base calibration maps. The coverage accuracy bound requires the base calibrator to preserve the accuracy. (ii) These two bounds do not depend on the metric used to evaluate the level of calibration, no matter whether it is top-label calibration error (which we use in this paper) or marginal calibration error (Kumar et al., 2019; Kull et al., 2019). (iii) To ensure the independence as sumption, the training data of Meta-Cal should be different from the data set used to train the multi-class classiﬁer.

5. Experiments

In this section, we present empirical results on a set of multiclass classiﬁers. Firstly, we compare our our proposed Meta Cal described in Section 4 with several baselines. Then we validate the two constraints investigated in this work are well satisﬁed in Section 5.2.

5.1. Calibrating Neural Networks

In this section, we compare our approach with other meth ods of post-hoc multi-class calibration. Following previous works (Guo et al., 2017; Kull et al., 2019; Zhang et al., 2020; Wenger et al., 2020), we report calibration results on several neural network classiﬁers trained on CIFAR-10, CIFAR 100 (Krizhevsky, 2009) and Image Net (Deng et al., 2009).

For CIFAR-10 and CIFAR-100, the following networks are used: Dense Net (Huang et al., 2016a), Res Net (He et al.,

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

CIFAR10 CIFAR100 Image Net

0.04 0.05 0.06 0.04 0.05 0.06 0.04 0.05 0.06

Dense Net161

Dense Net40

Res Net110SD

Wide Res Net32

Dense Net40

Res Net110SD

Wide Res Net32

Miscoverage Rate

Figure 1. Empirical miscoverage rate. The error bars show 2 standard deviation of 40 independent runs. The desired miscoverage rates for CIFAR-10, CIFAR-100 and Image Net are all set to be 0.05.

CIFAR10 CIFAR100 Image Net

0.970 0.975 0.980 0.86 0.88 0.90 0.84 0.86 0.88

Dense Net161

Dense Net40

Res Net110SD

Wide Res Net32

Dense Net40

Res Net110SD

Wide Res Net32

Coverage Accuracy

Figure 2. Empirical coverage accuracy. The error bars show 2 standard deviation of 40 independent runs. The desired coverage accuracy for CIFAR-10, CIFAR-100 and Image Net are set to be 0.97, 0.87 and 0.85, respectively.

2015), Res Net with stochastic depth (Huang et al., 2016b), Wide Res Net (Zagoruyko & Komodakis, 2016). 45000 out of 60000 images are used for training these classiﬁers. The remaining 15000 images are held out for training and evalu ating post-hoc calibration methods. The training details are given in Supplement C. These 15000 samples are randomly split into 5000/10000 samples to train and evaluate a posthoc calibration method. For Image Net, we use pre-trained Dense Net-161 and Res Net-152 from Py Torch (Paszke et al., 2019). 50000 images in the validation set are used for train ing and evaluating post-hoc calibration methods. To train and test a calibration map, we randomly split these samples into 25000/25000 images.

The following post-hoc algorithms are used for comparison: temperature scaling (Guo et al., 2017), ensemble tempera ture scaling (Zhang et al., 2020), Gaussian Process calibra tion (Wenger et al., 2020). The Dirichlet calibration (Kull et al., 2019) is not included in our empirical comparison, since on the neural network classiﬁers we considered in this paper, its performance is consistently worse than other methods (Zhang et al., 2020).

To construct a calibration map using our proposed Meta Cal, ﬁrstly we need to decide which base calibrator to use. Throughout the experimental part of this paper, we use temperature scaling since: (i) it is the most computa tionally efﬁcient approach among the above baselines, and (ii) it can be seen from Proposition 8 that Meta-Cal aug ments the performance of temperature scaling, since it is an

accuracy-preserving calibration map. Secondly a ranking model is required to construct a desired binary classiﬁer based on different constraints. In this paper, we deﬁne the ranking model as the entropy of the output of an uncali brated probabilistic classiﬁer, that is, given X X , we deﬁne h(X) = f(X)i log f(X)i. An alternative i [k] way is to learn a ranking model using a consistent ranking algorithm (Cl on et al., 2008) on a separate data set. emenc A potential advantage of this approach is the constructed binary classiﬁer has a lower Type II error, thus from Propo sition 6, the worst-case estimation of expected calibration error can be improved. A disadvantage of this approach is a separate data set is required to learn a ranking model and this effectively means we have less samples to train a calibration model.

The experimental conﬁgurations speciﬁc to our proposed ap proach are as follows. For Meta-Cal under the miscoverage rate constraint, we set the miscoverage rate tolerance to be 0.05 for all neural network classiﬁers and all data sets used in the experiments. For Meta-Cal under the coverage accu racy constraint, we set the desired coverage accuracy to be 0.97, 0.87, 0.85 for CIFAR-10, CIFAR-100 and Image Net, respectively. For reference, the original accuracy for each conﬁguration is shown in the third column in Table 1. In both settings, we randomly select 1/10 samples (up to 500 samples) from the calibration data set to construct a binary classiﬁer or estimate the coverage accuracy transformation function.

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Table 2. Time comparison (s). Reported values are the average of 40 independent runs.

Dataset Network Uncal TS ETS GPC Meta Mis Meta Acc

Dense Net40 0.004 0.074 0.143 4.722 0.102 0.094

Res Net110 0.004 0.066 0.050 5.255 0.127 0.084

CIFAR10 Res Net110SD 0.004 0.075 0.148 5.401 0.101 0.090

Wide Res Net32 0.004 0.078 0.133 5.116 0.105 0.093

Dense Net40 0.014 0.492 0.492 40.218 0.470 0.336

Res Net110 0.026 0.426 1.404 42.118 0.408 0.299

CIFAR100 Res Net110SD 0.015 0.357 0.931 40.867 0.421 0.327

Wide Res Net32 0.014 0.431 0.708 44.882 0.467 0.353

Dense Net161 0.304 12.707 99.488 NA 13.573 9.903 Image Net Res Net152 0.309 15.624 111.589 NA 12.581 8.426

The comparison results of different calibration methods are shown in Table 1. The empirical ECE are esitmated with 15 equal-width bins and ECE values reported in Table 1 are the average of 40 independent runs. Since GPC does not converge on Image Net, its results on Image Net are marked as NA in Table 1. We note CE loss instead of MSE in optimizing ETS.

From Table 1, it can be seen that our proposed method Meta-Cal performs much better w.r.t. the empirical ECE in almost all conﬁgurations. It should be noted that Table 1 should be interpreted in a careful way, since our proposed method works in a novel setting, that is, post-hoc calibration under constraints. Essentially, we can interpret the improved calibration is due to the miscoverage price we pay. It is worth noting that after calibration, GPC has an around 0.2% accuracy drop in Res Net110 and Res Net110SD on CIFAR 10 data set. It is possible that GPC is trading off accuracy for calibration in an implicit and uncontrolled way. Table 2 shows the running time of different calibration methods. It can be seen Meta-Cal has little overhead over its base calibrator (TS in our case). Sometimes Meta-Cal takes less than time than TS because its base calibrator is optimized on a subset of the whole calibration data set.

5.2. Verifying Constraints

In this section, we empirically verify whether constraints of miscoverage rate and coverage accuracy are satisﬁed. The purpose of this section is to support our theoretical claims in Section 4.1 and Section 4.2.

The error bar plot in Figure 1 depicts 2 standard deviations of the empirical miscoverage rate over 40 independent runs. It can be seen from Figure 1 that the miscoverage rate is well-controlled given a miscoverage rate tolerance equal to

0.05. In Figure 2, the 2 standard deviation of the empirical coverage accuracy is illustrated. Given the desired coverage accuracies are set to be 0.97, 0.87 and 0.85 for CIFAR-10, CIFAR-100 and Image Net, respectively, it can be seen that the coverage accuracy is well-controlled.

6. Conclusion

In this work, we propose Meta-Cal: a post-hoc calibration method for multi-class classiﬁcation under practical con straints, including miscoverage rate and coverage accuracy. Our proposed approach can augment an accuracy-preserving calibration map and improve its calibration performance. Contrary to post-hoc calibration methods that cannot pre serve accuracy, our approach has full control over coverage accuracy. A series of theoretical results are given to show the validity of Meta-Cal. Empirical results on a range of neu ral network classiﬁers on several popular computer vision data sets show our approach is able to improve an existing calibration method.

Acknowledgements

Xingchen Ma is supported by Onﬁdo. This research received funding from the Flemish Government under the Onder zoeksprogramma Artiﬁci ele Intelligentie (AI) Vlaanderen programme.

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. Learning to rank using gradient descent. In Proceedings of the 22nd Interna tional Conference on Machine Learning, ICML 05, pp. 89 96, New York, NY, USA, August 2005. Association

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

for Computing Machinery. ISBN 978-1-59593-180-1. doi: 10.1145/1102351.1102363.

Chen, C., Seff, A., Kornhauser, A., and Xiao, J. Deep Driving: Learning Affordance for Direct Perception in Autonomous Driving. In 2015 IEEE International Con ference on Computer Vision (ICCV), pp. 2722 2730, De cember 2015. doi: 10.1109/ICCV.2015.312.

Cl on, S., Lugosi, G., and Vayatis, N. Ranking and Em emenc pirical Minimization of U-statistics. Annals of Statistics, 36(2):844 874, April 2008. ISSN 0090-5364, 2168-8966. doi: 10.1214/009052607000000910.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Ding, Z., Han, X., Liu, P., and Niethammer, M. Lo cal Temperature Scaling for Probability Calibration. ar Xiv:2008.05105 [cs], August 2020.

El-Yaniv, R. and Wiener, Y. On the Foundations of Noisefree Selective Classiﬁcation. Journal of Machine Learn ing Research, 11(53):1605 1641, 2010.

Gao, W. and Zhou, Z.-H. On the Consistency of AUC Pairwise Optimization. ar Xiv:1208.0645 [cs, stat], July 2014.

Geifman, Y. and El-Yaniv, R. Selective Classiﬁcation for Deep Neural Networks. ar Xiv:1705.08500 [cs], June 2017.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Cali bration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, ICML 17, pp. 1321 1330. JMLR.org, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. ar Xiv:1512.03385 [cs], December 2015.

Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densely Connected Convolutional Networks. ar Xiv:1608.06993 [cs], August 2016a.

Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,

K. Q. Deep Networks with Stochastic Depth. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), Com puter Vision ECCV 2016, Lecture Notes in Computer Science, pp. 646 661, Cham, 2016b. Springer Inter national Publishing. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-0 39.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Image Net Classiﬁcation with Deep Convolutional Neural Networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Process ing Systems 25, pp. 1097 1105. Curran Associates, Inc., 2012.

Kull, M., Filho, T. S., and Flach, P. Beta calibration: A wellfounded and easily implemented improvement on logistic calibration for binary classiﬁers. In Artiﬁcial Intelligence and Statistics, pp. 623 631. PMLR, April 2017.

Kull, M., Perello Nieto, M., K angsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scal ing: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Informa tion Processing Systems 32, pp. 12316 12326. Curran Associates, Inc., 2019.

Kumar, A., Liang, P. S., and Ma, T. Veriﬁed Uncertainty Calibration. In Wallach, H., Larochelle, H., Beygelz imer, A., d\textquotesingle Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Pro cessing Systems 32, pp. 3792 3803. Curran Associates, Inc., 2019.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim ple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. ar Xiv:1612.01474 [cs, stat], December 2016.

Lin, H.-T., Lin, C.-J., and Weng, R. C. A note on Platt s probabilistic outputs for support vector machines. Ma chine Learning, 68(3):267 276, October 2007. ISSN 1573-0565. doi: 10.1007/s10994-007-5018-6.

Ma, X., Huang, D., Wang, Y., and Wang, Y. Cost-Sensitive Two-Stage Depression Prediction Using Dynamic Visual Clues. In Lai, S.-H., Lepetit, V., Nishino, K., and Sato, Y. (eds.), Computer Vision ACCV 2016, Lecture Notes in Computer Science, pp. 338 351, Cham, 2017. Springer International Publishing. ISBN 978-3-319-54184-6. doi: 10.1007/978-3-319-54184-6 21.

Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., and Wil son, A. G. A Simple Baseline for Bayesian Uncertainty in Deep Learning. ar Xiv:1902.02476 [cs, stat], December 2019.

Malinin, A. and Gales, M. Predictive Uncertainty Esti mation via Prior Networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Process ing Systems 31, pp. 7047 7058. Curran Associates, Inc., 2018.

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

Milios, D., Camoriano, R., Michiardi, P., Rosasco, L., and Filippone, M. Dirichlet-based Gaussian Processes for Large-scale Calibrated Classiﬁcation. In Bengio, S., Wal lach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6005 6015. Curran Asso ciates, Inc., 2018.

Naeini, M. P., Cooper, G. F., and Hauskrecht, M. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, AAAI 15, pp. 2901 2907, Austin, Texas, January 2015. AAAI Press. ISBN 978-0-262 51129-2.

Narasimhan, H. and Agarwal, S. On the Relationship Be tween Binary Classiﬁcation, Bipartite Ranking, and Bi nary Class Probability Estimation. Advances in Neural Information Processing Systems, 26:2913 2921, 2013.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Patel, K., Beluch, W., Yang, B., Pfeiffer, M., and Zhang, D. Multi-Class Uncertainty Calibration via Mutual Informa tion Maximization-based Binning. ar Xiv:2006.13092 [cs, stat], September 2020.

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hin ton, G. Regularizing Neural Networks by Penalizing Conﬁdent Output Distributions. ar Xiv:1701.06548 [cs], January 2017.

Platt, J. C. Probabilistic Outputs for Support Vector Ma chines and Comparisons to Regularized Likelihood Meth ods. In Advances in Large Margin Classiﬁers, pp. 61 74. MIT Press, 1999.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, December 2015. ISSN 1573-1405. doi: 10.1007/s11263-015-0816-y.

Tong, X., Feng, Y., and Li, J. J. Neyman-Pearson classiﬁca tion algorithms and NP receiver operating characteristics. Science Advances, 4(2), February 2018. ISSN 2375-2548. doi: 10.1126/sciadv.aao1659.

Wenger, J., Kjellstr om, H., and Triebel, R. Non-Parametric Calibration for Classiﬁcation. In International Confer ence on Artiﬁcial Intelligence and Statistics, pp. 178 190, June 2020.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing,

E. P. Stochastic Variational Deep Kernel Learning. ar Xiv:1611.00336 [cs, stat], November 2016.

Zadrozny, B. and Elkan, C. Obtaining calibrated proba bility estimates from decision trees and naive Bayesian classiﬁers. In Proceedings of the Eighteenth Interna tional Conference on Machine Learning, ICML 01, pp. 609 616, San Francisco, CA, USA, June 2001. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-778-1.

Zadrozny, B. and Elkan, C. Transforming classiﬁer scores into accurate multiclass probability estimates. In Pro ceedings of the Eighth ACM SIGKDD International Con ference on Knowledge Discovery and Data Mining, KDD 02, pp. 694 699, New York, NY, USA, July 2002. Asso ciation for Computing Machinery. ISBN 978-1-58113 567-1. doi: 10.1145/775047.775151.

Zagoruyko, S. and Komodakis, N. Wide Residual Networks. ar Xiv:1605.07146 [cs], May 2016.

Zhang, J., Kailkhura, B., and Han, T. Y.-J. Mix-n-Match: Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning. ar Xiv:2003.07329 [cs, stat], June 2020.