# toplabel_calibration_and_multiclasstobinary_reductions__dd827172.pdf

Published as a conference paper at ICLR 2022

TOP-LABEL CALIBRATION AND MULTICLASS-TO-BINARY REDUCTIONS

Chirag Gupta & Aaditya Ramdas Carnegie Mellon University {chiragg,aramdas}@cmu.edu

We propose a new notion of multiclass calibration called top-label calibration. A classiﬁer is said to be top-label calibrated if the reported probability for the predicted class label the top-label is calibrated, conditioned on the top-label. This conditioning is essential for practical utility of the calibration property, since the top-label is always reported and we must condition on what is reported. However, the popular notion of conﬁdence calibration erroneously skips this conditioning. Furthermore, we outline a multiclass-to-binary (M2B) reduction framework that uniﬁes conﬁdence, top-label, and class-wise calibration, among others. As its name suggests, M2B works by reducing multiclass calibration to different binary calibration problems; various types of multiclass calibration can then be achieved using simple binary calibration routines. We instantiate the M2B framework with the well-studied histogram binning (HB) binary calibrator, and prove that the overall procedure is multiclass calibrated without making any assumptions on the underlying data distribution. In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we ﬁnd that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling. Code for this work is available at https://github.com/aigen/df-posthoc-calibration.

1 INTRODUCTION

Machine learning models often make probabilistic predictions. The ideal prediction is the true conditional distribution of the output given the input. However, nature never reveals true probability distributions, making it infeasible to achieve this ideal in most situations. Instead, there is signiﬁcant interest towards designing models that are calibrated, which is often feasible. We motivate the deﬁnition of calibration using a standard example of predicting the probability of rain. Suppose a meteorologist claims that the probability of rain on a particular day is 0.7. Regardless of whether it rains on that day or not, we cannot know if 0.7 was the underlying probability of rain. However, we can test if the meteorologist is calibrated in the long run, by checking if on the D days when 0.7 was predicted, it indeed rained on around 0.7D days (and the same is true for other probabilities).

This example is readily converted to a formal binary calibration setting. Denote a random (feature, label)-pair as p X, Y q P X ˆ t0, 1u, where X is the feature space. A probabilistic predictor h : X Ñ r0, 1s is said to be calibrated if for every prediction q P r0, 1s, Prp Y 1 | hp Xq qq q (almost surely). Arguably, if an ML classiﬁcation model produces such calibrated scores for the classes, downstream users of the model can reliably use its predictions for a broader set of tasks.

Our focus in this paper is calibration for multiclass classiﬁcation, with L ě 3 classes and Y P r Ls : t1, 2, . . . , L ě 3u. We assume all (training and test) data is drawn i.i.d. from a ﬁxed distribution P, and denote a general point from this distribution as p X, Y q P. Consider a typical multiclass predictor, h : X Ñ L 1, whose range L 1 is the probability simplex in RL. A natural notion of calibration for h, called canonical calibration is the following: for every l P r Ls, Pp Y l | hp Xq qq ql (ql denotes the l-th component of q). However, canonical calibration becomes infeasible to achieve or verify once L is even 4 or 5 (Vaicenavicius et al., 2019). Thus, there is interest in studying statistically feasible relaxations of canonical notion, such as conﬁdence calibration (Guo et al., 2017) and class-wise calibration (Kull et al., 2017).

Published as a conference paper at ICLR 2022

In particular, the notion of conﬁdence calibration (Guo et al., 2017) has been popular recently. A model is conﬁdence calibrated if the following is true: when the reported conﬁdence for the predicted class is q P r0, 1s, the accuracy is also q . In any practical setting, the conﬁdence q is never reported alone; it is always reported along with the actual class prediction l P r Ls. One may expect that if a model is conﬁdence calibrated, the following also holds: when the class l is predicted with conﬁdence q, the probability of the actual class being l is also q ? Unfortunately, this expectation is rarely met there exist conﬁdence calibrated classiﬁer for whom the latter statement is grossly violated for all classes (Example 1). On the other hand, our proposed notion of top-label calibration enforces the latter statement. It is philosophically more coherent, because it requires conditioning on all relevant reported quantities (both the predicted top label and our conﬁdence in it). In Section 2, we argue further that top-label calibration is a simple and practically meaningful replacement of conﬁdence calibration.

In Section 3, we unify top-label, conﬁdence, and a number of other popular notions of multiclass calibration into the framework of multiclass-to-binary (M2B) reductions. The M2B framework relies on the simple observation that each of these notions internally veriﬁes binary calibration claims. As a consequence, each M2B notion of calibration can be achieved by solving a number of binary calibration problems. With the M2B framework at our disposal, all of the rich literature on binary calibration can now be used for multiclass calibration. We illustrate this by instantiating the M2B framework with the binary calibration algorithm of histogram binning or HB (Zadrozny and Elkan, 2001; Gupta and Ramdas, 2021). The M2B + HB procedure achieves state-of-the-art results with respect to standard notions of calibration error (Section 4). Further, we show that our procedure is provably calibrated for arbitrary data-generating distributions. The formal theorems are delayed to Appendices B, C (due to space limitations), but an informal result is presented in Section 4.

2 MODIFYING CONFIDENCE CALIBRATION TO TOP-LABEL CALIBRATION

Let c : X Ñ r Ls denote a classiﬁer or top-label predictor and h : X Ñ r0, 1s a function that provides a conﬁdence or probability score for the top-label cp Xq. The predictor pc, hq is said to be conﬁdence calibrated (for the data-generating distribution P) if

Pp Y cp Xq | hp Xqq hp Xq. (1)

In other words, when the reported conﬁdence hp Xq equals p P r0, 1s, then the fraction of instances where the predicted label is correct also approximately equals p. Note that for an L-dimensional predictor h : X Ñ L 1, one would use cp q arg maxl Pr Ls hlp q and hp q hcp qp q; ties are broken arbitrarily. Then h is conﬁdence calibrated if the corresponding pc, hq satisﬁes (1).

Conﬁdence calibration is most applicable in high-accuracy settings where we trust the label prediction cpxq. For instance, if a high-accuracy cancer-grade-prediction model predicts a patient as having 95% grade III, 3% grade II, and 2% grade I , we would suggest the patient to undergo an invasive treatment. However, we may want to know (and control) the number of non-grade-III patients that were given this suggestion incorrectly. In other words, is Prpcancer is not grade III | cancer is predicted to be of grade III with conﬁdence 95%q equal to 5%? It would appear that by focusing on the the probability of the predicted label, conﬁdence calibration enforces such control.

However, as we illustrate next, conﬁdence calibration fails at this goal by providing a guarantee that is neither practically interpretable, nor actionable. Translating the probabilistic statement (1) into words, we ascertain that conﬁdence calibration leads to guarantees of the form: if the conﬁdence hp Xq in the top-label is 0.6, then the accuracy (frequency with which Y equals cp Xq) is 0.6 . Such a guarantee is not very useful. Suppose a patient P is informed (based on their symptoms X), that they are most likely to have a certain disease D with probability 0.6. Further patient P is told that this score is conﬁdence calibrated. P can now infer the following: among all patients who have probability 0.6 of having some unspeciﬁed disease, the fraction who have that unspeciﬁed disease is also 0.6. However, P is concerned only about disease D, and not about other diseases. That is, P wants to know the probability of having D among patients who were predicted to have disease D with conﬁdence 0.6, not among patients who were predicted to have some disease with conﬁdence 0.6. In other words, P cares about the occurrence of D among patients who were told the same thing that P has been told. It is tempting to wish that the conﬁdence calibrated probability 0.6 has any bearing on what P cares about. However, this faith is misguided, as the above reasoning suggests, and further illustrated through the following example.

Published as a conference paper at ICLR 2022

Example 1. Suppose the instance space is p X, Y q P ta, bu ˆ t1, 2, . . .u. (X can be seen as the random patient, and Y as the disease they are suffering from.) Consider a predictor pc, hq and let the values taken by p X, Y, c, hq be as follows:

Feature x Pp X xq Class prediction cpxq Conﬁdence hpxq Pp Y cp Xq | X xq a 0.5 1 0.6 0.2 b 0.5 2 0.6 1.0

The table speciﬁes only the probabilities Pp Y cp Xq | X xq; the probabilities Pp Y l | X xq, l cpxq, can be set arbitrarily. We verify that pc, hq is conﬁdence calibrated: Pp Y cp Xq | hp Xq 0.6q 0.5p Pp Y 1 | X aq Pp Y 2 | X bqq 0.5p0.2 1q 0.6. However, whether the actual instance is X a or X b, the probabilistic claim of 0.6 bears no correspondence with reality. If X a, hp Xq 0.6 is extremely overconﬁdent since Pp Y 1 | X aq 0.2. Contrarily, if X b, hp Xq 0.6 is extremely underconﬁdent.

The reason for the strange behavior above is that the probability Pp Y cp Xq | hp Xqq is not interpretable from a decision-making perspective. In practice, we never report just the conﬁdence hp Xq, but also the class prediction cp Xq (obviously!). Thus it is more reasonable to talk about the conditional probability of Y cp Xq, given what is reported, that is both cp Xq and hp Xq. We make a small but critical change to (1); we say that pc, hq is top-label calibrated if Pp Y cp Xq | hp Xq, cp Xqq hp Xq. (2) (See the disambiguating Remark 2 on terminology.) Going back to the patient-disease example, top-label calibration would tell patient P the following: among all patients, who (just like you) are predicted to have disease D with probability 0.6, the fraction who actually have disease D is also 0.6. Philosophically, it makes sense to condition on what is reported both the top label and its conﬁdence because that is what is known to the recipient of the information; and there is no apparent justiﬁcation for not conditioning on both.

A commonly used metric for quantifying the miscalibration of a model is the expected-calibrationerror (ECE) metric. The ECE associated with conﬁdence calibration is deﬁned as conf-ECEpc, hq : EX |Pp Y cp Xq | hp Xqq hp Xq| . (3) We deﬁne top-label-ECE (TL-ECE) in an analogous fashion, but also condition on cp Xq: TL-ECEpc, hq : EX |Pp Y cp Xq | cp Xq, hp Xqq hp Xq| . (4) Higher values of ECE indicate worse calibration performance. The predictor in Example 1 has conf-ECEpc, hq 0. However, it has TL-ECEpc, hq 0.4, revealing its miscalibration. More generally, it can be deduced as a straightforward consequence of Jensen s inequality that conf-ECEpc, hq is always smaller than the TL-ECEpc, hq (see Proposition 4 in Appendix H). As illustrated by Example 1, the difference can be signiﬁcant. In the following subsection we illustrate that the difference can be signiﬁcant on a real dataset as well. First, we make a couple of remarks. Remark 1 (ECE estimation using binning). Estimating the ECE requires estimating probabilities conditional on some prediction such as hpxq. A common strategy to do this is to bin together nearby values of hpxq using binning schemes (Nixon et al., 2020, Section 2.1), and compute a single estimate for the predicted and true probabilities using all the points in a bin. The calibration method we espouse in this work, histogram binning (HB), produces discrete predictions whose ECE can be estimated without further binning. Based on this, we use the following experimental protocol: we report unbinned ECE estimates while assessing HB, and binned ECE estimates for all other compared methods, which are continuous output methods (deep-nets, temperature scaling, etc). It is commonly understood that binning leads to underestimation of the effective ECE (Vaicenavicius et al., 2019; Kumar et al., 2019). Thus, using unbinned ECE estimates for HB gives HB a disadvantage compared to the binned ECE estimates we use for other methods. (This further strengthens our positive results for HB.) The binning scheme we use is equal-width binning, where the interval r0, 1s is divided into B equal-width intervals. Equal-width binning typically leads to lower ECE estimates compared to adaptive-width binning (Nixon et al., 2020). Remark 2 (Terminology). The term conf-ECE was introduced by Kull et al. (2019). Most works refer to conf-ECE as just ECE (Guo et al., 2017; Nixon et al., 2020; Mukhoti et al., 2020; Kumar et al., 2018). However, some papers refer to conf-ECE as top-label-ECE (Kumar et al., 2019; Zhang et al., 2020), resulting in two different terms for the same concept. We call the older notion as conf-ECE, and our deﬁnition of top-label calibration/ECE (4) is different from previous ones.

Published as a conference paper at ICLR 2022

(a) Conﬁdence reliability diagram (points marked ) and top-label reliability diagram (points marked ) for a Res Net-50 model on the CIFAR-10 dataset; see further details in points (a) and (b) below. The gray bars denote the fraction of predictions in each bin. The conﬁdence reliability diagram (mistakenly) suggests better calibration than the top-label reliability diagram.

(b) Class-wise and zoomed-in version of Figure 1a for bin 6 (top) and bin 10 (bottom); see further details in point (c) below. The markers are in the same position as Figure 1a, and denote the average predicted and true probabilities. The colored points denote the predicted and true probabilities when seen class-wise. The histograms on the right show the number of test points per class within bins 6 and 10.

Figure 1: Conﬁdence reliability diagrams misrepresent the effective miscalibration.

2.1 AN ILLUSTRATIVE EXPERIMENT WITH RESNET-50 ON CIFAR-10

We now compare conﬁdence and top-label calibration using ECE estimates and reliability diagrams (Niculescu-Mizil and Caruana, 2005). This experiment can be seen as a less malignant version of Example 1. Here, conﬁdence calibration is not completely meaningless, but can nevertheless be misleading. Figure 1 illustrates the (test-time) calibration performance of a Res Net-50 model (He et al., 2016) on the CIFAR-10 dataset (Krizhevsky, 2009). In the following summarizing points, the pc, hq correspond to the Res Net-50 model.

(a) The markers in Figure 1a form the conﬁdence reliability diagram (Guo et al., 2017), constructed as follows. First, the hpxq values on the test set are binned into one of B 10 bins, r0, 0.1q, r0.1, 0.2q, . . . , r0.9, 1s, depending on the interval to which hpxq belongs. The gray bars in Figure 1a indicate the fraction of hpxq values in each bin nearly 92% points belong to bin r0.9, 1s and no points belong to bin r0, 0.1q. Next, for every bin b, we plot pconfb, accbq, which are the plugin estimates of E rhp Xq | hp Xq P Bin bs and Pp Y cp Xq | hp Xq P Bin bq respectively. The dashed X Y line indicates perfect conﬁdence calibration.

(b) The markers in Figure 1a form the top-label reliability diagram. Unlike the conﬁdence reliability diagram, the top-label reliability diagram shows the average miscalibration across classes in a given bin. For a given class l and bin b, deﬁne

b,l : | p Pp Y cp Xq | cp Xq l, hp Xq P Bin bq p E rhp Xq | cp Xq l, hp Xq P Bin bs |,

where p P, p E denote empirical estimates based on the test data. The overall miscalibration is then

b : Weighted-averagep b,lq ř l Pr Ls p Ppcp Xq l | hp Xq P Bin bq b,l.

Note that b is always non-negative and does not indicate whether the overall miscalibration occurs due to underor over-conﬁdence; also, if the absolute-values were dropped from b,l, then b would simply equal accb confb. In order to plot b in a reliability diagram, we obtain the direction for the corresponding point from the conﬁdence reliability diagram. Thus for every pconfb, accbq, we plot pconfb, confb bq if accb ą confb and pconfb, confb bq otherwise, for every b. This scatter plot of the s gives us the top-label reliability diagram. Figure 1a shows that there is a visible increase in miscalibration when going from conﬁdence calibration to top-label calibration. To understand why this change occurs, Figure 1b zooms into the sixth bin (hp Xq P r0.5, 0.6q) and bin 10 (hp Xq P r0.9, 1.0s), as described next.

(c) Figure 1b displays the class-wise top-label reliability diagrams for bins 6 and 10. Note that for bin 6, the marker is nearly on the X Y line, indicating that the overall accuracy matches the

Published as a conference paper at ICLR 2022

Base model top-label-ECE Base model conf-ECE

Temperature scaling top-label-ECE Temperature scaling conf-ECE

Histogram binning top-label-ECE Histogram binning conf-ECE

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

Figure 2: Conf-ECE (dashed lines) and TL-ECE (solid lines) of four deep-net architectures on CIFAR-10, as well as with recalibration using histogram binning and temperature scaling. The TLECE is often 2-3 times the conf-ECE, depending on the number of bins used to estimate ECE, and the architecture. Top-label histogram binning typically performs better than temperature scaling.

overall conﬁdence of 0.545. However, the true accuracy when class 1 was predicted is 0.2 and the true accuracy when class 8 was predicted is 0.9 (a very similar scenario to Example 1). For bin 10, the marker indicates a miscalibration of 0.01; however, when class 4 was predicted (roughly 8% of all test-points) the miscalibration is 0.05.

Figure 2 displays the aggregate effect of the above phenomenon (across bins and classes) through estimates of the conf-ECE and TL-ECE. The precise experimental setup is described in Section 4. These plots display the ECE estimates of the base model, as well as the base model when recalibrated using temperature scaling (Guo et al., 2017) and our upcoming formulation of top-label histogram binning (Section 3). Since ECE estimates depend on the number of bins B used (see Roelofs et al. (2020) for empirical work around this), we plot the ECE estimate for every value B P r5, 25s in order to obtain clear and unambiguous results. We ﬁnd that the TL-ECE is signiﬁcantly higher than the conf-ECE for most values of B, the architectures, and the preand postrecalibration models. This ﬁgure also previews the performance of our forthcoming top-label histogram binning algorithm. Top-label HB has smaller estimated TL-ECE than temperature scaling for most values of B and the architectures. Except for Res Net-50, the conf-ECE estimates are also better.

To summarize, top-label calibration captures the intuition of conﬁdence calibration by focusing on the predicted class. However, top-label calibration also conditions on the predicted class, which is always part of the prediction in any practical setting. Further, TL-ECE estimates can be substantially different from conf-ECE estimates. Thus, while it is common to compare predictors based on the conf-ECE, the TL-ECE comparison is more meaningful, and can potentially be different.

3 CALIBRATION ALGORITHMS FROM CALIBRATION METRICS

In this section, we unify a number of notions of multiclass calibration as multiclass-to-binary (or M2B) notions, and propose a general-purpose calibration algorithm that achieves the corresponding M2B notion of calibration. The M2B framework yields multiple novel post-hoc calibration algorithms, each of which is tuned to a speciﬁc M2B notion of calibration.

3.1 MULTICLASS-TO-BINARY (M2B) NOTIONS OF CALIBRATION

In Section 2, we deﬁned conﬁdence calibration (1) and top-label calibration (2). These notions verify calibration claims for the highest predicted probability. Other popular notions of calibration verify calibration claims for other entries in the full L-dimensional prediction vector. A predictor h ph1, h2, . . . , h Lq is said to be class-wise calibrated (Kull et al., 2017) if

(class-wise calibration) @l P r Ls, Pp Y l | hlp Xqq hlp Xq. (5)

Another recently proposed notion is top-K conﬁdence calibration (Gupta et al., 2021). For some l P r Ls, let cplq : X Ñ r Ls denote the l-th highest class prediction, and let hplq : X Ñ r Ls denote the conﬁdence associated with it (c cp1q and h hp1q are special cases). For a given K ď L,

(top-K-conﬁdence calibration) @k P r Ks, Pp Y cpkqp Xq | hpkqp Xqq hpkqp Xq. (6)

Published as a conference paper at ICLR 2022

Calibration notion Quantiﬁer Prediction (predp Xq) Binary calibration statement Conﬁdence - hp Xq Pp Y cp Xq | predp Xqq hp Xq Top-label - cp Xq, hp Xq Pp Y cp Xq | predp Xqq hp Xq Class-wise @l P r Ls hlp Xq Pp Y l | predp Xqq hlp Xq Top-K-conﬁdence @k P r Ks hpkqp Xq Pp Y cpkqp Xq | predp Xqq hpkqp Xq Top-K-label @k P r Ks cpkqp Xq, hpkqp Xq Pp Y cpkqp Xq | predp Xqq hpkqp Xq

Table 1: Multiclass-to-binary (M2B) notions internally verify one or more binary calibration statements/claims. The statements in the rightmost column are required to hold almost surely.

As we did in Section 2 for conﬁdenceÑtop-label, top-K-conﬁdence calibration can be modiﬁed to the more interpretable top-K-label calibration by further conditioning on the predicted labels:

(top-K-label calibration) @k P r Ks, Pp Y cpkqp Xq | hpkqp Xq, cpkqp Xqq hpkqp Xq. (7)

Each of these notions reduce multiclass calibration to one or more binary calibration requirements, where each binary calibration requirement corresponds to verifying if the distribution of Y , conditioned on some prediction predp Xq, satisﬁes a single binary calibration claim associated with predp Xq. Table 1 illustrates how the calibration notions discussed so far internally verify a number of binary calibration claims, making them M2B notions. For example, for class-wise calibration, for every l P r Ls, the conditioning is on predp Xq hlp Xq, and a single binary calibration statement is veriﬁed: Pp Y l | predp Xqq hlp Xq. Based on this property, we call each of these notions multiclass-to-binary or M2B notions.

The notion of canonical calibration mentioned in the introduction is not an M2B notion. Canonical calibration is discussed in detail in Appendix G. Due to the conditioning on a multi-dimensional prediction, non-M2B notions of calibration are harder to achieve or verify. For the same reason, it is possibly easier for humans to interpret binary calibration claims when taking decisions/actions.

3.2 ACHIEVING M2B NOTIONS OF CALIBRATION USING M2B CALIBRATORS

The M2B framework illustrates how multiclass calibration can typically be viewed via a reduction to binary calibration. The immediate consequence of this reduction is that one can now solve multiclass calibration problems by leveraging the well-developed methodology for binary calibration.

The upcoming M2B calibrators belong to the standard recalibration or post-hoc calibration setting. In this setting, one starts with a ﬁxed pre-learnt base model g : X Ñ L 1. The base model g can correspond to a deep-net, a random forest, or any 1-v-all (one-versus-all) binary classiﬁcation model such as logistic regression. The base model is typically optimized for classiﬁcation accuracy and may not be calibrated. The goal of post-hoc calibration is to use some given calibration data D p X1, Y1q, p X2, Y2q, . . . , p Xn, Ynq P p X ˆ r Lsqn, typically data on which g was not learnt, to recalibrate g. In practice, the calibration data is usually the same as the validation data.

To motivate M2B calibrators, suppose we want to verify if g is calibrated on a certain test set, based on a given M2B notion of calibration. Then, the verifying process will split the test data into a number of sub-datasets, each of which will verify one of the binary calibration claims. In Appendix A.2, we argue that the calibration data can also be viewed as a test set, and every step in the veriﬁcation process can be used to provide a signal for improving calibration.

M2B calibrators take the form of wrapper methods that work on top of a given binary calibrator. Denote an arbitrary black-box binary calibrator as At0,1u : r0, 1s X ˆp X ˆt0, 1uq Ñ r0, 1s X , where the ﬁrst argument is a mapping X Ñ r0, 1s that denotes a (miscalibrated) binary predicor, and the second argument is a calibration data sequence of arbitrary length. The output is a (better calibrated) binary predictor. Examples of At0,1u are histogram binning (Zadrozny and Elkan, 2001), isotonic regression (Zadrozny and Elkan, 2002), and Platt scaling (Platt, 1999). In the upcoming descriptions, we use the indicator function 1 ta bu P t0, 1u which takes the value 1 if a b, and 0 if a b.

The general formulation of our M2B calibrator is delayed to Appendix A since the description is a bit involved. To ease readability and adhere to the space restrictions, in the main paper we describe the calibrators corresponding to top-label, class-wise, and conﬁdence calibration (Algorithms 1 3). Each of these calibrators are different from the classical M2B calibrator (Algorithm 4) that has been used by Zadrozny and Elkan (2002), Guo et al. (2017), Kull et al. (2019), and most other papers

Published as a conference paper at ICLR 2022

M2B calibrators: Post-hoc multiclass calibration using binary calibrators

Input in each case: Binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X , base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq.

Algorithm 1: Conﬁdence calibrator

1 c Ð classiﬁer or top-class based on g;

2 g Ð top-class-probability based on g;

3 D1 Ð tp Xi, 1 t Yi cp Xiquq : i P rnsu;

4 h Ð At0,1upg, D1q;

5 return pc, hq;

Algorithm 2: Top-label calibrator

1 c Ð classiﬁer or top-class based on g;

2 g Ð top-class-probability based on g;

3 for l Ð 1 to L do

4 Dl Ð tp Xi, 1 t Yi luq : cp Xiq lqu;

5 hl Ð At0,1upg, Dlq;

7 hp q Ð hcp qp q (predict hlpxq if cpxq l);

8 return pc, hq;

Algorithm 3: Class-wise calibrator

1 Write g pg1, g2, . . . , g Lq;

2 for l Ð 1 to L do

3 Dl Ð tp Xi, 1 t Yi luq : i P rnsu;

4 hl Ð At0,1upgl, Dlq;

6 return ph1, h2, . . . , h Lq;

Algorithm 4: Normalized calibrator

1 Write g pg1, g2, . . . , g Lq;

2 for l Ð 1 to L do

3 Dl Ð tp Xi, 1 t Yi luq : i P rnsu;

4 rhl Ð At0,1upgl, Dlq;

6 Normalize: for every l P r Ls, hlp q : rhlp q{ řL k 1 rhkp q;

7 return ph1, h2, . . . , h Lq;

we are aware of, with the most similar one being Algorithm 3. Top-K-label and top-K-conﬁdence calibrators are also explicitly described in Appendix A (Algorithms 6 and 7).

Top-label calibration requires that for every class l P r Ls, Pp Y l | cp Xq l, hp Xqq hp Xq. Thus, to achieve top-label calibration, we must solve L calibration problems. Algorithm 2 constructs L datasets t Dl : l P r Lsu (line 4). The features in Dl are the Xi s for which cp Xiq l, and the labels are 1 t Yi lu. Now for every l P r Ls, we calibrate g to hl : X Ñ r0, 1s using Dl and any binary calibrator. The ﬁnal probabilistic predictor is hp q hcp qp q (that is, it predicts hlpxq if cpxq l). The top-label predictor c does not change in this process. Thus the accuracy of pc, hq is the same as the accuracy of g irrespective of which At0,1u is used. Unlike the top-label calibrator, the conﬁdence calibrator merges all classes together into a single dataset D1 Ť

l Pr Ls Dl.

To achieve class-wise calibration, Algorithm 3 also solves L calibration problems, but these correspond to satisfying Pp Y l | hlp Xqq hlp Xq. Unlike top-label calibration, the dataset Dl for class-wise calibration contains all the Xi s (even if cp Xiq l), and hl is passed to At0,1u instead of h. Also, unlike conﬁdence calibration, Yi is replaced with 1 t Yi lu instead of 1 t Yi cp Xiqu. The overall process is similar to reducing multiclass classiﬁcation to L 1-v-all binary classiﬁcation problem, but our motivation is intricately tied to the notion of class-wise calibration.

Most popular empirical works that have discussed binary calibrators for multiclass calibration have done so using the normalized calibrator, Algorithm 4. This is almost identical to Algorithm 3, except that there is an additional normalization step (line 6 of Algorithm 4). This normalization was ﬁrst proposed by Zadrozny and Elkan (2002, Section 5.2), and has been used unaltered by most other works1 where the goal has been to simply compare direct multiclass calibrators such as temperature scaling, Dirichlet scaling, etc., to a calibrator based on binary methods (for instance, see Section 4.2 of Guo et al. (2017)). In contrast to these papers, we investigate multiple M2B reductions in an effort to identify the right reduction of multiclass calibration to binary calibration.

To summarize, the M2B characterization immediately yields a novel and different calibrator for every M2B notion. In the following section, we instantiate M2B calibrators on the binary calibrator of histogram binning (HB), leading to two new algorithms: top-label-HB and class-wise-HB, that achieve strong empirical results and satisfy distribution-free calibration guarantees.

1the only exception we are aware of is the recent work of Patel et al. (2021) who also suggest skipping normalization (see their Appendix A1); however they use a common I-Max binning scheme across classes, whereas in Algorithm 3 the predictor hl for each class is learnt completely independently of other classes

Published as a conference paper at ICLR 2022

Metric Dataset Architecture Base TS VS DS N-HB TL-HB

Toplabel ECE

Res Net-50 0.025 0.022 0.020 0.019 0.018 0.020 Res Net-110 0.029 0.022 0.021 0.021 0.020 0.021 WRN-26-10 0.023 0.023 0.019 0.021 0.012 0.018 Dense Net-121 0.027 0.027 0.020 0.020 0.019 0.021

Res Net-50 0.118 0.114 0.113 0.322 0.081 0.143 Res Net-110 0.127 0.121 0.115 0.353 0.093 0.145 WRN-26-10 0.103 0.103 0.100 0.304 0.070 0.129 Dense Net-121 0.110 0.110 0.109 0.322 0.086 0.139

Toplabel MCE

Res Net-50 0.315 0.305 0.773 0.282 0.411 0.107 Res Net-110 0.275 0.227 0.264 0.392 0.195 0.077 WRN-26-10 0.771 0.771 0.498 0.325 0.140 0.071 Dense Net-121 0.289 0.289 0.734 0.294 0.345 0.087

Res Net-50 0.436 0.300 0.251 0.619 0.397 0.291 Res Net-110 0.313 0.255 0.277 0.557 0.266 0.257 WRN-26-10 0.273 0.255 0.256 0.625 0.287 0.280 Dense Net-121 0.279 0.231 0.235 0.600 0.320 0.289

Table 2: Top-label-ECE and top-label-MCE for deep-net models (above: Base ) and various posthoc calibrators: temperature-scaling (TS), vector-scaling (VS), Dirichlet-scaling (DS), top-label-HB (TL-HB), and normalized-HB (N-HB). Best performing method in each row is in bold.

Metric Dataset Architecture Base TS VS DS N-HB CW-HB

Classwise ECE ˆ102

Res Net-50 0.46 0.42 0.35 0.35 0.50 0.28 Res Net-110 0.59 0.50 0.42 0.38 0.53 0.27 WRN-26-10 0.44 0.44 0.35 0.39 0.39 0.28 Dense Net-121 0.46 0.46 0.36 0.36 0.48 0.36

Res Net-50 0.22 0.20 0.20 0.66 0.23 0.16 Res Net-110 0.24 0.23 0.21 0.72 0.24 0.16 WRN-26-10 0.19 0.19 0.18 0.61 0.20 0.14 Dense Net-121 0.20 0.21 0.19 0.66 0.24 0.16

Table 3: Class-wise-ECE for deep-net models and various post-hoc calibrators. All methods are same as Table 2, except TL-HB is replaced with class-wise-HB (CW-HB).

4 EXPERIMENTS: M2B CALIBRATION WITH HISTOGRAM BINNING

Histogram binning or HB was proposed by Zadrozny and Elkan (2001) with strong empirical results for binary calibration. In HB, a base binary calibration model g : X Ñ r0, 1s is used to partition the calibration data into a number of bins so that each bin has roughly the same number of points. Then, for each bin, the probability of Y 1 is estimated using the empirical distribution on the calibration data. This estimate forms the new calibrated prediction for that bin. Recently, Gupta and Ramdas (2021) showed that HB satisﬁes strong distribution-free calibration guarantees, which are otherwise impossible for scaling methods (Gupta et al., 2020).

Despite these results for binary calibration, studies for multiclass calibration have reported that HB typically performs worse than scaling methods such as temperature scaling (TS), vector scaling (VS), and Dirichlet scaling (DS) (Kull et al., 2019; Roelofs et al., 2020; Guo et al., 2017). In our experiments, we ﬁnd that the issue is not HB but the M2B wrapper used to produce the HB baseline. With the right M2B wrapper, HB beats TS, VS, and DS. A number of calibrators have been proposed recently (Zhang et al., 2020; Rahimi et al., 2020; Patel et al., 2021; Gupta et al., 2021), but VS and DS continue to remain strong baselines which are often close to the best in these papers. We do not compare to each of these calibrators; our focus is on the M2B reduction and the message that the baselines dramatically improve with the right M2B wrapper.

We use three metrics for comparison: the ﬁrst is top-label-ECE or TL-ECE (deﬁned in (4)), which we argued leads to a more meaningful comparison compared to conf-ECE. Second, we consider the more stringent maximum-calibration-error (MCE) metric that assesses the worst calibration across predictions (see more details in Appendix E.3). For top-label calibration MCE is given by TL-MCEpc, hq : maxl Pr Ls supr PRangephq |Pp Y l | cp Xq l, hp Xq rq r|. To assess classwise calibration, we use class-wise-ECE deﬁned as the average calibration error across classes:

Published as a conference paper at ICLR 2022

CW-ECEpc, hq : L 1 řL l 1 EX |Pp Y l | hlp Xqq hlp Xq|. All ECE/MCE estimation is performed as described in Remark 1. For further details, see Appendix E.2.

Formal algorithm and theoretical guarantees. Top-label-HB (TL-HB) and class-wise-HB (CWHB) are explicitly stated in Appendices B and C respectively; these are instantiations of the top-label calibrator and class-wise calibrator with HB. N-HB is the the normalized calibrator (Algorithm 4) with HB, which is the same as CW-HB, but with an added normalization step. In the Appendix, we extend the binary calibration guarantees of Gupta and Ramdas (2021) to TL-HB and CW-HB (Theorems 1 and 2). We informally summarize one of the results here: if there are at least k calibration points-per-bin, then the expected-ECE is bounded as: E r(TL-) or (CW-) ECEs ď a

1{2k, for TL-HB and CW-HB respectively. The outer E above is an expectation over the calibration data, and corresponds to the randomness in the predictor learnt on the calibration data. Note that the ECE itself is an expected error over an unseen i.i.d. test-point p X, Y q P.

Experimental details. We experimented on the CIFAR-10 and CIFAR-100 datasets, which have 10 and 100 classes each. The base models are deep-nets with the following architectures: Res Net50, Resnet-110, Wide-Res Net-26-10 (WRN) (Zagoruyko and Komodakis, 2016), and Dense Net121 (Huang et al., 2017). Both CIFAR datasets consist of 60K (60,000) points, which are split as 45K/5K/10K to form the train/validation/test sets. The validation set was used for post-hoc calibration and the test set was used for evaluation through ECE/MCE estimates. Instead of training new models, we used the pre-trained models of Mukhoti et al. (2020). We then ask: which post-hoc calibrator improves the calibration the most? We used their Brier score and focal loss models in our experiments (Mukhoti et al. (2020) report that these are the empirically best performing loss functions). All results in the main paper are with Brier score, and results with focal loss are in Appendix E.4. Implementation details for TS, VS, and DS are in Appendix E.

Findings. In Table 2, we report the binned ECE and MCE estimates when B 15 bins are used by HB, and for ECE estimation. We make the following observations:

(a) For TL-ECE, N-HB is the best performing method for both CIFAR-10 and CIFAR-100. While most methods perform similarly across architectures for CIFAR-10, there is high variation in CIFAR-100. DS is the worst performing method on CIFAR-100, but TL-HB also performs poorly. We believe that this could be because the data splitting scheme of the TL-calibrator (line 4 of Algorithm 2) splits datasets across the predicted classes, and some classes in CIFAR-100 occur very rarely. This is further discussed in Appendix E.6. (b) For TL-MCE, TL-HB is the best performing method on CIFAR-10, by a huge margin. For CIFAR-100, TS or VS perform slightly better than TL-HB. Since HB ensures that each bin gets roughly the same number of points, the predictions are well calibrated across bins, leading to smaller TL-MCE. A similar observation was also made by Gupta and Ramdas (2021). (c) For CW-ECE, CW-HB is the best performing method across the two datasets and all four architectures. The N-HB method which has been used in many CW-ECE baseline experiments performs terribly. In other words, skipping the normalization step leads to a large improvement in CW-ECE. This observation is one of our most striking ﬁndings. To shed further light on this, we note that the distribution-free calibration guarantees for CW-HB shown in Appendix C no longer hold post-normalization. Thus, both our theory and experiments indicate that skipping normalization improves CW-ECE performance.

Additional experiments in the Appendix. In Appendix E.5, we report each of the results in Tables 2 and 3 with the number of bins taking every value in the range r5, 25s. Most observations remain the same under this expanded study. In Appendix B.2, we consider top-label calibration for the class imbalanced COVTYPE-7 dataset, and show that TL-HB adapts to tail/infrequent classes.

5 CONCLUSION

We make two contributions to the study of multiclass calibration: (i) deﬁning the new notion of top-label calibration which enforces a natural minimal requirement on a multiclass predictor the probability score for the top class prediction should be calibrated; (ii) developing a multiclass-tobinary (M2B) framework which posits that various notions of multiclass calibration can be achieved via reduction to binary calibration, balancing practical utility with statistically tractability. Since it is important to identify appropriate notions of calibration in any structured output space (Kuleshov et al., 2018; Gneiting et al., 2007), we anticipate that the philosophy behind the M2B framework could ﬁnd applications in other structured spaces.

Published as a conference paper at ICLR 2022

6 REPRODUCIBILITY STATEMENT

Some reproducibility desiderata, such as external code and libraries that were used are summarized in Appendix E.1. All code to generate results with the CIFAR datasets is attached in the supplementary material. Our base models were pre-trained deep-net models generated by Mukhoti et al. (2020), obtained from www.robots.ox.ac.uk/ viveka/focal calibration/ (corresponding to brier score and focal loss adaptive 53 at the above link). By avoiding training of new deep-net models with multiple hyperparameters, we also consequently avoided selection biases that inevitably creep in due to test-data-peeking. The predictions of the pre-trained models were obtained using the code at https://github.com/torrvision/focal calibration.

7 ETHICS STATEMENT

Post-hoc calibration is a post-processing step that can be applied on top of miscalibrated machine learning models to increase their reliability. As such, we believe our work should improve the transparency and explainability of machine learning models. However, we outline a few limitations. Post-hoc calibration requires keeping aside a fresh, representative dataset, that was not used for training. If this dataset is too small, the resulting calibration guarantee can be too weak to be meaningful in practice. Further, if the test data distribution shifts in signiﬁcant ways, additional corrections may be needed to recalibrate (Gupta et al., 2020; Podkopaev and Ramdas, 2021). A well calibrated classiﬁer is not necessarily an accurate or a fair one, and vice versa (Kleinberg et al., 2017). Deploying calibrated models in critical applications like medicine, criminal law, banking, etc. does not preclude the possibility of the model being frequently wrong or unfair.

ACKNOWLEDGEMENTS

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Towns et al., 2014). Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). CG s research was supported by the generous Bloomberg Data Science Ph.D. Fellowship. CG would like to thank Saurabh Garg and Youngseog Chung for interesting discussions, and Viveka Kulharia for help with the focal calibration repository. Finally, we thank Zack Lipton, the ICLR reviewers, and the ICLR area chair, for excellent feedback that helped improve the writing of the paper.

Jock A Blackard and Denis J Dean. Comparative accuracies of artiﬁcial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999.

Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classiﬁcation and regression trees. Routledge, 2017.

Luc Devroye. The equivalence of weak, strong and complete convergence in L1 for kernel density estimates. The Annals of Statistics, 11(3):896 904, 1983.

Luc Devroye. Automatic pattern recognition: A study of the probability of error. IEEE Transactions on pattern analysis and machine intelligence, 10(4):530 543, 1988.

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2): 243 268, 2007.

Louis Gordon and Richard A Olshen. Almost surely consistent nonparametric regression from recursive partitioning schemes. Journal of Multivariate Analysis, 15(2):147 163, 1984.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.

Published as a conference paper at ICLR 2022

Chirag Gupta and Aaditya Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, 2021.

Chirag Gupta, Aleksandr Podkopaev, and Aaditya Ramdas. Distribution-free binary classiﬁcation: prediction sets, conﬁdence intervals and calibration. In Advances in Neural Information Processing Systems, 2020.

Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. In International Conference on Learning Representations, 2021.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science, 2017.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.

Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, 2018.

Meelis Kull, Telmo M. Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classiﬁers with beta calibration. Electronic Journal of Statistics, 11(2):5052 5080, 2017.

Meelis Kull, Miquel Perello-Nieto, Markus K angsepp, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems, 2019.

Ananya Kumar, Percy S Liang, and Tengyu Ma. Veriﬁed uncertainty calibration. In Advances in Neural Information Processing Systems, 2019.

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, 2018.

G abor Lugosi and Andrew Nobel. Consistency of data-driven histogram methods for density estimation and classiﬁcation. Annals of Statistics, 24(2):687 706, 1996.

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, 2020.

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In International Conference on Machine Learning, 2005.

Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. ar Xiv preprint ar Xiv:1904.01685, 2020.

Andrew Nobel. Histogram regression estimation using data-dependent partitions. The Annals of Statistics, 24(3):1084 1105, 1996.

Kanil Patel, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. Multi-class uncertainty calibration via mutual information maximization-based binning. In International Conference on Learning Representations, 2021.

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classiﬁers, pages 61 74. MIT Press, 1999.

Published as a conference paper at ICLR 2022

Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantiﬁcation for classiﬁcation under label shift. In Uncertainty in Artiﬁcial Intelligence, 2021.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoulli random variables. ar Xiv preprint ar Xiv:2001.11595, 2020.

Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra orderpreserving functions for calibration of multi-class neural networks. In Advances in Neural Information Processing Systems, 2020.

Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C Mozer. Mitigating bias in calibration error estimation. ar Xiv preprint ar Xiv:2012.08668, 2020.

J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. Scott, and N. Wilkins-Diehr. XSEDE: Accelerating Scientiﬁc Discovery. Computing in Science & Engineering, 16(5):62 74, 2014.

Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B Sch on. Evaluating model calibration in classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classiﬁcation: a unifying framework. In Advances in Neural Information Processing Systems, 2019.

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesian classiﬁers. In International Conference on Machine Learning, 2001.

Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining, 2002.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.

Jize Zhang, Bhavya Kailkhura, and T Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020.

Published as a conference paper at ICLR 2022

Algorithm 5: Post-hoc calibrator for a given M2B calibration notion C Input: Base (uncalibrated) multiclass predictor g, calibration data D p X1, Y1q, . . . , p Xn, Ynq, binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X

1 K Ð number of distinct calibration claims that C veriﬁes;

2 for each claim k P r Ks do

3 From g, infer prc, rgq Ð plabel-predictor, probability-predictorq corresponding to claim k;

4 Dk Ð tp Xi, Ziqu, where Zi Ð 1 t Yi rcp Xiqu;

5 if conditioning does not include class prediction rc then

6 (conﬁdence, top-K-conﬁdence, and class-wise calibration)

7 hk Ð At0,1uprg, Dkq;

10 (top-label and top-K-label calibration)

11 for l P r Ls do

12 Dk,l Ð tp Xi, Ziq P Dk : rcp Xiq lqu;

13 hk,l Ð At0,1uprg, Dk,lq;

15 hkp q Ð hk,rcp qp q (hk predicts hk,lpxq if rcpxq l);

18 (the new predictor replaces each rg with the corresponding hk)

19 return plabel-predictor, hkq corresponding to each claim k P r Ks;

Input for Algorithms 6 and 7: base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq, binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X .

Algorithm 6: Top-K-label calibrator

1 For every k P r Ks, infer from g the k-th largest class predictor cpkq and the associated probability gpkq;

2 for k Ð 1 to K do

3 for l Ð 1 to L do

4 Dk,l Ð tp Xi, 1 t Yi luq : cpkqp Xiq lu;

5 hpk,lq Ð At0,1upgpkq, Dk,lq;

7 hpkq Ð hpk,cpkqp qqp q;

9 return php1q, hp2q, . . . , hp Kqq;

Algorithm 7: Top-K-conﬁdence calibrator

1 For every k P r Ks, infer from g the k-th largest class predictor cpkq and the associated probability gpkq;

2 for k Ð 1 to K do

3 Dk Ð tp Xi, 1 t Yi luq : i P rnsu;

4 hpkq Ð At0,1upgpkq, Dkq;

6 return php1q, hp2q, . . . , hp Kqq;

A ADDENDUM TO SECTION 3 CALIBRATION ALGORITHMS FROM CALIBRATION METRICS

In Section 3, we introduced the concept of M2B calibration, and showed that popular calibration notions are in fact M2B notions (Table 1). We showed how the calibration notions of top-label, class-wise, and conﬁdence calibration can be achieved using a corresponding M2B calibrator. In the following subsection, we present the general-purpose wrapper Algorithm 5 that can be used to derive an M2B calibrator from any given M2B calibration notion that follows the rubric speciﬁed by Table 1. In Appendix A.2, we illustrate the philosophy of M2B calibration using a simple example with a dataset that contains 6 points. This example also illustrates the top-label-calibrator, the classwise-calibrator, and the conﬁdence-calibrator.

Published as a conference paper at ICLR 2022

(a) Predictions of a ﬁxed base model g : X Ñ 3 on calibration/test data D tpa, 3q, pb, 4q, . . . , pf, 1qu.

(b) Conﬁdence calibration

(c) Top-label calibration

(d) Class-wise calibration

(e) Canonical calibration

Figure 3: Illustrative example for Section A.2. The numbers in plot (a) correspond to the predictions made by g on a dataset D. If D were a test set, plots (b e) show how it should be used to verify if g satisﬁes the corresponding notion of calibration. Consequently, we argue that if D were a calibration set, and we want to achieve one of the notions (b e), then the data shown in the corresponding plots should be the data used to calibrate g as well.

A.1 GENERAL-PURPOSE M2B CALIBRATOR

Denote some M2B notion of calibration as C. Suppose C corresponds to K binary calibration claims. The outer for-loop in Algorithm 5, runs over each such claim in C. For example, for class-wise calibration, K L and for conﬁdence and top-label calibration, K 1. Corresponding to each claim, there is a probability-predictor that the conditioning is to be done on, such as g or gl or gpkq. Additionally, there may be conditioning on the label predictor such as c or cpkq. These are denoted as prc, rgq in Algorithm 5. For conﬁdence and top-label calibration, rc c, the top-label-conﬁdence. For class-wise calibration, when rg gl, we have rcp q l.

If there is no label conditioning in the calibration notion, such as in conﬁdence, top-K-conﬁdence, and class-wise calibration, then we enter the if-condition inside the for-loop. Here hk is learnt using a single calibration dataset and a single call to At0,1u. Otherwise, if there is label conditioning, such as in top-label and top-K-label calibration, we enter the else-condition, where we learn a separate hk,l for every l P r Ls, using a different part of the dataset Dl in each case. Then hkpxq equals hk,lpxq if rcpxq l.

Finally, since C is verifying a sequence of claims, the output of Algorithm 5 is a sequence of predictors. Each original prediction prc, rgq corresponding to the C is replaced with prc, hkq. This is the output of the M2B calibrator. Note that the rc values are not changed. This output appears abstract, but normally, it can be represented in an interpretable way. For example, for class-wise calibration, the output is just a sequence of predictors, one for each class: ph1, h2, . . . , h Lq.

This general-purpose M2B calibrators can be used to achieve any M2B calibration notion: toplabel calibration (Algorithm 2), class-wise calibration (Algorithm 3), conﬁdence calibration (Algorithm 1), top-K-label calibration (Algorithm 6), and top-K-conﬁdence calibration (Algorithm 7).

A.2 AN EXAMPLE TO ILLUSTRATE THE PHILOSOPHY OF M2B CALIBRATION

Figure 3a shows the predictions of a given base model g on a given dataset D. Suppose D is a test set, and we are testing conﬁdence calibration. Then the only predictions that matter are the top-predictions corresponding to the shaded values. These are stripped out and shown in Figure 3b, in the gp q row. Note that the indicator 1 t Y cp qu is sufﬁcient to test conﬁdence calibration and given this, the cp Xq are not needed. Thus the second row in Figure 3b only shows these indicators.

Published as a conference paper at ICLR 2022

Algorithm 8: Top-label histogram binning Input: Base multiclass predictor g, calibration data D p X1, Y1q, . . . , p Xn, Ynq Hyperparameter: # points per bin k P N (say 50), tie-breaking parameter δ ą 0 (say 10 10) Output: Top-label calibrated predictor pc, hq

1 c Ð classiﬁer or top-class based on g;

2 g Ð top-class-probability based on g;

3 for l Ð 1 to L do

4 Dl Ð tp Xi, 1 t Yi luq : cp Xiq lqu and nl Ð |Dl|;

5 hl Ð Binary-histogram-binningpg, Dl, tnl{ku , δq;

7 hp q Ð hcp qp q;

8 return pc, hq;

Verifying top-label calibration is similar (Figure 3c), but in addition to the predictions gp q, we also retain the values of cp q. Thus the gp q and 1 t Y cp qu are shown, but split across the 4 classes. Class-wise calibration requires access to all the predictions, however, each class is considered separately as indicated by Figure 3d. Canonical calibration looks at the full prediction vector in each case. However, in doing so, it becomes unlikely that gpxq gpyq for any x, y since the number of values that g can take is now exponential.

Let us turn this around and suppose that D were a calibration set instead of a test set. We argue that D should be used in the same way, whether testing or calibrating. Thus, if conﬁdence calibration is to be achieved, we should focus on the pg, 1 t Y cp quq corresponding to g. If top-label calibration is to be achieved, we should use the pc, gq values. If class-wise calibration is to be achieved, we should look at each gl separately and solve L different problems. Finally, for canonical calibration, we must look at the entire g vector as a single unit. This is the core philosophy behind M2B calibrators: if binary claims are being veriﬁed, solve binary calibration problems.

B DISTRIBUTION-FREE TOP-LABEL CALIBRATION USING HISTOGRAM BINNING

In this section, we formally describe histogram binning (HB) with the top-label-calibrator (Algorithm 2) and provide methodological insights through theory and experiments.

B.1 FORMAL ALGORITHM AND THEORETICAL GUARANTEES

Algorithm 8 describes the top-label calibrator formally using HB as the binary calibration algorithm. The function called in line 5 is Algorithm 2 of Gupta and Ramdas (2021). The ﬁrst argument in the call is the top-label conﬁdence predictor, the second argument is the dataset to be used, the third argument is the number of bins to be used, and the fourth argument is a tie-breaking parameter (described shortly). While previous empirical works on HB ﬁxed the number of bins per class, the analysis of Gupta and Ramdas (2021) suggests that a more principled way of choosing the number of bins is to ﬁx the number of points per bin. This is parameter k of Algorithm 8. Given k, the number of bins is decided separately for every class as tnl{ku where nl is the number of points predicted as class l. This choice is particularly relevant for top-label calibration since nl can be highly non-uniform (we illustrate this empirically in Section B.2). The tie-breaking parameter δ can be arbitrarily small (like 10 10), and its signiﬁcance is mostly theoretical it is used to ensure that outputs of different bins are not exactly identical by chance, so that conditioning on a calibrated probability output is equivalent to conditioning on a bin; this leads to a cleaner theoretical guarantee.

HB recalibrates g to a piecewise constant function h that takes one value per bin. Consider a speciﬁc bin b; the h value for this bin is computed as the average of the indicators t1 t Yi cp Xiqu : Xi P Bin bu. This is an estimate of the bias of the bin Pp Y cp Xq | X P Bin bq. A concentration inequality can then be used to bound the deviation between the estimate and the true bias to prove distribution-free calibration guarantees. In the forthcoming Theorem 1, we show high-probability and in-expectation bounds on the the TL-ECE of HB. Additionally, we show marginal and condi-

Published as a conference paper at ICLR 2022

tional top-label calibration bounds, deﬁned next. These notions were proposed in the binary calibration setting by Gupta et al. (2020) and Gupta and Ramdas (2021). In the deﬁnition below, A refers to any algorithm that takes as input calibration data D and an initial classiﬁer g to produce a top-label predictor c and an associated probability map h. Algorithm 8 is an example of A.

Deﬁnition 1 (Marginal and conditional top-label calibration). Let ε, α P p0, 1q be some given levels of approximation and failure respectively. An algorithm A : pg, Dq ÞÑ pc, hq is

(a) pε, αq-marginally top-label calibrated if for every distribution P over X ˆ r Ls,

P |Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε ě 1 α. (8)

(b) pε, αq-conditionally top-label calibrated if for every distribution P over X ˆ r Ls,

P @ l P r Ls, r P Rangephq, |Pp Y cp Xq | cp Xq l, hp Xq rq r| ď ε ě 1 α. (9)

To clarify, all probabilities are taken over the test point p X, Y q P, the calibration data D P n, and any other inherent algorithmic randomness in A; these are all implicit in pc, hq Ap D, gq. Marginal calibration asserts that with high probability, on average over the distribution of D, X, Pp Y cp Xq | cp Xq, hp Xqq is at most ε away from hp Xq. In comparison, TL-ECE is the average of these deviations over X. Marginal calibration may be a more appropriate metric for calibration than TL-ECE if we are somewhat agnostic to probabilistic errors less than some ﬁxed threshold ε (like 0.05). Conditional calibration is a strictly stronger deﬁnition that requires the deviation to be at most ε for every possible prediction pl, rq, including rare ones, not just on average over predictions. This may be relevant in medical settings where we want the prediction on every patient to be reasonably calibrated. Algorithm 8 satisﬁes the following calibration guarantees.

Theorem 1. Fix hyperparameters δ ą 0 (arbitrarily small) and points per bin k ě 2, and assume nl ě k for every l P r Ls. Then, for any α P p0, 1q, Algorithm 8 is pε1, αq-marginally and pε2, αqconditionally top-label calibrated for

logp2{αq 2pk 1q δ, and ε2

2pk 1q δ. (10)

Further, for any distribution P over X ˆ r Ls, we have Pp TL-ECEpc, hq ď ε2q ě 1 α, and E r TL-ECEpc, hqs ď a

The proof in Appendix H is a multiclass top-label adaption of the guarantee in the binary setting by Gupta and Ramdas (2021). The r Op1{ ?

kq dependence of the bound relies on Algorithm 8 delegating at least k points to every bin. Since δ can be chosen to be arbitrarily small, setting k 50 gives roughly ED r TL-ECEphqs ď 0.1. Base on this, we suggest setting k P r50, 150s in practice.

B.2 TOP-LABEL HISTOGRAM BINNING ADAPTS TO CLASS IMBALANCED DATASETS

The principled methodology of ﬁxing the number of points per bin reaps practical beneﬁts. Figure 4 illustrates this through the performance of HB for the class imbalanced COVTYPE-7 dataset (Blackard and Dean, 1999) with class ratio approximately 36% for class 1 and 49% for class 2. The entire dataset has 581012 points which is divided into train-test in the ratio 70:30. Then, 10% of the training points are held out for calibration (n |D| 40671). The base classiﬁer is a random forest (RF) trained on the remaining training points (it achieves around 95% test accuracy). The RF is then recalibrated using HB. The top-label reliability diagrams in Figure 4a illustrate that the original RF (in orange) is underconﬁdent on both the most likely and least likely classes. Additional ﬁgures in Appendix F show that the RF is always underconﬁdent no matter which class is predicted as the top-label. HB (in green) recalibrates the RF effectively across all classes. Validity plots (Gupta and Ramdas, 2021) estimate how the LHS of condition (8), denoted as V pεq, varies with ε. We observe that for all ε, V pεq is higher for HB. The rightmost barplot compares the estimated TL-ECE for all classes, and also shows the class proportions. While the original RF is signiﬁcantly miscalibrated for

Published as a conference paper at ICLR 2022

Random forest Histogram binning Class ratio

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 2 reliability diagram

0.00 0.05 0.10 0.15

Class 2 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 4 reliability diagram

0.00 0.05 0.10 0.15

Class 4 validity plot

Top-label ECE

Class ratio

(a) Top-label histogram binning (Algorithm 8) with k 100 points per bin. Class 4 has only 183 calibration points. Algorithm 8 adapts and uses only a single bin to ensure that the TL-ECE on class 4 is comparable to the TL-ECE on class 2. Overall, the random forest classiﬁer has signiﬁcantly higher TL-ECE for the least likely classes (4, 5, and 6), but the post-calibration TL-ECE using binning is quite uniform.

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 2 reliability diagram

0.00 0.05 0.10 0.15

Class 2 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 4 reliability diagram

0.00 0.05 0.10 0.15

Class 4 validity plot

Top-label ECE

Class ratio

(b) Histogram binning with B 50 bins for every class. Compared to Figure 4a, the post-calibration TL-ECE for the most likely classes decreases while the TL-ECE for the least likely classes increases.

Figure 4: Recalibration of a random forest using histogram binning on the class imbalanced COVTYPE-7 dataset (class 2 is roughly 100 times likelier than class 4). By ensuring a ﬁxed number of calibration points per bin, Algorithm 8 obtains relatively uniform top-label calibration across classes (Figure 4a). In comparison, if a ﬁxed number of bins are chosen for all classes, the performance deteriorates for the least likely classes (Figure 4b).

the less likely classes, HB has a more uniform miscalibration across classes. Figure 4b considers a slightly different HB algorithm where the number of points per class is not adapted to the number of times the class is predicted, but is ﬁxed beforehand (this corresponds to replacing tnl{ku in line 5 of Algorithm 8 with a ﬁxed B P N). While even in this setting there is a drop in the TL-ECE compared to the RF model, the ﬁnal proﬁle is less uniform compared to ﬁxing the number of points per bin.

The validity plots and top-label reliability diagrams for all the 7 classes are reported in Figure 9 in Appendix F, along with some additional observations.

C DISTRIBUTION-FREE CLASS-WISE CALIBRATION USING HISTOGRAM BINNING

In this section, we formally describe histogram binning (HB) with the class-wise-calibrator (Algorithm 3) and provide theoretical guarantees for it. The overall procedure is called class-wise-HB. Further details and background on HB are contained in Appendix B, where top-label-HB is described.

C.1 FORMAL ALGORITHM

To achieve class-wise calibration using binary routines, we learn each component function hl in a 1v-all fashion as described in Algorithm 3. Algorithm 9 contains the pseudocode with the underlying routine as binary HB. To learn hl, we use a dataset Dl, which unlike top-label HB (Algorithm 8), contains Xi even if cp Xiq l. However the Yi is replaced with 1 t Yi lu. The number of points per bin kl can be different for different classes, but generally one would set k1 . . . k L k P N. Larger values of kl will lead to smaller εl and δl in the guarantees, at loss of sharpness since the number of bins tn{klu would be smaller.

Published as a conference paper at ICLR 2022

Algorithm 9: Class-wise histogram binning

Input: Base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq Hyperparameter: # points per bin k1, k2, . . . , kl P NL (say each kl 50), tie-breaking parameter δ ą 0 (say 10 10) Output: L class-wise calibrated predictors h1, h2, . . . , h L

1 for l Ð 1 to L do

2 Dl Ð tp Xi, 1 t Yi luq : i P rnsqu;

3 hl Ð Binary-histogram-binningpgl, Dl, tn{klu , δq;

5 return ph1, h2, . . . , h Lq;

C.2 CALIBRATION GUARANTEES

A general algorithm A for class-wise calibration takes as input calibration data D and an initial classiﬁer g to produce an approximately class-wise calibrated predictor h : X Ñ r0, 1s L. Deﬁne the notation ε pε1, ε2, . . . , εLq P p0, 1q L and α pα1, α2, . . . , αLq P p0, 1q L. Deﬁnition 2 (Marginal and conditional class-wise calibration). Let ε, α P p0, 1q L be some given levels of approximation and failure respectively. An algorithm A : pg, Dq ÞÑ h is

(a) pε, αq-marginally class-wise calibrated if for every distribution P over X ˆ r Ls and for every l P r Ls P |Pp Y l | hlp Xqq hlp Xq| ď εl ě 1 αl. (11)

(b) pε, αq-conditionally class-wise calibrated if for every distribution P over X ˆ r Ls and for every l P r Ls,

P @r P Rangephlq, |Pp Y l | hlp Xq rq r| ď εl ě 1 αl. (12)

Deﬁnition 2 requires that each hl is pεl, αlq calibrated in the binary senses deﬁned by Gupta et al. (2021, Deﬁnitions 1 and 2). From Deﬁnition 2, we can also uniform bounds that hold simultaneously over every l P r Ls. Let α řL l 1 αl and ε maxl Pr Ls εl. Then (11) implies

P @l P r Ls, |Pp Y l | hlp Xqq hlp Xq| ď ε ě 1 α, (13)

and (12) implies

P @l P r Ls, r P Rangephlq, |Pp Y l | hlp Xq rq r| ď ε ě 1 α. (14)

The choice of not including the uniformity over L in Deﬁnition 2 reveals the nature of our class-wise HB algorithm and the upcoming theoretical guarantees: (a) we learn the hl s separately for each l and do not combine the learnt functions in any way (such as normalization), (b) we do not combine the calibration inequalities for different r Ls in any other way other than a union bound. Thus the only way we can show (13) (or (14)) is by using a union bound over (11) (or (12)).

We now state the distribution-free calibration guarantees satisﬁed by Algorithm 9. Theorem 2. Fix hyperparameters δ ą 0 (arbitrarily small) and points per bin k1, k2, . . . , kl ě 2, and assume nl ě kl for every l P r Ls. Then, for every l P r Ls, for any αl P p0, 1q, Algorithm 9 is pεp1q, αq-marginally and pεp2q, αq-conditionally class-wise calibrated with

2pkl 1q δ, and εp2q l

logp2n{klαlq

2pkl 1q δ. (15)

Further, for any distribution P over X ˆ r Ls,

(a) Pp CW-ECEpc, hq ď maxl Pr Ls εp2q l q ě 1 ř

l Pr Ls αl, and

(b) E r CW-ECEpc, hqs ď maxl Pr Ls a

Published as a conference paper at ICLR 2022

Theorem 2 is proved in Appendix H. The proof follows by using the result of Gupta and Ramdas (2021, Theorem 2), derived in the binary calibration setting, for each hl separately. Gupta and Ramdas (2021) proved a more general result for general ℓp-ECE bounds. Similar results can also be derived for the suitably deﬁned ℓp-CW-ECE.

As discussed in Section 3.2, unlike previous works (Zadrozny and Elkan, 2002; Guo et al., 2017; Kull et al., 2019), Algorithm 9 does not normalize the hl s. We do not know how to derive Theorem 2 style results for a normalized version of Algorithm 9.

D FIGURES FOR APPENDIX E

Appendix E begins on page 23. The relevant ﬁgures for Appendix E are displayed on the following pages.

Published as a conference paper at ICLR 2022

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(a) TL-ECE estimates on CIFAR-10 with Brier score. TL-HB is close to the best in each case. While CW-HB performs the best at B 15, the ECE estimate may not be reliable since it is highly variable across bins.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(b) TL-ECE estimates on CIFAR-100 with Brier score. N-HB is the best performing method, while DS is the worst performing method, across different numbers of bins. TL-HB performs worse than TS and VS.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(c) TL-MCE estimates on CIFAR-10 with Brier score. The only reliably and consistently well-performing method is TL-HB.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(d) TL-MCE estimates on CIFAR-100 with Brier score. DS is the worst performing method. Other methods perform across different values of B.

Figure 5: Table 2 style results with the number of bins varied as B P r5, 25s. See Appendix E.5 for further details. The captions summarize the ﬁndings in each case. In most cases, the ﬁndings are similar to those with B 15. The notable exception is that performance of N-HB on CIFAR-10 for TL-ECE while very good at B 15, is quite inconsistent when seen across different bins. In some cases, the blue base model line and the orange temperature scaling line coincide. This occurs since the optimal temperature on the calibration data was learnt to be T 1, which corresponds to not changing the base model at all.

Published as a conference paper at ICLR 2022

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(a) TL-ECE estimates on CIFAR-10 with focal loss. TL-HB is close to the best in each case. While CW-HB performs the best at B 15, the ECE estimate may not be reliable since it is highly variable across bins.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(b) TL-ECE estimates on CIFAR-100 with focal loss. N-HB is the best performing method, while DS is the worst performing method, across different numbers of bins. TL-HB performs worse than TS and VS.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(c) TL-MCE estimates on CIFAR-10 with focal loss. The only reliably and consistently well-performing method is TL-HB.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(d) TL-MCE estimates on CIFAR-100 with focal loss. DS is the worst performing method. Other methods perform across different values of B.

Figure 6: Table 4 style results with the number of bins varied as B P r5, 25s. See Appendix E.5 for further details. The captions summarize the ﬁndings in each case. In most cases, the ﬁndings are similar to those with B 15. In some cases, the blue base model line and the orange temperature scaling line coincide. This occurs since the optimal temperature on the calibration data was learnt to be T 1, which corresponds to not changing the base model at all.

Published as a conference paper at ICLR 2022

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(a) CW-ECE estimates on CIFAR-10 with Brier score. CW-HB is the best performing method across bins, and N-HB is quite unreliable.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(b) CW-ECE estimates on CIFAR-100 with Brier score. CW-HB is the best performing method. DS and N-HB are the worst performing methods.

Figure 7: Table 3 style results with the number of bins varied as B P r5, 25s. The captions summarize the ﬁndings in each case, which are consistent with those in the table. See Appendix E.5 for further details.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(a) CW-ECE estimates on CIFAR-10 with focal loss. CW-HB is the best performing method across bins, and N-HB is quite unreliable.

5 10 15 20 25 Number of bins

Estimated ECE

5 10 15 20 25 Number of bins

Estimated ECE

Res Net-110

5 10 15 20 25 Number of bins

Estimated ECE

Wide-Res Net-26-10

5 10 15 20 25 Number of bins

Estimated ECE

Dense Net-121

(b) CW-ECE estimates on CIFAR-100 with focal loss. CW-HB is the best performing method. DS and N-HB are the worst performing methods.

Figure 8: Table 5 style results with the number of bins varied as B P r5, 25s. The captions summarize the ﬁndings in each case, which are consistent with those in the table. See Appendix E.5 for further details.

Published as a conference paper at ICLR 2022

E ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS FOR CIFAR-10 AND CIFAR-100

We present additional details and results to supplement the experiments with CIFAR-10 and CIFAR100 in Sections 2 and 4 of the main paper.

E.1 EXTERNAL LIBRARIES USED

All our base models were pre-trained deep-net models generated by Mukhoti et al. (2020), obtained from www.robots.ox.ac.uk/ viveka/focal calibration/ and used along with the code at https://github.com/torrvision/focal calibration to obtain base predictions. We focused on the models trained with Brier score and focal loss, since it was found to perform the best for calibration. All reports in the main paper are with the Brier score; in Appendix E.4, we report corresponding results with focal loss.

We also used the code at https://github.com/torrvision/focal calibration for temperature scaling (TS). For vector scaling (VS) and Dirichlet scaling (DS), we used the code of Kull et al. (2019), hosted at https://github.com/dirichletcal/dirichlet python. For VS, we used the ﬁle dirichletcal/calib/vectorscaling.py, and for DS, we used the ﬁle dirichletcal/calib/fulldirichlet.py. No hyperparameter tuning was performed in any of our histogram binning experiments or baseline experiments; default settings were used in every case. The random seed was ﬁxed so that every run of the experiment gives the same result. In particular, by relying on pre-trained models, we avoid training new deep-net models with multiple hyperparameters, thus avoiding any selection biases that may arise due to test-data peeking across multiple settings.

E.2 FURTHER COMMENTS ON BINNING FOR ECE ESTIMATION

As mentioned in Remark 1, ECE estimates for all methods except TL-HB and CW-HB was done using ﬁxed-width bins r0, 1{Bq, r1{B, 2{Bq, . . . r1 1{B, 1s for various values of B P r5, 25s. For TL-HB and CW-HB, B is the number of bins used for each call to binary HB. For TL-HB, note that we actually proposed that the number of bins-per-class should be ﬁxed; see Section B.2. However, for ease of comparison to other methods, we simply set the number of bins to B for each call to binary HB. That is, in line 5, we replace tnl{ku with B. For CW-HB, we described Algorithm 9 with different values of kl corresponding to the number of bins per class. For the CIFAR-10 and CIFAR-100 comparisons, we set each k1 k2 . . . k L k, where k P N satisﬁes tn{ku B.

Tables 2,3, 4, and 5 report estimates with B 15, which has been commonly used in many works (Guo et al., 2017; Kull et al., 2019; Mukhoti et al., 2020). Corresponding to each table, we have a ﬁgure where ECE estimates with varying B are reported to strengthen conclusions: these are Figure 5,7, 6, and 8 respectively. Plugin estimates of the ECE were used, same as Guo et al. (2017). Further binning was not done for TL-HB and CW-HB since the output is already discrete and sufﬁciently many points take each of the predicted values. Note that due to Jensen s inequality, any further binning will only decrease the ECE estimate (Kumar et al., 2019). Thus, using unbinned estimates may give TL-HB and CW-HB a disadvantage.

E.3 SOME REMARKS ON MAXIMUM-CALIBRATION-ERROR (MCE)

Guo et al. (2017) deﬁned MCE with respect to conﬁdence calibration, as follows:

conf-MCEpc, hq : sup r PRangephq |Pp Y cp Xq | hp Xq rq r| . (16)

Conf-MCE suffers from the same issue illustrated in Figure 2 for conf-ECE. In Figure 1b, we looked at the reliability diagram within two bins. These indicate two of the values over which the supremum is taken in equation (16): these are the Y-axis distances between the markers and the X Y line for bins 6 and 10 (both are less than 0.02). On the other hand, the effective maximum miscalibration for bin 6 is roughly 0.15 (for class 1), and roughly 0.045 (for class 4), and the maximum should be taken with respect to these values across all bins. To remedy the underestimation of the effective

Published as a conference paper at ICLR 2022

Metric Dataset Architecture Base TS VS DS N-HB TL-HB

Toplabel ECE

Res Net-50 0.022 0.023 0.018 0.019 0.023 0.019 Res Net-110 0.025 0.024 0.022 0.021 0.020 0.020 WRN-26-10 0.024 0.019 0.016 0.017 0.019 0.018 Dense Net-121 0.023 0.023 0.021 0.021 0.025 0.021

Res Net-50 0.109 0.107 0.107 0.332 0.086 0.148 Res Net-110 0.124 0.117 0.105 0.316 0.115 0.153 WRN-26-10 0.100 0.100 0.101 0.293 0.074 0.135 Dense Net-121 0.106 0.108 0.105 0.312 0.091 0.147

Toplabel MCE

Res Net-50 0.298 0.443 0.368 0.472 0.325 0.082 Res Net-110 0.378 0.293 0.750 0.736 0.535 0.089 WRN-26-10 0.741 0.582 0.311 0.363 0.344 0.075 Dense Net-121 0.411 0.411 0.243 0.391 0.301 0.099

Res Net-50 0.289 0.355 0.234 0.640 0.322 0.273 Res Net-110 0.293 0.265 0.274 0.633 0.366 0.272 WRN-26-10 0.251 0.227 0.256 0.663 0.229 0.270 Dense Net-121 0.237 0.225 0.239 0.597 0.327 0.248

Table 4: Top-label-ECE and top-label-MCE for deep-net models and various post-hoc calibrators. All methods are same as Table 2. Best performing method in each row is in bold.

MCE, we can consider the top-label-MCE, deﬁned as

TL-MCEpc, hq : max l Pr Ls sup r PRangephq |Pp Y l | cp Xq l, hp Xq rq r| . (17)

Interpreted in words, the TL-MCE assesses the maximum deviation between the predicted and true probabilities across all predictions and all classes. Following the same argument as in the proof of Proposition 4, it can be shown that for any c, h, conf-MCEpc, hq ď TL-MCEpc, hq. The TL-MCE is closely related to conditional top-label calibration (Deﬁnition 1b). Clearly, an algorithm is pε, αqconditionally top-label calibrated if and only if for every distribution P, Pp TL-MCEpc, hq ď εq ě 1 α. Thus the conditional top-label calibration guarantee of Theorem 1 implies a high probability bound on the TL-MCE as well.

E.4 TABLE 2 AND 3 STYLE RESULTS WITH FOCAL LOSS

Results for top-label-ECE and top-label-MCE with the base deep net model being trained using focal loss are reported in Table 4. Corresponding results for class-wise-ECE are reported in Table 5. The observations are similar to the ones reported for Brier score:

1. For TL-ECE, TL-HB is either the best or close to the best performing method on CIFAR10, but suffers on CIFAR-100. This phenomenon is discussed further in Appendix E.6. N-HB is the best or close to the best for both CIFAR-10 and CIFAR-100. 2. For TL-MCE, TL-HB is the best performing method on CIFAR-10, by a huge margin. For CIFAR-100, TS or VS perform better than TL-HB, but not by a huge margin. 3. For CW-ECE, CW-HB is the best performing method across the two datasets and all four architectures.

E.5 ECE AND MCE ESTIMATES WITH VARYING NUMBER OF BINS

Corresponding to each entry in Tables 2 and 4, we perform an ablation study with the number of bins varying as B P r5, 25s. This is in keeping with the ﬁndings of Roelofs et al. (2020) that the ECE/MCE estimate can vary with different numbers of bins, along with the relative performance of the various models.

The results are reported in Figure 5 (ablation of Table 2) and Figure 7 (ablation of Table 3). The captions of these ﬁgures contain further details on the ﬁndings. Most ﬁndings are similar to those in the main paper, but the ﬁndings in the tables are strengthened through this ablation. The same ablations are performed for focal loss as well. The results are reported in Figure 6 (ablation of

Published as a conference paper at ICLR 2022

Metric Dataset Architecture Base TS VS DS N-HB CW-HB

Classwise ECE ˆ102

Res Net-50 0.42 0.42 0.35 0.37 0.52 0.35 Res Net-110 0.48 0.44 0.36 0.35 0.51 0.29 WRN-26-10 0.41 0.31 0.31 0.35 0.49 0.27 Dense Net-121 0.41 0.41 0.40 0.39 0.63 0.30

Res Net-50 0.22 0.20 0.20 0.66 0.23 0.16 Res Net-110 0.24 0.23 0.21 0.72 0.24 0.16 WRN-26-10 0.19 0.19 0.18 0.61 0.20 0.14 Dense Net-121 0.20 0.21 0.19 0.66 0.24 0.16

Table 5: Class-wise-ECE for deep-net models and various post-hoc calibrators. All methods are same as Table 2, except top-label-HB is replaced with class-wise-HB or Algorithm 3 (CW-HB). Best performing method in each row is in bold.

Table 4) and Figure 8 (ablation of Table 5). The captions of these ﬁgures contain further details on the ﬁndings. The ablation results in the ﬁgures support those in the tables.

E.6 ANALYZING THE POOR PERFORMANCE OF TL-HB ON CIFAR-100

CIFAR-100 is an imbalanced dataset with 100 classes and 5000 points for validation/calibration (as per the default splits). Due to random subsampling, the validation split we used had one of the classes predicted as the top-label only 31 times. Thus, based on Theorem 1, we do not expect HB to have small TL-ECE. This is conﬁrmed by the empirical results presented in Tables 2/4, and Figures 5b/6b. We observe that HB has higher estimated TL-ECE than all methods except DS, for most values of the number of bins. The performance of TL-HB for TL-MCE however is much much closer to the other methods since HB uses the same number of points per bin, ensuring that the predictions are somewhat equally calibrated across bins (Figures 5d/6d). In comparison, for CWECE, CW-HB is the best performing method. This is because in the class-wise setting, 5000 points are available for recalibration irrespective of the class, which is sufﬁcient for HB.

The deterioration in performance of HB when few calibration points are available was also observed in the binary setting by Gupta and Ramdas (2021, Appendix C). Niculescu-Mizil and Caruana (2005) noted in the conclusion of their paper that Platt scaling (Platt, 1999), which is closely related to TS, performs well when the data is small, but another nonparametric binning method, isotonic regression (Zadrozny and Elkan, 2002) performs better when enough data is available. Kull et al. (2019, Section 4.1) compared HB to other calibration techniques for class-wise calibration on 21 UCI datasets, and found that HB performs the worst. On inspecting the UCI repository, we found that most of the datasets they used had fewer than 5000 (total) data points, and many contain fewer than 500.

Overall, comparing our results to previous empirical studies, we believe that if sufﬁciently many points are available for recalibration, or the number of classes is small, then HB performs quite well. To be more precise, we expect HB to be competitive if at least 200 points per class can be held out for recalibration, and the number of points per bin is at least k ě 20.

F ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS FOR COVTYPE-7

We present additional details and results for the top-label HB experiment of Section B.2. The base classiﬁer is an RF learnt using sklearn.ensemble import Random Forest Classifier with default parameters. The base RF is a nearly continuous base model since most predictions are unique. Thus, we need to use binning to make reliability diagrams, validity plots, and perform ECE estimation, for the base model. To have a fair comparison, instead of having a ﬁxed binning scheme to assess the base model, the binning scheme was decided based on the unique predictions of toplabel HB. Thus for every l, and r P Rangephlq, the bins are deﬁned as tx : cpxq l, hlpxq ru. Due to this, while the base model in Figures 4a and 4b are the same, the reliability diagrams and validity plots in orange are different. As can be seen in the bar plots in Figure 4, the ECE estimation is not affected signiﬁcantly.

When k 100, the total number of bins chosen by Algorithm 8 was 403, which is roughly 57.6 bins per class. The choice of B 50 for the ﬁxed bins per class experiment was made on this basis.

Published as a conference paper at ICLR 2022

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 1 reliability diagram

0.00 0.05 0.10 0.15

Class 1 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 2 reliability diagram

0.00 0.05 0.10 0.15

Class 2 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 3 reliability diagram

0.00 0.05 0.10 0.15

Class 3 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 4 reliability diagram

0.00 0.05 0.10 0.15

Class 4 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 5 reliability diagram

0.00 0.05 0.10 0.15

Class 5 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 6 reliability diagram

0.00 0.05 0.10 0.15

Class 6 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 7 reliability diagram

0.00 0.05 0.10 0.15

Class 7 validity plot

(a) Top-label HB with k 100 points per bin.

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 1 reliability diagram

0.00 0.05 0.10 0.15

Class 1 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 2 reliability diagram

0.00 0.05 0.10 0.15

Class 2 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 3 reliability diagram

0.00 0.05 0.10 0.15

Class 3 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 4 reliability diagram

0.00 0.05 0.10 0.15

Class 4 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 5 reliability diagram

0.00 0.05 0.10 0.15

Class 5 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 6 reliability diagram

0.00 0.05 0.10 0.15

Class 6 validity plot

0.00 0.25 0.50 0.75 1.00 Predicted probability

True probability

Class 7 reliability diagram

0.00 0.05 0.10 0.15

Class 7 validity plot

(b) Top-label HB with B 50 bins per class.

Figure 9: Top-label histogram binning (HB) calibrates a miscalibrated random-forest on the class imbalanced COVTYPE-7 dataset. For the less likely classes (4, 5, and 6), the left column is better calibrated than the right column. Similar observations are made on other datasets, and so we recommend adaptively choosing a different number of bins per class, as Algorithm 8 does.

Published as a conference paper at ICLR 2022

Figure 9 supplements Figure 4 in the main paper by presenting reliability diagrams and validity plots of top-label HB for all classes. Figure 9a presents the plots with adaptive number of bins per class (Algorithm 8), and Figure 9b presents these for ﬁxed number of bins per class. We make the following observations.

(a) For every class l P r Ls, the RF is overconﬁdent. This may seem surprising at ﬁrst since we generally expect that models may be overconﬁdent for certain classes and underconﬁdent for others. However, note that all our plots assess top-label calibration, that is, we are assessing the predicted and true probabilities of only the predicted class. It is possible that a model is overconﬁdent for every class whenever that class is predicted to be the top-label. (b) For the most likely classes, namely classes 1 and 2, the number of bins in the adaptive case is higher than 50. Fewer bins leads to better calibration (at the cost of sharpness). This can be veriﬁed through the validity plots for classes 1 and 2 the validity plots in the ﬁxed bins case is slightly above the validity plot in the adaptive bin case. However both validity plots are quite similar. (c) The opposite is true for the least likely classes, namely classes 4, 5, 6. The validity plot in the ﬁxed bins case is below the validity plot in the adaptive bins case, indicating higher TL-ECE in the ﬁxed bins case. The difference between the validity plots is high. Thus if a ﬁxed number of bins per class is pre-decided, the performance for the least likely classes signiﬁcantly suffers.

Based on these observations, we recommend adaptively choosing the number of bins per class, as done by Algorithm 8.

G BINNING-BASED CALIBRATORS FOR CANONICAL MULTICLASS CALIBRATION

Canonical calibration is a notion of calibration that does not fall in the M2B category. To deﬁne canonical calibration, we use Y to denote the output as a 1-hot vector. That is, Yi e Yi P L 1, where el corresponds to the l-th canonical basis vector in Rd. Recall that a predictor h ph1, h2, . . . , h Lq is said to be canonically calibrated if Pp Y l | hp Xqq hlp Xq for every l P r Ls. Equivalently, this can be stated as E r Y | hp Xqs hp Xq. Canonical calibration implies class-wise calibration: Proposition 1. If E r Y | hp Xqs hp Xq, then for every l P r Ls, Pp Y l | hlp Xqq hlp Xq.

The proof in Appendix H is straightforward, but the statement above is illuminating, because there exist predictors that are class-wise calibrated but not canonically calibrated (Vaicenavicius et al., 2019, Example 1).

Canonical calibration is not an M2B notion since the conditioning occurs on the L-dimensional prediction vector predp Xq hp Xq, and after this conditioning, each of the L statements Pp Y l | predp Xqq hlp Xq should simultaneously be true. On the other hand, M2B notions verify only individual binary calibration claims for every such conditioning. Since canonical calibration does not fall in the M2B category, Algorithm 5 does not lead to a calibrator for canonical calibration. In this section, we discuss alternative binning-based approaches to achieving canonical calibration.

For binary calibration, there is a complete ordering on the interval r0, 1s, and this ordering is leveraged by binning based calibration algorithms. However, L 1, for L ě 3 does not have such a natural ordering. Hence, binning algorithms do not obviously extend for multiclass classiﬁcation. In this section, we brieﬂy discuss some binning-based calibrators for canonical calibration. Our descriptions are for general L ě 3, but we anticipate these algorithms to work reasonably only for small L, say if L ď 5.

As usual, denote g : X Ñ L 1 as the base model and h : X Ñ L 1 as the model learnt using some post-hoc canonical calibrator. For canonical calibration, we can surmise binning schemes that directly learn h by partitioning the prediction space L 1 into bins and estimating the distribution of Y in each bin. A canonical calibration guarantee can be showed for such a binning scheme using multinomial concentration (Podkopaev and Ramdas, 2021, Section 3.1). However, since Volp L 1q 2Θp Lq, there will either be a bin whose volume is 2Ωp Lq (meaning that h would

Published as a conference paper at ICLR 2022

not be sharp), or the number of bins will be 2Ωp Lq, entailing 2Ωp Lq requirements on the sample complexity a curse of dimensionality. Nevertheless, let us consider some binning schemes that could work if L is small.

Formally, a binning scheme corresponds to a partitioning of L 1 into B ě 1 bins. We denote this binning scheme as B : L 1 Ñ r Bs, where Bpsq corresponds to the bin to which s P L 1 belongs. To learn h, the calibration data is binned to get sets of data-point indices that belong to each bin, depending on the gp Xiq values: for every b P r Bs, Tb : ti : Bpgp Xiqq bu, nb |Tb| . We then compute the following estimates for the label probabilities in each bin:

for every pl, bq P r Ls ˆ r Bs, pΠl,b :

i PTb 1 t Yi lu

nb if nb ą 0 else pΠl,b 1{B.

The binning predictor h : X Ñ L 1 is now deﬁned component-wise as follows:

for every l P r Ls, hlpxq pΠl,Bpxq.

In words, for every bin b P r Bs, h predicts the empirical distribution of the Y values in bin b.

Using a multinomial concentration inequality (Devroye, 1983; Qian et al., 2020; Weissman et al., 2003), calibration guarantees can be shown for the learnt h. Podkopaev and Ramdas (2021, Theorem 3) show such a result using the Bretagnolle-Huber-Carol inequality. All of these concentration inequality give bounds that depend inversely on nb or ?nb.

In the following subsections, we describe some binning schemes which can be used for canonical calibration based on the setup illustrated above. First we describe ﬁxed schemes that are not adaptive to the distribution of the data: Sierpinski binning (Appendix G.1) and grid-style binning (Appendix G.2). These are analogous to ﬁxed-width binning for L 2. Fixed binning schemes are not adapted to the calibration data and may have highly imbalanced bins leading to poor estimation of the distribution of Y in bins with small nb. In the binary case, this issue is remedied using histogram binning to ensure that each bin has nearly the same number of calibration points (Gupta and Ramdas, 2021). While histogram binning uses the order of the scalar gp Xiq values, there is no obvious ordering for the multi-dimensional gp Xiq values. In Appendix G.3 we describe a projection based histogram binning scheme that circumvents this issue and ensures that each nb is reasonably large. In Appendix G.4, we present some preliminary experimental results using our proposed binning schemes.

Certain asymptotic consistency results different from calibration have been established for histogram regression and classiﬁcation in the nonparametric statistics literature (Nobel, 1996; Lugosi and Nobel, 1996; Gordon and Olshen, 1984; Breiman et al., 2017; Devroye, 1988); further extensive references can be found within these works. The methodology of histogram regression and classiﬁcation relies on binning and is very similar to the one we propose here. The main difference is that these works consider binning the feature space X directly, unlike the post-hoc setting where we are essentially interested in binning L 1. In terms of theory, the results these works target are asymptotic consistency for the (Bayes) optimal classiﬁcation and regression functions, instead of canonical calibration. It would be interesting to consider the (ﬁnite-sample) canonical calibration properties of the various algorithms proposed in the context of histogram classiﬁcation. However, such a study is beyond the scope of this paper.

G.1 SIERPINSKI BINNING

First, we describe Sierpinski binning for L 3. The probability simplex for L 3, 2, is a triangle with vertices e1 p1, 0, 0q, e2 p0, 1, 0q, and e3 p0, 0, 1q. Sierpinski binning is a recursive partitioning of this triangle based on the fractal popularly known as the Sierpinski triangles. Some Sierpinski bins for L 3 are shown in Figure 10. Formally, we deﬁne Sierpinski binning recursively based on a depth parameter q P N. Given an x P X, let s gpxq. For q 1, the number of bins is B 4, and the binning scheme B is deﬁned as:

1 if s1 ą 0.5 2 if s2 ą 0.5 3 if s3 ą 0.5 4 otherwise.

Published as a conference paper at ICLR 2022

2 q 1 q 2 q 3

Figure 10: Sierpinski binning for L 3. The leftmost triangle represents the probability simplex 2. Sierpinski binning divides 2 recursively based on a depth parameter q P N.

Note that since s1 s2 s3 1, only one of the above conditions can be true. It can be veriﬁed that each of the bins have volume equal to p1{4q-th the volume of the probability simplex 2. If a ﬁner resolution of 2 is desired, B can be increased by further dividing the partitions above. Note that each partition is itself a triangle; thus each triangle can be mapped to 2 to recursively deﬁne the sub-partitioning. For i P r4s, deﬁne the bins bi ts : Bpsq iu. Consider the bin b1. Let us reparameterize it as pt1, t2, t3q p2s1 1, 2s2, 2s3q. It can be veriﬁed that

b1 tpt1, t2, t3q : s1 ą 0.5u tpt1, t2, t3q : t1 t2 t3 1, t1 P p0, 1s, t2 P r0, 1q, t3 P r0, 1qu.

Based on this reparameterization, we can recursively sub-partition b1 as per the scheme (18), replacing s with t. Such reparameterizations can be deﬁned for each of the bins deﬁned in (18):

b2 tps1, s2, s3q : s2 ą 0.5u : pt1, t2, t3q p2s1, 2s2 1, 2s3q, b3 tps1, s2, s3q : s3 ą 0.5u : pt1, t2, t3q p2s1, 2s2, 2s3 1q, b4 tps1, s2, s3q : si ď 0.5 for all iu : pt1, t2, t3q p1 2s1, 1 2s2, 1 2s3q,

and thus every bin can be recursively sub-partitioned as per (18). As illustrated in Figure 10, for Sierpinski binning, we sub-partition only the bins b1, b2, b3 since the bin b4 corresponds to low conﬁdence for all labels, where ﬁner calibration may not be needed. (Also, in the L ą 3 case described shortly, the corresponding version of b4 is geometrically different from L 1, and the recursive partitioning cannot be deﬁned for it.) If at every depth, we sub-partition all bins except the corresponding b4 bins, then it can be shown using simple algebra that the total number of bins is p3q 1 1q{2. For example, in Figure 10, when q 2, the number of bins is B 14, and when q 3, the number of bins is B 40.

As in the case of L 3, Sierpinski binning for general L is deﬁned through a partitioning function of L 1 into L 1 bins, and a reparameterization of the partitions so that they can be further sub-partitioned. The L 1 bins at depth q 1 are deﬁned as

Bpsq " l if sl ą 0.5, L 1 otherwise. (19)

While this looks similar to the partitioning (18), the main difference is that the bin b L 1 has a larger volume than other bins. Indeed for l P r Ls, volpblq volp L 1q{2L 1, while volpb L 1q volp L 1qp1 L{2L 1q ě volp L 1q{2L 1, with equality only occuring for L 3. Thus the bin b L 1 is larger than the other bins. If gpxq P b L 1, then the prediction for x may be not be very sharp, compared to if gpxq were in any of the other bins. On the other hand, if gpxq P b L 1, the score for every class is smaller than 0.5, and sharp calibration may often not be desired in this region.

In keeping with this understanding, we only reparameterize the bins b1, b2, . . . , b L so that they can be further divided:

bl tps1, s2, . . . , s Lq : sl ą 0.5u : pt1, t2, . . . , t Lq p2s1, . . . , 2sl 1, . . . , 2s Lq.

For every l P r Ls, under the reparameterization above, it is straightforward to verify that

tpt1, t2, . . . , t Lq : sl ą 0.5u tpt1, t2, . . . , t Lq : ÿ

u Pr Ls tu 1, tl P p0, 1s, tu P r0, 1q @u lu.

Thus every bin can be recursively sub-partitioned following (19). For Sierpinski binning with L labels, the number of bins at depth q is given by p Lq 1 1q{p L 1q.

Published as a conference paper at ICLR 2022

K 4, τ 0.25

0 0.2 0.4 0.6 0.8 1

Figure 11: Grid-style binning for L 3.

G.2 GRID-STYLE BINNING

Grid-style binning is motivated from the 2D reliability diagrams of Widmann et al. (2019, Figure 1), where they partitioned 2 into multiple equi-volume bins in order to assess canonical calibration on a 3-class version of CIFAR-10. For L 3, 2 can be divided as shown in Figure 11. This corresponds to gridding the space 2, just the way we think of gridding the real hyperplane. However, the mathematical description of this grid for general L is not apparent from Figure 11. We describe grid-style binning formally for general L ě 3.

Consider some τ ą 0 such that K : 1{τ P N. For every tuple k pk1, k2, . . . , k Lq in the set

I tk P NL : maxp L, K 1q ď ÿ

l Pr Ls kl ď K p L 1qu, (20)

deﬁne the bins bk : ts P L 1 : for every l P r Ls, sl K P rkl 1, kls. (21)

These bins are not mutually disjoint, but intersections can only occur at the edges. That is, for every s that belongs to more than one bin, at least one component sl satisies sl K P N. In order to identify a single bin s, ties can be broken arbitrarily. One strategy is to use some ordering on NL; say for k1 k2 P N, k1 ă k2 if and only if for the ﬁrst element of k1 that is unequal to the corresponding element of k2 the one corresponding to k1 is smaller. Then deﬁne the binning function B : L 1 Ñ |I| as Bpsq mintk : s P bku. The following propositions prove that a) each s belongs to at least one bin, and b) that every bin is an L 1 dimensional object (and thus a meaningful partition of L 1).

Proposition 2. The bins tbk : k P Iu deﬁned by (21) mutually exhaust L 1.

Proof. Consider any s P L 1. For sl K R N t1, 2, . . .u, set kl maxp1, rsl Ksq ą sl K. Consider the condition C : for all l such that sl K R N, sl K 0.

If C is true, then for l such that sl K P N, set kl sl K. If C is not true, then for l such that sl K P N, set exactly one kl sl K 1, and for the rest set kl sl K. Based on this setting of k, it can be veriﬁed that s P bk.

Further, note that for every l, kl ě sl K, and there exists at least one l such that kl ą sl K. Thus we have:

Since řL l 1 kl P N, we must have řL l 1 kl ě K 1. Further since each kl P N, řL l 1 kl ě L.

Next, note that for every l, kl ď sl K 1. If C is true, then there is at least one l such that sl K P N, and for this l, we have set kl sl K ă sl K 1. If C is not true, then either there exists at least one l such that sl K R N Y t0u for which kl rsl Ks ă sl K 1, or every sl K P N, in which case we

Published as a conference paper at ICLR 2022

have set kl sl K for one of them. In all cases, note that there exists an l such that kl ă sl K 1. Thus,

l 1 psl K 1q

Since řL l 1 kl P N, we must have řL l 1 kl ď K L 1.

Next, we show that each bin indexed by k P I contains a non-zero volume subset of L 1, where volume is deﬁned with respect to the Lebesgue measure in RL 1. This can be shown by arguing that bk contains a scaled and translated version of L 1. Proposition 3. For every k P I, there exists some u P RL and v ą 0 such that u v L 1 Ď bk.

Proof. Fix some k P I. By condition (20), řL l 1 kl P rmaxp L, K 1q, K L 1s. Based on this, we claim that there exists a τ P r0, 1q such that

l 1 pkl 1q τL p1 τq K. (22)

Indeed, note that for τ 0, řL l 1pkl 1q τL p1 τq ď p K 1q 1 K and for τ 1, řL l 1pkl 1q τL p1 τq řL l 1 kl ą K. Thus, there exists a τ that satisﬁes (22) by the intermediate value theorem.

Next, deﬁne u K 1pk pτ 1q1Lq and v K 1p1 τq ą 0, where 1L denotes the vector in RL with each component equal to 1. Consider any s P u v L 1. Note that for every l P r Ls, sl K P rkl 1, kls and by property (22),

l 1 pkl pτ 1qq

l 1 pkl 1q τL p1 τq K.

Thus, s P L 1 and by the deﬁnition of bk, s P bk. This completes the proof.

The previous two propositions imply that B satisﬁes the property we require of a reasonable binning scheme. For L 3, grid-style binning gives equi-volume bins as illustrated in Figure 11; however this is not true for L ą 3. We now describe a histogram binning based partitioning scheme.

G.3 PROJECTION BASED HISTOGRAM BINNING FOR CANONICAL CALIBRATION

Some of the bins deﬁned by Sierpinski binning and grid-style binning may have very few calibration points nb, leading to poor estimation of pΠ. In the binary calibration case, this can be remedied using histogram binning which strongly relies on the scoring function g taking values in a fully ordered space r0, 1s. To ensure that each bin contains Ωpn{Bq points, we estimate the quantiles of gp Xq and created the bins as per these quantiles. However, there are no natural quantiles for unordered prediction spaces such as L 1 (L ě 3). In this section, we develop a histogram binning scheme for L 1 that is semantically interpretable and has desirable statistical properties.

Our algorithm takes as input a prescribed number of bins B and an arbitrary sequence of vectors q1, q2, . . . , q B 1 P RL with unit ℓ2-norm: }qi}2 1. Each of these vectors represents a direction on which we will project L 1 in order to induce a full order on L 1. Then, for each direction, we will use an order statistics on the induced full order to identify a bin with exactly tpn 1q{Bu 1 calibration points (except the last bin, which may have more points). The formal algorithm is described in Algorithm 10. It uses the following new notation: given m vectors v1, v2, . . . , vm P RL, a unit vector u, and an index j P rms, let order-statisticsptv1, v2, . . . , vmu, u, jq denote the j-th order-statistics of tv T 1 u, v T 2 u, . . . , v T muu.

Published as a conference paper at ICLR 2022

Algorithm 10: Projection histogram binning for canonical calibration

Input: Base multiclass predictor g : X Ñ L 1, calibration data D tp X1, Y1q, p X2, Y2q, . . . , p Xn, Ynqu Hyperparameter: number of bins B, unit vectors q1, q2, . . . , q B P RL, Output: Approximately calibrated scoring function h

1 S Ð tgp X1q, gp X2q, . . . , gp Xnqu;

2 T Ð empty array of size B;

3 c Ð t n 1

4 for b Ð 1 to B 1 do

5 Tb Ð order-statisticsp S, qb, cq;

6 S Ð Sztv P S : v T qb ď Tbu;

8 TB Ð 1.01;

9 Bpgp qq Ð mintb P r Bs : gp q T qb ă Tbu;

10 pΠ Ð empty matrix of size B ˆ L;

11 for b Ð 1 to B do

12 for l Ð 1 to L do

13 pΠb,l Ð Meant1 t Yi lu : Bpgp Xiqq b and @s P r Bs, gp Xiq T qs Tsu;

16 for l Ð 1 to L do

17 hlp q Ð pΠBpgp qq,l;

19 return h;

We now brieﬂy describe some values computed by Algorithm 10 in words to build intuition. The array T, which is learnt on the data, represents the identiﬁed thresholds for the directions given by q. Each pqb, Tbq pair corresponds to a hyperplane that cuts L 1 into two subsets given by tx P L 1 : x T qb ă Tbu and tx P L 1 : x T qb ě Tbu. The overall partitioning of L 1 is created by merging these cuts sequentially. This deﬁnes the binning function B. By construction, the binning function is such that each bin contains at least t n 1

B u 1 many points in its interior. As suggested by Gupta and Ramdas (2021), we do not include the points that lie on the boundary, that is, points Xi that satisfy gp Xiq T qs Ts for some s P r Bs. The interior points bins are then used to estimate the bin biases pΠ.

No matter how the q-vectors are chosen, the bins created by Algorithm 10 have at least X n

B \ 1 points for bias estimation. However, we discuss some simple heuristics for setting q that are semantically meaningful. For some intuition, note that the binary version of histogram binning Gupta and Ramdas (2021, Algorithm 1) is essentially recovered by Algorithm 10 if L 2 by setting each qb as e2 (the vector r0, 1s). Equivalently, we can set each qb to e1 since both are equivalent for creating a projection-based order on 2. Thus for L ě 3, a natural strategy for the q-vectors is to rotate between the canonical basis vectors: q1 e1, q2 e2, . . . , q L e L, q L 1 e1, . . . , and so on. Projecting with respect to el focuses on the class l by forming a bin corresponding to the largest values of glp Xiq among the remaining Xi s which have not yet been binned. (On the other hand, projecting with respect to el will correspond to forming a bin with the smallest values of glp Xiq.)

The q-vectors can also be set adaptively based on the training data (without seeing the calibration data). For instance, if most points belong to a speciﬁc class l P r Ls, we may want more sharpness for this particular class. In that case, we can choose a higher ratio of the q-vectors to be el.

G.4 EXPERIMENTS WITH THE COVTYPE DATASET

In Figure 12 we illustrate the binning schemes proposed in this section on a 3-class version of the COVTYPE-7 dataset considered in Section B.2. As noted before, this is an imbalanced dataset where classes 1 and 2 dominate. We created a 3 class problem with the classes 1, 2, and other (as

Published as a conference paper at ICLR 2022

(a) Calibration using Sierpinski binning at depth q 2.

(b) Calibration using grid-style binning with K 5, τ 0.2.

(c) Projection-based HB with B 27 projections: q1 e1, q2 e2, . . . , q4, e1, . . ., and so on.

(d) Projection-based HB with B 27 random projections (qi drawn uniformly from the ℓ2-unit-ball in R3).

Figure 12: Canonical calibration using ﬁxed and histogram binning on a 3-class version of the COVTYPE-7 dataset. The reliability diagrams (left) indicate that all forms of binning improve the calibration of the base logistic regression model. The recalibration diagrams (right) are a scatter plot of the predictions gp Xq on the test data with the points colored in 10 different colors depending on their bin. For every bin, the arrow-tail indicates the mean probability predicted by the base model g whereas the arrow-head indicates the probability predicted by the updated model h.

class 3). The entire dataset has 581012 points and the ratio of the classes is approximately 36%, 49%, 15% respectively. The dataset was split into train-test in the ratio 70:30. The training data was further split into modeling-calibration in the ratio 90:10. A logistic regression model g using sklearn.linear model.Logistic Regression was learnt on the modeling data, and g was recalibrated on the calibration data.

The plots on the right in Figure 12 are recalibration diagrams. The base predictions gp Xq on the test-data are displayed as a scatter plot on 2. Points in different bins are colored using one of 10

Published as a conference paper at ICLR 2022

different colors (since the number of bins is larger than 10, some colors correspond to more than one bin). For each bin, an arrow is drawn, where the tail of the arrow corresponds to the average gp Xq predictions in the bin and the head of the arrow corresponds to the recalibrated hp Xq prediction for the bin. For bins that contained very few points, the arrows are suppressed for visual clarity.

The plots on the left in Figure 12 are validity plots (Gupta and Ramdas, 2021). Validity plots display estimates of V pεq Ptest-data p}E r Y | gp Xqs gp Xq}1 ď εq , as ε varies (g corresponds to the validity plot for logistic regression; replacing g with h above gives plots for the binning based classiﬁer h). For logistic regression, the same binning scheme as the one provided by h is used to estimate V pεq.

Overall, Figure 12 shows that all of the binning approaches improve the calibration of the original logistic regression model across different ε. However, the recalibration does not change the original model signiﬁcantly. Comparing the different binning methods to each other, we ﬁnd that they all perform quite similarly. It would be interesting to further study these and other binning methods for post-hoc canonical calibration.

Proofs appear in separate subsections, in the same order as the corresponding results appear in the paper and Appendix. Proposition 4 was stated informally, so we state it formally as well.

H.1 STATEMENT AND PROOF OF PROPOSITION 4

Proposition 4. For any predictor pc, hq, conf-ECEpc, hq ď TL-ECEpc, hq.

Proof. To avoid confusion between the the conditioning operator and the absolute value operator | |, we use abs p q to denote absolute values below. Note that,

abs p Pp Y cp Xq | hp Xqq hp Xqq abs p E r1 t Y cp Xqu | hp Xqs hp Xqq abs p E r1 t Y cp Xqu hp Xq | hp Xqsq abs p E r E r1 t Y cp Xqu hp Xq | hp Xq, cp Xqs | hp Xqsq ď E rabs p E r1 t Y cp Xqu hp Xq | hp Xq, cp Xqsq | hp Xqs (by Jensen s inequality) E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqq | hp Xqs .

conf-ECEpc, hq E rabs p Pp Y cp Xq | hp Xqq hp Xqqs ď E r E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqq | hp Xqss E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqqs TL-ECEpc, hq.

H.2 PROOF OF THEOREM 1

The proof strategy is as follows. First, we use the bound of Gupta and Ramdas (2021, Theorem 4) (henceforth called the GR21 bound), derived in the binary calibration setting, to conclude marginal, conditional, and ECE guarantees for each hl. Then, we show that the binary guarantees for the individual hl s leads to a top-label guarantee for the overall predictor pc, hq.

Consider any l P r Ls. Let Pl denote the conditional distribution of p X, 1 t Y luq given cp Xq l. Clearly, Dl is a set of nl i.i.d. samples from Pl, and hl is learning a binary calibrator with respect to Pl using binary histogram binning. The number of data-points is nl and the number of bins is Bl tnl{ku bins. We now apply the GR21 bounds on hl. First, we verify that the condition they require is satisﬁed: nl ě k tnl{ku ě 2Bl.

Published as a conference paper at ICLR 2022

Thus their marginal calibration bound for hl gives,

|Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď δ

logp2{αq 2ptnl{Blu 1q | cp Xq l

Note that since tnl{Blu tnl{ tnl{kuu ě k,

logp2{αq 2pk 1q ě δ

logp2{αq 2ptnl{Blu 1q.

Thus we have

P p|Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď ε1 | cp Xq lq ě 1 α.

This is satisﬁed for every l. Using the law of total probability gives us the top-label marginal calibration guarantee for pc, hq:

Pp|Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε1q

l 1 Ppcp Xq lq Pp|Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε1 | cp Xq lq

(law of total probability)

l 1 Ppcp Xq lq Pp|Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď ε1 | cp Xq lq

(by construction, if cpxq l, hpxq hlpxq)

l 1 Ppcp Xq lqp1 αq

Similarly, the in-expectation ECE bound of GR21, for p 1, gives for every l,

E |Pp Y l | cp Xq l, hlp Xqq hlp Xq | cp Xq l| ď a

tnl{ku {2nl δ

E|Pp Y cp Xq | cp Xq, hlp Xqq hp Xq|

l 1 Ppcp Xq lq E|Pp Y l | cp Xq l, hlp Xqq hlp Xq| | cp Xq l

l 1 Ppcp Xq lqp a

Next, we show the top-label conditional calibration bound. Let B řL l 1 Bl and αl αBl{B. Note that B ď řL l 1 nl{k n{k. The binary conditional calibration bound of GR21 gives

@r P Rangephlq, |Pp Y l | cp Xq l, hlp Xq rq r| ď δ

logp2Bl{αlq 2ptnl{Blu 1q | cp Xq l

Note that d

logp2Bl{αlq 2ptnl{Blu 1q

logp2B{αq 2ptnl{Blu 1q

Published as a conference paper at ICLR 2022

logp2n{kαq 2ptnl{Blu 1q (since B ď n{k)

2pk 1q (since k ď tnl{Blu).

Thus for every l P r Ls,

Pp@r P Rangephlq, |Pp Y l | cp Xq l, hlp Xq rq r| ď ε2q ě 1 αl.

By construction of h, conditioning on cp Xq l and hlp Xq r is the same as conditioning on cp Xq l and hp Xq r. Taking a union bound over all L gives

Pp@l P r Ls, r P Rangephq, |Pp Y cp Xq | cp Xq l, hp Xq rq r|q ď ε2q

l 1 αl 1 α,

proving the conditional calibration result. Finally, note that if for every l P r Ls, r P Rangephq,

|Pp Y cp Xq | cp Xq l, hp Xq rq r| ď ε2,

then also TL-ECEpc, hq E|Pp Y cp Xq | hp Xq, cp Xqq hp Xq| ď ε2. This proves the high-probability bound for the TL-ECE. Remark 3. Gupta and Ramdas (2021) proved a more general result for general ℓp-ECE bounds. Similar results can also be derived for the suitably deﬁned ℓp-TL-ECE. Additionally, it can be shown that with probability 1 α, the TL-MCE of pc, hq is bounded by ε2. (TL-MCE is deﬁned in Appendix E, equation (17).)

H.3 PROOF OF PROPOSITION 1

Consider a speciﬁc l P r Ls. We use hl to denote the l-th component function of h and Yl 1 t Y lu. Canonical calibration implies

Pp Y l | hp Xqq E r Yl | hp Xqs hlp Xq. (23)

We can then use the law of iterated expectations (or tower rule) to get the ﬁnal result:

E r Yl | hlp Xqs E r E r Yl | hp Xqs | hlp Xqs E rhlp Xq | hlp Xqs (by the canonical calibration property (23)) hlp Xq.

H.4 PROOF OF THEOREM 2

We use the bounds of Gupta and Ramdas (2021, Theorem 4) (henceforth called the GR21 bounds), derived in the binary calibration setting, to conclude marginal, conditional, and ECE guarantees for each hl. This leads to the class-wise results as well.

Consider any l P r Ls. Let Pl denote the distribution of p X, 1 t Y luq. Clearly, Dl is a set of n i.i.d. samples from Pl, and hl is learning a binary calibrator with respect to Pl using binary histogram binning. The number of data-points is n and the number of bins is Bl tn{klu bins. We now apply the GR21 bounds on hl. First, we verify that the condition they require is satisﬁed:

n ě kl tn{klu ě 2Bl.

Thus the GR21 marginal calibration bound gives that for every l P r Ls, hl satisﬁes

|Pp Y l | hlp Xqq hlp Xq| ď δ

logp2{αlq 2ptn{Blu 1q

Published as a conference paper at ICLR 2022

The class-wise marginal calibration bound of Theorem 2 follows since tn{Blu tn{ tn{kluu ě kl, and so

logp2{αlq 2ptn{Blu 1q.

Next, the GR21 conditional calibration bound gives for every l P r Ls, hl satisﬁes

@r P Rangephlq, |Pp Y l | hlp Xq rq r| ď δ

logp2Bl{αlq 2ptn{Blu 1q

The class-wise marginal calibration bound of Theorem 2 follows since Bl tn{klu ď n{kl and tn{Blu tn{ tn{kluu ě kl, and so

logp2Bl{αlq 2ptn{Blu 1q.

Let k minl Pr Ls kl. The in-expectation ECE bound of GR21, for p 1, gives for every l,

E rbinary-ECE-for-class-l phlqs ď a

tn{klu {2n δ

E r CW-ECEpc, hqs E

l 1 binary-ECE-for-class-l phlq

as required for the in-expectation CW-ECE bound of Theorem 2. Finally, for the high probability CW-ECE bound, let ε maxl Pr Ls εp2q l and α řL l 1 αl. By taking a union bound over the the conditional calibration bounds for each hl, we have, with probability 1 α, for every l P r Ls and r P Rangephq, |Pp Y l | cp Xq l, hp Xq rq r| ď εp2q l ď ε.

Thus, with probability 1 α,

CW-ECEpc, hq L 1 L ÿ

l 1 E|Pp Y l | hlp Xqq hlp Xq| ď ε.

This proves the high-probability bound for the CW-ECE.