# toplabel_calibration_and_multiclasstobinary_reductions__dd827172.pdf Published as a conference paper at ICLR 2022 TOP-LABEL CALIBRATION AND MULTICLASS-TO-BINARY REDUCTIONS Chirag Gupta & Aaditya Ramdas Carnegie Mellon University {chiragg,aramdas}@cmu.edu We propose a new notion of multiclass calibration called top-label calibration. A classifier is said to be top-label calibrated if the reported probability for the predicted class label the top-label is calibrated, conditioned on the top-label. This conditioning is essential for practical utility of the calibration property, since the top-label is always reported and we must condition on what is reported. However, the popular notion of confidence calibration erroneously skips this conditioning. Furthermore, we outline a multiclass-to-binary (M2B) reduction framework that unifies confidence, top-label, and class-wise calibration, among others. As its name suggests, M2B works by reducing multiclass calibration to different binary calibration problems; various types of multiclass calibration can then be achieved using simple binary calibration routines. We instantiate the M2B framework with the well-studied histogram binning (HB) binary calibrator, and prove that the overall procedure is multiclass calibrated without making any assumptions on the underlying data distribution. In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we find that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling. Code for this work is available at https://github.com/aigen/df-posthoc-calibration. 1 INTRODUCTION Machine learning models often make probabilistic predictions. The ideal prediction is the true conditional distribution of the output given the input. However, nature never reveals true probability distributions, making it infeasible to achieve this ideal in most situations. Instead, there is significant interest towards designing models that are calibrated, which is often feasible. We motivate the definition of calibration using a standard example of predicting the probability of rain. Suppose a meteorologist claims that the probability of rain on a particular day is 0.7. Regardless of whether it rains on that day or not, we cannot know if 0.7 was the underlying probability of rain. However, we can test if the meteorologist is calibrated in the long run, by checking if on the D days when 0.7 was predicted, it indeed rained on around 0.7D days (and the same is true for other probabilities). This example is readily converted to a formal binary calibration setting. Denote a random (feature, label)-pair as p X, Y q P X ˆ t0, 1u, where X is the feature space. A probabilistic predictor h : X Ñ r0, 1s is said to be calibrated if for every prediction q P r0, 1s, Prp Y 1 | hp Xq qq q (almost surely). Arguably, if an ML classification model produces such calibrated scores for the classes, downstream users of the model can reliably use its predictions for a broader set of tasks. Our focus in this paper is calibration for multiclass classification, with L ě 3 classes and Y P r Ls : t1, 2, . . . , L ě 3u. We assume all (training and test) data is drawn i.i.d. from a fixed distribution P, and denote a general point from this distribution as p X, Y q P. Consider a typical multiclass predictor, h : X Ñ L 1, whose range L 1 is the probability simplex in RL. A natural notion of calibration for h, called canonical calibration is the following: for every l P r Ls, Pp Y l | hp Xq qq ql (ql denotes the l-th component of q). However, canonical calibration becomes infeasible to achieve or verify once L is even 4 or 5 (Vaicenavicius et al., 2019). Thus, there is interest in studying statistically feasible relaxations of canonical notion, such as confidence calibration (Guo et al., 2017) and class-wise calibration (Kull et al., 2017). Published as a conference paper at ICLR 2022 In particular, the notion of confidence calibration (Guo et al., 2017) has been popular recently. A model is confidence calibrated if the following is true: when the reported confidence for the predicted class is q P r0, 1s, the accuracy is also q . In any practical setting, the confidence q is never reported alone; it is always reported along with the actual class prediction l P r Ls. One may expect that if a model is confidence calibrated, the following also holds: when the class l is predicted with confidence q, the probability of the actual class being l is also q ? Unfortunately, this expectation is rarely met there exist confidence calibrated classifier for whom the latter statement is grossly violated for all classes (Example 1). On the other hand, our proposed notion of top-label calibration enforces the latter statement. It is philosophically more coherent, because it requires conditioning on all relevant reported quantities (both the predicted top label and our confidence in it). In Section 2, we argue further that top-label calibration is a simple and practically meaningful replacement of confidence calibration. In Section 3, we unify top-label, confidence, and a number of other popular notions of multiclass calibration into the framework of multiclass-to-binary (M2B) reductions. The M2B framework relies on the simple observation that each of these notions internally verifies binary calibration claims. As a consequence, each M2B notion of calibration can be achieved by solving a number of binary calibration problems. With the M2B framework at our disposal, all of the rich literature on binary calibration can now be used for multiclass calibration. We illustrate this by instantiating the M2B framework with the binary calibration algorithm of histogram binning or HB (Zadrozny and Elkan, 2001; Gupta and Ramdas, 2021). The M2B + HB procedure achieves state-of-the-art results with respect to standard notions of calibration error (Section 4). Further, we show that our procedure is provably calibrated for arbitrary data-generating distributions. The formal theorems are delayed to Appendices B, C (due to space limitations), but an informal result is presented in Section 4. 2 MODIFYING CONFIDENCE CALIBRATION TO TOP-LABEL CALIBRATION Let c : X Ñ r Ls denote a classifier or top-label predictor and h : X Ñ r0, 1s a function that provides a confidence or probability score for the top-label cp Xq. The predictor pc, hq is said to be confidence calibrated (for the data-generating distribution P) if Pp Y cp Xq | hp Xqq hp Xq. (1) In other words, when the reported confidence hp Xq equals p P r0, 1s, then the fraction of instances where the predicted label is correct also approximately equals p. Note that for an L-dimensional predictor h : X Ñ L 1, one would use cp q arg maxl Pr Ls hlp q and hp q hcp qp q; ties are broken arbitrarily. Then h is confidence calibrated if the corresponding pc, hq satisfies (1). Confidence calibration is most applicable in high-accuracy settings where we trust the label prediction cpxq. For instance, if a high-accuracy cancer-grade-prediction model predicts a patient as having 95% grade III, 3% grade II, and 2% grade I , we would suggest the patient to undergo an invasive treatment. However, we may want to know (and control) the number of non-grade-III patients that were given this suggestion incorrectly. In other words, is Prpcancer is not grade III | cancer is predicted to be of grade III with confidence 95%q equal to 5%? It would appear that by focusing on the the probability of the predicted label, confidence calibration enforces such control. However, as we illustrate next, confidence calibration fails at this goal by providing a guarantee that is neither practically interpretable, nor actionable. Translating the probabilistic statement (1) into words, we ascertain that confidence calibration leads to guarantees of the form: if the confidence hp Xq in the top-label is 0.6, then the accuracy (frequency with which Y equals cp Xq) is 0.6 . Such a guarantee is not very useful. Suppose a patient P is informed (based on their symptoms X), that they are most likely to have a certain disease D with probability 0.6. Further patient P is told that this score is confidence calibrated. P can now infer the following: among all patients who have probability 0.6 of having some unspecified disease, the fraction who have that unspecified disease is also 0.6. However, P is concerned only about disease D, and not about other diseases. That is, P wants to know the probability of having D among patients who were predicted to have disease D with confidence 0.6, not among patients who were predicted to have some disease with confidence 0.6. In other words, P cares about the occurrence of D among patients who were told the same thing that P has been told. It is tempting to wish that the confidence calibrated probability 0.6 has any bearing on what P cares about. However, this faith is misguided, as the above reasoning suggests, and further illustrated through the following example. Published as a conference paper at ICLR 2022 Example 1. Suppose the instance space is p X, Y q P ta, bu ˆ t1, 2, . . .u. (X can be seen as the random patient, and Y as the disease they are suffering from.) Consider a predictor pc, hq and let the values taken by p X, Y, c, hq be as follows: Feature x Pp X xq Class prediction cpxq Confidence hpxq Pp Y cp Xq | X xq a 0.5 1 0.6 0.2 b 0.5 2 0.6 1.0 The table specifies only the probabilities Pp Y cp Xq | X xq; the probabilities Pp Y l | X xq, l cpxq, can be set arbitrarily. We verify that pc, hq is confidence calibrated: Pp Y cp Xq | hp Xq 0.6q 0.5p Pp Y 1 | X aq Pp Y 2 | X bqq 0.5p0.2 1q 0.6. However, whether the actual instance is X a or X b, the probabilistic claim of 0.6 bears no correspondence with reality. If X a, hp Xq 0.6 is extremely overconfident since Pp Y 1 | X aq 0.2. Contrarily, if X b, hp Xq 0.6 is extremely underconfident. The reason for the strange behavior above is that the probability Pp Y cp Xq | hp Xqq is not interpretable from a decision-making perspective. In practice, we never report just the confidence hp Xq, but also the class prediction cp Xq (obviously!). Thus it is more reasonable to talk about the conditional probability of Y cp Xq, given what is reported, that is both cp Xq and hp Xq. We make a small but critical change to (1); we say that pc, hq is top-label calibrated if Pp Y cp Xq | hp Xq, cp Xqq hp Xq. (2) (See the disambiguating Remark 2 on terminology.) Going back to the patient-disease example, top-label calibration would tell patient P the following: among all patients, who (just like you) are predicted to have disease D with probability 0.6, the fraction who actually have disease D is also 0.6. Philosophically, it makes sense to condition on what is reported both the top label and its confidence because that is what is known to the recipient of the information; and there is no apparent justification for not conditioning on both. A commonly used metric for quantifying the miscalibration of a model is the expected-calibrationerror (ECE) metric. The ECE associated with confidence calibration is defined as conf-ECEpc, hq : EX |Pp Y cp Xq | hp Xqq hp Xq| . (3) We define top-label-ECE (TL-ECE) in an analogous fashion, but also condition on cp Xq: TL-ECEpc, hq : EX |Pp Y cp Xq | cp Xq, hp Xqq hp Xq| . (4) Higher values of ECE indicate worse calibration performance. The predictor in Example 1 has conf-ECEpc, hq 0. However, it has TL-ECEpc, hq 0.4, revealing its miscalibration. More generally, it can be deduced as a straightforward consequence of Jensen s inequality that conf-ECEpc, hq is always smaller than the TL-ECEpc, hq (see Proposition 4 in Appendix H). As illustrated by Example 1, the difference can be significant. In the following subsection we illustrate that the difference can be significant on a real dataset as well. First, we make a couple of remarks. Remark 1 (ECE estimation using binning). Estimating the ECE requires estimating probabilities conditional on some prediction such as hpxq. A common strategy to do this is to bin together nearby values of hpxq using binning schemes (Nixon et al., 2020, Section 2.1), and compute a single estimate for the predicted and true probabilities using all the points in a bin. The calibration method we espouse in this work, histogram binning (HB), produces discrete predictions whose ECE can be estimated without further binning. Based on this, we use the following experimental protocol: we report unbinned ECE estimates while assessing HB, and binned ECE estimates for all other compared methods, which are continuous output methods (deep-nets, temperature scaling, etc). It is commonly understood that binning leads to underestimation of the effective ECE (Vaicenavicius et al., 2019; Kumar et al., 2019). Thus, using unbinned ECE estimates for HB gives HB a disadvantage compared to the binned ECE estimates we use for other methods. (This further strengthens our positive results for HB.) The binning scheme we use is equal-width binning, where the interval r0, 1s is divided into B equal-width intervals. Equal-width binning typically leads to lower ECE estimates compared to adaptive-width binning (Nixon et al., 2020). Remark 2 (Terminology). The term conf-ECE was introduced by Kull et al. (2019). Most works refer to conf-ECE as just ECE (Guo et al., 2017; Nixon et al., 2020; Mukhoti et al., 2020; Kumar et al., 2018). However, some papers refer to conf-ECE as top-label-ECE (Kumar et al., 2019; Zhang et al., 2020), resulting in two different terms for the same concept. We call the older notion as conf-ECE, and our definition of top-label calibration/ECE (4) is different from previous ones. Published as a conference paper at ICLR 2022 (a) Confidence reliability diagram (points marked ) and top-label reliability diagram (points marked ) for a Res Net-50 model on the CIFAR-10 dataset; see further details in points (a) and (b) below. The gray bars denote the fraction of predictions in each bin. The confidence reliability diagram (mistakenly) suggests better calibration than the top-label reliability diagram. (b) Class-wise and zoomed-in version of Figure 1a for bin 6 (top) and bin 10 (bottom); see further details in point (c) below. The markers are in the same position as Figure 1a, and denote the average predicted and true probabilities. The colored points denote the predicted and true probabilities when seen class-wise. The histograms on the right show the number of test points per class within bins 6 and 10. Figure 1: Confidence reliability diagrams misrepresent the effective miscalibration. 2.1 AN ILLUSTRATIVE EXPERIMENT WITH RESNET-50 ON CIFAR-10 We now compare confidence and top-label calibration using ECE estimates and reliability diagrams (Niculescu-Mizil and Caruana, 2005). This experiment can be seen as a less malignant version of Example 1. Here, confidence calibration is not completely meaningless, but can nevertheless be misleading. Figure 1 illustrates the (test-time) calibration performance of a Res Net-50 model (He et al., 2016) on the CIFAR-10 dataset (Krizhevsky, 2009). In the following summarizing points, the pc, hq correspond to the Res Net-50 model. (a) The markers in Figure 1a form the confidence reliability diagram (Guo et al., 2017), constructed as follows. First, the hpxq values on the test set are binned into one of B 10 bins, r0, 0.1q, r0.1, 0.2q, . . . , r0.9, 1s, depending on the interval to which hpxq belongs. The gray bars in Figure 1a indicate the fraction of hpxq values in each bin nearly 92% points belong to bin r0.9, 1s and no points belong to bin r0, 0.1q. Next, for every bin b, we plot pconfb, accbq, which are the plugin estimates of E rhp Xq | hp Xq P Bin bs and Pp Y cp Xq | hp Xq P Bin bq respectively. The dashed X Y line indicates perfect confidence calibration. (b) The markers in Figure 1a form the top-label reliability diagram. Unlike the confidence reliability diagram, the top-label reliability diagram shows the average miscalibration across classes in a given bin. For a given class l and bin b, define b,l : | p Pp Y cp Xq | cp Xq l, hp Xq P Bin bq p E rhp Xq | cp Xq l, hp Xq P Bin bs |, where p P, p E denote empirical estimates based on the test data. The overall miscalibration is then b : Weighted-averagep b,lq ř l Pr Ls p Ppcp Xq l | hp Xq P Bin bq b,l. Note that b is always non-negative and does not indicate whether the overall miscalibration occurs due to underor over-confidence; also, if the absolute-values were dropped from b,l, then b would simply equal accb confb. In order to plot b in a reliability diagram, we obtain the direction for the corresponding point from the confidence reliability diagram. Thus for every pconfb, accbq, we plot pconfb, confb bq if accb ą confb and pconfb, confb bq otherwise, for every b. This scatter plot of the s gives us the top-label reliability diagram. Figure 1a shows that there is a visible increase in miscalibration when going from confidence calibration to top-label calibration. To understand why this change occurs, Figure 1b zooms into the sixth bin (hp Xq P r0.5, 0.6q) and bin 10 (hp Xq P r0.9, 1.0s), as described next. (c) Figure 1b displays the class-wise top-label reliability diagrams for bins 6 and 10. Note that for bin 6, the marker is nearly on the X Y line, indicating that the overall accuracy matches the Published as a conference paper at ICLR 2022 Base model top-label-ECE Base model conf-ECE Temperature scaling top-label-ECE Temperature scaling conf-ECE Histogram binning top-label-ECE Histogram binning conf-ECE 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 Figure 2: Conf-ECE (dashed lines) and TL-ECE (solid lines) of four deep-net architectures on CIFAR-10, as well as with recalibration using histogram binning and temperature scaling. The TLECE is often 2-3 times the conf-ECE, depending on the number of bins used to estimate ECE, and the architecture. Top-label histogram binning typically performs better than temperature scaling. overall confidence of 0.545. However, the true accuracy when class 1 was predicted is 0.2 and the true accuracy when class 8 was predicted is 0.9 (a very similar scenario to Example 1). For bin 10, the marker indicates a miscalibration of 0.01; however, when class 4 was predicted (roughly 8% of all test-points) the miscalibration is 0.05. Figure 2 displays the aggregate effect of the above phenomenon (across bins and classes) through estimates of the conf-ECE and TL-ECE. The precise experimental setup is described in Section 4. These plots display the ECE estimates of the base model, as well as the base model when recalibrated using temperature scaling (Guo et al., 2017) and our upcoming formulation of top-label histogram binning (Section 3). Since ECE estimates depend on the number of bins B used (see Roelofs et al. (2020) for empirical work around this), we plot the ECE estimate for every value B P r5, 25s in order to obtain clear and unambiguous results. We find that the TL-ECE is significantly higher than the conf-ECE for most values of B, the architectures, and the preand postrecalibration models. This figure also previews the performance of our forthcoming top-label histogram binning algorithm. Top-label HB has smaller estimated TL-ECE than temperature scaling for most values of B and the architectures. Except for Res Net-50, the conf-ECE estimates are also better. To summarize, top-label calibration captures the intuition of confidence calibration by focusing on the predicted class. However, top-label calibration also conditions on the predicted class, which is always part of the prediction in any practical setting. Further, TL-ECE estimates can be substantially different from conf-ECE estimates. Thus, while it is common to compare predictors based on the conf-ECE, the TL-ECE comparison is more meaningful, and can potentially be different. 3 CALIBRATION ALGORITHMS FROM CALIBRATION METRICS In this section, we unify a number of notions of multiclass calibration as multiclass-to-binary (or M2B) notions, and propose a general-purpose calibration algorithm that achieves the corresponding M2B notion of calibration. The M2B framework yields multiple novel post-hoc calibration algorithms, each of which is tuned to a specific M2B notion of calibration. 3.1 MULTICLASS-TO-BINARY (M2B) NOTIONS OF CALIBRATION In Section 2, we defined confidence calibration (1) and top-label calibration (2). These notions verify calibration claims for the highest predicted probability. Other popular notions of calibration verify calibration claims for other entries in the full L-dimensional prediction vector. A predictor h ph1, h2, . . . , h Lq is said to be class-wise calibrated (Kull et al., 2017) if (class-wise calibration) @l P r Ls, Pp Y l | hlp Xqq hlp Xq. (5) Another recently proposed notion is top-K confidence calibration (Gupta et al., 2021). For some l P r Ls, let cplq : X Ñ r Ls denote the l-th highest class prediction, and let hplq : X Ñ r Ls denote the confidence associated with it (c cp1q and h hp1q are special cases). For a given K ď L, (top-K-confidence calibration) @k P r Ks, Pp Y cpkqp Xq | hpkqp Xqq hpkqp Xq. (6) Published as a conference paper at ICLR 2022 Calibration notion Quantifier Prediction (predp Xq) Binary calibration statement Confidence - hp Xq Pp Y cp Xq | predp Xqq hp Xq Top-label - cp Xq, hp Xq Pp Y cp Xq | predp Xqq hp Xq Class-wise @l P r Ls hlp Xq Pp Y l | predp Xqq hlp Xq Top-K-confidence @k P r Ks hpkqp Xq Pp Y cpkqp Xq | predp Xqq hpkqp Xq Top-K-label @k P r Ks cpkqp Xq, hpkqp Xq Pp Y cpkqp Xq | predp Xqq hpkqp Xq Table 1: Multiclass-to-binary (M2B) notions internally verify one or more binary calibration statements/claims. The statements in the rightmost column are required to hold almost surely. As we did in Section 2 for confidenceÑtop-label, top-K-confidence calibration can be modified to the more interpretable top-K-label calibration by further conditioning on the predicted labels: (top-K-label calibration) @k P r Ks, Pp Y cpkqp Xq | hpkqp Xq, cpkqp Xqq hpkqp Xq. (7) Each of these notions reduce multiclass calibration to one or more binary calibration requirements, where each binary calibration requirement corresponds to verifying if the distribution of Y , conditioned on some prediction predp Xq, satisfies a single binary calibration claim associated with predp Xq. Table 1 illustrates how the calibration notions discussed so far internally verify a number of binary calibration claims, making them M2B notions. For example, for class-wise calibration, for every l P r Ls, the conditioning is on predp Xq hlp Xq, and a single binary calibration statement is verified: Pp Y l | predp Xqq hlp Xq. Based on this property, we call each of these notions multiclass-to-binary or M2B notions. The notion of canonical calibration mentioned in the introduction is not an M2B notion. Canonical calibration is discussed in detail in Appendix G. Due to the conditioning on a multi-dimensional prediction, non-M2B notions of calibration are harder to achieve or verify. For the same reason, it is possibly easier for humans to interpret binary calibration claims when taking decisions/actions. 3.2 ACHIEVING M2B NOTIONS OF CALIBRATION USING M2B CALIBRATORS The M2B framework illustrates how multiclass calibration can typically be viewed via a reduction to binary calibration. The immediate consequence of this reduction is that one can now solve multiclass calibration problems by leveraging the well-developed methodology for binary calibration. The upcoming M2B calibrators belong to the standard recalibration or post-hoc calibration setting. In this setting, one starts with a fixed pre-learnt base model g : X Ñ L 1. The base model g can correspond to a deep-net, a random forest, or any 1-v-all (one-versus-all) binary classification model such as logistic regression. The base model is typically optimized for classification accuracy and may not be calibrated. The goal of post-hoc calibration is to use some given calibration data D p X1, Y1q, p X2, Y2q, . . . , p Xn, Ynq P p X ˆ r Lsqn, typically data on which g was not learnt, to recalibrate g. In practice, the calibration data is usually the same as the validation data. To motivate M2B calibrators, suppose we want to verify if g is calibrated on a certain test set, based on a given M2B notion of calibration. Then, the verifying process will split the test data into a number of sub-datasets, each of which will verify one of the binary calibration claims. In Appendix A.2, we argue that the calibration data can also be viewed as a test set, and every step in the verification process can be used to provide a signal for improving calibration. M2B calibrators take the form of wrapper methods that work on top of a given binary calibrator. Denote an arbitrary black-box binary calibrator as At0,1u : r0, 1s X ˆp X ˆt0, 1uq Ñ r0, 1s X , where the first argument is a mapping X Ñ r0, 1s that denotes a (miscalibrated) binary predicor, and the second argument is a calibration data sequence of arbitrary length. The output is a (better calibrated) binary predictor. Examples of At0,1u are histogram binning (Zadrozny and Elkan, 2001), isotonic regression (Zadrozny and Elkan, 2002), and Platt scaling (Platt, 1999). In the upcoming descriptions, we use the indicator function 1 ta bu P t0, 1u which takes the value 1 if a b, and 0 if a b. The general formulation of our M2B calibrator is delayed to Appendix A since the description is a bit involved. To ease readability and adhere to the space restrictions, in the main paper we describe the calibrators corresponding to top-label, class-wise, and confidence calibration (Algorithms 1 3). Each of these calibrators are different from the classical M2B calibrator (Algorithm 4) that has been used by Zadrozny and Elkan (2002), Guo et al. (2017), Kull et al. (2019), and most other papers Published as a conference paper at ICLR 2022 M2B calibrators: Post-hoc multiclass calibration using binary calibrators Input in each case: Binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X , base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq. Algorithm 1: Confidence calibrator 1 c Ð classifier or top-class based on g; 2 g Ð top-class-probability based on g; 3 D1 Ð tp Xi, 1 t Yi cp Xiquq : i P rnsu; 4 h Ð At0,1upg, D1q; 5 return pc, hq; Algorithm 2: Top-label calibrator 1 c Ð classifier or top-class based on g; 2 g Ð top-class-probability based on g; 3 for l Ð 1 to L do 4 Dl Ð tp Xi, 1 t Yi luq : cp Xiq lqu; 5 hl Ð At0,1upg, Dlq; 7 hp q Ð hcp qp q (predict hlpxq if cpxq l); 8 return pc, hq; Algorithm 3: Class-wise calibrator 1 Write g pg1, g2, . . . , g Lq; 2 for l Ð 1 to L do 3 Dl Ð tp Xi, 1 t Yi luq : i P rnsu; 4 hl Ð At0,1upgl, Dlq; 6 return ph1, h2, . . . , h Lq; Algorithm 4: Normalized calibrator 1 Write g pg1, g2, . . . , g Lq; 2 for l Ð 1 to L do 3 Dl Ð tp Xi, 1 t Yi luq : i P rnsu; 4 rhl Ð At0,1upgl, Dlq; 6 Normalize: for every l P r Ls, hlp q : rhlp q{ řL k 1 rhkp q; 7 return ph1, h2, . . . , h Lq; we are aware of, with the most similar one being Algorithm 3. Top-K-label and top-K-confidence calibrators are also explicitly described in Appendix A (Algorithms 6 and 7). Top-label calibration requires that for every class l P r Ls, Pp Y l | cp Xq l, hp Xqq hp Xq. Thus, to achieve top-label calibration, we must solve L calibration problems. Algorithm 2 constructs L datasets t Dl : l P r Lsu (line 4). The features in Dl are the Xi s for which cp Xiq l, and the labels are 1 t Yi lu. Now for every l P r Ls, we calibrate g to hl : X Ñ r0, 1s using Dl and any binary calibrator. The final probabilistic predictor is hp q hcp qp q (that is, it predicts hlpxq if cpxq l). The top-label predictor c does not change in this process. Thus the accuracy of pc, hq is the same as the accuracy of g irrespective of which At0,1u is used. Unlike the top-label calibrator, the confidence calibrator merges all classes together into a single dataset D1 Ť l Pr Ls Dl. To achieve class-wise calibration, Algorithm 3 also solves L calibration problems, but these correspond to satisfying Pp Y l | hlp Xqq hlp Xq. Unlike top-label calibration, the dataset Dl for class-wise calibration contains all the Xi s (even if cp Xiq l), and hl is passed to At0,1u instead of h. Also, unlike confidence calibration, Yi is replaced with 1 t Yi lu instead of 1 t Yi cp Xiqu. The overall process is similar to reducing multiclass classification to L 1-v-all binary classification problem, but our motivation is intricately tied to the notion of class-wise calibration. Most popular empirical works that have discussed binary calibrators for multiclass calibration have done so using the normalized calibrator, Algorithm 4. This is almost identical to Algorithm 3, except that there is an additional normalization step (line 6 of Algorithm 4). This normalization was first proposed by Zadrozny and Elkan (2002, Section 5.2), and has been used unaltered by most other works1 where the goal has been to simply compare direct multiclass calibrators such as temperature scaling, Dirichlet scaling, etc., to a calibrator based on binary methods (for instance, see Section 4.2 of Guo et al. (2017)). In contrast to these papers, we investigate multiple M2B reductions in an effort to identify the right reduction of multiclass calibration to binary calibration. To summarize, the M2B characterization immediately yields a novel and different calibrator for every M2B notion. In the following section, we instantiate M2B calibrators on the binary calibrator of histogram binning (HB), leading to two new algorithms: top-label-HB and class-wise-HB, that achieve strong empirical results and satisfy distribution-free calibration guarantees. 1the only exception we are aware of is the recent work of Patel et al. (2021) who also suggest skipping normalization (see their Appendix A1); however they use a common I-Max binning scheme across classes, whereas in Algorithm 3 the predictor hl for each class is learnt completely independently of other classes Published as a conference paper at ICLR 2022 Metric Dataset Architecture Base TS VS DS N-HB TL-HB Toplabel ECE Res Net-50 0.025 0.022 0.020 0.019 0.018 0.020 Res Net-110 0.029 0.022 0.021 0.021 0.020 0.021 WRN-26-10 0.023 0.023 0.019 0.021 0.012 0.018 Dense Net-121 0.027 0.027 0.020 0.020 0.019 0.021 Res Net-50 0.118 0.114 0.113 0.322 0.081 0.143 Res Net-110 0.127 0.121 0.115 0.353 0.093 0.145 WRN-26-10 0.103 0.103 0.100 0.304 0.070 0.129 Dense Net-121 0.110 0.110 0.109 0.322 0.086 0.139 Toplabel MCE Res Net-50 0.315 0.305 0.773 0.282 0.411 0.107 Res Net-110 0.275 0.227 0.264 0.392 0.195 0.077 WRN-26-10 0.771 0.771 0.498 0.325 0.140 0.071 Dense Net-121 0.289 0.289 0.734 0.294 0.345 0.087 Res Net-50 0.436 0.300 0.251 0.619 0.397 0.291 Res Net-110 0.313 0.255 0.277 0.557 0.266 0.257 WRN-26-10 0.273 0.255 0.256 0.625 0.287 0.280 Dense Net-121 0.279 0.231 0.235 0.600 0.320 0.289 Table 2: Top-label-ECE and top-label-MCE for deep-net models (above: Base ) and various posthoc calibrators: temperature-scaling (TS), vector-scaling (VS), Dirichlet-scaling (DS), top-label-HB (TL-HB), and normalized-HB (N-HB). Best performing method in each row is in bold. Metric Dataset Architecture Base TS VS DS N-HB CW-HB Classwise ECE ˆ102 Res Net-50 0.46 0.42 0.35 0.35 0.50 0.28 Res Net-110 0.59 0.50 0.42 0.38 0.53 0.27 WRN-26-10 0.44 0.44 0.35 0.39 0.39 0.28 Dense Net-121 0.46 0.46 0.36 0.36 0.48 0.36 Res Net-50 0.22 0.20 0.20 0.66 0.23 0.16 Res Net-110 0.24 0.23 0.21 0.72 0.24 0.16 WRN-26-10 0.19 0.19 0.18 0.61 0.20 0.14 Dense Net-121 0.20 0.21 0.19 0.66 0.24 0.16 Table 3: Class-wise-ECE for deep-net models and various post-hoc calibrators. All methods are same as Table 2, except TL-HB is replaced with class-wise-HB (CW-HB). 4 EXPERIMENTS: M2B CALIBRATION WITH HISTOGRAM BINNING Histogram binning or HB was proposed by Zadrozny and Elkan (2001) with strong empirical results for binary calibration. In HB, a base binary calibration model g : X Ñ r0, 1s is used to partition the calibration data into a number of bins so that each bin has roughly the same number of points. Then, for each bin, the probability of Y 1 is estimated using the empirical distribution on the calibration data. This estimate forms the new calibrated prediction for that bin. Recently, Gupta and Ramdas (2021) showed that HB satisfies strong distribution-free calibration guarantees, which are otherwise impossible for scaling methods (Gupta et al., 2020). Despite these results for binary calibration, studies for multiclass calibration have reported that HB typically performs worse than scaling methods such as temperature scaling (TS), vector scaling (VS), and Dirichlet scaling (DS) (Kull et al., 2019; Roelofs et al., 2020; Guo et al., 2017). In our experiments, we find that the issue is not HB but the M2B wrapper used to produce the HB baseline. With the right M2B wrapper, HB beats TS, VS, and DS. A number of calibrators have been proposed recently (Zhang et al., 2020; Rahimi et al., 2020; Patel et al., 2021; Gupta et al., 2021), but VS and DS continue to remain strong baselines which are often close to the best in these papers. We do not compare to each of these calibrators; our focus is on the M2B reduction and the message that the baselines dramatically improve with the right M2B wrapper. We use three metrics for comparison: the first is top-label-ECE or TL-ECE (defined in (4)), which we argued leads to a more meaningful comparison compared to conf-ECE. Second, we consider the more stringent maximum-calibration-error (MCE) metric that assesses the worst calibration across predictions (see more details in Appendix E.3). For top-label calibration MCE is given by TL-MCEpc, hq : maxl Pr Ls supr PRangephq |Pp Y l | cp Xq l, hp Xq rq r|. To assess classwise calibration, we use class-wise-ECE defined as the average calibration error across classes: Published as a conference paper at ICLR 2022 CW-ECEpc, hq : L 1 řL l 1 EX |Pp Y l | hlp Xqq hlp Xq|. All ECE/MCE estimation is performed as described in Remark 1. For further details, see Appendix E.2. Formal algorithm and theoretical guarantees. Top-label-HB (TL-HB) and class-wise-HB (CWHB) are explicitly stated in Appendices B and C respectively; these are instantiations of the top-label calibrator and class-wise calibrator with HB. N-HB is the the normalized calibrator (Algorithm 4) with HB, which is the same as CW-HB, but with an added normalization step. In the Appendix, we extend the binary calibration guarantees of Gupta and Ramdas (2021) to TL-HB and CW-HB (Theorems 1 and 2). We informally summarize one of the results here: if there are at least k calibration points-per-bin, then the expected-ECE is bounded as: E r(TL-) or (CW-) ECEs ď a 1{2k, for TL-HB and CW-HB respectively. The outer E above is an expectation over the calibration data, and corresponds to the randomness in the predictor learnt on the calibration data. Note that the ECE itself is an expected error over an unseen i.i.d. test-point p X, Y q P. Experimental details. We experimented on the CIFAR-10 and CIFAR-100 datasets, which have 10 and 100 classes each. The base models are deep-nets with the following architectures: Res Net50, Resnet-110, Wide-Res Net-26-10 (WRN) (Zagoruyko and Komodakis, 2016), and Dense Net121 (Huang et al., 2017). Both CIFAR datasets consist of 60K (60,000) points, which are split as 45K/5K/10K to form the train/validation/test sets. The validation set was used for post-hoc calibration and the test set was used for evaluation through ECE/MCE estimates. Instead of training new models, we used the pre-trained models of Mukhoti et al. (2020). We then ask: which post-hoc calibrator improves the calibration the most? We used their Brier score and focal loss models in our experiments (Mukhoti et al. (2020) report that these are the empirically best performing loss functions). All results in the main paper are with Brier score, and results with focal loss are in Appendix E.4. Implementation details for TS, VS, and DS are in Appendix E. Findings. In Table 2, we report the binned ECE and MCE estimates when B 15 bins are used by HB, and for ECE estimation. We make the following observations: (a) For TL-ECE, N-HB is the best performing method for both CIFAR-10 and CIFAR-100. While most methods perform similarly across architectures for CIFAR-10, there is high variation in CIFAR-100. DS is the worst performing method on CIFAR-100, but TL-HB also performs poorly. We believe that this could be because the data splitting scheme of the TL-calibrator (line 4 of Algorithm 2) splits datasets across the predicted classes, and some classes in CIFAR-100 occur very rarely. This is further discussed in Appendix E.6. (b) For TL-MCE, TL-HB is the best performing method on CIFAR-10, by a huge margin. For CIFAR-100, TS or VS perform slightly better than TL-HB. Since HB ensures that each bin gets roughly the same number of points, the predictions are well calibrated across bins, leading to smaller TL-MCE. A similar observation was also made by Gupta and Ramdas (2021). (c) For CW-ECE, CW-HB is the best performing method across the two datasets and all four architectures. The N-HB method which has been used in many CW-ECE baseline experiments performs terribly. In other words, skipping the normalization step leads to a large improvement in CW-ECE. This observation is one of our most striking findings. To shed further light on this, we note that the distribution-free calibration guarantees for CW-HB shown in Appendix C no longer hold post-normalization. Thus, both our theory and experiments indicate that skipping normalization improves CW-ECE performance. Additional experiments in the Appendix. In Appendix E.5, we report each of the results in Tables 2 and 3 with the number of bins taking every value in the range r5, 25s. Most observations remain the same under this expanded study. In Appendix B.2, we consider top-label calibration for the class imbalanced COVTYPE-7 dataset, and show that TL-HB adapts to tail/infrequent classes. 5 CONCLUSION We make two contributions to the study of multiclass calibration: (i) defining the new notion of top-label calibration which enforces a natural minimal requirement on a multiclass predictor the probability score for the top class prediction should be calibrated; (ii) developing a multiclass-tobinary (M2B) framework which posits that various notions of multiclass calibration can be achieved via reduction to binary calibration, balancing practical utility with statistically tractability. Since it is important to identify appropriate notions of calibration in any structured output space (Kuleshov et al., 2018; Gneiting et al., 2007), we anticipate that the philosophy behind the M2B framework could find applications in other structured spaces. Published as a conference paper at ICLR 2022 6 REPRODUCIBILITY STATEMENT Some reproducibility desiderata, such as external code and libraries that were used are summarized in Appendix E.1. All code to generate results with the CIFAR datasets is attached in the supplementary material. Our base models were pre-trained deep-net models generated by Mukhoti et al. (2020), obtained from www.robots.ox.ac.uk/ viveka/focal calibration/ (corresponding to brier score and focal loss adaptive 53 at the above link). By avoiding training of new deep-net models with multiple hyperparameters, we also consequently avoided selection biases that inevitably creep in due to test-data-peeking. The predictions of the pre-trained models were obtained using the code at https://github.com/torrvision/focal calibration. 7 ETHICS STATEMENT Post-hoc calibration is a post-processing step that can be applied on top of miscalibrated machine learning models to increase their reliability. As such, we believe our work should improve the transparency and explainability of machine learning models. However, we outline a few limitations. Post-hoc calibration requires keeping aside a fresh, representative dataset, that was not used for training. If this dataset is too small, the resulting calibration guarantee can be too weak to be meaningful in practice. Further, if the test data distribution shifts in significant ways, additional corrections may be needed to recalibrate (Gupta et al., 2020; Podkopaev and Ramdas, 2021). A well calibrated classifier is not necessarily an accurate or a fair one, and vice versa (Kleinberg et al., 2017). Deploying calibrated models in critical applications like medicine, criminal law, banking, etc. does not preclude the possibility of the model being frequently wrong or unfair. ACKNOWLEDGEMENTS This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Towns et al., 2014). Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). CG s research was supported by the generous Bloomberg Data Science Ph.D. Fellowship. CG would like to thank Saurabh Garg and Youngseog Chung for interesting discussions, and Viveka Kulharia for help with the focal calibration repository. Finally, we thank Zack Lipton, the ICLR reviewers, and the ICLR area chair, for excellent feedback that helped improve the writing of the paper. Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999. Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and regression trees. Routledge, 2017. Luc Devroye. The equivalence of weak, strong and complete convergence in L1 for kernel density estimates. The Annals of Statistics, 11(3):896 904, 1983. Luc Devroye. Automatic pattern recognition: A study of the probability of error. IEEE Transactions on pattern analysis and machine intelligence, 10(4):530 543, 1988. Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2): 243 268, 2007. Louis Gordon and Richard A Olshen. Almost surely consistent nonparametric regression from recursive partitioning schemes. Journal of Multivariate Analysis, 15(2):147 163, 1984. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. Published as a conference paper at ICLR 2022 Chirag Gupta and Aaditya Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, 2021. Chirag Gupta, Aleksandr Podkopaev, and Aaditya Ramdas. Distribution-free binary classification: prediction sets, confidence intervals and calibration. In Advances in Neural Information Processing Systems, 2020. Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. In International Conference on Learning Representations, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science, 2017. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009. Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, 2018. Meelis Kull, Telmo M. Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052 5080, 2017. Meelis Kull, Miquel Perello-Nieto, Markus K angsepp, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems, 2019. Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, 2019. Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, 2018. G abor Lugosi and Andrew Nobel. Consistency of data-driven histogram methods for density estimation and classification. Annals of Statistics, 24(2):687 706, 1996. Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, 2020. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In International Conference on Machine Learning, 2005. Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. ar Xiv preprint ar Xiv:1904.01685, 2020. Andrew Nobel. Histogram regression estimation using data-dependent partitions. The Annals of Statistics, 24(3):1084 1105, 1996. Kanil Patel, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. Multi-class uncertainty calibration via mutual information maximization-based binning. In International Conference on Learning Representations, 2021. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61 74. MIT Press, 1999. Published as a conference paper at ICLR 2022 Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, 2021. Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoulli random variables. ar Xiv preprint ar Xiv:2001.11595, 2020. Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra orderpreserving functions for calibration of multi-class neural networks. In Advances in Neural Information Processing Systems, 2020. Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C Mozer. Mitigating bias in calibration error estimation. ar Xiv preprint ar Xiv:2012.08668, 2020. J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. Scott, and N. Wilkins-Diehr. XSEDE: Accelerating Scientific Discovery. Computing in Science & Engineering, 16(5):62 74, 2014. Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B Sch on. Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, 2019. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003. David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classification: a unifying framework. In Advances in Neural Information Processing Systems, 2019. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In International Conference on Machine Learning, 2001. Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining, 2002. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016. Jize Zhang, Bhavya Kailkhura, and T Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020. Published as a conference paper at ICLR 2022 Algorithm 5: Post-hoc calibrator for a given M2B calibration notion C Input: Base (uncalibrated) multiclass predictor g, calibration data D p X1, Y1q, . . . , p Xn, Ynq, binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X 1 K Ð number of distinct calibration claims that C verifies; 2 for each claim k P r Ks do 3 From g, infer prc, rgq Ð plabel-predictor, probability-predictorq corresponding to claim k; 4 Dk Ð tp Xi, Ziqu, where Zi Ð 1 t Yi rcp Xiqu; 5 if conditioning does not include class prediction rc then 6 (confidence, top-K-confidence, and class-wise calibration) 7 hk Ð At0,1uprg, Dkq; 10 (top-label and top-K-label calibration) 11 for l P r Ls do 12 Dk,l Ð tp Xi, Ziq P Dk : rcp Xiq lqu; 13 hk,l Ð At0,1uprg, Dk,lq; 15 hkp q Ð hk,rcp qp q (hk predicts hk,lpxq if rcpxq l); 18 (the new predictor replaces each rg with the corresponding hk) 19 return plabel-predictor, hkq corresponding to each claim k P r Ks; Input for Algorithms 6 and 7: base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq, binary calibrator At0,1u : r0, 1s X ˆ p X ˆ t0, 1uq Ñ r0, 1s X . Algorithm 6: Top-K-label calibrator 1 For every k P r Ks, infer from g the k-th largest class predictor cpkq and the associated probability gpkq; 2 for k Ð 1 to K do 3 for l Ð 1 to L do 4 Dk,l Ð tp Xi, 1 t Yi luq : cpkqp Xiq lu; 5 hpk,lq Ð At0,1upgpkq, Dk,lq; 7 hpkq Ð hpk,cpkqp qqp q; 9 return php1q, hp2q, . . . , hp Kqq; Algorithm 7: Top-K-confidence calibrator 1 For every k P r Ks, infer from g the k-th largest class predictor cpkq and the associated probability gpkq; 2 for k Ð 1 to K do 3 Dk Ð tp Xi, 1 t Yi luq : i P rnsu; 4 hpkq Ð At0,1upgpkq, Dkq; 6 return php1q, hp2q, . . . , hp Kqq; A ADDENDUM TO SECTION 3 CALIBRATION ALGORITHMS FROM CALIBRATION METRICS In Section 3, we introduced the concept of M2B calibration, and showed that popular calibration notions are in fact M2B notions (Table 1). We showed how the calibration notions of top-label, class-wise, and confidence calibration can be achieved using a corresponding M2B calibrator. In the following subsection, we present the general-purpose wrapper Algorithm 5 that can be used to derive an M2B calibrator from any given M2B calibration notion that follows the rubric specified by Table 1. In Appendix A.2, we illustrate the philosophy of M2B calibration using a simple example with a dataset that contains 6 points. This example also illustrates the top-label-calibrator, the classwise-calibrator, and the confidence-calibrator. Published as a conference paper at ICLR 2022 (a) Predictions of a fixed base model g : X Ñ 3 on calibration/test data D tpa, 3q, pb, 4q, . . . , pf, 1qu. (b) Confidence calibration (c) Top-label calibration (d) Class-wise calibration (e) Canonical calibration Figure 3: Illustrative example for Section A.2. The numbers in plot (a) correspond to the predictions made by g on a dataset D. If D were a test set, plots (b e) show how it should be used to verify if g satisfies the corresponding notion of calibration. Consequently, we argue that if D were a calibration set, and we want to achieve one of the notions (b e), then the data shown in the corresponding plots should be the data used to calibrate g as well. A.1 GENERAL-PURPOSE M2B CALIBRATOR Denote some M2B notion of calibration as C. Suppose C corresponds to K binary calibration claims. The outer for-loop in Algorithm 5, runs over each such claim in C. For example, for class-wise calibration, K L and for confidence and top-label calibration, K 1. Corresponding to each claim, there is a probability-predictor that the conditioning is to be done on, such as g or gl or gpkq. Additionally, there may be conditioning on the label predictor such as c or cpkq. These are denoted as prc, rgq in Algorithm 5. For confidence and top-label calibration, rc c, the top-label-confidence. For class-wise calibration, when rg gl, we have rcp q l. If there is no label conditioning in the calibration notion, such as in confidence, top-K-confidence, and class-wise calibration, then we enter the if-condition inside the for-loop. Here hk is learnt using a single calibration dataset and a single call to At0,1u. Otherwise, if there is label conditioning, such as in top-label and top-K-label calibration, we enter the else-condition, where we learn a separate hk,l for every l P r Ls, using a different part of the dataset Dl in each case. Then hkpxq equals hk,lpxq if rcpxq l. Finally, since C is verifying a sequence of claims, the output of Algorithm 5 is a sequence of predictors. Each original prediction prc, rgq corresponding to the C is replaced with prc, hkq. This is the output of the M2B calibrator. Note that the rc values are not changed. This output appears abstract, but normally, it can be represented in an interpretable way. For example, for class-wise calibration, the output is just a sequence of predictors, one for each class: ph1, h2, . . . , h Lq. This general-purpose M2B calibrators can be used to achieve any M2B calibration notion: toplabel calibration (Algorithm 2), class-wise calibration (Algorithm 3), confidence calibration (Algorithm 1), top-K-label calibration (Algorithm 6), and top-K-confidence calibration (Algorithm 7). A.2 AN EXAMPLE TO ILLUSTRATE THE PHILOSOPHY OF M2B CALIBRATION Figure 3a shows the predictions of a given base model g on a given dataset D. Suppose D is a test set, and we are testing confidence calibration. Then the only predictions that matter are the top-predictions corresponding to the shaded values. These are stripped out and shown in Figure 3b, in the gp q row. Note that the indicator 1 t Y cp qu is sufficient to test confidence calibration and given this, the cp Xq are not needed. Thus the second row in Figure 3b only shows these indicators. Published as a conference paper at ICLR 2022 Algorithm 8: Top-label histogram binning Input: Base multiclass predictor g, calibration data D p X1, Y1q, . . . , p Xn, Ynq Hyperparameter: # points per bin k P N (say 50), tie-breaking parameter δ ą 0 (say 10 10) Output: Top-label calibrated predictor pc, hq 1 c Ð classifier or top-class based on g; 2 g Ð top-class-probability based on g; 3 for l Ð 1 to L do 4 Dl Ð tp Xi, 1 t Yi luq : cp Xiq lqu and nl Ð |Dl|; 5 hl Ð Binary-histogram-binningpg, Dl, tnl{ku , δq; 7 hp q Ð hcp qp q; 8 return pc, hq; Verifying top-label calibration is similar (Figure 3c), but in addition to the predictions gp q, we also retain the values of cp q. Thus the gp q and 1 t Y cp qu are shown, but split across the 4 classes. Class-wise calibration requires access to all the predictions, however, each class is considered separately as indicated by Figure 3d. Canonical calibration looks at the full prediction vector in each case. However, in doing so, it becomes unlikely that gpxq gpyq for any x, y since the number of values that g can take is now exponential. Let us turn this around and suppose that D were a calibration set instead of a test set. We argue that D should be used in the same way, whether testing or calibrating. Thus, if confidence calibration is to be achieved, we should focus on the pg, 1 t Y cp quq corresponding to g. If top-label calibration is to be achieved, we should use the pc, gq values. If class-wise calibration is to be achieved, we should look at each gl separately and solve L different problems. Finally, for canonical calibration, we must look at the entire g vector as a single unit. This is the core philosophy behind M2B calibrators: if binary claims are being verified, solve binary calibration problems. B DISTRIBUTION-FREE TOP-LABEL CALIBRATION USING HISTOGRAM BINNING In this section, we formally describe histogram binning (HB) with the top-label-calibrator (Algorithm 2) and provide methodological insights through theory and experiments. B.1 FORMAL ALGORITHM AND THEORETICAL GUARANTEES Algorithm 8 describes the top-label calibrator formally using HB as the binary calibration algorithm. The function called in line 5 is Algorithm 2 of Gupta and Ramdas (2021). The first argument in the call is the top-label confidence predictor, the second argument is the dataset to be used, the third argument is the number of bins to be used, and the fourth argument is a tie-breaking parameter (described shortly). While previous empirical works on HB fixed the number of bins per class, the analysis of Gupta and Ramdas (2021) suggests that a more principled way of choosing the number of bins is to fix the number of points per bin. This is parameter k of Algorithm 8. Given k, the number of bins is decided separately for every class as tnl{ku where nl is the number of points predicted as class l. This choice is particularly relevant for top-label calibration since nl can be highly non-uniform (we illustrate this empirically in Section B.2). The tie-breaking parameter δ can be arbitrarily small (like 10 10), and its significance is mostly theoretical it is used to ensure that outputs of different bins are not exactly identical by chance, so that conditioning on a calibrated probability output is equivalent to conditioning on a bin; this leads to a cleaner theoretical guarantee. HB recalibrates g to a piecewise constant function h that takes one value per bin. Consider a specific bin b; the h value for this bin is computed as the average of the indicators t1 t Yi cp Xiqu : Xi P Bin bu. This is an estimate of the bias of the bin Pp Y cp Xq | X P Bin bq. A concentration inequality can then be used to bound the deviation between the estimate and the true bias to prove distribution-free calibration guarantees. In the forthcoming Theorem 1, we show high-probability and in-expectation bounds on the the TL-ECE of HB. Additionally, we show marginal and condi- Published as a conference paper at ICLR 2022 tional top-label calibration bounds, defined next. These notions were proposed in the binary calibration setting by Gupta et al. (2020) and Gupta and Ramdas (2021). In the definition below, A refers to any algorithm that takes as input calibration data D and an initial classifier g to produce a top-label predictor c and an associated probability map h. Algorithm 8 is an example of A. Definition 1 (Marginal and conditional top-label calibration). Let ε, α P p0, 1q be some given levels of approximation and failure respectively. An algorithm A : pg, Dq ÞÑ pc, hq is (a) pε, αq-marginally top-label calibrated if for every distribution P over X ˆ r Ls, P |Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε ě 1 α. (8) (b) pε, αq-conditionally top-label calibrated if for every distribution P over X ˆ r Ls, P @ l P r Ls, r P Rangephq, |Pp Y cp Xq | cp Xq l, hp Xq rq r| ď ε ě 1 α. (9) To clarify, all probabilities are taken over the test point p X, Y q P, the calibration data D P n, and any other inherent algorithmic randomness in A; these are all implicit in pc, hq Ap D, gq. Marginal calibration asserts that with high probability, on average over the distribution of D, X, Pp Y cp Xq | cp Xq, hp Xqq is at most ε away from hp Xq. In comparison, TL-ECE is the average of these deviations over X. Marginal calibration may be a more appropriate metric for calibration than TL-ECE if we are somewhat agnostic to probabilistic errors less than some fixed threshold ε (like 0.05). Conditional calibration is a strictly stronger definition that requires the deviation to be at most ε for every possible prediction pl, rq, including rare ones, not just on average over predictions. This may be relevant in medical settings where we want the prediction on every patient to be reasonably calibrated. Algorithm 8 satisfies the following calibration guarantees. Theorem 1. Fix hyperparameters δ ą 0 (arbitrarily small) and points per bin k ě 2, and assume nl ě k for every l P r Ls. Then, for any α P p0, 1q, Algorithm 8 is pε1, αq-marginally and pε2, αqconditionally top-label calibrated for logp2{αq 2pk 1q δ, and ε2 2pk 1q δ. (10) Further, for any distribution P over X ˆ r Ls, we have Pp TL-ECEpc, hq ď ε2q ě 1 α, and E r TL-ECEpc, hqs ď a The proof in Appendix H is a multiclass top-label adaption of the guarantee in the binary setting by Gupta and Ramdas (2021). The r Op1{ ? kq dependence of the bound relies on Algorithm 8 delegating at least k points to every bin. Since δ can be chosen to be arbitrarily small, setting k 50 gives roughly ED r TL-ECEphqs ď 0.1. Base on this, we suggest setting k P r50, 150s in practice. B.2 TOP-LABEL HISTOGRAM BINNING ADAPTS TO CLASS IMBALANCED DATASETS The principled methodology of fixing the number of points per bin reaps practical benefits. Figure 4 illustrates this through the performance of HB for the class imbalanced COVTYPE-7 dataset (Blackard and Dean, 1999) with class ratio approximately 36% for class 1 and 49% for class 2. The entire dataset has 581012 points which is divided into train-test in the ratio 70:30. Then, 10% of the training points are held out for calibration (n |D| 40671). The base classifier is a random forest (RF) trained on the remaining training points (it achieves around 95% test accuracy). The RF is then recalibrated using HB. The top-label reliability diagrams in Figure 4a illustrate that the original RF (in orange) is underconfident on both the most likely and least likely classes. Additional figures in Appendix F show that the RF is always underconfident no matter which class is predicted as the top-label. HB (in green) recalibrates the RF effectively across all classes. Validity plots (Gupta and Ramdas, 2021) estimate how the LHS of condition (8), denoted as V pεq, varies with ε. We observe that for all ε, V pεq is higher for HB. The rightmost barplot compares the estimated TL-ECE for all classes, and also shows the class proportions. While the original RF is significantly miscalibrated for Published as a conference paper at ICLR 2022 Random forest Histogram binning Class ratio 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 2 reliability diagram 0.00 0.05 0.10 0.15 Class 2 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 4 reliability diagram 0.00 0.05 0.10 0.15 Class 4 validity plot Top-label ECE Class ratio (a) Top-label histogram binning (Algorithm 8) with k 100 points per bin. Class 4 has only 183 calibration points. Algorithm 8 adapts and uses only a single bin to ensure that the TL-ECE on class 4 is comparable to the TL-ECE on class 2. Overall, the random forest classifier has significantly higher TL-ECE for the least likely classes (4, 5, and 6), but the post-calibration TL-ECE using binning is quite uniform. 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 2 reliability diagram 0.00 0.05 0.10 0.15 Class 2 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 4 reliability diagram 0.00 0.05 0.10 0.15 Class 4 validity plot Top-label ECE Class ratio (b) Histogram binning with B 50 bins for every class. Compared to Figure 4a, the post-calibration TL-ECE for the most likely classes decreases while the TL-ECE for the least likely classes increases. Figure 4: Recalibration of a random forest using histogram binning on the class imbalanced COVTYPE-7 dataset (class 2 is roughly 100 times likelier than class 4). By ensuring a fixed number of calibration points per bin, Algorithm 8 obtains relatively uniform top-label calibration across classes (Figure 4a). In comparison, if a fixed number of bins are chosen for all classes, the performance deteriorates for the least likely classes (Figure 4b). the less likely classes, HB has a more uniform miscalibration across classes. Figure 4b considers a slightly different HB algorithm where the number of points per class is not adapted to the number of times the class is predicted, but is fixed beforehand (this corresponds to replacing tnl{ku in line 5 of Algorithm 8 with a fixed B P N). While even in this setting there is a drop in the TL-ECE compared to the RF model, the final profile is less uniform compared to fixing the number of points per bin. The validity plots and top-label reliability diagrams for all the 7 classes are reported in Figure 9 in Appendix F, along with some additional observations. C DISTRIBUTION-FREE CLASS-WISE CALIBRATION USING HISTOGRAM BINNING In this section, we formally describe histogram binning (HB) with the class-wise-calibrator (Algorithm 3) and provide theoretical guarantees for it. The overall procedure is called class-wise-HB. Further details and background on HB are contained in Appendix B, where top-label-HB is described. C.1 FORMAL ALGORITHM To achieve class-wise calibration using binary routines, we learn each component function hl in a 1v-all fashion as described in Algorithm 3. Algorithm 9 contains the pseudocode with the underlying routine as binary HB. To learn hl, we use a dataset Dl, which unlike top-label HB (Algorithm 8), contains Xi even if cp Xiq l. However the Yi is replaced with 1 t Yi lu. The number of points per bin kl can be different for different classes, but generally one would set k1 . . . k L k P N. Larger values of kl will lead to smaller εl and δl in the guarantees, at loss of sharpness since the number of bins tn{klu would be smaller. Published as a conference paper at ICLR 2022 Algorithm 9: Class-wise histogram binning Input: Base multiclass predictor g : X Ñ L 1, calibration data D p X1, Y1q, . . . , p Xn, Ynq Hyperparameter: # points per bin k1, k2, . . . , kl P NL (say each kl 50), tie-breaking parameter δ ą 0 (say 10 10) Output: L class-wise calibrated predictors h1, h2, . . . , h L 1 for l Ð 1 to L do 2 Dl Ð tp Xi, 1 t Yi luq : i P rnsqu; 3 hl Ð Binary-histogram-binningpgl, Dl, tn{klu , δq; 5 return ph1, h2, . . . , h Lq; C.2 CALIBRATION GUARANTEES A general algorithm A for class-wise calibration takes as input calibration data D and an initial classifier g to produce an approximately class-wise calibrated predictor h : X Ñ r0, 1s L. Define the notation ε pε1, ε2, . . . , εLq P p0, 1q L and α pα1, α2, . . . , αLq P p0, 1q L. Definition 2 (Marginal and conditional class-wise calibration). Let ε, α P p0, 1q L be some given levels of approximation and failure respectively. An algorithm A : pg, Dq ÞÑ h is (a) pε, αq-marginally class-wise calibrated if for every distribution P over X ˆ r Ls and for every l P r Ls P |Pp Y l | hlp Xqq hlp Xq| ď εl ě 1 αl. (11) (b) pε, αq-conditionally class-wise calibrated if for every distribution P over X ˆ r Ls and for every l P r Ls, P @r P Rangephlq, |Pp Y l | hlp Xq rq r| ď εl ě 1 αl. (12) Definition 2 requires that each hl is pεl, αlq calibrated in the binary senses defined by Gupta et al. (2021, Definitions 1 and 2). From Definition 2, we can also uniform bounds that hold simultaneously over every l P r Ls. Let α řL l 1 αl and ε maxl Pr Ls εl. Then (11) implies P @l P r Ls, |Pp Y l | hlp Xqq hlp Xq| ď ε ě 1 α, (13) and (12) implies P @l P r Ls, r P Rangephlq, |Pp Y l | hlp Xq rq r| ď ε ě 1 α. (14) The choice of not including the uniformity over L in Definition 2 reveals the nature of our class-wise HB algorithm and the upcoming theoretical guarantees: (a) we learn the hl s separately for each l and do not combine the learnt functions in any way (such as normalization), (b) we do not combine the calibration inequalities for different r Ls in any other way other than a union bound. Thus the only way we can show (13) (or (14)) is by using a union bound over (11) (or (12)). We now state the distribution-free calibration guarantees satisfied by Algorithm 9. Theorem 2. Fix hyperparameters δ ą 0 (arbitrarily small) and points per bin k1, k2, . . . , kl ě 2, and assume nl ě kl for every l P r Ls. Then, for every l P r Ls, for any αl P p0, 1q, Algorithm 9 is pεp1q, αq-marginally and pεp2q, αq-conditionally class-wise calibrated with 2pkl 1q δ, and εp2q l logp2n{klαlq 2pkl 1q δ. (15) Further, for any distribution P over X ˆ r Ls, (a) Pp CW-ECEpc, hq ď maxl Pr Ls εp2q l q ě 1 ř l Pr Ls αl, and (b) E r CW-ECEpc, hqs ď maxl Pr Ls a Published as a conference paper at ICLR 2022 Theorem 2 is proved in Appendix H. The proof follows by using the result of Gupta and Ramdas (2021, Theorem 2), derived in the binary calibration setting, for each hl separately. Gupta and Ramdas (2021) proved a more general result for general ℓp-ECE bounds. Similar results can also be derived for the suitably defined ℓp-CW-ECE. As discussed in Section 3.2, unlike previous works (Zadrozny and Elkan, 2002; Guo et al., 2017; Kull et al., 2019), Algorithm 9 does not normalize the hl s. We do not know how to derive Theorem 2 style results for a normalized version of Algorithm 9. D FIGURES FOR APPENDIX E Appendix E begins on page 23. The relevant figures for Appendix E are displayed on the following pages. Published as a conference paper at ICLR 2022 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (a) TL-ECE estimates on CIFAR-10 with Brier score. TL-HB is close to the best in each case. While CW-HB performs the best at B 15, the ECE estimate may not be reliable since it is highly variable across bins. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (b) TL-ECE estimates on CIFAR-100 with Brier score. N-HB is the best performing method, while DS is the worst performing method, across different numbers of bins. TL-HB performs worse than TS and VS. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (c) TL-MCE estimates on CIFAR-10 with Brier score. The only reliably and consistently well-performing method is TL-HB. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (d) TL-MCE estimates on CIFAR-100 with Brier score. DS is the worst performing method. Other methods perform across different values of B. Figure 5: Table 2 style results with the number of bins varied as B P r5, 25s. See Appendix E.5 for further details. The captions summarize the findings in each case. In most cases, the findings are similar to those with B 15. The notable exception is that performance of N-HB on CIFAR-10 for TL-ECE while very good at B 15, is quite inconsistent when seen across different bins. In some cases, the blue base model line and the orange temperature scaling line coincide. This occurs since the optimal temperature on the calibration data was learnt to be T 1, which corresponds to not changing the base model at all. Published as a conference paper at ICLR 2022 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (a) TL-ECE estimates on CIFAR-10 with focal loss. TL-HB is close to the best in each case. While CW-HB performs the best at B 15, the ECE estimate may not be reliable since it is highly variable across bins. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (b) TL-ECE estimates on CIFAR-100 with focal loss. N-HB is the best performing method, while DS is the worst performing method, across different numbers of bins. TL-HB performs worse than TS and VS. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (c) TL-MCE estimates on CIFAR-10 with focal loss. The only reliably and consistently well-performing method is TL-HB. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (d) TL-MCE estimates on CIFAR-100 with focal loss. DS is the worst performing method. Other methods perform across different values of B. Figure 6: Table 4 style results with the number of bins varied as B P r5, 25s. See Appendix E.5 for further details. The captions summarize the findings in each case. In most cases, the findings are similar to those with B 15. In some cases, the blue base model line and the orange temperature scaling line coincide. This occurs since the optimal temperature on the calibration data was learnt to be T 1, which corresponds to not changing the base model at all. Published as a conference paper at ICLR 2022 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (a) CW-ECE estimates on CIFAR-10 with Brier score. CW-HB is the best performing method across bins, and N-HB is quite unreliable. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (b) CW-ECE estimates on CIFAR-100 with Brier score. CW-HB is the best performing method. DS and N-HB are the worst performing methods. Figure 7: Table 3 style results with the number of bins varied as B P r5, 25s. The captions summarize the findings in each case, which are consistent with those in the table. See Appendix E.5 for further details. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (a) CW-ECE estimates on CIFAR-10 with focal loss. CW-HB is the best performing method across bins, and N-HB is quite unreliable. 5 10 15 20 25 Number of bins Estimated ECE 5 10 15 20 25 Number of bins Estimated ECE Res Net-110 5 10 15 20 25 Number of bins Estimated ECE Wide-Res Net-26-10 5 10 15 20 25 Number of bins Estimated ECE Dense Net-121 (b) CW-ECE estimates on CIFAR-100 with focal loss. CW-HB is the best performing method. DS and N-HB are the worst performing methods. Figure 8: Table 5 style results with the number of bins varied as B P r5, 25s. The captions summarize the findings in each case, which are consistent with those in the table. See Appendix E.5 for further details. Published as a conference paper at ICLR 2022 E ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS FOR CIFAR-10 AND CIFAR-100 We present additional details and results to supplement the experiments with CIFAR-10 and CIFAR100 in Sections 2 and 4 of the main paper. E.1 EXTERNAL LIBRARIES USED All our base models were pre-trained deep-net models generated by Mukhoti et al. (2020), obtained from www.robots.ox.ac.uk/ viveka/focal calibration/ and used along with the code at https://github.com/torrvision/focal calibration to obtain base predictions. We focused on the models trained with Brier score and focal loss, since it was found to perform the best for calibration. All reports in the main paper are with the Brier score; in Appendix E.4, we report corresponding results with focal loss. We also used the code at https://github.com/torrvision/focal calibration for temperature scaling (TS). For vector scaling (VS) and Dirichlet scaling (DS), we used the code of Kull et al. (2019), hosted at https://github.com/dirichletcal/dirichlet python. For VS, we used the file dirichletcal/calib/vectorscaling.py, and for DS, we used the file dirichletcal/calib/fulldirichlet.py. No hyperparameter tuning was performed in any of our histogram binning experiments or baseline experiments; default settings were used in every case. The random seed was fixed so that every run of the experiment gives the same result. In particular, by relying on pre-trained models, we avoid training new deep-net models with multiple hyperparameters, thus avoiding any selection biases that may arise due to test-data peeking across multiple settings. E.2 FURTHER COMMENTS ON BINNING FOR ECE ESTIMATION As mentioned in Remark 1, ECE estimates for all methods except TL-HB and CW-HB was done using fixed-width bins r0, 1{Bq, r1{B, 2{Bq, . . . r1 1{B, 1s for various values of B P r5, 25s. For TL-HB and CW-HB, B is the number of bins used for each call to binary HB. For TL-HB, note that we actually proposed that the number of bins-per-class should be fixed; see Section B.2. However, for ease of comparison to other methods, we simply set the number of bins to B for each call to binary HB. That is, in line 5, we replace tnl{ku with B. For CW-HB, we described Algorithm 9 with different values of kl corresponding to the number of bins per class. For the CIFAR-10 and CIFAR-100 comparisons, we set each k1 k2 . . . k L k, where k P N satisfies tn{ku B. Tables 2,3, 4, and 5 report estimates with B 15, which has been commonly used in many works (Guo et al., 2017; Kull et al., 2019; Mukhoti et al., 2020). Corresponding to each table, we have a figure where ECE estimates with varying B are reported to strengthen conclusions: these are Figure 5,7, 6, and 8 respectively. Plugin estimates of the ECE were used, same as Guo et al. (2017). Further binning was not done for TL-HB and CW-HB since the output is already discrete and sufficiently many points take each of the predicted values. Note that due to Jensen s inequality, any further binning will only decrease the ECE estimate (Kumar et al., 2019). Thus, using unbinned estimates may give TL-HB and CW-HB a disadvantage. E.3 SOME REMARKS ON MAXIMUM-CALIBRATION-ERROR (MCE) Guo et al. (2017) defined MCE with respect to confidence calibration, as follows: conf-MCEpc, hq : sup r PRangephq |Pp Y cp Xq | hp Xq rq r| . (16) Conf-MCE suffers from the same issue illustrated in Figure 2 for conf-ECE. In Figure 1b, we looked at the reliability diagram within two bins. These indicate two of the values over which the supremum is taken in equation (16): these are the Y-axis distances between the markers and the X Y line for bins 6 and 10 (both are less than 0.02). On the other hand, the effective maximum miscalibration for bin 6 is roughly 0.15 (for class 1), and roughly 0.045 (for class 4), and the maximum should be taken with respect to these values across all bins. To remedy the underestimation of the effective Published as a conference paper at ICLR 2022 Metric Dataset Architecture Base TS VS DS N-HB TL-HB Toplabel ECE Res Net-50 0.022 0.023 0.018 0.019 0.023 0.019 Res Net-110 0.025 0.024 0.022 0.021 0.020 0.020 WRN-26-10 0.024 0.019 0.016 0.017 0.019 0.018 Dense Net-121 0.023 0.023 0.021 0.021 0.025 0.021 Res Net-50 0.109 0.107 0.107 0.332 0.086 0.148 Res Net-110 0.124 0.117 0.105 0.316 0.115 0.153 WRN-26-10 0.100 0.100 0.101 0.293 0.074 0.135 Dense Net-121 0.106 0.108 0.105 0.312 0.091 0.147 Toplabel MCE Res Net-50 0.298 0.443 0.368 0.472 0.325 0.082 Res Net-110 0.378 0.293 0.750 0.736 0.535 0.089 WRN-26-10 0.741 0.582 0.311 0.363 0.344 0.075 Dense Net-121 0.411 0.411 0.243 0.391 0.301 0.099 Res Net-50 0.289 0.355 0.234 0.640 0.322 0.273 Res Net-110 0.293 0.265 0.274 0.633 0.366 0.272 WRN-26-10 0.251 0.227 0.256 0.663 0.229 0.270 Dense Net-121 0.237 0.225 0.239 0.597 0.327 0.248 Table 4: Top-label-ECE and top-label-MCE for deep-net models and various post-hoc calibrators. All methods are same as Table 2. Best performing method in each row is in bold. MCE, we can consider the top-label-MCE, defined as TL-MCEpc, hq : max l Pr Ls sup r PRangephq |Pp Y l | cp Xq l, hp Xq rq r| . (17) Interpreted in words, the TL-MCE assesses the maximum deviation between the predicted and true probabilities across all predictions and all classes. Following the same argument as in the proof of Proposition 4, it can be shown that for any c, h, conf-MCEpc, hq ď TL-MCEpc, hq. The TL-MCE is closely related to conditional top-label calibration (Definition 1b). Clearly, an algorithm is pε, αqconditionally top-label calibrated if and only if for every distribution P, Pp TL-MCEpc, hq ď εq ě 1 α. Thus the conditional top-label calibration guarantee of Theorem 1 implies a high probability bound on the TL-MCE as well. E.4 TABLE 2 AND 3 STYLE RESULTS WITH FOCAL LOSS Results for top-label-ECE and top-label-MCE with the base deep net model being trained using focal loss are reported in Table 4. Corresponding results for class-wise-ECE are reported in Table 5. The observations are similar to the ones reported for Brier score: 1. For TL-ECE, TL-HB is either the best or close to the best performing method on CIFAR10, but suffers on CIFAR-100. This phenomenon is discussed further in Appendix E.6. N-HB is the best or close to the best for both CIFAR-10 and CIFAR-100. 2. For TL-MCE, TL-HB is the best performing method on CIFAR-10, by a huge margin. For CIFAR-100, TS or VS perform better than TL-HB, but not by a huge margin. 3. For CW-ECE, CW-HB is the best performing method across the two datasets and all four architectures. E.5 ECE AND MCE ESTIMATES WITH VARYING NUMBER OF BINS Corresponding to each entry in Tables 2 and 4, we perform an ablation study with the number of bins varying as B P r5, 25s. This is in keeping with the findings of Roelofs et al. (2020) that the ECE/MCE estimate can vary with different numbers of bins, along with the relative performance of the various models. The results are reported in Figure 5 (ablation of Table 2) and Figure 7 (ablation of Table 3). The captions of these figures contain further details on the findings. Most findings are similar to those in the main paper, but the findings in the tables are strengthened through this ablation. The same ablations are performed for focal loss as well. The results are reported in Figure 6 (ablation of Published as a conference paper at ICLR 2022 Metric Dataset Architecture Base TS VS DS N-HB CW-HB Classwise ECE ˆ102 Res Net-50 0.42 0.42 0.35 0.37 0.52 0.35 Res Net-110 0.48 0.44 0.36 0.35 0.51 0.29 WRN-26-10 0.41 0.31 0.31 0.35 0.49 0.27 Dense Net-121 0.41 0.41 0.40 0.39 0.63 0.30 Res Net-50 0.22 0.20 0.20 0.66 0.23 0.16 Res Net-110 0.24 0.23 0.21 0.72 0.24 0.16 WRN-26-10 0.19 0.19 0.18 0.61 0.20 0.14 Dense Net-121 0.20 0.21 0.19 0.66 0.24 0.16 Table 5: Class-wise-ECE for deep-net models and various post-hoc calibrators. All methods are same as Table 2, except top-label-HB is replaced with class-wise-HB or Algorithm 3 (CW-HB). Best performing method in each row is in bold. Table 4) and Figure 8 (ablation of Table 5). The captions of these figures contain further details on the findings. The ablation results in the figures support those in the tables. E.6 ANALYZING THE POOR PERFORMANCE OF TL-HB ON CIFAR-100 CIFAR-100 is an imbalanced dataset with 100 classes and 5000 points for validation/calibration (as per the default splits). Due to random subsampling, the validation split we used had one of the classes predicted as the top-label only 31 times. Thus, based on Theorem 1, we do not expect HB to have small TL-ECE. This is confirmed by the empirical results presented in Tables 2/4, and Figures 5b/6b. We observe that HB has higher estimated TL-ECE than all methods except DS, for most values of the number of bins. The performance of TL-HB for TL-MCE however is much much closer to the other methods since HB uses the same number of points per bin, ensuring that the predictions are somewhat equally calibrated across bins (Figures 5d/6d). In comparison, for CWECE, CW-HB is the best performing method. This is because in the class-wise setting, 5000 points are available for recalibration irrespective of the class, which is sufficient for HB. The deterioration in performance of HB when few calibration points are available was also observed in the binary setting by Gupta and Ramdas (2021, Appendix C). Niculescu-Mizil and Caruana (2005) noted in the conclusion of their paper that Platt scaling (Platt, 1999), which is closely related to TS, performs well when the data is small, but another nonparametric binning method, isotonic regression (Zadrozny and Elkan, 2002) performs better when enough data is available. Kull et al. (2019, Section 4.1) compared HB to other calibration techniques for class-wise calibration on 21 UCI datasets, and found that HB performs the worst. On inspecting the UCI repository, we found that most of the datasets they used had fewer than 5000 (total) data points, and many contain fewer than 500. Overall, comparing our results to previous empirical studies, we believe that if sufficiently many points are available for recalibration, or the number of classes is small, then HB performs quite well. To be more precise, we expect HB to be competitive if at least 200 points per class can be held out for recalibration, and the number of points per bin is at least k ě 20. F ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS FOR COVTYPE-7 We present additional details and results for the top-label HB experiment of Section B.2. The base classifier is an RF learnt using sklearn.ensemble import Random Forest Classifier with default parameters. The base RF is a nearly continuous base model since most predictions are unique. Thus, we need to use binning to make reliability diagrams, validity plots, and perform ECE estimation, for the base model. To have a fair comparison, instead of having a fixed binning scheme to assess the base model, the binning scheme was decided based on the unique predictions of toplabel HB. Thus for every l, and r P Rangephlq, the bins are defined as tx : cpxq l, hlpxq ru. Due to this, while the base model in Figures 4a and 4b are the same, the reliability diagrams and validity plots in orange are different. As can be seen in the bar plots in Figure 4, the ECE estimation is not affected significantly. When k 100, the total number of bins chosen by Algorithm 8 was 403, which is roughly 57.6 bins per class. The choice of B 50 for the fixed bins per class experiment was made on this basis. Published as a conference paper at ICLR 2022 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 1 reliability diagram 0.00 0.05 0.10 0.15 Class 1 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 2 reliability diagram 0.00 0.05 0.10 0.15 Class 2 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 3 reliability diagram 0.00 0.05 0.10 0.15 Class 3 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 4 reliability diagram 0.00 0.05 0.10 0.15 Class 4 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 5 reliability diagram 0.00 0.05 0.10 0.15 Class 5 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 6 reliability diagram 0.00 0.05 0.10 0.15 Class 6 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 7 reliability diagram 0.00 0.05 0.10 0.15 Class 7 validity plot (a) Top-label HB with k 100 points per bin. 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 1 reliability diagram 0.00 0.05 0.10 0.15 Class 1 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 2 reliability diagram 0.00 0.05 0.10 0.15 Class 2 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 3 reliability diagram 0.00 0.05 0.10 0.15 Class 3 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 4 reliability diagram 0.00 0.05 0.10 0.15 Class 4 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 5 reliability diagram 0.00 0.05 0.10 0.15 Class 5 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 6 reliability diagram 0.00 0.05 0.10 0.15 Class 6 validity plot 0.00 0.25 0.50 0.75 1.00 Predicted probability True probability Class 7 reliability diagram 0.00 0.05 0.10 0.15 Class 7 validity plot (b) Top-label HB with B 50 bins per class. Figure 9: Top-label histogram binning (HB) calibrates a miscalibrated random-forest on the class imbalanced COVTYPE-7 dataset. For the less likely classes (4, 5, and 6), the left column is better calibrated than the right column. Similar observations are made on other datasets, and so we recommend adaptively choosing a different number of bins per class, as Algorithm 8 does. Published as a conference paper at ICLR 2022 Figure 9 supplements Figure 4 in the main paper by presenting reliability diagrams and validity plots of top-label HB for all classes. Figure 9a presents the plots with adaptive number of bins per class (Algorithm 8), and Figure 9b presents these for fixed number of bins per class. We make the following observations. (a) For every class l P r Ls, the RF is overconfident. This may seem surprising at first since we generally expect that models may be overconfident for certain classes and underconfident for others. However, note that all our plots assess top-label calibration, that is, we are assessing the predicted and true probabilities of only the predicted class. It is possible that a model is overconfident for every class whenever that class is predicted to be the top-label. (b) For the most likely classes, namely classes 1 and 2, the number of bins in the adaptive case is higher than 50. Fewer bins leads to better calibration (at the cost of sharpness). This can be verified through the validity plots for classes 1 and 2 the validity plots in the fixed bins case is slightly above the validity plot in the adaptive bin case. However both validity plots are quite similar. (c) The opposite is true for the least likely classes, namely classes 4, 5, 6. The validity plot in the fixed bins case is below the validity plot in the adaptive bins case, indicating higher TL-ECE in the fixed bins case. The difference between the validity plots is high. Thus if a fixed number of bins per class is pre-decided, the performance for the least likely classes significantly suffers. Based on these observations, we recommend adaptively choosing the number of bins per class, as done by Algorithm 8. G BINNING-BASED CALIBRATORS FOR CANONICAL MULTICLASS CALIBRATION Canonical calibration is a notion of calibration that does not fall in the M2B category. To define canonical calibration, we use Y to denote the output as a 1-hot vector. That is, Yi e Yi P L 1, where el corresponds to the l-th canonical basis vector in Rd. Recall that a predictor h ph1, h2, . . . , h Lq is said to be canonically calibrated if Pp Y l | hp Xqq hlp Xq for every l P r Ls. Equivalently, this can be stated as E r Y | hp Xqs hp Xq. Canonical calibration implies class-wise calibration: Proposition 1. If E r Y | hp Xqs hp Xq, then for every l P r Ls, Pp Y l | hlp Xqq hlp Xq. The proof in Appendix H is straightforward, but the statement above is illuminating, because there exist predictors that are class-wise calibrated but not canonically calibrated (Vaicenavicius et al., 2019, Example 1). Canonical calibration is not an M2B notion since the conditioning occurs on the L-dimensional prediction vector predp Xq hp Xq, and after this conditioning, each of the L statements Pp Y l | predp Xqq hlp Xq should simultaneously be true. On the other hand, M2B notions verify only individual binary calibration claims for every such conditioning. Since canonical calibration does not fall in the M2B category, Algorithm 5 does not lead to a calibrator for canonical calibration. In this section, we discuss alternative binning-based approaches to achieving canonical calibration. For binary calibration, there is a complete ordering on the interval r0, 1s, and this ordering is leveraged by binning based calibration algorithms. However, L 1, for L ě 3 does not have such a natural ordering. Hence, binning algorithms do not obviously extend for multiclass classification. In this section, we briefly discuss some binning-based calibrators for canonical calibration. Our descriptions are for general L ě 3, but we anticipate these algorithms to work reasonably only for small L, say if L ď 5. As usual, denote g : X Ñ L 1 as the base model and h : X Ñ L 1 as the model learnt using some post-hoc canonical calibrator. For canonical calibration, we can surmise binning schemes that directly learn h by partitioning the prediction space L 1 into bins and estimating the distribution of Y in each bin. A canonical calibration guarantee can be showed for such a binning scheme using multinomial concentration (Podkopaev and Ramdas, 2021, Section 3.1). However, since Volp L 1q 2Θp Lq, there will either be a bin whose volume is 2Ωp Lq (meaning that h would Published as a conference paper at ICLR 2022 not be sharp), or the number of bins will be 2Ωp Lq, entailing 2Ωp Lq requirements on the sample complexity a curse of dimensionality. Nevertheless, let us consider some binning schemes that could work if L is small. Formally, a binning scheme corresponds to a partitioning of L 1 into B ě 1 bins. We denote this binning scheme as B : L 1 Ñ r Bs, where Bpsq corresponds to the bin to which s P L 1 belongs. To learn h, the calibration data is binned to get sets of data-point indices that belong to each bin, depending on the gp Xiq values: for every b P r Bs, Tb : ti : Bpgp Xiqq bu, nb |Tb| . We then compute the following estimates for the label probabilities in each bin: for every pl, bq P r Ls ˆ r Bs, pΠl,b : i PTb 1 t Yi lu nb if nb ą 0 else pΠl,b 1{B. The binning predictor h : X Ñ L 1 is now defined component-wise as follows: for every l P r Ls, hlpxq pΠl,Bpxq. In words, for every bin b P r Bs, h predicts the empirical distribution of the Y values in bin b. Using a multinomial concentration inequality (Devroye, 1983; Qian et al., 2020; Weissman et al., 2003), calibration guarantees can be shown for the learnt h. Podkopaev and Ramdas (2021, Theorem 3) show such a result using the Bretagnolle-Huber-Carol inequality. All of these concentration inequality give bounds that depend inversely on nb or ?nb. In the following subsections, we describe some binning schemes which can be used for canonical calibration based on the setup illustrated above. First we describe fixed schemes that are not adaptive to the distribution of the data: Sierpinski binning (Appendix G.1) and grid-style binning (Appendix G.2). These are analogous to fixed-width binning for L 2. Fixed binning schemes are not adapted to the calibration data and may have highly imbalanced bins leading to poor estimation of the distribution of Y in bins with small nb. In the binary case, this issue is remedied using histogram binning to ensure that each bin has nearly the same number of calibration points (Gupta and Ramdas, 2021). While histogram binning uses the order of the scalar gp Xiq values, there is no obvious ordering for the multi-dimensional gp Xiq values. In Appendix G.3 we describe a projection based histogram binning scheme that circumvents this issue and ensures that each nb is reasonably large. In Appendix G.4, we present some preliminary experimental results using our proposed binning schemes. Certain asymptotic consistency results different from calibration have been established for histogram regression and classification in the nonparametric statistics literature (Nobel, 1996; Lugosi and Nobel, 1996; Gordon and Olshen, 1984; Breiman et al., 2017; Devroye, 1988); further extensive references can be found within these works. The methodology of histogram regression and classification relies on binning and is very similar to the one we propose here. The main difference is that these works consider binning the feature space X directly, unlike the post-hoc setting where we are essentially interested in binning L 1. In terms of theory, the results these works target are asymptotic consistency for the (Bayes) optimal classification and regression functions, instead of canonical calibration. It would be interesting to consider the (finite-sample) canonical calibration properties of the various algorithms proposed in the context of histogram classification. However, such a study is beyond the scope of this paper. G.1 SIERPINSKI BINNING First, we describe Sierpinski binning for L 3. The probability simplex for L 3, 2, is a triangle with vertices e1 p1, 0, 0q, e2 p0, 1, 0q, and e3 p0, 0, 1q. Sierpinski binning is a recursive partitioning of this triangle based on the fractal popularly known as the Sierpinski triangles. Some Sierpinski bins for L 3 are shown in Figure 10. Formally, we define Sierpinski binning recursively based on a depth parameter q P N. Given an x P X, let s gpxq. For q 1, the number of bins is B 4, and the binning scheme B is defined as: 1 if s1 ą 0.5 2 if s2 ą 0.5 3 if s3 ą 0.5 4 otherwise. Published as a conference paper at ICLR 2022 2 q 1 q 2 q 3 Figure 10: Sierpinski binning for L 3. The leftmost triangle represents the probability simplex 2. Sierpinski binning divides 2 recursively based on a depth parameter q P N. Note that since s1 s2 s3 1, only one of the above conditions can be true. It can be verified that each of the bins have volume equal to p1{4q-th the volume of the probability simplex 2. If a finer resolution of 2 is desired, B can be increased by further dividing the partitions above. Note that each partition is itself a triangle; thus each triangle can be mapped to 2 to recursively define the sub-partitioning. For i P r4s, define the bins bi ts : Bpsq iu. Consider the bin b1. Let us reparameterize it as pt1, t2, t3q p2s1 1, 2s2, 2s3q. It can be verified that b1 tpt1, t2, t3q : s1 ą 0.5u tpt1, t2, t3q : t1 t2 t3 1, t1 P p0, 1s, t2 P r0, 1q, t3 P r0, 1qu. Based on this reparameterization, we can recursively sub-partition b1 as per the scheme (18), replacing s with t. Such reparameterizations can be defined for each of the bins defined in (18): b2 tps1, s2, s3q : s2 ą 0.5u : pt1, t2, t3q p2s1, 2s2 1, 2s3q, b3 tps1, s2, s3q : s3 ą 0.5u : pt1, t2, t3q p2s1, 2s2, 2s3 1q, b4 tps1, s2, s3q : si ď 0.5 for all iu : pt1, t2, t3q p1 2s1, 1 2s2, 1 2s3q, and thus every bin can be recursively sub-partitioned as per (18). As illustrated in Figure 10, for Sierpinski binning, we sub-partition only the bins b1, b2, b3 since the bin b4 corresponds to low confidence for all labels, where finer calibration may not be needed. (Also, in the L ą 3 case described shortly, the corresponding version of b4 is geometrically different from L 1, and the recursive partitioning cannot be defined for it.) If at every depth, we sub-partition all bins except the corresponding b4 bins, then it can be shown using simple algebra that the total number of bins is p3q 1 1q{2. For example, in Figure 10, when q 2, the number of bins is B 14, and when q 3, the number of bins is B 40. As in the case of L 3, Sierpinski binning for general L is defined through a partitioning function of L 1 into L 1 bins, and a reparameterization of the partitions so that they can be further sub-partitioned. The L 1 bins at depth q 1 are defined as Bpsq " l if sl ą 0.5, L 1 otherwise. (19) While this looks similar to the partitioning (18), the main difference is that the bin b L 1 has a larger volume than other bins. Indeed for l P r Ls, volpblq volp L 1q{2L 1, while volpb L 1q volp L 1qp1 L{2L 1q ě volp L 1q{2L 1, with equality only occuring for L 3. Thus the bin b L 1 is larger than the other bins. If gpxq P b L 1, then the prediction for x may be not be very sharp, compared to if gpxq were in any of the other bins. On the other hand, if gpxq P b L 1, the score for every class is smaller than 0.5, and sharp calibration may often not be desired in this region. In keeping with this understanding, we only reparameterize the bins b1, b2, . . . , b L so that they can be further divided: bl tps1, s2, . . . , s Lq : sl ą 0.5u : pt1, t2, . . . , t Lq p2s1, . . . , 2sl 1, . . . , 2s Lq. For every l P r Ls, under the reparameterization above, it is straightforward to verify that tpt1, t2, . . . , t Lq : sl ą 0.5u tpt1, t2, . . . , t Lq : ÿ u Pr Ls tu 1, tl P p0, 1s, tu P r0, 1q @u lu. Thus every bin can be recursively sub-partitioned following (19). For Sierpinski binning with L labels, the number of bins at depth q is given by p Lq 1 1q{p L 1q. Published as a conference paper at ICLR 2022 K 4, τ 0.25 0 0.2 0.4 0.6 0.8 1 Figure 11: Grid-style binning for L 3. G.2 GRID-STYLE BINNING Grid-style binning is motivated from the 2D reliability diagrams of Widmann et al. (2019, Figure 1), where they partitioned 2 into multiple equi-volume bins in order to assess canonical calibration on a 3-class version of CIFAR-10. For L 3, 2 can be divided as shown in Figure 11. This corresponds to gridding the space 2, just the way we think of gridding the real hyperplane. However, the mathematical description of this grid for general L is not apparent from Figure 11. We describe grid-style binning formally for general L ě 3. Consider some τ ą 0 such that K : 1{τ P N. For every tuple k pk1, k2, . . . , k Lq in the set I tk P NL : maxp L, K 1q ď ÿ l Pr Ls kl ď K p L 1qu, (20) define the bins bk : ts P L 1 : for every l P r Ls, sl K P rkl 1, kls. (21) These bins are not mutually disjoint, but intersections can only occur at the edges. That is, for every s that belongs to more than one bin, at least one component sl satisies sl K P N. In order to identify a single bin s, ties can be broken arbitrarily. One strategy is to use some ordering on NL; say for k1 k2 P N, k1 ă k2 if and only if for the first element of k1 that is unequal to the corresponding element of k2 the one corresponding to k1 is smaller. Then define the binning function B : L 1 Ñ |I| as Bpsq mintk : s P bku. The following propositions prove that a) each s belongs to at least one bin, and b) that every bin is an L 1 dimensional object (and thus a meaningful partition of L 1). Proposition 2. The bins tbk : k P Iu defined by (21) mutually exhaust L 1. Proof. Consider any s P L 1. For sl K R N t1, 2, . . .u, set kl maxp1, rsl Ksq ą sl K. Consider the condition C : for all l such that sl K R N, sl K 0. If C is true, then for l such that sl K P N, set kl sl K. If C is not true, then for l such that sl K P N, set exactly one kl sl K 1, and for the rest set kl sl K. Based on this setting of k, it can be verified that s P bk. Further, note that for every l, kl ě sl K, and there exists at least one l such that kl ą sl K. Thus we have: Since řL l 1 kl P N, we must have řL l 1 kl ě K 1. Further since each kl P N, řL l 1 kl ě L. Next, note that for every l, kl ď sl K 1. If C is true, then there is at least one l such that sl K P N, and for this l, we have set kl sl K ă sl K 1. If C is not true, then either there exists at least one l such that sl K R N Y t0u for which kl rsl Ks ă sl K 1, or every sl K P N, in which case we Published as a conference paper at ICLR 2022 have set kl sl K for one of them. In all cases, note that there exists an l such that kl ă sl K 1. Thus, l 1 psl K 1q Since řL l 1 kl P N, we must have řL l 1 kl ď K L 1. Next, we show that each bin indexed by k P I contains a non-zero volume subset of L 1, where volume is defined with respect to the Lebesgue measure in RL 1. This can be shown by arguing that bk contains a scaled and translated version of L 1. Proposition 3. For every k P I, there exists some u P RL and v ą 0 such that u v L 1 Ď bk. Proof. Fix some k P I. By condition (20), řL l 1 kl P rmaxp L, K 1q, K L 1s. Based on this, we claim that there exists a τ P r0, 1q such that l 1 pkl 1q τL p1 τq K. (22) Indeed, note that for τ 0, řL l 1pkl 1q τL p1 τq ď p K 1q 1 K and for τ 1, řL l 1pkl 1q τL p1 τq řL l 1 kl ą K. Thus, there exists a τ that satisfies (22) by the intermediate value theorem. Next, define u K 1pk pτ 1q1Lq and v K 1p1 τq ą 0, where 1L denotes the vector in RL with each component equal to 1. Consider any s P u v L 1. Note that for every l P r Ls, sl K P rkl 1, kls and by property (22), l 1 pkl pτ 1qq l 1 pkl 1q τL p1 τq K. Thus, s P L 1 and by the definition of bk, s P bk. This completes the proof. The previous two propositions imply that B satisfies the property we require of a reasonable binning scheme. For L 3, grid-style binning gives equi-volume bins as illustrated in Figure 11; however this is not true for L ą 3. We now describe a histogram binning based partitioning scheme. G.3 PROJECTION BASED HISTOGRAM BINNING FOR CANONICAL CALIBRATION Some of the bins defined by Sierpinski binning and grid-style binning may have very few calibration points nb, leading to poor estimation of pΠ. In the binary calibration case, this can be remedied using histogram binning which strongly relies on the scoring function g taking values in a fully ordered space r0, 1s. To ensure that each bin contains Ωpn{Bq points, we estimate the quantiles of gp Xq and created the bins as per these quantiles. However, there are no natural quantiles for unordered prediction spaces such as L 1 (L ě 3). In this section, we develop a histogram binning scheme for L 1 that is semantically interpretable and has desirable statistical properties. Our algorithm takes as input a prescribed number of bins B and an arbitrary sequence of vectors q1, q2, . . . , q B 1 P RL with unit ℓ2-norm: }qi}2 1. Each of these vectors represents a direction on which we will project L 1 in order to induce a full order on L 1. Then, for each direction, we will use an order statistics on the induced full order to identify a bin with exactly tpn 1q{Bu 1 calibration points (except the last bin, which may have more points). The formal algorithm is described in Algorithm 10. It uses the following new notation: given m vectors v1, v2, . . . , vm P RL, a unit vector u, and an index j P rms, let order-statisticsptv1, v2, . . . , vmu, u, jq denote the j-th order-statistics of tv T 1 u, v T 2 u, . . . , v T muu. Published as a conference paper at ICLR 2022 Algorithm 10: Projection histogram binning for canonical calibration Input: Base multiclass predictor g : X Ñ L 1, calibration data D tp X1, Y1q, p X2, Y2q, . . . , p Xn, Ynqu Hyperparameter: number of bins B, unit vectors q1, q2, . . . , q B P RL, Output: Approximately calibrated scoring function h 1 S Ð tgp X1q, gp X2q, . . . , gp Xnqu; 2 T Ð empty array of size B; 3 c Ð t n 1 4 for b Ð 1 to B 1 do 5 Tb Ð order-statisticsp S, qb, cq; 6 S Ð Sztv P S : v T qb ď Tbu; 8 TB Ð 1.01; 9 Bpgp qq Ð mintb P r Bs : gp q T qb ă Tbu; 10 pΠ Ð empty matrix of size B ˆ L; 11 for b Ð 1 to B do 12 for l Ð 1 to L do 13 pΠb,l Ð Meant1 t Yi lu : Bpgp Xiqq b and @s P r Bs, gp Xiq T qs Tsu; 16 for l Ð 1 to L do 17 hlp q Ð pΠBpgp qq,l; 19 return h; We now briefly describe some values computed by Algorithm 10 in words to build intuition. The array T, which is learnt on the data, represents the identified thresholds for the directions given by q. Each pqb, Tbq pair corresponds to a hyperplane that cuts L 1 into two subsets given by tx P L 1 : x T qb ă Tbu and tx P L 1 : x T qb ě Tbu. The overall partitioning of L 1 is created by merging these cuts sequentially. This defines the binning function B. By construction, the binning function is such that each bin contains at least t n 1 B u 1 many points in its interior. As suggested by Gupta and Ramdas (2021), we do not include the points that lie on the boundary, that is, points Xi that satisfy gp Xiq T qs Ts for some s P r Bs. The interior points bins are then used to estimate the bin biases pΠ. No matter how the q-vectors are chosen, the bins created by Algorithm 10 have at least X n B \ 1 points for bias estimation. However, we discuss some simple heuristics for setting q that are semantically meaningful. For some intuition, note that the binary version of histogram binning Gupta and Ramdas (2021, Algorithm 1) is essentially recovered by Algorithm 10 if L 2 by setting each qb as e2 (the vector r0, 1s). Equivalently, we can set each qb to e1 since both are equivalent for creating a projection-based order on 2. Thus for L ě 3, a natural strategy for the q-vectors is to rotate between the canonical basis vectors: q1 e1, q2 e2, . . . , q L e L, q L 1 e1, . . . , and so on. Projecting with respect to el focuses on the class l by forming a bin corresponding to the largest values of glp Xiq among the remaining Xi s which have not yet been binned. (On the other hand, projecting with respect to el will correspond to forming a bin with the smallest values of glp Xiq.) The q-vectors can also be set adaptively based on the training data (without seeing the calibration data). For instance, if most points belong to a specific class l P r Ls, we may want more sharpness for this particular class. In that case, we can choose a higher ratio of the q-vectors to be el. G.4 EXPERIMENTS WITH THE COVTYPE DATASET In Figure 12 we illustrate the binning schemes proposed in this section on a 3-class version of the COVTYPE-7 dataset considered in Section B.2. As noted before, this is an imbalanced dataset where classes 1 and 2 dominate. We created a 3 class problem with the classes 1, 2, and other (as Published as a conference paper at ICLR 2022 (a) Calibration using Sierpinski binning at depth q 2. (b) Calibration using grid-style binning with K 5, τ 0.2. (c) Projection-based HB with B 27 projections: q1 e1, q2 e2, . . . , q4, e1, . . ., and so on. (d) Projection-based HB with B 27 random projections (qi drawn uniformly from the ℓ2-unit-ball in R3). Figure 12: Canonical calibration using fixed and histogram binning on a 3-class version of the COVTYPE-7 dataset. The reliability diagrams (left) indicate that all forms of binning improve the calibration of the base logistic regression model. The recalibration diagrams (right) are a scatter plot of the predictions gp Xq on the test data with the points colored in 10 different colors depending on their bin. For every bin, the arrow-tail indicates the mean probability predicted by the base model g whereas the arrow-head indicates the probability predicted by the updated model h. class 3). The entire dataset has 581012 points and the ratio of the classes is approximately 36%, 49%, 15% respectively. The dataset was split into train-test in the ratio 70:30. The training data was further split into modeling-calibration in the ratio 90:10. A logistic regression model g using sklearn.linear model.Logistic Regression was learnt on the modeling data, and g was recalibrated on the calibration data. The plots on the right in Figure 12 are recalibration diagrams. The base predictions gp Xq on the test-data are displayed as a scatter plot on 2. Points in different bins are colored using one of 10 Published as a conference paper at ICLR 2022 different colors (since the number of bins is larger than 10, some colors correspond to more than one bin). For each bin, an arrow is drawn, where the tail of the arrow corresponds to the average gp Xq predictions in the bin and the head of the arrow corresponds to the recalibrated hp Xq prediction for the bin. For bins that contained very few points, the arrows are suppressed for visual clarity. The plots on the left in Figure 12 are validity plots (Gupta and Ramdas, 2021). Validity plots display estimates of V pεq Ptest-data p}E r Y | gp Xqs gp Xq}1 ď εq , as ε varies (g corresponds to the validity plot for logistic regression; replacing g with h above gives plots for the binning based classifier h). For logistic regression, the same binning scheme as the one provided by h is used to estimate V pεq. Overall, Figure 12 shows that all of the binning approaches improve the calibration of the original logistic regression model across different ε. However, the recalibration does not change the original model significantly. Comparing the different binning methods to each other, we find that they all perform quite similarly. It would be interesting to further study these and other binning methods for post-hoc canonical calibration. Proofs appear in separate subsections, in the same order as the corresponding results appear in the paper and Appendix. Proposition 4 was stated informally, so we state it formally as well. H.1 STATEMENT AND PROOF OF PROPOSITION 4 Proposition 4. For any predictor pc, hq, conf-ECEpc, hq ď TL-ECEpc, hq. Proof. To avoid confusion between the the conditioning operator and the absolute value operator | |, we use abs p q to denote absolute values below. Note that, abs p Pp Y cp Xq | hp Xqq hp Xqq abs p E r1 t Y cp Xqu | hp Xqs hp Xqq abs p E r1 t Y cp Xqu hp Xq | hp Xqsq abs p E r E r1 t Y cp Xqu hp Xq | hp Xq, cp Xqs | hp Xqsq ď E rabs p E r1 t Y cp Xqu hp Xq | hp Xq, cp Xqsq | hp Xqs (by Jensen s inequality) E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqq | hp Xqs . conf-ECEpc, hq E rabs p Pp Y cp Xq | hp Xqq hp Xqqs ď E r E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqq | hp Xqss E rabs p Pp Y cp Xq | hp Xq, cp Xqq hp Xqqs TL-ECEpc, hq. H.2 PROOF OF THEOREM 1 The proof strategy is as follows. First, we use the bound of Gupta and Ramdas (2021, Theorem 4) (henceforth called the GR21 bound), derived in the binary calibration setting, to conclude marginal, conditional, and ECE guarantees for each hl. Then, we show that the binary guarantees for the individual hl s leads to a top-label guarantee for the overall predictor pc, hq. Consider any l P r Ls. Let Pl denote the conditional distribution of p X, 1 t Y luq given cp Xq l. Clearly, Dl is a set of nl i.i.d. samples from Pl, and hl is learning a binary calibrator with respect to Pl using binary histogram binning. The number of data-points is nl and the number of bins is Bl tnl{ku bins. We now apply the GR21 bounds on hl. First, we verify that the condition they require is satisfied: nl ě k tnl{ku ě 2Bl. Published as a conference paper at ICLR 2022 Thus their marginal calibration bound for hl gives, |Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď δ logp2{αq 2ptnl{Blu 1q | cp Xq l Note that since tnl{Blu tnl{ tnl{kuu ě k, logp2{αq 2pk 1q ě δ logp2{αq 2ptnl{Blu 1q. Thus we have P p|Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď ε1 | cp Xq lq ě 1 α. This is satisfied for every l. Using the law of total probability gives us the top-label marginal calibration guarantee for pc, hq: Pp|Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε1q l 1 Ppcp Xq lq Pp|Pp Y cp Xq | cp Xq, hp Xqq hp Xq| ď ε1 | cp Xq lq (law of total probability) l 1 Ppcp Xq lq Pp|Pp Y l | cp Xq l, hlp Xqq hlp Xq| ď ε1 | cp Xq lq (by construction, if cpxq l, hpxq hlpxq) l 1 Ppcp Xq lqp1 αq Similarly, the in-expectation ECE bound of GR21, for p 1, gives for every l, E |Pp Y l | cp Xq l, hlp Xqq hlp Xq | cp Xq l| ď a tnl{ku {2nl δ E|Pp Y cp Xq | cp Xq, hlp Xqq hp Xq| l 1 Ppcp Xq lq E|Pp Y l | cp Xq l, hlp Xqq hlp Xq| | cp Xq l l 1 Ppcp Xq lqp a Next, we show the top-label conditional calibration bound. Let B řL l 1 Bl and αl αBl{B. Note that B ď řL l 1 nl{k n{k. The binary conditional calibration bound of GR21 gives @r P Rangephlq, |Pp Y l | cp Xq l, hlp Xq rq r| ď δ logp2Bl{αlq 2ptnl{Blu 1q | cp Xq l Note that d logp2Bl{αlq 2ptnl{Blu 1q logp2B{αq 2ptnl{Blu 1q Published as a conference paper at ICLR 2022 logp2n{kαq 2ptnl{Blu 1q (since B ď n{k) 2pk 1q (since k ď tnl{Blu). Thus for every l P r Ls, Pp@r P Rangephlq, |Pp Y l | cp Xq l, hlp Xq rq r| ď ε2q ě 1 αl. By construction of h, conditioning on cp Xq l and hlp Xq r is the same as conditioning on cp Xq l and hp Xq r. Taking a union bound over all L gives Pp@l P r Ls, r P Rangephq, |Pp Y cp Xq | cp Xq l, hp Xq rq r|q ď ε2q l 1 αl 1 α, proving the conditional calibration result. Finally, note that if for every l P r Ls, r P Rangephq, |Pp Y cp Xq | cp Xq l, hp Xq rq r| ď ε2, then also TL-ECEpc, hq E|Pp Y cp Xq | hp Xq, cp Xqq hp Xq| ď ε2. This proves the high-probability bound for the TL-ECE. Remark 3. Gupta and Ramdas (2021) proved a more general result for general ℓp-ECE bounds. Similar results can also be derived for the suitably defined ℓp-TL-ECE. Additionally, it can be shown that with probability 1 α, the TL-MCE of pc, hq is bounded by ε2. (TL-MCE is defined in Appendix E, equation (17).) H.3 PROOF OF PROPOSITION 1 Consider a specific l P r Ls. We use hl to denote the l-th component function of h and Yl 1 t Y lu. Canonical calibration implies Pp Y l | hp Xqq E r Yl | hp Xqs hlp Xq. (23) We can then use the law of iterated expectations (or tower rule) to get the final result: E r Yl | hlp Xqs E r E r Yl | hp Xqs | hlp Xqs E rhlp Xq | hlp Xqs (by the canonical calibration property (23)) hlp Xq. H.4 PROOF OF THEOREM 2 We use the bounds of Gupta and Ramdas (2021, Theorem 4) (henceforth called the GR21 bounds), derived in the binary calibration setting, to conclude marginal, conditional, and ECE guarantees for each hl. This leads to the class-wise results as well. Consider any l P r Ls. Let Pl denote the distribution of p X, 1 t Y luq. Clearly, Dl is a set of n i.i.d. samples from Pl, and hl is learning a binary calibrator with respect to Pl using binary histogram binning. The number of data-points is n and the number of bins is Bl tn{klu bins. We now apply the GR21 bounds on hl. First, we verify that the condition they require is satisfied: n ě kl tn{klu ě 2Bl. Thus the GR21 marginal calibration bound gives that for every l P r Ls, hl satisfies |Pp Y l | hlp Xqq hlp Xq| ď δ logp2{αlq 2ptn{Blu 1q Published as a conference paper at ICLR 2022 The class-wise marginal calibration bound of Theorem 2 follows since tn{Blu tn{ tn{kluu ě kl, and so logp2{αlq 2ptn{Blu 1q. Next, the GR21 conditional calibration bound gives for every l P r Ls, hl satisfies @r P Rangephlq, |Pp Y l | hlp Xq rq r| ď δ logp2Bl{αlq 2ptn{Blu 1q The class-wise marginal calibration bound of Theorem 2 follows since Bl tn{klu ď n{kl and tn{Blu tn{ tn{kluu ě kl, and so logp2Bl{αlq 2ptn{Blu 1q. Let k minl Pr Ls kl. The in-expectation ECE bound of GR21, for p 1, gives for every l, E rbinary-ECE-for-class-l phlqs ď a tn{klu {2n δ E r CW-ECEpc, hqs E l 1 binary-ECE-for-class-l phlq as required for the in-expectation CW-ECE bound of Theorem 2. Finally, for the high probability CW-ECE bound, let ε maxl Pr Ls εp2q l and α řL l 1 αl. By taking a union bound over the the conditional calibration bounds for each hl, we have, with probability 1 α, for every l P r Ls and r P Rangephq, |Pp Y l | cp Xq l, hp Xq rq r| ď εp2q l ď ε. Thus, with probability 1 α, CW-ECEpc, hq L 1 L ÿ l 1 E|Pp Y l | hlp Xqq hlp Xq| ď ε. This proves the high-probability bound for the CW-ECE.