# verified_uncertainty_calibration__c16a8c6b.pdf

Veriﬁed Uncertainty Calibration

Ananya Kumar, Percy Liang, Tengyu Ma

Department of Computer Science

Stanford University

Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We ﬁnd in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefﬁcient it requires O(B/ 2) samples, compared to O(1/ 2) for scaling methods, where B is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which ﬁrst ﬁts a parametric function to reduce variance and then bins the function values to actually ensure calibration. This requires only O(1/ 2 + B) samples. Next, we show that we can estimate a model s calibration error more accurately using an estimator from the meteorological community or equivalently measure its calibration error with fewer samples (O(

B) instead of O(B)). We validate our approach with multiclass calibration experiments on CIFAR-10 and Image Net, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration. We implement all these methods in a Python library:

1 Introduction

The probability that a system outputs for an event should reﬂect the true frequency of that event: if an automated diagnosis system says 1,000 patients have cancer with probability 0.1, approximately 100 of them should indeed have cancer. In this case, we say the model is uncertainty calibrated. The importance of this notion of calibration has been emphasized in personalized medicine [1], meteorological forecasting [2, 3, 4, 5, 6] and natural language processing applications [7, 8]. As most modern machine learning models, such as neural networks, do not output calibrated probabilities out of the box [9, 10, 11], reseachers use recalibration methods that take the output of an uncalibrated model, and transform it into a calibrated probability. Scaling approaches for recalibration Platt scaling [12], isotonic regression [13], and temperature scaling [9] are widely used and require very few samples, but do they actually produce calibrated probabilities?

We discover that these methods are less calibrated than reported. Past work approximates a model s

calibration error using a ﬁnite set of bins. We show that by using more bins, we can uncover a higher calibration error for models on CIFAR-10 and Image Net. We show that a fundamental limitation with approaches that output a continuous range of probabilities is that their true calibration error is unmeasurable with a ﬁnite number of bins (Example 3.2).

An alternative approach, histogram binning [10], outputs probabilities from a ﬁnite set. Histogram binning can produce a model that is calibrated, and unlike scaling methods we can measure its calibration error, but it is sample inefﬁcient. In particular, the number of samples required to calibrate

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

(a) Platt scaling

(b) Histogram binning

(c) Scaling-binning calibrator

Figure 1: Visualization of the three recalibration approaches. The black crosses are the ground truth labels, and the red lines are the output of the recalibration methods. Platt Scaling (Figure 1a) ﬁts a function to the recalibration data, but its calibration error is not measurable. Histogram binning (Figure 1b) outputs the average label in each bin. The scaling-binning calibrator (Figure 1c) ﬁts a function g 2 G to the recalibration data and then takes the average of the function values (the gray circles) in each bin. The function values have lower variance than the labels, as visualized by the blue dotted lines, which is why our approach has lower variance.

scales linearly with the number of distinct probabilities the model can output, B [14], which can be large particularly in the multiclass setting where B typically scales with the number of classes. Recalibration sample efﬁciency is crucial we often want to recalibrate our models in the presence of domain shift [15] or recalibrate a model trained on simulated data, and may have access to only a small labeled dataset from the target domain.

To get the sample efﬁciency of Platt scaling and the veriﬁcation guarantees of histogram binning, we propose the scaling-binning calibrator (Figure 1c). Like scaling methods, we ﬁt a simple function g 2 G to the recalibration dataset. We then bin the input space so that an equal number of inputs land in each bin. In each bin, we output the average of the g values in that bin these are the gray circles in Figure 1c. In contrast, histogram binning outputs the average of the label values in each bin (Figure 1b). The motivation behind our method is that the g values in each bin are in a narrower range than the label values, so when we take the average we incur lower estimation error. If G is well chosen, our method requires O( 1

2 + B) samples to achieve calibration error instead of O( B

2 ) samples for histogram binning, where B is the number of model outputs (Theorem 4.1). Note that in prior work, binning the outputs of a function was used for evaluation and without any guarantees, whereas in our case it is used for the method itself, and we show improved sample complexity.

We run multiclass calibration experiments on CIFAR-10 [16] and Image Net [17]. The scaling-binning calibrator achieves a lower calibration error than histogram binning, while allowing us to measure the true calibration error unlike for scaling methods. We get a 35% lower calibration error on CIFAR-10 and a 5x lower calibration error on Image Net than histogram binning for B = 100.

Finally, we show how to estimate the calibration error of models more accurately. Prior work in machine learning [7, 9, 15, 18, 19] directly estimates each term in the calibration error from samples (Deﬁnition 5.1). The sample complexity of this plugin estimator scales linearly with B. A debiased estimator introduced in the meteorological literature [20, 21] reduces the bias of the plugin estimator; we prove that it achieves sample complexity that scales with

B by leveraging error cancellations across bins. Experiments on CIFAR-10 and Image Net conﬁrm that the debiased estimator measures the calibration error more accurately.

2 Setup and background

2.1 Binary classiﬁcation

Let X be the input space and Y be the label space where Y = {0, 1} for binary classiﬁcation. Let X 2 X and Y 2 Y be random variables denoting the input and label, given by an unknown joint distribution P. As usual, expectations are taken over all random variables.

Suppose we have a model f : X ! [0, 1] where the (possibly uncalibrated) output of the model represents the model s conﬁdence that the label is 1. The calibration error examines the difference between the model s probability and the true probability given the model s output:

Deﬁnition 2.1 (Calibration error). The calibration error of f : X ! [0, 1] is given by:

|f(X) E[Y | f(X)]|2 1/2

If CE(f) = 0 then f is perfectly calibrated. This notion of calibration error is most commonly used [2, 3, 4, 7, 15, 18, 19, 20]. Replacing the 2s in the above deﬁnition by p 1 we get the p calibration error the 1 and 1 calibration errors are also used in the literature [9, 22, 23]. In addition to CE, we also deal with the 1 calibration error (known as ECE) in Sections 3 and 5.

Calibration alone is not sufﬁcient: consider an image dataset containing 50% dogs and 50% cats. If f outputs 0.5 on all inputs, f is calibrated but not very useful. We often also wish to minimize the mean-squared error also known as the Brier score subject to a calibration budget [5, 24].

Deﬁnition 2.2. The mean-squared error of f : X ! [0, 1] is given by MSE(f) = E[(f(X) Y )2].

Note that MSE and CE are not orthogonal and MSE = 0 implies perfect calibration; in fact the MSE is the sum of the squared calibration error and a sharpness term [2, 4, 18].

2.2 Multiclass classiﬁcation

While calibration in binary classiﬁcation is well-studied, it s less clear what to do for multiclass, where multiple deﬁnitions abound, differing in their strengths. In the multiclass setting, Y = [K] = {1, . . . , K} and f : X ! [0, 1]K outputs a conﬁdence measure for each class in [K].

Deﬁnition 2.3 (Top-label calibration error). The top-label calibration error examines the difference between the model s probability for its top prediction and the true probability of that prediction given the model s output:

Y = arg max

f(X)j | max

j2[K] f(X)j

j2[K] f(X)j

We would often like the model to be calibrated on less likely predictions as well imagine that a medical diagnosis system says there is a 50% chance a patient has a benign tumor, a 10% chance she has an aggressive form of cancer, and a 40% chance she has one of a long list of other conditions. We would like the model to be calibrated on all of these predictions so we deﬁne the marginal calibration error which examines, for each class, the difference between the model s probability and the true probability of that class given the model s output.

Deﬁnition 2.4 (Marginal calibration error). Let wk 2 [0, 1] denote how important calibrating class k is, where wk = 1/k if all classes are equally important. The marginal calibration error is:

(f(X)k P(Y = k | f(X)k))2 1/2

Prior works [9, 15, 19] propose methods for multiclass calibration but only measure top-label calibration [23] and concurrent work to ours [25] deﬁne similar per-class calibration metrics where temperature scaling [9] is worse than vector scaling despite having better top-label calibration.

For notational simplicity, our theory focuses on the binary classiﬁcation setting. We can transform top-label calibration into a binary calibration problem the model outputs a probability corresponding to its top prediction, and the label represents whether the model gets it correct or not. Marginal calibration can be transformed into K one-vs-all binary calibration problems where for each k 2 [K] the model outputs the probability associated with the k-th class, and the label represents whether the correct class is k [13]. We consider both top-label calibration and marginal calibration in our experiments. Other notions of multiclass calibration include joint calibration (which requires the entire probability vector to be calibrated) [2, 6] and event-pooled calibration [18].

2.3 Recalibration

Since most machine learning models do not output calibrated probabilities out of the box [9, 10] recalibration methods take the output of an uncalibrated model, and transform it into a calibrated probability. That is, given a trained model f : X ! [0, 1], let Z = f(X). We are given recalibration data T = {(zi, yi)}n

i=1 independently sampled from P(Z, Y ), and we wish to learn a recalibrator g : [0, 1] ! [0, 1] such that g f is well-calibrated.

Scaling methods, for example Platt scaling [12], output a function g = arg ming2G

(z,y)2T (g(z), y), where G is a model family, g 2 G is differentiable, and is a loss function, for example the log-loss or mean-squared error. The advantage of such methods is that they converge very quickly since they only ﬁt a small number of parameters.

Histogram binning ﬁrst constructs a set of bins (intervals) that partitions [0, 1], formalized below.

Deﬁnition 2.5 (Binning schemes). A binning scheme B of size B is a set of B intervals I1, . . . , IB that partitions [0, 1]. Given z 2 [0, 1], let β(z) = j, where j is the interval that z lands in (z 2 Ij).

The bins are typically chosen such that either I1 = [0, 1

B ], I2 = ( 1

B ], . . . , IB = ( B 1

B , 1] (equal width binning) [9] or so that each bin contains an equal number of zi values in the recalibration data (uniform mass binning) [10]. Histogram binning then outputs the average yi value in each bin.

3 Is Platt scaling calibrated?

In this section, we show that methods like Platt scaling and temperature scaling are (i) less calibrated than reported and (ii) it is difﬁcult to tell how miscalibrated they are. That is we show, both theoretically and with experiments on CIFAR-10 and Image Net, why the calibration error of models that output a continuous range of values is underestimated. We defer proofs to Appendix B.

The key to estimating the calibration error is estimating the conditional expectation E[Y | f(X)]. If f(X) is continuous, without smoothness assumptions on E[Y | f(X)] (that cannot be veriﬁed in practice), this is impossible. This is analogous to the difﬁculty of measuring the mutual information between two continuous signals [26].

To approximate the calibration error, prior work bins the output of f into B intervals. The calibration error in each bin is estimated as the difference between the average value of f(X) and Y in that bin. Note that the binning here is for evaluation only, whereas in histogram binning, it is used for the recalibration method itself. We formalize the notion of this binned calibration error below.

Deﬁnition 3.1. The binned version of f outputs the average value of f in each bin Ij:

f B(x) = E[f(X) | f(X) 2 Ij] where x 2 Ij (4)

Given B, the binned calibration error of f is simply the calibration error of f B. A simple example shows that using binning to estimate the calibration error can severely underestimate the true calibration error.

Example 3.2. For any binning scheme B, and continuous bijective function f : [0, 1] ! [0, 1], there exists a distribution P over X, Y s.t. CE(f B) = 0 but CE(f) 0.49. Note that for all f, 0 CE(f) 1.

The intuition of the construction is that in each interval Ij in B, the model could underestimate the true probability E[Y | f(X)] half the time, and overestimate the probability half the time. So if we average over the entire bin the model appears to be calibrated, even though it is very uncalibrated. The formal proof is in Appendix B, and holds for arbitrary p calibration errors including the ECE.

Next, we show that given a function f, its binned version always has lower calibration error. The proof, in Appendix B, is by Jensen s inequality. Intuitively, averaging a model s prediction within a bin allows errors at different parts of the bin to cancel out with each other. This result is similar to Theorem 2 in recent work [27], and holds for arbitrary p calibration errors including the ECE.

Proposition 3.3 (Binning underestimates error). Given any binning scheme B and model f : X ! [0, 1], we have:

CE(f B) CE(f).

(a) Image Net

(b) CIFAR-10

Figure 2: Binned calibration errors of a recalibrated VGG-net model on CIFAR-10 and Image Net with 90% conﬁdence intervals. The binned calibration error increases as we increase the number of bins. This suggests that binning cannot be reliably used to measure the true calibration error.

3.1 Experiments

Our experiments on Image Net and CIFAR-10 suggest that previous work reports numbers which are lower than the actual calibration error of their models. Recall that binning lower bounds the calibration error. We cannot compute the actual calibration error but if we use a ﬁner set of bins then we get a tighter lower bound on the calibration error.

As in [9], our model s objective was to output the top predicted class and a conﬁdence score associated with the prediction. For Image Net, we started with a trained VGG16 model with an accuracy of 64.3%. We split the validation set into 3 sets of sizes (20000, 5000, 25000). We used the ﬁrst set of data to recalibrate the model using Platt scaling, the second to select the binning scheme B so that each bin contains an equal number of points, and the third to measure the binned calibration error . We calculated 90% conﬁdence intervals for the binned calibration error using 1,000 bootstrap resamples and performed the same experiment with varying numbers of bins.

Figure 2a shows that as we increase the number of bins on Image Net, the measured calibration error is higher and this is statistically signiﬁcant. For example, if we use 15 bins as in [9], we would think the calibration error is around 0.02 when in reality the calibration error is at least twice as high. Figure 2b shows similar ﬁndings for CIFAR-10, and in Appendix C we show that our ﬁndings hold even if we use the 1 calibration error (ECE) and alternative binning strategies.

4 The scaling-binning calibrator

Section 3 shows that the problem with scaling methods is we cannot estimate their calibration error. The upside of scaling methods is that if the function family has at least one function that can achieve calibration error , they require O(1/ 2) samples to reach calibration error , while histogram binning requires O(B/ 2) samples. Can we devise a method that is sample efﬁcient to calibrate and one where it s possible to estimate the calibration error? To achieve this, we propose the scaling-binning calibrator (Figure 1c) where we ﬁrst ﬁt a scaling function, and then bin the outputs of the scaling function.

4.1 Algorithm

We split the recalibration data T of size n into 3 sets: T1, T2, T3. The scaling-binning calibrator, illustrated in Figure 1, outputs ˆg B such that ˆg B f has low calibration error:

Step 1 (Function ﬁtting): Select g = arg ming2G

(z,y)2T1(y g(z))2.

Step 2 (Binning scheme construction): We choose the bins so that an equal number of g(zi) in T2 land in each bin Ij for each j 2 {1, . . . , B} this uniform-mass binning scheme [10] as opposed to equal-width binning [9] is essential for being able to estimate the calibration error in Section 5.

Step 3 (Discretization): Discretize g, by outputting the average g value in each bin these are the gray circles in Figure 1c. Let µ(S) = 1 |S|

s2S s denote the mean of a set of values S. Let ˆµ[j] = µ({g(zi) | g(zi) 2 Ij (zi, yi) 2 T3}) be the mean of the g(zi) values that landed in the j-th bin. Recall that if z 2 Ij, β(z) = j is the interval z lands in. Then we set ˆg B(z) = ˆµ[β(g(z))] that is we simply output the mean value in the bin that g(z) falls in.

4.2 Analysis

We now show that the scaling-binning calibrator requires O(B + 1/ 2) samples to calibrate, and in Section 5 we show that we can efﬁciently measure its calibration error. For the main theorem, we make some standard regularity assumptions on G which we formalize in Appendix D. Our result is a generalization result we show that if G contains some g with low calibration error, then our method is at least almost as well-calibrated as g given sufﬁciently many samples.

Theorem 4.1 (Calibration bound). Assume regularity conditions on G (ﬁnite parameters, injectivity, Lipschitz-continuity, consistency, twice differentiability). Given δ 2 (0, 1), there is a constant c such

that for all B, > 0, with n c

B log B + log B

samples, the scaling-binning calibrator ﬁnds ˆg B with (CE(ˆg B))2 2 ming2G(CE(g))2 + 2, with probability 1 δ.

Note that our method can potentially be better calibrated than g , because we bin the outputs of the scaling function, which reduces its calibration error (Proposition 3.3). While binning worsens the sharpness and can increase the mean-squared error of the model, in Proposition D.4 we show that if we use many bins, binning the outputs cannot increase the mean-squared error by much.

We prove Theorem 4.1 in Appendix D but give a sketch here. Step 1 of our algorithm is Platt scaling, which simply ﬁts a function g to the data standard results in asymptotic statistics show that g converges in O( 1

2 ) samples.

Step 3, where we bin the outputs of g, is the main step of the algorithm. If we had inﬁnite data, Proposition 3.3 showed that the binned version g B has lower calibration error than g, so we would be done. However we do not have inﬁnite data the core of our proof is to show that the empirically binned ˆg B is within of g B in O(B + 1 2 ) samples, instead of the O(B + B

2 ) samples required by histogram binning. The intuition is in Figure 1 the g(zi) values in each bin (gray circles in Figure 1c) are in a narrower range than the yi values (black crosses in Figure 1b) and thus have lower variance so when we take the average we incur less estimation error. The perhaps surprising part is that we are estimating B numbers with e O(1/ 2) samples. In fact, there may be a small number of bins where the g(zi) values are not in a narrow range, but our proof still shows that the overall estimation error is small.

Our uniform-mass binning scheme allows us to estimate the calibration error efﬁciently (see Section 5), unlike for scaling methods where we cannot estimate the calibration error (Section 3). Recall that we chose our bins so that each bin has an equal proportion of points in the recalibration set. Lemma 4.3 shows that this property approximately holds in the population as well. This allows us to estimate the calibration error efﬁciently (Theorem 5.4).

Deﬁnition 4.2 (Well-balanced binning). Given a binning scheme B of size B, and 1. We say B is -well-balanced if for all j,

1 B P(Z 2 Ij)

Lemma 4.3. For universal constant c, if n c B log B

δ , with probability at least 1 δ, the binning scheme B we chose is 2-well-balanced.

While the way we choose bins is not novel [10], we believe the guarantees around it are not all binning schemes in the literature allow us to efﬁciently estimate the calibration error; for example, the binning scheme in [9] does not. Our proof of Lemma 4.3 is in Appendix D. We use a discretization argument to prove the result this gives a tighter bound than applying Chernoff bounds or a standard VC dimension argument which would tell us we need O(B2 log B

δ ) samples.

(a) Effect of number of bins on squared calibration error.

(b) Tradeoff between calibration and MSE.

Figure 3: (Left) Recalibrating using 1,000 data points on CIFAR-10, the scaling-binning calibrator achieves lower squared calibration error than histogram binning, especially when the number of bins B is large. (Right) For a ﬁxed calibration error, the scaling-binning calibrator allows us to use more bins. This results in models with more predictive power which can be measured by the mean-squared error. Note the vertical axis range is [0.04, 0.08] to zoom into the relevant region.

4.3 Experiments

Our experiments on CIFAR-10 and Image Net show that in the low-data regime, for example when we use 1000 data points to recalibrate, the scaling-binning calibrator produces models with much lower calibration error than histogram binning. The uncalibrated model outputs a conﬁdence score associated with each class. We recalibrated each class separately as in [13], using B bins per class, and evaluated calibration using the marginal calibration error (Deﬁnition 2.4).

We describe our experimental protocol for CIFAR-10. The CIFAR-10 validation set has 10,000 data points. We sampled, with replacement, a recalibration set of 1,000 points. We ran either the scaling-binning calibrator (we ﬁt a sigmoid in the function ﬁtting step) or histogram binning and measured the marginal calibration error on the entire set of 10K points. We repeated this entire procedure 100 times and computed mean and 90% conﬁdence intervals, and we repeated this varying the number of bins B. Figure 3a shows that the scaling-binning calibrator produces models with lower calibration error, for example 35% lower calibration error when we use 100 bins per class.

Using more bins allows a model to produce more ﬁne-grained predictions, e.g. [20] use B = 51 bins, which improves the quality of predictions as measured by the mean-squared error Figure 3b shows that our method achieves better mean-squared errors for any given calibration constraint. More concretely, the ﬁgure shows a scatter plot of the mean-squared error and squared calibration error for histogram binning and the scaling-binning calibrator when we vary the number of bins. For example, if we want our models to have a calibration error 0.02 = 2% we get a 9% lower mean-squared error. In Appendix E we show that we get 5x lower top-label calibration error on Image Net, and give further experiment details.

Validating theoretical bounds: In Appendix E we run synthetic experiments to validate the bound in Theorem 4.1. In particular, we show that if we ﬁx the number of samples n, and vary the number of bins B, the squared calibration error for the scaling-binning calibrator is nearly constant but for histogram binning increases nearly linearly with B. For both methods, the squared calibration error decreases approximately as 1/n that is when we double the number of samples the squared calibration error halves.

5 Verifying calibration

Before deploying our model we would like to check that it has calibration error below some desired threshold E. In this section we show that we can accurately estimate the calibration error of binned models, if the binning scheme is 2-well-balanced. Recent work in machine learning uses a plugin estimate for each term in the calibration error [7, 15, 18, 19]. Older work in meteorology [20, 21] notices that this is a biased estimate, and proposes a debiased estimator that subtracts off an

approximate correction term to reduce the bias. Our contribution is to show that the debiased estimator is more accurate: while the plugin estimator requires samples proportional to B to estimate the calibration error, the debiased estimator requires samples proportional to

B. Note that we show an improved sample complexity prior work only showed that the naive estimator is biased. In Appendix G we also propose a way to debias the 1 calibration error (ECE), and show that we can estimate the ECE more accurately on CIFAR-10 and Image Net.

Suppose we wish to measure the squared calibration error E2 of a binned model f : X ! S where S [0, 1] and |S| = B. Suppose we get an evaluation set Tn = {(x1, y1), . . . , (xn, yn)}. Past work typically estimates the calibration error by directly estimating each term from samples: Deﬁnition 5.1 (Plugin estimator). Let Ls denote the yj values where the model outputs s: Ls = {yj | (xj, yj) 2 Tn f(xj) = s}. Let ˆps be the estimated probability of f outputting s: ˆps = |Ls|

Let ˆys be the empirical average of Y when the model outputs s: ˆys = P

The plugin estimate for the squared calibration error is the weighted squared difference between ˆys

ˆps(s ˆys)2

Alternatively, [20, 21] propose to subtract an approximation of the bias from the estimate: Deﬁnition 5.2 (Debiased estimator). The debiased estimator for the squared calibration error is:

(s ˆys)2 ˆys(1 ˆys)

We are interested in analyzing the number of samples required to estimate the calibration error within a constant multiplicative factor, that is to give an estimate ˆE2 such that | ˆE2 E2| 1

2E2 (where 1

2 can be replaced by any constant r with 0 < r < 1). Our main result is that the plugin estimator requires

E2 ) samples (Theorem 5.3) while the debiased estimator requires e O(

B E2 ) samples (Theorem 5.4). Theorem 5.3 (Plugin estimator bound). Suppose we have a binned model with squared calibration error E2, where the binning scheme is 2-well-balanced, that is for all s 2 S, P(f(X) = s) 1 2B . 1

δ for some universal constant c, then for the plugin estimator, we have 1

pl 3 2E2 with probability at least 1 δ. Theorem 5.4 (Debiased estimator bound). Suppose we have a binned model with squared calibration error E2 and for all s 2 S, P(f(X) = s) 1 2B . If n c

δ for some universal constant c then for the debiased estimator, we have 1

2E2 with probability at least 1 δ.

The proofs of both theorems is in Appendix F. The idea is that for the plugin estimator, each term in the sum has bias 1/n. These biases accumulate, giving total bias B/n. The debiased estimator has much lower bias and the estimation variance cancels across bins this intuition is captured in Lemma F.8 which requires careful conditioning to make the argument go through.

5.1 Experiments

We run a multiclass marginal calibration experiment on CIFAR-10 which suggests that the debiased estimator produces better estimates of the calibration error than the plugin estimator. We split the validation set of size 10,000 into two sets SC and SE of sizes 3,000 and 7,000 respectively. We use SC to re-calibrate and discretize a trained VGG-16 model. We calibrate each of the K = 10 classes seprately as described in Section 2 and used B = 100 or B = 10 bins per class. For varying values of n, we sample n points with replacement from SE, and estimate the calibration error using the debiased estimator and the plugin estimator. We then compute the squared deviation of these estimates from the squared calibration error measured on the entire set SE. We repeat this resampling 1,000 times to get the mean squared deviation of the estimates from the ground truth and conﬁdence

intervals. Figure 4a shows that the debiased estimates are much closer to the ground truth than the plugin estimates the difference is especially signiﬁcant when the number of samples n is small or

1We do not need the upper bound of the 2-well-balanced property.

(b) B = 100

Figure 4: Mean-squared errors of plugin and debiased estimators on a recalibrated VGG16 model on CIFAR-10 with 90% conﬁdence intervals (lower values better). The debiased estimator is closer to the ground truth, which corresponds to 0 on the vertical axis, especially when B is large or n is small. Note that this is the MSE of the squared calibration error, not the MSE of the model in Figure 3.

the number of bins B is large. Note that having a perfect estimate corresponds to 0 on the vertical axis.

In Appendix G, we include histograms of the absolute difference between the estimates and ground truth for the plugin and debiased estimator, over the 1,000 resamples.

6 Related work

Calibration, including the squared calibration error, has been studied in many ﬁelds besides machine learning including meteorology [2, 3, 4, 5, 6], fairness [28, 29], healthcare [1, 30, 31, 32], reinforcement learning [33], natural language processing [7, 8], speech recognition [34], econometrics [24], and psychology [35]. Besides the calibration error, prior work also uses the Hosmer-Lemeshov test [36] and reliability diagrams [4, 37] to evaluate calibration. Concurrent work to ours [38] also notice that using the plugin calibration error estimator to test for calibration leads to rejecting well-calibrated models too often. Besides calibration, other ways of producing and quantifying uncertainties include Bayesian methods [39] and conformal prediction [40, 41].

Algorithms and analysis in density estimation typically assume the true density is L Lipschitz, while in calibration applications, the calibration error of the ﬁnal model should be measurable from data, without making untestable assumptions on L.

Bias is a common issue with statistical estimators, for example, the seminal work by Stein [42] ﬁxes the bias of the mean-squared error. However, debiasing an estimator does not typically lead to an improved sample complexity, as it does in our case. Recalibration is related to (conditional) density estimation [43, 44] as the goal is to estimate E[Y | f(X)].

7 Conclusion

This paper makes three contributions: 1. We showed that the calibration error of continuous methods is underestimated; 2. We introduced the ﬁrst method, to our knowledge, that has better sample complexity than histogram binning and has a measurable calibration error, giving us the best of scaling and binning methods; and 3. We showed that an alternative estimator for calibration error has better sample complexity than the plugin estimator. There are many exciting avenues for future work:

1. Dataset shifts: Can we maintain calibration under dataset shifts (for example, train on

MNIST, but evaluate on SVHN) without labeled examples from the target dataset? 2. Measuring calibration: Can we come up with alternative metrics that still capture a notion

of calibration, but are measurable for scaling methods?

Reproducibility. Our Python calibration library is available at

. All code, data, and experiments can be found on Coda Lab at

. Updated code can be found at

Acknowledgements. The authors would like to thank the Open Philantropy Project, Stanford Graduate Fellowship, and Toyota Research Institute for funding. Toyota Research Institute ( TRI") provided funds to assist the authors with their research but this article solely reﬂects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

We are grateful to Pang Wei Koh, Chuan Guo, Anand Avati, Shengjia Zhao, Weihua Hu, Yu Bai, John Duchi, Dan Hendrycks, Jonathan Uesato, Michael Xie, Albert Gu, Aditi Raghunathan, Fereshte Khani, Stefano Ermon, Eric Nalisnick, and Pushmeet Kohli for insightful discussions. We thank the anonymous reviewers for their thorough reviews and suggestions that have improved our paper. We would also like to thank Pang Wei Koh, Yair Carmon, Albert Gu, Rachel Holladay, and Michael Xie for their inputs on our draft, and Chuan Guo for providing code snippets from their temperature scaling paper.

[1] X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado. Calibrating predictive model estimates to

support personalized medicine. Journal of the American Medical Informatics Association, 19 (2):263 274, 2012.

[2] A. H. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology,

12(4):595 600, 1973.

[3] A. H. Murphy and R. L. Winkler. Reliability of subjective probability forecasts of precipitation

and temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26: 41 47, 1977.

[4] M. H. De Groot and S. E. Fienberg. The comparison and evaluation of forecasters. Journal of

the Royal Statistical Society. Series D (The Statistician), 32:12 22, 1983.

[5] T. Gneiting and A. E. Raftery. Weather forecasting with ensemble methods. Science, 310, 2005.

[6] J. Brocker. Reliability, sufﬁciency, and the decomposition of proper scores. Quarterly Journal

of the Royal Meteorological Society, 135(643):1512 1519, 2009.

[7] K. Nguyen and B. O Connor. Posterior calibration and exploratory analysis for natural language

processing models. In Empirical Methods in Natural Language Processing (EMNLP), pages 1587 1598, 2015.

[8] D. Card and N. A. Smith. The importance of calibration for estimating proportions from

annotations. In Association for Computational Linguistics (ACL), 2018.

[9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In

International Conference on Machine Learning (ICML), pages 1321 1330, 2017.

[10] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and

naive bayesian classiﬁers. In International Conference on Machine Learning (ICML), pages 609 616, 2001.

[11] V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated

regression. In International Conference on Machine Learning (ICML), 2018.

[12] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized

likelihood methods. Advances in Large Margin Classiﬁers, 10(3):61 74, 1999.

[13] B. Zadrozny and C. Elkan. Transforming classiﬁer scores into accurate multiclass probability

estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 694 699, 2002.

[14] M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Binary classiﬁer calibration: Non-parametric

approach. ar Xiv, 2014.

[15] D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure.

In International Conference on Learning Representations (ICLR), 2019.

[16] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,

University of Toronto, 2009.

[17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical

image database. In Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009.

[18] V. Kuleshov and P. Liang. Calibrated structured prediction. In Advances in Neural Information

Processing Systems (Neur IPS), 2015.

[19] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and

uncertainty. In International Conference on Machine Learning (ICML), 2019.

[20] J. Brocker. Estimating reliability and resolution of probability forecasts through decomposition

of the empirical score. Climate Dynamics, 39:655 667, 2012.

[21] C. A. T. Ferro and T. E. Fricker. A bias-corrected decomposition of the brier score. Quarterly

Journal of the Royal Meteorological Society, 138(668):1954 1960, 2012.

[22] M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using

bayesian binning. In Association for the Advancement of Artiﬁcial Intelligence (AAAI), 2015.

[23] J. V. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran. Measuring calibration in deep

learning. ar Xiv, 2019.

[24] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243 268, 2007.

[25] M. Kull, M. P. Nieto, M. Kängsepp, T. S. Filho, H. Song, and P. Flach. Beyond temperature scal-

ing: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

[26] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191 1253,

[27] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating

model calibration in classiﬁcation. In Artiﬁcial Intelligence and Statistics (AISTATS), 2019.

[28] U. Hebert-Johnson, M. P. Kim, O. Reingold, and G. N. Rothblum. Multicalibration: Calibration

for the (computationally-identiﬁable) masses. In International Conference on Machine Learning (ICML), 2018.

[29] L. T. Liu, M. Simchowitz, and M. Hardt. The implicit fairness criterion of unconstrained

learning. In International Conference on Machine Learning (ICML), 2019.

[30] C. S. Crowson, E. J. Atkinson, and T. M. Therneau. Assessing calibration of prognostic risk

scores. Statistical Methods in Medical Research, 25:1692 1706, 2017.

[31] F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: issues in developing

models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine, 15(4):361 387, 1996.

[32] S. Yadlowsky, S. Basu, and L. Tian. A calibration metric for risk scores with survival data.

Machine Learning for Healthcare, 2019.

[33] A. Malik, V. Kuleshov, J. Song, D. Nemer, H. Seymour, and S. Ermon. Calibrated model-based

deep reinforcement learning. In International Conference on Machine Learning (ICML), 2019.

[34] D. Yu, J. Li, and L. Deng. Calibration of conﬁdence measures in speech recognition. Trans.

Audio, Speech and Lang. Proc., 19(8):2461 2473, 2011.

[35] S. Lichtenstein, B. Fischhoff, and L. D. Phillips. Judgement under Uncertainty: Heuristics and

Biases. Cambridge University Press, 1982.

[36] D. W. Hosmer and S. Lemeshow. Goodness of ﬁt tests for the multiple logistic regression model.

Communications in Statistics - Theory and Methods, 9:1043 1069, 1980.

[37] J. Bröcker and L. A. Smith. Increasing the reliability of reliability diagrams. Weather and

Forecasting, 22(3):651 661, 2007.

[38] D. Widmann, F. Lindsten, and D. Zachariah. Calibration tests in multi-class classiﬁcation: A

unifying framework. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

[39] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman and

Hall/CRC Chapman and Hall/CRC, 1995 1995. [40] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning

Research (JMLR), 9:371 421, 2008. [41] J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive

inference for regression. Journal of the American Statistical Association, 113:1094 1111, 2016. [42] C. M. Stein. Estimation of the mean of a multivariate normal distribution. Annals of Statistics,

9(6):1135 1151, 1981. [43] Larry Wasserman. Density estimation.

, 2019. [44] E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical

Statistics, 33:1065 1076, 1962. [45] François Chollet. keras.

, 2015. [46] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,

Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL

. Software available from tensorﬂow.org. [47] Yonatan Geifman. cifar-vgg.

, 2015. [48] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998. [49] J. H. Hubbard and B. B. Hubbard. Vector Calculus, Linear Algebra, And Differential Forms.

Prentice Hall, 1998. [50] M. Kull, T. M. S. Filho, and P. Flach. Beyond sigmoids: How to obtain well-calibrated

probabilities from binary classiﬁers with beta calibration. Electronic Journal of Statistics, 11: 5052 5080, 2017.