# on_calibration_of_modern_neural_networks__bbd45091.pdf

On Calibration of Modern Neural Networks

Chuan Guo * 1 Geoff Pleiss * 1 Yu Sun * 1 Kilian Q. Weinberger 1

Conﬁdence calibration the problem of predicting probability estimates representative of the true correctness likelihood is important for classiﬁcation models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors inﬂuencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classiﬁcation datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling a singleparameter variant of Platt Scaling is surprisingly effective at calibrating predictions.

1. Introduction

Recent advances in deep learning have dramatically improved neural network accuracy (Simonyan & Zisserman, 2015; Srivastava et al., 2015; He et al., 2016; Huang et al., 2016; 2017). As a result, neural networks are now entrusted with making complex decisions in applications, such as object detection (Girshick, 2015), speech recognition (Hannun et al., 2014), and medical diagnosis (Caruana et al., 2015). In these settings, neural networks are an essential component of larger decision making pipelines.

In real-world decision making systems, classiﬁcation networks must not only be accurate, but also should indicate when they are likely to be incorrect. As an example, consider a self-driving car that uses a neural network to detect pedestrians and other obstructions (Bojarski et al., 2016).

*Equal contribution, alphabetical order. 1Cornell University. Correspondence to: Chuan Guo <cg563@cornell.edu>, Geoff Pleiss <geoff@cs.cornell.edu>, Yu Sun <ys646@cornell.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

0.0 0.2 0.4 0.6 0.8 1.0 0.0

% of Samples

Avg. conﬁdence

Le Net (1998)

0.0 0.2 0.4 0.6 0.8 1.0

Avg. conﬁdence

Res Net (2016)

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Outputs Gap

0.0 0.2 0.4 0.6 0.8 1.0

Outputs Gap

Figure 1. Conﬁdence histograms (top) and reliability diagrams (bottom) for a 5-layer Le Net (left) and a 110-layer Res Net (right) on CIFAR-100. Refer to the text below for detailed illustration.

If the detection network is not able to conﬁdently predict the presence or absence of immediate obstructions, the car should rely more on the output of other sensors for braking. Alternatively, in automated health care, control should be passed on to human doctors when the conﬁdence of a disease diagnosis network is low (Jiang et al., 2012). Specifically, a network should provide a calibrated conﬁdence measure in addition to its prediction. In other words, the probability associated with the predicted class label should reﬂect its ground truth correctness likelihood.

Calibrated conﬁdence estimates are also important for model interpretability. Humans have a natural cognitive intuition for probabilities (Cosmides & Tooby, 1996). Good conﬁdence estimates provide a valuable extra bit of information to establish trustworthiness with the user especially for neural networks, whose classiﬁcation decisions are often difﬁcult to interpret. Further, good probability estimates can be used to incorporate neural networks into other probabilistic models. For example, one can improve performance by combining network outputs with a lan-

guage model in speech recognition (Hannun et al., 2014; Xiong et al., 2016), or with camera information for object detection (Kendall & Cipolla, 2016).

In 2005, Niculescu-Mizil & Caruana (2005) showed that neural networks typically produce well-calibrated probabilities on binary classiﬁcation tasks. While neural networks today are undoubtedly more accurate than they were a decade ago, we discover with great surprise that modern neural networks are no longer well-calibrated. This is visualized in Figure 1, which compares a 5-layer Le Net (left) (Le Cun et al., 1998) with a 110-layer Res Net (right) (He et al., 2016) on the CIFAR-100 dataset. The top row shows the distribution of prediction conﬁdence (i.e. probabilities associated with the predicted label) as histograms. The average conﬁdence of Le Net closely matches its accuracy, while the average conﬁdence of the Res Net is substantially higher than its accuracy. This is further illustrated in the bottom row reliability diagrams (De Groot & Fienberg, 1983; Niculescu-Mizil & Caruana, 2005), which show accuracy as a function of conﬁdence. We see that Le Net is well-calibrated, as conﬁdence closely approximates the expected accuracy (i.e. the bars align roughly along the diagonal). On the other hand, the Res Net s accuracy is better, but does not match its conﬁdence.

Our goal is not only to understand why neural networks have become miscalibrated, but also to identify what methods can alleviate this problem. In this paper, we demonstrate on several computer vision and NLP tasks that neural networks produce conﬁdences that cannot represent true probabilities. Additionally, we offer insight and intuition into network training and architectural trends that may cause miscalibration. Finally, we compare various postprocessing calibration methods on state-of-the-art neural networks, and introduce several extensions of our own. Surprisingly, we ﬁnd that a single-parameter variant of Platt scaling (Platt et al., 1999) which we refer to as temperature scaling is often the most effective method at obtaining calibrated probabilities. Because this method is straightforward to implement with existing deep learning frameworks, it can be easily adopted in practical settings.

2. Deﬁnitions

The problem we address in this paper is supervised multiclass classiﬁcation with neural networks. The input X X and label Y Y = {1, . . . , K} are random variables that follow a ground truth joint distribution π(X, Y ) = π(Y |X)π(X). Let h be a neural network with h(X) = ( ˆY , ˆP), where ˆY is a class prediction and ˆP is its associated conﬁdence, i.e. probability of correctness. We would like the conﬁdence estimate ˆP to be calibrated, which intuitively means that ˆP represents a true probability. For example, given 100 predictions, each with conﬁdence of

0.8, we expect that 80 should be correctly classiﬁed. More formally, we deﬁne perfect calibration as

P ˆY = Y | ˆP = p = p, p [0, 1] (1)

where the probability is over the joint distribution. In all practical settings, achieving perfect calibration is impossible. Additionally, the probability in (1) cannot be computed using ﬁnitely many samples since ˆP is a continuous random variable. This motivates the need for empirical approximations that capture the essence of (1).

Reliability Diagrams (e.g. Figure 1 bottom) are a visual representation of model calibration (De Groot & Fienberg, 1983; Niculescu-Mizil & Caruana, 2005). These diagrams plot expected sample accuracy as a function of conﬁdence. If the model is perfectly calibrated i.e. if (1) holds then the diagram should plot the identity function. Any deviation from a perfect diagonal represents miscalibration.

To estimate the expected accuracy from ﬁnite samples, we group predictions into M interval bins (each of size 1/M) and calculate the accuracy of each bin. Let Bm be the set of indices of samples whose prediction conﬁdence falls into the interval Im = ( m 1

M ]. The accuracy of Bm is

acc(Bm) = 1 |Bm|

i Bm 1(ˆyi = yi),

where ˆyi and yi are the predicted and true class labels for sample i. Basic probability tells us that acc(Bm) is an unbiased and consistent estimator of P( ˆY = Y | ˆP Im). We deﬁne the average conﬁdence within bin Bm as

conf(Bm) = 1 |Bm|

where ˆpi is the conﬁdence for sample i. acc(Bm) and conf(Bm) approximate the left-hand and right-hand sides of (1) respectively for bin Bm. Therefore, a perfectly calibrated model will have acc(Bm) = conf(Bm) for all m {1, . . . , M}. Note that reliability diagrams do not display the proportion of samples in a given bin, and thus cannot be used to estimate how many samples are calibrated.

Expected Calibration Error (ECE). While reliability diagrams are useful visual tools, it is more convenient to have a scalar summary statistic of calibration. Since statistics comparing two distributions cannot be comprehensive, previous works have proposed variants, each with a unique emphasis. One notion of miscalibration is the difference in expectation between conﬁdence and accuracy, i.e.

h P ˆY = Y | ˆP = p p i (2)

Expected Calibration Error (Naeini et al., 2015) or ECE approximates (2) by partitioning predictions into M equally-spaced bins (similar to the reliability diagrams) and

0 20 40 60 80 100 120 Depth

Varying Depth Res Net - CIFAR-100

0 50 100 150 200 250 300

Filters per layer

Varying Width Res Net-14 - CIFAR-100

Without With Batch Normalization

Using Normalization Conv Net - CIFAR-100

10 5 10 4 10 3 10 2

Weight decay

Varying Weight Decay Res Net-110 - CIFAR-100

Figure 2. The effect of network depth (far left), width (middle left), Batch Normalization (middle right), and weight decay (far right) on miscalibration, as measured by ECE (lower is better).

taking a weighted average of the bins accuracy/conﬁdence difference. More precisely,

acc(Bm) conf(Bm) , (3)

where n is the number of samples. The difference between acc and conf for a given bin represents the calibration gap (red bars in reliability diagrams e.g. Figure 1). We use ECE as the primary empirical metric to measure calibration. See Section S1 for more analysis of this metric.

Maximum Calibration Error (MCE). In high-risk applications where reliable conﬁdence measures are absolutely necessary, we may wish to minimize the worst-case deviation between conﬁdence and accuracy:

max p [0,1]

P ˆY = Y | ˆP = p p . (4)

The Maximum Calibration Error (Naeini et al., 2015) or MCE estimates an upper bound of this deviation. Similarly to ECE, this approximation involves binning:

MCE = max m {1,...,M} |acc(Bm) conf(Bm)| . (5)

In reliability diagrams, MCE measures the largest calibration gap (red bars) across all bins, whereas ECE measures a weighted average of all gaps. For perfectly calibrated classiﬁers, MCE and ECE both equal 0.

Negative log likelihood is a standard measure of a probabilistic model s quality (Friedman et al., 2001). It is also referred to as the cross entropy loss in the context of deep learning (Bengio et al., 2015). Given a probabilistic model ˆπ(Y |X) and n samples, NLL is deﬁned as:

i=1 log(ˆπ(yi|xi)) (6)

It is a standard result (Friedman et al., 2001) that, in expectation, NLL is minimized if and only if ˆπ(Y |X) recovers the ground truth conditional distribution π(Y |X).

3. Observing Miscalibration

The architecture and training procedures of neural networks have rapidly evolved in recent years. In this section we identify some recent changes that are responsible for the miscalibration phenomenon observed in Figure 1. Though we cannot claim causality, we ﬁnd that model capacity and lack of regularization are closely related to model (mis)calibration.

Model capacity. The model capacity of neural networks has increased at a dramatic pace over the past few years. It is now common to see networks with hundreds, if not thousands of layers (He et al., 2016; Huang et al., 2016) and hundreds of convolutional ﬁlters per layer (Zagoruyko & Komodakis, 2016). Recent work shows that very deep or wide models are able to generalize better than smaller ones, while exhibiting the capacity to easily ﬁt the training set (Zhang et al., 2017).

Although increasing depth and width may reduce classiﬁcation error, we observe that these increases negatively affect model calibration. Figure 2 displays error and ECE as a function of depth and width on a Res Net trained on CIFAR-100. The far left ﬁgure varies depth for a network with 64 convolutional ﬁlters per layer, while the middle left ﬁgure ﬁxes the depth at 14 layers and varies the number of convolutional ﬁlters per layer. Though even the smallest models in the graph exhibit some degree of miscalibration, the ECE metric grows substantially with model capacity. During training, after the model is able to correctly classify (almost) all training samples, NLL can be further minimized by increasing the conﬁdence of predictions. Increased model capacity will lower training NLL, and thus the model will be more (over)conﬁdent on average.

Batch Normalization (Ioffe & Szegedy, 2015) improves the optimization of neural networks by minimizing distribution shifts in activations within the neural network s hid-

0 100 200 300 400 500 20

Error (%) / NLL (scaled)

NLL Overfitting on CIFAR 100

Test error Test NLL

Figure 3. Test error and NLL of a 110-layer Res Net with stochastic depth on CIFAR-100 during training. NLL is scaled by a constant to ﬁt in the ﬁgure. Learning rate drops by 10x at epochs 250 and 375. The shaded area marks between epochs at which the best validation loss and best validation error are produced.

den layers. Recent research suggests that these normalization techniques have enabled the development of very deep architectures, such as Res Nets (He et al., 2016) and Dense Nets (Huang et al., 2017). It has been shown that Batch Normalization improves training time, reduces the need for additional regularization, and can in some cases improve the accuracy of networks.

While it is difﬁcult to pinpoint exactly how Batch Normalization affects the ﬁnal predictions of a model, we do observe that models trained with Batch Normalization tend to be more miscalibrated. In the middle right plot of Figure 2, we see that a 6-layer Conv Net obtains worse calibration when Batch Normalization is applied, even though classiﬁcation accuracy improves slightly. We ﬁnd that this result holds regardless of the hyperparameters used on the Batch Normalization model (i.e. low or high learning rate, etc.).

Weight decay, which used to be the predominant regularization mechanism for neural networks, is decreasingly utilized when training modern neural networks. Learning theory suggests that regularization is necessary to prevent overﬁtting, especially as model capacity increases (Vapnik, 1998). However, due to the apparent regularization effects of Batch Normalization, recent research seems to suggest that models with less L2 regularization tend to generalize better (Ioffe & Szegedy, 2015). As a result, it is now common to train models with little weight decay, if any at all. The top performing Image Net models of 2015 all use an order of magnitude less weight decay than models of previous years (He et al., 2016; Simonyan & Zisserman, 2015).

We ﬁnd that training with less weight decay has a negative impact on calibration. The far right plot in Figure 2 dis-

plays training error and ECE for a 110-layer Res Net with varying amounts of weight decay. The only other forms of regularization are data augmentation and Batch Normalization. We observe that calibration and accuracy are not optimized by the same parameter setting. While the model exhibits both over-regularization and under-regularization with respect to classiﬁcation error, it does not appear that calibration is negatively impacted by having too much weight decay. Model calibration continues to improve when more regularization is added, well after the point of achieving optimal accuracy. The slight uptick at the end of the graph may be an artifact of using a weight decay factor that impedes optimization.

NLL can be used to indirectly measure model calibration. In practice, we observe a disconnect between NLL and accuracy, which may explain the miscalibration in Figure 2. This disconnect occurs because neural networks can overﬁt to NLL without overﬁtting to the 0/1 loss. We observe this trend in the training curves of some miscalibrated models. Figure 3 shows test error and NLL (rescaled to match error) on CIFAR-100 as training progresses. Both error and NLL immediately drop at epoch 250, when the learning rate is dropped; however, NLL overﬁts during the remainder of training. Surprisingly, overﬁtting to NLL is beneﬁcial to classiﬁcation accuracy. On CIFAR-100, test error drops from 29% to 27% in the region where NLL overﬁts. This phenomenon renders a concrete explanation of miscalibration: the network learns better classiﬁcation accuracy at the expense of well-modeled probabilities.

We can connect this ﬁnding to recent work examining the generalization of large neural networks. Zhang et al. (2017) observe that deep neural networks seemingly violate the common understanding of learning theory that large models with little regularization will not generalize well. The observed disconnect between NLL and 0/1 loss suggests that these high capacity models are not necessarily immune from overﬁtting, but rather, overﬁtting manifests in probabilistic error rather than classiﬁcation error.

4. Calibration Methods

In this section, we ﬁrst review existing calibration methods, and introduce new variants of our own. All methods are post-processing steps that produce (calibrated) probabilities. Each method requires a hold-out validation set, which in practice can be the same set used for hyperparameter tuning. We assume that the training, validation, and test sets are drawn from the same distribution.

4.1. Calibrating Binary Models

We ﬁrst introduce calibration in the binary setting, i.e. Y = {0, 1}. For simplicity, throughout this subsection,

we assume the model outputs only the conﬁdence for the positive class.1 Given a sample xi, we have access to ˆpi the network s predicted probability of yi = 1, as well as zi R which is the network s non-probabilistic output, or logit. The predicted probability ˆpi is derived from zi using a sigmoid function σ; i.e. ˆpi = σ(zi). Our goal is to produce a calibrated probability ˆqi based on yi, ˆpi, and zi.

Histogram binning (Zadrozny & Elkan, 2001) is a simple non-parametric calibration method. In a nutshell, all uncalibrated predictions ˆpi are divided into mutually exclusive bins B1, . . . , BM. Each bin is assigned a calibrated score θm; i.e. if ˆpi is assigned to bin Bm, then ˆqi = θm. At test time, if prediction ˆpte falls into bin Bm, then the calibrated prediction ˆqte is θm. More precisely, for a suitably chosen M (usually small), we ﬁrst deﬁne bin boundaries 0 = a1 a2 . . . a M+1 = 1, where the bin Bm is deﬁned by the interval (am, am+1]. Typically the bin boundaries are either chosen to be equal length intervals or to equalize the number of samples in each bin. The predictions θi are chosen to minimize the bin-wise squared loss:

min θ1,...,θM

i=1 1(am ˆpi < am+1) (θm yi)2 , (7)

where 1 is the indicator function. Given ﬁxed bins boundaries, the solution to (7) results in θm that correspond to the average number of positive-class samples in bin Bm.

Isotonic regression (Zadrozny & Elkan, 2002), arguably the most common non-parametric calibration method, learns a piecewise constant function f to transform uncalibrated outputs; i.e. ˆqi = f(ˆpi). Speciﬁcally, isotonic regression produces f to minimize the square loss Pn i=1(f(ˆpi) yi)2. Because f is constrained to be piecewise constant, we can write the optimization problem as:

min M θ1,...,θM a1,...,a M+1

i=1 1(am ˆpi < am+1) (θm yi)2

subject to 0 = a1 a2 . . . a M+1 = 1,

θ1 θ2 . . . θM.

where M is the number of intervals; a1, . . . , a M+1 are the interval boundaries; and θ1, . . . , θM are the function values. Under this parameterization, isotonic regression is a strict generalization of histogram binning in which the bin boundaries and bin predictions are jointly optimized.

Bayesian Binning into Quantiles (BBQ) (Naeini et al., 2015) is a extension of histogram binning using Bayesian

1 This is in contrast with the setting in Section 2, in which the model produces both a class prediction and conﬁdence.

model averaging. Essentially, BBQ marginalizes out all possible binning schemes to produce ˆqi. More formally, a binning scheme s is a pair (M, I) where M is the number of bins, and I is a corresponding partitioning of [0, 1] into disjoint intervals (0 = a1 a2 . . . a M+1 = 1). The parameters of a binning scheme are θ1, . . . , θM. Under this framework, histogram binning and isotonic regression both produce a single binning scheme, whereas BBQ considers a space S of all possible binning schemes for the validation dataset D. BBQ performs Bayesian averaging of the probabilities produced by each scheme:2

P(ˆqte | ˆpte, D) = X

s S P(ˆqte, S = s | ˆpte, D)

s S P(ˆqte | ˆpte, S =s, D) P(S =s | D).

where P(ˆqte | ˆpte, S = s, D) is the calibrated probability using binning scheme s. Using a uniform prior, the weight P(S =s | D) can be derived using Bayes rule:

P(S =s | D) = P(D | S =s) P

s S P(D | S =s ).

The parameters θ1, . . . , θM can be viewed as parameters of M independent binomial distributions. Hence, by placing a Beta prior on θ1, . . . , θM, we can obtain a closed form expression for the marginal likelihood P(D | S = s). This allows us to compute P(ˆqte | ˆpte, D) for any test input.

Platt scaling (Platt et al., 1999) is a parametric approach to calibration, unlike the other approaches. The nonprobabilistic predictions of a classiﬁer are used as features for a logistic regression model, which is trained on the validation set to return probabilities. More speciﬁcally, in the context of neural networks (Niculescu-Mizil & Caruana, 2005), Platt scaling learns scalar parameters a, b R and outputs ˆqi = σ(azi + b) as the calibrated probability. Parameters a and b can be optimized using the NLL loss over the validation set. It is important to note that the neural network s parameters are ﬁxed during this stage.

4.2. Extension to Multiclass Models

For classiﬁcation problems involving K > 2 classes, we return to the original problem formulation. The network outputs a class prediction ˆyi and conﬁdence score ˆpi for each input xi. In this case, the network logits zi are vectors, where ˆyi = argmaxk z(k) i , and ˆpi is typically derived using the softmax function σSM:

σSM(zi)(k) = exp(z(k) i ) PK j=1 exp(z(j) i ) , ˆpi = max k σSM(zi)(k).

The goal is to produce a calibrated conﬁdence ˆqi and (possibly new) class prediction ˆy i based on yi, ˆyi, ˆpi, and zi.

2 Because the validation dataset is ﬁnite, S is as well.

Dataset Model Uncalibrated Hist. Binning Isotonic BBQ Temp. Scaling Vector Scaling Matrix Scaling

Birds Res Net 50 9.19% 4.34% 5.22% 4.12% 1.85% 3.0% 21.13% Cars Res Net 50 4.3% 1.74% 4.29% 1.84% 2.35% 2.37% 10.5% CIFAR-10 Res Net 110 4.6% 0.58% 0.81% 0.54% 0.83% 0.88% 1.0% CIFAR-10 Res Net 110 (SD) 4.12% 0.67% 1.11% 0.9% 0.6% 0.64% 0.72% CIFAR-10 Wide Res Net 32 4.52% 0.72% 1.08% 0.74% 0.54% 0.6% 0.72% CIFAR-10 Dense Net 40 3.28% 0.44% 0.61% 0.81% 0.33% 0.41% 0.41% CIFAR-10 Le Net 5 3.02% 1.56% 1.85% 1.59% 0.93% 1.15% 1.16% CIFAR-100 Res Net 110 16.53% 2.66% 4.99% 5.46% 1.26% 1.32% 25.49% CIFAR-100 Res Net 110 (SD) 12.67% 2.46% 4.16% 3.58% 0.96% 0.9% 20.09% CIFAR-100 Wide Res Net 32 15.0% 3.01% 5.85% 5.77% 2.32% 2.57% 24.44% CIFAR-100 Dense Net 40 10.37% 2.68% 4.51% 3.59% 1.18% 1.09% 21.87% CIFAR-100 Le Net 5 4.85% 6.48% 2.35% 3.77% 2.02% 2.09% 13.24% Image Net Dense Net 161 6.28% 4.52% 5.18% 3.51% 1.99% 2.24% - Image Net Res Net 152 5.48% 4.36% 4.77% 3.56% 1.86% 2.23% - SVHN Res Net 152 (SD) 0.44% 0.14% 0.28% 0.22% 0.17% 0.27% 0.17%

20 News DAN 3 8.02% 3.6% 5.52% 4.98% 4.11% 4.61% 9.1% Reuters DAN 3 0.85% 1.75% 1.15% 0.97% 0.91% 0.66% 1.58% SST Binary Tree LSTM 6.63% 1.93% 1.65% 2.27% 1.84% 1.84% 1.84% SST Fine Grained Tree LSTM 6.71% 2.09% 1.65% 2.61% 2.56% 2.98% 2.39%

Table 1. ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods. The number following a model s name denotes the network depth.

Extension of binning methods. One common way of extending binary calibration methods to the multiclass setting is by treating the problem as K one-versus-all problems (Zadrozny & Elkan, 2002). For k = 1, . . . , K, we form a binary calibration problem where the label is 1(yi = k) and the predicted probability is σSM(zi)(k). This gives us K calibration models, each for a particular class. At test time, we obtain an unnormalized probability vector [ˆq(1) i , . . . , ˆq(K) i ], where ˆq(k) i is the calibrated probability for class k. The new class prediction ˆy i is the argmax of the vector, and the new conﬁdence ˆq i is the max of the vector normalized by PK k=1 ˆq(k) i . This extension can be applied to histogram binning, isotonic regression, and BBQ.

Matrix and vector scaling are two multi-class extensions of Platt scaling. Let zi be the logits vector produced before the softmax layer for input xi. Matrix scaling applies a linear transformation Wzi + b to the logits:

ˆqi = max k σSM(Wzi + b)(k),

ˆy i = argmax k (Wzi + b)(k). (8)

The parameters W and b are optimized with respect to NLL on the validation set. As the number of parameters for matrix scaling grows quadratically with the number of classes K, we deﬁne vector scaling as a variant where W is restricted to be a diagonal matrix.

Temperature scaling, the simplest extension of Platt scaling, uses a single scalar parameter T > 0 for all classes. Given the logit vector zi, the new conﬁdence prediction is

ˆqi = max k σSM(zi/T)(k). (9)

T is called the temperature, and it softens the softmax (i.e. raises the output entropy) with T > 1. As T , the probability ˆqi approaches 1/K, which represents maximum uncertainty. With T = 1, we recover the original probability ˆpi. As T 0, the probability collapses to a point mass (i.e. ˆqi = 1). T is optimized with respect to NLL on the validation set. Because the parameter T does not change the maximum of the softmax function, the class prediction ˆy i remains unchanged. In other words, temperature scaling does not affect the model s accuracy.

Temperature scaling is commonly used in settings such as knowledge distillation (Hinton et al., 2015) and statistical mechanics (Jaynes, 1957). To the best of our knowledge, we are not aware of any prior use in the context of calibrating probabilistic models.3 The model is equivalent to maximizing the entropy of the output probability distribution subject to certain constraints on the logits (see Section S2).

4.3. Other Related Works

Calibration and conﬁdence scores have been studied in various contexts in recent years. Kuleshov & Ermon (2016) study the problem of calibration in the online setting, where the inputs can come from a potentially adversarial source. Kuleshov & Liang (2015) investigate how to produce calibrated probabilities when the output space is a structured object. Lakshminarayanan et al. (2016) use ensembles of networks to obtain uncertainty estimates. Pereyra et al. (2017) penalize overconﬁdent predictions as a form of regularization. Hendrycks & Gimpel (2017) use conﬁdence

3To highlight the connection with prior works we deﬁne temperature scaling in terms of 1

T instead of a multiplicative scalar.

scores to determine if samples are out-of-distribution.

Bayesian neural networks (Denker & Lecun, 1990; Mac Kay, 1992) return a probability distribution over outputs as an alternative way to represent model uncertainty. Gal & Ghahramani (2016) draw a connection between Dropout (Srivastava et al., 2014) and model uncertainty, claiming that sampling models with dropped nodes is a way to estimate the probability distribution over all possible models for a given sample. Kendall & Gal (2017) combine this approach with a model that outputs a predictive mean and variance for each data point. This notion of uncertainty is not restricted to classiﬁcation problems. In contrast, our framework, which does not require model sampling, returns a conﬁdence for a given output rather than returning a distribution of possible outputs.

We apply the calibration methods in Section 4 to image classiﬁcation and document classiﬁcation neural networks. For image classiﬁcation we use 6 datasets:

1. Caltech-UCSD Birds (Welinder et al., 2010): 200 bird species. 5994/2897/2897 images for train/validation/test sets. 2. Stanford Cars (Krause et al., 2013): 196 classes of cars by make, model, and year. 8041/4020/4020 images for train/validation/test. 3. Image Net 2012 (Deng et al., 2009): Natural scene images from 1000 classes. 1.3 million/25,000/25,000 images for train/validation/test. 4. CIFAR-10/CIFAR-100 (Krizhevsky & Hinton, 2009): Color images (32 32) from 10/100 classes. 45,000/5,000/10,000 images for train/validation/test. 5. Street View House Numbers (SVHN) (Netzer et al.,

2011): 32 32 colored images of cropped out house numbers from Google Street View. 604,388/6,000/26,032 images for train/validation/test.

We train state-of-the-art convolutional networks: Res Nets (He et al., 2016), Res Nets with stochastic depth (SD) (Huang et al., 2016), Wide Res Nets (Zagoruyko & Komodakis, 2016), and Dense Nets (Huang et al., 2017). We use the data preprocessing, training procedures, and hyperparameters as described in each paper. For Birds and Cars, we ﬁne-tune networks pretrained on Image Net.

For document classiﬁcation we experiment with 4 datasets:

1. 20 News: News articles, partitioned into 20 categories by content. 9034/2259/7528 documents for train/validation/test. 2. Reuters: News articles, partitioned into 8 categories by topic. 4388/1097/2189 documents for train/validation/test.

3. Stanford Sentiment Treebank (SST) (Socher et al.,

2013): Movie reviews, represented as sentence parse trees that are annotated by sentiment. Each sample includes a coarse binary label and a ﬁne grained 5-class label. As described in (Tai et al., 2015), the training/validation/test sets contain 6920/872/1821 documents for binary, and 544/1101/2210 for ﬁne-grained.

On 20 News and Reuters, we train Deep Averaging Networks (DANs) (Iyyer et al., 2015) with 3 feed-forward layers and Batch Normalization. These networks obtain competitive accuracy using the optimization hyperparameters suggested by the original paper. On SST, we train Tree LSTMs (Long Short Term Memory) (Tai et al., 2015) using the default settings in the authors code.

Calibration Results. Table 1 displays model calibration, as measured by ECE (with M = 15 bins), before and after applying the various methods (see Section S3 for MCE, NLL, and error tables). It is worth noting that most datasets and models experience some degree of miscalibration, with ECE typically between 4 to 10%. This is not architecture speciﬁc: we observe miscalibration on convolutional networks (with and without skip connections), recurrent networks, and deep averaging networks. The two notable exceptions are SVHN and Reuters, both of which experience ECE values below 1%. Both of these datasets have very low error (1.98% and 2.97%, respectively); and therefore the ratio of ECE to error is comparable to other datasets.

Our most important discovery is the surprising effectiveness of temperature scaling despite its remarkable simplicity. Temperature scaling outperforms all other methods on the vision tasks, and performs comparably to other methods on the NLP datasets. What is perhaps even more surprising is that temperature scaling outperforms the vector and matrix Platt scaling variants, which are strictly more general methods. In fact, vector scaling recovers essentially the same solution as temperature scaling the learned vector has nearly constant entries, and therefore is no different than a scalar transformation. In other words, network miscalibration is intrinsically low dimensional.

The only dataset that temperature scaling does not calibrate is the Reuters dataset. In this instance, only one of the above methods is able to improve calibration. Because this dataset is well-calibrated to begin with (ECE 1%), there is not much room for improvement with any method, and post-processing may not even be necessary to begin with. It is also possible that our measurements are affected by dataset split or by the particular binning scheme.

Matrix scaling performs poorly on datasets with hundreds of classes (i.e. Birds, Cars, and CIFAR-100), and fails to converge on the 1000-class Image Net dataset. This is expected, since the number of parameters scales quadrat-

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Uncal. - CIFAR-100

Res Net-110 (SD)

Outputs Gap

0.0 0.2 0.4 0.6 0.8 1.0

Temp. Scale - CIFAR-100

Res Net-110 (SD)

Outputs Gap

0.0 0.2 0.4 0.6 0.8 1.0

Hist. Bin. - CIFAR-100

Res Net-110 (SD)

Outputs Gap

0.0 0.2 0.4 0.6 0.8 1.0

Iso. Reg. - CIFAR-100

Res Net-110 (SD)

Outputs Gap

Figure 4. Reliability diagrams for CIFAR-100 before (far left) and after calibration (middle left, middle right, far right).

ically with the number of classes. Any calibration model with tens of thousands (or more) parameters will overﬁt to a small validation set, even when applying regularization.

Binning methods improve calibration on most datasets, but do not outperform temperature scaling. Additionally, binning methods tend to change class predictions which hurts accuracy (see Section S3). Histogram binning, the simplest binning method, typically outperforms isotonic regression and BBQ, despite the fact that both methods are strictly more general. This further supports our ﬁnding that calibration is best corrected by simple models.

Reliability diagrams. Figure 4 contains reliability diagrams for 110-layer Res Nets on CIFAR-100 before and after calibration. From the far left diagram, we see that the uncalibrated Res Net tends to be overconﬁdent in its predictions. We then can observe the effects of temperature scaling (middle left), histogram binning (middle right), and isotonic regression (far right) on calibration. All three displayed methods produce much better conﬁdence estimates. Of the three methods, temperature scaling most closely recovers the desired diagonal function. Each of the bins are well calibrated, which is remarkable given that all the probabilities were modiﬁed by only a single parameter. We include reliability diagrams for other datasets in Section S4.

Computation time. All methods scale linearly with the number of validation set samples. Temperature scaling is by far the fastest method, as it amounts to a onedimensional convex optimization problem. Using a conjugate gradient solver, the optimal temperature can be found in 10 iterations, or a fraction of a second on most modern hardware. In fact, even a naive line-search for the optimal temperature is faster than any of the other methods. The computational complexity of vector and matrix scaling are linear and quadratic respectively in the number of classes, reﬂecting the number of parameters in each method. For CIFAR-100 (K = 100), ﬁnding a near-optimal vector scal-

ing solution with conjugate gradient descent requires at least 2 orders of magnitude more time. Histogram binning and isotonic regression take an order of magnitude longer than temperature scaling, and BBQ takes roughly 3 orders of magnitude more time.

Ease of implementation. BBQ is arguably the most difﬁcult to implement, as it requires implementing a model averaging scheme. While all other methods are relatively easy to implement, temperature scaling may arguably be the most straightforward to incorporate into a neural network pipeline. In Torch7 (Collobert et al., 2011), for example, we implement temperature scaling by inserting a nn.Mul Constant between the logits and the softmax, whose parameter is 1/T. We set T =1 during training, and subsequently ﬁnd its optimal value on the validation set.

6. Conclusion

Modern neural networks exhibit a strange phenomenon: probabilistic error and miscalibration worsen even as classiﬁcation error is reduced. We have demonstrated that recent advances in neural network architecture and training model capacity, normalization, and regularization have strong effects on network calibration. It remains future work to understand why these trends affect calibration while improving accuracy. Nevertheless, simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling is the simplest, fastest, and most straightforward of the methods, and surprisingly is often the most effective.

Acknowledgments

The authors are supported in part by the III-1618134, III1526012, and IIS-1149882 grants from the National Science Foundation, as well as the Bill and Melinda Gates Foundation and the Ofﬁce of Naval Research.

Bengio, Yoshua, Goodfellow, Ian J, and Courville, Aaron. Deep learning. Nature, 521:436 444, 2015.

Bojarski, Mariusz, Del Testa, Davide, Dworakowski, Daniel, Firner, Bernhard, Flepp, Beat, Goyal, Prasoon, Jackel, Lawrence D, Monfort, Mathew, Muller, Urs, Zhang, Jiakai, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

Caruana, Rich, Lou, Yin, Gehrke, Johannes, Koch, Paul, Sturm, Marc, and Elhadad, Noemie. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In KDD, 2015.

Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Cl ement. Torch7: A matlab-like environment for machine learning. In Big Learn Workshop, NIPS, 2011.

Cosmides, Leda and Tooby, John. Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty. cognition, 58(1):1 73, 1996.

De Groot, Morris H and Fienberg, Stephen E. The comparison and evaluation of forecasters. The statistician, pp. 12 22, 1983.

Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248 255, 2009.

Denker, John S and Lecun, Yann. Transforming neural-net output levels to probability distributions. In NIPS, pp. 853 859, 1990.

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.

Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.

Girshick, Ross. Fast r-cnn. In ICCV, pp. 1440 1448, 2015.

Hannun, Awni, Case, Carl, Casper, Jared, Catanzaro, Bryan, Diamos, Greg, Elsen, Erich, Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, et al. Deep speech: Scaling up end-to-end speech recognition. ar Xiv preprint ar Xiv:1412.5567, 2014.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016.

Hendrycks, Dan and Gimpel, Kevin. A baseline for detecting misclassiﬁed and out-of-distribution examples in neural networks. In ICLR, 2017.

Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. 2015.

Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, and Weinberger, Kilian. Deep networks with stochastic depth. In ECCV, 2016.

Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. In CVPR, 2017.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.

Iyyer, Mohit, Manjunatha, Varun, Boyd-Graber, Jordan, and Daum e III, Hal. Deep unordered composition rivals syntactic methods for text classiﬁcation. In ACL, 2015.

Jaynes, Edwin T. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.

Jiang, Xiaoqian, Osl, Melanie, Kim, Jihoon, and Ohno Machado, Lucila. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263 274, 2012.

Kendall, Alex and Cipolla, Roberto. Modelling uncertainty in deep learning for camera relocalization. 2016.

Kendall, Alex and Gal, Yarin. What uncertainties do we need in bayesian deep learning for computer vision? ar Xiv preprint ar Xiv:1703.04977, 2017.

Krause, Jonathan, Stark, Michael, Deng, Jia, and Fei-Fei, Li. 3d object representations for ﬁne-grained categorization. In IEEE Workshop on 3D Representation and Recognition (3d RR), Sydney, Australia, 2013.

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images, 2009.

Kuleshov, Volodymyr and Ermon, Stefano. Reliable conﬁdence estimation via online learning. ar Xiv preprint ar Xiv:1607.03594, 2016.

Kuleshov, Volodymyr and Liang, Percy. Calibrated structured prediction. In NIPS, pp. 3474 3482, 2015.

Lakshminarayanan, Balaji, Pritzel, Alexander, and Blundell, Charles. Simple and scalable predictive uncertainty estimation using deep ensembles. ar Xiv preprint ar Xiv:1612.01474, 2016.

Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Supplementary Materials: On Calibration of Modern Neural Networks

Mac Kay, David JC. A practical bayesian framework for backpropagation networks. Neural computation, 4(3): 448 472, 1992.

Naeini, Mahdi Pakdaman, Cooper, Gregory F, and Hauskrecht, Milos. Obtaining well calibrated probabilities using bayesian binning. In AAAI, pp. 2901, 2015.

Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading digits in natural images with unsupervised feature learning. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS, 2011.

Niculescu-Mizil, Alexandru and Caruana, Rich. Predicting good probabilities with supervised learning. In ICML, pp. 625 632, 2005.

Pereyra, Gabriel, Tucker, George, Chorowski, Jan, Kaiser, Łukasz, and Hinton, Geoffrey. Regularizing neural networks by penalizing conﬁdent output distributions. ar Xiv preprint ar Xiv:1701.06548, 2017.

Platt, John et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3): 61 74, 1999.

Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Socher, Richard, Perelygin, Alex, Wu, Jean, Chuang, Jason, Manning, Christopher D., Ng, Andrew, and Potts, Christopher. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp. 1631 1642, 2013.

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15:1929 1958, 2014.

Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, J urgen. Highway networks. ar Xiv preprint ar Xiv:1505.00387, 2015.

Tai, Kai Sheng, Socher, Richard, and Manning, Christopher D. Improved semantic representations from treestructured long short-term memory networks. 2015.

Vapnik, Vladimir N. Statistical Learning Theory. Wiley Interscience, 1998.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.

Xiong, Wayne, Droppo, Jasha, Huang, Xuedong, Seide, Frank, Seltzer, Mike, Stolcke, Andreas, Yu, Dong, and Zweig, Geoffrey. Achieving human parity in conversational speech recognition. ar Xiv preprint ar Xiv:1610.05256, 2016.

Zadrozny, Bianca and Elkan, Charles. Obtaining calibrated probability estimates from decision trees and naive bayesian classiﬁers. In ICML, pp. 609 616, 2001.

Zadrozny, Bianca and Elkan, Charles. Transforming classiﬁer scores into accurate multiclass probability estimates. In KDD, pp. 694 699, 2002.

Zagoruyko, Sergey and Komodakis, Nikos. Wide residual networks. In BMVC, 2016.

Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, and Vinyals, Oriol. Understanding deep learning requires rethinking generalization. In ICLR, 2017.