# improving_predictor_reliability_with_selective_recalibration__32cfb622.pdf

Published in Transactions on Machine Learning Research (07/2024)

Improving Predictor Reliability with Selective Recalibration

Thomas P. Zollo tpz2105@columbia.edu Columbia University

Zhun Deng zhun.d@columbia.edu Columbia University

Jake C. Snell js2523@princeton.edu Princeton University

Toniann Pitassi toni@cs.columbia.edu Columbia University

Richard Zemel zemel@cs.columbia.edu Columbia University

Reviewed on Open Review: https: // openreview. net/ forum? id= Aoj9H6jl6F

A reliable deep learning system should be able to accurately express its confidence with respect to its predictions, a quality known as calibration. One of the most effective ways to produce reliable confidence estimates with a pre-trained model is by applying a post-hoc recalibration method. Popular recalibration methods like temperature scaling are typically fit on a small amount of data and work in the model s output space, as opposed to the more expressive feature embedding space, and thus usually have only one or a handful of parameters. However, the target distribution to which they are applied is often complex and difficult to fit well with such a function. To this end we propose selective recalibration, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model. We provide theoretical analysis to motivate our algorithm, and test our method through comprehensive experiments on difficult medical imaging and zero-shot classification tasks. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.

1 Introduction

In order to build user trust in a machine learning system, it is important that the system can accurately express its confidence with respect to its own predictions. Under the notion of calibration common in deep learning (Guo et al., 2017; Minderer et al., 2021), a confidence estimate output by a model should be as close as possible to the expected accuracy of the prediction. For instance, a prediction assigned 30% confidence should be correct 30% of the time. This is especially important in risk-sensitive settings such as medical diagnosis, where binary predictions alone are not useful since a 30% chance of disease must be treated differently than a 1% chance. While advancements in neural network architecture and training have brought improvements in calibration as compared to previous methods (Minderer et al., 2021), neural networks still suffer from miscalibration, usually in the form of overconfidence (Guo et al., 2017; Wang et al., 2021). In addition, these models are often applied to complex data distributions, possibly including outliers, and may have different calibration error within and between different subsets in the data (Ovadia et al., 2019; Perez Lebel et al., 2023). We illustrate this setting in Figure 1a with a Reliability Diagram, a tool for visualizing

Published in Transactions on Machine Learning Research (07/2024)

Figure 1: Reliability Diagrams for a model that has different calibration error (deviation from the diagonal) in different subsets of the data (here shown in blue and green). The data per subset is binned based on confidence values; each marker represents a bin, and its size depicts the amount of data in the bin. The red dashed diagonal represents perfect calibration, where confidence equals expected accuracy.

calibration by plotting average confidence against accuracy for bins of datapoints with similar confidence estimates.

To address this calibration error, the confidence estimates of a pre-trained model can be refined using a post-hoc recalibration method like Platt scaling (Platt, 1999), temperature scaling (Guo et al., 2017), or histogram binning (Zadrozny & Elkan, 2001). Given existing empirical evidence (Guo et al., 2017) and the fact that they are typically fit on small validation sets (on the order of hundreds to a few thousand examples), these recalibrators usually reduce the input space to the model s logits (e.g., temperature scaling) or predicted class scores (e.g., Platt scaling, histogram binning), as opposed to the high-dimensional and expressive feature embedding space. Accordingly, they are generally inexpressive models, having only one or a handful of parameters (Platt, 1999; Guo et al., 2017; Zadrozny & Elkan, 2001; Kumar et al., 2019). But the complex data distributions to which neural networks are often applied are difficult to fit well with such simple functions, and calibration error can even be exacerbated for some regions of the input space, especially when the model has only a single scaling parameter (see Figure 1b).

Motivated by these observations, we contend that these popular recalibration methods are a natural fit for use with a selection model. Selection models (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) are used alongside a classifier, and may reject a portion of the classifier s predictions in order to improve some performance metric on the subset of accepted (i.e., unrejected) examples. While selection models have typically been applied to improve classifier accuracy (Geifman & El-Yaniv, 2017), they have also been used to improve calibration error by rejecting the confidence estimates of a fixed model (Fisch et al., 2022). However, selection alone cannot fully address the underlying miscalibration because it does not alter the confidence output of the model (see Figure 1c), and the connection between selection and post-hoc recalibration remains largely unexplored.

In this work we propose selective recalibration, where a selection model and a recalibration model are jointly optimized in order to produce predictions with low calibration error. By rejecting some portion of the data, the system can focus on a region that can be well-captured by a simple recalibration model, leading to a set of predictions with a lower calibration error than under recalibration or selection alone (see Figure 1d). This approach is especially important when a machine learning model is deployed for decision-making in risk-sensitive domains such as healthcare, finance, and law, where a predictor must be well-calibrated if a human expert is to use its output to improve outcomes and avoid causing active harm. To summarize our contributions:

We formulate selective recalibration, and offer a new loss function for training such a system, Selective Top-Label Binary Cross Entropy (S-TLBCE), which aligns with the typical notion of loss under smooth recalibrators like Platt or temperature scaling models.

Published in Transactions on Machine Learning Research (07/2024)

We test selective recalibration and S-TLBCE in real-world medical diagnosis and image classification experiments, and find that selective recalibration with S-TLBCE consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.

We provide theoretical insight to support our motivations and algorithm.

2 Related Work

Making well-calibrated predictions is a key feature of a reliable statistical model (Guo et al., 2017; Hebert Johnson et al., 2018; Minderer et al., 2021; Fisch et al., 2022). Popular methods for improving calibration error given a pretrained model and labeled validation dataset include Platt (Platt, 1999) and temperature scaling (Guo et al., 2017), histogram binning (Zadrozny & Elkan, 2001), and Platt binning (Kumar et al., 2019) (as well as others like those in Naeini et al. (2015); Zhang et al. (2020)). Loss functions have also been proposed to improve the calibration error of a neural network in training, including Maximum Mean Calibration Error (MMCE) (Kumar et al., 2018), S-Av UC, and SB-ECE (Karandikar et al., 2021). Calibration error is typically measured using quantities such as Expected Calibration Error (Naeini et al., 2015; Guo et al., 2017), Maximum Calibration Error (Naeini et al., 2015; Guo et al., 2017), or Brier Score (Brier, 1950) that measure whether prediction confidence matches expected outcomes. Previous research (Roelofs et al., 2022) has demonstrated that calibration measures calculated using binning have lower bias when computed using equal-mass bins.

Another technique for improving ML system reliability is selective classification (Geifman & El-Yaniv, 2017; El-Yaniv & Wiener, 2010), wherein a model is given the option to abstain from making a prediction on certain examples (often based on confidence or out-of-distribution measures). Selective classification has been well-studied in the context of neural networks (Geifman & El-Yaniv, 2017; 2019; Madras et al., 2018). It has been shown to increase disparities in accuracy across groups (Jones et al., 2021), although work has been done to mitigate this effect in both classification (Jones et al., 2021) and regression (Shah et al., 2022) tasks by enforcing calibration across groups.

Recent work by Fisch et al. (2022) introduces the setting of calibrated selective classification, in which predictions from a pre-trained model are rejected for the sake of improving selective calibration error. The authors propose a method for training a selective calibration model using an S-MMCE loss function derived from the work of Kumar et al. (2018). Our work differs from this and other previous work by considering the joint training and application of selection and recalibration models. While Fisch et al. (2022) apply selection directly to a frozen model s outputs, we contend that the value in our algorithm lies in this joint optimization. Also, instead of using S-MMCE, we propose a new loss function, S-TLBCE, which more closely aligns with the objective function for Platt and temperature scaling.

Besides calibration and selection, there are other approaches to quantifying and addressing the uncertainty in modern neural networks. One popular approach is the use of ensembles, where multiple models are trained and their joint outputs are used to estimate uncertainty. Ensembles have been shown to both improve accuracy and provide a means to estimate predictive uncertainty without the need for Bayesian modeling (Lakshminarayanan et al., 2017). Bayesian neural networks (BNNs) offer an alternative by explicitly modeling uncertainty through distributions over the weights of the network, thus providing a principled uncertainty estimation (Blundell et al., 2015). Dropout can also be viewed as approximate Bayesian inference (Gal, 2016). Another technique which has received interest recently is conformal prediction, which uses past data to determine a prediction interval or set in which future points are predicted to fall with high probability (Shafer & Vovk, 2008; Vovk et al., 2005). Such distribution-free guarantees have been extended to cover a wide set of risk measures (Deng et al., 2023; Snell et al., 2023) and applications such as robot planning (Ren et al., 2023) and prompting a large language model (Zollo et al., 2024).

3 Background

Consider the multi-class classification setting with K classes and data instances (x, y) D, where x is the input and y {1, 2, ..., K} is the ground truth class label. For a black box predictor f, f(x) RK is a

Published in Transactions on Machine Learning Research (07/2024)

vector where f(x)k is the predicted probability that input x has label k; we denote the confidence in the top predicted label as ˆf(x) = maxk f(x)k. Further, we may access the unnormalized class scores fs(x) RK

(which may be negative) and the feature embeddings fe(x) Rd. The predicted class is ˆy = arg maxk f(x)k and the correctness of a prediction is yc = 1{y = ˆy}.

3.1 Selective Classification

In selective classification, a selection model g produces binary outputs, where 0 indicates rejection and 1 indicates acceptance. A common goal is to decrease some error metric by rejecting no more than a 1 β proportion of the data for given target coverage level β. One popular choice for input to g is the feature embedding fe(x), although other representations may be used. Often, a soft selection model ˆg : Rd [0, 1] is trained and g is produced at inference time by choosing a threshold τ on ˆg to achieve coverage level β (i.e., E[1{ˆg(X) τ}] = β).

3.2 Calibration

The model f is said to be top-label calibrated if ED[yc| ˆf(x) = p] = p for all p [0, 1] in the range of ˆf(x). To measure deviation from this condition, we calculate expected calibration error (ECE):

ECEq = E ˆ f(x)

ED[yc| ˆf(x)] ˆf(x) q 1

where q is typically 1 or 2. A recalibrator model h can be applied to f to produce outputs in the interval [0, 1] such that h(f(x)) RK is the recalibrated prediction confidence for input x and ˆh(f(x)) = maxk h(f(x))k. See Section 4.3 for details on some specific forms of h( ).

3.3 Selective Calibration

Under the notion of calibrated selective classification introduced by Fisch et al. (2022), a predictor is selectively calibrated if ED yc| ˆf(x) = p, g(x) = 1 = p for all p [0, 1] in the range of ˆf(x) where g(x) = 1. To interpret this statement, for the subset of examples that are accepted (i.e., g(x) = 1), for a given confidence level p the predicted label should be correct for p proportion of instances. Selective expected calibration error is then calculated as:

S-ECEq = E ˆ f(x)

ED[yc| ˆf(x), g(x) = 1] ˆf(x) q | g(x) = 1 1

It should be noted that selective calibration is a separate goal from selective accuracy, and enforcing it may in some cases decrease accuracy. For example, a system may reject datapoints with ˆf(x) = 0.7 and p(yc = 1|x) = 0.99 (which will be accurate 99% of the time) in order to retain datapoints with ˆf(x) = 0.7 and p(yc = 1|x) = 0.7 (which will be accurate 70% of the time). This will decrease accuracy, but the tradeoff would be acceptable in many applications where probabilistic estimates (as opposed to discrete labels) are the key decision making tool. See Section 5.2.1 for a more thorough discussion and empirical results regarding this potential trade-off. Here we are only concerned with calibration, and leave methods for exploring the Pareto front of selective calibration and accuracy to future work.

4 Selective Recalibration

In order to achieve lower calibration error than existing approaches, we propose jointly optimizing a selection model and a recalibration model. Expected calibration error under both selection and recalibration is equal to

SR-ECEq = Eˆh(f(x))

ED[yc|ˆh(f(x)), g(x) = 1] ˆh(f(x)) q | g(x) = 1 1

Published in Transactions on Machine Learning Research (07/2024)

Taking SR-ECEq as our loss quantity of interest, our goal in selective recalibration is to solve the optimization problem:

min g,h SR-ECEq s.t. ED[g(x)] β, (4)

where β is our desired coverage level.

There are several different ways one could approach optimizing the quantity in Eq. 4 through selection and/or recalibration. One could apply only h or g, first train h and then g (or vice versa), or jointly train g and h (i.e., selective recalibration). In Fisch et al. (2022), only g is applied; however, as our experiments will show, much of the possible reduction in calibration error comes from h. While h can be effective alone, typical recalibrators are inexpressive, and thus may benefit from rejecting some difficult-to-fit portion of the data (as we find to be the case in experiments on several real-world datasets in Section 5). Training the models sequentially is also sub-optimal, as the benefits of selection with regards to recalibration can only be fully realized if the two models can interact in training, since fixing the first model constrains the available solutions.

Selective recalibration, where g and h are trained together, admits any solution available to these approaches, and can produce combinations of g and h that are unlikely to be found via sequential optimization (we formalize this intuition theoretically via an example in Section 6). Since Eq. 4 cannot be directly optimized, we instead follow Geifman & El-Yaniv (2019) and Fisch et al. (2022) and define a surrogate loss function L including both a selective error quantity and a term to enforce the coverage constraint (weighted by λ):

min g,h L = Lsel + λLcov(β). (5)

We describe choices for Lsel (selection loss) and Lcov (coverage loss) in Sections 4.1 and 4.2, along with recalibrator models in Section 4.3. Finally, we specify an inference procedure in Section 4.4, and explain how the model trained with the soft constraint in Eq. 5 is used to satisfy Eq. 4.

4.1 Selection Loss

In selective recalibration, the selection loss term measures the calibration of selected instances. Its general form for a batch of data D = {(xi, yi)}n i=1 with n examples is

Lsel = l(ˆg, h; f, D)

i ˆg(xi) (6)

where l measures the loss on selected examples and the denominator scales the loss according to the proportion preserved. We consider 3 forms of l: S-MMCE of Fisch et al. (2022), a selective version of multi-class cross entropy, and our proposed selective top label cross entropy loss.

4.1.1 Maximum Mean Calibration Error

We apply the S-MMCE loss function proposed in Fisch et al. (2022) for training a selective calibration system. For a batch of training data, this loss function is defined as

l S-MMCE(ˆg, h; f, D) =

yc i ˆh(f(xi)) q yc j ˆh(f(xj)) qˆg(xi)ˆg(xj)ϕ ˆh(f(xi)), ˆh(f(xj)) # 1

where ϕ is some similarity kernel, like Laplacian. On a high level, this loss penalizes pairs of instances that have similar confidence and both are far from the true label yc (which denotes prediction correctness 0 or 1). Further details and motivation for such an objective can be found in Fisch et al. (2022).

4.1.2 Top Label Binary Cross Entropy

Consider a selective version of a typical multi-class cross entropy loss:

l S-MCE(ˆg, h; f, D) = 1

i ˆg(xi) log h(f(xi))yi . (8)

Published in Transactions on Machine Learning Research (07/2024)

In the case that the model is incorrect (yc = 0), this loss will penalize based on under-confidence in the ground truth class. However, our goal is calibration according to the predicted class. Thus we propose a loss function for training a selective recalibration model based on the typical approach to optimizing a smooth recalibration model, Selective Top Label Binary Cross Entropy (S-TLBCE):

l S-TLBCE(ˆg, h; f, D) = 1

i ˆg(xi) h yc i log ˆh(f(xi)) + (1 yc i ) log(1 ˆh(f(xi))) i . (9)

In contrast to S-MCE, in the case of an incorrect prediction S-TLBCE penalizes based on over-confidence in the predicted label. This aligns with the established notion of top-label calibration error, as well as the typical Platt or temperature scaling objectives, and makes this a natural loss function for training a selective recalibration model. We compare S-TLBCE and S-MCE empirically in our experiments, and note that in the binary case these losses are the same.

4.2 Coverage Loss

When the goal of selection is improving accuracy, there exists an ordering under ˆg that is optimal for any choice of β, namely that where ˆg is greater for all correctly labeled examples than it is for any incorrectly labeled example. Accordingly, coverage losses used to train these systems will often only enforce that no more than β proportion is to be rejected. Unlike selective accuracy, selective calibration is not monotonic with respect to individual examples and a mismatch in coverage between training and deployment may hurt test performance. Thus in selective recalibration we assume the user aims to reject exactly β proportion of the data, and employ a coverage loss that targets a specific β,

Lcov(β) = β 1

i ˆg(xi) 2. (10)

Such a loss will be an asymptotically consistent estimator of (β E[ˆg(x)])2. Alternatively, Fisch et al. (2022) use a logarithmic regularization approach for enforcing the coverage constraint without a specific target β, computing cross entropy between the output of ˆg and a target vector of all ones. However, we found this approach to be unstable and sensitive to the choice of λ in initial experiments, while our coverage loss enabled stable training at any sufficiently large choice of λ, similar to the findings of Geifman & El-Yaniv (2019).

4.3 Recalibration Models

We consider two differentiable and popular calibration models, Platt scaling and temperature scaling, both of which attempt to fit a function between model confidence and output correctness. The main difference between the models is that Platt scaling works in the output probability space, whereas temperature scaling is applied to model logits before a softmax is taken. A Platt recalibrator (Platt, 1999) produces output according to

h Platt(f(x)) = 1 1 + exp(wf(x) + b) (11)

where w, b are learnable scalar parameters. On the other hand, a temperature scaling model (Guo et al., 2017) produces output according to

h Temp(fs(x)) = softmax fs(x)

where fs(x) is the vector of logits (unnormalized scores) produced by f and T is the single learned (scalar) parameter. Both models are typically trained with a binary cross-entropy loss, where the labels 0 and 1 denote whether an instance is correctly classified.

4.4 Inference

Once we have trained ˆg and h, we can flexibly account for β by selecting a threshold τ on unlabeled test data (or some other representative tuning set) such that E[1{ˆg(X) τ}] = β. The model g is then simply g(x) := 1{ˆg(x) τ}.

Published in Transactions on Machine Learning Research (07/2024)

4.4.1 High Probability Coverage Guarantees

Since 1{ˆg(x) τ} is a random variable with a Bernoulli distribution, we may also apply the Hoeffding bound (Hoeffding, 1963) to guarantee that with high probability empirical target coverage ˆβ (the proportion of the target distribution where ˆg(x) τ) will be in some range. Given a set V of nu i.i.d. unlabeled examples from the target distribution, we denote empirical coverage on V as β. With probability at least 1 δ, ˆβ will

be in the range [ β ϵ, β + ϵ], where ϵ = q

δ ) 2nu . For some critical coverage level β, τ can be decreased until β ϵ β.

5 Experiments

In this section we examine the performance of selective recalibration and baselines when applied to models pre-trained on real-world datasets and applied to a target distribution possibly shifted from the training distribution. In Section 5.1 we investigate whether, given a small validation set of labeled examples drawn i.i.d. from the target distribution, joint optimization consistently leads to a lower empirical selective calibration error than selection or recalibration alone or sequential optimization. Subsequently, in Section 5.2 we study multiple out-of-distribution prediction tasks and the ability of a single system to provide decreasing selective calibration error across a range of coverage levels when faced with a further distribution shift from validation data to test data. We also analyze the trade-off between selective calibration error and accuracy in this setting.

Since we are introducing the objective of selective recalibration here, we focus on high-level design decisions, in particular the choice of selection and recalibration method, loss function, and optimization procedure (joint vs. sequential). For selective recalibration models, the input to g is the feature embedding. Temperature scaling is used for multi-class examples and Platt scaling is applied in the binary cases (following Guo et al. (2017) and initial results on validation data). Calibration error is measured using ECE1 and ECE2 with equal mass binning. For the selection loss, we use l S-TLBCE and l S-MMCE for binary tasks, and include a selective version of typical multi-class cross-entropy (l S-MCE) for multi-class tasks. Pre-training is performed where h is optimized first in order to reasonably initialize the calibrator parameters before beginning to train g. Models are trained both with h fixed after this pre-training (denoted as sequential in results) and when it is jointly optimized throughout training (denoted as joint in results).

Our selection baselines include confidence-based rejection ( Confidence ) and multiple out-of-distribution (OOD) detection methods ( Iso. Forest , One-class SVM ), common techniques when rejecting to improve accuracy. The confidence baseline rejects examples with the smallest ˆf(x) (or ˆh(f(x))), while the OOD methods are measured in the embedding space of the pre-trained model. All selection baselines are applied to the recalibrated model in order to make the strongest comparison. We make further comparisons to recalibration baselines, including the previously described temperature and Platt scaling as well as binning methods like histogram binning and Platt binning. See Appendix A for more experiment details including calibration error measurement and baseline implementations.

5.1 Selective Recalibration with i.i.d. Data

First, we test whether selective recalibration consistently produces low ECE in a setting where there is a validation set of labeled training data available from the same distribution as test data using outputs of pretrained models on the Camelyon17 and Image Net datasets. Camelyon17 (Bandi et al., 2018) is a task where the input x is a 96x96 patch of a whole-slide image of a lymph node section from a patient with potentially metastatic breast cancer and the label y is whether the patch contains a tumor. Selection and recalibration models are trained with 1000 samples, and we apply a Platt scaling h since the task is binary. Image Net is a well-known large scale image classification dataset, where we use 2000 samples from a supervised Res Net34 model for training selection and recalibration models, and temperature scaling h since the task is multi-class. Our soft selector ˆg is a shallow fully-connected network (2 hidden layers with dimension 128), and we report selective calibration error for coverage level β {0.75, 0.8, 0.85, 0.9}. Full experiment details, including model specifications and training parameters, can be found in Appendix A.

Published in Transactions on Machine Learning Research (07/2024)

Figure 2: Selective calibration error on Image Net and Camelyon17 for coverage level β {0.75, 0.8, 0.85, 0.9}. Left: Various recalibration methods are trained using labeled validation data. Middle: Selection baselines including confidence-based rejection and various OOD measures. Right: Selective recalibration with different loss functions.

Results are displayed in Figure 2. They show that by jointly optimizing the selector and recalibration models, we are able to achieve improved ECE at the given coverage level β compared to first training h and then g. We also find selective recalibration achieves the lowest ECE in almost every case in comparison to recalibration alone. While the confidence-based selection strategy performs well in these experiments, this is not a good approach to selective calibration in general, as this is a heuristic strategy and may fail in cases where a model s confident predictions are in fact poorly calibrated (see Section 5.2 for examples). In addition, the S-TLBCE loss shows more consistent performance than S-MMCE, as it reduces ECE in every case, whereas training with S-MMCE increase calibration error in some cases.

Published in Transactions on Machine Learning Research (07/2024)

To be sure that the lower calibration error as compared to recalibration is not because of the extra parameters in the selector model, we also produce results for a recalibration model with the same architecture as our selector. Once again the input is the feature embedding, and the model h has 2 hidden layers with dimension 128. Results for Camelyon17 are included in Figure 2; for clarity of presentation of Image Net results, we omit the MLP recalibrator results from the plot as they were an order of magnitude worse than all other methods (ECE1 0.26, ECE2 0.30). In neither case does this baseline reach the performance of the best low-dimensional recalibration model.

5.2 Selective Re-Calibration under Distribution Shift

In this experiment, we study the various methods applied to genetic perturbation classification with Rx Rx1, as well as zero-shot image classification with CLIP and CIFAR-100-C. Rx Rx1 (Taylor et al., 2019) is a task where the input x is a 3-channel image of cells obtained by fluorescent microscopy, the label y indicates which of 1,139 genetic treatments (including no treatment) the cells received, and there is a domain shift across the batch in which the imaging experiment was run. CIFAR-100 is a well-known image classification dataset, and we perform zero-shot image classification with CLIP. We follow the setting of Fisch et al. (2022) where the test data is drawn from a shifted distribution with respect to the validation set and the goal is not to target a specific β, but rather to train a selector that works across a range of coverage levels. In the case of Rx Rx1 the strong batch processing effect leads to a 9% difference in pretrained model accuracy between validation (18%) and test (27%) output, and we also apply a selective recalibration model trained on validation output from zero-shot CLIP and CIFAR-100 to test examples drawn from the OOD CIFAR-100-C test set. Our validation sets have 1000 (Rx Rx1) or 2000 (CIFAR-100) examples, ˆg is a small network with 1 hidden layer of dimension 64, and we set β = 0.5 when training the models. For our results we report the area under the curve (AUC) for the coverage vs. error curve, a typical metric in selective classification (Geifman & El-Yaniv, 2017; Fisch et al., 2022) that reflects how a model can reduce the error on average at different levels of β. We measure AUC in the range β = [0.5, 1.0], with measurements taken at intervals of 0.05 (i.e., β [0.5, 0.55, 0.6, ..., 0.95, 1.0]). Additionally, to induce robustness to the distribution shift we noise the selector/recalibrator input. See Appendix A for full specifications.

Table 1: Rx Rx1 and CIFAR-100-C AUC in the range β = [0.5, 1.0].

Rx Rx1 CIFAR-100-C Selection Opt. of h, g ECE1 ECE2 Acc. ECE1 ECE2 Acc.

Confidence - 0.071 0.081 0.353 0.048 0.054 0.553 One-class SVM - 0.058 0.077 0.227 0.044 0.051 0.388 Iso. Forest - 0.048 0.061 0.221 0.044 0.051 0.379 S-MCE sequential 0.059 0.075 0.250 0.033 0.041 0.499 joint 0.057 0.073 0.249 0.060 0.068 0.484 S-MMCE sequential 0.036 0.045 0.218 0.030 0.037 0.503 joint 0.036 0.045 0.218 0.043 0.051 0.489 S-TLBCE sequential 0.036 0.045 0.219 0.030 0.037 0.507 joint 0.039 0.049 0.218 0.026 0.032 0.500

Recalibration Temp. Scale 0.055 0.070 0.274 0.041 0.047 0.438 (β = 1.0) None 0.308 0.331 0.274 0.071 0.079 0.438

Results are shown in Table 1. First, these results highlight that even in this OOD setting, the selection-only approach of Fisch et al. (2022) is not enough and recalibration is a key ingredient in improving selective calibration error. Fixing h and then training g performs better than joint optimization for Rx Rx1, likely because the distribution shift significantly changed the optimal temperature for the region of the feature space where g(x) = 1. Joint optimization performs best for CIFAR-100-C, and does still significantly improve ECE on Rx Rx1, although it s outperformed by fixing h first in that case. The confidence baseline performs quite poorly on both experiments and according to both metrics, significantly increasing selective calibration error in all cases.

Published in Transactions on Machine Learning Research (07/2024)

5.2.1 Trade-offs Between Calibration Error and Accuracy

While accurate probabilistic output is the only concern in some domains and should be of at least some concern in most domains, discrete label accuracy can also be important in some circumstances. Table 1 shows accuracy results under selection, and Figure 3 shows the selective accuracy curve and confidence histogram for our selective recalibration model trained with S-TLBCE for Rx Rx1 and CIFAR-100 (and applied to shifted distributions). Together, these results illustrate that under different data and prediction distributions, selective recalibration may increase or decrease accuracy. For Rx Rx1, the model tends to reject examples with higher confidence, which also tend to be more accurate. Thus, while ECE@β may improve with respect to the full dataset, selective accuracy at β is worse. On the other hand, for CIFAR-100-C, the model tends to reject examples with lower confidence, which also tend to be less accurate. Accordingly, both ECE@β and selective accuracy at β improve with respect to the full dataset.

Figure 3: Plots illustrating 1) distribution of confidence among the full distribution and those examples accepted for prediction (i.e., where g(x) = 1) at coverage level β = 0.8 and 2) selective accuracy in the range β = [0.8, 1.0].

6 Theoretical Analysis

To build a deeper understanding of selective recalibration (and its alternatives), we consider a theoretical situation where a pre-trained model is applied to a target distribution different from the distribution on which it was trained, mirroring both our experimental setting and a common challenge in real-world deployments. We show that with either selection or recalibration alone there will still be calibration error, while selective recalibration can achieve ECE = 0. We also show that joint optimization of g and h, as opposed to sequentially optimizing each model, is necessary to achieve zero calibration error.

We consider a setting with two classes, and without loss of generality we set y { 1, 1}. We are given a classifier pre-trained on a mixture model, a typical way to view the distribution of objects in images (Zhu et al., 2014; Thulasidasan et al., 2019). The pre-trained classifier is then applied to a target distribution containing a portion of outliers from each class unseen during training. Our specific choices of classifier and training and target distributions are chosen for ease of interpretation and analysis; however, the intuitions built can be applied more generally, for example to neural network classifiers, which are too complex for such analysis but are often faced with outlier data on which calibration is poor (Ovadia et al., 2019).

Figure 4: A classifier pre-trained on a mixture model is applied to a target distribution with outliers.

Published in Transactions on Machine Learning Research (07/2024)

6.1.1 Data Generation Model

Definition 1 (Target Distribution) The target distribution is defined as a (θ , σ, α)-perturbed mixture model over (x, y) Rp {1, 1}: x | y, z z J1 + (1 z)J2, where y follows the Bernoulli distribution P(y = 1) = P(y = 1) = 1/2, z follows a Bernoulli distribution P(z = 1) = β, and z is independent of y.

Our data model considers a mixture of two distributions with disjoint and bounded supports, J1 and J2, where J2 is considered to be an outlier distribution. Specifically, for y { 1, 1}, J1 is supported in the balls with centers yθ and radius r1, J2 is supported in the balls with centers yαθ and radius r2, and both J1 and J2 have standard deviation σ. See Figure 4 for an illustration of our data models, and Appendix C.1 for a full definition of the distribution.

6.1.2 Classifier Algorithm

Recall that in our setting f is a pre-trained model, where the training distribution is unknown and we only have samples from some different target distribution. We follow this setting in our theory by considering a (possibly biased) estimator ˆθ of θ , which is the output of a training algorithm A (Str) that takes the i.i.d. training data set Str = {(xtr i , ytr i )}m i=1 as input. The distribution from which Str is drawn is different from the target distribution from which we have data to train the selection and recalibration models. We only impose one assumption on the model ˆθ: that A outputs a ˆθ that will converge to θ0 if training data is abundant enough and θ0 should be aligned with θ with respect to direction (see Assumption 3, Appendix C.2 for formal statement). For ease of analysis and explanation, we consider a simple classifier defined by ˆθ = Pm i=1 xtr i ytr i /m when the training distribution is set to be an unperturbed Gaussian mixture xtr|ytr N(ytr θ , σ2I) and ytr follows a Bernoulli distribution P(ytr = 1) = 1/2.1 This form of ˆθ is closely related to Fisher s rule in linear discriminant analysis for Gaussian mixtures (see Appendix C.2.1 for further discussion).

Having obtained ˆθ, our pretrained classifier aligns with the typical notion of a softmax response in neural networks. We first obtain the confidence vector f(x) = (f1(x), f 1(x)) , where

e2ˆθ x + 1 , f1(x) = e2ˆθ x

e2ˆθ x + 1 . (13)

and then output ˆy = arg maxk { 1,1} fk(x). For k { 1, 1}, the confidence score fk(x) represents an estimator of P(y = k|x) and the final classifier is equivalent to ˆy = sgn(ˆθ x).

6.2 Main Theoretical Results

Having established our data and classification models, we now analyze why selective recalibration (i.e., joint training of g and h) can outperform recalibration and selection performed alone or sequentially. To measure calibration error, we consider ECEq with q = 1 (and drop the subscript q for notational simplicity below). For the clarity of theorem statements and proofs, we will restate definitions of calibration error to make them explicitly dependent on selection model g and temperature T and tailored for the binary case we are studying. We want to emphasize that we are not introducing new concepts, but instead offering different surface forms of the same quantities introduced earlier. First, we notice that under our data generating model and pretrained classifier, ECE can be expressed as

ECE = Eˆθ x

P[y = 1 | ˆθ x = v] 1 1 + e 2v

By studying such population quantities, our analysis is not dependent on any binning-methods that are commonly used in empirically calculating expected calibration errors.

1The in-distribution case also works under our data generation model.

Published in Transactions on Machine Learning Research (07/2024)

6.2.1 Selective Recalibration v.s. Recalibration or Selection

We study the following ECE quantities according to our data model for recalibration alone (R-ECE), selection alone (S-ECE), and selective recalibration (SR-ECE). For recalibration, we focus on studying the popular temperature scaling model, although the analysis is nearly identical for Platt scaling.

R-ECE(T) = Eˆθ x

P[y = 1 | ˆθ x

T = v] 1 1 + e 2v

S-ECE(g) := Eˆθ x

P[y = 1 | ˆθ x = v, g(x) = 1] 1 1 + e 2v

| g(x) = 1 .

SR-ECE(g, T) := Eˆθ x

P[y = 1 | ˆθ x

T = v, g(x) = 1] 1 1 + e 2v

| g(x) = 1 .

Our first theorem proves that under our data generation model, S-ECE and R-ECE can never reach 0, but SR-ECE can reach 0 by choosing appropriate g and T.

Theorem 1 Under Assumption 3, for any δ (0, 1) and ˆθ output by A , there exist thresholds M N+

and τ > 0 such that if max{r1, r2, σ, θ } < τ and m > M, there exists a positive lower bound L, with probability at least 1 δ over Str

min min g:E[g(x)] β S-ECE(g), min T R R-ECE(T) > L.

However, there exists T0 and g0 satisfying E[g0(x)] β, such that SR-ECE(g0, T0) = 0.

Intuition and interpretation. Here we give some intuition for understanding our results. Under our perturbed mixture model, R-ECE is calculated as

R-ECE(T) = Ev=ˆθ x

1 + exp 2ˆθ θ

σ2 ˆθ 2 v + 1{v B}

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v/T + 1

for disjoint sets A and B, which correspond to points on the support of J1 and J2 respectively. In order to achieve zero R-ECE, when v A, we need T = ˆθ θ /(σ2 ˆθ 2). However, for v B we need T = αˆθ θ /(σ2 ˆθ 2). These clearly cannot be achieved simultaneously. Thus the presence of the outlier data makes it impossible for the recalibration model to properly calibrate the confidence for the whole population. A similar expression can be obtained for S-ECE. As long as ˆθ θ /(σ2 ˆθ 2) and αˆθ θ /(σ2 ˆθ 2) are far from 1 (i.e., miscalibration exists), no choice of g can reach zero S-ECE. In other words, no selection rule alone can lead to calibrated predictions, since no subset of the data is calibrated under the pre-trained classifier. However, by setting g0(x) = 0 for all x B and g0(x) = 1 otherwise, and choosing T0 = ˆθ θ /(σ2 ˆθ 2), SR-ECE = 0. Thus we can conclude that achieving ECE = 0 on eligible predictions is only possible under selective recalibration, while selection or recalibration alone induce positive ECE. See Appendix C for more details and analysis.

6.2.2 Joint Learning versus Sequential Learning

We can further demonstrate that jointly learning a selection model g and temperature scaling parameter T can outperform sequential learning of g and T. Let us first denote g := arg ming S-ECE(g) such that E[ g(x)] β and T := arg min T R-ECE(T). We denote two types of expected calibration error under sequential learning of g and T, depending on which is optimized first.

ECER S := min g:E[g(x)] β S-ECE(g, T);

ECES R := min T R R-ECE( g, T).

Published in Transactions on Machine Learning Research (07/2024)

Our second theorem shows these two types of expected calibration error for sequential learning are lower bounded, while jointly learning g, T can reach zero calibration error.

Theorem 2 Under Assumption 3, if β > 2(1 β), for any δ (0, 1) and ˆθ output by A , there exist thresholds M N+ and τ2 > τ1 > 0: if max{r1, r2, σ} < τ2, τ1 < σ, and m > M, then there exists a positive lower bound L, with probability at least 1 δ over Str

min ECER S, ECES R} > L.

However, there exists T0 and g0 satisfying E[g0(x)] β, such that SR-ECE(g0, T0) = 0.

Intuition and interpretation. If we first optimize the temperature scaling model to obtain T, T will not be equal to ˆθ θ /(σ2 ˆθ 2). Then, when applying selection, there exists no g that can reach 0 calibration error since the temperature is not optimal for data in A or B. On the other hand, if we first optimize the selection model and obtain g, g will reject points in A instead of those in B because points in A incur higher calibration error, and thus data from both A and B will be selected (i.e., not rejected). In that case, temperature scaling not will be able to push calibration error to zero because, similar to the case in the earlier R-ECE analysis, the calibration error in A and B cannot reach 0 simultaneously using a single temperature scaling model. On the other hand, the optimal jointly-learned solution yields a set of predictions with zero expected calibration error.

7 Conclusion

We have shown both empirically and theoretically that combining selection and recalibration is a potent strategy for producing a set of well-calibrated predictions. Eight pairs of distribution and β were tested when i.i.d. validation data is available; selective recalibration with our proposed S-TLBCE loss function outperforms every single recalibrator tested in 7 cases, and always reduces S-ECE with respect to the calibrator employed by the selective recalibration model itself. Taken together, these results show that while many popular recalibration functions are quite effective at reducing calibration error, they can often be better fit to the data when given the opportunity to ignore a small portion of difficult examples. Thus, in domains where calibrated confidence is critical to decision making, selective recalibration is a practical and lightweight strategy for improving outcomes downstream of deep learning model predictions.

Broader Impact Statement

While the goal of our method is to foster better outcomes in settings of societal importance like medical diagnosis, as mentioned in Section 2, selective classification may create disparities among protected groups. Future work on selective recalibration could focus on analyzing and mitigating any unequal effects of the algorithm.

Acknowledgments

We thank Mike Mozer and Jasper Snoek for their very helpful feedback on this work. This research obtained support by the funds provided by the National Science Foundation and by Do D OUSD (R&E) under Cooperative Agreement PHY-2229929 (ARNI: The NSF AI Institute for Artificial and Natural Intelligence). JCS gratefully acknowledges financial support from the Schmidt Data X Fund at Princeton University made possible through a major gift from the Schmidt Futures Foundation. TP gratefully acknowledges support by NSF AF:Medium 2212136 and by the Simons Collaboration on the Theory of Algorithm Fairness grant.

Peter Bandi, Oscar Geessink, Quirine Manson, Marcory van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Sveta Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Shenghua Cheng,

Published in Transactions on Machine Learning Research (07/2024)

Shaoqun Zeng, Jeppe Thagaard, and Geert Litjens. From detection of individual metastases to classification of lymph node status at the patient level: The Camelyon17 challenge. IEEE Transactions on Medical Imaging, PP:1 1, 08 2018. doi: 10.1109/TMI.2018.2867350.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, 2015.

Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78: 1 3, 1950.

Zhun Deng, Thomas P. Zollo, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Distribution-free statistical dispersion control for societal applications. In Advances in Neural Information Processing Systems, 2023.

Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(53):1605 1641, 2010.

Adam Fisch, Tommi S. Jaakkola, and Regina Barzilay. Calibrated selective classification. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.

Yarin Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016.

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017.

Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, 2019.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.

Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In International Conference on Machine Learning, 2018.

Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. ISSN 01621459.

Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, and Percy Liang. Selective classification can magnify disparities across groups. In International Conference on Learning Representations, 2021.

Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael C. Mozer, and Becca Roelofs. Soft calibration objectives for neural networks. In Advances in Neural Information Processing Systems, 2021.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, 2021.

Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, 2019.

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, 2018.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.

Published in Transactions on Machine Learning Research (07/2024)

David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems, 2018.

Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Ann Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, 2021.

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI Conference on Artificial Intelligence, 2015.

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your models uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, 2019.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(85):2825 2830, 2011.

Alexandre Perez-Lebel, Marine Le Morvan, and Gaël Varoquaux. Beyond calibration: estimating the grouping loss of modern neural networks. In International Conference on Learning Representations, 2023.

J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 1999.

Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners. In Conference on Robot Learning, 2023.

Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C. Mozer. Mitigating bias in calibration error estimation. In International Conference on Artificial Intelligence and Statistics, 2022.

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371 421, 2008.

Abhin Shah, Yuheng Bu, Joshua K Lee, Subhro Das, Rameswar Panda, Prasanna Sattigeri, and Gregory W Wornell. Selective regression under fairness criteria. In International Conference on Machine Learning, 2022.

Jake Snell, Thomas P. Zollo, Zhun Deng, Toniann Pitassi, and Richard Zemel. Quantile risk control: A flexible framework for bounding the probability of high-loss predictions. In International Conference on Learning Representations, 2023.

James Taylor, Berton Earnshaw, Ben Mabey, Mason Victors, and Jason Yosinski. Rx Rx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations, 2019.

Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: improved calibration and predictive uncertainty for deep neural networks. In International Conference on Neural Information Processing Systems, 2019.

Vladimir Vovk, Ilia Nouretdinov, Akimichi Takemura, and Glenn Shafer. Defensive forecasting for linear protocols. In International Conference on Algorithmic Learning Theory, 2005.

Deng-Bao Wang, Lei Feng, and Min-Ling Zhang. Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. In Advances in Neural Information Processing Systems, 2021.

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In International Conference on Machine Learning, 2001.

Published in Transactions on Machine Learning Research (07/2024)

Jize Zhang, Bhavya Kailkhura, and T. Yong-Jin Han. Mix-n-match : Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020.

Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration. In International Conference on Machine Learning, 2022.

Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distributions of object subcategories. In Conference on Computer Vision and Pattern Recognition, 2014.

Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Prompt risk control: A rigorous framework for responsible deployment of large language models. In International Conference on Learning Representations, 2024.

Published in Transactions on Machine Learning Research (07/2024)

A Additional Experiment Details

In training we follow Fisch et al. (2022) and drop the denominator in Lsel, as the coverage loss suffices to keep ˆg from collapsing to 0. Recalibration model code is taken from the accompanying code releases from Guo et al. (2017)2 (Temperature Scaling) and Kumar et al. (2019)3 (Platt Scaling, Histogram Binning, Platt Binning).

A.1 Calibration Measures

We calculate ECEq for q {1, 2} using the python library released by Kumar et al. (2019) 4. ECEq is calculated as:

i Bj 1{yi = ˆyi}

i Bj ˆf(xi)

where B = B1, ..., Bm are a set of m equal-mass prediction bins, and predictions are sorted and binned based on their maximum confidence ˆf(x). We set m = 15.

A.2 Baselines

Next we describe how baseline methods are implemented. Our descriptions are based on creating an ordering of the test set such that at a given coverage level β, a 1 β proportion of examples from the end of the ordering are rejected.

A.2.1 Confidence-Based Rejection

Confidence based rejection is performed by ordering instances in a decreasing order based on ˆf(x), the maximum confidence the model has in any class for that example.

A.2.2 Out of Distribution Scores

The sklearn python library (Pedregosa et al., 2011) is used to produce the One-Class SVM and Isolation Forest models. Anomaly scores are oriented such that more typical datapoints are given higher scores; instances are ranked in a decreasing order.

A.3 In-distribution Experiments

Our selector g is a shallow fully-connected network with 2 hidden layers of dimension 128 and Re LU activations.

A.3.1 Camelyon17

Camelyon17 (Bandi et al., 2018) is a task where the input x is a 96x96 patch of a whole-slide image of a lymph node section from a patient with potentially metastatic breast cancer, the label y is whether the patch contains a tumor, and the domain d specifies which of 5 hospitals the patch was from. We pre-train a Dense Net-121 model on the Camelyon17 train set using the code from Koh et al. (2021)5. The validation set has 34,904 examples and accuracy of 91%, while the test set has 84,054 examples, and accuracy of 83%. Our selector g is trained with a learning rate of 0.0005, the coverage loss weight λ is set to 32 (following (Geifman & El-Yaniv, 2019)), and the model is trained with 1000 samples for 1000 epochs with a batch size of 100.

2https://github.com/gpleiss/temperature_scaling 3https://github.com/p-lambda/verified_calibration 4https://github.com/p-lambda/verified_calibration 5https://github.com/p-lambda/wilds

Published in Transactions on Machine Learning Research (07/2024)

A.3.2 Image Net

Image Net is a large scale image classification dataset. We extract the features, scores, and labels from the 50,000 Image Net validation samples using a pre-trained Res Net34 model from the torchvision library. Our selector g is trained with a learning rate of 0.00001, the coverage loss weight λ is set to 32 (following (Geifman & El-Yaniv, 2019)), and the model is trained with 2000 samples for 1000 epochs with a batch size of 200.

A.4 Out-of-distribution Experiments

Our selector g is a shallow fully-connected network (1 hidden layer with dimension 64 and Re Lu activation) trained with a learning rate of 0.0001, the coverage loss weight λ is set to 8, and the model is trained for 50 epochs (to avoid overfitting since this is an OOD setting) with a batch size of 256.

A.4.1 Rx Rx1

Rx Rx1 (Taylor et al., 2019) is a task where the input x is a 3-channel image of cells obtained by fluorescent microscopy, the label y indicates which of the 1,139 genetic treatments (including no treatment) the cells received, and the domain d specifies the batch in which the imaging experiment was run. The validation set has 9,854 examples and accuracy of 18%, while the test set has 34,432 examples, and accuracy of 27%. 1000 samples are drawn for model training. Gaussian noise with mean 0 and standard deviation 1 is added to training examples in order to promote robustness.

A.4.2 CIFAR-100

CIFAR-100 is a well-known image classification dataset, and we perform zero-shot image classification with CLIP. We draw 2000 samples for model training, and test on 50,000 examples drawn from the 750,000 examples in CIFAR-100-c. Data augmentation in training is performed using Aug Mix (Hendrycks et al., 2020) with a severity level of 3 and a mixture width of 3.

B Additional Experiment Results

B.1 Brier Score Results

While our focus in this work is Expected Calibration Error, for completeness we also report results with respect to Brier Score. Figure 5 shows Brier score results for the experiments in Section 5.1. Selective recalibration reduces Brier score in both experiments and outperforms recalibration. The OOD selection baselines perform well, although they show increasing error as more data is rejected, illustrating their poor fit for the task. Further, Brier score results for the experiments in Section 5.2 are included in Table 2. Selective recalibration reduces error, and confidence-based rejection increases error, which is surprising since Brier score favors predictions with confidence near 1.

Published in Transactions on Machine Learning Research (07/2024)

Figure 5: Selective calibration error on Image Net and Camelyon17 for coverage level β {0.75, 0.8, 0.85, 0.9}. Left: Various re-calibration methods are trained using labeled validation data. Middle: Selection baselines including confidence-based rejection and various OOD measures. Right: Selective re-calibration with different loss functions.

Table 2: Rx Rx1 and CIFAR-100-C AUC in the range β = [0.5, 1.0].

Selection Opt. of h, g Rx Rx1 CIFAR-100-C

Confidence - 0.169 0.180 One-class SVM - 0.077 0.051 Iso. Forest - 0.061 0.051

S-MCE sequential 0.138 0.166 S-MCE joint 0.138 0.166 S-MMCE sequential 0.126 0.165 S-MMCE joint 0.126 0.164 S-TLBCE sequential 0.126 0.166 S-TLBCE joint 0.126 0.165

Recalibration Temp. Scale 0.140 0.164 (β = 1.0) None 0.252 0.170

C Theory: Additional Details and Proofs

C.1 Details on Data Generation Model

Definition 2 (Formal version of definition 1) For θ Rp, a (θ , σ, α, r1, r2)-perturbed truncated Gaussian model is defined as the following distribution over (x, y) Rp {1, 1}:

x | y z J1 + (1 z)J2.

Published in Transactions on Machine Learning Research (07/2024)

Here, J1 and J2 are two truncated Guassian distributions, i.e.,

J1 ρ1N(y θ , σ2I)1{x B(θ , r1) B( θ , r1)},

J2 ρ2N( y αθ , σ2I)1{x B(α θ , r2) B( (α θ , r2)}

where ρ1, ρ2 are normalization coefficients to make J1 and J2 properly defined; y follows the Bernoulli distribution P(y = 1) = P(y = 1) = 1/2; and z follows a Bernoulli distribution P(z = 1) = β.

For simplicity, throughout this paper, we set ρ1 = ρ2 and this is always achievable by setting r1/r2 appropriately. We also set α (0, 1/2).

C.2 Details on ˆθ

Recall that we consider the ˆθ that is the output of a training algorithm A (Str) that takes the i.i.d. training data set Str = {(xtr i , ytr i )}m i=1 as input. We imposed the following assumption on ˆθ.

Assumption 3 For any given δ (0, 1), there exists θ0 Rp with θ0 = Θ(1), that with probability at least 1 δ ˆθ θ0 < ϕ(δ, m),

and ϕ(δ, m) goes to 0 as m goes to infinity. Also, there exist a threshold M N+ such that if m > M, ϕ(δ, m) is a decreasing function of δ and n. Moreover,

θ0 2 , θ 0 θ , θ0 > 0.

We will prove the following lemma as a cornerstone for our future proofs.

Lemma 1 Under Assumption 3, for any δ (0, 1), there exists a threshold M N+, and constants 0 < I1 < I2, 0 < I3 < I4 < αI3, 0 < I5 < I6, such that if m > M, with probability at least 1 δ over the randomness of Str, ˆθ θ

ˆθ 2 [I1, I2], ˆθ θ [I3, I4], ˆθ [I5, I6].

Proof 1 Under Assumption 3, we know m leads to ˆθ θ0. In addition, for any δ (0, 1) there exists a threshold M N+ such that if m > M, ϕ(δ, m) is a decreasing function of δ and m, which leads to

ˆθ 2 [ θ 0 θ

θ0 2 ε, θ 0 θ

θ0 2 + ε], ˆθ θ [θ 0 θ ε, θ 0 θ + ε], ˆθ [ θ0 ε, θ0 + ε]

for some small ε > 0 that makes the left end of the above intervals larger than 0 and θ 0 θ + ε < α(θ 0 θ ε) hold for all r1, r2, σ, m as long as m > M. Then, we set Ii s accordingly to each value above.

C.2.1 Example Training ˆθ

In this section, we provide one example to justify Assumption 3, i.e., ˆθ = Pm i=1 xtr i ytr i /m, where the training set is drawn from an unperturbed Gaussian mixture, i.e., xtr|ytr N(ytr θ , σ2I) and ytr follows a Bernoulli distribution P(ytr = 1) = 1/2. Directly following the analysis of Zhang et al. (2022), we have

ˆθ θ = OP( 1 m) θ + θ 2.

For ˆθ 2, notice that

ˆθ = θ + ϵm

Published in Transactions on Machine Learning Research (07/2024)

where ϵm N(0, σ2I

m ). Then, we have

ˆθ 2 = θ 2 + 2ϵ mθ + ϵm 2 = θ 2 + p

m ) + OP( 1 m) θ .

Given p/m = O(1), combined with the form of classic concentration inequalities, one can verify this example satisfies Assumption 3.

C.3 Background: ECE Calculation

Recall we denote ˆf(x) = max{ˆp 1(x), ˆp1(x)} and denote the predicition result ˆy = ˆC(x). The definition of ECE is: ECE = E ˆ f(x)|P[y = ˆy | ˆf(x) = p] p|.

Notice that there are two cases.

Case 1: ˆf(x) = ˆp1(x), by reparameterization, we have

|P[y = ˆy | ˆf(x) = p] p| = P[y = 1 | ˆf(x) = e2v

1 + e2v ] e2v

= P[y = 1 | ˆθ x = v] e2v

Case 2: ˆf(x) = ˆp 1(x), by reparameterization, we have

|P[y = ˆy | ˆf(x) = p] p| = P[y = 1 | ˆf(x) = 1 1 + e2v ] 1 1 + e2v

= P[y = 1 | ˆθ x = v] 1 1 + e2v

= 1 P[y = 1 | ˆθ x = v] (1 e2v

= P[y = 1 | ˆθ x = v] e2v

To summarize,

ECE = E ˆ f(x)|P[y = ˆy | ˆf(x) = p] p| = Eˆθ x

P[y = 1 | ˆθ x = v] 1 1 + e 2v

Temperature scaling.

p T 1(x) = 1

e2 ˆθ x/T + 1 , p T 1 (x) = e2 ˆθ x/T

e2 ˆθ x/T + 1 . (15)

R-ECE = Eˆθ x

P[y = 1 | ˆθ x = v T] 1 1 + e 2v

Platt scaling.

pw,b 1 (x) = 1

e2w ˆθ x+2b + 1 , pw,b 1 (x) = 1

e 2w ˆθ x 2b + 1 . (16)

ECEw,b = Eˆθ x

P[y = 1 | w ˆθ x + b = v] 1 1 + e 2v

Published in Transactions on Machine Learning Research (07/2024)

ECE calculation. The distribution of ˆθ x has the following properties.

ˆθ x|y = 1 zρ1N(ˆθ θ , σ2 ˆθ 2)1{ˆθ x B(ˆθ θ , r1 ˆθ ) B( ˆθ θ , r1 ˆθ )} + (1 z)ρ2N( α ˆθ θ , σ2 ˆθ 2))1{x B(αˆθ θ , r2 ˆθ ) B( (αˆθ θ , r2 ˆθ )};

ˆθ x|y = 1 zρ1N( ˆθ θ , σ2 ˆθ 2)1{ˆθ x B(ˆθ θ , r1 ˆθ ) B( ˆθ θ , r1 ˆθ )} + (1 z)ρ2N(α ˆθ θ , σ2 ˆθ 2))1{ˆθ x B(αˆθ θ , r2 ˆθ ) B( αˆθ θ , r2 ˆθ )}.

Now, we are ready to calculate ECE. Specifically, given ˆθ, For notation simplicity, we denote A = B(ˆθ θ , r1 ˆθ ) B( ˆθ θ , r1 ˆθ ) and B = B(α ˆθ θ , r2 ˆθ ) B( (α ˆθ θ , r2 ˆθ ). Meanwhile, for simplicity, we choose r1, r2 such that ρ1 = ρ2 = ρ. This is always manageable and there exists infinitely many choices, we only require S1 S2 = for any S1 = S2, S1, S2 {B(ˆθ θ , r1 ˆθ ), B( ˆθ θ , r1 ˆθ ), B(αˆθ θ , r2 ˆθ ), B( αˆθ θ , r2 ˆθ )}. In able to achieve ρ1 = ρ2 = ρ, it only depends on r1/r2. Apparently, there exists a threshold ϕ > 0 such that if r1 r2 are both smaller than ϕ (one can choose r1, r2 as functions of σ with appropriate choosen σ), then A B = can be achieved.

P[y = 1 | ˆθ x = v] = P(ˆθ x = v | y = 1)

P(ˆθ x = v | y = 1) + P(ˆθ x = v | y = 1)

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1{v A} + 1

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1{v B}

C.4 Proof of Theorem 1

C.4.1 Temperature Scaling Only

A simple reparameterization leads to:

R-ECE = Ev=ˆθ x

1 + exp 2ˆθ θ

σ2 ˆθ 2 v + 1{v B}

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v/T + 1

The lower bound contains two parts. We choose the threshold ϕ mentioned previously small enough such that I3 > max{r1, r2}. This can be achieved because I3 is independent of r1, r2.

Part I. When v B(ˆθ θ , r1 ˆθ ) A, and we know that A B = . Let us choose a threshold min{I1, I3}/σ2 > τ > 0. Then for any T > 0, it must fall into one of the following three cases.

Published in Transactions on Machine Learning Research (07/2024)

Case 1: T 1 and ˆθ θ /(σ2 ˆθ 2) are far: T 1 ˆθ θ /(σ2 ˆθ 2) > τ, recall v = ˆθ x, then

Ex B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 1 + e 2v/T

1 + exp 2T 1(ˆθ θ r1) 1

1 + exp 2ˆθ θ

σ2 ˆθ 2 (ˆθ θ + r1)

P(x B(θ , r1)))

1 + exp 2(ˆθ θ /(σ2 ˆθ 2) + τ)(ˆθ θ r1) 1

1 + exp 2ˆθ θ

σ2 ˆθ 2 (ˆθ θ + r1)

1 + exp 2(ˆθ θ /(σ2 ˆθ 2) + τ)(ˆθ θ r1) 1

1 + exp 2ˆθ θ

σ2 ˆθ 2 (ˆθ θ + r1)

2 min c [I1,I2],d [I3,I4]

1 1 + exp ( 2(c/σ2 + τ)(d r1)) 1 1 + exp ( 2c/σ2(d + r1))

Case 2: T 1 and ˆθ θ /(σ2 ˆθ 2) are far: ˆθ θ /(σ2 ˆθ 2) T 1 > τ,

Ex B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 1 + e 2v/T

1 + exp 2ˆθ θ

σ2 ˆθ 2 (ˆθ θ r1) 1

1 + exp 2T 1(ˆθ θ + r1)

1 + exp 2ˆθ θ

σ2 ˆθ 2 (ˆθ θ r1) 1

1 + exp 2(ˆθ θ /(σ2 ˆθ 2) τ)(ˆθ θ + r1)

2 min c [I1,I2],d [I3,I4]

1 1 + exp ( 2c/σ2(d r1)) 1 1 + exp ( 2(c/σ2 τ)(d + r1))

Case 3: When T 1 and ˆθ θ /(σ2 ˆθ 2) are close: |T 1 ˆθ θ /(σ2 ˆθ 2)| τ, then when v B( αˆθ θ , r2 ˆθ ) B. For small enough τ satisfying τ 0.2(1 α)I1/σ2

Ev=ˆθ x B( αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 1 + e 2v/T

min a [ αˆ θ θ

σ2 ˆ θ 2 , ˆ θ θ

σ2 ˆ θ 2 +τ] min v B( αˆθ θ ,r2 ˆθ )

2v exp(2va) (1 + exp(2av))2

min a [ αI2

σ2 +τ] min v [αI3 r2I6,αI4+r2I6] 2v exp(2va) (1 + exp(2av))2

1.8(1 α) I1

Part III. Combining together, we have

R-ECE min{β1, β2, β3}.

Finally, we take r1 min{0.1, τσ2/I1}I3, which ensures βi > 0 for all i = 1, 2, 3.

Published in Transactions on Machine Learning Research (07/2024)

C.4.2 Selective Calibration Only

We hope E[g(x) = 1] β. Let us first define G = {ˆθ x : g(x) = 1}. Then, for any g, we have

P[y = 1 | ˆθ x = v, v G]

= P(ˆθ x = v, v G | y = 1)

P(ˆθ x = v, v G | y = 1) + P(ˆθ x = v, v G | y = 1)

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1{v A} + 1

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1{v B}

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1{v A G} + 1

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1{v B G}

Then, by choosing small enough σ, such that I1/σ2 > 1, the corresponding ECE is:

ECES = Ev=ˆθ x|ˆθ x G 1

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1{v A G}

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1{v B G} 1 e 2v + 1

Ev=ˆθ x|ˆθ x G

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1{v A G} 1 e 2v + 1

+ Ev=ˆθ x|ˆθ x G

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1{v B G} 1 e 2v + 1

λ1P(v A G|v G) + λ2P(v B G|v G).

Since P(v A G|v G) + P(v B G|v G) = 1, it is not hard to verify that

S-ECE min{λ1, λ2}

λ1 = min a [1, I2

σ2 ] min v A 2v exp(2va) (1 + exp(2av))2

λ2 = min a [ αI2

σ2 ,1] min v B 2v exp(2va) (1 + exp(2av))2

C.4.3 Selective Re-calibration

We choose G = B and set T 1 = ˆθ θ

σ2 ˆθ 2 , then SR-ECE = 0. Thus, there exists appropriate choice of g and T such that SR-ECE(g, T) = 0.

C.5 Proof of Theorem 2

Usually, β is much larger than 1 β, for example, β = 90%. In this section, we impose the following assumption.

Published in Transactions on Machine Learning Research (07/2024)

Assumption 4 The selector g will retain most of the probabilty mass in the sense that

β > 2(1 β).

Let us denote ξ = β/2 (1 β) and ξ is a positive constant. First, we have the following claim.

Claim 5 Under Assumption 4, if we further have

min v B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

> max v B( αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

then for g1 = arg ming:E[g(x)=1] β S-ECE, we have that Ex B( αθ ,r2)[g1(x) = 1] = P(x B( αθ , r2)).

Proof 2 The proof is straightforward. We denote O = {x : x B( αθ , r2), g(x) = 0}. We will prove that P(x O) = 0.

If not, let us denote P = P(x O) > 0. Since we know β > 2(1 β), which means even if we throw away" all the probability mass 1 β by only setting points in B(θ , r1) with g value equals to 0, there will still be other remaining probability mass retained in B(θ , r1) with g value equals to 1. Then, there exists g2 such that g2(x) = 1 for all x B( αθ , r2) and leads to P(x B( αθ , r2), g2(x) = 1) = P(x B( αθ , r2), g1(x) = 1) + ξ (enabled by the fact β > 2(1 β) ) and P(g1(x) = 1) = P(g2(x) = 1) for x B(θ , r1) B( αθ , r2). Since

min v B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

> max v B( αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

which means throwing away" points in B(αθ , r2) can more effectively lower the calibration error and we must have S-ECE(g2) < S-ECE(g1).

Next, we state how to set the parameters such that the condition in Claim 5 holds. As long as we choose σ, r1, r2 small enough, such that

1 1 + exp( 2I1/σ2(I4 + r1I6)) 1 1 + exp( 2I1(I3 r1I6))

< 1 1 + exp( 2(I4 + r2I6)) 1 1 + exp(2αI2/σ2(I4 + r2I6))

min v B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

> max v B( αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v + 1

Published in Transactions on Machine Learning Research (07/2024)

Then, following similar derivation in Section C.4.1, we can prove with suitably chosen parameters r1, r2, σ, ECES T > 0.

Lastly, let us further prove ECET S > 0. We choose r1 and r2 small enough such that v > 0 for all v B(ˆθ θ , r1 ˆθ ) B(αˆθ θ , r2 ˆθ ) and v < 0 for all v B( ˆθ θ , r1 ˆθ ) B( αˆθ θ , r2 ˆθ ).

For 1/T [ αˆθT θ

σ2 ˆθ , ˆθT θ

σ2 ˆθ ], we can calculate the derivative for R-ECE as the following:

R-ECE = Ev B(ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v 1 e 2v/T + 1

+ Ev B( ˆθ θ ,r1 ˆθ )

1 + exp 2ˆθ θ

σ2 ˆθ 2 v + 1 e 2v/T + 1

+ Ev B(αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v + 1 e 2v/T + 1

+ Ev B( αˆθ θ ,r2 ˆθ )

1 + exp 2αˆθ θ

σ2 ˆθ 2 v 1 e 2v/T + 1

Next, we take a derivative over x = 1/T for x [ αˆθT θ

σ2 ˆθ , ˆθT θ

σ2 ˆθ ], which leads to

dx = 2Ev B(ˆθ θ ,r1 ˆθ )

(e2vx + 1)2

+ 2Ev B(αˆθ θ ,r2 ˆθ )

(e2vx + 1)2

Consider the two values 2ve2vx

(e2vx + 1)2 , 2αve2αvx

(e2αvx + 1)2 ,

the ratio 2αve2αvx

(e2αvx + 1)2 /[ 2ve2vx

(e2vx + 1)2 ] v 0 2α.

That means if we take suitably small r1, r2 and let σ [c1, c2] with appropriately chosen c1, c2

σ2 ˆ θ < 0.

Thus, we know the best choice of 1/T should not be equal to ˆθT θ

σ2 ˆθ . Then, notice β > 2(1 β), which

means the probability mass in B(ˆθT θ , r1 ˆθ ) cannot be all be thrown away"; following similar derivation in Section C.4.1, we can prove with suitably chosen parameters r1, r2, σ, ECET S > 0.