# minimumrisk_recalibration_of_classifiers__97bf21bf.pdf

Minimum-Risk Recalibration of Classifiers

Zeyu Sun University of Michigan zeyusun@umich.edu

Dogyoon Song University of Michigan dogyoons@umich.edu

Alfred Hero University of Michigan hero@eecs.umich.edu

Recalibrating probabilistic classifiers is vital for enhancing the reliability and accuracy of predictive models. Despite the development of numerous recalibration algorithms, there is still a lack of a comprehensive theory that integrates calibration and sharpness (which is essential for maintaining predictive power). In this paper, we introduce the concept of minimum-risk recalibration within the framework of mean-squared-error (MSE) decomposition, offering a principled approach for evaluating and recalibrating probabilistic classifiers. Using this framework, we analyze the uniform-mass binning (UMB) recalibration method and establish a finite-sample risk upper bound of order O(B/n + 1/B2) where B is the number of bins and n is the sample size. By balancing calibration and sharpness, we further determine that the optimal number of bins for UMB scales with n1/3, resulting in a risk bound of approximately O(n 2/3). Additionally, we tackle the challenge of label shift by proposing a two-stage approach that adjusts the recalibration function using limited labeled data from the target domain. Our results show that transferring a calibrated classifier requires significantly fewer target samples compared to recalibrating from scratch. We validate our theoretical findings through numerical simulations, which confirm the tightness of the proposed bounds, the optimal number of bins, and the effectiveness of label shift adaptation.

1 Introduction

Generating reliable probability estimates alongside accurate class labels is crucial in classification tasks. A probabilistic classifier is considered "well calibrated" when its predicted probabilities closely align with the empirical frequencies of the corresponding labels [9]. Calibration is highly desirable, particularly in high-stakes applications such as meteorological forecasting [32, 34, 10, 15], econometrics [16], personalized medicine [25, 24], and natural language processing [37, 7, 11, 53]. Unfortunately, many machine learning algorithms lack inherent calibration [18].

To tackle this challenge, various methods have been proposed for designing post hoc recalibration functions. These functions are used to assess calibration error [52, 42, 17, 4], detect miscalibration [30], and provide post-hoc recalibration [50, 51, 18, 48, 29, 21]. Despite the rapid development of recalibration algorithms, there is still a lack of a comprehensive theory that encompasses both calibration and sharpness (retaining predictive power) from a principled standpoint. Furthermore, existing methods often rely on diverse calibration metrics [17, 4], and the selection of hyperparameters is often based on heuristic approaches [44, 20] without rigorous justifications. This highlights the

Equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

need for identifying an optimal metric to evaluate calibration, which can facilitate the development of a unified theory and design principles for recalibration functions.

In addition, the deployment of machine learning models to data distributions that differ from the training phase is increasingly common. These distribution shifts can occur naturally due to factors such as seasonality or other variations, or they can be induced artificially through data manipulation methods such as subsampling or data augmentation. Distribution shifts pose challenges to the generalization of machine learning models. Therefore, it becomes necessary to adapt the trained model to these new settings. One significant category of distribution shifts is label shift, where the marginal probabilities of the classes differ between the training and test sets while the class conditional feature distributions remain the same. With calibrated probabilistic predictions, label shift can be adjusted assuming access to class marginal probabilities [12]. However, miscalibration combined with label shift is common and remains a challenging problem [2, 14, 47, 41].

In this paper, we aim to address these issues in a twofold manner. Firstly, we develop a unified framework for recalibration that incorporates both calibration and sharpness in a principled manner. Secondly, we propose a composite estimator for recalibration in the presence of label shift that converges to the optimal recalibration. Our framework enables the adaptation of a classifier to the label-shifted domain in a sample-efficient manner.

1.1 Related work

Recalibration algorithms. Recalibration methods can be broadly categorized into parametric and nonparametric approaches. Parametric methods model the recalibration function in a parametric form and estimate the parameters using calibration data. Examples of parametric methods include Platt scaling [40], temperature scaling [18], Beta calibration [28], and Dirichlet calibration [27]. However, it has been reported that scaling methods are often less calibrated than supposed, and quantifying the degree of miscalibration can be challenging [29]. In contrast, nonparametric recalibration methods do not assume a specific parametric form for the recalibration function. These methods include histogram binning [50], isotonic regression [51], kernel density estimation [52, 42], splines [23], Gaussian processes [49], among others. Hybrid approaches, integrating both parametric and nonparametric techniques, have also been proposed. For instance, Kumar et al. [29] combine nonparametric histogram binning with parametric scaling to reduce variance and improve recalibration performance. Nevertheless, this hybrid approach is biased when its parametric assumptions fail. In this work, we consider a nonparametric histogram binning method called uniform-mass binning (UMB), which is asymptotically unbiased. Histogram binning method. Histogram binning methods are widely used for recalibration due to their simplicity and adaptability. The binning schemes can be pre-specified (e.g., uniformwidth binning [18]), data-dependent (e.g., uniform-mass binning [50]), or algorithm-induced [51]. When selecting a binning scheme, it is crucial to consider the trade-off between approximation and estimation. Coarser binning reduces estimation error (variance), leading to improved calibration, but at the expense of increased approximation error (bias), which diminishes sharpness. Thus, determining the optimal binning scheme and hyperparameters, such as the number of bins (B), remains an active area of research. [36] proposes a Bayesian binning method, but verifying the priors is often challenging. [44] suggests choosing the largest B that preserves monotonicity, which is heuristic and computationally inefficient. [20] offers a heuristic for choosing the largest B subject to a calibration constraint, lacking a quantitative characterization of sharpness. Our work builds upon the existing upper bounds for calibration risks of binning methods [29, 20] and derived upper bounds for a complementary risk component known as the sharpness risk. We quantitatively characterize the calibration-sharpness tradeoff, which yields an optimal choice for the number of bins that achieves the minimum risk. Adaptation to label shift. Label shift presents a challenge in generalizing models trained on one distribution (source) to a different distribution (target). As such, adapting to label shift has received considerable attention in the literature [12, 45, 31, 3, 2, 14]. In practical scenarios, it is common to encounter model miscalibration and label shift simultaneously [47, 41]. Empirical observations have highlighted the crucial role of probability calibration in label shift adaptation [2, 13], which is justified by subsequent theories [14]. However, to the best of our knowledge, there has been no prior work that specifically addresses the recalibration with a limited amount of labeled data from the target distribution. Our theoretical analysis points out that using only target labels achieves risk bounds of the same order as the methods using only target features [31, 3, 14].

1.2 Contributions

This paper contributes to the theory of recalibration across three key dimensions.

Firstly, we develop a comprehensive theory for recalibration in binary classification by adopting the mean-squared-error (MSE) decomposition framework commonly used in meteorology and space weather forecasting [5, 33, 8, 46]. Our approach formulates the probability recalibration problem as the minimization of a specific risk function, which can be orthogonally decomposed into calibration and sharpness components.

Secondly, utilizing the aforementioned framework, we derive a rigorous upper bound on the finitesample risk for uniform-mass binning (UMB) (Theorem 1). Furthermore, we minimize this risk bound and demonstrate that the optimal number of bins for UMB, balancing calibration and sharpness, scales on the order of n1/3, yielding the risk bound of order n 2/3, where n denotes the sample size.

Lastly, we address the challenge of recalibrating classifiers for label shift when only a limited labeled sample from the target distribution is available, a challenging situation for a direct recalibration approach. We propose a two-stage approach: first recalibrating the classfier on the abundant sourcedomain data, and then transfering it to the label-shifted target domain. We provide a finite-sample guarantee for the risk of this composite procedure (Theorem 2). Notably, to control the risk under ε, our approach requires a much smaller sample size from the target distribution than a direct recalibration on the target sample (Ω(ε 1) vs. Ω(ε 3/2), cf. Remark 4).

1.3 Organization

This paper is organized as follows. In Section 2, we introduce notation and provide an overview of calibration and sharpness. Section 3 introduces the notion of minimum-risk recalibration by defining the recalibration risk that takes into account both calibration and sharpness. In Section 4, we describe the uniform-mass histogram binning method for recalibration and provide a risk upper bound with rate analysis. We extend our approach to handle label shift in Section 5. To validate our theory and framework, we present numerical experiments in Section 6. Finally, in Section 7, we conclude the paper with a discussion and propose future research directions.

2 Preliminaries

2.1 Notation

Let N and R denote the set of positive integers and the set of real numbers, respectively. For n N, let [n] := {1, . . . , n}. For x R, let x = max{m Z : m x}. For any finite set S = {si : i [n]} R and any k [n], we let s(k) denote the k-th order statistic, which is the k-th smallest element in S.

Let (Ω, E, P) denote a generic probability space. For an event A E, the indicator function 1A : Ω {0, 1} is defined such that 1A(x) = 1 if and only if x A. We write A happens P-almost surely if P(A) = 1. For a probability measure P, define EP as the expectation, with subscript P omitted when the underlying probability measure is clear. For a probability measure P and a random variable X : Ω Rd, let PX := P X 1 be the probability measure induced by X.

Letting f, g : R R, we write f(x) = O(g(x)) as x if there exist M > 0 and x0 > 0 such that |f(x)| Mg(x) for all x x0. Likewise, we write f(x) = Ω(g(x)) if g(x) = O(f(x)). We write f(x) g(x) if f(x) = O(g(x)) and g(x) = O(f(x)). We write f(x) = O(g(x)) if there is k 1 such that f(x) = O(g(x) logk(g(x))).

2.2 Calibration and sharpness

Consider the binary classification problem; let X X and Y Y := {0, 1} denote the feature and label random variables. Letting P denote a probability measure, we want to construct a function f : X Z = [0, 1] that estimates the conditional probability, i.e.,

f(X) P[Y = 1 | X]. (1)

Since estimating the probability conditioned on the high dimensional X is difficult, the notion of calibration captures the intuition of (1) in a weaker sense [33, 9, 19]. Definition 1. A function f : X Z is (perfectly) calibrated with respect to probability measure P, if f(X) = P [Y = 1 | f(X)] P-almost surely.

Calibration itself does not guarantee a useful predictor. For instance, a constant predictor f(X) = EY is perfectly calibrated, but it does not change with the features. Such a predictor lacks sharpness [26], also known as resolution [35], another desired property which measures the variance in the target Y explained by the probabilistic prediction f(X). Definition 2. The sharpness of a function f : X Z with respect to probability measure P refers to the quantity Var E[Y | f(X)] = E h E[Y | f(X)] E[Y ] 2i .

The following decomposition of the mean squared error (MSE) suggests why it is desirable for a classifier f to be calibrated and have high sharpness; note that Var[Y ] is a quantity intrinsic to the problem, unrelated to f.

MSE(f) := E Y f(X) 2

| {z } mean-squared error

= Var[Y ] Var E Y | f(X)

| {z } sharpness

+ E f(X) E[Y | f(X)] 2

| {z } lack of calibration (2)

3 Optimal recalibration

For an arbitrary predictor f : X Z, the aim of recalibration is to identify a post-processing function h : Z Z such that h f is perfectly calibrated while maintaining the sharpness of f as much as possible. The calibration and sharpness can be evaluated using the following two notions of risks. We suppress the dependency of risks on f and P when it leads to no confusion. Definition 3. Let f : X Z and h : Z Z. The calibration risk of h over f under P is defined as

Rcal(h) = Rcal P (h; f) := EP h h f(X) EP [Y | h f(X)] 2i . (3)

Definition 4. Let f : X Z and h : Z Z. The sharpness risk of h over f under P is defined as

Rsha(h) = Rsha P (h; f) := EP h EP [Y | h f(X)] EP [Y | f(X)] 2i . (4)

Note that the calibration risk Rcal(h; f) = 0 if and only if h f is perfectly calibrated, cf. Definition 1. The sharpness risk Rsha(h; f) quantifies the decrement in sharpness of f incurred by applying the recalibration map h, and Rsha(h; f) = 0 when h is injective [29].

Next we define a comprehensive notion of risk that we will use to evaluate recalibration functions. Definition 5. Let f : X Z and h : Z Z. The recalibration risk of h over f under P is defined as R(h) = RP (h; f) := EP (h f(X) EP [Y | f(X)])2 . (5)

The following proposition shows that the recalibration risk can be decomposed into calibration risk and sharpness risk. The proof is deferred to Appendix A. Proposition 1 (Decomposition of recalibration risk). For any f : X Z and any h : Z Z,

RP (h; f) = Rcal P (h; f) + Rsha P (h; f). (6)

Note that RP (h) = 0 if and only if Rcal P (h) = 0 and Rsha P (h) = 0. This happens if and only if h f is calibrated, and the recalibration h preserves the sharpness of f in predicting Y .

If R(h) = 0, then we call h an optimal recalibration function (or minimum-risk recalibration function) of f under P. Let h f,P : Z Z be the function

h f,P (z) = EP [Y | f(X) = z], if z supp Pf(X), 0, if z supp Pf(X). (7)

Then R(h f,P ) = 0. Indeed, h is a minimum-risk recalibration function of f if and only if h = h f,P PZ-almost surely. Problem 1 (Recalibration). Suppose that we have a measurable function f : X Z and a dataset D = (xi, yi) : i [n] that is an independent and identically distributed (IID) sample drawn from P. The goal of recalibration is to estimate a recalibration function ˆh h f,P using f and D.

Problem 1 can be viewed as a regression problem, where we estimate the function form of E[Y | Z] from data {(zi, yi) : i [n]}, where zi = f(xi), i [n].

4 Recalibration via uniform-mass binning

4.1 Uniform-mass binning algorithm for recalibration

Given a dataset of prediction-label pairs {(zi, yi) Z Y : i [n]}, the histogram binning calibration method partitions Z = [0, 1] into a set of smaller bins, and estimate E[Y | Z] by taking the average in each bin. We consider the uniform mass binning, constructed using quantiles of predicted probabilities. Definition 6 (Uniform mass binning). Let S = {zi [0, 1] : i [n]}. A binning scheme B = {I1, I2, . . . , IB} is the uniform-mass binning (UMB) scheme induced by S if I1 = [u0, u1], and Ib = (ub 1, ub] b [B] \ {1}, (8) where u0 = 0, u B = 1, and ub = z( nb/B ) for b [B 1].

For our subsequent discussions, we make the following assumptions on the distribution of Y and Z:

(A1) The cumulative distribution function of Z, denoted by FZ, is absolutely continuous. (A2) h f,P , defined in (7), is monotonically non-decreasing on supp PZ.

(A3) There exists K > 0 such that if z1 z2, then h f,P (z2) h f,P (z1) K FZ(z2) FZ(z1) .

Assumption (A1) is made for the sake of analytical convenience without loss of generality; see discussions in [20, Appendix C]. An important implication of Assumption (A1) is that all intervals in B are non-empty PZ-almost surely if S = {Zi : i [n]} are IID under PZ. Assumption (A2) makes sure Z is informative to preserve the rankings of P[Y | Z] [51]. Lastly, Assumption (A3) posits that Z is sufficiently informative that P[Y = 1 | Z = z] does not change too rapidly in any interval I where PZ(I) is small. This assumption is mild but not trivial; see Appendix B.2.

Now we describe how to construct a recalibration function ˆh using UMB.

1. Given {(zi, yi) : i [n]}, and B N, let B = {I1, I2, . . . , IB} be the UMB scheme of size B induced by {zi : i [n]}.

2. Let ˆh = ˆh B : Z Z be a function such that

I B ˆµI 1I(z) where ˆµI := Pn i=1 yi 1I(zi) Pn i=1 1I(zi) , I B. (9)

4.2 Theoretical guarantees

Here we establish a high-probability upper bound on the recalibration risk R(ˆh) for the UMB estimator, ˆh defined in (9), which converges to 0 as the sample size n increases to . Theorem 1. Let P be a probability measure and f : X Z be a measurable function. Suppose that (A1) & (A2) hold. Let B be the UMB scheme induced by an IID sample of f(X), and let ˆh = ˆh B be the recalibration function based on B, cf. (9). Then there exists a universal constant c > 0 such that for any δ (0, 1), if n c |B| log 2|B|

δ , then with probability at least 1 δ,

log (4|B|/δ) 2( n/|B| 1) + 1 n/|B|

, and Rsha(ˆh)

8K2 |B|2 if (A3) holds.

Remark 1. We note that the upper bound on Rcal(ˆh) in Theorem 1 coincides with the result presented in [20] up to a constant factor in the failure probability. However, Theorem 1 provides an additional upper bound on Rsha(ˆh), thereby effectively managing the overall recalibration risk R(ˆh).

Optimal choice of the number of bins |B|. Based on Theorem 1, for any fixed sample size n, as the number of bins B = |B| increases, the calibration risk bound increases, while the sharpness risk bound decreases. This trade-off suggests we may get an optimal number of bins B by minimizing the upper bound for the overall risk, R(ˆh) = Rcal(ˆh) + Rsha(ˆh).

For the simplicity of our analysis, we assume n is divisible by B, and Assumption (A3) holds. Firstly, observe that n/B 2, and thus, n/B 1 n/(2B). Also, since B 1 and δ < 1, log( 4B

it follows that 1 n/B q

1 2(n/B 1) log 4B

1 n/B log 4B

δ . Thus, we obtain a simplified risk

bound R(ˆh) ζ(B; n, δ) where ζ(B; n, δ) := 4B

B2 . With log B terms ignored, minimizing ζ(B; n, δ) over B yields optimal B and resulting risk bounds, respectively:

B n1/3 log 1/3(1/δ), (10)

R(ˆh) Rcal(ˆh) Rsha(ˆh) = O B 2 = O n 2/3 log2/3(1/δ) . (11)

Furthermore, the asymptotic risk bound in (11) implies that R(ˆh), Rcal(ˆh), and Rsha(ˆh) are all bounded by ε with high probability if the sample size

n = Ω(ε 3/2). (12)

Comparison with the hybrid method [29]. Kumar et al. [29] proposed a hybrid recalibration approach that involves fitting a recalibration function in a parametric family H, which is then discretized using UMB. Note that the hypothesis class H may or may not include the optimal recalibration function h f,P . If h f,P H, using a similar analysis as above, we derive their high probability risk bound as O 1

δ + 1 B2 under Assumption (A3), which achieves O(n 1) when B n, exhibiting a faster decay than our R(ˆh) = O(n 2/3). While the faster rate is anticipated from employing parametric methods, we note that when h f,P / H, their method exhibits inherent bias (approximation error) induced by parametric function fitting, whereas our method is asymptotically unbiased. This distinction is corroborated by numerical simulations in Appendix E.2.

Proof sketch of Theorem 1. First, we demonstrate that the uniform-mass binning scheme B = {Ib}B b=1 satisfies two regularity conditions with high probability, when the sample size n is not too small. Specifically, we show that (i) B is 2-well-balanced [29] with respect to f(X), resulting in B bins having comparable probabilities (Lemma 3); and that (ii) the empirical mean in each bin of B uniformly concentrates to the population conditional mean of Y conditioned on f(X) being contained within the bin (Lemma 4). Thereafter, we prove that if B satisfies these two properties, then the calibration risk Rcal and the sharpness risk Rsha can be upper bounded as stated in Theorem 1; see Lemmas 5 and 6 (or Lemma 7 when (A3) holds). The detailed proof is in Appendix B.

5 Recalibration under label shift

This section extends the results from Section 4 to address label shift. In Section 5.1, we introduce the label shift assumption (Definition 7) and reframe the recalibration problem accordingly. We show that the optimal recalibration function in this context can be expressed as a composition of the optimal recalibration function (cf. Section 3) and a shift correction function. Building on this observation, we propose a two-stage estimator in Section 5.2, where each stage estimates one of the component functions. The composite estimator s overall performance is supported by theoretical guarantees.

5.1 Revisiting the problem formulation

Let P and Q denote the probability measures of the source and the target domains, respectively. We assume P and Q satisfy the label shift assumption defined below.

Definition 7 (Label shift). Probability measures P and Q are said to satisfy the label shift assumption if the following two conditions are satisfied:

(B1) P[X B | Y = k] = Q[X B | Y = k] for all k {0, 1} and all B B(X).

(B2) P[Y = 1] (0, 1) and Q[Y = 1] (0, 1).

According to Condition (B1), the class conditional distributions remain the same, while the marginal distribution of the classes may change. Condition (B2) requires all classes to be present in the source population, which is a standard regularity assumption in the discussion of label shift [31, 14]; it also posits the presence of every class in the target population.

Optimal recalibration under label shift. Under the label shift assumption between P and Q, we define the label shift correction function g : Z Z such that

g (z) = w 1z w 1z + w 0(1 z) where w k = Q[Y = k]

P[Y = k], k {0, 1}. (13)

The conditional probabilities under P and Q can be related [45] as follows:

Q[Y = 1 | X B] = g P[Y = 1 | X B] , B B(X). (14)

Recall that the optimal recalibration function for a predictor f : X Z under probability measure P is defined as h f,P (z) = P[Y = 1 | f(X) = z]; see (7). In the presence of a label shift between P and Q, we may write the optimal recalibration function for f under Q as h f,Q = g h f,P (15)

because h f,Q(z) (a) = Q[Y = 1 | f(X) = z] (b) = g (P[Y = 1 | f(X) = z]) (c) = g h f,P (z), where (a) and (c) follows from the definition of h f,P and (b) is due to (14).

Recalling the definition of the risk RP (h; f) from (5), we observe that RQ(h f,Q; f) = 0, which is consistent with the risk characterization of the optimal recalibration. Our goal is to estimate the optimal recalibration function h f,Q = g h f,P from data. Problem 2 (Recalibration under label shift). Suppose that we have a measurable function f : X Z and two IID datasets DP = (xi, yi)n P i=1 P and DQ = (x i, y i)n Q i=1 Q. The goal of recalibration under label shift is to estimate ˆh h f,Q using f, DP and DQ. Remark 2. The source (training) dataset DP may not be accessible due to privacy protections, proprietary data, or practical constraints, as is often the case when recalibrating a pre-trained black box classifier to new data. In these cases, it suffices to have estimates of the recalibration function h f,P and the marginal probabilities P[Y = k], k {0, 1} under P, for our method and analysis.

5.2 Two-stage recalibration under label shift

Method. We propose a composite estimator of h f,Q = g h f,P , which comprises two estimators ˆg g and ˆh P h f,P . Here we describe a procedure to produce this composite estimator.

1. Use DP to construct ˆh P : Z Z, the estimated recalibration function (9) (for f under P). 2. Use DP and DQ to construct ˆg : Z Z such that

ˆg(z) = ˆw1z ˆw1z + ˆw0(1 z) where ˆwk = ˆQ[Y = k] ˆP[Y = k] , k {0, 1}, (16)

where ˆP[Y = k] := 1 |DP | P|DP | i=1 1[yi = k] and ˆQ[Y = k] := 1 |DQ| P|DQ| i=1 1[y i = k] are the empirical estimates of the class marginal probabilities. 3. Let ˆh Q = ˆg ˆh P . (17)

Note that the recalibration estimator ˆh P (Step 1) remains the same with that in Section 4.1. Furthermore, the shift correction estimator ˆg (Step 2) is a plug-in estimator of the label shift correction function g in (14) based on the estimated weights, ˆw1 and ˆw0.

Theory. We present a recalibration risk upper bound for the proposed two-stage estimator. We let pk := P[Y = k], qk := Q[Y = k], and w k = qk

pk for k {0, 1}. Moreover, we let pmin := mink pk, qmin := mink pk, w min := mink w k and w max := maxk w k.

Theorem 2 (Convergence of ˆh Q). Let P, Q be probability measures and let f : X Z be a measurable function. Let DP P be an IID sample of size n P and DQ Q be an IID sample of size n Q. Suppose that Assumptions (B1) & (B2) hold. Let B be the UMB scheme induced by DP . Let ˆh P = ˆh P,B be the recalibration function (9) based on B, and let ˆg denote the shift correction function as defined in (16). Then

RQ ˆg ˆh P 2

2 + w max 3

w min 2 RP ˆh P ; f )

where ρk := ˆwk

w k , k {0, 1}.

(18) Furthermore, suppose that (A1), (A2) & (A3) hold. Then there exists a universal constant c > 0 such that for any δ (0, 1), if

n P max c, 27 pmin

|B| log 4|B|

and n Q 27 qmin log 16

then with probability at least 1 δ,

RQ ˆg ˆh P 2w max 3

1 2( n P /|B| 1) log 8|B|

+ 1 n P /|B|

+ 54 max 1 pmin n P , 1 qmin n Q

Remark 3. Note that ρk 1 as n P , n Q , and thus, the upper bound (18) reduces to

w min 2 RP ˆh P ; f . Moreover, when P = Q, we have w min = w max = 1, and this further simplifies to the recalibration risk without label shift, up to multiplicative constant 2. Remark 4 (Target sample complexity). Assume that n P n Q and the number of bins satisfies |B| n1/3 P . Then (19) implies RQ ˆg ˆh P ; f = O n 2/3 P + n 1 Q . This result indicates that the proposed recalibration method using (17) requires a significantly smaller target sample size n Q = Ω(ε 1) to control the risk, as compared to n Q = Ω(ε 3/2) in (12). Remark 5 (Comparison with label shift using unlabeled target data). When the source sample size n P is sufficiently large, we achieve a risk of RQ(ˆg ˆh P ) = O(n 1 Q ) with high probability. It is important to note that in this scenario, we only utilize the labels from the target sample to address label shift. Remarkably, the same rate applies when employing the algorithms proposed in [31, 3, 14], which solely rely on features from the target sample. For a proof sketch, please refer to Appendix D.

6 Numerical experiments

In this section, we present the results of our numerical simulations conducted to validate and reinforce the theoretical findings discussed earlier. The simulations are based on a family of recalibration functions called beta calibration [28]: Hbeta = {hbeta(z; a, b, c) : a 0, b 0, c R}, where hbeta( ; a, b, c) : [0, 1] [0, 1] is defined as

hbeta(z; a, b, c) = 1

1 + 1/ ec za (1 z)b . (20)

In addition, consider the joint distributions D(π) of X and Y , where Y Bernoulli(π), X | Y = 0 N( 2, 1), and X | Y = 1 N(2, 1), and a pre-trained probabilistic classifier f(x) = σ(x) := 1/(1 + e x). To accommodate the limitations of space, we summarize the results in Figure 1, 2, 3, and Table 1, 2, providing a concise overview. Detailed information about the simulation settings, implementation details, and further experimental results and discussions can be found in Appendix E. Our simulation code is available at https://github.com/Zeyu Sun/calibration_label_ shift.

103 105 107

Calibration Risk

B=6 B=34 B=184 B=1000

(a) Rcal(ˆh) vs. n

101 102 103

Calibration Risk

n=1.0e+02 n=4.6e+03 n=2.2e+05 n=1.0e+07

(b) Rcal(ˆh) vs. B

101 102 103

Sharpness Risk

n=1.0e+02 n=4.6e+03 n=2.2e+05 n=1.0e+07

(c) Rsha(ˆh) vs. B

101 102 103

n=1.0e+02 n=4.6e+03 n=2.2e+05 n=1.0e+07

(d) R(ˆh) vs. B

Figure 1: Medians (solid lines) and 10-90 percentile ranges (shaded areas) of quadrature estimates of population risks over 10 realizations and theoretical risk upper bounds (δ = 0.1) (dashed lines) for various n and B. (a)-(c) The empirical rates, Rcal = O(n 0.99B0.98) and Rsha = O(B 1.83), align with theoretically predicted rates, O(B/n) and O(B 2), in Thm. 1. (d) The empirically observed R and our upper bound exhibit similar trends as a function of B.

103 105 107

experiment theory

103 105 107

experiment theory n1/3

6 19 59 184 569 B

10 4 10 2 Risk

Figure 2: The optimal number of bins B for different sample sizes n plotted in linear scale (left) and log scale (middle), and the population risk for various combinations of n and B, with the optimal B marked by black dots (right). Note that the risk surface is relatively smooth around its minimum, suggesting the robustness of the optimal B.

Table 1: 90%-quantiles of the risks of Platt scaling [40], the hybrid method [29], uniform-width binning (UWB) [18], and uniform-mass binning (UMB) over 100 random calibration datasets drawn from Z Uniform[0, 1], and (a) Y | Z Bernoulli(hbeta(z; 4, 4, 0)), or (b) Y | Z Bernoulli(hbeta(z; 0.1, 4, 0)). While Platt scaling and the hybrid method achieve lower R under the correct parametric assumption, UWB and UMB may outperform when the parametric assumption fails.

(a) Correct parametric assumption

Metric ( 10 3) Rcal Rsha R MSE

Platt 0.122 0.000 0.122 10.338 Hybrid 0.119 0.212 0.315 10.532 UWB 0.661 0.194 0.855 11.071 UMB 0.647 0.212 0.839 11.055

(b) Misspecified parametric assumption

Metric ( 10 3) Rcal Rsha R MSE

Platt 3.682 0.000 3.682 21.994 Hybrid 3.117 0.251 3.360 21.672 UWB 0.572 0.238 0.810 19.122 UMB 0.560 0.251 0.797 19.109

Table 2: Risks under label shift from D(0.5) to D(0.1), with n P = 103 and n Q = 102. Standard deviations are computed from 10 random realizations. LABEL-SHIFT, only applying an injective ˆg, achieves Rsha = 0 but incurs high Rcal. SOURCE, recalibrated with B = n1/3 P on DP , incurs high Rcal. TARGET, recalibrated with B = n1/3 Q on DQ, incurs low Rcal. Our proposed COMPOSITE achieves the lowest nonzero Rsha and lowest Rcal.

Method Estimator Rcal Rsha R MSE

SOURCE ˆh P 0.016 0.005 0.0032 0.0014 0.019 0.006 0.029 0.006 TARGET ˆhtarget Q 0.0020 0.0025 0.049 0.006 0.051 0.006 0.060 0.006 LABEL-SHIFT ˆg 0.026 0.006 0 0.026 0.006 0.035 0.006 COMPOSITE ˆh Q 0.00019 0.00017 0.0032 0.0014 0.0034 0.0013 0.0127 0.0013

0.0 0.5 1.0 Predicted Probability

Empirical Frequency

Platt Hybrid UWB UMB

(a) Correct parametric assumption

0.0 0.5 1.0 Predicted Probability

Empirical Frequency

Platt Hybrid UWB UMB

(b) Misspecified parametric assumption

Figure 3: Optimal recalibration function h and recalibration function estimates by Platt Scaling [40], the hybrid method [29], UWB, and UMB when the parametric assumption is (a) correct and (b) misspecified. UMB traces the h in both cases, whereas the hybrid method traces Platt scaling, exhibiting an intrinsic bias from h in (b).

7 Discussion

This paper presents a comprehensive theory for recalibration, considering both calibration and sharpness within the mean-squared-error (MSE) decomposition framework. We use this framework to quantify the optimal calibration-sharpness balance and establish a rigorous upper bound on the finite-sample risk for uniform-mass binning (UMB). Additionally, we address the challenge of recalibration under label shift with limited access to labeled target data. Our proposed two-stage approach effectively estimates the recalibration function using ample data from the source domain and adjusts for the label shift using target domain data. Importantly, our findings suggest that transferring a calibrated classifier requires a significantly smaller target sample than recalibrating from scratch on the new domain. Numerical simulations confirm the tightness of the finite sample bounds, validate the optimal number of bins, and demonstrate the effectiveness of the label shift adaptation.

In concluding this paper, we identify several promising directions for future research.

Relaxation of the assumptions It would be worthwhile to explore whether or not the assumptions made in our analysis could be relaxed. For instance, the widely adopted monotonicity assumption (A2) and its variants [51, 52, 43, 44] may not hold in real-world settings. Thus, exploring potential relaxations of this assumption is valuable. In addition, a mild but non-trivial smoothness assumption (A3) (c.f. Remark 7 and 8) is introduced to obtain a tight sharpness risk upper bound (c.f. Remark 9); investigating its practical implications and exploring potential relaxations could be interesting future work. Calibration-sharpness framework analysis Applying our framework to analyze recalibration methods beyond UMB, such as isotonic regression [51] and kernel density estimation [52], can offer further insights into their performance and properties, providing guidance to practitioners in selecting suitable algorithms based on specific conditions and requirements. Multiclass probability recalibration The concept of calibration considered in this work extends to multi-class classification settings, known as canonical calibration [48]. Weaker, but more tractable notions of multi-class calibration have also been explored in the literature [48, 21]. While partitionbased methods, as multi-class extensions of binning methods, are known to have consistency [48] and vanishing calibration error [41], establishing upper bounds for their sharpness risk remains difficult. Additionally, designing a partition scheme in a multidimensional space is a challenging task [21]; the interplay between calibration and sharpness could potentially guide the development of partition strategies that balance both aspects in multi-class classification. Applications to real-world data Investigating the calibration-sharpness tradeoff in real-world applications, which often involve multiple classes, presents an interesting challenge. It is crucial to develop effective estimators for both calibration risk and sharpness risk in such scenarios. While estimators for calibration risk exist (e.g., binning-based and KDE-based) [6, 29, 52, 42] and a lower bound for sharpness risk has been established [39], a direct estimator of sharpness risk is still lacking.

Acknowledgments and Disclosure of Funding

The research in this paper was partially supported by ARO grants W911NF-23-1-0343 and W911NF19-1-0269 and by DOE grant DE-NA0003921.

[1] John Aitchison and Sheng M Shen. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261 272, 1980.

[2] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maximum likelihood with biascorrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning, pages 222 232. PMLR, 2020.

[3] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2018.

[4] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727 1740, 2023.

[5] Glenn W Brier et al. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 3, 1950.

[6] Jochen Bröcker. Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate dynamics, 39:655 667, 2012.

[7] Dallas Card and Noah A Smith. The importance of calibration for estimating proportions from annotations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1636 1646, 2018.

[8] Misty D Crown. Validation of the NOAA Space Weather Prediction Center s solar flare forecasting look-up table and forecaster-issued probabilities. Space Weather, 10(6), 2012.

[9] A Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982.

[10] Morris H De Groot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12 22, 1983.

[11] Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295 302, 2020.

[12] Charles Elkan. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, volume 17, pages 973 978, 2001.

[13] Andrea Esuli, Alessio Molinari, and Fabrizio Sebastiani. A critical reassessment of the Saerens Latinne-Decaestecker algorithm for posterior probability adjustment. ACM Transactions on Information Systems (TOIS), 39(2):1 34, 2020.

[14] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. A unified view of label shift estimation. Advances in Neural Information Processing Systems, 33:3290 3300, 2020.

[15] Tilmann Gneiting and Adrian E Raftery. Weather forecasting with ensemble methods. Science, 310(5746):248 249, 2005.

[16] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

[17] Sebastian Gruber and Florian Buettner. Better uncertainty calibration via proper scores for classification and beyond. Advances in Neural Information Processing Systems, 35:8618 8632, 2022.

[18] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017.

[19] Chirag Gupta, Aleksandr Podkopaev, and Aaditya Ramdas. Distribution-free binary classification: Prediction sets, confidence intervals and calibration. Advances in Neural Information Processing Systems, 33:3711 3723, 2020.

[20] Chirag Gupta and Aaditya Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, pages 3942 3952. PMLR, 2021.

[21] Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions. In International Conference on Learning Representations, 2022.

[22] Chirag Gupta and Aaditya Ramdas. Online Platt scaling with calibeating. In International Conference on Machine Learning, 2023.

[23] Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. In International Conference on Learning Representations, 2020.

[24] Yingxiang Huang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4):621 633, 2020.

[25] Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263 274, 2012.

[26] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. Advances in Neural Information Processing Systems, 28, 2015.

[27] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. Advances in Neural Information Processing Systems, 32, 2019.

[28] Meelis Kull, Telmo M Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11:5052 5080, 2017.

[29] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.

[30] Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-Cal: An optimal test for the calibration of predictive models. ar Xiv preprint ar Xiv:2203.01850, 2022.

[31] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pages 3122 3130. PMLR, 2018.

[32] Allan H Murphy. A new vector partition of the probability score. Journal of Applied Meteorology and Climatology, 12(4):595 600, 1973.

[33] Allan H Murphy and Edward S Epstein. Verification of probabilistic predictions: A brief review. Journal of Applied Meteorology and Climatology, 6(5):748 755, 1967.

[34] Allan H Murphy and Robert L Winkler. Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society Series C: Applied Statistics, 26(1):41 47, 1977.

[35] Allan H Murphy and Robert L Winkler. A general framework for forecast verification. Monthly weather review, 115(7):1330 1338, 1987.

[36] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

[37] Khanh Nguyen and Brendan O Connor. Posterior calibration and exploratory analysis for natural language processing models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1587 1598, 2015.

[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[39] Alexandre Perez-Lebel, Marine Le Morvan, and Gaël Varoquaux. Beyond calibration: Estimating the grouping loss of modern neural networks. In The Eleventh International Conference on Learning Representations, 2023.

[40] John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61 74, 1999.

[41] Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, pages 844 853. PMLR, 2021.

[42] Teodora Popordanoska, Raphael Sayer, and Matthew Blaschko. A consistent and differentiable lp canonical calibration error estimator. Advances in Neural Information Processing Systems, 35:7933 7946, 2022.

[43] Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra order-preserving functions for calibration of multi-class neural networks. Advances in Neural Information Processing Systems, 33:13456 13467, 2020.

[44] Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C Mozer. Mitigating bias in calibration error estimation. In International Conference on Artificial Intelligence and Statistics, pages 4036 4054. PMLR, 2022.

[45] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21 41, 2002.

[46] Zeyu Sun, Monica G Bobra, Xiantong Wang, Yu Wang, Hu Sun, Tamas Gombosi, Yang Chen, and Alfred Hero. Predicting solar flares using CNN and LSTM on two solar cycles of active region data. The Astrophysical Journal, 931(2):163, 2022.

[47] Junjiao Tian, Yen-Cheng Liu, Nathaniel Glaser, Yen-Chang Hsu, and Zsolt Kira. Posterior re-calibration for imbalanced datasets. Advances in Neural Information Processing Systems, 33:8101 8113, 2020.

[48] Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3459 3467. PMLR, 2019.

[49] Jonathan Wenger, Hedvig Kjellström, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pages 178 190. PMLR, 2020.

[50] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 609 616, 2001.

[51] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694 699, 2002.

[52] Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, pages 11117 11128. PMLR, 2020.

[53] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697 12706. PMLR, 2021.

A Proof of Proposition 1

Proof of Proposition 1. First of all, we recall the definition of the two risks from (3) and (4):

Rcal(h; f) = E h f(X) E [Y | f(X)] 2i

Rsha(h; f) = E h E [Y | h f(X)] E [Y | f(X)] 2i .

To keep our notation concise, we use Z = f(X) as a shorthand notation, and also let YZ := E[Y | Z] and Yh(Z) = E[Y | h(Z)] throughout this proof. We can decompose the recalibration risk from Definition 5:

R(h) = E[(h(Z) YZ)2]

= E[(h(Z) Yh(Z) + Yh(Z) YZ)2]

= E[(h(Z) Yh(Z))2] + E[(Yh(Z) YZ)2] + 2E[(h(Z) Yh(Z))(Yh(Z) YZ)]

= E[(h(Z) Yh(Z))2] + E[(Yh(Z) YZ)2] + 2E[E[(h(Z) Yh(Z))(Yh(Z) YZ) | h(Z)]]

= E[(h(Z) Yh(Z))2] + E[(Yh(Z) YZ)2] + 2E[(h(Z) Yh(Z))(Yh(Z) E[YZ | h(Z)])]

= E[(h(Z) Yh(Z))2] + E[(Yh(Z) YZ)2] + 2E[(h(Z) Yh(Z))(Yh(Z) Yh(Z))]

= E[(h(Z) Yh(Z))2] | {z } Rcal(h)

+ E[(Yh(Z) YZ)2] | {z } Rsha(h)

B Proof of Theorem 1

In this section, we present a proof of Theorem 1. Let (Y, Z) Y Z be random variables that admits a joint distribution PY,Z, which we assume to be fixed throughout this section. Let S = {(yi, zi) Y Z : i [n]} and let B = {I1, I2, . . . , IB} be the uniform-mass binning scheme (cf. Definition 6) of size B induced by (zi s in) S. Note that if S is a random sample from PY,Z, then the binning scheme B induced by S is also a random variable following a derived distribution. To facilitate our analysis, we introduce the notion of well-balanced binning. Definition 8 (Well-balanced binning; [29]). Let B N, let Z be a random variable that takes value in [0, 1], and let α R such that α 1. A binning scheme B of size B is α-well-balanced with respect to Z if 1 αB P[Z Ib] α

In addition, we define two (parameterized families of) Boolean-valued functions Φbalance and Φapprox as follows: for any binning scheme B,

α R, Φbalance(B; α) := 1 1 α|B| P [Z I] α

|B|, I B , (21)

ε R, Φapprox(B; ε) := 1 max I B |ˆµI µI| ε , (22)

where 1(A) = 1 if and only if the predicate A is true, and for each interval I B,

ˆµI = Pn i=1 yi 1I(zi) Pn i=1 1I(zi) and µI = E(Y,Z) PY,Z [Y 1I(Z)] . (23)

Note that if Φbalance(B; α) = 1 for α 1, then B is α-well-balanced with respect to Z (cf. Definition 8). Also, if Φapprox(B; ε) = 1 for ε 0, then the conditional empirical mean of Y in each bin I B approximates the conditional expectation with error at most ε, uniformly for all bins.

The rest of this section is organized as follows. In Section B.1, we ensure that for an appropriate choice of α, ε R, it holds with high probability (with respect to the randomness in B) that Φbalance(B; α) = Φapprox(B; ε) = 1. In Section B.2, we establish upper bounds on the reliability risk Rcal and the sharpness risk Rsha under the premise that Φbalance(B; α) = Φapprox(B; ε) = 1. Finally, in Section B.3, we conclude the proof of Theorem 1 by combining these results together.

B.1 High-probability certification of the conditions

Well-balanced binning scheme. First of all, we observe that the uniform-bass binning scheme B induced by an IID random sample from PY,Z is 2-well-balanced with high probability, if the sample size is sufficiently large. Here we paraphrase a result from [29] in our language. Lemma 3 ([29, Lemma 4.3]). Let S = {Zi : i [n]} be an IID sample drawn from PZ and let B be the uniform-mass binning scheme of size B induced by S. There exists a universal constant c > 0 such that for any δ (0, 1), if n c B log(B/δ), then Φbalance(B, 2) = 1 with probability at least 1 δ.

Lemma 3 states that

n c B log B

= P [B is 2-well-balanced with respect to PZ] 1 δ.

While the value of the universal constant c was not specified in the original reference [29], we remark that one may set, for example, c = 2420, which can be verified by following their proof with c kept explicit.

The proof of Lemma 3 in [29] relies on a discretization argument that considers a fine-grained cover of Z = [0, 1] consisting of disjoint intervals namely, I j : j [10B] such that P[Z I j] = 1 10B for all j [10B] and then approximates each Ib by a subset of the cover. As the authors of [29] remarked, this argument provides a tighter sample complexity upper bound than naïvely applying Chernoff bounds or a standard VC dimension argument, which would yield an upper bound of order O B2 log B

δ . We omit the proof of Lemma 3 and refer interested readers to the referenced paper [29] for more details.

Uniform concentration of bin-wise means. Next, we argue that for the uniform-mass binning scheme B induced by an IID sample, the conditional empirical means of each bin concentrates to the population conditional expectation, uniformly for all bins in B. Here we restate a result from [20]. Lemma 4 ([20, Corollary 1]). Let PZ be an absolutely continuous probability measure on Z = [0, 1], and S = {Zi : i [n]} be an IID sample drawn from PZ. Let B N such that B n

2 and B be the uniform-mass binning scheme of size B induced by S. Then for any δ (0, 1),

P [Φapprox(B; εδ) = 1] 1 δ where εδ =

1 2( n/B 1) log 2B

+ 1 n/B . (24)

Lemma 4 states that under the mild regularity condition of PZ being absolutely continuous, the uniform-mass binning accurately approximates all bin-wise conditional means as long as there are at least two samples per bin in the sense that

sup b [B] |ˆµb µb|

1 2( n/B 1) log 2B

B.2 Conditional upper bounds on reliability risk and sharpness risk

In this section, we establish upper bounds on the reliability risk Rcal and the sharpness risk Rsha

for ˆh under the premise that Φbalance(B; α) = 1 and Φapprox(B; ε) = 1 for appropriate parameters α, ε R.

Preparation. To avoid clutter in the lemma statements to follow, here we recall our problem setting and set several notation that will be used throughout this section. Recall that P = PX,Y is a joint distribution on X Y and let f : X Z is a measurable function. In addition, we let S = {(xi, yi) X Y : i [n]} be an IID sample drawn from P, and let

S = n (z, y) Z Y : (x, y) S and z = f(x) o . Let B be the uniform-mass binning scheme

induced by (z s in) S, and let ˆh = ˆB : Z Z be the recalibration function derived from B as we described in Section 4.1; see (9). The dependence among P, f, S, S, B, and ˆh are summarized by a diagram in Figure 4.

Furthermore, we define the index function for a binning scheme to facilitate our analysis.

Figure 4: Stochastic dependence among P, f, S, S, B, and ˆh.

Definition 9. Let B be a binning scheme. The index function for B is the function β : Z [|B|] such that β(z) = X

I B 1(0,sup I](z). (25)

Remark 6. Note that β is a measurable function and defines an index function that identifies which bin of B the argument z [0, 1] belongs to. Specifically, suppose that B = {I1, . . . , IB} for some B N and there exists u0, u1, . . . , u B [0, 1] such that (i) 0 = u0 < u1 < < u B = 1 and (ii) Ib = (ub 1, ub] for all b [B] \ {1} and I1 = [u0, u1]. Then β(z) = b if and only if z Ib.

B.2.1 Calibration risk upper bound

We observe that if a binning scheme B produces empirical means ˆµI that approximate the true means µI with error at most ε, then the calibration risk is upper bounded by ε2. Lemma 5 (Calibration risk bound). For any ε 0, if Φapprox(B; ε) = 1, then

Rcal(ˆh; f, P) ε2.

Proof of Lemma 5. To begin with, we recall the definition of the calibration risk (Definition 3), and let Z = f(X). Then we may write

Rcal ˆh; f, P = E ˆh(Z) E[Y | ˆh(Z)] 2

= E E ˆh(Z) E[Y | ˆh(Z)] 2 β(Z) the law of total expectation

= E h ˆµIβ(Z) µIβ(Z) 2i cf. (23)

max I B ˆµI µI 2.

Note that if Φapprox(B; ε) = 1, then max I B ˆµI µI 2 ε2.

We remark that the proof of Lemma 5 is a simple application of applying Hölder s inequality. Also, we note that a similar argument was considered in [20, Proposition 1] to establish the inequalities between the Lp-counterparts of the calibration risk, which they call the ℓp-expected calibration error (ECE). In this work, we focus on the case p = 2.

B.2.2 Sharpness risk upper bound

Next, we present an upper bound for the sharpness risk that diminishes as the binning scheme B becomes more balanced. Lemma 6 (Sharpness risk bound). Suppose that the optimal post-hoc recalibration function h f,P , cf. (7), is monotonically non-decreasing. Let α R such that α 1. If Φbalance(B, α) = 1, then

Rsha(ˆh; f, P) α

Proof of Lemma 6. Letting Z = f(X), we can write the sharpness risk of ˆh over f with repsect to P as Rsha(ˆh; f, P) := E h E Y | ˆh(Z) E [Y | Z] 2i .

We recall the definition of the index function β for B (Definition 9) and observe that

E h E h Y | ˆh(Z) i E [Y | Z] 2i

E h E h Y | ˆh(Z) i E [Y | Z] i E h Y | ˆh(Z) i E [Y | Z] 1

I B E h E[Y | ˆh(Z)] E [Y | Z] 1I(Z) i

I B E h E h E[Y | ˆh(Z)] E [Y | Z] 1I(Z) β(Z) ii the law of total expectation

I B P[Z I] E h E h E[Y | ˆh(Z)] E [Y | Z] Z I ii Remark 6

I B P[Z I] sup z I h f,P (z) inf z I h f,P (z) by definition of h f,P ; cf. (7)

α |B| sup z I h f,P (z) inf z I h f,P (z) Φbalance(B, α) = 1

The inequality in the last line follows from the facts that (i) I B are mutually exclusive and (ii) h f,P (z) [0, 1] and h f,P is monotone non-decreasing.

Our proof of Lemma 6 relies on similar techniques that are used in [29, Lemmas D.5 and D.6]. However, we note that we obtain an improved constant 1 as opposed to 2 in [29, Lemma D.6] with a more refined analysis.

An improved rate with additional assumptions. It is possible to improve the rate of the sharpness risk upper bound from O(|B| 1) to O(|B| 2) with an additional regularity assumption on h f,P .

Recall that we assumed in (A3) that there exists K > 0 such that if z1 z2, then h f,P (z2) h f,P (z1) K FZ(z2) FZ(z1) , that is, h f,P is K-smooth with respect to FZ. This posits that the conditional probability P[Y = 1|Z = z] of the target variable Y given a forecast variable Z cannot vary too much in regions where the density of Z is low, or where the forecast is rarely issued. This is a reasonable assumption because if P[Y = 1|Z] changes too rapidly with respect to Z, then it suggests that we need additional information about Y beyond what Z can provide in order to improve the quality of forecasts. We remark that (A3) is indeed a fairly mild assumption to impose on, however, is not a trivial one. Remark 7 (Mildness of (A3)). Suppose that Z = f(X) has a density p Z that is uniformly lower bounded by ϵ on the support of Z. If h f,P is L-Lipschitz, then h f,P is (L/ϵ)-smooth with respect to FZ. This also provides a sufficient condition to verify (A3) in practice. Remark 8 (Non-triviality of (A3)). Notice that even if FZ is absolutely continuous and h f,P is continuous, the smoothness constant K could become large if the prediction Z is heavily miscalibrated. For instance, in Figure 6, h f,P (z) is changing fast in the interval [0.5, 0.75] where p Z(z) is small, which results in a larger value of K that can even diverge if p Z(z) 0.

Here we define the notion of ψ-smoothness to formalize Assumption (A3), and then present an improved upper bound for the sharpness risk. Definition 10 (ψ-smoothness). Let K R+ and ψ : [0, 1] [0, 1] be a monotone non-decreasing function. A function ϕ : [0, 1] [0, 1] is K-smooth with respect to ψ if for any z1, z2 [0, 1] such that z1 z2, ϕ(z2) ϕ(z1) K ψ(z2) ψ(z1) . (26) Lemma 7 (Improved sharpness risk bound). Suppose that the function h f,P (z) defined in (7) is monotonically non-decreasing and K-smooth with respect to FZ for some K 0, where FZ is the cumulative distribution function of Z = f(X). If Φbalance(B, α) = 1, then

Proof of Lemma 7. Let Z = f(X) and B = |B|. For each b [B], we let zb,max := sup Ib and zb,min := inf Ib. Then we have

Rsha(ˆh; f, P)

= E h E Y | ˆh(Z) E [Y | Z] 2i

= E h E h E Y | ˆh(Z) E [Y | Z] 2 β(Z) ii

b=1 P[Z Ib] E h E Y | ˆh(Z) E [Y | Z] 2 β(Z) = b i

b=1 P[Z Ib] h f,P (zb,max) h f,P (zb,min) 2 h f,P is non-decreasing

b=1 P[Z Ib] K FZ(zb,max) FZ(zb,min) 2 h f,P is K-smooth w.r.t. FZ

b=1 K2 P[Z Ib]3

3 Φbalance(B, α) = 1

Remark 9 (Tightness of the rate O(B 2)). The asymptotic rate Rsha = O(B 2) is tight and cannot be further improved without additional assumptions. For instance, let s consider a uniform-mass binning of size B on Z Uniform[0, 1]. In the population limit, each bin has width 1/B and withinbin variance 1/(12B2). Thus, the sharpness risk, obtained by taking expectation of the conditional variance (per each bin), is 1/(12B2), attaining the rate B 2.

B.3 Completing the proof of Theorem 1

Proof of Theorem 1. For given δ (0, 1), let δ1 = δ2 = δ/2. Then we observe that

n c |B| log |B|

= P [Φbalance(B, 2) = 1] 1 δ1 by Lemma 3

n 2|B| = P [Φapprox(B, εδ2) = 1] 1 δ2 by Lemma 4

where c > 0 is the universal constant that appears in Lemma 3 and

1 2( n/|B| 1) log 2|B|

+ 1 n/|B| .

Observe that δ1 = δ

2 and |B| 1, and thus, log |B|

log 2. Letting c := max{c , 2 log 2} and applying the union bound, we have

n c |B| log 2|B|

= P Φbalance(B, 2) = 1 and Φapprox(B, εδ/2) = 1 1 δ.

Next, we observe that if Φbalance(B, 2) = 1 and Φapprox(B, εδ2) = 1, then

Rcal(ˆh; f; P) (εδ2)2 by Lemma 5,

Rsha(ˆh; f, P) 2

|B|, by Lemma 6.

Additionally, if the Assumption (A3) also holds, then we obtain a stronger upper bound on Rsha(ˆh; f, P) by Lemma 7:

Rsha(ˆh; f, P) 8K2

C Proof of Theorem 2

This section contains a proof of Theorem 2. Prior to the proof, in Section C.1, we provide several lemmas that will be useful in our proof. Thereafter, we present a proof of Theorem 2 in its entirety in Section C.2.

C.1 Useful lemmas

C.1.1 Concentration of ˆwk to w k

First of all, we recall the binomial Chernoff bound, which is a classical result about the concentration of measures that can be found in standard textbooks on probability theory.

Lemma 8 (Binomial Chernoff bound). Let Xi be IID Bernoulli random variables with parameters p (0, 1), and let Sn := 1

n Pn i=1 Xi. Then for any δ R such that 0 < ε < 1,

P [Sn (1 + ε)p] exp ε2p

P [Sn (1 ε)p] exp ε2p

It follows from Lemma 8 that for any ε, δ (0, 1),

n 3 ε2p log 2

p > ε δ. (27)

Let P, Q be two distributions on Y = {0, 1}, and let DP P, DQ Q denote IID samples of size n P , n Q, respectively. Recall from (13) and (16) that for each k {0, 1}, we define

w k = PQ[Y = k]

PP [Y = k], and ˆwk = PDQ[Y = k]

PDP [Y = k].

Then, we let

w 0 and ρ1 := ˆw1

Now we define another parameterized family of Boolean-valued functions Φratio(DP , DQ; β) as follows. Given DP P, DQ Q, and β R such that 1 < β 2,

Φratio(DP , DQ; β) := 1 1

β ρk β, k {0, 1} . (29)

Corollary 9. Let P, Q be two distributions on Y = {0, 1}, and let DP P, DQ Q denote IID samples of size n P , n Q, respectively. For each k {0, 1}, let pk := PP [Y = k] and qk := PQ[Y = k]. Likewise, we let ˆpk = 1 n P P

yi DP 1{yi = k} and ˆqk = 1 n Q P

yi DQ 1{yi = k}. For any δ (0, 1) and any β (1, 2], if

n P 27 (β 1)2 min{p0, p1} log 8

and n Q 27 (β 1)2 min{q0, q1} log 8

then P (Φratio(DP , DQ; β) = 1) 1 δ.

Proof of Corollary 9. Let ε = β 1

3 . Since 1+x

1 x 1 + 3x for all x [0, 1/3], we have 1

1+ε < 1+ε 1 ε β. Then it follows from (27) that for each k {0, 1},

n P 3 ε2pk log 8

= P |ˆpk pk|

n Q 3 ε2qk log 8

= P |ˆqk qk|

Applying the union bound, we obtain the following implication:

n P 3 ε2 min{p0, p1} log 8

and n Q 3 ε2 min{q0, q1} log 8

= P max k {0,1} |ˆpk pk|

pk > ε or max k {0,1} |ˆqk qk|

= P max k {0,1} ρk > 1 + ε

1 ε or min k {0,1} ρk < 1 ε

= P max k {0,1} ρk > β or min k {0,1} ρk < 1

C.1.2 Regularity of the Shift Correction Function

Lemma 10. Let w = (w0, w1) R2 such that w0, w1 > 0 and w0 + w1 = 1. The function

gw : [0, 1] [0, 1] such that gw(z) = w1z w1z+w0(1 z) is L-Lipschitz where L = max n w1 w0 , w0

Proof of Lemma 10. First of all, consider the first-order derivative of gw: d dz gw(z) = w1 w1z + w0(1 z) w1z (w1 w0) w1z + w0(1 z) 2 = w1w0 w1z + w0(1 z) 2 .

We observe that gw is monotone increasing as d dzgw(z) > 0 for all z [0, 1]. Next, we consider the second-order derivative of gw:

dz2 gw(z) = 2w0w1 (w0 w1)

w1z + w0(1 z) 3

> 0, z [0, 1] if w0 > w1, = 0, z [0, 1] if w0 = w1, < 0, z [0, 1] if w0 < w1. Therefore,

sup z [0,1]

d dz gw(z) =

d dzgw(z) z=1 = w0

w1 if w0 > w1,

d dzgw(z) z=0 = w1

w0 if w0 w1.

Lemma 11. Let P, Q be joint distributions of (X, Y ) X {0, 1}, and let wk = P [Y =k]

Q[Y =k] for k {0, 1}. If P, Q satisfy the label shift assumption (Definition 7), i.e., if Assumptions (B1) and (B2) hold, then for any measurable function f : X R, the following two-sided inequality holds:

min k {0,1} wk EQ[f(X)]

EP [f(X)] max k {0,1} wk. (30)

Proof of Lemma 11. First of all, we observe that EQ [f(X)] = EQ EQ f(X) | Y by the law of total expectation

k=0 PQ[Y = k] EQ f(X) | Y

wk PP [Y = k] EP f(X) | Y . by definition of wk & the label shift assumption

Thus, it follows that mink wk EP [f(X)] EQ[f(X)] maxk wk EP [f(X)].

C.2 Completing the proof of Theorem 2

Proof of Theorem 2. This proof is presented in four steps. In Step 1, we establish a simple upper bound for the risk RQ(ˆh Q; f) that consists of two error terms: the first term quantifies the error introduced by the estimated label shift correction, ˆg, while the second term quantifies the error due to the estimated recalibration function, ˆh P . In Steps 2 and 3, we derive separate upper bounds for these two error terms. Finally, in Step 4, we combine the results from Steps 1-3 to obtain a comprehensive upper bound for RQ, which concludes the proof.

Step 1. Decomposition of RQ. Recalling the definition of the risk RQ, cf. (5), we obtain the following inequality:

RQ(ˆh Q; f) = EQ

ˆh Q f(X) EQ[Y |f(X)] 2

ˆg ˆh P f(X) g ˆh P f(X) + g ˆh P f(X) EQ[Y |f(X)] 2

ˆg ˆh P f(X) g ˆh P f(X) 2

| {z } =:T1

g ˆh P f(X) EQ[Y |f(X)] 2

| {z } =:T2

where (a) follows from the simple inequality (a + b)2 2(a2 + b2) for all a, b R.

In Step 2 and Step 3 of this proof, we establish separate upper bounds for the two terms, T1, T2.

Step 2. An upper bound for T1. Recall from (13) and (16) that

g (z) = w 1z w 1z + w 0(1 z) where w k = Q[Y = k]

P[Y = k], k {0, 1},

ˆg(z) = ˆw1z ˆw1z + ˆw0(1 z) where ˆwk = ˆQ[Y = k] ˆP[Y = k] , k {0, 1}.

Let ρ0 := ˆw0

w 0 and ρ1 := ˆw1

Then we observe that for any z (0, 1),

|ˆg(z) g (z)| = ˆw1z ˆw1z + ˆw0(1 z) w 1z w 1z + w 0(1 z)

ˆw1w 0 w 1 ˆw0 z(1 z) ˆw1z + ˆw0(1 z) w 1z + w 0(1 z)

ˆw1w 0 w 1 ˆw0 z(1 z) ˆw1w 0 + w 1 ˆw0 z(1 z)

= ˆw1w 0 w 1 ˆw0 ˆw1w 0 + w 1 ˆw0

Moreover, ˆg(0) = g (0) = 0 and ˆg(1) = g (1) = 1. Letting Zˆh := ˆh P f(X), we obtain

ˆg(Zˆh) g (Zˆh) 2 ρ0 ρ1

It remains to establish probabilistic tail bounds for ρ0, ρ1, which we will accomplish in Step 4 of this proof.

Step 3. An upper bound for T2. We observe that

g ˆh P f(X) EQ[Y | f(X)] 2

g ˆh P f(X) g EP [Y | f(X)] 2 Label shift assumption, cf. (14)

w max w min

ˆh P f(X) EP [Y | f(X)] 2 g is w max w min -Lipschitz, cf. Lemma 10

w max w min

ˆh P f(X) EP [Y | f(X)] 2 by Lemma 11

w min 2 RP ˆh P ; f .

Step 4. Concluding the proof. For given δ (0, 1), let2 δ1 = δ2 = δ/4 and δ3 = δ/2. We observe that

n P c |B| log |B|

= P [Φbalance(B, 2) = 1] 1 δ1 by Lemma 3

n P 2|B| = P [Φapprox(B, εδ2) = 1] 1 δ2 by Lemma 4

where c > 0 is the universal constant that appears in Lemma 3 and

1 2( n/|B| 1) log 2|B|

+ 1 n/|B| .

Furthermore, assuming

n P 27 min{p0, p1} log 8

and n Q 27 min{q0, q1} log 8

we may define βδ3 as a function of n P , n Q and δ3 such that

βδ3 = βδ3(n P , n Q) := 1 +

max 1 n P min{p0, p1}, 1 n Q min{q0, q1}

Then it follows from Corollary 9 that

P (Φratio(DP , DQ; β0) = 1) 1 δ3.

Observe that δ1 = δ

4 and |B| 4, and thus, log |B|

log 16 2. Let c = c . Since c 1

and log |B|

δ = log 8 δ3

, we notice that

n P max c, 27 min{p0, p1}

|B| log 4|B|

= n P max c |B| log |B|

, 2|B|, 27 min{p0, p1} log 8

In summary, we obtain that for any given δ (0, 1),

n P max c, 27 min{p0, p1}

|B| log 4|B|

and n Q 27 min{q0, q1} log 16

= P Φbalance(B, 2) = 1 & Φapprox(B, εδ/4) = 1 & Φratio(DP , DQ; βδ/2) = 1 1 δ. (36)

2We remark that our decomposition of δ into δ1, δ2, δ3 is arbitrary, and is intended to simplify the subsequent analysis.

Conditioned on the event Φbalance(B, 2) = 1 & Φapprox(B, εδ/4) = 1 & Φratio(DP , DQ; βδ/2) = 1,

βδ/2 1 βδ/2 βδ/2 + 1 βδ/2

βδ/2 1 2 , (34); also, see (29)

w min 2 RP ˆh P ; f

w min 2 ε2 δ/4 + 2

. proof of Theorem 1; Lemmas 5 & 6

Note that if Assumption (A3) holds, then we additionally have

w min 2 ε2 δ/4 + 8K2

Inserting these upper bounds for T1 and T2 into (31), (32) and recalling the expression for β in (35), we complete the proof.

D Proof sketch of the argument in Remark 5

Recall our composite recalibration function,

ˆh Q = ˆg ˆh P , (37)

does not use the features in DQ. Specifically, ˆg is parameterized by ˆw = ( ˆw0, ˆw1), which can be estimated using only labels in DP and DQ, cf. (16). According to Theorem 2, RQ(ˆh Q) = O(n 1 Q ) with high probability for sufficiently large n P .

Now suppose we are given an unlabeled target sample with unknown label shift. We can estimate w using the target features via a maximum likelihood label shift estimation approach [14], yielding ˆw ML = ( ˆw ML 0 , ˆw ML 1 ). and the calibrated classifier ˆh f . This results in a different composite recalibration function than Equation (37),

ˆh ML Q = ˆg ML ˆh P , (38)

where ˆg ML : [0, 1] [0, 1] is defined as ˆg ML(z) = ˆw ML 1 z/( ˆw ML 1 z + ˆw ML 0 (1 z)). We claim in Remark 5 that, for sufficiently large n P , the composite recalibration function in Equation (38) achieves RQ(ˆh ML Q ) = O(n 1 Q ) with high probability, enjoying the same convergence rate as ˆh Q (Equation 37). Here we give a proof sketch.

Proof sketch. Suppose we use the maximum likelihood approach in [14] to estimate w. We want to show RQ(ˆh ML Q ) = O(n 1 Q ) with high probability for sufficiently large n P . Recall RQ(ˆh Q) 2(T1 + T2) according to Equation (31) and (32), and label shift estimation error only affects T1, so it is sufficient to show T1 = O(n 1 Q ) with high probability.

For sufficiently large n P , ˆh P f is sufficient calibrated, so ˆw w 2 2 = O(n 1 Q ) by Theorem 3 in [14]. Since

ˆw w 2 2 = X

k {0,1} (ρk 1)2w2 k w min 2 X

k {0,1} (ρk 1)2 w min 2 max k {0,1}(ρk 1)2, (39)

we have ρk [1 α, 1+α] for k {0, 1}, where α = ˆ w w 2

w min > 0. For sufficiently small ˆw w 2 2, we can control α < 0.5, which bounds T1 in (34):

2 2α 2 2α 2α = 2 ˆw w 2

w min = O(n 1 Q ). (40)

The rest of the proof are the same with Appendix C.2.

E Details on the experiments

In Section E.1 and E.3, we consider a family of joint distributions D(π) of X and Y , where Y Bernoulli(π), X | Y = 0 N( 2, 1), and X | Y = 1 N(2, 1). Suppose we are given f(x) = σ(x) := 1/(1 + e x), for x R, as a probabilistic classifier. The optimal recalibration function can be derived as h f,P (z) = P[Y = 1 | f(X) = z] = σ(4σ 1(z)). (41)

In Section E.2, we consider a parametric family of recalibration functions called beta calibration [28]: Hbeta = {hbeta( ; a, b, c) : a 0, b 0, c R}, where hbeta( ; a, b, c) : [0, 1] [0, 1] is defined as

hbeta(z; a, b, c) = 1

1 + 1/ ec za (1 z)b . (42)

In addition, consider a subfamily Hlogit-normal Hbeta defined as Hlogit-normal = {hlogit-normal( ; a, c) := hbeta( ; a, a, c) = σ(aσ 1( ) + c) : a 0, c R}3. Apparently, the optimal recalibration function in Equation (41), h f,P Hlogit-normal.

E.1 Verifying results for UMB

First, we recalibrate f on data distributed as D(0.5) using UMB.

Verifying the risk convergence in Theorem 1 We vary n [102, 107] and B [6, 103] in the log scale. For each combination of (n, B), we use UMB to recalibrate f on data generated from D(0.5), and compute quadrature estimates of population Rcal(ˆh), Rsha(ˆh), and R(ˆh), as well as their high probability bounds based on Theorem 1. The constant K in Assumption (A3) is selected by numerical maximization as

K = max 0 z1<z2 1 h (z2) h (z1)

P[Z [z1, z2]] .

Figure 1 shows the bounds follow the same trends as their associated population quantities, providing valid upper bounds in all cases.

Verifying the optimal choice of the number of bins. We find empirically optimal B experiment that achieves the minimal risk for each choice of n. We compute the theoretically optimal choice of the number of bins, B theory, by minimizing the finite-sample upper bounds. Figure 2 shows B experiment

follows the same trend with B theory, both scales in O(n1/3).

E.2 Comparing recalibration methods

To highlight the benefits and drawbacks of UMB s nonparametric nature, we compare UMB with (semi-)parametric recalibration methods in scenarios where the parametric assumption is correct and where it is misspecified. We compare the method under study, uniform-mass binning (UMB), with 3 other recalibration methods: uniform-width binning (UWB) [18], Platt scaling [40]4, and a hybrid parametric-binning method [29]. Note that Platt scaling and the hybrid method adopt the parametric assumption h Hlogit-normal.

For the first setting, we construct optimal recalibration function h Hlogit-normal so that the parametric assumption of Platt scaling and the hybrid method holds. In particular, we consider the distribution Z Uniform[0, 1] and Y | Z Bernoulli(hlogit-normal(Z; a, c)) with a = 4 and c = 0. For the second setting, we construct h Hbeta but h / Hlogit-normal so that the parametric assumption fails. In particular, we consider the distribution Z Uniform[0, 1] and Y | Z Bernoulli(hbeta(z; a, b, c)) with a = 0.1, b = 4, and c = 0. For each setting, we fix calibration sample size to be n = 5000.

3We say Z Logit-Normal(µ, τ 2) if σ 1(Z) N(µ, τ 2) [1]. Similar to beta calibration [28], we adopt the name logit-normal calibration" after a simple example: if Y = Bernoulli(0.5), Z | Y = i Logit-Normal(µi, τ 2 i ) for i {0, 1}, then the optimal recalibration function E[Y | Z = z] = hlogit-normal(z; a, c) for some a, c depending on µi s and τi s. 4The original Platt scaling operates on outputs of real-valued SVM outputs [40]. For probabilistic classifiers, we follow [38, 29, 22] and implement Platt scaling by first transforming probabilities onto the real line via the logit transform σ 1.

10 2 5 100 2 5 1000 2

Platt Hybrid

R R sha R cal

(a) Correct parametric assumption

10 2 5 100 2 5 1000 2

Platt Hybrid

R R sha R cal

(b) Misspecified parametric assumption

Figure 5: Risks vs. number of bins B.

Risks as functions of the number of bins B We traverse the number of bins B [10, 2000] in the log scale and compare how each method behaves as B changes. When the parametric assumption is correct, the hybrid method achives significantly lower Rcal and overall R than UMB and UWB for sufficiently large number of bins (Figure 5a), an advantage highlighted in [29]. In contrast, when the parametric assumption fails, the binning methods UMB and UWB has better performance with the optimal number of bins (Figure 5b). This is because Platt scaling and hybrid methods are intrinsicly biased when h / Hlogit-normal, as noted in Section 4.2.

Quantitative results of risks under optimal B For each setting, we fix B that achieves low recalibration risk for UWB and UMB in Figure 5. Specifically, we choose B = 2 n1/3 = 34 for the correct parametric assumption setting, and B = n1/3 = 17 for the misspecified parametric assumption setting. Then, for each setting, we compare the 90% quantiles of risks of each recalibration method fitted on 100 random replicates of calibration datasets of size n = 5000.

Table 1 quantititively verifies that Platt and the Hybrid method achieves lower Rcal and overall R if the parametric assumption is correct, and UWB and UMB achieves lower Rcal and overall R when the parametric assumption fails.

Visualization of calibration curves We fix the calibration dataset and visualize the calibration curves for all methods under the two settings. Figure 3 shows that the binning methods (UWB and UMB) closely track the optimal recalibration function h in both settings. In contrast, the hybrid approach follows the Platt scaling estimates, leading to an inherent bias from h when the parametric assumption is invalid (Figure 3b).

E.3 Comparing recalibration schemes under label shift

We consider the label shift with source distribution D(0.5) and target distribution D(πQ), where πQ varies in {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. The results where πQ > 0.5 can be inferred by symmetry and hence not experimented. We vary n P in {10, 103, 105, 107} and n Q in {10, 103, 105}.

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability

Empirical Frequency

0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability z

Figure 6: Left: calibration curves of for COMPOSITE ˆh Q, SOURCE ˆh P , TARGET ˆhtarget Q , and LABELSHIFT ˆg. Right: the marginal density of Z = f(X) under Q.

Aside from our proposed recalibration function ˆh Q = ˆg ˆh P (17), referred to as COMPOSITE, we consider three other calibration approaches as baselines: (1) SOURCE, denoted as ˆh P , which is only calibrated on the source data, (2) LABEL-SHIFT, denoted as ˆg, which performs label shift correction without calibration, and (3) TARGET, denoted as ˆgtarget Q , which is only calibrated on the target data.

The number of bins B are chosen to be n1/3 P for COMPOSITE and SOURCE, and n1/3 Q for TARGET.

Table 2 shows the risks for different approaches with πQ = 0.1, n P = 103, and n Q = 102. In terms of Rcal, COMPOSITE performs the best, as it is calibrated to the target distribution by taking advantage of the abundant source data. In terms of Rsha, LABEL-SHIFT achieves Rsha = 0 due to the strictly increasing ˆg, but it suffers from high Rcal. COMPOSITE and SOURCE achieve smaller Rsha than TARGET, as a result of using more bins on a larger sample. Considering the combined impact of calibration and sharpness, our approach COMPOSITE attains the lowest overall recalibration risk R as well as MSE.

Figure 6 shows the optimal recalibration function h and the recalibration functions for the four approaches. It can be seen that COMPOSITE best estimates h with the highest resolution.