# selfcalibrating_conformal_prediction__4da64ced.pdf

Self-Calibrating Conformal Prediction

Lars van der Laan University of Washington lvdlaan@uw.edu

Ahmed M. Alaa UC Berkeley and UCSF amalaa@berkeley.edu

In machine learning, model calibration and predictive inference are essential for producing reliable predictions and quantifying uncertainty to support decisionmaking. Recognizing the complementary roles of point and interval predictions, we introduce Self-Calibrating Conformal Prediction, a method that combines Venn Abers calibration and conformal prediction to deliver calibrated point predictions alongside prediction intervals with finite-sample validity conditional on these predictions. To achieve this, we extend the original Venn-Abers procedure from binary classification to regression. Our theoretical framework supports analyzing conformal prediction methods that involve calibrating model predictions and subsequently constructing conditionally valid prediction intervals on the same data, where the conditioning set or conformity scores may depend on the calibrated predictions. Real-data experiments show that our method improves interval efficiency through model calibration and offers a practical alternative to feature-conditional validity.

1 Introduction

Particularly in safety-critical sectors, such as healthcare, it is important to ensure decisions inferred from machine learning models are reliable, under minimal assumptions [38, 57, 11, 46, 56]. As a response, there is growing interest in predictive inference methods that quantify uncertainty in model predictions via prediction intervals [44, 28]. Conformal prediction (CP) is a popular, model-agnostic, and distribution-free framework for predictive inference, which can be applied post-hoc to any prediction pipeline [59, 49, 3, 34]. Given a prediction issued by a black-box model, CP outputs a prediction interval that is guaranteed to contain the unseen outcome with a user-specified probability [34]. However, a limitation of CP is that this prediction interval only provides valid coverage marginally, when averaged across all possible contexts with context referring to the information available for decision-making. Constructing informative prediction intervals that offer contextconditional coverage is generally unattainable without making additional distributional assumptions [58, 35, 4]. Consequently, there has been an upsurge in research developing CP methods that offer weaker, yet practically useful, notions of conditional validity; see, e.g., [43, 31, 47, 32, 23, 21].

In prediction settings, model calibration is a desirable property of machine learning predictors that ensures that the predicted outcomes accurately reflect the true outcomes [39, 65, 66, 22]. Specifically, a predictor is calibrated for the outcome if the average outcome among individuals with identical predictions is close to their shared prediction value [25]. Such a predictor is more robust against the over-or-under estimation of the outcome in extremes of predicted values. It also has the property that the best prediction of the outcome conditional on the model s prediction is the prediction itself, which facilitates transparent decision-making [51]. There is a rich literature studying post-hoc calibration of prediction algorithms using techniques such as Platt s scaling [45, 14], histogram binning [65, 25, 26], isotonic calibration [66, 40, 51], and Venn-Abers calibration [60].

Given the roles of both point and interval predictions in decision-making, we introduce a dual calibration objective that aims to construct (i) calibrated point predictions and (ii) associated prediction intervals with valid coverage conditional on these point predictions. Marrying model calibration and

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

predictive inference, we propose a solution to this objective that combines two post-hoc approaches Venn-Abers calibration [61, 60] and CP [59] to simultaneously provide point predictions and prediction intervals that achieve our dual objective in finite samples. In doing so, we extend the original Venn-Abers procedure from binary classification to the regression setting. Our theoretical and experimental results support the integration of model calibration into predictive inference methods to improve interval efficiency and interpretability.

2 Problem setup

2.1 Notation

We consider a standard regression setup in which the input X X Rd corresponds to contextual information available for decision-making, and the output Y Y R is an outcome of interest. We assume that we have access to a calibration dataset Cn = {(Xi, Yi)}n i=1 comprising n i.i.d. data points drawn from an unknown distribution P := PXPY |X. We assume access to a black-box predictor f : X 7 Y, obtained by training an ML model on a dataset that is independent of Cn. Throughout this paper, we do not make any assumptions on the model f or the distribution P. For a quantile level α (0, 1), we denote the pinball" quantile loss function ℓα by ℓα(f(x), y) := 1(y f(x)) α(y f(x)) + 1(y < f(x)) (1 α)(f(x) y).

2.2 Conditional predictive inference and a curse of dimensionality

Let (Xn+1, Yn+1) be a new data point drawn from P independently of the calibration data Cn. Our high-level aim is to develop a predictive inference algorithm that constructs a prediction interval b Cn(Xn+1) around the point prediction issued by the black-box model, i.e., f(Xn+1). For this prediction interval to be deemed valid, it should cover the true outcome Yn+1 with a probability 1 α. Conformal prediction (CP) is a method for predictive inference that can be applied in a post-hoc fashion to any black-box model [59]. The vanilla CP procedure issues prediction intervals that satisfy the marginal coverage condition:

P(Yn+1 b Cn(Xn+1)) 1 α, (1)

where the probability P is taken with respect to the randomness in Cn and (Xn+1, Yn+1). However, marginal coverage might lack utility in decision-making scenarios where decisions are contextdependent. A prediction band b Cn(x) achieving 95% coverage may exhibit arbitrarily poor coverage for specific contexts x. Ideally, we would like this coverage condition to hold for each context x X, i.e., the conventional notion of conditional validity requires

P(Yn+1 b Cn(Xn+1)|Xn+1 = x) 1 α, (2)

for all x X. However, previous work has shown that it is impossible to achieve (2) without distributional assumptions [58, 35, 4].

While context-conditional validity as in (2) is generally unachievable, it is feasible to attain weaker forms of conditional validity. Given any finite set of groups G and a grouping function G : X Y G, Mondrian-CP offers coverage conditional on group membership, that is, P(Yn+1 b Cn(Xn+1)|G(Xn+1, Yn+1) = g) 1 α, g G [59, 47]. Expanding upon groupand context-conditional coverage, A multicalibration objective was introduced in [17] that seeks to satisfy, for all h in an (infinite-dimensional) class F of weighting functions (i.e., covariate shifts ), the property: E h(Xn+1) (1 α) 1{Yn+1 b Cn(Xn+1)} = 0. (3)

Gibbs et al. [21] proposed a regularized CP framework for (approximately) achieving (3) that provides a means to trade off the efficiency (i.e., width) of prediction intervals and the degree of conditional coverage achieved. However, Barber et al. [4] and Gibbs et al. [21] establish the existence of a curse of dimensionality": as the dimension of the context increases, smaller classes of weighting functions must be considered to retain the same level of efficiency. For group-conditional coverage, this curse of dimensionality manifests in the size of the subgroup class G [4] via its VC dimension [55]. Thus, especially in data-rich contexts, prediction intervals with meaningful multicalibration guarantees over the context space may be too wide for decision-making.

2.3 A dual calibration objective

In decision-making, both point predictions and prediction intervals play a role. For example, in scenarios with a low signal-to-noise ratio, prediction intervals may be too wide to directly inform decision-making, as their width is typically of the order of the standard deviation of the outcome. Point predictions might be used to guide decisions, while prediction intervals help quantify deviations of point predictions from unseen outcomes and assess the risk associated with these decisions.

Viewing the black-box model f as a scalar dimension reduction of the context x, a natural relaxation of the infeasible objective of context-conditional validity in (2) is prediction-conditional validity, i.e., P(Yn+1 b Cn(Xn+1)|f(Xn+1)) 1 α. Prediction-conditional validity ensures that the interval widths adapt to the outputs of the model f( ), so that the intervals can be reliably used to quantify the deviation of model predictions from unseen outcomes. Since prediction-conditional validity only requires coverage conditional on a one-dimensional random variable, it avoids the curse of dimensionality associated with context-conditional validity. In addition, as illustrated in our experiments in Section 5 and Appendix C, when the heteroscedasticity (e.g., variance) in the outcome is a function of its conditional mean, prediction-conditional validity can closely approximate context-conditional validity, so long as the predictor estimates the conditional mean of the outcome sufficiently well.

Given the roles of both point and interval predictions in decision-making, we introduce a novel dual calibration objective, self-calibration, that aims to construct (i) calibrated point predictions and (ii) associated prediction intervals with valid coverage conditional on these point predictions. Formally, given the model f and calibration data Cn {Xn+1}, our objective is to post-hoc construct a calibrated point prediction fn+1(Xn+1) and a compatible prediction interval b Cn+1(Xn+1) centered around fn+1(Xn+1) that satisfies the following desiderata:

(i) Perfectly Calibrated Point Prediction: fn+1(Xn+1) = E[Yn+1 | fn+1(Xn+1)].

(ii) Prediction-Conditional Validity: P(Yn+1 b Cn+1(Xn+1)|fn+1(Xn+1)) 1 α.

Desideratum (i) states that the point prediction fn+1(Xn+1) should be perfectly calibrated or selfconsistent [19] for the true outcome Yn+1 [37, 25, 51]. It is widely recognized that the calibration of model predictions is important to ensure their reliability, trustworthiness, and interpretability in decision-making [37, 65, 7, 24, 15]. Desideratum (i) also improves interval efficiency by ensuring b Cn+1(Xn+1) is centered around an unbiased prediction, meaning the interval s width is driven by outcome variation rather than by prediction bias. Desideratum (ii) is a prediction interval variant of (i) that ensures the prediction interval b Cn+1(Xn+1) is calibrated with respect to the model fn+1, providing valid coverage for Yn+1 within contexts with the same calibrated point prediction. We refer to a predictive inference algorithm simultaneously satisfying (i) and (ii) as self-calibrating, as such a procedure is automatically able to adapt to miscalibration in the model f( ) due to, e.g., model misspecification or distribution shifts, ensuring that the interval is constructed from a calibrated predictive model.

Self-calibration can also be motivated by a decision-making scenario where point predictions determine actions and prediction intervals are used to apply these actions selectively. When point predictions are sufficient statistics for actions, self-calibration implies that the point and interval predictions are accurate, on average, within the subset of all contexts receiving the same prediction and, therefore, the same action.

3 Self-Calibrating Conformal Prediction

A key advantage of CP is that it can be applied post-hoc to any black-box model f without disrupting its point predictions. However, desideratum (i) introduces a perfect calibration requirement for the point predictions of f, thereby interfering with the underlying model specification. In this section, we introduce Self-Calibrating CP (SC-CP), a modified version of CP that is self-calibrating in that it satisfies (i) and (ii), while preserving all the favorable properties of CP, including its finite-sample validity and post-hoc applicability. Before describing our complete procedure in Section 3.3, we provide background on point calibration and propose Venn-Abers calibration for regression.

3.1 Preliminaries on point calibration

Following the framing of van der Laan et al. [51] (see also [25]), a point calibrator is a post-hoc procedure that aims to learn a transformation θn : R R of the black-box model f such that: (1) θn(f(Xn+1)) is well-calibrated for Yn+1 in the sense of Desideratum (i); and (2) θn f is comparably predictive to f. Condition (2) ensures that in the process of achieving (1), the quality of the model f is not compromised, and excludes trivial calibrators such as θn(f( )) := 1 n Pn i=1 Yi [26]. To our knowledge, this notion of calibration traces back to Mincer and Zarnowitz (1969) [39], who introduced the idea of regressing outcomes on predictions to achieve calibration in forecasting.

Commonly-employed point calibrators include Platt s scaling [45, 14], histogram (or quantile) binning [65], and isotonic calibration [66, 40]. Mechanistically, these point calibrators learn θn by regressing the outcomes {Yi}n i=1 on the model predictions {f(Xi)}n i=1. Importantly, however, point calibration fundamentally differs from the regression task of learning EP [Y |f(X)], as calibration can be achieved without smoothness assumptions, allowing for misspecification of the regression task [25]. Histogram binning is a simple and distribution-free calibration procedure [25, 26] that learns θn via a histogram regression over a finite (outcome-agnostic) binning of the output space f(X). Isotonic calibration is an outcome-adaptive binning method that uses isotonic regression [5] to learn θn by minimizing the empirical mean square error over all 1D piece-wise constant, monotone nondecreasing transformations. Isotonic calibration is distribution-free it does not rely on monotonicity assumptions and, in contrast with histogram binning, it is tuning parameter-free and naturally preserves the mean-square error of the original predictor (as the identity transform is monotonic) [51]. A key limitation of histogram binning and isotonic calibration is that their calibration guarantees are only approximate, and desideratum (i) only holds asymptotically.

3.2 Venn-Abers calibration

For binary classification, Vovk and Petej [60] proposed Venn-Abers calibration, which iterates isotonic calibration over imputations y Y of the unseen outcome Yn+1 to provide calibrated multiprobabilistic predictions in finite samples. In this section, we generalize the Venn-Abers calibration procedure to regression, offering finite-sample calibration guarantees for non-binary outcomes.

Let Θiso consist of all univariate, piecewise constant functions that are monotonically nondecreasing. Our Venn-Abers calibration procedure, outlined in Alg. 1, is derived from an oracle variant of isotonic calibration that provides a perfectly calibrated point prediction in finite samples, but requires knowledge of the true outcome Yn+1. Specifically, the Venn-Abers calibration algorithm iterates over imputed outcomes y Y for Yn+1 and applies isotonic calibration to the augmented dataset Cn {(Xn+1, y)} to produce a set of point predictions fn,Xn+1(Xn+1) := {f (Xn+1,y) n (Xn+1) : y Y}}. When the outcome space Y is non-discrete, Alg. 1 may be infeasible to compute exactly and can be approximated by discretizing Y. Nonetheless, the range of the Venn-Abers multi-prediction can be feasibly computed as [f (x,ymin) n (x), f (x,ymax) n (x)] where [ymin, ymax] := range(Y), in light of the min-max representation of isotonic regression [33].

Unlike point calibrators, Venn-Abers calibration generates a set of calibrated predictions for each context Xn+1, indexed by y Y. As we demonstrate later, this set prediction is guaranteed in finite samples to include a perfectly calibrated point prediction, namely, the oracle prediction f (Xn+1,Yn+1) n (Xn+1) corresponding to the true outcome Yn+1. Moreover, each prediction in the set, being obtained via isotonic calibration, still enjoys the same large-sample calibration guarantees as isotonic calibration [51]. By the stability of isotonic regression, as the size of the calibration set n increases, the width of this set of predictions rapidly narrows, eventually converging to a single, perfectly calibrated point prediction [60]. Venn-Abers calibration thus provides a measure of epistemic uncertainty by producing a range of values for a perfectly calibrated point prediction. In cases with small sample sizes, standard isotonic calibration can overfit, leading to poorly calibrated point predictions. When this overfitting occurs, the Venn-Abers set prediction widens, reflecting greater uncertainty in the perfectly calibrated point prediction within the set [30].

For the binary classification case, [60] derived a (large-sample) calibrated point prediction from the Venn-Abers multi-prediction using a shrinkage approach. We can similarly construct, for each x X,

Algorithm 1 Venn-Abers Calibration

Input: Calibration data Cn = {(Xi, Yi)}n i=1, model f, context x X.

1: for each y Y do

2: Set augmented dataset C(x,y) n := Cn {(x, y)}; 3: Apply isotonic calibration to f using C(x,y) n :

θ(x,y) n := argminθ Θiso P

i C(x,y) n {Yi θ f(Xi)}2.

f (x,y) n := θ(x,y) n f. 4: end for Output: Multi-prediction {f (x,y) n (x) : y Y}. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Figure 1: Example SC-CP output with small Cn (n = 200). Algorithm 2 Self-Calibrating Conformal Prediction

Input: Calibration data Cn = {(Xi, Yi)}n i=1, model f, context x X, miscoverage level α (0, 1) 1: for each y Y do 2: Obtain calibrated model f (x,y) n by isotonic calibrating f on Cn {(x, y)} as in Alg. 1; 3: Set self-calibrating conformity scores S(x,y) i = |Yi f (x,y) n (Xi)|, i [n] and S(x,y) n+1 = |y f (x,y) n (x)|; 4: Calculate 1 α empirical quantile ρ(x,y) n (x) of conformity scores with same calibrated prediction as x:

i=1 1 n f (x,y) n (Xi) = f (x,y) n (x) o ℓα(q, S(x,y) i ) + ℓα(q, S(x,y) n+1 );

6: Set fn+1(x) := {f (x,y) n (x) : y Y}. 7: Set b Cn+1(x) := {y Y : |y f (x,y) n (x)| ρ(x,y) n (x)]}.

Output: fn+1(x) conv(Y), b Cn+1(x) Y

a point prediction as follows:

efn+1,x(x) := f mid n+1,x(x) + f (x,ymax) n (x) f (x,ymin) n (x) ymax ymin

yn f mid n+1,x(x) , (4)

where f mid n+1,x(x) := 1

2{f (x,ymax) n (x) + f (x,ymin) n (x)} is the midpoint of the multi-prediction and

n Pn i=1 Yi. The behavior of efn+1,x(x) is natural; it shrinks the point prediction f mid n+1,x(x) towards the average outcome yn (a well-calibrated prediction) proportional to how uncertain we are in the calibration of f mid n+1,x(x). The ratio 1 ymax ymin (f (x,ymax) n (x) f (x,ymin) n (x)) measures the sensitivity of isotonic regression to the addition of a single data point to Cn, and a value closer to 1 corresponds to a higher degree of overfitting. In the extreme case where the calibration dataset is very large, we have f (x,ymax) n (x) f (x,ymin) n (x), implying that efn+1,x(x) f mid n+1,x(x). Conversely, in the opposite extreme where the calibration dataset is very small and isotonic regression overfits, we have f (x,ymax) n (x) ymax and f (x,ymin) n (x) ymin, in which case efn+1,x(x) yn. We could replace yn in (4) with any reference predictor, such as one calibrated using Platt s scaling or quantile-binning.

3.3 Conformalizing Venn-Abers Calibration

In this section, we propose Self-Calibrating Conformal Prediction, which conformalizes the Venn Abers calibration procedure to provide prediction intervals centered around the Venn-Abers multiprediction that are self-calibrated in the sense of desiderata (i) and (ii).

A simple, albeit naive, strategy for achieving (i) and (ii) without finite-sample guarantees involves using the dataset Cn to calibrate point predictions of f( ) through isotonic regression, and then constructing prediction intervals from the 1 α empirical quantiles of prediction errors within subgroups defined by unique values of the calibrated point predictions. To motivate our SC-CP algorithm, we introduce an infeasible variant of this seemingly naive procedure that is valid in finite samples, but can only be computed by an oracle that knows the unseen outcome Yn+1. In

this oracle procedure, we compute a perfectly calibrated prediction f n+1(Xn+1) := θ n+1(f(Xn+1)) by isotonic calibrating f using the oracle-augmented calibration set {(Xi, Yi)}n+1 i=1 , where θ n+1 argminθ Θiso Pn+1 i=1 {Yi θ(f(Xi))}2 . Next, we compute the conformity scores S i := |Yi f n+1(Xi)| as the absolute residuals of the calibrated predictions. An oracle prediction interval is then given by C n+1(Xn+1) := f n+1(Xn+1) ρ n+1(Xn+1), where ρ n+1(Xn+1) is the empirical 1 α quantile of conformity scores with calibrated predictions identical to Xn+1, that is, scores in the set {S i : f n+1(Xi) = f n+1(Xn+1), i [n + 1]}. Importantly, isotonic regression, which is an outcome-adaptive histogram binning method, ensures that the calibrated model f n+1 is piece-wise constant, with a sufficiently large number of observations averaged within each constant segment typically on the order of n2/3 [16]. Consequently, the empirical quantile ρ n+1(Xn+1) is generally stable with relatively low variability across realizations of Cn. In our proofs, using the first-order conditions characterizing the optimizer θn+1 and exchangeability, we show that f n+1(Xn+1) = E[Yn+1|f n+1(Xn+1)], so that desideratum (i) is satisfied. Furthermore, we establish that the interval C n+1(Xn+1) achieves desideratum (ii), i.e., P(Yn+1 C n+1(Xn+1) | f n+1(Xn+1)) 1 α. To do so, our key insight is that ρ n+1(Xn+1) corresponds to the evaluation of the function ρ n+1 computed via prediction-conditional quantile regression as:

ρ n+1 argmin θ f n+1;θ:R R

i=1 ℓα θ f n+1(Xi), Si .

The first-order conditions characterizing the optimizer ρ n+1 combined with the exchangeability between Cn and (Xn+1, Yn+1) can be used to show that C n+1(Xn+1) is multi-calibrated against the class of weighting functions Fn+1 := θ f n+1; θ : R R in the sense of (3). Using first-order conditions to establish the theoretical validity of conformal prediction was also applied by Gibbs et al. [21] to demonstrate the multi-calibration of oracle prediction intervals obtained from quantile regression over a fixed class F. In our case, quantile regression is performed over a data-dependent function class Fn+1, learned from the calibration data, which introduces additional challenges in our proofs.

Our SC-CP method, which is outlined in Alg. 2, follows a similar procedure to the above oracle procedure. Since the new outcome Yn+1 is unobserved, we instead iterate the oracle procedure over all possible imputed values y Y for Yn+1. As in Alg. 1., this yields a set of isotonic calibrated models fn,Xn+1 := {f (Xn+1,y) n : y Y}, where fn+1(Xn+1) is the Venn-Abers multi-prediction of Yn+1. Then, for each y Y and i [n], we define the self-calibrating conformity scores S(Xn+1,y) i := |Yi f (Xn+1,y) n (Xi)| and S(Xn+1,y) n+1 := |y f (Xn+1,y) n (Xn+1)|, where the dependency of our scores on the imputed outcome y Y is akin to Full (or transductive) CP [59]. Our SC-CP interval is then given by b Cn+1(Xn+1) := {y Y : S(Xn+1,y) n+1 ρ(Xn+1,y) n (Xn+1)}, where ρ(Xn+1,y) n (Xn+1) is the empirical 1 α quantile of the level set {S(Xn+1,y) i : f (Xn+1,y) n (Xi) = f (Xn+1,y) n (Xn+1), i [n + 1]}. By definition, b Cn+1(Xn+1) covers Yn+1 if, and only if, the oracle interval C n+1(Xn+1) covers Yn+1, thereby inheriting the self-calibration of C n+1(Xn+1). Formally, b Cn+1(Xn+1) is a set, but it can be converted to an interval by taking its range, with little efficiency loss.

3.4 Computational considerations

The main computational cost of Alg. 1 and Alg. 2 is in the isotonic calibration step, executed for each y Y. Isotonic regression [5] can be scalably and efficiently computed using implementations of xgboost [12] for univariate regression trees with monotonicity constraints. Similar to Full (or transductive) CP [59], Alg. 2 may be computationally infeasible for non-discrete outcomes, and can be approximated by iterating over a finite subset of Y. In our implementation, we iterate over a grid of Y and use linear interpolation to impute the threshold ρ(x,y) n (x) and score S(x,y) n+1 for each y Y. As with Full CP and multicalibrated CP [21], Alg. 1 and Alg. 2 must be separately applied for each context x X. The algorithms depend solely on x X through its prediction f(x), so we can approximate the outputs for all x X by running each algorithm for a finite number of x X corresponding to a finite grid of the 1D output space f(X) = {f(x) : x X} R. In addition, both algorithms are fully parallelizable across both the input context x X and the imputed outcome y Y. In our implementation, we use nearest neighbor interpolation in the prediction space to impute outputs for each x X. In our experiments with sample sizes ranging from n = 5000

to 40000, quantile binning of both f(X) and Y into 200 equal-frequency bins enables execution of Alg. 1 and Alg. 2 across all contexts in minutes with negligible approximation error.

4 Theoretical guarantees

In this section, under exchangeability of the data, we establish that the Venn-Abers multi-prediction fn,Xn+1(Xn+1) := {f (Xn+1,y) n (Xn+1) : y Y} and SC-CP interval b Cn+1(Xn+1) output by Alg. 2 satisfy desiderata (i) and (ii) in finite samples and without distributional assumptions. Under an iid condition, we further establish that, asymptotically, the Venn-Abers calibration step within the SC-CP algorithm results in better point predictions and, consequently, more efficient prediction intervals.

The following theorem establishes that the Venn-Abers multi-prediction is perfectly calibrated in the sense of [61], containing a perfectly calibrated point prediction of Yn+1 in finite samples.

C1) Exchangeability: {(Xi, Yi)}n+1 i=1 are exchangeable.

C2) Finite second moment: EP [Y 2] < .

Theorem 4.1 (Perfect calibration of Venn-Abers multi-prediction). Under Conditions C1 and C2, the Venn-Abers multi-prediction fn,Xn+1(Xn+1) almost surely satisfies the condition f (Xn+1,Yn+1) n (Xn+1) = E[Yn+1 | f (Xn+1,Yn+1) n (Xn+1)].

Theorem 4.1 generalizes an analogous result by [60] for the special case of binary classification. Even in this special case, our proof is novel and elucidates how Venn-Abers calibration uses exchangeability with the least-squares loss in a manner analogous to how CP uses exchangeability with the quantile loss [21].

The following theorem establishes desideratum (ii) for the interval b Cn+1(Xn+1) with respect to the oracle prediction f (Xn+1,Yn+1) n (Xn+1) of Theorem 4.1. In what follows, let polylog(n) be a given sequence that grows polynomially logarithmically in n.

C3) The conformity scores |Yi f (Xn+1,Yn+1) n (Xi)|, i [n + 1], are almost surely distinct.

C4) The number of constant segments for f (Xn+1,Yn+1) n is at most n1/3 polylog(n).

Theorem 4.2 (Self-calibration of prediction interval). Under C1, it holds almost surely that

P Yn+1 b Cn+1(Xn+1) | f (Xn+1,Yn+1) n (Xn+1) 1 α.

If also C3 and C4 hold, then E α P Yn+1 b Cn+1(Xn+1) | f (Xn+1,Yn+1) n (Xn+1) polylog(n)

Theorem 4.2 says that b Cn+1(Xn+1) satisfies desideratum (ii) with coverage that is, on average, nearly exact up to a factor polylog(n)

n2/3 . Notably, the deviation error from exact coverage tends to zero at a fast dimensionless rate and, therefore, does not suffer from a curse of dimensionality". Condition C3 is only required to establish the upper coverage bound and is standard in CP - see, e.g., [36, 21]. Although it may fail for non-continuous outcomes, this condition can be avoided by adding a small amount of noise to all outcomes [36]. The constant segment number of n1/3 polylog(n) in C4 is motivated by the theoretical properties of isotonic regression; Assuming C2 and continuous differentiability of t 7 EP [Y | f(X) = t], it is shown in [16] that the number of observations in a given constant segment of an isotonic regression solution concentrates in probability around n2/3. In general, without C4, our proof establishes a miscoverage upper bound of 1 n+1E[Nn+1], where

E[Nn+1] is the expected number of constant segments of f (Xn+1,Yn+1) n .

The next theorem examines the interaction between calibration and CP within SC-CP in terms of efficiency of the self-calibrating conformity scores. In the following, let x X, y Y be arbitrary. For each θ Θiso, define the θ-transformed conformity scoring function Sθ : (x , y ) 7 |y θ f(x )|. Let θ0 := argminθ Θiso R {Sθ(x , y )}2d P(x , y ) be the optimal isotonic transformation of f( ) that minimizes the population mean-square error. Define the self-calibrating conformity scoring function as S(x,y) n (x , y ) := |y f (x,y) n (x )|, where f (x,y) n is obtained as in Alg. 1.

C5) Independent data: {(Xi, Yi)}n+1 i=1 are iid. C6) Bounded outcomes: Y is a uniformly bounded set.

Theorem 4.3. Under C5 and C6, we have R {S(x,y) n (x , y ) Sθ0(x , y )}2d P(x , y ) = Op(n 2/3).

The above theorem indicates that the self-calibrating scoring function S(x,y) n used in Alg. 2 asymptotically converges in mean-square error to the oracle scoring function Sθ0 at a rate of n 2/3. Since the oracle scoring function Sθ0 corresponds to a model θ0 f with better mean square error than f, we heuristically expect that the Venn-Abers scoring function S(x,y) n will translate to narrower CP intervals, at least asymptotically. We provide experimental evidence for this heuristic in Section 5.

Limitations. The perfectly calibrated prediction f (Xn+1,Yn+1) n (Xn+1), guaranteed to lie by Theorem 4.1 in the Venn-Abers multi-prediction, typically cannot be determined precisely without knowledge of Yn+1. However, the stability of isotonic regression implies that the width of multi-prediction fn+1(Xn+1) shrinks towards zero very quickly as the size of the calibration set increases [10]. Moreover, the large-sample theory for isotonic calibration in [51] demonstrates that the ℓ2-calibration error of each model f (Xn+1,y) n with y Y is Op(n 2/3). One caveat of SC-CP intervals is that desideratum (ii) is satisfied with respect to the unknown, oracle point prediction f (Xn+1,Yn+1) n (Xn+1). However, we know that this oracle prediction lies within the Venn-Abers multi-prediction by Theorem 4.1, and its value can be determined with high precision with relatively small calibration sets [60] see, e.g., Figure 1. These limitations appear to be unavoidable as perfectly calibrated point predictions can generally not be constructed in finite samples without oracle knowledge [61, 60].

4.1 Related work

The work of [42] proposes a regression extension of Venn-Abers calibration that differs from ours, both algorithmically and in its objective. While our extension constructs a calibrated point prediction f(X) of Y such that f(X) = E[Y | f(X)], their approach uses the original Venn-Abers calibration procedure of [60] to construct a distributional prediction ft(X) of 1(Y t) that satisfies ft(X) = P(Y t | ft(X)) for t Y.

The impossibility results of [25] imply that any universal procedure providing prediction-conditionally calibrated intervals must explicitly or implicitly discretize the output of the model f( ). The works of [31] and [29] apply Mondrian CP [59] within leaves of a regression tree f to construct prediction intervals with prediction-conditional validity. However, this approach is restricted to tree-based predictors and does not guarantee calibrated point predictions and self-calibrated intervals. Mondrian conformal predictive distributions were applied within bins of model predictions in [9] to satisfy a coarser, distributional form of prediction-conditional validity. A limitation of Mondrian-CP approaches to prediction-conditional validity is that they require pre-specification of a binning scheme for the predictor f( ), which introduces a trade-off between model performance and the width of prediction intervals, and they do not perform point calibration (desideratum (i)) and, thereby, do not guarantee self-calibration. In contrast, SC-CP data-adaptively discretizes the predictor f( ) using isotonic calibration and, in doing so, provides calibrated predictions, improved conformity scores, and self-calibrated intervals.

Other notions of conditional validity have been proposed that, like prediction-conditional validity and self-calibration, avoid the curse of dimensionality of context-conditional validity. In the multiclassification setup, [50] and [18] use Mondrian CP to provide prediction intervals with valid coverage conditional on the class label (i.e., outcome). In [8], Mondrian CP is applied within bins categorized by context-specific difficulty estimates, such as conditional variance estimates. Multivalid-CP [32, 6] offers coverage based on a threshold defining the prediction interval. For multiclassification, [41] propose a procedure for attaining valid coverage conditional on the prediction set size [2].

5 Real-Data Experiments: predicting utilization of medical services

5.1 Experimental setup

In this experiment, we illustrate how prediction-conditional validity can approximate contextconditional validity when the heteroscedasticity in outcomes is strongly associated with model

predictions, thereby ensuring validity across critical subgroups without their pre-specification. We analyze the Medical Expenditure Panel Survey (MEPS) dataset [1], supplied by the Agency for Healthcare Research and Quality [13], which was used in [47] for Mondrian CP with fairness applications. We use the preprocessed dataset acquired using the Python package cqr, also associated with [47]. This dataset contains n = 15, 656 observations and d = 139 features, and includes information such as age, marital status, race, and poverty status, alongside medical service utilization. Our objective is to predict each individual s healthcare system utilization, represented by a score that reflects visits to doctors offices, hospital visits, etc. Following [47], we designate race as the sensitive attribute A, aiming for equalized coverage, where A = 0 represents non-white individuals (n0 = 9640) and A = 1 represents white individuals (n1 = 6016). The outcome variable Y is transformed by Y = log(1 + utilization score) to address the skewness of the raw score. In Appendox B, we present additional experimental results for the Concrete, Community, STAR, Bike, and Bio datasets used in [48] and publicly available in the Python package cqr, associated with [48] and [47].

We randomly partition the dataset into three segments: a training set (50%) for model training, a calibration set (30%) for CP, and a test set (20%) for evaluation. For training the initial model f( ), we use the xgboost [12] implementation of gradient boosted regression trees [20], where maximum tree depth, boosting rounds, and learning rate are tuned using 5-fold cross-validation. We consider two settings for training the model. In Setting A, we train the initial model on the untransformed outcomes and then transform the predictions as ˆy 7 log(1 + ˆy), which makes the model predictive but poorly calibrated because it overestimates the true outcomes, in light of Jensen s inequality. In Setting B, we train the initial model on the transformed outcomes, leading to fairly well-calibrated predictions. In both settings, calibration and evaluation are applied to the transformed outcomes.

For direct comparison, we compare SC-CP with baselines that leverage the standard absolute residual scoring function S(x, y) := |y f(x)| and target either marginal validity or prediction-conditional validity. The baselines are: Marginal CP [34], Mondrian CP with categories defined by bins of model predictions [59, 9], CQR [48] with model predictions used as features, and the Kernelsmoothed conditional CP approach of [21] with model predictions {f(Xi)}n i=1 used as features and bandwidth tuned with cross-validation. Due to the slow computing time of the implementation provided by [21], we apply Kernel on a subset of the calibration data of size ncal = 500. SC-CP is implemented as described in Alg. 2, using isotonic regression constrained to have at least 20 observations averaged within each constant segment to mitigate overfitting (via the minimum child weight argument of xgboost). The miscoverage level is taken to be α = 0.1. SC-CP provides calibrated point predictions and self-calibrated intervals, while the Mondrian and Kernel baselines offer approximate prediction-conditional validity, and Marginal and CQR only guarantee marginal coverage. We report empirical coverage, average interval width, and calibration error of model predictions in the test set within the sensitive attribute. Calibration error is defined as the mean error of the point predictions, E[b Y Y | A] within the sensitive attribute A, which measures model overor under-confidence. For SC-CP, we use the calibrated point predictions from (4), while the original point predictions are used for Marginal, Mondrian, and Kernel. For CQR, we use an estimate of the conditional median, obtained from a separate xgboost quantile regression model, as the point prediction. We note that, since quantiles are preserved under monotone transformations of the outcome, we expect the conditional median model of CQR to be well-calibrated, at least in a median sense, in both Setting A and Setting B. We include the baseline Mondrian for direct comparison with SC-CP, in which Mondrian is applied with the same number of prediction bins as data-adaptively selected by SC-CP.

5.2 Results and discussion

The experimental results for each setting are depicted in Figure 2. Each panel s left-most plot showcases a calibration plot [62] for SC-CP, illustrating original and calibrated predictions alongside prediction bands. On the right, the panels display prediction bands of our baselines as a function of the original model predictions. Visually, as expected by Theorem 4.2, the SC-CP bands adapt to outcome heteroscedasticity within model predictions, while Marginal lacks adaptation, Mondrian under-adapts due to insufficient bins, and Kernel adapts but offers wider intervals for large predictions where observations are sparse. The bands of CQR appear adaptive and similar to those of SC-CP, however, do not gaurnatee finite-sample prediction-conditional validity. The calibration plots reveal that heteroscedasticity in outcomes is primarily driven by their mean, suggesting that predictionconditional validity may approximate context-conditional validity. This heuristic is supported by the

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0

Predicted Outcome

Mondrian (10 bins)

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.933 0.865 3.40 3.40 0.690 0.513 0.177 CQR (marginal) 0.908 0.881 2.25 2.68 -0.0952 -0.0854 -0.0098 Mondrian (5 bins) 0.888 0.893 3.24 3.63 0.690 0.513 0.177 Mondrian (10 bins) 0.913 0.886 3.21 3.54 0.690 0.513 0.177 Mondrian (83 bins) 0.932 0.925 3.32 3.74 0.690 0.513 0.177 Kernel 0.895 0.913 3.38 3.93 0.690 0.513 0.177 SC-CP 0.902 0.911 2.20 2.91 -0.0119 0.000931 -0.01283

(a) Setting A (poorly-calibrated f( ))

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0

Predicted Outcome

Mondrian (10 bins)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.918 0.862 2.82 2.82 -0.0136 -0.0427 0.0291 CQR (marginal) 0.908 0.884 2.28 2.71 -0.108 -0.0862 -0.0218 Mondrian (5 bins) 0.889 0.885 2.26 2.79 -0.0136 -0.0427 0.0291 Mondrian (10 bins) 0.893 0.896 2.22 2.87 -0.0136 -0.0427 0.0291 Mondrian (101 bins) 0.910 0.918 2.56 3.26 -0.0136 -0.0427 0.0291 Kernel 0.893 0.901 2.26 2.94 -0.0136 -0.0427 0.0291 SC-CP 0.909 0.924 2.14 2.86 -0.0275 0.0231 -0.0506

(b) Setting B (well-calibrated f( ))

Figure 2: MEPS-21 dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

tables in Figure 2, which display empirical coverage, average interval width, and calibration error within the sensitive attribute (A) for all methods. In Setting A, the base regression model f( ) are poorly calibrated, i.e., E[f(X) Y | A] is not close to 0, resulting in wider intervals, overconfidence in point predictions, and decreased interpretability for the baselines, as their intervals center around biased point predictions. In contrast, being self-calibrated, SC-CP corrects the calibration error in f, achieving the smallest interval widths and well-calibrated point predictions in both settings, as guaranteed by Theorem 4.1. In both settings, the quantile regression model of CQR appears to have worse calibration than SC-CP, which may be due to the median differing from the mean because of the skewness of the outcomes. Additionally, SC-CP predictions achieves a smaller difference in calibration error between the two subgroups than Marginal, Mondrian, and Kernel, suggesting they are less discriminatory and more fair [47]. SC-CP and Kernel achieve the desired coverage level of 1 α = 0.9 in each subgroup and setting, whereas Marginal exhibits overor under-coverage in each subgroup. Mondrian tends to under-cover with 5 and 10 bins and only attains good coverage when using the same binning number data-adaptively selected by SC-CP, highlighting its sensitivity to the pre-specified binning scheme. CQR attains good coverage in the A = 0 group but undercovers in the A = 1 group, which may be explained by CQR only guaranteeing marginal coverage in finite samples. Even with SC-CP having higher coverage, the intervals of SC-CP are narrower than those of Kernel and Mondrian . This provides experimental evidence that calibration improves conformity scores and translates into greater interval efficiency, as suggested by Theorem 4.3.

6 Extensions

Our theoretical techniques can be used to analyze conformal prediction methods that involve the calibration of model predictions followed by the construction of conditionally valid prediction intervals. Our analysis can be adapted to the general case where either the conformity score or the conditioning variable depends on the calibrated model prediction. While we use the absolute residual conformity score in our work, SC-CP can be applied to other conformity scores, such as the normalized absolute residual scoring function [43], allowing for the inclusion of context-specific difficulty estimates in the SC-CP procedure. Although we use Venn-Abers calibration in SC-CP, our analysis also applies to other binning calibration methods, such as Venn calibration [61, 60]. Thus, we can replace the isotonic calibration step in Alg. 1 and 2, for example, with histogram binning [25]. Additionally, a group-valid form of SC-CP can be achieved by applying Alg. 2 separately within subgroups, similar to Multivalid CP [32]. Interesting areas for future work involve integrating point calibration with conformal prediction methods for predictive models beyond regression, such as the isotonic calibration of quantile predictions in conformal quantile regression [48].

[1] Medical expenditure panel survey, panel 21. Accessed: May, 2024.

[2] A. Angelopoulos, S. Bates, J. Malik, and M. I. Jordan. Uncertainty sets for image classifiers using conformal prediction. ar Xiv preprint ar Xiv:2009.14193, 2020.

[3] V. Balasubramanian, S.-S. Ho, and V. Vovk. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014.

[4] R. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455 482, 2021.

[5] R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140 147, 1972.

[6] O. Bastani, V. Gupta, C. Jung, G. Noarov, R. Ramalingam, and A. Roth. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362 29373, 2022.

[7] A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 128 146. IGI Global, 2010.

[8] H. Boström and U. Johansson. Mondrian conformal regressors. In Conformal and Probabilistic Prediction and Applications, pages 114 133. PMLR, 2020.

[9] H. Boström, U. Johansson, and T. Löfström. Mondrian conformal predictive distributions. In Conformal and Probabilistic Prediction and Applications, pages 24 38. PMLR, 2021.

[10] A. Caponnetto and A. Rakhlin. Stability properties of empirical risk minimization over donsker classes. Journal of Machine Learning Research, 7(12), 2006.

[11] D. S. Char, N. H. Shah, and D. Magnus. Implementing machine learning in health care addressing ethical challenges. The New England journal of medicine, 378(11):981, 2018.

[12] T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pages 785 794, New York, NY, USA, 2016. ACM.

[13] J. W. Cohen, S. B. Cohen, and J. S. Banthin. The medical expenditure panel survey: a national information resource to support healthcare cost research and inform policy and practice. Medical care, 47(7_Supplement_1):S44 S50, 2009.

[14] D. R. Cox. Two further applications of a model for binary regression. Biometrika, 45(3/4):562 565, 1958.

[15] S. E. Davis, T. A. Lasko, G. Chen, E. D. Siew, and M. E. Matheny. Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association, 24(6):1052 1061, 2017.

[16] H. Deng, Q. Han, and C.-H. Zhang. Confidence intervals for multiple isotonic regression and other monotone models. The Annals of Statistics, 49(4):2021 2052, 2021.

[17] Z. Deng, C. Dwork, and L. Zhang. Happymap: A generalized multi-calibration method. ar Xiv preprint ar Xiv:2303.04379, 2023.

[18] T. Ding, A. N. Angelopoulos, S. Bates, M. I. Jordan, and R. J. Tibshirani. Class-conditional conformal prediction with many classes. ar Xiv preprint ar Xiv:2306.09335, 2023.

[19] B. Flury and T. Tarpey. Self-consistency: A fundamental concept in statistics. Statistical Science, 11(3):229 243, 1996.

[20] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

[21] I. Gibbs, J. J. Cherian, and E. J. Candès. Conformal prediction with conditional guarantees. ar Xiv preprint ar Xiv:2305.12616, 2023.

[22] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(2):243 268, 2007.

[23] L. Guan. Localized conformal prediction: A generalized inference framework for conformal prediction. Biometrika, 110(1):33 50, 2023.

[24] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

[25] C. Gupta, A. Podkopaev, and A. Ramdas. Distribution-free binary classification: prediction sets, confidence intervals and calibration. Advances in Neural Information Processing Systems, 33:3711 3723, 2020.

[26] C. Gupta and A. Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, pages 3942 3952. PMLR, 2021.

[27] T. Hastie and R. Tibshirani. Generalized additive models: some applications. Journal of the American Statistical Association, 82(398):371 386, 1987.

[28] T. Heskes. Practical confidence and prediction intervals. Advances in neural information processing systems, 9, 1996.

[29] U. Johansson, H. Linusson, T. Löfström, and H. Boström. Interpretable regression trees using conformal prediction. Expert systems with applications, 97:394 404, 2018.

[30] U. Johansson, T. Löfström, and C. Sönströd. Well-calibrated probabilistic predictive maintenance using venn-abers. ar Xiv preprint ar Xiv:2306.06642, 2023.

[31] U. Johansson, C. Sönströd, H. Linusson, and H. Boström. Regression trees for streaming data with local performance guarantees. In 2014 IEEE International Conference on Big Data (Big Data), pages 461 470. IEEE, 2014.

[32] C. Jung, G. Noarov, R. Ramalingam, and A. Roth. Batch multivalid conformal prediction. ar Xiv preprint ar Xiv:2209.15145, 2022.

[33] C.-I. C. Lee. The min-max algorithm and isotonic regression. The Annals of Statistics, 11(2):467 477, 1983.

[34] J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

[35] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71 96, 2014.

[36] Y. Li, L. Qi, and Y. Sun. Semiparametric varying-coefficient regression analysis of recurrent events with applications to treatment switching. Statistics in Medicine, 37:3959 3974, 2018. doi: 10.1002/sim.7856. Pub Med PMID: 29992591. NIHMSID: NIHMS1033642.

[37] S. Lichtenstein, B. Fischhoff, and L. D. Phillips. Calibration of probabilities: The state of the art. In Decision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1 4 September, 1975, pages 275 324. Springer, 1977.

[38] E. B. Mandinach, M. Honey, and D. Light. A theoretical framework for data-driven decision making. In annual meeting of the American Educational Research Association, San Francisco, CA, 2006.

[39] J. A. Mincer and V. Zarnowitz. The evaluation of economic forecasts. In Economic forecasts and expectations: Analysis of forecasting behavior and performance, pages 3 46. NBER, 1969.

[40] A. Niculescu-Mizil and R. Caruana. Obtaining calibrated probabilities from boosting. In UAI, volume 5, pages 413 20, 2005.

[41] G. Noarov, R. Ramalingam, A. Roth, and S. Xie. High-dimensional prediction for sequential decision making. ar Xiv preprint ar Xiv:2310.17651, 2023.

[42] I. Nouretdinov, D. Volkhonskiy, P. Lim, P. Toccaceli, and A. Gammerman. Inductive venn-abers predictive distribution. In Conformal and Probabilistic Prediction and Applications, pages 15 36. PMLR, 2018.

[43] H. Papadopoulos, A. Gammerman, and V. Vovk. Normalized nonconformity measures for regression conformal prediction. In Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2008), pages 64 69, 2008.

[44] J. Patel. Prediction intervals-a review. Communications in Statistics-Theory and Methods, 18(7):2393 2465, 1989.

[45] J. Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61 74, 1999.

[46] L. Quest, A. Charrie, L. du Croo de Jongh, and S. Roy. The risks and benefits of using ai to detect crime. Harv. Bus. Rev. Digit. Artic, 8:2 5, 2018.

[47] Y. Romano, R. F. Barber, C. Sabatti, and E. Candès. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 2(2):4, 2020.

[48] Y. Romano, E. Patterson, and E. Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.

[49] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.

[50] F. Shi, C. S. Ong, and C. Leckie. Applications of class-conditional conformal predictor in multi-class classification. In 2013 12th International Conference on Machine Learning and Applications, volume 1, pages 235 239. IEEE, 2013.

[51] L. van der Laan, E. Ulloa-Pérez, M. Carone, and A. Luedtke. Causal isotonic calibration for heterogeneous treatment effects. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, Honolulu, Hawaii, USA, 2023. PMLR.

[52] L. van der Laan, E. Ulloa-Pérez, M. Carone, and A. Luedtke. Causal isotonic calibration for heterogeneous treatment effects. ar Xiv preprint ar Xiv:2302.14011, 2023.

[53] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.

[54] A. Van Der Vaart and J. A. Wellner. A local maximal inequality under uniform entropy. Electronic Journal of Statistics, 5(2011):192, 2011.

[55] V. Vapnik, E. Levin, and Y. Le Cun. Measuring the vc-dimension of a learning machine. Neural computation, 6(5):851 876, 1994.

[56] J. Vazquez and J. C. Facelli. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research, 6(3):241 252, 2022.

[57] M. Veale, M. Van Kleek, and R. Binns. Fairness and accountability design needs for algorithmic support in high-stakes public sector decision-making. In Proceedings of the 2018 chi conference on human factors in computing systems, pages 1 14, 2018.

[58] V. Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475 490. PMLR, 2012.

[59] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world, volume 29. Springer, 2005.

[60] V. Vovk and I. Petej. Venn-abers predictors. ar Xiv preprint ar Xiv:1211.0025, 2012.

[61] V. Vovk, G. Shafer, and I. Nouretdinov. Self-calibrating probability forecasting. Advances in neural information processing systems, 16, 2003.

[62] M. Vuk and T. Curk. Roc curve, lift chart and calibration plot. Metodoloski zvezki, 3(1):89, 2006.

[63] M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1 17, 2017.

[64] Y. Xu and S. Yadlowsky. Calibration error for heterogeneous treatment effects. In International Conference on Artificial Intelligence and Statistics, pages 9280 9303. PMLR, 2022.

[65] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609 616. Citeseer, 2001.

[66] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694 699, 2002.

The methods implemented in this paper are not computationally intensive and were run in a Jupyter notebook environment on a Mac Book Pro with 16GB RAM and an M1 chip. A Python implementation of SC-CP is provided in the package Self Calibrating Conformal, available via pip. Code implementing SC-CP and reproducing our experiments is available in the Git Hub repository Self Calibrating Conformal, which can be accessed at the following link: https: //github.com/Larsvanderlaan/Self Calibrating Conformal.

B Supplementary real data experiments

B.1 Additional results

In this section, we present the experimental results for the concrete, STAR, bike, community, and bio datasets used in [48] and publicly available in the Python package cqr, associated with [48] and [47]. For the STAR dataset, the sensitive attribute A was set to gender." For the Bike dataset, the sensitive attribute A was set to workingday," and for Community, it was set to race_binary." For the remaining datasets, the sensitive attribute A was set to a dichotomization of the final column in the feature matrix, as above or below its median value.

8.34 8.36 8.38 8.40 8.42 8.44 8.46 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

8.34 8.36 8.38 8.40 8.42 8.44 8.46

Predicted Outcome

Mondrian (10 bins)

8.34 8.36 8.38 8.40 8.42 8.44 8.46

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.892 0.896 0.174 0.174 0.00482 0.00961 -0.00479 CQR (marginal) 0.892 0.872 0.173 0.177 0.00302 0.00597 -0.00295 Mondrian (5 bins) 0.887 0.882 0.171 0.171 0.00482 0.00961 -0.00479 Mondrian (10 bins) 0.901 0.886 0.176 0.173 0.00482 0.00961 -0.00479 Mondrian (99 bins) 0.860 0.872 0.184 0.179 0.00482 0.00961 -0.00479 Kernel 0.887 0.891 0.170 0.170 0.00482 0.00961 -0.00479 SC-CP 0.905 0.919 0.180 0.178 0.00607 0.00983 -0.00376

(a) Setting A (poorly-calibrated f( ))

8.34 8.36 8.38 8.40 8.42 8.44 8.46 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

8.34 8.36 8.38 8.40 8.42 8.44 8.46

Predicted Outcome

Mondrian (10 bins)

8.34 8.36 8.38 8.40 8.42 8.44 8.46

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.896 0.896 0.175 0.175 0.00338 0.00796 -0.00458 CQR (marginal) 0.896 0.910 0.180 0.182 0.00292 0.00454 -0.00162 Mondrian (5 bins) 0.892 0.900 0.174 0.174 0.00338 0.00796 -0.00458 Mondrian (10 bins) 0.892 0.891 0.178 0.176 0.00338 0.00796 -0.00458 Mondrian (98 bins) 0.860 0.863 0.176 0.175 0.00338 0.00796 -0.00458 Kernel 0.892 0.900 0.171 0.171 0.00338 0.00796 -0.00458 SC-CP 0.892 0.905 0.177 0.177 0.00624 0.01000 -0.00376

(b) Setting B (well-calibrated f( )) Figure 3: STAR dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

0 1 2 3 4 5 6 7 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

0 1 2 3 4 5 6 7

Predicted Outcome

Mondrian (10 bins)

0 1 2 3 4 5 6 7

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.909 0.914 1.46 1.46 0.155 0.147 0.008 CQR (marginal) 0.884 0.894 1.31 1.10 0.055 0.0238 0.0312 Mondrian (5 bins) 0.878 0.923 1.20 1.17 0.155 0.147 0.008 Mondrian (10 bins) 0.863 0.927 1.15 1.15 0.155 0.147 0.008 Mondrian (101 bins) 0.872 0.935 1.23 1.22 0.155 0.147 0.008 Kernel 0.865 0.927 1.12 1.16 0.155 0.147 0.008 SC-CP 0.863 0.933 0.966 0.935 0.00833 -0.01100 0.01933

(a) Setting A (poorly-calibrated f( ))

1 2 3 4 5 6 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

1 2 3 4 5 6

Predicted Outcome

Mondrian (10 bins)

1 2 3 4 5 6

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.885 0.923 1.03 1.03 0.0215 0.00754 0.01396 CQR (marginal) 0.885 0.900 1.31 1.16 0.0328 0.0292 0.0036 Mondrian (5 bins) 0.872 0.925 0.992 0.970 0.0215 0.00754 0.01396 Mondrian (10 bins) 0.863 0.929 0.978 0.955 0.0215 0.00754 0.01396 Mondrian (101 bins) 0.883 0.937 1.12 1.07 0.0215 0.00754 0.01396 Kernel 0.860 0.926 0.958 0.966 0.0215 0.00754 0.01396 SC-CP 0.878 0.929 0.995 0.945 0.00797 -0.00647 0.01444

(b) Setting B (well-calibrated f( )) Figure 4: Bike dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

0.1 0.2 0.3 0.4 0.5 0.6 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

0.1 0.2 0.3 0.4 0.5 0.6 0.0

Predicted Outcome

Mondrian (10 bins)

0.1 0.2 0.3 0.4 0.5 0.6

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.870 0.949 0.321 0.321 -0.00871 0.0126 -0.02131 CQR (marginal) 0.919 0.943 0.366 0.245 -0.0181 -0.00647 -0.01163 Mondrian (5 bins) 0.825 0.847 0.275 0.186 -0.00871 0.0126 -0.02131 Mondrian (10 bins) 0.928 0.864 0.372 0.204 -0.00871 0.0126 -0.02131 Mondrian (98 bins) 0.843 0.852 0.340 0.219 -0.00871 0.0126 -0.02131 Kernel 0.901 0.886 0.337 0.189 -0.00871 0.0126 -0.02131 SC-CP 0.901 0.892 0.362 0.187 -0.01090 0.00859 -0.01949

(a) Setting A (poorly-calibrated f( ))

0.1 0.2 0.3 0.4 0.5 0.6 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

0.1 0.2 0.3 0.4 0.5 0.6 0.0

Predicted Outcome

Mondrian (10 bins)

0.1 0.2 0.3 0.4 0.5 0.6

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.861 0.949 0.321 0.321 -0.0153 0.00739 -0.02269 CQR (marginal) 0.888 0.915 0.333 0.227 -0.0346 -0.00193 -0.03267 Mondrian (5 bins) 0.852 0.875 0.293 0.188 -0.0153 0.00739 -0.02269 Mondrian (10 bins) 0.906 0.903 0.367 0.201 -0.0153 0.00739 -0.02269 Mondrian (97 bins) 0.821 0.858 0.366 0.214 -0.0153 0.00739 -0.02269 Kernel 0.888 0.881 0.347 0.185 -0.0153 0.00739 -0.02269 SC-CP 0.901 0.892 0.359 0.191 -0.00977 0.00804 -0.01781

(b) Setting B (well-calibrated f( )) Figure 5: Community dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

2.0 2.5 3.0 3.5 4.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

2.0 2.5 3.0 3.5 4.0 1.5

Predicted Outcome

Mondrian (10 bins)

2.0 2.5 3.0 3.5 4.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.839 0.903 0.529 0.529 -0.0120 -0.00892 -0.00308 CQR (marginal) 0.887 0.861 0.918 0.629 -0.0143 0.00432 -0.01862 Mondrian (5 bins) 0.968 0.889 0.816 0.551 -0.0120 -0.00892 -0.00308 Mondrian (10 bins) 0.952 0.910 0.783 0.601 -0.0120 -0.00892 -0.00308 Mondrian (87 bins) 0.774 0.806 0.652 0.481 -0.0120 -0.00892 -0.00308 Kernel 0.952 0.917 0.791 0.532 -0.0120 -0.00892 -0.00308 SC-CP 0.887 0.861 0.879 0.563 -0.0356 -0.0462 0.0106

(a) Setting A (poorly-calibrated f( ))

2.0 2.5 3.0 3.5 4.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

2.0 2.5 3.0 3.5 4.0 1.5

Predicted Outcome

Mondrian (10 bins)

2.0 2.5 3.0 3.5 4.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.839 0.903 0.525 0.525 -0.0540 -0.0150 -0.0390 CQR (marginal) 0.871 0.889 0.901 0.617 -0.00706 0.0119 -0.01896 Mondrian (5 bins) 0.935 0.903 0.698 0.534 -0.0540 -0.0150 -0.0390 Mondrian (10 bins) 0.935 0.903 0.699 0.504 -0.0540 -0.0150 -0.0390 Mondrian (86 bins) 0.726 0.792 0.602 0.453 -0.0540 -0.0150 -0.0390 Kernel 0.952 0.875 0.695 0.482 -0.0540 -0.0150 -0.0390 SC-CP 0.935 0.903 0.859 0.545 -0.0587 -0.0265 -0.0322

(b) Setting B (well-calibrated f( )) Figure 6: Concrete dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

1.0 1.5 2.0 2.5 3.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

1.0 1.5 2.0 2.5 3.0 0.0

Predicted Outcome

Mondrian (10 bins)

1.0 1.5 2.0 2.5 3.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.907 0.904 1.81 1.81 0.168 0.153 0.015 CQR (marginal) 0.895 0.897 1.64 1.73 -0.0223 -0.0062 -0.0161 Mondrian (5 bins) 0.908 0.915 1.73 1.84 0.168 0.153 0.015 Mondrian (10 bins) 0.907 0.910 1.69 1.79 0.168 0.153 0.015 Mondrian (101 bins) 0.903 0.914 1.62 1.79 0.168 0.153 0.015 Kernel 0.892 0.901 1.52 1.67 0.168 0.153 0.015 SC-CP 0.882 0.908 1.36 1.56 -0.0191 0.003 -0.0221

(a) Setting A (poorly-calibrated f( ))

0.5 1.0 1.5 2.0 2.5 3.0 Original Prediction (uncalibrated)

Predicted Outcome

Calibration Plot for SC-CP

Prediction Interval Venn-Abers Multi-Prediction Calibrated Prediction Original Prediction Outcome

Predicted Outcome

Predicted Outcome

CQR (marginal)

0.5 1.0 1.5 2.0 2.5 3.0 0.0

Predicted Outcome

Mondrian (10 bins)

0.5 1.0 1.5 2.0 2.5 3.0

Predicted Outcome

Original Prediction (uncalibrated)

Prediction bands

Method Coverage (α = 0.1) Average Width Cal. Error A = 0 A = 1 A = 0 A = 1 A = 0 A = 1 Difference Marginal 0.898 0.904 1.67 1.67 0.00227 -0.0191 0.02137 CQR (marginal) 0.891 0.889 1.66 1.70 -0.0209 -0.0124 -0.0085 Mondrian (5 bins) 0.902 0.922 1.58 1.68 0.00227 -0.0191 0.02137 Mondrian (10 bins) 0.893 0.923 1.49 1.59 0.00227 -0.0191 0.02137 Mondrian (101 bins) 0.894 0.925 1.49 1.61 0.00227 -0.0191 0.02137 Kernel 0.880 0.895 1.37 1.48 0.00227 -0.0191 0.02137 SC-CP 0.893 0.916 1.38 1.52 -0.0174 0.00773 -0.02513

(b) Setting B (well-calibrated f( )) Figure 7: Bio dataset: Calibration plot for SC-CP, prediction bands for SC-CP and baselines, and empirical coverage, width, and calibration error within sensitive subgroup.

C Supplementary synthetic data experiments

C.1 Experimental setup

In this appendix, we perform additional synthetic experiments to evaluate the prediction-conditional validity of our method and how it translates to approximate context-conditional validity in certain cases.

Synthetic datasets. We construct synthetic training, calibration, and test datasets Dtrain, Dcal, Dtest of sizes ntrain, ncal, ntest, which are respectively used to train f, apply CP, and evaluate performance. For parameters d N, κ > 0, a 0, b 0, each dataset consists of iid observations of (X, Y ) drawn as follows. The covariate vector X := (X1, . . . , Xd) [0, 1]d is coordinate-wise independently drawn from a Beta(1, κ) distribution with shape parameter κ. Then, conditionally on X = x, the outcome Y is drawn normally distributed with conditional mean µ(x) := d 1/2 Pd j=1{xj + sin(4xj)} and conditional variance σ2(x) := {0.035 a log(0.5 + 0.5x1)/8 + b |µ0(x)|6/20 0.02 /2}2. Here, a and b control the heteroscedasticity and mean-variance relationship for the outcomes. For Dcal and Dtest, we set κcal = κtest = 1 and, for Dtrain, we vary κtrain to introduce distribution shift and, thereby, calibration error in f. The parameters d, a, and b are fixed across the datasets.

Baselines. To mitigate overfitting, we implement SC-CP so that each function in Θiso has at least 20 observations averaged within each constant segment (implemented using the minimum leaf node size of argument xgboost). When appropriate, we will consider the following baseline CP algorithms for comparison. Unless stated otherwise, for all baselines, we use the scoring function S(x, y) := |y f(x)|. The first baseline, uncond-CP, is split-CP [36], which provides only unconditional coverage guarantees. The second baseline, cond-CP, is adapted from [21] and provides conditional coverage over distribution shifts within a specified reproducing kernel Hilbert space. Following Section 5.1 of [21], we use the Gaussian kernel K(Xi, Xj) := exp 4 Xi Xj 2 2

with euclidean norm 2 and select the regularization parameter λ using 5-fold cross-validation. The third baseline, Mondrian CP, applies the Mondrian CP method [59, 8] to categories formed by dividing f s predictions into 20 equal-frequency bins based on Dcal. As an optimal benchmark, we consider the oracle satisfying (2).

C.2 Experiment 1: Calibration and efficiency

In this experiment, we illustrate how calibration of the predictor f can improve the efficiency (i.e., width) of the resulting prediction intervals. We consider the data-generating process of the previous section, with ntrain = ncal = ntest = 1000, d = 5, a = 0, and b = 0.6. The predictor f is trained on Dtrain using the ranger [63] implementation of random forests with default settings. To control the calibration error in f, we vary the distribution shift parameter κtrain for Dtrain over {1, 1.5, 2, 2.5, 3}.

Results. Figure 8a compares the average interval width across Dtest for SC-CP and baselines as the ℓ2-calibration error in f increases. Here, we estimate the calibration error using the approach of [64]. As calibration error increases, the average interval width for SC-SP appears smaller than those of uncond-CP, Mondrian-CP, and cond-CP, especially in comparison to uncond-CP. The observed efficiency gains in SC-CP are consistent with Theorem 4.3 and provides empirical evidence that self-calibrated conformity scores translate to tighter prediction intervals, given sufficient data. To test this hypothesis under controlled conditions, we compare the widths of prediction intervals obtained using vanilla (unconditional) CP for two scoring functions: S : (x, y) 7 |y f(x)| and the Venn Abers (worst-case) scoring function Scal : (x, y) 7 maxy Y |y f (x,y) n (x)|. For miscoverage levels α {0.05, 0.1, 0.2}, the left panel of Figure 8b illustrates the relative efficiency gain achieved by using Scal, which we define as the ratio of the average interval widths for Scal relative to S. The widths and calibration errors in Figure 8b are averaged across 100 data replicates.

Role of calibration set size. With too small calibration sets, the isotonic calibration step in Alg. 2 can lead to overfitting. In such cases, the self-calibrated conformity scores could be larger than their uncalibrated counterparts, potentially resulting in less efficient prediction intervals. In our experiments, overfitting is mitigated by constraining the minimum size of the leaf node in the isotonic regression tree to 20. For α {0.05, 0.1, 0.2}, the right panel of Figure 8b displays the relationship between ncal and the relative efficiency gain achieved by using Scal, holding calibration error fixed (κtrain = 3). We find that calibration leads to a noticeable reduction in interval width as soon as ncal 50.

C.3 Experiment 2: Coverage and Adaptivity

In this experiment, we illustrate how self-calibration of b Cn+1(Xn+1) can, in some cases, translate to stronger conditional coverage guarantees. Here we take ntrain = ncal = 1000 and ntest = 2500,

0.050 0.075 0.100 0.125 Calibration error in f( )

Interval Width (α = 0.1)

SC CP (ours)

Mondrian CP

(a) Avg. interval width with varying calibration error.

0.050 0.075 0.100 0.125 Calibration error in f

Relative Width (cal/uncal)

30 100 300 Calibration set size

(b) Relative efficiency change from calibration.

Figure 8: Figure 8a shows the average interval widths for varying ℓ2-calibration errors in f. Figure 8b shows the relative change in average interval width using Venn-Abers calibrated versus uncalibrated predictions, with varying ℓ2-calibration error in f (left), and calibration set size (right). Below the horizontal lines signifies efficiency gains for SC-CP.

and no distribution shift (κtrain = 1) in Dtrain. We consider three setups: Setup A (d = 5, a = 0, b = 0.6) and Setup B (d = 5, a = 0, b = 0.6) which have a strong mean-variance relationship in the outcome process; and Setup C (d = 5, a = 0.6, b = 0) which has no such relationship. We obtain the predictor f from Dtrain using a generalized additive model [27], so that it is well calibrated and accurately estimates µ. To assess the adaptivity of SC-CP to heteroscedasticity, we report the coverage and the average interval width within subgroups of Dtest defined by quintiles of the conditional variances {σ2(Xi) : (Xi, Yi) Dtest}.

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal

Setup A (d=5)

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal

Setup B (d=20)

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal

Setup C (adverserial, d=5)

Q.1 Q.2 Q.3 Q.4 Q.5 Marginal Quintiles of σ2(X)

Coverage (α = 0.1)

Interval Width

SC CP (ours) Cond CP Uncond CP Mondrian CP Oracle

(a) Coverage and interval width for σ2(X) quintiles

ncal = 300 ncal = 1000

ncal = 50 ncal = 100

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Covariate (X)

Outcome (Y)

SC CP (ours) Cond CP Uncond CP Oracle

(b) Adaptivity of SC-CP with ncal

Figure 9: Figure 9a displays the coverage and average interval width of SC-CP, marginally and within quintiles of the conditional outcome variance for setups A, B, and C. For a d = 1 example, Figure 9b shows the adaptivity of SC-CP prediction bands (α = 0.1) across various calibration set sizes.

Results. The top two panels in Figure 9a displays the coverage and average interval width results for Setup A and Setup B. In both setups, SC-CP, cond-CP, and Mondrian-CP exhibit satisfactory coverage both marginally and within the quintile subgroups. In contrast, while uncond-CP attains the nominal level of marginal coverage, it exhibits noticeable overcoverage within the first three quintiles and significant undercoverage within the fifth quintile. The satisfactory performance of SC-CP with respect to heteroscedasticity in Setups A and B can be attributed to the strong mean-variance relationship in the outcome process. Regarding efficiency, the average interval widths of SC-CP are competitive with those of prediction-binned Mondrian-CP and the oracle intervals. The interval widths for cond-CP are wider than those for SC-CP and Mondrian-CP, especially for Setup B. This difference can be explained by cond-CP aiming for conditional validity in a 5D and 20D space, whereas SC-CP and Mondrian-CP target the 1D output space.

Limitation. If there is no mean-variance relationship in the outcomes, SC-CP is generally not expected to adapt to heteroscedasticity. The third (bottom) panel of Figure 9a displays SC-CP s performance in Setup C, where there is no such relationship. In this scenario, it is evident that the conditional coverage of both uncond-CP and SC-CP are poor, while cond-CP maintains adequate coverage.

Adaptivity and calibration set size. SC-CP can be derived by applying CP within subgroups defined by a data-dependent binning of the output space f(X), learned via Venn-Abers calibration, where the number of bins can grow with ncal. In particular, if t 7 E[Y | f(X) = t] is asymptotically monotone, which is plausible when f consistently estimates the outcome regression, then the number of bins selected by isotonic calibration will generally increase with ncal, and the width of these bins will tend to zero. As a consequence, in such cases, the self-calibration result in Theorem 4.2 translates to conditional guarantees over finer partitions of f(X) as ncal increases. For d = 1 and a strong mean-variance relationship (a = 0, b = 0.6), Figure 9b demonstrates how the adaptivity of SC-CP bands improves as ncal increases. In this case, for ncal sufficiently large, we find that the SC-CP bands closely match those of cond-CP and the oracle.

0.050 0.075 0.100 0.125 Calibration error in f( )

Coverage (α = 0.1)

SC CP (ours)

Mondrian CP

(a) Avg. interval width with varying calibration error.

Figure 10: Figure 10a shows the marginal coverage corresponding to the interval widths of Figure 8a for varying ℓ2-calibration errors in f.

Proof of Theorem 4.1. It can be verified that an isotonic calibrator class Θiso satisfies the following properties: (a) It consists of univariate regression trees (i.e., piecewise constant functions) that are monotonically nondecreasing; (b) For all elements θ Θiso and transformations g : R R, it holds that g θ Θiso. Property (a) holds by definition. Property (b) is satisfied since constraining the maximum number of constant segments by k(n) does not constrain the possible values that the

function may take in a given constant region. We note that the result of this theorem holds in general for any function class Θiso that satisfies (a) and (b).

Recall from Alg. 1 that f (Xn+1,Yn+1) n = θ(Xn+1,Yn+1) n f, where θ(Xn+1,Yn+1) n Θiso is an isotonic calibrator satisfying f (Xn+1,Yn+1) n = θ(Xn+1,Yn+1) n f. By definition, recall that the isotonic calibrator class Θiso satisfies the invariance property that, for all g : R R, the inclusion θ Θiso implies g θ Θiso. Hence, for all g : R R and ε > 0, it also holds that (1 + εg) θ(Xn+1,Yn+1) n lies in Θiso. Now, we use that (1+εg) θ(Xn+1,Yn+1) n f = (1+εg) f (Xn+1,Yn+1) n and that f (Xn+1,Yn+1) n is an empirical risk minimizer over the class {g f : g Θiso}. Using these two observations, the first order derivative equations characterizing the empirical risk minimizer f (Xn+1,Yn+1) n imply, for all g : R R, that

i=1 (g f (Xn+1,Yn+1) n )(Xi) n Yi f (Xn+1,Yn+1) n (Xi) o = d

n Yi (1 + εg) f (Xn+1,Yn+1) n o2 # ε=0 = 0. (5)

Taking expectations of both sides of the above display, we conclude

i=1 E h (g f (Xn+1,Yn+1) n )(Xi) n Yi f (Xn+1,Yn+1) n (Xi) oi = 0. (6)

We now use the fact that {(Xi, Yi, f (Xn+1,Yn+1) n (Xi)) : i [n + 1]} are exchangeable, since {(Xi, Yi)}n+1 i=1 are exchangeable by C1 and the function f (Xn+1,Yn+1) n is invariant under permutations of {(Xi, Yi)}n+1 i=1 . Consequently, Equation (6) remains true if we replace each (Xi, Yi, f (Xn+1,Yn+1) n (Xi)) with i [n] by (Xn+1, Yn+1, f (Xn+1,Yn+1) n (Xn+1)). That is,

E h (g f (Xn+1,Yn+1) n )(Xn+1) n Yn+1 f (Xn+1,Yn+1) n (Xn+1) oi

i=1 E h (g f (Xn+1,Yn+1) n )(Xn+1) n Yn+1 f (Xn+1,Yn+1) n (Xn+1) oi

i=1 E h (g f (Xn+1,Yn+1) n )(Xi) n Yi f (Xn+1,Yn+1) n (Xi) oi

By the law of iterated conditional expectations, the preceding display further implies

E h (g f (Xn+1,Yn+1) n )(Xn+1) n E[Yn+1 | f (Xn+1,Yn+1) n (Xn+1)] f (Xn+1,Yn+1) n (Xn+1) oi = 0.

Taking g : R R to be defined by (g f (Xn+1,Yn+1) n )(Xn+1) := E[Yn+1 | f (Xn+1,Yn+1) n (Xn+1)] f (Xn+1,Yn+1) n (Xn+1), we find

E n E[Yn+1 | f (Xn+1,Yn+1) n (Xn+1)] f (Xn+1,Yn+1) n (Xn+1) o2 = 0.

The above equality implies E[Yn+1 | f (Xn+1,Yn+1) n (Xn+1)] = f (Xn+1,Yn+1) n (Xn+1) almost surely, as desired.

Proof of Theorem 4.2. Recall, for a quantile level α (0, 1), the pinball" quantile loss function ℓα is given by

ℓα(f(x), s) := α(s f(x)) if s f(x), (1 α)(f(x) s) if s < f(x). As established in [21], each subgradient of ℓα( , x) at f in the direction g, for some β [α 1, α], given by:

ε[β] {ℓα(f(x) + εg(x), s)} ε=0 := 1(f(x) = s)g(x){α 1(f < s)} + 1(f(x) = s)βg(x).

For (x, y) X Y, define the empirical risk minimizer:

ρ(x,y) n argmin

θ f (x,y) n ;θ:R R

i=1 ℓα θ f (x,y) n (Xi), S(x,y) i + ℓα θ f (x,y) n (x), S(x,y) n+1 .

Then, since the isotonic calibrated predictor f (x,y) n is piece-wise constant and the above optimization problem is unconstrained in the map θ : R R, it holds that the evaluation ρ(x,y) n (x) lies in the solution set:

i=1 Ki(f (x,y) n , x) ℓα(q, S(x,y) i ) + ℓα(q, S(x,y) n+1 ),

where Ki(f (x,y) n , x) = 1 n f (x,y) n (Xi) = f (x,y) n (x) o . Consequently, we see that the empirical quantile

ρ(x,y) n (x) defined in Alg. 2 coincides with the evaluation of the empirical risk minimizer ρ(x,y) n ( ) at x, as suggested by our notation.

We will now theoretically analyze Alg. 2 by studying the empirical risk minimizer x 7 ρ(Xn+1,Yn+1) n (x ). To do so, we modify the arguments used to establish Theorem 2 of [21]. As in [21], we begin by examining the first-order equations of the convex optimization problem defining x 7 ρ(Xn+1,Yn+1) n (x ).

Studying first-order equations of convex problem: Given any transformation θ : R R, each subgradient of the map

ε 7 ℓα ρ(Xn+1,Yn+1) n (Xi) + εθ f (Xn+1,Yn+1) n (Xi), S(Xn+1,Yn+1) i

is, for some β [α 1, α]n+1, of the following form:

ε[β] n ℓα ρ(Xn+1,Yn+1) n (Xi) + εθ f (Xn+1,Yn+1) n (Xi), S(Xn+1,Yn+1) i o ε=0

( θ f (Xn+1,Yn+1) n (Xi){α 1(ρ(Xn+1,Yn+1) n (Xi) < S(Xn+1,Yn+1) i )} if S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi), βi(θ f (Xn+1,Yn+1) n )(Xi) if S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi).

Now, since ρ(Xn+1,Yn+1) n is an empirical risk minimizer of the quantile loss over the class {θ f (Xn+1,Yn+1) n ; θ : R R}, there exists some vector β = (β 1, . . . , β n+1) [α 1, α]n+1 such that

0 = 1 n + 1

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h θ f (Xn+1,Yn+1) n (Xi){α 1(ρ(Xn+1,Yn+1) n (Xi) < S(Xn+1,Yn+1) i )} i

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h β i (θ f (Xn+1,Yn+1) n )(Xi) i

The above display can be rewritten as:

h θ f (Xn+1,Yn+1) n (Xi){α 1(ρ(Xn+1,Yn+1) n (Xi) < S(Xn+1,Yn+1) i )} i

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h (1 β i )(θ f (Xn+1,Yn+1) n )(Xi) i .

Now, observe that the collection of random variables n (f (Xn+1,Yn+1) n (Xi), S(Xn+1,Yn+1) i , ρ(Xn+1,Yn+1) n (Xi)) : i [n + 1] o

are exchangeable, since {((Xi, Yi) : i [n + 1]} are exchangeable by C1 and the functions f (Xn+1,Yn+1) n ( ) and ρ(Xn+1,Yn+1) n ( ) are unchanged under permutations of the training data

{(Xi, Yi)}i [n+1]. Thus, the expectation of the left-hand side of Equation (7) can be expressed as:

h θ f (Xn+1,Yn+1) n (Xi){α 1(ρ(Xn+1,Yn+1) n (Xi) < S(Xn+1,Yn+1) i )} i#

i=1 E h θ f (Xn+1,Yn+1) n (Xi){α 1(ρ(Xn+1,Yn+1) n (Xi) < S(Xn+1,Yn+1) i )} i

= E h θ f (Xn+1,Yn+1) n (Xn+1){α 1(ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 )} i ,

where the final inequality follows from exchangeability. Combining this with Equation (7), we find

E h θ f (Xn+1,Yn+1) n (Xn+1){α 1(ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 )} i (8)

= E h 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h (1 β i )(θ f (Xn+1,Yn+1) n )(Xi) i (9)

Lower bound on coverage: We first obtain the lower coverage bound in the theorem statement. Note, for any nonnegative θ : R R, that (9) implies:

E h θ f (Xn+1,Yn+1) n (Xn+1){α 1(ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 )} i 0.

This inequality holds since (1 β i ) 0 and (θ f (Xn+1,Yn+1) n )(Xi) 0 almost surely, for each i [n + 1].

By the law of iterated expectations, we then have

E h θ f (Xn+1,Yn+1) n (Xn+1) n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) oi 0.

Taking θ : R R as a nonnegative map that almost surely satisfies

θ f (Xn+1,Yn+1) n (Xn+1) = 1 n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) o ,

E n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) o

where the map t 7 {t} := |t|1(t 0) extracts the negative part of its input. Multiplying both sides of the previous inequality by 1, we obtain

0 E n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) o

We conclude that the negative part of n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) o

is almost surely zero. Thus, it must be almost surely true that

α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) .

Note that the event Yn+1 b Cn+1(Xn+1) = {y Y : S(Xn+1,y) n+1 ρ(Xn+1,y) n (Xn+1)} occurs if,

and only if, ρ(Xn+1,Yn+1) n (Xn+1) S(Xn+1,Yn+1) n+1 . As a result, we obtain the desired lower coverage bound: 1 α P Yn+1 b Cn+1(Xn+1) | f (Xn+1,Yn+1) n (Xn+1) .

Deviation bound for the coverage: We now bound the deviation of the coverage of SC-CP from the lower bound. Note, for any f : R R, that (9) implies:

E h θ f (Xn+1,Yn+1) n (Xn+1){α 1(ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 )} i

E h 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h (1 β i )(θ f (Xn+1,Yn+1) n )(Xi) i . (10)

Using that (1 β i ) [0, 1] and exchangeability, we can bound the right-hand side of the above as E

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} h (1 β i )(θ f (Xn+1,Yn+1) n )(Xi) i#

" max i [n+1]

(θ f (Xn+1,Yn+1) n )(Xi) n+1 X

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)}

Next, since there are no ties by C3, the event 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} for some index i [n + 1] can only occur once per piecewise constant segment of ρ(Xn+1,Yn+1) n , since, otherwise, S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi) = ρ(Xn+1,Yn+1) n (Xj) = S(Xn+1,Yn+1) j for some i = j. However,

ρ(Xn+1,Yn+1) n is a transformation of f (Xn+1,Yn+1) n and, therefore, has the same number of constant segments as f (Xn+1,Yn+1) n . Thus, it holds that

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)} N (Xn+1,Yn+1),

where N (Xn+1,Yn+1) is the (random) number of constant segments of f (Xn+1,Yn+1) n . This implies that

max i [n+1] |(θ f (Xn+1,Yn+1) n )(Xi)|

i=1 1{S(Xn+1,Yn+1) i = ρ(Xn+1,Yn+1) n (Xi)}

1 n + 1E N (Xn+1,Yn+1) max i [n+1] |(θ f (Xn+1,Yn+1) n )(Xi)| .

Combining this bound with (10), we find

E h θ f (Xn+1,Yn+1) n (Xn+1){α 1(ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 )} i

1 n + 1E N (Xn+1,Yn+1) max i [n+1] |(θ f (Xn+1,Yn+1) n )(Xi)| .

By the law of iterated expectations, we then have

E h θ f (Xn+1,Yn+1) n (Xn+1) n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) oi

1 n + 1E N (Xn+1,Yn+1) max i [n+1] |(θ f (Xn+1,Yn+1) n )(Xi)| .

Next, let V R denote the support of the random variable f (Xn+1,Yn+1) n (Xn+1). Then, taking θ to be

t 7 1(t V)sign n α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) = t o ,

which falls almost surely in { 1, 1}, we obtain the mean absolute error bound:

E α P ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 | f (Xn+1,Yn+1) n (Xn+1) 1 n + 1E h N (Xn+1,Yn+1)i .

Since the event Yn+1 b Cn+1(Xn+1) = {y Y : S(Xn+1,y) n+1 ρ(Xn+1,y) n (Xn+1)} occurs if, and

only if, ρ(Xn+1,Yn+1) n (Xn+1) < S(Xn+1,Yn+1) n+1 , we conclude that

E α P Yn+1 b Cn+1(Xn+1) | f (Xn+1,Yn+1) n (Xn+1) E[N (Xn+1,Yn+1)]

Under C4, E[N (Xn+1,Yn+1)] n1/3 polylog n, such that

E α P Yn+1 b Cn+1(Xn+1) | f (Xn+1,Yn+1) n (Xn+1) n1/3 polylog n

n + 1 polylog n

as desired.

Proof of Theorem 4.3. Let Pn+1 denote the empirical distribution of {(Xi, Yi)}n+1 i=1 and let Pn denote the empirical distribution of {(Xi, Yi)}n i=1. For any function g : X Y R: we use the following empirical process notation: Pg := R g(x, y)d P(x, y), Pn+1g := R g(x, y)d Pn+1(x, y), and Png := R g(x, y)d Pn(x, y).

Define the risk functions R(x,y) n (θ) := 1 n+1 Pn i=1{Sθ(Xi, Yi)}2 + 1 n+1{Sθ(x, y)}2, Rn+1(θ) :=

1 n+1 Pn+1 i=1 {Sθ(Xi, Yi)}2, and R0(θ) := R {Sθ(x, y)}2d P(x, y). Moreover, define the risk min-

imizers as θ(x,y) n := argminθ Θiso R(x,y) n (f) and θ0 := argminθ Θiso R0(θ). Observe that R(x,y) n (θ(x,y) n ) R(x,y) n (θ0) 0 since fn minimizes Rn over Θiso. Using this inequality, it follows that

R0(θ(x,y) n ) R0(θ0) = R0(θ(x,y) n ) R(x,y) n (θ(x,y) n )

+ R(x,y) n (θ(x,y) n ) R(x,y) n (θ0) + R(x,y) n (θ0) R0(θ0)

R0(θ(x,y) n ) R(x,y) n (θ(x,y) n ) n R(x,y) n (θ0) R0(θ0) o

R0(θ(x,y) n ) Rn+1(θ(x,y) n ) {Rn+1(θ0) R0(θ0)}

+ R(x,y) n (θ(x,y) n ) Rn+1(θ(x,y) n ) n R(x,y) n (θ0) Rn+1(θ0) o .

The first term on the right-hand side of the above display can written as

R0(θ(x,y) n ) Rn+1(θ(x,y) n ) {Rn+1(θ0) R0(θ0)} = (Pn+1 P) h {Sθ(x,y) n }2 {Sθ0}2i .

We now bound the second term, R(x,y) n (θ(x,y) n ) Rn+1(θ(x,y) n ) n R(x,y) n (θ0) Rn+1(θ0) o . For any θ Θiso, observe that

R(x,y) n (θ) Rn+1(θ) = 1 n + 1 {y θ f(x)}2 {Yn+1 θ f(Xn+1)}2 .

We know that θ(x,y) n and θ0, being defined via isotonic regression, are uniformly bounded by B := supy Y |y|, which is finite by C6. Therefore,

R(x,y) n (θ(x,y) n ) Rn+1(θ(x,y) n ) n R(x,y) n (θ0) Rn+1(θ0) o 8B2

n + 1 = O(n 1).

Combining the previous displays, we obtain the excess risk bound

R0(θ(x,y) n ) R0(θ0) (Pn+1 P) h {Sθ(x,y) n }2 {Sθ0}2i + O(n 1). (11)

Next, we claim that R0(θ(x,y) n ) R0(θ0) (θ(x,y) n f) (θ0 f) 2 P . To show this, expanding the squares, note, pointwise for each x X and y Y, that

{Sθ(x,y) n (x, y)}2 {Sθ0(x, y)}2 = {θ(x,y) n f(x)}2 {θ0 f(x)}2 2y n (θ(x,y) n f)(x) (θ0 f)(x) o

= n (θ(x,y) n f)(x) + (θ0 f)(x) 2y o n (θ(x,y) n f)(x) (θ0 f)(x) o .

Consequently,

R0(θ(x,y) n ) R0(θ0) = Z n (θ(x,y) n f)(x) + (θ0 f)(x) 2y o n (θ(x,y) n f)(x) (θ0 f)(x) o d P(x).

The class Θiso consists of all isotonic functions and is, therefore, a convex space. Thus, the first-order derivative equations defining the population minimizer θ0 imply that Z n θ(x,y) n f(x) θ0 f(x) o {y θ0 f(x)} d P(x, y) 0. (13)

Combining (12) and (13), we find

R0(θ(x,y) n ) R0(θ0) = Z n (θ(x,y) n f)(x) (θ0 f)(x) o2 d P(x)

+ 2 Z {(θ0 f)(x) y} n (θ(x,y) n f)(x) (θ0 f)(x) o d P(x)

Z n (θ(x,y) n f)(x) (θ0 f)(x) o2 d P(x),

as desired. Combining this lower bound with (11), we obtain the inequality Z n (θ(x,y) n f)(x) (θ0 f)(x) o2 d P(x) R0(θ(x,y) n ) R0(θ0) (Pn P) h {Sθ(x,y) n }2 {Sθ0}2i + O(1/n)

Define δn :=

r R n (θ(x,y) n f)(x) (θ0 f)(x) o2 d P(x), the bound B := supy Y |y|, and the

function class, Θ1,n := {(x, y) 7 {(θ1 + θ2) f 2y}{(θ1 θ2) f}} . Using this notation, (14) implies

δ2 n sup θ1,θ2 Θiso: θ1 θ2 δn

Z {(θ1 + θ2) f(x) 2y}{(θ1 θ2) f(x)}d(Pn P)(x, y) + O(1/n)

sup h Θ1,n: h 4Bδn (Pn P)h + O(1/n)

Using the above inequality and C5, we will use an argument similar to the proof of Theorem 3 in [52] to establish that δn = Op(n 1/3). This rate then implies the result of the theorem. To see this, note, by the reverse triangle inequality, S(x,y) n (y , x ) S0(x , y ) = |y θ(x,y) n (x )| |y θ0(x )|

θ0(x ) θ(x,y) n (x ) .

Squaring and integrating the leftand right-hand sides, we find Z S(x,y) n (y , x ) S0(x , y ) 2 d P(x , y ) Z θ0(x ) θ(x,y) n (x ) 2 d P(x ) = δ2 n,

as desired.

We now establish that δn = Op(n 1/3). For a function class F, let N(ϵ, F, L2(P)) denote the ϵ covering number [53] of F and define the uniform entropy integral of F by

J (δ, F) := Z δ

log N(ϵ, F, L2(Q)) dϵ ,

where the supremum is taken over all discrete probability distributions Q. We note that

J (δ, Θ1,n) = Z δ

N(ε, Θ1,n, Q) dε = Z δ

N(ε, Θiso, Q f 1) dε = J (δ, Θiso) ,

where Q f 1 is the push-forward probability measure for the random variable f(W). Additionally, the covering number bound for bounded monotone functions given in Theorem 2.7.5 of [53] implies that J (δ, Θ1,n) = J (δ, Θiso)

δ. Recall that f is obtained from an external dataset, say En, independent of the calibration data. Noting that f is deterministic conditional on a training dataset En. Applying Theorem 2.1 of [54] conditional on En, we obtain, for any δ > 0, that

sup h Θ1,n: h 4Bδ (Pn P)h | En

n 1/2J (δ, Θ1,n) 1 + J (δ, Θ1,n) nδ2

n 1/2J (δ, Θiso) 1 + J (δ, Θiso) nδ2

Noting that the right-hand side of the above bound is deterministic, we conclude that

sup h Θ1,n: h 4Bδ (Pn P)h

n 1/2J (δ, Θiso) 1 + J (δ, Θiso) nδ2

We use the so-called peeling" argument [53] to obtain our bound for δn. Note

P δ2 n n 2/32M =

m=M P 2m+1 n2/3δ2 n 2m

2m+1 n2/3δ2 n 2m , δ2 n sup h Θ1,n: h 4Bδn (Pn P)h + O(1/n)

2m+1 n2/3δ2 n 2m , 22mn 2/3 sup h Θ1,n: h 4B2m+1n 1/3(Pn P)h + O(1/n)

22mn 2/3 sup h Θ1,n: h 4B2m+1n 1/3(Pn P)h + O(1/n)

E h suph Θ1,n: h 4B2m+1n 1/3(Pn P)h i + O(1/n)

J (2m+1n 1/3, Θiso) 1 + J (22m+2n 2/3,Θiso) n2m+1n 1/3

n22mn 2/3 +

O(1/n) 22mn 2/3

2(m+1)/2n 1/6

Since P m=1 2(m+1)/2

22m < , we have that P m=M 2(m+1)/2

22m 0 as M . Therefore, for all ε > 0, we can choose M > 0 large enough so that

P δ2 n n 2/32M ε.

We conclude that δn = Op(n 1/3) as desired.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The abstract and introduction state our main contributions: a conformal prediction method that provides calibrated point predictions and prediction intervals with validity conditional on these point predictions. They also motivate our objective of predictionconditional validity and self-calibration as a weaker, but feasible, notion of validity compared to context-conditional validity.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: The objective of this paper is to construct calibrated point predictions and self-calibrated prediction intervals satisfying desiderata (i) and (ii). Below Theorem 4.1 and Theorem 4.2, we discuss two potential limitations: namely, that we construct a perfectly calibrated Venn-Abers multior set-prediction as opposed to a perfectly calibrated point prediction, and that our prediction interval is self-calibrated (achieving (ii)) concerning the unknown (oracle) perfectly calibrated point prediction. These limitations appear to be unavoidable as perfectly calibrated point predictions can generally not be constructed in finite samples without oracle knowledge [61, 60]. Other potential limitations regarding computation and implementation, as well as possible extensions are discussed in Section 3.4.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All assumptions for theoretical results are provided in the main text and formal and complete proofs are given in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Important details of the experiment, including the dataset used for the analysis, the training, calibration, and test split, the definition of the evaluation metrics, the baseline methods for comparison, and how xgboost is trained, are all detailed in the text. Code to reproduce the experimental results is provided in the supplementary material.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: A Python package implementing our method, as well as code for reproducing our experimental results, are provided in a supplementary zip file. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Important details such as the data split percentages, xgboost hyperparameters tuned using cross-validation, as well as parameters for SC-CP and baselines are all reported in the main text. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [No]

Justification: The experiment involved comparisons on real data, building on the experiment performed in [47]. As such, the ground truth is not known.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: The methods implemented in this paper are not computationally intensive and were run in RStudio on a Mac Book Pro with 16GB RAM and an M1 chip.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our paper follows the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics.

If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our work is foundational with no direct positive or negative societal impacts. We note, as alluded to in the introduction, that our work has the potential to improve the trustworthiness of machine learning methods by providing calibrated predictions and prediction intervals for uncertainty quantification.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper does not involve releasing any data or models and, therefore, there is no risk of such misuse.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We use publicly available data and code, which are properly cited. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: No new assets are introduced in this paper. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.