# modular_conformal_calibration__bbe0d664.pdf

Modular Conformal Calibration

Charles Marx * 1 Shengjia Zhao * 1 Willie Neiswanger 1 Stefano Ermon 1

Uncertainty estimates must be calibrated (i.e., accurate) and sharp (i.e., informative) in order to be useful. This has motivated a variety of methods for recalibration, which use held-out data to turn an uncalibrated model into a calibrated model. However, the applicability of existing methods is limited due to their assumption that the original model is also a probabilistic model. We introduce a versatile class of algorithms for recalibration in regression that we call modular conformal calibration (MCC). This framework allows one to transform any regression model into a calibrated probabilistic model. The modular design of MCC allows us to make simple adjustments to existing algorithms that enable well-behaved distribution predictions. We also provide ﬁnitesample calibration guarantees for MCC algorithms. Our framework recovers isotonic recalibration, conformal calibration, and conformal interval prediction, implying that our theoretical results apply to those methods as well. Finally, we conduct an empirical study of MCC on 17 regression datasets. Our results show that new algorithms designed in our framework achieve nearperfect calibration and improve sharpness relative to existing methods.

1. Introduction

Uncertainty estimates can inform human decisions (Pratt et al., 1995; Berger, 2013), ﬂag when an automated decision system requires human review (Kang et al., 2021), and serve as an internal component of automated systems. For example, uncertainty informs treatment decisions in medicine (Begoli et al., 2019) and supports safety in autonomous navigation (Michelmore et al., 2018). In such settings, the beneﬁts of accounting for uncertainty hinge on

*Equal contribution 1Computer Science Department, Stanford University. Correspondence to: Charles Marx <ctmarx@stanford.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

our ability to produce calibrated uncertainty estimates e.g., of those events to which one assigns a probability of 90%, the events should indeed occur 90% of the time. A model that is not calibrated can consistently make conﬁdent predictions that are incorrect.

Many models, such as neural networks (Guo et al., 2017) and Gaussian processes (Rasmussen, 2003; Tran et al., 2019), achieve high accuracy but have poorly calibrated or absent uncertainty estimates. In other cases, a pretrained model is released for wide use and it is difﬁcult to guarantee that it will produce calibrated uncertainty estimates in new settings (Zhao et al., 2021). This leads us to the question: how can we safely deploy models with high predictive value but poor or absent uncertainty estimates?

These challenges have motivated work on recalibration, whereby a model with poor uncertainty estimates is transformed into a probabilistic model that outputs calibrated probabilities (Kuleshov et al., 2018; Vovk et al., 2020; Niculescu-Mizil & Caruana, 2005; Chung et al., 2021). Recalibration methods are attractive because they require only black-box access to a given model and can return wellcalibrated probabilistic predictions.

However, calibration is not the only goal of probabilistic models. It is also important for a probabilistic model to predict sharp (i.e., low variance) distributions to convey more information. Furthermore, recalibration methods need to be data efﬁcient to calibrate models in data poor regimes.

In this paper, we introduce modular conformal calibration (MCC), a class of algorithms that uniﬁes existing recalibration methods and gives well-behaved distribution predictions from any model. Our main contributions are:

1. We introduce modular conformal calibration, a class of algorithms for recalibration in regression, which can be applied to recalibrate almost any regression model. MCC uniﬁes isotonic calibration (Kuleshov et al., 2018), conformal calibration (Vovk et al., 2020), and conformal interval prediction (Vovk et al., 2005) under a single theoretical framework, and additionally leads to new algorithms.

2. We provide ﬁnite-sample calibration guarantees, showing that MCC can achieve ϵ calibration error with O(1/ϵ) samples. These results also apply to the afore-

Modular Conformal Calibration

mentioned recalibration methods that MCC uniﬁes.

3. We conduct an empirical study on 17 datasets to compare the performance of recalibration methods in practice. We ﬁnd that new algorithms within our framework outperform existing methods in terms of both sharpness and proper scoring rules.

2. Background

Given an input feature vector x X (e.g., a satellite image), we want to predict a label y Y (e.g., the temperature tomorrow). We consider regression problems where Y = R. We assume there is a true distribution FXY over X Y, and we have access to n i.i.d. examples (Xi, Yi) FXY .

Given a feature vector x, our goal is to predict the conditional distribution of Y given X = x, denoted FY |x. A distribution predictor is a function H : X F(Y) that takes a feature vector x as input and returns H[x], a cumulative distribution function (CDF) over Y. Note that H[x] F(Y) is a function, intended to approximate FY |x by mapping any input y Y to a value H[x](y) in [0, 1].

We consider a two-stage process to learn a distribution predictor from data. In the ﬁrst stage, we train the base predictor f : X R, which maps a feature vector x to a prediction in a space R. The base predictor can be any model (e.g., neural network, support vector machine) and the prediction can be of any type; for example, f could give a point prediction for the mean of FY |x (i.e., R = R) or f could give an interval prediction that is likely to contain Y (i.e., R = R2). Alternatively, f could predict a Gaussian distribution that approximates the distribution of the label (i.e., R = F(Y)).

Regardless of the base predictor we choose, the second step is to recalibrate the base predictor: we translate the base predictor f into a calibrated distribution predictor H. We construct H by ﬁtting a wrapper function around f, meaning that H[x] only depends on x via the prediction f(x).

We focus on the second stage, that of recalibration. For those interested in which base predictors yield the best recalibrated predictors, see Section 6 for an empirical study.

Example 1 (Linear Regression). Consider a linear regression problem where Yi = β Xi +ϵi for Xi Rd, β Rd, and ϵi R distributed i.i.d. with known CDF Fϵ. Imagine we are given a base predictor f(x) = βT x that perfectly predicts the mean of FY |x. Then we can construct a perfect distribution predictor H[x] : Y [0, 1] by deﬁning:

H[x](y) := Fϵ(y f(x)) = FY |x(y)

Note that the CDF prediction H[x] only depends on x through the point prediction f(x), but still gives perfect distribution predictions.

Calibration Optimally, the distribution predictor H will output for each value x the true conditional CDF, H[x] = FY |x, as in Example 1. However, many feature vectors x only appear once in our data, making it impossible to learn a perfect distribution predictor H from data without additional assumptions (such as the assumptions of linearity and i.i.d. noise in Example 1). Instead of making additional assumptions, we instead aim for calibration, a weaker property than perfect distribution prediction that can be obtained in practice.

Recall that for any random variable Y , the probability integral transform FY (Y ) obtained by evaluating the CDF with random input Y follows a standard uniform distribution. We should expect the same behavior from our predicted CDFs; we should observe that H[X](Y ) also follows a standard uniform distribution. For example, the observed label should be greater than the predicted 95th percentile for approximately 5% of examples.

Deﬁnition 1. Given a distribution predictor H : X F(Y), we say that H is calibrated if H[X](Y ) follows a standard uniform distribution. Formally, H is calibrated if

Pr(H[X](Y ) p) = p, for all p [0, 1] (1)

Similarly, for a value ϵ 0, we say that H is ϵ-calibrated if Equation (1) is only violated by at most ϵ:

Pr(H[X](Y ) p) p ϵ, for all p [0, 1] (2)

Calibration is a necessary but not sufﬁcient condition for making good distribution predictions. Note that a distribution predictor H[x] = FY that ignores x and returns the marginal cdf for Y will be calibrated, but not useful. Thus, distribution predictions should be as sharp (i.e., highly concentrated) as possible, conditioned on being calibrated.

Related Work Post-hoc uncertainty quantiﬁcation is an active ﬁeld of research. Platt scaling (Platt et al., 1999) and isotonic regression (Niculescu-Mizil & Caruana, 2005) are popular methods for recalibrating binary classiﬁers. Platt scaling ﬁts a logistic regression model to the scores given by a model, and isotonic regression learns a nondecreasing map from scores to the unit interval. Quantile regression (Romano et al., 2019; Chung et al., 2020, e.g.,) simultaneously estimates multiple quantiles of the label distribution, often via the pinball loss, which can then be combined to construct calibrated distribution predictions.

Isotonic calibration (Kuleshov et al., 2018) is an effective strategy for recalibrating a base predictor that already makes distribution predictions. Isotonic calibration computes the empirical quantiles (i.e., f[Xi](Yi)) of a distribution predictor on the calibration dataset, and uses isotonic regression to adjust the empirical quantiles so that they are

Modular Conformal Calibration

Base Predictor Calibration Score Interpolation Algorithm

Distribution

Interval Quantile

Ensemble NAF

Naive Linear

Figure 1: A combination of the three components deﬁnes a modular conformal calibration algorithm.

uniform on [0, 1]. Conformal calibration (Vovk et al., 2020) is similar to isotonic calibration, except it uses a randomized function to adjust the empirical quantiles instead of isotonic regression. This yields strong calibration guarantees, at the cost of discontinuous and randomized distribution predictions.

Conformal prediction (Vovk et al., 2005) is a general approach to uncertainty quantiﬁcation that produces prediction sets (i.e., interval predictions) with guaranteed coverage, instead of distribution predictions. In the context of these prediction sets, some prior work has also studied the connection between conformal prediction and calibration (Lei et al., 2018; Gupta et al., 2020; Angelopoulos & Bates, 2021). Our work builds on isotonic calibration, conformal calibration, and conformal prediction to construct novel recalibration algorithms for arbitrary base predictors.

3. Modular Conformal Calibration

In this section, we introduce a new class of recalibration procedures and provide calibration guarantees for this class of algorithms. We begin with a simple example in which we recalibrate a point predictor, then we generalize this reasoning to introduce modular conformal calibration. We conclude this section by enumerating design choices our framework introduces.

3.1. Warm-up

We start with a simple example to introduce the main idea. In this example, we turn a point predictor into a calibrated distribution predictor.

1. Suppose we have a base predictor f : X R that uses a satellite image X to produce a point estimate for the temperature the following day Y . Additionally, we are given a dataset (X1, Y1), . . . , (Xn, Yn) where we observe both the satellite image and temperature. Now, given a new satellite image X , we want to predict the (unobserved) temperature Y . It is important for us to quantify our uncertainty if this prediction will inform a decision we will likely behave differently if the temperature is certain to be within 2 degrees of our estimate, versus if it could differ by 20 degrees.

Algorithm 1 Modular Conformal Calibration

Input: base predictor f : X R, calibration score ϕ : R Y R and interpolation algorithm ψ Input: calibration dataset (X1, Y1), , (Xn, Yn) Compute calibration scores Si = ϕ(f(Xi), Yi) for i = 1, . . . , n Run the interpolation algorithm q = ψ (S1, , Sn) Return: the CDF predictor H[x](y) = q(ϕ(f(x), y)

2. We can apply the residue function ϕ(f(x), y) = y f(x) to our predictions, giving S1 = Y1 f(X1), , Sn = Yn f(Xn); if we knew the label Y , we could also apply the residue score to our test example to compute a residue S = Y f(X ). If the data is i.i.d. then the residues are also i.i.d. random variables.

3. We can consider how large or small S is among the set of residues {S1, , Sn}. Intuitively, because of the i.i.d. assumptions, S is equally likely to be the smallest, 2nd smallest, ..., largest element. Formally, if we deﬁne the ranking function q as:

i=1 1{t Si}

then up to discretization error

Pr[q(S ) c] c, c [0, 1] (3)

In fact, Eq.(3) is exactly the deﬁnition of calibration so the reasoning above proves that the CDF predictor H[x](y) = q(ϕ(f(x), y)) is approximately calibrated. We also have to show that H[x] is a CDF. This is easy to prove, as q(ϕ(f(x), y)) is a nondecreasing function in y. We conclude that H is an approximately calibrated CDF predictor.

3.2. Components of a Recalibration Algorithm

We organize the design choices within modular conformal calibration into three decisions:

1. Base predictor. In the ﬁrst step, we choose a base predictor. This can be any prediction function f : X R. There are no restrictions on the prediction space R, as

Modular Conformal Calibration

long as we can deﬁne a compatible calibration score. The only requirement is that f is not learned on the calibration dataset (X1, Y1), , (Xn, Yn), but it could be learned on any different dataset. In the previous example, the base predictor is a point prediction function.

2. Calibration score. In the second step, we choose a calibration score, which is any function ϕ : R Y R that is monotonically strictly increasing in y. In the previous example, the calibration score is the residue ϕ(f(x), y) = y f(x). Intuitively, the calibration score should reﬂect how large y is relative to our prediction f(x). We can then compute the calibration score for each sample in the training set Si = ϕ(Xi, Yi); i = 1, , n. For convenience of computing rankings, we sort the scores into S(1) S(2) S(n).

3. Interpolation algorithm. Finally we need a map from the calibration score to the ﬁnal CDF output. In example 1 we constructed an interpolation map (the function q) by mapping any score in (S(i 1), S(i)] to i/n. The interpolation algorithm we use here is a very simple step function. However, the resulting CDFs are not continuous which may be inconvenient (e.g., if we want to compute the log likelihood).

More generally we can use any interpolation algorithm: let Q be the set of monotonically non-decreasing functions R [0, 1]. An interpolation algorithm is a map ψ : Rn Q. An interpolation algorithm maps the calibration scores S1, . . . , Sn to a function q such that q(S1), . . . , q(Sn) are approximately evenly spaced on the unit interval.

Deﬁnition 2. An interpolation function ψ : Rn Q is λ-accurate if for any distinct inputs (u1, u2, . . . , un) Rn

the function q = ψ(u1, u2, , un) maps the i-th smallest input u(i) close to i/(n + 1):

n + 1, for all i = 1, . . . , n (4)

If ψ is a randomized function, then the statement is quantiﬁed by almost surely.

If a λ-accurate interpolation algorithm is applied to calibration scores computed on a held-out dataset, the function q ϕ will be approximately calibrated on that dataset. We can write the full process for making a CDF prediction as:

H[x](y) = q(ϕ(f(x), y)) | {z } prediction for Pr(Y y|X=x)

This three step process of applying a base predictor f, calibration score ϕ, and interpolation function q (learned by an interpolation algorithm ψ) is detailed in Algorithm 1 and illustrated in Figure 1. Now, we formalize the intuition that H will be calibrated into a formal guarantee.

Theorem 1. For any base predictor f, calibration score ϕ, and λ-accurate interpolation algorithm ψ such that the random variable ϕ(f(X), Y ) is absolutely continuous, Algorithm 1 is 1+λ

n+1-calibrated.

See Appendix C for a proof of Theorem 1. Similar to conformal interval prediction, there is a rather mild regularity assumption: ϕ(f(X), Y ) has to be absolutely continuous, i.e. two i.i.d. samples (X1, Y1) and (X2, Y2) almost never have the same score ϕ(f(X1), Y1) = ϕ(f(X2), Y2). In our warm-up example in Section 3.1, this condition requires that two samples (X1, Y1), (X2, Y2) almost never have exactly the same residue Y1 f(X1) = Y2 f(X2).

4. Choosing a Recalibration Algorithm

In this section, we describe natural choices for the calibration score and interpolation algorithm, given different base predictors. A main motivation for introducing the modular conformal calibration framework is to make it easy to develop new recalibration procedures. Any pairing of the calibration scores and interpolation algorithms described in this section results in a recalibration algorithm with the ﬁnite-sample calibration guarantee given by Theorem 1.

4.1. The Base Predictor

In some cases the base predictor will be ﬁxed, such as when ﬁne-tuning a pretrained model to be calibrated in a new setting. In other cases, we have end-to-end control of the training process. In these cases, we must answer the question: Which base predictor should I train to get the best calibrated distribution predictor?

An obvious choice is to learn a distribution predictor as the base predictor then recalibrate if needed. However, there is no guarantee that this will produce better results than learning a different type of base predictor (e.g., one of the prediction types in Table 1) then recalibrating. In fact, in our experiments we ﬁnd that even when learners are of similar power, distribution predictors are not necessarily the most effective choice of base predictor (see Section 6).

4.2. The calibration score

In this section, we introduce calibration scores for a few prediction types (see Table 1) to illustrate the role of the calibration score. Intuitively, a good calibration score should measure how large y is relative to the prediction. Recall that the calibration score ϕ : R Y R can be any function that is non-decreasing in y. A poor choice of calibration score still guarantees calibration (see Theorem 1), but can harm other metrics such as sharpness or NLL. Additional calibration scores for quantile prediction and ensemble prediction can be found in Appendix A.

Modular Conformal Calibration

Prediction Type Output Space (R) Interpretation

Point R e.g. estimate of the mean. Interval R2 Interval is [f1(x), f2(x)]. Quantile RK fk(x) predicts a quantile αk (0, 1). Distribution F(R) f[x] is a predicted CDF for y. Ensemble R1 RK Each fk is the prediction of a model k.

Table 1: A collection of common prediction types.

Point Prediction A natural calibration score for point predictors is the residue ϕresidue(x, y) = y f(x).

Interval Prediction For interval predictors, a natural choice for the calibration score is the residue divided by the interval size ϕinterval(x, y) = (y f1(x))/(f2(x) f1(x)). Intuitively, if y equals the predicted upper bound f2(x), then the calibration score is 1; if y equals the lower bound f1(x) then the calibration score is 0. The calibration scores of all other y are linear interpolations of these two.

Distribution Prediction Given a distribution prediction, i.e. a map f : X F(Y) two natural choices for the calibration score are

ϕcdf(x, y) = f[x](y)

ϕz-score(x, y) = (y mean(f(x)))/std(f(x))

Numerical stability is a practical issue for ϕcdf. When y is small or large, the calibration score may be the same for different y due to rounding with ﬁnite numerical precision. Empirically, ϕz-score has better numerical stability and often better performance.

4.3. The Interpolation Algorithm

Lastly, we discuss the choice of interpolation algorithm. We illustrate a simple linear interpolation algorithm, a randomized interpolation algorithm with strong theoretical guarantees, and a more complex approach using neural autoregressive ﬂows. Recall that an interpolation algorithm is a function ψ that maps a vector (u1, . . . , un) to a nondecreasing function q such that q(u1), . . . , q(un) are approximately evenly spaced on the unit interval. Recall also that we write u(i) to denote the i-th smallest input.

Naive Discretization As we discussed, the interpolation algorithm in Example 1 is

qnaive(u) = i/n, if u [u(i), u(i+1)) (6)

While simple, the resulting CDF is not continuous, making quantities such as the log-likelihood undeﬁned. It is also

not 0-accurate (recall Deﬁnition 2). For better performance we need more sophisticated interpolation algorithms.

Linear Linear interpolation is a simple way to get a continuous CDF function with a density.

qlinear(u) = i + (u u(i))/(u(i+1) u(i))

for u [u(i), u(i+1)). A piecewise linear CDF is differentiable almost everywhere, so the log likelihood and density function are well-deﬁned almost everywhere. Linear interpolation can perfectly ﬁt any monotonic sequence, and is therefore 0-accurate.

Neural Autoregressive Flow (NAF) To achieve even better smoothness properties, we can use a neural autoregressive ﬂow (NAF), which is a class of deep neural networks that can universally approximate bounded continuous monotonic functions (Huang et al., 2018). The beneﬁt of using a NAF is that the resulting CDF will be more smooth . In fact, if we use a differentiable activation function for the NAF network (such as sigmoid rather than Re LU), then NAF represents smooth CDF functions that are differentiable everywhere.

The short-coming is that NAF can only represent arbitrary monotonic functions (and hence be a 0-accurate interpolation algorithm) if the network is inﬁnitely wide. In our experiments, a network with 200 units is sufﬁcient to push the errors below numerical precision.

Random Finally there is an interpolation algorithm that uses randomization, which would recover the algorithm in (Vovk et al., 2020). Let U be uniform on [0, 1].

qrandom(u) = (i + U)/(n + 1) if u [u(i), u(i+1))

Compared to linear and NAF, random interpolation has some shortcomings: the CDF is not continuous and the standard deviation is undeﬁned. However, random interpolation has an important theoretical advantage, in that it guarantees that Algorithm 1 is 0-calibrated (i.e., perfectly calibrated). In our experiments, this theoretical advantage does not lead to lower calibration error in general, as all methods have near zero ECE. A detailed comparison of the interpolation algorithms in shown in Figure 3.

5. Towards Unifying Calibrated Regression

In this section, we show that modular conformal calibration recovers popular methods for calibrated regression. This implies that the calibration guarantees in this paper also apply to the methods discussed in this section. We also hope to shed light on connections between previously distinct streams of research.

Modular Conformal Calibration

We ﬁrst observe that isotonic calibration (Kuleshov et al., 2018; Malik et al., 2019) is recovered by MCC. Observation 1 (On Isotonic Calibration). Algorithm 1 in (Kuleshov et al., 2018) is equivalent to Algorithm 1 in our paper with a distribution base predictor, ϕcdf and qlinear.

Interestingly, this allows us to give new guarantees on the performance of Algorithm 1 in (Kuleshov et al., 2018). In particular, we can use Theorem 1 and conclude that Algorithm 1 in (Kuleshov et al., 2018) is 1/(n + 1)-calibrated. This result was not available in (Kuleshov et al., 2018). Observation 2 (On Conformal Calibration). Algorithm 1 in Vovk et al. (2020) is equivalent to Algorithm 1 in our paper with a distribution base predictor, ϕcdf and qrandom.

This makes it clear that conformal calibration and isotonic calibration are tightly connected. The most signiﬁcant difference between the two methods is that conformal calibration uses a randomized interpolation algorithm. Randomization gives better calibration guarantees at the cost of worse behavior for the distribution predictions (e.g., the predicted distributions are discontinuous so the log likelihood is ill-deﬁned).

5.1. Connection to Conformal Interval Prediction

Conformal prediction (Vovk et al., 2005; Shafer & Vovk, 2008; Romano et al., 2019) is a family of (provably) exact interval forecasting algorithms (see, e.g., Proposition 1 in Appendix C). Conformal interval prediction uses a proper non-conformity score φ : X Y R, which is any continuous function that is strictly unimodal in y (see Appendix B). Intuitively, the non-conformity score measures how well the label y matches the input x. For example, given a base point prediction function f : X R the absolute residue of the prediction φ(x, y) = |y f(x)| is a natural choice (Vovk et al., 2005). For a conﬁdence level c (0, 1), the conformal forecast is deﬁned as

i=1 1{φ(Xi, Yi) φ(X , y)} c

On the other hand, one can trivially construct a valid conﬁdence interval from a calibrated distribution predictor. Consider the map ηc : F(Y) R2, which maps any CDF into two numbers that represent a c-credible interval.

ηc : H[x] 7 H[x] 1((1 + c)/2), H[x] 1((1 c)/2)

Intuitively, ηc returns an interval that has c probability under the distribution H[x]. We then ask: Can modular conformal calibration yield comparable interval predictions to conformal interval prediction? We answer this question in the afﬁrmative, both theoretically and empirically.

Theorem 2. For the conformal interval predictor Ic with proper non-conformity score, there exists a calibration score ϕ, such that the distribution predictor H given by MCC with calibration score ϕ and any 0-exact interpolation algorithm satisﬁes

H[X](U) H[X](L) c 1 c

n + 1 a.s. (8)

where L, U are lower/upper bounds of the interval Ic(X).

See Appendix C for a proof. Theorem 2 states that the conformal prediction interval [L, U] is also a c credible interval (up to (1 c)/(n+1) error) of a distribution prediction made by MCC. In other words, if we know the distribution predicted by the appropriate MCC algorithm, then we can construct the conformal prediction interval by taking a c credible interval.

We only know that the conformal prediction interval is some credible interval of the distribution prediction, but we don t know which credible interval (i.e., ηc may not be the correct credible interval). We explore this complicating factor empirically: in particular, we will show that in practice, the conformal interval predictor Ic and the credible interval ηc H[x] (with the calibration score ϕ that is associated with the non-conformity score φ) have similar performance (see Figure 2).

6. Empirical Study of Recalibration

Our framework introduces three decisions when choosing a recalibration algorithm: the baseline predictor, the calibration score, and the interpolation algorithm. In this section, we investigate how those choices affect performance. We evaluate each combination of 8 base prediction types and 3 interpolation algorithms across 17 regression tasks with 16 random train/test splits per regression task. We also test all of the calibration scores deﬁned in Section 4.2. In total, we train 7,344 calibrated distribution predictors and evaluate each predictor across 4 metrics for a total of 29,376 model evaluations. We summarize our experimental ﬁndings in Table 2, Table 3, and Figure 2.

Datasets We compare MCC algorithms on 17 tabular regression datasets. Most datasets come from the UCI database (Dua & Graff, 2017). For each dataset we allocate 60% of the data to learn the base predictor, 20% for recalibration and 20% for testing.

Base Predictors We compare all ﬁve prediction types considered in this paper (see Table 1). For each base predictor, we use a simple three layer neural network and optimize it with gradient descent. The different base predictors only differ in the number of output dimensions, and the learning objective (i.e. the learning objective should be

Modular Conformal Calibration

STD 95% CI Width NLL CRPS

ZSCORE-NAF 0.442 0.003 1.874 0.037 0.297 0.022 0.232 0.002 ZSCORE-LINEAR 0.435 0.003 1.766 0.016 0.534 0.021 0.232 0.002 ZSCORE-RANDOM 0.438 0.003 1.776 0.016 N/A 0.232 0.002 CDF-NAF 0.446 0.005 1.723 0.027 0.465 0.144 0.245 0.007 CDF-LINEAR* 0.562 0.032 1.851 0.033 0.433 0.017 0.233 0.002 CDF-RANDOM* 0.587 0.058 1.851 0.033 N/A 0.217 0.000

Table 2: A comparison of calibration scores and interpolation algorithms when the base predictor is a distribution prediction (* indicates an existing algorithm we compare against). Note that CDF-LINEAR corresponds to the isotonic recalibration baseline and CDF-RANDOM corresponds to the conformal calibration baseline. Disaggregated experimental results are shown in Appendix D.

STD 95% CI Width NLL CRPS

POINT 0.467 0.006 1.927 0.016 0.611 0.017 0.242 0.002 INTERVAL 0.830 0.336 1.832 0.034 0.051 0.025 0.256 0.002

QUANTILE-2 0.449 0.004 1.790 0.019 0.101 0.019 0.228 0.002 QUANTILE-4 0.439 0.003 1.692 0.016 0.109 0.027 0.226 0.002 QUANTILE-7 0.434 0.003 1.629 0.015 0.103 0.021 0.226 0.002 QUANTILE-10 0.432 0.002 1.625 0.012 0.042 0.032 0.226 0.002 ENSEMBLE 0.491 0.009 1.795 0.021 0.384 0.017 0.227 0.002 DISTRIBUTION 0.562 0.032 1.851 0.033 0.433 0.017 0.233 0.002

Table 3: A comparison of base predictors. We ﬁnd that quantile predictors outperform all other prediction types on both sharpness metrics (STD, 95% CI Width) and proper scoring rules (NLL, CRPS).

a proper scoring rule for that prediction type). We try to make the architectures and optimizers of the base predictors as similar as possible across prediction types to isolate the impact of the choice of prediction type, calibration score, and interpolation algorithm, as opposed to the strength of the base predictor. We compare the following base prediction types:

For POINT predictors the output dimension is 1 and we minimize the L2 error. For QUANTILE predictors we use 2, 4, 7, 10 equally spaced quantiles (denoted in the plots as quantile-2, quantile-4, quantile-7, quantile-10). For example, for quantile-4 we predict the 1/8, 3/8, 5/8, 7/8 quantiles. We optimize the neural network with the pinball loss. For INTERVAL predictors we use the same setup as (Romano et al., 2019) which is equivalent to quantile regression with 5%, 95% quantiles. For DISTRIBUTION predictors the output of the neural network is 2 dimensions, and we interpret the two dimension as the mean / standard deviation of a Gaussian. We optimize the neural network with the negative log likelihood. For ENSEMBLE predictors we use the setup in (Lakshminarayanan et al., 2017) and learn an ensemble of Gaussian distribution predictors.

Metrics We compare ﬁve measurements of prediction quality. NLL is the negative log likelihood of the label

under the predicted distribution. CRPS is the continuous ranked probability score (Hersbach, 2000). Compared to NLL, CRPS is well-deﬁned even for distributions that do not have a density, while NLL is undeﬁned for such distributions. STD is the standard deviation of the predicted distribution, a smaller std corresponds to improved sharpness and is generally preferred (all else held equal). 95% CI Width is the size of centered 95% credible intervals given by each distribution prediction A smaller interval is better (assuming all else are equal). ECE is the expected calibration error (Kuleshov et al., 2018); we use debiased ECE which should be zero if the predictions are perfectly calibrated.

Results We ﬁnd that different recalibration algorithms perform optimally according to different metrics. This supports the need for ﬂexible design frameworks that apply broadly and can be adjusted to the needs of a particular problem. In general, we ﬁnd that quantile predictors are very effective base predictors that, perhaps surprisingly, tended to outperform distribution base predictors in our experiments. The ﬁndings of our experiments are summarized in Table 3, Table 2 and Figure 2.

On the choice of base predictor We ﬁnd that all base prediction types can be recalibrated to give models with

Modular Conformal Calibration

3 95% Interval Size

3 90% Interval Size

point distribution interval ensemble quantile-2 quantile-4 quantile-7 quantile-10

Figure 2: Comparing the interval size from conformal interval prediction (left of each pair) versus credible intervals from recalibrated predictors (right of each pair) for a variety of base prediction types.The intervals from both methods obtain the nominal coverage. The distribution of interval sizes are very similar between the two methods, indicating that recalibration is empirically comparable to conformal interval prediction in its ability to provide interval predictions.

very good calibration. All base predictors we tested achieved an average test ECE of less than 0.007 after recalibration, across the 17 datasets. This is consistent with the calibration guarantee given by our framework, which says that a recalibrated model will be O(1/n)-calibrated.

Quantile predictors performed best on the other four metrics we considered: quantile-10 performed best on the two sharpness metrics STD and 95% CI Width, while quantile-4 performed best in terms of NLL and the three quantile predictors had the same performance on CRPS (see Table 3). Interestingly, quantile base predictors outperformed distribution estimators on NLL, even though the distribution predictors were directly trained to optimize NLL. These results indicate that quantile prediction is a promising strategy for learning distribution predictors.

On the choice of calibration score We investigate the role of the calibration score for a base predictor that already makes distribution predictions (see Table 2). Speciﬁcally, we compare two natural choices: ϕcdf and ϕzscore. The ϕcdf calibration score computes the quantile of the observed label under the predicted distribution, and is the calibration score used by isotonic calibration and conformal calibration. The ϕcdf calibration score computes the number of standard deviations between the mean of the predicted distribution and the observed label. We ﬁnd that ϕz-score and ϕcdf are effective under different metrics. The ϕcdf calibration score performs better for CRPS and 95% CI Width, while ϕzscore performs better for NLL and STD.

On the choice of interpolation algorithm We compare three interpolation algorithms: Linear interpolation which is simple and stable, random interpolation which provides improved calibration guarantees, and Neural Autoregressive Flow (NAF) interpolation which uses a more sophisticated neural network approach to interpolation. We ﬁnd that NAF interpolation performs best on NLL and 95% CI Width, linear interpolation performs best on STD, and random interpolation performs best on CRPS. The random

interpolator leads to distribution predictions with inﬁnite STD and undeﬁne NLL, so is not appropriate when those metrics are of importance. The most appropriate interpolator is likely to vary between use cases.

On interval prediction In this experiment, we explore whether recalibration can yield high quality interval predictions, by comparing to conformal interval prediction, a standard approach for producing interval predictors from any base predictor. Recall that Theorem 2 tells us that conformal interval prediction can approximately be recovered by taking some credible interval of a recalibrated predictor. However, since we cannot identify which credible interval it should be a priori, we test in these experiments whether it is sufﬁcient to simply take the centered credible interval; for example, the interval between the 5% and 95% quantiles of the predicted distributions.

We ﬁnd that this recalibration yields interval predictions that effectively approximate conformal interval prediction (see Figure 2). Conformal interval prediction tends to produce slightly shorter intervals than recalibration (both methods achieve the nominal coverage). This shows that recalibration can be applied broadly, even when the downstream task is unknown. If we recalibrate a model to make distribution predictions then decide that we need interval predictions, we can extract credible intervals from the distribution predictor that are comparable to methods designed to directly produce interval predictions.

7. Discussion

Recalibration is a convenient and effective way to build calibrated distribution predictors. Flexible methods for uncertainty quantiﬁcation empower more practitioners to use uncertainty quantiﬁcation, improving the reliability of both fully-automated systems and decision support systems. Modular conformal calibration organizes and simpliﬁes the process of choosing a recalibration technique, and provides guarantees that the resulting models will be

Modular Conformal Calibration

calibrated. As a consequence, we believe that further developing principled and adaptive techniques for choosing between these recalibration algorithms is a promising direction for future work.

Acknowledgements

CM is supported by the NSF GRFP. This research was supported by NSF (#1651565), AFOSR (FA95501910024), ARO (W911NF-21-1-0125) and a Sloan Fellowship.

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantiﬁcation. ar Xiv preprint ar Xiv:2107.07511, 2021.

Begoli, E., Bhattacharya, T., and Kusnezov, D. The need for uncertainty quantiﬁcation in machine-assisted medical decision making. Nature Machine Intelligence, 1(1): 20 23, 2019.

Berger, J. O. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.

Chung, Y., Neiswanger, W., Char, I., and Schneider, J. Beyond pinball loss: Quantile methods for calibrated uncertainty quantiﬁcation. ar Xiv preprint ar Xiv:2011.09588, 2020.

Chung, Y., Char, I., Guo, H., Schneider, J., and Neiswanger, W. Uncertainty toolbox: an open-source library for assessing, visualizing, and improving uncertainty quantiﬁcation. ar Xiv preprint ar Xiv:2109.10254, 2021.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. ar Xiv preprint ar Xiv:1706.04599, 2017.

Gupta, C., Podkopaev, A., and Ramdas, A. Distributionfree binary classiﬁcation: prediction sets, conﬁdence intervals and calibration. Advances in Neural Information Processing Systems, 33:3711 3723, 2020.

Hersbach, H. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5):559 570, 2000.

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive ﬂows. In International Conference on Machine Learning, pp. 2078 2087. PMLR, 2018.

Kang, D. Y., De Young, P. N., Tantiongloc, J., Coleman, T. P., and Owens, R. L. Statistical uncertainty quantiﬁcation to augment clinical decision support: a ﬁrst implementation in sleep medicine. NPJ digital medicine, 4 (1):1 9, 2021.

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, pp. 2796 2804. PMLR, 2018.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402 6413, 2017.

Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

Malik, A., Kuleshov, V., Song, J., Nemer, D., Seymour, H., and Ermon, S. Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, pp. 4314 4323. PMLR, 2019.

Michelmore, R., Kwiatkowska, M., and Gal, Y. Evaluating uncertainty quantiﬁcation in end-to-end autonomous driving control. ar Xiv preprint ar Xiv:1811.06817, 2018.

Niculescu-Mizil, A. and Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625 632. ACM, 2005.

Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

Pratt, J. W., Raiffa, H., Schlaifer, R., et al. Introduction to statistical decision theory. MIT press, 1995.

Rasmussen, C. E. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63 71. Springer, 2003.

Romano, Y., Patterson, E., and Candes, E. Conformalized quantile regression. Advances in Neural Information Processing Systems, 32:3543 3553, 2019.

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar): 371 421, 2008.

Tran, G.-L., Bonilla, E. V., Cunningham, J., Michiardi, P., and Filippone, M. Calibrating deep convolutional gaussian processes. In The 22nd International Conference

Modular Conformal Calibration

on Artiﬁcial Intelligence and Statistics, pp. 1554 1563. PMLR, 2019.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer Science & Business Media, 2005.

Vovk, V., Petej, I., Toccaceli, P., Gammerman, A., Ahlberg, E., and Carlsson, L. Conformal calibrators. In Conformal and Probabilistic Prediction and Applications, pp. 84 99. PMLR, 2020.

Zhao, S., Kim, M., Sahoo, R., Ma, T., and Ermon, S. Calibrating predictions to decisions: A novel approach to multi-class calibration. Advances in Neural Information Processing Systems, 34, 2021.

Modular Conformal Calibration

Calibration error Log-likelihood Computation Random Perfect (+) Undeﬁned ( ) Fast (+) Linear O(1/T) (=) Well deﬁned (=) Fast (+) NAF O(1/T) w/ assumptions ( ) Best (empirically) (+) Slow ( )

Figure 3: High-level comparison of different interpolation algorithms on different performance benchmarks. Top: A visualization of different interpolation algorithms. Given a set of arbitrary real-valued calibration scores, each interpolation algorithm maps the values to be evenly spaced across the interval [0, 1]. Bottom:+, =, indicates best, intermediate, or worst.

A. Interpolation Algorithms

Quantile Prediction Recall that the interpretation of a quantile prediction is that the probability y is less than fk(x) should be αk, for k = 1, . . . , K.

ϕquantile(x, y) =

αK + y f K(x) y > f K(x) αk + y fk(x) fk+1(x) fk(x)(αk+1 αk) fk(x) < y fk+1(x), for k = 1, , K 1 α1 + y f1(x) y f1(x)

Intuitively, if y exactly equals the αk-th quantile, then ϕ0(x, y) = αk. For other values we use a linear interpolation.

Ensemble Prediction Given an ensemble consisting of K predictors, we can deﬁne the calibration score recursively: for each of the M predictors in the ensemble, we choose a calibration score ϕm; the overall calibration score ϕensemble is the summed calibration score ϕensemble(x, y) = P k ϕk(x, y). Naturally, if there is prior information about the quality of these predictions, we can use a weighted sum where the higher quality predictions are given a higher weight.

B. Proper Non-conformity Score

To ensure that conformal prediction algorithms are well-behaved , it is typical to put some restrictions on the nonconformity score. In particular, we say that a non-conformity score is proper if it is continuous and strictly unimodal in y. We require strict unimodality and continuity to ensure that the conﬁdence intervals change smoothly when c increases or decreases. Intuitively, an inﬁnitesimal increase in c should lead to an inﬁnitesimal increase in the conﬁdence interval.

The important property of conformal prediction is that it is always 1/n-exact (as in Deﬁnition 2), regardless of the true distribution of X, Y and the non-conformity score φ.

Proposition 1. For any non-conformity score φ, if φ(X, Y ) is absolutely continuous, then the conformal interval predictor Ic is 1/n-exact.

Modular Conformal Calibration

Proof of Proposition 1 .

Pr[Y Ic(X)] = Pr 1

n#{i | φ(Xi, Yi) φ(X, y)} c

n#{i | φ(Xi, Yi) φ(X, Y )} c | *Z1, , Zn, (X, Y )+

where Zi = (Xi, Yi).

Theorem 1. For any base predictor f, calibration score ϕ, and λ-accurate interpolation algorithm ψ such that the random variable ϕ(f(X), Y ) is absolutely continuous, Algorithm 1 is 1+λ

n+1-calibrated.

Proof of Theorem 1. By our assumption of absolute continuity, almost surely we have a 1 = a 2 = = a T = ϕ(X , Y ). For notation convenience we also let at = if t < 1 and at = + if t > T.

By the assumption t lambda

T +1 < q(a t) < t+λ T +1, if q(a) t+λ T +1 then q(a) > q(a t), which by monotonicity implies that a > a t, i.e.

if q(a) t + λ

T + 1 then a > a t (9)

if q(a) t λ

T + 1 then a < a t (10)

Pr[H[X ](Y ) c] := Pr[q(ϕ(X , Y )) c] Deﬁnition

Pr q(ϕ(X , Y )) c T + c + λ λ

Pr h ϕ(X , Y ) < a c T +c+λ i [i] + Eq.(10)

= E h Pr h ϕ(X , Y ) < a c T +c+λ | *Z1, , ZT , (X , Y )+ ii Tower

c T + c + λ /(T + 1) Symmetry

Where explanation [i] is based on the property A = B then Pr[A] Pr[B] ; the last inequality is usually an equality except when c 1 then the upper bound will be greater than 1. Similarly we have

Pr[H[X ](Y ) c] := 1 Pr[q(ϕ(X , Y )) > c] Deﬁnition

1 Pr q(ϕ(X , Y )) c T + c λ + λ

1 Pr h ϕ(X , Y ) > a c T +c λ i [i] + Eq.(9)

= 1 E h Pr h ϕ(X , Y ) > a c T +c λ | *Z1, , ZT , (X , Y )+ ii Tower

= E h 1 Pr h ϕ(X , Y ) > a c T +c λ | *Z1, , ZT , (X , Y )+ ii Linear

= E h Pr h ϕ(X , Y ) a c T +c λ | *Z1, , ZT , (X , Y )+ ii Linear

c T + c λ /(T + 1) Symmetry

Modular Conformal Calibration

Figure 4: Illustration of the proof of Theorem 2

Therefore we have

Pr[H[X ](Y ) c] c c T + c + λ

T + 1 c < c T + c + λ + 1

T + 1 c = 1 + λ

Pr[H[X ](Y ) c] c c T + c λ

T + 1 c > c T + c λ 1

T + 1 c = 1 λ

So combined we have (for any c (0, 1))

|Pr[H[X ](Y ) c] c| 1 + λ

Theorem 2. For the conformal interval predictor Ic with proper non-conformity score, there exists a calibration score ϕ, such that the distribution predictor H given by MCC with calibration score ϕ and any 0-exact interpolation algorithm satisﬁes

H[X](U) H[X](L) c 1 c

n + 1 a.s. (8)

where L, U are lower/upper bounds of the interval Ic(X).

Proof of Theorem 2. We will use a constructive proof. Because conformal prediction algorithm does not change if we add a constant to the non-conformity score, so without loss of generality, assume 0 is a lower bound on φ. Denote ymin(φ; x) as a global minimizer of φ(x, ), i.e.

φ(x, ymin(φ; x)) φ(x, y), y Y (11)

ϕ(x, y) = φ(x, y) y ymin(φ; x) φ(x, y) y > ymin(φ; x)

Based on this construction we have φ(x, y) = |ϕ(x, y)|. In addition because φ is uni-modal, ϕ is monotonically nondecreasing, so ϕ satisﬁes the condition as a calibration score. As a notation convenience we will also denote ϕ(x, y) = ϕx(y).

Consider the conformal interval predictor (for notation convenience we will drop its dependence on Z1, , ZT , X )

Γinterval := (L, U) = y | 1

T #{t | |ϕXt(Yt)| |ϕX (y)|} c

Modular Conformal Calibration

First of all, observe that because A is continuous, we must |ϕX (U)| = |ϕX (L)|. This intuition is illustrated in Figure 4. Therefore

{#t | |ϕXt(Yt)| |ϕX (L)|} = {#t | |ϕXt(Yt)| |ϕX (U)|}

= {#t | ϕX (L) ϕXt(Yt) ϕX (U)}

Second we wish to prove that

c T #{t | |ϕXt(Yt)| |ϕX (U)|} c T + 1

This is because if #{t | |ϕXt(Yt)| |ϕX (U)|} < c T then because of continuity of φ, almost surely choosing U = U +κ for sufﬁciently small κ > 0 still satisﬁes #{t | |ϕXt(Yt)| |ϕX (U )|} < c T, therefore U (L, U) but U > U, which is a contradiction.

If on the other hand, #{t | |ϕXt(Yt)| |ϕX (U)|} > c T + 1, then let U = U κ for sufﬁciently small κ > 0 we have #{t | |ϕXt(Yt)| |ϕX (U)|} > c T. This means that U (L, U) but U < U, which is a contradiction.

We observe that there are two possibilities, these two situations are illustrated in Figure 4: situation 1. there exists a t such that ϕXt(Yt) = ϕX (U); situation 2. there exists a t such that ϕXt(Yt) = ϕX (L).

We ﬁrst consider situation 1. Denote D = #{t, ϕX (L) < ϕXt(Yt) ϕX (U)} and B = #{t, ϕXt(Yt) ϕX (L)}. We know that c T D < c T + 1. Then by the assumption that the interpolation algorithm is 0-exact we have

H[X ](U) = D + B + 1

T + 1 , H[X](L) B + 1

T + 1 , B + 2

So their difference is bounded by

T + 1 < H[X ](U) H[X ](L) D T + 1 (12)

T + 1 < H[X ](U) H[X ](L) < c T + 1

H[X ](U) H[X ](L) c c T + 1

T + 1 c = 1 c

c H[X ](U) H[X ](L) c c T 1

T + 1 = c 1

Combined we have

H[X](U) H[X](L) c |1 c|

Now we consider situation 2. Denote D = #{t, ϕX (L) ϕXt(Yt) < ϕX (U)} and B = #{t, ϕXt(Yt) < ϕX (L)}. Again we know that c T D < c T + 1. Then by the assumption that the interpolation algorithm is 0-exact we have

H[X ](U) D + B + 1

T + 1 , D + B + 2

, H[X ](L) = B + 2

So their difference is bounded by

T + 1 < H[X ](U) H[X ](L) D

This is identical to Eq.(12) so the rest of the proof will follow identically.

D. Additional Experimental Results

Modular Conformal Calibration

STD 95% CI Width NLL CRPS ECE

blog ZSCORE-NAF 0.553 0.006 2.737 0.151 0.178 0.039 0.287 0.003 0.002 0.001 CDF-LINEAR 1.380 0.513 3.074 0.475 0.064 0.062 0.289 0.002 0.001 0.001 CDF-NAF 0.592 0.061 2.027 0.201 0.944 1.229 0.328 0.049 0.039 0.028 ZSCORE-RANDOM 0.567 0.007 2.657 0.101 4.087 0.102 0.288 0.003 0.002 0.001 CDF-RANDOM 1.381 0.513 3.074 0.475 4.069 0.096 0.289 0.002 0.001 0.001 ZSCORE-LINEAR 0.567 0.007 2.657 0.101 0.321 0.040 0.288 0.003 0.002 0.001

boston ZSCORE-NAF 0.337 0.019 1.589 0.195 0.467 0.117 0.173 0.012 0.009 0.010 CDF-LINEAR 0.372 0.031 1.527 0.118 0.642 0.075 0.173 0.012 0.009 0.009 CDF-NAF 0.357 0.025 1.539 0.124 0.363 0.070 0.174 0.012 0.009 0.010 ZSCORE-RANDOM 0.335 0.018 1.520 0.121 13.622 0.094 0.173 0.012 0.009 0.009 CDF-RANDOM N/A 1.532 0.121 13.588 0.114 N/A 0.009 0.009 ZSCORE-LINEAR 0.331 0.017 1.449 0.111 0.674 0.068 0.173 0.012 0.009 0.009

concrete ZSCORE-NAF 0.325 0.017 1.402 0.186 0.325 0.066 0.163 0.007 0.005 0.005 CDF-LINEAR 0.317 0.016 1.287 0.086 0.576 0.054 0.163 0.007 0.005 0.005 CDF-NAF 0.312 0.016 1.354 0.078 0.251 0.051 0.163 0.007 0.005 0.006 ZSCORE-RANDOM 0.310 0.014 1.287 0.087 13.206 0.090 0.163 0.007 0.005 0.005 CDF-RANDOM N/A 1.288 0.086 13.212 0.100 N/A 0.005 0.005 ZSCORE-LINEAR 0.309 0.014 1.279 0.084 0.593 0.057 0.163 0.007 0.005 0.005

crime ZSCORE-NAF 0.498 0.013 2.062 0.084 1.764 0.123 0.309 0.008 0.010 0.006 CDF-LINEAR 0.644 0.029 2.406 0.066 1.077 0.057 0.308 0.007 0.006 0.006 CDF-NAF 0.570 0.014 2.161 0.053 0.814 0.067 0.311 0.007 0.006 0.006 ZSCORE-RANDOM 0.487 0.011 1.900 0.036 13.308 0.071 0.308 0.008 0.006 0.006 CDF-RANDOM N/A 2.406 0.066 13.274 0.054 N/A 0.006 0.006 ZSCORE-LINEAR 0.487 0.011 1.900 0.036 1.901 0.130 0.308 0.008 0.006 0.006

energy ZSCORE-NAF 0.174 0.013 0.652 0.043 0.394 0.097 0.099 0.008 0.010 0.006 -efficiency CDF-LINEAR 0.168 0.010 0.668 0.040 0.027 0.108 0.099 0.008 0.010 0.006 CDF-NAF 0.168 0.010 0.681 0.039 0.342 0.101 0.099 0.008 0.010 0.006 ZSCORE-RANDOM 0.180 0.010 0.667 0.040 13.253 0.130 0.099 0.008 0.010 0.006 CDF-RANDOM N/A 0.668 0.040 13.252 0.125 N/A 0.010 0.006 ZSCORE-LINEAR 0.172 0.011 0.666 0.040 0.003 0.107 0.099 0.008 0.010 0.006

fb-comment1 ZSCORE-NAF 0.376 0.009 1.459 0.028 0.066 0.015 0.210 0.001 0.002 0.001 CDF-LINEAR 0.880 0.119 1.692 0.034 0.120 0.017 0.210 0.001 0.002 0.001 CDF-NAF 0.344 0.054 1.156 0.195 2.841 2.106 0.288 0.075 0.091 0.046 ZSCORE-RANDOM 0.397 0.004 1.692 0.034 5.196 0.030 0.210 0.001 0.002 0.001 CDF-RANDOM 0.880 0.119 1.692 0.034 5.220 0.029 0.210 0.001 0.002 0.001 ZSCORE-LINEAR 0.397 0.004 1.692 0.034 0.261 0.015 0.210 0.001 0.002 0.001

fb-comment2 ZSCORE-NAF 0.368 0.010 1.423 0.021 0.053 0.025 0.207 0.002 0.001 0.001 CDF-LINEAR 0.459 0.017 1.573 0.016 0.118 0.027 0.207 0.002 0.001 0.001 CDF-NAF 0.347 0.035 1.294 0.171 N/A 0.273 0.083 N/A ZSCORE-RANDOM 0.386 0.004 1.573 0.016 3.739 0.031 0.207 0.002 0.001 0.001 CDF-RANDOM 0.459 0.017 1.573 0.016 3.731 0.037 0.207 0.002 0.001 0.001 ZSCORE-LINEAR 0.386 0.004 1.573 0.016 0.194 0.023 0.207 0.002 0.001 0.001

forest-fires ZSCORE-NAF 1.167 0.043 4.697 0.214 1.959 0.137 0.601 0.023 0.018 0.008 CDF-LINEAR 1.285 0.162 4.874 0.380 1.964 0.086 0.601 0.022 0.017 0.008 CDF-NAF 1.219 0.078 4.801 0.339 1.637 0.087 0.607 0.023 0.017 0.008 ZSCORE-RANDOM 1.156 0.041 4.529 0.185 13.648 0.102 0.601 0.023 0.017 0.008 CDF-RANDOM N/A 4.862 0.381 13.659 0.099 N/A 0.017 0.008 ZSCORE-LINEAR 1.147 0.043 4.455 0.219 2.121 0.124 0.601 0.023 0.017 0.008

kin8nm ZSCORE-NAF 0.300 0.006 1.107 0.016 0.127 0.012 0.152 0.002 0.003 0.002 CDF-LINEAR 0.281 0.003 1.121 0.013 0.490 0.015 0.152 0.002 0.003 0.002 CDF-NAF 0.281 0.003 1.149 0.015 0.112 0.014 0.152 0.002 0.003 0.002 ZSCORE-RANDOM 0.281 0.003 1.121 0.013 11.279 0.090 0.152 0.002 0.003 0.002 CDF-RANDOM 0.281 0.003 1.121 0.013 11.291 0.073 0.152 0.002 0.003 0.002 ZSCORE-LINEAR 0.281 0.003 1.121 0.013 0.486 0.014 0.152 0.002 0.003 0.002

Table 4: Experimental results for individual datasets.

Modular Conformal Calibration

STD 95% CI Width NLL CRPS ECE

medical ZSCORE-NAF 0.935 0.008 4.465 0.064 1.548 0.015 0.463 0.002 0.003 0.001 -expenditure CDF-LINEAR 0.963 0.009 3.648 0.037 1.643 0.009 0.462 0.002 0.002 0.001 CDF-NAF 0.928 0.010 3.616 0.110 1.381 0.057 0.465 0.002 0.005 0.004 ZSCORE-RANDOM 0.893 0.004 3.648 0.037 10.869 0.041 0.462 0.002 0.002 0.001 CDF-RANDOM 0.963 0.009 3.648 0.037 10.866 0.034 0.462 0.002 0.002 0.001 ZSCORE-LINEAR 0.893 0.004 3.648 0.037 1.818 0.015 0.462 0.002 0.002 0.001

mpg ZSCORE-NAF 0.396 0.021 1.832 0.186 0.555 0.114 0.186 0.011 0.019 0.013 CDF-LINEAR 0.392 0.030 1.554 0.110 0.821 0.099 0.186 0.011 0.020 0.013 CDF-NAF 0.383 0.026 1.652 0.138 0.551 0.084 0.187 0.011 0.019 0.013 ZSCORE-RANDOM 0.380 0.018 1.545 0.109 13.696 0.073 0.186 0.011 0.020 0.013 CDF-RANDOM N/A 1.546 0.110 13.651 0.092 N/A 0.020 0.013 ZSCORE-LINEAR 0.377 0.018 1.538 0.108 0.756 0.096 0.186 0.011 0.019 0.013

naval ZSCORE-NAF 0.041 0.002 0.165 0.008 1.904 0.036 0.025 0.001 0.002 0.001 CDF-LINEAR 0.042 0.002 0.162 0.007 1.716 0.029 0.025 0.001 0.002 0.001 CDF-NAF 0.042 0.002 0.173 0.008 1.915 0.027 0.025 0.001 0.002 0.001 ZSCORE-RANDOM 0.042 0.002 0.162 0.007 1.355 0.125 0.025 0.001 0.002 0.001 CDF-RANDOM 0.042 0.002 0.162 0.007 1.330 0.114 0.025 0.001 0.002 0.001 ZSCORE-LINEAR 0.042 0.002 0.162 0.007 1.717 0.029 0.025 0.001 0.002 0.001

power-plant ZSCORE-NAF 0.218 0.003 0.845 0.007 0.104 0.010 0.120 0.001 0.003 0.002 CDF-LINEAR 0.225 0.003 0.854 0.009 0.244 0.011 0.120 0.001 0.003 0.002 CDF-NAF 0.224 0.003 0.886 0.011 0.101 0.010 0.121 0.001 0.003 0.002 ZSCORE-RANDOM 0.219 0.002 0.854 0.009 10.208 0.074 0.120 0.001 0.003 0.002 CDF-RANDOM 0.225 0.003 0.854 0.009 10.208 0.067 0.120 0.001 0.003 0.002 ZSCORE-LINEAR 0.218 0.002 0.854 0.009 0.251 0.011 0.120 0.001 0.003 0.002

protein ZSCORE-NAF 0.622 0.011 2.246 0.027 0.765 0.019 0.339 0.003 0.001 0.001 CDF-LINEAR 0.709 0.020 2.333 0.024 0.960 0.018 0.339 0.003 0.001 0.001 CDF-NAF 0.617 0.007 2.138 0.072 0.968 0.048 0.341 0.003 0.015 0.006 ZSCORE-RANDOM 0.627 0.006 2.333 0.024 7.366 0.072 0.339 0.003 0.001 0.001 CDF-RANDOM 0.709 0.020 2.333 0.024 7.393 0.073 0.339 0.003 0.001 0.001 ZSCORE-LINEAR 0.627 0.006 2.333 0.024 0.995 0.017 0.339 0.003 0.001 0.001

super ZSCORE-NAF 0.307 0.005 1.140 0.016 0.246 0.016 0.149 0.002 0.003 0.002 -conductivity CDF-LINEAR 0.343 0.017 1.179 0.016 0.052 0.013 0.149 0.002 0.003 0.002 CDF-NAF 0.294 0.005 1.089 0.061 0.046 0.074 0.153 0.002 0.019 0.008 ZSCORE-RANDOM 0.300 0.004 1.180 0.016 6.119 0.070 0.149 0.002 0.003 0.002 CDF-RANDOM 0.343 0.017 1.179 0.016 6.126 0.066 0.149 0.002 0.003 0.002 ZSCORE-LINEAR 0.300 0.004 1.180 0.016 0.006 0.017 0.149 0.002 0.003 0.002

wine ZSCORE-NAF 0.830 0.021 3.794 0.323 1.376 0.066 0.427 0.012 0.013 0.007 CDF-LINEAR 1.031 0.258 3.256 0.135 1.513 0.046 0.427 0.012 0.013 0.007 CDF-NAF 0.844 0.038 3.308 0.181 1.341 0.185 0.433 0.011 0.015 0.007 ZSCORE-RANDOM 0.791 0.020 3.256 0.136 12.452 0.122 0.427 0.012 0.013 0.007 CDF-RANDOM N/A 3.256 0.136 12.436 0.101 N/A 0.013 0.007 ZSCORE-LINEAR 0.789 0.020 3.253 0.132 1.589 0.055 0.427 0.012 0.013 0.007

yacht ZSCORE-NAF 0.066 0.007 0.249 0.028 1.484 0.180 0.042 0.007 0.015 0.011 CDF-LINEAR 0.066 0.007 0.266 0.040 1.134 0.171 0.042 0.007 0.016 0.011 CDF-NAF 0.066 0.007 0.260 0.037 1.470 0.178 0.042 0.007 0.015 0.011 ZSCORE-RANDOM 0.098 0.006 0.266 0.039 12.958 0.316 0.042 0.007 0.016 0.011 CDF-RANDOM N/A 0.267 0.041 12.996 0.323 N/A 0.016 0.011 ZSCORE-LINEAR 0.068 0.009 0.259 0.037 1.163 0.176 0.042 0.007 0.016 0.011

Table 5: Experimental results for individual datasets.